An Energy-Efficient Outlier Detection Based on Data Clustering in WSNs

Sensor nodes in wireless sensor networks are prone to malfunction because they are exposed to the nearby environment directly. Consequently, wrong sensor readings occurred from sensor nodes and these readings are called an outlier. Commonly, since an outlier deviates from normal sensor readings and it can bring about some problems, various techniques to detect the outliers have been proposed. In this paper, we propose an efficient outlier detection technique based on data clustering. In order to decide the width of the cluster that consists of the sensor readings, we applied the Pigeonhole Principle and then detected the outliers based on clusters. In experiments, we demonstrate the efficiency of our proposed technique compared to other outlier detection techniques.


Introduction
Recently, since development of the integrated circuit, the size of the sensor is gradually reduced and various sensors are built in the sensor node in a wireless sensor network (WSN).Sensor nodes detect the various and huge information (e.g., temperature, humidity, and light) around their environments and they communicate to the base station and others using radio transmission.Accordingly, WSNs are used in various applications such as environment and habitat monitoring [1,2], combat field surveillance [3], security [4], or health care [5] applications.
Commonly, sensor nodes are severely constrained in terms of the computation power, communication bandwidth, and battery power.Among these limitations, the power is of utmost importance, since replacing the battery of sensor nodes is too either expensive or impossible [6,7].Thus, the energy preservation is a major research issue since it directly impacts the life time of the network.Recently, much research has shown that the radio communication is more expensive than the computation or the sensing.Thus, many techniques [8][9][10][11][12][13] have been proposed in order to reduce the communication overhead.
Particularly, since sensor nodes are placed outdoor for the applications such as the disaster monitoring [14] and habitat monitoring, the sensor nodes can be malfunctioned or the sensor readings may be incorrect due to external impact such as the severe external environments [15].In addition, due to sudden changes in external environments, some sensor readings may deviate significantly from the normal sensor readings.These abnormal sensor readings are called the outliers.For example, assume that several sensor nodes are deployed in a mountain to monitor forest fires.When an outlier value is detected and sent to a forest guard, the forest guard can identify the actual forest fires or he/she can initialize the sensor node if the outlier is generated by the malfunctioned sensor.Thus, the outlier detection is a quite important task to detect an event or to maintain sensor networks harmoniously.
In this paper, we focus on an energy-efficient outlier detection technique in WSNs based on data clustering.To construct data clusters, we use the Pigeonhole Principle.By applying cluster width, called the permission range, obtained by the Pigeonhole Principle, we partition the data space into several clusters.However, if we partition the data space evenly, some sensor readings are identified as outliers 2 International Journal of Distributed Sensor Networks although similar sensor readings of them are detected by sensor nodes.Thus, we partition the domain of data unevenly based on the location of sensor readings.Then, we identify the outliers according to the user-defined threshold  which is related to the number of sensor readings in a cluster.
The remainder of the paper is organized as follows.Section 2 discusses related work.In Section 3, we present the background of our work.Section 4 introduces the proposed outlier detection technique in WSNs.Section 5 presents an empirical evaluation.Section 6 summarizes the paper.

Related Work
In WSNs, a lot of outlier detection techniques have been proposed.In [16], an outlier detection technique was proposed to collect the outliers with respect to the neighbor sensor nodes.Each sensor node calculates the median of the sensor readings received from the neighbors and its readings.Each sensor node computes the mean  and the standard deviation  of the differences between its sensor readings and the calculated median.If the standardized value (= (V − )/) of a sensor reading V is greater than or equal to the userdefined threshold, it is regarded as an outlier.
Palpanas et al. proposed an outlier detection technique based on the the Epanechnikov kernel function [17].Given a value V, each sensor node estimates the number of values around V using the kernel function.If the number of values around value V is less than a user-defined threshold , a value V is regarded as an outlier.However, to make the kernel function, each sensor node transmits the required information to the base station along the routing path.Thus, it wastes a lot of energy.
In [18], an outlier detection technique based on data clustering, called DC, was proposed.Given a user-defined threshold , if the distance between any two sensor readings is less than , they become a cluster.When the distance between a pair of cluster centers is less than , they are merged.The intercluster distance (ICD) of a cluster to -nearest clusters is computed to detect the outlier clusters.When ICD of a cluster is quite different from the mean of ICDs, it is regarded as an outlier cluster.Then, the information of outlier clusters is broadcasted into WSNs.Thus, it consumes much energy.
In [19], an outlier detection technique based on an ellipsoid was proposed.Each sensor node constructs an ellipsoid boundary of sensor readings using the mean and the covariance of sensor readings and transmits its ellipsoid boundary to the base station.The base station merges all received ellipsoids, computes the global ellipsoid boundary, and broadcasts the global ellipsoid to all sensor nodes.Then, with respect to the global ellipsoid boundary, each sensor node identifies the outliers among its sensor readings.
Recently, an outlier detection technique based on the distance between sensor readings and the estimation deviation was proposed [20].Each sensor node computes the expected deviation and the average distance between pairs of sensor readings detected within recent time interval.If the average distance is greater than the expected deviation, the sensor reading at the current time becomes an outlier.However, when the expectation model is frequently updated, the communication overhead increases.

Preliminary
In this section, we present the basic model of sensor networks briefly.
3.1.Sensor Networks.We consider a sensor network consisting of  stationary sensor nodes { 1 ,  2 , ...,   } deployed in a field of interest and the powered base station serving as an access point for users to pose ad hoc queries.We use a routing tree [9] which is frequently used as a primitive to collect sensor nodes.Two nodes capable of bidirectional wireless communication directly are referred to as the neighbors for each other.Each sensor node can broadcast a message to all of its neighbors (or from a parent to its child nodes) at a time.A simple sensor network using a tree routing is shown in Figure 1.In Figure 1,  1 to  3 are the intermediate sensor nodes that have the child sensor nodes, and  4 to  9 are the leaf sensor nodes that have no child node.Each sensor node generates its readings periodically.A sampling period is known as an epoch [21].To agree on a global time base that allows sensor nodes to start and finish each epoch simultaneously, each sensor node executes the SMACS protocol [22] or a global time synchronization protocol [23].Based on global time synchronized, nodes sleep for a certain period of time in each epoch to minimize energy consumption and each sensor node awakes to sample and receive results when its neighbors try to propagate a message.

Outlier Detection Based on Clustering
In this section, we present our proposed outlier detection technique which is based on data clustering in WSNs.In addition, we introduce an efficient data transmission scheme for our outlier detection technique.

Clustering Technique.
Generally, in the outlier detection techniques based on data clustering, the width of clusters is the most important, since the number of sensor readings in each cluster is affected by the width of clusters.If the width of clusters is too large, all sensor readings may belong to a single cluster.Otherwise, each cluster may have only a single sensor reading.Thus, the outliers cannot be identified in these cases.To solve the above problem, in this paper, we applied the Pigeonhole Principle to determine the width of clusters.We regard that pigeonholes and pigeons are the domain of sensor readings and the sensor readings, respectively.Thus, when we partition this domain into  number of subdomains, where  is less than the number of sensor readings, at least one subdomain contains more than two sensor readings.
Given a set of sensor readings  = { 1 ,  2 , . . .,   }(|| = ), we can acquire the domain of  as [MAX(), MIN()].Then, we obtain the permission range (PR) as follows: We use PR as the width of clusters and identify the outliers.For instance, given a set of sensor readings  shown in Figure 2, PR of  is obtained as 1.7 = (MAX() − MIN())/( − 1) = (19 − 2)/(11 − 1).If the number of sensor readings in a cluster is less than or equal to the user-defined threshold , the cluster is the outlier cluster and the sensor readings in the outlier clusters are considered as the outliers.
In Figure 2, the clusters represented by dotted lines are presented when we partition the domain of  evenly.Assume that the user-defined threshold  is 1.In Figure 2,  2 ,  4 , and  8 are the outlier clusters (denoted by ellipses) and the sensor readings in each outlier cluster are regarded as the outliers.Although a sensor reading  4 in the cluster  4 is definitely an outlier, since a sensor reading which is similar to  4 does not exist, the sensor readings  3 and  7 are not the outliers since there are  2 and  8 which are similar to  3 and  7 , respectively.In other words, although the difference between a pair of sensor readings is less than the differences of the others, these sensor readings may belong to separated clusters if we partition the domain of  evenly.
To solve this drawback, we propose the nonequipartitioning based on the permission range (PR).In the nonequipartitioning, a set of clusters is constructed with respect to Definition 1.

Definition 1.
If the difference between a pair of sensor readings   and   in  is less than or equal to the permission range PR obtained by (1), we say   is close to   .
In our proposed technique, if   and   are close to each other, a single cluster contains   and   .For instance, as shown in Figure 3, since the difference between sensor readings  1 and  2 is close, a cluster  1 for them is constructed.And then, a sensor reading  3 is inserted into  1 , since the difference between  2 and  3 is less than PR.But  4 is not inserted into  1 .The result of nonequipartitioning is presented in Figure 3.Each cluster consists of  respectively.Consequently, when a user-defined threshold  is 1, a cluster  2 is an outlier cluster and a sensor reading  4 in  2 is an outlier.

Clustering Scheme for WSNs.
If each sensor node transmits its readings to the base station blindly at each epoch and the base station computes the outliers, each sensor node consumes much energy.In this section, we present an efficient data transmission scheme for our outlier detection algorithm.We assume that all sensor nodes take sensor readings periodically and keep these readings into their local storage for a time window .Note that when  is 1, each sensor node transmits data to the base station at each epoch.At first, according to the Definition 1, each sensor node constructs clusters using its sensor readings detected within .If the number of sensor readings in a cluster is greater than a user-defined threshold , all sensor readings in this cluster cannot be the outliers (i.e., nonoutlier cluster (NOC)).Otherwise, all sensor readings in a cluster may be the outliers, and then we call such clusters the outlier candidate clusters (OCCs).
Along the routing path to the base station, each sensor node transmits NOCs and OCCs.For NOCs, cluster ranges (CRs) are transmitted only where CR consists of minimum and maximum values of sensor readings in a NOC.In contrast, for OCCs, the sensor readings in OCC are transmitted to the parent node.
When a parent node  received CRs of NOCs and OCCs from its child nodes,  attempts to merge them with its clusters.To merge the clusters, we use the following definition.Note that, when two clusters are merged into a new cluster where at least one of them is a NOC, the merged cluster cannot be an OCC, since the number of sensor readings in a NOC is already greater than .Thus, when a sensor node transmits a NOC, we do not need to transmit all sensor readings in the NOC and it only needs the cluster range CR of the NOC.Thus, each sensor node reduces the energy International Journal of Distributed Sensor Networks consumption when it transmits the NOCs since the volume of the transmitted data from each sensor decreases.

Definition 2. Given two cluster ranges CR
In contrast, when two OCCs are merged into a new cluster, we check whether the number of sensor reading in the new cluster is greater than  or not.Since each sensor node sends all sensor readings in each OCC, we can easily count the number of sensor readings in the new cluster.
Along the routing path from each sensor node to the base station, CRs of NOCs and OCCs are merged and transmitted gradually.Finally, the base station can determine the outliers among the received OCCs.Recall that the cluster ranges (CRs) rather than all senor readings in NOCs are transmitted along the routing paths.Thus, we can reduce the energy consumption of each sensor node.
For example, given a WSN with 11 sensor nodes in Figure 4 where the user-defined threshold  is 1 and the domain of a set of sensor readings is [2,19], each sensor node constructs the clusters using its sensor readings.Then, a cluster of each sensor node becomes the outlier candidate cluster (OCC), since threshold  is 1.As shown in Figure 5, the leaf nodes (i.e.,  8 :  11 ) transmit their clusters to their parents, respectively.In Figure 5,  4 merges its OCCs and the received OCCs coming from  8 and  9 based on Definition 2, and then  4 obtains a NOC 1 : [15,17].Similarly,  6 obtains the OCCs (i.e., OCC 1 : [3], OCC 2 : [8], and OCC 3 : [18]).
In Figure 6,  2 receives the NOCs and OCCs from its child nodes ( 4 ,  5 , and  6 ), and then  2 merges the received clusters.Since the cluster range of OCC 1 coming from  6 overlaps the sensor readings of  2 and  5 within PR,  2 generates a new cluster NOC 2 .Similarly,  2 merges NOC 1 and OCC 3 to NOC 3 .But OCC 2 is not merged since it does not overlap with other clusters.And another sensor node  3 generates the clusters OCC 3 and OCC 4 .
A sensor node  1 receives the clusters from its child nodes as shown in Figure 7.  1 merges NOC 3 and OCC 4 into NOC 4 , and it merges its sensor reading and OCC 3 into NOC 5 .Finally,  1 transmits the clusters to the base station, and then the base station detects the outlier cluster using the outlier candidate cluster OCC 8 .when a sensor node obtains the weather information, it detects temperature as well as other parameters such as humidity and light.Thus, we propose the clustering scheme for a set of multidimensional data.

Experimental Environments.
To evaluate the performance of our proposed algorithm compared with the stateof-the-art algorithm, we used a set of real-life data which is provided by Intel Berkeley Research Lab [24].A sensor network consists of 54 sensor nodes, and each sensor node is deployed in 40.5×31 (m 2 ) area as shown in Figure 8.The base station is located at the center of the area.We set the default communication distance 7 m.The maximum depth and the maximum width of a routing tree in sensor network are 5 and 3, respectively.Sensor readings consist of temperature (Celsius), humidity (%), and illumination (Lux).
As competitors, we implemented the Brute-Force (BF) and an outlier technique based on data clustering (DC) [18].In BF, each sensor node transmits its readings to the base station at each epoch.
In DC, each sensor node in a network transmits a set of intercluster distances (ICDs) in respect of sensor readings along the routing paths to the base station.The intercluster distance is the distance of centers in any two data clusters in each sensor node.If the intercluster distance is less than the user-defined threshold , two data clusters are merged.When the base station receives ICDs from all sensor nodes, it computes the ICDs of the -nearest clusters.And then, the base station broadcasts the means of ICDs in order to identify  the outliers in each sensor node.We set  and  for DC are 0.26 and 4, respectively.We called our proposed algorithm PC.
(2) And, to receive this message, a sensor expends   () =  -elec () =  *  elec . ( In this experiment, we set 50 nJ/bit to the electronic circuit constant ( elec ) and 100 pJ/bit/m 2 to the transmit amplifier constant ( amp ).We set the size of packet as 40 bytes.The parameters used in our experiment are summarized in Table 2.

Experimental Result.
To evaluate the energy consumption of each outlier detection algorithm, we run our own simulator for 1000 epoches and plot the total energy consumption.
Figure 9 shows the energy consumption varying the window size .When the size of a window is small (i.e.,  = 2), BF shows the best performance since BF transmits sensor readings, but DC and PC transmit the information of clusters.
However, as  increases, the performance of PC and DC is improved since the cluster information rather than sensor readings is transmitted.Furthermore, the performance gap between PC and DC increases with increasing .As  increases, the number of nonoutlier clusters (NOCs) increases in PC.Thus, the size of data to be transmitted decreases.In DC, the energy consumption is reduced since DC is also based on data clustering as PC.However, our proposed PC is better than DC because PC used the permission range to determine the cluster width.Additionally, in DC, the mean of ICDs needs to be broadcasted to obtain the outliers in each sensor node.Our proposed algorithm PC is better than the DC about 58% on the average.Figure 10 shows the energy consumption varying .As shown in Figure 10, the performances of all techniques are stable in spite of varying .The energy consumption of DC is constant, since the intercluster information has a fixed size.But the energy consumption of PC increases slightly, since the number of sensor readings in an outlier candidate cluster increases slightly.Nonetheless, our proposed technique PC shows the best performance in terms of energy efficiency.In this experiment, our proposed technique is better than the DC about 51%.
Figure 11 shows the energy consumption varying the dimension .The energy consumption of all techniques is increased according to varying , because the packet of the sensor nodes on multidimension contains more information as compared with that on 1 dimension.The energy consumption of PC is less than other techniques BF and DC.In other words, our proposed technique is better than the DC about 48%.

Conclusion
In this paper, we present an efficient outlier detection technique in WSNs.To obtain the appropriate width of clusters, we applied the Pigeonhole Principle.In our proposed technique, each sensor node in WSNs constructs and merges the clusters based on the permission range PR.Then, our proposed technique uses two kinds of clusters (NOC and OCC) in order to detect the outliers and reduce the energy consumption of each sensor.In our experiments with a set of real-life data, we show that our proposed technique outperforms existing techniques significantly.
6 }, and  4 = { 7 ,  8 ,  9 ,  10 ,  11 }, = [min  , max  ] and CR  = [min  , max  ], where (min  < min  ), if min  − max  ≤ PR, we say CR  and CR  overlap within PR.If a pair of cluster ranges overlap within PR, there are at least two sensor readings which are close and contained in different clusters.Thus, if two cluster ranges CR  (= [min  , max  ]) and CR  (= [min  , max  ]) of clusters   and   overlap within the permission range PR, two clusters   and   are merged into a new cluster whose CR is [MIN({min  , min  }), MAX({max  , max  })].
Scheme.Generally, since the sensor nodes in WSNs are equipped with several sensor devices, sensor readings are multidimensional.For instance, Figure 5: Clustering in  4 and  6 .

Table 1 :
2 , ...,    } (|  | = ).A -dimensional sensor reading    is represented as ⟨  [1], . . .,   []⟩, where    [] denotes the th dimensional value of    .On each dimension  such that 1 ≤  ≤ , we obtain the permission range PR  = Definition 3. Let a -dimensional cluster range of a cluster   be CR   , where the th dimensional ranges of CR  PR  on a dimension , we say CR   and CR   overlap within PR  on the th dimension.Based on Definition 3, For every dimension  with 1 ≤  ≤ , CR   and CR   overlap within PR  ; we merge the clusters   and   into a new cluster.For example, given three cluster ranges CR  1 , CR  2 , and CR  3 where  = 2,  = 7, PR 1 = 1, and PR 2 = 1.5 in Table 1, we merge CR  1 and CR  2 based on Definition 3. Example of CRs on 2 dimensions.