Ip2p K-means: an Efficient Method for Data Clustering on Sensor Networks

Many wireless sensor network applications require data gathering as the most important parts of their operations. There are increasing demands for innovative methods to improve energy efficiency and to prolong the network lifetime. Clustering is considered as an efficient topology control methods in wireless sensor networks, which can increase network scalability and lifetime. This paper presents a method, IP2P K-means – Improved P2P K-means, which uses efficient leveling in clustering approach, reduces false labeling and restricts the necessary communication among various sensors, which obviously saves more energy. The proposed method is examined in Network Simulator Ver.2 (NS2) and the preliminary results show that the algorithm works effectively and relatively more precisely.


Introduction
During the past few years, there has been growing popularity among world's nation to use wireless communication devices, which has also created more interests in communication infrastructure caused emergence of wireless sensor networks (WSN) (ChitraDevia et al. 2012).These networks normally include intelligent sensors, which are equipped with some other advanced microsensors to detect their environment, a small processor or even a low range wireless communication device.
In such networks, sensors with communication together, make a global framework from the environment.In many sensor network usages, real time data processing and global meaningful techniques for intelligent and rapid decision makings are unavoidable (Khalil & Attea 2011;Schaffer et al. 2012).To take advantage of these models we need data mining on some information and the primary concern is on how to cluster the data through an appropriate data mining technique to process a group of similar objects with common attributes.With sensor's data clustering, it is possible to get an overall wisdom to the manner of data distribution and clustering is the first step for processing the data (Aioffi, Valle et al. 2011).Clustering is also considered as one of the effective solutions to enhance energy efficiency and scalability of large-scale wireless sensor networks.The primary objective of clustering is to identify a subset of nodes in a wireless sensor network where all other nodes communicate with the network sink via these selected nodes (Bhardwaj, SoniDinesh et al. 2012).However, many existing clustering algorithms are tightly coupled with exact sensor locations derived through either triangulation techniques or extra hardware such as GPS equipment.However, in practice, it is difficult to detect sensor location coordinates precisely because there are different influencing factors such as random deployment, low-power and low-cost sensing devices (Ribas, Colonna et al. 2012;Silva, Chiky et al. 2012;Wei, Chen et al. 2012).
Since the nature of distributed and restricted network and communication resources is somehow unknown, it is necessary to make use of distributed algorithms.In this paper, we present a new distributed data clustering for sensor networks in terms of bandwidth, energy and memory restrictions (Liu & Li, 2012).

Sensor networks
Sensor networks are always dealt with a variety of challenges including energy, data processing, communication and routing restrictions.Design of protocols and routing algorithms in sensor networks to minimize the energy consumption is an area of open research.Routing protocols must include three main capabilities in networks: identification of topology changes, communication establishment in networks and detecting appropriate routes.In case of sleep state, existence of middle nodes increases packet transmission delay (Akkaya & Senel, 2009;Bajaber & Awan 2011).

Material and methods
In this paper, we present a clustering algorithm where the primary part of it is associated with data streaming processing and the other part is responsible of final data clustering.Because data stream is a continuous flow, data stream processing section of the algorithm is always on running stage.Therefore, it is impossible to store all data stream to main memory and so the proposed algorithm is approximate algorithms.The method tries to propose a solution where the target function is a constant approximate of efficient state of goal function.
The proposed algorithm uses location reduction for data stream process in restricted memory.Location reduction is the transformation of m data point to l (l<m), so l points contain characteristics of m points.
The proposed algorithm steps are: 1. Continue sampling as long as the majority of observed nodes, majority of gathered data by sensor node, have not exceeded the memory constraint (m).
2. Using classical k-mean algorithm, calculate O(k) of center for m point and replace m points.Use 2k central points, the location reduction process.The clustering is more precise if k is higher but the consumption memory is also higher.Consider a weight for every center.This weight is the points assigned to it.
3. Repeat step 1 and 2 until m 2 /2k point is read and m central point is obtained.These primer centers are considered level-1 centers.
4. Use k-means algorithm to reduce m level-1 centers to 2k level-2 centers.
5. keep the most m i-level center in memory and produce 2k i+1_level center if majority of i_level centers reaches m.Weight of new center is sum of weight of centers assigned to it.
6.If global clustering is obtained, apply k-means algorithm to all centers created to all levels, otherwise go to the previous step.

Results
To evaluate the efficiency of the proposed algorithm, two indicators are used.To measure the precise of proposed clustering, first indicator (LRI) shows percentage of errors on data points labeling.This indicator demonstrates the proportion of data whose cluster labels in two executions are different (distributed and non-distributed (central) algorithm), and is defined as follows, LRI = ILC/n × 100%. (1) ILC is sum of points whose cluster labels are different in distributed and central algorithm and n is number of total points.Second indicator is the average distance between cluster centers in central and distributed approaches.We show this indicator with DRC D : where J is the number of cluster, p is the number of network nodes, C D i (j) is center of j th cluster in i th node in distributed algorithm, C j C is the center of j th cluster in k-means central algorithm and || || is second order norm.
To evaluate the performance of the proposed algorithm, a 400 seconds scenario with 24 nodes is run on NS2.To have different traffic, 8 UDP agents that have FTP application attached and 16 TCP nodes, which have CBR on them are consisted.Data rate for CBR are 15 Mbit/s and 11 Mbit/s for FTP.The proposed algorithm is run on every node Fig. 1.We have analyzed the results based on γ.

γ parameter
The proposed model of this paper uses γ parameter as defined by Bandyopadhyay and Giannella (2006) where termination provision is based on γ.This parameter is a criterion on testing center changes in two repetitions.It is obvious that lower value for γ means more clustering procession but higher number of iterations will be required and it be more costly.Hence, it is important to select an appropriate amount of this parameter, which requires a tradeoff between clustering precision and communication cost.The changes of ILR in accordance with γ are shown on Fig. 2. As we can observe, with an decrease in γ, partial communication cost will increase and false labeling will be reduced.

Conclusion
Recent advancement in wireless communications and electronics has enabled the development of low-cost sensor networks.The sensor networks can be implemented in different applications and there are various technical issues where researchers are currently doing research on.A high-density wireless sensor network can be deployed for specific information-gathering.In such a network, sensors need to route their sensed data to a base station, consuming highly-limited and unreplenishable energy resource.Therefore, one of the most important issues in designing sensor data gathering algorithms is to minimize the energy consumption for network longevity while meeting certain requirements given, such as delay constraints, which may vary depending on specific applications or environmental situations.
In this paper, a new distributed data clustering on sensor networks was proposed where communication was the main reason of energy consumption in sensor networks.Therefore, the proposed algorithm attempted to reduce the communication and message interchange by trying to stop false labeling to save energy.The proposed algorithm was tested on a scenario with NS2 (Network Simulator ver.2) and the results showed efficient performance of the algorithm.

Fig. 2 .
Fig. 2. The changes of ILR in accordance with γ