An Affinity Propagation-Based Self-Adaptive Clustering Method for Wireless Sensor Networks

A wireless sensor network (WSN) is an essential component of the Internet of Things (IoTs) for information exchange and communication between ubiquitous smart objects. Clustering techniques are widely applied to improve network performance during the routing phase for WSN. However, existing clustering methods still have some drawbacks such as uneven distribution of cluster heads (CH) and unbalanced energy consumption. Recently, much attention has been paid to intelligent clustering methods based on machine learning to solve the above issues. In this paper, an affinity propagation-based self-adaptive (APSA) clustering method is presented. The advantage of K-medoids, which is a traditional machine learning algorithm, is combined with the affinity propagation (AP) method to achieve more reasonable clustering performance. AP is firstly utilized to determine the number of CHs and to search for the optimal initial cluster centers for K-medoids. Then the modified K-medoids is utilized to form the topology of the network by iteration. The presented method effectively avoids the weakness of the traditional K-medoids in aspects of the homogeneous clustering and convergence rate. Simulation results show that the proposed algorithm outperforms some latest work such as the unequal cluster-based routing scheme for multi-level heterogeneous WSN (UCR-H), the low-energy adaptive clustering hierarchy using affinity propagation (LEACH-AP) algorithm, and the energy degree distance unequal clustering (EDDUCA) algorithm.


Introduction
The development of embedded devices as well as the micro-electro mechanical system (MEMS) wireless sensor network (WSN) as an indispensable part of the Internet of Things (IoT), has also developed rapidly in recent years [1][2][3][4][5]. WSN commonly consists of a large number of tiny sensors, which form the network in a self-organizing and multi-hop manner. WSN has its unique features such as easy deployment, self-organization, low cost and fault tolerance, etc. Therefore, it has been widely used in many applications such as environmental detection [6], industrial production monitoring [7] and smart home [8].
One of the key research issues for WSN is energy efficiency [9][10][11][12][13][14], since the tiny sensors are generally powered by limited battery supply, and the battery replacement for these sensors is The energy degree distance unequal clustering algorithm (EDDUCA) [26] partitions the network using Sierpinski Triangle. The triangle of the outer of the sensor field generally contains more sensor nodes.

Network Model
In this paper, the network is composed of numerous sensors as well as a BS, as shown in Figure 1. Some physical information of the sensor field such as temperature and humidity are detected by sensors and they transmit their monitored data to its corresponding CHs. There are two types of roles for the sensors to play. Member nodes need to monitor the surroundings and send the monitored data to corresponding CHs. CHs not only need to detect the information of the environment, but also need to receive the data packages from their members and conduct data fusion. Finally, the fused data is uploaded to the BS by the CHs. The following assumption are made to conduct the simulation conveniently.

1.
All the sensors are deployed in a rectangle area by planes or other vehicles and they keep stationary after they are deployed.

2.
Sensor nodes can be identified by their unique ID.

3.
Each sensor owns the knowledge of its position by the equipment such as the Global Positioning System (GPS), and they can get the information of other nodes by information exchange.

4.
All the sensors own the same initial energy and their batteries cannot be changed. Once they exhaust their energy, they will be useless.

Energy Model
As the research [15,[27][28][29][30] has discussed, the energy used for transmission accounts for the majority of the total energy consumption. Therefore, energy consumption used for transmission is only considered in this paper. The energy used for transmission is generally divided into two parts, sending and receiving units, as shown in Figure 2. In the sending unit, the digital signal is transformed into an analog signal by the transmit electronics and then the analog signal is strengthened by the amplifier. The power of the amplifier is adjustable and it uses different power according to the communication distance. A threshold value 0 d is calculated to adjust the power of amplifier. If the communication distance exceeds the threshold value 0 d , free space model is used, otherwise, a multi-path fading model is used. In the receiving unit, the analog signal is transformed into digital signal again and the energy used in this part only depends on the amount of data. The total energy used for sending unit can be calculated using the formula 1.
where d represents the communication distance between the source node and the target node. L denotes the length of data package. represents the energy consumed by transmitting one-bit data between two sensors. is the energy consumption for the amplifier and it can be calculated by formula 2.

Energy Model
As the research [15,[27][28][29][30] has discussed, the energy used for transmission accounts for the majority of the total energy consumption. Therefore, energy consumption used for transmission is only considered in this paper. The energy used for transmission is generally divided into two parts, sending and receiving units, as shown in Figure 2. In the sending unit, the digital signal is transformed into an analog signal by the transmit electronics and then the analog signal is strengthened by the amplifier. The power of the amplifier is adjustable and it uses different power according to the communication distance. A threshold value d 0 is calculated to adjust the power of amplifier. If the communication distance exceeds the threshold value d 0 , free space model is used, otherwise, a multi-path fading model is used. In the receiving unit, the analog signal is transformed into digital signal again and the energy used in this part only depends on the amount of data.

Energy Model
As the research [15,[27][28][29][30] has discussed, the energy used for transmission accounts for the majority of the total energy consumption. Therefore, energy consumption used for transmission is only considered in this paper. The energy used for transmission is generally divided into two parts, sending and receiving units, as shown in Figure 2. In the sending unit, the digital signal is transformed into an analog signal by the transmit electronics and then the analog signal is strengthened by the amplifier. The power of the amplifier is adjustable and it uses different power according to the communication distance. A threshold value 0 d is calculated to adjust the power of amplifier. If the communication distance exceeds the threshold value 0 d , free space model is used, otherwise, a multi-path fading model is used. In the receiving unit, the analog signal is transformed into digital signal again and the energy used in this part only depends on the amount of data.
where d represents the communication distance between the source node and the target node. L denotes the length of data package. represents the energy consumed by transmitting one-bit data between two sensors. is the energy consumption for the amplifier and it can be calculated by formula 2. The total energy E Tx used for sending unit can be calculated using the Formula (1).
where d represents the communication distance between the source node and the target node. L denotes the length of data package. E elec represents the energy consumed by transmitting one-bit data between two sensors. ε amp is the energy consumption for the amplifier and it can be calculated by Formula (2).
where ε f s represents the energy consumption for free space model and ε mp represents the energy consumption for multi-path fading model. Additionally, d 0 is the threshold value for amplifier to adjust its power. d 0 can be calculated by Formula (3).
The total energy E Rx used for receiving unit can be calculated using the Formula (4).

The Proposed Affinity Propagation-Based Self-Adaptive (APSA) Algorithm
In this section, a detailed illustration of the affinity propagation-based self-adaptive (APSA) algorithm will be given. Initial phase, set-up phase and communication phase are contained in APSA. During the initial phase, sensors obtains the necessary information from their neighbors for network forming. After all the preparations are finished, set-up phase will start. In set-up phase, the network topology is determined by AP and the modified K-medoids. Then the network enters into the communication phase and data transmission is conducted in this phase.

Initial Phase
After all the sensors are deployed, the system begins to enter the initial phase. In the initial phase, the network has not been organized and sensors can only get their own location by GPS and record the information of residual energy. Then sensors begin to exchange their own information with their neighbors until the sink obtains the information of all the sensors. When the information exchange is finished, the system enters set-up phase.

Set-Up Phase
The main goal of the set-up process is to find the CHs and divide all the sensor nodes into appropriate clusters. During this phase, the AP algorithm is firstly introduced to find out the optimal cluster number and the position of initial cluster centers. Then K-medoids algorithm is used to achieve the final clustering result. In the traditional K-medoids algorithm, the initial cluster centers are randomly selected which means that the algorithm needs to iterate more time to converge. Additionally, the traditional K-medoids runs easily into local optimal solutions. With the purpose of solving the mentioned problems, AP is adopted to figure out the initial cluster centers to enhance the performance of K-medoids.
Firstly, the similarity between sensors can be calculated using the following formula: where X represents the location of sensors and s(m, n) denotes the similarity between node m and node n which is calculated by the square of their Euclidean distance. The similarity indicates whether the node n is suitable to be the CH for node m. For each node n, a real number s(n, n) represents the preference that it will be chosen as a cluster head node. s(n, n) is calculated by Formula (6).
where p represents the negative cost of adding a cluster. By numerous simulations, when p is set as -6000, the AP algorithm can achieve a good result.
r represents the responsibility and a represents the availability. a is firstly set as zero, and then r and a can be updated using Formulas (7) and (8).
where r(m, n) is defined as the value of the degree of node n if node n is selected as the CH of node m. a(m, n) represents the appropriate degree of node m to select n as its CH. Finally, Formula (9) is used to calculate the initial cluster centers.
where T represents the set of the initial cluster centers. The pseudocode of AP is described as Algorithm 1.

Algorithm 1:
The method for obtaining initial cluster centers

End for End for Until T does not change
The initial cluster centers obtained through the AP algorithm are not optimal, and there may be outliers. Due to the disadvantages above, the K-medoids algorithm is adopted to further optimize the clustering results. K-medoids adopts real points as the cluster centers instead of virtual points, and therefore the absolute errors can be effectively reduced. By combining the advantages of AP and K-medoids algorithm, the distance between the member node and its corresponding CH is minimized. Formula (10) describes the problem that the algorithm needs to solve. We want to study how to minimize the criterion of the absolute error σ.
where s is the common node in T i and T i represents the set of nodes of cluster i. In order to minimize the criterion of the absolute error σ, greedy method is adopted to achieve this object. A set of nodes of a cluster is represented as T = τ 1 , τ 2 , . . . , τ j , . . . , τ k−1 , τ k . Then a node τ random is randomly selected in the network to replace the node in set T, meanwhile, the residual energy of τ random which is randomly selected must be richer than other nodes in set T. Formula (11) describes the replacement method.
The new absolute error criterion σ * can be calculated by Formula (10). Compared with original σ (t) in the t-th time iteration, if σ (t) is greater than σ * , T (t+1) will be replaced by T * .
In the process of iteration, we focus on the remaining energy of each CH. In each iteration, once the average residual energy of all sensor nodes is greater than that of a CH, the CH must give up the election and become a member node. By repeating Formula (11), the final clustering results can be obtained. The pseudocode of the modified K-medoids is described as Algorithm 2.

Algorithm 2:
The method for clustering let T as the set of initial cluster centers; calculate the number of initial cluster centers k = T Repeat assign each remaining common node to the cluster with the nearest medoid; randomly select a common sensor node τ random ; calculate the cost function S(S = σ * − σ) of swapping node τ j with τ random ; if S<0 then swap τ j with τ random to form the new set of k clusters; Until no change Output: a set of k clusters.

Communication Phase
The clustering algorithm is executed in the remote server and the result of the clustering is sent to each sensor by broadcasting. When sensor nodes receive the clustering message, the real network architecture is established. In each round, the member nodes communicate with their corresponding CHs to upload the monitored data and their own residual energy. Each CH gather the monitored data of their members and then data fusion is conducted to filter the redundant data. Next, the compressed data is transmitted to the BS. At the end of each round, the BS uploads all the data of this round to the remote server. Finally, the remote server will quickly calculate the topology of next round of the network and return it to the BS. The BS determines whether it is necessary to send the reconstructed message by comparing whether the topology information of the previous round and current round are consistent. The next round starts with a message from BS and the network repeats the process from the set-up phase.

Simulation Parameters
Matlab as a powerful project software has been widely used in automatic control, machine design and mathematical statistics. Researchers can solve the complicated engineering problem efficiently using the integrated toolbox in Matlab. Additionally, Matlab can dynamically simulate operation of the system and conveniently visualize the data. Matlab is run with version of R2016a in a personal computer equipped with an Intel Core I5 central processing unit (CPU) to test the performance of the proposed APSA. The simulator randomly generates the sensors in a specific area with the same initial energy. A round is used as the period of the network and in each round, a sensor needs to upload a data package to the base station via single or multi-hop communication. According to [19], considering the discriminability and run time of the simulation results, the initial energy E init of each node is set as 2J and the data aggregative energy E DA is 5nJ/bit/signal . All the relevant parameters used in the simulation are listed in Table 2. An assumption is made that the sensors can communicate with the other nodes in their transmission range. In each round, each node generates a data package which contains the monitored information of surroundings and the target of the network is to gather all the packages of sensors. In the simulation, 50 sensor nodes are firstly deployed in a 100 × 100 m 2 sensor field in a random way. Then a BS is set at the center of the monitoring area. The AP algorithm is executed to search the optimal initial cluster centers and adopt the modified K-medoids to form clusters. After the clusters are formed, the BS collects data at regular intervals. Generally, the normal node transmits the monitored data directly to CH if the CH is in its one-hop transmission range. Otherwise, it will choose a relay node to forward its data package to CH by greedy algorithm. In a greedy algorithm, node chooses a neighbor which is closer to the sink compared to itself as the relay node. After the data is received by CH, the CH compresses the data and forwards it to the BS. Figure 3 shows the ultimate clustering result of APSA. As clearly shown in Figure 3, the proposed algorithm divides all the sensors into five clusters. The small dots denote the sensors, and the blue lines represent the virtual link between sensors and CHs.  Another 50 sensors are added to the network to test the presented algorithm and the simulation result is illustrated in Figure 4. APSA changes the number of CHs adaptively and it divides all sensors into six clusters.

Analysis of Energy Consumption
The presented APSA is compared with LEACH-AP, UCR-H and EDDUCA which are all centralized routing protocols. For each protocol, 50 different samples of the network model are generated to execute the protocol and the result is based on the average value of repeated simulations. In Figure 5, the x-axis represents the number of rounds the network runs and the y-axis represents the total energy consumption of the network. It obviously shows that with rounds going, the total energy consumption of the presented APSA increases more slowly compared to the other three algorithms. In about the 1000th round, APSA achieves about 33.33%, 52.5% and 54.21% performance gain compared to UCR-H, LEACH-AP and EDDUCA respectively. Another 50 sensors are added to the network to test the presented algorithm and the simulation result is illustrated in Figure 4. APSA changes the number of CHs adaptively and it divides all sensors into six clusters.

Sensors 10
Common node CH Common node to CH Another 50 sensors are added to the network to test the presented algorithm and the simulation result is illustrated in Figure 4. APSA changes the number of CHs adaptively and it divides all sensors into six clusters.

Analysis of Energy Consumption
The presented APSA is compared with LEACH-AP, UCR-H and EDDUCA which are all centralized routing protocols. For each protocol, 50 different samples of the network model are generated to execute the protocol and the result is based on the average value of repeated simulations. In Figure 5, the x-axis represents the number of rounds the network runs and the y-axis represents the total energy consumption of the network. It obviously shows that with rounds going, the total energy consumption of the presented APSA increases more slowly compared to the other three algorithms. In about the 1000th round, APSA achieves about 33.33%, 52.5% and 54.21% performance gain compared to UCR-H, LEACH-AP and EDDUCA respectively.

Analysis of Energy Consumption
The presented APSA is compared with LEACH-AP, UCR-H and EDDUCA which are all centralized routing protocols. For each protocol, 50 different samples of the network model are generated to execute the protocol and the result is based on the average value of repeated simulations. In Figure 5, the x-axis represents the number of rounds the network runs and the y-axis represents the total energy consumption of the network. It obviously shows that with rounds going, the total energy consumption of the presented APSA increases more slowly compared to the other three algorithms. In about the 1000th round, APSA achieves about 33.33%, 52.5% and 54.21% performance gain compared to UCR-H, LEACH-AP and EDDUCA respectively. Sensors 11 Figure 5. Energy consumption between different algorithms (50 sensor nodes).

Analysis of Network Lifetime
The network lifetime is defined as the time when about half of the sensors in the network have dead. At this time, the network is divided into several isolated portions which leads to a serious decline in the performance of the network. In order to have a fair evaluation on different protocols, the same network model is used to execute the APSA, UCR-H, LEACH-AP and EDDUCA algorithms, respectively. The simulation result is demonstrated in Figure 6. As Figure 6 shows, the lifetime of APSA is 1511 rounds and it achieves about 16.23%, 31.39%, 51.1% performance gain compared to UCR-H, LEACH-AP and EDDUCA, respectively.

Analysis of Clustering Result
The reasonable CHs are expected to be selected during the selection procedure and one of the significant standards to evaluate the reasonability is the average communication distance between the CH and its members. The same simulation parameters are used and the number of sensors is set as 100. The presented algorithm is compared with LEACH-AP [19] and the simulation result is shown

Analysis of Network Lifetime
The network lifetime is defined as the time when about half of the sensors in the network have dead. At this time, the network is divided into several isolated portions which leads to a serious decline in the performance of the network. In order to have a fair evaluation on different protocols, the same network model is used to execute the APSA, UCR-H, LEACH-AP and EDDUCA algorithms, respectively. The simulation result is demonstrated in Figure 6. As Figure 6 shows, the lifetime of APSA is 1511 rounds and it achieves about 16.23%, 31.39%, 51.1% performance gain compared to UCR-H, LEACH-AP and EDDUCA, respectively.

Analysis of Network Lifetime
The network lifetime is defined as the time when about half of the sensors in the network have dead. At this time, the network is divided into several isolated portions which leads to a serious decline in the performance of the network. In order to have a fair evaluation on different protocols, the same network model is used to execute the APSA, UCR-H, LEACH-AP and EDDUCA algorithms, respectively. The simulation result is demonstrated in Figure 6. As Figure 6 shows, the lifetime of APSA is 1511 rounds and it achieves about 16.23%, 31.39%, 51.1% performance gain compared to UCR-H, LEACH-AP and EDDUCA, respectively.

Analysis of Clustering Result
The reasonable CHs are expected to be selected during the selection procedure and one of the significant standards to evaluate the reasonability is the average communication distance between the CH and its members. The same simulation parameters are used and the number of sensors is set as 100. The presented algorithm is compared with LEACH-AP [19] and the simulation result is shown

Analysis of Clustering Result
The reasonable CHs are expected to be selected during the selection procedure and one of the significant standards to evaluate the reasonability is the average communication distance between the CH and its members. The same simulation parameters are used and the number of sensors is set as 100. The presented algorithm is compared with LEACH-AP [19] and the simulation result is shown as Figure 7. From Figure 7, it can be seen that the presented APSA can greatly reduce the intracluster communication distance and improves about 30.5% performance compared to LEACH-AP.

Sensors 12
as Figure 7. From Figure 7, it can be seen that the presented APSA can greatly reduce the intracluster communication distance and improves about 30.5% performance compared to LEACH-AP. Uneven distribution of CHs will result in unbalanced energy consumption between clusters and accelerate the death of the node. The CHs distribute more evenly, the number of members for each cluster will be closer. Therefore, the even distribution of CHs can be evaluated by the difference value between the number of maximal and minimal cluster members. The simulation result is illustrated in Figure 8. From Figure 8, it can be seen that in the presented APSA, the difference value is 2 when the algorithm achieves the worst result; however, it can be 6 in LEACH-AP.

Study of Affinity Propagation (AP) Preference
The parameter p has a great impact on the performance of APSA in terms of convergence time and number of clusters. Different values of p are tested under the same network model (100 sensors) Uneven distribution of CHs will result in unbalanced energy consumption between clusters and accelerate the death of the node. The CHs distribute more evenly, the number of members for each cluster will be closer. Therefore, the even distribution of CHs can be evaluated by the difference value between the number of maximal and minimal cluster members. The simulation result is illustrated in Figure 8. From Figure 8, it can be seen that in the presented APSA, the difference value is 2 when the algorithm achieves the worst result; however, it can be 6 in LEACH-AP.

Sensors 12
as Figure 7. From Figure 7, it can be seen that the presented APSA can greatly reduce the intracluster communication distance and improves about 30.5% performance compared to LEACH-AP. Uneven distribution of CHs will result in unbalanced energy consumption between clusters and accelerate the death of the node. The CHs distribute more evenly, the number of members for each cluster will be closer. Therefore, the even distribution of CHs can be evaluated by the difference value between the number of maximal and minimal cluster members. The simulation result is illustrated in Figure 8. From Figure 8, it can be seen that in the presented APSA, the difference value is 2 when the algorithm achieves the worst result; however, it can be 6 in LEACH-AP.

Study of Affinity Propagation (AP) Preference
The parameter p has a great impact on the performance of APSA in terms of convergence time and number of clusters. Different values of p are tested under the same network model (100 sensors)

Study of Affinity Propagation (AP) Preference
The parameter p has a great impact on the performance of APSA in terms of convergence time and number of clusters. Different values of p are tested under the same network model (100 sensors) and the simulation result is shown as Table 3. As shown in Table 3, when p is set as -6000, the algorithm convergences faster and achieves a more reasonable number of clusters.

Discussion
The initial cluster centers are obtained by iteration using the AP algorithm. With the scale of the network increasing, the time used for calculation of initial cluster centers will also increase rapidly. Therefore, one drawback of the presented clustering method is that it is not suitable for WSNs on a large scale. Additionally, the presented APSA can be improved by adjusting the value of AP preference. The value of the AP preference is obtained by experience and it has a significant influence on the performance of the AP; -6000 is just a suitable value for AP preference and we cannot ensure it is the optimal value. Therefore, our future work will focus on optimizing the parameter P.
The simulator used in this paper is MATLAB and it can only simulate the real world. However, in real applications, many other problems need to be solved. For example, in the simulation environment, it is assumed that the transmissions between sensors are always successful; while in the real environment, transmission may fail due to the harsh environment or the busy communication channel. Therefore, the presented algorithm still needs to be improved to adapt to the real environment.
Our future work will mainly focus on the improvement of expandability of the method. We will also combine popular mobile sink technology as well as data fusion technology with our clustering method to further improve performance.

Conclusions
The design of an energy-efficient routing algorithm has always been an important research issue for WSNs. In this paper, an adaptive clustering method based on an AP algorithm is presented, which can reduce the average data transmission distance of the network and provide load balanced routing effect. It firstly introduces the AP algorithm to calculate the initial cluster centers. Then a modified K-medoids algorithm is adopted to partition the whole network into clusters according to the previous initial cluster centers calculated by AP. Simulation results show that about 33.33%, 52.5% and 54.21% performance gain can be achieved in terms of energy consumption, and about 16.23%, 31.39%, 51.1% performance gain can be achieved in terms of network lifetime compared to the UCR-H, LEACH-AP and EDDUCA algorithms respectively. Acknowledgments: This work is supported by the National Natural Science Foundation of China (61772454, 61811530332, 61811540410). Se-Jung Lim is the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Data Availability:
The data that support the findings of this study are available from the corresponding author upon reasonable request.