Research on Reliability-Oriented Data Fusaggregation Algorithm in Large-Scale Probabilistic Wireless Sensor Networks

A lot of facts show that many researches just place emphasis on data aggregation or data fusion, which is not beneficial to analyze the sensed data thoroughly and will lead to the aggregation results' not being used fully; worse yet, the actual networks are always existed with lossy links; many now available aggregation algorithms are based on ideal network models and not any further analysis and fusion about aggregation results are done. Thus, we propose the concept of data fusaggregation so as to support processing sensed data while transmitting in large-scale probabilistic wireless sensor networks and propose a reliability-oriented data fusaggregation algorithm (RODFA) to assist users to get the monitoring information from the monitored geographic environment and measure the reliability of the information they get. RODFA also facilitates network administrator to improve the system sensing performance for large-scale probabilistic WSNs. In RODFA, the parameter η, which could reflect the reliability of aggregation result intuitively, is defined and calculated and it plays an important part in helping users to process aggregation result further. In our experiment, the validity of RODFA is verified by our simulation results, and the influence of network sizes and network performances on data fusaggregation is analyzed.


Introduction
Wireless sensor network (WSN) is a class of wireless ad hoc networks which consist of thousands of sensor nodes (SNs). Due to recent advance in microelectronics, wireless communications and sensor technologies have made the development of low-cost, low-power, multifunctional sensor nodes possible [1]. The capabilities of pervasive surveillance, sensor networks have attracted significant attention from many applications domains, such as habitat monitoring [2,3], object tracking [4,5], environment monitoring [6][7][8], military [9], traffic management [10], disaster management [11], and smart environments. In these applications, the aggregation and fusion of sensed data are very important for users to get the summary information from monitored areas. Many works about data aggregation and data fusion for WSNs emerged, which further promote the application domains of WSN [12][13][14]; however, firstly, the existing study rarely takes into consideration that SNs are restricted by resources due to low battery supply; secondly, most researches are aimed at studying the network capacity issue under ideal network model which does not match the actual network; thirdly, data aggregation and data fusion have different singular focus; for example, the focus of data aggregation is data transmitting while data fusion places emphasis on analyzing sensed data. Both data transmitting and analyzing are important to WSNs applications; lastly, there are few profound discussion and research on the network size's and network performance's influence on data fusaggregation result (which will be introduced in the next section). We will try to study these four issues in later sections.
Firstly, as energy resources provided for SNs are usually battery cells which are impossible to recharge during WSNs' working process, SNs are restricted by resources due to low battery supply. Consequently, in this paper, energy saving technology is taken into consideration, to make the network model more practical. For better energy utilization, we choose cluster-based WSNs [15,16]. In cluster-based WSNs, SNs resident in nearby area would form a cluster and SNs can select one of the clusters to be their cluster-head nodes 2 International Journal of Distributed Sensor Networks (CHs). The CH organizes data pieces received from SNs into an aggregated result and then forwards the result to the base station (sink node) along with the regular routing paths.
Secondly, to the best of our knowledge, lots of the studies mentioned above are based on ideal deterministic network model (DNM), where any pair of nodes in a network is either connected or disconnected. If two nodes are connected, that is, there is a deterministic link between them, and then a successful data transmission can be guaranteed as long as there is no collision. Otherwise, if two nodes are disconnected, the direct communication between them is assumed to be impossible. However, for real application, this DNM assumption is too ideal and not practical on account of the "transitional region phenomenon" [17,18]. Because of the transitional region phenomenon, a large number of network links (probably more than 90%) become unreliable, which is named lossy links [17]. Even without collisions, data transmission over a lossy link is successfully conducted with a certain probability rather than being completely guaranteed. Therefore, a more practical network model for WSN is the probabilistic network model (PNM) [17], in which data communications over a link are successful with a certain probability rather than always being successful or always unsuccessful. For convenience, the WSNs considered under the DNM/PNM are called deterministic/probabilistic WSNs. Hence, in order to make our research much closer to the real network and get more practical value than existing studies, our research is under probabilistic WSN shown as the network model in Section 2.
Thirdly, in order to truly achieve processing sensed data while transmitting among network, users, and network administrator, we first present the concept of data fusaggregation which is based on the definitions of data aggregation and data fusion. In many applications, WSNs are mainly used for gathering data acquired from the physical environment to an external base station [19]. During a data gathering process, if the raw data can be aggregated and only an aggregation value is transmitted to the sink node, it is called data aggregation [20]. And data fusion is a widely adopted signal processing technique that can improve the system sensing performance by jointly considering the measurements of multiple sensors [21]. As data aggregation and data fusion are both playing important functions in many applications of WSNs and they have different focuses, the realization of processing sensed data while transmitting has started to become important for WSNs. Thus, data fusaggregation is presented in our research as follows.
Definition 1 (data fusaggregation). The raw data can be aggregated; only one aggregation value will be transmitted to the sink node. And through analyzing the aggregation value, the sink node will get some other information which could measure the performance of network or be useful to users.
Lastly, on the basis of the concept of data fusaggregation, a data fusaggregation algorithm is proposed to comprehensively analyze and use sensed data to facilitate the improvement of system sensing performance. This algorithm is named reliability-oriented data fusaggregation algorithm (RODFA). In this algorithm, firstly, the parameter (the lower limit value of reliability) is defined to measure the reliability of aggregation result. Then we obtain the formula for calculating the value of through theoretical derivation. Lastly, the aggregation result and the parameter will be sent to users through the sink which could be a reference for users' handling the information they get. In our experiment part, we first study the influence of network size and network performance on RODFA result.
The rest of this paper is organized as follows. In Section 2, the network model is discussed; it includes the assumptions of this model and the problem definitions. In Section 3, we describe the mathematic foundations of RODFA. The calculation of for oneŜum( ) is given in Section 4. Section 5 shows the validity of RODFA. And we discuss factors affecting the RODFA (network size and network performance) in Section 6. In the last section, we conclude the paper.

Assumptions of the Model.
In this paper, we consider that a large-scale probabilistic WSN consists of sensors. Let be the number of active SNs in the monitored area at time .
is varying with time and unknown by the sink. Let (1 ≤ ≤ ) be the sensed data of active sensor node at time , and let = { 1 , 2 , . . . , } be the set of all the sensed data in monitored area at time . is stored in the active sensor node for 1 ≤ ≤ . Since all sensed data are bounded, we use sup( ) to denote the upper bound of all sensed data.
On the basis of the analysis in Section 1, for better energy utilization, we apply the dynamic minimal spanning tree routing protocol (DMSTRP) to our network model and our radio power model is similar to [22]. The DMSTRP is a cluster-based routing protocol for large-scale wireless sensor networks which is proposed in [23]. According to [23], in our network model, the monitored geographical area is divided into a number of clusters which are formed similar to [24], and the clusters are disjoined with each other. Here, assume that the monitored geographical area is fully covered by active sensors, and it is divided into clusters. Let be the maximum of network's layers at time , and let 1 , 2 , 3 , . . . , be the numbers of nodes in the layer 1, layer 2, layer 3, . . ., and layer at time , respectively. Let 1 = { 1 1 , 1 2 , 1 3 , . . . , 1 1 } be the set of all sensed data in layer 1 at time , and let 2 , 3 , . . . , be the set of all sensed data in layer 2, layer 3, . . . , layer at time , respectively.
When the sink node initiates one time of data collection at time , we let be the distance of all lossy links, the lossy links are successfully conducted with a certain probability , and let ( ) be the set of all sensed data which are sensed successfully to sink from layer 1 at time . Thus, the sensed data of layer 2, layer 3,. . ., layer will be sent successfully to the sink with a certain probability 2 , 3 , . . . , , respectively, and let ( 2 ) , ( 3 ) , . . . , ( ) be the set of all sensed data which are sensed successfully to sink node from the layer 2, layer 3, . . . , layer at time , respectively. And the function relationship between and will be presented in the next section.

Problem Definition.
In this paper, when researching on data fusaggregation, we take SUM operation, for example, and propose RODFA to comprehensively analyze and use sensed data for facilitating further promotion of WSNs' application domains. The exact sum of the monitored area at time is defined as Sum( . We will briefly introduce some related definitions of reliability as follows before introducing the steps of RODFA. Definition 4 ( ). For a given network and ( ≥ 0), (0 ≤ ≤ 1) is the lower limit value of probability of whether̂is the -estimate of ; that is, ≤ Pr(|(̂− )/ | ≤ ) and we call it the lower limit value of reliability.
is an important parameter which could measure the reliability of aggregation results. It will be sent to users for facilitating the next handling of aggregation result, should the users regard the aggregation result as a decision-making or just a reference point, even discard this aggregation result directly. The calculation of will be completed in RODFA. The main steps of RODFA are shown as Figure 1, and they can be described as follows in detail.
Step 1. Sink node launches one time of data collection and sends the data acquisition command to every cluster, and the clusters will retransmit this command to their own child nodes.
Step 2. According to [22], every node will send its sensed data to its cluster node. And after analyzing and processing the sensed data, the cluster will retransmit the processing result to sink node.
Step 3. Sink node will get an approximate sum which will be discussed in Section 3: Step 4. According to the formula of and the derivation process of this formula will be given in Section 4), sink node will calculate the value of and turn it back to users withŜum( ) together, or turn it back to the network administrator for facilitating the administrator to improve the system sensing performance. And the algorithm stops.
Combined with the above steps of RODFA, the key of RODFA is to obtain the value of ofŜum( ). And the problem of computing the value of is defined as follows. Input: (2) ( ≥ 0), , and ; (3) aggregation operator sum.

Output:
The value of of the mathematic estimator of sum: Sum( ).
For facilitating the later analysis, we will give the other two definitions of unbiased estimate and delay.
Definition 5 (unbiased estimate).̂is an unbiased estimate of if the mathematical expectation of̂is equal to , that is, (̂) = ; otherwisêis a biased estimator of .
Combining [23], DMSTRP connects nodes in clusters by MSTs. In each cluster, all nodes including the CH are connected by a MST and then the CH as a leader to collect data from the whole tree. All CHs are connected by another MST to route toward sink. And the processing of data fusing is handled along with the tree route. The average transmission distance of each node can be reduced by using MSTs as Figure 2 shows, and thus the energy dissipation of transmitting data is reduced.
In Figure 2, node 1, node 2, and node 6 are leaf nodes; node 3 and node 5 are father nodes; node 4 is the cluster head node. Because node 1 and node 2 have the same father, only node 1 can transmit in the first period. Thus the first period transmitting queue is {1, 6}; this means node 1 and node 6 can transmit their data at the same time. We can get the transmitting queue in the following period as {2, 5} and {3}. Based on the above analysis, we define delay as follows.
Definition 6 (delay). The amount of periods from the sink node initiates a data collection command to all sensed data from SNs being sent to the sink node, no matter how long it will last. This amount of time is called one delay.
According to the concept of delay, we can get that the delay of Figure 2 is

Mathematic Foundations
The calculation of probability is the foundation of RODFA. Existing studies show that the probability will be a function of distance (the distance between any pair of nodes is connected). Next, we will briefly describe the derivation of this function and then do research on the estimator of sum.   of one frame, and then the functional relationship between and can be described as [18] = (1 − 0

Relationship between and
In this paper, the propagation model is lognormal shadowing model [25], and the relationship between and the distance will be obtained in next section.
Let ( ) be the path loss for one certain location, 0 the reference distance, the path loss exponent, and Gaussian random variable available for zero mean. Then the relationship between ( ) and can be described by formula (2) and in units of dBm: On the basis of formula (2), we assume that trans is the output power of the sender and recv is the power that the receiver received. Then recv can be obtained from the following: Let be the platform noise and combined with the relationship (3) between the SNR and the power received by the receiver; we can see that  Once the model is determined, all of the parameters in the above formulas will be only given. As we set the propagation model to be lognormal shadowing model, the parameter settings for our experiment are similar to [26]. Combined with the above formulas (1)-(4), we can obtain the relationship between and that can be shown as 3.2. Estimator of Sum. Let 1 , 2 , 3 , . . . , be the sets of sensed data in layer 1, layer 2, layer 3, . . ., layer at time , respectively, and let ( ) , ( 2 ) , ( 3 ) , . . . , ( ) be the sets of all International Journal of Distributed Sensor Networks 5 sensed data which are sensed successfully to sink node from the layer 1, layer 2, layer 3, . . ., layer at time , respectively. The mathematic estimator of sum is denoted byŜum( ), and theŜum( ) can be computed bŷ where is the probability in which lossy links are successfully conducted at time . And according to the above definition of the unbiased estimate, Theorem 7 shows that Sûm( ) is the unbiased estimator of the exact sum Sum( ).

Calculating ofŜum( )
According to the steps of RODFA in Section 2.2, the key step is to calculate ofŜum( ). Thus, we will research on this issue next.
The steps of calculating the value of are (1) proofing thatŜum( ) obeys the normal distribution; (2) transforming the normal distribution into standard normal distribution; and (3) utilizing the characteristics of standard normal distribution to calculate the value of .
For any , let the variable (1 ≤ ≤ ) be as follows: There isŜum( ) = (1/ )∑ . Firstly, we need to proof thatŜum( ) obeys normal distribution. In view of the linear combination of independent normal distribution functions still obey normal distribution through proofing that the sum of each layer data obeys normal distribution to proof thatŜum( ) obeys the normal distribution. Reference [28] shows that, if each layer data conforms to Lyapunov condition, the sum of each layer data will be in accordance with the application conditions of the central limit theorem; that is, the sum of each layer data will obey the normal distribution. And Theorem 8 proofs that the data in layer 1, layer 2, . . ., layer conform to Lyapunov condition, respectively.
Omitting the analysis of other layers (the researching method for other layers is same as = 1), we will then describe the analysis of layer ; that is, = .
Theorem 8 shows that the groups of random variable sequences (1 ≤ ≤ ) satisfy the Lyapunov condition. That is, the sum of each layer dataŜum( ) = ∑ =1 obeys normal distribution. As whether the sensed data in each layer can be sent successfully to sink node is independent of each other, soŜum( ) is the sum of these independent variable normal distributionsŜum( ). ThusŜum( ) obeys normal distribution. For a given relative error limit , Theorem 9 describes the calculation of ofŜum( ). Theorem 9. Supposing = 1 − , /2 is the /2 quantile of standardized normal distribution, if /2 satisfies the following formula: Then, the probability that the relative error betweenŜum( ) and Sum( ) satisfies the given error limit will be equal or greater than ; that is, Proof. From formula (21), there is inf( ) inf( ) 2 ≥ 2 /2 sup( )((1 − )/ ). As inf( ) and inf( ) are, respectively, the lower limit of and the lower limit of the value of all sensed data, so there is Sum( Theorem 7 shows that Var(Ŝum( )) ≤ ((1 − )/ ) Sup( )Sum( ), (Ŝum( )) = Sum( ), and asŜum( ) obeys normal distribution, from formula (23), there is Combining the knowledge of standard normal distribution quantile [29], (23), (24), and = 1 − , there is

Validity of RODFA
To evaluate RODFA, we use Matlab to simulate a sensor network of 5000 nodes. Using the above 5000 nodes and 285 cluster-heads randomly deployed in network area of 5000 m × 5000 m, 10000 m × 10000 m, 15000 m × 15000 m, we make the DMSTRP protocol running in the simulation network system to connect all nodes and cluster-heads into a whole. For time in ready phase, we can obtain the maximum of network's layers through counting the numbers of layers of all clusters. As all the nodes are randomly deployed in network area, and as far as possible to ensure uniformity in the process of deploying, we can randomly select one cluster from the network, and then to obtain the distance of all lossy links through calculating the mean distance of all distances between every two linked nodes in this cluster. Plugging this distance into formula (5), then we can get the probability at time . This group experiments are to investigate whether RODFA is valid. As data communications over a link is successful with a certain probability rather than always being successful or always fail in probabilistic WSNs [17], every time we calculate the reliability ofŜum( ), we will get the different reliability value even if all relevant information of the network is fixed and immovable (i.e., the number of active nodes, the structure of the cluster, the number of clusters, the number of nodes in every cluster, the distance of all links, and the layout of all nodes are changeless).
We do 10000 times of calculations when all relevant information in the network is fixed and immovable and let the initial energy of sink node, cluster-heads, and nodes be large enough (no node will die in experiment process). For these 10000 calculations, we calculate the relative error between Sum( ) and Sum( ) (Ŝum( ) and Sum( ) obtained from our simulation network. In order to get the value of Sum( ), in this experiment, we assume that the CHs will not only retransmit the aggregation result to sink node but also relay all sensed data that are from other SNs or CHs and then do statistics analysis on cumulative probability under different relative error bounds , shown as the blue dashed lines in Figures 3, 4, and 5, and we call this reliability as statistical reliability. Furthermore, according to the above analysis, sink node can get the values of and for time easily, and after putting and in 2 /2 ≤ inf( ) inf( ) 2 /(1− ) sup( ), we can calculate the value of ofŜum( ) for every relative error bound and then describe the line of as the red forked lines in Figures 3, 4, and 5, and we call this reliability as RODFA reliability.  Figure 3, the blue dashed line shows that the statistical reliability for = 0.0225 is 0.9999, and the red forked line shows that, when the RODFA reliability of Sum( ) reaches 0.9999, the relative error bound between Sum( ) and Sum( ) will be 0.0907, and the difference between 0.0225 and 0.0907 is 0.0682. Furthermore, the blue dashed line also demonstrates that, for the network with 5000 m × 5000 m size, the maximum of the relative error bound betweenŜum( ) and Sum( ) from the network simulation results is 0.0234. In Figure 4, it can be found that when the statistical reliability ofŜum( ) (shown as the blue dashed line) reaches 0.9999, is equal to 0.0367; and the calculation results of RODFA demonstrate that, when RODFA reliability reaches 0.9999, the relative error bound will be 0.1379, and the difference between 0.0367 and 0.1379 is 0.1012. In addition, the blue dashed line also shows that the maximum of the relative error bound betweenŜum( ) and Sum( ) from the network simulation results is 0.0413. In Figure 5, when the statistical reliability ofŜum( ) reaches 0.9999, the relative error bound is equal to 0.0508; and the red forked line shows that, when RODFA reliability ofŜum( ) reaches 0.9999, the relative error bound will be 0.1768, and the difference between 0.0508 and 0.1768 is 0.126. Furthermore, the blue dashed line also demonstrates that the maximum of the relative error bound in the network simulation results is 0.0513.
After describing the comparative analysis of each figure separately, we will investigate and analyze the relevant feature details shown in Figures 3, 4, and 5 later.
Firstly, the three figures show that the maximums of relative error bounds for blue dashed lines are all small. These numbers show that the data aggregation method in our paper has better approximating effect, and it is also corresponding to the proof thatŜum( ) is an unbiased estimate of Sum( ). Secondly, there is a gap between blue dashed line and red forked line in each figure, and reasons for this phenomenon are as follows: (1) the zooming out in the formula derivation process of upper limit variance of Sum( )(Var(Ŝum( )) ≤ ((1− )/ ) Sup( )×Sum( )); (2) just considering the distance and neglecting other factors which also affect the value of between two connected nodes.
(3) In these three figures, due to the increase of network area, the statistical reliabilities go down (for a same relative error value) and the maximums of relative error bounds betweenŜum( ) and Sum( ) grow bigger (from 0.0234 to 0.0513, shown as the blue dashed lines in these three figures).
(4) With the increase of network area, the RODFA reliability is falling, but its changing rate is greater than that of the statistical reliability (0.0682 < 0.1012 < 0.126).
The above analysis indicates that the RODFA is validity, and ofŜum( ) which is calculated by RODFA has a greater likelihood to be the minimum ofŜum( ) reliability.

Discussing the Factors Affecting RODFA
The aim of the former part of the experiment is to demonstrate the effectiveness of RODFA. Next we will turn RODFA embedded in our simulation network and investigate the factors affecting RODFA. We use Matlab simulator to implement a wireless sensor network. To ensure that the simulation results in this paper are correct, we use our simulator to do the same experiments (the network lifetime) of DMSTRP that use C/C++ in [23] and get the approximately same results. And the optimal number of clusters in our simulative network is adopted according to [30].
In this section, the first group of experiments is to describe the changing trend of the value of ofŜum( ) (calculated by RODFA) with the increase of network running rounds. And the experiment results for different network areas are shown in Figure 6.
Due to these three lines belonging to the same type of experimental result, we will just analyze the blue curve shape in Figure 6. The blue line describes the changing trend of with the increase of network running rounds for network of 5000 m × 5000 m (when = 0.08). As the establishment of clusters will take a certain amount of time (how long does it take is based on network area and active nodes number), that is why is 0 from the 0th round to the 9th round. After the sink node obtaining the values of and ( = 3 and = 0.9308), RODFA will work out the value of and leaps to 0.9611 in the 9th round.
Based on the above analyses, the values of and mainly depend on network area, active nodes number, and internal structure of cluster and so on. As all of these three lines are describing the simulation results of network with certain area, thus we do not need to investigate the inference of network area on and . From the blue line, we can find that the number of active nodes is 5000 in the first 610 rounds. But near the 400th round, clusters are remodeled, which makes the network structure changed; it also changes the values of and . Thus, following the 409th round, suddenly reduces to 0.9457 and lasts for 210 rounds. Secondly, the active nodes number begins to decline from the 600th round. As setup phase and ready phase continuously cycle in the network, the values of and are different for different time. That leads to the result that has a sustained downward trend until its value becomes 0 from the 619th round. Lastly, we also find that is lower in the subsequent 123 rounds (from the 811th round to the 934th round). That is because the active nodes number remains at a low level in the subsequent rounds of network running, while less active nodes deploy in the network with a certain area, the value of will be larger and will be smaller, which makes small.
The second group of experiments is to investigate the influence of network size on RODFA. According to 2 /2 ≤ inf( ) inf( ) 2 /(1 − ) sup( ), the value of required given , , and . Based on the above analysis, the values of and depend on network size when the routing protocol is certain. And as the value of is given by user's application requirement, will change with the difference of network area. As shown in Figure 7, when is, respectively, set to be 0.01, 0.03, and 0.05, decreases with the growth of network size.
The last group of experiments is to analyze the network delay. According to the definition of delay in Section 2, we can get the delays of simulation network models with size of 5000 m × 5000 m, 10000 m × 10000 m, and 15000 m × 15000 m and describe them in Figure 8. Figure 8 shows that the network delay fluctuates up and down in 524 when the number of alive nodes is more than 1500; while when the number of alive nodes is less than 1500, the value of network The lower limit value of reliability 5000 * 5000 10000 * 10000 15000 * 15000 . From the blue line, we can find that the value of is 0 from the 0th to the 9th round, and in the 9th round, the value of leaps to 0.9611 and lasts for 400 rounds. In the 409th round, the value of suddenly reduces to 0.9457 and lasts for 210 rounds, and then there has a sudden drop of in the 619th round. Then has a sustained downward trend until its value becomes 0. The red line shows the ofŜum( ) in network area of 10000 m × 10000 m. We can find that, among the first 11 rounds, the value of is 0, and then it leaps to 0.9133 in the 11th round and lasts for 300 rounds. In the next 200 rounds (from the 311th round to the 511th round), is 0.8742, and in the 511th round, the value of reduces to 0.8504 again. After a sudden drop in the 631th round, has a sustained downward trend until its value becomes 0. The green line describes the ofŜum( ) in network area of 15000 m × 15000 m. = 0 for the first 13 rounds, and in the 413th round, has a sudden drop, and then its value continues to fall until it becomes 0. The lower limit value of reliability  delay begins to decrease until its value becomes 0. Probably this is due to the numbers of links and clusters in the network, and the value of network decreases with the decrease of alive nodes in the network, which lead to the changing trend of network delay shown in Figure 8.

Conclusions
Our proposed algorithm, RODFA, belongs to a data fusaggregation algorithm which mainly includes aggregating sensed data from large-scale wireless sensor networks and doing fusion analysis of the aggregation results. We choose SUM to design RODFA in this paper and the main idea of RODFA is to calculate the parameter ofŜum( ) through auto analysis and synthesizing ofŜum( ). It can provide to users and facilitate them to do the next handling ofŜum( ); also it can facilitate the network administrator to improve system sensing performance. Experiments in Section 5 indicate that the parameter , which is calculated by RODFA, has a greater likelihood to be the minimum ofŜum( ) reliability and RODFA is valid. In Section 6, we first investigate the influences of network size and network performanceon data fusaggregation to guide the network administrator on improving the system sensing performance.