A Subspace Approach to Sparse Sampling Based Data Gathering in Wireless Sensor Networks

Data gathering is an essential concern in Wireless Sensor Networks (WSNs). This paper proposes an efficient data gathering method in clustered WSNs based on sparse sampling to reduce energy consumption and prolong the network lifetime. For data gathering scheme, we propose a method that can collect sparse sampled data in each time slot with a fixed percent of nodes remaining in sleep mode. For data reconstruction, a subspace approach is proposed to enforce an explicit low-rank constraint for data reconstruction from sparse sampled data. Subspace representing spatial distributions of the WSNs data can be estimated from previous reconstructed data. Incorporating total variation constraint, the proposed reconstruction method reconstructs current time slot data efficiently. The results of experiments indicate that the proposed method can reduce the energy consumption and prolong the network lifetime with satisfying recovery accuracy.


Introduction
Recently, wireless sensor networks (WSNs) have been applied in many fields, such as target tracking, medical care, and environmental monitoring. The component of a WSN includes a number of sensor nodes monitoring and collecting the physical environmental information and a sink aggregating the collected data. The data collection from each node to the sink, known as data gathering, is a significant research issue in WSNs.
In most real-world applications, the sensor nodes in WSNs are always limited by computational capacity and battery power. Lots of data gathering methods have been proposed to reduce energy consumptions in WSNs. These data gathering methods can be mainly classified into two categories: methods utilizing data compression techniques [1][2][3] and methods based on designing network protocols [4][5][6]. In the past decades, inspired by the emergence of compressed sensing (CS) [7] and matrix completion [8] theory in signal processing field, data gathering methods based on data compression techniques obtain more attention. It is important to note that the decrease in data transmission between nodes can effectively reduce energy consumption and prolong the network lifetime. Considering the redundant and highly correlated data sensed in neighboring sensors during consecutive times, data gathering methods taking advantage of data compression techniques have contributed to reducing the amount of data transmission and prolonging the network lifetime. Specifically, Compressive Data Gathering (CDG) was proposed [2] based on compressed sensing theory [7]. Instead of collecting each raw sensed data, the sink in CDG receives a weighted sum of all the readings from nodes. Compared with the traditional data gathering methods in WSNs, CDG can balance the energy consumption and prolong the lifetime of the network. Afterwards, many extensions to the CS based data gathering methods in WSNs have been developed [9][10][11] based on dense sensing matrix. To further reduce the amount of transmission data, methods utilizing sparse sensing matrix to random sample the raw sensed data in WSNs were also proposed [12][13][14]. More recently, as the rank of matrix is interpreted as a measure of second-order sparsity, matrix completion method [8] has attracted the attention of researchers, and many data gathering and reconstruction methods in WSNs based on matrix completion were proposed [3,[15][16][17][18][19]. By distributing data collected from different nodes in different time slots into an environment matrix, the matrix completion based methods enforce the low-rank constraint on the matrix to take advantage of the spatiotemporal correlation. In particular, Spatio-Temporal Compressive Data Collection (STCDG) [3] was proposed to reduce the amount of traffic and improve the level of recovery accuracy by enforcing the low-rank constraint based on matrix factorization approach. Furthermore, methods utilizing joint low-rank and sparsity constraints were also proposed to further improve the data recovery accuracy [16,19]. Considering the inherent correlation among multi-attribute data, this method [19] extends the low-rank constraint based on matrix to tensor model to further exploit the correlation.
Although these methods enforcing low-rank constraint achieve better data recovery performances than the CS based methods, the requirement to form an environment matrix limits the real-time data gathering and reconstruction in WSNs. Besides, the sparse sampling strategy is adopted in matrix completion based methods, resulting in the existence of several nodes remaining in sleep mode. These inactive nodes will cause the failure of real-time data gathering. To achieve real-time data gathering with matrix completion, this paper proposes a sparse data gathering method based on clustered WSNs. Sparse sampled data can be collected in each time slot even with a fixed percent of nodes remaining in sleep mode. For data reconstruction, a subspace approach is proposed to enforce an explicit low-rank constraint. With subspace representing temporal distributions of the WSNs data estimated from previous reconstructed data, data collected in current time slot can be recovered efficiently. To guarantee data recovery performance even with low sampling ratio, we incorporate total variation constraint to further improve data recovery accuracy.
The rest of this paper is organized as follows. Section 2 reviews related works on matrix completion. Section 3 describes the proposed sparse data gathering method and the subspace approach for data reconstruction. The experimental results and analysis are presented in Section 4. Finally, Section 5 concludes this paper.

Matrix Completion
Matrix completion problem was proposed to recover a data matrix from its partially sampled entries, and it has been proved that most low-rank matrices can be perfectly recovered from an incomplete set of entries [8]. Specifically, recovering an incomplete matrix M ∈ R n×m can be cast as a rank minimization problem: where X is the decision variable, Π the sampled subset of the complete set of entries [n] × [m] (Here and in the sequel, [n] denotes the list {1, ..., n}), and rank (X) represents the rank of the matrix X.
Essentially, the rank of matrix is treated as the measure of the second order sparsity. Therefore, the low-rank constraint can utilize the spatial and temporal correlation in data matrix. However, the optimization problem (1) is nonconvex and NP-hard. To address this issue, two popular approaches are utilized in numerous applications to enforce the low-rank constraint. One is the nuclear norm based approach, and the other is matrix factorization based approach. For the first one, since the nuclear norm is the best convex surrogate to the rank function over matrices with spectral norm less than or equal to one [20], the nuclear norm minimization is used as an alternative: Here, · * denotes the nuclear norm, which is a convex function and equal to the sum of the singular values of the matrix. Then, (2) can be rewritten as a regularized unconstrained problem, and singular value thresholding algorithm [21] can be used for the resulting problem. It is worth noting that computing Singular Value Decomposition (SVD) in each iteration is the main computational cost. For the second one, the incomplete matrix M with rank r can be expressed as the product of two matrices LR, where L ∈ R n×r , R ∈ R r×m . The matrix factorization is not unique, the factorization where the matrices L and R have Frobenius norm as small as possible can be searched. Then (1) can be cast as: The matrix factorization based approach has gained great popularity due to fewer requirements for storage capacity and computational overhead. It is worth noting that the problem (3) is a nonconvex quadratic program and can be solved by standard nonlinear optimization algorithms, such as alternating minimization method [22] and gradient descent method.

Matrix Completion Based Method in WSNs
Inspired by the great success of matrix completion, many data gathering and reconstruction methods in WSNs were proposed based on matrix completion. Considering a WSN consisting of n sensor nodes and one sink with symbol N 1 , N 2 , · · · , N n used to represent sensor nodes. These sensor nodes sense the environment and transmit readings to the sink once every τ time. Therefore, an environment matrix (EM) M organized by n × m readings can be obtained during mτ time (i.e., m time slots): where f N i , T j denotes the reading collected from node N i in the jth time slot. In most matrix completion based methods, partially sampled readings sensed during m time slots are transmitted to the sink together. Since readings generated by the nodes in a certain area during consecutive times are redundant and highly correlated, the EM M has approximately low-rank structure. Then, the low-rank constraint can be enforced to exploit the correlation to reconstruct the unsampled readings. Methods based on matrix completion have achieved great success. However, the requirement to form an EM limits the real-time data gathering and reconstruction in WSNs.

The Proposed Method
In matrix completion based methods in WSNs, sparse sampling is adopted in data gathering to reduce the amount of transmission data and energy consumption. Typically, the sparse sampled readings sensed during a fixed number of time slots are transmitted to the sink together in the last time slot, which limits real-time data gathering and reconstruction. To implement real-time data gathering in WSNs, the sparse sampled data should be collected and transmitted to the sink in every time slot. However, since only partial sensor nodes wake up and collect data while other nodes remaining in sleep mode, the traditional data gathering methods fail due to the existence of inactive nodes. If the successful aggregation of sparse sampled data is guaranteed at the expense of all nodes waking up to transmit data, the goal of reducing energy consumption through sparse sampling can not be achieved. Moreover, for data reconstruction in real-time case, only data in current time slot need to be reconstructed. In this paper, a sparse data gathering method based on clustered WSNs is proposed to implement real-time data gathering, and a subspace approach based method is proposed to reconstruct data in current time slot by enforcing an explicit low-rank constraint.

The Sparse Sampling Data Gathering Scheme
To ensure the success of sparse sampled data gathering without waking up all nodes, a sparse sampling data gathering scheme is proposed in clustered WSNs. In clustered WSNs, all nodes are divided into multiple clusters, and each cluster contains one cluster head (CH) and a number of cluster members (CMs). The cluster head is selected for each round to communicate with the sink, and the cluster members only communicate with CHs in their clusters. The proposed sparse sampling data gathering scheme based on clustered WSNs is shown in Figure 1. For clarity, nodes are divided into two layers: sensor layer containing CMs in each cluster and cluster head layer containing all CHs. Sparse sampling is adopted for CMs in each cluster, that is, only partial CMs wake up and transmit the sensed data to the CH. The CHs transmit the received data and its own sensed data to the sink. It is worth noting that several CHs which locate far from sink node can communicate with sink using multi-hop transmission. For simplicity, only direct communication between CHs and the sink is illustrated in Figure 1. The proposed sparse sampling data gathering scheme can combine with many existing clustering algorithms. In this paper, the well known distributed energy-efficient clustering (DEEC) scheme [4] is selected as the clustering algorithm. In DEEC, cluster heads are selected according to a probability based on the ratio between the residual energy of each node and the average energy of the network. Because of adapting the rotating epoch of each node to its energy, the nodes with high initial and residual energy have a higher probability of becoming CHs than those with low energy. Specifically, the proposed sampling data gathering scheme using DEEC can be divided into rounds, and each round contains two phases as follows.
In the first phase, clusters are formed and the sampling ratio of each node is determined. Specifically, each node decides whether to become a CH based on its own probability threshold, which is related to the residual energy and estimated average energy of networks at current round. After the CHs are selected, the other nodes, cluster members, determine the dependent cluster according to the signal strength of the received information, and notify the corresponding CH to complete the establishment of the clusters. Then the sink broadcasts a fixed sampling ratio ρ to all sensor nodes in the network. Note that the CHs are always awake to ensure the successful aggregation of data. As a result, the sampling ratio for CHs can be set to 1. For the CMs, the sampling ratio is ρ CH = (ρn − n CH ) (n − n CH ), where n denotes the number of nodes still alive and n CH the number of CHs. n − n CH represents the number of CMs.
In the second phase, sparse sampling and data gathering are held. Specifically, each node in CMs generates a random number between 0 and 1. If the number is less than the sampling ratio ρ CH then the node senses the environment and transmits readings to the corresponding CH, otherwise the node remains in sleep mode in current time slot. After receiving data from CMs in its cluster, the CH transmits the received data and its own sensed data to the sink. The network will start a new round after a fixed number of time slots.
Instead of using all nodes to collect the information, the proposed sampling data gathering scheme only wakes partial nodes up in each time slot, which greatly reduces the energy consumption and prolongs the lifetime of the network compared to original clustering algorithm.

The Sliding Window Model
In real-time data gathering with sparse sampling, data from CHs and partial CMs are collected in the sink for current time slot. We use T c to represent the current time slot, and let T c−j denote the last (j + 1)th time slot. Figure 2 shows the data sampled from n sensor nodes between time slots T c−w to T c . To implement the real-time data reconstruction, the sliding window model is utilized by introducing the reconstructed data in previous time slots. As illustrated in Figure 2, two adjacent windows are shown with the width fixed to w time slots. The first window, the previous window of the current window, contains time slots from T c−w to T c−1 , and the second window, the current window,contains time slots from T c−w+1 to T c . Data in each window can form an environment matrix, and the window slides forward a time slot each time. That is, the oldest time slot will be deleted, while the current time slot enters the window. Meanwhile, the environment matrix is updated accordingly.

Sensor nodes
Time slot Figure 2. Sampled data using sliding window model.
For example, to reconstruct data x c in current time slot T c , the previous window containing time slots T c−w to T c−1 slides forward a time slot. Then the current window including data from T c−w+1 to T c is obtained. Divide the correspondingly current EM [M p x c ] into two parts: the previous reconstruction data M p (data in time slots T c−w+1 to T c−1 ) and the current data x c (collected in current time slot). Since each column in M p has been reconstructed in the corresponding time slot, the proposed subspace approach can enforce an explicit low-rank constraint to reconstruct the current data x c in real time.
It is worth noting that the sink needs the indexes of the sparse sampled data to reconstruct the current data. To avoid overhead incurred by index information, the preset random seed can be used. At the beginning of each round, each CM generates a random seed as a pseudo-random number generator and sends it to the CHs and sink. The generated random number determines whether the CM wakes up and collects data. With all the information of random seeds, the sink can obtain the indexes of the sparse sampled data without extra overhead.

The Subspace Approach for Reconstruction
With the proposed sparse sampling data gathering scheme, the sink can obtain the current sparse sampled data d = Ω(x c ), where Ω : R n×1 → R s×1 is the sparse sampling operator to obtain the partially known data from the current data. According to the sliding window model, the current EM can be expressed as [M p x c ]. Readings in EM are collected from nodes in a certain area during a consecutive time, then EM has approximately low-rank structure. Using the matrix factorization based approach, we have [M p x c ] = LR. Here, L ∈ R n×r and R ∈ R r×w can be regarded as the subspace representing the spatial distributions and the temporal distributions of the WSNs data, respectively. With R rewritten as [R p r c ], where R p ∈ R r×(w−1) and r c ∈ R r×1 , we have [M p x c ] = L[R p r c ] = [LR p Lr c ]. That is, M p = LR p and x c = Lr c . Since the subspace L representing the spatial distributions can be estimated from previous reconstructed data M p , the current data can be reconstructed bŷ To further improve the recovery accuracy, the total variation constraint ∇ x ([M p x c ]) 1 can be jointly utilized to enforce the temporal stability, where ∇ x represents the horizontal finite difference operator. It is worth noting that the vertical finite difference operator is not used. Since adjacent nodes in the matrix may not adjacent in space, constraining the vertical finite difference can not utilize the spatial stability. Besides, M p is known by previous reconstruction, then the constraint ∇ x ([M p x c ]) 1 can be simplified as ∇ x ([m p x c ]) 1 , where m p is the last column of M p . By jointly enforcing the constraint and introducing a quadratic penalty term, (5) can be converted into a corresponding unconstrained formulation as: where λ denotes the regularization parameter. The proposed subspace approach incorporates both the explicit low-rank constraint and the temporal stability constraint in a single formulation (i.e., (6)). To solve the optimization problem in (6), an efficient algorithm based on alternating direction method of multipliers (ADMM) [23,24] is developed. First, using variable splitting, we can convert (6) into the following equivalent constrained optimization problem: Second, the augmented Lagrangian function for (7) can be obtained: where a is the Lagrangian multiplier, and α > 0 is the penalty parameters. Third, (8) can be minimized alternatively as following: To solve the subproblem in (9), which is a quadratic optimization problem, the preconditioned conjugate gradient (PCG) algorithm is applied in this paper. To solve the subproblem in (10), the well-known soft-thresholding formula [25] can be utilized. Then we have where S(q, τ) i := sign(q i ) max(|q i | − τ, 0) is a soft-thresholding operator for each element q i in q.
The procedures of the reconstruction algorithm based on ADMM for solving (6) can be summarized in Table 1. In practical implementation, we initialize r 0 c , y 0 , and a 0 with zeros vectors. Since (6) is a convex optimization problem, the ADMM based method is guaranteed to have global convergence from any initializations [24]. The stopping criteria for the algorithm are r k c − r k−1 c 2 / r k−1 c 2 ≤ and k > K max , where and K max are the predefined tolerance parameter and the maximum number of iterations, respectively. The algorithm is terminated until each criterion is satisfied. Table 1. The reconstruction algorithm to solve the optimization problem in (6).

Input:
Initialized r 0 c ,y 0 ,and a 0 with zeros vectors; The subspace L representing the spatial distributions; The regularization parameter λ and the penalty parameter α; The measurements d, the sampling operation Ω, and the iteration number k = 0; do 1)Iteration number k = k + 1 ; 2)Update r k c by solving (9); 3)Update y k by solving (12); 4)Update a k by solving (11); Output: x c = Lr k c

Experiments and Analysis
In this section, the experimental environments are established to verify the effectiveness of the proposed method. For the reconstruction experiments, the performance of the subspace based reconstruction method was compared with local interpolation method K-Nearest Neighbors (KNN) [26], the CS based method [13], Seq-Prog-CS method [10], and joint CS and matrix completion method (CSMC) [16], and the real dataset collected from GreenOrbs was adopted. For the energy consumption experiments, the proposed sparse sampling data gathering scheme using DEEC was compared with the original DEEC to verify the benefit of the proposed method in reducing the energy consumption. All experiments were conducted using MATLAB R2017a on a computer with 1.8GHz Intel core i7-8550U CPU and 8GB RAM.

Data Reconstruction Performance
To verify the effectiveness of the subspace approach for reconstruction, a small real dataset collected from GreenOrbs [27] was selected as the ground truth. The readings in the small real dataset were collected from 94 nodes in 124 time slots for three attributes: temperature, humidity, and light. Then we verified the effectiveness of the proposed method from two aspects. One is the recovery of data in current time slot. Another one is the long-term recovery of data in consecutive time slots.
For the first aspect, the 50th time slot in the dataset was selected as the current time slot, and the width of sliding window was set to w = 41. x c ∈ R n×1 denote the data in current time slot, and M p ∈ R n×40 denotes the previous reconstructed data. Here, n = 94 is the number of nodes. With predefined sampling ratio ρ, only partial readings d ∈ R nρ ×1 can be sampled using the proposed sparse sampling data gathering scheme. Mathematically, the sampling procedure can be expressed as d = Ω(x c ). The proposed method and compared methods were used to reconstructx c from d. The Normalized Mean Absolute Error (NMAE) was adopted to measure the recovery performance and defined as: Here,x c is the recovered data, and Π represents the unsampled subset of the complete set of entries [n] × 1. That is, we only calculated the recovery accuracy of the unsampled data.
In the simulation, the spatial distributions L need to be estimated for the proposed subspace approach, which need M p and the rank r. For simplicity, the historical data collected from 10th to 49th time slots were used as M p , and the rank r was set to 15 according to singular values of M p . Besides, to make the comparison more convincing, the same historical data were also used in CSMC and Seq-Prog-CS method. The sampling ratio ρ was set as [0.4, 0.3, 0.2, 0.1, 0.07, 0.04, 0.01], and the sampling operator Ω was generated with a uniform random sampling pattern. Each experiment was repeated 100 times to calculate the mean NMAE value for each method. For each experiment, the same sampling operator and sampled data were used for all the methods. Figure 3 shows the recovery performance of methods for the temperature, humidity, and light data in GreenOrbs, respectively.  As shown in Figure 3, the proposed method achieves the lowest NMAE with each sampling ratio for all attributes. With the decrease of sampling ratio, the performance of other three methods degrades dramatically while the proposed method keeps satisfying performance. Even with the sampling ratio as low as 0.01, the proposed method obtains NMAE = 0.018, 0.021, 0.037 for temperature, humidity, and light data, respectively. Figure 4 shows the running time performance of methods for temperature data in GreenOrbs. The average running time of KNN, CS, CSMC, and the proposed method is graphically represented. As shown in Figure 4, the proposed method achieves the lowest running time. The reason is that the proposed method avoids the singular value decomposition (SVD) in (6). For traditional matrix completion methods, the main computational complexity is from the SVD of EM for each iteration, and exact SVD of a matrix with size n × m has time complexity O(min{nm 2 , n 2 m}). In the proposed subspace approach, the SVD is used to obtain the subspace L from previous reconstructed data M p . The calculation only needs to be run once and can be done between two time slots. Then the computational complexity in each iteration to solve (6)   The historical data were used as M p in the above simulation, while M p should be the previous reconstructed data in the practical implementation. It is necessary to verify the effectiveness of the proposed method in long-term reconstruction, which updates M p with previous reconstruction and estimates spatial distribution L for next time slot recovery. Then for the second aspect, the long-term reconstruction, data collected from the 50th time slot to the 124th time slot in the dataset were selected. Set the number of cluster head n CH as [1,5,9] and the sampling ratio ρ = 0.8. Other settings were set as the previous simulation. Figure 5 shows the long-term recovery performance of the proposed method for the temperature data in GreenOrbs. Despite the reduction in the recovery accuracy of the proposed method for long-term reconstruction, the overall reconstruction accuracy is still satisfying. The average NMAE is [0.024, 0.0245, 0.0254] for the proposed method with [1,5,9] cluster heads, respectively. It is obvious that less number of cluster head leads to lower NMAE. With a fixed number of CH, only nρ − n CH sampling nodes are randomly selected while n CH nodes are fixed. As a result, if n CH = 0, all sampling nodes are randomly selected, which can achieve better recovery.

The Energy Consumption Performance
To further verify the benefit of the proposed method in reducing the energy consumption, we established the experimental environments for the proposed sparse sampling data gathering scheme using DEEC as the clustering algorithm.
In the simulation, a two-level heterogeneous network containing advanced nodes and normal nodes was utilized. The initial energy of normal node was E 0 , and the initial energy of advance node was aE 0 . There were 100 sensors (80 advanced nodes and 20 normal nodes) randomly distributed in a 100m × 100m area and one sink located in the center of the area.
The parameters used in simulations are shown in Table 2. The transmission process adopted free space model and the multipath fading model for distance d ≤ 87m and d > 87m, respectively. There were 100 times data gathering in one round, and the cluster heads were reselected at the beginning of each round. The number of cluster head was set to 10 for both methods, and the sampling ratio was set as [0.2, 0.4, 0.6] for the proposed method.  Figures 6 and 7 show the energy consumption of original DEEC and the proposed sparse sampling data gathering scheme using DEEC. As shown in Figure 6, the nodes start to die in 798th round for original DEEC, while the proposed method keeps all nodes alive until the 1356th, 1941th, 3504th round with ρ = 0.6, ρ = 0.4, ρ = 0.2, respectively. To keep 50% nodes alive, the original DEEC can only carry out 1081 rounds. The proposed method with ρ = 0.2 can carry out 4300 rounds, which is about 4 times than that of original DEEC. As the sampling ratio decreases, more network nodes can keep alive after a fixed number of rounds. As shown in Figure 7, the total energy of network decreases slowly using the proposed sparse sampling data gathering scheme. When the sampling ratios are 0.2, 0.4, and 0.6, the total energy of the proposed method is 33.9J, 24.55J, and 20.46J, respectively, while the original DEEC only had 14.1J at the 4000th round. To keep 20% total energy, the original DEEC can only carry out 3692 rounds, while the proposed method with ρ = 0.2 can carry out 8829 rounds. It is obvious that the proposed method can reduce the energy consumption significantly compared with the original DEEC.

Conclusions
This paper proposes an efficient sparse sampling data gathering method in clustered WSNs including a sparse sampling data gathering scheme and a subspace based reconstruction algorithm. The proposed sparse sampling data gathering scheme not only uses the sparse sampling strategy to reduce the amount of transmission data and the energy consumption, but also ensures the success of real-time sparse sampled data gathering in clustered WSNs. Then the previous reconstructed data are introduced with the sliding window model and used to estimate the subspace representing spatial distributions of the WSNs data. In this way, the proposed subspace based reconstruction approach can enforce an explicit low-rank constraint for data reconstruction from sparse sampled data. Besides, the total variation constraint is joint utilized to further improve the recovery accuracy. The experimental results show that the proposed method outperforms the state-of-the-art methods in data recovery and reduces the energy consumption of network efficiently.