A DCT Regularized Matrix Completion Algorithm for Energy Efficient Data Gathering in Wireless Sensor Networks

This paper presents a novel matrix completion algorithm to enable energy efficient data gathering in wireless sensor networks. The algorithm takes full advantage of both the low-rankness and the DCT compactness features in the sensory data to improve the recovery accuracy. The time complexity of the algorithm is analyzed, which indicates it has a low computational cost. Moreover, the recovery error is carefully analyzed and a theoretical upper bound is derived. The error bound is then validated by experimental results. Extensive experiments are conducted on three datasets collected from two testbeds. Experimental results show that the proposed algorithm outperforms state-of-the-art methods for low sampling rate and achieves a good recovery accuracy even if the sampling rate is very low.


Introduction
Wireless sensor networks (WSNs) have been widely used in a variety of applications such as precision agriculture [1], personal health monitoring [2], and environment surveillance [3], in which sensor nodes with limited battery energy are deployed and periodically report their sensor readings to the base station or sink node. Therefore, a key issue in wireless sensor networks is how to efficiently gather these data from sensors provided with only limited energy resources. In the classical data gathering approach [4], each sensor node simply forwards its sensor readings to the sink, resulting in a large amount of traffic and energy consumption.
Recently, Compressive Sensing (CS) [5,6] has emerged as a new approach to tackle the efficient data gathering problem in WSNs. Taking advantage of the sparsity in sensor readings, CS based methods [7][8][9] require fewer data packets than the classical approach. However, there are many practical problems when applying CS to data gathering in WSNs. Firstly, CS based methods require a prior dictionary to sparsify the sensor readings. Secondly, the measurement matrix in CS is composed of independent and identically distributed random Gaussian entries, which is dense with very few zero elements. Therefore, sensor nodes need to sample all sensor readings and perform a considerable number of measurement operations, resulting in a large amount of energy waste. Thirdly, CS requires the number of measurements to exceed a certain threshold (depending on the sparsity level of sensor readings) to achieve exact recovery. However, the realistic sensor signal is not always exactly sparse as it should be. Therefore, low sampling rate may lead to insufficient measurements and result in a bad recovery accuracy.
As an extension of CS, matrix completion [10] has shown its potential for enabling efficient data gathering in WSNs. Because the sensor readings are highly temporal-spatial correlated, the data matrix structured by the sensor readings will approximate to a low-rank matrix. Therefore, the sink node can gather only a few of the total sensor readings and adopt the matrix completion algorithm to reconstruct the missing data. However, unlike CS based methods, matrix completion based methods do not require the prior dictionary to sparsify the original signal. Furthermore, the sampling matrix (or measurement matrix) in matrix completion is much sparser than CS, which makes it more suitable for wireless sensor networks.
Utilizing the low-rankness feature in sensory data, there are many pioneering works [11][12][13] on applying matrix completion to WSN, which adopt the alternating least squares 2 International Journal of Distributed Sensor Networks technique to estimate the low-rank matrix. The recovery accuracy is further improved by utilizing the spatiotemporal structure in the WSN data. However, the improvement is limited because the spatiotemporal structure directly implies the low-rank feature, and in some sense, these two features are equivalent. Moreover, the alternating least squares algorithm does not scale to large rank.
Besides the low-rankness feature, we observe that the sensor readings in WSNs also exhibit Discrete Cosine Transform (DCT) compactness feature. In other words, the sensor readings can be approximated by only a small number of DCT coefficients. Therefore, by taking full advantage of the DCT compactness feature of the WSNs data, in this paper, we propose a DCT Regularized Matrix Completion (DRMC) algorithm. We analyze the time complexity of DRMC, which indicates that the proposed algorithm has a low computational cost. Moreover, we analyze the recovery error of DRMC and derive a theoretical upper bound. The error bound is then validated by experimental results. Extensive experiments are carried out on three datasets that are collected from two realistic WSN testbeds. We compare the performance of DRMC with state-of-the-art methods. Experimental results show that DRMC outperforms state-of-the-art methods for low sampling rate and achieves a good recovery accuracy even if the sampling rate is very low.
The main contributions of this paper are summarized in the following: (i) We examine the sensor data collected from realworld WSNs, which reveal two data features: (1) lowrankness, (2) DCT compactness.
(ii) Inspired by these observations, we design a novel DCT Regularized Matrix Completion (DRMC) algorithm to estimate the missing sensory data. Experimental results indicate that DRMC outperforms stateof-the-art methods when the sampling rate is low.
(iii) We analyze the time complexity of the DRMC algorithm, which indicates that DRMC has a low computational cost.
(iv) We analyze the recovery error of the DRMC algorithm and derive a theoretical upper bound, which is then validated by experimental results.
The rest of this paper is organized as follows. Section 2 reviews the state-of-the-art methods. Section 3 models the problem. Section 4 examines the data features in WSNs. Section 5 proposes the DRMC algorithm. Section 6 analyzes the time complexity of the algorithm. Section 7 analyzes the recovery error of the algorithm. Section 8 evaluates the effectiveness of the proposed algorithm. Section 9 concludes this paper.

Related Works
In this section, we make a brief review of previous works related to the data gathering problem in wireless sensor networks.

Compressive Sensing.
Compressive Sensing (CS) theory suggests that sparse signals can be accurately reconstructed from only a small number of measurements [5,6]. It is a new paradigm for signal processing of networked data [7] and there are many CS based methods for data gathering in wireless sensor networks. Luo et al. proposed a data gathering scheme that applies Compressive Sensing theory to reduce communication cost [14]. Quer et al. presented a framework for data gathering and signals recovery in actual WSN deployments with the integration of CS [9]. Ebrahimi and Assi recently proposed a decentralized method to apply the Compressive Sensing to data gathering in wireless sensor networks [15].

Matrix Completion.
Recently, there are many applications that apply matrix completion technique to wireless sensor networks. Utilizing the low-rankness and spatiotemporal correlation, Zhang et al. proposed a method to recover the lost data in internet traffic matrices [13]. Kong et al. designed an algorithm using the low-rank structure, time stability, space similarity, and multiattribute correlation to estimate the missing data in highly incomplete data matrix [12]. Cheng et al. presented a Spatiotemporal Compressive Data Collection (STCDG) algorithm that utilizes the low-rankness and short-term stability features to reduce data traffic in WSNs [11].
In our earlier work [16], we have studied the data recovery problem in wireless sensor networks when historical data are available and proposed a DCT-Regularized Partial Matrix Completion (DCT-RPMC) algorithm. However, the new algorithm proposed in this paper does not depend on any historical data, which greatly widens its applicability to more general scenarios.

Problem Formulation
In this section, we formally formulate the data gathering and recovery problem in wireless sensor networks and state the goal of this paper. The main notations that will be used in the rest of this paper are listed in Summary of Notations.
Assume that the network is composed of sensor nodes. During a certain sampling period, the th sensor node acquires samples, which are modeled as an -dimensional sensor vector ⃗ , where Δ is the sampling period. Therefore, the entire samples in the network can be organized as an environment matrix ∈ R × , In order to reduce energy consumption, only a fraction of the entries of will be transmitted to the sink node. We then define a matrix ∈ R × as the sampling matrix to indicate which parts of are transmitted to the sink: International Journal of Distributed Sensor Networks 3 And the sampling rate is defined in the following: .
Let ∈ R × denote the data transmitted to the sink. is an incomplete version of , with missing entries replaced by zeros. Therefore, we have The symbol ∘ denotes the element-wise matrix production operator.
After obtaining , the sink can reconstruct the original environment matrix with the proposed algorithm in Section 5. Our goal is to generate a reconstructed matrixt hat approximates to the original environment matrix as closely as possible. We measure the recovery performance by the Normalized Mean Absolute Error (NMAE):

Exploring the Data Features
In this section, we examine the data features in real-world wireless sensor networks.

Experimental Datasets.
We use three datasets, which are collected from two WSN testbeds, to serve as the ground truth. The summary of the datasets is shown in Table 1.
The first category of datasets is collected by 54 Mica2Dot nodes deployed in the Intel Berkeley Research Lab [17] between February 28 and April 5, 2004. The Mica2Dot node reports collected sensor data including humidity and temperature once every 30 seconds. However, we find that the raw dataset has considerable missing data. Therefore, we have rearranged the raw data (by changing the reporting interval from 30 seconds to 10 minutes) to avoid the missing data.
The second category of dataset consists of temperature readings, which are collected with a 10-minute interval by our own testbed, namely, PARED. PARED consists of 50 sensor nodes. More details about PARED can be found in [18].

Low-Rank Structure.
We first examine the low-rank structure in WSN datasets using the Singular Value Decomposition (SVD). The environment matrix can be decomposed into three matrices by SVD: where is an × orthonormal matrix, is an × orthonormal matrix, and Σ is an × diagonal matrix with singular values 1 , 2 ⋅ ⋅ ⋅ sorted in a descending order ( = min( , )).
Because sensor readings in a WSN are spatiotemporal correlated, the environment matrix would exhibit low-rank feature. More exactly, should approximate to a low-rank matrix of rank . So, the first singular values will occupy the most energy of . We use the following as the metric to examine the quality of the low-rank approximation: where ‖ ⋅ ‖ * is the nuclear norm and ‖ ‖ * = ∑ =1 . Figure 1 shows the low-rank approximation quality of the three datasets. We found that the largest 10 singular values occupy the 93%-99% of the total energy, which suggests that the WSN datasets exhibit a good low-rank feature.

DCT Compactness.
We also observed that the sensor readings in WSN exhibit DCT compactness feature. In other words, the first DCT coefficients of the sensor vector ⃗ concentrate the most energy of ⃗ .
We first define as the × Discrete Cosine Transform Matrix: Then, the × orthonormal matrix can be divided into two submatrices: where 1 consists of the first rows of and 2 consists of the last − rows of . Therefore, if the first DCT coefficients occupy the most energy of ⃗ , we will have ‖ 1 ⃗ ‖ 2 /‖ ⃗ ‖ 2 ≈ 1 and Frobenius norm with ‖ ‖ = √∑ , ( , ) 2 . So, we use the following function to examine the DCT compactness feature: From Figure 2, we can see that the first 10 DCT coefficients concentrate 99% of the total energy, which suggests that these WSN datasets exhibit a good DCT compactness feature.

Algorithm
In this section, we proposed a novel matrix completion algorithm, namely, DCT Regularized Matrix Completion (DRMC), to solve the data recovery problem in WSN data gathering. DRMC takes full advantage of the low-rankness and DCT compactness features to improve the recovery accuracy.

Utilization of Low-Rankness.
As mentioned before in Section 3, the goal of the recovery problem is to estimate from only a fraction of known entries. According to [10], we can recover by solving the following rank optimization problem if is a low-rank matrix: However, the rank minimization problem (12) is NP-hard and is not solvable in polynomial time. Since the nuclear norm is the optimal convex approximation of the rank function, a reasonable solution is to solve a convex relaxation problem with the rank function replaced by the nuclear norm: However, in a more realistic occasion, the environment matrix is not an exactly low-rank matrix. There may not Input: ∈ R × : collected data matrix ∈ R × : sampling matrix : nuclear norm regularization parameter : DCT regularization parameter Output: ∈ R × : reconstructed environment matrix Main procedure: exist low-rank matrices that exactly satisfy the constraints in problem (13). So, we converted the constrained optimization problem (13) into the following nuclear norm regularized optimization problem: where > 0 is the nuclear regularization parameter.

Utilization of DCT Compactness.
Though we can estimate by solving the optimization problem (14), it will overfit to the known entries of when the sampling rate is low, which will lead to large recovery error in the estimation of the missing entries.
Therefore, to reduce the overfitting in (14), we exploit the DCT compactness feature of the sensor data. As mentioned in Section 4.3, ‖ 2 ‖ /‖ ‖ ≈ 0. So, we added ‖ 2 ‖ as the DCT regularization term to (14), and finally, we obtain the following optimization problem: where is the DCT regularization parameter.

The DRMC Algorithm.
We present the DRMC algorithm by solving the optimization problem in (15). The pseudocode is shown in Algorithm 1. Next, we will describe the design of DRMC in details.
International Journal of Distributed Sensor Networks 5 The object function in (15) can be rewritten into the following form: with Note that ( ) is a proper, convex, lower semicontinuous (lsc) [19] function but it is nonsmooth, while ( ) is a convex smooth function and is continuously differentiable, with Furthermore, ∇ ( ) is Lipschitz continuous with a positive constant : where dom = { | ( ) < ∞}. Proposition 2 indicates that the Lipschitz constant of ( ) is = 1 + 2 .
Since ( ) is nonsmooth, it is difficult to directly minimize the objective function ( ). Instead, we choose to iteratively minimize a sequence of quadratic approximations of ( ), which is an effective way to minimize the unconstrained nonsmooth convex function [20][21][22]. The quadratic approximation of (⋅) at point is defined as the following: And the objective variable new is repeatedly updated to the minimizer of ( , ), until ‖ new − old ‖ /‖ old ‖ < . The convergence of such iterative process is well studied in [21]. We then introduce an auxiliary variable to minimize ( , ). As suggested by Proposition 5, we can minimize ( , ) using the singular value shrinkage operator defined in (26). Thus, we have new = −1 ( ). What is more, we consider a warm-start technique for the nuclear regularization parameter . Rather than remaining unchanged, is monotonically decreasing in the iterative process. The nuclear regularization parameter starts with an initial value 1 and gradually declines to , forming a sequence of 1 > 2 > ⋅ ⋅ ⋅ > , ( = ).

Lemma 4.
Let ∈ R × . Then, where ( ) is the singular value shrinkage operator of .
Proof. The proof of Lemma 4 can be found in [23].

Proposition 5.
Let ∈ R × and ( ) = ‖ ‖ * . Assume that ( ) is Lipschitz continuous and define ∈ R × with Then, one has Proof. Consider Thus, combined with Lemma 4, we can obtain that arg min (31)

Complexity Analysis
In this section, we discuss the time complexity of the proposed algorithm. After analyzing the steps in Algorithm 1, we find that the most computationally intensive step is the singular value shrinkage operation that performs SVD on , which dominates the computational complexity of this algorithm. The time complexity of SVD for an × matrix is ( 2 ) [24]. And for any > 0, the iterative process in Algorithm 1 will terminate in (√ / ) iterations with an -optimal solution [21,25]. Consequently, the time complexity of our algorithm is ( √ / 2 ). We can further decrease the time complexity of the DRMC algorithm by applying Partial Reorthogonalization Package (PROPACK) [26] in the singular value shrinkage operation. PROPACK uses the Lanczos method [24] to compute only a partial SVD of . However, it cannot a priori compute singular values that are greater than / . Hence, we need to predetermine the number of singular values to be computed (denoted as V ) at the beginning of the th iteration, and PROPACK can then compute the V largest singular values and corresponding singular vectors. We adopt the prediction rule proposed in [27]: where V is the predicted number of singular values, V is the actual number of singular values that are larger than / , and V 0 = 10.
The time complexity of the Lanczos method is ( ) for × matrix with rank of [24]. Therefore, the time complexity of DRMC algorithm is ( √ / ) if PROPACK is used. For fixed number of iterations, the complexity of DRMC can be simplified as ( ), while the state-ofthe-art matrix completion based methods [11][12][13] require a complexity of ( 2 ). So, DRMC is more computationally efficient.

Error Analysis
In this section, we analyze the recovery error of the DRMC algorithm and present a theoretical upper bound.
Before starting the analysis, we first introduce some assumptions and lemmas. Assumption 6. The original data matrix can be approximated by the first DCT coefficients, with approximation error : and I be the identity operator. Then, there exists a constant 0 ≤ < 1, such that Lemma 8. Suppose that is the singular value shrinkage operator; then, Proof. The detailed proof of Lemma 8 can be found in [28].

Lemma 9.
Suppose that ∈ R × with rank of ; then, Proof. Consider where International Journal of Distributed Sensor Networks 7 Therefore, we have The upper bound of the recovery error is then given in the following theorem.
Theorem 10. Suppose that̂is the estimate of obtained by the DRMC algorithm; then, one haŝ Proof. When the iterative process in Algorithm 1 has converged, new will be equal to old . Hence,̂will be the fixed point of the following: Therefore, we havê Combining (43) and Lemma 8, we havê where I is the identity operator and A is the operator defined in (34). Then, applying Assumption 6 and Lemma 9 to (44), we havê− Combining Assumption 7, (24), and (45), we finally obtain the following error bound: Let represent the upper bound of the recovery error. Then, according to Theorem 10, = ( +2 )/(1− )(1+2 ). Note that ( ) is an increasing function of . So, we expect the actual recovery error of the DRMC algorithm to increase with , which is confirmed later by simulation results in Section 8.3.

Evaluation
We designed a data gathering scheme based on the proposed DRMC algorithm. The data gathering procedure is similar to [11]. Firstly, sink node broadcasts a sampling rate to all sensor nodes. Secondly, each sensor node randomly and independently decides whether to forward its readings to the sink according to the sampling rate. Finally, the sink node collects the incomplete data matrix and uses DRMC to retrieve the missing data. After implementing this data gathering scheme by Matlab, we carried out extensive experiments on three real-world datasets (as shown in Table 1) to evaluate the effectiveness of DRMC.

Baseline Methods.
We select two state-of-the-art methods to compare with DRMC. The first method is Compressive Sensing (CS). We choose the DCT matrix defined in (9) to serve as the orthonormal basis in CS. The second method is Spatiotemporal Compressive Data Collection (STCDG). The parameters of STCDG are set to = 0.5, = 10. Note that since our earlier work, namely, DCT-RPMC, depends on historical data while the proposed algorithm does not, we do not select DCT-RPMC as the baseline method.

Recovery Accuracy.
Firstly, we compared the recovery accuracy of the proposed algorithm with two baseline methods described above. The parameters of DRMC are listed in Table 2.
Simulation experiments are carried out on three realworld datasets. Each simulation is conducted for 100 independent trials. The recovery errors are computed according to (6) and are averaged over the 100 trials.
Comparison results are shown in Figures 3-5. For experiments on Intel Temperature Trace, all methods achieve nearly the same recovery accuracy when the sampling rate is high. When the sampling rate is below a certain value ( < 0.1), recovery performance of baseline methods deteriorates quickly, while DRMC still achieves a good recovery accuracy. When the sampling rate is as low as 0.03, which means 97% of data loss, DRMC can reconstruct the lost data with recovery error less than 10%, while recovery error of CS and STCDG is close to 100%.
Comparison results on Intel Humidity Trace are very similar to that on Intel Temperature Trace. Recovery error of DRMC is about 9% when the sampling rate is 0.03, which is noticeably better than that of baseline methods.
For experiments on PARED Temperature Trace, DRMC still outperforms baseline methods for low sampling rate. The recovery error of DRMC is about 16% when the sampling rate is 0.03, which is slightly worse than that on other two traces. This is because the low-rankness and DCT compactness features of PARED Temperature are not as good as that of other two traces, as shown in Figures 1 and 2. 8.3. Parameter Settings. The DRMC algorithm depends on several input parameters. Clearly, the choice of these parameters will affect the recovery performance of DRMC. In this subsection, we discuss how to choose the parameters for DRMC. The nuclear norm regularization parameter is an important parameter to ensure the low-rank feature of the reconstructed data matrix. In DRMC, we adopted a warmstart strategy for , in which is linearly reduced from 1 to . In the implementation, 1 = 10 + 1, = , and = 20. We tested DRMC on a range of values of to investigate how effects the performance of DRMC. Figure 6 shows the experimental results. The recovery errors of DRMC increase with , just as what we predicted in Section 7 by the theoretical error bound (41). Note that when < 0.1, the recovery performance of DRMC is not sensitive to . So, in practice, we just set it to a small enough value, = 0.001.
Parameter is a regularization parameter that guarantees the DCT compactness feature of the recovered signal. Figure 7 shows the effects of on the recovery performance of DRMC. The recovery errors decline with and are stable when > 0.1. So, we choose to use = 1 for our experiments.
Recall that represents the number of concentrated DCT coefficients. As discussed before in Section 4.3, the first 10 DCT coefficients concentrate 99% of the total energy. As expected, Figure 8 shows that the recovery errors are  dropping fast with and are stable when > 10. So, we choose to use = 10 for our experiments. Note that the recovery errors are slightly increasing with the growth of when > 10. We explain this by considering the extreme case when = . If = , 1 is equal to . As a result, the DCT regularization term ‖ 2 ‖ 2 in (15) is automatically equal to zero. Therefore, the DCT compactness property of the sensory data is not utilized and the recovery errors will increase, just as the case in Figure 7 when = 0.

Energy Consumption and Network
Lifespan. DRMCbased data gathering protocol is more energy efficient because it transmits less packets than the classic one (receiving and forwarding). As a result, DRMC-based protocol can save more energy and prolong the lifespan of wireless sensor networks. To verify this, simulation experiments are conducted and the simulation configuration is shown in Table 3.
In the simulation, sensor nodes are randomly deployed in a 500 m × 500 m area and the sink node is deployed in the center. Each sensor node is equipped with 1 J energy. To evaluate the energy consumption, we adopt the following energy model [29]: ( , ) represents the energy consumption of transmitting bits data with distance . ( ) denotes the energy consumption of receiving bits data. is the energy consumed by the transmitting circuit to process 1 bit data. is the energy consumed by the receiving circuit to process 1 bit data. Amp is the energy consumed by power amplifying circuit. Figure 9 demonstrates the network lifespan of DRMCbased protocol and other baseline protocols under different sampling rate. Note that the network lifespan is defined as the time when the first energy exhausted node appears. Apparently, the sampling rate does not play a role in the classic data gathering, since it directly transmits all data without compression. Therefore, the lifespan curve of the classic protocol in Figure 9 is a straight line. For the CS method, the smaller the sampling rate is, the less measurements are taken. And as shown in Figure 9, the lifespan of CS is decreasing with the sampling rate. However, when the sampling rate is above a certain value, the lifespan of CS is even worse than the classic one. The reason why CS performs badly for large sampling rate is well analyzed in [30]. Figure 9 shows that DRMC-based protocol achieves the best lifespan. Similarly, the lifespan of DRMC is decreasing with the sampling rate. When the sampling rate decreases to 1, DRMC-based protocol is equivalent to the classic protocol and the lifespan of DRMC is equal to the classic one. Note that the lifespan of STCDG is exactly the same as DRMC, because both of the two methods are based on matrix completion. The lifespan of DRMC is longer than CS because the sampling matrix in DRMC is much sparser than that in CS.

Conclusion
In this paper, we studied the data gathering and reconstruction problem in WSNs. We modeled the problem as matrix completion problem and investigated the data features in real WSN datasets. Then, by taking advantage of the low-rankness and DCT compactness features in WSNs, we proposed a DCT Regularized Matrix Completion (DRMC) algorithm to reconstruct the missing data. The recovery error of DRMC is carefully analyzed and a theoretical error upper bound is presented. Experimental results show that DRMC outperforms state-of-the-art methods for low sampling rate and achieves a good recovery accuracy even if the sampling rate is very low.