Kernel Sparse Representation with Hybrid Regularization for On-Road Traffic Sensor Data Imputation

The problem of missing values (MVs) in traffic sensor data analysis is universal in current intelligent transportation systems because of various reasons, such as sensor malfunction, transmission failure, etc. Accurate imputation of MVs is the foundation of subsequent data analysis tasks since most analysis algorithms need complete data as input. In this work, a novel MVs imputation approach termed as kernel sparse representation with elastic net regularization (KSR-EN) is developed for reconstructing MVs to facilitate analysis with traffic sensor data. The idea is to represent each sample as a linear combination of other samples due to inherent spatiotemporal correlation, as well as periodicity of daily traffic flow. To discover few yet correlated samples and make full use of the valuable information, a combination of l1-norm and l2-norm is employed to penalize the combination coefficients. Moreover, the linear representation among samples is extended to nonlinear representation by mapping input data space into high-dimensional feature space, which further enhances the recovery performance of our proposed approach. An efficient iterative algorithm is developed for solving KSR-EN model. The proposed method is verified on both an artificially simulated dataset and a public road network traffic sensor data. The results demonstrate the effectiveness of the proposed approach in terms of MVs imputation.


Introduction
With the rapid development of finance and society, traffic congestion has become an urgent worldwide problem causing the waste of travel time, environmental pollution, etc. To alleviate traffic congestion and improve road capacity, intelligent transportation system (ITS) [1] was developed by integrating various techniques, such as computer, communication, artificial intelligence, and so on. Data acquisition through different sensors is the foundation of ITS, where higher-level functions including traffic forecasting, route planning, etc., usually depend on the quality and quantity of sensor data. Unfortunately, traffic data acquired by sensors are usually incomplete, indicating the loss of many entries, due to various unpredictable factors, including sensor malfunction, transmission network anomaly, storage equipment damage, etc. For example, for a dense road network in Melbourne city, about 8% of its sensors can reach up to 56% missing data. Likewise, about 10% of daily traffic volume in Beijing is missing [2], and according to Turner et al. [3], more than 5% of data within the PeMS database are lost. The loss of traffic sensor data is unavoidable under existing technical level and objective conditions, therefore posing great challenge for effective deployment of ITS. For instance,

•
We perform MVs imputation in kernel-induced high-dimensional feature space instead of the original input space [29,30]. By doing so, nonlinear relationship among data samples can be discovered and leveraged for recovery performance improvement. To the best of our knowledge, this is the first work to which KSR was applied for MVs imputation.

•
We propose to apply the combination of l 1 -norm and l 2 -norm, namely elastic net in statistics literature [26], as regularization on the representation coefficients, with the hope that enough information can be extracted from those highly correlated samples for recovering MVs.
• An iterative algorithm is developed for solving the resulting KSR-EN model by integrating monotone fast iterative shrinkage thresholding algorithm (FISTA) [31][32][33], as well as projected gradient descent (PGD) approach [24]. • The proposed model is evaluated on both synthetic data and real-work traffic sensor data.
The results demonstrate that KSR-EN outperforms other competing algorithms in terms of MVs imputation.
The remainder of this paper is organized as follows. In Section 2, we briefly review some popular approaches for MVs imputation. In Section 3, KSR-EN model and corresponding solution algorithm are proposed. Section 4 reports some experiments on simulated as well as real-world traffic sensor data. Finally, we give some conclusions and future works in Section 5.

Related Work
In order to solve the problem of MVs imputation, many algorithms have been put forward by researchers from different perspectives. The most widely applied approaches can be roughly divided into three categories which are discussed in sequel.

Probabilistic Model Based Methods
For this type of method, a statistical model responsible for producing complete data needs to be specified. One of the most common choices is the multivariate Gaussian model and its extension [34]. Such a model is used to describe the inherent relationship between variables, forming the basis for MVs recovery. The model parameters, along with those MVs, can be estimated simultaneously via alternating optimization. In probabilistic principal component analysis (PPCA) [2,35], the data is assumed to be drawn from a single low-dimensional linear subspace. Then, the likelihood of observed values is derived and maximized through the well-established expectation maximization (EM) algorithm [36]. Furthermore, Bayesian PCA [37] combines the advantages of Bayesian learning [38] and PPCA such that the dimensionality of latent space in PPCA can be automatically inferred based on data. These methods are suitable for data with dominant structure. However, PPCA may perform poorly when missing ratio in data is high [39]. Moreover, how to specify a proper statistical model for real-world data is not a trivial task. In practice, it may be infeasible to postulate a uniformed model for different types of data.

Regression Model Based Methods
The regression-based methods attempt to establish regression equation to characterize the relationship between a number of observed variables and other variables with MVs based on a set of samples. Various concrete regression techniques can be adopted for this purpose, such as linear regression, support vector machine (SVM) [40,41], and neural networks (NNs) [42]. Then, MVs are replaced with the conditional expectation of the regression results. Local least squares (LLS) regression is a typical regression based imputation method which has been successfully applied in different situations [43]. By using K-nearest neighboring search, LLS first chooses a small number of variables which are most similar to the target variable. Then, least squares criterion is used to establish the regression equation based on the observed data. By doing so, LLS succeeds in exploiting underlying local similarity structure among data and thus is more applicable when data distribute in a nonlinear way. However, LLS may fail in the case of high missing ratio due to unreliable estimation of nearest neighbors or improper regression model.

Matrix Completion Based Methods
As stated in Section 1, traffic sensor data in the same road network usually exhibit strong spatiotemporal correlation because of road structure, as well as people's travel behavior. As a consequence, the data matrix usually has low-rank property, implying the number of independent rows (and columns) is much smaller than the size of matrix. Under such a circumstance, low-rank matrix completion (LRMC) [18,19] can be used to recover MVs through rank (or its surrogate nuclear norm) minimization on the whole matrix. Many efficient optimization algorithms have been developed to solve the problem, e.g., SVT [18,44], FPCA [45], ADMM [46], etc. However, LRMC depends on global linear correlation, which is restricted for real data, such as traffic sensor data [19]. Recently, sparse representation (SR) [23] based subspace clustering [21] has drawn much attention because it supposes that the samples are drawn from a union of multiple subspaces, instead of a single one. Due to such advantage, it can reveal complex structure of data, thus leading to better imputation performance [20]. Nevertheless, many datasets in practice are not necessarily well characterized by multiple linear models, and in such cases, existing algorithms may produce suboptimal imputation results [47].

Linear and Kernel SR-EN
p and N denote the number of features and samples, respectively. In the case we study, not all of the entries in sample X i are known. Therefore, let Ω indicate the index set of observed entries in X, that is, for all (i, j) ∈ Ω, x i (j) is observable. The task here is to obtain an accurate estimation for those x i (j), ∈ Ω. Sharing similar flavor with SR [23] and sparse subspace clustering [21,48], we hope that each sample can be well approximated as a linear combination of other samples, i.e., indicating the contribution of x j in reconstructing x i . Additionally, the constraint w i (i) = 0, i.e., diag(W) = 0 in matrix form, is added to avoid trivial solution. Besides this reconstruction criterion, we also impose a penalty to W so as to alleviate possible overfitting. In this work, elastic net regularization [26], as a hybrid of l 1 -norm and l 2 -norm, is adopted since it not only achieves sparse variable selection but also would benefit from highly correlated variables. By integrating the above ingredients together, we propose the following MVs imputation model termed as SR with elastic net regularization (SR-EN): and 1 ≥ α ≥ 0 are two parameters used to balance the role of regularization and l 1 -norm, respectively. When α = 1 or α = 0, elastic net regularization will degenerate to pure l 1 -norm or l 2 -norm regularization. The above SR-EN model is able to recover missing values when samples distribute in a union of multiple linear subspaces. However, it may produce suboptimal imputation when applied to data with nonlinear structure. Therefore, we further extend SR-EN to deal with samples distributed nonlinearly in raw input space. To achieve this goal, motivated by kernel method [5], we first map the original input space R p to a reproducing kernel Hilbert space (RKHS) H with higher or even infinite dimensionality, by employing nonlinear mapping function φ : R p → H . Let φ(x) ∈ H be the image of sample x in feature space H, and φ(X) = [φ(x 1 ), φ(x 2 ), · · · , φ(x N )] denotes the entire sample matrix after mapping. Similar to classic kernel-based learning, we make an assumption that nonlinear distribution of a given dataset in original input space R p can be well converted into linear distribution in feature space H with much higher dimensionality, and thus facilitating the application of SR-EN. Based on the above discussion, the kernel SR-EN (KSR-EN) we propose can be expressed as: (2) Figure 1 presents the effect of nonlinear mapping. Note that the above KSR-EN model reduces to its linear version (1) when the mapping function φ is linear, namely, φ(x) = x. Therefore, in what follows, we will concentrate on KSR-EN (2) and develop an effective algorithm for solving it. We let λ 1 = Cα and λ 2 = C(1 − α) to simplify notations when deriving optimization algorithm.

Optimization Algorithm
As we can see from (2), the coupling between decision variables and makes it difficult to find optimal solutions for and simultaneously. Therefore, in this work, we choose alternating optimization scheme to find the optimal solution iteratively. Specifically, we attempt to solve (2) by alternatively optimizing over and while holding the other variable fixed.
Optimize while fixing . In such a case, (2) leads to the following optimization problem: In terms of our problem ( operator of a matrix. Furthermore, based on the rule of matrix derivative [49], the gradient of ( ) w.r.t. is given by ∇ ( ) = ( ) ( ) − ( ) ( ). We notice immediately that both ( ) and its gradient, ∇ ( ), depend exclusively on the inner products between the images of all pairs of samples in feature space, without having to give explicit representation for . In other words, we can introduce kernel Gram matrix = ( ) ( ), with element computed as = ( ) � �. Then, we have: According to Mercer's theorem [50], any matrix could be a valid kernel as long as it is positive semi-definite (PSD). Some commonly used kernels include polynomial kernel, radial basis function (RBF) kernel, sigmoid kernel, etc. In this work, RBF kernel, defined as , is used because of its simplicity, along with good empirical performance in various kernel-based learning algorithms. In RBF kernel, is a free parameter controlling the smoothness degree of the kernel.
On the other hand, the second part, ( ) , i.e., the regularization term, is convex yet nondifferentiable, thus restricting the application of traditional gradient descent algorithm. Nevertheless, considering the facts that ∇ ( ) is Lipschitz-continues and ( ) has a closed-form proximity operator, we develop a first-order algorithm to find the optimal solution of (3) under the proximal gradient descent framework. This framework, also known as fast iterative shrinkage thresholding approach (FISTA), has been widely used to solve various sparsity-related problems, such as [33].
Specifically, given the Lipschitz constant of ∇ ( ) and the current solution of (3) at the -

Optimization Algorithm
As we can see from (2), the coupling between decision variables X and W makes it difficult to find optimal solutions for X and W simultaneously. Therefore, in this work, we choose alternating optimization scheme to find the optimal solution iteratively. Specifically, we attempt to solve (2) by alternatively optimizing over W and X while holding the other variable fixed.
Optimize W while fixing X. In such a case, (2) leads to the following optimization problem: In terms of our problem (3), the objective can be decomposed into the sum of two parts f (W) + g(W), where f (W) = 1 2 φ(X) − φ(X)W 2 and g(W) = λ 1 W 1 + λ 2 2 W 2 . Let us focus on the first part, f (W). It is obvious that f (W) is convex and differentiable and can be reformulated as f (W) = where tr(·) denotes the trace operator of a matrix. Furthermore, based on the rule of matrix derivative [49], the gradient of . We notice immediately that both f (W) and its gradient, ∇ f (W), depend exclusively on the inner products between the images of all pairs of samples in feature space, without having to give explicit representation for φ. In other words, we can introduce kernel Gram matrix Then, we have: According to Mercer's theorem [50], any matrix K could be a valid kernel as long as it is positive semi-definite (PSD). Some commonly used kernels include polynomial kernel, radial basis function (RBF) kernel, sigmoid kernel, etc. In this work, RBF kernel, defined as is used because of its simplicity, along with good empirical performance in various kernel-based learning algorithms. In RBF kernel, γ is a free parameter controlling the smoothness degree of the kernel.
On the other hand, the second part, g(W), i.e., the regularization term, is convex yet nondifferentiable, thus restricting the application of traditional gradient descent algorithm. Nevertheless, considering the facts that ∇ f (W) is Lipschitz-continues and g(W) has a closed-form proximity operator, we develop a first-order algorithm to find the optimal solution of (3) under the proximal gradient descent framework. This framework, also known as fast iterative shrinkage thresholding approach (FISTA), has been widely used to solve various sparsity-related problems, such as [33].
Specifically, given the Lipschitz constant L of ∇ f (W) and the current solution W k of (3) at the k-th iteration, it is possible to construct an approximating function q(W, W k ) majorizing the original f (W) at W k : where the definitions of f (W k ) and ∇ f (W k ) are given in (4). From the definition of q(W, W k ), we have: where the equality holds if and only if W = W k . This fact motivates the following update: We can prove that W k+1 will lead to an improved objective value for (3) because: In order to solve (7), we rewrite the objective (5) as: and incorporating the definition of g(W), W k+1 can be obtained by: Notice that all of the entries in W are independent of each other, and thus can be optimized separately and parallelly. For the sake of simplicity, consider the proximity operator for elastic net regularization with univariate as follows: where a, b ∈ R. Problem (11) has a closed-form optimal solution as: where (a) + = max{0, a}. Incorporating the above result into (10), one can obtain the solution W k+1 as: The whole algorithm for solving (3) is summarized in Algorithm 1. Input: Estimated data matrix X, trade-off parameters λ 1 , λ 2 , Gaussian kernel parameter γ Optimize X while fixing W. In such a case, (2) leads to the following optimization problem with equality constraint: where we have used the results in (4). To solve (14), we employ the projected gradient descent method with Armijo step size rule. The essential ingredient is the calculation of derivative w.r.t. decision variables.
Denoting the objective in (14), as h(X) = tr K − KW − W T K + W T KW , its partial derivative w.r.t. the kernel matrix K is given by: , we can easily compute: To calculate the derivative of h(X) w.r.t. the entries in X, we apply the chain rule to get: where ∂h(X) ∂K ik and ∂K ik ∂x i (j) can be obtained from (15) and (16), respectively. For clarity, the algorithm for solving (14) is presented in Algorithm 2. Algorithm 2. Gradient descent algorithm for solving (14) Input: Coefficient matrix W Output: Estimated data matrix X Procedure Initialize data matrix X while not converge do

Configuration
Besides the proposed KSR-EN (and its linear version SR-EN), we include some typical imputation algorithms including LLS, PPCA, and LRMC in order to comprehensively evaluate their performance. As stated in Section 2, LLS is a regression based method, PPCA relies on statistical assumption of data, and LRMC imposes low-rank property of sample matrix. Some recent studies [2,51] have manifested that PPCA and LRMC are two effective approaches for traffic sensor data. All of these methods are implemented in MATLAB 2015a on a PC with Core i7 2.4 GHz CPU and 12GB RAM. Following [20,24], the parameters in each method, such as the number of K-nearest neighbors in LLS, the dimensionality of latent space in PPCA, are tuned to give optimal performance.
To measure the accuracy of each method, we randomly produce MVs and then employ different methods to obtain corresponding estimations. Finally, the estimated values are compared with the real values and the difference between them are calculated. In this work, two widely applied metrics, i.e., root mean square error (RMSE) and relative error (RELERR), are calculated as follows: where T denotes the total number of missing entries, andx i (j) and x i (j) denote the imputed value and the real value, respectively. Obviously, the smaller the RMSE and RELERR, the better the imputation performance. In addition, we repeat each test 10 times and report the mean imputation error and the associated standard deviation, so as to reduce the potential bias caused by randomness.

Synthetic Data
We first evaluate the proposed method on nonlinear synthetic dataset to intuitively illustrate its behavior. The simulated samples are shown as the red points in Figure 2. The specific equations for generating these samples are described as follows: where latent variable t is sampled from a one-dimensional uniform distribution in the interval − π 2 , 0 . As we can see, the entire set of samples is drawn from two arcs which intersect at the origin. The intrinsic dimension for each arc is 1. Gaussian noise with zero mean and 0.05 standard deviation is added to each sample. We synthesize 100 samples from each arc and the whole sample set, organized in matrix X ∈ R 3×200 in this case, forms a nonlinear structure in three-dimensional space.
To get incomplete matrix, a randomly selected entry for each sample is removed, thus resulting in an incomplete matrix with missing ratio of 33.33%. Then, different imputation approach is applied to restore the missing entries of the data matrix. Figure 3 show the results in one experiment. Notice that in Figure 3, the first column depicts the real samples (red points) and the imputed samples (blue points), the second column shows the scatter plot of real and estimated values, and the last column further shows the residual. The averaged errors obtained by different methods across 10 tests are summarized in Table 1, where the best results are highlighted in bold. diagonal structure spreads in wider range in obtained by SR-EN. For KSR-EN, we also notice from the figure of KSR-EN that some nonzero values in appear far away from the main diagonal, seemingly violating our hypothesis. After careful analysis, we found that these off-diagonal nonzero values are mainly caused by the samples drawn from the intersection area of two arcs. Since the samples in this area are heavily mixed, it may be impossible to distinguish them completely.  diagonal structure spreads in wider range in obtained by SR-EN. For KSR-EN, we also notice from the figure of KSR-EN that some nonzero values in appear far away from the main diagonal, seemingly violating our hypothesis. After careful analysis, we found that these off-diagonal nonzero values are mainly caused by the samples drawn from the intersection area of two arcs. Since the samples in this area are heavily mixed, it may be impossible to distinguish them completely.

Data Collection
In this experiment, we evaluate the proposed KSR-EN algorithm on a real-world traffic flow dataset which is publicly accessible at http://portal.its.pdx.edu/. The data is collected from 40 loop detectors installed at Interstate 205 (I205) interstate highways. The selected sub-area road network is shown in Figure 5. The traffic volume is aggregated every 15 min and the unit is vehicles per 15 min (veh/15 min). For each loop detector, 96 data points is recorded per day. We use traffic data of 30  As can be seen from these results, LRMC performs worst since it heavily depends on the global low-rank structure of data, which is violated in our case. Instead of minimizing the rank of data for recovery, PPCA makes use of maximum likelihood estimation and EM algorithm to jointly optimize model parameters and MVs. As a result, PPCA works better than LRMC, although it also imposes Gaussian distribution for data. SR-EN further improves the results of PPCA, although it is also based on linear structure of data. Different from PPCA, SR-EN is applicable to data lying on or close to a union of linear subspaces [48]. LLS achieves significantly smaller errors than LRMC, PPCA, and SR-EN. It may be because LLS is an imputation algorithm based on local rather than global linear relationship between samples. Finally, as expected, KSR-EN, as nonlinear extension of SR-EN, consistently outperforms all other imputation algorithms under both performance metrics. For example, the imputed results obtained by KSR-EN fit well with the true values according to Figure 3. For more clearly illustrating this, we present in Figure 4 the coefficient matrix W obtained by SR-EN and KSR-EN. Note that the sample index in the figure roughly reflects the proximity relation between samples. We can see that W derived from KSR-EN exhibits clear sparse and diagonal structure, indicating that in the high-dimensional feature space, each sample can be well represented as a linear combination of its neighboring samples within the same class. It also confirms that KSR-EN successfully discovers the underlying nonlinear structure of multiple subspaces. In contrast, the diagonal structure spreads in wider range in W obtained by SR-EN. For KSR-EN, we also notice from the figure of KSR-EN that some nonzero values in W appear far away from the main diagonal, seemingly violating our hypothesis. After careful analysis, we found that these off-diagonal nonzero values are mainly caused by the samples drawn from the intersection area of two arcs. Since the samples in this area are heavily mixed, it may be impossible to distinguish them completely.

Data Collection
In this experiment, we evaluate the proposed KSR-EN algorithm on a real-world traffic flow dataset which is publicly accessible at http://portal.its.pdx.edu/. The data is collected from 40 loop detectors installed at Interstate 205 (I205) interstate highways. The selected sub-area road network is shown in Figure 5. The traffic volume is aggregated every 15 min and the unit is vehicles per 15 min (veh/15 min). For each loop detector, 96 data points is recorded per day. We use traffic data of 30

Data Collection
In this experiment, we evaluate the proposed KSR-EN algorithm on a real-world traffic flow dataset which is publicly accessible at http://portal.its.pdx.edu/. The data is collected from 40 loop detectors installed at Interstate 205 (I205) interstate highways. The selected sub-area road network is shown in Figure 5. The traffic volume is aggregated every 15 min and the unit is vehicles per 15 min (veh/15 min). For each loop detector, 96 data points is recorded per day. We use traffic data of 30 weekdays in the year of 2015 since the traffic flow profile on weekends and holidays are very different from weekdays. The total number of traffic volumes is 96 × 40 × 30 = 115,200. The whole data is organized as a 96 × 1200 matrix with each column denoting a sample. Figure 6 illustrates the traffic flow profiles captured by 40 sensors in the same day. As can be seen, despite of overall similarity, the variation of traffic flow at different road segments shows remarkable difference in terms of maximum traffic flow, the duration of rush hours, etc. The whole data is organized as a 96 × 1200 matrix with each column denoting a sample. Figure 6 illustrates the traffic flow profiles captured by 40 sensors in the same day. As can be seen, despite of overall similarity, the variation of traffic flow at different road segments shows remarkable difference in terms of maximum traffic flow, the duration of rush hours, etc.
In order to comprehensively evaluate the recovery performance of different imputation algorithms under complex environments, we artificially inject MVs into original data by simulating three different missing patterns [2]. (i) Missing completely at random (MCAR), which means the presence or absence of MVs is completely independent of observed values and other parameters of interest. (ii) Missing at random (MAR), indicating that the occurrence of MVs depends on its neighboring points. (iii) Mixture of MCAR and MAR (MIXED), where half of MVs obey MCAR and the other half are from MAR. Generally speaking, MCAR is easier than MAR in the sense of recovery because MVs in the former case appear as isolated points randomly distributed, while MVs in the latter case often look like continuous missing, thus making accurate imputation more challenge. In addition, we also change the missing ratio δ, that is, the ratio of the number of missing entries to the total number of entries in data matrix, from 0.1 to 0.5 with step 0.1. Obviously, the larger δ, the harder the imputation task.     Generally speaking, MCAR is easier than MAR in the sense of recovery because MVs in the former case appear as isolated points randomly distributed, while MVs in the latter case often look like continuous missing, thus making accurate imputation more challenge. In addition, we also change the missing ratio δ, that is, the ratio of the number of missing entries to the total number of entries in data matrix, from 0.1 to 0.5 with step 0.1. Obviously, the larger δ, the harder the imputation task.

Imputation Error
Tables 2-4 summarize the imputation errors obtained by different methods under MCAR, MAR, and MIXED missing patterns, respectively. The best results are highlighted in bold for clarity. From these results, we can obtain some interesting observations. As a typical local structure (instead of global structure) based approach, LLS is able to produce small imputation errors when the missing ratio is low. However, the performance deteriorates rapidly with increased missing ratio. It may be because that the reliable estimation of local structure or neighboring samples turns out to be very difficult given a set of samples with many unknown entries. Different from LLS, LRMC and PPCA all depend on the global linear subspace structure of data. PPCA works slightly better than LRMC on this dataset, and both outperform LLS when δ is larger than 0.3. However, these methods may fail to deliver good performance on the dataset with complex intrinsic structure. SR-EN, our proposed linear imputation approach, works better than the above conventional approaches, since it successfully accounts for the potential multiple linear subspace structure, thus avoiding the assumption of single subspace. KSR-EN further relax the restriction on linear subspace structures of SR-EN via effective nonlinear mapping from original input space to a high-dimensional feature space. By doing so, KSR-EN is able to explore multiple linear subspace structure in feature space, which is nonlinear in input space, and thus enhance the performance of SR-EN. As a result, KSR-EN is superior to SR-EN and the other competing approaches w.r.t. imputation errors, regardless of specific missing pattern or missing ratio. It suggests that the integration of nonlinear kernel mapping, as well as SR-EN, allows significant performance gain for the recovery of MVs in traffic scenarios.

Influence of Parameters
In this work, the linear combination of l 1 -norm and l 2 -norm, namely elastic net, is used as regularization to encourage highly correlated samples can be selected for reconstructing each target sample. As a result, the trade-off parameter α plays an indispensable role in the final model. On one hand, small α means that l 2 -norm dominates the regularization and makes the solution dense. On the other hand, large α will increase the portion of l 1 -norm and thus leads to solution with more sparsity. To this end, we first study the impact of parameter α. In this example, we focus on MCAR missing pattern and missing ratio δ = 0.2. To investigate the influence of α, we fix C = 2 −5 and change α from 0 to 1 with step 0.2. The variation of RMSE and RELERR w.r.t. α is shown in Figure 7. Moreover, one sample is selected to show the variation of representation coefficients w.r.t. α. The results are shown in Figure 8. As we can see from Figure 7, small α and large α both degrade the imputation performance. From the viewpoint of sparsity, when α equals to zero, the resulting coefficients are dense. Large α tends to shrink many elements in the coefficients towards zero. We observe from extensive experiments that in most cases, optimal performance is achieved when α is between 0 and 1, thus confirming the effectiveness of elastic net regularization.
In what follows, we investigate the influence of trade-off parameter C. Through tuning parameter C, we can control the strength of regularization on coefficient matrix. Small C will weaken the role of regularization and leads to coefficient matrix with large magnitude values. It will make the resulting model fit well on the observed values but perform worse on other MVs. This is also known as overfitting in machine learning [52]. In contrast, large C tends to drive many coefficients to be small (or exact zero) and produces a model incapable of characterizing the inherent structure of data sufficiently. Similar to the above experiment, we fix α = 0.6 and tune the value of C from 2 −9 , 2 −7 , 2 −5 , 2 −3 , 2 −1 . The variation of RMSE and RELERR is shown in Figure 9. Accordingly, the variation of representation coefficients of one selected sample is illustrated in Figure 10. We notice that when C is not too large or small, the obtained performance is satisfactory. This observation is consistent with the above analysis we made.
Next, we investigate the influence of aggregation time on the imputation performance of each method. Larger aggregation time leads to smoother traffic flow profile, thus causing loss of detail information. Due to different aggregation time, the variation range of traffic flow in unit interval is much different. Therefore, relative error is more suitable when comparing imputation performance under different aggregation time. The aggregation time is set to be 30 min and the experimental results are shown in Figure 11. As can be seen, the proposed KSR-EN outperforms all the other approaches regardless of specific missing pattern or missing ratio, thus verifying the effectiveness of KSR-EN again.

Influence of Parameters
In this work, the linear combination of l1-norm and l2-norm, namely elastic net, is used as regularization to encourage highly correlated samples can be selected for reconstructing each target sample. As a result, the trade-off parameter plays an indispensable role in the final model. On one hand, small means that l2-norm dominates the regularization and makes the solution dense. On the other hand, large will increase the portion of l1-norm and thus leads to solution with more sparsity. To this end, we first study the impact of parameter . In this example, we focus on MCAR missing pattern and missing ratio δ = 0.2. To investigate the influence of , we fix = 2 −5 and change from 0 to 1 with step 0.2. The variation of RMSE and RELERR w.r.t. is shown in Figure 7. Moreover, one sample is selected to show the variation of representation coefficients w.r.t. . The results are shown in Figure 8. As we can see from Figure 7, small and large both degrade the imputation performance. From the viewpoint of sparsity, when equals to zero, the resulting coefficients are dense. Large tends to shrink many elements in the coefficients towards zero. We observe from extensive experiments that in most cases, optimal performance is achieved when is between 0 and 1, thus confirming the effectiveness of elastic net regularization.  In what follows, we investigate the influence of trade-off parameter . Through tuning parameter , we can control the strength of regularization on coefficient matrix. Small will weaken the role of regularization and leads to coefficient matrix with large magnitude values. It will make the resulting model fit well on the observed values but perform worse on other MVs. This is also known as overfitting in machine learning [52]. In contrast, large tends to drive many coefficients to be small (or exact zero) and produces a model incapable of characterizing the inherent structure of data sufficiently. Similar to the above experiment, we fix = 0.6 and tune the value of from {2 −9 , 2 −7 , 2 −5 , 2 −3 , 2 −1 }. The variation of RMSE and RELERR is shown in Figure 9. Accordingly, the variation of representation coefficients of one selected sample is illustrated in Figure 10. We notice that when is not too large or small, the obtained performance is satisfactory. This observation is consistent with the above analysis we made.     information. Due to different aggregation time, the variation range of traffic flow in unit interval is much different. Therefore, relative error is more suitable when comparing imputation performance under different aggregation time. The aggregation time is set to be 30 min and the experimental results are shown in Figure 11. As can be seen, the proposed KSR-EN outperforms all the other approaches regardless of specific missing pattern or missing ratio, thus verifying the effectiveness of KSR-EN again.

Computational Time
In this section, we report the computational time of different approaches employed in this paper. We take MIXED missing pattern as it is a combination of MCAR and MAR cases. The experimental results are shown in Table 5 as follows. As we can see, the proposed KSR-EN performs slower than other traditional approaches. It is well-known that sparse representation is a time-consuming procedure especially when the number of samples is large. In addition, gradient descent generally leads to slow convergence. To deal with this problem and improve the efficiency of our approach, several strategies can be exploited. For example, we can approximately solve sparse representation problem through greedy algorithms, such as matching pursuit (MP) or orthogonal MP (OMP) [53], etc. Alternatively, quasi-Newton approach can be used to replace the gradient descent, owing to its affordable memory request but fast convergence. We will concentrate on improving the efficiency of this model in the future work.

Computational Time
In this section, we report the computational time of different approaches employed in this paper. We take MIXED missing pattern as it is a combination of MCAR and MAR cases. The experimental results are shown in Table 5 as follows. As we can see, the proposed KSR-EN performs slower than other traditional approaches. It is well-known that sparse representation is a time-consuming procedure especially when the number of samples is large. In addition, gradient descent generally leads to slow convergence. To deal with this problem and improve the efficiency of our approach, several strategies can be exploited. For example, we can approximately solve sparse representation problem through greedy algorithms, such as matching pursuit (MP) or orthogonal MP (OMP) [53], etc. Alternatively, quasi-Newton approach can be used to replace the gradient descent, owing to its affordable memory request but fast convergence. We will concentrate on improving the efficiency of this model in the future work.

Conclusions
The problem of missing traffic sensor data imputation is studied in this work. Conventional sparse representation based imputation may lead to the loss of information conveyed in highly correlated samples. The application to real-world data with complex nonlinear structure is also problematic due to the drawbacks of the linear model. Therefore, we propose the KSR-EN model, which integrates elastic net regularization and the kernel method in a unified framework. In such a way, MVs imputation is performed in high-dimensional feature space rather than original input data space, benefiting accurate imputation. To solve the resulting model, an iterative algorithm is further developed by optimizing the representation coefficients and MVs alternatively. Experiments on both synthetic and traffic sensor data verify that exploiting nonlinear sparse representation, along with the combination of l 1 -norm and l 2 -norm, can provide more accurate imputation than other competing approaches.
In current work, MVs are estimated from a statistical perspective. Despite reduced estimation errors, the dynamic property of traffic flow is ignored in this work. Many traffic state estimation models have been developed using different techniques, such as extended Kalman filter (EKF) [13], Hamilton-Jacobi equations [12], etc. In future work, we will try to incorporate the traffic state estimation models into our MVs imputation approach to exploit the inherent relationship between variables. Another aspect deserving investigation is the theoretical supporting argument of our proposed model, such as the condition that accurate imputation can be guaranteed. We will focus on the problems in future work.