A N ADAPTIVE K NEAREST NEIGHBOUR METHOD FOR IMPUTATION OF MISSING TRAFFIC DATA BASED ON TWO SIMILARITY METRICS

: Traffic flow is one of the fundamental parameters for traffic analysis and planning. With the rapid development of intelligent transportation systems, a large number of various detectors have been deployed in urban roads and, consequently, huge amount of data relating to the traffic flow are accumulatively available now. However, the traffic flow data detected through various detectors are often degraded due to the presence of a number of missing data, which can even lead to erroneous analysis and decision if no appropriate process is carried out. To remedy this issue, great research efforts have been made and subsequently various imputation techniques have been successively proposed in recent years, among which the k nearest neighbour algorithm (kNN) has received a great popularity as it is easy to implement and impute the missing data effectively. In the work presented in this paper, we firstly analyse the stochastic effect of traffic flow, to which the suffering of the kNN algorithm can be attributed. This motivates us to make an improvement, while eliminating the requirement to predefine parameters. Such a parameter-free algorithm has been realized by introducing a new similarity metric which is combined with the conventional metric so as to avoid the parameter setting, which is often determined with the requirement of adequate domain knowledge. Unlike the conventional version of the kNN algorithm, the proposed algorithm employs the multivariate linear regression model to estimate the weights for the final output, based on a set of data, which is smoothed by a Wavelet technique. A series of experiments have been performed, based on a set of traffic flow data reported from serval different countries, to examine the adaptive determination of parameters and the smoothing effect. Additional experiments have been conducted to evaluate the competent performance for the proposed algorithm by comparing to a number of widely-used imputing algorithms.


Introduction
Perceiving traffic flow parameters through detectors facilitate an accurate estimate of traffic state, which can be used at various aspects, such as dynamic routing or signal control. Therefore, investment has intensively put into the construction of traffic detectors in recent years. However, in the data collection process, it is inevitable that various unpredictably malfunctions, such as communication interruption, power outages or storage equipment damage, can occur even for the advanced detectors, resulting in a number of data missing. For example, the solarpowered sensors, which are recently promoted in China, cannot function properly sometimes during the period of rainy season. On the other hand, the data transmission problems can be interrupted for short time when an update is performed for the communication system (e.g., from 3rd generation to 4th generation). The missing data problem and associated effects have been constantly reported in recent years. Missing data problem, where some subsets of traffic data become missing, has greatly hindered the collection and subsequent analysis, estimation and prediction of traffic flow data (Wang and Mao, 2019). In Texas Transportation Institute, the rate of missing data is between 16% and 93% (Li, Zhang, Wang, et al., 2019). Tan et al. found more than 5% are missing from the PeMS traffic flow database (Tan, Feng, Feng et al., 2013). Similarly, the missing ratio of the traffic data reported in (Xu, Li and Shi, 2010) can be as high as 90%, with the average ratio of 50% for the period of 7 years. An analysis on the traffic flow data obtained from the microwave sensors amounted on the Beijing ring expressways reports up to 50% of data missing (Ma, Luan, Du, et al., 2017). It is no doubt that the missing data build an intangible barrier for understanding and modelling the traffic phenomenon due to incomplete information. To remedy the undesired effects caused by the presence of missing data, a number of imputing approaches have been proposed in last few decades. These imputation methods have taken different procedures to provide plausible estimations for those missing values given other observed data, based on the assumption that the real traffic system usually possess inherent structure due to its flow dynamic characteristics. Although these existing approaches can be classified from various aspects (e.g., the angle of data construction (Tan, Feng, Feng et al., 2013)), the mechanism adopted by each method to infer the values for missing data mostly determines the performance of imputation and application scope. Therefore, we have grouped the existing imputing approaches into three categories, predictionbased, interpolation-based and statistical learningbased imputing approaches. However, it should be aware that there is a possibility for a few approaches which can fit into the different categories from different view of point.

Prediction-based imputing approaches
The approaches which fall in this category usually take advantage of sample data to build a model to account for the mapping relationship between historical data and future data and the model is subsequently used to predict the values for the missing data. Typical examples of this category include Bayesian networks (BNs) ( (Tang, Wang, Zhang, et al., 2015), and support vector regression (Castro-Neto, Jeong, Jeong et al., 2009). In addition, the regression techniques, such as linear or polynomial regression, can also be used to estimate the values for the missing data. Although these approaches can even model the highly non-linear relationship between independent and dependent variables for the prediction purpose, the use of continuous chunks in the time series can significantly degrade the imputation performance due to inefficient use of all observed data.

Interpolation-based imputing approaches
The simplest interpolation method is the nearestneighbour interpolation, which uses the value at the previous time instant nearest to the instant of a missing data or at the same time instant of the previous day (Chen, and Shao, 2000). This imputing approach is often a favourable choice if the real time is one of the rigid requirements. A slightly complicated approach is to fill the missing data by averaging the observed data which are close to the missing data in Wang, Y., Xiao, Y., Lai, J., Chen, Y., Archives of Transport, 54 (2), 59-73, 2020 61 time generally. More complicated interpolation approaches include linear, spline, and polynomial (e.g., Lagrange's interpolation formula (Stoeck, and Prajwowski, 2010)), etc. The kind of imputing approaches can be further divided into two groups, interpolation and extrapolation. Although the phrase, 'interpolation', is generally used to refer these two situations, there is a difference between interpolation and extrapolation according to whether the missing data are in between or outside of the observed data. As these approaches force the interpolating function passes exactly through the given data points, spurious features in the region of missing data may be generated.

Statistical learning-based imputing approaches
One of representative approaches in this category is k Nearest neighbours (kNN) (Sliva, H. D., and Perera, A. S., 2017; Esawey, M. E., and Sayed, T., 2012) which attempts to derive the missing data from a number of similar patterns. Local least squares (LLS) (Kim, Golub, and Park, 2005) also takes advantage of the information from similar patterns to infer the possible values for missing data. The principal component analysis (PCA) based methods, such as probabilistic principal component analysis (PPCA) and Bayesian principal component analysis (BPCA), extract the statistical characteristics of the observed data and map the relationship between the observed data and latent variables by constructing a number of principle components (Qu, Li, Zhang et al., 2009;Li, Li, and Li, 2014). PPCA employs the expectation-maximization algorithm to determine the projection matrix for the later variables, but BPCA applies the Bayesian estimation approach. As these approaches make use of both global and local information of the observed data, better performance than the other conventional imputation methods (e.g., the nearest-neighbour interpolation, the spline interpolation, and the mean historical methods) has been reported for the traffic flow data. However, it has been shown that there is no significant difference with regards to the imputation accuracy between the different PCA-based approaches (Li, Li and Li, 2013). In recent years, a number of tensor-based approaches have been proposed to impute the missing data by taking advantage of traffic spatial-temporal information (Chen, He and Sun, 2019). In theory, more correlation information considered in the imputation process can generally produce more accurate estimation for the missing data. However, numerous detectors are deployed outside urban areas and most of them are sparsely spaced, resulting in a very weak spatial correlation. Without loss of generality, the work presented in this paper ignores the spatial correlation information in an attempt to deliver an imputation method which can also applied to the rural areas where the spatial correlation is generally weak. Based on the traffic temporal information, we propose a novel imputation approach based on the kNN method which has been reported to be one of competent imputation approaches in terms of accuracy and efficiency (Loukopoulos, Sampath, Pilidis, et al., 2016). The motivation for the improvements of the kNN approach is presented in the next section. The proposed imputation approach is introduced in Section 3. Section 4 presents the experiments and the corresponding results. The concluding remarks are provided in Section 5.

Motivation
In this section, the imputation problem of traffic flow data considered in this paper is firstly formulated, followed by a short introduction to the kNN algorithm. Based on the analysis of the stochastic characteristic of traffic flow, the difficulties to apply the kNN algorithm for traffic flow imputation are discussed, which motivates us to make an improvement.

The traffic flow imputation problem
As stated above, the spatial information is not considered in the imputation process in this paper and, consequently, the imputation problem can be modelled as follows. Suppose that we have a set of data X for D days. That is: in which the dth day's traffic flow data is a vector, whose elements are sequentially arranged in time order, as follows: Note that M is the number of recordings of traffic flow and determined based on the recording frequency adopted in the detector system.
The last column in the data set X contains a number of missing data, whose values are required to estimate. In this paper, it is assumed that values in the data set are missing at random, which means the missing data are independent to each other. Note that it is not necessary that the set of data X contains the traffic flow recordings from consecutive days. In order to take advantage of the observed data information, the data set may reject the data containing missing values in the previous days, but instead, include the fully observed data even in earlier days. Furthermore, it is natural to exclude the data which contain the different traffic patterns due to obvious reasons (e.g., the traffic patterns for weekdays and weekends are generally different).

kNN Algorithm
As a non-parametric method, the kNN algorithm has enjoyed great popularity in many classification and regression applications due to its simplicity. The idea of the kNN algorithm for the missing data imputation is based on the assumption of local similarity in data space: the missing data embedded in a time series are expected to have the similar values, as those observed in the same time instants but contained in the historical time series if the counterparts in those time series are similar. Figure 1 illustrates the imputation process for a number of missing data by using the kNN algorithm. Suppose we have a set of data, which has been arranged as a matrix X with each row and column denoting the time of day and day, respectively. Due to the fact that the traffic data are normally sensed at the fixed time instant with a constant interval, we simply use the index to indicate the time instant and day, respectively, when the data are collected. The last column vector, xD, contains a number of missing data, each of which is denoted by a question mark in Figure 1. For convenience, we separate the missing data from the observed data and the corresponding parts are denoted by miss and obse , respectively. That is: The first step is to find k nearest neighbours for the observed part obse , which can be mathematically expressed as: where H contains the historical data obse (d = 1, 2, …, D-1) and s is a function to estimate the distance for obse and obse . A number of metrics, such as Manhattan, Chebychev, Levenshtein (Abbasifard, Ghahremani, and Naderi, 2014), have been proposed in literature. However, the Euclidean distance has been widely chosen as the similarity metric since a large number of problems can be defined in the Euclidean space. The number of neighbours, k, is usually specified by users based on their experiences or trail-and-error. As shown in Figure 1, the column vectors marked with green colour are the k neighbours nearest to the observed part of the last column. Subsequently, the weight of ith neighbour can be obtained by normalizing the distance over the summation of the distances of the k neighbours, as follows: Finally, the missing value at mth time instant can be estimated as: For easy reference in the sequel, we summarize the primary steps to perform the kNN algorithm in Figure 2. Although the implementation of the kNN algorithm is straightforward, a number of elements, such as the parameter k and similarity metric, are critical to the success of imputation of missing data. To determine these critical elements, domain knowledge is frequently employed, thus resulting in a number of variants of the kNN algorithm. An excellent review on the variants of the kNN algorithm can be found in the works (Bhatia, and Vandana, 2010). However, to the best of our knowledge, there are few attempts to consider the traffic stochastic characteristic to improve the imputation accuracy of the kNN algorithm. The next sub-section presents an analysis of the traffic stochastic characteristic with a discussion of possible difficulties that may arise from the stochastic characteristic.

The Effect of Stochastic Characteristics
Traffic flow often exhibits a strong stochastic characteristic, which can be attributed to various factors such as stochastic travel demands or drivers' bounded rationality. To examine the stochastic characteristic of traffic flow, the signal-to-noise ratios (SNRs) have been estimated for three sets of traffic flows collected in the Portland metropolitan area, the freeways in California, and the sub-urban area of Beijing. For easy reference, the three data sets used in the following experiments are denoted as 'Portal' (i.e., the Portland Oregon Regional Transportation Archive Listing (http://portal.its.pdx.edu, 2018)), 'PeMS' (i.e., the freeway Performance Measurement System (http://pems.dot.ca.gov, 2017)), and 'Beijing' (i.e., a provincial road in the sub-urban area of Beijing). The aggregated 5-minute flow data of 5 weekdays in the three sets were used to examine the stochastic characteristics of traffic flow. Figure 3 shows the power spectrums of the traffic flows with the SNR values obtained for the three data sets. From the results presented in Figure 3, it is evident that the traffic flows have a significant stochastic characteristic. It should be aware that the term "noise" is used here to reflect the stochastic of traffic flow, but it does not mean the noise that reflects nothing about the intrinsic characteristics of traffic flow. Due to the stochastic characteristic of traffic flow, k nearest neighbours, determined by the kNN algorithm with the similarity metric of Euclidean distance, can vary with different data missing.
To examine such sensitivity, the three data sets were used with the missing rate ranging from 0.05 to 0.9 with the interval of 0.05. For each missing rate, 100 independent runs were performed and the missing data were randomly selected for each run. With the parameter k set to be 4 for all tests, the order of k nearest neighbours determined for each run was recorded.
To reflect the fluctuation of the order of k nearest neighbours, we define a ratio, named as Difference Ratio, as the number of the unique order of k nearest neighbours over 100 independent runs. The results obtained for the three data set are illustrated in Figure 4. From Figure 4, it can be seen that the uncertainty in the determination of k nearest neighbours increases with missing rates for all test data. Such uncertainty resulted from the stochastic characteristic of traffic flow can obviously affect the imputation performance. To remedy this issue, we propose a mild solution, which is described in Section 3, instead of striving hard for the exact answer.

Methodology
This section begins with an introduction of the structure of the improved version of kNN, followed by an explanation for the modified implementation. As shown in Figure 5, the algorithm starts with the preparation of data set which includes the traffic flow with missing data and the historical data without missing values. For conveniences, we use "current" and "historical" data in the sequel to refer the flow data of a day with and without missing values. Obviously, the current data containing the missing values are required to impute. Next, the similarity is measured for each historical time series data with respect to the current flow data, based on two metrics, namely interweaving degree and Euclidean distance, which are introduced in the following sub-section. According to the similarities estimated, the historical flow data are classified into different groups, which can be used to identify k nearest neighbors. Subsequently, the k nearest neighbors are subject to a smoothing process, before those data are employed to build a regression model. Finally, the values for the missing data can be obtained by applying the regression model constructed.

The Identification of k Nearest Neighbours
As explained in Section 2, the stochastic characteristic of traffic flow makes it difficult to correctly identify the nearest neighbours when using the conventional similarity metrics. This difficulty is stemmed from the fact that the random fluctuation of traffic flow makes the nearest neighbours indistinguishable. Instead of strenuously discriminating the neighbouring ranks for those that are highly interwove, we decide to classify those indistinguishable neighbours into one group and the remaining into another group. For convenience, those two groups are called "neighbouring group" and "remote group", respectively, in this paper. As shown in Figure 5, two similarity metrics, namely interweaving degree and Euclidean distance, are employed here to separate the historical time series data into the corresponding groups. Consequently, four groups, denoted as "NN", "NR", "RN", and "RR", can be formed by combining neighbouring and remote groups for the two metrics (e.g., "NR" representing neighbouring group for the first metric, interweaving degree, and remote group for the second, Euclidean distance). The elements in the "NN" group are selected as the k nearest neighbours to be used in the subsequent process. In such way, the parameter, k, can be adaptively determined without the need to be predefined by users. Note that it is possible that empty set can be obtained for "NN" group and, if so, the historical data in the neighbouring group with the metric of Euclidean distance will be used as nearest neighbours.
The interweaving degree proposed in this paper is defined as follows: where N and Nc denote the number of points of a time series and the number of crossing points between two time series, respectively. After linearly interpolating each successive pair of data points for a time series, it is straightforward to determine the crossing points for each pair of line segments. After obtaining the interweaving degrees, a simple clustering process is performed by classifying each neighbouring series with respect to the two centres, which are defined as the two extreme values of the interweaving degrees calculated, and it will be classified into the neighbouring group if its interweaving degree is closer to the largest interweaving degree (i.e., the centre of neighbouring group). The main reason to adopt such simple classification is to reduce possible computation overhead. The same procedure is also employed to classify the neighbouring series for the metric of Euclidean distance.

The Smoothing Process
The indistinguishable neighbours determined previously are all subject to a smoothing process to eliminate the random effect on the following process. While a wide spectrum of smoothing techniques, such as moving average filter (Arce, 2005), Butterworth filter (De Boor, 2001), and smoothing spline (Bianchi, and Sorrentino, 2007) etc, has been reported in literature, the smoothing technique based on wavelet transform (Misiti, and Misiti, 2007) is employed here to eliminate the random effect of traffic flow as it is able to localize the characteristics in the temporal and frequency domains by the hierarchically organized decompositions (Chui, 1992). Except for the theoretical soundness of the wavelet based smoothing technique, the decomposition level in practice is one of the critical factors, which has significant impact on the smoothing performance (El-Dahshan, 2011). If the smoothing process performs successfully, the part removed from the original time series data should manifest the stochastic process which can be modelled as a white noise. According to the Wiener-Khinchin theorem (Zbilut, and Marwan, 2008), the corresponding correlation function ( ) can be derived as follows: where is the frequency, is the time shift, 0 is a constant and δ(τ) = 1 for τ = 0, and 0, otherwise. Then, the autocorrelation coefficient is: Ideally, the part removed after smoothing process should be a random noise-like time series and its corresponding coefficient of autocorrelation should be either 1 or 0 for τ=0 or τ>0. However, it is unlikely to have such pure noise-like time series by the smoothing operation in general. Therefore, the approximate 95% confidence interval for a noise-like time series is also computed with the sample autocorrelation. With the dependence structure for a set of time lags, the number of coefficients that fall within the confidence bounds are counted. That is, the more the coefficients fall into the confidence bounds, the more likely it is a random noise. Such calculation is performed for each decomposition level, which is incremented by one at each time before a pre-defined maximum level is reached. The decomposition level with which the random part of a time series can be maximally removed will be used as the final decomposition level to smooth the time series data. 66 Wang, Y., Xiao, Y., Lai, J., Chen, Y., Archives of Transport, 54(2), 59-73, 2020

The Estimation of Missing Values
As explained in Section 2, it is likely that the stochastic characteristics of traffic flow result in an error in the similarity measurement for the imputation by the kNN algorithm. As a consequence, such error can be embedded to the weights when the weighted sum of k nearest neighbours is used to produce the final estimation. Therefore, we adopt a regression model in this work to estimate the missing values in order to avoid the possible weighting errors. Although a large variety of regression models is available, we employ the multivariate linear regression model in this work to estimate the missing values with consideration in compromising imputation accuracy and computational overhead. Multivariate linear regression is a generalization of simple linear regression to the case where two or more explanatory variables and a response variable can be modelled by a linear function (Wichura, 2006). The basic model for multivariate linear regression can be represented as: for each p dimensional observed data indexed as i = 1, …, n. In our case, the set of observed data are those that k neighbouring time series have the observed responses at the corresponding time instants, with dimension p being equal to k. That is to say that we use the regression technique rather than the similarity distance to determine the weights for the final estimation.

Experiments
In order to evaluate the imputation performance of the improved kNN algorithm, we have designed and performed a series of experiments for the three data sets, as described in Section 2.3. Before conducting the experiments, the traffic flow data have been prepared as follows. The aggregated 5-minute flow data of 9 weekdays were chosen for each data set and a number of data in the traffic flow of the last day was selected randomly and removed as the missing data. Note that it is not necessary to choose traffic data of successive days. In the experiments presented in the following, a set of missing rates, ranging from 0.05 to 0.9 with an interval of 0.05, was used to evaluate the imputation performance for different missing situations. For each missing rate, 100 independent runs were performed with a number of missing data randomly chosen for each run. Therefore, the performance averaged over the 100 runs for each experiment are reported in the section after introducing the performance indicators adopted for the subsequent performance evaluations.

Performance Indicators and Prediction Examples
The performance indicators, namely mean absolute percentage error (MAPE) and variance of absolute percentage error (VAPE) (Zhang, and Liu, 2011), have been chosen to evaluate the imputation performance. While MAPE calculates the average relative error between the estimated values and actual observed data, VAPE represents the performance stability.
where ( ) and ̂( ) are ith true and predicted values and N is the number of data. Figure 6 shows the results imputed by our proposed algorithm for traffic flow data set 'PeMS', 'Portal' and 'Beijing', respectively, with the missing rate of 0.7 and Table 1 summaries the parameter k and the coefficients of the multivariate linear regression determined for the experiments shown in Fig 6. As the parameter k is determined in an adaptive way, the values of k are different for the three tests. The sold lines in Figure 6 indicate the true data of traffic flow and the data imputed for a number of missing data are marked with "o". Note that a number of data are randomly selected from the original data (as the solid lines in Figure 6) and discarded according to the missing rate specified, before the proposed algorithm is applied to impute these missing data. As show in Figure 6, traffic flow from the data set 'Portal' is relatively larger in magnitude than those from 'Portal' and 'Beijing'. Furthermore, it can also be seen that the missing data can be estimated with a reasonable accuracy, even though the traffic flows fluctuate significantly over time of day.

Comparison to the Original kNN Algorithm
The first experiment has been designed to evaluate the performance of the proposed imputing algorithm by comparing it to the original kNN algorithm which has been used as a basis to make specific improvements. As the parameter k is critical for both the original kNN algorithm and the proposed one, a fair comparison can be made with the same parameter k for the both imputing algorithms. For the original kNN algorithm, the parameter k is generally specified by users before performing the imputation for missing data and the lack of sufficient domain knowledge can induce a subjective decision on the parameter. On the contrary, the improved kNN algorithm automatically determine the parameter with the assistance of two similarity metrics. Therefore, we first use the proposed algorithm to determine the parameter k which is subsequently employed by the original kNN algorithm. Table 2 lists the MAPE and VAPE values obtained by the original kNN algorithm and improved version, denoted as "kNN" and "IkNN", respectively.  As shown in Table 2, the values of MAPE and VAPE obtained by IkNN are lower than those by kNN in most cases, indicating the proposed algorithm outperforms the original kNN algorithm when the parameter k is same. The reductions in MAPE and VAPE are most profound for the PeMS data set, as compared to other two sets. However, by a close inspection to the performance of the proposed algorithm, there are increasing trends for the values of MAPE and VAPE for the test data sets with the missing rates, implying that the imputation performance is inversely affected with the increase of missing rate. The parameter k, determined by the proposed algorithm, are similar for the data set of PeMS and Beijing, but different for the data set of Portal. For the data set of PeMS and Beijing, the parameter k, determined by the proposed algorithm, indicates that only 2 or 3 historical time series can be classified as the neighbours. On the other hand, the more historical time series are similar in the data set of Portal, as the parameter k determined ranges from 3 to 5. Figure 7 shows the dependence and adaptation of parameter k on the missing rate. Note that we use different colours, as shown in colour bar, to indicate the value of k determined by the proposed method described in Section 3.1. From Figure 7, it seems that the parameter k fluctuates slightly when missing rate is low, but there is an increasing tendency in fluctuation of the parameter k when the number of missing data becomes large. When there are a few missing data, the information used to estimate the similarity is relatively adequate and, therefore, this may partially explain the relatively small fluctuation in the parameter k over 100 independent runs. On the other hand, only small part of the time series is available for the similarity measure and different part may provide different local information, resulting in a large fluctuation of the parameter k over 100 runs.

Evaluation of Smoothing Effect
The experiments presented in this sub-section aim to evaluate the imputation performance enhanced with the smoothing process. To this end, the three sets of traffic flow data with the missing rates ranging from 0.05 to 0.9 were imputed by the proposed algorithm without and with the smoothing process and the values of MAPE and VAPE obtained are listed in Table  3, along with the decomposition level automatically determined. Obviously, the results (see Table 3) indicate the smoothing process can effectively improve the imputation accuracy and reduce the fluctuations over the 100 independent runs for each missing rate. In addition, it can also be seen that the imputation performance, in terms of averaged accuracy and fluctuation over 100 independent runs, is slightly degraded with the increase of missing rate. The decomposition levels determined for the data sets of PeMS and Beijing are close to 3 and 4, respectively, while the decomposition levels for the data set of Portal is blew 3. This implies that the stochastic degree of the traffic flow in the data set of Portal is less than those in the data sets of PeMS and Beijing.

Comparison to Other Typical Methods
Additional experiment was performed to make a comparative study to a set of existing imputation methods, namely interpolation-based techniques, ARIMA, PPCA, and nearest historical recording methods (denoted as 'NH' in the followings) which are typical imputing algorithms falling into the prediction, interpolation, and statistical-learning based imputation algorithm categories. There are three interpolation techniques, i.e., linear, spline, and pchip interpolation, used for comparison and denoted as 'Linear', 'Spline', and 'Pchip', respectively, in the followings, for easy reference. The parameters of ARIMA, p and q, were determined by calculating the autocorrelation and partial autocorrelation coefficients, while the augmented Dickey-Fuller test was used to help determine the differential parameter, d. For the imputation by PPCA, three principle components, determined by trail-and-error, was used for reconstruction. The simplest imputing method adopted here for comparison is the nearest historical recordings method that the recordings for the same time period in the previous day are used to impute the missing values.   , and (c) present the values of MAPE obtained by the test imputing algorithms for the three data sets. It is evident that the proposed algorithm can produce more accurate imputation than other test algorithms for different missing rates, even though a similar performance can be obtained by ARIMA for the data sets, 'PeMS' and 'Portal', when the missing rate is small. In general, the 'Spline' and 'Pchip' interpolation techniques perform poorly, while the 'Linear' interpolation can produce similar accurate imputation as 'ARIMA' and 'PPCA'. Amongst all the existing algorithms used for comparison, 'PPCA' performs better than others when a number of missing data becomes large. In addition, it is interesting to note that the simplest imputing strategy, 'NH', can produce medium accuracy for the imputation with various missing rate among all test algorithms and its performance seems to vary slightly for different missing rate. On the other hand, the values of VAPE obtained by the test algorithms are shown in Figure 9. For the data sets, 'Portal' and 'Beijing', the proposed algorithm can produce least fluctuation when different data are removed as missing data. However, ARIMA performed more stable than the other algorithms when it was used to impute for data set 'PeMS'. Again, it can be seen that poor performances in terms of stability are generated by the 'Spline' and 'Pchip' interpolation techniques. In addition, a reasonable stability can be achieved when the 'Linear' interpolation algorithm was used to impute missing data. Furthermore, a relatively medium stability can be achieved by the 'NH' method, and there is no significant difference in the stability for different missing rate, indicating it is insensitive to the number of missing data.

Conclusions
The work presented in this paper attempts to improve the imputation performance by proposing a modified version of the kNN algorithm based on the analysis of the stochastic characteristic of traffic flow which results in a number of difficulties to determine the critical elements of the original kNN algorithm. The improvements have been motivated by the intention to eliminate both the uncertainties resulted from the stochastic characteristic of traffic flow and the requirement to predefine parameters. A series of experiments for a set of traffic flow data has been performed to evaluate the imputation improvements for the proposed algorithm from various aspects. The comparative study indicates the proposed algorithm outperforms the other conventional approaches in terms of imputation errors and corresponding fluctuations in general. Furthermore, no need to predefine parameters is a unique advantage of the developed imputing algorithm over the other commonly-used algorithms. One of the critical parameters can be automatically determined in an adaptive fashion to fit different traffic patterns. On the other hand, the experimental results also imply a number of weaknesses, which is required to improve in the further research. The current strategy to adaptively determine the critical parameter cannot guarantee an optimal imputation in terms of accuracy, even though the current version outperforms the original kNN algorithm with same parameter settings. Also, it is interesting to investigate the imputation performance if other regression techniques instead of multivariate linear model are employed. (c) Fig. 9. The values of VAPE obtained by a set of imputing algorithms for (a) PeMS data set, (b) Portal data set, (c) Beijing data set