Learning Spatiotemporal Latent Factors of Traffic via Regularized Tensor Factorization: Imputing Missing Values and Forecasting

Intelligent transportation systems are a key component in smart cities, and the estimation and prediction of the spatiotemporal traffic state is critical to capture the dynamics of traffic congestion, i.e., its generation, propagation and mitigation, in order to increase operational efficiency and improve livability within smart cities. And while spatiotemporal data related to traffic is becoming common place due to the wide availability of cheap sensors and the rapid deployment of IoT platforms, the data still suffer some challenges related to sparsity, incompleteness, and noise which makes the traffic analytics difficult. In this article, we investigate the problem of missing data or noisy information in the context of real-time monitoring and forecasting of traffic congestion for road networks in a city. The road network is represented as a directed graph in which nodes are junctions (intersections) and edges are road segments. We assume that the city has deployed high-fidelity sensors for speed reading in a subset of edges; and the objective is to infer the speed readings for the remaining edges in the network; and to estimate the missing values in the segments for which sensors have stopped generating data due to technical problems (e.g., battery, network, etc.). We propose a tensor representation for the series of road network snapshots, and develop a regularized factorization method to estimate the missing values, while learning the latent factors of the network. The regularizer, which incorporates spatial properties of the road network, improves the quality of the results. The learned factors, with a graph-based temporal dependency, are then used in an autoregressive algorithm to predict the future state of the road network with a large horizon. Extensive numerical experiments with real traffic data from the cities of Doha (Qatar) and Aarhus (Denmark) demonstrate that the proposed approach is appropriate for imputing the missing data and predicting the traffic state. It is accurate and efficient and can easily be applied to other traffic datasets.


INTRODUCTION
R APID urbanization, GDP growth, decline in fuel prices, and increase in car ownership are all factors that contribute directly or indirectly in creating and/or worsening road congestion. Most of the cities in the world are regularly monitoring yearly traffic congestion-related KPIs that help them evaluate the road infrastructure. According to [16], the economic cost of congestion in 2016 in Qatar is estimated to be between US 1.53$B and US 1.80$B which translates to a loss of about 0.9-1.0 percent of the GDP. Thus, it is extremely important for cities to deploy required systems and applications that provide access to real-time congestion information. For city planners and operators, knowing what is going to happen in their road networks is as important as knowing the realtime situation. For example, predicted traffic congestion information is essential for anticipating the implementation of automated actions to avoid heavy congestion in strategic locations, this can help urban planners to fine-tune signal timings or strategically allocate police patrol units to regulate the traffic.
From the user perspective, it is essential to provide the fastest path from a source to a destination with an accurate estimation of the Estimated Time of Arrival (ETA). These features can only be offered if the system has the capability to forecast efficiently and effectively the traffic congestion in addition to the real-time monitoring.
With the rise of IoT technologies, smart cities became a testbed for technologies that capture spatio-temporal realtime data on citizens' mobility. This later has been enabled mainly by the wide availability of cheap sensors (e.g., closedloop detectors, Bluetooth sensors). However, the data still suffer some major challenges related to sparsity, incompleteness, and noise which makes the traffic analytics difficult. It is for instance reported in many studies that the percentage of missing values in real traffic monitoring scenarios can be as high as 90 percent [21], [23], [24]. The obvious reason for inherent sparsity problem we observe in traffic data is that not all road segments are equipped with sensors. And even when a road segment has a sensor, it is often the case that it temporarily stops emitting due to a variety of failures (e.g., networks, batteries, and low recall.) This problem makes traditional approaches based on time-series for traffic forecasting obsolete [3], [24], [27]; and even more robust decomposition-based methods such as matrix factorization-which aim to project objects into a lower-dimensional latent space-have been shown to be NP-hard in the presence of missing values, see e.g., [7]. In fact, handling missing values in decompositions is often via Imputation or Marginalization. In the former approach, missing values are estimated iteratively using the Expectation Maximization algorithm. This results in an easy implementation with very little changes into the model [15]. In the later approach, i.e., Marginalization or Weighted Regression, the missing values are simply ignored during the optimization process [22].
However, given the specific nature of traffic data such as the high temporal and spatial correlation that can be leveraged to pacify the data sparsity problem, and the periodicity in data that can be exploited for the problem of multi-step ahead forecasting, it is important to design new analytical frameworks that naturally account for the above-mentioned properties.
Motivated by these challenges, we propose a temporal regularized tensor factorization framework (TRTF) suitable for high dimensional time-series analysis. TRTF models traffic data as follows. First, the road network of a city is modeled as a directed graph in which nodes are junctions (e.g., intersections, Y-merging, round-abouts, etc.) and edges are road segments linking junctions. The adjacency matrix A adj of size N Â N, where N is the number of nodes of the road network, is assumed to remain unchanged over time. Second, speed readings associated with edges in the graph are modeled into a three-way tensor G G of size (N Â N Â T ), where T is the number of time points. In other words, the tensor G G is a time-dependent sequence of T successive G t matrices such that G ði;jÞ t ¼ 0 if A ði;jÞ adj ¼ 0 i.e., there no road segment between junctions i and j, or if A ði;jÞ adj ¼ 1 but there is no speed reading for road-segment ½i; j at time t; G ði;jÞ t > 0 otherwise. The objective then is to decompose the tensor G G into three factor matrices A; B and C minimizing the following error: min A;B;C E EðG G k A; B; CÞ þ R R s ðA; BÞ þ t R R t ðCÞ; (1) where the first two factors A and B capture the spatial constraints of the road network, whereas C captures the structure of the temporal dependencies among the temporal embeddings. The decomposition problem could be solved by an alternating minimization process over A, B, and C. It is a regularized minimization problem to avoid overfitting and to incorporate the spatial dependency through R R s and the temporal dependency through R R t . One of our main contributions in this paper is the adaptation of a novel autoregressive temporal regularizer R R t ðCÞ for tensor decompositions. This modeling is adapted from the regularized matrix representation introduced in [25]. In that work, traffic data is represented as a matrix in which rows are road segments and columns are time points. That is, each row is a time-series of speed readings of a particular road segment. The proposed Temporal Regularized Tensor Factorization (TRTF) framework allows the learning of temporal dependencies from the data and enables multistep ahead forecasting using tensor decomposition. This is done by learning the temporal dependencies between the latent dimensions by incorporating an autoregressive temporal regularizer into the factorization process.
Latent space modeling has attracted a lot of attention lately. Deng et al. [4] proposed to use matrix factorization with spatial and temporal regularizers to deal with the two problems of missing values and forecasting. Although the authors highlight the difference between latent models for social network and that for traffic prediction, they eventually propose a model similar to latent social network models. Moreover, they bring in the temporal aspect by learning a fixed transition matrix between consecutive snapshots.
One solution to cope with this problem is to further exploit the summary/stacked transition model technique of Rossi et al. [17] and possibly make the model learn longer temporal patterns spanning multiple snapshots. However, in our case, we decided to rather take a tensor based approach which elegantly and naturally captures the temporal dependencies in the latent space. Another interesting approach to capture time dependencies is that of Wang et al. in which they propose a tensor based model for segment as well as path cost estimation using GPS data [23]. However, their approach has few short-comings. First, they aggregate the historical data per time-slot for the entire history and therefore, possibly losing valuable information. Second, in order to build real-time scalable solution they only consider the very recent time-slots. But this along with the aggregation make this model not suitable for forecasting where we would like the periodic patterns in data be preserved. Third, their approach uses a lot of auxiliary information which may not be available for all data-sets. Moreover, unlike the holistic manifold smoothing used in [4], Wang et al. manually construct the spatial features which constitutes a serious limitation for a data-driven approach. Fourth, their approach relies on Tucker decomposition [11] which has been advised not to be used for temporal problems as it suffers from rotational freedom [5].
In this study, we use CP decomposition, which is more appropriate when time order preserving is required, see e.g., [1]. We run extensive experimentation on two real traffic datasets, from two different cities, Aarhus in Denmark and Doha in Qatar, showing different sparsity distributions. In Aarhus, the percentage of missing values is only about 4 percent, whereas Doha dataset suffers from 85 percent missing values. This, in itself, constitutes a good test-bed to evaluate against several recent algorithms designed for the same purpose such as LSM-RN [4], TRMF [25], and CP-WOPT [1].
The main contributions of this paper are: 1) We propose a novel temporal regularized tensor factorization framework (TRTF) for high-dimensional traffic data. TRTF provides a principled approach to account for both the spatial structure and the temporal dependencies. 2) We introduce a novel data-driven graph-based autoregressive model, where the weights are learned from the data. Hence, the regularizer can account for both positive and negative correlations.
3) We show that incorporating temporal embeddings into CP-WOPT [1] leads to accurate multi-step forecasting, compared to state of the art matrix factorization based methods. 4) We conduct extensive experiments on real traffic congestion datasets from two different cities and show the superiority of TRTF for both tasks of missing value completion and multi-step forecasting under different experimental settings. For instance, TRTF outperforms LSM-RN by 24 percent and TRMF by 29 percent. The results show the superiority of tensor based methods in general and our framework in particular for both tasks of missing value completion and multi-step ahead forecasting under different experimental settings. Through our extensive experiments on real datasets, we show that our method outperforms LSM-RN with about 24 percent and TRMF with 29 percent.
The remainder of this paper is organized as follows. We describe our problem in Section 2 before we discuss the literature review and its limitations in Section 3. In Section 4 we introduce our temporal regularized tensor factorization framework and describe the auto regressive extension allowing for multi-steps-ahead forecasting in Section 5. We demonstrate the suitability and superiority of our framework through an extensive set of experiments on two different real traffic datasets in Section 6. We finally conclude the paper in Section 7.

PROBLEM FORMULATION
We represent the road network of a city as a directed graph consisting of N nodes and M edges. Nodes represent junctions and edges represent road segments connecting two junctions. We assume the existence of K ( M physical sensors located on K different road segments (edges) that continuously generate speed readings at a particular rate, e.g., every 1 minute. Sensors can occasionally fail to generate readings, yielding missing values. This traffic data is used to build a three-way tensor G G of size (N Â N Â T ), where T is the number of time points at which speed data has been generated (from time 1 to time T ). Each slice G t in G G is a ðN Â NÞ matrix of which cells G ði;jÞ t report speed values observed in segment ½i; j at time t. G ði;jÞ t takes 0 if the edge ½i; j does not exist or if its corresponding sensor fail to generate a speed reading at time t, see Fig. 1. The distinction between zeros of non existing edges and those of missing values is made by looking at the adjacency matrix of the road network.
Our problem is therefore: 1) to complete the (data) tensor G G by inferring all its missing values due to absent or malfunctioning sensors; and 2) to forecast the future tensor G G new that extends the initial tensor G G from time T to T þ h, where h is the desired horizon.

RELATED WORK
In this section, we present the literature review of previous works related to ours. We mainly focus on the two categories of work related to traffic prediction and the use of tensors in the new discipline of urban computing.

Traffic Prediction
Research in traffic prediction can be classified on the basis of the target being estimated or the data being utilized. Mainly there are two traveling costs being estimated: (1) Segment cost and (2) Path cost. Moreover, research is fragmented on the type of data that is being employed as input for the various cost estimations. Here, we have two main categories: (1) GPSbased data and (2) sensor-based data. Sensor only captures data for the road segment it is tracking but the data collection is more fine grained and continuous. GPS-data, unlike sensordata, does not provide continuous data streams, but instead gives a more global view as the entire trip of a vehicle is recorded. Irrespective of the cost being estimated and the data being utilized, sparsity of the data whether missing sensors or un-traveled segments, is a central concern. The existence of missing data will reduce the performance sharply [2], [3], [10]. Therefore, models that treat each segment data as independent time-series fail . However, a promising feature of traffic data is that it is highly correlated both temporally as well as spatially. Therefore, several latent as well as correlated timeseries based models have been proposed that exploit this nature of the traffic data. Recently, Deng et al. [4] proposed a latent spatio-temporal model based on traffic snapshots of the road crossing network. Inspired from social network models [9], [17] where different actors have roles that evolve over time, they similarly hypothesize that various road crossings have certain latent traffic roles which change over time. They treat the network partitions in every snapshot as roles and also learn how role transitions occur between consecutive snapshots in the form of a role transition matrix.

Tensors in Urban Computing
Given the spatial-temporal dynamics in cities, tensorial analysis attracted more attention in the field of urban informatics. Zheng et al. highlighted the importance of using tensors in this area in their insightful review [28] on the recent research and challenges at the convergence of city science and urban computing using human mobility data. Others used tensorial formulation to address different problems, varying from noise pollution prediction [29], and gas station recommendation [18], to modeling urban refueling events [26] by adapting Tucker tensor decomposition to capture contextual features. Moreover, researchers used tensor formulation to study and characterize human mobility in the city. Fan et al. [6] used Non-negative Tensor Factorization (NTF) with spatial regularization to model people's flow in the city. The algorithm decomposes people flow tensor into basic life pattern tensors, in order to capture real-world human mobility. The tool was used to quantify the fluctuation of urban mobility before and after the Great East Japan Earthquake in 2011. Moreover, Tan et al. [21] showed the efficiency of using tensor-based methods, and the authors integrated Tucker decomposition and EM algorithm to infer the missing values. Their experiments demonstrated superior performance than the state-of-the-art imputation approaches even when the missing ratio is up to 90 percent. Another research study by Takeuchi et al. [20] showed the use of a Non-negative Multiple Tensor Factorization algorithm to capture and reveal spatio-temporal patterns of peoples' daily routines such as leisure, drinking, and shopping activity from online review data sets. Similarly, Han et al. [8] proposed the use of non-negative tensor decomposition to identify spatialtemporal patterns of traffic. The study has shown the usefulness of tensorial decomposition techniques for multivariate data sequences.

PROPOSED METHODOLOGY
Let G G be a three-way tensor of size N Â N Â T , and assume its rank is R. With perfect data, i.e., data with no missing values, the tensorial decomposition of G G is defined by factor matrices A 2 R NÂR , B 2 R NÂR and C 2 R T ÂR and denoted as G G ¼ ½½A; B; C, such that for all 1 i; j N and 1 t T , see Fig. 1. This special tensorial decomposition has been discovered and named by many researchers independently, such as CANDECOMP (canonical decomposition) and PARAFAC (parallel factors), which preserves the uniqueness under some mild conditions [14]. Remark 1. The rank R of the tensor G G is defined as the minimal number of rank-1 tensors whose linear combination yields G G, i.e., where a; b and c are column vectors of the factor matrices A; B and C, respectively. Here, denotes the standard outer product. Contrary to the case of matrices, the rank of a tensor is presently not well understood. It is known that the problem of computing the rank of a tensor is NP-hard.
For traffic data obtained from stationary and mobile sensors, missing values are unavoidable due to failures such as detector malfunction. Therefore, the true G G is not observable and we cannot expect equality in (2). Instead, the CP decomposition should minimize the error function as follows: Unlike the singular value decomposition (SVD) for matrices, PARAFAC (usually) does not impose any orthogonality constraints. The two main competitors of PARAFAC are the Tucker3 method [13], and simply the unfolding of the tensor into a matrix and then performing standard two-way methods such as PCA. The alternating least-squares framework is used to solve the CP approximation problem. It proceeds by solving a sequence of structured linear least squares problems.

Remark 2.
Modern applications, such Intelligent Transportation Systems, generate massive amounts of data with multiple aspects such as sparsity and high dimensionalities; and tensors provide a natural representation for such data. Consequently, tensor decompositions have become important tools for analysis. Given a high-order large-scale tensor, how can we decompose it into latent factors in a scalable manner? In a recent work [19], the authors propose Coordinate Descent for Tensor Factorization, which is more memory efficient and applicable to more loss functions; and Subset Alternating Least Square, which converges faster to a better solution. These distributed tensor factorization methods are scalable with all aspects of data. Moreover, there have been attempts to improve the accuracy and efficiency of the decomposition by encoding prior knowledge of the application, and using additional contextual information (such as time and location), as a regularization term in the objective function.
Our approach to deal with missing values is to use a weighted version of the error function to ignore missing data and model only the known entries. Definition 1. We define the nonnegative binary weight index tensor W W, of the same size as G G, as a 0-1 tensor, which indicates whether entries of G G are missing or not, i.e., Therefore the weighted version of the error is where represents the Hadamard product, i.e., point-wise multiplication. By minimizing the objective function, we can get optimized A; B and C. Now, we can recover the missing values in G G by Remark 3. If we restrict the core-array of the Tucker model to be diagonal with values one in the diagonal elements, we arrive at the CP model which has (in general) a unique minimizer of the cost function (up to scale and permutation of the components), in contrast to the Tucker model.
We can then solve where the spatial regularizer is Here L 2 R NÂN is the graph Laplacian of the road network. This will allow embedding the spatial structure of the road network into the formulation. It is clear that the choice of the Frobenius norm only is not appropriate, as it does not take into account the dependencies among the columns of the factor matrices. Hence, it is important to impose prior knowledge of the application, such as spatial dependencies of the road network, into the learning algorithm. The Frobenius norm or Tikhonov-type regularization makes the learning algorithm numerically stable, since the sizes of the elements of the factor matrices become bounded. The penalty parameters are a a;b and b a;b . Therefore, the matrices A and B may be considered as representing the spatial latent factors, whereas the matrix C corresponds to the latent temporal embedding.
As for the temporal regularizer R R t ðCÞ, one of the basic and common ideas is to do a graph-based regularization for temporal dependencies, i.e., to incorporate general dependencies among temporal variables and to make the latent representations of the two entities as close as possible if there exists a relationship between them, i.e., an edge in the temporal graph, see Fig. 2. Assume that there exists a temporal graph G G c ¼ ðV V c ; E E c Þ, whose adjacency matrix W G G c 2 R T ÂT encodes the relationships between the T columns of C ¼ C T . Therefore, two columns of C, e.g., c s and c t of size R, which are connected by an edge in this temporal graph, are said to be "close" to each other in terms of the euclidean distance, and hence temporally dependent, see Fig. 3.
Remark 4. We use C ¼ C T for clarity purposes. This does not affect the mathematical formalism. In so doing, a row in C, i.e., c T i for i ¼ 1; . . . ; R, represents a time-series. In the context of graph-based embedding, temporal regularization is achieved by including the following regularizer as part of the objective function where s $ t denotes an edge between the sth node and the tth node, whose weight is w st , and the second term is used to guarantee a strong convexity. This regularizer makes the temporal variable C to be faithful to the underlying temporal graph structure. The matrix Lap G G c 2 R T ÂT is the graph Laplacian associated with G G c and is defined as where 1 1 is the vector of all ones of size T . (8) is convex.

Lemma 1. The objective function in
To apply graph-based regularizers to temporal dependencies, we need to specify the repeating dependency pattern by a lag set L and a weight vector w such that all the edges s $ t of distance ', i.e., js À tj ¼ ', share the same weight w ' . Therefore, given L and w, the temporal regularizer (8) becomes While this graph-based regularization approach is intuitive, the explicit temporal dependency structure is usually not readily available. Therefore, one could try to learn this structure, i.e., the unknown weights w l . This leads to the following optimization problem While this formulation is common and widely used, it does have some issues. First, the method does not handle cases where negative correlation dependencies exist between two time points. Second, the method assumes that the structure of the temporal dependency is available, which is not the case in real-life applications. Indeed, most often, the explicit temporal dependency needs to be inferred. Moreover, a simple derivation of Equation (11) with respect to w shows that the above optimization yields the trivial allzero solution for w. Even, by adding an additional constraint, such as a simplex constraint on w, there is a trivial solution. Therefore, learning the weights for the temporal dependencies is a challenging problem, and cannot be obtained simply by a mere addition of a regularizer into the error minimization.  2. Illustration of the graph-based regularization of temporal dependencies. Each node represents a column vector of C ¼ C T , and an edge between two nodes represents the temporal dependency. The weights of the edges will be learned. In this example, L ¼ f1; 4g. Fig. 3. The matrix C captures the latent temporal variables. Each row represents a time-series. The forecasted C new is obtained via an AR algorithm using a temporal graph where the weights are learned.
Therefore, we use well-studied time-series models to describe temporal dependencies among the columns f c t g explicitly. Such models are of the form, see e.g., [25] where e e t is a noise vector, assumed to be Gaussian here, and M M Q Q is the time-series model which is parameterized by the set L containing the lag indices ', denoting a dependency between time-points c t and c tÀ' , and Q Q which captures the weight information of temporal dependencies, similar to a transition matrix in autoregressive models.
To incorporate the temporal dependencies in (6), we use a new (temporal) regularizer T T M M ð C j Q QÞ which encourages the structure induced by the time-series model M M Q Q . This regularizer is obtained from the general problem of maximizing the likelihood of parameters, given the data. Therefore, to estimate its parameters, we run a grid search over possible values to find the set of values for which the observed sample was most likely. That is, we find the set of parameter values that, given the model M M Q Q , were most likely to have given us the observed data. In other words, we want to know the distribution of the unknown parameters conditional on the observed data, i.e., pðModel j DataÞ, which is known as the inverse probability problem. This cannot be calculated, so we go around it through the concept of the likelihood. And to get a convex function, we use the negative log likelihood, i.e., The question is how can we estimate the parameters when we cannot maximize the likelihood analytically? To do so, we need to (1) be able to evaluate the likelihood function for a given set of parameters; and (2) find a way to evaluate a sequence of likelihoods conditional on difference parameter vectors so that we can feel confident that we have found the parameter vector that maximizes the likelihood. In this case, Grid Search, i.e., to divide the range of parameters into a grid and evaluate all possible combinations, is a method which guarantees finding the global optimum. With a fine enough grid, Grid Search always finds the global optimum, if the parameter space is bounded. However, it may be computationally infeasible for models with large number of parameters. Among the methods to solve this issue, one can cite the steepest ascent method, which is feasible for models with a large number of parameters; but can be hard to calibrate even for simple models to achieve the right rate of convergence. Another method is Newton-Raphson, which is similar to steepest ascent, but also computes the step-size which depends on the second derivative. It usually converges faster than steepest ascent, but it requires concavity, and so it is less robust when the shape of the likelihood function is unknown.
When Q Q is known, we can use R R c ðCÞ ¼ T T M M ð C j Q QÞ to encourage f c t g to follow the temporal dependency induced by M M Q Q ; and when Q Q is unknown, we learn it by treating it as another set of variables and introduce another regularizer R R u ðQ QÞ into (6), that is which can be solved by an alternating minimization procedure over A; B; C and Q Q. Unlike the case of using directly a graph-based regularizer to embed the temporal dependencies, as in Equation (11), which leads to trivial solutions for the weights, the formulation (14) avoids this issue. When A; B and C are fixed, the optimization problem is reduced to and therefore the set of parameters Q Q can be learned automatically from the data. This is a data-driven approach to learn the temporal dependency, and obtain non-trivial Q Q, as shown in [25].
We show in what follows how the general temporal regularizer T T M M ð C j Q QÞ, which incorporates dependencies specified by the time-series model M M Q Q , could be instantiated to the more specific case of an autoregressive (AR) model, which is parametrized by a lag set L and the weights W W ¼ fW ð'Þ 2 R RÂR : ' 2 Lg. Assume that c t 2 R R is a noisy linear combination of some previous points, i.e., where e e t is assumed to be a Gaussian noise, see Equation (12).
Here, the main challenge is to learn the weight matrix W ð'Þ associated with each ' 2 L. To avoid overfitting, W ð'Þ is assumed to be a diagonal matrix. Algorithm 1 details the procedure.
Initialize the matrix W to zeros 7: for r 1; R do 8: x ¼ Cð:; rÞ " rth column of the matrix C 9: y ¼ xðidxÞ 10: X ¼ ½ " Initialize the matrix X to zeros 11: for i 1; sizeðLÞ do 12: ' ¼ LðiÞ 13: Xð:; iÞ ¼ xðidx À 'Þ 14: end for 15: w ¼ ðX T X þ w IÞ À1 X T y " rth column of W 16: W ¼ ½W; w 17: end for 18: end procedure This algorithm learns the weights of the temporal graph, given the matrix C and the lag set L. We use ridge regression to avoid potential singularity of ðX T XÞ. We can also use LASSO, or Elastic Net to learn the weights. However, numerical experiments show that it is sufficient to use Ridge Regression. This algorithm is very efficient and its cost is essentially the cost of a ridge regression per iteration, i.e., R times the cost of a ridge regression. In Matlab, the inversion is simply done by the backslash operator.
The choice of the lag set L is flexible, and since the weights are learned, L can be chosen to be discontinuous (if domain knowledge is included such as periodicity and/or seasonality); or very large to account for long range dependencies. The model is explainable, and the temporal graph dependency gets sparsified when using LASSO.

PREDICTION
In this section, we present the autoregressive procedure used to allow multi-step forecasting using time regularized tensor decomposition (see Algorithm 2). The main idea is to learn the weights of the lag set in the graph-based temporal dependencies by using Algorithm 1, and then use them in the prediction procedure. The (vector) autoregressive model is one of the most successful, flexible, and easy to use models for the analysis of multivariate time-series. Since the weight matrix W ð'Þ is diagonal, the matrix-vector product W ð'Þ c tÀ' is efficiently computed by using the Hadamard product, represented by the symbol .

EXPERIMENTS
In this section, we describe the experimentation setting and discuss results achieved by three different algorithms against our TRTF (Temporal Regularized Tensor Factorization). Two of the baseline algorithms are based on matrix factorization that use different ways to represent speed information. LSM-RN (Latent Space Road Network) captures speed data into T snapshots of N Â N matrices where each cell s t ði; jÞ reports the speed observed in the road segment linking i to j at time t [4]. TRMF (Temporal Regularized Matrix Factorization) uses a matrix in which rows are road segments and columns are timestamps, i.e, each row is the time-series of speeds of a particular road segment [25]. The third baseline is CP-WOPT (Weighted Optimization), which is the state of the art method for tensor decomposition with missing data [1].
We evaluate the four algorithms on two real datasets of speed readings observed in road networks of two cities. The evaluation concerns two tasks: (1) missing value completion and (2) forecasting of future values. In addition to their general accuracy, we are interested in how different algorithms perform in the two scenarios of rush-hour (e.g., 7am-8am) versus non-rush hour (e.g., 10am-11am). All reported results are averages of several runs on four weeks of data.

Traffic Datasets
We use real traffic data from two different cities, namely (1) Aarhus in Denmark as a mature and developed Northern European city, and (2) Doha in Qatar as one of the most fast growing cities in the world that suffers a huge congestion problem due to its extra-ordinary yearly population growth (% 10%).

Aarhus
This data is part of Aarhus smart city project and has been made available for download via CityPulse Project Portal, 1 see [12]. The data is used for various traffic analytics such as day-to-day developments of roadworks, major changes, and abnormal events. Data is collected from 126 measuring points using Bluetooth sensors on selected stretches in the municipality of Aarhus. By looking at the time it takes for a car to drive from one measuring point to the next, its speed is computed based on the distance and travel time. Speed time-series are then aggregated at five-minutes granularity. In this study, we use the chunk of data captured between February 2014 and March 2014. The fraction of Aarhus road network covered by the data is modeled as a directed graph consisting of 136 nodes (junctions) and 443 edges (road segments) in two periods, the first period spans between February 13th, 2014 to June 09th, 2014 and the second period spans between August 1st, 2014 to September 30th, 2014 see Fig. 4a.
The number of speed readings captured in this dataset is 20,402,174. Only 2,055,404 speed readings are missing from 5 minutes-granularity time-series of all road segments. This represents approximately 10 percent of missing values.

Doha
The data is provided by the Qatar Mobility Innovation Center, a local company that deployed different types of traffic sensors and mobile applications to monitor the traffic and mobility in the city. The data consist of two files: one for the details of the road network such as IDs of road segments, their start and end points as geographic coordinates, and length of the segments; and the second file contains timely speed readings for a subset of segments. As one would expect, the data is far from being complete. Many road segments lack readings, and there is a high variability in the completeness of the speed time-series of different road segments. That is, the percentage of missing values varies a lot from a road segment to another, depending on its importance in the city traffic, its proximity to the city center, the presence of sensors, the network coverage, and many other parameters. Given the nature of the data sources used by the company (few Bluetooth sensors), it is often the case that there is not enough information to calculate speeds for all road segments. Thus, we observe a very high rate of missing values reaching up to 99 percent in some cases. The part of Doha's road network covered by the dataset

Some Remarks Regarding the Datasets
It is important to notice that the two datasets show a completely different set of properties. First, the fraction of missing values in Doha is 96 percent whereas Aarhus suffers 10 percent missing values. This could have a significant impact on the quality of different algorithms. Second, because of the high level of data sparsity in Doha, the segments for which speed readings are available are weakly connected in the road network-modeled as directed graphs-i.e., the adjacency matrix of the dual road network graph is very sparse. This is another challenge that could presumably make it difficult to effectively learn the role of spatial properties in speed inference. In sum, we believe that having access to those two different types of real datasets is a good opportunity to understand how different methods respond under different circumstances.

Evaluation Metrics
Based on our literature review, we found that it is common to use different metrics to evaluate the quality of both missing value completion and forecasting accuracy in the context of traffic prediction [4], [25]. Thus, we report the results of the four algorithms based on the following metrics: Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), Normalized RMSE, and Normalized Deviation (ND), each of which captures a different aspect of the quality and accuracy of the results. Definitions of the used metrics are given below.

Missing Value Completion
We first report the completion accuracy of different algorithms under different percentages of missing values {10, 30, 50, 70, 90 percent}. Given the fact that the input tensors are sparse and comprise lots of missing values, the percentages we are testing with are relative to the known values. The overall results with the best parameters for each dataset and each evaluation metric are reported in Table 1. For a close inspection, we report in Fig. 5 MAPE scores achieved by different methods in Aarhus (Fig. 5a) and Doha (Fig. 5b.) Among all algorithms, LSM-RN is the worst, especially in the cases where the data is very sparse. TRTF outperforms all methods in the case of Doha where we get a noisy and sparse data comparing to Aarhus, this later makes it difficult to other methods to correctly learn the latent space features. This is especially true in the case where only 10 percent of data is  available and in which TRTF clearly outperformed by far all other methods achieving a MAPE score of 34 percent against 53.3 percent for TRMF, 76.6 percent for CP-WOPT, and 99.9 percent for LSM-RN. The same trends can be seen for all other metrics in tables Aarhus (Table 1) and Doha (Table 1).
On the contrary, in the case of dense road network graphs and complete data, LSM-RN seems to slightly outperform our method. This indicates that in such scenarios, LSM-RN is able to accurately learn the spatial and temporal properties on the traffic phenomena.

Rush Hour Versus Non Rush Hour
Next, we investigate the accuracy of missing value completion in two different time windows, known to have different traffic dynamics: rush hours (congestion) and non rush hours (free flow). We choose the time window between 7 am to 8 am to represent the rush hour scenario and the one from 10 am to 11 am to represent the non rush hour scenario. Fig. 6 plots the results (MAPE scores) for the case of 10 percent missing values. The full set of results is reported in Table 2. Once again, our method TRTF, in most cases, outperforms other methods for both cities (Aarhus and Doha) and both cases (rush hour and non rush hour). While the difference between TRTF and CP-WOPT in the case of Aarhus (Fig. 6a) seems narrow in that TRTF achieves only 3 percent improvement over CP-WOPT, the difference is much significant in the case of Doha where TRTF achieves over 9 percent improvement compared to the second best method CP-WOPT. This means, that in the extreme cases of missing values and disconnected networks, the use of both temporal and spatial properties leads to lower errors compared to cases where only temporal properties are learned. Another interesting observation is that TRMF completely degenerated in completing missing values for Doha, especially in the case of non-rush hour, which could be explained by the low effectiveness of this method in case of large amount of missing values. Last, with an exception of TRMF, we see from this experiment that all three methods TRTF, CP-WOPT, and LSM-RN have had better completion scores in the case of non-rush hours compared to rush hours. The difference in gain is specifically noticeable for CP-WOPT and TRTF. This leads to the fact that traffic dynamic is more predictable in free-flow scenarios.

Weekly Patterns of MAPE Scores
As seen in the previous section, we demonstrated the effectiveness of TRTF method in general and in specific scenarios of rush and non rush hours. We plot in Fig. 7 the heatmaps of typical MAPE scores achieved by TRTF through-out a week. Note that we are only interested in the five working days in each city (Monday through Friday in Aarhus and Sunday through Thursday in Doha). For each day, we only report for the 18 hours in which it makes sense to monitor the traffic, i.e., from 6am to 12am. In the heat-maps, rows represent days of the week and columns represent hours of the day. Each cell reports the average level of MAPE error observed on a typical hour of that day (e.g., the typical level of error made at any Tuesday 9 am). Due to space limitation, we only plot heat-maps for our method. Unsurprisingly, the scores for Doha are noticeably higher than those for Aarhus which could be explained by the quality of the different datasets. Doha shows two distinguishable periods in which  TRTF does not perform well. These periods correspond to the morning (7 am and 8 am) and evening (5 pm and 6 pm) commutes. In Aarhus, the general trend suggests that the method down-perform in the morning period (6 am to 12 pm), then things slowly improve in the afternoon and evening. This plot is particularly important in that it shows that completion error is never uniform through-out hours and days, especially when dealing with complex phenomenas like traffic. Thus, it is important to train different set of parameters for different relevant temporal segments.

Forecasting Evaluation
In this set of experiments, we evaluate the quality of different methods in predicting future speed values of the road network in Aarhus and Doha. The two main parameters for all methods are (1) the horizon (h) which is the number of steps-ahead to predict and (2) the size of the training window. For the three methods CP-WOPT, TRMF, and TRTF that use auto-regression, we also vary the size of the lag set. Due to space limitation, we only report a subset of results obtained. The same trends are applicable to all other cases.
As an illustrative example, we assume a case in which we know the traffic status in each city up to 7 am (t), and try to predict the speeds beyond, i.e., at ft þ 1; t þ 2; . . . t þ 12g each step correspond to a horizon of 5 minutes. Fig. 8 reports the MAPE scores of different methods in prediction speeds at different horizons between 7:05am and 8am. The training set is fixed to 30 minutes before 7am (data from 6:30am to 7am) and the size of the lag set is 6. As one would expect, the general trend is that the error increases as the horizon increases which could be explained by the error accumulation problem.
We observe that tensor-based methods (CP-WOPT and TRTF) that use autoregressive regularizer procedure outperform by far the matrix based methods LSM-RN & TRMF. This demonstrates that a good incorporation of temporal properties into a simple tensor factorization algorithm, yields remarkably good results compared to those achieved by matrix-factorization methods (i.e., LSM-RN and TRMF). For instance, in the case of lag 6 ( Fig. 8a), we observe that the gain in MAPE score of our method is about 23.65 percent compared to LSM-RN and 28.96 percent compared to TRMF, which is very significant. The reason for which TRTF is outperforming CP-WOPT is that TRTF includes both a spatial and a temporal regularizers that are not embedded in CP-WOPT. These results are valid across different experimental settings as shown in (Fig. 8b) and (Fig. 8c).
Next, we investigate the impact of the lag set size on the accuracy of TRTF forecasting. We report in Fig. 9 the average MAPE scores observed in horizon bins of 15 minutes. We vary the size of the lag set that we assume is continuous, taking values in {4, 6, 12}. Interestingly, we find that the accuracy of the predictions are much better in the case of Aarhus compared to Doha. This is probably due to the quality of the data and the connectedness of the road network. In the case of Aarhus (Fig. 9a), we found that a lag of 6 yields better results; whereas in Doha, the best lag size turned to be 4.

Time Performance
We report in Table 3 the performance results of different methods observed in the training phase of the missing value completion task. We see that TRMF has similar running time for both Aarhus and Doha datasets. On the other hand, CP-WOPT and TRTF are one order of magnitude faster in the case of sparse data (Doha), and become slower for bigger datasets. This behavior is observed in tensor-factorizationbased methods as well. LSM-RN on the other side requires much more time to learn the spatio-temporal properties  than any other method, which is mainly due to the costly phase of learning different matrices representing the attributes of the vertices of the road network.

Fit and Convergence
We finally discuss the convergence of our algorithm TRTF in its best case scenarios. Convergence is captured here in terms of number of iterations required to reach the convergence criteria. In practice, we compute the fit score at each iteration and declare convergence when the fit gets equal or greater to 0.98, i.e., where jjG Gjj is the Frobenius norm of the three-way tensor of speed values. Fit curves are reported in Fig. 10. We see that our algorithm converges after 27 iterations only in the case of Doha compared to 61 iterations for Aarhus. Time wise, convergence happens within 4.30 seconds in Doha versus 112.9 seconds in Aarhus. This significant difference in time is mainly due to the fact that tensors tend to be slower when they are full. Fetching real-time feeds data from PeMS can be done via FTP. We wrote a script that does the following:

New Set of Experiments
Step 1 script logs in into PeMS using login() function. Users need to provide username and password in login(); Step 2 script reads station_ids.csv, where station_ids can be retrieved from 'EXPORT TEXT' button, see step2. png example for Los Angelos; Step 3 script goes through each ID in station_ids, and generates the URL to retrieve ID's time series. The  downloaded via API or manually: (1) Webtris API: http:// webtris.highwaysengland.co.uk/api/swagger/ui/index#!/ Areas/Areas_Get_0; and (2) Webtris R Interface: https:// cran.r-project.org/web/packages/webTRISr/index.html.
Remark 6. We have developed a script to extract the Adjacency matrix from England's road network; and we realized that Working on the England dataset will require a lot of time, and can be the scope of another paper. Therefore, we decided to form the adjacency matrices for Denmark Aarhus, i.e., it is the same city as the earlier experiments, but with more recent data, see Table 4. The dataset 1, i.e., for the period January-June 2014 was already done with good results; and we wanted to run experiments on the new datasets, in order to assess how the proposed methodology performs. It is clear that the proposed methodology competes with other state-of-the-art algorithms (LSM-RN, TRMF), as well as the baseline (CP-WOPT) on three real datasets, see Figs. 11, 12, 13, 14, 15 and 16. The flexibility of the proposed algorithm TRTF can be noted when tested under different

CONCLUSION
We presented in this paper TRTF, an algorithm for temporal regularized tensor decomposition. We show how the algorithm can be used for several traffic related tasks such as missing value completion and forecasting. Our algorithms incorporates both spatial and temporal properties into the tensor decomposition procedure, thus learning better factors. We also, extended TRTF with an autoregressive procedure to allow for multi step-ahead forecasting of future values. We compare our method to recently developed algorithms that deal with the same type of problems using regularized matrix factorization, and show that under many circumstances, TRTF does provide better results. This is particularly true in cases where the data suffers from high proportions of missing values, which is common in the traffic context. For instance, TRTF achieves a 20 percent gain in MAPE score compared to the second best algorithm (CP-WOPT) in completing missing values in the case of extreme sparsity observed in Doha.
As future work, we will first focus on adding non-negativity constraints to TRTF, although the highest fraction of negative values generated by our method throughout all the experiments did not exceed 0.7 percent. Our second focus will be to optimize TRTF training phase in order to increase its scalability to handle large dense tensors, and to implement it on a parallel environment. Fig. 15. During Rush hours: TRTF has the lowest MAPE score and thus, performs better than all other algorithms. During non-rush hours: TRTF seems to perform the best here as well. TRMF seems to worse during non-rush hours, which is different from the other three algorithms which seem to perform better during non-rush hours. Fig. 16. Results from Aarhus seem to be different as compared to Doha. TRMF performed worse for Doha than it did for Aarhus. TRMF seems to perform well but it performs worse during non-rush hours which seems like a frequent pattern across all instances.
Abdelkader Baggag is a senior scientist with the Qatar Computing Research Institute, and an associate professor of data science at Hamad Bin Khalifa University, where he teaches advanced Machine Learning. His research focuses on developing datadriven models for finding patterns in complex data (mobility and health data) and implementing these methods in high-performance solutions, in particular multidimensional data and sequence of states data to support domain experts in traffic using sensors data, and eHealth for analyzing large-scale wearable sensor signals. His expertise is machine learning, representation learning, temporal causal modeling, artificial intelligence for health and mobility analytics, and missing data imputation.
Sofiane Abbar is a researcher and senior software engineer with the Social Computing Department, Qatar Computing Research Institute, Hamad Bin Khalifa University. His research interests include social computing, machine learning, and datadriven approaches for urban analytics.
Ankit Sharma is currently working toward the PhD degree in the Computer Science Department, University of Minnesota, Minneapolis. He was a research associate with the Social Computing Department, Qatar Computing Research Institute, Hamad Bin Khalifa University, when this work was conducted.
Tahar Zanouda is a data scientist at Ericsson, Sweden. His research interest lies at the intersection of machine learning and urban mobility. He was a research associate in the Social Computing department at Qatar Computing Research Institute, Hamad Bin Khalifa University, at the time of this work.
Abdulaziz Al-Homaid is a research associate with the Qatar Computing Research Institute in Doha. He worked on improving access to Education data across response organizations during a fellowship with The United Nations Office for the Coordination of Humanitarian Affairs (UN OCHA). His research is on applying deep learning methods in Smart City domains such as city-wide traffic prediction and e-Health.
Abhiraj Mohan received the BS degree in computer science and engineering from the University of Minnesota at Twin Cities. His interests include machine learning and data science. He has worked on detecting behavior patterns for users from wearable technology, and on imputing missing values from traffic data. He has been a recipient of the UROP Scholarship at Minnesota, and has been on the Dean's list multiple times during his studies.
Jaideep Srivastava is a professor with the Department of Computer Science and Engineering, University of Minnesota, Twin Cities. He was the chief scientist with the Qatar Computing Research Institute, Hamad Bin Khalifa University, when this work was conducted. He does research in databases, computing in social sciences, arts and humanities, data mining, and works on the use of advanced machine learning for many applications such as traffic analytics and health analytics. He is an IEEE fellow.