Short-term traffic prediction using physics-aware neural networks

In this work, we propose an algorithm performing short-term predictions of the flux of vehicles on a stretch of road, using past measurements of the flux. This algorithm is based on a physics-aware recurrent neural network. A discretization of a macroscopic traffic flow model (using the so-called Traffic Reaction Model) is embedded in the architecture of the network and yields flux predictions based on estimated and predicted space-time dependent traffic parameters. These parameters are themselves obtained using a succession of LSTM ans simple recurrent neural networks. Besides, on top of the predictions, the algorithm yields a smoothing of its inputs which is also physically-constrained by the macroscopic traffic flow model. The algorithm is tested on raw flux measurements obtained from loop detectors.


Introduction
Using neural networks to produce short-term traffic predictions is a subject of growing interest, especially since advances in transportation systems increased substantially the availability of large amount of traffic data from various sources [38,40].We refer the interested reader to [7,14] for surveys of neural networks-based models that have been proposed for short-term traffic predictions.These methods have proven to be particularly fit to tackle the challenges associated with traffic prediction such as the modeling of spatio-temporal dependencies in traffic data or the influence of external factors (e.g.weather and road conditions, unpredictable traffic events) [32].
Particular architectures have been proposed to model the temporal dependencies present in traffic data (see [36] for a comprehensive survey), which are due for instance to the fact that these data often consist of time series of measurements of traffic-related quantities (such as the occupancy, speed and flux) made by sensors on a road.Consequently, algorithms taking sequential data as input have been proposed to reflect for traffic predictions: the input then consists of a sequence of (possibly multivariate) variables corresponding to measurements at some times t −(Np−1) ≤ • • • ≤ t −1 ≤ t 0 and generates predictions of some quantity of interest at times t 1 ≤ • • • ≤ t N f (where t 1 ≥ t 0 ).Examples of such algorithms include time-delayed neural networks (e.g.[1,18,39]), (graph) convolutional neural networks (e.g.[24,37]) and recurrent neural networks (RNNs) (e.g.[22,33,34]), which will be of particular interest in this work.
However, regardless of their great prediction performances, a caveat of these approaches is their black-box nature: indeed, due to their complex nature and high number of parameters, these models lack interpretability as it is hard to understand how (and if) these algorithms manage to produce reliable traffic predictions [7,32].Also, these algorithms (and their performance) depend on a number of hyperparameters (such as the number of hidden layers, the number of nodes,...) that must be tuned by the practitioner in order to achieve the best performances for a given dataset.
Bearing this in mind, we propose in this work a "grey-box" RNN architecture which aims at providing short-term flux predictions based on past flux measurements along a stretch of road.The originality of this architecture is that it combines a conventional RNN with a discretization of a macroscopic traffic flow model in an attempt to yield interpretable predictions.Indeed, the space-time varying parameters of the traffic model are seen as a set of unobserved time series which are learned from the flux observations.In a first step, a conventional RNN is used to extract these time series of parameters from the observations and predict their evolution over a short period of time.The output of this first algorithm is then passed to the traffic model to yield both a re-estimation of the flux observations and their flux predictions.Besides, since the macroscopic model is defined on the whole road, flux estimates and predictions can be obtained everywhere on the road, hence providing the practitioner with an algorithm able to complete and predict the flux variations along a road.
Note that the traffic predictions obtained with the proposed approach come directly from the (discretized) macroscopic traffic model, and therefore are bound to follow the associated physical model.The idea of constraining neural networks with physical models is receiving much attention in the context of solving forward and inverse problems for partial differential equations (PDEs) [23].Two main approaches have been proposed in this context: • either augmenting the cost function used to train the neural network with terms that describe the PDE and its possible boundary or initial conditions (e.g.[26,35]), • or explicitly embedding the PDE inside the architecture of the neural networks by dedicating some parts of the network to the approximation of differential operators (e.g.[20,21]) or to the numerical approximation of the PDE itself (e.g.[2,6]).
In particular, Pakravan et al. [23] proposed so-called physics-aware neural networks for solving inverse PDE problems, which, much like our approach, consist of using a neural network to learn the parameters of a PDE, and then inputting these parameters into a numerical scheme designed to approximate the PDE.Hence, our approach can be seen as an extension of their approach for prediction purposes and in the presence of real (traffic) data.
Another original aspect of this research is the use of a particular discretization scheme for the macroscopic traffic model: the so-called Traffic Reaction Model [19].This scheme models flow dynamics along a discretized road as a chain of chemical reactions happening between adjacent cells (now seen as compartments containing "fictional" chemical reactants).This scheme was shown to be able to reproduce complex traffic pattern observed when performing state and parameter estimation from real-world traffic data [25].It also allows to assimilate the space-time varying parameters of the macroscopic traffic model to dimensionless and bounded parameters that are interpreted as the reaction rates of the chemical reactions mentioned above.
In addition to the interpretation that can be given to the parameters of our architecture and the origin of the predictions it produces, most of its hyperparameters are set by the configuration of the detectors along the road of interest, hence removing this burden of tuning them.Moreover, our approach is particularly suited to high resolution data (i.e.data composed of measurements made with high frequency, for instance every minute or less).Such datasets are known to be problematic in prediction applications, due to the high level of noise they might have [16].In such cases, noise reduction techniques, such as smoothing or wavelet algorithms (e.g.[11]), or aggregation of the measurements are usually applied to the data before proceeding to the predictions.But there is no consensus on which approach to select (and with which parameters), and besides, these techniques have be shown to impact and distort the properties of time series [31].Our approach on the other hand naturally incorporates a smoothing of the measurements inherited from the discretized macroscopic traffic model, and is therefore physically justified.
In summary, we propose a physics-aware algorithm for short-term traffic predictions.This algorithm produces a smoothing of the input data and predictions (even on unobserved sections of the road) that are highly interpretable as they arise directly from a (discretization of a) macroscopic traffic flow model.Besides, both the architecture and the parameters of our algorithm find physical interpretations.
The outline of this paper is as follows.In Section 2, we present the macroscopic traffic model and its discretization used in the flux prediction architecture.In Section 3, we provide some reminders on RNNs and present the architecture.Finally, in Section 4, we present numerical experiments based on a dataset of flux measurements taken on a Dutch motorway.

Traffic models and data
2.1.First order macroscopic traffic flow model.First order macroscopic traffic flow models link the evolution of two quantities describing the flow and state of the traffic at any point in time and space: the flux of vehicles f and the density of vehicles ρ.Both quantities can in particular be coupled through a conservation law expressing a balance between the numbers of vehicles entering and exiting any section of the road and the number of vehicles present in this same section.Such balance laws are common when studying for instance the flow of incompressible fluids and take the following form: (1) where q is a function acting as a source term and describing for instance on-and off-ramps along the road.One of the most popular macroscopic traffic flow model, the so-called Lighthill-Whitham-Richards (LWR) model [17,27], takes the additional assumption that the flux of vehicles can be expressed as a function of the density of vehicles.A popular choice of flux function is the so-called Greenshield flux [10], for which f takes the form where ρ m > 0 denotes the maximal value the density can take (i.e. the density value corresponding to a bumperto-bumper traffic) and f m ≥ 0 denotes the (possibly space-time-dependent) maximal flux of vehicles.This choice follows naturally when enforcing two straightforward and physically meaningful conditions that a flux function should satisfy, namely that the flux should be 0 when the road is empty (i.e.ρ = 0) or at full capacity (i.e.ρ = ρ m ).Note that considering a space-time-dependent maximal flux f m allows to model for instance situations where the traffic flow is impacted by road conditions or policies that may vary in time or space.For any given bounded and integrable initial condition ρ 0 ≡ ρ(0, •), and any T > 0, the existence and uniqueness of (entropy) solutions of the resulting PDE in C([0, T ), L 1 (R) ∩ L ∞ (R)) has been proven in the general case, where the maximal speed is taken to be a function (t, x) → f m (t, x) that is bounded and Lipschitz-continuous [5,12].

2.2.
Solving numerically the PDE: the Traffic Reaction Model.In general, no closed-form solution of PDE ( 1) is available, and therefore, numerical methods must be used to approximate it.In particular, finite volume schemes have been widely used to compute solutions to (hyperbolic) PDEs of the form (1) since these solutions can develop discontinuities in finite time but remain locally integrable [15].For reasons which will be clarified in subsequent sections of this paper, we focus on a particular scheme proposed by Liptak et al. [19] and called Traffic Reaction Model (TRM).This scheme can be seen as a particular instance of the finite volume schemes investigated in [4].
Let us once again consider PDE (1) on (R + × R) and the following space-time discretization: in time we consider equidistant time steps t n = n∆t (for some step size ∆t > 0 and n ∈ N) and in space we consider equispaced cells of size ∆x > 0 with centroids x j = j∆x (for j ∈ Z).We denote by x j±1/2 = x j ± ∆x/2 the boundary points of the j-th cell.For each cell j and time step t n , a quantity ρ n j is defined as follows.At t = t 0 , where ρ 0 denotes the initial condition associated with PDE (1), and is assumed to be known.Then, the TRM approximates the solution of PDE (1) at (t n , x j ) by the quantity ρ n j obtained through the following recurrence relation ) + q n j , j ∈ Z, n ∈ N, where the quantities C n j and q n j are given by x j−1/2 q(t, x) dx dt, j ∈ Z, n ∈ N, and F denotes the numerical flux defined by Note in particular that, in order to apply this recurrence relation, the maximal flux f m and the source term q (or at least their discretized counterparts C n j , q n j ) must be known at any time and space location.Remark 2.1.Note that the recurrence relation (3) actually stems from a stable forward Euler discretization of the system of ordinary differential equations (ODEs) satisfied by a set of functions {ρ j } j∈Z and given by where for j ∈ Z, Q j is given by As defined, the quantity ρ n j acts as an approximation of the cell average at time t n and on the j-th cell of the solution ρ of PDE (1) which is the ratio of the integral over the j-th cell of the function x → ρ(t n , x) to the cell size.Assuming that f m is bounded and Lipschitz-continuous, Chainais-Hillairet and Champier [4] proved the convergence of a class of finite volume schemes that includes the TRM as the step size ∆t, ∆x → 0, provided that these quantities satisfy a so-called Courant-Friedrichs-Lewy (CFL) condition given by ( 4) where |f m (t, x)|.Note that since f m is nonnegative, this condition imposes that the coefficients This condition is instrumental into proving the stability, monotonicity and ultimately convergence of the TRM scheme.
Remark 2.2.Multiplying the recurrence relation by ∆x yields an equation describing the variation of the number of vehicles inside the j-th cell between t n and t n+1 : In this last equation, the quantity q n j ∆x corresponds to the net volume of vehicles added or subtracted to the cell by the source term, and the quantity (ρ m ∆x)C n j F (ρ n j , ρ n j+1 ) corresponds to the volume of vehicles exiting the cell j and entering the cell (j + 1).Hence, dividing this last quantity by the time step ∆t then gives an approximation of the flux of vehicles f (t n , x j+1/2 ) at the interface between the cells j and (j + 1), and at time t n : Remark 2.3.The TRM can be derived by assimilating the traffic flow dynamic (on the discretized road) to a chemical reaction network [19,25].More precisely, each cell is modeled as a compartment containing two (homogeneously-distributed) fictional chemical reactants: molecules of occupied space O j and molecules of free space F j .The density of vehicles ρ j within a cell j is then assimilated to the concentration of occupied space O j , and the variations in time of this density is interpreted as resulting from chemical reactions happening at the boundaries of the cell.Namely, the chemical reaction happening at the boundary between the cells j and j + 1 "transforms" a molecule of occupied space O j in j and a molecule of free space Φ j+1 in j + 1 into a molecule of free space Φ j in j and a molecule of occupied space O j+1 in j + 1 (cf.Figure 2.1).
The TRM then falls from writing the kinetics of these reactions using the law of mass action [8] (assuming that the reactions described above happen with some rate k j that can be time-dependent).As for the coefficients C j et q j , they are seen as respectively reaction rates between compartments and external intakes or outtakes of reactants (cf.[25] for details).
In practice, the modeled road is not infinite, and only a finite number of cells is used to discretize it.Hence, boundary conditions must be supplied to define the evolution of the outer cells of the road.Indeed, assuming that the road is discretized using N x cells (numbered from 1 to N x ), the recurrence relation (3) takes the form (6) where for the interior cells (cells 2 to N x − 1), the quantities F n k are defined as the TRM numerical fluxes: (7) For the boundary cells (cells 1 and N x ), the TRM numerical flux would require the quantities ρ n 0 , ρ n Nx+1 ∈ [0, ρ m ], which are no longer defined.A natural approach would consist in considering these quantities as additional parameters of the model.Then, for instance, the numerical flux F n 0 would be given by and is therefore parameterized only by the product of the independent parameters C n 0 and ρ n 0 (divided by the maximal density ρ m ).Therefore, we propose to directly consider the factor C n 0 ρ n 0 /ρ m ∈ (0, 1/2) as a parameter of the model, and denote it, with a slight abuse of notation, by C n 0 .Using the same approach for the numerical flux F n Nx , we end up with the following expressions: (8) where n ∈ N 0 .Note that this setting is equivalent to considering that the road is full upstream (i.e.ρ n 0 = ρ m ) and empty downstream (i.e.ρ n Nx+1 = 0).The reaction rates C n 0 and C n Nx are then used to control the amount of vehicles allowed to enter or exit the road between t n and t n+1 .
Remark 2.4.Note that using Remark 2.2, the numerical fluxes F n j can be assimilated to measurements of the flux of vehicles entering and exiting the road by scaling those by a factor ρ m ∆x/∆t.

Neural networks
3.1.Recurrent neural networks.Recurrent neural networks (RNN) are a particular type of neural networks designed to take sequential or time-ordered data as inputs, and widely used in applications such as time series prediction, natural language processing, or speech recognition [13].We now provide a short introduction on these algorithms, but refer the interested reader to [9,13,29,30] for more details on the matter.
Given an input sequence X = (x (1) , . . ., x (Nseq) ) (composed of vectors of size N in > 0), a RNN computes a sequence of so-called cell states S = (s (1) , . . ., s (Nseq) ) (composed of vectors of size N state > 0) which are turned into a sequence of outputs Y = (y (1) , . . ., y (Nseq) ) (composed of vectors of size N out > 0).Assuming that an initial state s (0) has been set, the operations performed by the RNN can be summed up by the following recurrence relations: where n ∈ {1, . . ., N seq and F 1 , F 2 are two nonlinear (vector-valued) functions depending on parameters determined during the training of the RNN.For instance, in the so-called Simple RNN (SRNN), the functions F 1 and F 2 are MultiLayer Perceptrons (MLPs) (cf.Appendix A).
A shortcoming of RNNs is the problem of vanishing and exploding gradients, which occur when backpropagation is used to compute the gradient of the cost function used when training the algorithm.This can result in a training phase that fails to produce adequate parameters for the model to yield low-error predictions.The Long Short-Term Memory (LSTM) architecture was proposed to avoid this pitfall.In this setting, the state vectors s (n) (n ∈ {1, . . ., N seq }) can be seen as composed by two subvectors as s (n) = (c (n) , h (n) ).The nonlinear functions F 1 and F 2 defining the RNN in (9) then take the form where n ∈ {1, . . ., N seq } and P i , i ∈ {1, . . ., 4} denote single layer perceptrons (cf.Appendix A) and σ denotes the application of a nonlinear activation function (usually tanh) to all entries of a given vector and denotes the component-wise product between two vectors (cf. Figure 3.1 for a visual representation of an LSTM).

3.2.
The TRM as a RNN.Since the TRM can be seen as a forward Euler discretization of a system of ODEs (cf.Remark 2.1), it can also be interpreted as an instance of RNN for which the successive states of the RNN represent successive states of the discretized system of ODEs [28,29] .Hence, taking the notations introduced in the previous subsection, the state vector s (n) of the RNN associated with the TRM is no other than the vector containing the numerical solution of PDE (1) at the k-th time step, i.e. s (n) = (ρ n 1 , . . ., ρ n Nx ).As for the input x (n) , it is given by the vector containing the reaction rates at the n-th time step, i.e.
Nx ).Since our goal is to perform flux predictions, the output of the RNN that we associate to the TRM is taken to be the value of the numerical fluxes y Nx ), as given by the relations ( 7)- (8).By definition of the TRM recurrence (3), when omitting the source terms, the corresponding RNN equations ( 9) are given, LSTM architecture: P i , i ∈ {1, . . ., 4} denote single layer perceptrons, denotes the component-wise product between two vectors, + denotes the sum of two vectors and σ denotes the application of an activation function to the entries of a vector.
for any n ∈ {1, . . ., N seq }, by where f 1,...,Nx ) denotes the first (resp.last) N x entries of the vector f (n−1) defined by Hence, note that the densities approximated by the TRM define the internal state of the corresponding RNN, and not its output.We therefore work with an approach similar to state-space models, where the state of the system is the density of vehicles, the dynamics of the system are defined by the TRM (and therefore by extension, the macroscopic traffic flow model (1)) and the output of the model are the numerical fluxes of the TRM.
As a RNN, the TRM has a few particularities.First, and as described above, it has no tuning parameters, and its inputs, output and states are interpretable as components of a traffic flow model.Second, one can note the nonlinearities that appear in the derivation of the TRM are only polynomial, as the states and outputs can be written as polynomial expressions of the inputs and past states.

Traffic predictions.
In our setting, we assume that traffic is observed on a stretch of road with no on/offramps, for sake of simplicity.The road is discretized into segments of length ∆x and the flux of vehicles is recorded every ∆T > 0 seconds at some locations inside the road.In particular, we assume that the data then consists of measurements of the flux of vehicles at some (or all) of the interfaces between the road segments (or "road interfaces" for short).We denote by N o the number of road interfaces at which the flux is measured and by N i ≥ N o the total number of road interfaces in the road.In particular, the number N s of segments used to discretize the road satisfies N s = N i − 1.
We propose to use RNNs to perform short term predictions of the flux of vehicles based on the TRM.More precisely, we start with flux measurements at N p consecutive time steps t −Np+1 , . . ., t −1 , t 0 (where t i+1 = t i +∆T , i ∈ {−N p + 1, . . ., −1}) at the N o observed interfaces of the road and wish to predict the flux of vehicles at time t N f = t 0 + N f ∆T at both the observed and unobserved interfaces of the road.Hence we aim at conjointly inpaint and forecast the flux of vehicles along the road.
We propose to use the TRM RNN in Section 3.2 to model the evolution of the traffic along the road between t −Np+1 and t N f .We hence consider a TRM defined on the discretized road, and to comply with the CFL condition (4), we follow the multilevel approach proposed in [25] and consider a time step ∆t for the TRM given by ∆t = ∆T /P t where P t is an integer such that Such a choice ensures that ∆t and ∆x satisfy the CFL condition (4), and therefore the stability of the TRM [19,25].
Remark 3.1.Recall then that ρ m and f m ∞ denote respectively the maximal values the density and flux of vehicles can take.The former can be approximated by the ratio between the number of lanes in the road and the average length of the vehicles.As for the latter, note that the speed of vehicles can be identified with the ratio between the flux and the density.Assuming that this speed is bounded by some value v m (for instance 130 km/hour) and using (2), we can upper-bound f m ∞ by v m ρ m /4.Hence, we can take P t to be the smallest integer such that Then, the corresponding TRM RNN requires us to specify: • an input sequence, taken as the sequence of (N f + N p − 1)P t + 1 vectors, containing the values of the TRM reaction rates (at every road interface) at times t −Np+1 + k∆t, for k ∈ {0, . . ., (N f + N p − 1)P t }; • an initial state, taken as the vector containing the density of vehicles inside each road cell at time t −Np+1 .The TRM RNN then outputs a sequence of (N f + N p − 1)P t + 1 vectors, containing the values of flux of vehicles at every road interface, between t −Np+1 and t N f : the last element of this sequence gives the desired flux prediction.
In order to specify the input sequence needed by the TRM RNN, we proceed as follows.Starting from the N p flux observations at times t −Np+1 , . . ., t 0 and at the N o observed road interfaces, we use a LSTM RNN to estimate the reaction rates at the same time stamps and at all of the road interfaces (including the unobserved ones).Namely, we consider a LSTM network whose input sequence is given by the flux observations and whose outputs are used as estimates for the reaction rates at times t −Np+1 , . . ., t 0 .We refer to this LSTM network as "Extractor" as it is in fact used to extract reaction rates from flux observations.The initial state of this RNN is obtained as the output of a 2-layer MLP taking as input the flux measurement at time t −Np+1 .
Then, the reaction rates at times t 1 , . . ., t N f are predicted using a SRNN whose inputs are vectors of zeros, and with initial state given by the last state of the LSTM RNN (i.e. the state of the LSTM RNN at time t 0 ).We refer to this SRNN as "Predictor", as it is in fact used to predict reaction rates beyond the time scope of the observations (i.e. for times greater than t 0 ).Remark 3.2.Note that since the LSTM RNN and SRNN are tasked with estimating and predicting reaction rates, their outputs should take values in (0, 1/2), in accordance with the CFL condition (cf.Eq. ( 5)).This is achieved by choosing the activation functions of these RNNs to sigmoids (hence ensuring that the output is in (0, 1)) and scaling the output by a factor 1/2 before using them.
Hence, we end up with reaction rates at times t −Np+1 , . . ., t 0 , t 1 , . . ., t N f and at every road interface.We then complete these reaction rates by assuming that the reaction rates stay constant between these time steps, thus allowing us to have reaction rates at any time t −Np+1 + k∆t, for k ∈ {0, . . ., (N f + N p − 1)P t } (as needed by the TRM RNN).These reaction rates are then used in the TRM RNN described above and yield estimates for the flux of vehicles at the same time steps.In particular, since the TRM is nothing but the discretization of the continuous traffic model (1), the TRM RNN then recreates, under this traffic model, the traffic dynamics during the time scope of the observations (i.e. between t −Np+1 and t 0 ) and extends it to future times (i.e.t 1 to t N f ) to yield predictions.
In the end, our proposed architecture takes as input a sequence of past flux observations, and outputs a re-estimation of these observations under the assumption that they arise from the traffic flow model (1), and then uses this same model to extend the traffic dynamics to future times, hence yielding the flux predictions.In particular, when the observations consist in noisy (raw) flux measurements, their re-estimation by the TRM RNN can be seen as a smoothing of the measurements.This smoothing is a consequence of the smooth nature of the TRM as a finite volume discretization of a PDE, and was already noted in [25].Hence, it has a physical interpretation as resulting from the traffic flow model (1), and is integrated to the learning process.We sum up in Figure 3.2 the architecture used for flux predictions.
The number of free parameters to learn, as well as the size of the input, output and (when applicable) state vectors or sequences of the different networks involved in our prediction approach are presented in Table 3.1.These parameters are determined through training, in a supervised learning setting: hence, a cost function evaluating the discrepancy between "ideal" outputs of the network and the actual outputs obtained is minimized to determine a set of optimal parameters.
In practice, we perform training as follows.We assume that we have at our disposal a "long"1 history of flux observations.From these observations, we can form a large set of so-called training examples, which consist of pairs (F , F ) where  The circles stand for road interfaces and green if an observation or estimate is available, and black otherwise.The data, which consists a sequence of flux observations, is fed to a LSTM RNN (the Extractor) tasked with estimating the reaction rates that gives rise to the flux observations.These reaction rates are then extended to future time steps using a second RNN, the Predictor.All these reaction rates then serve as input of a TRM RNN which produces a smoothed version of the observed traffic states, and extends them to future time steps.The blocks MLP 1 and MLP 2 correspond to the multilayer perceptrons used to initialize respectively the Extractor and the TRM.
• F is a sequence of N p consecutive flux observations denoted by F = (f −Np+1 , . . ., f 0 ) and corresponding to some consecutive observation times t −Np+1 , . . ., t 0 ; • Φ is a sequence of N p + N f consecutive flux observations denoted by Φ = (f −Np+1 , . . ., f N f ) and corresponding to some consecutive observation times t −Np+1 , . . ., t 0 , . . ., t N f ; note in particular that the first N p elements of Φ are the same as the elements of F .In an example (F , Φ), F is used as an input of the flux prediction network N that we are training, and Φ is used as the ideal output we would like to get from N .
Let us denote by Φ = N (F , θ) = ( f −Np+1 , . . ., f N f ) the output of N with a set of parameters θ and when inputting F , and let us consider a set of N ex examples {(F (1) , Φ (1) ), . . ., (F (Nex) , Φ (Nex) )}.We determine the optimal set of parameters θ associated with these examples by minimizing the cost function L given by Sizes and number of parameters.
where • stands for the Euclidean norm of vectors, and R(θ) is a regularization term.Note that the first two terms defining this cost functions have very particular roles: the first one assesses the capacity of the TRM to recreate the dynamics of the observations, and the second one assesses its capacity to extend it to make predictions.As for the regularization term, its role is to counteract the tendency the TRM might have to overfit the data through physically unsound reaction rates, as investigated in [25].As proposed in [25], a regularization term limiting the spatial variations of these reaction rates can be used for this purpose: we hence choose R(θ) to be defined as  In the numerical experiment, we consider only a subsection of the road of length ≈ 8200 m.This stretch of road and the position of the detectors is represented in Figure 4.1, where we can notice that the detectors are approximately separated by (multiples of) 150 m.This allows us to discretize the road into segments of length ∆x = 150 m and have the detectors located at the interfaces of these segments (cf. Figure 4.1).Note in particular that the considered stretch of road contains an off-ramp (which starts at the location of the interface 24) and an on-ramp (which ends at the location of the interface 31).Nevertheless, we do not explicitly include them in the TRM as we have no information on the in-and out-fluxes of vehicles they generate.

Numerical experiments
Starting from the Dutch motorway dataset, we create a training dataset from the measurements made in February, March and April (totaling 84 days) and then use the measurements made in the first 5 days of May to create a validation dataset (used to choose the hyperparameters of the algorithm) and finally the measurements made on the 5 days afer as a test set (used to check the performances of the algorithm).These measurements are used as is, without any smoothing or preprocessing (besides a spatial linear interpolation when a sensor stopped working).Besides, while training the algorithm, we hide the data from the detectors marked as 4 and 7 in   We consider the problem of predicting the flux at a road interface up to 10 minutes in the future.Taking the notations introduced in Section 3.3, we therefore consider the architecture yielding a prediction N f = 10 time steps in the future.The algorithm will then return both a smooth estimate of the most recent measurement it received as an input, as well as predictions of what the flux might be 1, 2, . . ., 10 time steps in the future (one time step being equal to 1 minute).4.2.Analysis of the results.In order to determine the number of past observations N p , we perform a quick grid search: we train the algorithm with N p ∈ {5, 7, 10, 12, 15, 18, 20} and choose the value of N p yielding the smallest root-mean-square error (RMSE) and mean absolute percentage error2 (MAPE) over the validation dataset.These errors are reported in Table 4.1.Based on these results, we take N p = 15 as this values yields small values of both RMSE and MAPE.Once that the number of past observations is chosen, we apply the associated training algorithm to the test dataset to assess its performance.In particular, all subsequent errors will be given in terms of MAPE in order to ease the comparison with existing methods carried out in Section 4.3.
First, we look at how good the predictions 3, 5 and 10 minutes ahead are.Examples of such predictions are presented in Figures 4.2(a) to 4.2(d), where we show, for various road interfaces along the road, a comparison between the 3-, 5-and 10-step predictions and the corresponding measurements obtained on May 10th (which is one of the days used in the test dataset).We also compute, over all the measurements in the test dataset and considering only the road interfaces used in the training phase, the MAPE between predictions and measurements (also called prediction MAPE).We compare them to the MAPE obtained when we use the last known measurement (i.e. the measurement at time t 0 ) as a predictor for the flux in the future (i.e. at time t 3 , t 5 or t 10 ).These errors are reported in Figure 4.2(e).We note that the TRM predictions show an improvement of 22% to 27% compared to the case where the last known measurement is used as predictor.
We then take a closer look at how these prediction MAPE distribute themselves along the road.To do so, we compute, on each road interface, the MAPE between the TRM predictions and the measurements, and display these errors in Figure 4.3(a).We note that the errors are stable across the interfaces, and that therefore there is no major difference in the quality of the predictions across space.In particular, the errors obtained on the two hidden interfaces (namely, the interfaces 4 and 7) are similar to the errors obtained on the surrounding interfaces that were used during training.This points to the conclusion that the algorithm seems stable with respect to missing data, and therefore that it is capable of completing and yielding trustworthy predictions on missing interfaces.Note also that the errors are slightly higher in the middle of the road.This is expected since this is where the model is most constrained by the TRM dynamics (which propagates free space from right to left and occupied space from left to right) and the measurements, and therefore has to balance them the best it can.
We also give in Figures 4.3(b) to 4.3(d) the overall MAPE between predictions and measurements obtained on observed and hidden interfaces, and the distribution of these errors.In both cases, and for all three prediction horizons, 75% to 80% of the errors are below 20%.The remaining "extreme" errors could be explained by the discrepancy between the TRM that yields flux predictions based on relatively smooth traffic dynamics, as modeled by the macroscopic traffic model, and the high level of noise of the data.This fact is already noticeable on the prediction plots given in Figures 4.2(a) to 4.2(d), but also when plotting space-time plots of measurements and predictions as in Figure 4.5.We confirm by representing in Figure 4.4 histograms comparing the distributions of measurements and predictions obtained on the test dataset: it is clear that the fat tails of the measurement distribution no longer appear on the prediction distributions.
This raises the question of the level of smoothing introduced by our approach.Recall indeed that our algorithm yields predictions by recreating, under a macroscopic traffic model, the traffic dynamics as observed  (e) Prediction MAPE obtained when using our approach (TRM) and when repeating the last known measurement.in the past observations (at times t −14 , . . ., t 0 ) and extending them to future times (t 1 , . . ., t 10 ).In particular, at t 0 , a smoothed version of the most recent observation is available: we refer to it as smoothed flux.This smoothed flux is for instance represented in some particular cases in First, note in the histograms in Figure 4.6(a) that, as expected, smoothed fluxes present lighter tails than the measurements.And besides, as seen in Figure 4.6(b), when taking the difference between the smoothed fluxes and the measurements, we obtain errors with a symmetric distribution around zero: hence the smoothing does not tend to overestimate or underestimate the fluxes.When comparing the distribution of these errors to a normal distribution (cf.QQ-plot in Figure 4.6(c)), we can see that most of the errors (98% of them) can be compared to Gaussian errors, but the remaining ones tend to take more extreme values than Gaussian errors would.Hence, we have a non-Gaussian distribution of smoothing errors, with fatter tails than a Gaussian distribution.These fat tails reflect the ability of the algorithm to propose smoothed fluxes that are robust to outliers in the measurements.
In conclusion, our algorithm yields smoothed fluxes that are physics-aware (since they arise from the macroscopic traffic model) and well-behaved (since they are robust to outliers and unbiased).Besides, these smoothed          fluxes are available on all of the interfaces dividing the road, and not only at the locations where measurements were taken.If we now use these smoothed fluxes as ground truth when computing prediction MAPEs (instead of using the raw flux measurements), we note that we get errors that are drastically reduced compared to the case where we compare our predictions to the raw measurements (cf. Figure 4.7(a)): indeed, reductions ranging from 45% to 55% can be observed on the considered prediction horizons (when considering in both cases only the observed interfaces).When looking at the spatial distribution of these errors (cf. Figure 4.7(b)), we retrieve the fact that they are higher further away from the boundaries, just like the prediction errors.Besides, the errors peak at the interfaces where data are observed, which can be seen as a sign that the algorithm modifies locally the dynamics to match the data recorded there (where only the dynamics of the traffic model plays a   role on the other interfaces).These observations are coherent with the fact that our algorithm was optimized to extend to the future the dynamics that gave rise to the smoothed flux: hence, the predictions obtained by our algorithm can be seen as accurate predictions of a physics-aware smoothing of the raw measurements.

4.3.
Comparison to existing methods.Comparing the prediction performances obtained in the numerical experiment with other works is no easy task.Indeed, for the comparison to be fair, our algorithm should be compared to approaches that take as input the same type of information, namely only past raw flux measurements taken at a relatively high frequency (every minute in our case).Besides, the question of the metric used to compare different studies is not straightforward.Indeed the prediction errors are compared based on a comparison between the prediction and what is assumed to be the ground truth.However, most of the studies apply some preprocessing on the raw data (e.g.aggregation or smoothing) and use the result as ground truth.Since these steps usually differ from one study to the other, and are not present at all in ours, comparing the performances becomes complicated.Finally, the question of the amount of parameterization in a model and the amount of data used to train it play a central role in the performance of an algorithm, and are hard to compare from one approach to the other.Let us omit the above comments and compare anyway the prediction performances of our algorithm with other deep neural network approaches.As noted in the review article [7] , such approaches were able to yield predictions MAPE ranging from 5% to 10%.Compared to the MAPEs we computed from the raw data (cf.Figure 4.2(e)), our algorithm seems to yield higher errors.However, compared to the MAPEs we computed from the smoothed fluxes (cf.

Conclusion
In this work, we propose a physics-aware algorithm for short-term flux predictions based a macroscropic traffic flow model.This algorithm has two main components.The first one is a RNN which aims at extracting the space-time varying parameters of the traffic flow model associated with a set of past observations and extending them to future time steps.These parameters are then passed to the second component, which is a TRM discretization of the traffic flow model: we then obtain flux estimates at past and future time steps that arise directly from the traffic flow model.A numerical experiment, conducted on a dataset of raw flux measurements taken along a stretch of freeway in the Netherlands, highlights the advantages of our approach.
First, the algorithm outputs a reestimation of the most recent measurement it got as input, which can be used as a smooth estimate for the traffic state.This is particularly desirable when dealing with raw data, for which choosing and justifying the right aggregation or smoother to reduce the noise can be cumbersome.In our case, the noise reduction is streamlined with the prediction and is physically grounded by the macroscopic traffic flow model.As for the predictions, they arise from the same traffic model and are therefore a simple continuation in time of the dynamics inferred from the past observations.Hence both the smoothing of raw data and the subsequent predictions have a clear physical interpretation.Finally, the algorithm yields smooth flux estimates and predictions everywhere on the road (and not only at the locations where measurements are made).Hence our algorithm also serves as a data augmentation tool.Future work include th extension of the approach to networks, which can be made possible by extending the TRM part of the architecture to networks.Multilayer perceptron: S 1 , . . ., S n denote single layer perceptrons, h (1) , . . ., h (n−1) denote hidden layers, and circles are used to denote input and/or output vectors.

Figure 2 . 1 .
Figure 2.1.Representation of the Traffic Reaction Model interpretation of traffic flow.

Figure 3 . 2 .
Figure 3.2.Detail of the architecture used for flux predictions.The circles stand for roadinterfaces and green if an observation or estimate is available, and black otherwise.The data, which consists a sequence of flux observations, is fed to a LSTM RNN (the Extractor) tasked with estimating the reaction rates that gives rise to the flux observations.These reaction rates are then extended to future time steps using a second RNN, the Predictor.All these reaction rates then serve as input of a TRM RNN which produces a smoothed version of the observed traffic states, and extends them to future time steps.The blocks MLP 1 and MLP 2 correspond to the multilayer perceptrons used to initialize respectively the Extractor and the TRM.

4. 1 .
Setup.The code used in this section was implemented in Python, using Tensorflow and Keras (2.4.0).We use a dataset which consists of measurements of flux made by detectors placed along a stretch of freeway A12 in the Netherlands (courtesy of Rijkswaterstaat -Centre for Transport and Navigation).This data was collected (almost) every day from January 5, 2006 toMay 19, 2006.Each day, the flux and speed of vehicles at each station were recorded every minute, for a total duration of 5 hours.

Figure 4 . 1 .
Figure 4.1.Road and detectors.The red ticks on the horizontal axis mark the actual position of the detectors on the road.Below this axis, we represent the road discretization we consider, obtained by dividing the road into segments of 150 m: we use continuous red lines for the road interfaces for which data are available, and black dashed lines for the others.

Figure 4 . 1 ,
Figure 4.1, in order to assess the capacity of the algorithm to complete and predict the flux at unobserved interfaces.In the end, we have N o = 15 observed interfaces (those marked 0, 2, 9, 12, 16, 20, 24, 27, 31, 35, 38, 42, 47, 51 and 54 in Figure 4.1), for a total of N i = 55 road interfaces.We consider the problem of predicting the flux at a road interface up to 10 minutes in the future.Taking the notations introduced in Section 3.3, we therefore consider the architecture yielding a prediction N f = 10 time steps in the future.The algorithm will then return both a smooth estimate of the most recent measurement it received as an input, as well as predictions of what the flux might be 1, 2, . . ., 10 time steps in the future (one time step being equal to 1 minute).

Figure 4 . 2 .
Figure 4.2.Measured, smoothed and predicted fluxes on various road interfaces on May 10th, and overall MAPE between predictions and measurements on the test dataset (using only observed road interfaces).
Figures 4.2(a) to 4.2(d), and represents the current traffic state as seen by the macroscopic traffic model.We perform in Figure 4.6 a comparison between the measurements and these smoothed fluxes.
Distribution of MAPE values on hidden interfaces.

Figure 4 . 3 .
Figure 4.3.MAPE of TRM predictions made on the test dataset, on observed and hidden road interfaces.

Figure 4 . 4 .
Figure 4.4.Histogram comparison between all the available measurements (i.e. on both observed and hidden interfaces) and predictions (i.e. on all 55 interfaces composing the road) of the test dataset.

Figure 4 . 5 .
Figure 4.5.Comparison between the space-time variations of the measurements and predictions on May 10th.The indices marked in red correspond to the observed interfaces, and those in black to the hidden interfaces.
QQ-plot of the difference between the smoothed fluxes and the measurements (adjusted in mean in variance) with respect to the standard normal distribution.98% of the points lie between the two blue lines.
MAPE on each road interface between the predicted and smoothed fluxes.The red lines mark the observed interfaces.The indices marked in red also correspond to the observed interfaces, and those in black to the hidden interfaces.

Figure 4 . 7 .
Figure 4.7.Comparison between the predictions and the smoothed fluxes computed on the test dataset.

Table 4 .
1. Grid search results: prediction RMSE and MAPE on the validation dataset when predicting up to N f = 10 time steps ahead using N p past observations.