Physics-Informed Deep Learning For Traffic State Estimation: A Survey and the Outlook

For its robust predictive power (compared to pure physics-based models) and sample-efficient training (compared to pure deep learning models), physics-informed deep learning (PIDL), a paradigm hybridizing physics-based models and deep neural networks (DNN), has been booming in science and engineering fields. One key challenge of applying PIDL to various domains and problems lies in the design of a computational graph that integrates physics and DNNs. In other words, how physics are encoded into DNNs and how the physics and data components are represented. In this paper, we provide a variety of architecture designs of PIDL computational graphs and how these structures are customized to traffic state estimation (TSE), a central problem in transportation engineering. When observation data, problem type, and goal vary, we demonstrate potential architectures of PIDL computational graphs and compare these variants using the same real-world dataset.

Impact Statement-Despite benefits of physics-informed deep learning (PIDL) for robust and efficient pattern learning, how to inject physics into neural networks (NN) remains underexplored and varies case by case and across domains. We will summarize and establish a systematic design pipeline for hybrid computational graphs that facilitates the integration of physics and NNs. The insights into the modular design of the hybrid PIDL paradigm as well as the established visualization tool will not only be useful to guide transportation researchers to pursue PIDL, but can also facilitate researchers at large to better understand and decompose a PIDL problem when applied to their own application domains. It will hopefully open up the conversation across domains about establishing a unified PIDL paradigm.

I. INTRODUCTION
P HYSICS-informed deep learning (PIDL) [1], also named "theory-guided data science" [2], "model-informed machine learning" [3], or "physics-informed machine learning" [4], has gained increasing traction in various scientific and engineering fields. Its underlying rationale is to leverage the pros of both physics-based and data-driven approaches while compensating the cons of each. Physics-based approach refers to scientific hypotheses of what underlying physics govern observations, like the first principle. Normally, scientists or engineers first come up with a prior assumption of how a quantity of interest is computed from other physics quantities. Then laboratory or field experiments are designed to collect data that are used to calibrate the involved parameters. In contrast, data-driven approach does not bear any prior knowledge of how things work and how different quantities are correlated. Instead, they rely on machine learning (ML) techniques such as deep learning (DL) to learn and infer patterns from data. The former is data-efficient and interpretable but may not be generalizable to unobserved data, while the latter is generalizable at cost of relying on huge amounts of training samples and may be incapable of offering deductive insights. Thus, the PIDL paradigm opens up a promising research direction that leverages the strengths of both physics-based and data-driven approaches. Fig. 1 summarizes the amounts of data (in x-axis) and scientific theory (in y-axis) used for each paradigm. Model based approaches heavily rely on the scientific theory discovered from the domain knowledge and little data for system discovery, while machine learning approaches mostly rely on data for mechanism discovery, In contrast, PIDL employs a small amount of data (i.e., "small data" [4]) for pattern discovery while leveraging a certain amount of scientific knowledge to impose physically consistent constraints. The "sweetspot" of PIDL lies in the small data and partial knowledge regime. In other words, PIDL achieves the best performance in accuracy and robustness with small data and partial domain knowledge. In this paper, we will validate the advantage of PIDL using transportation applications, and present a series of experiments using the same real-world dataset against conventional physics-based models.  Fig. 1: Comparison of the pure physics based, the data-driven, and the hybrid paradigms (adapted from [2]).
There could be a variety of structure designs for PIDL. Here we primarily focus on the PIDL framework that is represented by a hybrid computational graph (HCG) consisting of two graphs: a physics-informed computational graph (PICG) and a physics-uninformed neural networks (PUNN). Despite numerous benefits of PIDL, there are still many open questions that need to be answered, including the construction of the 2 HCG, the choice of the physics and the architecture of the PICG, the architecture of the PUNN, the loss function, and the training algorithms. Among them, how to encode physics into (deep) neural networks (DNN) remains under-explored and varies case by case and across domains. In this paper, we will primarily establish a systematic design pipeline for hybrid computational graphs that would facilitate the integration of physics and DL. To this end, we will review the state-ofthe-art of the PIDL architecture in traffic state estimation (TSE) problem. This survey paper can be used as a guideline for researchers at large when they consider using PIDL for problems at hand.
We have seen a growing number of studies that apply PIDL to physical and biological systems [4], [5]. However, its feasibility for social systems, in particular, human decision making processes, such as driving behaviors, remains largely unexploited. Human's decision-making involves complex cognitive processes compounded by perception errors, noisy observations, and output randomness. Moreover, human driving behaviors exhibit highly unstable and nonlinear patterns, leading to stop-and-go traffic waves and traffic congestion [6]- [8]. Accordingly, neither model-driven nor data-driven approach alone suffices to predict such behaviors with high accuracy and robustness. Thereby, we strongly believe that a hybrid method, which leverages the advantage of both model-driven and data-driven approaches, are promising [9], [10].
There is a vast amount of literature for TSE [11], [12] and for PIDL [4], [5], respectively. To distinguish from other survey papers in TSE, here we primarily focus on datadriven approaches, PIDL in particular. Since PIDL for TSE is less studied than physics-based models and the existing literature is focused on single roads, we will primarily examine TSE along links and leave that on a road network, which includes both link and node models, for future research. To distinguish from other PIDL surveys, we primarily focus on the modular design of the hybrid PIDL paradigm and show how to customize various designs for accurate and robust identification of traffic dynamic patterns. Here the modular design refers to the architecture design of each component in the graph and how these components are wired, in other words, how physics laws are injected into DNNs. The generic architecture of a PIDL consists of two computational graphs: one DNN (i.e., the data-driven component) for predicting the unknown solution, while the other (i.e., the physics-driven component), in which physics are represented, for evaluating whether the prediction aligns with the given physics. The physics encoded computational graph can be treated as a regularization term of the other deep neural network to prevent overfitting, i.e., high-variance. In summary, the hybrid of both components overcomes high-bias and high-variance induced by each individual one, rendering it possible to leverage the advantage of both the physics-based and data-driven methods in terms of model accuracy and data efficiency.
Application of PIDL to TSE is a relatively new area. We hope that the insights into the modular design of the hybrid PIDL paradigm as well as the established visualization tool will not only be useful to guide transportation researchers to pursue PIDL, but also facilitate researchers at large to better understand a PIDL pipeline when applied to their own application domains.
Overall, this paper offers a comprehensive overview of the state-of-the-art in TSE using PIDL, while striving to provide insights into the pipeline of implementing PIDL, from architecture design, training, to testing. In particular, 1) propose a computational graph that visualizes both physics and data components in PIDL; 2) establish a generic way of designing each module of the PIDL computational graphs for both predication and uncertainty quantification; 3) benchmark the performance of various PIDL models using the same real-world dataset and identify the advantage of PIDL in the "small data" regime. The rest of the paper is organized as follows: Section II introduces the preliminaries of TSE and its state-of-the-art. Section III-A lays out the framework of PIDL for TSE. Two types of problems for TSE, namely, deterministic prediction and uncertainty quantification, are detailed in Sections III-IV, respectively. Section V concludes our work and projects future research directions in this promising arena.

II. PRELIMINARIES AND RELATED WORK
A. PIDL Definition II.1. Generic framework for physics-informed deep learning. Define location x ∈ [0, L] and time t ∈ [0, T ] and L, T ∈ R + . Then the spatiotemporal (ST) domain of interest is a continuous set of points: Denote the state as s and its observed quantity asŝ. Denote the (labeled) observation O, B, I and the (unlabeled) collocation points C below: where, i and j are the indices of observation and collocation points, respectively. i b , i 0 are the indices of boundary and initial data, respectively. The numbers of observed data, boundary and initial conditions, and collocation states are denoted as N o , N b , N 0 , N c , respectively. The subscripts b, 0 represents boundary and initial condition indices, respectively. We design a hybrid computational graph (HCG) consisting of two computational graphs: (1) a PUNN, denoted as f θ (x, t), to approximate mapping s (i) , and (2) a PICG, denoted as f λ (x, t), for computing traffic states of s (j) from collocation points. In summary, a general PIDL model, denoted as f θ,λ (x, t), is to train an optimal parameter set θ * for PUNN and an optimal parameter set λ * for the physics. The PUNN parameterized by the solution θ * can then be used to predict a traffic stateŝ new on a new set of observed ST points O new ⊆ D, and λ * is the most likely model parameters that describe the intrinsic physics of observed data.
One important application of PIDL is to solve generic partial differential equations (PDE). We will briefly introduce the underlying rationale. Define a PDE over the ST domain as: where, N x (·) is the nonlinear differential operator, B, I are boundary and initial condition operators, respectively, ∂D = Otherwise we define a residual function r c ( . If PUNN is well trained, the residual needs to be as close to zero as possible. Fig. 2 Fig. 2: Schematic of PDE solution approximation. Physics are usually encoded as governing equations, physical constrains, or regularity terms in loss functions when training PUNNs [13]. When "soft regularization" is implemented [14], the loss of PIDL is a weighted sum of two distance measures: one for the distance between observed action and predicted one (aka. data discrepancy) and the other for the distance between computed action from physics and predicted one (aka. physics discrepancy). Its specific forms will be detailed in Secs. III-IV.

B. Traffic state estimation
Traffic state inference is a central problem in transportation engineering and serves as the foundation for traffic operation and management.
Definition II.2. Traffic state estimation (TSE) infers traffic states in single-lane or multi-lane along a stretch of highway or arterial segments over a period of time, represented by traffic density (veh/km/lane, denoted by ρ), traffic velocity (km/lane/hour, denoted by u), and traffic flux or volume (veh/lane/hour, denoted by q) using noisy observations from sparse traffic sensors [12]. The ultimate goal of TSE is traffic management and control building on the inferred traffic states.
Remark. 1) Three traffic quantities are connected via a universal formula: Knowing two of them automatically derives another. Thus, in this paper, we will primarily focus on ρ, u, and q can be derived by Eq. 4.
is the capacity of a road, ρ jam is the jam density (i.e., the bumper-to-bumper traffic scenario), and u max is the maximum speed (and usually represented by speed limit). How to calibrate these parameters are shown in Tab. I. TSE is essentially end-to-end learning from the ST domain to labels. Denote a mapping parameterized by θ as f θ (·), from a ST domain (x, t) ∈ D to traffic states s = {ρ, u}: The key question is to find a set of parameters θ * and the functional form f θ (·) that fit observational data the best.
Remark. TSE can be implemented using supervised learning as presented in this paper, where the physics-based models are used to regularize the training of the data-driven models. TSE can also be formulated as unsupervised learning such as matrix/tensor completion that estimates unknown traffic states [15], [16]. Instead of using physics-based models, matrix/tensor completion methods use prior knowledge such as lowrank property to regularize the estimation. We would like to pinpoint here that such prior knowledge regularization can also be integrated into our framework by including the rank of the predicted traffic state matrix in the PIDL loss function.
Traffic sensors range from conventional ones placed on roadside infrastructure (in Eulerian coordinate), such as inductive loop detectors, roadside closed-circuit television (CCTV) or surveillance cameras, to in-vehicle sensors (in Lagrangian coordinate), including Global Positioning System (GPS), onboard cameras, LiDARs, and smart phones. The emerging traffic sensors mounted on connected and automated vehicles (CAVs) are expected to generate terabytes of streaming data [17], which can serve as "probe vehicles" or "floating cars" for traffic measurements. Future cities will be radically transformed by the Internet of Things (IoT), which will provide ubiquitous connectivity between physical infrastructure, mobile assets, humans, and control systems [18] via communication networks (e.g., 5G [19], DSRC [20], [21], xhaul fiber networks [22], edge-cloud [23] and cloud servers). Fig. 3 illustrates the observational data types for TSE. Building on individual vehicle trajectories, we can aggregate velocity and density for each discretized cell and time interval. With the availability of high-resolution multi-modality data, we should also consider developing disaggregate methods for TSE that can directly use individual trajectories or images as inputs.   [12], including fixed location sensors (in blue hexagon), roadside camera, and collocation points (in black cross)). In the TSE problem, the physics-based approach refers to the scientific hypotheses about the evolution of traffic flow on micro-, meso-, and macro-scale; while the ML approach refers to data-driven models that mimic human intelligence using deep neural networks, reinforcement learning, imitation learning, and other advanced data science methods [6].

1) Physics-based models
The field of transportation has been buttressed by rich theories and models tracing back to as early as the 1930s [24]. A large amount of theories have since been successfully developed to explain real-world traffic phenomena, to prognose and diagnose anomalies for operation and management, and to make predictions for planning and management.
These models make ample use of scientific knowledge and theories about transportation systems, ranging from closedform solutions to numerical models and simulations. Transportation models have demonstrated their analytical and predictive power in the past few decades. For example, microscopic car-following models and macroscopic traffic flow models succeed to capture transient traffic behavior, including shock waves and stop-and-go phenomenon.
The first constitutive law that needs to satisfy is a conservation law (CL) or transport equation, meaning that inflow equals outflow when there is no source or sink. Mathematically, A second equation that stipulates the relation between ρ, u can be a fundamental diagram (FD) (for the first-order traffic flow model) or a moment equation (for the second-order traffic flow model): where U (·) is a fundamental diagram, a mapping from traffic density to velocity, and g(U (ρ)) is a nonlinear function of U (ρ). A fundamental diagram can also be a mapping from density to volume/flux, as exemplified in Fig. 4 calibrated by a real-world traffic dataset. In the literature, several FD functions are proposed and interested readers can refer to [31] for a comprehensive survey.
When a traffic flow model is selected as the underlying dynamical system, data assimilation is employed to find "the most likely state" and use observations to correct the model prediction, including extended Kalman filter (EKF) [32]- [34], unscented KF [35], ensemble KF [36]), and particle filter [37].
To quantify the uncertainty of TSE problems, the modelbased approach usually makes a prior assumption about the distribution of traffic states by adding a Brownian motion on the top of deterministic traffic flow models, leading to Gaussian stochastic traffic flow models. There is the other school of literature that derives intrinsic stochastic traffic flow models with more complex probabilistic distributions [34], [38]- [40]. With a stochastic traffic flow model, large population approximation or fluid limit is applied to extract the first and second moments of the stochastic processes to facilitate the application of filtering methods.
2) Data-driven approach The data-driven approach aims to learn traffic dynamics directly from data. Machine learning extracts knowledge, pattern and models, automatically from large volumes of data. DL, especially DNN, has revived the interest of the scientific community since 2006 [41]. Using ML for TSE is primarily focused on data imputation leveraging temporal or spatial correlation, including autoregressive integrated moving average [42], probabilistic graphical models [43], k-nearest neighbors [44], principal component analysis [45], [46], long short term memory model [47]. The majority of these methods assume that a traffic quantity at a time interval or within a cell depends on its historical or neighboring values, regardless of the physical characteristics of traffic flow. Accordingly, datadriven approach is not as popular as the model-based one and does not achieve as a high accuracy as the latter [12]. More recent ML techniques aim to model non-linearity in traffic dynamics, leveraging deep hidden layers together with sparse autoregressive technique [48] and fuzzy neural network [49]. With the advantage of both model and data based approaches, it is natural to consider a hybrid one in TSE.

3) PIDL
In the pioneering work [50], [51], PIDL was proposed as an alternative solver of PDEs. Since its inception, PIDL has become an increasingly popular tool for data-driven solution or discovery of nonlinear dynamical systems in various engineering areas [52]- [56]. While PIDL has increasingly demonstrated its predictive power in various fields, transportation modeling is lagging behind in combining both physics and data aspects.

C. Two classes of problems
The existing literature on PIDL aims to solve two classes of problems: (1) PDE solution inference, and (2) uncertainty quantification. In the next two sections, we will detail these two problems one by one.

III. PIDL FOR DETERMINISTIC TSE
A. PIDL for traffic state estimation (PIDL-TSE) Definition III.1. PIDL for traffic state estimation (PIDL-TSE) aims to infer the spatiotemporal fields of traffic states by integrating physics based and deep learning methods.
Define the (labeled) observation set as O d , O p , the boundary and initial observation sets as B, I, and the (unlabeled) collocation point set as C below: (8) where, i s , i m are the indices of data collected from stationary and mobile sensors, respectively; i b , i 0 are the indices of data collected from boundary and initial conditions, respectively; and j is still the index of collocation points. The numbers of stationary sensor data, mobile data, boundary and initial conditions, and collocation points are denoted as N os , N om , N b , N 0 , N c , respectively. The number of mobile trajectories is denoted as N n . X(n, t (im) ) is the n th vehicle's position at time t (im) . Observation data in O are limited to the time and locations where traffic sensors are placed. In contrast, collocation points C have neither measurement requirements nor location limitations, and is thus controllable.
In the next two sections, we will elaborate the PIDL-TSE framework on the architecture of HCG and training methods.

B. Hybrid computational graph (HCG)
HCG is a tool we have invented to facilitate the visualization of the two components, namely, PUNN and PICG, and how they are wired. Over an HCG, the architecture of the PUNN and the PICG, the loss function to train the PUNN, and the training paradigm can be defined visually. A computational graph, establishing mathematical consistency across scales, is a labeled directed graph whose nodes are (un)observable physical quantities representing input information, intermediate quantities, and target objectives. The directed edges connecting physical quantities represent the dependency of a target variable on source variables, carrying a mathematical or ML mapping from a source to a target quantity. A path from a source to the observable output quantities represents one configuration of a model. A model configuration is to establish a path within the HCG [57].

C. Training paradigms
Once the physics model is selected, we need to determine the sequence of parameter calibration prior to, or during the training of PUNN. The former corresponds to solely infer time-dependent traffic flow fields, and the latter corresponds to system identification of traffic states [51].
1) Sequential training Sequential training aims to first calibrate parameters of the PICG (i.e., parameter calibration) and then encode the known physics into the PUNN for training. Parameter calibration has been extensively studied in TSE using non-linear programming [58], genetic algorithm [59], least-squares fitting [60]- [63], and kernel smoothing [64]. The physics based parameters include ρ max , u max and other nominal parameters. Tab. I summarizes the existing methods for model discovery.
Sequential training is the default paradigm in most PIDLrelated studies, with a focus on how to make the training robust and stable, when large-scale NNs and complicated physicsinformed loss functions are involved. A growing amount of parametrized in DNNs works aim to develop more robust NN architectures and training algorithms for PIDL. For example, one can use adaptive activation function by introducing a scalable hyper-parameter in the activation function at some layers of the PUNN to improve the convergence rate and solution accuracy [72]. The adaptive activation function has also been used in DeLISA (deep learning based iteration scheme approximation), which adopts the implicit multistep method and Runge-Kutta method for time iteration scheme to construct the learning loss when training the PUNN [73]. To perform an efficient and stable convergence in the training phase, [74] investigates the training dynamics using neural tangent kernel (NTK) theory and proposes a NTK-guided gradient descent algorithm to adaptively adjust the hyperparameters for each loss component. New algorithms and computational frameworks for improving general PIDL training are currently a popular research area, and we refer readers to [4] by Karniadakis et al. for a detailed survey on this topic.

2) Joint training
Physics parameters and hyperparameters of the PUNN and the PICG are updated simultaneously or iteratively in the training process. All the existing literature on PIDL-TSE employs simultaneous updating of all parameters associated with both PICG and PUNN altogether, which will be our focus below. However, we would like to pinpoint that there are increasing interests in training both modules iteratively [75], which could be a future direction to improve the training efficiency of PIDL-TSE. 6

a) Challenges
The PUNN in PIDL is a typical deep learning component that most training techniques can apply. In contrast, the training challenges incurred by the PICG with unknown physics parameters are nontrivial, and accordingly, substantial research and additional adaptive efforts are in need.
First, some traffic flow models may include a large number of physical parameters that need to discover in TSE, and it is challenging to train all the parameters at once. For example, the three-parameter LWR model in section III-D involves 5 parameters, and it is reported that the direct joint training for all the parameters with real-world noisy data leads to unsatisfactory results [71]. For this issue, the alternating direction method of multipliers (ADMM) method [76] is an option to improve training stability, i.e., to train one subset of physical parameters at a time with the rest fixed. The advanced ADMM variant, deep learning ADMM (dlADMM), may further address the global convergence problems in nonconvex optimization with a faster learning efficiency [77].
Second, a highly-sophisticated traffic flow model may contain complicated terms that are unfriendly to the differentiation-based learning, making the model parameter discovery less satisfactory for real data. In this case, the structural design of PICG plays an important role to make the framework trainable. Specifically, additional efforts such as variable conversion, decomposition and factorization need to be made before encoding to have the structure learnable and the loss to converge. Alternatively, as will be discussed in section III-C2b, one can use an ML surrogate, such as a small NN, to represent the complicated terms in PICG to avoid tackling them directly [71]. b) ML surrogate When physics and PUNN are jointly trained, Fig. 5 further demonstrates the flow of simultaneously tuning parameters in the physics model and the hyperparameters in the PUNN. The top blue box encloses the data-driven component that contributes to the loss function, and the bottom red box encloses the physics-based component. Note that in Fig. 5, we omit details about how ρ and u interact within a PUNN. These two quantities can be generated in sequence or in parallel. When ρ is generated first, u can be derived from an FD that stipulates the relation between ρ and u. Otherwise, one PUNN can be trained to generate both ρ, u altogether or two PUNNs are trained to generate ρ and u separately. Since ρ, u need to satisfy the CL, but generated Proposed the idea of integrating ML surrogate (e.g., an NN) into the physics-based component in PICG to represent the complicated FD relation. Improved estimation accuracy achieved and unknown FD relation is learned [71] out of PUNNs cannot guarantee that constraint. Thus, ρ, u are then fed into the physics component to impose this constraint on these quantities. To inject the minimum amount of knowledge like the universal CL defined in Eq. 6 to PIDL, and leave out those based on assumptions like FD defined in Eq. 7, we propose an ML surrogate model that replaces the mapping from traffic density to velocity as a DNN and learns the relation between these two quantities purely from data.
A fundamental diagram that establishes a relation between ρ and u, can be treated as an embedded physics component within traffic flow models. According to empirical studies, the relation between ρ and u is generally not a one-to-one mapping, especially in the congested regime. Thus, it is natural to employ an ML surrogate component to characterize the interaction between these two traffic quantities. We can further explore to what extent the addition of surrogates affects the performance and where ML surrogates are appropriate to use.
Tab. II summarizes the existing studies that employ PIDL for TSE. To leverage PIDL for the TSE problem is firstly proposed by Huang et al. [65] and Shi et al. [67], [68], [71], concurrently and independently.
Next, we will demonstrate how to design the architecture of PIDL and the corresponding PICG. We will first present a numerical example to demonstrate how the physics law of three-parameter-based LWR is injected into the PICG to inform the training of the PUNNs, and then compare all the existing architectures on a real-world dataset.

D. Numerical data validation for three-parameter-based LWR
In this example, we show the ability of PIDL for the traffic dynamics governed by the three-parameter-based LWR traffic flow model on a ring road. Mathematically, where = 0.005. The initial and boundary conditions are ρ(x, 0) = 0.1 + 0.8e −25(x−0.5) 2 and ρ(0, t) = ρ(1, t), respectively.
In this model, three-parameter flux function [60] is employed: In the model, δ, p and σ are the three free parameters as the function is named. The parameters σ and p control the maximum flow rate and critical density (where the flow is maximized), respectively. δ controls the roundness level of Q(ρ). In addition to the above-mentioned three parameters, we also have ρ max and diffusion coefficient as part of the model parameters. In this numerical example, we set δ = 5, p = 2, σ = 1, ρ max = 1 and = 0.005.
The PIDL architecture that encodes the LWR model is shown in Fig. 6. This architecture consists of a PUNN for traffic density estimation, followed by a PICG for calculating the residual r c := ρ t + (Q(ρ)) x − ρ xx on collocation points. The estimated traffic density ρ is calculated by the PUNN f θ (x, t), which is an NN and maps a spatiotemporal point (x, t) directly to ρ, i.e., ρ = f θ (x, t). PUNN f θ (x, t), parameterized by θ, is designed as a fully-connected feedforward neural network with 8 hidden layers and 20 hidden nodes in each hidden layer. Hyperbolic tangent function (tanh) is used as the activation function for hidden neurons. By replacing ρ with f θ , we have r c : With the estimated ρ and the observationsρ on the observation points, we can obtain the data loss L o . In contrast, in PICG, connecting weights are fixed and the activation function of each node is designed to conduct specific nonlinear operation for calculating an intermediate value of r c . The physics discrepancy L c is the mean square of r c on collocation points.
To customize the training of PIDL to Eqs. 9, we need to additionally introduce boundary collocation points for learning the two boundary conditions. Different from the B in Eqs. 8, observations on boundary points are not required here in C B . Then, we obtain the following loss: where, Note that because r c : Also, boundary collocation points are used to calculate the boundary loss L b . Because L b might change with different scenarios, it is ignored from Fig. 6 for simplicity. TSE and system identification using loop detectors: In this experiment, five model variables δ, p, σ, ρ max , and are encoded as learning variables in PICG depicted in Fig. 6. Define λ = (δ, p, σ, ρ max , ), and the residual r c is affected by both θ and λ, resulting in the objective Loss θ,λ . We now use observations from loop detectors, i.e., only the traffic density at certain locations where loop detectors are installed can be observed. By default, loop detectors are evenly located along the road.
We use N c = 150, 000 collocation points and other experimental configurations are given in SUPPLEMEN-TARY A1. We conduct PIDL-based TSE experiments using different numbers of loop detectors to solve (θ * , λ * ) = argmin θ,λ Loss θ,λ . In addition to the traffic density estimation errors of ρ(x, t; θ * ), we evaluate the estimated model parameters λ * using the L 2 relative error (RE) and present them in percentage. Fig. 7 presents a visualization of the estimated traffic density ρ (left half) and traffic velocity (right half) of the PIDL. The comparision at certain time points are presented. Note that the PUNN in Fig. 6 does not predict u directly, and instead, it is calculated by Q(ρ)/ρ in the post-processing. More results are provided in SUPPLEMENTARY Table I, and the observation is that the PIDL architecture in Fig. 6 with five loop detectors can achieve a satisfactory performance on 8 both traffic density estimation and system identification. In general, more loop detectors can help our model improve the TSE accuracy, as well as the convergence to the true model parameters. Specifically, for five loop detectors, an estimation error of 3.186 × 10 −2 is obtained, and the model parameters converge to δ * = 4.86236, p * = 0.19193, σ * = 0.10697, ρ * max = 1.00295 and * = 0.00515, which are decently close to the ground truth. The observations demonstrate that PIDL can handle both TSE and system identification with 5 loop detectors for the traffic dynamics governed by the threeparameter-based LWR.
We conduct sensitivity analysis on different numbers of collocation points and how they are identified. The details are presented in SUPPLEMENTARY Table II. A larger collocation rate (i.e., the ratio of the number of collocation points to the number of grid points) is beneficial for both TSE and system identification, because it could make estimation on the collocation points physically consistent by imposing more constraints on the learning process. Empirically, more collocation points can cause longer training time and the performance does not improve too much when a certain collocation rate is reached.

E. Real-world data validation
It would be interesting to see the performance of stateof-the-art methods based on either physics or data-driven approaches in order to better quantify the added value of the proposed class of approaches We will use a widely used realworld open dataset, the Next Generation SIMulation (NGSIM) dataset, detailed in Tab. III. Fig. 8 plots the traffic density heatmap using data collected from US 101 highway.
The performance of 2 baseline models and 4 PIDL variants for deterministic TSE is presented in Fig. 9. As shown in the y-axis, the REs of the traffic density and velocity are used for evaluation. The comparison is made under representative combinations of probe vehicle ratios (see x-axis) and numbers of loop detectors (see the title of sub-figures). We implement an EKF and a pure NN model as the representative pure datadriven and physics-driven baseline approaches, respectively. The EKF makes use of the three parameter-based LWR as the core physics when conducting estimation. The NN only contains the PUNN component in Fig. 6, and uses the first term in Eq. 10 as the training loss. Among the PIDL variants, the PIDL-LWR and PIDL-ARZ are the PIDL models that encodes three parameter-based LWR and Greenshields-based ARZ, respectively, into the PICG. PIDL-LWR-FDL and PIDL-ARZ-FDL are the variant models of PIDL-LWR and PIDL-ARZ by replacing the FD components in the PICG with an embedded neural network (i.e., the FD learner). Note, the FD leaner technique is the one introduced by [71].  Performance of baseline models: From the experimental results, it is observed that the performance of all models improves as the data amounts increase. The EKF method performs better than the NN method, especially when the number of observations is small. The results are reasonable because EKF is a physics-driven approach, making sufficient use of the traffic flow model to appropriately estimate unobserved values when limited data are available. However, the model cannot fully capture the complicated traffic dynamics in the real world and the performance improves slowly as the data amount increases. The NN is able to catch up with the EKF when the data is relatively large (see the case with loop=2 and ratio=3.0%). However, its data efficiency is low and large amounts of data are needed for accurate TSE.
Comparison between PIDL-based models: The PIDLbased approaches generally outperform the baseline models. PIDL-ARZ achieves lower errors than PIDL-LWR, because that ARZ model is a second-order model which can capture more complicated traffic dynamics and inform the PIDL in a more sophisticated manner.
Effects of using FDL: Comparing the models with FD learner (PIDL-LWR-FDL and PIDL-ARZ-FDL) to the ones without (PIDL-LWR and PIDL-ARZ), the former generally shows better data efficiency. In PIDL-LWR-FDL and PIDL-ARZ-FDL, the FD equation is replaced by an internal small neural network to learn the hidden FD relation of the real traffic dynamics. A proper integration of the internal neural network may avoid directly encoding the complicated terms in PIDL and trade off between the sophistication of the modeldriven aspect of PIDL and the training flexibility, making the framework a better fit to the TSE problem.
Comparison between PIDL-FDL based models: PIDL-LWR-FDL can achieve lower errors than PIDL-ARZ-FDL, implying that sophisticated traffic models may not always lead to a better performance, because the model may contain complicated terms that makes the TSE performance sensitive to the PIDL structural design. With the NGSIM data, PIDL-LWR-FDL can balance the trade-off between the sophistication of PIDL and the training flexibility more properly.
Transition from pure physics-driven to data-driven TSE models: The contributions of the physics-driven and datadriven components can be controlled by tuning the hyperparameters α and β in Eqs. 10. Fig. 10 shows how the optimal β/α ratio changes as the data size increases. The x-axis is the number of loop detectors, which represents the training data size. The y-axis is the optimal β/α corresponding to the minimally achievable estimation errors of the PIDL-LWR methods shown in Fig. 9. The property of tuning hyperparameters enables the PIDL-based methods to make a smooth transition from a pure physics-driven to pure data-driven TSE model: in the sufficient data regime, by using a small β/α ratio, the PIDL performs more like a pure data-driven TSE model to make an ample use of the traffic data and mitigate the issue that the real dynamics cannot be easily modeled by some simple PDEs; while in the "small" data regime, by using a large ratio, the PIDL behaves like a pure physics-driven model to generalize better to unobserved domains.

IV. PIDL FOR UQ
It is a widely regarded modeling and computational challenge to quantify how uncertainty propagates within dynamical systems that could result in cascading errors, unreliable predictions, and worst of all, non-optimal operation and management strategies. It is thus crucial to characterize uncertainties in traffic state estimators and consequently, in traffic management that relies on TSE predictors. UQ for TSE using PIDL is still at its nascent stage.
Definition IV.1. Uncertainty quantification (UQ) aims to assess robustness of the developed model and bound prediction errors of dynamical systems, by estimating the probability density of quantities with input features and boundary conditions [79]. Stochastic effects and uncertainties potentially arise from various sources, including variability and errors in measurements, dynamic actions of various entities, model biases, and discretization and algorithmic errors. In summary, there are two types of uncertainty [79], [80] in the context of TSE: 1) Aleatoric uncertainty (or data uncertainty): an endogenous property of data and is thus irreducible, coming from measurement noise, incomplete data, mismatch between training and test data. 2) Epistemic uncertainty (or knowledge uncertainty, systematic uncertainty, model discrepancy): a property of model arising from inadequate knowledge of the traffic states. For example, traffic normally constitutes multi-class vehicles (e.g., passenger cars, motorcycles, and commercial vehicles). A single model can lead to insufficiency in capturing diversely manifested behaviors.
UQ is "as old as the disciplines of probability and statistics" [79]. In recent years, its explosive growth in large-scale applications has been bolstered by the advance of big data and new computational models and architectures. The conventional UQ techniques include but not limited to: sensitivity analysis and robust optimization [81]; probabilistic ensemble methods and Monte-Carlo methods with multi-level variants [82]- [84]; stochastic spectral methods [79]; and methods based on the Frobenius-Perron and Koopman operators [85], [86] for dynamic systems.
Epistemic uncertainty arising from model discrepancy, often biased, can be compensated by improved domain knowledge, which has received increasing attention, especially with the advance of PIDL. In the computational architecture of PIDL, the data component can be treated as a compensation term for inaccurate or biased physics supplied by the physics-based component. Thus, it is natural to generalize PIDL to UQ, where the physics-based component provides partial domain knowledge when stochasticity is propagated within highly nonlinear models, and the data-driven component learns extra randomness arising from both data and model errors.
Definition IV.2. UQ for traffic state estimation (UQ-TSE) aims to capture randomness of traffic statesŝ = {ρ,û} by probabilistic models. It is assumed thatŝ follows the observational distribution, i.e.,ŝ ∼ p data (ŝ|x, t). The goal of the UQ-TSE problem is to train a probabilistic model G θ parameterized by θ such that the distribution of the prediction s ∼ p θ (s|x, t) resembles the distribution of the observation s ∼ p data (ŝ|x, t). One widely-used metric to quantify the discrepancy between p data and p θ is the reverse Kullback-Leibler (KL) divergence.
Since the majority of literature on UQ-PIDL employs deep generative models, including generative adversarial networks (GAN) [87], normalizing flow [88], and variational autoencoder (VAE) [89], here we will focus on how to leverage deep generative models for UQ problems. Among them, physicsinformed generative adversarial network (PhysGAN) is the most widely used model, which has been applied to solve stochastic differential equations [90]- [92] and to quantify uncertainty in various domains [71], [93]. Little has been done on using physics-informed VAE for the UQ-TSE problem, which can be a future direction.
A. PIDL-UQ for TSE 1) Physics-informed generative adversarial network (Phys-GAN) One way to formulate the UQ-TSE problem is to use the generative adversarial network (GAN) [87], which imitates the data distribution without specifying an explicit density distribution and can overcome the computational challenge of non-Gaussian likelihood [52], as opposed to using Gaussian process [94], [95]. Now we will formulate the UQ problem in the context of conditional GANs. The generator G θ learns the mapping from the input (x, t) and a random noise z to the traffic state s, G θ : (x, t, z) → s, where θ is the parameter of the generator. The objective of the generator G θ is to fool an adversarially trained discriminator D φ : (x, t, s) → [0, 1]. The loss functions of the GAN are depicted as below: is the predicted traffic state, andŝ is the ground-truth. With physics imposed, the generator loss carries the same form as Eq. 10, and the data loss L o and boundary loss L b become: Different PhysGAN variants adopt different ways of integrating physics into GANs, and the exact form of L c changes accordingly. Below we will introduce the general structure of the PhysGAN and its four variants. The general structure of the PhysGAN is illustrated in Fig. 11. The top blue box encloses the data-driven component, which is a GAN model consisting of a generator G θ and a discriminator D φ . Here we omit details about how ρ and u interact within the generator. These two quantities can be generated sharing the same NN or from separate NNs. The PICG can be encoded with either the LWR or the ARZ equations.
PI-GAN [90], [91], [96], [97] calculates the physics loss L c based on the residual r c , as illustrated in branch B 1 of Fig. 11 (a). L c is added into the loss of the PUNN L o using the weighted sum. This model is the first and most widely used to encode physics into the generator.
PID-GAN [92] feeds the residual r c into the discriminator D φ to provide additional information on whether the predictions deviate from the physics equations, which helps the discriminator to distinguish between the prediction and the ground-truth. This way of integrating physics is illustrated in branch B 2 of Fig. 11 (a). It is worth mentioning that the PID-GAN and the PI-GAN share the same structure of the data-driven component. They differ in how the physics are incorporated, i.e., informing the generator (branch B 1 ) or the discriminator (branch B 2 ). [92] shows that, by informing the discriminator, PID-GAN can mitigate the gradient imbalance issue of the PI-GAN.
The above-mentioned two PhysGAN variants use deterministic physics, that is, the parameters in the physics equations are deterministic.
Mean-GAN [98] incorporates stochastic differential equations into the physics components (illustrated as the PICG in Fig. 11 (a)), where physics parameters are assumed to follow Gaussian distributions. The randomness in physics parameters is the source of the epistemic uncertainty, which leads to the randomness in the residual r c . The physics loss is calculated based on the square error of the mean of r c , i.e., where N k is the number of physics parameter samples. Then the physics loss is included to the loss function of the PUNN using the weighted sum.
Within the PICG in Fig. 11 (a), we can also replace the parametric FD with ML surrogates, which is used in the PI-GAN-FDL [91], [99].
2) Physics-informed normalizing flow (PhysFlow) PhysFlow [100] employs the normalizing flow as an alternative generative model to the GAN. The normalizing flow explicitly estimates the likelihood, and is thus more straightforward to train compared to the GAN model. It estimates the likelihood by constructing an invertible function G θ that transforms a Gaussian priorẑ to the traffic states s. The structure of the PhysFlow, i.e., PI-Flow, is illustrated in Fig. 11 (b). The top blue box encloses the data-driven component consisting of a normalizing flow model. The inverse function G −1 θ takes as input the traffic states and outputs a predicted prior z. The training objective is to make z follow a Gaussian distribution, which can be achieved by the maximum likelihood estimation. The bottom red box encloses the physics component, which is the same as the PI-GAN.

3) Physics-informed flow based GAN (PhysFlowGAN)
PhysFlowGAN combines the merits of GAN, normalizing flow, and PIDL. It uses normalizing flow as the generator for explicit likelihood estimation, while exploiting adversarial training with the discriminator to ensure sample quality. The structure of PhysFlowGAN is shown in Fig. 11 (c), which consists of a normalizing flow, a discriminator, and a PICG. The data loss L o is composed of of two parts, i.e., L GAN o that is calculated from the discriminator and L f low o that is calculated from the normalizing flow. The physics loss is calculated in the same way as PI-GAN. One PhyFlowGAN model, TrafficFlowGAN [99], has been applied to the UQ-TSE problem.
Tab. IV summarizes the hybrid architecture used for the UQ-TSE.

B. Numerical data validation for Greenshields-based ARZ
As PI-GAN is the most widely used UQ-TSE model, we conduct numerical experiments using PI-GAN for demonstration. In the next subsection, we will use real-world data to compare the performance of the aforementioned UQ-TSE models.
The ARZ numerical data is generated from Greenshields- based ARZ traffic flow model on a ring road: U eq = u max (1 − ρ/ρ max ) is the Greenshields speed function, where ρ max = 1.13 and u max = 1.02; τ is the relaxation time, which is set to 0.02. The boundary condition is ρ(0, t) = ρ(1, t). The initial conditions of ρ and u are ρ(x, 0) = 0.1 + 0.8e −25(x−0.5) 2 and u(x, 0) = 0.5, respectively. Gaussian noise following a distribution of N (0, 0.02) is Nc Nc i=1 |rc| 2 is added to the generator loss function using the weighted sum the most widely used [90], [91], [96], [97] PID-GAN Residual is fed into the discrimina- , which is then averaged over collocation points to calculate Lc can mitigate the gradient imbalance issue [92] Mean-GAN Residual is averaged over the physics parameters λ: can encode stochastic physics model [98] PI-GAN -FDL  [99] added to impose randomness. The results of applying PI-GAN to ARZ numerical data a re shown in SUPPLEMENTARY Table IV. Fig. 12 illustrates the predicted traffic density (left half) and velocity (right half) of the PI-GAN when the number of loop detectors equals to 3. The snapshots at sampled time steps show a good agreement between the prediction and the ground-truth.

C. Real-world data validation
We apply PIDL-UQ models to the NGSIM dataset. We use 4 model types, i.e., PI-GAN, PI-GAN-FDL, PI-Flow, and TrafficFlowGAN for demonstration. Each model can be informed by either LWR or ARZ equations, resulting in 8 model variants in total. EKF and GAN are used as baselines. 12 The EKF uses the three parameter-based LWR as the physics. The results are shown in Fig. 13. As shown in the y-axis, the upper panels are the REs of the traffic density and velocity, and the lower panels are the Kullback-Leibler divergence (KL) of the traffic density and velocity. The comparison is made under representative combinations of probe vehicle ratios (see x-axis) and numbers of loop detectors (see the title of sub-figures). We interpret the results in three perspectives: Effect of loop data. When the probe vehicle ratio is fixed to 0.5%, the performance of all models are significantly improved as the loop number increases from 0 to 2, which is because the loop data provides more information. This improvement is not significant when the probe vehicle ratio is 3%, as the probe data alone has provided sufficient information.
Effects of using FDL. When the loop number is 2 and the probe vehicle ratio is 0.5%, PI-GAN-FDL achieves significantly lower REs and KLs compared to PI-GAN and PI-Flow, while this advantage becomes less significant when the data is sparse. This is because the ML surrogate requires more data to train. Also, PI-GAN-FDL achieves lower KLs than PI-GAN in general, indicating that PI-GAN-FDL can better capture uncertainty.
Comparison between the ARZ-based model and the LWR-based model. The ARZ-based model outperform the LWR-based model in general, which shows that the secondorder physics is more suitable for the real-world scenario.
Comparison between PIDL-UQ models. As the data amounts increase, the performance improves and the performance difference across models becomes small. Among all PIDL based models, TrafficFlowGAN generally achieves the least error in terms of both RE and KL, because it combines the advantages of both PhysGAN and PhysFlow. In Fig. 13(e) we summarize the RE (x-axis), i.e. the relative difference between the predicted and ground-truth mean, and KL (yaxis), i.e. the statistical difference between the predicted and ground-truth distribution, of all models with different training datasets that are shown in Fig. 13(a-d). Each point represents a combination of metrics by applying one model type to one training dataset. We interpret these points by assigning them into 4 regions: • Region A (optimal RE and KL): Most points in this region belong to the TrafficFlowGAN model type (in stars), which shows that the combination of PI-GAN and PI-Flow help achieve the best performance in terms of both RE and KL.
• Region B (low RE and high KL): Most points in this region belong to GAN (in upsided triangle) and PI-GAN (in dot), which is a sign that the GAN-based models are prone to mode-collapse.
• Region C (balanced RE and KL): Most points in this region belong to the PI-Flow model type, indicating that explicit estimation of data likelihood helps balance RE and KL.
• Region D (high RE and low KL): Most points in this region belong to the EKF (in triangle) and PI-GAN-FDL (square), showing that these two types of models can better capture the uncertainty than the mean.  Transition from pure physics-driven to data-driven TSE models: Fig. 14 shows the optimal β/α ratio of the Traf-ficFlowGAN under different numbers of loop detectors. The optimal β/α has a similar decreasing trend as in the the deterministic TSE problem shown in Fig. 10.
quantification. We present the concept of HCG that integrates both a PUNN and a PICG. In the TSE problem in particular, we present various architecture design. For the traffic state inference problem, the PUNN and PICG can be linked in sequence or in parallel, depending on whether traffic velocity is computed from traffic density using FDs, or whether both velocity and density are computed from PUNN(s) simultaneously. For the UQ problem, GAN and non-GAN based adversarial generative models are presented to characterize the impact of measurement uncertainty on traffic states. The architecture variants, including PI-GAN, PID-GAN, Mean-GAN, PI-GAN-FDL, PI-Flow, and TrafficFlowGAN are introduced. This is the first study that compare PIDL model variants using the same real-world dataset, which provides a benchmark platform for model comparison. By comparing various PIDL models, we demonstrate that this paradigm is advantageous over pure physics or data-driven approaches in the "small data" regime, and also allows a smooth transition from pure physics-driven to data-driven models via tuning the hyperparameters in the loss. Tab. V summarizes the existing types of hybrid architecture used in PIDL-TSE.

B. Outlook
Looking forward, we will pinpoint several promising research directions that hope to guide researchers to exploit this under-tapped arena.

1) Physics representation
What physics model is selected depends on data fidelity, model fidelity, and available computational resources.
While there are a significant amount of models to describe traffic dynamics on micro- [101], [102], meso- [103], [104], and macro-scale [26], [29], [30], [105], the future is in the discovery of a multiscale traffic model (i.e., a mapping from multiscale measurements to traffic states on different scales) that collectively infers traffic states with various measurements. Analytical multiscale models have been thoroughly studied in fields like biological [13] and materials science [106], but remain under-exploited in transportation modeling. The drivers of developing a multiscale traffic model are twofold: (1) Sensor data are at different scales. The collected traffic data include high-resolution individual trajectories and low-resolution aggregate information, both at various spatiotemporal granularity. Traffic measurements on different scales essentially measure the same system and phenomenon. Accordingly, a multiscale model capable of integrating measurements of various scales could characterize traffic dynamics more accurately with a smaller dataset. (2) Traffic optimization and management strategies need to rely on diverse models and data of different scales. Individual traces are normally modeled using ODEs while aggregate traffic patterns are modeled with PDEs. An integrated ODE-PDE physics model can potentially accommodate these measurements at both micro-and macroscale [107]. Multiscale traffic models could also help reduce model complexity and speed up simulation for real-time applications.
A multiscale traffic model, however, presents computational challenges due to the curse of dimensionality (i.e., highdimensional input-output pairs). It is thus important to utilize reduced modeling techniques and multi-fidelty modeling [108], [109].
An emerging direction to improve physics representation is to consider proxy models. For example, symbolic regression has been demonstrated to learn relatively simple physical forms to describe complex physical systems [110], [111]. Traffic engineers have spent decades discovering various physical laws to describe human behavior models, from car-following, lane-change, to other driving scenarios and tasks. Is it possible to develop a systematic method to select mathematical models? For example, selection of a model can be reformulated as finding an optimal path over the PICG from inputs to outputs [112]. Accordingly, model selection is converted to an optimization problem. Machine learning methods, such as neural architecture search [113] and automatic modularization of network architecture [114], could enable the automatic architecture design of PICG and PUNN.
2) Learning discontinuity in patterns Traffic patterns, especially congested traffic, are highly driven by underlying physics, thus, how to learn patterns around shockwave, which corresponds to the discontinuity of which gradients do not exist, remains an active research area in PIDL. There is a small amount of literature that discussed such a limitation using PIDL for nonlinear PDEs with solutions containing shocks and waves. One solution is to add viscosity terms [68], [71], [115] to smoothen the discontinuity.
Because the state-of-the-art primarily focuses on the "soft" method, which imposes the physics constraints as part of the loss function, the fulfillment of physics constraints cannot be guaranteed, leading to poor prediction in shocks and waves. Thus, "hard" methods that enforce physics can be a radical remedy. One option is to convert the TSE problem using varational formulation [116], and define an energy-based loss function [117] that naturally adopts the objective function in the variational formulation.
3) Transfer and meta-learning For engineers, an important question is, if we have trained a PUNN using data collected from city A, can we directly generalize the prediction out of the NN using data from city B? Traffic datasets from two cities could differ drastically in the underlying traffic dynamics (arising from heterogeneity in driver behavior such as that in San Francisco and Mumbai), traffic environments, road conditions, initial and boundary 14 conditions, and traffic demands. To address this challenge, we have to modify the input of the PUNN without using (x, t) but other attributes that vary across datasets, including but not limited to road type, geometry, lane width, lane number, speed limit, travel demands, traffic composition and so on.
Another direction to meet the transfer needs is to ask, how can we train a family of related physics based PDEs (such as LWR and ARZ) and generalize the tuned hyberparameters to other physics members? Meta-learning the parameters involved in the PIDL pipeline could be a potential solution [118].
4) IoT data for urban traffic management How can we fully exploit the advantage of multimodal, multi-fidelity IoT data, including but not limited to individual trajectories, camera images, radar heatmap, and Lidar cloud points? The existing practice is to preprocess these multimodal, multi-fidelity measurements using computer vision and extract aggregate traffic information in terms of traffic velocity and/or density within discretized spatial cells and time intervals. Rather than converting these data formats to conventional ones, we should think outside the box and potentially, redefine the entire TSE framework. It thus demands a shift in paradigm from what constitutes traffic states to one taking individual trajectories and images as inputs. In other words, is it sufficient to simply use aggregate traffic velocity, density, and flux to describe traffic states? The aggregate traffic measures were introduced decades ago when inductive loop detectors were deployed to measure cumulative traffic counts. Traffic flow has long been regarded analogous to fluids, and accordingly, hydrodynamic theory is applied to model traffic flow dynamics. It is natural to adapt physical quantities defined for fluids to traffic flow. However, traffic systems are not physical but social systems, in which road users constitute a complex traffic environment, while interacting continuously with the built environment. Contextual information greatly influences driving behaviors and traffic dynamics. Accordingly, the question is, when we describe the state of traffic and aim to optimize traffic management strategies, would traffic contextual information perceived by our eyes and brains (represented by DNNs) be helpful as an addition to those widely used quantitative measures, especially when this type of information becomes more widely available, thanks to IoT and smart city technology?
Furthermore, if TSE models can be enriched with multimodal data, what are new challenges and opportunities to traffic control and optimization models that rely on traffic state inference? There are extensive studies on causal discovery and causal inference using observational data when unobserved counfounders are present [119], [120]. However, little work has been done to leverage explicit causal relations from physical knowledge to improve PIDL. For example, counterfactual analysis of physical dynamics concerns identifying the casual effects of various interventions, including traffic control and the sequential decision-making of other agents in the environment [121]- [123]. Without performing extra experimentation that is risky and unsafe, how can we design and validate traffic management strategies using new information provided by IoT data?

5) TSE on networks
With the ubiquitous sensors in smart cities, traffic state estimation on large-scale road networks will be more feasible and useful to perform. To generalize from single road segments to networks, the challenge lies in the spatial representation of graphs, as well as temporal evolution of traffic dynamics. When PIDL is applied when there is sparse networked sensing information, we need to straighten out in what representation existing physics models could be incorporated, in other words, how we should encode traffic states on links and those at junctions into what deep learning models. [124] predicts network flows using a spatio-temporal differential equation network (STDEN) that integrates a differential equation network for the evolution of traffic potential energy field into DNNs.
There are a lot more open questions that remain unanswered. As this field continues growing, we would like to leave them for readers to ponder: 1) How do we leverage various observation data to fully exploit the strengths of PIDL? 2) What types of sensors and sensing data would enrich the application domains of PIDL and better leverage its benefits? 3) Would there exist a universal architecture of hybrid computational graphs across domains? 4) What are robust evaluation methods and metrics for PIDL models against baselines? 1

I. SUPPLEMENTARY MATERIALS
A. Experimental details for TSE and system identification using loop detectors

1) Experimental configurations
The PUNN f θ (x, t) parameterized by θ is designed as a fully-connected feedforward neural network with 8 hidden layers and 20 hidden nodes in each hidden layer. The specific structure is: layers = [2,20,20,20,20,20,20,20,20,1], meaning that PUNN inputs (x, t) to estimate density ρ. Hyperbolic tangent function (tanh) is used as the activation function for each hidden neuron in PUNN. In addition, the PICG part in the PIDL architecture is parameterized by physics parameters λ.
We train the PUNN and identify λ through the PIDL architecture using the adaptive moment estimation (Adam) optimizer [?] for a rough training for about 1,000 iterations. A follow-up fine-grained training is done by the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimizer [?] for stabilizing the convergence, and the process terminates until the loss change of two consecutive steps is no larger than 10 −16 . This training process converges to a local optimum (θ * , λ * ) that minimizes the loss.
For a fixed number of loop detectors, we use grid search for hyperparameter tuning by default. Specifically, since Adam optimizer is scale invariant, we fix the hyperparameter α to 100 and tune the other hyperparameters from [1,10,50,100,150,200] with some follow-up fine tuning.
The training details are presented in Algorithm 1.
3) Additional experimental results using real-world data Figure. 1 shows the error heatmaps for the NN and PIDL-LWR-FDL models when the number of loop detectors is 2 and the prove vehicle ratio is 0.05, where "SE" means the squared error.

4) Sensitivity analysis on collocation points
We conduct sensitivity analysis on different numbers of collocation points and the results are presented in Table II. The detector number is fixed to 5 and hyperparameters remain unchanged. The performances of estimation and parameter discovery improve when more collocation points are used. In this case study, the performance is sensitive to the number of collocations when the collocation rate is smaller than 0.01. Table. III presents the computational time and prediction error of the deterministic PIDL and baselines with LWR model as the physics when the loop detector number is 2 and the probe vehicle ratio is 3%. The first column is the name of the model used in the experiment, and the second column is the type of the model. The third and fourth columns are the computation time for the training and test, respectively. The remaining columns are the performance metrics in terms of prediction error.  C.R. stands for the collocation rate, which is the the ratio of the number of collocation points to the number of grid points. The detector number is fixed to 5.

5) Computational effort and model accuracy
As EKF is not a learning-based method, it can make prediction directly after the parameters of the dynamic model, i.e., discrete LWR or ARZ, are pre-calibrated. Although the deep learning models (PIDL-FDL, NN) require training, their test time is much less than that of the EKF. In terms of accuracy, which is another aspect of evaluating the efficiency of a method, the PIDL and PIDL-FDL models achieve a significant accuracy improvement compared to EKF and NN. The short computational time in the test phase (or in the run time) and high accuracy are both of major consideration in real-world applications.  B. Experimental details for UQ-TSE and system identification using loop detectors

1) Experimental configurations
For the PI-GAN model, the generator structure is: layers = [3,20,40,60,80,60,40,20,2], meaning that the generator inputs (x, t, z) to estimate both the traffic density ρ and traffic velocity u. The discriminator structure is: layers = [4,20,20,40,60,80,60,40,20,20,1], meaning the discriminator inputs (x, t, ρ, u) and output a one-dimensional real number indicating whether the input ρ and u are from the real-world data or not. Note that the discriminator has a more complex structure than the generator to stabilize the training. The Rectified Linear Unit (ReLU) is used as the activation function for each hidden layer.
For the PI-Flow model, both the scale neural network and the transition neural network share the same structure, i.e., 2 hidden layers with 64 neurons each layer. The number of transformation is 6, that is, there are 6 scale neural networks and 6 transition neural networks in total. Leaky Relu is used as the activation function for each hidden layer.
For the TrafficFlowGAN, its PUNN is a normalizing flow model that share the same architecture as that in the PI-Flow model introduced above. The discriminator consists of a stack of 3 convolutional layers. The kernel size for each layer is 3, and the number of channels of each layer is 4, 8, and 16, respectively.
We use Adam optimizer to training the neural networks for 5000 iterations. Different from training PIDL for deterministic TSE, L-BFGS optimizer is not used, because the GAN model requires training the generator and discriminator in turn, for which the L-BFGS optimizer is not suitable. We fix the hyperparameter β = 1 − α and tune α from [0.1, 0.3, 0.4, 0.6, 0.7].
The training details are presented in Algorithm. 2.
2) Additional experimental results using numerical data validation for Greenshields-based ARZ The results under different numbers of loop detectors m are shown in Table. IV.
3) Additional experimental results using real-world data Figure. 2 shows the error and prediction standard deviation heatmaps for the EKF and TrafficFlowGAN models when the number of loop detectors is 2 and the prove vehicle ratio is 0.05.  Table. V presents the computational time and prediction error of the stochastic PIDL-UQ with LWR model as the physics when the loop detector number is 2 and the probe vehicle ratio is 3%. The column specification of which is the same as Table. III, except for that there are more performance metrics for the UQ-TSE problem. The deep learning based models (GAN, PI-GAN, PI-Flow, and TrafficFlowGAN) achieve a much less test time compared to the EKF, and there is a significant accuracy improvement for the PIDL based models, especially for the TrafficFlowGAN model. This property is similar as in Table. III when training and testing the deterministic PIDL models. Among the PIDL based models, PI-Flow has the shortest training time because it uses the normalizing flow model as the generator (PUNN), which can be trained by maximum likelihood estimation and does not require a discriminator. Using both normalizing flow and GAN, TrafficFlowGAN consume the most time for training.

C. Performance metrics
We use four different metrics to quantify the performance of our models. In the formulas below, s(x, t) represents the predicted traffic state (i.e., traffic density ρ or traffic state u)  at location x and time t, andŝ represents the ground truth.
1) Relative Error (RE) measures the relative difference between the predicted and ground-truth traffic states. For the deterministic TSE, RE is calculated by: 2) Squared Error (SE) measures the squared difference between the predicted and ground-truth traffic states at location x and time t. For the deterministic TSE, SE is calculated by: SE s (x, t) = (s(x, t) −ŝ(x, t)) 2 .
3) Mean Squared Error (MSE) calculates the average SE over all locations and time, which is depicted as follows: where N is the total number of data. 4) Kullback-Leibler divergence (KL) measures the difference between two distributions P (x) and Q(x), which is depicted as follows: Note that for UQ-TSE, the traffic state s(x, t) follows a distribution, and the aforementioned errors measure the difference between the predicted the the ground-truth mean instead. For example, RE and SE for the UQ-TSE are defined as: