Analyzing the Impact of Traffic Congestion Mitigation: From an Explainable Neural Network Learning Framework to Marginal Effect Analyses

Computational graphs (CGs) have been widely utilized in numerical analysis and deep learning to represent directed forward networks of data flows between operations. This paper aims to develop an explainable learning framework that can fully integrate three major steps of decision support: Synthesis of diverse traffic data, multilayered traffic demand estimation, and marginal effect analyses for transport policies. Following the big data-driven transportation computational graph (BTCG) framework, which is an emerging framework for explainable neural networks, we map different external traffic measurements collected from household survey data, mobile phone data, floating car data, and sensor networks to multilayered demand variables in a CG. Furthermore, we extend the CG-based framework by mapping different congestion mitigation strategies to CG layers individually or in combination, allowing the marginal effects and potential migration magnitudes of the strategies to be reliably quantified. Using the TensorFlow architecture, we evaluate our framework on the Sioux Falls network and present a large-scale case study based on a subnetwork of Beijing using a data set from the metropolitan planning organization.


Introduction
As the population size, economic growth, and personal travel activities continue to increase, traffic congestion in metropolitan areas remains one of the major concerns for urban transportation planning and management agencies [1]. Recurring traffic congestion is often caused by a regional imbalance involving excess demand and limited infrastructure, while nonrecurring congestion is mainly caused by various traffic incidents and severe weather conditions. Many advanced intelligent transportation systems (and more recently, smart transportation initiatives) have been developed to enable more sophisticated traffic demand management and reliable selection of effective congestion mitigation strategies.
The emerging traffic big data environment brings both opportunities and challenges to the core functions of traffic demand estimation and control. For example, Toole et al. (2015) demonstrated progress in using sampled mobile phone data to infer the most traveled routes, which are extremely 1.
The gap between layers 1 and 2 is related to the wide variety of available traffic data sources and the fact that core traffic network models are typically difficult to calibrate consistently. 2.
The gap between layers 2 and 3 lies in the fact that traffic models contain many elements with limited certainty, whereas mission-critical scenario evaluation requires reliable current-state estimates and policy-sensitive forecasts.
Sensors 2019, 19, x FOR PEER REVIEW 2 of 31 vehicle identification (AVI) data), planners and decision-makers hope to gain deep insight into human mobility patterns, which will accordingly increase the confidence level and reduce the uncertainty of transportation strategy evaluations [3][4][5][6][7][8]. Over the past several decades, many researchers have highlighted the need for a fully integrated connection from big data sources to pattern recognition and the deployment of final traffic demand/control strategies [9][10][11][12].
While it is important to extract and visualize spatial and temporal patterns from diverse data sources collected from advanced sensor networks, transportation planners and managers are more concerned with how to gain insight into the different causes of congestion because the benefits of different urban traffic congestion mitigation strategies must be better quantified in a data-rich environment. This expectation requires transformative advances not only within the traditional domains of traffic behavioral analysis and demand estimation, but also within the emerging field of big data itself [13,14]. Moreover, how to develop an explainable deep learning framework has been identified as one of the top 10 key challenges for the AI community [15].
Based on the above research needs, we attempt to bridge two important gaps between the three functional layers in a data-centric transportation planning and management framework: Models, data, and policy scenario evaluation, as shown in Figure 1.
1. The gap between layers 1 and 2 is related to the wide variety of available traffic data sources and the fact that core traffic network models are typically difficult to calibrate consistently.
2. The gap between layers 2 and 3 lies in the fact that traffic models contain many elements with limited certainty, whereas mission-critical scenario evaluation requires reliable current-state estimates and policy-sensitive forecasts.

Literature Review
A traffic demand management system is characterized by three modules: Traffic demand models, different data sources, and traffic demand management policies (e.g., congestion mitigation policies), as displayed in Figure 1. Both model-driven and data-driven approaches provide estimates of traffic demand states for evaluating the effects of congestion mitigation strategies in different scenarios.

Model-Driven Travel Demand Estimation Approaches
The first module includes conventional model-driven approaches for traffic demand estimation that are motivated by a priori knowledge of the kinds of decision-making that must be performed while traveling [16,17]. Traditionally, these decision-making processes are conceived as four-step choices, including trip generation, trip distribution, mode-split models, and traffic assignment [18]. Individual route choice behavior is assumed based on the user equilibrium condition, in which each user chooses the optimal route that minimizes his or her travel costs [19]. Discrete choice models are also used to estimate the modes and route choice behaviors of users [20]. Recently, a number of researchers have attempted to develop activity-based models to estimate users' trip chains [21] (see Figure 1).
One model-driven approach that focuses on network-wide demand estimation and prediction of origin-destination (OD) demand and route choice probabilities, as well as the resulting traffic network flow patterns, is based on "link proportions" obtained through static and dynamic traffic assignment (STA/DTA). However, in most of the literature, the four-step process and demand estimation are independent. Some studies simply combine STA/DTA with OD matrix estimation (ODME) using bi-level programming models [22][23][24], where the lower level is an STA/DTA model and the upper-level model uses econometric estimators to obtain estimates of the OD matrix.
It should be noted that the real-world deployment of model-driven methods faces technical and computational barriers. Different types of models are not consistent in their formulations of traffic demand. For example, dynamic network assignment for the management of information to travelers (DynaMIT) uses a deterministic queuing and speed model [25], dynamic traffic assignment and simulation for advanced network informatics (DYNASMART) adopts a Greenshield model [26], and the visual interactive system for transportation algorithms (VISTAs) use a cell transmission model [27]. Because the STA/DTA process provides "link proportions" as inputs for demand estimation, different models might lead to different estimates.
Conversely, the accuracy of STA/DTA models also depends heavily on the data available to calibrate the models (see Figure 1). In particular, DTA models require excessive data processing. The input of a DTA model includes demand-side inputs, the OD demand, supply-side inputs, and the network topology. The barriers involve the collection of representative data, the transformation of those data into the DTA input format, and the modeling and evaluation of control strategies.
One task undertaken in this paper is to develop a general framework that integrates the model-driven approach with the traffic demand estimation process using the forward and backward mechanism of computational graphs (CGs). Furthermore, the concept of CGs provides us with a comprehensive learning framework to coordinate the multiple traffic data sources required in traditional four-step methods.
Inductive loops, radars, and cameras have become the predominant fixed vehicle detection devices in most cities due to their low unit equipment costs and relatively high performance [10,36,37]. Many existing ODME methods focus on the estimation of OD matrices using link counts derived from fixed sensor observations. It has been widely recognized that OD flow patterns are not unique because of the non-uniqueness of the path flows and the limited observation data collected from sensors on links [38,39]. The in-car navigation Global Position System (GPS) technology has matured into a rapidly growing industry. Floating car data have been used in conjunction with link count data and video camera data to derive OD matrices and analyze route choices and trip length distributions [40,41]. Furthermore, mobile phone data make it possible to capture the characteristics of human mobility. Hao et al. [21] applied mobile phone data in activity-based models to generate user tours. Interestingly, Bonnel et al. [42] compared OD matrices that were separately generated from household surveys and mobile phone data and found a large disparity between the two outputs. In the existing literature, secondary data sources from social network services have also been used to study individual travel behaviors [43][44][45].
In data-driven traffic demand estimation models, it is extremely challenging to determine an overall cause-effect explanation based on the partially observable information acquired from multiple data sources. In addition, to achieve the full potential of data-assisted traffic congestion mitigation and demand management, it is critical to have a data mining platform that is sensitive to the requirements of urban planners in a heterogeneous data environment.
From the perspective of big data fusion, Wu et al. [46] proposed the big data-driven transportation computational graph (BTCG) framework as a fundamental mathematical modeling tool for performing multilayered traffic demand estimation based on the forward and backward propagation mechanism. The multilayered representation used in this framework structurally models four levels of travel demand variables, including trip generation, OD matrices, path flows, and link flows. Ma et al. (2019) extended the framework to the problem of multiclass traffic demand estimation [47]. This paper further investigates the application of CGs to facilitate data-assisted traffic congestion mitigation and traffic demand management.

Existing Congestion Mitigation Strategies
Traffic congestion can be defined in many ways. Because traffic demands change constantly, city congestion varies from day to day. However, approximately half of all congestion recurs every day in the same location. This paper focuses on investigating such recurrent congestion (see Figure 1). Traffic congestion can be reduced using various mitigation strategies. The most intuitive solution is to build new roadways to add new capacity. However, the famous Braess paradox shows that sometimes, new capacity may decrease utility for users [19]. Moreover, in metropolitan areas, the right-of-way is expensive; consequently, other mitigation options must be considered. The advent of ride-sharing services means that providing high-occupancy vehicle (HOV) lanes for vehicles with two or more passengers is also a feasible solution for expanding capacity.
Despite the success of congestion pricing in some cities (London has seen benefits from the implementation of congestion pricing in 2003) [60], toll schemes do not appeal to the public. Based on this fact, Wu et al. [61] indicated the importance of calculating the marginal external cost of one extra toll on links because congestion pricing may lead to a possible loss of overall social welfare. Wu et al. [61] recommended a Pareto-improving strategy to increase social benefit without increasing the travel expense for every stakeholder. However, a Pareto-efficient system is not always guaranteed. In most cases, it is important to develop an efficient method of calculating the marginal effect of one extra toll on each link and the tradeoff between benefits and costs.
Traffic congestion is also closely related to urban planning factors, such as job locations, land use, and house (rental) pricing. From the urban planning perspective, to reduce long-term congestion, planners tend to implement population transfers and redesign residential locations and workplaces to balance employment supply and demand [62][63][64][65][66]; however, to the best of our knowledge, very few studies have comprehensively considered the marginal effects of macroscopic policies on the traffic volume on each link.
Another contribution of this paper is to extend the computational graph (CG) based learning framework to evaluate the marginal effects of different congestion mitigation strategies in a big data context.

Outline
The remainder of this paper is organized as follows. Section 3 presents a general description of the overall system architecture and a conceptual illustration based on CGs. In Section 4, we propose a compact description of the traffic demand estimation problem based on the existing CG framework and extend the deep learning framework to the evaluation of the effects of congestion mitigation strategies by mapping those strategies onto CGs. Section 5 reports computational results obtained based on the Sioux Falls network and a subnetwork of Beijing to demonstrate the effectiveness and applicability of our framework. Section 6 concludes the paper and discusses some possible extensions of our work.

System Architecture and Conceptual Illustration
As background, in this section, we introduce the overall architecture of the "super-simulation system" that is designed for the new generation of data-driven traffic management platforms. It should be noted that the proposed CG-based learning framework for traffic demand estimation is only one of the engines of this "super-simulation" system. The architecture depicted in Figure 2 reveals the relationship between the CG-based demand estimation engine and the STA/DTA simulation engine. studies have comprehensively considered the marginal effects of macroscopic policies on the traffic volume on each link. Another contribution of this paper is to extend the computational graph (CG) based learning framework to evaluate the marginal effects of different congestion mitigation strategies in a big data context.

Outline
The remainder of this paper is organized as follows. Section 3 presents a general description of the overall system architecture and a conceptual illustration based on CGs. In Section 4, we propose a compact description of the traffic demand estimation problem based on the existing CG framework and extend the deep learning framework to the evaluation of the effects of congestion mitigation strategies by mapping those strategies onto CGs. Section 5 reports computational results obtained based on the Sioux Falls network and a subnetwork of Beijing to demonstrate the effectiveness and applicability of our framework. Section 6 concludes the paper and discusses some possible extensions of our work.

System Architecture and Conceptual Illustration
As background, in this section, we introduce the overall architecture of the "super-simulation system" that is designed for the new generation of data-driven traffic management platforms. It should be noted that the proposed CG-based learning framework for traffic demand estimation is only one of the engines of this "super-simulation" system. The architecture depicted in Figure 2 reveals the relationship between the CG-based demand estimation engine and the STA/DTA simulation engine. In addition, we will briefly report on how big data sources and real-world congestion mitigation strategies interact with our traffic demand estimation engine (i.e., the CG-based learning framework). This discussion is followed by a comprehensive review of the concept of a CG and its mathematical relationship to discrete choice models, which shows that the CG representation is sufficiently flexible to be used in formulating various problems in the field of transportation modeling.

System Architecture
The proposed CG-based learning framework is an important component of the overall system, as shown in Figure 2. A central traffic database is established that receives multiple types of traffic data, In addition, we will briefly report on how big data sources and real-world congestion mitigation strategies interact with our traffic demand estimation engine (i.e., the CG-based learning framework). This discussion is followed by a comprehensive review of the concept of a CG and its mathematical relationship to discrete choice models, which shows that the CG representation is sufficiently flexible to be used in formulating various problems in the field of transportation modeling.

System Architecture
The proposed CG-based learning framework is an important component of the overall system, as shown in Figure 2. A central traffic database is established that receives multiple types of traffic data, including household survey data, mobile phone data, floating car data, loop detector data, and data from other possible sources. After being cleaned and formatted, the data are input into the CG-based traffic demand estimation engine and the traffic assignment engine. The outputs of these two engines can be viewed as the input to the simulation engine for visualizing the traffic system [67].
As shown in Figure 2, the CG-based traffic demand estimation engine and the traffic assignment engine (for STA/DTA) provide feedback to each other. On the one hand, the CG-based traffic demand estimation engine provides input files for the traffic assignment engine. On the other hand, the traffic assignment engine initializes the candidate route set for the traffic demand estimation engine. Zhuge et al. [68] reported on how to generate the candidate route set using a tree-based assignment (20 iterations are usually required to arrive at relatively stable conditions; see [68]). The CG-based traffic demand estimation engine should also be fully integrated with the transport policy resources. Then, we will be able to predict the possible demand patterns under different transport policies.
In summary, the essential tasks of the traffic demand estimation engine are as follows: 1. Estimate multilayered traffic demands (i.e., trip generation, trip distribution, and path/link flows).

2.
Produce the input file for the traffic assignment engine.

3.
Evaluate the effects of various transport policies.
Because this paper focuses on policy analysis, we do not consider the interaction between traffic assignment and traffic demand estimation. We invoke the STA engine only to generate the candidate physical route set. Furthermore, we consider only the private car mode in a regional network during rush hour. The traffic demand estimation engine will be extended in the future.

Computational Graphs and Marginal Effect Analyses
The CG concept serves as a basic description language for many machine learning methods, such as artificial neural networks [69]. In particular, CGs serve as the basis of a fine-grained framework that can be used to decompose complex composite functions into a sequence of nested mappings. Each mapping in a CG (represented by an edge) expresses an elementary operation involving only one or two arguments.
In the field of computer science, CGs provide a paradigm for developing machine learning systems in a big data environment because they enable the efficient and flexible management, operation, and control of data sources using intuitive graphical representations and elementary mathematical operations. Consequently, CGs have become basic building blocks for many current machine learning software packages [70][71][72][73]. For example, TensorFlow, developed by the Google Brain team, is a machine learning library based on a CG framework.
CGs are also related to traditional transportation demand modeling. Below, we present an illustrative example adapted from Koppelman and Bhat (2006) [20] in the context of discrete choice models to show how to formulate a mode-split model (a multinomial logit model) using a CG. We refer readers to a previous introduction to the relationship between CGs and deep learning methods [69]. Furthermore, Zhao et al. (2019) provide a good discussion of the relationship between machine learning and logit modeling [74].

Comparison between Discrete Choice Modeling and Computational Graph Modeling Based on a Mode-Split Model
Consider an example with two available alternatives: Drive alone (DA) and transit (TR). The planners intend to encourage users to use transit services instead of private driving to mitigate congestion in the driving network. The following utility function implies that the decision-maker preferences are a function of the average income (INC) and travel time (TT). For simplicity, we do not consider interaction terms or constant terms: Then, the probability of choosing DA can be calculated using the following multinomial logit model: .
(3) Table 1 shows an example in which β DA,1 = 0.004, β TR,1 = 0, and β 2 = 0.02 for an individual from a household with a $50,000 annual income facing travel times of 30 and 50 min for DA and TR, respectively. The utility and probability calculations are shown in Table 1. Table 1. Utility and probability calculations with transit as the base alternative.

Utility Exponent Probability Expression Value
Drive alone As indicated by Goodfellow et al. [65], a logit model for a binary mode choice can be viewed as the simplest type of neural network, with three layers corresponding to two steps of calculations.

1.
The first layer is a stack of neurons that express the utility function of the differences between pairs of alternatives for predicting the DA probability: Consider −β 1 = β TR,1 − β DA,1 ; then, where x = (−INC, TT DA − TT TR ) can be viewed as a vector that includes both income and travel time.

2.
The second layer applies the logistic sigmoid function, σ(U DA − U TR ), as an activation function to squeeze the output of the linear utility function into the interval (0, 1).

3.
The third layer calculates the probability of choosing DA: . (6) The mode-split model is a composite function consisting of both a linear utility function (Equation (5)) and the sigmoid function (Equation (6)). Figure 3 shows how to describe this function using a CG.
There are seven operations in the expression: Two additions, three multiplications, one exponential operation, and one reciprocal operation. We can describe the mode-split model using the inputs, the outputs, and the following six intermediary variables as vertexes in a CG: In the CG plotted in Figure 3, each edge corresponds to a derivative. For example, edge 1 corresponds to the partial derivative of P DA with respect to d, and edge 2 corresponds to the partial derivative of d with respect to e. We evaluate the output value of P DA by setting the input variables/parameters to certain values and then computing the vertexes progressively through the graph in the upward direction. Let us set β 1 = 0.004, β 2 = 0.02, INC = 50 thousand dollars, TT DA = 30 min, and TT TR = 50 min. We obtain the same results as in Table 1 using the CG (P DA = 0.65).
Sensors 2019, 19, x FOR PEER REVIEW 8 of 31 There are seven operations in the expression: Two additions, three multiplications, one exponential operation, and one reciprocal operation. We can describe the mode-split model using the inputs, the outputs, and the following six intermediary variables as vertexes in a CG: = − , In the CG plotted in Figure 3, each edge corresponds to a derivative. For example, edge ① corresponds to the partial derivative of with respect to , and edge ② corresponds to the partial derivative of with respect to . We evaluate the output value of by setting the input variables/parameters to certain values and then computing the vertexes progressively through the To understand the partial derivatives in these cases, it is important to understand the chain rule in calculus. According to the chain rule, to calculate the partial derivative of P DA with respect to any parameter, we need to sum over all possible paths from vertex P DA to the vertexes of that variable/parameter in the graph while multiplying the derivatives on each edge on the same path. Examples are given below: can be calculated by multiplying the partial derivatives on the path 1 → 2 → 3 → 6 → 7 : Typically, "loss errors" can be propagated to update the estimated variables, e.g., β 1 and β 2 , across the CG using the stochastic gradient descent algorithm. This approach minimizes the sum of the The gradients, , can then easily be calculated easily as follows: where

Marginal Effect Analyses Using a Computational Graph
Interestingly, the above formulations can equivalently be used to derive the marginal effect equations obtained with discrete choice models in [19]: . (22) This relationship implies the potential importance of CGs in economics. Furthermore, the outstanding advantage of CGs is that, when a large number of parameters are involved in a real-world economic system, we can efficiently implement marginal analyses based on the principle of dynamic programming (DP) [69]. The DP principle can be applied in CGs to avoid duplicate calculations and sequentially update the derivatives along the "computing" paths in a backward fashion via the chain rule.
For example, in our case, for all three paths, we can first calculate: Then, we can reuse the intermediate result of -0.2284 to calculate the following marginal values: Provided that we estimate β 1 = 0.004 and β 1 = 0.02, we can also use the intermediate result of -0.2284 to calculate several important marginal values: These values imply that if the travel time of transit services decreases by 1 minute, then 0.46% more users will use the transit service system instead of private driving. Thus, the congestion in the driving network can be mitigated. By contrast, if the average income of users increases by 1 dollar, then 0.09% of users will use private driving instead of transit services.

Congestion Mitigation Strategies Based on the Computational-Graph-Based Learning System
In Sections 4.1-4.3, we used a compact formulation to rewrite the multilayer CG representation proposed in [46], which captures three steps of the traditional four-step process: Trip generation, trip distribution, and path-flow-based traffic assignment. In contrast to [46], we described the methodology from the perspective of control theories instead of mathematical optimization. In Section 4.4, we attempt to map different congestion mitigation strategies on the CG representation as controllable inputs. Then, the marginal effects of these strategies can be calculated to reflect whether the corresponding control policies can increase the welfare of the transportation system.

Variables
The three basic groups of estimated state variables used in the CG are as follows.

Group I: Demand Variables
α indicates the trip generation variable vector, containing the trip generation results from all zones. q indicates the trip distribution variable vector, containing the flow volume between all OD pairs. f indicates the path flow variable vector, containing all flow volume on each path in the candidate route set generated by the traffic assignment engine. v indicates the link flow variable vector, containing the flow volume on each link in the network.
To relate variables, f and v, we define δ as a link-route incidence parameter matrix that indicates whether a route passes through a particular link.

Group II: Proportional Variables
γ indicates the OD split rate variable matrix expressing the rate at which each OD pair is selected from each traffic zone.
ρ indicates the route choice proportion variable matrix, expressing the rate at which each route between each OD pair is selected.

Group III: Variables in Discrete Choice Models
In the field of traffic modeling, ρ can be determined by a multinomial logit model, which is the simplest type of stochastic network loading algorithm [18]. In the field of deep learning, the multinomial logit model is called the softmax function. The softmax function applies when we wish to represent a probability distribution over a discrete variable with multiple possible values. The softmax function can be regarded as a generalization of the logistic sigmoid function used in the illustrative example in the previous section: where P(w) represents all candidate routes, r, between OD pairs, and V(r) represents the set of links that are passed through on route r. We express the softmax function (logit model) as follows: where TC and TT are the travel time parameter vector and the toll parameter vector, respectively.
U r denotes the utility of route r, where TC r is the toll cost of the route, which is calculated by aggregating the link toll on each related link, and TT r is the observed travel time of the route, which is calculated by approximately aggregating the observed link travel time for each related link.
In the logit model, we have three estimated variable vectors, as follows: β 1 indicates the variable vector collecting all β w,1 between each OD pair, w. β 2 indicates the variable vector collecting all β w,2 between each OD pair, w. β indicates the variable vector collecting all β w between each OD pair, w.

System Equations to Express the Computational Graph
The overall traffic demand estimation problem can be represented by the CG G(V c , A c ) depicted in Figure 4. This CG can be generally viewed as a control system containing the following items.
where ( ) represents all candidate routes, , between OD pairs, and ( ) represents the set of links that are passed through on route . We express the softmax function (logit model) as follows: where TC and TT are the travel time parameter vector and the toll parameter vector, respectively.
denotes the utility of route , where is the toll cost of the route, which is calculated by aggregating the link toll on each related link, and is the observed travel time of the route, which is calculated by approximately aggregating the observed link travel time for each related link.
In the logit model, we have three estimated variable vectors, as follows: indicates the variable vector collecting all , between each OD pair, . indicates the variable vector collecting all , between each OD pair, . indicates the variable vector collecting all between each OD pair, .

System Equations to Express the Computational Graph
The overall traffic demand estimation problem can be represented by the CG ( , ) depicted in Figure 4. This CG can be generally viewed as a control system containing the following items.

State Variables
denotes the set of vertexes, including all state variables of the system, i.e., the estimated variable vectors and matrices listed in Section 4.1, and other parameters that can be expressed as scalars, vectors, matrices, or tensors.

State Transitions
denotes the set of directed edges that describe different operations in the traffic demand estimation model. If a variable, y, is computed by applying an operation to a variable, x, we draw a directed edge from the vertex representing x to the vertex representing y and annotate the vertex representing y with the name of the operation. Complicated functions are described by combining many elementary operations in a recursive fashion [69].

State Variables
V c denotes the set of vertexes, including all state variables of the system, i.e., the estimated variable vectors and matrices listed in Section 4.1, and other parameters that can be expressed as scalars, vectors, matrices, or tensors.

State Transitions
A c denotes the set of directed edges that describe different operations in the traffic demand estimation model. If a variable, y, is computed by applying an operation to a variable, x, we draw a directed edge from the vertex representing x to the vertex representing y and annotate the vertex representing y with the name of the operation. Complicated functions are described by combining many elementary operations in a recursive fashion [69].
The graph in Figure 4 expresses the following four constraints to enforce flow conservation and capture user choice behaviors (layer-based state transitions): where the values of all flow variables are required to be nonnegative. The above constraints can also be viewed as a state transition model in a control system.

Mapping of Data Measurements on the Computational Graph
Traffic demand variable values are propagated forward through the CG in Figure 4 until they reach the output layers, which are connected to external data sources. The outputs of the CG can then be compared to reference measurements to calculate the loss errors. The CG simultaneously uses traffic demand estimation models (e.g., four-step processes, in the field of transportation modeling) and multiple measurements from different data sources to generate estimates of the system's variables (its states) that are better than the estimates obtainable when using only a single measurement type. As such, CGs can be viewed as a common data fusion method. Furthermore, in the era of big data, we can directly use multiple data samples and the stochastic gradient descent (SGD) algorithm to represent the noise in the system instead of assuming complex probability distributions.
The loss functions used in this paper are as follows:

Measurements Associated with Trip Generation,ᾱ
We obtained a reference set of trip generation results for a zone using household survey data. The number of trips associated with this zone can be calculated by multiplying the population size by the rate of trips using private cars [28,75]. The population and trip rate can be calculated using the average numbers for different groups. Then, we have a reference set of trip generation results, α, and we can obtain the following loss function: where M 1 is the number of survey samples. Reference OD split rates can be generated from raw mobile phone data that match the given zoning system. Because it is technically difficult to identify the mode of transportation using mobile phone data, in this paper, we use mobile phone data to generate the reference OD split rates, γ:

Measurements Associated with the Origin-Destination Split Rates, γ
where M 2 is the number of mobile phone samples. Notably, the locations of cellular base stations are generally not consistent with traffic zones. Hence, some matching assumptions must be made to derive OD splits from cellular records. Furthermore, a rule should also be specified for identifying stationary activities [14,41].

Measurements Associated with the Route Choice Proportions, ρ
Floating car data can be matched to the driving network using matching algorithms [76,77]. Thus, we can obtain reference route choice proportions using floating car data: where M 3 is the number of floating car samples. It should be noted that floating car data are sampled data. We should again specify a rule for identifying stationary activities and then match the GPS records to a map using a map matching algorithm (e.g., a hidden Markov chain algorithm) [76,77].

Measurements Associated with the Link Flows, v
Sensor flow counts, v, are collected from a subset of links: where M 4 is the number of sensor data samples. Loop detector data are usually collected at regular time intervals (e.g., every 15 min). In this paper, we sum the counts for all lanes to generate the flow count on each link for each hour. Overall, the CG can be expressed in terms of the following optimization problem: s.t. Equations (32)- (35).
Here, we simply assume that all four objective functions are equally important.

Mapping of Congestion Strategies on the Layers of the Computational Graph
A CG can serve as a helpful tool to support the evaluation of a congestion mitigation strategy. The marginal effects of a policy can be calculated from the partial derivatives of a variable/parameter with respect to all other variables/parameters. That is, the gradients can be viewed as indicators of whether an imposed policy is reasonable. From the perspective of control theories, congestion mitigation strategies can be viewed as controllable inputs acting on traffic demand variables. Figure 5 shows how different strategies map to different variables in the CG. For example, pricing on paths/links can be viewed as controlling the route choice proportions, ρ, or the link flows, v. When planners aim to redesign urban layouts or relocate workplaces, the corresponding policies can similarly be viewed as controlling the variables, γ and q. Thus, we can define the marginal effect (ME) of such control as follows:

Algorithm
As shown in Figures 4 and 5, we can obtain a first-order partial derivative on each edge of the CG. Based on these derivatives, the loss errors (gradients) from the measurements are backpropagated to each vertex starting from the outputs. The gradients are calculated using the chain rule. The updating process is widely used in the field of deep learning in the context of the backpropagation (BP) algorithm [69]. In this paper, we implement the algorithm as follows: Step 1. Estimation: Step 1.1. The forward passing step implements trip generation, trip distribution estimation, and traffic assignment.
Step 1.2. The backward propagation step updates the estimated variables using the SGD algorithm.

Definition:
The marginal effect (ME) of a congestion mitigation strategy is defined as "the change in the overall negative utility of the system caused by a marginal change in the controlled variable". In this paper, the negative utility is measured as the total travel time for all users in the transportation network. Let the vector field, F(v), be a link travel time function mapping from R |v| to R |v| . Then, the ME can be calculated as: where ∆ represents the marginal changes in the link flows caused by the control (i.e., the policy). The physical meaning of the ME is the total reduction in "vehicle transport" on all links caused by the policy.

1.
If ME ≤ 0, the policy decreases the total travel time for users and has a positive effect.

2.
If ME > 0, the policy increases the total travel time for users and has a negative effect.
One advantage of a CG is that it can directly yield the reciprocal of ∆. Table 2 lists the congestion mitigation strategies considered and the variables they control in the CG. Table 2 also shows how to calculate 1/∆ based on the chain rule in calculus. Each 1/∆ in the table corresponds to a certain "pathway" propagated through the CG, as shown in Figures 4 and 5. The initial values of the variables before learning can be set with rerence to typical historical data.

Algorithm
As shown in Figures 4 and 5, we can obtain a first-order partial derivative on each edge of the CG. Based on these derivatives, the loss errors (gradients) from the measurements are backpropagated to each vertex starting from the outputs. The gradients are calculated using the chain rule. The updating process is widely used in the field of deep learning in the context of the backpropagation (BP) algorithm [69]. In this paper, we implement the algorithm as follows: Step 1. Estimation: Step 1.1. The forward passing step implements trip generation, trip distribution estimation, and traffic assignment.
Step 1.2. The backward propagation step updates the estimated variables using the SGD algorithm.
Step 1.3. Steps 1.1. and 1.2. are iteratively implemented until convergence is reached.
Step 2. Evaluation: Step 2.1. Different mitigation strategies are evaluated by calculating their MEs on link volume.
Because our proposed learning framework can be regarded as a kind of control system, in Table 3, we compare the above algorithm with the standard Kalman filter (KF) (i.e., a linear Gaussian state-space model) [5,[78][79][80]. The similarity between them is that they both provide a general framework for state estimation and data fusion. Both algorithms are recursive and follow a two-step process. The prediction step of the KF, corresponding to the forward passing process in Step 1.1 of our CG-based algorithm, produces the current estimates. The update process of the KF, corresponding to the backward propagation in Step 1.2 of our CG-based algorithm, updates the estimates based on the observed measurements. The difference between them is that they use different methods of updating their estimates. While the KF uses the Kalman optimal gain to update the state variables, the above algorithm implements a gradient-based update strategy. Furthermore, the KF applies a linear transition matrix to generate the next-stage estimates and uses covariance matrices to express the correlations between state variables. By contrast, our framework directly uses a CG to describe these relationships. Similar to other deep learning methods, our algorithm uses the SGD to capture the noise in the data, while the KF usually requires an assumed probability distribution.
It is important to note that the above algorithm can be easily implemented using the TensorFlow data programming architecture, which has been widely applied in many machine learning applications. To see the source code of our CG-based learning framework using the TensorFlow Python API, readers can refer to [81].

Numerical Examples
In this section, we present two numerical examples to illustrate the effectiveness of our methodology.

A Case Study Based on the Sioux Falls Network
We implemented our model based on multiple samples in the Sioux Falls network to validate the effectiveness of our framework. The input data (measurements) included two samples of hypothetical household survey data (trip generation results, α), two samples of mobile phone data (OD split rates, γ), two samples of floating car data (route choice proportions, ρ, for some routes), and three samples of sensor data (link flow counts, ν). We assumed that the data were collected during one hour of the morning peak time on several working days.
Notably, the accuracy of calibration will depend on the synthesized hypothetical data. To avoid discrepancies between different data sources, we designed one "seed" group of hypothetical data that satisfied the flow conservation constraints of Equations (36)- (39) in an ad hoc manner. The data set included the following: 1.
One sample of floating car data: We enumerated all candidate paths between the 20 OD pairs, then randomly selected 7 of these paths and adopted assumed route choice proportions for them. 4.
One sample of sensor data: We assigned assumed link counts to 7 links.
Then, we added some random perturbations to the "seed" samples to generate additional samples. The complete data set can be found in [81]. Figure 6 shows only the average values of our data set. All flows are expressed in units of vehicles/hour. Figure 7 shows the trend lines between the average estimated values and the average reference measurements. We find that reasonably satisfactory R-square values are achieved for all data sources. However, this finding does not imply that our proposed learning framework can always achieve good loss errors. When the inconsistency between different data sets increases, the obtained loss errors will worsen. However, this case study demonstrates that our proposed learning framework can simultaneously decrease the values of all four loss functions between the estimated values and the reference measurements. Figure 7 displays the convergence curves for the four different data sources. Because of the use of the SGD algorithm, the curves fluctuate, reflecting the random noise in the data. Figure 6 also displays four links (links (1, 3), (3,12), (12,13), and (13,24)) with estimated volumes of >390 vehicles/hour (i.e., link flow/link capacity > = 1.3). We find that sensors are not installed on any of these four links.

Analysis of Congestion Components
Because the CG-based learning framework internally integrates the four-step process of transportation modeling and enables the estimation of the flows on all paths in the candidate path set, traffic managers can apply the results to analyze the individual components of the congestion on each link. Figure 8 displays the congestion pie charts for links (3,12) and (13,24). then randomly selected 7 of these paths and adopted assumed route choice proportions for them. 4. One sample of sensor data: We assigned assumed link counts to 7 links. Then, we added some random perturbations to the "seed" samples to generate additional samples. The complete data set can be found in [81]. Figure 6 shows only the average values of our data set. All flows are expressed in units of vehicles/hour.  Figure 7 shows the trend lines between the average estimated values and the average reference measurements. We find that reasonably satisfactory R-square values are achieved for all data sources. However, this finding does not imply that our proposed learning framework can always achieve good loss errors. When the inconsistency between different data sets increases, the obtained loss errors will worsen. However, this case study demonstrates that our proposed learning framework can simultaneously decrease the values of all four loss functions between the estimated values and the reference measurements. Figure 7 displays the convergence curves for the four different data sources. Because of the use of the SGD algorithm, the curves fluctuate, reflecting the random noise in the data. Figure 6 also displays four links (links (1, 3), (3,12), (12,13), and (13,24)) with estimated volumes of >390 vehicles/hour (i.e., link flow/link capacity > = 1.3). We find that sensors are not installed on any of these four links.

Analysis of Congestion Components
Because the CG-based learning framework internally integrates the four-step process of transportation modeling and enables the estimation of the flows on all paths in the candidate path set, traffic managers can apply the results to analyze the individual components of the congestion on each link. Figure 8 displays the congestion pie charts for links (3,12) and (13,24).

Analysis of Congestion Components
Because the CG-based learning framework internally integrates the four-step process of transportation modeling and enables the estimation of the flows on all paths in the candidate path set, traffic managers can apply the results to analyze the individual components of the congestion on each link. Figure 8 displays the congestion pie charts for links (3,12) and (13,24).  Table 4 further shows how the volume on link (3,12) is composed of 10 path flows. As shown in Figure 8A and Table 4, the following three paths (bolded in Table 4) contribute the majority (a total of 70%) of the traffic volume on link (3, 12):  (3,12); (B) The OD volumes that contribute to the volume on link (3,12); (C) The trip generations that contribute to the volume on link (3,12); (D) The path flows that contribute to the volume on link (13,24); (E) The OD volumes that contribute to the volume on link (13,24); (F) The trip generations that contribute to the volume on link (13,24). Table 4 further shows how the volume on link (3,12) is composed of 10 path flows. As shown in Figure 8A and Table 4, the following three paths (bolded in Table 4) contribute the majority (a total of 70%) of the traffic volume on link (3,12): 1→3→12→13→24→23→22. Furthermore, by simply combining the corresponding path flows, it can be seen from Figure 8B and Table 4 that link (3,12) is mainly passed by users of OD pairs (2,24), (1,22), and (1, 24) (26%, 44%, and 12%, respectively). Figure 8C and Table 4 show that 61% of the flow on link (3,12) is generated from node 1. We can analyze the congestion components for link (13,24) in a similar manner; see Table 5 and Figure 8D-F. The pie charts show that 37% of the flow on link (13,24) is generated from node 13 and that most of the corresponding paths terminate at node 24. The detailed congestion components are presented in Table 4. This congestion component analysis provides us with useful information about the sources of congestion. Knowledge of the congestion pie charts is very important for selecting the zones, OD pairs, or paths where congestion mitigation strategies will have the greatest effect. For example, if we wish to decrease the flows on link (3,12), relocating some workplaces at node 22 to other zones might be an effective method.

Marginal Effect Analysis
In this experiment, we aimed to evaluate the MEs of various congestion mitigation strategies. We used the Bureau of Public Roads (BPR)-form function as the travel time function, F(v), in Equation (36). The current estimated link flows and the parameters of the BPR function are reported in Appendix A. Figure 9 shows the changes in the flows and travel times on each link. The MEs of the six strategies are displayed in Figure 9.
Several interesting observations can be found: 1.
In Figure 9A,B, the toll successfully decreases the flows and travel times on links (3,12) and (1,3). It also reduces the total travel time on all links. Interestingly, although both links (3,12) and (1,3) are congested (465 vehicles/hour and 583 vehicles/hour, respectively), imposing a toll of one extra dollar on link (1, 3) produces more benefit than doing the same on link (3,12). The former policy reduces the number of vehicles in the system by 44 (ME = 44 vehicles), while the latter results in a decrease of only 4 vehicles (ME = 4 vehicles). These findings demonstrate that similar pricing policies can have different effects.

2.
As shown in Figure 9C, if one user changes his/her destination from node 24 to node 9, then the traffic flows on links (3,12), (12,13), and (13,24) will decrease (by 0.7 vehicles/hour, 0.99 vehicles/hour, and 0.99 vehicle/hour, respectively). This figure justifies the importance of job-housing balancing in urban planning.

3.
We also find that the impacts of the policies on the flows and travel times are complex, with some mitigation strategies potentially decreasing the overall welfare of the system. For the scenario depicted in Figure 9D, the ME corresponds to an increase of 9.89 vehicles in the system. The policy actually decreases the flow levels (by approximately 0.4 vehicles/hour) on links (3,4), (4,5), and (12, 11) (links 6, 9, and 36, respectively, in the plot); however, these three links are not approaching their capacities in their current states (257.3 vehicles/hour, 116 vehicles/hour, and 216 vehicles/hour, respectively; see the Appendix A). Unfortunately, the strategy also guides additional traffic flows (approximately 0.9 vehicles/hour) to links (12,13) and (13,24), which are already congested (391 vehicles/hour and 615 vehicles/hour, respectively). This is the reason why, in the short term, sometimes the functional relocation of a metropolitan area can sometimes lead to a worse result than before. 4. Figure 9E shows that the strategy of "population transfer" achieves good performance in relieving traffic congestion. As seen in Figure 9F, if methods are implemented to make fewer people from zone 1 use private cars, this strategy will also apparently increase the overall utility of the system. In particular, these policies greatly reduce the flow on the congested link (13,24). The reason for the beneficial effects of these policies can be identified from the congestion component pie chart shown in Figure 9: In total, 42% of the flow on link (13,24) is generated from node 1. (C) ME analyses of removing one person from OD pair (1,22) and adding one person to OD pair (1,9); (D) ME analyses of removing one person from OD pair (1,11) and adding one person to OD pair (1,24); (E) ME analyses of removing one person from zone 1 and adding one person to zone 7; (F) ME analyses of removing one person from zone 1.
Several interesting observations can be found:  (1,22) and adding one person to OD pair (1,9); (D) ME analyses of removing one person from OD pair (1,11) and adding one person to OD pair (1,24); (E) ME analyses of removing one person from zone 1 and adding one person to zone 7; (F) ME analyses of removing one person from zone 1.

Application Study in a Beijing Subnetwork
To better demonstrate the applicability of the proposed CG-based learning framework, a median-scale experiment was conducted based on a subnetwork of Beijing with 2502 nodes, 236 zones, 14,967 OD pairs, 40,494 paths, and 5397 links. The traffic demand outside the subnetwork was merged in the traffic zones near the boundaries. We did not use mobile phone data or floating car data in this experiment. We utilized only the reference measurements for the trip generation results and OD split rates provided by the metropolitan planning organization. Loop detector data (from one hour on 113 links) from this organization were also applied. Hence, the objective function, minF 1 (α) + F 2 (γ) + F 4 (v), was used to integrate the data sources into the learning process.
The maximum number of iterations was set to 1000, and the initial learning rate was set to 0.00001. The procedure was run under Linux on a Dell PowerEdge T630 tower server with two Intel Xeon Quad CPUs, eight 16 GB of RAM, and 512 GB of SSD storage. The number of variables in the CG can be estimated as follows: Number of neurons in each layer + number of connections between layers ((236 + 14,967 + 40,494 + 5,397) + (14,967 + 40,494 + 40,494 × 5397) = 116,555). The total CPU time required for the learning process was 2 hours.
Several interesting results were obtained after the experiment: 1.
Estimated outputs: Figure 10A shows the physical network. The other panels in this figure illustrate the outputs estimated using different layers of the CG. Figure 10B displays the estimated trip distribution. As displayed in Figure 10D, the estimated link flows per hour per lane were also obtained using the proposed learning framework.

2.
Congestion analysis: Figure 10C shows one of the most congested links (i.e., link 4501, from node 2119 to node 2243). The estimated link flow is 8325 vehicles/hour. Based on the CG, there are a total of 2157 paths passing through link 4501. Figure 10C displays the two paths that most strongly contribute to congestion on this link (paths 20218 and 20219). Furthermore, the flow on the link comes from 557 OD pairs and 102 traffic zones. Figure 13A,B display the top 30 OD pairs and trip-generating traffic zones that contribute the most to the flow volume on link 4501. We also labelled the OD pairs and traffic zones associated with the largest volume on the map. We find that the congestion on link 4501 is primarily caused by traffic demands for travel from the southern urban area to the northern subarea. 3. ME analysis: Because link 4501 and zone 83 are near several universities and companies in Beijing, one possible congestion mitigation strategy is to move some workplaces from zone 83 to zone 117. We can calculate the ME of this policy as follows. Figure 11 shows the changes in travel times and volumes on related links. Interestingly, this strategy actually increases the total volume on the links. However, it can reduce the travel times on certain highly congested links. The ME is −37.1 vehicles, which implies that the policy can indeed mitigate congestion in the traffic system.

4.
Calibration: The estimation processes using different data sources are shown in Figure 12. During the experiment, we normalized the estimates and the references to lie within the range of [0, 10]. We found that the different objective functions can be simultaneously estimated during the learning process.  2. Congestion analysis: Figure 10C shows one of the most congested links (i.e., link 4501, from node 2119 to node 2243). The estimated link flow is 8325 vehicles/hour. Based on the CG, there are a total of 2157 paths passing through link 4501. Figure 10C displays the two paths that most strongly contribute to congestion on this link (paths 20218 and 20219). Furthermore, the flow on the link comes from 557 OD pairs and 102 traffic zones. Figure 11A,B display the top 30 OD pairs and trip-generating traffic zones that contribute the most to the flow volume on link 4501. We also labelled the OD pairs and traffic zones associated with the largest volume on the map. We find that the congestion on link 4501 is primarily caused by traffic demands for travel from the southern urban area to the northern subarea. 4. Calibration: The estimation processes using different data sources are shown in Figure 13. During the experiment, we normalized the estimates and the references to lie within the range of [0, 10].
We found that the different objective functions can be simultaneously estimated during the learning process.  4. Calibration: The estimation processes using different data sources are shown in Figure 13. During the experiment, we normalized the estimates and the references to lie within the range of [0, 10].
We found that the different objective functions can be simultaneously estimated during the learning process.

Conclusions and Future Research Plans
In a traffic system, congestion and inefficiency are fundamentally related to the relative balance between demand and supply. A traffic system can operate efficiently if the traffic demand can be Beijing, one possible congestion mitigation strategy is to move some workplaces from zone 83 to zone 117. We can calculate the ME of this policy as follows. Figure 12 shows the changes in travel times and volumes on related links. Interestingly, this strategy actually increases the total volume on the links. However, it can reduce the travel times on certain highly congested links. The ME is −37.1 vehicles, which implies that the policy can indeed mitigate congestion in the traffic system.

Conclusions and Future Research Plans
In a traffic system, congestion and inefficiency are fundamentally related to the relative balance between demand and supply. A traffic system can operate efficiently if the traffic demand can be accurately estimated. This paper proposed an integrated framework that combines traditional model-driven traffic demand estimation with emerging data-driven approaches based on a CG framework. This CG-based learning framework might contribute to filling the two research gaps mentioned in Section 1. First, the framework can help to overcome challenges related to data mining and fusion when processing big data from multiple heterogeneous sources. Second, congestion mitigation strategies can be integrated with the CG to evaluate the benefits and costs of the corresponding policies. A real-world case study was also presented to demonstrate the applicability of the framework.
However, the proposed framework also has some limitations that should be addressed in the future. First, we anticipate that the proposed CG-based framework can be developed to describe more realistic traffic demands. We can extend the learning framework based on the fundamental unit method [28], that is: α = g·t(p 1 , p 2 , . . . , p n ) (42) where g is a vector expressing the populations of different groups and the trip rate vector, t(p 1 , p 2 , . . . , p n ), expresses the relationships between trip rates and various influencing factors ((p 1 , p 2 , . . . , p n )). The learning framework can be used to calibrate the related parameters. For example, suppose that the trip rate of a certain population group is impacted by the price of gas, p 1 . Managers can impose gas taxes to reduce the trip rate of this group of people. When trip chain data are available, we can also apply a trip chain layer in place of the OD layer shown in 4 and reconstruct the CG based on activity-based models [80][81][82][83][84]. Second, the current framework has not fully considered several important traffic flow characteristics. We simply used the BPR function to describe the basic relationship between travel times and link flows, which is not accurate according to the fundamental diagrams in traffic flow theories. Finally, the only traffic mode considered in this research was driving in private cars. We also need to extend the CG by incorporating a mode-split layer or extending the path layer based on a supernetwork to capture both automobile and public transit networks [18]. This modification would allow managers to additionally impose strategies for adjusting the choices between different traffic modes. The final version of this demand estimation engine should serve as a decision-making support platform that can simulate transportation scenarios that cannot be practically realized on the basis of the closely interconnected relationship between traffic planners and travelers. It should have the capability of predicting and visualizing the evolution of multimodal traffic systems (including private cars, buses, and metro systems) and provide useful decision support advice for managers.

Conflicts of Interest:
The authors declare that there are no conflict of interest.