Deep Reinforcement Learning for the Co-Optimization of Vehicular Flow Direction Design and Signal Control Policy for a Road Network

Reinforcement Learning (RL) is a popular approach for deciding on an optimum traffic signal control policy to alleviate congestion in a road network. However, the traffic signal control policy can also be optimized in conjunction with the design of vehicular flow directions to further improve traffic performance. The design of vehicular flow directions refers to the right of way or directional restriction imposed in a road network. Here, a new RL-based technique is presented for co-optimization of the design of vehicular flow directions and control policy for traffic signals. This technique consists of a two-step iterative process, wherein a set of vehicular flow directions for a road network is generated, then a RL-based approach is used to train the traffic signal control policy over the given set of vehicular flow directions. Following the proposed technique, the vehicular flow directions with poor traffic performance are iteratively eliminated, while new vehicular flow directions are generated to achieve better traffic performance and realize convergence to a maximum possible expected traffic performance. The proposed RL-based technique is evaluated by using two examples under rush hour and non-rush hour traffic conditions. It is found that, compared to a RL-based approach in which only traffic signal control policy is considered, the proposed approach can be used to obtain a better traffic performance in terms of vehicular queue length and throughput.


I. INTRODUCTION
Increasing population growth and corresponding vehicle ownership have resulted in heightened traffic congestion, presenting significant challenges for transportation authorities. Traffic congestion can lead to vehicular queueing, travel delays, fuel consumption, economic loss, and pollution, particularly in urban areas, placing a newfound emphasis on the importance of intelligent urban planning [2]. Two key The associate editor coordinating the review of this manuscript and approving it for publication was Daniel Augusto Ribeiro Chaves .
factors that contribute to traffic congestion in road networks are (i) design of vehicular flow directions and (ii) the control policy used for traffic signals. Vehicular flow direction design refers to the right of way directions for roads in a network [3]. For example, urban road networks are composed of a combination of roads with one-or two-way vehicular flow directions with turning restrictions at their intersections [4], [5]. With vehicular flow design, one specifies whether the road is restricted to a one-way or two-way direction, while the traffic signal control policy dictates traffic signal patterns for non-conflicting directions at intersections, controlling VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the vehicular flow between connected roads. Traffic signal control policy can be adjusted according to real-time traffic information, including data collected from GPS-equipped vehicles, navigation systems, and sensors [6]. Traffic performance in congested areas is significantly impacted by vehicular flow design and control of traffic signals, whose effects are inherently coupled together. Different vehicular flow directions can result in distinct traffic signal control policies and vice versa. Due to their interdependence, the authors propose the simultaneous consideration of the design of vehicular flow directions and traffic signal control policies in order to optimize traffic performance in urban road networks.
In the literature, approaches to traffic signal control may be broken down into two broad categories: model-based optimization and model-free Reinforcement Learning (RL) techniques. Cong et al. [7] propose four model-based optimization approaches to jointly find an optimal road network topology and traffic signal controller policy. However, model-based techniques often rely on strong assumptions and specific traffic control models, making the generalization of this approach difficult. On the other hand, modelfree RL techniques [8], [9] have two key benefits. First, they require less restrictive assumptions and optimal control policies can be learned from available data, such as GPS-equipped vehicles and loop detector sensors, lending the approach to more general problem instances. Within the RL framework, optimal performance can be achieved through constant back-and-forth interaction between an agent and the traffic environment. Additionally, function approximation techniques can be used to improve computational efficiency in high-dimensional state-action spaces [10]. The efficiency of model-free RL approaches is especially important in traffic optimization problems, as the number of variables grow exponentially with the size of the network.
RL-based techniques have been successfully used for solving traffic signal control problems [8], [9]. Such approaches dominate much of the literature and differ in their characterizations of the state-action space and reward function, and the RL algorithm variations employed. Conventionally, the state space is defined in terms of real-time traffic information such as vehicular flow rate [8], [11], queue length [12], [13], and average delay time [14], [15]. Recent efforts incorporate image-like features [15], [16] to provide a more comprehensive description of the traffic conditions, with positions of vehicles along the lanes represented in the form of a binary matrix. The action space is traditionally defined as a set of traffic phase patterns. A traffic signal phase is used to specify the timing of the permission (green light) or restriction (red light) of vehicular flow directions. Generally speaking, the action space may be (i) step-based [11], [17], in which the traffic controller decides whether to switch or stay in a traffic phase for a pre-determined time duration, or (ii) phasebased [8], [18], [19], wherein the controller is used to decide the time duration for each traffic phase. The reward function, which is a measure of traffic performance, is typically defined as a weighted sum of several metrics, such as travel time [11], [20], queue length [13], [14] and vehicular throughput. Additionally, various RL algorithms have been investigated for learning optimal traffic signal control policies, namely Deep Q-learning Network (DQN) [10], [14], [21], policy gradient method [22], and actor-critic method [23]. Problem instances for evaluation also vary, spanning single intersection control [14], multi-intersection control [24], [25], and even massive-scale scenarios, such as the road network of Manhattan with 3,971 traffic signals [9]. Even though significant progress has been made within this domain, popular techniques only focus on the optimization of the traffic signal controller assuming a pre-determined or fixed design of vehicular flow directions.
The task of designing vehicular flow directions is commonly viewed as an endeavor disconnected from traffic signal control policy optimization. Nonetheless, the urban transportation network design problem is a popular topic in the literature, and is concerned with building new streets, expanding existing road capacity, designing public transportation networks (i.e., bus networks) and restricting turning directions based on current traffic scenarios [26]. The scope of road network design problems may be classified into three broad categories: strategic decisions, tactical decisions, and operational decisions. These resolutions range from long-term decisions that relate to the design of new infrastructures [27] to short-term decisions that improve the flow of traffic in preexisting networks [3]. The proposed approach has applications to both classes of road design problems, as it may be applied to find optimal vehicular flow design patterns and a corresponding optimal control policy for new networks or extensions to existing networks; furthermore, the proposed co-optimization framework can be applied to make real-time traffic flow modifications such as imposing turning restrictions based on traffic demand [28], [29]. The problem of simultaneously optimizing traffic control and vehicular flow design in such application cases is one of the main motivations of this work.
Another key motivation behind this paper is that the traditional RL-based approaches, which are formulated as standard Markov Decision Processes (MDP) [30], do not consider vehicle flow direction design during the learning procedure. To resolve this issue, the authors couple the design and control optimization and employ a bi-level technique; one in which, first generates a set of candidate vehicular flow directions for the road network, and then optimizes the traffic signal controller for each design. In this way, the authors are able to co-optimize the design and control while capturing their interdependent relationship.
The contributions made in this paper are as follows: 1) For the first time in the literature, an integrated approach to the traffic optimization problem is considered in this paper. The authors integrate the design of vehicular flow directions into the RL-based traffic signal control problem. This new extension helps 7248 VOLUME 11, 2023 simultaneously perform a heuristic search of the design space while optimizing the control policy, which has not been considered in the conventional MDP-based approaches. In this way, the proposed bi-level optimization approach is a new extension of a standard MDP [31], [32]. Furthermore, the proposed approach is data-driven and requires less assumptions than those made in conventional model-based techniques, for example, [7]. 2) The incorporation of two new modeling schema. First, a directed graph is used to model the road network, which permits the exhaustive generation of feasible vehicular flow designs [33]. Second, conventional Deep Q-Learning is extended to approximate traffic signal control performance for different vehicular flow direction designs and combined with a decaying random search strategy to explore the design space. In this way, the proposed DQN approach can be used to explore a diversified set of feasible design alternatives, and eventually converge to the best combination of the vehicular flow direction design and traffic signal control policy.
3) The merits and versatility of the proposed approach is illustrated by applying the obtained solution to two road network topologies, namely 4-and 12-intersection grid networks. Furthermore, the application allows for the specification of designed roads and incorporates a signal synchronization scheme, extending the applicability of the model. Through this application, the authors illustrate that the co-optimization of vehicular flow design and traffic signal control can scale up to larger networks and, as evidenced in the applications, outperform conventional RL-based traffic signal control only approach.
The rest of the paper is organized as follows. In Section II, the authors present an overview of MDP and DQN to establish the necessary background. The proposed RL-based approach is outlined in Section III, and includes the problem definition, a detailed description of the algorithm and the accompanying deep neural network, an overview of parameter selection procedures, and an outline of the model assumptions. Following that, in Section IV, the performance of the proposed technique is demonstrated by using two examples. Finally, some concluding remarks are offered in the last section, Section V.

A. MARKOV DECISION PROCESS (MDP)
MDPs are formally defined by the tuple {S, A, T , R, γ }, with finite environment state space S, action space A, transition function T : S × A × S → [0, 1], reward function R : S × A × S → R and discount factor γ . Given an MDP, an agent observes the state of the environment s t ∈ S at timestep t and takes an action a t ∈ A according to a control policy π : S × A → [0, 1]. The control policy π(a t |s t ) returns the probability of taking an action a t given state s t . The transition function T (s t , a t , s t+1 ) = P(s t+1 |s t , a t ) dictates the transition from one state to the next at timestep t by considering immediate reward r t = R(s t , a t , s t+1 ). The objective of the agent is to learn a control policy π * that maximizes the cumulative discounted reward at timestep t where γ ∈ [0, 1) is a discount factor that weights the effect of immediate and future rewards, and T is the terminal timestep.
With MDP algorithms, one learns an optimal control policy based on a Q-value function, which quantifies the estimated expected reward for the agents action in a particular state. The Q-value function is given by where E is the expected value operator. The function computes an expected reward starting from state s t , taking an action a t , and thereafter following policy π. The agent then chooses the action that yields the maximum Q-value at each timestep. Thus, the optimal Q-function, Q * (s, a) = max π Q π (s, a) = max π E[R t | s, a, π] provides the optimum policy π * by selecting the action a which maximizes the Qvalue for the state s according to π * (s) = arg max a Q * (s, a), ∀s ∈ S.
Through the lens of dynamic programming, the Q-value for state-action pairs is learned via the Bellman equations [34] Q π (s t , a t ) = E r t + γ Q π (s t+1 , π(s t+1 )) .
In practice, the estimates for Q π are updated with a learning rate α as where y t is the Temporal Difference (TD) target, which is used to specify the target reward value based on previous estimates. As an off-policy model, actions of the agent in Q-learning are updated by maximizing Q-values over the action using a greedy approach.

B. DEEP Q-LEARNING
Deep Q-learning is an improved version of classical reinforcement learning wherein the Q-value function is approximated with a deep neural network, or Deep Q-Learning Network (DQN) [35]. Deep RL algorithms lend themselves to high-dimensional and complex systems, for which standard RL approaches are ineffective at learning necessary features for function approximation. Within the DQN framework, the VOLUME 11, 2023 Q-value function is approximated as Q(s, a; θ) ≈ Q * (s, a), where θ are the neural network hyperparameters [37]. Conventional DQNs take in the state of the environment as input, estimate the Q-value by using a fully connected neural network architecture to train the network, and then output an action according to (3). While many modifications to traditional Deep RL algorithms exist, two novel techniques introduced by Minh et al. [36] have been shown to significantly stabilize learning in DQNs: experience replay and target network. Experience replay is used to update the Q-network based on past experiences, so as to mitigate the potential of harmful correlations leading to diverging action values. At each timestep, the DQN agent interacts with the environment, obtains data (s t , a t , r t , s t+1 ) and stores the data in memory store D. During training, mini-batches are sampled uniformly from D, and the Q-network is updated according to a loss function L where θ − represents the parameters of the target neural network and E is the expected value over the uniform distribution U (D) of replay memory D. A target network is further employed to stabilize learning by designating two separate networks in the DQN: the main network that approximates the Q-function and the target network that computes the Temporal Difference (TD) target update for the main network. While the main network parameters θ are updated during each iteration in training, the target network parameters θ − are updated only after a user-specified number of timesteps. The use of a target network allows for the utilization of a Double Dueling DQN (DDQN), which extends the standard Q-learning algorithm with a single estimator to one with two estimators [38].

III. PROPOSED REINFORCEMENT LEARNING APPROACH
In the following section, the authors outline the cooptimization approach to traffic design and signal control. First, a description of the RL environment is offered by formally defining the state-action space, and reward function. Next, the authors propose a new extension of a MDP, which formulates the co-optimization problem at hand in an RL setting. Then, a directed graph model is presented, which is used to determine the feasibility of vehicular flow directions. Finally, the co-optimization framework and corresponding assumptions are discussed.

A. ENVIRONMENT DEFINITION
The problem objective is to co-optimize the design of vehicular flow directions and control of traffic signals at one or multiple intersections such that the overall traffic performance in the road network is maximized. For traffic performance, one considers the number of vehicles that safely passing through intersections and the congestion of vehicles in the road network. To solve this traffic optimization problem, the authors use a RL framework that is defined by the state space, the action space, and a reward function. The state space S represents the traffic state in the road network and the action space A is a set of actions the agent may take at each time step. The reward function R, which drives the learning process of the agent, quantifies the current traffic state. The goal of the RL agent is to learn a policy that maximizes traffic performance by observing the current state of the system and choosing an action at each timestep.

1) STATE
The state space captures the information received during the agent's observation of the environment. Suppose there are m intersections in the road network, where the intersection i is composed of n i roads. The state of the environment is defined by three variables: the head-car distance from intersection of vehicles, the number of vehicles driving towards each intersection, and the head-car turning intention. Formally, the observed state at time t is given by the matrix s t ∈ R N ×3 , where N = m i=1 n i is the number of roads flowing into all intersections, referred to as in-roads. Row j of the matrix, where 1 ≤ j ≤ N , represents the observation of in-road j and is stored as the tuple is the distance between the leading vehicle on in-road j and the intersection, referred to as the head-car distance, ν j t ∈ N is the number of cars on road j and τ j t ∈ {−1, 0, 1} is the head-car intention (left-turn, straight ahead or right-turn). At each timestep 0 ≤ t ≤ T , the agent observes the current state s t and receives a reward based on this current state.

2) REWARD
The traffic performance is calculated based on a weighted sum of two normalized criteria, where the weights are assumed to be predetermined by a traffic authority. The reward function R : S × A × S → R is used to quantify the current traffic performance, where higher values indicate a better traffic condition. The first criterion is a cumulative vehicle throughput for all intersections in a road network. Let η i t be the number of vehicles that pass through intersection i during timestep t. The cumulative vehicle throughput is defined as the sum of the number of vehicles passing through all intersections per unit of time: The second criterion is cumulative queue length of vehicles' per unit time for all intersections. The queue length q j t of in-road j, where 1 ≤ j ≤ N , at time t is defined as the number of vehicles waiting to be served by a traffic signal. The cumulative queue length is then calculated by where n i is the indices of in-roads to intersection i. Thus, the cumulative queue length measures the sum of the worst, or longest, queue length at each intersection. The reward r t at  time t is calculated by using the weighted sum where w f and w q are predetermined parameters for weighting set by traffic authorities. The objective of the agent is to follow a policy that maximizes this reward and thereby improve traffic performance in the network.

3) ACTION
Given the vehicular flow direction design for a road network, a centralized traffic control policy π is assumed to control the traffic signal for all intersections. Each traffic signal can take on a phase from a predetermined, finite set of phase patterns; therefore, the agent is used to take phase-based actions. For the road networks considered in this paper, it is assumed that a traffic signal controls the flow of traffic for each road entering the intersection. A phase is used to assign the permission (green light) or restriction (red light) to a combination of non-conflicting vehicular flow directions through an intersection. For example, in the four-legged intersection shown in Figure 1b, there can be four signal phases as shown in Figure 2. In general, an n i -legged intersection is controlled by n i traffic signals with n i phase options. The goal of the centralized traffic controller is to choose a signal phase for each intersection at each timestep. For a road network with m multi-legged intersections, the traffic signal controller is used to decide on a combination of signal phases a t = (a 1 t , a 2 t , . . . , a m t ) for each intersection. Thus, at each time t, the total number of signal phase combinations is n 1 × n 2 × . . . × n m .

4) DESIGN
The vehicular flow direction design is defined by the vector where N is the number of roads to be designed in the network, and -1, 0 and 1 represent oneway clockwise, two-way and one-way counterclockwise flow directions, respectively. The vehicular flow direction of a road can be designed to have a one-or two-way direction, for a total of three design options per road. Examples of vehicular flow directions are shown in Figure 1. Consider the design of vehicular flow directions for the roads in a road network composed of N roads and m multi-legged intersections, where intersection i has n i roads. Since each road has three design options (two one-way vehicular flow directions and one two-way flow direction), there are 3 N possible design options for the network. However, not all design options are feasible, since conflicts in vehicle flow may arise, especially, as the size of the network grows. Further, note that the subset of feasible signal phases also depends on the flow direction on roads into intersections. Within the co-optimization framework, the vehicular flow design acts as a component of the state observation space.

B. MDP EXTENSION
The co-optimization of vehicular flow direction design and traffic signal control is formulated as an extension of a conventional MDP, defined by the tuple {X , S, A, T , R, γ }, where X is the design space. In the proposed approach, a heuristic search of the design space is integrated with at traditional MDP to create a bi-level optimization framework, as illustrated in Figure 4. The action a t ∈ A taken by the agent dictates the traffic signal control policy π and this action is taken based on both the state of the environment and the vehicular flow design variable. Thus, the policy is redefined as the function π : S × A × X → [0, 1], where π(a t |s t , x) is used to compute the probability of taking action a t given current state s t and design x. The vehicular flow design in a road network governs the traffic controller's interaction with the environment through a transition probability function T (s t , a t , s t+1 , x) = P(s t+1 |s t , a t , x). The transition function, formally defined as T : S × A × S × X → [0, 1], dictates the transition from one state to the next at each timestep. In this way, the reward function R : S × A × S × X → R is also dependent on the vehicular flow direction design, which is given by The objective is to maximize the expected future reward R by co-optimizing the vehicular flow direction design x and traffic signal control policy π, with the optimal Q value where s 0 is the initial state. As before, the agent's objective is to learn a control policy π and design x that achieves this maximum expected future reward, which is approximated by the Q-value function in (2) and (4).

C. DIGRAPH MODEL FOR FEASIBLE DESIGN CLASSIFICATION
The design of the vehicular flow directions for a road network can be modeled by a directed graph G = (V , E), with the vertex set V and the edge set E. In the model, vertices represent roads in the network and edges represent vehicular flow between roads. Specifically, edge e ii ′ ∈ E corresponds to a vehicular flow direction from road v i to road v i ′ . In this way, the vehicular flow direction design for a road network can be visualized as a directed graph, as shown in Figure 3. The directed graph model allows for the classification of feasible designs [39]. A road design is feasible if the injection roads into the network are strongly connected. That is, if there is a path between all pairs of vertices that represent an injection road. To determine whether injection roads are strongly connected, the authors employ Kosaraju's algorithm [33], in which one uses depth-first search to recursively traverse the directed graph to find strongly connected components.

D. CO-OPTIMIZATION FRAMEWORK
The DQN approach to the aforementioned RL technique may be extended to account for the co-optimization of design and control as well. In addition to the traffic signal controller inputs, the Q-value function is extended to account for the design variables as additional inputs: Q(s, a, x; θ) ≈ Q * (s, a, x), with neural network parameters θ. As such, the objective function is formulated as Following this formulation, the optimization of vehicular flow direction design and traffic signal control are coupled. The Q-value function is used to estimate the traffic control performance for different designs, which is then used to eliminate poor performing feasible designs until an optimal design is achieved. Once a vehicular flow direction design is fixed, the optimal value of the Q-value function Q * = max π Q π , yields an optimal control policy π * (a t | s t , x) for this optimal design. The combination of signal phases a t for multiple intersections is decided by obtaining the maximum Q-value as a t ∈ arg max a t Q * (s t , a t , x).
The co-optimization is done by using a co-learning approach, in which, one trains the Q-value function to simultaneously optimize the vehicular flow direction design and traffic signal controller. In the approach, first, the Q-value function is approximated by using a neural network Q (s, a, x; θ), where the parameters of the neural network θ are randomly initialized. Then, an iterative optimization process follows, for updating to the Q-value function. The computational framework is illustrated in the flowchart of Figure 4 and outlined in Algorithm 1. Generate sample of N feasible designs from U (X ) 7: for i = 1, . . . , N do 8: Initialize Design x ∼ U (X ) 9: for t = 1, . . . , T do 10: Take action a t = arg max a Q π (s t , a, x; θ) with probability 1 − ϵ 11: Store tuple (s t , a t , r t , s t+1 , x) in replay memory D 12: Sample D tuples randomly from D

15:
Set θ − = θ every M steps 16: end for 17: end for 18: Eliminate bottom k designs from X based Q-value estimation 19: Set N = N · δ 20: Check whether converged to one design option 21: end while Algorithm 1 is begun by generating all possible designs and determining a set of feasible designs according to the directed graph model outlined in Line 4. Following these initialization steps, the learning process is started by generating a sample of N designs from a uniform distribution of all feasible designs in Line 6. At this point, the RL training process to learn an optimal policy π is begun in Lines 5-21. By choosing a new design x ∼ U (X ) at the beginning of each training iteration in Line 12, the agent attempts to learn a control policy that is optimal for all feasible design candidates. At each timestep, the centralized traffic signal controller decides on a combination of signal phases a t according to the current control policy π(a t |s t , x) such that a t ∈ arg max a t Q(s t , a t , x) (Line 10). The agent receives an immediate traffic performance reward r t and observes the resulting traffic state s t+1 in Line 11. After completing one training iteration, a set of traffic data for vehicular flow direction design x is collected as {s 0 , a 0 , r 0 , s 1 . . . s T −1 , a T −1 , r T −1 , s T , x}. In the next step, the Q-value function is updated with the collected traffic data using experience replay in Line 13. A mini-batch of D traffic datapoints is uniformly drawn at random from the stored traffic data D. Then, the stochastic gradient-decent (SGD) method is used in Line 14 to update the Q-value function parameters θ by minimizing the following loss function as where θ − is the neural network parameters at the previous training iteration [37]. During the procedure, the Q-value function is updated based on data from a diversified set of feasible vehicular flow direction designs. As such, the agent learns a policy π that is optimal for all designs in the current feasible set. Next, the updated Q-value function is used to approximate the traffic performance on the current set of feasible designs. The estimation of traffic performance for a single design x is calculated by the summation of the Q-values for a set of traffic states, given the current control policy where s x is a randomly generated set of traffic states that corresponds to design x. Once a full training cycle is complete, the Q-value estimation is used to quantify the performance of the sample of vehicular flow direction designs. Based on this estimation, a portion of the design alternatives with the lowest estimated traffic performance is removed from further consideration in Line 18. This portion is dictated by an elimination rate k ∈ (0, 1), which dictates the number of designs that are eliminated from the optimization. The elimination rate is determined via experimentation to balance the speed of the convergence and quality of the solution obtained. The final step is to check whether there is only a single vehicular flow direction design remaining. If not, the approach is used to explore the remaining design candidates by sampling another round of design and continuing to train the signal control policy. This process is repeated until only one candidate design remains. Thereafter, the traffic signal controller is trained by using that design until the agent converges to an optimal control policy.
The computational complexity of the proposed approach is dependent on the number of timesteps T in one simulation, the sample size of designs N and the size of the action space |A|, which is a function of the number of intersections m. Namely, in a network with m L-legged intersections, one has |A| = L m . To analyze the complexity of Algorithm 1, the authors consider consider the number of function calls performed in one training iteration. A function call is defined as an instance when the agent calls the reward function or Q-values function [40], [41]. As a result, if one neglects the one-time cost of Kosaraju's algorithm in Line 4 and neural network parameter updates via SGD, the worst case cost at time t is L m . Overall, in a network with sample size N and T timesteps, the algorithm requires N ·T ·L m or O(L m ) function calls per iteration. The authors explore ways to mitigate this complexity in Section III-G.

E. NEURAL NETWORK ARCHITECTURE
In the proposed co-optimization framework, the DQN plays a similar role as in traditional RL-based traffic control algorithms. The DQN outputs an estimated Q-value for each action a ∈ A, denoted by Q (s, a, x; θ), which approximates the value of taking action a, given the current state of the environment. The agent uses this estimation to choose an action a ∈ arg max a t Q(s, a, x), which maximizes the future expected reward. The input of the DQN is extended to include the design variable x in addition to the current state of the environment s. As such, the input layer is a vector of size 3N + K , where K is the number of roads to design in the network. The output is then a vector of size m i=1 n i , assuming that intersection i has n i possible phase options. The hidden layers of the DQN are fully connected layers with Rectified Linear Unit (ReLU) activation functions, the size of which are determined via a grid search of hyperparameters, outlined in Section III-F2. A target network is employed to stabilize training, as outlined in Section II-B. This network has the same architecture as the main network, shown in Figure 5; however, it takes in previous state s ′ and design x, and outputs Q(s ′ , a ′ , x; θ − ) to evaluate successive action a ′ . In the proposed technique, the authors also make use of DDQN and use the target network estimation to evaluate the action selected by the main network, by updated the TD target according to (5).

F. PARAMETER SELECTION EXPERIMENTS
The proposed co-optimization approach is dependent on a number of parameters that impacts its performance and the stability of the learning process. In particular, there are two sets of parameters that impact algorithm performance: design sampling parameters and DQN hyperparameters. The former influences how designs are sampled from the feasible design set, and the latter effects the approximation of Q-values in the deep neural network. Both sets of parameters are determined based on the results of a grid search of the parameter space, as outlined below. For both experiments, the impact on stability, measured by the standard deviation of the agents reward at each training episode, traffic performance, which is quantified by the average episode reward received by the agent, is recorded.

1) DESIGN SAMPLING PARAMETERS
Design sampling parameters govern the manner with which designs are chosen from the feasible design set at the beginning of each training iteration. The following four design sampling parameters are tested for their impact on learning stability and performance: sample size N , elimination rate k, decreasing factor δ, and distribution of designs (weighted or VOLUME 11, 2023  (Steps 1-3), then the control policy π is trained by using a standard Deep RL approach with each design in the sample (Steps 4-6). Finally, poor performing designs, as measured by Q-value approximations, are eliminated. The process is repeated until the algorithm converges to a single feasible design. uniform). The sample size N ∈ Z dictates how many designs are sampled from the feasible design set X at Step 6 in Algorithm 1. In all cases, the sample size N exceeds the number of feasible designs, thus, the traffic controller agent trains on the same design for multiple iterations. At the beginning of each training episode, a feasible design is sampled from a distribution of candidate designs, which dictates the design of vehicular flow directions in the agent's environment for all timesteps in the ensuing episode (Step 8 in Algorithm 1). The elimination rate k ∈ (0, 1) determines the ratio of designs that are eliminated from consideration after each training cycle. Poor performing designs, as determined by the cumulative estimated Q-values of the previous iteration according to (12), are eliminated from the candidate design set at the end of each training cycle until one design remains. As such, the elimination rate directly impacts the speed of design convergence in the proposed algorithm. The decreasing factor δ ∈ (0, 1], governs the update to the sample size made after each training cycle in Step 19 of Algorithm 1. The methodology behind the decreasing factor δ is that the sample size of feasible designs should decrease with the number of feasible candi-date designs. Finally, the distribution used in the sampling dictates how designs are sampled from the feasible candidate set X . A uniform sampling distribution selects design x i with equal probability p i = 1/|X |, while a weighting distribution selects design x with probability where R i is the average reward obtained with design x i during the previous training. Defined in this way, a weighted sampling will sample better performing designs with higher probability.

2) DQN HYPERPARAMETERS
Deep Q-learning approaches are sensitive to the tuning of various hyperparameters that dictate the structure of the neural network and its ability to learn features during training. Such parameters include the quantity and size of layers in the network, the learning rate α, the ϵ decay rate, experience replay buffer length, and batch size M . A combination of these five hyperparameters were chosen through experimentation, for use with each application network through a grid search of the parameter space. The standard deviation of episode rewards during training and the average episode reward were used as metrics of comparison to quantify the impact of each hyperparameter on learning stability and performance. Results and analysis from hyperparameter experimentation are presented in Section IV-A.

G. MODEL ASSUMPTIONS
While the proposed approach is applicable to general road networks, since the design and action space scale exponentially to the number of roads and intersections, respectively, reductions on these spaces must be imposed to allow for application to larger road networks. Furthermore, the following assumptions were made during the implementation of the To increase the applicability of the approach and address the scalability of the model, two simplifications are made. The use of traffic signal synchronization is used to control traffic signals in clusters wherein all signals in one cluster follow the same phase pattern. This is a popular approach in the literature, as traffic signal synchronization may be used to improve the flow of traffic along heavily traveled routes [42]. For networks composed of more than 4 intersections, a signal synchronization scheme of k clusters is fixed, as shown in Figure 6 for a 12-node network. In this example, 4 clusters of 3 traffic signals is chosen to reduce the sufficiently reduce the action space to 4 4 = 256 to ensure the problem is computationally feasible. In road networks with more diverse intersection types, the dimension of the action space becomes where n i is the number of incoming roads to intersections in cluster i. Continuing the discussion on computational complexity from Section III-D, by using M ≪ m signal clusters in a network with L-legged intersections, one reduces the number of function calls per timestep to L M . Additionally, the model may be generalized to consider the design of a subset internal roads in the network. This could allow practitioners to fix the design of certain roads in the network, while including a smaller set of roads in the design space, possibly permitting the redesign of particular roads in a network as in reference [43]. For implementation, roads in the design space are decided by the importance of each internal road. By using the directed graph model of road networks presented in Section III-C, the importance of a road can be measured by calculating the betweenness centrality of each node where n st (u) is the number of shortest paths from node s to node t that pass through node u, and N st is the total number of shortest paths from s to t [44]. The betweenness centrality of a road is a measure of the importance of a road as the likelihood of a vehicle driving along a particular road based on network topology. By using this metric, only a subset of roads is chosen as the most important and used as the design space in the co-optimization framework.

IV. APPLICATION & RESULTS
The proposed RL-based approach is implemented by using a traffic simulation tool OpenTrafficLab [45], which is built in MATLAB R [46] by using the Automated Driving Toolbox TM [47]. To evaluate performance, the authors implement the model on two road network sizes with two different traffic scenarios. In the first application case, the model is implemented on a grid-shaped road network with 4 intersections, as shown in Figure 8a, as a proof of concept and to demonstrate the inner-workings of the algorithm. In the next example, the authors consider a road network composed of twelve-intersections, as shown in 8b, to demonstrate the scalability of the problem and application to complex road networks. Additionally, the authors apply the approach in two traffic scenarios: (1) symmetric high traffic flow from all injection roads with a 4-node network, shown in Figure 8a, meant to simulate rush-hour traffic across the road network, and (2) asymmetric high traffic flow from a subset of injection roads with a 12-node network, meant to simulate concentrated rush-hour traffic. In each case, the co-optimization approach is compared to three control agents with random designs, which are trained using the conventional RL-based for traffic signal optimization. Through these two examples, one can quantify the usefulness of the proposed approach and exhibit the interdependence of signal control and vehicular flow direction design in road networks. The co-optimization agent undergoes offline training to learn an optimal design and control policy by interacting with the traffic environment. The traffic environment is defined by the road network topology and a vehicle traffic simulation, which imitates real-world traffic scenarios. During the simulation, vehicles enter the road network from injection roads according to a Poisson distribution. As a vehicle approaches an intersection, it is allowed to randomly choose one of the following actions: (1) turn right, (2) turn left or (3) continue straight, each with equal probability. A vehicle that enters the network from an injection road acts in this manner until it exits the road network and leaves the simulation. The reward function is computed as the weighted sum of traffic performance criteria according to (8) at each timestep. After the algorithm converges to an optimal design, the signal control policy is further optimized with the design fixed. At each training iteration, the traffic performance reward is recorded to quantify learning progress.

A. PARAMETER SELECTION
In order to effectively train the deep reinforcement learning agent within the co-optimization framework, the authors investigate the impact on design sampling and DQN parameters on offline training, as outlined in Section III-F. To begin, a set of DQN hyperparameters is fixed and the offline training scheme in Algorithm 1 is run with different design sampling parameters until the algorithm converges to a single design. Next, by using this set of design sampling parameters, the DQN hyperparameter space on the same algorithm is searched. The following investigation was performed separately for the 4-and 12-node road network typologies, and the parameters are outlined in Table 1. The authors search all possible permutations of the parameters. Thus, the experiment consisted of 24 design sampling parameter sets and 16 DQN hyperparameter sets for both 4-and 12-node road networks. Each set of parameters is evaluated based on its offline traffic performance, as measured by the reward function, and the stability of learning, as measured by the standard deviation of the agent's offline learning curve.
Design sampling parameters directly impact both the speed of convergence and the quality of the design determined in the co-optimization framework. For example, a high elimination rate will lead to faster convergence but at the cost of lower exploration, which impacts the quality of the converged design. From the grid search outlined above, the authors found the optimal set of design sampling parameters for the FIGURE 8. Road Network Scenarios used during implementation, where bold and thin arrows indicate rush-hour and non-rush-hour traffic flow, respectively, and each color shade of intersections corresponds to clusters of traffic signals whose phase is synchronized: (a) Four-Node Road Network with symmetric injection rates across injection roads and four independent traffic signal phases; (b) Twelve-Node Road Network with asymmetric injection rates and four signal phase clusters. 4-node road network to be N = 1000, k = 0.75, dist = W (X ) and δ = 0.75, and N = 5000, k = 0.75, dist = W (X ) and δ = 0.8 for the 12-node road network. The high sampling size N allows the agent to explore a large set of designs, while a high elimination rate and low descending N rate eliminates poor designs more quickly, allowing for faster and more stable convergence to high-quality designs.
By using the aforementioned optimal design sampling parameters, the authors evaluate the impact of DQN hyperparameter sets. For the 4-node intersection, the authors found the optimal set of DQN parameters to be a 2-layer neural network with 512 nodes in layer 1 and 400 nodes in layer 2, learning rate α = 5 × 10 −5 , decay rate ϵ = 10 −3 and mini-batch size D = 256. Similarly, for the 12-node road network these parameters were 512 and 400 node layers, α = 5 × 10 −5 , ϵ = 10 −3 and D = 512. Due to the randomness of the vehicle scenario, it was found that a lower learning rate produced more stable learning from the agent, while a higher ϵ-decay rate lead to more exploration of the action space, producing more effective control policies during the co-optimization procedure. In addition to the DQN hyperparameters and design sampling parameters determined via experimentation, the deep RL approach is implemented with a discount factor γ = 0.99 and exploration rate ϵ = 0.9.

B. FOUR-NODE ROAD NETWORK
The design-control co-optimization agent is trained in the four-node road network environment ( Figure 8) following the proposed approach outlined in Section III by using the set of DQN parameters and design sampling parameters determined in Section IV-A. At each time step, the centralized traffic signal controller is used to choose a signal phase (or action) for each of the four intersections from the set of non-conflicting signal phases shown in Figure 7. These signal phases comprise the action space for the signal control policy, which has size 4 m for m independently operating traffic signals. In the 4 node road network, the action space has size 4 4 = 256 at each time step t. At the beginning of the algorithm, all possible vehicular flow designs for the internal roads are generated, of which there are 3 M for a road network with M internal roads. Thereafter, the set of feasible designs is obtained through the directed-graph model outlined in Section III-C. In the four-node road network, the initial design space is comprised of 3 4 = 81 designs, 31 of which are feasible. These 31 feasible designs comprise the initial design sampling space in the co-optimization algorithm.

1) OFFLINE TRAINING
The learning curve for the offline training of the co-optimization agent is presented in Figure 10a. The agent converged to a single feasible design after the first 5000 iterations, upon which the optimal design was fixed and the control optimization continued for another 4000 iterations. The traffic performance remained stagnant during the design optimization, likely due to the complex dynamics of the traffic environment and the symmetric injection rates. The training curve indicates that the agent struggled to improve traffic performance while optimizing control and design simultaneously in the first 5000 iterations; however, once the design converged the agent effectively improved traffic conditions in the continuation of the control optimization. It appears the agent was able to effectively use its past experiences to quickly improve the control policy after the  design optimization terminates, eliminating the overhead of learning a control policy from scratch. To further explore the effectiveness of this approach, the authors compare these results to conventional RL-based signal control optimization agents.

2) ONLINE SIMULATION
To display the effectiveness of the proposed co-optimization approach, the authors train three traffic control agents by using a conventional RL-based approach with a randomly generated, fixed design, and compare the performance of the agents in an online traffic environment simulation. The control-only agents begin with a random feasible design, and train for 4000 iterations in an offline environment. The vehicular flow direction designs are presented in Figure 9. Next, a random vehicular flow pattern is initialized and each agents acts in the environment for 2000 time steps, with the traffic performance measured at each time step. To measure traffic performance, the authors record the cumulative vehicular throughput and cumulative queue length; the two metrics which define the reward function. To statistically analyze the results, the experiment is run by using 20 different randomly generated traffic scenarios. A plot of the results from one VOLUME 11, 2023 online simulation is presented in Figure 11, and the averages across simulations are presented in Table 2.
It is observed that the co-optimization agent outperforms all three control-only agents on average, producing a higher cumulative vehicular throughput and lower queue length. This corresponds to more vehicles passing through intersections and less waiting time for vehicles in the network. Based on the averages and standard deviations summarized in Table 2, one can accept this result with 95% confidence, indicating that co-optimizing design and control produces a signal control policy that is more effective at improving traffic conditions compared to the conventional control-only approach.

C. TWELVE-NODE ROAD NETWORK
To exhibit the scalability of the approach, the authors extend the approach to a more complex road network of twelve intersections, displayed in Figure 8b. In addition to the increased size of the road network topology, the injection rates are asymmetric, with incoming traffic flow concentrated in the top left of the grid system. In order to scale the approach to this larger road network, two techniques are taken to reduce the action and design space size. The first is a signal clustering scheme, summarized in Figure 6, which reduces the action space to 4 4 = 256 at each time step for the centralized traffic controller. In the second technique, one reduces the design space by selecting a subset of internal roads to design, which is summarized in Section III-G. With the reduced action space, the agent decides on an optimal design for 5 internal roads, while the other 12 internal roads are fixed to be bi-directional. Therefore, the design space is composed of 3 5 = 243 possible designs, 181 of which are feasible. At the beginning of the algorithm, these 181 feasible designs comprise the design space which the agent searches throughout the co-optimization framework.

1) OFFLINE TRAINING
Similar to the implementation of the four-node network, the co-optimization agent is trained in an offline traffic environment by using the empirically determined parameters in Section IV-A. The offline learning curve for the agent in the twelve node environment is shown in Figure 10b. With a larger design space to search, the agent converged to a single design after 8000 iterations, and thereafter continued to optimize the control policy for another 6000 iterations. During the design and control optimization, the agent was able to effectively improve its reward through interactions with the environment. It appears that the asymmetry in the injection roads improved the agent's ability to learn while searching the feasible design space. After the optimal design was fixed, the agent continued to improve traffic performance while converging to an optimal control policy.

2) ONLINE SIMULATION
The performance of the co-optimization agent is compared to the conventional RL-based control only approach in the twelve-intersection road network scenario. As before, three feasible designs are selected from a uniform distribution and three control agents are trained for 6000 iterations using the same DQN hyperparamters with a fixed design, shown in Figure 12. A random vehicular flow pattern is initialized and each agent interacts with the environment for 2000 time steps. The traffic information is measured at each time step and is summarized in Figure 13. This process is repeated 20 times  Table 2.
Again, the co-optimization technique outperforms the three control-only agents, allowing more vehicles to pass through intersections and reducing vehicle waiting times at intersections. From the statistics provided in Table 2, the authors accept the true mean with 95% confidence, indicating that the co-optimization framework is effective in improving traffic conditions compared to conventional control-only approaches. Furthermore, the proposed technique is capable of capturing the dynamics of larger and more complex road networks. It should be noted that due to the increased size of the design space, the training process was slow when run on a high performance computing cluster, indicating that reducing computational cost is key to scaling up the approach to larger networks.

V. CONCLUSION
The goal of any intelligent transportation system is to improve traffic conditions in a road network and effectively capture the complex dynamics of urban transportation systems. To that end, the goal of the present work is to propose a novel RL-based approach to road network management successful at this task. The authors present a technique, in which the design of vehicular flow directions is integrated into the conventional RL-based traffic signal control problem. In the approach, other methods such as a directed graph model are leveraged to determine design feasibility, a centrality measure to quantify road importance, and a set of reasonable measures to reduce design and action space size. At a high level, this approach is an extension of the deep reinforcement learning framework to explore design options via random search, optimizing signal control while eliminating feasible designs based on performance. After a sufficient number of training iterations, the algorithm is found to converge to a best combination of vehicular flow direction design and a corresponding signal control policy.
The proposed approach is demonstrated in two problem instances: a four-node grid network with symmetric vehicle injection, and a twelve-node grid network with asymmetric vehicle injection. These applications are used to illustrate the algorithm's ability to optimize vehicular flow design for multiple roads and to learn an effective control policy implemented by a centralized traffic controller. In each instance, the design-control co-optimization approach is compared to several RL-based control techniques in an online traffic simulation. When acting in this environment, the co-optimization method is found to achieve better traffic performance than the conventional control approaches. This comparison not only illustrates the ability of the present technique to capture the complex dynamics of road networks; but further displays the interdependence on vehicular flow design and signal control; the results indicate that the simultaneous optimization of design and control produce a policy better equipped to reduce traffic congestion.
The present data-driven approach has many advantages. Compared to model-based approaches, the problem is easy to generalize to different road network topologies and traffic scenarios. For example, the approach can be modified to design a specified subset of roads, or to incorporate the synchronization of traffic signals. Although the authors were able to make realistic assumptions to scale the approach to mid-scale networks, the scalability of the approach is one potential limitation. As the role of a centralized traffic controller scales the problem exponentially with the number of independent traffic signals, further modifications to the approach, such as the implementation of a decentralized traffic controller, may be needed to scale the network to mega-scale networks present in the real-world. Nonetheless, the present co-optimization framework sheds light on the interdependence of the design and control of road systems, and effectively reduces traffic congestion in urban road networks.
XIANGXUE ZHAO received the bachelor's degree in electrical science and technology from the University of Electronic Science and Technology of China, the master's degree in industrial and system engineering from the University of Michigan-Dearborn, and the Ph.D. degree in reliability engineering from the University of Maryland (UMD), College Park, MD, USA. Her research interests include predicting, designing, and controlling of system behavior with simulation, experiment, and sensor data using statistical machine learning.