Optimizing the Junction-Tree-Based Reinforcement Learning Algorithm for Network-Wide Signal Coordination

+is study develops three measures to optimize the junction-tree-based reinforcement learning (RL) algorithm, which will be used for network-wide signal coordination.+e first measure is to optimize the frequency of running the junction-tree algorithm (JTA) and the intersection status division. +e second one is to optimize the JTA information transmission mode. +e third one is to optimize the operation of a single intersection. A test network and three test groups are built to analyze the optimization effect. Group 1 is the control group, group 2 adopts the optimizations for the basic parameters and the information transmission mode, and group 3 adopts optimizations for the operation of a single intersection. Environments with different congestion levels are also tested. Results show that optimizations of the basic parameters and the information transmission mode can improve the system efficiency and the flexibility of the green light, and optimizing the operation of a single intersection can improve the efficiency of both the system and the individual intersection. By applying the proposed optimizations to the existing JTA-based RL algorithm, network-wide signal coordination can perform better.


Introduction
Signal control system is an important method of improving the operation of urban traffic. With the development of people's understanding on traffic and technology, urban traffic signal control systems have undergone three stages: single-point, linear coordinated, and regional coordinated. Traffic signal coordination is considered to be more effective in alleviating traffic congestion than single-point and linear coordinated.

Review of the Literature on Signal Coordination.
Signal coordination has been studied extensively over the past 30 years. e first developed signal coordination control systems include SCOOT [1], SCATS [2], PRODYN [3], OPAC [4], RHODES [5], UTOPIA [6], CRONOS [7], and TUC [8]. Although the signal coordination control can achieve better effects than the single-point signal control and the inductive signal control, there are also many restrictions on the signal coordination control, such as difficulty in parameter calibration, computational complexity, and poor adaptability and stability.
Considering these restrictions and the fact that the dynamic characteristics of the traffic environment also provide the need for interactive environment-based learning from the environment, machine learning algorithms are proposed to be used in signal coordination control research. Among the machine learning algorithms, the reinforcement learning (RL) algorithm is the most widely used in the field of traffic signal control.
Liang et al. [9] proposed a deep reinforcement learning model to control the traffic light cycle. Aslani et al. [10] introduced the actor-critic method to solve the problem of the trade-off between exploration of the traffic environment and exploitation of the knowledge already obtained. Aslani et al. [11] developed adaptive traffic signal controllers based on continuous residual reinforcement learning to improve their stability. Jeon et al. [12] suggested a novel artificial intelligence that only uses video images of an intersection; the image-based RL model outperformed both the actual operation of fixed signals and a fully actuated operation. Aziz et al. [13] applied R-Markov Average Reward Techniquebased reinforcement learning algorithm for vehicular signal control problem leveraging information sharing among signal controllers in the connected vehicle environment. Darmoul et al. [14] suggested a Immune Network Algorithm-based Multiagent System to control a network of signalized intersections, which is able to handle different traffic scenarios.
Graph theory models can reduce the computational complexity of RL, especially when joint action of multiagents needs to be calculated. But not much research has been done in this area. Some work has included developments in the max-plus algorithm and junction-tree algorithm (JTA); these have been applied to signal coordination control research at the road network level.
Medina and Beenekohal [15] applied the max-plus algorithm as a coordinating strategy in the network-wide signal control problem. However, the max-plus algorithm has two key limitations. Firstly, it is only applicable to treestructured networks and cannot guarantee the convergence to an optimal solution for general cyclic networks. Secondly, this algorithm only provides a brief loopy propagation that refers to inexact messages received at a node. us, it only provides an approximate inference of the exact message being passed. Zhu et al. [16] first proposed the JTA instead of the max-plus algorithm to obtain the best joint action for traffic signals and to realize network-wide signal coordination. JTA was first proposed by Jensen et al. [17]. e advantage of JTA is that it is computationally efficient and can handle looped or acyclic road networks and accurately infer the best joint scheme.

Motivations and Contributions of this Study.
Zhu et al. [16] demonstrated that the test network can perform better under the JTA compared to an adaptive or single-agent RLbased control. Although the network system improved, some intersections still experienced poor operations. Zhu et al. [16] also noted that it is necessary to assess the variance of performance metrics at the intersection level, and modified schemes should be developed to optimize the system to ensure desired level of performance at local intersections.
To summarize, the research goals are as follows: (1) To optimize the basic parameters of the JTA algorithm so that the signal coordination control scheme is consistent with actual requirements (2) To evaluate the impact of existing algorithms on local intersection operations (3) To propose optimization measures for local intersections to improve the practical application value of the algorithm

Reinforcement Learning (RL) and Its Application in Signal
Control. e basic RL model is shown in Figure 1. It contains an environment, agents, learners, and strategies. e agent obtains the state "s" from the environment and selects action "a" according to the state. e action "a" interacts with the environment, which then returns to a new state "s′" and sends a certain feedback "r" to the agent. After repeated interactions, the agent can learn an optimal strategy for the situations presented.
In the application of RL to traffic signal control, the road network is the environment and the signal control machine is the agent. During the decision period, the signal control machine takes an action to activate a signal phase, and the state of the environment changes accordingly. e goal of the algorithm is to obtain the optimal strategy that can achieve the maximum return. e optimal strategy is to map the activation phase and state of the traffic. e feedback can include average delay and the number of stops. Its value can be extracted directly from the environment.

Junction-Tree Algorithm and Application in Signal
Control.
e key idea of the JTA is to find a way to decompose the global computation of joint probability into a set of related local computations. e JTA is introduced to reveal the important connections between global and local probabilistic reasoning using graph theory. e essence of the JTA is information transmission. e forward transmission is the transfer from the root node to the leaf node, while the reverse transmission is from the leaf node to the root node. e process of information transfer can be expressed by equations (1)-(4).
Forward transmission from v to s: Forward transmission from s to w: Reverse transmission from w tos: Reverse transmission from s to v: In the equations above, v is the root node; w is the leaf node; s is the separation node; ψ v , ψ w , and ϕ s denote potential functions of v, w, and s; ψ v ′ , ϕ s ′ , and ψ w ′ denote JTA and RL have the same objective function in terms of calculating the maximum posteriori probability. ey both decompose the whole network optimization problem into local subproblems, and both use their Markov attributes to do so. In the probability model, the probability of a node depends on the adjacent nodes. In the coordinated traffic signal control, the phase selection of the intersection depends on the phase of the adjacent intersection. erefore, JTA is selected to solve a coordinated traffic signal control problem. JTA has great advantages in dealing with coordinated traffic signal control problems because it is the fastest and most accurate inference algorithm.

e Junction-Tree-Based RL Algorithm.
e control flow of the JTA-based RL algorithm method is shown in Figure 2. In the applied method, the RL is the core algorithm of signal control, and the JTA is used to find the signal control scheme with the highest rate of return. Existing research verifies that the applied method is better than the timing signal control, the independent Q learning signal control, and the maximum queue length priority signal control under different traffic intensities.
It should be noted that the RL algorithm can learn the Q value under specific traffic demand and signal control scheme for one or two adjacent intersections. But, the RL algorithm cannot learn the Q value for the whole network with too many intersections because of the large scale of knowledge to be learned. JTA is adopted to achieve the best signal control scheme so that the Q value for the whole network is the best one. In the proposed algorithm, there is no cycle time and split. If the frequency of running JTA is 1 s, then the algorithm can only decide which phase is green light for each intersection in the next 1 s.

Optimizing Basic Parameters
3.1.1. Frequency of Running the JTA. As the JTA determines the phase-switch at intersections, the lower frequency running it, the longer a given phase duration will be. To adjust the signal control scheme according to feedback in time, the frequency to run the JTA should not be lower than the headway of queueing vehicles passing the parking line. Both Shao et al. [18] and Zhao et al. [19] have verified that the headway is less than 2 s when the queue length is longer than 10 vehicles. However, in existing research on JTA, the frequency is 5 s which cannot meet actual control requirements. In order to improve the sensitivity of the signal control scheme, and considering the minimum step size of the signal control scheme, 1 s is employed in this study.

Intersection Status Division.
e JTA-based RL algorithm selects the phase scheme with the highest return according to the state of the road network. Phase schemes are determined by the number of intersections and the phases of a single intersection, which are relatively fixed. erefore, the accuracy of the applied method for signal control is determined by the state of the road network. However, the large number of intersections available when signal coordination control is performed provides a status division that is too detailed and may lead to a long learning time. Existing studies treat the saturation as the evaluation index of intersection entrance, and saturations of all phases are summed and divided into three levels.
at is, each intersection contains three states, and the state of two adjacent intersections is divided into nine. In general, this state division is rough and makes the signal control scheme less sensitive to the traffic state of the road network.
Considering that the state will be defined as an eightdimensional vector in the program of the applied method, the saturation of each intersection entrance is divided into three levels, and then each intersection is divided into 81 states. In future applications, the status of the intersection can be divided in more detail based on specific requirements.

Analysis of the JTA Information Transmission Mode.
e JTA uses the continuity function while calculating the maximum posteriori probability, which should not be directly applied to the information transmission in traffic signal coordination control. erefore, a new information transmission mode that will be applied in signal coordination control is defined.
e new transmission mode, taking four intersections as the example, is shown as follows.
Suppose that all four intersections have only two phases, A and B; phase A is for north-south traffic, and phase B is for east-west traffic. e virtual road network can be transferred into a junction tree using moralization and triangulation, see Figure 3. Intersections 1-3 form a root node; intersections 2-4 form a leaf node, and intersections 2 and 3 form a separation node. e key parameter Q is the value of two adjacent intersections and is shown in Table 1.
e target function of JTA is arg max(Q 12 + Q 13 + Q 24 + Q 34 ).  Journal of Advanced Transportation

Initialization: Define the Potential Function of all
Nodes. e potential functions of the root and leaf nodes are the sum of the Q values of three intersections that form the node. e potential function of the separation node is the phase combination of two intersections that form the node; the initial value is null. e potential function of the root node is ψ 123 � Q 123 � Q 12 + Q 13 e potential function of the separation node is ϕ 23 � Null e potential function of the leaf node is
After transmission, ψ 123 should achieve the max value ψ 123 ′ under all possible potential functions ϕ 23 and also achieve the best phase combination ϕ 23 ′ . e transmission result is shown in Table 2.
After transmission, the potential function of leaf node ψ 234 changes to ψ 234 ′ .
After transmission, ψ 234 ′ should achieve the max value ψ 234 ″ under all possible potential functions ϕ 23 ′ and the best phase combination ϕ 23 ″ . e transmission result is shown in Table 3.
By combining ϕ 23 ′ and arg max(Q 24 + Q 34 ), it is easy to understand that ϕ 23 ′ (Q 24 + Q 34 ) achieves the maximum value only when ϕ 23 ″ selects combination 4. In other words, ϕ 23 ′ (Q 24 + Q 34 ) can achieve the maximum value only when intersections 2, 3, and 4 are all in phase B; at the same time, ψ 234 ″ must be 13.

Reverse Transmission from the Separation Node to the Root Node.
e transmission function is ψ 123 ″ � ϕ 23 ″ ψ 123 ′ . After transmission, ψ 123 ′ changes to ψ 123 ″ based on ϕ 23 ″ . At this time, ψ 123 ″ is 16, and intersection 1 is in phase B. e result of applying JTA is obtained after the above information transmission occurs, that is, after the joint action of the four intersections becomes (B, B, B, B), which will result in the joint tree achieving its highest potential function.

Optimizations for Single Intersection's Operation.
Network-wide signal coordination control both pursues the system optimization and the requirements of the individual intersection. For example, the queue length of a single intersection entrance should not be too long when the network has a low average queue length. e JTA-based RL algorithm considers system optimization to be the goal; however, this tends to cause the queue lengths of some entrance lanes to be too long.
To improve the performance of single intersections, optimization should be studied.

Information Transmission Rule-Based Optimization.
In the JTA-based RL algorithm, the root and leaf nodes determine the direction of information transmission along the junction tree. Existing study, Zhu et al. [16], simply assigns the endpoints of the junction tree as the root and leaf nodes, without considering the signal control requirements. Analyses of the JTA information transmission modes show that the intersection's phase is determined in the reverse transmission process. For these reasons, it is proposed that the phase of the intersection with poor operation should be determined first. erefore, the worst running node should be taken as the leaf node while all endpoints of the junction tree are taken as root nodes. e information transmission rule before and after optimizations is shown in Figure 4.  Q(A, B); thus, the differences cannot be learned in signal timing. erefore, the differentiated return-based optimization method is proposed to optimize the definition of Q values.

Differentiated Return-Based Optimization. System
If the saturation q is taken as the evaluation index and varies from 0 to 1, q should be divided into n levels, and the return of the k th level should be 2 k (k ∈ [1, n]). When the saturations of the adjacent intersections A and B are q 1 and q 2 , q 1 belongs to level k 1 , and q 2 belongs to level k 2 . erefore, the Q value of the adjacent intersections is expressed as follows:

Network Description.
is study used VISSIM5.4 to build a virtual road network and test the validity of optimizations on the JTA-based RL algorithm. Details about the modules in VISSIM (e.g., car-following, lane-changing, traffic light control) can be found in the VISSIM manual. e JTA-based RL algorithm is coded in VB.net and interacts with VISSIM through the component object model (COM) interface.
A virtual road network same to the one in Zhu et al.'s study [16] was built. Under the same test environment, the results of this study should be more convincing. e network uses a structure with six horizontal and three vertical roads. e number of lanes is randomly set. ere are 18 intersections in the network, and each entrance has an independent left turn lane, as shown in Figure 5. Also, the given network is transformed into a junction tree, as shown in Figure 6. e length of the road section in the test network is set randomly, and channelization schemes of 18 intersections are also not uniform. All  e performance of the JTA-based RL algorithm is tested at three levels of congestion: low, medium, and high. e traffic demand is input into the network through the 18 link origins in Figure 5. e congestion levels are reflected in the ranges of the demand inputs, which are 500 vph to 600 vph, 600 vph to 800 vph, and 900 vph to 1200 vph, respectively.

Test Group Settings.
In the test case, queue length w ij is adopted to build the return and objective functions. e objective function is created to achieve the shortest queue length for the system. e return function is as follows: where f ij r (t) is the return of intersection i in phase j and time t, q t ij is the traffic volume of the key entrance of intersection i in phase j and time t, J ij is the density of the key entrance when it is congested, and l ij is the lane length available for queueing of intersection i in phase j.
ree test groups are set to test the effectiveness of optimization methods. e details of the settings are as follows: Group 1: existing research of Zhu et al. [16] applying JTA in signal coordination (1) Frequency of running JTA: 5 s (2) Intersection status division: each intersection contains three states, and the state of the two adjacent intersections is divided into nine parts (3) JTA information transmission mode: the mode introduced in Section 2.2 (4) Root and leaf node: V (1, 2, 4) is the root node, and V (14,16,17) and V (15,17,18) are the leaf nodes

Journal of Advanced Transportation
(1) Frequency of running JTA: same as group 2 (2) Intersection status division: same as group 2 (3) JTA information transmission mode: same as group 2 (4) Root and leaf node: the worst running node is taken as the leaf node while all endpoints of the junction tree are taken as root nodes (5) Q value-differentiated returns are calculated and applied In addition to the above settings, the training time of group 1 is 5 h, while that of groups 2 and 3 is 10 h. After training, the three groups are applied in signal coordination; each group contains 10 simulation runs (each with a different random seed), and each simulation lasts 1 h. e differentiated-return-based optimization method adopted in group 3 is necessary to classify the queue length w ij . is is divided into three levels in this study: the first level is [0, 0.4), the second is [0.4, 0.7), and the third is [0.7, 1]. e return of each level is 2, 4, and 8, respectively.

Test Result Analysis.
By comparing the test results of three groups, several conclusions can be drawn as follows.

4.3.1.
e Green Light of Each Phase Is More Flexible. Taking intersection 8 as an example, 50 randomly selected continuous phases under medium congestion levels are extracted, and the corresponding green light durations are shown in Figure 7. As the frequency of calling the JTA in group 1 is 5 s, the green time of all phases is a multiple of 5, while the green time in group 2 is not subject to this constraint. e green time in group 2 can be adjusted according to the length of the queue. It can be concluded that optimization of the basic parameters can increase the flexibility of the green light duration, which in turn makes the green light more reasonable.

4.3.2.
e Efficiency of Signal Coordination Is Improved. e queue length of the system and the intersection at different congestion levels are shown in Table 4. e queue length of the intersection is the longest queue length of all the entrance lanes while the phase is being switched. e average queue length of the system is the average queue length of all 18 intersections. As the traffic demand is input into the network via link origins, the outermost intersections of the network are directly affected by the traffic input, which may then also affect the evaluation result. Considering the above reasons, only intersections 5, 8, 11, and 14 are selected and analyzed.
In terms of the queue length of the system, the table shows that the length of group 2 is shorter than group 1 by over 10%. It can be concluded that optimizations of basic parameters and the JTA information transmission mode can improve the efficiency of signal coordination. e lengths of group 2 and group 3 are not significantly different, which means that optimizing the operation of a single intersection has little effect on the system operation.

Problems after Parameter Optimization and the Information Transmission Mode Are Still Significant.
Optimization methods improve system operation, but the operations of some intersections are still poor. Table 4 shows Root node Leaf node The worst running node that the average queue length of some intersections in group 2 is longer than that in group 1; for example, intersection 5 under a low congestion level and intersection 8 under a high congestion level. Queue lengths of 50 randomly selected continuous phases of these two intersections are also shown in Figures 8 and 9. e two figures show intersections with large fluctuations in queue length, such as intersection 5 under low congestion level with a maximum queue length of 0.55 and a minimum queue length of 0. 16. In other words, after optimizing basic parameters and the information transmission mode, the operation of a single intersection still needed to be improved.

Optimizations for Operating Single Intersections Can
Reduce the Maximum Queue Length of the System. e maximum queue length of the system under low and high congestion levels is counted at intervals of 10 s and shown in Figures 10 and 11. It is obvious that the queue length of group 3 is the lowest. In other words, the maximum queue length of the system is reduced after the optimizations for the operation of a single intersection were adopted.

Optimizations for the Operation of a Single Intersection
Can Reduce the Fluctuation of the Queue Length at the Intersection. After applying a differentiated return-based optimization, group 3 should be more sensitive towards returns than groups 1 and 2. e queue length of intersection 5 under low congestion levels in different groups can be taken as an example. e variations in the queue lengths are shown in Figure 12

Discussion and Conclusion
e study proposed three optimization methods for the JTAbased RL algorithm which can be used for network-wide signal coordination. ree test groups were built to analyze the optimization effect.
Group 1 used the existing algorithm applying JTA in signal coordination; this group was taken as the control group Group 2 applied optimizations on basic parameters and the information transmission mode relative to group 1 Group 3 applied optimizations on the transmission rule and the return relative to group 2 Detailed grouping and improvement effects are shown in Table 5. Table 5 shows that the optimizations proposed in this paper play a good role in improving the operation of the JTA-based RL algorithm used for network-wide signal coordination. Optimizations of basic parameters and information transmission modes can improve the system efficiency and the flexibility of green lights. Optimizations of the information transmission rule and the return can improve the efficiency of both the system and of the single intersection. It can be concluded that better operational results can be achieved in network-wide signal coordination by applying the proposed optimizations to existing JTAbased RL algorithms.
However, the results reported here are based on a hypothetical network. Results from real-world implementation should be studied in future research. is would make our conclusions stronger. What is more, each intersection is divided into only 81 states; the possibility of more detailed states division should be studied.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Grouping
Features Improvement Group 1 Existing algorithm e control group Group 2 Optimizations of basic parameters and the information transmission mode (i) e green light of each phase was more flexible (ii) e system efficiency of signal coordination improved (iii) e operations of some intersections were still poor which need to be improved Group 3 Optimizations of the information transmission rule and the return (i) e maximum queue length of the system was reduced (ii) e fluctuation of the queue length at the intersection was optimized