Constraint-Guided Behavior Transformer for Centralized Coordination of Connected and Automated Vehicles at Intersections

The centralized coordination of Connected and Automated Vehicles (CAVs) at unsignalized intersections aims to enhance traffic efficiency, driving safety, and passenger comfort. Autonomous Intersection Management (AIM) systems introduce a novel approach for centralized coordination. However, existing rule-based and optimization methods often face the challenges of poor generalization and low computational efficiency when dealing with complex traffic environments and highly dynamic traffic conditions. Additionally, current Reinforcement Learning (RL)-based methods encounter difficulties around policy inference and safety. To address these issues, this study proposes Constraint-Guided Behavior Transformer for Safe Reinforcement Learning (CoBT-SRL), which uses transformers as the policy network to achieve efficient decision-making for vehicle driving behaviors. This method leverages the ability of transformers to capture long-range dependencies and improve data sample efficiency by using historical states, actions, and reward and cost returns to predict future actions. Furthermore, to enhance policy exploration performance, a sequence-level entropy regularizer is introduced to encourage policy exploration while ensuring the safety of policy updates. Simulation results indicate that CoBT-SRL exhibits stable training progress and converges effectively. CoBT-SRL outperforms other RL methods and vehicle intersection coordination schemes (VICS) based on optimal control in terms of traffic efficiency, driving safety, and passenger comfort.


Introduction
Intersections are frequent sites of accidents in urban traffic and major bottlenecks for congestion.Traditional traffic management methods often fail to effectively address these issues, especially during peak hours and in complex traffic environments.With the rapid development of artificial intelligence and wireless communication technologies, Vehicle-to-Infrastructure (V2I) technology is considered the most promising solution to these problems [1][2][3].V2I technology enables efficient communication between vehicles and road infrastructure such as traffic signals, sensors, cameras, etc., facilitating real-time information sharing and processing.Through this technology, connected and autonomous vehicles (CAVs) within intersections can communicate in real time with road infrastructure, achieving centralized control and coordinated management of vehicles.This not only allows for collision-free driving among vehicles but also optimizes traffic flow, reduces congestion, and improves the overall efficiency of the traffic system.
Against this backdrop, Autonomous Intersection Management (AIM) systems have emerged.AIM leverages V2I technology, utilizing intelligent algorithms and real-time data To address the challenge of insufficient model reasoning capabilities in RL, transformers have been introduced due to their outstanding performance, generalization ability, and scalability [18].Transformer models are based on attention mechanisms.They possess strong sequence modeling capabilities and are able to capture long-distance dependencies, making them suitable for handling complex temporal data.The introduction of transformers has provided new insights for applying RL in complex environments.Zambaldi et al. [19] integrated transformers into RL using multi-head dot-product attention to achieve relational reasoning in structured observations, significantly enhancing the application of attention mechanisms in complex RL environments.However, the unique demands of RL training, including alternating data collection and policy optimization, present challenges for transformer applications, including instability due to changing data distributions and sensitivity to design choices.Offline RL has emerged as a transformative approach to address these challenges, solving RL problems using large-scale experience datasets and redefining transformer applications as sequence modeling problems.This method allows transformers to leverage their sequence modeling capabilities without the need to discount future rewards.Noteworthy advances include the Decision Transformer (DT) by Chen et al. [20], which predicts actions based on past states, actions, and rewards, and the Trajectory Transformer (TT) by Janner et al. [21], which uses beam search to predict the optimal action based on past states, actions, and rewards.However, TT models perform poorly on tasks requiring high real-time responsiveness due to the time-intensive nature of beam search.Further innovations, such as the Generalized Decision Transformer (GDT) by Furuta et al. [22] and the Q-learning Decision Transformer (QDT) by Yamagata et al. [23], have introduced methods such as hindsight information matching and dynamic programming, enhancing the practical utility of transformers in RL.Nonetheless, despite these advances offline RL may be constrained by the quality of the training dataset, potentially leading to suboptimal policy.To address this issue, Zheng et al. [24] proposed the Online Decision Transformer (ODT), combining offline RL with online fine-tuning.However, ODT relies on supervised learning to minimize the difference between actions collected online and expert data, potentially limiting the model's ability to explore high-reward actions and achieve robust policy updates.Additionally, learning safe policy from experience datasets presents a significant challenge when applying transformers to offline RL.To address these safety concerns, Liu et al. [25] adopted a Multi-Objective Optimization (MOO) approach in offline RL to maximize rewards while ensuring safety constraints.Zhang et al. [26] proposed SaFormer, which restricts the action space with cost-related labels and enforces safety through a posterior validation step.
Ensuring safety in the application of RL to AIM is a crucial concern, and Safe RL has emerged as a solution to this challenge.Constrained Markov Decision Processes (CMDP) are employed to address Safe RL issues by introducing safety constraints.These constraints ensure that the model not only maximizes rewards but also adheres to optimal policies within the defined safety limits.Various model-based and model-free methods have been developed to tackle CMDP problems, including primal-dual methods, trust region methods with safety constraints, Lyapunov-based methods, and Gaussian process optimization [27][28][29][30][31][32][33][34][35].Additionally, algorithms such as PID-based Lagrangian methods and "SauteRL" have been designed to enhance safety during the learning process by using state augmentation and penalty adjustments.Other approaches, such as "SimmerRL", dynamically adjust safety thresholds to improve policy robustness and safety.Model-based Safe RL methods leverage environmental models to predict future states and rewards, thereby improving learning efficiency and safety [36][37][38][39].However, constructing accurate models in complex large-scale environments such as intelligent traffic systems remains challenging, limiting the applicability of model-based Safe RL methods in AIM systems.While model-free methods do not require environmental models and directly learn policies through interaction with the environment, existing RL methods often struggle to solve autonomous intersection management problems due to insufficient inference capabilities, making it difficult to cope with complex and highly stochastic traffic environments.
To address the challenges of insufficient model reasoning capabilities and difficulty in ensuring robust safety in RL, in this research we propose the Constraint-Guided Behavior Transformer Safe Reinforcement Learning (CoBT-SRL) method.This Safe RL approach first involves collecting a large amount of historical trajectory data using expert policy.Specifically, these trajectory data include information such as states, actions, rewards, and costs.These data undergo rigorous collection and filtering processes to ensure their quality and reliability.In the AIM dataset, each data point includes a past state and action along with the associated future Reward to Go (RTG) and Cost to Go (CTG).The RTG represents the total reward obtainable from the current state over a future period, while the CTG represents the potential total cost over the same period.Next, an autoregressive modeling approach is used, taking these historical data as input to predict future actions.In this process, a transformer model is introduced to leverage its powerful capabilities in sequence modeling.The transformer model excels at capturing long-range dependencies and effectively handling and modeling complex temporal data.This approach enhances data efficiency while ensuring robust safety, thereby improving generalization to highly complex and stochastic traffic environments.
The main contributions of this study are as follows: (1) This study proposes CoBT-SRL, utilizing a transformer as a policy to achieve sequence decision-making for vehicle driving behavior.This method leverages historical states, actions, RTG, and CTG to predict future actions, taking advantage of the transformer's ability to capture long-range dependencies and strong reasoning capabilities while improving data sample utilization efficiency and safety.
(2) This study models the AIM system as a continuous task rather than a discrete one.
By maintaining and dynamically updating the real-time status information of CAVs, the system can better simulate and respond to dynamic conditions in real-world environments.This continuous task modeling approach not only enhances the realism of simulations but also significantly improves the generalization ability of RL-based AIM policy under various traffic densities and scenarios at unsignalized intersections.(3) The effectiveness of the model was validated through extensive simulation experiments.The results indicate that the CoBT-SRL training process is stable and converges well.During testing, CoBT-SRL was compared with advanced RL methods such as PPO and TD3 as well as the optimal control-based Vehicle Intersection Coordination Scheme (VICS).CoBT-SRL achieved superior results in terms of traffic efficiency, driving safety, and passenger comfort.
The rest of this paper is structured as follows: Section 2 defines the problem and describes the framework of this study; Section 3 introduces the modeling of the AIM problem within the CMDP framework, defining the state space and action space and then designing the reward and cost functions; Section 4 discusses the methodology for reinforcement learning; Section 5 details the experimental setup and analysis of the results; finally, Section 6 summarizes the work and looks forward to future research directions.

Problem Definition
The intersection scenario studied in this paper involves a typical unsignalized fourway intersection with a multi-lane road layout.Each inbound lane is designed to allow vehicles to go straight, turn left, or turn right as needed.As vehicles gradually approach the intersection from a distance, they move from an uncontrolled area into a controlled area.Within this controlled area, the AIM system monitors and controls all approaching vehicles in real time.Advanced sensor and communication technologies continuously obtain critical data for each vehicle, including real-time position, speed, and driving direction.Based on these data, the system dynamically adjusts the driving routes and speeds of the vehicles to ensure safe and efficient passage through the intersection.Utilizing complex algorithms and decision models, the system optimizes the sequence and paths of vehicles in real time to minimize traffic conflicts and congestion to the greatest extent possible.Each vehicle i periodically sends its dynamic driving data s i = [x i , y i , v i , β i , ζ i ] to the AIM system via V2I wireless communication.In these data, x i and y i represent the real-time positional coordinates of vehicle i, providing precise spatial location information; v i represents the current driving speed of the vehicle, reflecting its motion state; β i indicates the intended driving direction at the intersection (left turn, straight, or right turn); and ζ i is the positional encoding information of the vehicle on the road, which is used for localization and trajectory planning.
The AIM system uses these data to control the driving behavior of each vehicle in real time through an RL algorithm.Based on the traffic conditions and the state of each vehicle, the system dynamically adjusts the target speed v goal i and target steering angle to ensure efficient and orderly traffic flow at the intersection.This approach enhances overall traffic efficiency and improves passenger comfort while ensuring driving safety.At the vehicle motion control level, a PID controller is employed to achieve precise longitudinal and lateral control.The PID controller calculates the required throttle, braking force, and steering wheel angle based on the target speed and steering angle.By adjusting these parameters, the vehicle can smoothly and safely accelerate, decelerate, and turn, thereby achieving precise motion control and ensuring that the vehicle follows the predetermined trajectory.
The problem is defined as follows: at each time step, given the static road information of the intersection (i.e., road topology and geometry) and the dynamic state information of all vehicles.including position, speed, driving intention, and positional encoding information, the CoBT-SRL-based AIM system determines the desired speed and steering angle of vehicles within the control area in real time.This coordination aims to achieve collision-free and efficient traffic flow for all vehicles at the intersection.

Method Framework
The CoBT-SRL framework proposed in this study is divided into two main components, as shown in Figure 1  In the dataset collection and filtering phase, historical trajectory information is obtained through interactions between the existing policy and the online environment, forming an expert dataset.This dataset includes past global states, joint actions, reward and cost returns, and other relevant information.The collected data are then filtered to further enhance the quality of the dataset.The filtered expert dataset is used for subsequent policy training and evaluation.
During the policy training and evaluation phase, a transformer model is employed as the policy network to predict future actions based on historical states, actions, RTG, and CTG.The transformer's ability to capture long-range dependencies improves data sample utilization efficiency and provides robust safety.Through policy training and evaluation, a more robust policy can be obtained for the AIM system that enhances traffic efficiency, driving safety, and passenger comfort.

Safe RL Formulation
To address the challenges of Safe RL, the traditional Markov Decision Process (MDP) framework is extended by introducing additional safety constraints, resulting in a CMDP model.This CMDP framework explicitly considers safety factors during the decisionmaking process, providing a foundation for solving the complexity of Safe RL problems.The CMDP model is defined as a tuple {S, A, P, r, c, d, ρ 0 , γ}, where S → R represents the state space, encompassing any state the agent acquires from the environment; A → R represents the action space, encompassing all possible actions the agent can perform; and P : S × A × S → [0, 1] represents the transition probability distribution, describing the probability of transitioning from state s t ∈ S to the next state s t+1 ∈ S after executing action a t ∈ A at time step t.The reward function r : S × A × S → R represents the immediate reward obtained after transitioning from state s t to state s t+1 by performing action a t .The set of cost functions c = {c i } N c i=1 defines specific environmental safety risk constraints, with each cost function c i : S × A × S → R mapping the transition tuple to a safety cost.Here, N c represents the number of different safety constraints considered, with N c distinct cost functions c i in the set c.The threshold d = {d i } N c i=1 associated with each cost function represents the maximum allowable cost to ensure safety, while ρ 0 represents the initial state distribution describing the probability distribution of states at the start of the process and γ ∈ [0, 1) is the discount factor indicating the extent to which future rewards and costs are considered during decision-making.
According to the CMDP model, the agent interacts with the environment in discrete time steps.At each time step t, the agent receives the current state s t ∈ S from the environment and executes an action a t ∼ π(• | s t ) based on the policy π.After performing action a t , the agent receives a reward r = r(s t , a t , s t+1 ) and a set of costs c = {c i (s t , a t , s t+1 )} N c i=1 .The environment then transitions to a new state s t+1 ∼ P(• | s t , a t ).
The interactions between all agents and the environment continuously generate trajectories, forming a complete trajectory τ = {s t , a t , r t , c t } N t 1 , where N t represents the number of time steps in an episode.In each trajectory, the RTG and CTG are defined as where T current represents the current time step.These metrics indicate the expected return and cost from the current time t to the end of the trajectory.They provide crucial insights for the CMDP model and help the agent to evaluate future rewards and costs to optimize actions under specified constraints, thereby maximizing rewards while adhering to cost constraints.

State Space and Action Space
In the Safe RL-based AIM system, designing a reasonable state space is crucial for the learning process.A well-designed global state space allows each vehicle to consider its own state and action, understand and predict the action of other vehicles, and grasp the overall dynamics of the environment.Unlike previous RL-based AIM problems that only consider fixed vehicles at intersections, this study focuses on continuous and stochastic traffic flow at intersections.In this context, randomly arriving vehicles more realistically simulate actual traffic flow, prompting the model to learn more robust, adaptable, and generalizable policy, thereby enhancing its ability to handle highly complex and stochastic real-world scenarios.Additionally, designing a reasonable observation space in a dynamic environment is essential for the model to effectively understand the random variations in traffic scenarios and make more flexible and efficient decisions.
To effectively address the complex and dynamically changing intersection scenarios, this study designs an efficient global state space that includes temporal information.These states are observed at different time points and influence vehicle decisions, thereby better reflecting the historical trajectories of vehicle states and their interactions with the environment.Moreover, the global state space at each moment is defined as the ordered concatenation of all vehicle states in order to promote the coordination efficiency of the AIM system.Based on this design, for a vehicle i at the current specific time step T current , the input state can be represented as S i t t=T current t=T current −K , where K represents the context length.Each vehicle's state space includes its current speed information v i t , distance to the target point d i t , steering intention β i t , and lane position information ζ i t : where N a represents the total number of vehicles considered in the scenario and the speed information v i t represents the real-time speed of vehicle i at time step t.The AIM system determines the vehicle's target position at the intersection exit based on the current vehicle's steering intention β i t and road encoding ζ i t .Subsequently, the target distance d i t between the vehicle and the target position can be calculated based on the specific trajectory.Notably, in order to enhance the efficiency of neural network learning and strengthen the representation of state space features, the steering intention and road encoding information of the vehicle are represented using one-hot encoding.
By designing a global state space that considers the temporal and dynamic characteristics of real intersection scenarios, the complex state space fully leverages the transformers mechanism to capture global attention as policy.This maps the global state to efficient and reasonable joint actions.The action output is defined as the action set for each vehicle, represented as A i t t=T current +K t=T current .Similarly, future actions over multiple time steps are used as outputs to help the model with long-term planning and adjustment.Specifically, for each vehicle, the speed v and steering angle φ are used as action outputs to account for lateral and longitudinal motion control, represented as Furthermore, the motion control layer of each vehicle utilizes a PID controller to compute the required throttle, brake force, and steering wheel for the vehicle's motion, enabling longitudinal and lateral control.Specifically, the longitudinal and lateral control errors e i,lon t and e i,lat t are calculated as follows: where v i,goal t and v i t represent the desired velocity and current velocity of vehicle i, respectively, − → v i t represents the velocity vector of vehicle i, and − → ω i t represents the vector from the current position of vehicle i to its target position.Then, the longitudinal control input u i,lon t and lateral control input u i,lat t are computed separately using PID controllers: where k lon p ,k lon i ,k lon d and k lat p ,k lat i ,k lat d represent the coefficients of the proportional, integral, and derivative terms in the longitudinal and lateral control, respectively.

Reward and Cost Functions
In RL, reward functions and cost functions are crucial for guiding the agent to learn effective policy.For traditional tasks, maximizing rewards typically directs the agent to learn optimal policy, while penalizing undesirable actions provides negative feedback.However, in autonomous driving this approach may lead to vehicles simply avoiding any actions that might incur penalties, resulting in overly conservative driving behavior and reduced efficiency of the learned policy.By introducing cost functions for multi-objective optimization, vehicles can find an appropriate balance between efficiency, comfort, and adherence to safety constraints, rather than merely avoiding actions that might incur penalties.By combining reward functions and cost functions, they jointly guide the learning behavior of the policy, ensuring that vehicles in the AIM system can drive efficiently, safely, and comfortably.
For traditional AIM systems with single scenarios and simple tasks, using a specific reward structure can accelerate the learning efficiency of the policy.However, this approach may also lead to overfitting and difficulties in adapting to new environments.In contrast, for complex intersection scenarios with randomly arriving vehicles, a comprehensive and careful consideration of reward design is necessary to enhance the generalization ability and robustness of the policy.Specifically, the designed reward function R CoBT−SRL comprehensively considers traffic efficiency, safety, and comfort, combining both sparse and dense rewards, and is divided into three main components, represented as The efficiency term R e f f iciency measures vehicle passage metrics within a fixed timestep.It considers two aspects, traffic throughput and average speed, represented as shown below.
The parameter δ 1 is a fixed reward for each vehicle that successfully passes through the traffic system, representing the system's goal of maintaining high throughput.The weight coefficient ω 1 determines the importance of average speed in the reward function, allowing the system to balance throughput and speed depending on the traffic management objectives.The average speed of all vehicles at time t, denoted as v t , is calculated as where N a is the number of vehicles and v i t is the individual speed of vehicle i.Compared to the total sum of speeds, the average speed better reflects real-time traffic flow, preventing scenarios in which the high speed of a few vehicles skews the perception of traffic conditions and promoting a smooth and efficient overall flow of traffic.
Subsequently, the comfort term R com f ort is designed by considering the smoothness of vehicle movement, considering acceleration and jerk.represented as follows: where ω 2 and ω 3 are weight coefficients that reflect the relative importance of acceleration and jerk in the comfort evaluation.These parameters can be adjusted to prioritize different aspects of comfort, such as minimizing acceleration to avoid passenger discomfort or reducing jerk to ensure smoother transitions.Here, a i t and j i t represent the acceleration and jerk (rate of change of acceleration) of vehicle i at time t, respectively.Acceleration is calculated as ∆t , while jerk is calculated as ∆t .Smooth traffic flow typically requires vehicles to maintain relatively constant speeds while avoiding frequent acceleration or deceleration, thereby ensuring overall traffic smoothness and passenger comfort.Finally, considering the importance of safety, the negative value of the cost function is included as a penalty term in the reward function, specifically as In order to effectively ensure the safety of the AIM system, the cost function primarily considers whether the Distance to Collision (DTC) between vehicles is maintained stably, whether the Time to Collision (TTC) between vehicles at risk of collision satisfies constraints, and the cost that is incurred in the event of a collision.This can be represented as where C dtc is used to measure the risk between two vehicles when the relative distance dtc i,i ′ is less than a certain threshold.The cost function is increased to quantify the risk at that moment, which can be represented as shown below.
This design increases the costs as the distance between vehicles decreases, encouraging vehicles to maintain safe distances more accurately.By dynamically adjusting the distance threshold DTC threshold , the penalty intensity can be flexibly set according to different traffic scenarios and safety requirements.Similarly, for vehicles with potential conflict points, TTC threshold represents their respective TTC thresholds.The cost function C ttc is designed as shown below.
By combining the comprehensive effects of C dtc and C ttc , the AIM system achieves thorough safety risk assessment, accurately predicting collision risks and enabling the system to respond flexibly in various situations.For example, in low-speed or congested scenarios, the distance cost is emphasized, whereas under high-speed conditions the TTC penalty better reflects collision risks.Additionally, a significant penalty is imposed if a collision occurs between vehicles, helping to guide vehicles away from behaviors that might lead to collisions, as shown in Figure 2.This can be represented as shown below.The cost function C CoBT−SRL consists of three subcomponents, each associated with immediate costs represented by δ 2 , δ 3 , and δ 4 .Among these, C dtc and C ttc are seen as preventive measures aimed at mitigating behaviors that increase the risk of collisions, C dtc is focused on maintaining static safety boundaries, and C ttc carries out dynamic risk assessment to address collision risk prevention at various levels.In contrast, C collision directly responds to collisions that have already occurred, clearly indicating that such behaviors lead to unacceptable outcomes.The parameters δ 2 , δ 3 , and δ 4 are chosen to appropriately weigh the importance of different safety considerations, reflecting the severity and priority of each component in the overall safety strategy.Specifically, δ 2 might be set higher if maintaining static safety boundaries is prioritized, while δ 3 could be more significant when dynamic risk assessments are critical; δ 4 should typically have a large value in order to heavily penalize actual collisions, as they represent the most severe safety breach.Together, these components form a multi-level safety protection system designed to minimize collision risks in complex traffic environments by preventing and penalizing potential dangerous behaviors across different dimensions and stages, thereby enhancing the overall safety of the AIM system.

Dataset Collection and Filtering
When training a transformer model through autoregressive modeling, the quality of the dataset is crucial for the model's performance.Collecting high-quality datasets in the real world is time-consuming and expensive, requiring significant investments in terms of time, resources, and logistics.To address this issue and maximize the utilization of trajectory samples, a dataset was constructed using historical data generated from an optimized expert policy π expert from a previous work.
Specifically, the details of dataset collection and filtering are shown in the first half of Algorithm 1. First, a sample of the initial global state s 1 is obtained in an online environment.Then, joint actions a t are generated and executed in the environment using the expert policy, resulting in the next state s t+1 (lines 1-7).Meanwhile, the reward r t and cost c t for the current time step are calculated using the reward function R and cost function C.This process is repeated for the specified number of time steps T l to obtain a complete trajectory τ = {s t , a t , r t , c t } T l t=1 (lines 8-11).Subsequently, a data augmentation process is performed by summing the costs in the trajectory and checking whether they are below the initially set cost constraint threshold d (lines 12).If the cost constraints are met, then the RTG and CTG are calculated as new labels for subsequent model input.Finally, the filtered and processed trajectory τ = {R t , C t , s t , a t } T l t=1 is added to the dataset T .This process iterates continuously to obtain the final AIM dataset T (lines [13][14][15][16][17].
Predict joint action of vehicles ât Compute NLL loss and entropy and construct the Lagrangian function 26: Update policy parameter θ 28: end for 30: end for The dataset was further refined by filtering and extracting optimized data based on the cost values in the trajectories, allowing the training process to benefit from a diverse set of trajectories and promoting the development of a robust foundational policy.This approach reduces the need for expensive real-world data collection while still providing a comprehensive training dataset.Utilizing these optimized historical trajectories ensures that the autoregressive modeling process is based on realistic high-quality data, enhancing the robustness and adaptability of the policy when applied to new and unseen traffic scenarios.

Policy Training and Evaluation
The above data collection and filtering process resulted in a high-quality expert dataset T for used in RL training of a transformer-based autoregressive model.In this phase, the precollected data enable the learned policy to generalize effectively across various environments without the need for real-time interaction.This section introduces the sequence modeling for the AIM dataset, focusing on the trajectory τ in the dataset T , as shown in Figure 3.The trajectory τ is represented as shown below.
Figure 3. Methodology of CoBT-SRL.Using a causal transformer as the policy, historical states, actions, rewards, and cost return information obtained from the traffic environment are input for sequence modeling.This approach facilitates sequential decision-making and predicts future actions.The policy is then updated using the maximum entropy RL method, ensuring the accuracy of action outputs while enhancing the exploration capability of the policy.
CoBT-SRL uses historical global states, joint actions, RTG, and CTG to predict current actions, achieving the goal of sequential decision-making.Relying solely on RTG to represent expected returns is insufficient for safety-critical tasks such as autonomous driving.CoBT-SRL aims to not only maximize cumulative returns but also to adhere to predefined cost constraints.The CTG metric is indispensable, as it represents the allowable cost from the current time step T current to the end of the trajectory, playing a crucial role in integrating safety constraints into action selection.Unlike traditional RL methods, CoBT-SRL constructs its model input as {R t , C t , s t , a t } T current −1 T current −K ∪ {R T current , C T current , s T current } to achieve sequential decision-making through sequence modeling.
When predicting actions, CoBT-SRL employs a stochastic policy, modeling the action distribution through a multivariate Gaussian distribution with a diagonal covariance matrix.This approach aims to enhance policy adaptability by maximizing the likelihood of actions within the dataset while introducing the flexibility to respond to different environmental conditions.The policy and its parameters are denoted by π and θ, with the mean vector represented by µ θ and the covariance matrix by Σ θ .This setup allows the stochastic policy to dynamically adjust to the decision-making environment, as detailed in Figure 3.
Compared to deterministic policy, stochastic policy offers significant advantages.By introducing randomness in action selection, stochastic policy facilitates the exploration of environmental performance limits rather than merely executing known optimal actions.This approach avoids local optima while enhancing exploration capabilities.Conversely, deterministic policy is prone to out-of-distribution (OOD) actions due to system bias.By varying actions in similar states, stochastic policy enables agents to develop more robust policy, thereby enhancing the generalization capability of the policy.
To quantify the consistency between the actions predicted by the stochastic policy and the actual benchmark actions, the negative log-likelihood (NLL) loss function is used.
The policy update process focuses on minimizing NLL loss.However, concentrating solely on this metric may limit exploratory behavior and increase the risk of overfitting.To mitigate these issues, the Shannon entropy H(θ) of the policy π θ is integrated into the optimization framework [40].This approach is a hallmark of maximum entropy RL algorithms, which aim not only to minimize prediction error but also to maximize entropy.In this way, the policy is encouraged to explore a more diverse range of actions, thereby enhancing its robustness and adaptability.This dual focus on entropy maximization and loss minimization ensures a more balanced learning approach.The entropy H(θ) can be calculated as shown below.
The objective of the policy update is to minimize the NLL loss while ensuring that the policy's entropy H(θ) meets or exceeds a predefined lower bound β.This approach not only improves the accuracy of the policy but also ensures an adequate level of exploration.By setting a lower entropy bound, the policy is encouraged to explore a broader range of behaviors, thereby avoiding premature convergence to suboptimal solutions.This method promotes a balanced development of exploration and exploitation, which is crucial for effective learning in complex environments.This objective can be expressed as follows: To tackle the objective problem characterized by the aforementioned inequality constraint described in Equation ( 21), the Lagrangian function for the primal problem is constructed as follows: where λ ∈ [0, +∞) represents the Lagrange multiplier (i.e., the dual variable), serving as a penalty coefficient that autonomously adjusts to mitigate potential constraint violations.Thus, the dual problem requires solving the following optimization: The primal-dual gradient descent method resolves the optimization challenges presented by alternating updates of policy parameters θ and the Lagrange multipliers λ to achieve optimal solutions.Initially, λ is held constant to solve the minimization problem for θ, expressed as Subsequently, with θ fixed, the process maximizes with respect to λ, focusing on max λ≥0 λ(β − H(θ)).
These optimization steps are alternated until convergence, ensuring stability in the values of θ and λ.The updates are carried out as follows: where α θ and α λ denote the respective learning rates and ∇ θ L(θ, λ) and ∇ λ L(θ, λ) denote the gradients of L(θ, λ) with respect to θ and λ.This methodology allows for iterative refinement of the policy during training, integrating maximum entropy to bolster the policy's exploration capabilities while reinforcing the model's robustness.
Policy training and evaluation outlines the main details, as shown in the latter part of Algorithm 1. First, all trajectories used for training are extracted from the AIM dataset.From these trajectories, all training samples with a context length of K are extracted (lines [18][19][20][21][22].These samples are then trained using the transformer, which predicts the next action based on RTG, CTG, past states, and actions.The NLL loss between the predicted and actual actions is computed, then the entropy of the predicted actions is calculated to construct the Lagrangian function, which is solved using the primal-dual method.This approach continuously updates the model parameters to achieve the goal of training (lines 23-30).

Experiment
Experiments were conducted in the CARLA high-fidelity autonomous driving simulator [41] to evaluate the performance of the proposed method.This section first provides a detailed description of the parameters and specific experimental settings used during implementation, then analyzes the training process and presents the results obtained from testing the trained model upon deployment.

Experimental Settings
All experiments were conducted in a simulated environment.A road intersection scenario was built using CARLA 0.9.11, and the RL model was constructed based on the PyTorch framework.Additionally, CARLA's built-in sensors were used to transmit vehicle state information in real time.The initial positions and speeds of all vehicles were randomly set, and traffic operated in a continuous framework rather than a segmented one.The desired vehicle speed and steering angle predicted by the policy neural network were converted into control signals for longitudinal (throttle and brake) and lateral (steering angle) motion using PID controllers.This integration ensured precise maneuverability and stability of the vehicles in the simulated environment.The experiments were conducted on an NVIDIA GeForce RTX 3090 GPU with Ubuntu 18.04 as the operating system.The total time spent on interacting with the environment to collect datasets and training the model was approximately 70 h.
This experiment used a four-way dual-lane unsignalized intersection in CARLA TOWN 05 to train and test the RL model.The road width was 14.2 m.Considering the road characteristics in the CARLA map and the V2I communication range, the control distance was set to 70 m in both the east-west and north-south directions.Vehicle lengths ranged from 3.6 to 5.4 m, widths from 1.8 to 2.2 m, and heights from 1.5 to 2 m, with vehicle arrivals following a Poisson distribution.Consistent with the actual vehicle control cycle, the time step was set to 0.1 s.This simulation environment was meticulously constructed to reflect real-world intersection scenarios in order to ensure the model's performance and applicability under actual traffic conditions.This section primarily covers two sets of experiments.The first set evaluated the performance of CoBT-SRL during training, including key metric changes and convergence speed.The second set compared the test results of the trained policy in terms of traffic efficiency, driving safety, and passenger comfort.The main hyperparameters of the experiments are shown in Table 1.In order to comprehensively evaluate the proposed method, CoBT-SRL was compared with existing centralized coordination methods for unsignalized intersections.The benchmark methods included: (1) PPO [15]: This centralized coordination scheme for autonomous vehicles utilizes an MA-PPO algorithm, integrating a priori models into PPO to enhance sample efficiency and accelerate the learning process.This method achieves outstanding performance in both traffic efficiency and computation time.The state space includes information on each vehicle's distance to the center of its corresponding trajectory and its velocity, resulting in a total of sixteen dimensions.The action space consists of the longitudinal acceleration for each vehicle, comprising eight dimensions.The reward function is designed to prioritize traffic safety and improve efficiency while encouraging task completion.A high negative reward is assigned for collisions, a small negative reward is given at each step to minimize waiting time, a positive reward is provided when a vehicle successfully passes through the intersection, and a higher positive reward is granted when all vehicles safely pass through the intersection.
(2) TD3 [17]: A deep reinforcement learning policy based on the TD3 algorithm is employed to optimize cooperative trajectory planning at unsignalized intersections.This approach improves traffic throughput and safety for controlled vehicles (CVs) while reducing computational complexity.The state space includes the position and velocity of each CV, their intended maneuvers (left turn, straight, or right turn), and the queue states, providing a comprehensive view of the traffic situation.The action space comprises acceleration commands that dictate the vehicles' velocity, heading angle, and steering angle.The reward function balances safety and efficiency.Large negative rewards are assigned for collisions, while positive rewards are given for vehicles that pass through the intersection without incidents.Additionally, intermediate rewards are granted to encourage the completion of maneuvers, with overall system performance evaluated by summing individual rewards.(3) VICS [10]: This centralized coordination method for unsignalized intersections utilizes an MPC framework to optimize vehicle trajectories, aiming to prevent collisions and enhance traffic efficiency.The state space comprises the position, velocity, and acceleration of each vehicle approaching the intersection, represented in a Cartesian coordinate system.The control inputs, which are the longitudinal accelerations of the vehicles, are optimized to ensure efficient and safe passage through the intersection.The objective function is designed to minimize the risk of cross-collisions by controlling vehicle trajectories, considering constraints such as maximum velocity, safe following distance, and lateral acceleration limits for turning maneuvers.This optimization problem is addressed iteratively, with control inputs recalculated at each time step to adapt to real-time traffic conditions and ensure collision-free operation.
The training time for PPO was approximately 120 h, while the training time for TD3 was around 80 h.In contrast, VICS employed the MPC method for control, which did not require any training.All these methods were trained under the same hardware environment to ensure fairness and comparability of the experimental results.

Training Performance
In this experiment, CoBT-SRL was trained on the expert dataset according to the parameters set in Table 1; the training results are shown in Figure 4. Figure 4a depicts the variation of the loss function during training.As shown, the loss value converged before 1000 iterations and remained stable thereafter, indicating a rapid convergence speed of CoBT-SRL during the training process.Figure 4b illustrates the change in exploration entropy throughout the training process.The exploration entropy increased during the initial training phase, peaked around 1000 iterations, and then gradually decreased.The rate of decrease was slow between 1000 and 2500 iterations, accelerated after 2500 iterations, and eventually converged to a low level.This indicates that CoBT-SRL's exploratory nature initially increased and then decreased before ultimately converging to a lower level.This helps the model better explore the environment during training, enhancing its generalization performance.During the experiment, the loss function showed a rapid decline in the initial phase and stabilized after 1000 iterations, indicating that the model quickly adjusted its parameters to fit the training data.This rapid convergence not only shortens the training time but also provides a stable foundation for subsequent model tuning, demonstrating the algorithm's efficiency during learning.
Simultaneously, the exploration entropy curve reveals the model's ability to explore the policy space.The initial rise in entropy indicates the model's high tolerance for environmental uncertainty, encouraging diverse policy exploration.After peaking, the entropy gradually decreased as iterations progressed, with a more rapid decline after 2500 iterations.This trend suggests that after sufficiently exploring various policy, the model gradually focuses on optimizing the existing policy, thereby improving overall performance.This shift from exploration to exploitation ensures that the model not only thoroughly explores the policy space but also effectively selects the optimal policy to enhance generalization and adaptability.

Testing Performance
The trained CoBT-SRL model was tested and compared with the aforementioned benchmark methods.To comprehensively evaluate the performance of CoBT-SRL and other methods, tests were conducted in unsignalized intersection scenarios under both sparse and dense traffic conditions.The primary evaluation metrics included average travel time, collision rate, TTC violations, and average acceleration.Average travel time was used to assess traffic efficiency, collision rate and TTC violations were used to evaluate driving safety, and average acceleration was used to measure passenger comfort.The test results are shown in Table 2. From the table, it is clear that CoBT-SRL achieved the best performance under both sparse and dense traffic conditions, with collision rates and TTC violations both at 0, average travel times of 16.34 s and 21.91 s, and average accelerations of 0.39 m/s² and 0.30 m/s², respectively.Compared to other benchmark methods, CoBT-SRL demonstrated superior performance in traffic efficiency, driving safety, and passenger comfort, indicating significant advantages in coordinating autonomous vehicles at unsignalized intersections.Although the VICS method based on MPC also achieved a zero collision rate, it performed less well in terms of traffic efficiency and passenger comfort, highlighting the distinct advantages of CoBT-SRL.Additionally, to evaluate the distribution and stability of the test data from multiple runs, box plots were used to represent the test results.Using the dense traffic scenario as an example, the test results are shown in Figure 5. Figure 5a shows the distribution of the average travel time.From the figure, it can be seen that CoBT-SRL's average travel time is relatively stable and short, indicating an advantage in traffic efficiency.The average travel times of the other three methods are higher than that of CoBT-SRL, with VICS having the longest average travel time due to its optimization-based approach, which requires longer computation times.Although TD3 also has a low standard deviation in average travel time, its mean value is much higher than CoBT-SRL, demonstrating CoBT-SRL's superior performance in traffic efficiency.Figure 5b shows the distribution of collision rates.From the box plot, it can be seen that both CoBT-SRL and VICS have collision rates of 0, while PPO and TD3 have higher collision rates.VICS achieves a collision rate of 0 because its MPC-based method can optimize vehicle trajectories through optimal control to avoid collisions.Similarly, CoBT-SRL achieves a collision rate of 0 by considering safety constraints during training, effectively preventing collisions.In contrast, PPO and TD3, which do not prioritize safety constraints, have higher collision rates, highlighting CoBT-SRL's significant advantage in driving safety.
Figure 5c shows the distribution of TTC violations.It can be seen that CoBT-SRL has zero TTC violations, while VICS also has a low level of TTC violations.Similar to VICS, CoBT-SRL can optimize vehicle trajectories through optimal control to avoid TTC violations, demonstrating excellent performance in ensuring driving safety.In contrast, PPO and TD3 have a higher number of TTC violations, mainly because these RL methods do not prioritize safety constraints, further underscoring CoBT-SRL's superior results in driving safety.
Figure 5d shows the distribution of average acceleration.CoBT-SRL's average acceleration is relatively stable and low, while the average acceleration of the other three algorithms is higher and shows a more unstable overall distribution.Average acceleration reflects passenger comfort, with smoother driving improving passenger comfort.Therefore, the lower acceleration of CoBT-SRL indicates an advantage in passenger comfort.
In summary, CoBT-SRL achieves the best performance in traffic efficiency, driving safety, and passenger comfort, demonstrating significant advantages in coordinating autonomous vehicles at unsignalized intersections.Additionally, the proposed algorithm demonstrates a certain level of generalizability, making it applicable to other types of intersection scenarios, such as three-way intersections (T-junctions or Y-junctions), roundabouts, and other irregular traffic environments.Before applying the algorithm to new settings, it is essential to fine-tune the model's hyperparameters, including adjusting the coefficients of the reward and cost functions, vehicle turning formulas, control area size, and other environment-specific parameters.For scenarios with significantly increased complexity, additional state variables such as heading angle, driving direction, etc., can be introduced to provide a more comprehensive description of the traffic situation.Furthermore, more action options, such as passage permissions, route choices, etc., can be included to optimize traffic flow control.Increasing the number of neural network layers and neurons per layer can enhance the model's flexibility to handle more complex problems, although it is important to consider the risk of overfitting that may arise from a more complex model.Therefore, with careful design and appropriate optimization, the proposed method is expected to deliver excellent performance even in more complex situations.

Conclusions
This study proposes the CoBT-SRL framework for centralized coordination among vehicles at unsignalized intersections, aiming to address the limitations of traditional RL policy networks in reasoning capability and ensuring policy safety.First, an expert dataset is constructed by collecting data on past states, actions, RTG, and CTG, and data filtering is performed.Leveraging the powerful ability of transformers to capture long-distance dependencies when solving sequence modeling problems, autoregressive modeling is applied to the collected dataset to predict future actions based on past states, actions, and other trajectory information.Second, the proposed framework introduces a maximum-entropy RL approach, which increases policy exploration by maximizing the entropy of the policy within constrained bounds, thereby enhancing the model's generalization performance.Experimental results demonstrate that CoBT-SRL offers significant advantages in centralized coordination of autonomous vehicles at unsignalized intersections, exhibiting excellent traffic efficiency, driving safety, and passenger comfort.
Although CoBT-SRL achieves significant advantages in centralized coordination of autonomous vehicles at unsignalized intersections, several challenges remain when applying this technology to more complex traffic scenarios.The current research primarily focuses on the management of single intersections.Extending this approach to broader traffic systems, such as coordinated control between main roads or collaborative management of multiple intersections within the same network, holds more profound practical significance and application value.Moreover, in light of the rapid development and widespread application of large models, future research should explore how to further leverage these advanced technologies.Through this approach we aim to significantly enhance the model's generalization capability and reasoning performance, enabling more effective handling of complex and dynamic traffic environments.
: (1) Dataset Collection and Filtering, and (2) Policy Training and Evaluation.The CoBT-SRL method aims to address challenges related to low data efficiency, insufficient model reasoning capabilities, and difficulties in ensuring driving safety.

Figure 1 .
Figure 1.The proposed CoBT-SRL framework is primarily divided into two components: (1) Data Collection and Filtering, and (2) Policy Training and Evaluation.Historical trajectory information is collected by interacting with the environment using expert policy, and the collected data are filtered.Then, a transformer model is used as the policy for training and evaluation.

Figure 2 .
Figure 2. Consideration of safety cost sub-items and trajectories with varying risk levels at intersection: the white dashed lines indicate safe trajectories, the blue dashed lines potentially risky trajectories, and the black dashed lines highly dangerous trajectories.

Figure 4 .
Performance evaluation during training: (a) normalized loss function and (b) exploration entropy.

Figure 5 .
Performance comparison after deployment in terms of efficiency, safety, and comfort: (a) mean travel time; (b) collision rate; (c) number of TTC violations; (d) mean acceleration.

Algorithm 1
CoBT-SRL for AIM Require: expert policy π expert , trajectory samples T s , trajectory length T l , cost limit d, target entropy β, batch size B, learning rate α θ , training iterations I, context length K 1: for each sample s in T s do t = R(s t , a t , s t+1 ), c t = C(s t , a t , s t+1 ) 10:Append the new tokens to the trajectoryτ = {s k , a k , r k , c k } t−1 1 ∪ {s t , a t , r t , c t }Get the new trajectory τ = {R t , C t , s t , a t }