Abstract

As a popular research field, autonomous driving may offer great benefits for human society. To achieve that, current studies often applied machine learning methods like reinforcement learning to enable an agent to interact and learn in a stimulating environment. However, most simulators lack realistic traffic which may cause a deficiency in realistic interaction. The present study adopted the SMARTS platform to create a simulator in which the trajectories of the vehicles in the NGSIM I-80 dataset were extracted as the background traffic. The built NGSIM simulator was used to train a model using the proximal policy optimization method. The actor-critic neural network was applied, and the model takes inputs including 38 features that encode the information of the host vehicle and the nearest surrounding vehicles in the current lane and adjacent lane. A2C was selected as a comparative method. The results revealed that the PPO model outperformed the A2C model in the current task by collecting more rewards, traveling longer distances, and encountering less dangerous events during model training and testing. The PPO model achieved an 84% success rate in the test which is comparable to the related studies. The present study proved that the public driving dataset and reinforcement learning can provide a useful tool to achieve autonomous driving.

1. Introduction

It is wildly accepted that autonomous driving (AD) may help alleviate the problem of traffic congestion and reduce vehicle accidents and drivers’ fatigue accompanied by manual driving [1, 2]. To achieve AD, researchers and car companies devoted themselves, and in recent years, the development of AD has witnessed big progress, and the commercialization of AD has been realized in some specific closed low-speed scenarios.

The key technologies of autonomous vehicles (AVs) include perception, decision-making, planning, and control [3]. Among them, decision-making which is also referred to as driving policy is responsible for deciding the behavior of AVs [4]. The driving policy of AVs takes the information collected from perception and outputs an appropriate action for the AVs. In real-world scenarios with complex road environments and dynamic traffic, it is vital to design a driving policy that considers the driving environment’s uncertainty and negotiates with surrounding traffic safely.

Two common approaches have been applied to establish the driving policy of AVs [5]. One approach is the rule-based method which adopts traffic rules and expert knowledge to construct a rule library and an appropriate rule will be selected according to the current situation of AVs [6, 7]. As it is hard to consider all the situations in real-world situations, the rule-based method may not generate well when confronted with a new situation [8].

Another approach is the learning-based method. In contrast with manually designing the rules, the learning-based method forms the driving strategy from data automatically. The flourishing of machine learning enables using expressive models like neural networks to represent complex relationships like driving. Learning-based methods are attracting more attention in recent years [9, 10]. Imitation learning (IL), one of the famous learning-based methods, has been applied by researchers to achieve AD [11, 12]. The principle of IL is to directly learn the mapping between drivers’ actions and the corresponding states. Despite the fact that IL has been proven to be effective in some studies, several shortcomings have also been found. First, IL requires collecting a large number of drivers’ demonstrations which can be expensive and time-consuming. Second, the learned policy of IL may face the problem of the covariate shift problem [13].

Another type of learning-based method is reinforcement learning (RL). In RL, an agent learns from interacting with the environment in a trial-and-error form [14]. RL does not require collecting expert demonstrations, and the agent learns by maximizing the long-term returns which helps avoid the covariate shift problem in IL. Combined with deep learning, deep reinforcement learning (DRL) has been successfully applied to solve the game of GO [15], play Atari games [16], and accomplish loco-moto tasks [17]. In the application of applying DRL in AD, Zhang et al. [9] used the deep deterministic policy gradient algorithm to achieve automatic driving. The trained model can reach the defined goal and successfully avoid obstacles. Cai et al. [18] proposed an algorithm called DQ-GAT which combined deep Q-learning and graph attention-based networks to achieve safe and efficient autonomous driving in different urban environments. Shi et al. [19] sought to solve the problem of controlling AVs driving in urban unsignalized intersections using the proximal policy optimization (PPO) algorithm.

In the related studies, relatively simple and unrealistic background traffic was used in their simulator. Since the RL algorithms require interaction with the environment, an unrealistic environment may lead to unsafe or unrealistic behavior of the learned policy. In this study, the dataset from next-generation simulation (NGSIM) was extracted and used as the background traffic for the simulated environment. NGSIM provides, so far, the largest traffic dataset recorded by roadside cameras on US national highways [20]. The realistic and diverse features of the NGSIM dataset make it suitable for creating a simulator to train and test the RL algorithm to achieve AD. Therefore, in the present study, a simulating environment was built incorporating the NGSIM traffic dataset upon the SMARTs platform [21], and the DRL algorithm-proximal policy optimization (PPO) [22] was applied to realize AD in the highway scenario.

The contribution of this paper can be summarized as follows: (1) Introduce a data-driven approach to establish a realistic environment in which the background traffic is reproduced using the NGSIM dataset. (2) Propose a modern DRL algorithm (PPO) to train an agent to learn to drive in this environment. (3) Propose a way of state representation that extracts the most relevant information about the surrounding traffic. (4) Apply multiple indexes to analyze the training and test results of the trained models.

The rest of the paper is organized as follows: Section 2 briefly reviews the background of this study. Section 3 describes the architecture of the proposed model, the details of the state representation, and the proposed algorithm. Section 4 describes the baseline method and the evaluation metrics. Section 5 presents the training and the test results. The final section presents the discussion and conclusion.

2. Background

2.1. Reinforcement Learning

The Markov-decision process (MDP) is often used to model the sequential decision-making problem in RL. A MDP consists of a tuple . Specifically, denotes the state space and denotes the action space, denotes the transition matrix which is the probability of transition from state to after taking action , denotes the reward function which encodes the objectives or preferences of the agent in RL, is the discount factor.

The objective of RL is to seek the optimal policy which has the maximum value for all states as illustrated in the following equation[14]:where denotes the expected total rewards when following policy , is sampled from , and is the next state when taking action in state which is determined by the transition matrix.

In model-free RL, there are three fundamental methods to solve the optimal policy which include the value-based method, the policy-based method, and the Actor-Critic method. Among them, the Actor-Critic method combines the advantages of the other two methods, and it has become the basis of the modern algorithms in RL [23].

In the framework of Actor-Critic, the actor is responsible for choosing the action of the agent in order to interact with the environment, and the critic is responsible for evaluating the agent’s actions. The state-action value function and advantage in actor-critic are defined in equations (2) and (3) respectively as follows:

2.2. Proximal Policy Optimization

PPO is one of the improved versions of the policy-based method which adopts the framework of actor-critic. PPO is well known for its prominent performance in a wild range of tasks [24]. It is therefore the first choice for OpenAIs projects.

There are two main features of PPO. First, it is found that the vanilla policy gradient often leads to extensive policy updates which may bring high variance and difficulty of convergence to the training of the model. PPO follows the theory of trust region policy optimization which is an earlier version of PPO and constructs a clipped surrogate objective to constrain the excessive updating of policies at every step of the policy gradient. The objective of the actor in PPO is as follows:where is the probability ratio defined as and clips the values of outside of the range . is a hyperparameter which is recommended to be 0.2 and is the parameter of the actor.

Secondly, generalized advantage estimation (GAE) is required for the calculation of the gradient of the PPO algorithm. A linear combination of n-step bootstrapping is used in GAE to get a low bias and variance estimation of , defined as follows:where is the temporal difference (TD) error defined as follows and is the parameters of the critic.

For the whole Actor-Critic, the loss function which combines the clipped loss for the actor and the squared error for the critic is defined in equation (7).where denotes the loss for the critic which is calculated as ; denotes the entropy of the policy ; and are hyperparameters which are set to be 0.5 and 0.01, respectively.

2.3. NGSIM Simulator

A driving simulator with high fidelity is crucial to training an RL agent to learn to drive. Current studies make use of open-source simulators like CARLA [25] and SUMO [26]. Despite the many successful applications that have been made using these simulators, none of them has addressed the problem of multiagent interactions in driving. Interaction with diverse road users is found to be the current challenge for AV. SMARTS is developed to enable a realistic and diverse multiagent interaction to help the research community to solve the interaction challenge in AD [21]. The key features of SMARTS involve realistic physics supported by the PyBullet physics engine, traffic simulation with SUMO integration, and web-based visualization with recording.

The present study adopted the SMARTS platform to build the NGSIM simulator. The NGSIM I-80 dataset was extracted and integrated as the background traffic. The NGSIM dataset is so far unique in the study of traffic which has been wildly applied and analyzed by lots of researchers [27, 28]. The I-80 dataset is one of three public datasets included in the NGSIM dataset. The I-80 dataset contains three 15 minutes periods which are 4:00 p.m. to 4:15 p.m., 5:00 p.m. to 5:15 p.m., and 5:15 p.m. to 5:30 p.m. [20]. These periods represent the situation of traffic before rushed hours, the transition to rushed hours, and during rushed hours. The camera located on the roadside recorded the video of the traffic in the monitored area, and the trajectories of the vehicles were extracted from the recorded video. The extracted dataset contains 3366 vehicle trajectories. For each vehicle, speed, longitudinal and lateral position, vehicle length and width, the ID of the following vehicle, and the lead vehicle in the current lane are included in the data.

As illustrated in Figure 1, to set up the NGSIM simulator, first, the road map of the I-80 road segment needs to be created. The specific road alignment parameters of the I-80 road were thoroughly investigated and written in SUMO road network format which is supported by SMARTS. The road is about 310 m in length and has 6 lanes which include a merging ramp lane. Then, the SMARTS scenario studio library was applied to generate the traffic. The routes of every vehicle in the traffic were assigned according to the historical trajectories recorded in the I-80 dataset. Finally, when the NGSIM driving simulator is used for training, a random vehicle in the traffic is selected as the host vehicle which is to be controlled by the RL agent. The bicycle model is employed as the kinematics model for the host vehicle. The motion state of the other vehicles in the traffic is updated according to their history trajectories which have been smoothed by an extended Kalman filter. Since there are 3366 vehicles included in the traffic, a random pick of one vehicle in the traffic as the host vehicle can bring high diversity for the simulator which facilitates the training and testing of the model.

During simulation using the NGSIM simulator, the agent library of SMARTS provides rich sets of observations. A simulating radar is used to collect information on nearby vehicles. The events of the host vehicles such as collisions, off roads, and driving in the wrong direction are also provided. SMARTS also provides a visualization tool called envision which enables visually checking the simulation with the web browser.

3. The Proposed Model

As mentioned, the PPO algorithm was selected to train an agent to learn to drive in the aforementioned NGSIM environment. This section will introduce the network architecture of the actor-critic which is the core component of PPO, the design of state representation, and the reward function.

3.1. The Architecture of Actor-Critic

After being tested, the architecture of the actor-critic was determined as demonstrated in Figure 2. The input of the actor-critic is the states including 38 features. It represents the information that the agent observes during driving. The actor and critic networks have a similar structure which all have two hidden layers with 200 units except that the last layer of the actor uses a tanh activation function to convert the output value to the range of [−1, 1]. The outputs are then multiplied by the max range of each action to determine the proper range of each action. The critic outputs the value of the present state, and the actor outputs the action which is going to be taken by the agent in the present state. Fully connected networks (FCNs) and tanh activation functions are used throughout this architecture.

3.2. State Representation and Action Space

The state space should contain all the necessary information for the agent to decide on the appropriate action. In the present study, the state is represented by 38 features grouped into two parts, as illustrated in Table 1. The first part contains features about the host vehicle. The second part is mainly about the surrounding traffic. Different ways of representing the surrounding traffic were tested. To extract the most relevant information and to reduce the dimension of the state space, the nearby vehicles provided by the NGSIM simulator were further filtered to the six nearest vehicles in blue, as illustrated in Figure 3. The six nearest vehicles include the vehicle in front and behind on the left and right adjacent lanes, the lead vehicle, and the following vehicle in the same lane. Then, Euclid distance , relative speed defined as , longitudinal distance , and lateral distance between the host vehicles in yellow and the surrounding six vehicles were calculated. Time to collision (TTC) defined as was also calculated to represent the risk of collision. It should be noted that the distance here refers to the distance between the bounding boxes of the two vehicles which were calculated by considering the width and length of the vehicles.

The present study adopted a continuous action space. The action space includes acceleration and yaw rate. To limit the value of the action to a reasonable scope, the max range of acceleration was set to the range of [−3, 3], and the max range of yaw rate was set to the range of [−1, 1] following the study of Xiao-fei et al. [8].

3.3. Reward Function

The design of the reward function is vital for ensuring the convergence and the performance of the RL algorithm. The reward function should encode the objective of the agent. The present study first tried a sparse rewards configuration. The agent received a large reward when reaching the destination and a negative reward when occurring the aforementioned dangerous events. However, the results showed that the agent could not learn effectively in the NGSIM simulator which has high-dimensional observations.

Therefore, the present study used a shaped reward function. After being tested carefully, the reward function was determined to be a linear combination of the three parts below.

3.3.1. Speed Reward

The speed of the host vehicle should not exceed the speed limit of the road. Therefore, the speed reward is set to discourage the agent to violate the speed limit. When the speed of the agent exceeds the speed limit, the agent receives a negative reward of −1, otherwise, the agent can get a small reward of 0.01.

3.3.2. Lane-Keeping Reward

Lane keeping is the most common task in daily driving, especially in the NGSIM simulator which has rather congested traffic. To encourage the agent to maintain the lateral position around the lane center, the lane-keeping reward is calculated by a gauss function as equation (8). When the vehicle drifts far from the lane center, the reward gets smaller:where denotes the distance to the center of the current lane and μ and σ denote the mean and the variance of the lateral position being set to 0.9 and 0.2, respectively.

3.3.3. The Terminal Reward

When the ego vehicle reaches the terminal states, such as colliding with other vehicles, reaching the destination, or driving off the road, the simulation ends and restarts from a random initial position. The terminal reward is defined in the following equation:

The overall reward function is defined as equation (10). A constant is introduced in the formulation to encourage the agent to continue to explore.where are the weights of the speed reward, the lane-keeping reward, and the terminal reward, respectively.

3.4. The Proposed Algorithm

The proposed algorithm is presented in Algorithm 1 below. PyTorch was used for implementing the proposed model. The Adam optimizer was used for training. Several tricks were implemented to facilitate the training of the model including using a linear decay of the learning rate, orthogonal initialization of the networks’ parameters, and normalization of the state input [29].

Input: Randomly initialize the parameters of the Actor-Critic as , the initial learning rate
For  = 0 to , repeat the following steps
 Using the policy to interact with the NGSIM environment for steps, record the trajectories of the agent as , calculate the reward according to equation (10) for every state in the trajectories.
 Compute advantage using GAE.
 Compute the gradient according to equation (7) with epochs and minibatch size , and update using Adam optimizer.
 Linearly decay the value of the learning rate
End
3.5. Hyperparameters

The choice of hyperparameters is an important factor that may greatly affect the performance of the RL algorithm. The hyperparameters used in this study are carefully tuned to achieve the best training performance as listed in Table 2.

4. Model Investigated and Evaluation Metrics

4.1. A2C

A2C which stands for advantage actor-critic was proposed to solve the problem of high variance in the original actor-critic method [30]. The algorithm estimates the advantage by subtracting a baseline value from the reward. A2C was found to be more stable on training and effective for continuous problems. Therefore, A2C was chosen as a baseline method in the present study to be compared with PPO.

The same network architecture used in the PPO model as described above was adopted for the A2C model for a fair comparison. The state input, action space, and reward design are all the same as those of the PPO model. RMSProp optimizer was used to train the A2C model, and the learning rate was set to 5e − 5. The hyperparameters used in A2C were set according to the original paper [30].

4.2. Evaluation Metrics

To evaluate the performance of the model, three kinds of metrics were adopted as follows:(1)As the most representative index in RL, the mean trajectory rewards which were calculated by averaging the accumulated rewards in every trajectory in an episode were selected as an index. The RL agent should collect more rewards during the training process following the objective of RL.(2)To evaluate the agent’s ability to drive in the NGSIM simulator safely, the mean distance traveled by the agent in each episode was chosen as another metric. A longer distance traveled by the agent represents a better skill in negotiating with road traffic.(3)The respective numbers of the three different dangerous events including colliding with other vehicles, driving in the wrong direction, and driving off the road, were selected as metrics.

5. Results

Figure 4 presents the change in mean trajectory rewards for the PPO and A2C models during training. Global steps which represent the total steps that the agent interacts with the NGSIM simulator are used as the X-axis. As can be seen, the mean trajectory rewards increased gradually for both models as the training process. The reward curve fluctuates, but it is very common in the training of the RL model [22]. The PPO agent collects much higher mean trajectory rewards than the A2C agent. For the PPO model, the mean trajectory rewards tend to stabilize after training about 3e5 global steps revealing that the model converges.

Figure 5 shows the trend of the mean travel distance during training. It can be seen that the mean travel distance increases with the same trend as the growth of the mean trajectory rewards. The results prove that the PPO agent and A2C agent can both learn to improve the skill of driving during training. The PPO agent maintains a longer distance than the A2C agent throughout the training process which fits well with the trend of the mean trajectory rewards reflected in Figure 4. For the PPO agent, in the beginning, the agent can only drive about 30 m on average, however, when the model converges at about 3e5 global steps, the mean travel distance can increase up to 310 m in some episodes. The value of the mean travel distance fluctuates around 280 m at the end of training for the PPO agent. Since the overall length of the road is about 310 m, the result shows that the trained PP0 agent can nearly finish the NGSIM environment.

To test the trained models, the simulation was repeated 100 times running the two models in the NGSIM environment respectively. Table 3 lists the test results including the success rate, the mean travel distance, and the respective number of the aforementioned events. During the test, the PPO agent reached the destination 84 times. The remaining 16 times in which the agent did not finish the scenario are mainly collisions. The mean travel distance is 281.37 m which is very similar to the mean travel distance at the end of training. It demonstrates that the trained model has a fair generalization ability. As a comparison, the A2C agent only has a success rate of 38%. The mean travel distance is 122.95 m which is much less than that of the PPO agent. The A2C agent encounters a large number of collisions in the test.

The results above all demonstrated a superior performance of the PPO model compared with the A2C model. Therefore, the rest of this section only presents further analysis of the PPO model. During the test, speed, headway, and relative speed were recorded and analyzed to gain a better understanding of the PPO model. The test simulation includes about 30613 data points which are about 51 minutes in total. The normal distribution was applied to fit the distribution of the aforementioned data.

Figure 6 presents the distribution of speed from the test data. The mean speed is 7.75 m/s and the standard derivation is 3.76 m/s. The relatively low speed indicates the simulated environment has very congested traffic. Figure 7 presents the time headway distribution during the test. The data with a time headway (TH) higher than 10 s were excluded which leaves about 30177 data points. As can be seen, the value of TH centralizes around 2.8 s with only a small part distributed below 2 s. The distribution of TH shows that the PPO agent can maintain a safe distance from the lead vehicle.

Figure 8 shows the distribution of TTC. Data points with TTC higher than 8 s and below 0 were excluded. The remaining 1880 data points were used for analysis. The majority of TTC is above 3 s revealing a relatively safe following strategy of the host vehicle. However, a small part of TTC is below 1.5 s which represents a high probability of collision, and some of them did eventually lead to a collision.

To show the performance of the trained agent, the visualizations of two simulations are presented below. As shown in Figure 9, the host vehicle in red (ID 1695) was initialized in a position as keyframe 1 (Figure 9(a)), it was maintained in the same lane, and kept a proper distance from the front vehicle until the end of the road. In Figure 9(c), it can be seen that the front traffic was congested, and the agent even learned to stop for a little period to wait for the front traffic to move.

Figure 10 shows the simulation of the host vehicle merging into the adjacent lane. Starting from the rightmost lane (Figure 10(a)), the host vehicle in red has to merge into the left adjacent lane because the rightmost lane will narrow ahead. As can be seen, the agent successfully negotiated with the surrounding vehicles and merged into the target lane (Figure 10(d)). However, it should be noted that during our test described above, most collisions happened when the host vehicle was initialized in the rightmost lane and had to merge into the left traffic. The success rate of merging is lower than the average.

6. Discussion and Conclusion

In the present study, the SMARTS platform which features multiagent interaction was adopted to build a simulator for training an agent to learn to drive. The NGSIM I-80 dataset was applied to generate the background traffic for the simulator. The built-in NGSIM simulator has high diversity in which the vehicle in the background traffic is randomly picked as the host vehicle to be controlled by the agent. The PPO algorithm was applied to train the agent, and the proposed model used an actor-critic neural network. 38 features are selected as inputs, including 8 features related to the host vehicle and 30 features associated with surrounding vehicles. The linear combination of three parts of rewards is used as the reward function, which includes the reward of punishing the vehicle for exceeding the maximum speed, the reward of encouraging lane maintenance, and the relevant reward for the termination state. Another DRL method A2C was selected as a baseline algorithm for comparison.

The results showed that the PPO method outperformed the A2C method both in the training and test phases. The results from the present study are in line with the study of Schulman et al. [22] in which better performance was found for PPO compared with other algorithms on different continuous control environments. As for the PPO model, when it converged, the mean trajectory reward and the mean travel distance increased greatly. The trained PPO model achieved an 84% success rate and a 281.37 m mean travel distance in the model test. As reported by Chen et al. [31], the trained model achieved a success rate of over 80% in the roundabout scenario. In the study of Folkers et al. [32], the proposed model had a similar success rate of over 80% in the urban scenario. The comparable success rate in this study indicates that the proposed model has learned to drive in the NGSIM environment. The two simulations presented in the results reveal that the model can handle the most common and important daily driving skills like lane keeping and car-following, and that the model preliminarily can merge into the lane.

The proposed model still has a failure rate especially higher in the merging situation. The reason is summarized in two aspects. First, the reward function designed in this study encourages lane keeping; however, lane changing is needed in the process of merging; therefore, a new design of the reward function may be required for a better performance of the merging scenario. Secondly, the PPO agent can only encounter the merging situation when it is initialized in the rightmost lane which has a low probability, the relatively low interaction or experience of the PPO agent in the merging scenario may cause worse performance. Merging is also an open problem in the study of AD [33, 34], and further study should be specifically conducted on this topic.

The present study had some important limitations. The proposed model focused on the scenario of multivehicle interaction on the expressway. The road geometry in the simulator is relatively simple, and no curves or intersections are involved. Future studies should consider these scenarios into account. Also, the proposed model used precise information about the road environment and the surrounding vehicles which may be expensive to get in a real-world situation; image inputs are a promising choice in the future study. Finally, the present study did not consider riding comfort in the design of the reward function. Recent studies emphasized the significance of accomplishing human-like AD models in which riding comfort and human reaction characteristics were considered to improve the acceptability of AVs [35, 36]. The present study should investigate incorporating these factors into the design of AD models.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was jointly supported by the Research and Development Project of Xiangyang City under Grant no. 2022ABH006510, the Hubei Key Laboratory of Power System Design and Test for Electrical Vehicle Opening Fund under Grant no. ZDSYS202217, the Hubei Superior and Distinctive Discipline Group of “New Energy Vehicle and Smart Transportation” under Grant no. XKTD012023, and the Industry-University Cooperation Education Program of the Ministry of Education under Grant no. 220600805293405.