A Personalized Behavior Learning System for Human-Like Longitudinal Speed Control of Autonomous Vehicles

As the main component of an autonomous driving system, the motion planner plays an essential role for safe and efficient driving. However, traditional motion planners cannot make full use of the on-board sensing information and lack the ability to efficiently adapt to different driving scenes and behaviors of different drivers. To overcome this limitation, a personalized behavior learning system (PBLS) is proposed in this paper to improve the performance of the traditional motion planner. This system is based on the neural reinforcement learning (NRL) technique, which can learn from human drivers online based on the on-board sensing information and realize human-like longitudinal speed control (LSC) through the learning from demonstration (LFD) paradigm. Under the LFD framework, the desired speed of human drivers can be learned by PBLS and converted to the low-level control commands by a proportion integration differentiation (PID) controller. Experiments using driving simulator and real driving data show that PBLS can adapt to different drivers by reproducing their driving behaviors for LSC in different scenes. Moreover, through a comparative experiment with the traditional adaptive cruise control (ACC) system, the proposed PBLS demonstrates a superior performance in maintaining driving comfort and smoothness.


Introduction
During the last several decades, considerable efforts have been made to design and develop highly autonomous vehicles that can drive with little or even no interventions from human drivers. However, the overall architecture for designing autonomous vehicles has not been improved too much. Most of the existing autonomous vehicles share the same three-layer system architecture, i.e., "sensing and perception" layer, "motion planner" layer, and "vehicle controller" layer [1,2].
Of the three layers, motion planner is responsible for generating a feasible reference trajectory for the low-level controllers to follow [3]. In the simple traffic environment with little or even no surrounding vehicles, this kind of motion planner has shown its effectiveness and has been successfully applied for autonomous driving [4]. However, when more complex environments with dense traffic are considered, the increased requirements of driving smoothness, comfort, and personalized adaptation complicate the motion planner and make it difficult to find a feasible reference trajectory within the time limit [1]. It has been found that experienced human drivers seem to work well in such complex environments without using a sophisticated algorithm to compute the optimal trajectory [5,6].

1.
A reinforcement-learning-based system is proposed in this paper to learn the driver behavior and realize the human-like control. Based on RL, the system dynamics are not required and can be learned directly from the interaction between drivers and the driving environment.

2.
By incorporating the controller into the learning system, the learned driving behavior can be converted to control commands for autonomous vehicles online, which realizes the personalized adaption for newly-involved drivers. The remainder of this paper is organized as follows. Section 2 describes the system architecture of PBLS and gives the definition of different modules in the architecture. Then, Section 3 presents a solution algorithm for training PBLS. After that, two tests based on the driving simulator and the real driving data are shown in Sections 4 and 5 to evaluate the performance of PBLS. Finally, Section 6 concludes the paper and gives future directions of the research.

Proposed Personalized Behavior Learning System
The system architecture for PBLS is shown in Figure 1, where the learning module is combined with a proportion integration differentiation (PID) controller to interact with the traffic environment during the learning process. In this study, the focus is on a typical LSC problem for car following. Considering the difficulties and the risks of testing the online learning system in real-world scenarios, the testing traffic scenarios are built in PreScan, a simulation tool for simulating vehicle dynamics and traffic environments [28]. The testing scenario for LSC consists of one host vehicle and one leading vehicle. The objective of the host vehicle is to follow the leading vehicle and try to keep a stable distance to the leading vehicle. Here, the host vehicle can either be controlled by a human driver or the proposed system. The driving data collected from the on-board sensors can be directly transferred to the learning module to activate the learning process. The remainder of this paper is organized as follows. Section 2 describes the system architecture of PBLS and gives the definition of different modules in the architecture. Then, Section 3 presents a solution algorithm for training PBLS. After that, two tests based on the driving simulator and the real driving data are shown in Sections 4 and 5 to evaluate the performance of PBLS. Finally, Section 6 concludes the paper and gives future directions of the research.

Proposed Personalized Behavior Learning System
The system architecture for PBLS is shown in Figure 1, where the learning module is combined with a proportion integration differentiation (PID) controller to interact with the traffic environment during the learning process. In this study, the focus is on a typical LSC problem for car following. Considering the difficulties and the risks of testing the online learning system in real-world scenarios, the testing traffic scenarios are built in PreScan, a simulation tool for simulating vehicle dynamics and traffic environments [28]. The testing scenario for LSC consists of one host vehicle and one leading vehicle. The objective of the host vehicle is to follow the leading vehicle and try to keep a stable distance to the leading vehicle. Here, the host vehicle can either be controlled by a human driver or the proposed system. The driving data collected from the on-board sensors can be directly transferred to the learning module to activate the learning process.
The learning module is based on RL and can learn the desired longitudinal speed from human drivers when they are controlling the host vehicle. Using a PID controller, the desired speed can be converted to the low-level control commands for throttle and brake pressure. In this way, the human car-following behavior can be reproduced.  The learning module is based on RL and can learn the desired longitudinal speed from human drivers when they are controlling the host vehicle. Using a PID controller, the desired speed can be converted to the low-level control commands for throttle and brake pressure. In this way, the human car-following behavior can be reproduced.

Formulation of the Learning Module
The objective of the learning module is to learn the desired speed of the human driver, i.e., to track the speed trajectory of a human driver. Thus, the learning problem can be defined as a trajectory tracking problem for the given system . x = f (x, u) and a desired trajectory (x h , u h ). Let s = x − x h , a = u − u h , thus the trajectory tracking problem can be solved by a linear quadratic regulator (LQR) for minimizing the following cost function: such that, where k is the time index, s k is the state vector of the trajectory tracking problem, a k is the control action, A and B are matrices related to the system dynamics, and C and D are positive-definite matrices for weighting the cost function. The system dynamics is required for solving this problem based on the traditional LQR. For the real applications, the system dynamics are usually difficult to know in advance. In this case, an RL method can be used to learn the optimal solution to the trajectory tracking problem defined above. Following [29], the cost at each time step can be defined by: It should be noted that r k is different from its counterpart in the traditional RL problem where r k is the reward at each time step and is used to formulate a maximization problem. Here, r k is related to the tracking error for the human behavior and thus should be minimized. For each state-action pair (s k , a k ), the Q function can be defined following the Bellman equation and is given as follows: where H ss , H sa , H as , and H aa are matrices related to the system dynamics and the weights of the cost function. By setting the derivative of Q with respect to a k to 0, i.e., ∇ a k Q(s k , a k ) = 0, the optimal action can be derived and expressed by: For the car-following scenario considered here, the system dynamics are highly related to the distance between the host vehicle and the leading vehicle (denoted by d) as well as the speed of the host vehicle (denoted by v). Thus, for the system . x = f (x, u), the following variables can be defined: Therefore, the state and the control action of the tracking problem can be defined as: where v k and v h,k are the speeds of the host vehicle controlled by the learning system and the human driver at time step k, respectively, d k and d h,k are the distances controlled by the learning system and the human driver at time step k, respectively, and a k and a h,k are the accelerations of the host vehicle controlled by the learning system and the human driver at time step k, respectively. Given the definition of state and action, the weights for the cost function can be determined as: where C 1 + C 2 + D = 1.

Function Approximation Using ANN
It can be seen from (4) that, to get the explicit Q value for each state-action pair, the system dynamics are required, i.e., the exact values of H ss , H sa , H as , and H aa should be given. In the learning problem considered in this study, the system dynamics are not known in advance, thus an alternative way is used to learn the Q values from data samples and estimate H ss , H sa , H as , and H aa for the purpose of calculating the optimal action.
Assume that the Q function can be approximated by a linear function shown below: where Under the definition of Equation (9), Equation (4) can be rewritten as a linear function by setting: By substituting Equation (12) into Equation (4), the following equation can be obtained: According to Equation (13), one can easily get: In this way, H ss , H sa , H as , and H aa can be constructed when θ is obtained. From the definition of state and action, it can be seen that both of these two variables are continuous, thus traditional RL methods such as the standard Q-learning that can only deal with the discrete state and action space cannot be used here. Under such circumstances, the neural Q-learning (NQL) algorithm is adopted by this study to deal with the continuous problem. Under the framework of NQL, the continuous Q function can be approximated by an artificial neural network, and thus all the possible state and action values can be coped with.
As shown in Figure 1, a three-layer feed forward ANN similar to [29] is designed for the learning module. To guarantee the performance of the ANN, all the input variables should be normalized [30]. For the feed forward ANN considered in this study, the state and action defined in Equation (7) are normalized as follows.
where ∆v k = v k − v h,k , ∆d k = d k − d h,k , ∆v max , and ∆d max are the maximum values of these two variables, and ∆v min and ∆d min are the minimum values of these two variables. Here, ∆v min , ∆v max , ∆d min , and ∆d max can be obtained from the data. In this way, both s k,1 and s k,2 can be normalized into a range between −1 and 1. Similarly, the action can be normalized as: In this way, all the elements of ξ k are normalized into [−1, 1]. Let ξ k be the input vector for the input layer, then the feed forward ANN can be defined by its activation functions Γ i , i = 0, 1, 2, 3 for each node. The output of ANN is the optimal Q function, which can be expressed by: h,i · · · w 5 h,i ] for five input variables) is the weight vector for the ith node in the hidden layer, w o,i is the weight for the link from the ith hidden node to the output node, b j is the bias for the ith hidden node, and b 0 is the bias of the output node. When the optimal Q value is obtained, the elements of the parameter vector θ can be calculated by: Then, the optimal action can be derived from Equation (5) through reconstructing H aa and H as from θ according to Equation (14).

Speed Control Module
Given the action, the desired speed can be easily derived from Equation (16) and calculated by: where v d,k and v d,k+1 are the desired speeds for the kth and the k + 1th time step, respectively. The speed control module can then convert the desired speed to control commands for the throttle and the break pressure control of the host vehicle using a PID controller [31].
where v e (t) is the tracking error between the desired speed and the actual speed, K p is the proportional gain, T I is the integral time, T D is the derivative time, and y(t) is the output of the controller, which can be converted to the throttle and the breaking control commands by a conversion block. Both the PID controller and the conversion block are embedded in PreScan and implemented as a module named "Path follower". In this study, the default parameter values (provided by PreScan) for the PID controller are applied for all of the experiments. These default parameters provided by PreScan are set as: K p = 20, K p /T I = 0.3, and K p /T D = 3.0625.

Training Algorithm for PBLS
Technically, the goal of the learning system is to find the optimal Q value and its corresponding parameter vector θ. Temporal difference (TD) learning [32] is a method to solve this problem by making the TD error defined by Equation (21) approach zero: Here, the feed forward ANN is used to accomplish this goal. For N time steps, the errors should be cumulated to formulate the loss function for ANN. For ease of calculation, a quadratic loss function is defined as follows: The first term of Equation (22) is related to the sum of squares of errors, which should be minimized by ANN. The second part of Equation (22) is named weight decay term, which is used here to avoid over-fitting by reducing the magnitude of the weights [33].
From the definition of e k , it can be seen that the bias of the hidden node does not affect the loss function and thus can be removed from Equation (17). Let Γ 0 be a linear function and Γ i , i = 1, 2, 3 be a number of hyperbolic tangent functions, then Equation (17) can be rewritten as: The hyperbolic tangent function is selected, as it is a typical activation function for ANN and has been proven to be effective in many practical cases [30].
Thus, according to Equations (18) and (21), the elements of θ can be obtained from: The second (24) is very small when the weights and the biases are small and can be ignored, as suggested by [29]. Therefore, θ is only related to the weight matrix of the ANN and can be calculated when w o,i and w l h,j are updated. As the objective of ANN is to minimize the loss function shown in Equation (22), the gradient decent method can be used here to update the weights. The weights for the output layer can be updated by: where u is the updating index, and the network is updated every N time steps. Similarly, the weights for the hidden layer and the biases can be calculated by: and The key issue right now is how to get the gradients of weights and biases at each updating step, i.e., the terms ∂e k /∂w o,i , ∂e k /∂w l h,i , and ∂e k /∂b i . To this end, the back propagation (BP) algorithm can be used to train the ANN via a mini-batch training method. Frequently updating the weights of the neural network, e.g., step-by-step update with N = 1, may lead to poor generalization and unstable learning curves, especially for learning unstable human behaviors. To overcome this limitation, the network weights are usually updated every N steps (N > 1) by using a small batch of data. This kind of training method is named mini-batch training and has been widely used for training neural networks [34]. In this paper, the mini-bath training is used to train the feedforward ANN, and, as suggested by [34], a small N with the value 10 (between two and 32) is selected. This kind of setting can help to avoid the bias of newly collected driving data and guarantee a relatively smooth learning curve in our experiment. Based on BP, the whole algorithm for the learning system is developed and shown in Algorithm 1.
Observe the state s k at the current step and get the recorded state s k−1 and action a k−1 . b.
Get the reward r k−1 through Equation (3) c.
Get the action a k through Equation (5)  The error for the output δ o ← e k through Equation (21) The error for the hidden layer: Calculate the gradients for i = 1, 2, 3 and l = 1, 2, 3, 4, 5:

Experiments with Constant Speed
The proposed learning system (PBLS) is tested in a simulation platform built by PreScan and Matlab/Simulink in this section. As mentioned in Section 2, the vehicle information and the driver data are required by the learning system. In PreScan, both the host and the leading vehicles are equipped with a virtual lidar system, a Global Positioning System (GPS), and vehicle-to-vehicle (V2V) communication systems. The vehicle information in terms of location, speed, and distance between the host vehicle and the leading vehicle can be obtained through these on-board systems.
As shown in Figure 2, driver data can be collected by the Logitech G29 driving simulator through the human-in-the-loop experiments. For real applications, the driving data can be obtained through the on-board sensing system. The vehicles involved in the experiments are modeled by the typical 2-D vehicle dynamics models (single-track model), which are embedded in Matlab/Simulink. The traffic environment and the driving scene are simulated in PreScan, which is connected to the driving simulator and provides drivers with the visual information. Two groups of experiments with different speed profiles-constant speed (CS) and variant speed (VS)-for the leading vehicle are carried out to evaluate the performance of the proposed system.
In all the tests, the weight values are set as C 1 = C 2 = D = 1/3 to guarantee that each part of the cost r k has the same importance. Other parameters for PBLS are shown in Table 1, which are chosen according to experience and can guarantee a stable performance of PBLS.

Experiments with Constant Speed
The proposed learning system (PBLS) is tested in a simulation platform built by PreScan and Matlab/Simulink in this section. As mentioned in Section 2, the vehicle information and the driver data are required by the learning system. In PreScan, both the host and the leading vehicles are equipped with a virtual lidar system, a Global Positioning System (GPS), and vehicle-to-vehicle (V2V) communication systems. The vehicle information in terms of location, speed, and distance between the host vehicle and the leading vehicle can be obtained through these on-board systems.
As shown in Figure 2, driver data can be collected by the Logitech G29 driving simulator through the human-in-the-loop experiments. For real applications, the driving data can be obtained through the on-

3 C C D
to guarantee that each part of the cost k r has the same importance. Other parameters for PBLS are shown in Table 1, which are chosen according to experience and can guarantee a stable performance of PBLS.

Experimental Settings
The driving scene used by the constant speed scenarios is shown in Figure 3. A straight two-lane urban road with a length of 30 km is considered. In the test, the driver is asked to drive the host vehicle first, and then the driving data collected from the driver are transferred to PBLS, which is used to control the host vehicle in the same scene and learn the driving behavior from the collected driving data online. When the learning algorithm is converged, PBLS can reproduce the learned behavior by setting the learning rate as zero.

Experimental Settings
The driving scene used by the constant speed scenarios is shown in Figure 3. A straight two-lane urban road with a length of 30 km is considered. In the test, the driver is asked to drive the host vehicle first, and then the driving data collected from the driver are transferred to PBLS, which is used to control the host vehicle in the same scene and learn the driving behavior from the collected driving data online. When the learning algorithm is converged, PBLS can reproduce the learned behavior by setting the learning rate as zero. In the first test, the leading vehicle keeps a constant speed, and three speed profiles, namely, low speed (L, 10 m·s −1 ), medium speed (M, 15 m·s −1 ), and high speed (H, 22 m·s −1 ), are designed to form three different test scenarios. To test the adaptive learning ability of the proposed system, two drivers (A and B) are involved and asked to follow the leading vehicle in all three scenarios. Then, the learning system is triggered to learn the driving behavior from these two drivers. It should be noted here that the focus of this study is to develop a personalized learning system that has the ability to adapt to different driving behaviors. This kind of adaptation can be tested by involving two different drivers in this section. Analytical work involving more drivers can be considered in our future study to analyze the algorithm performance under various kinds of driving behaviors. RMSE (Root Mean Square Error) can be used to measure the learning error of the learning system, which is calculated by: where k z is the data point related to the learning system at step k , and ˆk z is the observed data from human drivers at step k . Figure 4 presents the learning curves of PBLS for different speed scenarios. In all three scenarios, the learning system can learn the stable distance and the speed curves within 5000 time In the first test, the leading vehicle keeps a constant speed, and three speed profiles, namely, low speed (L, 10 m·s −1 ), medium speed (M, 15 m·s −1 ), and high speed (H, 22 m·s −1 ), are designed to form three different test scenarios. To test the adaptive learning ability of the proposed system, two drivers (A and B) are involved and asked to follow the leading vehicle in all three scenarios. Then, the learning system is triggered to learn the driving behavior from these two drivers. It should be noted here that the focus of this study is to develop a personalized learning system that has the ability to adapt to different driving behaviors. This kind of adaptation can be tested by involving two different drivers in this section. Analytical work involving more drivers can be considered in our future study to analyze the algorithm performance under various kinds of driving behaviors. RMSE (Root Mean Square Error) can be used to measure the learning error of the learning system, which is calculated by:

Experimental Results
where z k is the data point related to the learning system at step k, andẑ k is the observed data from human drivers at step k. Figure 4 presents the learning curves of PBLS for different speed scenarios. In all three scenarios, the learning system can learn the stable distance and the speed curves within 5000 time steps (250 s). As shown in Figure 5, in the low speed scenario, the learning RMSE for both the speed and the distance of two drivers can be kept at a very low level close to zero. However, with the growth of speed for the leading vehicle, the performance of PBLS gets worse with RMSE for the speed increasing from 0.01 m·s −1 to 0.37 m·s −1 and RMSE for the distance increasing from 0.05 m to 2.43 m. This result is reasonable, as in the low-speed scenario, both drivers can perform well in keeping a stable distance to the leading vehicle. In this situation, the curves for speed and distance are very smooth without large fluctuation after around 5000 time steps, and thus PBLS performs better in this scenario. steps (250 s). As shown in Figure 5, in the low speed scenario, the learning RMSE for both the speed and the distance of two drivers can be kept at a very low level close to zero. However, with the growth of speed for the leading vehicle, the performance of PBLS gets worse with RMSE for the speed increasing from 0.01 m.s −1 to 0.37 m.s −1 and RMSE for the distance increasing from 0.05 m to 2.43 m. This result is reasonable, as in the low-speed scenario, both drivers can perform well in keeping a stable distance to the leading vehicle. In this situation, the curves for speed and distance are very smooth without large fluctuation after around 5000 time steps, and thus PBLS performs better in this scenario.  In all three scenarios, PBLS shows a better performance on reproducing the behavior of Driver A than Driver B with lower RMSE for Driver A. This is mainly because Driver A has more experience in driving and can keep a relatively stable curve for both speed and distance. steps (250 s). As shown in Figure 5, in the low speed scenario, the learning RMSE for both the speed and the distance of two drivers can be kept at a very low level close to zero. However, with the growth of speed for the leading vehicle, the performance of PBLS gets worse with RMSE for the speed increasing from 0.01 m.s −1 to 0.37 m.s −1 and RMSE for the distance increasing from 0.05 m to 2.43 m. This result is reasonable, as in the low-speed scenario, both drivers can perform well in keeping a stable distance to the leading vehicle. In this situation, the curves for speed and distance are very smooth without large fluctuation after around 5000 time steps, and thus PBLS performs better in this scenario.  In all three scenarios, PBLS shows a better performance on reproducing the behavior of Driver A than Driver B with lower RMSE for Driver A. This is mainly because Driver A has more experience in driving and can keep a relatively stable curve for both speed and distance. In all three scenarios, PBLS shows a better performance on reproducing the behavior of Driver A than Driver B with lower RMSE for Driver A. This is mainly because Driver A has more experience in driving and can keep a relatively stable curve for both speed and distance.

Experiments with Variant Speed
In the previous section, the learning ability of the proposed system was tested in scenarios with constant speed. In this section, three driving scenes with variant speeds for the leading vehicle are considered. In the first two driving scenes, the whole test is similar to the constant speed case, except that a traditional adaptive cruise control (ACC) system is considered here to make a comparison with PBLS, which is the focus of this section. In the third driving scene, driving data collected from real vehicles on the real road are used to test the learning system. The driver (Driver A) with more driving experience is involved in this section. In the following test, PBLS only learns from Driver A.
The ACC system is a widely used longitudinal speed control system, which is designed to assist drivers to keep a pre-set time headway between the host vehicle and the leading vehicle [35]. The time headway is defined as the ratio of the distance (d) to the speed of the host vehicle (v). The desired time headway for ACC is set as 1.8 s in the test, which can keep d between 20 m and 40 m when the leading vehicle has a speed between 10 m·s −1 and 20 m·s −1 . In this way, the d kept by ACC and PBLS can be ranged to the same level, which helps to make a fair comparison.
Two indicators suggested by [36] are used here to evaluate and compare the performance of PBLS and ACC on driving comfort and smoothness. These two indicators are given by: where the driving comfort is measured by J 1 , which is obtained by dividing the average acceleration a mean by the average speed v mean , and the driving smoothness is measured by J 2 , which is the jerk of the vehicle. The driving comfort is considered low when J 1 is at a high level, while a high driving smoothness corresponds to a low and stable J 2 .

Driving Scene I
As shown in Figure 6a, in the first driving scene, the road layout is the same as in Section 4, while the speed of the leading vehicle changes between 10 m·s −1 and 20 m·s −1 during the whole test. For data collection, the driver in the host vehicle is asked to follow the leading vehicle with variant speed in the first run. After that, the proposed PBLS is triggered for behavior learning. In this case, the algorithm runs for 80,000 time steps (around 1 h for convergence) for learning and then runs for 40,000 time steps by setting the learning rate as zero to reproduce the learned behavior. Figure 7 presents the distance and the speed comparison among the driver with PBLS and ACC in the first driving scene. PBLS performs well in learning from the driver with the distance and the speed curves close to the driver, which means the learning error (RMSE) of PBLS is at a very low level.
Compared to PBLS, the speed of ACC fluctuates more greatly, especially when the speed is close to 20 m·s −1 . As shown in Figure 8, the acceleration and the jerk (J 2 ) of ACC vary significantly during the whole test, while PBLS can keep a relatively stable curve for both the acceleration and the jerk. Thus, PBLS can provide better driving smoothness than ACC.   Figure 7 presents the distance and the speed comparison among the driver with PBLS and ACC in the first driving scene. PBLS performs well in learning from the driver with the distance and the speed curves close to the driver, which means the learning error (RMSE) of PBLS is at a very low level.
Compared to PBLS, the speed of ACC fluctuates more greatly, especially when the speed is close to 20 m.s −1 . As shown in Figure 8, the acceleration and the jerk ( 2 J ) of ACC vary significantly during the whole test, while PBLS can keep a relatively stable curve for both the acceleration and the jerk. Thus, PBLS can provide better driving smoothness than ACC.    Figure 7 presents the distance and the speed comparison among the driver with PBLS and ACC in the first driving scene. PBLS performs well in learning from the driver with the distance and the speed curves close to the driver, which means the learning error (RMSE) of PBLS is at a very low level.
Compared to PBLS, the speed of ACC fluctuates more greatly, especially when the speed is close to 20 m.s −1 . As shown in Figure 8, the acceleration and the jerk ( 2 J ) of ACC vary significantly during the whole test, while PBLS can keep a relatively stable curve for both the acceleration and the jerk. Thus, PBLS can provide better driving smoothness than ACC.

Driving Scene II
In the second driving scene presented in Figure 6b, the leading vehicle is controlled by a human driver without predefined speed profiles. Therefore, in the data collection phase, both the host vehicle and the leading vehicle are driven by human drivers. A typical intersection with a traffic light is involved to form Driving Scene II.
In this scene, the leading vehicle is asked to go through the intersection according to the traffic light, and the host vehicle follows the leading vehicle all the time. The traffic light changes following the order: yellow, red, and green. The time for the yellow light is set as 5 s (100 steps), and the red light lasts for 40 s (800 steps). There is no time limit for the green light, which guarantees that both vehicles can pass through the intersection.
The initial speed for the leading vehicle and the host vehicle is 8 m.s −1 , and the initial distance between these two vehicles is 30 m. It can be seen from Figure 9 that, because of the yellow and the red light, the leading vehicle slows down in the first 600 steps (30 s) when it is approaching the stop line. Then, it restarts and speeds up after 300-step waiting at the stop line.

Driving Scene II
In the second driving scene presented in Figure 6b, the leading vehicle is controlled by a human driver without predefined speed profiles. Therefore, in the data collection phase, both the host vehicle and the leading vehicle are driven by human drivers. A typical intersection with a traffic light is involved to form Driving Scene II.
In this scene, the leading vehicle is asked to go through the intersection according to the traffic light, and the host vehicle follows the leading vehicle all the time. The traffic light changes following the order: yellow, red, and green. The time for the yellow light is set as 5 s (100 steps), and the red light lasts for 40 s (800 steps). There is no time limit for the green light, which guarantees that both vehicles can pass through the intersection.
The initial speed for the leading vehicle and the host vehicle is 8 m·s −1 , and the initial distance between these two vehicles is 30 m. It can be seen from Figure 9 that, because of the yellow and the red light, the leading vehicle slows down in the first 600 steps (30 s) when it is approaching the stop line. Then, it restarts and speeds up after 300-step waiting at the stop line.
In this test, the algorithm runs for 12,000 time steps (600 s) to get convergence, which means the whole test needs to repeat 10 times. Similar to the test in Driving Scene I, after learning, the learning rate of the algorithm is set as zero to reproduce the learned behavior. As shown in Figure 10, compared with ACC, PBLS has better driving smoothness with smoother acceleration and jerk trajectories. From Figure 11, it can be seen that PBLS can reproduce the behavior of the driver who controls the host vehicle with a very low RMSE, while the difference between the curves of ACC and the driver is very large (see Figure 9). Thus, compared with ACC, PBLS is more consistent with the driver's behavior and habits. Except for the driving smoothness, PBLS also performs better than ACC in the driving comfort. As shown in Figure 12, the J 1 of PBLS is much smaller than the J 1 of ACC. In this test, the algorithm runs for 12,000 time steps (600 s) to get convergence, which means the whole test needs to repeat 10 times. Similar to the test in Driving Scene I, after learning, the learning rate of the algorithm is set as zero to reproduce the learned behavior. As shown in Figure 10, compared with ACC, PBLS has better driving smoothness with smoother acceleration and jerk trajectories. From Figure 11, it can be seen that PBLS can reproduce the behavior of the driver who controls the host vehicle with a very low RMSE, while the difference between the curves of ACC and the driver is very large (see Figure 9). Thus, compared with ACC, PBLS is more consistent with the driver's behavior and habits. Except for the driving smoothness, PBLS also performs better than ACC in the driving comfort. As shown in Figure 12, the 1 J of PBLS is much smaller than the 1 J of ACC.

Driving Scene III
In the third driving scene, as shown in Figure 13, two real vehicles are involved for collecting the real driving data. The Beijing Institute of Technology (BIT) intelligent vehicle [37] is used as the host vehicle in this work. This vehicle is equipped with on-board sensing systems to capture the

Driving Scene III
In the third driving scene, as shown in Figure 13, two real vehicles are involved for collecting the real driving data. The Beijing Institute of Technology (BIT) intelligent vehicle [37] is used as the host vehicle in this work. This vehicle is equipped with on-board sensing systems to capture the

Driving Scene III
In the third driving scene, as shown in Figure 13, two real vehicles are involved for collecting the real driving data. The Beijing Institute of Technology (BIT) intelligent vehicle [37] is used as the host vehicle in this work. This vehicle is equipped with on-board sensing systems to capture the speed and the distance information. The detailed description of the host vehicle can be found in [37]. Both host and leading vehicles are driven by human drivers. The driver in the leading vehicle is asked to drive along a straight road with a changeable speed. speed and the distance information. The detailed description of the host vehicle can be found in [37]. Both host and leading vehicles are driven by human drivers. The driver in the leading vehicle is asked to drive along a straight road with a changeable speed. After the data collection process, real driving data are used to test the learning system. Testing the on-line learning and control system directly on a real-world road is highly risky, as slight learning deviations may lead to severe safety issues for both testing and surrounding vehicles. Thus, in this study, the real driving data are used to reproduce the observed real driving scene in PreScan, where the simulated leading vehicle follows the speed profile observed from the real world. The real behavior data collected from the host vehicle are used to train the PBLS in PreScan. The collected data shown in Figure 13 are divided into eight groups, and each group contains the data collected from 2000 time steps. Seven groups of data are used to train the algorithm, and the remaining group is used for testing. The test result is shown in Figure 14. After the data collection process, real driving data are used to test the learning system. Testing the on-line learning and control system directly on a real-world road is highly risky, as slight learning deviations may lead to severe safety issues for both testing and surrounding vehicles. Thus, in this study, the real driving data are used to reproduce the observed real driving scene in PreScan, where the simulated leading vehicle follows the speed profile observed from the real world. The real behavior data collected from the host vehicle are used to train the PBLS in PreScan. The collected data shown in Figure 13 are divided into eight groups, and each group contains the data collected from 2000 time steps. Seven groups of data are used to train the algorithm, and the remaining group is used for testing. The test result is shown in Figure 14.
After the data collection process, real driving data are used to test the learning system. Testing the on-line learning and control system directly on a real-world road is highly risky, as slight learning deviations may lead to severe safety issues for both testing and surrounding vehicles. Thus, in this study, the real driving data are used to reproduce the observed real driving scene in PreScan, where the simulated leading vehicle follows the speed profile observed from the real world. The real behavior data collected from the host vehicle are used to train the PBLS in PreScan. The collected data shown in Figure 13 are divided into eight groups, and each group contains the data collected from 2000 time steps. Seven groups of data are used to train the algorithm, and the remaining group is used for testing. The test result is shown in Figure 14. Compared with Driving Scenes I and II, PBLS in Driving Scene III performs slightly worse with higher RMSE for both distance and speed. This is mainly because the real driving data are noisier than the simulation data, especially when the leading vehicle has a changeable speed. Compared with Driving Scenes I and II, PBLS in Driving Scene III performs slightly worse with higher RMSE for both distance and speed. This is mainly because the real driving data are noisier than the simulation data, especially when the leading vehicle has a changeable speed.

Conclusions
A personalized behavior learning system (PBLS) was proposed in this paper to learn the human driving behavior from demonstrations. PBLS is based on a reinforcement learning method named neural Q-learning (NQL), which can approximate the Q function in a continuous state and action space, such that the human-like longitudinal speed control (LSC) problem can be solved properly. To train PBLS online, a batch-updating algorithm based on back-propagation (BP) was developed.
A series of driving simulator experiments with different speed profiles for the leading vehicle were carried out to evaluate the performance of PBLS. In all the experiments, PBLS kept a low learning error, especially for the driver who had a stable operation. In the test with variant speed, by learning from an experienced driver, PBLS achieved higher driving comfort and smoothness than the traditional adaptive cruise control (ACC) system.
As mentioned in Section 4, this study focused on developing a personalized behavior learning system that can adapt to different drivers. In future work, a systematic analysis involving more drivers will be conducted to investigate the effects of different drivers and driving styles on the performance of the learning system.