Unmanned Ground Vehicle Behavior Decision via Improved Bayesian Inverse Reinforcement Learning

Abstract: In this paper, we address an improved Bayesian inverse reinforcement learning (BIRL) algorithm for behavior decision making problem of unmanned ground vehicle through learning from teaching sequences. It is demonstrated that in this paper how the autonomous drving vehicle executes optimal action under current state according to expert’s demonstration by using the proposed approach. Simulation tests show that the proposed improved BIRL can make reasonable decisions for the driverless car.


Introduction
In recent years, unmanned ground vehicle as a important research topic for decreasing traffic accidents in intelligent transportation system has been mentioned gradually.It is very important to design an effective autopilot system for UGV.Generally speaking, the autopilot system of unmanned ground vehicle can be divided into three parts.As Fig. 1 shows, the first part is the environment recognition that will capture all the information from surroundings and be seen as "eyes".The module of decisionmaking is the part which is regarded as "brain".This part's main task is to decide the vehicle's actions according to current information provided by environment recognition system.The third part is called control module that is to plan optimal path and implement the decision made in the second part.From this framework, we can find that the decision-making module has played a central role in autopilot system.
To the best of our knowledge, the current researches about decision-making have obtained some achievements.For example, Multiple Criteria Decision Making (MCDM) [1] and Multiple Attribute-Based Decision Making (MADM) [2] can provide a better decision in the simple external environment that has less states defined by human.In paper [1], a maneuver decision making method for autonomous vehicle in urban environment is studied.The algorithm can be decomposed into three steps.The first step is to select all possible logical maneuvers and the second step to remove the maneuvers which break the traffic rules.At the last step, Multiple Attribute Decision Making methods are used in the process of selecting the optimal driving maneuver in the scenario considering safety and efficiency.In paper [2], the stage of decision making is also to select and execute the most appropriate alternative from those driving maneuvers which have been determined to be feasible in the current traffic situation.More precisely, this method breaks general objective into a lower hierarchy level containing more specific objectives, which specify how to achieve the objective of the higher level.Inverse reinforcement learning (IRL) [3] has been proposed to solve the decisionmaking problem.As pointed out in paper [3], IRL problems are of great interests for a wide range of application, from basic science [4,5] to optimal control of aircraft [6] and more recently aerobatic helicopter flight [7] within the robotic community.Supervised learning approaches such as [8][9][10][11][12] that learn a direct mapping from the states to the actions and directly mimic the demonstrator will do not work for highway driving or more complex driving situations.The inverse learning algorithm, to some extent, will be the most succinct and robust way to explain expert behavior.In addition, the algorithm has better generalization ability in the domain of the features.Bayesian inverse reinforcement learning [13] is one inverse reinforcement learning algorithm of all.The algorithm will update a prior on reward functions and we learn the process for recovering expert's actions by this posterior.However, the algorithm still has serious problem in driving behavioral decision, especially for the poor rate of convergence.It means that we will need more time to get ideal reward function if the algorithm is not good enough.This case will be danger when applying this algorithm to recover expert's driving behaviors.What's more, to a large extent, the learning process does not take into account existing noise problem.That is to say, in driving environment, it is hard to derive good policy in short time.So, new solutions to solving this problem will be necessary.This paper attempts to propose a improved algorithm.
In this paper, we apply improved Bayesian inverse reinforcement learning to learn optimal expert's policy.The improved algorithm will have fast convergent speed and high accuracy for recovering expert's driving behaviors.From the perspective of engineering, it is very important to improve these two points above when compared to traditional BIRL.First, assuming expert's policy is stationary stochastic policy can get reward function's posterior distribution combining with Bayesian theory.Finally, using sampling method estimates the mean of the posterior distribution as our predictive value.We will improve existed Bayesian reinforcement learning algorithm to derive a better performance and learn expert's policy from making full use of demonstration information.This algorithm can let the driverless car has more security decision-making behaviors in environment and execute the relative optimal actions.Besides, we will give the simulation experiment's results to verify the algorithm.
The rest of this paper proceeds as follows.In section II, we will give a brief introduction for Markov decision process (MDP) basic properties and inverse reinforcement learning algorithm.Section III reviews BIRL algorithm and introduces improved BIRL algorithm.Our main contribution in revising the algorithm's potential function and adding the fitting process will be presented.The simulation and experimental test will be showed in section IV.In this section, we will also analyze those results' data.We conclude in section V with a brief discussion of opportunities for future work.

Markov Decision Process
The inverse reinforcement learning problem is generally described by the Markov decision process formalism.A Markov decision process has five elements (S,A,{P sa (‧)},γ, R), wherein: S is a finite set of n states; A = {a 1 ,...a k } is a set of k actions; {P sa (‧)} are the state transition probabilities upon taking action in state ; γ∈[0,1) is the discount factor; R is the reward.
A policy is defined as any state map to action: S→A, notates π.The value function for describing a policy, evaluated at any state S 1 is given by: , In the same way, we have the Q-function according to (2) The optimal value function is: (16) Require: Distribution , MDP , bias , features' vector , threshold value , a random vector , expect threshold value for policy error , expert policy .

While
And the optimal Q-function is: , Require: Distribution , MDP , bias , features' vector , threshold value , a random vector , expect threshold value for policy error , expert policy .

While
To solve the IRL problem, the two basic properties concerning MDP are needed.The two basic properties can be characterized as following [14]: Theorem 1 (Bellman Equation) Let an MDP M= (S, A, {P8a(•)},γ,R) and a policy π: S→A be given.Then, for all s∈ S ,a∈A, V π and Q π satisfy: , ( , Theorem 2 (Bellman Optimality) Let an MDP M = (S,A,{P sa (‧)},γ, R) and a policy π: S→A be given.Then that is an optimal policy for M if and only if, for all , According to the two basic properties above, an important theorem can be derived to explain expert behavior in MDP using reward function.We can think this theorem that is the core for inverse reinforcement learning.The theorem is written: Theorem 3 Let a finite state space S, a set of actions A = {a 1 ,...a k }, transition probability matrices p sa , and a discount factor γ be given.Then the policy π given by π(s)≡a 1 is optimal if and only if, for all a = a 2 ...a k , the reward satisfies: (16) Require: Distribution , MDP , bias , features' vector , threshold value , a random vector , expect threshold value for policy error , expert policy . ( This result characterizes the set of all reinforcement functions that are solutions to the inverse reinforcement learning problem.Using this equation we can recover expert behavior via reward function for finite-state MDP.For further problem about the criteria of choosing reward and how to approximate for reward function in large scale state space will be found in paper [15].

Problem Formulation
In considering the decision-making problem of autonomous driving vehicles, we think that the car can be taught by expert.So, our aim is to design a more efficient algorithm to learn a reward function from sample trajectories.The reward function can capture the state's features and help the driverless car make decision no matter whether the state ever emerged or not.Our algorithm applied in decision-making can be seen as Fig. 2. In training process, this framework will be repeated again and again.Once we capture the partial states' reward function that can explain expert's behaviour, these

Bayesian Reinforcement Learning
We recall some basic definitions and assumptions relating to Bayesian reinforcement learning.We can assume expert's demonstration that equals to a reward function from posterior distribution.Then an agent according to this reward function implements optimal policy.Expert's a trajectory expresses as: 0 = {(s 1 ,a 1 ), (s 2 ,a 2 ),..., (s k ,a k )}, which means that the expert was in state s i and took action a i .The two assumptions have guaranteed that the problem is defined in Markov decision framework and does not abandon problem's mathematic's essence.Because demonstrator employs stationary policy, the problem can make a further independence assumption, giving (2) (16) Expert's goal is to maximize the total accumulated reward and the goal amounts to every step the agent that executes the action with maximum action function Q(s,a).Therefore the larger Q(s,a) is, the more likely it is that the demonstrator would choose action a i at state s i .So we can model this problem's distribution as this potential function: , (13) Then the likelihood of the entire evidence is: and Z´ is the appropriate normalizing constant.Finally, the posterior probability of reward function by applying Bayes theorem can rewrite as following: Detailed derivation of these equations and its complete algorithm can be found in prior work [13].

Improved Bayesian Reinforcement Learning
In the original BIRL algorithm, to maximize demonstrator trajectories' likelihood equation is regarded as to maximize the sum of E(O,R).The model will tend to increase all weights' value during the process of iteration.From this point of view, the reward function with weight vector ω multiplying by β cannot influence original policy while such a change like this will be not good for reducing the times of iteration and lead to a new potential function βE(O,R) and a new likelihood equation.As a result, the algorithm will select a new reward function to replace last reward function.This case however, must be hard to stop in short time.That means that we cannot get better proper reward function.If so, the reward function recovered by unimproved BIRL will lead to an overfitting problem.This part will introduce a series of efforts to improve the Bayesian inverse reinforcement learning.To reconstruct the potential function is to be the main method in this section and then to design a new algorithm frame combined with linear function from states' features mapping to reward.The first modified potential function is giving: Require: Distribution , MDP , bias , features' vector , threshold value , a random vector , expect threshold value for policy error , expert policy .
While do 1.Pick a reward vector uniformly at random from the neighbours of 2.

Compute for all
where π(s i ) is a random policy.The policy is to sample an action randomly that is different from expert's action in every state, following π(s i ) ∈ A\a i The meaning of this kind of potential function is to maximize the distance of the expert's policy and random policy.It is also easy to express it in linear return function and prove its convergence.Nevertheless, on the original algorithm performance improvement is not particularly significant because of its performance depending on the start policy.We rewrite the potential function again like: ( Unauthenticated Download Date | 6/22/19 2:40 PM This equation tries to maximize the difference between optimal and suboptimal strategy.From the experimental results, the potential function for ascension algorithm performance is better than the first one.But it still tends to choose a larger weight vector.In order to solve the problem, we made a further attempt.So the new equation is described as this: The function can be intuitive understanding as the probability of expert's action.If Q(s i ,a i ) -max α∈ A\a i Q(s i ,a i ) is larger positive value, we can think that the probability of carrying out optimal action is very close to 1under the current state.Under this kind of understanding the new potential function is the expectation of expert's optimal actions.The potential function is rewritten formally: There is no need to increase the reward function value for the actions that has been defined as optimal actions with maximum probability.The algorithm will tend to increase the probability of other actions that are included in sample trajectories.Our improved learning algorithm is as follows:

Simulation Results
We evaluated the algorithm on more concrete behaviors in the context of a simple highway driving simulator as Fig. 3 shows, similar evaluations in other work [3].The task is to navigate a car on a three-lane highway, where all other vehicles move at a constant speed with different level.From the leftmost to rightmost lane, the speed level of vehicle is set to 90km/s, 60km/h and 30km/s randomly.Target car has added two kinds of behavior that are acceleration and deceleration.Under such settings, we assume that has a special random driving style to avoid collision and then is able to drive as fast as possible.For example the car will be the leftmost lane and its speed is 120km/s.If there is another car ahead, the target car will change its current lane to the middle lane when the middle lane has nothing.Further, the target car will decrease its speed to 90km/h and keep going in the left lane when another car emerges in the middle lane.In real world, driving style is very complex and popular.Naturally, complex driving styles are very hard to train.So we will evaluate the algorithm by random driving styles.At first, we show the rate of converge between previous algorithm and improved algorithm by comparison in three-lane simulator.Then we will show all features' weight values to analyze the algorithms' performance for recovering reward function one by one.
Our improved BIRL algorithm that is applied to the problem of driving decision will have less iteration times and faster converged speed from Table 1 and Table 2.The median of iteration for improved BIRL is less than BIRL.The success rate in 500 steps is 100%, but the unimproved BIRL is only 42%.These comparisons are obvious for algorithms' performance.Similarly, we can also present the results between previous algorithm and improved algorithm by comparison in three-lane simulator for reward function's weight values.
Owing to more states, it is hard to guarantee the demonstrators optimality.For unimproved BIRL algorithm, that will be facing a bigger challenge to recover a reward function with proper weight value vector.We have made a comparison for our improved BIRL and initial BIRL.All features have emerged in our sample data except "Collision".All features' weight values observed from Table 3 have two larger values for left lane and 120km/h.It is interesting to see that collision's weight value for improved BIRL is smaller than initial BIRL.To explain this phenomenon, we have to repeat again about the advantage of the improved BIRL algorithm that is relative insensitive to these features that have not ever emerged in demonstrator' strategies.The reward function will became smaller when the environmental states data have included collision's feature.Besides, improved BIRL algorithm can also make the features' weight values be positive when comes to "Middle lane" and "90km/h".The weight values for "Distance 60 to 120" and "30km/h" have increased a little respectively.Other features included in sample data will give proper weight values according to the scale in sample trajectories.For example, the weight values for "Right Lane" and "Distance 30 to 60" have decreased a little respectively because of less data included in sample data.These changes show that the improved

Figure 3 .
Figure 3.The simulation environment do 1.Pick a reward vector uniformly at random from the neighbours of c. Update Upon termination, the algorithm returns .
Unauthenticated Download Date | 6/22/19 2:40 PM process will terminate.Then, the value function can choose appropriate action from current state to next state.

Table 1 .
Key Parameters for Birl

Table 2 .
Key parameters for improved birl

Table 3 .
Key features' weight value