A Review on Deep Reinforcement Learning for Autonomous Driving

Autonomous driving technology has gained significant attention, offering opportunities to modernize transportation systems worldwide. Deep reinforcement learning (DRL) has emerged as a robust approach to design smart driving policies for intricate and changeable environments. This paper provides a detailed investigation of state-of-the-art DRL methodologies that are effectively applied to autonomous driving. It begins by providing a clear explanation of the fundamental concepts of deep learning and reinforced learning, highlighting their application for control of self-driving vehicles. Consequently, the paper presents an overview of various DRL algorithms, including Deep Q-Networks (DQN), Deep Deterministic Policy Gradients (DDPG), and Actor-Critic methods, describing their structures, training approaches, and applications in autonomous driving situations. Recent advancements in DRL research, such as domain adaptation, imitation learning, and meta-learning, have also been addressed in the study, with an investigation of their potential implications for autonomous driving. Via a thorough assessment of current literature, key trends, challenges, and research directions have been identified for exploiting DRL in autonomous car development. This review intends to provide a comprehensive understanding of the current and future possibilities of DRL for self-driving vehicles to researchers, practitioners, and enthusiasts.


A. Introduction
Autonomous driving technology is a significant milestone in the progress of transportation systems across the globe.As it guarantees better safety, productivity, and availability, it has excited the interest of researchers, industry personnel, policymakers, and the general public.The core of this revolution in technology relies on the state-of-the-art method of DRL, which enables vehicles to steer and make choices in intricate and ever-changing settings [1][2][3] [4].
The fusion of deep neural networks [5]with reinforcement learning concepts has advanced autonomous driving technology to unparalleled heights of refinement.By contrast to conventional rule-based systems that depend on manually constructed algorithms and heuristic principles, DRL empowers the vehicles with the ability to develop optimal driving techniques via trial and error, just like human drivers hone their capacities via practice [6][7] [8].Through leveraging data-supported decision-making [9][10] [11], DRL has the potential of opening up novel realms in autonomy, permitting the vehicles to respond to varying traffic situations, unanticipated occurrences, and adjustments in road conditions [12] [13].The purpose of this review paper is to present a thorough analysis of current cutting-edge DRL techniques that are used in autonomous driving.By delving deeply into the concepts, recent developments, advanced algorithms, and upcoming research trends, we aim to provide readers with a comprehensive understanding of the benefits and challenges involved in utilizing DRL for the progress of autonomous vehicle technology.
Our aim is to allow researchers, practitioners, and enthusiasts to gain a deeper understanding of the significance of DRL in shaping the future of autonomous driving.We hope to encourage them to explore new ways to further the progress of autonomous vehicle technology by providing a synthesis of insights from existing literature along with critical analysis.Ultimately, we seek to pave the way for safer, more efficient, and sustainable transportation systems through this review paper.
The subsequent sections of this paper are organized as follows: Section 2 explains the fundamentals of Reinforcement Learning.Section 3 provides an overview of DRL Algorithms.Section 4 covers Applications of Reinforcement Learning in Autonomous Driving, while Section 5 delves into Recent Advancements in DRL.

B. Fundamentals of Reinforcement Learning
Reinforcement Learning is a type of machine learning that focuses on how an agent can learn to achieve a goal by interacting with its environment [14].Through trial and error, the agent takes actions in the environment and receives either rewards or penalties as feedback.The purpose of the agent is to learn a policy that will enable it to acquire the highest amount of cumulative reward over time.
At the core of RL are several fundamental concepts:  Markov Decision Processes (MDPs) offer a numerical methodology to emulate RL problems.An MDP includes states, actions, transition probabilities, and rewards.An MDP [16]can be described by a tuple (S, A, P, R, γ), where S denotes the set of states, A represents the set of actions, and P is the function governing state transitions, determining the probability of transitioning to the next state, s` ∈ S, when action a ∈ A is chosen in response to observing state s ∈ S. In cases involving continuous actions or state spaces, the mathematical formulation becomes more intricate.Additionally, R denotes the reward function mapping from state-action-state tuples to real numbers (R: S × A × S → R), and γ ∈ [0, 1] represents the discount rate.
The objective of the agent is to acquire a policy, πθ: S × A 7→ [0, 1], that relates states to actions in such a manner that the expected cumulative reward is maximized.
1.For the state-value function V * (s): (1) 2. For the action-value function Q * (s, a): Here: • V * (s) represents the optimal value function for state s, • Q * (s, a) represents the optimal action-value function for state-action pair (s, a), • πθ represents the policy parameterized by θ, • E[⋅] represents the expected value operator, • γ is the discount factor, • rt+k represents the reward at time t + k, • st represents the state at time t, • at represents the action at time t, • S is the state space, and • A is the action space.
In this context, the state s corresponds to the input from car sensors, such as cameras and distance sensors.The steering angle is denoted by a continuous parameter a ∈ R, which represents an action.The transition probability P is the probability of the car's state and action interacting.Based on the car's current state s and the action a, the reward function R provides feedback to the algorithm.The discount rate γ measures the significance of long-term rewards relative to immediate rewards, ranging from considering only immediate rewards (γ = 0) to considering both immediate and long-term rewards equally (γ = 1).The interaction between the agent and the environment is illustrated in this diagram [16].

St+1
let's consider a simplified example with two states: "Stop" and "Go."The autonomous vehicle can take two actions: "Brake" or "Accelerate."The transition probabilities and immediate rewards associated with each action are as follows: 1. Stop State: • If the vehicle brakes, it remains in the stop state with a high probability (e.g., 0.9) and may transition to the go state with a lower probability (e.g., 0.1).• Braking in the stop state incurs a negative reward due to the delay in reaching the destination (e.g., -10).

Go State:
• If the vehicle accelerates, it remains in the go state with a high probability (e.g., 0.8) and may transition back to the stop state with a lower probability (e.g., 0.2).• Accelerating in the go state incurs a positive reward as it progresses towards the destination (e.g., +5).
Two frequently employed methods for resolving RL problems are Dynamic Programming and Q-Learning.In Dynamic Programming, algorithms like policy iteration and value iteration compute optimal policies by frequently revising value functions.In contrast, Q-Learning is an RL algorithm that is free of any underlying model and learns the value of state-action pairs from practice.It uses the Bellman equation to modify Q-values.
Having an understanding of these elementary concepts is crucial in comprehending the fundamental principles related to the implementation of RL techniques in self-driving systems.

C. Deep Reinforcement Learning Algorithms
DRL methods [17] are essential for autonomous vehicles as they allow them to learn complicated driving behaviors by actively engaging with their surroundings.This portion of the paper presents a summary of the primary DRL algorithms implemented in autonomous driving R&D, complete with mathematical illustrations to explain their principles.

Deep Q-Networks (DQN)
DQN is a method introduced in 2015 by Mnih et.al. [18][19] that utilizes deep neural networks to approximate the Q-function.The Q-function determines DQN is a method introduced in 2015 by Mnih et.al. [18][19] that utilizes the expected cumulative reward for a given action in a particular state and is represented mathematically, the Q-function is represented as: (3) where s is the state, a is the action, r is the immediate reward, s′ is the next state, and γ is the discount factor.
The DQN algorithm aims to minimize the temporal difference error between the predicted and observed rewards: (4) where θ represents the parameters of the neural network, and θ-denotes the target network parameters.

Deep Deterministic Policy Gradients (DDPG)
DDPG is a technique created by Lillicrap et al. in 2016 for continuous action spaces [21].DDPG makes use of an actor-critic design where the actor network is trained to learn a policy that maps states to specific actions, and the critic network evaluates these actions' quality according to approximated value functions.
For DDPGG, the architecture of the actor and critic neural networks are the same for the visual features extraction, which uses three blocks each of which contain a convolutional layer, a batch normalization layer, and a Relu activation layer [20], see the Figure4 below.

Actor-Critic Methods
Actor-Critic methods merge features from value-based and policy-based techniques by using different networks for the actor (policy) and critic (value function).These algorithms are aimed at maximizing expected cumulative rewards while also estimating the value function to steer policy improvement.Mathematically, the policy gradient for actor-critic methods can be expressed as: ∇θμ J ≈ E[∇θμ logμ(a|s;θμ)Q(s,a;θQ)] (6) where μ(a|s;θμ) represents the policy function, Q(s,a;θQ) is the action-value function, and J denotes the expected return [23].

Other DRL Algorithms
In addition to DQN, DDPG, and Actor-Critic methods, various other DRL algorithms have been proposed and applied to autonomous driving tasks.These include Deep Q-Learning from Demonstrations (DQfD) [24], Twin Delayed Deep Deterministic Policy Gradients (TD3) [25], and Soft Actor-Critic (SAC) [26], among others.

D. Applications of Reinforcement Learning in Autonomous Driving
RL techniques have diverse applications in self-driving systems, including lane keeping [27], traffic light [28] and intersection management [29], collision avoidance [30], and path planning [31].RL algorithms help vehicles navigate through complex environment s, predict potential hazards, ensure safe passage, and optimize trajectories to minimize congestion.This versatility of RL algorithms makes them a crucial tool for enhancing the autonomy and safety of driving systems.

E. Recent Advancements in DRL
Benchmark datasets [16] and standardized evaluation metrics [17] are essential for enhancing research and development in DRL for self-driving vehicles.Benchmark datasets, like Waymo Open Dataset, Udacity Self-Driving Car, and CARLA, provide standard test and training environments and scenarios for benchmarking RL algorithms.Furthermore, established evaluation metrics, including safety measures, success rates, and computational efficiency, ensure that comparisons between different techniques and approaches are justifiable.In summary, the recent developments in DRL techniques for self-driving cars have resulted in significant advancements, such as enhancing the capabilities of autonomous driving systems and addressing critical challenges.Researchers have leveraged various techniques, including imitation learning, domain adaptation, multi-agent learning, meta-learning, and benchmark datasets, to unlock new opportunities and overcome remaining hurdles towards mass deployment of safe and efficient autonomous vehicles.

F. Related Works
Sallab et.al. (2017) [32], presented a framework for autonomous driving utilizing deep reinforcement learning.They address the challenges inherent in developing such agents and delineate three primary task categories: recognition, prediction, and decision-making.Their framework integrates Recurrent Neural Networks and attention models to address partially observable scenarios and emphasize pertinent information.
Liang et.al. (2018) [33], addressed the challenge of devising optimal driving policies for autonomous urban driving.They propose a CIRL approach that leverages encoded experiences mimicking human demonstrations to enhance exploration efficiency and tailor adaptive policies and steering-angle reward designs.The method has shown significant performance improvements compared to previous approaches on the CARLA driving benchmark.
Wang et.al. (2018) [34], explored the application of deep reinforcement learning in developing real-world autonomous driving systems using the deep deterministic policy gradient algorithm.They employed TORCS as the environment, incorporating a range of sensors and reward mechanisms, and devised a network architecture for both the actor and critic components.The model underwent evaluation across various modes in TORCS, demonstrating positive results both quantitatively and qualitatively.
Nageshrao et.al. (2019) [35], proposed a method for autonomous highway driving employing deep reinforcement learning and a modified version of the DDQN algorithm to train the decision-making neural network.The approach aims to mitigate unforeseen behaviors and adapt to diverse driving scenarios.
Pusse et.al. (2019) [36], introduced HyLEAP, a hybrid approach that combines deep reinforcement learning and approximate POMDP planning to address pedestrian collision-free navigation in self-driving cars.The performance evaluation is based on GIDAS, and the paper discusses the pros and cons of each method.
Folkers et.al. (2019) [37], presented a control approach based on deep reinforcement learning for autonomous vehicles.They discuss the training of a neural network agent with proximal policy optimization in a simulated environment and its application to a full-sized research vehicle for autonomous exploration of a parking lot, turning maneuvers, and obstacle avoidance.The paper also compares various model-based control approaches and references previous research that utilized deep learning methods for steering a vehicle solely based on camera images.
Moghadam et.al. (2019) [38], proposed an approach to autonomous driving that utilizes a hierarchical architecture with deep reinforcement learning.Their method aims to generate high-level sequential commands for lower-level controllers, ensuring consistent performance in uncertain environments.
Spielberg et.al. (2019) [39], introduced a novel adaptive, model-free controller for general discrete-time processes, utilizing deep reinforcement learning.Their proposed controller learns the control policy in real time through interactions with the process.The effectiveness and advantages of the controller are demonstrated through simulations across various scenarios.
You et.al. (2019) [40], discussed the planning problem of autonomous vehicles in traffic.The authors propose a stochastic Markov decision process model that incorporates road geometry to accommodate diverse driving styles.They also design the reward function.
Semnani et.al. (2020) [41], proposed a hybrid algorithm combining deep reinforcement learning (RL) and force-based motion planning (FMP) for distributed motion planning in dense and dynamic environments, where each agent has a fixed final position that cannot be exchanged with another agent.The proposed algorithm outperforms both deep RL and FMP algorithms, resulting in up to 50% more successful scenarios than deep RL and up to 75% less extra time to reach the goal than FMP.
Liu et.al. (2020) [42], proposed a method for learning personalized discretionary lane-change initiation for fully autonomous driving based on reinforcement learning.The proposed offline algorithm employs a reinforcement learning technique to learn how to initiate lane changes from traffic context, the action of a self-driving vehicle, and in-vehicle user feedback.A multi-dimensional driving scenario is considered to represent a more realistic lane-change trade-off.The results show that the lane-change initiation model obtained by this method can reproduce the personal lane-change tactic, and the performance of the customized models is much better than that of the non-customized models.
Muzahid et.al. (2020) [43], discussed a conceptual framework based on reinforcement learning for threat assessment of multiple vehicle collisions in autonomous driving.They emphasize the importance of real-time crash risk prediction in ensuring a secure and effective autonomous driving system.The paper advocates for cross-disciplinary efforts that integrate various technological fields, such as robotics, artificial intelligence, machine learning, IoT, and reinforcement learning.
Rong et.al. (2020) [44], conducted a study on safe reinforcement learning with policy-guided planning for autonomous driving.The research addresses uncertainty and complexity within autonomous driving and demonstrates the effectiveness of this approach through numerical experiments.The authors introduce a hierarchical structure to work with formal specifications to ensure safety standards are met.The study concludes that the implemented approach is effective in achieving safe reinforcement learning.
Liao et.al. (2020) [45], investigated decision-making strategies on highways for autonomous vehicles using deep reinforcement learning (DRL).The study explores the dueling deep Q-network (DDQN) method to derive the highway decision-making strategy and conducts a series of simulation experiments to evaluate its effectiveness.
Kim et.al. (2020) [46], examined the application of deep reinforcement learning in developing intelligent self-driving policies aimed at minimizing injury and collision incidents in unexpected traffic scenarios.The study demonstrates that the trained agents surpassed human drivers and an autonomous emergency braking system in terms of collision avoidance and reducing injury severity.
Hishmeh et.al. (2020) [47], analyzed the performance of various deep reinforcement learning algorithms in the context of autonomous driving within a simulated environment.Their objective is to develop a short-term planner to complement other components for long-term planning and ensuring safety.
Lu et.al. (2020) [48], discussed a proposed approach to address the challenges of autonomous decision-making and motion planning for intelligent vehicles in complex traffic scenarios.The approach comprises two layers: a decision-making layer employing a kernel-based least-squares policy iteration algorithm with uneven sampling and pooling strategy, and a lower layer concentrating on lateral motion planning using a dual heuristic programming algorithm.The effectiveness and efficiency of this approach are demonstrated through extensive simulations.
Josef et.al. (2020) [49], introduced a deep reinforcement learning approach for local planning in unknown rough terrains for unmanned ground vehicles.The approach demonstrated superior performance compared to potential fields or local motion planning search space methods.
Palanisamy et.al. (2020) [50], proposed the utilization of Partially Observable Markov Games (POSG) to devise learning-based solutions for connected autonomous driving under realistic assumptions.They present a taxonomy of multi-agent learning environments and offer an extensible set of CAD simulation environments to develop algorithms for CAD systems in multi-agent settings.
Duan et.al. (2020) [51], explored the decomposition of driving tasks into three maneuvers and the learning of sub-policies for each maneuver.They design a master policy to select the maneuver to execute in the current state.All policies, including the master policy and maneuver policies, are represented by fullyconnected neural networks and trained using asynchronous parallel reinforcement learners.The paper concludes by demonstrating how this method can safely and smoothly drive a car on a highway.
Emuna et.al. (2020) [52], examined a deep reinforcement learning approach aimed at simulating human driver behavior to develop "human-like driving policies."The study concentrates on mixed traffic scenarios, encompassing both autonomous vehicles and human-controlled vehicles.The authors advocate for autonomous vehicles to demonstrate human-like driving behavior to maintain efficient and safe traffic flow.
Fernando et.al. (2021) [53], scrutinized the significance of precise behavior modeling in autonomous driving and scrutinizes the primary approaches and notable advancements made by researchers, with a focus on the potential of deep inverse reinforcement learning.The authors offer quantitative and qualitative assessments to substantiate their insights and explore promising avenues for future research breakthroughs.
Orgovan et.al. (2021) [54], focused on developing an algorithm for autonomous drifting using Machine Learning, specifically Reinforcement Learning.The algorithm utilized a model-free learning method known as Twin Delayed Deep Deterministic Policy Gradients (TD3), which was trained on six different tracks within the CARLA simulator, specifically designed for autonomous driving applications.The study also highlighted the effectiveness of Deep Reinforcement Learning (DRL) in addressing motion planning challenges, with inputs to the network including factors like destination, vehicle parameters, and outputs controlling vehicle actions such as steering, torque, and braking.
Chukamphaeng et.al. (2021) [16] ,discussed the utilization of end-to-end reinforcement learning for autonomous vehicles, which employs a single reinforcement learning model to develop the autonomous car.Additionally, it delineates the design of a novel efficient reward function aimed at accelerating the agent's learning process and constructing the car with only essential perceptions and sensors.The paper concludes that end-to-end reinforcement learning holds promise as an approach for autonomous driving vehicles.
Zou et.al. (2021) [55], introduced a deep imitation reinforcement learning (DIRL) framework comprising two main components: the perception module and the control module.The framework employs a deep deterministic policy gradient algorithm (DDPG) for controlling self-driving vehicles via vision-based policies.The authors validated the effectiveness of the DIRL framework using TORCS, an open racing car simulator.
Cao et.al. (2022) [56], introduced a method that combines reinforcement learning (RL) with a baseline rule-based driving policy to develop more intelligent driving policies for autonomous vehicles.Termed "confidence-aware reinforcement learning" (CARL), the proposed method is assessed through a case study involving driving in a two-lane roundabout scenario.The study demonstrates that the proposed approach surpasses both the pure RL policy and the baseline rule-based policy in terms of performance.
Chen et.al. (2022) [57], suggested an approach to address complex urban scenarios using an interpretable deep reinforcement learning method, aiming to reduce the sample complexity of reinforcement learning.Their method incorporates a latent environment model that generates a semantic bird's eye mask to connect with specific intermediate properties and elucidate the behaviors of the learned policy.Comparative tests conducted in a realistic driving simulator demonstrate that the method performs significantly better in urban scenarios with surrounding vehicles compared to numerous baseline approaches.

G. Discussion
Table 1 provides a comprehensive review of various algorithms, their associated advantages, disadvantages, and results in the realm of autonomous driving.It highlights the diverse array of techniques utilized, ranging from deep reinforcement learning (DRL) to imitation learning and hierarchical reinforcement learning (RL), each aimed at tackling specific challenges such as lane keeping, urban dynamics, and autonomous highway driving.
While each algorithm offers unique benefits like improved safety, efficiency, and adaptability, they also come with their own set of limitations, including computational costs, reliance on simulated environments, and challenges in realworld applicability.Notably, the effectiveness of these approaches is assessed through simulations and real-world experiments, showcasing their potential for practical implementation in autonomous driving systems.
However, the review reveals common gaps across these methodologies, including limited generalizability, uncertainty in effectiveness, and the need for further research to address scalability and transferability issues.These findings emphasize the ongoing challenges in autonomous driving research and underscore the importance of continued exploration and development in this rapidly evolving field.

H. Conclusion
This paper presents a comprehensive review on deep RL frameworks for autonomous driving, that have become a promising and thriving research field.Deep RL provides the necessary robustness to design adaptive and intelligent driving policies for the complex and unstructured dynamic environment.The results of the review indicate that RL algorithms are superior to traditional methods and can become a game-changer for autonomous driving.Study outcomes reveal that deep RL has not yet provided satisfactory solutions for dynamic and real-world environments.While deep RL incurs fewer assumptions about the behavior of physical and social phenomena, it is still subject to sample randomness and distributional shift.Also, these techniques are computationally expensive and require abundant data and computational resources.Further research is needed to address pressing practical issues to boost the autonomous driving industry's development and make the future of autonomous vehicles more promising.Future research can be extended by exploring potential techniques that can bridge the gap between simulation and the real world, handling non-stationary environments, requiring less computation, and improving the scalability of these systems.

Figure 1 .
Figure 1.The structure of autonomous driving based on reinforcement learning from end to end.
The entity in charge of decision-making and taking actions within the environment.• Environment: The external system with which the agent interacts • State: A representation of the current situation or configuration of the environment.• Action: The choices made by the agent that influence the state of the environment.• Reward: Numeric feedback provided by the environment to indicate the desirability of a particular action taken by the agent.•