Social Learning for Sequential Driving Dilemmas

: Autonomous driving (AV) technology has elicited discussion on social dilemmas where trade-offs between individual preferences, social norms, and collective interests may impact road safety and efﬁciency. In this study, we aim to identify whether social dilemmas exist in AVs’ sequential decision making, which we call “sequential driving dilemmas” (SDDs). Identifying SDDs in trafﬁc scenarios can help policymakers and AV manufacturers better understand under what circumstances SDDs arise and how to design rewards that incentivize AVs to avoid SDDs, ultimately beneﬁting society as a whole. To achieve this, we leverage a social learning framework, where AVs learn through interactions with random opponents, to analyze their policy learning when facing SDDs. We conduct numerical experiments on two fundamental trafﬁc scenarios: an unsignalized intersection and a highway. We ﬁnd that SDDs exist for AVs at intersections, but not on highways.


Introduction
Despite significant efforts in artificial intelligence (AI) being focused on computer vision, the intelligence of autonomous vehicles (AVs) lies in their optimal decision-making abilities in the motion-planning stage while driving alongside human drivers [1][2][3]. The question of how individual AVs should develop the ability to survive in a complex traffic environment is a prerequisite to realizing the anticipated societal benefits of AVs [4,5], such as improved traffic safety [6][7][8][9], efficiency [10][11][12][13], and the promotion of sustainable human-machine ecologies. One major advantage that AVs have over human drivers is their ability to promptly assess situations with a greater amount of information, allowing them to react in an optimal way. However, AVs still lack the ability to deal with sequential decision-making when facing complex driving scenarios, which may result in underperformance compared to humans. To elaborate, the self-driving technology remains unclear about how to handle the conflicts between individual self-interest and the collective interest of a group, the so-called "social dilemma". Social dilemmas can arise when individual AVs prioritize their own safety and efficiency, potentially compromising the safety and efficiency of other AVs or human-driven vehicles on the road. For instance, when changing lanes on highways, AVs must consider the speed and trajectory of surrounding vehicles while prioritizing their own utility. If each AV prioritizes its self-interest, it may lead to car accidents and decreased the overall efficiency on the road. Understanding social dilemmas involving AVs is critical at the early stage of AV adoption.
The aim of this paper is to study social dilemmas that arise when AVs make sequential decisions in a dynamic driving environment (as shown in Figure 1). To achieve this, we propose a social learning framework in which AVs learn cooperation and defection strategies in a Markov game through interactions with random opponents selected from an agent pool. By obtaining the empirical payoff matrix of the Markov game using these  [14]).

Related Work
Social dilemmas have been widely studied to understand cooperative and defective behaviors among players, as seen in the works of Allen et al. [15], Alex et al. [16], Qi et al. [17], and Hilbe et al. [18]. While recent studies have investigated social dilemmas in the context of autonomous vehicles (AVs), such as the Trolley problem, which aims to investigate how AVs can make moral decisions [19,20], these studies are limited to matrix games. Matrix games are stateless and cannot capture the sequential decision-making process in dynamic traffic environments. Moreover, they do not account for other road users that may affect the state of the traffic environment. To overcome these limitations, this paper focuses on Markov games, which allow for sequential decision-making and cumulative rewards over time, making them suitable for modeling AVs' behavior in traffic environments.
To identify dilemmas in Markov games, sequential social dilemmas have been proposed to capture cooperative and defective behaviors in a dynamic environment, building on the concept of social dilemmas in two-player matrix games [14,[21][22][23]. Prior research assumes that agents play with a fixed opponent in Markov games, which does not work for real-world scenarios in which AVs may randomly interact with each other. To tackle this challenge, this paper leverages a social learning framework where a population of agents [24] learn policies through repeated interactions with randomly selected opponents in multi-agent systems. In this study, we aim to identify sequential social dilemmas for AVs in traffic scenarios, utilizing a social learning scheme for Markov games.
Many other studies focus on enhancing coordination and cooperation among players at game equilibrium. One approach to guiding players' cooperative behavior in a social group is through social norms or conventions, which are shared standards of acceptable behavior by groups [25][26][27][28] and can lead to desired social outcomes [24]. It is important to note that social dilemmas depict a game where there is a conflict between collective and individual interests, while social norms denote a game equilibrium that can lead to desired social outcomes. We provide a summary of related work on social dilemmas and social norms in Table 1. Table 1. Literature on social dilemmas and social norms.
We propose a social learning framework to investigate sequential driving dilemmas (SDD) in autonomous driving systems.

2.
We develop a reinforcement learning algorithm for AVs' policy learning to estimate SDDs in traffic scenarios.

3.
We apply the proposed algorithm to two traffic scenarios: an unsignalized intersection and a highway, in order to identify SDDs.
The remainder of this paper is organized as follows. In Section 3, we first present preliminaries regarding social dilemmas in matrix games. In Section 4, we introduce a social learning framework to model interactions among random AV players in dynamic driving environments. In Section 5, we conduct several numerical experiments to identify SDDs in traffic scenarios. Section 6 concludes and discusses the future work.

Social Dilemma in a Matrix Game
In this subsection, we briefly introduce social dilemmas in a traditional matrix game. Table 2 demonstrates the payoff matrix of a game between two players. Both players can choose cooperative and defective policies, which are defined as follows.

Definition 1. Define
∈ Π and ∈ Π as cooperative and defective policies, respectively. Players choose to cooperate with opponents when conducting and defect with opponents when conducting . Π and Π denote policy sets. Note that social dilemma in Definition 2 is a one-shot matrix game. The cooperative and defective policies , are static strategies, which are not applicable to real-world scenarios. We will extend it to sequential driving dilemmas with dynamic settings in the following subsection.

Sequential Driving Dilemma in a Markov Game
We first introduce a Markov game to model the sequential decision making of agents in a dynamic driving environment. We assume the Markov game is a non-cooperative game in which each agent aims to maximize their own cumulative payoff. The Markov game is denoted by M. We specify each component of the Markov game as follows: • . There are adaptive AVs in the agent pool, denoted by {1, 2, · · · , m}. • ∈ S. The environment state in the driving environment, denoted by ∈ S, refers to global information such as the spatial distribution of all road users and road conditions. It is important to note that there may also be other road users, such as background vehicles, who are non-strategic players in the Markov game. The environment state space is denoted by S. However, it should be noted that the environment state is not fully observable to agents, making the Markov game a partially observable Markov decision process (POMDP). • ∈ O. Each agent draws a private observation from their neighborhood environment, which is a subset of the global environment state . Specifically, agent ∈ 1, 2, · · · , draws a private observation denoted by ∈ O , where O is the observation space of agent . The joint observation space for all agents is denoted by which captures the overall observation of the driving environment by all agents. It is important to note that each agent is limited to only observing their surroundings and not the entire environment state. • ∈ A. For simplicity, we adopt discrete action space. Joint action is = ( 1 , 2 , · · · , ), where ∈ , = 1, 2, ..., and is the action set. Actions in different traffic scenarios will be detailed in Section 5. • ∈ P. After taking action in state , an agent arrives at a new state with transition probability ( | , ). The agent interacts with the environment to gain state transition experiences, i.e., ( , , ). • ∈ R. Agent ∈ {1, 2, · · · , } receives a reward ( , , ) at each time step, which can be the travel cost in the driving environment.
• . The discount factor is used to discount the future reward. In this study, we take = 1, because drivers usually complete trips in a finite horizon and they value future and immediate rewards equally.
Agent ∈ {1, 2, · · · , } uses a policy : O × A → [0, 1] to choose actions after drawing observation . The policy is designed to maximize the agent's expected cumulative reward by selecting an optimal action. This process of observing the environment, selecting an action, and receiving a reward repeats until the agents reach their own terminal state, which occurs when either a crash happens or the agents complete their trips. In a multi-agent system, the optimal policy for agent is denoted as * , and is defined as the agent's best response to the policies of other agents when those policies are held constant. For all agents ∈ 1, · · · , , the value achieved by agent from any state is maximized given the other agents' policies.
where, − is the joint action of agent 's opponents. We denote the value function of agent starting from state as , − ( ) given a policy of agent and policies − of her random opponents in the environment. This study employs independent reinforcement learning to facilitate individual agents' optimal policy learning. Each agent is trained with its own policy networks.
We now present how agents learn the cooperative policy and defective policy in the Markov game, as well as the method for calculating the empirical payoff matrix of the agents with respect to these policies. To obtain the cooperative policy and defective policy within the social learning framework, we introduce a weight parameter to represent the desired level of social outcomes [29] in an agent's reward. Specifically, we define the weight associated with desired social behaviors for the cooperative policy as and that for the defective policy as . The reward obtained by agent at each time step in a dynamic environment is then calculated as follows: where, is the reward associated with the selected action and is the reward associated with some desired social outcome (e.g., road safety). Note that the weight parameter is used to balance the importance of these two types of rewards. We have > , which means agents prefer collective interest to self-interest when adopting and vice versa. Cooperative and defective policies are trained based on and , respectively (See the learning algorithm in Section 4.2). There are many other ways to obtain and . For instance, a cooperative agent is defined as a driver who always yields at the intersection, and a defective agent is a driver who crosses the intersection aggressively without taking into account other road users [36]. The level of aggressiveness is utilized for each agent via social behavior metrics [14] to determine and .

Definition 4.
Define an empirical payoff matrix for a pair of agents and who play a Markov game starting from state ∈ S in Table 3: Table 3. Empirical payoff matrix in a Markov game. where, , ( ) is the payoff of agent when agents and adopt and , respectively.

Definition 5.
A Markov game is a sequential driving dilemma (SDD) when there exist states ∈ S for a pair of agents whose empirical payoff matrix satisfies the following social dilemma conditions (Definition 2): 1.
2 ( ) > ( ) + ( ): When starting from state ∈ S, agents prefer mutual cooperation over an equal probability of unilateral cooperation and defection;
Note that the empirical payoff matrix in a Markov game is utilized to identify the existence of dilemmas. We will investigate game equilibrium in sequential driving dilemmas in the future.

Social Learning Framework
In this section, we introduce a social learning framework that utilizes a Markov game to model interactions among autonomous vehicles (AVs) that are randomly selected from an agent pool.

Social Learning Scheme
In order to understand road users' behaviors in the traffic environment, each AV must learn through interactions with its opponents. In traditional learning schemes, an agent repeatedly interacts with a fixed opponent until a stable equilibrium is reached. However, in the context of navigating the traffic environment, AVs encounter different opponents dynamically instead of a fixed one.
In a social learning scheme, agents learn through interactions with opponents randomly selected from an agent pool. As a result, each agent plays games repeatedly with random opponents and develops its own policy based on personal experience. This approach allows AVs to learn from a diverse set of opponents and adapt to changing circumstances, which is critical for ensuring safe and efficient driving behavior in real-world scenarios. In contrast to the fixed agents in traditional learning schemes, the agents in a social learning scheme are randomly selected from a pool in each episode. These agents then play a Markov game and update their policies based on the outcomes.

Social Learning Algorithm
In this section, we introduce a multi-agent reinforcement learning algorithm based on Deep Q-network (DQN) [37] to learn the cooperative policy and defective policy for AVs in the social learning scheme. The proposed algorithm is summarized in Algorithm 1.
We first initialize deep Q-networks for each agent. Q-networks for cooperative and defective policies are denoted as ( , ; ), ( , ; ) parameterized by and , respectively. Their target networks are˜ ( , ;˜ ) and˜ ( , ;˜ ). Hyperparameters, including exploration rate , learning rate , update period and update parameter , are predetermined (See Table 4). For each run in one episode, two agents are randomly selected and removed till the agent pool is empty. When an agent is selected to participate in a Markov game, she randomly chooses to either cooperate or defect with her opponent. If agent chooses cooperation, her cooperative Q-network ( , ; ) and target cooperative Q-network˜ ( , ;˜ ) will be updated with the weight parameter of rewards in the Markov game. Similarly, if agent chooses defection, her defective Q-network ( , ; ) and target defective Q-network˜ ( , ;˜ ) will be updated with the weight parameter of rewards in the Markov game. From an initial environmental state 0 , agent draws private observation and taking action according to the widely used -greedy method (i.e., agent chooses action randomly with probability and greedily from optimal policy * with probability 1 − ), until reaching some terminal state. The joint action a is executed in the environment, which results in a state transition → . After the state transition, each agent receives a new private observation and a corresponding reward according to Equation (3). This process is repeated until a terminal state is reached, which occurs when a crash happens or agents complete their trips. The experience tuple ( , , , ) of agent is stored in a replay buffer, which is used to update the target network every time steps. Algorithm 1 DQN-SDD 1: Initialize two networks for cooperative and defective policies and for each agent in the agent pool: ( , ; ), ( , ; ) and their target networks˜ ( , ;˜ ), ( , ;˜ ). 2: Input: exploration parameter , learning rates , target network update period and update parameter . 3: for ← 1 to do 4: while the agent pool is not empty do 5: Randomly select two agents from the pool to play a Markov game. 6: Each agent can choose to cooperate or defect with her opponent.

7:
while s is not terminal do 8: For each agent, select action using -greedy policy; 9: Update state transition → ; 10: Store ( , , , ) for each agent. 11: end while 12: Decay . 13: if > then 14: Update target networks:˜ ← (1 − )˜ + . 15: end if 16: end while 17: Update Q-function with the selected learning rate . 18: Refill the pool with agents who have updated policies. 19: end for After the training process, the trained Q-networks are utilized to calculate the empirical payoff matrix. First, each pair of agents from the agent pool is randomly selected. Then, the pair executes four Markov games, corresponding to the combinations of cooperative and defective policies, namely ( , ), ( , ), ( , ), and ( , ), according to their own Q-networks ( , ; ) and ( , ; ). The empirical payoff for agent is calculated as cumulative rewards received over a fixed number of iterations. Each cell in the resulting payoff matrix is calculated based on Equation (4), which represents the average reward obtained by agent when the pair of agents takes actions a, given the current state and policies and .

Numerical Experiments
In this section, we investigate two traffic scenarios ( Figure 2): unsignalized intersections and highways. In each traffic scenario, we first introduce the environment set-up and then discuss SDD results. The traffic environment is an unsignalized intersection (Figure 2a) comprising two intersecting roads, Road 1 and Road 2. Vehicles represented by green and red colors navigate Road 1 and Road 2, respectively. The intersection is discretized into uniform cells, and the state is defined by the locations of the agents. Each agent is able to observe only the road she navigates, and the action set consists of two actions: "Go" and "Stop". The "Go" action corresponds to moving forward by one cell, while the "Stop" action represents taking no action. In each time step, agents receive a negative reward for taking either the "Go" or "Stop" action, denoting the instantaneous travel time, in order to encourage them to complete the game quickly. If a collision occurs at the intersection cell, each agent incurs a negative reward to reflect the cost of the collision. On the other hand, if no collision occurs, agents complete their trips and the game terminates. The desired social outcome is that no collision occurs, and agents successfully complete their trips.

SDD Results
We investigate whether SDD exists at the unsignalized intersection. In Figure 3, each point represents the empirical payoff matrix (Table 3) obtained by a pair of agents randomly selected from the pool and play the intersection game from some initial state . For example, to estimate one cell ( , ) in the payoff matrix, we simulate the game between these two agents driven by policies and for 1000 times and compute the average payoff. The x-axis represents ( ) − ( ) and the y-axis represents ( ) − ( ) in the empirical payoff matrix. The squares and circles represent cases satisfying and violating SDD conditions, respectively. According to Definition 4, the intersection game is an SDD. The cooperation and defection performance in two scenarios are summarized in Table 5. We look into points satisfying SDD conditions (orange squares). Note that these points are in the first quadrant where ( ) > ( ) and ( ) > ( ). According to Definition 4, ( ) > ( ) means agents prefer unilateral defection to mutual cooperation. It implies that when AVs encounter those who choose to yield at the intersection, they prefer crossing the intersection aggressively over yielding to others. ( ) > ( ) means agents prefer mutual defection to unilateral cooperation. It implies that when AVs encounter aggressive agents, they prefer crossing the intersection aggressively over yielding to others. According to Definition 3, the intersection game is a prisoner's dilemma.

Environment Set-Up
The traffic environment is a highway (Figure 2b) with two lanes, going from south to north. Vehicles in green and red colors represent AV players in a Markov game. The environment is discretized into uniform cells, and the state space S is defined by the locations of the agents. The initial locations of the agents on two lanes are randomized. Agents are capable of observing both lanes. The action set comprises three actions, namely, "Go", "Stop", and "Lane-change". The "Go" action is associated with moving forward by one cell, whereas the "Stop" action is associated with taking no action. The "Lane-change" action allows switching to the other lane. At each time step, agents receive a negative reward that denotes their travel time for taking any of the three actions. Agents receive a positive reward for formulating a car platoon, which refers to a group of vehicles that travel in a coordinated manner, with one leading vehicle and the others following behind [38]. Figure 4 displays the empirical payoff matrices obtained by simulating the platoon game. The points in the plot are located in the fourth quadrant where ( ) < ( ) and ( ) < ( ), indicating that the platoon game does not satisfy the SDD condition. Specifically, ( ) < ( ) implies that agents prefer mutual cooperation over unilateral defection, indicating that AVs are more likely to form a platoon when interacting with cooperative agents. On the other hand, ( ) < ( ) implies that agents prefer unilateral cooperation over mutual defection, suggesting that AVs are more likely to change lanes and form a platoon with defective agents.

SDD Results
Summarizing the two games, we find that the intersection game is an SDD but the platoon game is not. The intuitive explanation is: In the intersection game, the collective interest for agents is to avoid crashes at the intersection and individual interest is to minimize travel time. If agents want to avoid crashes, they need to stop and wait till one of them crosses the intersection, which increases travel time. It is shown that there exists a conflict between the collective and individual interests. In the presence of defective agents, the intersection game becomes a Prisoner's dilemma where agents may defect to gain individual advantages at the cost of collective interest.

Conclusions
In this study, we employ a social learning scheme for Markov games to identify SDDs in AVs' policy learning. We propose a learning algorithm to train cooperative and defective policies and evaluate SDDs in traffic scenarios. We investigate the existence of SDDs in intersection and platoon games. The overall findings include: (1) The intersection game is an SDD, while the platoon game on highways is not. (2) The Markov game at an unsignalized intersection resembles a Prisoner's dilemma, where agents may defect to gain individual advantages, but at the cost of the collective interest.
We briefly discuss the limitations of this work: (1) AVs' decision making in driving environments are simplified as discretized action sets. There are many continuous decision variables for AV players, including velocity, brake rate and headway. (2) The number of agents who randomly encounter in Markov games is limited. The number of agents and their interactions in real-world traffic scenarios can be larger and more complex.
This work can be extended in the following ways: (1) We will leverage multi-agent reinforcement learning (MARL) and identify SDDS in more complex real-world scenarios with many road users (e.g., pedestrians). (2) Addressing SDDs for AVs will require policymakers and road planners to design policies and incentives that encourage AVs to make decisions that benefit the public good. This may involve the creation of regulations that enforce safe and efficient behavior, in line with established traffic rules and social norms. Incentive mechanisms such as gifting in a multi-agent system can be explored to study how to enhance AVs' cooperative behaviors.
Funding: This work is partially supported by the National Science Foundation CAREER under award number CMMI-1943998.
Data Availability Statement: Not applicable.

Conflicts of Interest:
We confirm that neither the manuscript nor any parts of its content are currently under consideration or published in another journal.