Fair collaborative vehicle routing: A deep multi-agent reinforcement learning approach

Collaborative vehicle routing occurs when carriers collaborate through sharing their transportation requests and performing transportation requests on behalf of each other. This achieves economies of scale, thus reducing cost, greenhouse gas emissions and road congestion. But which carrier should partner with whom, and how much should each carrier be compensated? Traditional game theoretic solution concepts are expensive to calculate as the characteristic function scales exponentially with the number of agents. This would require solving the vehicle routing problem (NP-hard) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game solved using deep multi-agent reinforcement learning, where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function; thus, when deployed in production, we only need to evaluate the expensive post-collaboration vehicle routing problem once. Our contribution is that we are the first to consider both the route allocation problem and gain sharing problem simultaneously - without access to the expensive characteristic function. Through decentralised machine learning, our agents bargain with each other and agree to outcomes that correlate well with the Shapley value - a fair profit allocation mechanism. Importantly, we are able to achieve a reduction in run-time of 88%.


Introduction
Heavy goods vehicles (HGVs) in the UK contributed 4.3% of the UK's total greenhouse gas emissions in 2019 (UK BEIS 2021).HGVs are utilised inefficiently at 61% of their total weight capacity.Moreover, 30% of the distance travelled carries zero freight (UK DfT 2020, RFS0125).
Collaborative vehicle routing (CVR) has been proposed to improve HGV utilisation.Here, carriers collaborate through sharing their delivery information in order to achieve (total cost: 2.47) Figure 1: Three agents (denoted by colours) before and after collaboration.Squares denote depots.Crosses denote customer locations.Node indices (arbitrary) are denoted in black, with costs given in their respective colours.The collaboration gain is defined as the difference in social welfare (or total cost) before and after collaboration.In Figure 1b, Agents 1, 2 and 3 all decide to collaborate which reduces the system's total cost by 0.88 (or 26%).This results in a collaboration gain per capita (assuming agents split the gain equally) of 0.29.For detailed calculations, see Section 3.1.
economies of scale.If carriers agree to work together, they are said to be in a coalition.As a result of improved utilisation, total travel costs across collaborating carriers can be reduced, resulting in a collaboration gain.The remaining question then is how to allocate this collaboration gain in a fair manner such that carriers are incentivised to form coalitions.An example of CVR is given in Figure 1.
Prior literature suggests that collaborative vehicle routing can reduce costs by around 4-46% and also reduce greenhouse gas emissions and road congestion (Cruijssen et al. 2007;Zhang et al. 2017;Gansterer and Hartl 2018b;Pan et al. 2019;Gansterer and Hartl 2020;Cruijssen 2020;Ferrell et al. 2020).Sharing resources may also lead to improved resilience to fluctuations in supply and/or demand.Despite these benefits, real-world adoption remains limited, with only a few companies participating (Cruijssen et al. 2007;Guajardo and Rönnqvist 2016;Cruijssen 2020).Currently, a key barrier is the computational complexity of calculating a fair gain sharing mechanism that scales with a larger number of companies.Guajardo and Rönnqvist (2016) recommends that future work should investigate approximate gain sharing methods.Our paper follows this recommendation.
Our first contribution is modelling the collaborative routing problem as a coalitional bargaining game (Okada 1996) with intelligent agents obtained through the use of deep multi-agent reinforcement learning (MARL).We provide the theoretical grounding in this paper, tying together the fields of collaborative vehicle routing, coalitional bargaining, and deep multi-agent reinforcement learning in order to obtain a theoretically grounded approach that significantly reduces run-time.Here, agents attempt to reach agreement on selecting the 'best' carrier(s) to partner with, and rationally share the collaboration gain amongst the coalition.This bargaining process takes place over multiple rounds of bargaining (see Section 3.3 for a formal definition).A benefit of this approach is that both the routing problem (who should deliver which requests?) and the gain sharing problem (who receives how much of the added value?) are considered simultaneously, whereas a key limitation of many previous methods consider these sub-problems in isolation from one another (Gansterer and Hartl 2018b).Moreover, our approach is agnostic to the underlying routing problem -the complexity of the vehicle routing problem (VRP) formulation can be increased with further constraints such as time windows, without further modification to the method.
Our second contribution is that agents do not need access to the full characteristic function explicitly.To obtain the full characteristic function, the collaboration gain for all possible coalitions must be calculated.In the three-player setting, there are four possible coalitions {1, 2, 3}, {1, 2}, {1, 3} and {2, 3}.Therefore, to obtain the full characteristic function requires solving 2 n−1 NP-hard post-collaboration VRPs (for a formal introduction, see Section 2.1.3.As a result, methods that require full access to the characteristic function are intractable for settings with more than 6 carriers (Cruijssen 2020).Instead, our agents can implicitly reason about the characteristic function through only receiving a high-dimensional graph input of delivery information (for example, latitudes and longitudes), as well as other agents' actions.This eliminates the need to fully evaluate the characteristic function when deployed in production, which involves solving the expensive post-collaboration VRP an exponential number of times.Instead, we only need to solve the post-collaboration VRP once when deployed in real-world settings, thus allowing our approach to achieve a significant run-time reduction.In addition, our approach utilises Centralised Training with Decentralised Execution (CTDE) to obtain decentralised agent policies (Lowe et al. 2020).Decentralised policies are desirable in real-world applications as each agent does not necessarily require access to the global, underlying state.This helps ensure that companies' sensitive information will not be leaked to competitors.This also aids to stabilise training in multi-agent settings as well as reduce communication costs.Furthermore, our approach is inductive as opposed to transductive of prior methods.This enables our agents to generalise to agents never seen before during training and thus reduces computational cost.
The remainder of this paper is organised as follows.Section 2 positions our work within the wider context of both collaborative vehicle routing and deep multi-agent reinforcement learning.Section 3 provides a formal introduction to coalitional games, coalitional bargaining and reinforcement learning.Section 4 discusses and justifies various design decisions regarding our agents.Section 5 details our experimental setup, results, discussion and future work.Finally, Section 6 concludes our findings and provides broader managerial implications as a result of this work.

Collaborative vehicle routing
Prior collaborative routing literature tackles the partner selection sub-problem (i.e., who should each carrier work with?) by estimating the collaboration gain between different carriers using heuristics (Palhazi Cuervo et al. 2016;Adenso-Díaz et al. 2014).
However, a limitation of this approach is that they do not consider how much each agent should be compensated, nor if agents even agree to join the same coalitions (i.e., if the coalitions are stable).Posing this problem as a coalitional bargaining game not only allows us to tackle the partner selection aspect, but we are also able to consider the gain sharing aspect simultaneously as well.
The majority of the collaborative routing literature is concerned with the exchange of individual transportation requests amongst the carriers.This can be divided into three types of planning approaches: centralised; decentralised without auctions; and decentralised with auctions (Gansterer andHartl 2018b, 2020).

Centralised planning
Centralised planning approaches desire to simply maximise social welfare (the sum of each company's profits).Typically, this goal is achieved by using a form of mixed integer linear programming or (meta)heuristics (Cruijssen et al. 2007;Gansterer and Hartl 2018b;Angelelli et al. 2022).This can be viewed as a common-payoff setting, i.e., where all agents are on the same team and receive the same reward.However, assuming a common-payoff setting in practice is unrealistic as companies are self-interestedthey mostly care only about their own profits (Cruijssen et al. 2007).Moreover, there exists fierce competition especially in horizontal collaborations.Therefore, the more realistic setting of decentralised control is needed where agents are modelled to be self-interested.

Decentralised planning
There have been few attempts to tackle CVR with decentralised approaches as well.One approach focuses on the problem of partner selection, i.e. "who should work with whom?".Adenso-Díaz et al. (2014) proposes an a priori index to estimate the collaboration gain between carriers based on their transportation requests.However, a key limitation is that they do not consider the gain sharing aspect and thus the coalitions formed may not be stable.
A key challenge in decentralised settings is managing the explosion in the number of bundles.Consider Figure 1 where Agent 2 may desire to sell delivery node 10 to Agent 1.However, if Agent 2 offers both nodes 10 and 11 as a bundle, then Agent 2 may be able to command a higher price.Indeed, the number of possible bundles scales O(2 m ) where m is the number of deliveries.To manage this explosion, a heuristic is typically implemented where agents can only submit or request a few bundles (sometimes only one) which would limit optimality (Bo Dai and Chen 2009).
A second challenge is to also elicit other agents' preferences over all bundles.One approach is to invoke structure on the problem in the form of combinatorial auctions which aids optimality (Krajewska et al. 2008;Gansterer and Hartl 2018b;Gansterer et al. 2019;Los et al. 2022).Auctions are where carriers submit requests they do not wish to fulfil to a common pool.Then, other carriers can submit bids on these requests with various methods of determining the "winners" of said bids.Combinatorial auctions in these settings allow carriers to bid on bundles of transportation requests instead of individual transportation requests which increases its expressivity and optimality.However, this additional structure comes at additional computational complexity.Moreover, in auction mechanism design, there are four desirable properties: efficiency; individual rationality; incentive compatibility; and budget balance.(Gansterer et al. 2019) proposes two auction-based approaches which may be useful in practice, but would be unable to satisfy all four properties simultaneously: there exists a trade-off instead.(Los et al. 2022) investigates large-scale carrier collaboration containing 1,000 carriers with decentralised auctions.Whilst impressive in scale, their approach ignores the difficulty of large-scale gain sharing.
Both auction-based and non-auction-based approaches may also be exploited by strategic agent behaviour.Would agents intentionally misreport the costs associated with performing deliveries in order to maximise their own profits?Whilst we do not tackle this problem in our work, we believe MARL could be a useful tool to investigate this strategic behaviour in future work.

Gain sharing
Whilst gain sharing has been studied in collaborative routing using cooperative game theory (Guajardo and Rönnqvist 2016), the solution concepts typically assumes that the characteristic function is given.For a set of n agents, N = {1, . . ., n}, the characteristic function v : 2 N → R ≥0 assigns a value, or in our case collaboration gain, for every possible coalition that could be formed.Note that there exists O(2 n ) possible coalitions.This is intractable for settings with more than a few agents, because evaluating the collaboration gain of even a single coalition, involves solving a vehicle routing problem which is NP-hard.For detailed calculations of the collaboration gain, see Section 3.1.Guajardo and Rönnqvist (2016) reviews 55 papers from the collaborative transportation literature concerning gain sharing.They recommend that a future research direction should focus on developing approximate gain sharing approaches based on cooperative game theory that scales with the number of agents.
In the wider algorithmic game theory literature, coalition formation has also been extensively studied (Chalkiadakis et al. 2011).However, much of the existing literature again assumes that the full characteristic function is given.Alternatively, they aim to find more succinct representations of the characteristic function, typically at a cost of increased computational complexity when computing solution concepts (Chalkiadakis et al. 2011).Examples include Induced Subgraph Games and Marginal Contribution Nets (Deng and Papadimitriou 1994;Ieong and Shoham 2005); however, even these succinct representation schemes require evaluating the value of multiple coalitions and thus solving multiple NP-hard VRPs.We argue that many real-world scenarios consist of the characteristic function being a function of the agents' assets or capabilities.In the collaborative routing setting, this is a function of the transportation requests an agent possesses.We therefore ask: "Can agents form optimal coalitions from the delivery information alone instead of having full access to the characteristic function?".Therefore, our paper can be viewed as using an alternative, succinct representation scheme which approximates a rational outcome by using a function approximator.

Deep multi-agent reinforcement learning
Single agent reinforcement learning has seen increasing adoption in supply chain management.However, supply chains can be naturally modelled as a system comprising multiple self-interested agents (Fox et al. 2000;Xu et al. 2021;Brintrup 2021).For a thorough review of reinforcement learning applied towards supply chain management, see Yan et al. (2022).
Recently, MARL has seen success in playing board and video games such as Go, StarCraft II and Dota 2 (Silver et al. 2016;Vinyals et al. 2019;OpenAI et al. 2019).Whilst these are tremendous feats in the AI space, the underlying games tend to be 2-player and zero-sum.However, most real-world applications, including supply chain management (Gabel and Riedmiller 2012;Kosasih and Brintrup 2021), are n-player and mixed-motive (with potential 'sequential social dilemmas' [Leibo et al. 2017]).
Whilst there is some research in this direction, the majority of MARL research focuses on pure coordination or pure competition settings (see Table 1).Our work is 3-player and mixed-motive which leads to a more challenging joint-policy space, allowing for complex behaviours such as collusion.
The most similar work to ours from a multi-agent learning perspective is that of Bachrach et al. (2020) and Chalkiadakis and Boutilier (2004).In Bachrach et al. (2020), they apply deep MARL to a spatial and non-spatial Weighted Voting Game, where agents are given full access to the characteristic function.In Chalkiadakis and Boutilier (2004), they apply a Bayesian MARL approach to coalition formation as their problem has uncertainty in the characteristic function.In their problem, each agent knows its own capability, but does not observe other agents' capabilities.As a result, they maintain a belief over other agents' capabilities.However, each agents' capabilities remains constant.In our work, each agents' 'capability' can be thought of as the transportation requests it possesses, which constantly changes between episodes.Thus, our agents must be able to generalise across differing agent capabilities.

Collaborative Vehicle Routing
We denote the set of n agents as N = {1, . . ., n}.A coalition is a subset of N , i.e.C ⊆ N .The grand coalition is where all agents are in the coalition, i.e.C = N .
Pre-collaboration profit and social welfare: The pre-collaboration profit of Agent 1 in Figure 2 is calculated as follows: the Revenue is 3 (1 for each delivery); the Cost is 1.42 (sum of the edge distances); thus the Profit is 1.58 (Revenue subtract Cost).Similarly, the pre-collaboration profit of Agents 2 and 3 is 2 and 2.07.The pre-collaboration social welfare is the sum of the pre-collaboration profits, thus 1.58 + 2 + 2.07 = 5.65.
Post-collaboration "profit" and social welfare: Assuming agents agree to form the grand coalition C = {1, 2, 3}, the post-collaboration "profit" of Agent 1 can be calculated as 1−(0.06+0.06)= 0.88.Note that the post-collaboration "profit" for Agent  Figure 2: Three agents, Agents 1, 2 and 3 are denoted by the colours green, orange and purple respectively.Squares denote depots.Crosses denote customer locations.Node indices (arbitrary) are denoted in black, with costs given in their respective colors.The collaboration gain is defined as the difference in social welfare before and after collaboration. Figure 2b and Figure 2c refer to two possible post-collaboration scenarios with collaboration gains per capita of 0.29 and 0.38 respectively.Thus, it would be rational for the coalition {1, 2} to form instead of the grand coalition {1, 2, 3}.
1 appears to have decreased from 1.58 to 0.88 as a result of collaboration.This will be accounted for when discussing the characteristic function and thus Agent 1 will not lose out when we calculate its reward.For Agents 2 and 3, the post-collaboration "profit" is 2.19 and 3.46 respectively.Thus a post-collaboration social welfare of 0.88+2.19+3.46= 6.53.
Collaboration gain: The collaboration gain is defined as the difference in social welfare before and after collaboration for a given coalition, in this case 6.53−5.65 = 0.88 for the grand coalition.Note that the collaboration gain is always greater than or equal to 0. The value per capita is 0.88 3 = 0.29.During the bargaining process, agents are able to choose how to divide this collaboration gain amongst themselves.In the unique case where agents agree to divide the collaboration gain equally, i.e. according to the value per capita, we refer to this as equal gain sharing.Note that if only Agents 1 and 2 form a coalition (and exclude Agent 3), then the collaboration gain (assuming equal gain sharing) is divided by 2 instead -thus making it rational to object and form the coalition {1, 2} (the value per capita of this coalition is 0.38).
Characteristic function: The characteristic function, v : 2 N → R calculates for every possible coalition the collaboration gain.Importantly, to fully evaluate the characteristic function would require solving a variant of the Vehicle Routing Problem for every possible coalition which scales O(2 n ).
It is important to note that the characteristic function is 0-normalised, essential and super-additive (see Section 3.2 for a formal definition).This guarantees that agents will not lose profits as a result of collaboration.The final take-home profit that each agent (or carrier) receives can then be calculated as the sum of the pre-collaboration profit and its respective allocation of the collaboration gain.For Agents 1, 2 and 3, this would equate to 1.58 + 0.88 3 = 1.87, 2 + 0.88 3 = 2.29 and 2.07 + 0.88 3 = 2.36 respectively (assuming equal gain sharing).In reality, carriers will receive this take-home profit (which is always greater than or equal to the pre-collaboration profit) as an incentive to collaborate.

Coalitional games
We consider the n-player coalitional game, also called a cooperative game, with a set of agents N = {1, . . ., n}.A coalition is defined as a subset of N, i.e.C ⊆ N .The set of all coalitions is denoted Σ.The grand coalition is where the coalition consists of all agents in N, i.e.C = N .A singleton coalition is where the coalition consists of only one agent, i.e.
A (transferable utility) coalitional game is a pair G = ⟨N, v⟩.The characteristic function v : 2 N → R ≥0 represents the value (or collaboration gain in our setting) that a given coalition C receives.Like Okada (1996), we assume that the characteristic function is 0-normalised, essential and super-additive.The characteristic function is 0-normalised if the value of all singleton coalitions is 0 The payoff vector x C = (x C i ) i∈C denotes the pay-off for player i in the coalition C.
The set of all feasible payoff vectors for a given coalition C is X C , and X C + when all the elements of X C is non-negative.

Coalitional bargaining
The purpose of this work is to find a partition of the N carriers with an associated payoff vector, i.e. (CS, x), which all self-interested, rational carriers agree to.Notice how this does not imply any sequential decision making.However, it was found that certain cooperative solution concepts can be retrieved as the outcome of non-cooperative, extensive form games such as coalitional bargaining (Nash 1953).Therefore, this necessitates sequential decision making in our problem where we propose to obtain intelligent agents through the use of MARL.Okada (1996) presents the n-player, random proposers, alternating offers coalitional bargaining game which we adopt.At every time-step t = 1, 2, . . .an agent from N is selected uniformly at random to be the proposer.The proposer, player i, has two actions -the proposed coalition and proposed pay-off vector.The proposed coalition C must contain player i and the value of the coalition v(C) must be greater than 0. Due to the characteristic function being 0-normalised this implies |C| ≥ 2. The payoff vector x C must be in the set of all feasible, non-negative payoff vectors X C + .After player i has proposed, the remaining players called the responders are uniformly at random selected sequentially to either accept or reject the proposal.If all agents in the proposed coalition C accepts, then those agents form a coalition with the agreed upon proposal.The remaining players outside of C continue negotiating from the next time-step.If any responder in C rejects the proposal, then all players receive an immediate reward of zero and negotiations go on to the next round of bargaining.Then, a new proposer is selected uniformly at random and the time-step incremented by 1.This continues until either agreement is reached, or the maximum time step is reached.When a proposal (C, x C ) is agreed upon at time t, every agent i in C receives a reward of γ t−1 x C i , where γ ∈ [0, 1] is the discount factor.The discount factor decreases the reward received as time passes.This encourages agents to reach agreement within the first time-step in the three-player setting as shown in Okada (1996).The discount factor in this setting is analogous to the patience of an agent, or the urgency of the delivery decision.Any agent who is not in a coalition at the end of this process is assumed to have a reward of zero.In the three-player setting, note that if one proposal is accepted, then no more feasible coalitions can form; thus, this denotes the end of the bargaining process as seen in Figure 3.

Methodology
In summary, analytically calculating cooperative game theory solution concepts is intractable for settings with more than 6 carriers (Cruijssen 2020).Instead, we can recover these cooperative solution concepts through non-cooperative, extensive form games such as coalitional bargaining (Serrano 2004).However, coalitional bargaining requires intelligent, rational agents and it is difficult to manually craft rule-based agents for collaborative routing due to its exponential and NP-hard nature.Instead, we propose to develop intelligent, rational agents through having agents learn through trial-and-error, learning to collaborate in the presence of multiple other self-interested, rational agents (i.e., multi-agent reinforcement learning).A holistic diagram to depict the whole pipeline can be found in Appendix E. The remainder of this section focuses on the reinforcement learning algorithm employed.Pseudo-code of the pipeline can be found in Appendix D.

Single Agent Reinforcement Learning
Reinforcement Learning (RL) is a subfield of machine learning.Here, the field studies an agent learning what actions to take for a given state in order to maximise a numerical reward.In supervised learning, the ground truth target labels are provided.
In RL, we are not told the "correct" actions to take that will maximise (expected) cumulative reward.Instead, the agent must learn through trial-and-error.This leads to an exploration-exploitation dilemma.Should the agent try new actions (explore) in the hope that there is a better sequence of actions that leads to an even higher expected reward?Or, should the agent stick with its current best-known actions (exploit) since the agent believes it is unlikely there will be a better sequence of actions with higher expected reward?(Sutton and Barto 2018).The agent selects actions according to its policy based on the current state.The action is sent to the environment which calculates the reward and next state which is then returned to the agent.Through the learning process, we aim to obtain a policy that maximises the expected cumulative reward.
In our setting of collaborative vehicle routing, the environment is the coalitional bargaining game as described in Section 3.3.Each carrier is represented as an individual agent.The state is the locations of depots and customers, as well as auxiliary features to describe the current state of the coalitional bargaining process -see Section 4.3 for further details.There are three actions that an agent can take depending on if it is proposing or responding.When proposing, the agent must decide (a) which other carriers should the agent propose to partner with, and (b) how much should each carrier in the proposal be paid.When responding, the agent must decide (c) if they accept or reject the proposal.The reward is the collaboration gain the agent is allocated as a result of the coalitional bargaining process.Throughout the training process, we train our agents' policies (or neural network) to maximise expected cumulative reward.See Section 4 for a formal definition of states, actions and rewards in our setting.
We can formalise the problem using Markov decision processes (MDPs) (Puterman 1994).Formally, a finite-horizon, discounted Markov decision process M can be defined by the tuple M = ⟨S, A, P, r, ρ 0 , γ⟩ where S is the set of states, A is the set of actions, T : S × A → S is the transition probability distribution, R : S × A × S → R is the reward function, ρ 0 : S → R is the distribution of the initial state s 0 , and γ ∈ [0, 1] is the discount factor.
An episode begins by first sampling an initial state s 0 from ρ 0 .A trajectory (s 0 , a 0 , s 1 , a 1 , . . . ) is generated by sampling actions from the agent's policy a t ∼ π(a t | s t ).The next states are obtained by sampling the transition dynamics function s t+1 ∼ T (s t+1 | s t , a t ) until reaching a terminal state.At each time step, a reward R t ∼ R(s t , a t , s t+1 ) is received.At timestep t, the discounted return, G t , is defined as: where T is the maximum time-step and γ ∈ [0, 1] is the discount factor.As γ approaches 1, the agent will take into account rewards received far into the future.However, as γ approaches 0, the agent will only account for the immediate reward R t+1 , and the agent is often said to be myopic.
The state-value function of a state s under a policy π is denoted by V π (s).This is the expected return when the agent starts in s and continues following its policy π.Formally: A similar notion is the action-value function which is denoted by Q π (s, a).This is the expected return when the agent starts from s, but also takes the action a, and follows its policy π afterwards.Formally:

Multi-agent reinforcement learning
A stochastic game generalises MDPs to involve multiple agents.This can be defined as a tuple ⟨N, S, A, T , R, γ, ⟩ where: • N denotes the set of n agents • S denotes the set of states including the initial state s 0 . ., n}} denotes the set of joint actions, where A i is player i's set of actions and × denotes the Cartesian product.
For every time-step t, an agent i ∈ N receives an observation of the global state s and outputs an action a i,t sampled from its policy π i (a i,t | s t ).We update the state s t to include agent i's action before sending this new state to agent j ∈ N, j ̸ = i.Note that the time-step is not yet incremented.We continue this process until all agents in N have submitted their actions to the environment.This yields the joint action a = (a 1 , . . .a n ).We calculate the reward R i,t ∼ R(s t , a, s t+1 , i).We consider the sparse reward setting, i.e., all rewards are zero until the episode terminates.Upon termination, we calculate the reward for agent i depending on if agent i successfully joined a coalition or not.When a proposal (C, x C ) is agreed upon at time t, every agent in C receives a reward of γ t−1 x C i v(C).Else, if the agent is not in a coalition C, it is assumed to receive a reward of zero.The return G i is discounted by a factor γ ∈ [0, 1], given by G i = T t=1 γ t−1 r i,t .Agent i's objective is to find a policy π θi which maximises its expected discounted sum of rewards It is important to note that this maximisation assumes all opponents' policies π θj ∀j ̸ = i to be fixed.Thus, one of the key challenges in MARL is the non-stationarity present due to multiple concurrently learning agents.
In our setting, we assume perfect information and thus agents have full access to the global state.We make this assumption as the aim of our paper is to provide the theoretical grounding between collaborative vehicle routing, coalitional bargaining, and multi-agent reinforcement learning.The imperfect information setting is also a promising research direction, e.g., to investigate the value of information.Future work could study the applicability of decentralised partially observable Markov decision processes (dec-POMDPs) (Oliehoek and Amato 2016) to imperfect information settings in collaborative vehicle routing.
A challenge in reinforcement learning is handling the curses (plural) of dimensionality (Powell 2022).With "tabular" methods, the policy is represented by a lookup table.One curse is that the size of the state space grows exponentially with the number of dimensions (even if the state space is discrete).In our setting, our state space is continuous thus further exacerbating the challenge.As a result, we must resort to function approximation methods (Sutton et al. 2000).Instead, we aim to replace the lookup table with a parameterised model, with parameters θ ∈ R d to map from states to actions.Thus, we can write the policy for agent i as π θi (a i,t | s t ) instead.Respectively, the state-value function and action-value function can also be re-written V (s, θ) ≈ V π (s) and Q(s, a, θ) ≈ Q π (s, a).Importantly, the dimensionality d of the model is typically much less than the number of states.Changing one parameter will effect the estimated value of many other states.Thus, if we can generalise across states, this could greatly accelerate learning.Note that any parameterised model can be used: a linear function, multi-layer perceptron, decision trees etc. Historically, linear functions were favoured due to favourable convergence guarantees.However, deep neural networks have demonstrated significant success due to their high capacity and generalisability (Sutton and Barto 2018;Vinyals et al. 2019;Mnih et al. 2015;OpenAI et al. 2019).Thus, we also opt for deep neural networks as well.
Policy gradient-based approaches are a common way to learn a parameterised policy π θi which maximises an agent's expected discounted return.It is also performant, for example, it achieved great success in playing Dota 2 (OpenAI et al. 2019) amongst others.Typically, a scalar performance measure J(θ) is defined and we maximise their performance using approximate gradient ascent: θ t+1 = θ t + α ∇J(θ t ) where ∇J(θ t ) ∈ R d is a stochastic estimate whose expectation approximates the gradient of J(θ t ) with respect to θ t .However, a challenge is that the performance depends on both the policy's action selection and also the distribution of states where these actions are selected.Varying θ affects both of these distributions and we typically do not know the effect of our policy on the state distribution.The policy gradient theorem (Sutton et al. 2000;Sutton and Barto 2018) shows that we can approximate the gradient of performance with respect to θ but without requiring the derivative of the state distribution.Formally: The simplest approach is the REINFORCE algorithm (Williams 1992).Here, an agent plays M episodes in parallel until termination and remembers all states, actions and rewards it encountered (or trajectory).Next, it estimates the (undiscounted) policy gradient using: where, for REINFORCE Ât = T t ′ =t γ t ′ −t r(s m t ′ , a m t ′ ).The agent updates its policy using stochastic gradient descent, i.e., θ ← θ + αĝ where α is the learning rate.The intuition for this policy update is that for each action the agent took for a given state, it will increase or decrease the (log) probability of taking that same action proportional to the discounted return it received during that episode.However, policy gradient methods are notorious for having high variance in the policy gradient.As a result, we employ multiple variance reduction techniques to mitigate this problem, such as M parallel environments.
Another variance reduction technique is to subtract a baseline.A baseline b(s) can be any function that may or may not depend on the state s.Importantly, it must not vary with the action a.We can replace REINFORCE's estimate of Ât by using Ât = It can be shown that introducing a baseline does not introduce bias into the policy gradient, but may significantly reduce variance (Williams 1992;Greensmith et al. 2004;Sutton and Barto 2018).An example baseline is the average return an agent received.The term ) can be thought of as how much better than the baseline an agent performed as a result of choosing its action.A common choice of b(s) is to estimate the state-value Selecting a good baseline is crucial.We discuss our proposed baseline functions in Section 4.7.
In REINFORCE, typically only one gradient update is used per batch of trajectories.As a result, REINFORCE is typically said to be sample inefficient -it requires a lot of episodes to train a performant policy.In addition, REINFORCE can be unstable during training, and sometimes performance collapse may occur as a result of the data distribution changing too drastically.
Proximal Policy Optimisation (PPO) (Schulman et al. 2017) aims to improve the sample efficiency by performing multiple gradient updates to maximise the use of each gathered data point.However, this risks changing the data distribution too drastically and thus risks performance collapse.To rectify this, the intuition behind PPO is to constrain the policy from deviating too greatly.Let the current policy (before any gradient updates) be denoted π θold (a t |s t ).After one round of gradient updates, this would yield new policy parameters, denoted π θ (a t |s t ).PPO constrains that the probability ratio, r t (θ) = πθ(at|st) πθ old (at|st) , of taking action a t for the same state s t under the old policy vs new policy to be no more than a certain percentage ε.This should prevent the risk of policy collapse if ε is chosen carefully.Moreover, PPO is then able to perform more gradient updates on the same data points, thus greatly improving its sample efficiency.In addition, it is also more stable during training and is less sensitive to chosen hyperparameters.As a result, PPO has been applied to wide range of domains, most notably in OpenAI Five (bots to play Dota 2) (OpenAI et al. 2019) and also in ChatGPT (OpenAI 2022).
PPO adjusts the neural network parameters θ to increase or decrease the probability ratio r t (θ) proportional to the advantage the agent received Ât .PPO enforces the ε threshold by clipping the probability ratio, r t (θ), to remain within ± ε.We can further encourage exploration by adding an entropy bonus.Thus, the PPO policy gradient can be estimated as follows: where Ât is the baseline, β is the entropy regularisation coefficient and H the entropy bonus.An entropy bonus encourages agents to explore rather than exploit.It is important to note that when the advantage is positive, we clip r t (θ) only if it is greater than 1 + ε.If the advantage is negative, we clip r t (θ) only if it is less than 1 − ε (see Figure 1 of (Schulman et al. 2017) for further details).The clip function is a function that clips the first argument by the lower and upper bounds denoted by the second and third arguments respectively.
As a result, PPO has been widely used in a range of applications, most notably in OpenAI Five (for Dota 2) and in ChatGPT (OpenAI et al. 2019;OpenAI 2022).

State space
The agents receive a variety of inputs from the environment as seen in Figure 4. Let the state at time t be denoted by s t ∈ S which can be represented by the tuple ⟨D, c, x, r, t, p, a⟩.The deliveries matrix D ∈ R 12×4 describes the features of each of the three depots and nine customers, yielding twelve rows where we refer to each row as a location.A location can be represented by the tuple ⟨x, y, o, d⟩ where x ∈ R is the x-coordinate; y ∈ R is the y-coordinate; o ∈ N denotes the agent who owns the location; and d ∈ {0, 1} denotes whether the location is a depot or a customer.For instance, to represent Agent 2's depot located at ⟨x = 0.2, y = 0.173⟩, its corresponding row in D would be represented as ⟨0.2, 0.173, 2, 1⟩ and the remaining rows in D would be comprised of similar entries for the remaining depots and customers, yielding a shape of 12 × 4. The vector c ∈ {0, 1} |N | denotes which agents were selected to be in the proposed coalition.The vector x ∈ R |N | denotes the proposed pay-off vector, the vector r ∈ {0, 1} |N | denotes the responses of the agents.The vectors c, x and r are initialised to zero if no agent has taken an action in the current round of bargaining.The scalar t ∈ N 0 denotes the current round of bargaining, p ∈ N denotes which agent was selected to propose in the current round of bargaining, and a ∈ {0, 1} denotes whether the current agent is proposing or responding.

Action space
The agents have three action heads: coalitions, proposals and response.
The coalitions action is denoted by c ∈ {0, 1} . This vector denotes how much of the collaboration gain is assigned to each respective agent (as a percentage).Note that in game theory, the definition of a feasible pay-off vector is i∈C x C i ≤ v(C).However, agents will never know the value of v(C) a priori (although it can implicitly reason about it).Thus, to practically implement our neural network, we output a vector that is interpreted as percentages as opposed to absolute values.These percentages are then multiplied by the value of a coalition v(C) to obtain a feasible pay-off vector.
Note that this is a continuous action space, as opposed to the other actions which are discrete.To parameterise the proposals action head, we use the Dirichlet distribution which is a multivariate generalisation of the Beta distribution.The neural network will output three logits α which are used as the concentration parameters of the Dirichlet distribution Dir(α).The Dirichlet distribution has support over the probability simplex Intuitively, agents will propose an equal gain share with high probability if the inputs to the Dirichlet are large and equal.Agents will make proposals uniformly at random within the probability simplex if the inputs to the Dirichlet are small and equal, but greater than 1.If Agent 1 wanted to collaborate with Agent 2 but not 3, the input to the Dirichlet could be ⟨10000, 10000, 1.001⟩.This would result in approximately a 50/50 split between Agents 1 and 2 with high probability.
The Dirichlet distribution is appealing due to two reasons.Firstly, the proposals vector requires that it sums to 1 which matches the form of the Dirichlet distribution.Secondly, the Dirichlet distribution has finite support.In continuous action spaces, a Gaussian distribution is typically used which has infinite support and can lead to bias (Chou et al. 2017).Chou et al. (2017) overcomes this issue by using a Beta distribution instead as it has finite support and find that their agents learn more efficiently.
To calculate the proposals, the state inputs are passed through a variety of dense layers (see Figure 4) to produce an embedding.A linear layer with 3 output neurons is applied to the embedding.As in Chou et al. (2017) we add 1.001 to the output logits to ensure the resultant Dirichlet distribution remains unimodal.As a result, during evaluation the agents can fully exploit by proposing the mode of the distribution, instead of having to sample from the Dirichlet which may involve exploration.The output logits are then masked by the coalitions vector, i.e. if a player i is not in the coalition S, its corresponding output logit will be 1.001.Finally, to calculate the pay-off vector, we sample from the Dirichlet distribution with the masked output logits.
The response action r ∈ {0, 1} denotes whether an agent accepts or rejects a given proposal.It takes the resultant embedding followed by a single linear layer with one output neuron.The output is then fed through a Bernoulli distribution.
Whilst we have chosen to use Bernoulli and Dirichlet distributions to parameterise the three action spaces, it may be beneficial to experiment with more expressive probability distributions or e.g.output actions auto-regressively.This may speed up learning and would be an interesting line of future research.

Reward function
Our reward function is sparse, i.e., at timestep t the agents will always receive an immediate reward R t of zero until the coalitional bargaining game terminates.Upon termination, we calculate a reward for each agent.
If agent i successfully joins a coalition C by having all agents in C accept the proposal, then it receives a reward of r i,t = v(C) • x i where v(C) is the collaboration gain obtained by coalition C, and x i is the ith element of the pay-off vector x.For clarity, if agent i is the proposer and has its proposal rejected by the responder agents, it will receive an immediate reward of zero.However, there is potential for agent i to obtain more than zero immediate reward in future rounds of bargaining and thus the discounted return can still be greater than zero.
Else, if agent i does not successfully join a coalition C by the end of the episode, then it will receive a terminal reward of zero.

Transfer learning
A key challenge with policy gradient approaches is its sample inefficiency, even in single agent settings.This is further exacerbated due to the non-stationary learning dynamics imposed by having multiple agents learn concurrently.In typical RL settings, agents learn "tabula rasa", i.e., without any prior knowledge.Whilst this is mathematically elegant, learning tasks tabula rasa for problems with high complexity, such as in real-world, multi-agent settings, is rare (Agarwal et al. 2022).Instead, it may be preferable to pre-train on some offline dataset in order to learn a good feature extractor.For example, (Silver et al. 2016;Vinyals et al. 2019) pre-train their networks on human gameplay data in a supervised learning setting before using RL.This idea of transfer learning, or recently, reincarnating RL (Agarwal et al. 2022) is well accepted in the RL literature and the reader is referred to (Taylor and Stone 2009;Agarwal et al. 2022) for a thorough review.Furthermore, transfer learning is well accepted in supervised learning, especially in the computer vision and natural language processing domains leading to the likes of ChatGPT (OpenAI 2022).In our case, the pre-training process aids in efficiently initializing the agents' policies and facilitates faster convergence in the MARL framework.
We therefore pre-train our agents to learn a good feature extractor in a supervised learning fashion.We hypothesise that a good feature extractor should be able to predict whether a given coalition for a given state is productive or not.As a result, we create a dataset of one million instances and randomly select a feasible coalition per instance and calculate the social welfare obtained.Next, we train a neural network to predict the social welfare for a given state and coalition.We optimise the neural network to minimise the mean squared error.We split the dataset using an 80/20 train/test split.The neural network design can be seen in Figure 5.We experimented with different neural network architectures but found this architecture performed best.Whilst this is not the exact task agents must perform in the collaborative routing scenario, the intuition is that the neural network should still learn useful patterns which are transferable to the full collaborative routing problem.

Policy gradient baselines
As discussed in Section 4, a useful baseline helps reduce the variance in policy gradient methods.We use two types of baselines: one for the response action (when agents are responding); and one shared for both the coalitions and proposals actions (when agents are proposing).The neural network architectures for the baselines can be found in Figure 6.
The response action is discrete and thus we can easily implement Counterfactual Multi-Agent Policy Gradients (COMA) (Foerster et al. 2017).They use the following baseline: In the discrete setting, it is easy to sum over all other actions agent i could have taken.However, with continuous actions using Dirichlet distributions in the proposals action, this can be difficult.Therefore, we instead estimate the state-value which estimates the expected discounted return conditioned on the state s.We denote this baseline with Vπ (s, i, | s where w is the parameters of a function approximator such as a neural network.Thus, our baseline for both the coalitions action and proposals action is given by: Finally, we normalise A i (s, a) by subtracting the mean and dividing by the standard deviation due to the small magnitude in rewards.

Time limits
It is crucial to deal with time limits properly in this setting.The full coalitional bargaining game presented in Okada (1996) is infinite horizon, i.e., negotiation could go on indefinitely.Clearly, this is impossible to simulate on a finite computer and we must set a maximum number of rounds.Nevertheless, it is still possible to optimise for the infinite horizon, but care must be taken as shown in Pardo et al. (2018).They argue that if an episode terminates only due to reaching the maximum number of rounds, we should bootstrap the discounted estimated value of the next state, vπ (s ′ ).Thus, if agents reach agreement within the maximum number of rounds, they should receive a reward r as expected.However, if they exceed the maximum number of rounds, they should receive a reward of r + γv π (s ′ ).In our setting with the maximum number of rounds equal to 10, if agents do not reach agreement within 10 rounds, we fictitiously step them into the next state s ′ , at round 11 with proposers selected uniformly at random.If player i is not selected as a proposer, then the selected proposer is asked to propose a coalition and pay-off vector (S, x) in this fictitious round.We then use a critic to estimate the value of this state, vπ (s ′ ).
4.9.Skill retention Okada (1996) shows that agents should reach agreement with no delay in agreement.Therefore, as agents learn to collaborate better, they will reach agreement sooner, which is beneficial due to the environment's discount factor.However, this may lead to agents forgetting how to play the game at later time-steps.To enable retention of skills at later time-steps, we employ a targeted training design.During training, instead of starting all bargaining games at round 1, we uniformly at random start them between round 1 and the last round of bargaining, T − 1.Therefore, agents will always be exposed to a range of bargaining scenarios even if agents are collaborating optimally.Each depot has three distinct service radii which are selected uniformly at random.Customers may be uniformly at random located within any of corresponding depot's service radius.

Problem setting
We base our problem setting on a modified version of (Gansterer and Hartl 2018a).We consider an environment with three companies, each represented by an agent.Each agent has one depot and three customers that it must deliver to.The depot (x, y) locations for each agent are held fixed at {(−0.2, 0.173), (0.2, 0.173), (0, −0.173)} respectively.The depots' service radius for each instance is selected uniformly at random from the set {0.3, 0.4, 0.6}.The rationale by Gansterer and Hartl (2018a) is that, through varying the depots' service radius, this varies the degree of overlap and thus competition (or collaboration opportunity) between carriers.A high degree of overlap using a radius of 0.6 creates high collaboration opportunity between carriers.A low degree of overlap using a radius of 0.3 has low collaboration opportunity between carriers.With a small radius of 0.3, this can analogously be seen as the scenario when depots do not lie in close proximity to each other.The customers locations are then generated uniformly at random with the depot's service radius.
To calculate the pre-collaboration and post-collaboration gains, the shortest paths are calculated exactly using Gurobi (Gurobi Optimization, LLC 2021).The pre-collaboration shortest paths can be calculated by solving three (un)Capacitated Vehicle Routing Problems (one for each agent).The post-collaboration shortest paths are calculated by solving a single multi-depot vehicle routing problem.Problem formulations for the capacitated VRP and multi-depot VRP can be found in Appendices A and B respectively.Capacity is effectively removed by setting the capacity of each vehicle to an arbitrarily large number and the weight of each delivery to 1.
Whilst this problem setting is rather simplistic, this is important as it allows us to evaluate our agents rigorously.To calculate optimal solutions (for evaluation purposes only), we must brute force the characteristic function.This is expensive and only possible for small, simple VRPs and 3 agents.

Experimental design
We perform 10 independent runs with different random seeds to train our agents.Agents are trained for 10,000 epochs and evaluated every 100 epochs.Agents are evaluated on instances it has never seen before in training.We train using a batch size of 2048 and evaluate with a batch size of 2048.All agents use a discount factor γ of 0.95.All agents' observations are normalised with a running estimate of the mean and standard deviation.The maximum number of bargaining rounds T is set to 10.The learning rate was held constant at 3×10 −4 and we use Adam optimisation.We clip the global norm of gradient updates if they exceed 1.We use ε = 0.05 to clip the probability ratios in PPO as it seems to help stability in (Yu et al. 2021).All code to generate results is run on the Wilkes 3 high performance computing cluster with AMD EPYC™ 7763 64-Core Processors and NVIDIA A100 GPUs.Note we only use a supercomputer to perform runs in parallel.Training takes approximately 8 hours per run.

Correlation with the Shapley value
The objective of our work is to find a partition of the N carriers with an associated fair pay-off vector.We emphasise that certain cooperative solution concepts (e.g.Shapley values) can be retrieved as the outcome of non-cooperative, extensive form games (e.g.coalitional bargaining as in our work).The Shapley value is the most common gain sharing mechanism used in the collaborative vehicle routing setting (Guajardo and Rönnqvist 2016) as it is widely accepted in game theory to be fair -each agent gets paid proportional to their marginal contribution.In addition, it is also guaranteed to be unique.We believe that both of these arguments would help transportation planners to reach agreements better, in line with (Krajewska et al. 2008).Thus, we compare the outcomes that our MARL agents agree to with the Shapley value for each instance by measuring the correlation, mean absolute error, and mean squared error.

Baseline bots
We compare our MARL agents against two rule-based bots as a baseline.The heuristic bot always proposes the grand coalition with equal gain share and always accepts every proposal.The random bot proposes coalitions and gain shares as well as responses all uniformly at random.These two bots help us to understand that (a) our MARL agents are learning interesting, complex behaviours, and (b) our experimental setup is not too easy in design and that simple, intuitive policies are not sufficient for this setting.

Accuracy
A simple evaluation metric is to measure how often the agents propose the correct coalition.For player i, the correct coalition C * i is defined to be the coalition C which would maximise player i's reward.This involves brute forcing the characteristic function to evaluate the value of each possible coalition which is only possible since we consider 3 agents.We emphasise that brute force is only required to evaluate our agents -brute force is not required to train the agents.The reward R is the collaboration gain from agreeing to coalition C, v(C), multiplied by the ith element of the pay-off vector, x i .

Optimality gap
We denote the absolute and relative optimality gap of player i by ϕ i and η i respectively.The absolute optimality gap ϕ i for player i is defined as is the correct coalition, C i is player i's proposed coalition, and v(•) is the characteristic function (i.e. the collaboration gain of a given coalition).The relative optimality gap η i , is calculated as: Since the data is randomly generated, there could be scenarios where there is no value in collaborating, i.e. even the value of the grand coalition is 0, v(N ) = 0. Note that we exclude these scenarios when calculating the above evaluation metrics; however, this only occurs 1.9% of the time when brute-forcing 51,200 instances.(1996) analyses this coalitional bargaining game in a non-collaborative routing setting, and proves that agents will cooperate by sharing gains equally in our setting.Therefore, in addition to the above metrics, we check that the agents' behaviour agrees with those predicted by Okada (1996).Firstly, we check that agents do converge to an equal gain share.Secondly, all agents should reach agreement in the first time-step in the three-player setting.

Results
We perform ten independent runs comparing our RL bot to the heuristic bot.Ten independent runs in (MA)RL is commonly accepted following the work of Henderson et al. (2019).We also compare to a random bot which simply proposes coalition structures and pay-off vectors as well as responds all uniformly at random.
From Figures 8a and 8b, we conclude that our agents have learnt close to optimal behaviour.Our agents reach an average accuracy of 77% and average optimality gap of 0.01 (or 3.9%).Moreover, we can see from Figure 9a that Agent 1 learns to share gains equally -as expected by game theory (Okada 1996).Whilst we only show the plot for Agent 1, similar plots can be made for Agents 2 and 3 but are omitted due to space constraints.Interestingly, three 'phases' of learning are identified as shown in Figure 9b.In Phase 1 (the first approximately 300 epochs), proposers act extremely myopically and propose that they receive the majority of the gain (up to 90%).Occasionally, the responders will accept these sub-optimal proposals and thus the proposer could receive high reward.However, the responders learn to reject more proposals so that they can potentially counter-offer in the next round.This leads to more rounds of bargaining.After about 300 epochs, both proposers and responders reach agreement quickly; however, the gains are not equally shared.In Phase 2, responders realise they can do better by rejecting proposals and potentially proposing counter-proposals.This drives the proposers to propose more equal gain shares.Finally, in Phase 3, we can see that agents have learnt to maximally cooperate with equal gain share and reach agreement within the first time-step as expected by Okada (1996).Shaded regions denote ± two standard deviations.After training for 10,000 epochs, our RL agents reach an average accuracy of 77% with an average optimality gap of 3.9%.

Correlation with Shapley Values
In Figure 10, we see that the outcomes from our bargaining procedure correlate well with the calculated Shapley values.The three agents receive an R 2 score of 0.76, mean squared error of 0.08, and mean absolute error of 0.01 (averaged across all three agents).
In addition, it is promising that when agent 1 is excluded from the coalition (denoted by orange cross markers), this is usually when Agent 1 has low marginal contribution (as seen by the orange kernel density estimate plot at the top of the x-axis).As a result, we conclude that our agents learn to agree to fair outcomes.This is important from a managerial perspective as fairness could be crucial to help incentivise carriers to participate in collaborative vehicle routing (Guajardo and Rönnqvist 2016).

Ablations
We further perform two ablations to strengthen the confidence in our findings.Each ablation is carried out with 10 random seeds each.The first ablation changes the maximum number of bargaining rounds from 10 to 30.This ablation is carried out  In (a), after 10,000 epochs, Agent 1 converges on an approximately equal gain share.In (b), after 10,000 epochs agents reach agreement after an averaged 1.03 rounds of bargaining.Both of these results agree with Okada (1996).
since the underlying coalitional bargaining game is infinite-horizon, yet we must set a maximum number of bargaining rounds.Our ablation shows that increasing the maximum number of time-steps does not significantly change the quality of our agents' solutions.The agents still agree to share gains equally, with an average optimality gap of 4.1% (up from 3.9%) and identifying the correct coalitions 76% of the time (down from 77%).Therefore, we conclude that using a maximum number of time-steps of 10 to be sufficient.This is expected as we deal with time-limits properly as discussed in section Section 4.8.The second ablation changes the agents' discount factor from 0.95 to 1.0.This ablation is carried out as we use a discount factor to reduce variance in the return.We test whether it's possible to use a higher discount factor.We find that using a discount factor of 1.0 decreases performance which we suspect to be due to the increased variance.With a discount factor of 1.0, agents achieve an average optimality gap of 6.07% (up from 3.9%).gain share but identifies the correct coalitions only 68% of the time (down from 77%).We conclude that using a discount factor of 0.95 is sufficient to achieve a set of strong agents.

Discussion
In addition, our RL agents are able to reach agreement in 512 parallel instances within an average of 3.0s (or 0.006s per instance).We note that the prior literature assumes full access to the characteristic function, such as Krajewska et al. (2008).Using these prior methods to solve 512 instances takes 24.3s (or 0.047s per instance).Thus, our RL agents achieve a 88% reduction in computational time when compared with prior methods to calculate the Shapley value, such as in (Krajewska et al. 2008).Whilst 0.047s per instance may seem reasonable even with traditional methods, we stress that this is due to the simplistic VRP setting we consider -prior methods will not scale with the number of agents nor problem complexity via additional constraints such as time-windows.Importantly, our agents agree to outcomes that correlate well with Shapley values and thus we conclude that our method produces fair outcomes.This is important to fairly compensate carriers to enable wide-spread industrial adoption of collaborative vehicle routing.Our agents also reach agreement in a decentralised and self-interested manner, which overcomes the limitations of central orchestration methods mentioned in Section 2.
Furthermore, our MARL agents are able to outperform the two baseline bots in both accuracy and optimality.The heuristic bot and random bot has an accuracy of 62% and 25% respectively, and an optimality gap of 8% and 32% respectively.The relatively low performance of both the heuristic bot and random bot suggests that the experimental setup is sufficiently challenging (due to the NP-hard nature of vehicle routing problems), and that simple policies are not performant in this setting.The heuristic bot shows that 38% of the time, it is not desirable to form the grand coalition as some agents may contribute very little.The random bot's high optimality gap shows that, whilst there is symmetry in our problem and depots are equidistant, the choice of partners is still important.This necessitates more intelligent agents and thus complex methods such as MARL.More importantly, we conclude that our MARL agents have learnt interesting behaviours, such as to exclude opponents if they contribute little to the coalition, as seen in Figure 10.
In this work, we make the assumption that each carrier possesses only one truck.We further assume that the same truck driver is assigned to the same truck.This is a reasonable assumption as the road freight industry is highly fragmented: for example, in the UK, there are 60,000 registered carriers (Office for National Statistics 2022) in 2022, and 1 million registered carriers in the EU in 2020 (Eurostat 2020).However, if a single carrier possesses multiple trucks and thus multiple drivers, it would be possible to decompose the problem at different levels of granularity.One could consider coalitions of carriers; coalitions of trucks; or even coalitions of truck drivers.Our framework should be applicable to deal with all three types of modelling choices, but clearly the more granular the modelling choice, the more computational power that will be required.
The benefit of studying collaborative routing in a coalitional bargaining game is that game theory describes optimal, rational behaviour in this setting.As a result, we have a measure of the gap to optimality.This is important because of the challenging nature of 3-player, mixed-motive settings for MARL; thus, we can understand if the agents are learning correctly.However, there are three main limitations of this approach.Firstly, collaborative vehicle routing is most fruitful with a large number of participating carriers (Cruijssen et al. 2007;Los et al. 2022).Future work must investigate scaling our MARL approach to a larger number of carriers.We believe this to be possible in a hybrid centralised-decentralised manner.The advantage of our decentralised MARL approach is that it enables us to provide a large volume of high-quality solutions to optimise the central agent.Secondly, future work should investigate the performance of MARL-based approaches on real-world data distributions with real-world constraints.One direction would be to study the effect of data imbalance (such as the locations of depots and customers, as well as the delivery volumes) on the performance of MARL-based methods.Another direction could be to study the effect of partial observability of other carriers' information; we currently consider the perfect information scenario where all delivery information is publicly shared (though, crucially, the characteristic function is still unknown).It would be interesting in future work to explore imperfect information settings, such as the value of information sharing.This could be tackled using decentralised, partially observable Markov decision processes (dec-POMDPs) (Oliehoek and Amato 2016).Thirdly, our approach currently only incentivises carriers.An independent third-party logistics provider may be required to enable collaborative routing.How should we incentivise third-party logistics providers?How should we incentivise shippers?What role could government play to incentivise collaboration?Moving in these directions with MARL would result in using more complex and flexible games; however, optimal, rational behaviour would be unknown.Nevertheless, MARL may still be applied to these complex games but in a descriptive manner (Shoham et al. 2007), i.e. to analyse the emergent behaviour of agents assuming a given MARL algorithm.We believe this to be an exciting line of future research.

Conclusions and Managerial Implications
Collaborative Vehicle Routing has promised cost savings between 4 -46% in the last two decades.Yet industrial adoption remains limited.A key remaining barrier is the design of a gain sharing mechanism that is fair and scalable such that carriers are incentivsed to collaborate.Orchestration of truck sharing is usually proposed via a central optimiser, where an intermediary would receive information from each carrier and allocate trucks to each route.Subscription to intermediaries do not necessarily outweigh costs, and carriers typically do not obtain any benefits from sharing their trucks.In this paper, we propose an automated, decentralised approach, where software agents representing carriers find optimal routes through a coalitional bargaining game, and any gain obtained via improved truck utilisation is shared between the carriers.Manual orchestration costs are also avoided as the approach is automated.
To facilitate decentralised optimisation and fair gain sharing we utilised deep multi-agent reinforcement learning.The main challenge of our setting is the inability of extant methods to fully evaluate the characteristic function due to high computational complexity.The characteristic function calculates the collaboration gain for every possible coalition, which requires solving an exponential number of NP-hard VRPs.The autonomous agents designed in this work are able to correctly reason over a high-dimensional graph input to implicitly reason about the characteristic function instead.This eliminates the need to evaluate the expensive post-collaboration vehicle routing problem an exponential number of times and increases its practicability as we only need to evaluate this once.Furthermore, applying MARL to mixed-motive games is highly non-trivial and applying out-of-the-box MARL algorithms to this problem does not work.We show that we are able to achieve strong performance through careful design decisions, such as transfer learning, a targeted training design and COMA, and provide intuition for why these approaches help.
Moreover, the multi-agent reinforcement learning approach designed in our work is applicable to any coalitional bargaining game.Thus, our work may be suitable to problems in the broader collaborative logistics literature such as warehouse sharing.Another important point is that collaboration is not centrally orchestrated but facilitated using decentralised decision making.This marks an important step towards real-world adoption which might encourage transportation planners to consider more profitable and fair collaboration scenarios.Whilst we initially envisage this system operating as a decision support system, as transportation planners gain trust in the agents' decisions, we ultimately envisage this system to operate fully autonomously.This would enable even faster decision making that is traceable and consistent, potentially enabling a more responsive supply chain (Brintrup et al. 2009).We urge transport planners and software system providers to consider potential adoption scenarios and integration into information systems.
Our work has limitations which provide avenues for future research.The current focus of this work is to obtain strong autonomous agents that maximally cooperate in the challenging mixed-motive setting of collaborative vehicle routing.Whilst we have achieved this, we have focused on a setting with 3 carriers as the focus of our work was to provide the theoretical link between collaborative vehicle routing, coalitional bargaining, and deep multi-agent reinforcement learning.Future work should investigate the scalability of a MARL approach to a larger number of agents.Furthermore, CVR problems typically include various additional considerations such as axle weights, goods compatibility, and packing orders, which have not yet been incorporated to the framework proposed here.Our approach is agnostic to the underlying optimisation design, and being so, we do not envisage the incorporation of additional problem features to hinder its function.

Appendix A. Capacitated vehicle routing problem
In our paper, the pre-collaboration social welfare can be calculated by first solving three independent Capacitated Vehicle Routing Problems, where we assume an arbitrarily high capacity for each vehicle.
The capacitated vehicle routing problem (CVRP) and their variants have been studied for over 60 years (Toth and Vigo 2014).Here we show the three-index (vehicle-flow) formulation.
The CVRP considers the setting where goods are distributed to n customers.The goods are initially located at the depot, denoted by nodes (or vertices) o and d.Node o refers to the starting point of a route, and node d the end point of a route.The customers are denoted by the set of nodes N = {1, 2, . . ., n}.Each customer i ∈ N has a demand q i ≥ 0. In our setting, we consider q i = 1 for all customers.A fleet of |K| vehicles K = {1, 2, . . ., |K|} are said to be homogeneous if they all have the same capacity Q > 0. In our setting, we consider only one vehicle and set its capacity Q to an arbitrarily high number to remove the capacity constraint.A vehicle must start at the depot, and can deliver to a set of customers S ⊆ N before returning to the depot.The travel cost c i,j is associated for a vehicle travelling between nodes i and j which we assume to be the Euclidean distance.
This problem can be modelled as a complete directed graph G = (V, A), where the vertex set V := N ∪ {o, d} and the arc set A := (V \ {d}) × (V \ {o}).We define the in-arcs of S as δ − (S) = {(i, j) ∈ A : i / ∈ S, j ∈ S}.The out-arcs of S is δ + (S) = {(i, j) ∈ A : i ∈ S, j / ∈ S}.The binary decision variables x ijk denotes whether a vehicle k ∈ K travels over the arc (i, j) ∈ A. The binary decision variables y ik denotes whether a vehicle k ∈ K visits node i ∈ V .u ik denotes the load in vehicle k before visiting node i.We define the demand at the depot nodes o and d to be 0, i.e. q o = q d = 0.This yields: subject to k∈K y ik = 1, ∀i ∈ N, (1b) Post-collaboration with coalition {1, 2, 3}

Figure 3 :
Figure3: Flowchart of the n-player coalitional bargaining game(Okada 1996).Our proposed approach is therefore to obtain a set of intelligent agents that can bargain with each other in a coalitional bargaining game.To achieve a suitable level of agent intelligence, we train our agents using deep multi-agent reinforcement learning.

Figure 7 :
Figure7: A plot of the distribution of depot and customer locations.Depots are denoted by squares.Each depot has three distinct service radii which are selected uniformly at random.Customers may be uniformly at random located within any of corresponding depot's service radius.

Figure 8 :
Figure8: Learning curve of (a) average accuracy (b) average optimality gap across all 3 agents for readability.Solid lines denote mean accuracy across 10 independent runs.Shaded regions denote ± two standard deviations.After training for 10,000 epochs, our RL agents reach an average accuracy of 77% with an average optimality gap of 3.9%.
Average number of bargaining rounds.

Figure 9 :
Figure9: Learning curve of (a) Agent 1's average proposed pay-offs (b) average number of bargaining rounds across all 3 agents for readability.Solid lines denote mean accuracy across 10 independent runs.Shaded regions denote ± two standard deviations.Dashed lines denote the proposed pay-off of an equal gain share agent.In (a), after 10,000 epochs, Agent 1 converges on an approximately equal gain share.In (b), after 10,000 epochs agents reach agreement after an averaged 1.03 rounds of bargaining.Both of these results agree withOkada (1996).

Figure 10 :
Figure10: The empirical pay-off agent 1 receives as a result of coalitional bargaining vs. the theoretical Shapley values for 2048 test instances.Green circle markers denote when agent 1 was included in the coalition.Orange cross markers denote when agent 1 was excluded from the coalition.R 2 score of 0.76, mean squared error of 0.08 and mean absolute error of 0.01.

x
= (x k ) ∈ {0, 1} K×A , (1h) y = (y k ) ∈ {0, 1} K×V .(1i)• The objective function (1a) minimises the Euclidean distance travelled by the vehicle.• Constraint (1b) ensures the vehicle only visits each customer once.•Constraint (1c) ensures that the sum of vehicles entering node d and exiting node d is −1.This ensures that a vehicle k performs a route starting at o and ending

Table 1 :
Characteristics of selected games studied in MARL.

Table 2 :
Notation Table Value of the coalition C, or the collaboration gain of the coalition C in the collaborative vehicle routing setting.
0 Distribution of the initial state, s 0 a An action t Time-step index Gt Return following time t T Maximum time-step (or the horizon length) π Agent's policy Vπ(s) State-value function of a state s following a policy π V (s, θ)Policy's (parameterised by θ) estimate of the state-value function given the state s.Q(s, a, θ) Policy's (parameterised by θ) estimate of the action-value function given the state s and taking the action a. Qπ(s, a)Action-value function of a state s taking the action a following a policy π |N |where |N | is the total number of agents, in this case, 3. Note that in game theory, typically agents propose a coalition of size |C| instead of |N |.However, it is beneficial to output coalitions in this manner as it keeps the output size constant.The coalitions action denotes whether the respective Figure 4: Actors' neural network design.Grey boxes denote state inputs.Blue boxes denote MLP parameters which come from supervised pre-training (see Section 4.6).Note that the linear layer to produce coalition logits is learnt and not pre-trained.White boxes denote learnt parameters.Red boxes denote actions.Numbers in brackets denote the output shapes (ignoring batch size as it's shared by all).
agent index is part of the coalition C. Note that this game assumes that player i is in the coalition c, i.e., c i = 1.The deliveries matrix, D, is fed through two dense layers with 256 hidden neurons.These parameters come from a supervised pre-training step (see Section 4.6).The output is fed through a linear layer with |N | outputs.These outputs are passed into |N | independent Bernoulli distributions to determine the probability that a given agent is in the coalition C. A Bernoulli distribution is chosen as the number of outputs required scales linearly with the number of agents.Alternatively, this action can be output auto-regressively, but would be more computationally expensive.It may also be useful to introduce correlation in the agents' actions via more expressive probability distributions which may speed up learning.The proposals action is denoted by x ∈ R |N | where i Figure 5: Pre-trained neural network design.Grey boxes denote state inputs.White boxes denote learnt parameters.Red box denotes the output, which predicts the collaboration gain for this given state and coalition .Numbers in brackets denote the output shapes (ignoring batch size as it's shared by all).