Dynamic stochastic electric vehicle routing with safe reinforcement learning

Dynamic routing of electric commercial vehicles can be a challenging problem since besides the uncertainty of energy consumption there are also random customer requests. This paper introduces the Dynamic Stochastic Electric Vehicle Routing Problem (DS-EVRP). A Safe Reinforcement Learning method is proposed for solving the problem. The objective is to minimize expected energy consumption in a safe way, which means also minimizing the risk of battery depletion while en route by planning charging whenever necessary. The key idea is to learn offline about the stochastic customer requests and energy consumption using Monte Carlo simulations, to be able to plan the route predictively and safely online. The method is evaluated using simulations based on energy consumption data from a realistic traffic model for the city of Luxembourg and a high-fidelity vehicle model. The results indicate that it is possible to save energy at the same time maintaining reliability by planning the routes and charging in an anticipative way. The proposed method has the potential to improve transport operations with electric commercial vehicles capitalizing on their environmental benefits.


Introduction
With increasingly more interest in electric commercial vehicles, in part due to legislation, which drives vehicle manufacturers to develop such vehicles, but also due to public awareness and more transport companies investing in green transportation, it is important to properly integrate them into the current operations. These vehicles have several advantages compared with their ICE counterparts and can contribute to reduce local pollution and noise (Jochem et al., 2016). But despite the latest technology developments, electric vehicles are still limited by their battery capacities. Their batteries are heavy, big and expensive, which usually translates into a limited driving range. Furthermore, energy consumption depends on several uncertain factors, such as driving behavior and traffic conditions. Therefore, safe route planning becomes essential to ensure that the vehicle will not run out of energy along the route.
When considering more dynamic transport operations with customer requests that can arrive while the vehicle is already driving, and taking into account the uncertainty of energy consumption, the routing problem becomes even more challenging. This has been a particularly relevant issue recently, with the on-demand economy seen a rapid development during the pandemic and putting extra pressure on transport operators who now have to fulfill more requests and faster, often same day deliveries, while at the same time aiming at fulfilling emission reduction goals. One common example of such a scenario is dynamic pick-up, when the vehicle is supposed to collect packets during a day, but the requests arrive during the same day. In that case, every time a new request arrives the previously planned route is no longer valid. A new route needs to be planned in real-time taking into account the remaining battery capacity and the remaining customers to be visited, planning additional charging stops if necessary. To solve this problem it is necessary to predict future customer requests but also energy consumption. With historical data about customer requests it could be possible to predict dynamic customer requests. Energy consumption could be predicted using models available in the literature. Then these two predictive aspects need to be integrated into a single model to decide the best route.
In this paper we propose an anticipative method for dynamically routing a single electric commercial vehicle with reliable charge planning, considering stochastic energy consumption and dynamic customer requests. The safety aspect is also taken into account in order to avoid battery depletion while the vehicle is driving.
The method provides a reliable way to adopt electric commercial vehicles in transport operations with high levels of dynamism and stochasticity. To the best of our knowledge there is no published paper dealing with this problem. By using the proposed method it is possible to anticipate dynamic customer requests and predict energy consumption, therefore supporting the dispatcher and driver when planning their routes and charging. Additionally, the method can be used to react in real-time to unexpected events such as accidents and congestion by re-routing or planning extra charging stops. As a result, it could be possible to have vehicles with less battery capacity, making them cheaper, lighter and with higher payload. Consequently it does not only benefit the driver and transport company, but also has the potential to contribute to achieving environmental goals such as the ones proposed by the European Union (EC, 2011).

Literature review
The Vehicle Routing Problem (VRP) aims at finding the optimal set of routes for a number of vehicles to visit a number of customers. The problem has been studied since the fifties in many variations and several solution methods have been developed. For an overview we refer to Toth and Vigo (2014). The objective has been traditionally to find the shortest routes in terms of distance or travel time, starting at a depot, visiting all customers and returning to the same depot.
During recent years there has been some effort to add environmental aspects into the problem, such as the Pollution Routing Problem (PRP) . Minimization of emissions, minimization of fuel consumption and minimization of energy consumption of electric vehicles (e.g. Erdoğan and Miller-Hooks, 2012;Schneider et al., 2014;Felipe et al., 2014;Tahami et al., 2020) have been studied. Recent reviews can be found in Lin et al. (2014), Demir et al. (2014), Pelletier et al. (2016), Bektaş et al. (2016) and Bektaş et al. (2019). Several different models for estimating energy/fuel/emissions have been investigated (Demir et al., 2011;Cuma and Koroglu, 2015), including some specific for electric vehicles such as Genikomsakis and Mitrentsis (2017), Qi et al. (2018) and Fiori and Marzano (2018). In Basso et al. (2019Basso et al. ( , 2021, an energy estimation model for electric vehicles is developed based on Bayesian machine learning. With that model it is possible to estimate link-level probabilistic energy consumption coefficients that can be used in routing problems. One of its advantages is that it is possible to estimate the probability distribution of energy consumption, not only the expected value. The model was validated with simulations and the results will be used in this paper. Real-life traffic environments often present a series of uncertain factors such as traffic conditions and driving behavior. Additionally, some transport operations may include a certain degree of dynamism, with new customer requests arriving while the vehicles are already driving. These two elements make routing problems even more challenging. In Laporte et al. (1992) the VRP with stochastic travel and service times is introduced for the first time. A chance-constraint and two recourse models are presented. Later most problem formulations consider stochastic customer locations, stochastic demands or stochastic times (i.e. service and travel times). For an overview we refer to Toth and Vigo (2014) and Gendreau et al. (1996). Only a few publications consider the case of stochastic energy consumption, such as Basso et al. (2021) and Pelletier et al. (2019). Dynamic routing has also been examined over time, with surveys provided by Ritzinger et al. (2015), Gendreau et al. (2016), Psaraftis et al. (2016) and Oyola et al. (2018). In Asamer et al. (2016) the authors present a sensitivity analysis of energy demand for electric vehicles and quantify the impact of parameter uncertainty. In Sweda et al. (2017) heuristics are presented to adaptively find the route for an electric vehicle with uncertain charging station availability and waiting time. In Jafari and Boyles (2017) the authors consider stochastic travel times and availability of charging stations to find the shortest path for an electric vehicle. In Liao (2017) an online method is presented to solve the routing problem with stochastic travel times and customer requests, targeting minimization of emissions. In Eshtehadi et al. (2017) robust optimization is used to solve the PRP with uncertain travel time and customer demand. In Masmoudi et al. (2018) the authors solve the Dial-a-Ride Problem for electric vehicles with battery-swap stations using Evolutionary Variable Neighborhood Search algorithms. In Tang et al. (2019) a method for scheduling electric buses is proposed taking into account stochastic traffic conditions. Reinforcement Learning (RL) (Sutton and Barto, 2018) and Approximate Dynamic Programming (ADP) (Powell, 2007) consist of a vast collection of techniques that can be used to solve many different kinds of problems. The key idea is that an agent, as it is called in the literature of the field, learns about the environment and is able to take successively better decisions. The agent can learn online or offline with simulations. The environment is usually modeled as a Markov Decision Process (MDP), since many algorithms use dynamic programming techniques by learning the cumulative reward or cost from a certain state until the end of the process. This is usually done in an approximate way, using Value Function Approximation methods, either learning post-decision states or state-action pairs. RL and ADP methods do not assume full knowledge of the model so they can be used in more complex MDPs and even model free. They have been successfully applied for solving dynamic and stochastic routing problems (Powell et al., 2012;Ulmer, 2017). In Secomandi (2000Secomandi ( , 2001 a few different algorithms of ADP are developed and evaluated for the VRP with stochastic demands. Those algorithms are further improved in Novoa and Storer (2009). In Adler and Mirchandani (2014) an ADP method is developed to solve online routing of electric vehicles and make battery reservations minimizing time. In Çimen and Soysal (2017) the authors propose a time-dependent PRP variant with stochastic speed and solve it using ADP. In Mao and Shen (2018) two variants of Q-Learning are used for solving routing problems in stochastic time-dependent road networks. In He et al. (2018) ADP is used for scheduling buses with stochastic trip times. In Ulmer et al. (2019) an offline-online ADP method is presented to solve the dynamic VRP with stochastic requests. In Liu et al. (2020) integrate Dijkstra's algorithm with inverse RL to plan routes for food delivery. RL with Neural Networks is not very commonly used for VRPs. In Nazari et al. (2018) RL is used with neural networks to solve the capacitated VRP. In Shi et al. (2019) the authors introduce an RL method with neural networks to solve the ride-hailing problem for electric vehicles. As can be seen in the reviews by Ritzinger et al. (2015) and Psaraftis et al. (2016), there is a strong focus on single-vehicle problems when using ADP/RL methods and modeling the dynamic problem as an MDP. In Fan et al. (2006) and Goodson et al. (2013) multiple vehicles are considered using decomposition schemes to assign customers to vehicles beforehand. Both solve the dynamic VRP with stochastic demands but with known customer locations. In Shi et al. (2019) and Liang et al. (2021) the authors use neural networks as approximators for the state values of each vehicle while formulating the fleet dispatch as a linear assignment problem. In Jin et al. (2019) the authors present a ride-hailing problem with two layers represented by Neural Networks. In Turan et al. (2020) a Deep RL method is proposed to minimize the total cost of a fleet of electric autonomous vehicles for ride hailing services, where the complete system state is approximated using neural networks. In Chen et al. (2021) the authors propose a Deep Q-Learning method to assign customers to vehicles and drones for same-day delivery.
One of the branches of the field is Safe Reinforcement Learning, which aims at minimizing cost or maximizing expected reward while keeping safety constraints when learning or after deployment. These methods are relevant for environments where failures can have a significant impact (e.g. accidents). These methods are also useful when the environment can potentially deviate from the training scenarios. An overview of the methods and review of recent literature is provided in Garcıa and Fernández (2015). In the context of electric vehicles, battery depletion is a risk with potentially high consequences such as extra costs for towing, delays, congestion and even traffic accidents. Therefore a safe approach can be used to minimize that risk.
To the best of our knowledge there is no published paper solving the electric vehicle routing with stochastic energy, dynamic customer requests and charge planning. It is also novel the use of Safe Reinforcement Learning as a solution method to plan the route predictively and ensure that the vehicle will not run out of energy while driving.

Contributions
The main contributions presented in this paper are: • A model of the DS-EVRP as a Markov Decision Process; • A Safe Reinforcement Learning solution method including: -A tailored Value Function Approximation; -A safe policy; -A training strategy; • Realistic computational experiments to evaluate the proposed approach.
The Dynamic Stochastic Electric Vehicle Routing Problem is modeled as a Markov Decision Process (MDP). We propose a state representation, a set of valid actions and a state transition function. It considers stochastic customer requests and stochastic energy consumption. Like most related methods found in the literature, this paper focuses on a single vehicle.
The solution method for the problem is a Safe Reinforcement Learning approach based on Q-Learning but with some modifications. It aims at learning the cumulative energy cost and risk of failure (i.e. battery depletion) from a certain state when taking a certain action, and following the process until terminated (e.g. reaching back to the terminal). The first contribution of this article is a Value Function Approximation based on look-up tables indexed by a tailored reduced state representation to learn the value of state-action pairs. The second contribution is a chance-constrained policy with two layers of safety to ensure that the battery will not be depleted while en route. The third contribution is a training approach with a rollout heuristics to guide the learning to focus on the most promising actions directly when a state is first visited. Since to the best of our knowledge there are no solution methods for this problem in the literature, we present a deterministic online reoptimization method that is used as a benchmark in the experiments. Both the reoptimization method and the rollout for training the agent use a tabu-search heuristics that is shown to find routes very close to the optimal.
In order to evaluate the proposed method, a set of experiments is presented. A probabilistic energy consumption model is used based on Basso et al. (2021). In that paper a series of simulations are performed with a realistic traffic model for the city of Luxembourg (Codeca et al., 2015) and a high-fidelity vehicle model. Then machine learning is used to estimate the link-level energy consumption probability distributions for the city center. In this paper the energy consumption is sampled from those probability distributions. Customer requests are sampled from probabilities for each test instance. The reinforcement learning method is compared with the reoptimization benchmark in terms of reliability and energy consumption. Different levels of dynamism are investigated by allowing late new requests and by knowing less customers at the start. Results are also shown for scenarios with stochastic customer demand (i.e. payload weight). Runtime and memory usage are not in focus but an initial analysis is shown nonetheless. Finally, a discussion about the hyper-parameters and algorithm is presented.
The structure of the paper is as follows: Section 2 presents the problem formulation and MDP model; Section 3 presents the solution method; Section 4 describes the experiments and results; Section 5 presents conclusions and proposes potential future work. Appendix A describes the test instances provided in the supplementary material.

Problem formulation
The goal of the Dynamic Stochastic Electric Vehicle Routing Problem (DS-EVRP) is to dynamically find a route for a single electric vehicle to service customer requests that can arrive randomly during a certain period (e.g. a full or half working day). The vehicle departs from and returns to the same depot. There are two types of customer requests, (i) deterministic requests that are known before the vehicle leaves the depot and (ii) stochastic requests that are received after leaving the depot with a known probability. All customer locations are known. It is assumed that all requests are accepted and can be served by the vehicle. The amount of energy needed to drive a certain stretch of road is random with a known probability distribution. Since the vehicle is electric, the battery capacity is assumed to be limited and therefore there might be a need to dynamically plan for charging along the route.
The DS-EVRP is modeled as a Markov Decision Process (MDP). The problem can be visualized in a complete and directed graph, represented by = (, ) with  = {0} ∪  ∪  = {0, 1, 2, … , + } as the set of nodes and  as the set of paths connecting each pair of nodes. The total number of customers is given by and the total number of charging stations is given by . The customer nodes are defined by  = {1, … , }, the charging stations are  = { + 1, … , + } and the depot is represented by node 0.
Each customer ∈  has a demanded cargo weight (kg). The total probability of sending a request is and follows a binomial distribution with trials, so the probability of a new request for epoch ≤ is given by = 1 − √ 1 − . The maximum payload (kg) for the vehicle is and the empty vehicle (curb) weight (kg) is . The vehicle has batteries with total capacity (Wh) and minimum accepted battery level (e.g. = 0.2 , minimum 20% of the total capacity).
The state of the system at epoch is given by = ( , , , , ), where: • is the vehicle node position, where ∈ ; • is the battery level, where ≤ ; • is the current payload, where ≤ ; • is the set of active requests, where ⊆  ⧵ ; • is the set of visited customers, where ⊆ .
A decision epoch is delimited by the time when the vehicle leaves the node . In the initial state 0 = (0, , 0 , , ∅) the vehicle is at the depot fully charged and with initial payload 0 (e.g. zero for the pick-up case). The set of known (i.e. deterministic) requests before the vehicle departs is given by ⊆ . It is assumed that the requested demands throughout the route will not exceed the total payload capacity . Furthermore, new requests will only be received until epoch and a visited customer cannot send a new request.
When the process is in state at epoch , the decision to be taken is which node will be visited next. The set of feasible actions is: Eq.
(1) requires that , the next node to be visited, is in the graph . Constraint (2) requires a customer to be visited while there are still active requests. Constraint (3) makes the vehicle return to the depot when all requests have been served. Charging stops are allowed at any time.
When the process is in state and decision has been taken, a cost is accumulated , , which is the total energy consumed to drive from node to node . The energy consumption framework used in this paper is based on Basso et al. (2021). The probability distribution for the energy consumption to drive from node to node is known and given by: where is the expected energy and 2 is the variance, following a Normal distribution. The total mass of the vehicle (curb weight + payload) is given by = + . Coefficients and are calculated using the method described in Basso et al. (2021), where = [ ] is an array with coefficients for the expected energy and is an array with coefficients for the variance. These energy coefficients embed information about topography, speed, traffic and the vehicle. For simplification, the expected energy consumption can be re-written as: which is the same as Eq. (4). Therefore , is the notation for a sample from probability distribution N( , 2 ) at epoch .
From epoch to epoch + 1 there might be new customer requests represented by the set  +1 . Each customer has a total probability of requesting a visit during the allowed period (i.e. 0 < ≤ ). The state transition from to +1 is then given by: Eq. (6) assigns the new node location. Eq. (7) calculates the remaining battery level after arriving at node . Eq. (8) increases the vehicle payload. If is a charging station or the depot, then = 0. Eq. (9) removes node from the current requests, if is a customer, and adds any new requests  +1 that might have arrived while driving from to . Eq. (10) adds node to the list of visited customers, if is a customer. If node is a charging station, the battery is fully charged before leaving (i.e. +1 = ).
The MDP is terminated if all customers have been visited and the depot is reached, or if the battery is completely depleted (i.e. +1 ≤ 0). An overview of the model is shown in Fig. 1. As shown above, parts of the model are deterministic and parts are stochastic. In Fig. 1 the dashed lines represent stochastic transitions. Because the policy is deterministic, the actions are deterministic as well, since when the next node is decided it is assumed that the vehicle will drive there unless the battery is depleted on the way. On the other hand, the state transitions are stochastic with unknown probability distribution, since new requests are random. The rewards are stochastic since the energy consumption is random, but the probability distribution is assumed to be known. Furthermore, it is important to notice that this MDP has an infinite state space, since there are continuous variables in the state representation (i.e. battery level and payload). It is assumed that the current state is fully observable.
The objective of the problem is to minimize expected energy consumption and guarantee that the vehicle will be able to complete the route by planning charging when necessary. It is important to observe that these two objectives are somewhat conflicting. To increase safety it is usually necessary to add more charging stops, which in turn can increase energy consumption. Conversely, reducing the number of charging stops might reduce energy consumption but can potentially decrease safety. The objective function can be formulated as: subject tō Eq. (11) minimizes the expected energy consumption for the complete route. Constraints (12) and (13) force the vehicle to visit all early requests in and all dynamic requests in  from all epochs until the process is finished (i.e. = ). The set of feasible actions given by( ) also constrain the problem to visit all customers with active requests before returning to the depot. Since is stochastic, the battery level is also stochastic, therefore the problem is chance-constrained. Eq. (14) implies that the risk of the battery being below the minimum is less or equal to (e.g. 5%) for the complete route.
The next section presents methods to try to achieve this objective.

Solution method
In order to find solutions for the problem described above, we present two different approaches. The first one is a reoptimization method. The second is a Reinforcement Learning method.

Deterministic reoptimization
In this section a deterministic online reoptimization method is presented to solve the MDP shown above. This approach adapts a tabu-search heuristics for the static EVRP to find the routes dynamically, and will be used as a benchmark for the comparisons in the computational experiments. It uses only the deterministic information known from state at decision epoch to take a decision , considering only the current requests to decide the route. To plan for charging and estimate the total energy consumption it uses the expected value of energy as shown in Eq. (4).
At every epoch , the decision of the best action is made by running a tabu-search heuristics considering the current requests. First a route is constructed using a greedy policy from node to visit all customer requests in and back to the depot, choosing the nearest neighbors in terms of expected energy. Secondly, this initial route is improved using tabu-search with 2-opt moves. If the route is energy unfeasible (i.e. not enough battery capacity to drive all the way), then a charging station is inserted and the route is improved again. The best charging station is chosen. Since energy is stochastic, the method tries to keep the battery above the minimum limit in order to increase reliability. The heuristics are described in details in Section 4.2. The solutions generated are shown to be very close to the optimal.

Reinforcement learning
This section introduces a reinforcement learning method to solve the MDP presented in Section 2. The approach is to try to anticipate future requests and predict energy consumption in order to take better decisions. As described, some aspects of the problem are known, some are unknown with known probabilities and some are unknown with unknown probabilities. Learning about all the unknowns can be difficult considering the curse of dimensionality (Powell, 2007). Furthermore, vehicle routing problems are well known for growing exponentially in size with increasing number of nodes (i.e. customers and charging stations).
The proposed method is based on Q-learning (Sutton and Barto, 2018), which is an algorithm that learns the value of taking an action from a certain state. This value is usually expressed as a reward and the most common approach is to maximize the reward from the current state until the process is terminated. There are similarities with Dynamic Programming and with extensive training the policy converges to a near-optimal solution. However, one of the advantages of Q-learning is that a complete MDP is not necessary and the algorithm can even run model free. One of the variants of Q-learning keeps a table with the reward for each state-action pair, called the Q-value. It is an off-policy algorithm, which means that it learns about the optimal policy by following an exploration policy (e.g. -greedy with some actions selected randomly). Fig. 2 shows an overview of how the agent interact with the environment, with some aspects that will be discussed in the following subsections. For the problem discussed in this article, the MDP accumulates a cost (i.e. energy consumed) instead of a reward, which should therefore be minimized (i.e. instead of reward maximization). But it is also important to keep safety constraints, not letting the vehicle run out of energy while driving. Thus a Safe Reinforcement Learning approach will be used, by learning the expected cumulative energy cost and the risk from state until the end of the route if action is taken. These will be stored in two Q-tables, for expected energy and for the probability of failure (i.e. battery depletion). Following a certain policy , they can be defined as: By selecting the next node to visit using and , it is possible to plan robust routes and keep safety constraints. The method will be presented in details in the following sections, focusing on its three main components: a value function approximation, a safe policy and a training strategy.

Value function approximation
The MDP presented in Section 2 has an infinite number of states, since both the battery and the payload are continuous variables. Furthermore, the number of combinations of and grows very fast when increasing the number of customers. Therefore, in order to use Q-learning in an efficient way, the state needs to be discretized. With that it is possible to reduce storage space of the Q-tables and improve exploration. Hence, a value function approximation is presented, based on a reduced state representation.
To find a suitable reduced state representation̂, it is necessary to select which variables from will be included and eventually discretize the continuous ones. A number of alternatives were evaluated. Since knowing the battery level is crucial for taking decisions (e.g. whether to charge or not), is discretized intô. The experiments will use 10 levels, from above 90% to 0% (i.e. 9 to 0). Furthermore, the current requests are necessary in order to know which actions are valid. Thus the reduced state representation is given bŷ= ( ,̂,). The payload weight and the previously visited customers were removed since they are less important for the solution method to work properly.
As a result any state (infinite cardinality) in the MDP has a corresponding representation̂(finite cardinality) that is used to index and . Therefore it is possible to get an approximation of the cumulative energy cost and risk, from state if an action is taken by reading (̂, ) and (̂, ) respectively. With that approximation the size of the and tables is significantly reduced and is finite. Despite the total number of reduced states still being quite large, the number of actual state-action pairs visited is significantly lower, as many are invalid or unreachable states (e.g. less than 100% battery at the start). This will be discussed in Section 4. Another advantage of this approximation is that similar states (e.g. small differences in battery level) will be merged, improving exploration. The complete state representation is used during the simulations to keep track of the system state and calculate energy consumption.

Policy
As the method is based on Q-learning, the target policy depends on and . The objective of the problem is to minimize energy consumption for the complete route, but also minimize the risk of running out of battery while on the road. As energy consumption has the same value over time, the problem does not include any discount factor. The value of (̂, ) estimates the probability of failure (i.e. battery depletion) from state until the end of the route if action is taken. Therefore we propose a policy * that is greedy with respect to and chance-constrained with respect to , with a maximum accepted risk (e.g. 5%). Taking into account the feasible actions from state , the actions that satisfy the risk criteria are: The optimal action from state is then decided using the policy: If there are actions that satisfy the risk criteria, then the action with less expected energy consumption is chosen from  ( ). If there are no actions that satisfy the risk criteria (i.e. (̂, ) > ∀ ∈( )), then the action with less expected risk is chosen from all feasible actions( ).
As this is a probabilistic approach, there is still a risk of for the vehicle running out of battery. Although it is possible to reduce the accepted risk (e.g. = 1%), it might still happen that some tours will fail. That might not be acceptable for transport companies since it can incur high expenses (e.g. towing the vehicle, customer compensations, reputation, risk of accidents). Therefore we propose a second layer of safety with policy * * .
This second layer of safety tries to keep it always feasible to visit a charging station from any node. It uses a two-step lookahead to predict energy cost. The first step evaluates the energy cost from the current node to the chosen destination , as shown in Eq. (19). The second step evaluates the cost from node to its nearest charging station (i.e. in terms of energy cost), as shown in Eq. (20).

= ( + ) +
= min ∈ ( + + ) + If the sum of those two costs is larger than the current battery level then the policy overrides the previous decision and chooses to charge the battery at the nearest charging station from the current node. * * = In Section 4 a discussion about the implications of adding this second layer of safety to the policy will be presented, in terms of increased safety and increased energy cost.

Training
One of the biggest challenges of combinatorial optimization problems is that the number of possible solutions grows very quickly with the number of elements. For VRPs it can be computationally hard to find the optimal solution for problems with many customers to visit and even harder for problems which incorporate stochasticity. This becomes even more critical for the MDP presented in Section 2, since it has an infinite number of states due to the continuous values of battery and payload. As a consequence it is hard to train a Reinforcement Learning agent by visiting all feasible states many times. Even considering the reduced state representation, the total number of combinations grows very quickly with the number of customers. Therefore we propose a focused strategy to train the agent by finding good candidate actions directly on the first time a state is visited.
The training method is offline off-policy learning using -greedy as the exploration policy and the target policy * * . Every episode simulated is a complete route from the depot, servicing all requests and returning to the depot, with charging stops if necessary. The episode is also terminated if the battery is depleted. The method is shown in Algorithm 1.
The basic idea is to simulate a complete episode (from Line 3 to 25) then update the cost and risk tables, and respectively (from Line 27 to 43). The number of times a state-action pair is visited is stored in (̂, ). The Q-table for expected energy is updated in line 36 and the Q-table for risk is updated in line 37, based on the average of all visits to the state-action pair. To keep track of the risk, the updates are done backwards, from the end of the route to the beginning.
The next action is decided from Line 8 to Line 14. If a statêhas never been visited before, the action is decided using the same rollout heuristics as described in 3.1 and 4.2. Otherwise an -greedy policy is followed, but the random actions are chosen only from the set of feasible actions( ).
Function step, in Line 15, computes the transition from state to the next state +1 . In that function a realization for the energy consumption , is sampled from the probability distribution given by Eqs. (4) and Eq. (5). Furthermore, new requests  +1 are sampled based on the probability for each customer and added to the current requests in +1 . After the agent is trained offline, it can be used online to take decisions using policy * * and the learned tables and . Since the method is based on look-up tables, very little computation is needed to take decisions online. Re-training may only be required if the set of customers or charging stations change, or if the probabilities for energy and customer requests change significantly.

Experiments
In order to evaluate the solution method, several numerical experiments are presented. The main objective is to demonstrate the energy savings and reliability comparing the reinforcement learning method with the reoptimization approach. Computational cost, runtime and memory usage are not the focus, but a simple assessment is shown nevertheless. An analysis of the impact of different levels of dynamism to the results is presented as well as a discussion about hyperparameter selection and the structure of the algorithm.
First the setup of the experiments is described. Secondly, an evaluation of the reliability of the reoptimization method is shown. Lastly, a detailed analysis of the reinforcement learning is presented.

Test environment
To evaluate the methods, 50 test instances were generated, half of them with 20 customers and half with 10 customers each. All customers are randomly located in the city center of Luxembourg with the same depot and two charging stations for all test instances, as shown in Fig. 3. The energy prediction model and the link-level energy coefficients are derived from the experiments in Basso et al. (2021).
The test instances were generated to reflect real scenarios for urban distribution of goods with trucks. The number of customers depends on the type of goods and the city as described in Holguín-Veras et al. (2013). Therefore, data from a company in Gothenburg was used to set those parameters based on Sanchez-Diaz et al. (2020). In those scenarios one vehicle typically delivers or picks-up between 10 and 20 packages, due to time and capacity constraints, of which around half can be dynamic.
To find the paths between all pairs of nodes, the Bellman-Ford algorithm was used in the same way as in Basso et al. (2021), then producing the energy coefficients and for each path ( , ). During the simulations both for training and evaluation, the energy is sampled from the probability distributions given by Eqs. (4) and (5). The test instances are provided in the supplementary material which is described in Appendix A.
Every customer has a weight demand, but the total payload for each instance does not exceed the maximum payload of the truck. The experiments focus on the pick-up case, so the payload weight is increased at every customer visit. Each customer has a certain probability of requesting a visit during an episode. They can send a new request up to epoch = 5 or = 10, for 10 and 20 customers instances respectively. About half the customer requests are known before the vehicle leaves the depot (i.e. = 1), so there is a fixed for each test instance, with | | ≈ 5 for 10 customers and | | ≈ 10 for 20 customers. The vehicle leaves the depot with the battery fully charged and recharges the battery completely when it stops at charging stations.
The vehicle model is an all-electric medium duty truck with two gears, similar to current truck models in the market (e.g. Volvo FL electric). Curb weight is 10 700 kg, including the battery and maximum payload is 16000 kg. The battery capacity and minimum battery are different for each experiment, therefore they are given in the descriptions below. For the value function approximation (i.e. and tables), the battery is discretized into ten levels, as described in Section 3.2.1 to produce the reduced state representation̂.
Each episode simulates a complete route for a single vehicle, starting at the depot with initial customer requests, new random requests while driving, finishing back at the depot after visiting all requests, unless the battery is depleted. The method to choose the route is either the reoptimization from Section 3.1 or the reinforcement learning from Section 3.2. The training for the reinforcement learning method is done by simulating 500 000 episodes, if not stated otherwise, and = 0.05 without decay. Maximum risk accepted is = 0.1 (i.e. 10% chance of failure). A discussion about these hyperparameters is presented later.
In order to acquire statistically significant data, the evaluation is done by simulating 20 000 episodes, both for the reoptimization and reinforcement learning. For the reoptimization approach, this adds runtime requirements, since for each episode there might be up to 10 events with new customer requests (for the case of 20 customers), which means that the route needs to be re-planned up to 10 times per episode. As shown in Section 4.2, for 20 customer requests with charging, on average it takes 0.69 s to find a route with the heuristics and 1087 s with a MILP solver. It took more than 21 h to run the reoptimization experiments for the 25 test instances with 20 customers and charging, 51 min per instance on average as discussed in Section 4.4.2. Therefore, using a MILP solver instead of the heuristics in the reoptimization approach would make the solution time impractical. Furthermore, the solutions computed by the heuristics are very close to optimal, as shown below.

Rollout heuristics
The algorithm shown below, based on tabu-search, tries to find a route to service a set of requests, considering the already visited customers in the route and a starting node.
First it builds an initial route with function BuildRoute from the current node visiting all customers with active requests. That first route is constructed using a greedy algorithm (i.e. nearest neighbor) with respect to energy consumption.
Next it tries to improve the initial route with 2-opt moves in function ImproveRoute. The function routeCost calculates the total energy cost for the route and checks if there is any battery violation (i.e. battery below minimum level ). If no improvement can R. Basso et al. be found, the algorithm tries again using a move generated by TSMove. That function uses the principle of tabu-search, generating a random 2-opt move that has not been performed before.
If the improved route violates the battery constraints, the algorithm tries to find the best charger. In function BetterRoute, the battery constraints are prioritized over energy. A route with no battery violations is always preferred. Below we present the comparison of the heuristics with an optimal solution, considering the test instances deterministic (i.e. all customers are visited). The CPLEX 12.9 mixed integer linear programming solver was used to solve the problem to optimality. The code was developed using Matlab 2017b and run in a computer with a 3.1 GHz dual-core i5 processor and 16 Gb RAM. Tables 1 and 2 show the results for 10 and 20 customers with battery = 200 000. Tables 3 and 4 show the results for 10 and 20 customers with battery = 20 000 and = 30 000, which means that at least one charge stop is needed. Columns Energy show the amount of energy required to complete the route in Wh. Columns Runtime show the total time needed to compute the solution with the different methods in seconds. Column Diff show the optimality gap in percentage for the heuristics. The solution time for CPLEX was limited to 30 min, therefore some instances have slightly better solutions by the heuristics.
As the results indicate, the gap from the heuristics to the optimal solution is small. At most it is 3.61% for an instance with 20 customers and charging. On average, for the 10 customers instances the gap is close to zero while for the 20 customers instances it is still quite small. The runtime difference however is significant. While the optimal solution reached to limit of half an hour for many instances, the longest time to solve with the heuristics was 1.1 s.

Deterministic reoptimization
In this section the goal is to evaluate how reliable the deterministic online reoptimization is. Since energy consumption is stochastic, but the method uses the expected energy to find the route, it is necessary to keep a certain margin with regards to the battery level. Therefore, in this set of experiments we evaluate the minimum level by running the reoptimization method with the 20 customers instances. The battery capacity is = 30 kWh, which means that in most cases at least one charging stop will be necessary to complete the route. Three different minimum levels are evaluated: = 0, = 3000 Wh and = 6000 Wh. Table 5 shows the results for simulating the deterministic online reoptimization method with minimum battery level of 20% ( = 6000 Wh), 10% ( = 3000 Wh) and 0 ( = 0 kWh). Column Cost shows the average energy cost (Wh) for each test instance for 20 000 simulated episodes. Column Fail shows how many times the battery got completely depleted while driving.
These results indicate that the only safe margin is 20%, which did not produce any battery depletion for any test instance. Therefore, the results for = 6000 will be used as a benchmark for comparing with the reinforcement learning method below.

Reinforcement learning
In this section the goal is to evaluate how much energy can be saved keeping reliability using the reinforcement learning method compared with the deterministic online reoptimization approach. For the instances with 10 customers the battery is = 20 kWh and for the instances with 20 customers the battery is = 30 kWh. The minimum battery for the reoptimization is = 4 kWh and = 6 kWh for 10 and 20 customers respectively. For the reinforcement learning method = 0. This means that in most episodes at least one charging stop will have to be planned. For the tests without charging, the battery is = 200 kWh, which is enough for any route. Tables 6 and 7 show the results of the experiments with 10 and 20 customers test instances respectively. Column Reopt shows the average energy cost (Wh) using the reoptimization method evaluated with 20 000 simulated episodes. Column RL shows the average energy consumption cost (Wh) using the reinforcement learning method evaluated with 20 000 simulated episodes, after training with 500 000 episodes. Column Diff shows the percentage difference, with a negative number indicating that the reinforcement learning method needs less energy on average. Additionally, although not shown in the tables, the average number of charging stops was slightly smaller for the reinforcement learning method.
As can be seen in the tables, it is possible to save energy for all instances. Using the reinforcement learning method there was not a single episode during evaluation where the battery was completely depleted. In summary, keeping reliability it was possible to save up to 12% energy. Another advantage is that after training offline, the method could be very fast to run online. These are very important results, since transport companies could benefit from better routing instead of investing in more expensive vehicles with larger batteries.
In order to analyze how well the method anticipates customer requests and predicts energy consumption, the Expected Value of Perfect Information (EVPI) is calculated for the instances with 20 customers and charging. After an episode is simulated and all dynamic and stochastic information was revealed (i.e. customer requests and energy consumption), the a posteriori route is computed with the heuristics presented in Section 4.2. Table 8 shows the results of the experiments. Columns Cost show the average energy cost for the a posteriori routes. Columns Diff show the average of the difference between the route generated by the reoptimization or RL method compared with the a posteriori route for each episode. As it can be seen, if the problem was deterministic with all information known a priori, it would be possible to save an additional 7% energy on average using the RL method. However, considering the stochastic and dynamic nature of the problem, this gap is reasonable.

Dynamism analysis
In this subsection an analysis of how different levels of dynamism affect the results is presented. To do that, the test instances with 20 customers and battery = 30 kWh are re-run with different settings. The first test is to change the time limit for receiving new customer requests, allowing only early requests with = 5 or allowing late requests with = 15, which means that new requests can be received until visiting nodes. The second test is with a different number of requests known from the beginning, with | | ≈ 5 or | | ≈ 15. The exact number of initial requests vary for different test instances. The other parameters are the same as before.  Table 9 presents the results of the comparison between the reoptimization and reinforcement learning methods. When varying the amount of known customer requests from the beginning (i.e. | |), the results are slightly less favorable for both scenarios. With less known requests (i.e. | | ≈ 5) the environment is more dynamic and it is also harder to predict ahead. With more known customers (i.e. | | ≈ 15) the environment loses dynamism and it gets closer to the static version of the problem. When varying the time limit for new requests, the results move in either direction. With a more dynamic environment (i.e. = 15) the results are slightly less favorable. One of the main reasons is that it is more difficult to find better actions and predict ahead when there can be changes so late during the route. It should be noticed that = 15 is a quite late decision epoch. Considering that many routes will not have 15 total requests, they might receive new requests when there are just a few customers left to visit, which reduces the possibilities of improving the route.
The last test was performed to evaluate the impact of random weight for the customer requests. It was assumed that each customer has a demanded weight that follows a normal distribution with expected value and standard deviation 0.1 (i.e. 10% of the mean). Table 10 shows the results for the experiments. As it is shown, the average savings is quite similar to the results shown in Table 7, where the requested demand is deterministic.

Algorithm and hyper-parameters discussion
Different values for the hyper-parameters and different variations of the algorithm were tested. However, since the main focus of the paper is to show the reliability and energy savings of the algorithm, an extensive report of these experiments is refrained from. But a short discussion about the choice of hyper-parameters and algorithm is presented in this subsection.
Since combinatorial optimization problems can have a very large number of combinations even for a small set of inputs, using reinforcement learning can be challenging because the state-action space can be extensive. Therefore our solution method includes a rollout function to be used every time a state is visited for the first time. With that it is possible to get a good action to be taken from a state directly when that state is first visited. Consequently it is not necessary to evaluate all the possible actions from a state to find a good candidate. As a result, it is possible to use a small epsilon for exploration. The choices of = 0.05 and = 0.1 for 20 and 10 customers respectively, is because in this way on average one action will be random per episode, exploring around the regions with good candidate solutions initially found with the rollout heuristics. Another consequence is that it is not necessary to decay epsilon. With a larger epsilon with decay, the algorithm will explore outside the regions with good candidates, which could eventually produce even better results, but will also take considerably longer time.
For the instances with 20 customers and 2 charging stations, the total number of states, using the reduced state representation presented in 3.2.1 is slightly over 241 million. The number of state-action pairs is then a bit over 5.5 billion. However, on average, only around 890 thousand state-action pairs are visited during training with 500 thousand episodes. Table 11 shows the results for every instance. That means that on average only 0.3% of the states are actually visited and only about 0.02% of the state-actions are evaluated. That is mainly due to two reasons: the first is because there are many unreachable or invalid states; the second because the algorithm focuses on good solution candidates from the beginning.
Different variations of Algorithm 1 were also tried. The updates of and tables are done by averaging over time for a state-action pair. A few other approaches were tested, including Temporal Difference learning with different values of alpha and a moving average with different window lengths.
The risk of failure is managed by the policy with two layers as shown in Section 3.2.2. The first layer tries to keep the risk of battery depletion within a maximum . The second layer tries to keep the nearest charging station reachable from any customer location. They are complementary, since the first helps during exploration to find good candidate actions by chance-constraining the risk, while the second can override the preferred action in order to guarantee safety. Removing the second layer produces slightly higher energy savings but there are some failures. If the first layer is removed there are no failures but the energy savings are reduced. Different levels of risk were tested but the best results were achieved by keeping it between 5% ≤ ≤ 10% to maintain a good balance between the two layers of safety in the policy. A larger accepted risk increases the chance of the second layer being triggered, which increases the average energy cost. A smaller accepted risk becomes too restrictive and also increases the average energy cost. As mentioned before, runtime performance was not the focus of the paper, but a short discussion is presented, although convergence to optimality is not included. The algorithm was implemented in Matlab 2017b. An ordinary computer with a 3.1 GHz dual-core i5 processor and 16 Gb RAM was used for the experiments. Training runtime for 500 thousand episodes was 126 min on average for the test instances with 20 customers and charging (i.e. = 30 000). For the validation of those instances, the average time for running 20 thousand episodes for the reoptimization method was 51 min while for the trained RL agent it was 33 s. However, training runtime for 100 thousand episodes was 45 min producing good savings in energy and keeping reliability. Therefore, by lowering the number of training episodes it was possible to still get good results and keep reliability but reducing considerably the amount of time required for training. Total runtime and energy savings for 50, 100, 250 and 500 thousand episodes are shown in Table 12. Fig. 4 illustrates how reducing the number of episodes reduces the total time without reducing much the energy savings. It is important to emphasize that the implementation was not focusing on speed and that there are several ways to make it faster. Therefore, with proper implementation it should be possible to use it in real transport operations. Furthermore, the online runtime difference between the reoptimization and the trained RL agent are substantial.

Conclusion
This paper introduces the Dynamic Stochastic Electric Vehicle Routing Problem (DS-EVRP) and models it as a Markov Decision Process. A solution method is proposed based on Safe Reinforcement Learning with the following contributions: (i) a Value Function Approximation with a reduced state representation to decrease the size of the Q-tables and improve exploration; (ii) a chanceconstrained policy with two layers of safety to minimize energy consumption and avoid failures (i.e. battery depletion while in transit); (iii) a training approach with a rollout heuristics based on tabu-search in order to focus the exploration on the most relevant parts of the state-action space. Lastly, the paper presents a set of computational experiments to evaluate the solution method and analyze its properties. The main results can be summarized as follows: • The reinforcement learning method could save on average 4.8% (up to 12%) energy by planning the route and charging anticipatively, compared with the deterministic online reoptimization approach; • Although the reoptimization method can solve the problem, it is not as reliable as the reinforcement learning and can produce some failures (i.e. battery depletion) even when keeping a certain margin for the battery level; • By using the two layer policy, the reinforcement learning method demonstrated more reliability than the reoptimization method, avoiding failures altogether; • The method performed well for different levels of dynamism. The reinforcement learning algorithm performed better than the reoptimization method even when allowing late requests, knowing only a few customer requests before leaving the depot or with random customer demands (i.e. payload weight); • The proposed training process can produce satisfactory results even with a low number of episodes, due to the rollout function and the Value Function Approximation. Therefore it could be tailored to be deployed in current transport operations; • The online computation and memory usage could be suitable for onboard implementation, since the training is done offline and the reduced state representation decreases the used memory for the Q-tables. Furthermore, the trained RL agent has a significantly lower online runtime than the reoptimization method.
To the best of our knowledge there is no published paper considering the presented problem. With the proposed method it is possible to reliably integrate electric commercial vehicles into dynamic and stochastic transport operations. By anticipating dynamic customer requests and predicting energy consumption, this approach can help drivers and dispatchers to plan their routes in real-time, including charge planning. Additionally, it has the potential to support achieving environmental goals. Therefore, implementing this method opens an avenue for fulfilling the high demand on service levels from the on-demand economy while transitioning to a fossil free economy.
Possible future work includes adapting the method for large instances (e.g. 100 customers and above) with multiple vehicles. It would be interesting to investigate the use of Neural Networks either by using Deep Q-Learning or policy optimization methods. Since energy consumption is just one aspect in the total cost of transport operations, it could be interesting to analyze different objective functions for the problem, aiming at operational cost minimization, for instance by including time aspects and partial recharging. The same method could also be applied for time-windows, by estimating the risk of not meeting them. Additionally, it could be possible to integrate other sources of stochasticity, such as waiting times at charging stations or service times at customers.

Funding
This work was co-funded by Vinnova, Sweden through the project EL-FORT 2: Electric Fleet Optimization in Real-Time, by the FFI program Efficient and Connected Transport Systems.
Balázs Kulcsár and Ivan Sanchez-Diaz acknowledge the support of Transport Area of Advance within Chalmers University of Technology, Sweden.

Appendix A. Supplementary data
The supplementary material provided with this article contains all test instances used in the experiments. Each test instance is stored in a folder named instance_10_1 to instance_10_25 for the instances with 10 customers and instance_20_1 to instance_20_25 for the instances with 20 customers. Each folder contains 7 CSV files: The file customers.csv is a table with the first column being the weight demand (kg) for each customer and the second being the probability (%) of each customer sending a request during an episode.
The other files are matrices representing the graph = (, ), as described in Section 2. Each line and column represents a node in the graph. The first one is the depot, the following are the customers and the last two are the charging stations.
Files matrixAlpha.csv and matrixBeta.csv contain the coefficients for the expected energy for Eq. (4). Files matrixSigma1.csv and matrixSigma2.csv contain the coefficients for the energy variance for Eq. (5). Those coefficients are used to get the energy consumption (Wh) probability distribution for each path, as a function of total mass (i.e. curb weight + payload).
Files matrixDistance.csv and matrixTime.csv represent the distance (m) and time (seconds) for driving each path. They are not actually used in the experiments but are provided for a better visualization of the test instances.