Online Learning of Energy Consumption for Navigation of Electric Vehicles

Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks.


Introduction
Today, electric vehicles experience a fast-growing role in many different transport systems. However, the applicability of electric vehicles is often constrained by the limited capacity of their batteries. Due to the historically high cost of batteries, the range of electric vehicles has generally been much shorter than that of conventional vehicles. This has led to the fear of being stranded when the battery is depleted, an effect known as "range anxiety". Such concerns could be alleviated by improving the navigation algorithms and route planning methods for these systems. Therefore, in this paper we aim at developing principled methods for energy efficient navigation of electric vehicles.
Several works employ variants of shortest path algorithms for the purpose of finding the routes that minimize the energy consumption. Some of them (e.g., [5,51]) focus on computational efficiency in searching for feasible paths where the constraints induced by limited battery capacity are satisfied. Both [5] and [51] use energy consumption as edge weights for the shortest path problem. They also consider recuperation of energy modeled as negative edge weights, since when reward observations are delayed [14]. The authors of [34] propose a black box algorithm which may convert any stochastic bandit algorithm into an algorithm handling delayed feedback. The converted algorithms retain the regret bounds of the original algorithms, except for an additive term which is constant in the horizon and linear in the maximum delay. In [43], a lower regret bound for the two-armed bandit problem with batched feedback is derived, again exhibiting a linear dependence with respect to the batch size.
We take inspiration from the frequentist analysis of batched linear contextual UCB presented in [28] and extended to the generalized linear setting in [47], utilizing a similar technique in our analysis to decompose the Bayesian regret over the batches. Another extension of [28] is [46], which presents a greedy LASSO-based algorithm for a high-dimensional batched linear contextual bandit setting, where the dimension of the context is assumed to be much higher than the time horizon. To provide an upper bound for the frequentist regret, they assume that the context is stochastic with enough variance to induce sufficient exploration. This assumption does not hold for the non-contextual setting studied in our work.
Finally, regarding incremental learning of energy consumption in graphs, the authors of [8] use a Bayesian approach, similar to the one in this work, to learn the edge-specific distributions of electric vehicle energy consumption in a road network. They utilize the posterior distributions to formulate and solve an EVRP for commercial vehicles, where the paths between customers, charging stations and depots are selected using learned parameters and information from the environment. Since exploration is not the focus of their work, their method of calculating the shortest paths most closely corresponds to the greedy baseline used for the experiments in this work.

Our Contributions
First and foremost, we propose a novel online learning framework for energy efficient navigation of electric vehicles, in a setting where the vehicle energy consumption of road segments is assumed to be stochastic and the corresponding distributions are unknown a priori. We utilize a physical model of vehicle energy consumption to assign the edgespecific parameters of prior distributions for Bayesian bandit algorithms, such as Thompson Sampling and BayesUCB, in order to intelligently guide necessary exploration towards reasonable paths. The multi-armed bandit problem can be seen as a resource allocation problem, and as such, bandit algorithms are most useful where there is a limited number of agents available for data collection.
While travel time in a road network is both stochastic and a common edge weight in shortest path problems, there is an abundance of travel time data available from various sources, e.g., from cellular devices. For vehicle energy consumption, however, there are factors limiting the number of agents. As energy consumption depends heavily on the specific vehicle type used, internal vehicle sensors are required for data collection. Furthermore, energy consumption also depends on the characteristics of the road traveled, like slopes, curvature, bumps, etc. Hence, it is a problem setting highly suited for Bayesian bandit algorithms.
While several works on Bayesian combinatorial bandit algorithms have been empirically evaluated using uninformative priors, it is less common with experiments where informative priors are used to explore combinatorial arm sets more efficiently. We not only utilize informative priors in our experiments, but also study the exploration of the road network through visual inspection of geospatial plots. Furthermore, we experimentally evaluate the robustness of the proposed framework to prior misspecification. We perform experiments for the road networks of multiple cities, using realistic traffic environment data.
As far as we are aware, there are no previous works analyzing the Bayesian regret of batched combinatorial Thompson Sampling. Furthermore, we also extend our analysis to the synchronous multi-agent setting. While there is prior work for batched linear contextual bandits (e.g., UCB in [28] and [47]), a combinatorial bandit problem is only a special case of the linear bandit problem for linear reward functions. Our technique, however, is feasible to extend for non-linear reward functions, such as in [4], where combinatorial Thompson Sampling is used to address the problem of finding paths which minimize their maximum edge weights.
Finally, this is the first work extending and evaluating the BayesUCB algorithm [36] to the online shortest path problem, empirically demonstrating good performance of the algorithm in this problem setting.

Energy Consumption Model
In this section, we start by describing how we model the road network and the different factors affecting the energy consumption of a vehicle traversing a specific road segment. We then outline two different Bayesian approaches to extend the deterministic energy consumption model to a probabilistic setting.

Setup of the Energy Consumption Model
We model the road network by a directed graph G(V, E, w) where each vertex u ∈ V represents an intersection of the road segments, and E indicates the set of directed edges. Each edge e = (u 1 , u 2 ) ∈ E is an ordered pair of vertices u 1 , u 2 ∈ V such that u 1 = u 2 and it represents the road segment between the intersections associated with u 1 and u 2 . In the cases where bidirectional travel is allowed on a road segment represented by (u 1 , u 2 ) ∈ E, we add an edge (u 2 , u 1 ) ∈ E in the opposite direction. A directed path is a sequence of vertices u 1 , u 2 , . . . , u n , where u h ∈ V for h = 1, . . . , n and (u h , u h+1 ) ∈ E for h = 1, . . . , n − 1. Hence, a path p can also be viewed as a sequence of edges. If p starts and ends with the same vertex, p is called a cycle. Note that, in this work, different paths may have different numbers of vertices.
We associate a weight vector w to the graph, where each element w e represents the total energy consumed by a vehicle traversing edge e ∈ E. We extend the notation so that the total weight of a path p is denoted w p := e∈p w e . For each edge e, we also define other edge attributes associated with road segments, such as the average speed v e , the length l e , and the inclination α e .
In our setting, the amount of energy consumed at different road segments is stochastic and a priori unknown. We adopt a Bayesian approach to model the energy consumption at each road segment e ∈ E, i.e., the edge weights. Such a choice provides a principled way to induce prior knowledge. Furthermore, as we will see, this approach fits well with the online learning and exploration of the parameters of the energy model.
We first consider a deterministic model of vehicle energy consumption E e for an edge e, which will be used later as the prior. Similar to e.g., [27,8], our model is based on longitudinal vehicle dynamics and Newton's second law of motion. For convenience, we assume that vehicles drive with constant speed along individual edges so that we can disregard the longitudinal acceleration term. However, this assumption is only used for the prior. We then have the following equation for the approximated energy consumption (in watt-hours): In Eq. 2.1 the vehicle mass m, the rolling resistance coefficient C r , the front surface area A and the air drag coefficient C d are vehicle-specific parameters. Whereas, the road segment length l, speed v and inclination angle α are location (edge) dependent. In principle, C r could also be considered edge-specific (since it also depends on the surface of the road), but in this work, we assume that it is the same for all edges. We treat the gravitational acceleration g and air density ρ as constants. The powertrain efficiency η is vehicle specific and can be approximated by a constant η = 1 for an ideal vehicle with no battery-to-wheel energy losses.
Actual energy consumption can be either positive (traction) or negative (regenerative braking). If the energy consumption is modeled accurately and used as w e in a graph G(V, E, w), the law of conservation of energy guarantees that there exists no cycle c in G where w c < 0. However, since we are modeling and estimating the expected energy consumption of each individual road segment independently (to ensure that the problem is tractable), this guarantee does not necessarily hold in our case.
While modeling energy recuperation is desirable from an accuracy perspective, it introduces some difficulties. In terms of computational complexity, Dijkstra's algorithm [20] does not allow negative edge weights and the Bellman-Ford algorithm [52,22,10] is slower by an order of magnitude. There are methods to overcome this (e.g., [33]), but they still assume that there are no negative edge weight cycles in the network. Hence, we choose to only consider positive edge weights when solving the energy efficient (shortest path) problem, which enables us to use Dijkstra's algorithm in this work. This approximation still achieves meaningful results, since even with recuperation discarded, edges with high energy consumption are avoided. So while the powertrain efficiency η has a higher value when the energy consumption is negative than when it is positive, we believe using a constant is a justified simplification as we only consider positive edge-level energy consumption in the optimization stage. However, we emphasize that our generic online learning framework is independent of such approximations, and can be employed with any senseful energy model and shortest path algorithm.

Rectified Gaussian Model of Energy Consumption
Motivated by [56], as the first attempt at a probabilistic model of energy consumption, we assume the stochastic energy consumptionẼ e of a road segment represented by an edge e follows a Gaussian distribution, given a certain small range of inclination, vehicle speed and acceleration. We also assume thatẼ e is independent fromẼ e for all e ∈ E where e = e and that we may observe negative energy consumption. In other words, we assume that we may observe the energy recuperation of the vehicle, even though we only use estimates of the non-negative energy consumption when solving the shortest path problem (as stated in Section 2.1). The likelihood function (where, for later convenience,Ẽ e is negated so that θ * e indicates a mean reward) is then P (Ẽ e | θ * e , σ 2 e ) := N (−Ẽ e | θ * e , σ 2 e ).
Here, for clarity, we assume the noise variance σ 2 e is given. We can then follow a Bayesian approach, and use a Gaussian conjugate prior over the mean energy consumption: where we choose µ e,0 ← −E e and ς 2 e,0 ← (ϑµ e,0 ) 2 for some constant ϑ > 0. Due to the conjugacy properties, we have closed-form expressions for updating the posterior distributions with new observations ofẼ e . For any path p in G, we have E e∈pẼ e = e∈p E[Ẽ e ], which means we can find the path with the lowest expected energy demand if we set w e ← E[Ẽ e ] and solve the shortest path problem over G(V, E, w). When the expected energy consumption is estimated instead of being known, to deal with the risk of w e < 0 (i.e., negative weights), we instead set w e ← E [z e ] where z e is distributed according to the rectified Gaussian distribution N R (−θ * e , σ 2 e ), which is defined so that z e := max (0,Ẽ e ) andẼ e ∼ N (−θ * e , σ 2 e ). The expected value of z e is then E [z e ] = −(θ * e · (1 − Φ(−θ * e /σ e )) + σ e · φ(−θ * e /σ e )), where Φ and φ are the standard Gaussian CDF and PDF respectively. Thus, since we observe both negative and positive energy consumption, we may utilize the conjugacy properties of the Gaussian likelihood and prior distribution to efficiently update and sample from the posterior distribution over the (non-negative) rectified Gaussian mean.

Log-Gaussian Model of Energy Consumption
Alternatively, instead of assuming a rectified Gaussian distribution for the energy consumption of each edge, we model the non-negative edge weights by (conjugate) Log-Gaussian likelihood and prior distributions. By definition, if we have a Log-Gaussian random variable Z ∼ LN (µ, σ 2 ), then the logarithm of Z is a Gaussian random variable (log Z) ∼ N (µ, σ 2 ). Therefore, we have the expected value E[Z] = exp{µ + 0.5σ 2 } and the variance Var[Z] = (exp{σ 2 } − 1) · exp{2µ + σ 2 }. We can then define the likelihood function as . We also choose the prior hyper-parameters such that E[θ * e ] = µ e,0 and Var[θ * e ] = ς 2 e,0 , and also let ψ e = µ e,0 , where µ e,0 and ς e,0 are calculated in the same way as for the Gaussian prior (except that µ e,0 is restricted to be negative) in order to make fair comparisons between the Log-Gaussian and rectified Gaussian results. The resulting prior distribution is P θ * e µ e,0 , ς 2 e,0 := LN −θ * e log (−µ e,0 ) − We emphasize that the specific parameterization that we use for the Log-Gaussian model in Eq. 2.2 allows for closed form posterior updates with the prior distribution in Eq. 2.3. Since −θ e is drawn from a Log-Gaussian prior distribution, then the value (i.e., a linear function of log(−θ e )) of the first parameter of the Log-Gaussian likelihood described in Eq. 2.2 is Gaussian (i.e., the conjugate prior distribution of the first parameter using the standard parameterization). For more details on Bayesian updates with this Log-Gaussian parameterization, see e.g., [30]. We summarize the notation used in the preceding sections and the rest of the paper in Table A1 of A.

Algorithm 1 Online learning for energy efficient navigation
Require: µ 0 , ς 0 1: for t ← 1, . . . , T do 2: Algorithm 2 Gaussian parameter update of the energy model for each edge e ∈ a t do 3: return µ t , ς t Algorithm 1 describes these steps, where the vectors µ t−1 and ς t−1 refer to the current posterior parameters of the energy model for all the edges at the current time t, which are used to obtain the current edge weight vector w t . Whenever we refer to an element of a vector indexed by a time step t, we always let the rightmost index be t, e.g., w e,t in the vector w t . We solve the optimization problem using w t to determine the optimal action (or arm in the nomenclature of multi-armed bandit problems) a t , which in this context is a path through a graph. The action a t is applied and a reward r t (a t ) is observed, consisting of the actual measured energy consumption for each of the passed edges. We assume that the energy consumption distribution of each edge is fixed over time, and therefore, we exclude the subscript t of the reward where it is not needed, such as for the expected reward E [r(a t )]. Since we want to minimize energy consumption, we regard it as a negative reward when we update the parameters (shown for example for the rectified Gaussian model in Algorithm 2). T indicates the total number of time steps, sometimes called the horizon.
To measure the effectiveness of our online learning algorithm, we consider its regret, which is the difference in the total expected reward between always playing the optimal action and playing actions according to the algorithm. Formally, the instant regret at time t (or alternatively the gap of the action selected at time t) is defined as where a * := arg max a E [r(a)] is the action which maximizes the expected reward, and the cumulative regret is defined as Regret(T ) := T t=1 ∆ t . Since our framework uses a Bayesian approach, we also consider Bayesian regret, which is the expected value of the regret over problem instances sampled from the prior distribution, so that BayesRegret(T ) := E [Regret(T )].

Shortest Path Problem as Multi-Armed Bandit
A combinatorial bandit [12,23] is a multi-armed bandit problem where an agent is only allowed to pull sets of arms instead of an individual arm. However, there may be restrictions on the feasible combinations of the arms. We consider the combinatorial semi-bandit case where the rewards are observed for each individual arm pulled by an agent during a round.
A number of different combinatorial problems can cast to multi-armed bandits in this way, among them the online shortest path problem [23,39,57] is the focus of this work. An efficient algorithm for the deterministic problem (e.g., [20]) can be used as an oracle [54] to provide feasible sets of arms to the agent, as well as to maximize the expected reward.
We connect this to the optimization problem in Algorithm 1, where we want to find an arm a t . At time t, let G(V, E, w t ) be a directed graph with weight vector w t and sets of vertices V and edges E. Given a source vertex u 1 ∈ V and a target vertex u n ∈ V, let P be the set of all paths p in G such that p = u 1 , . . . , u n . Assuming non-negative edge costs w e,t for each edge e ∈ E, the problem of finding the shortest path (arm a t ) from u 1 to u n can be formulated as For the analysis, we introduce some formal definitions for this stochastic combinatorial semi-bandit problem. There is a set of base arms A, which corresponds to E in the considered graph. The set of arms selected at time t is called the super-arm a t ⊆ A. The set of feasible super-arms I such that a t ∈ I, is equal to the set of paths P. We further define the expected reward of super-arm a with respect to a particular mean reward vector (for all base arms) θ as f θ (a) := i∈a θ i . Hence, according to the previously introduced definition of regret, we have that

Thompson Sampling
In our Bayesian setup, a greedy strategy chooses the arm which maximizes the expected reward according to the current estimate of the mean rewards. Since the greedy method does not actively explore the environment, there are other methods which perform better in terms of minimizing cumulative regret. One commonly used method is -greedy, where a (uniformly) random arm is taken with probability and the greedy strategy is used otherwise. While, in principle, it could possible to select paths uniformly at random for the exploration time steps, the size of the set of all paths P (corresponding to the set of feasible super-arms I) can be exponential with respect to the number edges in the graph. This might even include paths similar to random walks through the graph, which would almost certainly be very inefficient in terms of accumulated edge costs. Hence, this method is not well suited to the shortest path problem. A modification of -greedy (based on Algorithm 1 in the supplementary material of [15]), where only a single edge (and the shortest path through it) is sampled, is introduced in Algorithm 7. However, for large graphs this might still lead to unreasonable exploration paths (e.g., a path between New York City and Boston through a randomly selected detour around Los Angeles).
An alternative method for exploration is Thompson Sampling (TS). In contrast to the greedy method, with TS (like in -greedy), arms are randomly sampled. However, where arms are sampled uniformly at random with -greedy, the TS agent samples from the model, i.e., during each time step, it selects an arm which has a high probability of being optimal by sampling mean rewards from the posterior distribution and choosing an arm which maximizes them. In other words, the method utilizes the prior beliefs about the parameter values to guide exploration towards reasonable arms.
Thompson Sampling for the energy consumption shortest path problem is outlined in Algorithm 3, where it can be used in Algorithm 1 to obtain the edge weights in the network (only shown for the rectified Gaussian model).

Regret analysis
In the following section, we provide an analysis on the cumulative regret of Thompson Sampling for the shortest path navigation problem. While better upper bounds on Bayesian regret for combinatorial TS is possible (e.g., using our proof for the batched combinatorial setting in Theorem 3 with batch size 1, we obtain a Bayesian regret upper bound ofÕ |E| √ T ), this result may give some insight on the relationship between reinforcement learning problems and combinatorial bandit problems. Proposition 1. The Bayesian regret of Algorithm 1 is upper bounded by We arrive at this result by relating the problem to recent results in reinforcement learning literature [41]. We view the online shortest path problem as an episodic reinforcement learning problem on an unknown finite time horizon Markov decision process (MDP). Here, each vertex u ∈ V corresponds to a state, each edge e ∈ E corresponds to an action, and the reward distributions for each action are the same as in the bandit problem. Like in the bandit problem formulation, the rewards of different states are assumed to be independent. Furthermore, given a state and an (allowed) action, transitions are deterministic, such that the next state is the end vertex of the edge corresponding to the action. Each episode starts in the source vertex state and ends when the target vertex state is reached. In other words, each episode corresponds a time step (and path selection) in the bandit problem formulation.
Applying posterior sampling for reinforcement learning (PSRL) like in [42], to this problem, using identical priors over reward distribution parameters as in the bandit problem, is equivalent to using TS on the combinatorial semi-bandit problem. At the start of each episode, PSRL samples an MDP from the current prior / posterior distribution over MDPs (here, a distribution over reward distributions, since the transitions are deterministic and known).
The policy used during this episode by PSRL is then the optimal policy with respect to the sampled MDP. In this problem, since the rewards are the negative edge weights of the graph, the shortest path between the source and target vertices will be selected.
Since the posterior parameters involved in PSRL are updated in the same way as in the bandit problem, identical observations and samples will lead to identical posterior updates. Hence, they are equivalent, and a regret bound for one for each edge e ∈ E do 3:θ e ← Sample from posterior N (µ e,t−1 , ς 2 e,t−1 ) 4: for each edge e ∈ E do 3: return w t will apply to the other. From [41], with τ being the episode length and T from the bandit problem corresponding to the number of episodes in the RL problem, we have We also note that Conjecture 1 of [41] would improve this result so that We note that the combinatorial semi-bandit problem formulation of Section 3.1 can be seen as a simpler special case of the reinforcement learning problem with less complexities to learn (e.g., less parameters to estimate, no state transitions modeling, etc.). In particular, whereas the traffic environment is affected by the paths that we choose, any state changes caused by an agent do not typically affect it immediately, since an edge is likely not traversed more than once during a single episode (path). If we want to adapt to the observed immediate rewards of different base arms while driving on a selected path, this could be modeled as a reinforcement learning problem, e.g., like the (#P-hard) stochastic shortest path problem with recourse [44], which may (in principle) then be addressed by PSRL. In general, however, choosing the less complex (though still meaningful) bandit problem formulation enables us to use powerful methods with proven strong performance guarantees.

Bayesian Upper Confidence Bound
Another class of algorithms demonstrated to work well in the context of multi-armed bandits is the collection of the methods developed around the Upper Confidence Bound (UCB). Informally, these methods are designed based on the principle of optimism in the face of uncertainty. The algorithms achieve efficient exploration by choosing the arm with the highest empirical mean reward added to an exploration term (the confidence width). Hence, the arms chosen are those with a plausible possibility of being optimal.
In [15] a combinatorial version of UCB (CUCB) is shown to achieve sub-linear regret for combinatorial semi-bandits. However, using a Bayesian approach is beneficial in this problem since it allows us to employ the theoretical knowledge on the energy consumption in a prior. Hence, we consider BayesUCB [36] and adapt it to the combinatorial semi-bandit setting. Similar to [36], we denote the quantile function for a distribution λ as Q(β, λ), defined such that for a random variable distributed according to λ (s.t. X ∼ λ), we have Pr(X ≤ Q(β, λ)) = β. The idea of that work is to use upper quantiles of the posterior distributions of the expected arm rewards to select arms. If λ denotes the posterior distribution of a base arm and t is the current time step, the Bayesian Upper Confidence Bound (BayesUCB) for that base arm is This method is outlined in Algorithm 4 for the rectified Gaussian model. Here, since the goal is to minimize the energy consumption which can be considered as the negative of the reward, we use the lower quantile Q(1/t, λ).

Multi-Agent Learning and Exploration
The online learning may speed up via having multiple agents exploring simultaneously and sharing information on the observed rewards with each other. In our particular application, this corresponds to a fleet of vehicles of similar type sharing information about energy consumption across the fleet. Such a setting can be very important for road planning, electric vehicle industries, vehicle fleet operators and city principals.
The communication between the agents for the sake of sharing the the observed rewards can be synchronous or asynchronous. In this paper, we consider the synchronous setting, where the vehicles drive concurrently in each time step and share their accumulated knowledge with the fleet before the next iteration starts. At each time step, each individual vehicle independently selects a path to explore/exploit according to the online learning strategies provided in Section 3. Here, we assume that all vehicles start their paths with the same source vertex and end them at the same target vertex, though even without this assumption, vehicles would benefit from information sharing as long as there is some overlap between selected paths. The vehicles share information synchronously, when all agents have finished their trips for a certain time step. During each time step, the agents are allowed to select paths which are overlapping (with shared edges), but we do not model any physical interactions between vehicles (e.g., how increased traffic intensity on those road segments affects energy consumption). However, this could be an interesting topic for future work.
Below, we provide two different regret bounds for TS-based multi-agent learning under the synchronous setting. Both are based on the idea of viewing the synchronous multi-agent problem as a single-agent problem with delayed feedback received in batches. Specifically, the delay corresponds to the number of vehicles in the fleet, since we wait for all of them to finish traversing their selected paths until we update the posterior distributions and start the next time step.

Thompson Sampling with Queued Delayed Feedback
The first approach is based on the method of [34], which converts any algorithm for non-delayed stochastic bandit problems to an algorithm which handles delayed feedback, with a term constant in T added to the regret. This method and other similar queue-based methods have previously been used to adapt (and analyze) existing bandit algorithms for various problem settings with delayed feedback (see e.g., [40,32]). The approach of [34] is to wrap the original algorithm in an outer algorithm, which they call Queued Partial Monitoring with Delays (QPM-D). In essence, the inner algorithm functions as in the non-delayed case, unless the feedback of a selected arm is delayed and not available yet.
In that case, the outer algorithm takes over and repeatedly plays the selected arm until feedback is received. Since the arm is played multiple times, excess delayed feedback, not immediately used by the inner algorithm, is also received. The outer algorithm stores the excess feedback in a queue data structure (where the order in which elements are inserted is also the order in which they are later retrieved, i.e., First In, First Out, or FIFO). This allows the inner algorithm to retrieve feedback from the queue the next time the arm is selected, instead of having to wait for delayed feedback. We outline QPM-D adapted to our problem in Algorithm 5. Let b be the next super-arm selected by Algorithm 1.

8:
There are no queued rewards for b, so perform arm a t ← b at time t to receive rewards (possibly delayed) by the environment.

9:
Update: 10: Let D t be the set of (delayed) rewards received at time t and each (s, r s (a s )) ∈ D t be the timestamped reward r s (a s ) resulting from the arm a s at time s. 11: for (s, r s (a s )) ∈ D t do 12: Add the reward r s (a s ) to Queue[a s ].
Theorem 2. Let K be the number of agents, T be the horizon and Regret k (T ) be the regret of each agent k ∈ [K]. In the synchronous multi-agent online shortest path setting (i.e., a fleet of K agents / vehicles working in parallel in each time step), the total fleet regret incurred by invoking Algorithm 5 satisfies Proof. The result is obtained as a corollary of Theorem 6 in [34] which converts online algorithms for the non-delayed case to ones that can handle delays in the feedback (i.e., Algorithm 5), while retaining their theoretical guarantees. We consider the online shortest path problem as a standard stochastic bandit problem where the paths are the arms, and handle the multi-agent setting using Algorithm 5, like a sequential setting with delayed feedback. Let κ t denote the feedback delay of the action at time t. Then according to [34] we have While the additional first term of the regret is constant in T , it is also linear in |P|, which may be exponential w.r.t. |E|.

Thompson Sampling with Batched Feedback
In order to remove the exponential factor in Theorem 2, we outline a second approach. While the synchronous multi-agent setting can be cast as a delayed feedback problem, the general delay model is not actually necessary. Since the updates are synchronous, viewing it as a batched problem setting is sufficient. In this setting, rewards for selected arms are received periodically at fixed intervals, i.e., like tumbling windows. We note that this problem formulation can be useful beyond the multi-agent setting, e.g., in environments where feedback may be delayed due to wireless connection problems.
The regret analysis is not as straightforward as the one for Theorem 2. We combine ideas on batched bandit algorithms and analyses from [24], [28] and [47] with the general proof framework for deriving Bayesian regret bounds introduced by [49]. Before considering the multi-agent case, we start by outlining Thompson Sampling for the batched combinatorial semi-bandit setting in Algorithm 6. Here, we first consider a general stochastic combinatorial semi-bandit problem (i.e., not limited to the online shortest path problem) where rewards for each base arm i ∈ A are drawn from N θ * i , σ 2 i , with θ * i ∼ N µ i,0 , ς 2 i,0 and finite (and known) variance σ 2 i . Also, we let B be the total number of batches, each of size K, such that T = BK. Furthermore, we denote the last time step in each batch b ∈ [B] as t b , i.e., t b = bK. We also define the history H t as the sequence of actions and rewards until time step t, such that H t = (a 1 , r 1 (a 1 ), . . . , a t−1 , r t−1 (a t−1 )). Since the the actions and rewards are random variables, H t is a random variable as well. We denote a realization of H t as H, i.e., a fixed history of actions and rewards. for t ← t b−1 + 1, . . . , t b do 3:

5:
a t ← arg max a∈I fθ(a) 6: Play super-arm a t 7: Observe batched rewards r t b−1 +1 , . . . r t b . Append corresponding arms and rewards to the history of selected super-arms and received rewards, such that H t b +1 = (a 1 , r 1 (a 1 ), . . . , a t b , r t b (a t b )).

8:
Compute posterior parameters µ t b , ς t b given the history H t b +1 .
In this problem setting and algorithm, the rewards for all arms performed during a batch are received at the end of that batch. Hence, in each time step, parameters are sampled from the posterior distribution given the rewards observed at the end of the previous batch.

Regret analysis
We analyze the regret of this algorithm in the proof of Theorem 3, where Theorem 3. For Algorithm 6, with horizon T and batch size K, we have BayesRegret(T ) =Õ(|A| K + |A| √ T ).
In order to prove Theorem 3, we need a few intermediary lemmas and assumptions. For base arm i, letθ i,t be the average reward of i until time step t, and N t (i) be the number plays of i until time step t.
Assumption 1. For each base arm i ∈ A, the variance σ 2 i is finite, and σ 2 i ≤ 1.
Since we assume that the variance σ 2 i of each base arm i ∈ A is finite, we let, for convenience of notation, σ 2 i ≤ 1 for all i ∈ A (which can be achieved by scaling the feedback distributions of all base arms).
Assumption 2. Given the horizon T and the number of base arms |A|, we have T ≥ |A|.
Assumption 3. Each base arm i ∈ |A| has been played once initially, such that N 0 (i) = 1.
Assumptions 2 and 3 are mainly for convenience, to reduce the complexity of the proofs, whereas the finite variance assumption is needed for the concentration inequality we utilize in the proof of Lemma 7. We begin the analysis by defining upper and lower confidence bounds (for a super-arm a and history H t , as defined in Algorithm 6): Using these definitions, we can decompose the regret in a way similar to [49] as follows: Proof. By the definition of Bayesian regret, we have that: (Tower rule) (Conditioned on the history H t b−1 +1 , up to and including the last batch b − 1, all super-arms a t for t = t b−1 + 1, . . . , t b and the optimal super-arm a * are identically distributed. We have that E U (a t , H)

since U is a deterministic function of a super-arm and a history)
To bound the last two terms of the decomposed Bayesian regret, we use the following lemma.
T are proven in the same way, so we focus on the first inequality: (By Lemma 6 and Lemma 7) To bound the expected overestimation (or, correspondingly, underestimation) in the second-to-last inequality of the proof for Lemma 5, we derive two intermediate results in Lemma 6 and Lemma 7. For both of the lemmas, we letν i,x be the average reward of base arm i over the first x times it has been played, i.e., contained in any played super-arm. In other words, for each batch b ∈ [B] we have thatθ i,t b−1 =ν i,Nt b−1 (i) . Additionally, for the proofs of both lemmas, we note that the averageν i,x is Gaussian with mean θ * i and variance σ 2 i /x. Since, by Assumption 1, we have that σ 2 i ≤ 1, this implies that (ν i,x − θ * i ) has mean 0 and variance ≤ 1. Lemma 6. For any batch b ∈ [B] and base arm i ∈ A, it holds that Proof. We have that: For any fixed integer x > 0, we have that The inner expectation in Eq. 4.1 is the expected value of the corresponding truncated (below 0) Gaussian distribution, which (by, e.g., Theorem 2 of [31]) is increasing in − 8 log T x . Consequently, The claim follows by bounding the inner expectation of Eq. 4.1 using Eq. 4.2.
Proof. We perform a standard concentration analysis using union bounds and Hoeffding inequality, adapted for the batched feedback setting: (Hoeffding inequality for 1-subgaussian random variables, since θ * i −ν i,x is Gaussian with mean 0 and variance ≤ 1, by Assumption 1) With the last two terms of the regret decomposition of Lemma 4 bounded using Lemma 5, we may focus on the first term. We can bound it in the following way: The first term in the last expression above bounds the regret resulting from the batch delays, while the second term bounds the regret of the Thompson Sampling algorithm for the corresponding non-batched combinatorial semi-bandit setting. We start by bounding the first term: We can then continue by bounding the second term: (See the proof of Lemma 1 in [49]) This completes the proof of the lemma.
With these lemmas, we can finish the proof of Theorem 3: Proof of Theorem 3. We bound terms in the regret decomposition of Lemma 4 using Lemma 5 and Lemma 8, such that: The result in Theorem 3 applies to a setting with unbounded Gaussian rewards. While general, it does not directly correspond to either of the models described in Section 2. However, it is straightforward to modify the proof so that it applies to a setting with rectified Gaussian base arm rewards (i.e., for a batched version of Algorithm 3).
Proposition 9. The Bayesian regret of Algorithm 6, modified to sample arms as in Algorithm 3, with horizon T and batch size K, satisfies BayesRegret(T ) =Õ(|A| K + |A| √ T ).
Proof. Let f R θ (a) := − i∈a E zi∼N R (−θi,σ 2 i ) [z i ] be the expected super-arm reward function for a combinatorial semi-bandit with rectified Gaussian base arm feedback. Note that, to connect the super-arm reward function to the rectified Gaussian model in Section 2.2 and the online shortest path problem formulation, we let base arm feedback be negative, with rectification above 0. The first term of the regret decomposition in Lemma 4 is bounded in Lemma 8 using only the confidence width term of the upper and lower confidence bounds, not involving the estimated expected super-arm rewards. Hence, under the assumption that we can use the same confidence bounds as in the (non-rectified) Gaussian setting, we only need to ensure that the bounds of the last two terms of the regret decomposition still hold. We can do this with a modification of the proof of Lemma 5.
After Eq. 4.3, the rest of the proof of Lemma 5 holds unmodified. Hence, the bound of Theorem 3 also holds in the case of rectified Gaussian base arm feedback.
We can extend this result to the multi-agent online shortest path setting through the following corollary (where the set of edges E corresponds to the set of base arms A used throughout the proof of Theorem 3). We note that recently, a similar result has been derived in [13] for frequentist regret in a linear contextual bandit setting. Corollary 10. Let K be the number of agents, T be the horizon and Regret k (T ) be the regret of each agent k ∈ [K]. In the synchronous multi-agent online shortest path setting (i.e., a fleet of K agents / vehicles working in parallel in each time step), the total fleet regret incurred by invoking Algorithm 6 satisfies Proof. We prove this in the same way as the proof for Theorem 2, but use Theorem 3 instead of the result for QPM-D in [34].
For completeness, we also formally state the Bayesian regret upper bound mentioned in Section 3.2.1 as the following corollary of Theorem 3 and Proposition 9, with batch size 1. Corollary 11. The Bayesian regret of Algorithm 1 is upper bounded by This corollary matches the bound from Proposition 3 of [49], which can be applied to any combinatorial semi-bandit problem with a linear super-arm reward function, when seen as a special case of the linear bandit problem. However, our analysis does not assume that the prior distributions have bounded support.
One way to discuss the optimality of the upper bounds derived in Theorem 3 and Proposition 9, is to compare them with existing lower bounds. To our knowledge, there is no established lower bound for the specific setting studied in this work (i.e., the batched feedback combinatorial semi-bandit problem). However, there are related bounds that one could either possibly derive a lower bound from, or discuss the upper bound in terms of. Perchet et al. derived a lower bound (Theorem 4 in [43]) for the excess regret due to the delay in the two-armed bandit problem, which is a special case of our problem. Furthermore, there are lower bounds for the (non-delayed) combinatorial semi-bandit problem (e.g., by Kveton et al., Proposition 2 in [38]), which induce a mandatory term in any lower bound for this problem.
Combining these two will result in a lower bound to which the upper bound we derive in Theorem 3 is not tight in the excess regret term, since the upper bound includes a linear dependence on the number of base arms. We conjecture that it should also be possible to adapt the lower bound (for linear contextual bandits with adversarially generated contexts) by Ren et al. in Theorem 1 of [47], which includes a square-root factor (i.e., |A| with the notation used in our work) for the excess regret term. It is notable that under both of these conjectured lower bounds, theÕ √ T term of our upper bound is optimal up to polylogarithmic factors.

Experimental Results
In this section, we describe different experimental studies. For real-world experiments, we extend the simulation framework presented in [50] to network/graph bandits with general directed graphs, in order to enable exploration scenarios in realistic road networks. Furthermore, we add the ability to generate synthetic networks of specified size to this framework, in order to compare with the derived regret bounds (as the ground truth is provided for the synthetic networks). In all experiments, Dijkstra's algorithm is used to compute the shortest paths through the networks.

Real-World Experiments
For the experiments in real-world road networks, we study one scenario with realistic energy consumption distributions handled by the agents using misspecified wide prior distributions, and another scenario where the prior distributions are completely known and utilized by the agents. In the second scenario, the parameters of the underlying energy consumption distributions are sampled from the prior distributions before each each experiment run, whereas in the first scenario, the underlying distributions are fixed over multiple runs. Based on the second setting, we also consider a third setting where the energy consumption of different edges is correlated.
For each of the settings, we perform experiments using data from three cities: Luxembourg, Monaco and Turin. For Luxembourg, specifically, we study two problem instances (denoted #1 and #2) with different source and target vertices. We utilize, respectively for each of the cities, the Luxembourg SUMO Traffic (LuST) [17], Monaco SUMO Traffic (MoST) [18] and Turin SUMO Traffic (TuST) [45] scenarios to provide realistic traffic patterns and vehicle speed distributions for each hour of the day. This is used in conjunction with altitude data [21], and vehicle parameters from an electric vehicle. The resulting graph G for Luxembourg has |V| = 2247 nodes and |E| = 5651 edges, representing a road network with 955 km of highways, arterial roads and residential streets.
We use the default vehicle parameters provided for the energy consumption model in [7], with vehicle front surface area A = 8 m 2 , air drag coefficient C d = 0.7 and rolling resistance coefficient C r = 0.0064. The vehicle is a medium duty truck with vehicle mass m = 14750 kg, which is the curb weight added to half of the payload capacity.
We approximate the powertrain efficiency during traction by η + = 0.88 and powertrain efficiency during regeneration by η − = 1.2. In addition, we use the constant gravitational acceleration g = 9.81 m/s 2 and air density ρ = 1.2 kg/m 3 .

Prior distribution misspecified by agent
In this set of experiments, with results shown in Figure 1 and Table 1, we study a scenario where agents do not have access to the true prior distributions of the environment. To simulate the ground truth of the energy consumption, we take the average speed v e of each edge e from a full 24 hour scenario in each city road network. In particular, for LuST we observe the values during a peak hour (8 AM), with approximately 5500 vehicles active in the network. This hour is selected to increase the risk of traffic congestion, hence finding the optimal path becomes more challenging. We also get the variance of the speed of each road segment from the SUMO scenarios. Using this information, we sample the speed value for each visited edge and use the energy consumption model to generate the rewards for the arms.
For the probabilistic model, we assume σ e to be proportional to E e in Eq. 2.1, such that σ 2 e = (ϕE e ) 2 , where we set ϕ = 0.1. For the prior distribution of an edge e ∈ E, we misspecify it by using the speed limit of e as v e , indicating that the real average speed is unknown. Then µ e,0 = −E e and ς 2 e,0 = (ϑµ e,0 ) 2 , where ϑ = 0.25. As a baseline, we consider the greedy algorithm for both the rectified Gaussian and Log-Gaussian models, where the exploration rule is to always choose the path with the lowest currently estimated expected energy consumption, similar to the recent method in [8].
We run the simulations for the BayesUCB, TS and greedy algorithms with a horizon of T = 2000 (i.e., T = 2000 time steps). Table 1 and Figures 1b, 1d, 1f and 1h show the cumulative regret for the rectified Gaussian and Log-Gaussian models (indicated in all tables and figures with prefixes "N-" and "LN-", respectively, before the name of each algorithm), where the regret is averaged over 10 runs for each agent in each city. The intuition is that the energy saved by using the TS and BayesUCB agents instead of the baseline greedy agent is the difference in regret, expressed in watt-hours. It is clear that Thompson Sampling with the Log-Gaussian model has the best performance in terms of cumulative regret, but the other non-greedy agents also achieve good results. To illustrate that Thompson Sampling explores the road network in a reasonable way, Figures 1a, 1c, 1e and 1g visualize the road network and the paths visited by this exploration algorithm in each city. Each plot displays all paths visited by the agent during a single experiment, where more frequently traveled paths are indicated with darker shades of red. We observe that in Figures 1a, 1c and 1g, no significant detours are performed, in the sense that most paths are close to the optimal path. While there are some   detours shown in Figure 1e, we note that the distances in Monaco are small compared to the other cities, and that Figure  1f indicates that the detours do not result in much additional regret.
For the multi-agent case, we use LuST and a horizon of T = 100 and 10 scenarios where we vary the number of concurrent agents by K ∈ [1,10] in each scenario. The cumulative regret averaged over the agents in these scenarios is shown in Figure 2 for each K. In the figure, the final cumulative regret for each agent decreases sharply with the addition of just a few agents to the fleet. This continues until there are five agents, after which there seems to be diminishing returns in adding more agents. While there is some overhead (parallelism cost), just enabling two agents to share knowledge with each other decreases their average cumulative regret at t = T by almost a third. This observation highlights the benefit of providing collaboration early in the exploration process, which is also supported by the regret bound in Corollary 10.

Prior distribution known by agent
In Section 5.1.1 we had realistic unknown energy consumption distributions (fixed across all experiment runs), handled by the agents using misspecified prior distributions. For the second set of experiments, with results shown in Figure 3 and Table 2, we instead assume that the prior distributions are completely known by the agents. In other words, the environment samples the unknown mean vector θ * from the prior before all of the agents are applied to the problem instance specified by θ * . Again, the regret results are averaged over 10 runs of each agent, in this setting resulting in an estimate of the Bayesian regret for each agent.
Since we assume that each agent is aware of the true prior distribution in this problem setting, we settle on the rectified Gaussian model of energy consumption for these experiments, with Gaussian prior distributions. As replacements for the Log-Gaussian agents, we increase the number of baselines by implementing a version of -greedy adapted to combinatorial semi-bandits, based on Algorithm 1 introduced in the supplementary material of [15].   Sample an edge (u h , u h ) uniformly from E. a t ← Shortest path w. r. t. µ t−1 , between source and target vertices. 10: Play a t , update posterior parameters µ t , ς t using observed rewards r t (a t ).  As outlined in Algorithm 7, at each time step t with probability t , we select an edge (u h , u h ) ∈ E uniformly at random. We then find the shortest paths with respect to the posterior mean vector, between (1) the source vertex of the problem instance and u h , and (2) u h and the target vertex. The resulting concatenated path, including the edge (u h , u h ), is used to explore the road network graph. With probability 1 − t , we instead greedily select the shortest path between the source and target vertices, exploiting the current posterior mean estimates.

City
We evaluate agents using constant values of t (0.1 and 0.5), as well as an agent t decaying in t (with t = 1 t ). We motivate the latter with Theorem 4 in the supplementary material of [15], where the authors show a sub-linear upper bound on the expected regret of their t -greedy algorithm, with t in the order of 1 t (with an additional constant factor derived from information about the problem instance). Figures 3a, 3b, 3c and 3d, the results from the experiments with the TS, BayesUCB and (pure) greedy agents closely match the corresponding experiments in the misspecified prior problem setting of the previous section, while t -greedy with decaying t has comparable performance to the greedy agent. The t -greedy agents with constant t perform consistently worse than the other agents. Also supported by Table 2, the regret of the TS agent still saturates rapidly and achieves the best average regret out of the evaluated agents for all cities.

Networks with correlated edge weights
To demonstrate that the proposed framework performs well even when a few environment assumptions are relaxed, we run an additional set of experiments in a variation of the setting described in Section 5.1.2, with results shown in Figure 4 and Table 3. Whereas in the previous sections the stochastic weights of all edges are assumed to be mutually independent, we now introduce correlation between edge weights. An example of this in real-world road networks can be that traffic congestion on one road segment is likely to affect nearby road segments as well. As in the previous section, a mean vector θ * unknown to the agents is generated by the environment, where each element is sampled independently from the (Gaussian) prior distribution of each edge in the road network. Subsequently, we randomly assign all edges in E to a set of |E|/2 pairs of edges. We let the energy consumption of the individual edges in each such pair of edges (e, e ) ∈ E × E be perfectly correlated, but we define the marginal distributions according to the model in Section 2.2. In each time step, we jointly sample the energy consumption for each pair (e, e ) from a two-dimensional distribution with mean vector θ * (e,e ) and covariance matrix Σ (e,e ) , defined as Beyond the generation of correlated energy consumption by the environment, the experiments are set up exactly as in Section 5.1.2. The agents are assumed to be unaware of the correlation, and only attempt to estimate the parameters of the marginal distributions. As shown in Figures 4a, 4b, 4c and 4d, as well as in Table 3, when compared with results in the previous section, the performance of the agents is not noticeably affected by the presence of correlation.

Synthetic Networks
In order to evaluate the regret bound in Proposition 1, we design synthetic directed acyclic network instances G(V, E, w) according to a specified number of vertices n and number of edges o (with the constraint that n − 1 ≤ o ≤ n(n − 1)/2). We start the procedure by adding n vertices u 1 , . . . , u n to V. Then for each h ∈ [1, n − 1] we add an edge (u h , u h+1 ) to E. This ensures that the network contains a path with all vertices in V. Finally, we add o − n edges (u h , u h ) uniformly at random to E, such that h = h , h + 1 = h and h < h .
Since these networks are synthetic, instead of modeling probabilistic energy consumption, we design instances where it is difficult for an exploration algorithm to find the path with the lowest expected cost. Given a synthetic network G generated according to the aforementioned procedure, we select p = u 1 , . . . , u n to be the optimal path. In other words, p contains every vertex u ∈ V. The reward distribution for each edge e in p is chosen to be N (−Ẽ e |θ * e , σ 2 e ) with θ * e = −10 and σ 2 e number of vertices skipped by the shortcut. This guarantees that no matter the size of the network and the number of edges that form shortcuts between vertices in p, p will always have a lower expected cost than any other path in G.
For the agent prior N (θ * e |µ e,0 , ς 2 e,0 ), we set µ e,0 = −11(h − h) and ς 2 e,0 = 8. This choice of prior mean implies according to our prior beliefs, every path from the source u 1 to the target u n will initially have the same estimated expected cost.
We run the synthetic network experiment with T = 2000 time steps, varying the number of vertices |V| ∈ {30, 40, 50, 60} and edges |E| ∈ {200, 250, 300, 350, 400}. In Figure 5, each plot represents the cumulative regret at T = 2000 for a fixed |V|, as a function of |E|. We observe that the regret increases no more than linearly with the number of edges, which is consistent with the theoretical regret bound in Corollary 11.

Conclusion
We developed a Bayesian online learning framework for the problem of energy efficient navigation of electric vehicles. Our Bayesian model assumes a rectified Gaussian or Log-Gaussian energy model. To learn the unknown parameters of the model, we adapted exploration methods such as Thompson Sampling and BayesUCB within the online learning framework. We extended the framework to multi-agent and batched feedback settings, and established theoretical regret bounds. Finally, we demonstrated the performance of the framework with several real-world and synthetic experiments.    Order of a function (excluding polylogarithmic factors) P (·) Probability distribution of random variable Pr{·} Probability of event Q(β, λ) Quantile function of distribution λ with probability threshold β Queue[a] Delayed feedback queue of super-arm a r t (a) Reward of super-arm a at time t Regret(T ) Frequentist regret until horizon T Regret k (T ) Frequentist regret of agent k until horizon T U (a, H) Upper confidence bound of super-arm a given history H

Var[·]
Variance of random variable φ(x) Standard Gaussian probability density function (PDF) Φ(x) Standard Gaussian cumulative distribution function (CDF) Table A1: Summary of the notation used throughout the paper.