Solving Large-Scale Dynamic Vehicle Routing Problems with Stochastic Requests

Dynamic vehicle routing problems (DVRPs) arise in several applications such as technician routing, meal delivery, and parcel shipping. We consider the DVRP with stochastic customer requests (DVRPSR), in which vehicles must be routed dynamically with the goal of maximizing the number of served requests. We model the DVRPSR as a multi-stage optimization problem, where the first-stage decision defines route plans for serving scheduled requests. Our main contributions are knapsack-based linear models to approximate accurately the expected reward-to-go, measured as the number of accepted requests, at any state of the stochastic system. These approximations are based on representing each vehicle as a knapsack with a capacity given by the remaining service time available along the vehicle's route. We combine these approximations with optimal acceptance and assignment decision rules and derive efficient and high-performing online scheduling policies. We further leverage good predictions of the expected reward-to-go to design initial route plans that facilitate serving dynamic requests. Computational experiments on very large instances based on a real street network demonstrate the effectiveness of the proposed methods in prescribing high-quality offline route plans and online scheduling decisions.


Introduction
We consider the dynamic vehicle routing problem with stochastic requests (DVRPSR), a fundamental problem in freight transport. In the multi-vehicle DVRPSR, a set of static requests is known in advance, while other requests (hereafter, dynamic requests) arrive randomly during the service period. Vehicles depart from a given location, and must return to that location before the end of the service period. The goal is to serve as many customer requests as possible by assigning an initial route plan to each vehicle, and by routing vehicles dynamically in response to new requests.
Previous research on the DVRPSR is mostly restricted to small-and medium-scale instances on synthetic graphs with Euclidean or Manhattan distances (e.g., Azi et al., 2012;Klapp et al., 2018;Ulmer et al., 2018aVoccia et al., 2019). This paper scales the DVRPSR up to practical sizes: we address the dynamic and real-time routing of multiple vehicles on a real street network with more than 16,000 nodes (intersections), each corresponding to a potential customer location. Real-time decision-making's tight time frames pose considerable challenges when solving the DVRPSR in our large-scale setting.
A solution to the DVRPSR consists of an initial route plan and an online scheduling policy.
The former is computed offline, i.e., before the beginning of the service period, and defines the initial vehicle routes covering all static requests. In many recently developed DVRPSR solution methods, initial route plans are computed by myopic heuristics such as cheapest insertion (CI) and savings heuristic (e.g., Ulmer et al., 2018avan Heeswijk et al., 2019), which attempt to minimize the travel time required for serving static requests, but completely ignore the probable realization of dynamic requests. Ideally, however, the initial route plan should also facilitate the acceptance of dynamic requests during the service period. As shown by Bent & Van Hentenryck (2004) and Ferrucci & Bock (2016), exploiting the stochastic knowledge of dynamic requests can improve the quality of the initial routes.
In addition to the initial route plan, the DVRPSR requires an online policy to quickly determine acceptance/rejection decisions, request-vehicle assignments, and updated routes. Value function approximation (VFA) and rollout algorithms are the most widely used approximate dynamic programming (ADP) methods for computing online policies (e.g., Klapp et al., 2018;van Heeswijk et al., 2019). Both methods estimate the expected rewardto-go by simulation and can effectively solve small DVRPSR instances. However, when the problem scale becomes realistically large, the application of many VFA policies (e.g., those developed in Ulmer et al. (2018a) and ) is restrained by the high dimensionality of lookup tables. Rollout algorithms are unsuitable for large-scale problems as well, since they perform all simulations online and their solution quality and computation time depend largely on the number of simulated scenarios.
Our key methodological contributions are knapsack-based linear models to approximate the expected reward-to-go at any state of the dynamic system. We derive these models by representing each vehicle as a knapsack whose capacity is given by the remaining available service time (or budget) along the vehicle's route. We assess the cost (or budget consumption) of assigning future requests to a vehicle by predicting the vehicle's future location along its planned route. We combine these approximations with the decision rules of an optimal policy in order to derive high-performing and efficient scheduling policies. Moreover, we demonstrate how these approximations can be used to select initial route plans that facilitate the acceptance of requests in the dynamic phase. Notes. The green dots represent the served requests; the red dots represent the accepted requests that have not yet been served; the red circle represents a dynamic request that has just arrived; the blue square in the southeast represents the depot; the other squares represent vehicles' current locations and the lines represent the remaining planned routes (each vehicle is associated with a specific color).
To summarize, the main contributions of this paper are the following: 1. We propose a sequential optimization model for the DVRPSR, which jointly optimizes the offline decision concerning the initial route plan and the online decisions for scheduling dynamic requests. This is the first DVRPSR model that incorporates, at the same time, multiple vehicles, a real street network, and real-time decision-making. We further characterize the decision rules of optimal scheduling policies.
2. We develop interpretable knapsack-based approximations of the expected reward-to-go, which accurately and efficiently predict the expected number of future accepted requests from any state in our DVRPSR model. These approximations improve the service area coverage achieved by the initial route plan and, in combination with the optimal decision rules, lead to high-performing and efficient online scheduling policies. 3. We demonstrate by an extensive computational study on large-scale instances the merits of initial route plans that consider the spatiotemporal distribution of dynamic requests.
Further, we show that the scheduling policies based on the knapsack approximations are suitable for real-time decision-making and, at the same time, outperform conventional ADP methods widely applied to solve dynamic vehicle routing problems (DVRPs). 4. We make freely available the implementation of a generic DVRPSR simulator (see Fig-ure 1 for a snapshot and https://youtu.be/D57xNfU73as for animated simulations) and a set of standard benchmark instances to allow reproducibility and comparison of results.
The remainder of this paper is organized as follows. Section 2 reviews the relevant literature.
Section 3 introduces a sequential optimization model for the DVRPSR. Section 4 and Section 5 elaborate on online scheduling policies and offline planning algorithms, respectively. The results of the computational study are presented and discussed in Section 6. Finally, Section 7 draws concluding remarks and proposes directions for further research.

Literature Review
A detailed review of contributions to the DVRPSR is presented in Section 2.1. The applications of ADP to DVRPs are discussed in Section 2.2.

Dynamic Vehicle Routing with Stochastic Requests
We first focus on the previous research on the DVRPSR. Table 1 provides an overview of the relevant contributions. In terms of problem settings, we indicate, for each paper, whether decisions are made in real-time (i.e., whether decision epochs are triggered by request arrivals), the maximum expected number of requests, the maximum number of vehicles, and the graph type (synthetic graph or street network). A "n/a" (not applicable) in the column "Veh." means that fleet size is not constrained. Table 1 also summarizes the methodologies employed for computing initial route plans and online scheduling policies. A "n/a" in the column "Initial plan" indicates that static requests are not considered.
Early research on the DVRPSR proposes and evaluates several policies which follow certain decision rules (e.g., first-come, first-served) and ignore probabilistic knowledge about future requests when prescribing decisions (e.g., Bertsimas & Van Ryzin, 1991;Tassiulas, 1996;Gendreau et al., 1999;Larsen et al., 2002). Chen & Xu (2006) develop a dynamic column generation framework to solve a DVRPSR. Assuming that no probabilistic information of future  Hvattum et al. (2006), and Ichoua et al. (2006) are the earliest works to show that exploiting stochastic information of dynamic requests leads to significantly better initial route plans and online scheduling policies. The multiple scenario approaches (MSAs) presented in Bent & Van Hentenryck (2004) and Voccia et al. (2019) compute and reoptimize route plans for multiple scenarios, each corresponding to a static VRP involving known requests and randomly sampled future requests. From multiple route plans, the optimal one is selected by consensus functions. Ichoua et al. (2006) and Ferrucci & Bock (2015 integrate randomly sampled future requests into tabu search metaheuristics for static VRPs, by which anticipatory DVRPSR policies are derived. MSAs and anticipatory (meta-)heuristics are applied to DVRPSRs by decomposing the problem into sequential static VRPs (e.g., Hvattum et al., 2006;Ferrucci & Bock, 2015Klapp et al., 2018). In these methods, decision epochs occur at the end of fixed time intervals.
Thus, there is usually sufficient computation time for reoptimizing static VRPs, but new requests must wait until the end of the current time interval before being scheduled or rejected.
To improve customer responsiveness, the decision epochs in many DVRPSRs are triggered by events, e.g., when a vehicle arrives at a node or when a new request arrives. In these cases, sequential stochastic optimization models are more suitable than sequential static VRP formulations. For example, Thomas (2007) and Ulmer et al. (2018a) propose Markov decision processes for single-vehicle DVRPSRs, where a decision epoch occurs every time the vehicle stops at a customer location or returns to the depot. At each epoch, a subset of the newly arrived requests is accepted, and the vehicle's route is updated accordingly.
In most DVRPSRs, some dynamic requests may be rejected due to the limited number of vehicles and hard deadlines of service periods (e.g., Klapp et al., 2018;Ulmer et al., 2018a;Voccia et al., 2019). It is hence necessary to consider customers' expectations for quick responses from the service provider. If a decision epoch occurs only when a vehicle stops or at predetermined time instants, most responses to dynamic requests will be delayed. Column "RT" of Table 1 shows that only 4 models define decision epochs as the moments when new requests arrive. Among these, Azi et al. (2012) spend more than 200 seconds to compute a decision for small instances with 3 vehicles and 144 expected requests, whereas in the remaining works, the computation times are not specified.
Concerning problem scales, most DVRPSR instances in the previous research are defined on synthetic graphs with modest numbers of requests and vehicles. In van Heeswijk et al. (2019), several instances based on the real street network of Copenhagen (Denmark) are solved, but the authors consider only 10 customer locations and hence the problem scale is still limited. Ferrucci & Bock (2015) formulate test instances on the real street network of Dortmund (Germany), but the number of nodes (potential customer locations) or the geographical distribution of dynamic requests is not specified. Therefore, no research has addressed solving large-scale DVRPSR instances in real-time.

Approximate Dynamic Programming in DVRPs
Sequential stochastic optimization (also referred to as Markov decision process) is a commonly used mathematical model for DVRPs (including DVRPSRs) (Pillac et al., 2013;Psaraftis et al., 2016;Ritzinger et al., 2016;Soeffker et al., 2022). However, in practice, solving these models exactly by classical dynamic programming is computationally prohibitive because of the "curses of dimensionality". Consequently, a variety of ADP methods have been developed for DVRPs. The basic idea of ADP is to make decisions according to the estimated reward-togo, which is generally obtained by simulation. We refer the interested reader to Powell (2011) and Powell et al. (2012) for tutorials on the different ADP methods and their applications to various transportation and logistics problems. Powell (2011) classifies ADP into four broad categories: myopic policies, look-ahead policies, value function approximation (VFA), and policy function approximation (PFA). Myopic policies guide the selection of decisions in such a way that immediate rewards are maximized.
For DVRPs, these policies tend to perform poorly because they do not take into account any forecast information about dynamic requests. Look-ahead policies, on the contrary, approximate the reward-to-go by explicitly sampling and simulating the future. Rollout policies are a typical look-ahead policy class initially proposed by Bertsekas & Tsitsiklis (1996) (Powell & Meisel, 2015). P-VFA requires less computational effort, but its linear basis function ignores non-linearity and cannot approximate the value function in every detail .
The last category of ADP policies, PFA, is similar to VFA in the sense that all simulations are conducted offline, but is easier to implement because it usually only requires tuning the value of a parameter rather than approximating a complex value function. PFA policies are suitable for the cases where the structure of a good decision policy can be easily observed and captured by an analytical function, which returns a decision given a state (Powell et al., 2012 (2018) propose a PFA policy based on the intuition that it is beneficial to serve remote customers by drones and serve nearby customers by vehicles. With offline simulations, they tune a threshold to measure whether each customer is far enough from the depot so that it should be served by a drone.
The three categories of anticipatory ADP policies (i.e., look-ahead, VFA, and PFA) require a sufficiently large number of online or offline simulations. As a result, it is hard to balance their computational efficiency and solution quality. To the best of our knowledge, there is no proven method to solve accurately large-scale DVRPSR instances while respecting the tight time frames imposed by real-time decision-making. This paper fills this gap by proposing accurate and efficient approximations of the reward-to-go, which lead to practical algorithms to solve large-scale DVRPSRs in real-time. Moreover, the proposed approximations are based on sample scenarios and hence can better cope with the heterogeneity of stochastic information (e.g., time-varying request rates) than VFA (Soeffker et al., 2022). With a set of PFA and rollout policies as benchmarks, we validate through an extensive computational study that the proposed algorithms outperform the conventional ADP policies in terms of both efficiency and quality.

Problem Definition
In this section, we formally introduce the DVRPSR. Section 3.1 presents general definitions, and Section 3.2 models the DVRPSR as a sequential stochastic optimization problem. The notations used in this section are summarized in Table A.1 in Appendix A.

General Definitions
The DVRPSR is defined on a strongly connected digraph is the set of nodes and A is the set of arcs. Graph G represents the street network of the service area, and sets V and A represent, respectively, road intersections and road segments. We assume that all potential customers are located at intersections. The length of each segment (i, j) ∈ A is given by d ij . The service period is represented by the interval [0, U ]. We use interchangeably the terms time and instant to refer to a value u ∈ [0, U ]. Before the service period begins, a fleet of K vehicles is stationed at node 0, which corresponds to the location of the depot. Vehicles travel at a constant speed s, so the travel time along a segment (i, j) is given by t ij = d ij /s. Further, the duration (under speed s) of the fastest path from node i to node j is given by t(i, j).
A customer request consists of a triple (u, i, d), where u ∈ [0, U ] is the request arrival time (alternatively, the time when the request is generated by the customer), i ∈ V \ {0} is the customer location, and d ∈ R >0 is the duration of the request, i.e., the service time required by the customer to fulfill the request. A set S of static requests is known before the beginning of the service period. Static requests must be served by the end of the service period. In addition, dynamic requests arrive randomly during the service period according to a non-homogeneous Poisson process with a request rate Λ(u) = i∈V\{0} λ i (u), where λ i (u) are node specific request rates. We denote by D(u) the ordered ( Once a dynamic request arrives, the scheduling policy in effect decides whether to accept or to reject the request. Accepted requests must be served before the end of the service period, and rejected requests are lost (it is assumed that rejected requests are served by competitors).
The scheduling policy must take a decision in an online fashion, that is, almost immediately after a dynamic request arrives. A static or accepted dynamic request (u, i, d) is assigned to a vehicle k ∈ {1, . . . , K}. The request is served by routing vehicle k to location i and keeping the vehicle there for an amount of time d. Once a request is served, the vehicle proceeds to the location of another request, or back to the depot if there are no further requests assigned to that vehicle. In particular, waiting is not allowed, except when a vehicle is idle at the depot.
The goal in the DVRPSR is to route the K vehicles, so that all requests in S are served, all vehicles return to the depot no later than instant U , and the number of dynamic requests accepted and served is maximized.

A Sequential Stochastic Optimization Model for the DVRPSR
We now propose a model for the DVRPSR inspired by the unified ADP framework of Powell et al. (2012) and by the route-based Markov decision processes proposed by  for modeling DVRPs.

Decision Epochs and States
Decision epochs correspond to the moments when decisions must be taken. In the DVRPSR, in addition to a first decision concerning the initial route plan to serve requests in S, a decision is also made whenever a dynamic request arrives. Hence, there are in total 1 + |D(U )| decision epochs, where |D(U )| is uncertain.
The state of the system at the moment a decision must be made is represented by state variables S t , t ∈ {0, . . . , T }, where T = |D(U )| is the last decision epoch. State S 0 represents all parameters of the problem, since no uncertainty has been realized before the beginning of the service period. Next, we formally define routes, which are instrumental for characterizing all other states S t , t = 0: . . , f −1}, is the (possibly empty) set of requests to serve when arriving at the i-th node in θ. For notational convenience, . A vehicle may be either in an idle state, in which case we say that V k = ∅, or in a routing state executing a route θ k , in which case we say that V k = (τ k , θ k ), where τ k is the instant when vehicle k started route θ k and b(τ k , θ k ) ≥ 0.
As u t , d t , and τ k are continuous variables, the number of possible states is infinite. The idea of incorporating route information in the state space was originally proposed by , and is particularly useful as our solution methods rely heavily on information about planned routes to guide online decisions. So, when in a routing state, a vehicle must be either in transit (traveling along a road segment) or in service (fulfilling a request at a customer).
Since travel times and service times of accepted requests are deterministic, a state collects sufficient information to define precisely the locations of all vehicles. In particular, given a state it is possible to identify the status of each vehicle (idle, in transit or in service), to compute the remaining time before a vehicle reaches the endpoint of the road segment along which it currently travels (if the vehicle is in transit), or to compute the remaining time that a vehicle will stay at a particular node serving requests (if the vehicle is in service).

Offline and Online Decisions
It is convenient to discuss separately the decision at S 0 and the decisions at other states S t , t = 0, since they are structurally different. We denote by F the feasible decision set at state S 0 . Set F contains all sets of routes {θ 1 , . . . , θ k }, k ≤ K, such that b(0, θ 1 ), . . . , b(0, θ k ) ≥ 0 and ∆(θ 1 ), . . . , ∆(θ k ) is a partition of S. We represent generically the decision made at S 0 by a decision vector y ∈ F. This decision corresponds to an initial route plan and is computed offline (i.e., before the start of the delivery period) by, e.g., one of the algorithms described in Section 5.
We now consider a state S t = (V 1 , . . . , V K , r t ), t = 0. That is, a state where the dynamic request r t = (u t , i t , d t ) has just arrived and an online decision must be taken, in real-time.
Given S t , a scheduling policy is the set of rules that determines: (i) An acceptance decision: this is an 'accept' or 'reject' decision relative to request r t .
(ii) An assignment decision: if the request is accepted, this decision allocates the request to a vehicle k ∈ {1, . . . , K}.
(iii) A routing decision: when the request is allocated to vehicle k, this decision defines how route θ k will be adjusted (in case V k = (τ k , θ k )) or initialized (in case V k = ∅) to accommodate the accepted request.
There are 1 + K combinations of acceptance and assignment decisions, but in general there are many more possibilities concerning routing decisions, especially when considering the road network graph. The set of possible decisions when the system is at state S t is denoted by X t .
Each decision x t ∈ X t , t = 0, specifies acceptance and, when required, also assignment and routing decisions. Formally, a (deterministic) scheduling policy maps every possible state S t , t = 0, to a decision x t ∈ X t . We further denote by X π (S t ) the decision prescribed by policy π when at state S t . Once a decision is made at state S t , t = T , the system evolves and the next decision epoch occurs when the (t + 1)-th request arrives. State S t+1 is specified by the preceding state S t , the decision x t ∈ X t , and the (t + 1)-th request r t+1 = (u t+1 , i t+1 , d t+1 ).
This functional relationship is represented by the transition function S M (·):

Rewards and Objective Function
We denote by R(S t , x t ) the reward associated with decision x t ∈ X t when at state S t . At state S 0 , any decision y ∈ F has a reward of 0. At any other state S t , t = 0, the reward of a decision is 1 if the dynamic request (u t , i t , d t ) is accepted, and 0 otherwise.
The previous definitions allow us to propose the following sequential optimization model (SOM) for the DVRPSR: where Π is the set of all scheduling policies. The double expectation is required because state S 1 = S M (S 0 , y, r 1 ) is unknown, since it depends on the (also unknown) first dynamic request Model (DVRP-SOM) maximizes the expected total reward (or the expected number of accepted requests) conditional on the state resulting from the initial decision y. In this work, we are concerned with finding a good policy π as well as a good initial plan y to be used in combination with π. The online and offline algorithms that we propose in Sections 4 and 5, respectively, are based on the idea of the potential Φ π (S t ) of a state S t , t = 0, which is defined as the expected reward-to-go once the system occupies state S t and is controlled by policy π: When policy π is relatively simple, (1) can be estimated in a reasonable time by simulation and sample average approximation. An example of such a simple policy is a greedy policy that accepts a request whenever it is feasible to serve it, and assigns requests to vehicles according to cheapest insertion. We further discuss this policy (and its rollout version) in Section 4.6.
In the following section, we describe computationally efficient approximations of Φ π (S t ) when policy π is anticipatory, that is, when it takes into account the spatiotemporal distribution of future requests and prescribes acceptance and assignment decisions accordingly.

Online Scheduling Policies
In this section, we develop several scheduling policies. Across all policies, routing decisions are prescribed according to one of the two routing policies defined in Section 4.1. Section 4.2 characterizes acceptance and assignment decisions in the optimal scheduling policy. Next, in Section 4.3 we derive our potential approximation model following a sequence of interpretable steps. Sections 4.4 and 4.5 introduce two potential-based scheduling policies, and Section 4.6 presents greedy and rollout benchmark policies. All notations used in this section are summarized in Table A.2 in Appendix A.

Routing Policies
When a dynamic request is accepted and assigned to a vehicle in a routing state, it is reasonable to schedule the new request to be served in a position such that the budget of the adjusted route is maximized. Depending on whether it is allowed to reorder unserved requests, this leads to two main routing policies, which are defined next: Definition 2 (Cheapest Insertion (CI) Routing Policy ρ CI ). Consider a request r = (u, i, d) and a vehicle state V = (τ, θ). Under the CI routing policy ρ CI , route θ is adjusted into a new route ρ CI (θ, r) in such a way that ∆ u (ρ CI (θ, r)) \ ∆ u (θ) = {r}, ∆ u (θ) is a subsequence of ∆ u (ρ CI (θ, r)) and the budget b(τ, ρ CI (θ, r)) is maximized.
Definition 3 (Reoptimization Routing Policy ρ R ). Consider a request r = (u, i, d) and a vehicle state V = (τ, θ). Under the reoptimization routing policy ρ R , route θ is adjusted into a is maximized.
Given a route θ and a request r, ρ CI (θ, r) can be computed in polynomial time on |θ|, while computing ρ R (θ, r) requires solving a traveling salesman problem with |∆ u (θ)| + 1 nodes.

Characterizing the Optimal Scheduling Policy
Given a first stage decision y, the optimal scheduling policy π * is such that: We now define a class of potential-based policies Π P ⊂ Π such that acceptance and assignment decisions are prescribed according to the following decision rules (DRs): Decision Rule 1 (Acceptance Rule). Consider that the system is at state S t , and let x − t be the 'reject' decision with respect to request r t . Then, a policy π ∈ Π P accepts request r t if and Then, a policy π ∈ Π P assigns an accepted request r t to the vehicle k * such that: Essentially, (DR1) determines that request r t is accepted only if the immediate reward offsets the expected decrease in potential (due to budget consumption) at the (random) state S t+1 , and (DR2) determines that, if r t is accepted, it is assigned to the vehicle such that the expected potential at S t+1 is maximized. Note that π * ∈ Π P , otherwise π * is not optimal.
Intuitively, good approximations of E[Φ π * (S t+1 )|x t ], x t ∈ X t , can be used to prescribe sensible decisions according to (DR1) and (DR2). Next, we discuss how to efficiently compute such approximations.

Multiple-knapsack Approximation of
Given a request r t = (u t , i t , d t ), our goal is to approximate E[Φ π * (S t+1 )|x t ] for any decision To this end, we propose an approximation based on a multiple-knapsack model The objective is to maximize the number of future requests accepted. The cost (in terms of budget consumption) of serving a future request depends on each vehicle state (or knapsack).
In particular, it depends on the vehicle location and the remaining route at the moment the future request arrives. The procedure to predict the location of a vehicle at a future instant is based on the following two observations: Remark 1 (Late Depot Arrival). Since the objective in the DVRPSR is to maximize the number of accepted requests, vehicles will usually return to the depot at an instant near the end of the service period.
Remark 2 (Request Order Preservation). When following the CI routing policy, the relative order of scheduled requests along a route does not change when a new request is assigned to the route. When following the reoptimization policy, the relative order of already scheduled requests usually changes only slightly, if at all, when a new request is assigned.
Note that our goal of approximating the expected number of requests that can be served from a given state is related, in an inverse or symmetric fashion, to the problem of approximating the length of the minimum-length tour to visit a given set of nodes. The latter is studied in the stream of literature on routing approximations (Daganzo, 1984b,a). In our case, the route lengths are given, and we estimate the maximum number of additional requests that can be accommodated along the planned routes.
Remarks 1 and 2 suggest the concept of effective speed, which is useful to predict the location of a vehicle at future instants. The effective speed is defined as the traveling speed such that a route finishes exactly at instant U , assuming no further requests are added to the route. We denote byš ut (V k ) the effective speed of a vehicle with state V k = (τ k , θ k ) at instant u t (the formula for computingš ut (V k ) is given in Appendix B.1). Then, it is reasonable to predict the location of vehicle k at a future instant u > u t by, from the current location, following route θ k at speedš ut (V k ) for an amount of time u − u t . By predicting at the current instant u t the location of vehicle k at a future instant u , we also estimate the set of nodes along θ k to be traversed after instant u , which we denote by V ut (V k , u ). The procedure to predict the vehicle location and determine V ut (V k , u ) requires a simple simulation of θ k and is detailed is then used to estimate the cost (or budget consumption) c ut (V k , r ) when assigning a future request r = (u , i , d ) to a vehicle with state V k : , so vehicle states V 1 , . . . , V K are known and already take into account decision x t . Let ω = {r ω 1 , . . . , r ω Tω } be a sample path of the request arrival process {Λ(u)} u∈(ut,U ] , and K = {k ∈ {1, . . . , K} : V k = ∅}. Then, at time u t , after decision x t is taken, the number of requests in ω accepted by the optimal scheduling policy is approximated by the following model with variables z kr , k ∈ K and r ∈ ω: Model (MKA) captures the multiple-knapsack structure of the online phase of the DVRPSR.
The objective function measures the total number of requests in ω accepted and assigned to a vehicle. For each vehicle, constraints (2) limit the total budget consumption due to assigned requests. These constraints consider the predicted vehicle locations, as determined by effective speeds and planned routes, and charge a vehicle-dependent cost for each accepted request, based on scheduling the request in a position such that budget consumption is minimum.
Constraints (3) forbid assigning a request to more than one vehicle. Idle vehicles are ignored in model (MKA), since it is not possible to predict their future locations with the concept of effective speed. Unless the fleet size is over-dimensioned, all vehicles remain busy during the service period and are considered in the model.
We consider continuous instead of binary variables in (MKA), so we are able to solve the model fast in a real-time context where computational time is scarce. This is also justified since, first, the coefficients of the formulation have already been approximated, so solving (MKA) with integer variables would not necessarily produce a better approximation; second, |ω| is usually much larger than |K|, so the difference between the optimal solution values of (MKA) and its integer counterpart is relatively small. These observations lead us to hypothesize that the approximation attained by (MKA) is sensible, which is confirmed empirically. To achieve stability, when approximating E[Φ π * (S t+1 )|x t ] we take the average of solution values obtained over a set Ω = {ω 1 , . . . , ω H } of H sample paths: Finally, note that our approximation scheme requires no parameters other than H.

Potential-based Policy (PbP)
The PbP scheduling policy consists in approximating state potentials according to (4) and applying decision rules (DR1) and (DR2). Given a routing policy ρ and a state S t = , the policy works as follows: (i) If there is an idle vehicle k and it can serve request r t , that is, U , then r t is accepted, assigned to vehicle k, and the state of vehicle k changes to where θ k corresponds to the fastest route from the depot to node i t , and back.
t of accepting and assigning r t to k under routing policy ρ is feasible and is added toX t .
Different variants of the PbP arise, depending on the adopted routing policy. We denote by PbP R (H) the PbP under routing policy ρ R , when H sample paths are used to approximate state potentials.

Simplified Potential-based Policy (S-PbP)
Step ( setting. An alternative approximation can be computed faster by considering single-knapsack models. Consider a state S t = (V 1 , . . . , V K , r t ), r t = (u t , i t , d t ). Given a non-idle vehicle k with V k = (τ k , θ k ) and a sample path ω = {r ω 1 , . . . , r ω Tω } of process {Λ(u)} u∈(ut,U ] , we define the following model with variables z r , r ∈ ω: Model (KA) overestimates the number of requests from ω that will be assigned to vehicle k, since it does not take into account other vehicles that may compete for the same requests. To compensate for this bias, we define a coefficient α ω as the ratio between the potential estimated by model (MKA) and model (KA) before accepting or rejecting request r t : Ratio α ω falls within the unit interval, and is inversely proportional to the amount of competition among vehicles to serve requests. Then, another approximation of E[Φ π * (S t+1 )|x k t ], wherex k t is an 'accept' decision that assigns request r t to vehicle k and changes its state to V + k , is given by: Policy S-PbP follows by using (6)  Analogously to PbP R (H), we denote by S-PbP R (H) the S-PbP when routing policy ρ R is adopted and H sample paths are used to approximate potentials.

Benchmark Policies
We discuss a few benchmark policies in this section.

Greedy Policy
Given a routing policy ρ and a state S t = (V 1 , . . . , V K , r t ), r t = (u t , i t , d t ), the greedy policy is specified as follows: (i) Same as step (i) from Section 4.4.
(ii) Otherwise, request r t is accepted if and only if there is a non-idle vehicle k, V k = (τ k , θ k ), such that b(τ k , ρ(θ k , r t )) ≥ 0.
(iii) If r t is accepted, it is assigned to the non-idle vehicle k, V k = (τ k , θ k ), such that b(τ k , θ k )− b(τ k , ρ(θ k , r t )) is minimum.
Again, different variants of the greedy policy arise, depending on the adopted routing policy.
We denote by GP CI and GP R the greedy policies obtained by considering routing policies ρ CI and ρ R , respectively.

PFA Policy
PFA is a class of ADP methods that rely on analytical functions and is suitable for the cases where the structure of the optimal policy is known (Powell et al., 2012). We design a benchmark PFA policy in which the expected reward-to-go E[Φ π * (S t+1 )|x t ] is estimated by an analytical functionΦ PFA (S t+1 |x t ) parameterized by a value γ ∈ R >0 . The precise mathematical definition ofΦ PFA (S t+1 |x t ) and the offline procedure to tune parameter γ are detailed in Appendix C.
Given a routing policy ρ and a state S t = (V 1 , . . . , V K , r t ), r t = (u t , i t , d t ), the PFA policy is specified as follows: (i)-(iii) Same as steps (i)-(iii) from Section 4.4.
(iv) A decision is selected fromX t by evaluating decision rules (DR1) and (DR2) based on x t ∈X t , computed by (C.1).
We denote by PFA CI and PFA R the PFA policies with routing policies ρ CI and ρ R adopted in the online phase, respectively. As the offline procedure to tune parameter γ requires simulating the policy under a large number of sample paths, to guarantee computational efficiency the routing policy ρ CI is employed in the offline simulations.

Rollout Policies
A rollout algorithm is a general procedure to improve the performance of a base policy. It works by simulating candidate decisions, online and under the base policy, and then prescribing the decision with the maximum simulated reward-to-go. We refer to Bertsekas & Tsitsiklis (1996) for more details and a formal introduction to the method. As shown in Table 1, rollout algorithms are frequently employed for solving DVRPs, and for this reason we also define rollout-based benchmark policies.
Given a base policy, a routing policy ρ and a state S t = (V 1 , . . . , V K , r t ), r t = (u t , i t , d t ), the rollout policy is specified as follows: (i)-(iii) Same as steps (i)-(iii) from Section 4.4.
(iv) For each candidate decisionx t ∈X t , the system is simulated over H sample paths from the current instant u t until the last instant U , given that decisionx t is taken at state S t and the base policy is in effect. As a result of the simulations, we obtainr(x t ),x t ∈X t , the approximated reward-to-go associated with decisionx t .
(v) The policy prescribes decisionx t ∈X t such thatr(x t ) is maximum.
Because of the reduced computational time available in a real-time context and the large number of online simulations required in step (iv) of the rollout procedure, only very efficient scheduling policies are suitable as base policies. Among the policies previously presented, only GP CI and PFA CI are appropriate as base policies, since all other policies require solving multiple integer programs or multiple linear programs before a decision is taken. In fact, as we discuss in Section 6.3, even under the computationally efficient base policies GP CI and PFA CI the decision times of rollout policies in large instances are severely compromised due to the costly online simulations.
Finally, different routing policies can be adopted when defining the set of candidate decisions in step (iii) of the procedure. So, we denote by R CI -GP CI (H) and R CI -PFA CI (H) the rollout policies with base policies GP CI and PFA CI , respectively, where routing policy ρ CI is adopted to generate candidate decisions and H samples paths are simulated in step (iv) to approximate rewards-to-go. Likewise, we define rollout policies R R -GP CI (H) and R R -PFA CI (H) when routing policy ρ R is used to generate candidate decisions.

Offline Planning Algorithms
We now focus on the decision at the initial state S 0 . In Section 5.1, we briefly discuss previous methods for route planning in DVRPs, and formalize the problem. Section 5.2 describes a benchmark method based on travel time minimization, and Section 5.3 presents our potential-based approach for route planning.

Anticipative Route Planning
A good route plan should cover static requests and leave the system in a favorable state for fitting dynamic requests that arrive during the service period. In this context, previous work proposed designing route plans that include static and sampled dynamic requests (Bent & Van Hentenryck, 2004), waiting strategies in anticipation of the location of dynamic requests (Branke et al., 2005;Thomas, 2007), combinations between waiting and request sampling (Ichoua et al., 2006), and managing slack times to facilitate acceptance of future requests when time windows are enforced (Mitrović-Minić et al., 2004). For a detailed discussion, we refer to Ichoua et al. (2007). Our potential-based planner is closest to the multiple scenario approach from Bent & Van Hentenryck (2004), as we rely on sample paths from the spatiotemporal request distribution to evaluate the potential of candidate planned routes.
The feasible solution space of (DVRP-SPM) is defined by the partitioning constraints (7).
The states of the vehicles to which a planned route is assigned are initialized accordingly, and it is assumed that these vehicles depart from the depot at instant 0 (i.e., waiting time at the depot is not allowed). The cost function is non-linear and also depends on the optimal (but unknown) policy π * . A reasonable step towards a tractable offline planning algorithm is to replace the cost function g by an additively separable (on V 1 , . . . , V K ), mostly order-preserving mappingg, and solve (DVRP-SPM) with a standard (static) VRP algorithm or heuristic. In the remainder of this section, we discuss two possible ways of implementing this approach.

Budget-based (Myopic) Planner
The budget of a route is a suitable proxy for assessing the capacity of the route to accommodate future requests. Therefore, a sensible mapping is given by: We callg myo the myopic mapping, since it does not anticipate the realization of future dynamic requests. Under cost function (8) (u i , i, d i ), (u j , j, d j ) ∈ S, a cost t(i, j) + d j is associated. Then, a myopic plan is obtained by solving a DCVRP on G , with node 0 as the depot, up to K vehicles available, and a duration limit of U on individual routes.
Depending on the number of static requests |S|, finding the optimal solution to the DCVRP may be computationally prohibitive. Moreover, a near-optimal solution is enough to allow us to assess the quality of budget-based first stage DVRPSR decisions, compared to the potentialbased plans described in the next section. Therefore, we opt for a heuristic method instead of an exact algorithm to solve the DCVRP. The two-step heuristic works as follows: (i) A large set of feasible routes Θ ⊂ Θ is generated by applying column generation (Barnhart et al., 1998) on the linear relaxation of model (DVRP-SPM) under cost function (8).
The details of this procedure, which employs standard techniques commonly applied in (static) VRP algorithms, are described in Appendix B.2.
(ii) A myopic plan is determined by finding the set of routes {θ 1 , . . . , θ k } ⊂ Θ , k ≤ K, that serve all static requests with minimum total duration. To this end, we employ an off-the-shelf integer solver to solve exactly model (DVRP-SPM) under cost function (8) with the restricted set of variables Θ .

Potential-based Planner
Instead of maximizing budget, the potential-based planner maximizes the potential of the state resulting from the first stage decision. Given sample paths Ω = {ω 1 , . . . , ω H } of the request arrival process {Λ(u)} u∈[0,U ] , we define the potential-based mappingg pb : where V k = (0, θ k ), k ∈ {1, . . . , k }, are the initial states of the vehicles to which a route is assigned.
Mappingg pb takes into account the distribution of future requests and is also additively separable on V 1 , . . . , V K . An algorithm for generating potential-based plans follows: (i) Same as step (i) from Section 5.2.
(iii) A potential-based plan is determined by finding the set of routes {θ 1 , . . . , θ k } ⊂ Θ , k ≤ K, that serve all static requests with maximum (estimated) potential. As in step (ii) of the myopic planner, we invoke an integer solver to solve exactly model (DVRP-SPM) under cost function (9) with the restricted set of variables Θ .
Compared to the myopic planner, the potential-based planner favors routes with a total budget that can be more efficiently used to serve dynamic requests. The potential-based mapping promotes flexibility to serve dynamic requests as planned routes with more slack time are preferred over low-budget routes. In addition, whereas the DCVRP-based myopic plan usually consists of few routes, each serving several static requests, the potential-based first stage decision exhibits a more balanced distribution of static requests to vehicles and promotes a better coverage of the service area. By considering a large number of sample paths when evaluating (9), we ensure that the initial plan is robust, in the sense that it performs well, on average, for a variety of possible dynamic request trajectories. Also note that, similar to the potential-based policies, the potential-based planner requires no parameter other than the number of sample paths.

Computational Study
In this section, we present an extensive computational study on large-scale test instances based on a real street network. Section 6.1 introduces the detailed configurations of the test instances. Section 6.2 compares the myopic and potential-based offline planning algorithms, and Section 6.3 compares the online scheduling policies. All algorithms and simulations are programmed in C++ and executed on a single core of an Intel ® Xeon ® Gold 6130 (2.1GHz) processor with 8GB of available RAM. IBM ® CPLEX ® version 12.10 is employed for solving linear and integer programs. When adopting the reoptimization policy ρ R , the resulting TSPs are solved with a standard branch-and-cut procedure. Finally, all software and test instances developed in this project, with which all results presented in this section can be replicated, are available open-source at https://github.com/amflorio/dvrp-stochastic-requests.

Test Instances
Using geographical data from the OpenStreetMap (OSM) project, we construct our test instances based on the street network of Vienna, Austria. The street network graph is generated from OSM data in the following way: first, we fetch all available data (nodes and ways in OSM terminology) for the city of Vienna; next, we remove the ways that correspond to walkways or service roads, and remove the nodes that are not intersections of two or more streets. The final graph consists of 16,080 nodes (intersections) and 36,424 arcs (street segments) and is illustrated in Figure 2. The depot, indicated by the square, is located in the southeast of the city, and two request clusters, depicted as circular shaded areas, are defined in the northeast and southwest areas.
The fleet size K varies from 2 to 20, and each vehicle travels at a constant speeds=20 km/h. The duration of the service period is set to U = 600 minutes. The ratio between the expected number of dynamic requests and the expected total number of requests (static and dynamic) is given by the degree of dynamism η = E[T /(|S| + T )]. We assume i∈V\{0} λ i (u) = Λ, u ∈ [0, U ], where Λ ∈ {0.2, 0.4, 0.8, 1.5} is the (constant) overall request rate per minute. Note Notes. The square represents the depot, and the two shaded areas represent request clusters. that our smallest instances with Λ = 0.2 and η = 0.75, such that E[T ] + |S| = 160, are larger than most synthetic instances considered in the previous works mentioned in Table 1.
Given Λ and η, the expected numbers of static and dynamic requests are fixed. Static requests are uniformly distributed on the network nodes. To assess the performance of the proposed methods under distinct dynamic request patterns, we generate dynamic requests according to three spatiotemporal distributions: uniform and time-independent (UTI), clustered and time-independent (CTI), and clustered and time-dependent (CTD). In the UTI distribution, dynamic requests are also uniformly distributed on the network nodes (λ i (u) = λ j (u), for all i, j ∈ V \ {0} and u ∈ [0, U ]), while in the CTI and CTD distributions, half of the dynamic requests are expected to originate from a node within the two circular areas with a radius of 3 km (see Figure 2). The request rate at each node does not change over time in the UTI and CTI distributions (λ i (u) = λ i (u ), for all i ∈ V \ {0} and u, u ∈ [0, U ]). In the CTD distribution, the request rates in the northeast and southwest clusters decrease and increase over time, respectively: 80% (20%) of the clustered requests are expected to appear in the northeast (southwest) cluster at the beginning of the service period (u = 0), and this percentage decreases (increases) linearly to 20% (80%) at the end of the service period (u = U ). The UTI distribution models more closely a practical large-scale instance, because the high density of nodes in the inner city induces a natural demand cluster. Distributions CTI and CTD are somewhat arbitrary but useful nevertheless, as they allow testing the policies on request trajectories largely distinct from those given by the UTI distribution. Regardless of which request distribution is adopted, each static or dynamic request's service duration is normally distributed with a mean of 10 minutes and a standard deviation of 150 seconds.
Each test instance is characterized by a certain combination of request rate Λ, degree of dynamism η, fleet size K, and request distribution. For each instance, we independently generate one set of static requests and multiple sets of dynamic requests. Combining the former with one of the latter yields a request scenario, which corresponds to a specific problem and is solved individually. Table 2 summarizes all test instances, the number of request scenarios per instance, as well as the offline and online algorithms employed to solve each scenario. We parameterize both the potential-based planner and policies with H = 50, as our preliminary experiments indicated that additional sample paths do not lead to improvements in solution quality. Hence, for simplicity, S-PbP R (50) and PbP R (50) are from now on referred to as S-PbP and PbP, respectively. When solving a request scenario under a non-deterministic online policy (all policies except GP CI and GP R ), we repeat the solution procedure 15 times to evaluate the policy's performance in expectation. All sample paths required by the nondeterministic policies are independently generated. Considering all instances, request scenarios, and algorithms tested, we simulate a total of 27,964 executions of the DVRPSR. As the total computational time required in this simulation study was quite high (approximately, 248 CPU days), we also make available in the online repository the complete raw simulation results.
Throughout each simulation, the implemented framework records the arrival of requests, the decisions taken, and the states' evolution, and it also outputs each state in a visual format.
The visualization of a simulation enables a better interpretation of the decisions taken by the scheduling policy. As an example, at https://youtu.be/D57xNfU73as we can visualize a simulation of policies R R -GP CI (50) and PbP on an instance with Λ = 0.4, K = 5 and UTI distributed requests. The main indicators used to evaluate the performance of the algorithms are the acceptance rate, defined as the number of dynamic requests accepted divided by the total number of dynamic requests generated, and the maximum decision time, which is the maximum computation time required for taking a single decision. The maximum decision time is an important metric to assess a scheduling policy's suitability for real-time applications.

Evaluation of Offline Planning Algorithms
To evaluate the performance of the myopic and potential-based planners, we employ both algorithms to compute initial route plans for all instances with Λ ∈ {0.2, 0.4}. Table 3 compares  Notes. Dist., distribution; Scen., scenario; MY, myopic planner; PB, potential-based planner; * The online algorithms marked by * are tested with the potential-based planner only; ** Only one scenario is tested for each large instance with Λ = 1.5 due to the long simulation time.
the average acceptance rates across eight online policies, where columns "Impr." indicate the percentage improvement achieved by the potential-based plan in relation to the myopic plan when a specific online algorithm is used. A more detailed comparison, which shows the results of different request distributions, is provided in Table D.1 in Appendix D. Overall, when combined with the best-performing policies S-PbP and PbP, the potential-based plans improve acceptance rates by 3.6% and 5.0%, respectively. Although myopic and potential-based plans perform equally well in the smallest instance with Λ = 0.2 and K = 2, in all other instances the potential-based plans significantly improve the average acceptance rates, regardless of the request distribution and the adopted online policy. Figure 3 compares the initial plans generated by the two offline planning algorithms for two instances to develop intuition behind these improvements. In the upper instance, we see that the myopic planner defines two initial routes and leaves one vehicle idle at the depot (which is activated upon arrival of the first dynamic request). This myopic plan maximizes the total budget but ignores the spatiotemporal distribution of dynamic requests. In comparison, the potential-based planner defines one additional route and, by doing so, promotes a better coverage of the dense area in the city center, which constitutes a natural cluster as requests are uniformly distributed over intersections. Similarly, in the lower instance, we observe that the potential-based planner uses the third vehicle to improve the northeast cluster's coverage, where the request rate is high, especially during the early stage of the service period. Notes. Impr., relative improvement of PB over MY: (PB−MY)/MY×%. * indicates the improvements that are NOT statistically significant (i.e., p > 0.05 in paired two-sample t-tests).
Potential-based plans' success is also attributed to a more balanced set of initial routes, which cover similar distances and serve a similar number of static requests. This is achieved simply by evaluating routes under the knapsack-based potential approximation, without requiring additional parameters. The capacity (in terms of budget) of those routes is consumed in a more balanced way along the service period, which gives more flexibility to the scheduling policy when deciding the best vehicle to serve a dynamic request. On the other hand, myopic plans are characterized by very long routes and idle vehicles. While it is possible to enforce the myopic planner to use all vehicles, this alone does not lead to a balanced set of routes as the DCVRP model minimizes the total duration, so additional constraints (and parameters) would be required to achieve the same effect.   Notes. Each red dot represents a static request. Each colored line represents the initial route of a vehicle. Notes. MY, myopic planner; PB, potential-based planner.

Evaluation of Online Scheduling Policies
We first compare the online scheduling policies applied to the instances with Λ ∈ {0.2, 0.4}.
As Section 6.2 demonstrates that potential-based plans outperform myopic plans, all results presented in this section are based on the former. Moreover, since the spatiotemporal distribution of dynamic requests has little impact on the performance of online policies, all results presented in this section are averaged over the three distributions. The interested reader is referred to Tables D.2 to D.5 in Appendix D for detailed results. Taking the results of policy GP CI as a reference, Table 5 presents the relative improvements in acceptance rate obtained by the other policies (the acceptance rates of GP CI can be found in Table 3). In addition, the maximum decision times are presented in Table 6. * indicates the improvements that are NOT statistically significant (i.e., p > 0.05 in paired two-sample t-tests). The best-performing policy is PbP, which displays an average acceptance rate higher than those of the PFA and rollout algorithms, with or without reoptimization, and for any number of sample paths. Moreover, PbP can take decisions much faster than rollout policies. For the same number of sample paths H = 50, PbP promotes a sixfold decrease in decision times, on average. Policy S-PbP is even more computationally efficient, at the cost of somewhat lower acceptance rates compared to PbP (but still better than the other policies except R R -PFA CI (H ≥ 50), on average). In these instances, PFA-based policies require no more than 60 seconds of offline computation time for parameter tuning. We also notice that, compared to the CI routing policy ρ CI , the reoptimization routing policy ρ R improves the acceptance rates slightly in both the greedy and PFA policies, at the cost of short additional computation time per decision. Similar differences caused by ρ CI and ρ R can be observed in the rollout policies, as illustrated in Figure D.1 in Appendix D. Figure 4 compares the overall performance and decision times of all scheduling algorithms (except the rollout algorithms with the CI routing policy ρ CI ) graphically. As expected, the performance and decision times of rollout policies increase with the number of sample paths.
As also expected, policies GP CI and PFA CI are the least computationally expensive, but the former has the worst solution quality. Although the PFA-based policies lead to drastically better results than the GP-based policies, their solution quality does not surpass that of policy S-PbP until the number of sample paths increases to H = 50, at which point their maximum decision times are 21 and 7 times longer than those of S-PbP and PbP, respectively, which limits their application in a real-time context. Tables 7 and 8 present the acceptance rates and decision times, respectively, obtained in the instances with Λ ∈ {0.8, 1.5}. In Table 7, the relative improvements over GP CI are provided in parentheses. The number of sample paths in rollout algorithms is limited to H = 10 as rollout decision times increase considerably in these large instances. Furthermore, the offline runtime required by the PFA-based policies grows to 15 minutes in the largest instances with Λ = 1.5 and K = 20. Policies S-PbP and PbP perform significantly better than the other algorithms.
In addition, we see that S-PbP requires at most 11.3 seconds to compute a decision in the largest instances, which reaffirms S-PbP as a high-performing scheduling policy suitable for   * indicates the improvements that are NOT statistically significant (i.e., p > 0.05 in paired two-sample t-tests).  As observed in Table 8, the average decision times of policy S-PbP grow linearly with K. scenarios. Quite remarkably, both models are able to predict with significant accuracy, already at the beginning of the service period (i.e., before any stochasticity is disclosed) and regardless of the request distribution and instance size, the total number of accepted requests during the entire period. In combination with the decision rules applied in the optimal scheduling policy, these predictions lead to high-quality dynamic decisions that can be computed in real-time. simulations are plotted, these profiles' general shape represents well what is generally observed.
While the greedy policy GP R exhausts the fleet's service capability relatively early, the PFA and rollout policies with H = 25 sample paths or more prescribe better decisions and achieve higher acceptance rates. Capacity is most efficiently consumed by potential-based policies, which accept requests at a more steady rate until near the end of the service period.

Conclusions
We studied a vehicle routing problem where a service provider must determine, in real-time,  ρ(θ, r) New route resulting from adding request r to route θ by policy ρ π * ∈ Π Optimal scheduling policy x − t ∈ Xt "Reject" decision with respect to request rt Set of decisions where rt is accepted and assigned to vehicle k Decision of accepting rt and assigning it to vehicle ǩ su t (V k ) Effective speed of a vehicle with state V k estimated at instant ut (iv) Taking into account the initial conditions set byũ s andũ t , the remaining route θ k (t) is simulated under effective speedš ut (V k ) for an amount of time u − u t .
Step 2: Determining V ut (V k , u ): first unvisited node along route θ k at instant u .
The linear relaxation of (DCVRP-SPM) cannot be solved directly because the number of variables |Θ| grows exponentially with |V |. So, we apply column generation and, in the process, generate a large pool of routes to subsequently use in the offline planning heuristic.
Within column generation, the restricted master problem (RMP) refers to the linear relaxation of (DCVRP-SPM) when only a subsetΘ R ⊂Θ of variables is considered. A profitable variable to add toΘ R is identified by solving the pricing problem, which calls for a columnθ with negative reduced cost: where β i (i ∈ V \ {0}) and β 0 are, respectively, the dual values associated with the partitioning and the maximum number of vehicles constraints in (DCVRP-SPM), obtained by solving the RMP. Once profitable variables are identified, they are added toΘ R and the procedure iterates.
Convergence is attained when no further variable satisfies (B.1).
The pricing problem is solved with a labeling procedure, as common in algorithms for several VRP variants (Costa et al., 2019). Each label corresponds to a partial path from node 0 to a node j ∈ V \ {0}, and stores the reduced cost accumulated so far along the partial path.
To ensure tractability of the labeling procedure, we allow non-elementary routes according to ng-route relaxation (Baldacci et al., 2011), and control the combinatorial growth of labels with dominance rules.

Appendix C. PFA Policy
Given a state S t = (V 1 , . . . , V K , r t ), r t = (u t , i t , d t ), it is clear that r t should be accepted only if this acceptance leads to an increase in the expected total reward, as stated by decision rule (DR1). Hence, we design a PFA policy with an analytical functionΦ PFA (S t+1 |x t ), which approximates the expected reward-to-go E[Φ π * (S t+1 )|x t ] based on the total time budget: where x − t is the 'reject' decision with respect to r t , X k t is the set of 'accept' decisions where r t is assigned to vehicle k, and coefficient γ > 0 is a relative weight that trades off the immediate reward and the reward-to-go. The larger γ is, the more likely request r t is rejected so as to preserve the fleet's future service capability.
Parameter γ is tuned in advance via offline simulations. As the optimal value of γ depends not only on instance settings (graph type, fleet size, request rate and distribution, etc.) but also on the initial route plan, the tuning process of γ takes place after the initial routes have been fixed by an offline planner and before the beginning of the service period. During the tuning process, we simulate PFA policies with different values of γ over H complete sample paths. The values of γ simulated are selected based on a variable step size search procedure, which assumes that the total reward is concave on γ, as observed empirically. Finally, we select the value of γ which leads to the highest simulated total reward.

Appendix D. Detailed Computational Results
This section presents the detailed computational results with respect to the three different  Notes. Dist., request distribution; Impr., relative improvement of PB over MY: (PB−MY)/MY×%. * indicates the improvements that are NOT statistically significant (i.e., p > 0.05 in paired two-sample t-tests).     * indicates the improvements that are NOT statistically significant (i.e., p > 0.05 in paired two-sample t-tests).