Dynamic Capacitated Arc Routing Problem in E-Bike Sharing System: A Monte Carlo Tree Search Approach

. This paper studies a dynamic capacitated arc routing problem for battery replacement in an e-bike sharing system, where workers replace batteries for underpowered e-bikes along street segments dynamically. The objective is to replace as many batteries as possible and minimize pickup failures. The temporal dependency of the routing decisions, the conﬂict of the workers’ operations, and the stochastic and dynamic nature of user demands all make this a diﬃcult problem. To cope with these diﬃculties, a “Partition-First, Route-Second” bi-level solution framework is adopted to describe the problem in two diﬀerent time scales. On the large time scale, a spatiotemporal partitioning method, which divides the road network into nonoverlapping subzones, is proposed to decompose the problem. On the small time scale, this paper models the routing decision process of individual worker as a Markov Decision Process. We adopt a lookahead policy that simulates future information and decisions over some horizons to evaluate the long-term inﬂuence of current feasible decisions. A Monte Carlo Tree Search algorithm is also used to improve the simulation eﬃciency. By performing numerical computation experiments on a test case study and comparing with some benchmarking policies, we demonstrate the eﬀectiveness and eﬃciency of the suggested method.


Introduction
e "Mobile Internet" and "Sharing Economy" have greatly promoted the development of shared micromobility services, represented by free-floating bike sharing system, shared electric scooter system, and e-bike sharing system, which provide a more convenient and environmentally friendly "start/last mile" travel solution. e E-Bike Sharing System (EBSS) is a relatively new form of shared micromobility service, which could provide auxiliary power through the battery and help users reduce the physical energy consumption when they are riding against the wind, uphill, with loads, or over long distances and thus can significantly improve user experience. In China, the country with one of the world's largest markets for electric bicycles, the shared micromobility service providers, such as Hellobike and Qingju Bike, have already launched EBSS in many cities, especially those with lots of slopes in the road network.
Despite the improvement in user experience, EBSS faces more challenges in daily operation. A new challenge is how to replace the detachable batteries of underpowered e-bikes scattered throughout the city in a timely and cost-effective way. At present, the EBSS providers in China usually take the approach of setting up "virtual parking spots" as illustrated in Figure 1. e virtual parking spots are areas where parking is allowed as determined by the built-in GPS positioning and are usually distributed along the street segments. Users who park e-bikes outside these areas will face punitive dispatch fees. In this way, it not only reduces the worker's efforts to find an e-bike but also restrains the messy parking behavior.
In the above context, since the underpowered e-bikes are mostly aggregated over the virtual parking spots along the street segments, the routing problem of battery replacement in EBSS can be modeled as a capacitated arc routing problem (CARP), just like the Chinese postman problem [1]. An optional way is to replace batteries at night when the user demands can be ignored, corresponding to a static version of CARP (SCARP). In the static version, the distribution of underpowered e-bikes is fixed and known in advance, and a set of minimum cost routes that service all underpowered e-bikes is expected to be found for a fleet of capacitated vehicles located at a given depot. However, there is a drawback with this passive static way: it cannot guarantee the number of e-bikes with sufficient power maintained at a high level throughout the next day, resulting in lower utilization rates of the system, especially during the evening peak. erefore, this paper mainly considers a dynamic version of the CARP (DCARP) in EBSS: the battery replacement is performed during daytime, when the user's participation must be considered and new underpowered e-bikes may stochastically appear after the fleet departs from depot. Although there is a large amount of studies related to the dynamic capacitated vehicle routing problem, which was surveyed by Pillac et al. [2] and Psaraftis et al. [3], direct studies on DCARP are relatively few. Some literatures [4][5][6][7] investigated the periodic CARP in certain scenarios and proposed various heuristic algorithms; some literatures [8][9][10][11][12] studied the dynamic case where the demand or deadheading cost of each arc is stochastic and proposed some repair operators or adaptive algorithms rather than just an initial static solution.
However, the DCARP in the scenario of battery replacement in EBSS has the following difficulties: the first difficulty is the temporal dependency of the routing decisions. e current routing decisions for battery replacement will affect the success of subsequent pickup demands, which in turn affects the future state of e-bikes and the future routing decisions. is intricate temporal dependency results in a significant increase in the complexity of the problem. e second difficulty is the conflict of the workers' operations. Since there is usually more than one worker, it is important to design a cooperation mechanism to avoid simultaneous operations on the same street segment by different workers and to improve the operation efficiency throughout the entire system. e third difficulty is how to anticipate the future influence of current decisions with stochastic and dynamic user demands. In the dynamic case, the goal is not only to maximize the number of batteries replaced but also to maximize the successful attempts to pick up e-bikes, and this requires anticipating potential future user demands when making current decisions. ese difficulties make it hard to compute a fixed decision sequence in advance. To cope with these difficulties, this paper adopts a "Partition-First, Route-Second" bi-level solution framework to describe this problem in two different time scales, as shown in Figure 2.
On the upper level, the road network is partitioned into subzones to decompose the problem into multiple smaller subproblems. is idea of partitioning streets into clusters has been widely used in the classical CARP and its variants [13][14][15][16][17]. In the setting of our problem, the road network partition provides constraints of feasible operation area and reward to each worker, preventing the potential conflict of workers' operations.
e partitioning also reduces the temporal dependence between different workers' routing decisions to some extent, due to the short-distance nature of e-bike trips. To adapt the partition to the real-time state, a spatiotemporal partitioning method is proposed on a large time scale, where an initial partition is generated at the beginning of a day and will be adjusted dynamically with the real-time state of e-bikes based on some heuristic algorithms.
On the lower level, the routing decision process of each worker is modeled as a Markov Decision Process (MDP) on a small time scale, which is small enough to capture the details of the user demands and worker operations. In the MDP, decisions should be made about which adjacent street segment to visit next when a worker has finished the task on the previous street segment. is paper adopts a lookahead policy with Monte Carlo Tree Search (MCTS) algorithm, which focuses on solving an approximation of the problem over some horizon to anticipate the future impacts [18,19]. At a decision point of a worker, we model all feasible sequences of routing decisions for the worker over some horizon into a decision tree, then search and simulate it in the simulator based on the MCTS algorithm, which will allocate more simulation runs to more highly rewarding routes, and thus save computational resources in the simulation process. e main contributions of this paper are as follows. We model the routing problem of battery replacement in EBSS into a DCARP, based on the distribution of virtual parking spots. en, we adopt a "Partition-First, Route-Second" solution framework to cope with the difficulties of this problem. We take both user satisfaction and operational efficiency into account and adopt a lookahead policy with MCTS algorithm to balance the relationship of exploration and exploitation during the simulation. e next parts of this paper are structured as follows. In Section 2, we briefly review the related literature. In Section 3, we summarize the notations used in this paper and formulate the problem. In Section 4, a spatiotemporal partitioning method is proposed in detail. In Section 5, we introduce the lookahead policy with MCTS algorithm. In Section 6, we present the design of the system simulator. In Section 7, we analyze the results of numerical studies. Finally, we conclude the paper in Section 8.

Literature Review
As a convenient way for short-distance travel, bike sharing system has been expected to help mitigate traffic congestion, alleviate the growing environmental issues, and promote  sustainable transportation development [20]. However, the development of the bike sharing systems has also undergone a long trial-and-error process. e first generation of bike sharing systems was dockless and operated as nonprofits but failed due to theft and damage [21]. en, to facilitate maintenance, station-based bike sharing systems became the dominant form for a long time and went through stages such as coin-deposit systems, IT-based systems, and demandresponsive multimodal systems that made it easier for users to pick up or return bikes [22]. Besides, numerous studies [23][24][25][26][27] have investigated the bike allocation and reposition problems in station-based bike sharing systems to rebalance the system from nonhomogeneous asymmetric demand processes. However, the user base of the station-based bike sharing system has been limited, with the constraint that it is difficult to build dense stations due to the high cost and road space required.
In recent years, with the support of emerging technologies such as mobile payment and electronic locking, the Internet-based free-floating bike sharing system has become the main mode, where users are allowed to pick up and return bikes at any location near their origin and destination.
is increases the coverage of the system and attracts more users. However, with the rapid expansion of free-floating bike sharing system, a series of new problems have arisen, such as the recycling of faulty bikes [28] and the messy parking behaviors [29]. erefore, the promotion of setting virtual parking spots in some areas is necessary [29]. e EBSS, which needs frequent replacement of batteries for underpowered e-bikes, has adopted the approach of setting up virtual parking spots in many cities in China. ese virtual parking spots are quite different from the parking stations in traditional station-based bike sharing systems. e virtual parking spots do not require the construction of physical parking stations or docks and thus could be set up more densely throughout the city at a lower cost and in a more flexible way. However, to the best of our knowledge, there is no research on the problem of dynamic battery replacement in the EBSS with virtual parking spot setting so far.
From another perspective, the dynamic battery replacement problem in EBSS is similar to the dynamic rebalancing problem in traditional station-based bike sharing systems in terms of user demand distribution and system operation mechanism. us, it would be helpful to review some of the dynamic rebalancing policies in stationbased bike sharing systems. e rebalance problem is usually modeled as a pickup and delivery problem in the literature. Brinkmann et al. [23] proposed a short-term rebalancing policy, and according to this policy, the vehicle is sent to the nearest shared bike station where the safety buffer is breached. is policy is simple and computationally efficient but does not take future information into account. Ghosh et al. [24] simplified the relocation problem by aggregating the stations and trips and proposed a mixed-integer linear program on a rolling horizon. e high computational cost of this algorithm makes it difficult to meet the requirements of real-time applications. Brinkmann et al. [25] presented a dynamic lookahead policy to anticipate potential future demands over a predefined horizon by rollout simulation. However, this policy is less efficient when multiple vehicles should be considered and when the decision process occurs too frequently. Li et al. [26] used neural network to approximate the long-term value function of current decisions via a reinforcement learning approach.
is deep reinforcement learning approach has the appeal of fast decision speed and the ability to consider long-term effects. However, it is still a challenging work to learn an accurate and stable value function neural network due to the stochastic exogenous information and complexity of the environment in EBSS.
Taking the distribution of virtual parking spots into consideration, we model the battery replacement problem in EBSS as a dynamic capacitated arc routing problem. e capacitated arc routing problem is aimed to determine an optimal traversal of an arc subset of a graph by some capacitated vehicles [1] and has been widely studied in the past years, such as by converting to the capacitated vehicle routing problem [30,31], or directly by exact optimization methods [32,33], or by some lower bounding procedures [34][35][36][37] and heuristic methods [13,[38][39][40][41]. Among them, the "Partition-First, Route-Second" heuristic has been widely used in CARP and its variants and has been proved to be effective [13][14][15][16][17]. is  paper adopts this heuristic idea to decompose the original problem into some smaller subproblems by partitioning the arcs into some subzones. e dynamic version of CARP considers the stochastic and dynamic demand or deadheading cost, which greatly increases the problem complexity. Christiansen et al. [8] considered a variant of the CARP with stochastic demands on each arc and proposed a branch-and-price algorithm. ey incorporated the stochastic demand into the pricing problem. Laporte et al. [9] proposed an adaptive large-scale neighborhood search heuristic method for the CARP with stochastic demands. Tagmouti et al. [10,11] studied the CARP with time-dependent service costs and proposed a variable neighborhood descent heuristic method. Mei [12] aimed to find a robust solution for all possible environments instead of the optimal solution for a specific environment for the dynamic CARP and designed a repair operator which could repair the solution to feasible dynamically. However, these methods may fail to anticipate the difficulties of the DCARP in EBSS, such as the temporal dependency of the routing decisions and the influence of the future demand trend.
us, this paper proposes a lookahead policy based on MCTS algorithm for the DCARP in EBSS, which is an efficient algorithm for sequential decision problems [19].

Notation and Problem Formulation
In this section, we summarize the notations used in this paper in Section 3.1, formulate the DCARP in Section 3.2, and introduce the decomposition of the decision process of multiple workers and model the routing decision process of an individual worker as MDP in Section 3.3.

Notation.
e main notations and their values used in this paper are summarized in Table 1. Figure 3 shows a simple scenario of EBSS, which consists of the road network, e-bikes scattered among the virtual parking spots or used on the road, and vehicles/workers replacing batteries for underpowered e-bikes. e road network is abstracted as a connected graph

Problem Formulation.
In the vertex set, v 0 is the depot and v i |i � 1, 2, . . . , N v is the set of intersections. Assume that each edge has two directions and the vehicle can only serve one direction at a time. us, an edge (v i , v j ) can be split into two directed arcs: 〈v i , v j 〉 and 〈v j , v i 〉, as shown by the edge (v 1 , v 2 ) in Figure 3. en, the road network can also be represented as a directed graph: and i ≠ j is the arc set. e undirected graph representation is used in the road network partitioning and the directed graph representation is used in the routing decision process. e operating day is divided into T discrete time steps: t ∈ [1, 2, . . . , T], according to a predefined time interval Δt (1 minute). e dynamic user demand is reflected in the stochastically generated pickup and return demands in each virtual parking spot at each time step, which leads to dynamic changes in the location and battery power level of e-bikes. e number of e-bikes in the system is supposed to be fixed at N eb , without new e-bikes being put into the system or removed from the system. e battery range of each e-bike is in the range of [0, ψ max ] (km). When the battery range is below a threshold ψ min (km), an e-bike is regarded as underpowered and unavailable to users. For the sake of simplicity, we assume that the time cost for a worker to replace a battery is a constant: τ b (e.g., 1 minute).
Assume there are N w workers (or vehicles) in the system, denoted by the set W � w 1 , w 2 , . . . , w N w . Each vehicle has the same battery capacity of B (e.g., 40). Suppose t begin (e.g., 7:00 am) and t end (e.g., 10:00 pm) are the beginning and the end of working hours. During working hours, workers will continue to replace batteries for underpowered e-bikes. Once the available batteries are used up, workers need to return to the depot and reload batteries. Suppose vehicles travel with a certain speed that is independent of time, leading to a positive deadheading cost c ij of each arc 〈v i , v j 〉 ∈ A. Assume that the time cost of reloading a battery in the depot is τ d (e.g., 0.5 minutes) and there are always enough fully charged batteries in the depot, so the time cost of reloading a vehicle's battery in the depot is τ d · B.
Under the limited working hours and time cost constraints introduced above, on the one hand, we hope to route the vehicles in a way that replaces as many batteries as possible during a limited time; on the other hand, we also hope to prioritize services for those arcs that have the potential to reduce user pickup failures in the following time. Take worker w 1 in Figure 3 for example; if there will be a user picking up e-bike on arc 〈v 4 , v 7 〉 in the following time, it would be better for w 1 to serve arc 〈v 4 , v 7 〉 first instead of arc 〈v 6 , v 7 〉. However, directly maximizing the sum of the number of replaced batteries and the number of successful pickups may be not appropriate. In fact, replacing one more battery has a higher expected reward than reducing one more pickup failure because a fully charged e-bike can usually be used multiple times and can reduce more potential pickup failures in the following days. For this consideration, the optimization objective function is designed as a weighted sum of these two factors, as shown in the following equation: where b tij is the number of batteries replaced on arc 〈v i , v j 〉 at time step t, d tij is the number of successful pickups on arc 〈v i , v j 〉 at time step t, and ψ user indicates the expectation of battery power consumption for each e-bike trip. erefore, t∈ [1,2,...,T] 〈v i ,v j 〉∈A ψ user · d tij denotes the battery power consumed by the users, indicating the short-term rewards during current day, while t∈ [1,2,...,T] 〈v i ,v j 〉∈A (ψ max − ψ min ) · b tij denotes the battery power replenished by the workers, indicating the long-term rewards of the system sustainability.  Serving time cost of replacing battery for an e-bike (1 minute) N eb Number of e-bikes in system (1000) Time cost of reloading a battery in the depot (0.5 minutes) ψ user Expectation of battery range consumption for e-bike trips Decision of worker w m at the kth decision point, x m k is a one-hot vector of length |A| indicating the arc to visit next,

Decomposition of the Problem.
For the above dynamic capacitated arc routing problem, it is hard to compute the optimal route for each worker in advance due to the stochastic and complex nature of the system. erefore, we seek to formalize the problem as a sequential decision process in cooperative multiagent system. A decision point occurs when a worker completes the task on the current arc and a decision about which one of the adjacent arcs to visit next should be made. It should be noted that as the number of tasks and deadheading cost varies for each arc, the executing duration of each decision may be different and the decision processes between different workers are thus asynchronous. At each time step, only a subset of the workers may make a decision while others keep on executing the tasks on the current arcs.
To simplify the problem, we assume that each worker is assigned a nonoverlapping subzone and is allowed to perform the battery replacement operation only within his subzone. As there are N w workers, the road network G � (V, A) will be partitioned into N w subzones: In this case, the routing decisions of each worker have very limited effects on other workers, and we can decompose the cooperative multiagent decision problem in the whole road network into multiple mono-agent decision problems in each subzone approximatively.
As shown in Figure 4, the arc set A is partitioned into two subsets A 1 and A 2 for workers w 1 and w 2 , and the decision process of each worker is modeled as a Markov Decision Process. e operating day is divided into discrete time steps t according to a small time interval Δt (e.g., 1 minute). At each time step, a worker can make a routing decision or continue with his current task. For example, the decision points for worker w 1 in Figure 4 are time steps t + 2, t + 4,

Notation
Meaning l e vector of the length of each edge, l � (l i ) ∀i∈[1,2,...,|E|] β Temperature in the simulated annealing algorithm β 0 Initial temperature in the simulated annealing algorithm (100) α e decay factor in temperature in the simulated annealing algorithm (0.9) MaxIter inner Inner iteration limit in the simulated annealing algorithm (10) MaxIter outer Outer iteration limit in the simulated annealing algorithm (20  and t + 6. We also divide the day into some discrete time periods p according to a larger time interval Δp (e.g., 180 minutes). At each time period (e.g., at p 1 and p 2 in Figure 4), the partition can be adjusted according to the real-time e-bike state.
To model the routing decision process of each worker as a MDP, the following assumptions are made: (i) After a worker leaves the depot, he follows the shortest path (i.e., calculated by the Dijkstra algorithm) to the nearest vertex of his subzone. During this process, the worker does not replace battery.
Once the remaining batteries in the vehicle run out, the worker follows the shortest path back to the depot to reload the batteries and then goes to the previous location and continues his operation. (ii) As new underpowered e-bikes may be generated on the arcs that have been served, multiple visits and multiple services to an arc are allowed. However, we suppose that temporary stops of vehicles are not allowed and that the vehicle's power is always sufficient during working hours.
Based on the above assumptions, the MDP for the mth worker w m consists of the following components:  Figure 4, , representing the next arc to visit.
State transition function: P m (s m k+1 |s m k , x m k ) is the transition probability function for worker w m and describes the probability that the system will move to the next decision state s m k+1 after making a decision x m k under the current decision state s m k . e next decision state s m k+1 will also be influenced by the stochastic exogenous information in the system (e.g., the stochastic user demand) and the operations of other workers. (2) Since the subzones do not overlap, the cumulative reward in a subzone is largely related to the decisions of the worker in that subzone and less related to the decisions of other workers, and the sum of the cumulative rewards of all subzones is the total system reward, which is consistent with the optimization objective in equation (1). Policy: π m describes the policy to make a decision under current state for worker w m : Suppose that s m 0 is the initial state of worker w m and K m is the decision point set of worker w m . For worker w m , our goal is to find an optimal policy π * m that maximizes his expected accumulation of rewards:

Spatiotemporal Partitioning Method
As introduced in the previous sections, on the upper level of the solution framework, we decompose the decision processes of multiple workers by partitioning the road network into subzones. erefore, the way the road network is partitioned has a great impact on the allocation of labor force and operation efficiency. In this section, we first propose the objective function of the road network partitioning in Section 4.1 and then introduce the algorithm of initial partition generation and partition adjustment at each time period in Sections 4.2 and 4.3.

Objective.
e objective of road network partitioning should not only take into account the static topology of the road network but also make the amount of tasks, i.e., the number of underpowered e-bikes, in each subzone as consistent as possible. is is to avoid situations where some workers are overworked while others are underworked. We use an edge (i.e., a street segment) as the partition unit, so the two arcs of an edge will always be divided into the same subzone. erefore, the edge set E will be partitioned into N w (i.e., the number of workers) subsets: and the subgraph induced by E i is connected graph for i � 1, 2, . . . , N w . If the subgraph G i induced by above E i is represented in the form of a directed graph: G i � (V i , A i ), it is obvious that G i � (V i , A i ) satisfies the constraints for the subzone described in Section 3.3.
Since the graph is partitioned on edges instead of vertexes, here we use the adjacent matrix M � (m ij ) |E|×|E| to represent the link relationship of edges in set E. m ij � 1 indicates that the ith and jth edges in E are linked; otherwise, m ij � 0. An edge is linked to another edge when and only when they have common vertexes. Suppose an edge is not linked to itself. For example, in Figure 3 We chose three indicators to evaluate the road network partition: the first indicator is that the number of underpowered e-bikes in each subzone should be as similar as possible, which facilitates the efficiency of battery replacement and makes the workers' income (which is usually related to how many batteries they replaced) as even as possible. e second indicator is that the sum of the lengths of the edges in each subzone should be as similar as possible to make each subzone similar in size. e third indicator is that the links between subzones should be as few as possible, which ensures that the street segments in each subzone are as closely connected as possible.
Connected M e i , e i � 1, ∀i ∈ 1, 2, . . . , N w , where e T i u p denotes the underpowered e-bike number in E i at time period p, e T i l denotes the total length of edges of E i , e T i M(1 − e i ) denotes the link number between E i and the complement of E i in E, and ϕ 1 , ϕ 2 , and ϕ 3 are the weight coefficients of the three indicators, respectively. erefore, in equation (4), the first term represents the variance of the underpowered e-bike number (after normalization) in each subzone, the second term is the variance of the total length of edges (after normalization) in each subzone, the third term is the ratio of the link number between subzones and the total link number in the road network. Equation (5) ensures that each edge can only be divided into exactly one subzone, and equation (6) ensures the decision variables are 0-1 integers.
In equation (7) Since the amount of tasks in each subzone depends on the real-time state of e-bikes and may change over time, it may not be appropriate to partition the road network in a fixed way throughout the day. However, the frequent changes to the road network partition may cause confusion to workers. erefore, in the proposed spatiotemporal partitioning method, an initial partition solution is first generated at the beginning of the operating day and then will be adjusted dynamically in the large time scale.

Initial Partition Generation.
When generating the initial road network partition, we first ignore the dynamical e-bike state and only consider the static topology of the road network to ensure that the initial partition is the same at the beginning of each day. We adopt the solution of the optimization problem (8) with constraints (5) and (6) e RatioCut problem in equation (8) is a classic graphcutting problem and attempts to minimize the links between subzones while making the number of edges in each subzone as close as possible. e number of edges in a subzone is often correlated with the number of tasks and the total length of the edges in this subzone. Various algorithms to this problem have been proposed in the past and we adopt the solution based on spectral clustering algorithm, which can be referred in the literature [42,43]. If there is a subgraph in the initial partition that is not connected, the nonmaximal connected components of this subgraph will be assigned to other adjacent subgraphs to ensure that all subgraphs in the initial partition are connected.

Partition Adjustment Algorithm. For a new time period
p and an existing road network partition g of the previous time period (for period p 0 , the previous road network partition is the initial partition), we first determine whether the existing partition needs to be adjusted according to the real-time e-bike state, and if so, we adjust the existing partition through minimizing the objective in equation (4).
In the last two decades, various algorithms for community detection or graph partition have been proposed, including a large part of optimization-based methods, i.e., the modularity optimization method [44][45][46][47]. With reference to literature [46], a simulated annealing heuristic algorithm is adopted to adjust the road network partition. e simulated annealing algorithm is an effective heuristic algorithm that can effectively avoid falling into local minima and eventually converge to the global optimum. e process is shown in Algorithm 1.
In Algorithm 1, we use three operators to generate new solutions, i.e., the task balance operator, the length balance operator, and the local search operator.
e task balance operator consists of the following steps as illustrated in Figure 5: Step 1: for the existing partition g, the TaskCut(g) is calculated according to equation (4), as shown in Figure 5(a).
Step 2: select two adjacent subzones randomly according to their difference in underpowered e-bike number. Here we ensure that the adjacent subzones with larger difference in underpowered e-bike number have a higher probability of being selected (e.g., through Boltzmann exploration strategy). Suppose the subzones E 2 and E 3 in Figure 5(b) were selected. As e T 2 u p > e T 3 u p , we select the edges in E 2 that are linked to the edges in E 3 as the removable edge set of E 2 , i.e., the edges (v 0 , v 3 ) and (v 2 , v 5 ).
Step 3: we randomly select an edge from the removable edge set of E 2 and transfer it to E 3 , such as the edge (v 2 , v 5 ). It should be noted that the edge selected here cannot change the connectivity of E 2 after being transferred, i.e., satisfying the constraint in equation (7). en, a new partition g ′ is created as shown in Figure 5(c). e length balance operator and the local search operator are the same as the task balance operator except for step 2 above. In the length balance operator, we select two adjacent subzones for edge transfer randomly according to their difference in the sum of edge lengths. In the local search operator, we select two adjacent subzones for the edge transfer completely at random.
We define the TaskCut min in Algorithm 1 as the minimum TaskCut value at the previous time period. is suggests that no change to the existing partition is needed as long as the variance of the underpowered e-bike number in each subzone at the current period is not greater than the previous time period. For period p 0 , we set TaskCut min � 0.

Lookahead Policy with MCTS Algorithm
After partitioning the road network into subzones, on the lower level of the solution framework, we model the routing decision process of each worker in their subzone as a monoagent MDP and adopt a lookahead policy with MCTS algorithm; the lookahead policy based on the rollout heuristic is introduced in Section 5.1. A MCTS algorithm that can improve the simulation efficiency is proposed in Section 5.2. A heuristic pruning rule is added to the MCTS algorithm in Section 5.3.

Lookahead Policy.
is paper takes an arc as the unit of the routing decision process as the two arcs of an edge generally cannot be served at the same time. We use Q * k (s m k , x m k ) to denote the optimal Q-value of the state-decision pair, representing the expected rewards obtained along the optimal policy after executing decision x m k under        Input: u p at time period p, the edge length l, existing partition g, the initial temperature β 0 , MaxIter inner , MaxIter outer , α ∈ (0, 1), TaskCut min Output: new road network partition g best If TaskCut(g) ≤ TaskCut min : Keep the existing partition g as g best Else: Set temperature β � β 0 Repeat: Repeat: Create a new partition g ' from g using task balance operator Calculate Δ task � TaskCut(g′) − TaskCut(g) Accept g � g′ with probability of min 1, exp(− Δ task /β) Create a new partition g ' from g using length balance operator Calculate Δ task � TaskCut(g′) − TaskCut(g) and accept g � g′ with probability of min 1, exp(− Δ task /β) Until iteration limit MaxIter inner reached Create a new partition g ' from g using local search operator Accept g � g′ if TaskCut(g′) < TaskCut(g) Set β � αΔβ Use g best to record the best partition solution found Until iteration limit MaxIter outer reached Output the new road network partition: g best ALGORITHM 1: Simulated annealing-based partition adjustment algorithm. the state s m k . us, the optimal decision can be deduced by the following equation: However, it is difficult to obtain the optimal Q-value in our problem. e difficulty lies not only in the complexity of the problem but also in the stochastic and time-dependent user demands.
is paper adopts a lookahead policy for our problem. e rollout policy is one of the lookahead policies [18,48] and will estimate the approximate value of a feasible decision from the simulated experience generated in a simulator. e simulator will generate the stochastic pickup and return demands, according to a priori knowledge or the prediction of the realistic environment. e rollout policy is suitable for problems where the state space is very large. e process of the lookahead policy with rollout heuristic is shown in Figure 6. For the kth decision point of worker w m , we first simulate from each feasible decision x for a predefined simulation run I r and then use the Q-value calculated by equation (10) to approximate the optimal Qvalue in equation (9).
where λ is a time discount factor to reduce the impact of rewards in future because the uncertainty gets higher further into the future. As there are a large number of time steps in the operating day, we truncate the rollout trajectories and set the simulation horizon to H time steps. H is set to 180 according to preliminary tests. k j max is the terminal decision point within the horizon H in the jth simulation trajectory. r m ij is the reward for the ith decision point of worker w m in the jth simulation trajectory. For the decision point i � k + 1, k + 2, . . . , k j max , the decision is made following the base heuristic polikers also follow the base heuristic policy when making decisions during the simulation.

MCTS Algorithm.
e rollout policy spends the same simulation runs for each feasible decision, resulting in a lack of exploitation of the decisions that appear to be promising and much simulation for some decisions that may be clearly inferior to others [48]. erefore, this paper adopts the MCTS sampling algorithm, a variant of the rollout policy, that can better the trade-off between exploration and exploitation when allocating simulation resources.
When we are making a routing decision, we are also attempting to find an arc belonging to the optimal route. Constructing a decision tree of all feasible decisions is equivalent to traversing all possible routes that a worker can take in future.
In the decision tree, a node represents an arc. e root node (denoted as node 1 ) represents the current arc where the worker is located at. e child nodes of a given node indicate the arcs that can be expanded from the given arc. All children of the root node form the feasible decision set. e nodes between the root node and a leaf node form a possible route: L � (node 1 , node 2 , . . .). It is worth noting that in the simulation horizon, if the worker runs out of batteries, he will return to the depot immediately to reload the batteries and then go to his subzone to continue the operation. e arcs passed during this process will not be recorded in the decision tree.
On the basis of the rollout policy, the MCTS algorithm will maintain a decision tree during the simulation. Each node in the decision tree has two attributes: n i denotes the number of times that no de i is visited and score i denotes the value of node i . e root node and its child nodes are initialized with n i � 0, score i � 0.
MCTS is an iterative algorithm, where the maximum iteration number is supposed to be set to I max . Each iteration contains the following four main parts: tree traversal, node expansion, random simulation, and backpropagation: (i) Tree traversal: tree traversal starts from the root node and selects the child nodes layer by layer until the current node is a leaf node. One criterion for selecting child node is the UCB algorithm [49]: where node i is a child node, N denotes the total visited times of the parent node, and c is a weighting factor. e child node with the largest UCB value will be selected. e simulator will simulate the user demand and the operation of workers after a child node is selected. When the worker completes the task, a new child node will be selected until the current node is a leaf node. (ii) Node expansion: if the current leaf node has not been visited, then go to the next step; if the current leaf node has been visited, then expand its children to the decision tree and set the visited time and the value of these child nodes to 0. A child node is selected at random for the next step. (iii) Random simulation: random simulation is performed from the current leaf node until the simulation horizon H according to the base heuristic policy in Algorithm 2. Suppose the worker is w m and the current decision point is k. Until the end of the simulation, a route L of w m will be obtained. Suppose the terminal decision point in the simulation is k max . e Score(L) in equation (12) is used to evaluate route L: (iv) Backpropagation: after getting the Score(L), the nodes of the decision tree that are also in route L will be updated according to equations (13) and (14): Journal of Advanced Transportation After performing the backpropagation, the decision tree is updated. en, it returns to the tree traversal in step (1) and continues iteration until the iteration number reaches I max . Finally, the child node of the root node with the highest value will be selected as the next arc to visit.

k-Cycle Pruning
Rule. e rules for expanding the child nodes were introduced earlier, i.e., expanding the arcs that are adjacent to the parent node and in the same subzone. However, following this relatively loose rule may generate some apparently low-value routes, such as circular routes. Take Figure 3 for example; the circular route (〈v 7 , v 4 〉, 〈v 4 , v 7 〉, 〈v 7 , v 4 〉, 〈v 4 , v 7 〉, . . .) may yield low reward because of repeated visits to the same arcs during a short time. To avoid such situation, we propose a k-cycle pruning rule.
For no de i at the ith level of the decision tree (suppose the root node node 1 is the first level), we denote the route from the root node to it by L � (node 1 , node 2 , . . . , node i− 1 , node i ). If a child node of node i is in the set of node i+1− min(k,i) , node i+2− min(k,i) , . . . , node i− 1 , node i , this child node will not be expanded (unless there is no other expandable child node). k is an empirical constant and is tuned artificially. e routes with circles of length less than or equal to k will be pruned from the decision tree. For example, if k is set to 2, the decision tree from the arc v 7 , v 4 in Figure 3 will be cut as shown in Figure 7. In the experiment, k is tuned to 6 according to preliminary tests.

System Simulator
Referring to literature [50], when there is no real e-bike trip data or no condition to test the proposed policy in reality, we turn our focus to simulation, which is an integral part of most studies of dynamic planning in complex systems. A system simulator is designed to capture the basic dynamics of the interactions of road network, e-bikes, workers, and users, which is shown in Figure 8.
In the "road network partition" module, an initial road network partition will be generated at the beginning of the operating day and will be adjusted every new period.
An e-bike in the EBSS has three attributes: location, battery power level, and usage state. It is assumed that users only pick up or return e-bikes in the virtual parking spots. Input: state s m k � (t m k , η m k , φ m k , L m k , C m k ), current road network partition g Output: decision x m k and calculate the feasible decision set X s m k based on the current worker's location L m k and the road network partition g; (1) Calculate the ratio of the underpowered e-bike number and available e-bike number u dp(x) on each feasible arc x ∈ X s m k based on the e-bike state η m k , φ m k .
If the available e-bike number is 0, we replace it with a sufficiently small value. (3) Select a decision x m k with Boltzmann exploration P(x m k |s m k ) � (exp(u dp(x m k )))/( x∈X s m k exp(u dp(x))).
ALGORITHM 2: Base heuristic policy erefore, e-bikes are only parked within the virtual parking spots or used on the road. e configuration at the beginning of the simulation includes the number of e-bikes in each virtual parking spot and the initial battery power level of each e-bike. In each time step, the attributes of the e-bike will be changed by two factors: the worker's operation and the user's usage.
In the "pickup demand" module, the simulator will generate a batch of new orders every time step. e order contains the following attributes: original virtual parking spot, destination, virtual parking spot, distance, and expected travel time.
e origins of the orders are first generated stochastically according to a certain distribution. Spatially, the virtual parking spots are aggregated based on the arc, i.e., virtual parking spots on the same arc have the same pickup demand distribution. Temporally, the pickup demand at each time step is assumed to conform to a certain Poisson distribution. Given the origin of an order, the destination is generated from a Rayleigh distribution: f(x) � (x/σ 2 )exp(− (x 2 /2σ 2 )), where x is the Manhattan distance from the destination to the origin and σ is a parameter that is set to 3 empirically. Given the origin and the destination, the shortest path between the origin and destination in the road network is calculated as the distance (e.g., calculated by the Dijkstra algorithm or Floyd algorithm). e user's speed is assumed to be generated from a normal distribution: N(12 km/h, 5(km/h) 2 ), empirically. erefore, the expected travel time can be estimated and will be approximated as an integer number of time steps.
For each new order, the simulator will first try to dispatch it to an e-bike. If there is an available e-bike in the original virtual parking spot or neighboring virtual parking spots on the same arc, the dispatch will be successful and the order will be tracked by the simulator. e usage state of the dispatched e-bike will be changed to be in use; else, the dispatch is failed and the order will be deleted.
In the "return demand" module, at each time step, the simulator will subtract a time step from the expected travel time of the orders in progress. If the expected travel time of an order becomes zero, this order is finished and will be deleted from the simulator. e attributes of the e-bike dispatched to this order will be updated: the location will be changed to the destination, the battery power level (represented in range) will be subtracted by the distance of this order, and the state will be changed to be not in use.
Finally, if the time step reaches T, the simulation is ended and the simulation data will be analyzed.

Numerical Studies
In this section, the results of the numerical study on a test case are shown. In Section 7.1, the test case is presented; Section 7.2 introduces the benchmark policies; a comparison of the performance of different policies is presented in Section 7.3; and more insights into the experiment are given in Sections 7.4-7.6. Some parameters of the model, as well as some tuned parameters in the algorithms, are given in Table 1.

Test Case.
e simulated scenario of the test case study is shown in Figure 9, covering an area of 6 * 6 � 36 km 2 , with 81 vertexes and 272 arcs. Assume that vertex 41 is also the location of the depot (i.e., vertex v 0 ). To make the scenario as realistic as possible, we set three types of districts, indicated by red, blue, and green, respectively. Suppose the red arcs represent the central business district, the blue arcs represent the residential districts, and the green arcs represent the connective district.
Arcs in different colored districts are supposed to have different temporal trends of user demand. For example, in residential districts, the pickup demand level for e-bikes is higher in the morning rush than in the evening rush, while in central business districts, the pickup demand level for e-bikes is higher in the evening rush than in the morning rush. e distribution of user demand is assumed to be the same for the time steps during an hour. e relative pickup demand level for virtual parking spots in the three different districts during each hour is shown in Figure 10.
Given the relative pickup demand level and the total pickup demand number, the specific pickup demands can be generated for each virtual parking spot at each time step. e total pickup demand number in a day has three levels to verify the validity of the proposed policy: 2000 pickup demands/day, 4000 pickup demands/day, and 8000 pickup demands/day. Given the origin of an order, the destination, distance, and the expected travel time are generated from the rules in Section 6.
ere are three lengths of arcs in the road network: 0.5 km, 1 km, and 2 km. Assume that there are 1-3 virtual parking spots randomly distributed in each arc, so the density of virtual parking spots is greater in the residential and central business districts than in the connecting districts because most arcs in the residential and central business districts are shorter than the arcs in the connecting districts.
is is usually in line with the reality. e total number of e-bikes in the system (i.e., N eb ) is fixed at 1000. In the initial configuration, the e-bikes are randomly distributed in each virtual parking spot and the battery level is uniformly distributed in the range [0, ψ max ], where ψ max is set to 40 km.
Assume that there are 4 workers/vehicles and the battery capacity of each vehicle is 30. e speed of the worker when operating in his subzone is assumed to be constant at 10 km/h in the residential districts and central business districts and 12 km/h in the connective district. e speed of the worker when traveling between the depot and his subzone is assumed to be constant at 15 km/h. From this speed parameter, the deadheading cost of each arc can be calculated. e time cost to replace a battery is set to 1 minute, and the time cost to reload a battery at the depot is set to 0.5 minutes.     We simulated for 7 days to evaluate each policy. e working hours are from 7 am to 10 pm. Suppose it is possible to make the initial settings of the system generated from the same distribution for every day to test the effectiveness of the dynamic operations.

Benchmarks.
To verify the effectiveness of the suggested policy, the following benchmark policies are designed.
Random policy: the random policy selects the feasible decisions with equal probability.
ϵ − greedy policy: without considering future information, ϵ − greedy policy selects an arc to visit next based on the ratio of the underpowered e-bike number and available e-bike number on each feasible arc (if the available e-bike number is 0, we replace it with a sufficiently small value). Suppose the number of feasible arcs is n. e feasible arc with the maximum ratio is selected with the probability of (1 − ϵ) + ϵ/n; any other feasible arc is selected with the probability of ϵ/n. Here ϵ is tuned to 0.3.
Rollout policy: the lookahead policy with rollout heuristic (denoted as the rollout policy) introduced in Section 5.1 is also used as a benchmark. e simulation horizons and the evaluation way of the rollout policy are consistent with the MCTS algorithm.

Results.
We compare the suggested lookahead policy with MCTS algorithm (denoted as the MCTS policy) with the benchmarks at multiple demand levels. e simulation run for each feasible decision I r is tuned to 16 in the rollout policy. e maximum iteration I max is set to the number of feasible decisions multiplied by 16 in the MCTS policy to ensure that the total simulation run at each decision point is the same for the rollout policy and the MCTS policy.
ree demand levels of 2000 pickup demand/day, 4000 pickup demands/day, and 8000 pickup demands/day are tested. e average score of each policy is calculated from the simulation trajectories by equation (1). e result is shown in Figure 11. e result shows that the rollout policy and the MCTS policy outperformed the myopic random policy and ϵ − greedy policy significantly. Except at the demand level of 2000 pickup demands/day, the MCTS policy outperformed the rollout policy. Notably, at the demand level of 2000 pickup demands/day, the MCTS policy performed slightly worse than the rollout policy and the ϵ − greedy policy performed slightly worse than the random policy. is phenomenon occurred because at a low demand level (with the same e-bike number and worker number), the amount of tasks for each worker is relatively low, resulting in a relatively abundant labor force for these tasks and a low sensitivity of the results to different policies under the assumption that temporary stops of vehicles are not allowed during operating.
e number of replaced batteries varies more while the successful pickup number varies less under different policies, resulting in a higher impact of the number of replaced batteries on the final score. In fact, we found that the impact on the success rate of user pickups through battery replacement operation alone is limited because the EBSS also has other problems that can affect the success of user pickups, such as the unbalanced distribution of e-bike supply and user demand, which could not be solved by battery replacement operation.

e Partitioning.
For a given scenario, the road network partition way will be adjusted every time period. Take the test case with a demand level of 4000 pickup demands/day for example; the road network partition during some working hours of a day is shown in Figure 12.
In Figure 12, the edges with the same color belong to the same subzone. From the result, it could be seen that the spatiotemporal partitioning method is well able to partition the road network according to the topological characteristics and can adjust over the temporal workload characteristics.
e results show that each subzone is relatively close in size and the partitioning results in each period are relatively close, which ensures that the change of the subzone does not cause too much confusion to the workers. e average number of batteries replaced by each worker at different demand levels is shown in Table 2. In Table 2, the average number of initial underpowered e-bikes in each subzone at the beginning of the operating day is notated as "initial_underpowered," the average number of new emerged underpowered e-bikes in each subzone during the operating day is notated as "new_underpowered," and the average number of batteries replaced by each worker is notated as "battery_replaced." e % completion_rate is defined as the ratio of the "battery_replaced" and the sum of the "initial_underpowered" and "new_underpowered" in the following equation: % completion_rate � battery replaced initial underpowered + new underpowered * 100 %.
As the number of user demands increases, the number of batteries replaced by each worker also increases, but the % completion_rate of each worker decreases. For a given demand level, the number of batteries replaced by each worker is relatively close, which proves the effectiveness of the partitioning algorithm. From the results, we can see that the batteries replaced by workers w 2 and w 4 are relatively more than the workers w 1 and w 3 . In fact, this is because in the subzone where the road network is sparse and the e-bikes are more sparsely distributed (such as the subzone of w 1 and w 3 in Figure 12), the worker has to spend more time running on the road, resulting in a lower number of batteries replaced in the same operating time.  From the results, we also can see that at each demand level, the % completion_rate of different workers is relatively close, which reflects that the partition is reasonable and allows the labor force to be appropriately distributed. Also, we can find that the worker number is sufficient at the demand level of 2000 pickup demands/day and the demand level of 4000 pickup demands/day because the batteries replaced are more than the new appeared underpowered batteries, but more workers are needed at the demand level of 8000 pickup demands/day as the batteries replaced by each worker are all less than the new appeared underpowered batteries.

Dynamic and Static Operations.
To verify the necessity of dynamic operation, we compared the lost user number under dynamic battery replacement operation and under static battery replacement operation. In the static case, no operation is applied to the e-bikes during the daytime, but the initial setting of a day is set to the same as the dynamic   case under the assumption that there is a static battery replacement operation at night. e lost users are divided into two categories: lost users because there is no e-bike in the virtual parking spots, denoted as "unbalance lost," and lost users because the e-bikes in the virtual parking spots are underpowered, denoted as "underpowered lost." e average number of lost users under static and dynamic operations at three demand levels is shown in Table 3. From the results, we can see that the dynamic operation could reduce the total lost users compared to the static case. In addition, the result shows that dynamic battery replacement mainly reduced the number of the "underpowered lost" users but may exacerbate the unbalance of the e-bike supply, resulting in more "unbalance lost" users. One reason for the high "unbalance lost" number in the dynamic operation case is that for an arc with underpowered e-bikes, in the static operation case, the subsequent failed pickups will be recorded as "underpowered lost"; however, in the dynamic operation case, the underpowered e-bikes may be replaced battery by the worker in time and will be picked up by users who arrive first, and the subsequent failed pickups on this arc will be recorded as "unbalance lost." erefore, both the battery replacement operation and rebalancing operation are essential for the EBSS. e average number of lost users reduced by the dynamic operation (compared with the static operation) at three demand levels is 87.8, 314.0, and 1131.8, respectively, which suggests that at the low demand level, such as 2000 pickup demands/day, it may not be cost effective to perform dynamic operations or the number of workers can be reduced appropriately due to the low ratio of user demand over e-bike supply.
7.6. Simulation Runs. In the MCTS policy and rollout policy, the number of simulation runs is a key factor for estimating the value of feasible decisions. In general, the more the simulation runs, the more accurate the estimation, but it also requires more computational cost.
As the MCTS policy is able to allocate more simulation runs to more highly rewarding routes, it requires less computational costs to achieve the same effect as the rollout policy. We compared the results of these two policies for different number of simulation runs at the demand level of 8000 pickup demand/day, as shown in Table 4.
Four numbers of simulation runs are tested: 4, 8, 16, and 32 simulation runs per feasible decision. Take the 8 simulation runs for example; if the number of feasible decisions of a decision point is 3, then the total simulation runs at that decision point is 24. e result shows that the more the simulation runs, the higher the final score in both the rollout policy and the MCTS policy. e results of experiments with 8, 16, and 32 simulation runs all show that the MCTS policy performs better than the rollout policy, indicating that the MCTS policy requires only half the simulation runs to obtain the same or even better results than the rollout policy. However, in the experiment with 4   simulation runs, the MCTS policy performs worse than the rollout policy. One reason for this result is that the simulation runs for some feasible decisions may be so inadequate as to cause great deviation in the MCTS policy when the total number of simulations runs is relatively low.

Conclusion
In this paper, we have modeled the battery replacement in an EBSS as a DCARP according to the distribution characteristics of the virtual parking spots. e objective is to maximize a weighted sum of the number of successful pickup demands and the number of batteries replaced. A "Partition-First, Route-Second" solution framework was proposed for the scenario with multiple workers. A spatiotemporal partitioning method was proposed to partition the road network into subzones on a large time scale, and the routing decision process of each worker was modeled as MDP on a small time scale. A lookahead policy with MCTS sampling algorithm was proposed to anticipate the potential future influence of the current decisions. e numerical study results show that the spatiotemporal partitioning method was able to avoid the conflict between workers' operations and allocate the labor force reasonably. e MCTS policy proposed in this paper outperformed the myopic policies and could reduce the simulation runs significantly with no degradation in effectiveness compared to the rollout policy. e limitations of the application scenarios of the suggested policy also need to be noted. Like the rollout policy and other lookahead policies, the MCTS policy relies on the forecast of future exogenous information, such as the user demand trend. As analyzed in literature [25], if the impact of noise of user demand becomes too significant, the anticipation for future demand is less possible, and a policy only based on the current state may yield better results than the lookahead policy.
Future research for the EBSS may focus on the case where battery replacement operation and the rebalancing operation are performed simultaneously in the system. e unbalance between supply and demand in bike sharing systems has been studied extensively and also exists in the EBSS. Our experiments show that only replacing the battery dynamically may have limited improvement in the success rate of user pickup. As the e-bike has not only location attributes but also battery power level attributes, the rebalancing problem of an EBSS may be more complicated and needs further study.

Data Availability
e data used to support the findings of this study are available at https://drive.google.com/file/d/12Dl_Y1HkTVHl3FNoRQgr3S2l KtYnvQOe/view?usp�sharing.

Conflicts of Interest
e authors declare that they have no conflicts of interest.