Stochastic charging scheduling of a parking lot with wind power: a state-aggregation method based on Markov decision processes

Optimal charging scheduling of electric vehicles with renewable energy could greatly save the electricity cost in parking lots, and also helps for environment protection. However, uncertainties of renewable energy and electric vehicles charging demand bring great challenges to obtaining the optimal charging policy. To this end, in this paper, a state aggregation based dynamic programming method is presented for such a stochastic charging scheduling problem. Speciﬁcally, ﬁrst, a new Markov decision process based formulation is established for the multi-stage stochastic programming problem. Second, a novel state-aggregation method is proposed to relieve the dimension curse due to large state-action space without sacriﬁcing the optimality of the charging policy. Besides, the consistency of the problem before and after state aggregation also could be veriﬁed by proposed theorems both on feasibility and optimality. Moreover, economic efﬁciency and computational efﬁciency are both improved in numerical testing, which shows the effectiveness of the proposed method.


INTRODUCTION
To alleviate the environmental pollution problem, electric vehicles (EVs) have been developed vigorously due to their zerocarbon emission characters [1][2][3]. Meanwhile, the integration of renewable energy also helps for environmental protection. To this end, optimal scheduling of EVs with renewable energy has attracted more and more attention and becomes an urgent problem.
Charging EVs with renewable energy is significant in both environmental protection and energy saving. In recent years, the studies of charging EVs with renewable energy are mainly from two main aspects. One is the safe operation of the power grid, which focuses on the centralized schedule of EV groups. They usually schedule the charging time and quantity of EVs to transfer the charging load according to the amount of renewable energy [4][5][6][7][8][9]. The other one is the economic dispatch of EVs which decides the charging behaviour of EVs to minimize the charging cost [10,11]. In this paper, we focus on the economical schedule of multi-EVs coordinating and propose an effective method to solve the problem considering the uncertainties of both EVs charging demand and renewable energy. For the charging problem, most of the existing works focus on EVs. However, these charging policies based on EVs require the high cooperation of EV users, it is hard to obey for all users sometimes. Thus, we try to schedule the charging problem from the point of charging piles and conduct the research based on the charging piles in the parking lot. Scheduling the charging behaviours of charging piles in the parking lot is meaningful. With the growth of EVs, parking lots are the most common charging place in the city [12][13][14]. And it is more feasible to manage the charging piles than to control the EV users for the manager. In this paper, we take the charging scheduling in an enterprise private parking lot as the application scenario, considering the willingness of EV users to cooperate and the regularity of EV journey in such parking lots.
To solve the charging scheduling problem in the parking lot, there are three main challenges as follows. First, there exist large uncertainties on both the charging demand of EVs and renewable energy, which generates a large number of charging states. Second, the problem is time-coupled and multi-processes. Each EV has a charging deadline which means the current charging decision would affect future decision-making. And wind power breaks the independence of charging piles, which turns the charging process of a single charging pile into co-operation charging processes of multi charging piles. Third, a large number of uncertainties make it hard and time-consuming to solve the stochastic problem.
To address the challenges above, we model the multistage stochastic optimization problem as a Markov decision process (MDP). It is very clear to represent the stochastic in the problem and illustrate the relation between the adjacent stages. Many literatures also model the charging problem as MDP, but they decide the total charging power for an EV aggregation, which cannot guarantee the satisfaction of the charging constraint of each EV. In [15], it models the profit-maximizing joint charging problem for a public charging station as an MDP, and decides the total charging rate for all EVs. In [16], the joint optimization problem of EVs is modelled as MDP, which aggregates the load of all EVs and decides the total charging power at each stage. However, the MDP model in our paper describes a joint-charging problem of EVs, whose action is a high-dimension vector composed of the charging action of each EV, and its state is the combination of the states of all EVs. Thus, the MDP directly decides the action of each EV and chooses the optimal action in a feasible action zone calculated by the action constraint to ensure the feasibility. There are also a few literature building the MDP model considering the state of each EV for the joint charging scheduling. They usually take the parking time and charging demand of each parked EV as states [17,18]. We add a new state for each EV which called the occupied state indicating the parking state of each EV. It can better describe the randomness of the arrival of EVs. However, solving MDP is another difficulty due to the dimension curse when the state space and action space are very large. Dynamic programming (DP) is the traditional method to solve MDP problems which would get the global optimal solution [19]. But it is time-consuming to solve the problems with a large number of states. In [20], the stochastic dual DP (SDDP) method is used to solve the charging scheduling problem considering the uncertainty of future charging demand. It takes several hours to solve the problem. Approximately dynamic programming (ADP) is another kind of method to solve the MDP problems [21], which reduces the complexity of computation by sacrificing part of the solution accuracy. The ADP methods to solve the charging problems can be divided into three categories. The first one is to define a deterministic charging rule which has little complexity to make a decision. In [22], an energy scheduling problem of EVs in the residential distribution network is modelled as MDP, and it proposes a rule-based charging policy according to the contribution of EVs to guide the charging action. The second one is called the lookahead policies, which simulates the state transition and action selection in the future for a period of time to evaluate the value function such as Monte Carlo simulation (MC) and rollout algorithm. In [17,18], a simulation-based policy improvement method is proposed to solve the charging scheduling problem, which samples a large number of scenarios to approximate the value function. In [23], a two-stage approximately dynamic programming algorithm is proposed to obtain the optimal charging policy for EVs in a commercial building parking lot. It reduces the computation complexity by generating a large number of deterministic samples containing the information of stochastic vehicle arrival and departure as well as dynamic electricity price. The third one is the method of function estimation, which fits the value function directly by constructing a parameterized function [24]. With the development of neural network (NN), it usually combines the traditional methods with NN to reduce the difficulty of feature construction. In [25], it models the charging problem of EVs as two-level MDP and the proximal policy optimization(PPO) is taken to solve both the lower and upper level MDP problem. In [26], a safe deep reinforcement learning (SDRL) is proposed to solve the constraint charging scheduling problem to minimize the charging cost. However, the rule-based method cannot deal with the dynamics of the future stages, which causes a solution error in stochastic problems. The accuracy of lookahead policies is related to the number of samples, and it is difficult to balance the accuracy and calculation time. The estimation of the value function takes a long time to train the function and is easy to fall into a local optimal solution. While our methods can get the global optimal solution at an acceptable cost of time.
In this paper, we aim at getting an optimal charging policy to minimize the cost of purchasing electricity from the power grid. And a novelty state-aggregation method is proposed to decrease the computation complexity and the state aggregation based dynamic programming method (SABDP) method is proposed to obtain the optimal charging policy. The main contributions in this paper are as follows.
• MDP is taken to model the joint-charging scheduling problem of multi-EVs to describe the uncertainties of each EV and the time-coupled relation clearer. • A novelty state aggregation method is proposed which aggregates the states by merging the states with the same constituents elements but a different arrangement. • SABDP is proposed to solve the MDP problem, and numerical experiments show that it gets the global optimal value as well as the optimal charging policy with less calculation time and space cost compared with other classic methods.
The rest of the paper is organized as follows. Section 2 formulates the proposed optimal charging model, and the state aggregation method and solution method are introduced in Section 3. The numerical testing is implemented in Section 4 and Section 5 concludes the paper.

System description and assumption
In this paper, we focus on a stochastic charging scheduling problem in the parking lot. There are N fast charging piles, AC/DC converters and small wind turbines in the parking lot, shown in Figure 1. The charging energy can be obtained directly from both the wind generators and purchased from the power grid. The cost of wind power is regarded as zero. Each EV Then an intelligent scheduler in the parking lot decides the charging actions of EVs, considering the charging information of EVs and power resources. A basic consensus should be obeyed that the charging behaviour of EVs during the parking period can be completely controlled by the scheduler and the charging station must meet the charging demand of each EV with a low charging cost before the EV leaves. Moreover, the parking lot can obtain the distribution of charging demand and the wind power output by statistical historical data to predict the future uncertainty on both the EVs and the wind power. The model is established on the following assumptions.
1) The output power of the charging pile is constant.
2) The charging process of EVs can be interrupted.
Assumption (1) simplifies the charging decisions to only decide which charging pile to work while not decides the charging power [17]. Assumption (2) means that the smart scheduler allows a discontinuous charging process and flexible control to cope with the fluctuation of renewable energy and electricity price for better economic performance in the parking lot. This kind of charging mode is mainly used in research on the economy of EV charging currently [14,15,25].

Markov decision processes model
MDP is used to model the stochastic dynamic programming problem. An MDP model contains 5 factors < , , P, R, >, which are decision time, state space, action space, state transition matrix, reward function and discounted factor [27]. The established MDP based model for a general charging scheduling problem in the parking lot is described as follows.

State space
State space  contains all the possible states of the problem. Each state for the joint-scheduling problem is a highdimension state, consisting of the state of the charging piles and renewable energy, denoted as S t = [s 1 For the convenience of discretization, S W ,t is mainly considered to be affected by the velocity of wind. The calculation formula is shown in Equation (3).

Action space
The joint action of the problem is to decide which charging pile to work at stage t , defined as if a i t is 1, the ith charging pile is chosen to work.

State transition matrix
P t describes the transition probability between states of the adjacent decision stage. Before giving the state transition matrix, state transition functions are obtained according to the physical characteristics of states.
Equation (5) describes the transition of the occupancy state, where o i t = 1 represents the charging pile is occupied. Δt is the decision interval. Equations (6) and (7) are the state transition funtion of l i t and e i t . The first lines in Equations (6) and (7) show that the transition process is deterministic for the ith charging pile, if it has been parked at time t . The second lines indicate that the l i t and e i t of the ith charging pile in the future is uncertain if the charging pile is not occupied at time t . i t +1 and i t +1 are two random values following a certain distribution, representing the possible parking time and charging demand.
The transition probability of s i t is calculated by Equation (8), which is the joint probability of o i is regarded as the Chi-square distribution 2 (k). The transition of S W ,t is regarded as a Markov process, which means the wind power is transferred to a random value according to a probability distribution, and the transition probability is denoted as p(S W ,t +1 | S W ,t ). Therefore, the total transition probability that combines the transition probability of S W ,t , S CP,t is shown in Equation (12).

Reward function
The reward function calculates the immediate reward after taking an action and reflects the worth of the action. In this paper, the reward is the expected purchasing cost of electricity.

Objective function
The goal is to get the optimal charging policy to minimize the total expected cost of purchasing electricity, shown in Equation (14).
Where E(⋅) is the expected charging cost under the strategy . The first constraint restricts the total charging power at stage t to be less than the bus power P max . The second one is the action constraint which represents the charging demand must be satisfied before the EV leaves. Δl is the maximum redundancy time, which is the maximum time difference between the occupancy time and the actual charging time.
Bellman function is taken to iteratively calculate the optimal value in (15), where V * is the optimal value. Its the basis of getting the optimal results in MDP.
The established MDP based model is shown as Equations (1)- (14). However, the explosion problem of the combination of states makes it hard to get the optimal results of the MDP problem. Thus, more efficient methods need to be proposed.

SOLUTION METHODOLOGY
Solving the MDP problem may be trapped in the curse of dimension especially when the problem scale is large. In order to reduce the complexity of calculation caused by the large scale of states, we propose an optimality-consistency state aggregation method (OCSAM) to reduce the number of states in the problem while retaining the optimality of the solution. Moreover, we propose an improved dynamic programming method based on the OCSAM to improve the solution efficiency of the problem.

The optimality-consistency state aggregation method
State aggregation is quite important for reducing the computational burden of large-scale MDP problems. Different from existing literature [28,29], a novel state aggregation method is proposed in this subsection, without sacrificing the accuracy of the optimal results, and the consistency of results precision also could be guaranteed in the proposed method.
The main difficulty in solving the charging problem comes from the state combination explosion of multiple charging piles. Each S CP,t is with three elements and each element has several discrete values. The number of the state of a charging pile is shown in Equation (16).
It is shown that the state number is exponential growth with the number of charging piles increasing in Equation (17). If traversing through all the states, it is time-consuming and high space-consuming. In this paper, we reduce the states by making use of the symmetry of charging piles, that is the states are generated by state combination instead of state permutation. We propose two theorems to prove the feasibility of the method and the consistency of the results before and after aggregation. The proof process is in the Appendix.
Sort(⋅) is a sorting function which arranges the state through ascending order of l and e. Theorem1 proves that any state with the same composition [s 1 t , s 2 t , … , s N t ] has the same V value whatever the arrangement order of s i t is.
Theorem 2. Supposing  t and  ′ t are the state sets and the aggrega- Where S j ′ t is the substitution state for a set of states G m t that have the same composition with S i t in  t . It shows the state aggregation method does not influence the optimality of results.
We take a small scale case to illustrate the aggregation method. Suppose there are three charging piles, the aggregation rules are written as follows, where a, b, c represent three different states for a charging pile. a, b, c], [a, c, b], [b, a, c], [b, c, a][c, a, , b], [c, b, a]  r2 = [a, a, b], [a, b, a], [b, a, a]  r3 = [a, a, a] (22) ALGORITHM 1 The optimality-consistency state aggregation method 1: Calculate all the aggregation rules, denoted as R. Each rule shows a set of state combination forms, recorded as r.

2:
Aggregate the states into M aggregations according to the rule r, r ∈ R and all the states in an aggregation G m t have the same constitutes.

3:
Replace all the states in G m t by any state S j t ∈ G m t , and the state transition probability is rewritten as below.

5:
The total number of states at stage t after reduction is calculated by the following formula, where x is the number of different kinds of ALGORITHM 2 State-aggregation based dynamic programming 1: Given the state space  c , transition matrix P c of each charging pile and the initial state S 0 .

2:
Generate all the possible states of each charging pile from t to T .

3: if t = T , generate the aggregated state space S ′ T and initial the value of all states as
if t > 0, t=t−1, aggregate the states and update state space  t and transition matrix P t . Then calculate the optimal value of each state in  t by the Bellman function Equation (15).
In Equation (22), r 1 shows all the combinations with three different kinds of state and r 2 , r 3 are the combination rule of two and one kind of states. The states satisfying the same rule r i ∈ R are aggregated in a group and replaced by one state in the group.

State-aggregation based dynamic programming method
SABDP combines the OCSAM with DP, which reduces the computational complexity while maintaining the optimality of results. The method is presented in Algorithm 2.

NUMERICAL RESULTS
In this section, we conduct the simulation experiments of charging scheduling in the parking lot of an enterprise and analyse the experiment results from three aspects to show the advantages of the SABDP method. First, we get the optimal expected charging cost under different scenarios, which shows that our method is flexible and universal. Second, we compare the time and space complexity with those of DP in solving the charging problem to verify that our method could reduce the scale of state space greatly and improve computational efficiency. Third, we compare the SABDP method with two experienced charging policies and two ADP methods to show the optimality of decisions in our charging policy.

Parameters
The parking lot equips with fast charging piles. Considering the cost of charging piles, the number of fast charging piles in a parking lot is usually less than eight [30]. For a fast charging pile, it takes about 30 min to 2 h to fully charge an EV [17]. We take 2 h as full charge time according to the parameters of BYD e6 and the decision interval is 20 min [31]. Many works have proved that wind power is mainly related to wind speed. The wind speed is regarded as obeying Weibull distribution. The shape and ratio parameters of Weibull distribution are 1.309 and 7.0576 got from the data of the National Renewable Energy Laboratory of the United States fitted in [17]. For simplicity, we divide the output of wind power into six levels and the wind power is calculated by Equation (23). At each time, the wind power may choose any value in the six wind power values with different probability, and it means that the number of states after transferring is at least six.
The occupancy time obeys normal distribution. Table 1 shows the means and variance at different hours. The initial charging demand is correlated with d t , which obeys the Chisquared distribution. The variance v is 8.488 [17]. The relation between charging demand and mileage is shown in Equation (24).
We take the time-of-use (TOU) electricity price of Xian [32]. Some important parameters are shown in Table 2.

Optimal charging cost and charging policy
We conduct the experiments under different wind power, different decision stages and different scales of charging piles and get the optimal charging cost respectively.  In Figure 2, it shows the evolutionary process from the initial state to the terminal states. The blue circles at stage t are the V values of states that S t −1 can transfer to and the red solid point is the expected value of all states at stage t . The line in black is one of the deterministic charging scenarios sampling from the charging process. We analyse the expected charging cost based on the whole charging process and analyse the optimal charging decisions based on the sampled scenarios. Table 3 shows the optimal expected cost and the abandoned wind power (AW) under different charging scenarios calculated by SABDP. It also shows the relation about the optimal charging cost and the wind power. From the table, we get the following results: (1) The optimal cost decreases greatly with the increase of wind power. Taking the seven charging piles and 72 decision stages as an example, compared with the charging cost

FIGURE 3
The charging policies under different wind power with no wind, the optimal cost is reduced by 62.28%, 93.16% and 96.36% as the wind power changes from 5 kW to 12 kW.
(2) Appropriate allocation of wind power is significant to reduce the charging cost and improve energy utilization. If the wind power is too large, the charging demand can be totally met by wind power and it leads to waste of wind power. If the wind power is too small, it buys the electricity to meet the remaining demand, which causes a large charging cost. However, when the wind power is close to the maximum charging demand of EVs, there is a good balance between the charging cost and the utilization of wind power, which makes the optimization of EVs more significant.
In Figure 3, we contrast the charging policies under different wind power to demonstrate the impact of wind power. The charging scenarios start from the same initial state and evolve according to their own charging policies. At each stage, the green line and red line are the maximum and minimum charging energy the charging piles could choose, which form the feasible zone of action. The blue line is the actual charging energy and the light blue line is the wind power. The optimal charging decision is mainly affected by wind power, electricity price and action constraint. When the wind power is far less than the charging demand, as shown in Figure 3(a), it has to buy extra electricity after fully using the wind power almost all the time, which means the wind has little with charging actions. The charging policy is mainly affected by the electricity price so that it tends to buy as much electricity as possible at a low price before stage 22. When the wind power is large enough as shown in Figure 3(c), the charging decision is almost the same as the myopic policy, which chooses the charging action making the current charging cost the lowest. It considers little about the fluctuation of wind power in the future. The results show that the charging policy obtained by SABDP is very close to the experienced charging policies when the wind power is too small or too large. There seems unnecessary to take SABDP to get the global optimal solution since the experienced charging policies can get a near-optimal solution with less complexity. However, the charging policy considers both the information of wind power and the electricity price when the wind power is appropriate shown in Figure 3(b), which reduces the charging cost obviously. For example, comparing Figure 3(a,b) at decision stage three, the maximum charging demand is 60 kWh, which means all charging piles have the charging demand. The charging policy of Figure 3(a) chooses to charge all charging piles because there is no need to delay the charging demand to the next stage since the wind power is inadequate all the time. However, in Figure 3(b), it charges less EVs at stage three when the wind power is small. And it delays some of the charging demand to stage four which has larger wind power to avoid purchasing electricity at stage three.

Time and space complexity
State aggregation reduces the scale of state-space greatly. As the number of states decreases, the computation complexity is also reduced. We contrast DP and SABDP on both the space-time complexity and solution accuracy. The results indicate that the SABDP method effectively reduces time and space complexity without affecting the accuracy of the results.
In Table 4, we contrast the number of states before and after state aggregation. The SABDP method cuts down the hidden danger of dimension disaster to some degree because it saves the storage space greatly. Especially, as the number of charging piles increases, the reduction ratios of states reach 99.99%.
The charging cost and computation time are shown in Table 5. With the increase in the number of charging piles, the computation time of SABDP is significantly lower than that of DP. And SABDP expands the problem scale that DP can solve. Moreover, the optimal charging cost of DP and SABDP are exactly the same. It proves that SABDP could get optimal results while reducing the calculation complexity.

Comparison of charging policies
To clearly show that the SABDP method could get the optimal charging policy to get the minimum charging cost more clear, we compare the charging policy and charging cost of SABDP with those of the other two experience charging policies and two ADP methods. For a more concise description, we define the methods as follows. PCP: Plug and charge policy, which means that if an EV plugs in, charge it immediately.
IRP: The minimum immediate reward policy chooses the action to make sure the one-step reward minimum.
MC: Monte Carlo method, a simulation-based method that approximates the charging cost by sampling a large number of deterministic scenarios.
DQN: Deep Q-learning network, a deep reinforcement learning method, which gets the charging cost and the charging strategy by estimating the value function of states. Table 6 shows the optimal expected cost under three charging policies with the wind power Pw = 12kW . S/P means the  comparison of the SABDP and PCP. And S/I is the comparison of SABDP and IRP. The results show that the cost of SABDP cuts down by 65.51% and 5.92% on average compared with PCP and IRP. PCP always charges with the maximum charging demand and does not consider the current wind power and electricity price. Thus it may buy electricity with a high electricity price when wind power is small and abandons wind power when the charging demand is far less than the wind power. IRP guarantees the immediate optimal reward, but it cannot ensure the global optimal value for it does not consider the effect of future wind power and electricity price. Table 7 shows the charging cost and calculation time of SABDP, MC and DQN. Because the DQN is difficult to converge with the increase of state space and action space, we only take the scenario of five charging piles in 24 time periods and the wind power is Pw = 8 kW as an example to analyse the results. The results show that the charging cost of SABDP is lower than that of MC and higher than that of DQN. MC has a large error compared with SABDP by sampling part of scenarios to estimate the charging cost. However, the optimal charging cost estimated by DQN is unreasonable, which is even better than that of SABDP because of the overestimation. Therefore, SABDP can achieve a more accurate charging cost than that of the other two methods in an acceptable time. Figure 4 shows the charging process of three policies under the same initial state and wind power output curve. PCP gets the largest charging cost among the three policies because it charges EVs with the maximum charging demand all the time no matter how much the wind power and electricity price are. For example, from decision stage 1 to 6, PCP buys 92 kWh electricity to meet the maximum charging demand and abandons 36 kWh electricity generated from the wind power. However, the charging policy of SABDP only buys 60 kWh electricity The action of three policies in a scenario

FIGURE 5
The action of three methods in a scenario and abandons 0 kWh wind power to meet the same charging demand. IRP gets near-optimal charging decisions that are closest to SABDP, but its charging cost is still higher than that of SABDP. The main reason is that IRP does not consider the future wind power and electricity price. For example, IRP and SABDP have the same charging demand from decision stage 1 to 3. At stage 1, the wind power is 36 kWh which is enough to charge three EVs but not enough to charge four EVs. IRP chooses to charge three EVs in order not to buy extra electricity and delays some of the charging demand to the next stage. But it has to buy 12 kWh electricity at stage 2 to satisfy both the original and delay charging demand due to the insufficient wind power. While SABDP does not delay the charging demand and it just buys 8 kWh to satisfy the same charging demand. SABDP makes the charging decision considering the wind power and electricity price comprehensively. If the electricity price is unchanged for a period of time, the charging decision is mainly influenced by the future wind power. The price influences the charging decision especially at the time of price change. If the electricity price is low, the policy would buy electricity to satisfy the charging demand as more as possible rather than delaying part of charging demand to the next stage, even though the wind power is insufficient. For example, compared with that of IRP, SABDP charges the EVs at the maximum charging demand while the IRP charges at the minimum charging demand at stage 21, when the wind power is not large enough to satisfy the maximum charging demand. The main reason is that the electricity price changes to a high price at stage 22, which would cause a high cost to buy extra electricity if delaying charging demand to stage 22.
In Figure 5, we compare the charging actions of different methods under the same scenario to show the optimality in action selection. The results show that the charging policy of MC is almost the same as that of SABDP under the same scenario, but the precision is related to the number of samples. The charging policy obtained from DQN is not sensitive to wind power. The charging demand of EVs it chooses may be far larger than the wind power such as stages 1 and 2. This charging policy would buy more electricity than SABDP. However, SABDP chooses the optimal charging decision considering both the wind power and the electricity price, which can get a lower charging cost and higher wind power utilization.

CONCLUSION
In this paper, we consider a joint-charging scheduling problem in the parking lot with wind power. The charging problem is modelled as MDPs to describe the uncertainties of wind power and the charging events. Then we propose a state-aggregation method to reduce the number of states and a novelty method based on aggregation to get the global optimal value more efficiently. In addition, several experiments are conducted to verify that our methods could get the optimal charging cost with less computational time compared with other methods. Moreover, it also shows that the SABDP method can be extended to different scales of charging problems. In future research, we will pay more attention on the charging scheduling problem with large scale and multi-scale.

ACKNOWLEDGMENT
This work was supported in part by National Key R&D Program of China (2016YFB0901900), and National Natural Science Foundation of China (61903293).

A.Indices t Time step indices
i The indice of charging piles m The indice of state aggregators B.Parameters  t State space at stage t  t Action space at stage t P t Transition probability matrix  c The state space of a charging pile P c Transition probability of a charging pile T One-step reward function N The number of charging piles v t Wind speed at stage t v rated ∕v in ∕v out The rated/cut in/cut out wind speed