The distributed economic dispatch of smart grid based on deep reinforcement learning

In order to solve the problems of inefﬁcient, inﬂexible and insecure for traditional centralized algorithm in the process of optimization dispatch, and with the application of artiﬁcial intelligence technology to smart grids, the novel distributed solution is proposed by using the deep reinforcement learning and the consensus theory to optimize the economic dispatch. Firstly, the optimal commitment sequence of massive units is realized through constructing deep reinforcement learning model. Secondly, the optimal unit output and efﬁ-cient economic dispatch can be obtained by utilizing the improved consensus algorithm together with Adam’s algorithm. Finally, simulation results of IEEE-14 and IEEE-162 node systems may demonstrate the effectiveness of the proposed solution for the smart grids with complex network structures, which can not only solve the problem of massive data processing, but also it may reduce the dependence on the exact objective function when dealing with extremely complex load distribution scenes and distributed powers.


INTRODUCTION
With the development of large-scale renewable energy, the Smart Grid with high-density intermittent energy structure needs sufficient controllable resources to guarantee the efficiency, security and reliability for the power system operation. Flexible load may play an important role to maintain system balance compared with the traditional controllable generators. Therefore, how to take the connections among various parts of the "source-network-load-storage" and manage effectively diversified and scattered demand responses for the global optimization [1,2] have attracted great attention in the field of science and technology. The early solution to economic dispatch (ED) problem usually adopts centralized algorithms. All the distributed power (DG) and flexible loads send parameters to the central controller for centralized computing, processing and unified release. However, this centralized algorithm cannot be suitable for massive amounts of complex data due to large-scale intermittent renewable energy integration, the increased customers and the complex user types [3], nor can it meet the calculation requirements for access to DG with the plug-and-play features. Therefore, for distributed computing of ED [4][5][6][7][8][9], a distributed algorithm based on the consensus principle has been applied in many multi-agent systems [4]. It allows local agents to share information iteratively through a two-way communication link to find global optimal decisions. A consensus can be reached when all agents agree on the value of the information [5]. Then, for computing massive amounts of complex data, a distributed control strategy is proposed which is used to avoid single point failure and reduce the computing costs [6]. This strategy can reduce the core role of traditional central controllers in ED, and considers the expansibility of calculation and communication load in terms of each DG. In addition, for the convergence speed and global optimum of the consensus-based algorithm, a primal-dual principle is proposed in the literature [5] to be used to decompose the main problem into several sub-problems and then make parallel decision-making calculations to improve the convergence speed. In the literature [7], principles of subgradient steps are combined with the conventional consensus principle to update local optimization variables cyclically and reach the global optimum gradually.
The distributed algorithm based on the consensus principle has many specific advantages in solving the ED problem, but it may be rely on the predicted generation and load conditions [10]. In practice, the actual unit commitment is uncertain, non-static under a long-term scale, only considering real-time power allocation alone is of no practical significance for the ED [11]. In addition, it is difficult to represent the model with accurate mathematical formulas in some cases, for example, the cost function parameters of wind turbines can be very hard to be estimated since they are greatly affected by changing environmental factors [12]. Therefore, the appropriate ED solution need to be designed, which can not only eliminate the difficulty of establishing mathematical model, but also can make optimal decisions based on the previous or historical operating data [13], is of great significance in theory and practice.
However, when the data to be processed in a large-scale smart grid is too large, the "continuous loop" calculation mode cannot meet the system's requirements for the algorithm, especially when the environmental parameters change during the calculation process. In addition, when the objective function model is unknown or the order is too high, the above method cannot be used. Therefore, the deep reinforcement learning (DRL), which can save all the data having been calculated in the form of experience and learn, memory and utilize the data, can effectively improve computing efficiency. In the management of distributed energies [14], DRL is utilized to process high-dimensional data in smart grids, calculates energy optimization strategies online, and improves the efficiency of electricity through real-time feedback and control. In [15], quadratic function of DRL can be used to approximate the value function of each agent to achieve the optimal cooperative control with the purpose of solving the optimal consensus problem of multi-agent system which is capable of measuring input/output information. However, the DRL seems incompetent in the face of the "plug and play" [16] characteristics of distributed energies, dealing with continuous variables and other issues.
Inspired by the above research developments, the distributed ED solution is proposed by using the commitment of deep reinforcement learning and consensus theory, to study the ED of smart grids. The major contributions of this paper include: 1. The distributed method is utilized for source-load-storage economic power allocation in smart grids and to realize the coordination of the selected consensus variables among the agents. The process of the iterative calculation is completed in each local agent device without control centre for centralized calculation. Different from [3], the distributed method proposed in this paper is more simple and reliable, and is suitable for large-scale smart grid. Compared with [4][5][6], due to the existence of "adjustment item" in this method, various constraints can be satisfied in the calculation process automatically, and the "plug and play" requirements of DG can be realized, which the inherent shortcomings of DRL are eliminated [14,15]. 2. In response to the problems existing in the consensus algorithm, an improved algorithm is proposed by using the Adam algorithm [17]. Different from the [17], the theory of Adam and node agglomeration method is applied to the conventional consensus algorithm, strengthening the relationship between computing nodes and realizing the distributed form of Adam algorithm. The stability and computational efficiency are improved by the proposed algorithm. Compared with [8,9], the proposed algorithm has better convergence efficiency and faster calculation speed. 3. The DRL is combined with the principle of consensus resulting in a solution to discrete problems of unit commitment and continuous problems of power allocation. Compared with [12,13], the computational load of DRL can be reduced greatly by the proposed algorithm and the optimality and stability of the calculation results can be ensured. And the objective function of unit commitment is digitized, which no longer depends on the specific mathematical formula [13].
This paper is organized as follows. Models of ED are presented in Section 2. The ED solution theory is provided in Section 3. The algorithm convergence analysis is presented in Section 4. The Simulation results are formulated in Section 5. Finally, Section 6 is acknowledgments and Section 7 concludes this paper.

Power distribution
The goal of load distribution is to find the optimal dispatch strategy. The total load is distributed among the N′ adjustable units in operation, so that the total cost is minimized, that is, where C i and P i denote the cost function and the power of adjustable unit i, respectively, that is, P indicates the generating power for the generator, the dissipation power for the flexible load; N′ represents the number of adjustable unit in operation.
The cost function is formulated as where a i , b i , and c i are the cost coefficients of the adjustable unit i. Load distribution is restricted by both the power balance constraint and the capacity constraint of each adjustable unit.
The power balance constraint is formulated as where P loss represents the transmission loss of line transmission power. Line transmission power constraints have a certain effect on the convergence process and optimization results of the algorithm. Generally, transmission loss accounts for about 3-7% of the total load [18]. D is the total rigid load that cannot be adjusted. The capacity constraint of an adjustable unit is formulated as where P i min and P i max represent the minimum and maximum output power of the adjustable unit i, respectively. Other constraints are shown in [18].

Unit commitment
In the situation where a smart grid with a large number of generating units supplies total load D, the major goal of the unit commitment is finding a dispatch strategy to minimize the total cost within a period of time T, that is, where N is the total number of generating units. I i,t indicates that during the period t, the power generation unit i is in operation or not. C i,SD represents possible downtime costs and C i,SU the start-up cost [18]. In addition to meeting the requirements of the load distribution in each time period, all power generation units i = 1,2,…,N are subject to the minimum continuous up-down constraints [19], and the generation's ramp constraints [19], where R i,U and R i,D are the ramp up and ramp down constraints, respectively. T i,U and T i,D are the minimum uptime and the minimum continuous downtime of a power generation unit i, respectively. X i,ON (t) is the period of continuous commitment of the generating unit i at t, while X i,OFF (t) represents the time during which unit i has been continuously stopped at t. Other detailed constraints are shown in [18,19].

Solution strategy
The power distribution is characterized by continuity while the unit commitment has discrete features in the power system. For the continuous and discrete objects, the traditional strategy turns them into the same type of object to achieve the solution, which can hardly realize the optimal ED of smart grids and expense the efficiency and accuracy of calculation. This paper proposes an optimal strategy including a series of algorithms to cope with two different types of problems. Here an improved DRL algorithm is adopted to solve the discrete unit commitment problem firstly. Specifically, the optimal unit commitment sequence and the approximate unit power output are obtained by fixing some necessary variables. The improved distributed consensus algorithm is hereafter utilized to obtain the detailed unit power output based on the above results.

DISTRIBUTED DRL-BASED SOLUTION
To solve the above model, this section proposes a combined ED solution of DRL and the improved distributed algorithm based on the consensus principle.

DRL algorithm
It can be known from Equation (3) that the state-action value function can be formulated in a recursive form, which can be used to calculate the optimal solution. In this paper, s i,t and a i,t are defined as the current stop/start state and the current start-stop action of each unit i, respectively, and Q i (s i,t , a i,t ) as the state-action value function. When the amount of data is large and the function Q is difficult to formulate, it can be solved by the deep Q network (DQN) algorithm.

Updated state-action value functions
Each time, the (s i,t , a i,t ) is accessed, state-action value function Q(S I,t , A I,t ) is updated according to the formula as follows [20].
where η represents the learning rate when the agent i in the state space S I,t takes action space A I,t in the t-th time, and rew I,t denotes the reward and punishment feedback generated by the decision at the current moment. represents the discount coefficient. The state space S I,t and the action space A I,t accord to the following formula.
In Equation (8), the first term represents the processing of historical information, and the second term introduces new information brought by each access.

Exploration and exploitation
The balance of exploration and exploitation is achieved by selecting actions according to the idea of ε-greedy algorithm [21]. Specifically, the action with the highest value can be obtained in the selected dispatch strategy π(s i,t ) with probability 1-ε, and start-stop a i,t of the unit i is randomly selected with probability ε.

Dispatch strategy adjustment
The dispatch strategy π(s i,t ) is improved according to the current state-action value function If there are multiple A i,t+1 that minimize the target value, one of them is randomly selected.

3.1.4
Learning process of DQN model When the objective function model Equation (1) is unknown or non-convex function, or the state space S I,t is so complex that it cannot be represented by detailed mathematical formulas, the function Q cannot be obtained in the form of a state transition function or a table. In this case, the function Q needs to be represented by a function approximation method. The DQN model is also a suitable algorithm for approximating the tabular function Q by using the convolutional neural network (CNN) [22]. Take IEEE-14 node system with 4 thermal power units as an example, and please refer to [23] for detailed data. During the learning process, let p load t be the current interaction power with the main grid, and p load t +1 the power with the main grid at the next moment. Let p load t ,p load t +1 , S I,t , S I,t+1, A I,t, and A I,t+1 be the inputs to the CNN model. The data type of the input layer is 18 × 230, representing the inputs' date during 230 h. The first and second layer is 36 convolution kernels of 2 × 1 and 72 of 2 × 2, respectively. Tanh function is used as the activation function. The maxi-pooling method is used in the pooling layer to transform the data into a one-dimensional vector with a length of 352, which is input to the fully connected layer. The fully connected layer includes 2 hidden layers, of which the number of neurons is 60 and 30, respectively. The number of neurons in the output layer is 1, which indicates the value Q obtained from the current start-stop status of the units. The ReLU function is used as activation function.
The CNN in the DQN algorithm is used to approximate the function Q. In DQN [24], the updated formula of the function at this moment is shown in Equation (12) as follows: where ξ is learning rate of DQN, ω t is a weight parameter for fitting Q in DQN. According to the constraints Equations (6) and (7) of the model in Equation (5), the reward and punishment feedback function rew t is defined in Equations (13)- (17) as follows: where r 1 is the reward item of climbing constraint, r 2 and r 3 are the reward items of minimum continuous upper and lower bound constraint, and r 4 is the reward item of power balance offset. ζ is the deviation reward coefficient, and Δp is the deviation limit value.
When training DQN, the mean square error is used to define the error function of the network, as shown in the formula as follows: An optimal strategy can be achieved by finding the gradient of the error function with respect to ω t . The stochastic gradient descent is used to update the parameters to obtain the optimal value Q.

Fully distributed implementation
As can be seen from consensus algorithm [25], the state transition matrix W is determined by the communication network topology and usually replaced by a Laplace matrix. Taking the IEEE-14 node system as an example, the 14 × 14 adjacency matrix D of 14 controllable units is established based on graph theory. If the unit i and the unit j have a directly adjacent relationship, then the element d ij in the matrix D is assigned as 1, otherwise as 0. The element d ii is assigned as 0. However, taking the same adjacency matrix elements for adjacent nodes will limit the convergence speed of the consensus variable. In order to solve the problem above, the idea of different weights is integrated into the adjacency matrix since the final convergence accuracy is not affected by the change of elements in the Laplace matrix if the convergence coefficient remains unchanged [25]. Therefore, the concept of node agglomeration is introduced in this paper to perfect the state transition matrix W and improve the convergence speed.
Specifically, the node agglomeration method [26] is equivalent to centring on two points of which the degree of connection is to be judged, agglomerating them into a core node, and taking the degree of a node connected to the core node into account in the calculation of importance. The importance of node α and the element w′ ij of improved state transition matrix W′ are defined as follows: where n is the total number of nodes; V is a set of all nodes in the network; x ij and y ij represent the degrees and edges of the core nodes after node i and node j are agglomerated, respectively. l ij is an element of the Laplace matrix.

Update of power adjustment items
In general, the equal incremental cost criterion introduced in [8] is a classic solution to the problem of power system dispatch optimization. When the slight cost increase rate μ is selected as the consensus variable of the first-order consensus algorithm and constitutes the consensus item, it can be known from [8] that in the process of continuous iteration, μ will gradually approach a specific value which is not necessarily μ * . That is, various constraints cannot be satisfied and only the consensus term cannot solve the model correctly. Therefore, an adjustment item needs to be added to feedback, correct and make the result approach μ * . The consensus variable is modified and the formula is updated as follows [27]: where μ i is the slight increase rate of the controllable unit i, which is usually defined as the derivative value of the cost function relative to power in Equation (2). φ i is the adjustment term and δ is the adjustment coefficient which is a small positive number. v ij is the element of the matrix W′ after transposition and X i is the set of indices of the neighbouring agents of agent i. The network used in this paper is a standard IEEE network model, and topological structure is called a strong connection [19], and it is used by the distributed algorithm.
Thus, the proposed model cannot cause congestions in the grid.
In the process of iterative calculation, the power adjustment item determines the convergence direction of the consensus variable by using Equation (23), so that the results of power decision continuously approach the optimal solution that satisfies various constraints in Equation (3). The process of verification is detailed as follows: For simplicity, the Equation (23) is written in vector/matrix form as Equation (24).
Since φ is a non-negative column stochastic matrix, that is, the elements in each column vector are non-negative and the sum is 1, so 1 T V = 1 T . Then it can be derived from Equation (25) as follows: Then reducing Equation (26) Therefore, using ∑ [P i (0) + i (0)] = 0 when setting the initial value can make the φ i in the calculation process as the negative feedback to converge towards 0. When all φ i converges to 0, it means that the active power shortage of the system is 0, and the constraint in Equation (3) is satisfied. At the same time, the consensus variable μ i in Equation (22) is iterated to the convergence value * i .

Improvement based on Adam algorithm
The gradient descent can be used for optimization the objective function in Equation (1) which is continuously differentiable. This paper uses Adam algorithm, an effective stochastic gradient descent optimization method, to optimize the results calculated by the consensus algorithm with the purpose of accelerating the overall convergence. The proposed algorithm is completely distributed, so distributed improvements are conducted to Adam algorithm in this part. The objective function is known as C(P) in this paper, the exponential moving average of the gradient m k and the square gradient v k is updated respectively in the calculation and the hyper-parameter 1 , 2 ∈ [0, 1) is utilized to control the exponential decay rate of these moving averages [17].
The updated iterative formula for biased estimation m(k) and biased estimation v(k) is presented as follows [17]: where k 1 and k 2 are 1 and 2 to the power k, respectively. The estimations after correction are as follows [17]: The centralized updated formula [17] of P is as follows: where θ is the step size and is generally set to 0.001; ε is the coefficient for fine-tuning the step size. Therefore, the distributed form of Equation (30) is proposed in this paper as follows: where w″ is the element of the transform matrix W″, which 1 T W″ = 0 T and W″1 = 0 is satisfied in the case of vector. The w″ is formulated as follows: where x i is the degree of the agent i. By the improved Equations (31), (32), the distributed form of the Adam algorithm is implemented and is better integrated into the distributed consensus algorithm. Simultaneously, the convergence speed is further improved due to the gradient descent principle of Adam algorithm being integrated into the consensus algorithm. Consequently, the calculation efficiency of the entire algorithm proposed is improved and the total calculation time is reduced.

ALGORITHM CONVERGENCE ANALYSIS
In this section, the convergence analysis of the fully distributed optimization algorithm based on the consensus theory and Adam algorithm is mentioned to prove the effectiveness theoretically.
Aiming at finding the optimal value of the independent variable P in the function C(P), the test function E(K) is defined to represent the degree of algorithm convergence, which is the sum of all the previous difference between the online prediction C(P(k)) and the best fixed point parameter C(P*) from a feasible set X for all the previous steps K. Specifically, the test function is defined as Equation (33): where P * = arg min P∈X C (P ),and C(P) represents Equation (2).

Lemma 1.
If a function f: R d → R is convex, then for all x, y ∈ R d , The above lemma can be used to verify the presence of upper bound in the test function. And the proof for the main theorem is constructed by the updated rules proposed in this paper. Theorem 1. Assume that the function C(P) has bounded gradients, ‖∇P (C )‖ 2 ≤ G, ‖∇P (C )‖ ∞ ≤ G ∞ for all P ∈ R d and distance between any P(k) generated by algorithm is bounded,‖P (m) − P (n)‖ 2 ≤ D,‖P (m) − P (n)‖ ∞ ≤ D ∞ for any m, n ∈ {1, 2, … , K }. The algorithm proposed in this paper achieves the following guarantee for all K ≥ 1. Proof.
Let ∇C (P (k)) = g k ,P (k) = P k ,m(k) = m k ,v(k) = v k ,m ′ (k) = m ′ k , and v ′ (k) = v ′ k . Using Lemma 1, then, From the updated rules Equation (31) presented in the fully distributed optimization algorithm, then, Square both sides of the update rule Equation (28), and then, , then the Equations (36) and (38) can be combined into the following Equation (39).
Topology of IEEE-14 node system Therefore, we have the following formula via Equation (41): (42) Finally, it shows that the test function converges.

SIMULATION ANALYSIS
In this section, an IEEE-14 node system and an IEEE-162 node system are built to evaluate the performance of the proposed solution. The overall calculation process of the proposed algorithm is presented. The effectiveness of the algorithm under different working conditions is verified by the simulation examples in the IEEE-14 node system, and the calculation speed of the algorithm in large-scale smart grid systems are further verified by the simulation examples in the IEEE-162 node system.

IEEE-14 node system
In this part, the IEEE-14 node system is constructed to simulate the intelligent distribution network and test the performance of the proposed algorithm. In the system, nodes 1, 2, 3, and 6 are thermal power generators, node 7 is an energy storage device, and the rest nodes have adjustable flexible loads. In addition, the node 1 is also connected to the main grid for power complementation. The simulated grid is a smart main grid. The system topology [8] is illustrated in Figure 1. And the basic parameters of IEEE-14 node system is illustrated in Table 1.
2: Collect the initial start-stop status and initial power output of each unit, and build the initial Q function value table in the form of Equations (9) and (10).

3: repeat
5: Use the ε-greedy algorithm to select A I,t .

6:
Seek value scores as current reward feedback rew t .
7: Calculate the current function value Q via Equation (8).
9: until meet constraints and optimal solution requirements.
10: Build a CNN network model and train historical data. Achieve offline learning and online calculation.
11: Get and analyse new network topology data.
12: Build the matrix W′ and W″ via Equations (14), (15)  The interactive power data of the simulated intelligent main power grid and power distribution network are collected as a feature. The optimal start-stop sequence of the unit is calculated based on the changing trend of the feature, the current environment variables and the state variables. 280 data of interactive power are collected as shown in Figure 2.
The unit start-stop sequence and the power output sequence corresponding to each data are respectively calculated by the deep reinforcement learning algorithm without the specific objective function of unit commitment. The first 230 data is used as training samples and the last 50 data as testing samples. 18 × 230 training set data and 1 × 230 value-Q label data are respectively input to the input/output layer of the CNN network for training. The simulation results after training are illustrated in Figure 3. Figure 3 reflects the quality of the CNN network model. The MSE value is obtained by comparing the output data of the CNN network model with the test data to reflect whether the CNN network model can fit the objective function well, so that optimal unit combination can be completed quickly even when the objective function is fuzzy and non-convex or   there are too many data. It can be seen from the Figure 3 that the fitting effectiveness of the first 230 data is better, and the MSE value is lower. The testing result of the last 50 data is basically within the qualified range. This is mainly because the difference between the first 230 training data and the last 50 testing data is slightly larger, which results in a lower DQN effectiveness, but the overall effectiveness is better. The digitization of the objective function of unit commitment is basically realized, and it is no longer dependent on the specific mathematical formula.

Single dispatch instruction
It is assumed that the interactive power from the simulated intelligent main power grid makes it necessary to start all four generators, and the total power imbalance at the moment is 0.86 MW. That means that the simulated intelligent distribution network receives single dispatch instruction (SDI) and performs a single action. The simulation experimental results by using the proposed algorithm are shown in figures below. It can be seen from Figures 4-6 that the output power of the 14 adjustable units tends to be stable, the active power of the system finally reaches a balance state, and the consensus variable tends to a value, which proves that the proposed algorithm has a significant convergence performance.
It is important to note that the proposed algorithm is able to converge to the near-optima much faster and smoother than other distributed methods in [8], [9] and [6]. For example, compared with the experiments shown in [8], the proposed algorithm has higher convergence efficiency under the same experimental conditions, and the optimization results are the same. The superiority and correctness of this algorithm are proved. And one of the published experiments in [9] showed that the similar ED problem was solved after 80 iterations, while the proposed algorithm can be able to finish this problem at the 52nd iteration.
The generation supply and load demand significantly change respectively in the process of the algorithm iteration, but the  Figure 5 illustrates that the maximum power deviation of the entire simulated intelligent distribution network system is about 5.61 MW, which is a very small part (5.61 / 210 = 0.027) compared with the total capacity, and cannot cause a large frequency deviation. It indicates that the algorithm can meet the demand for a basic stability of power systems [8]. Therefore, the algorithm proposed in this paper also has good research value and application prospect in the operation and control of smart grid.

SDI considering transmission loss
Line transmission power constraints have a certain effect on the convergence process and optimization results of the algorithm. Generally, transmission loss (TL) accounts for about 3% to 7% of the total load [28]. Taking 4% for example, the Equation (21) where i′ and i″ represent the serial number of the power generation unit and the flexible load, respectively. Under the same experimental environment and conditions as in Section 5.1.2, the SDI simulation experimental results considering the transmission loss are shown in the figures below. Figure 7 illustrates that when the transmission loss is taken into account in the model, the consensus variables of all adjustable units converge to a lower common value compared with that demonstrated in Figure 4. Therefore, as shown in Figure 8, the output power of the generation unit is reduced, and the power absorbed by the flexible load is also reduced, but it has little effect on the algorithm's convergence performance. Multiple dispatch instruction In reality, the interactive power is not static, the dispatch instructions will change as the balance between the load of the entire smart grid and the power of the total generator set is slightly broken [29]. Therefore, In order to verify the timeliness of the proposed algorithm, the interactive power is set to 46.7, 73, 132.6, and 194.7 MW in turn, and the time cycle of multiple dispatch instruction (MDI) is set to 0.4 s, with the purpose of simulating the operation of the algorithm in real dispatch. The simulation figures are given in Figure 9-11. It can be seen from Figures 9, 10, and 11, the proposed algorithm can meet the basic practical application requirements: within each dispatch cycle, the power output of each adjustable unit is stable, and the active power of the system can eventually reach balance. With the increase of interactive power, the consensus variables of the system also gradually increases. All variables achieve convergence and obtain the optimal solution, Consensus variable when G1 reaching the limit and there is no case where the algorithm does not complete the convergence within the prescribed period. The simulation results demonstrate that the proposed algorithm can meet the basic dispatch requirements of the system in complex environments.

Power exceeds limit
In order to verify the effectiveness of the proposed algorithm when the power output of some adjustable units reaches the limit, the interactive power of the simulated smart grid is set to 441.8 MW, so that the power output of some adjustable units reaches the upper limit. The convergence results of the algorithm are shown in Figures 12-14. Figure 14 illustrates that when iterating to 10 times, the power generation G1 reaches a maximum of 50 MW, and the consensus variable also reaches a maximum. To maintain the balance of active power in smart distribution networks, other adjustable units will undertake more power balancing tasks. Figure 13 shows that the effectiveness of the proposed algorithm Power output when G1 reaching the limit is verified since the unbalanced power of the system finally reaches 0.

Plug and play
The distributed power should meet the plug-and-play (PAP) requirements in the operation of smart grids [29]. To verify the effectiveness of the proposed algorithm in the case of plug-andplay, the following distributed power scenarios are constructed: the initial environmental conditions of the simulation experiment are the same as those in Section 4.1.1. When a dispatch period (0.4 s) elapses, the distributed power G15 depicted by dotted lines in the Figure 1 is connected to the 13th node of the system. It can be seen from the Figures 15 and 17 that when the distributed G15 is connected to the system, it undertakes a part of the power distribution resulting in the reduction of the output Comparison of the convergence performance of the three algorithms power which the remaining adjustable generators make. Consequently, the consensus variable is also reduced in the system. Figure 16 illustrates that during the process of connecting the distributed G15 to the system, the total net power deviation of the entire simulated intelligent distribution network is so small that it cannot cause a large frequency deviation that means, the algorithm meets the demand for a basic stability of the power system. Therefore, the proposed algorithm can satisfy the plug-and-play function of distributed power in smart grids.

5.1.7
Contrast with traditional centralized algorithm A contrastive simulation with two commonly used centralized algorithms is conducted to verify the performance of the proposed algorithm in this part. It is assumed that the total unbalance power is 100 MW, and the rest of the simulation initial conditions is the same as that in Section 5.1.1. The efficiency of the convergence of each algorithm is reflected clearly through observing the process of the unbalanced power converging to 0. The simulation results are shown in Figure 18. The results of convergence from the centralized consensus algorithm and the algorithm based on traditional gradient descent principle are represented by curve 1 and 2, respectively. Curve 3 shows the convergence result by using the complete distribution based on the improved Adam algorithm.
As shown in the figure, the first and second algorithms converge after 95 and 88 iterations, respectively. The third one converges after 58 iterations. Moreover, the computing efficiency of the proposed algorithm in the early stage of iteration is obviously better than that of the other two. It can be seen that the iteration efficiency of the proposed algorithm is significantly higher than that of the commonly used traditional algorithms.

FIGURE 19
Consensus variable of IEEE-162 node system

IEEE-162 node system
An IEEE-162 node system is built with the purpose of verifying the effectiveness of the proposed algorithm in a huge energy network with massive adjustable units in this part. The testing case consists of 162 buses, 17 generators, 284 lines, 9 transformers and 91 loads. The proposed algorithm is used to optimize the dispatch of 126 adjustable units in the case of this paper. It is assumed that the designed topology is a strong connection graph, the total load demand is 18,422 MW, and the interactive power is increased by 4800 MW. The initial conditions of the simulation are the same as those given in [8]. Due to the excessive amount of the detailed basic data, please see reference [8].
The proposed algorithm is used to perform optimization calculations, the simulation results are presented in the Figure 19.
Based on the comparison of the results illustrated in Figure 19, the simulation results can prove the proposed algorithm has a significant effectiveness compared with the experiments shown in [8], [9] and [6] when processing massive data in largescale smart grids. Particularly in one of the simulations [8], it showed that the same simulation was solved after 500 iterations, but the proposed algorithm can get the same results at the 397th iteration in less time.

CONCLUSION
In this paper, the distributed mixture solution is proposed for the optimal dispatch of unit commitment and load distribution. The proposed solution based on DRL and consensus theory can achieve calculation cost reduction while having good dynamic performance in calculating massive data for large-scale power grids. The convergence efficiency is significantly improved by integration of the improved Adam algorithm and the consensus principle. In the context of data-driven artificial intelligence, the application of DRL to the proposed solution can effectively improve the decision by optimizing dispatch strategies and broaden the methodological ideas for ED. Generally, under the background that the global power grid dispatch has attracted unprecedented attention, this approach has broad prospects of application.