Research on Hierarchical and Distributed Control for Smart Generation Based on Virtual Wolf Pack Strategy

Nowadays, haze has become a big trouble in our society. One of the significant solutions is to introduce renewable energy on a large scale. How to ensure that power system can adapt to the integration and consumption of new energy very well has become a scientific issue. A smart generation control which is called hierarchical and distributed control based on virtual wolf pack strategy is explored in this study. The proposed method is based on multiagent system stochastic consensus game principle. Meanwhile, it is also integrated into the new win-lose judgment criterion and eligibility trace. The simulations, conducted on the modified power system model based on the IEEE two-area load frequency control and Hubei power grid model in China, demonstrate that the proposed method can obtain the optimal collaborative control of AGC units in a given regional power grid. Compared with some smart methods, the proposed one can improve the closed-loop system performances and reduce the carbon emission. Meanwhile, a faster convergence speed and stronger robustness are also achieved.


Introduction
Recently, the thermal power generation makes the environmental pollution more serious, especially the air pollution.Therefore, more and more clean energies such as wind and photovoltaics are continuously merged into the strongly coupling interconnected power grid [1].However, new troubles, such as voltage over limit and power fluctuations as well as frequency instability [2][3][4], are brought out.Meanwhile, the safe operation of the power grid is also affected.The traditional centralized automatic generation control (AGC) cannot obtain the similar control performance with the decentralized AGC since the energy distributions are more dispersed.It will be an inevitable trend for the future smart grid to research the decentralized AGC.
In recent years, many scholars have devoted to the optimal control strategy of decentralized AGC [5][6][7][8][9][10][11][12][13].Authors in [6] put forward the concept of optimal AGC by using the original dual transformation method, which is based on the optimal control theory.It showed that the dynamic equation and the constructed AGC control strategy of the interconnected system could realize multiarea decentralized optimal AGC control.However, the used optimal AGC controller needed to feedback all the state variables which were difficult to be obtained directly in the actual system.In [8], a new method was proposed based on the model predictive control.It focused on a decentralized optimal AGC control strategy based on cooperative synchronous power grid.While the stability and robustness of the multivariable predictive control method including the application in actual AGC system needed to be further studied, the method was a great amount of calculation and time-consuming.Yu et al. [11] demonstrated that an optimal AGC can be achieved under the circumstance that the agents are in small number.However, the algorithm is only applicable to systems with a small number of agents and its application is limited.In the same way, the decentralized control has been studied by the author in the early stage, namely, decentralized correlated equilibrium Q(λ)-learning (DCEQ(λ)) [12] based on multiagent (MA).It can solve the complex stochastic dynamic characteristics and optimal coordination control of AGC after the access of distributed energy.Nevertheless, if the number of MA increases, the searching time for the MA equilibrium solution is geometric growth, which will limit the application of DCEQ(λ) in larger systems.Therefore, the decentralized win or learn fast policy hill-climbing(λ) (DWoLF-PHC(λ)) [13] based on MA was developed, in which by using average mixed strategy instead of equilibrium strategy.Thus, the dynamic characteristics of the system are effectively improved, and the dynamic optimization control of the total power is also obtained.However, the DWoLF-PHC(λ) still has multisolution problem.It results in system instability when the number of MA increases sharply.
The above literatures have some limitations that they only focus on the control strategy of the total power in the AGC.However, the dynamic optimal allocation of the total power is not involved.In fact, the modern power grid has gradually been developed into a hierarchical and distributed control (HDC) structure, which integrates the large-scale new energy.For this reason, a single control strategy is difficult to meet the requirements of control performance standards (CPS).Therefore, a hierarchical and distributed control based on virtual wolf pack strategy (HDC-VWPS) is proposed in order to attenuate the stochastic disturbance problem caused by massive integration of new energy to the power grid.The proposed strategy is based on multiagent system stochastic consensus game (MAS-SCG).It is divided into two parts.The first part is an AGC optimal control method which combines a new win-lose judgment criterion, policy hill-climbing algorithm (PHC) [14], and eligibility trace [15].Especially, the new win-lose judgment criterion is named as policy dynamics-based WoLF (PDWoLF) [16].Moreover, the control method, which is called PDWoLF-PHC(λ), is based on multiagent system stochastic game (MAS-SG) theory.Meanwhile, the second part is the collaborative consensus (CC) algorithm [17] which is based on multi-agent system collaborative consensus game (MAS-CC) theory.This algorithm is used to distribute the total power dynamically and optimally.Consequently, the perfect combination of AGC control and distribution is realized.At the same time, the intelligence from the whole to the branch is truly obtained.The significant difference between smart generation control (SGC) and AGC is that the original proportional-integral (PI) control in AGC is replaced by the smart control in SGC.
The rest of the paper is as follows.The SGC framework based on HDC structure is proposed in Section 2. The HDC-VWPS is expounded in Section 3.Meanwhile, Section 4 is the AGC design based on HDC-VWPS.Section 5 covers the case study, and Section 6 summarizes the full text, respectively.

SGC Framework Based on HDC Structure
Hierarchical reinforcement learning (HRL) [18] is a hierarchical control method that can solve the problem of "curse of dimensionality" in traditional reinforcement learning effectively.A new method, namely, HDC-VWPS, is put forward to obtain the optimal total power and its optimal dispatch dynamically.The term "virtual wolf pack" is a generator set group (GSGs) of a certain control area.The PDWoLF-PHC(λ) with the win or learn fast (WoLF) attribute based on heterogeneous MAS-SG theory is adopted to obtain the total power of each GSG.Meanwhile, the ramp time CC algorithm based on homogeneous MAS-CC theory is used to distribute the total power to each unit dynamically in order to achieve the optimal coordination control of each GSG.The "leader" of virtual wolf pack refers to a new dispatcher who is responsible for communicating, contacting and cooperating with the leaders of the other GSGs, and sending the instructions to each unit in the GSGs.Each GSG only has one leader.The SGC framework based on HDC structure is shown in Figure 1, where ΔP tie is the tieline exchange power, Δf is the interconnected power grid frequency error, ΔP i is the total power of GSGi i = 1, 2, … , n , and ΔP iu is the regulation power of the uth unit in GSGi.

HDC-VWPS
A HDC-VWPS is designed to coordinate and optimize the operation of GSGs in the SGC system with HDC structure through the integration of MAS-SG and MAS-CC.

MAS-SG Framework.
Based on the MAS-SG framework, a PDWoLF-PHC(λ) algorithm is proposed to the game among GSGs to obtain total power command of each GSG.
The WoLF principle can meet the convergence requirement by changing the learning rate without sacrificing rationality, namely, learn quickly when losing and cautiously when winning [14].However, in more than 2 × 2 games, the players cannot accurately calculate the win-lose criterion and can only rely on the estimation.Therefore, an improved WoLF version, PDWoLF, whose judgment criterion can be accurately computable in more than 2 × 2 games, was explored in [16].Also, it can converge to Nash equilibrium in more than 2 action games.
It indicates that PHC algorithm can meet the requirement of the rationality in [14].Therefore, PDWoLF-PHC can satisfy the requirements of the convergence and the rationality at the same time.It also converges faster with a higher learning rate ratio [16].The PDWoLF-PHC is the extension of the classical Q-learning [19].It combines the multistep backtracking idea of the SARSA(λ) [15] to search the optimal action-value function through the continuous trial and error dynamically.The parameter λ refers to the use of an eligibility trace.It can solve the temporal credit assignment of time-delayed reinforcement learning.The optimal value function V π * s and strategy π * s are as follows.
where A is the set of possible actions under state s.
The eligibility trace is updated by Complexity where e k s, a denotes the eligibility trace at the kth step iteration under state s and action a, γ is the discount factor, and λ is the trace-attenuation factor.
The Q function will be iteratively updated according to where 0 < α < 1 is the Q-learning rate and R s k , s k+1 , a k is the reward function value from state s k to s k+1 under the selected action a k Q s, a is the Q value function when executing action a in state s, which uses look-up table method.a ′ is a greedy action.After sufficient trial and error iterations are done, the state-value function Q s, a will converge to the Q * matrix with the probability of one.Finally, an optimal control strategy, represented by the optimal Q function (Q * matrix), can be obtained.The win-lose criterion of PDWoLF-PHC(λ) is determined by two parameters δ win and δ lose for a given agent .Strategy π s k , a k is updated for an agent according to (4) in the state state-action pair s k , a k .
where Δ s k a k is the variable quantity of the updating strategy.The updating rule is described as follows.
In ( 6), |A| is the number of possible actions.δ is the variable learning rate and δ win < δ lose ∈ 0, 1 .Also, φ = δ lose /δ win is defined as the variable learning rate ratio.δ is updated by where Δ 2 k s k , a k is the decision space slope value and Δ k s k , a k is the decision change rate at the kth step iteration.Meanwhile, Δ 2 s k , a k and Δ s k , a k are updated by  3 Complexity

MAS-CC Framework.
The MAS-CC framework is introduced into the HDC-VWPS to dynamically allocate the total power command to each unit.

Graph Theory. The topology of MAS can be expressed as a directed graph
and a weighted adjacency matrix B = b ij ÎR n×n .Among them v i denotes the ith agent, edge means the relationship among agents, and constant b ij ≥ 0 is the weight factor between v i and v j .If there is a connection between any two vertices, then the graph G is called a strongly connected graph.The Laplacian matrix L = l ij ÎR n×n of graph G can be written as follows.
where the matrix L reflects the topology of the MA network.

Collaborative Consensus.
In a MAS, it is usually called collaborative consensus (CC) [20] while an agent interacts with the adjacent one to reach the consensus.A MAS consisting of n autonomous agents is regarded as a node in a directed graph G.The purpose of CC is to obtain a consensus in each agent and to update state in real time after communicating with neighboring agents.Due to the communication delay among agents, the first-order CC algorithm of a discrete system is chosen as follows.
where ψ i is the state of the ith agent, k represents the discrete time series, and d ij k denotes the i, j entry of the row stochastic matrix D = d ij ÎR n×n at discrete time.k d ij k is given by The CC algorithm can be achieved if and only if the directed graph is strongly connected on the condition of the continuous communication and constant gain b ij .

Ramp Time Collaborative Consensus.
The ramp time is chosen as the consensus variable among all units in a GSG.A unit which has a higher ramp rate will be distributed with more disturbances.The ramp time of the uth unit in GSGi can be obtained as follows.
where ΔP iu is the regulation power of the uth unit in GSGi.ΔP rate iu is the ramp rate of the unit and is calculated as follows.
where ΔP rate+ iu and ΔP rate− ui are the upper and lower bounds of the ramp rate, respectively.The ramp time of the uth unit in GSGi can be updated according to (10) as follows.
where U i is the total number of units in GSGi.
Then the ramp time of the GSGi leader can be updated as follows.
where ξ i > 0 represents the GSGi's adjustment factor of the power error.ΔP error−i denotes the power error between the GSGi total power and the total power of all units.It is obtained from In the condition of the total power command ΔP i > 0, if ΔP error−i > 0, the ramp time t iu needs to be increased; otherwise t iu needs to be reduced.Oppositely, t iu will be increased or decreased in condition that ΔP i < 0.
As a ramp time CC algorithm among units is adopted, the power of some units may exceed their maximum power.At the same time, the smaller the unit maximum ramp time t max iu is adopted, the faster the power limit is reached.While the power limit is reached, the uth unit's power and ramp time are as follows.
where ΔP max iu and ΔP min iu are the maximum and minimum reserve capacity of the uth unit in GSGi, respectively.Furthermore, if the power ΔP iu of the uth unit exceeds its limit, the weight factor becomes as follows.
where B = b i uv ∈ R U i ×U i is the weighted adjacency matrix of the GSGi.

AGC Design Based on HDC-VWPS
4.1.Reward Function Selection.The impact of energy management system (EMS) on the environment is considered, and carbon emission (CE) as part of the reward function is also introduced.Meanwhile, in the load frequency control (LFC), each regional power grid will control the generator set in this area according to its own area control error (ACE).The main purpose is the ACE is zero when the steady state is reached.Therefore, in the reward function, the weighted sum of CE and ACE is taken as the objective function.The reward function in GSGi is defined as follows.

Parameter Setting.
A reasonable set of six parameters λ, γ, α, δ, φ, and ξ i is required in the design of the control system.The trace-attenuation factor λ allocates the credits among state-action pairs.Usually, the parameter λ is located between 0 and 1.It determines the convergence rate and the non-Markov decision process (MDP) effects for large timedelay systems.Generally, the factor λ can be interpreted as a time scaling element in the backtracking.For Q-function errors, a small λ means that few credit will be given to the historical state-action pairs while a large λ denotes that much credit will be assigned.Through trial and error, it shows that 0.7 < λ < 0.95 is acceptable.Here, λ = 0 9 is selected.
The discount factor γ is between 0 and 1, which discounts the future rewards in Q functions.A value close to 1 should be chosen as the latest rewards in the thermal-dominated LFC process which is the most important.Experiments demonstrate that 0.6 < γ < 0.95 is proper.Here, γ = 0 9 is chosen.
The Q-learning rate α is set between 0 and 1, which weighs the convergence rate of the Q-functions, namely, algorithm stability.Note that a larger α can accelerate the learning rate, while a smaller α can enhance the system  7 Complexity stability.In the prelearning process, the initial value of α is chosen to be 0.1 to obtain the overall search.After that, in order to gradually increase the stability of the system, it will be reduced in a linear way.
The variable learning rate δ is between 0 and 1, which derives an optimal policy by maximizing the action value.Especially, the algorithm will be degraded into Q-learning if δ equals 1.The main reason is that a maximum value action is permanently executed in every iteration.For a fast convergence rate, the greedy strategy with a variable learning rate ratio φ = δ lose /δ win = 4 is selected in a stochastic game.Through trial and error, it shows that δ win = 0 06 can obtain stable control characteristics.
The value of power error adjustment factor ξ i in GSGi is related to ΔP i , which is shown in

21
ΔP i is the total power of GSGi in MW.

Case Study
5.1.The Modified Model with Two-Area LFC Power System in IEEE.In order to test the control performance of the proposed strategy, an IEEE-modified model with two-area LFC power system [21] is selected as the simulation object, whose framework is shown in Figure 2. The system parameters are taken from [22], and those of GSG1 and GSG2 are provided in Table 1.
The work cycle of the AGC is set to be 4 s.Note that HDC-VWPS has to undergo a sufficient prelearning through off-line trial and error before the final online implementation.It includes extensive explorations in CPS state space for the optimization of Q-functions and state-value functions [23].Figure 3 presents the prelearning of each area produced by a continuous 10 min sinusoidal disturbance.It is obvious that the HDC-VWPS can converge to the optimal strategy in each GSG with qualified CPS1 (the average of 10 min CPS1) and E AVE 10 min (the average of 10 min ACE).
Furthermore, a Q matrix Q ik s, a -Q i k−1 s, a 2 ≤ ς with 2 norms is used as the criterion for the prelearning termination of an optimal strategy [24].ς = 0.1 is a specified positive constant.Both the Q value and look-up table will be automatically saved after the prelearning, such that HDC-VWPS can be applied into a real power system.The convergence result of Q-function differences is given in Figure 4.The result is obtained in each GSG during the prelearning, in which the HDC-VWPS can accelerate the convergence rate by nearly 26.7%~40% over that of Q(λ).

Complexity
In order to evaluate the robustness of each algorithm, the control performances of DWoLF-PHC(λ), Q(λ), and Q-learning are compared with that of HDC-VWPS under a step and a stochastic load disturbance in GSG1.The simulation results under a step load disturbance are shown in Figure 5.In Figure 5(a) it is shown that the overshoots are around 6.3758%, 4.907%, 7.2614%, and 13.0435%, respectively.Meanwhile, in Figure 5(b), it refers that the average values of ACE are 0.1261 MW, 1.0682 MW, 1.2216 MW, and 1.0438 MW, respectively.In addition, in Figure 5(c), it is illustrated that the minimum CPS1 is 189.6487%,186.7696%, 189.6426%, and 190.1703%, respectively.In the meantime, the simulation results under a stochastic load disturbance are described in Figure 6.In Figure 6(a), it is demonstrated that HDC-VWPS has the strongest robustness.Besides, in Figure 6(b), it refers that the average values of ACE are 22.7175 MW, 45.1846 MW, 66.6484 MW, and 75.7486MW, respectively.Moreover, in Figure 6(c), it is presented that  9 Complexity the minimum CPS1 is 167.7471%,159.4400%, 150.6757%, and 127.3168%, respectively.Therefore, HDC-VWPS can provide better control performances for AGC units.
The stochastic white noise is used as the load disturbance after the prelearning process, in which the control performance of each algorithm obtained in each GSG is summarized in Figure 7. CE, Δf (average values of the frequency deviation), E AVE 1 min (average values of 1 min ACE), and CPS1 are the average values over 24 h.It can be seen from Figure 7 that compared with the other methods, HDC-VWPS can reduce CE by 1.21%~1.51%,Δ f by 4.5647 × 10 −4 ~7.5851 × 10 −4 Hz, and E AVE 1 min by 5.79%~44.22%and increase CPS1 by 0.0007%~0.02%.11 Complexity K p4 = 0 0025, and T p = 20.Generation rate constraint (GRC) is the P rate+ iu / P rate− iu in this study.GRC and all the other system parameters are given in Table 1.

GSG 2 Sha anghe A2 A A A A A A A A A A A A A A A A A A A A A A A A A A
The system includes coal-fired power plants, hydropower plants, and pumped storage power plants.The output of each plant is relative to its own governor, and the setting point of AGC is obtained according to the optimal dispatch.The long-term AGC control performance based on MA is evaluated by a statistic experiment with 30-day stochastic load disturbance.Four types of controllers are simulated, that is, Q-learning, Q(λ), DWoLF-PHC(λ), and HDC-VWPS.The statistic experiment results obtained under the impulsive perturbations and stochastic white noise load fluctuation are showed in Figures 10 and 11, respectively.Especially, Δf and ACE are the average values of the frequency deviation and ACE.CPS1, CPS2, and CPS are the monthly compliance percentages.The same weight of HDC-VWPS in each GSG is chosen, which has a more effective joint cooperation than other policvies.As a result, a higher scalability and self-learning efficiency can be achieved.
It can be seen from the simulation results that the HDC-VWPS has stronger adaptability and better control performance than that of other three methods.In each GSG area, the win-lose criteria of the unit depend on the sign of the product of Δ s k , a k and Δ 2 s k , a k .By determining the "lose" or "win" of an agent, the corresponding variable learning rate is selected to obtain the optimal Q function through updating the Q value dynamically.Meanwhile, the variable quantity is determined in the mix strategy updating.Finally, the optimal mixed strategy is gained by the dynamic updating continuously.The results also demonstrate that the proposed    12 Complexity strategy can effectively reduce the CE and improve the utilization rate of new energy.

Conclusion
Based on the MAS-SCG theory, a novel HDC-VWPS method with new win-lose judgment criterion and eligibility trace is proposed to dynamically obtain the optimal total power and its optimal dispatch.Also, it can attenuate the stochastic disturbance caused by massive integration of new energy to the power grid.
Based on MAS-SG, a PDWoLF-PHC(λ) algorithm is proposed to solve the universality problem which usually a strict knowledge system is required for agents under the traditional MAS-SG system.It also solved the problem which the agents cannot accurately calculate the judgment criterion and converge to Nash equilibrium slowly in more than 2 × 2 games.Based on MAS-CC theory, the ramp time CC algorithm is used to allocate the total power command to each unit dynamically.
The simulation results verify the effectiveness of the proposed strategy using modified power system model in the IEEE two-area LFC and Hubei power grid model in

Complexity
China.Compared with other four smart methods, the proposed one can satisfy the CPS requirements and improve the performance of the closed-loop system.Also, it can reduce the CE and maximize the utilization rate of energy.

Figure 1 :
Figure 1: The SGC framework based on HDC structure.

Figure 2 :
Figure 2: Modified power system model based on IEEE two-area LFC.
The HDC-VWPS controller output

Figure 4 :
Figure 4: The convergence result of Q-function differences obtained in each GSG during prelearning.

Figure 5 :
Figure 5: Control performance of four AGC controllers under a step load disturbance.
Model of Hubei Power Grid.Four-area model of Hubei power grid is shown in Figure8.As shown in

Figure 6 :Figure 9 ,
Figure 6: Control performance of four AGC controllers under a stochastic load disturbance.

Figure 7 :
Figure 7: Statistic performance of each GSG in the two-area LFC modified power system model.

Figure 8 :
Figure 8: The interconnected network of Hubei power grid model in China.

Figure 10 :
Figure 10: Statistic experiment results obtained under the impulsive perturbation in the Hubei power grid model.

Figure 11 :
Figure 11: Statistic experiment results obtained under the white noise load fluctuation in the Hubei power grid model.

Table 1 :
Model parameters of GSG units in the Hubei power grid.