Risk-based Reactive Power Optimization Based on Tribe Q-Learning Algorithm

In this paper, the risk assessment theory is introduced into the traditional reactive power optimization problem. Moreover, a novel tribe Q-learning algorithm with knowledge transfer is proposed, which is developed from the search mechanism of artificial intelligence algorithm and the iteration mode of Q-learning. The Q matrix is adopted as the knowledge matrix for the storage of the search information of the tribe. During online learning, the rate of TQL can be accelerated significantly via the knowledge transfer. The simulation on IEEE 118-bus systems demonstrates that the rate of TQL is two to twenty times faster than that of other AI algorithms while the global convergence can be ensured.


Introduction
With the development of the industrialization process, the construction of power system have been accelerated. The development of interconnection of regional power grid and cross-regional transmission has become faster. At the same time, more large-scale wind energy, solar energy and electric vehicles has connected to the system in distribution network side, which makes the power grid become more complex and may result in severe challenges to the secure and stable operation of power grids. In order to obtain an appropriate trade-off between system security and economy, since the 1990s, several scholars have studied the security issues of reactive power optimization [1]. Based on the planning and adopting the traditional reactive power optimization model, reference [2] attempted to make up for the deficiencies of traditional methods through the rational configuration of reactive power compensation location. Reference [3] has proposed a reactive power optimization model based on Monte Carlo simulation and voltage security constraint, which takes the improvement of the node voltage level as the goal of optimization. From the perspective of risk, reference [4] has analysed the influence of power loss, voltage instability and voltage violations on the operation of power system, and configured the various reactive power resources in the system with the aim of minimizing operation risk. However, the above studies evaluate the security of the system from the perspective of static voltage state, ignoring the effects of line overload and the effects of load fluctuations.
In order to improve the ability of power system to withstand the operation risks, the theory of power system risk assessment is introduced into the traditional reactive power optimization issue, and a risk-based reactive power optimization (RBRPO) mathematical model is constructed. The model aims at reducing operational risk and active power loss of the system and adopts the probability model to evaluate the risk of transmission line overload and node voltage violation when the transmission line occur faults at the same time to reduce the operational risk and the active power loss of the system.
Risk-based reactive power optimization is a complex mixed discrete nonlinear programming issue. The solution to this kind of issues mainly includes classical mathematical method and heuristic artificial intelligence algorithm. Compared to classical mathematical method, artificial intelligence (AI) algorithms such as genetic algorithm (GA) [5], particle swarm optimization (PSO) [6], artificial bee colony (ABC) [7] and so on have been widely applied to various areas of power system optimization due to its outstanding features of less dependence of an accurate system model, convenience of application and global optimization and its suitability for dealing with discrete, non-linear large-scale issues [8] [9]. However, these algorithms can only deal with the problem in isolation without the ability to store information and self-learning, which results in inefficiencies in dealing with similar tasks. Nowadays transfer technology has become a powerful tool to accelerate the process of machine learning for similar multitasking optimization. In practical work, many historical task and new tasks being executed have a number of common features in essence. Transfer learning is to find the similarity between the past and present, using the previous knowledge to guide the current task, which significantly improve the efficiency of task optimization [10]. Based on the previous analysis, this paper introduces a brand new tribe Q-learning algorithm (TQL) with knowledge transfer to solve the risk-based reactive power optimization model. Different from the common artificial intelligence algorithm of random search mode, TQL uses four kinds of individuals, i.e., tribal chiefs, civilians and rangers as the search subject to find the optimal solution. TQL uses the Q matrix in the Q-learning algorithm to store the group optimization information and guide the next step of the optimization method. At the same time, the dimensionality of the knowledge matrix has been reduced, which avoids the curse of dimensionality in the large-scale system. In the pre-learning process, TQL stores the optimization information of the source task in the optimal knowledge matrix. Through the extraction of the similarity, the initial matrix of the similar task is formed in the form of non-linear transfer. Therefore, the optimization process for new tasks will be significantly accelerated during the online learning process. In order to verify the effectiveness of the new algorithm, TQL is applied for RBRPO of 24 scenarios on IEEE118-bus system, which performance is compared with that of other existing AI algorithms.

Operation risk assessment
The operation risk assessment of power systems means a comprehensive evaluation with the possibility and severity of random perturbations, which can be described by the sum of the product of probability and the consequence of each random disturbance [11]: wheres i is the ith random perturbations, P(s i ) and I(s i ) are the probability and risk index of s i , respectively. This article focused on contingency taking into account the failure of the transmission line outage.According to the statistical data, the failure rate of transmission line L i at a certain time intervalfollows the Poisson distribution, thus its can be described as (L ) 1 exp( / 8760) (2) whereλ i is the annualfailure rate of transmission line i;Δt is the fault calculation time, the unit is one hour.
If the line outage is an independent event, the probability of a single failure at any time can be described as [12]: whereU L is the setof all the normal operational transmission lines, sowe can get the probability of two and more than two lines failure occurred. The outage of a transmission line may results in the transfer of active power flow, and a sudden line overload or a severe node voltage deviation may happen in the neighbourhood of the failure point. In order to distinguish the probability of failure more effectively between low probability but serious failure andhigh probability but slight failure, a utility function is employed so as to fully describe the risk index of branch power and node voltage. The branch power risk index is used to describe the overload of the line power flow, which is defined as follows: whereαis the set of overload transmission line;T i is the apparent power flowing through line i; T imax is the limit of line power flow;a,b are positive real numbers. The node voltage risk indexmeasures the extent of the node voltage overrun, which can be defined as follows: whereβis the set of the voltage overrun node; V i is the actual voltage amplitude of node i; U imax is the upper limit of voltage for node i; U imin is the lower limit of voltage for node I;a u ,b are positive real numbers.
Taking into account the probability of failure of transmission lines, the system comprehensive risk indexcan be calculated as follows: whereρ k is the kth probability of expected faultoccurrence; Cis the set of expected fault.

Objective function andConstraints ofRBRPO
Under the premise to meet the various operational constraints, RBRPO can adjust the distribution of the power flow by reasonably configuring the switching capacity of the reactive power compensation device, the generator terminal voltage and the transformer tap ratio, so as to reduce the active power loss of the power grid and operational risk as much as possible. In this paper, the linear weighting method is adopted to convert multi-objective problem into single objective processing.The objective function of RBRPO can be described as follows: denote the three objectives of the active power loss, risk index and voltage deviation component after normalization, respectively;ω 1 , ω 2 , ω 3 are the weights of each objective.
The voltage deviation componentV d can be calculated as follows: whereS N is the set of nodes. The active power loss P Loss can be described as: Whereθ ij is the voltage phase angle difference between nodes i and j;G ij is the conductance oflinei-j. Constraints include power flow constraints, control variable constraints, and state variable constraints [ where vectorvariable x=[Q C ,K T ,U,θ,P G ,Q G ] T denotes the switching capacity of the reactive power compensation device, the transformertap ratio, the node voltage amplitude, the node voltage phase angle, active and reactive power of generator , respectively.P Di and Q Di are the active and reactive load of node i, respectively; B ij is the susceptance of linei-j; S C , S T , S G , S D and S L are the set of reactive power compensation devices,transformers, generators, load buses, and lines, respectively.

Knowledge matrix
The Q-value matrix of Q-learning is adopted as the knowledge matrix of TQL and defined as a record of algorithm optimization strategy [14][15] [16].The element of knowledge matrix, i.e., Q(s,a), denotes the expected accumulative reward by selecting an action a in a state s.In the process of optimization, the algorithm achieves the convergence after a large number of iterative trial and error, and the process, the optimization body maps the state to the action, is stored in theknowledge matrix.As shown in Figure 1, an agent (here it means a tribal member) can obtain an action policy under a given state from the knowledge matrix and update its prior knowledge matrix by feedback. Basically, the Q-value matrix is a lookup table with the size of |S|×|A|. For large-scale complex systems, the action space|A|will increase exponentially with the increase in the number of variables, that is, "Curse of Dimensionality", which results in the calculation difficult to carry out.Therefore, in order to considerably reduce the dimension, the original knowledge matrix is divided into several interrelated low-dimensional sub-matrices. The sub-matrices correspond to the corresponding variables, and the rows and columns of the matrix correspond to the state and action of the variables respectively, and the number of rows of Q i+1 is the same as the number of Q i columns.In other words, the action space A i of the ith variable is also the state space S i+1 of the (i+1)th variable.The action selection process of the different variables is no longer isolated, but presents a chain state -action pairs to extend, i.e., the state space S i+1 of the (i+1)th variable cannot be selected until the ith variable has been determined.In the knowledge matrix, the elements not only reflect the merits and demerits of the current strategy, but also reflect the compactnessdegree between adjacent variables. The larger the element value, the closer the state-action combination of adjacent variables is.
A multi-agent cooperative mechanism is introduced into TQL algorithm to update the knowledge matrix. When the agents (tribal members) complete a trial and error each time, the algorithm will 5 1234567890 ''"" under a selected action a ij k in the kth iteration;

Optimization mode
In order to improve the ability to adapt to the environment, primitive people living in the same habitat will spontaneously form tribes where different tribal members achieve survival and developmentthrough mutual cooperation. Inspired by this primitive human social activity, TQL is able to achieve global convergence and accurate local search through the mutual cooperation by different tribal members.
The algorithm classifies and divides the tribe members according to the reward function values. The top 25% of the reward function are the tribal patriarchs, where the best individual is the chiefs, 25% of the individuals in the middle are civilians, and 50% of the rears are rangers. There are two tendencies, i.e., search and utilization in the optimization mode of reinforcement learning. Focusing on search can enhance its global convergence, and focusing onutilization can improve the convergence rate.
Following the behavioural strategy or chaos search strategies are taken to search by chiefs, tribal patriarchs and civilians, who assume the main search task. Chaos phenomenon is random, regular and ergodic, therefore, this search is conducive to enrich the diversity of the population and jump out of the local optimal solution.
The fitness of tribal patriarchs in the tribe is in a dominant position, which guides civilians and rangers, but follows the chiefs of the lead. Thus, the patriarchstake chaotic searches based on Logistic mapping as the main mode of movement andhave certain following behavior to chiefs. The movement of the patriarch can be described as follows: (16) where f 1 , f 2 are the chaotic search components and follow the chiefs components, respectively;μis the chaos control parameter, 4 in this paper; r i t is the random number generated by the chaotic sequence for the tth cycle;r sign is a random number with a value of 1 or -1; histhe approximation factor, characterizing the degree of individual followschiefs,this paper takes 0.1;x lead is the chiefs; Step wherex rand is a randomly selected patriarch differ from itself. Chiefs take the same mode of movement as patriarchs and are also carried out according to Eqs. (14) to (16). The difference is that the follow components toward itself are zero.
The movement of civilians consists of the following components of the chiefs and the patriarchs whereD is a dimension of the solution component; c 1 ，c 2 are the following factors of the chiefs and patriarchs, this paper takes 1.5 and 1, respectively;r 1 , r 2 are random numbersamong [0,1], respectively;x str is the closest to i patriarch. Rangers adopt the Pm-greedy strategy. Under the guidance of the knowledge matrix, rangerssearch in the feasible domain to improve the efficiency of the algorithm, through the utilization of information. It can be described as follows: (19) wherer 3 ∈ [0,1] is the random number;P m ∈ [0,1] denotes migration probability; a s means roulette selection. When r 3 < P m , the ranger chooses roulette according to the action probability matrix P i ; when r 3 ≥ P m , the ranger chooses the action that is expected to accumulate the maximum reward in the current state, that is, the implementation of the greedy strategy.
The action probability matrix P i denotes the probability of selection of state-action pair,and has a positive correlation with the value of the knowledge matrix element Q i (s i ,a i ). Pi is updated as follows: whereβis the divergence factor to magnify the divergence of sub-matrices and e i is the transition matrix.
After each round of iteration, all the tribal members have completed the search and got the feedback of the current round of reward function. According to the reward function, the tribal members are reordered and assigned new roles. Ranking front tribal members become chiefs, patriarchs and civilians, sorted by the later became rangers, where the former chiefs to maintain the original position unchanged. Therefore, the algorithm not only guarantees the continuity of the elite individual, but also maintains the global search performance in the solution space.

Transfer learning
If the task ready to be completed by TQL contains multiple similar tasks, then the efficiency of new tasks can be improved through knowledge transfer.
As illustrated in Figure 2, based on the existing knowledge of source task, knowledge transfer can accelerate the learning process of new tasks. In order to acquire the initial knowledge of similar new tasks, the source task must be studied in pre-learning at first. Assuming that the action space and the state space keep constant, the optimal knowledge matrices of the source tasks can be treated as the initial knowledge matrices of the target tasks. In the transfer learning process, the optimal knowledge matrices of source tasks Q S can be transferred to the initial knowledge matrices of similar new tasks Q N . The optimal knowledge matrix of the source task contains both related and invalidinformation about new tasks. Therefore, once the related knowledge cannot be fully extracted, invalid information will interfere with the process of new task learning, which will reduce the effect of transfer learning, i.e., malignant negative transfer. To handle this, new tasks only extract the closely relevant knowledge and takes similarity as the criterion to select the object for learning during the process of transfer learning In RBRPO, a task corresponds to a time section, and the demand for active load at different time scenarios is different. Since the solution of RBRPO is mainly determined by the power flow of the system, the active power deviation of different time scenarios is defined as the similarity between source and new tasks.

Source task
Assuming that the active power demand of the new task xis P Dx , the two source tasks with the least active power deviations from task x are task i and task k, and P Di <P Dx <P Dk is satisfied, the similarity between task x and the two source tasks can be calculated as follows: whereη 1 and η 2 are the similarities weighting factors, with η 1 +η 2 =1. The knowledge matrix of the new task x can be obtained by a linear transfer, which yields whereQ x i ,Q j i and Q k i denote the knowledge sub-matrices of the ith variable in source task x, source task j and new task k, respectively.

Case Studies
In this paper, the TQL algorithm is used to solve RBRPOon the IEEE 118-bus system, which performance is compared with that of GA [5], PSO [6], ABC [7], ANT-Q [18], quantum genetic algorithm (QGA) [19], ant colony optimization(ACO) [20].Simulation is undertaken in MatlabR2014a by a personal computer with Intel(R) Core TM i7-6700 CPU at 3.40GHz with 16GB of RAM. The power flow calculation is based on the Matpower6.0toolbox in Matlab R2014a.   figure 3 shows the typical load curve of the IEEE 118-bus system. Based on the demand of active power, load can be uniformly divided into 7 intervals,{[3556, 3897), [3897, 4239), …, [5604, 5945]}. Therefore, the number of source tasks for the IEEE 118-bus system is 8.

Algorithm comparison
The optimization of the objective functionsperformed by each algorithm within 24 hours of the day is shown infigure 4, where the blue solid line represents the optimization result of the TQL, and the dotted line represents the other algorithm. It can be seen from the figure that the trend of the objective function of TQL is basically the same as other algorithms in one day, and its objective function curve is only slightly higher than ACO algorithm and superior to other AI algorithm. It shows that the algorithm can take full advantage of the related knowledge obtained in the pre-learning process and avoid the occurrence of the negative transfer, and has good global convergence performance by fully grasping the similarity between the source task and the new tasks. Ingeneral, the solving process of AI algorithm is random and uncertain. In order to compare the optimal performance of each algorithm correctly, each algorithm runs 10 times. Since the algorithms carry outeach 24-hour load level optimization in each round of simulation, for each algorithm, the total number of simulation is10×24=240times, the convergence of the algorithm has been fully reflected. Table 1 indicates the average data of the objective functions obtained from the 10 runs of each algorithm. The values of the power loss, the voltage deviation component, the risk index and the objective function are the sum of 24 tasks. The calculation time is the sum of time for each algorithm to complete 24 tasks. The convergence time is the average time to complete a single task. It should be noted that the convergence effect of the algorithm is only determined by the objective function value instead ofthe power loss, the voltage stable component or the risk index.
It can be seen from the table1, TQL algorithm only needs about 895s to complete the optimization of 24 tasks,whichis much faster than the other algorithms. Moreover, the convergence rate of TQL is 2~20 times faster than the other 6 algorithms, averaging more than 10 times that of other algorithms. The objective function value obtained by TQL is 3407.87, which places second among all 7 algorithms. However, its optimization performance of power loss and the voltage deviation component are better than ACO algorithm. This shows that TQL fully exploits the similarities between source and new tasks, and significantly accelerates the optimization process through knowledge transfer. At the same time, TQL combines the trial and error iteration mechanism of Q-learning with the random optimization mode of tribe organically to ensure the global convergence performance of the algorithm. The figure 5 and figure6 show the speed advantage and excellent search ability of TQL algorithm intuitively.    In the table, variance, standard deviation and relative standard deviation are calculated to evaluate the convergence stability of the algorithm. TQL shows the best performance among all AI algorithm, especially its convergence stability, where its relative standard deviation is only 40% of the ANT-Q algorithm. This is because TQL reduces the global search blindness and uncertainty through transferor using the past knowledge.

Optimization analysis
The node voltage and the power flow distribution of the IEEE 118-bus system at load section 20 are shown in the figure 7 and figure 8, respectively, which compare the situation before and after TQL optimization. After the optimization, the voltage deviation of the system node is reduced and distributed in the range of [0.96,1.04]. Therefore, the risk of the system node voltage overrun has been effectively controlled.It can be seen from the figure 8 that after the optimization, the distribution of the power flow is more uniform, which avoids the occurrence of branch overload.   figure 9, which compares the different results before and after optimization. After the optimization of TQL, the values of the power loss, voltage deviation, risk index and objective function are all less than the values before optimization, which verifies the validity of the multi-objective reactive power optimization model proposed in this paper. Among them, the improvement of system operation risk is the most obvious, the optimized risk index is only 68% of its values before optimization, which means that the ability to resist the power system uncertainty risk has been significantly improved through optimization by TQL. Figure 9.The distribution radar map of the objective function and sub-objective.

Conclusion
In this paper, a novel TQL algorithm is proposed for RBRPO. The main innovations can be summarized as follows: 1) The theory of risk assessment is introduced into the tradition reactivepower optimization issue and the RBRPO model is proposed. The model aims at reducing the operating risk and the power loss of the system and optimizing the system voltage stability at the same time, which is beneficial to the security and economical operation of the power system.
2) TQL combines the trial and error iteration mechanism of Q-learning with the random optimization mode of tribeorganically to ensure the local depth search ability and global convergence performance of the algorithm. 3) The active load deviation is defined as the similarity degree. TQL can accelerate the optimization process of the new task by knowledge transfer using knowledge of similar source tasks.
The convergence stability and excellent performance of TQL can be confirmed by the simulation results of IEEE 118-bus system, i.e., from 2 to 20 times faster than that of existing AI algorithms for RBRPO, while the quality of optimal solution and the convergence stability can be guaranteed. Thus TQL can be a useful tool to deal with the risk-based reactive power optimization issue in power system.