Adaptive Online-Learning Volt-Var Control for Smart Inverters Using Deep Reinforcement Learning

The increasing penetration of the power grid with renewable distributed generation causes significant voltage fluctuations. Providing reactive power helps balancing the voltage in the grid. This paper proposes a novel adaptive volt-var control algorithm on the basis of deep reinforcement learning. The learning agent is an online-learning deep deterministic policy gradient that is applicable under real-time conditions in smart inverters for reactive power management. The algorithm only uses input data from the grid connection point of the inverter itself; thus, no additional communication devices are needed and it can be applied individually to any inverter in the grid. The proposed volt-var control is successfully simulated at various grid connection points in a 21-bus lowvoltage distribution test feeder. The resulting voltage behavior is analyzed and a systematic voltage reduction is observed both in a static grid environment and a dynamic environment. The proposed algorithm enables flexible adaption to changing environments through continuous exploration during the learning process and, thus, contributes to a decentralized, automated voltage control in future


Introduction
The proceeding decentralization of the power system due to higher penetration with renewable distributed generation (DG) can cause voltage problems in the distribution grid, since bidirectional power flows increase the risk of voltage violations [1,2]. Furthermore, the fluctuating character of renewable energy feed-in causes rapid changes in the power fluxes and affects the voltage behavior significantly [3,4]. To overcome these problems, reactive power injection or absorption can be used to compensate voltage changes [5]. Thus, volt-var control (VVC) is essential for stable grid operation [6].
Various volt-var control approaches are currently under research, such as automatic voltage regulators, switchable capacitors, distribution static compensators (DSTATCOM), and smart inverters (SI) [7][8][9][10]. This work focuses on VVC in smart inverters to combine DG and voltage control efficiently and provide a fast response to voltage changes. Various volt-var strategies for SI have been investigated recently [10]. Reactive power can either be provided with a fixed power factor, with fixed reactive power Q or voltage-dependent reactive power Q(U) [11,12]. Since the cable length and the grid topology influence the local voltage behavior, the reactive power demand at each grid connection point (GCP) is individual [13,14]. Thus, regarding the voltage-dependent approach, research is being conducted on optimizing the Q(U) feed-in [15,16].
The recent success of deep reinforcement learning (DRL) in many fields, including games [17,18] and robotics [19,20], has also attracted attention in power and energy applications [21,22]. Several studies focus on applying DRL methods to power grid operations in order to provide intelligent control algorithms [23][24][25]. In Reference [26,27], an optimized coordination of various voltage regulating devices was realized through DRL. Several works suggest DRL algorithms for multi-agent smart inverter coordination [28][29][30][31] requiring knowledge from other grid nodes. Especially the deep deterministic policy gradient (DDPG) [32] is highly promising in the field of control algorithms. In Reference [33], a twostage deep reinforcement learning method for inverter-based volt-var control is presented consisting of an offline and online stage in the learning process.
In contrast to most of these studies that use centralized methods with the need for measurement and communication devices within the grid, in this paper, a fully decentralized DRL volt-var control is demonstrated. Within the DRL framework, a DDPG learning agent is used, which only gets input data from its own connection point to the grid and does not require any previous knowledge of the grid. Hence, the proposed method can be applied to single inverters individually and is able to regulate the local voltage without any further measurement or communication equipment. This strongly reduces the computational costs and the amount of data and, thus, increases the potential for real-time applications. Additionally, the individual implementation allows a progressive and demand-orientated upgrading of the inverters in the grid without complex guidance through the distribution grid operator (DGO). Furthermore, the learning process is fully realized as an onlinelearning, allowing ongoing exploration. This enables the control algorithm to continually adapt to fluctuating power flows in the short term and changing grid contributors in the long term.
The main contributions of this paper are summarized as listed below: • No need for additional equipment: this saves installation effort and costs. • Reduction of data flows: with a self-learning individual volt-var control there is no need for data exchange with other grid actors. • Individual application at the point of demand: this enables DGOs to progressively adapt the distribution grid to higher shares of DG feed-in. • Flexible adaptation to changing environments through online-learning: the ongoing exploration in the learning process allows a continuous adaption to the actual reactive power demand.
The remainder of the paper is organized as follows: Section 2 presents the proposed deep reinforcement learning method, including the formulation of the DRL agent parameters and the reward function. Subsequently, in Section 3, the simulation test feeder is illustrated and analyzed regarding its reactive power demand at each connection point. Accordingly, the proposed DRL method is applied at various nodes in the test feeder, and the static and dynamic grid behavior is investigated in Section 4.

Proposed DRL Volt-Var Control Algorithm
The proposed voltage control method is based on a reinforcement learning framework consisting of a training environment and a deep learning agent. The learning agent used in this algorithm is a deep deterministic policy gradient (DDPG). The DDPG agent is a model-free actor-critic deep reinforcement learning agent for continuous action spaces. The algorithm was developed in 2015 by Lillicrap et al. [32] based on the deterministic policy gradient (Silver et al. [34]). It uses an actor and critic architecture with deep neural networks as function approximators. The learning agent interacts with its environment by observing the environment and receiving rewards for the performed actions. Together with the observed values, this reward is used to update the neural networks and, thus, influences the output values. In the following, this learning method is applied to volt-var-control in inverters. The algorithm is developed in such a way that only input data from the grid connection point of the inverter is used. Therefore, the observed data (input) is limited to the voltage values at the grid connection point, while the performed action (output) of the learning agent is the reactive power output. The flowchart of the proposed algorithm is shown in Figure 1. To initialize the algorithm, the active and reactive power at the grid connection point are set. Every second step, the reactive power is calculated by the DRL agent and then kept constant for two steps in order to prevent an oscillating behavior. Afterwards, the active and reactive power are fed in by the inverter to the electric grid and the resulting voltage values Re{U} and Im{U} (real and imaginary part of the voltage) at the grid connection point are measured. Subsequently, the voltage deviation ∆U between the measured voltage and the nominal voltage of 1 pu, as well as the temporal derivativeU and the reward R, are calculated. For updating the DRL agent, the values of ∆U, Re{U}, Im{U} andU are used as observation values (input). After updating the DRL agent, the new reactive power output Q(t) is calculated as the action value. For calculating the reward R, a specific reward function was developed. The reward function is essential for reinforcement learning algorithms because it defines the behavior of the DRL agent. For the considered application, the voltage is supposed to be regulated to 1 pu; therefore, any voltage deviations result in negative rewards. The developed reward function weights the absolute voltage deviation with the factor 1000 and combines this with an additional term that rewards long lasting voltage stability: where a describes the admissible absolute voltage deviation in pu. This specific reward function was developed in order to prevent an oscillating voltage behavior. Taking into account the parameter b, the function rewards successive compliance of ∆U with the desired interval [−a,a]. The longer the voltage is kept within the interval 1 ± a, the bigger the value b gets; thus, the term 1 1 + b tends to zero, leading to a higher total reward.
The proposed algorithm was implemented in Python with the help of OpenAI Gym [35] and keras-rl [36]. Optimized parameters for the DRL agent were set as listed in Table 1. The proposed algorithm was tested in the feeder shown in Figure 2. The test feeder was developed in the study 'Merit Order Netzausbau 2030' (MONA) and is a three-phase 21-bus system [37]. It can be considered a European low voltage distribution grid with ten residential households at a voltage level of 400 V and a frequency of 50 Hz. The test feeder was modeled in MATLAB Simulink and exported to the training environment in Python as a functional mock-up unit (fmu-file) with FMIKit [38]. The simulation scenario combines distributed generation through photovoltaic (PV) at every household (N1 to N10) together with individual active and reactive loads from the households. For this, load profiles by Tjaden et al. [39] were used that provide three-phase values for active and reactive power at every second. The first ten profiles were utilized. For the PV feed-in, a normalized PV profile [40] was multiplied by a fixed factor for every household (see Table 2) to model different PV sizes. These factors were chosen randomly around an average PV size of 5 kWp for residential PV installations. For all following simulations, the three phases are assumed symmetric; thus, only the results of phase A are presented. Figure 3 shows the load and PV profile at node N10 exemplarily.   Figure 4 shows the voltage topology in the test feeder with zero reactive power feed-in for a static grid situation. As input data, the values from the load and PV profiles were used as listed in Table 2. The x-and the y-axis indicate the distance to the transformer. The transformer is circled in black, the households in gray. The fill color of each bus indicates its voltage. At the transformer, the voltage is 1 pu and rises from there with the distance up to 1.02 pu due to the active power injections. Thus, Figure 4 illustrates the voltage rise along the lines in distribution grids with high PV penetration and demonstrates the risk for overvoltages at distant nodes. This emphasizes the need for voltage regulating measures, for instance additional reactive power injection.

Reactive Power Demand in the Test Feeder
The reactive power demand in the test feeder was analyzed. Based on the test feeder, it was investigated how much reactive power is necessary to regulate the voltage to 1 pu under static PV feed-in. For this purpose, the reactive power at the grid connection point was systematically varied for different in-feeds with fixed active power until a voltage of 1 pu was observed. With exception of the node under consideration, all other nodes were kept constant according to Table 2. By this, the demand of reactive power was recorded as a function of the active power injection and is shown for every connection point in Figure 5. The figure shows that the reactive power demand is individual for every connection point, which emphasizes the relevance of a self-learning volt-var control algorithm. In this case study, reactive power in the magnitude of −20 kVar to −60 kVar is required to maintain a voltage level of 1 pu depending on the active power feed-in. This comparatively high amount of reactive power results from the fact that a target voltage of exactly 1 pu was used in this study in order to show the theoretical potential of DRL VVC. However, in practice, most probably wider tolerance intervals will be allowed; thus, the reactive power demand reduces.

Static Grid Behavior
In this section, the performance of the proposed algorithm is investigated in a static grid environment. The power was kept constant at all nodes according to Table 2. At one node, the DRL agent was implemented and trained over a period of 80,000 steps. The aim was to control the voltage at 1 pu with an admissible control deviation of 0.2 %; thus, the parameter a in (1) was set to 0.002 pu. Nevertheless, other values are also possible for a, e.g., 0.05 pu in case less reactive power is available. During training, the active power feed-in at the considered node was varied according to the PV profile.
After the training process the active power was increased from 0 to 8 kW in 1000 steps to determine the characteristic curve for the reactive power output. The simulation has been carried out for the nodes N1, N5, and N10. As a result of these simulation runs, Figure 6 shows the reactive power over the active power. These characteristic curves can be interpreted as the reactive power curves learned by the DRL agent. These characteristic curves match very well with the corresponding ideal characteristic curves calculated in Section 3 and also shown in Figure 6.
The figure shows that the proposed algorithm has learned the individual reactive power demand in a static grid environment with high agreement to the theoretic benchmark. All learned curves show only slight deviations from the corresponding optimal curves. Minor deviations occur, especially in the range of very small and very large active power. This observation may be due to the fact that the training was carried out with real PV data instead of equally distributed data. Thus, some values are not presented during the training process. Despite the limited training data, the proposed algorithm was able to learn the reactive power demand for a large feed-in spectrum. To visualize the training process, the moving average reward is shown in Figure 7. The moving average reward was calculated over 1000 successive learning steps at the three different nodes, N1, N5, and N10. For all nodes, the proposed DRL algorithm shows an improvement of the reward with the number of training steps. This verifies the suitability of the presented reward function. After 60,000 steps, the moving average reward increases very slowly indicating a stable training result.
To investigate the effect of the DRL algorithm on adjacent nodes while applying the self-learning volt-var control at node N10 (red circle), the voltage at each bus was evaluated. The resulting voltage topology is presented in Figure 8. The x-and the y-axis indicate the distance to the transformer, whereas the fill color of each bus indicates its voltage. With the proposed DLR VVC algorithm the voltage at the transformer is about 0.992 pu and rises to 1.002 pu at distant nodes. It is noticeable that the voltage deviations in the network are significantly smaller than in Figure 4. For Q = 0, the voltage at the transformer was 1 pu and rises from there with the distance up to 1.02 pu due to the active power injections. Thus, the voltage difference when applying the proposed VVC is only half as large as in the reference with zero reactive power (Figure 4). Furthermore, with VVC, the voltage at the grid connection point of the smart inverter can be controlled within a 0.2 % interval of the nominal voltage. This value corresponds exactly to the tolerance range defined in the reward function (a = 0.002 pu).
In addition to the voltage topology, the voltage at the nodes along the line from the transformer to the inverter was calculated. For the nodes N1, N5, and N10, the voltage rise along the line length l is shown in Figure 9 with and without self-learning volt-var control. The volt-var control was always applied at the node under consideration; all other nodes had no VVC. In order to regulate the transformer voltage to 1 ± 0.002 pu, as well, and thereby further reduce the voltage differences along the line, a combination of various smart inverters at different nodes could be investigated in future studies. Nevertheless, with the proposed DRL VVC algorithm the voltage differences along the line were reduced significantly by up to 50% in this case study.

Dynamic Grid Behavior
After successful application of the proposed DRL VVC algorithm in static grid environments, in this subsection the performance in a dynamic grid environment is investigated. For this purpose, the load and PV values were varied in time according to their profiles from Table 2. In all subsequent simulations, the time t = 0 indicates 21 June, 00.00 h 00 s of the data set. As before, the data for phase A was used for all three phases. The active power feed-in changed every 50 s according to its PV profile. The test feeder was simulated over a period of 120 h (5 days), and the learning progress of the DRL agent was observed.
The self-learning volt-var control was located at node N10 as an online-learning algorithm without any additional knowledge or previous training. After only 3 simulated days of online training, the agent was able to balance the voltage at its connection point within the given interval; thus, the voltage rise at node N10 was eliminated. During the DRL training process the voltage deviations never exceeded the deviations of the reference without VVC. However, an inherent problem was observed. Every time the active power feed-in changes, a voltage peak occurs, due to the fact that the algorithm has a delay of one step because of the measurement duration. Thus, there is a time lag between the voltage change and the response of the algorithm in form of reactive power injection. Despite these voltage peaks, a significant voltage reduction was achieved with the proposed DRL agent compared to the voltage curve without reactive power injection. Figure 10 shows a section of the voltage behavior with and without VVC at node N10. Ignoring the voltage peaks every 50 s, with DRL VVC a systematic voltage reduction of up to 0.06 pu can be observed at node N10. Under application of the volt-var control, the voltage at N10 ranges around 1 pu including small deviations because of the ongoing exploration by the DRL agent. These results verify the performance of the proposed DRL volt-var control algorithm and demonstrate its potential for future application in smart inverters.

Conclusions
This paper proposes a novel self-learning volt-var control algorithm on the basis of deep reinforcement learning. The algorithm is an online-learning DDPG that can be applied under real-time in smart inverters for reactive power management. In contrast to other machine learning-based volt-var control methods, no additional communication devices are needed. The only input data for the proposed algorithm are the measured voltage values at the grid connection point of the inverter.
The proposed DRL volt-var control was successfully tested in simulations at different nodes in a 21-bus low-voltage distribution grid. A significant voltage reduction was shown both in a static grid environment and a dynamic environment, and the proposed DRL algorithm was able to keep the voltage within the desired range of 1 ± 0.002 pu. Furthermore, in the static case, the voltage difference along the lines were reduced by up to 50% using DRL VVC.
In this study, the aim was to control the voltage at 1 pu within a tolerance range of 0.2%. However, by adjusting the reward function, other intervals and use cases are also possible, such as higher tolerances or the most economical use of reactive power.
Thus, the proposed volt-var control algorithm is promising for application in real inverters and can make a significant contribution to decentralized, automated voltage control in future grids with high DG-penetration.