Optimal Dispatch of PV Inverters in Unbalanced Distribution Systems using Reinforcement Learning

In this paper, a Reinforcement Learning (RL)-based approach to optimally dispatch PV inverters in unbalanced distribution systems is presented. The proposed approach exploits a decentralized architecture in which PV inverters are operated by agents that perform all computational processes locally; while communicating with a central agent to guarantee voltage magnitude regulation within the distribution system. The dispatch problem of PV inverters is modeled as a Markov Decision Process (MDP), enabling the use of RL algorithms. A rolling horizon strategy is used to avoid the computational burden usually associated with continuous state and action spaces, coupled with a computationally eﬃcient learning algorithm to model the action-value function for each PV inverter. The eﬀectiveness of the proposed decentralized RL approach is compared with the optimal solution provided by a centralized nonlinear programming (NLP) formulation. Results showed that within several executions, the proposed approach converges either to the optimal solution or to solutions with a PV curtailment excess of less than 2.5% while still enforcing voltage magnitude regulation.


Introduction
According to the International Energy Agency, for the year 2020, a total addition of 107 GW to the global solar PV capacity was reached [1].From this new PV capacity, approximately 36% comes from residential, commercial, and industrial projects, usually located at low voltage (LV) and medium (MV) voltage distribution networks [2].Due to these constantly increasing levels of PV generation, Distribution System Operators (DSOs) are facing several technical and operational challenges, including overvoltage issues, increase in the frequency of tap changes in distribution transformers as well as in power losses, violation of the thermal limits on the lines, among others [3,4].
Various strategies can be found in the literature to cope with the technical issues on distribution networks due to a high PV penetration.These strategies can be grouped into coordinated and locally implemented strategies.Locally implemented strategies are easy to implement and do not require any type of communication infrastructure.Among these strategies, one can find those based on droop control, such as in [5] and [6].In these droop-based control strategies, the PV inverters regulate their active and reactive power injection as a function of their voltage magnitude at the point of connection with the distribution system [7].
Despite their effectiveness to solve overvoltage issues, as curtailment decisions are made based only on local information, a larger amount of active power will be curtailed, especially when compared with coordinated strategies that consider the whole distribution network's operation.Moreover, they can be seen as unfair, as PV inverters located at the end of the feeders curtail more than those located closer to the distribution transformer [8].This issue can be solved, for instance, if the operation of the droop control is coordinated among all PV inverters, as shown in [9].
In contrast to locally implemented strategies, coordinated strategies can ensure minimum PV power curtailment, but they require the deployment of either a centralized (e.g., [10]) or a distributed (e.g., [11,12]) communication infrastructure.The dispatch of all PV inverters within the distribution system can be formulated as a nonlinear optimization problem to ensure minimum PV power curtailment, such as in [10] and [13].
Although optimality can be guarantee through convexification procedures, these centralized approaches show poor scalability features.To overcome this issue, works such as [12] and [14] have developed distributed strategies in which all the information required to perform coordination is shared either with a centralized operator or between PV inverters closely located.Nevertheless, due to their distributed nature, an online iterative procedure must be executed until a convergence criterion is reached.If such criterion is not met, optimality and feasibility cannot be guaranteed.
Recently, coordinated methods based on reinforcement learning (RL) have drawn much attention for their capacity to learn from historical data and/or from continuous interaction with an environment [15].If properly designed, RL-based approaches offer multiple advantages when compared with other optimizationbased methods.Such advantages include that distributed implementation is easier and straightforward; they can be used in real-time (they are usually trained offline); do not require an accurate physical model since they can be updated after interacting with the environment [16], among others.An updated review on the application of RL approaches for energy systems problems can be found in [17].Regarding the dispatch problem of PV inverters, in [18], a centralized deep RL algorithm is implemented.Results showed that once trained; the developed deep RL approach can successfully mitigate overvoltage issues with lower PV power curtailment when compared with a droop-based strategy.A similar centralized deep RL strategy is presented in [19].An RL-based strategy based on a multi-agent approach is presented in [20] and [21] to enable distributed implementation.In these works, deep neural networks are also used to model the value function within the RL strategy.Nevertheless, although deep neural networks have shown promising results in several RL application areas [22,23], as these are nonlinear parametric models, their convergence within RL frameworks is not guaranteed, difficulting its implementation [24].Moreover, a general procedure to optimally define some intrinsic parameters (e.g., number of layers, number of units, types of activation functions) is not available yet.The value and/or action-value function can be easily approximated using linear models to overcome this issue [25].In this sense, the main advantage of linear parametric models is that their convergence is theoretically guaranteed as long as enough exploration is ensured [26].
Based on the aforementioned, an RL-based approach to optimally dispatch PV inverters in unbalanced distribution systems is presented in this paper.The proposed approach exploits a decentralized architecture in which PV inverters are operated by agents that perform all computational processes locally; while communicating with a central agent to guarantee voltage magnitude regulation within the distribution system.
Here, the PV inverters dispatch problem is modeled as a Markov Decision Process (MDP), enabling the use of RL algorithms.Within the proposed RL model, a computationally efficient learning algorithm to model the action-value function is used.The effectiveness of the proposed decentralized RL approach is compared with the optimal solution provided by a centralized nonlinear programming (NLP) formulation.The main contribution of this paper is as follows: • A decentralized RL approach able to optimally dispatch PV inverters in an unbalanced distribution system considering voltage magnitude constraints is presented.The proposed RL approach uses a customized reward function and state definition that enables to reach the centralized optimal solution, while still enables all computational processes to be performed locally (also known as on-device machine learning).
The remainder of this paper is structured as follows: Sec. 2 presents a centralized NLP formulation model for the optimal dispatch problem of PV inverters.Later, Sec. 3 introduces Markov Decisions Processes (MDPs) and RL.Sec. 4 presents the optimal dispatch problem of PV inverters as an MDPs and the proposed RL approach, while Sec. 5 presents the simulation results used to validate the proposed approach.Finally, conclusions are drawn in Sec. 6.

Optimal Dispatch of PV Inverters
The optimal dispatch problem of PV inverters in unbalanced distribution networks can be modeled using the NLP formulation given by ( 1)- (13).The objective function in (1) aims at minimizing the total PV generation curtailment for the time horizon T .
Constraints (4) and (5) model the real and imaginary voltage drop in lines, respectively.The active and reactive power consumption is modeled using ( 6) and (7), respectively, while the active and reactive PV power generation is modelling using ( 8) and ( 9), respectively.Notice in (9) that it is assumed that the PV inverter operates with unity power factor.Constraint (10) models the PV active power output of the PV inverters as a function of the PV power curtailment percentage (∆P P V m,t ).Finally, constraints (11) and (12) enforce the voltage magnitude limits and the thermal limits of lines, respectively, while (13) defines the limits for the PV power curtailment percentage.Notice that a centralized approach must be used to solve the above-presented NLP formulation.A central operator gathers the operational data (e.g., nominal capacity, long-term expected PV generation, etc.) to define the dispatch decisions for the PV inverters enforcing voltage magnitude limits.Simultaneously, the central operator defines the total amount of curtailed power.

Markov Decision Process and Reinforcement Learning
In this section, some background on Markov Decision Process (MDP) and the used Reinforcement Learning (RL) algorithm are provided.

Markov Decision Process (MDP)
In general, a MDP can be described by the 5-tuple (S, A, P, R, γ), where S is a finite set of states s ∈ S (also know as state space), A is a finite set of actions a ∈ S (also know as action space), P is a Markovian transition model that states the probability of transitioning from one state to another state after taking an action; R : S × A × S → R is a reward functions that maps from each state s, s ′ ∈ S and a ∈ A, r = R(s, a, s ′ ) is the reward obtained when the system transitions from state s to state s ′ after implementing action a; and γ ∈ [0, 1) is a discount factor.For now on, we will refer to the 4-tuple (s, a, s ′ , r) as a transition.
Let S t and A t denote the state and action at time t, respectively, and R t the reward received after taking action A t in state S t .Let P denote the probability operator, then, P t (s ′ |s, a) is the probability of transitioning from state s to state s ′ after taking action a at time t.Thus, one can estimates the expected reward from an state-action pair (s, a), as where E[•] denotes the expectation operator.The total discounted rewards from time t until the system reach a terminal state at time T , denoted by G t , and also known as the expected return, can be defined as Let define a deterministic policy π that maps from S to A, such that a = π(s), s ∈ S, a ∈ A. Then, ones can define an action-value function Q π (s, a) under policy π as follows where Q π (s, a) estimates the expected return when taking action a in state s, following policy π.In this sense, the action-value function Q π (s, a) estimates the quality of the state-action pair (s, a) for a give policy π.If the optimal value-function Q * (s, a) is known, then, an optimal policy can be derived as π * (s) = arg max a∈A Q * (s, a).Then, it follows from ( 15) and ( 16) that Q * (s, a) satisfies the Bellman optimality equation (see [24]), In this case, if P is known and both the state and the action spaces are finite, the action-value function can be exactly represented in a tabular form for all the pairs (s, a) ∈ S × A solving recursively the expression in (17).If P is not known, it can be approximated from a batch of transitions samples obtained by directly interacting with the system, using a type of RL algorithms known as batch RL, such as Q-Learning [28].

Reinforcement Learning and Action-Value Function Approximation
All RL algorithms follow a similar step-by-step procedure i.e., for a state s ∈ S, take an action a ∈ A either randomly or using Q(s, a), observe a new state s ′ ∈ S and a reward r, update the action-value function Q(s, a), repeat until convergence.If the state S, and action A spaces are finite, conventional Q Learning can be used, in which the state-action spaces are discretized (see e.g., [29]).However, this procedure may suffer from the curse of dimensionality, depending on the size of the discretization step.In practical applications, when the state space S is large or continuous, the action-value function Q(s, a) can be approximated by any type of parametric function such as linear [30] and neural network [31], or non-parametric functions such as decision trees [32].If a linear function approximation is used, Q(s, a) can be represented as, where φ(•) : S × A → R f is a feature function for (s, a), which is also referred as a basis function, and ω ∈ R f is a parameter vector.
One of the most data efficient algorithms available in literature to estimate parameters ω, and thus approximate the action-value function Q(s, a), is known as Least Square Policy Iteration (LSPI) [26].To be executed, the LSPI algorithm requires a collection of transition samples D = {(s, a, s ′ , r) : s, s ′ ∈ S, a ∈ A)} to iteratively estimate ω.To better understand the intuition behind the LSPI algorithm, define an error estimation function J(ω) as where ω k corresponds to the approximation of ω at iteration k.Notice that Q(s, a) is not known and can be replaced by the temporal-difference (TD) target r + γω T φ(s ′ , a ′ ), where a ′ = arg max a∈A ω ⊤ k φ(s ′ , a) is the optimal action taken for state s ′ based on the current approximation available of parameters ω k .Thus, J(ω k ) can be expressed as Therefore, at iteration k + 1, ω k+1 can be approximated by solving the next non-constrained optimization problem, As can be seen from ( 21), at each iteration, the LSPI algorithm finds parameters ω that minimizes the mean squared error between the TD target and Q(s, a) over all transitions samples available in D. This process is repeated until a convergence criterion, defined as to the L 2 norm and ε is a small number.
The LSPI algorithm has multiple advantages: First, linear functions are used to approximate the actionvalue function Q(s, a), which allows the algorithm to handle MDPs with large and continuous S as well as to guarantee learning convergence.Second, at each iteration, the whole available batch of transitions samples are used to approximate ω, thus, increasing data efficiency.Third, different from the classic Q Learning algorithm, there is no need to define a learning rate, thus fewer hyper-parameters are required to be tuned.
Interested readers are referred to [26] for more details on convergence and performance guarantee.

PV Inverters Dispatch Problem as an MDP
The PV inverters dispatch problem is modelled as a MDP as in [19].If P P V m,t represents the PV generation of the PV inverter connected at node m at time t, with voltage magnitude V m,φ,t , and represents the PV generation after curtailing ∆P P V m,t , then, these can not be equal at the same time step t.There should be a time delay between applying the PV curtailment action and the distribution system reaching a new steady-state, in which the PV inverter m perceives V m,φ,t+1 .In other words, V m,φ,t+1 is the result of the distribution system reaching a steady-state considering the current PV generation power for the PV inverter m as P P V m,t (1 − ∆P P V m,t ), instead of P P V m,t+1 , as shown in Fig. 1.The described modelling t t + 1 Observe V m,φ,t Observe P P V m,t+1 Figure 1: Transition representation used to model the PV inverters dispatch problem as a MDP as in [19].Notice that V m,φ,t+1 is the result of the distribution system reaching steady-state considering the current PV generation power for the PV inverter m as representation is the base for the definition of the transition model, later explained in Sec.4.4.
An agent-based architecture is developed to facilitate the implementation of the proposed RL approach as in Fig. 2. Two types of agents are considered: PV Agents and a centralized Distribution System (DS) Agent.
PV Agents are in charge of controlling the PV inverters, while the DS Agent is in charge of supervising the distribution network's operation, enforcing voltage magnitude constraints.Regarding these agents, the following assumptions that are assumed to hold: 1.The DS Agent is aware of the topology of the distribution network and can execute a power flow algorithm assuming the proposed curtailment actions by each PV Agent.After executing the power flow algorithm, the DS Agent shares with each PV Agent their expected voltage magnitude.
2. The PV Agents have enough computational resources to execute all the required processes of the LSPI algorithm locally, as explained in Sec.4.5.Also, such agents only communicate and share data with the DS Agent and not between themselves, ensuring privacy.The shared data is limited to the proposed curtailment action and their PV power forecast.
The remaining definitions for the proposed RL approach, regarding state space, action space, reward function, and transition models, are presented next.

State Space
For the PV Agent m, connected to node m ∈ N of the distribution system at time t, define the state s m,t = (P P V m,t , V m,t ) | s m,t ∈ S, conformed by the tuple between the expected PV active power generation, V n,φ,t+1 V o,φ,t+1 Figure 2: RL approach using an agent-based architecture.Two types of agents can be found: a DS Agent and the PV Agents.
PV Agents share limited information with the DS Agent, while update of the Q(s, a) is done locally and in parallel.

Action Space
For the PV Agent m, actions are defined as a discrete PV power curtailment percentage of the expected PV power generation at time step t, i.e., a m,t = ∆P P V m,t .Thus, the action space is defined as A = {0, ∆, 2∆, ..., 1.0}, where ∆ defines the discretization step used.

Reward Function
As discussed in Sec. 2, the centralized objective of the optimal dispatch of PV inverters problem is to solve local voltage issues while minimizing the total amount of PV active power curtailed.The centralized objective function in (1) can be translated as a local reward function for each PV Agent m, R m,t (•), as follows: where δ A and δ V are positive penalty parameters.In expression (22), the first term corresponds to a penalty term proportional to the action taken.PV Agents need to chose lower value actions ∆P P V m,t , thus, reducing the total amount of PV active power curtailed; while the second term penalizes actions that result in voltage magnitude violation.The second term of expression (22) in terms of the voltage magnitude is depicted in Fig. 3. Notice that if the maximum voltage magnitude over all phases of the PV Agent m, i.e., V m,t , is above V or below V , the penalty term is equal to zero, otherwise, the penalty increases with slope

Transition Model
Once the PV Agents share the proposed PV curtailment percentages (actions) with the DS Agent, as explained on the MDP modelling in Sec. 4 and the state definition provided in Sec.4.1, the DS Agent solves a nonlinear power flow to estimate V m,φ,t+1 for all the PV Agents m.This information is used by the PV Agents to define the values of the state transition, from s m,t = (P P V m,t , V m,t ) to s ′ m,t = (P P V m,t+1 , V m,t+1 ), knowing that V m,t+1 is a result of implementing actions a m,t = ∆P P V m,t in the current state.The definition of states s m,t and s ′ m,t are needed in order to update the approximation of the action-value function Qm (s, a).

Action-Value Function Approximation
The algorithm used to approximate (learn) the action value function Qm (s, a) is based on the LSPI algorithm presented in Sec.3.2.Although the LSPI is an efficient algorithm to handle data, it may become computationally intractable as the action space A increases.To overcome this issue, instead of approximating a general Q(s, a), a separate approximation is defined for each action a ∈ A and for each time step t ∈ T .
Thus, each PV Agent m learns an approximated optimal action-value function where a of the form e (−(x−xc) 2 /σ 2 ) , where x is a generic variable1 of the state s, x c is a generic (and constant) center related to the generic variable x, and σ is the standard deviation of the RBFs, forming the following feature vector where κ corresponds to a positive number that indicates the total number of RBFs used.Based on this definition, notice that φ (l) (s m,t , a m,t ) . . .
Based on this definition, the LPSI algorithm shown in Algorithm 1 is used to learn parameters ω m ∈ R f to approximate Q(s, a).Notice that Algorithm 1 requires as input a collection of transition samples D m .The procedure used by each PV Agent m obtain these collections of samples is explained next.

Overview of the Proposed RL Approach
The proposed RL approach builds on the agent-based architecture shown in Fig. 2 and follows the stepby-step procedure presented in Algorithm 2, implemented over a rolling time horizon strategy.Initially, the time horizon is divided into larger size time steps t h ∈ T h , e.g., T h can be a set of time in hours, while T a granular partition of each hour.In other words, if the duration between two time steps t ∈ T is 15 min, i.e. ∆t = 0.25 h, thus set T will have a total of three partitions, i.e., T = {t h , t h + ∆t, t h + 2∆t, t h + 3∆t}.
By doing this, the RL algorithm is executed each t h ∈ T h , while decisions are taken for the next time steps T = {t h , t h + ∆t, ..., t h + (n − 1)∆t}, where n is the number of partitions.This approach reduces the need of a long-term forecast of PV generation PV Agents, as well as the need of advanced computational infrastructure to execute Algorithm 1.Additionally, limiting the LPSI algorithm to take decisions only for a few future time steps allow the proposed RL approach to adapt to changes in the system dynamics.
According to Algorithm 2, the following procedure is executed to learn parameters ω m,t h , which are used to take the optimal actions a * m,t for t ∈ T as a * m,t = arg max a∈A φ(s m,t , a) ⊤ ω m,t h .In t h , for each time step t ∈ T , each PV Agent m either chooses a random action from set S or the best action obtained using the current estimation of ω m,t h (also known as ǫ-greedy).In a real implementation, this procedure is done in parallel by all PV Agents m, using only its local computational infrastructure, as shown in Fig. and improves its current approximation of ω m,t h executing Algorithm 1.This procedure is done until a maximum number iterations J is reached.

Results and Discussion
In this section, simulation results are presented.Comparisons with the optimal solution of the centralized PV dispatch formulation in Sec. 2 are also presented.

Simulation Setup
The proposed RL approach has been implemented in Python language and executed on a notebook with a processor Intel Core i7 and 16 RB RAM memory.The unbalanced 25-bus system shown in Fig. 4 is used, load consumption per node, as well as resistance and reactance data, can be found in [33] 2 .The load level per Algorithm 2: RL-Based Approach used to define the optimal dispatch power of the PV Agents.Input: J : Maximum number of iterations ǫ 0 : Parameter to control exploration η : Decay rate to control exploration Output: ω m,t h : updated parameter vectors for each t h ∈ T h a * m,t : Optimal actions to implement for each t ∈ T Initialize j = 0, ω m,t h = 0, ∀m ∈ N , t h ∈ T h , for t h ∈ T h do Approximate action-value function as follows: while j < J do ǫ j = min{0.05, Define optimal actions as follows: for t ∈ T do for m ∈ N do a * m,t = arg max a∈A φ(s m,t , a) ⊤ ω m,t h time step is shown in Fig. 5.In total, three PV Agents are considered, located at nodes m = 13, 17, 25, with a nominal capacity of 1500 kW, 1800 kW, and 2000 kW, respectively.The irradiance profile used for simulations is shown in Fig. 5.All PV inverters are assumed to operate with a unity power factor.The nominal voltage of the distribution system in Fig. 4 is 4.16 kV, set up to a value of 1.03 p.u. to avoid undervoltage problems during the peak consumption, while the minimum and maximum voltage magnitude level have been defined as 0.90 p.u. and 1.10 p.u., respectively.The base power used for the power flow formulation and the state representations is 1000 kVA.

Validation and Comparison
Fig. 6 shows the learning results obtaining after executing Algorithm 2 for hour t = 40 (i.e., 12:00).As each hour is discretized into shorter time steps of ∆t = 0.25 h, Algorithm 2 provides the optimal actions ∆P V P V m,t for time steps t =12:00, 12:15, 12:30, and, 12:45.Fig. 6(a) presents the typical learning curve obtained when training RL algorithms, in which it is possible to observe how the reward improves over the learning process.
As described in Algorithm 2, at the beginning of the learning process, as the actions proposed by the PV Agent are defined randomly, lower (negative) reward values are obtained.Nevertheless, as more and more samples are added by the PV Agents to the set D m , the estimation of the parameters ω m,t h improves, leading to the PV Agents to propose decisions that receive higher reward values.
Due to the randomness associated with the exploration of the state-action space during the learning process, different estimation of parameters ω m,t h can be obtained after convergence in different executions.
As a consequence, different optimal actions can also be defined.To assess this, Fig. 6(b) − (c) shows the mean and the standard deviation of the rewards obtained by each of the PV Agents over five different executions.
As expected, higher standard deviations are observed at the beginning of the learning process due to the exploration process.Nevertheless, as the learning process continues, convergence to higher reward actions is attained.In this case, observe that even at the end of the learning process the mean does not converge to a single value (and thus the standard deviation is different than zero).This is due to the fact that exploration is never finished, i.e., ǫ j never becomes zero.This is done to continuously improve the estimation of parameters ω m,t h , allowing the PV Agent to discover better solutions (exploitation process).
To assess the quality of the solutions obtained in different executions of the proposed RL approach, comparisons are presented in Table 1 for each PV Agent.In terms of actions ∆P P V m,t , all PV Agents were able to achieve the optimal solution in at least one execution.The optimal solutions provided in Table 1 were obtained after solving the centralized model presented in Sec. 2 using a continuous and a discrete NLP formulation.As expected, the optimal solutions obtained by the proposed RL approach enforce the voltage magnitude within the required limits.Notice in Table 1 that although the optimal solution is attained by the PV Agents in at least one execution, different voltage magnitude results are obtained when compared the optimal solution provided by the NLP formulation.This is due to the fact that, for instance, PV Agent 13 may have obtained the optimal solution, while the remaining PV Agents may have converged to a quasi-optimal solution.As a result, different voltage magnitude profiles are observed.In terms of the worst solutions obtained, notice that they differ from the optimal solution in curtailing more PV power than to enforce voltage magnitude regulation.In this case, for the worst solutions, PV Agents 13, 17, 25, curtail 1.23%, 3.65%, and 2.47%, respectively, more than the optimal solution.Nevertheless, when comparing the centralized NLP formulation, the main advantage of the proposed RL approach relies on how good quality solutions can be attained in a distributed fashion, performing computations locally at the PV Agents and by sharing limited information.

Computational Time Assessment
In order to be able to implement the proposed RL approach, the total computational time required for the PV Agents to achieve convergence must fit within the time step discretization of the rolling horizon strategy used, which in this case is 1 hour (∆t h = 1 h).To assess the wall-clock time of the proposed RL approach was measured, resulting in an average time per iteration (of Algorithm 2) lower than 2 s, and in an average total time lower than 32 min (all PV Agents perform computations in parallel).As these average results are way below the time step discretization of 1 hour, the proposed RL approach can be implemented to operate in real-time.Notice that the most computationally expensive operation within the proposed approach corresponds to the last step of Algorithm 1, in which the inverse of matrix B |Dm| ∈ R |Dm| needs to be calculated estimate parameters ω m,k .The inversion of this matrix can be avoided by using the Sherman-Morrison formula, which allows to estimate it iteratively, as explained in [26].

Full-Time Horizon Operation
To assess the effectiveness of the proposed RL approach for different irradiance and consumption conditions, continuous simulations were executed for a time horizon of 24 h considering the irradiance and load level consumption data shown in Fig. 5. Obtained results are discussed based on Fig. 7, which presents the rewards for all PV Agents over the learning process for hours at 6:00, 7:00, and 8:00.These hours are selected as the irradiance increases as time passes.In terms of actions, as the irradiance is relatively low at 6:00, curtailment actions are not required to enforce voltage magnitude limits.This can be seen in Fig. 7 as the maximum reward obtained by the PV Agents is zero, which necessarily implies that no PV curtailment is performed.
Nevertheless, the irradiance conditions change during the next hours at 7:00 and 8:00, curtailment actions be required to enforce voltage magnitude limits.In these cases, as shown in Fig. 7, the proposed RL approach is able to converge to good quality solutions when compared with the optimal reward (obtained the centralized NLP formulation).In operational terms, the total PV curtailment for PV Agents 13, and 25, was estimated to be 1.6 %, 2.13 %, and 0.76 %, respectively, higher than the centralized optimal solution, which validated the effectiveness of the proposed RL approach to obtain good quality actions during continuous operation.Notice that all the defined curtailment actions during continuous operation enforce all voltage magnitude constraints during the full-time horizon, as shown Fig. 8. Finally, notice in Fig. 7 that in case of continuous operations, once the optimal curtailment actions are defined for time step t h , parameters ω m,t h+1 for time step t h+1 are initialized as zero.This is done as the LPSI algorithm is biased towards the current system's state, and thus, if ω m,t h are used as an initial approximation for ω m,t h+1 , exploration will be limited to the vicinity of the actions obtained after convergence in time step t h , leading even to unfeasible solutions.

Conclusion
In this paper, an reinforcement learning (R)L-based approach to optimally dispatch PV inverters in distribution networks was presented.The proposed approach takes advantage of a decentralized architecture that enables all computational processes to be performed locally by the PV Agents.To avoid the computational burden usually associated with Markov Decision Processes (MDPs) with continuous state and action spaces, a rolling horizon strategy was used, together with a computationally efficient learning algorithm used to model the action-value function.Results showed that in several executions, the proposed RL approach converged to the optimal solution, and in the worst case, converged to solutions with an excess of PV curtailment lower than 2.5 %.However, in both cases, it was found that the solution still enforces voltage magnitude limits.
Continuous operation of the proposed RL approach was also tested, obtaining similar results.Compared with other distributed optimization-based approaches, RL approaches offer the advantage of straightforward implementation while guarantying convergence to good quality solutions.Moreover, RL approaches do not rely on strict convexity assumptions as long as linear parametric functions are used to model the action-value Figure 8: Voltage magnitude during the full-time horizon using the proposed RL approach.Notice that all actions implemented guaranteed that voltage magnitude constraints are enforced, when compared with the voltage magnitude profile when no control is applied (grey lines).

Figure 3 :
Figure 3: Representation of the second term of the reward function in (22) related to the penalty due to voltage magnitude violation as a function of the maximum voltage over the phases V m,t.
is the l-th component of the action space A, ω(l) m,t are the parameters associated with action a (l) m,t , and φ (l) (•, •) is a vector of basis functions.In this paper, we propose to use radial basis functions (RBFs)

Algorithm 1 :
LPSI Algorithm used by PV Agent m to learn ω m and approximate the action-value function Qm (s, a).Input: D m : Transition samples for PV Agent m φ(•, •) : RBFs approximation γ : Discount factor ε : Small threshold value c : Small number Output: ω m : updated parameter vector to approximate Qm (s, a) as φ(s, a) ⊤ ω m .Initialize ω m,−1 = 0 f and k ) : S × A → R (κ+2) , and that φ(s, a) : S × A × T → R f , where f = (κ + 2) × |T | × |A|.Therefore, as the function approximation φ(s, a) is a collection of feature vectors of the form shown in (24), specifically, when constructing φ(s, a) for the pair (s m,t , a (l) m,t ), the remaining terms of φ(s, a) for all a ∈ A different from a(l) m,t and time steps different from t ∈ T are set to 0 κ+2 , as shown next 2. Notice in Algorithm 2 that as the number of iterations increases, parameter ǫ j decreases, reducing the chance of selecting random actions, thus allowing controlling the balance between exploration and exploitation.Once all PV Agents have individually proposed one action, they share this information with the DS Agent, which uses the transition model explained in Sec.4.4.The output information from this step (see also Fig. 2) is shared with each PV Agent, and it is used to estimate the reward r m,t and construct the next state s ′ m,t , as explained in Sec.4.3 and Sec.4.4.After all this, each PV Agent m updates its collection of samples D m

Figure 4 :
Figure 4: 25-node unbalanced distribution system test used with PV Agents located at nodes m = 13, 17 and 25.

Figure 5 :
Figure 5: Irradiance in kW/m 2 vs Time (in ∆t = 15 min time steps).Load level vs Time.Note: All active and reactive consumption power for each node is multiplied by this load level factor, where a load level factor of 1.0 is equivalent to the data provided in[33].

Figure 6 :
Figure 6: Rewards for 1000 iterations for the PV Agents 13, 17, and 25 at 12:00.(a) Rewards for all PV agents in one execution.(b), (c) and (d) Mean and standard deviation of rewards for PV Agents 13, 17, and 25, respectively, over five different executions.

Figure 7 :
Figure7: Rewards for the PV Agents 13, 17, and 25 at 6:00 (from iteration 0 to 1000), 7:00 (from iteration 1000 to 2000), and at 8:00 (from iteration 2000 to 3000), when executed in continuous operation for the full-time horizon of 24 h.The red dashed lined represents the optimal reward obtained when using the centralized NLP formulation.
* Optimal solution solving the continuous NLP formulation in Sec. 2.