Multi-Stage Volt/VAR Support in Distribution Grids: Risk-Aware Scheduling With Real-Time Reinforcement Learning Control

The ever-increasing penetration of intermittent renewable resources in low-voltage power grids necessitates efficient operational strategies for voltage regulation as well as power scheduling of the available resources. In this paper, a risk-aware Volt/VAR support framework followed by a real-time reinforcement learning controller is presented for three-phase distribution systems. In the risk-aware stochastic scheduling stage, the legacy voltage regulating assets along with inverter-based photovoltaics (PVs) and energy storage system (ESS) are optimized considering day-ahead and intra-day markets. Moreover, demand response (DR) and voltage reduction plans are included in the proposed scheduling framework. By incorporating voltage-dependent load modeling in this study, the implementation of the voltage reduction plan reduces energy consumption in feeders by running the network at lower permissible voltage limits. The shiftable loads under the DR program are employed for peak shaving and to reduce operational costs. The result shows that DR implementation also reduces dependencies on the operation of traditional devices. The stochasticity of abrupt changes in PV generations is represented as the Gaussian Mixture Model (GMM), indicating a non-unimodal probability distribution in day-ahead PV forecasting errors. The scenario sets for uncertain variables are then reduced using a fuzzy clustering technique. Decisions made in the scheduling, associated with PV inverters and ESS operation, are revised with a real-time controller, i.e., Deep Deterministic Policy Gradient (DDPG) reinforcement learning. The DDPG is adopted in the control stage of the framework considering the detailed modeling of unbalanced three-phase distribution grids to minimize the voltage deviation and power ramping of ESS. The performance of the proposed multi-stage scheme is verified using a three-phase active distribution grid under different scenarios.


I. INTRODUCTION A. MOTIVATION AND BACKGROUND
In recent years, deployment of distributed energy resources (DERs) such as solar photovoltaic (PV) units and energy storage [1], and demand response programs have been proliferated to provide clean energy resources and to increase grid flexibility leading to more reliable and sustainable operation in distribution networks. Increasing penetration of DERs may cause some technical and operational problems such as voltage violations and uncertainty in decision-making due to the intermittent behavior of renewable-based energy resources [2]. To cope with voltage regulation problems and the uncertainty imposed by DERs, a robust distribution management system (DMS) is required to optimally and reliably operate and control the system considering a realistic uncertainty modeling of renewable energy resources. In this regard, coordination of legacy devices, e.g., capacitor banks (CBs) and transformer taps, and advanced Volt/VAR devices such as inverters need to be considered to take advantage of their capabilities at different time intervals. Legacy devices are typically planned for hours ahead as they have a slow response time due to their characteristics. However, inverterbased resources are known as complementary fast and flexible resources which can be used for adjusting voltages by injecting or absorbing reactive power in a short period of time [3], varying from several seconds to minutes. The role of real-time control by inverters to regulate voltage fluctuations becomes more apparent on partially cloudy days with cloud movement during small time intervals, leading to large PV generation and voltage fluctuations.
To track the intermittent changes in PV outputs, the historical forecasted and actual data can be used to quantify the exact distribution function of PV forecasting error. It is very common to model the forecasting error with the unimodal Gaussian distribution function; however, this study indicates that the Gaussian Mixture Model (GMM) better represents the forecasting error for realistic PV data. Note that even the best forecasting/uncertainty modeling may not be able to capture the sudden changes in sub-hour time intervals due to the fast weather-related changes in a very short period of time. Hence, the available fast response resources need to be re-dispatched in sub-hour time intervals to compensate for the drawbacks of day-ahead (DA) or hourly forecasting tools using a robust and fast control policy. This study implements Reinforcement Learning (RL) as a real-time controller to redispatch the resources in real-time [4], [5].
In addition, conservation voltage reduction (CVR) is considered a voltage-related load plan for reducing consumption by simply setting the voltage at a lower permissible level [6]. Previous studies show that annual energy consumption in the United States can be reduced by approximately 3% by implementing CVR on all the distribution feeders [7]. To achieve the CVR goal, a centralized voltage control system can be envisaged, that seeks to reduce the total power consumption by integrating all voltage control devices, including capacitor banks, transformer tap-changers, and smart inverters. This plan requires voltage-dependent load modeling in an optimization framework, which is usually neglected. Another flexible demand-based plan is the demand response (DR) program that can coordinate with inverter-based PV and storage for enhancing operational flexibility [8] and reducing stress on legacy Volt/VAR devices. The DR has several other advantages such as peak shaving and improving economic metrics and reliability of the grid [9].

B. CHALLENGES AND RATIONALITY
Coordination of legacy devices and inverter-based resources has been performed using multi-stage optimization techniques such as stochastic [2], [10], chance-constrained [11], [12], risk-aware [13], and robust [14] optimization. In these methods, several linearization approaches and approximations are considered to enhance the computation time. Therefore, the dispatch of the available resources is performed effectively for DA and hourly scheduling, while they may not be computationally fast and applicable when it comes to realtime control based on the states of the grid. On the other hand, several studies implement RL for Volt/VAR control (VVC) while ignoring the DA and intra-day (ID) participation of distribution grids in the electricity market environments as well as DR and CVR plans for improvement of techno-economic metrics. In this study, we propose to link the scheduling and the real-time control stages together while minimizing the operation cost, regulating the voltages, implementing CVR and DR plans for energy saving and cost reduction, and regulating the set points of resources in real-time under uncertain weather conditions. To meet these goals, a cost-effective multi-stage Volt/VAR support framework is proposed that solves a risk-aware stochastic optimization completed with a real-time RL-based controller in three-phase distribution feeders. This scheduling and control linkage is required because the fast cloud movements, errors in the prediction, optimization modeling errors, or other real-time problems cause generation fluctuation of PVs and nodal voltages, so the centralized scheduled set points of fast response resources need to be readjusted.
This adjustment maintains the desired nodal voltages obtained based on the uncertainty realization of variables in the scheduling stage to meet voltage-dependent load and VOLUME 11, 2023 54823 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. operational cost objectives. Therefore, the expected voltages obtained from the scheduling stages are considered as the reference values for real-time control, making coordination among the scheduling optimization and control of local resources in real time to minimize voltage deviations from the optimal adjusted values in the scheduling stage. Note that we need an accurate and fast model to control the set points in real-time. This is why a trained RL under different PV and load scenarios is implemented interfacing with accurate non-linear power flow representations and detailed security constraints of the distribution grids to handle the mismatches of the scheduled decisions and the real-time set points. Moreover, the model-free RL methods are adaptable to large networks without having concern for computation time [15].

C. THE MOST RELEVANT STUDIES
Different model-based operation strategies with a focus on Volt/VAR optimization (VVO) in active distribution management systems have been recently studied. In [16], authors use a stochastic VVO formulation in active distribution grids, focusing on inverter control. A study of [17] proposes a riskaware VVO for voltage regulation and loss minimization in balanced distribution feeders. In [18], an alternating direction method of multipliers is used in a stochastic VVO problem. This study addresses technical problems whereas the optimal cost-effective operation is not investigated. In this regard, reference [19] optimally adjusts the voltages and dispatch power in a microgrid. The risk formulations are included in the operational stochastic formulation to prevent overoptimistic decisions. Additionally, a study in [20] has developed stochastic daily scheduling of unbalanced distribution feeders using peak shaving and voltage adjustment. However, the risk of uncertainty is not quantified and both voltagedependent DR and CVR plans are not investigated. The reviewed researches have limited implementation of the CVR and DR plans in their voltage regulation problem.
In [7], CVR is adopted as the main objective function for energy saving. The uncertainty quantification of PV is not included in [7]. The CVR and DR are financially and technically analyzed in [21] with a non-linear model. Also, the PV uncertainty is represented by the common unimodal distribution functions. In [22], the impact of CVR and soft open points (SOPs) is studied for energy consumption and lossesa SOP is a power electronic apparatus to provide more flexibility for switching and control commands. Authors of [23] present a non-linear operational planning framework for distribution grids equipped with SOP, handling external disturbances effectively.
In the context of VVO, RL-based controllers have attracted attention for real-time distribution system management. This is because the model-based optimization problem requires solving an expensive optimization with several approximations, hence limiting the response of these approaches against sudden changes of nodal voltages under intermittent weather conditions. However, a model-free RL works in a computationally efficient way and is robust for real-time control. In [24], a bi-level off-policy deep RL algorithm is proposed for different time scales VVC. Authors of [25] use consensus multi-agent deep RL to determine set points of legacy devices such as voltage regulators, capacitors, and on-load tap changers in the VVC problem. The regulation of inverterbased resources has not been studied. In [3], a coordinated VVC is formulated to simultaneously minimize bus voltage deviations and losses using multi-agent RL in a balanced distribution system. The uncertainties of PV and load are simulated by stochastic programming. Authors of [26] propose hierarchically coordinated VVC and battery peak-shaving using DA optimization and RL-based control of inverters. The operation of legacy devices, DR programs, CVR strategy, and different electricity markets are not considered in this study. Also, this problem has ignored the uncertainty of prediction in the scheduling stage and has not modeled the complicated three distribution feeders. Our proposed study addresses the aforementioned research gaps. Moreover, authors of [27] and [28] combine RL and model-based optimization for resilient operation of distribution networks after outages, showing that the hierarchical combination of classical optimization and RL is an effective approach for solving power system scheduling and control problems.

D. SUMMARY OF CONTRIBUTIONS
This paper proposes a multi-stage and multi-timescale Volt/VAR scheduling and control framework for unbalanced distribution grids with inverter-based resources and legacy voltage regulation devices considering CVR and DR programs in ID and DA market environments. The technical contributions are list as below: • For the first time, a comprehensive risk-aware stochastic model and real-time RL-based control are developed for cost-effective Volt/VAR support for three-phase unbalanced distribution systems, which include battery degrading, CB and Tap operational limits, four quartiles inverters, multi-electricity markets, and voltagedependent load constraints.
• The shiftable DR program and voltage-dependent loads are included in the linear three-phase stochastic power flow model to minimize power consumption in feeders and shave the peak load while minimizing loss, risk, and operation costs. McCormick relaxation is used to linearize the three-phase ZIP load models including DR constraints.
• The uncertainty of PV forecasting error is quantified with the non-unimodal GMM and expectation maximization (EM) algorithm, and the scenario sets of load, PV, and price are reduced with an efficient fuzzy-based clustering technique.
• A robust and computationally efficient RL control is adopted to minimize the voltage deviations and ESS power ramping in real-time by re-adjusting the 54824 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
scheduled set points of ESS and inverters, relying on the hourly-scheduled legacy devices, DR participation, and ESS status as fixed plans. The states of RL are the locationally-scarce voltage measurements throughout the network. A detailed three-phase non-linear model with mutual impedance is built in the OpenDSS simulation environment to train and test the proposed RL framework under near-to-real-world conditions. The rest of the article is organized as follows. The uncertainty modeling and quantification are presented in Section II. In Section III, the McCormick envelopes and relevant mathematical formulations are discussed. The proposed approach for scheduling and control of the resources is represented in Section IV, and the detailed description of the method and its formulations are given in Section V. The cases studied and numerical analysis are discussed in Section VI. Finally, Section VII presents the concluding remarks.

II. UNCERTAINTY ASSESSMENT
This section addresses the uncertainty modeling and scenario generation for stochastic programming.

A. MODELING OF UNCERTAINTY
Considering real-world situations and geographical features, stochastic variables may not be modeled through uni-modal probability distribution functions, such as beta or Gaussian. These stochastic behaviors can be captured by non-unimodal (non-standard) distribution representation considering datadriven techniques [29]. A more realistic characterization of these uncertainties is modeled by a weighted combination of Gaussian distributions, namely the GMM instead of a standard unimodal Gaussian distribution function [13], [30]. The GMM is a weighted sum of Gaussian distribution functions with the associated parameters including mean, variance, and weights. The GMM is applied as (1) for a stochastic variable x, which cannot be represented by common distribution functions [31].
In (1), M shows the number of Gaussian distribution components; ω m is the proportional weight associated with the m th component; and ℘ is the parameter set {ω m , µ m , σ 2 m }. The weights in (1) sum to unity as M m=1 ω m = 1. Moreover, g(x|µ m , σ 2 m ) is the Gaussian density of the m th component expressed as (2), where µ m is the mean and σ 2 m denotes the variance.
A parameter identification method named expectation maximization (EM) is adopted to obtain the GMM parameters. This parameter-driven technique acts in an iterativebased manner to maximize the likelihood of dataset. For a sequence of k training data X = {x 1 , x 2 , . . . , x k }, the maximum likelihood estimate of the GMM is defined as: The main concept behind the EM algorithm is to determine a new estimate ℘ c , starting with an initial estimate ℘, such that f (X |℘ c ) ≥ f (X |℘) [32]. The primary estimate is then replaced with the new estimate for the next iteration, and this process continues till convergence. The re-estimation calculations for weights, means, and variances applied on each EM iteration are given by (4), (5), and (6), respectively.
The probability for distribution component m in proposed equations, (4)-(6), is calculated as

B. SCENARIO GENERATION AND REDUCTION
The driven GMM parameters are then used in the scenario creation algorithm. Also, the prediction errors of power market price and load are assumed to follow a normal distribution function. The generated possible scenarios need to be reduced to make the scenario sets computationally tractable in the optimizer. A fuzzy-based clustering method is implemented as a scenario reduction technique in this study. This method is preferred as it detects the soft clusters whereas k-means finds harder clusters. Moreover, the sample in a softer group may belong to other groups, represented through specific affinity values. The fuzzy-based method is based on the minimization of the objective function as (8) for scenario reduction [33], [34].
where n is the number of data points; s is the number of reduced scenarios; m(> 1) is the parameter that controls the degree of fuzzy overlap; and x i is the i th data point. In (8), u ik is a coefficient for the degree of membership of the data point x i in the k th cluster. Also, c k and n indicates the cluster centre and data point, respectively.  The scenario reduction is necessary for stochastic programming since the computation time of this approach is highly dependent on the number of scenarios [35]. The reduction technique helps in reducing the computation time for real-world applications. Also, the reduced scenario set must be the best representation of all possible scenarios. In this study, we use Silhouette [36] and Davies-Bouldin [37] scores to assess the quality of the reduced scenarios. More details can be found in [13].

III. McCormick ENVELOPES
McCormick envelopes are one of the most efficient techniques to tackle the bi-linear terms in nonlinear programming. This method makes the non-linear equations linearly convex, helping the solver to find the global optimal. Also, the relaxation of bounds using this technique reduces the computation time in solving problems [38]. For instance, a bi-linear term xz is defined by a new auxiliary variable w, considering the variable limits as x ∈ [x L , x U ] and z ∈ [z L , z U ]. Fig. 1 shows the underestimators and overestimators using the McCormick technique, where U and L indicate the upper and lower limits, respectively. Mathematically, the McCormick relaxation is defined for underestimators of w as: Moreover, the overestimators for w are represented as: McCormick reduces the size of the feasible region and allows the lower bound solutions to be closer to the main solution.

IV. PROPOSED COORDINATED SCHEDULING AND REAL-TIME CONTROL A. SCHEDULING IN INTER-HOUR PERIODS
After preparing the scenario sets obtained for PV generations, electricity price, and loads, the day-ahead scheduling optimizer is executed with these input scenarios. This scheduling level can dispatch the resources in two coordinated stages as follows: First stage: This stage involves adjusting the legacy apparatuses such as OLTC and CB for 24 hour-ahead with 1-hour intervals. Moreover, the purchased power in the DA market and the hourly status of charging or discharging in ESS are determined.
Second stage: In the second stage, which is set up for the ID market, the dispatch of ESS, the purchased power from the ID market, DR participation, voltage reduction for energy saving, and the generation of inverters are optimally set considering multiple scenarios for each hour. This stage gives different solutions for possible scenarios in inter-hour periods by coordinating with the decision made in the first stage. The linking constraints connect two stages together [39].
Consequently, the expected optimal set points of resources are obtained through the coordination of both stages to minimize the operation costs in the day-ahead risk-aware stochastic framework. In fact, this optimization stage is developed to obtain primary expected set points for resources.

B. REAL-TIME CONTROL IN INTRA-HOUR PERIODS
It is possible that under real-time situations, grid operators face a sudden variation of solar generations or loads due to weather-related factors or socio-economic parameters, which may not be encountered in hourly scenario generation. Therefore, an efficient and fast re-dispatching control strategy is required to correct the set points of inverters and batteries proposed at the scheduling level. More specifically, since the set point corrections are performed in real-time, a robust and computationally-efficient model is also required to guarantee the security of the operation with a detailed model of the grid and to satisfy the technical constraints. In this regard, RLbased control is developed in the third stage, explained as follows: Third stage: At this stage, flexible resources, such as ESS and inverters, adjust their outputs from the initially determined reference (or base) values, which were established during the optimization scheduling level, by utilizing an RLbased controller. These resources track the voltage changes due to PV generation and load variations in real-time. The main grid also works as a backup for providing active and reactive power to compensate for local shortages and technical restrictions. Note that the set points of the legacy devices and participation level of responsive loads are not changed in real-time due to the slow response features of legacy devices and DR program policies.
More specifically, the control center receives the local real-time voltage measurement of a limited number of buses and immediately re-adjusts the reactive and active power outputs to minimize the voltage deviation from the optimal reference voltage. The reference voltage is the expected voltage of nodes, calculated at the scheduling level. The reference voltages are techno-economically optimized considering the voltage-dependent load plans in optimization 54826 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
stage. The substation voltage can be also considered as the voltage reference. The ramping of ESS's active power is also controlled to minimize the sharp fluctuations between the real-time output and reference scheduled output for ESS. Using the reference optimal set points for nodal voltage and ESS outputs, the scheduling and the real-time controller are hierarchically coordinated considering the techno-economic perspectives in multi stages and time scales.
This control policy is trained in Python through interaction with a distribution grid simulator, i.e., the OpenDSS, as the environment containing the detailed non-linear model of the system with mutual impedances and technical constraints. This environment emulates real-world conditions, where the optimized set points are given and some of them are allowed to be adjusted again by the controller. The proposed framework is illustrated in Fig. 2.

1) Objective Function:
The objective function G is to minimize the operation cost in two coordinated stages, as given in (13)-(21), subjected to the linearized constraints. Stage 1 and stage 2 are shown by ι and ιι, respectively.
where λ indicates the cost; γ c , γ tap , and γ DA are defined as the switching cost of CB, tap positioning cost, and DA price of the electricity market in the first stage, respectively; and γ E , γ PV , and γ ID show the cost of battery degradation, PV maintenance cost, and ID market price, respectively.
2) Risk constraint: To reduce the impact of worst scenarios on the system operation, CVaR-based constraints are added and reformulated [40], through which the CVaR, defined as equation (22), is excluded from the problem objective function.
By incorporating CVaR within the constraints, constraint (23) enforces a risk-averse strategy.
where shows the confidence level. The range of β, the defined risk-awareness level, is between 1 and ∂. The feature and number of scenarios determine the magnitude of ∂. Setting greater values than ∂ makes this constraint ineffective [13], [41]. In addition, constraint (24) is added to the stated CVaR-based constraints to calculate the cost differences and auxiliary variables [17].

B. ZIP LOADS
The linear form of voltage-related active and reactive demands are shown in (25)-(26). the main demand over the scheduling period.
Note that DR constraints add non-linearity to the ZIP load constraints because of the multiplication of two continuous variables (P ϕ,i,t,s &V ϕ,i,t,s ), so McCormick theory [38] is applied and another new variable D is defined as (31) [10].
Consequently, each active ZIP load constraint is converted to 5 new constraints, given in (32)-(36) [42]. Reactive loads are also modified in the same way.

D. NETWORK REPRESENTATION
Three-phase distribution power flow model is used for network modeling. This mathematical modeling is applied in [43]. However, the line losses are also modeled in the proposed formulation. Since the mutual impedances are much lower than the self impedances, they are ignored as described in [43] and [44], but they will be considered for critical realtime control using the detailed system model. The power balance equations at each node i and phase ϕ are defined 54828 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
The linear form of the quadratic loss is modeled as (39)- (40). The piecewise linearization technique is adopted for the reformulation.
In equations (39)-(40), P * i,k,t,s and Q * i,k,t,s show the negative variables in the linearized constraints and are employed to measure the loss in the reverse direction. The slopes of the linear segments k and l are given by a, b, c, and d [14].
The power and voltage are related by (41). In (41), r ϕ,i and x ϕ,i are the actual and imaginary expressions of line impedances in phase ϕ.
The voltage of nodes is restricted by (42).

E. INVERTER-BASED ESS
The operational constraints of ESS are represented as (43)- (49). (43) indicates the stored power in ESS. The power in charging and discharging are calculated by (44) and (45), respectively. The available capacity of ESS is expressed as (46), and it is confined with (47). Parameters η d and η c are discharging and charging efficiencies, respectively. In (48), the reactive power output of inverter-based ESS is determined. The linear expression of (48) is indicated by the use of the Polyhedral norm model as (49) [13], [45]. The parameter o shows the number of linear formulations in the new modified terms.

F. INVERTER-BASED PV
The output of PV is adjusted by (50). The constraint (51) is the linear form of the original constraint (50).

G. LEGACY DEVICES
The reactive power of CB is determined by (52), where B ϕ,i,t indicates the number of Q C unit . The unit number of CB is restricted through (53). The B total ϕ,i confines the switching number of CB, represented in (54).
The voltage of the substation bus (i.e., the bus connected to the main grid) is adjusted by tap positioning, formulated as (55). The base voltage is shown by V o in (55).
The tap position W ϕ,i,t is defined through (56), and the number of tap changes is less or equal to W total ϕ,i , as presented in (57).
It is necessary to reformulate nonlinear absolute variables into linear forms [20], given as (58)-(60), for off-the-shelf optimizers. For instance, constraints of CB are linearized by new auxiliary integer variables ϑ and α. The same technique is used for tap positioning constraints.

H. REAL-TIME CONTROL FRAMEWORK
The deep deterministic policy gradient (DDPG) algorithm [46], which is a blending of the Deterministic Policy Gradient (DPG) and the Deep Q-Network (DQN) algorithms, is used for real-time RL-based control. VOLUME 11, 2023 The selection of states, actions, and reward function is very important for training an RL agent effectively. In real-time Volt/VAR control, a vector of a certain number of node voltages is taken as the system state. The RL agent controls the reactive power of PV inverters and the active and reactive power of inverter-based ESS based on the state. Therefore, the action is a vector of new settings for the reactive power of PV inverters and the active and reactive power of ESS inverters. Since our goal is to minimize voltage deviation with respect to the optimal nodal voltages as well as ramping of ESS active power during real-time operations, the reward function is designed by accounting for both of these factors.
The total reward at time step t is calculated as where β v and β r represent weighting factors corresponding to the voltage deviation and ramping of ESS active power, respectively, which encourage the agent to optimize the reward.
The part of the reward function associated with the average relative voltage deviation (ARVD) is represented by R v t as the first term of the reward equation (61), and it motivates the RL agent to maintain it as minimal as possible [47]. Due to this reason,Â the term is assigned a negative sign. R v t is calculated as where I is the set of nodes in the system; V ϕ,i,t is the voltage at phase ϕ of node i at time t after an action is taken by the RL agent; and V ϕ,i,ref is optimal reference voltage for node i determined by scheduling stages at that particular hour.
In the second part of the reward equation (61), R r t represents the relative deviation of ESS active power. It encourages the RL agent to minimize ESS active power changes (Ramping), which can lower battery degradation. This is why a negative sign is assigned to this term. R r t is expressed as where P E ϕ,i,t is the active power of the ESS in the i th node at time t; P E ϕ,i,ref is the expected scheduled active power of the ESS determined based on hourly stochastic scheduling, and N b is the set of ESS.

2) TRAINING ATTRIBUTES
The real-time RL-based approach is first trained for a specific number of episodes. At the start of the training process, the parameters of the main critic and actor networks (φ and θ) are randomly initialized. The target networks' parameters (φ targ and θ targ ) are also set equal to that of the main networks. Each episode starts with the environment being reset, which initializes the state. A specific number of time steps constitutes each episode. The actor-network decides actions (reactive power of PV inverters and active and reactive power of ESS inverters) for each time step. The Gaussian noises are introduced once the actions are generated. A specific deterministic policy (π) is determined using (64), often known as the Q-function or the Bellman equation [48].
where E is expectation operator; γ indicates the discount factor; and R(S t , A t ) is the reward function defined for the state-action pair (S t ,A t ) [47]. During the training of the critic network, its parameters are tuned to minimize the loss function (65). This method avoids the iterative update of the Q-function.
For a parameterized action function π(S t |θ) that deterministically maps states to actions, the policy update of the actor-network is performed by gradient ascent based on (66). The parameters of target critical and actor networks are then adjusted using a small constant τ , as given in (67) and (68).

VI. CASE STUDIES AND NUMERICAL ANALYSIS A. TEST SYSTEM
The designed scheduling framework completed with the realtime control model is examined on a modified three-phase 33-node grid, as shown in Fig. 3. The technical data and descriptions of this verified three-phase test system are given in [43]. The scheduling time horizon is 24 hours ahead with hourly intervals, and real-time control is performed for revising the decisions in sub-hour intervals. The RL-based fast control can be performed every second or minute without loss of generality depending on the possible changes in the states of the grid during short intervals. The initial voltage of the transformer on the secondary side is 1.0 pu, and the allowable voltage changes are between 95% to 105% of the base voltage. Seven PV sources are placed at phase A of nodes 4 and 17, phase B of nodes 9, 21, and 24, and phase C of nodes 14 and 32, and their profile is adopted from [13]. This PV configuration adds generation unbalanceness to the grid. The capacity of each PV inverter is 300 kVA, where the allowed range of reactive power of each inverter is between −100 and 100 kVar in each phase and the active generation power of each inverter is not exceeded 270 kW per phase. A three-phase ESS is located at bus 19. The maximum active power and stored energy of ESS are 200 kW and 800 kWh per phase, respectively. The total capacity size of ESS's inverter is 320 kVA per phase. The efficiency of charging/discharging is assumed to be 1 [49]. The reactive power of the ESS inverter varies between -200 to 200 kVar. ESS inverters operate in four quadrants similar to PV inverters. The locationally-scarce voltage measurements are assumed to be located on buses with PV and ESS, which are coordinated with the control center (CC) using the communication links for real-time control. The remotely dispatchable capacitor bank is formed of two separate units, each with a capacity of 30 kVar. The switching cost of each unit is assumed to be 0.24 $/time [20]. The transformer tap changer can be set between −10 and 10, and each change of tap costs 1.4 $/time per phase [20]. The tap initial position is on 0, and is required to be on a position less than 5 at the end of the day. Additionally, the number of times tap changes and CB switching can take place within a 24-hour time frame is limited to a maximum of 6 times. The operational cost of PV units is 0.0045 $/kWh, and the degradation cost of ESS is 0.0035 $/kWh [13]. The data of electricity prices in DA and ID markets are chosen from [50].
The total maximum consumption is 1494.47 kW under the scenario generation for phase B with 0.95 power factor. Its daily pattern is shown in the analysis section of this study and our previous work [13]. The ZIP load coefficients, K PZ , K PI and K PP , are 0.4, 0.1, and 0.5. respectively [13]. The base load consumption in each phase is distributed equally to different buses of the distribution grid. Also, a random variable is considered to slightly change the base loads in different phases.

B. SCHEDULING ANALYSIS
In this section, uncertainty-aware scheduling of demand plans, i.e., DR and CVR, as well as inverter-based units are addressed for the next 24 hours. DA and ID markets are the candidates for purchasing power in the scheduling phase considering the quantification of risks. The risk parameter β and confidence level are set as 1.018 and 0.8, respectively.

1) UNCERTAINTY QUANTIFICATION AND SCENARIO CREATION
To cope with the intermittent behavior of resources, the uncertainty of PV prediction is quantified by comparing actual PV outputs with their corresponding forecasted values [13]. The prediction error is quantified using GMM [13], with mean values of −0.32 and −0.18 in the two driven components. For the first and second components, the standard deviations are estimated as 0.256 and 0.0032, respectively. The components' weights are 0.62 and 0.38. Note that the prediction errors are restricted up to 60%, as the errors near to this percentage and over that are infrequent. Furthermore, it is important to note that the deviations in load and market price predictions conform to normal distributions. The standard deviation of each distribution has been determined as 20% and 30% of the forecasted value, respectively [13], [17].
The characteristics of the GMM model for PV and the Gaussian models for the load and ID electricity price enable us to generate a mass number of scenarios. The obtained scenario sets are then reduced to 5 representative scenarios for each variable using the fuzzy clustering method with adjusted m=1.1, defined as the degree of fuzziness. The efficacy of the fuzzy-based clustering method is studied compared to the widely used k-means clustering technique in power and energy system studies [51], [52]. For instance, the Davies-Bouldin index of PV scenarios reduces from 2.585 to 2.516 by using fuzzy clustering, indicating a better clustering performance. Moreover, the Silhouette index increases by 7.7%, which is 0.083. A greater value of the Silhouette metric shows more separation of the sample from other clusters and more cohesion of that sample with its cluster, indicating better performance.

2) VOLTAGE REDUCTION PLAN
The DA scheduling problem is executed with and without considering the CVR plan. The 3-D profiles, shown in Fig. 4, represent the voltage of buses along the feeders. Implementing the CVR plan causes voltage reduction within the ANSI ranges. Ignoring CVR increases the voltage magnitudes of buses, represented with different colors. However, CVR mitigates the risk of over-voltage problems by keeping the voltage closer to the minimum acceptable voltage (0.95 p.u.). Moreover, energy consumption reduces through the implementation of the CVR plan, as shown in Fig. 5-a for one particular node. The consumption reduction is indicated with dotted lines. This reduction is possible thanks to the realistic voltagedependent load model considered in the ZIP constraints, giving the chance to the decision makers for modifying the load behavior by just changing the nodal voltages. Consequently, the total energy consumed in phase A for all nodes reduces from 22.03 MWh to 21.45 MWh. Another advantage of the voltage reduction plan is reducing the dependency on legacy devices. Fig. 5-b indicates that using CVR reduces the changes of set points or delays them during the daily operation. Reducing the number of CB switching and tap positioning enhances their lifetime, which is exactly related to these set point changes [53]. More specifically, the CVR causes lower operational cost, loss minimization, and risk hedging with modifying loads, which are then explained in the following comparison section and Table 2.

3) RISK MANAGEMENT
Risk-aware strategy mitigates the consequence of the worst scenario in the operation by controlling the resources in a    robust way [54]. The risk parameters are defined based on previous cases, with β set to 1.018 and a confidence level of 0.8. In Fig. 6-a and Fig. 6-b, the total three-phase generated power of ESS is modified to cope with the worst scenarios under the risk-aware policy. Fig. 6-a shows that the charging occurs during off-peak and low-price hours, but the stored energy is different in risk-neutral and risk-aware policies. Furthermore, the discharging remarkably increases to compensate for the PV generation drop in the evening and peak hours (17 pm-21 pm). The inverter of ESS generates more reactive power during peak hours for regulating the voltage and minimizing the operational costs, as indicated in Fig. 6-b. In addition, Fig. 6-c demonstrates that the risk-aware strategy leads to purchasing more energy from the DA market than the risk-neutral strategy. This is because the risk-aware operator tends to face less uncertainty of price and is interested in purchasing from the DA market with a finalized deterministic price; however, the risk-neutral (risk-taker) operator tends to face uncertainty and chooses to purchase energy from ID with a hope of providing cheaper energy. Thus, under the risk-aware policy, the purchased energy from the ID market reduces to 31.41 MWh from 31.59 MWh, while under this policy, the purchased energy from the DA market increases to 4.67%, which is 8.56 MWh. Moreover, the transformer tap position is also affected by the risk-aware decisions and is set on position 2 earlier than the risk-neutral mode to be ready for sudden changes such as generation reduction and load increment, known as the worst scenario. The tap changes are shown in Fig. 6-d. This is also economically beneficial as the number of tap changes does not increase, saving the slow legacy devices for a longer life span. The capacitor provides base reactive power as 30 kVAR in both risk-neutral and aware techniques.
The sensitivity analysis is also performed on risk awareness parameter β, as indicated in Table 1. Increasing the parameter increases the CVaR, while the operation cost decreases. However, the small value, 1.01, makes the scheduling robust against the worst scenario as the quantified risk, CVaR, reduces, but the expected cost increases as the resources are scheduled optimally considering the worst scenarios, meaning that the decision may not be optimal for the expected possible scenarios. The confidence level ( ) is also changed to 0.9 considering β equal to 1.018. The expected cost and CVaR increase to $ 1,208.1 and $ 1,229.9, respectively. When the confidence level gets greater, the scheduling will focus more on the top worsts and high-cost scenarios, which increases operation costs and CVaR. Increasing the confidence level may result in less economic decisions and higher risk costs (CVaR), even when such high confidence may not be necessary.

4) DEMAND RESPONSE PROGRAM
DR is an established program to make changes in consumption patterns in response to dynamic prices of the electricity market, equipment failures, and desired technical objectives such as Volt/VAR regulations. In this section, the technical and financial impacts of using the DR program are investigated. It is assumed that up to 15% of loads are manageable and can be shifted from high-demand hours to low-peak periods. In high-demand hours, the electricity price is usually higher. Thus, reduction of energy consumption during these hours can reduce the operation costs. Also, using a shiftable VOLUME 11, 2023 54833 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. load plan, the operator can still satisfy the demand in other low-price hours, guaranteeing both customer satisfaction and lower operating costs. Figures 7-a,-b,-c show the peak shaving that occurs thanks to the implementation of the DR program. It is observed that the energy consumption is deducted 12.51%, 12.29%, and 12.22% in phases A, B, and C, respectively between 5-10 pm, known as high demand and costly electricity price hours. Note that the voltage drop can occur during the peakload hours, so the voltage regulators operate in distribution feeders to compensate for this reduction, as shown in Fig. 7-d. When DR implementation leads to peak shaving in highdemand hours, the voltage regulation would not be a critical problem during these hours. This is the reason that traditional regulators such as transformer tap are less operated. These load modifications reduce the number of tap changes as represented in Fig. 7-d. This strategy also reduces the mechanical tension on the traditional transformer tap, preserving these devices for longer periods and more crucial hours.
Furthermore, the demand modification reduces the risk of load shedding [55] under probable outages or failure at the upstream grid level as the peak load is minimized. Moreover, it provides operational flexibility [56] for the operators to satisfy other objectives and security constraints in the daily operations. One of these important objectives is the expected operation cost that reaches $ 1,174.2. This cost is 2.2% less than the base case without the contribution of the DR program. Other benefits are discussed in the next section compared to other cases.

5) TECHNO-ECONOMIC COMPARISONS
In this part, three different cases are compared in terms of operation cost, quantified risk, and energy loss. Cases are the scheduling of resources as follows: 1) without CVR and DR (base case), 2) with CVR and without DR, and 3) with CVR and DR. In case 1, the operation cost, risk, and loss are higher compared to the other cases implementing voltage-dependent load plans, as shown in Table 2. However, implementing CVR with voltage-dependent loads reduces the operation cost and CVaR (risk) to $ 1,200.7 and $ 1,222.4, respectively. The CVR plan reduces 3.2% of loss by just adjusting nodal voltages in case 2. Furthermore, implementing DR and CVR together brings more advantages to the system's operation [57], [58]. In this case (case 3), operation cost and CVaR reduce by 4.52% compared to the base case. The loss reduces 119 kWh and 53.7 kWh, compared to case 1 and case 2, respectively. These benefits are obtained as the demand level is reduced by saving energy under the CVR plan and the peak portion is shifted to off-peak hours using the DR plan.

C. REAL-TIME CONTROL
The position of legacy devices is kept according to the scheduling program proposed for each hour. Furthermore, the decisions associated with the DR program are not changed in the real-time stage. However, the set-points of flexible resources such as PV and battery inverters determined based on the scheduling with the associated scenarios may not necessarily satisfy the objectives during real-time operation with new possible scenarios. Therefore, PV and battery inverters are re-dispatched in real-time using an RL-based control as these resources are flexible and responsive enough in a very short period of time [59]. Moreover, the provided power from the main grid can be changed, and if it exceeds the expected purchased amount optimized in scheduling, it is subjected to a 20% extra payment penalty for real-time deviation. A limited number of voltage measurement devices are assumed to be installed at nodes where PV and ESS are connected, and the voltage status shown by these measurements enables the control center to adjust the inverters and active powers efficiently.

1) TRAINING OF RL-BASED MODEL
The proposed RL-based controller is trained for 2,000 episodes. Both actor and critic networks have two hidden layers with 60 neurons in each layer. Adam optimizers with learning rates of 0.001 and 0.002, respectively, are used for actor and critic networks. Each episode starts by resetting the environment. The environment is reset by choosing random values of PV and loads. The PV values are chosen from a uniform distribution with minimum and maximum values ranging from 40% to 120% of the base values, simulating cloud movement conditions, and the loads are randomly varied between 95% and 105% of the base load. For these randomly chosen values of PV and loads, a detailed threephase power flow is run using the OpenDSS simulation environment, and the system state is initialized with a selected number of voltage measurements. Based on the given voltage states, the RL agent takes an action by choosing the reactive power of PV inverters, and the active and reactive power of ESS. The action vector is fed to the reward generator function, which calculates the reward function. The weighting factors β v and β r in the reward function are set equal to 100 and 1, respectively. The states, actions, and rewards are stored in a memory buffer. A batch of stored data is accessed from the memory buffer to initiate the learning process of the neural network model. Since the batch size is 120, the model collects 120 samples of states, actions, and rewards before it starts the   learning process. Fig. 8 shows the learning curve of the proposed RL-based model, which indicates that the total rewards are very low during the initial episodes. However, as the episode progresses, the total rewards continuously increase and become almost constant after 500 episodes. The RL controller may need to be trained again in different months or seasons. The frequency of re-training the model depends on the geographical features and possible uncertainties [60].

2) TESTING AND IMPLEMENTATION
During the testing and implementation of the trained model, we use 12 test cases with random PV and load patterns, as shown in Fig. 9-a, corresponding to consecutive time slots within a sub-hour period (i.e., twelve 5-minute intervals within an hour) starting at 11 am. This can be performed every minute or several seconds without loss of generality because we are using a model-free trained algorithm which is computationally very efficient in real-time applications. Computationally, the RL-based real-time control takes only 0.4 to 0.6 seconds for each scenario to be executed on a 64-bit PC having an Intel core i5 processor. This low computation time makes the RL-based controllers applicable for real-time applications.
The abrupt drop in PV generation between 5 th and 7 th sub-hour intervals is thought to be the result of the significant cloud coverage. Fig. 9-b shows an immediate response of inverters with real-time RL-based control against the scheduled inverter outputs during the worst sub-hour interval (7 th interval) between 11 am and 12 pm. The ESS output is 149.5 kW, which just deviates 1.1 kW from the scheduled reference power thanks to the ESS ramp minimization in RL and reasonable scheduled set point in the previous stage. This deviation can be more remarkable under the occurrence of other contingencies and power shortages [61]. Table 3 shows the comparison of average relative voltage deviation (ARVD) using the decision made by the pure scheduling (case 1) and the scheduling completed with RLbased control (case 2). ARVD reduces for all cases considering the RL-based corrective actions in the final stage of the proposed model, compared to the pure scheduling model. This comparison justifies the usage of the RL-based controller as a complementary layer of scheduling. Furthermore, Fig. 10 shows the voltage profile of all 99 nodes for 50 random new scenarios with low PV generation. The figure shows the robustness of the model by exhibiting that the node voltages remain within the ANSI range for all scenarios by using real-time controller and adopted voltage constraints.

VII. CONCLUSION
In this paper, a multi-stage Volt/VAR support scheme including the day-ahead risk-aware stochastic scheduling and RLbased real-time controller has been proposed for economic operation and voltage regulation in three-phase active distribution systems. The forecasting errors of PV generation data are quantified realistically with the Gaussian Mixture Model (GMM) and the Expectation-Maximization (EM) algorithm. The uncertainties associated with the PV, load, and price are generated by creating scenario sets and then reduced using the unsupervised fuzzy-based clustering method. The voltagedependent load plans such as demand response (DR) and voltage reduction are utilized to reduce the operation cost, power loss, and risk in the operation. The load plans are beneficial in reducing the dependency of grid operation to slow responding legacy devices including capacitor banks and transformer taps, which can increase their lifespan.
Furthermore, the conditional value at risk (CVaR) is included to minimize the cost of possible risks by adjusting the set points of resources with higher situational awareness. The set points of legacy devices and load plans are fixed in the scheduling stage and used in the real-time stage. The RL-based real-time controller revises the hourly scheduled set points of inverters and ESS using the voltage states captured by locationally-scarce measurements. The RL-based model is designed to minimize the voltage deviation and ESS ramping in real-time under the possible cloud movements causing PV to drop and load uncertain variations. The proposed scheme is evaluated and compared with different cases on the modified three-phase distribution system.
Regarding future work, the adaptive inverter droop constraints will be included in the optimization model. Also, the stability constraints of inverters can be added to the problem. The real-time model can be improved with model-based reinforcement learning techniques. The power flow formulation can be changed to address the current sharing problem in distribution systems by utilizing equations based on current injection in the linear power flow model [62].

ACKNOWLEDGMENT
This article was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. This work was supported by the U.S. Department of Energy's Office of Energy Efficiency and Renewable Energy (EERE) under the Solar Energy Technologies Office Award Number DE-EE0009022.