Noise-Immune Machine Learning and Autonomous Grid Control

Most recently, stochastic control methods such as deep reinforcement learning (DRL) have proven to be efficient and quick converging methods in providing localized grid voltage control. Because of the random dynamical characteristics of grid reactive loads and bus voltages, such stochastic control methods are particularly useful in accurately predicting future voltage levels and in minimizing associated cost functions. Although DRL is capable of quickly inferring future voltage levels given specific voltage control actions, it is prone to high variance when the learning rate or discount factors are set for rapid convergence in the presence of bus noise. Evolutionary learning is also capable of minimizing cost function and can be leveraged for localized grid control, but it does not infer future voltage levels given specific control inputs and instead simply selects those control actions that result in the best voltage control. For this reason, evolutionary learning is better suited than DRL for voltage control in noisy grid environments. To illustrate this, using a cyber adversary to inject random noise, we compare the use of evolutionary learning and DRL in autonomous voltage control (AVC) under noisy control conditions and show that it is possible to achieve a high mean voltage control using a genetic algorithm (GA). We show that the GA additionally can provide superior AVC to DRL with comparable computational efficiency. We illustrate that the superior noise immunity properties of evolutionary learning make it a good choice for implementing AVC in noisy environments or in the presence of random cyber-attacks.


I. INTRODUCTION
I T HAS been shown that a deep reinforcement learning (DRL) [1] agent using deep artificial neural networks (ANNs) [2] in many electrical grid control applications can readily learn an optimal actions policy by assessing stateaction pairs within the grid's operational environment. Recent advances in autonomous grid voltage control (AVC) research have been illustrated primarily using double Q-learning (DQL) [3], or deep deterministic policy gradient (DDPG) RL [4] with slow convergence times of greater than 10,000 training episodes. A more rapid convergence method for AVC that imports target Q-network and random exploration values during training with convergence in under 1,000 training episodes was devised in [5]. While the DRL AVC method explained in [5] reduces the convergence time, the average reward and controlled voltage variance can be high due to q-value overestimations. The key method comparison for AVC between DQL and DDPG centers around how each method learns. DQL uses a value function to select a discrete set of actions while DDPG parameterizes a deterministic policy to select a continuous action that does not require discretizing of the continuous action space in the process of optimization.
Like DQL, DDPG utilizes the Bellman equation in an off-policy manner to optimize a reward value function, but, unlike DQL, it additionally uses the policy gradient method in an on-policy manner to optimize the acting neural network's policy. DQL generally uses epsilon-greedy exploration for selecting discrete actions, while DDPG requires a continuous action space to utilize sophisticated continuous exploration methods such as uncorrelated Gaussian or correlated Ornstein-Uhlenbeck (OU) [6] process. Although DDPG provides greater AVC control stability, its actor-critic interactions come with a considerable processing overhead as compared to that of the DQL architecture [5]. Both DDPG and DQL utilize exploration methods involving the injection of noise (i.e. Epsilon-greedy or uncorrelated Gaussian), which often results in sub-optimal tuning of ANN hyperparameters that results in sub-optimal AVC.
Unlike machine learning-based AVC, traditional power distribution grid voltage control methods tend to rely on decentralized control methods to set the position of switching capacitor banks, on-load voltage tap changers [7] or active and reactive power of smart inverters [8]. Deterministic physics-based methods for optimization of voltage control in power distribution systems have also been proposed but they require knowledge of the system model and can be hard to implement [9].
In this paper, we discuss a noise-immune control policy search method that utilizes a trained ANN for AVC and does not experience the effects of sub-optimal hyper-parameter tuning. A genetic algorithm is used to produce an optimal policy from which an on-policy ANN is trained to produce low variance AVC with a superior mean bus control voltage. We compare the use of evolutionary learning with DDPG, and DQL for grid AVC, and illustrate its advantages when employed on a medium scale grid configuration using the IEEE 8500 node model. It is shown that evolutionary learning as compared to DDPG and DQL can provide superior noise immunity and rapid convergence in noisy environments.

II. BACKGROUND
The power grid is equipped with electronic meters that measure electrical variables, such as grid bus voltages and line currents, by converting analog signals into digital numbers that can be stored in computers and transmitted through computer networks. These meters are equipped with transducers, sensors, analog-to-digital converters (ADCs), and algorithms that compose the measurement chain. Each of the links in this chain contribute to the errors in the measurements. For instance, quantization errors in ADCs [10], variations caused by temperature fluctuations, numerical errors in floating point arithmetic operations, and noise in transducers and electronic circuits [11] all contribute to uncertainty in measurements. The electrical industry has developed standards for classifying meters depending on how much error is tolerated. For instance, the American National Standards Institute defines electricity meter classes based on their accuracy, defined as a percentage of true value at full load [12]. Voltage control devices rely on noisy measurements from intelligent electronic devices within substations or in the field to make decisions on how to operate actuators such as capacitor banks, on-load tap changers, or to control reactive power output of smart inverters, for example.
An additional source of errors is related to the transmission of data from meters in the field to systems that collect and process meter data. Communication systems might be subject to conditions that may cause meter data to be unavailable or significantly delayed as, for example, bandwidth limitations, availability problems, high latency, and loss of packets [13]. Due to the dynamic nature of the power grid, delayed measurements might be an additional source of error for timesensitive applications, like controls, since the data might not reflect the current state of the system when the measurement is processed [14], [15].
Finally, the increased interconnection of grid Operational Technology to the IT networks caused by the adoption of Smart Grid concepts have enlarged the attack surface of malicious actors aiming at launching cyberattacks on the power grid [16]. Consequently, biases in measurements caused by cyber-incidents can also be an additional source of measurement error.
As noise riding on grid voltage signals is inherently present, it especially important that noise immunity propertied are designed into the AVC agents. Once noise immunity is built into the agent, it has the dual benefit of countering external noise introduced by a cyber-adversary. In the following sections of this paper, a comparison between the two most used DRL algorithms in AVC is compared to an inherently noise-immune approach utilizing evolutionary learning.
The remainder of this paper is organized as follows. In Section III the evolutionary learning AVC methodology devised is explained in detail. Section IV describes the experimental process used and the results while in Section V the conclusions and future work are discussed.

III. EVOLUTIONARY LEARNING AND CONTROL METHODOLOGY A. SIMULATED GRID ENVIRONMENT
The OpenDSS [17] electrical grid distribution system simulator was used to research the effectiveness and efficiency of the devised AVC algorithm. Fig. 1 shows the topology of the IEEE 8500 node model [18], and its attributes listed in Table 1. Most notably, the IEEE 8500 model has bus voltages to be maintained within a range of [0.8950, 1.0526] (p.u), which contains 4876 buses, 7113 devices and 8531 total nodes. The system also has 12 single-phase voltage regulators (tap changers) located at 4 buses (one of which is the substation), 9 single-phase switching capacitors.
One effective way to visualize voltage is to use a daisy plot chain generated by OpenDSS, as shown in Fig. 2. Fig. 2 shows how voltage changes as power flows from the substation (distance zero) to the loads while flowing through power lines, transformers and voltage control devices. The voltages of each phase is shown in a different color (red, blue and black). Solid lines represent primary distribution system (medium voltage) and dashed lines show voltages at the secondary distribution system (low voltage). The plot shows that voltages stay within or at least very close to the accepted voltage limits.   The primary goal of the control model described in this paper is to intelligently adjust inductive reactance using 12 strategically placed voltage regulators within the model. The task of voltage control devices in power distribution systems is to maintain the voltage magnitude of all system nodes within a predefined range. To do so, they must respond to variations in system voltage levels caused by fluctuation of electric currents flowing through power distribution system power lines and transformers. Voltage control is achieved in this model using modulation of the turns ratio of the voltage regulators by adjusting the tap settings of each regulator according to the actions recommended by control agents which are discussed in the sections that follow.

1) POTENTIAL EFFECTS OF ADVERSARIAL ACTION ON VOLTAGE PROFILE
An attack on voltage regulators of a distribution feeder can have very serious effects. To demonstrate the damage that could be caused, we have run a simulation where voltage regulators go to the lowest setting. The results are shown in III-B, which show that this attack could cause all voltages in the system to go well below the minimum limit of 0.95 p.u., with some voltages reaching almost 0.6 p.u. At such low voltage level it is very likely that many, if not most of, electronic equipment in the feeder would malfunction or not work at all.

B. REWARD FUNCTION
AVC methods work by sampling bus voltage levels while grid load reactance components vary. A reward function which includes the bus voltages is maximized to maintain grid bus voltages within a specified limit. Maximizing reward at each time sample i. The reward is maximum if all voltages are within bounds v i,j ∈ v − , v + . ANSI standard C84.1 defines Range A of utilization voltage for systems over 600 V as +5% to −2.5% of nominal voltage and Range B as +5.8% to −5% of nominal voltage [16].
where (for example):v i,j is a nodal voltage (for each phase, in per unit); v + = 1.05; v − = 0.975; R max = 1; α + = 1 0.008 2 = 15, 625; α − = 1 0.025 2 = 1600; N is the number of nodes in the grid. This formulation, however, assumes all bus voltages are available, which is not always the case. Then N could incorporate only the subset of buses that are being monitored at a given time. Optimal control is achieved by an agent when the reward is maximized, and the mean voltage is This reward function provides a general framework for AVC. It implements a quadratic penalty on voltage deviations at each bus if they go outside of ANSI C84.1 voltage limits. This nonlinear reward function penalizes greatly large deviations and leverages the capability of the proposed AVC algorithms for handling nonlinear functions. The parameters proposed are intended to provide the same reward when voltages reach the limits of ANSI C84.1 range B.

C. GRID CONTROL SYSTEM
In conducting the simulations (Fig. 4), 3 different grid AVC agents and a cyber adversary in deriving an optimal AVC policy from the set of individual optimal policies P = [π θ,1 , π θ,2 , π θ,3 , π θ,4 ]. The RL DQL and DDPG agents over a set number of steps i for each training episodes e input a predicted optimal control regulator tap vectors T e,d,i composed of 12 elements corresponding to each voltage regulator d, where i = 1, 2, 3, . . . , I ; e = 1, 2, 3, . . . , E; I is the total number of steps per episode; E is the total number of episodes per simulation. The bus voltage vector V e,b,i is returned after voltage regulation adjustment of T e,d,i is input to the model with b representing the index for each bus with reward values R e,d,i being calculated using V e,b,i .
Using the concept of maximum entropy, T e,d,i values are pulled from a uniform random distribution by the Adversary and presented to the model, while the DPG and DQL agents utilize stochastic RL to select T e,d,i values. The following sections discuss the details of how the DPG and DQL agents identify optimal policies. The GA Agent utilize evolutionary learning over a set of generations until a highly fit population is selected. T g,d,i , V g,b,i vectors are processed during the evolutions with g representing the generation index and the reward values R g,d,i being calculated using V g,b,i .

D. DOUBLE Q LEARNING
Q-learning establishes viable policies by optimizing a cumulative future reward using a value function shown in (3) below.
where γ ∈ [0, 1] is a discount factor that when set at 0 considers no future rewards and when set at 1 consider the maximum number of future rewards when calculating the q-value Q π (s, a) at specific state s with given action a under a given policy π. The maximum q-value, maxQ π (s, a), is selected at each state, and the optimal policy π * consists of those state-action pairs with corresponding maxQ π (s, a) values. In a temporal sense the q-value function is defined by where after A t action is initiated in state S t , the reward is then R t+1 with a state of S t+1 . Y t is a target value and α is the step size. The goal is with each update to Q(s, a;θ t ) to converge on target value Y t . Deep Q-Learning (DQN) consists of a deep ANN where the network for each state s outputs a vector of actions Q(s, [a i , . . . , a o ]; θ) and θ are the network weights and hyperparameters. There are two separate identical ANNs: an online network and a target network with the target network's parameters designated θ targ . The ANN parameters θ are transferred after several steps σ from the online network to the target network. The target for the DQN is shown in (6).
Experience replay [20] is used to enhance the target ANN training. Archived transitions are sampled randomly from the archive, and the target network is then further trained with the samples. Double Q-learning (DQL) [20] uses two separate value functions and randomly updates the online network using its value function and the target network using its value function. In DQL the two separate ANNs work together to reduce overestimation by finding the optimal action in the target ANN. A greedy policy along with a value function is used by the online network and the target network's value function further refines the online network's estimation. This helps with the q-value overestimation problem to a degree by separating the max operators in (4) and (5) and preventing the use of the same value to be used for action selection and policy evaluation. Although DQL is more robust than DQN in terms of overestimation caused primarily by noise and unprecise function approximation, overestimation error still exists which in the grid voltage control application ultimately elevates voltage regulation variance.

E. DEEP DETERMINISTIC POLICY GRADIENT LEARNING
As mentioned previously DDPG [21], [22], [23], [24], [25], [26] parameterizes a deterministic policy to select a continuous action and therefore does not have the overhead of optimizing a discretized version of the continuous action space. DDPG utilizes the Bellman equation predictions in optimizing a value function and the policy gradient method to optimize the active policy. Besides the deterministic policy, exploration is further driven by uncorrelated Gaussian or a correlated OU process. DDPG forms and updates a deterministic on-policy actor policy π = (aµ(s, θ π )) where υ (s, θ π ) is the on-policy neural network mapping, and θ π is iteratively updated until the Q-function Q(s, a) (actor) is efficiently maximized in the continuous state-action space. The off-policy critic's neural network Q θ (s, a) is updated during training in the same manner as in Q-learning using the Bellman equation.
with sampling occurring from replay buffer and gradient update: ∇ θ π J (θ π ) = E s∼L [∇ a Q θ (s, υ (s, θ π ))∇ θ π υ (s)] (9) The exploration policy's use of uncorrelated Gaussian or OU process rapidly speeds up the convergence process. Formulating an efficient learning strategy and exploration method is always quite challenging when we consider many of the existing methods beyond the simpler most used methods described above in DDPG. Continuous learning for example is a learning methodology based on the concept of restructuring the exploration algorithm continually based on immediate periodic state transitions [27], [28], [29]. Such methods are centered around astutely adjusting value function and loss function related hyperparameters dynamically throughout the search process [30], [31]. Other exploration approaches involve maximizing variation and the use of compression [32], [33], [34] which allows the agent to discover patterns in the reward distributions which drives shaping of the rewards. Shaping the rewards is often subjective and can deviate greatly from the environment's true reward over the exploration period [35]. Besides the most used Gaussian or OU exploration methods in DDPG, there are frequency, value function variability and priority-based exploration methods [36], [37], [38] that can be implemented within the DDPG framework. However, such methods add considerable computational time complexity over simpler stochastic exploration methods such as Gaussian and OU.
To minimize the reward signal noise over each training episode, the approach utilized in this research included the use of both a Gaussian exploration function and a noisereducing autoencoder (NRA) along [2] to filter the reward signal over each training episode. The Gaussian noise was injected in each training step to facilitate rapid exploration, and the NRA consisting of the neural network layers in Fig. 5 where x = R e,d,i and z is a compressed representation of x. On successive training episodes, the reward signal R e,d,i is filtered to exclude reward value outliers. The autoencoder generalizes and encodes the input data as shown in the Fig. 5 and compresses P(z|x) into a lower dimensioned feature space while simultaneously removing noise. The decoder portion of the autoencoder reconstructs the input data based on the generalized P(z|x) back to its original form P(x|z) minus the reward signal noise and any associated tap values T e,d,i . To add optimization stability on successive training episodes, Q(s, a) is calculated using the previously filtered R e,d,i and T e,d,i values.

F. EVOLUTIONARY LEARNING
Evolutionary learning [39], [40], [41], [42], [43] are global optimization algorithms inspired by biological evolution where an initial set of candidate solutions is generated and iteratively updated. Each stochastically generated new generation contains the most fit solutions, and the least fit solutions are removed from each generation. Fig. 6 illustrates the basic evolutionary learning process used in the AVC control system GA agent as described in the steps that follow.
1. The GA agent selects each member of the initial population from a uniform random distribution where each individual is a voltage regulator tap control T g,d,i and inputs the population to the grid model which in grid bus voltages V g,b,i from which the reward metric R g,i. is calculated using (2). 2. The GA agent selects the T g,d,i subpopulation that results in the highest R g,i. values. 3. Iteratively crossover and mutation is performed on the T g,d,i selected subpopulation resulting in a final subpopulation of T g,d,i that is the optimal control set ψ 4 yielding the highest set of rewards and optimal policy π θ,4 . There are several considerations to make when selecting a fitness function for selecting the T g,d,i individuals. There are two activities that occur during the selection process. First R g,i using (2) is calculated and then a selection algorithm is used to ensure adequate population diversity and a convergence resulting in selection of an optimal population of T g,d,i individuals. There are several selection algorithms available, but two commonly used and notable algorithms include fitness-proportionate selection with Roulette Wheel and sigma scaling. In fitness-proportionate selection with Roulette wheel, everyone is assigned a slice of a circular roulette wheel with the size of the slice being proportional to the individual's fitness. On each spin of the wheel the individual under the wheel's marker is selected for the next generation. Because this selection algorithm is stochastic in nature, the resulting selected population diversity at convergence will be the statistically expected number of the most fit individuals; however, if the diversity of the initial population is low, convergence will occur rapidly with a suboptimal set of individuals selected. In sigma scaling an individual's expected value is a function of its fitness. The next generation is selected based on the current population's mean and standard deviation.
At the beginning of a run, when the standard deviation of fitness is typically high, the fitter individuals will not be many standard deviations above the mean and will not be allocated many offspring. Later in the run, when the population is more converged and the standard deviation is typically lower, the fitter individuals will stand out more from the mean which allows evolution to continue with an optimal set of individuals selected at convergence.
The selection process needs to be balanced with the variations introduced by crossover and mutation. A vector of 13 voltage regulator tap setting values for each regulator with each element having an integer value of 1 to 32. There are no constraints on the number of tap positions the voltage regulator can change at each action. Each vector of T g,d,i is considered a chromosome for which crossover and mutation operations occur. Prior to crossover, the integers are encoded into binary format. During crossover the agent randomly chooses a locus within each chromosome and exchanges the subsequences before and after that locus between two chromosomes to create two offspring. In mutation the agent randomly flips some of the bits in a chromosome. Selection must be balanced with variation from crossover probability λ and mutation probability µ to find the proper balance between exploitation versus exploration. The use of very high selection probabilities will result in a suboptimal highly fit population of individuals, reducing the diversity needed for further change and progress. Use of very low probabilities will result in a slow evolution process.

G. COMPARING OPTIMIZATION TECHNIQUES
Theoretical computational time complexity for DQL, DDPG and GA algorithms are listed in Table 2 [44], [45].
The complexity for the genetic algorithm is with one point mutation and crossover with roulette wheel selection algorithm.   Table 2 show that the number of samples n used to train DQN and DDPG is the dominant time complexity factor. GA has multiple factors g, p, and m where in practice the dominant time complexity factor is the initial population size p. Because GA's worst case time complexity is a degree 3 polynomial as compared to degree 2 for DDPG and DQN, it is important to carefully select the initial population and selection algorithm such that the number of generations to reach convergence time for the GA is minimal.

The time complexities shown in
As will be discussed in the results Section III, a GA like DDPG can efficiently optimize with a high convergence reward value in a continuous environment and provides stable AVC with a great deal of noise immunity. Fig. 7 and Table 3 and list and describe the components of the grid distribution system simulator used for the research. The DDPG, DQL, GA agents and cyber adversary were all implemented using the Python language and the Python package OpenDSSDirect.

A. COSTS AND COMPONENTS USED
The agents accessed the OpenDSSDirect application programming interface to gather bus voltage vectors V g,b,i from which the reward R e,d,i is calculated. Based on the reward, the optimal tap T g,d,i vectors were then transferred back to the OpenDSS simulation engine's resident IEEE 8500 node model instance to gather the corresponding V g,b,i state change. Optimal policies were derived at the end of the pre-determined series of episodes as discussed in sections IV-B-C.
The DDPG and DQL agent deep neural networks were implemented using the Python Tensorflow and Keras machine learning packages. The GA agent utilized Python's DEAP package to implement the functions described in section III-F. The cost for materials was limited to the  simulation workstation itself as the Python development tools and software packages were opensource. Table 4 below lists the DDPG and DQL agent neural network hyperparameters of Fig. 4 where each agent was trained over 500 episodes. Through iterative experimentation using different training episode numbers, it was found that 500 training episodes yielded the highest mean reward with the minimal number of training episodes. Additionally, extensive parameter tuning was performed with different PCNN, TNN, CONN hidden layer sizes, gamma values, learning rates, drop-out rates and replay buffer sizes. Table 3 represents the parameter configuration that yielded the highest mean reward over the 500 training episodes.

B. EXPERIMENTS
In Table 4, the policy capture networks (PCNN) of the DDPG and DQL agents both contain 6 hidden layers with rectified linear unit (ReLU) activation functions and the use of the ADAM optimizer to avoid vanishing gradients and to control the rate of gradient descent in such a way that there is minimum oscillation when it reaches the global minimum. Learning and dropout rates were carefully selected to ensure that learning progressed at a nominal rate with minimal probability of overfitting the data. To measure the respective policy optimization search capabilities of DDPG and DQL, experiments were run with the number of steps varied between 16 and 32 training steps episode. Additionally, experiments were conducted in which the GA described in Section II.F where initial populations of 5 and 10K were given in finding an optimal control policy. After comparing the mean rewards over the training periods as shown in Table 4, the policy derived by the optimization algorithmhyperparameter combination with the highest observed mean reward was chosen. A deep feed forward NN AVC controller was subsequently trained with the selected optimal policy. The results of the experimentation are discussed in the following section. Figs. 8, 9 and 10 show a rise in reward value level as the agents learn the set of optimal voltage regulator tap settings as the adversary described in Section III-C continually injects perturbating regulator tap control setting T e,d,i values. The agents in effect form optimal control policies which simultaneously counteract the adversaries' actions by forming mean filtered search trajectories.   Table 5 summarizes the conducted experiments and is intended to illustrate the isolated training results of the three AVC agents when a uniform randomly generated tap control injections are made to the IEEE 8500 model.
The regulator tap control injections in this research represent control noise that is injected by a cyber adversary. When considering agent performance, the primary goal is to achieve the highest mean reward with a relatively low CPU runtime and reward standard deviation. We chose to train the three agents over 500 training episodes as this was found to be the least number of training episodes required such that the reward curves leveled-off, and training past 500 training episodes did not yield any further gain in maximum reward  value. When considering these tradeoffs, experiments g46 (GA agent) and r41 (DQL agent) are notable. Table 6 summarizes the training results illustrated in Figs 8, 9 and 10. Because the DQL and DDPG agents use predictive value functions (4), (5) and (8) to guide the optimal control values search, as compared to the GA agent learning begins much earlier in the training period. Because the DDPG agent utilizes predictive value function (8) and gradient based convergence Equations (8) and (9), the number of seconds until the maximum reward for the DDPG agent was the least as compared to the other agents. Although the GA agent in Tables 4 and 5 converged more slowly than the DDPG and DQL agents, the mean reward of the GA agent over the training period was the highest of all agents at 4.099 in experiment g23. The GA agent's stochastically guided fitness-proportionate selection convergence model achieved a higher mean reward because the DDPG and DQL agents utilized exploration methods (ie., Epsilon-greedy or uncorrelated Gaussian noise) resulting in sets of sub-optimal trajectories, policies, and sub-optimal tuning of the PCNN hyper-parameters.
Weighing CPU runtime, reward standard deviation and mean reward, the GA agent (experiment g46) exhibited optimal balanced tradeoff performance as compared to the other agent configurations in Table 4. The GA agents exhibited the highest mean reward with relatively low reward standard deviation with the drawback that convergence and learning rate lagged the DRL agent's learning rate. This lagging occurs in the GA agents due to the fact that all of the initial population (5K or 10K) of T g,d,i vectors in the first generation were presented to the simulation environment. By the 40 th generation through selection and elimination  of least fit, the number of T g,d,I vectors has substantially decreased and learning accelerates dramatically. For this reason, the constraining factor when using evolutionary methods for AVC is what size of initial population is to be used such that the agent's runtime, mean reward and reward standard deviation combination tradeoff is optimized.
Using the g46 GA control policy, a deep feed forward neural network controller was trained. A control simulation was subsequently run where the cyber adversary randomly injects perturbating T e,d,i values while the trained neural network controller performed AGC on the IEEE 8500 node model. Figs 11, 12 and 13 shows the mean uncontrolled and controlled bus voltage values observed over a 3000 second simulation period. Figures 11, 12 and 13 show the AGC performance over a 3000 second simulation period as well for the DQL and DDPG agents, and Table 7 summarizes the uncontrolled and controlled mean bus voltages are shown in these figures.
As Table 7 and Fig. 11 illustrate, the IEEE 8500 model bus voltage regulation requirement of [0.8950, 1.0526] (p.u), is maintained by the trained NN AVC controller trained with the GA policy being that the GA AVC control mean bus voltage has the smallest delta from the regulation mean of 0.974 (p.u).

V. CONCLUSION AND FUTURE WORK
In this paper, we have shown that inherent signal noise with the addition of cyber attacker perturbations within an electrical grid can prevent full convergence of the most used DRL AVC algorithms resulting in sub-optimal control policies. An alternative AVC approach using evolutionary learning was illustrated that converges less rapidly than the most used DRL methods, but ultimately converges with a higher mean reward and relatively low reward standard deviation. It was shown weighing CPU runtime, reward standard deviation and mean reward, a GA agent exhibited a more optimal balanced tradeoff performance as compared to the commonly used DRL agents. Furthermore, it was illustrated that using the optimal GA agent's derived AVC policy it was possible to train an AVC neural network controller to properly regulate the bus voltages of the IEEE 8500 Node distribution system while the system was experiencing normally occurring inherent voltage signal noise and random perturbations from a cyber-adversary.
As mentioned in the performance Section III-G and IV-B, the use of a large initial (individuals) control set population results in a high mean reward, a low reward standard deviation and a superior AVC policy. The computational time complexity is increased by a factor of the initial population size; thus, the use of large initial populations can be operationally limiting. Future research will focus on the use of parallel computing methods coupled with the research of GA optimization techniques.

ACKNOWLEDGMENT
Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.