Role of reinforcement learning for risk‐based robust control of cyber‐physical energy systems

Critical infrastructures such as cyber‐physical energy systems (CPS‐E) integrate information flow and physical operations that are vulnerable to natural and targeted failures. Safe, secure, and reliable operation and control of CPS‐E is critical to ensure societal well‐being and economic prosperity. Automated control is key for real‐time operations and may be mathematically cast as a sequential decision‐making problem under uncertainty. Emergence of data‐driven techniques for decision making under uncertainty, such as reinforcement learning (RL), have led to promising advances for addressing sequential decision‐making problems for risk‐based robust CPS‐E control. However, existing research challenges include understanding the applicability of RL methods across diverse CPS‐E applications, addressing the effect of risk preferences across multiple RL methods, and development of open‐source domain‐aware simulation environments for RL experimentation within a CPS‐E context. This article systematically analyzes the applicability of four types of RL methods (model‐free, model‐based, hybrid model‐free and model‐based, and hierarchical) for risk‐based robust CPS‐E control. Problem features and solution stability for the RL methods are also discussed. We demonstrate and compare the performance of multiple RL methods under different risk specifications (risk‐averse, risk‐neutral, and risk‐seeking) through the development and application of an open‐source simulation environment. Motivating numerical simulation examples include representative single‐zone and multizone building control use cases. Finally, six key insights for future research and broader adoption of RL methods are identified, with specific emphasis on problem features, algorithmic explainability, and solution stability.

physical attack surface (Rao et al., 2016).These vulnerabilities may exist in the cyber space or physical space and could be exploited/targeted by natural and/or man-made threat agents, leading to potentially catastrophic losses.As a result, safe, secure, and reliable operation and control of CPS-E is critical for ensuring societal well-being and economic prosperity.
Essential features of a CPS-E include (Alur, 2015): (1) reactive and real-time computation-continuous interaction with the environment through inputs/outputs including timing delays and timing-dependent coordination protocols, (2) concurrency-multithreaded information execution and parallel information exchange to meet compute goals, (3) feedback control-measurement via sensors and influence via actuators, and (4) safety-critical operations-assurance for detecting errors and ensuring reliable operations.Based on these system properties and from a safe and secure operability/control standpoint, a critical emerging research challenge is the seamless coupling of closed-loop dynamics with discrete actions of networked components especially under hazard and information uncertainty (Liu et al., 1999;Lygeros et al., 2008;Cardenas et al., 2009;Kim et al., 2013).
While CPS-E representations (such as with open hybrid automata model abstraction in Heracleous et al., 2017) enable large-scale system simulation, managing reliable, safe, and secure automated control operations continues to be an ongoing research pursuit both from mathematical formalism and computational architecture viewpoints.Per probabilistic risk analysis principles, a performance-based CPS-E consequence assessment approach may comprise of interconnected computational phases (Kaplan & Garrick, 1981;Cornell & Krawinkler, 2000;Garrick, 2008;Chatterjee et al., 2021): (1) system modeling, (2) hazard intensity analysis, (3) engineering parameter analysis, (4) damage analysis, and (5) loss exceedance analysis.Loss exceedance analysis may yield system responses that are useful for resource allocation optimization and decision support.While estimating CPS-E loss exceedance functions is a challenging research problem due to hybrid system dynamics and hazard and information uncertainties, it also presents opportunities to develop decision support capabilities for robust CPS-E control.
Decision support, for efficient system response and recovery, in this context may be mathematically formulated as a sequential decision-making (SDM) problem under uncertainty.In the artificial intelligence (AI) literature (Kochenderfer, 2015;Russell & Norvig, 2016), sequential decision making refers to problems of reasoning under uncertainty where decision-agents select actions based on knowledge and observations of the system-of-interest in a dynamic setting.Typically, the system may produce stochastic outcomes based on operational conditions.Also, action choices of decision-agents influence the system state over time.The goal for a decision agent is then to determine optimal actions under different system states.Reinforcement learning (RL), a machine learning paradigm, is designed to solve such sequential decision-making problems (Sutton & Barto, 2018) and is appropriate in a CPS-E context to facilitate automated decisions via computational agents.RL is a type of sequential learning method that involves repeated interactions between a decision agent and an environment and focuses on mapping system states to optimal actions that maximize an aggregated numerical reward or payoff over time.This learning process often involves balancing a tradeoff between exploration (try new courses of action that may yield favorable outcomes) and exploitation (pursue known actions based on prior experience and current knowledge) to maximize cumulative reward in the long run.RL approaches are designed specifically for solving Markov decision process (MDP) problems with incomplete or unknown system dynamics, which has the Markov property that the future outcome is independent of the past given the present state.
While RL-based solutions are promising for addressing SDM problems with recent success in gameplay settings (Silver et al., 2016), practical challenges associated with real-world CPS-E for energy domain including safety, observability, dimensionality, sample efficiency, and explainability still need to be addressed before wider deployment in practice (Dulac-Arnold et al., 2019).In line with the challenges above, key research questions addressed in this article include: (1) What RL methods are appropriate under diverse real-world SDM problem settings with focus on cyber-physical energy system control?(2) How do risk preferences (i.e., risk-averse, risk-neutral, and risk-seeking) of a decision-agent influence performance of multiple RL methods in a risk-based robust CPS-E control setting? and ( 3) Can open-source simulation platforms be developed for experimentation with RL algorithms within a cyber-physical energy system control context?
This article represents an important contribution to how AI-based methods can be utilized for risk-informed decision analysis, particularly for robust learning-based cyberphysical energy system control.The novel contributions of this work addressing the research questions above include: (1) systematic mapping and suitability analysis of RL methods to SDM problem features with an emphasis on CPS-E control, (2) performance evaluation and comparison of model-free, model-based, and hybrid model-free and model-based RL algorithms under different risk specifications (i.e., riskaverse, risk-neutral, and risk-seeking) with a representative multi-zone cyber-physical buildingsystem control example, and (3) development of an open-source OpenAI Gym simulation environment for experimentation and reproducibility of results with deep RL methods.
The rest of the article is organized as follows.In section 2, a brief background on CPS-E risk analysis is presented first, followed by a discussion of four types of RL methods (including model-free, model-based, hybrid model-free and model-based, and hierarchical) and their applicability in developing risk-based robust control of cyber-physical energy systems.Section 3 presents a motivating numerical example of a risk-based RL approach for cyber-physical building system control.Following that, in Section 4 numerical results are discussed, and in Section 5, key insights for future research on applicability of RL methods for cyber-physical energy system control are summarized.Section 6 contains concluding remarks.

Background
CPS-E risk analysis typically may consist of interconnected computational phases (i.e., system modeling, hazard intensity analysis, engineering parameter analysis, damage analysis, and loss exceedance analysis) based on well-grounded foundations of quantitative probabilistic risk assessment (Kaplan & Garrick, 1981) and performance-based engineering principles as described in Cornell and Krawinkler (2000), Audigier et al. (2000), Porter (2003), Moehle and Deierlein (2004), Baker and Cornell (2006), Günay and Mosalam (2013), and Chatterjee et al. (2021).System modeling focuses on abstractions of an embedded system (C) that can be formalized via data-driven and/or physics-based methods.This includes discrete concurrent components operating at multiple modes interacting with continuous dynamics in the physical environment.Hazard intensity analysis generates measures of intensity (IM) associated with varying hazard levels for a given CPS-E model abstraction (C).Engineering parameter analysis calculates the response of a system, in terms of an engineering parameter (EP) variability, for a given IM.System damage analysis produces measures of damage (DM) to system elements using EP variability and fragility functions.Finally, loss/performance exceedance analysis involves using these DMs to develop probabilistic estimates of loss (e.g., system performance degradation) that will serve as decision variables/functions (DV) for decision-making.
Translating the computational phases above into an analytical approach while characterizing uncertainty at each phase and propagating it probabilistically leads to a mathematical formulation as presented in Equation 1 below (Audigier et al., 2000;Cornell & Krawinkler, 2000;Chatterjee et al., 2021;Porter, 2003;Baker & Cornell, 2006;Günay & Mosalam, 2013;Moehle & Deierlein, 2004): (1) where g [X|Y] is P(X > x|Y); p [X|Y] refers to the probability density of X given Y; C represents a cyber-physical system; IM is hazard-based intensity measure; EP is engineering system parameter; DM is hazard-induced damage measure associated with the system; and DV is a loss-related decision variable for response and recovery decisions.
In the context of CPS-E and the mathematical formulation in Equation 1 above, an example system C may represent a multizone building with supervisory control (Du et al., 2021;Yu et al., 2020).The hazard-based intensity measure IM may then correspond to the arrival rate of malicious cyber-based energy demand requests (Bhattacharya et al., 2019).The engineering system parameter EP may correspond to operational thresholds to the control system (e.g., thermal comfort limits for a building controller).Damage measure DM corresponds to energy inefficiencies incurred due to malicious energy demand requests and/or system faults (e.g., deviation of indoor air temperature from the thermal comfort limits in a building).Finally, a loss-related decision variable DV may comprise of economic costs incurred due to operational inefficiencies and system degradation.Analyzing CPS-E risks can feed into risk-based SDM under uncertainty based on optimization and learning paradigms such as adaptive control, RL, and Bayesian optimization.In this article, we focus on the suitability of multiple RL paradigms for risk-based decision support in the context of a CPS-E.
Figure 1 encapsulates overarching computational phases (as described in Equation 1) of CPS-E risk analysis and RLbased decision support within a feedback loop.Note that the RL module in this figure leverages flow of information associated with system state transition and impact into the decision support engine, along with optimal action recommendations back to the system.
The feedback between CPS-E risk analysis and RL-based decision support may be contextualized through an agentbased setting.For example, a blue agent (operator) may seek to implement RL-based optimal action recommendations to minimize risk.Conversely, a red agent (adversary) may seek to poison or tamper with the system impact information (generated through risk analysis) used by the RL algorithm leading to suboptimal action recommendations.Risk preference of the blue agent, in terms of utility functions (capturing red agent effects to the system) for learning optimal actions, may then guide the selection of RL algorithm and influence the recommended action outcomes.To the best of our knowledge, identifying the need and suitability of RL methods remains an essential task in the CPS-E risk literature to foster broader real-world adoption for energy systems.So far, there is no all-purpose RL method that is applicable all types of energy system control problem settings.In the sections below we first present a brief review of RL, followed by an overview and comparison of four categories of RL methods: (1) model-free, (2) model-based, (3) hybrid model-free and model-based, and (4) hierarchical.

A brief review of RL
RL is a machine learning paradigm and branch of artificial intelligence that focuses on learning how to map situations to actions by trial-and-error to maximize a reward signal.A learning agent discovers optimal actions by experimentation and through responses to system feedback.RL is specifically designed for solving MDP problems with incomplete or unknown system dynamics.In an MDP, at each state, the agent takes an action and receives a reward from the environment.Then the agent moves to the next state with a certain F I G U R E 1 CPS-E risk analysis and reinforcement learning-based decision support feedback (Note: CPS-E dynamics illustrated here is based on an open hybrid automata abstraction as described in Heracleous et al., 2017) probability and repeats the above process until reaching the end of the time horizon.The goal of the agent is to maximize the total accumulated reward.An MDP is composed of five fundamental elements: (1) a series of environment states s ∈ S; (2) a set of actions a ∈ A; (3) a reward function r; (4) the probability p(s ′ , r|s, a) that describes the transition from state s and taking an action a to the next state s ′ and the reward r; and (5) discount rate  such that 0 ≤  < 1 which refers to the extent to which future rewards are assumed to be valued lower compared to immediate reward.
A key input for solving an MDP is the transition probability, which describes the environment dynamics.With a known state transition probability model, the MDP can be formulated explicitly as a stochastic dynamic program and solved analytically using Bellman's optimality equation given by: where, q * is the optimal action-value function that captures the accumulated reward under the optimal policy when starting from state s and action a.Nevertheless, in many real-world CPS-E, an accurate estimation of the state transition probabilities is not a trivial task due to hazard and system-related uncertainties.
Figure 2 presents an overarching categorization of RL methods and lists some representative approaches in each category.Broadly, the focus here is on four types of methods: (1) model-free (environment dynamics unknown), ( 2 Figure 3 presents an illustrative comparison of the four types of RL methods that are discussed in the following subsections.As presented in Figure 3A, model-free RL can be applied in an environment with unknown dynamics, where only observation of the environment and the reward is needed for the agent to learn a policy (i.e., system state and optimal action mapping).In contrast, per Figure 3B, model-based RL requires state and reward information along with learned system dynamics model based on environment interactions to make look-ahead plans.The hybrid model-free and modelbased RL in Figure 3C combines the two methods and learns a data-driven environment model to make look-ahead plans.Finally, the hierarchical RL method in Figure 3D is mostly applied in scenarios with multiple goals and task hierarchies which can be decomposed into sub-problems.

Model-free RL
Model-free RL does not require formulation of environment dynamics into optimization models, but it directly learns optimal actions based on interactive experience with the environment.This desirable feature qualifies the model-free RL in developing optimal policies for a broad range of complex control problems where an accurate model of environment is inaccessible.Cyber-physical security assessment of power systems constitutes such a problem (Baggott & Santos, 2020).Power system is one of the most complex critical infrastructures that involves risks and uncertainties from multiple aspects including underlying topologies, generation, load, and coupling of electric devices.As a result, accurate measurement and modeling of the bulk grid is difficult.In this context, application of model-free RL is for developing adaptive emergency control strategies for bulk power system under contingency (Huang et al., 2020).Emergency control refers to enforcing actions that lower the risk of system collapse, such as dynamic braking and fault-induced under-voltage load shedding.Conventional model-based control methods that rely on solving algebraic and differential equations face scalability issues in case of large-scale power systems and are not readily generalizable to unobserved scenarios.The model-free RL methods, on the other hand, can learn a more robust control strategy owing to the high-dimensional feature extraction and nonlinear generalization capabilities of the deep neural networks.Once well-trained, the algorithm can be directly applied to unseen fault scenarios to produce control strategies with limited computational cost, which is highly desirable for realtime power system emergency control against operational risks.
Model-free RL has also been proposed for developing optimal risk-averse defensive strategies for CPS-E.Considering the randomness of cyber-attacks and the real-time requirement for taking recovery measures, model-free RL methods can be trained offline using representative attack scenarios.Once trained, the RL pipeline can be deployed in large-scale online settings to guard against real-time operational threats.Model-free RL has also been applied to intelligent transportation systems for controlling traffic signals, autonomous driving, hybrid vehicle energy management, and navigation control (e.g., speed limit control, toll pricing design, and ramp metering) (Haydari & Yilmaz, 2020).Given the possible life-threatening conditions in real-world experiments, these studies are mostly performed on traffic simulators or using historical datasets.Despite training in simulated environments, model-free RL methods produce policies that are more generalizable compared to standard control methods under risk-prone conditions and unseen uncertainties in the real world.
In summary, model-free RL is mostly applicable for solving control and optimization problems with uncertain or unknown information and intractable environment dynamics.Model-free RL method is desirable due to its simplicity in implementation, scalability to large-scale systems when combined with deep learning, and generalizability against uncertain risks and variations.

Model-based RL
Model-based RL requires learning the system dynamics of the environment or using a known model as an input to obtain an optimal policy.The major advantage of modelbased RL methods is sampling efficiency during training which reduces the need for large number of interactive episodes with the environment.In model-based RL, the optimal policy is derived from simulations of a learned model that corresponds to the real-world dynamics, which can greatly reduce the interactions between the agent and the physical environment.This feature is especially desirable when interacting with the real world is computationally expensive and/or time consuming.For example, in the case of robotic control, using simulations instead of actual interactions result in significantly less mechanical wear of robots.
In the literature, application of model-based RL is mostly studied in the field of intelligent robotic control against risks and uncertainties (Berseth et al., 2018;Dong et al., 2020;Xiao et al., 2019).In this regard, a simulated model environment is learned first, and reward and state transition probability functions are estimated, followed by optimal policy computation (Berseth et al., 2018).The availability of a model facilitates look-ahead planning over time to minimize adverse system impacts.Nevertheless, learning a dynamic model that captures system complexity with adequate fidelity continues to be a key research challenge.One research thread is focused on learning an ensemble of dynamical models (Xiao et al., 2019).A policy that is learned under such situations hold different beliefs of the true dynamics of the system and is more adaptive to uncertainties.An alternative thought is to develop a two-layer learning structure, where the inner layer is still the canonical RL approach and the outer layer learns to fine-tune the estimated system dynamic models based on the inner layer feedback (Dong et al., 2020).This decomposition approach facilitates auto-tuning capability and leads to computational savings.
Another prominent application of model-based RL lies in autonomous vehicle control.The real-time path tracking of autonomous vehicles remains a challenging task because often the vehicle may encounter unseen terrains during online testing, which fails the maneuverability of the control policies learned in off-line environment and can result in unsafe action recommendations.An alternative is to implement an online model-based RL approach, where the model-learning and the policy learning are carried out simultaneously (Kim & Kim, 2016).The learned model is used as a prediction for RL to make look-ahead control plans to avoid unsafe scenarios.
In summary, model-based RL method is mostly suitable for problems where testing in real-world environment is computationally expensive (e.g., mechanical degradation of robotic devices) or inaccessible (life-threatening traffic conditions).
The key to successful application of model-based RL methods is the acquisition of a simulated environment model that is accurate enough for the algorithm to plan ahead to reach an optimal control strategy.

Hybrid model-free and model-based RL
The model-free RL methods have demonstrated success in solving complex system control and optimization problems with hidden information, while model-based RL method is effective due its sampling efficiency properties.Given the complementary advantages of both approaches, it is natural to leverage the advantages of both methods for better system performance.The hybrid model-free/model-based RL is an emerging topic in the RL community.Among the relatively few existing studies focused on hybrid RL, most find their niche in developing robust robotic locomotion control strategies.There is one research study (Nagabandi et al., 2018) where a model-based RL method that follows the logic similar to model predictive control is first applied to learn the environment dynamics.Then the policy learned by the model-based RL is used as the "expert knowledge" to initialize the model-free RL algorithm.Endeavors have also been devoted to improving the accuracy of the simulated model through design of neural network modules and utilizing synthetic data for training (Che et al., 2018;Wang et al., 2018).A recent study (Feinberg et al., 2018) also describes a hybrid approach using a dynamics model for short-term horizon and Q-learning to estimate long-term value beyond the simulation horizon for improved value estimates and faster learning.
In summary, the hybrid model-free and model-based RL methods overcomes the issue of sampling inefficiency in model-free RL, while still maintaining generalizability across problem settings.In this hybrid manner, the model-based RL methods can serve as a domain knowledge-guided system representation to characterize the evolution of model-free RL methods, leading to more stable and plausible policies against uncertainties.

Hierarchical RL
Despite of the success of RL in a variety of fields, there still remain two open challenges that impact its widespread application: (1) curse of dimensionality: when facing control problems with high-dimensional state space, the number of parameters that need to be trained in the RL algorithm grow exponentially, resulting in costly consumption of computation and storage resource; and (2) sparse reward: in the case of control problems with long interactive trajectory, where the agent can only get feedback from the environment at the final step, the learning agent can get confined in a local optimum without proper guidance during the exploration.The hierarchical RL (HRL) is proposed to tackle the above challenges.The core idea behind HRL is divide-and-conquer, where a complex control problem is divided into multiple sub-problems and the global optimization goal can be achieved by solving these sub-problems in a nested setting (Nachum et al., 2018).One scenario in which the HRL makes a good fit is to develop robust control and operation strategies for the bulk power grid under hazard events (such as compound cyber-attacks and natural disasters) through topology change and generator scheduling to guarantee a long-term reliable and sustainable power supply.The grid operation scenario is usually based on a long-time horizon, for example, weeks or months, which leads to a long MDP chain and the control strategy can easily get confined in a local optimum, along with the unpredictable occurrence of hazard events.An HRL approach can be applied by dividing the long operation time window into sub-intervals and exploring the optimal control strategy for each interval successively.A sub-goal can be defined for each time interval and only after the sub-goal is achieved will the system be transferred to the next inter-val.This sub-goal setting can serve as a guide for algorithmic evolution.
HRL could also contribute to developing recovery strategies for bulk power system under cascading outages.Cascading outage refers to a situation where the grid uncontrollably and successively loses elements triggered by an initial incident at any location.A cascading outage can result in widespread electric service interruption and system collapse if proper remedial measures are not taken.Within the framework of HRL, each cascading outage stage constitutes one hierarchy, and the goal of the RL agent is to find the most economical recovery strategy for the current hierarchy.The hierarchies are temporally correlated since the power system operation states are consecutive.The acquisition of optimum at each outage stage sums up to the global optimum for the entire process.
To summarize, HRL is most applicable for solving long time-horizon control and optimization problems with high-dimensional search space, where the logic of divideand-conquer offers an efficient stepwise solution approach.However, HRL has not become a standard RL solution due to critical issues such as the non-stationarity from hierarchical policy updates and algorithmic instability.Nevertheless, the HRL still holds promise for learning high-level actions and generalized skills to develop more adaptive and robust risk-based control policies.

Summary of RL methods
Table 1 summarizes the properties of RL methods described above in terms of problem features, solution stability, and scalability considerations under uncertainty.Typical applications of these methods in the context of CPS-E control are also mentioned.Despite methodological differences, the four types of RL methods also share some common aspects.For instance, all of them can be applied for solving problems with either discrete state/action space or continuous state/action space.A loss function can be defined as an indicator of algorithm convergence for all.Also, trade-off between action exploration and exploitation is mostly required to obtain better policies while making full use of the learned experiences.

NUMERICAL ILLUSTRATION
In this section, we describe two numerical examples to illustrate the performance of risk-aware RL approaches for a CPS-E control application.Specifically, we consider the optimal supervisory control problem for a representative single-zone and multizone building, where the controller's objective is to maintain the indoor temperature within the occupant comfort bounds.Next, we discuss the two usecases and compare the RL algorithms under different risk preferences for each use-case.(Feinberg et al., 2018).

Robotic control
Hierarchical RL Long-time-period MDP chain with sparse reward The updating of the hierarchical policies causes non-stationarity, which requires introducing additional hyperparameters to stabilize the algorithm convergence (Nachum et al., 2018).
Long-term power system operation, cascading outage analysis

Single-zone building control use case
We consider a stylized single zone building control setup, where the objective of a building operator is to minimize the overall thermal discomfort over an operating horizon.
The operator decides when to switch on or off the heating, ventilation, and air conditioning (HVAC) system (control action) based on the current indoor temperature (system state), subject to temperature dynamics (environment dynamics) and occupant comfort constraints (operational bounds).The operator's objective here is to determine a control policy that minimizes risk-inferred utility of the accrued discomfort based on their perception of the system state and operations.The mathematical form of these utility functions may be derived in practice from a performance-based CPS consequence assessment approach, as outlined in Figure 1.For demonstration purposes, we considered three standard utility functions: linear (risk-neutral), exponential with increasing marginal utility (risk-seeking), and exponential with decreasing marginal utility (risk-averse).For the single-building use-case, the nominal open-loop CPS-E control problem can be formulated as an SDM model of the form: min Equation 4 describes the indoor temperature dynamics (Lu, 2012), where T out (t) is the current outdoor temperature; H is the equivalent heat rate (W) with the value of 400; R is the equivalent heat capacity ( • C/W) with the value of 0.1208; δ(t) is a binary control variable indicating the on/off status of the HVAC system; Δt is the time resolution for the dynamics, in this case it is set to 1 min; C is the equivalent heat capacity (J/ o C) with the value of 3599.3.In this model, we assume that the zone can only be heated when the HVAC is ON; a model that considers both heating and cooling will be introduced in Section 3.2.
The nominal model described in Equations 3-5 can be formulated as a Markov decision process (MDP).Under the Markovian assumption, the indoor temperature at the current time step is only related to the temperature at the previous time step, as presented by Equation 4. The state of the MDP, s(t), is the indoor temperature at the time step T in (t); the binary action a(t) is the on/off status of the HVAC system (t), and the reward is the negative indoor temperature deviation: With a reward x, and utility function, u(x), we consider the following three forms of the utility function: (1) risk-neutral: u(x) = x, (2) risk-seeking: u(x) = e x -1, and (3) risk-averse: 4 illustrates the mathematical form of the utility functions.Observe that if the utility u(x) is riskneutral, then Equation 6 becomes the standard Q learning where reward is same as utility; if the utility function is riskseeking, then u(x) will be greater than x, which indicates that the Q value will be overestimated, and the RL agent will be less conservative in making decisions; if the utility function is risk-averse, then u(x) will be less than x, which indicates that the Q value will be underestimated and the RL agent will be more conservative in making decisions.In the next sub-sections, we discuss results for model-free (Q-learning) and model-based (approximate dynamic programming) RL methods applied to the single-zone building control use case.

3.1.1
Model-free Q-learning Assuming that the system dynamics are unknown to the learning agent, we first use a model-free, risk-aware Q-learning algorithm (Sutton & Barto, 2018) to solve the MDP introduced in the previous section.The utility functions serve as the learning signal embedded within risk-sensitive Qvalue estimates.We introduce the following risk-sensitive Q-learning method to solve the MDP, where the action-value Q is updated through the following iterative procedure (Shen et al., 2014): )] (6) In Equation 6, u(⋅) is the utility function, that captures the operator's risk tolerance; x 0 is the reference level, in this case it is set to 0;  is the learning rate, which is the update speed of the action-value, in this case it is set to 1 to allow for a quick update; the discount factor  is set to 0.99 to indicate that the future reward is worth less than the immediate reward and to avoid algorithm divergence.
The Q-learning algorithm is trained within 1 day with 24 hourly intervals to learn the optimal control action of the HVAC system for each hour.Figure 5 shows the risk-inferred rewards learned over multiple 24-hour training episodes.The risk-averse utility function values take around 300 episodes to converge, which is much faster than the other two utility functions (risk-neutral takes over 2,000 episodes and risk-seeking takes around 1,000 episodes).
The Q-learning simulation experiments were performed on a laptop with Intel®CoreTM i7-9850H, 2.59 GHz CPU, and 16.00 GB RAM.The risk-sensitive Q-learning method was implemented in Python 3.7 with PyCharm Community 2020 (JetBrains, 2022).The convergence time of the Qlearning method under the three utility functions were 11 s (for risk-neutral), 18 s (for risk-seeking), and 12 s (for riskaverse).Figure 6 presents the learned HVAC control actions and the indoor temperature dynamics from the final training episode under the three risk-informed utility functions.In the figure, the green curve represents indoor temperature; the blue dashed curve represents the outdoor temperature (consistent across the three utility function cases); the yellow shaded area represents the acceptable indoor temperature range; and the orange bars represent the on status of the HVAC system (conversely, the absence of an orange bar at a time point represents the off status of the HVAC system).Per the figure, with risk-averse utility function, the magnitude of the indoor temperature violations is relatively lower compared to the risk-seeking case and is comparable to the risk-neutral case.Also, the learned HVAC control actions over time for the risk-seeking case is significantly different from the other two cases.The accumulated thermal discomfort in the final episode is −10.80 for the risk-neutral case, −11.64 for the risk-seeking case, and −10.56 for the risk-averse case.This implies that over a 24-hour planning horizon, the risk-averse strategy led to least discomfort for the building occupants.Interestingly, we also observe that the violations during normal operating hours from 7 a.m. to 7 p.m. (when the building occupancy is highest) are lower for riskaverse and risk-neutral cases compared to the risk-seeking case.However, outside the normal operating hours, the riskseeking case can exhibit lower violations than the other two cases.

Model-based approximate dynamic programming
Next, we use a model-based RL approach to solve the singlezone HVAC problem.Specifically, we use an approximate dynamic programming (ADP) approach (Jiang, 2015) that uses an approximate mathematical representation of the true Q value-function for learning the optimal policy.In our setting, we use a quadratic polynomial approximation of the Q-function of the form: where  0 ,  1 ,  2 ,  3 ∈ R are tuning parameters that are learnt by minimizing a L 2 loss function between the true expected future rewards and estimtated future reward Figure 7 illustrates the evolution of the zone temperatures for the different risk-based utility functions.Similar to Q-learning, the risk-seeking agent exhibits the worst performance with regards to occupant comfort due to high variability in it's action sequences.On the other hand, riskaverse is the most conservative and provides the best occupant comfort values, followed by the risk-neutral utility.Note for the given choice of Q, we didn't observe any noticeable improvement in convergence rate compared to the model-free Q learning approach.

Multizone building control use case
Next, we consider a more complex multizone building usecase where the HVAC control system seeks to minimize user-discomfort and total HVAC energy consumption in multiple zones of a building.Let T be the number of decision stages in the operating horizon and N be the number of zones in the building.In the multizone setup, the HVAC controller can heat or cool the zones as desired to meet the control objectives.The nominal optimal-control problem for a multi-zone building is formulated as the following SDM: min ) Here, the objective in Equation 8 minimizes the overall discomfort and HVAC energy consumption of the building, where  is the penalty for discomfort (d t,i ) and C is the penalty per-unit time for keeping the HVAC ON during the planning horizon.The binary variables  i t, H ,  i t,C ∈ {0, 1} indicate if the HVAC is heating and cooling, respectively, zone i at time t.The building dynamics for each zone i is expressed in Equation 9, where  i ,  i ,  i are building envelope parameters, m i t is mass-flow rate, and T H , T C is the temperature of air supplied by the HVAC to any zone to heat or cool, respectively; we assume that T H , T C to be constant in this problem.Finally, Equation 12 represents a discrete set of mass-flow rates available to the HVAC controller.Table 2 summarizes the values for different parameters and variables in our experimentation.
Similar to the single-zone problem, we consider three different utility functions based on the risk-preference for the

Domain-aware open-AI gym simulation environment for RL experimentation
Figure 8 illustrates the customized domain-aware OpenAI Gym simulation environment (Brockman et. al, 2016) that we developed for this study.At the start of each RL decision epoch, the learning agent sends a control action to the heater/cooler regulator to switch ON or OFF the HVAC system for heating or colling.The RL agent solves the optimal-control problem in Equations 8-12 for T = L × w minutes, where L is the number of lookahead windows, and w specifies the number of minutes in a window.For each zone, the RL policy prescribes L different actions for L windows, while keeping actions fixed at all time instances within any window.For the experiments in this article, we keep the granularity of the data sampling process to 1 min and set the number of zones to 2. At each time-step (one minute) t in window k, the regulator defines a k 1 for zone 1 and a k 2 for zone 2, where a k i represents the action in zone i during window k.At the end of each time-step, indoor temperature evolves according to Equation 9.The Reward Aggregator module in the environment aggregates the rewards (or costs) accrued using Equation 8 over the decision epochs.If the time duration exceeds (L × w) minutes, an observation is formed using the current indoor temperature, and a final risk-inferred reward is calculated based on aggregated rewards and the risk type.As feedback, both the observation and final reward are sent to the agent to optimize the control policy.
We ran the simulation in a high-performance computing (HPC) cluster with 64 central processing unit (CPU) nodes and one graphics processing unit (GPU) node.Each experiment deployed four simulation jobs which ran in parallel with the same initial system conditions.During training, each sim-ulation job ran for 1,000 daily RL episodes, where each daily RL episode simulates 24 hours.In the next subsections, we discuss results for model-free (Deep Q-Network (DQN)) and hybrid model-free and model-based (AlphaZero) RL methods applied to the multizone building control use case.Based on these experimental specifications, each DQN simulation job took 7 h to run for 1,000 daily episodes, whereas each AlphaZero simulation job took 18 h for the same number of episodes.

3.2.2
Model-free deep Q-network The model-free deep Q-network (DQN) algorithm was developed by DeepMind (Mnih et al., 2015).The algorithm was developed by enhancing the classical Q-learning algorithm (described in Section 3.1.1)with deep neural networks.The key idea in DQN is to use neural networks as a function approximator to estimate the Q value functions defined for each state and action pair, that is, Q * (s, a) ≈ Q(s, a; ) where  is the vector of parameters of a neural network that approximates the optimal value function Q * .The parameters in  are learnt by iteratively using the stochastic gradient descent (SGD) algorithm to minimize the temporal-difference (TD) loss over all state-action transitions observed during an episode.To improve stability and sampling efficiency of DQN, an experience replay buffer is used to sample a minibatch of prior-observed state-action transitions to compute the loss and update its gradient over the iterations.Similar to the single-zone setup, we train the DQN algorithm for episodes that span 24 h.To reduce the size of the action-space, we set the number of look-ahead windows (L) to 2, where each window spans 3 m (i.e., w = 3).Thus, new control actions are evaluated at every 3 min, while the problem in Equations ( 8)-( 12) is solved for T = 2 × 3 = 6 min on a rolling fashion over the 24-h horizon.The DQN network is trained for each of the three risk-inferred utility functions for 900 daily episodes.Figure 9 describes the convergence of DQN for each utility function class.We note that the Q function converges fastest for the risk-averse utility, followed by risk-seeking, and then by risk-neutral.This trend is similar to the one observed for the traditional Q-learning where the risk-averse utility led to the quickest convergence, while riskneutral had the slowest convergence rate.Next, in Figures 10  and 11, we illustrate the performance of the trained DQN for a test operating day for Zones 1 and 2, respectively, for the three risk-inferred utilities.
Note that in each zone, the zone temperature remains within the comfort bounds over the entire 24-h horizon for each of the risk-inferred utility functions.In Zone 2, riskseeking utility function results in larger fluctuations of the zone temperature but it never crosses the comfort bounds.This points to DQN's ability to learn a highly effective control policy that maintains occupant comfort irrespective of the perceived value of occupant discomfort.However, maintaining occupant discomfort comes at the expense of keeping the  3 and 4 tabulate the percentage of time HVAC is switched ON for either heating or cooling for Zones 1 and 2, respectively, for each of the utility functions.It is interesting to note that in both zones, F I G U R E 1 1 Evolution of the indoor temperature in Zone 2 using DQN-based HVAC controller trained for each of the risk-inferred utility functions HVAC is ON for the largest fraction of time for the riskseeking utility function, which favors a highly volatile control policy that favors frequent switching between heating and cooling decisions due to higher sensitivity towards perceived occupant discomfort.On the other hand, a risk-averse utility prescribes a more conservative control policy that maintains temperature within bounds by keeping the HVAC ON for fewer time duration as it is less sensitive to user discomfort.A risk-neutral utility results in the HVAC being kept ON for an average duration that lies between the risk-neutral and risk-averse values.

Hybrid model-free and model-based AlphaZero
The AlphaZero algorithm is a hybrid RL approach developed by DeepMind (Silver et. al, 2018) that combines modelfree search with model-based learning using deep-neural networks.The algorithm augments a tree-search procedure that uses two heuristics: one to evaluate how good a current position (or system state) is, and other to branch on When training starts, both heuristics are initialized randomly, and tree search only has access to a meaningful reward signal at the final goal (or terminating) states.These heuristics are then improved iteratively through self-play using a twoheaded neural network.The search component is powered by Monte-Carlo Tree Search (MCTS), which implements a good compromise between breadth-first and depth-first search and provides a principled way to manage the uncertainty introduced by the heuristics.The network's policy heuristic is updated to match the output of MCTS on all encountered states.AlphaZero has mainly been implemented for discrete game environments (like Chess, Go etc.) with sparse reward structures.
For our multi-zone building control setting, we make a minor modification in the cost structure for executing the Alp-haZero algorithm.Let  i t,D ∈ {0, 1} be an indicator variable to denote whether the indoor temperature in zone i violates the comfort constraints at time t.Then, we consider the following cost function for AlphaZero: Note that this cost function does not include the true deviation from comfort bounds; rather, it just considers whether the bounds have been violated or not.This modification is made to make the rewards sparse and discrete (compared to a continuous deviation term), which is more amenable to AlphaZero.
Figures 12 and 13 describes the evolution of the temperatures in Zone 1 and 2, respectively, for the different  risk-inferred utility functions using the trained AlphaGo RL policy.Interestingly, in both Zones 1 and 2, temperature trajectory for the risk-seeking case exhibit higher variability and regularly violate the bound constraints.By comparison, the risk-averse trajectory never violates the comfort bounds in either zone, while the risk-neutral violates the bounds only for a short duration of time in Zone 2. It is important to note that, compared to DQN, AlphaZero is insensitive to the degree of deviation from the comfort bounds and will exhibit larger temperature deviations in general than DQN.
In Tables 5 and 6, we tabulate the results for the fraction of time the HVAC is kept ON for a given test day by the trained AlphaZero-based HVAC controller for different categories of the utility function.Similar to DQN, the risk-seeking utility leads to highest fraction of time when the HVAC is kept ON, due to the high volatility and frequent switching between ON and OFF states caused by higher sensitivity towards comfort bound violations.However, the difference in operational time fractions between risk-neutral and risk-averse cases is less pronounced for AlphaZero than in DQN; in fact, the risk averse case has slightly higher values and therefore, more energy consumption, than the risk-neutral case.But this trend is balanced by the fact that risk-neutral is less efficient in keeping the temperature within the comfort bounds compared to risk-averse, especially in Zone 2.

DISCUSSION
For the single-zone use-case, two RL algorithms were employed to solve the HVAC control problem: model-free Qlearning and model-based ADP.For both these approaches, we observed that a risk-seeking utility often results in high variability in control-action sequences and resulting indoortemperature fluctuations that lower occupant comfort.By contrast, a risk-averse utility leads to the most conservative control actions and minimal occupant discomfort.For both Qlearning and ADP, we observed that the Q-values converged fastest for the risk-averse case, followed by the risk-seeking, and then risk-neutral.However, due to the small size of the single-zone building use-case, we did not observe any significant difference between the convergence rates of Q-learning and ADP.
For the larger multizone building use-case, the HVAC controller has flexibility to either heat or cool multiple building zones, have variable mass-flow rates, and seeks to minimize both occupant discomfort and energy consumption.Using our customized, domain-aware OpenAI Gym simulation environment, the multizone building control problem was solved using two deep-neural network-based RL methods: model-free DQN and hybrid AlphaZero.We observed that the neural-network-based approaches learned the multizone control policy effectively for all three risk specifications.Specifically, the learned DQN control policy kept the temperature of both the zones within the comfort under all three risk utility functions.However, the risk-seeking DQN policy ended up keeping the HVAC On for much longer time durations compared to the risk-averse and risk-neutral cases.The hybrid AlphaZero had similar trends for the risk-seeking case; interestingly, the risk-neutral AlphaZero policy kept the HVAC system ON for lesser time duration on average than the risk-averse case.This is likely due to the modification of the cost function used for AlphaZero that only uses comfort violation as a cost indicator rather than the degree of comfort violation.
In summary, deep RL algorithms for CPS-E applications can be broadly categorized into model-free (i.e., inductive process of learning direct mapping from states to actions) and model-based (i.e., deductive process of learning a predictive environment model and then derive actions) methods.However, the suitability of specific algorithms depends on several factors, including model complexity, solution stability, and computational efficiency.Model-free deep RL methods are effective for learning complex risk-aware CPS-E control policies from raw system-state data with minimal assumptions on the underlying physics model; however, they require extensive training data of agent interactions with the environment and are not generalizable across multiple systems and operational regimes.Specifically, model-free deep RL training can be computationally challenging in CPS-E systems with nested (or hierarchical) control architectures and highly coupled subsystems operating at multiple decision-making time scales.
By contrast, model-based deep RL methods are promising due to their relatively fast training at scale with less data generated in a self-supervised manner and ease of generalizability under diverse risk specifications.This enables the agent to learn adaptive control policies that allow switching between different risk postures (i.e., risk-neutral, risk-averse, riskseeking) using learnt data-driven system state and planning models.However, model-based deep RL methods require comprehensive knowledge of the process physics and can suffer from accuracy due to model-overfitting and increased bias in the learned policies.Consequently, the learned models may lead to unsatisfactory performance during the testing stages.Hybrid methods (e.g., Alpha-zero) combine elements from both model-free and model-based learning to determine control policies in a sample-efficient manner in sparse data environments, while providing generalizability in some cases.Future efforts may further explore hybrid methods under different risk tolerances, model assumptions, and data availability for large-scale CPS-E applications.

KEY INSIGHTS FOR FUTURE RESEARCH
While significant progress has been made in applying RL methods, gaps still exist for RL to be fully harnessed across diverse real-world CPS-E applications.Major research gaps include: (1) explainability of policies and actions, (2) partial observability, (3) learning from limited samples, (4) safety tolerance, (5) high-dimensional state and action spaces, and (6) complex reward structures.ing).This may involve an aggregation of approaches that investigate the logic and reasoning behind RL-based policies, making them more interpretable to human operators.Developing interpretability will help render RL policy more understandable, which can in turn enhance the trust and confidence between the RL agents and the human operators, and eventually widen and deepen the application of RL methods.4. Scalability and validation are key for wider adoption of RL methods: As real-world CPS-E are growing in both the scale and complexity, increasing the scalability of the existing RL methods to handle high-dimensional state and action spaces in large-scale systems becomes a key challenge.This requires advances in both algorithm design and the computational architecture.In addition, generalization of RL-based policies in practical settings necessitate rigorous validation.Challenges associated with RL validation include inaccessibility to the real physical environment, designing of ubiquitous benchmark metrics, and enforcement of constraints for safety-critical systems. 5. Multi-agent reinforcement learning will continue to play an important role with the advances of single-agent RL techniques: Multi-agent RL (MARL) focuses on addressing sequential decision-making problems involving multiple autonomous agents, each maximizing their respective objective(s) by interacting with other agents and the environment.Under such situations, the system dynamics may become more uncertain requiring advances in algorithmic rules of engagement for policy generation across multiple agents.On the other hand, the MARL is increasingly more applicable to a wide range of real-world CPS-E compared to single agent RL, such as autonomous grid operations and smart building control.6. Risk criteria will help boost the application of reinforcement learning in solving safety-critical decision-making problems: For several real-world decision-making problems where safety remains the top priority reducing system risks must be taken into consideration as part of the optimization objectives.Systemic risk can be measured by various criteria, such as chance constraint, value-at-risk (VaR), and conditional value-at-risk (CVaR).The CVaR is a percentile risk measurement that calculates the expected reward in the worst α-percentile of all the possible outcomes.Most recent endeavors in the RL community have been focusing on incorporating CVaR into MDP to learn more safe and robust control policies.Integration of risk criteria into RL will advance implementation in real-world mission-critical applications.

CONCLUDING REMARKS
Safe, secure, and reliable operation and control of CPS-E is essential for safeguarding critical infrastructure assets and networks from natural and targeted disruptions.Decision making in such settings is a sequential process under uncertainty and may be mathematically formulated via optimization and data-driven learning methods.RL is a powerful mathematical data-driven construct for solving sequential decision-making problems under uncertainty in an automated CPS-E control context.This paper systematically analyzed the applicability of four types of RL methods

R E F E R E N C E S
) model-based (learned environment dynamics), (3) hybrid model-free and model-based (combines model-free learning with model-based planning), and (4) hierarchical RL (learning task divided into sub-tasks that collectively yield desired goals) approaches.Interested readers may refer to Sutton et al. (2018), Moerland et al. (2020), and Barto & Mahadevan (2003) for detailed algorithmic approaches under each category of RL method.

F
An illustrative comparison of four types of RL methods Equation 3, T in (t) is the indoor temperature at time step t, expressed in terms of hours; − T and T _ are the upper and lower comfort set-points of the indoor temperature; in our simulations, they are set to 24 and 21 • C, respectively.The objective function in Equation 3 expresses the accumulated temperature violation within N T hours, where N T is set to 24.

F
Risk-inferred rewards over multiple 24-hour episodes (Note: The x-axis denotes number of episodes with varying range of values across the three risk-informed functions) evaluated via Q.Note that the choice of Q is often guided by domain knowledge and system dynamics.The ADP algorithm was coded in Python and model training was completed using Python's Pytorch library(Paszke et al., 2019) and an Intel i7 Core processor with 32GB of RAM and 1 TB of storage.

F
I G U R E 6 HVAC control actions and indoor temperature dynamics from the final episode.(Note: Green curve represents indoor temperature; blue dashed curve represents the outdoor temperature; yellow shaded area represents the acceptable indoor temperature range; and the orange bars represent the on status of the HVAC system.)TA B L E 2 Summary of the parameter values used for the multi-zone HVAC use-case bounds during peak-hours (11AM to 8PM) 19.5 • C, 22.5 o C Comfort bounds during all other hours 18 • C, 24 o C Supply-air temperature for heating and cooling during peak-hours 22.5 o C, 19.5 o C Supply-air temperature for heating and cooling during off-peak hours 24 o C, 18 o C F I G U R E 7 Evolution of indoor temperature for different risk-inferred utility functions system operator: risk-neutral, risk-seeking, and risk-averse.Next, we describe the open-source simulation environment used for RL experimentation on the multi-zone HVAC control use-case.

F
OpenAI Gym environment for the multizone building control use case F I G U R E 9 Convergence of the DQN algorithm for the three risk-inferred utility functions F I G U R E 1 0 Evolution of the indoor temperature in Zone 1 using DQN-based HVAC controller trained for each of the risk-inferred utility functions HVAC On for longer time durations.Tables

F
Evolution of the indoor temperature in Zone 1 using AlphaZero-based HVAC controller trained for each of the risk-inferred utility functions F I G U R E 1 3 Evolution of the indoor temperature in Zone 2 using AlphaZero-based HVAC controller trained for each of the risk-inferred utility functions (model-free, model-based, hybrid model-free and modelbased, and hierarchical) for risk-based robust CPS-E control and demonstrated their comparative performance within an open-source, domain-aware simulation environment.Based on CPS-E problem features and solution stability of RL methods, we designed simulation experiments to analyze the effect of risk specifications within robust CPS-E control.Numerical simulation examples with representative singlezone and multizone building control use cases illustrate the tradeoff between control objectives (discomfort versus energy consumption) for the different risk specifications, and the power of deep neural networks in learning complex control policies in the presence of strict operational constraints and multiple nonlinear dynamical equations.The results clearly illustrated the varying degree of volatility in the control action sequences based on the given risk specifications that affect control performance.Future research directions may include emphasis on algorithmic explainabil-ity, scalability, validation, and safety tolerance based on risk criteria.This research was supported by the U.S. Department of Energy, through the Office of Advanced Scientific Computing Research's "Data-Driven Decision Control for Complex Systems (DnC2S)" project.Pacific Northwest National Laboratory (PNNL) is a multiprogram laboratory operated by Battelle Memorial Institute for the U.S. Department of Energy under Contract No. DE-AC05-76RL01830.Yan Du and Ashutosh Dutta were with PNNL when they contributed to this research.The authors thank the anonymous reviewers for providing helpful feedback.
Summary of RL methods for risk-based robust CPS-E control TA B L E 1 Summary of fraction of time HVAC in switched ON for heating and cooling in Zone 1 under different risk utility functions Summary of fraction of time HVAC in switched ON for heating and cooling in Zone 2 under different risk utility functions moves (or action sequences) that are not obviously wrong.
Summary of fraction of time HVAC in switched ON for heating and cooling in Zone 1 under different risk utility functions Summary of fraction of time HVAC in switched ON for heating and cooling in Zone 2 under different risk utility functions Moreover, emergence of different branches of RL, such as model-free, model-based, hybrid model-free and model-based, and hierarchical; as well as methodological approaches such as Deep RL, Inverse RL, Safe RL, and Distributional RL point to opportunities and challenges in theoretical and applied RL advances.Based on the discussion and results above, below are key insights for future research with RL methods for the risk-based robust CPS-E control problem: 1. RL methods are suitable for sequential decision-making problems under uncertainty: If the Markov property holds within a sequential decision-making setting, and the state, action, and reward structures are defined adequately based on problem objective, an RL approach might be suitable to meet automated decision support goals.Explainability with RL methods is key for wide adoption in real-world applications: RL methods are still far from full-scale adoption mainly because of its opacity and inexplicability.Lack of explainability of the learned policies often leads to reduced trustworthiness for practical deployment in real-world CPS-E settings.Therefore, it is highly desirable to develop theoretical foundations for explainability with RL methods.Algorithmic explainability dimensions may be local versus global (i.e., explain a specific decision or the entire model behavior) or posthoc versus intrinsic (i.e., explanations after training via a reduced order model or self-explanations during train-