Applications of reinforcement learning in energy systems

Energy systems undergo major transitions to facilitate the large-scale penetration of renewable energy technologies and improve efficiencies, leading to the integration of many sectors into the energy system domain. As the complexities in this domain increase, it becomes challenging to control energy flows using existing techniques based on physical models. Moreover, although data-driven models, such as reinforcement learning (RL), have gained considerable attention in many fields, a direct shift into RL is not feasible in the energy domain irrespective of the ongoing complexities. To this end, a top-down approach is used to understand this behavior by reviewing the current state of the art. We classified RL papers in the literature into seven categories based on their area of application. Subsequently, publications under each category were further examined relative to problem diversity, RL technique employed, performance improvement (compared with other white and gray box models), verification


Introduction
With the escalating accumulation of CO 2 emissions in the atmosphere, the frequent occurrence of extreme climate events, and the rapid increase in global population particularly in urban areas, significant changes in the energy sector are necessary [1]. It is anticipated that energy sustainability and energy efficiency improvement will perform vital functions in the urban sector where the integration of sustainable energy technologies are necessary [2,3]. In addition, the energy nexus, such as between water, agriculture, and transportation, should be considered as these elements tend to improve the sustainability of several sectors while minimizing greenhouse gas emissions [4].
However, the introduction of these changes into energy systems is an exigent task when both demand and generation are taken into account [5].

Problems on growing energy system complexity
As the complexities in the energy sector increase, it becomes more difficult to optimally control energy systems. For example, centralized generation is gradually moving into distributed energy systems, replacing fossil fuel-based dispatchable energy sources from renewable energy technologies [6]. The inclusion of renewable energy technologies (e.g., solar photovoltaics (PV), solar thermal, and wind) makes the operation of distributed energy systems more problematic because of the intermittent nature of these sources. Energy storage and dispatchable energy technologies, such as combined heat and power (CHP) generators, are necessary because of the short and long-term changes (stochastic nature) in renewable energy potential and energy demand [7]. It is difficult to integrate these components into a single system because of the intermittent nature of energy potentials, variations in demand, and complexities in energy conversion processes [8,9]. Similarly, the complexity of the operation increase when introducing energy storage to work in harmony with the internal combustion engines in automobiles [10]. In the energy sector, these changes become increasingly common because of the demand for CO 2 emission reduction.
Energy transition introduces problems that are well beyond the boundaries of energy systems; for example, the energy nexus between transportation, agriculture, waste management, and buildings also requires considering the interaction among these sectors. This typically necessitates co-simulation platforms that lead to bulky models [11], which are difficult to employ for control purposes. Furthermore, existing models focus on presenting the physical interactions among the different sectors, and usually fail to take into account cyber interactions [12]. Considering both cyber and physical interactions for control purposes using existing model-based approaches is another problem [13]. In addition to the limitations caused by the increasing complexity of energy systems, particular attention should be devoted to uncertainty and security management [14]. Uncertainties, such as climate change, energy market variations, and improvements in energy technologies, are considered using bulky physical models that demand extensive computational time especially for control purposes [15][16][17]. Therefore, the multi-dimensional impact of certain uncertainties are often neglected, thereby making it difficult to assess the effect of certain critical phenomena (e.g., climate change) on the energy sector [18]. In view of these limitations in the present state-of-the-art techniques, significant changes in the modeling methods employed in the design and operation sectors are necessary. In conclusion, with the increasing complexity in energy systems, cyber-physical interactions, uncertainties, and security challenges, a paradigm shift in the present state-of-the-art methodologies for energy system control is required [5].

Emergence of data science
The application of machine learning techniques have become a main research focus irrespective of the research domain especially with the emergence of deep learning [19]. The number of research publications on machine learning has rapidly increased, and machine learning methods have gradually attracted the attention of researchers in the energy sector for managing the aforementioned complexities because of the model-free approach of these techniques [20]. Machine learning employ a data-driven methodology that can support energy experts in considering complex planning problems at the urban, regional, and national scales [20]. The main branches of machine learning (i.e., supervised, semi-supervised, unsupervised, and reinforcement) are already well-established in the energy domain [21,22] and can be facilely understood by referring to energy-related publications that elaborate on machine-learning techniques. In shifting into the energy system domain, machine-learning techniques are employed in all the major steps of energy system design process. The number of publications that discuss machine learning techniques covering topics ranging from renewable energy forecasting to development of complex surrogate models that can be used for energy system design has rapidly increased ( Fig. 1).

Emergence of reinforcement learning
Reinforcement learning (RL), a branch of machine learning, incorporates human-level control [23,24], which has attracted attention in several fields. Although it is observed that the application of RL in the energy domain has gained considerable interest, there is a reasonable time lag between such an application and the publication of papers that present machine learning (including RL). A lag is also evident when RL publications are compared with those that focus on model predictive control (MPC) methods (Fig. 2). Since 2008, there has been a certain gap between the number of publications that featured MPC and RL (irrespective of the domain); it was only in 2018 when the compensation for this deficiency was introduced to a moderate extent (Fig. 2). However, in moving into the energy domain, a gap, which tends to further increase, is observed. This is unexpected particularly with the increasing complexity of energy systems, uncertainties, and security problems, which are difficult to control entirely using model-based approaches. Accordingly, it is important to implement a more holistic analysis of the present state-of-the-art applications of RL in the energy sector to identify the root causes of the gap. This requires a more thorough assessment besides a mere paper review. Accordingly, this study uses a top-down approach to review the present state-of-the-art applications of RL.

Objectives and methodology
In literature, several papers present a comprehensive overview of the state-of-the-art methods. Cheng and Yu [25] extensively reviewed the machine learning methods implemented in the energy and electric power system domains that mainly include many aspects in the energy sector and RL. Their paper was intended to provide a general overview of the state of the art instead of simply reviewing each article since the time machine learning techniques started to be extensively used in the energy domain. Han et al. [26] focused only on RL and the control of occupant comfort in buildings. Vázquez-Canteli and Nagy [27] discussed RL applications in demand response as well as building control problems related to demand, generation, and energy management. These studies employed a paper-by-paper approach and presented the previous studies that were conducted within a definite scope. Recently [28], reviewed the recent progress in building energy management systems using the same approach.
All these studies provides a comprehensive overview about the use of RL in building energy systems. A paper by paper review is presented on this regard where the specific application as well as the RL method are classified in a comprehensive manner. The number of publications related to RL grow at a rapid speed. Therefore, it is hard to track all the papers especially for someone starting new which is quite common within the machine learning concerning areas such as computer vision, Although energy transition introduces several energy management problems, which cannot be resolved by the mere use of classical control theory-based approaches, most of these problems having many similarities because they are related to energy flow management. Hence, it is possible to develop a common knowledge base. However, this is an extremely exigent task because the collection and organization of relevant literature require considerable effort. It is also more difficult to present following a paper-by-paper format similar to that in Ref. [26,27]. Accordingly, in this study, the research papers related to RL applications in the energy system domain are classified into several categories and subsequently extended by considering the crosslinks among these classes. Section 2 provides a comprehensive overview about the control/operation problems where RL has been used leading into a classification of these applications into six main classes. Being similar to the applications, reinforcement methods can be classified into several classes which makes it easy to understand the use of RL algorithm. Section 3 presents a classification of RL techniques used in the present state of the art with a comprehensive theoretical background on each method. A detailed cross comparison among problem classes and RL techniques are performed in Section 4 being focused on: • The complexity of the control problem considered within the domain -Handling non-linearity -Use of approximation models/data driven models -Expansion of decision and objective space variables • Similarity among the different classes of problems • Verification of results and reproducibility of approaches • Computational burden for problem class Based on Section 4, Section 5 is devoted to discuss about the future perspectives on RL applications in the energy system sector, particularly on extending the boundaries of RL problems related to sector coupling, linking control problems with multi-resolution time steps, and shifting focus from control to coupled control and design problems, are discussed. Finally, Section 6 presents the conclusions of the study.

Broader applications of RL
The integration of variable renewable energy technologies introduces problems into the energy system domain from the perspectives of control, stability, and security. It is therefore important to understand the advantages afforded by the use of RL in dealing with these problems. In Section 2.1, the benchmarking of RL with other existing techniques in order to better comprehend these advantages is presented. As shown in Fig. 3, the peak energy generated by renewable energy technologies may not satisfy demand, making it essential to depend on other energy technologies or energy storage. The stochastic nature of demand and generation performs a vital function when managing energy flows within the system. This leads to a set of operation optimization problems in scheduling generation (commonly known as energy dispatch problem), energy system operation within buildings, device control (e.g., PV panels), and market interaction, which are deemed as the main applications of RL; thus, in this domain, RL application is not limited to the dispatch problem. Section 2.2 elaborates and classifies the various RL applications in the energy system domain.

Incorporation of RL in energy system domain
Several different approaches are employed in the state of the art to control energy systems. These can be categorized into three classes: white box, gray box, and data-driven (black box) models [29]. The white box models, which are also known as model-based control strategies, apply physical principles to represent the relationship between model inputs and outputs during the control process; the model predictive control (MPC) strategy discussed in the Introduction section is classified as a white box model. Data-driven models, also known as black box control methods, use the knowledge derived by processing online or offline data instead of depending on the explicit or implicit information   3. Requirements for optimal control to avoid mismatch between demand and generation. of the mathematical model; RL belongs to the data-driven category. Gray box models are those that are between white and black box models; models based on fuzzy logic are among those classified under this topic. In this section, these approaches are differentiated to aid readers gain a better understanding of the techniques rather than presenting a broad categorization of existing modeling tools in the energy domain.
Consider a simple energy system with renewable energy technologies and energy storage (Fig. 4), i.e., the previously explained dispatch problem. The energy system is connected to the grid and caters to the electricity demand in a neighborhood. When sufficient renewable energy is not generated to satisfy the energy demand, either a battery bank or the grid can be used to compensate for the mismatch. The choice depends on several factors, such as the current price of electricity in the grid, price forecast of electricity in the grid for the time horizon, renewable energy generation forecast for the time horizon, and demand forecast for the time horizon. The white box approach uses either dynamic programming or MPC technique to select the appropriate control decision. This approach uses a detailed model to represent the energy and cash flow within the energy system that are subsequently linked to an optimization algorithm to derive the optimal states for the control horizon. However, the uncertainties in the forecast may also perform a vital role in such instances; in this case, stochastic MPC and stochastic dynamic programming techniques are used. Model dependencies and convexity guarantee make it considerably problematic to extend such an approach to more complex energy systems [30]. In such instances, the gray box models use approximation methods, such as fuzzy logic, to assist in achieving dispatch decisions [31]. Although fuzzy rules are typically defined, the pool of fuzzy decisions significantly increases with the complexity of the energy system, making optimization difficult [32]. In this regard, RL takes a different approach that is solely based on a data-driven model by employing an agent to participate in the process. The agent makes decisions based on the data-driven model and accepts inputs from the surroundings. The data-driven model that makes control decisions is fine-tuned by participating in the decision-making process and maximizing its reward. There are a number of different methodologies under the broad scope of RL that may be utilized to formulate and train the data-driven model; these are discussed in detail in Section 3. The data-driven model is effective in managing model uncertainties and complexities, which are difficult to achieve using white box models.

Classification of RL problems in energy systems
The changes in outdoor conditions (e.g., fluctuations in temperature, solar irradiation, and wind speed) influence generation and demand. Accordingly, the entire energy system and each device should maximize its performance (e.g., use of maximum power point tracking for wind turbines and PV panels). Moreover, a number of other factors, such as occupancy and equipment usage pattern, are expected to introduce uncertainties into the demand side. Therefore, considering a standalone operation, the mere matching of demand and generation is already a complex problem. Numerous publications regard demand and generation as two distinct control problems, as shown in Fig. 3. The optimal operation of heating, ventilation, and air conditioning (HVAC) systems that considers changes in the environment is typically regarded as a demand side problem, which neglects uncertainties in power generation. There are a number of recent publications pertaining to vehicles and grids, which well fit the demand and generation sides. In addition, there are numerous instances when the uncertainties in both demand and generation sides are concurrently discussed. Accordingly, it is difficult to define the boundaries of the foregoing problems. On the other hand, specific problems, such as maximum power point tracking (MPPT) of wind-turbines and solar panels, can be defined without infringing on other problems.
By conducting a keyword search in Scopus, two groups of publications are found: 1) those that deal with specific problems (SPs) and 2) those that deal with integrated problems (IPs). The SP category focuses on a well-defined and extremely confined domain to delve deep into and resolve typically broader IPs. It is extremely exigent to provide an exact definition for each group as there are many gray areas. The SPs are further classified into the following six groups considering the application domain: • Building energy management system (BEMS) The BEMS is a common framework in the energy system domain for implementing optimal control strategies, such as those for HVAC, lighting, and blinds, related to the thermal inertia of buildings, weather uncertainties, and occupant behaviors. Within the BEMS, catering heating, air-conditioning, ventilation demands of the building, job scheduling, optimal control building elements (such as window blinds), and maintaining indoor air quality are often considered. The main focus of BEMS is to either increase the comfort level and energy efficiency or minimize the cost. The scope of the problem notably extends when moving from BEMS to the dispatch problem. The main focus of the dispatch problem is delivering electricity, heat, and cooling demand by optimally using energy storage, renewable energy technologies, and dispatchable energy sources. Some studies take price signals from the grid to determine the optimal dispatch strategy for the energy system. Although it is not common, there are instances that the energy system is expected to cater to diverse applications such as desalination besides being limited to heating, cooling, and electricity. Often cost minimization is considered as the main objective in the dispatch problem. However, minimizing emissions is gradually getting popular due to environmental concerns. It is difficult to define the exact boundary between dispatch problem and BEMS because many publications present problems that can be categorized between these two; hence, IP, BEMS, and dispatch are merely defined to classify the publications that discuss the elements of BEMS and dispatch problems. In terms of RL applications in the transportation sector (within the energy system domain), the control problems can be classified into two: optimal chargingdischarging using grid electricity (Vehicle to Grid (V2G)) and energy management problem within the vehicle (vehicle energy system). Although these two are closely interlinked, there is no publication that discusses this aspect together; hence, they are treated separately. A vehicle energy system is regarded as an SP that does not maintain any links with BEMS or dispatch problems. In contrast, the V2G problem is usually associated with BEMS, energy markets, or dispatch problems that are related to the optimal time slots for charging vehicles based on changes in the grid electricity price. Accordingly, the problem is typically extended considering the optimal time slots to charge vehicles, electricity price in the grid, and renewable energy generation that interlinks power demand and generation. Therefore, two IPs are introduced, V2G-dispatch and V2G-BEMS, to consider the interactions with the transportation sector.
The energy system operation (dispatch decisions) is influenced by a number of factors, such as demand, grid conditions, energy market, and changes in the performance of energy system components. The use of RL for grid control involves a broad field of research, the transient stability of the grid, and n-1 security; voltage and frequency regulations, optimal power flow, etc. are considered in this context. In the present study, a detailed description of grid control using RL is not considered, and the scope is only limited to the impact of the grid on the energy system. Promising methods for operating the energy systems while actively participating in the energy markets are evaluated. The participation of energy systems into the day ahead, balancing, and real-time markets are considered in this context. Similar to vehicles, the impacts of energy markets on the BEMS and dispatch problem are separately taken into account, although RL applications are not significant in these areas. Finally, the energy device classification focuses on the optimal control of energy system components (devices), performing an important function to improve the energy system efficiency. In this subset, the focus is set on the maximum power point tracking (MPPT) of wind turbines and solar panels.

RL methods and applications in different problem classes
RL is a branch of machine learning that mainly focuses on sequential decision-making that takes into account uncertainties. The recent advances in deep RL have achieved remarkable performance in games [33,34], continuous control [35], and robotics [36]. RL can also be defined as the problem of learning how to act optimally in an environment through experience. In this regard, an RL agent must interact with its environment and learn how to maximize certain cumulative rewards over time. RL has made a reasonable progress during the recent past. This section provides a present state of the art methodologies used within the RL community and used for energy system operation. A reader who is already familiar with RL methods or interested in only RL applications in energy systems can safely skip this section. In this section, Section 3.1 provides the necessary background on RL; Section 3.2 discusses various RL methods used in practice; Section 3.3 summarizes the implementation packages of deep RL algorithms.

Background
Formally, RL is a game between an agent and an environment. which maps states to action. The set Π denotes the set of all stationary policies. An optimal policy intuitively maximizes the overall cumulative (discounted) reward. The game between the agent and the environment is given as follows: • At time step t = 0: S 0̃P0 ( ⋅) • At each time step t = 0, 1, 2, …: -agent observes the environment's state S t ∈ S agent chooses an action The above game is graphically illustrated in Fig. 5. An agent executing a policy π : S →A in the environment M obtains a random objective of RL is to find an optimal policy, π * , which maximizes the discounted sum of rewards (1), (1) Value Functions: Value functions are estimations of the expected return over a certain time horizon given the current state s ∈ S and are often used to construct the optimal policy. In particular, the state-value function V π of a given policy π is defined as (2) which stands for the expected return starting from state s ∈ S . The optimal value function V * corresponds to the value function of an optimal policy π * , i.e., (3): Once V * is available, then the optimal policy can be recovered by picking an action a that is greedy with respect to V * , i.e., (4): In RL setup, the model information such as transition probabilities and reward distributions are usually unavailable, thus it is not easy to directly obtain the optimal policy from state-value function.
Instead, the action value function or Q-function has been considered, which is defined as (5) and the corresponding optimal Q-function is defined as (6) Q * (s, a) : = max π:S →A Q π (s, a).
Observe that once Q * is available, then the optimal policy can be retrieved by π * (s) = arg max a∈A Q π * (s,a), which does not require the model information in contrast to the state-value function case.
Dynamic Programming: When the MDP M is known, finding π * is called a planning problem, and it can be solved efficiently via dynamic programming (DP) algorithms. The value functions for given π can be computed or estimated by solving the Bellman equations (7) and (8) [ [38][39][40]: which are derived from the Markov property [40] of the MDP and the definition of the value function. This step is often called the policy evaluation. Once a policy π is evaluated, then an improved policy can be obtained by the greedy policy, π or π ′ (s) = arg max a∈A Q π (s, a), which is called the policy improvement.
Various dynamic programming algorithms, such as the policy and value iterations, are based on various combinations of alternative iterations of the policy evaluation and improvement [39].

RL methods
RL methodologies can be broadly classified into several classes. Therefore, it is challenging to come up with a comprehensive classification. In this study, we provide a broad classification of RL methodologies by categorizing them into three sets: Value-based, Policy-based, and Model-based (cf. Fig. 6). We can further group the model-based RL methods into the following three groups: (i) value-based, e.g., Dyna-Q algorithm [42], Deep Dyna-Q algorithm [43], and Value-Aware Model Learning (VAML) [44], (ii) policy-based, e.g., Model-Based Policy Gradient (MBPG) [45], ME-TRPO, SLBO, and Policy-Aware Model Learning (PAML) [46], and (iii) actor-critic, e.g., Model-Based Actor--Critic (MBAC) [47], Model-Augmented Actor-Critic (MAAC) [48], and Dyna-DDPG [49]. Some methods belong to several sets. For example, Deterministic policy gradient (DPG) and Deep Deterministic policy gradient (DDPG) belong to both value and policy sets. Within this broad classification, this section is divided into five classes, namely, Value-based, Policy-based, Actor-critic, Model-based, and Batch RL. RL methods can also be categorized into on-policy and off-policy methods. On-policy methods estimate the value of a policy that is used for control. In contrast, off-policy methods evaluate a policy (estimation policy) different from that used to generate behavior (behavior policy).

Value based methods
Value-based RL methods aim to learn the state or action-value function and then to select actions accordingly. SARSA [50] and Q-learning [51] are the two key algorithms in this category.
Temporal-Difference (TD) Learning: TD learning is the most commonly used policy evaluation algorithm. TD learning algorithm estimates the state value function V π of a given policy π iteratively. The update rule for TD learning is derived from the squared-Bellman error and is given by (9) where s k̃d π , s k+1̃P ( ⋅|s k ,π(s k )), and α k is the learning rate (or step-size).
Note that the update is done directly after witnessing the transition (s k , a k = π(s k ), r k+1 , s k+1 ).
The notation d π above denotes the stationary state distribution under policy π. The update term, R(s k , π(s k )) + γV k (s k+1 ) − V k (s k ), is called the TD error, and it measures the difference between the current estimated value V k (s k ) and the improved estimate R(s k , π(s k )) + γV k (s k+1 ). For any fixed policy π, TD update converges to V π almost surely (i.e., with probability 1) if the step-size satisfies the so-called Robbins-Monro rule, SARSA "SARSA" refers to the procedure of updating Q-value by following a sequence of experience …, s k , a k , r k+1 , s k+1 , a k+1 , …, and the update rule is given by (10): SARSA runs TD-learning to evaluate the state-action value function Q π corresponding to the current policy π, computes an improved policy using Q π , and alternates both steps to find Q * . SARSA is an on-policy method because the actions a k and a k+1 used in the update equation are both derived from the policy that is being followed at the time of the update.
Q-Learning The Q-learning algorithms obey the following update rule (11): where s k̃d π , s k+1̃P ( ⋅ ⃒ ⃒ s k , π b (s k )), and π b is called the behavior policy, which usually refers to the policy used to collect observations for learning. The algorithm converges to Q * almost surely [39] provided Interaction between the agent and the environment (source: adapted from Ref. [37]).
that the step-size satisfies the Robbins-Monro rule and every state is visited infinitely often. SARSA and Q-learning are distinct in the way how they evaluate the policy they are optimizing. SARSA evaluates the policy based on the experience from the policy itself whereas Q-learning evaluates the policy based on the experience from any behavior policy. Hence, we can use a database of past experiences in Q-learning, also referred to as the experience replay buffer. In stark contrast, we have to create a new experience each time a policy is updated for SARSA. As a result, SARSA is called as an on-policy method whereas Q-learning is referred to as an off-policy method.
Deep Q-Network (DQN) The function approximators are used to scale up the above lookup table methods to problems with very large state and/or action spaces. The deep Q-network (DQN) algorithm [33] is a variant of Q-learning based on neural network (NN) approximations of the Q-function. In particular, we train an NN as parameterized by θ to minimize the following loss function L(θ), (12): where D := (e 0 , e 1 , …, e l− 1 ) is an experience replay buffer of some user chosen length l, which stores the agent's past experience e i := (s i , a i , r i , s i+1 ) to reduce correlations between observations; and θ, θ − are called the online and target variables, respectively. Stochastic gradient descent steps are taken while freezing the target variable θ − , and the target variable is replaced with the online variable θ periodically after a number of stochastic gradient steps. The experience replay and target network significantly improve and stabilize the training procedure of Qlearning.
There are many extensions of DQN to improve the original design, such as Double DQN [52] and Dueling DQN [53]. The max operator in DQN uses the same network values both to select and to evaluate an action. Thus the DQN algorithm suffers from a substantial overestimation of the value function. Double DQN addresses this issue. Like in DQN, the dueling network is also a DNN function approximator for learning the Q-function. Differently, it approximates the Q-function by decoupling the value function and the advantage function.

Policy based methods
Policy-based methods learn the policy directly with a parameterized function respect to θ, π(a|s; θ). Compared to value-based methods, policy-based methods are effective in continuous action spaces, and they can learn stochastic policies. We consider the following objective (13): where d πθ (s) is stationary distribution of Markov chain for π θ . The policy gradient theorem, which lays the theoretical foundation for various policy gradient algorithms, states that the gradient of the above objective is given by (14) ∇J Here G t is the discounted cumulative return starting from time step t. Policy gradient methods search for a local maximum in J(θ) by ascending the gradient of the policy, w.r.t parameters θ. Below we discuss three prominent policy-based methods: REINFORCE, Trust region policy optimization (TRPO), and Proximal Policy Optimization (PPO) (cf. Fig. 6).
REINFORCE REINFORCE, also known as Monte-Carlo policy gradient, relies on Q π (s, a), an estimated return by MC methods using episode samples, to update the policy parameter θ (15): The use of stochastic gradient method ensures the convergence to a local optimum when choosing a decreasing step α t such that The above vanilla policy gradient update has no bias but high variance. A widely used variation of REINFORCE is to subtract a baseline value from the return G t to reduce the variance of gradient estimation while leaving the expected value of the update unchanged. For example, a common baseline is to subtract state-value from action-value.
Trust region policy optimization (TRPO) To improve training stability, we should avoid parameter updates that change the policy too much at one step. The key idea in TRPO [54] is to define a KL divergence based trust region that constrains updates to the policy. This constraint is in the policy space rather than in the parameter space and becomes the new "step size" of the algorithm. In this way, we can approximately ensure that the new policy after the policy update performs better than the old policy. Concretely, TRPO solves the following optimization problem (16)- (19): where and δ is a hyperparameter. Proximal Policy Optimization (PPO) PPO relies on a clipped surrogate objective function to ensure that the new policy does not get far from the old policy. PPO is significantly simpler to implement, and empirically seems to perform at least as well as TRPO. Denote the probability ratio between old and new policies as (20): PPO imposes the constraint by forcing r k (θ) to stay within a small interval around 1, precisely [1 − ε,1 + ε], where ε is a hyperparameter.

Actor-critic methods
Policy-based (actor only) methods explicitly learn a policy that implicitly maximizes the discounted cumulative reward; however, these methods are disadvantaged by high variance in gradient estimates. Value-based (critic only) methods use the Bellman optimality relationship to derive policy from the learned value function; these methods have a lower variance in expected return estimates. The actor-critic methods combine the advantages of the actor-only and critic-only methods. The actor-critic algorithms parameterize both policy and value functions and simultaneously update these in training; thus they belong to both value-based and policy-based groups (cf. Fig. 6). They often exhibit better empirical performances than the value-based only and policy-based only methods.
Deterministic policy gradient (DPG) In an off-policy setting, we consider the following performance objective (23): which is the value function of the target policy μ, averaged over the state distribution of a (stochastic) behavior policy b : S × A → [0, 1]. Then, we consider parameterized deterministic policies {μ θ : θ ∈ Θ}, and search for θ to maximize the performance objective J b (μ θ ). By an abuse of notation, we denote J b (θ) = J b (μ θ ). If the policy parametrization μ θ is differentiable, under some regularity conditions on the MDP, the gradient of J b (θ) can be expressed as (24) .
The above equation is referred to as off-policy deterministic policy gradient [55]. Deep Deterministic policy gradient (DDPG): DDPG [35] is an off-policy method, and it is applicable to continuous action spaces. DDPG is an actor-critic method that maintains a parameterized actor function μ θ which specifies the current deterministic policy. The critic Q w (s, a) is parameterized by w and learned using the Bellman equation, by minimizing the empirical Bellman residual , where y t = r t + γQ w (s t+1 , μ θ (s t+1 )), and b is a behavior policy used to collect samples. The workflow of DDPG algorithm is shown in Fig. 7. Twin Delayed DDPG (TD3): It is well known that Q-learning and deterministic policy gradients suffer from an overestimation bias due to the noise in the value estimates. To address this issue [56], proposed a variant of DDPG built on Double Q-learning, by making the following changes: 1. Maintains a pair of critics along with a single actor. 2. Clipped action exploration: noise added like DDPG but bounded to fixed range as follows (25) a 3. Updates of Critic are more frequent than of policy. Soft Actor Critic (SAC) SAC [57] algorithm incorporates the entropy measure of the policy into the reward to encourage exploration. It is an off-policy actor-critic model following the maximum entropy RL framework. SAC algorithm searches for policy (26): where H , and α is the trade-off between reward and entropy. The entropy maximization leads to policies that can (1) explore more and (2) capture multiple modes of near-optimal strategies. For a detailed review on policy gradient and actor-critic methods, confer [58].

Model-based RL
Although model-free approaches are successfully applied in several domains, such as games and robotics, in practice, their high sample complexity is a critical problem. Model-based approaches have long been recognized as a potential avenue for reducing the sample complexity of RL algorithms. In the model-based RL, the agent interacts with the environment and gathers experience to learn its model. Recently, in Ref. [59], it is demonstrated that model-based methods can be more sample-efficient exponentially than model-free methods in general contextual decision processes. In linear quadratic regulators, a gap between model-based and specific model-free algorithm is reported [60].
The Dyna algorithm [42] alternates between learning the model based on the gathered data that executes the current policy on the environment and improving the policy with imaginary data from the learned model. The Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) [61] is a Dyna-style algorithm, which maintains an ensemble of neural networks to model the dynamics. It also uses the TRPO in the policy improvement step with the data generated by learned dynamics models. Recently, Luo et al. [62] proposed an algorithmic framework for learning both policy and model with a monotonic improvement guarantee; their proposed final practical algorithm (i.e., Stochastic Lower Bound Optimization (SLBO)) is a variant of ME-TRPO. Apart from Dyna-style algorithms, there are other types of model-based RL algorithms (cf [63]).

Batch RL
RL algorithms discussed above are all interaction-based methods (online RL), i.e., they interact with the environment to collect data while updating the policy or value parameters. The batch RL algorithms decouple data collection and policy optimization. This means that they operate on a fixed set of experience {(s, a, r, s ′ )} collected using certain behavioral policies; however, they do not interact with the environment during policy training [64]. Batch algorithms therefore ignore the exploration-exploitation problem and leverage the data, that is, they are more data-efficient. The least-squares policy iteration (LSPI) [65] is a model-free batch RL algorithm that alternates between policy evaluation (learning a linear Q-function approximation) and policy improvement steps. The fitted Q-iteration (FQI) [66,67] is the most popular algorithm in batch RL and is a considerably straightforward batch version of Q-learning that allows the use of any function approximator for the Q-function (e.g., random forests and deep neural networks). Here, some recent batch deep RL algorithms, such as random ensemble mixture (REM) [68], batch-constrained deep Q-learning (BCQ) [69], and bootstrapping error accumulation reduction Q-learning (BEAR-QL) [70], are also mentioned.

Implementation
Several open-source deep RL packages are available with either Tensorflow or PyTorch implementation of state of the art deep RL algorithms and good documentation. The prominent ones include OpenAI Baselines [71], Stable Baselines [72], Tensorforce [73], and Mush-roomRL [74]. These libraries enable quick and reliable implementation and testing of RL models for the problem of interest. [75] prescribes a list of best practices for the deep RL researchers and practitioners to follow when working on the deep RL experiments and reporting the results.

Top-down view of applications
Several RL methodologies have been successfully applied in the energy system domain. Regardless of the methodology or application, two driving factors can be considered as the main reasons for selecting RL over other available methods: RL 1) is effective in handling uncertainties and 2) is a model-free approach. For example, the uncertainties in demand, such as renewable energy potential, energy market, and grid conditions are regarded as major problems in energy dispatch. These uncertainties can be effectively managed by using RL [32], thus making it an effective technique. In other cases, it is difficult to develop a comprehensive model for energy flow; the model-free nature of RL is advantageous in these contexts, such as BEMSs. As explained in Section 3, different RL approaches have been used to overcome these problems. In this section, the effectiveness of these approaches in different domains is assessed in a more holistic manner using a top-down approach. In view of this, Section 4.1 discusses the diversity of the foregoing problems. Section 4.2 provides a cross comparison among the different classes of the problems highlighting the similarities and the difference. Based on that, Section 4.3 elaborates on the different RL methodologies used in the energy system domain. Finally, Section 4.4 explicates the verification methodology, reproducibility of existing state of the art applications.

Problem diversity
As explained in Section 3, RL can be effectively used in a number of different domains, some of which can be facilely interlinked. This section provides a more holistic overview about the diversity of the control problem addressed in each domain.

Building energy management systems (BEMS)
Building energy management systems (BEMS) mainly focused at maintaining controlling the energy flows within the building while taking into account the changes that take place in climate, occupancy, equipment usage, etc. RL has been used to control heating, ventilation, air conditioning (HVAC), lighting, and blinds considering several objectives, such as minimizing the operational cost, and improving energy efficiency, comfort, and indoor air quality. A number of the publications related to BEMS, including the different elements of the problem, are summarized in Table 1. It is observed that the majority of publications (more than 80%) focus on the optimal use of HVAC systems (Table 1), whereas the number of papers that mainly discuss building elements, such as blinds and lighting appliances, are limited. Considering energy consumption and visual and thermal comforts, the lighting control and HVAC of a building can be effectively linked. Except for the papers of Park et al. [76] and Cheng et al. [77], the above aspects have not been reported; only lighting has been discussed. Similarly, although the interactions with the grid have an important role in the HVAC system operation, these are only presented in four papers. The indoor air quality has been elaborated only by Ref. [78][79][80]. As for minimizing operational cost, only Wen et al. reported on the scheduling of building jobs [81]. It can thus be concluded that the majority of RL applications are only focused on improving the thermal performances of BEMS. Accordingly, by considering several building elements, the potential of improving the application diversity is considerable. Although cost minimization is a major problem, most of the publications in BEMS only discuss energy efficiency and visual or thermal comfort. Numerous studies focus on several objectives where weighted objective functions are considered. It can be concluded that a specific control problem is considered in the BEMS sector where there is ample opportunity to improve the diversity.

Dispatch problem
The dispatch problem is one that is widely examined in the energy system domain. Irrespective of the methodology, the dispatch problem is well investigated because of the popularity of renewable energy technologies and energy storage. Several notable changes can be observed when BEMS is shifted to the dispatch problem. According to the list in Table 2, focus shifts from heat to electricity, and cost is given a higher priority in the dispatch problem. Except for [82][83][84], none of the papers gave a report on heating. Kofinas et al. [85,86] considered desalination as an application of the dispatch problem, but all other papers (95%) only focused on the electricity sector. Furthermore, more than 80% of articles focused on minimizing system cost, again indicating a notable deviation from BEMS. A considerably broader problem with a number of control elements is considered in the dispatch problem. The optimal operation of energy storage and dispatchable sources are considered by 82% and 60% of the publications, respectively. The dispatch strategy is sensitive to fluctuations in renewable energy sources, as reported in more than 70% of the papers. Similarly, the factors that influence grid price signals are considered by more than 50% of the publications. Evidently, compared to BEMS, a more complex problem is taken into account in the dispatch problem, and more diversified system designs are considered according to their application. A moderate number of publications use multi-agent reinforcement learning (MARL). Although the majority of papers report that a cooperative scenario can lead to a correlated equilibrium, its problem formulation is usually more exigent than that of a single-agent scenario. Foruzan et al. [87] considered the non-cooperative scenario where the Nash equilibrium is guaranteed, making it a more extensive process than the single-agent RL. Finally, it can be concluded that RL is well established as a potential method for solving the dispatch problem. More diverse operation problems have been solved considering numerous elements in an energy system. It will be interesting to consider heating, cooling, and other energy services as well as the dispatch problem for future research.

Energy markets and grid
The number of RL applications beyond BEMS and dispatch problems is steadily increasing. The participation in energy markets-day ahead market, balancing market, or spot market-has been performed by using RL. All papers related to energy markets only focus on electricity ( Table 3). The optimal control strategies for energy storage, dispatchable source, and demand response including fluctuations in the energy markets are obtained by using RL. Compared to the dispatch problem, a simplified energy system, which can be extended to consider more comprehensive problems, is considered in these studies. However, 44% of the papers consider MARL, which is a reasonable improvement when shifting from both BEMS and dispatch problems, allowing the participation of multiple sectors. This indicates that a theoretical platform already exists for a broader extension of energy market problems by being linked with other problems, such as dispatch and BEMS.
A grid facilitates the linking of energy systems with other energy systems and markets. Therefore, it performs an important role when considering RL applications of energy systems; it can also be considered as a boundary. RL has been appropriately used for grid operation from low to high voltages taking into account several aspects (Table 4). This study does not attempt to review the publications that report the use of RL to resolve the grid operation problem; instead, a few papers that discuss the interaction between energy systems and the grid are selected. The publications that present the link between energy systems and grid consider two aspects: reliable operation and stability of grid linked with generation. The papers that discuss reliable operation consider the n-1 security (Zarrabian et al.) [88] and component outage caused by maintenance activities (Rocchetta et al.) [89]. In some papers, the stability aspect is considered, including the frequency/voltage stability and transient stability of the grid. However, it remains problematic to include the optimal operation of the energy system and the healthy operation of the grid; these will be interesting research directions for future investigations.

Vehicles and energy devices
The applications of hybrid vehicles and energy devices have a relatively specific scope compared to that presented in Section 4.1.1-3. RL algorithms that are used in the energy management systems of vehicles either have multiple storage devices or an energy storage device with an internal combustion engine (ICE). The energy management problem of vehicles is similar to the dispatch problem discussed in Section 4.1.2. The operational strategy varies depending on the traffic and driving style of vehicles. Most publications focus on the combination of ICE and battery bank (Table 5), and a few of them report on the combination of a battery bank with H 2 and supercapacitor storage. Furthermore, the majority of publications focus on improving fuel efficiency; for example, Ermon et al. [90], Xiong et al. [91], and Reddy et al. [92] successfully improved the battery lifetime. Brusey et al. [93] used RL to improve the thermal comfort inside vehicles and formulated a problem similar to the BEMS. However, the link between transportation and electric charging that is usually discussed with the BEMS or dispatch problems has not been considered in any of these publications, and the details on energy flow within a vehicle are not considered in depth as in other publications. It is further observed that most of the papers are published by only one research group. Similar to energy systems in vehicles, RL applications in energy devices are mainly related to the optimal power point tracking for renewable energy devices. The maximum power point tracking (MPPT) for wind turbines, solar panels, and wave energy generators are considered in this context ( Table 6). The main difference between MPPT, typical BEMS, and dispatch problem is that the MPPT should be considered at a finer time resolution than other problems. Usually, in the MPPT problem, the operation should be in the scale of seconds, whereas the BEMS and dispatch problem are solved at a resolution of 15 min or 1 h. As a result, no study that uses RL to link energy system operation and MPPT is found although these two problems are closely related (see Fig. 8).

Cross comparison among different topics and studies covering several areas
The operation problems discussed under Section 4.1 are having many similarities. As shown in Fig. 10, dividing each problem class into sub-areas makes it easy to understand the interrelations. For example, BEMS is closely linked with the dispatch problem, where many areas have been shared commonly. For example, demand response, integration of renewables, energy storage, energy demand for heating and cooling, and overall cost reduction in the operation have been commonly discussed in these two classes. This is quite clear when analyzing the integrated class of problems, as shown in Fig. 9. Out of the total publications, 70% of the publications consider BEMS the dispatch problem in the integrated class. Most of the publications focus on cost minimization while considering the fluctuations in the grid electricity price, renewable energy generation, and demand (Table 7). Similarly, electric vehicle charging has been closely linked with both BEMS and dispatch problems. Such a detailed workflow that covers several sectors shows the potential of RL to link several energy management problems; this is a relatively exciting attribute of the energy transition where sector coupling is expected to perform a major function. However, reasonable simplification concerning the BEMS and dispatch problem is observed when extending the scope of the problem. For example, most of these publications do not pay careful attention to the building energy model, energy conversion process in the system components, etc. Furthermore, BEMS is having relatively low interactions with the energy market and grid, which needs to be further improved interlink the building sector to the energy internet.
The dispatching problem is having a close relationship between BEMS, Market, and Grid besides maintaining close interlinks between the BEMS. The close relationship between dispatch and market is easy to understand as energy systems' operation is often controlled by the price signals from the energy markets. Several studies use RL for storage management, renewable energy integration taking into account the energy markets that are broadly discussed in Section 4.1.2. Often these studies consider price signals from the energy markets. However, some studies consider the detailed dynamics of the energy markets and dispatch, as presented in Section 4.1.3. Simultaneously, multi-agent models have been introduced to represent the participation of different micro-grids and different components of the same micro-grid in energy markets. Claessens et al. [94] and Foruzan et al. [87] used MARL in developing an energy management strategy, which is a more difficult task. Linking several sectors can present more complexities related to uncertainties and model approximation. Similar to BEMS, the dispatch problem is oversimplified when moving into the multi-agent systems when dealing with the energy markets. It is not easy to find publications that cover a reasonable level of physics (depth) while accommodating the dispatch and energy markets. Energy markets are going through significant expansion, opening into many participants. Energy markets with large-scale participation where a large group of distributed energy systems interacts are not discussed (where energy markets need to be considered and the grid and dispatch problem together).
The relationship between the grid and the energy system can be understood as the dispatch strategy facilitates to maintain grid stability, which leads to a coupling between the optimal power flow and optimal dispatch problems. This is clear when analyzing the publications in the dispatch domain (4.1.2). As it was bound in the building sector, a reasonable simplification of the control problem is observed when linking these two problems. Machine learning techniques such as supervised learning (especially with graph convolution) have been used for optimal power flow problems more often than the use of reinforcement learning. Therefore, joint consideration of the optimal power-flow Table 2 Diversity of the elements considered for the dispatch problem. Much broader problems are considered under the dispatch problem mainly focusing on the electricity sector. problem and the dispatch strategy using RL is not found in the literature. Often, the influence of micro-grids on the local grid stability and the healthy operation of the grid are taken into account along with the dispatch problem. It would be interesting to evaluate the potential of reinforcement learning to consider the dispatch and grid problems jointly that will notably help peer-to-peer trading, frequency regulation markets etc. Network multi agent architecture might be quite interesting in this regard, which has been less discussed within the domain. A limited number of publications attend to link several sectors. Most of the publications covering a broad set of domains are centered on the dispatch problem that often focuses on the electricity sector. This highlights the dispatch problem's capability to be the hub of while facilitating other domains to be well connected to it. Secondly, BEMS have been closely linked with the dispatch problem within the integrated problem domain. This highlights RL's potential to extend the boundaries of classical dispatch or unit commitment problem that is often limited to the generation. RL enables considering the complex domains of the energy demand such as lighting, user comfort etc. (Lee & Choi [95]) along with the dispatch problem that brings energy system operation to become more user-friendly and adaptive. Similarly, the dispatch problem can be easily linked with the vehicle to grid problem, which is quite similar to the job shop scheduling problem. This Table 3 Elements considered for RL problems in energy markets.    [142] ✓ ✓ ✓ Fuel consumption by engine Liu et al. [143] ✓ ✓ ✓ Fuel consumption by engine Zou et al. [144] ✓ ✓ ✓ Fuel consumption by engine Yuan et al. [145] ✓ ✓ ✓ Hydrogen consumption Xiong et al. [91] ✓ ✓ ✓ Battery energy loss, ultra-capacitor energy loss and DC/DC converter loss Brusey et al. [93] ✓ Thermal comfort Reddy et al. [92] ✓ ✓ ✓ Improving battery life by minimizing charge discharge cycles Wu et al. [146] ✓ ✓ ✓ Sum of cost for the fuel and electricity consumption Zhou et al. [147] ✓ ✓ Energy used for vehicle operation Qi et al. [148] ✓ ✓ ✓ Fuel efficiency Table 6 Applications of RL in Energy system devices.

✓ ✓
Aguirre et al. [153] ✓ ✓ highlights RL's capability to link the manufacturing sector into the energy markets while considering complex operations within the industrial sector, an interesting future extension. However, the interlinks between vehicle and device domains are limited. Only a limited number of links between the grid operation and MPPT of devices are found with other fundamental problems. It can be considered that this is mainly because of the mismatch among the time resolutions. Grid operation and MPPT both demand a considerably finer time resolution than the other problems. It would be interesting to explore potential means to link these sectors while resolving the mismatch problems among time resolutions.

RL methodologies
RL methods can be classified in several different ways. In this study, RL techniques are grouped into three main classes: interaction-based methods, interaction-free methods (termed as batch RL), and other methods, such as gradient-free techniques. Based on the state of the art, the majority of publications (more than 80%) report the use interactionbased methods. A detailed evaluation of the applicability of these techniques is presented in Section 4.2.1. Interaction-free methods include batch learning, which appears promising for energy systems because of the availability of historical data, as discussed in Section 4.2.2. Finally, other alternatives, which do not fall under any of the Table 7 Applications of RL in integrated Problems.

Interaction-based methods
The class of interaction-based RL methods include many techniques. RL algorithms that belong to this class interact with the environment during the learning process. The literature on interaction-based methods is categorized as value-based RL, policy-based RL, actor-critic, and model-based RL methods; the foregoing is aligned with the classification presented in Section 3.
Value-based RL. Value-based methods, including Monte Carlo (MC), Q-learning, and SARSA methods, have been extensively used in the energy system domain. Among these three techniques, the Monte Carlo (MC) approach is less preferred by the energy system community because of its sample complexity resulting from the high variance of complete trajectories ( Table 8). The MC has been used in only one study (Rayati et al.) [157], i.e., under the IP category in the tabular setting. The SARSA is also a less frequently used approach, and approximately 7% of papers in the SP category report the use of this technique. The authors found no study that uses SARSA to solve integrated problems. Both standard and lambda SARSA methods have been applied in the domains of BEMS [104], dispatch [122], energy markets [137], and vehicle energy systems [148]. A clear contrast is observed in shifting from both MC and SARSA to Q-learning (Table 8).
In the energy system domain, Q-learning is the predominantly employed approach, i.e., more than 50% in the specific problem category and 57% in the integrated problem domain (cf. Table 8). Q-learning has been applied in all energy system domains. For its branches, Qlearning in the tabular setting is used in all domains-BEMS [78,96], dispatch [83,85], energy markets [129,132], grid [88,138], vehicle energy systems [10,142], and devices [149,151]. On the other hand, the use of Q (lambda) algorithm has been reported in publications related to BMES ( [103]) and dispatch ( [117,120]); it has potentially application in other domains. In addition to tabular Q-learning, different function approximators have been explored, i.e., linear function approximation [124], multi-layer perceptron [115], convolutional neural network [140], and deep neural network (DNN) [79,89,131]. Q-learning with function approximators is typically preferred for the specific problem category. However, note that there is a difference between Q-Learning with DNN and deep Q-network (DQN): in addition to function approximation, the DQN introduces further stratagems, such as experience replay and target network, to improve stability. Only publications related to BMES (SP) and dispatch (SP) have reported the use of the DQN. The use of sophisticated DQN variants, such as double DQN (except [79]), dueling DQN, and rainbow have not been reported in any of the published papers, and some papers have reported the use of different variants of Q-learning, such as the greedy GQ-Algorithm [124], speedy QL algorithm [143], and multistep RL [147]. The following is related to SARSA. In addition to tabular methods, function approximation methods, such as linear function approximation [137] and tile coding [93], have also been considered. It is interesting to note that Ebell and Pruckner [126] considered the MDP setting with both discrete and continuous state spaces.
Policy-based RL. In the literature on energy system (based on RL), the use policy-based methods are less frequently reported (Table 8). In particular, in all the papers considered, only 3.2% in the specific problem group and 3.8% in the integrated problem category have applied policy-based RL methods. Even these few works only utilized the policybased methods, such as vanilla policy gradient (with and without baseline) [105,108,158] and TRPO [82]. In the selected pool of papers, the authors have not found any work that uses the PPO, which is another state-of-the-art policy-based approach. It is thus concluded that a detailed investigation of the effectiveness of policy-based methods for energy-related problems should be performed.
Actor-critic methods. In the model-free setting, after Q-learning, the actor-critic approach is more frequently applied in energy systems.   Specifically, 8.4% in the specific problem domain and 15.4% in the integrated problem category have applied actor-critic methods (cf. Table 8). The modern deep actor-critic algorithms, such as A3C [99,112] and DDPG [82,100,146,159], have only been used by the works focusing on BMES (SP), dispatch (SP), vehicle energy systems (SP), and IP. Furthermore, the authors have not encountered any report on work that utilizes state-of-the-art actor-critic algorithms, such as TD3 and SAC, in the pool of papers considered. Finally, it should be noted that despite the successful results achieved by using model-free RL methods, these techniques often require numerous samples.

Model-based RL (MBRL).
Interestingly, considerable effort (13%) has been devoted to the specific problem category in applying the model-based RL/planning methods in energy systems, whereas the integrated problem domain has entirely ignored this approach. Compared to specific problems, integrated problems involve complex energy models. Thus, learning an accurate model (with a small error) in the IP setting would require a significant number of samples compared to the SP case. Learning an accurate model is extremely crucial for the MBRL; otherwise, a compounding error problem would result. In this situation, learning local models would be advantageous to the improvement of sample complexity and keeping the modeling error as low as possible [163]. In addition to the tabular representation of transition dynamics, function approximators are also used to represent the model [98,111].
Note that only the studies related to BMES [76,98] and vehicle energy systems [90] have attempted to learn the model and apply RL/planning algorithms to the learned model. In these works, while learning the model, the transition dynamics is represented in tabular form [10,76,90,144], RF [98], NN [98], or DNN ensemble [111]. The Dyna-Q algorithm, which alternates between model-learning and Q-learning blocks, has also been utilized in energy systems [10,144]. It should be noted that a recent deep RL-related paper reports the extensive investigation of MBRL because of this technique's sample-efficient nature [59,60]. The MBRL methods are predominantly employed in the robotics domain [36,164]. The energy system community should also consider these recent advances and adapt them to energy-related problems.

Batch RL
These methods depend on historical data to learn control strategies. Moreover, they do not interact with the environment while learning. The use of batch RL algorithms have only been observed in works related to BEMS (SP), dispatch (SP), and BEMS + dispatch (IP). This method has a significant potential to be used with model predictive control, which can provide a rich historical dataset. In Ref. [84,94,110], batch RL methods have been used in the tabular setting. The fitted Q-iteration algorithm with function approximators, such as DNN [97,156], ERT [106], and ERT ensemble [107] has also been employed.

Other techniques
RL applications that are not categorized under the aforementioned classes are considered under other techniques. Most of the publications report the use non-gradient techniques for optimization and mainly rely on evolutionary algorithms. Evolutionary algorithms have been directly used to train both BEMS and dispatch problems on the rules of Markov decision processes. Both fuzzy [31,165] and crisp rules [166,167] have been considered in this regard. Crisp rules are frequently employed for simple energy systems (e.g., standalone hybrid energy systems), whereas fuzzy rules are used for the dispatch problem of grid-integrated energy systems, which present a considerably complex state space. Although there is a recent trend towards neuro-evolutionary RL, the authors have not encountered any publication that reports the use of evolutionary algorithms to optimize the function approximation process. In addition, hybrid approaches that combine gradient methods and evolutionary algorithms have also not been found. Most of the studies that use evolutionary algorithms consider distributed rewards that are capable of implementing gradient-based methods. It would be interesting to analyze whether the use of evolutionary algorithms in some of those applications affords added advantages. In addition to evolutionary algorithms, the fussy logic and TRPO have been directly applied.

Final remarks
Both BEMS and dispatch-related studies have employed all RL methods, except for the MC method. On the other hand, the BMES + dispatch integrated problem-related studies have only used Q-learning and batch RL methods. In the investigations related to energy markets and vehicle energy systems, Q-learning, SARSA, actor-critic, and evolutionary/other methods, have been employed. Moreover, the studies related to vehicle energy systems used the MBRL methods. In grid-related research, only Q-learning, actor-critic, and MBRL methods are used. In device-related studies, only Q-learning, MBRL, and evolutionary/other methods have been utilized. On the integrated problem side, Foruzan et al. [87] (related to dispatch + market) employed Q-learning; Odonkor and Lewis [159] (related to market + BMES + dispatch) used the actor-critic method. In the V2G + BMES + dispatch category, Q-learning, MC, and PG methods have been applied. Furthermore, the dispatch + V2G-related studies have used Q-learning and actor-critic methods; Q-learning is the most predominantly used approach (more than 50%) in both SP and IP. With the broad range of RL methods, it would be advantageous to investigate other RL methods across all SPs and IPs. It is necessary to create benchmark datasets for energy-related problems so that different RL algorithms can be fairly compared. Model-based RL (specifically learning local models for integrated problems) and batch RL methods seem to be techniques that are worth exploring because they may lead to interesting open research problems in RL literature.

Performance improvement, verification, and reproducibility
Considering the state of the art, 65% of the studies discussed in Section 4.1 compare the results obtained using RL with an alternative method. The rule-based technique, alternative RL method, fuzzy logic, heuristic models, etc. Are used in this regard. The majority of publications pertaining to the BEMS sector report the use of an alternative method to validate the results. In contrast, fewer studies related to the dispatch problem employ an alternative methodology to validate the results. However, simple approaches, such as set point strategies or simple rule-based techniques, have been used to benchmark RL algorithms in the BEMS sector, achieving a highly promising 10-20% improvement; such a significant performance is not typically observed in other sectors. For example, the performance improvement in the vehicle and device sectors are usually less than 5%. In some cases, RL have been outperformed by the detailed white box approaches, such as model predictive control. It is thus necessary to conduct a cross comparison between more advanced control strategies apart from only using simple-rule based strategies to benchmark the algorithms. In order to create such an environment, it is important to maintain a public repository of these algorithms and problem datasets as that implemented in other communities, such as computer vision. Such repositories will encourage research groups to benchmark RL applications especially for uncommon problems, such as energy system dispatch. Finally, considerably few publications report the use experimental techniques to validate results. Experimental validation requires creating a prototype and implementing RL algorithm in reality. Such an implementation in addition to computational simulation will considerably aid in presenting the potential of RL to a wider range of communities.

Future perspectives
This section is devoted to explaining the bottlenecks of the present state-of-the-art methods and identify directions that are more promising by extending the discussion in Section 4. In Section 5.1, the possibility of improving the diversity of the problem is presented as an extension of Section 4.1. This will allow the use of RL for sector coupling problems, which enable decarburization in several sectors. It is interesting to check whether there is any possibility to extend the use of RL in the energy sector where the energy management problem becomes a part of a bigger problem. Section 5.2 attempts to predict such extended use of RL. The extended RL applications introduce many problems. The possibility of using the present development in RL, optimization methods, and computational facilities is discussed in Section 5.3.

Potential to extend the model concerning more diverse problems
RL have been effectively used for a set of problems related to energy systems. Reasonable progress has been achieved in most of the cases when the basic RL methods are used, indicating the potential for a notable improvement. In Section 4.1.5, it is demonstrated that RL problems can be effectively used for more diverse applications than simple dispatch or BEMS problems. Such an interlink, which connects several sectors, is also known as sector coupling [168]. Sector coupling is regarded as a potential method to decarbonize multiple sectors, such as building, transportation, and manufacturing. RL can be effectively used in such applications to consider the uncertainties and complex interrelations between different sectors whose modeling is complex when the white box approach is applied especially with the intervention of cyber-physical systems. The optimal control of resources within interconnected systems introduces several problems. First, from the training perspective, its application will be more exigent because it formulates a complex optimization problem. Second, each sector has its own interest, which may be completely different from those of other sectors. For example, the grid operation, dispatch problem, and BEMS may focus on grid stability, minimization of generation cost, and comfort of occupants, respectively. In certain cases, these objectives may be conflicting, making it difficult to minimize them simultaneously. Information sharing with different sectors is another problem. The BEMS requires sensitive information, such as the presence of people and use of equipment at specific time intervals. These are difficult to share publicly, making exigent to link the BEMS with the dispatch problem in certain cases. The formulation and use innovative methods that can handle data privacy would be interesting in this regard. Finally, there are difficulties in synchronizing these sectors mainly because of the mismatch in response time. For example, the dispatch and BEMS problems often operate within 15 min-1h time resolutions, whereas grid operation requires considerably finer time resolutions (seconds or even finer temporal resolutions). Resolving these mismatches is important when controlling the energy flows within interconnected systems. Considering all the foregoing, it can be concluded that RL has the potential to be used in solving sector-coupling problems despite the presence of several problems, some of which require the use of more advanced machine learning and optimization techniques, as discussed in detail in Sections 5.3-5.4.

RL applications beyond energy flow control
The state of the art clearly demonstrates that RL can be effectively used for a number of control problems in the energy system domain. More importantly, its potential for resolving complex problems, such as those in sector coupling, is demonstrated. This can considerably aid in energy transition and climate change mitigation. It would be interesting to investigate the potential of RL beyond simply controlling energy flows. Although RL has been effectively used with supervised and unsupervised learnings in other sectors, limited examples are found in the energy system domain.
The energy system design problem is usually linked with the optimal control problem and conducted as a bi-level optimization problem where the system operation (optimal control problem) and system-sizing problems are considered at the primary and secondary levels, respectively. Such a bi-level design approach is usually applied in microgrid design and energy systems in automobiles. Initially, simple rulebased techniques have been applied to represent the control strategy. Subsequently, fuzzy logic and MPC are introduced to facilitate more complex energy flows. However, there are a number of limitations in both gray and white box approaches, especially when considering both cyber and physical interactions in the energy system domain. RL can be an attractive alternative in this regard as it can effectively handle such complex environments because of the model-free approach [32]. This will require a substantial extension of optimization because of the mismatch in the optimization techniques employed between energy system and RL. Energy system design is usually achieved using either linear/mixed integer linear or heuristic methods, whereas RL algorithms are trained using gradient descent methods. Accordingly, linking these two problems will significantly increase the use of RL in the energy system domain. Furthermore, the possibility of utilizing RL with unsupervised learning techniques, such as clustering, will aid in more effectively locating the energy systems compared to existing approaches that only use unsupervised learning. Considering all the foregoing, it can be concluded that RL may employed beyond energy flow control, which would be an interesting new research area.

Limitations of present state-of-the-art
There are many limitations in the present state of the art reinforcement learning techniques. As carefully investigated in Ref. [75], deep RL algorithms are susceptible to the choice of hyperparameter values, network architecture, reward shaping, and implementation codebase. The instability of Deep RL algorithms (the learning process exhibits high variance, and a near-optimal policy turns arbitrarily bad) tends to affect their performance adversely. Thus, in recent years, the RL research community focuses on developing stable RL algorithms with reduced variance [169,170]. Furthermore, majority of the deep RL methods fail to perform well when there is some difference between training and testing scenarios, thereby posting serious safety and security concerns. To this end, learning policies that are robust to environmental shifts, mismatched configurations, and even mismatched control actions are becoming increasingly more important. There are works that build on the robust MDP framework [171], for example [172], and [173], whereas some other leverages on the equivalence between action-robust and robust MDPs introduced by Ref. [174], for example [175], and [176]. Despite the impressive empirical progress, the robust RL objectives' training remains an open and critical challenge.
The successful application of deep RL methods is attributed to the implementation of meaningful representation learning via deep neural networks. However, for real-world problems, such as energy system optimization, standard representation techniques for a vision-based application may not be advantageous. It is therefore necessary to contextualize the representations along with deep RL methods to enhance their applicability in the energy system domain. Most current deep RL methods augment their main objective with additional losses, typically facilitating and regularizing the representation learning process [177][178][179][180]. Recently, deep RL researchers have started exploring unsupervised representation learning methods for RL [181][182][183].

Present state of the art developments: optimization in RL
In RL community, there is a recent interest to understand the standard (policy gradient-based) RL algorithms from the optimization perspective. The research community has also attempted to study the theoretical convergence properties of PG methods from a non-convex optimization perspective [184,185]. In particular, by leveraging the recent advancements in non-convex optimization, new algorithmic solutions are also being proposed for RL. Furthermore, by exploiting the minimax duality of Bellman equations, a class of stochastic primal-dual (SPD) methods that is computational and sample-efficient has been proposed for RL [186]. The SPD Q-learning in Ref. [187] extends it to the Q-learning framework with off-policy learning. It is presumed that the energy system domain will benefit from extending the optimization-based RL with time.
Energy system operation is often considered as a multi objective optimization problem. Often conflicting objectives such as cost, comfort level, environmental impact are considered. Both Pareto and weighted objective functions are used on this regards. Within the RL community, multi-task RL is used to handle this task. Multi-task RL is a promising approach to alleviate the sample complexity issue in RL algorithms that learn individual tasks from scratch. Multi-task RL methods share structure across multiple tasks to enable more efficient learning [188,189]. Multi-task RL methods pose significant optimization challenges compared to standard RL methods that learn tasks independently from scratch [190]. By assuming additional structure on the similarity between the tasks, people have proposed more efficient optimization algorithms for Multi-task RL with provable guarantees [191]. Multi-task RL also suffers from scalability issues when the number of tasks grows large. The decentralized optimization remedies the scalability issues by distributing computation across multiple units; recently [192], have proposed scalable Multi-task RL algorithms with improved convergence guarantees.
The ever-increasing complexity of energy systems demands for reasonable improvements in the optimization techniques used for RL. As a result, there is a recent surge in customizing the online (convex) optimization algorithms for energy system-related problems [193][194][195]. Based on the min-max RL formulation, reductions between RL methods and online learning algorithms have been established [196], leading to the systematic development of new RL algorithms. In Ref. [197], the connections between DP and (constrained) convex optimization are established and several policy iteration algorithms in the optimization language are formulated. For example, they link conservative policy iteration to the Frank-Wolfe algorithm, mirror descent modified policy iteration to mirror descent, and Politex (policy iteration using expert prediction) to dual averaging. Accordingly, it is assumed that a considerable effort can be devoted in this direction by leveraging these relationships for more complex energy problems [196,197].

Distributed RL
As discussed in Sections 5.1 and 5.2, MARL can perform a major function in understanding the interactions among multiple sectors in the energy system domain. However, learning in a multi-agent environment is considerably more exigent when compared to a single-agent setting because the agent has to interact with both the environment and other agents [198]. For a recent survey on deep MARL, the reader is referred to Refs. [199,200], and for the distributed optimization-based MARL, cf [201,202]. Despite several practical problems (e.g., non-stationarity issues [203], curse of dimensionality, and computational demand) associated with deep MARL, in several recent works, empirical success in complex multi-agent scenarios has been reported. This success is achieved by carefully scaling the algorithms originally introduced to RL and multi-agent learning to deep MARL. Most of these studies, which directly use single-agent algorithms in the multi-agent setting (i.e., independent learners), lack theoretical/convergence guarantees. In some studies [204,205], the focus is mainly set on analyzing and evaluating the DRL algorithms in a multi-agent environment under cooperative, competitive, and mixed scenarios. Littman [206] studied the convergence properties of joint-action RL agents in Markov games. Recently, several MARL algorithms, such as the distributed TD learning [207], distributed Q-learning [208], and distributed actor-critic algorithm [209], have been proposed by leveraging the core idea of averaging consensus-based distributed optimization [210]. These methods achieve global consensus on optimal policy only through local computation and communication with neighboring agents. The extensions of these distributed optimization methods to recent deep RL algorithms are currently under investigation. Compared to classical RL methods, optimization-based RL methods can more effectively handle different multiple objectives that arise in real-world control problems, e.g., energy systems in the form of constraints.

Conclusions
The energy system goes through a significant transition with the decarbonization of the energy sector, notably increasing the complexity of energy systems. Integration of non-dispatchable renewable energy technologies and distributed energy storage, the introduction of complex market mechanisms, uncertainties brought by energy markets, climate and occupancy and the expansion of the energy system boundaries concerning sector coupling, etc. Demands for a paradigm shift in the methods used to control the energy systems. These demand to shift from the present state of the art method such as model-based, gray box, and rule-based methods to data-driven approaches. Three major bottlenecks are expected to be addressed by data-driven models, i.e., • Handling the complexity of the models used (non-linear and nonconvex nature of objective functions) • Limitations in the physical models, especially in the building sector • Limitations in handling large state and control variable space.
Addressing these limitations will notably improve efficiency while reducing the operation cost, sustainability, user comfort, quality of service, security, and quality of the energy systems' service. Fig. 11. State of the art and future perspectives concerning RL and energy system control.
Reinforcement Learning becomes quite an attractive alternative in this regard. As a branch of machine learning, Reinforcement Learning is gradually gaining popularity beyond the machine learning community because of its broader applicability. Compared to model predictive control strategies, Reinforcement Learning employs a model-free approach and does not require convergence guarantees, thereby significantly increasing its applicability. Also, the capability to handle a large decision space with a little knowledge about the problem physics make it competitive compared with the other rule-based controllers used in the present state of the art (cf. Fig. 11). Therefore, a progressive improvement is observed in publications that report the use of Reinforcement Learning.
The publications within the energy system domain grow rapidly, focusing on a broad group of problems, making it difficult to consider one by one. A novel approach is proposed in the present study to address this problem by using a top-down approach. In this study, the publications that discuss the resolution of energy system-related problems using Reinforcement Learning are clustered into seven groups. Six groups are related to specific control problems, such as building energy management, dispatch, energy systems in hybrid vehicles, energy markets, grid, and energy devices. Within this classification, the study investigates.
1. The complexity of the problem within each problem class (a) Handling non-linearity (b) Use of approximation models/data-driven models (c) Expansion of decision and objective space variables 2. Types of RL methods used 3. How far the present state of art methods has been successful?
The study reveals that most RL (45%) studies focus on either building energy management or dispatch problem. Most of the papers related to building energy management systems focus on HVAC, whereas a few publications center on lighting and control of blinds together with the HVAC. More diversity is observed in shifting to the dispatch problem. Energy systems with considerably complex system configurations are considered in the dispatch problem while focusing on the uncertainties in renewable generation, demand, and price signals in the grid. The study reveals that the dispatching problem has a close link between the energy market and the grid in addition to the building energy management systems. Multi-agent Reinforcement Learning has been used to couple the dispatch and energy markets; this enables to schedule distributed energy systems considering day-ahead and spot markets. Grid models have been used along with Reinforcement Learning algorithms to guarantee the secure and stable operation of the distribution networks while accommodating distributed energy sources. The study reveals the potential of linking dispatch problems with the optimal power flow problem in the grid; this will facilitate to improve the efficiency of the grid while guaranteeing stability. The flexibility demonstrated by the dispatch problem to link with building energy management, grid, and energy markets makes it an ideal candidate to act as the central hub while maintaining interactions between many participants within the energy system domain. However, the connectivity of energy devices and vehicle energy systems between the other sectors are difficult to observe. These sectors focus on finer time resolutions than in other sectors, making them more difficult to link with other sectors. Within these different domains, RL has shown the potential to consider large state spaces and complex nonlinear models that are difficult to handle using the other existing techniques.
Several major bottlenecks are observed when looking into the present state of the art. Most studies lack proper benchmarking compared to model-based approaches or gray-box models; this makes it difficult to make any conclusions regarding performance improvement. The development of a public repository of computational codes and test cases to validate performance improvement could be considerably advantageous for the research community, as evidenced by its implementation in other ML communities, such as computer vision. The reproducibility of the results is another major issue that has not been discussed broadly. Comparing the performance of RL algorithms taking case studies with different environments will be helpful in this regard. Although RL can adequately solve integrated problems in several sectors, only a limited number of publications have discussed its broad application. One of the most remarkable observations is that most studies do not use deep learning techniques; instead, the tabular method or shallow network for function approximation is employed. In practically all publications, the use of existing libraries and optimization algorithms is reported with regard to the implementation of Reinforcement Learning. The development of RL algorithms to cater to bottlenecks in the energy system control problems is not reported. Considering all of the aforementioned aspects, there remain many open questions relative to both energy systems and machine learning. The answers to such questions are expected to lead to significant improvements in the energy system domain. However, even with the current limited use of reinforcement, learning methods exhibit a 10-20% performance improvement in many applications (especially in building energy management), although there are few cases where the model predictive control outperforms the RL within a considerably close margin.
The operation of the energy systems will benefit from progress in Reinforcement Learning techniques to address many open-ended problems that have not been addressed. The study reveals that improving the participation of multiple agents having different priorities will become more common, especially with the open energy markets where Reinforcement Learning could immensely help improve the participants' profits and maintain robust operation. Given the recent advances in the robust Reinforcement Learning literature, those robust RL methods could be utilized in energy systems to handle uncertainty in the system parameters. The possibility to accommodate a large pool of participants will be beneficial, especially concerning applications like energy internet where a large group of devices having energy storage (for example, mobile phones, dishwashers, etc.) will interact with the energy systems. One of the promising directions is sector coupling. Consideration of multiple objectives within the control problem is another important aspect of the energy system domain. For example, minimizing both cost and emissions are considered vital in the dispatch problem. Introducing multi-tasking RL would be quite interesting in this regard, which can address this specific issue. The applications of Reinforcement Learning algorithms to link several domains will help improve the penetration levels of renewable energy technologies while minimizing the generation of CO 2 in these sectors; the shift into multi-agent Reinforcement Learning will be beneficial in this regard. One major limitation observed when extending the scope of the problem is the oversimplification. It is noteworthy to investigate possible ways to consider detailed physics of the problems while extending the problem's scope where linking existing physical models with Reinforcement Learning techniques would be immensely helpful. Developing a hybrid technique to link model predictive control and Reinforcement Learning will also be an interesting approach to improve energy efficiency in building energy systems. Furthermore, Reinforcement Learning can be used to locate distributed energy systems where Reinforcement Learning can be employed with clustering algorithms. Therefore, advances in Reinforcement Learning techniques will essentially play a vital role towards the improvements in the energy sector.
The study reveals that Reinforcement Learning can effectively address many limitations in the present state of the art methods used to operate energy systems. It has already presented a significant potential as a major candidate to address the energy system operation problem. More importantly, the potential of data-driven methods to address complex energy-related problems is significant. The advances taking place in the machine learning community play a vital role in this regard, which facilitates Reinforcement Learning algorithms to be used for a diverse group of problems. Based on the literature review, the authors have noted the potential of RL beyond its use in control problems.
Considering all these aspects, RL definitely has a broad scope of application with a huge potential to resolve global problems in the energy sector.