Gym-ANM: Reinforcement Learning Environments for Active Network Management Tasks in Electricity Distribution Systems

Active network management (ANM) of electricity distribution networks include many complex stochastic sequential optimization problems. These problems need to be solved for integrating renewable energies and distributed storage into future electrical grids. In this work, we introduce Gym-ANM, a framework for designing reinforcement learning (RL) environments that model ANM tasks in electricity distribution networks. These environments provide new playgrounds for RL research in the management of electricity networks that do not require an extensive knowledge of the underlying dynamics of such systems. Along with this work, we are releasing an implementation of an introductory toy-environment, ANM6-Easy, designed to emphasize common challenges in ANM. We also show that state-of-the-art RL algorithms can already achieve good performance on ANM6-Easy when compared against a model predictive control (MPC) approach. Finally, we provide guidelines to create new Gym-ANM environments differing in terms of (a) the distribution network topology and parameters, (b) the observation space, (c) the modelling of the stochastic processes present in the system, and (d) a set of hyperparameters influencing the reward signal. Gym-ANM can be downloaded at https://github.com/robinhenry/gym-anm.


Introduction
Reinforcement learning (RL) is a vibrant field of machine learning aiming to mimic the human learning process. This allows us to solve numerous complex decision-making problems [1]. In the field of power systems (a term used to refer to the management of electricity networks), researchers and engineers have used RL techniques for many years [2]. Over the last few years, however, decision-making challenges in power systems have drawn less attention than other domains in which RL has been successfully and extensively applied, such as the fields of games [3,4,5,6], robotics [7,8,9,10], and autonomous driving [11,12,13]. A plausible explanation for this is the lack of off-the-shelf simulators that model such problems. Indeed, despite its many recent breakthroughs, RL research remains largely dependent on the availability of artificial simulators that can be used as surrogates for the real world [14]. Training on real systems is often too slow and constraining, while simulators allow us to take advantage of large computational resources and do not constrain exploration.
Developing efficient and reliable algorithms to solve decision-making challenges in power systems is becoming more and more crucial for ensuring a smooth transition to sustainable energy systems. Power grids have experienced profound structural and operational changes over the last two decades [15]. The liberalization of electricity markets introduced a competitive aspect in their management, driving network improvements and cheaper energy generation [16]. The arrival of distributed generators, such as wind turbines and photovoltaic arXiv:2103.07932v2 [cs.LG] 30 Jun 2021 panels (PVs), has compromised the traditional model of decentralized generation. In particular, we have seen the appearance of (virtual) microgrids creating local energy ecosystems in which consumers are now also producers [17]. In the near future, we can expect the addition of even more distributed generators to the grid [18], along with an increase in the number of large loads due to the fast electric vehicle market growth [19]. Power grids are also facing the emergence of distributed energy storage (DES), with certain technologies already available, such as batteries [20] and power-to-gas [21]. As a result, system operators are facing many new complex decision-making problems (overvoltages, transmission line congestion, voltage coordination, investment issues, etc.), some of which might benefit from advances in the very active area of RL research.
Through this work, we seek to promote the application of RL techniques to active network management (ANM) problems, a class of sequential decision-making tasks in the management of electricity distribution networks (DNs). In the power system literature, ANM refers to the design of control schemes that modulate the generators, the loads, and/or the DES devices connected to the grid. This is done to avoid problems at the distribution level and maximize profitability through, e.g., avoidable energy loss [22]. This modulation, operated by distribution network operators (DNOs), may result in a necessary reduction in the output of generators from what they could otherwise have produced given available resources, often referred to as the process of curtailment. Such generation curtailment, along with storage and transmission losses, constitute the principal sources of energy loss that we would like to minimize through ANM. At the same time, the ANM scheme must ensure a safe and reliable operation of the DN. This is often expressed as a set of operational constraints that must be satisfied.
More specifically, we propose Gym-ANM, a framework that facilitates the design and the implementation of RL environments that model ANM tasks. Our goal was to release a tool that could be used without an extensive background in power system analysis. We thus engineered Gym-ANM so as to abstract away most of the complex dynamics of power system modelling. With its different customizable components, Gym-ANM is a suitable framework to model a wide range of ANM tasks, from simple ones that can be used for educational purposes, to complex ones designed to conduct advanced research. In addition, Gym-ANM is built on top of the OpenAI Gym toolkit [23], an interface with which a large part of the RL community is already familiar. Note that Gym-ANM environments do not solve ANM problems but, rather, provide a simple programming interface to test and compare various optimization and RL algorithms that aim to do so.
The remainder of this paper is organized as follows. First, we introduce a series of background concepts and notations in RL, DNs, and MPC in Section 2. Section 3 then formalizes the generic ANM task that we consider as a partially observable Markov decision process (POMDP). Next, we propose a specific Gym-ANM environment that highlights common ANM challenges, ANM6-Easy, in Section 4. The performance of the state-of-the-art proximal policy optimization [24] (PPO) and soft actor-critic [25] (SAC) deep RL algorithms are evaluated on ANM6-Easy in Section 5. Finally, Section 6 concludes our work. To keep this paper accessible to a broad audience and to provide interested readers with a formal introduction to power system modelling, technical details about the inner working of the power grid simulator are gathered in Appendices A and B. Guidelines to design and implement new Gym-ANM environments are also provided in Appendices C and D, and more in-depth tutorials and documentation can be found on the project repository at https://github.com/robinhenry/gym-anm.

Reinforcement Learning
We consider the standard RL setting for continuing tasks where an agent interacts with an environment E over an infinite sequence of discrete timesteps T = {0, 1, . . .}, modelled as a Markov decision process (MDP). At each timestep t, the agent selects an action a t ∈ A based on a state s t ∈ S according to a stochastic policy π : S × A → [0, 1], such that a t ∼ π(·|s t ). After the action is applied, the agent transitions to a new state s t+1 ∼ p(·|s t , a t ) ∈ S and receives the reward r t = r(s t , a t , s t+1 ) ∈ R. The return from state s t is defined as R t = lim T →∞ T −1 i=t γ i−t r t , where γ ∈ [0, 1) is the discount factor determining the weight of shortversus long-term rewards. We also distinguish a set of terminal states S terminal ⊂ S. Given the distribution of initial states p 0 (·) and the set of stationary policies Π, a policy π ∈ Π is considered optimal if it maximizes the expected return J π (s 0 ) = E si>0,r i≥0 ∼E,ai∼π [R 0 |s 0 ] for all s 0 that belong to the support of p 0 (·), and where rewards received after reaching a terminal state are always zero. In the context of problems with large and/or continuous state-action spaces, RL often focuses on learning a parameterized policy π φ ∈ Π with parameters φ whose expected return J π φ (s 0 ) is as close as possible to that of an optimal policy. In many cases, the environment may be partially observable so that the agent only has access to observations o ∈ O. The agent must thus adequately infer, directly or indirectly, an approximation of the state s t from the history of observation-action-reward tuples h t = (o 0 , a 0 , r 0 . . . , a t−1 , r t−1 , o t ). We designed the Gym-ANM framework so that it is straightforward for researchers to experiment with different degrees of observability in each environment.

Distribution Networks
An electricity distribution network can be represented as a directed graph G(N , E), where N = {0, 1, . . . , N −1} is a set of positive integers representing the buses (or nodes) in the network, and E ⊆ N ×N is the set of directed edges linking buses together. The notation e ij ∈ E refers to the directed edge with sending bus i and receiving bus j. Each bus might be connected to several electrical devices, which may inject into or withdraw power from the grid. The set of all devices is denoted by D = {0, 1, . . . , D − 1}, the set of all devices connected to bus i ∈ N by D i ⊆ D, and it is assumed that each device is connected to a single bus.
Several variables (complex phasors) are associated with each bus i ∈ N : a bus voltage level V i , a bus current injection I i , an active (real) power injection P Similarly, variables I ij , P ij , Q ij , and S ij refer to the directed flow of these quantities in branch e ij ∈ E, as measured at bus i. Note that, as a result of transmission losses, power and current flows may have different magnitudes at each end of the branch, e.g. |P ij | = |P ji |.

Model Predictive Control (MPC) and Optimal Power Flow (OPF)
In this work, we also present a model predictive control (MPC) approach to solving the ANM tasks that we propose with Gym-ANM. MPC in discrete-time settings is a control strategy in which, based on a known model of the dynamics of the system, a multi-stage optimization problem is solved at each timestep over a finite time horizon. The solution found is applied to the system at the current timestep, and the process is repeated at the next one, indefinitely [26]. The fact that a multi-stage optimization problem based on a model of the system is solved at each time step allows MPC to plan ahead and anticipate the system's behavior. This leads to near-optimal performance as the optimization horizon is increased (assuming an accurate model of the system).
The optimization problem solved by our MPC control algorithm is a multi-stage optimal power flow (OPF) problem. Since its first formulation by Carpentier in 1962 [27], solving a single instance or multiple instances of the OPF problem at regular time intervals has been the dominant approach to tackling decision-making problems in the management of power systems when network constraints are taken into account. In its most general form, the OPF problem is a non-convex constrained optimization problem with equality and inequality constraints. The objective function to minimize is often a representation of network operating costs, the equality constraints model the physical flows of electricity, and the inequality constraints model operational constraints. There exist many different formulations of the OPF problem, each designed to solve a particular control task in power systems. Although many solution methods have been proposed using a wide range of optimization tools and techniques, no single formulation has been accepted as suitable for all forms of OPF problems and it remains an active area of research. For the interested reader, comprehensive surveys of such approaches can be found in [28,29].

Gym-ANM
In this section, we propose Gym-ANM, a framework that can model a wide range of novel sequential decision-making ANM tasks to be solved by RL agents. Each Gym-ANM task is provided as a Gym [23] environment E that we describe by the MDP (S, A, O, p 0 , p, r, γ) E . Our formalization of these MDPs follows closely, and was inspired by, the work of Gemine et al. in [30].
For mathematical convenience, the set of electrical devices D connected to the grid is divided into three disjoint subsets D G , D L , and D DES , so that |D G | + |D L | + |D DES | = |D|. The set D G contains the generators, D L the loads, and D DES the DES units. Generators represent devices that only inject power into the grid, such as renewable energy resources (RER) D RER ⊂ D G or other traditional power plants D G − D RER . Loads group the passive devices that only withdraw power from the grid. Storage units, on the other hand, can both inject and withdraw power into/from the network. The only exception is the slack generator g slack ∈ D G − D RER , assumed to be the only device connected to the slack bus. The slack bus is a special bus used to balance power flows in the network and provide a voltage reference. The slack bus can also either inject or withdraw power into/from the network, such that the total generation remains equal to the total load plus transmission losses, at all times.

Overview
The structure of the Gym-ANM framework is illustrated in Figure 2, in which grey blocks represent components (functions) that are fully customizable by the user to design unique ANM tasks. At each timestep t, the agent receives an observation o t ∈ O and a reward r t−1 ∈ R, based on which it then selects an action a t ∈ A to be applied in the environment.
Once the environment has received the selected action a t , it samples a series of internal variables using the next vars() generative process conditioned on the current state s t ∈ S. These internal variables model the temporal stochastic evolution of the electricity demand and of the maximum renewable energy production before curtailment across the DN, as further described in later sections.
The internal variables are then passed, along with a t and s t , to the main function next state(), which applies the action to the environment and outputs the new state s t+1 ∈ S. The next state() block behaves deterministically for a given DN. It first maps the selected action a t to the current available action space A(s t ) ⊆ A before applying it to the environment. All the currents, voltages, energy storage levels, and power flows and injections are then updated, resulting in a new state s t+1 . Most of the power system modelling of the environment is handled by the next state() component, which we provide as a built-in part of the framework.
The new state s t+1 is then used to compute the new observation o t+1 ∈ O and reward r t ∈ R. Much like the next vars() block, the behavior of the observation() component can be freely designed by the designer of the environment. This way, it becomes straightforward to investigate the impact of different observation vectors on the performance of a given algorithm on a given ANM task. To simplify the use of our framework, we also provide a set of default common observation spaces that researchers can experiment with.
Our framework provides a built-in reward() component that computes the reward r t as: where ∆E t:t+1 is the total energy loss during (t, t+1], φ(s t+1 ) is a penalty term associated with the violation of operating constraints, λ is a weighting hyperparameter, and r clip > 0 keeps the rewards within a finite range [−r clip , r clip ]. This reward function was designed to reflect the overall goal: learn a control policy π that ensures a secure operation of the DN while minimizing its operating costs. In the management of realworld DNs, there are many varied sources of operating costs. For simplicity, however, we consider energy losses and the violation of operational constraints to be the only sources of costs. Our reward formulation also assumes that the action is selected by the agent at time t, immediately applied in the environment at time t + , with → 0, and that all power injections remain constant during (t + , t + 1].
The Gym-ANM framework allows for the creation of environments that model highly customizable ANM tasks. In particular, varying any of the following components will result in a different MDP, and therefore a different ANM task: 1. Topology and characteristics of the DN. Its topology is described by the tuple (D, N , E) and its characteristics refer to the parameters of each of its device d ∈ D, bus i ∈ N , and transmission link e ij ∈ E. In particular, the number of devices |D| and their respective operating range will shape the resulting state space S and action space A. A detailed list of all the DN parameters modelled in Gym-ANM is provided in Appendix D.
2. Stochastic processes. This corresponds to the design of the next vars() component in Figure 2. This component must model the temporal evolution of the electricity demand P (dev) l,t of each load l ∈ D L , the maximum production P (max) g,t that each generator g ∈ D G − {g slack } could produce at time t (before curtailment is applied if g ∈ D RER ), and a set of K auxiliary variables {aux Observation space. The observation space O can be changed to make the task more or less challenging for the agent by modifying the observation() function.

4.
Hyperparameters. Although the reward() component is built-in as a part of the Gym-ANM framework, it nonetheless relies on three hyperparameters that can be chosen for each new task: the penalty weighting hyperparameter λ, the amount of time ∆t (in fraction of hour) elapsed between subsequent discretization timesteps, and the clipping hyperparameter r clip . Because we consider a policy to be optimal if it minimizes the expected sum of discounted costs, we also consider the discount factor γ ∈ [0, 1) to be another fixed hyperparameter part of the task description.
In the remainder of this section, we explore the resulting MDP in more detail.

State Space
At any timestep t, the state of a Gym-ANM environment is fully described by the state of the DN that it models. We represent this state using a set of state variables aggregated into a vector s t ∈ S: refer to the active and reactive power injections of device d ∈ D into the grid, respectively, • SoC d,t is the charge level, or state of charge (SoC), of DES unit d ∈ D DES , is the maximum production that generator g ∈ D G − {g slack } can produce, is the value of the (k − 1) th auxiliary variable generated by the next vars() block during the transition from timestep t to timestep t + 1.
In (2), the first 2|D| + |D DES | variables (P (dev) 0,t , . . . , SoC |D DES |−1,t ) can be used to compute any other electrical quantities of interest in the DN (i.e., currents, voltages, power flows and injections, and energy storage levels), as derived in Appendix A. We also include the maximum generation variables P (max) g,t in s t because, even though they do not affect the physical electric flows in the network, they are required to compute the reward signal (see Section 3.6).
These variables do not, however, provide any information about the temporal behavior of the system. Hence, they are not sufficient to describe the full state of the system from a Markovian perspective. For instance, it may not be enough to know the active and reactive power injections from a load l ∈ D L at time t to fully describe the probability distribution of its next demand P (dev) l,t+1 .
In order to make s t Markovian, we chose to include a set of K auxiliary variables {aux k=0 that can be used to model other temporal factors that influence the outcomes P (dev) l,t+1 and P (max) g,t+1 during the next vars() call of Figure 2. This leads to state transitions that are only conditioned on the current state of the environment and on the action the agent selects, i.e., s t+1 ∼ p(·|s t , a t ). The overall task is thus indeed a MDP.
For example, the environment ANM6-Easy that we introduce in Section 4 uses a single auxiliary variable that represents the time of the day. This is sufficient to make s t Markovian, since the underlying stochastic processes can all be expressed as a function of the time of day. Another example would be an environment in which the next demand of each load and the generation from each generator is solely dependent on their current value. In this case, s t would not require any extra auxiliary variables. As environments become more and more complex, we expect state vectors to contain many auxiliary variables. Such examples could include solar irradiation and wind speed information to better represent the evolution of the electricity produced by renewable energy resources.
Finally, the environment may also reach a terminal state s t ∈ S terminal , indicating that no solution to the power flow equations (see Appendix A.5) was found as a result of the action taken by the agent. This means that the power grid has collapsed and is often due to a voltage collapse problem [31].

Action Space
Given the current state of the environment s t ∈ S, the available actions are denoted by the action space A(s t ). We define an action vector a t ∈ A(s t ) as: for a total of N a = 2|D G | + 2|D DES | − 2 control variables to be chosen by the agent at each timestep. Each control variable belongs to one of four categories: • a Pg,t : an upper limit on the active power injection from generator g ∈ D G −{g slack }. If g ∈ D DER , then a Pg,t is the curtailment value. For classical generators, it simply refers to a set-point chosen by the agent. The slack generator is excluded, since it is used to balance load and generation and, as a result, its power injection cannot be controlled by the agent. That is, g slack will inject the amount of power needed to fill the gap between the total generation and demand into the network.
• a Qg,t : the reactive power injection from each generator g ∈ D G − {g slack }. Again, the injection from the slack generator is used to balance reactive power flows and cannot be controlled by the agent.
• a P d,t : the active power injection from each DES unit d ∈ D DES .
• a Q d,t : the reactive power injection from each DES unit d ∈ D DES .
The resulting action space A(s t ) is bounded by three sets of constraints. First, individual control variables in a t ∈ A(s t ) are restricted to finite ranges [P , P ] or [Q, Q]. This is because electrical devices cannot physically inject (withdraw) infinite active or reactive power into (from) the network. Second, generators and DES units may have additional constraints on their current injections, such as current limits of power converters. These constraints further restrict the range of (P, Q) injection points that these devices can apply, i.e. they cannot simultaneously operate at full capacity for both active and reactive power. Third, the range of possible active power injection from each DES unit depends on its current storage level (provided in s t ). Indeed, empty (full) units cannot inject (withdraw) any power into (from) the network. Note that the first two sets of constraints remain the same for all s t ∈ S (see Appendix A.3).
For simplicity, the agent is never given the precise boundaries of the action space A(s t ). Instead, we let it choose an action within a larger set A bounded only by the first set of constraints, i.e. A ignores current limits in generators and DES units, as well as storage levels. In the case where the agent selects an action a t ∈ A that falls outside of the current action space A(s t ), the action that is actually applied in the environment during the next state() call is the action in A(s t ) that stands the closest to a t , according to the Euclidean distance (see Appendix A.6).
As a result, A is always bounded. Its bounds can be retrieved by the agent through the built-in action space() function. This allows users to follow good practices by working with agents that generate normalized action vectors in [−1, 1] Na .

Observation Space
In general, DNOs rarely have access to the full state of the distribution network when doing ANM. To model these real-world scenarios, Gym-ANM allows the design of a unique observation space O through the implementation of the observation() component, which may result in a partially observable task. We only assume that the size of o t remains constant.
To simplify the design of customized observation spaces, Gym-ANM also allows researchers to simply specify a set of variables to include in the observation vectors (e.g., branch active power flows {P 12 , P 23 } and bus voltage magnitudes {|V 0 |, |V 2 |}) of the new environment. The full list of available variables from which to choose is given in Appendix C.
The agent can access the bounds of the observation space through the function call observation space(). This functionality may be of particular interest to agents that use neural networks to learn ANM policies, in which case normalized input vectors may increase training speed and stability.

Transition Function
Each state transition occurs in two steps. First, the outcomes of the internal stochastic variables {P are generated by the next vars() block of the Gym-ANM framework (see Figure 2). Once the selected action a t ∈ A has been passed to the environment, the remainder of the transition is handled by the next state() component in a deterministic way. The reactive power injection of each load d ∈ D is directly inferred from its active power injection (assuming a constant power factor). The action a t is then mapped to A(s t ) according to the Euclidean distance and applied in the environment. Finally, all electrical quantities are updated by solving a set of so-called network equations (see Appendix A.5). The computational steps taken by next state() are described in more detail in Appendix A.6.

Reward Function
The reward signal is implemented by the built-in reward() block of Figure 2 and is given by: where Using a reward clipping parameter r clip ensures that any transition from a non-terminal state to a terminal one (i.e., when the power grid collapses), generates a much larger reward than any other transition does. As a result, it encourages the agent to learn a policy that avoids such scenarios at all costs. Subsequent rewards are always zero, until a new trajectory is started by sampling a new initial state s 0 .
During all other transitions, the energy loss ∆E t:t+1 is computed in three parts: where: • ∆E (1) t:t+1 is the total transmission energy loss during (t, t + 1]. This is a result of leakage in transmission lines and transformers. • ∆E (2) t:t+1 is the total net amount of energy flowing from the grid into DES units during (t, t + 1]. Over a sufficiently large number of timesteps, the sum of these terms will approximate the amount of energy lost due to leakage in DES units. That is, taking an energy of ∆E from the grid using a DES unit d ∈ D DES will yield a cost of ∆E. Given a charging and discharging efficiency factor of η d for d, injecting the remaining energy after a total round-trip loss will result in a cost of −η 2 ∆E, totalling a round-trip cost of (1 − η 2 )∆E. This is the total energy loss over the round-trip.
• ∆E (3) t:t+1 is the total amount of energy loss as a result of renewable generation curtailment of generators D RER during (t, t + 1]. Depending on the regulation, this can be thought of as a fee paid by the DNO to the owners of the generators that get curtailed, as financial compensation. In the penalty term φ(s t+1 ), we consider two types of network-wide operating constraints. The first is the limit on the amount of power 1 that can flow through a transmission link e ij ∈ E, referred to as the rating of that link. These constraints are needed to prevent lines and transformers from overheating. The second type of constraint is a limit on the allowed voltage magnitude |V i | at each bus i ∈ N . The latter are necessary conditions to maintain stability throughout the network and ensure proper operation of devices connected to the grid.
In practice, violating any network constraint can lead to damaging parts of the DN infrastructure (e.g., lines or transformers) or power outages. Both can have important economic consequences for the DNO. For that reason, ensuring that the DN operates within its constraints is often prioritized compared to minimizing energy loss. Although our choice of reward function does not guarantee that an optimal policy will never violate these constraints, choosing a large λ will ensure that these violations remain small. This would, in practice, have a negligible impact on the operation of the DN. In addition, the risk of violating reallife constraints in the DN could be further reduced by setting an over-restrictive set of constraints in the environment.
The technical details behind the computation of r t can be found in Appendix A.7.

Model Predictive Control Scheme
In order to quantify how well an agent is performing on a specific Gym-ANM task, we can cast the task as a MPC problem in which a multi-stage (N -stage) OPF problem is solved at each timestep. The resulting policy provides us with a loose lower bound on the best performance achievable in the environment.
The general MPC algorithm that we provide takes as input forecasts of demand for each load l ∈ D L and of maximum generation for each non-slack generator g ∈ D G − {g slack } over the optimization horizon [t + 1, t + N ]. We refer to the resulting policy as π M P C−N . We then consider two variants: policies π constant M P C−N and π perf ect M P C−N . The former, π constant M P C−N , uses constant forecasts over the optimization horizon. Its simplicity means that it can be used in any Gym-ANM environment 2 . The other variant, π perf ect M P C−N , assumes perfect predictions of future demand and generation are available for planning. Although it can only be used in simple environments such as ANM6-Easy (see Section 4.2), its performance is superior to that of π constant M P C−N . This means it provides the user with a tighter lower bound on the best achievable performance. Both variants are formally described in Appendix B.
Both MPC-based control schemes model the power grid using the DC power flow equations, a linearized version of the AC power flow equations. They thus solve a multi-stage DCOPF problem at each timestep. The DCOPF formulation relies on three assumptions: (a) transmission lines are lossless, (b) the difference between adjacent bus voltage angles is small, and (c) bus voltage magnitudes are close to unity.
It is worth stressing that the MPC method that we propose here is an example of a traditional approach to tackling ANM problems. Because RL algorithms make less assumptions about the intrinsic structure of the problem, however, they have the potential to overcome the limitations of such optimization approaches and reach better solutions. This, of course, does not mean that RL should be blindly applied to most multi-step OPF-like problems, but, rather, that it might prove to be a good alternative when traditional approaches reach their limitations. This remains a hypothesis which, we hope, Gym-ANM will help confirm or deny.

Gym-ANM Environments.
In conformity with the Gym framework, any Gym-ANM environment provides four main functions that allow the agent to interact with it: reset(), step(action), render(), and close(). An example of code illustrating the interactions between an agent agent and an environment env is shown in Listing 1 (inspired from [23]). The agent-learning procedure is omitted for clarity. Guidelines to design and implement new Gym-ANM environments can be found in Appendix C.

ANM6-Easy.
Along with this paper we are also releasing ANM6-Easy, a Gym-ANM environment that models a series of ANM characteristic problems. ANM6-Easy is built around a DN consisting of six buses, with one highvoltage to low-voltage transformer, connected to a total of three passive loads, two renewable energy generators, one DES unit, and one fossil fuel generator used as slack generator. The topology of the network is shown in Figure 1 and its technical characteristics are summarized in Appendix E. We use a time discretization of ∆t = 0.25 (i.e., 15 minutes) by analogy with the typical duration of a market period, much like the work of [30]. The observation() component is the identity function. This leads to a fully observable environment with o t = s t . The discount factor is fixed to γ = 0.995, the reward penalty to λ = 10 3 , and the reward clipping value to r clip = 100.
In order to limit the complexity of the task, we also chose to make the processes generated by the next vars() block deterministic. To do so, we use a fixed 24-hour time series that repeats every day, indefinitely. A single auxiliary variable aux (0) t = (T 0 + t) mod 24 ∆t representing the time of day is used to index the time series, where T 0 ∈ {0, 1, . . . , 24 ∆t − 1} is the starting timestamp of the trajectory. During each timestep transition, the next vars() function thus behaves as described by Algorithm 1, where P l [0, . . . , 24 ∆t − 1] and P g [0, . . . , 24 ∆t − 1] are the fixed daily time series of load injections P (dev) l,t and max- respectively. The initialization procedure of the environment is also provided in Appendix E.
Algorithm 1 Implementation of next vars() in ANM6-Easy. 1: aux The daily patterns were engineered so as to produce three problematic situations in the DN. Figures 3, 4, and 5 show the power injections, power flows, and voltage levels that would result in each situation if the agent neither curtailed the renewable energies nor used the DES unit. Each situation lasts for seven, three, and three hours, respectively, during which the power injections remain constant. A two-hour-long period is used to transition between situations, during which each power injection is linearly incremented from its old to new value. Situation 1 This situation ( Figure 3) characterizes a windy night, when the consumption is low, the PV production null, and the wind production at its near maximum. Due to the very low demand from the industrial load, the wind production must be curtailed to avoid an overheating of the transmission lines connecting buses 0 and 4. This is also a period during which the agent might use this extra generation to charge the DES unit in order to prepare to meet the large morning demand from the EV charging garage (see Situation 2).
Situation 2 In this situation (Figure 4), bus 5 is experiencing a substantial demand due to a large number of EVs being plugged-in at around the same time. This could happen in a large public EV charging garage.
In the morning, workers of close-by companies would plug in their car after arriving at work and, in the evening, residents of the area would plug in their cars after getting home. In order to emphasize the problems arising from this large localized demand, we assume that the other buses (3 and 4) inject or withdraw very little power into/from the network. During those periods of the day, the DES unit must provide enough power to ensure that the transmission path from bus 0 to bus 5 is not over-rated, which would lead to an overheating of the line. For this to be possible, the agent must strategically plan ahead to ensure a sufficient charge level at the DES unit. Figure 4: Situation 2, lasting between 08:00 a.m. and 11:00 a.m. and between 06:00 p.m. and 09:00 p.m. every day.
Situation 3 Situation 3 ( Figure 5) represents a scenario that might occur in the middle of a sunny windy weekday. No one is home to consume the solar energy produced by residential PVs at bus 1 and the wind energy production exceeds the industrial demand at bus 2. In this case, both renewable generators should be adequately curtailed while again storing some of the extra energy to anticipate the EV late afternoon charging period, as depicted in Situation 2.

Experiments
In this section, we illustrate the use of the Gym-ANM framework. We compare the performance of PPO and SAC, two model-free deep RL algorithms, against that of the MPC-based policies π constant M P C−N and π perf ect M P C−N introduced in Section 3.7 on the ANM6-Easy task. For both algorithms, we used the implementations from Stable Baselines 3 [32], a popular library of RL algorithms. Since our goal was not to compute an excellent approximation of an optimal policy, but rather to show that existing RL algorithms can already yield good performance with very little hyperparameter tuning, most hyperparameters were set to their default value (see Appendix F). The code used for all experiments in this section can be found at https://github.com/ robinhenry/gym-anm-exp.

Algorithms
Proximal Policy Optimization PPO is a stable and effective on-policy policy gradient algorithm. It alternates between collecting experience, in the form of finite-length trajectories starting from states s 0 ∼ p 0 (·) and following the current policy, and performing several epochs of optimization on the collected data to update the current policy (after which the collected experience is discarded). During each policy update step, the policy parameters θ are updated by maximizing (e.g., stochastic gradient ascent) a clipped objective function characterized by a hyperparameter that dictates how far away the new policy π θ is allowed to diverge from the old π θ old . The objective also requires the use of an advantage-function estimator, which is achieved using a learned-state value function V φ (s). In the Stable Baselines 3 implementation that we used, both the policy π θ and the state-value function V φ (s) were represented using separate fully connected MLPs with weights θ and φ, respectively, each with two layers of 64 units and tanh nonlinearities.
Soft Actor-Critic SAC is an off-policy actor-critic algorithm based on the maximum entropy RL framework. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. It alternates between collecting and storing experience of the form (s t , a t , r t , s t+1 ) into a replay buffer, regularly ending the current trajectory to start from a new initial state s 0 ∼ p 0 (·), and updating the policy π θ (actor) and a soft Q-function Q φ (s t , a t ) (critic) from batches sampled from the replay buffer (e.g., stochastic gradient descent), in an offline manner. In the same manner as the work of Haarnoja et al. [25], the implementation that we used makes use of two Q-functions to mitigate positive bias in the policy improvement step. Both the policy π θ and the Q-functions Q φ1 , Q φ2 were represented using separate fully connected MLPs with weights θ, φ 1 , and φ 2 , respectively, each with two layers of 64 units and ReLU nonlinearities. Separate target Q-networks that slowly track Q φ1 , Q φ2 were also used to improve stability, using an exponentially moving average with smoothing constant τ .

Performance metric
We evaluate the performance of the different algorithms on the ANM6-Easy task as follows. Every N eval steps the agent takes in the environment (i.e., selects an action), we freeze the training procedure and evaluate the current policy on another instance of the environment. To do so, we collect N r rollouts of T timesteps each, using the current policy π θ , and report: where s (i) 0 ∼ p 0 (·) and r (i) t are the initial state and rewards obtained in the i th rollout, respectively. Because the reward signal is bounded by a finite constant r clip ∈ R (i.e., |r t | ∈ [−r clip , r clip ], ∀t), approximating J π θ (s 0 ) = lim T →∞ In our experiments, we used N r = 5 and set T = 3000, such that r clip γ T 1−γ < 10 −2 results in negligible error terms. Discounted return computed as in Eqn. (7) SAC PPO Figure 6: Evolution of the empirical discounted return J π θ (T = 3000) during training.

Results
We trained both the PPO and SAC algorithms on the ANM6-Easy environment for three million steps, starting from a new initial state s 0 ∼ p 0 (·) every 5000 steps (or earlier if a terminal state is reached), and evaluated their performance every N eval = 10 4 steps. Both algorithms used normalized observation and action vectors. We repeated the same procedure with 5 random seeds and plotted the mean and standard deviation of the evolution of their performance during training in Figure 6. Table 1 reports the average performance of policies π constant M P C−N and π perf ect M P C−N for different planning steps N and safety margin hyperparameters β (see Appendix B). As expected, the performance of π perf ect M P C−N increases with N , since the algorithm has access to perfect demand and generation forecasts. In the case of π constant M P C−N , the best average return is capped at 129.1 and increasing N does not improve performance.  Table 1: Average discounted returns J π θ for π constant M P C−N (left) and π perf ect M P C−N (right), for different planning horizons N and safety margin hyperparameters β. Table 2 compares the best performance of the trained agents against that of the MPC policies. Note that both RL agents reach better performances than π constant M P C−N . That is, both PPO and SAC outperform a MPC-based policy in which future demand and generation are assumed constant.
Finally, Table 2 also summarizes computational CPU times required for each control policy to select an action on a MacBook 2.3 GHz Intel Core i5 with 8GB of RAM. Clearly, RL policies have the advantage of requiring significantly less time for action selection, since the mapping from state (or observation) to action is stored in the form of function approximators, which can be efficiently evaluated. Nevertheless, the learning of these function approximators may require significant computational times, which vary greatly between different RL algorithms.

Conclusion
In this paper, we proposed Gym-ANM, a framework for designing and implementing RL environments that model ANM problems in electricity distribution networks. We also introduced ANM6-Easy, a particular instance of such environments that highlights common challenges in ANM. Finally, we showed that stateof-the-art RL algorithms can already reach performances similar to that of MPC-based policies that solve multi-stage DCOPF problems, with little hyperparameter tuning.  We hope that our work will inspire others in the RL community to tackle decision-making problems in electricity networks, potentially through the use of our framework. We believe that Gym-ANM has the potential to model tasks of a wide range of complexity, creating a novel extensive playground for advanced RL research. [

Appendices A Electricity Distribution Network Simulator
This Appendix describes in more detail the dynamics of the alternative current (AC) power grid on top of which Gym-ANM environments are built. Section A.1 introduces some technical power system notions used in later analyses. Sections A.2, A.3.1, A.3.2, and A.3.3 describe the mathematical model and assumptions used to simulate the behavior of transmission links, passive loads, distributed generators, and DES units, respectively. Section A.4 then introduces the set of network constraints that we would like the learned ANM control scheme to satisfy, and Section A.5 derives the set of equations that govern the network electricity flows. Finally, Sections A.6 and A.7 derive the sequence of computational steps that make up the environment transition and reward functions, respectively.

A.1 Preliminaries
Today, the majority of AC transmission and distribution networks dispatch electricity using the so-called three-phase system. In this system, electricity flows in three parallel circuits, each associated with its own phase. In a balanced three-phase network, the electrical quantities of each phase have the same magnitude and differ by a 120 • phase shift, i.e. phase 3 is a time-delayed version of phase 2, which is itself a timedelayed version of phase 1. Conveniently, any balanced three-phase system can thus be analyzed using an equivalent single-phase representation, where only one of the phases is taken into account. The complex phasors corresponding to the other two phases can be obtained by applying a 120 • or 240 • phase shift to the first-phase phasors. All systems implemented by Gym-ANM are assumed to be such three-phase balanced networks, and we adopt its equivalent single-phase representation in the following derivations.
In order to efficiently generate and distribute electricity, power grids are also divided into so-called voltage zones. Each zone is characterized by a particular nominal voltage level that represents the average voltage level of the nodes in that zone. For instance, a 220kV (ultra-high voltage) transmission network may be connected to an intermediary 150kV (high voltage) network, which is then connected to a 30kV (medium voltage) distribution network. Transitions between the different voltage levels are carried out by power transformers that bring up (step-up transformers) or down (step-down transformers) voltages while minimizing power losses. For mathematical convenience, power systems that include several voltage zones are often analyzed using the per-unit (p.u.) notation, in which all electrical quantities are normalized with respect to a set of base quantities chosen for the whole system. In practice, the per-unit analysis method becomes very handy as it removes the need to include nominal voltage levels in derivations. This allows us to analyze the network as a single circuit and cancels out the effect of transformers whose tap ratio is identical to the ratio of the base voltages of the zones it connects. In other words, only so-called off-nominal transformers need to be considered. In the remainder of this Appendix, all quantities are expressed in p.u.

A.2 Branches
As introduced in Section 2.2, we model a distribution network as a set of nodes N connected by a set of directed edges E. Each edge e ij ∈ E may represent a sequence of (a) transmission lines, (b) power transformers, and/or (c) phase shifters linking buses i and j. Any combination of (a)-(c) components can be equivalently mapped to the common branch representation adapted from [33] and shown in Figure 7. Formally, branch e ij ∈ E is characterized by five parameters: a series resistance r ij , a series reactance x ij , a total charging susceptance b ij , a tap ratio magnitude τ ij , and a phase shift θ ij . The branch series admittance is given by y ij = (r ij + ix ij ) −1 , each shunt admittance by y sh ij = i bij 2 , and the complex tap ratio of the off-nominal transformer by t ij = τ ij e iθij . Note that one can use a value of t ij = 1 to represent the absence of a transformer, or, equivalently, the presence of an on-nominal transformer.

A.3 Electrical Devices
The different electrical devices D connected to the grid are classified as passive loads D L , generators D G , or DES units D DES . Within generators, we further differentiate between renewable generators D RER ⊂ D G and the slack generator g slack ∈ D G − D RER . Much like what was done by Gemine et al. [30], the range

A.3.1 Passive Loads
We define passive loads as the devices that only withdraw power from the network. We also assume that each passive load l ∈ D L has a constant power factor cos φ l and that its negative injection P (dev) l,t is lower bounded 3 by P l . Formally, the range of operation R l,t = R l of l is defined by: for all t ∈ T .

A.3.2 Generators
Generators, with the exception of g slack , refer to devices that only inject power into the network. The physical limitations of any generator g ∈ D G are modelled by a range of allowed active power injections [P g , P g ] and of reactive power injections [Q g , Q g ]. Additional linear constraints Q g can also be added to limit the flexibility of reactive power injection when P is close to its maximum value. 4 These constraints result in the range of operation shown in Figure 8. Finally, a dynamic upper bound P (max) g,t ∈ [P g , P g ] is also generated by the next_vars() block to model time- The resulting dynamic region of operation R g,t is formally expressed as: 3 When designing a new environment, the user can set P l = −∞ to model an unbounded load. Note that a finite lower bound value is required to have a bounded state space S. 4 These additional linear flexibility constraints can be used to approximate current limits of power converters and/or of electric generators [34]. They can also be ignored by setting Q + g = Q g and Q − g = Q g . where τ

18
g , ρ (2) g are computed based on the parameters {P g , P + g , Q g , Q g , Q + g , Q − g } provided in the network input dictionary (see Appendix D) as: In order to ensure that a solution to the network equations derived in Section A.5 is found at each timestep, we do not restrict the range of operation of the slack generator g slack . Instead, we assume that it can provide unlimited active and reactive power to the network.

A.3.3 Distributed Energy Storage (DES)
DES units can both inject power into (discharge) and withdraw power from (charge) the network. Their timeindependent physical constraints are modelled much like that of generators, as shown in Figure 9, where:  where η ∈ [0, 1] is the charging and discharging efficiency factor (assumed equal), the condition SoC d ≤ SoC d,t+1 ≤ SoC d can be re-expressed as a constraint on P (dev) d,t+1 as: In summary, the range of operation of each DES unit d ∈ D DES is modelled by the time-varying constrained set R d,t :

A.4 Network Constraints
Constraints on the operating range of each electrical device in D (derived in Appendix A.3) get enforced by the environment during each timestep transition (see Appendix A.6). Unlike these constrains, however, network constraints will be left unchecked but will generate a large negative reward when not met, as further detailed in Appendix A.7. That is, the simulator will allow the network to operate past the following network constraints, but will penalize through negative rewards any policy that does so.
As introduced in Section 3.6, we consider two types of such network constraints that network operators should ensure are satisfied at all times: voltage and line current constraints. The first one is a constraint on bus voltage magnitudes, which must be kept within a close range of their nominal value to ensure stability of the grid: where V i and V i are often chosen close to 1 p.u. and voltages are expressed as root mean squared (RMS) values.
The second one is an upper limit on line currents, which are determined by materials and environmental conditions. Let I ij be the maximum physical current magnitude allowed through branch e ij ∈ E. In practice, such limits are often expressed as apparent power flow limits S ij at a 1 p.u. nodal voltage. The reason behind this choice is the fact that the apparent power flow |S ij | = |V i I * ij | is close to |I ij | when voltage magnitudes are kept close to unity by constraint (20). For consistency with existing optimization tools that model line current limits as apparent power flow constraints, we chose to adopt the same approach in Gym-ANM. In addition, for a given branch e ij ∈ E, the branch current at the sending end |I ij | may be different to the current injection at the receiving end |I ji |. This is due to the asymmetry of the common branch model of Figure 7. The constraints must thus be respected at each end of the branch: |S ij,t | ≤ S ij and |S ji,t | ≤ S ij , ∀e ij ∈ E, ∀t ∈ T .

A.5 Network Equations
The flow of electricity within a power network is dictated by a set of network equations, or power flow equations, which we will now derive. The following derivations assume that all AC quantities are expressed in RMS terms.
The ideal transformer with complex tap ratio t ij : 1 used in the common branch model introduced in Section A.2 can be further described by the relations: Applying Kirchhoff's current law at nodes A and B of Figure 7 yields: which, after substituting (22), becomes: Expressions (24) can be equivalently presented in matrix form: which is one possible formulation of the power flow equations.
However, the most commonly used formulation in practice is obtained after applying Kirchhoff's current law at each bus i ∈ N , which results in the classical matrix formulation: where I = [I 0 , I 1 , . . . , I |N |−1 ] T is the vector of bus current injections, V = [V 0 , V 1 , . . . , V |N |−1 ] T the vector of corresponding bus voltages, and Y ∈ C |N |×|N | the nodal admittance matrix with elements: Finally, (26) can also be formulated in terms of nodal power injections and voltage levels, removing the need to compute current injections: where Y i denotes the i th row of the admittance matrix Y.

A.6 Transition Function
Based on the current state s t ∈ S, each timestep transition starts by sampling the internal variables through the next_vars() block of Figure 2. Note that this block can be uniquely designed for different environments. The remainder of the transition function happens with the next_state() component in a deterministic manner, which we now describe as a series of steps analogous to the underlying implementation.
1. Load injection point First, the reactive power injection Q (dev) l,t+1 of each load l ∈ D L is inferred from its new demand P (dev) l,t+1 outputted by next_vars(), according to (9): where P (dev) l,t+1 is first clipped to [P l , 0].

Distributed generator injection point
The power injection point of each distributed generator g ∈ D G − {g slack } is computed based on its allowed range of operation R g,t+1 given by (10). The active and reactive injections a Pg,t , a Qg,t are then set by the agent in a t : || a Pg,t , a Qg,t − (P, Q)|| .
In the case where the (a Pg,t , a Qg,t ) injection point set by the agent falls outside of R g,t+1 , the environment selects the closest point in R g,t+1 , according to the Euclidean distance.

DES injection point
Similarly, the power injection point of each DES unit d ∈ D DES is computed based on the (a P d,t , a Q d,t ) point chosen by the agent in a t and the operating range R d,t+1 of d given by (19). We again use the Euclidean distance as the distance metric, resulting in: ||(a P d,t , a Q d,t ) − (P, Q)|| .

Power flows & bus voltages
Now that the power injection point of each device, with the exception of the slack generator, is known, the total nodal active and reactive power injection for each non-slack bus i is computed using: After fixing the slack bus voltage to unity, the environment then solves the network equations given by (28). Our implementation uses the Newton-Raphson procedure [35] to do so. From the solution, we obtain the voltage V (bus) i at each non-slack bus and the slack generator power injection point (P g slack ,t+1 , Q g slack ,t+1 ).

State construction
The new state vector s t+1 can now be constructed according to the structure defined by (2). The active and reactive power injection points P d,t+1 , Q d,t+1 have already been computed. The new charge level SoC d,t+1 of each DES unit d ∈ D DES is obtained using expression (16). Finally, the P (max) g,t+1 and aux (k) t+1 variables are simply copied from the output of next_vars().

A.7 Reward Function
The main component of the reward signal, as introduced in (4) and (5), is a sum of three energy losses and a penalty term associated with violating operating constraints: We chose to compute both terms in p.u. to ensure similar orders of magnitude.

A.7.1 Energy loss
The transmission energy loss, ∆E (1) t:t+1 , is computed as: where ∆t is used to get the energy loss in p.u. per hour. The net amount of energy flowing from the grid into DES units, ∆E t:t+1 , is obtained using: Finally, the amount of energy loss as a result of renewable energy curtailment, ∆E t:t+1 , is: Summing (34)-(36) together yields the total energy loss:

A.7.2 Constraint-violation penalty
Let Φ : S → R be the penalty function that adds a large cost λΦ(s t+1 ) to a policy that leads to a violation of operating constraints. To compute Φ(s t+1 ), the environment first computes the node voltages V i,t+1 using (28) and the directed branch currents I ij,t+1 and I ji,t+1 for each branch e ij ∈ E using (25). The obtained values are then plugged into |S ij,t+1 | = |V i,t+1 I * ij,t+1 | and |S ji,t+1 | = |V j,t+1 I * ji,t+1 | to compute the corresponding branch's apparent power flows. The penalty term Φ(s t+1 ) is finally obtained using:

B Model Predictive Control Scheme B.1 Introduction
This appendix describes the MPC problem solved by the MPC-based policy π M P C−N introduced in Section 3.7. At each timestep, the policy solves a multi-stage DCOPF problem with an optimization horizon of N timesteps. As a linear approximation of the actual ACOPF that we would like to solve, the DCOPF formulation relies on three assumptions, included here again for clarity: 1. Transmission lines are lossless: r ij = 0, ∀e ij ∈ E, 2. The difference between adjacent bus voltage angles is small: 3. Bus voltage magnitudes are close to unity: We start by giving a general formulation of the MPC problem in which the algorithm takes as input predictions of future demand and generation in Section B.2. We call this policy π M P C−N . We then consider two particular forecasting methods in Section B.3: one which assumes constant values over the optimization horizon, policy π constant M P C−N , and another that generates perfect forecasts, policy π perf ect M P C−N .

B.2 General formulation B.2.1 Policy overview
The action selection procedure followed by π M P C−N at timestep t is given by Algorithm 2. In this algorithm, solveMPC() refers to solving 5 the optimization problem (40)-(49) and extracting the vector of device active power injection P (dev) t+1 from the solution. The considered-optimal power injections from all non-slack generators and DES units are then concatenated into an action vector a t . Since reactive power flows are ignored by the DCOPF formulation, we chose to simply set the reactive power set-points in a t to zero.
In this general formulation, Algorithm 2 takes as inputs the network state s t (directly extracted from the Gym-ANM simulator) and forecasts of demand and generation over the optimization horizon k = t + 1, . . . , t + N . We denote these forecasted values asP respectively. An additional safety margin hyperparameter, β ∈ [0, 1], is also introduced to further constrain the power flow on each transmission line in the OPF. This is done with the hope that it will account for any errors introduced with the linear DC approximation, thus ensuring that line current constraints are respected. The penalty hyperparameter λ is taken to be the same as in the reward function.

The optimization problem
We now describe the optimization problem (40)-(49) in more detail. The objective function is a simplified version of the cost function used in the reward signal, originally defined as: In the above formulation, load injections and maximum generations of generators are non-controllable variables (i.e., constants), which can thus be removed from the objective function. In addition, the DCOPF assumptions have |V i | = 1, which leads to |S ij | = |P ij |, from which the penalty term φ(s t+1 ) can be greatly simplified. The resulting objective function to minimize is given by (40). Note that we define it as the discounted sum of costs over the optimization horizon. This is to reflect the agent's objective of learning a policy that minimizes the expected discounted return.

B.2.3 Further considerations
The performance achieved by π M P C−N provides a lower bound on the best performance achievable in a given environment. This bound is not tight, however, since the achieved performance depends on (a) the quality of the DC linear approximation, (b) the accuracy of the forecasted values, and (c) the length of the optimization horizon N . Note that, in general, the performance of an MPC-based policy increases as N → ∞. Because of (a) and (b), however, this may not be the case with π M P C−N , since, e.g., erroneous long-term forecasts may harm policies with larger N 's. As a result, N may have to be tuned, depending on the environment.

B.3 Special cases: constant and perfect forecast
We now consider two special cases of the MPC-based policy π M P C−N . Both policies were used in the ANM6-Easy environment in Section 5.

B.3.1 Constant forecast
The first variant that we consider is π constant M P C−N . It assumes that load injections P (dev) l,t and maximum generations P (max) g,t remain constant during the optimization horizon. As such, it is one of the simplest variants of π M P C−N that one could use. Formally, we can describe π constant M P C−N by its constant forecasts: ∀l ∈ D L , k = t + 1, . . . , t + N, The main advantage of π constant M P C−N is that it can be used out-of-the-box in any Gym-ANM environment. More information on how to do this can be found on the project repository.

B.3.2 Perfect forecast
The second variant that we consider is π perf ect M P C−N . This variant is specifically tailored for the ANM6-Easy environment introduced in Section 4.2. This is because it assumes perfect forecasts of load injections and maximum generations. In other words, it relies on the fact that ANM6-Easy is a deterministic environment in which future demand and generation can be perfectly predicted. Formally, π perf ect M P C−N uses perfect forecasts: ∀l ∈ D L , k = t + 1, . . . , t + N, Unlike π constant M P C−N , policy π perf ect M P C−N can only be used in deterministic environments of the like of ANM6-Easy. Nevertheless, it offers a large advantage in that it yields a much better performance in such environments. This provides the user with a tighter lower bound on the best achievable performance in the environment.

Listing 2: Implementation template for new Gym-ANM environments.
__init__() This method, known as a constructor in object-oriented languages, is called when a new instance of the new environment MyANMEnv is created. In order to initialize the environment, the following arguments need to be passed to the superclass, through the call super().__init__(): • network: a Python dictionary that describes the structure and characteristics of the distribution network G and the set of electrical devices D. Its structure should follow the one given in Appendix D.
• obs: a list of tuples corresponding to the variables to include in observation vectors. In Listing 2, o t is constructed as (P  Table 3. Alternatively, the obs object can be defined as a customized callable object (function) that returns observation vectors when called (i.e., o t = obs(s t )), or as a string 'state'. In the later case, the environment becomes fully observable and observations o t = s t are emitted.
• K: the number of auxiliary variables K in the state vector given by (2).
• delta, gamma, lamb, r_clip: the hyperparameters ∆t, γ, λ, and r clip , respectively, used to compute the rewards and returns, as introduced in Section 3.1.
• seed: an integer to be used as random seed.
init_state() This method will be called once at the start of each new trajectory and it should return an initial state vector s 0 that matches the structure of (2). In the case where s 0 falls outside of S because, for instance, the (P, Q) injection point of a device falls outside of its operating range, the environment will map s 0 to the closest element of S according to the Euclidean distance. In short, init_state() implements p 0 (·).
next_vars() As introduced in Section 3.1, next_vars() is a method that receives the current state vector s t and should return the outcomes of the internal variables for timestep t + 1. It must be implemented by the designer of the task, with the only constraint being that it must return a list of |D L | + |D RER | + K values.  observation_bounds() This method is optional and only useful if the observation space is specified as a callable object. In the latter case, observation_space() should return the (potentially loose) bounds of the observation space O, so that agents can easily normalize emitted observation vectors.
Additional render() and close() methods can also be implemented to support rendering of the interactions between the agent and the new environment. render() should update the visualization every time it gets called, and close() should end the rendering process. For more information, we refer to the official OpenAI Gym documentation [23]. 7

D Network Input Dictionary
This appendix describes the structure of the Python dictionary required to build new Gym-ANM environments. The dictionary should contain four keys: 'baseMVA', 'bus', 'device', and 'branch'. The value given to the key 'baseMVA' should be a single integer, representing the base power of the system (in MVA) used to normalize values to per-unit. Each of the other three keys should be associated with a numpy 2D array, in which each row represents a single bus, device, or branch of the distribution network. The structures of the 'bus', 'device', and 'branch' arrays are described in Tables 4, 5  Type of device (-1 = load; 0 = slack; 1 = classical generator; 2 = distributed renewable energy generator; 3 = DES unit). 3 Constant ratio of reactive power over active power (Q/P ) d (loads only). 4 Maximum active power output P d (MW). 5 Minimum active power output P d (MW). 6 Maximum reactive power output Q d (MVAr). 7 Minimum reactive power output Q d (MVAr). 8 Positive active power output of PQ capability curve P + d (MW). 9 Negative active power output of PQ capability curve P − d (MW). 10 Positive reactive power output of PQ capability curve Q + d (MVAr). 11 Negative reactive power output of PQ capability curve Q − d (MVAr). 12 Maximum state of charge of storage unit SoC d (MWh). 13 Minimum state of charge of storage unit SoC d (MWh). 14 Charging and discharging efficiency coefficient of storage unit η d . Receiving-end bus unique ID j. 2 Branch series resistance r ij (p.u.). 3 Branch series reactance x ij (p.u.). 4 Branch total charging susceptance b ij (p.u.). 5 Branch rating S ij (MVA). 6 Transformer off-nominal turns ratio τ ij . 7 Transformer phase shift angle θ ij (degrees) (> 0 = delay).

E ANM6-Easy Environment
This appendix describes in more detail the ANM6-Easy Gym-ANM environment introduced in Section 4.2.

E.1 Network Characteristics
Tables 7, 8, and 9 summarize the characteristics of buses, electrical devices, and branches, respectively, following the network input dictionary structure given in Appendix D. Based on the values given in Table 8, the range of operation (see Appendices A.3.2 and A.3.3) of the distributed generators and the DES unit of ANM6-Easy are also plotted in Figure 10.    Table 9: Description of each branch e ij ∈ E of ANM6-Easy.
The fixed time series used by the next_vars() component of ANM6-Easy, in order to model the evolution of the loads and of the maximum generation from renewable energy resources, are provided in Table 10.

E.2 Environment Initialization
The initialization procedure of ANM6-Easy, according to which initial states s 0 ∼ p 0 (·) are drawn, is illustrated in Algorithm 3. Time series P 1−5 refer to Table 10. An initial time of day t 0 is sampled and used to initialize the aux respective constant power factor. We assume that each distributed generator operates at its maximum active power (i.e., no generator is curtailed) and that its reactive power injection is sampled uniformly. The initial power injection point of each generator is then mapped onto the generator's allowed region of operation R g,0 (line 9). The initial state of charge of the DES unit is also uniformly sampled and its power injection point is set to zero (lines 12-13). Finally, the slack power injection is obtained after solving the set of network equations (line 15).

F Experimental hyperparameters
The hyperparameters used for the experiments presented in Section 5 are summarized in Table 11 for the PPO and in Table 12 for the SAC algorithms. Both implementations were taken from the Stable Baselines 3 library [32]. The horizon T is the maximum number of steps per episode used during training.