Optimal Energy System Scheduling Using A Constraint-Aware Reinforcement Learning Algorithm

The massive integration of renewable-based distributed energy resources (DERs) inherently increases the energy system's complexity, especially when it comes to defining its operational schedule. Deep reinforcement learning (DRL) algorithms arise as a promising solution due to their data-driven and model-free features. However, current DRL algorithms fail to enforce rigorous operational constraints (e.g., power balance, ramping up or down constraints) limiting their implementation in real systems. To overcome this, in this paper, a DRL algorithm (namely MIP-DQN) is proposed, capable of \textit{strictly} enforcing all operational constraints in the action space, ensuring the feasibility of the defined schedule in real-time operation. This is done by leveraging recent optimization advances for deep neural networks (DNNs) that allow their representation as a MIP formulation, enabling further consideration of any action space constraints. Comprehensive numerical simulations show that the proposed algorithm outperforms existing state-of-the-art DRL algorithms, obtaining a lower error when compared with the optimal global solution (upper boundary) obtained after solving a mathematical programming formulation with perfect forecast information; while strictly enforcing all operational constraints (even in unseen test days).


Introduction
To reduce the impact of the energy sector on the environment, distributed energy resources (DERs) are being integrated into our energy systems.Such DERs, in the form of renewable-based systems (e.g., PV systems and wind turbines) and small-scale energy storage systems (ESSs), provide more flexibility, enabling a more efficient operation.Nevertheless, these DERs also increase the energy system's complexity, especially when it comes to defining its operational schedule.Moreover, due to their weather-dependent nature, renewable-based DERs inherently increase the energy system's levels of uncertainty, requiring scheduling algorithms capable of providing fast and good-quality, but feasible, solutions [1].In the technical literature, two main approaches are available to deal with the optimal scheduling of energy systems; namely, model-based and model-free approach.A detailed literature review is presented next.

Literature review
In general, model-based approaches rely on precise models to build complex mathematical formulations in order to consider the energy system' operational constraints.Depending on how these constraints are modeled, the derived mathematical formulations can be classified as linear, nonlinear programming, or dynamic programming problems [2].In this regard, in [3], a mixed-integer nonlinear programming (MINLP) formulation is used to determine the optimal operation of an unbalanced three-phase energy system.In order to reduce the complexity of the proposed formulations, linearizations and simplifications are introduced.Similar work has been done in [4].Nevertheless, the model-based nature of these methods requires considerable precision of the built mathematical models, which limits their performance, especially if uncertainty is to be considered.
Generally, in model-based approaches, uncertainty is modeled either by using a probability distribution function or by leveraging a set of representative scenarios, leading to stochastic or robust mathematical formulations, such as the ones presented in [5][6][7].Other approaches, such as the one in [8], leverage a rolling time horizon approach to eliminate the forecast error when defining the DERs optimal energy scheduling.To guarantee the feasibility of the defined schedule under various operational scenarios, in [9], an adjustable two-stage robust optimization framework is proposed, solving simultaneously a dayahead scheduling and real-time regulation problem of an integrated

Notation
The notation used throughout this paper is reproduced below for reference.Maximum/minimum charging/discharging limit of the ESSs  providing good quality solutions, existing model-based approaches are not adequate for handling the increased uncertainty level of renewablebased energy systems, as their performance and efficiency mainly depend on the accuracy of the used models and their approximations.Moreover, the computational complexity of these methods increases dramatically with the system size, imposing scalability and convergence challenges.

Sets
To overcome this, model-free approaches have been introduced as an alternative solution.The most promising approach is based on the use of reinforcement learning (RL) [11], modeling the decisionmaking problem as a Markov Decision Process (MDP).One of the most interesting features of RL algorithms is that they can learn any system's dynamics by interaction, providing good-quality solutions guided by a reward value used as a performance indicator [12].Recently, deep reinforcement learning (DRL) algorithms have shown good performance when solving MDPs in energy systems tasks [13], ranging from, home energy management [14], microgrid dispatch [15], voltage regulation [16], and electricity network operation [17].Other applications include, for instance, a standardized DRL approaches for demand response in smart buildings [13], and learning to solve fast optimal power flow problems using DRL algorithms, specifically the proximal policy optimization (PPO) algorithm and imitation learning [18].In [19], a performance comparison of the soft actor-critic (SAC) algorithm with a rule-based control method on the surrogate simulation model developed by [13], is presented.In [16], the voltage regulation problem of a distribution network is first modeled as a partial-observable MDP, and then multi-agent DRL algorithms are leveraged to execute the optimal solutions.In [20], a DRL approach-based proactive operation framework is proposed to model the stochastic behavior and uncertainty of solar energy for residential buildings.In [21], a DRL algorithm is developed to solve a stochastic energy management problem considering power flow constraints, resulting in an optimal policy that minimizes total operational cost (although operational constraints are disregarded).
Different from the energy-related MDPs presented above, the operational schedule of DERs within an energy system must enforce a rigorous set of operational constraints to ensure a reliable and safe operation, e.g., generation and consumption must always be balanced during real-time operation, ramping-up and down constraints, etc.Nevertheless, current DRL algorithms lack of safety guarantees [22], as these constraints cannot be directly imposed in the algorithm's formulation.Different strategies to indirectly enforce operational constraints have been proposed to overcome this.In [23], a DG unit is set as a slack bus with unlimited generation capacity, avoiding unbalance by the outputs of the generators controlled by DRL agents.In [24], a penalty term is added to the reward function to guide the learning process aiming to reduce operating costs while enforcing power balance.A similar penalty approach has been used to enforce voltage magnitude constraints in case the electricity network operation is considered.For instance, [25] modeled the dispatch of PV inverters as an MDP, and built a decentralized dispatch framework penalizing RL agents when actions lead to voltage violations.In research [26], an on-policy RL algorithm with eligibility traces is developed to dispatch the energy storage system to minimize the cost and regulate voltage magnitudes.A similar work is presented in [27].In [28], a service assistant restoration problem is modeled as MDP.Then, imitation learning is employed as expert demonstrations enabling a deep deterministic policy gradient (DDPG) agent learn a safe policy for online implementation.In [29], a double auction market-based coordination framework is proposed to schedule the energy trading between multi-energy microgrids.Multiagent twin delayed deep deterministic algorithm (TD3) is used to solve the formulated problem, while a large penalty is imposed on the reward function to reduce the energy unbalance.In [30], the SAC algorithm is leveraged to control a virtual power plant to provide frequency regulation services, penalizing any frequency deviation.Nevertheless, although these strategies may enforce operational constraints during training, they are either based on nonpractical assumptions or fail to guarantee the feasibility of the defined operating schedule in realtime, especially in cases of large peak consumption or renewable-based generation [31].EV management Strategies based on safe RL have also been proposed to directly enforce operational constraints, exploiting results from different research areas, such as robot manipulation [32,33].In [34], an action projection layer is implemented, correcting the action defined by the DRL algorithm via a projection operator.Unfortunately, this projection operator degrades the DRL algorithm's performance, as shown in [35].In [36], safe DDPG is used for real-time automatic control of a smart hub, while a safety net is used to estimate the feasibility of decided actions.A similar strategy is proposed in [37], in which the action proposed by the DRL algorithm is used as starting point to solve a mathematical programming formulation, ensuring constraints compliance.In [38], a constrained policy gradient approach is proposed, updating the parameters of the DNN model in the direction that minimizes the power unbalance.In [39], the same approach is used to solve an EVs coordination problem.This policy optimization approach allows the DRL algorithm to provide a probabilistic notion of safety.Nevertheless, feasibility is paramount in energy systems operation, and it should be certifiable.In this regard, enforcing operational constraints during the online scheduling stage is a critical challenge for DRL algorithms and it must be addressed in order to enable their wide adoption in real systems.A summary of the discussed research literature is presented in Table 1.The openness and free online availability of the algorithms discussed here are also highlighted in Table 1.

Contributions
To overcome the above-discussed limitations, this paper proposes a DRL algorithm (namely MIP-DQP) to define the optimal schedule of a renewable-based energy system, capable of strictly enforcing all the operational constraints in the action space, ensuring the feasibility of the defined scheduled in real-time operation.To do this, we used recent optimization advances for DNNs that allow their representation as a mixed-integer linear (MIP) formulation, enabling further consideration of any action space constraints.Such approaches have been also employed in the context of feature visualization and adversarial machine learning [40].The performance of the proposed algorithm has been compared with other state-of-the-art DRL algorithms available in the literature, including DDPG, PPO, SAC, and TD3 algorithms [11], to show its effectiveness.A comparison with the optimal global solution is also presented, obtained by solving the energy system scheduling problem as a mathematical programming formulation considering full knowledge of future information (i.e., consumption, dynamic prices, and renewable-based generation).The main contributions of this paper are as follows: • A value-based DRL algorithm to solve the energy system scheduling problem is proposed, capable of dealing with continuous action spaces.Different from other actor-critic DRL algorithms (e.g., DDPG, PPO, and TD3 [11]), we make use of the action-value function approximated using a DNN, while discarding the policy model learning used during exploration.• An innovative online execution approach that guarantees that the proposed DRL algorithm strictly meets all operational conditions in the action space (e.g., the power balance constraint), even in unseen test data, is also proposed.This is done by leveraging new optimization results from DNNs that allow their representation as a MIP formulation, enabling further consideration of any action space constraints.
The rest of this paper is organized as follows.In Section 2, the optimal energy system scheduling problem is formulated.Then, in Section 3, the formulated problem is modeled as MDP while the proposed MIP-DQN algorithm is illustrated and used to solve the optimal energy system scheduling problem in Section 4. Simulation tests are presented, analyzed and discussed in Section 5, while final conclusions are presented in Section 6.

Mathematical programming formulation of the energy systems scheduling problem
The structure of the considered energy system is shown in Fig. 1, including various DERs, such as solar photovoltaic (PV), ESSs, DGs, and loads, while a connection to the utility grid is leveraged to address a demand surplus or shortage problem.For tractable analysis, we assume the day-ahead market where the electricity price of each hour is revealed beforehand.For the energy system in Fig. 1, the optimal energy system scheduling problem can be modeled by the nonlinear programming (NLP) formulation described by ( 1)- (11).The objective function in (1) aims at minimizing the operating cost for the whole time horizon  , comprising the operating cost of the DG units, as presented in (2), and the cost of buying/selling electricity from/to the main network, as in (3).Given the output power of DG units   , , the operating cost can be estimated by using a quadratic function as in (2).The transaction cost between the energy system and the network is settled according to Time-of-Use (ToU) prices, in which it is assumed that selling prices are lower than the purchasing prices.In (3),   is the ToU price at time slot , while    refers to the exported/imported power transaction to/from the network.
Subject to: Expression (4) defines the power balance constraint.Expression (5) defines the DG units generation power limits while ( 6) and ( 7) enforce the DG unit's ramping up and down constraints, respectively.Energy storage systems (ESSs) are modeled using ( 8)- (10).In this model, the operation cost of ESSs is not considered, while ESSs are allowed to schedule their discharge and charge power in advance.Expression (8) defines the charging and discharging power limits, while expression ( 9) models the state of charge (SOC) as a function of the charging and discharging power.Expression in (10) limits the energy stored in the ESSs, avoiding the impacts caused by over-charging and over-discharging.Finally, the main network export/import power limit is modeled by the expression in (11).Notice that in order to solve the mathematical formulation described by ( 1)-( 11), full knowledge of future information (e.g., renewable-based generation, consumption and dynamic prices) is required, for instance, provided via a forecasting algorithm.The proposed DRL algorithm is able to provide good-quality solutions with only current information, as shown later.Next, the MDP formulation of the optimal scheduling problem is presented.

MDP formulation & value-based DRL
The above-presented decision-making problem can be modeled as a finite MDP, represented by a 5-tuple (, , , , ), where  represents the set of system states,  the set of actions,  the state transition probability function,  the reward function, and  a discount factor.In this formulation, the energy system operator can be modeled as an RL agent.The state information provides an important basis for the operator to dispatch units.We define a state at time  as   = (   ,    ,   −1 ,   ),   ∈ , while the actions, defining the scheduling of the DG units and the ESSs, as   = (  , ,    ),   ∈ .Notice that the RL agent does not directly control the transaction between the energy system and the main network (i.e.,    ).Instead, after any action is executed, power is exported/imported from the main network to maintain the power balance.Nevertheless, a maximum power capacity constraint exists and must be enforced i.e., (11).Notice that if the maximum export/import limits are defined to be a low value (as done in this paper), in most cases, the power balance constraint will not be automatically met.
Given the state   and action   at time step , the energy system transit to the next state  +1 defined by the next transition probability function which models the energy system's dynamics.In model-based algorithms, the uncertainty is predicted by a determined value or sampling from a prior probability distribution.In contrast, DRL algorithms are a model-free approach, capable of learning such dynamics from interactions.To guide learning, a reward   must be provided by the environment in order for the RL agent to quantify the goodness of any action taken.In the energy system scheduling problem, the reward function (  ,   ) should guide the RL agent to take actions that minimize the total operating cost, while enforcing the power balance constraint.This can be done by using the reward function in which   corresponds to the power unbalance at time-step , defined as, In (13),  1 and  2 are used to control the order of magnitude and the trade-off between the operating cost minimization and the penalty incurred in case of power unbalance.The procedure used to solve the proposed MDP using value-based RL algorithms is presented next.

DRL value-based algorithms
Define   (  ,   ) as the action-value function that estimates the expected cumulative reward given that action   is taken at state   and following policy (⋅) after that.The action-value function   (  ,   ) can be expressed recursively as [12], Bellman's principle of optimality states that the optimal action-value function for an MDP has the recursive expression which solution can be obtained by using a Temporal Difference (TD) algorithm [41], which solves the following update rule iteratively.
in which Q(⋅) corresponds to a function approximator used to represent  *  (⋅) and  ∈ (0, 1] is a learning rate.Once a good quality representation of  *  (⋅) is obtained via Q(⋅), at time step  and state   , optimal actions   can be sampled from the optimal policy, i.e.,   ∼  * (  ), obtained as For continuous state and action spaces, the optimal action-value function  *  (⋅) can be approximated using a DNN i.e., Q(⋅) =   (⋅) with parameters , leading to an algorithm known as deep Q-networks (DQNs) [42].In this case, the iterative procedure shown in (17) can be seen as a regression problem whose objective is to estimate the DNN's parameters  via stochastic gradient ascent.In DQNs, the   is updated using the value   + max ∈    (  , ), where   target is a target Q-function. 1  ( Notice that in continuous action spaces, the procedure used in (18) to sample actions from the action-value function   is not feasible since an exhaustive action enumeration (i.e., the Max-Q problem) is not possible.Moreover, in (18) actions constraints are completely disregarded.To overcome this, we combine value-based DRL algorithms with mixed-integer programming, as explained next.
1 i.e., a copy of model   which parameters are updated less frequently.This procedure helps to stabilize learning within the DRL algorithm.For a more detailed explanation, see [43]. 2 For a more detailed derivation of the loss function in (19), see [43].

Proposed MIP-DQN algorithm
The proposed DRL algorithm is named MIP-DQN and is defined through two main procedures: training and deployment (or online execution).The main objective of the training procedure is to estimate the parameters  of the DNN used to approximate the action-value function   ; whereas during deployment, the obtained function   is used to take actions to directly operate assets within the energy system.Both procedures are explained in detail below.

Training procedure
The training process developed for the MIP-DQN algorithm is described in Algorithm 1.This process starts by randomly initializing the parameters of the DNN functions   ,   target .Then, interactions with a model of the energy system take place.In traditional valuedbased RL algorithms, exploration is done by sampling actions from the current estimate of the action-value function   .However, and as explained before, sampling actions from   following (18) is not a feasible procedure in continuous action spaces.Instead, we propose to use a parameterized deterministic optimal policy   , which is also approximated using a DNN model and randomly initialized.Similar to other works [43,44], the policy function   , the action-value functions   and   target , will be jointly approximated.
Within one epoch, for each time step , a transition tuple of the form (  ,   ,   ,  +1 ) is collected and store in a replay buffer .Then, a subset  of these samples is selected and used to update the parameters of functions   ,   target and   as shown in Algorithm 1.This procedure is iteratively done until a maximum number of epochs is reached.
Different from other DRL algorithms, such as DDPG and PPO, after training, we make use of the action-value function   and discard the approximated policy   .Moreover, it is critical to notice that the power balance constraint is only enforced via the penalty added to the reward function in (13).Thus, it is expected that at the end of the training procedure, such equality constraint is not strictly met.The procedure used to enforce constraints is developed for the deployment or online execution, as explained next.

Deployment (online execution) procedure
After convergence of the training procedure, the action-value function   , with fixed parameters , can be used to take actions to control different energy resources.To do this, the problem stated in (18) must be solved.In this case, as function   represents a DNN, in order to solve (18), we leverage recent optimization results for DNNs.Thus, proposing a transformation of the DNN model   into a MIP formulation.

MIP for deep neural networks
Let the DNN   (, ) in Fig. 2 consists of  +1 layers, listed from 0 to .Layer 0 is the input of the DNN, while the last layer,  refers to the outputs of the DNN.Each layer  ∈ {0, 1, … , } have   units, which is denoted by  , , the  ℎ unit of the layer .Let   refers to the output vector of layer , then    is the output of unit  , , ( = 1, 2, … ,   ).As layer 0 is the input of the DNN, then  0  is  ℎ input value for the DNN.For each layer  ≤ 1, the unit  , computes the output vector   below: where  −1 and  −1 are matrices of weights and biases that compose the set of parameters  i.e.,  = { , } and ℎ(⋅) is the activation function, which in this case corresponds to the ReLU function, described as: for a real vector , ReLU() ∶= max{0, }.Based on the above definitions, the DNN of Fig. 2, with fixed parameters , can be modeled as a valid MIP problem by modeling the ReLU function using binary constraints.Thus, using a binary activation variable    for each unit  , , the MIP formulation of a DNN can be expressed as [40]: Subject to: In the above formulation, weights  −1 , and biases    are fixed (constant) parameters; while the same holds for the objective function costs    and    .The ReLU function output for each unit is defined by (22), while ( 23) and ( 24) define lower and upper bounds for the  and  variables: for the input layer ( = 0), these bounds have physical meaning (same limits of the   inputs i.e.,  and ), while for  ≥ 1, these bounds can be defined based on the fixed parameters  [45].Finally, notice that in order for the MIP formulation to be equivalent to the DNN, ReLU activation functions must be used, as explained in [40].

Enforcing constraints in online execution
For an arbitrary state   , the optimal action   can be obtained by solving the MIP in ( 21)-( 24) derived from   .In this case, as the decision variables are the actions   (see (18)), the power balance constraint in (4) as well as the ramp-up and ramp-down constraints in ( 6) and ( 7), respectively; can also be added to the MIP formulation described by ( 21)-( 24).As a result, the optimal actions obtained by solving this MIP strictly enforce all operational constraints in the action space.This problem can be represented as, 22)-( 24), ( 4), ( 6), ( 7). ( To better understand the MIP formulation stated in (25), Fig. 3 shows a re-interpretation of the power balance constraint in (4) as a hyperplane that define the feasibility region (for a three dimensional space) of the action space.Notice that such hyperplane may have different parameters for different time steps.Thus, if the hyperplane that enforces the power balance constraint is added to the MIP formulation that represents the DNN   , the solution of such mathematical problem will ensure minimum operating cost (via the maximization of   ) and enforce all action space constraints, as exemplified in Fig. 4. In this case, this re-interpretation of the DNN as a MIP formulation offers enough flexibility to enforce equality constraints (as well as other constraints over the action space) for the energy system scheduling problem, such as the power balance.Algorithm 2 shows the stepby-step procedure used during the online execution of the proposed MIP-DQN algorithm.

Algorithm 2: Online Execution for the MIP-DQN Algorithm
Extract trained parameters  from   ; Formulate the Q-function network   as a MIP formulation according to ( 21)- (24).Add all action space constraints i.e., ( 4), ( 6) and ( 7).Extract initial state  0 based on real-time data; for  = 1 to  do For state   , get optimal action by solving (25) using commercial MIP solvers;

Simulation results and discussions
In this section, simulation results and discussions are presented.A comparison with DRL algorithms available in the literature, including PPO, SAC, DDPG and TD3 algorithms, is also presented.

Case study and simulations setup
To test the developed MIP-DQN algorithm, an energy system consisting of three DG units and an ESS is defined.The DG unit's parameters are shown in Table 2, while for the ESS, the charging/discharging Fig. 3. Action space (grey) and feasible action space (red) illustration.Actions  1 ,  2 ,  3 refer to generic actions in a three dimension action space .For each time step , the power balance constraint in (4) can be seen as the hyperplane  1 +  2 +  3 =  that defines the feasible actions space.Fig. 4. Visualization of the constraint space whose boundaries are formed by the hyperplanes ℎ   (⋅) defined by the ReLU activation functions derived from the deconstructed DNN   (, ⋅) as a MIP formulation, for an specific state  and actions  1 and  2 .The grey are shows the increasing value (from darker to lighter) of ∇  .The red point exemplifies the optimal solution of max If such a constraint is added to the MIP formulation, the solution represented with the blue point will be reached.limits, nominal capacity, and energy efficiency (  ) are set to 100 kW, 500 kW, and 0.90, respectively.We assume that the network's maximum export/import limit is defined as 30 kW.To encourage the use of renewable energies, we set selling prices as half of the current electricity prices, i.e.,  = 0.5.One-year demand consumption and PV generation data are used as the original data-set, sampled in hour resolution.Fig. 5 shows the mean and standard deviation of the demand consumption and PV generation during summer and winter for a period of 24 h, defined as the length of one episode ( = 24).The original dataset is divided into two additional datasets: training and testing.The training dataset contains the first three weeks of each month, while the testing dataset contains the remaining data.This allows the DRL algorithm to learn any seasonal and weekly behavior available in the PV generation and demand consumption data [31].During training, the EES's initial SOC was randomly set.To implement our MIP-DQN algorithm, PyTorch and OMLT (see [45]) package has been used.Default settings were used for all the implemented DRL algorithms, as shown in Table 3.All implemented algorithms are openly available in [46].Hyper-parameters  1 and  2 are defined as 0.01 and 20, respectively, as default values.Each test is run with five random seeds to eliminate randomness from code implementation.

Validation and algorithms for comparison
In the research literature, DRL algorithms are usually compared with simple rule-based or MPC-based algorithms (considering the impacts of any forecasting error) [47].Nevertheless, this procedure does not allow us to estimate the optimality gap between current DRL algorithms and the optimal global solution with a perfect forecast of the stochastic variables (i.e., generation and demand consumption).In this case, this optimal global solution with full knowledge should be regarded as an upper boundary, as none algorithm would perform better.Based on this, to validate and fairly compare the performance of the proposed MIP-DQN algorithm, besides comparing the optimal DERs schedule defined by several state-of-the-art DRL algorithms (DDPG, PPO, TD3), we compared with the optimal global solution obtained considering perfect forecast for the next 24 h.In this case, the optimal global solution is found by solving the nonlinear mathematical programming formulation in Section 2, implemented using Pyomo [48].Notice that different from the optimal global solution, all the tested DRL algorithms are able to make decisions only using current information.Finally, to evaluate the DRL algorithms' performance, the total operating cost, as in (1), and the power unbalance, as in (14), are used as metrics.

Performance on the training set
Fig. 6 shows the average reward, operating cost, and power unbalance for the developed MIP-DQN algorithm and other DRL algorithms during the training process.As can be seen in Fig. 6, the average reward increases rapidly after 100 episodes of training, while the operating cost and the power unbalance significantly decrease.This behavior during training is typical of DRL algorithms as the DNN's parameters are randomly initialized, leading initially to random actions causing high power unbalance.Throughout the training, and due to the introduction of the penalty terms used in the reward definition in (13), the DNN's parameters are updated, leading to higher quality actions, reducing  Next, we show how our proposed algorithm can overcome this during online execution, even in unseen data.

Performance on the test set
After training, the DNN's parameters of all the DRL algorithms are fixed as shown in Algorithm 2. A performance comparison is now made on the test set.Recall that the data on the test set is not used during training; therefore, it has not been seen by any of the DRL algorithms.To compare results on the test set, Fig. 7 shows the cumulative operating cost and power unbalance (which can be seen as a cumulative error) for 10 different days using the proposed MIP-DQN algorithm, as well as other DRL algorithms.The optimal global solution obtained by solving the NLP formulation and considering the perfect forecast is also presented.As can be seen in Fig. 7, during online operation and for all 10 test days, the proposed MIP-DQN algorithm strictly meets the power balance constraint, while other DRL algorithms fail to deal with such equality constraint.Notice in Fig. 7 how DRL algorithms such as DDPG and TD3 reach a cumulative power unbalance near 0.14 MW at the end of the test period.As a result of such high unbalances, an operating cost of 53.3% higher than the optimal global solution is also observed.In contrast, the proposed MIP-DQN algorithm achieves an operating cost of 94 $, i.e., 17.6% higher than the optimal solution.
To test the performance with a higher number of test days, Table 4 presents the average cumulative error (with respect to the solution obtained by solving the NLP formulation with perfect forecast), the average power unbalances, and total average computational time (over 30 test days) of the proposed MIP-DQN algorithm as well as other DRL algorithms.As can be seen, the proposed MIP-DQN algorithm has the lowest average error, 13.7%; while strictly meeting the power balance (and other) constraint.In contrast, algorithms such as PPO showed poor performance reaching an error of 52.4%.As expected, the total computational time required to execute the proposed MIP-DQN algorithm is higher than other DRL algorithms.This increase in the computational time is a result of the MIP formulation required to be solved in order to enforce the equality constraint (see (25)).Nevertheless, for this case, the proposed MIP-DQN algorithm can still be used for real-time operation as it only requires less than 20 s for execution.In this case, it is important to highlight that the computation time of the proposed MIP-DQN algorithm is impacted by the size of formulated MIP problem, which is only determined by the size of the used Q network (layers, units of each layer, etc.) and not by the size of the energy system (microgrid) considered.Previous research   has shown that (small) neural networks can generalize well in real environments [19,28], supporting the applicability of DRL models in real systems.

Dispatch decisions comparison
Until now, the general performance of the proposed MIP-DQN algorithm has been presented, highlighting its capability of strictly enforcing the power balance constraint, even in unseen operational days.Next, a comparison in terms of the scheduling of the DG units and the ESSs is introduced.To do this, Fig. 8 displays the output power of all the DG units, ESSs and the imported/exported power from the network for: the proposed MIP-DQN algorithm (Fig. 8𝑏), and the optimal solution obtained after solving the NLP formulation considering perfect forecast (Fig. 8𝑐).Notice in Fig. 8 that when the electricity price is high, and the net power is low, the proposed MIP-DQN algorithm dispatches the ESSs in charging mode, and a similar dispatch decision is observed in the optimal global solution.Notice also that, when compared with the optimal solution, the proposed MIP-DQN algorithm dispatched 3 ℎ DG during the peak hour, which can be considered a sub-optimal decision as the operating cost of such DG is higher than the others.This difference in this dispatch decision can be due to the estimated -function, which might not be good enough to represent the true action-value function.In this sense, as the proposed MIP-DQN algorithm chooses actions that maximize its -value estimation, the largest -value might not represent the best action for this specific state-action pair.Nevertheless, even in executing a sub-optimal decision, the proposed MIP-DQN algorithm is able to meet the power balance constraint, guaranteeing operational feasibility.Finally, although differences in the dispatch decisions made by the proposed MIP-DQN algorithm and the optimal solution can be observed, it is important to highlight that the optimal global solution is obtained considering the perfect forecast of the future generation and demand consumption for the next 24 h, while the proposed MIP-DQN algorithm provides dispatch decisions in an hourly basis, without knowledge of the future values of the stochastic variables.

Sensitivity analysis
To better understand the impact of hyperparameter  2 in the reward function in (13), Fig. 9 shows the average operating cost and power unbalance (during training) for the proposed MIP-DQN algorithm for  2 = 20, 50, 100.As can be seen in Fig. 9, and as expected, higher values of  2 accelerate the convergence of the proposed MIP-DQN algorithm to rapidly reduce power unbalance, while having no apparent impact on the convergence of the operating cost.On the other hand, lower values of  2 seem to accelerate the convergence of the operating cost leaving behind the convergence of the power unbalance.In general, for the test performed, it was observed that the proposed MIP-DQN algorithm could converge in less than 200 episodes.

Comparison with safe DDPG algorithm
A comparison with current safe DRL algorithms is also performed.In this case, the proposed MIP-DQN algorithm is compared with a Safe DDPG algorithm, as presented in [49].Fig. 10 shows the average reward (Fig. 10𝑎), operating cost (Fig. 10𝑏), and power unbalance (Fig. 10𝑐) for the two algorithms being compared.In this case, and as expected, both algorithms fail to enforce the power unbalance constraint strictly during training.At the beginning of the training stage, the Safe DDPG algorithm shows a lower operating cost and power unbalance, and higher reward, when compared to the MIP-DQN algorithm.This is mainly due to the trained linear safe layer of the Safe DDPG, which projects the exploration action to a safer one, while the MIP-DQN algorithm is free to explore the action space regardless of the feasibility of the decided action.Nevertheless, along with the training, the Safe DDPG algorithm fails to learn to reduce further or eliminate power unbalance, while our proposed MIP-DQN algorithm reduces the unbalance sharply.This behavior is mainly due to the reward shaping of the MIP-DQN algorithm, which can learn to avoid the penalty due to the power unbalance during the training.It is important to highlight that the performance of the Safe DDPG algorithm depends on the quality of the trained safe layer that project the original action of the DDPG algorithm to a feasible one.In this case, as the safe layer is a linear function, its generalization capabilities may not be enough to learn the complex nonlinear energy system dynamic.Thus, even after projection, the action cannot fully meet the power unbalance constraint.Moreover, as the safe layer modified the action during exploration, it also harms the performance of the trained RL algorithm as shown in Fig. 10.Compared to the Safe DDPG algorithm, the proposed MIP-DQN algorithm learns to eliminate the unbalance in a small value after training and guarantees the feasibility during the execution (Fig. 7).

Larger case study
To test the performance of the proposed MIP-DQN algorithm on an energy system with multiple ESSs, an environment with three ESSs and three DG generators is designed.For this new environment, Fig. 11 shows the average operating cost and power unbalance of the proposed MIP-DQN algorithm as well as other state-of-the-art DRL algorithms, during the training process.As can be seen in Fig. 11, the operating cost and power unbalance are significantly reduced.In this case, all tested DRL algorithms converged at around 400 episodes.The power unbalances (presented by the average with 95% confident interval) of the DDPG, SAC, PPO and TD3 algorithms are 97±125 kW, 533±208 kW, 45 ± 19 kW, 462 ± 98 kW, respectively.In contrast, a power unbalance of 17 ± 22 kW was observed for the proposed MIP-DQN algorithm.Similar to the results presented in Section 5.3 for the smaller case study (see Fig. 6), none of the tested DRL algorithms can strictly enforce the power balance during training.Most of the observed power balance violations happen during peak load days, consistent with previous results [31].Nevertheless, the proposed MIP-DQN algorithm is able to   enforce power unbalance during the online execution, even on peak load days, as shown next.Additionally, compared to the result of simulations in Section 5.3, no performance degeneration is observed, proving the scalability of the proposed MIP-DQN algorithm.
Fig. 12 shows the scheduling decisions from the MIP-DQN algorithm for all three ESSs (Fig. 12𝑏) and DG generators (Fig. 12𝑐), and corresponding SOC changes (Fig. 12𝑑) in a typical day with extreme peak load.Notice that the power balance is strictly enforced during the peak load day.For instance, at 19 h, the load is extremely high, and the MIP-DQN algorithm dispatches all the ESSs in discharging mode.This avoided importing electricity from the main grid as the electricity price was high at that particular time.These results showed that the proposed MIP-DQN algorithm learned to schedule feasible decisions for multiple ESSs in extreme peak situations.Notice also that, at hours 3 and 4, the proposed MIP-DQN algorithm dispatches the 2 ℎ DG, instead of fully using the 1 ℎ DG, which can be considered as a sub-optimal decision because the operating cost of 2 ℎ DG is higher than that of 1 ℎ DG.A similar result was observed in Fig. 8. Nevertheless, even in executing a sub-optimal decision, the proposed MIP-DQN algorithm is able to meet the power balance constraint, guaranteeing operational feasibility.Thus, the proposed MIP-DQN algorithm can provide feasible dispatch decisions hourly for multiple ESSs, displaying prominent scalability features.

Discussion
The penetration of renewable-based DERs energies significantly increases the uncertainty and complexity of the operation of energy systems.Existing model-based approaches may not perform well when defining the operational schedule of DERs in real time due to their poor accuracy and high computational time requirements.Due to this, current efforts are put into leveraging RL algorithms' model-free and data-driven nature.After offline training, RL algorithms can provide near-optimal solutions in real-time.Nevertheless, the most critical challenge to enabling RL algorithms deployment in real energy systems scheduling frameworks is their lack of constraint enforcing guarantee.Even though several safe RL algorithms have tackled this problem, these approaches fail to meet the required security levels of energy systems operation [50].In general, model-based optimization approaches can guarantee the feasibility of the defined DERs schedule by setting hard constraints in the mathematical formulation, which is impossible to do in current RL algorithms.
To overcome the problem mentioned above, inspired by recent advances in deep learning and optimization research areas, we first bring constraint enforcement in RL algorithms combining deep learning and optimization theory.We developed a DRL algorithm, namely MIP-DQN, that can theoretically guarantee the feasibility of the decided solution and get the optimal solution during the online scheduling stage.To do this, we redesigned the training and online-scheduling procedure.The proposed MIP-DQN algorithm uses a trained -network to approximate the state-action values function.Exploration and exploitation are executed based on a trained policy network to update the Q-network parameters.After training, the -network is assumed to approximate the optimal -values.Then, the trained -network is extracted and formulated as MIP formulation, which can be used to impose hard constraints in the action space, ensuring the feasibility of the defined schedule.In this case, the power balance constraint is used as an example to show the effectiveness of the proposed approach.Results showed that MIP-DQN strictly meets the power balance constraint, showing a lower error when compared with other DRL algorithms and the optimal global solution.
The essence of the proposed MIP-DQN algorithm is using a trained -network as a surrogate function for the optimal operational decisions.As above-mentioned, the optimality is defined by the -network modeled as a MIP formulation.Thus, the approximation quality of the  network determines the proposed algorithm's performance.In Fig. 8, we showed that the proposed MIP-DQN could be considered a good quality operational schedule, albeit sub-optimal.Thus, efforts to reduce the error when compared with the optimal global solution must be centered on increasing the quality of the approximation of the Q-values via the used deep neural network.Additionally, the proposed MIP-DQN algorithm still needs to integrate a penalty term into the reward function to explore the right direction during the training process.This introduces extra hyperparameters that also impact the approximation performance of the obtained -function.An alternative exploration approach that can be used is to model the DNN as a MIP formulation in each iteration step; nevertheless, this would imply higher training time.

Conclusion
This paper proposed a value-based DRL algorithm, namely MIP-DQN, to define the optimal dispatch decisions of multiple distributed energy resources within a renewable-based energy system.The proposed DRL algorithm was developed for continuous action (and state) spaces with the main feature of strictly enforcing all operational constraints in the action space during online execution, ensuring the feasibility of the defined schedule.This is done by re-formulating the deep neural network (DNN), used to approximate the action-value Qfunction, as a mixed-integer programming (MIP) formulation enabling to further consider any action space constraint.Results showed that the proposed MIP-DQN algorithm obtained near-optimal solutions, with an error of 13.7% when compared with the optimal solution obtained with a perfect forecast of the stochastic variables.A comparison with other DRL algorithms was also presented, observing higher errors than the proposed algorithm while failing to meet the power balance constraint on unseen test days.Future work directions include implementing plug-and-play features, considering DERs' uncertain availability.

Fig. 1 .
Fig. 1.Illustration of the considered energy system structure composed of various DERs, such as solar photovoltaic (PV), ESSs, DGs, and loads.

Fig. 2 .
Fig. 2. Layer structure of the DNN used to approximate the action-value function (, ).We denoted this DNN model as   (, ) in Algorithm 1.

Fig. 5 .
Fig. 5. Mean and standard deviation of the demand consumption and PV generation.

Fig. 6 .
Fig. 6.Mean and 95% confident interval for the reward, operating cost and power unbalance for the developed MIP-DQN algorithm, as well as for other DRL algorithms, during training.As expected, none of these DRL algorithms are able to enforce the power balance constraint.

Fig. 7 .
Fig. 7. Cumulative costs and power unbalance for 10 days in the test set.The proposed MIP-DQN algorithm is able to strictly meet the power balance constraint while other DRL algorithms fail to do so.

Fig. 8 .
Fig. 8. Operational schedule of all DG units and ESSs defined by the proposed MIP-DQN algorithm and the optimal global solution obtained by solving the NLP formulation considering perfect forecast.

Fig. 9 .
Fig. 9. Average reward, operating cost, and power unbalance of the proposed MIP-DQN algorithm for different values of  2 .

Fig. 10 .
Fig. 10.Mean and 95% confident interval for the reward, operating cost and power unbalance for the developed MIP-DQN and Safe DDPG algorithms.

Fig. 11 .
Fig. 11.Mean and 95% confident interval for the operating cost and power unbalance for the developed MIP-DQN algorithm, as well as for other DRL algorithms, during training.

Fig. 12 .
Fig. 12. Operational schedule of all ESSs and DG units defined by the proposed MIP-DQN algorithm for a larger case study composed of three ESSs and three DG units.
,  target ,  Parameters for the DNN's   ,   target and     ,   , Under this value definition, parameters  can be obtained minimizing a loss function over mini-batches  of past data { (  ,   ,   ,  +1 ) } || =1.In this case, the loss definition used to train the DQN is based on the mean squared Bellman error, defined as2

Algorithm 1 :
Training procedure for MIP-DQN Define the maximum training epochs  , episode length .Initialize parameters of functions   ,   target , and   ; Initialize reply buffer . ; for  = 1 to  do Sample an initial state  0 from the initial distribution for  = 1 to  do Sample an action with exploration noise   ∼   (  ) + ,  ∼  (0, ) and observe reward   and new state  +1 .;

Table 3
Parameters for DRL algorithms.

Table 4
Performance comparison of different DRL algorithms in a new test set of 30 days.