Prescribing net demand for two-stage electricity generation scheduling

We consider a two-stage generation scheduling problem comprising a forward dispatch and a real-time re-dispatch. The former must be conducted facing an uncertain net demand that includes non-dispatchable electricity consumption and renewable power generation. The latter copes with the plausible deviations with respect to the forward schedule by making use of balancing power during the actual operation of the system. Standard industry practice deals with the uncertain net demand in the forward stage by replacing it with a good estimate of its conditional expectation (usually referred to as a point forecast), so as to minimize the need for balancing power in real time. However, it is well known that the cost structure of a power system is highly asymmetric and dependent on its operating point, with the result that minimizing the amount of power imbalances is not necessarily aligned with minimizing operating costs. In this paper, we propose a bilevel program to construct, from the available historical data, a prescription of the net demand that does account for the power system's cost asymmetry. Furthermore, to accommodate the strong dependence of this cost on the power system's operating point, we use clustering to tailor the proposed prescription to the foreseen net-demand regime. By way of an illustrative example and a more realistic case study based on the European power system, we show that our approach leads to substantial cost savings compared to the customary way of doing.


Introduction
Many decision-making processes under uncertainty can be modeled by optimization problems where some of the input parameters are not perfectly known. The field of Optimization under Uncertainty focuses on developing tools to tackle these problems depending on the knowledge of those parameters that the decision maker actually has. For example, if these parameters can be modeled reasonably well as random variables following certain probability distributions, then the decision maker should probably resort to stochastic programming techniques (Birge and Louveaux, 2011). In contrast, if all the decision maker knows about said parameters is their range of variation or support, then she should rather opt for robust optimization methods instead (Ben-Tal et al., 2009).
In the realm of power system operations, there is a vast literature on operations research methods, models, and algorithms for power dispatch that rely on stochastic programming or robust optimization or hybrids of both.
The richness of this literature makes it materially impossible and pointless to embrace it all in this paper. Instead, we refer the reader to monograph Morales et al. (2013) and references therein for examples of power systems operating problems based on stochastic programming, to the seminal work Bertsimas et al. (2012) on the application of robust optimization for unit commitment and generation scheduling, and to the recent contribution Dvorkin (2019) on a distributionally robust chance-constrained electricity market.
Despite the firm and promising advances in Optimization under Uncertainty, still one of the most widely extended practices in decision making is to replace the unknown parameter with a sensible value or estimate, some sort of "the most likely value" that the parameter can take on. A natural candidate to play that role is the expected value of the parameter. Thus, the decision maker can, in addition, exploit all the powerful tools that the disciplines of Statistics, Forecasting and Machine Learning have developed for decades to estimate that expected value conditional on all the information the decision maker has available at the moment the decision must be made.
The adherence of the power sector to this strategy is particularly notorious, essentially because it is argued to be simpler, more transparent, computationally cheaper and easily accepted by the different stakeholders (see, e.g., Morales et al. (2014); Wang and Hobbs (2015); Morales and Pineda (2017); Kazempour et al. (2018) for further details on this issue). However, operations researchers have repeatedly shown that this strategy results in suboptimal decisions in general, because the conditional expected value of the parameter ignores the impact of the parameter's uncertainty on the decision's value (Birge and Louveaux, 2011).
Against this background, research efforts have been recently placed on finding a compromise solution. Intuitively, the idea is still to replace the uncertain parameter but with a point estimate (generally different from the parameter's conditional expectation) that is purposely computed to result in the optimal or a nearly optimal decision in view of the parameter's uncertainty. This alternative point estimate is usually called a prescription and the term smart predict has been recently coined to refer to this midway solution strategy. In this line, we find a number of research works, e.g., Donti et al. (2017); Muñoz et al. (2022); Elmachtoub and Grigas (2022); El Balghiti et al.
(2019). In particular, the authors in Donti et al. (2017) propose a heuristic gradient-based procedure to produce estimates of uncertain parameters in optimization problems based on the objective of the task for which these estimates will be used. In Muñoz et al. (2022), instead, they introduce a bilevel programming framework to the same end. In Elmachtoub and Grigas (2022), they deal with linear programs with an uncertain cost vector, for which they develop a tailored convex loss function to compute an estimate of the cost vector that accounts for the underlying linear optimization problem. Finally, the work in El Balghiti et al. (2019) is a continuation of that in Elmachtoub and Grigas (2022) (first released as an arXiv preprint in 2017), where the authors provide bounds on how well a certain method to predict the cost vector from training data generalizes out of sample.
In the field of power systems, it has also been shown that, by smartly tuning the input parameters of current operational and procedures, these can mimic the performance of their stochastic-programming-based counterparts to a large extent. For instance, in Morales et al. (2014), they propose a bilevel programming model to compute the amount of (uncertain) renewable power generation that must be considered in a forward electricity market to maximize the short-run market efficiency. In the same vein, the authors in Dvorkin et al. (2018) show that, by means of bilevel programming too, the reserve requirements in an European-style two-stage electricity market can be set so that the market can be almost as cost-efficient as the ideal two-stage electricity market run by a full stochastic programming approach.
Our proposal is in the same spirit, but we make use of bilevel programming not to directly prescribe renewable production or reserve capacity, but to determine straightforward rules to improve the value of the (net-demand) forecasts currently employed by system operators. In any case, the works of Morales et al. (2014) and Dvorkin et al. (2018) reveal that the power sector can highly benefit from the aforementioned smart-predict strategy.
Actually, in Donti et al. (2017), they apply it to power generation and gridscale electricity battery operation, and in Muñoz et al. (2022) to the offering problem of a thermal power producer competing strategically in an electricity market. In Carriere and Kariniotakis (2019) and Muñoz et al. (2020), they focus instead on renewable energy producers, for which they propose different smart-predict strategies for energy trading. Lastly, the authors in Garcia et al. (2021) use a bilevel programming framework similar to that proposed in Muñoz et al. (2022) whereby they train several autoregressive models to estimate the uncertain demand and the size of the energy reserves in a joint reserve allocation and energy dispatch problem.
Within this context, and given a two-stage power scheduling problem, our contributions are the following: • We propose a bilevel program to learn the value of the net demand (i.e., that obtained by subtracting the weather-driven renewable power production from the non-dispatchable power load) that the scheduling problem must consider so that the power system operating costs are minimized in expectation. This value (that is, the prescription) is, in general, different from the conditional mean of the net demand that is currently employed in standard industry practice.
• We show that our method can be easily used to upgrade any given point forecast of the net demand into a prescription that accounts for the power system's cost asymmetry.
• Motivated by the fact that a power system usually features distinct net-demand regimes, we introduce a data clustering and partitioning strategy that, on the one hand, increases the performance of the prescription of the net demand that our approach produces (by making it dependent on the foreseen net-demand regime), and, on the other, substantially decreases the computational effort to solve the associated bilevel estimation problem.
• We evaluate the economic benefits that our approach achieves through an out-of-sample test on a stylized version of the European power system that makes use of real data, in particular, of actual and dayahead predicted net-demand values downloaded from the ENTSO-e Transaparency Platform (ENTSO-E Transparency Platform, 2020).
The rest of this paper is organized as follows. Section 2 describes the two-stage power generation scheduling problem we consider throughout our paper, motivates the ultimate goal of our work, and formulates the generic bilevel program we use to construct, from the available historical data, prescriptions of the net demand intended to minimize the expected system operation costs. In Section 3, we resort to various simplifying assumptions to guarantee that the globally optimal solution of the bilevel program can be obtained by solving a mixed-integer linear program (MILP). The potential of the net-demand prescriptions provided by this MILP is then illustrated, discussed and justified using a small power system in Section 4. In Section 5, we introduce a procedure based on data clustering and partitioning to enhance the value of our prescription and to speed up the solution of the mixed-integer prescription problem. A case study based on the European power system is used in Section 6 to investigate the benefits of our approach on real data. Finally, Section 7 concludes the paper with some final remarks.

Problem description
We consider a two-stage generation scheduling problem consisting of a forward power dispatch and a real-time balancing phase. The forward dispatch is decided some time prior to the actual delivery of energy, for instance, from 15 minutes to 36 hours in advance. The forward dispatch is required to decide the output levels of inflexible generating units (e.g. nuclear-based power plants) that need advance planning due to technical limits such as ramping constraints or minimum times. The real-time balancing phase processes the energy imbalances with respect to the forward production schedule. The real-time balancing rescheduling aims at adjusting the output of flexible generating units (such as gas-based power plants) that can quickly deviate from the forward schedule to ensure the equilibrium between electricity production and consumption. More details about the two-stage generation scheduling problem used in this paper can be found in Morales et al. (2013) In our framework, the forward stage first determines the power dispatch that minimizes the anticipated electricity production costs by solving where z F denotes the scheduling decisions (such as the dispatch of thermal generating units and the consumption of dispatchable loads) andŷ represents a point or single-value estimate of the uncertain parameters (in our case, the net demand, given as the difference between the non-dispatchable load and the weather-driven renewable power generation). Objective function (1a) determines the forward system operating costs, while the generic constraints (1b) and (1c) enforce the technical limitations of the generating units and the network, respectively.
Since the net demand is uncertain, the scheduling decisions that results from the forward problem (1) are to be adjusted during the real-time operation of the power system to accommodate the eventual net-demand value that is realized. This adjustment is conducted through the following generic real-time balancing problem: where z B symbolizes the adjustment variables in relation to the forward decisions z F -here treated as a known and fixed input vector coming from (1)and y represents the eventually realized value of the uncertain parameters.
The objective function (2a) includes both the forward dispatch cost (which, at this point, is known and constant and, therefore, could be removed from the minimization) and the imbalance cost caused by unexpected net-load variations that occur very close to the real-time operation of the system.
Similarly to problem (1), the constraints (2b) and (2c) enforce the technical limitations of the generating units and the network, respectively. As customary, the objective function and constraints of models (1) and (2) are well known and properly defined. Morales et al. (2014) show that the valueŷ of the net demand that is used in the forward scheduling problem (1) may have a major impact on the subsequent balancing costs (2a). Current practice, however, is content with a simple and direct solution, which is to takeŷ as an estimate of the expectation of the net demand y conditional on all the information at the forecaster's disposal. This information is generally known as context. Yet, this expectation is oblivious to the minimization of the balancing costs that drives the real-time balancing problem (2) and therefore, may turn out to be highly suboptimal. In other words, the conditional expectation overlooks the typical asymmetries affecting the minimization of the balancing costs, such as the fact that the distribution of the renewable generation is often skewed and that the costs of supplying upward and downward balancing energy are usually different.
Instead of employing the conditional expectation of the uncertain parameters y as the estimateŷ used in the forward scheduling problem (1), we propose a regression procedure that provides an alternative value forŷ that explicitly accounts for the potential impact of the uncertain parameters on the subsequent real-time balancing problem (2). This alternative value is what is called a prescription and a procedure to obtain it runs as follows. Suppose we have a sample of N data points expressed in the a vector of features or covariates making up the context and y ∈ R m is a vector of uncertain parameters. Our objective is to utilize said sample to infer a functional relationŷ = h(x) with h : R p → R m such that, given the context x, the provided prescriptionŷ is trained to deliver the minimum total system costs in expectation when inserted into the forward scheduling problem (1). Now suppose that the prescriptive function h is parameterized on a coefficient vector q of appropriate dimension, we propose to train h q (·), i.e., to estimate q, by way of the following bilevel optimization problem: Essentially, the bilevel program (3) seeks the q-parameterized prescriptive function h q (·) that minimizes the empirical expectation of the total dispatch Depending on the nature of variables z F , z B and functions f obj , f gen , f net , g obj , g gen , g net , and h q , the difficulty of solving the bilevel optimization problem (3) can vary significantly. For instance, if the forward scheduling decisions z F include binary variables representing the on/off status of thermal generating units, then computing the global optimal solution of the bilevel problem is tremendously challenging as discussed in Fanghänel and Dempe (2009). Likewise, even in the case that all scheduling decisions z F are continuous, the lower-level problem (3d)-(3f) can be non-convex if constraints g net account for the nonlinear AC power flow equations. In that case too, computing the global optimal solution of (3) is also a very challenging task.
If all variables are continuous and all functions affine, then the bilevel optimization problem (3) can be solved by replacing the lower-level problem with its equivalent KKT optimality conditions (Muñoz et al., 2022). Said strategy, however, requires additional binary variables and large enough constants that increase the numerical instability and the computational burden of the resulting single-level optimization problem. Besides, as discussed in Pineda and Morales (2019) and Kleinert et al. (2020), existing methods to validate these upper bounds may fail and lead to suboptimal solutions.
Given that the development of methods to solve a generic bilevel program like (3) is outside the scope of this paper, in the next section we make use of some simplifying assumptions that allow us to efficiently solve the bilevel problem (3) to global optimality using commercially available software for mixed-integer programming. In practice, we may not need to solve the estimation problem (3) to global optimality. In effect, finding a good feasible solution to this problem may be enough to reap the bulk of the benefits of our approach. This enables the use of heuristics to solve problem (3).
We also remark that some system operators are indeed aware of the importance of scheduling generation taking into account the subsequent real-time operation of the power system. For example, the California Independent System Operator (CAISO) manually modifies the load forecast, anticipating the ramping capacity needs based on the operator's experience (Yurdakul et al., 2021). This is explicitly explained in California Independent System Operator (2020) as follows: "Operators in the ISO and energy imbalance market can manually modify load forecasts used in the market through load adjustments (...) to increase ramping capacity within the ISO by (...) committing additional units." Our approach is not only consistent with this modus operandi, but also supports it with a scientifically grounded methodology to modify the load forecast.
Once the bilevel problem (3) is solved and the prescriptive function h q is obtained, the forward dispatch for an unseen value of the context x can be easily determined by solving the following deterministic optimization problem: It is important to realize that the strategy we propose is completely in contrast with those other approaches that make use of stochastic programming to simultaneously decide the forward scheduling decisions and the real-

Contextual merit-order dispatch
For the remainder of this article, we assume the following.
• Forward dispatch decisions and real-time balancing actions are all modeled by continuous variables, that is, on/off binary variables are not considered.
• The inter-temporal constraints of power production portfolios, such as ramping limits and minimum-up and -down times, are not explicitly accounted for in the generation scheduling problem.
• Network constraints are only taken care of in the real-time stage using a pipeline representation of the transmission network. That is, constraints (3f) are removed from the bilevel optimization problem (3).
• All power loads are assumed to be non-dispatchable and therefore, the objective function of the two-stage scheduling problem boils down to the minimization of the generating cost of dispatchable units.
With these assumptions in place, the forward scheduling problem (1) is simplified to a merit-order dispatch whereby generators are scheduled based on their marginal production costs, as explained later. Furthermore, with the pipeline representation of the transmission network, our setup becomes more aligned with European practices, in keeping with the case study we present in Section 6. Likewise, we consider non-dispatchable loads only just to simplify the subsequent exposition of our approach. Finally, by way of the rest of the assumptions, the bilevel model (3)  All in all, the particular forward generation scheduling problem we consider in this paper is given as follows: where p g , P g , C g ∈ R + and G ⊆ N is the set of generation units (it can also represent generation blocks with different cost). Each unit g has associated a production level p g and a marginal cost C g . Equation (5b) enforces the aggregate power balance, with parameter L ∈ R + representing a point or single-value estimate of the total net demand L ∈ R + in the system, which is unknown at the moment the forward scheduling is determined and thus, is to be treated as a random variable. Lastly, equation (5c) sets the capacity of each generation unit.
The linear program (5) stands for an economic dispatch problem whereby generation units are dispatched following a cost-merit order, meaning that units g with a lower cost C g are dispatched first. This is illustrated in Figure 1 for four units with capacities P 1 , P 2 , P 3 , P 4 and marginal costs C 1 , C 2 , C 3 , C 4 , respectively. The vertical solid line represents the total net demand L and the shadow area indicates the optimal dispatch computed by model (5). Since units 1 and 2 are the cheapest ones, their dispatch is set at their maximum capacity. Unit 3 is partially dispatched until the demand is fulfilled. Unit 4 is the most expensive one and therefore, is left out of the demand supply. To ease the discussion that follows, hereinafter we consider that the units in the set G are ordered such that g < g ′ if and only if C g < C g ′ . Hence, if we denote the optimal solution to (5) as {p * g } g∈G , it holds that p * g ′ > 0 implies that p * g = P g , whenever g < g ′ . Since the system net demand is uncertain, the power dispatch that results from the forward problem (5) is to be adjusted during the real-time operation of the power system to satisfy the actual net demand. The aim of the real-time balancing problem is to correct the imbalance of the system in a cost-efficient manner. To this end, we consider a pipeline model where G(b) and D(b) represent the set of generation units and loads that are connected to node b, in that order. With some abuse of notation, let (L di ∈ R) d∈D(b) be a certain realization i ∈ N of the net load d ∈ D(b) connected to node b of the system. We also define o(l) and e(l) as the origin and ending nodes of line l, respectively. Thus, {l : o(l) = b} and {l : e(l) = b} represent the subset of lines that start or end at node b, in that order. Once introduced this notation, the real-time balancing problem under consideration renders where Ξ := {r u g , r d g ∈ R + , g ∈ G, f l ∈ R, l ∈ Λ} is the set of decision variables and p * g , R u g , R d g , C u g , C d g , P g , F l ∈ R + and L di ∈ R are known parameters. The power output of each flexible unit g may be increased by an amount r u g , based on the marginal cost for upward balancing energy C u g , or decreased by an amount r d g of downward balancing energy, which entails a marginal benefit (linked to fuel-cost savings) of C d g . These actions are driven by the nodal power balance equation (6e) and the minimization of the total balancing costs (6a). Naturally, the amount of balancing energy provided from each generation unit g, either upward or downward, must be such that the eventual power output from that unit (taking into account the forward optimal schedule p * g ) is positive and lower than its capacity P g , as stated in equation (6b). Moreover, constraints (6c) and (6d) limit the amount of up-and down-balancing power that can be deployed from each generation unit to R u g and R d g , which are indicative of how flexible the underlying asset actually is. Finally, line capacity limits are imposed by (6f), with f l being the power flow through line l and F l the capacity of the line. Now suppose we have a sample of N data points expressed in the form vector of features or covariates making up the context and L ∈ R + is the random net system demand. As stated in the previous section, our objective is to utilize said sample to infer a functional relation L = h(x), h : R p → R + , such that, given the context x, the provided prescription L is trained to deliver the minimum total system costs in expectation when inserted into the power balance equation (5b). For simplicity, and because it proves to perform very satisfactorily in the numerical experiments of Section 6, we restrict h to the family of affine linear functions, i.e., h(x) = q ⊤ x, with q ∈ R p and with one of the features, say x 1 , fixed to one. This selection, together with the particular structure of problems (5) and (6) circumvent the need for applying the conventional procedures to solve the bilevel problem (3).
Instead, to determine q, we solve the following empirical risk minimization problem: where Υ := {p gi , r u gi , r d gi , u gi , f li } {i,g,l} . Intuitively, the estimation problem (7) computes the coefficient vector q = (q 1 , . . . , q p ) such that the total system cost averaged over the sample is minimized. This is the reason why all the decision variables related to the power dispatch and the provision of regulating power, i.e., p gi , r u gi , r d gi , appear augmented with the sample index i in (7). Constraints (7b)-(7g) serve exactly the same purpose as their analogs in (5) and (6). Equation (7h) expresses the prescription L i of the net system demand L under context x i as an affine function of the features, whose coefficients are to be computed by solving (7). Finally, the set of constraints (7i)-(7l) guarantee that the power dispatch {p gi } g∈G coming from (7) for each sample i is optimal in the forward problem (5), that is, these constraints constitute the optimality conditions of the forward dispatch problem (5). Accordingly, these constraints enforce, for each sample i, the merit-order dispatch of the generation units {p gi } g∈G , forcing that p g ′ i > 0 =⇒ p gi = P g , for all g ∈ G : g < g ′ . Importantly, this formulation of the optimality conditions of problem (5) neither necessitates dual variables, nor large enough constants, both of which are frequently used to solve lineal bilevel programs (Pineda and Morales, 2019).
Problem (7) is, therefore, a mixed-integer linear program due to the binary character of variables u gi , which are used to impose the cost-merit order. As such, this problem can be solved using commercially available solvers such as CPLEX (IBM ILOG CPLEX Optimization Studio, 2020).
Once we obtain the optimal coefficient vector q * , we can produce the netdemand prescription L = (q * ) ⊤ x, which is to be fed into (5b) under the context x to readily obtain the forward dispatch decisions. Now, let L F denote the expected net-demand value (which is typically known as a point prediction of the net demand). Even though this expected value has been and is normally used as L in the forward problem (5), it is not consistent with the plausible asymmetry in the cost of dealing with the subsequent prediction errors through the real-time problem (6). Indeed, it is most often the case that the cost of increasing the electricity production in real time is different from that of diminishing it. In this line, problem (7) offers a handy way to construct a new estimate L to be used in (5) that takes into account the referred cost asymmetry. Indeed, the training problem (7) can be used in practice to upgrade the point prediction L F to a prescription for L, which does account for the asymmetry of the power system's balancing costs. For this, it suffices to include L F as one of the features that are part of the context x. This is what we do in the following example and in the case study of Section 6. That said, the context x could include any other information that could be deemed useful to improve the prescription L, for instance, the estimated standard deviation or quantiles of the conditional net demand, if these were available through a probabilistic forecast.
We finish this discussion with an important remark. Despite the fact that constraints (7i)-(7l) turn our training problem (7) into a mixed-integer program, enforcing the cost-merit order through these constraints is critical to train an affine model h(x) = q ⊤ x that renders economic benefits within the two-stage scheduling problem described in Section 2. Furthermore, even though the training problem (7) may require some computational effort to be solved, the affine model it delivers is intended to remain effective for a period of time (e.g., weeks or months), and hence, the task of solving the mixed-integer program (7) only has to be undertaken once in a while. That said, the parameter vector q is to be re-estimated offline on a regular basis to capture changes in the dynamics of the electricity market and the power system (for example, to account for seasonal variations).

Example
Consider the small three-bus system depicted in Figure 2, which is composed of one random demand L at bus 3, two thermal generators, G 1 and G 2 , at buses 1 and 2, respectively; and two lines, Line 1 and Line 2, connecting nodes 1 and 3, and buses 2 and 3, in that order.   The technical and economic characteristics of generating units G 1 and G 2 are collated in Table 1. Note that, in comparison, unit G 1 is smaller and cheaper than G 2 . In contrast, the latter is significantly more flexible as it features re-dispatch costs, i.e., C u g and C d g , that are much more competitive. We remark that C d 1 = −20 e/MWh implies that this power unit must be paid 20 e for each MWh its production is decreased in the real-time stage.
Unless stated otherwise, line capacities are assumed infinite.
The only demand in the system, namely, L, is random. Suppose we represents a classical point prediction of the demand L built out of whichever available information the forecaster had at her disposal to produce it by way of whatever machine learning or forecasting technique she could have developed to that end. We stress that this setup is very common in reality, where power system operators often count on specialized software to produce good point predictions L F i . Our objective is to use our methodology and training model (7) described in Section 3 to recycle this standard point prediction with the aim of fabricating a better value for L in equation (5b).

For this small example, we generate samples in the form
as follows. We consider that the per-unit (p.u.) point forecast of the net demand L follows a uniform distribution between a and b. Therefore, L F ∼ L · U (a, b), where L is a factor representing the maximum power load at bus 3. We further assume that the per-unit net demand itself L/L follows a Beta distribution with mean equal to L F /L and standard deviation σ.
Hence, L ∼ L · Beta(α, β), where the scale and shape parameters α and β are related to the mean L F /L and the standard deviation σ as follows: Equations (8) guarantee that the point prediction L F is an unbiased estimator of the conditional net demand L. In this illustrative example we fix σ = 0.075 p.u. and generate 20 samples . Given L F i /L and σ, and provided that the system of nonlinear equations (8) has a solution (notice that α, β > 0), parameters α i and β i can be computed as Each L i is then randomly taken from L · Beta(α i , β i ). We take the first 500 data points of each sample as the training set and the last 250 as the test set.
We postulate the affine model L = q 0 + q 1 L F and solve problem (7) on the training set to determine coefficients q 0 and q 1 . Finally, to evaluate the performance of the affine model, for each data point (x i , L i ) in the test set, we simulate the sequential forward and real-time problems (5) and (6), with L = q 0 + q 1 L F i in (5b), and L i in (6e). We then compute the sum of the forward and real-time production costs averaged over the 250 data points in the test set. This mean sum is further averaged over the 20 samples we generate. Our approach, which uses a prescription of the system net demand for scheduling generation, is referred to as P-SC (from prescriptive scheduling). We compare it with the customary practice of directly using the point forecast L F i as L in (5b), which is referred to as F-SC (from forecast scheduling). Notice that our approach boils down to the conventional one if q 0 = 0 and q 1 = 1. Finally, the relative cost difference between these approaches is denoted as ∆cost.
In the results we discuss next, we set a base case with a = 0.03, b = 0.97, L = 100 MW, and the technical and economic parameters of the three-bus system described above 1 . We then define variants of this case by changing one or some of those parameters. Table 2 provides the cost savings that our approach achieves with respect to the conventional one under different G 2 's power regulation costs.

Impact of power regulation costs
For completeness, this table also includes the average cost of these two approaches for the test set and the values of the intercept q 0 and the linear coefficient q 1 of the affine model for L our approach utilizes. These values represent expectations over the test data points of the 20 samples generated as indicated above. The first row in the table corresponds to the base case.
Interestingly, our approach systematically corrects the point forecast of the net demand L downwards, with a linear coefficient q 1 which is, on av-erage, lower than or equal to 1, and a negative intercept q 0 in expectation. This is so because it is economically advantageous for the system to cope with positive net demand errors (i.e., eventual demand increases) by deploying upward balancing energy from unit G 2 . Indeed, the alternative would be to deal with negative demand errors by down-regulating (i.e., by providing downward balancing energy) with unit G 1 , a recourse that is clearly much more expensive.
To further elaborate on this phenomenon, the second row in Table 2 provides results for a variant of the base case in which C u 2 has been decreased from 20 to 15 e/MWh. Now that up-regulating (i.e., providing upward balancing energy) through unit G 2 is even cheaper, the downward correction of our approach to the net demand point forecast is more pronounced and the associated cost savings due to said phenomenon become larger. On the contrary, if it is the provision of downward balancing energy by G 2 what becomes 5 e/MWh cheaper and, hence, free (see third row of Table 2), the net demand point forecast is barely corrected and the costs savings brought by our approach (with regard to F-SC) become smaller as a result. Note that correcting the point forecast upwards in this case (in an attempt to profit from the free downward balancing energy provided by G 2 ) would be counterproductive in reality, as the system may risk having to resort to the high-cost downward balancing energy of unit G 1 in those likely scenarios in which the net demand ends up being lower than the capacity of this unit.
At this point, it may be instructive to see what happens when we drop constraints (7i)-(7l) from the mixed-integer program through which we train the affine model L = q 0 + q 1 L F . This is indeed very tempting, because, if these constraints are removed, the training model (7) becomes a very pleasant linear program, similar to the stochastic-programming-based for-  (7) playing the role of the "scenarios" in Pritchard et al. (2010)).
However, these constraints ensure the optimality of the lower-level problem (3d)-(3f) and guarantee that the above affine model is learned following a cost-merit-order principle. Therefore, if these constraints are dropped from (7), the affine model L = q 0 + q 1 L F is not trained for the target task. This is exactly what Table 3 shows. This table is analogous to Table 2, but for a linear training model made up of constraints (7a)-(7h) only. We denote this approach as L-SC from "Linear"). The training model L-SC ignores the merit order and hence, takes for granted that the system can benefit from the cheap downward balancing energy of unit G 2 by allocating a non-zero production to this unit in the forward problem regardless of whether unit G 1 has been fully dispatched or not. This is, however, an strategy forbidden by the problem, which explains the poor actual performance of L-SC. This phenomenon is especially notorious for the case C d 2 = 15 e/MWh, in which the demand is heavily overestimated (the mean of the estimated demand is increased by 45%) and only the free downward regulation from unit G 2 is used.
Therefore, due to the catastrophic impact that removing the merit-order constraints (7i)-(7l) from the training model (7) may have on the actual performance of the obtained affine model, the strategy L-SC is no longer considered in the rest of our analysis.

Impact of grid congestion
Here we introduce a variant of the base case in which the capacity of Line 1 has been set to 30 MW. Recall that the capacity of this line in the base case is unlimited, which we denote by symbol "∞" in Table 4. The results collated in this new table are analogous to those in Table 2.
Recall that the estimation problem (7), whereby we determine the affine function L = q 0 + q 1 L F , explicitly accounts for network constraints. In contrast, the computation of the net-demand point forecast L F is typically based on statistical criteria alone and, consequently, ignores any possible limiting effect of the grid.
When the capacity of Line 1 is limited to 30 MW, our approach strongly corrects the point prediction L F downwards, so that L is kept in between 16 and 32 MW approximately. Thus, unit G 1 is dispatched well below the expected demand. This is clever because, in doing so, no (expensive) downward regulation from this unit has to be deployed in real time to comply with the limiting capacity of Line 1. In this way, the eventual realized demand at bus 3 can be satisfied, instead, with cheaper up-regulation from unit G 2 through Line 2. The ultimate result is that using L, given by our approach, in the forward problem (5) is way more profitable than using the raw point forecast L F .
Based on this analysis, if network constraints were accounted for at the day-ahead stage, that is, in the optimization problem (5), the cost savings achieved by our approach would decrease, at the expense of significantly complicating the resolution of the bilevel problem (3).

Impact of the peak demand
Now we change the peak demand and consider two variants of the base case in which we take L = 50 MW and L = 150 MW (in the base case, L = 100 MW). The results of this new analysis are compiled in Table 5.
Again, as in the analysis of the impact of G 2 's balancing energy costs in Section 4.1, our approach systematically corrects the net-demand point forecast downwards to reduce the usage of down-regulation from G 1 in favor of the up-regulation from G 2 . However, the cost savings achieved by our approach get diluted as the peak demand is augmented. The reason for this is twofold. First, the probability of events where the net demand takes on a value below the capacity of unit G 1 diminishes with growing L. when L = 50 MW, the probability that the net demand is smaller than the capacity of G 1 is equal to one, which explains why our method delivers the highest cost savings in this variant (from among the three cases considered in this analysis). In contrast, as L grows, that probability diminishes and the cheaper downward balancing generation from G 2 becomes more available.
Second, the balancing costs account for a lower percentage of the total costs as the peak demand L increases.

Impact of the net demand regime
We conclude this small example by studying how the net demand regime affects the prescriptive power of the affine function L = q 0 + q 1 L F that we determine by way of problem (7). To this end, we modify the support of the uniform distribution from which the per-unit net-demand point prediction is randomly drawn. Thus, we distinguish a low-demand regime, with L F ∼ L · U(0.03, 0.5), and a high-demand regime, with L F ∼ L · U(0.5, 0.97). We also consider the base case, where L F ∼ L · U(0.03, 0.97) and therefore, no demand regime is differentiated. The corresponding results are provided in Table 6.
In line with the observations in the previous analysis of the impact of the peak demand, under a low-demand regime, the expensive, but flexible unit G 2 is not dispatched in the forward problem. The downward correction to the net-demand point forecast our approach prescribes is then intended to benefit from the up-regulation provided by G 2 , which is clearly more competitive than the down-regulation offered by G 1 . The system features, therefore, a distinct cost asymmetry given by the expensive downward balancing generation of G 1 versus the cheap upward balancing generation of G 2 . Our approach sees this asymmetry and corrects the net-demand point forecast downwards accordingly. In addition, since the beta distribution modeling the point forecast error is right-skewed for low levels of demand, said correction leads to substantial cost savings. In contrast, under a highdemand regime, G 2 is very likely to participate in the forward dispatch, whereas there is a lower probability that G 1 be needed to provide downward balancing energy, since the distribution of the point forecast error is leftskewed. Consequently, the cost structure of the system looks very different under a high-demand regime, which prompts a quite different affine function and reduces the cost savings obtained from our method.
Most importantly, in the base case, when no net-demand regime is distinguished, most of the benefits our approach can potentially bring for low values of net demand are lost. This motivates us to cluster net-demand observations into different regimes and use optimization problem (7) to compute a possibly different affine model in the form L = q 0 + q 1 L F for each demand regime, similarly to segmented regression in classical statistics. This is formalized in the next section.  Table 6. Impact of the net-demand regime.

Data clustering and partitioning
Take N := {1, . . . , i, . . . , N }, that is, the index set of the data sample We partition N into a collection {N k } K k=1 of K subsets that are pairwise disjoint and whose union is equal to N . Consider the one-to-one mapping Therefore, N k := {i ∈ N : φ(i) = k}.
We compute K affine models of the form L = q ⊤ k x, k ≤ K, by solving the estimation problem (7) for each subset sample N k . In practice, this means replacing N and N in (7) with |N k | and N k , respectively.
To construct a meaningful mapping φ, we employ the K-means algorithm that is implemented in the Python package scikit-learn (Pedregosa et al., 2011), using the Euclidean distance. We note that, to construct φ, this algorithm receives the feature sample {x} i∈N as input. Once the mapping φ is computed, it is used to determine the cluster to which a new feature vector x belongs. That cluster is the one that minimizes the distance between its centroid and the feature vector x. That is, given a new observation of x, say x N +1 , φ(x N +1 ) = k means that x N +1 is predicted to belong to partition N k , and therefore, L = q ⊤ k x N +1 is to be used in the forward problem (5).
On a different issue, the estimation problem (7) is a MIP program and, as such, computationally expensive in general. Actually, the size of (7) grows linearly with the sample size. To keep the time to solve (7) reasonably low, we reduce the cardinality of subsets {N k } K k=1 by means of the PAM K-medoids algorithm (Kaufman and Rousseeuw, 2009) through the Python package implementation scikit-learn-extra. This algorithm selects the most representative data points within each subset N k , the so-called medoids, by minimizing the sum of distances between each point in N k and said medoids. We remark that this reduction process results in data points (the medoids) with unequal probability masses, so extra care should be taken when formulating objective function (7a) for each subset N k considering the medoids only. More specifically, the uniform weight 1 N appearing in the objective function (7a) should be replaced with a medoid-dependent weight representing the probability mass assigned to each medoid as a result of the reduction process.

Case Study
In this section we assess the performance of our approach in a realistic case study that is based on the stylized model for the European power system that is described in Nahmmacher et al. (2014). Accordingly, we consider a pipeline network model with 28 nodes, each representing an European country. The capacities of the lines are also obtained from Nahmmacher et al. (2014), in particular, we take the values from " respectively. Again, the available capacity of both technologies has been assigned based on the data in Nahmmacher et al. (2014) corresponding to 2020 for each country. More specifically, the base power-plant capacities have been obtained by adding up the installed capacities of the technologies "Nuclear", "Hard coal", "Oil" and "Lignite" and the peak power-plant capacities from the technologies "Natural Gas", "Waste" and "Other gases".
The nodes of the system and the resulting generation capacities of each type are listed in Table 7.
To build a data sample of the form {(x i , L i )} i∈N , we have collected the actual aggregate hourly demand, wind, solar and hydro energy production for each country (node of the system) in 2020 from the ENTSO-e Transparency Platform (ENTSO-E Transparency Platform, 2020). We have also retrieved the day-ahead forecast of the hourly demand and the produced wind and solar energy from this platform. To get series of net demand values (both forecast and actual), we have subtracted the respective wind, solar and hydro power data series from the aggregate day-ahead forecast/actual demand series. We clarify that no day-ahead forecast for the hydro power production is available in ENTSO-E Transparency Platform (2020), so the  series of real hydro power production has been used (instead of the missing day-ahead hydro forecast) for the computation of day-ahead forecasts of the nodal net demands. Some minor gaps in the data extracted from ENTSO-E Transparency Platform (2020) have been filled through linear interpolation.
The marginal costs of energy generation and up-and downward balancing energy provision of each unit are randomly sampled from the uniform distributions specified in Table 8. The so-obtained values for these costs have remained fixed throughout the experiments performed in this section.
We point out that in the uniform distributions of Table 8, we have considered that base power units are cheap but inflexible, and thus, with costly balancing energy. In contrast, peak power plants are expensive, but flexible, and hence, with more competitive balancing energy costs.
We conduct a rolling simulation on the data of 2020, in which we gradually select non-overlapping windows of 150 points each. From each window, we randomly sub-sample (without replacement) the indexes corresponding to the training and test sets, which are eventually made up of 100 and 50 samples, respectively. We take ten windows over which we average the results that follow. First, we compute the average cost of the baseline method F-SC, i.e., the method that uses the point forecast L F of the net demand as L in (5), which amounts to 5092.8ke. For comparison, we also compute the average cost of a perfect information benchmark in which L is equal to the actual realization of the net demand L. This perfect forecast approach is unrealistic but can be used to assess how much could be gained from improving the prescribed net demand. For this case study, the perfect information cost is 3380.9ke, i.e, 33.6% lower than that delivered by method F-SC.
As in the example of Section 4, we consider a feature vector x made up of the day-ahead forecast of the system net demand, L F , (measured in MWh), enlarged with an additional feature fixed to one to accommodate the intercept of the affine models L = q ⊤ k x = q 0k + q 1k L F , k ≤ K. In the analysis we conduct next, we consider various values for K (number of partitions and hence of affine models) and several percentage reductions of the number of data in each partition N k , k ≤ K. The results of this analysis are summarized in Table 9, where "r%" in the first column means that only that percentage of medoids in the partition N k (more precisely r 100 |N k | , where ⌈·⌉ denotes the ceiling operator) have been used to estimate the affine function L = q ⊤ k x through (7). This table shows the average cost achieved by our approach for different values of r and K, the cost savings with respect to the cost of the baseline method F-SC (5092.8ke), and the average time the solution to the K estimation problems (7) takes. The reported cost savings have been computed out of sample, that is, on the test sets. Beyond the fact that these savings are significant in general, it is clear that our prescriptive approach benefits from exploiting different affine models under different net-demand regimes, which confirms the preliminary conclusion we draw in this regard through the small example of Section 4.
Nevertheless, it is also true that the added benefit rapidly plateaus as K grows. Actually, the bulk of the economic gains we get through the par-titioning of the data sample is already reaped with K = 2. On the other hand, increasing K has a positive side effect: It remarkably reduces the time to solve the MIP problem (7). In addition, this time can be shortened even further, with a tolerable reduction in cost savings, by using only the medoids of the partitions N k , k ≤ K, when estimating the affine models through (7). Notwithstanding this, K cannot be made arbitrarily big since it can lead to overfitting. Therefore, the choice of K should be guided by a training-validation scheme similar to the one we show in Table 9 to ensure that increasing K does not lead to a reduction in the benefits of the affine rules out of sample.
To comprehend where those cost savings our approach yields come from, in Figure 3 we plot the predicted aggregate net demand L F against the one prescribed by our method, i.e., L. The plot corresponds to one window of 150 data points taken at random out of the ten we have considered in the rolling-window simulation. Furthermore, the figure depicts results from the case with five partitions (K = 5). It can be seen that, when the system net demand is predicted low, our method prescribes to overestimate it. This prescription is motivated by two facts. On the one hand, the overestimation of the net demand in the forward problem is covered by cheap power plants, whereas it reduces the need for upward balancing. On the other, even though it slightly increases the demand for downward balancing, the group of units that down-regulate remains the same in any case, i.e., with and without the overestimation, due to the limitations of the network. As a result, the cost savings linked to the reduction in up-balancing outweigh the extra costs incurred by the increase in down-balancing. It is interesting to note that, as mentioned at the end of Section 2, system operators, based on their accumulated experience, often introduce an upward bias into the net demand  Table 9. Average cost savings and average time to solve the estimation problem (7)  forecast (California Independent System Operator, 2020), precisely as our model here suggests.
As the level of net demand grows, the overestimation of the system net load that our method prescribes diminishes to a point where the prescribed amount flattens (see partitions N 4 and N 5 ). Again, this phenomenon is caused by the network and the limitations it imposes. Indeed, our method avoids dispatching power plants in the forward problem which, despite being their turn in the cost-merit order, would have been irretrievably downregulated in real time because of network bottlenecks. For instance, in partition N 4 , F-SC consistently dispatches the DE base generator, with its massive 46 GW, to maximum capacity. However, due to grid constraints, this unit is subsequently down-redispatched to around 30 GW. On the contrary, P-SC takes into consideration that this power plant is one of the latest to be scheduled in this partition and foresees the grid limitation on the power flow, thus constraining the aggregated energy production and systematically dispatching such a unit to the previously mentioned 30 GW.
Finally, Figure 3 also illustrates that, even though the prescriptive functions h q (·) we employ are affine, the use of several clusters leads to a piecewise affine function that is able to approximate the non-linear relationships between the features and the prescribed parameters.

Conclusions
In this paper, we have proposed a data-driven method to prescribe the value of net demand that the forward stage in a two-stage generation scheduling problem should use in order to minimize the expected total cost of operating the underlying power system. For this purpose, we have formulated a mixed-integer linear program that trains an affine function to map the predicted net demand into the prescribed one.
Numerical experiments conducted out of sample on a stylized model of the European electricity grid reveal that the cost savings implied by the estimated affine mappings are substantial, well above 2%. Furthermore, on the grounds that the cost structure of a power system is highly dependent on its operating point, and hence, on the level of net demand, we have devised a K-means-based partition strategy of the data sample to train different affine mappings for different net-demand regimes. The utilization of this strategy is shown to have a positive twofold effect in the form of substantially increased costs savings and a remarkable drop in the computational burden of the proposed MIP training model. Finally, we have further complemented the partitioning of the data sample with a medoid-based reduction in the size of the partitions, achieving additional speedups in solution times. All this together opens up the possibility to leverage our prescriptive approach in larger instances.
Future work will include attempts to optimize the partitioning of the data sample by embedding it into the MIP training model and to devise a procedure to efficiently update the parameters of the affine rules as new data become available without having to solve a new instance of the estimation problem (3). Another direction for future research is the extension of the proposed methodology to prescribe net demand trajectories (of 24 hours, for example) for generation scheduling problems that include inter-temporal constraints such as ramping limits or minimum times.