A bilevel framework for decision-making under uncertainty with contextual information

In this paper, we propose a novel approach for data-driven decision-making under uncertainty in the presence of contextual information. Given a finite collection of observations of the uncertain parameters and potential explanatory variables (i.e., the contextual information), our approach fits a parametric model to those data that is specifically tailored to maximizing the decision value, while accounting for possible feasibility constraints. From a mathematical point of view, our framework translates into a bilevel program, for which we provide both a fast regularization procedure and a big-M-based reformulation that can be solved using off-the-shelf optimization solvers. We showcase the benefits of moving from the traditional scheme for model estimation (based on statistical quality metrics) to decision-guided prediction using three different practical problems. We also compare our approach with existing ones in a realistic case study that considers a strategic power producer that participates in the Iberian electricity market. Finally, we use these numerical simulations to analyze the conditions (in terms of the firm's cost structure and production capacity) under which our approach proves to be more advantageous to the producer.


Introduction
In the last couple of decades, the field of decision-making under uncertainty has regained momentum, spurred by the new opportunities that the Digital Age has brought to modern economies. As a result, this field has been prolific in the design and development of new tools capable of exploiting the vast amount of information that human societies currently generate, compile and record, mainly in the form of data.
From among all the exciting advances that have been achieved in the realm of decision making under uncertainty in recent years, we highlight the so-called data-driven optimization under uncertainty, which endows the decision maker with a powerful and versatile mathematical framework to hedge her decisions against both the intrinsic risk of an uncertain world and the limited and incomplete knowledge of the random phenomena that can be retrieved from a finite set of observations or data.
Data-driven optimization under uncertainty has been applied to a broad range of contexts and problems, for instance, inventory management [1,2], nurse staffing [3], portfolio optimization [4,5,6], shipment planning [4], network flow [5], power dispatch [7], and vehicle routing [8], just to name a few. For a recent survey on the topic and its applications, we refer the reader to [9] and [10].
In this paper, we first compare the proposed methodology with existing ones using two classical conditional stochastic optimization problems, namely, the newsvendor problem [3,11,12] and the product placement problem [1]. Additionally, we consider the problem of a strategic firm that has to decide the generation quantity that maximizes its expected profit while facing the uncertainty related to market conditions. This problem has a long tradition in the Economics and Management Science literature (see, for instance, [13,14,15]). In particular, we take electricity as the homogeneous good to be produced and thus, we place ourselves in the context of electricity markets, where this problem has received a great deal of attention since the deregularization of the power sector [16,17]. Most existing models address this problem by forecasting, as accurately as possible, the electricity market behavior. Then, such forecasts are used to compute the decision that maximizes the producer's profit. Here we present a novel and alternative data-driven procedure that considers the problem structure and leverage available auxiliary data to enhance market participation and increase profits. Our approach is formulated as a bilevel program that, under convexity assumptions, can be efficiently solved using commercially available optimization solvers. We demonstrate the superior performance of the proposed approach on a realistic case study that uses data from the Iberian electricity market.
In short, our contributions are threefold, namely: -From a methodological point of view, we propose a novel data-driven framework for conditional stochastic optimization, whereby the parameters that are input to the decision-making problem are formulated as a function of some covariates or features. This function is, in turn, estimated factoring in its impact on the decision value. Finally, by way of this function, we construct a deterministic (single-scenario) surrogate optimization model that delivers decisions that are good in terms of the original conditional stochastic program. In Section 2, we introduce and mathematically formalize our proposal along with alternative state-of-the-art approaches available in the technical literature. Our approach is formulated as a bilevel optimization problem that can be reformulated as a single-level optimization problem and solved using off-the-shelf optimization solvers as discussed in Section 3.
-From a theoretical perspective, we compare our approach with existing ones in Section 4 for three different applications, namely, the newsvendor problem, the product placement problem, and the strategic producer problem.
-From a more practical point of view, Section 5 provides simulation results for the strategic producer problem using both an illustrative example and a realistic case study based on the Iberian electricity market.
The numerical experiments show that our proposal can significantly increase the competitive edge of the strategic producer depending on her cost structure and the market demand elasticity.
We conclude the paper with a brief compilation of the most relevant observations in Section 6.

Mathematical framework and related work
In decision making we often model the uncertainty as a random vector of parameters (y ∈ Y ⊆ R m ) governed by a real unknown distribution Y and, typically, some relevant contextual information (x ∈ X ⊆ R p ) ∼ X is available before the decision is to be made. Following this scenario, the decision maker is interested in solving the conditional stochastic optimization problem where f : R n × R m → R is a known function in the decision z ∈ R n , and Z ⊆ R n is a nonempty, compact set known with certainty (i.e., independent of Y ), to which the decision z must belong. In practice, neither the joint distribution of X and Y , nor the conditional distribution of Y given X = x are known and therefore, problem (1) cannot be solved. On top of that, even if the true distribution were known and the decision z were fixed, problem (1) would typically require to compute the expectation of a function of a continuous random vector (i.e., a multivariate integral), which is, in itself, a hard task in general. Instead, the only information that the decision maker typically has is a sample S = {(y i , x i ), ∀i ∈ N } where y i ∈ R m is a particular outcome of the uncertainty Y recorded under the context x i ∈ X , and N denotes the set of available samples. Against this background, problem (1) is alternatively replaced with a surrogate optimization problem, in the hope that the solution to the latter is good enough for the former. In this line, different approaches have been proposed to construct such a surrogate optimization problem. For instance, the traditional modus operandi follows the rule "first predict, then optimize," which results in the following surrogate problem to approximate the solution to (1): whereŷ denotes an estimate of the outcome of the uncertainty Y under the contextual information x ∈ X ⊆ R p . The surrogate problem (2) is attractive for several reasons. First and foremost, it is much simpler and faster to solve than (1). Actually, it is a deterministic optimization problem that, as opposed to (1), only requires evaluating the cost function f (z; ·) at the single value or scenarioŷ. Furthermore, problem (2) seems intuitive and natural, especially whenŷ represents "the most likely value" for Y given X = x. Indeed, the single scenarioŷ is often chosen as an estimate of the expected value of the uncertainty Y conditional on where, logically, the approximation is built from the available sample In the realm of forecasting, the estimateŷ is usually referred to as a point prediction.
In order to build the estimateŷ ≈ E[Y |X = x], a function g FO : X × R q → R m is normally chosen from a w-parameterized family G FO , with w ∈ R q , to construct the forecasting modelŷ = g FO (x; w). The goodness of a certain parameter vector w is quantified in terms of a loss function l FO (y,ŷ) : Y × R m → R that measures the accuracy of the estimate. Then, given the sample S = {(y i , x i ), ∀i ∈ N }, the choice of w is driven by the minimization of the in-sample loss, as expressed below: In this framework, the optimal decision z FO under the context X = x is thus obtained by solving the following deterministic problem: We refer to this approach, which relies on a good forecast of the uncertainty Y (in particular, an estimate of E[Y |X = x]), as FO (short for FOrecasting). Even though this approach is intuitive and may perform relatively well in many situations, it is fundamentally flawed for the following two basic reasons. First, sinceŷ ≈ E[Y |X = x] in FO, the surrogate problem (2) works as a proxy of the problem which, in general, is not equivalent to (1). Second, even in those cases where these two problems are indeed equivalent, the loss function l FO that is typically used to compute w FO (for example, the squared error) is solely intended to get a statistically good estimate of E[Y |X = x] and does not account for the nominal objective f or the constraints that the decision z must satisfy. For instance, approach (3)-(4) is unable to capture that overestimating E[Y |X = x] might worsen the objective function f much more than underestimating it. In view of these design flaws, a number of works have proposed to replace the problem-agnostic l FO that is generally used in (3) with a problem-aware loss function l SP (y,ŷ) = f (ż(ŷ); y) where l SP : R m × R m → R andż : Y → Z defined asż(y) = arg min z∈Z f (z; y). Therefore, function l SP evaluates the loss of optimality associated with the decisionż(ŷ) that is prescribed by the surrogate decision-making problem (2) for the single valueŷ. Accordingly, the optimal parameter vector w SP is obtained as the one that minimizes the in-sample optimality loss, that is: where the function g SP : X × R q → R m is chosen from a family of functions G SP . We use the acronym SP, which stands for "Smart Predict", to refer to this setup. Solving problem (6) using descent optimization methods requires to compute the gradient of the loss function l SP (y,ŷ) with respect to w. This may not be feasible, since it involves the differentiation of the discontinuous functionż(y) [18]. To overcome this difficulty, a great deal of research has been devoted to finding methods to approximate the gradient of (6) for particular instances. The work developed in [19], for example, describes a procedure to solve (6) under the following three conditions: i) f is quadratic, ii) the uncertainty is only present in the coefficients of the linear terms of f , and iii) no constraints are imposed on the decision z, which means Z = R n . Some years later, the authors [2] proposed a heuristic gradient-based procedure to solve (6) for strongly convex problems with deterministic equality constraints and inequality chance constraints. Almost concurrently, reference [5] discusses the difficulties of solving (6) in the case of linear problems, since such a formulation may lead to an uninformative loss function. To overcome this issue, they successfully develop a convex surrogate that allows to efficiently train g SP (x i ; w) in the linear case. More recently, the authors in [20] suggest a similar approach as in [2] to combinatorial problems with a regularized linear objective function. In summary, the four references above propose ad-hoc gradient methods for specific instances of (6). However, the technical literature lacks, to the best of our knowledge, a general-purpose procedure to solve such a problem using available optimization solvers. To fill this gap, we propose the following bilevel program [21] as a generic mathematical formulation of (6): where g BL : X ×R q → R m is selected similarly to g FO and g SP . Problem (7) is formulated as a bilevel optimization model commonly used to mathematically characterize non-cooperative and sequential Stackelberg games in which the leader makes her decisions anticipating the reaction of the follower [22]. In this sense, the upper-level problem determines the optimal parameter vector w anticipating the decision provided by each lower-level problem (7b) if the valueŷ i is given by g BL (x i ; w). We denote this approach based on bilevel programming as BL (acronym of BiLevel). In Section 3, we discuss the assumptions that problem (1) must satisfy so that problem (7) can be reformulated as a single-level optimization problem to be solved using offthe-shelf optimization solvers. Although solving the bilevel problem (7) may be computationally expensive, this is a task that can be performed offline.
Once w BL is determined, the optimal decision z BL under context X = x is computed by solving the following problem: The bilevel program (7a)-(7b) computes the value for the parameter vector w that maximizes the in-sample performance of the surrogate decisionmaking model (8). For this estimation to be of use, it must be guaranteed that under two contexts , under equal contexts, equal decisions. This is a condition that is reminiscent of the notion of non-anticipativity in Stochastic Programming. Importantly, this condition is automatically satisfied if the solution to the lower-level problem (7b) is unique for any value of w. Otherwise, the bilevel program (7a)-(7b) would choose theẑ i from the optimal solution set of (7b) that minimizes the upper-level objective function (7a) given -i.e., by anticipating-the uncertainty outcome y i . This is so because the bilevel program (7a)-(7b), as we have formulated it, delivers the optimistic Stackelberg solution. For instance, let us assume that there exists a valuew such that f (z; g BL (x i ;w)) = ϑ for all i ∈ N , where ϑ is a constant. In this case, the lower-levels (7b) boil down to feasibility problems imposing that z ∈ Z and therefore,ẑ i can violate nonanticipativity and adapt to realization y i for all i ∈ N . More importantly, usingw in (8) would lead to degenerate and highly suboptimal decisions under any context X = x. This issue is reported in [5] for linear objective functions, where authors propose a convex surrogate function of l SP to train meaningful instances of model g SP (·; w SP ). Similarly, we propose in Section 4.3 a modified lower-level surrogate model for the strategic producer problem in order to ensure non-anticipativity for any parameter vector w.
Next, we discuss other surrogate decision-making models different from (2), which have also been recently proposed to approximate the solution of (1). For this purpose, notice first that problem (1) can be equivalently recast as where Q |x represents the conditional probability distribution of Y given X = x. Thus, a second family of surrogate decision-making models can be introduced with the following general form: where Q |x is an approximation of the unknown probability measure Q |x that is constructed from the available sample S = {(y i , x i ), ∀i ∈ N }. For the surrogate problem (10) to be computationally tractable, the proxy Q |x is often built as a discrete probability distribution supported on a finite number of points, more specifically, on the y-locations of the sample, i.e., {y i , ∀i ∈ N }. This way, the solution to (10) under context X = x, which we denote as z ML (x), can be generically expressed as: with {g ML (x, x i ; w), ∀i ∈ N } being the probability masses that the specific proxy Q |x that is used places on {y i , ∀i ∈ N }. These masses or weights are determined as a function g ML : X × X × R q → R of the historical contextual information x i , the current context x, and some parameters w.
In essence, this scheme adapts the Sample Average Approximation (a well-known data-driven solution strategy in Stochastic Programming [23,24]) to the case of conditional stochastic programs. It was first formalized in [1] and, since then, has been subject to a number of improvements (e.g., regularization procedures for bias-variance reduction [25]; robustification [26]; and algorithmic upgrades [27]) and extensions, e.g., to a dynamic decisionmaking setting [4]. Recently, the work in [11] introduces a bilevel formulation to optimally tune the parameters w that determine the weights g ML (x, x i ; w). Using our notation, the method proposed in [11] can be formulated as follows: where the function g ML : X × X × R q → R used to compute the weights can be chosen from a catalog of several classical machine learning algorithms G ML such as k-nearest neighbors, Nadaraya-Watson kernel regression or Random Forest. The author of [11] resorts to tailor-made approximations and greedy algorithms for each machine learning technique that is used to construct function g ML , but do not provide a general-purpose solution strategy valid for any function g ML . This approach, which is based on machine learning techniques, is called ML (stands for Machine Learning). After solving (12), the optimal decision z ML (x) under context X = x is obtained by solving (11) with w = w ML . The surrogate problems (2) and (10) are, by design, different, in part because they are the result of distinct frameworks to address the conditional stochastic program (1). The surrogate problem (2) is based on the assumption that it is possible to find a good decision z in terms of the conditional expected cost E[f (z; Y )|X = x] by optimizing that decision for a single scenarioŷ of the uncertainty Y . Naturally, all the complexity of this approach lies in how to infer, from the data sample S, the single scenarioŷ that unlocks the best decision z. This inference process makes use of global methods that consider all data points in the sample to obtain more robust decision mappings. In contrast, all the difficulty of the surrogate problem (10) rests on how to retrieve a good approximation of the true conditional distribution Q |x from the sample S. Such an approximation is performed using local machine learning methods that only employ data close to the given context x and consequently, a large amount of data is required to avoid overfitting. In more practical terms, embedding local machine learning methods into the estimation problem (12) makes this problem computationally intractable in most cases. Besides, the surrogate problem (2) is computationally less demanding than (10), because the latter requires evaluating the cost function f (z; ·) for multiple values of the uncertainty Y .
Finally, there is a third class of surrogate decision-making models that arises from the idea of using the sample S to directly learn the optimal decision z as a function of the context x, this way bypassing the need for constructing the estimateŷ or the proxy distribution Q |x . Following this logic, we seek a decision rule or mapping g DR : Particularizing for the empirical distribution of the data, this approach renders: One clear advantage of directly learning the optimal decision policy is that, after solving (13), the decision z DR to be implemented under context X = x is efficiently computed as follows: Actually, the mapping (14) constitutes the surrogate decision-making model itself. This method, which aims at determining an optimal decision rule, is denoted as DR (acronym of Decision Rule). Nevertheless, feasibility issues may arise as this approach does not necessarily guarantee that the resulting z DR obtained through (14) belongs to Z for any plausible context x. The authors of [3] propose and investigate this approach for the popular newsvendor problem, for which they consider a linear decision rule. Their newsvendor formulation does not involve any constraint and therefore, decisions yielded by (14) are always valid. However, the use of this approach is questionable for many other practical applications in which decisions must satisfy a set of constraints.
In summary, the contributions of the proposed bilevel model (7) with respect to the other approaches presented in this section are: -Unlike the traditional approach (3), ours provides estimations of y by leveraging information about the optimization problem to be solved.
-Unlike the existing "predict-then-optimize" methodology (6), our approach is formulated as a generic bilevel optimization that, under convexity assumptions, is reformulated as a single-level optimization problem that can be solved using off-the-shelf optimization software.
-Unlike approach (12), ours makes uses of global estimation methods that use all available data to infer the point forecast of the uncertainty that unlocks the best decision. Therefore, our approach is less prone to overfitting, especially for small data samples. In addition, formulation (12) is more difficult to solve than (7).
-Unlike approach (13), ours guarantees the feasibility of the resulting optimal decision under any context.

Solution strategy
In this section, we elaborate on how to solve the general-purpose bilevel program (7) we propose to compute the best single scenarioŷ to be fed into the surrogate problem (2). To do so, we particularize the generic formulation (1) as follows: where z constitutes the vector of here-and-now variables independent of the uncertainty, s(Y ) represents the wait-and-see decisions, and constraints (15b), (15c) must be satisfied for Q |x -almost all y (i.e., with probability one).
We also assume that f 0 , f j are convex functions with respect to all variables, h k are affine functions, and function g BL is continuous in the parameter vector w.
Our method solves the following bilevel optimization problem: On the assumption that the lower-level minimization problems (16d)-(16f) satisfy some constraint qualification, the classical strategy to solve (16) is to replace each lower level (16d)-(16f) with its equivalent Karush-Kuhn-Tucker (KKT) conditions [28], that is, where λ ji , υ ki ∈ R are, respectively, the Lagrange multipliers related to constraints (16e) and (16f) for each lower-level problem; (17a) and (17b)-(17c) are, in that order, the objective function and constraints of the upperlevel problem, and constrains (17d), (17e)-(17f), (17g), (17h), are, respectively, the stationarity, primal feasibility, dual feasibility and slackness conditions of the lower-level problems. As discussed in [29], problem (17) violates the Mangasarian-Fromovitz constraint qualification at every feasible point and therefore, interior-point methods fails to find even a local optimal solution to this problem. To overcome this issue, a regularization approach was first introduced in [30] and further investigated in [31]. This method replaces all complementarity constraints (17h) with inequality (18c) below: where ǫ is a small non-negative scalar that allows to reformulate (17) as the parametrized nonlinear optimization problem (18), which typically satisfies a constraint qualification and can be then efficiently solved by standard nonlinear optimization solvers. Authors of [30] prove that, as ǫ tends to 0, the solution of (18) tends to a local optimal solution of (17). In the remaining of the manuscript, we will refer to this approach as BL-R. Alternatively, the complementarity slackness conditions can be linearized according to Fortuny-Amat [32] as follows: where u ji are binary variables, and M P , M D ∈ R + are large enough constants whose values can be determined as proposed in [33]. The resulting model (19) is a single-level mixed-integer non-linear problem. We denote this method as BL-M. Solving the bilevel problem (7) using either BL-R or BL-M is valid for a conditional stochastic problem that satisfies the conditions described in this section. Nonetheless, the complexity of solving the regularized non-linear problem (18) or the mixed-integer non-linear program (19) highly depends on functions f 0 , f j , h k , g BL . In some cases (see, for instance, the particular applications discussed in Section 4), problem (19) can be reformulated as a mixed-integer linear/quadratic optimization problem that can be solved to global optimality using standard optimization solvers. In the general case, problems (18) and (19) can also be solved using off-the-shelf optimization solvers, but global optimality may not be guaranteed. Notwithstanding this, local optimal solutions of the proposed bilevel formulation (7) may still lead to optimal decisions that are significantly better than those computed by FO or DR.

Applications
In Section 2, we introduce a common mathematical framework to present five different approaches for contextual decision-making under uncertainty, namely, the predict-then-optimize strategies FO, SP, and BL; method ML, which relies on a proxy of the true conditional distribution that is built using machine-learning techniques, and the decision-rule approach DR. Unfortunately, in the technical literature, methods SP and ML have only been applied to conditional stochastic optimization problems with a specific structure and they both lack a solution strategy for more general conditional stochastic programs. For this reason, in this section, we limit ourselves to comparing approaches FO, BL, and DR on various contextual decision-making problems under uncertainty, each of which illustrates a certain relevant aspect of our proposal. Subsection 4.1 compares these methodologies using the newsvendor problem, a well-known stochastic programming problem with simple recourse. The proposed methodology is also applied in Subsection 4.2 to the product placement problem, a two-stage stochastic programming problem with full recourse. Finally, Subsection 4.3 presents a strategic producer problem formulated as a one-stage stochastic programming in which the uncertainty only affects the objective function.

Newsvendor problem
We start with the popular newsvendor problem in the spirit of [3], a work that elicited renewed interest [11,12] in the solution to the conditional stochastic program (1). In the newsvendor problem, the goal of the decision maker is to find the optimal ordering quantity for a product with unknown random demand Y . In turn, this (positive) demand may be influenced by a random vector of features X representing, for instance, product information, weather conditions, customer profiles, etc. The decision maker has, therefore, a collection of observations {(x i , y i ), ∀i ∈ N }, which s/he would like to exploit to make an informed ordering quantity z under the context X = x. Let d and r, with r > d > 0, be the cost and revenue of manufacturing and selling one product unit, respectively. This problem can be formulated as the following conditional stochastic program: Approaches FO and BL both follow a "predict-then-optimize" strategy, whereby the ordering quantity is obtained as the solution to the following surrogate decision-making model: We can use an auxiliary variable s to get rid of the inner minimization and write instead: whose solution is trivial, namely, z * = s * =ŷ. FO and BL differ in the particular single value or scenarioŷ that each of them uses. In the case of FO,ŷ is an estimate of E[Y |X = x]. Consequently, it becomes apparent that, for the newsvendor problem, approach FO is fundamentally inconsistent, because it is well-known that the solution to (20) corresponds to the quantile r−d r of the demand distribution Y conditional on Now, if we takeŷ = g BL (x; w) = w T x in our approach, the optimal vector of linear coefficients w BL is computed as follows: which, based on our previous argument, boils down to: Therefore, our approach coincides exactly with that proposed in [3], which, in turn, is given by problem (13) in Section 2 when g DR (x; w) = w T x. This equivalence is far from being general though, as we will see with the other applications below.

Product placement problem
Given a graph G = (B, A) with node-arc matrix A, in the product placement problem, the goal is to decide the amount z b ∈ R + of a certain product to be placed in each node b ∈ B of the grid [1]. After this decision is made, the demand for the product at each node y b is realized, and the inventories of product throughout the network are shipped across the arcs A so as to satisfy the actually observed nodal demands. As in the newsvendor problem, these demands may be affected by some exogenous factors X that may be also random, but that are disclosed before the product placement decision is to be made. Let h ∈ R |B| and g ∈ R |A| be the cost of initially placing products in the nodes of the network and the cost of shipping products through the edges of the graph, respectively. The product placement problem under uncertain demand, but with contextual information, can be formulated as follows: where In problem (26), we have included a variable vector p ∈ R |B| ≥0 to allow for unsatisfied demand, with the associated penalty cost r ∈ R |B| . Furthermore, the decision vector f ∈ R |A| represents the amount of product shipped across the arcs of the network. The cost function (26a) takes the form of a two-stage linear cost, with the integration of a recourse problem. More importantly, unlike in the newsvendor problem, the recourse is given by a full-fledged (linear) minimization problem. The surrogate decision-making model associated with the predict-then-optimize strategies FO and BL is as follows: To ease the exposition and the notation that follows, we make the additional assumption that r > h > 0, where the inequality holds component-wise. In this case, variable vector p in (27) is zero at the optimum and the surrogate model can be simplified to: As previously discussed, problem (28) is a deterministic mathematical program whereby the decision z is solely optimized for the point prediction of demandŷ. While the traditional FO approach sets such a prediction to E[Y |X = x], the rationale behind the approach BL is to compute a Wparameterized function such that the surrogate problem (28) delivers the decision z that minimizes the in-sample cost, that is: where we have takenŷ = g BL (x; W ) = W x with W ∈ R |B|×p . As discussed in Section 2, the lower-level problem (29d)-(29e) must have a unique solution. This can be guaranteed if, for example, all the shipping routes that can be taken to satisfy each demand in the graph entail a different cost. If this condition is not satisfied, the degeneracy of the lower-level problem can be eliminated by using classical results from the linear programming literature as described in [34]. As stated in Section 3, the solution to (29) can be addressed by replacing the lower-level linear program (29d)-(29e) with its KKT optimality conditions: where α i ∈ R |B| is the vector of Lagrange multipliers associated with constraint (29e). Thus, problem (30) can be solved by regularizing the com-plementary slackness conditions or by using their Fortuny-Amat big-M reformulation. In the latter case, we arrive to a MIP problem that can be solved using commercial optimization software such as CPLEX or GUROBI.
Finally, if we also take a linear decision mapping z(x) = g DR (x; W ) = W x where W ∈ R |B|×p , the DR approach solves the following minimization problem to compute the optimal matrix of linear coefficients W DR : It is apparent that the estimation problems (30) and (31), which BL and DR solve, respectively, are structurally different and so are W BL and W DR in general. For instance, think of a graph for which min{g ℓ } ℓ∈A > max{h b } b∈B . This represents a network where it is always cheaper to satisfy the nodal demand y b , b ∈ B, through the amount z b of product that is initially placed at the demand location, that is, a graph where product shipping would be uneconomical if the nodal demands were certainly known in advance. Indeed, take the ℓ-th row of g + A T α i in equation (30e) for any i ∈ N , that is, g ℓ + α o(ℓ),i − α e(ℓ),i , where o(ℓ) and e(ℓ) denote the origin and end nodes of arc ℓ, respectively. We have that inf{g ℓ +α o(ℓ),i −α e(ℓ),i : Hence, f ℓ = 0, ∀ℓ ∈ A and the system of inequalities (30d)-(30f) boils down to: which, unlike (31d)-(31e), allows for feasible solutions in the formẑ b,i = 0 with w T b x i < 0 (and α i,b = 0), where w b is the b-th row of matrix W . Furthermore, recasting (31e) asẑ i − W x i = 0 and setting α i = h, ∀i ∈ N , it is trivial to see that any feasible point of DR is also feasible for BL. Since the feasible region of (31) is contained in the feasible region of (30), but the opposite is not true, the optimum of (30) is in general lower than that of (31).

Strategic producer problem
Here we apply our decision-making framework to the problem of a strategic producer partaking in a forward market [16]. This strategic player must decide the produced quantity q ∈ R that maximizes her profits while facing some uncertainty on market conditions. Let c(q) : R → R + denote the generation cost function whose parameters are assumed to be known with certainty. Let p(q; Y ) : R × R m → R represent the inverse demand function expressing the impact of the generation quantity q on the good's price. For some goods such as electricity, the inverse demand function varies depending on the season of the year, the day of the week, or the hour of the day. Besides, this function is also uncertain when producers must make their generation decisions q, since it may depend, for example, on weather conditions. If Q represents the known feasible region of variable q according to technical or economic constraints, the strategic producer must solve the following conditional stochastic optimization problem: As it is customary, we assume that the price and the demand are linearly related as p(q; α, β) = α − βq where α ∈ R and β ∈ R + are unknown parameters. Similarly, we assume that the production cost is computed through a quadratic cost function c(q) = c 2 q 2 + c 1 q where c 1 , c 2 > 0 are known parameters related, respectively, to proportional production costs (such as fuel cost) and the increase of marginal costs due to technological factors (such as efficiency loss) [35]. In order to ease the notation, we define α ′ = α − c 1 and β ′ = β + c 2 . Finally, we consider that the production quantity q is bounded by known capacity limits, i.e., q ≤ q ≤ q with q, q ∈ R + . Thus, problem (33) can be reformulated as: Since the quantity decision q is independent of the outcome of the uncertainty (β ′ , α ′ ), the above can be further simplified to: Therefore, the optimal solution q * is driven by the conditional expected values of α ′ and β ′ . To be more precise, since β ′ > 0, q * could be equivalently computed as follows: Unfortunately, E[α ′ |x] and E[β ′ |x] are both unknown and therefore, they need to be estimated somehow. As explained further in Section 5.2.1, the producer has available a set of historical observations and x i ∈ R p in order to accomplish such a task. At this point, it should be underlined that the strategic producer problem (33) is of a distinctly different nature from that of the newsvendor problem (20) and the product placement problem (25). Indeed, the conditional stochastic program (33) has no recourse and the uncertain parameters appear only in its objective function. Consequently, solving (33) is apparently as "simple" as estimating the two conditional expectations E[α ′ |x] and E[β ′ |x]. Our claim, however, is that the way the producer draws decisions from a finite data sample (all we usually have in practice) may have a significant impact on the actual expected performance of the producer's strategy. Actually, the best estimates of E[α ′ |x] and E[β ′ |x] from a statistical sense do not necessarily result in the best offer q.
According to the predict-then-optimize strategies, the surrogate model of this problem is formulated as follows: As explained in Section 2, the traditional approach aims at learning the uncertain parameters α ′ i , β ′ i as a function of the available information x i . If we assume the family of linear functions, that is,α ′ i = w T α x i ,β ′ i = w T β x i with w α , w β ∈ R p , and we choose the squared error as the loss function l FO , then the standard implementation of (3) is: The optimal quantity under context X = x is the solution to the following optimization problem: Alternatively, w α and w β can be determined following the proposed approach by solving the following bilevel formulation: For this particular application, the bilevel optimization problem rendered by the proposed approach has a significant drawback, because the global optimal solution of (40) is w α = w β = 0. Consequently, the lower-level problem (40b) can be replaced by the feasibility condition q ≤q i ≤ q, and the optimal values ofq i are determined as if uncertain parameters α ′ and β ′ were known in advance, which violates non-anticipativity. While this solution does lead to the minimum value of objective function (40a), it is useless to determine the optimal decisions for any context X = x. This degenerate solution of the proposed approach occurs because all coefficients of the objective function (37) are uncertain. Interestingly, this shortcoming does not affect the newsvendor and product placement problems, because the uncertainty only affects the feasible region in those applications.
In this paper, we propose to ensure non-anticipativity by formulating a bilevel optimization problem that considers the following modified surrogate model: where γ = α ′ β ′ . For known values of α ′ and β ′ , the optimal solution of (37) and (41) coincide. However, surrogate model (41) is simpler since it only includes one uncertain parameter instead of two. Assuming a linear relationship between the new uncertain parameter γ and the contextual information, the proposed methodology yields the following bilevel problem: Formulation (42) has the following advantages with respect to (40): i) it includes fewer parameters and therefore, it is less prone to overfitting, ii) it ensures non-anticipativity for any parameter vector w γ , and iii) under certain conditions, it is able to retrieve the true model that relates random variable γ and the context X and the optimal solution to (34) as the sample size |N | grows to infinity, as shown in Proposition 1 in Appendix A. By replacing the lower-level problem with its KKT conditions, we obtain the following single-level problem: where λ i , λ i are the dual variables corresponding to the capacity limit constraints. Notice that if complementarity conditions (43c) and (43d) are reformulated using the Fortuny-Amat approach, problem (43) can be solved to global optimality as a quadratic mixed-integer program using off-the-shelf optimization software. According to this procedure, optimal decisions under context X = x are made by solving: Finally, we can directly learn the optimal production as a function of the known information as proposed in [3]. Assuming the linear mappinĝ q i = w T q x i with w q ∈ R p , problem (13) for this particular application is formulated as: Formulation (45) is a convex quadratic optimization problem and can be then solved using commercial software such as CPLEX. In line with (14), the optimal quantity under context X = x is directly computed as: Although not true in general, approaches (43) and (45) may lead to the same solution under specific conditions. For instance, if the produced quantity q is not limited by minimum/maximum bounds, then constraint (43b) boils down toq i = w T γ x i /2. Consequently, the solutions of (43) and (45) satisfy that w DR q = w BL γ /2 and therefore, q BL (x) = q DR (x) for any context X = x. As we show in the following section, the decisions q BL delivered by our approach are significantly more profitable than q DR in the constrained case.

Numerical simulations
As an additional contribution, we assess and compare the performance of the proposed approach for the strategic producer problem using numerical simulations. In Section 5.1 we illustrate the advantages of BL with respect to FO and DR using a small example with a reduced data sample. Additionally, Section 5.2 presents the numerical results of a realistic case study that uses real data from the Iberian Electricity Market and the Spanish Transmission System Operator [36,37].

Illustrative example
This section aims at gaining insight into the performance of the proposed approach with a small example of the strategic producer problem. For the sake of simplicity, we only consider four realizations of the uncertain parameters α ′ i , β ′ i and a single feature x i ∈ [0, 10], whose values are shown in Table 1 γ,0 +w BL γ,1 x i ; and approach DR considersq i = w DR q,0 +w DR q,1 x i . These three approaches are compared with a benchmark method (BN) that assumes perfect knowledge of the uncertain parameters α ′ , β ′ and, consequently, yields the best possible offer for each time period. Obviously, this method cannot be implemented in practice and, accordingly, is just used here for comparison purposes. Given the reduced size of this example, methods BL-R and BL-M provide the same results and are thus jointly referred to as BL.
First, we deal with the unconstrained case, that is, the case in which the capacity constraints are disregarded. Table 2 shows the in-sample results obtained from methods BN, FO, DR, and BL, namely, the optimal production quantity for each time period q i , the absolute income (I), and the relative income with respect to the benchmark (RI). Notice that the income for each time period can be computed as −β ′ i q 2 i + α ′ i q i . As discussed in Subsection 4.3, in connection with the unconstrained case, coefficients w DR are equal to w BL /2 and the decisions and incomes obtained by DR and BL are the same as a result. It is also interesting that the income of these two methods is 5% higher than that of FO. To explain this, we refer to Fig. 1a, which depicts the optimal production quantities given by the different methods as a function of the context x ∈ [0, 10], namely, (47) This figure shows that methods BL and DR can return decisions much closer to the benchmark ones than method FO for the four data points in the sample. Therefore, even for unconstrained optimization problems, the proposed methodology may outperform the classical "first-predict-thenoptimize" approach, which is purely based on reducing the error of forecasting the uncertain parameters, simply because minimizing this error is not   necessarily aligned with maximizing the decision value. Now we consider the constrained case, that is, we bring the capacity constraints back into this small example. In particular, the minimum and maximum outputs of the strategic producer are set to 0 and 1, respectively. Similarly to Table 2, the in-sample results obtained in the capacity-constrained case are collated in Table 3, where we can see that the optimal quantity q i reaches its maximum value for some time periods and methods FO, DR and BL all provide different results. Methods FO and DR achieve an income 7.5% and 8.2% lower than the benchmark. This poor in-sample performance is better understood by means of Fig. 1b, which similarly to Fig. 1a, represents the optimal quantities as a function of the context for the constrained case according to (39), (44) and (46). First, since method FO is unaware of the feasibility region of the original conditional stochastic problem, it provides the same prediction of the uncertain parameters α, β in the unconstrained and constrained cases. However, using these forecasts in the surrogate model (37) enforces q = 1 for x ≥ 7.1 in the constrained case. As observed, reducing the forecast error of α, β does not lead to the maximization of the decision value in the constrained case either. Second, method DR must ensure feasible solutions for all samples, a condition that also leads to quite poor approximations of the optimal quantities for most values of the context x. Furthermore, this approach would return infeasible solutions q > 1 for x > 9 as shown in Fig. 1b. On the contrary, the proposed approach BL can find a linear relation between γ and x to be used in the surrogate model (41) that results in decisions q that perfectly match those provided by the benchmark for the four data points and therefore, this method achieves the highest possible income in sample. In summary, this small example sheds light on the reasons why the proposed methodology outperforms existing ones for both unconstrained and constrained optimization problems under uncertainty: Our approach provides forecasts of the uncertain parameters that take into account the objective function and feasible region of the decision maker. Such enhanced forecasts translate into decisions that are much closer to those obtained in the ideal perfect information instance.

Case study
In this section, we compare the proposed approach with existing ones using realistic data from the Iberian electricity market, as described in detail in Section 5.2.1. Sections 5.2.2, 5.2.3 and 5.2.4 investigate how the type of generation portfolio, the quadratic cost term c 2 , and the residual demand elasticity impact the performance of the proposed methodology, respectively. These three subsections only include the global optimal solutions given by method BL-M. Finally, Section 5.2.5 provides computational solution times for all the approaches and discusses the differences between BL-R and BL-M in that respect.

Experimental setup
In order to test our proposal, we consider a realistic case study based on actual data from the Iberian electricity market. We construct a data set of the form {(x i , α i , β i ), ∀i ∈ N } from which we derive the rest of the parameters required for our simulations as explained in Section 4.3. We gather raw market data from the forward market OMIE [36] to compute parameters α i , β i of the inverse demand function. Furthermore, we collect wind and solar power forecasts of the aggregated production of Spain to be used as a vector of contextual information x i . The wind and solar forecasts, originally published by the Spanish TSO, are downloaded from the ENTSO-e Transparency Platform [37].
Historical raw hourly block-wise bids and offers submitted by buyers and sellers to the Iberian day-ahead energy market are processed to obtain parameters α i , β i as follows. For each hour of the year, we have access to the set of bids and offers defined as {(q b , p b ), ∀b ∈ B}, {(q o , p o ), ∀o ∈ O}, respectively, where q b/o is the amount of energy to buy/sell at price p b/o . Thus, the residual demand r to be potentially covered by a new producer entering the market for each possible price p is defined as r := b∈B:p b ≥p q b − o∈O:po≤p q o , that is, the aggregated demand minus the aggregated production. The stepwise function relating the residual demand r and the electricity price p is plotted in Fig. 2 for illustrative purposes. Now consider that a new strategic producer enters the market with an offer to sell quantity q at offer price 0. If we assume that the remaining bids and offers stay unaltered, the market price would decrease following the right-hand part of the step-wise function depicted in Fig. 2. Therefore, a strategic producer aiming at maximizing her profit is interested in modeling the dependence between her offered quantity q and the market price p in the shaded area, with parameter δ being a constant sufficiently larger than the producer's maximum generation capacity. In connection with Section 4.3, we approximate said dependency using a linear function such that p i (q) = α i − β i q as illustrated in Fig. 2 and therefore, the values of α i , β i for each hour are obtained by determining the linear function that best approximates the blocks shaded in gray.
We collect data from November 2018 to October 2019 in order to build p r δ p i (q) = α i − β i q p(r) a data set of 8600 hours (almost one year), which is divided into 43 bins of 200 consecutive samples. Each bin is randomly split into training and test sets with a ratio of 80%/20%, respectively. This process is repeated five times for each bin. Therefore, each approach is solved for 215 different training sets of 160 samples, and the obtained solutions are evaluated using the corresponding 215 test sets of 40 samples each. The out-of-sample results provided in Sections 5.2.2, 5.2.3, 5.2.4, and 5.2.5 are obtained by averaging the outcomes over these 215 test sets. We select a value of δ equal to 5 GW in order to encompass enough bids and offers to obtain accurate approximations of p i (q) throughout the whole data set. We determine the optimal parameters w through problems (38), (43), and (45), which we denote FO, BL and DR, respectively. More specifically, we name BL-M the Fortuny-Amat big-M reformulation of model (43) and BL-R the regularized counterpart, which are a particularization of (17) and (18), respectively.
Each bin is executed in parallel with the following resources: 1 CPUs Intel E5-2670 @ 2.6 GHz and 1 Gb of RAM. Each instance of model BL-M is solved using the MIQP solver CPLEX [38] for a maximum time of 20 minutes or a relative gap of 10 −8 . On the other hand, BL-R is executed using the NLP solver CONOPT [39] without time limit.  Table 4. Generation technology data.

Impact of the generation portfolio
As previously stated, the main advantage of our approach is that it yields forecast values for the uncertain parameters that are tailored to the optimization problem by which the strategic power producer determines her optimal market sale. However, such an advantage may translate into higher or lower incomes depending on the firm's generation portfolio. In this section, therefore, we evaluate the performance of the various approaches for three generic power plants characterized by different linear costs (c 1 ) and capacities (q). Table 4 provides the values of c 1 , c 2 and q for these three generic units. For simplicity, the minimum output q of all units is assumed equal to 0 and the value of c 2 is set to 0.005 e/MWh 2 [35]. The base unit can represent a nuclear power station and is characterized by low fuel cost and high capacity. The medium unit can be, for example, a carbon-based power station with a lower capacity and higher fuel costs. Finally, peak units, such as combined cycle power plants, typically have the highest fuel cost and a smaller generation capacity. Table 4 also includes the percentage of time periods in which q BN = 0, 0 < q BN < q and q BN = q denoted as N q BN =0 , N 0<q BN <q , and N q BN =q , respectively, where q BN represents the optimal quantity that the strategic firm would place into the market under the true inverse demand function (that is, the solution given by the benchmark approach). It is observed that the base unit generates at maximum capacity for most times periods and is only shut down in 8% of the cases. The medium generating unit is idle 32% of the time (if prices are too low) and is at maximum capacity during the 39% of the time periods. Finally, the peak unit is not dispatched most of the time since electricity prices are usually below its marginal production cost. probably more interesting remark relates to the impact of the uncertainty about the inverse demand function on the market revenues accrued by each generating technology. Since the base unit is at full capacity most of the time, the uncertainty pertaining to the residual demand does not affect revenues that much, and the three methods obtain relative incomes above 94%. On the contrary, the participation of the medium and peak units highly depends on market conditions and therefore, this very same uncertainty remarkably deteriorates market revenues, with the eventual result that the maximum relative incomes amount to 80% and 59%, respectively, for the method featuring the best performance (which is BL-M). On a different front, the DR approach produces infeasible offers in a considerable number of time periods, whereas FO and BL-M are guaranteed to provide feasible production quantities in all cases. The percentage of periods for which method DR results in an infeasible q is higher for the base unit because the medium and peak units are idle more frequently. For this particular application, making DR decisions feasible can be easily achieved by computing min(max(q i , q), q). However, this post-processing step to guarantee feasibility can be much more challenging in applications with general convex feasible sets. It is also apparent that the DR approach provides the lowest RI for the three cases considered and therefore, this method is not even recommended for decision-making models where the decision vector is simply bounded component-wise.
Finally, we notice that, for the three generation technologies, the proposed method BL-M always provides higher incomes than the FO approach. However, relative income improvements vary widely for each case. For the base unit, the relative income of BL-M is only 0.2% higher than that of FO.  Table 6. Income distribution for the peak generating unit (Out-of-sample results). This is understandable since this power plant is at full capacity most of the time and thus, the impact of the uncertainty is comparatively minor, as we mentioned before. For the peak unit, in contrast, the relative income of BL-M is 14.6% higher than that of FO. Note that, unlike for base units, making small errors in the forecasts of the market conditions can be catastrophic for peak units, because such deviations may mean the difference between producing nothing or producing at maximum capacity. The ability of BL-M to reduce the forecast error when consequences are worse, together with the lower absolute incomes of peak units, explains this high difference in percentage. The gain of BL-M with respect to FO for the medium unit has an intermediate value of 2.6%.
To conclude this section, Table 6 includes, for the peak generating unit, the percentage of periods with a positive income, with a negative income and with an income equal to zero, denoted as N I>0 , N I<0 and N I=0 , in that order. The total sum of positive and negative incomes is also provided in the last two rows, represented by the symbols I + and I − , respectively. Interestingly, BL-M achieves the highest percentage of periods with a positive income and succeeds in providing the highest value of I + .

Impact of parameter c 2
While parameter c 1 basically depends on the cost of the fuel used by each unit, the interpretation of c 2 is not as straightforward. Indeed, this parameter measures the decrease in the plant marginal cost as production increases and is connected to technological aspects of the plant's economy of scale, like the way the plant efficiency varies for different operating points. For this reason, in this section, we investigate the impact of c 2 on the performance of the proposed method. Notice that, if q = 0 MW, then the unit marginal costs range from c 1 to c 1 + c 2 q. In a similar way, as Table 4 does, Table 7 shows  the operating regime of a medium generating unit with c 1 = 35e/MWh, q = 500 MW and different values of c 2 . As expected, a decrease in c 2 entails a reduction in the marginal production cost of the plant and, as a result, the amount of electricity the strategic firm places into the market increases. Table 8 provides the same results as Table 5, but for different values of c 2 and the medium generating unit only. Naturally, reducing the plant marginal costs increases both the absolute income for the benchmark approach and also the relative income achieved by all methods. Nevertheless, BL-M proves to be between 1.8% and 2.6% more profitable to the producer than the traditional FO approach for the values of c 2 considered.

Impact of the residual demand elasticity
So far we have centered our study on the cost structure of the generation portfolio owned by the strategic firm. Here, on the contrary, we focus on the elasticity of the market residual demand. Roughly speaking, this elasticity is inversely proportional to parameter β of the inverse demand function. Bearing this in mind, we compare the next two market situations, namely, the "Normal" and the "Low-elast" instances. The former corresponds to the values of β in the original data set, while the latter is obtained by multiplying these β-values by two.  Table 9. Impact of residual demand elasticity (Out-of-sample results).
relative to those of the benchmark. The numbers correspond to the medium power plant of Table 4. The overall effect of increasing the residual demand elasticity (lower β-values) is analogous to that of decreasing parameter c 2 , i.e., the involvement of the strategic producer in the market augments, thus leading to higher revenues. Results in Table 9 show that the proposed BL approach outperforms FO and DR for different values of the residual demand elasticity, improving the competitive edge of the strategic producer in more than 2% with respect to FO in terms of relative income.

Computational results
In Sections 5.2.2, 5.2.3 and 5.2.4 we have only included results from BL-M, and not from BL-R, because the former variant of the bilevel framework we propose guarantees global optimality for the strategic producer problem for appropiate values of large constants M P , M D . However, solving model BL-M can be computationally very expensive. Alternatively, local optimal solutions of the proposed bilevel model (43) can be efficiently found by way of the particularization of the regularization approach (18) that we named BL-R.
Next, we first compare the solutions given by methods BL-M and BL-R. In order to solve model BL-R, we iteratively shrink the regularization parameter ǫ taking values from the discrete set {10 6 , 10 4 , 10 2 , 1, 10 −1 , 10 −2 , 0}. In each iteration, we initialize the model with the solution provided by the previous problem. It is also worth mentioning that method BL-M is warm-started with the solution delivered by BL-R.
Results in Table 10 are intended to compare the relative incomes of BL-M and BL-R for each generating unit whose data is collated in Table 4. Although method BL-R logically yields lower incomes, the differences with respect to BL-M are below 0.8%. This means that if model (40)  of optimality) can be efficiently computed by solving the regularized NLP version of our approach. Finally, we compare the computational burden of methods FO, DR, BL-M, and BL-R. The average simulation time invested in solving problems (38), (43) and (45) for the three generation technologies are indicated in Table 11, where the maximum solution time has been limited to 20 minutes for all methods. These results highlight the higher computational burden required by BL-M to ensure global optimality. On the other hand, the computing times of BL-R are very affordable, especially considering the competitive edge that this method gives to the strategic firm.

Conclusions
In this paper, we have addressed the problem of data-driven decisionmaking under uncertainty in the presence of contextual information. More precisely, our ultimate purpose has been to construct a parametric model to predict, based on some covariate information, the uncertain parameters that are input to the optimization model by which the decision is made. To this end, we have proposed a bilevel framework whereby such a parametric model is estimated taking into account the impact of its outputs on the feasibility and value of the decision. Under convexity assumptions, we have provided two single-level reformulations of the bilevel program, namely, a non-linear regularized optimization problem and a mixed-integer non-linear reformulation based on the use of large enough constants. When compared to alternative approaches available in the technical literature, ours features two major advantages: it guarantees feasibility in constrained decision-making problems, and its solution can be directly tackled using off-the-shelf optimization solvers under convexity assumptions.
We have theoretically compared our approach with existing ones for three different applications, namely, the newsvendor problem, the product placement problem, and the strategic producer problem. Additionally, we have evaluated the performance of our approach and its practical relevance through a realistic case study of a strategic producer that participates in the Iberian electricity market. Specifically, numerical results show that our framework not only significantly increases the revenue streams of the firm in general, but also proves to be critical to generation portfolios mainly consisting of peak power units. Indeed, the market revenues of a strategic peak generation portfolio are specially sensitive to the uncertainty in the inverse demand function. Therefore, in this case, the strategic firm may put at risk the bulk of its market incomes, by being left out of the market or trading in deficit. Our approach, however, is, by construction, aware of that sensitivity and thus, is able to retain most of the profit the firm would make under a perfectly predictable inverse demand function.
Potential extensions of this work would include the use of more advanced techniques in the resolution of our bilevel framework such as those employed in more general MPCC problems. Likewise, the generalization of our approach to multi-stage decision-making problems under uncertainty requires further analysis.
Promotion of Talent and its Employability in R&D&I, within the framework of the State Plan for Scientific and Technical Research and Innovation 2017-2020 and by the European Social Fund. Finally, the authors thankfully acknowledge the computer resources, technical expertise, and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga.
Appendix A. Asymptotic consistency , ∀i ∈ N } be an i.i.d sample of size N and suppose that there exists a linear relationship between α ′ and β ′ > 0 given by α ′ β ′ = a T x + ξ, with ξ being a zero-mean noise independent of x, α ′ and β ′ , and that the expectations E[α ′ ], E[β ′ ] and E[α ′ x] are all finite. Then, it almost surely holds in the limit N → ∞ that the optimizer of the problem with W ⊂ R p being a compact set containing a, is attained at w γ = a.
The true expectation problem associated with the sample average approximation (A.1) is given by: where Q is the joint probability law governing the random parameters β ′ and α ′ and the feature vector X. We first show that a is the unique solution to problem (A.2). To this end, we note that the lower-level problem (A.2b) renders the following decision mapping for almost all x ∈ X : which is a continuous function in w γ . Now let Q X be the probability measure of X. Consider the following optimization problem, which is a relaxation of (A.2). min q(x)∈[q,q], ∀x∈X X ×R + ×R β ′ q 2 (x) − α ′ q(x) Q(dx, dβ ′ , dα ′ ) = min q(x)∈[q,q], ∀x∈X X The inner pointwise minimum results in the following optimal decision rule: for almost all x ∈ X . Therefore, since w γ = a is feasible in the true expectation problem (A.2), then it is also an optimal solution to this problem. Furthermore, this solution is unique, if there exists a subset of X with measure greater than zero such that q < E[α ′ |x] 2E[β ′ |x] < q. In addition, note that all the samples in S are i.i.d. and that β ′ q 2 (x) − α ′ q(x) is dominated by the function max β ′ q 2 − α ′ q, β ′ q 2 − α ′ q, α ′2 4β ′ , which is integrable because the expectations E[α ′ ], E[β ′ ] and E[α ′ x] are all finite. Indeed, since α ′ β ′ = a T x + ξ by assumption, we have that E[ α ′2 4β ′ ] = a T 4 E[α ′ x]. Therefore, by invoking Theorems 5.3 and 7.48 in [40], we have that the minimizer of the sample average approximation problem (A.1) converges to a almost surely as the sample size N grows to infinity.