Asynchronous Lagrangian scenario decomposition

We present a distributed asynchronous algorithm for solving two-stage stochastic mixed-integer programs (SMIP) using scenario decomposition, aimed at industrial-scale instances of the stochastic unit commitment (SUC) problem. The algorithm is motivated by large differences in run times observed among scenario subproblems of SUC instances, which can result in inefficient use of distributed computing resources by synchronous parallel algorithms. Our algorithm performs dual iterations asynchronously using a block-coordinate subgradient descent method which allows performing block-coordinate updates using delayed information, while candidate primal solutions are recovered from the solutions of scenario subproblems using heuristics. We present a high performance computing implementation of the asynchronous algorithm, detailing the operations performed by each parallel process and the communication mechanisms among them. We conduct numerical experiments using SUC instances of the Western Electricity Coordinating Council system with up to 1000 scenarios and of the Central Western European system with up to 120 scenarios. We also conduct numerical experiments on generic SMIP instances from the SIPLIB library (DCAP and SSLP). The results demonstrate the general applicability of the proposed algorithm and its ability to solve industrial-scale SUC instances within operationally acceptable time frames. Moreover, we find that an equivalent synchronous parallel algorithm would leave cores idle up to 80.4% of the time on our realistic test instances, an observation which underscores the need for designing asynchronous optimization schemes in order to fully exploit distributed computing on real world applications.


Motivation
The unit commitment problem is a classical problem in the short-term scheduling of electric power systems.Unit commitment deals with deciding which generating units will supply energy to a power system over a certain time horizon, so as to minimize the operation cost while respecting the technical constraints of the power system.The problem is usually formulated as a mixed integer linear program (MILP) and it is solved on a daily basis by power system operators worldwide.
In recent years, the integration of significant shares of renewable energy resources into electric power systems has motivated the industry and academic community to investigate efficient approaches for scheduling power systems in the presence of uncertainty (see [50] and references therein for a recent survey on the subject).Stochastic unit commitment (SUC) is a widely studied approach to incorporate uncertainty in this scheduling problem.SUC can be formulated as the two-stage stochastic mixed integer program (SMIP) (1)-( 4), where i indexes scenarios, u is the vector of non-anticipative decision variables, U ⊂ R n is a bounded convex set, v i are local copies of non-anticipative variables at each scenario, w i are recourse variables of each scenario, D i ⊂ R n+m i i = 1, . . ., N are bounded non-convex sets and constraints (4) model non-anticipativity constraints.
We assume that the problem satisfies P v i (D i ) = P v j (D j ) for any i, j ∈ {1, . . ., N }, where P v i is the projection of D i on the coordinates of v i .We later refer to this property as weak relatively complete recourse. 1Typically, v i corresponds to commitment variables of thermal generators (binary decisions), but it can also include production variables of inflexible generators (continuous decisions).The vector w i includes commitment variables of fast generators, production variables of all generators and flows over the network (mixed integer decisions).D i describes the feasible operation domain for scenario i in terms of production constraints (minimum stable level, maximum capacity, maximum ramp rates, minimum up/down times) and power grid constraints (power balance, power flow equations, flow limits).U corresponds to a convex relaxation of the production constraints for variables included in u, U ⊇ P v i (D i ) for an arbitrarily chosen i ∈ {1, . . ., N }.See Appendix A for a detailed description of the SUC model.Our aim is to solve problem (1)-(4) for real power systems, within the time limits imposed by daily operations.This differentiates SUC from other applications of stochastic programming in that the typical scale of realistic SUC instances (see [18,42,53]), as measured by the number of variables and the number of constraints required to describe U and D i , is orders of magnitude larger than that of commonly used SMIP test instances [4].Furthermore, the computational effort required for optimizing over D i can vary significantly from one scenario to another, as well as for the same scenario with slight modifications.Therefore, a parallel implementation of a serial decomposition scheme may perform inefficiently in practice.
In order to overcome these challenges, in this paper we propose an asynchronous distributed algorithm for solving (1)-( 4), and we present a high performance computing implementation of the algorithm which is used for solving SUC instances of two industrial-scale systems and the largest instances in the SIPLIB collection [4].

Relevant literature
Stochastic unit commitment was initially proposed in the seminal work of Takriti et al. [51] and Carpentier et al. [15] as a methodology for coping with demand uncertainty in power systems.The scope for application of SUC has, since then, been extended to renewable energy forecast uncertainty and component failures (referred to as contingencies in the power engineering literature), among other sources of uncertainty in power system operations.
SUC studies commonly use decomposition methods for handling the scale of the problem, which increases linearly with the number of scenarios.Takriti et al. [51] use Lagrange relaxation of non-anticipativity constraints (i.e.scenario decomposition), and a progressive hedging heuristic to obtain non-anticipative solutions.Carpentier et al. [15] relax the problem over generators using an augmented Lagrangian.They maximize the dual function with a proximal point method and recover primal solutions by allowing demand shedding at a quadratic penalty.Shiina and Birge [48] use a column generation approach over production schedules, decomposing over generators and solving the slave production scheduling problems using dynamic programming.Cerisola et al. [16] compare scenario decomposition with variants of Benders decomposition and stress the importance of finding good initial solutions.Additional decomposition approaches for SUC can be found in [50].
SUC instances in the literature have often been limited to test systems which fall short of industrial scale instances.In order to advance towards more realistic instances, several recent studies exploit distributed computing alongside decomposition methods.Papavasiliou et al. [42] implement scenario decomposition in an HPC cluster to solve instances of the Western Electricity Coordinating Council (WECC) system with up to 1000 scenarios in at most 24 h within an optimality gap of 1-2.5%.Cheung et al. [18] decompose the problem by scenarios and use progressive hedging on a multi-processor workstation and on an HPC cluster to solve instances of the WECC system with up to 100 scenarios in at most 25 min within an optimality gap of 1.5-2.5%.Kim and Zavala [28] propose an interior point cutting-plane algorithm to handle dual iterations in scenario decomposition and solve SUC instances based on the IEEE 118-bus test system with up to 32 scenarios, using an HPC cluster, in 6 h within an optimality gap of 0.01%.
Other recent methods, which do not exploit distributed computing explicitly, have also being used to solve large SUC instances.Among them, Schulze et al. [47] use a stabilized Dantzig-Wolfe decomposition to solve multi-stage SUC instances of the British system with up to 50 scenarios in 2 h within an optimality gap of 0.1%.van Ackooij and Malick [53] use primal-dual decomposition along with bundle methods to solve SUC instances of the French system with up to 250 scenarios within an optimality gap of 1%, however no solution times are reported.
As our research is focused on making it practical to solve industrial-scale SUC instances, (i) we present a method capable of solving SUC instances faster than the state-of-the-art, (ii) we release all the industrial-scale test instances used in this study and (iii) we provide the time it takes to solve them with the proposed method.These three elements are not found simultaneously in any of the aforementioned studies.
In the broader SMIP class, to which SUC belongs, scenario decomposition has inspired several different decomposition algorithms.Carøe and Schultz [14] propose a scenario decomposition method where non-anticipativity constraints are gradually enforced through a branch-and-bound algorithm.Lubin et al. [31] extend the previous method to a parallel computing setting by solving the dual problems at each node of the branch-and-bound tree using an interior point solver that exploits the structure of the problem.Oliveira et al. [40], and references therein, use scenario decomposition and solve the dual problem using bundle methods.Ahmed [1,2] proposes an alternative approach for stochastic integer programs where solutions to scenario subproblems are directly used for primal recovery while, at the same time, these solutions are separated from subproblems in order to improve the bound of the relaxation.Ryan et al. [46] develop several improvements over Ahmed's original algorithm, including a distributed asynchronous implementation.
Distributed computing has also found applications in stochastic programing outside the realm of scenario decomposition methods.Lubin et al. [30] propose a distributed memory simplex algorithm for stochastic linear programs.Munguía et al. [34] propose a branch-and-bound algorithm for stochastic mixed integer programs on which each node of the tree is solved using Lubin's method.Moritchs et al. [33] propose a nested asynchronous decomposition algorithm for multistage stochastic linear programs.Chaturapruek et al. [17] propose and prove optimal convergence for asynchronous stochastic gradient descent methods on unconstrained stochastic programs with strongly convex and differentiable objective functions.
A crucial aspect to scenario decomposition is the method used to optimize the dual function, which is separable, convex and non-differentiable.Certain specialized methods allow exploiting the separable structure of the dual function in a distributed computing infrastructure.Nedić et al. [35][36][37] analyze incremental subgradient algorithms, on which each update is made along the direction of the subgradient of a part of the objective function.Coordinate descent methods are a different approach to exploit the structure of minimization problems for which it is cheaper to compute the gradient with respect to a subset of variables (coordinates) than it is to compute the full gradient of the objective.Wright [55] provides a recent survey on coordinate descent methods.Nesterov [39] provides worst-case complexity results for randomized coordinate descent methods for smooth optimization.Fercoq and Richtárik [23] propose a parallel synchronous coordinate descent method for minimizing non-differentiable simple composite functions.The authors use Nesterov's smoothing technique [38] to obtain a smooth approximation of the non-decomposable part of the objective and perform a line search separately on each coordinate of each iteration, as proposed by Tseng [52].Fisher and Helmberg [24] propose an asynchronous distributed bundle method for non-differentiable convex optimization, where at each step a worker greedily selects a subset of variables, blocks them from being accessed by the other workers, and performs a proximal bundle iteration on the selected variables.

Contributions
Stochastic unit commitment, despite its attractiveness, has failed to become an industry standard due to several reasons, including the difficulty of solving the mathematical programs in an operationally acceptable time frame.In the present paper, we aim at overcoming this challenge by developing an asynchronous scenario decomposition scheme based on distributed computing.The main innovation of the developed scheme is that we optimize the dual function using an asynchronous block-coordinated subgradient algorithm.Our algorithm does not require differentiability or strong convexity assumptions that are commonly used in the literature [17,29,39,55]; it differs from the algorithm of Fercoq and Richtárik [23] in that we perform iterations asynchronously and without the need for line search, which would be prohibitive in our context; and it requires less serial coordination overhead than the asynchronous bundle method [24].We provide convergence guarantees for the proposed algorithm based on previous results for stochastic subgradient methods [21] and incremental subgradient methods [35][36][37].
We also perform primal recovery asynchronously and in parallel to the dual iterations, either by recovering solutions from scenario subproblems, as proposed in [1,2], or by using recombination heuristics, similar to those proposed in [14].The proposed asynchronous algorithm allows us to solve limited-size SUC instances faster than the state-of-the-art [18] and to solve SUC instances for systems larger than the state-ofthe-art [53] within operationally acceptable time frames.
Even though the proposed algorithm is inspired and tailored to SUC, the proposed framework for asynchronous dual decomposition can be applied in other contexts, such as temporal or spatial decomposition of unit commitment.
In order to make our numerical experiments replicable, we release all SUC instances used in the present study [9] (WECC and CWE, see Sect. 6) along with the source code of our implementation [10].

Notation and paper organization
Throughout this document we use boldface to denote vectors, lowercase letters to denote variables and uppercase letters to denote parameters or sets.Additionally, we use partial indexation of vectors to keep notation simple, i.e.
The rest of the paper is organized as follows.Section 2 introduces the stochastic unit commitment problem and its scenario decomposition in a stylized fashion.Sec-tion 3 presents the asynchronous distributed block-coordinate subgradient method and provides convergence results for the dual iterations.Section 4 describes our primal recovery heuristics.Section 5 describes the HPC implementation of the dual algorithm, where we also cover aspects of communications and load balancing.Section 6 presents the numerical results for the WECC and Central Western European (CWE) systems.Finally, Sect.7 presents the conclusions of the study and points to directions of future research.

Scenario decomposition in stochastic programming
Problem (1)-( 4) is a general formulation for two-stage mixed integer stochastic programs with finitely many scenarios and bounded feasible sets [32].
Scenario decomposition schemes for solving (1)-( 4) work by solving the dual problem (5) and generating primal solutions based on the solution to subproblems (6) and (7).The objective of problem (5) has a separable structure.Moreover, by evaluating the component functions f 0 and f i , i = 1, . . ., N at a certain x, we obtain a subgradient of the objective at x.These two properties motivate the use of subgradient algorithms for solving (5) and evaluating the component functions in parallel in order to speed up the algorithm [42].A third important property of problem (5) should be carefully considered, namely, the differences in evaluation times between component functions.
In our context, the evaluation of f i at a certain xi , i = 1, . . ., N , requires the solution of a large mixed integer linear problem, while the evaluation of f 0 at a certain x requires solving a medium-size linear program, hence the evaluation of f 0 requires only a small fraction of the time that it takes to evaluate f i for any i. 2 In addition, for a given x and two component functions f i and f j , the evaluation times of f i and f j can be dramatically different (we have observed differences of more than 7500%).These differences in evaluation time can also arise for the same component function evaluated at two different iterates.Altogether, these differences can render synchronous parallel algorithms ineffective because the time between iterations is limited by the component function which requires the greatest time to be evaluated.The main aim of the present work is to overcome this limitation.

Asynchronous distributed block-coordinate subgradient method
In this section we present a minimization method that exploits the special structure of (5) in order to effectively harness distributed computing.The proposed method has been inspired by previous work on incremental subgradient algorithms by Nedić et al. [35][36][37].
We replace the non-decomposable component function f 0 by a smooth approximation f μ 0 , defined in Eq. ( 8), as done in [23].
We show in Sect.3.1 that smoothness of the non-decomposable part of the objective is necessary for the convergence of our method.Given μ > 0 and certain u 0 ∈ U, f μ 0 is a convex differentiable function and its gradient has a Lipschitz constant L μ 0 [38,Thm. 1].We then focus on solving problem (9), where X is a convex set, onto which it is easy to project.
The solution of problem (9) will be an approximate solution to problem (5) [38].Note that problem (9), as well as problem (5), are non-differentiable convex optimization problems.
For ease of presentation, we first introduce a serial variant of the proposed method.We then generalize it to an asynchronous distributed setting.Proofs for all results presented in this section are provided in Appendix B.

Serial method
Let us consider a block version of the randomized coordinate descent method [39] for minimizing f .At iteration k, with current iterate x k , we select uniformly at random a block j(k) from {1, . . ., N } and compute the next iterate x k+1 using the following update rule: Here, λ k is the step size, g( j, x) ∈ ∂ f j (x) and I j is a matrix that maps ∇ f μ 0 to its components relevant to block j, i.e.
with 1 n j ×n j corresponding to the identity matrix.Simply put, rule (10) updates the multipliers by moving them in the opposite direction to a subgradient of the objective function, restricted to the coordinates related to scenario j(k).
Although f is a non-differentiable function, the update rule ( 10) is equivalent to the update rule of the stochastic subgradient method, as in the following proposition.
Proposition 1 Let J be a discrete uniform random variable on the set {1, . . ., N }.The expected direction of update rule (10), coincides with the direction of a subgradient of f at x k .
Note that smoothness of f μ 0 is a necessary condition for Proposition 1.This can be understood intuitively by observing that the expectation in Proposition 1 constructs the expected update direction by concatenating subgradients of f restricted to blocks of coordinates, which are obtained from different calls to a first-order oracle.The concatenation of blocks of coordinates of different subgradients of a non-separable function (e.g.N j=1 I T j I j g j , where g j ∈ ∂ f 0 (x), j = 1, . . ., N ) is not necessarily a subgradient of the function (see Proposition 4, in Appendix B), whereas the concatenation of blocks of coordinates of the gradient of a differentiable function results in the gradient of that function.The stochastic subgradient method and its convergence have been well studied in the literature, see for instance [21,Thm. 2].By extension, with an appropriate selection of the step size λ k (diminishing, nonsummable, square summable), the method defined by the update rule (10) will converge to an optimal solution with probability 1.
Note that the update rule (10) requires computing the gradient of f μ 0 and a subgradient of f j(k) at each iteration k.This makes the method more expensive than the subgradient method when we consider an entire pass over all scenarios, i.e.N iterations of the method, which is the minimum amount of iterations required to update every block of variables.Nevertheless, since the cost of computing the gradient of f μ 0 is almost negligible compared to the cost of computing a subgradient of f i for any i, the extra cost in practice would not be noticeable.

Asynchronous distributed method
The idea behind the asynchronous distributed method is essentially the same as in the serial method presented in the previous subsection, with the exception that subgradients of component functions are computed in parallel to the updates.Specifically, following [37], we use the computation model presented in Fig. 1.In this computation model, only the Updating system can increase iteration counters, and updates are performed serially using the information provided by the Subgradient computation system.The serial nature of the Updating system allows us to describe the algorithm using a single iteration counter k.
Note that subgradients communicated by the Subgradient computation system to the Updating system might have been computed at previous iterates.In other words, the subgradient information available to the Updating system might have delays.We denote these delays by l(k), the total number of updates since the last evaluation of the gradient of f μ 0 , at iteration k; and ( j, k), the number of updates to block j since Updating system Subgradient computation system Fig. 1 Computation model for the asynchronous distributed method.At each iteration k, the Updating system communicates the current iterate x k to the Subgradient computation system.In turn, the Subgradient computation system communicates back the last computed subgradient for each component function.The parallelism of this scheme resides within the Subgradient computation system the last computation of its subgradient, at iteration k.Considering delays, we propose update rule (11) for the randomized block-coordinate subgradient method: Here, j(k) is selected uniformly at random from {1, . . ., N }.This update rule is an extension of (10), where we allow for delays in the subgradient information.
The presence of delays implies that Proposition 1 is no longer valid for update rule (11).Nevertheless, we can use essentially the same idea in order to prove convergence of the method defined by rule (11) as the one we used in the previous subsection.That is, we show that the expected update direction coincides with the direction of the approximate subgradient of the objective function and that the error in the approximate subgradient vanishes as the iterations advance.In this work, approximate subgradients are defined as follows.
Definition 1 (Approximate subgradient [45]) Let f be any convex function and let x ∈ Dom( f ).A vector g is called an -subgradient, or approximate subgradient, of f at x if there exists > 0 such that Our analysis of the iterates of the asynchronous method is based on the Supermartingale Convergence Theorem [11,Prop.123 Assumption 3 (Diminishing-bounded stepsize) The stepsize λ k might be a function of x k , x k−1 , . . ., x 0 , but not of the block coordinate to be updated j(k).Furthermore, the sequence {λ k } is bounded above and below by a deterministic sequence {γ k }, such that where Ǧ, Ĝ, r , q are positive constants.
For SMIP instances, Assumption 1 is ensured by the boundedness of the sets U and D i , i = 1, . . ., N of the primal problem (1)-( 4), while Assumption 2 will hold as long as the evaluation time of all component functions is finite.The selection of a stepsize which is consistent with Assumption 3 is discussed in Sect.3.3.
The following lemma conveys the key idea of our analysis and the rest of the proof follows almost directly from it.
Lemma 1 Let Assumptions 1 and 2 hold.Additionally, let J be a discrete uniform random variable on the set {1, . . ., N } and F k = {x k , x k−1 , . . ., x 0 }.Then, the expected direction of update rule (11), coincides with the direction of an -subgradient of f at x k , with Lemma 1 shows that the update direction of rule (11) should, on average, be close to a subgradient of the objective and that the error in the subgradient is bounded by the sum of the last L stepsizes, hence by choosing a stepsize consistent with Assumption 3 this error will vanish as k grows.In the following, Proposition 2 gives an estimate of the progress of the asynchronous method at each iteration, based on the result of Lemma 1, while Proposition 3 presents a straightforward consequence of Assumption 3.

Proposition 2 Let Assumptions 1, 2 and 3 hold and let
Then, for the sequence {x k } generated by the update rule (11), we have that Proposition 3 Let Assumption 3 hold.Then, we have where for notational convenience we let λ −l = λ 0 ∀l ∈ N.
Finally, in the following theorem, we apply the Supermartingale Convergence Theorem (Theorem 2, Appendix B) to the results of Propositions 2 and 3 in order to prove the convergence of the asynchronous method to an optimal solution.
Theorem 1 Let Assumptions 1, 2 and 3 hold, and assume further that the optimal solution set X * is nonempty.Then the sequence {x k } generated by the randomized method converges to some optimal solution with probability 1.
The result presented in Theorem 1 extends the state-of-the-art by providing a convergence guarantee for the asynchronous block-coordinate descent method for non-differentiable optimization problems with the structure of problem (9) without the need for a line search on x j(k) at each iteration [23,52].
An important remark is that the asynchronous incremental method proposed in [37] is guaranteed to converge to an optimal solution for both problem (5) and problem (9), while also exploiting their decomposable structure to a certain extent.We utilize block-coordinate descent instead of incremental methods for two reasons: -The choice between incremental and block-coordinate subgradient methods can have significant implications on the magnitude of delays whenever we are minimizing a problem with the structure of problem (9), where a gradient of f μ 0 is much easier to evaluate than a subgradient of f i for any i = 1, . . ., N , and X = R n (unconstrained optimization).The incremental subgradient method at iteration k selects at random a component function, j(k) ∈ {0, 1, . . ., N }, and performs an update following the direction of the computed subgradient for the selected component function.Whenever j(k) ≥ 1, the algorithm will update block-coordinate j(k), which will cause the current gradient of f μ 0 and the subgradient f j(k) to gain one unit of delay.On the other hand, every time j(k) = 0, the algorithm will update the entire vector x, adding one unit of delay to all the subgradient information available to the Updating system.This effect, which is unavoidable for the problem structure analyzed in [37], is undesirable because errors on the update direction depend directly on the magnitude of the delays.The block-coordinate subgradient method at iteration k updates only the coordinates of block j(k) ∈ {1, . . ., N }, causing the available gradient of f μ 0 and subgradient f j(k) to gain a unit of delay, but leaving unaffected the rest of the subgradient information available to the Updating system.Moreover, new gradients for f μ 0 can be computed rapidly as iterations advance, which allows us to maintain small delays throughout the solution process.
-For SMIP instances and, in general, for Lagrangian relaxation of constraints linking duplicated variables, as the dual multiplier x approaches an optimal value, u and v i start becoming similar for all i.As a consequence, the gradient of f μ 0 will tend to point in an opposite direction to the subgradient of f i for any i = 1, . . ., N , thereby causing the incremental method to be susceptible to oscillations in x.

Stepsize selection and function value estimation
Although Assumption 3 might seem to restrict the stepsize to a diminishing series of the type 1/k q , it also allows us to use a stepsize similar to the dynamic stepsize proposed by Polyak for the subgradient method [43], In order to use the original Polyak stepsize, we need to know the objective value at the current iterate f x k , the optimal value f * , and the norm of the subgradient at the current iterate g x k 2 , none of which are available for the asynchronous method.Instead, we construct a new stepsize which uses estimates for each of the aforementioned quantities.An estimate of the current objective, which is also an upper bound on the objective of the primal problem ( 1)-( 4) , can be obtained at the cost of evaluating f 0 as follows: On the other hand, an estimate ḡk of the subgradient norm can be computed using the last known subgradients for the component functions: Here, σ is a small positive constant intended to prevent that ḡk = 0 (note that, due to the delays, ḡk = 0 does not imply that x k is optimal).An underestimate for the optimal value L B k can be obtained from a feasible solution to the primal problem, which is computed as described in Sect. 4. We assume that this is a strict underestimate, i.e. θ ≤ f * − L B k for some θ > 0. In other words, we assume that strong duality is never attained for realistic SUC instances.Using these estimates, we propose the following dynamic stepsize: where ξ is a positive constant and p k = p 0 /(1 + rk) q , with p 0 , r positive constants and 1/2 < q ≤ 1.The goal of ξ is to prevent the method from taking long steps whenever the underestimate of the optimal value is loose.At the same time, we scale our estimate of the Polyak stepsize by the decreasing sequence p k .We do this order to ensure that the resulting stepsize λ k , defined in Eq. ( 14), respects Assumption 3.This can be verified by observing that λ k satisfies the following inequality: Therefore, by Theorem 1, the asynchronous algorithm using the proposed stepsize (14) will converge with probability 1 to an optimal solution.
A diminishing stepsize of type λ k = p 0 /(1 + rk) q is also guaranteed to achieve convergence under Assumptions 1-3.However, for such a stepsize to work effectively, it is necessary to determine a 'good' initial stepsize p 0 in absolute terms.The process of selecting a 'good' initial stepsize can require several trial-and-error runs of the algorithm for every instance to be solved, which would not be possible in an industrial implementation with a strict time limitation.
By contrast, in order to set the parameters for the proposed stepsize ( 14), we only need to decide the proportion p k of the Polyak stepsize that we would like to have at two different iteration counts (e.g.p 0 = 0.5 and p 50•N = 0.25) and the rate at which we would like to decrease p k (e.g.decreasing with a rate 1/k).Given that these parameters are scale-free, in the sense that they do not depend on the absolute value of the objective function, there is no need for trial-and-error runs when using this stepsize rule in industrial-size applications.

Primal recovery
Primal recovery is an essential component of any Lagrangian relaxation method aiming at solving the original problem.Although there exist exact methods for recovering primal solutions in the case of linear programs [7], these methods do not extend to the mixed integer case.Therefore, primal recovery generally relies on heuristics.
In the case of the SMIP instances in this paper, we exploit the weak relatively complete recourse of the problem, i.e. that P v i (D i ) = P v j (D j ) for any i, j ∈ {1, . . ., N }.In other words, any solution to a scenario subproblem v can be used as a candidate non-anticipative first-stage solution.We can then compute the second-stage cost h i (v) by solving second-stage problems with fixed v i = v: Thus, we obtain a complete primal solution to (1)-( 4) and a lower bound on the objective of (5) [1,2].
Recovering one feasible non-anticipative solution at every dual iteration would require the solution of N 2 second-stage MILPs for every dual pass over data.For medium to large scenario sets, the computational requirements of primal recovery can easily become larger than the requirements of the dual algorithm.If both dual iterations and primal recovery are performed concurrently, primal candidates would need to enter into a queue for evaluation, which will typically grow as dual iterations advance (assuming similar resources are allocated for dual iterations and primal recovery).Within this context, we test three rules for determining the order of evaluation of primal candidates in the queue: -First-in-first-out (FIFO).
-Random order (RND), motivated by the possibility that a good solution might appear anywhere in the sequence of solutions to scenario subproblems.-Last-in-first-out (LIFO).This approach accounts for the fact that, as dual iterations advance, solutions to scenario subproblems tend to be almost non-anticipative (v i ≈ v j , i, j = 1, . . ., N ).Therefore, scenario subproblem solutions in later iterations could achieve better overall performance.
A different approach towards recovering primal solutions is to create primal candidates as they are required, by combining the solutions to different scenario subproblems.Carøe and Schultz [14] propose creating primal candidates by first averaging the solutions to all scenario subproblems and then rounding the result using a heuristic.We use a variant of this idea, combined with importance sampling (IS), to create primal solutions.Let V = P v i (D i ) for an arbitrarily chosen i ∈ {1, . . ., N }.Our recovery heuristic, then, proceeds as follows: 1. Associate to each scenario i = 1, . . ., N a probability proportional to an estimate of its importance, e.g.
). 2. Pick a sample of the scenarios of size M < N , using ρ i as the probability of sampling scenario i. 3. Average the current scenario subproblem solution vi associated to the sampled scenarios {i(m), m = 1, . . ., M} in order to generate an average ṽ: 4. Generate a primal candidate ū by projecting the average of step 3 onto V, In general, problem (15) follows the form of a modified scenario subproblem, because where n I indicates the number of integer first-stage variables.Problem (15) then reduces to a quadratic version of problem (6), which is smaller and easier to solve than a scenario subproblem.
The proposed heuristic allows us to create primal candidates that combine the characteristics that make a solution optimal for a representative subset of the scenarios while, at the same time, generating different candidates after each dual iteration if necessary.The latter would not be possible if we were to follow [14], since the average of the solution to all scenario subproblems does not change significantly from one coordinate descent iteration to the next.

High performance computing implementation
Section 3 provides convergence guarantees for the asynchronous dual optimization algorithm, based on the conceptual distributed computation model of Fig. 1, while Sect. 4 proposes two primal recovery schemes that can be parallelized.This section specifies the actual implementation of the algorithm, that is, the different processes running in parallel and the information to be exchanged between them.
We implement the algorithm using the Master/Slave design presented in Fig. 2. The Master coordinates the work of all workers, dynamically assigning tasks (solving optimization problems) to Slaves as the algorithm progresses.Slaves, on the other hand, limit themselves to perform the tasks demanded by the Master, without a view of the global progress of the algorithm.
In contrast to the conceptual computation model presented in Fig. 1, in the actual implementation there is no clear separation between the Updating system and the Subgradient computation system.The Updating system is contained within the Master.The Subgradient computation system is split between the Master and the Slaves that are currently evaluating f μ 0 or f i , i = 1, . . ..N .This part of the process shown to the left of the Master in Fig. 2.
Primal recovery is performed concurrently with dual iterations, using a portion of the Slaves.This part of the process is shown to the right of the Master in Fig. 2. Primal recovery evaluates the second-stage cost of primal candidates and, if using the IS heuristic, it also creates new candidates by projecting averaged first-stage solutions onto the first-stage feasible set.
Performing dual iterations alongside primal recovery enables the algorithm to continuously compute upper bounds (dual function evaluations) and lower bounds (primal recovery) on the optimal value.This allows us to establish a natural termination criterion, U B − L B ≤ .We can also terminate the algorithm at a certain wall time, while still returning the incumbent solution and an optimality gap.These termination guarantees permit the deployment of our asynchronous algorithm in settings with hard limits on execution time, such as day-ahead unit commitment.
) , Fig. 2 Execution snapshot of the asynchronous distributed algorithm for stochastic unit commitment.Each box corresponds to a process and each dashed line corresponds to information that is exchanged between processes.The Master assigns tasks to each of the Slaves dynamically.Not all types of tasks need to be present at all times, and there might be several Slaves engaged on the same task, but over different data In the following, we detail the internal layout of the Master and Slave processes, and how they interact with each other.The implementation is based on the SMPS file format for stochastic programs [26].In particular, it uses the concepts of CORE problem, time indexation of variables and constraints (TIME file), and the specification of scenarios by their differences to the CORE problem (STOCH file).In order to maintain a small memory footprint, which is critical for solving large SUC instances, only the CORE problem and the TIME indices are maintained in memory, while the information in the STOCH file is loaded from the hard drive as needed and purged after it has been used.

Slave
The Slave process, presented in Fig. 3, starts by partially reading the instance to be solved (steps 1-2, Fig. 3): it loads the CORE problem, loads all the information in the TIME file (time stages of variables and constraints) and gathers information about the organization of the STOCH file (metadata), e.g. a list of the scenarios and where in the STOCH file are they located.
The reading process respects the classification of constraints within the CORE file.In particular, it differentiates between normal constraints, delayed constraints (i.e.constraints that are necessary for feasibility but are unlikely to be binding, also known as lazy constraints) and model constraints (i.e.constraints redundant at the optimal MIP solution, also known as user cuts). 3Current commercial MILP solvers can take advantage of this classification of constraints to speed up the solution process.Note that, in order to read files with this constraint differentiation, the TIME file must explicitly declare the time index of rows and columns [26].
After every process finishes the reading step (step 3), the control flow is organized around a loop within which the Slave receives a task from the Master, executes it, and communicates back the result.Subproblems in all tasks are formulated either by modifying the CORE problem, as in steps 6, 10 and 22, or by taking a subset of the constraints of the CORE problem, as in steps 14 and 18.The transformation of steps 6, 10 and 22 uses the metadata to avoid parsing unnecessary parts of the STOCH file.
There are 5 types of tasks, as well as 1 termination signal, that the Slave can receive from the Master.Among these, dual scenario corresponds to evaluating a certain component j ∈ {1, . . ., N } of the dual objective for certain multipliers, primal projection corresponds to projecting an averaged candidate onto the feasible set of first stage decisions [see Eq. ( 15)], and second stage scenario corresponds to solving a recourse problem for a given first stage decision.
The dual f 0 task involves the following two actions: computing the gradient of f μ 0 and computing an upper bound.These tasks are merged because they involve solving very similar mathematical programs, ( 6) and (8), none of which has a strict requirement on the frequency with which it must be solved (contrary to the case of primal projection, which must be solved whenever we need a new primal candidate).The task uses four pieces of data.The first two, the sum of the current multipliers  The other two, the sum of certain multipliers y, N i=1 y i , and the sum of the scenario component functions of the dual evaluated at y, N i=1 f i ( y i ), are used to obtain an upper bound on the optimal value of the original program by evaluating f 0 ( y) and letting f ( y) = f 0 ( y) + N i=1 f i ( y i ), as indicated in Eq. ( 13).The approx dual scenario task has the same objective as the dual scenario task, with the difference that the former solves only a relaxation of the scenario subproblem.Two types of relaxation are considered: (i) the linear programming (LP) relaxation, applicable to problems where the sets D i are mixed-integer polyhedral sets, and (ii) the period relaxation, in which each period of the scenario subproblem is solved independently.The latter is applicable to problems where sets D i have a multi-period structure, such as the case of SUC (see Appendix A).The period subproblem is constructed using the indexation of variables and constraints present in the TIME file.These relaxations provide cheap subgradient estimates and upper bounds on component functions.They are used in the initialization procedure described in Sect.5.3.

Master
The Master process, presented in Figs. 4 and 5, starts in the same manner as the Slave, by allocating memory to variables and reading the necessary information from the CORE, TIME and STOCH files, steps 1-2 in Fig. 4. Note that the Master does not require further information regarding the subproblems because this is not used directly for dual iterations or primal recovery.
After reaching the synchronization point, step 3, the Master uses the Slaves to perform the initialization procedure, step 4, which provides initial values for the subgradients and upper bounds.This step is followed by the launching of the initial batch of tasks of the algorithm, in steps 5 and 6.We update as many blocks as possible, launching the corresponding subgradient evaluation tasks.If there are free Slaves after updating all blocks, we use them to perform primal recovery tasks, so that all Slaves are assigned to a task.The Master keeps track of the task assigned to each Slave.
The Master then enters its main loop, which receives the result of a task from Slave s (step 7), processes it (steps 8-16) and assigns a new task to the Slave s (steps [21][22][23][24][25][26][27][28].The processing procedure depends on the type of task.For dual f 0 , we simply overwrite the gradient of f 0 and update the upper bound, while for primal projection we add the new candidate to the list of primal candidates. The processing of dual scenario tasks requires checking whether the received result contains new information relative to what is already available to the master before overriding it (step 11).This is necessary, because as the updates of x are performed at random, there might be two or more Slaves evaluating the same component function i, each for a different x i .If the evaluation for an older x i finishes later, it should not overwrite the subgradient information available to the Master, since this would only introduce more delays in the subgradient information.Irrespective of its delay, the result is used in step 12 as a primal candidate as long as the primal recovery heuristic is not set to importance sampling.task(s)   Second stage scenario tasks, on the other hand, return the second stage cost of a given first stage candidate solution at a certain scenario (step 14).If the first stage solution has been evaluated for all scenarios, it is used in step 15 in order to update the lower bound on the primal objective.Note that we do not assume any order in the evaluation of first-stage candidate solutions, allowing for primal recovery to be performed asynchronously on several candidates at the same time.
Once the processing phase is complete, at the top of Fig. 5, the algorithm checks whether the termination criterion is met and terminates the execution of the Master and Slaves if that is the case.If the termination criterion is not met, the algorithm will proceed to decide which task to assign to Slave s.This decision is made according to the following criteria, in the same order of importance as they are presented: 1. Use at most N concurrent processes for dual scenario tasks.2. Maintain the proportion of Slaves engaged in dual iterations as close as possible to the value of the configuration parameter DualShare.3. A dual f 0 task must be executed every certain number of dual scenario tasks.4. If using the IS primal recovery heuristic, a primal recovery task must be executed whenever the number of elements left in primal_tasks is small enough to risk primal_tasks to be empty before a new candidate is generated.
Before sending the chosen task to Slave s, certain preprocessing steps are required.In the case of a dual scenario task, a block coordinate descent update is performed on a random scenario j and, only then, a dual scenario task for scenario j with the new multipliers is assigned to Slave s.This action (steps 22-23) corresponds to the updating system of Fig. 2.
Before assigning a second stage scenario task, on the other hand, it is necessary to check whether there are pending primal tasks to be executed.If this is not the case, then, in step 25, new tasks are created using the next candidate, which is chosen according to the order specified by the RecoveryType parameter (FIFO, RND, LIFO, IS) from the list of pending candidates, primal_cand, and removed from it.If the RecoveryType configuration parameter is set to IS, then the most recent candidate in the list is chosen.
The assigned task is then sent to Slave s and the Master returns to the beginning of its main loop, where it will wait for the next result.

Initialization
The main objective of the initialization subroutine, executed by the Master at step 4 (Fig. 4), is to obtain cheap estimates of the subgradients of component functions, upper bounds on the component function values at the initial multipliers x 0 and, optionally, an initial set of primal candidates.
The initialization proceeds as follows.First, subgradient estimates and upper bounds are computed by executing approx dual scenario using the Slaves, for all scenarios with the initial multipliers.Once the results for all scenarios have been collected, a dual f 0 task is executed to obtain a gradient for f μ 0 and a valid upper bound on the primal objective.
If the selected primal recovery heuristic is not IS, subgradient estimates can be used as primal candidates (because of the projection onto V, step 8, Fig. 3) and evaluated at step 6 of Fig. 4 (if S > N ).On the other hand, if the selected primal recovery heuristic is IS, then at step 6 of Fig. 4 the next primal task would correspond to primal projection, with averaged candidates generated by the Master on the fly.
Without these subgradient estimates, the first round of updates (step 5, Fig. 4) would not modify the multipliers, and the computation of the first upper bound would be delayed until all component functions have been evaluated, that is, at least until the slowest of all component functions is accurately evaluated.Considering that differences in evaluation times of component functions observed in real instances can be as high as 7500%, the lack of an initial upper bound can significantly delay termination of the algorithm, even when a good primal candidate and lower bound are already available.
Note that primal recovery also benefits from the initialization.The reason for this is that, without the initial set of candidates, primal recovery would be delayed until the results from the accurate evaluation of component functions f i are returned to the Master.

Numerical results
We implement the proposed asynchronous algorithm as described in the previous section in C, using Xpress (through its C API) [22] for solving all mathematical programs and MPI [25] for handling communications between processes.
We test the proposed algorithm on SUC instances of the WECC [42] and CWE [8] systems, and on generic SMIP instances from the SIPLIB collection [4].Sizes and solution times of scenario subproblems of SUC instances are presented in Table 1, where it can be observed that the instances used in the present study are at least as large, or present scenario subproblems as difficult to solve (in terms of solution time) as the models in the literature.We conduct these numerical experiments using multiple high-performance computing clusters, the technical specifications of which are presented in Appendix C.

Western Electricity Coordinating Council system instances
The WECC system instance [42] is composed of 130 thermal generators, 182 buses and 319 branches.It features multiarea renewable production with hourly resolution over a 24 h horizon for 8 representative day types, one weekday and one weekend day per season.The number of scenarios ranges from 10 to 1000, and each scenario is associated with different renewable production profiles and contingencies.
Numerical experiments on the WECC system were run on the Cab cluster, hosted at the Lawrence Livermore National Laboratory.We configure Xpress to solve the root node in all subproblems using the barrier algorithm, we limit the number of Xpress threads to 1 and set the termination gap of subproblems to 1%.For all WECC instances, Subproblem solution times for WECC [42] correspond to a winter weekend instance with 100 scenarios, while subproblem solution times for CWE [8] correspond to a spring weekday instance with 120 scenarios we use 1 core per process, so that S = #Cores − 1, and we limit the run time of the algorithm to 2 h.The solution times of the WECC instances are summarized in Table 2 for different configurations of the asynchronous algorithm.Regarding stepsizes, 'Dim.1/k' corresponds to a stepsize of the type 1/k and 'Polyak' corresponds to the Polyak stepsize defined in Eq. ( 14), where the diminishing part is set to decrease from 0.5 to 0.25 in 50N iterations with q = 1.Primal solution recovery methods correspond to the four methods described in Sect. 4. The modification of the DualShare parameter for 1000-scenario instances is motivated by the findings that are presented in Sect.6.4.
From the perspective of dual optimization, Polyak stepsizes outperform 1/k stepsizes when considering a 1% termination criterion.In all our instances and test runs, Polyak stepsizes provided better results without the need for tuning.
For primal recovery, we observe that the three methods that recover solutions directly from solutions to scenario subproblems (FIFO, RND, LIFO) exhibit similar performance for the 10-scenario instances.RND and LIFO outperform FIFO on instances with 100 scenarios.The IS heuristic outperforms its counterparts in all instances.
The best configuration of Table 2, which is highlighted in boldface, outperforms the run time reported in [42], in which 1000 scenario instances (23.1 million variables, 35.9 million constraints and 3.1 million integers) were solved in up to 24 h using 1000 cores.Instead, the best performing algorithm in this paper solves the same instance in less than 2 h using 256 cores.Similar speedups are observed for instances with fewer scenarios with respect to [42].The progressive hedging method of Cheung et al. [18] solves instances of the WECC with up to 100 scenarios within 1.5-2.5% optimality within 25 min.In comparison, the proposed algorithm solves instances of similar difficulty in terms of solution time of subproblems (see Table 1) to 1% optimality in at most 18.7 min and to 2% suboptimality in less than 5 min.We perform a direct comparison between progressive hedging [18] and the proposed asynchronous algorithm using a subset of the SUC WECC instances.Details of this comparison are presented in Appendix D. We observe that the proposed method is 4.35 times faster than the progressive hedging method of [18], though the latter tends to produce better quality solutions.
Detailed solution statistics per day type for the best configuration are reported in Table 8, in Appendix E. For ease of replicability, in Table 9, which is also included in Appendix E, we report the solution statistics for the 10-scenario instances obtained using a 4-core laptop.

Central Western European system instances
The CWE system instance is based on [8], with the following adaptations: (i) we include the commitment of nuclear units (binary variables) and (ii) the selection of a set point for CHP power plants 4 (continuous variables) as first-stage decisions of SUC.The system consists of 656 thermal generators, 679 buses and 1037 branches, and it  Statistics for configurations that failed to achieve the target optimality gaps, within the limit wall time, for one or more day types are not reported and are denoted with a dash  [53], see also Table 1.
Numerical experiments on the CWE system were run on the Cab cluster of the Lawrence Livermore National Laboratory.We configure Xpress to solve the root node in all subproblems using the barrier algorithm, limit the number of Xpress threads to 2, set the MIP time limit to 1 h and 30 min, and set the termination gap of subproblems to 1%.Each process uses 2 cores (1 core per Xpress thread) hence S = #Cores/2 − 1.We set the maximum run time of the algorithm to 6 h.Table 3 presents the summarized solution statistics for the proposed algorithm on the CWE instances.Detailed solution statistics per day type are reported in Table 10 of Appendix E.
Solution times are larger that those observed for the WECC, nevertheless they remain within operationally acceptable time frames for day-ahead scheduling (at most 2 h and 34 min are required for obtaining a solution within 1% optimality).This increase in overall solution time with respect to the WECC instances results mostly from the time required for solving dual scenario subproblems which, as shown in Table 1, is two orders of magnitude larger than for WECC.
These results largely outperform reported solution times for instances of the CWE system with up to 16 scenarios [44] (also obtained from [8]), on the same computing environment (Cab cluster).In these references, scenario decomposition along with an optimal grouping approach achieved solutions within 2% of optimality in 15 h.

SIPLIB instances
SIPLIB [4] is a collection of SMIPs, provided in SMPS format [27], for testing and developing numerical methods for solving SMIPs.Two-stage SMIP instances in the SIPLIB collection include a variable count that ranges from tens up to a few thousand variables.Each scenario subproblem has constraints, which are much fewer than the industrial-size SUC instances considered in the previous sections of thie paper.The SIPLIB problems include from 2 up to 2000 scenarios.We conduct numerical experiments using the largest two-stage SMIP instances in the SIPLIB collection, which  4 presents the solution times and solution quality statistics for the DCAP and SSLP instances.Here, k/N corresponds to the total number of coordinate descent iterations normalized by the number of scenarios.For consistency with the rest of the paper, lower bounds L B and upper bounds U B refer to bounds on the objective values of the respective problems, which are posed as maximization problems.
We observe that the proposed asynchronous approach fails to achieve the desired 1% optimality gap required for 4 of the 12 instances.This is highlighted in boldface in Table 4. From the exact solution to these instances [3], obtained using a specialized branch-and-bound algorithm described in [5], we know that this is due to the fact that our heuristics failed to detect a primal candidate with an acceptable objective function value for dcap233_500, dcap332_300 and dcap332_500.Instead, for dcap332_200, the proposed asynchronous algorithm failed to derive a tight upper bound.All other DCAP instances were solved to the desired tolerance.Regarding SSLP instances, we arrive at a different observation.While we were able to achieve the desired optimality tolerance and solve problems with up to 2000 scenarios, in terms of solution time the proposed approach is outperformed by an order of magnitude by the integer L-shaped method presented in [6].
These results validate the correctness and show the general applicability of our asynchronous algorithm.At the same time, they indicate that, for small-scale problems, enumerative techniques (such as those presented in [5,6]) should be preferred to our asynchronous algorithm.On the other hand, for large-scale problems, the proposed asynchronous algorithm should be preferred, as demonstrated in the previous sections.

Sensitivity of solution times to the allocation of resources
The Dual Share configuration parameter determines how distributed computing resources are allocated between tasks related to dual iterations or to primal solution recovery.This can impact the overall performance of the algorithm significantly, especially when parallel computing resources are limited.
Table 5 shows how variations in the Dual Share affect run time, with all other parameters remaining constant.We select the WECC spring weekday instance with 100 scenarios to perform this test.Our choice is due to the fact that it corresponds to a medium-size instance and has the median solution time among WECC instances with 100 scenarios.
The configuration using the IS primal recovery heuristic (first 5 rows) exhibits an important improvement in solution time as we increase the Dual Share.For the LIFO heuristic (6th and 7th row), on the other hand, we observe only a minor improvement in solution time with the increase of Dual Share.This shows that averaging scenario subproblem solutions can indeed generate better candidates than simply using scenario subproblem solutions.We thus avoid the use of computing power in evaluating the performance of low-quality primal candidates.This effect was not observed in the 10 and 100-scenario instances of Table 2 because we used almost twice as many cores as scenarios, allowing FIFO, LIFO and RND to carry out a very large number of candidate evaluations.
Our implementation gives us the freedom to change the Dual Share during the solution of an instance, allocating fewer or more resources to dual tasks in ealier iterations.This can be beneficial as the subgradient method achieves fast improvements during the first iterations, but it becomes slow as it approaches the optimal solution.Therefore, by starting with a high Dual Share and gradually decreasing it, we can take advantage of the rapid bound improvement during the first iterations and use more computing power in later stages in order to recover better primal solutions.Rows 8 and 9 present the results of applying this idea.The improvements with respect to maintaining a fixed resource allocation, as in all other rows of the table, are modest because of a combination of factors.During the first iterations, the Polyak stepsize cannot be computed accurately due to the lack of a good lower bound (see Eq. ( 14)) and the updates tend to overshoot.On the other hand, during the last iterations, the IS heuristic encounters difficulties in finding new primal candidates, because the frequency at which we obtain solutions to scenario subproblems decreases with the Dual Share.

Parallel computing performance
The performance of a parallel algorithm can be measured against various metrics.We focus on (i) how the proposed algorithm compares to a synchronous algorithm, i.e. an algorithm that uses a synchronous method to carry out dual iterations, and (ii) how effectively the proposed algorithm scales up with the number of cores.
Solving the instances used in this work with a fully synchronous algorithm may require excessive use of computational resources [41].In order to avoid using additional computing time for this comparison, we compare the proposed algorithm to a synchronous algorithm in terms of the idle time of slaves.This can be estimated using the solution times of subproblems.This information is already available to us from the experiments of the previous sections.We consider a synchronous algorithm that solves the dual problem with the subgradient method using min{N , Dual Share × S} slaves (rounded to the closest integer, if necessary), and that performs primal recovery synchronously using the remaining slaves.We refer the reader to Appendix F for a complete description of the procedure that we employ for estimating idle times.
Table 6 presents the estimated idle times of slaves for the described synchronous algorithm.Note that idle times generally increase with the number of cores, since with fewer cores tasks can be stacked.In order to obtain a measure of the variation of idle times with the number of cores, Table 6 includes estimates for the synchronous algorithm with (i) the same number of cores as the asynchronous algorithm, (ii) half the number cores of the asynchronous algorithm, and (iii) 1 + 2N cores (1 Master, 2N Slaves).
It can be directly observed that synchronous schemes tend to underutilize parallel computing infrastructure, leaving cores corresponding to slaves idle up to 80.4% of  Wall times are obtained by averaging 4 runs for each core count presented in the plot the time.For the same problem, the asynchronous algorithm achieves almost zero idle time (the percentage of time dedicated to solving mathematical programs never falls below 97%, see Tables 8 and 10 for details).Decreasing or increasing the number of cores does not change this observation for the synchronous scheme.
In order to determine how well the proposed algorithm scales with respect to the number of cores, we numerically estimate its parallel efficiency (i.e.speedup divided by the number of cores) by running the asynchronous algorithm multiple times with different core counts.The estimated efficiency is presented in Fig. 6.Our baseline is the serial execution of the asynchronous algorithm, i.e. the method of Sect.3.1, with delays in the gradient of the smooth part of the objective and interleaving subgradient evaluation with primal recovery.
For a small number of cores, low efficiencies result from of the existence of the Master process, which is not necessary in the serial execution.For instance, in Fig. 6 with 4 cores, the asynchronous algorithm launches the Master and 3 Slave processes.Even under ideal conditions, this limits the parallel efficiency to 75%.When the number of cores is between 0.1N and 1.3N , the algorithm achieves efficiency values above 90%, peaking around 0.3N with 99.0%.This regime is characterized by very small lags in the dual algorithm, hence a close to sequential performance, and an effective performance of the primal recovery heuristic.For more than 1.3N cores, the efficiency decreases as the IS heuristic experiences difficulties in finding new primal candidates.

Conclusions
We propose an asynchronous dual decomposition algorithm for stochastic unit commitment, in which dual iterations are performed using a block-coordinate subgradient method, for which we provide convergence guarantees.We also propose primal recovery heuristics and present a high performance computing implementation of the algorithm.The algorithm is able to solve all instances of WECC and CWE within operationally acceptable time frames and presents parallel efficiency above 90% when using between 0.1N and 1.3N slaves.
We find that synchronous algorithms dramatically underutilize high performance computing infrastructure, resulting in core idle times of up to 80.4%.This finding underscores the need for designing asynchronous algorithms in order to tackle industrial scale unit commitment problems.
Future extensions of the present work will focus on the application of the developed asynchronous decomposition framework for tackling (i) multi-stage stochastic unit commitment and (ii) detailed deterministic unit commitment problems over large interconnected power systems.The latter class of problem includes the exciting prospect of integrating the optimization of transmission and distribution systems through convex relaxations of AC power flow constraints [13].
Acknowledgements The authors would like to thank Dr. Vitor de Matos (Plan4 Engenharia) for suggesting the idea of evaluating primal solutions in reverse order (LIFO), and Dr. Bernard Knueven (Sandia National Laboratories) for his help in setting up the parameters of progressive hedging for the benchmark presented in Appendix D. The authors would also like to thank the Fair Isaac Corporation FICO for providing licenses for Xpress; the Lawrence Livermore National Laboratory for granting access and computing time at the Sierra, Cab and Quartz clusters; and the the Consortium des Équipements de Calcul Intensif (CÉCI) for granting access and computing time at the Lemaitre3 cluster.CÉCI is funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11and by the Walloon Region.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Center value of production band of static generator g
The SUC problem aims at capturing the scheduling problem faced by power system operators in day-ahead planning.The power system has a fleet of thermal generators divided in 3 groups: (i) static generators (nuclear and combined heat-and-power units), (ii) slow generators (coal and combined-cycle units) and (iii) fast generators (gas and diesel turbines).At day ahead, before knowing the net demand of the system and possible outages, system operators must decide (i) the ON/OFF status and a production band for static generators and (ii) the hourly ON/OFF status of slow generators.In real time, after the actual net demand of the system and outages are realized, system operators must decide (i) the quarterly production of static generators within the day-ahead band, (ii) the quarterly production of slow generators, and (iii) the hourly ON/OFF status and quarterly production of fast generators, in order to match the net demand at minimum cost.
We formulate SUC using as basis the generic two-stage SMIP formulation (1)-( 4), repeated here for convenience.max u,v,w i∈I In the case of SUC, the blocks of first-and second-stage variables correspond to: In words, u is the vector of ON/OFF and startup decisions of slow generators, and ON/OFF and target production decisions of static generators.v i is the local copy of u for scenario i. w i is the vector of ON/OFF and startup decisions of fast generators, real-time production decisions, and power flow decisions under scenario i. Domain D i enforces the operational limits and physics of the power system under scenario i. Operational limits of slow or fast generator g under scenario i are described by P g,i , defined in Eq. (16).
The constraints correspond to startup, minimum up time, minimum down time, production bounds, maximum ramp-up and maximum ramp-down constraints.
Operational limits of static generator g at scenario i are described by P ST AT , defined in Eq. (17).
The constraints correspond to limits for target production, ON/OFF limits for actual production, and lower and upper band limits for actual production.Note that both the ON/OFF status and target production for static generators are not indexed by time period, meaning that they are constant throughout the SUC horizon.
Using the previous definitions for P g,i and P ST AT g,i , we can describe D i as follows: Here, we enforce the operational limits of all generators, the DC power flow equations [49], and active power balance at each bus.Note that we introduce demand and production shedding variables in all power balance equations, which ensures that SUC formulation has relatively complete recourse.We define U as the linear relaxation of the first-stage constraints in D i , i.e. those constraints involving only . Therefore, we can describe U as in Eq. ( 19), which includes only bounds and minimum up-and down time restrictions.
Finally, the part of the objective function associated with scenario i can be written as This corresponds to variable production costs, fixed and startup costs for slow and fast generators, fixed costs for static generators, and penalty terms for demand shedding.All cost components are weighted by the probability of scenario i, π i .

B Proofs
Proof of Proposition 1 By inspection we have that, Proposition 4 Let f : X ⊆ R n → R be a non-differentiable convex function, P be a partition of {1, . . ., n} and I p , p ∈ P, be the matrix that maps x ∈ R n into its p coordinates (i.e.I p x contains only the p coordinates of x).For any x ∈ X , define Then, ∂ f (x) ⊆ D P (x).Furthermore, the inclusion can hold strictly and g ∈ D P (x)\∂ f (x) might not be an approximate subgradient of f at x.

Proof
It is trivial to observe that the inclusion holds.We proceed to prove the remaining claims via an example.Consider f : , which proves that the inclusion can hold strictly.Also, note that there is no > 0 such that holds for all y ∈ R, since for any the inequality is violated at y = [− ].Therefore, [−1 1] T is not an approximate subgradient of f at 0, which concludes the proof.
Proof of Lemma 1 Note that, for the expected direction of update rule (11), we have 123 The first term in the right-hand side of (20) can be expanded as follows: In the second line above we use the convexity of , in the third line we use the Cauchy-Schwarz inequality, and in the fourth line we use the definition of the Lipschitz constant of the gradient of f μ 0 .Following a similar reasoning, each of the terms under the sum on the right hand side of ( 20) can be expanded as follows: In the second line above we use Assumption 1. Furthermore, Assumptions 1 and 2 allow us to bound the difference between current and delayed iterates using the stepsize assumptions.To see this, note the following: Using the previous expressions, we arrive to relation (21), which concludes the proof.

Proof of Proposition 2 Let h
)) and J be a discrete uniform random variable on the set {1, . . ., N }.Then for all k = 1, . . ., ∞ and y ∈ X , we have where in the second line we use the nonexpansive property of the projection.Therefore, for the expectation conditioned on F k it holds that Finally, by substituting E h J F k in the last relation using ( 21) (Lemma 1) we obtain the desired inequality.

Proof of Proposition 3
The first two inequalities follow directly from Assumption 3.For the third inequality, we have For the last inequality above, we use the fact that {γ k } is non-increasing.Therefore, The same reasoning applies to the fourth inequality of the present proposition.

.2])
Let X t , Y t and Z t , t = 0, 1, 2, . .., be three sequences of random variables and let F t , t = 0, 1, 2, . .., be sets of random variables such that F t ⊂ F t+1 for all t.Suppose that: (a) The random variables X t , Y t and Z t are nonnegative, and are functions of the random variables in and the sequence X t converges to a non-negative random variable X with probability 1.

Proof of Theorem 1
From Proposition 2, with y = x * , we obtain Using Proposition 3 and by the Supermartingale Convergence Theorem (Theorem 2), with probability 1 and for each x * ∈ X * , we have and with probability 1 the sequence x k − x * 2 converges to a random variable.The rest of the proof follows exactly the proof of [35,Prop. 3.4] (see also [21,Thm. 2]) and it is repeated here only for the sake of making the material self-contained.
For each x * ∈ X * , let x * denote the set of all sample paths for which Eq. ( 22) holds and x k − x * 2 converges.By convexity of f , the set X * is convex, so there exist vectors v 0 , v 1 , . . ., v p ∈ X * that span the smallest affine set containing X * , and are such that v j − v 0 , j = 1, . . ., p, are linearly independent.
The intersection = ∩ p j=1 v j has probability 1, and for each sample path in , the sequences x k − v j 2 , j = 0, . . ., p, converge.Thus, with probability 1, {x k } is bounded, and therefore it has limit points.Furthermore, for each sample path in , by Eq. ( 22) and the relation implying that {x k } has at least one limit point that belongs to X * by continuity of f .For any sample path in , let x and x be two limit points of {x k } such that x ∈ X * .Because the sequences x k − v j 2 , j = 0, . . ., p, converge, we must have x − v j 2 = x − v j 2 , ∀ j = 0, 1, . . ., p.
Moreover, since x ∈ X * , the preceding relation can hold only for x = x by convexity of X * and the choice of vectors v j .Hence, for each sample path in , the sequence {x k } has a unique limit point in X * , implying that {x k } converges to some optimal solution with probability 1.

C Technical specifications of computation platforms used for the numerical experiments
This section list the technical specifications of the computers or high performance computing clusters used in the numerical experiments presented in this paper.Different clusters were used as they were available to the authors.6), making 2N parallel workers available to the asynchronous algorithm would make it 40% faster than it is with N parallel workers.By contrast, progressive hedging cannot make use of more than N parallel workers, even if they are made available to the algorithm.Overall, these results demonstrate the ability of the proposed asynchronous approach to solve SUC instances several times faster than the state-of-the-art, while providing optimality guarantees.At the same time, the good quality of the solutions provided by progressive hedging suggests directions for the future development of better primal recovery heuristics.

E Detailed results
This section presents solution statistics per instance for the best configurations found.Table 8 presents solution statistics for WECC instances using the Cab supercomputer at LLNL.Table 9 presents solution statistics for WECC instances using a personal laptop equipped with an Intel Core i7 (4 cores, 8 threads) and 16GB of memory.Table 10 presents solution statistics for CWE instances using Cab.In order to estimate the idle times of slaves in a synchronous algorithm, we first need to specify the exact algorithm for which we are performing the estimation.We consider a synchronous parallel algorithm for SMIPs based on the Lagrangian decomposition presented in Sect. 2. This hypothetical synchronous algorithm uses the subgradient method to minimize the dual function, as in [42].Additionally, the synchronous algorithm uses one of the primal recovery heuristics presented in Sect. 4 to detect high-quality primal candidates.Let 1+ S be the number of parallel workers available for executing the synchronous algorithm over a SMIP with N scenarios.Then, 1 worker will be used as Master, DS = min{N , round(Dual Share × S)} workers will be used as Dual Slaves, and P S = S − DS workers will be used as Primal Slaves.
At each iteration k of the subgradient method, the Master uses the Dual Slaves to evaluate f 0 (x k ) and f i (x k i ) for all i = 1, . . ., N , and its respective subgradients.The Master, then, has at its disposal DS Dual Slaves for performing N + 1 > DS evaluations.The Master performs the required function evaluations using a single queue of dual tasks, leaving the evaluation of f 0 (x k ) at the end of the queue ( f 0 is much easier to evaluate than any f i ).At the beginning of the iteration, as all Dual Slaves have no assigned task, the Master simultaneously assigns the first DS tasks of the dual queue to the Dual Slaves.The next task in the dual queue is then assigned to the first Dual Slave to finish its current task, until all tasks are completed, and the Master can perform the subgradient update in order to obtain x k+1 .
Concurrently with carrying out subgradient iterations, the Master uses the Primal Slaves for evaluating the primal candidates recovered from the scenario subproblems.This evaluation is carried out sequentially, in the sense that the Master evaluates a candidate only after completing the evaluation of the previous candidate.However, due to the abundance of primal candidates, we do not impose any locking mechanism between subgradient iterations and primal recovery.That is, the synchronous algorithm can proceed to iteration l > k even if there are primal candidates derived from iteration k pending of evaluation.The evaluation of the p-th primal candidate v p requires evaluating h i (v p ) for all i = 1, . . ., N .If P S < N , these evaluations are carried out using a single queue of primal tasks, which operates in an analogous fashion to the dual queue.Once the evaluation of the p-th candidate is complete, the Master moves on to the evaluation of the ( p + 1)-th candidate.
The synchronous algorithm terminates once the upper bound (provided by the subgradient method) and the lower bound (provided by primal recovery) are sufficiently close, or when a certain number of subgradient iterations have been completed.
By design, the synchronous algorithm described above has smaller or equal idle times for slaves than synchronous decomposition methods present in the literature [18,28,42] for the same number of parallel processes.This is because of the unlocked execution of dual iterations and primal recovery, which contrasts with the methods in the literature.The latter methods recover primal solutions after each dual/progressive hedging iteration or after a predetermined number of dual or progressive hedging iterations.The estimated idle times of slaves for the synchronous algorithm are, therefore, optimistic approximations for the methods in the literature.

F.2 Estimation of synchronous idle time using execution data of the asynchronous algorithm
The idle times of slaves in the synchronous algorithm depend on only two factors: (i) the solution times of the subproblems (either dual subproblems, f 0 , f i , or secondstage cost evaluations, h i ) and (ii) the order of these subproblems in their respective queues.We estimate the idle times of slaves by considering both factors as follows.
We estimate the proportion of idle time for the synchronous algorithm using the recorded solution times of subproblems in the asynchronous algorithm presented in this paper as proxies for the solution times of the synchronous algorithm.Specifically, for each subproblem s solved by the asynchronous algorithm we record (i) its C s ∈ {Dual, Primal}, (ii) its description D s with D s ∈ {0, 1, . . ., N } for dual subproblems (indicating which dual function f i , i = 0, . . ., N was evaluated) and D s ∈ {1, . . ., P} × {1, . . ., N } for primal subproblems (indicating which function h i (v p ) was evaluated), and (iii) its evaluation time T s .These quantities are organized in an ordered list of tuples L = {(C s , D s , T s ), s = 1, 2, . ..},where indices s are assigned according to the termination timestamp of each task.We can think of L as the log of the asynchronous algorithm.
The solution times of second-stage cost evaluations can be directly obtained from L. The solution times of the dual subproblems for the reference synchronous algorithm are not readily available, because the asynchronous algorithm does not require to evaluate all component functions over an iteration or a number of iterations.We assume, then, that the list L contains at least K > 1 evaluations of each component function f i , and that the evaluation time of f i in the k-th iteration of the synchronous algorithm, where k < K , can be approximated by T s(k) where s(k) indicates the k-th occurrence of (Dual, i, •) in the list L. Thus, from the execution of the asynchronous algorithm we obtain estimations for solution times of subproblems over K dual iterations, and P primal candidate evaluations of the synchronous algorithm.
In order to account for the unknown order in which these subproblems may appear in the dual and primal queues of the reference synchronous algorithm, we conduct Monte Carlo simulations of wall time for dual subproblems and second-stage cost evaluations over their respective slaves.Note that we can simulate the wall time of dual subproblems separately from that of second-stage cost evaluations, because these two processes are not locked in the synchronous algorithm.
Let W S (T , R) be the wall time for executing the set T of tasks using S slaves and where tasks are launched as slaves become available in the order indicated by the permutation R. Then we can express the expected wall time W k for the k-th dual iteration with respect to all possible permutations of {1, . . ., N } with equal probability as E R [W k ] = E R [W DS ({ f 1 , . . ., f N , f 0 }, (R; N + 1))].We estimate this quantity for each dual iteration k = 1, . . ., K using 1000 different Monte Carlo samples of permutations, obtaining estimates Wk .Assuming independence of the ordering

Fig. 3
Fig. 3 Control flow of the Slave process.Continuous lines show the flow of the program, while dashed lines indicate exchange of information with other processes.Italics denote constants and parameters.The iteration counter k and the candidate index t are dropped because they are not relevant within the Slave

Fig. 5
Fig. 5 Control flow of the Master process (cont.)

Fig. 6
Fig.6 Parallel efficiency plot of the asynchronous algorithm.Plot drawn using WECC, spring weekdays, 100 scenarios, Dual Share 0.75, Polyak stepsize and IS primal recovery, and 1 process per core (S = #Cores−1).Wall times are obtained by averaging 4 runs for each core count presented in the plot

4
Control flow of the Master process

Table 1
Scenario subproblem sizes and solution times for different instances used in the present study (highlighted in boldface) and in the literature

Table 2
Solution time statistics for WECC instances, over 8 representative day types

Table 3
The problem is solved for 8 representative day types and using 30, 60 and 120 scenarios of renewable production.For all CWE instances, ramp rate constraints within each hour are declared as delayed constraints.Each scenario subproblem of the CWE instances is almost one order of magnitude larger, in terms of matrix size, than the scenario subproblems of the largest instance considered in the SUC literature

Table 4
Solution time statistics for largest SIPLIB instances originate from two test sets: (i) the DCAP test set includes instances with mixedinteger first-stage variables and binary second-stage variables, and (ii) the SSLP test set includes instances with binary first-stage variables and mixed-binary second stage variables.These experiments are run on the Lemaitre3 cluster hosted at the Université catholique de Louvain, with Xpress used for solving the root node with dual simplex and a termination gap of 0.1% for subproblems.Table

Table 5
Variation of solution time with the Dual Share parameter setting, which varies smoothly between its start value (k = 0) and its value after 200 dual passes over data (k = 200N ) Results in the table correspond to WECC, spring weekdays, for 100 scenarios.The instance is solved using 96 cores and a Polyak stepsize.Statistics are obtained over 4 runs for each configuration

Table 6
Estimated idle time of slaves when solving SUC using a parallel synchronous algorithm.Average and maximum over 8 day types System N # Cores Dual share Synchronous idle time (%), avg.(max.) ξ n,i,t Renewable supply at bus n, Monte Carlo sample i, period t π i Probability of Monte Carlo sample i Commitment and startup, fast generator g, Monte Carlo sample i, hour τ p g,i,t Production of generator g, at Monte Carlo sample i, period t f l,i,t Flow through line l, at Monte Carlo sample i, period t θ n,s,t Voltage angle, bus n, Monte Carlo sample i, period t e n,i,t Load shedding at bus n, Monte Carlo sample i, period t o n,i,t Production shedding at bus n, Monte Carlo sample i, period t

Table 7
Progressive hedging results for WECC instancesOptimality gaps for progressive hedging are computed against the upper bound of our asynchronous algorithm 123 ate the proposed asynchronous algorithm by allowing it to use more than N parallel workers.For instance, considering an estimate parallel efficiency of 70% at 2N parallel workers (see Fig.

Table 8
Solution statistics for WECC.Results are obtained using the best configuration in Table 2: Polyak stepsize and IS primal recovery

Table 9
Solution statistics for the 10-scenario instance of the WECC system, when running on a 4-core laptop

Table 2 :
Polyak stepsize and IS primal recovery 123 Results are obtained using the same configuration as in Table 3: Polyak stepsize and IS primal recovery 123 F