Gradient-based optimisation of the conditional-value-at-risk using the multi-level Monte Carlo method

In this work, we tackle the problem of minimising the Conditional-Value-at-Risk (CVaR) of output quantities of complex differential models with random input data, using gradient-based approaches in combination with the Multi-Level Monte Carlo (MLMC) method. In particular, we consider the framework of multi-level Monte Carlo for parametric expectations and propose modifications of the MLMC estimator, error estimation procedure, and adaptive MLMC parameter selection to ensure the estimation of the CVaR and sensitivities for a given design with a prescribed accuracy. We then propose combining the MLMC framework with an alternating inexact minimisation-gradient descent algorithm, for which we prove exponential convergence in the optimisation iterations under the assumptions of strong convexity and Lipschitz continuity of the gradient of the objective function. We demonstrate the performance of our approach on two numerical examples of practical relevance, which evidence the same optimal asymptotic cost-tolerance behaviour as standard MLMC methods for fixed design computations of output expectations.


Introduction
Optimisation algorithms play an important role across various scientific and engineering fields as valuable design tools. The key goal of optimisation is to find the best values of certain parameters (design variables) of a model, typically a differential model such as a Partial Differential Equation (PDE), used to predict the behaviour of a certain system, such that a desired output Quantity of Interest (QoI) of the model is optimised. Such differential models usually also include various other input parameters beside the design variable, which may or may not be fully characterised. There is an increasing interest in the computational science and engineering community to treat such parameters as random variables to reflect their uncertainty, either due to a lack of knowledge or to some intrinsic variability. As a result, the output QoI being optimised also becomes a random variable. Naively optimising the system for only one particular value of the inputs (e.g., the nominal value) can lead to a design that is not robust enough to the uncertainties in the system. A classical example is of civil engineering structures designed to minimise structural loads for moderate wind conditions, which are then unable to withstand local wind gusts or storms.
The field of PDE-constrained Optimization Under Uncertainty (OUU) seeks to characterise the randomness of the output QoI of the PDE using summary statistics such as moments, quantiles, etc., and optimise the summary statistic instead of the QoI directly. In particular, in risk-averse PDE-constrained optimisation, one aims at favouring designs with acceptable performance also in extreme conditions. In this case, the summary statistic, often called a risk-measure, should quantify the importance that is given in the design process to unfavourable scenarios. The reader is referred to [2] for a comprehensive review of several risk-measures and their main properties. An important class of risk-measures is that of coherent risk-measures [3,4], which exhibit favourable properties such as monotonicity and convexity. numerically differentiating in θ instead, the approach in [14] ameliorates this issue and preserves the same optimal complexity behaviour of the MLMC method as predicted for estimating E [Q(z)]. Lastly, since the MLMC estimator proposed in [1,14] automatically provides an approximationĴ (·, z) of the function θ → J (θ, z) at a given design z, we propose in this work to use an optimisation algorithm in which, at each iteration, gradient steps are taken only in the design variable z, whereas exact optimisation in θ is performed using the surrogateĴ (·, z). Such an algorithm, introduced in [19], was applied in combination with the Monte Carlo estimation of a regularised version of the CVaR in [20].
The main contributions of this work are as follows. We propose novel expressions for the sensitivity of the objective function defined in Eq. (3) in terms of parametric expectations, thus allowing us to use and extend the framework in [14] to build cost optimal adaptive MLMC estimators for those sensitivities with error control. We then propose to use MLMC sensitivity estimators within an Alternating Minimisation-Gradient Descent (AMGD) algorithm, analogous to the one proposed in [19,20], where gradient steps are taken in the design variable z whereas exact optimisation is performed in θ using an MLMC-constructed surrogateĴ of J . The accuracy of the surrogate and sensitivity estimation is increased over the optimisation iterations and is set proportional to the gradient norm. Following closely the analysis in [20], we propose a convergence result for our algorithm under the assumption that the objective function J (θ, z) is strongly convex with Lipschitz continuous gradients.
The structure of this paper is as follows. We present the problem formulation in Section 2, for a problem of penalised CVaR minimisation of the form in Eq. (3). The novel expression for the gradients in terms of the parametric expectations is also presented in Section 2. In Section 3, we propose the AMGD algorithm with inexact gradient and objective function estimation and demonstrate its convergence. Section 4 discusses the novel MLMC estimators, error estimation procedure, and adaptive Continuation MLMC (CMLMC)-type hierarchy selection for the gradients of J (θ, z). In addition, it presents a final CMLMC-AMGD algorithm. Lastly, in Section 5, we demonstrate the above optimisation algorithm and MLMC procedure on two problems of interest. The first is a two-dimensional oscillator, typically used to model oscillatory phenomena in excitable media. The second is a more applied problem of pollutant transport modelling. We demonstrate that the procedure proposed in this work performs well and reflects the theoretical results presented in Sections 2 and 3.

Problem formulation
Let (Ω, F, P) denote a complete probability space, ω ∈ Ω an elementary random event and z ∈ R d the vector of design variables. We denote by Q(z, ω) ∈ R the random QoI, typically a functional of the solution to an underlying differential model with random input ω and design z. We are interested in minimising the CVaR c τ (z) of the random variable Q(z, ·) over the designs z ∈ R d , as indicated in Eq. (2), following the formulation presented in [5]. To this end, we first introduce the following assumptions on the random variable Q(z, ·). Assumption 1. For any z ∈ R d : (i) Q(z, ·) is a random variable in L p (Ω, R) for some p ∈ [1, ∞).
(ii) The measure of Q(z, ·) admits a probability density function, i.e., the measure of Q(z, ·) is free of atoms. We denote by Γ the subset of random variables in L p (Ω, R) that are free of atoms, and hence, Q(z, ·) ∈ Γ ⊂ L p (Ω, R).
(iii) There exists a positive random variable K, possibly dependent on z, such that E [K] < ∞ and for any ∆z ∈ R d close enough to 0 (restated here from [15,16]).
(iv) For almost every ω ∈ Ω, the mapping z → Q(z, ω) is differentiable in R d and the corresponding vector of partial derivatives Q z (z, ·) = [Q z 1 (z, ·), ..., Q z d (z, ·)] T of Q with respect to the components z k of z, k ∈ {1, ...d}, is a random variable in L p (Ω, R d ).
To quantify the tails of Q(z, ·), we first define the τ -VaR q τ (z), alternatively known as the τ -quantile, of significance τ ∈ (0, 1) as follows: The τ -CVaR c τ (z) is defined as the expected value of Q(z, ·) in the tail above and including the τ -VaR q τ (z): As was described in Section 1, [5] proposed that c τ (z) could be written in the form in Eq. (1) for a random variable Q(z, ·) satisfying Assumption 1.(ii).
In this work, we extensively use the concept of parametric expectations. In particular, let us introduce the function (parametric expectation) Φ : R × R d → R as: with φ : R × R → R given by: The introduction of the parametric expectation Φ has the advantage that the τ -VaR q τ (z) and the τ -CVaR c τ (z) of any significance τ can be obtained by simple post-processing of Φ as: The framework of parametric expectations allows us to write the penalised CVaR minimisation problem in Eq. (2) as a combined minimisation over θ and z as in Eq. (3). The problem is restated below for reference: For the remainder of this work, we address the challenge of solving problem (10). The combined objective function J (θ, z) has several properties that, when combined with the properties of Q in Assumption 1, have useful implications for gradient based optimisation techniques. We first discuss the differentiability of J (θ, z). Theorem 2.1 below gives a result on Fréchet differentiability of the CVaR.
Theorem 2.1. Let L(X, Y ) denote the space of bounded linear operators between the normed vector spaces X and Y . We define the function R : R × L p (Ω; R) → R as follows: Then, R(θ, Q) is jointly Fréchet differentiable in R × Γ, with Fréchet derivative DR(θ, Q) ∈ L(R × L p (Ω, R), R) at the point (θ, Q) ∈ R × Γ in the direction (δθ, δQ) ∈ R × L p (Ω, R) given by: Proof. The reader is referred to Appendix A for the proof.
This result, combined with Assumption 1 on Q, leads immediately to the differentiability of J (θ, z).
A direct implication of Corollary 2.1 is that the partial derivatives J z (θ, z) = [J z 1 (θ, z), ..., J z d (θ, z)] T and J θ (θ, z) exist and are given by the following expressions: One of the main contributions of this work is the estimation of the sensitivities in Eqs. (14) and (15) using MLMC estimators. However, as discussed in Section 1, using MLMC to directly estimate the expectations in Eqs. (14) and (15) may result in compromised or non-optimal MLMC performance. The reader is referred to [14,1] for a detailed discussion on the topic. To ameliorate this issue, we propose the following alternative formulation of the gradients in terms of parametric expectations: The superscript prime of the parametric expectations in Eq. (16) and Eq. (17) denotes the derivative computed with respect to θ. In addition to Φ(θ; z), we have introduced the parametric expectation Ψ(θ; z) ∈ R d and the function ψ(θ, Q, Q z k ) ∈ R where z k and Q z k denote the k th components of z and Q z respectively, k ∈ {1, ..., d}. The differentiability of Ψ(θ; z) in θ follows by the same arguments of Theorem 2.1 and Corollary 2.1, under Assumption 1. It was shown in [14] that since φ and ψ are Lipschitz continuous in their arguments, the corresponding MLMC estimators no longer suffer from the compromised performance due to discontinuities. The idea is then to build MLMC estimatorsΦ(·, z) andΨ(·, z) for the whole functions θ → Φ(θ; z) and θ → Ψ(θ; z) respectively on a suitably chosen interval Θ ⊂ R, and then approximate J θ and J z asĴ θ (θ, z) =Φ (θ; z) and As a by-product of this approach for estimating sensitivities, we construct an approximation θ ∈ Θ → J (θ, z) =Φ(θ; z) + κ z − z ref 2 l 2 of the objective function itself for all θ ∈ Θ, at a given design z ∈ R d . This allows us to consider an optimisation problem in which exact minimisation in θ is performed at each iteration using the surrogateĴ , and gradient steps are performed only in z using the approximate gradient J z . Notice that the gradient approximation in z is inconsistent with the surrogate modelĴ , i.e., J z = ∂ zĴ , in contrast toĴ θ . We will detail this approach in the next section.

Gradient based optimisation algorithm
In this section, we present a gradient-based iterative procedure to find a local minimiser (θ * , z * ) of the OUU problem in Eq. (10), should it exist. The broad goal of a gradient based algorithm is to define the iterates (θ j , z j ), j ∈ N such that lim j→∞ (θ j , z j ) = (θ * , z * ), where the iterates are computed using gradient information. Motivated by our interest in using MLMC estimators based on parametric expectations to estimate the objective function and its sensitivities, we consider in this section the general situation in which, at each iteration j of the gradient based algorithm, we build an approximationĴ (j) (θ, z), θ ∈ Θ of the objective function at the design z ∈ R d on a suitably chosen interval Θ ⊂ R (which may depend on j, although to ease the notation, we do not highlight such dependence), as well as approximationsĴ where the approximation J (j) z may not coincide with the z-derivative ofĴ (j) . The approximationsĴ (j) ,Ĵ (j) θ and J (j) z may be random, as will be the case for MLMC estimators. We then propose the following variation of the standard gradient descent algorithm, starting from an initial design z 0 : where α > 0 denotes a step size parameter. We note that according to the procedure in [14], the interval Θ can be freely selected and, hence, we can ensure that θ j always belongs to the interior of Θ, so that J In Theorem 3.1 in Section 3.1, we show that the iterates (θ j , z j ) converge exponentially fast in the iteration counter j towards (θ * , z * ) under additional assumptions on the objective function J and its approximationsĴ (j) . The results of Theorem 3.1, specifically the implications of Eq. (24) introduced there, demonstrate that exponential convergence of the iterates z j and θ j in j can be obtained if the gradient approximation is accurate up to a tolerance that is a fraction η of the gradient magnitude at the previous iteration.
The step size is selected sufficiently small, and remains fixed over all optimisation iterations, although variable step sizes and line search methods could be easily added. The algorithm is terminated once the gradient magnitude drops to a specified fraction of the initial value. We introduce here the notation w = (θ, z), J w = (J θ , J z ) andĴ z ) for convenience in the following.

Convergence analysis
For the interested reader, we present a self-contained convergence analysis of the iterates (θ j , z j ) in Theorem 3.1, under additional assumptions on J andĴ (j) , based on the analysis presented in [20]. The key differences in the two analyses are related to the fact that the algorithm studied here is an AMGD algorithm instead of a pure gradient descent algorithm. We first note that the objective function J (θ, z) is convex under the additional assumption that Q(z, ·) is almost surely convex in z [5, Theorem 10]. When combined with the assumption that J → ∞ when z l 2 , |θ| → ∞, this ensures that a minimiser of J (θ, z) exists in R × R d . However, we require additional assumptions on the objective function J to prove exponential convergence of the iterates θ j and z j towards such a minimiser; namely Assumptions 2 and 3 below on strong convexity and on the Lipschitz continuity of the gradients, respectively. An immediate implication of Assumption 2 is that there exists a unique minimiser (θ * , z * ) ∈ R × R d for the OUU problem in Eq. (10) such that J z (θ * , z * ) = J θ (θ * , z * ) = 0.
In what follows, we denote by E j [·] the expectation conditional on all of the random variables used to define z j (i.e., conditioned on the past up to iteration j), and by ·, · the l 2 inner product. Readers interested in the implementation details of Algorithm 1 and its relation to the MLMC method can proceed directly to Section 4. Assumption 2. The objective function J is µ-strongly convex, i.e., there exists µ > 0 such that, for all w a , w b ∈ R × R d , equivalently: Assumption 3. The objective function J has Lipschitz continuous gradients, i.e., there exists L > 0 such that, for all w a , w b ∈ R × R d : Lemma 3.1. Let J satisfy Assumptions 2 and Assumptions 3. Then we have that, for 0 < α ≤ 1/L, satisfy Assumptions 2 and 3, and J (j) : Θ × R d → R satisfy the following condition: for some η > 0, where (θ j−1 , z j ) is the j th iterate produced by Algorithm 1 with step size α satisfying 0 < α ≤ 1/L and αµ ≤ 1. Then, the following result holds true: for some constants C 1 > 0 and 0 < ξ < 1.
Proof. From the definition of the iterate z j+1 in Eq. (21), we have: The termT 1 = α 2 Ĵ (j) w (w j ) 2 l 2 can be bounded as follows: The termT 2 = −2α J w (w j ), w j − w * can be bounded as follows: where we have used Lemma 3.1. Finally, the termT 3 = −2α Ĵ (j) w (w j ) − J w (w j ), w j − w * can be bounded as follows: Combining the bounds forT 1 ,T 2 andT 3 , we have the following: We now utilise Lemma 3.1 once again, from which we have the following result: for 0 < α ≤ 1/L and αµ ≤ 1. In addition, the last term of Eq. (38) can be rewritten as follows: Applying Eqs. (39) and (40) to Eq. (38), we then have the following simplified bound: where we have defined the constants C 1 = αµ − ηL 2 + ηL and C 2 = (η 2 + η)L 2 + ηL. We then have the following: We note that the leading constant on the right hand side is less than 1 as long as C 1 > C 2 , which holds true for η < 1 + αµ/L 2 − 1. This in turn ensures contraction in the norm z 2 l 2 + C 1 θ 2 on the space R d × R. This completes the proof. Remark 1. We note that although the accuracy condition Eq. (24) is stated in the L ∞ -norm for all θ, the proof of Theorem 3.1 uses this property only at θ j . This condition is required since we do not know the quantile θ j a priori, and seek to use the parametric expectation framework from [14] to do so. [14] requires that the error in the approximationsĴ (j) be controlled at all θ, in order to estimate θ j accurately.

Remark 2.
In practical applications, it is difficult to determine whether Assumptions 2 and 3 are satisfied, since both are strongly dependent on the properties of the random QoI Q(z, ·). These assumptions require stronger properties on Q(z, ·) and its Probability Density Function (PDF) than those presented in Assumption 1; for example, that the PDF remains both upper bounded and lower bounded away from zero for all designs z, and that the random variable Q(z, ·) is bounded, i.e., Q(z, ·) ∈ L ∞ (Ω, R).

Gradient estimation and error control using MLMC methods
We note that the key assumption in the proof of Theorem 3.1 is Eq. (24); namely, that the gradient approximation is accurate up to a tolerance that is proportional to the magnitude of the true gradient. As stated earlier in Section 1, we are interested in utilising the framework of MLMC estimators for parametric expectations developed in [14] for the accurate estimation of the objective function J (riskmeasure CVaR) and its gradient.
Expressing the gradients J z and J θ in terms of the first derivatives of the parametric expectations Φ(θ; z) and Ψ(θ; z) as in Eqs. (16) and (17) and estimating the latter using MLMC estimators poses many key advantages. The first advantage was already seen earlier in Section 3; namely that J (j) z andĴ (j) θ can be estimated for all θ for a given design z in one shot. Secondly, as was demonstrated in [14], the level-wise differences for the MLMC estimator of Φ(θ; z), analogous to the level-wise differences corresponding to the classical MLMC estimator of E [Q], decay at the same rate in the levels l as the differences Q l − Q l−1 , in an appropriately selected norm over θ ∈ R. This ensures that if cost-optimal MLMC behaviour can be achieved for estimating E [Q], then it can be achieved also for MLMC estimators of Φ(θ; z) and Ψ(θ; z), using a practically computable number of samples. The last key advantage is that, using the mechanism in [14], one can select the parameters of the MLMC estimator such that a prescribed tolerance can be attained on the MLMC approximation error on Φ and Ψ. By prescribing a tolerance proportional to the gradient magnitude, one can estimate the gradient using MLMC estimators that respect the condition in Eq. (24) as required by Algorithm 1.
Although the procedure used in this work to estimate Φ accurately is identical to the one described in [14], some important modifications are required to use the same procedure for accurately estimating Ψ. We present in this section the modifications of the work developed in [14] that are required for the accurate estimation of Ψ, and consequently the gradients J θ and J z , using the MLMC method.

MLMC estimator for the gradients
We begin by recalling that the parametric expectation Ψ is defined as in Eq. (18). The proposed MLMC method relies on a sequence of approximations .., d} follows the same construction as that for Φ in [14]. The first step is to estimate Ψ k (θ r , z), r ∈ {1, ..., n}, on a set of n equidistant points θ = {θ 1 , ..., θ n } such that Θ = [θ 1 , θ n ], by a standard MLMC estimatorΨ L,k (θ r ; z), which reads: where Q ) are correlated realisations of Q l (z) and Q l−1 (z), respectively, typically obtained by solving the underlying differential problem on meshes with discretisation parameters h l and h l−1 , driven by the same realisation ω (i,l) of the random parameters for the fixed design z. On the other hand, Q l−1 respectively with respect to z k . {N l } L l=0 is a decreasing sequence of sample sizes. The MLMC hierarchy is hence defined by three parameters; namely the number of interpolation points n, the number of levels L and the level-wise sample sizes N l .
We finally construct a MLMC estimatorΨ L,k of the whole function Ψ k (·; z) : Θ → R by interpolating over the pointwise estimates as below:Ψ where S n denotes a uniform cubic spline interpolation operator andΨ L,k (θ; z) denotes the set of pointwise MLMC estimates in Eq. (44), that isΨ L,k (θ; z) = {Ψ L,k (θ 1 ; z),Ψ L,k (θ 2 ; z), . . . ,Ψ L,k (θ n ; z)}. An estimate of the first derivative Ψ k in θ is then obtained by computing the derivative of the resultant interpolated function, for each componentΨ L,k :

Estimation of the Mean Squared Error (MSE) of the gradient
Since we have assumed that the gradient estimateĴ w is a random vector in L p (Ω, R d+1 ) with p ≥ 2, we propose to quantify the error on the gradient in an MSE sense as follows: where J z,k and J (j) z,k denote the k th components of J z and J (j) z . We now present a result relating MSE Ĵ (j) w (·, z j ) to the MSE of the MLMC estimatorsΦ L and Ψ L . Proposition 4.1. LetΦ L (·; z j ) andΨ L (·; z j ) denote the MLMC estimators of Φ(·; z j ) and Ψ(·; z j ) as defined in [14] and Eq. (44) respectively. LetĴ w (·, z j ) be the approximation to the true gradient J w (·, z j ) computed using the estimatesΦ L (·; z j ) andΨ L (·; z j ) at the j th optimisation iteration. Let Ψ k andΨ L,k denote the k th component of Ψ andΨ L respectively, for k ∈ {1, ..., d}. Let the MSEs onΦ L andΨ L,k be defined as follows: for the design z j ∈ R d . Then, we have that: Proof. We first note that: Adding together each of the contribuitons and taking the expectation on both sides, we have that: As was described earlier in this section, we seek to use the error estimation and adaptivity procedure described in [14] to accurately estimate Φ and Ψ k , and consequently, to accurately estimate the gradient J w . From Eq. (50), it is evident that if one can control the MSE ofΦ L andΨ L,k in an L ∞ sense, one can control the MSE on the gradient J w as defined in Eq. (47). Specifically, the MSE of the gradient is equal to a simple sum of the MSEs of the parametric expectations. Eq. (50) hence allows us to use the work of [14] to accurately calibrate MLMC estimators for the parametric expectations Φ and Ψ such that the resultant gradient estimate is accurate up to a prescribed tolerance.

Modified error estimation procedure
Since the error estimation procedure is independent of the design z, in the following, we drop the explicit dependence of Φ and Ψ on z, with the dependence being implied. We recall here that the error estimation procedure for estimating MSE Φ L is identical to that presented in [14]. The procedure for estimating MSE Ψ L however has several modifications from the procedure forΦ L , that we detail in this section.
We recall that MSE Ψ L was defined in Eq. (49). Proceeding similarly as in [14], we can bound MSE Ψ L as follows: andê Ψ k s denote error estimators that estimate the error due to interpolation, the error due to approximation of the QoI (i.e. bias error), and the error due to finite sampling (i.e. statistical error) respectively onΨ L,k . The reader is referred to [14] for a detailed discussion of each of the three errors components, as well as their corresponding estimators.
The procedure for estimating the interpolation and bias errors requires the accurate estimation of θ-derivatives of the function Ψ l,k (θ) = E ψ(θ, Q l , Q z k ,l ) . Although the true function Ψ l,k is smooth, replacing the true probability density with an empirical probability density corresponding to a Monte Carlo estimator implies that the right-hand side would be a linear combination of piecewise linear functions. The first derivative of such a function would be piecewise constant, and high order derivatives would not exist in the discontinuity points, and would be zero otherwise. A MLMC hierarchy designed based on estimates obtained in this manner would lead to non-optimal complexity behaviour. In [14, Section 3.2], a Kernel Density Estimation (KDE) based procedure was described for ameliorating this issue. Although the error estimation procedure is broadly the same for estimating Ψ as for Φ, an important distinction arises with respect to this KDE procedure, which we detail in this section.
Since the issue chiefly relates to the regularity of the empirical Monte Carlo probability density, we propose the use a KDE based smoothing technique; namely, we replace the true joint density p l of (Q l , Q z k ,l ) with a KDE smoothed joint probability density p kde l , which consists of a linear combination of two-dimensional kernels composed of products of two one-dimensional Gaussian kernels centred on each of the N l fine samples {(Q Here, K δ l (·, µ) denotes a Gaussian kernel with mean µ and bandwidth parameter δ l > 0, which is selected according to Scott's rule [21] for the realisations {Q (i,l) l } N l i=1 and controls the "width" of the kernel. A closed form expression can be computed for the integral in Eq. (59), leading to the KDE smoothened approximation E kde l,k [ψ(θ, ·, ·)] for Ψ k . According to the procedure in [14], the interpolation error requires the estimation of the quantity Ψ (4) k , for which we use the KDE estimator described above. To this end, we first select a level L/2 from the MLMC hierarchy; this choice of level is to ensure thatΨ L/2 ,k is sufficiently close to Ψ k , and N L/2 is large enough for the KDE procedure to produce accurate estimates. We then construct the KDE approximation Υ L/2 ,k (θ) := E kde L/2 ,k [ψ(θ, ·, ·)]. The fourth derivative Υ (4) k is then constructed using a second order central finite difference scheme on a uniform grid on Θ with n n points. The norm is evaluated on the same grid as follows: For the bias error onΨ L,k , we are required to estimate the quantity Replacing the expectation by a Monte Carlo estimator leads to the same regularity issue as described earlier in this section. To smooth the empirical Monte Carlo density, we propose the use of a KDE smoothed approximation p kde l,l−1 to the true density p l,l−1 of (Q l , Q z k ,l , Q l−1 , Q z k ,l−1 ), consisting of products of four one-dimensional Gaussian kernels: The expectation in Eq. (61) can be replaced by the KDE smoothened expectation in Eq. (66), which can then be used in the bias error estimation procedure outlined in [14]. Lastly, the procedure for the statistical error follows the idea of bootstrapping developed in [14] identically without modification.

Adaptive hierarchy selection procedure and CMLMC-gradient descent algorithm
We discuss in this section how to select the parameters of the MLMC hierarchy; namely the number of interpolation points n, the level-wise sample sizes N l and the number of levels L. The aim is to select these parameters such that a prescribed tolerance can be obtained on the gradient estimateĴ (j) w . In what follows, we drop the dependence on z for notational simplicity, with the dependence being implied. We propose here a minor variation of the framework presented in [14,Section 5]. An adaptive strategy was proposed therein for the selection of the hierarchy parameters n, L and N l for any statistic s τ , the MSE of whose estimatorŝ τ could be bounded by a linear combination of MSEs onΦ L and its derivatives: We first note that the same hierarchy adaptivity procedure extends trivially to any linear combination of MSEs ofΦ L ,Ψ L,k , and their derivatives. Specifically, this includes the case of the MSE on the gradient J (j) w in Eq. (50). In addition, each of the MSEs on the parametric expectations in Eq. (50) can be split into its three error contributions, similar to Eq. (54), leading to the following error estimator for MSE Ĵ (j) w (w) :

Squared bias error
Here,ê Φ i ,ê Φ b andê Φ s denote the interpolation, bias and statistical error estimators corresponding to MSE Φ L . Once in the above form, the procedure described in [14] for adapting the hierarchy parameters n, L and N l for linear combinations of MSEs can be extended trivially to the current case when combined with the modifications proposed in Section 4.3.
Lastly, we comment that the above adaptive procedure is carried out within the framework of the CMLMC algorithm presented in [14]. The CMLMC algorithm works by first simulating a small "screening" hierarchy with relatively few samples and levels. The algorithm then adapts the hierarchy parameters with respect to a decreasing set of tolerances, of which the target tolerance is the final one. The optimal parameters for a given tolerance in the sequence are computed based on estimates obtained from the optimal hierarchy for the previous tolerance, or the initial "screening" hierarchy. In this way, the MLMC estimator becomes robust to large variations in the estimates produced by an initial screening hierarchy.
We now possess all the ingredients required to tailor Algorithm 1 to the specific case in which an MLMC procedure is combined with a CMLMC algorithm to estimate the gradient up to a prsecribed tolerance. The algorithm is detailed below, and differs from Algorithm 1 in that the first estimate of the gradient is computed based on a screening hierarchy, and that successive gradients are computed such that the MSE on the gradient satisfies a tolerance equal to a fraction of the gradient magnitude from the previous iteration; namely, the right-hand side of Eq. (24) is estimated usingĴ Another key difference to note is that in contrast to CMLMC algorithm described in [14], the screening hierarchy used to compute first estimates for the design z j is the optimal hierarchy used to accurately estimate the gradient for the design z j−1 . In addition, the gradient at the first design point z 0 is estimated using an initial small fixed hierarchy.

FitzHugh Nagumo oscillator
To demonstrate the optimisation framework, we use the FitzHugh-Nagumo system described in [22] and [23]. The FitzHugh-Nagumo model is a two dimensional simplification of the Hodgkin-Huxley model introduced by [24], which was originally proposed in the field of neuroscience to model the phenomenon of spiking neurons. The dynamical equations read as follows: where [v(t), w(t)] T ∈ R 2 denotes the state variables and a, b, ζ and I denote system parameters. Fig. 1 shows a phase-space plot containing the v and w-nullclines for a nominal value of the system parameters. The oscillator enters a limit cycle for parameter values such that the intersection of the two nullclines lies in the interval v ∈ [−1, 1], indicated by the black lines. If the intersection lies exterior to this interval, then the oscillator eventually reaches the intersection and remains at a constant value of v and w. Although initially proposed to model neuron behaviour, the FitzHugh-Nagumo model has seen widespread use in modelling wave phenomena in excitable media. Examples include blood coagulation [25,26] and cardio-electrophysiological phenomena [27], wherein the optimal control of the model plays an important role in the application. The reader is referred to [28] for an overview of existing work on the modelling applications and optimal control of the FitzHugh-Nagumo system. In this work, we study the forced FitzHugh-Nagumo system: To study the behaviour of the system, we propose the following QoI: where ξ l 1,n and ξ l 2,n are independently drawn realisations of standard normal random variables. The quantity of interest that we study is the following time average: To compute the sensitivities Q z,l , we utilize the method of adjoints. We consider the corresponding adjoint variables λ l n and ν l n corresponding to v l n and w l n , n ∈ {1, ..., N T,l } respectively. The adjoint equation reads as follows: The reader is referred to Appendix B for the details of the derivation.
Once the adjoint equation is solved backwards in time, the approximation Q z,l of the sensitivities Q z at level l can then be obtained as follows: ∆t l ζw l n ν l n+1 , To demonstrate the performance of Algorithm 2, we assess the performance individually of its two components; firstly, the performance of the CMLMC algorithm, the error estimation procedure and the adaptive strategy described in Section 4 for accurately estimating the gradient for a given design, and secondly, the gradient based optimisation procedure described in Algorithm 2. We first assess the performance of the CMLMC algorithm and adaptive strategy. We remark that the solution of the forward and adjoint problems, as well as the CMLMC procedure, are implemented within the XMC software library [29], which we use for the simulations presented herein.
We seek to accurately estimate the gradient J w (·, z 0 ) using the estimatorĴ w (·, z 0 ), where z 0 = [0.7, 0.8, 0.08, 1.0] and we set τ = 0.70 for the significance of the CVaR. The gradient and gradient error are estimated using the MLMC procedure described in Sections 4. To assess the reliability of the error bound derived in Proposition 4.1, we run a reliability study wherein we adapt the parameters of the MLMC hierarchy to attain a prescribed tolerance on MSE Ĵ w (·, z 0 ) . We run the MLMC algorithm 20 times for each tolerance tested and compare the estimated error to the true error obtained using a reference gradient computed using a Monte Carlo estimator with 2 × 10 5 samples and 2 × 10 4 time steps. Specifically, we are interested in assessing the tightness of the inequality in Eq. (68).
The resultant plot is shown in Fig. 2a. Three errors are plotted in Fig. 2a; namely, the true error on the gradient, defined in the L ∞ sense, corresponding to the term on the leftmost side of Eq. (68), the square root of the MSE estimate on the gradient, produced by the optimally calibrated MLMC hierarchy, corresponding to the term on the rightmost side of Eq. (68), and the true error on the gradient evaluated at the point (θ 0 , z 0 ), where θ 0 corresponds to the 70%-VaR for the design z 0 . The true errors are computed with respect to a reference solution computed using 2 × 10 5 samples and 2 × 10 4 time steps. As can be seen from the figure, the MSE estimator provides a tight bound on the true error on the parametric expectations. However, the true error on the gradient in the L ∞ sense is much larger than the true pointwise error. This is a natural consequence of using the L ∞ -norm over the entire interval Θ to define the MSE, as compared to using the pointwise error. Controlling the MSE error in an L ∞ sense, as defined in Eq. (47), is necessitated by the error accuracy condition in Eq. (24), in order to ensure exponential convergence of Algorithm 2. Fig. 2b shows the complexity behaviour of the MLMC estimator calibrated using the CMLMC algorithm. We compute the cost required to obtain the final optimal hierarchy for a given tolerance 2 on MSE Ĵ w (·, z 0 ) . As can be seen from the figure, the cost grows as −2 , which is the theoretically predicated best case performance for the MLMC estimator. For comparison, we also plot the estimated cost of a comparable Monte Carlo estimator, as well as the expected cost growth rate for the case of the first order time discretisation used here. The Monte Carlo reference cost is computed as described in [14].
We now examine the performance of the gradient descent algorithm proposed in Section 3. We are interested in solving the minimisation problem given in Eq. (10), with τ = 0.7. We utilize the framework of Algorithm 2, with a tolerance = 0.01 on the gradient ratio. This implies that we stop the algorithm once the gradient magnitude has dropped to 1/100 th of its initial magnitude. As an initial guess, we begin with the design z 0 = [0.7, 0.8, 0.08, 1.0]. We also set z ref = [0.7, 0.8, 0.08, 1.0]. We combine the above with the CMLMC algorithm detailed in [14] and detailed further in Section 4, with η = 0.2 on the relative error on the gradient.
We plot in Fig. 3a the value of the objective function for different iterations of the objective function. We observe exponential convergence in the number of iterations towards the final value, as predicted by Theorem 3.1, although we cannot guarantee that the hypotheses of Theorem 3.1 are satisfied for this problem. Fig. 3b shows the value of the gradient ratio r for different iterations of the optimisation algorithm. We also observe that the gradient decreases exponentially. Lastly, we plot in Fig. 3c the Cumulative Distribution Function (CDF) of the output QoI Q(z j , ·) computed at different iterations of the optimisation algorithm, as well as the predicted VaR and CVaR values. We observe that the CDF,    shows the optimal hierarchy produced by the CMLMC algorithm at each iteration of the optimisation. We observe that since the tolerance supplied to the CMLMC algorithm is a fraction of the gradient magnitude, the optimally tuned hierarchy becomes larger for later iterations of the optimisation. In addition, Fig. 4b shows the cumulative cost required for the optimisation algorithm to reach a given gradient magnitude. The cumulative cost at a given optimisation iteration is defined as the sum of costs of all optimal hierarchies until the current optimisation iteration. Specifically, the cumulative cost is computed as l=0 denote the optimal level-wise sample sizes for the i th optimisation iteration and Cost (Q l ) denotes the average cost of simulating one sample of Q l . This cost is plotted versus the gradient magnitude. We observe that after an initial pre-asymptotic regime, the cumulative cost grows as Ĵ (j) , a rate comensurate with the use of an optimally tuned MLMC hierarchy at each iteration tuned to obtain a tolerance proportional to

Pollutant transport problem
We now apply the methodology to a more applied problem of practical relevance. A problem of pollutant transport is studied, where the concentration of pollutant in a domain is modelled using a steady reactiondiffusion-advection equation. We consider a square domain D = (0, 1) × (0, 1), with boundary ∂D := Γ d ∪ Γ n , where Γ d := {0} × (0, 1) and Γ n := ∂D\Γ d . We denote by u : D × R 9 × Ω → R the concentration subject to the following boundary conditions: where > 0 denotes a viscosity parameter. V(x, ω) is a random divergence-free velocity field defined as follows: where a ∼ U[4.95, 5.05] and b ∼ U[3.95, 3.05] are uniformly distributed random variables, and x 1 and x 2 denote the components of x. The source f (x) is the sum of five Gaussian source terms: where the values of s i , µ i and σ i are given in Table 1. The sink term B(x, z) is defined as follows: where the locations p k are defined as p k = (0.25i, 0.25j), i, j ∈ {1, 2, 3}, k = 3(i − 1) + j, σ = 0.05, and z k denotes the k th component of z ∈ R 9 . We are interested in studying the distribution of the random QoI Q, defined as follows: with κ s = 10 4 . The problem is implemented using the FEniCS finite element software [30]. The domain is discretised using a uniform triangular mesh with piecewise linear finite elements. The resultant linear system is solved using a sparse direct solver [31,32]. The number of elements per side of the square domain varies as 32 × 2 l/2 , l ∈ {0, 1, ..., L}, leading to a mesh size h l that varies as h l = h 0 × 2 −l . An in-built automatic differentiation module within the FEniCS library is used to compute the sensitivities of the QoI with respect to design parameters. Once again, the XMC software library [29] is used to implement the CMLMC procedure. Similar to Section 5.1, we seek to examine both parts of the optimisation algorithm; namely the CMLMC and the gradient based OUU algorithm. For the CMLMC, we seek to accurately estimatê J w (·, z 0 ), where z 0 = [0.1] 9 , such that MSE Ĵ w (·, z 0 ) satisfies a prescribed tolerance. Fig. 5 shows the results of reliability and complexity studies conducted for the above parameters, similar to the one conducted for the FitzHugh-Nagumo system in Section 5.1. For studying the reliability of the error estimators, we conduct 20 independent CMLMC simulations for a given tolerance. For each simulation, we plot three errors; namely the true L ∞ error on the gradient, the square root of the MSE estimate produced by our error estimation procedure described in Section 4, and the true pointwise error on the gradient, computed by evaluating the parametric expectationsĴ w (·, z 0 ) for the gradient at θ 0 , the VaR corresponding to the design z 0 . The reference value of the gradient is computed by first running 20 simulations for a tolerance that is half of the finest tested tolerance, and averaging over the gradient estimates produced by these simulations. Similar to before, we find that although our novel error estimators provide a tight bound on the true L ∞ error of the gradient, the L ∞ error on the gradient is significantly larger than the error on the gradient evaluated at θ 0 . Fig. 5b presents the complexity results of the CMLMC algorithm. The cost to compute the optimal hierarchy for a given tolerance 2 on MSE Ĵ (j) w (·, z 0 ) is plotted versus the tolerance, for each of the 20 CMLMC simulations at a given tolerance, in addition to their sample average value. In addition, the theoretical cost growth rate of a comparable Monte Carlo estimator is shown, as well as the estimated cost of the estimator for reference and comparison. The Monte Carlo reference cost is computed as described in [14]. As can be seen from the figure, the complexity follows the theoretically predicted complexity −2 .  For the OUU, we wish to minimise an objective function of the form in Eq. (10), with z ref = 0 and for significance of τ = 0.7. This implies that we seek to minimise the CVaR while also minimising the amplitude of the controlled sinks. We utilise Algorithm 2, starting from a design z 0 = [0.1] 9 , and halt the optimisation once a gradient ratio of r = 0.08 has been achieved. In Fig. 6, we show the source field f (x), the control field B(x, z * ) and the solution u(x, z * , ω) for the mean conditions a(ω) = 4 and b(ω) = 5 at the optimal control z * obtained by solving problem (10). Figure 6: Source, control and solution fields for the pollutant transport problem for a(ω) = 4 and b = 5(ω), and for x ∈ D Fig. 7a shows the decay of the objective function towards its final value. We once again observe exponential convergence in the optimisation counter j, as predicted by Theorem 3.1. In addition, we plot in 3b the gradient ratio for different iterations of the optimisation, which also decreases exponentially in the iteration counter j. Fig. 7c shows the CDF of the output QoI Q(z j , ·) for different iterations j of the optimisation algorithm, along with the estimated VaR and CVaR. The CDF, the VaR and the CVaR all move left as before in Section 5.1, which translates to moving the right tail of the distribution as much as possible to the left.   shows the optimal hierarchy produced by the CMLMC algorithm at each iteration of the optimisation for a given tolerance. Similar to before, we observe that the optimally tuned hierarchy increases in size for later optimisation iterations, since the tolerance supplied to the CMLMC is a fraction of the gradient magnitude. Fig. 8b shows the cumulative cost as defined in Section 5.1 for a given gradient magnitude. We observe once again that the cumulative cost grows as after an initial preasymptotic regime, as is to be expected for the use of an optimally tuned MLMC hierarchy at each iteration, tuned to obtain a tolerance proportional to Ĵ (j) w (w j )

Conclusions
The aim of this work was to tackle the challenge of minimising the CVaR of a random QoI, typically the output of a differential model with random inputs, over a suitable design space, using gradient-based optimisation techniques. A main challenge in utilising gradient-based techniques was the differentiability of the CVaR in terms of the design variables. A differentiability result was presented in Section 2, which was a generalisation of the one presented in [15], showing that gradient-based algorithms could still be used to directly minimise the CVaR without requiring smoothing. The expression for the sensitivities of the CVaR with respect to design parameters required the computation of expectations of discontinuous functions of the QoI; namely, the indicator function. Estimating this expectation naively using MLMC estimators could become impractically expensive, and possibly result in non-optimal complexity behaviour of the corresponding MLMC estimator. A similar issue was discussed and tackled in [14], and an alternative was proposed using the framework of parametric expectations. We presented a modified expression for the sensitivities of the CVaR, based on derivatives of parametric expectations, thereby allowing us to use the work in [14]. Based on this modification, we also presented a novel optimisation algorithm consisting of an alternating minimisation-gradient procedure. We demonstrated a theoretical result that, under additional assumptions on the combined objective function in Eq. (10), the novel algorithm would achieve exponential convergence of the design iterates towards the optimal design in the optimisation iterations.
To enable the use of the work in [14], we presented modifications of the MLMC estimator, the error estimation procedure and adaptive hierarchy selection procedure specific to computing the sensitivities of the CVaR. Namely, a relation was derived between the MSE of the sensitivities and the MSE of the parametric expectations in Section 4. In addition, a modification of the KDE smoothing procedure presented in [14] was presented, specific to CVaR minimisation. The combination of the MSE relation and KDE modification allowed us to trivially extend the error estimation and hierarchy adaptivity procedure of [14] to the current application. Lastly, a minor modification of the CMLMC procedure of [14] was presented in Algorithm 2, wherein the CMLMC was restarted from the optimal hierarchy of the previous design iterate.
The combination of gradient-based optimisation and MLMC estimation of the sensitivities of the CVaR was tested on two problems of practical relevance; namely the FitzHugh-Nagumo oscillator and a more applied problem of advection-reaction-diffusion problem used to model pollutant transport. In both cases, it was observed that the novel error estimation procedure provided tight bounds on the MSE of the gradient as defined in Eq. (47). In addition, the CMLMC algorithm was shown to produce the best-case complexity behaviour for the MLMC estimators of the sensitivities. The OUU algorithm was shown to converge exponentially in the optimisation iterations, while also preserving the best case MLMC cost complexity.
The numerical examples considered in this work demonstrated that the AMGD procedure performs well for the cases presented here. However, one may wish to improve on the performance of the algorithm by considering alternatives to the AMGD algorithm. Such variations could, for example, include higherorder optimisation methods such as the Newton method. It still remains to be seen whether higher-order methods can directly be used with objective functions of the type in problem (10), as well as whether the framework of parametric expectations can be combined with such an algorithm. We plan to explore such questions in future works.

Acknowledgments
This project has received funding from the European Union's Horizon 2020 research and innovation programmed under grant agreement No. 800898.

Appendices
A Proof of Theorem 2.1 To prove Theorem 2.1 on the Fréchet differentiability of the objective function J (θ, z), we first prove an important result in Lemma A.1. We recall that Γ ⊂ L p (Ω, R) is the set of L p -integrable random variables whose measures are atom-free.
Lemma A.1. Consider random variables Y ∈ Γ ⊂ L p (Ω, R) and δY ∈ L p (Ω, R). We then have the following: and lim Proof. We begin with the proof for Eq. (85), since the proof for Eq. (86) follows from identical arguments. We make use of the following result; for any X ∈ L p (Ω, R), the following holds for any > 0 and p ≥ 0: Setting = X β L p for some β ∈ [0, 1), we have that where γ := p(1 − β). We rewrite the term within the limit in Eq. (85) as follows: The first term can be bounded as follows: Due to dominated convergence, we can pass the limit into the expectation, resulting in the following: since Y ∈ Γ is atom-free. The second term can be bounded as follows, where we use a Hölder inequality: Hence, we have that the second term in Eq. (90) goes to zero as well with the application of the limit, thus concluding the proof for Eq. (85). The proof for Eq. (86) follows from identical arguments.
We now use the above result and present a proof of Theorem 2.1. We note that the function R(θ, Q) is a composition of two functions. We define the functions l 1 : Γ → R and l 2 : R × Γ → Γ as follows: =⇒ R(θ, Q) = θ + l 1 • l 2 (θ, Q) 1 − τ .
Hence, to show that R is Fréchet differentiable, it suffices to show that each of the functions l 1 and l 2 are Fréchet differentiable. It is straightforward to see that l 2 is Fréchet differentiable (being linear and bounded) with Fréchet derivative Dl 2 (θ, Q) in the direction (δθ, δQ) ∈ R × L p (Ω, R) given by: The Fréchet derivative of l 1 however, requires some consideration. We argue that the Fréchet derivative of l 1 exists at any point Y ∈ Γ and is given by Dl 1 (Y )(δY ) = E 1 {Y ≥0} δY . To prove this statement, we must verify the following limit: To show the above, we begin by re-writing the numerator as follows: Inserting Eq. (101) into Eq. (100), we have the following: with the terms T 1 , T 2 and T 3 given by: We then begin with the term T 1 . We first note that T 1 can be rewritten in the following manner: The term T 2 can be bounded as follows: Similarly, the term T 3 can be bounded as follows: Inserting Eqs. (108), (110) and (112) into Eq. (102), and applying the limit using Lemma A.1, we have that: This concludes the proof.

B Adjoint of first-order ODE with additive noise
We present here the derivation of the adjoints for a first-order Ordinary Differential Equation (ODE) with white noise forcing for an objective function containing the CVaR of a time-averaged quantity of the trajectory. Let (Ω, F, P) be a complete probability space, ω ∈ Ω denote an elementary random event, and z ∈ R d the set of design variables. Let u(t, z, ω) ∈ U ⊂ R Nu be the state vector at time t ∈ [0, T ] for a given random input ω and design z. The state vector u is governed by the following ODE with additive noise.u (t, z, ω) = g(u, z) + τẆ (t, ω) over (0, T ], u(0, z, ω) = u 0 , where g : U × R d → R Nu , and W : [0, T ] × Ω → R Nu is a N u -dimensional standard Wiener process. We discretise the problem on a uniform temporal grid T where the interval [0, T ] is divided into N ∈ N segments of step size ∆t = T /N , T := {t n := n∆t : n ∈ 0, N l }. The ODE is discretised using the Euler-Maruyama scheme, which reads as follows: u n+1 = u n + ∆tg(u n , z) + τ √ ∆tξ n , where u n denotes the approximation to u(t n , z, ω), ξ n ∈ R Nu are N u -dimensional random vectors whose components are independent identically distributed standard normal variables. We are interested in computing the statistics of time-averages of functions of the trajectory.
where we have used the subscript notation for partial derivatives.
We have in our case that u 0 z = 0. To remove terms dependent on u n z , we set λ n = λ n+1 (1 + ∆tg n u ) +

Q h ≥θ
(1 − τ )T ∆tf n u , n = 1, ..., N − 1 (122) This gives us the adjoint equations which are solved backwards in time. It is noteworthy to mention that since Eq. (122) is linear, that it can be solved for {λ n } without the factor (1−τ )T , and equivalently, the sensitivities can be computed as: That is, setting we have that J z (θ, z) = E N −1 n=0 ∆tλ n+1 g n z .