GRADIENT ESTIMATES FOR GAUSSIAN DISTRIBUTION FUNCTIONS: APPLICATION TO PROBABILISTICALLY CONSTRAINED OPTIMIZATION PROBLEMS

. We provide lower estimates for the norm of gradients of Gaussian distribution functions and apply the results obtained to a special class of probabilistically constrained optimization problems. In particular, it is shown how the precision of computing gradients in such problems can be controlled by the precision of function values for Gaussian distribution functions. More-over, a sensitivity result for optimal values with respect to perturbations of the underlying random vector is derived. It is shown that the so-called maximal increasing slope of the optimal value with respect to the Kolmogorov distance between original and perturbed distribution can be estimated explicitly from the input data of the problem.


1.
Introduction. Optimization problems with probabilistic constraints play an important role in engineering and operations research when decisions have to be taken under uncertainty affecting the given system of constraints. A typical class of such problems is given in the form min c T x|P (Ax ≥ ξ) ≥ p . (1) Here, x ∈ R n is a decision vector whose linear costs c T x are to be minimized subject to a probabilistic constraint. In this constraint, ξ is some s-dimensional random vector defined on a probability space (Ω, A, P) and A is a matrix of order (s, n). In most cases, decisions x have to be taken before realizations of the random vector ξ are observed. Then, the random inequality system Ax ≥ ξ cannot serve as a constraint for a well defined optimization problem: whatever decision x is taken, it may turn out to be unfeasible under a future realization of ξ (in particular for unbounded distributions). Therefore, it is reasonable to declare x to be feasible if the random constraint Ax ≥ ξ is satisfied under x at a probability not smaller than a given level p ∈ [0, 1]. For introductory texts on probabilistically constrained optimization problems and their applications we refer to the monographs [14,15,17].

RENÉ HENRION
For the numerical solution of (1) but also for theoretical issues like structure and stability it is essential not only to be able to calculate the probability function but also its gradient. Note that, using the cumulative distribution function of ξ defined as F ξ (y) := P (y ≥ ξ) one may rewrite (2) as the function F ξ (Ax), thus giving problem (1) the equivalent form min c T x|F ξ (Ax) ≥ p .
Evidently, for calculating values and gradients of F ξ (Ax) it is sufficient to be able to calculate values and gradients of F ξ . As far as values are concerned, this can be done in moderate dimension s at fairly satisfactroy precision for many prominent multivariate distributions, e.g. Gaussian, t-, Dirichlet, Gamma or Exponential dstribution (see, e.g., [1,5,12,19,20]). As far as gradients are concerned, several approaches have been proposed which allow, based on certain representations as surface or volume integrals, to approximate partial derivatives of probability functions via Monte Carlo and related techniques (e.g., [4,11,13,21]). For a few special distributions (like Gaussian or Dirichlet) it is possible to establish an explicit relation between partial derivatives and function values of F ξ by means of an analytical formula. One may benefit from such comfortable situation in several respects: • The same algorithm providing (approximate) values of F ξ can be employed for calculating (approximate) values of ∇F ξ . • Higher order derivatives of F ξ can be obtained as well by recursively reducing their computation in an exact way to the computation of values of F ξ . • The precision of ∇F ξ (and of higher order derivatives) can be controlled by that of F ξ itself. The purpose of this paper is to show for the multivariate Gaussian distribution, how such reduction formula can be employed in order to estimate (independent of the concrete argument) the precision of ∇F ξ by that of F ξ (Section 3) and, moreover, in order to derive sensitivity estimates for optimal values to problem (3) when perturbing the underlying distribution (Section 4). The latter issue is of interest, for instance, when replacing the theoretical, typically unknown distribution of ξ by some empirical approximation). Both topics to be addresses rely essentially on the derivation of lower estimates for the norm of gradients of Gaussian distribution functions (Section 2).
The notation used is standard. · and · ∞ denote the Euclidean and the maximum norm, respectively. The symbol ξ ∼ N (µ, Σ) expresses the fact that the random vector ξ has a multivariate Gaussian distribution with mean vector µ and covariance matrix Σ.
2. Lower estimates for the norm of gradients of Gaussian distribution functions. We commence this section by recalling a well-known gradient formula for Gaussian distribution functions. Without loss of generality, we may confine ourselves to the case of standard Gaussian distributions, i.e., where the mean vector equals zero and all variances (diagonal elements of the covariance matrix) are equal to one. Indeed, due to well-known transformation laws for Gaussian distributions, a general Gaussian distribution function F ξ with ξ ∼ N (µ, Σ), as it occurs, for instance, in (3), can be rewritten as where Φ R is the distribution function of random vector distributed according to N (0, R), with R being the correlation matrix associated with Σ. Moreover, D denotes the diagonal part of Σ (i.e., D is the diagonal matrix composed of the variances of ξ). As a consequence, the probabilistically constrained optimization problem (1) can be equivalently rewriten via (3) and (4) as In the present and in the following section, we shall assume our original problem (1) to be given in the form of (5). Consequently, in the vein of the previous discussion, the whole interest focusses on the gradient ∇Φ R of standard normal distribution functions. For these reasons, let now Φ R be the distribution function of an s-dimensional Gaussian random vector distributed according to N (0, R), with R = (r i,j ) s i,j=1 and r i,i = 1 for i = 1, . . . , s. It is well known (see, e.g., [14], p.204) that Φ R is a smooth function and that its partial derivatives calculate as where f is the density of the 1-dimensional standard Gaussian distribution, and the correlation matrixR (i) of order (s − 1, s − 1) has entries In the numerical solution of chance-constrained optimization problems, for instance when applying a supporting hyperplane method as in [18], one has to calculate gradients ∇Φ R (z) at arguments z satisfying Φ R (z) = p where p ∈ [0, 1] is some given probability level typically close to 1. Formula (6) permits an analytical reduction of gradients of Gaussian distribution functions to function values of the same type of distribution functions albeit with different parameters. As mentioned in the introduction, this fact has important consequences for numerics and sensitivity analysis in probabilistically constrained optimization problems. In this section, as a preparation of the following ones, we are interested in lower estimates for the norm of gradients of Gaussian distribution functions based on formula (6). More precisely, we are interested in estimates that depend only on the data (R,p,s) of the problem but not on the specific argument z satisfying Φ R (z) = p . Let us consider first the simple case of independently distributed components, i.e., R = I s : Proposition 2.1. Let p ≥ 0.5 and consider any point z such that Φ Is (z) = p. Then, ∇Φ Is (z) ∞ ≥ f (q p 1/s ) · p 1−1/s , where f and q α denote the density and the α-quantile of the 1-dimensional standard Gaussian distribution, respectively.
Proof. With R = I s one derives from (8) thatR (i) = I s−1 for all i = 1, . . . , s. Similarly, (7) yields that Denoting by F the cumulative 1-dimensional standard Gaussian distribution function, it follows that, for all i = 1, . . . , s, As this estimate holds true for all i = 1, . . . , s, we get from (8) that Since F is increasing as a distribution function and f is decreasing for p ≥ 0.5, one derives that f /F is decreasing. Therefore, We claim that τ ≤ q p 1/s . Indeed, otherwise z i > q p 1/s for all i = 1, . . . , s, whence the contradiction Now, exploiting once more the fact that f /F is decreasing, we arrive at the desired lower estimate The correlated case is less obvious and meaningful lower estimates for the precision of the gradient depending only on the data (R,p,s) seem to be very hard to derive for general correlation matrices. In the following, we identify a special class of correlation matrices for which such estimates are possible. To this aim, we associate with each correlation matrix R the parameter where ρ 1 , ρ 2 , ρ min denote the largest, second largest and smallest nondiagonal elements of R.
We give examples for two subclasses of amenable correlation matrices: Example 2.3. Let t ∈ [0, 1] be arbitrary. Assume that all nondiagonal entries of R lie in the interval t 2 , t . Then, R is amenable. Indeed, by assumption, ρ min ≥ t 2 and ρ 1 , ρ 2 ≤ t, whence ∆(R) ≥ 0. For example, R is amenable if all nondiagonal correlation coefficients lie between 0.25 and 0.5. Example 2.4. Assume that all or all but one nondiagonal entries of R are constant and nonnegative. Then, R is amenable. Indeed, by assumption, ρ min = ρ 2 , whence ∆(R) = ρ 2 (1 − ρ 1 ) ≥ 0. For example, R is amenable if one nondiagonal correlation coefficient equals 0.9 while all the others are equal to 0.1.
Proposition 2.5. If R is amenable and positive definite, then R ≥ 0.
Proof. If not R ≥ 0, then ρ min < 0. Assuming that ρ 1 , ρ 2 have a common sign, one arrives at the contradiction ∆(R) < 0. Otherwise, it follows that where the first strict inequality relies on the fact that a correlation matrix cannot be positive definite if it contains an off-diagonal element equal to 1. From this chain of inequalities, we derive again the contradiction The technical meaning of amenability is clarified in the following result:  (8), it follows that j = k , j = i and k = i. Consequently, r j ,k , r j ,i and r k ,i are off-diagonal elements of R. By definition, it holds that r j ,k ≥ ρ min . Due to Proposition 2.5, one has that r j ,i ≥ 0 and r k ,i ≥ 0. Moreover, by j = k , r j ,i and r k ,i are different elements of R. It follows that r j ,i r k ,i ≤ ρ 1 ρ 2 , whence, by amenability, r j ,k − r j ,i r k ,i ≥ ∆(R) ≥ 0. Therefore,r (i) j,k ≥ 0 owing to (8). Summarizing,R (i) ≥ I s−1 .
In order to prepare a result for the class of amenable correlation matrices, we define for each i = 1, . . . , s a correlation matrix Σ (i) of order (i, i) having constant nondiagonal entries equal to ρ 1 . Note that Σ (i) is positive semi-definite (as required for a correlation matrix) if ρ 1 ≥ 0 (which is automatic if R is amenable). From here we derive for each i = 2, . . . , s generalized quantiles of order i as the unique values τ (i) satisfying the equality Here, uniqueness of the τ (i) follows from the fact that a regular Gaussian distribution function is strongly increasing for arguments strictly inreasing in each com-ponentNote that in the result of the following theorem the norm of the gradient is estimated by an expression depending exclusively on the original data R,p,s but not on the specific argument z.
where f and F refers to the density and distribution function, respectively, of the one-dimensional standard Gaussian distribution, the τ (j) are defined in (10) and τ is the unique solution of the equation Proof. Define indices i 1 , . . . , i s in a way that z i1 ≤ · · · ≤ z is . Since distribution functions are always dominated by their marginal distribution functions, it follows that According to (7), Recall that R ≥ 0 by Proposition 2.5. Now, (13) implies that Obviously, the correlation coefficients occuring in the above two expressions are offdiagonal, hence smaller than or equal to ρ 1 . The function t → (1 − t) 1 − t 2 −1/2 being decreasing, it follows that Now, let j ∈ {1, . . . , s} be arbitrarily given. Assume that z ij < τ (j) with τ (j) defined by (10). Then, by (13), Denote byR the correlation matrix induced by components i 1 , . . . , i j (i.e.,R is obtained from R by selecting rows and columns corresponding to these indices).
We exploit once more that distribution functions are dominated by their marginal distribution functions of any order and obtain that where the strict inequality is a consequence of (15). SinceR ≤ Σ (j) (componentwise) by definition of Σ (j) , the well-known Slepian's inequality yields that This is a contradiction with (16), whence Combining this with (14), we have that Now, consider the matrixR (i1) defined in (8). SinceR (i1) ≥ I s−1 (componentwise) by Lemma 2.6, applying first Slepian's inequality again, and then taking into account (14) and (17), we arrive at On the other hand, the standard Gaussian density being decreasing for positive arguments, (13) leads to Next we claim that there exists some index i * such that z i * ≤ τ , where τ is defined in the assertion of our theorem. Indeed, otherwise z i > τ for all i and so we end up at the contradiction Since z i ≥ 0 for all i (see (13)) and f is decreasing for nonnegative arguments, one derives that such that the assertion of the Theorem now follows from (18).

Precision of gradients of Gaussian distribution functions.
Although formula (6) establishes an analytical relation between partial derivatives and function values of Gaussian distribution functions, one has to take into account for numerical applications that function values themselves are already imprecise and can only be approximated by methods as described in [1,5,20]. Thanks to relation (6) one is not forced to further approximate partial derivatives by means of finite differences of function values already being imprecise. Rather, one may employ the same method to the computation of partial derivatives and thus avoid additional imprecision by the impact of another type of approximation. So, intuitively it is clear that in view of (6) the error of gradients can be controled by that of function values. Note that under the error of gradients we do not understand the norm of the absolute difference between theoretical and computed gradient but rather the norm of the difference between normalized theoretical and computed gradient: (note that under the computed gradient ∇Φ R (z) comp we understand the gradient obtained via the analytical relation (6) from computed function values ΦR ). The reason for this definition of error is that in a numerical context, the direction of a gradient is much more important than its norm. For instance, if gradients are used in a cutting plane method, the Walkup-Wets isometry theorem implies that the truncated Hausdorff distance between the cuts (halfspaces) defined by the gradients at a given point equals the truncated Hausdorff distance between the gradients themselves, which is exactly the error defined above. In other words, the error of gradients can be also interpreted as the error of the cuts induced by them. A first simple consequence of the triangle inequality yields the relation Using (6) and denoting by 'fv-err' the maximum absolute precision for the computation of function values for Gaussian distribution functions (predetermined by the user), one may continue via (6) as Now, we may invoke Proposition 2.1 in order to derive the following first estimate for the error of gradients in terms of the error of function values in the case of independent components: Corollary 3.1. Consider an arbitrary point z such that Φ Is (z) = p. Then, Proof. One may continue (9) as because of F (z i ) ≤ 1. Now, the assertion follows from combining the previous relation with (19).
Due to p ≈ 1 in typical applications of chance constrained programming, the result of the Proposition yields that the error of the gradient is at most approximately double the chosen maximum error of function values in the case of independent components. Unfortuantely, this efficient estimate is of limited practical use because in the independent case both the function values and the gradients can be computed almost exactly as the calculus reduces to a multiplication of values of onedimensional Gaussian distribution functions which is almost free of error. In the more interesting case of correlated components, we obtain the following Corollary by directly combining (18) with (19): Consider an arbitrary point z such that Φ Is (z) = p. Then, under the assumptions of Theorem 2.7 it holds that .
As an illustration, we consider the following example: 4. Sensitivity of optimal values to probabilistically constrained optimization problems. When solving a probabilistically constrained optimization problem like (1), one is usually confronted with the fact that the underlying random vector ξ and its distribution P • ξ −1 are not exactly known. Often, however, they may be approximated, for instance on the basis of historical observations, by some other random vector η and then instead of the original problem (1) one solves the optimization problem min c T x|P (Ax ≥ η) ≥ p .
(21) This (inevitable) approximation step immediately raises the question about stability of the problem, in particular about sensitivity of its solution (set) and of its optimal value. For simplicity, let us confine ourselves in the following to the sensitivity of optimal values. Denote for an arbitrary random vector η the optimal value of problem (21) by ϕ (η) := inf c T x|P (Ax ≥ η) ≥ p (22) (note that an infimum occurs here because (21) does not necessarily have a solution). In particular, ϕ (ξ) is the optimal value of the original problem (1). The stability of probabilistically constrained optimization problem has been analyzed in a fairly complete fashion in a series of papers [6,7,8,9]. As pointed out in [16], stability of any stochastic optimization problem requires the choice of an appropriate probability metric adapted to the concrete problem structure and measuring the deviation between the underlying theoretical random vector ξ and its approximation η. An appropriate choice in the context of our problem (1) (and its perturbation (21)) is the following discrepancy (see, e.g. [9]): where the second representation, adapted to the equivalent formulation (3) of the original problem, refers to the distribution function of the respective random vectors. This discrepancy is well-known as the Kolmogorov distance between the distributions P • ξ −1 and P • η −1 of ξ and η. Note that the term distance relates just to the distributions but not to the random vectors themselves: ∆(ξ, η) = 0 implies that P • ξ −1 = P • η −1 (or ξ ∼ η) but not necessarily ξ = η.
Using the discrepancy (23), the following more general result from [7] (Theorem 1) can be invoked for deriving the following result for our problem: Theorem 4.1. Let problem (1) be defined by a Gaussian random vector ξ and have a nonempty and bounded solution set. Moreover assume that there exists a point x * such that P (Ax * ≥ ξ) > p. Then, there exist constants L, δ > 0 such that for all random vectors η with ∆(ξ, η) < δ.
Note, that all assumptions of this Theorem refer to the original problem and no assumptions are made on the perturbed problem. In particular, the perturbed random vector η does not have to have a nice distribution like the Gaussian one (implying differentiability and convexity properties) but could follow, for instance, a discrete distribution coming from an empirical approximation of ξ. Although the result of the Theorem is nice in that it identifies a Lipschitz-like behavior of optimal values under perturbation, its practical use is limited due to the fact that the constant L is hard to be eplicitly quantified from the data of the original problem. The following Theorem is, to the best of our knowledge, the first attempt to do so in a concrete setting. In order to keep its presentation sufficiently simple, we confine ourselves to the weaker property of upper Lipschitz continuity of optimal values. Proposition 4.2. In problem (1), let ξ be an s-dimensional Gaussian random vector distributed according to ξ ∼ N (µ, Σ). Assume that the problem has a solution. Let R be the correlation matrix associated with Σ. Finally, let κ > 0 be given such that where Φ R is the distribution function according to N (0, R) (see (4)).
Observe that the result of this proposition makes a statement on the sensitivity of optimal values in problem (1) that depends on the norm c of the linear objective functional although the solution of the problem itself is not affected by this norm but just by the direction of c. In order to arrive at a sensitivity result independent of this norm, one might either suppose without loss of generality, that c = 1 in problem (1) or alternatively pass to the sensitivity of relative (with respect to the solution) optimal values.
It is interesting to observe a close relation of the statement of Proposition 4.2 with the so-called slope of functions in metric spaces as introduced in [2]. More precisely for a metric space M , a function h : M → R and a point x ∈ M , the slope of h at x is defined as The notation is justified by the fact that in a differentible setting |∇h(x)| would correspond to the norm of the gradient of h at x. Of course, in a differentible setting one could replace h by −h without changing the norm of the gradient. In the nondiffernetiable case, however, it makes a big difference to do so. Then, (30) corresponds to a maximum local decrease of h at x. Therefore, |∇h(x)| is also sometimes referred to as the maximal decreasing slope and denoted by |∇ − h(x)|. This concept plays an important role in the theory of gradient flows or in stability (metric regularity, error bounds) of set-valued analysis (e.g., [3,10]). On the other hand, as in the case of our sensitivity analysis of optimal values, it is often important to also have information about the maximal increasing slope, which would be defined in a natural way as Now, disregarding the negligible fact that (23) defines just a semi-metric but not a metric on the set of s-dimensional random vectors, it is natural to introduce the maximal increasing slope for the optimal value function ϕ at ξ as ∇ + ϕ (ξ) := lim sup ∆(ξ,η)→0,ξ =η [ϕ (η) − ϕ (ξ)] + ∆(ξ, η) .
Then, we derive the following result as an mmediate consequence of Proposition 4.2: Now, in order to extract explicit quantitative information from Theorem 4.3, it is essential to have a concrete value of κ which can be directly computed from the input data of the problem. But, recalling from the statement of Proposition 4.2 that κ is a lower estimate for the (maximum) norm of the gradient ∇Φ R (y) at a point y satisfying Φ R (y) = p, it is exactly the results from Section 2 (Proposition 2.1 and Theorem 2.7) which provide us which such desired concrete value of κ. Combination with Theorem 4.3 yields the following two corollaries.
Corollary 4.4. In problem (1), let ξ be an s-dimensional Gaussian random vector distributed according to ξ ∼ N (µ, Σ) where Σ is diagonal (i.e., the components of ξ are independent). Assume that the problem has a solution. Then, where f and q α denote the density and the α-quantile of the 1-dimensional standard Gaussian distribution, respectively.
Corollary 4.5. In problem (1), let ξ be an s-dimensional Gaussian random vector distributed according to ξ ∼ N (µ, Σ) where the correlation matrix R associated with Σ is amenable in the sense of Definition 2.2. Assume that the problem has a solution. Then, where the meaning of f, F, ρ 1 , τ, τ (j) is as in Theorem 2.7.
The following example illustrates Corollary 4.4:
A similar example could be constructed to illustrate Corollary 4.5 as well.