Nonparametric estimation of a maximum of quantiles

: A simulation model of a complex system is considered, for which the outcome is described by m ( p,X ), where p is a parameter of the system, X is a random input of the system and m is a real-valued function. The maximum (with respect to p ) of the quantiles of m ( p,X ) is estimated. The quantiles of m ( p,X ) of a given level are estimated for various values of p from an order statistic of values m ( p i ,X i ) where X,X 1 ,X 2 ,... are independent and identically distributed and where p i is close to p , and the maximal quantile is estimated by the maximum of these quantile estimates. Under assumptions on the smoothness of the function describing the dependency of the values of the quantiles on the parameter p the rate of convergence of this estimate is analyzed. The ﬁnite sample size behavior of the estimate is illustrated by simulated data and


Introduction
We consider a simulation model of a complex system described by where p is a parameter of the system from some compact subset P of R l , X is a R d -valued random variable with known distribution and m : P × R d → R is a given (measurable) function. Let G p (y) = P{m(p, X) ≤ y} be the cumulative distribution function (cdf) of Y for a given value of p. For α ∈ (0, 1) let q p,α = min{y ∈ R : G p (y) ≥ α} be the α-quantile of m(p, X). Using at most n function evaluations m(p i , x i ) of m for arbitrarily chosen values (p 1 , x 1 ), . . . , (p n , x n ) ∈ P × R d we are interested in estimating the maximal α-quantile This problem is motivated by research carried out at the Collaborative Research Centre SFB 805 at the Technische Universität Darmstadt, which focuses on uncertainty in load-carrying systems such as truss structures. The truss structure under consideration is a typical system for load distribution in mechanical engineering (cf. Figure 1). It is made of ten nodes and 24 equal beams with circular cross sections and fixed boundary conditions. Four equal tetrahedron structures may be assembled into modules. The truss support is statically determined at nodes 2, 3 and 4. A single force F z in z-direction acts at node 1 and two symmetric moments M act at nodes 6 and 7 as shown in Figure 1. Due to the non-symmetric loading, the maximum equivalent stress σ v (cf. e.g., [9]) is not equal in every beam. It is influenced not only by the non-symmetric loading via F z and M , but also by the beam's material parameter Young's modulus E as well as geometric parameters: radius r and length l. We assume that from the manufacturing process of the truss structure we know the corresponding ranges [a, b] for each beam parameter E, r and l. The parameter vector p = (E, r, l) describes the parameters for all 24 equal beams within the truss structure.
The occurring random force F z is normally distributed and the moments M have truncated normal distribution restricted to positive half-axis. A numerical finite element model is built that calculates the maximum equivalent stress Y = σ v,1 , Y = σ v,2 and Y = σ v,3 in the beams 1, 2 and 3, resp., that are the members with the highest equivalent stresses in this specific truss structure. So our parameter set is P = [a E , b E ]×[a r , b r ]×[a l , b l ]. The function m : P ×R d → R describes the physical model of the truss structure and Y = m(p, X) is the occurring maximum equivalent stress in beam 1, 2 and 3, resp., of the truss structure with load vector X = (F z , M ).
We are interested in quantifying the uncertainty in the random quantity Y = m(p, X), which we characterize by determining an interval which contains the value of Y with high probability. Such an interval can be determined by computing quantiles of Y for α close to one (leading to the right end point of the interval) and α close to zero (leading to the left end point of the interval). However, we do not know the exact value of p ∈ P. Hence, instead of computing the right end point of the interval by a quantile of Y , we compute for each p ∈ P the α-quantile q p,α of m(p, X) and use sup p∈P q p,α for α close to one as the right end point of the interval. Similarly, we can construct the left end point of the interval by computing inf p∈P q p,α for α close to zero.
In the sequel we introduce the estimates of these quantities and investigate their properties. In order to simplify the notation we consider only sup p∈P q p,α . The other value can be estimated similarly.
For p ∈ P we estimate the α-quantile of G p by the α-quantile of an empirical cumulative distribution function corresponding to data points m(p i , X i ), where p i is close to p and X, X 1 , X 2 , . . . are independent and identically distributed random variables. Then we estimate the maximum quantile by the maximum of these quantile estimates. Under suitable assumptions on the smoothness of the function describing the dependency of the quantiles on the parameter we analyze the asymptotic rate of convergence of this estimate. The finite sample size behavior of the estimate is illustrated using simulated data.
In our proofs the main trick is to relate the error of our quantile estimates to the behavior of the order statistics of a fixed sequence of independent random variables uniformly distributed on (0, 1).
The estimation problem in this paper is estimation of a quantile function in a fixed design regression problem. Usually this kind of problem is studied in literature in the context of random design regression, see, e.g., [22,12] and the literature cited therein. In the context of random design regression the rate of convergence results for local averaging estimates of conditional quantiles have been derived in [19] and [1]. Results concerning estimates based on support vector machines can be found in [16]. The problem of estimating the maximal quantile was not considered in the papers mentioned above.
Throughout the paper we use the following notation: N, R and R + are the set of positive integers, real numbers, and nonnegative real numbers, resp. For z ∈ R we denote the smallest integer greater than or equal to z by ⌈z⌉ and the largest integer less than or equal to z by ⌊z⌋. For a set A its indicator function (which takes on value one on A and zero outside of A) is denoted by I A . x ∞ = max{x (1) , . . . , x (d) } is the supremum norm of a vector x = (x (1) , . . . , x (d) ) T ∈ R d . For z 1 , . . . , z n ∈ R the corresponding order statistics are denoted by z 1:n , . . . , z n:n , i.e., z 1:n , . . . , z n:n is a permutation of z 1 , . . . , z n satisfying z 1:n ≤ · · · ≤ z n:n . Similarly, the order statistics of real valued random variables are defined. If Z n are real valued random variables and δ n > 0 are positive real numbers, we write The definition of our estimate of the maximum quantile is given in Section 2, the main result is formulated in Section 3 and proven in Section 5. The finite sample size behavior of the estimate is illustrated in Section 4 using simulated data.

Definition of the estimate
In order to simplify the notation, we assume in the sequel for all of our theoretical considerations that the set of parameters is given by P = [0, 1] l .
For p ∈ P we estimate the cumulative distribution function by the empirical cumulative distribution function based on all those m(p i , X i ) where the distance between p i and p in supremum norm is at most h n . Here h n > 0 is a parameter of the estimate. In order to define this estimate, let Next, the quantile is estimated by the plug-in estimatê Among other conditional quantile estimates we mention optimal quantization approach of Charlier, Paindaveine and Saracco [3] and machine learning approaches of Takeuchi et al. [20] and Meinshausen [13]. Finally, we estimate the maximal quantile sup p∈P q p,α by the maximum of the estimated quantileq p,α for p ∈ {p 1 , . . . , p n }, i.e., bŷ

Main results
Our main result is the following bound on the error of our maximal quantile estimate.
Theorem 3.1. Let α ∈ (0, 1), let P = [0, 1] l and let m : P × R d → R be a measurable function. Let X be a R d -valued random variable and, for p ∈ P, let G p be the cumulative distribution function of m(p, X). Let Assume that G p is continuous for all p ∈ P and that there exists δ > 0, such that for some c 1 > 0 for all p ∈ P, the derivative of G p exists on and is continuous on the interval [q p,α − δ, q p,α + δ] and is greater than c 1 . Let the estimateM n be defined as in (2.1) for some h n > 0 satisfying h n → 0 (n → ∞) and n · h l n → ∞ (n → ∞). Then If we impose some smoothness condition on the function describing the dependency of the values of the quantiles on parameter p, we can derive from the above theorem the following rate of convergence result. Corollary 1. Assume that the assumptions of Theorem 3.1 hold and that, in addition, q p,α is (r, C)-smooth as a function of p ∈ P for some r ∈ (0, 1] and and define the estimateM n as in Section 2. Then Proof. Since q p,α is (r, C)-smooth as a function of p ∈ P we can conclude from Theorem 3.1 The definition of h n implies the result.
Remark 1. We would like to put our results in perspective considering the results obtained in [1] and [19]. Our result is slightly different from the result which could have been obtained using the union bound combined with the previously established conditional quantile estimator error bound. The log(n)/nh l n term is the same estimation error term between the quantile estimate and a kernel surrogate of the true quantile as in [19], up to a log(n) factor coming from the use of the union bound. The other sup term may be thought of as an approximation term, but is slightly different from the classical distance between the true quantile and its kernel surrogate. Both choices of approximation terms would have lead to the result of Corollary 1, however, the proof exposed in Section 5 is more self-contained, and easily deals with multidimensional covariables.
Remark 2. Let the assumption of Theorem 3.1 hold. Define a partitioning estimate of the quantile function p → q p,α bȳ q p,α =p pi,α where for p ∈ P, p i is the center of the cube in the partition of P which contains p. Then, as in the proof of Theorem 3.1 (cf. second inequality in (5.5) and Lemma 2), one can show In particular, under the assumptions of Corollary 1 we have the following bound on the supremum norm error of the above quantile function estimate: . Remark 3. a) Up to multiplicative constants, Theorem 3.1 and Corollary 1 also hold for the naive kernel taken as the indicator function of the unit ball of any norm in R l , because of norm equivalence.
b) As pointed out by a referee, the above results can be easily generalized to a more general setting, where instead of a function m which defines the random variables via Y = m(p, X), one does not observe values of X but instead there exists a Markov kernel Π from P to R.
Remark 4. The rate of convergence in Corollary 1 is (up to some logarithmic factor) the same as the optimal minimax rate of convergence for estimation of a (r, C)-smooth function on a compact subset of R l in supremum norm derived in [18].
Remark 5. In any application of the above estimate we have to select the bandwidth h n of the estimate in a data-driven way. We suggest to choose the bandwidth in an optimal way in view of estimation of G p (y) byĜ p (y) (p ∈ P) for suitable chosen y ∈ R. To do this, we propose to use a version of the well-known technique of sample splitting in nonparametric regression (cf. e.g., Chapter 7 in Györfi et al. [11]). More precisely, let us assume that we have available n additional random variablesX 1 , . . . ,X n such that X, X 1 , . . . , X n ,X 1 , . . . ,X n are independent and identically distributed, and that we observe m(p i ,X i ) for these random variables. We choose y as the α-quantile of the empirical cumulative distribution function corresponding to m(p 1 , X 1 ), . . . , m(p n , X n ), and choose h n by minimizing 1 n In the next section we will investigate the performance of this data-driven way of choosing the bandwidth using simulated data.
Remark 6. Corollary 1 provides an answer to Question 4 in Stone [18], by stating that the rate n −1/(2r+l) is achievable for estimation of the median.

Application to simulated data
In this section we illustrate finite sample size behavior of our estimate (particularly with regards to the data-driven choice of the bandwidth in Remark 5) with the aid of simulated data. In all simulations we choose the sample size for the estimate of the maximal quantile as n = 20,000, where we use half of the data to choose the bandwidth of the estimate from a finite set of bandwidths as described in Remark 2. In each simulation we compute the absolute difference between the true value of the maximal quantile and (first) the estimate with the data-driven choice of the bandwidth, and (second) with the estimate with sample size n/2 applied with the bandwidth value that leads to the minimal error for the given set of data. The level of the quantile is chosen as α = 0.95. We repeat each simulation 100 times and report the mean values and the standard deviations of the resulting 100 error values for the two estimates.
In our first example we choose X as a standard normally distributed random variable, P = [0, 1] and define m(p, X) = 0.5 · sin(2 · π · p) + X.
Hence m(p, X) is normally distributed with mean 0.5 · sin(2 · π · p) and variance 1. In this case the α = 0.95 quantile of m(p, X) is given by q p,α ≈ 0.5 · sin(2 · π · p) + 1.64. In this example we choose the bandwidth from set {0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4}. In Figure 2, we show a typical simulation for the two quantile estimates using the data-driven bandwidth and the optimal bandwidth, respectively. For our data-driven choice of the bandwidth the mean values (standard deviations) of our errors are approximately 0.044 (0.034) as compared to 0.017 (0.015) for the optimal bandwidth choice, which is not applicable in practice, since it depends on the unknown maximal value. Hence, the mean absolute error of our estimate using the data-driven bandwidth is only 2.5-times larger than the mean absolute error of the estimate using the optimal bandwidth, which is not known in practice.
In our second example we again choose random variable X with a standard normal distribution and P = [0, 1], but this time we define m by m(p, x) = exp(p + x), which implies that m(p, X) is lognormally distributed and its logarithm has mean p and variance 1. Again, the α-quantiles are known and can be computed with a statistics package. The bandwidth is chosen from the same set as in the first example. In Figure 3 we show typical simulation for the two quantile estimates using the data-driven bandwidth and the optimal bandwidth, respectively. In this example the mean values (standard deviations) of our errors are in case of our data-driven bandwidth approximately given by 0.827 (0.517) as compared to 0.344 (0.285) for the optimal bandwidth, hence the mean error of the data-driven bandwidth choice is within a factor of 2.5 of the mean value of the optimal (and in practice not available) bandwidth choice.
In our third example we choose X = (X (1) , X (2) , X (3) ) T , where X (1) , X (2) and X (3) are independent standard normally distributed random variables, P = [−0.5, 0.5] 3 and In this case m(p, X) is a non-central chi-square random variable with 3 degrees of freedom and a non-centrality factor (p (1) ) 2 + (p (2) ) 2 + (p (3) ) 2 . Again, the α-quantiles are known and can be computed with a statistics package. The bandwidth is chosen from the same set as above. In this example the mean  values (standard deviations) of our errors are in the case of our data-driven bandwidth approximately given by 0.399 (0.376) as compared to 0.207 (0.144) in case of the optimal bandwidth, hence the mean error of the data-driven bandwidth choice is within a factor of approximately 2 of the mean value of the optimal (not available in practice) bandwidth choice. Finally we apply our newly proposed estimation method in the application described in Section 1. The numerical simulation model of the truss structure in Figure 1 is obtained by finite elements with the commercial software ANSYS. Each of the 24 beams is modeled with the same E, r and l parameters. Timoshenko beam theory is applied to all beams by using BEAM189 elements in ANSYS. The 10 connecting nodes are modeled as rigid links using MPC184 elements. Deflection is constrained at node 2 in x-, y-and z-directions, at node 3 in y-and z-directions and at node 4 in z-direction such that the truss support is statically determined.
The beam parameters E, r and l are assumed to be in the intervals given in Table 1. Random loading is applied to node 1 as a single force F z in the z-direction and two symmetric moments with magnitude M that act around axes in the x-y-plane with angle of +60 • and −60 • from the x-direction at nodes 6 and 7, respectively. The load F z is normally distributed with mean value µ F z and standard deviation σ F z . The moments M have a truncated normal  distribution restricted to the positive half axis, where the underlying normal distribution has mean value µ M = 0 and standard deviation σ M , see Table 2.
To obtain the maximum equivalent stresses σ v,1 , σ v,2 and σ v,3 , a static analysis is carried out for each parameter set and load set. Two independent random load sets were generated for each combination of 21 values of each parameter set distributed equidistantly in the parameter intervals. In total, 2·21 3 deterministic calculations were carried out. Application of the newly proposed estimate of the maximum 0.95-quantile to this data results in predicted maximum stresses of 7.15 · 10 6 N/m 2 , 17.83 · 10 6 N/m 2 and 17.83 · 10 6 N/m 2 in the three considered beams, respectively. In contrast, if we ignore the uncertainty in the parameters and simply compute 0.95-quantiles using the mean values of these parameters and a data set of 21 3 corresponding values, we instead obtain 6.77 · 10 6 N/m 2 , 16.71 · 10 6 N/m 2 and 16.71 · 10 6 N/m 2 . As a result we get values that are 5.6%, 6.7% and 6.7% lower, respectively.

Auxiliary results
In this subsection we formulate and prove two lemmas which we will need in the proof of Theorem 3.1. Our first lemma shows that if we mix coupled samples from several different distributions, the empirical quantile of the mixed samples lies with probability one between the minimal and the maximal quantiles of the individual samples. Here we use the fact that we do not change the joint distribution of our samples if we assume that the random variables are defined by applying the inverse of the cumulative distribution function to uniformly distributed random variables.
More precisely, our setting is the following: Let M ∈ N be the number of mixed samples, and for k ∈ {1, . . . , M } let H (k) : (0, 1) → R be a monotonically increasing function (which we will choose later as the inverse function of a cumulative distribution function). Let N ∈ N, let u 1 , . . . , u N ∈ (0, 1) and denote by u 1:N , . . . , u N :N the corresponding order statistics. We define M different coupled samples by = 1, . . . , N ) for k ∈ {1, . . . , M }, and we mix them by choosing be the empirical cumulative distribution functions corresponding to z Since z ⌈N ·α⌉:N =q α this completes the proof.
Our second lemma relates the error of the quantile estimate to the behavior of order statistics corresponding to some fixed sequence of independent random variables distributed uniformly on (0, 1).

Lemma 2.
Let p ∈ P be fixed and let the estimateq p,α be defined as in Section 2. Assume that there exists a constant c 1 > 0 such that for some δ > 0 we have that Gp is continuous on R and continuously differentiable on [qp ,α − δ, qp ,α + δ] with derivative bounded from below by c 1 , for allp ∈ P satisfying p − p ∞ ≤ δ. Let be the number of parameters p i which have at most supnorm distance h n to p. Let U 1 , . . . , U N be independent random variables distributed uniformly on (0, 1) and denote their order statistics by U 1:N , . . . , U N :N . Then we have for any 0 < ǫ < δ satisfying δ ≥ h n : Proof. Forp ∈ P let G −1 p (u) = min {y ∈ R : Gp(y) ≥ u} (u ∈ (0, 1)) be the generalized inverse of Gp. Then it is well-known that G −1 p (U 1 ) has the same distribution as m(p, X 1 ). SinceĜ p is by definition (observe that K is the naive kernel) the empirical cumulative distribution function of m(p i1 , X i1 ), . . . , m(p iN , X iN ) for suitable chosen pairwise distinct i 1 , . . . , i N ∈ {1, . . . , n}, we see that it has the same distribution as the empirical cumulative distribution functionĜ N of G −1 pi 1 (U 1 ), . . . , G −1 pi N (U N ). Consequently,q p,α has the same distribution as the quantile estimateq U α corresponding toĜ N , which implies for any η > 0 P {|q p,α − q p,α | > η} = P q U α − q p,α > η . LetĜ α the corresponding quantile estimate (k = 1, . . . , N ). Application of Lemma 1 yields for any η > 0 Summarizing the above results we see that we have shown P |q p,α − q p,α | > ǫ + sup By the assumptions of Lemma 2, on [q pi k ,α − δ, q pi k ,α + δ] the derivative of G pi k exists and is bounded from below by c 1 > 0. Consequently, on [α − c 1 · δ, α + c 1 · δ], the inverse function of G pi k is continuously differentiable with derivative bounded from above by 1/c 1 . So in case that U ⌈N ·α⌉:N − α < c 1 · δ we can conclude from the mean value theorem that we have  ≤ sup p1,p2∈P : p1−p2 ∞ ≤1/n 1/l |q p1,α − q p2,α |, (5.3) follows from n −1/l ≤ h n for n sufficiently large, which is implied by n·h l n → ∞ (n → ∞).