Fighting the Curse of Sparsity: Probabilistic Sensitivity Measures From Cumulative Distribution Functions

Quantitative models support investigators in several risk analysis applications. The calculation of sensitivity measures is an integral part of this analysis. However, it becomes a computationally challenging task, especially when the number of model inputs is large and the model output is spread over orders of magnitude. We introduce and test a new method for the estimation of global sensitivity measures. The new method relies on the intuition of exploiting the empirical cumulative distribution function of the simulator output. This choice allows the estimators of global sensitivity measures to be based on numbers between 0 and 1, thus fighting the curse of sparsity. For density‐based sensitivity measures, we devise an approach based on moving averages that bypasses kernel‐density estimation. We compare the new method to approaches for calculating popular risk analysis global sensitivity measures as well as to approaches for computing dependence measures gathering increasing interest in the machine learning and statistics literature (the Hilbert–Schmidt independence criterion and distance covariance). The comparison involves also the number of operations needed to obtain the estimates, an aspect often neglected in global sensitivity studies. We let the estimators undergo several tests, first with the wing‐weight test case, then with a computationally challenging code with up to k=30,000 inputs, and finally with the traditional Level E benchmark code.


INTRODUCTION
In risk-informed decision making, risk analysts and decisionmakers increasingly benefit from the use of quantitative risk assessment models (Apostolakis, 2004). Applications range from the probabilistic risk assessments of nuclear waste disposals (Helton, 1994;Helton, Hansen, & Swift, 2014;Helton & Marietta, 2000;Iman, Helton, & Campbell, 1978), of nuclear power plants (Breeding, Helton, Gorham, & Harper, 1992;Helton & Breeding, 1993;Iman & Hora, 1990;NRC, 1990), to food safety (Patil & Frey, 2004), An important piece of information, often sought in risk assessment, is the importance of the uncertain inputs. This information allows the analyst to identify areas where additional modeling efforts are needed and to prioritize the collection of further data and information. Using the terminology of Saltelli (2002b), we are in a factor prioritization setting. An analyst obtains this information using either local, screening, or global sensitivity analysis methods. Local techniques comprise methods such as Tornado Diagrams (Eschenbach, 1992) and partial derivatives (Helton, 1993). Employing a local method, the analyst identifies the key drivers of variability around a point or in a limited region of the parameter space. Screening methods comprise techniques such as the method of Morris (Morris, 1991), sequential bifurcation (Bettonvil & Kleijnen, 1997;Kleijnen, 2017), and permutation importance (we refer to Wei, Lu, & Song, 2015, for greater details). Employing a screening method, the analyst aims to identify the least relevant inputs and to provide a qualitative indication of the most important ones, with a low number of simulator evaluations. Global methods comprise techniques such as variance-based (Saltelli & Tarantola, 2002), momentindependent sensitivity indices (Borgonovo, 2006), and value of information (Felli & Hazen, 1998;Oakley, 2010). Employing a global sensitivity method, the analyst aims at identifying the key drivers of uncertainty quantitatively, while thoroughly exploring the simulator input space.
While best practices recommend the use of global sensitivity methods in the presence of uncertainty (Helton, 1993;Helton & Davis, 2002;Patil & Frey, 2004), the estimation of global sensitivity measures can become a challenging task. Past literature has identified and introduced methods to fight the curse of dimensionality, that affects estimation when simulators have a large number of inputs. Samplingbased methods-in the terminology of Helton and Davis (2002)-or one-sample approaches in the terminology of Strong, Oakley, and Chilcott (2012) and Strong and Oakley (2013) are estimation methods that reduce the relevance of dimensionality.
However, the curse of dimensionality might not be the only computational challenge. Numerical values of the output spanning over orders of magnitude may impair convergence at reasonable sample sizes, independently of the dimensionality of the model. We call this effect the curse of sparsity. Although not explicitly isolated from the curse of dimensionality, this issue has been recognized early on in the risk analysis literature. The scaling problem most often can be overcome by performing uncertainty importance calculations based on a logarithmic scale for the top-event frequencies. The log scale produces a reliable ordering of the uncertainty importance for the events, and expresses the results in terms of log-based risk. However, the log-based uncertainty importance calculations do not readily translate back to a linear scale (Iman & Hora, 1990, p. 402). These observations evidence two facts. On the one hand, transformations can help reducing the effects of the curse of sparsity. On the other hand, transformations induce interpretation issues, because results are valid on the transformed scale and not on the original scale. The use of transformations has been popular in the risk analysis literature since seminal works such as Iman and Conover (1979) and Saltelli and Sobol' (1995). However, recently it has been noted that the interpretation problems might be overcome if the analyst employs a sensitivity measure which is transformation-invariant (Borgonovo, Tarantola, Plischke, & Morris, 2014). Nonetheless, transformation invariance per se is not sufficient to overcome the curse of sparsity. In fact, some transformationinvariant global sensitivity measures (for instance, sensitivity measures based on the Kullback-Leibler entropy or on the L 1 -norm between densities) require the estimation of a probability density function (pdf). Density estimation often relies on kernel smoothing whose numerical performance is affected by sparsity. Thus, convergence might be impaired even if the global sensitivity measure under estimation is transformation-invariant.
We propose and evaluate a new method for computing global sensitivity measures that builds on this literature. The new method is based on the intuition of rewriting estimators as function of the empirical (marginal and conditional) distribution function of the model output. In this way, the numerical elaborations are restricted to real numbers in the [0,1] interval, contrasting the curse of sparsity. The estimation is within a given-data (one-sample) framework, thus keeping the estimation cost independent of the number of model inputs.
Before considering the numerical aspects of the new method, we establish the relationship among the L 1 -norm between pdfs, the Kolmogorov-Smirnov and the Kuiper metrics. These three metrics are the basis of three global sensitivity measures used in previous risk assessment studies, namely, the measures δ, β KS , and β Ku (Borgonovo, 2006;Wei et al., 2015). We find that if the marginal and all conditional model output distributions are unimodal, the Kuiper and L 1 -norm distances are equivalent. Then, the analyst can bypass density estimation using directly a sensitivity measure based on cumulative distribution functions (cdfs). However, these conditions on the marginal and conditional distributions are not verified in general. We therefore adapt the new method to the estimation of density-based sensitivity measures. The intuition is to employ a moving average to replace kernel smoothing. We study the resulting estimators from two viewpoints. First, we provide a discussion on their consistency in Appendix C. Second, we address their algorithmic properties. In fact, besides the global importance measures discussed above, other dependence measures coming from machine learning are becoming of interest in risk assessment studies-in particular, distance covariance (Székely & Rizzo, 2005) and the Hilbert-Schmidt independence criterion (HSIC) (Da Veiga, 2015;De Lozzo & Marrel, 2016;De Lozzo & Marrel, 2017). These importance measures can be computed from given data and thus, nominally, at the same number of model runs as the method we are studying. The difference is, however, in the number of operations (algorithmic cost) performed once the input-output sample is available. We then study the algorithmic cost of the present method and compare it with the algorithmic cost of the above-mentioned dependence measures.
A series of challenging numerical experiments is carried out. The first experiments involve the simulator developed to study the weight of a wing of a light aircraft and recently studied in Jiménez Rugama and Gilquin (2018). The simulator is smooth, fast to run, and of low-dimensionality: all estimators work well. Then, we consider a simulator in which the inputs and output are Cauchy distributed and the curse of sparsity appears. This synthetic case has the same behavior as the Level E geosphere transport code, the reference code for sensitivity analysis, which is investigated next. Level E features an output spanning orders of magnitude. There is no curse of dimensionality, but the curse of sparsity starts playing a role. The last experiments involve a computationally demanding albeit analytically known model. The goal is to identify the 10 most important inputs out of 30,000 input parameters with a parameterization that makes the output variance close to the numerical range of the floating-point representation (which is 1.79 · 10 308 ). This test case combines sparsity with dimensionality. For all experiments, we test the new approach also against previous methods as well as against estimators of the above-mentioned dependence measures, performing the analysis with and without output transformation to investigate whether rescaling is essential to reach convergence. In all cases, the results show that the newly proposed estimators achieve convergence at reasonable sample sizes.

A CONCISE REVIEW ON GLOBAL SENSITIVITY MEASURES
This section provides a concise review on global sensitivity analysis. The literature is vaster than what can be exposed here. For broad overviews, we refer the reader to the monographs of Saltelli, Ratto, Tarantola, and Campolongo (2012) and Borgonovo (2017) and to the Handbook of Uncertainty Quantification (Ghanem, Higdon, & Owhadi, 2017) for comprehensive discussions. Among global methods, in risk analysis regression-based methods have been among the first to be applied for factor prioritization (Helton, 1993;Saltelli & Marivoet, 1990). Reviews are offered in Helton, Johnson, Sallaberry, and Storlie (2006); Storlie and Helton (2008); and Storlie, Swiler, Helton, and Sallaberry (2009). Works such as Helton (1993Helton ( , 1994Helton ( , 1999; Davis (2002, 2003); Frey and Patil (2002); Helton and Sallaberry (2009) The main sensitivity indicators of nonparametric regression methods are the standardized regression coefficients and Pearson's correlation coefficient. To introduce them, let and denote the input-output mapping where k is the number of input factors. X is a random vector on a probability space (X , B(X ), P), where B(X ) is the Borel σ -algebra and P the probability measure that reflects the analyst's state of knowledge about the factors. The model output becomes a random variable Y whose distribution is induced by g(·). The corresponding probability space is (Y, B(Y ), P Y ). We denote by F Y (y) and f Y (y) the model output marginal cdf and density, respectively. Conditional distributions, cdfs, and densities are denoted by If the input-output mapping g can be accurately fitted by a linear regression model, i.e., g(Y ) ≈ b 0 + k i=1 b k X k then natural sensitivity measures are the standardized regression coefficients SRC i (Helton, 1993;Kleijnen & Helton, 1999a, 1999b, where b i are the linear regression coefficients, σ i are the standard deviations of X i , and σ Y is the standard deviation of the model output Y . Under independence of all random inputs X i , the standardized regression coefficients coincide with the Pearson product moment correlation coefficients (Pearson, 1901) where cov(Y, X i ) is the covariance between Y and X i (Table I lists the sensitivity measures used in this work). The quantities i and SRC i are usually interpreted as measures of the strength of the linear relationship between two random variables. However, when the linear regression fit is poor, the explanatory power of a linear model assumption is weak and confidence in the ranking induced by regression-based techniques diminishes (Campolongo & Saltelli, 1997). Two remedies are present. The first is to resort to rank transformations (Conover & Iman, 1976;Iman & Conover, 1979). The standardized rank regression coefficients and the Spearman regression coefficient (Spearman, 1904) then become natural global sensitivity measures. The second remedy is to apply sensitivity methods that rely less on the regression fit.
Researchers have successfully investigated the use variance-based sensitivity measures (Iman & Hora, 1990;Saltelli, 2002a). One writes Here are the conditional output expectation and the conditional output variance given the input X i . The conditional expectation given X i is the nonparametric regression curve of Y on X i . The quantity in (3) is the expected reduction in model out-put variance achieved by learning the true value of X i . This quantity coincides with the Pearson correlation ratio (Pearson, 1905) and with the first order variance-based sensitivity index (Homma & Saltelli, 1996). Several strategies have been developed over the years to efficiently estimate variance-based sensitivity measures. Among others, we recall the designs based on Fourier Amplitude Sensitivity Test (FAST) (Saltelli, Tarantola, & Chan, 1999) and on Polynomial Chaos Expansion (Le Gratiet, Marelli, & Sudret, 2017;Sudret, 2008). Recently, Borgonovo, Hazen, and Plischke (2016) show that several global sensitivity measures can be defined through a common rationale. Consider an operator ζ : P × P → R of the form ζ (P Y , P Y |X i =x i ), where P is the set of all distributions on (X , B(X )) and ζ (·, ·) is a generalized form of distance (thus, a metric or a divergence) between two distributions in the sense of Glick (1975). ζ (·, ·) is called an inner operator. For consistency, it is assumed that ζ (P, P) = 0 for any P ∈ P. The value of ζ (P Y , P Y |X i =x i ) depends on the value attained by X i . Therefore, the distance ζ (P Y , P Y |X i ) is a random function of X i . Taking the expectation with respect to the marginal distribution of X i , we obtain the quantity This quantity is the global sensitivity measure of X i based on inner operator ζ (·, ·). Several probabilistic sensitivity measures used in risk analysis are encompassed by (4). For instance, by setting and averaging we obtain the first-order variancebased sensitivity measure η i of (3). Similarly, setting and averaging, we obtain the δ-importance measure In addition, the global sensitivity measures and are the expected separations between the conditional and unconditional model output cdfs obtained, respectively, using the Kolmogorov-Smirnov and the Kuiper distances (Borgonovo et al., 2014). The three distance-based sensitivity measures (δ i , β KS i , β Ku i ) share the following properties: (3) Monotonic transformation invariance: where z : Y → R is a monotonic function of the model output Y .
Nullity implies independence is also listed as number 5 in Rényi's axioms for measures of statistical dependence (Rényi, 1959). It allows an analyst to conclude that a null value of the sensitivity measure implies that X i and Y are independent random variables (Da Veiga, 2015). The sensitivity measures SRC i , i , and η i do not possess the nullity-impliesindependence property.
Monotonic transformation invariance helps analysts in fighting the curse of sparsity. As shown in Conover and Iman (1981); Iman and Hora (1990); Saltelli and Sobol' (1995), transformations may improve numerical efficiency in the estimation of global sensitivity measures. However, transformations open the question of reinterpreting results back on the original scale (Iman & Hora, 1990). Moreover, transformations may induce ranking changes consequent to the change in the input-output structure. If rank modifications occur then the analyst would need to make a choice on whether to trust the ranking before or after the transformation. Then, it becomes important to understand whether the transformed or the original output is the quantity of interest for the model user. Conversely, these problems are avoided if the sensitivity measure is transformation-invariant.
Recently, further ways of identifying key drivers of uncertainty have been explored in machine learning (Da Veiga, 2015). A first method relies on distance covariance and distance correlation (Székely, Rizzo, & Bakirov, 2007). We refer to Lyons (2013) and Sejdinovic, Sriperumbudur, Gretton, and Fukumizu (2013) for further readings on theoretical aspects underlying these dependence measures. In Appendix D, we report some additional mathematical details useful to clarify the calculations carried out in the subsequent numerical experiments of our work. Distance covariance quantifies the degree of statistical dependence between Y and X i via pairwise distances of their realization. In particular, one considers the random variables X i and Y and their independent replicates X i , X i and Y , Y . Then, their distance covariance is calculated from the expression (Lyons, 2013;Sejdinovic et al., 2013): In this work, we will use the normalized version of distance covariance, which is known as distance correlation-see Appendix D for details. Distance covariance and distance correlation are part of the so-called energy statistics (Székely & Rizzo, 2017), and are a topical research subject.
The second method relies on the HSIC as a sensitivity measure. Lyons (2013); Sejdinovic et al. (2013);Da Veiga (2015) show that this criterion is closely related to distance correlation. In particular, Sejdinovic et al. (2013) prove that when an Euclidean distance is used in HSIC, the square of distance correlation and HSIC are proportional. However, with the traditional use of a Gaussian kernel, HSIC, and distance correlation are not generally equivalent sensitivity measures.

ESTIMATION: THE GIVEN-DATA APPROACH
The total cost for estimating a probabilistic sensitivity measure is given by the sum of two components: the cost associated with the generation of the sample ( Model ) and the cost associated with the calculation of the global sensitivity measure from the sample ( Estimator ). The first component is strictly related to the ability to run the model and is measured in terms of model runs. The second component is related to the number of operations required by the estimator and is measured in number of computer operations. When the running time of the model is high, Model is usually dominating. However, if two sensitivity measures can be estimated from the same sample, then the lower Estimator the faster is the analysis. In this section, we shall focus on the first component, Model . The algorithmic cost, Estimator , is discussed at the end of Section 4.2.
The global sensitivity measures comprised by the common rationale in (4) are associated with a cost Brute-Force Model = k · n ext · n int where n ext points are sampled from the marginal distribution of X i , i = 1, . . . , k, and n int points are required to compute the inner statistic conditional to the n ext realizations of The required number of model evaluations becomes rapidly prohibitive and several works have addressed ways of reducing computational burden. For variance-based sensitivity measures, the works (Homma & Saltelli, 1996;Saltelli, 2002a;Saltelli et al., 2010;Sobol', 1990) have reduced the computational cost to Model = (k + 2)n, i.e., k + 2 evaluations of a basic sample block, for estimating first and total effects. 1 These works use the Jansen-Saltelli-Sobol' design, known also as pick-andfreeze sampling (Gamboa, Janon, Klein, Lagnoux, & Prieur, 2016). Recently, Owen (2013) introduces a specific design that improves the estimation of variance-based sensitivity measures when their values are small. The random balance design (Tarantola, Gatelli, & Mara, 2006), a variant of the FAST method (Cukier, Fortuin, Shuler, Petschek, & Schaibly, 1973), requires C = n evaluations for computing first-order effects. All these approaches are specific and require that the sample is generated from the computer code following a precise scheme.
A given-data or one-sample approach, instead, allows the computation global sensitivity measures directly from a sample (X, Y ), with Given-Data Model = n, where n is the sample size. The key intuition lies in replacing the point-conditional probability P Y |X i =x i with the class-conditional probability P Y |X i ∈C m,i , where C m,i is an element of a partition of the support of X i . (We recall that, in general, the partition of a set X i is a collection of sets such that X cally, in our case, one considers the realizations of is Pearson's 1905 given-data estimator of η i in Equation (3), where the weights n m,i n are estimates of the probability of C m,i under X i .
More in general, points in a bin follow the conditional distribution F Y |X i ∈C m,i . If Y is absolutely continuous, we have the corresponding conditional densities f Y |X i ∈C m,i . These densities are visualized in the red lines of Fig. 1. Thus, from the realizations in the bins it is possible to obtain empirical estimates of the cdf F Y |X i ∈C m,i or of the corresponding density f Y |X i ∈C m,i . This partitioning idea carries over directly to other measures in the common rationale. To illustrate, for the δ-importance in (7) the estimator takes the form where M is the number of partitions depending on the sample size n and f Y , f Y |X i ∈C m,i are empirical estimation of the unconditional and conditional simulator output densities. In Equation (12), a crucial role is played by the estimator of the conditional and unconditional densities. If kernel smoothing is chosen, this choice in itself involves the selection of kernel functions and bandwidths-please see Appendix E for greater details. It is well known (Sheather, 2004) that the values of the kernel-density bandwidths are of critical importance and optimal choices are related to the roughness of the (unknown) pdfs f Y and f Y |C m,i . When the model output is sparse, this part of the estimator may fail in producing reasonable results at limited sample sizes. Then, bypassing density estimation might be advantageous.

ESTIMATION USING EMPIRICAL CDFS
This section is divided into two parts. In the first part, we state some results that help us in reformulating estimators of global sensitivity measures based on empirical cdfs. In the second, we introduce the new method.

A Preliminary Result
In Fig. 2, a bimodal distribution (dark/red) is compared with a unimodal one (light/green). The Kolmogorov-Smirnov measure is determined by the maximum distance between two cdfs. This distance equals the length of the larger of the vertical black bars in graphs (A) or (B). The Kuiper distance is determined by considering the maximum difference in both directions. Hence the sensitivity measure is the sum of the two bars. However, we observe that the separations in both the Kuiper and Kolmogorov-Smirnov distances can be interpreted also in terms of the areas between the corresponding pdfs. This is illustrated in graphs (C) and (D) of Fig. 2. In graph (C), regions I and III are bounded above by the red pdf and below by the green pdf. Region II is bounded by the green pdf above and the red pdf below. Graph (D) shows the pdf difference, regions I and III are above the 0-line, while region II is below. In this example, the Kolmogorov-Smirnov distance is equal to the area of region III. The Kuiper distance is the area of region II or, alternatively, the combined area of regions I and III. The inner operator of the δ sensitivity measure (L 1 -norm, (6)) equals half the sum of the areas of the three regions. It is also worth observing that the locations of the extremes of the difference between the cdfs coincide with the locations of intersections of the pdfs. While the Kolmogorov-Smirnov and Kuiper metrics refer to the global extreme value of the difference between the conditional and unconditional cdfs, the L 1 -norm also takes all local extreme values into account (Davies & Kovac, 2004). Extending these insights to the corresponding global sensitivity measures, we then have the following proposition-see Appendix A for the proof.  (Borgonovo et al., 2014). Thus, Proposition 1 can be turned into a computational advantage: one estimates β Ku i instead of δ i if all distributions are unimodal. However, in general the analyst may not have a priori knowledge that conditional and marginal distributions are all unimodal (for instance, this is not the case for the densities in Fig. 1). Moreover, the reason of the computational advantage associated with β Ku i has not been fully investigated. We argue that the better convergence properties are due to the fact that β Ku i works directly on the cdf, thus embedding a convenient numerical transformation of the model output data. We investigate these aspects further in the next subsection.

CDF Estimators
The generic given-data estimator of any global sensitivity measure comprised within the framework of (4) can be expressed as a function of empirical cdfs as follows (Borgonovo et al., 2016): 2 where the number of partitions M(n) is dependent on the sample size n. The assumptions under which the estimators in Equation (13) are consistent are discussed in Borgonovo et al. (2016) and summarized in theorem 1 therein. Specifically, it is required that: (i) M(n) is monotonically increasing in n and such that lim n→∞ n M(n) = ∞, and (ii) the inner statistic ζ (·, ·) is a continuous function of its inputs, and (iii) ζ (·, ·) is Riemann-Stieltjes integrable with respect to the marginal distribution of X i . In the proof, a key role is played by the fact that F Y and F Y |X i ∈C m,i are consistent, that is, by the fact that F Y and F Y |X i ∈C m,i tend pointwise to F Y and F Y |X i ∈C m,i as the sample size n and therefore the size of the partition M(n) increases.
Then, the first step for calculating any estimator in the form of (13) is to obtain consistent cdf estimators. A canonical candidate for this purpose is the empirical cdf of Y . Given n realizations (y j ), j = 1, 2, . . . , n of the random variable Y , the empirical cdf of Y is defined by counting the number of realizations below y, where #A denotes the number of elements in the set A. For the conditional random variable Y |X i ∈ C m,i , we obtain the subsample Y m,i = {y j |x ji ∈ C m,i } containing n m,i realizations of outputs for which the associated input value of interest is contained in the mth class of the partition. Then, the conditional empirical cdf of Y given that X i ∈ C m,i is defined by Setting we have classes or bin intervals that all satisfy n m,i n ≈ 1 M(n) , i.e., each strip contributes the same weight in (13). With this choice of the partition the estimator in Equation (13) becomes which we call cdf-based estimator. As shown in the proof of theorem 1 in Borgonovo et al. (2016), the canonical empirical-cdf estimators are consistent by the law of large numbers and thus the cdf-based estimator in Equation (16) is consistent. The literature has investigated the problem of obtaining smooth estimates of empirical cdfs intensively. For instance, Berg and Harris (2008) use a diffusion-based approach while Veraverbeke, Gijbels, and Omelka (2014) use local linear regression estimates. One can benefit from these advances to refine the estimators. Recently, Ben Abdellah, L'Ecuyer, Owen, and Puchhammer (2018) show that convergence in the estimation of the distribution occurs not only when the sample is generated through Monte Carlo, but also through Quasi-Monte Carlo.
Let us now come to the numerical implementation. Proposition 1 suggests that a common problem in estimating δ i , β KS i , and β Ku i is to find the critical points (local extrema) for the function , confirming the intuition in Liu and Homma (2009). For simplicity in the following discussion, we assume that the output sample is without ties and reordered according to y j < y j+1 , j = 1, . . . , n − 1, with the associated inputs being rearranged accordingly. The computation is simplified by the fact that in this case F Y (y j ) is already given by j n . It is also useful to introduce the notion of subsample run.

Definition 1. A subsample run is a maximal sequence of adjacent output values
{y j , y j+1 , . . . , y j+r−1 } ⊂ Y m for which the associated input values x ·i are all contained in C m,i .
We can detect whether y j belongs to a subsample run by considering the difference between at two consecutive realizations of y. In fact, it holds: Hence, the difference in (17) is increasing in jumps of 1 n m,i − 1 n > 0 within each subsample, while it is decreasing outside the subsample in smaller steps of − 1 n . If the partition bins are chosen to be equally likely then 1 n m,i − 1 n ≈ 1 M(n) , which shows how the partition size impacts the cdf differences. Regarding β KS i and β Ku i , a straightforward application of the cdfbased estimator (16) yields consistent estimators as the conditions of theorem 1 in Borgonovo et al. (2016) are satisfied. The insight added by (17) is that local extrema of cdf differences are located at the beginning and at the end of subsample runs. This fact can be exploited to speed up the search for the global extrema required by the Kolmogorov-Smirnov and the Kuiper metrics.
Regarding δ i , the construction of estimators based on Equation (16) is more elaborate. In the remainder of this section, we summarize the main steps and intuition. Appendix B reports the technical details and Appendix C discusses the consistency of the estimators. The first step is to link the estimation of a density to the estimation of a cdf via Scheffé's theorem (Borgonovo, 2006;Devroye & Györfi, 1985;Scheffé, 1947). This theorem implies that it is equivalent to count (and sum) all the areas between the conditional and unconditional cdfs, or to count twice only the areas where the conditional cdf is greater (smaller) than the unconditional cdf. So, the problem boils down to determine the subset where f Y |X i ∈C m,i (y) is greater than f Y (y). Second, the δ importance measure requires that Y is an absolutely continuous random variable. In this case, one observes that F Y |X i ∈C m,i is not a good approximation of F Y |X i ∈C m,i . The left graph in Fig. 3 offers a visual illustration, using the same data of Fig. 1. Note the jiggling in the left graph. The reason is that F Y |X i ∈C m,i is discontinuous, as Equation (17) suggests, while F Y |X i ∈C m,i is continuous because of the absolute continuity of Y . Here, the literature offers two main alternatives: kernel smoothing or moving  averages. To avoid kernel smoothing, we then apply a moving-average approach. The right graph in Fig. 3 displays the effect of processing F Y |X i ∈C m,i with a moving-average operator. The operation considerably reduces the jiggling, while the local and global extrema are still present. In general, smoothing has the advantage of reducing variance, but at the cost of introducing bias. At any finite sample size, despite the smoothing, the determination of the extreme values is still an error-prone process. Therefore, we need to take additional provisions, defining positive and negative subsample runs. These are technical aspects that we discuss in Appendix B.
All in all, one obtains the estimator whereb m t andâ m t are properly selected values of Y in each partition, and F {·} denotes the smoothed cdfs.
The pseudo-code of Algorithm 1 summarizes the given-data estimators discussed thus far. It is based on an efficient way to estimate conditional cdfs and computes the sensitivity measures β KS , β Ku , δ by supplying two vectors, x and y, of length n.
Implementationwise, the selection of the conditional subsample with respect to input values M quantile and the m M quantile is implemented using the "is member of the subsample" information which is coded into a true/false vector that is interpreted as integers 1 and 0. Because the vector z is a presorted copy of x, the conditional cdf is given by a scaled version of the cumulative sum of the membership vector, while the unconditional cdf is a linear ramp. Thus, the operations to be performed are differences of cumulative sums that are used to determine the points of local extrema. Finally, the algorithm returns the estimates of the sensitivity measures by averaging over all partition bins.
We now analyze the algorithmic cost of the new estimation method, Estimator . The run-time of Algorithm 1 is essentially driven by sorting the available data with respect to the output and also with respect to all inputs, which yields k + 1 sorting operations of data of size n. Each sort can be performed with O(n log(n)) operations (Knuth, 1997) so that O((k + 1)n log n) operations are needed. Thus, the memory requirements of New Method Estimator are approximately linear in the sample size. We note that for kernel-based approaches like DCov or HSIC as presented in Appendix D, matrices of distances between all data pairs have to be formed, both for the inputs and for the outputs. Hence the implementation implies, in principle, quadratic memory requirements O(2n 2 ) when evaluating the kernel on all pairs of data. Furthermore, for HSIC with Gaussian kernels, one needs bandwidth estimation, which adds further complexity.
With respect to previous works, we also note that Algorithm 1 sorts the output, in contrast to the scatterplot partitioning idea of Section 3 and the algorithms discussed in Plischke (2012) ;Plischke, Borgonovo, and Smith (2013) where the sorting is on the realizations of X i , with the output realizations that are then reordered accordingly.
Finally, we recall that counting the number of runs from the conditional subsample in the ordered output sample is at the basis of the Wald-Wolfowitz run test statistic (Wald & Wolfowitz, 1940). Hence, taking the largest cdf difference within each run combines ideas from the Kolmogorov-Smirnov test and the Wald-Wolfowitz run test.

NUMERICAL EXPERIMENTS
We start with a premise on partition selection. As mentioned in Strong and Oakley (2013) ;Borgonovo, Hazen, and Plischke (2016), there is a trade-off between the number of realizations to allocate to a partition and the number of partitions. In fact, the lower the number of partitions the more accurately we can estimate the inner statistic, but a low number of partitions may lead to failure in capturing the behavior of ζ i (P Y , P Y |X i ) as a function of X i . The trade-off is well detailed in Strong and Oakley (2013) and Borgonovo et al. (2016). While for large sample sizes estimates become insensitive to the partition size, for sample sizes below n = 2, 000 the partition selection strategy becomes relevant. The partition size M in the following experiments is linked to the sample size n via M = min{ n 64 , 32} guaranteeing a minimal subsample size of 64 realizations when one selects equally likely partition bins.

When Everything Works
We consider first the wing-weight simulator, a numerical code recently studied in the context of the estimation of Sobol' sensitivity indices in Jiménez Rugama and Gilquin (2018). The model simulates the weight of a light aircraft wing depending on 10 design parameters. The MATLAB file and implementation details are available from the library of  Surjanovic and Bingham (2019). The code has 10 uncertain input parameters. Following the distributional assignment in Jiménez Rugama and Gilquin (2018, table 9, p. 735), we generate a sequence of samples of increasing size, up to n = 65, 536, for the estimation of variance-based sensitivity measures (the size is the same as in Jiménez Rugama and Gilquin, 2018). We utilize the same samples and distributional assignments to estimate, besides first-order sensitivity measures, correlation coefficients, the β KS , β Ku , δ sensitivity measures, as well as HSIC and distance correlation. For the β KS , β Ku , δ, we compare the estimators proposed in this work against the estimators of Plischke et al. (2013) and Borgonovo et al. (2014).
The graphs in Fig. 4 report the sample sizes on the horizontal axis, and the corresponding estimates on the vertical axis. The upper left panel reports the squared correlation coefficients, the lower left variance-based first-order indices. The upper three centered panels are obtained by Algorithm 1 applied directly to the sample data, while the lower panels use the estimators in Plischke et al. (2013). The rightmost panels display the results for the HSIC (upper graph) and distance correlation (lower graph). Fig. 4 shows that all estimates rapidly converge and the ranking of the most important inputs is cor-rectly reported by all estimators already at the smallest tested sample size (n = 512), with convergence in the estimates obtained for n ≥ 1, 024. For this simulator, a linear regression surface just with additive terms would fit well, capturing over 90% of the output variance, signaling a mild behavior of the inputoutput mapping. Thus, in this case, the newly introduced estimators do not display an advantage over estimators previously introduced. (Note that the horizontal axis of distance-correlation and of HSIC stops at about n = 7, 000. The large memory requirements would make it not possible to obtain numerical estimates for larger sample sizes. However, all relevant inputs are already identified by these sensitivity measures at smaller sample sizes.)

When Sparsity Jumps In
Let us now consider what happens when the output is sparse (ranges over orders of magnitude), a situation often encountered in risk analysis applications (see Iman & Hora, 1990). To show that sparsity issues can emerge independently of the simulator dimensionality, we start with a lowdimensional test case. Consider Y = X −1 1 + X 2 , with X 1 Cauchy(0,3)-distributed and X 2 independently Cauchy(1,0.5)-distributed. Then, Y is Cauchy(1, 5 6 )-distributed, and also all the conditional distributions are Cauchy. Therefore, it is possible to obtain the values of moment-independent sensitivity measures analytically. Specifically, we have δ 1 = 0.31 and δ 2 = 0.46. Because the distributions are all unimodal, by Proposition 1 we expect δ i = β Ku i . Fig. 5 shows the estimates of the sensitivity measures addressed thus far. For variance-based sensitivity measures, we observe that the larger the sample size, the more numerical issues appear. Furthermore, the kernel-density-based estimator of δ of Plischke et al. (2013) fails (lower row) to produce consistent estimates. Then, from the estimators in the lower row of Fig. 5 one would not recover the theoretically expected identity between δ i and β Ku i . This identity is, instead, recovered by the estimators in the upper row.
The previous test case is based on Cauchydistributed random variables, which have mode and median, but no mean. Thus, techniques that rely on the first moment do not converge. The test case may seem artificial, but it captures a problem that occurs with one of the most widely used test cases in sensitivity analysis, Level E. The geosphere transport model Level E has been widely studied in sensitivity analysis since Saltelli et al. (1999) and Saltelli and Tarantola (2002). The simulator computes the annual radiation dose to humans as a result of leakage from a hypothetical underground disposal site for nuclear waste spanning a time horizon going from 2 × 10 4 to 2 × 10 9 years into the future. We analyze the output at timestep t = 300, 000. The simulator features 12 uncertain input parameters, whose distributions have been assigned based on expert opinions in OECD (1989) and have been used consistently in all subsequent studies. We refer to Saltelli and Tarantola (2002) for the description. Issues with this model can be spotted from the slow convergence of the first-order effects. Moreover, the ranking of the first two parameters is not consistent over all measures. A neat separation of the sensitivity measures of minor contributors is only possible for large sample sizes, n ≥ 8, 192 (Fig. 6).
As the output spans orders of magnitudes, the analyst usually employs an output transformation. We then apply a logarithmic transformation of the output. Zero values and stray negative values are mapped to the smallest positive number representable in floating-point arithmetic before applying the transformation. As Fig. 7 shows, the numerical convergence issues are now resolved, but the transformation has changed the ranking of the inputs.
Overall, the analysis shows that for the dose at t = 300, 000 years, the estimation of global sensitivity measures is challenging. The analysis with and without transformations, however, permits to identify two key drivers of uncertainty. The steam flow rate X 12 is a stochastic variable whose properties cannot be influenced by technical design considerations. Thus, a risk manager knows that she/he cannot intervene on this parameter by changing the design.
The fact that this parameter is a key driver of uncertainty then means that additional information on the parameter will, in fact, reduce uncertainty about the output but will not be informative about possible ameliorations of the performance of the waste repository. The second important parameter X 4 , the water velocity in the first geosphere layer, refers to characteristics of the host rock formation. Thus, getting more specific on this aspect of the repository design has the potential to reduce the problem uncertainty.

When Dimensionality Adds To Sparsity
Now, if sparsity of output is paired with a large number of input parameters, then classical screening methods for which the sample design depends on the number of inputs are not applicable. Givendata techniques, however, may still extract information from a sample when its size of is smaller than the number of dimensions. We consider the product of standard lognormals, Y = 10 i=1 X 32 i d i=11 X i , X i ∼ logN(0, 1) i.i.d, with d = 30, 000. Additional tests were performed with d = 30, d = 300, and d = 3, 000, but these offer qualitatively the same results and are not reported here. In order to deal with over-and underflow, infinite values in the simulator output have been replaced by the largest representable positive floating point number and zeros by the smallest representable positive floating point number. Fig. 8 shows the results of computing different sensitivity measures using alternative algorithms. Estimation methods not derived from cdf estimations are not working reliable on this data set. This is a consequence of the sparsity of the output. The logtransformed output is studied in Fig. 9. Note that the simulator output becomes linear after the transformation. As the sparsity and therefore also the variance of the model output is now notably reduced, all methods produce results.
Throughout we have considered a sample of size n, with the intuition that this is the maximum sample size that the time/resource budget allows. However, Figs. 4-9 suggest an iterative version of the estimation procedure. One starts with a trial budget n 0 of model runs, and then increases it to n 1 , n 2 , . . . monitoring the convergence in the estimates. Once the difference in two subsequent estimates is small enough the analyst can stop the process. The application of an automated sequential approach is part of future research.

CONCLUSIONS
The computation of global importance measures is an important part of quantitative risk assessment. However, such computation can be extremely challenging. This work contributes to improving the efficiency and lowering the computational cost associated with the estimation of global sensitivity measures.
First, the investigation contributes in raising the awareness that estimation challenges do not come solely from dimensionality, but also from sparsity. Thus, our finding add to the risk analysis literature on the use of transformations originated by the work of Iman and Hora (1990). We have proposed and analyzed a new computation method that relies on rewriting the estimators of global sensitivity measures using the cdf of the output. It is based on the given-data principle, and thus it reduces the effect of the curse of dimensionality. Moreover, the new approach allows the manipulation of numbers on the [0,1] scale. As such, it works seemlessly for the estimation of any global sensitivity measure based on the cdf of the output such as β Ks and β Ku . For density-based sensitivity measures, we have proposed a moving-average approach that bypasses kernel-density estimation. We have studied the resulting estimator proving its convergence. We have also discussed the algorithmic cost of the new implementation and compared it to the algorithmic cost of dependence measures such as distance covariance and HSIC.
Because the new method allows the manipulation of numbers on the [0,1] scale, it becomes a potential remedy to the curse of sparsity. We have tested this assertion through a series of experiments of increasing complexity, from the computationally friendly wing-weight simulator to a 30, 000 input case study in which the model output variance reaches the numerical range of the floating-point representation of numbers in the computer. For each experiment, we have performed tests with and without transformations. Results show that the estimates to converge at reasonable sample sizes for all the examples even without the use of transformations.
There are some key recommendations for a risk analyst emerging from our investigation. First, sparsity may affect the performance of global sensitivity estimators as much as dimensionality and, in any case, sparsity acts independently of dimensionality. The use of transformations may not directly solve the curse of sparsity. Not only do transformations introduce interpretation issues, but a transformation may not be numerically effective, especially if the estimators rely on kernel smoothing. Moreover, sparsity does not impact estimators of alternative global sensitivity measures in the same way; a recommendation is, then, that the analyst employs an ensemble of global sensitivity measures. Relying on a single indicator may lead to unreliable conclusions due to numerical estimation issues. In this respect, the case studies have shown that employing the new method simultaneously with other global sensitivity measures allows the analyst to obtain solid insights on the key drivers of uncertainty while keeping computational burden under control.
Avenues for future research are the comparison, in high-dimensional settings, of the global estimators discussed in this work with screening techniques such as the method of Morris or sequential bifurcation, as well as the implementation of a sequential version of the method. Moreover, the addition of further cdf-based distances like Cramer/von Mises to the global sensitivity measure portfolio is also pursued.

ACKNOWLEDGMENTS
We wish to thank the editors Prof. Tony Cox and Prof. Seth Guikema for their editorial efforts. We are also really grateful to the three anonymous reviewers for several perceptive and constructive suggestions.
Open access funding enabled and organized by Projekt DEAL.

APPENDIX A: PROOFS
Suppose that there are two functions y 0 , y 1 : and f Y |X i (y) = 0 for all other y.
These zeros correspond to the minimum (≤ 0) and Now recall Scheffé's theorem (Devroye & Györfi, 1985;Scheffé, 1947) which states that given two probability density functions (pdfs) f 1 and f 2 one has Hence for the intervals B(x i ) = [min{y 0 (x i ), y 1 (x i )}, max{y 0 (x i ), y 1 (x i )}], by (A2) δ i satisfies If there is only one nonvanishing extremum for all x i then F Y |X i contains no sign change and therefore β Ku i = β KS i . If g(·) is a function depending on its sole scalar parameter x i then we have for y 1 < g(x i ) < y 2 that Ku(x i ) ≥ |1 − F Y (y 2 ) − (0 − F Y (y 1 ))|→1 as y 1 → y 2 . If g(·) is monotonically increasing then by transformation invariance we have β KS = 1 0 max{u, 1 − u} du = 3 4 .

APPENDIX B: ESTIMATION OF BORGONOVO'S δ FROM CDFS: TECHNICAL DETAILS
Scheffé's theorem (Devroye & Györfi, 1985;Scheffé, 1947) where B m . With a slight notation abuse, (B1) can also be written in terms of cdfs Equation (B2) has a geometric interpretation which can be used for the estimation of δ. Returning to Fig. 2, observe that, in association with (B2), it is equivalent to count (and sum) all the areas between the conditional and unconditional cdfs, or to count twice only the areas where the conditional cdf is greater (smaller) than the unconditional cdf. So, the problem boils down to determining the subset B m + (Liu & Homma, 2009). This set is a union of intervals of which the endpoints are critical points (local extrema) for the function (Borgonovo, Castaings, & Tarantola, 2011;Liu & Homma, 2009). Now, the subset B m + has to be determined from the sample and the conditional subsample. But for this task, F Y |X i ∈C m,i is not a good approximation of F Y |X i ∈C m,i , as the former is discontinuous (see Equation (17)), while the latter is continuous because of the absolute continuity of Y . We therefore suggest the following simple form of data-smoothing: The av- For the size of the moving-average window, note that the chance of selecting an output realization which is contained in the subsample is, on average, 1 M (where M is the number of partitions, as above). Hence, one out of M realizations is from the subsample and therefore responsible for a jump in F Y |X i ∈C m,i . Therefore, when applying a moving average to F Y |X i ∈C m,i the minimal window size is equal to the partition size as then a single jump contribution is balanced out. In our experience, a larger (±3M) moving window is better suited to capture the variability. For finite sample sizes, despite the smoothing, the determination of the extreme values to approximate the set B + m is still an error-prone process as local extreme values appear and vanish depending on the smoothing parameters and it is not clear if they are numerical artifacts or ground truth. Therefore, we consider the union of intervalsB A negative run is defined in a similar way. The maximum distances between the cdfs within each of the run have to be considered for forming the estimate of the δ measure, lead to Equation (18).

APPENDIX C: MOVING-AVERAGE ESTIMATOR CONSISTENCY
In this section, we provide greater details on the convergence properties of the cdf-based estimators proposed here. Consider first that X i is an absolutely continuous random variable in X i ⊂ R. Then, for every point x 0 i ∈ X i there exists a series of intervals in the partitions such that x 0 i = n C m(n),i (n), m ≤ M(n). The notation M(n) makes explicit the dependence of partition classes on the sample size and we assume that the function M(n) satisfies the assumptions of theorem 1 in Borgonovo et al., (2016). We have the following inequality: For the first term, absolute continuity of X i allows us to writê which converges toF Y |X i =x 0 i by partition refinement. For the second term,F Y (y) → F Y (y), by the law of large numbers, as also given in the proof of theorem 1 of Borgonovo et al. (2016). Therefore, the estimators are asymptotically consistent. Now, consider the averaged version of the distance between the conditional and the unconditional empirical cdf. For this estimator, we need the assumption that also Y is absolutely continuous. We have We have seen above thatF Y (y) →F Y (y) and F Y |X i → F Y |X i as n increases. We also have that | j n | ≤ 3M(n) n → 0 as n → ∞. Then, we need to be reassured thatF −1 Y (F Y (y) + j n ) → y as n increases, wherê F −1 Y is the generalized inverse, i.e.,F −1 Y (u) = inf{y ∈ Y :F Y (y) ≥ u}. For the termF Y (y) + j n we register F Y (y) + j n → F Y (y) as n tends to infinity. Then, we need thatF −1 Y → F −1 Y . For this, absolute continuity of Y comes into play. In fact, if Y is absolutely continuous, then F Y is differentiable and strictly increasing. Therefore, we always have a nonzero derivative at any point and the quantile function is defined for any value of y. Under these conditions, the empirical quantile function tends to the true value (see van der Vaart, 2000, chapter 21).

APPENDIX D: DETAILS ON DISTANCE CORRELATION AND HILBERT-SCHMIDT INDEPENDENCE CRITERION (HSIC)
If Z is an -dimensional random vector and Y is a random variable, the distance covariance is defined by taking a weighted difference between (possibly multidimensional) characteristic functions, dcov (Z, Y ) 2 = C R R ||s|| −1− |t| −2 E e is T Z+itY −E e is T Z E e itY 2 dsdt, where C is a normalizing constant related to the volume of a unit hyperball. The weights are chosen in such way that the measure is invariant with respect to rotations of the -dimensional space of Z. Analogously to the standard correlation of (2), the distance correlation is given by where dvar(Z) = dcov(Z, Z).
Under the condition that the first moments of Z and Y are finite, dcov(Z, Y ) = 0 if and only if Z and Y are independent, i.e., it satisfies the nullity-impliesindependence property. The distance correlation d is also invariant under affine linear transformations of Z and Y . Estimators that avoid the use of characteristic functions are working on the trace product of difference matrices A i, j = (||z i,· − z j,· ||) i, j=1,...,n and B i, j = (|y i − y j |) i, j=1,...,n for realizations z i· , z j,· ∈ R and y i , y j ∈ R, dcov 2 (Z, Y) = 1 (n − 1) 2 trace A(I − n −1 11 T )B(I − n −1 11 T ) , (D3) where I is the n × n identity matrix, 1 is a vector of ones, and n is the sample size. In Székely et al. (2007), the entries of the distance matrices A and B are centered with respect to column, row, and overall averages, however, due to the properties of the trace product, (D3) needs only the row averages, n −1 A1 and n −1 B1. The HSIC replaces the difference matrices in (D3) with appropriate kernels from reproducing kernel Hilbert spaces, i.e., A HSIC i, j = K Z (z i,· , z j,· ) and B HSIC i, j = K Y (y i , y j ) (see Gretton, Bousquet, Smola, & Schölkopf, 2005, for further details). Then, analogously to (D3), HSIC can be computed via HSIC = 1 (n − 1) 2 trace A HSIC (I − n −1 11 T ) . B HSIC (I − n −1 11 T ) .
For using distance correlation or HSIC as sensitivity measures, Z is given by a single input factor X i or a group of factors of interest, (X i 1 , . . . , X i ).