1 Introduction

Determining feature importance is a crucial task in machine learning and statistical investigations. In machine learning, it is an integral part of post-hoc explainability (Murdoch et al. 2019; Fisher et al. 2019), where it helps us understand the degree to which a model relies on the available features. This understanding has two final objectives (Genuer et al. 2010; Bénard et al. 2022). The first objective is dimensionality reduction, which involves screening out features that do not contribute to the model’s predictions. The second objective is identifying the features that are most important for further modeling efforts or data collection by domain experts.

Over the years, several feature importance measures have been developed to perform this task. On the one hand, in machine learning, a particularly important family is represented by Breiman’s permutation importance measures (Breiman 2001). Breiman originally defines them based on the notion of mean decrease accuracy (MDA) (Bénard et al. 2022). The intuition is as follows: A given machine learning model (e.g., a random forest) is fitted to a feature-target dataset, yielding a given predictive accuracy. The values of a specific feature are then permuted to break its relation to the target. The predictive accuracy is then reassessed for this perturbed dataset. The difference between the new (possibly degraded) and the original accuracies provides us with an indication about the importance of the feature. However, (Bénard et al. 2022, Proposition 2) show that there is no consensus on the exact mathematical formulation of the mean decrease accuracy and they prove that alternative software implementations yield different values. On the other hand, in statistics, a central role is played by measures of statistical association. Several indicators have been developed over the years: From the original Pearson linear correlation coefficient (Pearson 1895), to the new correlation coefficient of Chatterjee (2021). In this family, a significant role is played by the so-called total order sensitivity effects (Homma and Saltelli 1996)—total effects for short.

Total effects are defined as the difference between the variance of the target and the portion that remains after all features have been fixed with the exception of the feature of interest. Homma and Saltelli (1996) show that, when features are independent, the total effect of a given feature equals the sum of all terms in the ANOVA expansion associated with that feature. Also, we can calculate total effects as the expectation of the squared difference of the values of the model output in two points that differ only in the value of the feature of interest. The new point can be obtained by a simple permutation of the values of the features in the dataset. This formulation is known as the Jansen’s estimation method (Jansen et al. 1994; Jansen 1999) and has inspired the so-called pick-and-freeze designs. In the class of pick-and-freeze estimators the one proposed by Janon et al. (2014) is proven to be asymptotically efficient.

Hart and Gremaud (2018) show that even in the case of dependent features, total effects retain an interpretation from a relative error perspective. Under a squared loss function, a total index is the expected loss increase for approximating the input–output mapping with a function that does not contain the terms associated with the feature of interest. Also, Bénard et al. (2022) show that total effects are closely related to Breiman’s mean decrease accuracy. In particular, (Bénard et al. 2022, Proposition 2) show that the different software implementations do not converge to the total effects but to a quantity whose bias increases with dependence, and is potentially amplified by interactions. They propose corrections so that the calculation of Breiman’s mean decrease accuracy indeed converges to a total effect in the case the machine learning model is a random forest.

The presence of statistical dependence complicates the calculation of total effects (we refer to Da Veiga et al. 2021, Ch. 5 for a thorough account). First, the interpretation in terms of the correspondence with the sum of terms in the ANOVA decomposition is lost. Second, also the possibility to use a Jansen-type estimator is not straightforward. In fact, while under independence the new points can be obtained with unrestricted permutations, the presence of dependencies challenges such procedure. The problem is similar to the one signalled by Hooker et al. (2021) in the machine learning literature: unrestricted permutations may make the new points fall in regions that are far from where the data lie, forcing the machine learning model to extrapolate. These difficulties are compounded when features are not only dependent but also constrained on non-Cartesian supports. Constraints arise in applications when physical or business reasons require features to be located in certain regions. Here, an unrestricted permutation could lead to a feature that falls outside a given constraint, making the evaluation of the model not only at risk of extrapolation, but also meaningless. Furthermore, if constraints give rise to disconnected feature domains, they make the functional ANOVA expansion ill-defined Owen and Prieur (2017). Kucherenko et al. (2017) propose a numerical approach based on the combination of rejection sampling and quadrature for the calculation of variance-based indices, with focus on numerical aspects. However, a statistical analysis of possible estimators with constrained inputs is missing.

Our goal is to address the estimation of total effects with both dependent and constrained features, considering numerical as well as theoretical aspects. Here we assume, as typically encountered in sensitivity analysis, that the input distribution is known. In Sect. 6 we discuss alternatives for standard machine learning settings, where only a data sample is available to estimate feature importance.

We proceed as follows. First, we extend the estimator of Jansen (1999) to the case of dependent inputs. We show that it is still possible to estimate total effects under input dependence using a Jansen-like approach if the new value of the feature is obtained under conditional independence. We then propose a generalized winding stairs design based on the Knothe-Rosenblatt transform that can be used in association with a vast family of input dependencies. However, while this design conceptually pushes the boundaries of available methods for dependent inputs, it becomes impractical when inputs are constrained.

We then introduce a new estimator of total indices by applying a weighting factor (called density quotient) to the extended Jansen’s estimator. We show that the density quotient can be reinterpreted as a block-copula density, that vanishes when inputs are outside the constraints and that becomes unity when inputs are independent. We then formulate a U-statistic version of the estimator and obtain a central limit theorem. However, the new U-statistic estimator turns out to be computationally expensive as it requires the evaluation of the model at \(n^2\) points. We then propose an alternative estimator based on a single permutation that reduces computational burden. We consider first the simplest estimator with the permutation given by a one-shift in the coordinate of interest and prove a central limit theorem of this estimator. We then extend the result for general derangements in the coordinate of interest. To further abate computational burden, we also introduce a nearest-neighbour estimator that makes the estimation cost independent of the number of features.

We derive analytical expressions for the estimators in the case of linear models and Gaussian inputs. We then challenge the estimators on test cases of increasing complexity, starting with Cartesian domains with dependent features, and moving to connected non-Cartesian domains (e.g., circle and simplex) to disconnected non-Cartesian domains (such as non-overlapping triangles and Sierpinski gaskets), we also provide comparison tests with machine learning approaches and finally conclude with an application to a realistic simulator, the flood example of de Rocquigny (2006).

2 Total effects: bridging old and new

In this section, we review total effects from a fresh perspective. We discuss the covariance representation of total effects and establish a link with an early result by Fréchet. We underline the role of conditional independence in the estimation of total effects with independent as well as dependent inputs. The analysis allows us to propose a new estimator for total effects with dependent inputs based on winding stairs and the Knothe-Rosenblatt transformation. In Sect. 2.2, we highlight the roles of conditional independence for such a representation. In Sect. 2.3, we propose a new estimator based on winding stairs and on the Knothe-Rosenblatt transformation for the case in which input dependence can be expressed via a Gaussian copula.

2.1 A Fréchet perspective

Let us consider a reference probability space \((\Omega ,{\mathcal {B}}(\Omega ), {{\,\textrm{Pr}\,}})\), where \({\mathcal {B}}(\Omega )\) is a Borel \(\sigma \)-algebra. Let also X, Y be random variables on \((\Omega ,{\mathcal {B}} (\Omega ),{{\,\textrm{Pr}\,}})\), with supports \({\mathcal {X}}\), \({\mathcal {Y}}\). We let \(X=(X_{1},X_{2},\dots ,X_{d})\) be a d-dimensional random vector in \({\mathbb {R}} ^{d}\), so that \({\mathcal {X}}\subseteq {{\mathbb {R}}}^{d}\) and consider a univariate Y, with \({\mathcal {Y}}\subseteq {{\mathbb {R}}}\). For the moment, we make the further assumption that the support of X is Cartesian, that is \( {\mathcal {X}}={\mathcal {X}}_{1}\times {\mathcal {X}}_{2}\times \dots \times {\mathcal {X}}_{d}\), where \({\mathcal {X}}_{i}\) is the support of \(X_{i}\), \( i=1,2,\dots ,d\). We denote the cumulative distribution function and probability density functions of X and Y by \(F_{X}(x)\), \(f_{X}(x)\) and \(F_{Y}(y)\), \(f_{Y}(y)\), respectively. For notation simplicity, we regard X and Y as continuous in the remainder. We suppose that Y has finite second moment, \({\mathbb {V}}[Y]<\infty \). Let \(u\subseteq [d]=\{1,\ldots ,d\}\), e.g., \(u=\{i_{1},i_{2},\ldots ,i_{k}\}\) with \(k\le d\). Let \(x_{u}\) correspond to the |u|-dimensional vector whose components are indexed by u, and \(x_{-u}\) the \((d-|u|)\)-dimensional vector whose components are indexed by the complement of u, \(-u=[d]\setminus u\). For a single factor i, we have \(u={i}\) and we write the all-but-one set \( [d]\setminus \{i\}\) as \(-i\). The total effect of \(X_{u}\) is defined as (Homma and Saltelli 1996)

$$\begin{aligned} \tau _{u}={{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {V}}\,}}[Y|X_{-u}]]={{\,\mathrm{\mathbb {V}}\,}}[Y]-{{\,\mathrm{\mathbb {V}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}[Y|X_{-u}]]. \end{aligned}$$
(1)

The literature has also introduced the normalized total effect as \( T_{u}=\tau _{u}/{{\,\mathrm{\mathbb {V}}\,}}[Y]\). Using an argument of Fréchet (1934), we find the following useful equalities.

Lemma 1

Let \(Y^{\prime }\) be a replicate of Y conditionally independent on \(X_{-u}\), i.e., Y and \(Y'\) have same distribution and satisfy \({{\,\textrm{Pr}\,}}(Y\cdot Y^{\prime }|X_{-u})={{\,\textrm{Pr}\,}}(Y|X_{-u}){{\,\textrm{Pr}\,}}(Y^{\prime }|X_{-u})\). Then,

$$\begin{aligned} \tau _{u}={{\,\mathrm{\mathbb {V}}\,}}[Y]-{{\,\textrm{cov}\,}}(Y,Y^{\prime })=\tfrac{1}{2}{{\,\mathrm{\mathbb {E}}\,}}\left[ \left( Y-Y^{\prime }\right) ^{2}\right] . \end{aligned}$$
(2)

Proof

The proof is postponed to “Appendix A.1”. \(\square \)

The first equality in (2) substitutes the possibly high-dimensional nonlinear regression \({{\,\mathrm{\mathbb {E}}\,}}[Y|X_{-u}]\) in (1) with a covariance operation. When replacing the regression surface by \(Y^{\prime }\), the error term \(Y^{\prime }-{{\,\mathrm{\mathbb {E}}\,}}[Y|X_{-u}]\) is uncorrelated to Y, because \({{\,\textrm{cov}\,}}(Y,Y^{\prime }-{{\,\mathrm{\mathbb {E}}\,}}[Y|X_{-u}])=0\). The second equality can be interpreted as a generalization of Jansen’s equality for total effects (Jansen et al. 1994; Jansen 1999) that does not require feature independence.

In simulation and machine learning Y is often a function of X, \(Y=g(X)\), \(g:{\mathcal {X}}\rightarrow {\mathbb {R}}\). Suppose that g is square integrable, and that it can be decomposed as

$$\begin{aligned} g(x)=g_{0}+\sum _{u\in 2^{[d]},u\ne \emptyset }g_{u}(X_{u}), \end{aligned}$$
(3)

with \(2^{[d]}\) the power set of [d], \(g_{0}={{\,\mathrm{\mathbb {E}}\,}}[Y]\), and \(g_{u}(x_{u})= {{\,\mathrm{\mathbb {E}}\,}}[Y|X_u=x_u]-\sum _{v\subset u}g_{v}(x_{v})\). Under input independence, we can expand \({{\,\mathrm{\mathbb {V}}\,}}[Y]\) via the well-known functional ANOVA decomposition (Efron and Stein 1981; Sobol’ 1993; Oakley and O’Hagan 2004; Sun et al. 2021)

$$\begin{aligned} {{\,\mathrm{\mathbb {V}}\,}}[Y]=\sum _{u\in 2^{[d]},u\ne \emptyset }{{\,\mathrm{\mathbb {V}}\,}}[g_u(X_u)] \end{aligned}$$
(4)

with \({{\,\mathrm{\mathbb {V}}\,}}[g_u(X_u)]\) the variance of \(g_{u}(X_{u})\) in (3). In Homma and Saltelli (1996), Saltelli and Tarantola (2002), the total effect of input \(X_{j}\) is defined as the sum of all terms in the right hand side of (4) that contain index j and it is shown that

$$\begin{aligned} \begin{aligned} \tau _{u}&={{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {V}}\,}}[Y|X_{-u}]]={{\,\mathrm{\mathbb {V}}\,}}[Y]-{{\,\mathrm{\mathbb {V}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}[Y|X_{-u}]]\\ {}&=\sum _{v\in 2^{[d]},v\cap u\ne \emptyset }{{\,\mathrm{\mathbb {V}}\,}}[g_v(X_v)]. \end{aligned} \end{aligned}$$
(5)

However, this identity does not hold if features are statistically dependent. Under dependence, \(\tau _{u}\) remains defined as in (1) and enjoys an interpretation in terms of the \(L^{2}\) approximation error, as established in Hart and Gremaud (2018). In an argument similar to Rabitz and Alış (1999), Hart and Gremaud (2018) consider that the space \( L^{2}\) can be decomposed into a direct sum \(L^{2}({\mathcal {X}})=M_{-u}\oplus M_{-u}^{\perp }\) where \(M_{-u}\) contains all \(L^{2}\) functions which solely depend on \(x_{-u}\) and \(M_{-u}^{\perp }\) is its orthogonal complement. In general, for dependent features, \(M_{-u}^{\perp }\ne M_{u}\). Then we can write \(g(x)=g_{0}+g_{-u}(x_{-u})+g_{-u}^{\perp }(x)\) with \(g_{-u}\in M_{-u}\) and \(g_{-u}^{\perp }\in M_{-u}^{\perp }\). If we ask the question of how accurately \(g(x)-g_{0}\) can be approximated without the features in \(x_{u}\), then the answer is

$$\begin{aligned} \begin{aligned} \left\Vert g-g_0-g_{-u}\right\Vert _{L^{2}}^{2}=\left\Vert g_{-u}^{\perp }\right\Vert _{L^{2}}^{2} \\ = \left\Vert g-g_0\right\Vert _{L^{2}}^{2}-\left\Vert g_{-u}\right\Vert _{L^{2}}^{2}. \end{aligned} \end{aligned}$$
(6)

If we consider the \(L^{2}\) norm weighted with the density of X, then we regain (1) from (6), because \(g_{-u}(x_{-u})= {{\,\mathrm{\mathbb {E}}\,}}[g(X)-g_0| X_{-u}=x_{-u}]\).

Conditional independence plays a central role in deriving (2): it is this property that enables one to replace \({{\,\mathrm{\mathbb {E}}\,}}[Y|X_u]\) by \(Y^\prime \) in (2). We show that it also plays a central role in estimating \(\tau _u\) under input dependence via winding stairs and pick-and-freeze designs.

2.2 Conditional independence and total effect estimation

The gold standard for obtaining estimates for total effects under input independence is the Sobol’ method, i.e., a pick-and-freeze design paired with Jansen’s estimator (Jansen 1999). Let X, \(X'\) be d-dimensional input vectors. For \(u \subseteq [d] \), we use the notation \(X'_u:X_{-u}\) to denote the d-dimensional vector whose components indexed by u are taken from \(X'\) and whose components indexed by \(-u\) are taken from X. Now, let \(X^\prime _u\) be a replicate of \(X_u\), independent of \(X_u\) conditionally on \(X_{-u}\). Then, \((X^\prime _u:X_{-u})\) and X are identically distributed and \(Y=g(X)\) and \(Y'=g(X^\prime _u:X_{-u})\) are identically distributed and conditionally independent given \(X_{-u}\). The second equality in Eq. (2) can then be rewritten as

$$\begin{aligned} \tau _{u} =\tfrac{1}{2}{{\,\mathrm{\mathbb {E}}\,}}\left[ \left( g(X)-g(X^\prime _u:X_{-u})\right) ^2\right] . \end{aligned}$$
(7)

By Lemma 1, Eq. (7) is true even under feature dependence. However, independence makes the design of an estimator for \(\tau _u\) straightforward. One generates two independent samples of size n from the input distribution. Let us denote them by \(X^A=(X^{A,i})_{i=1,\dots ,n}\) and \(X^B=(X^{B,i})_{i=1,\dots ,n}\). The columns of these sample matrix blocks are recombined, copying input realizations for factor(s) \(j\in u\) from the second sample (B) into the first sample (A) to form pick-and-freeze input sample blocks \(X^B_u:X^A_{-u}\). The model is then evaluated to obtain the output samples \(Y^A=g(X^A)\) and \(Y_u^{BA}=g(X^B_u:X^A_{- u})\). Combining them via Jansen’s equality, we obtain the estimator

$$\begin{aligned} \begin{aligned} \widehat{\tau }^{PF }_{u}&=\dfrac{1}{2n}\sum _{i=1}^n \left( g(X^{A,i})-g(X^{B,i}_{u}:X^{A,i}_{-u})\right) ^2 \\ {}&=\dfrac{1}{2n} \left( Y^{A,i}-Y_u^{BA,i}\right) ^2. \end{aligned} \end{aligned}$$
(8)

After the introduction of this design in Sobol’ (1993), Homma and Saltelli (1996), works such as Saltelli et al. (2000), Sobol’ et al. (2007), Gatelli et al. (2009), Gamboa et al. (2016) and Prieur and Tarantola (2017) have developed it further refining several aspects. Most of these works rely on the independence assumption, while we remove it in the remainder of this section.

Under conditional independence not only the conditional probability measure can be written in product form, but the joint density also factors into the product of two terms, see “Appendix A.1” for further details. As a direct consequence for the Jansen’s estimator, when considering the joint distribution of \(X_u,X'_u\) and \(X_{-u}\), we obtain the following result.

Proposition 2

Let \(X^\prime \) be a replicate of X, conditionally independent given \(X_{-u}\). Letting \(Y=g(X_u:X_{-u})\) and \(Y'=g(X'_u:X_{-u})\), then Y is a replicate of \(Y'\) conditionally independent given \(X_{-u}\). Written in density terms, we find two interchangeable representations of the total effect,

$$\begin{aligned} \begin{aligned} \tau _u=&\frac{1}{2} \int _{\mathbb {R}^{d+|u|}} \left( g(x'_u:x_{-u})-g(x_u:x_{-u})\right) ^2 \cdot \\ {}&f_{{u}|{-u}}(x_{u}|x_{-u}) f(x'_{u}:x_{-u}) dx'_{u} dx_{u} dx_{-u}\\ =&\frac{1}{2} \int _{\mathbb {R}^{d+|u|}} \left( g(x_u:x_{-u})-g(x'_u:x_{-u})\right) ^2 \cdot \\ {}&f_{{u}|{-u}}(x'_{u}|x_{-u}) f(x_{u}:x_{-u}) dx'_{u} dx_{u} dx_{-u}. \end{aligned} \end{aligned}$$
(9)

The second term in (9) is the numerator of Eq. (2.11) in (Kucherenko et al. 2012, p. 939). Equation (9) is an essential ingredient for the estimation of total Sobol’ indices under feature dependence. In the next section, we exploit Equation (9) to create a generalized design. In Sect. 3, we use it for the definition of total effect estimators in the presence of constrained (i.e., non-Cartesian) input domains.

2.3 Winding stairs for dependent inputs with Gaussian Copula

We propose a new estimation strategy that combines Proposition 2 with the Knothe–Rosenblatt transformation (Knothe 1957; Rosenblatt 1952). This transformation is proposed in the works of Mara and Tarantola (2012), Mara et al. (2015) and Li and Rabitz (2017) in association with the challenging task of computing variance-based sensitivity indices in the presence of dependent features. Formally, the Knothe–Rosenblatt transformation implies the following equations:

$$\begin{aligned} \begin{aligned} U_1&=F_1(X_1), \\ U_2&=F_{2|1}(X_2|X_1), \\ {}&\dots \\ U_d&=F_{d|1,\dots d-1}(X_d|X_1,\dots X_{d-1}), \end{aligned} \end{aligned}$$
(10)

where \(F_1(X_1)\) is the marginal cumulative distribution function of \(X_1\), \(F_{2|1}(X_2|X_1)\) the conditional distribution function of \(X_2\) given \(X_1\), etc. The transformation maps the dependent set of features \({\textbf{X}}\) into the independent set of variables U uniformly distributed in the unit hypercube. As a result, the transformed features U are independent. Considering the mapping from U to Y one can apply the theory and algorithms of the functional ANOVA expansion under independence. One can calculate global sensitivity indices on the independent coordinates in U. However, the transformation has two main drawbacks. The transformation is not unique, as there are as many transformations as the possible orderings of the features. The values of the global sensitivity measures and the corresponding ranking are then dependent on the chosen order. Moreover, results hold for the transformed variables U and reinterpreting results back on the X features is not straightforward.

We propose an intuition to use the Knothe–Rosenblatt transformation for calculating total indices that avoids the rank dependence on the feature ordering and allows us to remain within the original feature space. The key is to combine these two facts. The first is that, by Proposition 2, the total effects \(\tau _j\) are associated with the conditional density \(f_{j|-j}\). The second is that this coincides with the density of the last term in (10). Then, if this last term is available from the transformation, we can draw realizations of an independent standard uniform random variable and apply the inverse transformation \(t_j: u \mapsto x_j=F^{-1}_{j|-j}(u|X_{-j}=x_{-j})\). This transformation is, indeed, the inverse of the last term in (10) (up to a re-ordering of input variables): we have an \(X_j\) which is conditional on all the remaining features.

We exploit this fact to introduce a generalized winding stairs total effect estimator for the case of dependent features. The term winding stairs originates with Jansen et al. (1994). Please see also Chan et al. (2000) and Owen and Hoyt (2021) for further reviews. Assume that \(X^{(0)}\) is a random copy of X, and U is a random vector of d independent standard uniformly distributed random variables, independent of \(X^{(0)}\). In the classical winding stairs design, under independence, the \(j\text {th}\) column in the feature sample matrix is replaced by an independent copy of \(X_j\). Under dependence, using Lemma 1, we can cyclically replace the \(j\text {th}\) entry in the input vector with a conditionally independent one. A way to obtain this conditionally independent sample is by a Knothe–Rosenblatt transformation of the following form: In the \(j\text {th}\) step, the \(j\text {th}\) component of \(X^{(0)}\) is altered via

$$\begin{aligned} X^{(j)}_\ell = {\left\{ \begin{array}{ll} F^{-1}_{j|-j}\left( U_j\Bigg |X_{-j}^{(j-1)}\right) , &{}\text {for } \ell =j, \\ X_\ell ^{(j-1)}, &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$
(11)

for \(\ell ,j=1,2,\dots ,d\). When sampling, we obtain blocks of the type \(X^{(j)}=(X^{(j)}_j:X^{(j-1)}_{-j})\) for \(j=1,\dots ,d\). By construction, \(Y^{(j)}=g(X^{(j)})\) is a replicate of \(Y^{(j-1)}\) conditionally independent given \(X_{-j}\). By Lemma 1, we obtain the following winding stairs total effect expression:

$$\begin{aligned} \tau _j^\text {WS}=\frac{1}{2}{{\,\mathrm{\mathbb {E}}\,}}\left[ \left( Y^{(j)}-Y^{(j-1)}\right) ^2\right] , \quad j\in [d]. \end{aligned}$$
(12)

The associated estimator is a variant of (8). As observed in Goda (2021), the winding stairs estimator is a sample average, so that the empirical variance of \(\frac{1}{2}\left( Y^{(j)}-Y^{(j-1)}\right) ^2\) is approximating the variance of the estimator \({\hat{\tau }}_j^\text {WS}\) in (12).

The computational cost associated with the calculation of a global sensitivity measure is expressed in terms of the required number of model evaluations. For a winding stairs design, the cost is \(n(d+1)\), where n is the sample size and d is the input dimensionality (see the first row in Table 1, which reports the computational cost of all the estimators discussed in this work). The cost corresponds to the fact that for each of the n sampled locations we vary the inputs one-at-a-time.

Table 1 Computational costs for the estimation of all d total effects with constrained inputs, block sample size n

In general, closed-form expressions for the Rosenblatt transformation are unavailable. However, analytical formulas exist when the input dependence is modeled via Gaussian copulas. Specifically, let \(\Psi _j\), \(j=1,\dots ,d\) be transformations from the marginal distribution of each input into the standard normal distribution and let \(Z_j\) be a standard normal random variable independent of X. Then there exist linear combinations such that the random vectors \( \begin{bmatrix}\Psi _1(X_1)&\dots&\Psi _d(X_d)\end{bmatrix}\) and

$$\begin{aligned} \begin{bmatrix} \Psi _1(X_1)&\!\dots \!&\gamma _j^{(j)}Z_j\!+\!\sum \limits _{\ell \ne i}\gamma _\ell ^{(j)}\Psi _\ell (X_\ell )&\!\dots \!&\Psi _d(X_d) \end{bmatrix} \end{aligned}$$
(13)

are identically \({\mathcal {N}}(0,\Sigma )\) distributed and conditionally independent given \(X_{-j}\). These linear combinations can be extracted from a Cholesky decomposition of a reordered covariance matrix where the \(j\text {th}\) row/column is moved to the last position (see the proof of Theorem 14 in “Appendix A.5” for the computation of the coefficients \(\gamma _\ell ^{(j)}\) in the linear combination in (13)). Hence (11) becomes

$$\begin{aligned} \begin{aligned} X^{(j)}_j&\!=\! \Psi ^{-1}_j\!\left( \!\gamma _j^{(j)}Z_j\!+\!\sum _{\ell \ne j}\gamma _\ell ^{(j)}\Psi _\ell \left( X^{(j-1)}_\ell \right) \! \right) \\ {}&=t_j(\Phi ^{-1}(Z_j)|X_{-j}^{(j-1)}), \end{aligned} \end{aligned}$$
(14)

where \(\Phi \) is the standard normal cumulative distribution function. From a numerical viewpoint, the computational cost associated with the winding stairs approach amounts at \(n(d+1)\) model evaluations. This cost is explained as follows: we sample n random values of X and then consider one-at-a-time variations in each input for each of the n values.

This design generalizes a winding stairs approach to dependence structures that include the broad family of Gaussian copulas. However, for cases in which the Knothe–Rosenblatt transformation is not available, the generalized winding stairs design becomes impractical. This happens as soon as features do not leave on a support which is Cartesian. We then introduce alternative approaches that allow to generalize Lemma 1 to more complex dependence structures in the next sections.

3 Total effects under feature dependence via reweighting

Our purpose in this section is to introduce an estimation strategy that allows us to relax the traditional condition of a Cartesian domain, that is, we allow for \({\mathcal {X}}\ne {\mathcal {X}}_1\times {\mathcal {X}}_2\times \dots \times {\mathcal {X}}_d\). We assume that the features are distributed with a joint density f such that \(f(x)>0\) if \(x\in {\mathcal {X}}\) and \(f(x)=0\) if \(x\notin {\mathcal {X}}\). In order to use an estimation strategy with a classical pick-and-freeze design, we start with the following definition.

Definition 3

Let \(X'\) be an independent copy of X. We call the function

$$\begin{aligned} \iota _u(X',X)=\frac{f(X'_{u}:X_{-u})}{f_{u}(X'_{u})f_{-u}(X_{-u})} \end{aligned}$$
(15)

the density quotient of X on \({\mathcal {X}}\) for the feature list u.

By Proposition 2, the density quotient in (15) satisfies

$$\begin{aligned} \begin{aligned} \iota _{u}(X',X)&=\dfrac{f_{u|-u}(X'_{u}|X_{-u})}{f_u(X'_{u})} =\dfrac{f_{-u|u}(X_{-u}|X'_{u})}{f_{-u}(X_{-u})}\\ {}&= \iota _{-u}(X,X'). \end{aligned} \end{aligned}$$

To illustrate, for a Cartesian domain and independent inputs, we have \(\iota _{u}(X',X)=1\). Also, we have a compact expression for the case in which the dependence among two features can be expressed via a Gaussian copula.

Example 4

Under a bivariate Gaussian copula, the density quotient for pairwise dependence can be obtained in a compact form as follows. Let \(X_i\) and \(X_j\) be two random variables with a rank-correlation of \(\varrho \). Setting \(u_i=F_i(x_i)\) and \(u_j=F_j(x_j)\), (Joe 2014, Sect. 4.3.1) derives the density of the bivariate Gaussian copula as

$$\begin{aligned} \iota _i(u_i,u_j;\varrho )=\frac{\phi (\Phi ^{-1}(u_i),\Phi ^{-1}(u_j);\varrho )}{\phi (\Phi ^{-1}(u_i))\phi (\Phi ^{-1}(u_j))}\nonumber \\ = \frac{1}{ \sqrt{1-\varrho ^2}}e^{\frac{-\varrho }{2(1-\varrho ^2)} \left( \varrho (z_i^2+z_j^2)-2z_iz_j\right) }, \end{aligned}$$
(16)

where one uses the transformation \(z_i=\Phi ^{-1}(u_i)=\Phi ^{-1}(F_i(x_i))\). The right hand side in (16) is the density quotient for a bivariate Gaussian copula.

Proposition 5

Let X and \(X'\) be i.i.d. random vectors. The following equality holds:

$$\begin{aligned} \tau _u={{\,\mathrm{\mathbb {E}}\,}}\left[ \iota _u(X', X) \left( g(X_u:X_{-u})-g(X'_u:X_{-u})\right) ^2\right] . \end{aligned}$$
(17)

Given n independent copies \(X^i\) of X, \(i=1,2,\dots ,n\), then an unbiased estimator of \(\tau _u\) is

$$\begin{aligned} \widehat{\tau }_{u,n}^U = \frac{1}{2n(n-1)}\sum _{i=1}^n \sum _{j\ne i} \frac{f(X^j_{u}:X^i_{-u})}{f_{u}(X^j_{u})f_{-u}(X^i_{-u})} \cdot \nonumber \\ \left( g(X^i_{u}:X^i_{-u}) - g(X^j_{u}:X^i_{-u})\right) ^2. \end{aligned}$$
(18)

Proof

See “Appendix A.2” for the proofs of all results stated in this section. \(\square \)

Equation (18) combines a brute-force double-loop sample design together with Jansen’s estimator modified by an importance sampling weight. The Jansen’s approach would yield an estimator based on a pick-and-freeze design. However, a pure pick-and-freeze sample would imply generating the data from the product of their marginal distributions ignoring the probabilistic dependence generated by the presence of constraints. The density quotient in Proposition 5 introduces a correction factor that allows one to generate the data from the (correct) joint distribution. Moreover, the density quotient allows us to consider non-Cartesian input domains, as it vanishes for points outside the region where the inputs are defined. If features are independent we regain Jansen’s classical estimator, because the density quotient is identically equal to one.

Lemma 6

The mix-and-reweight estimator (18) of Proposition 5 is a U-statistic of order 2.

We call the design associated with Proposition 5 mix-and-reweight. This design is associated with a computational cost of \(dn^2-nd+n\) evaluations (Table 1, second row). The cost depends quadratically on n, while the cost of the winding stairs design depends linearly on it. We can derive a central limit theorem for the mix-and-reweight estimator from the general theory of U-statistics (Hoeffding 1948).

Lemma 7

Let \(\delta _1={{\,\mathrm{\mathbb {V}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}[\Phi ^s(W_1,W_2)|W_1]]\) and \(\delta _2={{\,\mathrm{\mathbb {V}}\,}}[\Phi ^s(W_1,W_2)]\) with \(\Phi ^s\) defined by (51) and (52). Assume that \({{\,\mathrm{\mathbb {V}}\,}}[\delta _2]< + \infty \). Then

$$\begin{aligned}{} & {} {{\,\mathrm{\mathbb {V}}\,}}[\widehat{\tau }_{u,n}^U]= \frac{2}{n(n-1)}\left( 2(n-2)\delta _1 + \delta _2\right) =\frac{4}{n}\delta _1 \nonumber \\{} & {} \quad +O(n^{-2}). \end{aligned}$$
(19)

If \(\delta _1\ne 0\) then the U statistic is non-degenerate and \(\sqrt{n}(\widehat{\tau }_{u,n}^U-\tau _u)\rightarrow {\mathcal {N}}(0,4 \, \delta _1)\).

Equation (19) in Lemma 7 provides us with the asymptotic variance of the mix-and-reweight estimator. Empirically, one way to calculate this variance is to make use of the plug-in Jackknife estimator by Helmers (1991) (see also Bose and Chatterjee 2018, p. 106), defined as

$$\begin{aligned}{} & {} \widehat{{{\,\mathrm{\mathbb {V}}\,}}}[\widehat{\tau }_{u,n}^U]= \frac{4}{n}\frac{n-1}{(n-2)^2} \cdot \nonumber \\{} & {} \quad \sum \limits _{i=1}^n \left( \frac{1}{n-1}{\sum \limits _{j\ne i}} \Phi ^s(X_u^i,X_{-u}^j)- \widehat{\tau }_{u,n}^U\right) ^2. \end{aligned}$$
(20)

Alternatively, we can derive a bootstrap distribution of the estimator \(\widehat{\tau }_{u,n}^U\) from the sample of \(\Phi ^s(X_u^i,X_{-u}^j)\).

We find earlier accounts on the use of reweighting techniques in sensitivity analysis. Let us mention the estimation of first-order and total Sobol’ indices in Sparkman et al. (2016) in which the already available sample of simulations is reweighted with importance sampling. Moreover, in Badea and Bolado (2008, Sect. 5.4), the authors discuss reweighting and rejection techniques to measure the potential impact of small changes in the input probability distribution on the output mean. In Kucherenko et al. (2017), a rejection technique to handle non-Cartesian input domains is implemented.

4 Derangement and shift estimators

The mix-and-reweight estimator of Lemma 7 possesses the clear theoretical advantages associated with the notion of U-statistics. However, the associated estimation cost may turn into a notable disadvantage in practical applications. To reduce the cost for estimating \(\tau _u\) under input constraints, we propose two new estimators based on derangements and shifts. The intuition here is to take the difference between a given realization and another one (for instance randomly picked) instead of taking the differences against all other realizations.

We proceed as follows. Given a sample of size n, we introduce the cyclic shift-by-one of \(\{1,\ldots , n\}\) defined by \(s_n(i)=i+1\) for \(i<n\), and \(s_n(n)=1\). We also define the acyclic shift-by-one by \(s_{n-1}^a(i)=i+1\) for \(i<n-1\), and \(s_{n-1}^a(n-1)=n\). Here, \(s_{n-1}^a(\cdot )\) is a fixpoint-free map from \(\{1, \ldots , n-1\}\) to \(\{2, \ldots , n\}\). Based on this idea, we introduce the shift-and-reweight total effect estimator for input j given by

$$\begin{aligned} \begin{aligned} \widehat{\tau }_{j,n}^S&= \frac{1}{2n}\sum _{i=1}^n \iota _j(X^{s_n(i)},X^i) \cdot \\ {}&\left( g(X^{i}_{j}:X^{i}_{-j})-g(X^{s_n(i)}_{j}:X^{i}_{-j})\right) ^2. \end{aligned} \end{aligned}$$
(21)

In the next result, we prove a central limit theorem for this estimator.

Theorem 8

Let X be a sample of size n subject to the joint density f. Then, \( \widehat{\tau }_{j,n}^S\) is an unbiased estimator of \( \tau _{j}\). In addition, assume that \(\sigma _j^2={{\,\mathrm{\mathbb {E}}\,}}[V_j^1V_j^1]+2{{\,\mathrm{\mathbb {E}}\,}}[V_j^1V_j^2]< + \infty \) with

$$\begin{aligned} V_j^i\!=\!\frac{1}{2} \iota _j(X^{s_n(i)}\!\!,X^i) \left( g(X^{i}_{j}\!:\!X^{i}_{-j})\!-\!g(X^{i+1}_{j}\!:\!X^{i}_{-j})\right) ^2. \end{aligned}$$

Then \(\widehat{\tau }_{j,n}^S\) defined by (21) satisfies the following central limit theorem:

$$\begin{aligned} \sqrt{n}\left( \widehat{\tau }_{j,n}^S-\tau _j\right) \xrightarrow [n\rightarrow + \infty ]{}{\mathcal {N}}(0,\sigma ^2_{j}). \end{aligned}$$
(22)

Proof

See “Appendix A.3” for the proofs of results stated in this section. \(\square \)

Theorem 8 holds for the unnormalized version of the total index estimator. For the normalized version, let us define the shift-and-reweight normalized total effect estimator as

$$\begin{aligned} {\widehat{T}}_j^S= \dfrac{\widehat{\tau }_{j,n}^S}{{\hat{S}}_{Y,n}}, \end{aligned}$$
(23)

where \({\hat{S}}_{Y,n}\) is the empirical n-sample variance of Y. A central limit theorem for the normalized estimator in (23) can be deduced using the so-called \(\delta \)-method.

Corollary 9

Let X be a sample of size n subject to the joint density f. Assume \({{\,\mathrm{\mathbb {V}}\,}}\left[ Y\right] <+ \infty \) and \(\sigma ^2_j < + \infty \), with \(\sigma _j^2\) defined in Theorem 8. Then we have:

$$\begin{aligned} \sqrt{n}\left( {\widehat{T}}_{j,n}^S-T_j\right) \xrightarrow [n\rightarrow + \infty ]{}{\mathcal {N}}\left( 0,{\sigma }_{norm ,j}^2\right) \end{aligned}$$

where \({\sigma }_{norm ,j}^2=\rho _j^T\Sigma _j \rho _j\) with \(\rho _j=\frac{1}{{{\,\mathrm{\mathbb {V}}\,}}[Y]}\left[ 1,2T_j{{\,\mathrm{\mathbb {E}}\,}}[Y],-T_j\right] ^T\) and

$$\begin{aligned} \Sigma _j = \begin{pmatrix} {{\,\mathrm{\mathbb {V}}\,}}[V_j^1] &{} {{\,\textrm{cov}\,}}(V_j^1,Y^1) &{} {{\,\textrm{cov}\,}}(V_j^1,(Y^1)^2)\\ *&{} {{\,\mathrm{\mathbb {V}}\,}}(Y^1) &{} {{\,\textrm{cov}\,}}(Y^1,(Y^1)^2)\\ *&{} *&{}{{\,\mathrm{\mathbb {V}}\,}}[(Y^1)^2] \end{pmatrix}\cdot \end{aligned}$$

Theorem 8 deals with a cyclic shift-by-one strategy. However, the analyst may consider more general shifts or permutations. To define the corresponding estimator, we proceed as follows. Let \(\left( \pi _n\right) _{n\ge 1}\) be a sequence of derangements (fixpoint-free permutations) of \(\{1, \ldots , n\}\). Then the derange-and-reweight total effect estimator for input j is defined as

$$\begin{aligned}{} & {} \widehat{\tau }^{D}_{j,n}= \frac{1}{2n}\sum _{i=1}^n \iota _j(X^{\pi _n(i)},X^i) \cdot \nonumber \\{} & {} \quad \left( g(X^{i}_{j}:X^{i}_{-j})-g(X^{\pi _n(i)}_{j}:X^{i}_{-j})\right) ^2. \end{aligned}$$
(24)

Theorem 10

Let X be a sample of size n subject to the joint density f. Then, the derange-and-reweight estimator in (24) is unbiased. In addition, suppose that there exists \(\delta >0\) such that \({{\,\mathrm{\mathbb {E}}\,}}\left[ |V_{j}^1|^{2+\delta }\right] < + \infty \) with \(V_j^1\) defined as in Theorem 8 and \(\displaystyle \lim \limits _{n \rightarrow + \infty } m_n^{1+\delta }n^{-\nicefrac {\delta }{2}} \rightarrow 0\), with \(m_n\) the number of cycles of \(\pi _n\). Then we have the following central limit theorem:

$$\begin{aligned} \sqrt{n}\left( \widehat{\tau }^{D}_{j,n}-\tau _j\right) \xrightarrow [n\rightarrow + \infty ]{}{\mathcal {N}}(0,\sigma ^2_{j}) \end{aligned}$$
(25)

where \(\sigma _j^2={{\,\mathrm{\mathbb {V}}\,}}[V_j^1]+2{{\,\textrm{cov}\,}}(V_j^1,V_j^2)\) with

$$\begin{aligned} V_j^i=\frac{1}{2} \frac{ f_{j,-j}(X^{i+1}_{j}:X^{i}_{-j})}{f_{j}(X^{i+1}_{j})f_{-j}(X^{i}_{-j})}\cdot \\ \left( g(X^{i}_{j}:X^{i}_{-j})-g(X^{i+1}_{j}:X^{i}_{-j})\right) ^2. \end{aligned}$$

Theorem 10 proves a central limit result for the derange-and-reweight estimator (24).

Remark 11

The assumption that in the limit \(\displaystyle \lim \nolimits _{n \rightarrow + \infty }m_n^{1+\delta }n^{-\nicefrac {\delta }{2}} \rightarrow 0\) holds does not seem too technical. Indeed, a permutation \(\pi _n\) of \(\{1,\ldots , n\}\) decomposes into cycles and a classical result in combinatorics lets us expect \(1+\frac{1}{2}+\dots +\frac{1}{n}\) cycles per permutation, and this harmonic series is approximately \(\log (n)\).

Remark 12

It is possible to compute confidence intervals by block-bootstrapping. It is important to implement bootstrapping with blocks, in order to preserve the 1-dependence structure (see, e.g., Lahiri (2003) and references therein).

The shift-and-reweight and the derange-and-reweight estimators are associated with a computational cost equal to \(n(d+1)\), the same as the winding stairs approach (Table 1, rows three and four).

5 Nearest-neighbour estimation

The designs we have introduced thus far (listed in rows 1 to 4 of Table 1) are based on Jansen’s intuition. We observe that, in a simulation setting, the costs in Table 1 are an upper bound: whenever the density quotient vanishes, there is no need to evaluate the model at those coordinates and we can save computational time. However, the estimators built on these designs are not given data. Indeed, given a design X and the evaluations of the model at X, the computation of all the estimators listed in rows 1 to 4 of Table 1 requires the simulator to be run at new design points (e.g., for the shift estimator, at points defined as \(X_j^{s_n(i)}:X_{-j}^i\), \(j=1, \ldots , n\)). This is the reason why, while the costs of the most efficient estimators are linear in the sample size, they depend on the number of features (d) and thus can be exposed to the curse of dimensionality. We propose below a nearest-neighbour approach to get rid of the dependence of the cost in the input space dimension d. The construction is as follows. Consider two observations x and \(x^\prime \), and consider the recombined point \((x^\prime _j:x_{-j})\). The first step is the evaluation of the density quotient \(\iota _j(x',x)\) at this point. If the density quotient is non-negligible, then select the point which is closest to \((x^\prime _j:x_{-j})\) with respect to some predefined Euclidean metric from the available data. More precisely, the point is the solution of

$$\begin{aligned} k^*=\mathop {\textrm{argmin}}\limits _{k\in [n]}\left\{ \left\Vert x_{k,j}-x'_{j}\right\Vert ^2_2+ \left\Vert x_{k,-j}-x_{i,-j}\right\Vert ^2_2\right\} . \end{aligned}$$
(26)

Hence, we use \(g(x'_{j}:x_{-j})\approx g(x_{k^*})\) in (18). Otherwise, if the quotient is small, we set the contribution of \(\iota _j(x',x)\times (g(x'_j:x_j)-g(x))^2\) equal to zero. This step can be implemented by setting a threshold on the value of the density quotient and considering as negligible all values of the density quotient below the threshold.

All the required information for applying this design is contained in a sample generated by a once-through pass of a Monte Carlo simulation.

The nearest-neighbour approach serves here as a metamodel, predicting model outputs for the mixed input sample. This is a different use of the nearest-neighbour intuition than in Broto et al. (2020), Plischke et al. (2022), where nearest-neighbour is used to select a conditional stratum (see Devroye et al. (2013, 2018) for theoretical results). At this stage, also due the presence of the density quotient threshold, we do not furnish any theoretical results for our estimator based on (26). We will then evaluate this strategy based on empirical comparisons in a series of experiments in which we compare the performance of this design with the other estimators in Table 1.

6 The link between total indices and Breiman’s permutation importance with feature constraints

There is a close link between \( {\text {MDA}}_j\) and \(\tau _j\) visible in Hooker et al. (2021, Theorem 2) and discussed in great detail in Bénard et al. (2022). In this section, we extend the relationship to the case in which inputs are constrained. Consider the problem of training an input–output mapping of the type \( h(X,\theta )\), \(h:{\mathcal {X}}\times {\mathbb {R}}^{q}\rightarrow {\mathbb {R}}\), where \(\theta \) is a q-dimensional vector of auxiliary parameters. Let \( {\mathcal {L}}(Y,g(X;\theta ))\), \({\mathcal {L}}:{\mathbb {R}}\times {\mathbb {R}}\) a loss function. The training problem can be defined as finding

$$\begin{aligned} \begin{array}{c} \theta ^{*}=\mathop {\textrm{argmin}}\limits _{\theta }{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X;\theta ))]. \end{array} \end{aligned}$$
(27)

We let \({{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X;\theta ^{*}))]\) denote the nominal (and minimal) expected loss function for the machine learning problem. Then, Breiman’s importance of feature \(X_{j}\) is defined as

$$\begin{aligned}{} & {} \text {MDA}_{j}={{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X_{j}^{\prime }:X_{-j};\theta ^{*}))]\nonumber \\{} & {} \quad -{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X;\theta ^{*}))], \end{aligned}$$
(28)

where \({{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X_{j}^{\prime }:X_{-j};\theta ^{*}))]\) is the expected loss that we incur if feature \(X_{j}\) is permuted (using the model trained on the original data). The intuition is that if the machine learning model relies heavily on \(X_{j}\) for its predictions, then the loss in predictive accuracy should be high and consequently the difference between the nominal loss and the loss after permutation should be significant. After feature permutation we expect a decrease in model prediction accuracy, so that the expected loss after feature permutation is larger than the nominal expected loss, yielding \(\text {MDA}_{j}\ge 0\).

Williamson et al. (2021, 2023) propose the following quantity as a model-agnostic variable importance measure:

$$\begin{aligned}{} & {} \text {MDA}^W_{j}={{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g_j(X_{-j};\theta ^{**}))] \nonumber \\{} & {} \quad -{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X;\theta ^{*}))], \end{aligned}$$
(29)

where \(g_j(X_{-j};\theta ^{**})\) is the retrained predictor after feature \(X_j\) has been eliminated from the dataset and \(g(X;\theta ^{*})\) is the model trained with all features in the dataset. The variable importance measure in (29) is called dropped variable importance in Hooker et al. (2021) and leave-one-out-covariate (LOCO) in Lei et al. (2018). Notice that, if the model coefficient of determination \(R^2\) is taken as a performance measure in (29), then \(\text {MDA}^W_{j}\) is the total index of \(X_j\). The main difference between the approach to variable importance in (29) and the feature importance in (28) is that the latter permutes the feature and uses the same model, while in the former one deletes the feature from the dataset and then uses a newly retrained model. The advantage of the remove-and-retrain approach is that it handles dependence and constraints. However, in case retraining is expensive the approach might become computationally heavy. The computational issues related to retraining are also noted in Lundberg and Lee (2017), who propose alternatives to avoid retraining in their introduction of the SHAP variable importance measure. Breiman’s approach to permute features and use the same model may then be computationally convenient. However, Breiman’s strategy is exposed to unrestricted permutations (Hooker et al. 2021), which could lead to violating the constraints.

Of relevance to us are also Eqs. (3.1) and (6.2) of Fisher et al. (2019). Fisher et al. (2019) formulate Breiman’s importance in terms of model reliance, as a ratio between the expected loss after permutation over the expected loss before permutation. Using our notation, the definition of model reliance in their Eq. (6.2) would read

$$\begin{aligned} \text {MR}_{j}=\dfrac{{{\,\mathrm{\mathbb {E}}\,}}[\iota _{j}(X^\prime ,X){\mathcal {L}}(Y,g(X_j^\prime :X_{-j};\theta ^*)]}{{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X;\theta ^*))]}. \end{aligned}$$
(30)

Rewritten in difference terms, (30) yields a dependence-aware version of (28)

$$\begin{aligned}{} & {} \text {MDA}^\text {FR}_{j}={{\,\mathrm{\mathbb {E}}\,}}[\iota _{j}(X^\prime ,X){\mathcal {L}}(Y,g(X_j^\prime :X_{-j};\theta ^*)]\nonumber \\{} & {} \quad -{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {L}}(Y,g(X;\theta ^*))]. \end{aligned}$$
(31)

Proposition 13

Consider \(\text {MDA}^\text {FR}_{j}\) in (31). If \({\mathcal {L}}(Y,g(X))\) is a squared loss function and \( g(X;\theta ^{*})\) is a perfect predictor, then \(\text {MDA}_{j}^\text {FR}=\tau _{j}\).

Proof

The proof is postponed to “Appendix A.4”. \(\square \)

Proposition 13 then suggests that, also under feature constraints, total indices can be reinterpreted in terms of mean decrease of a model’s accuracy.

Another popular alternative from the machine learning literature to measure variable importance is the model-X knockoffs from Candès et al. (2018). Model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is unknown, but the input probability distribution (pdf) is known, or at least approximated. The approach relies on the construction of knockoff variables as far as on the proposition of feature statistics allowing false discovery rate control. In the framework of complex input dependence structure and complex input–output realisation these tasks are non trivial.

7 Analytical total effects for linear models with Gaussian features

In this section, we derive analytical expressions for the values of total effects to be used as benchmarks in numerical experiments. In doing so, we also derive a formula for the density quotient in the case of Gaussian copulas.

Let us consider a linear mapping between Y and X, \(Y=\beta ^0 +\beta ^T X\), \(X\in \mathbb {R}^{d}\), \(\beta ^0 \in {\mathbb {R}}\), \(\beta \in {\mathbb {R}}^d\) and let the features be normally distributed with mean \(\mu \) and variance-covariance matrix \(\Sigma \). Under this assumption all conditional distributions are Gaussian and all conditional expectations are linear. The pair (XY) is then also Gaussian with augmented covariance matrix, \(\Sigma '=\left( {\begin{matrix}\Sigma &{} \Sigma \beta \\ \beta ^T \Sigma &{} \beta ^T \Sigma \beta \end{matrix}}\right) \).

Let m be a positive integer. Let u, v, w be pairwise disjoint index sets from [m]. For any m-dimensional multivariate normal distribution \(Z\sim {\mathcal {N}}(\mu ,\Gamma )\), the conditional distribution of \(Z_{u+v}\) given \(Z_w=z_w\) is

$$\begin{aligned}{} & {} {\mathcal {N}}(\mu _{u+v}+\Gamma _{{u+v},w}\Gamma _{w,w}^{-1}(z_w-\mu _w), \nonumber \\{} & {} \quad \Gamma _{u+v,u+v}-\Gamma _{{u+v},w}\Gamma _{w,w}^{-1}\Gamma _{w,{u+v}}). \end{aligned}$$
(32)

Here, the correlation matrix in (32) is given in form of a Schur complement. We assume that the submatrix selected by w is invertible.

For a multivariate Gaussian distribution, the conditional independence \(u \perp \!\!\!\perp v|w\) holds if and only if \(\Gamma _{u,v}=\Gamma _{u,w}\Gamma _{w,w}^{-1}\Gamma _{w,v}\), as in this case the correlation matrix in (32) is block-diagonal, i.e.

$$\begin{aligned}{} & {} \begin{bmatrix} \Gamma _{u,u} &{} \Gamma _{u,v} \\ \Gamma _{v,u} &{} \Gamma _{v,v} \end{bmatrix} - \begin{bmatrix} \Gamma _{u,w} \\ \Gamma _{v,w} \end{bmatrix} \Gamma _{w,w}^{-1} \begin{bmatrix} \Gamma _{w,u}&\Gamma _{w,v} \end{bmatrix} \\{} & {} \quad = \begin{bmatrix} \Gamma _{u,u} -\Gamma _{u,w}\Gamma _{w,w}^{-1} \Gamma _{w,u} &{} 0\\ 0 &{} \Gamma _{v,v}-\Gamma _{v,w}\Gamma _{w,w}^{-1} \Gamma _{w,v} \end{bmatrix}. \end{aligned}$$

Theorem 14

Given \(Y=\beta ^0 +\beta ^T X\), \(X\sim {\mathcal {N}}(\mu ,\Sigma )\), \(X\in \mathbb {R}^d\), the unnormalized main and total effects are given by

$$\begin{aligned} S_j&= \beta ^T \left( \frac{\Sigma _{[d],j} \Sigma _{[d],j}^T}{\Sigma _{j,j}}\right) \beta , \end{aligned}$$
(33)
$$\begin{aligned} T_j&=\beta _j^2 \frac{\det (\Sigma )}{\det (\Sigma _{-j,-j})}. \end{aligned}$$
(34)

The output variance is \({{\,\mathrm{\mathbb {V}}\,}}[Y]=\beta ^T \Sigma \beta \).

Proof

The proof is postponed to “Appendix A.5”. \(\square \)

Alternative computations are offered in Mara and Tarantola (2012) for the case \(d=3\). These results can also be retrieved from the proof of (Owen and Prieur 2017, Theorem 4.1).

Fig. 1
figure 1

Total effects (unnormalized) for the linear model, including confidence bounds (gray areas): Winding stairs (panels A and E), U-statistics estimator (panels B and F), shift-and-reweight (panels C and D), derange-and-reweight (panels G and H). The analytical total effects are represented by solid lines

The proof of Theorem 14 (see “Appendix A.5”) provides analytical formulas for main and total effects for linear models with Gaussian features. For the mix-and-reweight and the derange-and-reweight approaches, we obtain the density quotient

$$\begin{aligned}{} & {} \iota _u(x',x)=\sqrt{\frac{\det (\Sigma _u)\det (\Sigma _{-u})}{\det (\Sigma )}}\cdot \nonumber \\{} & {} \quad e^{-\frac{1}{2}((x'_u:x_{-u})-\mu )^T(\Sigma ^{-1}\!-(\Sigma _u^{-1}\!:\!\Sigma _{-u}^{-1})) ((x'_u:x_{-u})-\mu )}, \end{aligned}$$
(35)

where \((\cdot :\cdot )\) is the out-of-order composition of vectors and block diagonal matrices. The expression for the density quotient readily generalizes to the case in which the factors are distributed with generic marginals correlated via a Gaussian copula. In the bivariate case, setting \(\mu =0\) and \(\Sigma =\left( {\begin{matrix}1&{} \varrho \\ \varrho &{} 1\end{matrix}}\right) \) in (35) yields (16).

8 Experiments

In this section, we study the numerical implementation of the designs discussed in this work, using examples with constraints of growing complexity. The experiments are organized as follows. In Sect. 8.1, we start with linear models and correlated inputs to test the consistency of all estimators by comparison with the analytical benchmarks developed in Sect. 7. In Sect. 8.2, we provide a detailed study of reweighting estimators on the Ishigami function, a popular test case for sensitivity analysis. In Sect. 8.3, we study the case of inputs constrained to a circle and total Sobol’ indices are estimated with nearest-neighbour and double-loop approaches, because a winding stairs approach becomes infeasible. In Sect. 8.5 inputs are constrained by a non-connected domain (two separate triangles). Section 8.4 reports results for the case in which inputs are constrained on a simplex, based on the case study in Gilquin et al. (2015). Section 8.6 reports results for the constraint represented by Sierpinski gaskets. Finally, Section 8.8 reports results for a realistic simulator. All experiments are run on personal computer with an Intel(R) i7-3770 CPU at 3.40GHz and 8GB RAM, using MatLab R2022a. We rely on the MatLab k-d-tree implementation for the nearest-neighbour search. In all the experiments we implemented the nearest-neighbour approach (see Sect. 5) with the low threshold equal to zero.

Because the methods are associated with different computational costs (third column in Table 1), to make experiments comparable, we fix the overall budget of the experiment and then calculate the corresponding sample size n. To illustrate, given a budget of B = 10,000 model runs, for a \(d=3\) variable model we have sample sizes respectively of \(n^\text {winding stairs}=2500\) for the generalized winding stairs and the shift(derange)-and-reweight designs, and \(n^\text {U-statistic}=58\) for the U-statistic design, while the entire budget is available for the nearest-neighbour design \(n^\text {nearest-neighbour}=B=10000\).

8.1 Linear model with normal and correlated inputs

In this section, we consider a parameterization of a linear model with correlated inputs, for which analytical expressions are discussed in Sect. 7 (Theorem 14). Our goal is to test the performance of all estimators in Table 1. We use as a test case the model discussed in Kucherenko et al. (2012). One sets \(Y =g( X_1, X_2, X_3)=X_1+X_2+X_3\) with \(X\sim {\mathcal {N}}(0,\Sigma )\), and correlation matrix \(\Sigma =\left( {\begin{matrix}1 &{} 0 &{} 0 \\ 0&{} 1 &{} \varrho \sigma \\ 0 &{} \varrho \sigma &{} \sigma ^2\end{matrix}}\right) \). In the experiments, we set \(\sigma =2\) and assess the effect of increasing correlations, letting \(\varrho \) vary between \([-1, 1]\). We hypothesize a computational budget restriction, with a fixed budget \(B=1680\) and derive back the corresponding basic sample sizes n from Table 1. With \(d=3\) input factors, we find to \(n^{\text {U-statistic}}=24\), \(n^{\text {Winding Stairs}}=n^{\text {Reweighting}}=420\), and \(nn^{\text {Nearest-Neighbour}}=1680\). We perform the calculations with this given budget B for \(\varrho \in \{-0.9,-0.75,-0.6,\dots ,0.6,0.75,0.9\}\). We also test the effect of changing the sample generator, comparing crude Monte Carlo (MC) and randomized Quasi-Monte Carlo (RQMC). We randomize the sequence generation process employing the MatLab Sobol’ scrambler discussed in Owen (1997) and Matoušek (1998). Figure 1 shows the results.

Each of the panels in Fig. 1 report the values of the total indices for each input \(\tau _1,\tau _2\), and \(\tau _3\), respectively. The horizontal axis reports the values of the correlation coefficients. The analytical values are displayed as continuous lines, and the estimates as dashed lines. Confidence intervals are displayed as shaded areas around the point estimates. To compute them, for the U-statistics approach, we use the asymptotic normality result of Lemma 7 together with plug-in estimates for the estimator variance. For the winding stairs approach we follow Goda (2021). For the shift-and-reweight approach (panels C and D) and for the derange-and-reweight approach (panels G and H) we use the upper and lower \(2.5\%\) quantiles from a block-bootstrap. We also used a normal approximation from the asymptotic results of Theorem 8 or Theorem 10 to compute confidence intervals. The results were similar and are therefore not reported here.

Figure 1 shows that the point estimates from QMC designs (using scrambled sequences) generally perform better than those from a plain Monte-Carlo design, while the confidence bounds are comparable. Preliminary tests showed that the shift-and-reweight approach is not working well with QMC design. Instead, we replaced the shift with a permutation, and a derange-and-reweight approach has been used for panels G and H. Considering the mean squared error for different simulations one can conclude that for this example, a QMC design is an advantage for U-statistics and winding stairs methods, while the impact is not so clear on the derange-and-reweight methods (both with reevaluations and with nearest-neighbour approximations of the mixed sample).

In this example, investing the computational budget into one large sample and using a nearest-neighbour metamodeling approach seems to offer a good performance, only to be beaten by winding stairs QMC design that, as already remarked, is not necessarily available for general dependence in form of constraints of the input features (these visual findings are corroborated by corresponding mean squared errors, see Table 2).

Table 2 Averaged mean squared errors (with respect to \(\varrho \)) of the different methods and sampling designs for the Linear Model using 20 repetitions

8.2 The Ishigami function with correlations under a Gaussian copula

We further test the proposed designs on a well-known test case in simulation experiments (Kucherenko et al. 2012). The domain is still Cartesian, the input–output mapping, however, presents interactions. Our goal is to obtain additional insights on the performance of the estimators used in the previous case study, and especially to test their behavior with respect to the random sample generator. The input–output mapping is \(Y=g(x_1,x_2,x_3)=\sin (x_1)+7\sin ^2(x_2)+0.1x_3^4\sin (x_1)\) with \(X_i\) uniformly distributed on \([-\pi , \pi ]\). As in (Kucherenko et al. 2012, Sect. 7.3), we introduce a statistical dependence between \(X_1\) and \(X_3\) by a pairwise Gaussian copula. We let the rank correlation coefficient \(\rho (X_1,X_3)\) range from \(\rho (X_1,X_3)=-0.9\) to \(\rho (X_1,X_3)=0.9\). There is no closed-form solution for the total indices at a generic value of \(\rho (X_1,X_3)\). However, in the uncorrelated case, \(\rho (X_1,X_3)=0 \) the total indices are analytically known, with values \(T_1=0.56\), \(T_2=0.44\) and \(T_3=0.24\), respectively.

For the winding stairs estimator, we use the inverse Knothe–Rosenblatt transformation detailed in Example 4. For the reweight estimators, we implement the rank correlation using (14). We fix a budget of about \(B=8200\) simulations. This yields a basic sample size of \(n=2048\) for the generalized winding stairs and weight and derange approach, of \(n= 53\) for the U-statistic and \(n=8192\) points for the nearest-neighbour estimator. We generate the sample first with crude MC and then with QMC.

Fig. 2
figure 2

Ishigami function with rank correlation: normalized total (\(T_i\)) indices depending on the rank correlation \(\rho ^*(X_1,X_3)\). Total budget: 8200 model evaluations. Upper row: crude Monte Carlo sampling, lower row: Quasi Monte Carlo sampling

The panels in Fig. 2 show the estimates of the normalized total effects \(T_1\), \(T_2\), \(T_3\) as a function of the correlation between \(X_1\) and \(X_3\) for the winding stairs, mix-and-reweight, derange-and-reweight and nearest-neighbour designs, respectively. In the upper row, we use crude MC to generate the input sample, while we use QMC in the lower row. The panels in the first row show that, when using crude MC, the winding stairs (first panel) and derange-and-reweight (third panel) designs yield comparable estimates. However, the mix-and-reweight estimator (second panel) yields highly unstable estimates, while the nearest-neighbour design yields more stable estimates. In the absence of correlations, at \(\rho (X_1,X_3)=0\), the estimates for all designs are close to the analytical values.

The panels in the second row show that, when using QMC, the winding stairs and U-statistic estimates exhibit increased regularity, while methods with a random derangement do not. This can be due to the random derangements breaking the properties of low discrepancy sequences. Despite the greater regularity of the U-statistic estimates as a function of \(\rho (X_1,X_3)\), its estimates are upward biased for \(T_1\) and most notably for \(T_2\), compared to the other approaches. A reason may be that at \(n=53\), the QMC Sobol’ sequence does not populate a Latin Hypercube and the projections on the marginals are not uniform. To fill in a Latin Hypercube, we would need to increase the basic sample size to the next power of 2, \(n=64\). However, these additional eleven points in the sample block would propagate into a new budget of \(B=12160\), a nearly \(50\%\) increase in computational cost. Lastly, the rightmost panels show that the nearest-neighbour estimator performs similarly with both sample generation methods.

8.3 Features constrained on a circle

In the previous two test cases, the input domain was the Cartesian product of the individual input supports and the dependence structure was determined by probabilistic correlations. In this section, we continue the investigation of the performance of the estimators for an example in which the dependence structure cannot be modeled by a Gaussian copula. Specifically, we consider the inputs constrained on a circle. We consider the two-dimensional input–output mapping

$$\begin{aligned} Y=g(X_1,X_2)=(X_1-1)\cdot (X_2-1), \end{aligned}$$
(36)

with \(X_1\) and \(X_2\) uniformly distributed within a circle of radius \(\pi \) centered at the origin.

Fig. 3
figure 3

Uniform inputs constrained within a circle (left), model response (right)

This geometry rules out the application of a design based on the Knothe–Rosenblatt transform for calculating total effects. In fact, we would need to find the quantile function corresponding to the marginal cdf of \(X_1\),

$$\begin{aligned} x\mapsto \frac{2}{\pi R^2} \left( \frac{x}{2}\sqrt{\max \{0,R^2-x^2\}}+\frac{R^2}{2} \cdot \right. \\ \left. \arcsin \left( \min \left\{ 1,\max \left\{ -1,\frac{x}{R}\right\} \right\} \right) +\frac{\pi R^2}{4}\right) , \end{aligned}$$

and plug this into the conditional quantile function of \(x_2\), \((u,x)\mapsto 2(u-\frac{1}{2})\sqrt{R^2-x^2}\). Even for this simple geometry, this seems to be a tantalizing task. Conversely, in this case, we can find the density quotient analytically. Assuming a uniform unconstrained density \(f_X(x)\) on the square of side 2R enclosing the circle of radius R, the joint density is \(x\mapsto \frac{\mathbbm {1}_C(x)f_X(x)}{\int _C f_X(x) dx}\) and the marginal distributions can be obtained accordingly. For the present test case, where the circle of radius R is centered at the origin, the density quotient is

$$\begin{aligned}{} & {} \iota _1(x_1,x_2)=\iota _2(x_2,x_1)= \frac{\pi R^2}{4} \cdot \nonumber \\{} & {} \quad \frac{\mathbbm {1}\{x_1^2+x_2^2\le R^2\}}{\sqrt{\max \left\{ 0, R^2-x_1^2\right\} }\!\cdot \! \sqrt{\max \left\{ 0, R^2-x_2^2\right\} }}. \end{aligned}$$
(37)

Because the output density can be computed analytically, calculation with symbolic software (Mathcad Prime 8 in our case) yields unnormalized total effects \(\tau _1=\tau _2=6.53\). The equality follows by the problem symmetry. We use the experiments to investigate the rate of convergence of the nearest-neighbour and derange-and-reweight estimators. We also test the effect of the random number generator and use both Monte Carlo and quasi-Monte Carlo alternatives in our experiments. Because we have seen a poor performance of the shift-and-reweight estimator together with quasi-Monte Carlo sampling, we do not consider this estimator in these experiments. Figure 3 shows an input sample of size \(n=1024\) (left panel) and the model output response (right panel). For randomized QMC, two approaches are used: the scrambled sequence of Matoušek (1998) and Owen (1997) and a “mingled” Halton sequence with linear scrambling (Bayousef and Mascagni 2019).

The mean squared errors which are averaged over all factors follow a \(O(n^{-1})\) convergence rate for all sampling strategies, as seen in Fig. 4, where the diagonal of the dashed triangle evidences the \(n^{-1}\) convergence rate.

Fig. 4
figure 4

Mean squared errors for the unnormalized total effects over both factors with alternative sample generators, depending on the sample size (number of model evaluations) using basic sample sizes from \(2^8\) to \(2^{20}\)

Fig. 5
figure 5

An input constraint in form of a simplex condition

The estimators therefore are of rate \(O(n^{-\nicefrac {1}{2}})\), as for standard Monte Carlo estimation.

We observe a similar convergence rate across the alternative generators. A reason may be that the shifting strategy interferes with the regularity of the QMC structure, thus reducing the advantage of using a QMC generator in this context. Furthermore, a constraint may introduce discontinuities which are not compatible with functions of bounded variation in the sense of Hardy and Krause and in this case the Koksma–Hlawka Theorem does not provide an improved convergence rate.

8.4 Features constrained on a simplex

We have seen in the previous sections that the shift-and-reweight estimator does not combine well with a quasi-Monte Carlo sampling. We propose here a preprocessing consisting in permuting the order of the quasi-Monte Carlo realizations. With such a preprocessing, the mean squared error of the shift-and-reweight estimator combined with QMC remains consistent with the Monte Carlo approximation error, and regain similar performance to the derange-and-reweight estimator.

Fig. 6
figure 6

MSE at increasing sample sizes for total effect estimation for models with a simplex constraint on the inputs

We perform a series of tests for a setting in which it is still possible to obtain the total indices analytically via a symbolic calculation software (we used Maple software). More precisely, we consider features constrained on a simplex:

$$\begin{aligned} \{ (x_1,\dots ,x_d)\in [0,1]^d: x_{d-1}\le x_d )\}. \end{aligned}$$
(38)

Figure 5 provides a visualization of the constraint for the case \(d=3\). The density quotient between all pairs of inputs is constant and equal to one, except for the last two features: \(\iota (x_{d-1},x_d)= \frac{\mathbbm {1}\{x_{d-1}\le x_d\}}{2(1-x_{d-1})x_d}\).

As numerical test cases, we consider two models (with \(d=3\) and \(d=4\)) introduced in Gilquin et al. (2015), where the authors tested a new space-filling sampling strategy for estimating grouped Sobol’ indices in the framework of constrained inputs (see Jacques et al. 2006), as well as the well-known Sobol’ g-function. These models are described hereafter, together with the corresponding analytical values for total Sobol’ indices under the simplex constrained given by (38):

  • \(g(x)=-x_1+x_1x_2-x_1x_2x_3+x_1x_2x_3x_4\) with \(T=(0.6300,\ 0.4861,\ 0.0064,\ 0.0064)\);

  • \(g(x_1,x_2,x_3)=x_1x_2-x_3\) with \(T=(0.125,\ 0.125,\ 0.375)\);

  • Sobol’ g-function with parameter vector \(a=(0,1,3,6)\), such that \(T=(0.7659,\,0.2357,\,0.0522,\,0.0172)\).

The numerical experiments aim to compare estimates obtained with the shift-and-reweight estimator with data generated either with a Monte Carlo sampling or with a quasi-Monte Carlo sampling (here a scrambled Sobol’ sequence) with preprocessing. The preoprocessing consists of a random permutation of the sample realizations. We report results for the the mean squared errors (average over 20 replicates) in Fig. 6. For comparison, the figure also reports the estimates of the nearest-neighbour estimator with crude Monte Carlo sampling. The results demonstrate that preprocessing restores the performance of the shift-and-reweight estimator also with a quasi-Monte Carlo data-generation. However, there is no advantage in combining a shift-and-reweight estimator with quasi-Monte Carlo sampling rather than with crude Monte Carlo. Moreover, in two out of three examples the nearest-neighbour approach presents the best performance, while for the Sobol’ g-functions the three estimators perform similarly, with a slightly better performance for the shift-and-reweight estimator.

8.5 Features constrained on two disconnected triangles

In this section, we further challenge the estimators with experiments for a test case in which the inputs are on a disconnected domain. We hypothesize the input–output mapping as \(g(x_1,x_2)=(x_1-1)\cdot (x_2-1)\) and let the features lie in the 2-dimensional region \({\mathcal {X}}=\{(x_1,x_2)\in [0,1]^2: x_2 \le \frac{1}{2}-x_1 \vee x_2 \le 1-\frac{1}{4} x_1 \}\) (Fig. 7). We assign a uniform density \(f_{12}=4\) within the triangles, which vanishes outside the triangles. It is possible to compute analytically the marginal and conditional densities of the inputs, as well as the density quotients. From this knowledge, the values of the total indices can be analytically obtained and are equal to \(T_1=0.037\) and \(T_2=0.27\) (the total variance is 0.117). In the next series of numerical tests, we employ random sampling with rejection, starting from a uniform distribution on the unit square. Figure 7 provides a visualization of the input space, when the data are generated using crude Monte Carlo after sampling 2000 points, and about \(\tfrac{1}{4}\) of the points are left in the domain.

Fig. 7
figure 7

An input domain in the form of two separate triangles. About 500 points remain in the domain after rejection

We perform a series of experiments at increasing sample sizes, from \(n=474\) to \(n=3522\) as reported in Table 3 (due to rejection sampling the sample sizes do not form a regular progression). We then apply the shift-and-reweight, the derange-and-reweight and the nearest-neighbour estimators. Because the first two estimators produce similar results, we only display the values of the shift-and-reweight estimates. The estimator variances in Table 3 are computed from a plug-in estimator as per Lemma 7.

Table 3 A product model on the support formed by two separate triangles, normalized total effects, shift-and-reweight method of Theorem 8

The values in Table 3 show that the estimators exhibit similar accuracy, with a decreasing variance of the estimates. However, the nearest-neighbour estimator is associated with a much lower computational cost. We believe this good performance is allowed by the low dimension of the input space and we are to challenge it in later experiments (Sect. 8.7).

8.6 Features constrained on the Sierpinski gasket

In this section, we consider experiments in which the structure of the feature support is progressively complicated. We start with a Cartesian support and remove parts of the support to move towards a disconnected structure until we reach a Sierpinski gasket, that contains holes and is not star-shaped connected. Figure 8 provides a visualization of the regions.

Fig. 8
figure 8

Feature constraints: left one-corner, center two-corner, right Sierpinski with depth \(k=5\)

The second and third domains are created by cutting corners of the unit square. The former (second panel) consists of all points on the [0, 1] with the exclusion of the pairs \((x_{1},x_{2})\) such that \(x_{1}+x_{2}>\nicefrac {3}{2} \), the latter (third panel) also excludes points of the type \(x_{1}+x_{2}< \nicefrac {1}{2}\). The fourth domain approximates a fractal structure, the Sierpinski gasket, by excluding all realizations \((x_{1},x_{2})\) that satisfy the following three conditions for \(k=1,2,\dots ,5\):

  • \({{\,\textrm{mod}\,}}(2^{k-1}(x_1+x_2),2)>1 \),

  • \({{\,\textrm{mod}\,}}(2^{k-1}x_1,2)\le 1\), and

  • \({{\,\textrm{mod}\,}}(2^{k-1}x_2,2)\le 1\),

where \(x\mapsto {{\,\textrm{mod}\,}}(x,2)=x-2\left\lfloor \frac{x}{2}\right\rfloor \) denotes the rest after an integer division by 2.

We consider a mapping of the form \(Y=X_{1}+\beta X_{2}\), with \(\beta \) varying in \(\{-2,-1,0,1,2\}\) for the experiments. With this test case we aim to unveil some of the subtleties that appear when the input domain becomes increasingly more disconnected.

It is possible to derive the expression of the density quotient semi-analytically. Let \(x_1,x_2,x\in [0,1]\). For the cut-one-corner constraint, joint density is defined by

$$\begin{aligned} f_{12}^{OC}(x_{1},x_{2})=\frac{7}{8}\cdot \mathbbm {1}\left\{ x_{1}+x_{2}\le \frac{3 }{2}\right\} \end{aligned}$$
(39)

and marginal densities by

$$\begin{aligned} f_{1}^{OC}(x)=f_{2}^{OC}(x)=\frac{7}{8}\cdot \left( 1-\left( x-\frac{1}{2} \right) ^{+}\right) . \end{aligned}$$
(40)

For the cut-two-corners domain, joint density is given by

$$\begin{aligned} f_{12}^{TC}(x_{1},x_{2})=\frac{3}{4}\cdot \mathbbm {1}\left\{ \frac{1}{2}\le x_{1}+x_{2}\le \frac{3}{2}\right\} \end{aligned}$$
(41)

and marginal densities by

$$\begin{aligned} f_{1}^{TC}(x)=f_{2}^{TC}(x)=\frac{3}{4}\cdot \left( 1-\left|x-\frac{1}{2}\right| \right) . \end{aligned}$$
(42)

The rejection rate is present in both the marginal distributions and the joint one, so that the inverse of the rejection rate is a multiplicative constant in the density quotient. For the Sierpinski gasket, the joint distribution is the product of indicator functions of the complements of the Sierpinski sets listed above. From these distributions, we can calculate the density quotient by marginal integration.

The shapes of the constraints rule out an approach based on the winding stairs design. We are then left with reweighting (shift or derange) and nearest-neighbour approaches. To proceed with numerical experiments, we use a rejection method. We generate data using crude Monte-Carlo sampling, and reject feature realizations that do not satisfy the constraints. For the cut-one-corner and cut-two-corners domains, we start with a sample of size \(n=10240\) on the \([0,1]^2\) full Cartesian domain. After rejection of realizations falling outside the domain, we are left with samples of sizes \(n=8938\) and \(n=7690\), respectively for the the cut-one-corner and cut-two-corners domains. These numbers are in line with the theoretical acceptance rates of \(\nicefrac {7}{8}\) and \(\nicefrac {3}{4}\), respectively. For the Sierpinski gasket, the first step in the rejection process retains half of the observations, while each further step retains \(\nicefrac {3}{4}\) of the observations. Hence after five iterations, we reach a sample size of \(n\approx \frac{1}{2}\cdot \left( \frac{3}{4}\right) ^{4}n_{0}\), where \(n_0\) is the starting sample size. In this experiment, we start with an initial sample size of \(n_0=50000\) and, after rejection, we are left with \(n=7932\) observations.

Fig. 9
figure 9

Sierpinski marginal density derivation: evolution of the marginal integral with rejections

Figure 9 shows the empirical marginal densities of the features. Iteration \(k=1\) removes \(\nicefrac {1}{2}\), the further iterations \(k>1\) remove each \(\nicefrac {1}{4}\) of the mass. One needs to rescale these curves to obtain the marginal probability densities for both factors which are then piecewise linearly defined. The two-corner design and the Sierpinski gasket both satisfy \({{\,\textrm{cov}\,}}(X_1,X_2)=-0.5\) and have roughly the same input variances. In a linear model, only the input variances and covariances enter into the computation of total effects; thus, we expect similar values of the normalized total indices for the two corners and the Sierpinski gasket cases.

Table 4 Estimates of the normalized total effects for the linear model \(Y=X_1+\beta X_2\) (S &R: shift-and-reweight, NN: nearest-neighbour method)

Table 4 reports the results obtained with the shift-and-reweight and the nearest-neighbour approaches. The resulting values indicate only small differences in the estimates. Similar values are also obtained with the derange-and-reweight estimator (not reported). Overall, the estimators exhibit a similar performance. The results in Table 4 can then be used to obtain some further indications about the behavior of total indices under constraints. For the fully connected domain, it is \(T_{1}+T_{2}=1\), as expected for a linearly additive model with independent inputs. However, we cannot expect this equality to hold when the inputs are constrained, as constraints make the inputs statistically dependent. For instance, for the two-corners domain and \(\beta =-2\) we find \(T_{1}+T_{2}=0.56\), while for \(\beta =2\), we find \(T_{1}+T_{2}=1.26\). Similar values are obtained for the Sierpinski gasket. Results for the case \(\beta =0\) are also interesting. The total effect of \(X_2\) is zero which correctly asserts that \(X_2\) is inactive. However, the fact that the total effect of \(X_1\) is less than 1 is due to input dependence, and must not be erroneously interpreted as the presence of an interaction in the model.

8.7 Comparison tests with machine learning approaches

The previous experiments have focused on a computer modeling setting. Indeed, by construction, the estimators we discussed are well suited for a setting in which the model and data distributions are known to the analyst. However, we have seen in Sect. 6 that there are links between our estimators and feature importance measures of the machine learning literature. In the present section, we present experiments aimed at shedding initial light on the behaviour of our estimators in comparison with the remove-and-retrain estimator discussed by Hooker et al. (2021) and Williamson et al. (2021, 2023). We also aim to address the performance of our estimators for problems of larger dimensionality than in the previous sections.

As a first test case, we consider the following version of the Bratley et al. (1992) function defined as:

$$\begin{aligned} g(x)=\sum _{i=1}^d (-1)^i \prod _{j=i}^d x_{d+1-i}, \end{aligned}$$
(43)

with inputs constrained on a simplex defined from (38) with \(d=10\). For this test case, it is possible to obtain the value of the total indices analytically, still using analytical calculations from Maple. We start with a sample of of size 100,000 generated using crude Monte Carlo, with independent inputs and employ the rejection method to implement the simplex constraint. For the remove-and-retrain estimator, we proceed as follows. We use an artificial neural network (with 10 hidden layers) as machine learning model. The accuracy of the neural network is high, with a coefficient of model determination \(R^2\) close to unity. As foreseen by the algorithm, we then remove each feature, retrain the neural network and measure the difference in \(R^2\). The resulting value is an estimate of the total indices of the features. The analytical values, as far as the results obtained with the derange-and-reweight method, the nearest-neighbour method and the remove-and-retrain method are reported Table 5.

Table 5 Performance results for the 10-dimensional model described in (43) (Ana.: analytical results, D &R: derange and reweight, NN: derange with nearest-neighbour, MM: metamodel fit using a feed-forward neural network with 10 hidden layers)

Let us now analyze the results. For the derange-and-reweight estimator (D &R), we observe estimates very close to the analytical values, and an overall estimation time of 28 s (last column). The estimates obtained with the nearest-neighbour design (NN) are distorted. Moreover, the estimation is notably time-consuming due to the long time required by the nearest neighbour search. The deteriorated performance shows that the increased dimensionality negatively impacts this estimator. Finally, the estimates of the remove-and-retrain estimator (MM) are accurate, although one obtains negative values for features \(X_1\), \(X_2\) and \(X_3\), whose analytical values are close to zero. Overall, the analysis takes 323 s, due to the retraining of the machine-learning model.

We then report results for a second test case, namely,

$$\begin{aligned} y=\sin \left( \pi \sum _{i=1}^{d} x_i\right) , \end{aligned}$$
(44)

with inputs still constrained on the simplex defined from (38) with \(d=10\). By construction the total indices of the first 8 features are equal and the last two. Analytically, we find \(T_1=T_2,\dots =T_8=0.595\), and \(T_9=T_{10}=0.332\).

To study the performance of the estimators, we generate a first sample of size \(n=2,000\) and after rejecting the realizations violating the constraint we are left with about half of the realizations (987). Results are presented in Table 6.

Table 6 Performance for the 10-dimensional model described in (44) (Ana.: analytical results, D &R: derange and reweight, NN: derange with nearest-neighbour, MM: metamodel fit using a feed-forward neural network with 10 hidden layers)

The derange-and-reweight approach (D &R) provides estimates close to the analytical values, while the nearest-neighbour estimates (NN) are again distorted. They are equal to about half of the analytical values for all inputs. However, they still signal that \(X_9\) and \(X_{10}\) are less important than the other eight features. The reject-and-retrain estimator (MM) shows a poor performance in this case, with several negative estimates, and fails in indicating the true feature importance in this case. We believe the reason is the poor fit of the machine-learning model. In fact, the neural network achieves an \(R^2\) of about \(7\%\), signaling that the emulator does not capture the input–output mapping in this test case.

In conclusion, the results in these two experiments seem not to recommend the nearest-neighbour design as the input space dimension increases, because it produces biased estimates. The main advantage of the reject-and-retrain estimator is that it can be applied both in a simulation and in a purely data-driven context because it does not require knowledge of the density quotients. However, it has to be considered with caution because its performance strongly depends on having an accurate machine-learning model, whose training time might make the estimation computationally expensive.

8.8 Application: the flood model of De Rocquigny (2006)

In this section, we apply the estimators to a realistic example, the flood model in de Rocquigny (2006) and Chastaing et al. (2012). The model calculates the maximum annual overflow, given eight input features (Table 7).

Table 7 Flood model feature list
Fig. 10
figure 10

Sensitivity results for the flood model (error bars represent the \(95\%\) confidence band)

We assume the same dependence structure in the input features as in Chastaing et al. (2012). The correlation between the pair of features (1,2) is set to 0.5, and the correlation between (3,4) and (7,8) to 0.3 each, via Gaussian copula. The density quotients for each pair are given in (16).

We calculate total indices with the derange-and-reweight approach of Theorem 8 and nearest-neighbour, fixing a budget of \(B=9000\) model evaluations. The basic sample size is then \(n=1000\) Monte Carlo realisations. The correlation structure in the basic sample is implemented via the Iman-Conover method (Iman and Conover 1982; Mainik 2015). Main effects are estimated from this basic sample using a discrete cosine transformation Plischke (2012) with 8 harmonics. A jackknife is used to derive confidence bounds. For nearest-neighbour, a Monte Carlo sample of size \(n=9000\) (same budget) is used. Because of the different scales in the inputs, we standardize the input sample using the empirical standard deviations as scaling factors. Figure 10 displays the results in the form of a barplot, reporting the main and total effects for each feature, as well as the error bands on top of the bars. The error bands for the main effects and for the shift/derangement approach are calculated as 1.96 times the square root of the plug-in variance estimates, using a normal approximation for a \(5\%\) confidence bound. Regarding the feature ranking, Feature 6 (\(H_d\)) is identified as the most important, followed by Features 1 (Q), 3 (\(Z_v\)), 2 (\(K_s\)) and 7 (\(C_b\)); the remaining features play a minor role. This result is in accordance with the findings in Chastaing et al. (2012). The values of the main effects and total effects (estimated with the derange-and-reweight approach) are close. Because Feature 6 is stochastically independent of the remaining features, this equality signals that \(H_d\) is not involved in relevant interactions. In contrast, features 1 and 2 are noticeably different under main and total effects, with total effects larger than their main effects. Also, if ranked according to main effects, Feature 3 would rank second most important, switching place with Feature 1. Overall, the derange-and-reweight approach that evaluates the model at the mixed realizations shows comparable results to the nearest-neighbour approach. However, the nearest-neighbour approach seems to exhibit a bias for the least important inputs. By the theoretical results of Devroye et al. (2018) for dimensions larger than four, the nearest-neighbour estimator bias dominates the estimator variance. One insight gained from this application is to use both a derangement and a nearest-neighbour approach, when possible, to confirm the estimates of both methods. This is because the former is less affected by bias as dimensionality increases.

9 Final remarks

Estimating total effects under feature dependence and constraints is a challenging task. We have proposed a set of estimators that accommodate increasingly complex constraints. For all estimators we have addressed both theoretical and numerical aspects. We have first analyzed the performance of a winding stairs approach that relies on a Knothe–Rosenblatt transformation for the case of dependent features. While the estimator accommodates a broad family of input dependences, it becomes inapplicable in the presence of non-Cartesian domains.

We have then studied estimators based on pairing Jansen’s design with a reweighting factor. We have formulated a U-statistic estimator, for which a central limit theorem is immediately derived. To abate the computational burden, we have proposed two alternatives based on shifts and derangements, for which we have proven central limit theorems. We have also considered a nearest-neighbour approach for which, however, theoretical results seem out of reach. We have tested the behavior of the estimators through numerous experiments with feature constraints of increasing complexity.

We have also studied the connection of these approaches with the calculation of feature importance measures in the machine-learning literature. On the theoretical side, we have derived the link between total indices under constraints and Breiman’s permutation feature importance measures. We have compared our approach with the removal-and-retrain method of Williamson et al. (2023), whose importance measures are, in fact, total indices. We have seen that our method and removal-and-retrain produce similar results when the machine learning model captures well the simulation response. However, when the performance of the machine learning model is poor the removal-and-retrain method fails in computing total Sobol’ indices. This investigation is, however, preliminary and future research is needed to come to more definitive result. In this perspective, future research aims at further studying the performance of the method in a data-driven context. We expect some challenges to emerge. On the one hand, the nearest-neighbour approach is expected to suffer from the curse of dimensionality, as our experiments have shown. On the other hand, estimation accuracy is expected to depend on the accurate calculation of density ratios, which, instead, are known in the simulation context. There, a starting point is also the work of Fisher et al. (2019), where density ratios are approximated.