Is VARS more intuitive and efficient than Sobol' indices?

The Variogram Analysis of Response Surfaces (VARS) has been proposed by Razavi and Gupta as a new comprehensive framework in sensitivity analysis. According to these authors, VARS provides a more intuitive notion of sensitivity and it is much more computationally efficient than Sobol' indices. Here we review these arguments and critically compare the performance of VARS-TO, for total-order index, against the total-order Jansen estimator. We argue that, unlike classic variance-based methods, VARS lacks a clear definition of what an"important"factor is, and show that the alleged computational superiority of VARS does not withstand scrutiny. We conclude that while VARS enriches the spectrum of existing methods for sensitivity analysis, especially for a diagnostic use of mathematical models, it complements rather than substitutes classic estimators used in variance-based sensitivity analysis.


Introduction
Sensitivity analysis (SA) explores how uncertainty in the output of a model (numerical or otherwise) can be apportioned to different sources of uncertainty in the model input [1] 1 . SA is especially needed when complex models, which often formalize partially known processes and include non-linear relations, are used to guide policies in the real world. This is generally the case of models in the Water Sciences domain, e.g. on crop water requirements, water availability under climate change, flood forecasting, surface runoff or precipitation and evaporation processes [2][3][4][5]. The uncertainties in these models might be either parametric (i.e. exact values for parameters might be unknown, there might be errors in the measurement) or structural (i.e. lack of knowledge on the underlying processes, multiple ways of modeling the same phenomenon), and their combined effect on the model output should be understood to guarantee a robust inference for policy-making. In this context, SA jointly with uncertainty analysis is regarded as an unavoidable step to ensure the quality of the modeling process [6][7][8][9][10][11].
In SA, as in all fields of computational research, different strategies and methods compete to establish themselves as "good", "recommended" or "best" practices. While variancebased methods and Sobol' indices are considered to belong to the class of recommended methods [12,13], other approaches have been proposed to complement or overcome their limitations (i.e. entropy-based methods [14], the δ measure [15], the Kuiper' metric [16], or the PAWN index [17,18]). One of the most recent competitors is the Variogram Analysis of Response Surfaces (VARS), proposed by Razavi & Gupta [19,20]. According to Google Scholar and as of September 4 2020, the two foundational VARS papers have been cited a total of 84 times, evidencing its widespread adoption by the modeling community. VARS seems to have been especially embraced by Hydrologists and Water Scientists [21][22][23][24].
Razavi & Gupta [19,20] report that VARS outperforms Sobol' indices in two main aspects: 1. It provides a more intuitive assessment of sensitivities and the importance of model inputs in determining the model output.
2. It computes the total-order effect with a much higher computational efficiency (up to two orders of magnitude more efficient).
In the present work we explore these results and benchmark VARS against one of the best Sobol' indices estimator, that of Jansen [25]'s. Before engaging in the discussion, we briefly recall hereafter some useful formulae needed to understand the two approaches.

Sobol' indices
The apparatus of variance-based sensitivity indices, described by Sobol' [26] and extended by Homma & Saltelli [27], is currently considered as the recommended practice in SA [13]. For a model of k factors f (x) = (x 1 , x 2 , ..., x k ) ∈ R k , the first-order sensitivity index S i can be written as The inner mean in Equation 1 is taken over all-factors-but x i (x ∼i ), while the outer variance is taken over x i . V (y) is the unconditional variance of the output variable y. When the factors are independent, S i can be defined as a first order term in the variance decomposition of y: S i lends itself to be expressed in plain English as the fractional reduction in the variance of y which would be obtained on average if x i could be fixed. This is because E xi [V x∼i (y|x i )] is the average variance that would be left after fixing x i to a given value in its uncertainty range. For this reason, V xi [E x∼i (y|x i )] must be the average reduction in variance as discussed above. While V x∼i (y|x i ) can be greater than V (y), E xi [V x∼i (y|x i )] is always smaller than V (y) as per Equation 3.
Another useful variance-based measure is the total-order index T i [27], which measures the first-order effect of a model input jointly with its interactions up to the k-th order: The index is called "total" because it includes all factors in the variance decomposition (see Equation 2) that include the index i. For instance, for a model with three factors, T 1 = S 1 + S 1,2 + S 1,3 + S 1,2,3 , and likewise for T 2 or T 3 . The meaning of T i is the fraction of variance that would remain on average if x i is left to vary over its uncertainty range while all other factors are fixed. Note that the theory of variance-based measures is as flexible as to accommodate "group" or "set" sensitivities. These are simply the first-order effect of a set of factors: if u is the set of factors (x 1 , x 2 ), then S u = S 1 + S 2 + S 1,2 .

VARS
VARS is based on variogram analysis to characterize the spatial structure and variability of a given model output across the input space [19,20]. Let us again consider a function of factors f (x) = (x 1 , x 2 , ..., x k ) ∈ R k . If x A and x B are two generic points separated by a distance h, then the variogram γ(.) is calculated as and the covariogram C(.) as Note that Given that V [y( As mentioned, the points x A , x B are spaced by a fixed distance, and V , COV are the variance and covariance respectively. Note that γ(.) is defined by the interval separating where the term E 2 in the expression of the variance as the expectation of the square minus the square of the expectation, V (.) = E(.) 2 − E 2 (.), is assumed to be zero. The practical formula for computing a multidimensional variogram is where the sum is extended to all N (h) couples of points [19,20] suggest some integral measures based on variogram γ, i.e. the integrated variogram Γ(H i ): and recommend the use of IVARS 10 , IVARS 30 , and IVARS 50 (computed for H equal to 10%, 30%, and 50% of the factor range respectively) to explore larger fractions of the variation space of the function, with IVARS 50 corresponding to the entire interval (in variogram analysis, the maximum meaningful range is one half of the factor range [28]).
Of important practical use, as we shall see, is the directional variogram along one of the axes of the factors space, which is evidently computed on all couples of points spaced h i along the x i axis, with all other factors being kept fixed. Note that the difference in parentheses is what is called in Saltelli et al. [29] a step along the x i direction, which is fungible to compute the total sensitivity index T i .
The equivalent of Equation 8 for the case of the unidirectional variogram γ(h i ) is where x * ∼i is a fixed point in the space of non-x i . In order for VARS to compute the total-order index T i (labeled as VARS-TO by Razavi & Gupta [19]), the authors suggest to take the mean value across the factors' space on both sides of Equation 13, thus obtaining which can also be written as and therefore

The issue of intuitivity and importance
Razavi & Gupta [19] benchmark Sobol' indices and VARS against several functions with different response structures. The fact that integrated variogram measures such as IVARS 10 , IVARS 30 and IVARS 50 are able to differentiate sensitivities as a function of scale H while Sobol' indices do not is taken as proof of the limitations of the latter. According to the authors, this endows VARS with a more intuitive appraisal of sensitivities.
Razavi & Gupta [19] construct their case using several functions, which we reproduce hereafter. In Fig. 1a, Sobol' indices do not differentiate f 3 from f 1 , whereas VARS points towards f 3 as the most sensitive function. In Fig. 1b, variance-based methods equate f 1 with f 2 because they have identical variance. According to Razavi & Gupta [19,p. 428], this "runs counter to our intuitive notion of sensitivity" given the multimodality of f 2 . If VARS is used, f 2 is identified as more sensitive than In Fig. 1c, Sobol' indices do not detect the periodicities of f 2 , which Razavi & Gupta suggest might be important in evaluating the impact of a factor from the perspective of model calibration. In Fig. 1d, variance-based methods regard f 2 as more sensitive than f 1 . Razavi & Gupta [19, p. 433] argue that this is "contrary to intuition" because the effect of f 1 is more complex (bi-modal). IVARS 10 and IVARS 30 , in contrast, characterize f 1 as more sensitive than f 2 .
It is apparent that for Razavi & Gupta [19] a sensitivity measure should be able to appraise the function structure. Our impression is that this perception of sensitivity is relevant to specific contexts, i.e. a diagnostic setting in which one is interested in the topology of a given function. However, the key lies in the definition of "importance" pointed to by VARS. In which sense is f 2 more important than f 1 in Fig. 1b, or f 3 more important than f 1 and f 2 in Fig. 1a? If SA is used in an information quality setting [30], when the aim is to determine which factor has the highest potential to reduce the uncertainty in the inference (i.e. how much is gained by discovering the true value of an uncertain factor), these functions might be regarded as equally sensitive. The same applies to Fig. 1d: given that f 2 changes more decidedly over the interval range than f 1 , a larger reduction in uncertainty can be achieved by learning first about f 2 than about f 1 .
Given that SA quantifies the relative influence of each model input in the model output, the concept of sensitivity is ultimately linked to that of "importance". This is why it should be clear what do we mean when we say that a model input is "important", or that a model output is very sensitive to a given model input. Variance-based methods meet this requirement by linking SA to statistical theory via ANOVA [31], thus defining SA as "the study of how the variance in the model output is apportioned to different sources of uncertainty in the model input" [12]. The use of variance-based methods such as Sobol' indices are well defined and associated with clear settings [32]: 1. Factors prioritization: the aim is to identify the single factor that, if determined (i.e., fixed to its true but unknown value), would lead to the greatest reduction in the variance of the output. This is met by the first-order sensitivity index (S i ).
2. Variance reduction: the aim is to identify the sets of factors (couples, triplets, and so on) leading to the reduction of the output variance below a given threshold, and this by fixing the smallest number of factors. This is achieved by using set (group) sensitivity indices.
3. Factors fixing: the objective is to identify factors which can be fixed anywhere in their range of variation without affecting the variance of the output. This is met by the total-order sensitivity index (T i ).
Variance-based methods clearly resolve what is meant by "importance" of a factor. However, this is not as apparent in the case of VARS: if a decision needs to be taken based on the inference provided by a model, which of the variogram-based measures (IVARS 10 , IVARS 30 , VARS 50 , VARS-TO) should be used to characterize the factors' importance? and what does "importance" mean for VARS? Razavi & Gupta [19]'s statement of VARS being more "intuitive" than Sobol' indices is open to debate: intuition is in the eyes of the beholder, while solid criteria underpin the methodological quality of Sobol' indices.
This discussion leads to another aspect listed by Razavi & Gupta [19, p. 423] as a motivation for developing VARS: an "ambiguous characterization of sensitivity": (...) different SA methods are based in different philosophies and theoretical definitions of sensitivity. The absence of a unique definition for sensitivity can result in different, even conflicting, assessments of the underlying sensitivities for a given problem. 2 We argue that the source of ambiguity in sensitivity analysis is not the lack of a unifying theory, or the fact that many sensitivity measures are available, but in the definition of "importance". Unless the analyst stipulates what she means when she says that a variable is important, different methods can be thrown at the model resulting in different ordering of importance of the input variables, whereby the analyst could be tempted to cherry-pick the method most conforming to one's own bias. By linking the definition of importance to clear settings, Sobol indices resolve this quandary clearly and transparently, and provide end-users with a plain English description of the results. This comes in handy when the receiver (customer) of the analysis is not another practitioner.
The expedient to produce functions where Sobol' indices look "wrong" is quite common. This approach was also taken by Liu et al. [14] and Pianosi & Wagener [17] with Liu's highlyskewed function y = x1 x2 , where x 1 ∼ χ 2 (10) and x 2 ∼ χ 2 (13.978) (Fig. 2). The reader might wonder why one of the degrees of freedom is expressed with a two-digits precision and the other with a five-digits one. The reason is that, with these crisp numbers, T 1 and T 2 are identical and equal to 0.5462, while inspection of Fig. 2b should convince us that x 1 is more important than x 2 on reasons of its longer tail. The Liu function is thus what Lakatos [34] would have called a monster example, designed on purpose to invalidate variance-based methods. However, based on the definition of "importance" of Sobol' indices, the fact that they are equally influential appears totally reasonable. We conclude by stating that rather than hinting at what should or not should be intuitive, a sensitivity index should pin down its definition of importance in unambiguous terms.

The issue of efficiency
Razavi & Gupta [19,20] claim that VARS-TO is much more computationally efficient than the Sobol' total-order index (up to two orders of magnitude). They make their case with three different models: 1. The six-dimensional response surface displayed in Fig. 1d, which is a purely additive model. VARS-TO accurately ranks total-order indices with just 60 simulations, beating Sobol' indices at > 6.000 simulations [19, pp. 435-436].
Do these examples truly prove that VARS-TO is between 20 and 100 times more efficient than Sobol' indices?

The case of the six-dimensional response surface model
To properly answer this question in the case of the six-dimensional model presented in Fig. 1d, we should first focus on the sampling design of VARS and Sobol' indices.
The computation of VARS relies on stars and is referred to as STAR-VARS by Razavi & Gupta [20]: the analyst first randomly selects N star points across the factor space, i.e. via random numbers, Latin Hypercube Sampling (LHS) or Sobol' Quasi Random Numbers (QRN). These are the "star centers" and their location can be denoted as s v = s v1 , ..., s vi , ..., s vk , where v = 1, 2, ..., N star . Then, for each star center, a cross section of equally spaced points ∆h apart needs to be generated for each of the k factors, including and passing through the star center (Fig. 3, left side plot). The cross section is produced by fixing s v∼i and varying s i . Finally, for each factor all pairs of points with h values of ∆h, 2∆h, 3∆h and so on should be extracted. The total computational cost of this design is N t = N star k( 1 ∆h − 1) + 1 . Sobol' indices also rely on a star-based sampling strategy: they require a (N, 2k) base sample matrix, designed via LHS or QRN, in which the rightmost k columns are allocated to an A matrix and the leftmost k columns to a B matrix. Then, k extra (N, k)A i B matrices are created, where all columns come from A except the i-th, which comes from B. This design creates stars with centers and points a step away in the x i direction (Fig. 3, right-side plot). The cost of this design for T i is N t = N (k + 1), where N is the row dimension of the base sample matrix.
When the function or the model under study is fully additive, as in the six-dimensional surface model mentioned above (Fig. 1d), the computation of VARS-TO can be done with a single cross-section in the space of x * ∼i for each model input. VARS-TO thus becomes a first-order index de facto, as one model input remains constant while all the others vary. The natural term of comparison is thus the Sobol' first-order index, and not the total. In that sense, and for any function which behaves non-additively for at least one factor, i.e. F = f (x i ) + g(x ∼i ), the first-order effect S i can be computed very easily, since i.e. S i is only a function of x i and hence it can be computed with a single trajectory along x i , irrespective of its position in x ∼i . We provide the proof in the Supplementary Materials.
We used Equation 17 to compute S i for the six-dimensional model, aiming at replicating the results by Razavi & Gupta [19, see their Fig. 6]. We observed that Sobol' indices accurately rank all model inputs at N t = 896 (Fig. 4), contrasting with the N t > 6.000 obtained by Razavi & Gupta [19]. This example suggests that VARS-TO is indeed more efficient than first-order Sobol' indices when the model is fully additive, but less than what the authors claimed it to be.

The case of the HYMOD and MESH models
What happens when the model is non-additive? then a single trajectory is not enough and several cross-sections in the space of x * ∼i should be drawn to fully explore the hypercube. This is required in the case of the HYMOD and MESH models. But does the higher accuracy of VARS-TO in these settings truly evidence its superiority over Sobol' indices? in other words: can the results obtained from these two specific examples be extrapolated to conclude that VARS-TO is generally better than Sobol' indices? Puy et al. [38] recently showed that, once the benchmark settings are randomized (i.e. the functional form of the model, the sampling method, the model dimensionality, the total number of model runs, the fraction of active second and third-order effects, the distribution of the model inputs and the performance measure), VARS-TO loses much of its purported computational superiority: it only very slightly outperforms the Sobol' estimators Jansen [25] and Janon/Monod [39,40] when there are serious constraints on the number of model runs that can be allocated to each model input (i.e. 2-10). At larger sample sizes, the performances of VARS and Jansen and Janon/Monod are exactly the same [38]. In order to obtain a more comprehensive view of the performance of VARS-TO against Sobol' indices, we have reproduced the work by Puy et al. [38] with the following changes and/or additions:  [43] for PAWN, the uncertainty in the design parameters of a sensitivity index might contribute appreciably to its volatility.
We compared the performance of VARS-TO and Jansen by treating the main benchmark settings listed in Table 1 as uncertain parameters described by probability distributions. We created a (2 12 , k) sample matrix using Sobol' quasi-random numbers [44,45], in which each row was a sample point and each column an uncertain parameter. For v = 1, 2, ..., 2 12 rows, we computed VARS-TO and the Jansen total-order index according to the specifications set by N starv , h v , ..., δ v . The final model output was r v , the correlation coefficient between the indices estimated by VARS-TO and Jansen (T iv ) and the "true" indices (T iv ), computed with a large sample size (N = 2 12 ) and the Jansen [25] estimator. The larger the r v , the better the estimation of the "true" sensitivity indices by VARS-TO or Jansen. We argue that this approach allows to examine the accuracy of VARS-TO in a more comprehensive way given the enormous range of sensitivity problems that it can explore [38,41] (potentially more than 3 billion scenarios in this case). If VARS-TO beats Sobol' unequivocally, as asserted by Razavi & Gupta [19,20], its computational advantage should emerge here as well. The Supplementary Materials thoroughly detail the rationale and the execution of the experiment.  Fig. 5a shows that both Jansen and VARS-TO are very accurate as the empirical distribution of r is highly right-skewed. If anything, Jansen seems to outperform VARS-TO overall due to its slightly narrower distribution (95% CI 0.93-0.99, median of 0.98 for Jansen; 95% CI 0.87-0.99, median of 0.95 for VARS). This is also apparent in Fig. 5b, with VARS-TO presenting more simulations with redder/orange colors (approx. r ≤ 0.85). A closer look at the performance of both approaches reveals that Jansen maintains a higher median accuracy at higher dimensions (Fig. 6a), whereas VARS-TO confirms its slightly higher efficiency only when the number of runs that can be allocated per model input (N t /k) is considerably constrained (< 50 in this case, Fig. 6b) [38]. VARS-TO also displays a larger volatility at 100 > (N t /k) (Fig. 6b), suggesting that Jansen might become more stable in a larger number of sensitivity problems if the number of model runs per input is increased. These results rest on solid grounds as the number of simulations for which we have computed the median N t /k is almost identical for Jansen and VARS-TO (Fig. 6c). Overall, this proves that both estimators have a very similar efficiency and reliability.
We also computed Sobol' indices to assess which uncertain parameter most influences the performance of VARS-TO (Fig. 7). We observed that c. 30% the variance in its performance is driven by the underlying probability distribution of the model inputs φ, which appears as the most influential parameter. The other parameters are important through interactions, especially the functional form of the model ( ), the sampling method (τ ), the model dimensionality (k) and the performance measure selected (δ), in that order. The proportion of second and third-order effects (k 2 , k 3 ) does not have any effect, which means  Table 1 and the Supplementary Materials).
that VARS-TO is very robust against non-additivities. Compared to Jansen, VARS-TO significantly underperformed when the model inputs were normally distributed (e.g. when φ = 2, Figs. S5, 8). We observed that this was caused by high-order interactions between the sampling design of VARS-TO (Fig. 3, left side) and at least five different uncertain parameters, N star , h, k, φ, τ .
To understand these interactions, let's first assume that we use random numbers (τ = 1) to sample our star centers, which Razavi et al. [42, Table 1 where v = 1, 2, ..., N star . The higher the N star and k, the higher the chances that a value at the boundary of (0, 1) is included in s v . Given that VARS-TO requires to fix s v∼i while varying s i at a step defined by h, this value at the periphery of (0, 1) will be repeated in ( 1 h ) − 1 (k − 1) coordinates, which can be manifold if k is high and h is low. Once the model inputs are transformed into a normal distribution, it will turn into an extreme value and will disrupt both the model output and the computation of VARS-TO for the x ∼i parameters.
Let's now assume that we do not use random numbers to sample the star centers, but Sobol' Quasi-Random (QRN) number sequences (τ = 2). They are also contemplated by Razavi et al. [42, Table 1] as a sampling strategy to compute VARS-TO. Although the design of QRN makes the sampling of star centers at the very periphery (0, 1) very unlikely, cross-sections can indeed sample the boundary of the domain: for instance, if h = 0.1 and s vi = 0.5, the cross-section of the x i parameter will be the vector x i = 0.1, 0.2, ...s vi , ..., 1. This will will cause VARS-TO to crash as a uniform one becomes infinity under a normal distribution. Even if the STAR-VARS algorithm is modified to prevent 1 from being sampled (e.g. by replacing 1 by 0.999, as we did in this study), some cross-sections will still sample values very close to 1 by design, especially if h is set at a small value. These values will be extreme values under a normal distribution, disrupting again the computation of the model output and the VARS-TO index -in this case, for the x i parameter. We believe that this explains the high-order interactions involving N star , h, k, φ, τ , which are non-negligible (Fig. 7). The effect of N star and h in VARS-TO was not explored by Puy et al. [38] or Becker [41], who documented a slight higher performance of VARS-TO against Jansen and Janon/Monod. Our results indicate that VARS-TO loses this marginal edge once these internal uncertainties jointly with the uncertainties in the benchmark settings are considered in the computations. Even if the use of VARS-TO is restricted to non-normal distributions (φ = 2), its performance would still be slightly outdone by Jansen (95% CI 0.96-0.99, median of 0.99 for Jansen; 95% CI 0.94-0.99, median of 0.97 for VARS).

Conclusions
We have revised the Variogram Analysis of Response Surfaces (VARS), a new framework for sensitivity analysis developed by Razavi & Gupta [19,20] in Water Resources Research. We have specifically focused on two aspects that, according to Razavi & Gupta [19,20], make VARS outperform Sobol' indices: its more intuitive appraisal of sensitivities and of the importance of model inputs, and its 20-100 times higher computational efficiency.
The claim that VARS is more intuitive than Sobol' indices can hardly be reversed as it ultimately is a matter of personal taste, disciplinary orientation and objective of the modeling activity: a geographer working in a diagnostic model setting might indeed prefer VARS's approach to the model structure. We argue however that Sobol' indices provide a clearer description of what an "important" model input is given its connection to statistical theory and ANOVA. The use of Sobol' first or total order indices is associated with clear research settings and their meaning can be easily conveyed to stakeholders or non-specialists, which adds to their transparency. VARS, in contrast, allows to zoom into the structure of the model output and assess its dependency on the model inputs through the integrated variograms IVARS 10 , IVARS 30 and IVARS 50 , as well as through the variance-based totalorder effect VARS-TO. But what is their definition of importance? how useful is for a stakeholder to know that a parameter is "important" under IVARS 10 and not as much under IVARS 30 , for instance? Which IVARS measure should she finally rely onto before making a decision for policy-making? If the answer is the summary measure VARS-TO, then it is unclear how VARS advances Sobol' indices given the reliance of VARS-TO on variance and covariance matrices.
The purported much higher efficiency of VARS-TO is contentious. The observation that it is more than 100 times more efficient than Sobol' total-order indices rests on an exercise conducted with a fully additive model. In this context, VARS-TO is a first-order index and its performance should be compared against a first-order Sobol' estimator. We have shown that if Equation 17 is used to compute S i , VARS-TO is actually "just" 15 times more efficient than Sobol' first-order indices. The advantage of VARS-TO over Sobol' firstorder indices is nonetheless still remarkable, and suggests that VARS-TO should be the sensitivity measure of choice if 1) the model under analysis is known to be fully additive before its execution, and 2) computational efficiency is a priority. However, this condition is unlikely to apply to models of the Earth and Environmental domain, either because they encompass multiplicative terms and exponentials or because their mathematical complexity prevents the analyst from knowing their behavior before running the simulations.
The assertion that VARS-TO is at least 20 times more efficient than the Sobol' total-order index is not confirmed by our results. VARS-TO only very slightly outperforms Jansen when the number of model runs per model input is very small. However, it comes second to Jansen at increasing dimensionalities and in overall performance. Such results have been obtained after randomizing the benchmark settings, thus creating a set of sensitivity problems much wider than those represented by the HYMOD and MESH models, and by simultaneously examining the internal uncertainties of VARS-TO (N star , h). Its sampling design makes it especially vulnerable to the high-order interactions between the sampling method (τ ), the number of stars (N star ), the function dimensionality (k), the distance between pairs (h) and the underlying distribution of the model inputs (φ), especially if they follow a normal distribution.
VARS nonetheless represents a relevant addition to the family of sensitivity analysis methods, with the additional merit of having been developed to appraise the response surface of a model. Furthermore, the conceptual framework of VARS comes with a software described as "next-generation" by Razavi [46]. Time will tell whether VARS ends up unseating Sobol' indices as the recommended best practice in SA.

Acknowledgements
This work has been funded by the European Commission (Marie Sk lodowska-Curie Global Fellowship, grant number 792178 to A.P.).

Code availability
Fully documented R code to reproduce the results of this paper is available in https: //github.com/arnaldpuy/VARS_paper.
Figure 1b: Figure 1c: The following proves that if a function is of the additive type, its first-order index S i can be computed very easily with a single trajectory along the x axis, irrespective of its position in the space of x ∼i .
Proof : If the function is of the type and we wish to compute V xi (E x∼i (F |x i )) can be written as The inner sum E x∼i (F |x i ) can be written in terms of f and g: where we indicate asĝ the average of g(x ∼i ). We now compute the two terms in Equation 7; the first term is: and the second term is: Substituting Equations 9-10 into 7 leads to which proves that if function F is additive in factor x i its sensitivity index only depends upon f .
All functions are covered by the general formula corresponding to g 3 (x 3 ): To make an example, g 1 (x 1 ) corresponds to 12 when f is set to zero, while for g 2 (x 2 ) c is zero. Thus we only offer the analytic expression for g 3 (x 3 ) which is obtained as from Equation 11: and to give 2.2 The case of the HYMOD and MESH models

The rationale of our experiment
Here we describe the conceptual and methodological approach behind our assessment of the performance of VARS-TO [1,2] and the Jansen [3] estimator. We basically followed Becker [4] and Puy et al. [5], to which we direct the reader for more detailed information. The main rationale is the following: 1. The performance of sensitivity indices is usually benchmarked with a handful of test functions and across different sample sizes [1,2,[6][7][8][9]. However, there are other factors that might also condition the behavior of sensitivity indices besides the model and the number of allocated runs: (a) The sampling method used to create the sample matrix (i.e. random numbers, Quasi-random numbers) [10].
(c) The dimensionality of the model.
(d) Its degree of non-additivity.
(e) The performance measure used (i.e. whether we assess how well the estimated indices approach the "true" indices, how well the estimator ranks the parameters, or how well the estimator ranks the most important parameters only) [4,5].
(f) The design parameters (i.e. the number of stars N star or the value of h for VARS-TO).
2. By treating these factors as uncertain parameters we can check the accuracy of sensitivity indices in a comparatively much larger range of sensitivity settings. This provides a much thorough picture of their behavior and a clearer assessment of their advantages and limitations [5]. In a sense, this approach simulates what happens when all the forking paths leading to the computation of a sensitivity index are walked at once [13,14].
3. To follow this approach, we described these uncertain parameters with probability distributions based on the literature available ( Fig. S1): (a) N star was defined based on the range of star centers used by Razavi & Gupta [1,2] in their examples.
(c) k was defined to explore the behavior of VARS-TO and Jansen at both small and medium dimensionalities. For k > 50, see Puy et al. [5].
(d) sets the seed of the random numbers required to define the test function. Our test function follows the metafunction approach of Becker [4]: it allows test functions to be generated by randomly combining p univariate functions in a multivariate function of dimension k. Our metafunction includes cubic, discontinuous, exponential, inverse, linear, non-monotonic, periodic, quadratic, trigonometric functions, as well as a no-effect function (Fig. S2). See Becker [4] and Puy et al. [5] for more details.
(e) τ was described to check the performance of VARS-TO and Jansen when the base sample matrix is constructed with random numbers (τ = 1) or with Sobol' [15,16] Quasi-Random Numbers (τ = 2).
(f) φ aimed at observing how the estimators behave under uniform, normal, skewed and random distributions (Fig. S3).
(g) k 2 was described to randomly activate between 50% and 100% of the existing pairwise interactions of the test function.
(h) k 3 was described to randomly activate between 30% and 100% of the three-wise interactions of the test function (i) δ checked how the estimators behave when the performance measure varies. Let T i andT i be the "true" and the estimated sensitivity indices. We checked how wellT i correlated with T i (δ = 1) , how well the ranks ofT i correlated with those of T i (δ = 2), and how well the most important ranks ofT i correlated with those of T i (δ = 3).

The simulations
1. We created a (2 12 , 2k) sample matrix using Sobol' Quasi-Random Numbers, where the k leftmost columns are allocated to an A matrix and the k leftmost columns to a B matrix. In these matrices, each row is a sample point and each column a factor input, distributed as in Fig. S1. We then added k (2 12 , k) A . This forced all comparisons between VARS-TO and Jansen to be conducted on an identical or almost identical total number of runs. Fig. S4 shows that the total number of model runs allocated to each estimator in each simulation differed mostly by 1-10 model runs, and that the largest difference was of 25 model runs only. iii. A large sample matrix to compute T iv , N t = 2 12 (k v + 1).
(b) It transforms the model inputs of all three matrices into the probability distribution set by φ v .
(c) It applies the metafunction to all three matrices simultaneously, with its functional form, degree of active and third-order effects defined by v , k 2v and k 3v respectively.
(d) It computes the estimated sensitivity indicesT iv for VARS-TO and Jansen, and the "true" sensitivity indices T iv from the large sample matrix (see above).
(e) It assesses the performance of VARS-TO and Jansen according to δ v : if δ = 1, it checked how wellT iv correlated with T iv ; if δ = 2, how well the ranks ofT iv correlate with those of T iv ; and if δ = 3, how well the ranks of the most important parameters ofT iv correlated with those of T iv . The resulting model output was the correlation coefficient r v .  Figure S5: Scatterplots of the model inputs against the model output.