Decoding with confidence: Statistical control on decoder maps

In brain imaging, decoding is widely used to infer relationships between brain and cognition, or to craft brain-imaging biomarkers of pathologies. Yet, standard decoding procedures do not come with statistical guarantees, and thus do not give confidence bounds to interpret the pattern maps that they produce. Indeed, in whole-brain decoding settings, the number of explanatory variables is much greater than the number of samples, hence classical statistical inference methodology cannot be applied. Specifically, the standard practice that consists in thresholding decoding maps is not a correct inference procedure. We contribute a new statistical-testing framework for this type of inference. To overcome the statistical inefficiency of voxel-level control, we generalize the Family Wise Error Rate (FWER) to account for a spatial tolerance δ, introducing the δ-Family Wise Error Rate (δ-FWER). Then, we present a decoding procedure that can control the δ-FWER: the Ensemble of Clustered Desparsified Lasso (EnCluDL), a procedure for multivariate statistical inference on high-dimensional structured data. We evaluate the statistical properties of EnCluDL with a thorough empirical study, along with three alternative procedures including decoder map thresholding. We show that EnCluDL exhibits the best recovery properties while ensuring the expected statistical control.


Introduction
Predicting behavior or diseases status from brain images is an important analytical approach for imaging neurosciences, as it provides an effective evaluation of the information carried by brain images. Machine learning tools, mostly supervised learning, are indeed used on brain images to infer cognitive states ( Haynes and Rees, 2006;Norman et al., 2006 ) or to perform diagnosis or prognosis ( Demirci et al., 2008;Fan et al., 2008 ). Brain images are obtained from MRI or PET imaging, or even EEG-or MEG-based volume-based activity reconstruction. They are used to predict a target outcome: binary ( e.g., two-condition tasks), discrete ( e.g., multiple-condition tasks) or continuous ( e.g., age). The decoding models used for such predictions are most often linear models, characterized by a weight map that can be represented as a brain image ( Mourao-Miranda et al., 2005;Varoquaux and Thirion, 2014 ).
Besides the prediction accuracy achieved, this estimated weight map is crucial to assess the information captured by the model. Typically, the produced weight maps are used to identify discriminative patterns ( Gramfort et al., 2013;Haxby et al., 2001;Mourao-Miranda et al., 2005 ) and support reverse inferences ( Poldrack, 2011;Schwartz et al., 2013;, a natural way to recover predictive regions from their weight maps is to threshold these maps (e.g. Lee et al., 2010;Mourao-Miranda et al., 2005;Rehme et al., 2015;Sato et al., 2013 ). However, this approach is problematic for two reasons: there exists no clear way to choose the threshold as a function of a desired significance, and it is unclear whether such a thresholded map is still an accurate predictor of the outcome. Solutions that bypass the arbitrary threshold choice have been proposed, such as Recursive Feature Elimination (RFE) ( De Martino et al., 2008 ), but the produced maps still lack statistical guarantees.
In this work, we show that the natural procedure that consists in thresholding standard decoders, such as SVR, is not a relevant solution. In this respect, we consider two thresholding strategies: one that keeps extreme weights, and another one that computes the threshold by performing a permutation test. Unlike RFE, these two thresholding strategies can be derived from statistical testing considerations -yet, these statistical properties are not assumption free. We also consider decoders that provide confidence intervals around the estimated weight map. As detailed in the next section, these approaches also face severe challenges in terms of statistical power and computational tractability. They have to rely on algorithmic shortcuts, approximations and hypotheses that are more or less problematic in practice.
Hence, for all methods considered, the control of false detections is only achieved within a certain theoretical framework, and given a series of assumptions that are not easily checked. It is thus fundamental to analyze their statistical behavior with an extensive empirical study. We present here a set of experiments assessing the accuracy of the error rate control and support recovery on real and semi-synthetic brain-imaging data.
Additionally, to achieve a reasonable compromise between error control and power, we introduce a new type of error control adapted to imaging problems. The proposed quantity is a generalization of the Family Wise Error Rate (FWER) ( Hochberg and Tamhane, 1987 ) including a spatial tolerance parametrized by a distance . We call it -FWER.
In Section 2 , we bring useful background, discuss the statistical guarantees that we aim at for pattern maps, and make the theoretical and practical inference challenges explicit. In Section 3 we provide a definition of the -FWER along with a geometrical interpretation of this quantity. We also describe several statistical inference methods producing statistical maps reflecting the significance of conditional association of brain regions with a target, while controlling the FWER or -FWER. Section 4 and Section 5 follow with extensive experiments on simulations and large-scale fMRI datasets that study the behavior of the benchmarked solutions regarding false positive control and recovery.

Context: decoding-map recovery
In this section, we first review a result due to Weichwald et al. (2015) about the complementarity of univariate and multivariate inference, then we present the statistical guarantees that we aim at for on brain-wide decoding maps, lastly we formalize the problem of statistical inference on such maps.

Complementarity of univariate and multivariate inference
Statistical inference in neuroimaging can be performed using a mass univariate modeling, i.e., fitting brain activity maps from an outcome -leading to encoding models -or by predicting an outcome from brain maps using multivariate modeling -leading to decoding models . The complementarity of univariate and multivariate analyses has been demonstrated in Weichwald et al. (2015) . Specifically, they argued: "We showed that only encoding models in a stimulus-based setting support unambiguous causal statements. This result appears to imply that decoding models, despite their gaining popularity in neuroimaging, are of little value for investigating the neural causes of cognition. In the following, we argue that this is not the case. Specifically, we show that by combining encoding and decoding models, we gain insights into causal structure that are not possible by investigating each type of model individually. " This statement clearly implies that inference tools are needed for multivariate analysis. The present work is thus fully dedicated to multivariate inference. We simply provide some univariate inference results for reference, given that they address different yet complementary questions.

Statistical control with spatial tolerance
In decoding, the signals from voxels are used concurrently to predict an outcome. Given that they display high correlations, trying to identify the effect of each covariate (voxel) is not possible. Precise voxel-level control may not be necessary: current brain models are rather specified at a regional scale, see e.g., ( Glasser et al., 2016 ). Additionally, to control a statistical error, detecting a voxel adjacent to a truly predictive region is less problematic than detecting a false positive far from such a predictive region. These two facts argue in favor of incorporating a spatial tolerance in the sought statistical control, as with efforts in standard analysis ( Bowring et al., 2019;Da Mota et al., 2014;Smith and Nichols, 2009 ). Hence, we introduce a generalization of the Family Wise Error Rate (FWER) ( Hochberg and Tamhane, 1987 ): the -FWER. This generalization is related to the extension of the False Discovery Rate (FDR) ( Benjamini and Hochberg, 1995 ) proposed by Nguyen et al. (2019) and Gimenez and Zou (2019) , called -FDR and local-FDR, respectively.

Formal problem setting
Notation. For clarity, we use bold lowercase for vectors and bold uppercase for matrices. For ∈ ℕ , we write [ ] for the set {1 , … , } . For a vector , refers to its th coordinate. For a matrix , , refers to the element in the th row and th column.
Formalizing the decoding problem. The target (outcome to decode) is observed in samples and denoted by ∈ ℝ ( can be binary, discrete or continuous). The brain volume is discretized into voxels. The corresponding voxel signals are also referred to as explanatory variables, covariates or features. We denote by ∈ ℝ × the matrix containing (column-wise) the covariates { 1 , … , } . We assume that, for all ∈ [ ] , the samples ( , ,. ) are i.i.d. Then, further assuming a linear dependency between the covariates and the response, the generative model is as follows: where * ∈ ℝ is the true weight map and is the noise vector. In the present study, we assume for simplicity that the noise is Gaussian, i.e., ∼  (0 , 2 ) , but extension to sub-Gaussian noise is possible. High dimensionality and structure of the data. Given and , a standard procedure computes an estimate ̂ of * . Getting statistical guarantees on * , ∈ [ ] , means assessing with some degree of uncertainty that * is non-zero, or equivalently, giving a confidence interval for * . This is hard in high dimension and when short-and long-range correlations are present in the data. Indeed, for brain imaging data, is typically hundreds (or less), whereas may amount to hundreds of thousands. In addition, voxel signals are highly correlated, which makes model identification harder due to multicollinearity and ill-posedness. Theoretical studies, e.g., Wainwright (2009) , have revealed that in such settings there is no hope to recover completely and accurately the predictive regions.

Current practices: thresholding decoding maps
Uniform threshold. Probably the most natural procedure used to recover discriminative patterns is to threshold decoders with high prediction performance -a popular choice is the linear SVM/SVR decoder ( Pereira et al., 2009;Rizk-Jackson et al., 2011 ). Thresholding decoder maps at a uniform value -i.e., the threshold is the same for all weightsis probably the most common practice in neuroimaging; threshold value being generally arbitrary: "naked-eye criteria ". It is not thought of as a statistical operation, and is sometimes left to the reader, who is presented unthresholded maps and yet told to interpret only the salient features of these maps.
Permutation testing can also be used to derive a uniform threshold with explicit guarantees. The classical Westfall-Young permutation test procedure ( Westfall and Young, 1993 ) is well-known in the univariate context to control the FWER ( Anderson, 2001 ), but its application to multivariate testing is not as straightforward. Then, instead of considering the usual -statistics, a permutation test can use the linear SVR weights. An estimated weight map must be computed for the original problem and for several permuted problems before performing the Westfall-Young procedure; this method is detailed in Section 3.3 .
Under some assumptions (see Section 3.2 and Section 3.3 ) that are more or less problematic in practice, the uniform thresholding strategies might recover the predictive patterns with FWER control. However, we will see that these naive strategies are not satisfactory in practice.
Non-uniform threshold. Another method proposed by Gaonkar and Davatzikos (2012) , specifically designed for neuroimaging settings, relies on the analytic approximation of a permutation test performed over a linear SVM/SVR estimator. This method computes confidence intervals around the weights of the proposed estimator. Then, under some assumptions (see Section 3.4 ) that are not always met in practice, this procedure controls the FWER. It is almost equivalent to thresholding the SVR weights with a non-uniform threshold -i.e., the threshold is specific to each weight. We refer to it as Adaptive Permutation Threshold SVR (Ada-SVR) from now on.

Building decoders designed for statistical control
Dimension reduction by voxel grouping. A computationally attractive solution to alleviate high dimensionality is to leverage the data structure and group adjacent -and correlated -voxels, producing a closely related, yet compressed version of the original problem. In decoding, the grouping of voxels via spatially-constrained clustering algorithms has already been used to reduce the problem dimension Varoquaux et al., 2012;Wang et al., 2015 ). Specifically, groups of contiguous voxels can be replaced by the average signal they carry, reducing the dimensionality while improving the conditioning of the estimation problem. However, such a compression introduces a bias, as the patterns are constrained by the clusters shape. This bias is problematic as there is no unique grouping or clustering of the voxels : many different groupings capture the signal as accurately. One way to mitigate this bias is to use aggregation of models ( Breiman, 1996;Zhou, 2012 ) obtained from several voxel groupings. Varoquaux et al. (2012) implemented this idea by computing different groupings from different random subsamples of the full data sample. The corresponding procedure yields decoders with more stable maps as well as a better prediction accuracy. In this subsampling spirit, random subspace methods ( Ho, 1998; also improve the prediction accuracy with more stable solutions -but in this case the subsampling is performed on the raw features. More recently, a procedure, Fast Regularized Ensembles of Models (FReM) ( Hoyos-Idrobo et al., 2018 ), has combined clustering and ensembling to reduce the variance of the weight map, while ensuring high prediction accuracy. Yet, FReM weight maps do not enjoy statistical guarantees.
High-dimensional statistics tools. There have been a variety of procedures to produce p-value maps (map of p-values associated to every covariate) for linear models in high dimension ( Bühlmann, 2013;Javanmard and Montanari, 2014;Meinshausen et al., 2009;Wasserman and Roeder, 2009;Zhang and Zhang, 2014 ). Yet, they are not directly applicable to brain-imaging settings, as the dimensionality is too high. Based on a comparative review of those procedures ( Dezeure et al., 2015 ), we have focused on the so-called Desparsified Lasso(DL), introduced in Zhang and Zhang (2014) and thoroughly analyzed by van de Geer et al. (2014) . Roughly, Desparsified Lassocan be seen as a Lasso-type ( Tibshirani, 1996 ) extension of the least-squares to high dimensional settings, producing weight maps with well-controlled satistical distribution.
However, when the number of features is much greater than the number of samples, Desparsified Lassolacks statistical power ( Chevalier et al., 2018 ) and the computational cost becomes prohibitive. Indeed, solving Desparsified Lassoentails solving Lasso problems with a design matrix ∈ ℝ × . Using the standard coordinate descent implementation ( Friedman et al., 2007 ) the computation time is  ( 2 ) , with the number of epochs used to solve the Lasso. However, when is of order of few thousands and few hundreds, Desparsified Lassoremains feasible with modest computer resources. In this context, the recently proposed Ensemble of Clustered Desparsified Lasso(EnCluDL) ( Chevalier et al., 2018 ) combines three steps: a clustering procedure that reduces the problem dimension but preserves data structure, the Desparsified Lassoprocedure that is tractable on the compressed problem, and an ensembling method introduced by Meinshausen et al. (2009) that aggregates several solutions of the compressed problem. This method, summarized in Section 3.5 , follows a scheme similar to FReM but the inference and ensembling procedures are different since it aims at producing p-value maps with statistical properties. Indeed, under some assumptions (see Section 3.5 ), it can be shown that EnCluDL controls the -FWER at the desired nominal level.
Finally, Knockoff filters ( Barber and Candès, 2015;Candès et al., 2018 ), extended to work on images by Nguyen et al. (2019) , are also an appealing procedure, though they can only control the FDR ( Barber and Candès, 2015 ) or a relaxed version of the FWER ( Janson and Su, 2016 ) incompatible with our spatial control, the -FWER detailed below. In this study, following the previous work of Chevalier et al. (2018) , we focus on FWER or -FWER control. We then defer the extension of En-CluDL to FDR-controlling procedures and the benchmarking with alternatives to future work.

-family wise error rate ( -FWER )
In this section, we introduce a new way of controlling false detections that is well suited for neuroimaging settings as it incorporates spatial tolerance. True support under linear model assumption. When considering multivariate inference, the support ⊂ [ ] is the set of covariates that are non-independent of conditionally to the other covariates. The rest of the voxels form the null region i.e., = [ ] ⧵ . Formally, is the unique set that verifies: where the sign ⟂ ⟂ denotes independence. Under the linear assumption made in (1) , becomes simply the set of non zero weights and the set of zero weights: -neighborhood. The variables 1 , 2 , … , can also be characterized by the spatial proximity of their underlying voxels in brain space: given ≥ 0 , a voxel ∈ [ ] is in the -neighborhood of a voxel (or a set of voxels) if their distance is less than . -null region. For ≥ 0 , we denote by ( ) the -dilation of the support , i.e., the set of voxels in or in its -neighborhood. By definition, ⊂ ( ) . We denote by (− ) theerosion (inverse operation of a -dilation) of the null region , implying that (− ) ⊂ . From the definition of we have immediately: We refer to (− ) as the -null region. As shown in Fig. 1 , we interpret the -null region as the subset of the covariates which are at a distance Fig. 1. Spatial tolerance to false discoveries. Left : example of 2D-weight map, small squares represent voxels. The map is sparse. Right : representation of the -null region for the associated map with = 2 . The covariates in the -null region are "far " from non-null covariates, discoveries in this area are highly undesired. Discovering a null covariate "close " to a non-null covariate is tolerated.

Fig. 2. Ensemble of Clustered Desparsified Lasso(EnCluDL) algorithm.
The EnCluDL algorithm combines three algorithmic steps: a clustering (or parcellation) procedure applied to images, the Desparsified Lassoprocedure (statistical inference) to derive statistical maps, and an ensembling method that synthesizes several statistical maps. In the first step, clusterings of voxels are generated using random subsamples of the original sample. Then, for each grouping-based data reduction, a statistical inference procedure is run resulting in z-score maps (or p-value maps). Finally, these maps are ensembled into a final z-score map using an aggregation method that preserves statistical properties. less than from the support covariates. We also give a practical example of the -null region in the case of real fMRI data in appendix in Fig. A.14 .
-Family Wise Error Rate ( -FWER ). If we have an estimate of the support ̂ ⊂ [ ] , we recall that the Family Wise Error Rate (FWER) is defined as the probability of making a false detection ( Hochberg and Tamhane, 1987 ): Similarly, given ≥ 0 , we defined the -FWER to be i.e., the probability of making a detection at distance more than from the true support. The -FWER control is thus weaker than the FWER control, except when = 0 and when the true support is empty ( i.e., = [ ] ), in which case the -FWER coincides with the classical FWER.

Thresholded SVR (Thr-SVR)
In this section, we introduce Thresholded SVR (Thr-SVR), a procedure that thresholds uniformly the estimated SVR weight map, keeping extreme weights; this method corresponds to the most standard and simple approach to recover predictive patterns. The first step is to derive the SVR weights ̂ SVR . Then, assuming that the estimated weights of the null region are sampled from a given distribution centered on 0, the corresponding standard deviation SVR can be approximated with the following estimator: We could also consider other estimators to approximate this quantity ( e.g., Schwartzman et al., 2009 ) but the former is simple and at worst biased upward when the support is not empty. Now, assuming a Gaussian distribution for the SVR weights in the null region, i.e., for ∈ : we can produce (corrected) p-values by applying a Bonferroni correction. The produced p-values are at worst conservative under the two assumptions discussed in Section 6 . In this procedure, the regression method considered is a linear SVR but similar results were obtained with other procedures ( e.g., Ridge regression).

Permutation Test SVR (Perm-SVR)
Now, we introduce another uniform thresholding strategy of SVR weights based upon a permutation test procedure. To derive corrected pvalues from a permutation test, we first regress the design matrix against the response vector using a linear SVR to obtain an estimate ̂ SVR of the weights map similarly as made in the Thr-SVR procedure. Then, permuting randomly times the response vector and regressing the design matrix against the permuted response by a linear SVR, we obtain maps ( ̂ SVR , ( ) ) ∈[ ] . We can now apply the Westfall-Young step-down maxT adjusted p -values algorithm ( Westfall and Young, 1993 , p. 116-117) taking the raw SVR weights instead of the usual -statistics to derive the corrected p -values. A sufficient assumption to ensure the validity of the p -values is the pivotality of the SVR weights. Keeping the corrected p -values that are less than a given significance level -equal to 10% in this study -this procedure is equivalent to thresholding the SVR weight map. We call this procedure Permutation Test SVR (Perm-SVR). The only difference between Perm-SVR and the Thr-SVR procedure is the way of computing the threshold. To perform the permutation test procedure, we took = 1000 permutations.

Adaptive permutation threshold SVR (Ada-SVR)
Here, we introduce Adaptive Permutation Threshold SVR (Ada-SVR), a statistical inference procedure that produces a weight map and confidence intervals around it; it is also almost equivalent to thresholding the SVR weights non-uniformly. Ada-SVR was first presented by Gaonkar and Davatzikos (2012) . First, the authors derived an estimated weight ̂ APT linearly related to the target by approximating the hard margin SVM formulation, their estimator is given by the following equation: where is the target variable and ∈ ℝ × only depends on the design matrix : where 1 ∈ ℝ is a vector of ones. The approximation made by (9) is notably valid under the assumption that all the data samples are support vectors, which might hold at least if ≪ . Then, if is standardized and if is large enough (so that the central limit theorem holds), one expects that under the null hypothesis for the th covariate: From (11) , p-values can be computed and corrected by applying a Bonferroni correction (multiplying the raw p-values by a factor ).

Ensemble of Clustered Desparsified LassoAlgorithm (EnCluDL)
Ensemble of Clustered Desparsified Lasso(EnCluDL) is a multivariate statistical inference procedure designed for spatial data; it was first introduced by Chevalier et al. (2018) . As illustrated in Fig. 2 , EnCluDL relies on three steps: a spatially-constrained clustering algorithm for reducing the problem dimension, a statistical inference procedure for deriving statistical maps, and an ensembling method for aggregating the statistical maps. Statistical inference with Desparsified Lasso. Desparsified Lasso(DL) is a statistical inference procedure that can be viewed as a generalization of the least-squares-based inference in high dimension under sparsity assumptions. It was proposed and thoroughly analyzed by Zhang and Zhang (2014) and van de Geer et al. (2014) . This estimator produces p-values on linear model parameters even when the number of parameters is (reasonably) greater than the number of samples . A technical description of Desparsified Lassois available in Section A.1 . In the neuroimaging context, the initial parameters are related to the voxels, which are of the order of one hundred thousand while the number of samples is almost always lower than one thousand. In such settings Desparsified Lassois inefficient due to a lack of statistical power, hence dimension reduction is required. Clustering. As argued in Section 1 , while performing dimension reduction, we aim at keeping the spatial structure of the data and avoid mixing voxels "far " from each other. This is achieved with data-driven parcellation along with a spatially constrained clustering algorithm following the conclusions by Varoquaux et al. (2012) and Thirion et al. (2014) . Another interesting aspect of this dimension reduction method is its denoising property ( Hoyos-Idrobo et al., 2018 ) since it produces averages from groups of noisy voxels. Note that this choice ultimately calls for a spatial tolerance on the statistical control, i.e., considering the -FWER instead of the standard FWER. Through the clustering, the voxels are grouped into clusters, where ≪ . Then, Desparsified Lassois directly applied to the compressed problem in order to produce corrected p -values. Notably, corrected p-values are obtained from the initial p -values by applying Bonferroni correction ( Dunn, 1961 ) with a factor ≪ . Following the terminology in ( Chevalier et al., 2018 ), we refer to this procedure as Clustered Desparsified Lasso(CluDL). CluDL however suffers from high variance ( Chevalier et al., 2018 ) as it depends on an arbitrary grouping choice. This can be alleviated by ensembling techniques, as described next. Ensembling. Hoyos-Idrobo et al. (2018) ; Varoquaux et al. (2012) have shown that randomizing the grouping choice and adding an ensembling step to aggregate several solutions can stabilize the overall procedure. Additionally, Chevalier et al. (2018) have highlighted that the ensembling step is also beneficial in terms of support recovery. To perform groupings of the covariates, we train the parcellations algorithm with different random subsamples of the original data sample. Then, thanks to the CluDL procedure, we obtain statistical maps that are aggregated into one through an ensembling procedure. The ensembling procedure we considered in the statistical inference procedure is adapted from Meinshausen et al. (2009) that is described in appendix in Section A.2 . We refer to the full inference algorithm as Ensemble of Clustered Desparsified Lasso(EnCluDL). Under hypothesis ensuring Desparsified Lassostatistical properties -notably sparsity and smoothness of the true weight map and i.i.d. data samples -EnCluDL gives statistical guarantees, namely it controls the -FWER. Choosing for -FWER control Theoretically, the minimal spatial tolerance that guarantees a control of the -FWER with EnCluDL is the largest parcel diameter. However, in practice, we aggregate many statistical maps obtained from different choices of voxel grouping; then the required spatial tolerance is reduced to the average radius. Then, the value of for which we observe the -FWER control varies approximately linearly with the cubic root of the average number of voxels per cluster. In standard fMRI settings, we propose the following formula for : the ratio ∕ being the average number of voxels per cluster, 0 is a distance in voxel size unit. Note that the previous formula is an estimate of the average cluster radius that assumes that the shape of the clusters have identical cubic shape. In practice, this formula tends to underestimate the average cluster radius but was suitable in all our experiments. In Section A.6 , we study empirically the distribution of the cluster radius distribution as a function of the number of clusters, and compare it with 0 .
Additionally, note that when the setting is particularly favorable for inference, e.g., if log ( )∕ is large, the choice of given by (12) might be slightly too liberal. To address these specific cases, we propose a more refined formula to estimate in appendix in Section A.5 . EnCluDL hyper parameters.
The number of clusters is a crucial hyperparameter of EnCluDL. Generally, a suitable depends on intrinsic physical properties of the problem and on the targeted spatial tolerance . Decreasing increases the statistical power while reducing the spatial precision. In the neuroimaging context, taking = 500 is a fair default value achieving a suitable trade-off between spatial precision and statistical power when the number of samples is a few hundreds. With this choice, the spatial tolerance should be close to = 10 mm when working with masked fMRI data.
As a more adaptive approach, we recommend tuning according to e.g., ∈ [ ∕2 , ] . This choice should still ensure the -FWER control with given by (12) (or its corrected version, see appendix Section A.5 ) and is justified in Section 4.5 .
The parameter , the number of CluDL solutions to be aggregated, is discussed in Section 3.5 . The larger the more stable the solution, yet the heavier the computational cost. In our experiments, we have set = 25 (see Hoyos-Idrobo et al. (2018) for a more complete discussion on this parameter). Empirical analysis of data structure assumptions for EnCluDL. The core part of EnCluDL consists in applying Desparsified Lassoto a clustered version of the original problem. As disclaimed in van de Geer et al. (2014) , some technical hypotheses on the structure of the design matrix -i.e., of the reduced data -are necessary to produce valid confidence intervals on the parameters with Desparsified Lasso. Roughly, it is necessary that the features are "not too much correlated ". In appendix in Section A.3 , we show in a simple setting that as long as the correlation between two predictive features is less than 0.8, it is possible to recover both features. However when the correlation between features is more than 0.9, only one of the two features can be identified.
In Section A.4 , we show that in standard fMRI datasets neighboring voxels can have a correlation greater than 0.9. Thus applying Desparsified Lassoat the voxel level certainly leads to many false negatives. However, since Desparsified Lassois applied to the clustered problem, we have to consider correlation between clusters instead. In Section A.4 , we show on HCP data that such inter-cluster correlation is almost always lower than 0.8 and always lower than 0.85. This means that data structure assumptions for EnCluDL are sustainable. Additionally, the fact that EnCluDL aggregates several CluDL solutions increases the tolerance to inter-cluster correlation.

A complementary univariate solution
Given the complementarity of univariate and multivariate inference noted previously, we add to our study a univariate inference method, namely univariate permuted OLS (Univ-OLS). This method does not test the same null hypothesis as the other methods: it tests whether or not a voxel is marginally associated with the target. Then, while it should not be benchmarked with the other methods, we propose to consider jointly the results obtained by the marginal and the conditional analyses, as advocated by Weichwald et al. (2015) .
The Univ-OLS method is based on the generalized linear model (GLM) ( Friston et al., 1994 ). For every voxel we compute a t-statistic by applying the OLS procedure on the linear model that associates each voxel with the target. Subsequently, we also derive the permuted t-statistic distribution by performing the OLS on permuted data. Finally, to obtain corrected p-values, we use the standard maxT procedure ( Westfall and Young, 1993 ). Note that, for this method, we have used the permuted_ols function implemented in the Nilearn python package ( Abraham et al., 2014 ) with 1000 permutations.

Data
To validate empirically the statistical guarantees of the four algorithms -Thr-SVR, Perm-SVR, Ada-SVR and EnCluDL -described in Section 3 , we perform several experiments on resting-state fMRI and task fMRI data. We also show some results for Univ-OLS to highlight the complementarity of univariate and multivariate analyses, in particular when studying predictive patterns on real data. We focus on three datasets: HCP900 resting-state fMRI, HCP900 task fMRI and RSVP task fMRI. HCP900 resting-state fMRI data. HCP900 resting-state fMRI dataset ( Van Essen et al., 2012 ) contains 4 runs of 15 min resting-state recordings with a 0 . 76 s -repetition time (corresponding to 1200 frames per run) for 796 subjects. We use the MNI-resampled images provided in the HCP900 release. For this dataset the number of samples is equal to 1200 (only one run is used) and the number of voxels is 156 374 after gray-matter masking (the spatial resolution being 2 mm isotropic). HCP900 task fMRI data. We also use the HCP900 task-evoked fMRI dataset ( Van Essen et al., 2012 ), in which we take the masked 2 mm z-maps of the 796 subjects from 6 tasks to solve 7 binary classification problems: emotion ( emotional face vs shape outline ), gambling ( reward vs loss ), language ( story vs math ), motor hand ( left vs right hand), motor foot ( left vs right foot), relational ( relational vs match ) and social ( mental interaction vs random interaction ). We consider the fixed-effect maps for each outcome (or condition), yielding one image per subject per condition (which corresponds to two images per subject for each classification problem). Then, for each problem, the number of samples available is 1592 ( = 2 × 796 ) and the number of voxels is 156 374 after gray-matter masking. Unmasked RSVP task fMRI data. We also use activation maps obtained from a rapid serial visual presentation (RSVP) task of the individual brain charting dataset ( Pinho et al., 2018 ), augmented with 9 additional subjects performing the same task, under the same experimental procedures and scanning parameters. No masking is used for this dataset, so that out-of-brain voxels are not withdrawn from preprocessing. We consider the unmasked 3 mm-resolution statistical z-maps of the 6 sessions of the 21 subjects for a reading task with 6 different contrasts that have been grouped into 2 classes: language (words, simple sentences, complex sentences) vs pseudo-language (consonant strings, pseudo-word lists, jabberwocky). The images are all registered to MNI space and per-condition effects are estimated with Nistats v0.0.1 library ( Abraham et al., 2014 ). For this dataset the number of samples available is equal to 756 ( 21 subjects × 6 runs × 6 images per run) and the number of voxels is 173 628 (unmasked images resampled at 3-mm resolution). We run the inter-subject experiment described in Section 4.4 with this dataset.

Statistical control on semi-simulated data
A first series of experiments study whether the four different methods exhibit the expected -FWER control and are competitive in terms of support recovery, as measured with the precision-recall curve. To do so, we have to construct the true weight map * . We generate "semisimulated " data: generating signals from estimates on real data. To avoid circularity in the definition of the ground truth, we used two different tasks: one to build * and another one to define . Building a reference weight map from HCP900 motor hand dataset. To construct an underlying weight map, we use the motor hand (MH) task of the HCP900 task fMRI dataset described in Section 4.1 . Specifically, we build a design matrix MH ∈ ℝ × from the motor hand task z-maps of all subjects associated with a binary target index MH . To obtain an initial weight map SVC MH we regress MH against MH by fitting a linear Support Vector Classifier (SVC) ( Cortes and Vapnik, 1995 ). From SVC MH we only kept the 10% most extreme values ensuring that the connected groups of non zero-weight voxels have a minimal size of 1 cm 3 by removing small clusters. We chose this map (represented in Fig. 3 and Fig. 4 ) to be the true weight map * ∈ ℝ for the whole simulated experiments. Simulating responses with HCP900 emotion dataset.
We then take to be the set of z-maps from the emotion task of the HCP900 task fMRI dataset described in Section 4.1 . To generate a continuous response vector , we draw a Gaussian random noise vector ∼  (0 , 2 ) and use the linear model introduced in (1) , where = 0 . 2 to reach SNR y = 10 , where SNR y is given by: The way we simulate is summarized in Fig. 3 . Quantification of error control and detection accuracy. To obtain representative results, we then To generate the response for a given sample we multiply the corresponding brain activation map by the true weight map and add a Gaussian noise with fixed variance. To highlight the predictive regions, we circle them in pink for positive coefficients and in light blue for negative coefficients. As an illustration, we take four different data samples with negative or positive output value.
run the procedures described in Section 3 for 100 different response vectors generated from different random samples of subjects and different draws of . We let the number of samples vary from = 50 (25 random subjects taken among the 796) to = 1200 (600 subjects), the number of voxels being = 156 374 . For each simulation, we record the empirical -FWER and the precision-recall curves. Importantly, we do not recommend running such analysis with < 100 , since the estimation problem is hard and statistical guarantees are only asymptotic. Heavytailed version of the semi-simulated experiment. In the above experiment the noise is Gaussian, hence we also benchmark the inference procedures for Laplace and Student noise to assess the impact of noise distribution. Binary version of the semi-simulated experiment. In the main experiment the response vector is continuous, hence we also benchmark the inference procedures for a binary response. For that, we simply take as response vector the signs of the continuous generated as in the previous paragraph. Univ-OLS solves another inference problem. Univariate methods do not compete with multivariate methods, as they do not test the same null hypotheses. However, for pedagogical purpose, we show that Univ-OLS based FWER control is not valid in the multivariate analysis setup.

Statistical control under the global null with i.i.d. data
In this experiment, we test whether the procedures control the FWER under a global null model. EnCluDL only controls the -FWER theoretically but, when the true weight vector * is null, the -FWER and the classical FWER are identical. Then, all procedures should control the FWER. Here, we considered the tasks of the HCP900 task fMRI dataset described in Section 4.1 keeping all the subjects ( = 1592 ). Then, to get a noise-only response, we (uniformly) randomly permute the original response vector. Similarly as in Section 4.2 , the i.i.d. hypothesis is legitimate, since the data correspond to z-maps of different subjects. For each task, we draw 100 different permutations of the response and check if the different methods enforce the chosen nominal FWER of 10% . to illustrate the importance of checking the underlying assumptions, in appendix in Section A.8 , we describe an additional experiment to show that FWER (or -FWER) is not controlled anymore when working with an autocorrelated response vector, breaking the i.i.d hypothesis. This experiment is adapted from Eklund et al. (2016) .

Statistical control of out-of-brain detections
In this experiment we test the four procedures on an unmasked task fMRI dataset to verify that no spurious detection is made outside of the brain -up to the allowed error rate. Indeed, the non-null coefficients of the weight vector * should all be contained in the brain since there is no informative signal in out-of-brain voxels. To do so, we take the unmasked RSVP task fMRI dataset, described in Section 4.1 (with design matrix containing = 756 unmasked z-maps). Then, we report how frequently some voxels are detected outside the brain volume. For the sake of completeness, we also check the non-occurrence of out-of-brain detections with Univ-OLS.

Insights on the choice of number of clusters
In this experiment, we assess empirically the impact of , the number of clusters used in the EnCluDL algorithm. We use the same generative method as in Section 4.2 to produce an experiment with known ground truth. Then, we run the EnCluDL algorithm varying the numbers of clusters from = 200 to = 1000 . We also vary the number of samples from 100 to 1200. As in Section 4.2 , we run the experiment for 100 different response vectors and report aggregated results. We report two statistics: the empirical -FWER and the AUC of the precision-recall curve for every value of and .

Face validity on HCP dataset
In this experiment, we consider the output of the procedures in terms of brain regions that are conditionally associated with the task performed by the subjects. Similarly as in Section 4.3 , we consider the tasks of the HCP900 task fMRI dataset described in Section 4.1 , keeping this time the true response vector. We run all the procedures on every task and report the statistical maps thresholded such that the FWER < 10%

Fig. 4. Qualitative comparison of the model solutions.
Here, we show the solutions (z-maps) given by the four inference procedures, for a single random draw of the noise vector in the experiment described in Section 4.2 . The weight maps are thresholded such that -FWER < 10% theoretically. We can observe that none of the methods yield false discoveries but the Ensemble of Clustered Desparsified Lasso (EnCluDL) procedure is the most powerful followed by Adaptive Permutation Threshold SVR (Ada-SVR). -FWER control and precision-recall curve on semi-simulated data (known ground truth). Left : The results of the experiment described in Section 4.2 show that the permutation test (Perm-SVR) and Ensemble of Clustered Desparsified Lasso (EnCluDL) are the only procedures that correctly control the -FWER at the nominal level ( 10% ). This is not the case for Adaptive Permutation Threshold SVR (Ada-SVR) and Thresholded SVR (Thr-SVR) procedures. Right : For the same experiment, EnCluDL has the best performance in terms of precision-recall curve. For = 400 , and ensuring 90% precision, EnCluDL obtains a recall of 23% and Ada-SVR a recall of 16% . Thr-SVR and Perm-SVR share the same precision-recall curve and were not able to reach 90% precision.
or the -FWER < 10% (for EnCluDL). For this, we use all the available samples ( = 1592 ). We also include Univ-OLS to compare the discriminative patterns obtained with a univariate inference.

Prediction performance
Even if it is not the purpose of this study, we also checked the prediction performance of the decoders produced by each method. Since Thr-SVR and Perm-SVR rely on the same predictive function, there are three different decoders: SVR, Ada-SVR and EnCluDL. To perform this experiment, we consider the tasks of the HCP900 task fMRI dataset described in Section 4.1 . We run all the procedures on every task using a sample size = 400 , keeping the rest of the samples to test the trained model. For each task and each method, we take 100 different random subsamples to produce the results. This experiment being a side study, we give the results in appendix in Section A.12 .

Results
In this section, after setting the value of the tolerance parameter in the different datasets, we present the experimental results.

Estimating in HCP and RSVP datasets
In all the experiments, unless specified otherwise, we run EnCluDL with the default choice = 500 . Reversing (12) , we obtain a tolerance parameter of HCP = 5 . 4 voxels for HCP900 and RSVP = 5 . 6 voxels for RSVP, corresponding to HCP = 12 mm and RSVP = 18 mm respectively after rounding up. In Fig. A.14 in appendix, we display the spatial tolerance of 6 voxels in the case of HCP data.

Statistical control with known ground truth
Here, we describe the results obtained from the experiment described in Section 4.2 . Qualitative comparison of the model solutions.
In Fig. 4 , we present a qualitative comparison of the model solutions when = 400 . None of the methods yields false discoveries for the chosen threshold -taken such that -FWER < 10% . EnCluDL recovers more active regions than the other procedures, which makes it the most powerful procedure, followed by Ada-SVR. The other two procedures do not discover the expected patterns. These results displayed are obtained for a single random draw of the noise vector, but similar results holds for different draws. -FWER control. In this experiment,

Fig. 6. FWER control under the global null with i.i.d. data
The results of the experiment with i.i.d. data under the global null, described in Section 4.3 , show that, only the Thresholded SVR (Thr-SVR) fails to control the FWER empirically in this context. EnCluDL makes no detection: it is a conservative approach, as one could expect from theory.
we check if Thr-SVR, Perm-SVR, Ada-SVR and EnCluDL control the -FWER at the targeted nominal level (here being 10%). Fig. 5 shows that Perm-SVR and EnCluDL procedures control the -FWER for all sample sizes since their empirical -FWER remain below the targeted nominal level, whereas Thr-SVR and Ada-SVR fail to control the -FWER in every setting. In particular, the empirical -FWER for Ada-SVR is above the targeted nominal level for ≥ 800 . This might occur since the approximation made by (9) is valid only if remains "sufficiently low " ( Gaonkar and Davatzikos, 2012 ). Thr-SVR fails to control empirically the -FWER for any value of . This might be due to the two assumptions made in Section 3.2 not being satisfied -it is indeed unlikely that the SVR weights of the null region follow the same distribution. We further discuss this point in Section 6 . Concerning EnCluDL, one can notice that the empirical -FWER is slightly larger for = 1200 , this effect is explained in appendix in Section A.5 and Section A.6 . We report additional results, notably heavy-tailed and binary version of the exper-iment, in appendix in Section A.10 . These lead to the same statistical behavior as observed here. Precision-recall.
In this experiment, we also evaluate the recovery properties of the four methods by comparing the precision-recall curve for different value of . Fig. 5 shows that EnCluDL has the best precision-recall curve for = 400 . We recall that the perfect precision-recall curve is reached if the precision is equal to 1 for any value of recall between 0 and 1. Similar results were obtained for the other sample sizes tested (appendix Fig. A.17 ). Indeed, when = 400 , for a 90% precision, EnCluDL gives a recall of 23% and Ada-SVR a recall of 16% . Thr-SVR and Perm-SVR share the same precision-recall curve since they both produce p-values arranged in the reverse order of the absolute SVR weights. These thresholding methods were not able to reach the 90% precision; their recovery properties are much weaker.
We report additional results in Section A.10 .

Statistical control under the global null with i.i.d. data FWER control under the global null (permuted response).
Here, we summarize the results of the experiment testing control of the FWER in a global null setting ( Section 4.3 ). Fig. 6 shows that, when samples are i.i.d., all the procedures control the FWER, except Thr-SVR. EnCluDL is even conservative since the empirical FWER remains at 0 for all the different tasks tested. This result is not surprising since at least two steps of the EnCluDL procedure are conservative: the Bonferroni correction and the ensembling of the p-values maps. Face validity (original response).
Additionally, we run the procedures with the original (not permuted) response vector to check whether the methods can recover predictive patterns; this corresponds to the experiment described Section 4.6 . We plot the results for the two first tasks (emotion and gambling) in Fig. 7 ; see appendix Fig. A.23 for the five other tasks. Qualitatively, EnCluDL recovers the most plausible predictive patterns, Ada-SVR sometimes makes dubious discoveries: patterns are too wide and implausible. The two other methods exhibit a very weak statistical power.
Comparing EnCluDL and Univ-OLS solutions, we see that the discovered patterns are not a subset of each other. This result was expected given the arguments in Weichwald et al. (2015) : the advantage of combining the two paradigms is to get more insight on the causal nature of the relation between the voxel signals and the target.

Statistical control of out-of-brain discoveries
We now report the results from the unmasked RSVP task data experiment ( Section 4.4 ). Here, we check whether out-of-brain detections are made. In Fig. 8 , the z-score maps are thresholded such that the FWER (for Perm-SVR, Thr-SVR, and Ada-SVR) or the -FWER (for EnCluDL) are at most 10% for = 6 voxels (or 18 mm). We observe that Ada-SVR makes some out-of-brain discoveries, and it does not control the FWER empirically. Thr-SVR and Perm-SVR do not yield spurious detections but very few detections are made, hence these methods have low statistical power. EnCluDL does not make any out-of-brain detections and it outlines predictive regions in the temporal lobe and Broca's area, expected for a reading task. Finally, Univ-OLS does not make any spurious detection either; it only makes detections in the temporal lobe.

Insights on choosing the number of clusters
Here, we report the results obtained of the experiment task-fMRI data ( Section 4.5 ) studying the impact of (number of clusters) on the -FWER control and the recovery properties of EnCluDL for various sample sizes. These results are obtained with 100 repetitions for every sample and cluster sizes. In Fig. 9 , we notice that a lower leads to improved recovery, according to the area under the precision-recall curves, for = 6 voxels (or 12 mm). However, when the number of cluster is lower, the average cluster radius increases and overcomes the spa-tial tolerance of , leading to inflated error rates (cf. Section A.6 ). More precisely, the -FWER is controlled when ≥ 500 . Note that for < 500 , it is possible to control the -FWER, even when is small, provided a larger spatial tolerance > 6 voxels. To compute the requested , one can use (12) . Besides, we observe that the recovery score of EnCluDL improves when increases, as expected. We also notice that the empirical -FWER increases with . To explain this effect, we first recall that theoretically the -FWER is controlled for equal to the largest cluster diameter, likely to be too large in practice. In this study, we have taken equal to 0 , which is slightly smaller than the average radius of the clusters (cf. Section A.6 ), since in practice this choice ensures the -FWER control. However, when the setting is particularly favorable for inference ( e.g., if log ( )∕ > 1 . 5 × 10 −2 ), some false discoveries can be made at a distance greater than the average radius from the support. The choice of is further discussed in Section 3.5 and in appendix in Section A.5 . Additionally, we can notice from Fig. 9 that for a fixed ∕ ratio the recovery capability is stable (see also appendix Section A.9 ). Then, as discussed in Section 3.5 , we advise taking of the same order as ( e.g., ∈ [ ∕2 , ] ) when the goal is to recover most of the predictive regions without strict requirements on the accuracy of their shapes -since the value of given by (12) might be not small with regards to the predictive region itself.

Discussion
Decoding models are fundamental for causal interpretation of the implication of brain regions for an outcome of interest, mental process or disease status ( Weichwald et al., 2015 ). They produce weight maps that are needed to support this type of inference ( Poldrack, 2011;Varoquaux et al., 2018 ). These weight maps capture how brain regions relate to the outcome, conditional on the other regions, which is a key difference with respect to standard brain mapping based on mass univariate models. However, the weight maps produced by the common decoders come without statistical guarantees. Indeed, decoders optimize the quality of their prediction, but give no control on conditional feature importance. This is difficult due to the large number of covariates -voxels -as well as the severe multi-collinearity: voxel-level inference is untenable. On the other hand, given the spatial structure of the data, a spatial tolerance in the statistical control is natural, as in Gaussian random field theory used in standard analysis ( Nichols, 2012 ).
Our first contribution is to formalize this spatial statistical control by introducing the -FWER, a control of false discoveries up to a spatial slack . This definition uncovers a fundamental trade-off between accuracy in the localization of the brain structures involved and statistical power: here we deliberately degrade spatial accuracy, acknowledging current concerns on statistical power in neuroimaging studies ( Button et al., 2013;Noble et al., 2019 ).
Our second contribution is to study empirically the statistical control of four procedures computing decoding maps, ranging from thresholding procedures applied to SVR weights, to a dedicated decoding procedure, EnCluDL. Experiments show that the Thr-SVR procedure, thresholding SVR weights, fails to achieve useful statistical control. Exact permutation testing yields the expected statistical control but with very poor statistical power for all experimental settings we have studied. On the other hand, Adaptive Permutation Threshold SVR (Ada-SVR) ( Gaonkar and Davatzikos, 2012 ), does not control the FWER as it should, though it exhibits a fair precision-recall curve in our semi-simulated experiments. This shows how difficult it is to identify a statistically valid threshold for SVR weight maps. This is due to the fact that under the null hypothesis, estimated weights are not distributed according to a fixed distribution -notably because of the dependency structure of the data -and more precisely, the variance of these distributions differs. Then, thresholding linear decoders (SVR, logistic regression) based on their estimated weights amplitudes is not a principled approach to control false discoveries. Fig. 7. Estimated predictive patterns on standard task fMRI dataset. Here, we plot the results for the emotion and gambling tasks of the experiment described in Section 4.6 thresholding the statistical maps such that the -FWER stays lower than 10% for = 12 mm. Qualitatively, EnCluDL discovers the most plausible patterns, Ada-SVR sometimes makes dubious discoveries, patterns are too wide and implausible, while the two other methods exhibit a very weak statistical power. Univariate analysis results obtain with Univ-OLS clearly provide distinct information about the relationship between the voxel signals and the outcome. The results of the five other tasks are available in Fig. A.

.
EnCluDL uses a different decoding procedure to estimate the weight maps ( Chevalier et al., 2018 ), and as a result comes with theoretical statistical guarantees: it controls the -FWER for a predetermined tolerance parameter equal to the largest diameter of the clusters, assuming that the observed samples are i.i.d. and that the weight maps are homogeneous and sparse. The experiments show that, indeed, for i.i.d. scenarios, EnCluDL controls the -FWER for equal to the average radius of the clusters. Though, in some very high SNR or high sample size regimes, it might be necessary to take larger than the average radius (see Section A.5 ). In practice, our choice of is conservative, and with current fMRI datasets, -FWER control holds for smaller , even in relatively large cohorts ( = 1200 ).
In our experiments, the spatial tolerance is around 1cm. Given that the definition of spatial location is blurred by inter-subject variability in group studies, this tolerance does not seem problematic. The method can thus be used for inference in cognitive neuroscience and population studies in psychiatry, neurology or epidemiology.
In addition, EnCluDL exhibits the best support recovery performance in the proposed semi-simulated experiments with fMRI data but also finds patterns with good face validity in more qualitative experiments plotted in Fig. 7 . On the other hand, we also notice that EnCluDL tends to be over-conservative. Taking into account the difficulty of the problem and the fact that the convergence results are only asymptotic, we do not recommend using EnCluDL with < 100 .
In the present study, we have considered that the confounding variable effects have been removed during fMRI data preprocessing. However, it is still possible to include an additional confounding variable to the covariates before performing the inference. With regards to En-CluDL, we note that confounding variables should be handled separately from the clustered brain features.
Although it is not the main purpose of this study, we also checked the prediction performance of the decoders produced by each method. It is important to note that EnCluDL has been designed for the recovery of conditional statistical associations, not for prediction. In practice, the prediction performance is almost the same for SVR and Ada-SVR, and The results of the unmasked task-fMRI experiment, described in Section 4.4 , show that EnCluDL, Thresholded SVR (Thr-SVR) and the permutation test (Perm-SVR) do not return out-of-brain discoveries, while the Adaptive Permutation Threshold SVR (Ada-SVR) does. Here z-score maps are thresholded such that the -FWER is at most 10% for = 6 voxels (or 18 mm). Thr-SVR and the Perm-SVR do not yield spurious detections but very few detections are made, hence these method have low statistical power. EnCluDL does not make any spurious detection; rather it makes detections in the temporal lobe and Broca's area, which are expected for a reading task. Univ-OLS does not make any out-of-brain detection either but returns significant associations in the temporal lobe.

Fig. 9. Influence of the number of clusters on -FWER control and the recovery properties of EnCluDL.
The results of the experiment described in Section 4.5 show the impact of on the -FWER control and the recovery score of EnCluDL. When ≥ 500 , clusters are smaller, hence the -FWER is controlled for = 12 mm (and potentially lower values of ) since all the empirical -FWER's are lower than the 10% nominal rate. Conversely, when < 500 , clusters are wider and the spatial tolerance is overcome by the model inaccuracy, hence the -FWER is not controlled for = 12 mm. However, it remains controlled for higher values of . Concerning the recovery properties we see that reducing the number of clusters improves the precision-recall curves. Thus, the more spatial uncertainty is tolerated, the best recovery properties EnCluDL offers. is slightly better than the one of EnCluDL (see Fig. A.24 ). For prediction purpose, we recommend using Fast Regularized Ensembles of Models (FReM) ( Hoyos-Idrobo et al., 2018 ), which is a stable and computationally efficient decoder with state-of-the-art prediction performance.
For pedagogical purpose, we have also considered a dataset where cross-sample independence is violated due to serial correlation, reproducing an experiment of Eklund et al. (2016) . The ensuing loss of statistical control underlines the importance of the i.i.d. hypothesis. Hence, EnCluDL should not be used to make inference from intra-subject dataset recorded over one session. With these warnings in mind, we think that EnCluDL can be used safely in neuroimaging context. Our code, implemented with Python 3, can be found on https://github.com/ja-che/ hidimstat along with some examples.
We have not considered the method proposed by Nguyen et al. (2019) based on the Knockoff filters ( Barber and Candès, 2015;Candès et al., 2018 ) that yet appear to be an appealing procedure, as it can only control the FDR. In this study we have focused on -FWER control, and hence defer the analysis of FDR-controlling procedures to future work. Also, we have not benchmarked postselection inference procedures ( Berk et al., 2013;Lee et al., 2016 ), as we found them challenging to run in high dimensional settings and prone to numerical underflows.
Our empirical results clearly show that standard thresholding procedures, including classical permutation tests, are not reliable to infer regions importance on decoder maps, due to the high number of covariates. Since, in neuroimaging studies, these maps are used to give evidence on the brain regions that supports an outcome, it is crucial to use a procedure with statistical control on the brain maps. Our study shows that EnCluDL provides such a control.
Ethics statement. No experiments on living beings were performed for this study. Hence, IRB approval was not necessary. The data that we used were acquired in original studies that had received approval by the original institutions IRB. All data were used accordingly to respective usage guidelines.

Acknowledgments
Data were provided in part by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. This project has received funding from the European Unions Horizon 2020 research and innovation program under grant agreement No 826421, and has been supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR as part of the program "Investissement d'Avenir " Idex Paris Saclay (ANR-11-IDEX-0003-02), as well as by the ANR-17-CE23-0011 Fast-Big project.

A1. Desparsified Lasso
Additional notation. For a matrix , , ⋅ refers to the th row and ⋅, to the th column, , refers to the element ( , ), and (− ) refers to the matrix without the th column. † denotes the Moore-Penrose pseudo-inverse of . Small-dimension insight. The Desparsified Lassoprocedure, introduced by Zhang and Zhang (2014) extends the Ordinary Least Squares (OLS) procedure to < cases. Let us first recall the standard OLS framework ( > ). Starting from model (1) , let us define ⊷ ∈ ℝ the residual of the OLS regression of ⋅, versus (− ) given by: where ̂ (− ) refers to the estimator of the OLS regression of ⋅, versus (− ) . In particular, ⊤ ⋅, = 0 for all ∈ [ ] ⧵ { } . In this setting, we also have the following result: where ̂ OLS is the parameter vector estimates obtained from the OLS regression of against .
Desparsified Lasso. In this setting, it is not possible to construct a nonzero vector family { , ∈ [ ]} ( i.e., a family verifying ≠ 0 for all ∈ [ ] ), such that ⊤ ⋅, = 0 for all ≠ . The idea proposed by Zhang and Zhang (2014) is to construct a family { , ∈ [ ]} which would play the same role as the residual of the OLS regression of ⋅, versus (− ) in (14) but relaxing (slightly) the constraint ⊤ ⋅, = 0 . To do so, instead of computing { , ∈ [ ]} by OLS regression, they proposed to take the residual of the Lasso regressions 1 of ⋅, against (− ) . Then, from (1) , one can derive the following: Noticing that the second term in (16) is a noise term and plugging in an initial estimator ̂ (init) of * in the third term -a standard choice being the Lasso -they propose the following estimator: Here, one can notice that (17) generalizes (15) to < . Then, from (16) and (17) one can derive: This yields: where: Asymptomatically and under some sparsity assumptions (one can refer to ( Dezeure et al., 2015 ) for more details), one can neglect the last term and obtain: From (21) , one can compute the confidence intervals and p-values of the coefficients of the estimated weight map. Note that similar estimators have been derived in parallel in Javanmard and Montanari (2014) .

A2. Adaptive quantile aggregation of p-values
For the th voxel, we have a vector ( ( ) ) ∈[ ] of p-values, with one -value computed for each of the clusterings. Then, the final -value of the th feature is given by the adaptive quantile aggregation, as proposed by Meinshausen et al. (2009) where we have taken min = 0 . 20 in our experiments. Taking a value of min not too small ( e.g., min ≥ 0 . 20 ) ensures that the discovered sources have received small p -values many times ( e.g., at least for ∕5 different choices of clustering).

A3. Empirical analysis of data structure impact
In this section, we propose two simulations to gain more insight concerning the assumptions about data structure that are necessary for Desparsified Lassoand EnCluDL to have power. More precisely, we investigate up to which level of correlation two correlated predictive features (having non-zero weight) are both identified. Indeed, when two predictive features are highly correlated, there is a risk that the inference procedure only detects one of the two.
The first simulation has modest data dimension, which corresponds to that of data after clustering. We use it to analyze the behavior of The correlation between the first two features is set to , while the other features are uncorrelated. The higher the harder it is to identify each of the two correlated features. For = 1 . 0 , it is impossible, while for = 0 . 8 , the identification of both features is successful. Right: Quantitative summary of the simulations. When the correlation increases the minimum z-score of the two first features ( "correlated features ") decreases ( 90% confidence intervals also displayed). The correlation between the two following features ( "control features ") remains equal to zero, thus the minimum z-score of these features is used as a control value that should not vary significantly. Also we plot the z-score of a random non-predictive feature ( "random null feature "). We observe that for a correlation lower than 0.8 the deviation is limited and it is possible to identify the two correlated variables. For a correlation larger than 0.9 the deviation is massive and it becomes impossible to recover the two correlated variables. A.11. Impact of correlation when trying to identify two correlated regions. Left: True weight map, and z-scores estimated by Desparsified Lasso, CluDL and EnCluDL, obtained for = 0 . 9 . Desparsified Lassocannot handle the extreme short-range correlation that occurs within each predictive region and only identifies one feature in each. CluDL and EnCluCL benefit from the clustering, as they identify all the features for every predictive regions. We can also observe that EnCluDL improves upon CluDL thanks to the smoothing effect produced by ensembling. Focusing on the EnCluDL solution, we can see that the z-score of the upper left active region is a bit lower than for the other active regions. This is due to the high correlation between the upper left and bottom right regions. Right: Summary of the results of the second simulation. When the correlation increases the minimum z-score within the correlated active regions decreases. The minimum z-score between the two uncorrelated regions is used as a control. We also plot the z-score of a random non-predictive feature, we notice that due to the ensembling step of EnCluDL, the empirical confidence intervals are much thinner than in Fig. A.10 . We observe that for a correlation lower than 0.8 the deviation is limited and it is possible to identify the two correlated predictive regions. For a correlation larger than 0.9 the deviation becomes large and recovering the two correlated regions becomes impossible.

Fig.
Desparsified Lasso. The second simulation has a 2D structure with larger data dimension, it introduces short-and long-range correlation structure, it is used to study EnCluDL.
First simulation: approximating the clustered data setting.
In this simulation we set = 100 and = 500 . We construct the design matrix such that features are normally distributed and the first two features have a correlation equal to parameter , while all the other features are independent. The weight * is such that * = 1 for 1 ≤ ≤ 10 and * = 0 otherwise. We also set = 1 giving approximately SNR = 12 close to the SNR estimated in real fMRI datasets.
To check the ability of Desparsified Lassoto identify two correlated features, we compare the smallest z-score of the first two first features ( "correlated features ") with the smallest z-score of the two following features ( "control features ") for different value of ∈ (0 , 1) . While the minimum z-score of the control features should not vary significantly and corresponds to a control value, the minimum z-score of the two Left: Correlation histogram of the fMRI data at voxel level. The correlation between two random voxels is quite low, a typical value being around 0.1. However, when looking at neighboring voxels, we observe that the correlation is often higher then 0.9. This exhibits the short-and long-range correlation structure but also suggests that raw Desparsified Lassowould not be adapted to this setting. Right: Correlation histogram of the clustered data for = 500 . The correlation between two random clusters is around 0.3, while the correlation between two neighboring clusters is around 0.7 and almost always below 0.8. Then, thanks to clustering, highly correlated voxels are aggregated into groups and Desparsified Lassois adapted to this setting. A.13. Comparing 0 with the distribution of the cluster radius as a funtion of . By taking a larger number of clusters, we decrease the size of the clusters. The statistical control is thus valid for a smaller spatial tolerance. Comparing the distribution of the cluster radius with the recommended choice of spatial tolerance parameter 0 , we observe that 0 is a bit lower than the empirical average cluster radius. Finally, we observe that few clusters are much wider than the others, this may occasionally lead to false discoveries far from the support in high SNR scenarios. correlated features should decrease towards 0 when increases to 1. Also, we look at the z-score of a random non-predictive feature ( "random null feature ") to get insight about the z-score threshold value to declare a feature significant.

Fig.
First simulation results. In Fig. A.10 , we give the results for the first simulation. When the correlation of the two correlated features increases, their identification using the Desparsified Lassoprocedure be-comes harder. In this experiment, we observe that below a correlation of 0.8, Desparsified Lassocan identify accurately the two correlated variables. However, above a correlation of 0.9, Desparsified Lassomight fail to recover the both correlated variables.
Second simulation: 2D data structure. The simulation we consider has a 2D data structure. It aims at approximating the short-and long-range correlation structure that can be observed in fMRI data (see Section A.4 ). The feature space considered is a square with edge length = 40 , then = 2 = 1600 features and we took = 100 samples. To construct * , we define a 2D weight map ̃ * of size × with four active regions then we flatten ̃ * in a vector * of size . Each active region is a small square of width ℎ = 4 , leading to support of size 4 × ℎ 2 = 64 . The four active regions are located in the corners of the weight map. The true weight map is represented in Fig. A.11 . To construct the design matrix , we first construct a 2D matrix ̃ by drawing random normal vectors of size that are spatially smoothed with a 2D Gaussian filter (the smoothing is only made in the feature space for each sample independently, the samples are not mixed and remain independent). We flatten the vectors to go from ̃ of size × × to of size × . The spatial smoothing enforces a 2D structure on the data. Then, we further modify such that ( ) all the features of an active region are perfectly correlated and ( ) two of the four active regions are correlated at a given value ∈ (0 , 1) , the two other active regions being unmodified (hence uncorrelated). The first transformation aims at showing that the clustering is useful to handle the short-range correlation that might be very high for fMRI data (see Section A.4 ). The second transformation aims at testing whether EnCluDL can recover two correlated predictive regions; this is notably desirable in the case of long-range correlation ( e.g., two contralateral brain regions). The two uncorrelated regions are used to provide control values. With these transformations we obtain the design matrix . In Section A.4 , the two active regions that are correlated are located in the upper left corner and in the bottom right corner while the other Fig. A.14. Expanding HCP maps by 6 voxels. The black-colored voxels represent the positive weights of the reference map constructed in Section 4.2 . The red-colored voxels are the -dilation of the previous map where = 6 voxels, i.e., the tolerance we have taken in all experiments. Then, -FWER controls the false discoveries made outside of the colored voxels (see also Section 3.1 ). two are uncorrelated. Finally, we also set = 10 , to approximately get SNR = 4 .
To check the ability of EnCluDL to identify two correlated regions, we compare the smallest z-score of the features that belong to one of the correlated regions with the smallest z-score of the features that belong to the uncorrelated active regions; we analyze the results for several values of ∈ (0 , 1) . To understand the effect of the clustering and ensembling, we compare Desparsified Lasso, CluDL and EnCluDL solutions qualitatively. Since the features that belong to the same active region are perfectly correlated, we expect that Desparsified Lassoidentifies only one feature per region at best. We also report the z-score of a random non-predictive feature.
Second simulation results. In Fig. A.11 , we give the results for the second simulation. Clustering turns out to be crucial to produce valid statistical inference solution in the presence of extreme short-range correlation. Additionally, we show that when the correlation of the two correlated active regions increases, their identification using EnCluDL becomes harder. In this experiment, we observe that below a correlation of 0.8, EnCluDL can identify accurately the two correlated regions. However, above a correlation of 0.9, EnCluDL generally fails to recover the two correlated regions.

A4. fMRI data structure
In Section A.3 , we have shown that one may encounter multicollinearity issues. It is thus necessary to analyze the correlation structure of actual fMRI data.
In Fig. A.12 , we study the correlation observed in the HCP900 Emotion task data. Considering correlation between random voxels, then neighboring voxels, we can see that the correlation is much higher in the case of neighboring voxel. Notably, the median correlation between two random voxels is 0.1 while the median correlation between two neighboring voxels is above 0.8, and often larger than 0.9. We have shown in Section A.3 , that Desparsified Lassomay fail to detect two features when they are so strongly correlated.
Correlation histograms after clustering the data as shown in Fig. A.12 . For example, taking = 500 clusters, the median correlation between two random clusters is 0.3 while it is 0.7 for two neighboring clusters. Inter-cluster correlation always remains below 0.85 and almost always below 0.8. In practice, we have shown in Section A.3 that Desparsified Lassocan handle scenarios where features have correlation lower than 0.8. The results of the experiment with correlated data under the global null, described in Section A.8 , show that, when the data are temporally autocorrelated, all the procedures fail to control the FWER. Indeed, for all the fictitious block response paradigms, the empirical FWER exceeds the targeted nominal level of 10% for every procedure. This result is not surprising as the procedures control the -FWER under the hypothesis that the samples are i.i.d.; this is not the case for the block or event response paradigms. However, when the fictitious response breaks the temporal dependency (binary or Gaussian random responses), the i.i.d. hypothesis is met and the FWER is empirically well controlled except for the Thr-SVR procedure. A.16. Influence of the ∕ ratio on the precision-recall AUC. The results of the experiment described in Section 4.5 show that the precision-recall AUC depends almost linearly on log ( ∕ ) except when is critically low creating very wide clusters and deteriorating the precision-recall curve. This limit depends on the physical properties of the problem; here, should not be lower than 100. Keeping this limit in mind, we advise taking ∈ [ ∕2 , ] to recover most of the predictive regions.

A5. Estimating for which EnCluDL controls the -FWER
In Sec. 3.5 , we recommend using , in regular brain imaging settings with (12) : , 0 being a distance in voxel unit close to the average radius of the clusters used in EnCluDL. However, when the setting is particularly favorable for inference, i.e., if log ( )∕ is large or is small, the choice of given by (12) may be over-optimistic and we might need to correct this formula. We have found empirically that a suitable multiplicative factor, denoted by > 0 , that could be used to correct 0 is given by: where is the standard deviation of the noise . In practice has to be estimated; in the fMRI datasets we studied, estimates of std ( ) were close to 0.1. However, given the heuristic derivation of this quantity and the uncertainty about the value of , we do not recommend correcting 0 with a factor lower than 1 as it could lead to a dramatic under estimation of the valid . Then, the final formula to compute the such that -FWER control is ensured, is: Note that the formula given by (12) and even (23) are not bullet proof but rather give reasonable estimates of .

A6. Cluster size analysis
In Section 3.5 , we have proposed a formula to compute a valid spatial tolerance parameter 0 . In Fig. A.13 , we show that 0 is close but slightly lower than the average cluster radius. Also, one can notice that taking a larger number of clusters, the size of the clusters is smaller. As a consequence, the statistical control is valid for a lower spatial tolerance. Finally, by looking at the shape of the distribution of the cluster radius, we observe that there are only few large clusters. In general 0 is a suitable choice, however when the setting is particularly favorable for inference, the mixing effect produced by ensembling might not be sufficient and voxels far (further than 0 ) from the support might be discovered. This effect can be explained by the detection of large clusters that are overlapping the support and the null region.
A7. Illustrating spatial tolerance on real brain geometry In Fig. A.14 , we display a brain pattern with spatial tolerance in the case of the HCP data.

A8. Statistical control under the global null with autocorrelated data
Experiment. In this experiment, we study how the different procedures control the FWER when the data are temporally autocorrelated; hence violating the i.i.d. assumption. Notably, this is the case if the data correspond to fMRI signal recordings of one given subject during an acquisition. We consider data from the HCP900 resting-state fMRI dataset described in Section 4.1 with full samples ( = 1200 ). The design matrix contains the 15-min fMRI signal records. As in Eklund et al. (2016) , we construct such that it corresponds to two activity paradigms: block or event responses, with several frequencies: 10 s on/off, 20 s on/off, 30 s on/off, 2 s-activation/6 s-rest, 4 s-activation/8 s-rest. Thus, is temporally autocorrelated. In these simulations * = 0 so the -FWER and the classical FWER are identical. To better assess the impact of correlation, we also generate as an i.i.d. -uncorrelated -Bernoulli or standard Gaussian random variable (here again * = 0 ), breaking spurious correlations between and . These two cases enable to check if the procedures still control the FWER at the targeted nominal level on this dataset under the i.i.d. hypothesis. For each kind of response, we repeat the experiment 100 times, using data from 100 different subjects. Results. we now report the results of the experiment. In Fig. A.15 , we observe that for all the fictitious block response paradigms, for every procedure, the empirical FWER exceeds the targeted nominal level ( 10% ), as one would expect. This result is not surprising since independence across samples is a key assumption for a valid statistical inference with any of the four procedures. Notably, concerning EnCluDL, Desparsified Lassoneeds the i.i.d. hypothesis ( van de Geer et al., 2014;Zhang and Zhang, 2014 ) to produce valid confidence intervals or p -values. This assumption is not verified for the block or event response paradigms due to the temporal dependency in the data. However, when the target is i.i.d. -i.e., without temporal dependency (Bernoulli or Gaussian random responses) -the FWER is controlled (except for Thr-SVR). Indeed, the model is no longer confounded by the correlation structure underlying the data.

A9. Influence of the ∕ ratio on the recovery property of EnCluDL
When using EnCluDL, the number of clusters is an arbitrary parameter. We proposed some default choice in Section 4.5 , yet intuitively, should adapt to the amount of data available: larger samples size lead to better estimation, allowing refined localization, hence higher . In Fig. A.16 , we show on semi-simulated data that for ∈ [ ∕2 , ] , ∕ being fixed, the precision-recall AUC on real data does not depend on , suggesting to chose proportional to .

A10. Statistical control with known ground truth: additional plots
In this section, we provide additional experimental results to assess the detection accuracy of the multivariate estimators, to complement the results in Section 4.2 . Fig. A.17 shows additional precision-recall curves, obtained for different values of : these different settings preserve the relative performance of the methods, while larger results in better curves. However, we do not recommend running such analysis with < 100 , since the estimation problem is hard and statistical guarantees only hold in asymptotic regime. 19 display the performance of the methods in terms of -FWER control and precision-recall curves on semi-simulated data where is binary. This induces a violation of the EnCluDL model that reduces its performance in terms of precision-recall. Yet, unlike Ada-SVR, it still controls the -FWER accurately. In Fig. A.20 and Fig. A.21 , we plot the same quantities when the noise is simulated by Fig. A.20. -FWER control and precision-recall curves on semi-simulated data with continuous response vector with Laplace noise. The results of the experiment described in Section 4.2 with Laplace noise are similar to the one presented in Fig. 6 for Gaussian noise.

A11. Face validity on HCP dataset
In Fig. A.23 , we plot the results for five tasks taken from the HCP dataset, besides of the two described in Section 4.6 . For all methods, the statistical maps are thresholded such that the -FWER stays lower than 10% for = 12 mm. Qualitatively, EnCluDL discovers the most plausible patterns, Ada-SVR often makes dubious discoveries, patterns are too wide and implausible, while the two other methods exhibit a very weak statistical power. As discussed in the main person, Univ-OLS provides complementary results that highlight marginal association between the data and the target.

A12. Prediction performance
In this section, we give results on the prediction performance of the methods. In Fig. A.24 , we plot the results of the experiment described in Section 4.7 . We notice that the classification error rate is almost the same for SVR (the weight map of Thr-SVR and Perm-SVR) and Ada-SVR, their prediction performance is slightly better than the one of EnCluDL. Hence, we do not recommend using EncluDL to achieve state-of-the art prediction accuracy, but only for statistical inference purpose.

Fig. A.22. -FWER control and precision-recall curves on semi-simulated data with continuous response vector including a univariate method.
These results show that the FWER control guaranteed by Univ-OLS for univariate inference does not match the control granted by EncluDL in the conditional paradigm. This is due the fact that the null hypotheses being tested are not the same. Here, we plot the results for five tasks of the experiment described in Section 4.6 thresholding the statistical maps such that the -FWER stays lower than 10% for = 12 mm. Qualitatively, EnCluDL discovers the most plausible patterns, Ada-SVR often makes dubious discoveries, patterns are too wide and implausible, while the two other methods exhibit a very weak statistical power. As discussed before, Univ-OLS provides complementary results that display marginal associations between voxel signals and the target. The results of emotion and gambling tasks are available in Fig. 7 .   Fig. A.24. Prediction performance. Here we plot the results for the experiment described in Section 4.7 . The classification error rate is almost the same for SVR and Ada-SVR. Their prediction performance is slightly better than the one of EnCluDL. Hence, we do not recommend using EncluDL to achieve state-of-the art prediction accuracy, but only for statistical inference purpose. For all the task, "chance " classification error rate is 50% .