How to control for confounds in decoding analyses of neuroimaging data

Over the past decade, multivariate pattern analyses and especially decoding analyses have become a popular alternative to traditional mass-univariate analyses in neuroimaging research. However, a fundamental limitation of decoding analyses is that the source of information driving the decoder is ambiguous, which becomes problematic when the to-be-decoded variable is confounded by variables that are not of primary interest. In this study, we use a comprehensive set of simulations and analyses of empirical data to evaluate two techniques that were previously proposed and used to control for confounding variables in decoding analyses: counterbalancing and confound regression. For our empirical analyses, we attempt to decode gender from structural MRI data when controlling for the confound ‘brain size’. We show that both methods introduce strong biases in decoding performance: counterbalancing leads to better performance than expected (i.e., positive bias), which we show in our simulations is due to the subsampling process that tends to remove samples that are hard to classify; confound regression, on the other hand, leads to worse performance than expected (i.e., negative bias), even resulting in significant below-chance performance in some scenarios. In our simulations, we show that below-chance accuracy can be predicted by the variance of the distribution of correlations between the features and the target. Importantly, we show that this negative bias disappears in both the empirical analyses and simulations when the confound regression procedure performed in every fold of the cross-validation routine, yielding plausible model performance. From these results, we conclude that foldwise confound regression is the only method that appropriately controls for confounds, which thus can be used to gain more insight into the exact source(s) of information driving one’s decoding analysis. HIGHLIGHTS The interpretation of decoding models is ambiguous when dealing with confounds; We evaluate two methods, counterbalancing and confound regression, in their ability to control for confounds; We find that counterbalancing leads to positive bias because it removes hard-to-classify samples; We find that confound regression leads to negative bias, because it yields data with less signal than expected by chance; Our simulations demonstrate a tight relationship between model performance in decoding analyses and the sample distribution of the correlation coefficient; We show that the negative bias observed in confound regression can be remedied by cross-validating the confound regression procedure;


INTRODUCTION
In the past decade, multivariate pattern analyses (MVPA) have emerged as a popular alternative to traditional univariate analyses of neuroimaging data  . The defining feature of MVPA is that it considers patterns of brain activation instead of single units of activation (i.e., voxels in MRI, sensors in MEG/EEG). One of the most-often used MVPA methods is "decoding", in which machine learning algorithms are applied to neuroimaging data directly to decode a particular stimulus, task, or psychometric feature. For example, decoding analyses have been used to successfully decode different experimental conditions within subjects, such as object category from fMRI activity patterns  and working memory representations from EEG data (LaRocque,  , as well between-subject factors such as Alzheimer's disease (vs. healthy controls) from structural MRI data  and major depressive disorder (vs. healthy controls) from resting-state functional connectivity ) . One reason for the popularity of MVPA methods, and especially decoding, is that they appear to be more sensitive than traditional mass-univariate methods in detecting effects of interest, which is often attributed to the ability of MVPA to pick up spatially distributed multidimensional representations while univariate methods, by definition, cannot  .
In the past years, however, MVPA's apparent superior sensitivity and ability to pick up distributed representations has been criticized for a number of reasons, both statistical  and more conceptual  in nature. For the purposes of the current study, we focus on the specific criticism forwarded by  , who argue that decoding analyses are inherently "representationally ambiguous". This representational ambiguity arises when the classes of the to-be-decoded variable systematically vary in more than one source of information (see also  . The current study aims to investigate how decoding analyses can be made more interpretable by reducing representational ambiguity. To illustrate the problem of representational ambiguity, consider, for example, the scenario in which a researcher aims to decode gender (male/female) from structural MRI with the aim to contribute to the understanding of gender differencesan endeavour that generated considerable interest and controversy within the scientific community (Chekroud, Ward, Fausto-Sterling, 2016; . By performing a decoding analysis on the MRI data, the researcher hopes to capture meaningful patterns of variation in the data of male and female participants that are predictive of the participant's gender. From the literature, we know that gender dimorphism in the brain is manifested in two major ways  . First, there is a global difference between male and female brains: men have on average about 15% larger intracranial volume than women, which falls in the range of gender differences in height (8.2%) and weight (18.7%;  . Second, brains of men and women are known to differ locally : some specific brain areas are found to be larger in women than in men (e.g., in superior and middle temporal cortex;  and vice versa (e.g., in frontomedial cortex;  . One could argue that, given that one is interested in explaining behavioral or mental gender differences, global (i.e., absolute) differences are relatively uninformative, as it reflects the fact than male bodies are in general larger than female bodies  . As such, our hypothetical researcher is likely primarily interested in the local sources of variation in the neuroanatomy of male and female brains. Now, suppose that the researcher is able to decode gender from the MRI data significantly above chance, it remains unclear on which source of information the decoder is capitalizing: the (arguably meaningful) local difference in brain structure or the (in the context of this question arguably uninteresting) global difference in brain size? In other words, the data contains more than one source of information that may be used to predict gender. In the current study, we aim to evaluate methods that improve the interpretability of decoding analyses through controlling for these "uninteresting" sources of information.
Partitioning effects into true signal and confounded signal Are multiple sources of information necessarily problematic? And what makes a source of information interesting or uninteresting? First of all, this depends on the particular goal of the researcher employing the decoding analysis. In principle, multiple sources of information in the data does not pose a problem if a researcher is only interested in accurate prediction and not necessarily interpretability of the model  . In brain-computer interfaces (BCI), for example, accurate prediction is arguably relatively more important than interpretability , i.e., knowing which sources of information are driving the decoder. Similarly, if the researcher from our gender-decoding example is only interested in accurately predicting gender regardless of model interpretability, representational ambiguity is not a problem . As such, whether 1 representational ambiguity in decoding analyses is a problem thus depends on the specific goal of the researcher .
In most scientific applications of decoding analyses, however, model interpretability is important, because researchers are often interested in the relative contributions of different sources of information in their models. Specifically, in most decoding analyses, researchers often (implicitly) assume that the decoder is only using information in the neuroimaging data that is related to the variable that is being decoded . In this scenario, representational ambiguity (i.e., the presence of multiple sources of information) is problematic as it violates the (implicit) assumption that the decoded variable is the only source of information driving the decoder. Another way conceptualize the problem of representational ambiguity is that, using the aforementioned example, (global) brain size is confounding the decoding analysis of gender. Here, we define a confound as a variable that is not of primary interest, which correlates with the to-be-decoded variable (the target) and is encoded in the imaging data .

2
To give an example of confounding variables in the context of decoding, suppose one is interested in building a classifier that is able to predict whether subjects are suffering from schizophrenia or not based on the subjects' gray matter data. Here, the variable "schizophrenia-or-not" is the variable of interest, which is assumed to be encoded in the neuroimaging data (i.e., the gray matter) and can thus be decoded. However, there are multiple factors known to covary with schizophrenia, like gender (i.e., men are more often afflicted than women;  and substance abuse  , which are also known to affect gray matter  . As such, the variables gender and substance abuse can be considered confounds according to our definition, because they are both correlated with the target (schizophrenia or not) and are known to be encoded in the neuroimaging data (i.e., the effect of these variables is represented in the gray matter data). Now, if one is able to classify schizophrenia with above-chance accuracy from gray matter data, one cannot be sure which source of information within the data is picked up by the decoder: the actual neural representation of schizophrenia or the neural representation of gender or substance abuse? If one is interested in more than mere accureate prediction of schizophrenia, then this ambiguity due to confounding sources of information becomes problematic.
Importantly, as our definition suggests, what is or is not regarded as a confound is relativeit depends on whether the researchers deems it of (primary) interest or not. In the aforementioned hypothetical schizophrenia decoding study, for example, one may equally well define the severity of substance abuse as the to-be-decoded variable, in which the variable "schizophrenia-or-not" becomes the confounding variable. In other words, one researcher's signal is another researcher's confound. Regardless, if decoding analyses of neuroimaging data are affected by confounds, the data thus contain two types of information: the "true signal" (i.e., variance in the data related to the target, but unrelated to the confound) and the "confounded signal" (i.e., variance in the data related to the target that is also related to the confound; see Figure 1 ). In other words, representational ambiguity arises due to the presence of both true signal and confounded signal and, thus, controlling for confounds (by removing the confounded signal) provides a crucial methodological step forward in improving the interpretability of decoding analyses. In the following section, we will review the previously proposed methods to control for confounds. Figure 1 . Visualization of how variance in brain data ( ) can partitioned into "true signal" and "confounded X signal", depending on the correlation structure between the brain data ( ), the confound ( ), and the target ( ).
X C y Overlapping circles indicate a non-zero (squared) correlation between the two variables.

Methods for confound control
In decoding analyses, one aims to predict a certain target variable from patterns of neuroimaging data. In this section, we will review some common methods for controlling confounds in the context of decoding analyses, some of which are supplemented with a mathematical formalization; for consistency and readability, we define the notation we will use in Table 1 .    . For the correlations, we assume that and thus that the correlations in the table reduce to a scalar. P = 1 A priori counterbalancing Ideally, one would prevent confounding influences as much as possible before the acquisition of the neuroimaging data . One common way (in both traditional 3 'activation-based' and decoding analyses) to prevent confounding is to make sure that potential confounding variables are counterbalanced in the experimental design  . In the gender/brain size example described earlier, counterbalancing would entail ensuring that brain size does not differ significantly between men and women (i.e., given that men on average have larger brains than women, this would entail including only men with relatively small brains and women with relatively large brains) . 4 Formally, in decoding analyses, a design is counterbalanced when the confound and the C target are statistically independent. In practice, this often means that the sample is chosen y so that there is no significant correlation coefficient between and (although this does C y not necessarily imply that and are actually independent). To illustrate the process of C y counterbalancing, let's consider another hypothetical experiment: suppose one wants to set up an fMRI experiment in which the goal is to decode abstract object category (e.g. faces vs. houses) from the corresponding fMRI patterns (cf.  , while controlling for the potential confounding influence of low-level or mid-level stimulus features, like luminance, spatial frequency, or texture  . Proper counterbalancing would entail making sure that the images used for this particular experiments have similar values for these low-level features across object categories (again see Görgen et al., 2017 , for details). Thus, in this example, low-level stimulus features should be counterbalanced with respect to object category, such that above chance decoding of object category cannot be attributed to differences in low-level stimulus features (i.e., the confounds).
A priori counterbalancing of potential confounds is, however, not always feasible. For one, the exact measurement of a potentially confounding variable may be unknown until data acquisition. For example, brain size of a participant is only known after data collection. Similarly, Todd and colleagues (2013) found that their decoding analysis of rule representations was confounded by reaction times corresponding to the to-be-decoded trials. Another example of a "data-driven" confound is motion during data acquisition (important in, for example, decoding of clinical populations, like ADHD;  . In addition, counterbalancing confounds a priori may be challenging because of 3 In the context of behavioral data, a priori counterbalancing is often called 'matching' or a employing a 'case-control design'  . 4 Note that the counterbalancing process is the same for both univariate (activation-based) studies and decoding studies, but the direction of analysis is reversed in univariate (e.g., gender → brain) and decoding studies (e.g., brain → gender). As such, in univariate studies the confound (e.g., brain size) is counterbalanced with respect to the predictor ( s ) (e.g., gender) while in decoding studies the confound (e.g., brain size) is counterbalanced with respect to the target (e.g., gender). limited clinical populations, in which researchers do not have the luxury of maintaining a counterbalanced sample due to small sample sizes. Lastly, researchers may simply discover confounds after data acquisition.
Given that a priori counterbalancing is not possible or undesirable in many situations, it is paramount to explore the possibilities of controlling for confounding variables after data acquisition for the sake of model interpretability. From the neuroimaging literature, we identified several methods that aim to control for confounds in decoding analyses, which we discuss below in turn.

Include confounds in model
One perhaps intuitive method to control for confounds in decoding analyses is to include (or, technically, column-wise concatenate) the confounds to the set of predictors (i.e., the neuroimaging data, ; see, e.g.,  . This intuition may stem from the X analogous situation in univariate (activation-based) analyses of neuroimaging data, in which confounding variables are similarly controlled for by including them in the design-matrix together with the stimulus/task regressors. For example, in univariate analyses of functional MRI, movement of the participant is often controlled for by including motion estimates in the design matrix of first-level analyses  ; in EEG, some control for activity due to eye-movements by including activity measured by concurrent electro-oculography as covariates in the design-matrix  . Usually, the general linear model is then used to estimate each predictor's influence on the neuroimaging data. Importantly, the parameter estimates are often interpreted as reflecting the unique contribution of each predictor variable, independent from the influence of the confound.
In the context of decoding, however, this arguably intuitive method to control for confounds by including them as predictors is problematic. This is because this method neglects the fact that these two types of analyses perform inference on different statistics: univariate (activation-based) analyses usually focus on parameter estimates ( ), while decoding β analyses focus on model performance (usually measured as explained variance, R 2 , or classification accuracy;  . While including confounds in the model effectively controls for the parameter values of predictors-of-interest , this method does not 5 control the value for model fit . Model performance statistics (e.g., R 2 , classification accuracy, etc.) alone cannot disentangle the contribution of different sources of information as it only represents a single summary statistic of model fit  , which will only increase when adding more predictors to the model (especially if these predictors are correlated to the target, as is the case with confounds). One might, then, argue that additionally inspecting parameter values of decoding models may help in disambiguating different sources of information  . However, it has been shown that the weight and direction of those parameters cannot reliably be mapped to specific sources of information, i.e., as being task-or confound-related (e.g., features with large weights may 5 Parameter estimates only reflect unique variance when ordinary least squares is used to find the model parameters. Other (regularized) linear models, such as ridge regression or LASSO, are not guaranteed to yield parameters with unique variance. be completely uncorrelated to the target variable;  . As such, it does not make sense to include confounds to the set of predictors when the goal is to disambiguate the different sources of information in decoding analyses.

Control for confounds during pattern estimation
Another method that has been used by some decoding studies on functional MRI data deals with confounds in the initial procedure of estimating activity patterns of the to-be-decoded events  . In this method, an initial first-level (univariate) analysis models the fMRI time series ( ) as a function of both predictors-of-interest ( ) and s X the confounds ( ), often using the GLM : Then, only the estimated parameters ( , or normalized parameters, such as t -values or β z -values) corresponding to the predictors-of-interest ( ) are used as activity estimates (i.e., β x the used for predicting the target ) in the subsequent decoding analyses. This method X y thus takes advantage of the shared variance partitioning in the pattern estimation step to control for potential confounding influences. However, while elegant in principle, this method is not applicable in between-subject decoding studies (e.g. clinical decoding studies; e.g.,  , in which confounding variables are defined across subjects, or in electrophysiology studies, in which activity patterns do not have to be estimated in a first-level model , thus limiting the applicability of this method. 7 Post-hoc counterbalancing of confounds When a priori counterbalancing is not possible, some have argued that post-hoc counterbalancing might control for the influence of confounds (Rao et al., 2017, p. 24, 38). In this method, given that there is some sample correlation between the target and confound ( ) in the entire dataset, one takes a subset of samples in which there is no empirical = 0 r Cy / relation between the confound and the target anymore (e.g., when ). In other words, r Cy ≈ 0 post-hoc counterbalancing is a way to decorrelate the confound and the target through subsampling the data. Then, subsequent decoding analysis on the subsampled data can only capitalize on true signal, as there is no confounded signal anymore after subsampling (see Figure 2 ). While intuitive in principle, we are not aware whether this method has been evaluated before and whether it yields unbiased performance estimates. 6 Note that X and C , here, refer to (usually HRF-convolved) predictors of the time series signal ( s ) for a single voxel. In the rest of the article, X and C refer to features that are defined across samples (not time). 7 Note that, technically, one could use the "Control for confounds during pattern estimation" method in electrophysiology, by first fitting a univariate model explaining the neuroimaging data ( X j for j = 1, …, K ) as a function of both the target ( y ) and the confound ( C ) and subsequently only using the parameter estimates of the target-predictor ( β y ) as patterns in the subsequent decoding analysis; this is, in fact, equivalent to the "confound regression" technique, discussed in the next section.

Confound regression
The last and perhaps most common method to control for confounds is by removing the variance of the confound (i.e., the confounded signal) from the neuroimaging data directly Rao, Monteiro, Mourao-Miranda, & Alzheimer's Disease Initiative, 2017;) -a process we refer to as confound regression (also known as "image correction";  . In this method, a (usually linear) regression model is fit on each feature in the neuroimaging data (i.e., a single voxel or sensor) with the confound(s) as predictor(s). Thus, each feature in the neuroimaging data is modelled as j X a linear function of the confounding variable(s), : C (2) β ε X j = C + We can estimate the parameter(s) for feature using, for example, ordinary least β j X j squares as follows (but for an example using a different model, see  : Then, to remove the variance of (or "regress out") the confound from the neuroimaging data, we can subtract the variance in the data associated with confound ( ) from the original β Cˆj data: In which represents the neuroimaging feature from which all variance of the X j, corr X j confound is removed (including the variance shared with , i.e., the confounded signal; see y Figure 2 ). When subsequently applying a decoding analysis on this corrected data, one can be sure that the decoder is not capitalizing on signal that is correlated to the confound, which thus improves interpretability of the decoding analysis.
Confound regression has been applied in several studies in the context of decoding . Todd 8 and colleagues (2013) were, as far as the current authors are aware, the first to use this method to control for a confound (in their case, reaction time) that was shown to correlate with their target variable (rule A vs. rule B). Notably, they both regressed out reaction time from the first-level time series data (similar to the "Control for confounds during pattern-estimation" method) and regressed out reaction time from the trial-by-trial activity estimates (i.e., confound regression as described in this section). They showed that controlling for reaction time in this way completely eliminated the above-chance decoding performance in a substantial amount of voxels that was observed when not controlling for reaction time. Similarly,  observe a substantial drop in classification accuracy when controlling for scanner-site in the decoding analysis of Huntington's disease, but only when scanner-site and disease status were actually correlated. Lastly, Rao and colleagues (2017) found that, in contrast to Kostro et al. and Todd et al., confound regression yielded similar (or slightly lower, but still significant) performance compared to the model without confound control, but it should be noted that this study used a regression model (instead of a classification model) and evaluated confound control in the specific situation when the training set is confounded, but the test-set is not . In sum, while confound 9 regression has been used before, it has yielded variable results, possibly due to slightly different approaches and differences in the correlation between the confounding variable and the target.

Current study
In summary, multiple methods have been proposed to deal with confounds in decoding analyses. Often, these methods have specific assumptions about the nature or format of the data (like "A priori counterbalancing" and "Confound control during pattern estimation"), differ in their objective (e.g., prediction vs. interpretation , like in "Include confounds in the model"), or have yielded variable results (like "Confound regression"). Therefore, given that we are specifically interested in disambiguating decoding analyses, the current study evaluates the two methods that are applicable in most contexts: post-hoc counterbalancing and confound regression. In addition to these two methods, we propose a third method-a modified version of confound regression-which we show yields plausible, unbiased, and interpretable results.
To test whether these methods are able to effectively control for confounds and whether they yield unbiased results, we apply them both to empirical data and simulated data in which the ground truth with respect to the signal in the data (i.e., the proportion of true signal and confounded signal) is known. For our empirical data, we enact the previously mentioned hypothetical study in which participant gender is decoded from structural MRI data. We use a large dataset ( ) with structural MRI data and try to predict subjects' gender 17 N = 2 (male/female) from the respective gray matter patterns while controlling for the confound of "brain size" using the aforementioned methods, which we compare to a baseline model in which confounds are not controlled for. Given the previously reported high correlations between brain size and gender  , we expect that successfully controlling for brain size yields lower decoding performance than using uncorrected data, but not below chance level. Note that higher decoding performance after controlling for confounds is theoretically possible when the correlation between the confound and the target is sufficiently low to cause suppressor effects (see Figure 1 in  . However, because our confound, brain size, is known to correlate strongly with our target, gender (approx. r = 0.63;  , classical suppression effects are unlikely and thus we expect lower model performance after controlling for brain size. However, our results indicate that both counterbalancing and confound regression lead to unexpected results: counterbalancing fails to reduce model performance while confound regression consistently yields very low model performance up to the point of significant below-chance accuracy. In subsequent simulations, we show that both methods lead to biased results: counterbalancing yields inflated model performance (i.e., positive bias) because subsampling selectively selects a subset of samples in which features correlate more strongly with the target variable, suggesting (indirect) circularity in the analysis  . Furthermore, our simulations show that negative bias (including significant below-chance classification) after confound regression on the entire dataset is due to reducing the signal below what is expected by chance  , which we show is related to and can be predicted by the standard deviation of the empirical distribution of correlations between the features in the data and the target. We propose a minor but crucial addition to the confound regression procedure, in which we cross-validate the confound regression models (which we call "Foldwise Confound Regression"), which solves the below-chance accuracy issue and yields plausible model performance in both our empirical and simulated data.

Data
For the empirical analyses, we used voxel-based morphometry (VBM) data based on T1-weighted scans and tract-based spatial statistics (TBSS) data based on diffusion tensor images from 217 participants (122 women, 95 men), acquired with a Philips Achieva 3T MRI-scanner and a 32-channel head coil at the Spinoza Centre for Neuroimaging (Amsterdam, The Netherlands).

VBM acquisition & analysis
The T1-weighted scans with a voxel size of 1.0 × 1.0 × 1.0 mm were acquired using 3D fast field echo (TR: 8.1 ms, TE: 3.7 ms, flip angle: 8°, FOV: 240 × 188 mm, 220 slices). We used "FSL-VBM" protocol  from the FSL software package (version 5.0.9;  using default and recommended parameters (including non-linear registration to standard space). The resulting VBM-maps were spatially smoothed using 3 millimeter (FWHM) gaussian kernel. Subsequently, we organized the data in the standard pattern-analysis format of a 2D ( ) array of shape 217 (subjects) × 412473 (non-zero N × K voxels).

TBSS acquisition & analysis
Diffusion tensor images with a voxel size of 2.0 × 2.0 × 2.0 mm were acquired using a spin-echo echo-planar imaging (SE-EPI) protocol (TR: 7476 ms, TE: 86 ms, flip angle: 90°, FOV: 224 × 224 mm, 60 slices), which acquired a single b=0 (non-diffusion-weighted) image and 32 (diffusion-weighted) b=1000 images. All volumes were corrected for eddy-currents and motion (using the fsl command 'eddy_correct') and the non-diffusion-weighted image was skullstripped (using FSL-BET with the fractional intensity threshold set to 0.3) to create a mask that was subsequently used in the fractional anisotropy (FA) estimation. The FA-images resulting from the diffusion tensor fitting procedure were subsequently processed by FSL's tract-based spatial statistics (TBSS) pipeline  , in which we used the pipeline's recommended parameters (i.e., non-linear registration to FSL's 1 millimeter FA image, construction of mean FA-image and skeletonized mean FA-image based on the data from all subjects, and a threshold of 0.2 for the skeletonized FA-mask). Subsequently, we organized the resulting skeletonized FA-maps into a 2D ( ) array of shape 217 N × K (subjects) × 128340 (non-zero voxels).

Brain size estimation
To calculate the values for our confound, global brain size, we calculated for each subject separately the total number of nonzero voxels in the gray matter and white matter map resulting from the segmentation step in the FSL-VBM pipeline (using FSL's segmentation algorithm "FAST";  . The number of non-zero voxels from the gray matter map was used as the confound for the VBM-based analyses and the number of non-zero voxels from the white matter map was used as the confound for the TBSS-based analyses. Note that brain size estimated from total white matter volume and total gray matter volume correlated strongly, .

Data and code availability
In the Github repository corresponding to this article ( https://github.com/lukassnoek/MVCA ), we included a script to download the data (the 4D VBM and TBSS nifti-images as well as the non-zero 2D samples × features arrays). The repository also contains detailed Jupyter notebooks with the annotated empirical analyses and simulations reported in this article.

Decoding pipeline
All empirical analyses and simulations use a common decoding pipeline, which is implemented using functionality from the scikit-learn Python package for machine learning  . This pipeline includes univariate feature selection (based on a prespecified amount of voxels with highest univariate difference in terms of the ANOVA F -statistic), feature-scaling (ensuring zero mean and unit standard deviation for each feature), and a support vector classifier (SVC) with a linear kernel, fixed regularization parameter ( ), and sample weights set to be inversely proportional to C = 1 class frequency (to account for class imbalance). In our empirical analyses, we evaluate model performance for different amounts of voxels (as selected by the univariate feature selection). We report model performance as the F 1 score, which is insensitive to class imbalance (which, in addition to adjusted sample weights, prevents the classifier to learn the relative probabilities of target classes instead of representative information in the features). The F 1 score has an expected chance level of 0.5. Statistical significance was calculated using non-parametric permutation tests as implemented in scikit-learn with 1000 permutations  .

Evaluated methods for confound control Counterbalancing
We implement post-hoc counterbalancing in two steps. First, to quantify the relation between the confound and the target in our dataset, we estimate the point-biserial correlation coefficient between the confound, (brain size), and the target, (gender) across the entire C y dataset (all samples ). Because of both sampling noise and measurement 1, ... , N i = noise, sample correlation coefficients vary around the population correlation coefficient and are thus improbable to be 0 exactly . Therefore, in the next step, we subsample the data 10 until the correlation coefficient between and becomes non-significant at some C y significance threshold : α In our analyses, we use an of 0.1. α 11 Then, given that the subsampled dataset is counterbalanced with respect to the confound, a random stratified K-fold cross-validation scheme is repeatedly initialized until a scheme is found in which all splits are counterbalanced as well (cf.  . This particular counterbalanced cross-validation scheme is subsequently used to cross-validate the MVPA pipeline. We implemented this counterbalancing method as a scikit-learn -style cross-validator class, available from the aforementioned Github repository (in the counterbalance.py module).

Confound regression
In our empirical analyses and simulations, we tested two different versions of confound regression, which we call "whole-dataset confound regression" (WDCR) and "foldwise confound regression" (FwCR). In WDCR, we regress out the confounds from the predictors from the entire dataset at once , i.e., before entering the iterative cross-validated MVPA pipeline (the approach taken by  . Note that we can do this for all K voxels at once using the closed-form OLS solution, in which we first estimate the parameters : In which is an array in which the first column contains an intercept and the second C N × 2 column contains the confound brain size. Accordingly, is an array. We then β C 2 × K remove the variance associated with the confound from our neuroimaging data as follows: Now, is an array with the same shape as the original array, but with the variance X corr X related to the confound removed (i.e., is residualized with regard to ). C X C In our proposed cross-validated version of confound regression (which was mentioned but not evaluated by Rao et al., 2017 , p. 25), "FwCR", we similarly regress out the confounds from the neuroimaging data, yet instead of estimating on the entire dataset, we estimate β C this within each fold of training data ( ):  (Snoek, 2017).

Simulations
In addition to the empirical evaluation of counterbalancing and confound regression in the gender decoding example, we ran three simulations: one generic simulation to evaluate the three confound control methods on synthetic data (" generic simulation "), one simulation to specifically investigate positive bias observed after counterbalancing (" counterbalancing follow-up simulation "), and one simulation to specifically investigate the negative bias after WDCR and to demonstrate FwCR solves this negative bias (" WDCR/FwCR follow-up simulation "). We discuss the implementation of these simulations below.

Generic simulation
In this simulation, we evaluate how the three methods for confound control behave on data with a prespecified correlation between the confound and the target, , and different r Cy amounts of "confounded signal" (i.e., the explained variance in driven by shared variance y between and ). These simulations allow us to have full control over (and knowledge of) X C the influence of the signal and confound in the data, and thereby help diagnosing the issues associated with counterbalancing and confound control (which are investigated in detail in the method-specific simulations).
Specifically, in the generic simulation, we generate hypothetical data sets holding the correlation coefficient between and constant, while varying the amount of true signal C y and confounded signal. We operationalize true signal as the the squared semipartial Pearson correlation between and each feature in , controlled for . As such, we will y X C refer to this term as signal R 2 : Similarly, we operationalize the confounded signal as the shared explained variance of by y each feature of and . This term, which we will refer to as confound R 2 , is calculated as X C follows: In the simulations reported and shown in the main article, we use , which .65 r Cy = 0 corresponds to the observed correlation between brain size and gender in our dataset. To create data with this prespecified structure, within our simulations we generated (1) a data-matrix of shape , (2) a target variable of shape , and (3) a confound X N × K y N × 1 variable of shape . For the simulations reported in this manuscript, we generate C N × P data with , , and (i.e., a single confound variable). We generate as a 00 N = 2 K = 5 P = 1 y categorical variable with binary values, , with equal class probabilities (i.e., 50%), 0, 1} y ∈ { given that most decoding studies focus on binary classification. We generate as a C continuous random variable drawn from a standard normal distribution. We generate each feature as a linear combination of and plus Gaussian noise. Thus, for each X j y C represents the weight given to and represents the weight given to in β y y β C C the generation of the feature , and is the normal distribution with mean and X j (a, ) N b a standard deviation . Then, we partition the explained variance of into signal R 2 and b y confound R 2 . If one or both of these values are off by more than 0.01 from the desired values, the generative parameters β y and β C are adjusted after which is generated again, X j which is iterated until the data contain the desired "true signal" and "confounded signal". Here, we evaluate the different methods for confound control for two different values for signal R 2 (0.004, representing plausible null data , and 0.1, representing a plausible true 12 effect) and a range of confound R 2 values (in steps of 0.05: 0.00, 0.05, 0.10, …, 0.35). This simulation is iterated 10 times (with different partitions of the folds) for robustness. Importantly, the specific scenario in which confound R 2 equals 0, which represents data which quantitatively does not contain any confounded signal ( ), will serve as ignal R r yX 2 = S 2 "reference model performance" to which we can compare the efficacy and bias of the confound control methods.
After the data have been generated, a baseline model (no confound control) and the three methods outlined above (counterbalancing, WDCR, and FwCR) are applied to the simulated data using the standard pipeline described in the " Decoding pipeline " section (but without univariate feature selection) and compared to the "reference performance".

Counterbalancing follow-up simulation
To further investigate the positive bias after counterbalancing, we simulate a multivariate normal dataset with three variables, which reflect our data ( ), target ( ), and confound ( X y C ), with 1000 samples ( ) and a single feature ( ). We iterate the process 1000 times N K = 1 and subsequently use the dataset which yields the largest (positive) difference between model performance after counterbalancing versus no confound control; in other words, we use the dataset in which the counterbalancing issue is most apparent. While not necessarily representative of typical (neuroimaging) datasets, this process allows us to clearly explain and visualize what causes the positive bias after counterbalancing the data.
To generate data from a multivariate normal distribution, we specified the variance-covariance matrix to have unit variance for all variables, so that covariances can be interpreted as correlations. The covariances in the matrix were generated as pairwise correlations ( ), each sampled from a uniform distribution with range , r , r r yX Cy CX . We generate data using such pre-specified correlation structure because the − .65, 0.65 [ 0 ] relative improvement in model performance after counterbalancing does not appear to occur when generating completely random (normal) data. Moreover, we restrict the range of the uniform distribution from which the pairwise correlations are drawn to because − .65, 0.65 [ 0 ] a larger range may result in the covariance matrix not being positive-semidefinite. After generating the three variables, we binarize the target variable ( ) using a mean-split ( y y = 0 if , otherwise) to frame the analysis as a classification problem rather than a y < ȳ y = 1 regression problem.
We then subsample the selected dataset using on our post-hoc counterbalancing algorithm and subsequently run the decoding pipeline (without univariate feature selection) on the subsampled ("retained") data in a 10-fold cross-validation scheme. Notably, we cross-validate our fitted pipeline not only to the left-out retained data, but also to the data that did not survive the subsampling procedure (the "rejected" data; see Figure 3 ). Across the 10 folds, we keep track of two statistics of both the retained and rejected samples: (1) whether the samples were predicted correctly and (2) the signed distance to the decision boundary. Negative distances in binary classification (in simple binary classification with ) reflect a prediction of the sample as , while positive distances reflect a y ∈ {0, 1} y = 0 prediction of the sample as . Here, however, we want to count the distance of samples y = 1 that are on the "incorrect" side of the decision boundary as negative distances, while counting the distance of samples that are on the "correct" side of the decision boundary as positive distances. To this end, we use a "re-coded" version of the target variable ( if − y * = 1 , otherwise) and multiply it with the distance. As such, we calculate the signed y = 0 y * = 1 distance from the decision boundary ( ) for any sample as: in which refers to the feature weights (coefficients) and refers to the intercept term. w b Then, differences in these two statistics (proportion correctly classified and distance to boundary) between the retained and rejected samples allows us to evaluate if counterbalancing yields unbiased estimates (i.e., better cross-validated model performance on the retained data than on the rejected data would confirm positive bias, as it indicates that subsampling tends to reject hard-to-classify samples). We apply this analysis similarly to the empirical data (separately for the different values of ) to show that the effect of K counterbalancing as demonstrated using simulated data also occurs in the empirical data.

WDCR/FwCR follow-up simulation
To demonstrate the problem with WDCR (and to demonstrate that FwCR solves this), we perform two follow-up simulations: the first follow-up simulation shows that below-chance accuracy depends on the distribution of feature-target correlations ( ; see Jamalabadi et r yX al., 2016 , for a similar argument) and the second follow-up simulation to show that WDCR artificially narrows this distribution, leading to below chance accuracy, which is exacerbated by increasing number of features ( ) and higher correlations between the target and K confound ( ). In the first simulation, we simulate random null-data (drawn from a standard r Cy normal distribution) with 100 samples ( ) and 200 features ( ), as well as a binary target N K feature ( ). We then calculate the cross-validated accuracy using the standard 0, } y ∈ { 1 pipeline (without univariate feature selection) described in the "Decoding pipeline" section; we iterate this process 500 times. Then, we show that the variance of the cross-validated accuracy is accurately predicted by the standard deviation (i.e., "width") of the distribution of correlations between the features and the target ( with ), which we will denote r yX j , ..., K j = 1 by . Importantly, we show that below-chance accuracy likely occurs when the d(r ) s yX standard deviation of the feature-target correlation distribution is lower than the standard deviation of the sampling distribution of the Pearson correlation coefficient parameterized with the same number of samples ( ) and the same effect (i.e., , because we 00 N = 2 ρ = 0 simulated random null-data). The sampling distribution of the Pearson correlation coefficient is described by Kendall & Stuart (1973). When (as in our simulations), the equation is ρ = 0 as follows: Then, in a second simulation, we similarly simulate null-data as in the previous simulation, but now we also generate a continuous confound ( ) with a variable correlation with the C target ( . Before subjecting the data to the decoding 0.0, 0.1, 0.2, ... , .9, 1.0}) r Cy ∈ { 0 pipeline, we regress out the confound from the data (i.e., WDCR). We do this for different amount of features ( ). Then, we apply FwCR on the 1, 5, 10, 50, 100, 500, 1000} K ∈ { simulated data as well for comparison.

Influence of brain size
Before evaluating the different methods for confound control, we determined whether brain size is truly a confound given our proposed definition ("a variable that is not of primary interest, correlates with the target and is encoded in the neuroimaging data"). We evaluated the relationship between the target and the confound in two ways. First, we calculated the (point-biserial) correlation between gender and brain size, which was significant for both the estimation based on white matter, , and the estimation based on grey (216) 645, p .001 r = .
< 0 Second, as recommended by  , who argue that the potential influence of confounds can be discovered by running a classification analysis using the confound as the (single) feature predicting the target, we ran our decoding pipeline (without univariate feature selection) using brain size as a single feature to predict gender. This analysis yielded a mean classification performance ( F 1 score) of 0.78 ( ) when using brain size D 10 S = . estimated from white matter and 0.81 ( ) when using brain size estimated from gray D 9 S = . matter, which are both significant with (see Figure 4A ). .001 p < 0 To estimate whether brain size is encoded in the neuroimaging data, we compared the distribution of bivariate correlation coefficients (of each voxel with brain size) with the sampling distribution of correlation coefficients when and (see section ρ = 0 17 N = 2 " WDCR/FwCR follow-up simulation " for details). Under the null hypothesis that there are no correlations between brain size and voxel intensities, each individual correlation coefficient between a voxel and the confound can be regarded as an independent sample with 17 N = 2 (ignoring correlations between voxels for simplicity). Because is very large for both the K VBM and TBSS data, the empirical distribution of correlation coefficients should, under the null hypothesis, approach the analytic distribution of correlation coefficients parametrized by and . Contrarily, the density plots in Figure 4B clearly show that the observed 17 N = 2 ρ = 0 correlation coefficients distribution do not follow the sampling distribution (with both an increase in variance and a shift of the mode), indicating that at least some of the correlation coefficients between voxel intensities and brain size are extremely unlikely under the null hypothesis. Figure 4. A) Model performance when using brain size to predict gender for both brain-size estimated from grey matter (left) and from white matter (right). Points in yellow depict individual F 1 scores per fold in the 10-fold cross-validation scheme. Whiskers of the box plot are 1.5x the interquartile range. B) Distributions of observed correlations between brain size and voxels ( ), overlayed with the analytic sampling distribution of correlation r XC coefficients when and , for both the VBM data (left) and TBSS data (right). ρ = 0 17 N = 2 Baseline model: no confound control In our baseline model on the empirical data, for different amounts of voxels, we predicted gender from structural MRI data (VBM and TBSS) without controlling for brain size (see Figure 5 ). The results show significant above chance performance of the MVPA pipeline based on both the VBM-data and the TBSS-data. All performance scores averaged across folds were significant ( ). .001 p < 0 Figure 5 . Baseline scores using the VBM (left) and TBSS (right) data without any confound control. Scores reflect the average F 1 score across 10 folds; error bars reflect 95% confidence intervals across 1000 bootstraps. The dashed black line reflect theoretical chance-level performance and the dashed orange line reflects the average model performance when only brain size is used as a predictor for reference; * indicates significant performance above chance with p < 0.001.
These above-chance baseline performance estimates replicate previous studies on gender decoding from MRI data  and will serve as a baseline estimate of model performance to which the confound control methods will be compared. In the next three subsections, we test the three discussed methods to control for confounds: post-hoc counterbalancing, whole-dataset confound regression (WDCR), and foldwise confound regression (FwCR).

Counterbalancing Empirical results
In order to decorrelate brain size and gender (i.e., ), our subsampling algorithm .1 r Cy > 0 selected 117 samples in the VBM data (i.e., a reduction in samples of 46.1%) and 131 samples in the TBSS data (i.e., a reduction of 39.6%). The model performance for different values of (number of voxels) are shown in Figure 6 . Contrary to our expectations, the K predictive accuracy of our decoding pipeline is, for a substantial part of the results (especially in the VBM data), higher after counterbalancing than before counterbalancing. This is particularly surprising in light of the large reductions in sample size, which normally results in a hit in power and thus lower model performance. Figure 6 . Model performance after counterbalancing (green) versus the baseline performance (blue) for both the VBM (left) and TBSS (right) data. Performance reflects the average F 1 score across 10 folds; error bars reflect 95% confidence intervals across 1000 bootstraps. The dashed black line reflect theoretical chance-level performance (0.5) and the dashed orange line reflects the average model performance when only brain size is used as a predictor; * indicates significant performance ( p < 0.001) above chance.
One could argue that this increase in model performance after counterbalancing can be explained by the possibility that the subsampling and counterbalancing just leads to the selection of different features using univariate feature selection compared to the baseline model. In other words, the increase in model performance may be caused by the feature selection function to select "better" voxels (i.e., containing better or more "robust" signal), which lead to higher model performance regardless of the reduction in sample size. However, this does not explain the similar scores for counterbalancing and the baseline model when using all voxels (the data points at in Figure 6 ), as it is voxels .. (all) K = . expected that successful confound control would yield a lower model performance given the high correlation between the target and the confound and the confound and the neuroimaging data (see " Influence of brain size " section).
Another possibility for the relative increase in performance of the model based on the counterbalanced data versus the baseline model is that counterbalancing increases the signal in the data. Indeed, counterbalancing appears to increase the (absolute) correlations between the data and the target ( ), which is visualized in Figure 7 , suggesting an r yX increase in signal.

Figure 7 . Density plots of the correlations between the target and voxels across all voxels before (blue) and after (green) subsampling for both the VBM and TBSS data
As can be seen in Figure 6 and Figure 7 , counterbalancing increases the correlations between the target and neuroimaging data, which goes against the intuition that removing the influence of a confound that is highly correlated to the target will reduce decoding performance. To further investigate these issues, we first replicate this effect of counterbalancing on simulated data in our " generic simulation " section and then elucidate the mechanism behind this phenomenon in the " counterbalancing follow-up simulation ".

Generic simulation
In the generic simulation, we simulated data in which we varied the strength of confound R 2 and signal R 2 , after which we applied the three confound control methods to the data. The results from this generic simulation shows that counterbalancing maintains chance-level model performance when there is almost no signal (i.e., signal R 2 = 0.004; Figure 8 , left graph, green line). However, when there is some signal (i.e., signal R 2 = 0.1; Figure 8 , right graph), we see that counterbalancing yields similar or even higher scores than the baseline model, replicating the effects observed in the empirical analyses. Figure 8. Results from the different confound control methods on simulated data without any experimental effect (signal R 2 = 0.004; left graph) and with some experimental effect (signal R 2 = 0.1; right graph) for different values of confound R 2 . The orange line represents the average performance (± 1 SD ) when confound R 2 = 0, which serves as a "reference performance" for when there is no confounded signal in the data. For both graphs, the correlation between the target and the confound, , is fixed at 0.65. The results from the WDCR and FwCR r yC methods are explained later in the paper.
As is apparent from Figure 8, the counterbalanced data seems to yield better performance than the baseline model only for relatively low confound R 2 values (confound R 2 < 0.15). As suggested by our findings in the empirical data (see Figure 7 ), we hypothesized that the observed improvement in model performance after counterbalancing is caused by the increase in correlations between the target and neuroimaging data. In support of this hypothesis, we find that the increase in correlations after counterbalancing ( ) is indeed strongly correlated to the difference in performance r Δ yX = r yX af ter − r yX bef ore between the counterbalancing model and the baseline model, (see (79 922, p .001 r = . < 0 Figure 9 ), which seems to be inversely related to confound R 2 ( color-coded in Figure 9 ). While this relationship in Figure 9 might be statistically interesting, it does not tell us why counterbalancing tends to increase the correlations between neuroimaging data and target and even outperforms the baseline model when confound R 2 is low and there is signal present. More importantly, it does not tell us whether the counterbalancing procedure uncovers signal that is truly related to the target-in which case the confound acted as a suppressor before counterbalancing-or inflates performance estimates and thereby introduces positive bias. Therefore, in the next section we report and discuss results from a follow-up simulation that intuitively shows why counterbalancing leads to an increase in performance, and that this increase is in fact a positive bias.

Counterbalancing follow-up simulation
In this follow-up simulation, we aim to visualize the scenario in which counterbalancing leads to a clearly better performance than in case of no confound control. As such, we iteratively generate data with a correlation structure that we know leads to a large difference between counterbalancing and the baseline model (i.e., data with a low confound R 2 ) and choose the variables ( ) that yielded the largest difference for our visualization (see the , y, C X " Counterbalancing follow-up " section in the Methods for details).
The data that yielded the largest difference (i.e., a performance increase from 0.589 to 0.801, 26%) are visualized in Figure 10 . The samples are plotted against the value of their feature ( , on the x-axis) and the value of their confound ( , on the y-axis), both before X C subsampling (upper scatter plot) and after subsampling (lower scatter plot). From visual inspection, it appears that the samples rejected by the subsampling procedure (i.e., the samples with the red border) tend to lie relatively close to (or on the "wrong" side of) the decision boundary (i.e., the dashed black line). In other words, subsampling tends to reject samples that are harder to classify based on the data (here, the single feature of ). The X density plots in Figure 10 show the same effect in a different way: while the difference in the mode of the distribution of the confound ( ) is reduced after subsampling (i.e., the density C plots parallel to the y-axis), the difference in the mode of the distribution of the data ( ) is X actually increased after subsampling (i.e., the density plots parallel to the x-axis).
We quantified this effect due to subsampling by comparing the signed distance from the decision boundary (i.e., the dashed line in the upper scatter plot) between the retained samples and the rejected (subsampled) samples, in which a larger distance from the decision boundary reflects a higher confidence of the classifier's prediction (see Figure 3 for a visualization of this method). Indeed, we find that samples that are removed by subsampling lie significantly closer to (or on the "wrong" side of) the decision boundary ( ) than samples that are retained after subsampling ( 358, SD 619 M = − . = . ), as indicated by a independent samples t -test, . 506, SD 580 M = .
(998) 2.32, .001 t = 2 p < 0 Also (which follows from the previous observation), samples that are been removed by subsampling are more often classified incorrectly (75% incorrect) than the samples that would have been retained by subsampling (20% incorrect), as indicated by a chi-squared test, . 270.29, p 0.001 χ 2 = < Figure 10. Both scatterplots visualize the relationship between the data ( with , on the x-axis), the X K = 1 confound ( , on the y-axis) and the target ( ). Dots with a red border in the upper scatterplot are samples that C y are rejected in the subsampling process; the lower scatterplot visualizes the data without these rejected samples. The dashed black lines in the scatterplot represent the decision boundary of the SVM classifier; the color of the background shows how samples in that area are classified (a blue background means a prediction of and a y = 0 green background means a prediction of ). The density plots parallel to the y-axis depict the distribution of y = 1 the confound ( ) for the samples in which (blue) and in which (green). The density plots parallel to C y = 0 y = 1 x-axis depict the distribution of the data ( ) for the samples in which (blue) and in which (green), X y = 0 y = 1 from which it is clear that subsampling preferentially removes samples close to the decision boundary.
To show that the same effect (i.e., removing samples that tend to be hard to classify) occurs in the empirical data after counterbalancing, we apply the same analysis of comparing the model performance and distance-to-boundary between the retained and rejected samples to the empirical data. Indeed, across all different amounts of voxels ( ), the retained samples K were classified correctly significantly more often ( Figure 11A ) and had a significantly larger distance to the classification boundary ( Figure 11B ) than the rejected samples. This demonstrates that the same effect of counterbalancing as shown in the simulated data (i.e., the removal of hard-to-classify samples) likely underlies the increased model performance of the counterbalanced data relative to the baseline model in the empirical data.
In summary, the removal of a subset of observations to correct for the influence of a confound can induce significant bias by removing samples that are harder to classify using the available data ( ). The bias itself can be subtle (e.g., in our empirical results, the X predictive performance falls in a realistic range of predictive performances), and could remain undetected when present. Therefore, we believe that counterbalancing through subsampling the data is an inappropriate method to control for confounds. Figure 11. A) The proportion of samples classified correctly, separately for the "retained" samples (blue line) and "rejected" samples (green line); the dashed line represents chance level (0.5). B) The average distance to the classification boundary for the retained and rejected samples; the dashed line represents the decision boundary, with values below the line representing samples on the "wrong" side of the boundary (and vice versa). * = p < 0.05, ** = p < 0.01, *** p < 0.001.

Whole-dataset confound regression (WDCR)
Empirical results In addition to counterbalancing, we evaluated the efficacy of 'whole-dataset confound regression', i.e. regressing out the confound from each feature separately using all samples from the dataset, to control for the of confounds. Compared to the baseline-model, WDCR yielded a strong decrease in performance, even dropping (significantly) below chance for all TBSS-analyses and a subset of the VBM-analyses (see Figure 12 ). Figure 12 . Model performance after WDCR (orange) versus the baseline performance (blue) for both the VBM (left) and TBSS (right) data. Performance reflects the average F 1 score across 10 folds; error bars reflect 95% confidence intervals across 1000 bootstraps. The dashed black line reflect theoretical chance-level performance (0.5) and the dashed orange line reflects the average model performance when only brain size is used as a predictor; * indicates significant performance (p < 0.001) above/below chance. This strong (and implausible) reduction in model performance after WDCR is investigated in more detail in the next two sections on the results from the simulations.

Generic simulation
In fact, the results from the generic simulation (see Figure 8 ) show that WDCR accurately corrects for the confound in both the case of data without signal (i.e., when signal R 2 = 0.004) and in case of some signal (i.e., when signal R 2 = 0.1), as evident from the fact that the performance after WDCR is similar to the reference performance. This result (i.e., plausible performance after confound control) stands in contrast to the results from the empirical analyses, which is why we ran a follow-up simulation to investigate this specific issue, as is reported in the next section.

WDCR follow-up simulation
Inspired by the work of  on below-chance accuracy in decoding analyses, we ran several follow-up analyses to get insight into why WDCR leads to below-chance model performance. As  show, below-chance model performance occurs when the data contains little signal, which they operationalize using Cohen's (which is proportional to ). In our first follow-up simulation, we sought to refine δ r yX the explanation of the cause of below-chance model performance by linking it to the observed standard deviation of the empirical distribution of correlations between the data ( X ) and the target ( ). To do so, we simulate random data ( ) and a binary target ( y X 0, 1} y ∈ { ) and estimated (per fold) the cross-validated classification accuracy using the standard pipeline described in the methods section. We repeated this process 500 times, yielding 500 data sets. The expected average predictive accuracy for each dataset is 0.5, but this varies randomly across folds and iterations. We hypothesized that this variance can be explained by the standard deviation ("width") of the initial feature-target correlation distribution, : d(r ) s yX narrower distributions may yield relatively lower cross-validated classification accuracy than relatively wider feature-target correlation distributions. Indeed, we find that the initial standard deviation of this distribution is significantly correlated with the cross-validated accuracy, ( Figure 13A ). Importantly, we find that this relationship .73, p 0.001 r(499) = 0 < holds for different values of (see Supplementary Figure 1 ), for different sizes of the N test-set (see Supplementary Figure 2 ), and for different sizes of (see Supplementary K Figure 3 ). This observation, then, begs the question: why do narrower-than-chance correlation distributions lead to below-chance accuray? In follow-up analyses of the data in Figure 13A , we found that relatively narrow distributions of feature-target correlations induce what is known in the machine learning literature as "dataset shift", which describes the phenomenon of a change in feature-target relationship between the train and test set ) . An example of "dataset shift" is observing a positive correlation of 0.1 between a particular feature and the target in the train set while observing a negative correlation of -0.1 for this same feature is the test set. We quantified the effect induced by "dataset shift" as the average (across features ) difference between the train-set feature-target correlation ( ) , ... , K j = 1 r X , y train train and test-set feature-target correlation ( ), weighted by coefficients assigned to the r X , y test test 13 feature by the classifier ( ). Formally, we estimate dataset shift ( ) as follows: w dŝ Indeed, the correlation between this particular operationalization of "dataset shift" and accuracy across simulations is highly significant, r (499) = -0.934 ( Figure 13B ).
Having established the relation between the standard deviation of the initial feature-target correlation distribution and (below-chance) accuracy, we followed up our simulation by investigating specifically the effect of WDCR on the standard deviation of the correlation distribution. We investigated this by simulating data with different strengths of the correlation between the confound and the target ( ) and the number of features ( ). In Figure 14A , it r Cy K is clear that, while the expected chance level is 0.5 in all cases, model performance quickly drops below chance for increasing correlations between the target and the confound and for increasing numbers of features, and even leading to a model performance of 0% correct when the the confound is perfectly correlated with the target and when using 1000 features. Figure 14C shows that, indeed, higher values lead to smaller widths of the correlation r Cy distribution, which is shown in Figure 14D to yields relatively lower accuracy scores. In summary, our simulations show that below-chance accuracy is accurately predicted by the standard deviation (i.e., "width") of the distribution of empirical feature-target correlations and that WDCR reduces this standard deviation, which explain why the empirical analyses yielded below-chance model performance (especially for larger amounts of voxels).

Foldwise confound regression (FwCR) Empirical results
As the results from the empirical analyses and simulations suggest, WDCR appear problematic because of the partitioning of the dataset into a separate train-set and test-set after confound regression. As such, our proposed foldwise confound regression (FwCR) suggest to move the confound regression procedure inside the cross-validation loop and thus also cross-validating this step. As expected, compared to the baseline model (i.e., no confound control), the results from the empirical analyses using FwCR show reduced (but not below-chance) model performance for both VBM and TBSS and all different number of voxels (see Figure 15 ). All performance estimates (for all amounts of voxels and in both VBM and TBSS) are significant ( ). .001 p < 0 Figure 15 . Model performance after FwCR (pink) versus the baseline performance (blue) for both the VBM (left) and TBSS (right) data. Performance reflects the average F 1 score across 10 folds; error bars reflect 95% confidence intervals across 1000 bootstraps. The dashed black line reflect theoretical chance-level performance (0.5) and the dashed orange line reflects the average model performance when only brain size is used as a predictor. * indicates significant performance (p < 0.001) above chance.
Generic simulation Similar to WDCR, FwCR yields plausible and unbiased model performance (see Figure 8 , pink line).

FwCR follow-up simulation
When applied to the simulated null-data as described in the " WDCR follow-up simulation " section, FwCR yields model performance scores at chance level across all levels of the confound-target correlation and different amount of features (see Figure 14B ).

Summary methods for confound control
In this section, we have shown the effects of different method to control confounds (counterbalancing, WDCR, and FwCR) on empirical MRI data and simulated data (see Figure 16 ). Counterbalancing is clearly unable to correctly control for confounding influences, which is putatively caused by indirect circularity in the analysis process due to subsampling. Confound regression shows an expected drop in model performance (but not below chance-level), given that the confound regression step is properly cross-validated (i.e., the FwCR version).

DISCUSSION
Decoding has become a popular alternative to univariate analyses of neuroimaging data. This analysis approach, however, inherently suffers from ambiguity in the source of information picked up by the decoder  . One way to improve the interpretation of decoding results is to control for alternative sources of information (i.e., other than the target-of-interest) that might drive decoding. Effectively controlling for these alternative sources of information, or confounds , helps in disambiguating decoding models. In this article, we reviewed and tested two generic, broadly applicable methods that aim to control for confounds in decoding analyses: (post-hoc) counterbalancing and confound regression. Additionally, we proposed a third method (a modification of traditional confound regression) that, unlike the other two methods, effectively and in an unbiased way controls for confounds.
In both the empirical data and simulations, we found that neither (post-hoc) counterbalancing nor (whole-dataset) confound regression yielded plausible and unbiased model performance estimates. First, we found that post-hoc counterbalancing leads to optimistic (i.e., positively biased) model performance estimates, which is caused by removing samples that are hard to classify during the subsampling process. Because this subsampling process is applied to the entire dataset at once (i.e., it is not cross-validated), it can be seen as a form of indirect circular analysis  , in which the data itself is used to inform analysis decisions, and is thus bound to yield optimistic generalization estimates. Second, our initial evaluation of confound regression, which was applied on the entire dataset ("WDCR"), yielded pessimistic (i.e., negatively biased) and even significantly below-chance model performance estimates. Extending previous research  , we show that this negative bias occurs when the "signal" in the data (operationalized as the width of the feature-target correlation distribution) is lower than would be expected by chance, which we link to the sampling distribution of the Pearson correlation coefficient. Importantly, we show that WDCR systematically narrows the width of the correlation distribution -and thus leads to lower model performance -which is exacerbated by higher correlations by the target and the confound and more features.
To overcome the negative bias observed in WDCR, we propose to cross-validate the confound regression procedure (which we call "Foldwise Confound Regression", FwCR). We show that this method yields plausible model performance in the empirical analyses (i.e., significantly above-chance model performance) and nearly unbiased model performance in the simulations for different datasets varying in the amount of features ( ) and the strength K of the confound ( ). As FwCR can be used with any type of data (electrophysiology, MRI, r Cy and even behavioral data), it represents a universal method to disambiguate decoding analyses . 14 Relevance and consequences for previous and future research We believe our results have implications for not only post-hoc counterbalancing, but counterbalancing in general. In both behavioral research  and neuroimaging research  , a priori counterbalancing (or 'matching') is common to avoid confound effects. However, as we show in the current study, this may unintentionally remove samples that are harder to predict, especially when there is little shared variance between the confound and the other predictors (i.e., when there is low confound R 2 ). Because, conceptually, this represents a form of circular analysis, counterbalancing -regardless of whether it is applied a priori or post-hoc -is bound to yield overly optimistic model performance estimates. Another way of interpreting this finding, more in terms of classical frequentist statistics, is that the model performance estimates are based on a sample that is not representative of the population (see also  . As a result, out-of-sample predictive performance drops significantly, in our case even to chance level.
In contrast to post-hoc counterbalancing, confound regression in its uncross-validated form (i.e., WDCR) has been applied widely in the context of decoding analyses  . Indeed, the first study that systematically investigated the effect of confounds in decoding analyses  used WDCR to account for the confounding effect of reaction times (RT) on decoding of rule representations and found that WDCR completely eliminated the effect found when not controlling for RT. This observation, however, can potentially be explained by the negative bias induced by WDCR. This possible explanation is corroborated by a follow-up study that similarly looked into RT confounding decoding of rule representations  , who did not use WDCR but accounted for RT confounding by including it as a covariate during the pattern estimation procedure (see " Control for confounds during pattern estimation " section), which in contrast to the study by Todd et al. yielded widespread significant decoding. Moreover, while not specifically investigated here, we expect a similar negative bias to occur when a confound is removed from a continuous target variable using 14 While FwCR offers an unbiased method to control for confounds in decoding analyses, it is limited to confounding effects that are linearly encoded in the neuroimaging data (because, here, we use OLS to regress out the confound from the data). Theoretically, if one would use a non-linear model (such as radial-basis SVMs or decision trees) in a decoding analysis, this model could potentially pick up confounded signal that is non-linearly related to the target variable. However, this issue could be mitigated by using non-linear models to regress out the confound from the data (as done in, e.g.,  . WDCR -which may offer an explanation for the null-finding of  , who fail to decode personality characteristics from resting-state fMRI. Interestingly, a technique related to confound regression is recently gaining popularity in studies using representational similarity analysis (RSA) to analyze neural data  . In the context of RSA , the explained variance in the neural data is 15 often partitioned into different (model-based) feature sets (i.e., sources of information), which allows one to draw conclusions about the unique influence of each source of information (see, e.g.,  . Specifically, variance partitioning in RSA is done by removing the variance from the representational dissimilarity matrix (RDM) based on the feature set that needs to be controlled for. Notably, the variance of the RDMs that are not of interest can be removed from only the neural RDM  or both from the neural RDM and the RDM-of-interest  . While the context is different, the underlying technique is identical to confound regression as described and evaluated in this article. Importantly, the studies employing this variance partitioning technique  similarly report plausible model performances after confound regression (i.e., relatively lower but not below-chance performance), corroborating our results with (foldwise) confound regression. Note that the distinction between WDCR and FwCR in the context of most RSA studies (including the aforementioned studies) is irrelevant, as representational similarity analyses are not commonly cross-validated. However, recently, some have proposed to use cross-validated distance measures (such as the cross-validated Mahalanobis distance;  in representational similarity analyses, which could suffer from negative bias when combined with (not cross-validated) variance partitioning similar to what we observed with WDCR in the context of decoding analyses.
The importance of proper confound control is highlighted by the empirical question we address. Without any prediction pipeline optimization, we were able to predict gender with a model performance up to approximately 0.85 without confound control. This is in line with reports from various other studies  . However, this predictive performance is driven by a mixture two sources of information: global and local differences in brain structure. With confound control, however, we show that predictive performance using only local differences lies around 70%-a substantial drop in performance. Especially because the remaining predictive performance is lower than predictive performance using only brain size, we argue that proper confound control may lead one to draw significantly different conclusions about the differences in brain structure between men and women. For the debate on sexual dimorphism, it is thus extremely important to take global brain size into account in the context of decoding analyses (as has been similarly recommended for mass-univariate analyses;  . In general, we believe that the contributions of the current study are twofold. First and foremost, it provides a systematic evaluation of two widely applicable methods to control for confounds and shows that only a single method ("foldwise confound regression") yields plausible and unbiased results. The results from this evaluation hopefully prevents researchers from using counterbalancing and (whole-dataset) confound regression, which we show may introduce (unintended) biases. Moreover, we open-sourced all analyses and preprocessed data ( https://github.com/lukassnoek/MVCA ) and provide a simple implementation for foldwise confound regression that interfaces with the popular scikit-learn package in Python. Second, we believe that this study improves understanding of the elusive phenomenon of below-chance accuracy (building on previous work by  . In general, we hope that this study helps researchers in gaining more insight into their decoding analyses by providing a method that disentangles the contributions of different sources of information that may by encoded in their data. Figure 1. These plots show that the relationship between the standard deviation of the empirical feature-target correlation distribution, , and accuracy holds d(r ) s yX for different samples sizes (i.e., values for ). Note that the predicted accuracy based on N the standard deviation expected from the sampling distribution is at 0.5 for every plot. The data were generated in the same manner as reported in the " WDCR follow-up " section. Figure 2. These plots show that the relationship between the standard deviation of the empirical feature-target correlation distribution, , and accuracy also d(r ) s yX holds for sizes of the test-set (replicating results from   . Note that the predicted accuracy based on is again at 0.5 for every plot. d(r ) s yX

Supplementary
The data were generated in the same manner as reported in the " WDCR follow-up " section. Figure 3. These plots show that the relationship between the standard deviation of the empirical feature-target correlation distribution, , and accuracy also d(r ) s yX holds for different amounts of features ( ). Note that the predicted accuracy based on K is approximately at 0.5 for every plot. The data were generated in the same manner d(r ) s yX as reported in the " WDCR follow-up " section.