Cross-validation and permutations in MVPA: Validity of permutation strategies and power of cross-validation schemes

Multi-Voxel Pattern Analysis (MVPA) is a well established tool to disclose weak, distributed eﬀects in brain activity patterns. The generalization ability is assessed by testing the learning model on new, unseen data. However, when limited data is available, the decoding success is estimated using cross-validation. There is general consensus on assessing statistical signiﬁcance of cross-validated accuracy with non-parametric permutation tests. In this work we focus on the false positive control of diﬀerent permutation strategies and on the statistical power of diﬀerent cross-validation schemes. With simulations, we show that estimating the entire cross-validation error on each permuted dataset is the only statistically valid permutation strategy. Furthermore, using both simulations and real data from the HCP WU-Minn 3T fMRI dataset, we show that, among the diﬀerent cross-validation schemes, a repeated split-half cross-validation is the most powerful, despite achieving slightly lower classiﬁcation accuracy, when compared to other schemes. Our ﬁndings provide additional insights into the optimization of the experimental design for MVPA, highlighting the beneﬁts of having many short runs.


Introduction
Multi-Voxel Pattern Analysis (MVPA) has became a widely established tool to analyze imaging data, being particularly suited to disclose weak, distributed effects in brain activity patterns, which would otherwise go undetected using traditional univariate statistical tests ( Formisano et al., 2008;Haynes, 2015;Pereira et al., 2009 ). MVPA has been extensively used to infer brain states from imaging data (e.g. functional Magnetic Resonance Imaging (fMRI), electroencephalography (EEG)). Given the typical dimensions of the problems, i.e. with relatively limited samples available and possibly a large number of features, classification is often performed using discriminative models. These models try to find a "rule " to separate the examples in two or more classes, without focusing on the statistical characterization of each class (which is instead done by generative models, see Bishop, 2006 , 1.5.4). The rule estimated by discriminative models is often validated on new, unseen data, in order to avoid over-interpretation of the learning data ( overfitting ). When the amount of data is large, the data are usually divided Fig. 1. Example of repeated cross-validation with 8 indivisible sets (units) and 4-fold cross-validation. Each row contains a cross-validation iteration, training data in blue, testing in red. A whole repetition of cross-validation consists of 4 iterations (i.e. training-testing combinations). When multiple repetitions are used, the partitions are created using a different random assignment of units to the cross-validation partitions. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) tions makes the binomial test invalid. In a valid test, the probability of Type-I error (incorrectly rejecting the Null Hypothesis) is smaller than or equal to the chosen nominal level of significance . The use of invalid tests results in an inflation of false positives, which in turn harms the reliability and reproducibility of the findings, see for instance ( Eklund et al., 2016 ) on the issue of false-positive inflation in fMRI cluster-based multiple comparison correction. It has been shown in Bengio and Grandvalet (2004) that the errors obtained on different cross-validation iterations are correlated, due to the partial overlap of the training data; the binomial model ignores such correlation and can underestimate the variance of the overall error. This underestimation can, in turn, result in an overconfident statement over a finding, resulting in an invalid test characterized by an increased chance of false positives, as shown in Combrisson and Jerbi (2015) ; Noirhomme et al. (2014) .
To control Type-I errors, the most employed approach is to run a permutation test ( Golland et al., 2005;Ojala and Garriga, 2010 ), that empirically characterizes the null distribution by repeatedly evaluating the classifier on resampled datasets. Permutation tests are also needed in population studies, when cross-validation is used. In this scenario, decoding can be carried out across -subjects or within subjects, with different aims and limitations, see e.g. ( Wang et al., 2019 ). When decoding is conducted across subjects, the decoding performance can be validated in a straightforward manner with a permutation test. When crossvalidation analysis is conducted instead within each subject, a correct estimate of the -value of single subjects, which can be obtained with a permutation test, is still needed to perform valid inference on the population, see the prevalence -based testing introduced in Allefeld et al. (2016) .
A permutation test on cross-validated MVPA compares the decoding error (or accuracy) with its reference distribution under the null hypothesis of independence between the samples and the class membership. To this avail, several resample datasets are generated under the null hypothesis by permuting the data or the labels (see Section 2.1.2 for more details), evaluating the cross-validation error on each resample. A -value can then be estimated as the probability of obtaining, on the resamples, a statistic as extreme, or more extreme than, the observed one ( Ernst, 2004 ). If the -value is lower than the chosen significance level , the null hypothesis is rejected. Following ( Golland et al., 2005 ), the only theoretically justified way of obtaining a permutation distribution is to use on each resample the same cross-validation estimation used on the original data. This implies that permutation procedures that do not adhere to this principle, such as generating different resamples for the training and testing partitions in cross-validation, are not theoretically guaranteed to be valid, and simulations need to be conducted to ensure they do not result in a severe false positive inflation.
Another important aspect to consider is to choose the crossvalidation strategy that maximizes sensitivity, i.e the probability of correctly rejecting the Null Hypothesis when it is false, also referred to as power , calculated as one minus the Type II error rate. A high power has also important consequences on the positive predictive value (PPV), the post-study probability of a claimed finding being true, that can be surprisingly poor in low power situations (see, for instance, Ioannidis, 2005 ). In the context of cross-validated MVPA the choice of the number of cross-validations and the assignment of data to the different partitions can have an impact on power. In fact, randomness in selecting the partitions for cross-validation has an effect on the variability of the classifier performance, which in turn can lead to sensitivity loss. It has therefore been suggested to repeat the whole cross-validation scheme several times ( Lemm et al., 2011;Varoquaux et al., 2017 ) in order to obtain more stable estimates. In fMRI MVPA studies this has further implications on the experimental design: in those cases where the partitioning is based on runs (see the concept of unit in Section 2.1.1 and Fig. 1 ), performance stabilization by means of cross-validation repetition is more effective in the presence of many runs, as the number of possible training-testing partitions grows with the number of runs.
In the remainder of this work, we will examine the statistical validity of different permutation schemes and compare the power of different cross-validation schemes. We will first introduce the problem more formally in Section 2.1 and then illustrate the analyses on simulations and real data ( Sections 2.2, 2.3 and 3 ). The Matlab implementation of all the analyses can be found on https://github.com/GiancarloValente/ MVPACrossValidationAndPermutations .

Problem definition
In the following section we will discuss some aspects of cross-validated MVPA ( Section 2.1.1 ) and permutation methods ( Section 2.1.2 ) needed to deal with validity and power. A definition of the terms used in this work is provided in Table 1 .

Cross-validated MVPA
In the first step of MVPA, feature extraction , the fMRI time series observed in each subject across the different runs are summarized in Table 1 Glossary.
samples , each of them associated with one of the experimental conditions ( classes ). This operation can be done in several ways, depending on the experimental design and the goal of the experiment. Individual time points could be considered as samples, however two popular choices are single-trial estimation, based either on a single-trial GLM ( Formisano et al., 2008 ) or more sophisticated strategies to account for short Inter Stimulus Interval (ISI) ( Mumford et al., 2012 ), or extracting one value per run per condition with a run-wise GLM ( Haynes, 2015 ). Once the features are extracted, the dataset of each subject consists of multiple samples, each with its associated class, divided into several runs.
In -fold cross-validation the samples are partitioned in disjoint sets. A learning model is trained on − 1 partitions and tested on the remaining one, and all the combinations of training and testing are used, for a total of separate iterations . In this way, in each iteration the training and the testing are done on separate data and the total amount of errors across all the testing datasets provides an estimate of the out-ofsample generalization. It has been shown that cross-validation is a pessimistic estimator of the generalization error, since in -fold the model is trained only on a portion of the data ( ( − 1) ∕ of the total), with the bias being the strongest for = 2 (split-half) and reasonably small for = 5 or = 10 , see ( Kohavi, 1995 ). The limit case of being equal to the number of samples is referred often to as leave-one-out cross-validation, and it has the smallest bias among the cross-validation procedures, albeit having a large variance which makes it rather unstable, when compared with 5-fold or 10-fold cross-validation ( Kohavi, 1995 ).
Cross-validation can overestimate the out-of-sample generalization if the training and testing data are not independent. In within-subject MVPA, this problem arises when samples retrieved from the same fMRI trial are used in both the training and testing partitions ( Lemm et al., 2011 ). Moreover, an overestimation can occur when performing withinrun cross-validation using single-trial estimates in an experiment with relatively short ISI, see ( Mumford et al., 2014 ). Similarly, when conducting MVPA across different subjects, confounding effects such as differences in scanner homogeneity across subjects come into play and alter the statistical properties of each subject randomly. From a hierarchical model perspective, this random effect across subjects introduces a correlation among the estimated samples within each subject, see the intra-cluster correlation coefficient ( Eldridge et al., 2009 ).
To avoid overestimation of the generalization, it has to be ensured that training and testing data are independent. In the remainder of the work we will consider the smallest "indivisible " sets of data, referred to as elementary units , or simply units , and make sure that the data within a unit do not end up in two or more cross-validation partitions, therefore cross-validating at the unit level and guaranteeing independence between training and testing data. The partitioning of data in units depends on the analysis considered. In within-subject analyses, a unit could consist, for instance, of all the single-trial estimates obtained in a single run, and cross-validating at the unit level leads to across-run MVPA. In across-subject MVPA, a unit consists of all the samples belonging to a specific subject.
Noteworthy, in all the cases but the leave-one-unit-out, there are multiple ways of assigning units to the different partitions; a single repetition of cross-validation would rely on a random assignment of units across partitions, while, when cross-validation is repeated times, in each of the repetitions a different random assignment is used. Fig. 1 illustrates the assignment of units to different partitions within each repetition of cross-validation. When the number of partitions and units is small, it is possible to exhaustively cover all the training-testing combinations: with 10 units and 5-fold cross-validation there are ( 10 2 ) = 45 unique combinations. When the numbers are larger it is preferable to repeat the entire cross-validation a fixed amount of times: with 20 units and 2 partitions there are ( 20 10 ) = 184756 unique combinations. A related approach, the hold-out cross-validation, consists of selecting a fixed percentage of the data (say 80%) for training and the remaining for testing, and repeating the procedure several times. This procedure is very similar to the repeated cross-validation, and we will therefore not consider it in the remainder of the work.

Permutation and randomization tests
In a permutation test, the cross-validation error is estimated on resampled data, obtained under the null hypothesis that there is no difference between the observations. If all the possible resamples are considered, an exact -value can be obtained, while using randomly selected resamples results in a Monte Carlo estimation.
There are several ways of obtaining resamples, based on different assumptions. Following ( Scheffé, 1959 ), permutation tests can be used when there is a symmetry in the joint distribution of the observations, that is, when there exist permutations of the observations that leave such distribution unchanged. Such symmetry is present if we assume that the samples observed for a given class are a random sample from the population of brain patterns associated with that class, or if we assume that a randomized design was used, where in different realizations of the experiment a sample could have been associated to either of the classes. In ( Ernst, 2004 ) these two approaches are described as population and randomization models respectively, and the resampling methods are called permutation and randomization tests respectively (although it is not uncommon to see both referred to as permutation tests , Scheffé (1959) , or permutation methods ( Ernst, 2004 )). In permutation tests, samples are assumed random and resamples are obtained by permuting the samples keeping the class labels fixed, while in randomization tests the design is assumed to be random and the resamples are constructed by resampling class membership, i.e. considering different experimental randomizations, keeping the samples constant. The two approaches result in the same procedure, if the experimental design is completely randomized, while they differ when the design is not fully randomized.
When the symmetry assumption of the population model is violated, that is, when the data are not exchangeable under the null hypothesis, a permutation test is not justified. This violation could be caused, for instance, by the presence of autocorrelated noise or by the use of a short ISI within a run, combined with single trial estimates. Please note that correlated samples can still be exchangeable (for instance, features extracted with a run-wise GLM are correlated but still exchangeable within each run). Randomization tests, on the other hand, can be still used if there is a randomness in the design that can be exploited to generate new assignments of class membership to the samples.
In fMRI data analysis, the violation of exchangeability has been dealt with in different ways. In the context of linear models, restricting the permutations in a way that reflects the block structure of the dataset has been advocated in Wang et al. (2014) and Winkler et al. (2014) , while in Raz et al. (2003) the authors show that a permutation test on the estimated amplitude of event-related fMRI trials is valid assuming that the design is fully randomized. A recent work ( Maris, 2019 ) puts explicit emphasis on the use of "randomization " tests, rather than "permutation " tests, as a valid approach to non-parametric testing in linear models in biological modeling experiments. Based on these considerations, we assume, in the remainder of the work, that a correct strategy is employed to generate resamples and we will focus on the interaction between cross-validation and non-parametric hypothesis testing. In the simulations, unless otherwise stated, and in the real data we use in the work, the exchangeability assumptions are met within each unit, and we will therefore refer simply to "permutations " (rather than randomization or unrestricted permutations), computed in each unit separately. However, we recommend to employ randomization tests (or restricted permutations) and refer to the relevant works ( Maris, 2019;Winkler et al., 2014 ), should the MVPA be done using single-trial estimates obtained with a design with short ISI and not completely randomized.

Cross-validation permutation scheme
In Fig. 2 we illustrate four possible choices of permutations for a cross-validation across runs. In the left panel we have three runs with 6 examples each (coded with different gray values), and consider a leaverun-out cross-validation scheme (training runs highlighted in blue, testing runs in red), where each column depicts one of the three crossvalidation iterations. Please note that the examples within each run are presented in an alternate design for illustrative purpose only. In a real experiment a different presentation scheme would be used in each run.
In the right panel we show four different resampling strategies (all conducted within each run) that could be employed. In the top left, the labels are reassigned within each run and then the whole cross-validation procedure is performed, whereas in the other three panels the permutations are performed independently within each cross-validation iteration, resampling both in the training and in the testing data (top right), only in the training data (bottom left) and only in the testing data (bottom right). As discussed in the introduction, the cross-validation estimator should be used on a resample on its entirety. If this is not done, it is unclear whether it results in a negligible or a dramatic increase in false positive probability. We considered three widely employed MVPA toolboxes, namely PyMVPA 1 ( Hanke et al., 2009 ), CosmoMVPA 2 ( Oosterhof et al., 2016 ) and PRoNTo 3 ( Schrouff et al., 2013 ). In the last two toolboxes, cross-validated error is estimated on each resample, so permuting before performing cross-validation, 4 while in PyMVPA it is explained that the software can perform the "within cross-validation " permutation scheme (permuting both training and testing within each iteration), but this is not valid, whereas only the training dataset should be permuted within each cross-validation iteration. 5 A study showed that resampling the dataset before running cross-validation results in the broadest null distribution ( Etzel and Braver, 2013 ), but to the best of our knowledge the false positive rate of the different strategies has not been examined.

Type I error rate of different permutation strategies 2.2.1. Simulations
To examine the validity of the different strategies we considered simulated data under H0, i.e. where no information that would allow decoding was present. The simulations used here can be used to investigate both within-subject decoding (where each unit is an fMRI run ) and across subject decoding (where a unit is a single subject ). We considered 20 units with a total of 80 samples (40 samples per class), evenly distributed across all the runs and 100 voxels. The data were generated from a multivariate normal distribution, with identity covariance matrix. For each dataset, we considered several cross-validation schemes, namely a split half, five-fold, ten-fold and leave-one-unit-out; all the cross-validations were repeated 20 times, when possible (leave-one-unit-out can only be performed once). For each cross-validation scheme, we validated the observed error by running 1000 permutations with the four permutations schemes described in Fig. 2 , and determined whether the null hypothesis of decoding at chance could be rejected. Among the possible learning algorithms, we considered a Support Vector Machine classifier ( Boser et al., 1992 ), using the implementation in the LIBSVM software package ( Chang and Lin, 2011 ) with pre-computed kernel to speed up the permutations, a Gaussian Naïve Bayes (GNB), with a fast parallel implementation for the permutations based on ( Ontivero-Ortega et al., 2017 ) and a L2-regularized logistic regression (L2LR), as implemented in the package LIBLINEAR ( Fan et al., 2008 ). The whole procedure (generate the data, run the different cross-validation schemes and for each scheme run 1000 permutations using the four permutation strategies) was repeated 1000 times, to provide an empirical estimate of the Type I error rate (i.e. how often the null-hypothesis would be rejected when the decoding is actually at chance). All the simulations were ran in MATLAB ( http://www.mathworks.com ) using the distributed computing toolbox.

Simulations
To determine power differences between the different crossvalidation schemes, we considered a similar setup as in the previous section. In this case, we considered, in addition to Gaussian noise, a univariate difference between the two classes in 20% of the voxels; additionally, we modeled unit-to-unit (e.g. run-to-run or subject-to-subject) variability by modeling the noise variance in a given unit drawing from the lognormal distribution with = 0 and = 0 , 0 . 3 and 0 . 5 ; when = 0 , the variance was assumed identical across the different units, while with = 0 . 3 the variance was mainly in [0 . 55 1 . 8] (95% quantile range) and for = 0 . 5 it was in [0 . 37 2 . 66] (95% quartile range). The range of the univariate difference between classes was determined based on an initial small-scale simulation, in order to cover both low and high power situations. For each difference value, we generated a dataset and ran all the cross-validation schemes as in Section 2.2 , namely split-half, fivefold, ten-fold and leave-one-unit-out cross-validation, with 20 repetitions, using again SVM, GNB and L2LR as learning algorithms. In these simulations, we used the only valid permutation scheme ( before crossvalidation) and ran 1000 permutations. The whole procedure was repeated 1000 times for each difference, and the power was calculated as the proportion of datasets where the Null Hypothesis was correctly rejected. For a specific power level, we furthermore considered the effect of increasing the number of cross-validation repetitions on power, repeating the cross-validation up to 100 times.

Fig. 2.
Schematics of different permutation strategies in combination with cross-validation. On the left, a schematic of the labels of a two classes MVPA analysis with three runs, each with 5 samples per class (denoted with different grayscale values). A 3-fold (leave-run-out) cross-validation is used, and each column represents one of the three cross-validation iterations, training data in blue, testing data in red. On the right side: different permutation schemes, with n permutations. If the permutation takes place before a full run of cross-validation (top-left), in each iteration the labels are the same. Permutations can be done also within each cross-validation iteration (top right); the bottom panels describe a within cross-validation strategy limited to training or testing data only. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Real data
In order to show power differences with real fMRI data, a single dataset is of limited reach; in fact, in order to make a statement about power, that is a probabilistic measure, an empirical estimate needs to be obtained by performing the same analysis over many datasets. To perform such an analysis we used a large dataset from the Human Connectome Project Van Essen et al. (2013) , namely the freely available 3 Tesla WU-Minn dataset described in Barch et al. (2013) , focusing on the 889 subjects that completed all the battery of tasks. We focused on the motor task, where the subjects performed, in a blocked design, movements of the right or left hand or foot and of the tongue. The data of each subject consisted of two runs, where on each run two repetitions of each task were performed, with a total of 4 repetitions per subject. For each subject we considered the results of the first level within-subject GLM obtained with FEAT (FMRIB's Expert Analysis Tool, Woolrich et al., 2001 ) and we used the grayordinate-based results based on multimodal surface matching (MSM) (see Robinson et al., 2018;Robinson et al., 2014 ) and smoothed with a 2mm kernel, restricted to cortical voxels. We had therefore one activation map per condition per subject and considered across-subject decoding of two of the five conditions, namely right hand versus right foot . The task was designed to provide a robust localizer of basic motor functions, and therefore, not unexpectedly, the decoding performance was almost perfect (i.e. error close to zero for any crossvalidation choice) even when only a limited number of subjects (e.g. 10) and all the cortical voxels were considered. In order to be able to modulate the decoding performance from a low power (high error rate) to a high power (low error rate) scenario, we resorted to the following strategy: 1. Choose at random 9 subjects; on those subjects compare the two conditions using a univariate double-sided t-test and select the vertices with a -value larger than 0.8 (therefore, the least discriminative vertices). 2. Choose randomly vertices among the ones selected in the previous step. 3. Assign randomly the remaining 880 subjects in 44 batches of 20 subjects each, considering only the vertices selected in the previous step. 4. Within each batch, perform cross-validation as in Section 2.3.1 , namely using split-half, five-fold, ten-fold and leave-one-subject-out cross-validation, with 20 repetitions, when possible, validating each scheme with 1000 permutations. Note that the cross-validation is done across subject. 5. Determine the average error rate and determine how many batches were significant at a given -level We considered the selection of vertices, choosing = 10 , 100 , 1000 (that resulted in low, medium and high power, respectively), and repeated the whole procedure 100 times for each value of , changing in every iteration the subjects used to select vertices and the batch assignment.

Type I error rate of permutation strategies
The results of the simulations under H0 (no decoding information present) are shown for an SVM classifier for cross-validation without repetition in Fig. 3 , top panel, and for 20 repetitions of cross-validation in the bottom panel, for = 0 . 01 , 0 . 05 and 0 . 1 (columns one, two and three respectively). For each cross-validation scheme and permutation strategy, we estimated the Type I error rate as the average rejection rate of the null hypothesis across the 1000 datasets. Additionally, we computed its 95% confidence intervals using the Clopper-Pearson model. Each row shows the results obtained with a given cross-validation scheme. A comparison between the confidence intervals of the estimated Type I error rates and the nominal (black line) indicates that, as expected theoretically, resampling before performing cross-validation results in a valid test, regardless of the cross-validation scheme used. Importantly, all the other strategies result in an inflation of the false positive rate. The increase is in the range from 50% to 100% when crossvalidation is used once, while it is much more pronounced when the cross-validation is repeated. In the Supplementary material we show the results of GNB and L2LR, that convey the same message, together with the error covariance matrix obtained either with a valid permutation scheme (before cross-validation) and an invalid one (within crossvalidation), to illustrate the effect of ignoring the correlation terms induced by cross-validation.
We also performed an additional simulation where the synthetic data exhibited different degrees of within-unit correlation in single-trial estimates, ranging from low to extreme, and considered either a completely randomized or non completely randomized designs, either a blocked or alternating event-related design (see Appendix A for a detailed description). We resampled the datasets before running cross-validation and considered both an unrestricted permutation test (exchanging freely all the samples within a unit) and a randomization test (or restricted permutation test). The randomization test in the non completely randomized designs was possible since, both in the blocked and event-related design, the starting condition was chosen at random for each unit, allowing many possible randomizations (two to the power of the number of units). Confirming what shown in ( Maris, 2019;Winkler et al., 2014 ), we observed ( Fig. 4 ) that restricted permutations or randomization tests are valid in all the cases, whereas an unrestricted permutation test is valid in the case of a completely randomized design, while with a non completely randomized designs it is either invalid or too conservative if the examples are non-exchangeable. Please refer to Appendix A for more details on the simulations.

Simulations
The results of the simulations using an intermediate noise variance across runs ( = 0 . 3 ) with SVM are shown in Figs. 5 and 6 , while the results for other variance levels and algorithms (GNB and L2LR) are shown in the Supplementary Information. Fig. 5 shows the errors obtained with different cross-validation schemes, as a function of the difference between the classes. The dashed lines refer to the cross-validation without repetition, whereas the solid lines refer to repeated cross-validation. The worst performing scheme, in terms of average error, is the split-half cross-validation, while the other schemes are more or less close to each other in terms of performance. This is readily explained by the fact that split-half is the scheme that uses the least training data, resulting in the simplest (i.e. most regularized) model with an upward bias in the error, when compared to other schemes that use more training data. Additionally, repeating cross-validation does not change the average decoding performance (across the 1000 repetitions), which is again in line with the machine learning literature. Fig. 6 shows the power of the different cross-validation schemes, with similar color and line coding as in Fig. 5 ; the shading of each curve covers the 95% confidence interval, based on a binomial model. As expected from the literature, the results suggest that repeating cross-validation results in a tangible gain in power; in fact, repeating cross-validation results in a stabler estimate of the CV error (see Varoquaux et al., 2017 ), which in turns reduces the probability of having an error close to chance when the true error is actually lower than chance. More surprisingly, the most powerful cross-validation scheme is the repeated split-half CV, followed by five-fold and by the ten-fold cross-validation scheme, reversing what is shown by the cross-validation without repetition (where the split-half is the worst). It is important to  note that the highest power is achieved with a method (repeated splithalf) that results in the highest error rate (that is, worst average performance ); in other words, whereas the performance of five-and ten-fold CV is better in terms of decoding accuracy, it is nonetheless more difficult to exclude it could be attributed to chance, when compared with repeated split-half. This is due to the fact that the null distribution is sharper in a repeated split-half, when compared with the other schemes. We illustrate in Appendix B the error distributions under the Null and Alternative hypothesis obtained on simulated data and show how both distributions are indeed sharper for repeated split-half, when compared to other schemes. Furthermore, in the Supplementary Information, we analyze in detail the contribution, in the four cross-validation schemes, In Fig. 7 we show the effect of cross validation repetition on power. We considered an intermediate value for the difference, repeated CV 100 times, and displayed the obtained power as a function of the number of cross-validation repetitions. As early as two to three repetitions, the power of the different schemes is similar, and from around 10 repetitions the repeated split-half outperforms the other schemes. The obtained curve suggests that at 20 repetitions all the power curves stabilize, although the curves for repeated five-folds and repeated ten-fold CV stabilize earlier than the split-half. The reason for that is combinatorial, since many more combinations are possible for repeated splithalf than for the other schemes. Importantly, the difference in power between repeated split-half and the other schemes, and number of repetitions needed, depends on the number of samples and units considered, and simulations should be run for a different setup of the experiment. In the Supplementary Material we show additional curves for different numbers of units.

Real data
The results of the decoding with different cross-validation schemes on the motor experiment of the HCP dataset are shown in Fig. 8 , where the average error rates (across the 44 batches and the 100 iterations) are shown for each cross-validation scheme without repetition (blue bars) and with 20 repetitions (red bars) for the three scenarios, i.e. low, medium and high power, obtained selecting vertices with = 10 , 100 and 1000 respectively. The results clearly indicate, in line with the simulations and with theoretical considerations, that the average error estimate is the same whether cross-validation is repeated or not, and that, on average, leave-one-out cross-validation has the lowest error, while split-half has the highest error. The power with = 0 . 05 over the 44 batches, averaged across 100 random batch selections, is shown in Fig. 9 , with a similar color scheme as the errors. Two clear messages can be taken from these results: first, the repetition of cross-validation results in an overall increase in power, regardless of the scheme considered. Repeating cross-validation stabilizes the error estimate and results in a power gain, as shown with the simulations. Second, when crossvalidation is not repeated, the most powerful approach is the leave-oneout and the worst is the split-half; on the other hand, with 20 repetitions of cross-validation the ranking is reversed, with the split-half being the most powerful, followed by five-fold and ten-fold cross-validation. This reflects closely what was shown with the simulations, and is particularly evident in the medium power range (scenario II) where power of the split-half scheme doubles when cross-validation is repeated 20 times.
In Fig. 10 , we show the estimated power in each of the 100 iterations of each scenario, and show all the pairwise comparisons between crossvalidation schemes. The three colors indicate the three different scenarios (blue: low power, red: intermediate power, yellow: high power) and in the element of the upper triangular matrix we compare power obtained using any two different schemes on the same data . These results clearly demonstrate that repeated split-half supersedes all the other schemes in terms of detection power, with repeated five-fold being the second best.

Discussion and conclusion
In this work we addressed two aspects in the statistical characterization of decoding performance in neuroimaging studies, namely the validity of different permutation strategies within cross-validation and the power of different cross-validation schemes. We have confirmed with simulations that the only valid permutation strategy is to estimate the cross-validation on each resampled dataset, by first modifying the association between data and labels, and then estimating the cross-validation error keeping that association fixed across all the crossvalidation iterations and possible repetitions. Not doing so results in a inflation of the Type I error (which is exacerbated when cross-validation is repeated), severely hampering the conclusions made based on such test.
A theoretical insight into why the other resampling schemes result in an inflation of false positives can be gained from ( Bengio and Grandvalet, 2004 ), where the authors describe the error covariance matrix across all samples in terms of blocks and decompose the variance of the cross-validation error (i.e. the sum of all the elements of the covariance matrix) into three components. The first component is the variance of errors for each test data point ( main diagonal of the covariance matrix), the other two stem from the use of cross-validation: the first arises from the fact that all the sample in a partition are tested using the same model (that changes per test partition), while the second arises from the overlap present in the training data of different partitions (this overlap is more pronounced when more partitions are used). When conducting a permutation test, if the resampling takes place in both the training and testing datasets or only in one of them at each cross-validation iteration , the cross-validation related terms in the variance decomposition are ignored, since it is implicitly assumed that the data across different iterations are independent. In other words, these resampling schemes subsume that one or both of the cross-validation related variance components described above is zero. This underestimation results in a sharper (i.e. with lower variance) null distribution, and therefore overconfident statements and invalid tests. On the other Fig. 8. HCP WU-Minn 3T motor dataset: average error across 100 repetitions of the right hand versus right foot decoding across subjects with 20-subject batches, as a function of different cross-validation schemes. In the three scenarios the number of voxels considered was varied so that the decoding performance ranged from weak (Scenario I, 10 voxels, left panel) to high (Scenario III, 1000 voxels) performance. Blue bars: average error for one repetition of cross-validation; red bars: average error for 20 repetitions of cross-validation. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 9. HCP WU-Minn 3T motor dataset: average power across 100 repetitions of the right hand versus right foot decoding across subjects with 20-subject batches, as a function of different cross-validation schemes, using the same scenarios as in Fig. 8 . Blue bars: average power for one repetition of cross-validation; red bars: average power for 20 repetitions of cross-validation. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) hand, when the data/labels association is kept constant across the different cross-validation iterations, the cross-validation related variance components are kept also in the estimation under H0, resulting in a more realistic empirical null distribution. When considering several repetitions of cross-validation similar considerations apply, and ignoring the cross-validation related variance component can have a more conspicuous effect since the covariance matrix is now much larger. A detailed empirical analysis of the covariance matrix of the errors (under H0) when cross-validation is repeated is presented in the Supplementary Information.
In this work we have also compared different cross-validation schemes in terms of power , both on artificial data and real data from the 3T WU-Minn Connectome dataset. The main finding was that the most powerful strategy is to perform repeated split-half cross-validation. When cross-validation is performed only once (and the choice of which units go into which partition is random), an increase in the number of partitions (from two in the split half scheme up to the number of units in the leave-one-unit-out scheme) results in an increase of overall performance (lower average error) and power (the Null Hypothesis is rejected more often, when it is false). When cross-validation is re- peated (by randomly assigning units to different partitions) the average error is not affected, however a decrease in the number of partitions results in an increase in power, with repeated split-half being the best, followed by five-and ten-fold cross-validation. We observed this behavior on simulated data using SVM, GNB and L2 regularized logistic regression and on the real data using SVM, and showed that performing around 20 repetitions is enough to obtain this effect. A crucial point here is that repeated split-half has a higher decoding error when compared with a leave-one-unit-out cross-validation, nonetheless it achieves higher power and should therefore be preferred when a decoder is used in the context of hypothesis testing. This is not entirely new in the literature, since it has been shown in Jamalabadi et al. (2016) that, in the case of a single feature (voxel) or highly correlated features, a split-half is more powerful than other cross-validation techniques when using linear classifiers. We show that this holds true also in the case of uncorrelated features, provided that cross-validation is repeated several times. One possible explanation for this behavior is that, referring again to Bengio and Grandvalet (2004) , the variance term related to the overlap between the different training datasets is minimized in repeated split-half cross-validation: in this case, the training datasets within a cross-validation iteration are disjoint, while the overlap between any two training datasets in two different cross-validation repetitions is on average 25%; in five-fold, instead, two training datasets within a crossvalidation iteration overlap by 60%, and the overlap between two training datasets in different cross-validation iterations is on average 64% (which can be readily derived from the properties of the hypergeometric distribution). Therefore, despite the average error for a repeated splithalf is higher than repeated five-fold, the estimate has lower variance, and therefore the scheme has higher power. We illustrate this aspect in Appendix B , where we compare error distributions under then Null and Alternative hypotheses, showing how the error distribution in repeated split-half is sharper when compared with other schemes.
In this work we also discuss the behavior of permutation tests when the data are non-exchangeable, which could be a consequence of, for in-stance, a too short inter-stimulus interval and of noise auto correlation. In Appendix A we show that, in the presence of dependency in the data, a permutation test that simply exchanges the labels of trials at random could result in an invalid test when the design is fixed. We furthermore show how a restricted permutation test ( Winkler et al., 2014 ) or a randomization test ( Maris, 2019 ) result in a valid test even when the different examples are highly correlated and therefore not exchangeable, and we therefore suggest resorting to one of the two strategies when the design is not completely randomized and the samples are not exchangeable under the null hypothesis.
Our findings furthermore provide some suggestions on the experimental design of fMRI experiments to be analyzed with MVPA, both in terms of validity and power. If a given number of examples has to be partitioned in units ( runs or subjects ), then it is preferable to have many units , so that repeated split half can be employed to maximize power. Considering for instance = 20 . Based on our simulations, the worst strategy would be to choose = 2 units (for instance, two runs with 10 examples each), since only one repetition of split-half can be performed. On the other hand, if = 10 (10 runs with 2 examples each), then a repeated split-half would be possible. There are many more considerations to take into account when partitioning the allocated experimental time across multiple runs, including alignment issues and noise heterogeneity across runs. In our simulations we showed that also in rather extreme noise heterogeneity situations, a repeated split-half performs better than the other schemes. Having many short runs has also been advocated in Coutanche and Thompson-Schill (2012) , where a comparison between decoding using "long " and "short " runs in 4-fold cross-validation has been conducted. An additional point in favor of many short runs is related to the validity of the unrestricted permutations. When the examples within a unit are not exchangeable, a restricted permutation or a randomization test should be used. Even in the case of a alternating blocked design, simply switching the conditions with each other at random within each run would be enough to provide a valid test. With runs, there are 2 possible switches; to have 1000 possible permutations, 10 runs should be considered. When this is not possible, introducing relatively long blocks of rest within a run could help in that respect. For instance, in Valente et al. (2019) we considered mini-blocks of fast-event related presentation (3 repetitions per condition), and considered a permutation test where the two conditions within each mini-block were switched at random across the mini-blocks.
All the power analyses conducted here refer to Region of Interest (ROI) decoding errors; it is reasonable to assume that the same considerations apply to the searchlight decoding, where the decoding is conducted at multiple locations across the brain; however, we have not examined the effect of the different cross-validation schemes on the correlation between adjacent searchlight, which in turn could have an important effect on the multiple correction strategy and ultimately on power, and it therefore remains to be tested empirically whether the conclusions drawn here apply to searchlight studies as well. An interesting question is whether the findings presented here generalize also to other cross-validated measures used in different analysis approaches (such as RSA or encoding models). We believe that the suggestions we make on the choice of permutation strategy (permuting before running one or more iterations of cross-validation) applies also here, since the proposed permutation strategy is the only one that fully accounts for the effect of re-using data across different partitions. It remains unclear, however, if the power considerations we made based on MVPA simulations would also apply to RSA or encoding models, and this too remains to be tested empirically before a strong statement can be made.

Data and code availability statement
The Matlab implementation of all the simulations and analyses can be found on: https://github.com/GiancarloValente/ MVPACrossValidationAndPermutations . The publicly available 3T WU-Minn dataset can be requested here: https://db.humanconnectome. org/app/template/Login.vm .  the permutation was valid, then we would expect the Type I error rate to be lower than or equal to . If it exceeds this value, the test is invalid, if it is below, the test is conservative. We then considered a randomization test, or a restricted permutation test (only for scenarios two and three, since in the first scenario a permutation and a randomization test would result in the same set of label reassignments), by switching at random in each run the labels of the two classes (for instance, if in a run the class labels were AAABBB, we would switch them with probability 0.5 with BBBAAA). Again, we considered the Type I error rate, based on 2000 dataset generations.

Credit authorship contribution statement
The results are in Fig. 4 , where each row denotes the estimated Type I error rate for a different imposed correlation structure on the data, with the top row containing independent data, and the bottom row containing highly correlated data. The results clearly indicate that a randomization (unrestricted permutation) test is valid, whereas in high correlation situations a permutation approach may result in an overconfident estimation (in the block design the test is invalid), or too conservative estimation (in the alternated design).

Appendix B. Distributions of error rates obtained with different cross-validation schemes under the null and the alternative hypothesis
In this section we examine the distributions of error rates obtained with different cross-validation schemes, using the simulations shown in Section 3.2.1 . We considered an intermediate effect (the fifth out of the seven tested, with the first being the weakest) such that the difference in power could be highlighted the most. We considered the errors observed on 1000 datasets and built a histogram to represent the error distribution (under H1). When cross-validation was repeated, we computed the error rate as the ratio between the total amount of errors and the total amount of predictions (across all the repetitions). The null distribution was built considering a large set of permutations. In both cases, the histograms were smoothed with a kernel density estimator, as implemented by Matlab function ksdensity .
The results are depicted in Fig. B.11 . In the top panel the distributions under H1 (solid line) and H0 (dashed line) obtained with a single repetition of cross-validation are compared, while in the bottom panel the different distributions when cross-validation is repeated 20 times are shown. The 95% lower quantile of each null distribution is depicted with a dash-dotted vertical line; in the case of a single iteration the quantiles of different cross-validation schemes overlap, while they differ when cross-validation is repeated.
Whereas in the case of a single repetition of cross-validation the distributions both under H0 and H1 are relatively similar, when crossvalidation is repeated we observe that the distribution of errors obtained with repeated split-half both under H0 and H1 are sharper than the other schemes. This has, in turn, clear effects on power, since with repeated split-half it is more likely that the null hypothesis is rejected. The difference in null distributions is also reflected in the 95% quantiles: whereas for a repeated split-half an error rate of.42 would result in a significant decoding, the same error rate would not be enough to reject the null hypothesis in the case of repeated five-fold or ten-fold cross-validation. This counteracts the higher average error rate which is usually achieved with repeated spit-half, and provides an illustration of why, despite having higher error, the repeated split-half is a more powerful procedure, when compared with other schemes.