Valid and powerful second-level group statistics for decoding accuracy: Information prevalence inference using the i-th order statistic (i-test)

In functional magnetic resonance imaging (fMRI) decoding studies using pattern classification, a second-level group statistical test is typically performed after first-level decoding analyses for individual participants. In the second-level test, the mean decoding accuracy across participants is often tested against the chance-level accuracy (for example, one-sample Student t-test) to check whether information about the label, such as, experimental condition or cognitive content, is included in brain activation. Meanwhile, Allefeld et al., (2016) highlighted that significant results for such tests only indicate that "there are some people in the population whose fMRI data carry information about the experimental condition." Therefore, such tests failed to conclude whether the effect is typical in the population. Based on this argument, they proposed an alternative method implementing the prevalence inference. In the present study, that method is extended to propose a novel statistical test called as the "information prevalence inference using the i-th order statistic" (i-test). The i-test has a high statistical power compared with the method proposed in Allefeld et al., (2016) and provides an inference regarding the typical effect in the population. In the i-test, the i-th lowest sample decoding accuracy (the i-th order statistic) is compared to the null distribution to verify whether the proportion of higher-than-chance decoding accuracy in the population (information prevalence) is higher than the threshold. Hence, a significant result in the i-test is interpreted as a majority of the population has information about the label in the brain. Theoretical details of the i-test are provided, its high statistical power is identified by numerical calculation, and the application of this method in an fMRI decoding is demonstrated.


Introduction
An ever-increasing number of functional magnetic resonance imaging (fMRI) studies use multi-voxel pattern classification analysis where the cognitive content ( = label), such as visual input categories (e.g. Spiridon and Kanwisher 2002 ;Haxby et al., 2014 ;Nishida and Nishimoto 2018 ), types of movement (e.g. Gallivan et al., 2011 ;Nambu et al., 2015 ), and the effector of movement (e.g. Hirose et al., 2015 ;Hirose et al., 2018 ) are predicted (decoded) from fMRI signal patterns. The prediction accuracy (decoding accuracy; D-Acc) is considered as an index of the label information in the brain. The D-Acc is often compared to the chance-level accuracy, which is the theoretical expectation of the D-Acc without information about the label (e.g., 50% for binary classifications). When the D-Acc exceeds the chance level, the result is interpreted as information about the label is represented in the brain.
Abbreviations: D-Acc, decoding accuracy; fMRI, functional magnetic resonance imaging; i -test, information prevalence inference using the i -th order statistic; i-test-one , i -test with i = 1; i-test-unif-bino , i -test with the assumption of a uniform distribution of true parameters and a binomial distribution of D-Acc.
E-mail address: satoshi.hirose@gmail.com not be lower than the chance level ( Allefeld et al., 2016 ). Thus, a very small part of the population (e.g. one in a million) with higher-thanchance D-Acc results in the population mean higher than the chance level. Therefore, significant results of these tests can only indicate that "there are some people in the population whose fMRI data carry information about the experimental condition " ( Allefeld et al., 2016 ), but fail to conclude whether the effect is typical in the population. The above-mentioned problem is inevitable for statistical tests of the population mean. A possible solution is to assess the population information prevalence (i.e., the proportion of the population having label information in their brain activity), like other methods for evaluation of the population proportion for second-level group statistics, such as the dynamic causal model ( Stephan et al., 2009 ), the second-level random effect of univariate analysis ( Rosenblatt et al., 2014 ), and the conjunction analysis ( Friston et al., 1999 ). Accordingly, a second-level group statistical test on D-Acc was proposed by Allefeld et al. (2016) where the population information prevalence is targeted instead of the population mean. In this method, the null hypothesis is that the proportion of the population having label information is not larger than the predetermined prevalence threshold (e.g., 0.5) . Thus, the null hypothesis rejection means that proportion larger than the prevalence threshold (e.g., more than half) of the population has label information .
The method proposed by Allefeld et al. (2016) theoretically solved the above issue by implementing the population information prevalence and provided a meaningful inference regarding the typical effect. However, the method may have the disadvantage of a low statistical power ( Section 3 ). Because the lowest D-Acc (minimum order statistic) among the participants is used as the test statistic, the method may fail to capture the population characteristics at the presence of one low D-Acc, leading to low statistical power.
In the present study, I propose an extension of the previous method. The idea is to generalize the method in Allefeld et al., (2016) for second or higher order statistics ( i -th lowest D-Acc for = 2 , 3 …). See Section 2.3.4 regarding the relationship between the method in Allefeld et al., (2016) and that proposed in this paper. The proposed method can be called as the "information prevalence inference using the i -th order statistic ", or i -test.
The paper is organized as follows. The proposed statistical test is explained in Section 2 . The advantage of the method in terms of a high statistical power is demonstrated by providing numerical results for the artificial data in Section 3 . An application of the method using a real fMRI dataset is demonstrated in Section 4 . Finally, the results are summarized, and future directions of the study are discussed in Section 5.

Information prevalence inference using the i -th order statistic ( i -test)
In this section, Section 2.1 clarifies the statistical model, defines the problem, and outlines the i -test calculation. Section 2.2 explains the i -test theoretical details. Section 2.3 introduces essential theoretical characteristics of the i -test, including its relationship with the previous method. Finally, a practical procedure for performing the i -test is explained in Section 2.4 . Section 2.4 is self-contained and involves a minimum number of equations, so that readers may skip sections 2.2 and 2.3 to familiarize yourself with the practical procedure before switching to the theoretical details.

Notations and problem definition
Figure 1 A displays the statistical model for the population and experimental result (sample D-Acc in an experiment). The population Ω is composed of two subgroups, Ω + and Ω − . People in Ω + have label information in their brain activity and, thus, their D-Acc expectations are higher than the chance level, while people in Ω − do not have label information and the expectations are at a chance level. A randomly chosen person with an index from the population belongs to Ω + ( ∈ Ω + ), with and Ω − without label information. An experiment is expressed as a two-step random sampling; first, each participant is randomly sampled from the population (Random sampling of participants), and then a decoding accuracy (D-Acc) in the experiment ( ̂ ) is randomly sampled from the distribution for each participant (Random sampling of D-Acc). B) Experimental results and estimated D-Acc distribution: The i -th order statistic ( i -th lowest sample D-Acc in the experiment; ̂ ( ) ) is identified (example with = 3 is shown). Then, ̃ ( < ̂ ( ) | ∈ Ω − ) , which is the probability that a participant without label information has D-Acc lower than ̂ ( ) , is estimated (green area). C) Derivation target. Using ̃ ( < ̂ ( ) | ∈ Ω − ) , it is verified whether ( ( ) ≥ ̂ ( ) | 0 ∶ ≤ 0 ) (green area) is lower than . a probability , or otherwise belongs to Ω − ( ∈ Ω − ). An experiment with participants constitutes a two-step random sampling. First, participants are independently and randomly sampled from the population (Random sampling of participants). Each sampled participant is associated with the D-Acc probability distribution, (probability mass function; ( ) ). Second, the experimental results (sample D-Acc; ̂ ) are randomly sampled from the distribution for each participant (Random sampling of D-Acc). Note that in this paper, a lowercase depicts a probability mass function for discrete variables or probability density function for continuous variables, while an uppercase stands for probabilities. The i -test objective is to verify whether ( ( ∈ Ω + ) ) is larger than the predetermined threshold 0 from participants' experimental results ( ̂ ; = 1 … ), with a significance threshold .
To perform the i -test, we first identify the i -th order statistic ̂ ( ) , which is the i -th lowest D-Acc observed in the experiment ( Figure 1 B). Then, ̃ ( < ̂ ( ) | ∈ Ω − ) , which is the probability that a participant without label information has D-Acc lower than ̂ ( ) (green area in Figure 1 B) is estimated. The estimation can be performed by assuming a parametric distribution for D-Acc, such as the binomial distribution ( Section 2.4 ), or by an empirical procedure, such as a permutation test (Supplementary Materials S3.1). Note that the i -test is performed without estimation of D-Acc distribution for Ω + . After estimating ̃ ( < ̂ ( ) | ∈ Ω − ) , we check whether is the probability that the i-th order statistic ( ) is not lower than the observed value ̂ ( ) under the null hypothesis (the green  If it is smaller than , we conclude that the observed ̂ ( ) is so high that it has unlikely (with a probability smaller than the significance threshold ) occurred under the null hypothesis. Thus, the null hypothesis is rejected and one may conclude that is higher than the predetermined threshold 0 with the significance threshold . The notations are tabulated in Table 1 .

Theoretical details
As noted above, the objective of the i -test is to verify whether is larger than the predetermined threshold 0 . Thus, the alternative hypothesis of 1 ∶ > 0 is set, and the null hypothesis is its negation The i -test formulation is conducted in four steps ( Sections 2.2.1 -2.2.4 ). First, ( < ̂ ( ) | ) , which is the probability that a participant has D-Acc lower than ̂ ( ) for a given , is formulated by Eq. (2.1) . Second, ( ( ) ≥ ̂ ( ) | ) , which is the probability that the -th lowest D-Acc ( ( ) ) is not lower than the observed one ( ̂ ( ) ) for a given is formulated by Eq. (2.2) . Third, the lower bound of ( < ̂ ( ) | 0 ∶ ≤ 0 ) , which is the probability that a participant has lower D-Acc than ̂ ( ) under the null hypothesis is formulated by Eq. (2.5) . Finally, the upper bound of ( ( ) ≥ ̂ ( ) | 0 ∶ ≤ 0 ) , which is the probability that the i -th lowest D-Acc ( ( ) ) is not lower than ̂ ( ) under the null hypothesis, is formulated by Eq. (2.8) .

Derivation of
Considering that a participant belongs to Ω + with the probability and otherwise belongs to Ω − , the probability that a participant has D-Acc lower than ̂ ( ) is formulated as: ( Note that it is assumed that the D-Acc are identically distributed among participants, i.e., ( < ̂ ( ) | ∈ Ω − ) and ( < ( ) | ∈ Ω + ) are independent of , and, therefore, ( < ̂ ( ) | ) is independent of .

Derivation of
Then, the probability that the i -th lowest D-Acc is not lower than ̂ ( ) for a given is derived as follows: This is because the condition the -th lowest D-Acc not being lower than ̂ ( ) is identical to the condition fewer than of the participants having D-Acc lower than ̂ ( ) , whose probability is formulated with the binomial cumulative distribution function.

Lower bound of
Following the basic nature of probability, i.e., ( < ̂ ( ) | ∈ Ω + ) ≥ 0 , the lower bound of ( < ̂ ( ) | ) can be easily derived from Eq. (2.1) as Consequently, a lower bound of the probability under the null hypoth- I define to equal the right side of Eq. (2.4) , i.e., and Eq. (2.4) may be expressed as (2.6) ), we obtain the following upper bound for We define to equal the left side of Eq. (2.7) , i.e., and Eq. (2.7) may be expressed as

i-test calculation
The i -test calculation is conducted as follows. First, ̃ ( < ̂ ( ) | ∈ Ω − ) is estimated by assuming a parametric distribution for D-Acc or by an empirical procedure. Then, ̃ is calculated with Eq. (2.5) , and, consequently, ̃ is calculated with Eq. (2.8) . Assuming that the estimation is correct, the expression > ̃ ≥ ( ( ) ≥ ̂ ( ) | 0 ∶ ≤ 0 ) follows from Eq. (2.9) when ̃ is smaller than the statistical threshold . Therefore, it is confirmed that > ( ( ) ≥ ̂ ( ) | 0 ∶ ≤ 0 ) and we may conclude that the observed i -th order statistic ̂ ( ) is so high that it has unlikely (with probability smaller than ) occurred under the null hypothesis. Therefore, we reject the null hypothesis and accept 1 ∶ > 0 , i.e., a proportion larger than 0 of the population has label information in their brain activity .

Parameter constraint
The i -test requires three predetermined parameters: (threshold for statistical significance), 0 (threshold for the information prevalence), and (rank of the order statistic). Together with the number of participants , they should satisfy the following inequality: This is because is bounded above by 1 − 0 from Eq. (2.5) and ( < ̂ ( ) | ∈ Ω − ) ≤ 1 . Therefore, given that decreases with ( Eq. (2.8) ), one may find that is bounded below as follows: Herein, Eq. (2.10) must be satisfied; otherwise, is never lower than and the i -test never reports a significant result.

Expectation of the statistical power
The statistical power is expressed as ( Sign ific ant | 1 ∶ > 0 ) , which is the probability that the i-test reports a significant result under the alternative hypothesis . This can be formulated as: where is the parameter other than that determines the true distribution of D-Acc, is the parameter space, ( , ) is the joint prior distribution, and ( Signi | , ) is the probability that the i-test reports a significant result for given and . This is used for selection of parameter (see Section 2.4.2 ). Refer to Appendix B for an example of the statistical power calculation and formulation of the selection of .

Control of the false alarm rate
It is analytically guaranteed that the false alarm rate ( Signi | 0 ∶ ≤ 0 ) , which is the probability that the i-test reports a significant result under the null hypothesis , does not exceed the statistical threshold , when the estimation of ( | ∈ Ω − ) is correct.

Relation with the previous method in
and the i -test reduces to the method referred as the " Permutation-based information prevalence inference using the minimum statistic " proposed by Allefeld et al., (2016) . Hereafter, this is called as i-test-one .
As an example of the i -test implementation, i-test-unif-bino ( i -test with the assumption of uniform distribution of true parameters and the binomial distribution of D-Acc) is presented, which can be applied within realistic computational time ( < 5 seconds for ≤ 100 , trial ≤ 1000 with MATLAB 2018 on iMacPro, 64GB Memory, 10-core 3GHz processor). Following the binomial distribution assumption, D-Acc distribution ( ) is determined by two unknown parameters: the true information prevalence, , and correct + , which is the probability of the correct decoding in a trial for participants without label information. These parameters are assumed to follow the uniform prior distribution. Pseudocode of the procedures is presented in Appendix C . Refer to Supplementary Material S3 for other implementations.

Setting the and 0 threshold parameters
As the first step of the i-test, the two threshold parameters, (the threshold for the statistical significance) and 0 (the threshold for the information prevalence), should be set. These parameters explicitly appear in the concluding statement of the test. Namely, with the significant result of the i -test, one may conclude that a proportion larger than 0 of the population has label information in their brain activity with a false-alarm rate less than . Therefore, users may select these values based on their purpose.
As the primary choice, = 0 . 05 and 0 = 0 . 5 are suggested, because = 0 . 05 is the most commonly used value in neuroscience studies, while 0 = 0 . 5 may be intuitively acceptable because significant result leads us to conclude that a majority (more than half) of the population has label information , and at present this is a standard value in the existing statistical methods on the population prevalence (e.g. Friston et al., 1999 ; see also Discussion in Allefeld et al., 2016 ).

Determination of (the rank of the order statistic)
After the two threshold parameters are set, the upper limit of ( max ) is determined from Eq. (2.10) . The selection of among = 1 , 2 , … max does not affect the concluding statement, but it affects the statistical power (see Section 3 ). Therefore, the optimal that provides a maximal expected statistical power ( Eq. (2.12) ) is suggested.
The above-mentioned assumptions enable numerical calculation of the optimal that maximizes the expected statistical power (refer to Appendix B.2 for full derivation).

Identification of ̂ ( ) and estimation of
Next, ̂ ( ) is identified from experimental results, and ̃ ( < ̂ ( ) | ∈ Ω − ) is estimated. Following the binomial assumption, this estimation can be performed on the base of the known parameters: number of trials ( trial ), sample order statistic ( ̂ ( ) ), and probability of the correct decoding in a trial for participants without label information ( correct − ), which is the chance level (e.g., 0.5 for binary).

Numerical calculation in artificial experiments
The i -test was designed to improve the statistical power compared to the i-test-one . The results of the statistical power calculation in the synthetic situations are presented below to demonstrate the advantage of the i -test (see the full derivation in Appendix A ). Note that the numerical calculation results reported in this section were replicated in simulations with the same parameters (Supplementary Material S1.1). Furthermore, the results were replicated with different prevalence threshold values ( 0 = 0 . 1 , 0 . 3 , and 0 . 7 ; Supplementary Material S2).

Definition of artificial data
Let us consider artificial experiments where each participant performed trial trials with a binary choice (chance level: 0.5). Each participant had label information ( ∈ Ω + ) with probability , or did not have label information ( ∈ Ω − ) otherwise. In a trial, the decoder for participants with label information ( ∈ Ω + ) predicted the label correctly with probability correct + independently across trials, while for participants without label information ( ∈ Ω − ) the label was predicted randomly and independently. Therefore, the D-Acc for ∈ Ω + and ∈ Ω − followed the binomial distributions, i.e., ( | ∈ Ω + ) = BPDF ( trial , trial , correct + ) and ( | ∈ Ω − ) = BPDF ( trial , trial , correct − ) , where correct − is the chance level, i.e., 0.5. Note that the decoding accuracies do not always follow the binomial distribution in the real fMRI decoding, because the trials are not completely independent (refer to Discussion and Supplementary Material S1.2).
The threshold for statistical significance was fixed at 0.05 and the prevalence threshold 0 was fixed at 0.5. The numerical calculation of ( Significant ), which is the probability that i-test reports the significant result , was performed for all possible ( = 1 … ) with all combinations of the following parameters: trial = 10 , 100 , 1000 , = 5 , 6 … 100 , from 0 to 1 with 0.01 increment and correct + from 0.51 to 1 with 0.01 increment.
< 5 was omitted because the i -test cannot be applied for any due to the constraint of Eq. (2.10) . The i -test calculation was made with the correct estimation of ̃ ( < ( ) | ∈ Ω − ) , i.e., cates the statistical power at > 0 and the false alarm rate at ≤ 0 .

Extending i-test-one to i -test can improve the statistical power
First, the relationship between and the statistical power with trial = 100 and = 50 was illustrated ( Figure 2 A-E for = 1 , 5 , 10 , 15 , 19 , SupplementaryGIF1 for = 1 … 19 ). As demonstrated in Figure 2 A, the i-test-one provided a high statistical power in the restricted area near = 1 . Particularly, the statistical power of the i-testone exceeded 0.8 ( Cohen, 2013 ) only at ≥ 0 . 97 , indicating that the itest-one could report a significant result with sufficient probability only when almost all people in the population have label information although the prevalence threshold was 0 = 0 . 5 . The area expanded downward (toward smaller ) as increased ( Figure 2 B-E), indicating that the i -test with larger can report a significant result with high probability at smaller .
In the implementation of the i-test-unif-bino , we used the optimal , which maximizes the expectation of the statistical power under the assumption of the binomial distribution of D-Acc and uniform distribution of the parameters for true D-Acc distribution ( Section 2.4.2 ). With this implementation, the optimal was 15 for this set of parameters ( Figure 2 D). As observed, the statistical power improved in the large area (red area in Figure 2 F) at = 0.6-0.9 and correct + > 0 . 6 , while degraded in the smaller area (blue area) at > 0 . 9 and correct + < 0 . 6 .
The same comparison was performed for the combinations of trial = 10 , 100 , 1000 and = 5 , 6 … 100 . The statistical power of the i-test-one exceeded 0.8 in all cases only at ≥ 0 . 96 . In contrast, the statistical power of the i-test-unif-bino exceeded 0.8 with smaller in many cases (see Figure 2 G, H, and I for excerpted results, SupplementaryGIF2-4 for all results). Exceptions were observed when = 1 was used for the i-test-unif-bino ; therefore i-test-unif-bino was identical to i-test-one .
As expected from the analytical analysis ( Section 2.3.3 ), the false alarm rate was below the statistical threshold for all the tested parameters with any at trial = 10 , 100 , 1000 , = 5 , 6 … 100 , from 0 to 0.5 with 0.01 increment, and correct + from 0.51 to 1 with 0.01 increment (not shown).
In summary, extension from the i-test-one to the i -test (implemented as the i-test-unif-bino ) improved the statistical power making it possible to report a significant result with a smaller , while the power was reduced in the smaller range around = 1 and the relatively low correct + .

Application to empirical fMRI data
An example of the i -test application to a real fMRI dataset reported by Hirose et al., 2015 is presented below.

Experiment and analysis procedures
For the experimental details, please refer to the original paper . Briefly, 8 participants ( = 8 ) performed a fingertapping task using either his/her right index or middle finger for 1.5 s. Each participant underwent 10 sessions, and each session consisted of 20 trials comprising 10 index finger trials and 10 middle finger trials in a random order. Trials were separated by inter-trial intervals (ITI) of 6 s plus 0.5 s instruction periods just before the trials. The finger was predicted from the preprocessed fMRI volume measured in the time range of 2-6 s after the end of each trial. The classifier was trained by the sparse logistic regression algorithm ( Yamashita et al., 2008 ) on voxel activities in the whole brain (33,270 ± 1,570 voxels). Classification accuracy was evaluated by the hold-out validation procedure, in which the classifiers were trained using 100 trials from the odd sessions (training dataset), and 100 trials from the even sessions (test dataset) were used to evaluate D-Acc. D-Acc was calculated as the number of test trials with correct prediction divided by the number of all test trials. The preprocessing and first-level decoding analyses for each participant were performed using the Multi-Voxel Pattern Classification Toolbox (https://github.com/satoshi-hirose/MVPC_toolbox).
Then, the i-test-unif-bino and the i-test-one were applied to the experimental results. For both tests ̃ ( < ̂ ( ) | ∈ Ω − ) was estimated based on the binomial assumption, and the thresholds were set as = 0 . 05 and 0 = 0 . 5 .
Finally, ̃ is calculated and compared with ( Section 2.4.4 ). Substituting the estimation into Eq. (2.5) and Eq. (2.8) , we found = 0 . 035 . Therefore, L was lower than the statistical threshold = 0 . 05 and the i -test reported a significant result.
Although the true was unknown, the results may suggest the statistical power improvement of the i -test compared with the i-test-one . In the experiment, 7 of the 8 participants had D-Acc above the upper 1% point (0.62) of the D-Acc distribution for Ω − , i.e. BPDF ( 100 , 100 , 0 . 5 ) . Under this condition, i -test reported a significant result, while the i-testone did not. This may be in agreement with the results of the previous section demonstrating the statistical power improvement by extending the i-test-one to the i -test.

Discussion
I proposed a novel second-level group statistical test for decoding accuracy, named "i -test. " The i -test targets the population information prevalence, i.e., the proportion of the population that has higher-thanchance D-Acc. Therefore, a significant result of the i -test can infer the typical effect that the brain activation contains label information in a proportion larger than 0 of the population. This was in contrast to the fact that a significant result of the statistical test on the population mean of the D-Acc, such as t-test, can only infer that "there are some people in the population whose fMRI data carry information about the experimental condition " ( Allefeld et al., 2016 ).
i -test is an extension of the test for the population prevalence proposed by Allefeld et al., 2016 ( i-test-one ), in which the minimum statistic was necessarily used. By using a higher-order statistic, i -test can improve the statistical power ( Section 3 ).

Limitation of the statistical power evaluation in the current study
My numerical evaluation of the statistical power ( Section 3 ) was limited to the situation when the true D-Acc follows the binomial distribution and we can correctly estimate the distribution. In real situations, the D-Acc does not always follow the binomial distribution because of the lack of independence among trials and we cannot know the exact distribution. Particularly, when the D-Acc was evaluated with crossvalidation, the distribution is known to differ from the binomial distribution ( Combrisson and Jerbi, 2015 , Supplementary Material S1.2.2.1). In Supplementary Material S1.2, I present a fragmental simulation evidence that the i -test provided a high statistical power for the nonbinomial cross-validated D-Acc (CV-D-Acc). But fairly speaking, the D-Acc distribution of empirical neuroscience experiments can be more different from the binomial distribution than the simulated CV-D-Acc, because the dependence and difference between trials may arise not only due to cross-validation procedure, but also due to empirical issues in the experiment, such as drift of participants' arousal level.
An important issue is that the i -test could not perfectly control the false alarm rate when the estimated D-Acc distribution did not match the true distribution (Supplementary Material S1.2.2.3). Further theoretical studies and accumulation of empirical evidence are needed for evaluation of the i -test's robustness to violations of the distribution assumptions in real situations.

Implementation of i -test
In the present study, a mathematically simple implementation of the i -test with the assumption of the binomial distribution for D-Acc ( i-testunif-bino ) was proposed. The computational merit was that such D-Acc distribution is fully determined by only two unknown parameters, and correct + . However, as mentioned above, D-Acc does not always follow the binomial distribution in empirical situations. Following this concern, another possible variation is to use a nonparametric empirical estimation of the distribution, e.g., estimation using permutation test for ̃ ( | ∈ Ω − ) as proposed by Allefeld et al., 2016 for the i-test-one . As an example of such implementations, a permutation-based i -test ( i-test-unifperm ) is introduced in Supplementary Material S3.1.
The other assumption used in the i-test-unif-bino is the noninformative (uniform) prior distributions of and correct + that does not require any prior knowledge. Another idea is to use experimental results to estimate prior distribution of parameters and correct + , which could lead selection of better optimized . From this viewpoint I propose another implementation of the i -test, in which a maximum likelihood point estimates of and correct + are used ( i-test-ml-bino ; Supplementary Material S3.2).
Further studies are needed to evaluate the efficacy, including the robustness discussed in Section 5.1 , of these alternative implementations, as well as other possibilities, e.g., i-test-ml-perm .

Conclusion
A novel second-level group statistical test for decoding accuracy named "i -test " was proposed. I provided the theoretical derivation, the empirical implementation, and evidence that this test can improve the statistical power compared with the i-test-one proposed by Allefeld et al., 2016 . The advantages of the i -test comprise a mathematically guaranteed and meaningful population inference, and a high statistical power.
Although the i -test was introduced as a statistical method for the D-Acc, it is also applicable to other "information-like " measures, including continuous variables, such as Mahalanobis distance ( Kriegeskorte et al., 2006, Nili et al., 2014, linear discriminant t ( Nili et al., 2014 ), or pattern distinctness D ( Allefeld and Haynes, 2014 ). This study may provide a robust, second-level group statistical test for information-based neuroimaging studies.

Data and code availability
All analyses in this study can be replicated by MATLAB program codes available at https://github.com/satoshi-hirose/i-test/releases.

Declaration of Competing Interest
None Credit authorship contribution statement Satoshi Hirose: Conceptualization, Methodology, Software, Resources, Writing -original draft, Visualization, Funding acquisition.

Acknowledgments
I would like to acknowledge Dr. Atsushi Yokoi, Center for Information and Neural Networks, Advanced ICT Research Institute, National Institute of Information and Communications Technology, Dr. Isao Nambu, Department of Electrical, Electronics and Information Engineering, Nagaoka University of Technology, Dr. Matthew de Brecht, Graduate School of Human and Environmental Studies, Kyoto University, and Dr. Carsten Allefeld, Department of Psychology, City, University of London, for helpful suggestions. This work was supported by a Grant-in-Aid for Young Scientists (B) (JSPS KAKENHI 16K16649) to the author.

Data and code availability statement
The participants of the experiment ( Section 4 ) were explicitly informed that "the neural data measured in the experiment will not be released in any form " when they provided informed consent. Therefore, it is impossible to publish the neuronal data.
All analyses in this study can be replicated by MATLAB program codes available at https://github.com/satoshi-hirose/i-test/releases, which include the result of the decoding analysis (decoding accuracy) for each participant obtained from the experiment. I think this is sufficient because this study focuses on the second-level group statistical test using the decoding accuracies of the individual participants.
Appendix A: Numerical calculation of the probability that the i-test reports a significant result I describe the calculation of ( Significant ), which is the probability that the i -test reports a significant result, given the predetermined parameters ( 0 , , and ), number of participants ( ), the estimated distribution of the D-Acc without label information ( ̃ ( | ∈ Ω − ) ), and the true distribution ( ( | ∈ Ω − ) , ( | ∈ Ω + ) , and ). The outline of the calculation is as follows. First, a threshold, T of the -th order statistic is introduced such that the i -test reports a significant result when and only when ( ) > (Section A.1). Then, ( ( ( ) > ) is derived (Section A.2).

A.2. Calculation of the probability that i-th order statistic is higher than T
The probability of a significant result is defined as From the equivalence of > ̃ and ( ) > , it follows that ( To calculate ( ( ) > ) , first, the probability that the D-Acc of a participant does not exceed is formulated as follows: Then, by accounting that the condition i -th order statistic being higher than is identical to the condition fewer than of the participants having D-Acc not exceeding , we can formulate ( ( ) > ) as follows: Therefore, ( Significant ) is derived as follows: In summary, ( Significant ) is numerically calculated with the following steps: is calculated from ̃ ( | ∈ Ω − ) , 0 , , , and . Then, ( ≤ ) is calculated with Eq. (A.5), the value of , and the true distribution ( ( | ∈ Ω + ) , ( | ∈ Ω − ) and ). Finally, ( Significant ) is calculated with Eq. (A.7).
With the same set of the parameters and = 19 ( max ), T increased ( = 0 . 530 ) and, therefore, ( ≤ ) , which is the probability that the D-Acc of a participant does not exceed , increased up to 0.195. However, at = 19 , the i -test reported a significant result when D-Acc of less than 19 participants of the is lower than . Consequently, the i -test reported a significant result with a probability of almost 100% ( ( Significant ) = 0 . 998 ).
The results demonstrate the intuitive understanding of the statistical power improvement. For the i-test-one , although the probability that D-Acc of a participant exceeds ( ( > ) = 1 − ( ≤ ) ) was close to one ( 1 − 0 . 0242 ), the probability that D-Acc of all of the 50 participants are higher than was only 29%. As a result, the i-test-one failed to report a significant result with probability larger than 70%, although the true γ (0.8) was much higher than the prevalence threshold ( 0 = 0 . 5 ). In contrast, for the i -test with = 19 , although ( > ) was lower ( 0 . 805 ), the i -test can report a significant result when D-Acc of more than 31 of the 50 participants are higher than and, therefore, can report a significant result with a probability of almost 100%.

Appendix B: Selection of optimal i
For the selection of parameter , the optimal value ( ) that maximizes the expected statistical power for a given ( Power ( ) ) is determined.

B.1. General formulation
For the calculation of Power ( ) , first, the set of parameters of the true distribution is defined as , such that the true distributions of D-Acc both with and without label information ( ( | ∈ Ω + ) and ( | ∈ Ω − ) ) are uniquely determined at fixed . Consequently, the true distribution can be expressed as Then, the expected statistical power ( Power ( ) ) is defined as Here, is the parameter space of , and ( , ) is the joint prior distribution. ( Signi | , , ) can be numerically calculated with the procedure proposed in Appendix A . The integration range for is ( 0 , 1 ] where the alternative hypothesis is true.
Then, opt is defined as the value of that maximizes the expected statistical power;

B.2. Calculation of opt for the i-test-unif-bino
In the implementation of i-test-unif-bino , the binomial distribution of D-Acc is assumed, so that includes one unknown parameter ( correct + ) and all other parameters determining the distribution ( correct − and trial ) are known. As for the prior distribution ( , ) , the noninformative prior (uniform distribution) is used. Under these assumptions, Eq. (B.2) can be expressed as The integration range of correct + is (  The summation is taken over from 0 + ℎ to 1 with ℎ increment and correct + from correct − + ℎ to 1 with ℎ increment, where ℎ is the precision parameter. With an additional assumption that ̃ ( | ∈ Ω − ) is identical to ( | ∈ Ω − ) , ( Significant | , correct + , ) can be numerically calculated ( Appendix A ). Therefore, unif − bino can be numerically calculated.

Appendix C: Implementation of the i-test-unif-bino
Figure C.1 shows the pseudocode of the i-test-unif-bino implementation.

C.1. Inputs
The i-test-unif-bino requires the following 6 inputs: the experimental result { } , the number of trials trial , the chance-level D-Acc correct − , the prevalence threshold 0 , the significance threshold , and the precision parameter ℎ .
The calculation of Power ( ) is performed with the following two steps. The first step is the calculation of (Lines 4-6; Appendix A.1 ). This was made by performing the i -test (Eqs. (A.1) and (A.2)) for each possible value of order statistic with the assumption of the binomial distribution for ̃ ( | ∈ Ω − ) (Assumption 4, see below) and identifying the largest value that does not lead significant result. The second step is the calculation of the marginal statistical power (Lines 7-9; Appendix A.2 and Appendix B.2 ). Assuming that correct + and are uniformly distributed (Assumption 1), it is approximated with the sum of the probability that i -test reports the significant result for each combination of correct + and (Eq. (B.6); Line 7). The probability for each combination of correct + and is calculated with Eqs. (A.5) (Line 8), (A.7) (Line 9), and calculated value of . The binomial distribution for true D-Acc distribution with and without label information is assumed in the calculation (Assumption 2 and 3).

C.3. Performing i-test
Finally, the i -test with unif − bino was applied to the experimental results (Eqs. (A.1) and (A.2), Lines 12-15) with the assumption of the binomial distribution for the estimation of D-Acc without label information (Assumption 4).

C.5. Error handling
The i-test-unif-bino yields an error when there is no that satisfies BCDF ( − 1 , , 1 − 0 ) < (Line 2), indicating that the i -test cannot be applied to the inputs because of the constraint imposed by Eq. (2.10) . Another error can occur when there is no that satisfies ≤ with particular values of (Line 4). This indicates that the i -test never reports a significant result with the value of . In this case, such values should be omitted in the search of unif − bino .

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.neuroimage.2021.118456 .