Removing the effects of the site in brain imaging machine-learning – Measurement and extendable benchmark

Multisite machine-learning neuroimaging studies, such as those conducted by the ENIGMA Consortium, need to remove the differences between sites to avoid effects of the site (EoS) that may prevent or fraudulently help the creation of prediction models, leading to impoverished or inflated prediction accuracy. Unfortunately, we have shown earlier that current Methods Aiming to Remove the EoS (MAREoS, e.g., ComBat) cannot remove complex EoS (e.g., including interactions between regions). And complex EoS may bias the accuracy. To overcome this hurdle, groups worldwide are developing novel MAREoS. However, we cannot assess their effectiveness because EoS may either inflate or shrink the accuracy, and MAREoS may both remove the EoS and degrade the data. In this work, we propose a strategy to measure the effectiveness of a MAREoS in removing different types of EoS. FOR MAREOS DEVELOPERS, we provide two multisite MRI datasets with only simple true effects (i.e., detectable by most machine-learning algorithms) and two with only simple EoS (i.e., removable by most MAREoS). First, they should use these datasets to fit machine-learning algorithms after applying the MAREoS. Second, they should use the formulas we provide to calculate the relative accuracy change associated with the MAREoS in each dataset and derive an EoS-removal effectiveness statistic. We also offer similar datasets and formulas for complex true effects and EoS that include first-order interactions. FOR MACHINE-LEARNING RESEARCHERS, we provide an extendable benchmark website to show: a) the types of EoS they should remove for each given machine-learning algorithm and b) the effectiveness of each MAREoS for removing each type of EoS. Relevantly, a MAREoS only able to remove the simple EoS may suffice for simple machine-learning algorithms, whereas more complex algorithms need a MAREoS that can remove more complex EoS. For instance, ComBat removes all simple EoS as needed for predictions based on simple lasso algorithms, but it leaves residual complex EoS that may bias the predictions based on standard support vector machine algorithms.


Introduction
Magnetic resonance imaging (MRI) researchers often pool data from different sites to achieve more statistical power to detect true differences such as ENIGMA use harmonized protocols ( Thompson et al., 2014 ), there are still differences due to varying scanning devices and acquisition sequence parameters. These differences may introduce effects of the site (EoS) that bias the analyses ( Solanes et al., 2021 ).
For instance, imagine we conduct a two-site MRI study to investigate whether we may use baseline MRI to predict the subsequent response to a medication. Imagine also that, by chance, 80% of patients in site A respond to the drug, whereas only 20% in site B. Finally, imagine that site A's MRI device makes the images very bright and site B's device very dark. With these settings, a machine-learning algorithm could predict whether a patient will respond or not, exclusively using the difference in images' brightness between the two MRI devices. In other words, the machine-learning model would predict that patients with bright images will respond, whereas patients with dark images will not. And the machine-learning model would be pretty successful: it would show 80% accuracy! However, this accuracy would be false, inflated, artifactual, exclusively based on an EoS. The balanced accuracy (the average of sensitivity and specificity) separately calculated for each site would be just 50%, like tossing a coin.
Due to the potentially high biases introduced by EoS, researchers worldwide are developing novel Methods Aiming to Remove the EoS (MAREoS). A common and old MAREoS is covarying for the site in the linear model, preferably coded as a random-effects factor (i.e., a mixedeffects analysis) ( Favre et al., 2019 ). Another usual MAREoS is ComBat ( Johnson et al., 2007 ), a batch adjustment method developed for genomics data. Several groups have recently adapted this MAREoS to MRI datasets ( Fortin et al., 2018 ;Radua et al., 2020 ).
However, we have shown previously that current MAREoS do not entirely remove all differences between sites. Worryingly, these differences may either inflate or shrink the accuracy. In other words, machinelearning algorithms may either use the remaining EoS "fraudulently ", thus inflating accuracy rates, or fail to detect true effects due to the noise associated with EoS, thus shrinking accuracy rates ( Solanes et al., 2021 ). While all MAREoS can remove simple additive differences, we are not aware of a MAREoS able to remove complex EoS, such as discrepancies in covariance (i.e., the interaction between brain regions). To avoid reporting biased accuracies, we have provided formulas and an R package to unbiasedly estimate the multisite-corrected accuracy in the presence of residual EoS ( Solanes et al., 2021 ). This package may be helpful to ensure that the EoS do not bias the reported accuracy. However, the goal of the community should be to develop a novel MAREoS able to remove complex EoS entirely.
Unfortunately, MAREoS developers may face a paradox. To our knowledge, there is no straightforward way to measure the EoS-removal effectiveness. For example, in data with EoS and true effects, a MAREoS may yield higher accuracy than another MAREoS for two opposite reasons. It may either reduce the noise associated with EoS (improving the detection of true effects) or fail to remove the EoS (leading to higher accuracy inflation). On the other hand, in data with only EoS and no true effects, a MAREoS may yield a lower accuracy than another MAREoS for two opposite reasons again. It may either remove the EoS better (minimizing the accuracy inflation) or degrade the data more (worsening the detection of true effects).
To overcome this problem, we designed an approach to objectively measure the removal of EoS and the degradation of the data of a given MAREoS. Furthermore, we also provide: a) datasets to conduct these measurements and b) a benchmark website to allow machine-learning researchers readily know the most appropriate MAREoS depending on the situation.

Methods
The strategy presented in this paper builds on the study of the change in accuracy associated with a MAREoS. This accuracy change has opposite meanings depending on whether the dataset has only EoS (i.e., no true effects) or only true effects (i.e., no EoS). In datasets with only true effects, an accuracy decrease should only be due to data degradation, a side effect of the MAREoS. Conversely, an accuracy decrease in datasets with only EoS should be due to a correct EoS-removal (plus some potential data degradation). We have noted above that accuracy increases are possible in datasets with both true effects and EoS since the noise associated with EoS may shrink the accuracy ( Solanes et al., 2021 ). However, to simplify the following calculations, we tried to avoid datasets mixing true effects and EoS.
We first describe the datasets and the specific machine-learning algorithm that MAREoS developers should apply to achieve that differences between MAREoS depend only on the MAREoS (while not on the datasets or machine-learning algorithms). Afterward, we present the formulas to measure the effectiveness of a MAREoS. Finally, we show an example with the Johnson-Fortin-Radua version of the ComBat MAREoS ( Fortin et al., 2018 ;Johnson et al., 2007 ;Radua et al., 2020 ) (script available at http://enigma.ini.usc.edu/protocols/statistical-protocols/). Readers only interested in the strategy may directly read the section about measuring the EoS-removal effectiveness of a MAREoS.

Description of the datasets and the machine-learning algorithm
Each simulated dataset includes the baseline MRI data (cortical thickness, cortical surface area, or subcortical volumes) from ∼1000 patients from 8 scanner sites, two baseline clinical covariates, and the subsequent responses to a given treatment. The simulated studies would aim to predict the response to the treatment (response vs. no response) from the baseline MRI data ( Figure 1 ). The latter follow normal distributions like those returned by FreeSurfer ( Radua et al., 2020 ) and have linear relationships with two simulated clinical covariates. In this section, we first describe the datasets (along with the machine-learning algorithm) to familiarize developers with them. Afterward, we briefly report how we created the MRI data for interested readers.
Two datasets have only simple EoS (i.e., neither true effects nor complex EoS). The lack of true effects means no relationship between the MRI data and the response. Therefore, machine-learning algorithms should not predict the response. Accuracy should be around 50%, like tossing a coin. However, there are substantial simple differences across sites in response probability and MRI data (e.g., the cortex is systematically measured thicker in some devices). Most machine-learning algorithms may use these simple EoS to "fraudulently " predict the response, inflating the accuracy. These simple EoS should be removable by most MAREoS.
Two other datasets have only simple true effects (i.e., neither EoS nor complex true effects). Thus, there are significant simple relationships between MRI data and the response to treatment (e.g., responders have thicker cortices). Therefore, most machine-learning algorithms should predict the response with > 50% accuracy.
To predict the treatment response using the brain imaging data, MAREoS developers should conduct a ten-fold cross-validation using our specific fold distribution. Within each fold, they should fit a lasso algorithm in which the variable to predict is the response to the treatment (coded binarily), and the predictors are the MRI data. For instance, in R, we could use the "glmnet " library ( Friedman et al., 2010 ). Developers may download the specific R scripts from https://www.imardgroup.com/mareos-benchmark/ We chose the simple lasso because we assumed it can only detect simple effects. We reasoned that it is a kind of linear model, and linear models cannot detect complex effects (e.g., interactions or unknown others) unless they are specifically modeled.
We also created datasets with complex true effects or EoS, including first-order interactions between two brain regions or nuclei. To assess the effectiveness of a MAREoS in removing first-order interaction-based complex EoS, we propose using a lasso algorithm with a design matrix that includes the first-order interactions. We chose the lasso with firstorder interactions algorithm because, again, we assumed that, being a Figure 1. Location of the cortical regions and subcortical nuclei whose thickness, surface area, or volume we provide in the datasets. linear model, it can only detect simple or first-order interaction-based effects.
We encourage other researchers to describe other complex EoS, create the respective datasets, and add them to the MAREoS benchmark website.

Creation of the datasets
For the interested reader, we will briefly report how we created each of these MRI datasets. We first generated normally distributed random data for each FreeSurfer region/nucleus, with means and standard deviations similar to real data ( Radua et al., 2020 ). Then, to create simple EoS, we added differences between sites: * where y r,i,j is the cortical thickness, cortical surface area, or subcortical volume of the r th ROI from the j th individual of the i th site, and r,i and r,i are the multiplicate and additive EoS of the i th site in the r th ROI. We set both r,i and r,i to follow normal distributions across the regions of a site, and ·,i and ·,i to follow normal distributions across the sites. For further information about normally distributed multiplicative and additive effects, please see ( Radua et al., 2020 ). To create interactions between ROIs, we swapped (between patients) the cortical thickness, cortical surface area, or subcortical volume of an ROI of a site to create positive or negative correlations with another ROI. For instance, imagine a site with only five patients where we aim to create a positive correlation between ROIs A and B. If the patient values in ROI A were [12,13,14,15,16], and the patient values in ROI B were [6,9,10,8,7], the correlation would be nearly null (r = 0.1). However, after swapping patient values 10 and 7 in ROI B (i.e., [6,9,7,8,10], the correlation would be 0.7. Finally, we added some value to the responders to create true effects. After conducting these transformations, we added the effects of the covariates (adding some value multiplied by the covariate), truncated the resulting values to avoid outliers, and rescaled the data to be like FreeSurfer again. We created many datasets, but we chose some that effectively only showed EoS or only showed true effects and were varied in features and BAC. To know which only showed EoS or only showed true effects, we used a logistic regression model to predict the response and calculated the accuracy using both standard formulas and the "multisite.accuracy " R package, which corrects for the site ( Solanes et al., 2021 ). We considered "datasets with only EoS " those with ∼50% mul- MAREoS: Method Aiming to Remove the EoS.
(a) Average RAC s and EoS-removal effectiveness are limited to 0-100%. These numbers may differ slightly from those reported at https://www.imardgroup.com/mareos-benchmark/ because the latter are based on a parallel collection of datasets for which the variable "response " is not public.
tisite accuracy -even if they showed high raw accuracy. Similarly, we considered "datasets with only true effects " those that showed similar (high) raw and multisite accuracies ( Solanes et al., 2021 ).

Measurement of the EoS-removal effectiveness of a MAREoS for simple EoS
As detailed above, each dataset contains baseline multisite MRI data from patients and the subsequent responses to a given treatment. First, separately for each dataset and within a ten-fold cross-validation scheme, the developers must use the training subset to fit and apply a MAREoS to remove the EoS, find and remove the linear effects of two clinical covariates, fit the simple machine-learning algorithm, and use all these models to predict whether patients in the test subset respond to treatment. Second, again separately for each dataset, the developers must calculate the predictions' sensitivity, specificity, and balanced accuracy (BAC). The sensitivity is the percentage of responders correctly predicted to respond. The specificity is the percentage of non-responders correctly predicted not to respond. The BAC is the average of sensitivity and specificity. Relevantly, the developers must calculate the BAC using these basic formulas. In other words, they cannot correct the site with the "multisite.accuracy " R package ( Solanes et al., 2021 ) that we would otherwise recommend. The reason is that we need uncorrected accuracies to measure the EoS-removal effectiveness of a MAREoS. Developers may download the specific R scripts to conduct all these steps from https://www.imardgroup.com/mareos-benchmark/. Afterward, they must perform the following calculations to measure the EoSremoval effectiveness of the MAREoS.
The first calculation, also performed separately for each dataset, is the relative accuracy change ( RAC ): To illustrate the idea, Table 1 shows the RAC calculations in the "Simple EoS #1 " and "Simple true effects #1 " datasets using a standard support vector machine (SVM) algorithm and ComBat (see details later). In the dataset with simple EoS, BAC was 74% when we fitted the SVM without applying any MAREoS. Using a MAREoS, the (EoS-inflated) BAC decreased to 50%. Then, RAC [simple EoS] in this dataset would be 100% (i.e., the accuracy is 100% closer to 50%). We could naively interpret that the MAREoS has reduced the bias by 100%. However, in the dataset with simple true effects, BAC decreased from 73.05% to 73.02% when using a MAREoS due to an undesirable potential side effect: data degra-dation. Then, RAC [simple true effects] in this dataset would be 0.14% (i.e., due to data degradation, the accuracy is 0.14% closer to 50%).
At a theoretical level, it might be interesting to note that the formula of the RAC would also work for accuracy increases -possible in datasets with both true effects and EoS. In such datasets, the EoS might prevent the machine-learning algorithm from fully detecting the true effects. For instance, BAC could be 70% before a MAREoS, while 75% after the MAREoS, for what RAC would be -25%, now meaning that the accuracy is now 25% farther from 50%. However, this RAC would be little informative because we would know neither the amount of EoS removed nor whether there was also data degradation.
Turning to the measurement of the EoS-removal effectiveness, the second calculation consists of adjusting the average RAC [simple EoS] (i.e., the naïve bias reduction) with the average RAC [simple true effects] (i.e., due to data degradation) to derive the EoS-removal effectiveness: Limiting the average RAC s and the EoS-removal effectiveness to 0-100% may be sensible.
Going back to Table 1 , if a MAREoS shows RAC [simple EoS] = 100% (a naïve 100% reduction in bias) and RAC [simple true effects] = 0.14% (a 0.14% decrease due to data degradation), then the simple EoS-removal effectiveness was 100%. In the datasets with only simple EoS, we may assume that the MAREoS would first remove simple EoS, reducing the accuracy. And afterward, it would degrade the data leading to (minimally) decreasing the remaining accuracy ( Figure 2 , A1 ). Or vice versa, we may assume that the MAREoS would first degrade the data, (minimally) decreasing the accuracy. And afterward, it would remove simple EoS, reducing the remaining accuracy ( Figure 2 , A2 ). Data degradation may seem negligible in this example, but it might be relevant in others.

Measurement of the EoS-removal effectiveness of a MAREoS for complex EoS
The overall strategy for measuring how well a MAREoS removes complex EoS is the same as for measuring how well a MAREoS removes simple EoS. However, datasets with complex EoS may also include simple true effects or simple EoS, and the MAREoS may remove both simple and complex EoS. Therefore, we may wish to subtract the part of the BAC attributable to simple effects as follows: [ , where BAC [complex] is the BAC obtained with the complex machinelearning algorithm (e.g., a lasso with first-order interactions), BAC [simple] is the BAC obtained with the simple machine-learning algorithm (i.e., the lasso without interactions), and BAC [complex,corrected] is the BAC [complex] after "subtracting " the simple effects. One way to see this subtraction from a different perspective is to decompose the accuracy of the complex machine-learning algorithm. Imagine that we have a sample of 100 patients, half responders. Suppose we predict randomly, simply tossing a coin. In that case, we will guess correctly by chance half of the time for what we expect to predict about 50 individuals correctly. Now imagine that a simple machine-learning algorithm correctly predicts 70 individuals. However, we can decompose this number as 50 + 20, with the 50 corresponding to the number of individuals that we can correctly predict tossing a coin and the 20 corresponding to the extra accuracy provided by the simple effects detected by the simple machine-learning algorithm. Finally, imagine that a complex machine-learning algorithm correctly predicts 85 individuals. Again, we can decompose this number as 50 + 20 + 15, with the 20 corresponding to the extra accuracy provided by the simple effects detected by the complex machine-learning algorithm and the 15 corresponding  to the extra accuracy offered by the complex effects beyond the simple effects.

Example: ComBat
To exemplify how to measure the EoS-removal effectiveness of a MAREoS, we applied ComBat to the public datasets provided. We then conducted the calculations needed to measure the EoS-removal effectiveness.
First, we downloaded the public datasets from https://www.imardgroup.com/mareos-benchmark/ . Each dataset is a table with the following columns: the identification of the simulated individual, the MRI data (cortical thickness or surface area or subcortical volumes), the site, the values of the two clinical covariates, and the distribution in folds. We also downloaded the Johnson-Fortin-Radua version of the ComBat MAREoS ( Fortin et al., 2018 ;Johnson et al., 2007 ;Radua et al., 2020 ) from https://enigma.ini.usc.edu/protocols/statistical-protocols/ The following analyses refer to one dataset. For fold 1, we defined the training subset as individuals in folds 2-10 and the test subset as the set of individuals in fold 1. In the training subset: a) we fitted the ComBat model with the function "combat_fit "; b) we removed the EoS according to the ComBat model with the function "combat_apply "; c) we fitted regressions to estimate the linear effects of the two clinical covariates (a separate linear regression per each brain region); d) we removed the linear effects of the clinical covariates according to these linear regressions; e) and we fitted the lasso algorithm (without interactions when assessing simple effects, or with first-order interactions when assessing complex effects including first-order interactions). Afterward, in the test subset: a) we removed the EoS according to the ComBat model (fitted with the training subset) with function "combat_apply "; b) we removed the linear effects of the clinical covariates according to the linear regressions (fitted with the training subset); c) and we applied the lasso algorithm (fitted with the training subset) to predict the individual responses. After repeating the same procedure for folds 2-10, we had predicted the response in all individuals. We then proceeded to calculate the sensitivity, specificity, and BAC. Finally, we combined the BACs with and without ComBat to calculate the RAC . We provide the R scripts to conduct such calculations at https://www.imardgroup.com/mareos-benchmark/ After conducting these analyses for each "Simple " dataset ( Table 1 ), we had a RAC for each of the "Simple EoS " datasets (100% and 107%), which we averaged (and limited to 0-100%) to obtain an average RAC [simple EoS] of 100%. Similarly, we had a RAC for each of the "Simple true effects " datasets (0% and 0%), which we averaged to obtain an average RAC [simple true effects] of 0%. Finally, we calculated the EoS-removal effectiveness (100%). Therefore, we should conclude that ComBat entirely removes the bias related to simple EoS (100%) and has negligible data degradation (0%).
The results were nearly identical when we used a lasso algorithm without interactions for the "Interaction " datasets ( Table 2 ). The RAC s for the "Interaction EoS " datasets were 100% and 100% (average RAC [simple EoS] = 100%), and the RAC s for the "Interaction true effects " datasets were -1% and -4% (average RAC [simple true effects] = 0%), leading to EoS-removal effectiveness = 100%. Therefore, we should conclude again that ComBat entirely removes the bias related to simple EoS (100%) and has negligible data degradation (0%). These datasets had complex effects (e.g., the multiplication of pairs of brain regions differed between groups). However, these effects are not detectable by the lasso algorithm without interactions; thus, they did not influence these calculations.
The results were very different when we analyzed the same "Interaction " datasets using a lasso algorithm with first-order interactions ( Table 3 ). First, all BAC were substantially higher (e.g., 80% instead of 63% for "Interaction EoS #1 " without MAREoS). This increase is because this machine-learning algorithm could detect the first-order interactions present in these datasets. However, we must highlight here that, as we saw in the previous paragraph, these datasets also included simple effects, which we had to subtract before specifically studying the complex effects. For instance, for "Interaction EoS #1 " without MAREoS, BAC was 80%, but we subtracted 13% (i.e., the BAC of the simple algorithm, 63%, minus 50%) for what the corrected BAC was 80% -13% = 67%. Once we corrected all BACs, we proceeded as before. The RAC s for the "Interaction EoS " datasets were -5% and -13% (average RAC [complex EoS] = 0%), and the RAC s for the "Interaction true effects " datasets were 3% and 5% (average RAC [complex true effects] = 4%), leading to EoS-removal effectiveness = 0%. Therefore, we should conclude that ComBat does not remove the bias related to complex EoS (0%) and may show minor data degradation (4%).

Other machine-learning algorithms
We repeated the above calculations with standard random forest ( Liaw and Wiener, 2002 ), support vector machine ( Meyer et al., 2021 ), and gaussian processes ( Karatzoglou et al., 2004 ) algorithms to provide insights on the use of MAREoS with these machine-learning algorithms. We used the default options of the "", "", and "" R packages, which involve radial basis function kernels for support vector machine and gaussian processes. We show again the R code to conduct such calculations at https://www.imardgroup.com/mareos-benchmark/.
With the "simple " datasets, BACs, RAC s, and simple EoS-removal effectiveness were similar when using lasso, random forest, support vector machine, or gaussian processes algorithms ( Table 4 ). There were differences between algorithms, but they were small and likely due to chance.
The analysis of the "interaction " datasets showed that the random forest (and, to a lesser extent, the support vector machine) algorithm detects the complex effects in these datasets ( Table 5 ). In contrast, the Gaussian processes algorithm only detects some. We thus repeated the calculations of complex EoS-removal effectiveness for the random forest and support vector machine algorithms. Table 6 shows that random forests and support vector machine algorithms yielded RAC [complex EoS] substantially different from the 0% calculated using the lasso algorithm with first-order interactions. As introduced earlier, we assumed that lasso with first-order interactions could detect only one type of complex EoS: those based on first-order interactions. Thus, a RAC [complex EoS] of 0% meant that ComBat does not remove these complex EoS. However, RAC [complex EoS] were 34-52% for random forest and support vector machine algorithms. Therefore, we should conclude that random forest and support vector machine algorithms detect a mixture of complex EoS, some of which are removable by ComBat and others are not.

Benchmark website
For machine-learning researchers, the website (https://www.imardgroup.com/mareos-benchmark/) includes information about the types of effects detectable by different machine-learning algorithms and the effectiveness of MAREoS in removing different types of EoS. See Figure 3 for a diagram of the steps to choose an appropriate MAREoS for a specific study. Note that the numbers in the website may be slightly different from those in the manuscript because the former are based on a parallel collection of datasets for which the variable "response " is not public. We created the latter datasets to keep objective ranks of the effectiveness of the different MAREoS for different types of EoS.

Table 3
Example of measuring the "complex effects of the site (EoS) including interactions "-removal effectiveness for ComBat using the "interaction " datasets. Average RAC s and EoS-removal effectiveness are limited to 0-100%. These numbers may differ slightly from those reported at https://www.imardgroup.com/mareos-benchmark/ because the latter are based on a parallel collection of datasets for which the variable "response " is not public.

Table 4
Alternative measurement with standard random forests (RF), support vector machine (SVM), and Gaussian processes (GP) algorithms of the simple effects of the site (EoS)-removal effectiveness for ComBat using the "simple " datasets. MAREoS: Method Aiming to Remove the EoS. (a) Average RAC s and EoS-removal effectiveness are limited to 0-100%. These numbers may differ slightly from those reported at https://www.imardgroup.com/mareos-benchmark/ because the latter are based on a parallel collection of datasets for which the variable "response " is not public.

Table 5
Detection of complex effects including interactions by standard random forests (RF), support vector machine (SVM), and Gaussian processes (GP) algorithms in the "interaction true effects " datasets.
Algorithms: lasso with first-order interactions (LFOI), random forests (RF), support vector machine (SVM), and Gaussian processes ( (a) These numbers may differ slightly from those reported at https://www.imardgroup.com/mareos-benchmark/ because the latter are based on a parallel collection of datasets for which the variable "response " is not public.
Developers wanting to add a MAREoS to the website should download these datasets and conduct calculations analogous to the ones described in the example, except for not fitting/applying the simple machine-learning algorithm because the response to the treatment is non-public. Instead, they must save the pre-processed MRI data of the training and test subsets of each fold, along with an identification of these sets (e.g., "train_fold1 ", "test_fold1 ", "train_fold2", etcetera). For instance, in the first fold of the cross-validation, users should: a) find the EoS and the linear effects of the covariates using individuals labeled to be in folds 2 to 10; b) remove these effects from these individuals and Table 6 Alternative measurement with random forests (RF) and standard support vector machine (SVM) algorithms of the "complex effects of the site (EoS) including interactions "-removal effectiveness for ComBat using the "interaction " datasets. Average RAC s and EoS-removal effectiveness are limited to 0-100%. These numbers may differ slightly from those reported at https://www.imardgroup.com/mareos-benchmark/ because the latter are based on a parallel collection of datasets for which the variable "response " is not public.

Figure 3.
Steps to choose an appropriate MAREoS for a specific study -for machine-learning researchers.
save the resulting data with the set identification "train_fold1"; and c) remove these effects from individuals labeled to be in fold 1 and save the resulting data with the set identification "test_fold1". See Figure 4 for a diagram of the steps to add a MAREoS to the extendable benchmark website.
Developers wanting to add a new type of EoS to the website may contact us directly. We also welcome researchers and developers wishing to add the effectiveness of an already investigated MAREoS using an alternative machine-learning algorithm.

Discussion
This work presents a strategy to measure the EoS-removal effectiveness of a MAREoS in multisite machine-learning studies. We provide datasets with only simple true effects, datasets with only simple EoS, datasets with complex true effects, datasets with complex EoS, and formulas to measure the EoS-removal effectiveness from the BAC obtained when fitting prediction models in these datasets. We also provide a benchmark website to rank the EoS-removal effectiveness of the different MAREoS and a relationship of the types of EoS that may bias accuracy for different machine-learning algorithms. For instance, we report that ComBat removes all simple EoS as needed for predictions based on simple lasso algorithms while it leaves residual complex EoS that may bias the predictions based on standard support vector machine algorithms. In other words, the extendable benchmark website provides the types of EoS that researchers should remove for a given machinelearning algorithm and the effectiveness of each MAREoS for removing each type of EoS.
The most important limitation of the present work is that it encompasses only one type of complex EoS: those due to first-order interactions between brain regions. Other complex EoS could potentially derive from higher-order interactions or other regional relationships. However, we believe that the investigation of new types of complex EoS, along with the (challenging) development of methods to measure them (e.g., creating specific datasets), should be the work of future studies. We created an extendable benchmark website for this reason. Another potential limitation of this work is that we only created binary outcomes (response vs. no response). We chose this distribution for the simplicity of its definition of accuracy (percentage of correct predictions). The definitions of accuracy in other distributions may be less straightforward. For instance, for continuous outcomes, there may be several metrics (absolute or squared difference between observed and predicted, correlation between observed and predicted, etcetera). However, MAREoS remove differences between sites independently of the outcomes; thus, the benchmarking should be similar using binary or other outcomes.
We want to finish by highlighting other exciting approaches to handling EoS. We believe that, when possible, researchers should use them along with our approach to providing richer complementary insights. For instance, a small set of individuals may volunteer to be scanned in the different devices used in a multisite MRI study ( Kurokawa et al., 2021 ;Noble et al., 2017 ;Tanaka et al., 2021 ;Tong et al., 2020 ). The data from these individuals, known as "traveling subjects ", have two valuable characteristics. First, they are real data and thus very likely have hidden features that our datasets may not have. Indeed, some data indicate that the traveling-subject outperforms ComBat . Second, these data allow an excellent study of the differences between MRI devices. In this scenario, differences between sites unrelated to MRI devices should be negligible. Together, these characteristics enable the development of promising deep learning-based MAREoS ( Tian et al., 2022 ). However, the traveling-subject approach also has its drawbacks. For instance, it can only be done prospectively (i.e., it is not helpful for mega-studies based on previously acquired data, such as those in the ENIGMA consortium ( Dima et al., 2022 )). Also, it may be costly (requires traveling) for which the set of subjects is usually tiny, though new projects such as the BMB-HBM aim to overcome this hurdle .

Disclosures
Dr. Vieta has received grants and served as a consultant, advisor, or CME speaker for the following entities (work unrelated to the topic of this manuscript): AB-Biotics, Abbott, Allergan, Angelini, Dainippon Sumitomo Pharma, Galenica, Janssen, Lundbeck, Novartis, Otsuka, Sage, Sanofi-Aventis, and Takeda. Dr. Llufriu has received compensation for consulting services and speaker honoraria from Biogen Idec, Novartis, TEVA, Genzyme, Sanofi, and Merck

Data and code availability statement
The data and scripts used in this study are available at https://www.imardgroup.com/mareos-benchmark/