Harmonization of resting-state functional MRI data across multiple imaging sites via the separation of site differences into sampling bias and measurement bias

When collecting large amounts of neuroimaging data associated with psychiatric disorders, images must be acquired from multiple sites because of the limited capacity of a single site. However, site differences represent a barrier when acquiring multisite neuroimaging data. We utilized a traveling-subject dataset in conjunction with a multisite, multidisorder dataset to demonstrate that site differences are composed of biological sampling bias and engineering measurement bias. The effects on resting-state functional MRI connectivity based on pairwise correlations because of both bias types were greater than or equal to psychiatric disorder differences. Furthermore, our findings indicated that each site can sample only from a subpopulation of participants. This result suggests that it is essential to collect large amounts of neuroimaging data from as many sites as possible to appropriately estimate the distribution of the grand population. Finally, we developed a novel harmonization method that removed only the measurement bias by using a traveling-subject dataset and achieved the reduction of the measurement bias by 29% and improvement of the signal-to-noise ratios by 40%. Our results provide fundamental knowledge regarding site effects, which is important for future research using multisite, multidisorder resting-state functional MRI data.

[https://bicr.atr.jp/decnefpro/]) [3][4][5]. When collecting large amounts of data associated with psychiatric disorders, it is necessary to acquire images from multiple sites because it is nearly impossible for a single site to collect a large amount of neuroimaging data from many participants (Connectomes Related to Human Disease [CRHD] [https://www.humanconnectome.org/disease-studies], Autism Brain Imaging Data Exchange [ABIDE], and SRPBS) [2,[6][7][8]. In 2013, the Japan Agency for Medical Research and Development (AMED) organized the Decoded Neurofeedback (DecNef) Project. The project determined a unified imaging protocol on 28 February 2014 (https://bicr.atr.jp/rsfmri-protocol-2/) and has collected multisite resting-state functional magnetic resonance imaging (rs-fMRI) data using 14 scanners across eight research institutes for the last 5 y. The collected dataset encompasses 2,239 participants and five disorders and is publicly shared through the SRPBS multisite multidisorder database (https://bicr-resource.atr.jp/decnefpro/). This project has enabled the identification of resting-state functional connectivity MRI (rs-fcMRI)-based biomarkers of several psychiatric disorders that can be generalized to completely independent cohorts [2,[8][9][10]. However, a multisite dataset with multiple disorders raises difficult problems that are not present in a single site-based dataset (e.g., HCP and UK Biobank).
That is, our experience in the SRPBS database demonstrated difficulties in differences due to scanner type, imaging protocol, and patient demographics [10][11][12][13], even when a unified protocol was determined. Moreover, unpredictable differences in participant population can often exist between sites. Therefore, researchers must work with heterogeneous neuroimaging data. In particular, site differences represent a barrier when extracting disease factors by applying machine-learning techniques to such heterogeneous data [14] because disease factors tend to be confounded with site factors [2,8,[10][11][12][13]15]. This confounding occurs because a single site (or hospital) is apt to sample only a few types of psychiatric disorders (e.g., primarily schizophrenia [SCZ] and autism spectrum disorder [ASD] from sites A and B, respectively). Although robust generalization across sites could be possible as long as the pattern of the disease factors is sufficiently different from the pattern due to the site differences [11], these factors depend on dataset and type of disease. To properly manage these heterogeneous data, it is important for them to be harmonized between sites [16][17][18][19]. Moreover, a deeper understanding of these site differences is essential for efficient harmonization of the data.
Site differences consist of two types of biases: engineering bias (measurement bias) and biological bias (sampling bias). Measurement bias includes differences in the properties of MRI scanners-such as imaging variables, field strength, MRI manufacturers, and scanner models -whereas sampling bias refers to differences in participant groups between sites. In this study, we used the word "bias" to indicate a systematic shift from the global population at a given site or with given imaging variables. Previous studies have investigated the effect of measurement bias on resting-state functional connectivity by using a traveling-subject design [20], wherein multiple participants travel to multiple sites to assess measurement bias [7]. By contrast, researchers to date have only speculated about sampling bias. For example, differences in the clinical characteristics of patients examined at different sites are presumed to underlie the stagnant accuracy of certain biomarkers, even after combining data from multiple sites [12]. Furthermore, to the best of our knowledge, no study has mathematically defined sampling bias or conducted quantitative analyses of its effect size, which is likely because the decomposition of site differences into measurement bias and sampling bias is a complex process. To achieve this aim, we combined a separate traveling-subject rs-fMRI dataset with the SRPBS multidisorder dataset. Simultaneous analysis of the datasets enabled us to divide site differences into measurement bias and sampling bias and quantitatively compare their effect sizes on resting-state functional connectivity with those of psychiatric disorders.
Furthermore, our detailed analysis of measurement and sampling biases enabled us to investigate the origin of each bias in multisite datasets for the first time. For measurement bias, we quantitatively compared the magnitude of the effects between different imaging variables, fMRI manufacturers, and the number of channels per coil in each fMRI scanner. We further examined two alternative hypotheses regarding the mechanisms underlying sampling bias: one hypothesis assumes that each site samples subjects from a common population. In this situation, sampling bias occurs because of the random sampling of subjects, which results in incidental differences in the patients' characteristics among the sites. The second hypothesis assumes that each site samples subjects from different subpopulations. In this situation, sampling bias occurs because of sampling from subpopulations with different characteristics. For example, assume multiple sites plan to collect data from the same population of patients with major depressive disorder (MDD). Subtypes of MDD exist within the population, such as atypical depression and melancholic depression [21,22]; therefore, one subpopulation may contain a large proportion of patients with atypical depression, whereas another subpopulation may contain a large proportion of patients with melancholic depression. Therefore, in some instances, atypical depression may be more frequent among patients at site A, whereas melancholic depression may be more frequent among patients at site B. The basic protocol for collecting large-scale datasets differs between these two hypotheses; thus, it is necessary to determine the hypothesis that most appropriately reflects the characteristics of the SRPBS dataset. In the former situation, one would simply need to collect data from a large number of individuals, even with a small number of sites. In the latter situation, a larger number of sites would be required to obtain truly representative data.
To overcome the limitations associated with site differences, we developed a novel harmonization method that enabled us to subtract only the measurement bias by using a travelingsubject dataset. We investigated the extent that our proposed method could reduce measurement bias and improve the signal-to-noise ratio. We compared its performance to those of other commonly used harmonization methods. All data utilized in this study can be downloaded publicly from the DecNef Project Brain Data Repository at https://bicr-resource.atr.jp/ decnefpro/.
SRPBS multidisorder dataset. This dataset included patients with five different disorders and healthy controls (HCs) who were examined at nine sites belonging to eight research institutions. A total of 805 participants were included: 482 HCs from nine sites, 161 patients with MDD from five sites, 49 patients with ASD from one site, 65 patients with obsessive-compulsive disorder (OCD) from one site, and 48 patients with SCZ from three sites ( Table 1). The rs-fMRI data were acquired using a unified imaging protocol at all but three sites (Table 2; https://bicr.atr.jp/rs-fmri-protocol-2/). Site differences in this dataset included both measurement and sampling biases ( Fig 1A). For bias estimation, we only used data obtained using the unified protocol. (Patients with OCD were not scanned using this unified protocol; therefore, a disorder factor could not be estimated for OCD.) Traveling-subject dataset. We acquired a traveling-subject dataset to estimate measurement bias across sites in the SRPBS dataset. Nine healthy participants (all men; age range: 24-32 y; mean age: 27 ± 2.6 y) were scanned at each of 12 sites, which included the nine sites in the SRPBS dataset, and produced a total of 411 scan sessions (see "Participants" in the Methods section). Although we had attempted to acquire this dataset using the same imaging protocol as that in the SRPBS multidisorder dataset, there were some differences in the imaging protocol across sites because of limitations in variable settings or the scanning conventions of each site (S1 Table). There were two phase-encoding directions (posterior to anterior [P!A] and anterior to posterior [A!P]), three MRI manufacturers (Siemens, GE, and Philips), four numbers of channels per coil (8, 12, 24, and 32), and seven scanner types (TimTrio, Verio, Skyra, Spectra, MR750W, SignaHDxt, and Achieva). Site differences in this dataset included measurement bias only as the same nine participants were scanned across the 12 sites (Fig 1B).

Calculation of rs-fMRI functional connectivity
We computed the region of interest (ROI)-based pairwise correlations as a measure of functional connectivity. For each participant, the temporal correlations of rs-fMRI blood-oxygenlevel dependent (BOLD) signals between pairs of ROIs were computed after averaging each voxelwise BOLD signal in each ROI. There are some candidates for the measure of functional connectivity such as the tangent method and partial correlation [11,23]; however, we used Pearson's correlation coefficients because they have been the most commonly used values in previous studies. Functional connectivity was defined based on a functional brain atlas   consisting of 268 nodes (regions) covering the whole brain, which has been widely utilized in previous studies [20,[24][25][26]. The Fisher's z-transformed Pearson's correlation coefficients between the preprocessed BOLD signal time courses of each possible pair of nodes were calculated and used to construct 268 × 268 symmetrical connectivity matrices in which each element represents a connection strength, or edge, between two nodes. We used 35,778 connectivity values [(268 × 267)/2] of the lower triangular part of the connectivity matrix. To briefly investigate any site effect on functional connectivity, we applied a one-way ANOVA with Site (9 sites) as a factor to the functional connections in the SRPBS multidisorder dataset and recorded the number of significant differences between sites. We set the threshold to p < 0.05, after Bonferroni correction. As a result, >30% of all connections (11,888/35,778) were significantly different between sites.

Bias estimation
To quantitatively investigate the site differences in the rs-fcMRI data, we identified measurement biases, sampling biases, and disorder factors. We defined measurement bias for each site as a deviation of the connectivity value for each functional connection from its average across all sites. We assumed that the sampling biases of the HCs and patients with psychiatric disorders differed from one another. Therefore, we calculated the sampling biases for each site separately for HCs and patients with each disorder. Disorder factors were defined as deviations from the HC values. Sampling biases were estimated for patients with MDD and SCZ because only these patients were sampled at multiple sites. Disorder factors were estimated for MDD, SCZ, and ASD because patients with OCD were not scanned using the unified protocol. It is difficult to separate site differences into measurement and sampling biases using the SRPBS multidisorder dataset alone because these two types of bias covaried across sites. Different samples (participants) were scanned using different variables (scanners and imaging protocols). In contrast, the traveling-subject dataset included only measurement bias because the participants were fixed. By combining the traveling-subject dataset with the SRPBS multidisorder dataset, we simultaneously estimated measurement bias and sampling bias as different factors are affected by different sites. We utilized a constrained linear regression model to assess the effects of both types of bias and disorder factors on functional connectivity, as follows. In the regression model for the SRPBS multidisorder dataset, the connectivity values of each  participant in the SRPBS multidisorder dataset were composed of the sum of the average connectivity values across all participants and all sites at baseline, measurement bias, sampling bias, and disorder factors. The combined effect of participant factors (individual difference) and scan-to-scan variations was regarded as noise. In the regression model for the travelingsubject dataset, the connectivity values of each participant for a specific scan in the travelingsubject dataset were composed of the sum of the average connectivity values across all participants and all sites, participant factors, and measurement bias. Scan-to-scan variation was regarded as noise. For each participant, we defined the participant factor as a deviation of connectivity values from the average across all participants. We estimated all biases and factors by simultaneously fitting the aforementioned two regression models to the functional connectivity values of the two different datasets. For this regression analysis, we used data from participants scanned using a unified imaging protocol in the SRPBS multidisorder dataset and from all participants in the traveling-subject dataset. In summary, each bias or each factor was estimated as a vector that included a dimension reflecting the number of connectivity values (35,778). Vectors included in our further analyses are those for measurement bias at 12 sites, sampling bias of HCs at six sites, sampling bias for patients with MDD at three sites, sampling bias for patients with SCZ at three sites, participant factors of nine traveling subjects, and disorder factors for MDD, SCZ, and ASD. Note that since patients with ASD were scanned at one site, we could not estimate the sampling bias of ASD. Thus, the sampling bias was included in the disorder factor of ASD. For each connectivity, the regression model can be written as follows: where m represents the measurement bias (12 sites × 1), s hc represents the sampling bias of HCs (6 sites × 1), s mdd represents the sampling bias of patients with MDD (3 sites × 1), s scz represents the sampling bias of patients with SCZ (3 sites × 1), d represents the disorder factor (3 × 1), p represents the participant factor (9 traveling subjects × 1), const represents the average functional connectivity value across all participants from all sites, and e � N ð0; g À 1 Þ represents noise. x m ; x s hc ; x s mdd ; x s scz ; x d ; x p are vectors represented by 1-of-K binary coding (more details are reported in "Estimation of biases and factors" in the Methods section). To eliminate the uncertainty of the constant term, we estimated measurement bias and each sampling bias by imposing constraints so that their average across sites would be 0.

Quantification of site differences
To quantitatively evaluate the magnitude of the effect of measurement and sampling biases on functional connectivity, we compared the magnitudes of both types of bias (m, s hc , s mdd , and s scz ) with the magnitudes of psychiatric disorders (d) and participant factors (p). For this purpose, we investigated the magnitude distribution of both biases, as well as the effects of psychiatric disorders and participant factors on functional connectivity over all 35,778 elements in a 35,778-dimensional vector (see S1 Text, S1A and S1B Fig). To quantitatively summarize the magnitude of the effect of each factor, we calculated the first, second, and third statistical moments of each distribution (Fig 2A). Based on the mean values and cube roots of the third moments, all distributions could be approximated as bilaterally symmetric with a mean of zero. Thus, distributions with larger squared roots of the second moments (standard deviations) affect more connectivities with larger effect sizes ( Fig 2B). The value of the standard deviation was largest for the participant factor (0.0662), followed by these values for the measurement bias (0.0411), the SCZ factor (0.0377), the MDD factor (0.0328), the ASD factor (0.0297), the sampling bias for HCs (0.0267), sampling bias for patients with SCZ (0.0217), and sampling bias for patients with MDD (0.0214). To compare the sizes of the standard deviation of the magnitude distribution between participant factors and measurement bias, we evaluated the variance of each distribution. All pairs of variances were analyzed using Ansari-Bradley tests. Our findings indicated that the variances of magnitude distributions in 10 of 12 measurement biases were significantly larger than in the MDD factor; the variances of magnitude distributions in seven of 12 measurement biases were significantly larger than in the SCZ factor; and the variances of all magnitude distributions in measurement biases were significantly larger than the variance of the MDD factor (S6 Table). The largest variance of magnitude distribution in the sampling bias was significantly larger than in the MDD factor (S7 Table). Variances of magnitude distributions in all participant factors were significantly larger than that in all measurement biases (9 participant factors × 12 measurement biases = 108 pairs; W � : mean = -59.80, maximum = -116.81, minimum = -3.69; p-value after Bonferroni correction: maximum = 0.011, minimum = 0, n = 35,778). The standard deviation of the magnitude distribution in the participant factor was approximately twice that in the SCZ, MDD, and ASD factors. To investigate similarity in the patterns of effect on functional connectivity between the measurement bias and the disorder factors, we next calculated Pearson's correlation coefficients between the 12 measurement biases and the factors of three diseases. As a result, we found a significant correlation (mean = 0.13 ± 0.08 [1SD], one-sample t test applied to absolute correlation value: t = 9.26, p < 1.0×10 −10 , df = 35), and maximum value was |r| = 0.31 (between the MDD factor and the measurement bias of Showa University [SWA]). This result indicates that the pattern of the measurement bias on functional connectivity was not sufficiently different from the patterns of disorder factors in our dataset. Furthermore, to quantitatively verify the magnitude relationship among factors, we calculated and compared the contribution size to determine the extent to which each bias type and factor explains the variance of the data in our linear model ( Fig 2C). After fitting the model, the b-th connectivity from subject a can be written as follows: For example, the contribution size of measurement bias (i.e., the first term) in this model was calculated as in which N m represents the number of components for each factor, N represents the number of connectivities, N s represents the number of subjects, and Contribution size m represents the magnitude of the contribution size of measurement bias. This formula was used to assess the contribution sizes of individual factors. The results were consistent with the analysis of the standard deviation (Fig 2A, middle). These results indicated that the effect size of the measurement bias on functional connectivity is smaller than that of the participant factor but is mostly larger than the disorder factors, which suggested that measurement bias represents a serious limitation in research regarding psychiatric disorders. Furthermore, the effect sizes of the sampling biases were comparable with those of the disorder factors. This finding indicates that sampling bias also represents a major limitation in psychiatric research. In addition, the effect size of the participant factor was much greater than that among patients with SCZ, MDD, or ASD. Such relationships make the development of rs-fcMRI-based classifiers of psychiatric or developmental disorders very challenging. If the disorder factor and site factor are confounded in functional connections, to develop robust and generalizable classifiers across multiple sites, we have to select disorderspecific and site-independent abnormal functional connections [2,[8][9][10]15]. Harmonization of resting-state functional MRI data across multiple imaging sites

Brain regions contributing most to biases and associated factors
To evaluate the spatial distribution of the two types of bias and all factors in the whole brain, we utilized a previously described visualization method [27] to project connectivity information to anatomical ROIs. First, we quantified the effect of a bias or a factor on each functional connectivity as the median of its absolute values across sites or participants. Thus, we obtained 35,778 values, each of which was associated with one connectivity and represented the effect of a bias or factor on the connectivity. Next, we summarized these effects on connectivity for each ROI by averaging the values of all connectivities connected with the ROI (see "Spatial characteristics of measurement bias, sampling bias, and each factor in the brain" in the Methods section). The average value represents the extent the ROI contributes to the effect of a bias or factor. By repeating this procedure for each ROI and coding the averaged value based on the color of an ROI, we were able to visualize the relative contribution of individual ROIs to each bias or factor in the whole brain (Fig 3). Consistent with the findings of previous studies, the effect of the participant factor was large for several ROIs in the cerebral cortex, especially in the prefrontal cortex, but small in the cerebellum and visual cortex [24]. The effect of measurement bias was large in inferior brain regions where functional images are differentially distorted depending on the phase-encoding direction [28,29]. Connections involving the medial dorsal nucleus of the thalamus were also heavily affected by both MDD, SCZ, and ASD. Effects of the MDD factor were observed in the dorsomedial prefrontal cortex and superior temporal gyrus, in which abnormalities have also been reported in previous studies [22,30,31]. Effects of the SCZ factor were observed in the left inferior parietal lobule, bilateral anterior cingulate cortices, and left middle frontal gyrus, in which abnormalities have been reported in previous studies [32][33][34]. Effects of the ASD factor were observed in the putamen, the medial prefrontal cortex, and the right middle temporal gyrus, in which abnormalities have also been reported in previous studies [10,11,35]. The effect of sampling bias for HCs was large in the inferior parietal lobule and the precuneus, both of which are involved in the default-mode network and the middle frontal gyrus. Sampling bias for disorders was large in the medial dorsal nucleus of the thalamus, left dorsolateral prefrontal cortex, dorsomedial prefrontal cortex, and cerebellum for MDD [22] and in the prefrontal cortex, cuneus, and cerebellum for SCZ [33].

Characteristics of measurement bias
We next investigated the characteristics of measurement bias. We first examined whether similarities among the estimated measurement bias vectors for the 12 included sites reflect certain properties of MRI scanners such as phase-encoding direction, MRI manufacturer, coil type, and scanner type. We used hierarchical clustering analysis to discover clusters of similar patterns for measurement bias. We used "correlation" as a distance metric for hierarchical clustering. This method has previously been used to distinguish subtypes of MDD, based on rs-fcMRI data [22]. As a result, the measurement biases of the 12 sites were divided into phaseencoding direction clusters at the first level ( Fig 4A). They were divided into fMRI manufacturer clusters at the second level and further divided into coil type clusters, followed by scanner model clusters. Furthermore, we quantitatively verified the relationship magnitude among factors by using the same model to assess the contribution of each factor (Fig 4B; see "Quantification of site differences" in the Results section or "Analysis of contribution size" in the Methods section). The contribution size was largest for the phase-encoding direction (0.0391), followed by the contribution sized for fMRI manufacturer (0.0318), coil type (0.0239), and scanner model (0.0152). These findings indicate that the main factor influencing measurement bias is the difference in the phase-encoding direction, followed by fMRI manufacturer, coil type, and scanner model, respectively.

Sampling bias is because of sampling from among a subpopulation
We investigated two alternative models for the mechanisms underlying sampling bias. In the single-population model, which assumes that participants are sampled from a common population (Fig 5A), the functional connectivity values of each participant were generated from a Gaussian distribution, with a mean of 0 and variance of ξ 2 for each connectivity value. Then, For each ROI, the mean effects of all functional connections associated with that ROI were calculated for each bias and each factor. Warmer (red) and cooler (blue) colors correspond to large and small effects, respectively. The magnitudes of the effects are normalized within each bias or each factor (z-score). The numerical data used in this figure are included in S1 Data. ASD, autism spectrum disorder; HC, healthy control; MDD, major depressive disorder; ROI, region of interest; SCZ, schizophrenia. https://doi.org/10.1371/journal.pbio.3000042.g003 Harmonization of resting-state functional MRI data across multiple imaging sites the functional connectivity vector for participant j at site k can be described as In the different-subpopulation model, which assumes that sampling bias occurs partly because participants are sampled from different subpopulations at each site (Fig 5B), we assumed that the functional connectivity values of each participant were generated from a different independent Gaussian distribution, with an average of β k and a variance of ξ 2 depending on the population of each site. In this situation, the functional connectivity vector for participant j at site k can be described as Here, we assume that the average of the population β k is sampled from an independent Gaussian distribution with an average of 0 and a variance of σ 2 . It is necessary to determine which model is more suitable for collecting big data across multiple sites: If the former model is correct, then the data can be used to represent a population by increasing the number of participants, even if the number of sites is small. If the latter model is correct, data should be collected from many sites, as a single site does not represent the true grand population distribution, even with a very large sample size.
For each model, we first investigated how the number of participants at each site determined the effect of sampling bias on functional connectivity. We measured the magnitude of the effect, based on the variance values for sampling bias across functional connectivity (see the "Quantification of site differences" section). We used variance instead of the standard deviation to simplify the statistical analysis, although there is essentially no difference based on which value is used. We theorized that each model represents a different relationship between Harmonization of resting-state functional MRI data across multiple imaging sites the number of participants and the variance of sampling bias. Therefore, we investigated which model best represents the actual relationships in our data by comparing the corrected Akaike information criterion (AICc) [36,37] and Bayesian information criterion (BIC). Moreover, we performed leave-one-site-out cross-validation evaluations of predictive performance in which all but one site was used to construct the model, and the variance of the sampling bias was predicted for the remaining site. We then compared the predictive performances between the two models. Our results indicated that the different-subpopulation model provided a better fit for our data than the single-population model (Fig 5C; different-subpopulation model: AICc = -108.80 and . Furthermore, the predictive performance was significantly higher for the different-subpopulation model than for the single-population model (one-tailed Wilcoxon signed rank test applied to absolute errors: Z = 1.67, p = .0469, n = 6; Fig 5D and 5E). This result indicates that sampling bias is caused not only by random sampling from a single grand population, depending on the number of participants among sites, but also by sampling from among different subpopulations. Sampling biases thus represent a major limitation in attempting to estimate a true single distribution of HC or patient data based on measurements obtained from a finite number of sites and participants.

Visualization of the effect of harmonization
Since our results indicated that the patterns of the measurement bias on functional connectivity were not sufficiently different from the patterns of disorder factors, we need harmonization to properly subtract the measurement bias. Therefore, we next developed a novel harmonization method that enabled us to subtract measurement bias alone using the traveling-subject dataset. Using a constrained linear regression model, we estimated measurement bias separately from sampling bias (see "Bias estimation" in the Methods section). Thus, we could remove the measurement bias from the SRPBS multidisorder dataset (i.e., traveling-subject method, see "Traveling-subject harmonization" in the Methods section).
To visualize the site differences and disorder effects in the SRPBS multidisorder dataset while maintaining its quantitative properties, we first performed an unsupervised dimension reduction of the raw rs-fcMRI data using a principal component analysis (PCA). All participant data in the SRPBS multidisorder dataset were plotted on two axes consisting of the first two principal components (PCs) (Fig 6A, small, light-colored symbols). The first two PCs could explain approximately 6% of the variance in the whole data (Fig 6B, 3.5% and 2.5% for the first and second PC, respectively). Dark-colored markers indicated the averages of projected data across HCs in each site and the average within each psychiatric disorder in the subspace spanned by the two components. For the raw data, there was a clear separation of the Hiroshima University Hospital (HUH) site for PC1, which explained most of the variance in the data. To visualize the effects of the harmonization process, we plotted the data after subtracting only the measurement bias from the SRPBS multidisorder dataset (Fig 6C). In Fig 6C, the differences among sites represent the sampling bias. Relative to the result of raw data, which reflects the data before harmonization, the HUH site moved much closer to the origin (i.e., grand average) and showed no marked separation from the other sites. This result indicated that the separation of the HUH site observed in Fig 6A was caused by measurement bias, which was removed by the harmonization. Furthermore, harmonization was effective in distinguishing patients and HCs scanned at the same site. Since patients with ASD were only scanned at the SWA site, the averages for these patients (▲) and HCs (blue •) scanned at this site were projected to nearly identical positions (Fig 6A). However, the two symbols are clearly separated from one another in Fig 6C. The effect of a psychiatric disorder (ASD) could not be observed in the first two PCs without harmonization but became detectable following the removal of measurement bias. Finally, to visualize the measurement bias in the SRPBS multidisorder dataset, we plotted the data after subtracting only the sampling bias from the SRPBS multidisorder dataset (Fig 6D). Relative to the harmonized data results, the HUH site showed marked separation from the other sites, which was similar to the raw data ( Fig 6A).

Quantification of the effect of traveling-subject harmonization
To correct difference among sites, there are three commonly used harmonization methods: (1) a ComBat method [16,17,19,38], a batch-effect correction tool commonly used in genomics, (2) a generalized linear model (GLM) method, site difference was estimated without adjusting for biological covariates (diagnosis) [16,18,22]; and (3) an adjusted GLM method, site difference was estimated while adjusting for biological covariates [16,18] (see "Harmonization procedures" in the Methods section). However, all these methods estimate site difference without separating it into measurement and sampling biases and subtracting the site difference from the data. Therefore, existing harmonization methods might have pitfalls that eliminate both biologically meaningless measurement bias and biologically meaningful sampling bias. Here, we tested whether the traveling-subject harmonization method indeed removes only the measurement bias and whether the existing harmonization methods simultaneously remove the measurement and sampling biases. Specifically, we performed 2-fold cross-validation evaluations in which the SRPBS multidisorder dataset was partitioned into two equal-size subsamples (fold1 data and fold2 data) with the same proportions of sites. Between these two subsamples, the measurement bias is common, but the sampling bias is different, because the scanners are common and participants are different. We estimated the measurement bias (or site difference including the measurement bias and the sampling bias for the existing methods) by applying the harmonization methods to the fold1 data and subtracting the measurement bias or site difference from the fold2 data. Next, we estimated the measurement bias in the fold2 data. For the existing harmonization methods, if the site difference estimated using fold1 contained only the measurement bias, the measurement bias estimated in fold2 data after subtracting the site difference should be smaller than without subtracting the site difference (Raw). To separately estimate measurement bias and sampling bias in both subsamples while avoiding information leaks, we also divided the traveling-subject dataset into two equal-size subsamples with the same proportions of sites and subjects. We concatenated one subsample of traveling-subject dataset to fold1 data to estimate the measurement bias for traveling-subject method (estimating dataset) and concatenated the other subsample of traveling-subject dataset to fold2 data for testing (testing dataset). That is, in the traveling-subject harmonization method, we estimated the measurement bias using the estimating dataset and removed the measurement bias from the testing dataset. By contrast, in the other harmonization methods, we estimated the site difference using the fold1 data (not including the subsample of traveling-subject dataset) and removed the site difference from the testing dataset. We then estimated the measurement bias using the testing dataset and evaluated the standard deviation of the magnitude distribution of measurement bias calculated in the same way as described in the "Quantification of site differences" section. To verify whether important information, such as participant and disorder factors, remained in the testing dataset, we also estimated these factors and calculated the ratio of the standard deviation of the magnitude distribution of the measurement bias to each participant and disorder factor as signal-to-noise ratios. This procedure was performed again by exchanging the estimating dataset and the testing dataset (see "Twofold cross-validation evaluation procedure" in the Methods section). Harmonization of resting-state functional MRI data across multiple imaging sites ComBat method). Moreover, improvements in the signal-to-noise ratios were also highest in our method for the participant factor (41% versus 3% in the ComBat method) and disorder factor (39% versus 3% in the ComBat method). These results indicated that the traveling-subject harmonization method removed measurement bias and improved the signal-to-noise ratios.

Discussion
In the present study, by acquiring a separate traveling-subject dataset and the SRPBS multidisorder dataset, we separately estimated measurement and sampling biases for multiple sites, which we then compared with the magnitude of disorder factors. Furthermore, we investigated the origin of each bias in multisite datasets. Finally, to overcome the problem of site differences, we developed a novel harmonization method that enabled us to subtract the measurement bias by using a traveling-subject dataset and achieved the reduction of the measurement bias by 29% and improvement of signal-to-noise ratios by 40%.
Previous studies have focused on measurement bias and compared its magnitude to the participant factor by using a traveling-subject design in a finger-tapping task fMRI [39] and rs-fMRI [20]. These studies revealed the magnitude of measurement bias is smaller than the participant factor. Although such a result was also obtained in this study, the novelty of this study exists in that we separately estimated measurement and sampling biases and then compared them with the magnitude of disorder factors. Our findings indicated that measurement bias exerted significantly greater effects than disorder factors, whereas sampling bias was comparable with (or even larger than) the disorder effects (Fig 2). Although our effect-size analysis was univariate, it is important to take perspective of multivariate pattern analysis into account because biomarker construction is based on the multivariate pattern of functional connectivity. From this view point, if the effect of the measurement bias is orthogonal to that of the psychiatric disorders, robust generalization across sites might be possible. Actually, previous research suggested this [11]. However, the orthogonality between the pattern of the disease factors and that of the measurement bias depends on dataset and type of disease. Since our results indicated that the pattern of the measurement bias was not sufficiently different from the patterns of disorder factors, harmonization was important to properly subtract the measurement bias. This result is a very important finding for future studies that collect rs-fMRI data from multiple sites and for consideration when constructing biomarkers of psychiatric disorders based on multisite data in the clinical field. Our results indicate that it is important to consider site differences when we investigate disorder factors using multisite rs-fMRI data. However, we did not control for variations in disease stage and treatment in our dataset.
Although controlling for such heterogeneity may increase the effect size of disorder factors, such control is not feasible when collecting big data from multiple sites. Therefore, it is important to appropriately remove measurement bias from heterogeneous patient data to identify relatively small disorder effects. This issue is essential for investigating the relationships among different psychiatric disorders because disease factors could be confounded by site differences. As previously mentioned, it is common for a single site to sample only a few types of psychiatric disorders (SCZ and ASD from sites A and B, respectively). This issue is also essential for constructing biomarkers of psychiatric disorders because classification of subjects can be achieved by using the information of nonbiological site difference. In this situation, it is critical to dissociate disease factors from site differences. This dissociation can be accomplished by subtracting only the measurement bias, which is estimated using the traveling-subject dataset.
Our results indicated that measurement bias is primarily influenced by differences in the phase-encoding direction, followed by differences in fMRI manufacturer, coil type, and scanner model (Fig 4). These results are consistent with our finding of large measurement biases in the inferior brain regions (Fig 3), the functional imaging of which is known to be influenced by the phase-encoding direction [28,29]. Previous studies have reported that the effect caused by the difference in the phase-encoding direction can be corrected using the field map obtained at the time of imaging [28,[40][41][42]. The field map was acquired in parts of the traveling-subject dataset; therefore, we investigated the effectiveness of field map correction by comparing the effect size of the measurement bias and the participant factor between functional images with and without field map correction (see S2 Text). Our prediction was as follows: if field map correction is effective, the effect of measurement bias will decrease, whereas that of the participant factor will increase following field map correction. Field map correction using SPM12 (http://www.fil.ion.ucl.ac.uk/ spm/software/spm12) reduced the effect of measurement bias in the inferior brain regions (whole brain: 3% reduction in the standard deviation of the magnitude distribution of the measurement bias) and increased the effect of the participant factor in the whole brain (3% increase in the standard deviation of the magnitude distribution of the participant factor; S2A and S2B Fig). However, the effect of measurement bias remained large in inferior brain regions (S2A Fig), and hierarchical clustering analysis revealed that the clusters of the phase-encoding direction remained dominant (S2C Fig). These results indicate that, even with field map correction, it is largely impossible to remove the influence of differences in phase-encoding direction on functional connectivity. Thus, harmonization methods are still necessary to remove the effect of these differences and other measurement-related factors. However, some distortion correction methods have been developed, such as the top-up method and symmetric normalization [43,44], and further studies are required to verify the efficacy of these methods.
Our data supported the different-subpopulation model rather than the single-population model (Fig 5), which indicates that sampling bias is caused by sampling from among different subpopulations. Furthermore, these findings suggest that, during big data collection, it is better to sample participants from several imaging sites than to sample many participants from a few imaging sites. These results were obtained only by combining the SRPBS multidisorder database with a traveling-subject dataset (https://bicr.atr.jp/decnefpro/). To the best of our knowledge, the present study is the first to demonstrate the presence of sampling bias in rs-fcMRI data, the mechanisms underlying this sampling bias, and the effect size of sampling bias on resting-state functional connectivity, which was comparable to that of psychiatric disorders. We analyzed sampling bias among HCs only, because the number of sites was too small to conduct an analysis of patients with psychiatric diseases.
We developed a novel harmonization method using a traveling-subject dataset (i.e., travelingsubject method), which was then compared with existing harmonization methods. Our results demonstrated that the traveling-subject method outperformed other conventional GLM-based harmonization methods and the ComBat method. The traveling-subject method achieved reduction in the measurement bias by 29%, compared with 3% in the second highest value for the Com-Bat method, and improvement in the signal-to-noise ratios by 40%, compared with 3% in the second highest value for the ComBat method. This result indicates that the traveling-subject dataset helps us to properly estimate measurement bias and harmonize the rs-fMRI data across imaging sites toward development of a wide range of final applications. As one example of such applications, we constructed biomarkers for psychiatric disorders based on rs-fcMRI data, which distinguishes between HCs and patients, and a regression model to predict participants' age based on rs-fcMRI data using the SRPBS multidisorder dataset (see "Classifiers for MDD and SCZ, based on the four harmonization methods" in S5 Text and "Regression models of participant age based on the four harmonization methods" in S6 Text). We quantitatively evaluated the harmonization method to investigate the generalization performance to independent validation dataset, which was not included in the SRPBS multidisorder dataset. Although the ComBat method achieved the highest performance for the MDD classifier and regression model of age, it was inferior to the raw method for the SCZ classifier. By contrast, the traveling-subject harmonization method always improved the generalization performance compared with when no harmonization was performed. This result also indicates that the pattern of the measurement bias on functional connectivity was not sufficiently different from the patterns of disorder factors in our dataset. These results indicate that the traveling-subject dataset also helps with constructing a prediction model based on multisite rs-fMRI data. Future work should improve the traveling-subject method by incorporating a hierarchical model, such as ComBat.
The present study possesses some limitations of note. The accuracy of measurement bias estimation may be improved by further expanding the traveling-subject dataset. This can be achieved by increasing the number of traveling participants or sessions per site. However, as mentioned in a previous traveling-subject study [20], it is costly and time-consuming to ensure that numerous participants travel to every site involved in big database projects. Thus, the cost-performance tradeoff must be evaluated in practical settings. The numbers of traveling participants and MRI sites used in this study (9 and 12, respectively) were larger than those used in a previous study (8 and 8, respectively) [20], and the number of total sessions in this study (411) was more than three times larger than that used in the previous study (128) [20]. Furthermore, although we estimated the measurement bias for each connectivity, hierarchical models of the brain (e.g., ComBat) may be more appropriate for improving the estimates of measurement bias. Regarding the number of sites in the data with psychiatric disorders, we believe that uniqueness of our study exists in the datasets of multiple disorders and multiple sites with traveling-subject data rather than the number of sites for a single disorder. For example, although ABIDE [6,11] collected the data from patients with ASD from 17 sites, it significantly differs from our study because it does not use a unified protocol for data collection and does not include a traveling-subject dataset. In this study, we have collected the data using a unified protocol with HCs from six sites, patients with MDD from three sites, patients with ASD from one site, patients with SCZ from three sites, patients with OCD from one site, and a traveling-subject dataset from 12 sites. These datasets enabled us to compare the magnitude of the effect between site differences (measurement or sampling bias) and multiple disorder factors, which is the key point of our study. To the best of our knowledge, such a multisite, multidisorder rs-fMRI dataset has not existed so far.
In summary, by acquiring a separate traveling-subject dataset and the SRPBS multidisorder database, we revealed that site differences were composed of biological sampling bias and engineering measurement bias. The effect sizes of these biases on functional connectivity were greater than or equal to the effect sizes of psychiatric disorders, and the pattern of the measurement bias was not sufficiently different from the patterns of disorder factors, highlighting the importance of controlling for site differences when investigating psychiatric disorders. Furthermore, using the traveling-subject dataset, we developed a novel traveling-subject method that harmonizes the measurement bias only by separating sampling bias from site differences. Our findings verified that the traveling-subject method outperformed conventional GLMbased harmonization methods and ComBat method. These results suggest that a traveling-subject dataset can help to harmonize the rs-fMRI data across imaging sites.

Ethics statement
All participants in all datasets provided written informed consent. All recruitment procedures and experimental protocols were approved by the institutional review boards of the principal

Participants
We used two rs-fMRI datasets for all analyses: (1) the SRPBS multidisorder dataset, which encompasses multiple psychiatric disorders, and (2) a traveling-subject dataset. The SRPBS multidisorder dataset contains data for 805 participants (482 HCs from nine sites, 161 patients with MDD from five sites, 49 patients with ASD from one site, 65 patients with OCD from one site, and 48 patients with SCZ from three sites) (Table 1). Data were acquired using a Siemens TimTrio scanner at ATR (ATT), a Siemens Verio scanner at ATR (ATV), a Siemens Verio at the Center of Innovation in Hiroshima University (COI), a GE SignaHDxt scanner at HUH, a Siemens Spectra scanner at Hiroshima Kajikawa Hospital (HKH), a Philips Achieva scanner at KPM, a Siemens Verio scanner at SWA, a Siemens TimTrio scanner at Kyoto University (KUT), and a GE MR750W scanner at the UTO. Each participant underwent a single rs-fMRI session for 5-10 min. The rs-fMRI data were acquired using a unified imaging protocol at all but three sites ( Table 2; https://bicr.atr.jp/rs-fmri-protocol-2/). During the rs-fMRI scans, participants were instructed as follows, except at one site: "Please relax. Don't sleep. Fixate on the central crosshair mark and do not think about specific things." At the remaining site, participants were instructed to close their eyes rather than fixate on a central crosshair.
In the traveling-subject dataset, nine healthy participants (all male participants; age range 24-32 y; mean age 27 ± 2.6 y) were scanned at each of 12 sites in the SRPBS consortium, producing a total of 411 scan sessions. Data were acquired at the sites included in the SRPBS multidisorder database (i.e., ATT, ATV, COI, HUH, HKH, KPM, SWA, KUT, and UTO) and three additional sites: Kyoto University (KUS; Siemens Skyra) and Yaesu Clinic 1 and 2 (YC1 and YC2; Philips Achieva) (S1 Table). Each participant underwent three rs-fMRI sessions of 10 min each at nine sites, two sessions of 10 min each at two sites (HKH and HUH), and five cycles (morning, afternoon, next day, next week, and next month) consisting of three 10-min sessions each at a single site (ATT). In the latter situation, one participant underwent four rather than five sessions at the ATT site because of a poor physical condition. Thus, a total of 411 sessions were conducted [8 participants × (3 × 9 + 2 × 2 + 5 × 3 × 1) + 1 participant × (3 × 9 + 2 × 2 + 4 × 3 × 1)]. During each rs-fMRI session, participants were instructed to maintain a focus on a fixation point at the center of a screen, remain still and awake, and to think about nothing in particular. For sites that could not use a screen in conjunction with fMRI (HKH and KUS), a seal indicating the fixation point was placed on the inside wall of the MRI gantry. Although we attempted to ensure imaging was performed using the same variables at all sites, there were two phase-encoding directions (P!A and A!P), three MRI manufacturers (Siemens, GE, and Philips), four different numbers of channels per coil (8, 12, 24, and 32), and seven scanner types (TimTrio, Verio, Skyra, Spectra, MR750W, SignaHDxt, and Achieva) (S1 Table).

Preprocessing and calculation of the resting-state functional connectivity matrix
The rs-fMRI data were preprocessed using SPM8 implemented in MATLAB (R2016b; Mathworks, Natick, MA, USA). The first 10 s of data was discarded to allow for T1 equilibration. Preprocessing steps included slice-timing correction, realignment, coregistration, segmentation of T1-weighted structural images, normalization to Montreal Neurological Institute (MNI) space, and spatial smoothing with an isotropic Gaussian kernel of 6 mm full-width at half-maximum. For the analysis of connectivity matrices, ROIs were delineated according to a 268-node gray matter atlas developed to cluster maximally similar voxels [26]. The BOLD signal time courses were extracted from these 268 ROIs. To remove several sources of spurious variance, we used linear regression with 36 regression parameters [45] such as six motion parameters, average signals over the whole brain, white matter, and cerebrospinal fluid. Derivatives and quadratic terms were also included for all parameters. A temporal band-pass filter was applied to the time series using a first-order Butterworth filter with a pass band between 0.01 Hz and 0.08 Hz to restrict the analysis to low-frequency fluctuations, which are characteristic of rs-fMRI BOLD activity [45]. Furthermore, to reduce spurious changes in functional connectivity because of head motion, we calculated framewise displacement (FD) and removed volumes with FD > 0.5 mm, as proposed in a previous study [46]. The FD represents head motion between two consecutive volumes as a scalar quantity (the summation of absolute displacements in translation and rotation). Using the aforementioned threshold, 5.4% ± 10.6% volumes (mean [approximately 13 volumes] ± 1 SD) were removed per 10 min of rs-fMRI scanning (240 volumes) in the traveling-subject dataset; 6.2% ± 13.2 volumes were removed per rs-fMRI session in the SRPBS multidisorder dataset. If the number of volumes removed after scrubbing exceeded the average of -3 SD across participants in each dataset, the participants or sessions were excluded from the analysis. As a result, 14 sessions were removed from the traveling-subject dataset, and 20 participants were removed from the SRPBS multidisorder dataset. Furthermore, we excluded participants for whom we could not calculate functional connectivity at all 35,778 connections, primarily because of the lack of BOLD signals within an ROI. As a result, 99 participants were further removed from the SRPBS multidisorder dataset.

Estimation of biases and factors
The participant factor (p), measurement bias (m), sampling biases (s hc , s mdd , s scz ), and psychiatric disorder factor (d) were estimated by fitting the regression model to the functional connectivity values of all participants from the SRPBS multidisorder dataset and the travelingsubject dataset. In this instance, vectors are denoted by lowercase bold letters (e.g., m), and all vectors are assumed to be column vectors. Components of vectors are denoted by subscripts such as m k . To represent participant characteristics, we used a 1-of-K binary coding scheme in which the target vector (e.g., x m ) for a measurement bias m belonging to site k is a binary vector with all elements equal to zero-except for element k, which equals 1. If a participant does not belong to any class, the target vector is a vector with all elements equal to zero. A superscript T denotes the transposition of a matrix or vector, such that x T represents a row vector. For each connectivity, the regression model can be written as follows: such that X 9 j p j ¼ 0; in which m represents the measurement bias (12 sites × 1), s hc represents the sampling bias of HCs (6 sites × 1), s mdd represents the sampling bias of patients with MDD (3 sites × 1), s scz represents the sampling bias of patients with SCZ (3 sites × 1), d represents the disorder factor (3 × 1), p represents the participant factor (9 traveling subjects × 1), const represents the average functional connectivity value across all participants from all sites, and e � N ð0; g À 1 Þ represents noise. For each functional connectivity value, we estimated the respective parameters using regular ordinary least squares regression with L2 regularization, as the design matrix of the regression model is rank-deficient (see S3 Text). We used the "quadprog" function in MATLAB (R2016b) for estimation. When regularization was not applied, we observed spurious anticorrelation between the measurement bias and the sampling bias for HCs and spurious correlation between the sampling bias for HCs and the sampling bias for patients with psychiatric disorders (S3A Fig, left). These spurious correlations were also observed in the permutation data in which there were no associations between the site label and data (S3A Fig, right). This finding suggests that the spurious correlations were caused by the rank-deficient property of the design matrix. We tuned the hyperparameter lambda to minimize the absolute mean of these spurious correlations (S3C Fig, left).

Analysis of contribution size
To quantitatively verify the magnitude relationship among factors, we calculated and compared the contribution size to determine the extent to which each bias type and factor explains the variance of the data in our linear model (Fig 2C). After fitting the model, the b-th connectivity from subject a can be written as follows: For example, the contribution size of measurement bias (i.e., the first term) in this model was calculated as 2 ; in which N m represents the number of components for each factor, N represents the number of connectivities, N s represents the number of subjects, and Contribution size m represents the magnitude of the contribution size of measurement bias. These formulas were used to assess the contribution sizes of individual factors related to measurement bias (e.g., phase-encoding direction, scanner, coil, and fMRI manufacturer: Fig 4B). We decomposed the measurement bias into these factors, after which the relevant parameters were estimated. Other parameters were fixed at the same values as previously estimated.

Spatial characteristics of measurement bias, sampling bias, and each factor in the brain
To evaluate the spatial characteristics of each type of bias and each factor in the brain, we calculated the magnitude of the effect on each ROI. First, we calculated the median absolute value of the effect on each functional connection among sites or participants for each bias and participant factor. We then calculated the absolute value of each connection for each disorder factor. The uppercase bold letters (e.g., M) and subscript vectors (e.g., m k ) represent the vectors for the number of functional connections: M ¼ median k ðjm k jÞ; S hc ¼ median k ðjs hck jÞ; S mdd ¼ median k ðjs mdd k jÞ; S scz ¼ median k ðjs scz k jÞ; D 2 ¼ jd 2 j; D 3 ¼ jd 3 j; P ¼ median j ðjp j jÞ: We next calculated the magnitude of the effect on ROIs as the average connectivity value between all ROIs, except for themselves.
Effect on connectivity n;v ; in which N ROI represents the number of ROIs, Effect_on_ROI n represents the magnitude of the effect on the n-th ROI, and Effect_on_connectivity n,v represents the magnitude of the effect on connectivity between the n-th ROI and v-th ROI.

Hierarchical clustering analysis for measurement bias
We calculated the Pearson's correlation coefficients between measurement biases m k (N × 1), where N is the number of functional connections) for each site k, and performed a hierarchical clustering analysis based on the correlation coefficients across measurement biases. To visualize the dendrogram (Fig 4A), we used the "dendrogram," "linkage," and "optimalleaforder" functions in MATLAB (R2016b).

Comparison of models for sampling bias
We investigated whether sampling bias is caused by the differences in the number of participants among imaging sites or by sampling from different populations among imaging sites. We constructed two models and investigated which model provides the best explanation of sampling bias. In the single-population model, we assumed that participants were sampled from a single population across imaging sites. In the different-population model, we assumed that participants were sampled from different populations among imaging sites. We first theorized how the number of participants at each site affects the variance of sampling biases across connectivity values as follows: In the single-population model, we assumed that the functional connectivity values of each participant were generated from an independent Gaussian distribution, with a mean of 0 and a variance of ξ 2 for each connectivity value. Then, the functional connectivity vector for participant j at site k can be described as Let c k be the vector of functional connectivity at site k averaged across participants. In this model, c k represents the sampling bias and can be described as in which N k represents the number of participants at site k. The variance across functional connectivity values for c k is described as in which 1 represents the N × 1 vector of ones and I represents the N × N identity matrix.
Since N equals 35,778 and 1 35;778 is sufficiently smaller than 1, we can approximate Then, the expected value of variance across functional connectivity values for sampling bias can be described as In the different-population model, we assumed that the functional connectivity values of each participant were generated from a different independent Gaussian distribution, with an average of β k and a variance of ξ 2 depending on the population of each site. In this situation, the functional connectivity vector for participant j at site k can be described as Here, we assume that the average of the population β k is sampled from an independent Gaussian distribution with an average of 0 and a variance of σ 2 . That is, β k is expressed as The vector of functional connectivity for site k averaged across participants can then be described as The variance across functional connectivity values for c k can be described as In summary, the variance of sampling bias across functional connectivity values in each model is expressed by the number of participants at a given site as follows: singleÀ population model : y k ¼ À x k þ 2 log 10 x differentÀ population model : y k ¼ À log 10 ðx 2 10 À x k þ s 2 Þ; in which y k = log 10 (v k ), v k represents the variance across functional connectivity values for s hck , s hck represents the sampling bias of HCs at site k (N × 1: N is the number of functional connectivity), x k = log 10 (N k ), and N k represents the number of participants at site k. We estimated the parameters ξ and σ using the MATLAB optimization function "fminunc." To simplify statistical analyses, sampling bias was estimated based on functional connectivity in which the average across all participants was set to zero.
We aimed to determine which model provided the best explanation of sampling bias in our data by calculating the AICc (under the assumption of a Gaussian distribution) for small-sample data [36,37], as well as BIC: b v k represents the estimated variance, and q represents the number of parameters in each model (1 or 2).
To investigate prediction performance, we used leave-one-site-out cross-validation in which we estimated the parameters ξ and σ using data from five sites. The variance of sampling bias was predicted based on the number of participants at the remaining site. This procedure was repeated to predict variance values for sampling bias at all six sites. We then calculated the absolute errors between predicted and actual variances for all sites.

Harmonization procedures
We compared four different harmonization methods for the removal of site differences, as described in the main text.
Traveling-subject harmonization. Measurement biases were estimated by fitting the regression model to the combined SRPBS multidisorder and traveling-subject datasets in the same way in the "Estimation of biases and factors" section. For each connectivity, the regression model can be written as follows: Measurement biases were removed by subtracting the estimated measurement biases. Thus, the harmonized functional connectivity values were set as follows: in which b m represents the estimated measurement bias. GLM harmonization. The GLM harmonization method adjusts the functional connectivity value for site difference using GLM. Site differences were estimated by fitting the regression model, which included site label only, to the SRPBS multidisorder dataset only. The regression SRPBS multidisorder dataset were plotted on two axes consisting of the first two PCs ( Fig  6A, small, light-colored symbols). The first two PCs could explain approximately 6% of the variance in the whole data (Fig 6B, 3.5% and 2.5% for the first and second PC, respectively). Dark-colored markers indicated the averages of the projected data across HCs at each site and the average within each psychiatric disorder in the subspace spanned by the two components. To visualize whether most of the variation in the SRPBS multidisorder dataset was still associated with imaging site after harmonization, we performed a PCA of functional connectivity values in the harmonized SRPBS multidisorder dataset (Fig 6C). We used the traveling-subject method for harmonization, as described in the following section. Finally, to visualize the measurement bias in the SRPBS multidisorder dataset, we performed a PCA of functional connectivity values in the SRPBS multidisorder data after subtracting only the sampling bias (Fig 6D).