Harmonization of multi-site MRS data with ComBat

Magnetic resonance spectroscopy (MRS) is a non-invasive neuroimaging technique used to measure brain chem- istry in vivo and has been used to study the healthy brain as well as neuropathology in numerous neurological disorders. The number of multi-site studies using MRS are increasing; however, non-biological variability in- troduced during data collection across multiple sites, such as diﬀerences in scanner vendors and site-speciﬁc acquisition implementations for MRS, can obscure detection of biological eﬀects of interest. ComBat is a data harmonization technique that can remove non-biological sources of variance in multisite studies. It has been validated for use with structural and functional MRI metrics but not for MRS measured metabolites. This study investigated the validity of using ComBat to harmonize MRS metabolites for vendor and site diﬀerences. Analyses were performed using data acquired across 20 sites and included edited MRS for GABA + ( N = 218) and macromolecule-suppressed GABA data ( N = 209), as well as standard PRESS data to quantify NAA, creatine, choline, and glutamate ( N = 190). ComBat harmonization successfully mitigated vendor and site diﬀerences for all metabolites of interest. Moreover, signiﬁcant associations were detected between sex and choline levels and between age and glutamate and GABA + levels that were not detectable prior to harmonization, conﬁrming the importance of removing site and vendor eﬀects in multi-site data. In conclusion, ComBat harmonization can be successfully applied to MRS data in multi-site MRS studies.


Introduction
Magnetic resonance spectroscopy (MRS) enables non-invasive in vivo detection of brain metabolite levels. Thus, it is a useful technique to assess brain chemistry in healthy brain function, as well as changes in brain chemistry in neurological and neuropsychiatric disorders.
As with all neuroimaging modalities, multi-site MRS studies are becoming increasingly common to increase participant recruitment and statistical power to detect group differences, longitudinal changes, and relationships among brain chemistry and neuropsychological outcomes. Multi-site studies are particularly advantageous for assessing changes in brain chemistry, however, differences in MRS sequence implementations across vendors and sites can complicate data aggregation. For example, MRS sequence parameters such as pulse timings and pulse profiles differ between vendors, and these differences can lead to dif-Abbreviations: MM-sup, macromolecule suppressed; GABA, -aminobutyric acid; GABA+, -aminobutyric acid + macromolecules; tNAA, total n-acetyl aspartate; tCHO, total choline; tCr, total creatine; Glx, glutamate + glutamine; Myo-Ins, Myo-inositol; fGM, fraction of gray matter; fWM, fraction of white matter; MRS, magnetic resonance spectroscopy; PRESS, point-resolved spectroscopy; MEGA-PRESS, Mescher-Garwood point-resolved spectroscopy. biological variability that may obscure the detection of biological effects of interest.
Indeed, Mikkelsen et al. showed significant effects of both vendor and site on GABA + and a significant effect of site on macromoleculesuppressed (MM-sup) GABA levels both when referencing to creatine ( Mikkelsen et al., 2017 ) and water ( Mikkelsen et al., 2019 ). Pova ž an et al. (2020) found similar effects of vendor and site on multiple metabolites quantified from PRESS data and referenced to creatine. However, none of these studies found significant effects of age on metabolite levels (though Pova ž an et al. did find a relationship between metabolite levels and sex, similar to other findings in the literature e.g. Hadel et al., 2013 ). This is perhaps surprising, as multiple studies have shown relationships between metabolite levels and age (see Cichocka and Bere ś (2018) for a systematic review on metabolite changes and Porges et al. (2021) for a meta-analysis on GABA + changes throughout the lifespan). The absence of age effects may therefore be due to confounding effects of vendor and/or site ( Fortin et al., 2017 ).
Data harmonization methods aim to remove unwanted variability caused by site or vendor differences while preserving true biological variability within the measures. In particular, ComBat, an empirical Bayesian method for data harmonization (originally developed for harmonization of gene expression data ( Johnson et al., 2007 )) has shown greater effectiveness at removing this unwanted variance in diffusion tensor imaging (DTI) data than other methods of harmonization, including global scaling and functional normalisation ( Fortin et al., 2017 ). In addition to DTI data, ComBat has been successfully applied to remove variance due to non-biological variables of interest in both cortical thickness ( Fortin et al., 2018 ) and functional connectivity data ( Yu et al., 2018 ). Further, the removal of non-biological variance using ComBat increased statistical power for detecting differences in brain structure between individuals with schizophrenia and healthy individuals compared to a random effects meta-analysis ( Radua et al., 2020 ).
Despite the promising utility of ComBat for harmonization of multisite structural and functional MRI data, it is yet to be validated with MRS data. Whilst imaging data is typically collected for the whole brain with multiple data points per subject, MRS data is often acquired from a single voxel, and therefore a single data point per subject. It is not known whether this reduction from multivoxel data to a single data point per subject reduces the effectiveness of ComBat harmonization. The goal of this manuscript was to evaluate the use of ComBat harmonization on single voxel MRS data in healthy adults. We additionally compared the performance of ComBat with removal of site variance using linear regression, an alternative, simpler approach to account for inter-site variation. Linear models were used to measure the variance contributed by vendor and site to metabolite levels both prior to and after harmonization. Additionally, to examine whether ComBat harmonization maintains true biological effects, age-and sex-related variance in metabolite levels was investigated both prior to and after harmonization.

Dataset
Data were obtained from the NITRIC Big GABA repository ( www.nitrc.org/projects/biggaba/ ), which consists of single voxel MRS data acquired on 3T MRI scanners from the three major vendors, General Electric (GE), Philips, and Siemens (see Mikkelsen et al., 2017 for details on data acquisition).
Scanning was conducted in accordance with ethical standards set by the institutional review board (IRB) at each site, including the sharing of anonymized data. Briefly, data were collected from participants aged 18-35 from a 3 ×3 ×3 cm 3 voxel in the parietal lobe using a standard GABA + edited sequence (TR/TE = 2000/68 ms, 320 averages, 15 ms editing pulses at 1.9 and 7.46 ppm), a MM-sup GABA edited sequence (TR/TE = 2000/80 ms for GE and Philips, TR/TE = 2000/68 ms for Siemens, 320 averages, 20 ms editing pulses at 1.9 and 1.5 ppm), and a standard PRESS sequence (TR/TE = 2000/35 ms, 64 averages). MEGA-PRESS data (both GABA + and MM-sup GABA) were included from 20 sites and PRESS data were included from 19 sites ( Table 1 ).

Data quantification
MEGA-PRESS data (both GABA + and MM-sup GABA) were preprocessed and quantified with Gannet 3.2 ( Edden et al., 2014 ) using the software's automated analysis pipeline specified in the commands 'Gan-netLoad' and 'GannetFit'. This procedure includes coil combination, removal of bad averages, and frequency and phase correction with spectral registration and modelling of the 3.0 ppm GABA + or MM-sup GABA peak with a 5 parameter Gaussian function, and modelling the unsuppressed water peak with a 7 parameter Voigt function.
Data quality was assessed by visual inspection by trained personnel. Data were quantified as molal units and corrected for partial volume effects using the equation specified in Near et al. (2020) . MEGA-PRESS data quantification included the -correction, which assumes twice as much GABA in grey matter as in white matter ( Harris et al., 2015 ). A secondary, confirmatory analysis for all metabolites (except tCr) was performed using creatine-referenced data (denoted as metabolite/tCr).

Residualizing
Site effects were estimated for each metabolite using linear regression implemented in R version 4.0.3 ( R Core Team (2020) https://www.R-project.org/ ) using the package "stats " version 4.1.2. The following regression model was used: m = i + s . Where m represents the metabolite level, i represents the intercept and represents the coefficient associated with site effects (s). Metabolite residual values were calculated by subtracting the values predicted by the linear regression model from the actual metabolite values. The residual values were used in further analyses.

ComBat
Data were harmonized using the "neuroComBat " function version 1.0.5 (available at https://github.com/Jfortin1/ComBat Harmonization/tree/master/R ). ComBat harmonization corrects for one covariate while maintaining the variance from other covariates by estimating an empirical statistical distribution for each parameter. To do this, ComBat uses information from each data point to estimate the scanner and site-specific variance for each feature (metabolite level) separately, based on the assumption that all data points share the same common distribution. The small sample sizes at each site preclude statistical testing to confirm this assumption, however visual Table 1 Summary of site related parameters and demographics, including: data acquisition parameters; number of data sets included in the analysis and the number of datasets available in the repository for each site (denoted as number included/number available); sex proportion and mean age of participants for each site. inspection of the box plots in Fig. 1 suggest the raw unharmonized data conforms to this assumption. Empirical Bayes is then typically applied to pool this information across features to improve this estimation. However, as the distribution of individual metabolites are not expected to be related, we applied ComBat to each metabolite separately and therefore did not need to use empirical Bayes, as recommended when the number of features was smaller than the number of participants ( Fortin et al., 2017 ). It is recommended to include a matrix of biological covariates of interest to aid in preserving these effects. Therefore, in the initial analysis this matrix was included (the biological covariates of interest were grey matter fraction (fGM/(fGM + fWM)), age and sex). A secondary analysis was performed without this matrix to ensure the inclusion of this matrix did not introduce spurious biological effects.

Statistical analysis
We followed the linear model procedure specified in Mikkelsen et al. (2017) to evaluate the presence of significant non-biological (vendor and site) and biological (grey matter fraction, age, and sex) effects in the data. If vendor or site effects were significant in the data prior to harmonization, but not following harmonization, we considered the harmonization procedure successful. This approach to test for site differences before and after ComBat harmonization to validate the efficacy of ComBat in removing inter-site variation is consistent with validation of ComBat in other neuroimaging literature ( Fortin et al., 2017 ;Orlhac et al., 2018 ;Yu et al., 2018 ). This approach was applied to the original data, residualized data, and ComBat harmonized data for comparison.
Vendor and site level effects on each metabolite were assessed separately using a three-level unconditional linear mixed-effects model, implemented using the R package "lme4 " version 1.1-26 ( Bates et al., 2015 ), with vendor and site included as random effects. To investigate the presence of biological effects in the three data sets (original, residualized, and ComBat harmonized), secondary multilevel analyses were per-formed to test for effects of grey-matter fraction, age, and sex on metabolite levels (included as fixed effects). Goodness of fit was calculated as a log-likelihood statistic. To assess the significance of each effect, likelihood ratio tests were performed by comparing a model with the effect of interest to a second model without the effect of interest. A p-value ≤ 0.05 was considered the threshold for significance. If an effect was significant, it was retained in the next assessed model for that metabolite; if not significant, it was removed. As in Mikkelsen et al. (2017) , effects were assessed in the following order: vendor, site, grey matter fraction, age, sex. Scripts for the linear model procedure can be found at https://github.com/HarrisBrainLab/lm4hz .

Water-referenced data
In the original data, vendor effects were significant for all metabolites and site effects were significant for all metabolites except MM-sup GABA. Given that both vendor and site were significantly associated with metabolite levels, and vendor variance was expected to overlap with site variance, with site being the lowest source of variance (i.e., site variance will likely capture vendor variance), data were harmonized by site. Following both residualizing and ComBat harmonization by site, neither vendor nor site effects were significantly associated with any metabolite. However, whilst ComBat harmonized metabolite levels still remained meaningful, residualizing produced metabolite levels centred around zero. See Fig. 1 for data visualisation prior to and after harmonization, Table 2 for summaries of metabolite effects pre-and postharmonization, and supplementary material for model values.

Creatine-referenced data
In the original data, vendor, but not site, was significantly associated with all creatine-referenced metabolites. Following residualizing,   neither vendor nor site effects were significantly associated with any metabolite. As only vendor effects were present in the original data, ComBat harmonization was initially applied by vendor. However, the effect of vendor remained significant in the model, therefore data were harmonized by site. After harmonization by site, no vendor or site effects were significant ( Table 3 , see supplementary material for model values). As with water-referenced data, ComBat harmonization produced meaningful metabolite ratios, whereas residualization did not (i.e. the residualization process produced negative metabolite ratios).
To ensure that inclusion of the matrix of biological covariates did not introduce spurious biological variability, the ComBat harmonization procedure was conducted with and without the matrix included. Both approaches yielded similar results and therefore results are not reported.

Evaluation of sex effects
The relationship between tCho (referenced to water) and sex was assessed within each site and vendor. This was done for two reasons: firstly, to confirm that ComBat harmonization enabled detection of true biological effects, and secondly, ensure that the harmonization procedure itself did not introduce false biological variability, tCho (referenced to water) was chosen because levels were only significantly associated with sex following harmonization (and not prior to) ( Fig. 3 ). As sex was significantly associated with tCho/tCr prior to and following harmonization, this supports the interpretation that harmonization revealed a true effect.
In the original data, when analysing individual sites, only site P9 showed significant sex differences in tCho (males > females). Similar sex-related differences were apparent within several other sites but did not reach statistical significance. When analysing vendor averaged data, a sex effect was only statistically significant for Philips scanners. When averaging all data, there was no significant effect of sex.
Following residualizing, only site P9 showed significant sex differences in tCho (males > females). Vendor averaged data showed a significant effect of sex in Philips data only (males > females). When averaging all data, there was a significant effect of sex (males > females).
Following ComBat harmonization, no individual sites showed significant effects of sex. As with residualized data, vendor averaged data showed a significant effect of sex in Philips data only (males > females). When averaging all data, there was a significant effect of sex (males > females) ( Fig. 3 ).

Discussion
Here we show that ComBat harmonization effectively removes site and scanner variance in MRS data. We also show that removing this variance increased our ability to detect associations of sex and age with metabolite levels that were masked prior to harmonization, confirming the utility of ComBat harmonization.

Both vendor and site significantly effect metabolite levels
In the original data, linear modelling showed vendor and site to be significantly associated with metabolite levels when referencing to water, in agreement with previous studies of water-referenced GABA + ( Mikkelsen et al., 2019 ). Water-referenced MM-sup GABA levels were associated with vendor, but not site. This contrasts with Mikkelsen et al. (2019) , who found a significant effect of site but not vendor. The minor difference in echo time on the Siemens scanners vs the GE and Philips scanners could have resulted in a large amount of vendor variance, which may absorb or overshadow the site variance. Additionally, there was a large amount of site variance within the Philips data. As the Siemens and GE data has less variance between the sites, this site variance may have been attributed to vendor. No other studies have investigated these associations in water-referenced PRESS data for the quantification of tNAA, tCr, tCho, Glx and Myo-Ins.
Interestingly, in the original data, vendor, but not site, was significantly associated with all metabolite levels when referencing to creatine. This is in contrast to Mikkelsen et al. (2017) , who found that both vendor and site were significantly associated with GABA + /tCr levels and that site only was significantly associated with MM-sup GABA/tCr levels. Additionally, Pova ž an et al. (2020) found significant effects of site on tNAA/tCr, tCho/tCr, Glx/tCr and Myo-Ins/tCr, with effects of vendor on tCho/tCr only. The previous studies included data from additional sites that were not available in the repository, which may have increased the site or vendor specific effects, and may explain the differences between the previous and current findings. Further, the current study used a generic pulse sequence to model the basis set (to emphasise effects of vendor in the data), whereas Pova ž an et al. (2020) used vendor specific acquisition pulse waveforms to minimise vendor differences, which may have also caused differences in site or vendor specific effects.

Harmonization successfully removes site and vendor effects
Following residualizing and ComBat harmonization by site, neither site nor vendor was associated with metabolite levels, whether referencing to water or to tCr, indicating both methods successfully removed this non-biological variance. However, ComBat harmonized values remain meaningful, whereas residualizing centres values on zero, and therefore can only be compared with residuals from the same model. After Com-Bat harmonization, metabolite values are still consistent with literature values.
Interestingly, ComBat harmonization by vendor did not completely remove variability across different vendors. This could result from the nested nature of vendor within site, and variance from the two variables will be highly overlapped. We therefore recommend harmonization by the lowest source of variance, in this case site.
For this study, MRS data were harmonized after quantification and tissue correction. Other imaging modalities allow for harmonization at various stages of processing. For example, Yu et al. (2018) applied harmonization to functional connectivity matrices before calculating network measures. The processing stage at which harmonization is applied may alter the results.
Additionally, ComBat handles one "batch effect " at a time, in that data input into the model is expected to be similarly affected by site and scanner. This may not be the case for different metabolites. For example, generally Siemens produced higher values than Philips for all metabolites, except for Myo-Ino. We therefore recommend harmonizing each metabolite separately as done in the present study. However, this reduction in data features provided to the model may hinder the accuracy of estimating site and vendor variance.

Harmonization reveals previously masked biological effects
Harmonization of MRS data revealed biological effects that were previously not detected, even when controlling for vendor and site in the model, indicating that (1) harmonization does remove non-biological variance and allows detection of biological effects and (2) controlling for these factors in statistical models is not enough to remove unwanted variance.
Sex differences in tCho/tCr were seen both prior to and following harmonization, but sex differences in water-referenced tCho were only detected following data harmonization. Specifically, tCho/tCr and tCho referenced to water were significantly higher in males as compared to females. During the ComBat harmonization procedure, inclusion of a matrix of biological variables is recommended to preserve biological variability. To ensure inclusion of this matrix did not introduce spurious biological variability, the harmonization procedure was conducted with and without the matrix included, with no change in results. Therefore, we are confident inclusion of the matrix did not introduce spurious effects. Furthermore, this effect was not detected even when including site and vendor in the linear model as covariates, indicating that controlling for variables like site and vendor does not remove all the variance associated with these factors, and highlighting the usefulness of data harmonization. tCho is generally considered to be a marker of cell density, and changes in tCho are thought to reflect alterations in membrane turnover (i.e. increase in membrane synthesis or breakdown) ( Rae, 2014 ). However, evidence suggests that choline levels may also relate to acetylcholine function ( Bell et al., 2018 ;Lindner et al., 2017 ). Our findings of higher tCho levels in males compared to females are in line with evidence from previous studies ( Hadel et al., 2013 ;Pova ž an et al., 2020 ) and may be due to the effects of estrogen. Craig et al. (2007) showed that tCho levels in women increased following 8 weeks of ovarian suppression, which significantly reduced estradiol levels. Thus, the higher level of tCho in males, who should have lower estrogen levels, is unsurprising.
Both residualizing and ComBat harmonization successfully removed site and vendor variance to reveal sex differences in tCho, and therefore both could be considered successful harmonization techniques. However, the two methods also resulted in the detection of different biological effects. Following residualizing, sex was significantly associated with tCr and tNAA/tCr. tCr plays a role in brain energy homeostasis and tNAA is considered a neuronal marker ( Rae, 2014 ). Evidence for sex differences in these metabolites is limited. Neither Endres et al. (2016) nor Hadel et al. (2013) found sex differences in tCr levels in the frontal lobe. Ostojic et al. (2011) found sex differences in NAA/tCr in the frontal cortex, however Safriel et al. (2005) found no sex differences. As little support is available for either of the current findings, they could be spurious effects introduced by the residualization procedure (see Fortin et al. (2017) for details on how harmonization procedures can introduce spurious effects). Additionally, while these effects were statistically significant, the confidence intervals approached zero, indicating these results may be uncertain.
Following ComBat harmonization, we found GABA + levels increased with grey matter fraction, consistent with the literature ( Bhattacharyya et al., 2011 ;Harris et al., 2015 ;Zhu et al., 2011 ). We also found Glx/tCr and GABA + /tCr declined with increasing age, consistent with previous research (see Cichocka and Bere ś , 2018 for a systematic review on metabolite changes and Porges et al., 2021 for a meta-analysis on GABA + changes throughout the lifespan). Notably, prior adult studies examined age affects across a broader age range. This larger range facilitates the detection of smaller effects that are subtle in more restricted age ranges as in the current study (18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35). Indeed, the regression coefficients in the current study (-0.15 for Glx and -0.17 for GABA + ) indicate a small effect size, which requires the increased power that is accomplished by the large number of participants included in this study. Additionally, neither of these effects were detectable prior to harmonization, even when including vendor and site as covariates in the model, again highlighting the impact of non-biological variables and the benefits of data harmonization. Moreover, though not statistically significant in the residualized data, these effects were in a similar direction (GABA + /tCR ß= -0.10, p = 0.12; Glx/tCr ß= -0.13, p = 0.08), providing further support that ComBat identified true effects. We are limited in the conclusions that can be drawn from this data due to a lack of ground truth regarding the relationship between MRS metabolites and biological variability, and therefore encourage further investigation of harmonization techniques on other multi-site MRS datasets.

Alternative harmonization methods
This manuscript focused on ComBat based on its popularity in the literature, extensive validation, and ease of implementation; the Com-Bat script is freely available on multiple coding platforms. Additionally, it requires no extra data (such as phantom data, or a travelling subject) and can be applied to datasets with only a single data point per subject, as is the case with MRS data. Alternative harmonization methods that have been applied to data from other imaging modalities include intensity harmonization techniques (e.g., histogram matching ( Shah et al., 2011 )), voxel-wise harmonization (e.g., RAVEL ( Fortin et al., 2016 )), and machine learning based methods (e.g., Deep Learning ( Dewey et al., 2019 )). Histogram matching involves aligning intensity histograms to a reference histogram ( Mali et al., 2021 ), and is therefore not suitable for MRS data. RAVEL uses voxels from a control region (CSF) to estimate site differences ( Pinto et al., 2020 ) and subsequently cannot be applied to MRS data. Metabolite concentrations in the CSF are negligible, and therefore there is no suitable control region. For machine leaning approaches to be rigorously implemented, training data from the same subjects on all scanners of interest would be required prior to harmonization ( Dewey et al., 2019 ). While less rigorous machine learning approaches may not require multiple data sets from the same subjects, independent training data would still be necessary for widespread validation and use, and is unlikely to be feasible for most studies.

Limitations and future studies
As the primary aim of this manuscript was to investigate ComBat harmonization, we are cautious not to over-interpret our findings regarding the biological effects as this was not an objective in the study design. This is partially highlighted in the large range of confidence intervals obtained across our results, which further suggests cautious interpretation until future studies can more precisely characterize these effects. We chose not to correct for multiple comparisons to increase our sensitivity to any effects and allow a comparison pre and post harmonization, but this may increase the chance of detecting spurious effects. Therefore, we recommend future study specifically exploring these effects by matching males and females by age or including other biologically relevant assessments (such as blood or saliva) to understand the significance or origin of these effects.
We chose not to include biological covariates in our residualization procedure to avoid the introduction of spurious effects. Fortin et al. (2018) showed that including age in their residualization procedure increased the association between median cortical thickness and age following residualization to a level similar to that of ComBat. However, in their sample, age was confounded with site. In the present study, the sites were relatively well balanced regarding the biological effects of interest and therefore we would expect inclusion of biological covariates to have less of an impact. Indeed, including the matrix of biological effects in the ComBat procedure did not change the resulting outcome.
Prior studies validating ComBat have often been performed on large samples obtained from few sites, whereas this study used data with small samples obtained at many sites. The minimum number of subjects needed per site has been estimated as 6 ( Maikusa et al., 2021 ), and the harmonization method successfully removed site and scanner variance, therefore we are confident that, despite the small sample size per site, ComBat was able to successfully identify non-biological variance. However, we also encourage investigation of the effects of sample size per site on ComBat harmonization.

Conclusions
In conclusion, we show that ComBat is an effective approach to harmonize MRS data (both PRESS and GABA-edited MEGA-PRESS) and successfully removes non-biological variance due to site and vendor differences. Further, we confirm the utility of harmonization by demonstrating that non-biological variance in MRS data can mask biological effects of interest and, upon harmonization of data, expected biological effects of interest can be detected. We therefore recommend the use of ComBat harmonization in multi-site MRS studies.

Data and code availability statement
Data were obtained from the NITRIC Big GABA repository ( www.nitrc.org/projects/biggaba/ ). Scanning was conducted in accordance with ethical standards set by the institutional review board (IRB) at each site, including the sharing of anonymized data. Scripts for the linear model procedure can be found at https://github.com/ HarrisBrainLab/lm4hz .

Declaration of Competing Interest
The authors declare no competing interests. and supported by the Hotchkiss Brain Institute and the Alberta Children's Hospital Research Institute, University of Calgary. Data collection was undertaken thanks to NIH grant EB016089. ADH holds a Canada Research Chair in Magnetic Resonance Spectroscopy in Brain Injury and KOY holds a Ronald and Irene Ward Chair in Pediatric Brain Injury. The funders had no involvement in study design, in the collection, analysis, and interpretation of data, in writing the report, or in the decision to submit the paper for publication.

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.neuroimage.2022.119330 .