Inter-scanner reproducibility of brain volumetry: influence of automated brain segmentation software

The inter-scanner reproducibility of brain volumetry is important in multi-site neuroimaging studies, where the reliability of automated brain segmentation (ABS) tools plays an important role. This study aimed to evaluate the influence of ABS tools on the consistency and reproducibility of the quantified brain volumetry from different scanners. We included fifteen healthy volunteers who were scanned with 3D isotropic brain T1-weighted sequence on three different 3.0 Tesla MRI scanners (GE, Siemens and Philips). For each individual, the time span between image acquisitions on different scanners was limited to 1 h. All the T1-weighted images were processed with FreeSurfer v6.0, FSL v5.0 and AccuBrain® with default settings to obtain volumetry of brain tissues (e.g. gray matter) and substructures (e.g. basal ganglia structures) if available. Coefficient of variation (CV) was calculated to test inter-scanner variability in brain volumetry of various structures as quantified by these ABS tools. The mean inter-scanner CV values per brain structure among three MRI scanners ranged from 6.946 to 12.29% (mean, 9.577%) for FreeSurfer, 7.245 to 20.98% (mean, 12.60%) for FSL and 1.348 to 8.800% (mean value, 3.546%) for AccuBrain®. In addition, AccuBrain® and FreeSurfer achieved the lowest mean values of region-specific CV between GE and Siemens scanners (from 0.818 to 5.958% for AccuBrain®, and from 0.903 to 7.977% for FreeSurfer), while FSL-FIRST had the lowest mean values of region-specific CV between GE and Philips scanners (from 2.603 to 16.310%). AccuBrain® also had the lowest mean values of region-specific CV between Siemens and Philips scanners (from 1.138 to 6.615%). There is a large discrepancy in the inter-scanner reproducibility of brain volumetry when using different processing software. Image acquisition protocols and selection of ABS tool for brain volumetry quantification have impact on the robustness of results in multi-site studies.

include manual segmentation, semiautomatic segmentation and automatic brain segmentation (ABS) [2]. Both manual and semiautomatic segmentations require manual delineation of brain regions, which are unavoidably susceptible to intra-and inter-rater inconsistency [2,3]. In contrast, ABS is hand-free and thus more resistant to inter-rater variability. Regarding the diseases related to abnormal brain morphometry, it provides a more effective and objective pipeline to yield reproducible quantifications of brain volumetry, which can facilitate to make accurate diagnosis, monitor disease progression and evaluate the prognosis [1].
In the recent decade, there have been dramatically more and more multi-site clinical studies as it becomes easier to obtain large data from multiple partners worldwide regarding the patient population in question [4]. In such background, the time-saving and objective ABS tools play a key role in large-scale multi-site brain morphometry studies based on MR images [5]. In fact, the accuracy and reproducibility of ABS tools (i.e. segmentation software) can greatly affect the evaluation of subtle brain morphometry changes [6]. It is not possible to make a correct diagnostic or treatment decision if the applied ABS tools produce inconsistent results of brain volumetry. Therefore, it is important to evaluate the variations of the quantified brain volumetry from different ABS software (for example, by testing their reproducibility on multiple scanners) before application in clinical practice.
To focus on the performance of ABS software and minimize the influences of other possible factors, some studies used standard datasets to evaluate the reproducibility of various image segmentation and volumetry software (e.g., SPM, FSL, Freesurfer) [2,7]. However, in addition to segmentation methods, there are many other factors that affect the quantified brain volumetric measures, such as imaging parameters, scanner manufacturer, subject positioning and hydration status, as well as image artifacts [5,8,9]. The existing studies also suffer from limitations in different aspects, for example: (1) only a small number of brain structures are considered [10,11]; (2) only one ABS software is tested without comparison of performance with other ABS software [1,5]; and (3) only a small sample is used for performance evaluation which cannot exclude the effect of interactions between scanners and subjects [1,12].
To this end, this study aimed to evaluate the interscanner reproducibility of brain volumetry quantified by different ABS software in a more comprehensive way that can be generalized to clinical practice. We compared three ABS software, i.e. Freesurfer [13], FIRST toolbox in FSL [14] and AccuBrain ® (BrainNow Medical Technology Ltd.) [15], in terms of their quantification performance in automatic brain volumetry. The accuracy and reliability of Freesurfer and FSL have been tested previously [1,2,6]. All the above segmentation tools can automatically segment and quantify multiple brain structures. FreeSurfer implements a complex image processing pipeline to segment a lot of anatomical structures and measure their volumes [13]. FIRST in FSL is a modelbased segmentation tool that enables segmentations of fifteen subcortical structures, such as thalamus, caudate, putamen and so on. AccuBrain ® is a cloud-based tool of automated brain volumetry. In this study, we compared the coefficients of variation [1] of the quantified brain volumetry of these tools in inter-scanner acquisitions to test their reproducibility and reliability.

Subjects and imaging protocol
Fifteen healthy volunteers (5 males and 10 females, mean age: 25.1 ± 0.59 years old) were enrolled in this study. The inclusion criteria in our study were: (a) no medical history of central neural system disease or psychiatric disorder; (b) Mini-Mental State Examination (MMSE) score within the normal range (27-30); (c) normal in physical examination of the central nervous system; (d) no medical treatments that may result in brain volumetric changes (e.g. steroid treatment) during the whole period of MRI acquisitions.
All the subjects were scanned using 3D sagittal isotropic brain T1-weighted sequences on three different 3.0 Tesla (T) MRI scanners, including GE Discovery MR750, Siemens Skyra and Philips Ingenia CX within 1 day. To avoid time-related brain structural volume changes, the time span of acquiring T1-weighted images on three different MR scanners for each subject was limited to 1 h. The details about the MRI scanners and the imaging protocols as conventionally used in clinic [1,16] are listed in Table 1.

Image processing
Visual assessment was performed on the obtained T1-weighted scans to confirm that there were no severe common artifacts (e.g. motion artifact and metal artifact), brain lesions or brain atrophy, which may lead to inaccurate volumetric estimations from the images. Subsequently, all the 3D T1-weighted MR images were processed using FreeSurfer v6.0, FSL v5.0 and AccuBrain ® .
1 FreeSurfer (http://surfe r.nmr.mgh.harva rd.edu/) is an atlas-based open-source software for processing and analyzing structural brain MRI images with no human intervention. The atlas that contains brain anatomy information is used as a reference for the segmentation of new MRI images [3]. Labels of brain regions from the atlas are modulated by affine transformations to fit target images [2]. FreeSurfer encompasses template registration and segmentation, and it can measure not only the volumes of many anatomical structures [13] but also other brain structural features such as cortical thickness, surface area, intensity and curvature. In this study, the images were processed using "recon-all" script provided by Free-Surfer, and a summary of volumetry of multiple brain structures were calculated. 2 FIRST (https ://fsl.fmrib .ox.ac.uk/fsl/fslwi ki/FIRST ) is provided as part of the FSL software distribution. It is a model-based segmentation tool. The models are created from manually labelled and segmented MRI images which are offered by the Center for Morphometric Analysis. These labels are parameterized as surface meshes and modelled as a point distribution model. Here, we used the "run_first_all" command of FSL-FIRST to calculate the brain volumetry of the provided fifteen subcortical structures. 3 AccuBrain ® is a cloud-based tool for automatic brain quantification [15]. After uploading the DICOM files on the website, a report including brain volumetry and a summary of anatomy information will be provided. AccuBrain ® employs multi-atlas image registration-based segmentation procedure. It uses a large atlas pool which is consisted of hundreds of brain MR images obtained from different scanners. Based on similarity measures, it selects a batch of most similar brains from the atlas pool to segment the subject image.
To perform a fair comparison of the quantification results in a way as similar as in clinical practice, we used the default settings of all these tools without any specific preference in parameter selection [2].

Reproducibility analysis
In order to test inter-scanner variability of brain volumetry, we measured the coefficient of variation (CV) of the quantified volumetric data based on the MRI acquisitions from different scanners. With a specific quantification tool for a certain brain region, the CV value was first calculated for each subject to measure the variability of brain regional volumes from acquisitions of the three scanners (GE, SIEMENS and PHILIPS). In detail, it is calculated as the proportion of standard deviation (SD) to the mean of volumetric measures from different scanners, which can also be expressed as a formula: CV = σ/m × 100% , where σ is the standard deviation and m is the arithmetic mean of the region-specific volumetric results of a single subject among different acquisitions. For example, if we would like to quantify the inter-scanner variability of the volumetric data of left hippocampus as measured by FreeSurfer (Additional file 1) for a single subject, we need to calculate the mean and SD of the three quantification results (from three scanners respectively) and subsequently the CV (i.e. SD/mean). In this way, we got a CV of the three scanners when quantifying left hippocampus with FreeSurfer. Similarly, we can calculate the CV of left hippocampus volume for this subject when using FSL-FIRST or AccuBrain ® for quantification. Finally, the CV values obtained from specific quantification tools can be compared in a cohort-level and for the volumetric measures of other brain substructures. Figure 1 is the flow chart of analysis method of CV of left hippocampus.
Due to the limited sample size, we utilized a non-parametric test, i.e. the Wilcoxon signed-rank test, to investigate the pair-wise between-group differences regarding the CV values of different ABS tools.
The quantified brain structures with their volumetric measures from different ABS tools were listed in Additional file 2 for reference. Of note, FSL-FIRST only quantified subcortical regions and thus the volumetric measures of WM, GM, and ventricular structures (e.g. lateral ventricle) were not available in FSL-FIRST. The CV values of the brain volumetric measures quantified from different ABS tools and the pair-wise comparisons of the CV values among these software were shown in Table 2.
The mean inter-scanner CV values among three different MRI scanners ranged from 6.946% (GM) to 12.29% (right pallidum) with a mean value of 9.577% for Free-Surfer, and 7.245% (left-pallidum) to 20.98% (rightamygdala) with a mean value of 12.60% for FSL-FIRST. In comparison, the CV values of AccuBrain ® were much smaller, ranging from 1.348% (WM) to 8.800% (left hippocampus) with a mean value of 3.546% (Table 2). Comparing FreeSurfer and FSL-FIRST, the CV values of different brain regions were generally similar, except for three regions where the FreeSurfer performed better (i.e. left and right amygdala, right accumbens, p < 0.05) and one region where FSL-FIRST performed better (i.e. left pallidum, p = 0.018). Regarding AccuBrain ® , it achieved significantly smaller inter-scanner CV values than FSL-FIRST and FreeSurfer in almost all the regions that were tested, except for left and right hippocampus, where no significant difference of CV values was found among these three software.
We further investigated the inter-scanner variability in each pair of scanners (GE vs. Philips, GE vs. Siemens, Philips vs. Siemens) as shown in Table 3. When using FreeSurfer and AccuBrain ® for automated brain volumetry, the variability between GE and Siemens scanners was the least among the comparisons of all the tested regions. When applying FSL-FIRST for quantification, the interscanner variability between GE and Philips was the least. In addition, AccuBrain ® also achieved the lowest variability of brain volumentry between Siemens and Philips scanners compared to FreeSurfer and FSL-FIRST.

Discussion
In multi-site neuroimaging studies, it is important to examine the inter-scanner reproducibility of volumetry data acquired from different MRI scanners before further statistical analysis with the integrated data. To this aim, MRI images of fifteen healthy subjects acquired multiple times from different MRI scanners were collected for scanner-related comparison and three structural brain MRI analysis software (FreeSurfer, FSL-FIRST and AccuBrain ® ) were selected to test software-related differences in measurements of brain volumetry. The segmentation accuracies of the three software have been evaluated and compared in many literatures [13]. As the segmentation accuracy of different structures is highly dependent on the anatomical definition of structures in a specific software, the comprehensive comparison of region-specific segmentation accuracy among the different software is out of the scope of this study. Our major objective is to investigate the reproducibility of brain volumetry in inter-scanner acquisitions and to test the influence of quantification software selection on inter-scanner reproducibility of brain volumetry.
In this study, AccuBrain ® presented less inter-scanner variability than FreeSurfer and FSL-FIRST according to the comparison of their CV values of brain volumetry. These findings might result from the superior performance of AccuBrain ® due to its large atlas pool, which consists of template images from a wide range of MRI scanners for knowledge transfer. Although FreeSurfer also employs atlas-based segmentation, it uses only one specific atlas (including one MRI template with labeled atlas) for knowledge transfer, which may influence its performance in inter-scanner reproducibility. Furthermore, several brain substructures (e.g. hippocampus, amygdala, pallidum and accumbens) had relatively higher CVs than other structures in the tested ABS tools, while brain tissues with larger volume (e.g. WM and GM) presented much smaller CV values ( Table 2). This finding may result from the relative volume of the tested brain structures or tissues, where the misclassified voxels from segmentation may have larger impact on the CV values if the volume of the structure is small. The secondary cause may be the differences in boundary definition and tissue contrast. One of the most important features that triggers brain MRI segmentation is brain tissue intensity [3,15], and the fuzzy boundary and lower contrast of background are more likely to cause tissue misclassification.
In addition, we found that the variabilities of the quantified brain volumetry between each pair of scanners (GE vs. Philips, GE vs. Siemens, Philips vs. Siemens) were quite different when different ABS tools were used (Table 3). When using AccuBrain ® or FreeSurfer as the quantification tool, the inter-scanner variability of GE and Siemens scanners was the lowest compared with the other pairs of scanners, and when using FSL-FIRST, the inter-scanner variability between GE and Philips scanners was the lowest. In view of the segmentation algorithm, both AccuBrain ® and FreeSurfer employ atlas-based segmentation method, while FSL-FIRST uses model-based segmentation method. The performance of atlas-based segmentation depends on the matching of the intensity in template image and that in the image to be segmented, while model-based segmentation relies more on fitting a prior model for the image to be segmented. In fact, the images acquired from GE and Siemens scanners are more similar in terms of intensity level and image contrast than the other pairwise comparison of scanners, which may also serve as a reason for the better reproducibility of the data from GE and Siemens scanners with AccuBrain ® and FreeSurfer. In contrast, FSL-FIRST, which is less affected by intensity level, does not follow the similar trend of pairwise inter-scanner variability in brain volumetry as identified by AccuBrain ® and FreeSurfer. In fact, FSL-FIRST presented the highest CV values among all the pair-wise inter-scanner comparisons, indicating its inferior inter-scanner reproducibility. Regarding the applications of the three segmentation tools, they all have their own superiorities. For example, although FreeSurfer takes the longest time to process one dataset, it supports not only quantification of subcortical brain volumetry, but also cortical parcellation and quantification. FSL-FIRST tool also enables surface-based morphometry analysis for the subcortical structures in addition to quantification of brain volumetry. As this paper mainly discussed about the reproducibility of brain volumetric quantification as affected by ABS tools, the comparison regarding different functions of the mentioned ABS tools is out of the scope of this study.
Of note, if the CVs (that indicate inter-scanner variability in brain volumetric quantification) are relatively higher when involving comparisons with a specific scanner, it does not necessarily imply that this scanner is inferior to the others, as the contrast and intensity level can be changed by modulating imaging parameters [15]. Although segmentation algorithm is the primary factor that influences inter-scanner reproducibility, the effect of the pulse sequence selected for a specific scanner cannot be underestimated, since it also has a large impact on the quantification results of brain volumetry. The misclassification rates can be reduced by a suitable and proper choice of pulse sequences [17], and the CV values obtained in our study may be reduced by adjustments of image acquisition parameters, which warrants further validations in the future.
Segmentation and quantification of specific brain regions are common tasks in the study of neurological disorders such as movement disorders [18], Alzheimer's disease [19] and epilepsy [20]. Disease progression is often reported using annualized rate of tissue volume loss, which may be very small [2]. Therefore, highly reproducible measurements are important to detect and monitor brain volumetric changes at multiple time points. Routine use of brain morphology analysis in clinical nursing needs reliable and reproducible measurements, because radiologists often give advice on treatment decisions according to brain volumetric changes [2]. High reproducibility is also necessary for detecting the subtle yet important changes of brain disease, especially in multi-site researches. The change of interest cannot be studied if the inter-scanner reproducibility of brain volume has large discrepancy [21,22]. In such background, the proper selection of brain segmentation software is a critical step in computeraided diagnosis and measurement [3]. In addition, choosing same scanner manufacturer, field strength, head coil, magnetic gradient [23], and pulse sequence [9] is helpful to improve inter-scanner reproducibility.
There are some limitations of this study that need to be considered. First, the results of our study were grounded on the examinations of young healthy volunteers. Therefore, the variability of brain volumetry in a cohort with severe brain atrophy and/or with brain lesions remain unclear. The accuracy of ABS tools might decrease when brain anatomic segmentation is performed in patients with demyelinating lesions (e.g. multiple sclerosis), masslike lesions (e.g. tumors) [24] or brain atrophy. In this respect, further studies with focus on the reproducibility of ABS tools in brain volumetry should expand the cohort to be tested from healthy individuals to individuals with brain lesions and/or atrophy. Second, as the primary goal of this study was to test inter-scanner reproducibility in a way as in clinical practice, the applied imaging parameters in this study were all daily used in clinic without any additional modulation, and the software parameters were set as default without specific preference in parameter selection [2]. However, it has been reported that

Table 2 Coefficient of variation (CV) for inter-scanner volumetric measurements among GE, Siemens and Philips
The mean values of region-specific CV (i.e. inter-scanner variability in brain volumetry of the three scanners) of the examined 15 subjects are displayed for each ABS tool, and the associated pairwise comparison results (i.e. p values) of the coefficients of variation among the ABS tools are also provided GM gray matter, N.A. not available, WM white matter, VentralDC ventral diencephalon, L Left, R Right The average CV of FSL-FIRST for different brain substructures is the mean over the structures available for quantification in FSL-FIRST (i.e. hippocampus, amygdala, thalamus, caudate, putamen, pallidum and accumbens) appropriate adjustments of image acquisition parameters can help achieve better reproducibility of brain volumetry [25]. Therefore, future efforts should also aim to investigate the optimal imaging parameters and protocols to further improve the inter-scanner reproducibility in multicenter studies.

Conclusion
In conclusion, this study demonstrated that automatic brain segmentation tool has a considerable impact on the inter-scanner reproducibility in quantification of brain volumetry. The results of this study may facilitate neuroimage data sharing and integration in multi-site research, where the selection of an appropriate automated brain quantification tool serve as a prerequisite to obtain reliable and meaningful findings.

Table 3 Coefficient of variation (CV) for inter-scanner volumetric measurement between each pair of scanners
The mean values of region-specific CV (i.e. inter-scanner variability in brain volumetry of GE