Multisite test–retest reliability and compatibility of brain metrics derived from FreeSurfer versions 7.1, 6.0, and 5.3

Abstract Automatic neuroimaging processing tools provide convenient and systematic methods for extracting features from brain magnetic resonance imaging scans. One tool, FreeSurfer, provides an easy‐to‐use pipeline to extract cortical and subcortical morphometric measures. There have been over 25 stable releases of FreeSurfer, with different versions used across published works. The reliability and compatibility of regional morphometric metrics derived from the most recent version releases have yet to be empirically assessed. Here, we used test–retest data from three public data sets to determine within‐version reliability and between‐version compatibility across 42 regional outputs from FreeSurfer versions 7.1, 6.0, and 5.3. Cortical thickness from v7.1 was less compatible with that of older versions, particularly along the cingulate gyrus, where the lowest version compatibility was observed (intraclass correlation coefficient 0.37–0.61). Surface area of the temporal pole, frontal pole, and medial orbitofrontal cortex, also showed low to moderate version compatibility. We confirm low compatibility between v6.0 and v5.3 of pallidum and putamen volumes, while those from v7.1 were compatible with v6.0. Replication in an independent sample showed largely similar results for measures of surface area and subcortical volumes, but had lower overall regional thickness reliability and compatibility. Batch effect correction may adjust for some inter‐version effects when most sites are run with one version, but results vary when more sites are run with different versions. Age associations in a quality controlled independent sample (N = 106) revealed version differences in results of downstream statistical analysis. We provide a reference to highlight the regional metrics that may yield recent version‐related inconsistencies in published findings. An interactive viewer is provided at http://data.brainescience.org/Freesurfer_Reliability/.


| INTRODUCTION
The reproducibility of research findings in the biological sciences has recently come to light as a major problem, particularly for the neuroimaging-heavy fields of psychological and neurological-sciences (Boekel et al., 2015;Bowring et al., 2019;Button et al., 2013;Hodge et al., 2020;Poldrack et al., 2020). Studies on major depressive disorder (MDD), for example, have pointed out inconsistencies in results as well as difficulties in drawing comparisons due to analytical and study design variability (Beijers et al., 2019;Dichter et al., 2015;Fonseka et al., 2018;Kang & Cho, 2020;Müller et al., 2017;Stuhrmann et al., 2011). In one study, using a more heterogeneous sample and rigorous statistical testing, Dinga et al. (2019) were unable to replicate the statistical significance used to define MDD biotypes previously found in the literature. Inconsistent results investigating neuroimaging traits and diseases have also been found in studies of insomnia (Spiegelhalder et al., 2015) and mild traumatic brain injury (mTBI). A meta-analysis of 14 reports of working memory in mTBI showed mixed findings of functional magnetic resonance imaging (MRI) hyperactivity, hypoactivity, and some studies even report both hyper and hypo activity (Bryer et al., 2013). Neuroimaging offers mechanistic insights into the variability that leads to risk for brain dysfunction, yet these findings must be replicable in order to extend the use of MRIderived biomarkers to a clinical setting.
It is important to understand how and why these discrepancies occur, so that we can better understand why certain findings are, or are not reproducible. For example, studies may be underpowered, or the variable of interest might have different effects across populations. Experimental results can also be affected by methodological factors such as the type of data collection (Han et al., 2006;Jovicich et al., 2009;Yan et al., 2020), data processing and analysis (Bennett & Miller, 2013;Botvinik-Nezer et al., 2020;Carp, 2012;Lindquist, 2020), tool version and selection (Bigler et al., 2020;Dickie et al., 2017;Gronenschild et al., 2012;Meijerman et al., 2018;Perlaki et al., 2017;Tustison et al., 2014;Zavaliangos-Petropulu et al., 2022), and even operating system environments (Glatard et al., 2015). The presence of pathological tissue has also been reported to cause systematic errors in segmentation output (Dadar et al., 2021). If sample population and methodology differ, it can be difficult to tease apart the main source of the discrepant findings.
Recent efforts in the neuroimaging community have heightened awareness and partially addressed concerns surrounding reproducibility. Guides and tools for enhancing reproducibility have been published in an effort to promote Open Science. Open science aims to provide transparency into research studies to better understand the data collected, the code implemented and software used, the analysis performed, and the full scope of results, including null findings (Gorgolewski et al., 2015;Gorgolewski & Poldrack, 2016;Kennedy et al., 2019;Nichols et al., 2017;Poldrack & Gorgolewski, 2017;Vicente-Saez & Martinez-Fuentes, 2018;Zuo et al., 2014). These efforts often include detailed documentation and containerization of analytical software to ensure consistency of software version, and even operating system to the extent possible should the study be replicated. Other efforts such as the Consortium for Reliability and Reproducibility (CoRR) emphasize reliability and reproducibility in neuroimaging. This is demonstrated by their open-source test-retest data sets which help facilitate these reliability and reproducibility assessments in both structural and functional MRI (Zuo et al., 2014).
Compared to sample size, these metrics are often overlooked, but it is important to note that reliability is a key determinant of statistical power (Zuo et al., 2019). Large consortia, such as the Enhancing Neu-roImaging Genetics through Meta-Analysis (ENIGMA) Consortium, have also addressed issues of low power and varying data processing pipelines by conducting large-scale harmonized meta-and megaanalyses across international data sets . Analytical protocols are proposed and approved by the community in advance; they are then distributed and made readily available. These protocols also include data quality control (QC) guidelines to improve analytic consistency across heterogeneous data sets and populations.
Large, publicly available and densely phenotyped data sets that use these protocols have recently become a powerful resource that has advanced the field of neuroscience (Horien et al., 2021). Studies like the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the UK Biobank collect data from 1000 to 10,000 of individuals (Littlejohns et al., 2020;Weiner et al., 2015) with some collecting longitudinal data that spans well over a decade (Weiner et al., 2017).
Automatic segmentation tools are widely used on such data sets and have allowed for tens to hundreds of thousands of scans to be conveniently processed, thus enabling neuroimaging traits to be used in a wide range of clinical and epidemiological studies. However, these tools do not come without challenges and limitations.
Data processed from updated versions of these softwares are continuously released (http://adni.loni.usc.edu/2021/) and this leaves researchers questioning which version is most reliable or whether data and results from work that used prior versions are compatible with those of later releases. If the detected effects depend on the software version used, then that variability could threaten the reproducibility of published research and compromise clinical translation.
However, these version updates are often needed to keep up with the many advancements made in the neuroimaging field. For example, version updates may include added options or tools to work with higher resolution images, or more computational efficient image processing pipelines (e.g., the use of GPUs for processing). As newer software releases are made available, we often lack information on whether new results will be consistent with prior findings, and the overall impact of a software upgrade. To understand sources of study variability, it is important to understand how version upgrades may impact outcome measures.
One such automatic feature extraction and quantification tool that is widely used in neuroimaging is FreeSurfer (Fischl, 2012). FreeSurfer is a structural MRI processing suite that allows researchers to obtain brain parcellations and metrics from just a single T1-weighted image.
Running the software involves just a one command, but the process itself is quite extensive-where the single image undergoes over 30 stepwise processing stages (https://surfer.nmr.mgh.harvard.edu/ fswiki/recon-all). Notably, more than 60 research papers have been published detailing FreeSurfer's algorithms and workflows (https:// www.zotero.org/freesurfer/collections/F5C8FNX8). The overall processing steps include: image preprocessing, brain extraction, gray and white matter segmentation, reconstruction of the white matter and pial surfaces, labeling of cortical and subcortical regions, and a spherical nonlinear registration of the cortical surface using a stereotaxic atlas, allowing for a more accurate alignment of gyral and sulcal landmarks.
Users can then extract features, such as cortical thickness (defined as the distance between the white matter and pial surfaces), surface area (or the area of all the triangles on the mesh representing the white matter surface), and cortical and subcortical volumes, measured in cubic millimeters (Fischl, 2012 A subset of 106 neurologically normal individuals was selected at random from the UK Biobank (Miller et al., 2016) to test age association outcome differences between versions. This included 56 females with a mean age and standard deviation of 62.3 (7.2) years and 50 males with a mean age and standard deviation of 61.2 (7.7) years.
The age ranged from 46 to 78 years of age. In this case, being neurologically normal was defined based on the following exclusion criteria: T A B L E 1 Cohort demographics and scan parameters for test-retest data sets analyzed. HCP is a family-based data set including up to four individuals per family, so we limited our ICC investigations to one randomly chosen individual per family. *indicates the maximum duration between any two consecutive scans; the maximum duration between the baseline scan and the final retest is 40 days.

| Statistics and quality control
ICCs were calculated using the psych library in R (https://CRAN.Rproject.org/package=psych). The following three compatibility comparisons were evaluated: v7.1 versus v6.0, v7.1 versus v5.3, and v6.0 versus v5.3. Only the first time points from the test-retest data were selected for these comparisons. ICC2 was used to compute betweenversion compatibility measures to account for any systematic errors using the following formula: where BMS is the between-targets mean square, EMS is the residual mean square, k is the number of judges, JMS is the between-judges mean square, and n 0 is the number of targets (in our context, the judges would correspond to different software versions used to compute the measures).
ICC3 was used to measure within-version reliability using the following formula: where BMS is the between-targets mean square, EMS is the residual mean square, and k is the number of judges. ICCs were computed for each site and a weighted average was also computed, where the To test if FreeSurfer version affects population level findings in studies of modest sample size, age associations were performed in a cross-sectional subset of the UK Biobank using linear regressions. Sex was used as a covariate; ICV was added as a covariate for subcortical volumes. In that same subset, detailed QC was performed using the ENIGMA QC protocol (http://enigma.ini.usc.edu/protocols/imagingprotocols/) to test differences in regional fail rates across the versions.
Then, 54 subjects were assigned to rater #1 and 52 to rater #2. Each rater QC'ed the same subset across all three versions. Rater #3 then reviewed all QC fails for consistency. All subcortical QC was performed by rater #3 where a fail constitutes any notable overestimation or underestimation of volume for any structure. Age associations were also performed in this QC'ed subset, where subjects were excluded if the QC of any ROI was inconsistent across versions. If subjects had consistent regional fails, they were kept in the analysis, but those regions were excluded. While many studies of such sample size may perform manual segmentation corrections, there is no way to ensure consistent manual editing across the outputs of all software versions. We therefore opted to exclude QC fails to ensure our reported differences were due to changes in software version.
For each set of regressions within a version, statistical significance was determined after controlling the false discovery rate (FDR) at q < 0.05 across 234 measures, which included all bilateral, unilateral, and full brain measures. FDR (Benjamini & Hochberg, 1995) corrected p-values and z-statistics were plotted on brain surfaces for comparison. All values, including uncorrected p-values, are tabulated on our web-viewer. Dice coefficients (Dice, 1945) were also calculated in the UK Biobank subset to assess the extent of spatial overlap of ROIs across versions, for all regions in the DK atlas.

| Replication analysis
To ensure replicability of our results, we calculated reliability and compatibility measures on the Hangzhou Normal University (HNU) cohort.
This data set is a valuable resource to assess reliability with its 10 test-retest design (Zuo et al., 2014): 30 participants were scanned 10 times, all within 40 days of their baseline scan with a mean of 3.7 days between two consecutive scans (see Table 1 for more details). Upon visual inspection of the FreeSurfer outputs, we excluded three subjects (subject IDs: 25434, 25440, 25438) due to an error in the brain extraction that cut off a superior segment of the brain in at least one of that subject's sessions. Out of 300 scans, we note that this was observed three times in v5.3, two times in v6.0, and four times in v7.1. Radar plots of ICCs are available in the supplementary materials ( Figures S7 and S8).

| ComBat analysis
To test the effects of batch correction, we performed an additional set of age associations on all the test-retest data sets-harmonizing for site using ComBat (Fortin et al., 2018). We limited our analysis to the first timepoint with a max age of 35 years old as the KKI data set only had a few participants with ages beyond this point. Here, we used all subjects from HNU given that errors in baseline scans were not observed. We compare harmonized v7.1 results to a mixture of v7.1 and other versions, where we change the version of one (75% v7.1) or two data sets (50% v7.1) to v5.3 or v6.0. Differences in bilateral significant z-statistics before and after FDR correction are available in the supplementary materials (Figures S11-S14).

| RESULTS
The full set of our reliability, compatibility, and association results are available through an interactive 3D brain viewer here: http://data. brainescience.org/Freesurfer_Reliability/. Cohort specific ICCs and associated statistics are also available in the supplementary material (Figures S1-S8, Tables S3-S6).
Replication analysis using the HNU data set showed the compatibility of surface area and subcortical volumes to be largely in line with our main analysis ( Figure S7d,g). Whereas in our main analysis, we show the most discrepancy between cortical thickness measures from v7.1 and the previous versions, mostly in the cingulate regions, our replication analysis showed lower compatibility more broadly between cortical thickness from v5.3 and those from newer versions ( Figure S7a).

| Within-version reliability
The meta-analyzed scan-rescan reliability for all cortical and subcortical metrics within each of FreeSurfer v5.3, v6.0, and v7.1 are shown in Figure 3. All versions showed high reliability for average bilateral, left hemispheric, and right hemispheric cortical thickness (ICC > 0.90).
Regional bilateral metrics with the lowest thickness ICCs-but still considered moderate to good-included the temporal pole  widespread and lower reliability overall, across all versions. Regions that were classified as having moderate to poor reliability in HNU included the temporal pole, insula, entorhinal, inferior temporal, rostral anterior cingulate, medial orbitofrontal, and lateral orbitofrontal cortices ( Figure S8a).   2. There were substantial compatibility issues between v7.1 and v5.3, in cortical regional thickness, area, and subcortical volumes.

| Quality control and population-level analysis
Thickness measures with low compatibility between v7.1 and v5.3 were the same as those between v7.1 and v6.0. However, the replication data set showed more similarities between the compari- studies, the anterior midcingulate, and in some cases the posterior cingulate, show, on average, lower thickness in individuals with PTSD compared to healthy controls (Hinojosa et al., 2019). Subregions of the cingulate cortex have also been associated with age related cognitive performance. In "SuperAgers," or adults over the age of 80 years, whose episodic memory is resistant to age-related decline, a preservation of the anterior cingulate thickness is observed (de Godoy et al., 2021;Gefen et al., 2015;Harrison et al., 2012;Harrison et al., 2018;Sun et al., 2016). Many of these studies were performed using versions of FreeSurfer that precede v7.1, so possible replication issues in future studies may be partially explained by the version incompatibility described in this work. Although we tested within a very narrow age range, and more extensive evaluation may be needed, we find that batch correction methods may adjust for these plays an important role in mediating information transfer between the hippocampus and the rest of the brain (Coutureau & Di Scala, 2009;Garcia & Buffalo, 2020). Measurements of its thickness are widely assessed in Alzheimer's disease, as it is one of the first regions to be impacted by the disease process (Braak & Braak, 1991) and researchers have found associations between its thickness and markers of amyloid and tau (Thaker et al., 2017). Entorhinal thickness is often a feature of interest in models that are designed to predict progressive cognitive decline due to its early vulnerability and role in the prodromal stages of Alzheimer's disease. Although v7.1 may have a more anatomically accurate segmentation, we advise caution when comparing the performance of predictive models that use earlier releases of FreeSurfer for deriving this metric.
Compatibility issues between v7.1 and older versions were less frequent with surface area and did not occur in the same regions as cortical thickness. This could be due to the relative independence of these measures: surface area is calculated as the area of all the triangles on the white matter surface, and the large area covered by many regions makes them more robust to slight variation in vertex counts.
On the other hand, cortical thickness is measured as the distance between the vertices of the white matter and pial triangulated surfaces, and is often between 2 and 4 mm thick, a span of only two to four voxels; slight variability in partial voluming may have a more dramatic effect on cortical thickness. Yet, as the thickness is averaged in the entire area, a slight variation in the number of vertices on the surface will have little effect on the averaged cortical thickness estimates.
The independence of these measures has also been established in relation to their genetic associations (Grasby et al., 2020;Winkler et al., 2010) overall suggesting that our results are not unexpected. Subcortical volumes are also another set of metrics derived from FreeSurfer that are of major interest to neuroimaging researchers (Ohi et al., 2020;Satizabal et al., 2019). Efforts to provide references of normative subcortical volume changes that occur as a result of aging have been put forth (Bethlehem et al., 2022;Coupé et al., 2017;Dima et al., 2022;Mileti c et al., 2022;Narvacan et al., 2017;Potvin et al., 2016 One limitation of our study was that there was no available higher-resolution or postmortem ground truth data to know which FreeSurfer version most represents true anatomical structure. However, given that many of these measures have been widely studied regarding their relationship with age, even in the absence of postmortem or higher resolution data (Fischl, 2012;Frangou et al., 2022;Salat et al., 2004), we instead assess age associations to gauge the downstream consequences of version differences. Version-related differences in FreeSurfer metrics between cases and controls have been assessed in Filip et al., 2022. In their work, Filip and colleagues assess group differences between nine preselected cortical and subcortical volumes of patients with type 1 diabetes and those of controls across the latest FreeSurfer versions. They found the statistical significance between groups was dependent on version; notably, analyses run using v7.1 metrics did not replicate the results of older versions. Our compatibility findings highlight specifically the regions for which effects differ. Our work also highlights the dampened effects that might be expected with v7.1, suggesting larger sample sizes might be needed to find similar effects, than what might be expected from power calculations using v5.3 results.

Measures
We also performed QC of regional parcellations to rule out any spurious associations with gross mis-segmentations. One example worth noting is that v7.1 and v6.0 may be better able to handle images with lower quality and/or motion as evidenced by one subject in our UK Biobank subset that failed in v5.3 but not in the newer versions ( Figure 4b). This could be due to the improved error handling of the Talairach registration: if one registration fails, v7.1 and v6.0 would try an older atlas. Another example involved the left middle temporal gyrus, which is often susceptible to underestimations due to the spillage/overestimation of the banks of the superior temporal sulcus into that gyrus. This occurred at approximately the same rate across versions. When associating the thickness of both the left banks of the superior temporal sulcus and the middle temporal gyrus with age before QC, all versions reveal significant associations for both regions.
After removing subjects encountering this issue, although the direction of the effects stayed the same, neither region was associated with age in any of the versions. While this may be due to a reduced sample size and study power, it is also possible that findings in these regions may not represent true anatomical structure, and may instead be due to common segmentation errors. It is also worth noting that our results are solely based on the DK atlas (Desikan et al., 2006) and translation to other atlases may not apply. We chose the DK atlas as it consists of a set of coarse regions defined by anatomical landmarks that can be reasonably quality controlled. Most other atlases, while possibly more precise, define finer parcellations based on cortical function, connectivity, topography, myelin, or a combination thereof (Glasser et al., 2016;Schaefer et al., 2018). Visual QC by region may not be readily possible when cortical parcellations are finer and there are over 100 regions in each hemisphere, so version performance of segmentation accuracy may be more difficult to compare. Our data sets were exclusively from adults without major neurological abnormalities, so our findings may not necessarily generalize to cohorts of young children, adolescents, or individuals with significant brain abnormalities. Finally, we recognize that repeatability is an important metric, and differences in repeatability may be explained, in part, by differences in operating systems used. While Tustison et al. (2014) found good repeatability for FreeSurfer v5.3, users of multiple workstations should exercise caution when pooling data run on various machines, as differences in floating point precision may affect reproducibility of these measures (Glatard et al., 2015). Containerization packages, such as Docker or Singularity (Matelsky et al., 2018), help mitigate differences in environment along with differences in version, which are quantified in this manuscript.
Overall, we find generally high within-version reliability across most versions and data sets, and many advantages to using FreeSurfer v7.1 over older versions for adult neuroimaging studies. However, considerable differences are observed when analyzing betweenversion compatibility for regional cortical thickness, surface area, and subcortical volumes. It is important to consider these compatibility differences when pooling data or statistical inferences across software versions, and when comparing findings across published works, especially for those regions with lower compatibility. Understanding these differences may help researchers to make informed decisions on study design and provide insight into reproducibility issues.