A controlled comparison of thickness, volume and surface areas from multiple cortical parcellation packages

Background Cortical parcellation is an essential neuroimaging tool for identifying and characterizing morphometric and connectivity brain changes occurring with age and disease. A variety of software packages have been developed for parcellating the brain’s cortical surface into a variable number of regions but interpackage differences can undermine reproducibility. Using a ground truth dataset (Edinburgh_NIH10), we investigated such differences for grey matter thickness (GMth), grey matter volume (GMvol) and white matter surface area (WMsa) for the superior frontal gyrus (SFG), supramarginal gyrus (SMG), and cingulate gyrus (CG) from 4 parcellation protocols as implemented in the FreeSurfer, BrainSuite, and BrainGyrusMapping (BGM) software packages. Results Corresponding gyral definitions and morphometry approaches were not identical across the packages. As expected, there were differences in the bordering landmarks of each gyrus as well as in the manner in which variability was addressed. Rostral and caudal SFG and SMG boundaries differed, and in the event of a double CG occurrence, its upper fold was not always addressed. This led to a knock-on effect that was visible at the neighbouring gyri (e.g., knock-on effect at the SFG following CG definition) as well as gyral morphometric measurements of the affected gyri. Statistical analysis showed that the most consistent approaches were FreeSurfer’s Desikan-Killiany-Tourville (DKT) protocol for GMth and BrainGyrusMapping for GMvol. Package consistency varied for WMsa, depending on the region of interest. Conclusions Given the significance and implications that a parcellation protocol will have on the classification, and sometimes treatment, of subjects, it is essential to select the protocol which accurately represents their regions of interest and corresponding morphometrics, while embracing cortical variability. Electronic supplementary material The online version of this article (10.1186/s12859-019-2609-8) contains supplementary material, which is available to authorized users.


Background
Various magnetic resonance imaging (MRI) tools have been developed to characterise the changes that the human brain undergoes over the course of a lifetime. One way to characterize such changes is through surfacebased modelling packages. Following the initial phase of pre-processing, the packages divide the brain into layers and parcels using a range of algorithms and atlases. Parcel morphometry is then interpreted through several metrics such as cortical thickness, or grey matter thickness (GM th [1]), grey matter volume (GM vol [2,3]), white matter surface area (WM sa , [1]), sulcal length and depth [4], gyrification index [5,6], and fractal dimensionality [7].
Morphometric analysis software tools are powerful techniques with multiple applications. Given their ability to examine critical cortical regions, they have proven essential for the identification of maturational changes (e.g. [8][9][10] and biomarkers of disease (e.g., application in multiple sclerosis [11]; autism spectrum disorder [12]; schizophrenia [13]; Alzheimer's disease [14], amnestic and non-amnestic mild cognitive impairment [15] to only name a few). From a computational perspective, these tools show good repeatability (although OS variations can be an issue due to underlying libraries, see e.g., [16]) and reliability of measurements for the same individuals (e.g., [17]). From an anatomical perspective, some morphometric measurements have been validated against post-mortem analyses (for instance, Cardinale et al., [18] showed a good agreement between FreeSurfer cortical thickness estimations and histological measurements), whilst parcellation per se is typically assessed visually by experts, in comparison or not to manually prepared data (e.g., [19]). In our previous work, we investigated critical differences between popular brain image analysis tools with focus on their cortical parcellation protocols [20]. We identified a lack of details in terms of the reference populations used, inconsistencies in gyral border definitions, and uncertainties with variability considerations. We concluded with an emphasis on the need for such details due to the direct influences that the derived parcels would have on any consequent analysis. Here we present a controlled comparison between FreeSurfer, BrainSuite and BrainGyrusMapping to quantify how differences in algorithms and protocols led to differences in parcel metrics, in comparison to ground truth data [21].
The subject data, including their T1 and T2-weighted volumes, are publically available in the Edinburgh Data-Share repository [22] organized in Brain Imaging Data Structure (BIDS [23]).

Data acquisition
All subjects were scanned at the Brain Research Imaging Centre, Edinburgh (UK) in a 1.5 T scanner (General Electric, Milwaukee, WI, USA). A coronal high resolution 3D T1-weighted (FSGE, 1*1.3*1 mm voxel size, TE 4.01 ms TR 9.8 ms flip angle 8°), an axial T2-weighted (SE, 1*1*2 mm voxel size, TE 104.9 ms TR 1320 ms flip angle 8°), and a T2 FLAIR volume were acquired for each subject, and reviewed by a consultant radiologist ensuring their good health. Additional details can be found in [21].

Materials
We chose 3 existing software packages to analyse the raw T1w data of each of the 10 subjects: FreeSurfer [24][25][26], BrainSuite [3], and BrainGyrusMapping [2]. A Linux version of FreeSurfer version 6.0 (freesurfer-Linux-centos6_x86_64-stable-pub-v6.0.0-2beb96c) was downloaded onto the department's server and run using the default recon-all command, which allowed us to compare their older Desikan-Killiany protocol [27] to its updated version, the Desikan-Killiany-Tourville protocol [19]. BrainSuite version 13a (build#1744, built with Qt 4.8.4 on Sept 112,013) was installed and run on a Windows 7, 64-bit operating system with 16G RAM, using the BrainSuite GUI. We used the default Cortical Surface Extraction Sequence, while refining the sulcal curves for accuracy. A BrainGyrusMapping (BGM, v 11.0.3888 beta = v 1.0) command-line tool was provided by Canon Medical Research Europe 1 and installed on the same Windows 7 system. This latter tool is a multi-atlas segmentation tool, originally built and validated using the data from the Medical Image Computing and Computer Assisted Intervention (MICCAI) 2012 challenge on multi-atlas labelling [2]. We selected the maximum number of atlases, 28, to be used by this tool rather than the default number, 7. All tools aside from BGM are freely available to the public. BGM's parcellation protocol is freely available as well [28]. We additionally ran each tool 3 times on the same platform to assess its repeatability.
The results from these tools were compared to those of our morphometrics tool, Masks2Metrics [29,30], which we ran on the same data with corresponding consistent ground truth. Briefly, the T1 and T2 images were combined to enhance grey-white matter borders and parcels drawn manually using a detailed protocol which accounted for all known anatomical variability (see [21] for details and validation). Using this ground truth allowed to conduct a controlled comparison by measuringdeviations from it for each package. The ground truth here acts as a reference frame, to compare one software against another, and as such agreement or disagreement with its border definition is irrelevant.

Parcels, metrics and statistical analysis Package parcels
The cortical parcellation protocols, and in turn the derived parcels, differed across the 3 packages. We assessed parcels generated by FreeSurfer's 2 latest and most suitable protocols for cortical analysis: the Desikan-Killiany (DK, [27]) and the Desikan-Killiany-Tourville (DKT, [19]) protocols. The DKT protocol was introduced in version 5.3 as an improvement on the DK protocol, offering better parcellation accuracy, clarity and consistency. BrainSuite parcellations are based on an adaptation of the LONI curve protocol [31], whereas the BrainGyrusMapping We focused our package analysis on 3 regions per subject hemisphere: the superior frontal gyrus (SFG) of the frontal lobe, the supramarginal gyrus (SMG) of the parietal lobe, and the cingulate gyrus (CG) of the cingulate cortex. These gyri were chosen on the basis that they are situated in different lobes, undergo structural changes with ageing [32] and dementia [33][34][35][36][37], and exhibit gender differences [32,38,39]. As the parcellation protocols differed, it was necessary at times to combine some parcels to produce comparable regions. Table 2 details the parcels we combined in each software package.

Reference parcels
The 10 subjects' corresponding ground truth SFG, SMG and CG parcels which we compared to the packagederived parcels were manually segmented as described in [21]. This study's source data and derivatives, including the left and right gyral parcels, are available in the Edinburgh DataShare repository [22].

Metrics and statistical analysis
Various metrics are automatically calculated by each of the tools. We chose the 3 most popular and relevant ones for our ageing population: grey matter thickness (GM th , e.g., [32-34, 40, 41]), grey matter volume (GM vol , e.g., [41,42]), and white matter surface area (WM sa , e.g., [41,42]). Both FreeSurfer and BrainSuite calculate these 3 metrics whilst BrainGyrusMapping provides GM vol only. Several parcels were combined to form a region of interest depending on the region and package considered (Table 2). Metrics for such regions were derived by combining the original parcels' metrics. For the case of GM th , this meant averaging individual parcel metrics,   Statistical analyses consisted of (i) descriptive statistics (medians and 95% Bayesian highest density intervals (HDIs) for each metric, region of interest (ROI), and hemisphere and (ii) a percentile bootstrap between packages on relative median differences. Here the ground truth values are subtracted from each measure, and those measures are then compared across packages. This enables us to compare packages relative to a common reference. The percentile bootstrap was adjusted for multiple comparisons per metric (i.e. all measurements for each hemisphere/ROI included in a single procedure to maintain the type 1 error at 5% [43]). The raw data (tsv files) and the Matlab script we wrote to perform the data analysis are available in the Edinburgh DataShare repository [44].

Results
Repeatability was observed for all packages, with identical results generated for each of the 3 runs (see tsv files of the Edinburgh DataShare repository [44]). Parcellation influences were also evident visually. We highlighted them using screenshots taken from various angles (see Additional file 1). We identified 6 double CG occurrences in this dataset: 4 in the left hemisphere (subjects 1, 5, 6 and 8) and 2 in the right hemisphere (subjects 6 and 10).

Cortical volumes
Gray matter volumes automatically computed with the different packages were comparable, with overlapping confidence intervals (Fig. 1, Table 3) Compared to our ground truth, automated packages' median volumes were all significantly higher for the SMG and all slightly larger for the SFG although not significantly different (overlap of confidence intervals). This difference in SFG is reflected by the smaller estimates seen for the neighbouring CG parcel (non-overlap of confidence intervals for FreeSurfer and BrainSuite, but not BGM).
The comparison of relative median differences is shown in Table 4. Re-expressed in ground truth unit,  Figure S1q-t). For subject 3 who has single CG occurrences, large relative SFG volumes are observed with BrainSuite because of differences in its medial, lateral and anterior borders compared to the remaining packages (indicated by arrows in (see Additional file 1: Figure S5 and S9)). Of interest, FreeSurfer DKT generates smaller relative volumes than DK for all CG scenarios (Fig. 1) because DKT accounts better than DK for double cingulate gyri, although imperfectly (Additional file 1: Figure S1, S2, S5, and S6). Furthermore, DKT's relative SFG volumes are larger than DK's for all subjects even when they are adjoining double CGs. Although the SFG in such cases loses its medial-most fold to the CG, with the DKT protocol the SFG is larger both anteriorly and posteriorly (i.e., lengthwise to include the majority of the frontal pole) as well as laterally, into the middle frontal gyrus, due to its revised border definitions [19]. This is evident pictorially in Additional file 1: Figure S1, S2, S5, S6, S9, S10, S11, and S12.

Cortical thickness
Cortical thickness measurements computed following FreeSurfer's two parcellation routes were very similar to the ground truth (overlap of 95% HDI) while BrainSuite show significantly higher estimate than all other packages (just under double those of the other methods) along with higher dispersion (Fig. 2, Table 5). All packages were, however, still in agreement with the reported post-mortem values taken at the lateral (3.5 mm), medial (2.7 mm) and overall (2.5 mm) cortical surfaces [45].
Relative to the ground truth, BrainSuite showed a significant difference to both FreeSurfer outputs (DK and DKT) for all ROIs ( Table 6). Examination of differences per subject (Fig. 2) revealed little difference between DK and DKT, yet large differences between them and Brain-Suite, as well as across subjects within BrainSuite. This is explained (i) by the fact that thickness is not expected to change at the borders of parcels, and therefore differences in volume between DK and DKT do not translate into differences in thickness and (ii) BrainSuite combines grey and white matter thicknesses rather than just grey matter (see Discussion).

Surface area
The packages' SFG and SMG surface area metrics were generally larger than the ground truth, whereas their CG metrics were generally smaller (Fig. 3, Table 7).
Relative to the ground truth, all SMG measurements were significantly different to one another in both hemispheres (Table 8). Significant differences existed between DKT and the remaining methods for all ROIs except for the left SFG when compared to BrainSuite). As with the relative cortical volumes, the largest relative surface areas were generally in the subjects with the double CG occurrence at both the CG and the affected SFG because larger gyral volumes are expected to have larger surface areas. Once again, DKT generated smaller relative volumes than DK for all CG scenarios as it accounted better than DK of both single and double gyri (see Additional file 1: Figure S1, S2, S5, and S6). Unlike other subjects, subject 5's left SMG surface area with Brain-Suite is relatively larger than its equivalent in the remaining protocols. This is also evident pictorially (see Additional file 1: Figure S3q-t) which demonstrates a wider BrainSuite SMG, terminating caudally, like DK, at the second segment of the caudal superior temporal sulcus rather than at the first segment as with DKT and BrainGyrusMapping.

Discussion
The parcellation protocol we followed while segmenting the ground truth parcels enabled us to consistently identify and address any visible anatomical variability (see Additional file 1, [21]). Because of this, the parcels' shapes varied greatly across the cohort, creating large dispersions in the ground truth volumes (Fig. 1) and surface areas (Fig. 3). Using this as a reference frame to compare packages allowed thus to highlight how each package deals with these natural variations. The main contributor to variability in the CG and SFG is the cingulate sulcus [46] which can have a single or double  occurrence (and therefore a double CG occurrence), branches, as well as discontinuities, all of which are interpreted differently by each package. Given that it defines the dividing landmark between the CG and SFG, both gyri are highly variable, as are their volumes and surface areas. The SMG is also highly variable across the cohort, mainly due to its posterior border, as is its segmentation across the packages. The size of our dataset and the use of 1.5 T MRI images are of course a limitation. There are variations which depends on age (in adults) that would be better captured with a larger sample capturing a wider range of age and higher resolution images. This is particularly true for gyrification (the process and the extent of folding) which varies with age [5] and can thus impact on the identification of anatomical branches and borders. The current dataset was nevertheless variable enough to highlight issues in automated packages. For what is reported here, i.e. that the differences observed mainly stem from how anatomical variability in additional gyri and branching is handled, aging or higher resolution imaging has no impact. For instance, the presence/absence of double gyri is observed once the brain is fully formed and does not change across adulthood and is observed even with coarse image resolution.
With volume being (in theory) a product of thickness and surface area, and the thicknesses being generally stable for each package, larger surface areas are expected to accompany larger volumes, and vice versa and this is what we saw. We also observed that the inability to fully    (Fig. 4). Although our work highlights differences between parcellation protocols, it is most likely that the corresponding outputs of image analysis tools in fact vary due to a combination of factors, and not just the parcellation phase. One step prior to parcellation in automated and semi-automated tools is the pre-processing phase. In FreeSurfer, for example, amongst other things, that phase is used to derive white and grey matter masks [1]. These are consequently split in the processing stage, as per a parcellation protocol, to form parcels. Such mask effects were not investigated in this manuscript although it could be contributing to differences, especially for thickness. Package inconsistency across sites (e.g., [47]) and operating systems (e.g., [16]) is another aspect to consider, although was not a contributing factor to our study as each package was run on only one computer and one operating system. Finally, and most relevant here, differences in algorithms can also account for observed differences. Volume is simply derived by counting the number of voxels in each parcel and thus directly reflects differences in parcellation protocols. Cortical thickness however is specific to grey matter in FreeSurfer, while in BrainSuite it refers to that of the gyrus, all the way down to the fundus, therefore capturing the combined grey and white matter thicknesses [31]. The combination of parcel definition and using the sulcal fundus to mark the border of a gyrus also explains inconsistencies in surface area measurements.

Conclusions
We previously investigated package differences in terms of their parcellation protocol definitions, raising awareness of the associated uncertainties stemming from the well-reported anatomical variability that they are likely to encounter [20]. In our present work, we quantify the effects of these uncertainties through a healthy middle-aged dataset and manually-derived ground truth data with associated morphometrics. We show that multi-atlas parcellation (BGM) is the most accurate method and therefore encourage more research and usage of such tools. Explicit definition of the method used to compute thickness and surface area is another major factor, and since multi-atlas methods are currently limited to volume, we recommend using FeeeSurfer's DKT approach with manual editing to derive grey matter thickness and white matter surface area. Mark Bastin: for the 10-subject data set which we analysed in this study [22]. Corne Hoogendoorn: for advising on the BrainGyrusMapping software.

Funding
Data acquisition and preparation was funded by NIH grant R01 EB004155. Data preparation, collection, analysis, interpretation and writing was funded by SINAPSE-SPIRIT (a Scottish Funding Council HR09021 grant), the Tony Watson Scholarship and Canon Medical Research Europe.

Availability of data and materials
The datasets analysed during the current study are available in the Edinburgh Datashare repository, at https://doi.org/10.7488/ds/2239 [22]. The Matlab (https://uk.mathworks.com/products/matlab.html, R2016a) code (.m file) we wrote for the statistical analysis as well as the datasets (tsv files) generated during the current study are available in the Edinburgh Datashare repository at https://doi.org/10.7488/ds/2376 [44]. To derive the ground truth metrics for each of the subjects' ground truth gyri of interest, we ran our software, Masks2Metrics (M2M), version 1.0 [29,48] freely available to all users under the GNU General Public License. The latest version of the software is available on GitHub [30]. All data generated or analysed during this study are included in this published article and its additional files.
Authors' contributions SM co-wrote the Matlab code, processed and interpreted the data, and wrote and revised the manuscript. CP co-wrote the Matlab code, assisted with the statistical analysis, and revised the manuscript. All authors read and approved the final manuscript.

Ethics approval and consent to participate
The local ethics committee (Lothian Research Ethics Committee (LREC) 4-05/ S1104/45) approved the study and informed consent was obtained from each subject.

Consent for publication
We confirm that we obtained consent to publish from the participant to report individual patient data.