Optimization and comparative evaluation of nonlinear deformation algorithms for atlas-based segmentation of DBS target nuclei

Nonlinear registration of individual brain MRI scans to standard brain templates is common practice in neuroimaging and multiple registration algorithms have been developed and refined over the last 20 years. However, little has been done to quantitatively compare the available algorithms and much of that work has exclusively focused on cortical structures given their importance in the fMRI literature. In contrast, for clinical applications such as functional neurosurgery and deep brain stimulation (DBS), proper alignment of subcortical structures between template and individual space is important. This allows for atlas-based segmentations of anatomical DBS targets such as the subthalamic nucleus (STN) and internal pallidum (GPi). Here, we systematically evaluated the performance of six modern and established algorithms on subcortical normalization and segmentation results by calculating over 11,000 nonlinear warps in over 100 subjects. For each algorithm, we evaluated its performance using T1-or T2-weighted acquisitions alone or a combination of T1-, T2-and PD-weighted acquisitions in parallel. Furthermore, we present optimized parameters for the best performing algorithms. We tested each algorithm on two datasets, a state-of-the-art MRI cohort of young subjects and a cohort of subjects age- and MR-quality-matched to a typical DBS Parkinson's Disease cohort. Our final pipeline is able to segment DBS targets with precision comparable to manual expert segmentations in both cohorts. Although the present study focuses on the two prominent DBS targets, STN and GPi, these methods may extend to other small subcortical structures like thalamic nuclei or the nucleus accumbens.


Introduction
Translating single-subject imaging into common space such as the MNI space is a core concept in the field of brain mapping (Evans et al., 2012). It allows for comparison of structural and functional MRI findings across brains, cohorts and research centers. Increasingly these techniques are being applied in clinical contexts including stroke (Fox et al., 2014) and deep brain stimulation (DBS) (Horn et al., 2017a(Horn et al., , 2017b. In the latter, subcortical atlases allow for delineation of DBS targets such as the subthalamic nucleus (STN) and internal part of the globus pallidus (GPi) that are imperfectly resolved on standard clinical MRI . By transforming single-subject data into a common reference frame such as MNI space, one can leverage the considerable brain imaging data available in public repositories (Horn et al., 2017a;Klein et al., 2009;Crivello et al., 2002). Example subcortical atlases include the probabilistic ATAG atlas , the 7T based "ultra-high field atlas for DBS planning" (Wang et al., 2016) and the DISTAL DBS atlas . Similarly, connectomic resources such as normative structural and functional connectomes in MNI space are now available (Horn and Blankenburg, 2016;Horn et al., 2014) and have been used to explore the functional and structural networks targeted by DBS (Horn et al., 2017b(Horn et al., , 2017a. Critical to this approach is the accurate transformation of single-subject imaging into the atlas space (or vice versa).
In the context of DBS, we often want to determine the spatial relationship between the implanted electrode and its specific target. This is also important for studies that assess electrode localizations across patients to identify spatial correlates of treatment outcomes (Horn et al., 2017b;Neumann et al., 2017;Israel and Bergman, 2016;Welter et al., 2014;Tisch et al., 2007). Another application for transforming atlas information to single-subject imaging is the segmentation of target nuclei on preoperative MRI acquisitions for DBS surgical planning.
Such patient-to-template or template-to-patient registrations are termed "atlas-based segmentation" and are important because manual segmentations of DBS related structures is highly time consuming, requires expert anatomical knowledge (Forstmann et al., 2017a, b;Zwirner et al., 2016;Visser et al., 2016b;Chakravarty et al., 2013) and may not be straightforward on clinical MRI data given insufficient signal-to-noise ratio or resolution. Thus, the automated segmentation of STN and GP has been a field of vital and ongoing work with innovative contributions by multiple research groups (Garz on et al., 2017;Visser et al., 2016bVisser et al., , 2016aChakravarty et al., 2013;Haegelen et al., 2012;Helms et al., 2009;Lim et al., 2013;Villegas et al., 2009). In addition, many more general methods facilitating imaging co-registration have been developed over the last 20 years, primarily within the field of brain imaging (Ashburner, 2007;Ashburner and Friston, 2011;Andersson et al., 2010;Avants et al., 2010;Sch€ onecker et al., 2009).
Despite the many developments in this field, comparative studies evaluating the alignment of subcortical structures in image registration are lacking, perhaps due to the focus on cortical structures in support of the fMRI literature (Glasser et al., 2016). In 2009, Klein and colleagues published an influential study comparing algorithm performance for cortical structures (Klein et al., 2009). However, only a few, coarsely defined subcortical structures were assessed. Moreover, since 2009 the algorithms studied have been improved substantially.
Here, we evaluated and optimized the preset parameters of six commonly used deformation algorithms. Multiple parameters were evaluated for most algorithms. Each algorithm, except for one, was evaluated for results of both multi-and mono-spectral datasets. Specifically, we assessed the performance of each algorithm using T1-or T2weighted imaging alone versus a combination of T1-, T2-and Proton Density weighted MRI. In total, we estimated over 11,000 deformations from 103 manually labeled subject brains to template space. Results were compared to manual expert segmentations. Accuracy of normalizations was estimated in a high-quality dataset consisting of healthy subjects acquired in the Human Connectome Project. These MR scans exhibit a high signal-to-noise ratio and were acquired on specialized hardware, representing an as-good-as-it-gets example dataset for 3T data. Second, we estimated results on data from the IXI project which resembles MR image quality commonly acquired on 1.5 or 3T magnets. These MRIs represent a more typical example of what is acquired in routine clinical practice or when the focus is not on specialized structural imaging. In addition, the IXI data set includes T1, T2 and PD acquisitions, allowing us to assess the performance of multi-vs. monospectral normalizations.
The present study was done with a particular focus on the two most common DBS targets, STN and GPi, though a translation to other small subcortical structures such as thalamic nuclei or the nucleus accumbens may potentially be inferred. To facilitate the broadest utility of these results to the scientific community, we have distributed all parameters and Tissue Probability Maps (TPM) and include practical information on the required computational processing times required for each algorithm presented.

Subjects and data
Two datasets were analyzed, a high quality 3T data set in young subjects ("Young Cohort"; YC from here on; image resolution 0.7 mm 3 isometric; age range 22-35 yrs, two subjects 36 þ yrs; mean age 28,7 yrs, std ¼ 3,42 yrs; F:M ¼ 49:24) and a data set resembling clinically acquired MRIs for DBS patients ("Pseudo-Clinical"; PC; image resolution 0.94 Â 0.94 Â 1mm 3 ; age range 55-70 yrs, mean age 62,9 yrs, std ¼ 4,2 yrs; F:M ¼ 17:13; for an exact list of YC-and PC-subjects please see section "Lists of segmented subjects" in supplementary material). 3T data was obtained from The Human Connectome Project (HCP) (https:// db.humanconnectome.org/, 'WU-Minn HCP Data -900 subjects'; Fischl, 2012;Jenkinson et al., 2002;Marcus, 2018;Van Essen et al., 2012). 73 unrelated subjects with sufficient image contrast for manual segmentation were selected. For each subject, both structural image modalities, T1-weighted (T1w) and T2-weighted (T2w) images were selected. Minimal preprocessing steps had been done within the HCP standard preprocessing workflow and included correction of gradient nonlinearity and field map distortion, removal of spatial artifacts, within-subject cross-modal registration and aligning the subject's native volume header to MNI space using a rigid body transform estimated by the FLIRT tool (FSL 5.0, Jenkinson et al., 2012). For details on the HCP preprocessing pipeline please see (Glasser et al., 2013).
To address how accurate automated segmentation algorithms would perform on clinical data (sometimes acquired on 1.5T with low signal-tonoise ratio and including older patients), a "typical clinical" dataset was obtained from the IXI project (http://brain-development.org/ixidataset/). A sub-cohort of 30 subjects aged between 55 and 70 years (age-matching a typical age of Parkinson's disease (PD) patients undergoing DBS surgery), whose imaging was free of gross motion artifact were randomly selected for this study. These subjects were scanned on either 1.5T or 3T MR hardware and provided T1w, T2w and Proton-Densityweighted images. In contrast to the YC, these images were neither coregistered cross-modally within-subject nor rigidly aligned to MNI space. Cross-modal, within-subject registration was performed using SPM12.
Additional details on the quality of the data used and a list of specific subjects is available in the supplementary materials.

Manual segmentation
In total, left and right GPi and left and right STN of 103 brains were manually segmented (73 YC and 30 PC). Manual segmentations were performed following a specified and published protocol  in 3DSlicer (https://www.slicer.org/) (see also "Detailed Segmentation Protocol" in supplementary material). Manual segmentations were carried out by either one of two raters (S.E. and F.F.) with overall 41% of HCP and 33% of IXI images segmented by both raters to estimate inter-rater reliability (see supplementary materials for details). The anatomical outlines of the segmented structures were defined with reference to neuroanatomical atlases and in-house acquired high resolution post-mortem 7T MRI scans that showed the target structures in detail (Edlow et al., 2018;Ding et al., 2016;Mai et al., 2015;Massey et al., 2012;Naidich et al., 2009).

Measures of inter-rater agreement
Inter-rater agreement was determined for each dataset by calculating both Dice coefficient (Dice, 1945) and the mean absolute surface distance (SD) (Wang et al., 2016) which measures the mean euclidean distance between two surfaces. All analyses were performed using MATLAB 2015b (The MathWorks, Inc., Natick, Massachusetts, United States).

Normalization and automatic segmentation
The MNI ICBM 152 NLIN 2009b template was used (Fonov et al., 2011). Native subject volumes were nonlinearly deformed into template space (normalization) using established algorithms included in the Lead-DBS v2.1.0 software package (www.lead-dbs.org; (Horn and Kühn, 2015)). Using these algorithms, deformation fields that project patient volumes into template space were estimated, inverted and applied to a precise definition of target structures (STN and GPi) in the DISTAL atlas in template space . This process will subsequently be referred to as "atlas-based segmentation".
The DISTAL atlas consists of a precise manual segmentation of the subthalamic nucleus and internal pallidum. This segmentation was performed on a high definition MNI template (ICBM 2009bnlin asym, Fonov et al., 2011 that is available in multiple modalities. These were used to generate an automatic pre-segmentation that was used alongside each individual spectra (T1, T2, etc.) to segment the structures of interest.
To assess the accuracy of each normalization, the atlas-based (automatic) segmentations were compared to manual segmentations of the corresponding nucleus in native space (Fig. 1). Agreement between atlasbased and manual segmentations was again assessed by calculating the Dice coefficient and mean surface distance. Nucleus volumes (in mm 3 ) for automatic and manual segmentations were correlated by correlating the sum of voxel volumes of the respective nuclei.
For ANTs we evaluated ANTs SyN (Avants et al., 2010) and ANTs BSplineSyN (Tustison, 2013), each with multiple different presets and with uni-versus multimodal datasets (T1 vs. T1 & T2 vs. T2-weighted MRI). We evaluated different degrees of gradient steps, shrinking factors, penalties for high frequency image deformations, convergence parameters, winsorizing and smoothing sigmas to produce three levels of allowable deformation field strengthtermed low, mid and high variance. For the winning method (ANTs SyN with the preset "Low Variance"), we additionally evaluated another step of subcortical refine similar to the one described in (Sch€ onecker et al., 2015). Here, as a last step after the global non-linear alignment of the whole brain, a subsequent warp was applied guided by a subcortical mask that focused on the basal ganglia. The final optimized parameters for registration of subcortical structures evaluated in the present study is the ANTs-based Fig. 1. Study workflow. A) First, native subject images for the Young (upper row) and Pseudo-Clinical (lower row) Cohort were normalized into MNI template space in which the DBS target regions STN and GPi had been precisely defined . B) The deformation fields estimated during normalization were inverted and applied to the template STN and GPi. This results in atlas-based (automatic) segmentations (orange labels) in patient native space which can then be compared to manual segmentations, considered ground truth. C) Shows the overlap for one subject of each cohort (YC and PC) between atlas-based (orange labels) and manual (green labels) segmentation. D) These results were then compared to inter-rater overlap. Green and blue labels indicate manual labels from different raters in the same subject. SyN approach with subcortical refinement termed "Effective (Low Variance)" which is now the default normalization method of Lead-DBS software v.2.1.6 (Horn et al., 2018). Specific parameters of this method and the principal competing presets are published online (https://github. com/leaddbs/leaddbs/blob/master/ext_libs/ANTs/presets/). The mask used for the final subcortical refinement step is also supplied with Lead-DBS software and is shown in Fig. S1 relative to the T1 ICBM 2009b nlin asym template.
For the SPM methods we evaluated the accuracy of subcortical image registration using different tissue priors. The SPM methods use Tissue Probability Maps (TPMs) as priors for their image segmentation and subsequent registration with the segmented template image. Critical to this method is a proper definition of subcortical target structures in both subject and template TPM. We tested the results of two other TPMs against the original SPM TPM which was derived from the same IXI dataset (http://brain-development.org/ixi-dataset/) that formed our clinically age-and SNR-matched dataset, the PC (http://www.fil.ion.ucl. ac.uk/spm/software/spm12/SPM12_Release_Notes.pdf). The second TPM tested was the one created by (Lorio et al., 2016). To create the third TPM, we adopted the approach described in (Lorio et al., 2016) and created a specialized template TPM with a refined definition of our subcortical target structures. Using the Lorio-template as a starting point, we used SPM New Segment to segment our target template. Subsequently, we used these segmentation as priors for SPM New Segment and our evaluated two cohorts. This TPM is currently implemented in Lead-DBS and can be downloaded at www.lead-dbs.org.
For the final, linear image registration algorithm, we used ANTs linear as registration method with the default input parameters. To improve the accuracy of this method we did, however, adapt the two-step approach described in (Sch€ onecker et al., 2009). The first step computes a global alignment between the individual and the target brain. The second step focuses on the subcortex using a mask that focuses on the basal ganglia and brainstem (Fig. S1).
ANTs and SPM methods were the most accurate registration methods in a previous comparative study that focused on cortical structures (Klein et al., 2009). These two pipelines also allow for multispectral warps, which in theory may help compensate for deficiencies of individual MRI sequences. For example, the STN is not visible on T1, which is often the volume with best spatial resolution, so a combined warp that uses information from both T1-and T2-weighted volumes appears sensible. The best performing ANTs preset with an additional step of subcortical refine similar to the one described in (Sch€ onecker et al., 2015) and the best performing TPM (for the SPM methods) were then subjected to detailed comparison and assessment.

Comparison of results
Similarity measures between manual and automatic segmentations for each method were analyzed using one-way ANOVA analyses. Pairwise multiple comparison post-hoc tests between each pair of approaches were estimated using Tukey's Honest Significant Difference procedure (as is default in Matlab's anova1 and multcompare functions). . For ANTs presets, the variance indicates the level of image distortions allowed. ANTs presets evaluated from left to right were: 1. BSplineSyN high variance; 2. BSplineSyN mid variance; 3. BSplineSyN low variance; 4. SyN high variance; 5. SyN mid variance; 6. SyN low variance; 7. SyN low variance with the additional step of subcortical refinement. The last preset yielded the best overall outcome. Of all evaluated datasets a multi-modal non-linear deformation (T1 & T2) yielded the best outcome, especially for the winning preset no. 7 (SyN low variance with the additional step of subcortical refinement).

Results
For each algorithm tested, we compared the accuracy of the atlasbased automated segmentation compared to the manually segmented target by means of a Dice coefficient. We first assessed the performance of two ANTs nonlinear warping algorithms (Syn and BSplineSyN) each with different degrees of gradient steps, shrinking factors, penalties for high frequency image deformations, convergence parameters, winsorizing and smoothing sigmas to produce three levels of allowable deformation field strengthtermed low, mid and high variance. The performance of each algorithm is shown in Fig. 2 by means of violin plots showing the distribution of Dice coefficients. The best performing algorithm was ANTs Syn Low Variance with subcortical refinement, using a T1 þ T2weighted multimodal imaging data set. This ANTs Syn preset penalizes high-frequency image deformations and includes an additional final normalization step using a subcortical mask.
We turned next to the SPM methods, New Segment, SHOOT and DARTEL. The SPM methods use Tissue Probability Maps (TPMs) as priors for the image segmentation and subsequent registration with the segmented template image. Critical to this method is a proper definition of subcortical target structures in both subject and template TPM. We evaluated the performance of the original SPM TPM (which was derived from the IXI dataset), against the TPM described by Lorio (Lorio et al., 2016) and a novel TPM implemented within Lead-DBS. Fig. 3 shows the automated segmentation performance of SPM New Segment Results using each of the three TPMs. For the YC, we see a slight improvement of image registration accuracy for the Lead-DBS TPM. In contrast, in the PC the original SPM TPM performed slightly better than the Lead-DBS template, perhaps owing to it having been developed on the IXI dataset from which the PC cohort was selected. The Lead-DBS TPM was used in the subsequent main analysis.
We then compared these optimized algorithms to a subcortically focused linear coregistration and FNIRT as implemented in FSL. The comparative performance of all methods is summarized in Table 2aþb and Fig. 4aþb. Results are shown separately for YC ( Fig. 4a and Table 2a) and PC ( Fig. 4b and Table 2b). Of all the methods compared, SPM Segment and ANTs SyN performed best yielding an accurate segmentation with Dice coefficients typical of inter-rater reliability. . Representative examples of the two best and two worst performing subjects for each algorithm and each data set are shown in Fig. S2a and Fig. S2b. Fig. 5aþb and Table 3aþb show a more detailed analysis for the two best performing methods, SPM Segment and ANTs SyN. Here, in addition to the Dice coefficient, the mean surface distance (SD) as well as volume correlations between atlas-based and manual segmentation for each subject are shown. Results are displayed for STN and GPi separately. Fig. 5a and Table 3a show the results for the YC, while Fig. 5b and Table 3b show the same results for the PC. Notably, the Dice coefficients and SD achieved with both methods, SPM Segment and ANTs SyN, almost reached inter-rater level accuracy for the YC. For the PC there was no significant difference in accuracy between inter-rater and atlas-based segmentations.
Specifically, in the YC for the STN, inter-rater agreement by means of SD was not significantly better than ANTs SyN (median SD 0.41 mm vs. 0.38 mm) indicating a very high normalization accuracy for this method in the subcortical region of the STN (Fig. 5a and Table 3a). For the GPi,  SPM Segment performed slightly better than ANTs SyN showing no significant difference between inter-rater agreement by means of Dice or SD (p > 0.1). In fact, for the mean SD of GPi segmentations no method proved significantly different than inter-rater mean SD (p > 0.8). To further compare the results of atlas-based and manual segmentation in the YC, we compared the volumes of the segmented nuclei (right panels in Fig. 5aþb). For both methods and nuclei we see a small but significant correlation between manual and automated segmentation volumes.  Table 2b.
Atlas-based segmentations by SPM Segment and ANTs SyN are reasonable in size (mean volume STN and GPi by ANTs SyN: 148 mm 3 and 406 mm 3 ; SPM Segment: 119 mm 3 and 377 mm 3 ). For additional details of volumes resulting from atlas-based segmentations compared to volumes of manual segmentations in the YC and PC please see lower sections of Table 3aþb. The median Dice coefficient for inter-rater agreement in the YC for subjects segmented by both raters (S.E and F. F.) was 0.76 AE 0.09 (standard deviation) for STN and 0.80 AE 0.05 for GPi. We found no significant effect of hemisphere on inter-rater Dice coefficients (p > 0.1). Median surface distance (SD) was 0.41 mm AE 0.20 for STN and 0.45 mm AE 0.14 for GPi, corresponding to less than the width of one voxel, again without significant effect of hemisphere (p > 0.1). Inter-rater agreement was lower in the PC. Dice coefficients for the PC were 0.66 AE 0.10 STN and 0.65 AE 0.06 for GPi. Median SD was 0.64 mm AE 0.25 for STN and 0.82 mm AE 0.16 for GPi. Again, no significant effect of hemisphere was found in any of these metrics (p > 0.2). Tables 3a and 3b presents inter-rater results in detail.
In the YC, segmentation mean volumes of GPi were 359 mm 3 AE 50 (standard deviation), (range 231-473 mm 3 ) and of STN 120 mm 3 AE 20 (range 68-168 mm 3 ). For volumes of the GPi, there was no significant difference between left and right hemisphere (p ¼ 0.49) and volumes correlated significantly between hemispheres (R ¼ 0.605, p < 0.001). As previously reported in the literature (Shen and Gao, 2009;Nowinski et al., 2005), a significant difference between volumes of left and right STN was found (p ¼ 0.031; mean volume for left STN ¼ 122 mm 3 , std ¼ 22, mean volume for right STN ¼ 118 mm 3 , std ¼ 18). Segmented volumes of GPi and STN within the same subject also correlated significantly (R ¼ 0.326, p < 0.001). In the PC, both STN and GPi volumes were significantly smaller compared to the YC (p < 0.05) which is likely attributable to a combination of lower data resolution and lower subcortical contrast (mean volumes GPi: 324 mm 3 AE 62, range 157-434 mm 3 ; STN: 112 mm 3 AE 19, range 74-154 mm 3 ). Again, volumes correlated significantly across hemispheres (STN: R ¼ 0.525, p < 0.005; GPi: R ¼ 0.376, p < 0.05). However, no significant difference in volume was found between hemispheres (p ¼ 0.07 and p ¼ 0.67 respectively). Furthermore, no significant correlation could be found for segmented volumes of GPi and STN within the same subject (R ¼ 0.132, p ¼ 0.315).
The manual segmentation of the PC dataset was more challenging due to the lower subcortical contrast and image resolution. This is reflected in the lower Dice coefficient between the manual segmentations of YC and PC (median Dice for both nuclei 0.78 and 0.66 respectively). Nevertheless, automated segmentation performed well ( Fig. 3b and Table 3b). For the PC, the performance of both methods (SPM Segment and ANTs SyN) was statistically indistinguishable from inter-rater agreement for Dice coefficient and SD for both nuclei (inter-rater median Dice STN and GPi: 0.66 and 0.65; median SD STN and GPi in mm: 0.64 and 0.82). SPM Segment performed slightly better for SD than ANTs SyN (median SD STN and GPi Segment vs. ANTs SyN in mm: 0.52 and 0.73 vs. 0.61 and 0.75), whereas ANTs SyN performed slightly but significantly better as measured by Dice coefficient (median Dice STN and GPi for SPM Segment vs. ANTs SyN: 0.62 and 0.68 vs. 0.67 and 0.71). For additional details please see Table 3b. Due to a lower number of segmented brains in the PC versus the YC cohorts (30 vs. 73 brains), we have lower statistical power. Only SPM Segment for the GPi showed a statistically significant correlation between manual and automatic segmented volumes. This may be due to the reduced SNR of the PC-dataset and increased variability of manual segmentation volumes. Still, atlas-based segmentation volumes in the PC cohort remain reasonable (mean volume STN and GPi by ANTs SyN: 118 and 380 mm 3 ; SPM Segment: 112 and 357 mm 3 ).

Computational time per method
Computational time and memory requirements are important practical considerations when implementing these methods in a research or clinical workflow. Table 4 presents an estimate of required computation time for each method based on commonly available computational resources. The MacBook Pro (left row) had the following specifications: Memory 16 GB 1867 MHz DDR3; Processor 2,9 GHz Intel ® Core™ i5. The desktop PC (right row) was better equipped: Memory 64 GB; Processor Intel ® Core™ i7-7700K CPU @ 4.20 Ghz Â 8. The computations are based on a test subjects with multimodal input data consisting of a T1and T2-weighted MRI. SPM Segment proved to be the fastest evaluated algorithm. ANTs SyN with the additional refinement of subcortical alignment was substantially slower. The processing times were primarily dependent on CPU throughput and were not limited by RAM in either of these configurations.

Discussion
We draw three main conclusions from this study. First, we evaluated six commonly used nonlinear deformation algorithms for human brain MRI and identified two methods that perform superiorly. We optimized their parameters and underlying data for alignment of subcortical structures with high precision. Second, we report that the accuracy of these optimized atlas-based segmentations are similar to inter-rater accuracy of expert manual raters, especially for data with lower signal to noise ratio as is common in clinical practice. These results also allow us to quantify the amount of error introduced when comparing subcortical resultssuch as DBS electrode placementsacross patients in standard space. Third, we demonstrate that multimodal image registration is more accurate than unimodal image registration for subcortical structures. This motivates the preoperative acquisition of multispectral data in patients undergoing functional neurosurgery.
We compared six established, openly available nonlinear deformation algorithms from four software suites with a focus on subcortical structures. Similar to prior studies focused on cortical segmentations, this was done by computing the overlap of automated segmentations with manual segmentations (Klein et al., 2009) and comparing accuracy of this overlap to the inter-rater accuracy of two independent manual raters. Dice coefficients of inter-rater YC-STN segmentations (mean 0.75) indicate a good inter-rater agreement and are comparable with inter-and intra-rater STN segmentations reported by other groups (Garz on et al., 2017;Lorio et al., 2016;Wang et al., 2016;de Hollander et al., 2014;Keuken et al., 2014). For the GPi the mean Dice coefficient was 0.79 which is also comparable to previously reported results (Makowski et al., 2017;Lorio et al., 2016;Wang et al., 2016;Keuken et al., 2014). Both results indicate sufficient overlap to establish a valid ground truth from which to assess automated segmentation.
Similar to prior results in the cortical domain, subcortically optimized ANTs and SPM implementations performed the best. However, there were some notable differences. First, within SPM, the oldest of the evaluated methods, the Unified Segmentation algorithm (Ashburner and Friston, 2005) yielded slightly superior results in comparison to newer implementations DARTEL and SHOOT (Ashburner, 2007;Ashburner and Friston, 2011) although these outperformed the Unified Segmentation approach in previous studies with a cortical focus (Klein et al., 2009). There are three potential explanations. First, DARTEL and SHOOT were created and optimized for precision in the cortex (Ashburner and Friston, 2011) and have primarily been utilized for fMRI. Second, the algorithms build upon subject-specific Tissue Probability Maps (TPMs) that result from the preceding Unified Segmentation step. They may only improve results if a proper definition of target structures in both subject and template TPM is given. In contrast to cortical gyri, for small subcortical structures this may not always be the case. To account for this, we did adopt (Lorio et al., 2016), as well as create a specialized template TPM with refined definition of these structures which did improve results (Fig. 3). Third, DARTEL and SHOOT were designed to perform group-wise registrations. Here, they were evaluated for pair-wise registrations since i) this represents a more suitable use-case for clinical neuroimaging and ii) in a prior study, group-wise and pair-wise results of DARTEL were reported not to differ (Klein et al., 2009). A second (caption on next page) S. Ewert et al. NeuroImage 184 (2019) 586-598 difference was that the ANTs based SyN approach outperformed the BSplineSyN approach. Again, BSplineSyN is the newer algorithm that had outperformed SyN for cortical normalization when it was introduced (Tustison, 2013). In summary, models with low variance in general outperformed models with high variance. One reason for this may be that small structures in the subcortex do not exhibit high image contrast and best results may be achieved by aligning surrounding anatomical structures instead of the nuclei themselves.

The choice of normalization technique is important
To date, there is no accepted standard for performing normalization of small subcortical nuclei. Fig. 6 demonstrates how variable normalization results can be depending on the method chosen. Here, we show the result of three example methods (ANTs SyN, SPM SHOOT and ANTs BSplineSyN) that were applied to multi-modally warp a single patient volume to MNI space. The patient had bilateral electrode implanted in the GPi. The ground truth of the electrode position can be seen on the native patient image (Fig. 6A). Three-dimensional reconstructions of DBS Fig. 5. a): Normalization accuracy for the two best performing methods SPM Segment and ANTs SyN for the Young Cohort. The two best performing methods SPM Segment and ANTs SyN are displayed and compared to inter-rater agreement for the YC. The left half (red) shows results for the STN, the right half (blue) for the GPi. Measurement of agreement are shown as Dice coefficient and surface distance (SD) (left half of both panels). Correlation of volumes between atlas-based and manual segmentations are shown on the right hand side of each panel. For detailed numerical results please see Table 3a. b) Normalization accuracy for the two best performing methods SPM Segment and ANTs SyN for the Pseudo-Clinical Cohort. The two best performing methods SPM Segment and ANTs SyN are displayed and compared to inter-rater agreement for the PC. Both methods showed statistically equivalent or better agreement between atlas-based and manual segmentations compared to the inter-rater comparison. Results for the STN are shown in red (left half) and for GPi in blue (right half). For both STN and GPi the left upper left graph shows the overlap between atlas-based and manual segmentations or inter-rater manual segmentations by means of Dice coefficient; the lower left graph shows the mean surface distance between those segmentations (for numeric values please see Table 3b); the upper and lower right graphs show a correlation of volumes for each subject between atlas-based and manual segmentations. Table 3 a) Dice coefficient, median and mean surface distance (SD) and volumes for the YC. For SD of STN segmentations, ANTs SyN shows a lower median value than the inter-rater agreement. For the GPi there was no significant difference between either method and inter-rater values. For the Dice values, except for SPM Segment in the GPi, inter-rater results were slightly but significantly higher than both evaluated methods. Inter-rater volumes as well as volumes from automated segmentation are comparable in size with ANTs tending to result in slightly bigger volumes.  Table 3 b) Dice coefficient, median and mean surface distance (SD) and volumes for the PC. This dataset was more difficult to segment for the manual raters due to lower SNR and resolution. Inter-rater reliability measured by Dice value or SD was lower compared to YC. Automated atlas-based segmentation by ANTs SyN and SPM Segment performed similarly to inter-rater results. Again, volumes from automated segmentation result in similar sizes as inter-rater volumes.  electrodes, as shown in B-D, may form a false sense of certainty when presented in isolation without the raw data. This is especially true for fully-automated platforms and highlights the importance of incorporating reports on registration accuracy or an interface for manual confirmation (Husch et al., 2018a(Husch et al., , 2018bD'Haese et al., 2010). However, depending on the chosen method, results differed meaningfully. The exemplary results of Fig. 6 underline the relevance of our findings, the importance of choosing the right method, and emphasize the importance of manual performance checks even for our best performing algorithms.
No registration pair in the sample analyzed here failed in terms of a gross misalignment between subject and template. However, there was variability in registration quality, in particular with the three-step linear approach. The reason for generally high robustness could be that HCP subject data had been linearly prealigned to the MNI space by the HCP team and subjects off the IXI dataset were scanned with rough alignment to the AC/PC line. The two best and two worst performing automated segmentation results for each data set and method are shown in Supplementary Figs. S2a and S2b. The preoperative patient imaging consisted of T1 and T2-weighted MRI images that were normalized multimodally to MNI space using three different methods. The electrodes were localized in native patient space using the Lead-DBS toolbox. The deformation field estimated during the normalization was subsequently applied to electrode coordinates in native space to visualize the electrodes in MNI space and assess their spatial relationship to the GPi. B) Normalization results for ANTs SyN with the preset "Low Variance with Subcortical Refine,". C) for ANTs BSplineSyN with preset "Low Variance," and D) for SPM SHOOT. For panels B, C and D the left column shows the normalized T2-weighted patient image with boundaries of the anatomical structures of the MNI template (red wires) overlaid to demonstrate fit of patient image to MNI template. Middle and right columns show a reconstruction and 3D visualization of the warped patient electrodes in MNI space. Normalization results deteriorate from rows B to D, with the third method showing the electrodes far off their actual anatomical target of the GPi. Comparing the 3D reconstruction in MNI space to the position of the electrode artifacts in the native patient image, the first method reflects the ground truth most accurately.

Results of atlas-based segmentations compared with other automated segmentation methods
Others have proposed alternative strategies of automatic segmentation of DBS structures (Garz on et al., 2017;Visser et al., 2016bVisser et al., , 2016aXiao et al., 2014;Chakravarty et al., 2013;Haegelen et al., 2012). In a recent paper, Garz on et al. use a new automated segmentation method for the midbrain target regions STN, SN and RN. Their algorithm constructs a map of spatial priors based on a training set of manual labels nonlinearly registered to the midbrain. For the STN, they reported Dice scores of 0.66 for 3T QSM-, 0.57 for R2*-and 0.59 for FLAIR-weighted MRI 0.59.
Visser et al. recently introduced a new method called MIST (Multimodal Image Segmentation Tool) which uses multiple image modalities simultaneously while using one single reference segmentation for the initialization. On 7T MRI data they reached median Dice scores for their automated segmentation of around 0.60 for the STN (Visser et al., 2016b). On a different 1.5T clinical dataset they reached Dice scores for the whole globus pallidus of around 0.75 and median mean mesh distances (compare to our mean surface distances) of around 0.8 mm (Visser et al., 2016a).
Chakravarty et al. presented a multi-atlas-based approach that was extended by the use of an automatically generated set of templates derived from different brains (Chakravarty et al., 2013). This "template layer" introduces an additional level of redundant registrations, in a first step warping the atlas brain(s) to all template brains and in a second step all of the template brains to the individual, to-be-segmented brain. This results in multiple versions of automatic segmentations in the individual brain that are then combined in a final segmentation. In theory this will distribute and thus eliminate errors from single warps due to registration and resampling errors. Using this method they reported a Dice coefficient of 0.75 for the whole globus pallidus.
In the young cohort for our best-performing method (ANTs SyN with preset "Low Variance with Subcortical Refine") we were able to reach a mean Dice score for STN and GPi of 0.72 and 0.75 respectively and a mean SD of 0.38 mm and 0.49 mm. This approach performs similarly to the best of the previously reported methods (Husch et al., 2018b;Garz on et al., 2017;Visser et al., 2016bVisser et al., , 2016aChakravarty et al., 2013;Haegelen et al., 2012). In fact, the results for mean SD (0.44 mm for both nuclei) are below the actual image resolution of the datasets (0.7 mm), consistent with a high level of accuracy.

Extension of methods to other structures
Here we have focused on the two most common DBS targets, STN and GPi, though this method may also be applicable to other small subcortical structures such as thalamic nuclei or the nucleus accumbens. The borders of these structures lack distinct MRI contrasts which may reduce the ability to accurately align them to template space. Still, our results show that the STN can still be segmented well even when only using T1weighted MRI as an input, where STN contrast is very poor. This is accomplished by registering surrounding structures with sufficient image contrast as accurately as possible. Given the phylogenetic age of subcortical structures, displacement across subjects is likely to be less than it is for cortical folds. For the thalamus specifically, the methods proposed here could potentially adjust for alterations in the overall size and shape of the thalamus, as well as potentially align intrathalamic structures like the internal medullary lamina which are discernible on MRI. Still, absent histology, we lack for a gold standard to segment the thalamus into its subnuclei, which limits the ability to explicitly assess the performance of automated segmentation algorithms for these structures.

Volumes of segmented STN and GPi
We found a marginal but significant difference in volumes of the segmented STNs across all YC subjects with a slightly higher volume for the left STN (mean volume left STN ¼ 122 mm 3 , right STN ¼ 118 mm 3 ). This left-right asymmetry has previously been reported in larger cohorts (e.g., 120 subjects in Gao, 2009 and168 STNs in Nowinski et al., 2005). For the PC we could not replicate this finding, perhaps due to a lower number of segmented subjects and resulting reduced statistical power to discriminate such small absolute volume differences. Our STN volumes are in line with other histology studies with volumes ranging from 119 to 139 mm 3 for the STN (Zwirner et al., 2016;Mai et al., 2015;Hardman et al., 2002). In addition to the high inter-rater overlap, these findings give us additional confidence in the quality of our manual segmentations.

Limitations
The study has several limitations. There is no definitive test to assess whether an algorithm is performing optimally. As such, we cannot rule out that further optimizations of ANTs, SPM or other algorithms (e.g., FNIRT) could outperform the results reported here. We therefore are careful not to claim that the performance ranking reported here could not be altered by, for example, further adapting the preset parameters of the FSL method for better subcortical alignment. Second, we used datasets of healthy subjects for this study, although we included an age-and qualitymatched cohort to resemble a typical Parkinson's disease. Nevertheless, we have not directly assessed the performance of these algorithms in a clinical cohort. Furthermore, here, we focused on routinely available acquisitions (T1-, T2-, PD-) instead of specialized basal ganglia sequences such as QSM or FGATIR that are increasingly utilized in DBS surgical planning. Of note, however, our final results still performed competitively with respect to results of a recent study that used these sequences for automated segmentation (Garz on et al., 2017).
Lastly, the present results show how well automatic segmentation reproduces manual segmentations of basal ganglia structures as visible on MRI. Although these results suggest that modern automatic procedures are (nearly) able to reproduce the work of expert humans, there remains a gap between the MRI-defined structure and the true boundary of the structures as demonstrated on histology. The size of the subthalamic nucleus and pallidum are both larger on histology than on MRI (Dormont et al., 2004;Richter et al., 2004;Sch€ afer et al., 2012;de Hollander et al., 2014;Massey et al., 2012). For the STN, much of the MRI imaging contrast derives from iron deposition which is not uniform within the nucleus. Rather, it exhibits a gradient that decays from anterior to posterior similar to the gradient in cellular density (de Hollander et al., 2014;Marani et al., 2008). As we did not have histologic data for the subjects studied here, we can not address these discrepancies directly. It is possible that ongoing technical refinements in MRI including quantitative susceptibility mapping (Alkemade et al., 2017) and high-field MRI imaging (Forstmann et al., 2017a, b) may narrow this gap in the future.

Conclusions
In conclusion, we have presented a detailed, validated approach to i) select nonlinear warping algorithms including their parameters and presets and ii) justify the use of atlas-based segmentations for automated definition of DBS targets. We see that several future innovations could further improve on these results. For one, incorporation of the aforementioned specialized basal ganglia sequences (QSM and FGATIR) may further improve accuracy and robustness.
We have released our 103-brain manually segmented dataset as an evaluation tool that can be used to assess the performance of future iterations of image registration algorithms. We plan to integrate this dataset into the open source toolbox Lead-DBS, of which the code is accessible through Github (https://github.com/leaddbs/leaddbs/). The optimal pipeline evaluated in the present study is included as the default preset in Lead-DBS. Code to perform evaluations of new methods similar to the present study will be included as an addition to the toolbox. We hope that other groups will build on our results to further optimize approaches to atlas-based segmentation.

Funding
This research was supported in part by NINDS grant K23NS099380 and an American Academy of Neurology/American Brain Foundation Clinical Research Training Fellowship to TMH. The project was further supported by the German Research Council, DFG grant KFO247 and SPP-2041 (AAK).

Conflicts of interest
All authors report no conflicts of interest.