Deep‐learning based fully automatic segmentation of the globus pallidus interna and externa using ultra‐high 7 Tesla MRI

Abstract Deep brain stimulation (DBS) surgery has been shown to dramatically improve the quality of life for patients with various motor dysfunctions, such as those afflicted with Parkinson's disease (PD), dystonia, and essential tremor (ET), by relieving motor symptoms associated with such pathologies. The success of DBS procedures is directly related to the proper placement of the electrodes, which requires the ability to accurately detect and identify relevant target structures within the subcortical basal ganglia region. In particular, accurate and reliable segmentation of the globus pallidus (GP) interna is of great interest for DBS surgery for PD and dystonia. In this study, we present a deep‐learning based neural network, which we term GP‐net, for the automatic segmentation of both the external and internal segments of the globus pallidus. High resolution 7 Tesla images from 101 subjects were used in this study; GP‐net is trained on a cohort of 58 subjects, containing patients with movement disorders as well as healthy control subjects. GP‐net performs 3D inference in a patient‐specific manner, alleviating the need for atlas‐based segmentation. GP‐net was extensively validated, both quantitatively and qualitatively over 43 test subjects including patients with movement disorders and healthy control and is shown to consistently produce improved segmentation results compared with state‐of‐the‐art atlas‐based segmentations. We also demonstrate a postoperative lead location assessment with respect to a segmented globus pallidus obtained by GP‐net.


| INTRODUCTION
In the past several decades, deep brain stimulation (DBS) therapy has shown clear clinical efficacy in the mediation of symptomatic motoric behavior associated with Parkinson's disease (PD), essential tremor (ET), dystonia, and other conditions (Benabid et al., 1987;Deuschl et al., 2006;Hariz et al., 2008;Mueller et al., 2008;Obeso et al., 2001;Volkmann et al., 2012). One of the most prominent DBS targets for PD and dystonia is the globus pallidus (GP) . The GP is divided into two compartments, the internal segment (GPi) and the external segment (GPe), of which the former is typically the actual target for electrode placement. The GPe and GPi are separated by a thin layer, called the internal medullary lamina (Lozano & Hutchinson, 2002;Patriat et al., 2018). Several past studies have reported that lesions applied to the GPi led to improvement in motor function (Baron et al., 1996;Obeso et al., 2001;Vitek et al., 2003). The application of lesions (or pallidotomy) is associated with nonreversible risks of applying the lesion outside of the intended target. DBS surgery, on the other hand, has risen as an alternative with similar benefits, whose application can be reversed or even stopped if erroneously applied to the wrong region (Benabid et al., 1987;Obeso et al., 2001). Recent studies have shown that accurate placement of the DBS electrode within the sensorimotor region of the target (e.g., subthalamic nucleus [STN] or GPi) is directly correlated with the success of the DBS procedure and reduction of adverse effects (Ellis et al., 2008;Marks et al., 2009;Paek et al., 2013;Patel et al., 2015;Richardson et al., 2009;Rolston et al., 2016;Welter et al., 2014). Correct anatomical target identification is characterized not only by the target's center of mass, but also by its boundaries (Kim et al., 2019). Thus, precise identification of both GPe and GPi and their lamina boundary is of great importance.
A fully automated segmentation process of the GP (both the internal and external segments) has several clear advantages, among which are accurate and fast inference. From a clinical point of view, an automated process has the potential to streamline clinical workflow and increase patient throughput, both in preoperative surgery planning and postoperative assessment of the DBS lead location with respect to the target. Such a process can also eliminate human bias associated with the segmentation process, and provide more accurate and consistent segmentation results.
Since some anatomical structures are not easily identified or visualized on standard clinical images (e.g., 1.5 or 3 T MRI scanners), a common approach to localize brain structures, and in particular those located in the basal ganglia, is to rely on an atlas Horn et al., 2019;Horn & Kühn, 2015). i An atlas provides an average location of brain structures, often based on multiple inputs, such as different MRI scans and histology, merged from numerous subjects.
For example, the authors of (Chakravarty et al., 2006) combined a histological atlas with a 3 T based multimodal subcortical atlas built from MRIs of PD patients (Xiao et al., 2017). In addition, they also combined high resolution multimodal MRIs and structural connectivity data as well . Atlases can be deterministic (Xiao et al., 2017), that is, each pixel corresponds to a single brain structure, or probabilistic , where each pixel is associated with a vector of probabilities which indicate how likely that pixel is associated with different brain structures. For example,  took a probabilistic approach to map DBS electrode locations onto the Montreal Neuorological Institute (MNI) space.
Atlases, which are typically defined in a normalized space, have shown great importance in retrospective population studies Horn, Neumann, et al., 2017;Horn, Reich, et al., 2017;Kim et al., 2019). However, recent works have shown that variablity exists in the size and shape of deep brain structures between different subjects (Abosch et al., 2010;Duchin et al., 2018;Lenglet et al., 2012;Patriat et al., 2018). This inherent interpatient variability is not fully captured by atlas-based segmentations, as they provide a single, averaged shape of brain structures, be it deterministic or probabilistic. Accounting for interpatient variability is typically done by registering the atlas to specific patient anatomy (i.e., via an MRI scan), although this approach cannot often account for the discrepancy between the template and the individual patient brain, and is also affected by registration errors (Dadar et al., 2018;Ewert et al., 2018;Kim et al., 2019). Several other approaches have been introduced in recent years. However, these approaches either segment the GP as an entire structure and do not provide distinction between the GPe and the GPi (Manjón & Coupé, 2016;Visser et al., 2016), or still rely on registration to a template for segmenting the GPe/GPi and other structures (Bazin et al., 2020). These drawbacks motivate the development of a truly patient-specific segmentation technique for GPe/GPi structures.
In recent years, the fields of image analysis and computer vision have undergone a monumental and profound change with the introduction of deep-learning (DL), and in particular deep fully convolutional neural networks (CNNs; Krizhevsky et al., 2012;Lecun et al., 2015). These types of networks consist of many aggregated layers of convolution filters of various sizes and nonlinear elementwise activation functions, such as the rectified linear unit (ReLU). In many computer vision tasks (e.g., segmentation/classification), CNNs are trained end-to-end in a fully supervised manner, over pairs of input images and (often manual) delineations or labels. Such trained CNNs have shown state-of-the-art performance in many tasks, such as semantic segmentation (Long et al., 2015), image classification (Krizhevsky et al., 2012;Zhong et al., 2015), and image registration Dalca et al., 2018) to name a few.
As opposed to iterative algorithms, CNNs perform inference in a single forward step without any need for time-consuming iterations.
Moreover, since CNNs are composed from convolution operations and elementwise nonlinearities, they are implemented very efficiently, which leads to fast execution times during inference.
In this study, we present GP-net, ii a fully convolutional deep neural network for the efficient and accurate 3D segmentation of both the GPe and GPi. GP-net is based on a variant of one of the most prominent deep network architectures, the U-net (Ronneberger et al., 2015), which exploits skip connections in order to prevent loss of contextual information at multiple image scales. In particular, we exploit the attention gated (AG) U-net proposed in (Schlemper et al., 2019), which has previously shown improved segmentation performance in medical imaging. Attention mechanisms are able to automatically learn to focus, or direct attention, without additional supervision. This ability allows AGs to highlight salient features during inference time in the input images or intermediate feature maps, and have been applied successfully in numerous machine learning disciplines, such as natural language processing and machine vision (Bahdanau et al., 2015;Wang & Shen, 2018). For example, in (Schlemper et al., 2019) it was shown that the added AGs improve model sensitivity and accuracy in medical computerized tomography (CT) and ultrasound segmentation, by suppressing irrelevant feature activations in irrelevant areas.
In addition, we augment the attention gated U-net with the recently introduced deformable convolutions (Dai et al., 2017;Pominova et al., 2019), by replacing some of the intermediate 3D convolution layers in the network with 3D deformable convolutions. Classical convolution filters rely on convolving the learned kernel with input which lies on a regular Cartesian grid. In (Dai et al., 2017) it was shown that by learning the grid sample offsets, instead of using a fixed grid, improved performance in vision-based tasks such as classification and segmentation can be achieved. In that sense, deformable convolutions can be thought of as another form of an attention mechanism.
Recent advances in ultra-high MRI machines and acquisition protocols have allowed 7 T imaging methods to directly visualize and identify small subcortical deep brain structures. Structures such as the STN or the lamina border between the GPe and GPi cannot be clearly visualized in lower field MRI machines, but can be more clearly identified with the use of 7 T due to improved contrast and resolution (Abosch et al., 2010;Patriat et al., 2018). Ultra-high field MR has already been used for deep brain structure identification; for example  have shown that direct visualization of the STN is possible, and (Kim et al., 2019) have shown that by relying on 7 T manual delineations and machine learning techniques, direct segmentation of the STN on 3 T images is possible, with 7 T accuracy and precision. Atlas-based approaches have also exploited the benefits of 7 T imaging and superior contrast to construct state-of-the-art altases.
One such example is the study of (Keuken et al. 2014) which constructed an atlas that relies on multi-contrast 7 T acquisitions to analyze the anatomical variability of subcortical structures.
In this study, we rely on acquired T2 volumes from an ultra-high field 7 T MRI machine, specifically tailored to visualize the basal ganglia region. The training cohort consists of movement disorder patients as well as healthy control subjects, such that for each subject manual delineations of both left and right GPe and GPi are obtained by experts to train the network end-to-end. An overall illustration of the proposed process is given in Figure 1. In the next sections, we outline the mechanism behind GP-net, provide extensive experimental validation and finish with a discussion and concluding remarks.

| Overview
GP-net is trained end-to-end in a fully supervised manner. Manual 3D delineations performed by domain experts for both the GPe and GPi are extracted per each patient's 7 T T2 scan. Thus, the network is fed with pairs of 7 T T2 volumes and corresponding 3D manual segmentations in the training procedure.
To evaluate the performance of GP-net, as well as the quality of its patient-specific segmentations, the resulting automatic delineations are compared against manual GP segmentations, performed on F I G U R E 1 Proposed GP-net for the automatic and patient-specific segmentation of the GPe and GPi from acquired 7 T T2 volumes. GP-net is an attention gated U-net, trained end-to-end from pairs of 7 T T2 scans and manual 3D delineations of both GPe and GPi. To increase training size, each training pair is augmented by flipping the images along its sagittal plane. Resulting network's output are the automatically segmented GPe (orange) and GPi (blue) 3D volumes 7 T T2 scans by the same group of experts, from a test set, which was excluded from the training phase of the network. The following metrics against the manual delineations are compared: dice score, center of mass Euclidean distance, volume and mean surface differences. In addition, we provide extensive quantitative comparison between GPnet and four state-of-the-art atlases. GP-net shows significant improvement over existing atlas-based techniques. A test-retest experiment was performed to assess the consistency of GP-net in segmenting the same structure of the same patient over several different scans acquired over the course of days. Moreover, to evaluate the clinical utility of the proposed method we compared the DBS lead location postsurgery as determined based on manual segmentation and the automatic GP-net segmentations.

| Scanning protocol
Patients were scanned on a 7 T MRI scanner (Magnetom 7 T Siemens, Erlangen, Germany) using our previous published protocols (Abosch et al., 2010;Duchin et al., 2018). The scanner was equipped with SC72 gradients capable of 70 mT/m and a 200 T/m/s slew rate using a 32-element head array coil (Nova Medical, Inc., Burlington, MA). On the day of scanning, the patients were instructed to take their usual medication in order to optimize patient comfort and minimize motion.
Whenever patient head size enabled enough space in the coil, dielectric pads were utilized in order to enhance signal in the temporal regions (Teeuwisse et al., 2012). The scan protocol consists of: T1-weighted whole brain scan (0.6 mm 3 isotropic) and T2-weighted axial slab covering from the top of the thalamus to the bottom of the substantia nigra with 0.39 × 0.39 × 1 mm 3 resolution. The T1 weighted scan was used only for atlas-based registration when comparing GP-net with different atlases and was not used in the network's training, inference or validation phases (details are given in Sections 2.6 and 3).

| Database and preprocessing
A cohort of 101 subjects, including 24 healthy controls and 77 movement disorder (PD and ET) patients participated in this study. All subjects were scanned on the 7 T scanner; patients were scanned prior to their DBS surgeries. Even though ET patients do not typically undergo GPi-DBS, we also chose to include their imaging data in our dataset.
Out of the 101 participants, images from 58 participants were used for training and 43 for testing; demographic details are given in Tables 1 and 2. In addition, one subject was scanned three times on 2 days with the same scanning protocol to assess the network's stability. For each subject in the cohort, manual delineations of both the left and right GPe and GPi were obtained by three independent experts from the scanned T2 volumes. Final manual delineations for training and testing per each subject were obtained by a consensus between all experts.
To increase the number of training pairs (T2 images and manual delineations), each training pair was mirrored along the sagittal plane ( Figure 1). Thus, the number of training pairs was doubled and provided more training examples of spatially translated GP structures.
The T2 volumes and manual segmentations were resampled to an isotropic grid of 0.39 × 0.39 × 0.39 mm 3 with a nearest neighbor interpolation kernel prior to training and inference. All of the quantitative analysis is performed on the resampled grid. This study was approved by the Institutional Review Board at the University of Minnesota and all participants gave their informed consent.

| Network architecture
As previously mentioned, GP-net is a fully convolutional 3D deep neural network. Its base architecture consists of an attention gated U-net, in which some of the inner 3D convolution layers were replaced by 3D deformable convolutions. The basic U-net architecture is composed of two paths, encoder and decoder paths, which consist of aggregated layers of convolutions, max-pooling, and ReLU activations.
Each layer shrinks its input size by a factor of 2. Thus, at the end of the encoder stage, the feature dimensions are shrunk by a factor of 2 m , where m is the number of encoder stages (assuming isotropic max-pooling in all input dimensions). The output of this stage is then fed into the decoder, which consists of the same number of m layers, each layer is built with convolution layers and upsampling by a factor of 2. The final output of the network has the same dimensions as the input. In this work, m = 4. Each encoder stage is composed of two consecutive blocks of 3D convolution (same parameters in both blocks), 3D batch normalization, ReLU activation followed by a max pooling layer of size 2 × 2 × 2 pixels. Deformable convolution layers consist of a 3D standard convolution kernel to estimate the offsets, a deformable convolution layer, followed by another standard 3D convolution (all convolutions of same kernel size and padding). Detailed parameters of the convolution layers are given in Table 3. In addition, as illustrated in Figure 1, each layer in the encoder stage is directly connected to its corresponding decoder stage, traditionally using skip connections. Skip connections allow more efficient gradient flow through the network during the training stage and prevent loss of contextual information at multiple image scales. Following (Schlemper et al., 2019), in this study the encoder stages are connected to the corresponding decoder stages via attention gates. A detailed description of the attention gate architecture, as well as the overall network architecture is given in (Schlemper et al., 2019). GPnet has three attention gates, for the second, third, and fourth layers of the decoder stage. Each block in the decoder stage is composed of a 3D convolution kernel, followed by an upsampling operator by a factor of 2 × 2 × 2. At each decoder level, the resulting attention signal and corresponding upsampled feature map from the previous lower decoder stage are concatenated and convolved with a 3D convolution filter of kernel size 1, no padding and no dilation. Deformable convolutions in the decoder stage are structured the same way as deformable convolutions in the encoder stage. Between the encode and decoder there is another stage (central stage) which is used as the gating signal for the fourth AG.

| Training loss function
GP-net is trained end-to-end using pairs of T2 volumes and corresponding manual delineations. The training loss function used in this study is a combination of several loss functions, detailed below.
Tversky loss (Salehi et al., 2017): In the case of binary segmentation (e.g., network's output and manual delineation), the Tversky index between two groups A and B is written as: where tp stands for true positives (correctly classified voxels), fp stands for false positives (wrongly classified voxels), fn stands for false negatives (wrongly missclassified voxels), and α and β are corresponding weights. The Tversky loss is taken as 1 − Tversky index , as we wish to minimize the loss function through gradient descent. The Tversky loss was reintroduced and utilized in the context of deep-learning based segmentation (Kim et al., 2020;Salehi et al., 2017) as an efficient tool for handling imbalanced class labels, since the parameters α and β control the relative weights between false positives and false negatives. In this study, we choose these parameters according to Table 4.
The Tversky index can be considered as a generalization of the dice index (Dice, 1945), and indeed by taking α = β = 0.5 the Tversky index coincides with the dice index. Additionally and similarly to (Kim et al., 2019), to minimize possible overlap between two different classes in the segmentation, we penalize (minimize) the dice score between each prediction to each other label. This term is weighed by a factor of 0.01 relative to the Tversky loss.
Hausdorff distance (Karimi & Salcudean, 2020) between two given contours. Thus, minimizing the Hausdorff distance can be thought of as minimizing the worst case, or largest outlier distance between the network's segmentation and the manual segmentation, which is indicative of the largest segmentation error. Given two point sets X and Y, the one sided Huasdorff distance is defined as (Karimi & Salcudean, 2020;Rockafellar & Wets, 2009) and the bidirectional Hausdorff distance is given by Since the Hausdorff metric is highly nondifferentiable, we require some differentiable proxy in order to use back-propagation. We rely on the proposed estimator given in equation (8) of (Karimi & Salcudean, 2020). This estimator is a smooth approximation of the Hausdorff distance, which allows back-propagation using gradient descent. In practice, we found it most efficient to minimize the Hausdorff distance for the entire GP (left and right sides together). Initial weight relative to the Tversky loss is 0.00001 and increases by a factor of 5 every 50 epochs.
We train the network of 94 epochs using stochastic gradient descent with learning rate of 0.0001 and momentum factor of 0.9.
Batch size is 1. GP-net was implemented in Python 3.6 with PyTorch 1.4 and trained on a single Nvidia V100 GPU with 32 GBs of memory.
It takes 3 days to train GP-net on this GPU (done only once).

| Validation
To evaluate GP-net, we performed an extensive quantitative analysis of its performance over 43 subjects. These subjects were not included in the training cohort and were only used for inference and evaluation of the network's performance.
GP-net is compared to the manual segmentation performed by domain experts and against four publicly available state-of-the-art atlases to quantify its performance and validate its reliablity for Since GP-net is patient-specific and operates on the patient's T2 (isotropically resampled) volumetric scan, to perform a fair comparison, we first register the atlases to the same T2 space. This registration process starts with a registration of the T1 MNI ICBM2009b (3 T) template to a 0.39 mm isotropic MNI template (3 T) using the Advanced Normalization Toolbox (ANTs) (Avants et al., 2011), via the command antsApplyTransforms and the LanczosWindowedSinc interpolation kernel. Next, the registered template is registered to the patient-specific (0.39 mm isotropically resampled) T1 scan (7 T). This is done via a combination of FLIRT and FNIRT modules from the FSL toolbox (Jenkinson et al., 2012) (implemented via HCP pipelines), using the ApplyWarp command and a spline interpolation kernel. The final registration stage includes registering the patient's T1 scan into his/her (isotropically resampled) T2 scan (7 T) using ANTs with a B-spline interpolation kernel and linear registration. The same transformations are applied to the atlases with a nearest-neighbor interpolation kernel. All registrations were verified visually.
We note that some of the atlases, such as the DISTAL (Ewert 2017 and CIT168 (Pauli 2017, Pauli et al., 2018 are probabilistic, while GP-net provides a deterministic segmentation map. To make a fair comparison, we have thresholded these atlases with a value of 0.001, meaning that for each class (GPe/GPi), voxels with values below 0.1% probability are zeroed out, while voxels with higher probability are given a value of 1 (GPe) or 2 (GPi). This threshold value was validated visually for each atlas and compared with the T2 ICBM2009b template. We also tried different values (such as 0.01), which yielded similar quantitative results. All registrations which were applied to the atlases, were applied after the thresholding of the probabilistic atlases, in order to make a fair comparison with the nonprobabilistic segmentations of GP-net.
On a final note, in some segmentation cases, we observe that the output of GP-net might contain small segmented regions ("islands") which are clearly not related to either the GPe or GPi. These regions are easily removed automatically with a small postprocessing step which removes all segmentation regions except for the largest four (left and right GPe and GPi) through the use of the connected components algorithm (Rosenfeld & Pfaltz, 1966).

| Metrics and statistics
We use the following metrics to compare between the performance Dice, CoM, and MSD are also calculated against the manual GPe and GPi segmentations. CoM distance describes how well the segmented structure is being localized in space (i.e., inside the brain), while the dice, MSD, and volume measurements describe how well the shape of the structure is being captured by the different segmentation techniques.
A one-way analysis of variance (ANOVA) was calculated for each metric, followed by a multiple comparison correction and post hoc tests with Tukey's honest significant difference to determine statistical significance between the different methods.
The matrices (Figures 2 and 3) indicate the statistical significance between each method. Each cell in the matrices corresponds to the p value that the method written in its corresponding row is statistically different from the method written in its corresponding column. A blue cell indicates p value lower than 0.1%, a green cell indicates p value lower than 5% and a red cell indicates p value higher than 5%. An additional metric which we utilize to compare between the different segmentations is the precision versus recall rates of each F I G U R E 2 (a) Dice score comparison between GP-net and the different atlases for both GPe (blue) and GPi (orange)

| RESULTS
In this section, we present a detailed quantitative and qualitative analysis of the performance of GP-net, compared against state-of-the-art atlases and experts' segmentation, as well as a stability test for GP-net.  however, they exhibit lower dice scores than GP-net and higher distribution variance. GP-net is shown to consistently produce higher dice scores, without much variation between the different age groups.
Consistency of performance, as demonstrated for the proposed GPnet, is critical for DBS and for real deployment.
Lastly, we compare the precision versus recall rates of each of the methods. Figure

| Stability test
We further test the stability of the network to assess GP-net's systematic behavior. We acquired three independent and repeated 7 T scans of a single healthy control subject over a period of 2 days, and performed inference using GP-net. Figure 5 presents  for the GPe and GPi, respectively. These ranges are well inside the deviations reported in the literature for easier to compute brain characteristics, such as total volume. This analysis demonstrates GP-net's ability to produce reliable and consistent segmentations, not only between different patients, as was previously shown, but also between different scans of the same subject over time. Since GP-net performs inference directly on the isotropically resampled grid (0.39 mm 3 ), its output is smooth, as can be seen in the 3D reconstruction, as opposed to the manual delineation (first row), which was segmented on the original 0.39 × 0.39 × 1 mm 3 grid. To make a fair comparison, the atlas-based segmentations were registered from ICBM2009b to the isotropically resampled T2 grid through a 0.39 mm 3 isotropic MNI template. However, even though we use the same grid spacing, these segmentation results seem to be more pixelated.  Table 6 summarizes the different metrics for this example.

| Qualitative examples
The second unique example is illustrated in Figure 9. In this example, two PD patients (panels a and b, respectively) had to be rescanned, as the first T2 scan was noticeably blurry. For each patient a second, free of motion ("sharp") image was acquired to allow accurate and reliable manual delineation of both the GPe and GPi. In each of the panels a and b, each row corresponds to a 3D reconstruction, a

| DISCUSSION
Alterations in neuronal activity in the internal segment of the GP has been shown to be correlated with motor symptoms of PD. For example, animal models of PD have shown a characteristic increase in neuronal activities in both the STN as well as in the GPi (Obeso et al., 2001;Wichmann et al., 1994). Lesions applied to these regions have shown striking improvement in motor function. Moreover, the creation of lesions in the GPi of PD patients have been reported to improve contralateral dyskinesia and provide moderate antiparkinsonian benefits (Baron et al., 1996(Baron et al., , 2000Obeso et al., 2001;Vitek et al., 2003). However, adverse effects which may be caused by lesions cannot be averted once surgery is performed (Kringelbach F I G U R E 8 GPe/GPi segmentation of a PD patient with irregular blood vessels. Upper row, left to right: 3D reconstruction of the manual segmentation of the GPe (green) and GPi (yellow). Blood vessels are in red (smoothed only for visualization purposes, no other smoothing was applied to any structure). Different panels correspond to selected axial T2 slices, going from the inferior side to the superior side of the brain. Middle row: GP-net reconstruction (GPe in orange and GPi in blue). Bottom row: Segmentation based on the DISTAL atlas (Ewert 2017;Ewert et al., 2018). Other atlases produced similar results to the DISTAL atlas, and were thus omitted for brevity. Blue arrow points in the superior direction, green arrow points in the anterior direction, and red arrow points in the right direction  GP-net is a deep-learning based segmentation technique specifically tailored to an accurate and robust segmentation of both the GPe and GPi. Although GP-net was described here in the context of DBS surgery, all clinical procedures which require presurgery GP trajectory F I G U R E 9 Robustness of GP-net segmentation under motion condition. Panels (a) and (b) correspond to two different PD patients for which the first scan suffered significant motion blur. 3D Reconstruction column shows 3D reconstruction of both the GPe (green in the manual segmentation and orange in the two GP-net reconstructions) and GPi (yellow in the manual segmentation and blue in the two GP-net reconstructions). No smoothing was applied at any stage to the reconstructions or manual delineations. Axial view column presents selected T2 axial slice of the brain. Apparent motion blurring can be seen in the lowest panel. Axial zoom column illustrates the same slice, zoomed in, and with superimposed outlines of the manual delineation, the corresponding segmentations of GP-net from the sharp image (second row) and from the blurred image (third row). Rightmost column provides the metrics values planning, such as DBS surgery and magnetic resonance guided focused ultrasound (Ebani et al., 2020;Miller et al., 2020;Zaaroor et al., 2018) can benefit from this method. In this study, we have utilized recent advances in ultra-high MRI scanner and acquisition protocols and train the network in an end-to-end manner on pairs of 7 T T2 acquisitions and manual delineations produced by domain experts.
GP-net is based on several key components: it is a U-net structure which relies on 3D convolutions (Goodfellow et al., 2016) as well as the recently introduced 3D deformable convolutions. It also relies on data augmentation to effectively increase the size of the training set.
By mirroring each T2 scan around the anterior-posterior axis, for example, left to right (as well as the corresponding manual delineation), we effectively double the amount of training data. The trained network is able to produce fast (a few seconds per subject), accurate and reliable GPe and GPi segmentations from new 7 T T2 images.
The results presented in this study imply two key observations.
First, for all the metrics considered in this study, deep-learning based GP-net was found to be superior to all the atlas-based segmentations tested. GP-net has exhibited improved average dice (scores above 0.8) indicating its ability to more accuratly capture the shape of the structure. A higher mean value along with reduced variance indicates that not only does GP-net perform better on average, it is also more stable and has far fewer outlier segmentations. This conclusion is also supported by the p value matrices below panel (a) of Figure 2. The first row clearly indicates that GP-net is statistically different than the atlas-based segmentations. On the other hand, some atlas-based segmentations do not significantly differ from one another (e.g., Ewert, Ewert et al., 2018 andPauli, Pauli et al., 2018), which is indicative of similar performance. Different atlases are constructed from different datasets, each with its own possible bias (e.g., an atlas based on PD patients versus an atlas based on healthy subjects), modalities and reconstruction technique, which may account for the variation in performance between them.
Moreover, the average CoM difference, an indication of how well the structure can be localized in the brain, was measured to be on the order of $0.7 mm, which corresponds to a difference of less than two voxels on the resampled grid (0.39 mm). This CoM localization error (with respect to the manual delineations) is below the actual slice thickness of the acquired 7 T T2 volumes (1 mm). The average mean surface distance for both GPe and GPi was measured to be $0.4 mm, which is on the order of a single voxel on the resampled T2 grid.
These numerical results present a significant improvement compared to the atlas-based registrations. All atlas-based registrations achieve an average dice score of about 0.45 − 0.66 and an average CoM error larger than 1.6 mm which corresponds to 4 pixels on the resampled grid. The average MSD for the atlas-based approaches is between 0.74 − 1.32 mm, higher than the MSD reported for GP-net.
In many cases, there exists a trade-off between the precison and recall rates. As exemplified in Figure 4, the precision rates of the atlasbased segmentation of Xiao 2017(Xiao et al., 2017 is notably higher than its recall rates. However, such a trade-off does not seem to exist for GP-net. The rates of the other atlas-based segmentations do not exhibit such a trade-off, however, these rate values are clearly lower than those of GP-net. GP-net exhibits both the highest precision and recall rates, both approaching the maximum value of 1, which further validates the superior performance of GP-net over the atlas-based segmentations. Variability between patients has been previously reported, for example, in Kim et al., 2019;Lenglet et al., 2012;Patriat et al., 2018). This intrinsic variability between different subjects motivates the need for patient-specific care, and suitably tailored algorithms to address this need. Standard atlases, which are typically defined in a normalized space, have shown great importance in retrospective population studies Horn, Neumann, et al., 2017;Horn, Reich, et al., 2017;Kim et al., 2019).  One of the key advantages of the proposed method is that GPnet relies solely on 7 T T2 scans to perform its inference for visualization of the GPe/GPi. No registrations are involved in its segmentation process and, therefore, GP-net performs segmentation directly in the patient's system coordinates. Atlas-based segmentations on the other hand, are given in a normalized space (i.e., MNI space) and must be registered into the patient's unique space prior to any trajectory planning. Thus, they are greatly affected by the registration process, which can adversely affect the outcomes of DBS surgery. Unfortunately, registration errors cannot be modeled easily, are typically unpredictable, and often have large variance (Kim et al., 2019). Moreover, the registration process often involves several successive registrations, which tend to accumulate and increase the registration errors with each step. Due to these factors, registrations often must be verified manually, as was done in this study for the atlas-based registrations. This dramatically prolongs the processing time per patient.
Although the atlas-based registrations and final segmentations were performed on the same grid as GP-net, they are visually more pixelated, as no spatial filtering is performed on the acquired T2 volume to infer the segmentations. Contrary to this, GP-net operates directly on the 0.39 mm 3 isotropically resampled grid of the T2 volume, and by the use of 3D convolution filters it is able to produce smooth segmentations, and higher quality structure detection of both parts of the GP.
Since atlas-based segmentations rely on a single (often average) depiction of the target structure, such a process cannot truly account for different interpatient variabilities in a true patient-specific manner, which is clearly exemplified by the segmentation results depicted in Figure 8. The middle row of Figure 8 shows that GP-net is able to segment the GP, while accounting for the blood vessels, even though GP-net was not trained on such irregular cases. GP-net is able to account for this irregular vessel shape and still be able to produce clear GP segmentations which align with the underlying 7 T scan anatomy, and is relatively close to the manual segmentation. On the other hand, the atlas-based segmentation presented in the lower row of  Table 6, further validates and quantifies this misalignment. This example indicates the true potential of deep-learning based approaches such as GP-net, in being a patientspecific automatic segmentation technique, even for irregular and unique cases as well as nonideal scanning conditions (Figure 9), without any additional training.
As was mentioned before, the internal and external segments of the GP are separated by a thin lamina layer. Often when using MER during DBS surgery, the lamina layer is characterized by the absence of somatodendritic action potentials, which characterize both the GPe and GPi, each with its own characteristic firing pattern (Baron et al., 1996(Baron et al., , 2000Lozano & Hutchinson, 2002 Different parameters, such as the acquisition sequence, resolution and field strength may also affect the performance of GP-net, since, at this stage, GP-net was only trained on 7 T T2 data, which was acquired with the protocol described in this manuscript. Thus, GP-net is currently not optimized for inference based on other modalities and/or data acquired with different field strength (for example, see Figure S2). This is in-fact a common limitation to many deep-learning based architectures and achieving increased robustness in the face of changing datasets (e.g., varying signal to noise ratios, resolution, etc.) is a matter of ongoing research in the machine learning community (often denoted as domain shift or domain transfer). Extending this study to operate on standard clinical images (1.5-3 T) is the subject of future study. Additionally, deep-learning frameworks currently lack interpretability. In some cases, the inference might result in suboptimal performance (e.g., in panel a of Figure 2, two GPe dice scores for GP-net are below 0.6). In these cases, it is not always easy to understand why the network reacted the way it did. Fortunately, as was statistically veryfied in this study, such occurrences seem to be rare. Exploiting standard CNN visualization tools can potentially educate the user on the internal workings of GP-net.
The clinical potential of GP-net is clearly exemplified in Figure 7, which presents an excellent match between the implanted DBS electrode and the manual versus GP-net's segmentations of both GPe/GPi, for a representative PD patient. GP-net will provide the clinical team the ability to have a much better understanding of the correlation between lead locations and outcomes, and might help reduce the extra operative time invested in planning of the surgical procedure. For DBS surgery, both the center of mass of the DBS target, as well as the accurate identification of its borders are of great importance. This figure, supported by the reported results for CoM distance and MSD, show that an accurate depiction of the DBS target clearly achieves this goal. Accurate identification can also contribute to reliable and accurate DBS lead placement, which has been associated with improved clinical outcomes in leads placed in the STN (Richardson et al., 2009). This often results in a reduced amount of programming time in the clinic necessary for optimizing symptom reduction and an overall improvement in quality of life for the patient (Hell et al., 2019). GP-net is currently being used under a research protocol to assist with DBS surgery preplanning and postoperative lead location assessment based on 7 T T2 scans at the University of Minnesota medical school.

| CONCLUSIONS AND FUTURE STUDY
In this study, we presented GP-net, a deep-learning based neural network for the segmentation of both the GPe and GPi. GP-net is able to produce accurate and reliable segmentations in a fully automated manner from 7 T T2 MR acquisitions, both for healthy subjects as well as PD and ET patients and with a large age distribution. The network is trained end-to-end on pairs of acquired 7 T T2 scans and corresponding manual delineations of both GPe and GPi. We have shown, both qualitatively and quantitatively, that GP-net outperforms state-of-the-art atlas-based segmentations and produces stable and consistent high quality patient-specific segmentations, while reducing potential biases.
GP-net is tailored for the segmentation of the GP (both internal and external segments). However, it can be extended to segment additional subcortical structures which are of interest for DBS surgery, such as the STN, Red Nucleus, and Substantia Nigra. This extension is currently being investigated in our group. With the ability to reliably segment all DBSrelated targets, we further plan to investigate the relationship between patients' properties (such as volume), as ascertained from MRI images, and their clinical characteristics. For example, in  it was shown that correlation exists between STN volumes and the Unified Parkinson's Disease Rating Scale (UPDRS) III scores.
With the incorporation of advanced DBS presurgery targeting and postsurgery lead localization tools and software, 7 T MRI based approaches, either for training or for deployment, have great potential in becoming clinical standards, especially now that the 7 T MRI is FDA approved for standard clinical applications. In this scenario, the use of fully automated segmentation software may prove to be very advantageous, leading to an accurate, fast, easy and reliable visualization tool, contributing to an improved surgical procedure and patient experience.

ACKNOWLEDGMENTS
This study was funded by the following National Institution of Health Grants: R01 NS081118, R01 NS113746, P50 NS098753, P30 NS076408, and P41 EB027061. Additional support by NSF (GS) and Department of Defense (GS) is also acknowledged.

AUTHOR CONTRIBUTIONS
Oren Solomon designed and wrote the algorithm, performed the analysis and wrote the manuscript. Tara Palnitkar, Remi Patriat, and Henry Braun acquired the 7 T MRI data and performed the manual segmentations and participated in the writing of the manuscript. Joshua Aman, Michael C. Park, and Jerrold Vitek provided clinical insight, performed the manual segmentations as well as participated in the writing of the manuscript. Guillermo Sapiro participated in designing the algorithm and writing of the manuscript. Noam Harel participated in designing the algorithm, performing the analysis, acquiring the 7 T MRI data, obtaining the manual segmentations and writing the manuscript.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.