Combined Atlas and Convolutional Neural Network-Based Segmentation of the Hippocampus from MRI According to the ADNI Harmonized Protocol

Hippocampus atrophy is an early structural feature that can be measured from magnetic resonance imaging (MRI) to improve the diagnosis of neurological diseases. An accurate and robust standardized hippocampus segmentation method is required for reliable atrophy assessment. The aim of this work was to develop and evaluate an automatic segmentation tool (DeepHarp) for hippocampus delineation according to the ADNI harmonized hippocampal protocol (HarP). DeepHarp utilizes a two-step process. First, the approximate location of the hippocampus is identified in T1-weighted MRI datasets using an atlas-based approach, which is used to crop the images to a region-of-interest (ROI) containing the hippocampus. In the second step, a convolutional neural network trained using datasets with corresponding manual hippocampus annotations is used to segment the hippocampus from the cropped ROI. The proposed method was developed and validated using 107 datasets with manually segmented hippocampi according to the ADNI-HarP standard as well as 114 multi-center datasets of patients with Alzheimer’s disease, mild cognitive impairment, cerebrovascular disease, and healthy controls. Twenty-three independent datasets manually segmented according to the ADNI-HarP protocol were used for testing to assess the accuracy, while an independent test-retest dataset was used to assess precision. The proposed DeepHarp method achieved a mean Dice similarity score of 0.88, which was significantly better than four other established hippocampus segmentation methods used for comparison. At the same time, the proposed method also achieved a high test-retest precision (mean Dice score: 0.95). In conclusion, DeepHarp can automatically segment the hippocampus from T1-weighted MRI datasets according to the ADNI-HarP protocol with high accuracy and robustness, which can aid atrophy measurements in a variety of pathologies.


Introduction
Dementia has devastating effects on patients as well as their families and care givers [1]. It is now becoming apparent that for new treatments to be effective, patients will have 2 of 15 to receive preventative or disease modifying drugs in the earliest, and possibly even presymptomatic stages [2]. Pathological global and regional cerebral atrophy reflects neuronal cell loss that can be measured from serially acquired MRI scans even before the onset of cognitive impairment has been identified. For this reason, it has been suggested as a potential early identifiable precursor of cognitive decline [3]. Within this context, longitudinal hippocampus volume changes (atrophy rates) have been found especially valuable for the identification of subjects at risk of imminent cognitive decline [4,5]. Hippocampus atrophy can also be used to track disease progression and has the potential to serve as a study outcome measure in pre-dementia enrichment clinical trials [6,7]. Hippocampus atrophy has been found to be a reliable biomarker to identify subjects in the early stages of Alzheimer's disease (AD) and to predict the time to progression to AD in subjects with mild cognitive impairment (MCI) who have amyloid positive results [8,9]. Cross-sectional population-based studies have also shown that the hippocampus volume and hippocampus atrophy rate can predict future cognitive decline in cognitively normal subjects 5-10 years before symptoms develop [8][9][10].
Accurate hippocampus volume measurement requires a robust and accurate segmentation tool. A growing body of research exists aiming to improve the accuracy of hippocampus segmentation for early diagnosis of dementia [11][12][13]. Hippocampus segmentation methods that incorporate population knowledge such as multi-atlas and machine learning techniques perform better compared to other established methods [12,14,15]. However, the accuracy of multi-atlas segmentation techniques is known to be affected by multiple factors such as the similarity of the atlas to the dataset to be segmented, the number of atlases in the library, the inter-operator variability of the manual segmentation atlas library, and the label fusion and registration algorithms used for segmentation. Moreover, high computational power is required for time efficient processing [12].
Segmentation methods that use machine learning algorithms, such as support vector machines, require many high-quality labelled datasets and a priori feature extraction [16]. Novel deep learning models have been shown to provide more generalizable results compared to traditional machine learning techniques for many segmentation tasks, but also require large training sets for optimal results [13,14]. Neuroanatomical segmentation using deep learning models has achieved significantly higher accuracies and reduced processing time compared to conventional segmentation algorithms used in the FSL [17] and Freesrufer algorithms [13,14,18]. For instance, QuickNat [14] used automatically generated segmentations of a large number of datasets for pre-training of a convolutional neural network (CNN) and a small number of manually segmented datasets for fine tuning, demonstrating a high level of accuracy. Another recently described technique, HippoDeep [13], used a large dataset of segmented hippocampi (>20 K) extracted by FreeSurfer [19] for pretraining of the CNN within a predefined bounding box containing both hippocampi, and fine-tuned parameters with a single manually labelled brain. Similar methodologies have been adopted by other groups [20]. Generally, most previously described deep learning methods for hippocampus segmentation include a pre-training step using large datasets and fine tuning of the CNN model using a small set of manually segmented datasets [13,14].
Even though CNN-based hippocampus segmentation methods yield better results compared to traditional segmentation methods, they are often biased by the quality of the ground truth hippocampus labels used for training the CNN. Recently, the Alzheimer's Disease Neuroimaging Initiative (ADNI) defined the harmonized hippocampal protocol (HarP) for the manual segmentation of the hippocampus from structural MRI, which aims to reduce inter-observer differences [21,22]. ADNI-HarP is an international effort aiming to harmonize the hippocampus segmentation by defining a standard protocol that results in hippocampus segmentations that can be compared between different sites and clinical trials. ADNI made 135 T1-weighted MRI images publicly available that were manually segmented in the anterior commissure-posterior commissure plane following the ADNI-HarP guidelines. The protocol has high inter-and intra-rater agreement compared to other protocols not following the ADNI-HarP segmentation guidelines by clearly specifying anatomical landmarks for precise hippocampus segmentation [23].
According to protocol, the cut line between the amygdala and hippocampus at the rostral level can be recognized by segmenting gray matter below the alveus and at the caudal level by separating the last grey matter mass inferomedial to the lateral ventricles. The protocol suggests using 2D orthogonal views (axial, sagittal, and coronal planes) for optimal tracings. This recommendation relates to the similarity of tissue intensity of the surrounding tissues, in which case visualization in the different orthogonal views can help to assess the continuity of the structures, aiding segmentation, especially at the medial boundary of the hippocampus, where the grey matter is difficult to delineate based on only voxel intensity information. The ADNI-HarP protocol was agreed upon by a panel of experts from North America and Europe and is accepted as the standard systematic method for hippocampal manual segmentation [21,22].
The ground truth data include the brain MRI datasets and the corresponding manual hippocampus segmentations by ADNI-Harp experts. Their segmentations have more than 90% agreement [22].
To date, the majority of hippocampal segmentation tools developed either utilize atlasbased segmentation methods or deep learning-based algorithms, in particular CNNs [24][25][26][27], but given the novelty of ADNI-HarP do not produce segmentations following these guidelines. Therefore, the aim of this study was to develop a highly accurate atlas-and CNNbased hippocampus segmentation tool for MRI data, called DeepHarp, that generates ADNI-HarP standardized hippocampus segmentations. The CNN model was partially trained, fully validated, and tested using this data to ensure the competency of the model with expert manual tracing criteria developed by the ADNI Collaborators.
Briefly described, the main contributions of this work are: (1) harnessing the combined advantage of atlas-and CNN-based segmentations producing superior accuracies and test-retest precision when compared with other established atlas-based techniques (FSL, Freesurfer) and deep learning approaches (QuickNat, HippoDeep); (2) incorporating MRI data and corresponding hippocampus segmentations from patients with multiple brain disorders acquired with different MRI scanners and different field strengths (1.5 and 3.0 T) to improve the validity, robustness, and the general application of DeepHarp; and (3) designing an automatic method for automatic hippocampus segmentation following the ADNI-HarP standardization.

Datasets
The T1-weighted MRI datasets included in this study were acquired using various 3T and 1.5T MRI scanners and were collected from four sources: Sunnybrook Health Sciences Centre, Ontario, Canada (n = 50; datasets included healthy subjects and patients with various pathologies such as AD, AD with cerebral vascular disease, and MCI) [12], Alzheimer Disease Neuroimaging Initiative Harmonised Protocol database (ADNI-HarP; n = 130) [22], the Open Access Series of Imaging Studies (OASIS) Dementia Study (n = 14) [28], and a free online repository of healthy subjects and patients with epilepsy (n = 50) [11].
ADNI was launched in 2003 as a public-private partnership led by Principal Investigator, Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). For up-to-date information, see www.adni-info.org.
Finally, a freely available test-retest dataset of healthy subjects [29] was used for comparing the robustness of the proposed method with previously described hippocampus segmentation methods. From this dataset, 40 T1-weighted datasets from a single subject were used for this study, which were acquired in 20 sessions spanning 31 days using the ADNI-recommended protocol.

Pre-Processing
The image datasets were acquired on various scanners using distinct imaging parameters. Thus, the image resolution and signal intensities were not directly comparable between the datasets used to train the CNN. To overcome this limitation, the datasets were pre-processed by normalizing the intensity values to a mean of zero with a standard deviation of one and resampling to a 1 × 1 × 1 mm 3 resolution with cubic interpolation.

ROI Identification
The pre-processing step of the proposed hippocampus segmentation method aims to detect a volume-of-interest (VOI) that contains the hippocampus within the 3D MRI datasets (see Figures 1 and 2). A safety margin of two voxels in all three dimensions was used. This VOI margin helps to restrict the CNN training and testing to the medial temporal lobe rather than the whole image, which also improves class balance. Therefore, the Montreal Neurological Institute (MNI) brain atlas was first registered to the brain of each subject using an affine transformation followed by a non-linear registration step using cross-correlation as the similarity metric and spline interpolation implemented in the NiftyReg package [30]. The non-linear transformation was then applied to a probabilistic map of the hippocampus from the Harvard Oxford subcortical structural atlas [30]. The registered probabilistic map was binarized (every non-zero pixel was set to 1) and dilated by two voxels to assure that the final VOI included all parts of the hippocampus. The dilated hippocampus segmentation was then used to crop the MRI with the corresponding bounding box.
Finally, a freely available test-retest dataset of healthy subjects [29] was used for comparing the robustness of the proposed method with previously described hippocampus segmentation methods. From this dataset, 40 T1-weighted datasets from a single subject were used for this study, which were acquired in 20 sessions spanning 31 days using the ADNI-recommended protocol.

Pre-Processing
The image datasets were acquired on various scanners using distinct imaging parameters. Thus, the image resolution and signal intensities were not directly comparable between the datasets used to train the CNN. To overcome this limitation, the datasets were pre-processed by normalizing the intensity values to a mean of zero with a standard deviation of one and resampling to a 1 × 1 × 1 mm 3 resolution with cubic interpolation.

ROI Identification
The pre-processing step of the proposed hippocampus segmentation method aims to detect a volume-of-interest (VOI) that contains the hippocampus within the 3D MRI datasets (see Figures 1 and 2). A safety margin of two voxels in all three dimensions was used. This VOI margin helps to restrict the CNN training and testing to the medial temporal lobe rather than the whole image, which also improves class balance. Therefore, the Montreal Neurological Institute (MNI) brain atlas was first registered to the brain of each subject using an affine transformation followed by a non-linear registration step using cross-correlation as the similarity metric and spline interpolation implemented in the Nif-tyReg package [30]. The non-linear transformation was then applied to a probabilistic map of the hippocampus from the Harvard Oxford subcortical structural atlas [30]. The registered probabilistic map was binarized (every non-zero pixel was set to 1) and dilated by two voxels to assure that the final VOI included all parts of the hippocampus. The dilated hippocampus segmentation was then used to crop the MRI with the corresponding bounding box.

CNN Algorithm
The CNN architecture ( Figure 3) described below aims to classify each voxel within the dilated hippocampus mask (green in Figure 2) as either background or foreground. All voxels outside the dilated hippocampal mask but within the bounding box are defined as background.  The fully-connected DeepMedic CNN architecture, originally designed for brain lesion segmentation and previously described in detail elsewhere [31], was optimized for hippocampus segmentation in this work. The DeepMedic framework is open source and publicly available on github.com. Despite its original development for brain lesion segmentation, it can serve as a general framework for easy creation and training of neural networks for the processing of medical images. DeepMedic employs a dual pathway architecture that enables the integration of local and larger contextual information, whereas one pathway uses the image at the original resolution while the second pathway uses a low-resolution version of the image. Since the larger contextual information is not needed due to the registration-based identification of a region-of-interest containing the hippocampus, DeepHarp for hippocampus segmentation only uses the high-resolution pathway. This high-resolution pathway is essentially a feedforward network that was configured to nine hidden layers using 3 × 3 × 3 kernels and a classification layer, which uses 1D kernels. Briefly described, the model used convolution layers followed by rectified linear unit (RELU) activation functions and a no max-pooling approach to preserve the image

CNN Algorithm
The CNN architecture ( Figure 3) described below aims to classify each voxel within the dilated hippocampus mask (green in Figure 2) as either background or foreground. All voxels outside the dilated hippocampal mask but within the bounding box are defined as background.

CNN Algorithm
The CNN architecture ( Figure 3) described below aims to classify each voxel within the dilated hippocampus mask (green in Figure 2) as either background or foreground. All voxels outside the dilated hippocampal mask but within the bounding box are defined as background.  The fully-connected DeepMedic CNN architecture, originally designed for brain lesion segmentation and previously described in detail elsewhere [31], was optimized for hippocampus segmentation in this work. The DeepMedic framework is open source and publicly available on github.com. Despite its original development for brain lesion segmentation, it can serve as a general framework for easy creation and training of neural networks for the processing of medical images. DeepMedic employs a dual pathway architecture that enables the integration of local and larger contextual information, whereas one pathway uses the image at the original resolution while the second pathway uses a low-resolution version of the image. Since the larger contextual information is not needed due to the registration-based identification of a region-of-interest containing the hippocampus, DeepHarp for hippocampus segmentation only uses the high-resolution pathway. This high-resolution pathway is essentially a feedforward network that was configured to nine hidden layers using 3 × 3 × 3 kernels and a classification layer, which uses 1D kernels. Briefly described, the model used convolution layers followed by rectified linear unit (RELU) activation functions and a no max-pooling approach to preserve the image The fully-connected DeepMedic CNN architecture, originally designed for brain lesion segmentation and previously described in detail elsewhere [31], was optimized for hippocampus segmentation in this work. The DeepMedic framework is open source and publicly available on github.com. Despite its original development for brain lesion segmentation, it can serve as a general framework for easy creation and training of neural networks for the processing of medical images. DeepMedic employs a dual pathway architecture that enables the integration of local and larger contextual information, whereas one pathway uses the image at the original resolution while the second pathway uses a lowresolution version of the image. Since the larger contextual information is not needed due to the registration-based identification of a region-of-interest containing the hippocampus, DeepHarp for hippocampus segmentation only uses the high-resolution pathway. This high-resolution pathway is essentially a feedforward network that was configured to nine hidden layers using 3 × 3 × 3 kernels and a classification layer, which uses 1D kernels. Briefly described, the model used convolution layers followed by rectified linear unit (RELU) activation functions and a no max-pooling approach to preserve the image information [32]. Multiple configurations were used for fine tuning the base architecture of the CNN based on parameters previously described in literature [33]. More precisely, the number of hidden layers, learning rate, cost function, shape of network (pyramid) were tuned manually by using reference values from literature.
In the final setup, the number of feature maps started at 20 and increased by increments of five in each layer up to a maximum of 60 in the final layer. For the training stage, a batch size of 20 and 10 epochs were used to train the network by minimizing the cross-entropy loss function. Small 3 × 3 × 3 kernels were applied, which reduces the number of network parameters [31]. A funnel shape network was used because it has been shown to lead to superior performance [33] compared to networks with other shapes using the same number of parameters. The learning rate hyperparameter was initially set to 0.006 and steadily diminished until the validation accuracy plateaued. The L1 and L2 regularization terms were set to 0.000001 and 0.0001, respectively. Dropout was used to reduce the risk of model overfitting [31].

Training of DeepHarp
ADNI-HarP segmentations are specifically harmonized at the most posterior, anterior, and medial anatomical boundaries [22]. Hippocampus segmentations that were previously generated using non-ADNI-HarP protocols differed on average by more than one voxel in all orthogonal planes [22]. The hippocampus is a small structure with a large surfacearea-to-volume ratio so that even a one voxel boundary error can introduce considerable volumetric differences.
The non-ADNI-HarP data from the Sunnybrook, OASIS, and the epileptic datasets used in this work were segmented using different manual labeling protocols, which vary based on local manual segmentation protocol. However, these manual labels encompassed most of the hippocampal anatomy included in the ADNI-HarP protocol. Thus, it was assumed that these datasets still provide useful information for training of the CNN-based segmentation model. The ADNI-HarP data that was used in this study was comprised of 130 MRI brain scans, of which 78 were used for training, together with all non-ADNI-HarP datasets (see Table 1). During training, the network parameters were adjusted based on validation set, which included 29 brain MRI scans from ADNI-HarP data not used in the training set. After training, the model was tested using 23 MRI brain scans from the ADNI-HarP data that were not used as part of the training or validation set. Thus, the test set of 23 ADNI-HarP brain MRI scans were never used for model training or optimization, and thus, represents an independent test set used to evaluate the performance of the model on unseen data. Thus, DeepHarp was trained using anatomical segmentation patterns from a combination of the local-protocols and the ADNI-HarP protocol. To improve the generalizability of the CNN model, all training data were augmented by Gaussian blurring using standard deviations of 0.5 and 1.0 ( Figure 4).

Evaluation of the DeepHarp Model
For validation purposes, the hippocampus segmentation results obtained by the DeepHarp model were compared to the segmentation results generated using FSL [17] and Freesurfer [15]. We additionally compared our results to the recently reported CNNbased segmentation methods QuickNat [14] and HippoDeep [13]. The Dice similarity metric was used as a quantitative overlap measurement for segmentation quality assessment. Dice values close to 1.0 indicate a perfect consensus with the de facto ground-truth labels (i.e., expert manual segmentations) while values close to 0 indicate no consensus. Twentythree test datasets (20% of the ADNI-HarP data that were not used for training or validation) were used for accuracy testing.
The precision of the proposed segmentation method was evaluated using two independent MRI scans of the same subject available for each scan session (20 sessions in total). For a quantitative evaluation, the second MRI of each session was registered to the first image acquisition of the same session, using a rigid transformation with a normalized cross-correlation cost function and linear interpolation. After this step, the previously described preprocessing steps and DeepHarP method were applied to each of the co-registered images for hippocampus segmentation. The two segmentations from each session were compared using the Dice coefficient, and the average Dice coefficient was computed across all sessions. As all scan pairs were acquired from the same subject within onemonth, minimal volume changes were expected over this interval, and the Dice similarity metric was hypothesized to be close to 1.0. We additionally performed hippocampus segmentation on these 20 scan pairs using the FSL [17], Freesurfer [15], QuickNat [14], and HippoDeep [13] methods for comparison with the proposed DeepHarP method.

Implementation Details
The training of the CNN model was performed using a NVIDIA GTX 1060 GPU and the CUDA 9 toolkit and cuDNN 7.5 with TensorFlow 1.12.1. All pre-processing steps were performed using the visualization toolkit (VTK) and python 2.7. All registrations were performed using the NiftyReg package [30]. Brain extraction and bias field correction were done using the brain extraction tool (BET), part of the FSL package. All statistical analyses were performed using open-source R and RStudio software packages. The trained model, pre-processing scripts and instructions for using tool are available on https://github.com/snobakht/DeepHarP_Hippocampus/tree/main/DeepHarP.

Evaluation of the DeepHarp Model
For validation purposes, the hippocampus segmentation results obtained by the DeepHarp model were compared to the segmentation results generated using FSL [17] and Freesurfer [15]. We additionally compared our results to the recently reported CNN-based segmentation methods QuickNat [14] and HippoDeep [13]. The Dice similarity metric was used as a quantitative overlap measurement for segmentation quality assessment. Dice values close to 1.0 indicate a perfect consensus with the de facto ground-truth labels (i.e., expert manual segmentations) while values close to 0 indicate no consensus. Twenty-three test datasets (20% of the ADNI-HarP data that were not used for training or validation) were used for accuracy testing.
The precision of the proposed segmentation method was evaluated using two independent MRI scans of the same subject available for each scan session (20 sessions in total). For a quantitative evaluation, the second MRI of each session was registered to the first image acquisition of the same session, using a rigid transformation with a normalized cross-correlation cost function and linear interpolation. After this step, the previously described preprocessing steps and DeepHarP method were applied to each of the co-registered images for hippocampus segmentation. The two segmentations from each session were compared using the Dice coefficient, and the average Dice coefficient was computed across all sessions. As all scan pairs were acquired from the same subject within one-month, minimal volume changes were expected over this interval, and the Dice similarity metric was hypothesized to be close to 1.0. We additionally performed hippocampus segmentation on these 20 scan pairs using the FSL [17], Freesurfer [15], QuickNat [14], and HippoDeep [13] methods for comparison with the proposed DeepHarP method.

Implementation Details
The training of the CNN model was performed using a NVIDIA GTX 1060 GPU and the CUDA 9 toolkit and cuDNN 7.5 with TensorFlow 1.12.1. All pre-processing steps were performed using the visualization toolkit (VTK) and python 2.7. All registrations were performed using the NiftyReg package [30]. Brain extraction and bias field correction were done using the brain extraction tool (BET), part of the FSL package. All statistical analyses were performed using open-source R and RStudio software packages. The trained model, pre-processing scripts and instructions for using tool are available on https://github.com/ snobakht/DeepHarP_Hippocampus/tree/main/DeepHarP.

Results
The quantitative evaluation of the segmentations generated showed that the proposed DeepHarp model achieved an average Dice score of 0.88 ± 0.02 using the 23 test datasets (46 hippocampi segmented from ADNI-Harp data). No significant difference was observed regarding the Dice similarity metrics of the left and right hippocampi. The quantitative comparison to the four other established hippocampus segmentation techniques (i.e., FSL, QuickNat, Freesurfer, and HippoDeep) showed that DeepHarp achieved significantly higher Dice scores (Wilcoxon rank-sum test, p < 0.001) indicating a high level of accuracy ( Figure 5). Figure 6 shows a selected segmentation for a subject generated by the proposed method and the other four methods. Overall, the three deep learning methods that were evaluated (i.e., DeepHarp, QuickNat, and HippoDeep) generated segmentations that were more similar to the ground truth segmentations compared to Freesurfer and FSL, which is in line with the quantitative results. Table 2 shows the comparisons and p-values.

Precision
The test-retest dataset was used for comparing the robustness of the proposed DeepHarp method with FSL, QuickNat, Freesurfer, and HippoDeep. The test-retest data was acquired using the same scanner and acquisition parameters and should present only minor volumetric differences since the two datasets compared for each day were acquired within the same imaging session, but with repositioning of the participant between the two scans. Figure 7 shows the performance of the five segmentation techniques by means of the Dice score. The quantitative evaluation revealed an average Dice score of 0.95 ± 0.01, with no significant difference between right and left hippocampus segmentations for the proposed DeepHarp method. Figure 8 shows the performance of the five techniques based on absolute volume difference in milliliter. DeepHarp measured an average difference of 0.08 ± 0.057 and 0.05 ± 0.036 milliliter between pairs of test-retest data for the left hippocampus and right hippocampus, respectively.

Precision
The test-retest dataset was used for comparing the robustness of the proposed Deep-Harp method with FSL, QuickNat, Freesurfer, and HippoDeep. The test-retest data was acquired using the same scanner and acquisition parameters and should present only minor volumetric differences since the two datasets compared for each day were acquired within the same imaging session, but with repositioning of the participant between the two scans. Figure 7 shows the performance of the five segmentation techniques by means of the Dice score. The quantitative evaluation revealed an average Dice score of 0.95 ± 0.01, with no significant difference between right and left hippocampus segmentations for the proposed DeepHarp method. Figure 8 shows the performance of the five techniques based on absolute volume difference in milliliter. DeepHarp measured an average difference of 0.08 ± 0.057 and 0.05 ± 0.036 milliliter between pairs of test-retest data for the left hippocampus and right hippocampus, respectively. Figure 6. Comparison of segmentations for different tools for a selected patient and slice. It can be seen that Freesurfer and FSL show more false positives than other tools at the head of hippocampus. Freesurfer also has more false negatives at the tail than other tools.

Precision
The test-retest dataset was used for comparing the robustness of the proposed DeepHarp method with FSL, QuickNat, Freesurfer, and HippoDeep. The test-retest data was acquired using the same scanner and acquisition parameters and should present only minor volumetric differences since the two datasets compared for each day were acquired within the same imaging session, but with repositioning of the participant between the two scans. Figure 7 shows the performance of the five segmentation techniques by means of the Dice score. The quantitative evaluation revealed an average Dice score of 0.95 ± 0.01, with no significant difference between right and left hippocampus segmentations for the proposed DeepHarp method. Figure 8 shows the performance of the five techniques based on absolute volume difference in milliliter. DeepHarp measured an average difference of 0.08 ± 0.057 and 0.05 ± 0.036 milliliter between pairs of test-retest data for the left hippocampus and right hippocampus, respectively.   The statistical evaluation on Dice scores and absolute volume difference revealed that DeepHarp is significantly more robust than FSL for the left hippocampus and Freesurfer for the right hippocampus segmentation (Wilcoxon rank-sum test, p < 0.003 and p< 0.01). HippoDeep is significantly more robust than DeepHarp (Wilcoxon rank-sum test, largest p < 0.03) based on Dice scores, while no statistically significant difference was found between HippoDeep and DeepHarp (Wilcoxon rank-sum test, p < 0.1) based on volume differences. DeepHarp and QuickNat have no significant difference based on Dice scores, while QuickNat is statistically more robust than DeepHarp on left hippocampus meas- The statistical evaluation on Dice scores and absolute volume difference revealed that DeepHarp is significantly more robust than FSL for the left hippocampus and Freesurfer for the right hippocampus segmentation (Wilcoxon rank-sum test, p < 0.003 and p< 0.01).
HippoDeep is significantly more robust than DeepHarp (Wilcoxon rank-sum test, largest p < 0.03) based on Dice scores, while no statistically significant difference was found between HippoDeep and DeepHarp (Wilcoxon rank-sum test, p < 0.1) based on volume differences. DeepHarp and QuickNat have no significant difference based on Dice scores, while QuickNat is statistically more robust than DeepHarp on left hippocampus measurements, but not for the right hippocampus. Overall, DeepHarp, QuickNat, and HippoDeep are comparable in robustness based on the two quantitative metrics, Dice scores and absolute volume differences. Table 3 shows the detailed data and p-values. The initial development and evaluation of DeepHarp was performed using full noncropped images without atlas segmentation. This took more than three days to complete for every round of training. Utilization of an atlas-based approach significantly improved the training time and results. Therefore, every training trial on the full set of data took approximately 6h to complete. This relatively short training time allowed for many rounds of trial and errors to identify the best parameters and hyperparameters. Excluding the atlas-based segmentation, DeepHarp segments the hippocampus in an MRI dataset in about 6 s. Adding the atlas-based segmentation can increase this time to 10 min, depending on whether a GPU is available.

Discussion
This study shows that combining a traditional atlas-based segmentation method with an optimized CNN architecture leads to improved segmentation accuracies compared to CNN-and atlas-based segmentation algorithms alone. Providing prior knowledge to the CNN, which was derived by an atlas-based method, limits the search area, and increases the foreground to background ratio, which likely contributed to the high accuracy of DeepHarp. The DeepHarp method achieved a mean Dice similarity score of 0.88, which was significantly better than four other established hippocampus segmentation methods used for comparison. At the same time, the proposed method also achieved a high testretest precision (mean Dice score: 0.95). Overall, DeepHarp achieves higher accuracy and test-retest precision than commonly used segmentation tools such as FSL and Freesurfer, and achieves more accurate results compared to other deep learning methods [13,14].
Automation of the ADNI-HarP protocol is necessary to increase efficiency, accuracy, and standardization of hippocampus segmentations within larger studies. Therefore, a major contribution of this work is that the proposed DeepHarp technique achieves highly accurate and precise segmentation results following the ADNI-HarP standard. The proposed model demonstrated high accuracy and precision across different scanners, image resolutions, and pathologies. Due to the limited number of manually segmented datasets from the original ADNI-HarP validation sample, we utilized datasets that were manually segmented by medical experts, using validated protocols with inclusive anatomical definitions. To assure that the final model segments the hippocampus according to the ADNI-HarP guidelines, the trained CNN model was additionally fine-tuned using an independent validation sample selected from the ADNI-HarP dataset, which was then tested using another independent ADNI-HarP sample.
Another major advantage of the proposed method is its simplicity, as it only requires an atlas registration and subsequent application of the trained CNN. Although training of a CNN can be time-consuming, the application of a trained network is usually quite fast using modern hardware especially if applied to a small ROI.
Excluding the atlas-based segmentation, DeepHarp segments the hippocampus in an MRI dataset in about 6 s. The efficient DeepMedic architecture as well as cropping images using the atlas-based segmentation are key components for fast inference time of DeepHarp. However, the atlas registration can take up to 10 min depending on the implementation and computational resources available. The other two deep learning methods used in this work for comparison (QuickNat and HippoDeep) require less than a minute for a single dataset while FreeSurfer and FSL are considerably slower. However, it needs to be highlighted that there is no clinical decision that requires fast hippocampus segmentation from MRI. What matters most is a reliable, robust, and accurate segmentation for use as an image-based biomarker in clinical decision making and clinical studies.
From a technical perspective, the main contribution of this work is the development and evaluation of an advanced method for hippocampus segmentation from T1-weigthed MRI datasets, according to the ADNI-HarP protocol by combining an atlas-based and CNNbased segmentation methods. These two techniques are the two most used general approaches for hippocampus segmentation [13,14,[24][25][26][27]. However, to the best of our knowledge, DeepHarp is the first method to actually combine these two general approaches.
However, there are still many interesting avenues to further improve the accuracy of DeepHarp. For example, a recent study by Dill et al. [26] showed that using meta data such as patient's age, sex, clinical status can improve the Dice coefficient of an atlas-based hippocampus segmentation by 5%. Thus, it might be interesting to also include such meta data in DeepHarp. Another study by Zhu et al. [25] proposed metric learning for improved atlas-based hippocampus segmentation. The authors of that study developed a novel label fusion technique by using a distance metric where similar image patches are kept close together and non-similar patches are grouped far from each other. Their method dynamically determines a distance metric rather than using a predefined distance metric as done in previous methods. When that method was applied to ADNI data, it achieved a statistically significant improvement in hippocampus segmentation accuracy. For this reason, it might be beneficial to use this advanced atlas-based segmentation method instead of the rather simple atlas-based segmentation method used in DeepHarp. However, due to the subsequent application of a CNN in DeepHarp, the actual benefit might be insignificant. A study by Liu et al. [27] proposed a multi-modal convolutional network to identify hippocampus features using a multi-task CNN and then used a dense network to use these features for a classification task (AD vs. normal control subjects). Their model achieved 87% accuracy for classifying AD, MCI, and normal control subjects. Thus, it might be interesting to also implement a subsequent disease classification in DeepHarp. Another study by Sun et al. [24] describes a dual convolutional neural network for hippocampus segmentation, which uses a unique loss function in a v-net structure [34] and achieves 90% accuracy, which could also prove beneficial for DeepHarp.
DeepHarp is based on the DeepMedic architecture [31] that has been used for various segmentations tasks and generally performs well for many biomedical image segmentations problems. In this study, baseline comparison to four other established models described in literature were performed, showing that the proposed model is more accurate and as reliable compared to previously described methods. By now, there are hundreds of different CNN networks described in the literature (e.g., U-Nets [35], MASK-RCNN [36], VGG16 [37], VGG19 [38], ResNet [39], and AlexNet [40]). It is likely that at least one of these models will perform better than the proposed model based on DeepMedic architecture, and it may only be because of overfitting or random chance. However, testing every single CNN base model for hippocampus segmentation is virtually impossible and outside the scope of this paper. The primary goal of this work was to show the effectiveness of combining atlas-based segmentation with a deep learning model for hippocampus segmentation without the requirement of a massive amount of data. Within this context, it is likely that another CNN method would achieve similar results compared to the DeepMedic architecture when combined with an atlas-based segmentation approach. Thus, this work should be considered a proof-of-principle, showing the effectiveness of combined atlasand CNN-based segmentation models that are potentially translatable to many other segmentation problems.
In our study, the selection of data used for training DeepHarp was very diverse, as demonstrated by the inclusion of MRI scans from patients with multiple brain disorders (dementia, epilepsy and normal controls) and dementia subtypes (MCI, AD, VCI) [11,12,[21][22][23]28]. Although previous studies describing CNN-based hippocampus segmentation methods have used diverse datasets, the data selection used for training of DeepHarp is much more diverse due to its acquisition at multiple clinical sites using MRI scanners from different vendors with varying field strengths. The use of such datasets and the combined application of atlas-and CNN-based methods are likely the key reasons that make DeepHarp very accurate and robust. This is an important feature to consider since a major drawback of using deep learning models is their inability to perform well if the distribution of test images is different than training sets [41]. We compared our method quantitatively with four published segmentation methods whereas the quantitative results show that the proposed method outperformed four other published segmentation methods with respect to the accuracy and achieved comparable robustness.
There are several limitations to our proposed method that should be addressed in future work. First, the use of non-linear registration in the pre-processing stages can be time consuming. Future work should investigate if an affine registration might be suitable enough for this purpose. Second, the CNN generates segmentations using T1-weighted data only. Whether the proposed method can be applied/re-trained to other MRI sequences or multi-modal MRI datasets requires further assessment and is beyond the scope of this study. Although our method yielded high accuracy, the reliability of the DeepHarp was significantly lower than the HippoDeep method. This is likely due to the higher number of training sets used in HippoDeep (>20 k). Including auxiliary data (using automated tools) for post training of DeepHarp may improve robustness but likely at the price of reduced accuracy. In this case, robustness is a measure of segmentation consistency across different time points. In other words, for this specific analysis, we mostly investigated if the methods make the same mistakes for the same patients or if the segmentation errors occur at different locations when segmenting the same structure in the same patient in images acquired at multiple time points. This analysis shows that all methods make similar mistakes across different imaging time points, while DeepHarp achieves the highest segmentation accuracy when comparing the different methods. Thus, it is important to consider the accuracy levels found when comparing the robustness between models. In other words, the results suggest that DeepHarp is precise and accurate.
From a clinical perspective, DeepHarp improved hippocampus segmentation accuracy is important as it could help to establish an image-based biomarker for early detection of Alzheimer's disease that can be used in prospective clinical testing of the effects of disease modifying treatments on rates of hippocampus atrophy as a secondary outcome measure.

Conclusions
Our findings suggest that the DeepHarp can automatically segment the hippocampus from T1-weighted MRI datasets according to the ADNI-HarP protocol with higher accuracy than four other state-of-the-art methods, while also achieving high precision.