Automated Olfactory Bulb Segmentation on High Resolutional T2-Weighted MRI

The neuroimage analysis community has neglected the automated segmentation of the olfactory bulb (OB) despite its crucial role in olfactory function. The lack of an automatic processing method for the OB can be explained by its challenging properties. Nonetheless, recent advances in MRI acquisition techniques and resolution have allowed raters to generate more reliable manual annotations. Furthermore, the high accuracy of deep learning methods for solving semantic segmentation problems provides us with an option to reliably assess even small structures. In this work, we introduce a novel, fast, and fully automated deep learning pipeline to accurately segment OB tissue on sub-millimeter T2-weighted (T2w) whole-brain MR images. To this end, we designed a three-stage pipeline: (1) Localization of a region containing both OBs using FastSurferCNN, (2) Segmentation of OB tissue within the localized region through four independent AttFastSurferCNN - a novel deep learning architecture with a self-attention mechanism to improve modeling of contextual information, and (3) Ensemble of the predicted label maps. The OB pipeline exhibits high performance in terms of boundary delineation, OB localization, and volume estimation across a wide range of ages in 203 participants of the Rhineland Study. Moreover, it also generalizes to scans of an independent dataset never encountered during training, the Human Connectome Project (HCP), with different acquisition parameters and demographics, evaluated in 30 cases at the native 0.7mm HCP resolution, and the default 0.8mm pipeline resolution. We extensively validated our pipeline not only with respect to segmentation accuracy but also to known OB volume effects, where it can sensitively replicate age effects.


Motivation
Over the past decades, there has been an increasing awareness to odor function not only as a quality of life indicator [1] but also as a potential biomarker in population studies. Olfactory dysfunction is among the earliest signs of many neurodegenerative disorders, including Alzheimer's and Parkinson's disease [2,3,4]. Therefore, it is of major interest to gain insights into the anatomical basis of the olfactory pathway in vivo.
New developments in magnetic resonance imaging (MRI) (e.g. field strength, accelerated acquisition schemes, etc.) have allowed the acquisition of high-resolutional (High-Res) MR images, providing an option for reliable assessment of odor-related brain structures, including olfactory bulb (OB). The OB is considered the most important relay station in the odor pathway, integrating peripheral and central olfactory information. Moreover, OB volume has been associated with olfactory dysfunction in clinical settings [5,6]. However, compared to its central counterparts, i.e. prefrontal cortex, hippocampus, and insular cortex [7,8], OB remains relatively poorly studied, especially in the general population. One reason for that could be the lack of a fully automated segmentation tool for this structure.
Currently, the gold standard for measuring OB volumes is the manual segmentation of T2 weighted (T2w) images -a very expensive and time-consuming process that greatly relies on the raters' expertise. Thus, especially for large * Correspondence to: Martin Reuter (martin.reuter [at] dzne.de).
population-based studies, automatic segmentation methods are required. However, achieving good accuracy on this small structure is challenging due to its inherent properties: (i) low contrast on T1w scans, (ii) low boundary contrast on T2w images (partial volume effects), (iii) highly sensitivity to noise due to its proximity to the nostrils (e.g. breathing artefacts), (iv) not visible in all subjects [9], and (v) highly dependent of age [6,10,11]. So far, those limitations have impeded the wide implementation of any automatic or semi-automatic techniques. Therefore, the introduction of an accurate automated method for segmenting OB is of significant clinical and research interest.

Olfactory Bulb Segmentation
Despite the fact, that many studies have analyzed the OB, there is a lack of accurate automatic processing methods for this structure which has been overlooked by many of the standard neuroimage processing frameworks, such as FreeSurfer [12], BrainSuite [13], SPM [14], ANTs [15], or FSL [16]. To date, manual delineation is still the predominant approach for accurate quantification of OB volumes. Most groups approximate OB volumes from 1.5T T2w MR scans with a relative low resolution (of 1.5 mm to 2 mm isotropic) [6,10,11,17]. Recent studies [9,18] on 3T high-resolutional T2w MRI have focused on developing semi-automatic techniques to reduce manual annotations workload but cannot automatically segment the OB. Concurrently to our work, Noothout et al. [19] proposed an automatic pipeline using fully convolutional neural networks (F-CNNs) to segment the OB on coronal T2w images with an in-plane resolution of 0.47 mm × 0.47 mm and 1 mm thickness. While this method, which is not publicly available at this time, shows promising results in a small dataset (n=21), it is reported to be sensitive to motion artefacts and unseen scenarios (i.e. cases with no apparent OB).
Recently, supervised learning using F-CNNs [20,21] has become the preferred standard in the medical computer vision community for solving semantic segmentation problems when sufficient training data is available [19,22,23,24,25,26,27,28,29,30]. F-CNNs often outperform other traditional methods, as they can learn intrinsic features and integrate global context to resolve local ambiguities in an end-to-end fashion. The most frequently employed network layout for semantic segmentation is the encoder-decoder architecture, i.e. the UNet [25].
The accuracy of this architecture, however, decreases when segmenting smaller structures [24,28,31]. This can be due to the more complex shapes (i.e. thinner, irregular boundaries) and visual appearance characteristics in medical images (i.e. less visible and partly occluded). Nonetheless, some of the fault can be attributed to the encoder-decoder layout as it can lead to a redundant use of information and insufficient encoding of the global contextual information [32,33]. An accurate understanding of the spatial context is of tremendous importance when segmenting smaller structures as local representation differences between pixels/voxels of a same structure introduce inter-class inconsistencies and affect the recognition accuracy [32]. To solve this issue, attention modules have been introduced to improve the understanding of long-range dependencies, not only for semantic segmentation [28,32,33] but also for other computer vision tasks [34,35,36,37].
In this work, we modify our FastSurferCNN [30] for whole-brain segmentation to focus on the OB. To improve FastSurferCNN's performance for small structures, we suitably included the self-attention mechanism proposed in [34] into FastSurferCNN ; the new deep-learning architecture is termed AttFastSurferCNN. AttFastSurfer-CNN promotes attention to spatial information by improving the modeling of local and global-range dependencies. Overall, to segment the OB on high-resolutional T2w whole-brain MRI in a fully automatic fashion, we introduce a deep learning pipeline consisting of three stages: 1. Localization of a region of interest (ROI) containing the OBs of both hemispheres using a semantic segmentation approach by implementing FastSurfer-CNN ; we use the centroid of the predicted region as a center point for cropping a localized volume.
2. Segmentation of OB tissue within the localized volume through four AttFastSurferCNN with different training condition (four data-splits and data initialization).
3. Ensemble stage where the previously generated label maps are averaged and view-aggregated to form a consensual final segmentation.
The presented networks were trained with manual annotations of 357 T2w scans from the Rhineland Study, an ongoing large population-based cohort study [38,39]. We extensively validated the quality of the individual stages of the pipeline through assessment of segmentation accuracy in an independent unseen heterogeneous in-house dataset (n = 203). We showed that our previously introduced  Manual annotations of the left and right OB were performed by an experienced rater in (unprocessed) T2w images using Freeview (a visualization tool of FreeSurfer [12,40]). OB is defined as a mostly almond-or spindle-shaped structure symmetrically located at the base of the forebrain [41] as seen in Figure 1, which can be demarcated based on surrounding cerebrospinal fluid and the underlying cribriform plate. The abrupt changes in diameter at the beginning of the olfactory tract in the axial and sagittal views were used as a posterior ending landmark [42,43].
In addition, to avoid bias, labeling was blind to participant metadata, e.g. outcomes of the olfactory function and demographics.
For the localization task, we solve a semantic segmentation problem with the goal to segment the forebrain region containing the OBs from both hemispheres (referred to as "region of interest (ROI)"). The ROI label generation is  (3) A binary cutoff at f (x, y, z)/max(f (x, y, z)) >= 0.8 separates ROI and background. The Resulting distance maps and labels are illustrated in Figure 1.

Olfactory Bulb Pipeline
Our proposed deep learning method is aimed at segmenting the OB on high-resolutional T2w whole-brain MRI. This task presents the challenge of a high-class imbalance between foreground and background (≈ 1 : 10 6 ).
A reduction in the spatial size of the input can partially alleviate the problem by cropping the background and by focusing the background information on relevant regions in close proximity to the OBs. This, furthermore, reduces computational and memory requirements during training and inference. Following this direction, we designed a fully automated pipeline for OB tissue segmentation as depicted in Figure 2.
The proposed pipeline consists of three stages: (1)  selected and also reduces variance due to network initialization. Furthermore, since right and left OB were com-bined as one structure during segmentation, they are split retrospectively in an independent post-processing step.

Region of Interest (ROI) Localization Network -FastSurferCNN
To localize the ROI as a semantic segmentation task, we employ FastSurferCNN [30] as it outperformed other commonly used encoder-decoder architectures, i.e. SDNet [44] and QuickNat [22], on the difficult task of whole-brain seg-  [45,46]. The maxout activation induces competition between feature maps by computing the maximum at each spatial location, thus improving the feature selectivity [47] and boosting the learning of fine-grained structures [29,31]. Furthermore, FastSurferCNN utilizes a multi-slice input approach by stacking preceding slices, current, and succeeding slices for segmenting only the middle slice, which in turn increases the spatial information aggregation in a 2D network by improving the local neighborhood awareness [30].
In this work, we slightly modified FastSurferCNN by adjusting the view-aggregation step to a normal unweighted average. Since the ROI label is not lateralized, there is no need to increase attention to any particular anatom-

OB Segmentation Network -AttFastSurferCNN
To accurately segment the OB, we introduce At-tFastSurferCNN a new deep learning architecture that boosts the attention to spatial information. We implemented AttFastSurferCNN by suitably including the selfattention mechanism proposed by [34] into FastSurfer-CNN [30]. The self-attention module was included after each competitive-dense block(CDB), as shown in Figure 4, thus improving the modeling of contextual information.
Furthermore, in order to take full advantage of the multiscale attention maps [32,33] and to prevent information loss from the unpooling layers [31], we replaced the maxout activation units between the finer feature maps from The implemented self-attention layer is illustrated in S R N ×N is defined as: where s j,i indicates the extend to which the i th position impacts the j th position. Before applying S, the F CDB features are fed into a 1 × 1 convolutional layer and a new feature map F c R C×H×W is generated and reshaped to R C×N . Afterwards, a matrix multiplication is performed between the transpose of S and F c and the results reshaped to the original size R C×H×W . Finally, the self-attention output (F att ) is formulated as follows: where α is a learnable scalar parameter initialized with 0. The introduction of α allows the network to first focus on the local information which is an easier task and gradually increases the importance of non-local dependencies which is a harder task [34]. We additionally normalize F att thus guaranteeing a normalized input to the other CDB blocks. A normalized input improves convergence [48] and increases the exploratory span of the created sub-networks when using a maxout activation [47]. In summary, the implemented spatial attention module improves the modelling of local and global-range dependencies, which in turn increases semantic consistency.
In brief, AttFastSurferCNN is a multi-network approach of three 2D F-CNNs operating on different anatomical views (coronal, sagittal and, axial). All three F-CNNs contain the self-attention layers following the aforementioned layout ( Figure 4). Within AttFastSurferCNN the CDB blocks maintain the configuration from Section 2.2.1 except for the 5 × 5 convolutions that are modified to a smaller kernel size of 3×3. Furthermore, the multi-slice input approach from FastSurferCNN [30] is maintained and a stack of three consecutive slices are passed as input. In the following section, the ensemble of different segmentation predictions will be explained in detail.

OB Segmentation Ensemble
One widely used method to assess the optimal model of CNNs trained with different data-splits is cross-validation.
Cross-validation jointly evaluates performance on different data-splits and the model with the maximal test-set performance is selected as the winner. This approach, however, can limit generalizability as the data-splits used for training the best performer can be biased towards the selected test-set. Recently, the combination of different CNN model outputs has been shown to improve the prediction performance and reduce the CNN's intrinsic variance [49].
As a consequence, we propose to ensemble the prediction of four training data was divided into four data-splits balanced for age and sex, and iii) the data-splits were treated in a leaveone-out fashion. Finally, the ensemble is constructed by an unweighted average as the output of models with comparable performance is merged [49,50,51,52]. Intuitively, the proposed ensemble approach can be seen as four different raters with similar experience taught by the same instructor and the consensus among the raters gives the final decision. It is important to note, that in our specific approach the final ensemble prediction is created by averaging twelve different models as each AttFastSurferCNN contains three 2D F-CNNs for the three different anatomical views (axial, coronal and, sagittal). Therefore, our ensemble approach also includes the advantages of viewaggregation where a voxel prediction is regularized by considering spatial information from multi-views [22,29,30].
We furthermore analyzed the impact of the ensemble approach by comparing directly with standalone data-splits.

Model Learning
All F-CNN models for localization and segmentation were implemented in PyTorch [53] using a docker container [54]. Independent models for axial, coronal, and sagittal views were trained for 40 epochs with a batch size of 16 using two NVIDIA Tesla V100 GPU with 32 GB RAM, and a Adam optimizer [55] with a step decay scheduler that decreases the learning rate (lr) by 95% every 5 epochs (initial lr = 0.01, constant weight decay = 10 −04 [56], betas=(0.9, 0.999), eps=10 −08 ). The networks were trained by optimizing a composed loss function of focal loss [36] and dice loss [26]. The focal loss addresses the class imbalance by modifying the standard cross-entropy loss such that lower importance is given to the well-classified pixels. On the other hand, the dice loss is a more robust loss to handle data imbalance [57] as it is based on the Dice score, an overlay similarity index that reflects both size and localization agreement. Therefore, our proposed composed loss function is formulated as: where p l (x) is the predicted probability at pixel x to belong to a class l, and g l (x) is the pixel ground truth class.
For the weighted focal loss, γ was set to 2 and the pixel weight scheme (w(x)) proposed by [22] was used to improve segmentation performance along anatomical boundaries. We additionally included online data augmentation to address two challenges: 1) spatial variations due to head position and image cropping, and 2) intensity inhomogeneities due to scan parameters and movement artefacts (e.g. eyes and breathing). The first problem was tackled by applying random spatial transformations (translation, rotation, and global scaling) on the input images.
It is important to notice that spatial augmentations were done in a full image for the segmentation models before  Note, care was taken to preserve the image contrast between versions.
For the training and testing of our pipeline, data from the first 572 participants from the Rhineland Study with a T2w scan was used (referred to as "in-house dataset").
All 572 MRI scans were manually annotated following Section 2.1. During the creation of the in-house dataset, a group of 12 subjects was separated into another subset (referred to as "no-OB dataset") as these cases were flagged with no visible OB. Subjects without an apparent OB had been reported previously [9]. Consequently, the no-OB cases were used to evaluate the automated method's robustness to an unseen extreme scenario.

Evaluation Metrics
For assessing the segmentation similarity between the predicted label maps and the ground truth, we computed metrics aimed at evaluating different properties: spatial overlap, spatial distance, and volume similarity. We first assessed the spatial overlap as it provides both size and localization consensus by computing the Dice similarity coefficient (Dice), which is a common metric used for validating semantic segmentation performance. Let G (ground truth) and P (prediction) denote binary label maps; the Dice similarity coefficient is mathematically expressed as where |G| and |P | represent the number of elements in each label map, and |G ∩ P | the number of common elements, therefore, the Dice ranges from 0 to 1 and a higher Dice represents a better agreement. However, Dice scores can be drastically affected by small spatial shifts when evaluating small and elongated structures such as the OB [64,24].
Spatial distance-based metrics such as Hausdorff Distance (HD) are widely used for assessing performance in small structures as they evaluate the quality of segmentation boundaries. In this work, we used the Average Hausdorff Distance (AVD), an HD variation less sensitive to outliers.
AVD is defined as where d is the Euclidean distance. In contrast to the Dice, AVD is a dissimilarity measurement so a smaller AVD indicates a better boundary delineation with a value of zero being the minimum (perfect alignment). Furthermore, as the OB volumes are usually the desired marker for downstream analysis, we computed a volume-based metric, the volume similarity (VS) [64], defined as While VS is similar to Dice, it does not take into account segmentations overlap and can have its maximum value even when the overlap is zero. In consequence, VS is not used for the localization marker and replaced with local-ization distance (R), a metric more suitable to assess the accuracy of the centroid coordinate created in this stage.
Let p and g be the centroid coordinates of the predicted and ground truth label maps, respectively. The localization distance (R) is calculated as follows Similar to AVD, a smaller distance indicates improved localization accuracy. Finally, to benchmark performance of various F-CNN models we first ranked the models performance for each metric individually and then computed an overall rank as the geometric mean of the model's rankings.

Experiments and Results
In we accessed the generalizability of our method to different population demographics on the publicly available HCP dataset [63]. A summary of the data needed for each of the experiments is presented in Table 1.

Manual Annotation Reproducibility (E1)
To the best of our knowledge, there is no automatic method for detecting and delineating the OB. Therefore, manually annotations are considered the gold standard.
As our approach is based on supervised learning, its performance is limited by the quality of the manual annotations. As a consequence, to assess the consistency of the labels created by our main rater, we conducted intra-rater and inter-rater variability experiments.
Fifty random subjects from the in-house dataset were selected. Afterwards cases were manually annotated twice (see Section 2.1), once by our main rater who had already segmented the cases and once by a second rater trained by our main rater. To remove bias and avoid overestimating performance, raters were blind to the scans' identification; furthermore, the main rater's second segmentations were done with a time gap of two months, and finally, the scans used for training the second rater were not included in the experiment. We assessed intra-rater variability by computing the similarity between the two sets of segmentations of the main rater. Inter-rater variability was estimated by comparing the segmentation agreement between the main rater's first annotations and second rater's annotations.
In Figure 6, we present the similarity scores for total OB (left and right combined) in the fifty subjects used for this experiment as well as significance level indicators (paired two-sided Wilcoxon signed-rank test [65]). We ob-  Figure 6: Segmentation similarity scores for total OB comparing intra-rater vs. inter-rater variability, as well as significance level indicators (paired two-sided Wilcoxon signed-rank). Significance: *** p < 0.001.

Pipeline Performance (E2)
In this section, we benchmarked and evaluated the accuracy of each stage of the pipeline in a completely separate unseen test-set. All implemented networks were trained using the scheme mentioned in Section 2.2.4 and datasplits introduced in Section 2.3 were treated in a leaveone-out fashion (e.g. model 1: splits 2, 3, and 4 were used for training, and split 1 was used for validation).

ROI Localization
For evaluating the ability of FastSurferCNN to localize the OB ROI in a down-sampled whole-brain image, we

OB Tissue Segmentation
To show a proof-of-concept for our proposed AttFast-SurferCNN in the more difficult task of OB tissue segmentation, we benchmarked our network against state-of-theart segmentation 2D F-CNNs used for neuro-imaging such as FastSurferCNN [30], UNet [25], and QuickNat

Ensemble
In this section, we tested our ensemble approach of combining the output of four AttFastSurferCNN against each individual AttFastSurferCNN trained in the previous sec-tion. We observed that all standalone models have comparable results in the three similarity metrics (Dice, VS, and AVD) as shown in Table 3 ing boundaries as illustrated in Figure 8. Significance: *** p < 0.001 , ** p < 0.01 , * p < 0.05

Age and Sex Effects Sensitivity (E3)
OB volumes obtained from manual segmentations of T2w images have shown to be negatively correlated with age [6,10,11]. Therefore, any automated method that intends to detect this small structure should be able to replicate these effects. As a consequence, we evaluated the sensitivity of our proposed pipeline to replicate ground truth age dependencies in the in-house unseen test-set  OBV ∼ age + sex + eT IV ). All statistical analyses were performed in R [67] and eTIV estimations were computed using FreeSurfer [12,40,68].
All predicted OB volumes significantly decreased with age as can be seen in Table 4, which in turn follows the behavior of the manual data and other studies [6,10,11].
We found an improvement in the modeling (R 2 ) of the age effects in the AttFastSurferCNN compared to the ground truth and the other comparative networks. Finally, we did not find a sex difference for any of the models, and, as expected, the inferred OBV are positively associated with eTIV (see Table 4).

E4: No Apparent Olfactory Bulb (E4)
As the proposed pipeline is to be deployed as a postprocessing OB analysis pipeline for the T2w MRI of the Rhineland Study, it should be robust to cases without an apparent OB that -based on the size of our in-house dataset -occur with an approximate prevalence of 2%. In this section, we processed the 12 flagged cases with no apparent OB and evaluated the OB volume estimates. Note, all cases used for training our AttFastSurferCNN have a visible OB.
The automated method agreed with the main-rater in 50% percent of these cases as illustrated in Figure 9 B) and shown in Appendix Figure 2. For the remaining cases: three had a total predicted volume smaller than 2.5 mm 3 and the other three between 7 mm 3 to 10.2 mm 3 . We

Generalizability (E6)
The lack of MR hardware heterogeneity (i.e. scanners, field strength, and acquisition parameters) in our training set can limit the ability of the neural network to generalize to unseen T2w images acquired under different conditions. In order to quantify the robustness of our pipeline, we tested it on 30 subjects of the HCP dataset, acquired with a different resolution with isotropic 0.7 mm voxels. In addition to sequence differences, HCP images are de-faced.
In order to analyze our method at the native 0.7 mm HCP resolution as well as at the default 0.8 mm network resolution, we constructed manual annotations twice per subject, one for each resolution.
We  which is the recommended approach.
As expected, overall performance on HCP data is slightly lower than the results obtained on our in-house dataset (see Section 3.2.2). The HCP dataset, however, consists of de-faced scans (never encountered during training) from a younger age distribution, and was acquired  for both the in-house as well as the HCP dataset can be found in Figure 11.

Discussion
In this work, we established, validated, and implemented a novel deep learning pipeline to segment and quantify the olfactory bulb on high resolutional T2-weighted MR scans. The proposed pipeline is fully automatic and can analyze a 3D volume in less than a minute in an end-to-end fashion, even though it implements a three-stage design.
The use of deep learning components for localizing and segmenting the OB enables the pipeline to accurately and quickly quantify the OB volume, providing a robust and reliable solution for assessing OB volumes in a large cohort study.
Segmenting the OB in T2w scans is a challenging task due to size, sensitivity to artefacts, age effects, and visibility on MR images (partial volume effects). Despite all these challenges, we demonstrate the feasibility of segmenting the OB on high resolutional isotropic T2w MR images. Our main rater's manual annotations exhibit a high intra-rater reliability in terms of boundary delineation, OB localization, and volume estimation. Furthermore, we verified the reproducibility of our labeling protocol with inter-rater reliability similar to the one reported in other manually annotated medical datasets [24,29]. We cannot directly compare the segmentation performance with other studies that manually labeled the OB on T2w MR images as they only report the volume difference for repeated measurements by a single observer or across observers [6,18,70,71]. Nonetheless, the volume similarity for both inter and intra-rater variability yields comparable or even better results than the OB studies mentioned  As demonstrated in the Rhineland data, the proposed pipeline successfully identifies the OB on a T2w scan as seen in Figure 11 A) to E). The pipeline also replicates the negative correlation of OB volumes with age reported in previous studies [6,10,11] and also visible in our man-ual annotations. We, furthermore, detected no sex difference after accounting for head size, however, estimates from AttFastSurferCNN and all comparative networks are positively correlated with head size -a result that is also detected in the manual segmentations -as expected -but with a lower significance and magnitude. All automated methods show stronger and less variable eTIV effect across subjects (see Table 4), explaining the significance discrepancy. The difference in effect magnitudes can be attributed to the F-CNN's ability to learn consistent information across subjects exhibiting stability to random noise and thus generating smoother segmentations than manual raters. Furthermore, our proposed pipeline efficiently handles cases without an apparent OB by not segmenting the structure at all or only a few voxels (< 10 mm 3 ) as seen in Figure 9 B), C), and D). Additionally, the sequence stability dataset demonstrates a good agreement of volume estimates between sequences. It must be noted that the difference in volume estimates includes not only potential variances of the processing pipelines but also vari- scans with a different resolution from the ones presented in this work can also be analyzed by running the pipeline with the default behaviour (resampling inputs to 0.8 mm) or by processing inputs directly at the native image resolution, if it is close to 0.8 mm isotropic. In these cases is highly recommended, however, that segmentation quality is assessed by the user. Generally, since the pipeline is based on deep learning, the model can easily be fine-tuned to another desired resolution by retraining or by more aggressive scaling augmentation techniques.
In conclusion, we have developed a fully automated post-processing pipeline for OB segmentation on submillimeter T2-weighted MRI based on advanced deep learning methods. To the best of our knowledge, the presented pipeline is the first to accurately segment the OB in a large cohort and is meticulously validated not only against segmentation accuracy but also with respect to known OB volume effects (e.g. age).

Acknowledgment
We would like to thank the Rhineland Study group for supporting the data acquisition and management. This work was supported by DZNE institutional funds, the Fed-   Figure 1: Similarity metrics scores for ROI localization comparing all trained FastSurferCNN models. Models were ranked ascendingly by individual metrics (box-plot color) and the overall rank (geometric mean of the metric rankings). We show significance level indicators of the paired Wilconox signed-rank test comparing FastSurferCNN-4 (M4, model with best overall rank) against the other FastSurferCNNs (M1,M2,M3). Significance: *** p < 0.001 , ** p < 0.01 , * p < 0.05, ns : p ≥ 0.05.   Figure 3: Scatterplots of OB volume estimates and segmentation similarity metrics on the in-house test-set as well as the Pearson correlation coefficient and linear regression. We observed that segmentation performance decreased with OB size. Especially in subjects with a total OB volume smaller than 20 mm 3 . Figure 4: Sagittal and Coronal T2-weighted MR images and predictions from the localization stage (purple) on two cases from the Rhineland Study. A-B) Present subjects excluded from the volume estimates sequence stability analysis (E5) due to severe motion artefact. Nonetheless, the localization stage still can detect a region containing both OBs.