Fighting the scanner effect in brain MRI segmentation with a progressive level-of-detail network trained on multi-site data

Many clinical and research studies of the human brain require an accurate structural MRI segmentation. While traditional atlas-based methods can be applied to volumes from any acquisition site, recent deep learning algorithms ensure very high accuracy only when tested on data from the same sites exploited in training (i.e., internal data). The performance degradation experienced on external data (i.e., unseen volumes from unseen sites) is due to the inter-site variabilities in intensity distributions induced by different MR scanner models, acquisition parameters, and unique artefacts. To mitigate this site-dependency, often referred to as the scanner effect, we propose LOD-Brain, a 3D convolutional neural network with progressive levels-of-detail (LOD) able to segment brain data from any site. Coarser network levels are responsible to learn a robust anatomical prior useful for identifying brain structures and their locations, while finer levels refine the model to handle site-specific intensity distributions and anatomical variations. We ensure robustness across sites by training the model on an unprecedented rich dataset aggregating data from open repositories: almost 27,000 T1w volumes from around 160 acquisition sites, at 1.5 - 3T, from a population spanning from 8 to 90 years old. Extensive tests demonstrate that LOD-Brain produces state-of-the-art results, with no significant difference in performance between internal and external sites, and robust to challenging anatomical variations. Its portability opens the way for large scale application across different healthcare institutions, patient populations, and imaging technology manufacturers. Code, model, and demo are available at the project website.


Introduction
Brain structure segmentation in magnetic resonance imaging (MRI) plays a pivotal role in both research and clinical routines for assessing and monitoring brain morphology, volumetry, and connectivity, in both normal and pathophysiological conditions.As more and more studies analyse data derived from thousands of MRI brain scans [Bethlehem et al., 2022], there is a growing need for tools able to perform automatic, fast, and reliable segmentation of brain structures, with benefits on downstream research and clinical studies in terms of accuracy, statistical power, and reproducibility of findings.
Well-established segmentation methods in neuroimaging, such as FreeSurfer [Fischl, 2012] and FSL [Jenkinson et al., 2012], exploit one or more atlases i.e., reference volumes and their manual trusted segmentation: first the target is registered with the reference volume, then the anatomical prior knowledge from the manual segmentation is transferred to the target volume [Yaakub et al., 2020].Although computationally expensive and slow, these methods easily adapt to images from different scanners or acquired by means of different sequences.
Recently, achievements in deep learning (DL) methods applied to automatic brain MRI segmentation [Akkus et al., 2017] such as DeepNat [Wachinger et al., 2018], QuickNat [Roy et al., 2019], and CEREBRUM [Bontempi et al., 2020, Svanera et al., 2021] have made remarkable progress in competing with the reliability offered by atlasbased segmentation methods.However, most DL methods usually include, for both training and testing, only MRI volumes collected from a single or few centres with almost homogeneous characteristics in terms of image statistics, acquisition parameters, and artefacts.Consequently, when challenged on external data i.e., unseen volumes from unseens sites, DL methods face the so-called scanner effect, a drop in performance on handling the data variability originated by different MRI site acquisitions.This mismatch between the distributions of internal and external data, which is common in MRI (see e.g., the competitions in [Sun et al., 2021, Campello et al., 2021]) is a problem more broadly known as distribution shift [Wiles et al., 2021].Some researchers in brain segmentation propose to tackle it by applying aggressive data augmentation [Zhao et al., 2019] or harmonisation [Beer et al., 2020], by using domain adaptation or randomisation [Billot et al., 2021], or by generating synthetic data with the needed variations [Shin et al., 2018].Despite achieving good robustness on a wide range of MRI contrasts and resolutions, these approaches keep showing limitations in matching statistics of real data distributions, struggling with morphological variabilities and atypical scanner artefacts.
In order to handle inter-site diversity, none of the existing DL solutions builds on the idea of generating the equivalent of an anatomical brain prior, for example exploiting volumes from multiple sites.Given the current availability of open datasets, a concrete opportunity for improving the model portability is in fact training a model directly on out-of-the-scanner data coming from multiple sites, to cover different vendors, resolutions, slice thickness, participant demographics, and pathological conditions.Previous approaches to multi-site learning for segmentation in different medical imaging domains show, on the one hand, that these methods help generalising on external data.On the other hand, they often perform worse on internal ones (i.e., unseen volumes from sites included in the training set) [Styner et al., 2002].This apparently contradictory situation has been observed also in other medical image analysis tasks [Zech et al., 2018], reinforcing the concept that effective learning from multiple sources is highly challenging and can introduce unexpected performance drops.
To exploit the informative innovation carried by such multi-source data, dedicated architectural solutions should be designed.An effective method should be able to integrate anatomical knowledge acquired by a large number of volumes into a robust anatomical brain prior.Additionally, this solution should handle the high degree of variability that characterises data from different sites and scanner vendors.

Main contributions
We here present LOD-Brain, a progressive level-of-detail network for training a robust brain MRI segmentation model from a huge variety of multi-site and multi-vendor data.LOD-Brain architecture is organised on multiple levels of detail (LOD), as shown in Fig. 1.Each level is a convolutional neural network (CNN) which processes 3D brain data at a different scale obtained via progressively down-sampling the input volume.Thanks to the rich variability of brain samples coming from 70 datasets from different MRI acquisition sites, the proposed architecture learns, at lower levels, a robust brain anatomical prior.Concurrently, higher levels handle site-specific intensity distributions and scanner artefacts.Through inter-level connections between networks and a bottom-up training procedure, such architecture integrates contributions from all levels to produce an accurate and fast segmentation.LOD-Brain shows outstanding generalisation capabilities, as it performs better than other state-of-the-art solutions on almost every novel site, with no need for retraining nor fine-tuning, and with no relevant performance offset in segmenting either internal or external sites.Furthermore, it proves to be general and robust across sites against different population demographics, anatomical challenges, clinical conditions, and technical specifications (e.g., field strength, manufacturer).
As an open source tool, LOD-Brain can be used off-theshelf on unseen scans from novel sites.Segmentation masks are returned very quickly (few seconds on a GPU) thanks to a reduced number of model parameters (300K), if compared to other state-of-the-art solutions.To maximise research reproducibility and state-of-the-art comparisons, we adopt for testing the MICCAI anatomical structure labels proposed in [Mendrik et al., 2015], using FreeSurfer [Fischl, 2012] segmentation masks as silver ground-truth (i.e., a ground-truth with errors).However, as we release both the model and the code at the project website, LOD-Brain can be retrained from scratch to deal with any set of structures and labels obtained by any manual or automatic software.A working demo is also available here.

Related work
Atlas-or multi-atlas based methods, such as FreeSurfer [Fischl, 2012] or FLS [Jenkinson et al., 2012], are still largely adopted for brain MRI segmentation [Cabezas et al., 2011].Despite the needed registration procedure usually provides a good alignment between volumes, it requires hours of processing for each scan [Klein et al., 2017], thus imposing barriers to groups with limited computational capabilities in case of large-scale studies [Bethlehem et al., 2022].Furthermore, atlas-based strategies are hardly effective on data with abnormalities, either in terms of anatomy or intensity distributions, requiring manual intervention for fixing automatic errors.
In recent years, deep learning (DL) techniques deeply impacted medical imaging [Litjens et al., 2017] and image segmentation tools [Isensee et al., 2021].Regarding the brain, first DL-based methods were limited in handling the 3D nature of MRI data, as they processed single 2D slices only.QuickNAT [Roy et al., 2019] tries to overcome the drawbacks imposed by 2D segmentation by aggregating the predictions of three different 2D slice-based encoder-decoder models, one per canonical slicing plane (longitudinal, sagittal, and coronal), and combining the three results for obtaining the segmentation.FastSurfer-CNN [Henschel et al., 2020] applies the same 2D approach training the network on a sequence of 2D neighbouring slices, instead of a single slice.To reduce the loss of 3D context and minimise inter-slice artefacts, methods processing 3D-patches and aggregating the resulting subvolumes are proposed in [Dolz et al., 2019, Wachinger et al., 2018].However, all these tools exploit only local 3D spatial information, while global spatial clues, such as the absolute and relative positions of different brain structures, are disregarded, hindering any possible learning of anatomical priors.Other ensemble approaches based on multiple CNNs processing different overlapping brain sub-volumes, such as AssemblyNet [Coupé et al., 2020] or SLANT [Huo et al., 2019], achieve whole brain segmentation, at the cost of an explosion of parameter cardinality.To avoid these drawbacks typical of the tiling process on 2D or 3D patches [Reina et al., 2020], CERE-BRUM tools represent a fully 3D solution to brain MRI segmentation for 3T [Bontempi et al., 2020], and 7T scans [Svanera et al., 2021].However, similarly to DL methods which are trained on single-site MRIs, they also do not perform well on volumes from unseen sites, as they require training from scratch, or fine-tuning for each new target distribution [Svanera et al., 2021].
Data harmonisation strategies, when oriented to an explicit removal of site-related effects in multi-site data [Pomponio et al., 2020], constitute a valid strategy to par-tially alleviate the unwanted performance drop due to the scanner effect.To mitigate inter-site differences, Beer et al. propose in [Beer et al., 2020] a longitudinal version of the ComBat method: an empirical Bayesian approach which applies a multivariate linear mixed-effects regression to account for both the biological variables and the scanner.The model is able to adjust for additive and multiplicative effects by calculating a site-specific scaling factor.A joint normalising function across multiple datasets is instead learnt by Delisle et al. in [Delisle et al., 2021] by means of two fully-convolutional 3D CNNs: the first normalises image intensities across multiple datasets, while the second optimises images for a downstream segmentation task.Despite harmonisation algorithms mitigate scanner-specific effects, they not always preserve the inter-subject biological variability from each site, and are sometimes sensitive to changes in pre-processing steps [Cetin-Karayumak et al., 2020].
Closely related to harmonisation, domain adaptation methods try to adapt the segmentation networks trained on a source domain to produce correct outputs also on samples from a target domain.As an example, DeepHarmony [Dewey et al., 2019] exploits a fully-convolutional CNN architecture to map brain scans of a subject from one source acquisition protocol to a target one.However, DeepHarmony cannot be extended to more than two sites since it relies on learning a protocol-to-protocol mapping.
SynthSeg [Billot et al., 2021] is an effective adaptation method which, starting from a full domain randomisation of the training set, segment brain MRI scans of any contrast and resolution, without retraining nor fine-tuning.As traditional data augmentation has limited ability to emulate real variations, SynthSeg is trained with synthetic scans obtained by leveraging a generative model with fully randomised parameters (intensity, shape, etc.).Despite its high accuracy, peculiar scanner artefacts and the absence of alignment parameters in the image header determine the presence of errors in the segmentation.
Far from applying full domain randomisation, Zhao et al. [Zhao et al., 2019] propose an alternative but still aggressive augmentation solution.This approach first learns independent spatial and appearance transform models to capture the variations in a dataset of brain scans.Then, it uses these transform models to synthesise a dataset of labelled examples starting from only a single selected scan.The synthesised dataset is eventually used to train a supervised network, which significantly improves over previous methods for one-shot biomedical image segmenta-tion, but with unclear outcomes in the presence of larger labelled training sets.Other synthetic approaches adopt generative adversarial networks (GAN) to create synthetic abnormal MRI images with brain tumours, so as to improve tumour and brain segmentation [Shin et al., 2018].Despite synthetic methods increase generalisation, aggressive augmentations do not always represent a solution for coping with distinct scanners and protocols, especially if they do not increase model performance.
The first multi-site attempt of gaining a model which is robust to the scanner effect is described in [Liu et al., 2020] in the domain of prostate segmentation.Authors first perform feature normalization for each site separately, and then extract more generalizable representations from multi-site data by a novel learning paradigm.Other works that adopt deep learning techniques to cope with the multi-site variability can be found in [Rundo et al., 2019] again for prostate segmentation, and in [Dou et al., 2020] for multi-organ segmentation from unpaired CT and MRI.However, most of these approaches confirm to perform well on internal subjects, whereas require additional external images for the adaptation step (e.g., see [Karani et al., 2018]) to adequately cope with testing data obtained using different imaging protocols or scanners.The usage of preprocessing steps confirms that efficiently handling multi-site data is still an open challenge and how the development of models able to jointly handle structure segmentation and site adaptation is highly needed.Learning directly from out-of-the-scanner MRI brain volumes (i.e., with no atlas-based pre-alignment) from multiplesites, with no fine-tuning nor adaptation steps, is an option that has remained unexplored until now, despite the recent availability of a large amount of brain open data repositories.

Brain MRI multi-site data
To address the huge brain MRI variability in intensity statistics and scanning artefacts, we collect almost 27,000 brain T1-weighted volumes of both healthy and clinical subjects, mainly scanned with mprage/mp2rage sequences, and released in 79 databases covering approximately 160 world sites2 .We first aggregated data from well-known open repositories, such as HCP, ABCD, OA-SIS, and datasets contained in the INDI project including NKI-RS, IXI, ABIDE, and ADHD.Then we added datasets from open platforms as OpenNeuro, OSF, neu-GRID2 and NIMH, avoiding paid repositories such as UKBiobank.Other public datasets included are Mindbog-gle101, AOMIC, and IBSR.Apart from Glasgow data3 , all repositories are available without fees, to maximise the reproducibility of this work.A full data table is provided on the project website.
In Figure 2, we present the composition of the dataset, its cardinalities and features, the quality assessment process done by MRIQC [Esteban et al., 2017], and details on training and testing splits.The 26,169 volumes that passed the MRIQC quality control analysis undergo defacing first, and then simple pre-processing steps before neural network feeding, including FreeSurfer's mri convert to reorient volumes to LIA (left, inferior, anterior) reference space, padding to 256 3 , and z-scoring.

Data split and labelling
Out of the 79 datasets, 70 are considered as internal (INT), while 7 are left out for testing only (EXT).The 2 remaining sets are used for specific analyses: SIMON [Duchesne et al., 2019] contains scans from a single healthy individual who participated in a multi-centre study; the last is a dataset with five patients with only one brain hemisphere from Kliemann et.al [Kliemann et al., 2019].As validated in Section 5.1.1,the model used for testing is trained on a randomised selection of 1,049 volumes from internal data (15 volumes for each dataset, except one contributing with 14 volumes as it does not have enough data).This allows to obtain a balanced training set in terms of dataset representativeness and an appropriate total number of training volumes for the learning task.The 77 datasets used for testing (70 INT and 7 EXT) include a total of 24,996 volumes (15,841 INT and 9,155 EXT).Since only 10% of the datasets include more than 80% of the testing volumes, we select up to 200 volumes per dataset to avoid biases and guarantee balanced results, ending up with a total of 5,956 testing volumes (5,360 INT and 596 EXT).The validation set, used for hyperparameter selection, includes 117 volumes from 72 datasets (91 INT and 26 EXT).
As no manual segmentations (gold standard) are available for most volumes, training adopts a weakly super- vised learning strategy, exploiting segmentation labels obtained by FreeSurfer [Fischl, 2012] as a silver standard ground-truth (GT), similarly to what proposed in [Bontempi et al., 2020].The only dataset with semi-manual labels i.e., MindBoggle (FreeSurfer plus manual corrections), is exploited in validation and testing.The manual segmentations provided for IBSR and MALC2012 were discarded and replaced with FreeSurfer outputs, because of their low quality.As for the quality of the FreeSurfer's GT masks, these present high variability.In particular, out of the seven external datasets (testing only), four present an acceptable GT (covering a total of 32 sites), while the other three show low quality GT segmentations as they include clinical scans.Low quality GT masks are usually produced from low quality T1w volumes; while they are not used for training, since we do not want to compromise the model learning ability, they are still used for testing to explore model capabilities and limitations.Training happens in a bottomup approach: after convergence, LOD L is frozen, and inter-level connections ensure that the 3D spatial context learnt at the lower level is embedded and propagated to LOD L−1 and, from there, to higher levels of the architec-ture.The process is iteratively repeated through superior levels until the upper one i.e., LOD 1 , which processes the input data at the fullest scale, refining the segmentation masks at the finest detail and accounting for site-specific intensity distributions.
The loss L adopted by LOD-Brain is a mixed per-channel dice function L dice and cross entropy loss L CE : with λ balancing the two components.In particular, L CE is: where V and C are the set of voxels and classes, respectively, y is the GT mask, and F is the output.Conversely, L dice is: Hyperparameter selection, network design, and the choice of parameters L, λ, d, etc. is described in Section 5.1.4.
The architecture which emerges as the best performing one during the experiments is presented in Fig. 3.The network is made up of three basic 3D convolutional blocks.The first addresses feature learning: it is composed of a 3 × 3 × 3 convolution layer followed by normalisation and non-linear activation, all repeated multiple times, ending with a dropout layer.The other two blocks perform down-sampling and up-sampling, with strided convolution and transposed convolutions, respectively, both followed by non-linear activations.These layers allow the network to learn optimal up/down-sampling strategies and process different extracted feature hierarchies.Moreover, skip connections and inter-level connections are implemented along with summation nodes, as it was proven to have a better trade-off between segmentation accuracy and parameter count compared to concatenation [Milletari et al., 2016].

Data augmentation
Instead of performing a pre-selected set of common data augmentations, we perform an ad-hoc procedure to verify the usefulness of augmentations in advance.In the first step, we create a pool of realistic transformations belonging to three categories: geometrical transformations, noise distortions, and artefact introduction.In the first category, in addition to classical operations such as, flip, rotation, and translation, we also introduce grid distortion.The second category accounts for a comprehensive set of noises: salt and pepper, Gaussian, Gamma, and contrast noise.
The last transformation family focuses on mimicking MRI artefacts like ghosting and MR field inhomogeneity, as described in [Svanera et al., 2021].In the second step, we test which transformation is beneficial to increase the model performance.Validation is done by applying each transformation to the validation set volumes (by increasing transformation parameters), and then computing the performance of a model trained without any data augmentation.If the model is already robust to a specific transformation (i.e., there is no performance gap in testing a volume with and without transformation), this is no further considered.Otherwise, in those situations in which the training set is not rich enough (i.e., whenever transforming the input data introduces a performance drop), such transformation is considered suitable for augmentation, since it can introduce a realistic alteration to input volumes that the model is not able to handle yet.Table 1 reports details on selected augmentations only, showing probabilities of application and parameters justified by the experiments detailed in Section 5.1.3.

Results and Discussion
The experimental assessment of our multisite-based model is structured as follows.The first set of experiments aims to justify the choice of the adopted model.Next, we test the robustness and generality of LOD-Brain on different types of data (internal and external datasets, and data with marked anatomical variations), and the invariance of the model against different types of bias, includ-  , 4, 8, 16, 32, 64, 70], we retrain LOD-Brain with 1, 049 volumes selected from i datasets randomly chosen from those with enough samples.As shown in Fig. 4b, as long as the number of sites increases, the gap of performance between internal and external testing data progressively decreases, until it fades.Therefore, unless otherwise specified, we set to 70 the number of datasets used to train LOD-Brain.

Parameter selection
In Fig.  cludes investigations regarding data processing, network architecture, and training.All results are computed on the validation set by evaluating their statistical significance and, in case no significance is found, by preferring models with least parameters.
On data, we evaluate the most advantageous type of data normalisation, and we make an attempt to train with a larger training set (almost 3k samples).However, since this causes a unbalance in data, we observe a drop in performance.
Regarding the network, we test different design choices for its architecture e.g., the number of levels L, the convolutional block (plain or residual), layer normalisation (batch or group), etc.As a result, the LOD network implemented for testing is configured on L = 2 levels and a down-sampling factor of d = 4, as shown in Fig. 3.It is relevant to note that the two levels resulting from the ablation study recall somehow the effectiveness of the approach extensively used in the past for brain segmentation: atlas-based registration first, followed by voxel-level segmentation.Similarly here, the coarser level learns a robust brain prior which replaces the registration step in identifying brain structure locations, while, the finest level, handles site-specific intensity distributions and artefacts.The entire procedure also may resemble the steps of manual segmentation, in which the human expert first zooms out to identify major anatomical structures, and then zooms in refining structures until the task is complete at the finest level.
With respect to training choices, we compare, among others, different loss functions (best with λ = 0 i.e., pure Dice loss) and evaluate as detrimental a refinement of the entire unfrozen network, thus confirming that the brain prior learnt at LOD 2 is robust, and that a joint finetuning with a higher level would negatively affect its siteindependent brain representation.

Data augmentation
After selecting the useful transformations as in Section 4.2, we augment the validation set (117 volumes) and test the two models trained with and without augmentation.Fig. 6 reports the comparison as function of the augmentation parameters for four significant transformations.

Implementation details
Training optimization is done using Adam [Kingma and Ba, 2014] and training lasts 50 epochs for LOD 2 and 30 for LOD 1 , with an initial learning rate of 5e − 4, reduced by 1/4 on plateau.As non-linear activation, relu is applied for both encoder and decoder.For better regularisation, each convolutional block performs group normalization, while the dropout rate is 0.05.Training lasts 3 days using a workstation with Nvidia Quadro RTX 8000 GPUs and Weights & Biases for experiment tracking.

Robustness and generalisability
In this series of experiments, we test LOD-Brain on a variety of scenarios to assess its robustness and capabilities.

Accuracy across datasets
Fig. 7a reports segmentation performance for each of the 77 datasets used in this study.The overall accuracy (mean: 0.928, std: 0.017) proves the robustness of the method, showing similar results on both internal and external sites.The performance obtained on low-quality GT datasets (in grey in Fig. 7a) is justified by the presence of several scans with head movement artefacts due to participant populations (e.g., elderly people with dementia in EDSD and children 7.5-12.9y.o. in ABIDE Stanford) which impair FreeSurfer segmentation.

Multi-site versus single-site models
To validate the need for multi-site data, we compare the generalisation abilities of multi-site (MS) training with those of single-site (SS) models, by testing both (MS vs. SS) on the same internal (INT) and external data (EXT).
For each of the 4 datasets, we train a SS model with 1, 049 volumes, and we test its segmentation accuracy on both internal data (i.e., all remaining volumes from the same site) and external sites (i.e., all left-out volumes from the other 3 datasets).A significant drop between INT and EXT performance, due to the scanner effect, is observed in Fig. 8a.
As for multi-site training, we test our model trained with 1, 049 volumes from 70 datasets on the same INT dataset used in the single-site case.To test on external data, we train 4 additional MS models on 69 datasets, considering the left-out dataset (one among AOMIC, Glasgow, FCP BGSP, and FCP RocklandSample) as EXT data.As both experiments in Fig. 8a and Fig. 8b use the same testing sets, we observe that models trained on multiple sites almost reaches on EXT data the same performance of SS models tested on unseen volumes from their training sites, while exhibiting far superior generalisation ability (i.e., a non-significant performance difference between INT and EXT data in Fig. 8b).

SIMON dataset
As segmentation performance can vary for both scanner intensity distribution and variability in participants' anatomy, here we attempt to disentangle the two components.We therefore test our model on a left-out dataset where, in the context of a multi-centre study [Duchesne et al., 2019]

Robustness to challenging anatomical variations
To test the robustness of the anatomical prior learnt, we test our model on a challenging scenario: a dataset with five individuals who had undergone surgical removal of one hemisphere [Kliemann et al., 2019].In Fig. 9a, we show the visual results obtained by FreeSurfer (second row), and by our method (third row).While FreeSurfer -and in general atlas-based methods -fails to generalise to such severe anatomical singularities, often inferencing non-existing structures, LOD-Brain reliably segments such cases, proving a high level of robustness to anatomical variations.In Fig. 9b, we report different activation maps for three subjects of this dataset coming from different levels (i.e., layers) of the network.Skull stripping and cortex extraction is coarsely done already in LOD 2 (bottom layer) and then the information is propagated along the network upper levels.This result gives an intuition on how the coarse level acts as a prior, giving guidance to LOD 1 for finer segmentation.

Invariance to bias
To investigate the fairness of our segmentation model, we assess LOD-Brain for potential bias regarding demographic characteristics such as sex and age, or other technical characteristics of the scanner including scanner model, vendor, magnet strength, and slide thickness.De-  spite the training data imbalance for some of these characteristics (see those in Figs. 2 g/h/i), on the test-set of 5, 956 volumes we observe no salient differences in Dice performance between different groups.Results are reported on the project website.

Method comparisons
A comparative assessment of our method against state-ofthe-art techniques is proposed here in terms of both brain segmentation performance and model complexity.The considered benchmark methods are: QuickNat [Roy et al., 2019], SynthSeg [Billot et al., 2021], 3D-UNet [C ¸ic ¸ek et al., 2016], CEREBRUM [Bontempi et al., 2020], Fast-SurferCNN [Henschel et al., 2020].Fig. 10a shows the obtained results on the whole testing set grouped by segmented brain structure.Fig. 10b focuses instead on the comparative performance of different methods on external datasets only.Obtained results highlight LOD-Brain as one of the most competing methods on all brain labels, as it yields the best scores in almost all target structures and on the majority of external datasets with acceptable GT.The number of parameters for each model is also reported, highlighting LOD-Brain (337, 719 parameters only) as the best overall model in terms of performanceto-complexity ratio.It is relevant to note the high performance achieved on ABCD, despite it includes volumes from 32 diverse scanners, previously skull-stripped and aligned to MNI152 reference space (a common situation in the field).et al., 2019], SynthSeg [Billot et al., 2021], 3D-UNet [C ¸ic ¸ek et al., 2016], CEREBRUM [Bontempi et al., 2020], FastSurferCNN [Henschel et al., 2020], and our method.(a) Results are computed on the test set of 5, 956 volumes, using FreeSurfer as GT reference and grouped for brain structure.(b) Results on external sites only, divided in acceptable vs. low-quality ground-truth.Numbers of parameters for each model are reported.

Qualitative comparison
Last, in Fig. 11 we show a qualitative comparison performed on the 12 worst numerical results obtained with LOD-Brain (one for dataset -blue dots in Fig. 7a).We display FreeSurfer segmentation masks in the first row, and LOD-Brain in the second, with segmentation masks overlayed to the correspondent T1w.Despite numerical results, which use FreeSurfer's masks as reference, the segmentation boundaries returned by LOD-Brain show less errors and are much smoother than those produced by FreeSurfer.

Conclusion
We here introduce LOD-Brain, a progressive level-ofdetail network for training a robust brain MRI segmentation model.At lower levels, the network learns a strong brain prior useful to spatially identify 3D brain structures; concurrently, at higher levels, it handles site-specific and anatomical peculiarities.Results are remarkable in terms of consistency across scanners and sites, and robust to very challenging anatomical variations.The proposed architecture, alongside the richness of the training dataset, leads to an automatic, fast, reliable, and off-the-shelf tool for brain MRI segmentation.Code, model, and demo are available at the project website.

Figure 1 :
Figure 1: LOD-Brain is a level-of-detail (LOD) network, where each LOD is a U-net which processes 3D brain multi-data at a different scale.Lower levels learn a coarse and site-independent brain representation, while superior ones incorporate the learnt spatial context, and refine segmentation masks at finer scales.Examples of outputs (grey matter renderings) at different LODs are shown in blue at the bottom.

Figure 2 :
Figure 2: Multi-site dataset: we collect and analyse with MRIQC [Esteban et al., 2017] almost 27,000 volumes originating from around 160 different sites (26,169 volumes after the quality check).(a) A visualisation by t-SNE [Van der Maaten and Hinton, 2008] of the 68 MRIQC features (different colour for each dataset).Note that one dataset (e.g., IXI in yellow colour) may contain volumes from more than one site or acquired with different scanners, and thus separate in clusters in the t-SNE space.(b) Dataset cardinalities.(c) Details on data quality assessment and (d) pre-processing.From (e) to (i), different demographic features and scanner properties are reported.
The labelling strategy follows the 7 classes adopted by MRBrainS challenge[Mendrik et al., 2015]: grey matter, white matter, cerebrospinal fluid, ventricles, cerebellum, brainstem, and basal ganglia.Such labeling maximises the possibility of comparison with other state-of-the-art methods, and covers most of clinical and research studies and applications.However, there are no limitations in selecting different brain structures and related labels for retraining LOD-Brain.4 Methods 4.1 Architecture: a 3D level-of-detail network LOD-Brain is a progressive level-of-detail 3D network designed for brain MRI segmentation.As shown in the general scheme in Fig. 1, each level of LOD-Brain is a U-net [C ¸ic ¸ek et al., 2016] which processes the input MRI volume (of initial dimensions 256 3 ) at a different scale obtained by successively down-sampling the volume by a factor d along each coordinate axis.The lowest network level, LOD L , is in charge of learning a robust anatomical prior.Since down-sampling input volumes removes high frequency details and smoothes individual differences, LOD L learns a coarse representation of brain structures, and their mutual locations, which is less dependent on the scan site.

Figure 3 :
Figure 3: LOD-Brain architecture selected for the experiments on the brain MRI segmentation task (L = 2, d = 4).

Figure 5 :
Figure 5: Ablation study.Performance of models trained with different architectural options are shown with respect to the best model (on the zero x-axis).Results (Dice coefficient differences) are computed on the validation set (those marked with * are statistically significant according to t test applying Bonferroni correction).

Figure 6 :
Figure 6: Data augmentation: performance of models trained with versus without augmentation for four transformations (i.e., blur, ghosting, gaussian, and salt and pepper noise).

Figure 9 :
Figure 9: Results on individuals who had undergone surgical removal of one hemisphere [Kliemann et al., 2019].(a) Inference for 5 subjects are shown for LOD-Brain and FreeSurfer.(b) Activation maps for 3 subjects at different LODs i.e., layers in the network.

Table 1 :
Details on selected augmentation methods.
5, we present the most relevant results of the ablation study carried out to select model parameters.This in- [Duchesne et al., 2019]coefficient on 5, 956 testing volumes (77 datasets), displayed per dataset: INT (green), EXT with good GT (red), and with low-quality GT (grey).Segmentation masks of the worst numerical result (blue dots) are further displayed in Fig.11.(b)Accuracy on SIMON (Single Individual volunteer for Multiple Observations across Networks) dataset (EXT)[Duchesne et al., 2019], comprising 94 volumes acquired with 15 different models of scanner by 3 major MR vendors (in different colours).