FastSurferVINN: Building resolution-independence into deep learning segmentation methods—A solution for HighRes brain MRI

Leading neuroimaging studies have pushed 3T MRI acquisition resolutions below 1.0 mm for improved structure definition and morphometry. Yet, only few, time-intensive automated image analysis pipelines have been validated for high-resolution (HiRes) settings. Efficient deep learning approaches, on the other hand, rarely support more than one fixed resolution (usually 1.0 mm). Furthermore, the lack of a standard submillimeter resolution as well as limited availability of diverse HiRes data with sufficient coverage of scanner, age, diseases, or genetic variance poses additional, unsolved challenges for training HiRes networks. Incorporating resolution-independence into deep learning-based segmentation, i.e., the ability to segment images at their native resolution across a range of different voxel sizes, promises to overcome these challenges, yet no such approach currently exists. We now fill this gap by introducing a Voxel-size Independent Neural Network (VINN) for resolution-independent segmentation tasks and present FastSurferVINN, which (i) establishes and implements resolution-independence for deep learning as the first method simultaneously supporting 0.7–1.0 mm whole brain segmentation, (ii) significantly outperforms state-of-the-art methods across resolutions, and (iii) mitigates the data imbalance problem present in HiRes datasets. Overall, internal resolution-independence mutually benefits both HiRes and 1.0 mm MRI segmentation. With our rigorously validated FastSurferVINN we distribute a rapid tool for morphometric neuroimage analysis. The VINN architecture, furthermore, represents an efficient resolution-independent segmentation method for wider application.


Introduction
While neuroimaging pipelines have benefited substantially from the standardization of Magnetic Resonance Imaging (MRI) at 1.0 mm, the resulting fixed-resolution paradigm now hinders transition to high-resolution (HiRes) MRI. With the hope of advancing quantification of structural detail, increasing explanatory power, and improving our understanding of the brain in health and disease (Glasser et al., 2013;Mellerio et al., 2014;Solano-Castiella et al., 2011;Stankiewicz et al., 2011;Wattjes et al., 2006;Zaretskaya et al., 2018), leading large-cohort neuroimaging studies have started to acquire structural MRI at 3T field strength and 0.7-0.9 mm resolutions (see Section 2.1). However, the lack of reference segmentations and limited diversity of HiRes MRI datasets (e.g. regarding scanner, disease, genetic variation) lead to substantial limitations for bias-free method development and validation. Additionally, since no de-facto standard resolution exists for HiRes imaging, neuroimaging tools introducing HiRes processing (Bazin et al., 2014;Gaser and Dahnke, 2016;Huntenburg et al., 2018;Yushkevich et al., 2014;Zaretskaya et al., 2018) have to provide resolution-independence instead of following the fixed-resolution paradigm. Although Convolutional Neural Networks (CNNs) deliver convincing performance under the fixed-resolution paradigm Coupé et al., 2020;Henschel et al., 2020;Huo et al., 2019;Ito et al., 2019;McClure et al., 2019;Mehta et al., 2017;Roy et al., 2019;Sun et al., 2019;Wachinger et al., 2018), no methodological solution leverages explicit knowledge of the native image resolution, consequently limiting all output segmentations to one pre-defined voxel size and potentially ignoring important structural detail specifically for submillimeter scans. By introducing Voxel-size Independent Neural Networks (VINNs), we now leverage the diversity of widely available 1.0 mm MRIs and enrich the model with details derived from HiRes MRI, achieving not only resolutionindependence but improving segmentation performance across resolutions.
For resolution-independent deep learning, we establish two core requirements: 1. Nativeresolution segmentation: the network's input and output for training and crucially inference should be at the native resolution, to avoid any external resampling. 2. Resolutionindependence: the network should be able to learn and predict from images at a range of different resolutions. Segmentation for HiRes images additionally aims to improve the quality of fine structures (e.g. narrow sulci, gyri and white matter (WM) details). To achieve high-quality whole brain segmentation and avoid training biases, a neural network should generalize to various resolutions (seen and unseen during training) and datasets with different characteristics (e.g. with respect to scanners, demographics, diseases, genetic variation) ideally sharing and transferring knowledge between resolutions.
While some traditional neuroimaging pipelines fulfill these requirements (Bazin et al., 2014;Gaser and Dahnke, 2016;Huntenburg et al., 2018;Yushkevich et al., 2014;Zaretskaya et al., 2018), no related work directly addresses these challenges with deep learning. Fig. 1 illustrates two adaptations to form baseline solutions: A. dedicated fixed-resolution networks and B. a single resolution-ignorant CNN that accepts multiple resolutions through training. Training one dedicated network per resolution (A.) trades the potential of a larger, more diverse training corpus for compatibility with the native resolution raising bias and generalization limitations (there are, for example, currently no HiRes neurodegeneration datasets). However, a fixed-resolution network trained and evaluated on its native resolution represents an upper bound for achievable performance under same-size training datasets. On the other hand, if one network is naïvely trained on multiple resolutions (B.), each convolutional layer has to learn to generalize across scales as the network cannot easily differentiate between different voxel sizes (i.e. resolution-ignorant), which allocates network capacity to this task. To support this process and reduce potential bias from missing or unbalanced resolution data, one can add external scale augmentation (+exSA) by resampling individual images and reference labels during training. This, however, induces information loss and interpolation artefacts, e.g. from lossy nearest-neighbour (NN) interpolation of discrete label maps. Since neither approach (A. or B.) has been implemented or compared for submillimeter whole brain segmentation, we introduce respective baseline models utilizing our proposed, optimized micro-architecture in all models for a fair comparison.
To overcome limitations of both approaches (A. and B.), we propose a VINN with innovations including its micro-architecture and the addition of a HiRes loss. The core contribution, however, is the network-integrated resolution-normalization to support native segmentation at various voxel sizes. In fact, any UNet-based architecture (Ronneberger et al., 2015) is, by design, a multi-scale approach where pooling operations represent fixedfactor integer down-and up-scale transitions (usually by the scale factor 2). Our networkintegrated resolution-normalization in VINN now replaces this fixed scale transition with a flexible re-scaling for the first and last scale transitions. This has the advantage of placing our interpolation operation at a position where information loss naturally occurs (down-or up-scaling via pooling). As illustrated in Fig. 1C., we retain compatibility with a range of resolutions for input and output (MultiRes) by shifting the interpolation into the architecture itself. At the same time, we leverage the lower variance of perceived size differences in the inner normalized resolution blocks (FixedRes). This has the advantages of (i) retaining important image information at the native resolution in the MultiRes blocks, (ii) interpolating multi-dimensional continuous feature maps rather than discrete labels or single slice images hence avoiding lossy NN interpolation and extending contextual neighbourhood information during the resolution-normalization, and (iii) disentangling perceived voxel versus actual structure size differences inside the VINN.
Especially the last point may have been underappreciated so far: Due to the nature of convolutional layers, CNNs 'perceive' distances and thus structure sizes by number of voxels rather than millimeters. Therefore, the original voxel size impacts perceived distances for baseline architectures, requiring them to cover larger size variety during training and inference in the multi-resolution context. The VINN, on the other hand, transitions from a voxel-based distance context in the MultiRes blocks to a standardized distance context in FixedRes via the network-integrated resolution-normalization and, therefore, releases network capacity in the inner layers for other tasks.

Contributions
All in all, our VINN, for the first time, effectively addresses the challenges associated with HiRes MRI (i.e. multiple resolutions and reduced variety in datasets) in a single framework. We extensively test variations of the network architectures and demonstrate that inherent biases introduced by the unavailability of certain HiRes scans are reduced by transferring information across resolutions via our network-integrated resolution-normalization.
Specifically, we show that our VINN i.
segments 3T brain MRI at their native resolutions (0.7-1.0 mm) into 95 regions in less than 1 min on the GPU and generalizes robustly to unseen resolutions within and beyond the training corpus, ii.
significantly outperforms state-of-the-art scale augmentations as well as fixedresolution models with respect to segmentation accuracy and generalizability with an optimized architecture, and iii. improves accuracy by combining and leveraging both the increased structural information from submillimeter 3T brain scans and the generalizability to intensity, scanner, disease, and other variations from standard 1.0 mm MRIs.

High-resolution MRI
In neuroimaging, spatial resolution is of great importance as the available voxel size directly dictates the degree to which fine-scale subcortical and cortical structures, specifically narrow gyri and sulci, can be resolved in an MRI. In particular, the diversity of tissue types in a single voxel, which heavily depends on its size, influences the signal intensity known as the partial volume effect (PVE). A strong partial volume effect (PVE) specifically complicates the delineation of tissue borders on the voxel grid (Glasser et al., 2013;Zaretskaya et al., 2018). HiRes images offer finer sampling of the underlying information, thereby directly reduce PVE and enable more accurate segmentations, improved volume-based measurements, morphometry, surface placement, and derived thickness measures (see Fig.  2) (Glasser et al., 2013;Luesebrink et al., 2013;Zaretskaya et al., 2018).
It is, therefore, not surprising that HiRes MRIs are becoming increasingly popular within the neuroimaging community. While most established large-scale neuroimaging studies acquire data at 1.0 mm, the current and next generation studies are shifting to submillimeter resolutions (e.g. Human Connectome Project (HCP) (Glasser et al., 2013), Rhineland Study (RS) (Breteler et al., 2014), Autism Brain Imaging Data Exchange II (ABIDE-II) (Di Martino et al., 2017), TRACK-PD (Wolters et al., 2020)). However, to date the consolidated superset of publicly available 3T HiRes neuroimages is sparse, unbalanced, and heterogeneous with respect to available submillimeter resolutions (e.g. HCP: 0.7 and 0.8, ABIDE-II: 0.7-0.9, RS: 0.8). Especially the limited data variety at each specific resolution poses a real challenge for data-driven computational methods, which translates into limited compatibility with scanners, age spans, and especially disease groups. Moreover, unbalanced training data can easily lead to the introduction of biases into the model.
While suffering from stronger PVEs and ensuing detrimental effects on segmentation accuracy, the available collection of standardized 1.0 mm images, on the other hand, is large and diverse. Consequently, a wide coverage of age-groups, diseases, genetic variants, and scanners can be retrieved from the rich reservoir of openly available MRI data sources (e.g. OpenfMRI database (Poldrack et al., 2013;Poldrack and Gorgolewski, 2014), OpenNeuro (Markiewicz et al., 2021), NITRC-IR 2 ). Furthermore, manual reference labels for validation are exclusively published openly for 1.0 mm (Klein and Tourville, 2012). To address dataset sparsity and bias at submillimeter resolutions, we believe data-driven computational methods require built-in resolution-independence. Only a single model that spans across the available resolutions can simultaneously provide benefits for both HiRes and standard 1.0 mm image analysis.

Automated analysis of high-resolution images
Traditionally, common neuroimaging pipelines have been developed and optimized for 1.0 mm voxels (Fischl et al., 2002;Friston et al., 2007;Gaser and Dahnke, 2016;Jenkinson et al., 2012;Zhang et al., 2001) representing the de-facto standard for years. FreeSurfer offers a validated HiRes stream (Zaretskaya et al., 2018), which provides sub-segmentation of the cortex into 31 structures per hemisphere (DKTatlas) (Klein and Tourville, 2012). However, one problem here is the extended processing time caused by the cubic voxel increase -a common issue limiting applicability of traditional tools to large cohort studies.
Despite their success for brain segmentation and other applications no work has introduced deep learning for submillimeter whole brain segmentation. Since CNNs require a large and diverse number of volume and segmentation pairs for effective training, the missing availability of diverse training data hinders this resolution transition. Additionally, the cubic relationship between resolution increase and GPU memory requirements puts memory hungry 3D architectures at a disadvantage, e.g. factor of 2.92 = (1/0.7) 3 memory demand for a 1 to 0.7 mm reduction of voxel sizes.
Here, multiple potentially cross-linked pathways are dedicated to specific down-scaled or cropped versions of the original image. The same principle is employed in other scale-aware networks (Chen et al., 2016;Huang et al., 2021;Li et al., 2019) by implementing more trans-scale connections or combinations of different dilation and kernel sizes. To further avoid sub-optimal equal weighting of different scale information, the networks often include attention mechanisms (Chen et al., 2016;Qin et al., 2018;Xu et al., 2021;Yang and Peng, 2018;Zheng et al., 2021). In practice, these methods are only compatible with the discrete input and sub-level resolutions they were trained on, leaving subvoxel scaling more common in neuroimaging data (0.7/0.8/1) unexploited. In fact, their intend is to explicitly integrate information from multiple image scales rather than achieve multi-resolution compatibility. Critically, in neuroimaging settings the true scale is known (image resolution) while scaleaware architectures assume this knowledge is not available. In VINN we inject this explicit knowledge into the network at the spatial resolution normalization step.

Resolution-independence in deep learning
Built-in resolution-independence in CNNs has not been described for brain segmentation nor -to our knowledge -for any other segmentation tasks. Approaches such as (Billot et al., 2020;Iglesias et al., 2021) pre-sample input images (with associated reliability maps) to a common resolution (here 1.0 mm) and provide outputs there, which makes them inherently fixed-resolution techniques. While they can provide 1.0 mm segmentations (and even images) for lower resolutional clinical scans via heavy augmentation, they can neither profit from submillimeter details, nor provide native HiRes segmentations. A transfer to higher resolutions would require retraining with a fixed submillimeter training set. This is, however, problematic as (i) no standard submillimeter resolution exists, meaning one would either need to train multiple versions or focus on the highest available resolution to retain input resolution-independence, (ii) HiRes datasets, as mentioned before, demonstrate low subject variance (disease, age, genetics) making this approach susceptible to training biases, and (iii) upscaling the 3D-UNet architecture to 0.7 mm would require 2.9-times as much GPU memory surpassing memory limits.
Different from segmentation networks, state-of-the-art super-resolution networks aim specifically at reconstructing a (fixed resolution) HiRes image based on a low-resolution (LowRes) input, also requiring ground truth at the higher resolution. They often rely on pre-sampling, i.e. using interpolation-based up-sampling methods to initialize the desired output grid and fine-tune the features in the following network (Dong et al., 2016;Kim et al., 2016a;2016b). Using augmentation, these networks are trained to restore images coming from various lower resolutions via a single model, similar to the discussed segmentation strategy (Fig. 3B.). While showing better performance than dedicated single-scale models (fixed input and fixed output resolution), pre-sampling significantly increases the computational complexity and cost of the architectures due to the increase in image size.
To overcome this limitation, interpolation-based post-sampling has been introduced recently (Alao et al., 2021;Shen et al., 2021). Here, the up-sampling step is shifted towards the end of the network architecture and performs the interpolation in the latent space. Interestingly, this post-sampling approach is as effective or even superior to the pre-sampling methods, while maintaining versatility with respect to the chosen output scale (Alao et al., 2021;Shen et al., 2021). All these super-resolution architectures, in addition to not being aimed at segmentation, differ from our approach by transitioning once, from the input resolution to an output resolution. In contrast, we insert two latent-space interpolation blocks transitioning both ways between the native and the inner resolution.
More generally, multi-source domain adaptations (MSDAs) leverage available data with different underlying distributions (i.e. resolutions in our case). This field of research focuses on enhancing the generalization ability of a model by transferring knowledge between resource-full source domains (i.e. LowRes data) to a sparsely represented target domain (i.e. HiRes data) by transforming either the features in the latent space or images on a pixel-level.
In contrast to our resolution-normalization, the latent space transformations in MSDAs are based on optimizing a discrepancy (Guo et al., 2018;Peng et al., 2019;Zhu et al., 2019) or adversarial loss Wang et al., 2019;Xu et al., 2018). These latent space alignments are, however, often insufficient for segmentation tasks due to their focus on high-level information only (Zhao et al., 2019). To circumvent this problem, intermediate domain generators have been proposed and successfully applied for semantic segmentation (Hoffman et al., 2018;Russo et al., 2019;Zhao et al., 2019). Here, a pixel-level alignment between source and target domain is learned through Generative adversarial networks. However, these MSDA methods are limited to a single target distribution. Extension to multi-targets is a relatively unexplored area with only a few published methods for classification tasks so far Gholami et al., 2020;Jin et al., 2020;Liu et al., 2020;Peng et al., 2019;Roy et al., 2021;Yang et al., 2020).
Spatial transformers (Jaderberg et al., 2015) represent another technology relying on internal interpolations as an important building block. Here, spatial invariance (registration) is targeted via a learnable affine transformation inside the network (localisation network). After the source feature map coordinates are computed (grid calculator), intensity values for each target pixel are determined via bi-linear interpolation (sampler). Spatial transformers hence attempt to implicitly learn data representations in a resolution-ignorant way. While our approach shares grid calculation and interpolation within the network with spatial transformers, our training approach is different: Instead of a localisation network, we directly determine the sampling-grid based on the input scale factors, i.e. the ratio between the native input resolution and the desired normalized inner resolution. Hence, we explicitly integrate knowledge about the image resolutions into the architecture. As a result, computational complexity is reduced while still achieving desired resolution-independence.

Datasets
The following three submillimeter MRI datasets were selected for training, testing, and validation of FastSurferVINN. An extended list of the used 1.0 mm datasets can be found in the Section A.3 including a tabulated overview of all used datasets (Table 3).
When not specifically mentioned otherwise, all sets are balanced for gender, age, and study. Participants of the individual studies gave informed consent in accordance with the Institutional Review Board at each of the participating sites. Complete ethic statements are available at the respective study webpages.
ABIDE-II-The Autism Brain Imaging Data Exchange II Di Martino et al. (2017) contains cross-sectional data and focuses on autism spectrum disorders. The dataset contains 1114 subjects from 19 different institutions in total and is accessible online 3 . The 3D magnetization prepared rapid gradient echo (MPRAGE) sequence, or a vendor specific variant, was used to acquire all data using 3T GE, Philips, and Siemens scanners. Voxel resolutions are not standardized across sites although most scans are acquired at 1.0 mm. In this work we use only the ETH_1 sub-cohort, which provides 0.9 mm HiRes scans acquired on a 3T Philips Achieva with a repetition time (TR) of 3 ms, an echo time (TE) of 3.90 ms, an inversion time (TI) of 1.15 ms, and flip angle of 8°. Since HiRes datasets often exclusively feature Siemens scanners, a subset of 25 HiRes scans from ABIDE-II ETH_1 covering an age range of 20-31 years serves as an independent test set to evaluate generalizability to this unseen Philips scanner and 0.9 mm resolution.  Fischl et al. (2002). In particular, the FreeSurfer v7.1.1 HiRes stream (Zaretskaya et al., 2018) was used to generate the desired parcellations following the "Desikan-Killiany-Tourville" (DKT) protocol atlas (Desikan et al., 2006;Klein and Tourville, 2012). We follow the same label mapping approach as in FastSurferCNN (Henschel et al., 2020). In short, identical cortical regions on the left and right hemisphere are joined into one class unless they are in close proximity to each other reducing the total number of labels from 95 (DKT without corpus callosum segmentations) to 78 during network training. The affiliation to the left or right hemisphere are restored in the final prediction by estimating the closest WM centroid (left or right hemisphere) to each label cluster. A list of all segmentation labels is provided in the Appendix (see Table 5). In accordance with the FreeSurfer HiRes stream, all MRI brain volumes are conformed to a uniform coordinate orientation at their respective isotropic voxel resolutions and robustly normalized to unsigned characters (0… 255), i.e. the trained network does not depend on skull stripping or bias-field removal.

Voxel size independent neural network
As mentioned above, inside a standard UNet (Ronneberger et al., 2015), fixed transitions between scales (i.e. resolutions) are performed via down-and up-sampling operations (e.g. pooling and unpooling, see Fig. 3), naturally leading to a reduction in information.
Theoretically, any scale-transition operation is replaceable with an alternative sampling strategy as long as it still allows gradients to flow effectively through the network. In FastSurferVINN, we use this concept to enforce voxel size independence by changing the first level transition in the encoder and the last in the decoder to a flexible interpolation step (i.e. network-integrated resolution-normalization, see Fig. 3). Thus, variable transitions between resolutions without restrictions to pre-defined fixed voxel sizes become possible, both during training and inference.
3.2.1. Network-integrated resolution-normalization-Similar to spatial transformers (Jaderberg et al., 2015), the interpolation-based scale transition can be divided into two parts: (i) calculation of the sampling coordinates (grid generator) and (ii) interpolation operation (sampler) to retrieve the spatially transferred output map.
The sampling coordinates are calculated based on the scale factor SF ∈ ℝ + -the quotient of the resolution information of the inner normalized scale Res inner (a tuneable hyperparameter set to 1.0 mm throughout our experiments) and the input image Res native . For optimal interpolation, SF is slightly adjusted to ensure integer feature map dimensions. In the first transition step (Fig. 3, pre-IDB to IDB), the output feature map V ∈ ℝ H inner × W inner × C is produced by sampling the input feature map U ∈ ℝ H native × W native × C . in the final transition 6 http://surfer.nmr.mgh.harvard.edu/. step (Fig. 3, CDB to post-CDB), this process is reversed effectively by using the inverse scale factor 1/SF. The interpolation itself is performed by applying a sampling kernel to the input map U to retrieve the value at a particular pixel in the output map V. Different interpolation strategies can be defined based on the selection of the sampling kernel. Theoretically, any kernel with definable (sub-)gradients is applicable. Here, we evaluate the bi-linear, bi-cubic, area, and integer sampling kernels (=NN interpolation). The sampling is identical for each channel, hence, conserving the spatial consistency.

Network architecture modifications-
The proposed interpolation strategy can, in general, be included in any CNN equipped with pooling-based scale transitions.
Here, due to its success in neuroanatomical segmentation, the principal network design is based on FastSurferCNN (Henschel et al., 2020) -a UNet-type network with a series of four competitive dense blocks (CDB) in the encoder and decoder arm separated by a CDB bottleneck layer. In FastSurferVINN (FastSurferVINN) one additional CDB layer is added ( Fig. 3 to each arm, i.e. the pre-IDB and post-CDB).

CDB design:
In FastSurferCNN (Henschel et al., 2020), a CDB is formed by repetitions of the basic composite function consisting of a 5 × 5 convolution, followed by batchnormalization (BN) and a probabilistic rectified linear unit (pReLU) activation function. In this work we optimize the architecture design slightly by replacing each 5 × 5 convolution kernel with two 3 × 3 kernels (see Fig. 3). This keeps the effective receptive field size within each block identical to FastSurferCNN while reducing parameter load. We implement this change in FastSurferVINN and also, for better comparability, in an updated FastSurferCNN version denoted by: FasturferCNN*. An ablation study detailing the changes from FastSurferCNN to FasturferCNN* is included in the appendix (Section A.2). As in FastSurferCNN, feature competition is achieved by using maxout (Goodfellow et al., 2013) instead of concatenations (Jégou et al., 2017) in the local skip connections. In order to guarantee normalized inputs to the maxout activation, the feature map stacking operation is always performed after the BN (see position of maxout in CDB design in Fig. 3).

Pre-IDB:
The additional encoder block in FastSurferVINN (see Fig. 3, pre-IDB) transfers image intensity information from the native image to the latent space and encodes voxel size-dependent information before the internal interpolation step. In contrast to the described CDB, the raw inputs are normalized by first passing them through a BN-Conv-BN combination before adhering to the original composite function scheme (Conv-BN-pReLU) (see Fig. 3, (pre-)IDB).

Post-CDB:
Akin to the pre-IDB, an additional CDB block in the decoder is used to merge the non-interpolated feature information returned from the pre-IDB skip connection and the upsampled feature maps from the network-integrated resolution-normalization step. Both maps are combined via a concatenation operation and then fed to a standard CDB block (see Fig. 3, (post-)CDB). After the final 1 × 1 convolution a softmax operation returns the desired class probabilities.

High-res network modifications
To improve segmentation accuracy of detailed structural features, we explore two network modifications, namely a loss-function weighting scheme and an adaptive attention mechanism.
Loss function-The network is trained with a weighted composite loss function of logistic loss and Dice loss (Roy et al., 2017). With p l,i (x) the estimated probability of pixel i to belong to class l and the corresponding ground truth probability y, the loss function can be formulated as Localized weights-Here, ω median freq. represents median frequency balancing and ω gradient boundary refinement through a 2D gradient vector (Roy et al., 2017). We now extend (ωi by two weighting terms (ω GM and ω WM/Sulci ) to improve segmentation quality in highly convoluted areas of the cortex that are better represented in submillimeter scans. The WM strand and deep sulci mask (ω WM/Sulci ) emphasizes thin WM strands and narrow sulci, and is defined by the voxels added through a binary closing operation on the gray matter (GM) labels (left side of Fig. 4). The outer gray matter mask (ω GM ) accentuates pixels at the boundary of the cortex and is defined by the voxels lost during brain mask erosion (right side of Fig. 4). Overall, ω GM and ω WM/Sulci aim to adjust the underlying decision boundary to closely match the target segmentation in PVE-affected locations by assigning higher weights to narrow WM strands, deep sulci, and tissue boundaries with emphasis on the border between cortex and cerebrospinal fluid (CSF).
Adaptive attention module-Context-driven learnable attention mechanisms have been used to automatically select optimal scale or filter sizes for specific image regions and can boost segmentation accuracy of differently sized structures within an image . As a reference, we therefore evaluate the addition of attention in the pre-encoder and post-decoder. The generation of the activation map follows the method proposed in . A detailed description is given in the Appendix (Section A.1). In short, learned activation maps from the attention module are used to dynamically weight each feature response generated by the sequence of convolutions within the CDB. This online weight calculation introduces non-linearities outside the activation function into the CDB.

View aggregation
In order to account for the inherent 3D geometry of the brain, we adopt the same view aggregation scheme as in Henschel et al. (2020) for all evaluated models: one F-CNN per anatomical plane is trained and the resulting probability maps are aggregated through a weighted average. Due to missing lateralization in the sagittal view, the number of classes is effectively reduced from 78 to 50 and the weight of the sagittal predictions is reduced by one half compared to the other two views. Inherently, and similar to ensemble learning, the view aggregation combines final soft predictions boosting segmentation accuracy.

Augmentations
External scaling augmentation for CNNs (exSA)-The current state-of-the-art method to introduce resolution invariance into neural networks is extensive scale augmentation (see Fig. 1B). Therefore, we contrast our proposed network-integrated resolution-normalization against this approach. We use random linear transforms with scaling parameters sampled from a uniform distribution of the predefined range 0.8 to 1.15 to augment images during the training phase. Every minibatch hence consist of a potentially scaled MRI (bi-linear interpolation) and a corresponding equally re-scaled label map (NN sampling). To disentangle resolution-independence strategies from micro-architecture changes, we use the same CDB and IDB implementations as for FastSurferVINN described above, with the exception of the resolution-normalization.
Internal scaling augmentation for VINNs (inSA)-In order to increase the robustness of the latent space interpolation, we augment the scale factor SF introduced by the VINN with a parameter α, so SF = Res inner /Res native + α during the network-integrated resolution-normalization. This effectively introduces small resolution variations within the grid sampling procedure. The values for α are sampled from a Gaussian distribution with parameters sigma=0.1 and mean=0. Overall, this modification can be interpreted as an internal scale augmentation randomly resizing the feature maps in the latent space (as opposed to externally augmenting the native images).

Evaluation metrics
We use the Dice Similarity Coefficient (DSC) and Average Surface Distance (ASD) to compare different network architectures and modifications against each other, and estimate similarity of the predictions to a number of previously unseen scans with respect to FreeSurfer and manual labels as a reference. Both are standard metrics to evaluate segmentation performance.
Here, the DSC Dice (1945); Sørensen (1948) is defined as twice the intersection of ground truth and prediction divided by the sum of their cardinalities and multiplied by 100. A larger DSC represents better overlap between the segmentations with a maximum value of 100 for perfect agreement. The ASD measures the average distance (in mm) between all points x ∈ Y, x′ ∈ P on the outer surface of the ground truth (Y) and the prediction (P). It is defined as with distance d(x, P) = min x ′) ∈P ∥x − x′∥ 2 representing the minimum of the Euclidean norm. In contrast to the DSC, a smaller ASD indicates better capture of the segmentation boundaries with a value of zero being the minimum (perfect match). Within each section, improvements in segmentation performance are confirmed by statistical testing (Wilcoxon signed-rank test (Wilcoxon, 1945) after Benjamini-Hochberg correction (Benjamini and Hochberg, 1995) for multiple testing) referred to as corrected p throughout the paper.

Training setup
Training dataset-Due to the nature of the experiments performed in this paper, the dataset composition varies between individual sections. An overview of the different trainingsets is given in the Appendix (Table 4). All representative datasets are balanced with regard to gender and age. Empty slices were filtered from the volumes. All directly compared networks are trained under the same conditions unless stated otherwise. Experimental setups demanding different datasets for training are separated by vertical white lines between bar plots and/or indicated in the figure legend. We additionally discuss this in the corresponding sections. In the first ablative evaluations of FastSurferVINN (Sections 4.1 and 4.2), 120 representative subjects are selected for training (60 1.0 mm and 60 submillimeter subjects, see Table 4: Mix), leaving on average 155 single view planes per subject and a total training size of at least 23k images per network. To determine the generalization performance of FastSurferVINN across resolutions, the training set is changed such that the submillimeter scans are of the same resolution and study (see Table  4: No 0.7 mm and No 0.8 mm). Similarly, the fixed-resolution networks are trained with 60 or 120 0.8 mm scans (see Table 4: Only 0.8 mm). For the LowRes version 120 1.0 mm scans are used (see Table 4: Only 1.0 mm). In Big-FastSurferVINN, the 1.0 mm component is extended to 1255 scans while keeping the same number (60) of HiRes scans as in the original training set (see Table 4: Mix (Big)).
Training parameters-Independent models for the coronal, axial, and sagittal plane are implemented in PyTorch (Paszke et al., 2017) and trained for 70 epochs using one NVIDIA V100 GPU with 32 GB RAM. The modified adam optimizer (Loshchilov and Hutter, 2019) is used with a learning rate set to 0.001. A cosine annealing schedule (Loshchilov and Hutter, 2017) adapts the learning rate during training where the number of epochs between two warm restarts is initially set to 10 and subsequently increased by a factor of two. The momentum parameter is fixed at 0.95 to compensate for the relatively small mini batch size of 16 images. For maximum fairness, all networks presented within this paper have been trained under equal hardware and hyper-parameter settings.

Results
We group the presentation of results into two blocks: 1. ablative architecture improvements to determine the best performing multi-resolution architecture (Sections 4.1 and 4.2), and 2. performance analysis to comprehensively characterize the advantages of our VINN on a wider variety of datasets, resolutions, scanners, and variations of the training corpus (Sections 4.3-4.5). Following best practice in data-science, we utilize completely separate datasets during the evaluations: the validation set Table 4: Validation (for 1.), and various test sets Table 4: Testing (for 2.). This avoids data-leakage and overfitting, i.e., it ensures that training, architectural design decisions, and final testing cannot influence each other, which could lead to overly optimistic results.

Scaling augmentation versus network-integrated resolution-normalization in FastSurferVINN
The central contribution of this paper is the design and evaluation of a Voxel-size Independent Neural Network for (sub)millimeter whole brain segmentation. Here, we compare segmentation performance of our FastSurferVINN, which avoids interpolation of label maps, with several approaches that rely on traditional scaling data augmentation. Each subsequent improvement in segmentation performance is confirmed by statistical testing (corrected p < 0.05).
Firstly, the original FastSurferCNN CDBs consecutively perform two 5 × 5 convolution operations followed by a final 1 × 1 convolution (Section 3.2.2, FastSurferCNN in Performance on the subcortical structures is, however, slightly reduced. Interestingly, VINN, which avoids label map interpolation al-together, shows a positive effect across structures compared to FastSurferCNN* with and without scaling augmentation (Fig. 5, VINN). Specifically, combination of VINN with our internal scaling augmentation (VINN + inSA, referred to as FastSurferVINN) strengthens segmentation performance significantly. The DSC increases to 89.05 for the subcortical and 88.93 for the cortical structures representing the best value across the compared architectures. Similarly, the ASD is significantly reduced with a distance of 0.293 mm for the subcortical and 0.226 mm for the cortical structures (Fig. 5, FastSurferVINN). Finally, addition of external scaling augmentation negatively impacts segmentation performance of VINN (Fig. 5, VINN + inSA + exSA). On average, performance on the subcortical structures is reduced to the level of FastSurferCNN* + exSA, while the cortical structures still represent the second best result for both, DSC and ASD. Overall, FastSurferVINN outperforms all traditional scale augmentation approaches by a significant margin (corrected p < 10 −7 ).
As our network-integrated resolution-normalization directly operates on continuous 2D feature maps in the latent space, various sampling kernels can be incorporated. Here, we evaluate the effect of four different interpolation strategies, namely NN, area, bi-cubic, and bi-linear. As visible in Fig. 6, with the exception of NN changing the sampling kernel does not significantly affect segmentation performance. The NN interpolation (Fig. 6, first column) overall reduces performance by 2% for DSC. The change on the ASD is more severe with a decrease in performance by 13% for the subcortical and 7% for the cortical structures. The bi-linear kernel (Fig. 6, last column) shows the best performance overall with a 89.05 DSC and an ASD of 0.293 mm for the subcortical structures and a 88.93 DSC and an ASD of 0.226 mm for the cortical structures. Given that the bi-linear interpolation is computationally favourable to bi-cubic, we henceforth keep it as the sampling kernel of choice.

HiRes specific adjustments
The higher resolution in submillimeter scans reduces PVEs and offers a potential to optimize segmentation performance (see Fig. 2). Overall, the changes in tissue and border assignment might, however, not be accurately captured during network training as they represent marginal changes in the loss compared to the whole brain volume. In order to focus on the PVE affected regions (specifically small WM strands and deep sulci), we test two different modifications. First, a HiRes loss function (HiRes Loss) up-weights information from the areas in question. Second, attention mechanisms are introduced to enable automatic refocusing on important information during network training. Here, we equip FastSurferVINN separately with both modifications and evaluate the change in segmentation performance with respect to the DSC and ASD.
Training FastSurferVINN with the new HiRes loss function (see Fig. 7, + HiRes Loss, right bar) significantly improves segmentation performance on the cortical structures (corrected p < 10 −5 ), while maintaining high accuracy on the subcortical structures (no significant change consistent with expectations). A final DSC of 89.3 and an ASD of 0.209 mm is achieved for the cortical structures. Further addition of attention, while simultaneously adjusting the number of feature maps to keep the total number of network parameters of FastSurferVINN constant, does not lead to a significant change in segmentation performance (see Fig. 7, + attention, middle bar). Due to the overall significant improvement on DSC and ASD, the HiRes Loss modification is included in all further comparisons.

Generalizability
After selection of the best architectures for both, VINN and CNN, we now perform a detailed evaluation in a broad selection of test scenarios (see Table 4: Testing). To highlight the advantages of our VINN, we compare FastSurferVINN with FastSurferCNN* + exSA. The latter is improved over the state-of-the-art FastSurferCNN by architectural updates. The difference between FastSurferCNN* + exSA and FastSurfer-VINN is exclusively the substitution of external scale augmentation (exSA, Section 3.5) with network-integrated resolution-normalization (see Section 3.2.1) and internal scale augmentation (inSA, Section 3.5). Here, the DSC differs by 1.1 and 2.1% for the subcortical and cortical DSC and around 11% for the ASD. This again reflects the better cross-resolution generalization performance of FastSurferVINN already outlined in the previous section. ADNI and OASIS1 also benefit strongly from FastSurferVINN, specifically for the cortical structures. Here, the DSC and ASD are improved by around 1.66% and 12.1%, respectively. Overall, the cortical structures benefit stronger from the internal resolution-normalization in FastSurferVINN.

Unseen resolutions-
A core aspect of FastSurferVINN is the implicit compatibility with a variety of resolutions independent of their explicit presence in the training corpus. In order to investigate the generalization capacity of FastSurferVINN in contrast to scale augmentation approaches, we evaluate the inter-and extrapolation capabilities of the trained networks based on the segmentation performance for three resolutions purposefully excluded during network training.
To this end, we specifically drop either (i) all 30 0.8 mm, or (ii) all 30 0.7 mm scans from the training corpus (see Table 4: No 0.7 mm and No 0.8 mm). In order to ensure comparability with respect to the total number of data points and balance between HiRes and LowRes scans, an equal number of subjects from the other respective HiRes datasets are added. In addition, we evaluate performance using the original mixed training set on 25 subjects from ABIDE-II, representing an unseen scanner type (Philips) and resolution (0.9 mm).
Segmentation performance on the 0.7 mm scans reflects the network's resolution extrapolation capabilities (training corpus consists of 0.8 mm and 1.0 mm scans only).
Here, FastSurferVINN reaches a DSC of 86.49 and 87.50 for the subcortical and cortical structures, respectively, representing a significant increase compared to the traditional scale augmentation approach. The improvements on the ASD are even more pronounced with We face the limitation that no publicly available 3T datasets exists at finer resolutions than 0.7 mm. In order to further explore the extrapolation capabilities of FastSurferVINN in comparison to traditional scale augmentation (FastSurferCNN* + exSA), we therefore evaluate segmentation performance at lower resolutions vastly outside the training range (1.4 mm and 1.6 mm). Note, that FreeSurfer segmentations are not available at lower native resolutions either, as images at resolutions coarser than 1.0 mm are upsampled to a millimeter voxel resolution in an initial conversion step (Fischl, 2012). Therefore, we downsample the high-resolution FreeSurfer segmentations of HiRes test sets (HCP and RS) by a factor of 2 along each axis using majority voting to generate "ground truth". The intensity images used for inference were resampled using cubic interpolation.
In Fig. 10  We expect a reduction of performance at low resolutions due to stronger PVE and an 8-fold information reduction.

Comparison to manual reference-Due
to the limited availability of manual labels, FreeSurfer segmentations have been used as the reference for comparison so far. In order to account for potential biases we now also evaluate DSC and ASD for the manually edited cortical regions on 78 subjects from the 1.0 mm Mindboggle101 dataset (Klein and Tourville, 2012). The subcortical segmentations available for a subset of 20 subjects within this cohort are used for subcortical evaluations.
On the subcortical structures (see Fig. 11, left part of left plot), FastSurferVINN (orange bar) and traditional scale augmentation (dark green bar) perform equally well with respect to both, DSC (80.26) and ASD (0.616 mm). The difference between the two methods is more pronounced on the cortical structures (Fig. 11, right side of left plot). FastSurferVINN reaches a final DSC of 81.89 and an ASD of 0.471 mm), representing a significant improvement compared to the external scale augmentation approach (corrected p < 0.005).
These results again indicate that the cortical structures specifically benefit from the internal interpolation approach of FastSurferVINN.
In order to evaluate the benefit of FastSurferVINN for submillimeter scans, we compare segmentation accuracy on three structures (WM, GM and hippocampus) on an in-house set, as no manual HiRes full brain segmentations are publicly available. Note, while we previously reported averages across the 45 cortical and 33 subcortical structures listed in Table 5 for clarity, we now calculate performance measures specifically for these three individual structures. To generate manual annotations, a trained expert corrected FreeSurfer generated segmentations on six cases from the RS. The results are depicted in the left part of Fig. 11

Fixed-resolution networks versus FastSurferVINN
In order to analyse the advantage of the multi-resolution training approach over a fixedresolution network, we compare FastSurferVINN with networks trained on (i) 60 0.8 mm subjects, (ii) 120 0.8 mm subjects, and (iii) 120 1.0 mm subjects (see Table 4: Only 0.8 mm and Only 1.0 mm). Inherently, this analysis highlights how the interplay of HiRes information from submillimeter 3T scans and dataset-variations from LowRes MRIs can mutually benefit segmentation performance on the different resolutions. Overall, the analysis of Fig. 12 highlights three important properties.
First, increasing the number of data samples has a strong effect on segmentation performance. The fixed-resolution network trained with 60 0.8 mm scans (left plot; light green bar) reaches an average DSC of 89.28 and ASD of 0.257 mm for RS data (same cohort as in the training set). Doubling the number of cases to 120, significantly improves both measures (corrected p < 10 −9 ) with a final DSC of 89.85 and ASD of 0.245 mm (FastSurferVINN, orange bar). Interestingly, FastSurferVINN trained with 60 HiRes (30 0.7 mm and 30 0.8 mm) and 60 LowRes scans (1.0 mm) reaches the same accuracy, as if these scans were actually all HiRes scans from the RS cohort (120 0.8 mm, dark green bar). Benefits from the sample size increase are, therefore, independent of the data resolution. Advantageously, the additional 90 HiRes cases at 0.8 mm may easily be integrated into the FastSurferVINN training dataset, hence, increasing the training corpus with another expected performance gain.
Second, fixed-resolution networks perform well on the cohort they are trained on, but lack generalization capability to other datasets (left plot, HCPL). When comparing networks trained with equally sized datasets (120 HiRes RS fixed-resolution net (dark green bar) and FastSurferVINN (orange bar)) on the 0.8 mm HCPL cohort, FastSurferVINN clearly outperforms the fixed-resolution network with an increase in DSC to 88.99 and a decrease of the ASD to 0.326 mm, representing a significant improvement (corrected p < 0.005). The gap in segmentation accuracy between the RS and HCP cohort with respect to DSC is halved for FastSurferVINN compared to the fixed-resolution approach and a similar reduction is visible for the difference in ASD. Third, multi-resolution training benefits 1.0 mm datasets as well. In the right part of Fig. 12, we compare FastSurferVINN (orange bar) and two fixed-resolution 1.0 mm networks -the original FastSurferCNN (CNN, gray bar) as well as the optimized FastSurferCNN* (CNN*, dark green bar) -with DSC (top) and ASD (bottom) metrics. FastSurferCNN is included to highlight the cumulative performance gain of FastSurferVINN (architectural optimization already included in CNN* and voxel-size independence via resolution normalization). The training sets of all three networks contain the exact same subjects, with the only difference being the MRIs resolution: We train the fixed-resolution networks exclusively with 1.0 mm images (native or downsampled from HiRes), while FastSurferVINN uses all images at their native resolution (1.0 mm, 0.8 mm, or 0.7 mm). Note, since the FreeSurfer-based label maps are obtainable at either resolution, we circumvent resampling with NN. Interestingly, the HiRes information from submillimeter scans boosts performance for the 1.0 mm scans. FastSurferVINN reaches a DSC of 87.62 and an ASD of 0.296 mm representing a significant improvement in DSC and ASD compared to the fixed-resolution networks (corrected p < 0.001). Compared to the original FastSurferCNN, on average an improvement of 1.46% DSC and 10.06% ASD (0.93 to 2.37% and 8.3 to 12.7%, respectively, across datasets) can be achieved with FastSurferVINN.
Overall, the inherent voxel size independence of FastSurferVINN and resulting multiresolution training option is highly beneficial to both, submillimeter and 1.0 mm scans.

Big-FastSurferVINN
Based on the observed performance gain with a larger training corpus, we evaluate whether ASD and DSC can be further improved across resolutions by expanding the training set from 120 to 1315 cases when exclusively adding 1.0 mm scans.
As illustrated in Fig. 13, expansion of the training corpus (n = 1315. yellow bar) leads to a noticeable performance gain. Specifically the 1.0 mm dataset benefits from the sample increase with a final DSC of 90.26 and an ASD of 0.231 mm. On average, a 2.95% increase in the DSC and 16.26% decrease in the ASD can be observed for the 1.0 mm scans compared to the smaller training set (n = 120, orange bar), representing a significant performance gain (corrected p < 10 −20 ). Additionally, ASD and DSC are significantly improved for the 0.7 mm, 0.8 mm and most strongly for the 0.9 mm scans (p < 10 −5 , Wilcoxon signed-rank test). Here, DSC and ASD improve by 3.77% and 24.08%, respectively. Note, that the 0.9 mm dataset is exclusively present in the testing corpus and thus completely new to the networks. A significant resolution bias in the training set, hence, does not decrease, but rather elevate performance across 1.0 mm and submillimeter scans.

Discussion
In this paper, we present the first multi-resolution deep learning tool for accurate and efficient (sub)millimeter 3T MRI whole brain segmentation. FastSurferVINN addresses the two main difficulties associated with HiRes MRIs, namely limited availability and diversity specifically for certain groups (e.g. neurodegeneration or scanner types) as well as resolution non-uniformity. Applicability of deep learning approaches is generally restricted to domains were enough training data exists (traditionally 1.0 mm scans). FastSurferVINNs' networkintegrated resolution-normalization provides independence to the input voxel grid during both, training and inference, and favourably extends processing spans to resolutions not explicitly included in the training corpus. Further, both, LowRes and HiRes scans, can be included during training and, in turn, benefit from each others favourable properties, i.e. coverage and size versus better representation of detailed structures.
The current state-of-the-art to approximate voxel size independence in networks is data augmentation. Here, transformations with randomly sampled scale parameters are applied to the images (both intensity and label map) in the native space. In FastSurferVINN, we shift this step into the latent space by replacing the first scale transition from a pooling/ unpooling operation to interpolation-based down/upsampling. This network-integrated resolution-normalization interpolates the continuous feature maps on both the encoder and decoder arms and avoids alteration of the underlying ground truth labels entirely.
When optimizing the network architecture, we first show that the best results are achieved with a 3 × 3 kernel and bilinear interpolation. Also, our HiRes loss which focuses segmentation performance on PVE-affected structures (deep sulci, thin white matter strands and GM-CSF boundary) improves DSC and ASD specifically for the cortical structures. Interestingly, the improvement is consistent across resolutions, indicating that the structural information learned from the HiRes scans is effectively transferred to the 1.0 mm scans. Finally, we evaluate the introduction of attention mechanisms into the architecture. Overall, performance does not improve if the total number of parameters of the network with and without attention is controlled, indicating that network capacity (i.e. number of learnable parameters) rather than adaptive attention is the important factor here.
We demonstrate FastSurferVINN's superior performance compared to state-of-the-art data augmentation for the fast and detailed segmentation of whole brain MRI. Our networkintegrated resolution-normalization in combination with internal scale augmentation outperforms traditional scale augmentation in terms of accuracy by a significant margin both with respect to FreeSurfer and a manual standard. Across nine different datasets and four resolutions, including both defaced (HCP) and full-head images, our network achieves the highest DSC compared to FreeSurfer as a reference (88.38 on average), as well as the lowest ASD (0.283 mm on average). In addition, FastSurferVINN achieves the best results on the manually labeled 1.0 mm Mindboggle101 dataset with a DSC of 81.89 and an AVG HD of 0.471 mm on the cortical structures. Correspondingly, WM, GM, and hippocampus are better segmented with FastSurferVINN on six manually corrected scans from the submillimeter RS dataset.
One possible explanation for the consistently improved segmentation metrics with our VINN compared to traditional scale augmentation is the circumvention of label interpolation. For discrete labels, NN or majority voting have to be applied. These kernels are prone to remove important structural details and create jagged segmentation maps (Allebach, 2005;Parker et al., 1983;Schaum, 1993;Thevenaz, 2009). This is underlined by the observation that segmentation performance deteriorates for FastSurferVINN when external scale augmentation is added. Further, analysis of the interpolation methods during the network-integrated resolution-normalization also highlights the negative impact of the NN interpolation kernels. While all other methods (area, bi-linear, and bi-cubic) performed equally well, performance drops for NN kernels on average by approximately 2% DSC and 10% ASD. Overall, NN seems to have a systematic negative effect on segmentation performance and should, hence, be avoided wherever possible.
In addition, our resolution-normalization interpolates a feature map of vectors instead of a single scalar image slice. As each vector describes a neighborhood in the input image, the available contextual information is extended before the interpolation step and may subsequently support a smoother transition between resolutions. Further, our networkintegrated resolution-normalization reduces the range of perceived anatomical size variation in the inner blocks, thereby liberating resources to focus on structural details at a common metric scale (structures have similar sizes after normalization). Finally, a skip-connection transfers the non-interpolated feature maps from the pre-IDB into the post-CDB, potentially improving ambiguous label borders with details available at the native resolution. The differentiation between native voxel-scale and normalized inner scale may specifically be helpful in settings with large inter-subject size similarities (e.g. head size is relatively stable). As such, our network-integrated resolution-normalization approach is not limited to the neuroimaging context, but can be expected to benefit segmentation performance in other domains.
FastSurferVINN further demonstrates excellent generalization performance with respect to inter-and extrapolation, achieving the best results for a variety of unseen submillimeter resolutions excluded from the training corpus. Especially more detailed cortical structures targeted with HiRes acquisition show a significant DSC increase on the 0.7 mm, 0.8 mm and 0.9 mm scans, respectively. Similarly, ASD improved by 19.7%, 10% and 11% highlighting specifically the improved extrapolation capabilities of FastSurferVINN on the left-out 0.7 mm dataset component compared to traditional scale augmentation. Given the lack of a de-facto standard in HiRes datasets, consistent performance towards unseen resolutions is an important property. Specifically, new releases such as the upcoming HCP disease studies 7 may benefit from a flexible and validated method to process scans with good accuracy at their native submillimeter resolution. Conveniently, FastSurferVINN avoids lossy image downsampling, as well as time intensive re-training, and simplifies re-validation. One future aspect to investigate, is the extend to which this generalization is effective. Unfortunately, with increasing resolutions, evaluation becomes virtually impossible due to (i) unavailability of manual segmentations and (ii) limited support by existing neuroimaging pipelines (i.e. validation is difficult here as well due to point (i)). Our substitute evaluation on the downsampled, lower resolution images underlines that FastSurferVINN robustly generalizes beyond the tested resolution range (see Section 4.3.2, Fig. 10). Specifically, differences to traditional data augmentation are magnified with a more than 16.8% increase in DSC for the 1.6 mm scans. The improved generalizability of FastSurferVINN to under-or unrepresented resolutions especially indicates superior performance in real applications. Furthermore, FastSurferVINN was trained and evaluated on 3T scans only. Adoption to ultra high-field MRI capable of producing resolutions below 0.5 mm is an interesting future avenue, as direct transferability is limited due to the strong signal, contrast, and noise differences (van der Kolk et al., 2013;van der Zwaag et al., 2016).
Overall, FastSurferVINN allows processing of both LowRes and HiRes data without reducing precision on either set. On the contrary, we demonstrated that FastSurferVINN even outperforms a fixed-resolution 1.0 mm network by a significant margin. As the only difference in the training corpus is the increased image resolution of 60 subjects, the gain in accuracy seems to be predominantly motivated by the structural details provided by the submillimeter scans.
Further, the intensity and demographical variety added by the 1.0 mm scans promoted generalization performance to the HCPL component when FastSurferVINN was compared to a fixed-resolution RS HiRes network. This is of special interest in longitudinal settings, where training on images from only the first time point may introduce a bias towards healthier or younger brains, which can be mitigated by introducing appropriate LowRes out-of-study cases. Similarly, potential future variations of acquisition parameters within studies may negatively affect performance of fixed-resolution networks and, hence, would require their retraining.
Comparison to fixed-resolution networks revealed another promising application of FastSurferVINN. As shown in Section 4.4, FastSurferVINN's segmentation performance improves across resolutions even if only additional 1.0 mm scans are included. Surprisingly, performance is dominated by the training corpus size and not significantly influenced by its heterogeneous resolution. On RS, the performance was comparable for FastSurferVINN when training with a mixed resolution corpus (120 total, with only 30 0.8 mm RS cases) as opposed to a fixed-resolution network trained exclusively on 120 0.8 mm RS scans (see Section 4.4). This can be exploited in settings where training cases are scarce at a specific resolution and new acquisitions difficult (e.g. manual segmentations or custom sub-resolution acquisition protocols). FastSurfer-VINN's resolution flexibility is optimally suited to integrate a breadth of already existing resources, reducing the amount of newly acquired data necessary to achieve good performance.
Finally, the results of Big-FastSurferVINN indicate that an imbalance between 1.0 mm and submillimeter data distribution does not introduce a resolution bias. On the contrary, increasing the 1.0 mm component in Big-FastSurferVINN by a factor 20 improved segmentation performance across all resolutions (0.7 to 1.0 mm). Specifically, the 0.9 mm dataset (ABIDE-II) benefits from the training set extension with an 3.67% increase in DSC and 22.11% decrease in ASD. Unlike HCP and RS, the 0.9 mm ABIDE-II scans were acquired on a Philips scanner, which is present in a larger proportion in the extended training set (30% Philip scanners compared to 1.6%). However, no submillimeter Philips scan was included at any point during training. The improved performance on ABIDE-II, therefore, highlights the potential for FastSurferVINN to actively reduce scanner-biases through inclusion of 1.0 mm MRIs. Exploring an expected alleviation of age-or diseasebiases in submillimeter datasets with FastSurferVINN presents an interesting direction for future work.
Overall, we introduce a fast, voxel size independent neural network that scales well to large datasets and enables seamless integration of a variety of resolutions during both, training and inference. Thereby, FastSurferVINN offers the potential to improve generalization performance to future HiRes datasets without retraining, reduce potentially existing dataset biases, and curtail necessary labour and time intensive manual labeling efforts.

A1. Attention
A context-driven learnable attention mechanism following the method proposed in  is evaluated as a reference. This modular mechanism can be inserted into any CDB and offers variable placement across the network architecture. Here, we investigate the addition of attention in the pre-IDB and post-CDB (see Section 3.2.2) as they are predominantly dealing with feature maps at the native image resolution. In short, a sequence of two convolutions (H 1 a : 3 × 3 kernel, half the number of filters as the input; H 2 a : 1 × 1 kernel, one more filter than convolutions in the dense blocks) is followed by a final elementwise softmax function normalizing the activations, denoted as λ 0 , λ 1 , λ 2 and λ 3 , to a sum of 1. The activation map λ 0 is used for weighting the input features representing an effective receptive field of 1. Each voxel is, therefore, assigned a different attention value driven by the image context. In contrast to Qin et al. (2018), the feature responses at different scales (c 1 , c 2 and X a ) are generated by a sequence of convolutions within the CDB (H a 1 , H a 2 and H a 3 ). Each convolution increases the receptive field size by (k 1 − 1 × k 2 − 1) with (k 1 × k 2 ) representing the kernel size. The calculated attention maps are then used to weight the local skip connections which combine the outputs of each convolution within these blocks. This is mathematically formulated in Eqs. (A.1a)-(A.1c).

A2. Ablative optimization of FastSurferCNN
Ablative evaluation of the CNN base architecture optimizations are summarized in Tables  1 and 2. Training and validation are based on the multi-resolution datasets listed in Table  4 (Mix). The original FastSurferCNN CDBs consecutively perform two 5 × 5 convolution operations followed by a final 1 × 1 convolution (Section 3.2.2, first row in Tables 1 and  2). In total, an average DSC of 88.63 and 87.09 and a ASD of 0.317 mm and 0.283 mm is reached for the subcortical and cortical structures, respectively. Changing the kernel size to 3 × 3 while keeping the effective receptive field size of 9 × 9 per CDB constant (Tables  1 and 2, 3 × 3, 64 F) leads to a significant improvement in DSC (p < 0. 01, Wilcoxonsigned rank test). Increasing the number of filters per layer from 64 to 71 preserves the original number of trainable parameters (approximately 1.85 × 10 6 ). This change leads to a slight improvement on the subcortical and cortical structures. For comparability to our final VINN, we also add the pre-IDB and post-CDB blocks (Tables 1 and 2, FastSurferCNN*). This addition merely assures comparability between the augmentation and interpolation approach. As visible in Tables 1 and 2, this change improves segmentation accuracy further with a final DSC of 88.85 and 88.01 and an ASD of 0.307 mm and 0.257 mm. The optimized FastSurferCNN* architecture is used as the baseline for all comparisons (i.e. augmentation and interpolation).  FastSurferCNN optimization. Average surface distance evaluation.

ABIDE I
The Autism Brain Imaging Data Exchange I Di Martino et al. (2013) is a cross-sectional study involving 17 international sites and focuses on autism spectrum disorders. It contains data for 1112 individuals between 7 and 64 years of age. Scanner and sequence parameters vary depending on the site and can be accessed on the ABIDE website (https:// fcon_1000.projects.nitrc.org/indi/abide/abide_I.html). All MRI data were acquired using 3 Tesla scanners (either Philips, Siemens or GE). 20 cases from the ABIDE-I were used for testing, 68 for training Big-FastSurferVINN.

ADNI
The Alzheimer's Disease Neuroimaging Initiative (Mueller et al., 2005) was launched in 2003 as a public-private partnership, led by principal investigator Michael W. Weiner, MD. The ADNI database has > 2000 participants and is available online at http:// adni.loni.usc.edu. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and early Alzheimer's disease (see www.adni-info.org for up-to-date information). Data were acquired at a resolution of 1.0x1.0x1.2 mm with 1.5T and 3T-MRIs scanners from the three largest MRI vendors (GE, Philips and Siemens) using an MP-RAGE sequence. Scanner parameters are optimized for the different vendors (see Jack et al. (2008) for details). 15 cases from ADNI where used for training, 8 for validation and 215 for training Big-FastSurferVINN. 40 different cases were used for assessing accuracy and generalizability in the final testset.

IXI
This data collection provides 600 MRIs from normal, healthy subjects. The data has been collected at three different sites in London, UK on one GE (1.5T) and two Philips (1.5T and 3T) scanners and is available online (https://brain-development.org/ixi-dataset/) under Creative Commons License BY-NC-ND 3.0 (https://creativecommons.org/licenses/bysa/3.0/legalcode). Detailed sequence parameters are available at the given URL. 43 scans from IXI were used for testing and 400 for training Big-FastSurferVINN.

LA5c
The cross-sectional UCLA Consortium for Neuropsychiatric Phenomics LA5c Study (Poldrack et al., 2016) includes 142 individuals diagnosed with a neuropsychiatric or neurodevelopmental disorder (schizophrenia, bipolar disorder, ADHD) and 130 normal controls (ages 21-50). All participants were scanned on a 3T Siemens Trio at a single-center. T1-weighted MP-RAGE images were acquired with field of view of 250, 256x256 matrix, and 176 1.0 mm sagittal partitions. An TI of 1.1 s, TE of 3.5-3.3 ms, TR of 2.53 s and flip angle of 7° was used for all scans. This data was obtained from the OpenfMRI database (https://openfmri.org/dataset/ds000030/). Its accession number is ds000030. 16 cases from LA5c were used for training, 9 for validation, 203 for training Big-FastSurferVINN, and 15 for the final testset. Summary of datasets used for training, validation and testing.  Composition of training, validation and testing set used throughout the paper. The asterisk (*) indicates submillimeter datasets are resampled to 1.0 mm.

MIRIAD
The Minimal Interval Resonance Imaging in Alzheimer's Disease (Malone et al., 2013) is a publicly available longitudinal study with focus on neurodegeneration (see http:// miriad.drc.ion.ucl.ac.uk/). MRIs of 23 elderly controls and 46 Alzheimer's diseased patients (ages 55+) were acquired at a single center with a 1.5T Signa MRI scanner (GE Medical systems), using an inversion recovery prepared fast spoiled gradient recalled sequence, field of view of 24 cm, 256 × 256 matrix, 124 1.5 mm coronal partitions (voxel size 0.9 ×0.9 × 1. 5), TR 15 ms, TE 5.4 ms, flip angle 15°, and TI 650 ms. 7 cases from MIRIAD were used for training and validation, respectively, and 23 for training Big-FastSurferVINN.

Mind-brain-body
The MPI Leipzig Mind-Brain-Body cohort (A mind-brain, 2019; Mendes et al., 2019) consists of 321 healthy participants between 20 and 77 years of age. MRIs were acquired on a Siemens Verio 3T with a weighted T1 Magnetization-Prepared 2 Rapid Acquisition Gradient Echoes (MP2RAGE) protocol and sagittal acquisition orientation, one 3D volume with 176 slices, TR of 5000 ms, TE of 2.92 ms, TI1 of 700 ms, TI2 of 2500 ms, flip angle 1 or 4 degrees, flip angle 2 of 5 degrees, echo spacing of 6.9 ms, 1.0 mm isotropic voxel size and field of view of 256 mm. This data was obtained from the OpenfMRI database. Its accession number is ds000221. 195 MRIs were used for training Big-FastSurferVINN.
Oasis-1 Marcus et al.(2007) and Oasis-2 Marcus et al.(2010) The Open Access Series of Imaging Studies 1 and 2, are publicly available (https://www.oasis-brains.org/) crosssectional (Oasis-1) and longitudinal (Oasis-1) studies covering non-demented and demented individuals with very mild to moderate Alzheimer's disease. All subjects were scanned at a single-center using either a 1.5T Vision or a 3T TIM Trio Siemens Scanner in sagittal orientation with a voxel resolution of 1.0x1.0x1.25 mm. For acquisition, a MP-RAGE sequence with a TR of 9.7 ms, TE of 4.0 ms, flip angle of 10° and TI of 20 ms was used. Oasis-1 includes 416 subject between ages 18 to 96, while the longitudinal Oasis-2 focuses on older adults (150 subjects at age 60+). 14 cases from Oasis-1 and 8 from Oasis-2 were used for training, 11 and 5 for validation, 79 and 65 for training Big-FastSurferVINN and 35 and 17 for final testing.
Participants of the individual studies gave informed consent in accordance with the Institutional Review Board at each of the participating sites. Complete ethic statements are available at the respective study webpages.

A4. Tabulated results
In Table 6 , 7, 8, 9, 10, 11, 12, 13,14,15, 16. we provide numeric values for DSC and ASD metrics for Figs. 5,6,7,8,9,10,11,12,13. A note on reproducibility: Authors wishing to compare their methods to FastSurferVINN and FastSurferCNN are strongly encouraged to download our code from our repository (github.com/DeepMI/FastSurfer). For reproducibility and fair comparison, we provide a reproducibility guide with training and evaluation instructions for FastSurferVINN and FastSurferCNN on github. Particularly, we recommend re-running evaluations to avoid inconsistencies in non-identical training and evaluation datasets (subjects) as well as reference segmentation.  Compiled results of Fig. 6 -Influence of sampling kernels in the network-integrated resolution-normalization of FastSurferVINN. Comparison of nearest-neighbour (NN), area, bi-cubic and bi-linear sampling kernels. Dice Similarity Coefficient (DSC) and average surface distance (ASD) are reported for the subcortical and cortical structures.    Compiled results of Fig. 9 -Generalization of FastSurferCNN* + exSA (external scaling augmentation) and FastSurferVINN to unseen resolutions purposefully excluded from the training set (0.7 mm, 0.8 mm and 0.9 mm). Dice Similarity Coefficient (DSC) and average surface distance (ASD) are reported for the subcortical and cortical structures. Note, that comparisons between the different data resolutions are not possible due to differing training sets.  Compiled results of Fig. 10   Compiled results of Fig. 11   Compiled results of Fig. 11   Compiled results of Fig

Table 16
Compiled results of Fig. 13  Resolution-independence in deep learning networks: A. Dedicated fixed-resolution convolutional neural networks (CNNs) only work on the resolution they are trained on and are limited by the availability and quality of corresponding datasets. B. One single resolution-ignorant CNN can learn to segment multiple resolutions by training on a diverse dataset. External scale augmentation (+exSA, B. left) simulates resolutions with few or no training cases by resampling the image and the reference segmentation map. Here, however, lossy interpolation and resulting artefacts, especially from nearest-neighbour interpolation of discrete label maps, may result in a loss of structural details and sub-optimal performance. C. Our voxel size independent neural network (VINN) avoids interpolation of the images and discrete labels by integrating the interpolation step into the network architecture. Further, the explicit transition from the native resolution to a normalized internal resolution facilitates an understanding of the difference between image features (MultiRes blocks with distances measured in voxels) and anatomical features (FixedRes inner blocks with normalized distances).  Image resolution affects the detail of discrete segmentation label maps and derived measures such as surface models and thickness. A. The low-resolution image is less detailed and causes partial volume effects (PVEs) by accumulating signals across tissue boundaries into larger voxels, whereas B. High-resolution images and derived segmentations allow more precise region delineation and capture details, e.g. for improved shape or thickness analysis (white arrows).  Voxel size independence in FastSurferVINN. Flexible transitions between resolutions become possible by replacement of (un)pooling with our network-integrated resolutionnormalization (green) after the first encoder (pre-IDB) and before the last decoder block (post-CDB). Scale transitions between the other competitive dense blocks (CDB) remain as standard MaxPool and UnPool operations. Each CDB is composed of four sequences of parametric rectified linear unit (PReLU), convolution (Conv) and batch normalization (BN).
In the first two encoder blocks ((pre)-IDB), the PReLU is replaced with a BN to normalize the inputs.  HiRes weight mask generation. Left: The difference between the gray matter label map (blue) and its closure (gray) produces the deep sulci and WM strand mask. Right: The difference between the original (blue) and eroded (gray) brain mask produces the outer gray matter mask.  Ablative optimization of FastSurferCNN and comparison to FastSurfer-VINN. FastSurferCNN (light green) is optimized through a switch to 3 × 3 kernels (FastSurferCNN*, green). Addition of data augmentation (external scaling augmentation, FastSurferCNN* + exSA, dark green) improves performance further. VINN equipped with internal scaling augmentation (inSA) (FastSurferVINN, orange) outperforms all other models on both subcortical (left) and cortical (right) structures with respect to Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom). Further addition of external scaling augmentation negatively affects performance (VINN + inSA + exSA). Segmentation results with FastSurferVINN are significantly better compared to all other models (corrected p < 10 −7 ).  Effect of sampling kernels on network-integrated resolution-normalization. Comparison of nearest-neighbour (NN, purple), area (light violet), bi-cubic (light yellow), and bi-linear (orange) sampling kernels with respect to the Dice Similarity Coefficient (DSC, top) and the average surface distance (ASD, bottom) for subcortical (left) and cortical (right) structures. Segmentation performance with NN is significantly worse than all other interpolation strategies (corrected p < 10 −13 ). Area, bi-cubic and bi-linear give equivalent results.  Adaptation of the original loss function through addition of HiRes weights focusing on areas strongly effected by PVEs (HiRes Loss, right bar) significantly improves segmentation performance on the cortical structures (compared to FastSurferVINN with original loss, left bar). Addition of attention (middle bar) does not lead to a significant improvement compared to the baseline. Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom) are shown for subcortical (left) and cortical (right) structures. Cortical structures are significantly better segmented with the HiRes Loss (corrected p < 10 −5 ). No significant change was detected on the subcortical structures..  Superior generalization performance of FastSurferVINN to resolutions vastly outside the training domain (1.4 mm, 1.6 mm). FastSurferVINN (orange) outperforms scaleaugmentation (FastSurferCNN* + exSA, green) highlighting its extrapolation capabilities.
Results are significantly better with respect to Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom) across subcortical (left) and cortical structures (right) (corrected p < 0.001 for 1.4 mm and p < 10 −13 for 1.6 mm).    Flexible-versus fixed-resolution networks. FastSurferVINN (orange) is comparable to, or outperforms, all fixed-resolution networks (green) (left plot: 0.8 mm from RS only, right plots: 1.0 mm) with respect to the Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom). On the submillimeter scans (left plots) generalization to an unseen dataset (HCPL) is significantly improved. Results are consistently better for the 1.0 mm scans (right plot). To highlight cumulative VINN and architectural optimizations, we also compare with the state-of-the-art FastSurferCNN (gray, without optimizations from Sections 4.1 and 4.2, which are already included in CNN*). We retrain this 1 mm fixedresolution network ensuring equal training datasets and conditions.  Big-FastSurferVINN trained with approximately 20 times more 1.0 mm scans (n = 1315, yellow) than the original version (n = 120, orange) raises segmentation performance across resolutions. Dice Similarity Coefficient (DSC, top) and average surface distance (ASD, bottom) improve on the submillimeter (0.7 mm-0.9 mm) as well as 1.0 mm scans.