Brain extraction on MRI scans in presence of diffuse glioma: Multi-institutional performance evaluation of deep learning methods and robust modality-agnostic training

Brain extraction, or skull-stripping, is an essential pre-processing step in neuro-imaging that has a direct impact on the quality of all subsequent processing and analyses steps. It is also a key requirement in multi-institutional collaborations to comply with privacy-preserving regulations. Existing automated methods, including Deep Learning (DL) based methods that have obtained state-of-the-art results in recent years, have primarily targeted brain extraction without considering pathologically-affected brains. Accordingly, they perform sub-optimally when applied on magnetic resonance imaging (MRI) brain scans with apparent pathologies such as brain tumors. Furthermore, existing methods focus on using only T1-weighted MRI scans, even though multi-parametric MRI (mpMRI) scans are routinely acquired for patients with suspected brain tumors. In this study, we present a comprehensive performance evaluation of recent deep learning architectures for brain extraction, training models on mpMRI scans of pathologically-affected brains, with a particular focus on seeking a practically-applicable, low computational footprint approach, generalizable across multiple institutions, further facilitating collaborations. We identified a large retrospective multi-institutional dataset of n = 3340 mpMRI brain tumor scans, with manually-inspected and approved gold-standard segmentations, acquired during standard clinical practice under varying acquisition protocols, both from private institutional data and public (TCIA) collections. To facilitate optimal utilization of rich mpMRI data, we further introduce and evaluate a novel “modality-agnostic training” technique that can be applied using any available modality, without need for model retraining. Our results indicate that the modality-agnostic approach1 obtains accurate results, providing a generic and practical tool for brain extraction on scans with brain tumors.


Introduction
Brain extraction, also known as skull-stripping, describes the process of removing the skull and non-brain tissues (e.g., neck fat) from brain magnetic resonance imaging (MRI) scans. It is a crucial step for pre-processing neuro-imaging datasets and has an immediate bearing on all subsequent investigative procedures. It is also a necessary processing step in most studies for compliance with privacy-preserving regulations, such as the Health Insurance Portability and Accountability Act of 1996 (HIPPA) and the General Data Protection Regulation of 2016 (GDPR). The effects of brain extraction on various analyses have been reported in the literature, including brain tumor segmentation (Menze et al., 2015;Bakas et al., 2018), lesion segmentation (Winzeck et al., 2018), cerebral hemisphere segmentation (Zhao et al., 2010), methods for surgical interventions (Leote et al., 2018;Nadkarni et al., 2015), neurodegeneration (Gitler et al., 2017), devising radiation therapy (Kim et al., 2019), image registration (Woods et al., 1993;Klein et al., 2010), predicting Alzheimer's disease (Frisoni et al., 2010), multiple sclerosis (Radue et al., 2015), estimation of cortical thickness (Haidar and Soul, 2006;MacDonald et al., 2000), and cortical surface reconstruction (Dale et al., 1999;Tosun et al., 2006).
Manual delineation of the brain (Souza et al., 2018) is very laborious and time-consuming, resulting in inter-and intra-rater variations and affecting reproducibility in future applications. Multiple automated methods have been developed over the years to overcome these shortcomings (Smith, 2002;Segonne et al., 2004;Eskildsen et al., 2012;Doshi et al., 2013;Shattuck et al., 2001;Iglesias et al., 2011). In recent years, DL methods, and particularly Convolutional Neural Networks (CNNs), have obtained state-of-the-art results in multiple problems of image segmentation.
Importantly, most existing computational methods for brain extraction were developed for and evaluated on brain scans without apparent pathologically-affected regions such as tumors or lesions, and they typically use only the T1-weighted MRI scans. Scans with brain tumors may present an important challenge for supervised machine learning algorithms that exclusively train on scans of healthy subjects, causing them to fail, particularly in the proximity of the tumor regions. Suboptimal brain extraction on these scans may result in parts of the skull and neck, which usually have tumor-like radiographic characteristics, being included on the final brain mask, or tumor regions being considered as non-brain, hence in both cases adversely affecting consecutive analyses. Example snapshots of such problematic brain extraction using the brain extraction tool (BET) (Smith, 2002) are shown in Fig. 1.
Another important challenge of brain extraction on multi-institutional, multi-parametric MRI (mpMRI) scans with tumors is the complexity and variability of the imaging data. Brain tumor scans acquired during standard clinical practice or even research studies and clinical trials, include a lot of sources of variation affecting their intrinsic radiographic characteristics/appearance, which may not be perceivable by visual observation but even the sub-visual differences affect further computational analyses. Examples of sources of variation include, but are not limited to imaging modalities, quality of the scans, scan acquisition parameters, e.g., repetition time (TR), echo time (TE), flip angle, slice thickness. Hence, robustness of algorithms to imaging variations and their generalizability to available imaging modalities are important requirements for achieving widespread application in clinical and research settings.
Finally, a third important requirement for the field is to have tools that are easy to deploy and apply, and with a low computational footprint. In methodological research studies, obtaining the highest performance is often the main goal, undermining other concerns such as complexity, running time and portability of a new method. However, if the goal is clinical translation and widespread usage, methods should be designed for being used by a variety of users with different hardware and software requirements, as well as competency levels.
In this study, we sought to provide a tool that addresses the specific challenges of brain extraction on scans with brain tumors, with a particular focus on coming up with a practical approach that is applicable and generalizable across multiple institutions and multiple modalities, further facilitating collaborations. Towards this aim, we performed a comprehensive performance evaluation of recently established DL architectures for semantic segmentation. We evaluated 5 widely used DL network architectures, using multiple datasets and various cross-validation strategies. Evaluation data included mpMRI brain tumor scans acquired during standard clinical practice under varying acquisition protocols from 3 private institutions, as well as publicly available data from The Cancer Imaging Archive (TCIA) (Clark et al., 2013;Scarpace et al., 2016;Pedano et al., 2016;Bakas et al., 2017aBakas et al., , 2017bBakas et al., , 2017c, with corresponding gold standard brain masks for all scans. DL architectures were trained using different combinations of input image modalities, including training on each single modality individually, multi-modality training and ensembles of models trained on single modalities. Importantly, we introduced a novel "modality-agnostic training" technique to enable brain extraction on new scans using any available modality, without the need to retrain the model. The proposed modality-agnostic training, which brings noticeable advantages for widespread application, obtained promising results in our comparative evaluations against models trained using other top-performing input modality combinations.

Data
mpMRI scans are routinely acquired across institutions for patients with suspected brain tumors. Scanning protocols, i.e., scanner models, acquired modalities, and scan parameters, vary across institutions, as well as depending on the time of scan with respect to patient's treatment history. In this study, we included four structural modalities that are commonly acquired at baseline pre-operative time-point: native (T1) and post-contrast T1-weighted (T1Gd), native T2-weighted (T2), and T2-weighted Fluid Attenuated Inversion Recovery (Flair) MRI scans.
We identified 3; 220 retrospective multi-institutional mpMRI scans from both private and public collections. The private retrospective collections included 2; 520 MRI scans from the hospitals of a) the University of Pennsylvania (UPenn, n = 1; 812), b) Thomas Jefferson University (TJU, n = 608), c) MD Anderson Cancer Center (MDA, n = 100). The public data are available through TCIA (Clark et al., 2013) and comprise of the scan collections of a) The Cancer Genome Atlas Glioblastoma (TCGA-GBM, n = 328)  b) The Cancer Genome Atlas Lower Grade Glioma (TCGA-LGG, n = 372) (Pedano et al., 2016), with their corresponding brain masks as provided through the International Brain Tumor Segmentation (BraTS) challenge (Menze et al., 2015;Bakas et al., 2017aBakas et al., , 2017bBakas et al., , 2017cBakas et al., , 2018. Notably, the bra in masks used in the BraTS challenge were prepared primarily focusing on the tumor area, rather than the overall brain. Accordingly, we performed an additional manual inspection on the BraTS dataset and we excluded scans with brain masks that were missing parts of the brain. The final dataset included 3; 220 mpMRI scans from 805 subjects, with the 4 structural modalities previously mentioned for each subject (Table 1). The ground truth brain masks of all the included scans were obtained through semi-automatic annotation using the ITK-SNAP (itksnap.org) software (Yushkevich et al., 2006;Yushkevich et al., 2019) and approved by a board-certified neuro-radiologist (M.B.) with more than 10 years of experience working with gliomas.
Importantly, this multi-institutional data collection was highly heterogeneous with a wide range of variability in scan quality, slice thickness of different modalities, scanning parameters, as well as pre-processing of initial scans, e.g (Table 1). Briefly, the UPenn data included T1 scans with high axial resolution. MDA data had similar characteristics, and accordingly it was used to evaluate the generalizability of other similar institutional data. All scans in TJU data had lower resolution compared to other datasets.
Examples of scans for each institution are shown in Fig. 2, showcasing the specific challenges of each dataset.

Pre-processing
Since the scans included in this study were heterogeneously obtained from different scanners and acquisition protocols, they all underwent the same pre-processing protocol to make image dimensions and voxel sizes uniform across studies and modalities. Specifically, all DICOM scans were converted to the NIfTI (Cox et al., 2004) file format and then, following the well-accepted 2 pre-processing protocol of the BraTS challenge (Menze et al., 2015;Bakas et al., 2017aBakas et al., , 2017bBakas et al., , 2017cBakas et al., 2018), the T1Gd scan of each patient was rigidly registered and resampled to an isotropic resolution of 1 mm 3 based on a common anatomical atlas, namely SRI (Rohlfing et al., 2010). The remaining scans (i.e., T1, T2, Flair) of each patient were then rigidly co-registered to this resampled T1Gd scan. The Greedy registration algorithm (Yushkevich et al., 2016) was used for all registrations. In order to accommodate hardware limitations two additional steps were considered. First, all scans were zero-padded with 3 and 2 slices on the top and bottom of the axial direction, ensuring that the image size on each dimension is factorized by 2, i.e., all scans converted from the SRI image size of 240×240×155 to 240 × 240 × 160 voxels. Subsequently, all scans were down-sampled to an image size of 128 × 128 × 128, converting their SRI isotropic voxel resolution (1:0 × 1:0 × 1:0mm 3 ) to anisotropic (1:875× 1:875× 1:25mm 3 ). These scans were finally used for training all the DL architectures. An overview of the pre-processing applied in this study is shown in Fig. 3.

DL network architectures
Multiple DL network architectures were tested in our comparative performance evaluation. Our selection was mainly motivated by their wide application on other segmentation tasks. The specific architectures included here comprise a) 2D U-Net (Ronneberger et al., 2015;Doshi et al., 2019), b) 3D U-Net (Çiçek et al., 1606;Milletari et al., 1606), c) Fully Convolutional Network (FCN) (Long et al., 1411), d) DeepMedic (Kamnitsas et al., 1603), and e) 3D U-Net with Residual connections (3D-ResU-Net) (Drozdzal et al., 1608;He et al., 1512). Below, we provide a very brief overview of each architecture, with references for more detailed descriptions, and we describe the specific parameters and network design that was used in our experimental evaluations. Notably, to address hardware limitations on memory utilization and keep the requirements to < 12GB, we utilized Instance Normalization (Ulyanov et al., 1607) instead of Batch Normalization (Ioffe and Szegedy, 1502) for training the 3D network architectures. (Ronneberger et al., 2015) was originally proposed in 2015 and its numerous variations have been used successfully in various segmentation tasks delivering promising results. In this study we include a heavily involved variation of a 2D U-Net (Doshi et al., 2019), including residual connections and an inception module. We refer to this architecture as "2D-Res-Inc". Specifically, the network architecture consists of an encoding path and a corresponding decoding path, as in U-Net (Ronneberger et al., 2015), followed by a voxel-wise, multi-class soft-max classifier to produce class probabilities for each voxel independently. A major modification in this implementation is the use of branched convolutions, adapted from the Inception-ResNet-v2 architecture (Szegedy et al., 2017). This module ensures that each branch learns a different representation of the input features by learning both shallow and deeper features, and allows the subsequent layer to abstract features from different scales simultaneously. This property can be extremely useful when dealing with segmentation tasks for more complex or heterogeneous structures. (Çiçek et al., 1606;Milletari et al., 1606). Here we utilize a 3D U-Net (Çiçek et al., 1606) with a customized image input size of 128 128 × 128 × 128. Furthermore, to accommodate hardware limitations, we use 16 base filters, compared to the original implementation that included 64. The pattern for a single convolution block was given as 3D Convolution operation (3 × 3 × 3) followed by Leaky ReLU (LReLU) and Instance Normalization with running statistics during learning. Adam optimizer was used with a learning rate of 0.01 over 25 epochs. The number of epochs was determined according to the amount of improvement observed. Notably, we evaluated the basic 3D-U-Net architecture in terms of the initial numbers of filters and their relation to the segmentation quality. Details of this analysis, showing the trade off between hardware requirements and performance, can be found in the supplementary section.

3D U-Net with residual connections-
We extended 3D U-Net, by applying residual connections improving the backpropagation process (Drozdzal et al., 1608;He et al., 1512). Our 3D-Res-U-Net implementation follows the principle of the 3D U-Net architecture, while further including skip connections in every convolution block, differently from other methods that use a 3D UNet. A schematic that illustrates these operations is shown in Fig. 4. Similar to 3D-U-Net above, we used Adam optimizer with a learning rate of 0.01 over 25 epochs, and the number of epochs was determined according to the amount of improvement observed.

FCN-
The FCN architecture (Long et al., 1411), which was introduced in 2017, captures hierarchical features, imaging patterns, and spatial information of each input image with receptive fields. FCN is considered less computationally expensive than other architectures, due to not having an "expensive" decoding part as U-Net, and hence provides faster coarse segmentation of various problems (Litjens et al., 2017). Here, we apply a 3D FCN. Similarly to other architectures, the Adam optimizer was used with a learning rate of 0.01 over 25 epochs. Number of epochs was determined according to the amount of observed improvement. (Kamnitsas et al., 1603) was originally proposed in 2015, as the first 3D architecture, when it emerged as a winner of the ISchemic LEsion Segmentation (ISLES) challenge (Winzeck et al., 2018). The standard DeepMedic architecture, as provided in its GitHub repository 3 is a 3D CNN with a depth of 11-layers, and a double pathway to provide sufficient context and detail in resolution. In our evaluation, we applied the original version of DeepMedic 4 with the default parameters provided, and we applied a hole-filling algorithm as a post-processing step.

Experimental design
Considering the complexity of the task, i.e., the combinatorial explosion of experiments using multiple sites, train/test combinations, input modalities and network architectures, we followed a multi-step experimental design, focusing on each step on a specific objective, as described below.
All hyper-parameters stayed consistent throughout all the experiments. Each of the applied architectures needed different time for convergence, based on their individual parameters. Table 2 shows the time each architecture run until convergence during training on a compute node equipped with an NVidia P100 GPU, with 12 GB memory.

Comparative evaluation of DL network architectures and input data
combinations-On this first step of our evaluations, we performed an exhaustive comparison of various network architectures and input image combinations. Specifically, models of all architectures were trained solely on data from a single institution (i.e., UPenn) and then inferred on datasets of multiple institutions (i.e., UPenn, TJU, and MDA). During the training phase, a hold-out cohort of 180 mpMRI scans from 45 UPenn subjects were used for internal algorithmic validation and convergence evaluation. Each method was trained using as input either single individual scans or combined mpMRI scans, as described in Section 2.5. A total of 7 input image combinations configurations were evaluated: a) four combinations of individual modalities (i.e., T1-T1, T1Gd-T1Gd, T2-T2, Flair-Flair), b) ensemble of these individual modalities based on majority voting, and c) two mpMRI combination, including the combination of T1 and T2 (Multi-2), as well as the combination of all four individual modalities (Multi-4).
In our attempt to get a more complete picture, we have also compared these DL architectures with BET and FreeSurfer, which can be considered two of the most widely used traditional approaches for brain extraction. For this comparison we only utilized the T1 scans that BET and FreeSurfer can utilize. Furthermore, to showcase how the disease can affects the performance of DL algorithms we have trained a DL model on scans from 50 healthy subjects (2D-Res-Inc-H) acquired for different studies and applied it in the same brain tumor scans we used for this step of the evaluation. The ground truth labels of the healthy brain scans were generated from different raters and confirmed for obvious errors by an expert data scientist with experience working in neuroimage analysis for more than 9 years.

2.4.2.
Modality-agnostic training-On the second step, we evaluated the performance of a "modality-agnostic training technique", which we developed aiming to provide a tool that is widely applicable to datasets with different MRI modality combinations. A similar concept was also recently presented by (Isensee et al., 2019), which shows the relevance of the proposed work. The modality-agnostic models were constructed by using different modalities as independent input samples during the training, such that a single model learns the mapping between any single image modality type and the target brain mask segmentation, without necessarily being informed about the type of modality provided as input. The modality-agnostic training was implemented and validated using architectures selected based on the performance evaluation of step 1.

Performance effect of training on diverse data-
The size and diversity of the training dataset are major factors affecting the performance and generalizability of DL models. In order to evaluate the effect of multi-site training on segmentation accuracy we performed additional experiments using models trained exclusively on data from a single institution (UPenn only) against models trained on multi-site data, i.e., UPenn, TJU, and MDA. For both cases, we used the large multi-institutional TCIA dataset for testing, thus applying the final models on a highly heterogeneous and previously unseen dataset, which allowed us to evaluate the generalizability of the proposed DL models.
2.5. Input data combination strategies for training 2.5.1. Training on individual modalities-Following the current literature on brain extraction methodologies, where T1 is solely used, we trained every architecture on T1. Additionally we trained models on each other available individual MRI modality, to compare the potential contribution of each modality beyond T1. In the remainder of this manuscript we refer to these models using the notation "modality-modality" to indicate that a single modality was used for both training and inference. For example we refer to a 3D U-Net trained and inferred on T1, as T1-T1 3D-U-Net.
We should note that the performance of models trained on individual modalities was also evaluated in ensemble configurations.

2.5.2.
Training on multiple mpMRI modalities-To take advantage of the richness of the routine mpMRI acquisitions for brain tumor patients, we also considered training methods on multiple modalities, instead of training on single modality individually. We limited these experiments to two main configurations: a) combining T1 and T2, that we refer to as "Multi-2", and b) combining all 4 structural mpMRI modalities together that refer to as "Multi-4". The first configuration (Multi-2) is selected as it was previously shown that T2 contributes to improved skull-stripping performance (Bakas et al., 2017a) and can be applied on cases where gadolinium is not considered, i.e., in non-tumor cases.

Modality-agnostic training-
We propose a novel approach that provides a "hybrid" alternative to single-modality and multi-modality training techniques, aiming to optimally use multi-modal imaging data, while not necessarily requiring existence of a predefined set of modalities. Specifically, in this configuration, a DL model is trained using all available modalities for a subject. However, in contrast to the multi-modality approach where the combination of all modalities from a subject is used as a single data sample, we feed each modality to the network as an independent data sample. This process attempts to contribute on forcing the network to learn the brain shape prior, instead of texture priors that CNNs usually learn (Geirhos et al., 1811). In this way, the model allows making an inference from a single modality, while being trained on multi-modal data. Importantly, during the inference, the model does not need to know which modality was provided as the input scan. The most important practical benefit of this approach is its robustness to missing data, i.e., it can be applied directly even if a specific modality is missing or is not usable due to low scan quality.

Evaluation metrics
Following the literature on semantic segmentation we use the following metrics to quantitatively evaluate the performance of the trained methods. We have further utilized the metrics of sensitivity and specificity in the supplementary material.

Sørensen-dice similarity coefficient-The
Sørensen-Dice Similarity Coefficient (Dice) is commonly used to evaluate and report on the performance of semantic segmentation. Dice measures the extent of spatial overlap, while taking into account the intersection between the predicted masks (PM) and the provided ground truth (GT), hence handles over-and under-segmentation. Dice can be mathematically defined as: where it would range between 0 and 100, with 0 describing no overlap and 100 perfect agreement.

Hausdorff-While
the Dice score is the most commonly used metric for comparing two segmentation masks, it is not sensitive to local differences, as it represents a global measure of overlap. For the specific problem of brain extraction, local differences may be very important for properly assessing the quality of the segmentation. Accordingly, we calculated a complementary metric, the Hausdorf f distance, which measures the maximum distance of a point set to the nearest point in another (Rockafellar and Wets, 2005): Thakur et al. Page 9 We specifically calculated the 95th percentile of the Hausdorf f distance between the contours of the two segmentation masks, a robust measure of the maximum distance between the segmentations.

Comparative evaluation of network architectures with BET and FreeSurfer using T1 image as input
Considering the general approach in the field, we first compared different architectures, as well as BET (Smith, 2002) and FreeSurfer (Segonne et al., 2004), two of the most widely used brain extraction tools, using only the T1 scan for brain extraction (T1-T1 models). Dice scores obtained by different methods on 3 datasets are shown in Fig. 5. The DL-based algorithms (regardless of the architecture) significantly outperformed BET, with a single exception for the UPenn dataset. We also note that FreeSurfer is superior to BET, FCN, and 3D-UNet, but is still outperformed by the other architectures.
Furthermore, based on the results shown in Fig. 5, we can confirm with certainty that the presence of diffuse gliomas in MRI scans indeed affect the performance of DL models trained on healthy brain scans. Specifically, we note that the best performing DL model for brain extraction in T1 brain tumor scans (2D-Res-Inc - Fig. 5) ends up with the lowest performance among the DL architectures when trained on healthy brain T1 scans (2D-Res-Inc-H - Fig. 5), with its Dice showing a decrease of at least 6% across the data from the multiple institutions.

Comparative evaluation of network architectures using different input data combinations
Plots of average Dice and Hausdorf f metrics for scans from the 3 different institutions obtained using each model are shown in Figs. 6-7 and S. Tables 3-4 (average Sensitivity and Specificity metrics are shown in Supplementary Figs. 13 and 14).
We observed that the 2D-ResInc model was consistently outperforming other models, although marginally, on all datasets, with the exception of the "T2-T2" and the "Ensemble model". DeepMedic and 3D-Res-U-Net models obtained comparable, but slightly lower, performance. On the other hand, 3D-U-Net, and FCN showed lower performance, with FCN obtaining the lowest accuracy for most input configurations.
From all input data configurations, "T1-T1" and "Multi-4" models obtained higher performance, with consistent results across datasets and network architectures, although there are few exceptions. Single modality training using modalities other than T1 resulted in lower performance, while the "Ensemble" models did not improve the results beyond the performance of the "T1-T1" model. "Multi-4" training showed improved performance, obtaining the best results comparable to the "T1-T1" model.
Based on the quantitative results of these experiments, we conclude that the 3D-Res-U-Net, DeepMedic, and 2D-ResInc are the three best performing approaches and perform comparably well. The statistical analysis indicated that 2D-ResInc is outperforming 3D-ResU-Net in most cases with the exception of the "T2-T2" model where 2D-ResInc fails dramatically. Furthermore, between 3D-Res-U-Net and DeepMedic we note that 3D-Res-U-Net is better in most models, with consistently the smallest differences in Dice between them when compared to the other architectures. Since a major goal of our analysis is also the practicality and footprint of the method both during training and inference, we considered the relatively simpler 3D-Res-U-Net and DeepMedic architectures over the 2D-ResInc for the follow up evaluations.

Modality-agnostic training
Comparative evaluations involved the 3D-Res-U-Net and the Deep-Medic models trained using the modality-agnostic technique, compared against the 3D-Res-U-Net and the DeepMedic using the best performing input combinations in the previous step, i.e., "T1-T1", "Multi-2", and "Multi-4" models. Similar to step 1, these models were trained on the same training dataset from UPenn and inference and evaluation was performed using the same three testing datasets. Note that the modality-agnostic model was only trained once using all single modalities sequentially as input. The model was then applied on each modality independently, and the performance was evaluated for each modality separately.
Comparative results for different models are shown in Figs. 8-9 and S.Tables 3-4, where 3D-Res-U-Net shows a consistent superior performance compared to DeepMedic. The modality-agnostic model performed comparable to "T1-T1" and "Multi-4" models on the UPenn dataset. However, when applied for inference on data from other institutions, i.e., TJU and MDA, we observed gradually inferior performance. Consistently with previous results, the "Ensemble" models did not improve accuracy beyond the performance of the independent modality-agnostic models.

Performance effect of training on diverse data
We further investigated the effects of training on multi-site data using the combination of UPenn, TJU, and MDA as our training dataset. Trained "T1-T1" and the modality agnostic T1 ("M-A-T1") models were then evaluated on the independent multi-institutional TCIA dataset (Fig. 10). We observed that training on diverse multi-institutional data (i.e., Fig. 10 -"UPenn + TJU + MDA") resulted in improved performance, compared to training on data from a single institution (i.e., Fig. 10 -"UPenn"). Importantly, the model trained with the modality agnostic technique performed comparably with the "T1-T1" model using the multiinstitutional training data, suggesting the potential of this model as a widely applicable generic tool for segmentation of brain images with tumors.

Discussion
In this study, we conducted a comprehensive performance evaluation of widely-used DL methods for the task of brain extraction. We selected multiple recent architectures established in the domain of 3D semantic segmentation, while also considering low computational footprint and out-of-the-box implementation availability. We further introduced a novel modality agnostic training technique, preferable to models trained using individual MRI modalities, considering the benefits of its flexibility to availability of different input image modality combinations. The accompanying source code can be found at: https://github.com/CBICA/BrainMaGe.
Many automated methods have previously been developed for brain extraction (Smith, 2002;Segonne et al., 2004;Eskildsen et al., 2012;Doshi et al., 2013;Shattuck et al., 2001;Iglesias et al., 2011). In recent years, several DL-based approaches have also shown promising results in brain extraction (Hwang et al., 2019). Even though these methods are shown to be widely successful on healthy subjects, they tend to be less accurate when evaluated on brain MRI scans with brain tumors. Furthermore, their performance can vary significantly when applied to images from different sites, scanners, and imaging acquisition protocols (Iglesias et al., 2011). Here we evaluated specific DL architectures, considering the availability of their implementation and the corresponding computational footprint. We note that the selected 3D-Res-U--Net and DeepMedic DL methods (as well as others that are not selected but performing similarly, such as the 2D-ResInc) outperformed standard brain extraction methods, such as BET and FreeSurfer. In terms of comparing 3D-ResU-Net, DeepMedic, and 2D-ResInc, we think that the use of branched convolutions with varying depths in the ResInc module makes the 2D-ResInc perform better than the others as it allows the network to simultaneously learn both shallow and deep representations of the region of interest. However, this use of branched convolutions with varying depths makes the 2D-ResInc substantially slower to train and infer when compared to 3D-ResU-Net, as indicated in Table  2. When further comparing the performance of 3D-ResU-Net with DeepMedic, it seems that the residual connections and the larger receptive field (including the whole 3D scan) is what makes the 3D-ResU-Net outperform DeepMedic. Notably, although DeepMedic is the first state of the art 3D CNN, its patch-based approach seems to be the reason for the resulting "holes" in the output predictions (primarily near the ventricles) that make it require a postprocessing step as recommended in its original publication (Kamnitsas et al., 1603). This post-processing step is what then renders DeepMedic the second slowest approach following 2D-Res-Inc, with average inference time equal to 17s per scan.
We assessed a total of 7 modality configurations, including both independent MRI scans and combinations of mpMRI brain tumor scans. Amongst these configurations, our results are in line with the existing literature on brain extraction, where the T1-weighted scan is typically used for brain extraction. These results indicate that utilization of the T1 modality ("T1-T1") achieves the best performance also for scans with brain tumors. However, we should also note that ground-truth labels are generally delineated using the T1 scans, thus creating a potential bias. Ensemble segmentation via majority voting did not contribute in improving performance. The only other model showing comparable performance was the one trained on all 4 modalities, i.e., "Multi-4". However, between these two alternatives, the "T1-T1" model would be preferred owing to the requirement of only one modality, T1, instead of requiring the co-existence of four modalities to train the "Multi-4" model. Poor and good illustrative examples are shown in supplementary figures 15-16. Poor segmentations are randomly chosen from the data of each institution separately, across all algorithms and after setting a threshold on the Dice score as Thr < 80%. Similarly, good segmentations were randomly chosen using a threshold on Dice score as Thr > 98%.
We introduced an alternative novel modality-agnostic brain extraction approach, which is not dependent on any particular modality and may be trained on any available modality. There are at least two reasons why training the algorithm and running it on a limited number of modalities is essential in clinical applications. First, some clinical MRI brain protocols do not include all four modalities. For example, at our institution, the GammaKnife protocol consists only of T2-FLAIR and T1-Gd scans, and hence skull stripping would have to be performed using at most two modalities. The other reason pertains to processing speed during inference, where brain extraction and further analysis has to be performed real time (typically before neuroradiologists open those cases from their worklist). Therefore the time needed for pre-and post-processing is directly affecting the radiological reading, as the more modalities included for processing, the more time it will take to perform the final task and also the more error prone the complete pipeline will be. The results of models trained on data from a single institution and applied on data of other independent institutions (i.e,. TJU and MDA) yield a consistently comparable performance between the proposed modalityagnostic approach and the "T1-T1" and "Multi-4" models. To rigorously validate the proposed modality-agnostic method on diverse multi-institutional data, we chose to first train two different models using i) data from a single institution (UPenn) and ii) multiinstitutional data (UPenn + TJU + MDA), and then infer them on multi-institutional diverse data from TCIA, i.e., from institutions not included in the training cohort. The results (Fig.  10) indicate that introducing knowledge from multiple institutions improves the performance of the trained models. Moreover, introducing a brain shape prior through the modality agnostic approach can handle any modality scan improving the performance (Fig. 10) and increasing the applicability of the method. Thus, the benefit of not requiring a specific modality (for the aforementioned reasons) enables us to appraise greater preference for the modality-agnostic training process compared to the "T1-T1".
While aiming on a practical brain extraction tool, generalizable across multiple institutions and multiple modalities, to facilitate the current paradigm for multi-institutional collaborations through data sharing, the current study has three major limitations: a) did not use any defaced/semi-skull-stripped data, b) consideration of only pre-operative scans, and c) consideration of only brain glioblastomas. Specifically, our models were trained on images prior to any defacing, which is a common practice mandated by many institutions for data sharing purposes, in order to comply with privacy regulations. We therefore identified another independent retrospective cohort of 120 scans from Washington University in St. Louis (WashU) that have followed the common practice of data sharing after defacing. We attempted to evaluate our current models on defaced data and we noted an inferior performance  when compared to the existing results. However, our modality agnostic approach shows promise for a potential solution that can be used as a harmonized pre-processing step giving rise to various paradigm shift approaches for multi-institutional collaborations, such as incremental learning (Li and Hoiem) and federated learning (Sheller et al., 1810). We think that the intrinsic shape prior is what makes the modality agnostic performance being more robust to missing skull. Second, the current study evaluated all models on pre-operative pathologies, and did not consider any post-resection tumor scans. Although the transition from pre-operative to post-operative scans might seem straightforward, there may be issues introduced due to the presence of cavities. Another limitation of our study is the use of subjects with only one type of pathology, i.e., glioblastoma. While the results on one type of pathology may not represent the entire spectrum of brain diseases, we hypothesize that the modality agnostic approach could be effective when on scans with other benign or malignant pathologies, such as traumatic brain injuries, intracranial infections, meningiomas, low grade gliomas, brain metastases, and more.

Conclusion
Brain extraction is a key component of computer algorithms that are poised to change the practice of clinical neuroradiology. It may facilitate real time automatic brain lesion detection and segmentation for clinical interpretation. It is therefore critical that this component be not only fast, but also have the flexibility to accommodate different MRI protocols, even with a limited number of sequences, and different hardware platforms. Our study demonstrates that DL brain extraction algorithms have the potential to satisfy those requirements.
In this study, using extensive validation protocols, we have shown that DL methods generally out-perform conventional brain extraction methods, such as BET, and FreeSurfer. Though accuracy is important, practical constraints dictate that the importance of the trained model's generalizability. Towards this end we introduced a modality-agnostic training method rather than one that needs a specific set or combination of modalities, which forces the model to learn the spatial relationships between the structures in the brain and its shape, as opposed to texture, and thereby overriding the need for a particular modality. In addition, we have also proved that such a model will have comparable (and in most cases better) accuracy to other DL methods while keeping minimal computational and logistical requirements. Finally, we have publicly released the accompanying source code at: https:// github.com/CBICA/BrainMaGe.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Fig. 1.
Example 2D tri-planar sections of mpMRI brain tumor scans after using the brain extraction tool (BET) (Smith, 2002) and followed by manual revisions. Example 2D tri-planar sections of T1Gd MRI scans from UPenn, MDA, and TJU datasets, respectively in each row, illustrating the high variability between datasets. Note the lower resolution of the TJU scans emphasizing the resampling interpolation. Overview of the complete framework applied in this study leading to results for further analyses.

Fig. 4.
Residual Connection in Encoder/Decoder block for our 3D-Res-U-Net implementation.  Quantitative evaluation of various DL network architectures compared with BET and FreeSurfer, using the T1 MRI brain tumor scans.   Evaluation results (Dice) for the selected 3D-Res-U-Net and DeepMedic on the Modality-Agnostic training process. Results also include training on the best results of Fig. 6 for comparison purposes. Mean Dice of model inference on publicly-available multi-institutional data from TCIA.
Diverse data contribute to performance improvements. "M-A" training process performs comparably with "T1-T1".  Evaluation results of the selected 3D-Res-U-Net on the Modality-Agnostic training process tested on unseen defaced data from an independent institution (WashU). Average Dice and Hausdorf f 95 metrics are shown in the left and right columns, respectively. Results also include the "T1-T1" and the "Multi-4" models for comparison purposes.    Summary of data included in our comparative evaluations.