SynthStrip: skull-stripping for any brain image

The removal of non-brain signal from magnetic resonance imaging (MRI) data, known as skull-stripping, is an integral component of many neuroimage analysis streams. Despite their abundance, popular classical skull-stripping methods are usually tailored to images with specific acquisition properties, namely near-isotropic resolution and T1-weighted (T1w) MRI contrast, which are prevalent in research settings. As a result, existing tools tend to adapt poorly to other image types, such as stacks of thick slices acquired with fast spin-echo (FSE) MRI that are common in the clinic. While learning-based approaches for brain extraction have gained traction in recent years, these methods face a similar burden, as they are only effective for image types seen during the training procedure. To achieve robust skull-stripping across a landscape of imaging protocols, we introduce SynthStrip, a rapid, learning-based brain-extraction tool. By leveraging anatomical segmentations to generate an entirely synthetic training dataset with anatomies, intensity distributions, and artifacts that far exceed the realistic range of medical images, SynthStrip learns to successfully generalize to a variety of real acquired brain images, removing the need for training data with target contrasts. We demonstrate the efficacy of SynthStrip for a diverse set of image acquisitions and resolutions across subject populations, ranging from newborn to adult. We show substantial improvements in accuracy over popular skull-stripping baselines – all with a single trained model. Our method and labeled evaluation data are available at https://w3id.org/synthstrip.


Introduction
Skull-stripping, also known as brain extraction, involves the removal of non-brain tissue signal from magnetic resonance imaging (MRI) data. This process is useful for anonymizing brain scans and a fundamental component of many neuroimage analysis pipelines, such as FreeSurfer (Fischl, 2012), FSL (Jenkinson et al., 2012), AFNI (Cox, 1996), and ANTs (Avants et al., 2011). These packages include tools that typically require brain-extracted input images and might perform inaccurately, or even fail, without removal of irrelevant and distracting tissue. One such class of algorithms that benefits from this systematic tissue extraction is image registration, a core element of atlas-based segmentation and other analyses. Nonlinear registration (Ashburner, 2007;Avants et al., 2008;Rueckert et al., 1999;Vercauteren et al., 2009) estimates local deformations between pairs of images, and these algorithms tend to produce more accurate estimates when they can focus entirely on the anatomy of interest (Klein et al., 2009;Ou et al., 2014). Similarly, skull-stripping increases the reliability of linear registration (Cox and Jesmanowicz, 1999;Friston et al., 1995;Hoffmann et al., 2015;Jenkinson and Smith, 2001;Jiang et al., 1995;Modat et al., 2014;Reuter et al., 2010) by excluding anatomy that deforms non-rigidly, such as the eyes, jaw, and tongue (Andrade et al., 2018;Fein et al., 2006;Fischmeister et al., 2013;Hoffmann et al., 2020).
Classical skull-stripping techniques are well-explored and widespread, but popular methods are often tailored to images with specific modalities or acquisition properties. Most commonly, these methods focus on three-dimensional (3D) T1-weighted (T1w) MRI scans acquired with MPRAGE sequences (van der Kouwe et al., 2008;Marques et al., 2010;Mugler and Brookeman, 1990), which are ubiquitous in neuroimaging research. While some skull-stripping tools accommodate additional contrasts, these methods are ultimately limited to a predefined set of viable image types and do not properly adapt to inputs outside this set. For example, skull-stripping tools developed for near-isotropic, adult brain images may perform poorly when applied to infant subjects or clinical scans with thick slices, such as stacks of 2D fast spin-echo (FSE) acquisitions.
When a suitable brain extraction method is not available for a particular scan type, a common workaround involves skull-stripping a compatible image of the same subject and computing a co-registration to propagate the extracted brain mask to the target image of interest (Iglesias et al., 2011). Unfortunately, an accurate intra-subject alignment can require significant manual tuning because the target image still includes extra-cerebral matter that may impede linear registration quality (Reuter et al., 2010). Crucially, this procedure also requires the existence of an additional, strip-able image, often a high-resolution isotropic T1w or T2-weighted (T2w) scan, which is rare, for example, in clinical screening protocols, introducing a barrier to the clinical adoption of analysis tools.
While classical algorithms for skull-stripping are limited by their assumptions about the spatial features and intensity distributions in the input images, supervised deep-learning approaches, which leverage convolutional neural networks (CNNs), can, in principle, learn to extract a region of interest from any image type given sufficient anatomical contrast and resolution. In practice, these networks achieve high accuracy for data types observed during training, but their performance often deteriorates on images with characteristics unseen during training (Hendrycks et al., 2021;Hoffmann et al., 2021b;Jog et al., 2019;Karani et al., 2018). In consequence, robust, supervised learning-based approaches depend on the availability of a representative training dataset that contains accurate ground-truth annotations and exposes the network to a landscape of image types. While numerous public datasets provide access to widely used MRI acquisitions for which target brain masks can be easily derived with classical methods, curating a diverse training dataset with uncommon sequences and sufficient anatomical variability is a challenging task that requires substantial human effort. As a result, current deep-learning skull-stripping methods are trained with few different data types and deliver state-of-the-art results only for particular subsets of image characteristics (Hwang et al., 2019;Kleesiek et al., 2016;Salehi et al., 2017).
Recently, a novel learning strategy alleviates the requirement for representative acquired training data by optimizing networks with a wide array of synthetic images, each generated directly from a precomputed label map (Billot et al., 2020;Hoffmann et al., 2021b). This synthesis scheme enables networks to accurately carry out tasks on any image type at evaluation-time without ever sampling real target acquisitions during training, and it has been effectively employed for segmentation (Billot et al., 2020) and deformable image registration (Hoffmann et al., 2021b). To build on deep-learning methods for brain extraction while addressing their shortcomings, we adapt the synthesis technique and introduce SynthStrip, a flexible brain-extraction tool that can be deployed universally on a variety of brain images. By exposing a CNN to an arbitrary and deliberately unrealistic range of anatomies, contrasts, and artifacts, we obtain a model that is agnostic to acquisition specifics, as it never samples any real data during training. Consequently, this scheme enables SynthStrip to extract the brain from a wide array of neuroimaging data types, and we demonstrate its viability and improvement over popular baselines using a varied test set that spans both research scans and clinical exams (Fig. 1). The test set includes T1w, T2w, T2w fluid attenuated inversion recovery (T2-FLAIR), and proton-density (PDw) contrasts as well as clinical FSE scans with slices and high in-plane resolution, and lowresolution EPI, ranging across age and pathology. We demonstrate the ability of SynthStrip to generalize beyond structural MRI, to MR angiography (MRA), diffusion-weighted imaging (DWI), fluorodeoxyglucose positron emission tomography (FDG-PET), and even computed tomography (CT). We make our validation set publicly available to promote further development and evaluation of brain-extraction tools.

Related work
In this section, we briefly review the automated brain-extraction techniques that we use as baseline methods. We include both classical and deep-learning baselines introduced over the last two decades, focusing in particular on those with high efficacy and popularity in the research domain. For an exhaustive overview of skull-stripping methods, see Fatima et al., 2020.

Classical skull-stripping
Classical, or traditional, algorithms that remove non-brain image signal vary substantially in their implementation (Cox, 1996;Eskildsen et al., 2012;Iglesias et al., 2011;Roy et al., 2017;Ségonne et al., 2004;Shattuck et al., 2001;Smith, 2002). One common class of approaches leverages a deformable mesh model to reconstruct a smooth boundary of the brain matter surface. The widely-used Brain Extraction Tool (BET; Smith, 2002), distributed as part of the FSL package (Jenkinson et al., 2012), utilizes this technique by initializing a spherical mesh at the barycenter of the brain and projecting mesh vertices outwards to model the brain border. Since BET uses locally adaptive intensity thresholds to distinguish brain and non-brain voxels, it generalizes to a variety of contrasts, such as T1w, T2w, and PDw. To prevent surface leaks beyond the brain boundary, 3dSkullStrip, a component of AFNI (Cox, 1996), extends the BET strategy by considering information on the surface exterior, accounting for eyes, ventricles, and skull.
The popular hybrid approach (Ségonne et al., 2004) available in FreeSurfer also leverages a deformable surface paradigm, combing it with a watershed algorithm and statistical atlas to improve robustness. First, the watershed establishes an estimate of the white-matter mask, which is then refined to the brain boundary using a surface mesh expansion. A probabilistic atlas of intensity distributions helps prevent outliers during mesh fitting, and erroneous brain mask voxels are removed during post-processing via a graph cuts algorithm (Greig et al., 1989;Sadananthan et al., 2010) that thresholds the cerebrospinal fluid (CSF). While effective, this technique is optimized only for images with T1w contrast, since it relies on the underlying assumption that white matter is surrounded by darker gray matter and CSF. Another hybrid approach, ROBEX (Iglesias et al., 2011), exploits a joint generative-discriminative model. A Random Forest classification (Breiman, 2001) detects the brain contour, which is used to fit a point-distribution model to the brain target. The skull-stripping tool BEaST (Eskildsen et al., 2012) builds on patch-based, non-local segmentation techniques (Coupé et al., 2010;Roy et al., 2017) and assigns a label to each voxel by comparing its local neighborhood to patches in a reference set with prior labels. With the exception of BET and 3dSkullStrip, all of these tools were specifically developed for T1w images.

Deep-learning approaches
Innovations in deep-learning have gained popularity as methodological building blocks for an array of tasks in medical image analysis, including skull-stripping. Various learningbased extraction methods have been proposed, demonstrating accuracy and speed that often out-perform their classical counterparts. These models are optimized in a supervised fashion, using a set of acquired training images with corresponding ground-truth brain masks, derived through classical methods or manual segmentation. An early, cross-contrast approach, Deep MRI Brain Extraction (DMBE) (Kleesiek et al., 2016), trains a 3D CNN on combinations of T1w, T2w, and FLAIR contrasts and matches the accuracy of classical baselines for several datasets, including clinical scans with brain tumors. Conversely, Auto-Net (Salehi et al., 2017) introduces two separate 2.5D architectures that skull-strip volumes by individually segmenting sagittal, coronal, and transverse views of same image and fusing the predictions with an auto-context algorithm (Tu and Bai, 2009). The first architecture leverages convolutions on single-resolution voxel-wise patches, while the second utilizes a scale-space U-Net architecture (Ronneberger et al., 2015) to predict the brain mask. Auto-Net is effective for both adult and neonatal brain scans but only trained with T1w images. CONSNet (Lucena et al., 2019) similarly leverages a 2D U-Net, applied across image slices in each plane, to strip 3D T1w images. More recently, implementations using full 3D U-Nets (Hsu et al., 2020;Hwang et al., 2019) have robustly matched or exceeded start-of-the-art brain-extraction performance.

Contribution
SynthStrip builds on a solid foundation laid by prior studies of deep-learning algorithms for brain extraction, enabling us to choose among network architectures well suited for this particular task. We emphasize that our goal is not to compare or make claims on the optimality of specific architectures -the discussed algorithms may perform equally well. Instead, our focus is on exploiting a novel training strategy using synthetic data only, to build an easy-to-use skull-stripping tool that alleviates the requirement of expanding the training set and re-optimizing network weights every time a new image type is to be supported.

Method
To predict robust brain masks for an array of real image types, we train a deep convolutional neural network on a vast landscape of images synthesized with a deliberately unrealistic range of anatomies, acquisition parameters, and artifacts. From a dataset D of precomputed, whole-head segmentations with brain and non-brain tissue labels, we sample a segmentation s ∈ Ds ∈ D at each optimization step and use it to generate a gray-scale head scan x with randomized acquisition characteristics. In effect, this paradigm synthesizes a stream of training images used to optimize a SynthStrip network g θ , with trainable parameters θ, in a supervised fashion: where y is the predicted brain mask, y is the target brain mask derived by merging the brain labels of s, and ℒ is the loss function that measures similarity between y and y.

Synthesis
Building from previous work (Billot et al., 2020;Hoffmann et al., 2021b), we use a generative model to synthesize a stream of random images with substantial anatomical and intensity variation, as exhibited in Fig. 2. At each training step, parameters that dictate synthesis components are randomly sampled from predetermined ranges and probability distributions explicitly defined in Table 1. We emphasize that while the generated scans can appear implausible, these training images do not need to be realistic in order for the SynthStrip model to accurately generalize to real images at test-time.
To generate a gray-scale image x from a whole-head anatomical segmentation s, we first create spatial variability to subject the network to a landscape of possible head positions and anatomical irregularities. This is accomplished by manipulating s with a spatial transformation t, composed of an affine transform (with random translation, scaling, and rotation) and a nonlinear deformation. The deformation is generated by sampling random 3D displacement vectors from a normal distribution, with random scale, at an arbitrarily low image resolution. This random displacement field is vector-integrated, using five scaling and squaring steps to encourage a diffeomorphic warp (Arsigny et al., 2006;Dalca et al., 2019), and tri-linearly resampled to match the resolution of s. After applying the randomized transform, the resulting segmentation s t serves as the basis for deriving the image x and target brain mask y, which is obtained by merging the labels of s t into brain and non-brain classes.
To compute x, we consider a Bayesian model of MR contrast, which assumes that the voxel intensity of each tissue type in the image can be represented by a single Gaussian distribution. Reversing this generalization, we assign a random distribution of tissue intensity to every anatomical label in s t and use this artificial mixture model to attain an image with arbitrary contrast by replacing each label voxel in s t with a random value drawn from its corresponding intensity distribution. Following the synthesis, we aim to simulate various artifacts and geometric properties that might exist across modality and acquisition type. First, we corrupt the image with a spatially varying intensity bias field, generated by resizing a low-resolution image sampled from a normal distribution with zero mean. The corrupted image is computed by an element-wise multiplication with the voxel-wise exponential of the bias field. Second, we perform gamma augmentation by globally exponentiating all voxels with a single value exp(γ), where γ is a normally sampled parameter. Lastly, to account for scans with a partial field of view (FOV) and varied resolution, we randomly crop the image content and down-sample along an indiscriminate set of axes. Before down-sampling by an arbitrary factor r, we simulate partial-volume effects by blurring the image using a Gaussian kernel with standard deviation σ = r/4. The image cropping and down-sampling components are applied with a 50% probability rate during synthesis.

Loss
We optimize g θ using a loss function ℒ that measures the similarity between predicted and target brain masks. Unless otherwise stated, we employ a loss ℒ = ℒ sdt that encourages the network to predict a signed distance transform (SDT) d representing the minimum distance (in mm) to the skull boundary at each voxel. Distances are positive within the brain and negative outside, facilitating the extraction of a binary brain mask y from d at test-time by simple thresholding. The training paradigm is outlined in Fig. 3. During training, an exact target Euclidean SDT d is computed from the target brain mask y, and the similarity between d and d is measured by their mean squared difference (MSE). To concentrate optimization gradients to pertinent regions of the image during training, d is banded such that voxel distances d i do not surpass a discrete threshold t, and all voxels that exceed the distance t are down-weighted in the MSE computation by a factor b. Therefore, where i represents a voxel in the spatial image domain P, t = 5 mm and b = 0.1 in our experiments, optimally determined via a grid search.
As a complimentary analysis, we compare the distance-based loss ℒ sdt against a soft Dice loss (Dice, 1945;Milletari et al., 2016), which is commonly used to optimize image segmentation models and quantifies volume overlap for pairs of labels. We define the loss ℒ dice as where y j and y j represent brain label maps, y k and y k represent non-brain label maps, and ⊙ and ⊕ represent voxel-wise multiplication and addition, respectively. While ℒ sdt and ℒ dice both result in effective skull-stripping networks, we favor the distance loss ℒ sdt due to its smoothing effect on the outline of the predicted brain mask, as demonstrated in Experiment 4.4.

Implementation
We implement g θ using a 3D U-Net convolutional architecture, with down-sampling (encoder) and up-sampling (decoder) components that facilitate the integration of features across large spatial regions. The U-Net comprises seven resolution levels, which each include two convolutional operations with leaky ReLU activations (parameter α = 0.2) and filter numbers defined in Fig. 3. Down-sampling is achieved through max-pooling, and skip-connections are formed by concatenating the outputs of each encoder level with the inputs of the decoder level with corresponding resolution. In models using ℒ = ℒ sdt , one final, single-feature convolutional layer with linear activation outputs the predicted SDT d. In models optimized with ℒ = ℒ dice , the final layer is a two-feature convolution, with softmax activation, that outputs a probabilistic segmentation representing non-brain and brain regions.
We train SynthStrip using the Adam optimizer (Kingma and Ba, 2014) with a batch size of one and an initial learning rate of 10 −4 . This rate is reduced by a factor of two after every 20,000 optimization steps without a decrease in validation loss. At test-time, all inputs to the model are internally conformed to 1-mm isotropic voxel size using trilinear interpolation, and intensities are scaled between 0 and 1. The U-Net outputs are resampled such that the final brain mask is computed in the original input space. We implement SynthStrip in Python, using the open-source PyTorch (Paszke et al., 2019) and Neurite (Dalca et al., 2018) libraries, and make our tool and associated code available in the open-source FreeSurfer package (https://w3id.org/synthstrip). All experiments are conducted using Intel Xeon Silver 4214R CPUs and Nvidia RTX 8000 GPUs.

Data
In our experiments, we employ a small training dataset of adult and infant brain segmentations and a separate, larger dataset of acquired images for validation and testing that spans across age, health, resolution, and imaging modality. All data are 3D images, acquired either directly or as stacks of 2D MRI slices.

Datasets:
We compose a set of 80 training subjects, each with whole-head tissue segmentations, from the following three cohorts: 40 adult subjects from the Buckner40 dataset (Fischl et al., 2002), 30 locally scanned adult subjects from the Human Connectome Aging Project (HCP-A) (Bookheimer et al., 2019;Harms et al., 2018), and 10 infant subjects born full-term, scanned at Boston Children's Hospital at ages between 0 and 18 months (de Macedo Rodrigues et al., 2015).
Processing: To compute anatomical segmentations of individual cerebral regions, adult and infant T1w scans are processed with SAM-SEG (Puonti et al., 2016) and the Infant FreeSurfer reconstruction pipeline (Zöllei et al., 2020), respectively. In order to build complete segmentation maps for robust whole-head image synthesis, we also generate six coarse labels of extra-cerebral tissue using a simple intensity-based labeling strategy with thresholds that mark label intensity boundaries. Considering only non-zero voxels without brain labels, we fit threshold values to each image by maximizing the similarity in number of voxels for each extra-cerebral label. These extra-cerebral labels do not necessarily represent or differentiate meaningful anatomical structures -their purpose is to provide intensity and spatial variability to synthesized regions outside the brain.
In total, the training segmentations contain 46 individual anatomical labels, with 40 brainspecific labels (including CSF), that we merge into the target brain mask y. All training segmentations are fit to a 256 3 image shape with 1-mm isotropic resolution. We emphasize that this geometric preprocessing is not required at test-time.

Evaluation data
Datasets: Our evaluation data comprise 620 images, split into validation and test subsets of sizes 22 and 598, respectively. We gather these images across seven public datasets, with makeup, resolution, and validation splits outlined in Table 2. The IXI 1 dataset features a range of MRI contrasts and modalities, including T1w and T2w as well as PDw, MRA, and DWI. To simplify the DWI evaluation, a single diffusion direction is randomly extracted from each acquisition. The FSM subset (Greve et al., 2021) is derived from in-house data using standard acquisitions as well as quantitative T1 maps (qT1). In-house, pseudo-continuous ASL (PCASL) scans are acquired as stacks of 2D-EPI slices with low resolution and a small FOV that often crops the ventral brain region (Dai et al., 2008). The QIN (Clark et al., 2013;Mamonov and Kalpathy-Cramer, 2016;Prah et al., 2015) dataset comprises precontrast, clinical stacks of thick image slices from patients with newly diagnosed glioblastoma. We also include a subset of the infant T1w image dataset, using subjects held-out from training. Lastly, to evaluate the ability of SynthStrip to adapt to imaging modalities beyond MR, we gather a test cohort of brain CT and FDG-PET scans from the CERMEP-IDB-MRXFDG (CIM) database (Mérida et al., 2021).

Ground-truth masks:
For each image in the evaluation dataset, we derive a reference brain mask using the following labelling strategy. Since every evaluation subject includes a corresponding T1w image, we generate brain masks for these scans using each classical baseline method evaluated in our analysis. Then, an "average" brain mask is computed for each subject by extracting the majority label value at every voxel. We refine the average masks manually before propagating the masks by rigidly aligning each subject's T1w scan to the remaining image types with a robust registration approach (Reuter et al., 2010). Poor alignments are further refined by hand. We make the reference dataset available online to facilitate future development of skull-stripping techniques, including the original images if permitted by their respective licenses.

Experiments
We analyze the performance of SynthStrip on diverse whole-head images and compare its 3D skull-stripping accuracy to classical and deep-learning baseline tools.

Baselines:
We select a group of skull-stripping baselines based on their popularity, determined by citation count, and effectiveness, as shown in prior work (Fatima et al., 2020;Iglesias et al., 2011). As classical baselines, we choose ROBEX 1.1, BET from FSL 6.0.4, 3dSkull-Strip (3DSS) from AFNI 21.0.21, BEaST 1.15, and the FreeSurfer 7.2 watershed algorithm (FSW). Unfortunately, many top-cited, learning based approaches do not make their models available, even upon request to the authors. A notable exception is Deep MRI Brain Extraction (DMBE), which we therefore include. Default parameters are used for each method except BET, for which the -R option is provided for more accurate brain center estimation. All inputs to FSW and DMBE are re-sampled to 1-mm isotropic voxel sizes to accommodate the expected input resolution for these methods.

Metrics:
We evaluate the similarity between computed and ground-truth brain masks by measuring their Dice overlap, mean and maximum (Hausdorff) surface distances, and percent difference in total volume. Baseline scores are compared to SynthStrip with a paired sample t-test. Sensitivity and specificity, which measure the percent of true positive and true negative brain labels, respectively, provide further insight into the properties of the computed brain masks.

Skull-stripping accuracy-
We assess the broad skull-stripping capability of a SynthStrip model trained using images synthesized from the label maps outlined in Section 3.4.1. We compare the accuracy of our method to each of the baselines across the test set of real brain images defined in Section 3.4.2. Method runtime is compared for the FSM dataset.
The comparison demonstrates SynthStrip's accurate and robust brain extraction, which substantially outperforms baseline methods (Tables 3, 4 and Supplementary Tables S1, S2). For every evaluation metric, brain masks predicted by SynthStrip yield significantly better scores than baseline masks (p < 0.05) for the vast majority of datasets. Importantly, no baseline method significantly outperforms SynthStrip on any dataset. As shown in Fig. 4, SynthStrip achieves the highest Dice score and lowest mean surface distance for more than 80% of all test images, in stark contrast to the next best performing method, BET, which yields the top result for less than 10% of images. The superior performance of SynthStrip persists even when considering only T1w, near-isotropic, adult-brain images, which all of the baselines are tuned for. Across this particular subset of 127 T1w images from the IXI, FSM, and ASL datasets, SynthStrip achieves the best mean Dice, surface distance, Hausdorff distance, and volume difference (Fig. 5), and it consistently extracts the brain with high specificity and sensitivity, while other methods tend to under-perform in either of those metrics due to tendencies to substantially over-or under-label the brain. When considering the remaining non-T1w, thick-slice, and infant image types, SynthStrip's predominance is similarly substantial (Fig. 6). For FSM T1w data, our method runs on the CPU in less than one minute (Table 5), trailing the fastest two baselines, BET and FSW, by approximately 17 seconds on average. On the GPU, SynthStrip runs substantially faster, requiring only 1.8 ± 0. 2 seconds.

Qualitative brain-mask analysis-
Across the evaluation set, skull-stripping errors in SynthStrip predictions are uncommon and typically involve minimally oversegmenting the brain mask by including thin regions of extra-cerebral matter near the dorsal cortex or pockets of tissue around the eye sockets, as shown in Fig. 8. Considering only the N images for which SynthStrip does not achieve the best score in Fig. 4, on average, SynthStrip lags behind the best-performing baseline by only −0.53 ± 0.54 Dice percentage points (N = 111) and (0. 20 ± 0.18) mm mean surface distance (N = 94).
The top performing baseline method is ROBEX, which yields high-quality brain extraction across many of the test datasets, with the notable exception of the qT1 cohort. ROBEX produces spatially plausible brain masks and evades drastic failure modes that exist in other base-lines, similarly to SynthStrip. However, despite its generally good performance, ROBEX has a tendency to include pockets of tissue surrounding the eyes and remove regions of cortical gray matter near the superior surface (Figs. 8 and S1).
BET and 3DSS also perform effective brain extraction across image types, but tend to fail dramatically for outlier cases. For example, BET locates the brain boundary with considerable precision when successful. However, for some image subsets, especially those with abundant non-brain matter, such as FSM, BET often includes large regions of inferior skull as well as facial and neck tissue in the brain mask. While 3DSS largely avoids such gross mislabeling, it tends to produce skull-strips that leak into neck tissue or, conversely, remove small regions of the cortical surface.
BEaST and FSW perform well for near-isotropic T1w images, such as those in the IXI, FSM, and ASL datasets. But since they are heavily optimized for the assumed spatial and intensity features of this acquisition type, they generally perform poorly or even fail completely for other contrasts. Common error modes of FSW involve the failure to remove bits of skull or inferior non-brain matter, in contrast to BEaST, which is susceptible to removing critical regions of the cortex.
The learning-based method DMBE yields suitable brain masks for near-isotropic image types with T1w contrast but frequently leaves substantial, unconnected components of non-brain matter. While DMBE extracts the brain tissue border as opposed to CSF, our analysis shows that the predominant contributor to the discrepancy between DMBE and ground-truth brain masks is the inclusion of neck and facial tissue (Figs. 8 and S1). DMBE model inference is slow, consuming more than a half hour to skull-strip a standard image.

Variability across time-series data-
We analyze the consistency of SynthStrip brain masks across time-domain data by assessing the differences between diffusionencoded directions acquired in the same session. For each subject in the DWI dataset, we affinely align and skull-strip all of the 16 diffusion-encoded frames in a common, average space (Reuter et al., 2010). We compute the number of discordant voxels across brain masks for a given method, defining discordant voxels (DV) as voxel locations with labels that differ in the time domain. We report the percent of DV relative to the brain mask volume, determined by the number of voxels labeled as brain in any frame. In this particular analysis, we only consider ROBEX, BET, and 3DSS as baselines since they generalize to DWI acquisitions. As shown in Fig. 7, SynthStrip demonstrates a high level of intra-subject consistency, as it predicts brain masks with substantially lower % DV across DWI directions than the baselines (p < 10 −12 ). Since the % DV metric considers voxels labeled as brain for any direction, a single mask with gross mislabeling will substantially increase the metric value, as is the case with ROBEX, which over-segments the brain for only a few directions per subject.

Loss comparison-During
our experimentation, we find that training SynthStrip models using a traditional soft Dice loss yields comparable results to those trained with an SDT-based loss for nearly every metric. However, despite similar global accuracy, we observe that models trained with ℒ dice predict brain masks characterized by relatively noisy and rough boundaries, as illustrated in Fig. 7. The high variability at the edge of the brain mask is emphasized by a 6.4 ± 3.2 mm increase in maximum surface distance when using ℒ dice compared to ℒ sdt . We further quantify this discrepancy in brain-mask smoothness by computing the percent of exposed boundary voxels (EBV) that neighbor more non-brain labels than brain labels. Brain masks with noisier boundaries will exhibit larger EBV due to an increased mask surface area and number of sporadic border voxels. We perform this evaluation using the FSM data subset of 132 images with isotropic voxel size. Models trained with ℒ dice predict masks with 4.5× higher EBV than models trained with ℒ sdt . We hypothesize that as the network learns to estimate an SDT, it is encouraged to focus more on the boundary of mask, rather than the label as a whole, resulting in a smoother prediction of the brain border.

Discussion
We present SynthStrip, a learning-based, universal brain-extraction tool trained on diverse synthetic images. Subjected to training data that far exceeds the realistic range of medical images, the model learns to generalize across imaging modalities, anatomical variability, and acquisition schemes.

Baseline comparison
SynthStrip significantly improves upon baseline skull-stripping accuracy for nearly every image cohort tested, and the few exceptions to this improvement involve data subsets for which SynthStrip matches baseline performance. This predominance is in part due to the ability of SynthStrip to generalize across a wide variety of image types as well as its proclivity to avoid substantial mislabeling. In particular, varying specific acquisition characteristics during synthesis promotes network robustness to such characteristics across a range of protocols. For example, simulating partial-volume effects with blurring and randomizing the resolution enable SynthStrip to accurately generalize to clinical thick slice acquisitions and those with large voxel sizes. By learning robust, large-scale spatial features of representative brain masks, the model consistently predicts masks of realistic and expected shape. Baseline techniques, on the other hand, often rely on weak spatial priors and are therefore prone to over-or under-segment brain tissue when confronted with image features that are unexpected or unaccounted for (Figs. 8 and S1).
ROBEX's consistent performance across contrasts and modality is somewhat unexpected since the discriminative edge detector is trained only for T1w scans. We hypothesize that the coupled shape model is able to compensate for any intensity bias encoded in the discriminative detector. The T1w-specific approaches BEAST and FSW could be effective for other MRI contrasts if provided known intensity priors of the brain matter. However, this work would require substantial human effort as it needs to be repeated for every new image type. The substantial, unconnected components of non-brain matter frequently left by DMBE are likely a byproduct of its convolutional architecture, which does not leverage multiple resolution levels to gather spatial features across large distances.

Use for brain-specific registration
Consistent brain extraction across different images from the same subject is critical for accurate analysis of time-series acquisitions. For example, diffusion (Holdsworth et al., 2012;Jones and Leemans, 2011) and functional MRI analyses (Ashburner, 2009;Jenkinson et al., 2002) depend on within-subject registration of individual frames acquired across time to undo the effect of any head motion during the scan. Unfortunately, anatomical structures that deform non-rigidly between frames, such as the neck or tongue, can hamper brain-registration accuracy and thus impinge on downstream results. While this effect can be accounted for by first removing non-brain tissue from each frame to achieve brain-specific registration (Andrade et al., 2018;Fischmeister et al., 2013), it requires consistent brain extraction across frames (Andrade et al., 2018;Fein et al., 2006;Fischmeister et al., 2013;Hoffmann et al., 2020). SynthStrip's high within-subject consistency despite substantial contrast differences across the diffusion encoding demonstrates its potential for regularizing retrospective motion correction of time-series data.

Model and data availability
Even as learning-based methods in neuroimaging analysis continue to grow in popularity, developers of deep-learning skull-stripping tools are sometimes disinclined to provide easyto-use distributions of their work. Out of the three promising methods discussed in this work, only DMBE makes its models and code publicly available for use. In contrast, we make SynthStrip available as a universal, cross-platform command-line utility, distributed both as a standalone and as a built-in FreeSurfer tool. To facilitate further development and testing of robust skull-stripping tools, we also make our evaluation data and ground-truth labels available at https://w3id.org/synthstrip.

Future work
While SynthStrip facilitates state-of-the-art brain extraction, we aim to extend the tissueextraction strategy to other applications both within and beyond neuroimaging. One such application is fetal head extraction from in-utero fetal MRI scans. Due to excessive motion, fetal MRI is limited to the acquisition of sub-second 2D slices. However, stacks of several slices are needed to cover the anatomy of interest, and while their inplane resolution is typically of the order of 1 mm × 1 mm, views across slices are hampered by slice thicknesses of 4-6 mm and between-slice motion (Hoffmann et al., 2021a). To enable full 3D views of the fetal brain, post-processing tools for super-resolution reconstruction have emerged, that aim to reconstruct a high-quality volume of isotropic resolution from a number of slice stacks acquired at different angles (Ebner et al., 2020;Kainz et al., 2015;Rousseau et al., 2006). Yet, these methods hinge on successful brain extraction which is challenging due to frequent artifacts and because the relatively small brain first needs to be localized within a wide FOV encompassing the maternal anatomy (Gaudfernau et al., 2021). In addition, substantially fewer public fetal datasets are available for training in comparison to vast public adult brain datasets. This presents an ideal problem to be addressed with SynthStrip, as our approach synthesizes an endless stream of training data from only a handful of label maps.

Conclusion
The removal of non-brain signal from neuroimaging data is a fundamental first step for many quantitative analyses and its accuracy has a direct impact on downstream results. However, popular skull-stripping utilities are typically tailored to isotropic T1w scans and tend to fail, sometimes catastrophically, on images with other MRI contrasts or stack-ofslices acquisitions that are common in the clinic. We propose SynthStrip, a flexible tool that produces highly accurate brain masks across a landscape of imaging paradigms with widely varying contrast and resolution. We implement our method by leveraging anatomical label maps to synthesize a broad set of training images, optimizing a robust convolutional neural network that is agnostic to MRI contrasts and acquisition schemes.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Examples of SynthStrip brain extractions (bottom) for a wide range of image acquisitions and modalities (top). Powered by a strategy for synthesizing diverse training data, SynthStrip learns to skull-strip brain images of any type.

Fig. 2.
Samples of synthetic images used for SynthStrip training. To encourage the network to generalize, we synthesize images that far exceed the realistic range of whole-brain acquisitions. In this figure, each brain image is generated from the same label map. In practice, we use label maps from several different subjects. SynthStrip training framework. At every optimization step, we sample a randomly transformed brain segmentation s t , from which we synthesize a gray-scale image x with arbitrary contrast. The skull-stripping 3D U-Net receives x as input and predicts a thresholded signed distance transform (SDT) d representing the distance of each voxel to the skull boundary. The U-Net consists of skip-connected, multi-resolution convolutional layers illustrated by gray bars, with their number of output filters indicated below. We train SynthStrip in a supervised fashion, maximizing the similarity between d and the ground-truth SDT d within a ribbon of set distance around the brain and derived directly from the segmentation labels of s t . SynthStrip accuracy compared to baseline methods, across all images in the test set. Images are sorted by the score of the top performing skull-stripping method. Each dot represents a single brain mask derived with a particular tool, and each column of dots represents the scores obtained for a single image across tools. See Supplementary Fig. S2  SynthStrip and baseline skull-stripping performance for near-isotropic, T1w adult MR brain images. Median scores are represented by black dots. For all metrics except sensitivity and specificity, SynthStrip yields optimal brain masks. The high specificity achieved by ROBEX and BEaST comes at the cost of substantial under-segmentation of the brain mask, as indicated by their low sensitivity scores. The inverse is true for FSW, which tends to substantially over-segment the brain. Black dots indicate median scores. Considering all non-T1w, thick-slice, and infant images in the evaluation set, SynthStrip surpasses baseline accuracy by a wide margin. In this figure, we include only baselines that generalize to acquisition protocols and modalities beyond the common structural T1w MRI scans. Black dots indicate median scores. A: SynthStrip variability across time-series data, measured by percent of discordant voxel locations (DV) across diffusion-encoded directions, relative to the brain mask volume. The ROBEX median % DV extends beyond the chart axis, as indicated by the black arrow. B: Effect of SDT-and Dice-based loss functions during training. A SynthStrip model trained using ℒ sdt predicts substantially smoother brain masks (boundaries indicated in orange) than a model trained with ℒ dice , resulting in considerably lower maximum surface distance (MD) to ground truth masks and percent of exposed boundary voxels (EBV). Representative skull-stripping errors for SynthStrip and baseline methods. White arrows indicate over-labeling of the brain mask, while orange arrows indicate removal of brain matter. SynthStrip errors are uncommon and typically involve including small regions of dura or other extracerebral tissue in the brain mask, if they occur.. Uniform hyperparameter sampling ranges used for synthesizing a training image from a source segmentation map. The specific values were chosen by visual inspection of the generated images to produce a landscape of image contrasts, anatomies, and acquisition characteristics that far exceed the realistic range of medical images. We sample fields with isotropic voxels of the indicated side length, where SD abbreviates standard deviation.   Table 3 SynthStrip and baseline method accuracy across datasets, as measured by the mean surface distance (± SD) between computed and ground-truth binary brain masks.  Table 4 Skull-stripping accuracy across datasets, as measured by the mean Dice overlap (± SD) between computed and ground-truth binary brain masks.  Table 5 Average single-threaded CPU runtime (± SD) for T1w images from the FSM dataset. In addition to BET and FSW, SynthStrip is one of only three skull-stripping methods that consistently runs in under one minute.