Accurate Learning with Few Atlases (ALFA): an algorithm for MRI neonatal brain extraction and comparison with 11 publicly available methods

Accurate whole-brain segmentation, or brain extraction, of magnetic resonance imaging (MRI) is a critical first step in most neuroimage analysis pipelines. The majority of brain extraction algorithms have been developed and evaluated for adult data and their validity for neonatal brain extraction, which presents age-specific challenges for this task, has not been established. We developed a novel method for brain extraction of multi-modal neonatal brain MR images, named ALFA (Accurate Learning with Few Atlases). The method uses a new sparsity-based atlas selection strategy that requires a very limited number of atlases ‘uniformly’ distributed in the low-dimensional data space, combined with a machine learning based label fusion technique. The performance of the method for brain extraction from multi-modal data of 50 newborns is evaluated and compared with results obtained using eleven publicly available brain extraction methods. ALFA outperformed the eleven compared methods providing robust and accurate brain extraction results across different modalities. As ALFA can learn from partially labelled datasets, it can be used to segment large-scale datasets efficiently. ALFA could also be applied to other imaging modalities and other stages across the life course.

Scientific RepoRts | 6:23470 | DOI: 10.1038/srep23470 watershed segmentation with a deformable-surface model, in which the statistics of the surface curvature and the distance of the surface to the centre of gravity are used to detect and correct inaccuracies in brain extraction.
Learning-based approaches use a set of training data to segment a target or test image. A popular learning-based technique for brain MRI is multi-atlas segmentation [28][29][30][31] , where multiple manually-segmented example images, called atlases, are registered to a target image, and deformed atlas segmentations are combined using label fusion (such as Majority Vote (MV) 28,31 , STAPLE 32 or Shape-based averaging (SBA) 33 ; for review see Iglesias and Sabuncu) 34 . The advantage of multi-atlas segmentation methods is that the effect of registration error is minimised by label fusion, which combines the results from all registered atlases into a consensus solution, and this produces very accurate segmentations 34 . In the context of brain extraction: Leung et al. 35 used non-rigid image registration to register best-matched atlas images to the target subject, and the deformed labels were fused using a shape-based averaging technique 33 ; Heckemann et al. 36 used an iterative refinement approach to propagate labels from multiple atlases to a given target image using image registration; Doshi et al. 37 used a set of atlas images (selected using K-means) with non-rigid image registration, and a weighted vote strategy was used for label fusion; and Eskildsen et al. 38 proposed a method in which the label of each voxel in the target image is determined by labels of a number of similar patches in the atlas image library. In addition, Brainwash (BW) 39 uses nonlinear registration from the Automatic Registration Toolbox (ART) with majority vote; and ROBEX 40 combines a discriminative random forest classifier with a generative point distribution model.
The neonatal brain presents specific challenges to brain extraction algorithms because of: marked intra-and inter-variation in head size and shape in early life; movement artefact; rapid changes in tissue contrast associated with myelination, decreases in brain water, and changes in tissue density; and low contrast to noise ratio between grey matter (GM) and white matter (WM). Most of the methods described above were optimised and evaluated on adult data and their validity for neonatal brain extraction has not been established.
Yamaguchi et al. 41 proposed a method for skull stripping of neonatal MRI, which estimates intensity distributions using a priori knowledge based Bayesian classification with Gaussian mixture model, and then a fuzzy rule-based active surface model is used to segment the outer surface of the whole brain. Also, Mahapatra 42 proposed a neonatal skull stripping technique using prior shape information within a graph cut framework. Recently, Shi et al. 43 developed a framework for brain extraction of paediatric subjects which uses two freely available brain extraction algorithms (BET and BSE) in the form of a meta-algorithm 44 to produce multiple brain extractions, and a level-set based label fusion is used to combine the multiple candidate extractions together with a closed smooth surface. The methods proposed by Yamaguchi et al. 41 and Shi et al. 43 rely on accurate detection of brain boundaries and have the risk of failing if the algorithm cannot successfully detect the brain boundaries. Also, Mahapatra 42 and Shi et al. 43 evaluated their methods on T2-weighted (T2w) scans only and their performance on other modalities such as T1-weighetd (T1w) is unknown.
In this article, we present a new method for neonatal whole-brain segmentation from MRI called ALFA (Accurate Learning with Few Atlases), within a multi-atlas segmentation strategy. A typical multi-atlas framework consists of three main components: atlas selection, image registration and label fusion. The proposed method differs from current multi-atlas approaches in the following ways. First, in the atlas selection step, most multi-atlas techniques use a strategy whereby a number of most similar atlas images for each target image is selected 45 . While these strategies can achieve high levels of accuracy, they may be computationally demanding, and lack the scalability to large and growing databases due to limited availability of the large number of manually labelled images on which they depend. In contrast, ALFA eliminates the need for target-specific training data by selecting atlases that are 'uniformly' distributed in the low-dimensional data space. This approach also provides information from a range of atlas images, and this benefits learning based label fusion techniques by providing complementary information to the fusion algorithm.
Second, ALFA uses a machine learning voxel-wise classification where a class label for a given testing voxel is determined based on its high-dimensional feature representation. In addition to voxel intensities which are utilised by most of label fusion approaches, we incorporate more information into the features, such as gradient-based features. Figure 1 shows an outline of the proposed method. We evaluate the method using neonatal T1w and T2w datasets and compare its performance, defined as the agreement between the automatic segmentation and the reference segmentation, with eleven publicly available brain extraction methods that are a representation of a range of learning and non-learning techniques. Validity of reference segmentations. Ground truth accuracy of reference masks was evaluated by an expert and corrected, when necessary, by a trained rater. The mean (SD) Dice coefficient between corrected and uncorrected segmentations was 89.13 (0.67)%, while the mean (SD) Hausdorff distance was 7.23 (0.96) mm.

MRI
To evaluate the reliability of the reference brain masks, we manually segmented the MR images from 10 randomly chosen subjects. The mean (SD) of the Dice coefficient and Hausdorff distance between the reference and manual segmentations of the first rater were 98.61 (0.25)% and 4.94 (1.75) mm, respectively. The mean (SD) of the Dice Coefficient and Hausdorff distance between the reference and manual segmentations of the second rater were 98.03 (0.29)% and 6.62 (1.17) mm, respectively. The inter-rater agreement between the two raters was 98.40 (0.37)%.
Comparison with other methods and across modalities. The proposed method ALFA was evaluated in comparison with eleven publicly available methods that include non-learning-and learning-based methods: [1] Table 1 shows means and standard deviations (SD) of the evaluation metrics for both modalities. Figure 4 shows sample outputs, i.e. the case with median Dice coefficient, from each method. For presented ALFA results, k = 3 for both image sequences.   Evaluating the feature importance and classifier performance. We used two main categories of features: intensity features and gradient-based features. Figure 7 shows that intensity features alone provided higher accuracy than gradient-based features. However, combining both categories yielded higher accuracy than each individual category (P < 0.001). We tested two different linear classification techniques: Linear Discriminant Analysis (LDA) and Naïve Bayes (NB) demonstrated equivalent performance, with both providing a very high accuracy. In addition, using a set of two training atlases that are selected with UAS strategy provides greater accuracy than twenty atlases selected using the MSAS strategy.

D (%) H (mm) SEN (%) SPE (%) D (%) H (mm) SEN (%) SPE (%)
Volume measurement. To evaluate the utility of ALFA for extracting whole brain volume from T1w and T2w datasets, we measured agreement between volumes derived from ALFA with reference values for both modalities. Figure 9 shows that ALFA provides a level of agreement that is likely to be acceptable for most clinical and experimental applications. There was no statistically significant difference between mean brain volumes estimated from T1w and T2w datasets (mean difference = 4.12 ml, P = 0.25); the difference observed may reflect differences in the masks created from the original templates.   for neck and eye cleanup) takes ~8 minutes. LABEL takes ~3 minutes to complete a single brain extraction. As BW, MASS, MV, STAPLE, SBA and ALFA are multi-atlas-based methods, the computation time of a single extraction is a combination of two processes: registration and fusion. A single registration of BW or BEaST, takes ~3 minutes; a single registration of MASS, based on DRAMMS registration framework 46 , takes ~20 minutes; and a single registration of MV, STAPLE, SBA or ALFA takes ~5 minutes (less than a minute based on an free-form registration using graphic processing unit 47 ). The fusion time for all the multi-atlas based approaches (including ALFA) takes less than a minute.

Discussion
In this article, we propose a new method (Accurate Labeling with Few atlases, ALFA) for brain extraction of neonatal MRI and demonstrate that it provides robust and accurate results for T1w and T2w neonatal data. The method belongs to the multi-atlas family where a number of training atlases are used to train a voxel-wise local classifier. The atlas selection strategy of ALFA has a crucial role because the use of a number of atlases that are 'uniformly' distributed in the low-dimensional data space provides information from a range of images and this benefits the classification process. The method contrasts with atlas selection strategies that select the most similar atlases to the test subject and hence provide less complementary information to the algorithm 45 . Also, the most similar atlas selection strategy is best suitable for large databases of images where for each subject, a large number of similar subjects (k ≥ 20) exists 35,45 . With ALFA, atlases with relatively large anatomical variability could be selected but this does not represent a problem because the algorithm requires alignment of global brain boundaries and not local structures. While alternative approaches for image registration with large anatomical variation could be used 19,48 , this would be at the expense of computation time.
As ALFA employs a sparsity-based technique to select a set of representative atlases from the target dataset, it eliminates the need for target-specific training data; quite similar to MASS 37 . However MASS uses K-means  to cluster the images, with subsequent selection of a number of images from each cluster, and K-means can fail when clusters of arbitrary shapes are present in the data because of sub-optimal selection of representative images and neglect of some clusters 49 . It is worth mentioning that there are other sparsity-and label-propagation-based techniques of interest that were applied to a range of medical image segmentation problems such as prostate segmentation from CT images 50 , hippocampus labeling in adult MRI 51,52 , and brain tissue segmentation and structural parcellation 53 .
In our leave-one-out cross-validation, learning-based approaches outperformed the non-learning-based methods. 3DSS, BET and BSE performed less well in extracting the neonatal brain compared to their established performance on adult data 40 . LABEL, which was designed and evaluated for paediatric and neonatal data, provided an acceptable accuracy on T2w (average Dice coefficient of 93.54%), however it did not perform well with respect to other methods on T1w data (average Dice coefficient of 45.62%). MASS outperformed MV, STAPLE, SBA (which are considered the benchmark for learning-based approaches); however ALFA provided accurate and robust results across modalities compared to MASS as well as the benchmark methods. It is notable that MASS crashed in eleven T1w cases (more details in Methods), and it takes ~20 minutes for a single registration. As the learning-based approaches are trained using the same set of selected atlases, the performance difference between the methods is a function of the accuracy of the registration algorithm used and/or the label fusion strategy (comparison of different registration approaches and label fusion schemes can be found in Klein et al. 54 , and Iglesias and Sabuncu 34 ).
ROBEX is a special case in our comparison since it combines generative and discriminative approaches. It is similar to ALFA in that it uses voxel-wise classification to refine the voxels at brain boundaries, but the major difference between the two is that ROBEX uses an adult template as standard space for training the voxel-wise classifier, and where the target subject is supposed to be aligned. This limits the flexibility of ROBEX to work with different imaging modalities and young populations. In contrast to ROBEX, ALFA just needs a small number of manually labelled images from the population under study to provide very accurate results. Typically, 2-5 training images are sufficient, however this need might increase depending on the morphological variation within the population under study. Another important difference is that ROBEX uses a global classifier which uses the voxel coordinates as features (beside other features), but ALFA uses a local classifier which is trained by information from the neighbouring voxels so it is less susceptible to classification errors.
Regarding the performance of the compared methods across modalities, the eleven methods provided better performance on T2w images compared with T1w images. This might be because the T2w images have better contrast than the T1w images and hence the brain boundaries can be detected more accurately on T2w images comparing to T1w. Also the better contrast on T2w images means that the registration process for learning-based methods is more accurate. It is worth mentioning also that evaluating the performance of the proposed method on different datasets was not performed because the main idea behind this work is to be able to provide accurate segmentation results using a very small number of within-study training images (which is not a labour intensive process), instead of the commonly used strategy of selecting training images from an external library.
We used a semi-automatic approach (automatic segmentations that were manually edited by a rater) to generate the reference brain masks. We chose this approach partly because of its accuracy in a recent evaluation of automatic neonatal brain segmentation algorithms 55 , and partly because it is more time-efficient than mask generation from scratch. It is possible that ALFA (in common with all other learning-based methods) may have an advantage over non-learning-based methods in the comparison because the reference segmentations were generated via a learning-based framework. However, any advantage conferred to learning-based methods is likely to be minimal for the following reasons. First, validation of the reference masks against a subset of manually delineated masks showed a very high agreement between reference and manually delineated masks. Second, ALFA and the learning-based methods show variable accuracies as the false positive rate and false negative rate maps of the learning-based methods show errors in various anatomical regions. This suggests that there is still inconsistency between the segmentations from learning based methods (including ALFA) and reference segmentations.
A possible limitation is that we tuned the parameters of 3DSS, BET and BSE based on previous experience 1,8,14,20 , and the suggestions from the authors of the methods, but it is possible that an expert user may be able to optimise parameters to produce improved results. Also, MV, STAPLE, SBA, BEaST, and BW might yield better results using an increased number of training atlases and/or a different atlas selection strategy. However, our intention was to design a method that can provide accurate results using a relatively small number of training data, and this formed the basis of the comparison study. It is worth mentioning that engines such as Segmentation Validation Engine 56 would be ideal for evaluating the performance of the different methods for adult brain data.
To conclude, we present a novel method for extracting neonatal brain MRI that is robust and provides accurate and consistent results across modalities, which is useful because T1w and T2w data enable different yet complementary inferences about developmental processes. As ALFA can learn from partially labelled datasets, it can be used to segment large-scale datasets efficiently. Although ALFA was implemented and evaluated on neonatal MR images, the idea is generic and could be applied to other imaging modalities and other stages of the life course. ALFA is available to the research community at http://brainsquare.org.

Ethical Statement. Ethical approval was obtained from the National Research Ethics Service (South East
Scotland Research Ethics Committee), and informed written parental consent was obtained. The methods were carried out in accordance with the approved guidelines. Preprocessing. Images were corrected for intensity inhomogeneity using the N4 method 58 , and reconstructed to isotropic voxel size (1 × 1 × 1 mm 3 ) using windowed sinc interpolation. Reference brain masks and atlas library construction. The reference brain masks of the atlas library that was used for training, validation and method comparison was created using the following approach. First, all the images from the dataset were nonlinearly aligned to the 40 weeks PMA template from the 4D atlas constructed in Serag et al. 14 . Then, an Expectation-Maximization framework for brain tissue segmentation (defined as white matter, grey matter and cerebrospinal fluid) was used, where the priors were propagated using prior probabilities provided by the 4D atlas. Finally, brain masks were deformed to the subjects' native space. Generated masks were inspected for accuracy by a radiologist experienced in neonatal brain MRI, and edited by a trained rater, when necessary.
To evaluate the reliability of the reference brain masks, an independent rater segmented the MR images from 10 randomly chosen subjects (5 T1w and 5 T2w) using ITK-SNAP (http://itksnap.org) to separate brain (grey and white matter, and cerebrospinal fluid) and non-brain voxels (such as skull, eye and optic nerve). Similarly, to assess the inter-rater variability, a different rater delineated the brains from the same 10 images.

Atlas selection.
In this work, we use a sparsity-based technique to select a number of representative atlas images that capture population variability by determining a subset of n-dimensional samples that are 'uniformly' distributed in the low-dimensional data space.
be a set of training images from N subjects. To select a subset S of k images where k ≤ N (optimally, k ≪ N), the atlas selection algorithm works as follows. First, all images from the training dataset are linearly registered (12 degrees of freedom) to the 40 weeks PMA template from the 4D atlas 14 , which is the closest age-matched template to the mean age of the subjects in the training dataset, and image intensities are normalised using the method described by Nyul and Udupa 59 . Then, all N aligned images are considered as candidates for the subset of selected atlases. The closest image to the mean of the dataset is included as the first subset image. Let us refer to it as S 1 . The consecutive images are selected sequentially, based on the distances to the images already assigned to the subset. The distance from the i-th to the j-th image, d(i, j) is defined as: Figure 10. Illustration of the atlas selection principle. The brain images represent five chosen atlases, and the colour codes represent the order these atlases were chosen.
Scientific RepoRts | 6:23470 | DOI: 10.1038/srep23470 i j n are data vectors obtained by concatenating the voxel, v, values of X i , Y j , respectively. The main steps of the proposed atlas selection algorithm are presented in Algorithm 1, and Fig. 10 shows an illustration of the atlas selection principle.
It is worth mentioning that the proposed atlas selection strategy was inspired by the Kennard-Stone algorithm 60 , yet different in the way it is initialised. The Kennard-Stone algorithm begins by finding the two images which are farthest apart, however the proposed algorithm begins by finding the closest image to the mean of the dataset.
Image registration. Image registration was carried out in two steps: first, a linear transformation was estimated using affine registration (12 degrees of freedom); second, a nonlinear registration step was carried out using the result of the affine registration as the initial transformation. The registration scheme is based on free-form deformations (FFD) 47,61 with normalised mutual information as the similarity metric 62 . The nonlinear registration was carried out in a coarse-to-fine manner with successive control point spacing of 20 mm, 10 mm, and 5 mm. All registration steps were carried out using the open-source image registration toolkit NiftyReg (https://sourceforge. net/projects/niftyreg), using default settings.

Algorithm 1. Uniform atlas selection algorithm
Set x to represent the vector obtained by concatenating the voxel, v, values of an image X; to be the vector which represents the mean μ of the dataset; Set m to represent the number of currently selected images; Increase m by 1; end Label fusion. Machine learning was used to assign a label to each voxel in the target image. The method is based on training a local classifier for each voxel. In addition to voxel intensities, which are utilised by most of label fusion approaches, we incorporate information from gradient-based features. Typically, each voxel v at the location (x, y, z) is converted to a five-dimensional feature vector where I is the grey scale intensity value, I x , I y and I z are the absolute norms of the first order derivatives with respect to x, y and z, and the gradient magnitude r is defined as The image derivatives are calculated through the filter [− 1 0 1] T . The vector in equation (2) represents the testing sample. The training samples come from the deformed atlas images where feature vectors are extracted in the deformed atlas images from the 26-adjacent voxels, which means that the number of training samples per voxel is equal to k × 26.
Finally, linear discrimination techniques [such as Naïve Bayes (NB) and Linear Discriminant Analysis (LDA)] were used to classify the target image voxels into brain or non-brain.
Compared methods, parameter selection and software considerations. The compared methods are listed in Table 2, and the parameter setting were determined as follows: 3DSS 26 . The parameters used for 3DSS: -shrink_fac_bot_lim 0.65,-shrink_fac 0.72 (suggested by the AFNI team). No problems encountered during running the software. BET 24 . The parameters used for BET: -f 0.5, -g − 0.1, -R, -B. These parameters were set based on our experience using BET in previous neonatal studies 1,8,14,20 . No problems encountered during running the software. BSE 25 . The parameters used for BSE: -d 20, -r 2, -s 0.8, -n 3, -p, -trim. We reached these setting by trying the interactive version of the software and experiment with the parameters there (as suggested by the BSE authors). No problems encountered during running the software.   MASS 37 . k = 5 atlases were selected as in Doshi et al. 37 . Several crashes were encountered for T1w and T2w cases, and in the first run of the software only 10 T2w and 6 T1w were complete. By sharing the issue with the authors, we were advised to re-run the software as the issue might be because one or more registration jobs have failed to finish or was killed prematurely. After listening to the authors advice and re-running the software several times on the failed subjects, the whole T2W cases were complete (after 3 additional runs), however 11 T1w cases were just very stubborn to complete (even after all the re-runs we tried).
BEaST 38 . k = 20 atlases were selected as suggested in Eskildsen et al. 38 . Neonatal reference brain scans were used inside of the BEaST framework, replacing adult training dataset. No problems encountered during running the software.
Validation framework. A leave-one-out cross-validation procedure was performed for the 50 subjects.
Each subject in turn was left out as a testing sample and the remaining 49 subjects were used as the training dataset where a subset of k atlases is selected using Algorithm 1. Agreement between the automatically segmented brain mask A and the reference mask M was evaluated using two complementary overlap metrics: Dice coefficient. The Dice coefficient D 63 measures the extent of spatial overlap between two binary images. It ranges between 0 (no overlap) and 1 (perfect agreement). The Dice values are obtained using equation (3) and expressed as a percentage.
Hausdorff distance. The Hausdorff distance H is generally used to measure the spatial consistency of the overlap between the two binary images by measuring the maximum surface-to-surface distance between the two images. It is given by where TP is the true positive, TN is the true negative, FP is the false positive, FN is the false negative, and |·| is the number of elements in a set. Similar to the Dice values, both sensitivity and specificity are expressed as a percentage.
Localisation of segmentation error. Projection maps were generated for false positive and false negative voxels, which enables the localisation of segmentation errors. First, the false positive/negative maps were aligned to one coordinate space, before averaging and projecting onto axial, coronal and axial orientations. For each method, and each modality, the projection maps give insight into the spatial locations where major false positive/ negative voxels exist, and hence it can be used to compare the performance of the evaluated methods against each other and across modalities.

Statistical analyses.
To test for differences between the results of the methods, t-tests were used for normally distributed data, and Mann Whitney U was used to compare non-normal distributions (Shapiro-Wilk normality test was used). P-values < 0.05 were considered significant after controlling for Type I error using false discovery rate (FDR). The effect of feature importance, classifier performance and atlas selection was evaluated based on T2w images using the Dice coefficient. Note that to evaluate the effect of atlas selection strategy on the performance, the most similar atlas selection strategy (MSAS) was implemented as explained in the atlas selection section with one difference. This difference is that instead of aligning all images to a template, all images were aligned to the test image and the distance between warped training images and the test image was calculated using equation (1). Then, the most similar k atlases were selected.
Agreement between whole brain volumes extracted from T1w and T2w images using ALFA compared to the reference segmentation was investigated using Bland-Altman methods.