Fast and Sequence-Adaptive Whole-Brain Segmentation Using Parametric Bayesian Modeling

Quantitative analysis of magnetic resonance imaging (MRI) scans of the brain requires accurate automated segmentation of anatomical structures. A desirable feature for such segmentation methods is to be robust against changes in acquisition platform and imaging protocol. In this paper we validate the performance of a segmentation algorithm designed to meet these requirements, building upon generative parametric models previously used in tissue classi ﬁ cation. The method is tested on four different datasets acquired with different scanners, ﬁ eld strengths and pulse sequences, demonstrating comparable accuracy to state-of-the-art methods on T1-weighted scans while being one to two orders of magnitude faster. The proposed algorithm is also shown to be robust against small training datasets, and readily handles images with different MRI contrast as well as multi-contrast data. & 2016 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
So-called whole-brain segmentation techniques aim to automatically label a multitude of cortical and subcortical regions from brain MRI scans.Recent years have seen tremendous advances in this field, enabling, for the first time, fine-grained comparisons of regional brain morphometry between large groups of subjects.Current state-of-the-art whole-brain segmentation algorithms are typically based on supervised models of image appearance in T1weighted scans, in which the relationship between intensities and neuroanatomical labels is learned from a set of manually annotated training images.
This approach suffers from two fundamental limitations.First, segmentation performance often degrades when the algorithms are applied to T1-weighted data acquired on different scanner platforms or using different imaging sequences, due to subtle changes in the obtained image contrast (Han and Fischl, 2007;Roy et al., 2013).And second, the exclusive focus on only T1-weighted images hinders the ultimate translation of whole-brain segmentation techniques into clinical practice, where they hold great potential to support personalized treatment of patients suffering from brain diseases.This is because clinical imaging uses additional MRI contrast mechanisms to show clinically relevant information, including T2-weighted or fluid attenuated inversion recovery (FLAIR) images that are much more sensitive to certain pathologies than T1-weighted scans (e.g., white matter lesions or brain tumors).Although incorporating models of lesions into whole-brain segmentation techniques is an open problem in itself, a first necessary step towards bringing these techniques into clinical practice is to make them capable of handling the multicontrast images that are acquired in standard clinical routine.
In this article, we present and validate the performance of a fast, sequence-independent whole-brain segmentation algorithm.The method, which is based on a mesh-based computational atlas combined with a Gaussian appearance model, yields segmentation accuracies comparable to the state of the art; automatically adapts to different MRI contrasts (even if multimodal); requires only a small amount of training data; and achieves computational times comparable to those of the fastest algorithms in the field (Zikic et al., 2014;Ta et al., 2014).
Because many distinct brain structures have similar intensity characteristics in MRI, these methods were typically built around detailed probabilistic models of the expected shape and relative positioning of different brain regions, using surface-based (Kelemen et al., 1998;Pizer et al., 2003;Patenaude et al., 2011;Cootes et al., 1998) or volumetric (Fischl et al., 2002;Pohl et al., 2006b) models.These anatomical models were then combined with supervised models of appearance to encode the typical intensity characteristics of the relevant structures in the training data, often using Gaussian models for either the intensity of individual voxels (Fischl et al., 2002;Pohl et al., 2006b) or for entire regional intensity profiles (Kelemen et al., 1998;Pizer et al., 2003;Patenaude et al., 2011;Cootes et al., 1998).The segmentation problem was then formulated in a Bayesian setting, in which segmentations were sought that satisfy both the shape and appearance constraints.
More recently, non-parametric methods1 have gained increasing attention in the field of whole-brain segmentation, mostly in the form of multi-atlas label fusion (Rohfling et al., 2004a;Heckemann et al., 2006;Isgum et al., 2009;Artaechevarria et al., 2009;Sabuncu et al., 2010;Rohfling et al., 2004b;Wang et al., 2013;Manjón et al., 2011;Rousseau et al., 2011;Tong and Wolz, 2013;Wu et al., 2014;Asman and Landman, 2013;Zikic et al., 2014;Iglesias and Sabuncu, 2015).In these methods, each of the manually annotated training scans is first deformed onto the target image using an image registration algorithm.Then, the resulting deformation fields are used to warp the manual annotations, which are subsequently fused into a final consensus segmentation.Although early methods used a simple majority voting rule (Rohfling et al., 2004a;Heckemann et al., 2006), recent developments have concentrated on exploiting local intensity information to guide the atlas fusion process.This is particularly helpful in cortical areas, for which accurate inter-subject registration is challenging (Sabuncu et al., 2010;Ledig et al., 2012).Label fusion methods have been shown to yield very accurate whole-brain segmentations (Landman and Warfield, 2012), but their accuracy comes at the expense of a high computational cost as a result of the multiple non-linear registrations that are required.Efforts to alleviate this issue include a local search using entire image patches, such that much faster linear registrations can be used (Manjón et al., 2011;Ta et al., 2014), as well as using rich contextual features so that only a single non-linear warp is needed (Zikic et al., 2014).

Existing methods that handle changes in MRI contrast
With the exception of simple majority voting (Rohfling et al., 2004a;Heckemann et al., 2006), all the methods reviewed above use supervised intensity models, in the sense that they explicitly exploit the specific image contrast properties of the dataset used for training.This poses limitations on their ability to segment images that were acquired with different scanners or imaging sequences than the training scans.
A generic way of making such methods work across imaging platforms is histogram matching (also known as intensity normalization), in which the intensity profiles of new images are altered so as to resemble those of the images used for training (Nyúl et al., 2000;Roy et al., 2013).However, histogram matching can only be used when the training and target data have been acquired with the same type of MRI sequence (e.g., T1-weighted), and it does not completely cancel the negative effects that intensity mismatches have on segmentation accuracy (Roy et al., 2013).
Another approach is to have the training dataset include images that are representative of all the scanners and protocols that are expected to be encountered in practice.However, this approach quickly becomes impractical due to the large number of possible combinations of MRI hardware and acquisition parameters.The situation is exacerbated for clinical data, due to the lack of standardized protocols to acquire multi-contrast MRI data across clinical imaging centers.
In contrast synthesis (Roy et al., 2013), the original scan is not directly segmented, but rather used to generate a new scan with the desired intensity profile, which is then segmented instead.The premise of this technique is that a database of scans acquired with both the source and target contrast is available, so that the relationship between the two can be learned (Iglesias et al., 2013a;Roy et al., 2013).This approach makes it unnecessary to manually annotate additional training data for each new set-up that is considereda considerable advantage given that a manual wholebrain segmentation often takes several days per scan (Fischl et al., 2002).However, it still requires that additional example subjects are scanned with both the source and target scanner and protocol, which is not always practical.
Finally, a more fundamental way to address the problem is to perform whole-brain segmentation in the space of intrinsic MRI tissue parameters (Fischl et al., 2004b).However, this requires the usage of specific MRI sequences for which a physical forward model is available, which are not widely implemented on MRI scanning platforms, and particularly not on clinical systems.

Contribution: validation of a fast, sequence-adaptive wholebrain segmentation algorithm
In contrast to the aforementioned approaches to whole-brain segmentation, which rely on supervised models of the specific intensity profiles seen in the training data, in this paper we validate an unsupervised approach that automatically learns appropriate intensity models from the images being analyzed.At the core of the method is an intensity clustering algorithm (a Gaussian mixture model) that derives its independence of specific image contrast properties by simply grouping together voxels with similar intensities.This approach is well-established for the purpose of tissue classification (aimed at extracting the white matter, gray matter and cerebrospinal fluid) where it is typically augmented with models of MRI imaging artifacts (Wells et al., 1996a;Van Leemput et al., 1999a;Ashburner and Friston, 2005) and spatial models such as probabilistic atlases (Ashburner and Friston, 1997;Van Leemput et al., 1999a;Ashburner and Friston, 2005) or Markov random fields (Van Leemput et al., 1999b;Zhang et al., 2001).
Here we validate a method for whole-brain segmentation that is rooted in this type of approach, building on prior work from our group including a proof-of-concept demonstration in whole-brain segmentation (Van Leemput, 2009), as well as the automated segmentation methods for hippocampal subfields (Iglesias et al., 2015a) and subregions of the brainstem (Iglesias et al., 2015b) that are distributed with the FreeSurfer software package (Fischl et al., 2002).The method we validate here uses a mesh-based probabilistic atlas to provide whole-brain segmentation accuracy at the level of the state of the art, both within and across scanner platforms and pulse sequences.Unlike many other techniques, the method does not need any preprocessing such as skull stripping, bias field correction or intensity normalization.Furthermore, because the method is parametric, only a single non-linear registration (of the atlas to the target image) is required, yielding a very fast overall computational footprint.
An early version of this work, with a preliminary validation, was presented in Puonti et al. (2013).The current article adds a more detailed explanation of our modeling approach, quantitative comparisons with additional state-of-the-art label fusion algorithms, and more extensive experimentsparticularly regarding test-retest reliability, segmentation of multi-contrast and non-T1contrast data, and the sensitivity of the method to the size of the training dataset.In order to estimate l from D, i.e., to compute automated segmentations, we use a generative modeling approach: a forward probabilistic model of MRI images is defined, and subsequently "inverted" to obtain the segmentation.The model consists of two parts: a prior and a likelihood.The prior is a probability distribution over segmentations ( ) p l that encodes prior knowledge on human neuroanatomy.The likelihood is a probability distribution over image intensities that is conditioned on the segmentation ( | ) p D l , which models the imaging process through which a certain segmentation yields the observed MRI scan.This type of model is generative because it provides a mechanism to generate data through the forward model: in our case, we could generate a random brain MRI scan by first sampling the prior to obtain a segmentation, and then sampling the likelihood conditioned on the resulting segmentation.

Modeling framework
Within this framework, the posterior distribution of image segmentations given an input brain MRI scan is given by Bayes' rule: Maximizing Eq. ( 1) with respect to l then yields the maximum a posteriori (MAP) estimate of the segmentation.
In the rest of this Section, we will describe in depth the prior (Section 2.1) and likelihood (Section 2.2); we will propose an inference algorithm to approximately maximize Eq. (1) (Section 2.3); and finally we will describe the details of the implementation of this algorithm (Section 2.4).

Prior
For the prior ( ) p l we use a generalization of the probabilistic brain atlases often used in brain MRI segmentation (Ashburner and Friston, 1997;Van Leemput et al., 1999b, 1999a, 2001;Zijdenbos et al., 2002;Fischl et al., 2002;Ashburner and Friston, 2005;Prastawa et al., 2005;Pohl et al., 2006b;D'Agostino et al., 2006;Awate et al., 2006;Bouix et al., 2007).This model, detailed in Van Leemput (2009), is based on a deformable tetrahedral mesh, the properties of which are learned automatically from a set of manual example segmentations made on MRI scans of training subjects.Each of the vertices of the mesh has an associated set of label probabilities specifying how frequently each of the K labels occurs at the vertex.The resolution of the mesh is locally adaptive, being sparse in large uniform regions and dense around the structure borders.This automatically introduces a locally varying amount of spatial blurring in the resulting atlas, aiming to avoid over-fitting of the model to the available training samples (Van Leemput, 2009).During training, the topology of the mesh and the position of its vertices in atlas space (henceforth "reference position") is computed along with the label probabilities in a nonlinear, group-wise registration of the labeled training data.An example of the resulting probabilistic brain atlas, computed from manual parcellations in 20 subjects, is displayed in its reference position in Fig. 1; note the irregularity in the shapes and sizes of the tetrahedra.
The positions of the mesh nodes x can change according to their prior distribution ( ) where T and x ref denote the number of tetrahedra and the re- ference position of the mesh, respectively; ϕ ( ) x x , t ref is a penalty for deforming tetrahedron t from its reference to its actual position; and β > 0 is a scalar that controls the global stiffness of the mesh.We use the penalty term proposed in Ashburner et al. (2000), which goes to infinity when the Jacobian determinant of the deformation approaches zero.This choice prevents the mesh from tearing or folding onto itself, thus preserving its topology.
Given a deformed mesh with node positions x, the probability ( | ) p k x i of observing label k at a voxel i is obtained by barycentric interpolation of the label probabilities at the vertices of the tetrahedron containing the voxel.Moreover, we assume conditional independence of the labels of the different voxels given the mesh node positions, such that ( ) ( ) The expression for the prior distribution over segmentations is finally: The likelihood ( | ) p D l models the relationship between seg- mentation labels and image intensities.For this purpose, we associate a mixture of Gaussian distributions with each label (Ashburner and Friston, 2005), and assume that the bias field imaging artifact typically seen in MRI can be modeled as a multiplicative and spatially smooth effect (Wells et al., 1996a).For computational reasons, we use log-transformed image intensities in D, and model the bias field as a linear combination of spatially smooth basis functions that is added to the local voxel intensities (Van Leemput et al., 1999a).
Specifically, letting θ denote all bias field and Gaussian mixture parameters, with uniform prior Σ Here, G k is the number of Gaussian distributions in the mixture associated with label k; and μ k g , , Σ k g , , and w k g , are the mean, covariance matrix, and weight of component ). Furthermore, where P denotes the number of bias field basis functions, ϕ p i is the basis function p evaluated at voxel i, and c n holds the bias field coefficients for MRI contrast n.
The entire forward model is summarized in Table 1.

Inference
Using the model described above, the MAP segmentation for a given MRI scan is obtained by maximizing Eq. ( 1) with respect to l: which is intractable due to the integrals over the parameters x and θ that appear in the expressions for ( ) p l (Eq. ( 4) and ( | ) p D l (Eq.( 5)), respectively.This difficulty can be side-stepped if the posterior distribution of the model parameters in light of the data is heavily peaked around its mode: ^^( ) x which no longer involves intractable integrals.The resulting inference algorithm then involves two distinct phases, detailed below: first, computing the point estimates by maximizing Eq. ( 8); and subsequently computing the segmentation by maximizing Eq. ( 9) with respect to l.
Computation of point estimates.Applying Bayes' rule to Eq. ( 8), we obtain: Taking the logarithm, we can rewrite the problem as the maximization of the following objective function: We solve this problem with a coordinate ascent scheme, in which the mesh node positions x and likelihood parameters θ are iteratively updated, by alternately optimizing one while keeping the other fixed.
To optimize the mesh node positions x with fixed θ, we use a standard conjugate gradient optimizer (Shewchuk, 1994).To optimize the likelihood parameters θ with fixed x, we use a gen- eralized expectation-maximization (GEM) algorithm (Dempster et al., 1977) similar to the one proposed in Van Leemput et al. (1999a).In particular, the GEM optimization involves iteratively computing the following soft assignments of each voxel to each of the Gaussian distributions, based on the current parameter estimates: ( ) and subsequently updating the parameters accordingly: Table 1 Equations for the forward probabilistic model of MRI brain scans.
It can be shown that this process is guaranteed to increase the objective function of Eq. ( 10) with respect to θ in each GEM iteration (Dempster et al., 1977;Van Leemput et al., 1999a).
Computation of the final segmentation.Given the point estimates of the model parameters, the conditional posterior distribution of the segmentation l factorizes over voxels: The optimal segmentation for each voxel is therefore given by: In practice, we have found that modeling substructures with similar intensity properties (e.g., all white matter structures) with the same Gaussian mixture model improves the robustness of the algorithm while giving faster execution times.Letting f denote a set of structures that share the same mixture model, this is accomplished by altering the GEM update equations for the Gaussian mixture parameters as follows: The details of which structures share the same mixture models will be given in Section 3.3.
To initialize the algorithm, we first affinely align the atlas to the target image using the registration method described in D' Agostino et al. (2004), which uses atlas probabilitiesrather than an intensity templateto drive the registration process.After the initial registration we mask out non-brain tissues by excluding voxels that have a prior probability lower than 0.01 of belonging to any of the brain structures.The image intensities are then log-transformed to accommodate the additive bias field that is employed (cf.Section 2.2).For the bias field modeling, we use the lowest frequency components of the 3D discrete cosine transform (DCT) as basis functions (for the number of components see Section 3.3).
The subsequent optimization is done at two resolution levels.In the first level, the atlas probabilities are smoothed using a Gaussian kernel with a standard deviation of 2.0 mm in order to fit large scale mesh deformations.No smoothing is used in the second level, which refines the registration on a smaller scale.
The stopping criteria for the different components of the algorithm are as follows: the likelihood parameters θ are updated until the relative change in the objective function (Eq.( 10)) falls under 10 À 5 ; the mesh node positions are updated until the maximum deformation across vertices falls under 10 À 3 mm; and the GEM and conjugate gradient optimizers are iteratively interleaved until the decrease in the cost function falls under 10 À 6 .
The algorithm is implemented in Matlab except for the computationally demanding optimization of the mesh node positions, which is implemented in C þ þ, and which involves computing the mesh node deformation prior ( ) p x (Eq. ( 2)) the interpolated prior probabilities ( | ) p l x (Eq. ( 3) and the gradient of the objective function (Eq.( 10)) with respect to the mesh node positions.

Experiments
In this section, we first describe the brain MRI datasets used in this study (Section 3.1).Then, we outline four methods that our algorithm is benchmarked against (Section 3.2).Next, we detail how the free parameters of each method are set (Section 3.3).Finally, we describe the setups for four different experiments in which the different methods are tested (Section 3.4).

MRI data
In the experiments, we use five different sets of scans: one exclusively for training the segmentation methods, and the other four for testing the performance on unseen data.For training, we use a dataset of 39 T1-weighted MRI scans and corresponding expert segmentations.The expert segmentations were obtained using a validated semi-automated protocol developed at the Center for Morphometric Analysis (CMA), MGH, Boston (Caviness et al., 1989(Caviness et al., , 1996;;Kennedy et al., 1989).All raters had to pass tests measuring intra-and inter-rater reliability before they were allowed to perform segmentations.The resulting training data consists of 28 healthy subjects and 11 subjects with questionable or probable Alzheimer's disease with ages ranging from under 30 years old to over 60 years old (Sabuncu et al., 2010).The scans were acquired on a 1.5T Siemens Vision scanner using an MPRAGE sequence with parameters: TR ¼ 9.7 ms, TE ¼4 ms, TI ¼20 ms, flip angle ¼10°and voxel size ¼1.0Â 1.0 Â 1.5 mm 3 (128 sagittal slices), where the scan parameters were empirically optimized for gray-white matter contrast (Buckner et al., 2004).This is the same dataset used for training in the publicly available software package FreeSurfer (Fischl et al., 2002).An example scan and a corresponding manual segmentation are shown in Fig. 1.
For testing, we use four different datasets acquired on scanners from different manufacturers, with different field strengths and pulse sequences.For three of the datasets, including a total of 35 subjects, we have access to expert manual segmentations, enabling quantitative comparisons of automated segmentation accuracy.All these manual segmentations were performed using the same protocol as was used for the training data.The fourth test dataset consists of 40 subjects scanned at two time points; it does not have expert segmentations but will be used to assess testretest reliability instead.Below we provide details on each of these four test datasets.
The first test dataset consists of T1-weighted scans of 13 individuals with age and disease status matching those of the training dataset, acquired on a 1.5T Siemens Sonata scanner with the same sequence and parameters as the training data (Han and Fischl, 2007).Given the similarity with the training data (vendor, field strength, pulse sequence), we will refer to this dataset as the "intra-scanner dataset".An example scan and a corresponding manual segmentation are shown in Fig. 2.
The second test dataset consists of T1-weighted scans of 14 individuals with age and disease status matching those of the training dataset, acquired on a 1.5T GE Signa Scanner using an SPGR sequence with parameters: TR ¼35 ms, TE ¼ 5 ms, flip angle ¼45°and = × × voxel size 0.9375 0.9375 1.5 mm 3 (124 cor- onal slices) (Han and Fischl, 2007).This dataset will be referred to as the "cross-scanner dataset".An example scan and a corresponding manual segmentation are shown in Fig. 3.
The third test dataset consists of multi-echo FLASH scans from 8 healthy subjects acquired on a 1.5T Siemens Sonata scanner.The acquisition parameters were: TR ¼20 ms, TE ¼ min, flip angle ¼ 3°, °5 , °20 and °30 , and voxel size ¼1.0 mm 3 isotropic (Fischl et al.,  2004b; Iglesias et al., 2012).The different flip angles correspond to different contrast properties, with the smallest angle having contrast similar to proton density (PD) weighting and the largest one having a contrast similar to T1-weighting.These data will be referred to as the "multi-echo dataset".A sample slice from this dataset, with flip angles °30 and °3 , is shown in Fig. 4. The fourth and final test dataset consists of 40 healthy subjects scanned at two different time points at different facilities, with scan intervals ranging from 2 days to six months, amounting to a total of 80 T1-and T2-weighted scans for the whole dataset (Holmes et al., 2012).The scans were all acquired with 3T Siemens Tim Trio scanners using identical multi-echo MPRAGE sequences for the T1 and 3D T2-SPACE sequences for the T2, with voxel size¼1.2Â 1.2 Â 1.2 mm 3 .Note that the acquisition protocol was highly optimized for speed, with a total acquisition time for both scans of under 5 minutes.This dataset will be referred to as the "test-retest dataset".One of the scans had to be excluded because of motion artifacts.Moreover, some of the T2-weighted scans have minor artifacts not present in the T1-weighted scans.These scans were however included in the experiments.Manual segmentations were not available for this dataset; however, these scans are still useful in test-retest experiments quantifying the differences between the two time points.Ideally, as all the subjects are healthy, the biological variations should be small and the segmentations between the two time points should be identical.An example of the T1-and T2-weighted scans is shown in Fig. 5.

Benchmark methods
In order to gauge the performance of the proposed algorithm with respect to the state of the art in brain MRI segmentation, we compare its performance against four representative methods:  BrainFuse2 (Sabuncu et al., 2010) is a multi-atlas segmentation method that uses an intensity-based label fusion approach to merge a set of propagated training labelings into a final segmentation of a target scan.More specifically, it assumes a generative model in which the joint intensity-label space is modeled with a Parzen density estimator (using a logOddsbased kernel (Pohl et al., 2006a) for the labels, and a Gaussian kernel for the intensities); with an optional Markov random field prior enforcing spatial consistency.Segmentation is carried out through Bayesian inference, effectively giving more weight to atlases that have locally similar intensities to the target scan.In the publicly available implementation, the Markov random field prior is not includedhowever it does not yield a significant increase in segmentation accuracy (Sabuncu et al., 2010).For computing the registrations between the training and target subjects, BrainFuse employs asymmetric bidirectional registrations based on an efficient Demons-style algorithm that uses a one parameter sub-group of diffeomorphisms combined with a sum of squared intensity differences (SSD) similarity measure (Sabuncu et al., 2010).The freely available implementation of BrainFuse is optimized to work with data that has been preprocessed (skull-stripped, bias field corrected, intensity normalized and re-sampled to a 1mm3 grid) with FreeSurfer.
In the experiments, we follow these preprocessing requirements.The free parameters of the registration method are set to the values reported in Sabuncu et al. (2010), where the authors cross-validated the parameter values on the same training dataset that we use in this study.
PICSL MALF 3 (Wang et al., 2013) assumes that the segmentation errors of the propagated training labelings can be correlated, as opposed to BrainFuse, in which independence of the errors of the different labelings is assumed.PICSL MALF formulates a weighted voting problem in terms of trying to minimize the expectation of the labeling error, i.e., the error between the fused labels and the true segmentation in every voxel.To achieve this, it approximates the expected pairwise joint label differences between the training scans and the target scan using intensity similarity information.The intensity similarities are computed within a patch around each voxel.The patch intensities are normalized to have zero mean and a constant norm, making the similarity measure robust against linear intensity change, which is often enough to correct for small differences in MRI contrast.Moreover, PICSL MALF also performs a local search to try to find the voxel that is most similar to the corresponding target image voxel patch-wise.This can be interpreted as additional refinement of the pre-computed pairwise registrations.For computing the initial pair-wise registrations between the training and target subjects PICSL MALF uses ANTs/SyN4 (Avants et al., 2008), which is a diffeomorphic registration algorithm.We follow the implementation details that were used in the implementation of PICSL MALF that won the MICCAI 2012 Grand Challenge on Multi-Atlas Labeling (Landman and Warfield, 2012).Specifically, for computing the pair-wise registrations, we use the cross-correlation (CC) similarity metric, which adapts naturally to situations where locally varying intensities occur (Avants et al., 2008); and we set the registration parameters to the values reported in Landman and Warfield (2012).The authors use no specific preprocessing steps such as bias field correction; however, the ANTs/SyN registration algorithm has been shown to be robust to quite severe bias field effects when the CC similarity metric is used (Avants et al., 2008).We note that the PICSL MALF software also provides a post-processing procedure to correct systematic segmentation errors based on corrective learning (Wang et al., 2011); however since this is an independent module that is equally applicable to the other benchmark methods as well it was not used in this study.
FreeSurfer5 (Fischl et al., 2002) is based on a statistical atlas of neuroanatomy, along with an intensity atlas in which a Gaussian distribution is associated with each voxel and class.The parameters of these Gaussians are estimated in a supervised fashion from training data.The model is completed by a Markov random field model that ensures spatial smoothness of the segmentation, which is computed as the MAP estimate in a Bayesian framework.We note that FreeSurfer was trained on the same training data that we are using in this study, which makes direct comparison with our approach and the multi-atlas methods feasible.
Majority Voting (Rohfling et al., 2004a;Heckemann et al., 2006) is a simple multi-atlas segmentation method, where the propagated training labelings are fused into a final segmentation by picking, in each voxel, the most frequent label across the propagated labelings.We include this method as a reference against which we can compare the performance of the more sophisticated label fusion approaches.For our implementation of majority voting, we use the same pair-wise registrations as for PICSL MALF.
These benchmark methods cover a wide spectrum of modern brain MRI segmentation algorithms.Majority voting, BrainFuse and PICSL MALF represent multi-atlas segmentation, which is arguably the most popular segmentation paradigm at the moment.Moreover, they are non-parametric methods, whereas our method and FreeSurfer represent parametric approaches.

Cross-validation experiments on training data for parameter tuning
The free parameters of the different methods are determined using the 39-subject training dataset as follows: Proposed Algorithm.We use 20 randomly picked subjects out of the available 39 to build our probabilistic atlas.Only 20 subject are chosen, because the atlas building process is very computationally expensive (several weeks to build an atlas with 20 subjects) and the results show that the segmentation performance does not increase any further when more subjects are added (see Section 4.3).The remaining 19 subjects are used to find suitable values for the free parameters in our algorithm: the global stiffness of the mesh β, the number of bias field basis functions P, the groups of structures f that share the same GMM parameters, and the number of mixture components associated with each structure group.The parameters are tuned based on a visual inspection of the automatic segmentations in the 19 training subjects.The chosen values for the mesh stiffness and number of bias field basis functions are: β = 0.1 and P ¼5 per dimension, amounting to a total of = = P 5 125 3 basis functions in 3D.The choice of which sets of structures share the Gaussian mixture parameters, as well as the number of Gaussians for each mixture, is summarized in Table 2.
BrainFuse.We use the optimal parameters listed in the original publication (Sabuncu et al., 2010); this choice is appropriate because the authors cross-validate the parameter values on the same training dataset as used in this study.
PICSL MALF.For this method we need to determine the optimal values for the patch radius over which the intensity similarity is calculated; a constant controlling the inverse distance function which maps the intensity difference to the joint error; and the size of the local search window (Wang et al., 2013).For this purpose, we randomly select 10 subjects as test data and use the remaining 29 subjects as training data, and perform a cross-validation grid search using similarity patch radii of 0, 1, 2, 3 s and inverse mapping constants of β = [ ] 0.5, 1, 1.5, 3, 6 .As a measure of goodness we use the mean Dice overlap score 6 (which is the main performance metric used in the experiments below) over the structures listed in Section 3.4 below.The resulting optimal values are: r p ¼1, r s ¼ 2 and β = 3.
FreeSurfer.We use the standard processing pipeline with default parameters.No cross-validation needs to be performed as FreeSurfer is trained on the same training dataset (using all 39 subjects) we use in this study.
Majority Voting.Given the pre-computed registrations, majority voting has no parameters to tune.

Experimental setup
We perform a comprehensive evaluation consisting of four sets of experiments: I.In a first experiment, we use models trained on the training dataset to segment the scans from the intra-scanner and the cross-scanner datasets, comparing each method's segmentations with the corresponding manual annotations using the Dice overlap score.This experiment enables us not only to compare the performance of the different methods, but also to assess how much their performance degrades when the image intensity properties of the training and test datasets are not matched.II.In a second experiment, we evaluate the computational efficiency of the various methods on the intra-and inter-scanner datasets.We compute the running time of the different algorithms on a cluster where each node has two quad-core Xeon 5472 3.0 GHz CPUs and 32 GB of RAM; we only use one core in the experiments in order to make fair comparisons, even though all the algorithms can potentially be parallelized.We also record the execution time of a multi-threaded implementation of our method, using 8 cores on a computer with 8 dual-cores with 3.4 GHz CPU and 64 GB of RAM.This setup represents a realistic scenario that enables us to compare the running time of our algorithm with those reported by other studies in the literature.III.In a third experiment, we study the effect of the number of training subjects on the segmentation performance.To achieve accurate segmentations, a representative training set is needed to capture all the structural variation one might see within the subjects to be segmented (Aljabar and Heckemann, 2009).However, some algorithms require less training data than others to approach their asymptotic performance, which represents a saving in manual labeling effort.We therefore randomly pick 5 sets of 5, 10 and 15 subjects from the training data, and re-evaluate the segmentation performance of the proposed method, BrainFuse, PICSL MALF and majority voting on the intra-and cross-scanner datasets.IV.In a final experiment, we evaluate the ability of the proposed algorithm to segment non-T1-contrast and multi-contrast MR scans using the multi-echo and the test-retest datasets.Given a training set consisting only of T1-weighted scans, using multi-contrast or non-T1-contrast information is out of reach for the four specific benchmark methods we compare against in this article, although we note that several multi-atlas label fusion techniques exist that could potentially be used in this context (cf.discussion in Section 5).For the multi-echo dataset we first run the proposed method using only the T1-weighted images (i.e., flip angle °30 ), then only the PD- weighted images (i.e., flip angle °3 ), and finally using both the T1-and PD-weighted images simultaneously.The resulting automated segmentations are then compared to the expert segmentations using Dice scores.For the test-retest dataset, we first segment the two time points using only the T1weighted images, and subsequently using both T1-and T2weighted images together.Because no manual segmentations are available for this dataset, we use absolute symmetrized percent change (ASPC) (Reuter et al., 2012) to quantify the differences in the automatic segmentations between the two time points.This metric is defined as the absolute value of the difference in volume, normalized by the mean volume: where V 1 ,V 2 are the volumes at the two time points.Ideally this number should be small, as the subjects are all healthy and the time between the scans is not so long.
We report the Dice scores and the ASPC on a representative subset of 23 relevant structures that is also used in other studies (e.g., Fischl et al., 2002;Sabuncu et al., 2010): left and right cerebral white matter (WM), cerebellum white matter (CWM), cerebral cortex (CT), cerebellum cortex (CCT), lateral ventricle (LV), hippocampus (HP), thalamus (TH), putamen (PU), pallidum (PA), caudate Table 2 Details of the parameter sharing between structure classes.The groups of structures that share their Gaussian mixture parameters are shown in the first column, and the corresponding amount of Gaussians in the mixture in the second column.

Structures with shared parameters Number of Gaussians
Non-brain tissues 3 M , where l A and l M are the automatic and manual segmentations respectively and |•| is the cardinality of a set.
(CA), amygdala (AM) and brain stem (BS).We will refer to these structures as the "regions of interest" (ROIs); note that for clarity of presentation we report the average Dice score of the left and right hemisphere for all structures except for the brain stem.

Intra-scanner and cross-scanner segmentation performance
The Dice scores between the manual and automated segmentations of the ROIs, obtained using the different methods, are shown for the intra-scanner dataset in Fig. 6 (top).Table 3 (first column) summarizes the scores in average over the ROIs and subjects, and reports statistically significant differences between the methods.The significance testing was done using paired, twosided t-tests, by stacking the individual Dice scores in each ROI and subject for a given method.Corresponding scores and significant differences for each ROI separately are reported in Supplementary material, Table 1.All of the methods perform well on the intra-scanner dataset, which was expected, as the contrast properties of the training data are identical to those of this dataset.The multiatlas segmentation methods achieve the highest mean scores, with PICSL MALF being the best method for this dataset.Majority voting also obtains a very high mean score despite its simple fusion strategy.This is likely due to the accurate ANTs/SyN registration framework, which has been shown to perform very well on intra-scanner data (Klein et al., 2009).We note that each of the benchmark methods is specifically trained for this type of data, whereas the proposed method is not.
For the cross-scanner data, where the contrast properties of the target data are different from the training data, the ROI Dice scores are shown in Fig. 6 (bottom) and the mean scores over the ROIs and subjects in Table 3 (third column).Corresponding scores and significant differences for each ROI separately are reported in Supplementary material, Table 2. Compared to the intra-scanner dataset, the overall segmentation accuracy of all methods decreases, which is likely due to the lower intrinsic image contrast of the SPGR pulse sequence as noted in Han and Fischl (2007) and as also visible from Fig. 3.In this dataset, the proposed method Fig. 6.The Dice scores of the different methods for the intra-scanner (top) and cross-scanner (bottom) data.The proposed method ¼ green, BrainFuse ¼ blue, PICSL MAL-F ¼magenta, FreeSurfer¼ red and Majority Voting¼ black.Additional results, obtained by preprocessing the input data using the FreeSurfer pipeline, are also shown (filled boxes with broken lines).On each box, the central horizontal line is the median, the circle is the mean, and the edges of the box are the 25th and 75th percentiles.Data points falling outside of the range covered by scaling the box four times are considered outliers, and are plotted individually.The whiskers extend to the most extreme data points that are not considered outliers.See Section 3.4 for the acronyms.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)O. Puonti et al. / NeuroImage 143 (2016) 235-249 achieves the highest mean score, demonstrating its robustness against changes in contrast.Although FreeSurfer explicitly encodes the contrast properties of the training scans, its performs relatively well on this un-matched data; this can be explained by its in-built renormalization procedure for T1 acquisitions, which applies a multi-linear atlas-image registration and a histogram matching step to update the class-conditional densities for each structure (Han and Fischl, 2007).In contrast, the label fusion methods, which directly rely on the image intensities of the training data in their registration and fusion steps, are clearly affected by the changes in the MRI contrast.The pair-wise registrations are especially more challenging for this dataset, leading to misregistrations that are the principal error source in multi-atlas segmentation.
The segmentation accuracy of BrainFuse, which uses Free-Surfer-preprocessed images, varies between different structures.In general it seems to perform well on some of the larger structures (LV, BS), whereas the performance is not so good on some of the smaller structures (AM, HP, PA).This is likely explained by the choice of registration algorithm and especially the SSD similarity measure, which is not invariant against small intensity changes.Although the PICSL MALF and majority voting methods use the more robust CC similarity measure in the ANTs/SyN registration framework, there are some subjects in the cross-scanner dataset for which computing the registrations without preprocessing is very difficult, resulting in the segmentation outliers shown in Fig. 6.We note that although majority voting does not rely on intensity information when fusing the labels, its usage of CC in the registration step indirectly assumes that the training and target scans have similar properties (linear local intensity transformation).Compared to PICSL MALF, its simple fusion rule makes majority voting much more dependent on the quality of the pair-wise registrations, as the effect of poorly registered subjects can not be downplayed.
In order to further analyze the relative performance of the various methods without the influence of outlier subjects in majority voting and PICSL MALF, we performed an additional, post hoc analysis with the explicit aim of avoiding ANTs/SyN registration failures in the cross-scanner data.For this purpose, we re-ran majority voting and PICSL MALF, as well as the proposed method, on data that had been preprocessed with the FreeSurfer pipeline (which includes skull-stripping, bias field correction, intensity normalization, and re-sampling to a 1mm 3 grid), although we note that this preprocessing is not part of the default implementations of these algorithms.
The resulting average Dice scores for the intra-and the crossscanner data are shown (in italics) in the second and the fourth column of Table 3 note that BrainFuse already depends on FreeSurfer so that all five methods effectively use the same preprocessing pipeline in this scenario.The Dice scores obtained this way for each ROI individually are also displayed in Fig. 6.Comparing the results obtained with and without FreeSurfer preprocessing, it can be seen that the additional preprocessing effectively avoids ANTs/SyN registration failures in the crossscanner data, resulting in a strong performance for majority voting with only a relatively minor improvement for the more advanced label fusion of PICSL MALF, which obtains the strongest overall segmentation accuracy.Unlike in the intra-scanner data, however, majority voting no longer outperforms the proposed method in the cross-scanner data even though all its pair-wise registrations are successful.It can also be seen that the proposed method does not benefit from FreeSurfer preprocessing in either the intrascanner or the cross-scanner data, an indirect demonstration of its intrinsic bias field correction and skull stripping performance.
Finally, Table 3 and Table 4 of the Supplementary material list the Dice scores and significant differences for each ROI separately for the intra-and cross-scanner data preprocessed with FreeSurfer.It can be seen that, although after preprocessing PICSL MALF outperforms other techniques in seven structures on the crossscanner data, the proposed technique remains the best method for four other structures, especially those in cortical areas (WM, CT, CWM).
A limitation of the comparisons presented in this Section is that the proposed method uses an atlas built from 20 randomly selected subjects, potentially introducing a bias when comparing to benchmark methods that use all 39 subjects without selection as training set.However, as shown in Section 4.3, PICSL MALF, BrainFuse and majority voting all benefit from using all the available training subjects compared to random subsets of various sizes, whereas the proposed method saturates around 10 subjects with very little further gains from larger training sets.

Running time
The approximate mean computation time for a single scan using the different methods is shown in Table 4.The proposed method is approximately 7 times faster than FreeSurfer, 12 times faster than BrainFuse and 100 times faster than PICSL MALF and majority voting.
In general, the parametric methods (i.e., FreeSurfer and the proposed method) are significantly faster than the label fusion approaches.This is because only a single non-linear registration is needed, as opposed to the multiple pair-wise registrations used in the non-parametric methods.Moreover, in PICSL MALF the local search is especially time consuming with large search windows.Compared with FreeSurfer, which is also parametric, our method is faster due to the sparse encoding of the mesh prior.Encoding this sparsity is computationally expensive, but needs to be done only once (in an offline fashion).Furthermore, in the proposed approach, no special post or preprocessing of the target scans is needed.
In its multi-threaded setup, the proposed method has an execution time of 23.5 min per scan on average.The fastest wholebrain segmentation method to our knowledge is presented in Zikic Mean Dice scores of the different methods over the ROIs for the intra-scanner (first column) and cross-scanner (third column) datasets.Additional results, obtained by preprocessing the input data using the FreeSurfer pipeline, are also shown (in italics, second and fourth columns).The superscript lists the methods that obtain significantly lower scores compared to a given method.The significance was tested using a paired two-sided t-test with a 5% significance level.2014) with execution times in the range of 5 to 13 minutes; however this method is not designed to handle image contrast differences.

Effect of the number of training subjects
Fig. 7 shows the effect on each method's Dice scores, averaged across all ROIs, of training on randomly selected subsets of the entire training pool, both for the intra-scanner and the crossscanner datasets.In order to compare the different methods' performance without the influence of gross registration failures in majority voting and PICSL MALF, results obtained after preprocessing the data with FreeSurfer are also provided.The figure shows that adding more training subjects generally yields more accurate segmentations for all methods, but that the proposed method reaches its maximum performance faster than the multiatlas methods: Already with 10 training subjects the segmentation accuracy of the proposed method is above 99% of its maximal performance in all experiments, regardless of the specific subjects included in the training set.This is especially useful for populations where expert segmentations are expensive or difficult to obtain, such as infants.The fact that the performance of the proposed method is not more dependent on the specific subjects included in the training set is likely due to the atlas construction process that explicitly avoids over-fitting to training data (Van Leemput, 2009), yielding sparser tetrahedral meshes (and therefore blurrier probabilistic atlases) when fewer training subjects are available.This effect is illustrated in Table 5, where the average number of mesh vertices for the 5, 10 and 15 training subject groups are reported.
The effect of FreeSurfer preprocessing appears to be minimal for the proposed method across the different training set sizes, showing a similar performance in both the intra-scanner and the cross-scanner data compared to when no preprocessing is applied.In contrast, preprocessing is crucial for both majority voting and PICSL MALF in the cross-scanner setting, as ANTs/SyN registration failures otherwise severely compromise segmentation performance.Compared to the other multi-atlas methods working on the same (i.e., preprocessed) data, as well as the proposed method (with or without preprocessing), BrainFuse appears to be much more sensitive to small training datasets, both in terms of the average Dice scores that it obtains as well as its sensitivity to the specific random subjects that are used for training.

Multi-contrast performance
Fig. 8 shows the Dice scores of the proposed method on the multi-echo dataset, for various combinations of single-(T1weighted only or PD-weighted only) and multi-contrast (T1-and PD-weighted simultaneously) input data.The results are very similar between T1-weighted only and multi-contrast input data, whereas using the PD-weighted contrast alone often yields reduced performance.This indicates that the PD-weighted contrast does not add much useful information to the T1-weighted scan when healthy brains are segmented.Example segmentations of the multi-echo dataset using T1-weighted only and multi-contrast scans are shown in Fig. 9.
The volume differences between the two time points in the 39 subjects of the T1/T2 test-retest dataset are shown in Fig. 10.In general, they are quite similar and small for both single-(only T1) and multi-contrast (both T1 and T2) segmentations, with the median ASPC in the 1-2% range.There are some larger differences   especially in the thalamus and pallidumwhen using multicontrast data.This appears to be mostly due to imaging artifacts in the T2-scans, an example of which is shown in Fig. 11.We note that this dataset has the lowest resolution of all the datasets we tested the method on, and therefore is affected the most by partial volume segmentation errors.
In order to put the ASPC test/retest results of Fig. 10 in perspective, we also report the ASPC scores for the benchmark methods when applied to the T1-weighted scans of the two time points.Because of the heavy computational burden of some of the methods (e.g., PICSL MALF occupies a CPU core for more than six days per scan, cf.Table 4), we only report the results on 10 randomly chosen subjects (20 scans in total) out of the available 39.
The benchmark methods' ASPC scores are shown in Fig. 12, along with those obtained with the proposed method on the same subjects (both T1-only and multi-contrast).The figure shows that the proposed method, PICSL MALF and majority voting perform most reliably across the time points, while BrainFuse and Free-Surfer have more variance in their segmentations.As discussed before, the weaker performance of BrainFuse compared to the other label fusion methods is likely a combination of the chosen registration framework and sub-optimal similarity measure used for the registrations; the reasons for FreeSurfer's weaker performance are not immediately clear.On the selected 10 subjects we did not observe problems with the pair-wise registrations when using the ANTs/SyN registration framework, leading to a robust performance of the PICSL MALF and majority voting methods in this experiment.

Discussion and conclusion
In this paper we have validated a whole-brain segmentation method that builds upon the parametric, unsupervised intensity clustering models commonly used in tissue classification.We have demonstrated that these type of models are capable of achieving state-of-the-art segmentation performance, while being very fast, adaptive to changes in tissue contrast, and able to handle multicontrast data.We emphasize that the exact same algorithm was used for all datasets in this paper, without any parameter retuning or configuration changes, demonstrating the robustness of the approach.
Our experiments indicate that, in the general cross-scanner scenario, the proposed method yields a robust segmentation performance on par with the very best competitors, while being orders of magnitude faster and without requiring any form of preprocessing.The method's accuracy is outperformed only when the image intensities of the training and test data are perfectly matched; however we believe this scenario will seldom occur in practice because manual whole-brain segmentation is so timeconsuming (e.g., taking hundreds of days for the training data used in this paper) that the available training data will seldom be acquired on the exact same imaging system as the images being segmented.Since the method we have validated here combines Gaussian mixture modeling with MRI bias field correction and probabilistic atlas deformation, it is closely related to the unified segmentation framework described in Ashburner and Friston (2005); however only basic tissue classification on T1-weighted images was attempted in that work.A related method based on fuzzy c-means clustering and a topological atlas was described in Bazin and Pham (2008), but that only segmented a handful of structures, and relied on the availability of pre-defined centroid initializations for each type of MRI sequence the method is expected to encounter.
An early attempt at whole-brain segmentation using a deformable probabilistic atlas combined with unsupervised intensity clustering was described in Babalola et al. (2009); however, the atlas registration was performed independently of the segmentation process, using relatively coarse deformations, and the resulting segmentation performance was found to trail that of label fusion methods.Subsequent methods showing better performance (Ledig et al., 2012a(Ledig et al., , 2015;;Makropoulos et al., 2014;Iglesias et al., 2013b;Tang et al., 2013) have used the non-parametric paradigm instead, where a probabilistic atlas is computed in the space of the target scan, i.e., after warping each of the training scans onto the target image using pairwise registration.We note that well-known majority voting methods (Rohfling et al., 2004a;Heckemann et al., 2006) using mutual information (Maes et al., 1997;Wells et al., 1996b;Studholme et al., 1999) as registration criterion also implicitly combine the non-parametric paradigm (multi-atlas label fusion) with unsupervised intensity clustering, since mutual information-based registration can be understood as jointly estimating registration parameters and class-conditional densities (Roche et al., 2000).In general, however, such non-parametric approaches are computationally much more expensive than the parametric method we evaluated here.Fig. 12.The ASPC scores of the different methods for 10 randomly chosen subjects from the test-retest dataset.The performance of the proposed method when using only T1-weighted data in dark green and when using both T1-and T2-weighted scans in light green, BrainFuse in blue, PICSL MALF in magenta, FreeSurfer in red and Majority Voting in black.The box plots are drawn in the same way as explained in Fig. 6. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) In the current paper, we only analyzed images of healthy subjects, and our experiments on multi-contrast images showed no benefit in terms of segmentation accuracy compared to when only T1-weighted scans are used.However, the ability to seamlessly handle multi-contrast data becomes essential when analyzing diseased populations, since many brain lesions are much better visualized in T2-weighted and FLAIR scans than in T1-weighted contrast.In future work we will therefore include models of pathologies in the proposed framework, enabling simultaneous whole-brain segmentation and pathology detection (Puonti and Van Leemput, 2016).
The proposed method has been evaluated on a set of structures in which the cerebral cortex was considered a single structure, without attempting to further parcellate it into neuroanatomical subregions.However, we note that the volumetric white matter segmentations generated by the method can be used to build and label cortical surface models using FreeSurfer (Dale et al., 1999;Fischl et al., 2004a).Exploring this direction remains as future work.
The segmentation software used in this paper, including the source code, the sparse probabilistic atlases and the code to build such atlases from training data, will be made publicly available.
collecting the intensities in a multi-contrast brain MRI scan with I voxels, where the vector = in voxel i for each of the available N contrasts.Furthermore, let = ( … ) of K possible segmentation labels assigned to voxel i.

Fig. 1 .
Fig. 1.Left: T1-weighted scan from the training data.Center: corresponding manual segmentation.Right: atlas mesh built from 20 randomly selected subjects from the training data.

Fig. 2 .
Fig. 2. On the left an example slice from the intra-scanner dataset and on the right a corresponding manual segmentation.

Fig. 7 .
Fig. 7. Mean Dice scores over the ROIs for the intra-scanner (left) and the cross-scanner (right) data when the different methods are trained using randomly picked subsets of only 5, 10 and 15 training subjects.The error bars correspond to the lowest and highest obtained mean Dice score across the random subsets.The score obtained when all subjects in the training pool are used is also shown for reference (fourth bar of each method).The proposed method (P) is shown in green, BrainFuse (BF) in blue, PICSL MALF (PM) in magenta and majority voting (MV) in black.Additional results, obtained by preprocessing the input data using the FreeSurfer pipeline, are also shown (filled bars with broken lines).(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 8 .
Fig. 8. Dice scores for the multi-echo dataset.Performance on T1-weighted data is shown in dark green, on PD-weighted data in orange, and on multi-contrast input data in light green.The box plots are drawn in the same way as explained in Fig. 6. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 10 .
Fig. 10.The ASPC scores for the test-retest dataset.Volume differences between the time points on multi-contrast input data is shown in light green, and on T1weighted data only in dark green.The box plots are drawn in the same way as explained in Fig. 6.The outlier marked by an arrow is the one shown in Fig. 11.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 11 .
Fig. 11.An example of an outlier subject marked by the arrow in Fig. 10.Top row from left to right: a T1-weighted scan with no visible artifacts, a T2-weighted scan with a line-like artifact in the pallidum and thalamus area marked by red arrows, and an automated segmentation of pallidum and thalamus showing the segmentation error caused by the artifact.The bottom row shows zoomed figures of the affected area, highlighting vertical lines in the T2-scan that cause jagged borders in the automatic segmentation, resulting in a poor ASPC score for this subject.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 4
Mean computational time for the different methods (single core).For label fusion methods the computation times for registration (Reg.) and label fusion (Fusion) are listed separately.

Table 5
Average number of vertices in the proposed atlas mesh for different numbers of training subjects.