An algorithm for optimal fusion of atlases with different labeling protocols

,


Introduction
Automatic segmentation of brain structures from MRI data makes it possible to carry out neuroimaging studies at larger scales than manual tracings would, since the latter are very time consuming to make.Moreover, automatic segmentation methods are also more repeatable and reliable than their manual counterparts.Brain MRI segmentation has been used in a number of applications, such as tractography [1], surgical planning [2] and studies of aging [3], brain development [4] and pathologies like Alzheimer's disease [5].
One family of supervised segmentation techniques that has become popular in brain MRI is multi-atlas segmentation [6].In conventional atlas-based segmentation, the grayscale image of the atlas is nonlinearly registered to the space of the test scan, and the resulting transform is then used to warp the corresponding labels, which provide an estimate of the segmentation.Since a single atlas is not sufficient to cover the whole spectrum of variability within a population, multi-atlas segmentation has emerged as a natural extension.Using multiple atlases, this family of techniques produces more accurate segmentations [7] by: (1) independently registering several atlases to the test scan; (2) using the resulting transforms to deform the corresponding label images; and (3) combining the registered label maps into a single estimate of the segmentation with a label fusion algorithm.Multi-atlas segmentation is becoming widespread for three reasons.First, the maturity of registration algorithms (e.g., ANTs/SyN [8], Elastix [9]) enable multi-atlas techniques to achieve very high performance.Second, the public availability of such methods makes multi-atlas segmentation easy to implement.And third, the relative computational cost associated with nonlinearly registering the atlases is quickly diminishing thanks to the rapid increase in processing power of computers.
The choice of label fusion method is critical for the performance of multi-atlas segmentation.Early algorithms include best atlas selection [6] and majority voting [10].The former estimates the segmentation as the labels of the atlas that is most similar to the test scan after registration.In this context, similarity can be measured with the same metrics that are typically used in image registration, such as cross-correlation, mutual information or sum of squared differences.Majority voting, on the other hand, operates at the voxel level by locally assigning the most frequent deformed atlas label at each spatial location -without considering the image intensity information.The performance of majority voting can be increased by an atlas selection process, in which only the deformed atlases that are most similar to the target scan are considered in the fusion [11,12] Later fusion methods compute the segmentation as a weighted combination of the labels of the registered atlases such that higher weights are given to more similar atlases.The weights can be global [13] or local [14,15,16,17].Sabuncu et al. [17] have shown that many of these multi-atlas methods can be written within a unified generative model.Another popular label fusion approach is STAPLE [18] and its variants [19,20,21,22]; while this method was originally developed to combine multiple manual segmentations from different human raters, it is increasingly being applied in the context of multi-atlas label fusion.
All the aforementioned label fusion algorithms assume that all structures of interest are labeled in all atlases, which is a rather limiting constraint.Eliminating this requirement would have several practical implications: • It would enable us to combine training scans from different datasets even if they have different sets of annotated structures.In turn, this would make it possible to take advantage of the increasing amount of heterogeneously labeled MRI data that are publicly available.
• It would also enable us to segment structures that are not included in any of the datasets, but defined as the intersection of labels.For instance, the intersection of the lateral postcentral region and the cerebral gray matter would define the primary somatosensory cortex.
• It would allow for the fusion of segmentations from different modalities with different field of views and resolution.For instance, it would be possible to combine standard resolution brain MRI (1 mm resolution) with high-resolution MRI with limited field of view or even histology or optical coherence tomography data.
• It would be useful if one were to manually relabel a subset of atlases to include finer structures in the annotations.For example, in a large dataset with the hippocampi already labeled, an expert rater can additionally delineate the hipocampal subfields -which is extremely difficult and time consuming -in just a few cases.Traditional label fusion methods would only be able to use these few scans in the segmentation, having to disregard the information in all the scans in which the subfields are not labeled.
Despite the practical implications that a label fusion algorithm which allows for heterogeneously labeled atlases would have, this direction remains largely unexplored in the literature.To the best of our knowledge, only a particular case of label fusion with heterogeneous labels has been considered so far: the situation in which some of the labels are missing in some of the atlases.To tackle this problem, Landman et al. [23,24,25] propose an ad-hoc solution by modifying the STAPLE framework such that unlabeled voxels are ignored and the confusion matrix entries corresponding to the missing structures are fixed.Commowick et al. [26] propose ameliorating the effect of missing labels by adding a prior on the confusion matrices to the STAPLE algorithm that, when a label is missing, encourages higher a transition probability from that label to the background.However, such an approach treats as background all the voxels that have not been labeled with one of the foreground labels.
In this study, we present a family of probabilistic models for label fusion that make it possible to use atlases that have been annotated with different protocols.In our models, the atlases are assumed to have a hidden "fine" segmentation with all the structures present in the training data -including those defined by intersections of labels.The actual observed labels are assumed to have been obtained by collapsing groups of hidden fine labels into more general, coarse labels.
The contribution of this study is twofold: i.We use probabilistic models of label fusion to extend three popular methods (majority voting, semi-locally weighted fusion and STAPLE) to the scenario of heterogeneously labeled atlases.
ii.We propose a new generative model for label fusion that can overcome the limitations of these generalizations -the inability to produce meaningful posteriors and to exploit the similarities between the atlases -and show that it outperforms the generalizations in experiments with four datasets.
The rest of this paper is organized as follows.In Section 2, we describe the general framework for label fusion with heterogeneously labeled atlases, propose the generalizations of the different methods, identify their disadvantages, and present a new fusion algorithm to address their shortcomings.In Section 3, we assess the performance of the different algorithms with experiments on four different datasets.Finally, Section 4 concludes the paper.

Methods
In this section, we first introduce the general framework and define the variables that we will use throughout the paper (Section 2.1).Then, we present the generalizations of majority voting, semi-locally weighted voting and STAPLE (Sections 2.2-2.4), and identify their weaknesses in Section 2.5.Finally, we introduce a label fusion method that addresses these shortcomings in Section 2.6.

General framework
Throughout the remainder of this paper, we will assume that a test scan consisting of J voxels is to be segmented.We will use y = {y j , j = 1,…, J} to refer to the image intensities, and s = {s j , j = 1,…, J} to refer to its hidden, underlying segmentation.Let us also assume that a set of N atlases has been pre-registered to a test scan with a non-linear algorithm.Let {i n } (where i n = {i nj , j = 1,…, J}) be the observed image intensities of the N registered atlases, and let {l n } (where l n = {l nj , j = 1,…, J}) be the corresponding discrete labels, defined at the finest detail level.Their values range from 1 to L, the total number of fine labels.
These deformed labels {l n } are not directly observed; instead, we have access to a different set of coarse labels {c n } (where c n = {c nj , j = 1,…, J}), which correspond to the actual manual delineations.The coarse labels {c n } are obtained by collapsing the fine labels {l n } into different groups of labels by means of a set of N deterministic, protocol-specific functions: c nj = f n (l nj ).A protocol function could, for instance, collapse the hippocampal sub-fields into a single hippocampal label.Having a separate f n for each atlas enables us to combine different labeling protocols.Different protocol functions can collapse the same fine label into different coarse labels; for instance, orbital cortex could be collapsed into the cerebral cortex by one protocol and into the frontal lobe by another.
We will now generalize three existing models -majority voting, locally weighted voting and STAPLE -to the scenario with collapsed labels.In all cases, the original algorithm (i.e., without heterogeneously labeled atlases) is recovered when all protocol functions are bijective (i.e., no labels are collapsed).

Generalization of majority voting
As explained in [17], majority voting can be seen as the most likely labeling in a probabilistic model in which the segmentation s j is sampled randomly from one of the N atlases, as indexed by a hidden discrete field m = {m j , j = 1,…, J}.Specifically, the value of the field at a certain voxel m j ∈{1,…, N} indexes from which atlas we take the label at voxel j, i.e, s j = l m j j .The field m follows a flat prior, i.e., p(m) ∝ 1.It is straightforward to show that, after marginalizing over m, the distribution of the segmentation at a given location is equal to its relative frequency within the propagated labels of the atlases.Therefore, the most most likely segmentation is the mode of the propagated labels.
The graphical model for generalized majority voting is shown in Figure 1, and the corresponding equations in Table 1; note that we have assumed a flat prior over the hidden labels.The main difference with the original model is that the fine labels of the atlases are not available; instead, we observe their coarse labels -but still want to compute the segmentation at the fine level.To find the most likely segmentation within this model, we can operate at each spatial location j independently -since the model factorizes over voxels.
The problem to solve is ŝ j = argmax s j p(s j |c.j ), where c. j denotes all the coarse atlas labels at voxel j.The probability p(s j |c.j ) is given by: (1) (2) where δ[•] is Kronecker's delta.The interpretation of Equation 2 is simple: at each voxel j, every atlas equally spreads its vote over all the fine labels that are compatible with the coarse label c nj .The segmentation is just the label that maximizes Equation 2 with respect to s j .

Generalization of semi-locally weighted voting
Here we generalize the model from [17] to the scenario with collapsed labels.The graphical model is shown in Figure 2, and the corresponding equations in Table 2.The model for the labels is the same as for majority voting; however, now there is a Markov random field (MRF) prior with a predefined constant β on the membership field m, in order to encourage spatial patterns of labels observed in the training data.In addition to the segmentation, the membership field now also generates the intensities of the test scan at each voxel by selecting atlas m j , taking its intensity i m j j , and corrupting it with Gaussian noise with variance σ 2 .The variance is independent of the spatial location.In [17], it is set to a predefined value; here, we estimate it from the data instead assuming a flat prior for it.The segmentation is formulated as: ( where we have used the mode approximation for the integral.The point estimate σ 2 of the variance is: (5) We now describe algorithms to compute the most likely variance σ 2 (Section 2.3.1) and subsequently the most likely segmentation(Section 2.3.2).

Computation of the most likely variance σ 2-
We first note that, according to the model in Figure 2, the variance σ 2 is independent of the coarse labels {c n } when the segmentation s is unknown.Therefore, the problem in Equation 5 can be rewritten as follows: (6) Equation 6 requires marginalizing over m, which leads to an intractable sum due to the MRF prior.Instead, we will use the variational expectation maximization (VEM) algorithm to estimate an approximate solution.Rather than optimizing Equation 6 directly, we maximize a lower bound J: where KL denotes the Kullback-Leibler divergence and H represents the entropy of a random variable.The inequality J(q(m, σ 2 )) ≤ log p(y|σ 2 , {i n }) holds because the KL divergence is non-negative.The distribution q(m) represents an approximation to the posterior distribution of m given the observed intensities and the variance.This distribution is optimized over a class of restricted functions.The standard approximation, known as mean field approximation, is that q factorizes over voxels: , where q j is a categorical distribution over the indices of the atlases at voxel j.
VEM alternates between an expectation (E) step and a maximization (M) step.In the E step, we maximize the bound J with respect to q(m).In the M step, we maximize J with respect to the model parameters -in this case, the variance σ 2 .In the E step, it is convenient to work with Equation 7: maximizing J is equivalent to minimizing the KL divergence, which yields the following update: (9) which can be solved with fixed point iterations, normalizing q j after each step.
In the M step, it is more convenient to work with Equation 8, since the entropy term can be disregarded.The maximization yields the following update: (10) The VEM algorithm typically converges in a few (5-6) iterations.Note that, if we set β = 0 in the model, we recover the standard EM algorithm [27].

Computation of the most likely segmentation
ŝ-Given σ 2, computing the segmentation ŝ still requires evaluating an intractable sum over m.However, since q(m) minimizes the KL divergence with p(m|y, σ 2 , {i n }), we can approximate the problem in Equation 4 with: Therefore, the optimal segmentation can be computed voxel by voxel as: which is almost identical to the right hand side of Equation 1, with the difference that the term p(m j ) has been replaced by q j (m j ).It is straightforward to show that the approximate posterior label probabilities are given by: (11) This expression is similar to the equation for the label posteriors of generalized majority voting (Equation 2).The difference is that the constant term 1/N has been replaced by the approximate membership posteriors q j .This term depends on the image intensities, such that the contribution is higher for the atlases that are semi-locally more similar to the novel scan.The vote of each atlas is still spread equally among the fine labels they are compatible with the coarse label at each voxel.

Generalization of STAPLE
The generative model of STAPLE is as follows: the hidden segmentation is generated by a prior , such that p(s j ) is a categorical distribution that reflects the prior frequencies of the classes.In our scenario, we used a flat prior (i.e., p(s j ) ∝ 1); preliminary experiments showed that the prior used in the original STAPLE algorithm -in which the frequencies are proportional to the volumes of the structures -considerably decreased the segmentation accuracy in small structures.Given the segmentation, the (deformed) atlas labels are assumed to be independent corrupted observations of the hidden ground truth segmentation.Each atlas has a corresponding confusion matrix through which its labels are generated, whereas the atlas intensities are disregarded in the fusion.The STAPLE algorithm first computes point estimates for the confusion matrices in light of the observed data, and the uses them to estimate the segmentation.
The graphical model for generalized STAPLE is shown in Figure 3a.Each atlas is characterized by a confusion matrix Ω n .The element corresponds to the probability that the true segmentation s is observed as label l by atlas n.The fine labels of the atlases are collapsed into the coarse, observed labels through the protocol functions f n .For our purposes, it is actually more convenient to work with a more compact version of this model, as shown in Figure 3b.We assume that the collapsed labels are generated directly by confusion matrices corresponds to the probability that the true segmentation s is observed as coarse label c by atlas n.The relation between Ω n and Θ n is simple: . Note that the matrices {Θ n } are not necessarily square, whereas the {Ω n } are.Moreover, the matrices {Θ n } will in general have different number of rows.
As in the original STAPLE algorithm, we compute the maximum likelihood estimates of the confusion matrices as follows: (12) Equation 12 can be iteratively optimized with EM.In the E step, we compute , a soft classification of voxel j: (13) In the M step, we update the confusion matrices: (14) Once the EM algorithm has converged, the approximate label posteriors are given by: and the discrete segmentation is just:

Shortcomings of the generalized methods
The presented generalizations of majority voting, semi-locally weighted voting and STAPLE suffer from several limitations.Generalized majority voting inherits from its parent the inability to exploit the information in the deformed atlas intensities.In addition, this method is unable to produce realistic label posteriors (soft segmentations) in many common multi-protocol scenarios due to the spreading of the votes across all the compatible fine labels at each location.For example, let's consider the problem of fusing the segmentation in Figure 4a -which includes the cortex as a whole and most subcortical structures -with the segmentation in Figure 4b -which includes the hippocampal subfields.Since the background label of the subfield data is compatible with all non-hippocampal labels, this scan would contribute a flat map of non-zero probability for all non-hippocampal structures all over the image domain.Even though the hard segmentation (Figure 4c) might still be meaningful, the label posteriors given by the method (Figures 4d and 4e) are not realistic.For instance, the posterior for the cortex is close to 0.5 (rather than 1) around this structure in Figure 4d, and the same thing happens with the amygdala in Figure 4e.
Soft segmentations are desirable for two reasons: first, they are useful to estimate the uncertainty in the boundary locations; and second, they allow to compute more accurate estimates of the volumes of the different structures using expectations.If V l is the volume of structure l in voxels [28]: where is the posterior label probability at voxel j, given by Equation 2(majority voting), 11 (semi-locally weighted voting) or 15 (STAPLE).If we computed volumes from the posteriors in Figure 4d or 4e, the estimates would clearly not be accurate.
In contrast to generalized majority voting, generalized semi-locally weighted voting uses the image intensities to estimate the contributions from the different atlases; however, it still suffers from the shortcoming that it spreads the votes across all the fine labels and cannot generate meaningful posteriors.STAPLE, on the other hand, can produce realistic posteriors thanks to the nature of its generative model, but it shares with majority voting the limitation that it does not consider the image intensities in the fusion.Moreover, STAPLE was originally conceived as a method to merge manual segmentations, and generally performs poorly in the label fusion step of multi-atlas segmentation, where it is often outperformed even by majority voting (see for instance [29]).
Even though some of these shortcomings could be addressed by newer versions of the algorithms (e.g., an extension of STAPLE using image intensities was presented in [20]), none of the generalized methods supports exchange of information between the atlases during the fusion.In the standard case where all atlases have all possible labels, this is not necessary, since there is nothing they can learn from one another.However, in the multiprotocol scenario, the exchange of information between atlases with missing labels can improve the segmentation.
For instance, let us assume that we have three registered atlases {i 1 , c 1 }, {i 2 ,c 2 }, {i 3 , c 3 } and that, at a given voxel j, their deformed coarse labels are c 1j = c′, c 2j = c″ and c 3j = c‴, respectively.Let us further assume that c′ and c″ correspond to fine labels l′ and l″ without any ambiguity, i.e., f 1 (l) = c′ ⇔ l = l′ and f 2 (l) = c″ ⇔ l = l″, but that c‴ is the result of collapsing l′ and l″, i.e., f 3 (l′) = f 3 (l″) = c‴.In that case, if i 3j ≈ i 1j , we would expect the hidden fine label l 3j to be l′, whereas if i 3j ≈ i 2j , we would expect it to be l″ instead.
To address the described issues, we present in Section 2.6 below a label fusion method specifically designed for multi-protocol scenarios.

Proposed fusion method
Figure 5 shows the graphical model of the proposed framework, which we coin "multiprotocol label fusion" (MPLF).Table 3 displays the corresponding equations.Essentially, we are assuming that both the registered atlases and the test scan are generated by a (latent) statistical atlas of labels (α ℒ ) and intensities (α ℐ ) in the space of the test scan.The assumption that the atlases were generated by the same statistical atlas is what allows the model to integrate the information across the atlases.Specifically, the vector represents the a priori probability of observing the different labels (at the fine detail level) at voxel j.The vector stores, for each fine label, the mean of a Gaussian that models the distribution of intensities at voxel j conditioned on that label.All the Gaussians share a predefined variance σ 2 .The voxels are assumed to be independent of one another.As in the previous sections, the fine labels of the atlases are hidden, and we only have access to corresponding coarse segmentations {c n }.
The model is completed with priors for (α ℒ ) and (α ℐ ), which we assume factorize over voxels as well.We use conjugate priors: for , we assume a Dirichlet distribution with concentration parameter 1 + ε, i.e., we assume that we have ε prior observations for each class at each voxel.For we assume a Gaussian distribution with mean μ 0 and variance σ 2 /ε, which is equivalent to having ε priors observations with sample mean μ 0 .In practice, ε is small and the only objective of these priors is to ensure the numerical stability of the algorithm.
Exact inference within this model would require marginalizing over the model parameters, i.e., the statistical atlas ((α ℒ ), (α ℐ )), which leads to an intractable integral.We use the assumption that the posterior distribution of the model parameters is heavily peaked to approximate: (16) where the point estimates α Î and α L are given by: (17) To compute the point estimates, it is convenient to notice that the test image can be considered an extra atlas, such that: In other words, the test image is an additional atlas (index N + 1) with a constant coarse label 1, which is compatible with all the labels at the fine level.Then, we can use Bayes's rule to rewrite the problem in Equation 17as: (18) where To solve the optimization problem of Equation 18, we use an EM algorithm: we iteratively build a lower bound to the objective function of Equation 18that touches it at the current estimate of ((α ℒ ), (α ℐ )) (E step), and then optimize this bound with respect to ((α ℒ ), (α ℐ )) (M step).
In the E step, we make a soft assignment of each label l at the fine level of detail to each voxel j in each atlas n: (19) Note that, if atlas n is labeled at the fine level already, then (f n ) −1 (c nj ) is unique and .These soft assignments are used to form the lower bound: (20) where α L and α Ĩ are the current estimates of the parameters.
The M step updates can be derived as: Once the algorithm has converged, we can substitute the point estimates α L and α Î back into Equation 16and use Bayes's rule to compute the segmentation.It is straightforward to show that the approximate posterior label probabilities of the voxels -which are independent of each other -are given by: (23) Therefore, the optimal discrete segmentation is: (24) and the expectation of the volume of label l is (in voxels):

MRI data
We used four different datasets of manually labeled T1 MRI scans in this study (see sample slices in Figure 6): • FreeSurfer dataset: 39 T1-weighted, 1 mm isotropic scans with 36 cortical and subcortical labels (see delineation protocol in [30]).The cerebral cortex and white matter are considered single entities.We note that these are the subjects that were used to train the probabilistic atlas in FreeSurfer [31].
• Brainstem dataset: ten T1-weighted, 1 mm isotropic scans with manual labels for the medulla oblongata, pons and midbrain, i.e., the substructures of the brainstem.The delineation protocol is described in [32].
• Winterburn dataset: five 0.6 mm isotropic scans 1 with annotations of the hippocampal subfields -subiculum, CA1, CA23, CA4 and molecular layer.The acquisition and manual delineation of the data are described in [33].The dataset includes T1-weighted and T2-weighted scans; only the T1-weighted volumes were used here.
• Hammers dataset 2 : 20 T1-weighted, 1 mm isotropic scans with 67 labels of cortical and subcortical structures.The labels of the cortical structures do not separate white from gray matter, and there is a single cerebellar label, which groups its gray and white matter.Further details on the dataset and on the manual labeling protocol can be found in [34] and [35].
The scans from all four datasets have fields of view covering the whole brain.Additional details on the acquisition can be obtained from the corresponding publications.

Definition of fine labels and protocol functions
In the fusion, we defined a set of 148 labels given by all the possible intersections of regions defined in the four datasets; note that this process generates more regions than the total number of unique labels in the four datasets (102).Specifically, we defined two labels for each lobe in the Hammers dataset, one for the corresponding white matter and one for the corresponding cortex, based on their intersections with the gray and white matter in the FreeSurfer dataset.We also defined a number of labels to cope with differences in labeling protocols of the same structure; even though the protocols are in general very similar to each other, there are two exceptions.First, the Hammers hippocampus does not include the tail, while the FreeSurfer and Winterburn protocols do.To cope with this, we split up each of the five hippocampal subfields into an anterior (head/body) and a posterior (tail) region; this yields a total of 20 hippocampal labels -10 per side.Second, the midbrain of the Brainstem dataset coincides with the superior part of the brainstem in the Hammers dataset, but extends further in the superior direction than the brainstem in FreeSurfer.To model this difference, we split the midbrain into an inferior and a superior part; the latter is further split into left and right.The mid-brain label in the Brainstem dataset and the brainstem in the Hammers dataset include all three regions, while the FreeSurfer dataset considers the inferior region part of the brainstem and the superior regions part of the left and right diencephala, respectively.
Given these region definitions, we specified four unique protocol functions f (one per dataset, as illustrated in Figure 7): • FreeSurfer dataset: the protocol collapses all cortical structures into two generic cortex labels (left and right), all white matter structures (including the left and right corpus callosum, only defined in the Hammers dataset) into two white matter regions, all the hippocampal subfields into whole hippocampi, and the brainstem labels into a whole brainstem region -except for the superior midbrain regions, which are collapsed together with the diencephala.
• Brainstem dataset: the protocol collapses all the non-brainstem regions into a single background label.
• Winterburn dataset: the protocol collapses all the non-hippocampal regions into a generic background label.
• Hammers dataset: the protocol collapses the white and gray matter labels of each lobe into a single label, all the brainstem and hippocampal regions into two generic labels, and the white matter and cortex of the cerebellum into a single structure.

Experimental setup
All 74 brain scans were resampled to 1 mm isotropic resolution skull stripped bias field corrected and intensity normalized with FreeSurfer [31].The intensity normalization is necessary because it enables us to directly compare image intensities (even across datasets), which is a critical assumption in semi-locally weighted fusion and MPLF.The scans were then pairwise registered with the software package ANTS [8]).We used the default parameters for the deformation model (SyN[0.25])number of iterations (30 × 90 × 20) and cost function (neighborhood cross correlation).However we increased the radius of the window of the cost function from 4 to 6 and the regularization parameter from 3 to 4; these values better coped with the differences in intensity profile between datasets that cannot be completely eliminated by the intensity normalization step.
The four proposed label fusion schemes were used to segment the scans in a leave-one-out manner where the test subject was left out from the atlas set {i n , l n }.The parameter setting and initialization was as follows.In semi-locally weighted fusion we initialized σ 2 = 100 and set β = 0.75 -as in [17].In STAPLE we initialized each column of each Θ n such that 0.95 of the probability mass is equally distributed among the coarse labels that are compatible with the fine label corresponding to that column; the remaining 0.05 is equally spread across the non-compatible coarse labels.In MPLF we set ε = 10 −6 , μ 0 = 65 (which is the typical intensity of gray matter after normalization) and σ 2 = 100 (again inspired by [17]).
We evaluated the label fusion methods in two different ways: directly with Dice scores and indirectly through an aging experiment.First we computed Dice scores between the manual delineations and the protocol-transformed automated segmentations produced by the different label fusion methods, i.e., we compared c tj with f t (ŝ j ) (where we use the subscript t to refer to the test scan) for all four datasets.In addition we also evaluated the performance of MPLF against the number of training scans.In this experiment we used the FreeSurfer dataset which has the most scans and therefore provides the largest range for the analysis.We computed segmentations of the FreeSurfer scans using N fs other randomly selected FreeSurfer scans ("fs" stands for FreeSurfer) and a pool of (randomly selected) scans from the other datasets such that the proportion of atlases from each of the datasets in the pool was approximately constant (see Table 4).This ensures that the performance depends on the number of atlases, rather than the proportion of scans from the FreeSurfer dataset in the pool of training images.The experiment was repeated 10 times for each N fs with different random selections of scans except for the cases N fs = 1, where we used all 38 left-out scans in the dataset, and N fs = 38, where there is only one possible combination scans (since we are using all the available atlases).
In a second set of experiments, we evaluated MPLF indirectly by analyzing the effect of age on the median thickness of each cortical region, as well as on the volume of the different subcortical structures.This analysis was carried at the fine level of label detail, so we measured volumes and thicknesses of structures that were not defined in any of the training datasets.Both the thicknesses and the volumes were computed from the soft segmentations: the volumes were computed with Equation 25, and the thicknesses were estimated from the label posteriors (Equation 23) using the algorithm described in [36].This technique estimates the thickness at a point of interest by minimizing the line integral over the probability map of the gray matter on line segments centered at that point.For each cortical region, we took the median (a robust estimate) of the thicknesses given by this method for the voxels belonging to that region, as estimated by the hard segmentation from Equation 24.Note that both the estimation of the volume and of the cortical thickness from soft segmentations require faithful posteriors; probability maps like the ones showed in Figure 4c or 4c would yield unrealistic estimates.
The aging experiment was performed on the FreeSurfer dataset, which has the most subjects and the widest age range (53.3±23.3years).For each subcortical brain structure, we first fitted a generalized linear model (GLM) predicting its volume as a linear combination of the age of the subject, his intracranial volume (as estimated by FreeSurfer) and a bias.Then, for each cortical structure, we fitted a GLM predicting its median thickness as a linear combination of the age of the subject and a bias.Finally, a statistical t-test was used to assess whether the coefficient related to age in each GLM was significantly different from zero.In order to increase the power of the analysis in the subcortical structures, we left-right averaged their volumes -for the median cortical thickness this is not as advantageous, since it is a robust estimate already.5 shows the mean Dice scores across the structures defined within each dataset.Generalized STAPLE produces very variable results, yielding excellent segmentation for some structures but poor outputs for others (e.g., brainstem and Winterburn datasets).On average, it outperforms generalized majority voting by 2% Dice.Generalized semi-locally weighted voting takes advantage of the image intensities of the deformed atlases to yield an average Dice 1% higher than that of generalized STAPLE.MPFL clearly outperforms all the other methods by communicating information between the atlases: its average Dice is 3% higher than that of the second best method (generalized semi-locally weighted voting).Figure 8 shows the mean Dice score produced by MPFL in the FreeSurfer dataset as a function of the number of training scans.

Direct validation: Dice scores-Table
The plot shows that MPFL only requires 3 atlases to yield Dice scores similar to those of generalized semi-locally weighted voting with 38 atlases.The performance of MPFL saturates at approximately 30 scans.
Figure 9 shows box plots for the Dice scores between the manual and automated segmentations for each structure in the four datasets.In the FreeSurfer dataset, generalized majority voting performs satisfactorily for most structures except for the cortex, which is difficult to register.Semi-local weighting provides a boost in the performance for some of the structures, particularly the caudate and the cortex, thanks to the use of image intensity information in the fusion.Generalized STAPLE produces very variable results: for some structures it outperforms semi-locally weighted fusion (diencephalon, amygdala, pallidum, putamen), but for others it produces rather poor results (most notably the cortex).MPLF successfully combines all the training data to produce the highest Dice scores for every structure other than the cerebellum.
In absolute terms, the results from MPLF in subcortical structures are slightly worse than those reported in the literature by state-of-the-art fusion algorithms working in singleprotocol settings (e.g., [17]).This is not caused by shortcomings of our algorithm, but rather by the fact that it produces a segmentation that "averages" subtle differences in labeling protocols that are not explicitly modeled in our fusion framework.Therefore, the automatic segmentation does not necessarily agree perfectly with any of the individual protocols, and is penalized when computing Dice scores against the ground truth segmentations of the individual datasets.For example, this is the reason -in addition to differences in registrations -for the discrepancy between our results for generalized majority voting in the Hammers dataset and the results reported in [10] using the same brain scans.
In cortical structures, MPLF shares the limitation of other label fusion algorithms (and registration based segmentation methods in general), that volumetric registration of the cerebral cortex is extremely difficult.The cortical segmentation could possibly be improved by replacing ANTs with a registration algorithm specifically designed for the cortex, e.g., [37].
The same trends that were observed in the FreeSurfer dataset apply to the Brainstem, Winterburn and Hammers datasets.Semi-locally weighted voting provides a small improvement over majority voting, STAPLE yields very variable results (rather poor, in some cases) and MPLF outperforms the other three for all the structures of interest, other than CA1 in the hippocampus and the cerebellum (and some lobes/gyri) in the Hammers dataset.In absolute terms, the Dice scores are high in the Brainstem dataset -though difficult to place in a global context due to the lack of midbrain, pons and medulla segmentation algorithms in the literature.On the other hand, they are low for the hippocampal subfields, due to their thin shapes (all but CA4) and the insufficient resolution to segment them accurately.In the Hammers dataset, Dice scores of the subcortical structures are one notch below the results from the FreeSurfer dataset; this is possibly due to the fact that the averaging in labeling protocols is skewed towards the FreeSurfer data due to the larger presence in the training datset -twice as many.
Figure 10 shows segmentations and 3D renderings of a sample test scan from the FreeSurfer dataset.The automated segmentations in Figure 10c-f are much richer than the manual labels in Figure 10b.For instance, the parcellated pial and white matter surfaces (Figure 10g-h) could not have been generated using any of the training datasets independently).Since they do not consider image intensities in the fusion, the generalizations of majority voting and STAPLE do not correctly segment the cortex, which is difficult to register.They also oversegment structures such as the pallidum.Generalized semi-locally weighted fusion ameliorates these problems through the use of intensities -particularly the cortical segmentation.However, it is outperformed by MPLF, which produces more accurate segmentations for the thalamus, pallidum, putamen and choroid plexus -while providing meaningful estimates of the label posteriors.

Indirect validation: aging study-
The association between morphometric measurements and age (Table 6) shows strong consistency with prior work using other procedures.For example, we found strong associations between age and cortical thickness in frontal lobe regions including the superior frontal and precentral gyrus, weaker association with medial temporal regions, and strong reductions in volume of the thalamus with more moderate reductions in volume of the hippocampus [3,38].Even though the resolution of the scans was not sufficient to clearly distinguish among hippocampal subfields, we observed a larger effect of aging on CA1 and CA4-DG, consistent with prior work [39].The results on the brainstrem also showed strong agreement with prior studies: the volume of the midbrain steadily decreases with age, the pons is spared, and the medulla suffers from minimal atrophy [40].

Conclusion and discussion
In this paper, we have generalized three popular label fusion methods to scenarios where the atlases have been manually traced with different protocols.We have discussed the limitations of the generalized methods and proposed MPLF, an alternative algorithm that was shown to outperform them: the average Dice score on four datasets improved between 3% and 6% with respect to the generalizations.The core of both the generalized methods and MPLF is the definition of protocol functions that group sets of fine labels (hidden) into coarser labels (observed).Adding the coarse labels to the corresponding graphical models, the generalization of the three existing methods to the multi-protocol scenario is straightforward.
The extensions of majority voting, semi-locally weighted fusion and STAPLE are easy to implement and require very few changes with respect to the original algorithms.On the other hand, MPLF is computationally more expensive, since it requires optimizing a function with the EM algorithm at each voxel.In any case, the computational cost of the fusion (approximately one hour in our experiments) is small compared with the cost of nonlinearly registering the atlases (approximately one and a half hours per atlas, tens of hours in total), and the algorithm can be easily parallelized if necessary -since the voxels of the input scan are processed independently.
Future work will follow three directions.First, we will generalize newer, more sophisticated label fusion algorithms and compare them with MPLF.In particular, it will be interesting to consider extensions of STAPLE that support soft labels and spatially varying performance parameters.Second, we will consider the possibility of placing smoothness priors on the intensities and label probabilities of the statistical atlas in MPLF, as well as on the segmentation of the test scan.Even though the automated segmentations were accurate and smooth in our experiments, smoothness constraints might be important when the number of atlases is not as high as in this study.And third, we will generalize MPLF to cross-modality scenarios, which will introduce the capability to handle microscopic images (e.g., BigBrain3 [41]) or optical coherence tomography [42], in order to model with very fine detail brain areas that are not visible with MRI.
As the amount of publicly available, heterogeneously labeled data continues to grow, we believe that segmentation methods that can cope with different protocols -such as the one we have described-will become increasingly important.

Highlights
Label fusion methods only use a small subset of available manually labeled atlases.This is due to differences in the definition/identity of the labels in each protocol.
We extend majority voting, STAPLE, and weighted fusion to the multi-protocol setting.
We demonstrate the issues with these generalized methods.
We propose a novel label fusion method that outperforms benchmarks.Graphical model for generalized semi-locally weighted voting.Shaded variables are observed.Graphical models for MPLF, the proposed fusion method.Shaded variables are observed.4).The Dice scores are averaged over all structures and ten random selections of scans (except for N fs = 1, where we used all 38 left-out scans, and N fs = 38, where there is only one possible combination of atlases).Box plots for the Dice scores corresponding to the four datasets.The central mark is the median, the box spans from the first to the third quartile, and the whiskers span the extreme data points not considered outliers (which are marked with red crosses).Equations for the probabilistic model of generalized majority voting Equations for the generative model of generalized semi-locally weighted voting.ℋ j represents the neighborhood of voxel j.Table 3 Equations for the generative model of Figure 5. Dir[•] represents the Dirichlet distribution, and 1 is the all one vector.

Figure 1 .
Figure 1.Graphical model for generalized majority voting.Shaded variables are observed.

Figure 4 .
Figure 4. Example to illustrate the shortcomings of generalized majority voting.(a) Labels for a registered atlas with labels for the whole cortex and for a number of subcortical structures.(b) Labels for a registered atlas with labels for the hippocampal subfields.(c) Fusion of (a) and (b) with generalized majority voting.(d) Posterior probability map for the cortex; note that the values are close to 0.5 (rather than 1) in the cortex, and that a large amount of probability mass is distributed all across the image in non-cortical areas (other than the hippocampus).(e) Posterior probability map for the amygdala; similar observations as for (d) can be made.

Figure 6 .
Figure 6.Sample sagittal slices of the four datasets used in this study, with manual annotations overlaid: (a) FreeSurfer, (b) Brainstem, (c) Winterburn, (d) Hammers dataset.We only show a region of interest around the labels in (b,c) -the scans cover the whole brain.

Figure 7 .
Figure 7. Protocol functions: (a) sagittal slice of a segmentation at the fine label level; (b) effect of applying protocol function corresponding to FreeSurfer datast; (c) Brainstem dataset; (d) Winterburn dataset; (e) Hammers dataset.It is the goal of this study to produce segmentations such as (a) by merging segmentations like (b-e).

Figure 8 .
Figure 8.Mean Dice score (in %) for the FreeSurfer dataset as a function of the number of training scans of the FreeSurfer dataset N fs .The pool of training scans also includes atlases from the other three datasets, such that the ratio of scans from the different datasets is approximately constant (see Table4).The Dice scores are averaged over all structures and ten random selections of scans (except for N fs = 1, where we used all 38 left-out scans, and N fs = 38, where there is only one possible combination of atlases).

Figure 10 .
Figure 10.(a) Sagittal slice of a scan from the FreeSurfer dataset.(b) Corresponding manual segmentation.(c) Automated segmentation with generalized majority voting.(d) Generalized semi-locally weighted segmentation.(e) Generalized STAPLE.(f) MPLF.(g) 3D rendering of the pial surface using MPLF.(h) Rendering of the white matter surface.

Table 4
Number of atlases from each dataset in the training pool for the experiment testing the performance of MPLF against the number of training scans.

Table 5
Mean Dice score (in %) across all structures within each dataset, as well as for all structures from all datasets combined."Maj.vot." represents generalized majority voting, "SL weight."represents generalized semilocally weighted fusion, and "G-STAPLE" represents generalized STAPLE.

Table 6
Effect of age: p-values (given as -log 10 p) for t-test on significance of slope in the generalized linear model.(a) Volume of subcortical structures (left-right averaged).(b) Cortical thickness.Significant p values (below 0.05 after Bonferroni correction) are in bold.The threshold for significance at α = 0.05 is, after Bonferroni correction, − log 10 p ≈ 2.65.