A Generative Probabilistic Model and Discriminative Extensions for Brain Lesion Segmentation— With Application to Tumor and Stroke

We introduce a generative probabilistic model for segmentation of brain lesions in multi-dimensional images that generalizes the EM segmenter, a common approach for modelling brain images using Gaussian mixtures and a probabilistic tissue atlas that employs expectation-maximization (EM), to estimate the label map for a new image. Our model augments the probabilistic atlas of the healthy tissues with a latent atlas of the lesion. We derive an estimation algorithm with closed-form EM update equations. The method extracts a latent atlas prior distribution and the lesion posterior distributions jointly from the image data. It delineates lesion areas individually in each channel, allowing for differences in lesion appearance across modalities, an important feature of many brain tumor imaging sequences. We also propose discriminative model extensions to map the output of the generative model to arbitrary labels with semantic and biological meaning, such as “tumor core” or “fluid-filled structure”, but without a one-to-one correspondence to the hypo- or hyper-intense lesion areas identified by the generative model. We test the approach in two image sets: the publicly available BRATS set of glioma patient scans, and multimodal brain images of patients with acute and subacute ischemic stroke. We find the generative model that has been designed for tumor lesions to generalize well to stroke images, and the extended discriminative -discriminative model to be one of the top ranking methods in the BRATS evaluation.


I. INTRODUCTION
G LIOMAS are the most frequent primary brain tumors.
They originate from glial cells and grow by infiltrating the surrounding tissue. The more aggressive form of this disease is referred to as "high-grade" glioma. The tumor grows fast and patients often have survival times of two years or less, calling for immediate treatment after diagnosis. The slower growing "low-grade" disease comes with a life expectancy of five years or more, allowing the aggressive treatment to be delayed. Extensive neuroimaging protocols are used before and after treatment, mapping different tissue contrasts to evaluate the progression of the disease or the success of a chosen treatment strategy. As evaluations are often repeated every few months, large longitudinal datasets with multiple modalities are generated for these patients even in routine clinical practice. In spite of the need for accurate information to guide decision making process for an treatment, these image series are primarily evaluated using qualitative criteria-indicating, for example, the presence of characteristic hyper-intense intensity changes in contrast-enhanced T1 MRI-or relying on quantitative measures that are as basic as calculating the largest tumor diameter that can be recorded in a set of axial images.
While an automated and reproducible quantification of tumor structures in multimodal 3D and 4D volumes is highly desirable, it remains difficult. Glioma is an infiltratively growing tumor with diffuse boundaries and lesion areas are only defined through intensity changes relative to surrounding normal tissues. As a consequence, the outlines of tumor structures cannot This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ be easily delineated-even manual segmentations by expert raters show a significant variability [1]-and common MR intensity normalization strategies fail in the presence of extended lesions. Tumor structures show a significant amount of variation in size, shape, and localization, precluding the use of related mathematical priors. Moreover, the mass effect induced by the growing lesion may lead to displacements of the normal brain tissues, as well as a resection cavity that is present after treatment, limits the reliability of prior knowledge available for the healthy parts of the brain. Finally, a large variety of imaging modalities can be used for mapping tumor-related tissue changes, providing different types of biological information, such as differences in tissue water (T2-MRI, FLAIR-MRI), enhancement of contrast agents (post-Gadolinium T1-MRI), water diffusion (DTI), blood perfusion (ASL-, DSC-, DCE-MRI), or relative concentrations of selected metabolites (MRSI). A segmentation algorithm must adjust to any of these, without having to collect large training sets, a common limitation for many data-driven learning methods.

A. Related Prior Work
Brain tumor segmentation has been the focus of recent research, most of which is dealing with glioma [2], [3]. Few methods have been developed for less frequent and less aggressive tumors [4]- [7]. Tumor segmentation methods often borrow ideas from other brain tissue and other brain lesion segmentation methods that have achieved a considerable accuracy [8]. Brain lesions resulting from traumatic brain injuries [9], [10] and stroke [11], [12] are similar to glioma lesions in terms of size and multimodal intensity patterns, but have attracted little attention so far. Most automated algorithms for brain lesion segmentation rely on either generative or discriminative probabilistic models at the core of their processing pipeline. Many encode prior knowledge about spatial regularity and tumor structures, and some offer longitudinal extensions for 4D image volumes to exploit longitudinal image sets that are becoming increasingly available [13], [14].
Generative probabilistic models of spatial tissue distribution and appearance have enjoyed popularity for tissue classification as they exhibit good generalization performance [15]- [17]. Encoding spatial prior knowledge for a lesion, however, is challenging. Tumors may be modeled as outliers relative to the expected shape [18], [19] or to the image signal of healthy tissues [16], [20]. In [16], for example, a criterion for detecting outliers is used to generate a tumor prior in a subsequent EM segmentation that treats the tumor as an additional tissue class. Alternatively, the spatial prior for the tumor can be derived from the appearance of tumor-specific markers, such as Gadolinium enhancements [21], [22], or from using tumor growth models to infer the most likely localization of tumor structures for a given set of patient images [23]. All these segmentation methods rely on registration to align images and the spatial prior. As a result, joint registration and tumor segmentation [17], [24] and joint registration and estimations of tumor displacement [25] have been investigated, as well as the direct evaluation of the deformation field for the purpose of identifying the tumor region [7], [26].
Discriminative probabilistic models directly learn the differences between the appearance of the lesion and other tissues from the data. Although they require substantial amounts of training data to be robust to artifacts and variations in intensity and shape, they have been applied successfully to tumor segmentation tasks [27]- [31]. Discriminative approaches proposed for tumor segmentation typically employ dense, voxel-wise features from anatomical maps [32] or image intensities, such as local intensity differences [33], [34] or intensity profiles, that are used as input to inference algorithms such as support vector machines [35], decision trees ensembles [32], [36], [37], or deep learning approaches [38], [39]. All methods require the imaging protocol to be exactly the same in the training set and in the novel images to be segmented. Since local intensity variation that is characteristic of MRI is not estimated during the segmentation process, as in most generative mixture models, calibration of the image intensities becomes necessary which is already a difficult task in the absence of lesions [40]- [42].
Advantageous properties of generative and discriminative probabilistic models have been combined for a number of applications in medical imaging: generative approaches can be used for model-driven dimensional reduction to form a low-dimensional basis for a subsequent discriminative method, for example, in whole brain classification of Alzheimer's patients [43]. Vice versa, a discriminative model may serve as a filter to constrain the search space for employing complex generative models in a subsequent step, for example, when fitting biophysical metabolic models to MRSI signals [44], or when fusing evidence across different anatomical regions in the analysis of contrast-enhancing structures [45]. Other approaches improve the output of a discriminative classification of brain scans by adding prior knowledge on the location of subcortical structures [46] or the skull shape [47] through generative models. The latter approach for skull stripping showed superior robustness in particular on images of glioma patients [48]. To the best of our knowledge no generative-discriminative model has been used for tumor analysis so far, although the advantages of employing a secondary discriminative classifier on the probabilistic output of a first level discriminative classifier, which considers prior knowledge on expected anatomical structures of the brain, has been demonstrated in [32].
Spatial regularity and spatial arrangement of the 3D tumor sub-structure is used in most generative and discriminative segmentation techniques, often in a postprocessing step and with extensions along the temporal dimension for longitudinal tasks: Local regularity of tissue labels can be encoded via boundary modeling within generative [16], [49] and discriminative methods [27], [28], [50], [49], or by using Markov random field priors [30], [31], [51]. Conditional random fields help to impose structures on the adjacency of specific labels and, hence, impose constraints on the wider spatial context of a pixel [29], [35]. 4D extensions enforce spatial contiguity along the time dimension either in an undirected fashion [52], or in a directed one when imposing monotonic growth constraints on the segmented tumor lesion acting as a non-parametric growth model [13], [53], [14]. While all these segmentation models act locally, more or less at the pixel level, other approaches consider prior knowledge about the global location of tumor determines the label of the normal healthy tissue. The latent atlas determines the channelspecific presence of tumor . Normal tissue label , tumor labels , and intensity distribution parameters jointly determine the multimodal image observations . Observed (known) quantities are shaded. The segmentation algorithms aims to estimate , along with the segmentation of healthy tissue .
structures. They learn, for example, the relative spatial arrangement of tumor structures such as tumor core, edema, or enhancing active components, through hierarchical models of super-voxel clusters [54], [34], or by relating image patterns with phenomenological tumor growth models that are adapted to the patient [25].

B. Contributions
In this paper we address three different aspects of multimodal brain lesion segmentation, extending preliminary work we presented earlier in [55]- [57]: • We propose a new generative probabilistic model for channel-specific tumor segmentation in multi-dimensional images. The model shares information about the spatial location of the lesion among channels while making full use of the highly specific multimodal, i.e., multivariate, signal of the healthy tissue classes for segmenting normal tissues in the brain. In addition to the tissue type, the model includes a latent variable for each voxel encoding the probability of observing a tumor at that voxel, similar to [49], [50]. The probabilistic model formalizes qualitative biological knowledge about hyper-and hypo-intensities of lesion structures in different channels. Our approach extends the general EM segmentation algorithm [58], [59] using probabilistic tissue atlases [60], [15], [61] for situations when specific spatial structures cannot be described sufficiently through population priors. • We illustrate the excellent generalization performance of the generative segmentation algorithm by applying it to MR images of patient with ischemic stroke, which-to the best of our knowledge-is one of the first automated segmentation algorithms for this major neurological disease. • We extend the generative model to a joint generative-discriminative method that compensates for some of the shortcomings of both the generative and the discriminative modeling approach. This strategy enables us to predict clinically meaningful tumor tissue labels and not just the channel-specific hyper-and hypo-intensities returned by the generative model. The discriminative classifier uses the output of the generative model, which improves its robustness against intensity variation and changes in imaging sequences. This generative-discriminative model defines the state-of-the-art on the public BRATS benchmark data set [1]. In the following we introduce the probabilistic model (Section II), derive the segmentation algorithm and additional biological constraints, and we describe the discriminative model extensions (Section III). We evaluate the properties and performance of the generative and the generative-discriminative methods on a public glioma dataset (Sections IV and Section V, respectively), including an experiment on the transfer of the generative model to images from stroke patients. We conclude with a discussion of the results and of future research directions (Section VI).

II. A GENERATIVE BRAIN LESION SEGMENTATION MODEL
Generative models consider prior information about the structure of the observed data and exploit such information to estimate latent structure from new data. The EM segmenter, for example, models the image of a healthy brain through three tissue classes [60], [15], [61]. It encodes their approximate spatial distribution through a population atlas generated by aligning a larger set of reference scans, segmenting them manually, and averaging the frequency of each tissue class in a given voxel within the chosen reference frame. Moreover, it assumes that all voxels of a tissue class have about the same image intensity which is modeled through a Gaussian distribution. This segmentation method, whose parameters can be estimated very efficiently through the expectation maximization (EM) procedure, treats image intensities as nuisance parameters which makes it robust in the presence of the characteristic variability of the intensity distributions of MR images. Moreover, since the method formalizes the image content explicitly through the probabilistic model, it can be combined with other parametric transformations, for example, for registration [62] or bias field correction [15], and account for the related changes in the observed data. Generative models with tissue atlases used as spatial priors are at the heart of most advanced image segmentation models in neuroimaging [63], [64].
Population atlases cannot be generated for tumors as their location and extensions vary significantly across patients. Still, the tumor location is similar in different MR images of the same patient and a patient-specific atlas of the lesion class could be generated. Segmentation and atlas building can be performed simultaneously, in a joint estimation procedure [50]. Here, the key idea is to model the lesions through a separate latent atlas class. Combined with the standard population atlas of the normal tissues and the standard EM segmentation framework, this extends the EM segmenter to multimodal or longitudinal images of patients with a brain lesion. The generative model is illustrated in Fig. 1.
A. The Probabilistic Generative Model 1) Normal Healthy Tissue Classes: We model the normal healthy tissue label of voxel in the healthy part of the brain in the bottom row it shows the probabilistic tissue atlases used in the analysis, and the patient-specific tumor prior inferred from the segmentations in the different channels. The right panel shows three voxels with different labels in T1-, T1c-and FLAIR-MRI for a high-grade glioma patient. In voxel '1' all three channels show the characteristic image intensity of gray matter (G). In voxel '2' white matter (W) is visible in the first two channels, but the third channel contains a tumor-induced change (T), here due to edema or infiltration. In voxel '3' all channels exhibit gray values characteristic of tumor: a hypo-intense signal in T1, a hyper-intense gadolinium uptake in T1c-indicating the most active regions of tumor growth-and a hyper-intense signal in the FLAIR image. The initial tissue class remains unknown. Both and are to be estimated. Inference is done by introducing a transition process-with 'latent' prior ( Fig. 1)which is assumed to have induced the channel-specific tissue changes implied by in the tumor label vector .
using a spatially varying probabilistic atlas, or prior that is constructed from prior examples. At each voxel , this atlas indicates the probability of the tissue label to be equal to tissue class ( Fig. 1, blue). The probability of observing tissue label at voxel is modeled through a categorical distribution (1) where for all and for all , . The tissue label is shared among all channels at voxel . In our experiments we assume , representing gray matter (G), white matter (W) and cerebrospinal fluid (CSF), as illustrated in Fig. 2.
2) Tumor Class: We model the tumor label using a spatially varying latent probabilistic atlas [49], [50] that is specific to the given patient ( Fig. 1, red). At each voxel , this atlas contains a scalar parameter that defines the probability of observing a tumor at that voxel, forming the 3D parameter volume . Parameter is unknown and is estimated as part of the segmentation process. We define a latent tumor label that indicates the presence or absence of tumor-induced changes in channel at voxel , and model it as a Bernoulli random variable with parameter . We form a binary tumor label vector (where indicates the transpose of the vector) of the tumor labels in all channels, that describes tumor presence in voxel with probability (2) Here, we assume tumor occurrence to be independent from the type of the underlying healthy tissue. We will introduce conditional dependencies between the underlying tissue class and the likelihood of observing tumor-induced intensity modifications in Section II-C.

3) Observation Model:
The image observations are generated by Gaussian intensity distributions for each of the tissue classes and the channels, with mean and variance , respectively ( Fig. 1, purple). In the tumor (i.e., if ), the normal observations are replaced by intensities from another set of channel-specific Gaussian distributions with mean and variance representing the tumor class. Letting denote the set of mean and variance parameters for normal tissue and tumor classes, and denote the vector of the intensity observations at voxel , we form the data likelihood: where is the Gaussian distribution with mean and variance .
4) Joint Model: Finally, the joint probability of the atlas, the latent tumor class, and the observed variables is the product of the components defined in (1)-(3): We let denote the set of the image volumes, denote the corresponding volumes of binary tumor labels, denote the tissue labels, and denote the parameter volume. We obtain the joint probability over all voxels by forming , assuming that all voxels represent independent observations of the model.

B. Maximum Likelihood Parameter Estimation
We derive an expectation-maximization scheme that jointly estimates the model parameters and the posterior distri-bution of tissue labels and tumor labels . We start by seeking maximum likelihood estimates of the model parameters : and (7) (8) where label vector indicates tumor in all channels with , and normal tissue for all other channels. As an example with three channels, illustrated in Fig. 2 (voxel 2), suppose and indicating tumor in channel 3 and image intensities relating to white matter in channels 1 and 2. This results in the tissue label vector .

1) E-Step:
In the E-Step of the algorithm, making use of given estimates of the model parameters , we compute the posterior probability of all tissue label vectors . Expanding (4), we use and that are corresponding to to simplify notation: (9) and for all . Using the tissue label vectors, we can calculate the probability that tumor is visible in channel of voxel by summing over all the configurations for which (or equivalently ): (10) where is the Kronecker delta that is equal to 1 for and 0 otherwise. In the same way we can estimate the probability for the healthy tissue classes : (11) where indicates that one or more of the channels of label vector contain .
2) M-Step: In the M-Step of the algorithm, we update the parameter estimates using closed-form update expressions that guarantee increasingly better estimates of the model parameters [65]. The updates are intuitive: the latent tumor prior is an average of the corresponding posterior probability estimates (12) and the intensity parameters and are set to the weighted statistics of the data for the healthy tissues (13) Similarly, for the parameters of the tumor class , we obtain We alternate between updating the parameters and the computation of the posteriors until convergence, which is typically reached after 10-15 iterations.

C. Enforcing Additional Biological Constraints
Expectation-maximization is a local optimizer. To overcome problems with initialization, we enforce desired properties of the solution by replacing the exact computation with an approximate solution that satisfied additional constraints. These constraints represent our prior knowledge about tumor structure, shape or growth behaviour 1 .

1) Conditional Dependencies on Tumor Occurrence:
A possible limitation in the generalization of our probabilistic model is the dimensionality of tissue label vector that has possible combinations in (9) and, hence, the computational demands and memory requirements that grow exponentially with the number of channels in multimodal data sets. To this end, we may want to impose prior knowledge on and by only considering label vectors that are biologically plausible. First, instead of assuming independence between tissue class and tumor occurrence, we assume conditional dependencies, such as for all . We impose this dependency by removing, in this example, all tumor label vectors containing both CSF and tumor from the list of vectors that are included in (9). Second, we can impose constraints on the co-occurrence of tumor-specific changes in the different image modalities (rather than assuming independence here as well), and exclude additional tumor label vectors. We consider, for example, that the edema visible in T2 will always coincide with the edema visible in FLAIR, or that lesions visible in T1 and T1c are always contained within lesions that are visible in T2 and FLAIR.
Together, these constraints reduce the total number of label vectors to be computed in (9), for a standard glioma imaging sequences with T1c, T1, T2, and FLAIR, from to as few as ten vectors: three healthy vectors with (corresponding to , , and ); edema with tumor-induced chances visible in FLAIR in the forth channel (with and ); edema with tumor-induced changes visible in both FLAIR and in T2 (with and ); the non-enhancing tumor core with changes in T1, T2, FLAIR, but without hyper-intensities in T1c ( and ); the enhancing tumor core with hyper-intensities in T1c and additional changes in all other channels ( ).

2) Hyper-and Hypo-Intense Tumor Structures:
During the iterations of the EM algorithm we enforce that tumor voxels are hyper-or hypo-intense with respect to the current average image intensity of the white matter tissue (hypo-intense for T1, hyper-intense for T1c, T2, FLAIR). Similar to [51], we modify the probability that tumor is visible in channel of voxel by comparing the observed image intensity with the previously estimated prior to calculating updates for parameters (16). We set the probability to zero if the intensity does not align with our expectations: if , otherwise (17) For hypo-intensity constraints we modify the posterior probability updates in the same way, using as a criterion. 3) Spatial Regularity of the Tumor Prior: Little spatial context is used in the basic model, as we assume the tissue class in each voxel to be independent from the class labels of other voxels. Atlas encourages spatially continuous classification for the healthy tissue classes by imposing similar priors in neighboring voxels. To encourage spatial regularity of the tumor labels, we extend the latent atlas to include a Markov random field (MRF) prior: (18) Here, denotes the set of the six nearest neighbors of voxel , and is a parameter governing how similar the tumor labels tend to be at the neighboring voxels. When , there is no interaction between voxels and the model reduces to the one described in Section II. By applying a mean-field approximation [66], we derive an efficient approximate algorithm. We let (19) denote the currently estimated "soft" count of neighbors that contain tumor in channel . The mean-field approximation implies (20) where , replacing the previously defined (9), using a channel-specific as a modification of that features the desired spatial regularity.

III. DISCRIMINATIVE EXTENSIONS
High-level context at the organ or lesion level, as well as regional information, is not considered in the segmentation process of the generative model. Although we use different constraints to incorporate local neighbourhood information, the generative model treats each voxel as an independent observation and estimates class labels only from very local information. To evaluate global patterns, such as the presence of characteristic artifacts or tumor sub-structures of specific diagnostic interest, we present two alternative discriminative probabilistic methods that make use of both local and non-local image information. The first one, acting at the regional level, is improving the output of the generative model and maintaining its hyper-and hypo-intense lesion maps, while the second one, acting at the voxel level, is transforming the generative model output to any given set of biological tumor labels.

A. The Probabilistic Discriminative Model
We employ an algorithm that predicts the probability of label for a given observation which is described by feature vector derived from the segmentations of the generative model. We seek to address two slightly different problems. In the first task, class labels indicate whether a segmented region is a result of a characteristic artifact rather than of tumor-induced tissue changes, essentially indicating false positive regions in the segmentations of the generative algorithm that should be removed from the output. In the second task, class labels represent dense, voxel-wise labels with a semantic interpretation, for example structural attributes of the tumor that do not coincide with the hyper-and hyper-intense segmentations in the different channels, but labels such as "necrosis", or "non-enhancing core". We test both cases in the experimental evaluation, using on channel-wise tumor probabilities and on normalized intensities to derive input features for the discriminative algorithms.
To model relations between and for observation , we choose random forests, an ensemble of randomized decision trees [67]. We use the random forest classifier as it is capable of handling irrelevant predictors and, to some degree, label noise. During training each tree uses a different set of samples . It consists of randomly sampled observations that only contain features from a random subspace of dimensionality , where is the number of features. We learn an ensemble of different discriminative classifiers, indexed by , that can be applied to new observations during testing, with each tree predicting the membership . When averaging over all predictions that we obtain for the individual observation, we obtain an estimate of . We choose logistic regression trees as discriminative base classifiers for our ensemble, as the resulting oblique random forests perform multivariate splits at each node and are, hence, better capable of dealing with the correlated predictors derived from a multimodal image data set [68]. For both discriminative approaches we use an ensemble with decision trees.

B. A Discriminative Approach Acting at the Regional Level
As many characteristic artifacts have, at the pixel level, a multimodal image intensity patterns that is similar to the one of a lesion, we design a discriminative probabilistic method that is postprocessing and "filtering" the basic output of the generative model. In addition to the pixel-wise intensity pattern, it evaluates regional statistics of each connected tumor area, such as volume, location, shape, signal intensities. It replaces commonly used postprocessing routines for quality control that evaluate hand-crafted rules on lesion size or shape and location by a discriminative probabilistic model, similar to [44].

1) Features and Labels:
The discriminative classifier acts at the regional level to remove those lesion areas from the output of the generative model that are not associated with tumor, but that stem from arbitrary biological or imaging variation of the voxel intensities. To this end we identify all isolated regions in the binary tumor map of the FLAIR image (containing voxels with ). We choose FLAIR since it is the most inclusive image modality. As artifacts may be connected to lesion areas, we over-segment larger structures using a watershed algorithm, subdividing regions with connections that are less than 5 mm in diameter to reduce the number mixed regions containing both tumor pixels and artifact patterns. For each individual region we calculate a feature vector that includes volume, surface area, surface-to-volume ratio, as well as regional statistics that are minimum, maximum, mean and median of the normalized image intensities in the four channels. We scale the image intensities for each channels linearly to match the distribution of intensities in a reference data set. We also determine the absolute and the relative number of voxels with within region , i.e., the volume of the active tumor. We calculate the linear dimensions of the region in axial, sagittal, and transversal direction, the maximal ratio between these three values indicating eccentricity, and the relative location of the region with respect to the center of the brain mask, as well as minimum, maximum, mean and median distances of the regions's voxels from the skull, as a measure of centrality within the brain. We then determine the total number of FLAIR lesions for the given patient and assign this number as another feature to each lesion, together with its individual rank with respect to volume both in absolute numbers and as a normalized rank within . Overall, we construct features for each region (Fig. 6). To assign labels to each region, we inspect them visually and assign those overlapping well with a tumor area to the true positive "tumor" class , all other to the false positive "artifact" class . When labeling regions in the BRATS training data set (Section V), all regions labeled as true positives have at least 30% overlap with the "whole tumor" annotation of the experts.

C. A Discriminative Approach Acting at the Voxel Level
The generative model returns a set of probabilistic maps indicating the presence of hypo-or hyper-intense modifications of the tissue. In most applications and imaging protocols, however, it is necessary to localize arbitrary tumor structures-with biological interpretations and clinically relevant semantic labels, such as "edema", "active tumor" or "necrotic core". These structures do not correspond one-by-one to the hypo-and hyper-intense lesions, but have to be inferred by evaluating spatial context and tumor structure as well. We use the probabilistic output of the generative model, together with few secondary features that are derived from the same probabilistic maps and image intensity features, as input to a classifier predicting the posterior probability of the desired semantic labels. The discriminative classifier evaluates local and non-local features to map the output of the generative model to semantic tumor structure and to infer the most likely label for each given voxel, similar to [32].

1) Features and Labels:
To predict a dense set of semantic labels we extract the following set of features for each voxel : the tissue prior probabilities for the tissue classes ; the tumor probability for all channels , and the image intensities after they have been scaled linearly to the intensities of a reference data set . From these data we derive two types of features. First, we construct the differences of local image intensities or probabilities for all three types of input features , , . This feature captures the difference between the image intensity or probability of voxel and the corresponding image intensity or probability of another voxel . For every voxel in our volume we calculate these differences for 20 different directions, with spatial offsets in between 3 mm to 3 cm, i.e., distances that correspond to the extension of most relevant tumor structures. To reduce noise the subtracted values of are extracted after smoothing the image intensities locally around voxel (using a Gaussian kernel with 3 mm standard deviation). We calculate differences between tumor or tissue probability at a given voxel and the probability of the same location on the contralateral side. Second, we evaluate the geodesic distance between voxel and specific image features that are of particular interest in the analysis. The path is constrained to areas that are most likely gray matter, white matter or tumor as predicted by the generative model. More specifically, we use the distance of of voxel to the boundary of the the brain tissue (the interface of white and gray matter with CSF), the distance to the boundary of the T2 lesion representing the approximate location of the edema. This latter distance is calculated independently for voxels outside and inside the edema. In the same way, we calculate and representing the inner and outer distance to the next T1c hyper-intensity. We calculate the number of voxels that are labeled as "edema" or "active tumor" among the direct neighbours of voxel , and determine the x-y-z location of the voxel in the co-registered NMI space . Overall, we construct image features for each voxel and, when adapted to the BRATS training data set (Section V), five labels as provided by clinical experts.

IV. EXPERIMENT 1: PROPERTIES AND PERFORMANCE OF THE GENERATIVE MODEL
In a first experiment, we evaluate the relevance of different components and parameters of the probabilistic model, compare it with related generative approaches,and evaluate the performance on the public BRATS glioma dataset, and test the generalization in a transfer to a related application dealing with stroke lesion segmentation.

1) Glioma Data:
We use the public BRATS 2012-2013 dataset that provides a total of 45 annotated multimodal glioma image volumes [1]. Training datasets consist of 10/20 low/high-grade cases with native T1, Gadolinium-enhanced T1 (T1c), T2 and FLAIR MR image volumes. The test dataset contains no labels, but can be evaluated by uploading image segmentations to a server; it includes 4/11 low-grade/high-grade cases. Experts have delineated tumor edema, Gadolinium-enhancing "active" core, non-enhancing solid core, cystic/necrotic core. We co-register the probabilistic MNI tissue atlas of SPM99 with the T1 image of each dataset using the FSL software, and sampled to isotropic voxel resolution. We perform a bias field correction using a polynomial spline model (degree 3) together with a multivariate tissue segmentation using an EM segmenter that is robust against lesions 2 [51]. Image intensities of each channel in each volume are scaled linearly to match the histogram of a reference.
2) Stroke Data: Images are acquired in patients with acute and subacute ischemic stroke. About half of the 18 datasets comprise T1, T2, T1c and FLAIR images in patients in the sub-acute phase, acquired about one or two days after the event; another half comprises T1, T1c, T2 base diffusion and mean diffusivity (MD) images acquired in acute stroke patients within the first few hours after the onset of symptoms. For both groups the imaging sequences return tissue contrasts of normal tissues and lesion areas that are similar to hyper-and hypo-intensities expected in glioma sequences; stroke lesions are characterised here by T1 hypo-, T1c hyper-, T2 hyper-and Fig. 4. Exemplary BRATS test sets, with results for generative and generative discriminative models. Shown are axial views through the tumor center for T1, T1c, T2 and FLAIR image (columns from left to right) and the segmented hypo-or hyper-intense areas (red and cyan). Regions outlined in red have been identified as "true positive" regions by the regional discriminative classifier and the resulting tumor labels are shown in column 5 with edema (bright gray) and active tumor region (white). Column 6 shows results of the voxel-wise generative-discriminative classifier, and column 7 the expert's annotation. Gray and white matter segmentations displayed in the last three columns are obtained by the generative model. FLAIR /MD hyper-intense changes. For the quantitative evaluation of the algorithm, we delineate the lesion in every 10th axial, sagittal, and coronal slice, in each of the four modalities. In addition, we annotate about 10% of the 2D slices twice to estimate variability. We register the probabilistic atlas and perform a model based bias field correction as for the glioma data. Image intensities are scaled to the same reference as for the glioma cases.
3) Evaluation: To measure segmentation performance in the experiments with this dataset, we combine the set of four tumor labels (edema and the three tumor core subtypes) to one binary "complete lesion" label map. We compare this map with the hyper-intense lesion as segmented in T2 and FLAIR. Separately, we compare the "enhancing core" label map with the hyper-intense lesion as segmented in T1c MRI. Quantitatively, we calculate volume overlap between expert annotation and predicted segmentation using the Dice score . We compute Dice scores for whole brain when testing performances on the BRATS data set. We also calculate Dice score within a 3 cm distance from the lesion to measure local differences in lesion segmentation rather than in global detection performances.

B. Model Properties and Evaluation on the BRATS Data set 1) Comparison of Generative Modelling Approaches:
We compare the proposed generative model against related generative tissue segmentations models and evaluate the relevance of individual components of our approach on the BRATS training data set. We calculate Dice scores in the area containing the lesion and the 3 cm margin. Fig. 3 A illustrates the benefit of the proposed multivariate tumor and tissue segmentation over a univariate segmentations that treat tumor voxels as intensity outliers similar to Van Leemput's EM segmentation approach for white matter lesion [51]. On the given data this baseline approach leads to a high number of false positives, either requiring stronger spatial regularization or a more adaptive tuning of the outlier threshold. Fig. 3B reports the benefit of enforcing intensity constraints within the proposed generative model. While the benefit for the large hyper-intense regions visible in T2 and FLAIR is minor, the difference for segmenting the enhancing tumor core visible in T1c in high-grade patients is striking: the constraint disambiguates tumor-related hypo-intensities-similar to those visible in native T1, for example, from edema-from hyper-intensities induced by the contrast agent in the active rim. Fig. 3C reports a comparison between our approach and Prastawa's classic tumor EM segmentation approach [22] that models lesions as an additional class with a "flat" global atlas prior. We test different values for the tumor prior in (2), evaluating result for . We find that every channel and every segmentation task has a different optimal . However, each optimally tuned generative model with flat prior is still outperformed by the proposed generative model.
2) Enforcing Spatial Regularity: Our model has a single parameter that has to be set which is the regularization parameter coupling segmentations of neighbouring voxels. Based on our previous study [55], we performed all experiments reported in Fig. 3 with weak spatial regularization . To confirm these preliminary results we test different regularization settings with , now also evaluating channel-specific performance in the lesion area (Fig. 7 in the online Supplementary Materials). We find a strong regularization to be optimal for the large hyper-intense lesions in FLAIR , suppressing small spurious structures, while little or no regularization is best for the hardly visible hypo-intense structures visible in T1 . Both T2 and T1c are rather insensitive to regularization. We find the previous value of to work well, but choose for both low-and high-grade tumors in further experiments, somewhat better echoing the relevance of FLAIR.
3) Evaluation on the BRATS Test set: We apply our segmentation algorithm to the BRATS test sets that have been used for the comparison of twenty glioma segmentation methods in the BRATS evaluation [1]. We identify the segmentations in FLAIR with the "whole tumor" region of the BRATS evaluation pro-tocol, and the T1c enhancing regions with the "active tumor" region. We evaluate two sets of segmentations: segmentations that are obtained by thresholding the corresponding probabilities at 0.5, and the same segmentations after removing all regions that are smaller than in the FLAIR volume. This latter postprocessing approach was motivated by our observation that smaller regions correspond to false positives in almost all cases. We calculate Dice scores for the whole brain. Table I reports Dice scores for the BRATS test sets with results of about .60 for the whole tumor and about .50 for the active tumor region ('raw'). As visible from Fig. 4, results are heavily affected by false positive regions that have intensity profiles similar to those of the tumor lesions. Applying the basic, size-based postprocessing rule improves results in most cases ('postproc.'). Most false positives are spatially separated from the real lesion and when calculating Dice scores from a region that contains the FLAIR lesion and a 3 cm margin only, results improve drastically to average values of for the whole tumor to and for the active region (not shown in the table) which aligns well with results obtained for T2 and T1c on the training set (Fig. 3).

C. Generalization Performance and Transfer to the Stroke Data set
We test the generalization performance of the generative model by using it for delineating ischemic stroke lesions that are similar in terms of lesion size and clinical image information. We apply the generative model as optimized for the BRATS dataset to the stroke images. As a single modification we allow T1c lesions to be both inside the FLAIR and T2 enhancing area and outside, as bleeding (which leads to the T1c hyper-intensities) may not coincide with the local edema. Stroke images contain cases with both active and chronic lesions with significantly different lesion patterns.
Although datasets, imaging protocol, and even major acquisition parameters differ, we obtain results that are comparable to the tumor data. We calculate segmentations accuracies in the lesion area and observe good agreement between manual delineation and automatic segmentation in all four modalities (Fig. 5). We also observe false positives at the white matter-gray matter interfaces, similar to those we observed for the glioma tests data (Fig. 4). Most false positive regions are disconnected from the lesion and could be removed with little user interaction or postprocessing. Inter-rater differences and performance of the algorithm are comparable to those from the glioma test set, with Dice scores close to .80 for segmenting the edema and around .50-.60 for T1c enhancing structures (Table I). Results on the stroke data underline the versatility of the generative lesion segmentation model and its good generalization performance not only across different imaging sequences, but also across applications. To the best of our knowledge this is one of the first attempts to automatically segment ischemic stroke lesion in multimodal images using a generative model.

V. EXPERIMENT 2: PROPERTIES AND PERFORMANCES OF THE GENERATIVE-DISCRIMINATIVE MODEL EXTENSIONS
Results of the generative model show its robustness and accuracy for delineating lesion structures. Still, it also shows to be sensitive to artifacts that cannot be recognized by evaluating the multimodal intensity pattern at the voxel level, and hypoand hyper-intense structures can only be matched with selected tumor labels. To this end, we evaluate the two discriminative modeling strategies that are considering non-local features as input and arbitrary labels as output. We first evaluate model properties on the BRATS training set and then compare performances to results of other state-of-the art tumor segmentation algorithms on the BRATS test set.

A. Relevant Features and Information Used by the Discriminative Models
The random forest classifier handles learning tasks with small sample sizes in high dimensional spaces by only relying on few "strong" variables and ignoring irrelevant features [69]. Still, in order to understand the information used when modeling the class probabilities, we can visualize the importance of the input features used. To this end we evaluate the relevance of the individual features using Breiman's feature permutation test [67] that compares the test error with the error obtained after the values of a given feature have been randomly permuted throughout all test samples. The resulting decrease in test accuracy, or increase in test error, indicates how relevant the chosen feature is for the overall classification task. Repeated for each feature of all trees in the decision forest, this measure helps to rank the features and to compare the relevance as shown in Fig. 6. In our test we augment the dataset with a random feature (random samples from a Gaussian distribution with mean 0 and standard deviation of 1) to establish a lower baseline of the relevance score. For each feature we compare the distribution of changes in classification error against the changes of this random feature in a paired Cox-Wilcoxon test. We analyze feature relevance in a cross-validation on the BRATS training set.
Results for the first discriminative model acting at the regional level are shown in Fig. 6. We find plausible features to be relevant: the relative location of the region with respect to the center of the brain (indicated as in the figure), the surface-to-volume ratio , the total number of lesions visible for the given patient , the ratio of segmented voxels in T1c , and some descriptors of image intensities, such as the minimum in FLAIR , the maximum, median and average of the T2 intensities , as well as the maximum in T1c and the minimum in T1 .
For the second discriminative model acting at the pixel level we find about 80% of the features to be relevant, with with some variation across the different classification tasks. The features that rank highest in all tests are those we derived from the probability maps of the generative model: the total number of local edema or active tumor voxels, the geodesic distance to the nearest edema or active tumor pixels, but also the relative anatomical location in the MNI space, and selected image intensities and intensity differences (such as the intensity values of T1 and FLAIR for edema and T1c for active core, and local differences in the T1 image intensities). Fig. 4 displays nine exemplary image volumes of the BRATS test set. Shown are the raw probability maps of the generative model (red and cyan; columns 1-4), those regions that are selected by the regional discriminative model (cyan) and the derived tumor segmentation (column 5), as well as the output of the voxel-level tumor classifier (column 6), together with an expert annotation (column 7).

B. Performance on the BRATS Test set
Quantitative results are reported in Table I, and we find both discriminative models to improve results over those derived from the "raw" probability maps of the generative model. With few exceptions most "false positive" artifact regions are removed (Fig. 4). The voxel-level model shows to be more accurate than the regional-level model, also correcting for "false negative" areas in the center of the tumor (rows 1, 3, 6, and 7). In addition to whole tumor and active tumor areas, the second discriminative model is also predicting the location of necrotic and fluid filled structures, as well as the "tumor core" label (with a Dice score of .58; segmentations not shown in the figure). Sensitivities and specificities for this latter model are balanced with sensitivities of .75/.58/.63 for the three tumor regions (whole tumor/tumor core/active tumor) and a specificities of .86/.71/.56.
The BRATS challenge also allows us to compare the two generative-discriminative modeling approaches with eighteen other state of the art methods including inter-active ones, and we reproduce results of the challenge in Fig. 8 in the online Supplementary Materials of this manuscript. The generative model with discriminative post-processing at the regional level (indicated by Menze (G)) performs comparable to most other approaches in terms of Dice score and robust Hausdorff distance for "whole tumor" and "active tumor". However, it cannot model the "tumor core" segmentation task as this structure does not have a direct correspondence to any of the segmented hyper-and hypo-intensity regions and, hence, does does not provide competitive results for this tumor sub-structure. The voxel-level generative-discriminative approach (indicated by Menze (D)) is able to predict "tumor core" labels. It ranks first among the twenty evaluated methods in terms of average Hausdorff distances for both "tumor core" and in "whole tumor" segmentation, and it is the second best automatic method for the "active tumor" segmentation. In the evaluation of the average Dice scores it is second best for "whole tumor", it is ranking third among the automated methods for the "tumor core" task, and its result are statistically indistinguishable from the inter-rater variation for "active tumor". Most notably, the voxel-level generative-discriminative approach is outperforming all discriminative models that are similar in terms of random forest classifier and feature design [37], [32], [2], [34], but that do not rely on the input features derived from the probability maps of the generative model.

VI. SUMMARY AND CONCLUSIONS
In this paper, we extend the atlas-based EM segmenter by a latent atlas class that represents the probability of transition from any of the "healthy" tissues to a "lesion" class. In practice, the latent atlas serves as an adaptive prior that couples the probability of observing tumor-induced intensity changes across different imaging channels for the same voxel. Using the standard brain atlas for healthy tissues together with the highly specific multi-channel information provides us with segmentations of the healthy tissues surrounding the tumor, and enables us to automatically segment the images. The proposed generative algorithm produces outlines of the tumor-induced changes for each channel which makes it independent of the multimodal imaging protocol. We complement the basic probabilistic model with a discriminative model and test two different modeling strategies, both of them addressing shortcomings of the generative model, and find the resulting discriminative-generative model to define the state of the art in tumor segmentation on the BRATS data set [1].
The proposed generative algorithm generalizes the probabilistic model of the standard EM segmenter. As such, it can be improved by combining registration and segmentation [62], or by integrating empirical or physical bias-field correction models [15], [70]. The generative segmentation algorithm that we optimized for glioma images exhibits a good level of generalization when applied to multimodal images from patients with ischemic stroke. The method is likely to also work well for traumatic brain injury with similar hypo-and hyper-intensity patterns, and it can also be adapted to multimodal segmentation tasks beyond the brain. It may be interesting to evaluate relations to multi-channels approaches that do not rely on multiple physical channels, but high-dimensional sets of features extracted from one or few physical images [71]. Analyzing feature relevance indicated that the location of a voxel or region within the MNI space helped in removing false positives, as most of them appeared at white matter-gray matter interfaces or in areas that are often subject to B-field inhomogeneities. Extensions of the generative model may use a location prior that lowers the expectation of tumor occurrences in these areas. Moreover, preliminary findings suggest that results may improve by using non-Gaussian intensity models for the lesion classes.
Some tumor structures-such as necrotic or cystic regions, or the solid tumor core-cannot easily be associated with local channel-specific intensity modifications, but are rather identified based on the wider spatial context and their relation with other tumor compartments. We addressed the segmentation of such secondary structures by combining our generative model with discriminative model extensions evaluating additional non-local features. As an alternative, relations between visible tumor structures can be enforced locally using MRF as proposed by [35], or in a non-local fashing following the hierarchical approach following [54]. Future work may also aim at integrating image segmentation with tumor growth models enforcing spatial or temporal relations as in [53], [14]. Tumor growth models-often described through partial differential equations [72] -offer a formal description of the lesion evolution, and could be used to describe the propagation of channel-specific tumor outlines in longitudinal series [73], as well as a shape and location prior for various tumor structures [23]. This could also promote a deeper integration of underlying functional models of disease progression and formation of image patterns in the modalities that are used to monitor this process [74].
To support the further use and analysis of our generative segmentation algorithm, we make an implementation available in Python from http://ibbm.in.tum.de, also illustrating its use on reference data from the BRATS challenge.