Automatic Segmentation of MS Lesions Using a Contextual Model for the MICCAI Grand Challenge

Automatically segmenting subcortical structures in brain images has the potential to greatly accelerate drug trials and population studies of disease. Here we propose an automatic subcortical segmentation algorithm using the auto context model. Unlike many segmentation algorithms that separately compute a shape prior and an image appearance model, we develop a framework based on machine learning to learn a uniﬁed appearance and context model. In order to test the method, speciﬁcity and sensitivity measurements were obtained on a standardized dataset provided by the competition organizers. Our overall score of 77 seems to be competitive with others who’s overall score was in the range of 50 - 90


Introduction
Segmentation of subcortical structures on brain MRI is vital for many clinical and neuroscientific studies.In many studies of brain development or disease, subcortical structures must typically be segmented in large populations of patients and healthy controls, to quantify disease progression over time, to detect factors influencing structural change, and to measure treatment response.In brain MRI, MS lesions are structures of great neurological interest, but are difficult to segment automatically.3D medical image segmentation has been intensively studied.Most approaches fall into two main categories: those that design strong shape models [1,14,6] and those that rely more on strong appearance models (i.e., based on image intensities) or discriminative models [2,8]; atlas-based, shape-driven and other segmentation methods were recently compared in a caudate benchmark test [13]; despite the progress made, no approach is yet widely used due to (1) slow computation, (2) unsatisfactory results, or (3) poor generalization capability.
In object and scene understanding, it has been increasingly realized that context information plays a vital role [5].Medical images contain complex patterns including features such as textures (homogeneous, inhomogeneous, and structured) which are also influenced by acquisition protocols.The concept of context covers intra-object consistency (different parts of the same structure) and inter-object configurations (e.g., expected symmetry of left and right hemisphere structures).Here we integrate appearance and context information in a seamless way by automatically incorporating a large number of features through iterative procedures.The resulting algorithm has almost identical testing and training procedures, and segments images rapidly by avoiding an explicit energy minimization step.We train and test our model on a dataset provided by the MICCAI Grand Challenge II workshop for segmenting MS lesions.For a validation of the presented method on hippocampal and caudate segmentation please refer to [4].

Problem
The goal of a subcortical image segmenter is to label each voxel as belonging to a specific region of interest (ROI), such as an MS lesion.Let X ∈ (x 1 . . .x n ) be a vector encompassing all N voxels in each manuallylabeled training image and Y ∈ (y 1 . . .y N ) be the label for each example, with y i ∈ 1 . . .K representing one of K labels (for hippocampal segmentation, this reduces to a two-class problem).According to Bayesian probability, we look for the segmentation where p(X|Y ) is the likelihood and p(Y ) is the prior distribution on the labeling Y .However, this task is very difficult.Traditionally, many "bottom-up" computer vision approaches (such as SVMs using local features [8]) work hard on directly learning the classification p(Y |X) without encoding rich shape and context information in p(Y ), whereas many "top-down" approaches such as deformable models, active surfaces, or atlas-deformation methods impose a strong prior distribution on the global geometry and allowed spatial relations, and learn a likelihood p(X|Y ) with simplified assumptions.Due to the intrinsic difficulty in learning the complex p(X|Y ) and p(Y ), and searching for the Y * maximizing the posterior, these approaches have achieved limited success.
Instead, we attempt to model p(Y |X) directly by iteratively learning the marginal distribution p(y i |X) for each voxel i.The appearance and context features are selected and fused by the learning algorithm automatically.

A traditional classifier can learn a classification model based on local image patches, which we now call
where P (0) k (i) is the posterior marginal for label k at each voxel i learned by a classifier (e.g., boosting or SVM).We construct a new training set where P (0) (i) are the classification maps centered at voxel i.We train a new classifier, not only on the features from the image patch X(N i ), but also on the probability patch, P (0) (N i ), of a large number of context voxels.These voxels may be either near or very far from i.It is up to the learning algorithm to select and fuse important supporting context voxels, together with features about image appearance.For our purposes, our feature pool consisted of 18,099 features including intensity, position, and neighborhood features.Our neighborhood features were mean filters, standard deviation filters, curvature filters, and gradients of size 1x1x1 to 3x3x3, and Haar filters of various shapes from size 2x2x2 to 7x7x7.Our AdaBoost weak learners were decision stumps on both the image map and probability map.Once a new classifier is learned, the algorithm repeats the same procedure until it converges.The algorithm iteratively updates the marginal distribution to approach In fact, even the first classifier is trained the same way as the others by giving it a probability map with a uniform distribution.Since the uniform distribution is not informative at all, the context features are not selected by the first classifier.In some applications, e.g.medical image segmentation, the positions of the anatomical structures are roughly known after registration to a standard atlas space.One then can provide a probability map of the structure (based on how often it occurs at each voxel) as the initial P (0) .
Given a set of training images together with their label maps, S = {(Y j , X j ), j = 1..m}:For each image X j , construct probability maps P (0) j , with a distribution (possibly uniform) on all the labels.For t = 1, ..., T : • Train a classifier on both image and context features extracted from X j (N i ) and • Use the trained classifier to compute new classification maps P (t) The algorithm outputs a sequence of trained classifiers for p (n) (y i |X(N i ), P (n−1) (N i )) We can prove that at each iteration, ACM is decreasing the error ε t .If we note that the error of one example (i), at time t − 1 is P (t−1) (i)(y i ) and at time t is p t (y i |X i , P (t−1) (i)), then we can use the log-likelihoods to formulate the error over all examples as in eqn. 2.
First, it is trivial to choose p (t) to be a uniform distribution, making ε t = ε t−1 .However, boosting (or any other effective discriminative classifier) is guaranteed to choose weak learners to create p (t) that minimize ε t and will fail if none such exists, so therefore, if AdaBoost completes, ε t ≤ ε t−1 .

Preprocessing and Postprocessing
Before any segmentation model was created, two preprocessing steps were performed.First the data was downsampled to 256x256x256 using a nearest neighbor interpolation.This was done to drastically speed up processing time.Secondly the data was run through a bias field corrector (BFC) [9] in order to remove the bias introduced by the scanner.The only post processing done was a nearest neighbor upsample to return the mask to 512x512x512 space.

Feature Pool and Weak Learner Description
When running AdaBoost, one must first define a pool of potentially informative features to draw from.First, we defined an initial shape prior (P (0) j ), based on the LogOdds formulation of Pohl et.al [7].Next, we used the same set of features drawn from each of these six channels (T1, T2, FLAIR, DTI-MD, DTI-FA, prior) including intensity, mean filters, standard deviation filters, curvature filters, and haar filters of various shapes.For each of the neighborhood based features, the neighborhood ranged from 1x1x1 to 7x7x7.Additionally, we used features based on the x,y,z position of each voxel.
A crucial part of AdaBoost is the definition of the weak learners.In our implementation, our weak learners are decision stumps.Therefore, each weak learner consists of a feature, a threshold, and a boolean stating whether positive examples are below or above that threshold.
It is also interesting to note which are the most important features selected for further analysis.For our model, the first ten features chosen (those that contributed most to the decision rule), are summarized in table 1 The first ten features selected by AdaBoost.These contributed most to the classification rule.It is interesting to note that the FLAIR image appears the most, which is to be expected as FLAIR provides the best deliniation of MS lesions.However, it is not the only modality present.The prior image represents the LogOdds shape prior, which is also expected to be informative because is it a combination of all the training masks.

Probabilistic Boosting Tree
In order to increase the classification ability of our algorithm, instead of just using AdaBoost during each iteration of ACM, we instead made a probabilistic boosting tree (PBT) [10].The PBT has been shown to be an effective medical image classifier [11].We used a very small tree (due to time constraints), with a depth of only 1. Finally, we set ε (as defined in [10]) to a value of 0.1.

Randomizations
Due to the large size of our dataset and the large number of features we have available, we must scale down our training set to a manageable size.In order to do this, at each run of AdaBoost, we choose a random set of 100,000 examples to train on and a random set of 5,000 features to comprise our feature pool.However, since we are running AdaBoost many times (we used an ACM depth of 6, for a total of 18 AdaBoost runs), these randomizations are acceptable.

Results
Fig. 4 presents our results from the competition.For this competition we choose to train two models, one for just the UNC dataset, and one for just the CHB dataset.The competition results for our segmentation algorithm.For an overview of the meaning of these metrics, refer to last year's competition [12].
For most images, a reasonable segmentation is found, and a good score is achieved.There are only a few outliers which our system did not segment well, cases 7, 9, and 10 in the UNC test case seem to do the poorest, however, most other cases do very well, in fact there is no overall score in the CHB dataset which is below 71.

Conclusions
Our method finds a reasonable segmentation on most images, and the results of fig. 4 show that for most cases the segmentation is relatively good.At the time of writing we do not know the score of the other groups, but only know the range of total scores is between 50 and 90, and our score of 77 seems very acceptable, especially when considering this system was originally designed to segment subcortical structures such as the hippocampus and caudate.

Figure 1 :
Figure 1: The training procedures of the auto-context algorithm.

Figure 2 :
Figure2: The competition results for our segmentation algorithm.For an overview of the meaning of these metrics, refer to last year's competition[12].