MR Brain Segmentation using Decision Trees

,


Introduction
The segmentation of magnetic resonance images (MRI) of the whole head into just the primary cerebrum tissues of cerebrospinal fluid (CSF), gray matter (GM), and white matter (WM) has been one of the core challenges of the neuroimaging community for the past twenty years.The majority of existing solutions are conceived as a pipeline, with several preprocessing steps used to isolate the cerebrum before it is segmented.These include inhomogeneity correction-the most well known being N3 [16]-followed or preceded by skull stripping-see Table 1 in [15] for a recent overview-and then either an image intensity standardization technique or directly into the segmentation task.The segmentations approaches that have been employed for this three class problem include: Gaussian distribution based such as Expectation Maximization Segmentation (EMS) [17], unified segmentation [1], and FMRIB's Automated Segmentation Tool (FAST) [18]; Fuzzy c-means (FCM) approaches such as FANTASM [12] and several others [7,8,13]; and more recently the Rician based distributions [14].Newer methods have tended to include one of these distributions at their core while incorporating statistical [1] and topology [3] atlases to help improve their accuracy.
These approaches assume that there are nice reasonable distributions that can approximate all given data, regardless of the patients pathology.In this work we want to explore the possibility of a distribution free model, that can provide rapid tissue segmentations.We have chosen to use random decision forests [10,4] which provide a model free framework that can learn a complicated distribution that would otherwise be poorly approximated by a fixed distribution choice.Our method uses some existing software tools to isolate the cerebrum in the whole head MRI, by removing the skull [6] and the cerebellum [3].We then use a decision tree ensemble to generate a hard classification of the tissues in the cerebrum.

Method
We use T 1 -w and FLAIR images which have been co-registered and bias corrected in our algorithm.We use {I }, t = 1, . . ., 5, to denote the t th training subject, which correspond to the T 1 -w, FLAIR, and manual segmentation image respectively.The class image has labels, 1, 2, 3, which are CSF, GM, and WM respectively.The training data images also have white matter lesions (WML), which have the appearance of GM in I (T ) t , though we wish to segment them as WM.

Preprocessing Training Data
Fig. 1 provides a flowchart of our algorithm.The training data images are skull stripped and manually labeled using the contour segmentation objects (CSO) tool in MeVisLab.The T 1 -w (I (T ) t ) images are linearly scaled so that their mean WM intensities are at 1000, the mean WM intensity is found by fitting a threeclass Gaussian Mixture Model (GMM) to the intensity histograms.The FLAIR images (I (F ) t ) are linearly scaled so that the mode of WM intensities is 1000, the WM mode is obtained from the intensity histogram, smoothed by a kernel density estimator.

Training and Prediction
At each voxel i of the t th training data image patches are defined on I     are concatenated to form a 2d × 1 vector x i,t , which acts as the feature vector for the i th voxel with a corresponding label taken from the i th voxel of I (C) t , denoted by y i,t .We thus consider components of x i,t 's as attributes, with the dependent variables being y i,t 's.We can then construct training pairs of x i,t , y i,t for each voxel i in each training subject t.Using all the available data, i.e. all the voxels in all five subjects, leads to an unbalanced training set as each tissue class is not represented equally, thus care is taken to ensure equal proportions of each class in the training data.
We pursue a classification tree solution which enables us to directly use the training algorithm described in [5] to train a bagged ensemble of decision trees.A single decision tree partitions our 2d-dimensional space by splitting different dimensions using a learned threshold.During training, one third of the attributes are randomly considered for the choice of splitting and the one that best minimizes the Gini impurity criterion, after deciding a threshold, is chosen as the dimension to split upon.A single decision tree is considered as a weak learner and has higher error in general, thus we use a bagged ensemble of decision trees which reduces errors through bootstrap aggregation.In this process, the ensemble consists of n trees, each of which is learned from a bootstrapped dataset-which are created by randomly sampling with replacement from the whole training dataset, N times where N is the number of training samples in the entire training data.We limit the depth of each tree by fixing the number of samples accumulated at a leaf to be five, thus preventing over-fitting.Prediction is done by passing a test feature vector through each tree and allowing it to traverse the nodes of the tree by observing the splitting criterion and threshold at each node until it reaches a leaf node.The predicted label is calculated by voting between the training data vectors present in the leaf.The training data consists of ∼ 10 6 samples from the five training subjects, with training done in parallel, we can create a trained ensemble of decision trees in eight to ten minutes on an 8-core, 2.73GHz machine, while prediction on a new unseen data set takes less than two minutes on the same machine.

Results
We perform three experiments to demonstrate the practicality of this new segmentation method.The first is a leave-one-out cross-validation on the training data, the second is an analysis on 12 additional subjects from the same cohort as the training data, and finally a study of the accuracy of the defined CSF/GM & GM/WM boundaries using manually picked landmarks.The training and test data consists of T 1 -w and FLAIR images both with resolution of 0.958 × 0.958 × 3.0mm with the manual segmentation being conducted in the same space.Our landmark cohort is made up of five healthy subjects (3 females) with a mean age of 39.4 years (range: 30-49) from Landman et al. [11], with the T 1 -w and FLAIR images having an isotropic resolution of 1.1mm 3 .Two raters (Raters A and B) then placed 10 landmark points upon the inner and outer boundaries of the cortex in each of 21 coarsely selected regions, resulting in each rater picking 210 landmarks per surface for each of the five subjects.

Cross-Validation
In each round of our cross-validation experiment we removed a single data set from the training sample of five subjects and trained our decisions trees as described in Sec. 2 with the four remaining data sets.The trained decision tree ensemble is then tested on the held out data with evaluation on the three classes including Dice score, 95% Hausdorff distance, and absolute volume difference.The results are reported in Table 1, see Babalola et al. [2] and Dubuisson et al. [9] for an explanation of the metrics used.To provide a baseline for comparison purposes, we computed the same metrics after using FreeSurfer to segment the data, also in Table 1.Fig. 2 has three orientations of a training data set showing the T 1 -w, FLAIR, and both the manual segmentation and the result of our algorithm.The red arrow in Fig. 2 denotes a region in the midsagittal plane where our algorithm seems to make a more sensible decision than the human rater by leaving a clear separation between the hemispheres.

Test Data
For evaluation on the test data, we trained our decision trees on all five subjects in the training data and then used this trained ensemble to predict the segmentation in the test data.The results are shown in Table 2 and some example segmentations for the test data are given in Fig. 3.

Landmark Validation
The same trained ensemble that we used on the twelve test data subjects was used on our landmark cohort.With each landmark representing either the CSF/GM or GM/WM interface, we computed the shortest distance from each landmark to the corresponding boundary as defined by our voxel based segmentation.For a comparison to the state-of-the-art, we also ran FreeSurfer on each of the landmark data sets and computed the shortest distance between each landmark and the appropriate surface generated by FreeSurfer.The results are shown in Table 3.    three metrics with respect to GM and WM segmentation.In comparison to the hard segmentation generated by FreeSurfer on the Training Data, we are clearly much better for all three metrics.Our inferior results for CSF segmentation on the test data set, are in large part due to the skull stripping differences between the training and text subjects, this is best evidenced by considering both the 95% Hausdorff distance and the absolute volume difference.These metrics show a very large difference in the volumes and the distance between mislabeled voxels for CSF, as our CSF volume extends outside the CSF volume labeled by the manual experts.Our landmark data provide more confirmation that our estimation of the boundaries of WM & GM and GM & CSF are close to the state-of-the-art even though they are just at the voxel level, and not sub-voxel like all surface generation software tools.
reside in a d dimensional space where d = pqr.x

Fig. 1 .
Fig. 1.A overview of our algorithm.The input training data is converted into patches which are fed into our random forest.This outputs a learned patch ensemble of decision trees, which are used on the test data to predict a subjects segmentation.

Fig. 2 .
Fig. 2. Each row shows a specific orientation from a training data set.From left to right the columns are: T1-w, FLAIR, manual segmentation, and the result of our algorithm.

Fig. 3 .
Fig. 3.The top row shows a sagittal view comparison of the skull-stripping on training and test data.The bottom row shows an axial view of the T1-w, FLAIR, and our segmentation of the same test subject.

Table 1 .
[9]ss validation on the five test subjects, performed by training on four datasets and evaluating on the fifth.We report the Dice score, 95% Hausdorff distance (HD), and the absolute volume difference (Abs.Vol.Diff.)as a percentage of the total brain volumes.More details about the computation of these metrics is available from Babalola et al.[2]and Dubuisson et al.[9].For comparison purposes, we include the results of FreeSurfer on the same data.

Table 2 .
[9]ults on 12 test subjects.We report the Dice score, 95% Hausdorff distance (HD), and the absolute volume difference (Abs.Vol.Diff.)as a percentage of the total brain volumes.More details about the computation of these metrics is available from Babalola et al.[2]and Dubuisson et al.[9].AIS denotes all internal structures.

Table 3 .
Landmark results based on five subjects with 420 manually picked landmarks, with 210 landmarks on each of the inner and outer surfaces, by two raters.