Ensemble segmentation for GBM brain tumors on MR images using conﬁdence-based averaging

Purpose: Ensemble segmentation methods combine the segmentation results of individual methods into a ﬁnal one, with the goal of achieving greater robustness and accuracy. The goal of this study was to develop an ensemble segmentation framework for glioblastoma multiforme tumors on single-channel T1w postcontrast magnetic resonance images. Methods: Three base methods were evaluated in the framework: fuzzy connectedness, GrowCut, and voxel classiﬁcation using support vector machine. A conﬁdence map averaging (CMA) method was used as the ensemble rule. Results: The performance is evaluated on a comprehensive dataset of 46 cases including different tumor appearances. The accuracy of the segmentation result was evaluated using the F 1 -measure between the semiautomated segmentation result and the ground truth. Conclusions: The results showed that the CMA ensemble result statistically approximates the best segmentation result of all the base methods for each case. © 2013 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 Unported License. [http://dx.doi.org/10.1118/1.4817475]


INTRODUCTION
Glioblastoma multiforme (GBM), a World Health Organization (WHO) grade IV astrocytoma, is the most common human brain tumor comprising about 12%-15% of all primary central nervous system (CNS) tumors and accounting for about 50%-60% of all astrocytomas. 1 Survival for patients with glioblastoma, although individually variable, averages 14 months after diagnosis. 2 Clinical trials are investigating effective treatments for GBM brain tumors, and imaging is playing an important role. Contrast-enhanced tumor size change on serial imaging studies is used as a surrogate endpoint using 1D and 2D diameters. Computer-aided volumetric methods are also under investigation, which can be more effective than diameters when the tumor contains a nonenhanced core or has an irregular shape. In clinical studies, manual contouring has been used to segment tumors on MR images. For example, in a recent clinical study of correlating methylated-DNA-protein-cysteine methyltransferase (MGMT) promoter methylation and imaging features of GBM tumors, Drabycz et al. 3 used manual contouring for GBM brain tumor segmentation. An accurate and robust automated segmentation system would facilitate quantitative anal-ysis in clinical studies. In Fig. 1, we show a 2D slice of a T1 weighted postcontrast magnetic resonance (MR) image presenting an enhancing GBM brain tumor with the outline of the active tumor region.
Automatic GBM brain tumor segmentation is a challenging task, since brain tumors are heterogenous, and highly variable in size, location, shape, and appearance. They also often deform adjacent structures in the brain. Some artifacts of MR imaging also increase the difficulty of tumor segmentation. Imperfection of the RF pulses and the location of RF coils may introduce nonuniformity in MR images. This study focuses on recurrent GBM brain tumors that develop after surgery, many of which contain a cavity, and the enhancing portions can vary in shape, for example, ring-shape, bloblike shape, or multiple components attached to the cavity or dispersed into the brain tissue (see Fig. 2). Furthermore, when patients are scanned at multiple centers, with different scanners and contrast agent injection protocols, the image intensity contrast can vary greatly. These factors make GBM brain tumor segmentation a very challenging problem in a clinical setting, and there is a lack of studies evaluating GBM brain tumor segmentation methods in a large clinical dataset. Computer-based brain tumor segmentation has remained largely experimental work. Many efforts have exploited MRI's multidimensional data capability through multispectral analysis. [4][5][6][7][8][9] There are generally several categories of techniques: knowledge-based, clustering, voxel-based classification, level set method, and graph-based techniques.
Knowledge-based segmentation systems typically use a brain atlas to provide prior information. Fletcher-Heath et al. 10 applied a knowledge-based system to segment nonenhancing tumors. Prastawa et al. 11 applied outlier detection to find abnormal regions, applied k-means clustering (k = 2) to separate tumor and edema, and then a region competition method using level-sets to add a smoothness constraint. In the study, they used T1-weighted precontrast and T2-weighted images, without contrast injection. However, in clinical trials, tumor definition is based on T1-weighted postcontrast images. They reported that the intrareader variability could be as low as 59.4%.
Among the clustering techniques, fuzzy clustering methods are the approach most widely employed across all tumor types. Fuzzy C-means (FCM) clustering is used frequently, since it does not require training data. Phillip et al. 4 was the first to apply FCM clustering to GBM brain tumor segmentation, and correlated the segmentation with tumor histology. The limitation of the study is that it did not include a quantitative validation of the method. Beevi and Sathik 12 applied an efficient denoising algorithm before FCM and incorporated spatial probability to deal with the sensitivity to noise. The limitation of the study is that the method was validated on one clinical brain MR scan with unknown tumor type. Khotanlou et al. 13 performed symmetry analysis and fuzzy clustering to initialize the segmentation, and combined deformable model and spatial relations to refine it. It was not clear whether the method was evaluated on GBM tumors, and it would be interesting to evaluate the method on images from GBM clinical trials. Aside from FCM, Ahmed and Mohamad 14 performed k-means clustering combined with the anisotropic diffusion denoising and evaluated on one MR scan. Liu et al. 9 developed a semiautomated system using the fuzzy-connectedness method and evaluated the overall volume accuracies for 20 patients. The method requires additional steps to remove attached brain structures. Clark et al. 5 use a knowledge-based system including five stages, using T1-weighted, T2-weighted, and PD-weighted image intensity. In each stage, various heuristic parameters are applied. The performance is reported as a correspondence ratio that ranges from 0.43 to 0.85 in 16 scans from seven patients. They used 17 slices from three patients to set up the heuristic parameters. It is not clear how difficult and practical it would be to set universal parameter values in the setting of a large clinical trial, considering the variability of GBM tumors.
Voxel-based supervised classification methods have been investigated by a number of researchers. 15 Vinitski et al. 16 developed a system using a k-nearest neighbor classifier (kNN) to segment multiple sclerosis (MS) lesions and brain tumors from a limited number of patients. Validation with more tumor cases is needed to apply the method in clinical trials. Jolesz and co-workers 17 developed an adaptive templatemoderated (ATM) classification algorithm (ATS) which incorporated a brain atlas to include spatial anatomical  20 to GBM tumor segmentation. The system used the difference between T1w pre and postcontrast images to develop tumor and edema priors, and form a Gaussian mixture model framework solved by expectation-maximization (EM) technique. The performance is reported as an overlap ratio of 0.49-0.92 from five patients. The system was extended for GBM tumor segmentation by adding the tumor and edema classes. 21 One limitation of the study is that they did not provide a prior in the model for necrosis, cyst, or cavity. It is common for GBM tumors to have necrosis or a surgical cavity, especially in recurrent GBM. Another limitation of the study is that the simplified geometric model for tumor shape cannot cope with tumors that have complex appearance and poorly defined boundaries. Zhang et al. 22 used baseline as training and follow-up as testing images. The method was tested on five scans on one tumor case. The application is limited since the GBM tumors on the baseline images still need to be manually contoured. Schmidt et al. 23 developed alignment-based features including a spatial prior, symmetry, intensity, and multiscale texture. The dataset included ten patients with one cavitated tumor from two sites. They reported average overlap of 0.732. However, performance for the active tumor volume is not clear. Lee et al. 24 applied discriminative random fields (DRFs) model with a support vector machine (SVM) and reported performance of 0.53-0.89 overlap ratio for 12 scans from seven patients. The weakness of the study is that they used patient-specific training, which means training and testing voxels are from the same patient, and the manual contouring is still needed for each patient. Ayachi and Amor 25 applied a support vector machine (SVM), using nine slices from each tumor as training, and the rest of the slices on the same patient as testing, and report a 0.82 true positive rate for four cases. However, with patient-specific training, manual contouring is still needed. Zhang et al. 26 applied a multikernel SVM, and again the limitation of the study is the need for patient-specific training. Level set and graph-based methods have also been explored for brain tumor segmentation. Ho et al. 6 ran a level set algorithm on probabilities derived from a T1w pre and postcontrast difference image. They report an 80%-90% overlap ratio on three tumors of blob-like shape. However, it is not clear how the method performs for irregular tumor shapes. Popuri et al. 27 extracted a clustered feature set, integrated them into a level set framework and used a Dirichlet prior to exclude the surrounding tissues. They showed success in differentiating tumor from normal tissue by incorporating shape information; however, it is not clear how it performs for GBM tumors which usually have irregular shapes. Taheri et al. 28 used a threshold-based speed function for level-set function evolution. Corso et al. 7,29 developed a segmentation by weighted aggregation (SWA) algorithm based on graph shift algorithm for GBM brain tumor segmentation. Dube et al. 8 incorporated the texture features into the SWA framework and applied to the GBM brain tumor segmentation on one-channel MRI using T1-weight postcontrast MRI. The study achieved 70% accuracy for the majority of the cases; however, the failure cases will need to be addressed before it is ready for the clinic. Recently, other features other than intensity were studied, including grayscale concurrence matrix (GLCM) features, 30 discrete cosine transform (DCT) features, 31 and the Gabor wavelet filter. 32 In summary, most of the literature reports the use of multichannel MR to segment GBM tumors, while segmentation on a single-channel MR has only been reported infrequently. 8 Although multichannel MR sequences are useful in differentiating brain tissues and disease, they are usually acquired at low resolutions, with slice gaps, and images from different sequences are often not aligned. Images can be realigned to a reference series but the resliced image series can suffer from lower resolutions along the slice axis as well as slice gaps. It is now possible to perform high resolution 3D imaging using various contrast mechanisms (T1w, T2w, FLAIR) and using identical image parameters for each image set on modern MR scanners. However, even with same-resolution T2w or FLAIR scans, they are not scanned at the same time, and registration is still needed to align them. That might be a source of errors. Due to the time issue, it is not standard care to acquire high-resolution for T2w and FLAIR images in the current clinical practice. Segmentation on a single channel T1 postcontrast isotropic data is potentially important in determining tumor volume for therapeutic response assessment in clinical trials.
Most of the papers reported a small dataset of less than ten cases to evaluate their methods. It is not clear whether the techniques can handle the more difficult and irregular GBM tumors that inevitable arise in larger clinical datasets.
There is also limited investigation of irregular recurrent GBM tumors. The tumor recurrence could happen around the surgical cavity or at a distant site, and show diffuse-pattern with anti-VEGF drugs. These factors increase the difficulty of recurrent GBM tumor segmentation compared to the newly diagnosed GBM tumors.
The contribution of this study is to investigate an ensemble approach to GBM tumor segmentation that combines results from three individual general-purpose segmentation algorithms, aiming to achieve high accuracy in GBM tumor segmentation.
There has been active research on combining multiple segmentation results. In the field of supervised learning, Kittler et al. 33 summarized the different schemes for combining results from multiple classifiers. In the field of unsupervised clustering, Ghaemi et al. 34 performed a survey of methods in clustering ensembles. As far as the applications in medical imaging field, Grady 35 and Wattuya et al. 36 developed an algorithm to combine multiple segmentation results using the random walker method for natural image segmentations. Rohlfing et al. 37 studied atlas-based segmentation of biomedical images. They proposed to estimate the performances of the base classifiers and combine their respective outputs by weighting them according to their estimated performance. This method is realized as a multiclass extension of an EM algorithm for ground truth estimation from a binary classification based on decisions of multiple experts. 38 Aljabar et al. 39,40 applied the majority voting rule 33 to combine segmentation results from atlas-based segmentation and presented a thorough evaluation on brain MR images. Ensemble segmentation showed its potential in these applications and we will apply it to the application of GBM brain tumors.
In this study, we propose an ensemble technique, applied to semiautomated GBM brain tumor segmentation on T1w postcontrast volumetric MR images, and evaluate the performance on a dataset with 46 tumor cases from a clinical trial research database. There are two steps involved. The first step is to generate input segmentation candidates from different algorithms. Three general-purpose segmentation methods were applied to generate input segmentations: fuzzy connectedness, 9 GrowCut, 41 and voxel classification using support vector machines (SVM). 42 The second step is to combine them to generate a final result. The ensemble scheme was confidence-based averaging (CMA). The CMA method was adopted based on an assumption that the majority of the base methods are correct, and errors from each method are independent so that they will be averaged out in the ensemble result. To our knowledge, we are the first to investigate ensemble segmentation for GBM tumor segmentation on singlechannel MR images (T1w postcontrast), and to evaluate base methods and their ensemble on a relatively large dataset of 46 GBM tumors including different types of GBM tumor appearance patterns, in comparison to dataset of 5-20 cases in the prior literature.

2.A. Input segmentations
We explored three algorithms as base methods including two semiautomated methods and one learning-based technique: fuzzy connectedness, GrowCut, and voxel classification using SVM. The fuzzy connectedness method was selected because it was reported to work well for semiautomated GBM brain tumor segmentation by Liu et al. 9 The GrowCut (GC) method was chosen due to its simple user interaction mechanism, straightforward implementation, and promising performance in our pilot study. 43 For these two semiautomated methods, user input seeds are provided in the tumor and background regions. SVM classification was chosen as a general-purpose method and adapted to this specific application by learning from examples.

2.A.1. Fuzzy connectedness
The fuzzy connectedness (FC) segmentation framework assigns fuzzy affinities to the target object during classification, to capture global "hanging togetherness" of voxels. The first step of the algorithm involves computing an "affinity" map, a local fuzzy relation, which quantifies the connectedness of any pixel pair in the original image; the second step calculates the "fuzzy connectedness," the global fuzzy relation with one specific (designated) pixel belonging to the object of interest.
We implemented the algorithm following Liu et al.'s work 9 since it has been previously applied to the GBM brain tumor segmentation task. The affinity between any two voxels c and d, denoted by μ k (c, d), is given by where f(c) and f(d) denote voxel intensity values at c and d, respectively. The functional forms for h 1 and h 2 are chosen as follows:

2.A.2. GrowCut
The GC method 41 is based on cellular automata theory. Formally, a cellular automaton (CA) is a triple (S, N, δ), where S is the state set, N is the neighborhood, and δ: S N − > S is the local transition function, where S N indicates the states of the neighborhood cells at a given time, while S is the state of the central cell at the next time step. In the GC method, the cells correspond to image voxels, and the cell state S = (C, l, θ) for each voxel consists of the image feature vector C which is intensity in this study, the label l indicating the category to which the voxel belongs, and the strength θ in the continuous range [0, 1] indicating the confidence in the current labeling.
The GC method uses CA theory to interactively label the image volume using user supplied seeds. The user starts the segmentation by supplying seed points comprising both tumor and background voxels, the seeds' labels are set to the respective category labels, while their strength is set to 1. This sets the initial state of the cellular automaton. Strengths for unlabeled cells are set to 0. In each iteration t, each cell tries to "attack" the neighboring voxels by calculating the local intensity similarity; accordingly, the label map and the strength map are updated until convergence. The algorithm converges to a stable configuration, where cell states no longer change. The pseudo code for the GC algorithm is shown in Fig. 3, where N(p) is the 26-neighbor system of a voxel p in 3D, and g is a monotonically decreasing function bounded within [0, 1]:

2.A.3. Voxel classification using SVM
The support vector machine (SVM) 42 is a supervised learning algorithm. It constructs a separating hyperplane in a multidimensional feature space that maximizes the margin between two classes. To calculate the margin, two parallel hyperplanes are constructed, one on each side of the separating hyperplane, which are "pushed up against" the samples from the two classes. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the neighboring data points of both classes, since in general the larger the margin the lower the generalization error of the classifier. Given a set of n labeled data points (x 1 , y 1 ), (x 2 , y 2 ), . . ., x n , y n where y i = ±1, SVM searches for an optimal separating hyperplane w, x + b = 0, where where w ∈ R n , x ∈ R n , and b ∈ R.
During the classifier training, voxels from manually contoured tumors are used as positive (tumor) examples, and an equal number of voxels sampled outside the tumor are used as negative (background) examples. For each training sample, a set of imaging features are calculated: intensity, gradient magnitude, first-order Gaussian derivatives (in three directions), second-order Gaussian derivatives (six in total), and the three eigenvalues of the Hessian matrix. These features are calculated at three different scales: 1, 2, and 4 pixels. In total, we have 42 features derived from images as the sum feature vector.
To apply the voxel classification to the test scan, the set of 42 features is calculated for each voxel and input to the trained classifier, and for each voxel a score that it belongs to a tumor is computed ranging from 0 to 1.

2.B. Combining input segmentations by confidence map averaging
A voxelized confidence map (CM) is generated for each base segmentation method. For the SVM method, the output score map was used as the CM. For the GC method, a strength map is generated by the algorithm, and we transform the strength map into a confidence map by linearly rescaling the foreground strength to [0.5, 1] and the background strength into [0, 0.5]. For the fuzzy connectedness method, the membership value is linearly rescaled to [0, 1] as the CM. The three base methods (N = 3) are combined by confidence map averaging (CMA); the output of the ensemble is the average of the three confidence maps generated by the three base methods, weighting each of the three methods equally. Figure 4 shows the confidence maps from the three individual methods for one tumor. In order to obtain the binary segmentation, the CMA result is later thresholded to obtain a binary segmentation:

EXPERIMENTS
We used 46 GBM tumor cases from 45 patients in this study from a 60-subject multisite research database. The 15 patients were excluded due to either lack of available manual gold standard tumor contours, anisotropy of voxel size, or variation in image resolution.
The imaging protocol for the T1w sequences was 3D volumetric acquisition in the axial plane using the flip angle-spoiled gradient echo sequence (FSPGR) or the magnetization-prepared rapid gradient-echo (MP-RAGE) se-quence with 1 mm slice thickness, 0.9 mm by 0.9 mm pixel size, and 256*256 in-plane resolution.
The ground truth for the segmentation was manually contoured by a board-certified neuroradiologist with ten years of experience, with the facilitation of a semiautomated segmentation tool 44 from an in-house software system QIWS (quantitative imaging work station). The brain volume was preprocessed to remove nonbrain matter and obtain consistent image intensities across all subjects for the given MR channel by the following steps: (1) skull-stripping -using FSL; 45 (2) B1 field correction and intensity normalization -using Freesurfer 46 to standardize the intensity of MR images acquired from different medical centers.
In order to reduce the processing time, we applied the algorithms in a predefined volume of interest (VOI). For each 3D MR volume, the user visually identified the start and end slice of the tumor, and provided manual seeds on the center slice of the tumor to initialize the GC method. With this information, the VOI can then be generated. First, the bounding box of the input seeds on the tumor center slice is extended 25 mm along each in-plane direction to enclose the whole tumor; then, the bounding box is extended in the z-direction to the start and end slice to obtain the VOI. Calculation time is thereby reduced by applying the segmentation framework only within the VOI instead of the whole brain volume.
We applied the proposed framework with the following parameter setup. For GC, users provided 3-5 seed points in the foreground and background structures respectively on the center slice of the tumor. The FC algorithm used the same foreground seeds, and the s1, m1, a1, and a2 were chosen as described in Sec. 2. The SVM voxel classification did not utilize the seeds. The SVM was trained for each leave-onetumor-out iteration, resulting in 45 runs.
To obtain the binary segmentation results, the outputs of the SVM, fuzzy connectedness, and the CMA ensemble were thresholded adaptively using the Otsu method. 44 The binary segmentation from SVM and CMA were further processed by a connected component analysis to remove speckle noise involving: (1) removal of components smaller than 27 voxels and (2) removal of components including background seeds but no foreground seeds. The accuracy of the segmentation result was evaluated using the F 1 -measure (ranging from 0 to 1) 47 between the semiautomated segmentation result and the ground truth.

RESULTS
We calculated the F 1 -measure for all 46 GBM tumors to evaluate the accuracy of the segmentation results against the ground truth, and to compare the three base methods and our ensemble method. We present the F 1 -measure plot for all 46 cases in Fig. 5.
First, to compare the three base methods, Fig. 5 shows that not a single base algorithm performs better than the other two algorithms in all the 46 cases. GC performed best for 34 cases out of 46, while FC and SVM performed best for seven and sic cases out of 46, respectively. The box plots and statistics of the the F 1 -measures are shown in Fig. 6 and summary statistics are provided in Table I. A paired t-test was run to compare the three base methods the results are shown in Table II, indicating that GC and SVM are significantly better than FC method.
Second, to compare the ensemble with the three base methods, the ensemble method was close to the best base result for  the majority of cases, although the best base method varied for each case. We obtained the best segmentation result for each case, and call it best individual result, and compared it with all other methods.The paired t-test shows that there is no significant difference between the best individual result and the ensemble result, while the best individual result is significantly better than all three base methods, as shown in Table III. The box plots and summary statistics of the best individual F 1 -measures are shown in Fig. 6 and Table I. The ensemble method improved the F 1 -measure by approximately 0.04 (0.04 ± 0.02) compared to the best individual accuracy for 11 cases (no. 4,8,11,12,13,14,18,36,41,43,45), shown in Fig. 5. Two main reasons for the improvement are observed. One is that when the tumor is inhomogeneously enhanced, the ensemble method detected more tumor components than each base method. The other is that the necrosis was often falsely included as a part of a tumor by the GC and FC methods but correctly removed by the ensemble method. Figure 9 shows one example (index no. 12).
The ensemble method performs similar (0.0006 ± 0.01) to the best individual result for 21 cases (no. 1,2,3,5,6,7,15,16,17,19,21,27,29,30,31,32,34,35,39,42,46). Two main observations may contribute to this result. One observation is that one method (GC) performs relatively well when the tumor appears as a well-enhanced and single component, as shown in Fig. 10 with index no. 31, while the other two methods do not provide much additional value to the CMA method. The other observation is that in some cases the CMA not only includes more true positive voxels than the base methods, but also includes more false positive voxels, resulting in no overall improvement.
The ensemble method exhibited promising results in a subgroup of multifocal tumors. Multifocal tumors are those with more than one lesion site, as defined by intervening areas of normal brain signal, including or excluding the primary site, all with a well-defined or mostly well-defined border. Figure 8 shows one example. 48 Cases 43-46 in Fig. 5 belong to this subgroup, and the zoom-in version is shown in Fig. 7 . For multifocal tumors, GC missed unconnected tumor pieces and SVM included all tumor pieces, and our ensemble method improved the performance over the GC method by 0.08 ± 0.01, improved over the SVM method by 0.04 ± 0.04, and improved over the FC method by 0.26 ± 0.25. In general, the F 1 -measure for all 46 cases is lower than 0.9 for all methods, because partial volumed voxels tended to be missed by the automated methods. Thus, the F 1 -measure is reduced even when the segmentation result appears accurate by visual inspection.

DISCUSSION
In this study, we proposed an ensemble framework for the application of GBM brain tumor segmentation on highresolution T1w postcontrast MR images. Rather than a highly customized method for this specific application, the proposed ensemble method can combine existing general-purpose segmentation algorithms to achieve greater consistency in performance.
Our study shows that ensemble segmentation has the potential to approximate the best result of the base method for each case. In spite of the power of the existing generalpurpose segmentation methods, unfortunately not a single segmentation method could beat all the others in solving the challenging problem of GBM tumor segmentation with large variation of tumor appearance. We tested that ensemble segmentation has the potential to approximate the best individual result (p > 0.05), even though the best individual result is significantly better than all the base methods. In the future, with properly selected base methods which are good at segmenting different types of tumor appearances, an appropriate ensemble method may sustain the accuracy from the best "performer" for different tumor appearance and achieve an overall improvement over the base methods.
To our knowledge, we are the first to investigate and evaluate ensemble segmentation for GBM tumors on a relatively large dataset of 46 GBM tumors, in comparison to datasets of 5-20 cases in the literature. In the challenge of brain tumor segmentation (BRATS) at MICCAI2013, a training set of 30 patients were provided including both GBM and low grade gliomas. It is necessary to evaluate the GBM tumor segmentation over a large dataset including a variety of tumor presentations, because the appearance of GBM tumors on the images can vary substantially. GBM brain tumor segmentation is a challenging problem due to tumor heterogeneity, inhomogeneous intensity profiles, variable shapes and sizes, and recurrence patterns postsurgery. For example, there may or may not be necrosis/cavity/cyst present in the tumor core; the tumor recurrence may occur in the primary site or at a distant site; the tumor may show a vivid enhancement or diffuse pattern; and the tumor could have a blob shape or an irregular shape. Thus, it is crucial to evaluate the segmentation method on a large clinical dataset. Liu et al. 9 and Corso et al. 7 are until now the only two studies in the literature that evaluated their systems on a dataset of 20 cases. In our study, we included 46 tumors, including cases with all the clinical variability mentioned above. This study is thus significant in elucidating the range of tumor types to be addressed and thereby suggests that an ensemble approach may be appropriate.
To our knowledge, we are the first to investigate ensemble segmentation on single-channel MR images (T1w postcontrast) for GBM brain tumor segmentation. Most of previous studies developed fully automated segmentation using multichannel MR images (T1w, T2w, FLAIR, etc.), and we found only one publication, where Dube et al. 8 performed a preliminary study of fully automatic segmentation on this task using a dataset of seven patients. In the setting of GBM tumor clinical trials, radiologists manually contour contrast-enhanced tumors on single-channel T1w postcontrast images to measure tumor size change. Therefore, semiautomated segmentation on a single-channel T1w MR volume is relevant in a drug trial that uses radiographic response as a surrogate endpoint. We compared the performance of different base segmentation algorithms on the application of GBM brain tumor segmentation. In the literature, many algorithms have been proposed as general-purpose segmentation methods; however, it is difficult to compare their performance since they were applied to different datasets. In this study, we evaluated three base algorithms on the same dataset, which serves as a reference to compare their performances and makes a useful contribution to the segmentation of GBM brain tumors.
Our study provides a potential general-purpose segmentation framework, even though our ensemble method was tested for a specific application of GBM tumor segmentation. In the context of tumor drug clinical trials where radiographical response is used as a surrogate endpoint, imaging core labs need a general-purpose segmentation method for medical image segmentation, and the ensemble framework is a potential solution. This is because imaging core labs collect and process data from different trials with different diseases and image modalities (CT, MRI, PET, etc.). It is tedious work for radiologists to manually contour the tumors, but it is expensive and inefficient to design a specific segmentation algorithm for each application. Therefore, a general purpose segmentation framework is attractive. However, medical image segmentation is not a trivial task due to the nature of medical image acquisitions and of heterogeneity of human diseases. An ensemble framework can take advantage of different segmentation algorithms. Its potential to serve as a generalpurpose segmentation framework can be further studied and evaluated in other applications in the future to test the generalizability of the methods. There are a couple of limitations in the present study design. One of them is the use of a single expert reading as the ground truth. Inter-observer variation is a limiting factor in GBM tumor segmentation due to the infiltrative nature and the boundary could be controversial sometimes, and it increases greatly in a post-operative setting, while the whole dataset in this study are post-operative scans. Meanwhile, segmentation for GBM tumor is still exploratory and in the current literature of GBM tumor segmentation, the majority of the studies are using a single reader for the evaluation. The reason is that, first, it is not easy to get large dataset with GBM tumors, and second, it is hard to collect results from multiple readers. Many of the literatures did not specify the ground truth, while in this paper, the ground truth is made by a neuroradiologist with ten-year experience in a real phase II drug trial. The other limitation is that the number of input seeds for FC and GC methods were not strictly controlled among all cases. As interactive method, GC is sensitive to user interactions. The amount of user interaction in this study is 5-9 seeds on the center slice of the tumor, to get an acceptable segmentation result. Thus, the segmentation result from GC is not the optimum from GC. The ensemble results might be even better if we get enough seeds to refine the GC results.
Future work could involve improvement of the CMA ensemble method. One possibility is to assign different weights to each base segmentation algorithm. Currently, we weighted all algorithms equally. Another possibility is to include additional base methods to explore whether more base methods can improve the segmentation performance.
In summary, we compared three base segmentation methods and evaluated the ensemble method on a clinical dataset of 46 GBM cases, and found that ensemble segmentation statistically approximates the best individual result (p > 0.05), and this provides motivation to investigate base methods that are good at segmenting tumors with different appearances. An ensemble method may then sustain the accuracy from the "best performer" for the various tumor appearances and obtain an overall improvement over the base methods.