Online resource for validation of brain segmentation methods☆
Introduction
The development of computational approaches for medical image analysis requires appropriate validation methodologies to assess their performance. This validation is important so that algorithm developers can understand the ways in which existing algorithms could be improved, so that they can compare their own methods with those of others, and so that users of algorithms can understand the characteristics of algorithms that they may select for a particular data processing stream. There are several reasons why validation is of great importance in medical imaging, as these algorithms may ultimately be used in decisions that affect patient treatment. Even algorithms that are not used directly in clinical application may be used as part of processing sequences for studies that have implications for drug trials, public policy, or even legal arguments.
The importance of validation in medical image processing is well-recognized, and much effort has been undertaken to develop tools and methods related for it. One of the early standards used in medical image validation was the Shepp–Logan phantom, which was developed for computed tomography (CT) (Shepp and Logan, 1974). This phantom was constructed from a set of 10 overlapping ellipses that produced a 2D map of attenuation values, with each ellipse contributing a different attenuation constant. This image shares basic shape properties with those of a scan of a human brain, and is still described completely by a table with 6 numbers per ellipse. More recently in the MRI brain imaging community, datasets made publicly available via the Internet have had a large impact. The Internet Brain Segmentation Repository (IBSR)1 provides several datasets that have been delineated using manually-guided processes. The IBSR website also provides similarity metrics computed on the results produced by several algorithms, thereby providing developers of new algorithms with a basis for comparison. Another dataset that has emerged as a standard in brain imaging is the BrainWeb digital phantom2, which was constructed based on a high-resolution image from one individual (Collins et al., 1998). The BrainWeb site provides several versions of this scan, at different resolutions, noise levels, and with various degrees of RF nonuniformity. The site was later extended to provide templates produced from 20 additional individuals (Aubert-Broche et al., 2006).
Both the IBSR and BrainWeb datasets have been used in several publications (e.g., Rajapakse and Kruggel, 1998, Pham and Prince, 1999, Zeng et al., 1999, Zhang et al., 2001, Shattuck et al., 2001, Marroquin et al., 2002, Tohka et al., 2004, Tohka et al., 2007, Bazin and Pham, 2007) and have played a key role in the validation and improvement of image processing algorithms. One benefit of this has been that users can compare their results on the data with those that have been published previously. However, there may be cases where results are not directly comparable across publications. For example, studies may use different subsets of the available data.
There have been several studies performed to evaluate multiple medical image processing approaches. These have included registration comparisons, such as an evaluation of 4 different approaches performed by Strother et al. (1994). West et al. (1997) performed a large-scale study in which numerous registration algorithms were compared. The developers of each algorithm were invited to perform the registration of the test data, and then their results were evaluated independently. Arnold et al. (2001) performed an evaluation of 6 bias field correction algorithms, also with some interaction with the developers of the individual algorithms. Boesen et al. (2004) evaluated 4 different skull-stripping methods. That work also introduced a web-based evaluation service, termed Brain Extraction Evaluation (BEE)3, through which users could download the set of 15 test MRI volumes used in the paper and submit their segmented data for evaluation and have their results e-mailed to them. The Bioinformatics Resource Network (BIRN) sponsored an additional evaluation of skull-stripping algorithms, in which 4 publicly available methods were applied to 16 data volumes (Fennema-Notestine et al., 2006). For that study, the authors of each algorithm were invited to participate by suggesting parameters based on their own tests on practice data (3 of the 4 elected to participate). The data processing was then performed without further input from the developers.
At the 2007 conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), two segmentation competitions were held as part of a workshop (van Ginneken et al., 2007). Participants were able to segment test data for one of two problems: caudate segmentation or liver segmentation. For both segmentation problems, a training set and a test set were provided to interested participants prior to the conference. A second test set was provided on the day of the workshop and had to be segmented within 3 h after it was made available. The results were ranked using a set of metrics. The life of this competition extends beyond the conference, as two websites4 have been established to provide access to the results from the competition. These sites also allow existing and new participants to submit new results for evaluation. Additional segmentation competitions were held during the 2008 MICCAI conference.5
We make a few observations based on some of the validation studies that have been performed. First, since researchers often struggle with identifying ways to validate their methods, the public availability of reference datasets is of clear benefit to the medical imaging community. Second, when researchers publish descriptions and comparisons of their own algorithms, the algorithm they are publishing typically performs the best in their evaluation. This could indicate an intentional bias on the part of the group performing the comparison; however, other possible explanations also exist. Researchers are unlikely to publish methods that do not perform as well as the state of the art, as this would reduce their own motivation to publish as well as reduce the likelihood of acceptance by their peers during the review process. Additionally, researchers are most familiar with their own algorithms, and thus may not be sufficiently experienced with the existing methods to tune them appropriately to the test data. Third, metrics that are published in separate publications are not always comparable, as the testing procedures and assumptions may differ even when the same assessment metrics and data are used. Finally, evaluation methods such as the ones performed by BIRN or MICCAI, where each algorithm developer was allowed to participate in the parameter selection process or in the data processing, are likely to provide an opportunity for each algorithm tested to perform well. We note that these results may differ from what an actual end-user of the algorithm may experience in practice.
In this paper, we introduce a web-based resource that provides automatic evaluation of segmentation results, specifically for the problem of identifying brain versus non-brain in T1-weighted MRI. Skull-stripping is often one of the earliest stages in computational analysis of neuroimaging, and it can have an impact on downstream processing such as intersubject registration or voxel-based morphometry (Acosta-Cabronero et al., 2008). In spite of the many programs developed for skull-stripping, neuroimaging investigators often resort to manual cleanup or completely manual removal of non-brain tissues.
The online resource that we have developed works as follows. Registered users are allowed to download a set of 40 T1-weighted whole-head MRI that have been manually labeled, process the data according to their method of choice, and then upload the segmented data to the web-server. An application on the server then computes a series of metrics and presents these to the user through a series of navigable webpages. These results are also archived on the server so that results from multiple methods can be compared, and the archived results are provided on the webserver. To examine the utility of our new framework, we used it to perform an examination of 3 popular skull-stripping methods — Hybrid Watershed (Ségonne et al., 2004), Brain Extraction Tool (Smith, 2002), and our own Brain Surface Extractor (Shattuck et al., 2001).
Section snippets
Validation data set
We produced a ground truth data set based on MRI volumes from 40 normal research volunteers. For each subject MRI, we generated a manually edited brain mask volume, which labeled each voxel as being brain or non-brain, and a structure label volume, which identified 56 anatomical areas in the MRI on a voxel-by-voxel basis (see Table 2 for a list of structures that were labeled). Additionally, our validation data set included a nonlinear mapping from the native scan space of each subject volume
Implementation and computation time
The webserver implementation was developed as described above. The time required to upload a set of brain masks varied depending on the size of the compressed segmentation results and the Internet connection used. Typically, the compressed segmentation files required 4 MB or less of storage. The upload process required less than 1 min over a residential broadband connection; on a local gigabit network within our laboratory, the upload required less than 1 s. The entire server-side processing
Discussion
We have presented a new online resource for performing validation studies of skull-stripping algorithms using a set of 40 manually labeled MRIs. These results were computed in only a few minutes and archived on the server, providing a convenient and repeatable mechanism to evaluate and compare different methods. We applied our framework to evaluate 3 existing skull-stripping algorithms, BET, BSE, and HWA. Our results indicated that with proper parameter selection, all 3 could produce very good
Acknowledgements
This work is funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 RR021813. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics. Funding was also provided under NIH Grants P41-RR013642 (PI: AW Toga) and R01-MH060374 (PI: RM Bilder).
References (36)
- et al.
The impact of skull-stripping and radio-frequency bias correction on grey-matter segmentation for voxel-based morphometry
NeuroImage
(2008) - et al.
Qualitative and quantitative evaluation of six algorithms for correcting intensity nonuniformity effects
NeuroImage
(2001) - et al.
Quantitative comparison of four brain extraction algorithms
NeuroImage
(2004) - et al.
Segmentation of MR images with intensity inhomogeneities
Image Vis. Comput.
(1998) - et al.
Putting our heads together: a consensus approach to brain/non-brain segmentation in T1-weighted MR volumes
NeuroImage
(2004) - et al.
The LONI pipeline processing environment
Neuroimage
(2003) - et al.
A meta-algorithm for brain extraction in MRI
NeuroImage
(2004) - et al.
A hybrid approach to the skull stripping problem in MRI
NeuroImage
(2004) - et al.
Magnetic resonance image tissue classification using a partial volume model
NeuroImage
(2001) - et al.
Construction of a 3D probabilistic atlas of human cortical structures
NeuroImage
(2008)
Localizing age-related changes in brain structure between childhood and adolescence using statistical parametric mapping
NeuroImage
Fast and robust parameter estimation for statistical partial volume models in brain MRI
NeuroImage
Twenty new digital brain phantoms for creation of validation image data bases
IEEE Trans. Med. Imag.
Topology-preserving tissue classification of magnetic resonance brain images
IEEE Trans. Med. Imag.
Design and construction of a realistic digital brain phantom
IEEE Trans. Med. Imag.
Measures of the amount of ecologic association between species
J. Ecol.
Quantitative evaluation of automated skull-stripping methods applied to contemporary and legacy images: effects of diagnosis, bias correction, and slice location
Hum. Brain Mapp.
Cited by (140)
Privacy preserving image registration
2024, Medical Image AnalysisA dual-branch hybrid dilated CNN model for the AI-assisted segmentation of meningiomas in MR images
2022, Computers in Biology and MedicinePost-acquisition processing confounds in brain volumetric quantification of white matter hyperintensities
2019, Journal of Neuroscience MethodsOptimization-based neutrosophic set for medical image processing
2019, Neutrosophic Set in Medical Image Analysis
- ☆
This work is funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 RR021813. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics. Funding was also provided under NIH Grants P41-RR013642 (PI: AW Toga) and R01-MH060374 (PI: RM Bilder).