Elsevier

NeuroImage

Volume 45, Issue 2, 1 April 2009, Pages 431-439
NeuroImage

Online resource for validation of brain segmentation methods

https://doi.org/10.1016/j.neuroimage.2008.10.066Get rights and content

Abstract

One key issue that must be addressed during the development of image segmentation algorithms is the accuracy of the results they produce. Algorithm developers require this so they can see where methods need to be improved and see how new developments compare with existing ones. Users of algorithms also need to understand the characteristics of algorithms when they select and apply them to their neuroimaging analysis applications. Many metrics have been proposed to characterize error and success rates in segmentation, and several datasets have also been made public for evaluation. Still, the methodologies used in analyzing and reporting these results vary from study to study, so even when studies use the same metrics their numerical results may not necessarily be directly comparable. To address this problem, we developed a web-based resource for evaluating the performance of skull-stripping in T1-weighted MRI. The resource provides both the data to be segmented and an online application that performs a validation study on the data. Users may download the test dataset, segment it using whichever method they wish to assess, and upload their segmentation results to the server. The server computes a series of metrics, displays a detailed report of the validation results, and archives these for future browsing and analysis. We applied this framework to the evaluation of 3 popular skull-stripping algorithms — the Brain Extraction Tool [Smith, S.M., 2002. Fast robust automated brain extraction. Hum. Brain Mapp. 17 (3),143–155 (Nov)], the Hybrid Watershed Algorithm [Ségonne, F., Dale, A.M., Busa, E., Glessner, M., Salat, D., Hahn, H.K., Fischl, B., 2004. A hybrid approach to the skull stripping problem in MRI. NeuroImage 22 (3), 1060–1075 (Jul)], and the Brain Surface Extractor [Shattuck, D.W., Sandor-Leahy, S.R., Schaper, K.A., Rottenberg, D.A., Leahy, R.M., 2001. Magnetic resonance image tissue classification using a partial volume model. NeuroImage 13 (5), 856–876 (May) under several different program settings. Our results show that with proper parameter selection, all 3 algorithms can achieve satisfactory skull-stripping on the test data.

Introduction

The development of computational approaches for medical image analysis requires appropriate validation methodologies to assess their performance. This validation is important so that algorithm developers can understand the ways in which existing algorithms could be improved, so that they can compare their own methods with those of others, and so that users of algorithms can understand the characteristics of algorithms that they may select for a particular data processing stream. There are several reasons why validation is of great importance in medical imaging, as these algorithms may ultimately be used in decisions that affect patient treatment. Even algorithms that are not used directly in clinical application may be used as part of processing sequences for studies that have implications for drug trials, public policy, or even legal arguments.

The importance of validation in medical image processing is well-recognized, and much effort has been undertaken to develop tools and methods related for it. One of the early standards used in medical image validation was the Shepp–Logan phantom, which was developed for computed tomography (CT) (Shepp and Logan, 1974). This phantom was constructed from a set of 10 overlapping ellipses that produced a 2D map of attenuation values, with each ellipse contributing a different attenuation constant. This image shares basic shape properties with those of a scan of a human brain, and is still described completely by a table with 6 numbers per ellipse. More recently in the MRI brain imaging community, datasets made publicly available via the Internet have had a large impact. The Internet Brain Segmentation Repository (IBSR)1 provides several datasets that have been delineated using manually-guided processes. The IBSR website also provides similarity metrics computed on the results produced by several algorithms, thereby providing developers of new algorithms with a basis for comparison. Another dataset that has emerged as a standard in brain imaging is the BrainWeb digital phantom2, which was constructed based on a high-resolution image from one individual (Collins et al., 1998). The BrainWeb site provides several versions of this scan, at different resolutions, noise levels, and with various degrees of RF nonuniformity. The site was later extended to provide templates produced from 20 additional individuals (Aubert-Broche et al., 2006).

Both the IBSR and BrainWeb datasets have been used in several publications (e.g., Rajapakse and Kruggel, 1998, Pham and Prince, 1999, Zeng et al., 1999, Zhang et al., 2001, Shattuck et al., 2001, Marroquin et al., 2002, Tohka et al., 2004, Tohka et al., 2007, Bazin and Pham, 2007) and have played a key role in the validation and improvement of image processing algorithms. One benefit of this has been that users can compare their results on the data with those that have been published previously. However, there may be cases where results are not directly comparable across publications. For example, studies may use different subsets of the available data.

There have been several studies performed to evaluate multiple medical image processing approaches. These have included registration comparisons, such as an evaluation of 4 different approaches performed by Strother et al. (1994). West et al. (1997) performed a large-scale study in which numerous registration algorithms were compared. The developers of each algorithm were invited to perform the registration of the test data, and then their results were evaluated independently. Arnold et al. (2001) performed an evaluation of 6 bias field correction algorithms, also with some interaction with the developers of the individual algorithms. Boesen et al. (2004) evaluated 4 different skull-stripping methods. That work also introduced a web-based evaluation service, termed Brain Extraction Evaluation (BEE)3, through which users could download the set of 15 test MRI volumes used in the paper and submit their segmented data for evaluation and have their results e-mailed to them. The Bioinformatics Resource Network (BIRN) sponsored an additional evaluation of skull-stripping algorithms, in which 4 publicly available methods were applied to 16 data volumes (Fennema-Notestine et al., 2006). For that study, the authors of each algorithm were invited to participate by suggesting parameters based on their own tests on practice data (3 of the 4 elected to participate). The data processing was then performed without further input from the developers.

At the 2007 conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), two segmentation competitions were held as part of a workshop (van Ginneken et al., 2007). Participants were able to segment test data for one of two problems: caudate segmentation or liver segmentation. For both segmentation problems, a training set and a test set were provided to interested participants prior to the conference. A second test set was provided on the day of the workshop and had to be segmented within 3 h after it was made available. The results were ranked using a set of metrics. The life of this competition extends beyond the conference, as two websites4 have been established to provide access to the results from the competition. These sites also allow existing and new participants to submit new results for evaluation. Additional segmentation competitions were held during the 2008 MICCAI conference.5

We make a few observations based on some of the validation studies that have been performed. First, since researchers often struggle with identifying ways to validate their methods, the public availability of reference datasets is of clear benefit to the medical imaging community. Second, when researchers publish descriptions and comparisons of their own algorithms, the algorithm they are publishing typically performs the best in their evaluation. This could indicate an intentional bias on the part of the group performing the comparison; however, other possible explanations also exist. Researchers are unlikely to publish methods that do not perform as well as the state of the art, as this would reduce their own motivation to publish as well as reduce the likelihood of acceptance by their peers during the review process. Additionally, researchers are most familiar with their own algorithms, and thus may not be sufficiently experienced with the existing methods to tune them appropriately to the test data. Third, metrics that are published in separate publications are not always comparable, as the testing procedures and assumptions may differ even when the same assessment metrics and data are used. Finally, evaluation methods such as the ones performed by BIRN or MICCAI, where each algorithm developer was allowed to participate in the parameter selection process or in the data processing, are likely to provide an opportunity for each algorithm tested to perform well. We note that these results may differ from what an actual end-user of the algorithm may experience in practice.

In this paper, we introduce a web-based resource that provides automatic evaluation of segmentation results, specifically for the problem of identifying brain versus non-brain in T1-weighted MRI. Skull-stripping is often one of the earliest stages in computational analysis of neuroimaging, and it can have an impact on downstream processing such as intersubject registration or voxel-based morphometry (Acosta-Cabronero et al., 2008). In spite of the many programs developed for skull-stripping, neuroimaging investigators often resort to manual cleanup or completely manual removal of non-brain tissues.

The online resource that we have developed works as follows. Registered users are allowed to download a set of 40 T1-weighted whole-head MRI that have been manually labeled, process the data according to their method of choice, and then upload the segmented data to the web-server. An application on the server then computes a series of metrics and presents these to the user through a series of navigable webpages. These results are also archived on the server so that results from multiple methods can be compared, and the archived results are provided on the webserver. To examine the utility of our new framework, we used it to perform an examination of 3 popular skull-stripping methods — Hybrid Watershed (Ségonne et al., 2004), Brain Extraction Tool (Smith, 2002), and our own Brain Surface Extractor (Shattuck et al., 2001).

Section snippets

Validation data set

We produced a ground truth data set based on MRI volumes from 40 normal research volunteers. For each subject MRI, we generated a manually edited brain mask volume, which labeled each voxel as being brain or non-brain, and a structure label volume, which identified 56 anatomical areas in the MRI on a voxel-by-voxel basis (see Table 2 for a list of structures that were labeled). Additionally, our validation data set included a nonlinear mapping from the native scan space of each subject volume

Implementation and computation time

The webserver implementation was developed as described above. The time required to upload a set of brain masks varied depending on the size of the compressed segmentation results and the Internet connection used. Typically, the compressed segmentation files required 4 MB or less of storage. The upload process required less than 1 min over a residential broadband connection; on a local gigabit network within our laboratory, the upload required less than 1 s. The entire server-side processing

Discussion

We have presented a new online resource for performing validation studies of skull-stripping algorithms using a set of 40 manually labeled MRIs. These results were computed in only a few minutes and archived on the server, providing a convenient and repeatable mechanism to evaluate and compare different methods. We applied our framework to evaluate 3 existing skull-stripping algorithms, BET, BSE, and HWA. Our results indicated that with proper parameter selection, all 3 could produce very good

Acknowledgements

This work is funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 RR021813. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics. Funding was also provided under NIH Grants P41-RR013642 (PI: AW Toga) and R01-MH060374 (PI: RM Bilder).

References (36)

  • SowellE.R. et al.

    Localizing age-related changes in brain structure between childhood and adolescence using statistical parametric mapping

    NeuroImage

    (1999)
  • TohkaJ. et al.

    Fast and robust parameter estimation for statistical partial volume models in brain MRI

    NeuroImage

    (2004)
  • Aubert-BrocheB. et al.

    Twenty new digital brain phantoms for creation of validation image data bases

    IEEE Trans. Med. Imag.

    (2006)
  • BazinP.-L. et al.

    Topology-preserving tissue classification of magnetic resonance brain images

    IEEE Trans. Med. Imag.

    (2007)
  • CollinsD.L. et al.

    Design and construction of a realistic digital brain phantom

    IEEE Trans. Med. Imag.

    (1998)
  • DiceL.R.

    Measures of the amount of ecologic association between species

    J. Ecol.

    (1945)
  • Evans, A.C., Collins, D.L., Mills, S.R., Brown, E.D., Kelly, R.L., Peters, T.M., 1993. 3D statistical neuroanatomical...
  • Fennema-NotestineC. et al.

    Quantitative evaluation of automated skull-stripping methods applied to contemporary and legacy images: effects of diagnosis, bias correction, and slice location

    Hum. Brain Mapp.

    (2006)
  • Cited by (140)

    • Privacy preserving image registration

      2024, Medical Image Analysis
    • Optimization-based neutrosophic set for medical image processing

      2019, Neutrosophic Set in Medical Image Analysis
    View all citing articles on Scopus

    This work is funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 RR021813. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics. Funding was also provided under NIH Grants P41-RR013642 (PI: AW Toga) and R01-MH060374 (PI: RM Bilder).

    View full text