Performance of five research-domain automated WM lesion segmentation methods in a multi-center MS study

doi:10.1016/j.neuroimage.2017.09.011

NeuroImage

Volume 163, December 2017, Pages 106-114

https://doi.org/10.1016/j.neuroimage.2017.09.011 Get rights and content

Highlights

•
Much-needed study on quantitative evaluation and objective comparison of WM lesion segmentation methods.
•
Using different scanners and different MR protocols in a real-life setting similar to phase-III trials and everyday clinical practice.
•
The methods perform almost equally well whether parameter tuning is performed using data from the same center or not.

Abstract

Background and Purpose

In vivoidentification of white matter lesions plays a key-role in evaluation of patients with multiple sclerosis (MS). Automated lesion segmentation methods have been developed to substitute manual outlining, but evidence of their performance in multi-center investigations is lacking. In this work, five research-domain automated segmentation methods were evaluated using a multi-center MS dataset.

Methods

70 MS patients (median EDSS of 2.0 [range 0.0–6.5]) were included from a six-center dataset of the MAGNIMS Study Group (www.magnims.eu) which included 2D FLAIR and 3D T1 images with manual lesion segmentation as a reference. Automated lesion segmentations were produced using five algorithms: Cascade; Lesion Segmentation Toolbox (LST) with both the Lesion growth algorithm (LGA) and the Lesion prediction algorithm (LPA); Lesion-Topology preserving Anatomical Segmentation (Lesion-TOADS); and k-Nearest Neighbor with Tissue Type Priors (kNN-TTP). Main software parameters were optimized using a training set (N = 18), and formal testing was performed on the remaining patients (N = 52). To evaluate volumetric agreement with the reference segmentations, intraclass correlation coefficient (ICC) as well as mean difference in lesion volumes between the automated and reference segmentations were calculated. The Similarity Index (SI), False Positive (FP) volumes and False Negative (FN) volumes were used to examine spatial agreement. All analyses were repeated using a leave-one-center-out design to exclude the center of interest from the training phase to evaluate the performance of the method on ‘unseen’ center.

Results

Compared to the reference mean lesion volume (4.85 ± 7.29 mL), the methods displayed a mean difference of 1.60 ± 4.83 (Cascade), 2.31 ± 7.66 (LGA), 0.44 ± 4.68 (LPA), 1.76 ± 4.17 (Lesion-TOADS) and −1.39 ± 4.10 mL (kNN-TTP). The ICCs were 0.755, 0.713, 0.851, 0.806 and 0.723, respectively. Spatial agreement with reference segmentations was higher for LPA (SI = 0.37 ± 0.23), Lesion-TOADS (SI = 0.35 ± 0.18) and kNN-TTP (SI = 0.44 ± 0.14) than for Cascade (SI = 0.26 ± 0.17) or LGA (SI = 0.31 ± 0.23). All methods showed highly similar results when used on data from a center not used in software parameter optimization.

Conclusion

The performance of the methods in this multi-center MS dataset was moderate, but appeared to be robust even with new datasets from centers not included in training the automated methods.

Introduction

Multiple sclerosis (MS) is an inflammatory and neurodegenerative disease of the central nervous system, with inflammatory white matter (WM) lesions as prominent pathological hallmark (Benedict and Bobholz, 2007, Lucchinetti et al., 2000). In vivo visualization of lesions by means of MRI plays a crucial role in the diagnosis and study of MS. Moreover, several clinical trials have used WM lesion volume as a (secondary) study outcome (Calabresi et al., 2014, Kappos et al., 2010, Polman et al., 2011, Radue et al., 2015).

For clinical and research purposes, delineation of WM lesions in MS is either performed manually or with a semiautomatic tool. These approaches, however, are labor-intensive and suffer from considerable inter- and intra-rater variability (Grimaud et al., 1996, Paty et al., 1986). To overcome these problems, automated WM lesion segmentation methods have been developed in the last decade (Mortazavi et al., 2012). However, these methods are not routinely applied in research, clinical trials or individual patient care. One important hurdle is the lack of comparative data reporting the accuracy and robustness of these methods when using data obtained from different centers. Evidence of their performance in multi-center investigations is lacking.

The aim of this study was, firstly, to evaluate the performance of research-domain automated WM lesion segmentation methods in a multi-center MS dataset with diverging scanners and protocols. And secondly, to investigate how these methods perform on data from a new center (using other centers for training). We selected five algorithms for automated segmentation: Cascade (Damangir et al., 2012, Damangir et al., 2016); Lesion growth algorithm (LGA) (Schmidt et al., 2012) and Lesion prediction algorithm (LPA) (Schmidt, 2017) both from the Lesion Segmentation Toolbox (LST) (Schmidt et al., 2012); Lesion-Topology-preserving Anatomical Segmentation (Lesion-TOADS) (Shiee et al., 2010); and k-Nearest Neighbor with Tissue Type Priors (kNN-TTP) (Steenwijk et al., 2013).

Section snippets

Subjects

The data for this study were drawn from a multi-center MS dataset that was collected by the MAGNIMS Study Group (www.magnims.eu) as described previously (Ropele et al., 2014). For the analyses described in the current paper, we selected the patients with a 2D FLAIR acquisition, and we excluded three patients with co-morbidity (vascular disease, glioblastoma, surgical removal of part of the brain) that could interfere with the automated lesion segmentation and one patients whose data were

Results

An overview of the optimal configurations derived in the training phase of both experiments is provided in Table 4. Fig. 2 displays a typical example of FLAIR image, the corresponding manual reference segmentation and the corresponding automated segmentation results.

The voxelwise intra-rater variability was SI = 0.73 ± 0.11 (mean ± SD) when comparing the first and second segmentations and 0.75 ± 0.11 when comparing the segmentations on the first and second marking of the lesions.

Discussion

In this study we directly compared five research-domain automated WM lesion segmentation methods in a multi-center MS dataset, to obtain quantitative results on their volumetric and spatial performance in a multi-center dataset. Accurate and robust segmentation of WM lesions would be beneficial for clinical trials in which lesion volumes are used as a (secondary) study outcome and studies on accurately measuring the GM atrophy (Amiri et al., 2017, Rocca et al., 2017). Our results show

Acknowledgements:

Aurélie Ruet was supported by an ECTRIMS research fellowship. Iris D. Kilsdonk was supported by a grant provided by the Noaber Foundation (Lunteren, The Netherlands). Adriaan Versteeg, Ronald A. van Schijndel, Keith S. Cover, Soheil Damangir and Giovanni B. Frisoni were partly funded by neuGRID4you (www.neuGRID4you.eu), an European Community FP7 project (grant agreement 283562). Olga Ciccarelli and Frederol Barkhof were supported by the National Institute for Health Research (NIHR) University

References (31)

F. Admiraal-Behloul et al.
Fully automatic segmentation of white matter hyperintensities in MR images of the elderly
Neuroimage
(2005)
P. Anbeek et al.
Probabilistic segmentation of white matter lesions in MR imaging
NeuroImage
(2004)
P. Anbeek et al.
Probabilistic segmentation of brain tissue in MR imaging
Neuroimage
(2005)
P.A. Calabresi et al.
Safety and efficacy of fingolimod in patients with relapsing-remitting multiple sclerosis (FREEDOMS II): a double-blind, randomised, placebo-controlled, phase 3 trial
Lancet Neurol.
(2014)
S. Damangir et al.
Multispectral MRI segmentation of age related white matter changes using a cascade of support vector machines
J. Neurol. Sci.
(2012)
J. Grimaud et al.
Quantification of MRI lesion load in multiple sclerosis: a comparison of three computer-assisted techniques
Magn. Reson Imaging
(1996)
R. Khayati et al.
Fully automatic segmentation of multiple sclerosis lesions in brain MR FLAIR images using adaptive mixtures method and Markov random field model
Comput. Biol. Med.
(2008)
P. Schmidt et al.
An automated tool for detection of FLAIR-hyperintense white-matter lesions in Multiple Sclerosis
Neuroimage
(2012)
N. Shiee et al.
A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions
Neuroimage
(2010)
M.D. Steenwijk et al.
Accurate white matter lesion segmentation by k nearest neighbor classification with tissue type priors (kNN-TTPs)
Neuroimage Clin.
(2013)

H. Amiri et al.

Urgent Challenges in Quantification and Interpretation of Grey Matter Atrophy in Multiple Sclerosis

(2017)

R. Benedict et al.

Multiple sclerosis

Semin. Neurol.

(2007)

S. Damangir et al.

Magnetic resonance materials in physics

Biol. Med.

(2016)

L.R. Dice

Measures of the amount of ecologic association between species

Ecology

(1945)

L. Kappos et al.

A placebo-controlled trial of oral fingolimod in relapsing multiple sclerosis

N. Engl. J. Med.

(2010)

Cited by (27)

How far MS lesion detection and segmentation are integrated into the clinical workflow? A systematic review
2023, NeuroImage: Clinical
Introduction: Over the past few years, the deep learning community has developed and validated a plethora of tools for lesion detection and segmentation in Multiple Sclerosis (MS). However, there is an important gap between validating models technically and clinically. To this end, a six-step framework necessary for the development, validation, and integration of quantitative tools in the clinic was recently proposed under the name of the Quantitative Neuroradiology Initiative (QNI).
Aims: Investigate to what extent automatic tools in MS fulfill the QNI framework necessary to integrate automated detection and segmentation into the clinical neuroradiology workflow.
Methods: Adopting the systematic Cochrane literature review methodology, we screened and summarised published scientific articles that perform automatic MS lesions detection and segmentation. We categorised the retrieved studies based on their degree of fulfillment of QNI’s six-steps, which include a tool’s technical assessment, clinical validation, and integration.
Results: We found 156 studies; 146/156 (94%) fullfilled the first QNI step, 155/156 (99%) the second, 8/156 (5%) the third, 3/156 (2%) the fourth, 5/156 (3%) the fifth and only one the sixth.
Conclusions: To date, little has been done to evaluate the clinical performance and the integration in the clinical workflow of available methods for MS lesion detection/segmentation. In addition, the socio-economic effects and the impact on patients’ management of such tools remain almost unexplored.
Segmentation of white matter lesions in multicentre FLAIR MRI
2021, Neuroimage: Reports
White matter lesions (WML) in the brain are thought to be related to ischemic processes, demyelination, and axonal degeneration. The presence of WML predict cognitive decline, dementia, stroke, and death. Lesion progression increases these risks, making WML significant clinical biomarkers for investigation. To analyze WML objectively, consistently, and efficiently, automated WML segmentation methods for neurological MRI have been the focus of extensive research efforts. There have been many unsupervised and traditional machine learning methods proposed over the years. Recently, deep learning architectures have been utilized for WML segmentation with promising results. In this work, we evaluate seven WML segmentation tools for multicentre fluid attenuated inversion recovery (FLAIR) MRI. Two traditional methods were evaluated, one unsupervised method and the other a traditional machine learning approach. The traditional methods were compared to five deep learning-based approaches. FLAIR MRI have the advantage of highlighting WML lesions robustly and are used routinely in neurological workflows. Automated WML segmentation tools for FLAIR MRI could optimize clinical workflows and improve patient care. The WML segmentation algorithms were evaluated on a multicentre, multi-disease FLAIR MRI database acquired with varying scanners and protocols. In total 252 imaging volumes (~13 K image slices) with annotations, from 5 multicentre datasets (33 imaging centres) were used to train, validate and test the WML segmentation methods. Two clinical datasets, which include dementia and vascular disease pathologies, and three open-source datasets were used. To examine clinical utility of each algorithm and establish proof of effectiveness, algorithms were evaluated over several dimensions related to accuracy, generalizability, and robustness to pathology. This work presents a framework for evaluating the efficacy of WML segmentation algorithms for improved reliability, patient safety and clinical trials. Of all methods, SC U-Net was found to be the best algorithm for WML segmentation in terms of highest Dice similarity coefficient (DSC) over most dimensions (mean DSC = 0.71 over all volumes). Deep learning methods outperformed traditional methods, especially in lower lesion loads, but were not able to generalize across all disease categories or datasets.
Intracranial volume segmentation for neurodegenerative populations using multicentre FLAIR MRI
2021, Neuroimage: Reports
Intracranial volume (ICV) segmentation, also known as brain extraction or skull-stripping, is a critical preprocessing step in analytical pipelines for studying neurodegenerative diseases in magnetic resonance imaging (MRI). While the fluid-attenuated inversion recovery (FLAIR) MRI modality has emerged as an important sequence for analyzing cerebrovascular and neurodegenerative disease, most existing automated ICV segmentation methods have been developed for T1-weighted or multi-modal inputs. Additionally, many methods have been designed using single centre data of healthy subjects and encounter difficulties using images with varying acquisition parameters and neurodegenerative pathology. In this work, we develop and evaluate 2 traditional and 8 deep learning algorithms for ICV segmentation in FLAIR MRI. Training and testing were completed on 175 vol (8317 images) from 2 dementia and 1 vascular disease cohort. A human phantom FLAIR MRI dataset from a repeatedly scanned, healthy individual was also utilized for reliability analysis. Images were acquired from 47 imaging centres with varying scanners and parameters. To measure and compare performance, we present a novel framework for evaluating the effectiveness of computer generated segmentations on multicentre datasets. The evaluation framework includes assessments of algorithm accuracy, generalization capabilities, robustness to pathology and spatial location, and volumetric measurement reliability – all important dimensions for establishing proof of effectiveness (a prerequisite to clinical translation). The top performing method was a multiple resolution U-Net (MultiResUNet), which achieved a mean Dice similarity coefficient greater than 98% and was robust across pathology levels and spatial locations. Our results confirm a FLAIR-based ICV analytical pipeline can alone be utilized for large-scale neurodegenerative disease research. The presented evaluation framework can be deployed by other researchers to assess the viability of tools proposed for automated analysis of diverse, clinical MRI datasets.
Accuracy and reproducibility of automated white matter hyperintensities segmentation with lesion segmentation tool: A European multi-site 3T study
2021, Magnetic Resonance Imaging
Citation Excerpt :
In particular, our main results are summarized as follows: (i) LPA and LGA show a good volumetric accuracy, but LPA performed overall better than LGA; (ii) the LGA and LPA's spatial accuracy increases with the amount of WMHs; (iii) volumetric reproducibility reveals that LST longitudinal pipeline steeply reduces the reproducibility error; (iv) spatial reproducibility of the longitudinal pipeline applied to LGA and LPA outputs was optimal. Compared to De Sitter and colleagues [40] in a cohort of 52 MS patients (mean WMH volume = 4.85 mL), we found a slightly better volumetric accuracy comparing both LPA (mean volume difference = 0.45 mL) and LGA (mean volume difference = 2.88 mL) using SPM12, or other tools that they have tested such as Cascade [41,42] (mean volume difference = 0.67 mL), Lesion-Topology preserving Anatomical Segmentation (Lesion-TOADS) [43] (mean volume difference = 2.18 mL) or using k-Nearest Neighbor with Tissue Type Priors (kNN-TTP) [44] (mean volume difference = −1.46 mL). Similar results have been reported by Egger and colleagues [45] for LGA SPM8 (median volume difference = 0.68 mL), LGA SPM12 (median volume difference = 0.93 mL), LPA SPM12 (median volume difference = 0.85 mL).
Brain vascular damage accumulate in aging and often manifest as white matter hyperintensities (WMHs) on MRI. Despite increased interest in automated methods to segment WMHs, a gold standard has not been achieved and their longitudinal reproducibility has been poorly investigated. The aim of present work is to evaluate accuracy and reproducibility of two freely available segmentation algorithms. A harmonized MRI protocol was implemented in 3T-scanners across 13 European sites, each scanning five volunteers twice (test-retest) using 2D-FLAIR. Automated segmentation was performed using Lesion segmentation tool algorithms (LST): the Lesion growth algorithm (LGA) in SPM8 and 12 and the Lesion prediction algorithm (LPA). To assess reproducibility, we applied the LST longitudinal pipeline to the LGA and LPA outputs for both the test and retest scans. We evaluated volumetric and spatial accuracy comparing LGA and LPA with manual tracing, and for reproducibility the test versus retest. Median volume difference between automated WMH and manual segmentations (mL) was −0.22[IQR = 0.50] for LGA-SPM8, −0.12[0.57] for LGA-SPM12, −0.09[0.53] for LPA, while the spatial accuracy (Dice Coefficient) was 0.29[0.31], 0.33[0.26] and 0.41[0.23], respectively. The reproducibility analysis showed a median reproducibility error of 20%[IQR = 41] for LGA-SPM8, 14% [31] for LGA-SPM12 and 10% [27] with the LPA cross-sectional pipeline. Applying the LST longitudinal pipeline, the reproducibility errors were considerably reduced (LGA: 0%[IQR = 0], p < 0.001; LPA: 0% [3], p < 0.001) compared to those derived using the cross-sectional algorithms. The DC using the longitudinal pipeline was excellent (median = 1) for LGA [IQR = 0] and LPA [0.02]. LST algorithms showed moderate accuracy and good reproducibility. Therefore, it can be used as a reliable cross-sectional and longitudinal tool in multi-site studies.
ExploreASL: An image processing pipeline for multi-center ASL perfusion MRI studies
2020, NeuroImage
Citation Excerpt :
LST detects outliers in the FLAIR WM intensity distribution and assesses their likelihood of being WMH (Schmidt et al., 2012). While ExploreASL offers the option of both LST lesion growing and lesion prediction algorithms, the default is set to the latter, which has been shown to be more robust (de Sitter et al., 2017a). This WHM correction described here is only performed when FLAIR images are available.
Arterial spin labeling (ASL) has undergone significant development since its inception, with a focus on improving standardization and reproducibility of its acquisition and quantification. In a community-wide effort towards robust and reproducible clinical ASL image processing, we developed the software package ExploreASL, allowing standardized analyses across centers and scanners.
The procedures used in ExploreASL capitalize on published image processing advancements and address the challenges of multi-center datasets with scanner-specific processing and artifact reduction to limit patient exclusion. ExploreASL is self-contained, written in MATLAB and based on Statistical Parameter Mapping (SPM) and runs on multiple operating systems. To facilitate collaboration and data-exchange, the toolbox follows several standards and recommendations for data structure, provenance, and best analysis practice.
ExploreASL was iteratively refined and tested in the analysis of >10,000 ASL scans using different pulse-sequences in a variety of clinical populations, resulting in four processing modules: Import, Structural, ASL, and Population that perform tasks, respectively, for data curation, structural and ASL image processing and quality control, and finally preparing the results for statistical analyses on both single-subject and group level. We illustrate ExploreASL processing results from three cohorts: perinatally HIV-infected children, healthy adults, and elderly at risk for neurodegenerative disease. We show the reproducibility for each cohort when processed at different centers with different operating systems and MATLAB versions, and its effects on the quantification of gray matter cerebral blood flow.
ExploreASL facilitates the standardization of image processing and quality control, allowing the pooling of cohorts which may increase statistical power and discover between-group perfusion differences. Ultimately, this workflow may advance ASL for wider adoption in clinical studies, trials, and practice.
Multimodal Image Analysis for Assessing Multiple Sclerosis and Future Prospects Powered by Artificial Intelligence
2020, Seminars in Ultrasound, CT and MRI
Citation Excerpt :
Specifically, the segmentation is done based on a multichannel input rather than a single imaging modality. In such an intensity-weighing scheme, the assignment of each voxel to WM or lesions is optimized to the lesion boundary.24,25 MIPAV (https://mipav.cit.nih.gov/pubwiki/index.php/Using_MIPAV_Algorithms) is a well-recognized standalone software that includes features of image enhancement, morphologic operations, surface plotter (ie, 3D plot display of intensities in images), region growing, volume rendering, brain extraction, image statistics, and many other processes.
The purpose of this paper is to serve as a template for greater understanding for the practicing radiologist about key steps to perform multimodality computer analysis of MRI images, specifically in multiple sclerosis patients. With this understanding, radiologists will be better equipped about how best to process and analyze MRI imaging data and obtain accurate quantitative information for MS patient evaluation. A secondary intent of this article is to improve radiologist understanding of how artificial intelligence will be employed in the future for better patient stratification, and for evaluation of response to therapy in both clinical care and drug trials.

View all citing articles on Scopus

View full text

Performance of five research-domain automated WM lesion segmentation methods in a multi-center MS study

Highlights

Abstract

Background and Purpose

Methods

Results

Conclusion

Introduction

Section snippets

Subjects

Results

Discussion

Acknowledgements:

Neuroimage

NeuroImage

Neuroimage

Lancet Neurol.

J. Neurol. Sci.

Magn. Reson Imaging

Comput. Biol. Med.

Neuroimage

Neuroimage

Neuroimage Clin.

Urgent Challenges in Quantification and Interpretation of Grey Matter Atrophy in Multiple Sclerosis

Multiple sclerosis

Semin. Neurol.

Magnetic resonance materials in physics

Biol. Med.

Measures of the amount of ecologic association between species

Ecology

A placebo-controlled trial of oral fingolimod in relapsing multiple sclerosis

N. Engl. J. Med.