Automated Segmentation of Hyperintense Regions in FLAIR MRI Using Deep Learning

We present a deep convolutional neural network application based on autoencoders aimed at segmentation of increased signal regions in fluid-attenuated inversion recovery magnetic resonance imaging images. The convolutional autoencoders were trained on the publicly available Brain Tumor Image Segmentation Benchmark (BRATS) data set, and the accuracy was evaluated on a data set where 3 expert segmentations were available. The simultaneous truth and performance level estimation (STAPLE) algorithm was used to provide the ground truth for comparison, and Dice coefficient, Jaccard coefficient, true positive fraction, and false negative fraction were calculated. The proposed technique was within the interobserver variability with respect to Dice, Jaccard, and true positive fraction. The developed method can be used to produce automatic segmentations of tumor regions corresponding to signal-increased fluid-attenuated inversion recovery regions.


INTRODUCTION
Glioma is the most common primary brain tumor type in adults and has seen little improvement in treatment effectiveness despite being an area of active research. One of the characteristics of gliomas is that it infiltrates the surrounding brain (1). Among gliomas, glioblastomas are the most frequent, are of high grade and aggressive, with median survival time of 14 months (2). Magnetic resonance imaging (MRI) is the most commonly used modality to assess response to therapy because of its superior soft tissue contrast compared with other imaging modalities (3).
Currently, there is significant interest in extracting quantitative imaging biomarkers from MRI including anatomical and functional images that could lead to better patient diagnosis and follow-up. These imaging biomarkers range from texture features to volumetric measurements extracted to predict molecular biomarkers and overall or progression-free survival. The common requirement of all these approaches is the segmentation of tumor from normal brain (4). Automating tumor segmentation could significantly improve treatment planning and follow-up (5), as manual delineation is a time-consuming task because of either artifacts or magnetic field inhomogeneities and it also introduces variability that might reduce the reliability of the image-derived biomarkers.
Brain tumor segmentation is a challenging task, with many researchers and competitors focusing on creating and evaluating newly developed algorithms. In 2012, the Brain Tumor Image Segmentation Benchmark (BRATS) competition (3,5) was established as part of the MICCAI (International Conference on Medical Image Computing and Computer Assisted Intervention) conference, and since then, it has been the "gold standard" for brain segmentation algorithm testing. The brain tumor segmentation algorithms commonly described in the literature usually exploit classical image analysis techniques or pattern recognition techniques (6)(7)(8) with the more recent approaches using deep convolutional neural networks (9)(10)(11)(12)(13)(14)(15)(16).
Each MRI series (image type) reveals different information about the tumor. For instance, T1-weighted (T1w) images after contrast acquisitions reveal information regarding the enhancing part of the tumor, whereas fluid-attenuated inversion recovery (FLAIR) acquisitions capture the edema part of the tumor.
Lesion size in FLAIR images is an important clinical parameter for patient assessment and follow-up. Manual estimation of the volume of the lesions in FLAIR images is time-consuming and highly user-dependent.
Autoencoders have recently been gaining attention for their ability to perform segmentation tasks in medical images (17)(18)(19). One advantage of autoencoders against other deep learning approaches is the use of decoders that enables estimation of features suitable for pixel-wise classification (19).
The aim of this paper is to focus on accurate quantification of the abnormal signal areas in the FLAIR acquisitions in patients with glioma. For the purpose of this study, we use convolutional autoencoders trained on the publicly available BRATS data set and evaluate the accuracy on a data set in which 3 expert segmentations were available. Interobserver variability was also measured.

Data Set
We used both the BRATS 2015 competition data set (N ϭ 186) and a locally collected glioma data set (N ϭ 135). The BRATS scans included 54 low-grade (grades 2-3) and 132 high-grade (grade 4) gliomas. For each tumor, the ground truth was available from the BRATS Web site. All the scans were subjected (by the BRATS authors) to the following preprocessing steps: skull stripping, coregistration to the same anatomical template, and interpolation to 1-mm 3 voxel resolution.
The locally collected data set consisted of 135 preoperative scans from 135 patients for which signal intensity-increased regions were delineated by 3 experienced image analysts, each with Ͼ8 years of experience (user 1, 14 years; user 2, 18 years; and user 3, 12 years). These regions were delineated using MRIcron (20). This data set was used to evaluate the segmentation accuracy of the proposed algorithm (referred to as validation data set).

Preprocessing
For the locally collected data set, skull stripping was performed using an atlas-based technique. This step was not necessary for the BRATS data set, as skull stripping was already performed.
For both the BRATS and local data sets, we applied N4 bias correction (21) followed by Nyúl intensity standardization (22). The N4 bias correction step was used to correct intensity inhomogeneity and artifacts present in MRI acquisitions consisting of low-frequency signals that affect their intensity levels. One challenge with MRI segmentation is the lack of a standard image intensities. The goal of the Nyúl method is to transform the image histograms so that they match the mean histogram determined through training. The algorithms seek to match histograms at certain percentiles. Figure 1 captures the main idea of an autoencoder and its application to image segmentation. The primary concept is that the autoencoder learns how to reconstruct the segmented desired output (namely, the segmentation mask). The encoder layer consists of 7 convolutional layers. The convolutions are used to produce the feature maps. In addition, a rectified-linear nonlinearity is applied followed by max-pooling with a 2 ϫ 2 window. The resulting output is subsampled by a factor of 2. Maxpooling achieves translation invariance, accounting for small spatial shifts. The decoder component consists of a hierarchy of decoders, one corresponding to each encoder. Of these, the appropriate decoders use the max-pooling indices received from the corresponding encoder to perform nonlinear upsampling of their input feature maps. This allows for improved boundary delineation (19). The high decoder output is forwarded to a trainable softmax classifier, which independently classifies each pixel. The number of input channels is the number of classes (in our case, tumor or no tumor) and the output of the sigmoid classifier is a 2-channel image of probabilities. The predicted segmentation corresponds to the class with maximum probability at each pixel.

Autoencoders
The autoencoder was trained in 100 (out of 132) patients with high-grade glioma included in the BRATS data set and tested on the 32 remaining patients. The training set was used to compute the autoencoder's parameters (such as learning rate), and the smaller one was used as a testing data set to ensure that no overfitting occurred. The pixels with P Ͼ 0.5 were assigned to the abnormal FLAIR category. All small regions consisting of Յ10 voxels were removed.
From the extracted region of interest (ROI), a total of 600 000 ROIs were randomly selected. For the purpose of training, ROIs of size 10 ϫ 10 were evaluated ( Figure 2).

Segmentation Evaluation
We applied the simultaneous truth and performance level estimation (STAPLE) (23) method to the local data sets to obtain a probabilistic estimate of the actual ground-truth segmentation incorporating the segmentation data from multiple observers. The STAPLE method assessed the 3 tracings to construct the ground-truth segmentation. To evaluate the proposed algorithm, the following segmentation metrics were calculated: the Dice Coefficient (ie, the similarity index), the Jaccard coefficient (ie, the overlap ratio), and the false positive and false negative fractions. The Dice and Jaccard coefficient values range between 0 and 1, with values closer to 1 indicating closer similarity. Box plot distributions were also calculated. To assess the segmentation results obtained from all the methods considered, the 2 test was performed. For each of the quantitative measures considered in this study, the following 7 first-order statistics were estimated: mean, standard deviation (SD), maximum value (Max), minimum value (Min), median, first quartile (Q1), and third quartile (Q3).

RESULTS
Comparison between the proposed method and the 3 manual segmentations available against the STAPLE algorithm is shown in Table 1. The statistical differences between the methods is presented in Table 2. Compared with STAPLE, user 1 achieved the best results with respect to all the metrics used. User 3 was the worst performing among all users. The proposed system achieved a Jaccard coefficient of 0.758 and ranked third (within the interobserver agreement) with respect to all measures considered. No statistically significant difference was observed between the proposed system results and those from users 1 and 2 besides the false positive fraction (FPF) metric. User 3 was significantly different for all the metrics besides the true positive fraction metric.
There is high interobserver variability for the manual delineation task compared with the STAPLE-derived ground truth, particularly when comparing user 3 to users 1 and 2. In terms of the FPF index, our method performs the worst. As depicted in Figure 6, this can be seen in the false positive segmented regions mostly belonging to the skull region. Figure 3 captures the probabilistic output generated from the autoencoder for an input image. Figures 4 and 5 depict representative examples of the algorithm output and its comparison with the 3 users and the STAPLE segmentation. Figure 5 depicts results on 4 different subjects from the validation data set. In cases 1 and 2, we can observe differences between users 1 and 2 and the way they create the data set. In cases 3 and 4, we can observe undersegmentation resulting from the proposed algorithm. Figure 6 captures false positives observed when using our algorithm. The majority of them can be attributed to undersegmentation errors of the skull stripping algorithm (failure to remove all nonbrain tissues). The false positive segmentation errors are more pronounced in case 2.

DISCUSSION
In this paper, we present an approach based on convolutional autoencoders that segment regions of increased signal intensity on FLAIR images. We used a publicly available data set to train and test the autoencoders, whereas a different data set originating from our institution was used for validation. The validation data set had all the areas of interest manually delineated by 3 image analysts, and STAPLE used these to create a ground-truth labeling. Our approach was able to produce segmentation results that fell within the interobserver variability with no statistical difference versus 2/3 users compared with the STAPLES output in terms of the Jaccard, Dice, and true positive fraction measures.
Our method was the worst in terms of the FPF metric ( Figure  6). This can be attributed to false positive segmentations occurring in areas where the skull-stripping algorithm failed. The data provided from the BRATS competition were those of skull stripping; thus, when we trained our autoencoder, it was unable to learn the skull patterns. This was problematic, as the locally available data set had some skull-stripping failures where the skull was not always removed. To account for this issue, a morphological operation was applied, which eliminated the small false positive regions.
The results of this study indicate that there is substantial interobserver variability for the manual delineation of the abnormal signal on FLAIR images. User 3 had the largest variation compared with the ground-truth variation from users 1 and 2. The challenge of reproducibly segmenting the increased signal regions has been highlighted recently where a Dice coefficient of 0.69 was reported for total tumor volume segmentation in the BRATS data set (24). Akkus et al. (25) reported a 10% variability between 2 different manual segmentations for tumor boundaries on T2-weighted images. Part of the task complexity is that    . Probabilistic output of the proposed system for a case in our validation data set. As depicted, tumor areas appear to be brighter than surrounding tissue (Right panel).

Segmentation of Hyperintense Regions in FLAIR MRI
hyperintensity can be due to nontumor disease (eg, vascular disease) and different users may draw different boundaries because of this. An automated algorithm may help standardize the segmentations and make the information that can be extracted more valuable. Several proposed systems in the literature evaluate their performance against the BRATS data set. In this paper, BRATS was used to train an algorithm, and the performance was evaluated on a locally created data set where the ground truth was based on 3 users (compared with BRATS where the ground truth corresponds to 1 user). We chose to use BRATS as the training data set because the acquisition protocol of the data was more variable than our local data set. Training the algorithm on a less variable data set would likely have been less successful in generalizing to the BRATS data.
Recently   Recently, a publicly available software tool called BraTumIA (24) for tumor segmentation was used as a means of calculating a set of proposed quantitative measures revealing a good correlation between manual and automatic quantification of the features. Compared with the proposed approach, BraTumIA requires precontrast T1w, postcontrast T1w, T2weighted and FLAIR images, making direct comparison with the proposed system challenging.
One advantage of the proposed technique against traditional machine learning algorithms is that the features are learned directly from the images without the need of special feature-extraction techniques. The proposed tool was developed using Python (version 2.7.11, Python Software Foundation, http://www.python.org). The autoencoder was developed with the Keras Python Library (https://github.com/fchollet/keras) and the training and execution of the code were done on an NVIDIA K2 grid card (NVIDIA Inc., Santa Clara, California). The algorithm needed ϳ1 hour to segment a 256 ϫ 256 ϫ 150 data set.
One significant source of error in our approach was because of poor skull stripping. The BRATS data had human-confirmed accurate skull stripping; thus, the algorithm had this included in its model. However, we used only fully automated methods, and so the human cleanup of skull stripping was absent and resulted in many nonbrain regions being labeled as tumorous. Our approach could benefit from a more robust skull-stripping algorithm.

CONCLUSION
The proposed automated system is indistinguishable from expert-derived segmentations in its ability to perform glioma segmentation. This approach will be useful for alleviating the inherent variability of human-derived tumor delineation, thereby improving the reproducibility of image-derived biomarkers.