Automated detection and quantification of brain metastases on clinical MRI data using artificial neural networks

Abstract Background Reliable detection and precise volumetric quantification of brain metastases (BM) on MRI are essential for guiding treatment decisions. Here we evaluate the potential of artificial neural networks (ANN) for automated detection and quantification of BM. Methods A consecutive series of 308 patients with BM was used for developing an ANN (with a 4:1 split for training/testing) for automated volumetric assessment of contrast-enhancing tumors (CE) and non-enhancing FLAIR signal abnormality including edema (NEE). An independent consecutive series of 30 patients was used for external testing. Performance was assessed case-wise for CE and NEE and lesion-wise for CE using the case-wise/lesion-wise DICE-coefficient (C/L-DICE), positive predictive value (L-PPV) and sensitivity (C/L-Sensitivity). Results The performance of detecting CE lesions on the validation dataset was not significantly affected when evaluating different volumetric thresholds (0.001–0.2 cm3; P = .2028). The median L-DICE and median C-DICE for CE lesions were 0.78 (IQR = 0.6–0.91) and 0.90 (IQR = 0.85–0.94) in the institutional as well as 0.79 (IQR = 0.67–0.82) and 0.84 (IQR = 0.76–0.89) in the external test dataset. The corresponding median L-Sensitivity and median L-PPV were 0.81 (IQR = 0.63–0.92) and 0.79 (IQR = 0.63–0.93) in the institutional test dataset, as compared to 0.85 (IQR = 0.76–0.94) and 0.76 (IQR = 0.68–0.88) in the external test dataset. The median C-DICE for NEE was 0.96 (IQR = 0.92–0.97) in the institutional test dataset as compared to 0.85 (IQR = 0.72–0.91) in the external test dataset. Conclusion The developed ANN-based algorithm (publicly available at www.github.com/NeuroAI-HD/HD-BM) allows reliable detection and precise volumetric quantification of CE and NEE compartments in patients with BM.

• High performance on heterogeneous MRI data and brain metastases lesions of small sizes.
• Publicly available artificial neural network based brain metastases segmentation algorithm.
About 25-45% of all patients with extracranial, malignant primary tumors develop brain metastases (BM). 1,2 Despite multimodal treatments the life expectancy of the patients who develop BM is still poor, with median survival of 2-18 months. 2,3 In this context, the determination of the exact endpoints of the treatment effectiveness plays a key role in neuro-oncology. One of the essential criteria for the precise assessment of the efficiency of a new therapy for brain tumors is the growth dynamics determined by magnetic resonance imaging (MRI) based mainly on manual measurements of target lesions according to the Response Assessment in Neuro-Oncology Brain Metastases (RANO-BM) criteria. 4 Although manual measurements of the largest diameter as prescribed by the RANO-BM criteria allow easy and widespread adoption in clinical practice, previous studies have shown that volumetric measurement may provide a more reliable and accurate metric. [5][6][7] The clinical potential of volumetric measurements and the possibility of automating this laborious analysis through artificial neural networks (ANN) has primarily been demonstrated in the setting of primary brain tumors, [8][9][10][11][12][13][14][15][16] whereas only a limited number of studies have investigated these approaches in the setting of BM. [17][18][19][20][21][22][23][24][25][26][27][28] Prior studies that have evaluated the performance of ANN for the detection and/or segmentation of BM have shown promising results but have also been limited by a relative high number of false positive (FP) results (ranging from 1.5 to 20 per case) 17,20,[24][25][26] and relatively poor performance in detecting smaller BM (high number of false negative with reported F1-scores in the range of 0.76-0.85). 19,21,23 Moreover, available studies so far only focus on segmenting the contrast-enhancing tumors (CE) lesion of BM whereas they do not quantify the surrounding nonenhancing FLAIR signal abnormality/ edema (NEE) which may be particularly important in the context of evaluating post-treatment changes during follow-up of BM.
Here, we evaluated the potential of a state-of-the-art ANN-based on the self-configuring nnU-Net method 29 for automated detection and quantification of CE lesions and NEE in BM using MRI data from a large institutional dataset for training, validation and testing. We evaluated detection and segmentation performance of the developed ANN on a case-and lesion-wise basis and analyzed the dependence of these metrics on the size of BM. Moreover, we applied the ANN to an independent external dataset, thereby enabling to evaluate the generalization of the model across multisite data.

Datasets
The retrospective analysis of imaging data was approved by the local ethics committee of the Medical Faculty of the University of Heidelberg and informed consent was waived. The following datasets were used for the present study:

Institutional Dataset
To develop, train and test an ANN for automated interpretation of MRI data in clinical setting we collected MRI data (n = 308) of adult patients (mean age 61 ± 11 years; 163 female) with BM from several primary cancers, who underwent standardized MRI examination for radiation treatment planning at Heidelberg University Hospital between 04/2011 and 04/2018. We included the last MRI scan prior to the start of radiation therapy. No exclusions were made based on the primary tumor histology or time-point

Importance of the Study
Treatment efficacy according to the Response Assessment in Neuro-Oncology Brain Metastases (RANO-BM) is highly dependent on the tumor growth dynamic, which relies on accurately detecting brain metastases (BM) instances and estimating their volumetric extent correctly. Due to the difficulties and time-intensive nature of this task artificial neural networks (ANN) based methods have been proposed to automate this process, firstly for brain tumors and recently also for BM. This study expands on previous work to more challenging clinical settings with data from varying stages of treatment and improves performance for small BM instances.

Neuro-Oncology Advances
of MRI exam, neither initially at primary diagnosis of BM, nor early post-operatively or follow-up, with the goal of exposing the ANN to as many different appearances of BM on MRI and thus enabling it to learn a broad range of clinical scenarios. The institutional MRI dataset was divided into a training/validation and a test dataset with a ratio of 4:1. Specifically, the institutional training/validation dataset consisted of 246/308 (80%) patients and the institutional test dataset consisted of 62/308 (20%) patients.

External Dataset
Another cohort of 30 adult patients (mean age 58 ± 11 years; 15 female) with lung cancer and at least one BM, who underwent routine MRI scans at the Heidelberg Thoracic Clinic between 06/2013 and 08/2019 was used to verify the generalisability of our developed method. This dataset consisted of MRI data at the time point of first occurrence of BM in the course of the disease.

Image Acquisition
MRI exams of the institutional dataset were acquired with a 3 T MRI system (Magnetom Verio, Skyra or Trio TIM; Siemens Healthineers), except a single measurement of the training set, which was acquired with a 1.5-T field strength (Magnetom Avanto; Siemens Healthineers). All MRI exams of the external test dataset were acquired with a 1.5-T MRI system (Magnetom Avanto; Siemens Healthineers). MRI scans from all datasets were acquired according to an established protocol and included T1-weighted images before and after gadolinium contrast agent and FLAIR images (detailed description of acquisition parameters in the Supplement).

Image Preprocessing
The MRI data were processed as described in Kickingereder et al. 8 Briefly, this included deep-learning based brain extraction using HD-BET, 30 image co-registration, and calculation of T1-subtraction maps (T1-sub). Subsequently, ground-truth segmentation of the BM was performed using ITK-SNAP (www.itksnap.org), as described in Kickingereder et al. 8 by IP, an in-training radiologist with 5 years of experience and subsequently checked by PV a board-certified neuroradiologist with 10 years of experience. Any discrepancies were resolved through consensus discussion. Specifically, CE lesions (on the T1-sub images or in case of artifacts on T1-sub with additional support of T1-weighted post-contrast images) as well as the associated NEE (excluding the contrast-enhancing and necrotic portion of the BM, resection cavity and obvious leukoaraiosis) were selected using a region-growing segmentation algorithm.

Artificial Neural Network
The architecture of the developed ANN (termed HD-BM) was based on the BraTS 2020 winning, 31 self-configuring nnU-Net method, 29 which itself is based on the U-Net, 32 that has shown to have excellent performance in brain tumor segmentation in the context of a large-scale multiinstitutional study. 8 During training, the model receives all input modalities of each training sample and was taught to reproduce the provided reference annotation. We followed the original, state-of-the-art nnU-Net training regime closely by training an ensemble of five models on our institutional train dataset, through the means of five-fold cross-validation. This splits the dataset into five partially overlapping training and five mutually exclusive validation subsets. Consequently, each of the images contained in the institutional training data set was used for validation once, allowing us to report validation metrics for our training cohort. Both test datasets remained untouched until model development was completed. Only then was the final model configuration used to generate predictions. These predictions were subsequently used in the performance analysis. Through development of additional models we additionally investigate how only receiving the T1-weighted images after gadolinium contrast agent and FLAIR images influences the performance, which we subsequently refer to as "Slim" . A detailed description of the applied ANN architecture and discussion of the Slim configuration is available in the Supplement.

Statistical Analysis and Evaluation Metrics
The performance of HD-BM for detecting and segmenting BM in both datasets was assessed case-wise for CE and NEE and lesion-wise for CE using the case-wise/ lesion-wise DICE-coefficient (C/L-DICE), sensitivity (C/L-Sensitivity). In addition, for CE lesions we calculated the lesion-wise positive predictive value (L-PPV), sensitivity (L-Sensitivity) as well as F1-score. For volume agreement we report concordance correlation coefficient (CCC) lesionwise for CE lesions and case-wise for NEE parts of BM. To evaluate the detection and segmentation performance between and within the respective datasets, we performed the Wilcoxon test and Spearman correlation. P < .05 was considered significant. The statistical analyses were performed using R version 4.0.3 (https://www.r-project.org) and Python version 3.9.7 (http://www.python.org). More information regarding the statistical analysis is provided in the Supplement.

Neuro-Oncology Advances
as compared to the external dataset with only one case out of 30 (3%) (p=0.0019).
The types of primary cancers were balanced between the institutional training/validation and test dataset (P = .256) with the most common entities being lung and breast cancer. In contrast the composition of primary cancers in the external test dataset was different and exclusively consisted of lung cancer patients, thereby reflecting the treatment focus of the Heidelberg Thoracic Clinic from which the external test dataset originated.   Figure S2). By filtering instances < 0.006cm 3 the F1-score increased from mean 0.86 ± 0.19 to its maximum value of mean 0.87 ± 0.19. However, since this increase was non-significant (P = .203) no volumetric threshold was applied for subsequent analyses.   Figure S3). Similarly, the volume of the NEE part of BM did also significantly influence the segmentation performance (C-DICE) of the NEE part of BM on a case-wise level (Spearman's r = .642 with P < .001 in the institutional test dataset and Spearman's r = .697 with P < .001 in the external test dataset) (Supplementary Figure S3). Consequently, the significantly lower volumes of individual CE lesions and NEE part of BM in the external test dataset as compared to the institutional test dataset likely explains the relative performance drop of HD-BM in the external test dataset.

Detection and Segmentation Performance of HD-BM in the Test Datasets
Analysis

Full vs. Slim Configuration
Our Slim configuration of HD-BM performs slightly worse across the key metrics, as is to be expected with fewer information available. A detailed discussion and interpretation is provided in the Supplement S4.

Public Implementation of HD-BM
A public implementation of HD-BM is provided as open-source through www.github.com/NeuroAI-HD/HD-BM.

Discussion
The application of AI for automatic image processing in neuro-oncology has shown enormous potential to improve the diagnostic and therapeutic decision-making processes. 8,33 In this paper, we created HD-BM, an ANN-based algorithm for automated volumetric quantification of BM and evaluated its performance in two test datasets: the institutional test dataset with n = 62 MRIs of patients with BM (n = 384) from different primary malignancies (n = 8) and the external test dataset with 30 patients with BM (n = 155)  CE, contrast-enhancing tumors; NEE, non-enhancing FLAIR signal abnormality/edema; PPV, positive predictive value.
The "Slim"-Model refers to a configuration where the HD-BM is trained only with T1-weighted images after gadolinium contrast agent and FLAIR images and is discussed in detail in the supplement. All values but one are median with respective inter-quartile ranges (IQR) except one mean F1-Score with standard deviation (SD). Comparison of respective datasets on the basis of the necessary input sequences: fullmodel (T1-weighted images before and after gadolinium contrast agent, FLAIR images and T1-subtraction map) and "Slim"-model (T1-weighted images after gadolinium contrast agent and FLAIR images). Group differences were evaluated with Wilcoxon test.

Neuro-Oncology Advances
from lung cancer. HD-BM performed well for both automated detection and volumetric quantification of BM with high agreement to the radiologist-annotated ground-truth and simultaneously obtaining ≤ 1 FP/scan. In contrast to previous studies HD-BM did not only focus on CE lesions but allowed precise differentiation between CE lesions and the surrounding NEE parts of BM, which may be particularly important in the context of post-treatment changes during follow-up of BM. 4 Moreover, we showed no need for applying the volumetric threshold to maximize lesion detection performance in contrast to prior studies (most ranging 0.003-4 cm 3 ), 19-23 thereby highlighting the robustness of HD-BM even for small lesions.
A direct comparison of the performance of an algorithm with other works was only possible to a limited extent due to the different underlying data and metrics. HD-BM achieved a high F1-score (> 0.93 in both test sets); previous studies reported F1-scores of 0.76-0.85. 19,21,23 This can be attributed to the fact that in our analysis a DICE-score of > 0.1 sufficed to be considered a true positive in conjunction with the fact that our method had high detection performance even for small volume lesions. Moreover, our method presented a lower number of FP/scan (0.87 in the institutional and 0.2 in the external test dataset) compared to about 1.5-20 FP/scan in the literature. 17  increased the likelihood of FP, our method achieved a good performance despite the small size of the BM without resulting in more FP or lower sensitivity. We obtained good L-Sensitivity in detecting BM in both test sets (0.81 and 0.85 in institutional and independent test sets respectively), which is well in line with previous studies reporting sensitivities of 0.70-0.96. 17,19,20,[22][23][24][25][26][27] Zhang et al. 27 achieved the highest sensitivity of 0.96, but presented more than twenty times the number of FP/scan than HD-BM. This also applied to other works with higher sensitivity, which however also featured about eight times (7.8 FPs/scan) 20 and two times (1.5 FPs/scan) 25 more FPs/scan than HD-BM.

Neuro-Oncology Advances
A recent study by Park et al. 26 reported a high sensitivity of 0.931 and also low FP/scan with 0.59. They developed multiple methods, the best using a combination of 3D black blood and 3D gradient echo (GRE) imaging techniques, while their model based only the 3D GRE sequences (like ours), reached a sensitivity of 0.768, which is slightly lower than the L-Sensitivity in our test sets.
HD-BM exhibited both high detection performance and few FPs/cases, despite the challenging dataset containing multiple low volume lesions, and performed well on our independent test dataset, indicating high robustness and potential generalizability of our method. We are also confident that HD-BM can be transferred to clinical conditions since the algorithm performed well on heterogeneous data with a broad appearance of BM on MRI including complex post-treatment alterations, like post-operative bleeding. The L-DICE segmentation performance of our HD-BM algorithm (0.78 or 0.79 in both test sets) was in line with previous studies (0.6-0.82). 17,[19][20][21][22][23]26 As expected, on a caseby-case basis our approach showed a better result with a median C-DICE-score of 0.9 in the institutional test set, which is comparable to the results in larger primary brain tumors. 8 We observed comparatively lower L-DICE as compared to C-DICE values, which can be expected because many patients have multiple lesions of different volumes: When calculating the C-DICE, the L-DICE of the bigger lesions influenced the metric more than smaller lesions, due to the greater number of true positive/false negative/FP voxels of the large lesions. Furthermore, the L-DICE of low volume lesions tends to be lower since the ratio of border voxels to internal voxels increases, leading to a more difficult segmentation problem. Additionally, the L-DICE of low volume lesions tends to be lower as shown in Bousabarah et al. 19 Our study has some limitations. First, we acknowledge the retrospective design of the study. Although HD-BM performed well on both internal and external test sets, further multicentric validation and refinement may be required to enable future clinical applicability, in order to verify its generalizability to images from different scanners and vendors. In this context, it will also be required to specifically evaluate the performance of HD-BM for longitudinal tracking of BM and response assessment in individual patients. Second, HD-BM required multiparametric MRI data, thus limiting the applicability of our method if one of the four required sequences are missing. To mitigate this, previous studies have shown that missing MRI sequences may be synthesized using generative adversarial networks. 34,35 Consequently, this may enable the use of HD-BM even with incomplete and heterogeneous sequence protocols.
In conclusion, our results highlight the capability of ANN for reliable detection and precise volumetric quantification of CE and NEE compartments in patients with BM, thereby supporting the assessment of BM disease burden and progression. A public implementation of HD-BM is available through www.github.com/NeuroAI-HD/HD-BM.

Supplementary Material
Supplementary material is available at Neuro-Oncology Advances online.