External Validation and Retraining of DeepBleed: The First Open-Source 3D Deep Learning Network for the Segmentation of Spontaneous Intracerebral and Intraventricular Hemorrhage

Background: The objective of this study was to assess the performance of the first publicly available automated 3D segmentation for spontaneous intracerebral hemorrhage (ICH) based on a 3D neural network before and after retraining. Methods: We performed an independent validation of this model using a multicenter retrospective cohort. Performance metrics were evaluated using the dice score (DSC), sensitivity, and positive predictive values (PPV). We retrained the original model (OM) and assessed the performance via an external validation design. A multivariate linear regression model was used to identify independent variables associated with the model’s performance. Agreements in volumetric measurements and segmentation were evaluated using Pearson’s correlation coefficients (r) and intraclass correlation coefficients (ICC), respectively. With 1040 patients, the OM had a median DSC, sensitivity, and PPV of 0.84, 0.79, and 0.93, compared to thoseo f 0.83, 0.80, and 0.91 in the retrained model (RM). However, the median DSC for infratentorial ICH was relatively low and improved significantly after retraining, at p < 0.001. ICH volume and location were significantly associated with the DSC, at p < 0.05. The agreement between volumetric measurements (r > 0.90, p > 0.05) and segmentations (ICC ≥ 0.9, p < 0.001) was excellent. Conclusion: The model demonstrated good generalization in an external validation cohort. Location-specific variances improved significantly after retraining. External validation and retraining are important steps to consider before applying deep learning models in new clinical settings.


Introduction
Spontaneous intracerebral hemorrhage (ICH) is a major cause of morbidity and mortality worldwide despite the relatively small contribution to all stroke types of up to 27% [1][2][3]. The prognosis after ICH is particularly affected by the ICH volume in addition to its location, the presence of intraventricular hemorrhage (IVH), and acute hematoma expansion (HE) [4]. Thus, instruments for accurate ICH and IVH quantification upon neuroimaging are crucial to guide further patient management and to inform future clinic trials [5][6][7][8]. The ABC/2 method has remained a clinically well-established formula to manually estimate the ICH volume [9] despite the consistent reports of underestimation or overestimation in large and irregular bleedings. The semiautomatic measurements of ICH and IVH are equally limited as they are labor-intensive and time-consuming [10]. Novel deep learning-based models have the potential to quantify ICH and IVH volumes rapidly and accurately in a fully automated approach and are therefore in high demand [11]. The DeepBleed network presented by Sharrock et al. is the first publicly available 3D neural network for the segmentation of ICH and IVH [12]. Despite being trained and internally validated on a large dataset from the MISTIE II and III trial series [13,14], its performance in an independent cohort has not been described yet. This step is of particular importance in imaging-based segmentation networks as they have increasingly shown inconsistent performance results on external datasets [15]. Therefore, it is important that clinicians are aware of the quality assessment steps that need to be taken before the local implementation of these models. In particular, the performance of DeepBleed in the detection and segmentation of infratentorial and small ICH remains undetermined as these two subsets were excluded from the MISTIE trials [13,14,16]. The aim of this study was to evaluate the generalizability and further improve the robustness of the proposed DeepBleed network. The objective of this study was to assess the performance of the existing DeepBleed network before and after retraining. Therefore, we hypothesized that the DeepBleed network would accurately detect and segment ICH and IVH regardless of its location and size. To test and evaluate this, the following threefold steps were performed. First, we externally validated the original DeepBleed model (OM) in an independent multicenter cohort. Secondly, we retrained the model (RM) to test the effect on the validation accuracy through an internal validation design. Third, we compared the interrater reliabilities between the OM and RM network and independent human raters. This study serves as a use case illustrating how the generalizability of deep learning models may be addressed via local retraining.

Study Population
This retrospective study was approved by the local ethics committee (Charité Berlin, Germany (protocol number EA1/035/20), University Medical-Center Hamburg, Germany (protocol number WF-054/19), and IRCCS Mondino Foundation, Pavia, Italy (protocol number 20190099462]). Written informed consent was waived by the institutional review boards. All study protocols and procedures were conducted in accordance with the Declaration of Helsinki. The study included patients of ≥18 years who were diagnosed with primary spontaneous ICH upon noncontrast computed tomography (NECT) between January 2017 and June 2020. Patients with multiple ICH, artifacts, external ventricular drain (EVD) or any other type of surgical procedure, and secondary hemorrhage following head trauma, ischemic infarction, neoplastic mass lesions, ruptured cerebral aneurysms, or vascular malformations were excluded from the study as presented in Figure 1.

Image Acquisition and Manual Segmentation
Participating sites acquired NECT images according to their local imaging protocols. De-identified and pseudomyzed imaging data were retrieved from the local picture archiving and communication system (PACS) servers and converted into a Digital Imaging and Communications in Medicine (DICOM) format according to local guidelines. DICOM data were then transformed in Neuroimaging Informatics Technology Initiative (NifTI) for further imaging analysis. Images were analyzed for the presence of IVH and the ICH location by one experienced neuroradiologist (J.N., who has 5 years of experience in ICH imaging research). Supratentorial bleedings in cortical and subcortical locations were classified as lobar and hemorrhages involving the thalamus, basal ganglia, internal capsule and deep periventricular white matter [17]. Infratentorial bleedings were classified within the brainstem, pons and cerebellum [18]. Ground truth (GT) masks of ICH and IVH were manually segmented on CT scans by two experienced raters (both with 3 and 5 years of experience in ICH imaging research) who were supervised by one neuroradiologist (J.N.) who inspected each ICH and IVH mask for the quality of segmentations and corrected them if necessary. Segmentation of the ICH and IVH was performed using ITK-SNAP software version 3.8.0 (Penn Image Computing and Science Laboratory, Philadelphia, PA, USA) [19]. Two expert raters segmented the test set six months apart for 60 patients to calculate inter-reader agreement (J.N. and FM, with 5 years of experience in ICH imaging research). The function shuffle from Python library NumPy random was applied on the subject ID list [20]. All readers independently analyzed and segmented images in a random order while blind to all demographic data and were not involved in the clinical care of assessment of the enrolled patients.

Preprocessing and Postprocessing
The preprocessing comprised two steps as described in the original study ( Figure 2): brain extraction and the coregistration. Briefly, after setting to zero the CT scan intensities lower than 0 or higher than 100, brain extraction was performed with FSL Brain Extraction Tool (BET) [21], setting the fractional intensity parameter (-f flag) to 0.01. Rigid coregistration was performed with ANTs [22] using a 1.5 mm 3 isotropic CT template [23]. The

Image Acquisition and Manual Segmentation
Participating sites acquired NECT images according to their local imaging protocols. De-identified and pseudomyzed imaging data were retrieved from the local picture archiving and communication system (PACS) servers and converted into a Digital Imaging and Communications in Medicine (DICOM) format according to local guidelines. DICOM data were then transformed in Neuroimaging Informatics Technology Initiative (NifTI) for further imaging analysis. Images were analyzed for the presence of IVH and the ICH location by one experienced neuroradiologist (J.N., who has 5 years of experience in ICH imaging research). Supratentorial bleedings in cortical and subcortical locations were classified as lobar and hemorrhages involving the thalamus, basal ganglia, internal capsule and deep periventricular white matter [17]. Infratentorial bleedings were classified within the brainstem, pons and cerebellum [18]. Ground truth (GT) masks of ICH and IVH were manually segmented on CT scans by two experienced raters (both with 3 and 5 years of experience in ICH imaging research) who were supervised by one neuroradiologist (J.N.) who inspected each ICH and IVH mask for the quality of segmentations and corrected them if necessary. Segmentation of the ICH and IVH was performed using ITK-SNAP software version 3.8.0 (Penn Image Computing and Science Laboratory, Philadelphia, PA, USA) [19]. Two expert raters segmented the test set six months apart for 60 patients to calculate inter-reader agreement (J.N. and FM, with 5 years of experience in ICH imaging research). The function shuffle from Python library NumPy random was applied on the subject ID list [20]. All readers independently analyzed and segmented images in a random order while blind to all demographic data and were not involved in the clinical care of assessment of the enrolled patients.

Preprocessing and Postprocessing
The preprocessing comprised two steps as described in the original study ( Figure 2): brain extraction and the coregistration. Briefly, after setting to zero the CT scan intensities lower than 0 or higher than 100, brain extraction was performed with FSL Brain Extraction Tool (BET) [21], setting the fractional intensity parameter (-f flag) to 0.01. Rigid coregistration was performed with ANTs [22] using a 1.5 mm 3 isotropic CT template [23]. The resulting transformation was applied to the GT masks as well. During postprocessing, a threshold of 0.6 was applied to the resulting probability maps, setting the higher values to 1 and the others to 0. Finally, inverse coregistration of the resulting mask was performed using the inverse transformation matrix.
resulting transformation was applied to the GT masks as well. During postprocessing, a threshold of 0.6 was applied to the resulting probability maps, setting the higher values to 1 and the others to 0. Finally, inverse coregistration of the resulting mask was performed using the inverse transformation matrix. After gantry tilt and unequal slices were corrected, DICOM data were converted using NIfTI. For brain extraction and coregistration, Python preprocessing pipelines were used. DeepBleed was then used to predict intracerebral hemorrhage (ICH) and intraventricular hemorrhage (IVH). In the final step, the predictions from the previous template registration were inversely transformed in the native space.

Model Retraining
In addition to the general exclusion criteria described in the above, the following additional exclusion criteria were applied to the training cohort in accordance with those of MISTIE studies [13,14]: symptom onset > 24 h prior to the admission CT or an unknown time of symptom onset as well as an admission ICH volume of > 30 mL. As described in the original study, adaptive moment estimation (Adam) was used as the optimization function [24], the dice similarity coefficient (DSC) was used as a loss function [25], and 100, randomly selected subjects were used as a training cohort, whereas 20 were used as a validation cohort. After testing various combinations, the optimal learning rate was 1 × 10 −4 and a batch size of 3 was chosen. Initially, the training dataset was shuffled; we stopped the training if, after the first 10th epoch, the current epoch did not improve, and validation was performed every five epochs.

Model Testing
For the model testing, preprocessing and postprocessing as described above were performed on the OM and RM test dataset. Contrary to the training dataset, the test dataset was defined according to the original criteria of our study with no additional exclusion criteria applied. This decision was made because we were interested in evaluating the performance of DeepBleed on small hemorrhages. After gantry tilt and unequal slices were corrected, DICOM data were converted using NIfTI. For brain extraction and coregistration, Python preprocessing pipelines were used. DeepBleed was then used to predict intracerebral hemorrhage (ICH) and intraventricular hemorrhage (IVH). In the final step, the predictions from the previous template registration were inversely transformed in the native space.

Model Retraining
In addition to the general exclusion criteria described in the above, the following additional exclusion criteria were applied to the training cohort in accordance with those of MISTIE studies [13,14]: symptom onset > 24 h prior to the admission CT or an unknown time of symptom onset as well as an admission ICH volume of >30 mL. As described in the original study, adaptive moment estimation (Adam) was used as the optimization function [24], the dice similarity coefficient (DSC) was used as a loss function [25], and 100, randomly selected subjects were used as a training cohort, whereas 20 were used as a validation cohort. After testing various combinations, the optimal learning rate was 1 × 10 −4 and a batch size of 3 was chosen. Initially, the training dataset was shuffled; we stopped the training if, after the first 10th epoch, the current epoch did not improve, and validation was performed every five epochs.

Model Testing
For the model testing, preprocessing and postprocessing as described above were performed on the OM and RM test dataset. Contrary to the training dataset, the test dataset was defined according to the original criteria of our study with no additional exclusion criteria applied. This decision was made because we were interested in evaluating the performance of DeepBleed on small hemorrhages.

Statistical Analysis
Statistics were conducted using GraphPad Prism v9.0.2 (GraphPad Software, Inc., San Diego, CA, USA) [28] and R (the R project for statistical computing, Vienna, AT) using tidyverse [29]. Various metrics were used to evaluate the DeepBleed performance and compare the segmentations from our RM with the OM. These included DSC, sensitivity, positive predictive value (PPV), and volume measurements. t-tests were used to compare the DSC, sensitivity, and PPV distributions between the OM and RM. Based on the central limit theorem, the t-test assumptions were fulfilled. To determine factors influencing segmentation performance, a linear regression model with the following formula was used: Pairwise correlations among volumes measured from each of the three segmentation methods (GT masks, OM and RM DeepBleed network) were assessed using the Pearson correlation coefficient (r). Agreements between two raters and the OM and RM DeepBleed network were assessed using the intraclass correlation coefficient (ICC) in the DSC. Moreover, a repeated measures ANOVA was performed with a pairwise t-test as a post hoc. The homogeneity of variance assumption for ANOVA was evaluated using the Levene test. Cohen's d effect size was determined as well. A p-value of < 0.05 was considered significant. Bonferroni adjustment was applied where necessary. Adjusted p-values are indicated as p adj -values.

Demographics and Characteristics of the Study Cohort
The manual review of the images led to an elimination of 54 patients due to exclusion criteria and manual segmentation errors. The final dataset was composed of 1040 patients, NECT scans and respective masks, where the numbers of patients in training, validation and test were n = 100, n = 20 and n = 920. The mean age was 69.6 (SD 14.2) years. The median NIHSS and GCS scores were 7.5 (IQR 12) and 13 (IQR 7), respectively. Imaging was performed within a median symptom onset time of 4.3 h (IQR 13.6). In total, 519 patients (49.9%) presented with ICH and IVH with a mean volume of 74.9 (SD 41.6) ml. A total of 521 patients (50.1%) presented with ICH only with a mean volume of 41.5 (SD 32.8) mL. No significant differences in demographic characteristics were found between training, validation, and test subjects as presented in Table 1.

Model Retraining and Testing
With Nvidia RTX 3090 GPU, the model was trained for 810 epochs in 16 h. Illustrative examples of segmentations are displayed in Figure 3. Model performance metrics of the OM and RM derived in the test set are presented in detail in Table 2 with additional DSC metrics illustrated in Figure 4A    values (PPVs) and are given as medians with 95% confidence intervals. The metrics of the original (OM) and retrained weights (RM) were compared using t-tests with adjusted p-values. Briefly, 95% CI = 95% confidence interval; padj-value = adjusted p-value; t 1 = paired t-test between OM and RM for the specified metric; ns = not significant.

Analysis of Factors Influencing the Model Performance
The results from the multivariate linear models are summarized in Table 3. Results for the univariate analysis are presented in the Supplementary Table S2. Overall model performance was negatively influenced by the ICH location. However, deep ICH was the only location that was not significantly associated with a DSC loss performance neither in the OM nor RM network, at p-values of > 0.05. While the slope coefficients for lobar ICH remained relatively similar in the OM and RM (−0.04, SD 0.01 and −0.06, SD 0.01), the negative effect of brainstem and cerebellar ICH decreased from −0.20 (SD 0.03) and −0.32 (SD 0.02) in the OM to −0.18 (SD 0.03) and −0.08 (SD 0.02) in the RM, at p-values of < 0.001. ICH volume increase had a strong positive effect on the DSC in the OM and RM network. The presence of IVH and the data's originating site had no significant effect on the DSC in the OM or RM. The correlation between ICH location and volume in the DSC are illustrated in Figure 4B,C.  Figure 5A shows the correlation between the GT masks and DeepBleed's automatic volume prediction with the OM and RM. Overall strong correlations were observed among the three segmentation methods (r > 0.9, p-value < 0.001), however, correlations among both DeepBleed models, OM and RM, were highest, whereas their correlation with GT volumes was lowest (r = 0.92 for OM and r = 0.94 for RM). The mean volumes of GT, OM and RM (±SD) were 43.2 (±42.6), 36.0 (±35.9) and 36.2 (±35.9). The median volumes of the three methods are displayed in Figure 5B. The repeated-measure ANOVA showed no significant effect between the volume estimation of GT masks and automatic volume estimations in the DeepBleed OM and RM (F = 0.45, p-value > 0.05). Significant agreements were found between the DSCs of manual segmentations by two expert raters and those of the OM and RM DeepBleed network with the GT masks (ICC = 0.90 and ICC = 0.94, p-value < 0.0001) and are presented in Figure 6A,B. The repeated-measure ANOVA showed a significant effect of the rater and OM and RM on the DSC (F = 14.38, p < 0.0001). The post hoc test showed a significant effect between OM and RM (t = 2.78, padj-value < 0.05, d = 0.4, small), OM and both raters (t = −5.11 and t = -5.37, padj-value <0.001 and d = 0.7, moderate for both) and RM and both raters (t = −4.9 and t = −5.3, padj-value < 0.001 and d = 0.7, moderate for both). No significant difference was found between the two raters (t = −1.57, p-value > 0.05, d = 0.2, small).

Discussion
Our study confirmed the external validity of the first open-source 3D deep learning network for the automatic detection and segmentation of spontaneous ICH with the Significant agreements were found between the DSCs of manual segmentations by two expert raters and those of the OM and RM DeepBleed network with the GT masks (ICC = 0.90 and ICC = 0.94, p-value < 0.0001) and are presented in Figure 6A,B. The repeatedmeasure ANOVA showed a significant effect of the rater and OM and RM on the DSC (F = 14.38, p < 0.0001). The post hoc test showed a significant effect between OM and RM (t = 2.78, p adj -value < 0.05, d = 0.4, small), OM and both raters (t = −5.11 and t = -5.37, p adj -value <0.001 and d = 0.7, moderate for both) and RM and both raters (t = −4.9 and t = −5.3, p adj -value < 0.001 and d = 0.7, moderate for both). No significant difference was found between the two raters (t = −1.57, p-value > 0.05, d = 0.2, small). Significant agreements were found between the DSCs of manual segmentations by two expert raters and those of the OM and RM DeepBleed network with the GT masks (ICC = 0.90 and ICC = 0.94, p-value < 0.0001) and are presented in Figure 6A,B. The repeated-measure ANOVA showed a significant effect of the rater and OM and RM on the DSC (F = 14.38, p < 0.0001). The post hoc test showed a significant effect between OM and RM (t = 2.78, padj-value < 0.05, d = 0.4, small), OM and both raters (t = −5.11 and t = -5.37, padj-value <0.001 and d = 0.7, moderate for both) and RM and both raters (t = −4.9 and t = −5.3, padj-value < 0.001 and d = 0.7, moderate for both). No significant difference was found between the two raters (t = −1.57, p-value > 0.05, d = 0.2, small).

Discussion
Our study confirmed the external validity of the first open-source 3D deep learning network for the automatic detection and segmentation of spontaneous ICH with the

Discussion
Our study confirmed the external validity of the first open-source 3D deep learning network for the automatic detection and segmentation of spontaneous ICH with the presence of IVH upon CT. Furthermore, we illustrated the importance of local retraining for a specific setting to increase the applicability of neural network models for ICH segmentation purposes.
In brief, the OM showed overall good results in our multicenter cohort during the validation process. However, location-specific performance metrics were comparatively low for infratentorial ICH lesions which improved significantly after retraining. In particular, performance metrics in cerebellar ICH improved with a 1.7-fold increase in the DSC. The negative effects on the DSC in our linear model improved with a slope increase of 25% for cerebellar ICH after retraining. In comparison, the model performance in supratentorial ICH was overall good for both the OM and RM, with deep ICH demonstrating the best and stable performance metrics. Nonetheless, even lobar ICH slightly improved after model retraining while initially demonstrating the second-best performance metrics in the OM. Additionally, the DSC performance was negatively associated with lobar ICH lesions in our linear regression model. Lobar ICH may demonstrate irregular margins and internal density heterogeneities upon imaging [30]. These imaging phenomena have especially been associated with the use of oral anticoagulants which in turn have been excluded from the MISTIE III trial (novel oral anticoagulants; NOAC) [14,31,32]. As DeepBleed adopts a binary prediction approach at the voxel level using a predetermined threshold, these nuanced voxel differences might be missed [12]. In comparison, another deep learning network, nnU-Net, utilizes the softmax output, as demonstrated by Zhao et al., for ICH segmentation [33][34][35]. However, we believe that these differences have only a minor impact on the general segmentation performance as shown by the overall good DSC during the validation process. A key strength of the network was that the DSC metrics were independent of the participating site's dataset as well as the presence of IVH. The level of heterogeneity in the developmental cohort of the DeepBleed network may directly relate to this high generalizability of the original model in our external validation cohort. In brief, the original DeepBleed network was multicenter-curated with data from the MISTIE II and III trial series that were conducted at 78 sites in North America, Europe, Australia, and Asia with over 500 patients included [13,14]. In comparison, most of the previous ICH segmentation models were single-center-curated and thus required even more generalization testing ahead of clinical implementation at other sites [35][36][37][38][39]. Secondly, the DeepBleed model employs a dice-based loss function and Adam optimizer, enabling the easy combination of various datasets and augmentation options making it easy to share the trained models as open-source models in order to adapt the network to specific settings [12].
We also observed some limitations in the DeepBleed network. We found that a volume increase had a significantly positive effect on the DSC. This finding is also consistent with that of previous studies showing a positive correlation between lesion size and the DSC in other segmentation networks [40,41]. Therefore, DeepBleed's limited performance in small ICH appears to be a general limitation of deep learning-based networks for segmentation purposes rather than a methodological limitation of DeepBleed-due to the inclusion of supratentorial ICH with an absolute volume greater 30 mL according to the inclusion criteria of the MISTIE III trial [14]. Considering other performance metrics, the absolute volume differences described by the DeepBleed authors were only small despite the variations in the DSC [12] and were thereby within the range with volume errors of 2 to 5 mL observed in similar studies [11,12,42]. In line with this, our post hoc analysis showed a high correlation between the manual and automatic volume predictions, with an underestimation of 5 mL in the automated approach, while the automatic segmentations had a statistically lower agreement in terms of the DSC compared to human expert raters [43,44]. The latter findings might have a stronger clinical implication than do the conclusions obtained from the DSC metrics, as in clinical practice the ICH volume is of great relevance [13,14,16,[45][46][47].
The DSC evaluates the quality of the alignment, which denotes the overlap between the predicted and the GT segmentation [48]. Finally, our results are limited to an external validation cohort of ICH patients who presented within a symptom onset of 24 h. Hence, the performance upon follow-up CT scans beyond a time interval of 24 h may differ.
Our study has the following two main implications. First, our results illustrated that the generalizability of neural networks may be restricted, even when the development and validation cohorts have strong similarities in terms of patient population and healthcare context. Secondly, more extensive retraining may be required to improve the performance at a new site when generalizability is poor. From a clinical point of view, the open-source RM could support decision making for surgical interventions and aid outcome prediction [18,49] in both supra-and infra-tentorial ICH [17,18] with IVH [50,51].
To conclude, the DeepBleed network demonstrated good generalization in an external validation cohort of patients diagnosed with spontaneous ICH on CT and retraining improved location-specific variances significantly. Volumetric analysis showed strong agreement with the manual segmentations of expert raters. However, segmentation accuracy was statistically higher in ground truth masks. The code and RM weights have been made available online [52,53]. Our study illustrates the importance of local retraining for a specific setting to increase the applicability of neural network models for segmentation purposes in spontaneous ICH patients. ICH clinicians and decision makers may take this into account when considering applying externally designed neural network models to their local settings.