A comparison of performance between a deep learning model with residents for localization and classification of intracranial hemorrhage

Intracranial hemorrhage (ICH) from traumatic brain injury (TBI) requires prompt radiological investigation and recognition by physicians. Computed tomography (CT) scanning is the investigation of choice for TBI and has become increasingly utilized under the shortage of trained radiology personnel. It is anticipated that deep learning models will be a promising solution for the generation of timely and accurate radiology reports. Our study examines the diagnostic performance of a deep learning model and compares the performance of that with detection, localization and classification of traumatic ICHs involving radiology, emergency medicine, and neurosurgery residents. Our results demonstrate that the high level of accuracy achieved by the deep learning model, (0.89), outperforms the residents with regard to sensitivity (0.82) but still lacks behind in specificity (0.90). Overall, our study suggests that the deep learning model may serve as a potential screening tool aiding the interpretation of head CT scans among traumatic brain injury patients.


Material and methods
Development of the deep learning model. The deep learning model we investigated in this study was proposed in our prior published study 22 . The model is a variation of the DeepMedic 23 model that has the ability to segment SDH, EDH, and IPH on a CT scan. Its architecture consists of four parallel pathways that process the input at different resolutions and two fully connected layers. It obtains a 2-channel voxel extracted from the subdural and bone windows of a brain CT scan as the input. All voxels in a CT scan were processed by the model to generate segmentation results with the value of each pixel as the class label of the hemorrhage type where the pixel was located. After the segmentation results were produced by the deep learning model, the regions of SDH and EDH with their major axis of less than 5 mm were considered as noise and removed from the segmentation result. The final results were drawn on the input CT scan by assigning different colors to hemorrhage types. The model is shared via https:// github. com/ Radio logyC MU/ Hemor rhage-DeepM edic. In this study, the deep learning model was performed without any fine-tuning. We also adopted the same data pre-processing process used in our previous work. In addition, a new dataset was used as the test dataset, which differed from the dataset used to train the model in 22 . The samples in the test dataset were randomly selected from patients who were not in the training dataset. Patients who received multiple scans, we have chosen only the first scan.
Study cohort. Non-contrast head CT scans of adult patients aged 15 years or older suspected of head injury/ trauma as the initial presentation during emergency department (ED) visits at Maharaj Nakorn Chiang Mai Hospital from January 1, 2014, to December 31, 2014, were included. Exclusion criteria included: (1) Follow-up CT studies of patients with known recent TBI; (2) Studies of patients with recent neurosurgical intervention; (3) Severe artifacts degrading study quality such as motion artifacts and metallic artifacts.
Brain CT scans were acquired with CT equipment from either of two manufacturers (Toshiba Aquilion 16 or Siemens SOMATOM Definition). Each slide was stored as a 512 × 512 pixels DICOM image. The typical image resolution of x and y is 0.4473 mm per pixel.
The number of image slices per patient may vary between 80 and 115, depending on factors such as the size of the patient's head, with a fixed separation distance between slices of 1.5 mm. With these criteria, a total of 300 head CT studies from different subjects were finally included, 166 studies were categorized under the intracranial hemorrhage (ICH) group, while 134 studies were classified under the non-ICH group. After thorough review and annotation from consensus of two experienced neuroradiologists out of the 300 head CT studies, 171 were identified as lesions of IPH (27.01%), 356 lesions of SDH (56.24%), and 106 lesions of EDH (16.75%). Detailed data describing locations of lesions are shown in Table 1.
Classification of ICH by deep learning model. The deep learning model was then used to identify negative and positive studies. In the positive studies, the subtypes of ICH of concern were classified and segmented. The deep learning model segmented the area of ICH with different color maps at specific locations indicating different subtypes and locations of ICH. Correct segmentation means coloring true ICH in the expected locations, as in Fig. 1. The true location of ICH but incorrect color regarding ICH subtypes were considered false interpretations.
Classification of ICH by residents. Four radiology residents (three junior radiology residents and one senior radiology resident) and four non-radiology residents (two senior emergency medicine residents and two senior neurosurgery residents) were recruited for the study. These three areas of specialist were chosen as the residents attached to these areas of expertise were most likely to be those making the initial CT interpretation at the emergency department. Blinded to the original CT results, all eight residents were independently required to interpret all head CT scans solely from the CT series consisting of a 1.5-mm thick slice in the axial plane. The residents could manually adjust the width and level of the window for each scan during interpretation.
A record form was created consisting of multiple-choice checkboxes regarding the location of each ICH subtype. The locations for IPH were:  Statistical analysis. In most cases, there were more than one ICH subtype and/or multiple lesions of the same subtype. The algorithm or trainee residents would have to correctly identify all ICH subtypes and a correct location was considered as "detected". If any ICH remained undetected or was mis-identified either by subtype or location, it was counted as "missed". The imaging of ICH subtypes in certain locations is presented in Fig. 3. We evaluated the performance of the algorithm and trainees using statistical metrics, including accuracy, sensitivity, and specificity, using the Python package scikit-learn. A significant difference was considered when p < 0.05.

Results
We evaluated the performance of the deep learning model and compared it to that of the eight training residents in the classification and localization of ICHs based on the individual locations occupied by specific types of ICH. The accuracy, sensitivity, and specificity of the algorithm and residents are displayed in Table 2. In terms of ICH detection and localization, the model achieved an accuracy of 0.89 with a sensitivity, and specificity of 0.82 and   Regarding the detection and localization of each ICH subtype, the deep learning model was the most sensitive in detecting SDH (sensitivity = 0.85). The sensitivities for detecting IPH and EDH were 0.83 and 0.72, respectively. The radiology residents performed similarly well to the model in the sensitivity of each ICH subtype. The sensitivity for IPH was 0.80, 0.71 for EDH, and 0.73 for SDH. The neurosurgery and emergency medicine residents Table 2. Location-level performance of the algorithm, three junior radiology residents, a senior radiologist, two emergency residents, and two neurosurgery residents presented as accuracy, sensitivity, and specificity. Abbreviations: ICH, intracranial hemorrhage; IPH, intraparenchymal hemorrhage; EDH, epidural hemorrhage; SDH, subdural hemorrhage. In several cases, the deep learning model detected subtle SDH when this could not be detected by any of the residents. Most of these cases were studies containing thin SDHs either along the tentorium cerebelli or the cerebral convexities. Some of these cases had various hemorrhagic subtypes resulting in small hemorrhages being overlooked. Three subtle EDHs were missed by radiology residents, and five subtle EDHs were missed by non-radiology residents. However, all of these EDHs were picked up by the algorithm. These cases are illustrated in Figs. 4 and 5. Only two cases of small EDHs were missed by the deep learning but were able to be detected by all residents (Fig. 6).
The specificities for ICH detection and subtypes of ICH by the deep learning and residents were relatively high. The overall specificity for ICH detection by the algorithm was 0.90, while the specificities for ICH detection by radiology and non-radiology residents were both 0.99. Specificity values for each subtype of ICH varied between 0.82 and 0.90 in the case of the deep learning, while the specificity values varied between 0.97 and 0.99 for the residents for each ICH subtype ( Table 2). There were many cases in which the deep learning model overdiagnosed, usually involving basal ganglia calcification, beam hardening artifacts, dense cortical veins, and dural venous sinuses being interpreted as hemorrhage. Some of these studies are presented in Fig. 7.

Discussion
Our results demonstrate non-inferior diagnostic accuracies in ICH detection of the deep learning model compared to residents (p > 0.05). In terms of sensitivity, the model yielded noticeably higher overall sensitivity values for ICH detection across nearly all subtypes compared to the residents (p < 0.05). Nonetheless, specificity of the model still falls behind that of the residents.
In our study, we determined the accuracy of the algorithm by evaluating the frequency of correct and incorrect ICH detection from each location. The deep learning model achieved high accuracy of overall ICH detection, and IPH, EDH, and SDH detection (0.89, 0.93, 0.91, and 0.82). The reported diagnostic performance of deep learning models in prior literature were variable, with accuracy values ranging from 0.70 to 0.94 13 . Similar studies based on convolutional neural networks yielded approximate accuracy values from 0.81 to 0.90 20,24 . The variations in  Few studies have compared the diagnostic performance between algorithms and trainee residents. Ye et al. demonstrated superior performance of the deep learning neural network. They concluded that their algorithm was fast and accurate, indicating its potential role in assisting junior radiology residents in reducing misinterpretation of head CT scans 15 . However, they primarily focused on ICH detection and classification, while our study also stresses the importance of identifying both the subtype and correct location.
Although our deep learning model demonstrated high specificity, it still lags behind that of the residents. From a review of the lesion segmentation by the algorithm, many non-ICH findings had been misinterpreted as ICH. This was possibly because identification of ICH required the region be of higher attenuation or higher Hounsfield unit (HU) than surrounding normal brain parenchyma. Other findings with high HU, such as basal ganglia calcification, beam hardening artifacts, dense cortical veins, and dural venous sinuses, were misclassified as hemorrhage. This is concordant with prior results in that common AI overcalls are calcification and beamhardening artifacts 25,26 . While radiologists and clinicians gain experience in identifying actual ICHs, AI analysis software must also be trained to recognize these ICH mimickers.
In terms of sensitivity, the deep learning model yielded noticeably higher overall sensitivity values for ICH detection across nearly all subtypes compared to the residents (p < 0.05). Moreover, the deep learning model could detect subtle hemorrhages missed by residents, such as thin SDH and small EDH. Waite et al. suggest that in their study of interpretative errors in radiology perceptual errors account for 60%-80% of radiological errors 27 . Since perception and detection are the initial phases in image interpretation, an error in this phase can abruptly terminate the diagnostic process and result in a mistaken (false-negative) diagnosis. Perceptual errors in radiology have been linked to a variety of causes, including fatigue of the interpreter and the increased pace of interpretations 28 . The error rate could also change depending on the time of day the interpretation was given, and long and overnight shifts are associated with increased rates of inaccuracy, according to previous studies 29,30 . More crucially, in approximately 1% of cases, the diagnostic error results in incorrect or inadequate patient management 31,32 . This reinforces the vital role of a deep learning model as a potential screening tool in emergency cases requiring rapid ICH detection or as an assistant for trainee residents in generating emergency CT reports. Similar work has been done using an active automated tool to detect acute intracranial conditions, Figure 6. The CT scan of the brain shows a small left frontal convexity EDH (white arrow) in narrowed window setting (a) and associated skull fracture in the bone window setting (b). The post-processing image reveals that the EDH has not been annotated by the algorithm c). www.nature.com/scientificreports/ including hemorrhage, reprioritizing study, and notifying radiologists if ICH was identified as present. Thus, these resulted in significant reductions in diagnostic waiting time with high sensitivity with regard to detection 33,34 . Different sensitive and specificity tradeoff is likely to be different depending on their role in the care pathway. If primary role of the operator, whether it be the radiologist, emergency physician or AI, is to screen/inform or refer for appropriate interventions, due to the serious and urgent nature of TBI and importance of timely surgical intervention for some subtypes of TBI, it may be better to set a high sensitivity and allow more false positives (lower positive predictive value). As a reference, the American College of Surgeons Committee on Trauma has set the national benchmark for field triage at ≥ 95% for sensitivity and ≥ 50% specificity 35 . However, if the primary role of the operator, whether it be the radiologist, neurosurgeon or AI, is to decide the definitive treatment on whether or not to perform neurosurgery, it would be better to set a high specificity.
Following this argument, outcomes of this study suggest that the deep learning model may be a useful screening tool for the detection and localization of ICH from CT scans in the cases of traumatic brain injury in the majority of emergency departments (ED), in particular where there may not be 24-hour coverage of trained radiologists. However, confirmation by trained personnel is required before definitive surgical treatment plans can be made. This is particularly the case in low-and middle-income settings when emergency physicians and neurosurgeons frequently evaluate emergency computed tomography (CT) scans without support from trained radiologists.
There are three main limitations to our study. First, the performance evaluation is done by scoring the detected ICH according to crude locations, which sometimes represent a wide and less specific area within the cranium. For example, hematoma in the cerebral hemisphere could be in either the frontal, parietal, temporal, or occipital lobes. Multiple separated hematomas may be present in different locations within the same hemisphere. If only one hematoma is segmented or identified in the setting of multiple discrete hematomas confined within the recording area (e.g., a hemisphere), this will be erroneously considered as all hematomas being detected when the remaining hematomas have been missed. However, unlike a previous study 15 , ours is one of the few that has attempted to match the type of ICH and their precise locations with performance comparisons to training residents rather than simply detecting ICH subtype alone. The second limitation is the sample size; with only 300 cases of head CT studies, the study may not have the same level of statistical power in defining the exact diagnostic performance of the deep learning model in comparison to some other studies in the literature 16,17,36 . In addition, all the subjects were limited to 15 or more years of age. In future studies, it would be useful if the data set were expanded to include all age groups. Lastly, we did not validate the deep learning model with the external dataset (e.g. from another hospital or geographical area). Samples in the test dataset were collected from the same hospital and CT machines used for the training dataset. This limitation needs to be addressed in future validation.

Conclusion
In conclusion, our study is one of the first to validate the efficacy of the role of the deep learning model for ICH detection and localization by comparing the level of diagnostic accuracy with radiology, emergency medicine, and neurosurgery residents. Based on the results, our study highlighted the potential use of AI as a useful intracranial hemorrhage screening tool in traumatic brain injury patients. However, its slightly lower specificity and tendency to misinterpret some benign lesions with high attenuation into hemorrhage remain an issue to be addressed. Further model training with a larger data set and a larger sample size is expected to improve the overall capability of our deep learning model in a real clinical setting.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.