Deep Learning Network with Spatial Attention Module for Detecting Acute Bilirubin Encephalopathy in Newborns Based on Multimodal MRI

Background: Acute bilirubin encephalopathy (ABE) is a significant cause of neonatal mortality and disability. Early detection and treatment of ABE can prevent the further development of ABE and its long-term complications. Due to the limited classification ability of single-modal magnetic resonance imaging (MRI), this study aimed to validate the classification performance of a new deep learning model based on multimodal MRI images. Additionally, the study evaluated the effect of a spatial attention module (SAM) on improving the model’s diagnostic performance in distinguishing ABE. Methods: This study enrolled a total of 97 neonates diagnosed with ABE and 80 neonates diagnosed with hyperbilirubinemia (HB, non-ABE). Each patient underwent three types of multimodal imaging, which included T1-weighted imaging (T1WI), T2-weighted imaging (T2WI), and an apparent diffusion coefficient (ADC) map. A multimodal MRI classification model based on the ResNet18 network with spatial attention modules was built to distinguish ABE from non-ABE. All combinations of the three types of images were used as inputs to test the model’s classification performance, and we also analyzed the prediction performance of models with SAMs through comparative experiments. Results: The results indicated that the diagnostic performance of the multimodal image combination was better than any single-modal image, and the combination of T1WI and T2WI achieved the best classification performance (accuracy = 0.808 ± 0.069, area under the curve = 0.808 ± 0.057). The ADC images performed the worst among the three modalities’ images. Adding spatial attention modules significantly improved the model’s classification performance. Conclusion: Our experiment showed that a multimodal image classification network with spatial attention modules significantly improved the accuracy of ABE classification.


Introduction
Neonatal jaundice, also known as neonatal hyperbilirubinemia, is a prevalent condition in newborns, identified by yellowing of the skin and whites of the eyes caused by the accumulation of bilirubin in the bloodstream. Bilirubin, a yellow pigment formed during the breakdown of red blood cells, is regulated by the liver among healthy individuals; however, in newborns, the liver may not fully develop, leading to the accumulation of bilirubin in the bloodstream, resulting in jaundice. Although neonatal jaundice is generally harmless and resolves itself within a few weeks, increased levels of bilirubin could cross the blood-brain barrier, leading to the death of brain cells, subsequently causing acute bilirubin encephalopathy (ABE) [1,2]. If left untreated, ABE can progress to a severe condition known as kernicterus, leading to possible neurological damage or even death. Thus, monitoring newborns with ABE and seeking medical attention if symptoms persist or worsen are crucial. the same network weights across different modalities, which was not ideal for multimodal data and hindered the network's ability to learn distinct features from each modality.
The spatial attention module in a convolutional neural network (CNN) is designed to selectively focus on certain regions within an image while downplaying or ignoring others [14]. This module can improve the performance of the CNN by allowing it to better recognize and classify objects within an image. The spatial attention module works by using a set of weights to assign importance values to different parts of the input image. These weights are learned during training and are based on the features that are most relevant to the task at hand. By adjusting these weights, the CNN can focus its attention more closely on the key areas of an image, such as the face of a person or the lesion area of the brain.
Multimodal MRI-based deep learning models have emerged as a promising approach for medical image analysis due to their ability to integrate information from multiple modalities. Recent works in this area have achieved remarkable success in various applications, including the detection of abnormalities and diseases in brain imaging. Zhang et al. proposed a deep-learning-based method for the automated detection of enlarged perivascular spaces (EPVS) in brain MRI images [15]. The proposed model was trained on a large dataset of MRI images and achieved a high accuracy in detecting EPVS, outperforming other existing methods. Another related work by Guo et al. proposed a method for glioma subtype classification using multiple MRI modalities and a decision fusion strategy to improve accuracy [16]. These works on multimodal MRI-based deep learning models have demonstrated promising results in medical image analysis, particularly in the detection, classification, and segmentation of brain abnormalities and diseases.
Therefore, in this study, we created a multimodal MRI image classification network based on ResNet18 that can differentiate ABE from the non-ABE control group (high bilirubin, HB). For each modality of data, we used the respective ResNet18 as the backbone to extract the features and then concatenated the features before the fully connected layer. Additionally, we introduced spatial attention modules into the ResNet18 network blocks to further enhance the classification performance of the model. In addition, we investigated the influence of different combinations of modalities on the classification results.

Study Subjects
The data were collected during routine clinical examinations at the Affiliated Children's Hospital of Jiangnan University in 2020-2022, and all research protocols were approved by the Clinical Research Ethics Committee. We recruited 177 newborns with high bilirubin levels for this study, of which 97 were diagnosed with ABE and 80 were diagnosed with non-ABE. Experienced pediatricians confirmed the diagnosis of all subjects based on the clinical examination results and the bilirubin-induced neurologic dysfunction (BIND) score, which is a scale used to evaluate the severity of ABE. The scores range from 1 to 9, with 1-3 indicating mild, 4-6 indicating moderate, and 7-9 indicating severe ABE [17]. Neonates without ABE did not have the corresponding clinical neurological symptoms.

Image Pre-Processing
We applied image pre-processing in the following steps: (1) skull stripping (FSL v7.0: SynthStrip, https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/, accessed on 15 January 2023) [18,19], and (2) normalizing the image intensity to a range of 0-1 and resizing the image to 224 × 224 ( Figure 1). To improve the computational efficiency, we selected four contiguous slices around the GP from each modality of the T1, T2, and ADC images as input for the models. We performed all pre-processing steps using Python and FSL with Ubuntu20.0. angle, 90°; matrix size, 164 × 168; field of view, 224 × 230 mm; b value, 1000 s/mm². A images underwent manual inspection by pediatricians to ensure that the image quali met the requirements for subsequent data analysis.

Image Pre-Processing
We applied image pre-processing in the following steps: (1) skull stripping (FSL v7. SynthStrip, https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/, accessed on 15 January 2023) [18,19], an (2) normalizing the image intensity to a range of 0-1 and resizing the image to 224 × 22 ( Figure 1). To improve the computational efficiency, we selected four contiguous slic around the GP from each modality of the T1, T2, and ADC images as input for the mode We performed all pre-processing steps using Python and FSL with Ubuntu20.0.

Deep Learning Framework and Spatial Attention Module
We used ResNet18 as the backbone to build a multimodal image classification ne work (Figure 2), where ResNet18 was used for image feature extraction [20]. Subs quently, we fused the multimodal features and constructed a fully connected layer to di tinguish between ABE and non-ABE patients. We used transfer learning methods to in tialize the model parameters effectively and improve the training performance. To cou teract the issues of limited training subjects and overfitting during the training proces we used data augmentation methods, including randomly translating images horizontal and vertically by −60 to 60 pixels, rotating images by −60 to +60 degrees, and scaling im ages 0.8 to 1.2 times.

Deep Learning Framework and Spatial Attention Module
We used ResNet18 as the backbone to build a multimodal image classification network (Figure 2), where ResNet18 was used for image feature extraction [20]. Subsequently, we fused the multimodal features and constructed a fully connected layer to distinguish between ABE and non-ABE patients. We used transfer learning methods to initialize the model parameters effectively and improve the training performance. To counteract the issues of limited training subjects and overfitting during the training process, we used data augmentation methods, including randomly translating images horizontally and vertically by −60 to 60 pixels, rotating images by −60 to +60 degrees, and scaling images 0.8 to 1.2 times.
In this paper, we introduced spatial attention modules (SAMs) into the residual network blocks and analyzed the effect of the spatial attention modules on the model classification performance. The detailed structure of the attention modules and their integration with ResNet18 are illustrated in Figures 3 and 4, respectively [14].  Deep learning architectures for the multimodal ABE prediction model. The feature extractors for T1-weighted images, T2-weighted images, and ADC images were used in ResNet18. MRI input images were selected to include four consecutive layers where the GP was located, and they were cropped to a size of 224 × 224 × 4. The output of the feature extractors for each modality was concatenated, and a fully connected layer of the prediction for ABE was used.
In this paper, we introduced spatial attention modules (SAMs) into the residual network blocks and analyzed the effect of the spatial attention modules on the model classification performance. The detailed structure of the attention modules and their integration with ResNet18 are illustrated in Figures 3 and 4, respectively [14].   MRI input images were selected to include four consecutive layers where the GP was located, and they were cropped to a size of 224 × 224 × 4. The output of the feature extractors for each modality was concatenated, and a fully connected layer of the prediction for ABE was used. Deep learning architectures for the multimodal ABE prediction model. The feature extractors for T1-weighted images, T2-weighted images, and ADC images were used in ResNet18. MRI input images were selected to include four consecutive layers where the GP was located, and they were cropped to a size of 224 × 224 × 4. The output of the feature extractors for each modality was concatenated, and a fully connected layer of the prediction for ABE was used.
In this paper, we introduced spatial attention modules (SAMs) into the residual network blocks and analyzed the effect of the spatial attention modules on the model classification performance. The detailed structure of the attention modules and their integration with ResNet18 are illustrated in Figures 3 and 4, respectively [14].

Model Evaluation
Five-fold cross-validation was used to evaluate the generalization ability of the model, and various metrics such as the classification accuracy, the area under the curve (AUC), sensitivity, specificity, recall, and F1 score were used to evaluate the model's classification performance. The performance metrics are presented as the mean ± standard deviation from the five-fold cross-validation.
To verify the classification performance of different combinations of modal images, the experiment mainly employed the following strategies: (1) single-modality data, using T1, T2, and ADC separately as model inputs; (2) dual-modality data, using T1 + T2, T1 + ADC, and T2 + ADC separately as model inputs; (3) triple-modality data, using T1 + T2 + ADC as the model input. For the multimodality inputs, the model first extracted features from each modality separately and then fused the features before finally conducting classification. To evaluate the effect of the SAMs on improving the classification performance, we conducted comparative experiments separately with models that have SAMs and models that do not have SAMs.
ImageNet-based pre-trained weight files were downloaded from the PyTorch website (https://download.pytorch.org/models/resnet18-5c106cde.pth, accessed on 13 January 2023) and used to initialize the weights of the feature extraction module in the model [21]. Training-related hyperparameters were set as follows: initial learning rate of 0.0001, maximum iteration of 140, and minibatch size of 64. The Adam algorithm was used for model training [22]. The experiment was developed with Windows 11 using Python 3.10. Table 1 shows the demographic and clinical characteristics of the patients enrolled in this study, including their gender, weight, and age. Differences in the gender distribution between groups were evaluated using the chi-square test; the result showed that there were no significant differences between the ABE and non-ABE groups (p = 0.15 > 0.05). As the other clinical features did not meet the assumption of normality based on the Kolmogorov-Smirnov test, we utilized the nonparametric Mann-Whitney test to evaluate differences between groups. Significant differences in age were found between the ABE and non-ABE groups, with a p-value of less than 0.05.

Model Evaluation
Five-fold cross-validation was used to evaluate the generalization ability of the model, and various metrics such as the classification accuracy, the area under the curve (AUC), sensitivity, specificity, recall, and F1 score were used to evaluate the model's classification performance. The performance metrics are presented as the mean ± standard deviation from the five-fold cross-validation.
To verify the classification performance of different combinations of modal images, the experiment mainly employed the following strategies: (1) single-modality data, using T1, T2, and ADC separately as model inputs; (2) dual-modality data, using T1 + T2, T1 + ADC, and T2 + ADC separately as model inputs; (3) triple-modality data, using T1 + T2 + ADC as the model input. For the multimodality inputs, the model first extracted features from each modality separately and then fused the features before finally conducting classification. To evaluate the effect of the SAMs on improving the classification performance, we conducted comparative experiments separately with models that have SAMs and models that do not have SAMs.
ImageNet-based pre-trained weight files were downloaded from the PyTorch website (https://download.pytorch.org/models/resnet18-5c106cde.pth, accessed on 13 January 2023) and used to initialize the weights of the feature extraction module in the model [21]. Training-related hyperparameters were set as follows: initial learning rate of 0.0001, maximum iteration of 140, and minibatch size of 64. The Adam algorithm was used for model training [22]. The experiment was developed with Windows 11 using Python 3.10. Table 1 shows the demographic and clinical characteristics of the patients enrolled in this study, including their gender, weight, and age. Differences in the gender distribution between groups were evaluated using the chi-square test; the result showed that there were no significant differences between the ABE and non-ABE groups (p = 0.15 > 0.05). As the other clinical features did not meet the assumption of normality based on the Kolmogorov-Smirnov test, we utilized the nonparametric Mann-Whitney test to evaluate differences between groups. Significant differences in age were found between the ABE and non-ABE groups, with a p-value of less than 0.05. to distinguish between ABE and non-ABE using single-modality and multimodality MRI data. The model without a spatial attention module had a classification accuracy of 0.666, 0.745, and 0.583 on T1, T2, and ADC images, respectively. The model with a spatial attention module achieved a classification accuracy of 0.674, 0.768, and 0.576 on T1, T2, and ADC images, respectively. From the results of the single-modality experiment, it was observed that the T2-weighted images achieved the best classification performance in both models with and without attention modules. However, the ADC images did not perform as well and showed the lowest classification accuracy and AUC in comparison. By comparing the implementation results with and without spatial attention modules, it was found that spatial attention modules helped to improve the overall classification performance of the model, especially on T2 images.  In the multimodality experiment, the combination of T1 and T2 achieved the best classification accuracy, area under the curve, sensitivity, and F1 score in both models with and without attention modules. Among the dual-modality data, the classification accuracy for T1 + ADC and T2 + ADC was found to be lower than that of their corresponding single modalities, indicating that the ADC images did not contribute to improving the model's classification performance. The overall classification performance of models with attention modules was better than that of those without attention modules, particularly in terms of enhancing the specificity. Figure 5 shows the ROC curves of the model tests with different combinations of MRI data. The results indicated that increasing the number of multimodal images helped to improve the AUC of the model. The combination of T1 and T2 achieved the best AUC in both ResNet18 with and without SAMs (with SAMs, 0.832; without SAMs, 0.828). Compared to the combination of other imaging modalities, it was found that the combination of ADC images with other modalities was not conducive to improving the AUC of the model. In fact, it led to a decrease in the model's classification performance. of ADC images with other modalities was not conducive to improving the AUC of the model. In fact, it led to a decrease in the model's classification performance.

Results
a b Figure 5. ROC curves for distinguishing ABE from non-ABE using different modal MRI features. The ROC curves were obtained from testing single-modality data T1, T2, and ADC, and multimodality data T1 + ADC, T2 + ADC, T1 + T2, and T1 + T2 + ADC using ResNet18 with and without SAMs. (a) ROC curves based on ResNet18 without SAMs. (b) ROC curves based on ResNet18 with SAMs.

Discussion
In this study, we proposed a deep learning network based on multimodal MRI images to distinguish between ABE and non-ABE. We validated whether multimodal images had superior ABE diagnosis performance compared to single-modality images and compared the classification performance of ResNet18 models with spatial attention modules and without spatial attention modules. Our experimental results showed that multimodal image fusion improved the ABE prediction compared to single-modality T1-weighted, T2weighted, and ADC images, and the inclusion of a spatial attention module helped to improve the overall classification performance of the model, particularly in terms of specificity.
The results of the single-modality image experiments showed that T2-weighted images had the best classification performance, followed by T1-weighted and ADC images. The ADC images performed the worst among the three modalities' images, which was also confirmed in the multimodality image experiments. This finding may be because ADC images did not show significant differences between the ABE and non-ABE groups in our dataset [23].
Acute bilirubin encephalopathy is typically characterized by a high symmetric signal intensity in T1-weighted images in the GP, subthalamic nucleus (SN), and hippocampus regions [7]; however, in chronic bilirubin encephalopathy, the high signal intensity in T2weighted images is more pronounced in the GP and SN compared to T1-weighted images [24,25]. Our experiment's results showed that the classification performance of the T2weighted images was better than that of the T1-weighted images. Furthermore, among all combinations of modal images, the T1-and T2-weighted image combination achieved the best classification accuracy, AUC, sensitivity, and specificity.

Discussion
In this study, we proposed a deep learning network based on multimodal MRI images to distinguish between ABE and non-ABE. We validated whether multimodal images had superior ABE diagnosis performance compared to single-modality images and compared the classification performance of ResNet18 models with spatial attention modules and without spatial attention modules. Our experimental results showed that multimodal image fusion improved the ABE prediction compared to single-modality T1-weighted, T2-weighted, and ADC images, and the inclusion of a spatial attention module helped to improve the overall classification performance of the model, particularly in terms of specificity.
The results of the single-modality image experiments showed that T2-weighted images had the best classification performance, followed by T1-weighted and ADC images. The ADC images performed the worst among the three modalities' images, which was also confirmed in the multimodality image experiments. This finding may be because ADC images did not show significant differences between the ABE and non-ABE groups in our dataset [23].
Acute bilirubin encephalopathy is typically characterized by a high symmetric signal intensity in T1-weighted images in the GP, subthalamic nucleus (SN), and hippocampus regions [7]; however, in chronic bilirubin encephalopathy, the high signal intensity in T2-weighted images is more pronounced in the GP and SN compared to T1-weighted images [24,25]. Our experiment's results showed that the classification performance of the T2-weighted images was better than that of the T1-weighted images. Furthermore, among all combinations of modal images, the T1-and T2-weighted image combination achieved the best classification accuracy, AUC, sensitivity, and specificity. Spatial attention modules have been shown to be effective in improving the performance of various models in image classification and object recognition [14]. We introduced spatial attention modules into residual blocks and adjusted their weights through training, so that the model could focus its attention on key areas of the image, such as the high-signal area of the pallidum. Our results showed that SAMs improved the overall classification performance of the model, especially in terms of specificity, compared to the control experiment.
Despite the promising results obtained in this study, there are some limitations that need to be addressed in future research. One of the main limitations is the sample size, which was relatively small and drawn from a single source. In order to increase the generalizability of the model, future studies should include larger and more diverse samples from multiple sources. Another limitation of our study is that the 2D-ResNet model design did not fully utilize the 3D information available in the images. To address this limitation, future research could consider using more advanced models such as 3D convolutional neural networks (3D-CNNs) that can effectively capture the spatial information in volumetric data. Additionally, incorporating other modalities of MRI data such as MRS, perfusion magnetic resonance imaging, and clinical information could further improve the diagnostic accuracy of the model. This could be achieved through the use of cross-modality attention modules, which allow for the fusion of information across modalities. Finally, in order to improve the interpretability of the model and facilitate its adoption by clinicians, future research could explore the use of explainable AI techniques such as Transformers [26][27][28]. These models have been shown to be effective at generating interpretable representations of medical images and may help to improve the diagnostic capabilities of the model. Overall, these advancements in methodology hold great promise for enhancing the accuracy and clinical utility of MRI-based diagnosis of ABE.

Conclusions
In this study, we developed a network framework for multimodal MRI image classification using ResNet18 as the backbone. Our results demonstrate that the accuracy of ABE classification can be significantly improved by utilizing multimodal image combinations, particularly the T1 + T2 combination (accuracy = 0.763 ± 0.029), compared to using single-modality images. Moreover, we incorporated a spatial attention module into the residual blocks, further enhancing the classification performance, with the highest accuracy achieved using the T1 + T2 combination (accuracy = 0.808 ± 0.069). This finding suggests that a multimodal classification network with a SAM is a promising approach for the clinical diagnosis of ABE. Future research can explore the integration of more advanced MRI techniques and larger datasets to further validate the effectiveness of our approach.