A review of deep learning-based information fusion techniques for multimodal medical image classification

Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of the developments in deep learning-based multimodal fusion for medical classification tasks. We explore the complementary relationships among prevalent clinical modalities and outline three main fusion schemes for multimodal classification networks: input fusion, intermediate fusion (encompassing single-level fusion, hierarchical fusion, and attention-based fusion), and output fusion. By evaluating the performance of these fusion techniques, we provide insight into the suitability of different network architectures for various multimodal fusion scenarios and application domains. Furthermore, we delve into challenges related to network architecture selection, handling incomplete multimodal data management, and the potential limitations of multimodal fusion. Finally, we spotlight the promising future of Transformer-based multimodal fusion techniques and give recommendations for future research in this rapidly evolving field.


Context
In recent years, the field of medical image analysis has seen a surge in efforts to apply deep learning-based methods to the classification of various diseases, notably related to the brain (Kong et al., 2022;Zhang et al., 2022a;Liu et al., 2018), breasts (Qian et al., 2020;Dalmis et al., 2019;Qian et al., 2021), prostate (Le et al., 2017;Yang et al., 2017;Mehrtash et al., 2017) and eyes (Li et al., 2022c;Yoo et al., 2022;Huang et al., 2022).The ability to accurately classify and diagnose diseases from medical images has the potential to revolutionize healthcare by improving diagnostic accuracy, reducing human error, and enabling more personalized treatment planning.This trend has highlighted the need for robust and efficient methods for analyzing medical images across multiple imaging modalities.
With advances in medical image acquisition systems, many new imaging modalities have been developed to diagnose patients (Muhammad et al., 2021;Azam et al., 2022;Hermessi et al., 2021), resulting in larger and more diverse datasets.An imaging modality alone does not often provide all the information needed to ensure accurate clinical diagnosis.Therefore, clinicians increasingly base their diagnosis on images obtained from a variety of sources: a combination of abundant information can be used in clinical practice with more confidence.Following this trend and to improve diagnosis results, artificial intelligence-based classification models are increasingly being developed by combining data from multiple sources to take advantage of both redundancies and complementarities across modalities.
Several surveys have been conducted in recent years to analyze the trends in the application of multimodality in various fields (Ramachandram and Taylor, 2017;Baltrušaitis et al., 2018), among which the field of medicine is gaining a great deal of attention.In medicine, several such survey papers focus on specific image analysis tasks: image fusion (Azam et al., 2022), image synthesis (Xie et al., 2022), image segmentation (Zhou et al., 2019a), or image registration (El-Gamal et al., 2016).However, medical image classification was never addressed in a comprehensive manner.A few surveys target specific fields such as neurology (Shoeibi et al., 2022) or oncology (Lipkova et al., 2022), but they do not provide a comprehensive discussion of how multimodal fusion may be applied to other fields.To fill this gap, we propose to review deep learning-based information fusion techniques for multimodal medical classification across all medical fields.We restrict this analysis to classification, given the sufficient coverage of other analysis tasks (Azam et al., 2022;Xie et al., 2022;Zhou et al., 2019a;El-Gamal et al., 2016).We include in the scope of classification methods any method assigning class labels, possibly with probabilities, to a patient or a region of interest in the patient, regardless of the application (diagnosis, prognosis, risk estimation, etc.).Throughout our review, we summarize and discuss the advantages and disadvantages of various information fusion methods that can be applied to various organs and imaging modalities.As the first review to examine the use of deep learning in multimodal medical classification, this paper aims to guide future investigations into medical diagnosis using multiple imaging modalities.

Traditional methods
Information fusion not based on deep learning strategies, relying on traditional image processing and machine learning, has been reviewed in a previous survey (Kline et al., 2022).We summarize hereafter the main developments and highlight the benefits of non-deep learning-based information fusion.
Input fusion is the most commonly used strategy among traditional methods.It involves the fusion of images from various modalities into structured data and fuses them into different categories depending on the fusion domain: spatial fusion (El-Gamal et al., 2016;Stokking et al., 2001;Bhatnagar et al., 2013;He et al., 2010;Bashir et al., 2019), frequency fusion (Princess et al., 2014;Parmar and Kher, 2012;Sadjadi, 2005;Das and Kundu, 2013;Liu et al., 2010;Xi et al., 2017) and sparse representation (Zhang et al., 2018;Zhu et al., 2019;Liu et al., 2015).In spatial fusion, multimodal images are combined at the pixel level, but this approach often leads to spectral degradation (Mishra and Palkar, 2015) and color distortion (Bhat and Koundal, 2021).Frequency fusion, which involves transforming the input image into the frequency domain, is more complex and results in limited spatial resolution (Sharma et al., 2020).Sparse representation, on the other hand, can be sensitive to registration errors and lacks attention to details (Bhat and Koundal, 2021).
Other strategies include intermediate and output fusion, which do not require registration of the input images.Intermediate fusion involves extracting features from different imaging modalities, concatenating them, and feeding them into a classifier, generally a support vector machine (SVM), for diagnosis (Lee et al., 2019b;Tang et al., 2020;Quellec et al., 2010).This approach requires extensive testing and rich domain knowledge for feature extraction and selection.On the other hand, output fusion involves stacking the data results from unimodal models and combining them (Lalousis et al., 2021).While this approach circumvents the need for early integration, it presents its own set of challenges.Individual models in output fusion may be heavily influenced by their respective modality-specific idiosyncrasies, potentially introducing biases into the final combined output.Moreover, if these unimodal models yield correlated or redundant information, the utility of stacking them diminishes, as it might not deliver significant additional value.
Traditional methods typically involve complex preprocessing steps paired with relatively simple model structures.Such a combination frequently leads to information loss during feature extraction, thereby complicating efforts to fully leverage the synergies between various imaging modalities.
Besides requiring domain knowledge, these traditional multimodal fusion approaches do not fully utilize the complementarity between multimodal features.These limitations highlight the need for more advanced techniques, such as deep learningbased multimodal fusion methods, able to overcome the challenges faced by traditional methods.Deep learning network architectures offer complex models that can explore more possibilities for multimodal fusion.Furthermore, various end-toend models significantly reduce the amount of domain knowledge required for diagnosis purposes, albeit at the cost of interpretability (Salahuddin et al., 2022).

Development trends
Recognizing the potential of deep learning-based methods for multimodal medical image classification, researchers have increasingly focused on this area.In order to obtain more accurate diagnoses, multimodal medical image analysis have also become a growing trend.Fig. 2 shows the number of publications about multimodal medical classification each year, which was queried on February 27, 2024, on PubMed.As illustrated by the figure, the number of papers has increased yearly from 2016 to 2023, indicating that multimodal medical classification tasks based on deep learning have gained greater attention in recent years.Furthermore, we report the number of publications on different organs in multimodal diagnosis tasks in Fig. 3.We found that brain-related publications currently account for a substantial portion of multimodal studies.This is due to the disclosure of many large multimodal image datasets on the brain.On the other hand, not many studies were conducted on other organs, except whenever a public dataset was released.This finding motivated us to focus the review on studies performed on public datasets from various organs.One advantage is to allow direct quantitative comparisons between methods.

Paper selection
In our initial literature search, we identified a total of 14 public multimodal image datasets.These datasets are detailed in Sect.2.2, with a summary presented in Tab. 1.The methodology for finalizing the list of papers for this review was as follows: 1.For each of the 14 datasets, we conducted a search on PubMed for publications that mentioned the dataset name, coupled with any of the following terms: (multimodality), (multimodal), (multi-modal), (multiparametric), or (multiparametric).2. We then concatenated the 14 resulting lists.3. Based on the abstracts, we handpicked articles that addressed multimodal information fusion through deep learning methods.
Notably, there are gaps in the availability of public multimodal datasets focused on classification tasks for certain organs-namely, the breast, lung, prostate, kidneys, larynx, heart, and liver, even though they are frequently discussed in the multimodal medical image analysis literature.To ensure a comprehensive review, we expanded our scope to include 19 pertinent articles that target these organs but utilize private datasets.This brought our final tally to 114 publications.

Highlights
Through our examination of the deep learning-based multimodal image classification literature (overview presented in Fig. 1), we propose in this paper an updated taxonomy for multimodal information fusion.As discussed in Sect.1.2 and other surveys (Ramachandram and Taylor, 2017;Muhammad et al., 2021;Boulahia et al., 2021), multimodal fusion methods are traditionally classified as input fusion, intermediate fusion or output fusion, based on the stage of information fusion in the classification pipeline, as in Fig. 4(a).Note that some publications refer to input fusion as early fusion, while intermediate fusion may be considered as feature-level fusion, and output fusion is equivalent to decision-level fusion or late fusion (Boulahia et al., 2021;Ramachandram and Taylor, 2017;Li et al., 2022c).Our analysis points to intermediate fusion as the prevailing category at present.To grant readers a more indepth understanding of multimodal deep learning networks, we further segment intermediate fusion into single-level fusion, hierarchical fusion, and attention-based fusion, as illustrated in Fig. 4(b).The proposed taxonomy is detailed and discussed in Sect.4.1: it covers the majority of the current multimodal classification network architectures, providing insight into their stages and styles of information fusion.
In this paper, we present the following contributions: (1) Identify the process of medical multimodal classification.
The methodological approach of deep learning-based multimodal classification can divided into four steps: data processing, deep learning network, multimodal information fusion, and the final classification algorithm.Crucially, the classification of multimodal information fusion methods hinges on the sequential positioning of these stages.
Propose network architectures apt for generic multimodal fusion classification endeavors.
(2) Propose network architectures for generic multimodal fusion classification tasks.
In order to address medical classification tasks involving different organs and imaging modalities, we summarized five strategies of multimodal fusion: input fusion, singlelevel fusion, hierarchical fusion, attention-based fusion, and output fusion.These fusion methods can be applied to any multimodal classification problem in medicine, allowing for greater flexibility and potential for improved results.
(3) Present the prevailing challenges and predict future trends.While it's apparent that multimodal fusion is still in its early stages, our paper analyzes specific challenges tied to this domain and predicts future trends in the field.
The remainder of the paper is organized as follows: Sect. 2 describes commonly used multimodal data for medical multimodal classification tasks and their publicly available datasets.Sect. 3 describes the multimodal medical image classification task process mentioned in contribution 1.A review of papers implementing each of the five fusion strategies of contribution 2 is presented in Sect. 4. The purpose of Sect. 5 is to discuss the existing problems and to make predictions regarding future fusion methods in contribution 3. Finally, Sect.6 contains our concluding comments.A list of frequently used abbreviations throughout the paper is shown in Tab. 6.

Imaging modalities
For medical diagnosis purposes, each imaging modality has its own characteristics and information.Different medical imaging modalities use different frequency bands of the electromagnetic spectrum in order to screen and diagnose different medical conditions in the human body (Azam et al., 2022).There are different wavelengths and frequencies associated with each imaging modality, as well as different characteristics (structure, function, etc.) (Singh et al., 2012).Furthermore, medical imaging modalities can be classified as invasive or non-invasive.Invasive methods involve inserting an object into the body through an incision or needle injection in order to examine an organ, while non-invasive methods utilize some form of radiation or sound (Azam et al., 2022).Table 2 shows some modalities that appear in multimodal medical image datasets.

Non-Invasive
In addition to high spatial resolution and exquisite soft tissue contrast, MRI can also display dynamic physiologic changes in three dimensions (Plewes and Kucharczyk, 2012).Positron Emission Tomography (PET) Brain, Prostate, Breast, etc. Invasive The PET provides information about the organs' activity, as well as its sugar use as energyBailey et al. (2005).Computed Tomography (CT) Lung, Bone, Oral, etc.

Non-Invasive (harmful)
CT is an excellent tool for detecting bone, joint, and soft tissue lesions that may affect bone, joints, or soft tissues (Buzug, 2011).

Non-Invasive
In addition to showing the activity and function of certain organs in the body, US can also identify whether a tissue or organ contains fluid or gas (Leighton, 2007).Optical Coherence Tomography (OCT) Eye, Heart Non-Invasive Biological tissues can be visualized in high-resolution with OCT scanning in two-dimensional or three-dimensional modes (Huang et al., 1991).

Dermatoscope
(Dsc) Skin Non-Invasive Dsc allows better visualization of subsurface structures and improved identification of skin diseases (MacKie et al., 2002).
Neurology and neurosurgery frequently use MRI.Different MRI images can be obtained by changing the factors affecting the magnetic resonance (MR) signal, and these different images are referred to as sequences.Depending on the sequence used, the behavior of tumors may vary, and it is essential to use multiple sequences to accurately determine tumor location and size (Pai et al., 2020).T1-weighted (T1) and T2-weighted (T2) MRIs are the most common MRI sequences.Tomographic anatomical maps can be observed with the T1 sequence, and the T2 sequence clearly shows the location and size of the lesion (Lindig et al., 2018).The Fluid Attenuated Inversion Recovery (Flair) sequence provides better visualization of the area around the tumor site, making it easier to de- tect the tumor's boundaries (Hecht et al., 2001).Furthermore, contrast-enhanced T1-weighted (T1c) sequences can be used to detect intra-tumor conditions and distinguish tumors from non-tumorigenic lesions (Kuban et al., 2003).T2 and Flair are suitable for detecting tumors with peritumoral edema, while T1 and T1c are suitable for detecting tumors without peritumoral edema (Zhou et al., 2019a).
Diffusion-weighted imaging (DWI) is another useful sequence designed to detect the random movements of water protons.Therefore, DWI sequence is a highly sensitive method for detecting acute strokes (Preston, 2006).An increased apparent diffusion coefficient (ADC) value with lower signals of DWI images could reveal the fast diffusion of water molecules (Shen et al., 2011).In addition to using multiple sequences, co-diagnosis using structural MRI (sMRI) and functional MRI (fMRI) is becoming increasingly popular (Akhavan Aghdam et al., 2018;Liu et al., 2022).fMRI measures the small changes in blood flow that occur with brain activity.This test can be used to determine which parts of the brain are performing critical functions and to determine the effects of strokes and other diseases on the brain (Bandettini, 2012).
The combination of PET and MRI, PET and CT has been recognized as a valuable method for screening and diagnosing various diseases (Calhoun and Sui, 2016;Liu et al., 2017;Huang et al., 2019;Xu et al., 2022;Andrearczyk et al., 2022).The PET scan is preceded by the administration of a radioactive agent to the patient.This allows doctors to determine the metabolic processes in which the brain tissue is involved (Bailey et al., 2005).Compared to other imaging methods such as CT and MRI, PET has a high sensitivity and can detect lesions even if MRI/CT does not yet show abnormalities.PET also has high specificity, making it possible to determine whether a tumor is malignant based on its metabolism at the time of MRI/CT detection (Muehllehner and Karp, 2006).However, because PET scan lacks information about organ anatomy, they should be conducted in conjunction with CT/MRI scans (Akhavan Aghdam et al., 2018).Indeed, the combination of PET and MRI/CT scans provides structural and functional information related to various diseases, improving the effectiveness of diagnosis.Fig. 5 shows the images of PET, CT, and MRI, as well as several sequences of MRI.
Availability, low cost, and safety make ultrasonography the most widely used clinical diagnostic tool, with applications ranging from breast cancer diagnosis to cervical lymph node detection.Conventional B-mode imaging is used to examine abnormal masses in tissues, Color Doppler imaging shows the distribution of blood vessels within tissues (Zwiebel and Pellerito, 2005), while Strain Elastography (SE) is a qualitative technique and provides information on the relative stiffness between one tissue and another.For example, the combined use of Conventional B-mode imaging and Color Doppler is common in identifying cervical lymph nodes (Abdelgawad et al., 2020), diagnosing breast cancer (Qian et al., 2020(Qian et al., , 2021)), and so forth (Lu et al., 2010;Schelling et al., 2000Schelling et al., , 1997)).Fig. 6 shows the US images of Conventional B-mode, Color Doppler, and Strain Elastography.
In the diagnosis of ophthalmic diseases, CFP and OCT are the two most cost-effective methods (Li et al., 2022c).These imaging modalities provide prominent biomarkers that can be used to identify glaucoma suspects, such as the vertical cup-todisc ratio (vCDR) on fundus images and the retinal nerve fiber layer thickness (RNFL) on an OCT image.A more accurate and reliable diagnosis, compared to a single modality, is often achieved by taking both screenings in clinical practice (Wu et al., 2022).Fig. 7 shows the images of CFP and OCT.
In the diagnosis of skin cancer, a combination of dermoscopic and clinical images is often used (Tang et al., 2022).The clinical image is captured using a digital camera and shows the visualized feature in different views and lighting conditions.On the other hand, dermoscopic images provide a clear view of the skin's subsurface structures and are obtained using a specific skin imaging technique in contact with the skin (Ge et al., 2017).Fig. 8 shows examples of the dermoscopic and clinical images.
In addition to multimodal image combinations, clinical in-   formation regarding the patient's medical history and symptoms can significantly contribute to the diagnosis of the disease.These data may contain implicit features that may improve the model's classification performance.Electronic Health Records (EHR) are commonly used to detect brain diseases by integrating image analysis features (Prabhu et al., 2022;Venugopalan et al., 2021).Similarly, skin cancer detection also relies heavily on metadata (Tang et al., 2022;Yap et al., 2018).

Multimodal image datasets
In multimodal medical diagnosis, multimodal datasets are particularly valuable for testing various networks and developing fusion methods.However, the privacy and cost of medical images often make obtaining more comprehensive multimodal datasets challenging for researchers.Fortunately, there are several freely available multimodal datasets.These datasets pro-vide information regarding the diagnosis of diseases at various locations in the body, as well as the analysis of various multimodal combinations.These datasets are expected to contribute to the analysis of fusion methods and serve as a foundation for the future development of multimodal fusion methods.
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multi-center longitudinal study to discover clinical, imaging, genetic, and biochemical biomarkers for Alzheimer's disease (AD).ADNI has three stages: ADNI 1 included 400 subjects diagnosed with mild cognitive impairment (MCI), 200 subjects with early AD, and 200 elderly control subjects (Petersen et al., 2010); ADNI 2 added new participant groups: 150 elderly controls, 100 EMCI subjects, 150 late mild cognitive impairment (LMCI) subjects, and 150 mild AD patients (Beckett et al., 2015); ADNI 3 added hundreds of new MCI subjects, mild AD subjects, and elderly controls (Weiner et al., 2017).The MRI Brain Tumor Segmentation (BraTS) challenge has been held since 2012 and currently includes classification tasks in addition to tumor segmentation (Menze et al., 2014).Each subject has four MRI modalities (T1, T1C, T2, and T2 FLAIR), human annotation of tumor segmentation, and tumor grade.The Cancer Imaging Archive (TCIA) is a large-scale public database containing medical images of common tumors (lung cancer, prostate cancer, etc.) and corresponding clinical information (treatment protocol, genetics, pathology, etc.) (Clark et al., 2013).Open Access Series of Imaging Studies (OASIS) seeks to make neuroimaging datasets freely accessible to the scientific community (Marcus et al., 2010).OASIS-3 contains 755 cognitively normal adults and 622 individuals at various stages of cognitive decline ranging in age from 42-95 years (LaMontagne et al., 2019).Seven-point Criteria Evaluation Database (SPC) provides a database for evaluating computerized image-based prediction of the 7-point malignancy checklist for skin lesions.The dataset contains more than 2000 clinical and dermoscopy color images and structured metadata for training and evaluating computer-aided diagnosis systems (Kawahara et al., 2018).As part of the Cancer Genome Atlas (TCGA), an internationally recognized cancer genomics project, more than 20,000 primary cancer samples and matched normal samples were molecularly characterized (Weinstein et al., 2013;Tomczak et al., 2015).The Autism Brain Imaging Data Exchange (ABIDE) initiative now includes two large-scale collections, ABIDE I and ABIDE II, whose ultimate goal is to facilitate discovery science and comparative analysis across samples.ABIDE I contains 1112 datasets, including 539 from individuals with ASD and 573 from typical controls (ages 7-64 years, median 14.7 years across groups) (Di Martino et al., 2014).ABIDE II contains 1114 datasets from 521 individuals with ASD and 593 controls (age range: 5-64 years) (Di Martino et al., 2017).ADHD-200 Sample is a grassroots initiative that aims to improve scientific understanding of the neural basis of ADHD through the implementation of open data sharing and discovery-based research methods (consortium, 2012).The Center for Biomedical Research Excellence (COBRE) is providing raw anatomical and functional magnetic resonance imaging data from 72 patients with schizophrenia and 75 healthy controls (ages ranging from 18 to 65 in each group) (Calhoun et al., 2012).The Glaucoma Grading from Multimodality Images (GAMMA) Challenge is intended to facilitate the development of fundus and OCT-based glaucoma grading (Wu et al., 2023a).GAMMA contains 2D fundus images and 3D OCT images of 300 patients.Computational Precision Medicine: Radiology-Pathology Challenge on Brain Tumor Classification 2020 (CPM-RadPath) is a brain tumor classification challenge.There are 221 cases in the training dataset, each with a paired radiology and digital pathology image.Within the 221 cases, there are 54, 34, and 133 cases for lower grade astrocytoma, IDH-mutant, oligodendroblioma, 1p/19q codeltion, and glioblastoma and diffuse astrocytic glioma with molecular features of glioblastoma, IDH-wildtype, respectively (Hsu et al., 2022;Kurc et al., 2020).The CPM-RadPath 2020 challenge also contains 35 and 73 validation and testing sets, respectively.Each patient contains multiple MRI sequences: T1, post-contrast T1-weighted (T1Gd), T2, and FLAIR.ISIT-UMR is a dataset for the classification of gastrointestinal lesions in regular colonoscopy.The dataset consists of 76 polyps with white light and NBI videos from the same polyp (Mesejo et al., 2016).The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center between January 1, 2001, and December 31, 2012.There were 1,104 (80.6%) abnormal exams in the dataset, with 319 anterior cruciate ligament (ACL) tears and 508 meniscal tears (Bien et al., 2018).CTU-CHB Intrapartum Cardiotocography is a database containing 552 cardiac tomography recordings from the Czech Technical University (CTU) in Prague and the University Hospital in Brno (UHB).As part of each CT, a fetal heart rate time series (FHR), as well as a uterine contraction (UC) signal, are recorded (Chudacek et al., 2014).
The previously mentioned datasets provide valuable resources for developing and testing multimodal fusion methods.They contain images of different medical modalities of the same patient, as well as images of different patients.Access to these datasets is available upon request and at no cost.In this review, we summarize the fusion methods presented in 53 articles that use ADNI, 11 articles that use TCIA, 7 articles that use BraTS (2015BraTS ( , 2017BraTS ( , 2019BraTS ( and 2021 editions) editions), 7 article that uses OASIS, 4 articles that use COBRE, 4 articles that use SPC, 4 articles that use ABIDE, 3 articles that use ADHD-200, 2 articles that use CPM-RadPath (2020 edition), 2 articles that use GAMMA, 2 articles that use MRNet, 1 article that uses TCGA, 1 article that uses CTU-UHB and 1 article that uses ISIT-UMR.As mentioned earlier, 19 papers discussed in this review are not based on public datasets.

Multimodal classification pipeline
Multimodal fusion of biomedical data using deep learning remains an evolving field.The terminology used to describe fu-sion methods often varies between publications, leading to ambiguity.For instance, terms like input, intermediate, and output fusion are commonplace, but their interpretations may differ.To bring clarity and standardization to the multimodal classification area, we adopt the five-stage pipeline proposed in Sleeman IV et al. (2022), referenced in Tab. 3.This pipeline offers a structured approach to encapsulate all medical multimodal classification tasks.Within this section, we elucidate each of these stages, detailing their definitions and the methodologies for their implementation.Subsequently, based on the sequence and structure of the information fusion stage paired with the deep learning (DL) backbone stage, we categorize multimodal fusion techniques into five distinct strategies in Section 4.
To further improve the performance of these models, data augmentation techniques play an essential role in the preprocessing pipeline.For example, data augmentation helps prevent overfitting (Wang et al., 2018) using methods like random cropping, flipping, and rotation during training.In addition, increasing the training dataset's diversity improves the model's generalization capabilities.
Considering the large volumes of data generated by multimodal medical images, it is noteworthy that only a small fraction is relevant to diagnosing diseases.Therefore, feature selection emerges as a crucial pre-processing step, aiming to reduce data dimensionality while retaining pertinent information.Common feature selection methods include manual selection (Zou et al., 2017;Shi et al., 2017;Kim and Lee, 2018) and Principal Component Analysis (PCA) (Li et al., 2015;El-Sappagh et al., 2020;Zhou et al., 2021b).
Another critical aspect of pre-processing multimodal medical images is image registration.It involves aligning images from different modalities (e.g., MRI, CT, and PET) into a common coordinate system, enabling the accurate matching of corresponding anatomical structures across image types (Azam et al., 2021;El-Gamal et al., 2016).Such alignment facilitates comprehensive data analysis and becomes particularly critical for input-level fusion, where combining complementary information from different modalities requires proper alignment (Li et al., 2022c).Image registration in this context presents several challenges.A significant one is the lack of ample training datasets for supervised deep learning.Another is defining accurate similarity measures, especially with the varied appearance of different modalities.The registration process can be further complicated when trying to align images from different patients or even the same patient over time due to factors like changes in anatomy

Data preprocessing
The initial step of the classification task is to perform operations such as registration, denoising, and data augmentation on the raw data.

DL backbone
Extraction of high-dimensional features of data by the deep learning network structure.Information fusion Fusion of multimodal data/features by different methods.

Final classifier
The final stage of generating classification results from multimodal data.

Model evaluation
Different metrics are used to evaluate the performance of multimodal models.and metabolic processes.
In tackling these challenges, deep similarity metrics have shown promise, especially in traditional frameworks.While multimodal registration has seen advancements, direct transformation prediction lags behind, especially compared to singlemodality methods.One innovative solution is using Generative Adversarial Networks (GANs) to make multimodal images more consistent.

Information fusion
A key component of multimodal image classification is information fusion.Based on the level at which information is fused, information fusion can be divided into input fusion, intermediate fusion, and output fusion.And there are two ways to achieve fusion (Sleeman IV et al., 2022), namely concatenation and merge.Concatenation involves the concatenation of data from different modalities into a single tensor for the next step.Merge involves complex calculations such as adding data from different modalities, and the final result is a smaller amount of data.Fig. 9 illustrates the two types of fusion.Our study focuses on the fusion of different medical imaging modalities, and in Sect.4, we will examine the different fusion methods in greater detail.

Deep learning backbone
DL backbones are used to extract high-dimensional features of modalities during the classification process.Over recent years, several high-performing network architectures have emerged, including AlexNet (Krizhevsky et al., 2017), VGG (Simonyan and Zisserman, 2014), GoogLeNet (Szegedy et al., 2015), ResNet (He et al., 2015), DenseNet (Huang et al., 2016), AE (Suk and Shen, 2013;Suk et al., 2015;Yan et al., 2021), ViT (Xing et al., 2022b), and others, providing state-of-the-art performance in classification.A summary of the common architectures for DL is presented in Tab. 4. DL has developed rapidly due to several factors, including the development of hardware devices like graphics processing units (GPUs) and tensor processing units (TPUs), which have greatly improved the training speed of DL networks.Additionally, publicly available datasets such as ImageNet (Deng et al., 2009) have facilitated the training and testing of various models.Furthermore, DL is capable of learning advanced features directly from data without requiring extensive expertise or prior experience, making it easily adaptable across various domains.
In input fusion, a single backbone can extract features from fused modalities.However, in other fusion schemes such as intermediate or output fusion, multiple DL backbones may be used to extract features from different modalities.In current multimodal fusion research, Convolutional Neural Networks (CNN) are the preferred choice of the majority of researchers due to their effectiveness in feature extraction from medical images.Many pre-trained models have already been tested on large datasets, making them suitable for use in medical imaging research.In the articles analyzed, CNNs were used in 65 articles, Fully Connected Neural Networks (FCNN) in 10 articles, Auto-Encoders (AE) in 8 articles, and Transformers in 6 articles.

Final classifier
Multimodal classification employs a final classifier to generate the classification results based on multimodal features or multiple independent classification results, depending on the employed fusion scheme.In DL networks, the Fully Connected (FC) layer (Akhavan Aghdam et al., 2018;Qin et al., 2020;Liu et al., 2022;He et al., 2021;Li et al., 2022c;El Habib Daho et al., 2023) is often used as the final classifier.Other methods, such as SVM (Li et al., 2015;Suk et al., 2014), Random Forest (Dalmis et al., 2019), and Score Merge (Wang et al., 2022;Hu et al., 2020) can also be used as final classifiers.

Evaluation metrics
Evaluation metrics for multimodal fusion tasks are similar to those used in unimodal classification tasks.Commonly used indicators for assessing the performance of multimodal fusion methods and DL networks in the context of medical classification tasks include True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).These indicators can be used to calculate several performance metrics, such as sensitivity, specificity, accuracy, precision, and F1 score, among others.Additionally, AUC and Kappa are commonly used metrics to evaluate medical classification tasks. -

Architecture Description
Fully Connected Neural Network (FCNN) FCNN are the most traditional deep neural networks.Every neuron in a layer is connected to every neuron in the layer below it (Goodfellow et al., 2016).

Convolutional Neural Network (CNN)
CNN can model spatial structures, such as images or volumes.Convolutional kernels model local information by sliding over input data (Goodfellow et al., 2016).

Autoencoders (AE)
By compressing and reconstructing the input data, AE learns low-dimensional encoding.
There are different types of layers, such as convolutional and fully connected (Ballard, 1987).

Transformer
Transformer is a model that uses a multi-headed attention mechanism.Feature extraction is solely based on attention (Vaswani et al., 2017a).where p 0 is the accuracy and p e the sum of the products of the actual and predicted numbers corresponding to each category, divided by the square of the total number of samples.

Information fusion taxonomy for multimodal image classification
The positions of pre-processing and the final classifier are fixed during the process of multimodal classification.Based on the number and sequence of DL backbones and information fusion step, multimodal DL network architectures can be categorized into five types: input fusion, single-level fusion, hierarchical fusion, attention-based fusion, and output fusion, as shown in Fig. 10.As explained hereafter, single-level, hierarchical, and attention-based fusion are sub-categories of intermediate fusion.These categories describe how the network processes and combines the input modalities to produce classification results.
(1) Input fusion can also be referred to as input-level fusion, where the information fusion phase precedes the DL backbone.Concatenation and Merge are two methods of information fusion.For the concatenation method, data of different modalities are used as different channels of the input.In the merge approach, data is fused at the pixel or voxel level, and the merged   (2) Single-level fusion involves information fusion after the DL backbone but before the final classifier.As part of a singlelevel fusion, the features extracted by the DL backbone are fused only once at some point before the classifier is applied.It can be divided into two types: Classic Fusion and Network Fusion, depending on the network structure.In Classic Fusion, high-dimensional features are extracted from different modalities using different DL classifiers and then merged or concatenated.This is the most common network structure in intermediate fusion, so we call it Classic.Fig. 12 illustrates the process diagram of classic fusion.In network fusion, the intermediate   features of different modalities are first extracted using DL classifiers, followed by the extraction of high-level features of the fused modalities using additional DL backbones.Fig. 13 shows the process diagram of network single-level fusion.
(3) Hierarchical fusion is an improvement over single-level fusion.In this approach, DL backbone extracts features from the data of different modalities, while features from each level are then fused at the network level by concatenation or merging.Additionally, further feature fusion is performed following the DL backbone.This allows for more complex feature combinations to be learned, improving classification accuracy.The process diagram for output fusion is shown in Fig. 14.
(4) The emergence of Transformers has led to the development of Attention-based fusion as a new network architecture.Through its unique DL backbone, this architecture is able to extract features and implement feature fusion based on the attention relationship between different modalities.Fig. 15 illustrates the process of attention-based fusion.A more detailed analysis of the network architecture is presented in Sect.4.5.
(5) Output fusion, also known as decision-level fusion or late fusion, involves the use of DL backbones to extract highdimensional features from different modalities of data.The extracted features are then used to generate separate classification results for each modality.These results are then combined using a fusion technique, such as majority voting or averaging, to produce a final classification result.The process diagram for  output fusion is depicted in Fig. 16.
Recent years have seen a growing trend toward the use of deep learning networks in multimodal fusion research.Fig. 1 illustrates the distribution of five fusion strategies in the scope of the study.In contrast to traditional methods, single-level fusion is the most commonly used method in DL multimodal fusion, followed by input and output fusion.Hierarchical fusion and attention-based fusion are also gaining attention and present great potential for research.These more recent fusion methods offer more complex ways of combining modalities, enabling deep learning networks to learn more powerful representations of multimodal data.

Input fusion networks
Input fusion combines data from multiple modalities into a single feature tensor fed into the deep neural network as an input.Input fusion typically involves the fusion of modalities with similar structures, making implementation relatively straightforward.Some modalities can be acquired together at the time of clinical photography (e.g., CT and PET).In many cases, these modalities have the same voxels and spacing after data processing, making obtaining registered multimodal data easy.Furthermore, the majority of input fusion tasks do not require re-modeling, only modifying the input part of the unimodal model to achieve multimodality.Fusion can be accomplished in three ways: concatenating or merging multimodal medical images, extracting high-dimensional features from multimodal images, and then fusing them.
(1) The registered multimodal data are fed into the DL classifier as input for different channels to obtain classification results, which is the most common input fusion approach.Fig. 17    2023) concatenated manually segmented multiparametric MRI images (PEI, DWI) into a CNN network.Despite the ease of implementing this fusion architecture, it has some limitations with regard to the modal data requirements.For instance, the registration performance of different modal data can influence the classification results.Moreover, this approach is not suitable for fusing heterogeneous data, such as 3D medical images and 1D clinical records, which have different characteristics and dimensions.
(2) The merging of images is another input fusion method in addition to concatenation.Various image modalities are fused at the pixel or voxel level in order to create a new fused image that is used for classification (Song et al., 2021;Kong et al., 2022;Rallabandi and Seetharaman, 2023).Song et al. (2021) proposed an effective multimodal image fusion method for Alzheimer's disease diagnosis using MRI and PET.Through registration and mask coding, they were able to fuse gray matter (GM) and 18-fluorodeoxyglucose positron emission tomography (FDG-PET) images to create a new imaging modality called "GM-PET".In the resultant composite image, the GM area is clearly highlighted, allowing AD diagnosis to be made while maintaining both the contour and metabolic characteristics of the subject's brain tissue.They then fed the fused images to the CNN for classification.The GM region cropped from the MRI image is mapped onto the PET image, resulting in the fusion of PET and MRI data in Kong et al. (2022) research.In addition to providing anatomical and metabolic information about the brain, the fusion modality also allows the viewer to focus on the main features of the brain by reducing the visual noise.Rallabandi and Seetharaman (2023) employed a fusion approach integrating images from MRI and PET for the diagnosis of Alzheimer's disease.The fusion process involved applying two-dimensional Fourier and discrete wavelet transform (DWT) to combine MRI and PET images.Subsequently, the MR-PET fused image was reconstructed using inverse Fourier and DWT methods.The benefit of fused images is that they contain a wealth of medical information, but the process of generating them often requires an extensive amount of prior medical knowledge.
(3) Some studies have performed input fusion after extracting features from multimodal images instead of performing a direct fusion of medical images (Li et al., 2015;Liu et al., 2014b).Li et al. (2015) used PCA to extract features from MRI, PET, and cerebrospinal fluid (CSF) and then concatenated these features into the Restricted Boltzmann Machine (RBM) network for the diagnosis of Alzheimer's disease.Liu et al. (2014b) manually extracted features from MRI and PET and then used stacked auto-encoder (SAE) to classify the concatenated multimodal features in order to diagnose Alzheimer's disease.The architecture of extracting features and combining them can solve the problem of multimodal heterogeneity.However, PCA-based or manual feature extraction requires prior knowledge and does not fully utilize image information.
In input fusion, fused data is used in single-branch feature extraction, and the network architecture design significantly reduces network parameters and deployment difficulties.However, due to the fusion of the data at the input level, the complementary information from the different modalities is not utilized to the fullest extent possible.

Single-level fusion networks
The single-level fusion process uses different DL backbones to extract features from different modalities separately, followed by an information fusion process before making the final decision.Based on the position of information fusion within the network architecture, it can be divided into classic fusion structures and network fusion structures.
(1) The most common single-level fusion architecture is to extract features from multimodal data by using different branches, then fuse these features and feed them to the final classifier (Suk and Shen, 2013;Suk et al., 2014Suk et al., , 2015;;Xu et al., 2016;Zou et al., 2017;Yang et al., 2017;Joo et al., 2021;Ye et al., 2017;Punjabi et al., 2019;Yap et al., 2018;Rahaman et al., 2021;Xiong et al., 2022;Qin et al., 2020;Liu et al., 2023b;Kollias et al., 2023;Kadri et al., 2023;Saponaro et al., 2024).A schematic diagram of its network architecture is shown in Fig. 18.After preprocessing the data, the architecture (Zou et al., 2017) extracted low-level 3D features from fMRI and sMRI to classify Attention Deficit Hyperactivity Disorder (ADHD) automatically.As soon as the features are concatenated, softmax classifiers are used to differentiate ADHD cases from typically developing children (TDC) cases.In order to diagnose breast cancer, Joo et al. (2021) fused MRI (T1, T2) and clinical information.Two 3D ResNet-50 were used to extract features from contrast-enhanced T1 subtraction MR images and T2 MR images, while the FC layer provided clinical inputs.For the prediction of pathological complete response, the outputs of each 3D ResNet-50 and FC layer were concatenated, and the final FC layer with sigmoid activation function was used.Likewise, Yap et al. (2018) employed ResNet and FC layers to extract features from DSC, Clinical Image, and metadata, then applied FC layers for skin lesion classification.Aside from these methods of concatenating modal features, complex computations can also be used to merge features.Xiong et al. (2022) used visual field (VF) and OCT for the diagnosis of glaucoma.VFNet and OCTNet were used to extract features from the VF and OCT modes, respectively.A weighted average was used to obtain an aggregated representation from bimodal features using an attention module.Each modal feature was assigned a weight using a fully connected layer, followed by a sigmoid function to calculate a scalar value (0-1) indicating the feature's relative contribution to the aggregate representation.To aggregate all features, a global average pooling layer was also used.The results of glaucoma diagnosis were predicted using three FC layers and a softmax layer.For CT and PET modalities, Qin et al. (2020) extracted features using CNN networks, merged the features using gated multimodal units (GMU), and classified lung cancer using FC layers.GMU, unlike the widely used connection operation, allows for the learning of intermediate representations of multimodality features using hidden structures and gate controls, thus enabling the prediction layer to assign weights more effectively to intrinsically associated features.
(2) Two stages can be described as the single-level fusion architecture for network fusion.The first stage involves extracting single-level features separately from different modalities using DL backbones, followed by the second stage of information fusion which involves utilizing an additional DL backbone to extract high-level features from the fused features (Shi et al., 2017;Zhou et al., 2017;Cheng and Liu, 2017;Kim and Lee, 2018;Rahaman et al., 2022;Jin et al., 2022;Leng et al., 2023;Lu et al., 2024).Lastly, the extracted high-level features are used in the final classification process.Fig. 19 illustrates a typical network fusion architecture.Cheng and Liu (2017) used cascaded CNN for the multimodal fusion of MRI and PET to diagnose Alzheimer's disease.They proposed a 2D CNN to combine the multimodality features and make the final classification.After 3D CNN output features are flattened to one dimension, the 1D feature vectors of MRI and PET are combined to produce a two-dimensional feature map for 2D CNN analysis.Kim and Lee (2018) developed a multimodal architecture for combining MRI, PET, and CSF features.Each modality's individual representation of high-level features is calculated using the stacked sparse extreme learning machine autoencoder (sELM-AE).Another stacked sELM-AE is used to get the joint features from the high-level MRI, PET, and CSF features.The kernel-based extreme learning machine classifies the joint feature representation.With multimodality neuroimaging and genetic data, Zhou et al. (2019b) proposed a threestage deep feature learning DNN framework for Alzheimer's disease classification.Each modality's latent representation is  2022) classified schizophrenia using sMRI, fMRI, and single nucleotide polymorphisms.The latent representations for the static functional network connectivity (sFNC), sMRI, and single nucleotide polymorphism (SNP) are learned using an autoencoder, multi-layered perceptron, and bi-directional long short-term memory (LSTM).The Multimodal Bottleneck Attention Module performs the fusion of the embeddings and then sends the combined embeddings to a variational autoencoder for encoding, followed by a Softmax layer for classification.
The single-level fusion method is currently used to merge multiple medical modalities for classification tasks and can be applied to the fusion of different medical modalities.The method does not require a specific format for the data as it extracts features from modalities using different branches and fuses data at a high-dimensional feature level.In this regard, single-level fusion is a suitable solution for unregistered or different dimensional data.Due to the fact that information fusion occurs only at the end of the network architecture, singlelevel fusion is not capable of analyzing low-dimensional features jointly.

Hierarchical fusion networks
Hierarchical fusion extends single-level fusion further in order to further exploit the complementary information between multimodal data.The hierarchical fusion process involves the fusion of different dimensional features and the classification of these jointly represented features through the process of fusion (Mahmood et al., 2018;Zhang and Shi, 2020;Zhou et al., 2021b;He et al., 2021;Li et al., 2022c;Wu et al., 2023b;Omeroglu et al., 2023;Xu et al., 2023;Tu et al., 2024;Miao et al., 2024;Xu et al., 2024).There are two ways to implement hierarchical fusion: by using additional branches for multimodal feature fusion or by using fusion blocks for joining features from different modalities.
(1) The common hierarchical fusion architecture involves extracting different modalities via different branches and simultaneously combining multimodal features of different dimensions via another parallel branch.Finally, the high-dimensional features from the fusion branch and each modal branch are combined for classification.(2) Hierarchical fusion can also be structured in another way by extracting features using different branches and fusing them in different dimensions by using fusion blocks, which are then returned to each modality branch for further fusion.The design of such a network structure can reduce the number of model parameters while fusing features at multiple levels.2021) proposed a multimodal MRI hierarchical-order multimodal interaction fusion network (HOMIF) to diagnose gliomas.There are two branch networks for each modality, several multimodal interaction modules with different scales and orderings, diverse learning constraints, and a predictive subnet in the framework.Each branch network has three CNN blocks with multiscale inputs and an arm with diverse high-order multimodal interaction (HOMI) modules to integrate and interact deeply with the multiscale features.The multi-level feature fusion allows hierarchical fusion to explore more fully the complex and complementary information between modalities.Learning the synergy of multimodal data while maintaining the features of the modalities improves the model's classification performance (Zhang and Shi, 2020).However, as it involves the fusion of low-dimensional features, the registration of multimodal data may affect the classification performance of hierarchical fusion.

Attention-based fusion networks
As attentional mechanisms (Vaswani et al., 2017b) have been proposed and developed, more and more studies are beginning to incorporate attentional mechanisms into network architectures.Some of the network architectures mentioned above also included attention mechanisms in order to enhance the performance of the models.Zhang and Shi (2020) added attention modules to reweight the modal features.Xing et al. (2022b) use a vision transformer (ViT) to extract the modal features and fuse them.These studies, however, only operate on unimodal modalities and do not utilize the attention mechanism for multimodal interactions.Recently, some studies have used the attention mechanism to extract and combine features (Dai et al., 2021;Qiu et al., 2023;Zhang et al., 2022b;Liu et al., 2023a;Li et al., 2022b;Dai et al., 2022;Zuo et al., 2023;Chen et al., 2023;Gao et al., 2023;Bi et al., 2023;Wei and Ji, 2024).This network architecture is called attention-based fusion, which is not related to any of the previous fusion architectures.
In the study of Dai et al. (2021), they propose TransMed, which combines CNN and transformer to capture high-level cross-modalities and low-level features.First, TransMed sends the multimodal images to CNN, where they are processed as sequences, then transformers learn the relationships between them and predict the end result.TransMed is more efficient and accurate than existing multimodal fusion methods because it effectively models the global features of multimodal images.
Attention-based Hierarchical Multimodal Fusion (AHM-Fusion) is a novel fusion module Qiu et al. (2023) designed.The system includes both an early feature guidance module and a late feature fusion module, capturing deep interaction information between different multimodal features.In the early stage of feature aggregation, the early feature guidance module is used to capture multimodal interactions.To obtain classification results, late feature fusion modules based on attention mechanisms are used.Through cascading double attention layers in the late feature fusion module, the deep interaction information is further captured.Then, they used a gating-based attention mechanism to decrease the impact of insignificant features in each modality.Research is increasingly incorporating attention mechanisms, particularly Transformer structures, into multimodal classification tasks.While performing cross-modal attention computation, a multi-level fusion of multimodal features is achieved.Furthermore, the Transformer structure is well-suited for joining modalities of different dimensions.Nevertheless, Transformer research in medical tasks is still in its infancy, and various studies tend to focus on solving particular problems, making it difficult to conclude a general multimodal classification architecture.A further important point to be noted is that while the success of Transformer is accompanied by pretraining on large datasets, the number of samples in medical datasets is often not sufficient to achieve the good training effect of Transformer.As a result, it is recommended that Transformer and CNN are used together in a hybrid fashion.

Output fusion networks
In output fusion, each modality uses a separate DL backbone to extract features and make decisions, and the results are merged into one final decision.Fig. 22 shows a typical network architecture for output fusion.The final Classifier of decision fusion can be achieved by simple operations (Moon et al., 2020;Guo et al., 2022) such as voting, weighting, and averaging, or by classifiers (Abdolmaleki and Abadeh, 2020;Fang et al., 2020;Yoo et al., 2022;Kwon et al., 2022;Qiu et al., 2022) such as SVM, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), Categorical Boosting (CatBoost), Decision Tree, and K-nearest neighbor (KNN).Moon et al. (2020) used unweighted average, weighted average, weighted voting, and stacking to fusion the classification results from different modalities of the US to identify breast tumors.Guo et al. (2022) applied a linear weighted module to assemble the predicted probabilities of the pre-trained models based on the 4 MRI modalities for the classification of gliomas.In order to achieve the diagnosis of early glottic cancer, Kwon et al. (2022) used decision trees to combine the classification results from the sound data and the image data.Abdolmaleki and Abadeh (2020) used SVM, KNN, and linear discriminant analysis (LDA) to fuse the classification results of fMRI and sMRI to diagnose ADHD.
The output fusion process involves combining unimodal results from different modalities.As a result, it is relatively easy to implement and generally does not require additional training.It is, however, difficult to exploit the complementary information between different modalities because there is no feature fusion.Furthermore, output fusion may not improve classification performance if there are large differences in classification performance between different modalities.

Which fusion method is the best?
The choice of a fusion method is crucial when dealing with multimodal medical classification problems.Fortunately, many fusion architectures have been evaluated on the same dataset: ADNI.Based on quantitative results reported by the authors, comparisons between fusion architectures is possible to some extent.We consider studies performed on ADNI where MRI and PET are used to diagnose Alzheimer's disease.There are three stages in the progression of Alzheimer's disease: normal cognition (NC), mild cognitive impairment (MCI), and Alzheimer's disease (AD).In spite of the fact that MCI does not significantly interfere with daily activities, a high risk of AD progression has been consistently demonstrated in patients with MCI (Dubois et al., 2007).MCI subjects can be classified into MCI converters (cMCI) and MCI non-converters (ncMCI) to predict the transition risk of MCI.Tab. 5 reports the results of the different fusion methods obtained by their authors for different classification tasks.The experiments were not replicated: when comparing these results, it should be noted that each paper relies on a different subset of patients, although the number of subjects was similar.
In general, we believe that deep multi-level fusion can better exploit the synergy of multimodal data to produce better classification results.This is further supported by the results in Tab. 5. Compared with the input fusion, the single-level fusion has a more robust feature fusion, which improves the overall ACC of the middle fusion.Hierarchical fusion utilizing multi-level feature fusion did not significantly improve the performance of dichotomous classification but performed well for four-class classifications.Generally, a complex model does not improve performance much when applied to a simple classification task.The more complex the network, the better it is at solving complex classification problems.When the number of categories for multi-category classification increases from two to four, the hierarchical fusion classification accuracy improves significantly.Last but not least, we note that the output fusion achieves excellent results on NC versus AD classification, thanks to the pre-training of different modal branches.With output fusion, DL backbones can be pre-trained on a large number of unimodal datasets and then fine-tuned on the multimodal datasets.Similar results are reported on other datasets.ABIDE data was combined with sMRI and fMRI to diagnose autism spectrum disorders.It was found that the hierarchical fusion (Liu et al., 2022) result was 87.2%, which was better than the input fusion (Akhavan Aghdam et al., 2018) result of 65.5%.Rahaman et al. (2022) used the COBRE dataset for the diagnosis of schizophrenia, and the accuracy of input fusion, output fusion, and singlelevel fusion was 70%, 78%, and 95%, respectively.Based on the GAMMA dataset, Li et al. (2022c) achieved 63% accuracy in input fusion, 72% accuracy in single-level fusion, and 80% accuracy in hierarchical fusion when they used the same dataset for glaucoma diagnosis.
It is difficult to determine a unified solution for a wide variety of multimodal fusion medical image fusion tasks.In spite of this, we can draw some preliminary conclusions from the above analysis.For medical modalities with similar structures, modal registration is easier, so input fusion, single-level fusion, and hierarchical fusion are all network structures worth investigating.Generally, single-level fusion and hierarchical fusion fuse deeper features, which will improve the classification per-formance.When data have a wide range of structures or dimensions, single-level fusion and attention-based fusion are preferable solutions, as they are capable of handling a wide range of modal feature fusion scenarios.Lastly, if we have a large number of unimodal datasets for each modality in multimodal data, output fusion will perform well.
In addition to using a single multimodal fusion method, multiple fusion methods can be combined (Tang et al., 2022;Li et al., 2023;Hu et al., 2020;Wang et al., 2022).Tang et al. (2022) achieved the classification of skin lesions using a combination of single-level fusion and output fusion.For the classification of Diabetic Retinopathy, Li et al. (2023) utilized different configurations of OCT Angiography data.Their approach combined hierarchical fusion for registered modalities with late fusion for unregistered modalities.In order to improve the diagnosis of breast cancer, Hu et al. (2020) fused multi-parametric MRI data at three levels: input, feature (intermediate), and decision (output).Combining different fusion methods can cumulate their advantages, allowing data from various perspectives to be fused and improving classification performance to some extent.It is one of the promising strategies that can be used when performing multimodal medical classification.

How to find the best architecture?
During the investigation of multimodal approaches, we have found that researchers need not only many tests to compare various fusion methods but also a large number of hyperparametric tests to determine the best network architecture for each fusion method.It takes a great deal of time and labor to conduct these extensive tests.There have been many recent studies that have applied Neural Architecture Search (NAS) to multimodal networks that can integrate various fusion techniques to determine the best architecture for a given dataset.The use of these methods is widespread in the fields of diagnosing dementia (Chatzianastasis et al., 2023), gliomas segmentation (Wang, 2020), multimodal action recognition (Pérez-Rúa et al., 2019), visual question answeringYu et al. (2020), multimodal damage identification (Singh and Nair, 2022), multimodal gesture recognition (Yin et al., 2022), etc.We believe that this approach has the potential to be explored in the future for multimodal classification in medicine.

How to manage incomplete multimodal data?
The problem of modality incompleteness is one of the most pressing challenges in multimodality medical research.The high cost and potentially harmful effects of medical images may lead many patients to refuse being scanned with multiple imaging modalities for clinical diagnosis (Pan et al., 2020).In the ADNI dataset, all subjects had MRI data; however, only about half of the subjects had PET scans (Pan et al., 2020).The most common approach to solving the modality incomplete problem is to discard the modality incomplete subjects (Liu et al., 2014b;Calhoun and Sui, 2016;Suk et al., 2014;Shi et al., 2017), but this approach reduces the number of trainable subjects for the deep learning model, resulting in reduced classification performance.There is also the option of estimating the features of the missing subjects (Donders et al., 2006;Sterne et al., 2009) Table 5.Comparison of the results of different fusion methods on ADNI dataset.In the multi-classification task, 3 classes is NC vs.MCI vs. AD and 4 classes is NC vs. ncMCI vs. cMCI vs. AD.Unit:%.
Generative Adversarial Networks (GAN) (Goodfellow et al., 2020) is a type of generative model used to produce data of a modality from another modality (Goodfellow et al., 2014).With the development of GAN, more and more fields are using this technology to generate images.The modal incompleteness problem has recently been solved through the use of GAN in many studies (Lin et al., 2021;Zhang et al., 2022a;Pan et al., 2018Pan et al., , 2020Pan et al., , 2021;;Jin et al., 2022;Gao et al., 2023;Tu et al., 2024).The GAN is used to generate the missing data and then the generated data is used for multimodal classification.It provides a significant increase in the number of subjects in the dataset, improves the model's classification performance, and is an effective solution when dealing with multimodal incompleteness.GAN-based solutions are currently the most promising.
Recent research endeavors have sought to confront the difficulties associated with addressing unbalanced datasets and in-complete modalities through the implementation of a distinctive fusion network design and training strategy.This approach aims to optimize the utilization of available dataset information while mitigating bias introduced by independently generating missing modalities.Gravina et al. (2024) introduced a Multi-Input-Multi-Output 3D CNN designed for the assessment of dementia severity, specifically tailored for scenarios involving incomplete multimodal brain MRI and PET data.In alignment with our hierarchical fusion architecture, they incorporated a fusion branch named PAIRED-NET during feature extraction, employing distinct CNN branches for MRI and PET modalities, each capable of producing independent outputs.Simultaneous parameter updates for all three branches occurred during training when the patient sample encompassed the full modality set.In instances where a modality was missing, only the parameters of the branch corresponding to the available modality were updated.This methodology facilitates comprehensive network training using the entire dataset, enabling classification of instances with a single missing modality during testing.Liu et al. (2023a) introduced a cascaded Multi-Modal Mixing Transformer (3MT) designed for the classification of Alzheimer's Disease with incomplete data.The architecture of 3MT comprises a sequence of Cascaded Modality Transformers (CMTs), each incorporating features from a specific modality.At the conclusion of this sequence, a more informed class prediction is obtained by aggregating the extracted multi-modal features.In scenarios involving missing data, the CMTs corresponding to the absent modalities receive zero embeddings, indicating "not available" to the model.This training approach equips the model with prior knowledge for handling diverse missing data scenarios.

Does multimodal fusion always performs better?
Multimodal data not only contain complementary information but may also contain a great deal of redundant information.One study found that multimodal fusion did not enhance classification performance (Khagi and Kwon, 2020).In one sense, this relates to the design of the network, and in another sense, multimodal fusion may not improve the performance of classification if the information in the modalities is relatively similar or if a particular modality does not accurately define the target class.Before starting a multimodal fusion project and gathering multimodal data, it is advisable to conduct a redundancy analysis (Salvador et al., 2019).
Alternately, Narazani et al. (2022) questioned the multimodal diagnostic objectives.Clinical studies primarily aim to determine the type of dementia, whereas studies on DL focused on only one type, AD. multimodal fusion studies have performed well in terms of classifying NC versus AD, but the clinical goal is the classification of AD at multiple levels.As a result of their tests, multimodal fusion networks did not improve the multilevel classification.Therefore, multimodal fusion classification studies should be conducted in conjunction with clinical needs.

Can we advantageously combine multimodal image clas-
sification with other tasks?We should point out that input fusion (Pereira et al., 2016;Isensee et al., 2018Isensee et al., , 2019;;Cui et al., 2018;Kamnitsas et al., 2017), intermediate fusion (Dolz et al., 2018(Dolz et al., , 2019;;Chen et al., 2018;Andrade-Miranda et al., 2022;Li et al., 2022a), and output fusion (Kamnitsas et al., 2018;Aygün et al., 2018) methods can also be applied to medical image segmentation, medical image fusion, and similar tasks.A notable trend is to incorporate the fusion network into the feature extraction process, thereby enabling the creation of multi-task multimodal networks.Cheng et al. (2022) proposed an end-to-end multi-task learning network for simultaneous glioma segmentation and IDH genotyping based on the sharing of spatial and global feature representations extracted from the hybrid CNN-Transformer encoder.The performance of both classification and segmentation can be enhanced through the use of a joint network.

Can we advantageously combine images with contextual data in a classification pipeline?
In this survey, our main focus was on integrating data from various imaging modalities.However, out of 114 reviewed papers, 27 incorporate non-image contextual information in various forms, often structured demographic and clinical metadata extracted from electronic health records (Huang et al., 2020;Zhou et al., 2021a;Yan et al., 2021;Venugopalan et al., 2021;Prabhu et al., 2022).Clinical metadata includes the result of physical tests, like visual acuity or refraction for chronic central serous chorioretinopathy diagnosis (Yoo et al., 2022), the result of chemical tests, like Pap or HPV for cervical dysplasia diagnosis (Xu et al., 2016), and the result of cognitive tests, like executive functioning (ADNI-EF) and memory (ADNI-MEM) test for Alzheimer's disease diagnosis (Lee et al., 2019a).Sometimes, images are combined with more complex contextual data like voice signals (Kwon et al., 2022), free-form text (Zhang et al., 2022b), or genomic data (Rahaman et al., 2021;Venugopalan et al., 2021;Rahaman et al., 2022;Qiu et al., 2023).Image and contextual metadata have no geometrical relationship, so input and hierarchical fusion are not relevant in this scenario.Earlier solutions thus relied on singlelevel intermediate fusion or output fusion.Recent studies tend to use attention-based intermediate fusion (Zhang et al., 2022b;Qiu et al., 2023;Chen et al., 2023).One challenge with contextual data is that they are often incomplete: the reader is referred to section 5.3 for this challenge.Nevertheless, all these studies report increased classification performance when using contextual data.

Trends for the future
As the network structure evolves and hardware devices become increasingly available, there has been a growing interest in multimodality research.From Table 5, it is evident that the rapid development of multimodal fusion has led to significant improvements in classification results.In fact, multimodal networks based on deep learning exhibit greater potential for development than unimodal networks.
Transformer is one of the most popular network architectures, and multimodal fusion based on Transformer has developed rapidly in the past two years.In particular, for visuallanguage tasks, Transformer can handle the fusion of images, languages, and text very effectively.Based on the research conducted in different fields, we classify Transformer-based multimodal fusion networks into self-attention Transformers (Akbari et al., 2021;Nagrani et al., 2021;Shi et al., 2022;Li et al., 2021;Pashevich et al., 2021;Appalaraju et al., 2021;Steitz et al., 2022;Wu and Mebane Jr, 2022) and cross-attention Transformers (Lu et al., 2019;Chen et al., 2021a;Tan and Bansal, 2019;Zhu and Yang, 2020;Ramesh et al., 2021;Rahman et al., 2021;Chen et al., 2021a,b;Li et al., 2022d), as shown in Fig. 23.Following the extraction of features using encoders, self-attention Transformers concatenate features from different modalities and compute the attention relationship between the fused features using Transformer blocks.Alternatively, crossattentional Transformers compute the attentional relationships among different modalities in order to achieve information fusion.Nowadays, these two architectures are the most popular multimodal fusion networks.
Compared to CNNs, Transformers have the advantage of efficiently identifying long-range relationships between sequences.In medical images, most visual representations are ordered due to the similarity of human organs.Medical images contain more information regarding sequence relationships than natural images (Dai et al., 2021).This indicates that Transformer-based multimodal medical image fusion is a promising approach, and the above two network architectures are worth exploring.While recent medical research has employed analogous structures (Gao et al., 2023;Zuo et al., 2023;Bi et al., 2023), it is imperative to undertake broader investigations and validations to extend the applicability of these findings.
In addition to these developments, the field is also witnessing an emerging trend in the exploration of representation learning for multimodal data using techniques such as pretext tasks or contrastive learning.These methods aim to learn robust and transferable representations that can be applied to downstream tasks of a classification nature or directly in parallel with the classification task (Wei et al., 2022;Mohit Prabhushankar et al., 2022;Cai et al., 2022;Gutiérrez et al., 2022;Xing et al., 2022a;Taleb et al., 2022;Hager et al., 2023).This field is new and a lot of research questions are emerging.How to learn an aligned representation across modalities?Can we learn one represen-tation space for the different modalities?The research community is actively moving in this direction to address challenges associated with representation learning.
In response to the challenge posed by the limited scale of medical datasets and the laborious nature of manual labeling, certain investigations have resorted to harnessing information from image-text pairs available on the web.This approach facilitates the construction of transfer learning multimodal models, which have demonstrated notable efficacy in subsequent finetuning tasks and zero-shot classification endeavors.Zhang et al. (2022c) introduced ConVIRT, an unsupervised methodology for acquiring medical visual representations through the analysis of naturally paired images and text.Their approach involves evaluating image representations in conjunction with paired text representations through a bidirectional objective, surpassing alternative methods in performance across various downstream medical classification tasks.Drawing inspiration from the Contrastive Language-Image Pre-training (CLIP) approach (Radford et al., 2021), Huang et al. (2023) undertook fine-tuning specifically tailored for medical applications, resulting in the formulation of Pathology Language-Image Pretraining (PLIP).Exploiting numerous de-identified images and abundant textual data disseminated by clinicians on public platforms like Twitter, PLIP introduces a multimodal, unsupervised, Transformerbased transfer learning model.This model effectively classifies new pathological images across four external datasets, exhibiting state-of-the-art performance.These initiatives highlight the immense potential of publicly shared medical information as a valuable resource for developing multimodal medical AI systems, thereby enhancing the landscape of medical diagnosis.
In support to these advancements and needs, the TorchMultimodal library15 has been created as a framework for training state-of-the-art multimodal multi-task models at scale by Meta Research using PyTorch framework.This library is the result of concerted community efforts, reflecting the growing focus on multimodal tasks and methodologies, as well as representation learning for multimodal data.Furthermore, in alignment with the increasing enthusiasm for multimodal learning, libraries such as Transformers16 by Hugging Face offer robust resources for constructing and training such models.These Transformers present pretrained models that can handle various modalities, such as text and images, simplifying tasks like image categorization or answering questions with multimodal data.This emphasis on multimodal functionalities underscores the growing demand for models that possess the ability to comprehend and analyze diverse types of data.Although Transformers are not exclusively designed for the medical domain, they can serve as a beneficial resource for scholars aiming to create and distribute code for tasks related to multimodal classification of medical images.
As illustrated in Tab. 7, very few multimodal image classification papers in the medical field are associated with public code.We hope TorchMultimodal, Transformers or similar libraries will facilitate code sharing.

Conclusion
In this paper, we conducted a comprehensive review of the development of deep learning-based multimodal medical classification tasks over the past few years.We examined the complementary relationships among several common clinical modalities and delved into five types of architectures for deep learning multimodal classification networks: input fusion, single-level fusion, hierarchical fusion, attention-based fusion, and output fusion.Our study covered a wide range of multimodal fusion scenarios in medical classification and the application domains for which different network architectures are most suitable.
Additionally, we discussed emerging trends and challenges in the field, including the exploration of representation learning techniques and the development of dedicated frameworks like TorchMultimodal.These advancements provide efficient tools for training state-of-the-art multimodal multi-task models at scale.In particular, we highlighted the advantages of using Transformer-based multimodal fusion architectures, particularly in medical imaging applications, where sequence relationships are more prevalent.This demonstrates the potential of these architectures in advancing the field of multimodal medical classification tasks.
Looking forward, we encourage the research community to continue investigating novel fusion techniques, optimization methods, and network architectures to further enhance the performance of multimodal classification tasks.Developing interpretable models, addressing data imbalance and scarcity, and exploring unsupervised and semi-supervised learning approaches are other areas worth investigating.Additionally, we recommend future research focus on the application of multimodal fusion in emerging areas such as genomics, proteomics, and patient-centered care, where the integration of diverse data types can potentially lead to significant improvements in diagnostic and therapeutic outcomes.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Overview and proportion of deep learning-based information fusion techniques for multimodal medical image classification presented in this paper.

Fig. 2 .
Fig. 2. Number of publications on medical multimodal image classification.Per-year statistics obtained using PubMed from 2016 to 2023.

Fig. 3 .
Fig. 3. Number of publications dealing with medical multimodal image classification on human organs, from 2016 to 2023.Tags: organ, number of publications, percentage.

Fig. 4 .
Fig. 4. (a) Unimodal classification task flow and different types of multimodal fusion based on the level in which they perform information fusion.(b) Information fusion networks for the three types of multimodal fusion, inputs to information fusion, and the implementation of information fusion.

Fig. 9 .
Fig. 9. Two types of fusion.Orange and green: data of different modalities.Blue: the output fused data.
images are used as inputs for the DL classifier.The process diagram for input fusion is shown in Fig.11.
illustrates this typical input fusion network architecture used in the research of Aldoj et al. (2020); Lin et al. (2021); Zong et al. (2020); Zhou et al. (2023).Aldoj et al. (2020) proposed a semi-automatic method for the classification of prostate cancer without feature selection.Several combinations of 3D volumes (e.g., ADC, DWI, and T2) are utilized as inputs of the CNN

Fig. 17 .
Fig. 17.Schematic diagram of the network architecture for input fusion.Information fusion method: Concatenation (Inputs).
network.Each sequence is considered an input channel; the output is the classification of significant versus nonsignificant lesions.Lin et al. (2021) employed MRI and PET to diagnose Alzheimer's disease.PET and MRI are used as two channels for the input of the CNN classification network, based on an ROI crop model to learn a classifier and fuse different features from MRI and PET.Zong et al. (2020) concatenated T2, ADC and DWI for tumor foci classification using an end-to-end CNN network.In order to diagnose triple-negative breast cancer, Zhou et al. (

Fig. 19 .
Fig. 19.Schematic diagram of the network architecture for single-level network fusion.Information fusion method: Merge (Network).

Fig. 20 .
Fig. 20.Schematic diagram of the network architecture for hierarchical fusion.Information fusion method: Merge (Network) and Concatenation (Classic).
Zhou et al. (2021b) utilized three sparse-response Deep Belief Network (DBN) branches to extract features from PET/MRI modalities, fuse them, and then employed an Extreme Learning Machine (ELM) to classify the fused features for brain diseases.Zhang and Shi (2020) used a deep multi-modal fusion network (DMFNet) to fuse PET and MRI data for the diagnosis of Alzheimer's disease.Three branches are present in DMFNet, two of which extract features from the MRI and PET scans, respectively.A channel attention model is used to extract the features from each branch and merge the reweighted feature maps.In the third branch, fused features are further extracted.Li et al. (2022c) used a three-branch CNN network to combine 2D fundus images with 3D OCT images in order to classify glaucoma and diabetic retinopathy.The fusion of 2D and 3D data features on the fusion pointers was achieved by changing the dimensionality of the data features using a transformation layer.
Fig. 21 illustrates a typical network architecture, Gao et al. (2021); He et al. (2021) utilized this network architecture.To classify brain diseases, Gao et al. (2021) proposed a pathwise transfer deep convolution network that gradually learned and combined the multi-level and multimodal features of MRI and PET.The pathwise transfer blocks are designed to fully utilize complementary information from different imaging modalities.Pathwise transfer blocks are used to communicate information across PET and MRI, which helps to improve the classification model's performance.He et al. (

Fig. 21 .
Fig. 21.Schematic diagram of another network architecture for hierarchical fusion.Information fusion method: Merge (Network) and Concatenation (Classic).
Zhang et al. (2022b) proposed a multimodal Medical Information Fusion (MMIF) framework that combines the Category Constrained-Parallel ViT framework (CCPViT) and the multimodal Representation Alignment Network (MRAN) as backbones, enabling the modeling of images and texts as unimodal features, as well as cross-modal features.CCPViT is proposed as a tool for learning key features of different modalities and for solving unaligned multimodal tasks.hen in MRAN, Cross-attention was used to cascade encoded images and decoded texts to explore deep-level interactive representations of cross-modal data, assisting with modal alignment and identifying abnormalities.MMIF is an image-text foundation modeling that could contribute to a much higher-precision classification model when compared with unimodal models.Multimodal Mixing Transformer (3MT) was presented Liu et al. (2023a) as a novel technique to classify diseases.Based on neuroimaging data, gender, age, and the Mini-Mental State Examination (MMSE), They tested it for Alzheimer's Disease classification.Multimodal information is incorporated through a Cascaded Modality Transformers architecture with crossattention.Different embedding layers are used to obtain Key (K) and Value (V) from imaging features and clinical data.K and V are then placed into a cross-attention layer with a latent code known as Query (Q).3MT allows mixing an unlimited number of modalities and formats and full data utilization.Zuo et al. (2023) has introduced a novel Swapping Bi-Attention Mechanism (SBM) designed for the diagnosis of Alzheimer's disease through the amalgamation of structuralfunctional brain images.The proposed model capitalizes on the transformer's bi-attention mechanism to explore mutually beneficial information inherent in both structural and functional images.Historically, transformers have been investigated solely within the confines of single-modality brain regions, neglecting the potential for leveraging complementary information across modalities.SBM, however, implements token exchange between the two modalities and adaptive fusion of intermediate features during the application of the Transformer for feature extraction from diverse modalities.This process facilitates the collaborative exchange of information embedded in bimodal images, enhancing the transparency of feature alignment and fusion.The incorporation of SBM results in a more lucid understanding of the influence of different modalities on feature extraction and multimodal fusion.

Fig. 22 .
Fig. 22.Schematic diagram of the network architecture for output fusion.Information fusion method: Merge (Outputs).

Table 1 .
A list of multimodal image datasets.The list is sorted by the number of publications on PubMed (Keywords: dataset name AND 'multimodal').Details about imaging modalities are given in Tab.2

Table 4 .
Some common architectures of deep neural networks.Different architectures are more suitable for different types of data.

Table 6 :
List of Terms.

Table 7 :
List of publications for different fusion networks.