Multi-objective optimization determines when, which and how to fuse deep networks: An application to predict COVID-19 outcomes

The COVID-19 pandemic has caused millions of cases and deaths and the AI-related scientific community, after being involved with detecting COVID-19 signs in medical images, has been now directing the efforts towards the development of methods that can predict the progression of the disease. This task is multimodal by its very nature and, recently, baseline results achieved on the publicly available AIforCOVID dataset have shown that chest X-ray scans and clinical information are useful to identify patients at risk of severe outcomes. While deep learning has shown superior performance in several medical fields, in most of the cases it considers unimodal data only. In this respect, when, which and how to fuse the different modalities is an open challenge in multimodal deep learning. To cope with these three questions here we present a novel approach optimizing the setup of a multimodal end-to-end model. It exploits Pareto multi-objective optimization working with a performance metric and the diversity score of multiple candidate unimodal neural networks to be fused. We test our method on the AIforCOVID dataset, attaining state-of-the-art results, not only outperforming the baseline performance but also being robust to external validation. Moreover, exploiting XAI algorithms we figure out a hierarchy among the modalities and we extract the features’ intra-modality importance, enriching the trust on the predictions made by the model.

selected from a large pool of many. Indeed, such previous contribution offered a baseline fusion paradigm where the multimodal features are exploited by early fusion or by baseline joint learning working with two fixed network architectures for CXR images and clinical data. In particular, the proposed optimized multimodal fusion exploits an end-to-end framework, relying on a combination of an evaluation and a diversity score. Second, our framework goes hand-in-hand with XAI algorithms which have not yet been investigated in this application, which permits to discover the modalities' hierarchy and features' intra-modality importance [9]. Third, this contribution not only performs experiments according to the procedure presented in [4], but it also works with other 283 new patients now available within the AIforCOIVD repository that we used as an external validation set.
The manuscript is organized as follows. Section 2 introduces the state-ofthe-art of both multimodal fusion methods and DL applications of COVID-19 stratification. Section 3, illustrates the dataset on which the experiments were performed. Section 4 describes the novel fusion method presented in this work and the XAI algorithms used for interpretation. Section 5 introduces the experimental configuration in the context of COVID-19 prognosis prediction, whilst the corresponding results and explanations are offered in Section 6. Finally, Section 7 provides concluding remarks.

Background
With the spread of the epidemic, there has been a growth of interest in employing novel AI methods to fight the pandemic [10]. The vast literature ranges from the use of traditional statistical models to the more complex DL networks [11]. As it is out of the scope of this manuscript to mention the entire literature, here we give an overview of the state-of-the-art of DL applications predicting the disease's outcome, focusing on those that used supervised MDL models and exploited a joint fusion between medical imaging and clinical data. Since the MDL method presented in this manuscript fits well with XAI algorithms, we also give an overview of this emerging field of study.

Multimodal Deep Learning
Multimodal data is composed of data originated by different sources observing a phenomena. The objective of MDL is to work with the data in 4 a complementary manner to satisfy the undertaken learning task [12]: instead of using manually designed or handcrafted modality-specific features, via DL we can automatically learn and extract an embedded representation for each modality. Furthermore, here we focus on supervised multimodal fusion that joins information from different modalities for a classification or regression task. The literature agrees that understanding when, which and how modalities should be fused is the main challenge in MDL [1,5].
To answer the first question (when), Baltrušaitis et al. divide multimodal fusion in model-agnostic and model-based methods [1]. The former are feature-based or decision-based, being also named as early and late fusion, respectively. Early fusion consists in the integration of the raw multimodal data into a single feature vector which is used as input for the DL model. At its simplest form, early fusion consists in concatenating multimodal features, which can bring to a redundancy of information. Oppositely, late fusion consists in using multiple models trained on separate modalities, whose decisions are joint via an aggregation function or a meta-learner, aiming to augment the performance of each unimodal model [13]. Because model-agnostic fusions can suppress intra-or inter-modality interactions, in the last years researchers have mainly focused on model-based methods that allow intermediate fusion, also referred to as joint fusion, to occur directly inside the models. DL models transform raw inputs into higher-level representations through the alternation of linear and non-linear operations. With a MDL model, we obtain these representations for each modality which will be fused into a hidden layer creating a joint multimodal representation. Thanks to the fact that with DL we can have an end-to-end training of all the multimodal representation components and the fusion component, the training phase essentially consists in learning hierarchical representations from raw data, which give rise to an optimized intermediate-level fusion representation [14]. This is why we can consider joint fusion an evolution of early fusion. Furthermore, inside a joint model, to understand when the unimodal architectures should be joined and to find the fusion structure, the majority of the available work follows a meticulous handcrafted approach. For instance, Neverova et al. adopted a single fusion layer from which they implement a gradual fusion strategy to automatically comprehend when to fuse the modalities [15].
Let us now focus on the second challenge, which aims to understand which modalities should be fused and with which models. With the increasing number of high-performing deep architectures, the selection of the one best 5 suited for each modality is a hard process. The choice of which modality to fuse is usually based on intuition and on a trial and error approach that, usually, first fuses similar modalities and then tries to integrate disparate modalities [1]. Furthermore, prior investigations on the data, as well as feature selection techniques, should allow us to find redundant information along the modalities [16]. Shifting the attention to the choice of which neural networks should be fused, the analysis of the literature shows that the deep architectures are usually chosen by analyzing the single modalities [4,6], and no optimal multimodal architecture selection is overtaken.
Focusing on the third question, in the literature there are various techniques studying how to obtain an embedded representation that helps the unimodal models in the desired tasks. In joint fusion, the state-of-the-art adopts two main methodologies [17,18]. The first is a simple but effective way that concatenates the extracted vectors from a given layer of each unimodal model [17]. The next hidden layer combines the different modalities by computing where v i and w i are the encoding and the weights of the model of the i th modality, f is the activation function and m stands for the number of modalities involved. The second approach consists of a multiplicative method computing the outer product ⊗ (or Kronecker product) of all the m modalityspecific feature vectors, extrapolated from a given hidden layer, exploiting the intra-and inter-modality relations [18]. The layer output is: where the 1 is appended to the encoding v i so that all unimodal and multimodal combinations interactions are represented. A schematic visualization of both joint fusion methods is shown in Figure 1. Once the fusion occurs, the multimodal encodings are then fed to a fully connected (FC) classifier. Many studies such as [19,20,21] dealt with determining an automatic methodology on how and when to fuse modalities, but none have focused on finding an optimized way for choosing which model architectures for each modality should be fused. Therefore, we hereby propose a strategy to find the best combination of models of the different modalities.   [17] b) multiplicative method [18]. The two methods train one model a i on each modality D i and fuse the extracted encodings v i which are then fed to a FC classifier.

DL and COVID-19
The application of AI to the COVID-19 data analysis has been directed towards medical imaging and clinical data addressing two main issues: i ) the detection of patients suffering from COVID-19 pneumonia from those which are healthy or affected by different types of pneumonia and, ii ) the prediction of the outcome.
In the former case, a large number of publications have been produced, as reviewed in [22]. Many of them leverage the open-source CXR COVID-19 data collection [23], or specific clinical datasets. Moreover, there are also papers that use CT scans, as reviewed in [24], or that use other imaging modalities [22]. Instead, in the prediction scenario, few works have investigated the stratification of the disease to distinguish between patients with mild and severe outcomes. In this respect, the rest of this Subsection overviews such papers since outcome prediction is the focus of our work.
Considering the publications which work with CXR scans only, Signoroni et al. designed an end-to-end convolutional neural network (CNN) architecture for predicting a multi-regional score conveying the degree of lung compromised in COVID-19 patients [25].
Focusing on the publications using clinical data only, Zhu et al. presented a DL algorithm, whose architecture is composed of 6 FC dense layers with ReLU as activation function, which identifies the top clinical variable pre-J o u r n a l P r e -p r o o f Journal Pre-proof dictors and derived a risk stratification score system to help clinicians triage COVID-19 patients [26]. Their model was trained on a private dataset with 181 patients and of 56 clinical variables, obtaining an AUC equal to 0.968. Similarly, Al-Najjar and Al-Rousan built a neural network with one hidden layer to predict the status of recovered and death COVID-19 patients in South Korea [27]. They used seven different clinical variables and obtained an accuracy equal to 0.966 on a private dataset composed of 1308 patients.
As interpreting medical data is a multimodal process by its very nature, we now present the four works that use both clinical data and medical images, also noticing that three of them use CT images [28,29,30] and only one made use of CXR scans [4]. Fang et al. introduced a DL-based method using 53 clinical features and quantitative CT data to detect mild patients with a potential malignant progression [28]. The authors use a multi-layer perceptron (MLP) on the clinical features creating an embedding that is concatenated with the flattened CT, which are then fed to a Long shortterm memory (LSTM) network and a FC network, achieving an accuracy of 0.792 on a cohort of 199 patients belonging to a private dataset. Next, Ning et al. used both CT images and 130 clinical features to discriminate between negative, mild and severe COVID-19 cases via a DL framework based on the VGG-16 and a 7-layer FC network [29]. This network was trained on a public dataset containing 1521 patients, and it obtained an accuracy equal to 0.811. Furthermore, Fang et al. developed an early-warning system with DL techniques to predict COVID-19 malignant progression leveraging CT scans and the clinical data of the patients [30]. The authors use a 3D ResNet and MLP to encode the chest CT scans and the 61 clinical features, respectively, whose features are concatenated and fed into an LSTM and several FC networks to make the prediction, obtaining an accuracy equal to 0.877, on a private dataset with 1040 patients.
Let us now turn the attention to the only work that, to the best of our knowledge, uses CXR images and clinical for the COVID-19 prognosis [4]. In this paper, the authors not only made publicly available a dataset of 1103 patients designed for the mild and severe stratification task, but they also presented three different multimodal AI approaches that can be used as a baseline performance for future improvement. In the first learning approach, also referred to as handcrafted approach (HC), the authors employed first order and texture features computed from the images, which are mined together with the clinical data feeding a supervised learner, such as Support Vector Machines and Random Forests. The second approach is named hy-8 J o u r n a l P r e -p r o o f Journal Pre-proof brid (HYB) and it mixes automatically extracted features computed by a pre-trained CNN with the clinical data, which are then fed to a classifier as before. The last approach is an end-to-end approach (ETE) exploiting together the clinical data and the raw CXR by using a multi-input CNN. The best results are obtained by the HYB approach, whose results are reported and discussed in Section 6.
This analysis of the literature shows that the development of AI-based models predicting the outcomes of COVID-19 patients still deserves further research efforts, since in the multimodal scenario only simple methods are employed with no interest in understanding when, which and how the neural networks architectures should be used to find the appropriate fusion. We decided to use [4] as our starting point, given the importance of their multimodal data, and we will show that our proposed MDL framework improves the performance available at the state-of-the-art.

XAI
The major disadvantage of neural network approaches is their lack of interpretability [31]. It is hard to understand what the predictions rely on, and which modalities and features hold an important role [32]. In general, an XAI algorithm is one that produces information to make a model's functioning clear or easy to understand. This is why the field of XAI has grown in the last few years, with the development of many different methods suited for different tasks, as described in the overview offered by Holzinger et al. in [33]. The literature makes a clear distinction among models that are interpretable by design (transparent models), and those that can be explained by means of external XAI techniques (post-hoc explainability). Given the complex structure of DL algorithms, post-hoc model-specific XAI models have been developed. Focusing our attention to tabular or image data, i.e., the one used in this work, in the literature we found algorithms presenting feature relevance explanations specific for these modalities taken alone. For example, according to [31], Deep Taylor decomposition [34] and DeepLIFT [35] are the most used XAI models specific for multi-layer neural networks which work on tabular data, Equalizer [36] and Grad-CAM [37] are suited for CNNs working on image data, whereas Integrated Gradients should be used by both network typologies [38].
The literature is well advanced in models which work on a single modality [31] but it lacks research for MDL. In this respect, the multimodal method that we present brings along the possibility to apply and combine any unimodal XAI algorithms in a way to understand the contribution that each modality and each data within a modality has during the prediction.

Materials
We performed the experiments on the AIforCOVID imaging archive [4] since it is the only publicly available multimodal dataset on COVID-19 stratification at the time this research was carried out. It includes 820 CXR scans and clinical data collected from six different hospitals at the time of hospitalization of symptomatic patients positive to the SARS-CoV-2 infection at the TR-PCR test. The 34 clinical available parameters are listed in Table A.1, and for further details the interested readers can refer to [4].
On the basis of the clinical outcome, each patient was assigned to the mild or severe group, where the mild group consists of the patients that were sent back to domiciliary isolation or were hospitalized without any ventilatory support, whilst the severe group is composed of patients who required noninvasive ventilation support, intensive care unit or patients who deceased.
Recently, a second release of the dataset was made available and it is composed of other 283 patients, whose data were collected in other two new centers. Here, we use these data as an external validation set in a way to have further proof that the proposed framework is robust on never seen data.

Multimodal Learning
To predict the severity outcome of COVID-19 patients exploiting both the pixel information of their CXR scan and their clinical information, we propose a novel supervised end-to-end joint fusion method, which maximizes the advantages brought by all the given modalities. It first looks for the best combination of models, which are then joint and trained to carry out the given classification task. This is visually summarized in Figure 2, which is organized in four main blocks that are now detailed, one per paragraph.  Let us first introduce the application matrix Θ ∈ Mat m×n , where each element is defined as follows if the j th model processes data of the i th modality 0 if the j th model does not process data of the i th modality (3) Hence, θ ij represents the use of the j th model for the i th modality, with i ∈ {1, . . . , m} and j ∈ {1, . . . , n}. Furthermore, let us denote with a j any neural network architecture we would like to use; on this basis, when θ ij = 1, a i j denotes that the architecture a j is trained on data from the i th modality. The set of networks that will be combined is denoted by where I contains a subset of all possible pairs of models and modalities: It is worth noting that the framework now introduced goes beyond the state-of-the-art of multimodal joint networks, which usually use only one model for each modality [6]. On the contrary, we consider the chance of having several models per modality, a situation that researchers and practitioners often meet in practice. In these cases, usually they first heuristically determine the best model per modality in a pool of many, i.e., Γ I , and they combine such "best" models in a multimodal learner [31]. However, this approach neglects any possible interaction between models and modalities, so that the combination of best unimodal models does not necessarily reflect the best multimodal architecture. Furthermore, the general assumption that a multimodal model should be composed of a single model per modality is also a too much straightforward simplification.
Optimization. To go beyond the naive use of Γ I , we deem that it should be more useful to work with all the modalities and all the models together. The first step is to detect which is the subset of models to be combined to get the best multimodal architecture providing the best multimodal data representation. This subset is determined by solving a multi-objective maximization problem that leverages on two scores computed on a validation set for each Γ I [39,40,41]. The first score is any measure derived from the confusion matrix, shortly referred as E(Γ I ), whilst the second is a measure of the diversity between the unimodal learning models, denoted as D(Γ I ). To find optimal combination of models from Γ I exploiting at best all the modalities, denoted by Γ * , the algorithm works as follows. Using a cross-validation approach, we first apply each model a i j , that was trained on the corresponding unimodal dataset D i , to a validation set. Then, we compute the average of both E(Γ I ) and D(Γ I ) among the cross-validation folds for all the possible Γ I , finally finding the combination Γ * which maximizes both metrics. This means that we look for the Γ I that, on the one side, returns the best classification performance and, on the other side, reduces the incidence and effect of coincident errors among its members, thus considering possible relationships between models and modalities.
It is worth noting that such a process corresponds to solving a multiobjective optimization problem. In fact, given the pair is said to be an admissible point for P . Given two admissible points x 1 and x 2 for P , for any other admissible points x for P . i.e., it tries to minimize all functions in F under the constraints C, solving a multi-objective minimization problem over P . Based on [42], the point x * is a Pareto optimum if there exist λ and µ satisfying the following system: In our framework, Γ * is the Pareto Optimum maximizing both two scores, related to the confusion matrix and to the diversity, respectively. This implies that E(Γ I ) and D(Γ I ) are the two objective functions (a = 2). Thus, Γ * is a Pareto optimum if the quintuple (λ * 1 , λ * 2 , µ * 1 , µ * 2 , Γ * ) satisfies the following system: whereĖ(Γ I ) andḊ(Γ I ) denotes the first derivative of such quantities. Observing that both E(Γ I ) and D(Γ I ) range in [0, 1], and under the hypothesis that E(Γ I ) = 1 in case of perfect outcome prediction 1 we can write the following minimization function: Following the Kuhn-Tucker theorem [43], solving the minimization of Equation 8 is equivalent to solving the system: satisfies the system and Γ * is a Pareto optimum for our optimization problem, when λ * Interested readers in understating how to resolve the system in Equation 9, can refer to [44].
In practice, with this optimization approach we want to find the combination of models, that when fused together maximize both an evaluation and a diversity metric, surpassing the constraint that if a modality is useless for the given classification task, all the networks working on that modality may be discarded.
Joint-Late Fusion. So far we presented an approach to find the optimal multimodal combination model, i.e., which modalities and related models should be included into the multimodal architecture, but we have not defined how and when to actually join the models a i j in Γ I . To simplify the notation, we now omit the apex i, denoting the learners included in Γ I as a j , with j = 1, . . . , |Γ I |. Each of these models provides a classification vector y j = [y j1 , . . . , y jp , . . . , y jc ] T , where c is the number of classes and p is the generic p th output neuron. Each entry y jp is defined using the parameterized softmax function: where k > 0, and z (j) p denotes the p th value of the output layer neuron internal activity for the network a j ; similar definition holds for z (j) t . Note that for k = 1 each element represents an estimate of the classification posterior probability for each class of each model.
On these premises, to join the models we define a soft shared representation given by [y T 1 , . . . , y T |Γ I | ] T , which corresponds to concatenate the classification vectors, an approach similar to concatenating the hidden layers as in [17]. Hence, this shared representation vector is composed of c · |Γ I | elements. Furthermore, if k → ∞, y j becomes a binary vector with only one element equal to one, allowing us to get a crisp shared representation containing the binary outputs, which provides a direct interpretation of the embedded layer.
Classification Layer. This shared representation, being either soft or crisp, is then passed to a FC neural network, with c neurons in the last layer, that performs the intermediate fusion. Note that here we exploit the advantages of both joint and late fusion, since the classifications of the sub-networks are aggregated via an end-to-end manner using the back-propagation during the training process that minimizes the overall loss function. Indeed, it avoids the combination of features belonging to different spaces, while it combines the deep neural networks into a common classification space, and it is shortly referred to as to joint-late fusion in the following.

XAI
The nature of the joint-late fusion approach, makes interpretability simple and effective. First, the weights w coming out of the embedded classification vector connected to the first layer of the FC network, provides useful information because it reveals the importance of each class for each model and, consequently, the importance of each modality. The higher the value of w i with i = 1, · · · , n, the higher is the contribution for the combined classification coming from i th model. Furthermore, as more models per modality are allowed, we can also understand the models that contribute more in the final classification for each class by leaking at the weights proportion within each modality.
It is worth noting that any XAI algorithm, being model-agnostic or model-specific, can be applied to our MDL architecture, because it is composed of multiple models which can be interpreted one by one. For instance, if one of the modalities uses tabular data and the chosen models are FC neural networks, any post-hoc model-specific XAI algorithm for multi-layer neural networks can be applied on the single models to understand the importance of each tabular variable [34,35]. Moreover, if another modality consists of images and the models used are CNNs, we can use a post-hoc model-specific XAI algorithm to deduce the contribution of each image's pixel to the classification [36,37]. Going forward, there are also XAI algorithms that work on models regardless of the used modalities, that when applied to CNNs or FC networks, we understand the contributions for the classification of each image pixel and of each tabular variable, respectively [38]. Since the model we propose is composed of multiple neural networks (both FC and CNNs as presented in Section 5), we can apply all the before-mentioned XAI algorithms on the corresponding network and sub-networks.
Given the fact in our framework we can have multiple models working with the same modality, we now introduce the concept of weighted XAI: it combines the outputs of the XAI algorithms specific to a modality with the weights coming from the embedded classification vector. Since these weights make us understand the contribution of each model, we combine the different XAI importances via a weighted sum to obtain a single interpretation of the modality. Hence for each feature e w the output of the weighted XAI algorithm importance is: where each contribution e i extracted from the n unimodal models is weighted with the corresponding layer weight w i and summed together. For example, if we are working on the image modality and we apply Grad-CAM [37] to all the corresponding CNNs, we obtain a feature map for each model. By computing the dot product between the maps and the weight vector, we get a single feature map explaining which pixels contributed more or less to that specific classification. The same considerations hold for tabular data.

Experimental Configuration
In this Section we introduce the different experiments we performed. Furthermore, to avoid any bias when comparing the baseline results of [4] that released the AIforCOVID dataset with the ones achieved by the learning framework presented here, we applied the same pre-processing procedure and validation approach, shortly described in Subsections 5.1 and 5.5, respectively.

Data Pre-processing
We use the 34 clinical descriptors available with the AIforCOVID repository, which are not direct indicators of the prognosis, that are listed in the appendix Table A.1. Missing data were imputed using the mean and the mode for continuous and categorical variables, respectively. Finally, to have the features all in the same range, a min-max scaler was applied along the variables.
In the case of CXR scans we extracted the segmentation mask of lungs, using a pre-trained U-Net [45] on two non-COVID-19 datasets [46,47]. The mask was used to extrapolate the minimum squared bounding box containing both lungs, which is then resized to 224x224 and normalized with a min-max scaler bringing the pixel values between 0 and 1. As already mentioned, the interested readers can refer to [4] for further details.

J o u r n a l P r e -p r o o f
Journal Pre-proof

Classification
As introduced in Section 4, the first step to follow to obtain the optimal Γ * network is to train and evaluate different models using the data of the two modalities, i.e., CXR scans and clinical data.
For the image modality we worked with 30 different CNNs that come from 8 different main architectures: In all the cases the weights were initialized using the values pre-trained on the ImageNet dataset [56]; we also changed the output layer dimension to 2 neurons, one for each class.
In the case of clinical information, which are tabular data, we adopted 4 MLPs that differ in terms of depth and wideness of the model. We opted to use such architectures since these feedforward networks are able to learn a low-dimensional representation before being fused with the other modality [57]. In particular, the models' hidden layers have the following organizations: • MLP-1: it has 3 hidden layers with 64, 64, 32 neurons respectively; • MLP-2: it has 5 hidden layers with 64 A ReLU activation function is applied on all layers, since it learns several times faster than regular sigmoid activation functions [58], despite its nondifferentiability at zero. Straightforwardly, the input layer and the output layers consist of 34 and 2 neurons, respectively. Turning our attention to the matrix Θ, it is composed of 24 columns (n = 24 the number of architectures) and two rows (m = 2 the number of modalities).
Furthermore, as for the two objective functions E(Γ I ) and D(Γ I ) we adopt the accuracy (Acc) and the correlation coefficient (ρ), respectively. Note that the accuracy is computed on the models part of the Γ I s in late-fusion via the majority voting aggregation function. We opted for late fusion optimization to prevent the training of all the s h=2 s h end-to-end ensemble models. In this way, we only train n single models, apply the optimization, and conclude with one final embedded training of Γ * .
Majority voting consist in finding the most common classification c between all the classifications c i extracted by the |Γ I | unimodal trained models where M o indicates the mode of a set and |Γ I | is the number of models in Γ I . The correlation coefficient ρ instead between two classifiers is defined as: ρ ij = n 11 n 00 − n 01 n 10 (n 11 + n 10 )(n 01 + n 00 )(n 11 + n 01 )(n 10 + n 00 ) where n 00 is the number of wrong classifications made by both models, n 11 is the number of correct classifications made by the models, n 10 and n 10 are the number of instances on which the classifications of the two models differ. Since ρ is a pairwise measure, the overall diversity in Γ I is given by: We opted on using these evaluation and diversity metrics, given that they are metrics that focus on classified observations both for positive and negative, and on correctly and incorrectly instances, respectively. Going beyond the single models, after determining Γ * , its models are joint via the joint-late fusion method introduced in 4.1. As mentioned there, the classification space can be built adopting a soft and a crisp shared representation. This permits us to investigate two learners, denoted as JLF-S and JLF-C, respectively. We also investigate what happens by varying the final FC layer, lying after the shared representation. To this end, we tested two different FC layers: the first does not have any hidden layers but directly connects the joint vector to the output classification layer using a single perceptron, and it is denoted using the ending 1. The second consists of a single hidden layer of 4 neurons connected to the output classification layer, which is denoted using the ending 2. The former permits us to investigate linear combinations among the classifications of the networks, whilst the latter investigates non-linear combinations. In summary, we investigate four set-ups for the joint-late approach we present, i.e., JLF-S-1, JLF-S-2, JLF-C-1 and JLF-C-2.

Training Configuration
For both the single model and the multimodal scenarios, we adopt the training procedure described in [4]; so that any variation in performance is given only by the enhancements brought by the method and not by any other reason. To prevent overfitting of the CNNs, we randomly applied the following image transformations with a probability equal to 0.3: horizontal or vertical shift (−20 ≤ pixels ≤ 20), random zoom (0.9 ≤ factor ≤ 1.1), vertical flip, random rotation (−15 • ≤ angle ≤ 15 • ), and elastic transform (20 ≤ α ≤ 40, σ = 7). Both the CNNs and the MLPs were trained using the cross-entropy loss, regulated by an Adam optimizer with an initial learning rate of 0.001, which is scheduled to reduce by an order of magnitude every time the minimum validation loss doesn't change for 10 consecutive epochs. To prevent overtraining and overfitting we fix the number of maximum epochs to 300, with an early stopping of 25 epochs on the validation loss. We do not perform any preliminary optimization of the networks hyperparameters because in Arcuri and Fraser empirically reported that fine-tuning the learner parameters provides performance that do not statistically differ from those attained using the default values [59], a claim also confirmed by the experiments presented in [4] on the AIforCOVID dataset. Furthermore, we do not J o u r n a l P r e -p r o o f Journal Pre-proof pre-train the networks on another CXR dataset: indeed, although this would help the models learn modality-specific feature representations [60], the results available on the AIforCOVID dataset show that this practice does not introduce any significant performance improvement [4].
All the experiments were trained by using a batch size of 32 on a NVIDIA TESLA V100 GPU with 16 GB of memory, using PyTorch as the main DL coding library.

Competitors
To verify that the end-to-end process actually brings a performance enhancement, and to check that the joint network is not actually learning a late-fusion via FC layers using as input the classifications of the different networks, we froze the joint networks up to the joint classification vector and trained only the last FC classifier, creating a late fusion method. We would expect that if there is an actual benefit in the joint end-to-end to fusion, these performances would be lower. We will denote these experiments with: LF-S-1, LF-S-2, LF-C-1 and LF-C-2, where the S and the C encodings refer to the soft and the crisp embedded classification vectors and 1 and 2 refer to the type of FC layers employed, as before.
We also compare the results of the joint-late fusion with those attained by the majority voting late fusion approach applied to the models included in Γ * , denoted as LF-MV. Since majority voting is the aggregation function used in the optimization phase, this comparison permits us to not only investigate if the joint-late training is beneficial, but also to compare our proposal against a late-fusion technique, addressing the question on when the models should be fused. Therefore, we also compare our method with an early fusion method, denoted as EF, using a Support Vector Machine for the classification, which receives as input the concatenation of the extracted features from the last hidden layers of the networks in Γ * . We also tested the concatenation (denoted as JF-C) and the multiplication methods (denoted as JF-M) introduced in Section 2.1: the comparison of these approaches against the JLF solutions lets us deepen how the single models should be aggregated and at which level the fusion should occur. For the sake of completeness, we would note that the JLF approach intrinsically studies which learners should be fused and, in the next Section 6 we also compare the results against the use of a single network.
To validate the usefulness of both the evaluation and diversity metric to find Γ * , we adjust the minimization function of Equation 8 by considering only one of the metrics at a time. We take the best performing JLF model and obtain JLF E and JLF D , which denote Γ * optimized only on E and D, respectively.
Finally, as natural competitors, we also compare our results with those presented in the work introducing the AIforCOVID dataset [4], i.e., the HC, HYB and ETE methods summarized in Section 2.
Note that for all the experiments where joint training was performed (i.e., JLF-C-1, JLF-C-2, JLF-S-1, JLF-S-2, JF-C, JF-M, JLF E , JLF ED ), the weight initialization is performed by using the weights of the networks trained singularly.

Validation Approach
The experiments were ran in 10-fold stratified cross-validation (CV), and leave-one-center-out cross-validation (LOCO), following the same experimental procedure described in [4], thus ensuring a fair competitor between the approaches. In each CV fold the proportion between the training, validation and testing sets is 70%-20%-10%, respectively.
In LOCO validation we study how the models generalize to different data sources: indeed, in each fold the test set contains all the samples belonging to one center only, while the instances of the other six centers were assigned to the training and validation set.
With reference to [4], here we add another layer of external validation (EV), ran on an external dataset. To this goal, we exploited the second release of the AIforCOVID dataset, which contains data collected from 283 patients belonging to other two centers, as already mentioned in Section 3.

Performance metrics and statistical assessments
The accuracy, the sensitivity and the specificity are the evaluation metrics used to assess the performance, as in [4].
Furthermore, we apply the one-way ANOVA among the different groups of models and, to interpret the statistical significance, we used the pairwise Tukey test with a Bonferroni p-value correction at α = 0.05. Henceforth, we refer to a statistically significant difference when this test is satisfied.

Results and Discussions
In this Section we first show the results of the aforementioned experiments for the classification task, then we verify if the model is interpreting the data in the correct manner via the XAI methods introduced in Section 4.

Classification Task
Before presenting the numeric results, let us report that the Pareto optimum Γ * is composed of three CNNs and one MLP, namely the GoogLeNet, the VGG13-BN, the ResNeXt50-32x4d, and the MLP-2. It is worth noting that the selected models belong to different families suggesting that each extrapolates different information to satisfy the desired classification. Moreover, we understand that the two modalities all give useful and distinct information for the prognosis task, since we have at least one model for each modality. Furthermore, we also ran the multi-objective optimization fusing the learners in each Γ I applying end-to-end training rather than late fusion to investigate if any differences exist 2 . The results reveal that both provided the same Γ * , confirming that performing the optimization in late fusion rather in an end-to-end fashion is a viable alternative that strongly reduces the computation time since the former does not require any additional training.  Table 1: Performance (Acc, TPR, TNR) of all the learners. The models with the highest accuracy are highlighted for each validation scenario (CV, LOCO, EC).
The quantitative results achieved by all the learners presented in Section 5 are shown in Table 1, which shows the average accuracy (Acc), sensitivity (TPR) and specificity (TNR), with the respective standard deviations, for the CV, LOCO, and EV scenarios. The results reveal that the four joint-late methods, i.e., the first four rows in Table 1, outperform the other techniques, and the performance differences against the other multimodal approaches are statistically significant in terms of accuracy, sensitivity and specificity.   Table 2: Performance (Acc, TPR, TNR) for each validation scenario (CV, LOCO, EC) of the single modality (CLI, IMG) models part of Γ * in JLF In particular, these significant differences between JLF and LF-MV, LF-P and LF-D methods demonstrate that the end-to-end training of the method is effective and provide better results than the late fusion of the networks in the ensemble. As mentioned in Section 5.4, this result displays when the models in the ensemble should be fused.
Let us now focus on the comparison between the four JLF methods and the other two approaches providing an embedded multimodal representation, namely the JF-C, JF-M competitors described in Section 2. We find that the JLF performance are statically different in terms of accuracy, sensitivity and specificity with respect to those attained by the JF-C, JF-M techniques. This finding points out how the individual learners should be combined to provide a useful embedded multimodal representation.
We also found that the performance of JLF models are statistically different when compared with those presented in [4] (i.e., HC, HYB and ETE), which are the current baseline for the AIforCOVID dataset.
Turning our attention to the comparison between the four set-ups of the JLF approach, the results reveal that no statistically significant differences exist among their performance, suggesting that there is no preferred technique between them, while JFL-C-1 provides the best accuracy in both CV and LOCO experiments.
We are now interested in assessing to what extent our optimization algorithm effectively detects Γ * as the best choice of which learners should be combined among all the possible Γ I . To this end, we randomly picked 100 different Γ I that are joint using the JFL-C-1 technique for the reason mentioned a few lines above. Next, we compare these performance against the four JLF results in Table 1 that are achieved using Γ * , and we find that they are statistically significantly lower.
We now focus on the unimodal scenario to investigate the performance in JLF when a modality is missing: to this end Table 2 shows the results of single models in Γ * when trained on a single modality. By performing the statistical assessments we find that there is no significant change with the unimodal results presented in [4], as it could be expected. Interested readers can refer to Table B.2 which reports all the results attained by all the unimodal CNNs and MLPs considered. Nevertheless, GoogLeNet, VGG13-BN and ResNeXt50 are actually the first, the second and the sixth, among the CNNs, respectively, whilst MLP-2 is the first among the other MLPs. Comparing the four JLF learners with these unimodal networks, we find that the former provide performance that are always statistically larger than those returned by the latter, confirming the importance of the fusion between the modalities.
Let us now concentrate on a single modality. Since Γ * is composed of three CNNs working on the image modality, we perform an ablation test removing the network using clinical data (i.e., MLP-2). In this case, JLF-C-1 achieves an average accuracy equal to 76.09 ± 0.29, 75.68 ± 0.31 and 74.86 ± 0.27in CV, LOCO and EV, respectively. Since these values are statistically lower than those of JLF, we deem that joining multiple networks is beneficial. Furthermore, note that the apposite ablation test, i.e., removing the networks using image data, corresponds to consider MLP-2 only, whose performance are already shown in Table 2.

24
J o u r n a l P r e -p r o o f

Journal Pre-proof
As a further observation, let us now focus on the different performance achieved in 10-fold CV, LOCO and EV. To make the comparison easier, Figure 3 shows the distribution of the performance of the different methods across all the experimental scenarios. Observing the results of the JLF methods, we notice that the method is robust and generalizable since the performance drop in LOCO and in EV is limited, while all the others suffer from a larger decrease.
The main limitation of this method is its computational cost, given that all the unimodal models and the resulting joint model need to be trained. As the number of models and the number of modalities increases, the complexity rises. One may argue that it may be more practical to spend time and resources to optimize a single model per modality, but Table 2 shows that there are unimodal networks which perform poorly, which would bring to a sub-optimal solution. The goal of our method is not to find the best performing single model but to find the optimal fusion from a pool of classifiers, making it very useful in real-world scenarios where it is hard to know a priori which models are best suited for a specific task. Furthermore, our method agrees with the no free lunch theorem [61], which states that the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method.
Explainability. To open the black-box nature of the multimodal deep approach presented in this work, we apply the XAI algorithms introduced in Section 4 to the model with the highest performance, i.e., JLF-C-1.
To this end, we first extract the weights coming out of the classification vector which are connected to the next hidden layer of the FC network; second, we compute the mean relative intensities of such weights attained in CV, obtaining the following distribution: 17%, 26%, 16%, and 41% for GoogLeNet, VGG13-BN, ResNeXt50, and MLP-2, respectively. This information not only makes us understand the importance of every single model for the final classification, but it also explains the hierarchy inter-and intramodality. In particular, we notice that the image modality has more importance than the clinical one since the relative vector weights are 59% and 41%, respectively. Even if there is a discrepancy between the number of models per modality, it is interesting to notice that MLP-2 has the highest importance among the single neural networks, but the combination of the three CNNs gives higher relative importance to the image modality. Going deeper, we can see that there is also an intra-modality hierarchy between the image-based models, where by calculating the relative weight importances on GoogLeNet, VGG13-BN, and ResNeXt50, we obtain 29%, 44%, and 27%, respectively. To enable physicians to explore and understand data-driven DL-based systems, we work on XAI algorithms for single models. For each model composing the joint fusion, we apply XAI algorithms suited for the specific modality. For example, Figure 4 shows the results of the integrated gradients algorithm [38] applied to MLP-2, which shows the importance of the clinical features for the classification. Observing the results on the test sets, we notice that the most important features are the difficulty in breathing and the oxygen percentage in blood. Indeed, the presence of difficulty in breathing and lower values in the oxygen percentage in the blood are indications of the severe COVID-19 cases as also the medical literature confirms [62]. Going forward, we can use the aforementioned relative weights of the classification vector for a specific modality to combine the results coming from an XAI algorithm. Still in Figure 4, we show the feature maps extracted from GoogLeNet, VGG13-BN, ResNeXt50 by applying Grad-CAM [37], and by combining them via Equation 11 we obtain a resulting map for the modality. In this way we understand how the different components working on the image modality, and as a whole, interpret the pixel values. Many more XAI algorithms can be applied following this methodology, but to prevent redundancy we limit ourselves to these examples. We did not perform any evaluation on the XAI methods, since it is out of our scope of the presented method.

Conclusions
In this manuscript we have presented an approach to build an optimized multimodal end-to-end model exploiting a performance metric and the diversity of different neural networks. It addresses open issues in MDL related to when, which and how the modalities and the related learners must be joined to provide effective embedded representation and satisfactory performance. Given the impact of the COVID-19 pandemic, we applied our method to predict patients' prognosis, a topic that has attracted recent research after the large efforts towards the detection of COVID-19 signs in medical images [22]. In this scenario, the use of different modalities capturing different aspects of the disease progression should be useful and, for this reason, we used the AIforCOVID dataset [4], which is the only publicly available dataset containing clinical data and CXR scans. Our proposal algorithmically determines which deep architectures for each modality should be fused, by exploiting Pareto Optimization. Furthermore, here we have investigated how and when the fusion of the modalities should occur: we find that concatenating the classification vectors of each learner, being either soft or crisp labels, is a viable solution. In terms of performance, our method attains state-ofthe-art results, not only outperforming the baseline performance reported in [4] but also being robust to external validation. Moreover, our method fits well with many XAI algorithms, which allow us to figure out a hierarchy among modalities and they can extract the intra-modality importance of the different features by combining the results of various networks.
Future directions are directed towards five main goals. First, we plan to study a method to minimize the computational costs to extend the number of possible architectures per modality in the case of end-to-end training. Second, since the presented method is not specific to medical data, a further investigation to other multimodal datasets of other fields is needed. Third, to demonstrate the generalizability of the proposed method, future work is J o u r n a l P r e -p r o o f Journal Pre-proof directed also on considering datasets composed of more modalities. It is worth noting that there exists several optimization frameworks proven to be very practical healthcare, such as monarch butterfly optimization [63], earthworm optimization algorithm [64], elephant herding optimization [65], moth search [66], slime mould algorithm [67], hunger games search [68], Runge Kutta optimizer [69], colony predation algorithm [70], and Harris hawks optimization [71] and those based on data [72,73] and graph representations [74,75]. On this basis, our fourth direction of future investigation will investigate and compare different optimization methodologies. Lastly, as medical data should come from different institutions, it would also be interesting to explore the deployment of our approach in a federated learning framework.
J o u r n a l P r e -p r o o f   [4]. Feature names followed by + were not used.