RadioPathomics: Multimodal Learning in Non-Small Cell Lung Cancer for Adaptive Radiotherapy

The current cancer treatment practice collects multimodal data, such as radiology images, histopathology slides, genomics and clinical data. The importance of these data sources taken individually has fostered the recent raise of radiomics and pathomics, i.e. the extraction of quantitative features from radiology and histopathology images routinely collected to predict clinical outcomes or to guide clinical decisions using artificial intelligence algorithms. Nevertheless, how to combine them into a single multimodal framework is still an open issue. In this work we therefore develop a multimodal late fusion approach that combines hand-crafted features computed from radiomics, pathomics and clinical data to predict radiation therapy treatment outcomes for non-small-cell lung cancer patients. Within this context, we investigate eight different late fusion rules (i.e. product, maximum, minimum, mean, decision template, Dempster-Shafer, majority voting, and confidence rule) and two patient-wise aggregation rules leveraging the richness of information given by computer tomography images and whole-slide scans. The experiments in leave-one-patient-out cross-validation on an in-house cohort of 33 patients show that the proposed multimodal paradigm with an AUC equal to $90.9\%$ outperforms each unimodal approach, suggesting that data integration can advance precision medicine. As a further contribution, we also compare the hand-crafted representations with features automatically computed by deep networks, and the late fusion paradigm with early fusion, another popular multimodal approach. In both cases, the experiments show that the proposed multimodal approach provides the best results.


Introduction
Nowadays, lung cancer is worldwide recognised as the most common type of cancer and one of the most frequent causes of tumour death despite the recent increase in the number of treatment options [1]. The current clinical decision-making process relies on multiple data sources to improve the detection and the classification as well as the prognosis of the tumours, such as radiology-based data (e.g. X-ray, CT scan, ultrasound, MRI and metabolic imaging), digital pathology slides, genome profiling and clinical data. Such a variety of modalities catch different clinical aspects of the cancer disease and can help clinicians to pursue the paradigm of precision medicine, i.e., tailoring the treatment to the specific patient. Indeed, the wide variety of complementary quantitative bio-markers extracted from the various modalities can lead to more accurate diagnosis and more efficient treatment plans.
In the last decades, the artificial intelligence (AI) community has directed large efforts towards the detection and the classification of tumours using one or more modalities. However, only in recent years we have assisted to a growing interest directed to the disease outcome prediction using different data modalities. In this respect, the emerging areas of research are: • Genomics: it is an interdisciplinary field of science that focuses on genomes, highlighting the role of human genetic variation in disease diagnosis, prognosis, and treatment response. However, genomics biomarkers still have limitations that hinder the possibility to collect such data in clinical routine due to their complexity and still high cost [2].
• Radiomics: it is based on the extraction of quantitative features from radiology images routinely collected in order to predict clinical outcomes or guide clinical decisions using AI algorithms [3,4].
• Pathomics: it refers to the combination of digital pathology, omic science and AI to extract embedded information in digitised high-resolution whole-slide images of tissue biopsy sections to obtain quantitative bio-markers [5].
Given the growing availability of public oncological datasets containing paired samples from different modalities, in the last few years researchers started to take into account the multimodal learning paradigm. Multimodal learning relies on the integration of heterogeneous data from multiple sources into a single machine learning framework. Although several works use genomics, radiomics or pathomics data alone, there are still few works that aim to fuse these modalities together [6][7][8][9]. In [6] the authors proposed a multimodal radiomics model for preoperative prediction of lymphatic vascular infiltration in rectal cancer fusing features extracted from multiple radiologybased data sources (MR and CT). Early results demonstrate performance improvement over the single modalities taken individually. In [7] the authors proposed a multimodal deep learning method for non-small cell lung cancer (NSCLC) survival analysis fusing CT images in combination with clinical data. The preliminary results show that the proposed multimodal model improves the analysis of survival in NSCLC patients compared to the current state-of-the-art which only works with clinical data. In [8] the authors proposed a deep multimodal fusion framework merging together histopathological images and genomics features to predict survival outcomes in patients with glioma or clear cell renal cell carcinoma. Preliminary results show that the proposed multimodal fusion paradigm leads to an improvement over the current state-of-the-art in predicting survival outcomes when using each modality independently. In [9] the authors proposed a deep model merging together radiology scans, molecular profiling, histopathology slides and clinical factors to predict the overall survival of glioma patients. Early results demonstrate that it significantly outperforms the best performing unimodal model.
The literature reported so far shows that few works have explored a multimodal approach, reporting performance improvement. Furthermore, despite the importance of radiomics and pathomics taken individually, to the best of our knowledge only one work to date has combined them together into a single machine learning framework [9]. Hence, we present here another investigation that combines radiomics, pathomics and clinical data together into a single multimodal late fusion scheme to predict radiation therapy treatment outcomes for NSCLC patients. In this work, we do not take into account genomics data, because in clinical practice, pending the results of ongoing studies, in patients with locally advanced NSCLC considered for chemoradiation treatment, knowledge of oncogene-dependent characteristics does not change the therapeutic strategy. The late fusion scheme permits us to combine uncorrelated data flows that vary significantly in terms of dimensionality and sampling rates, as in our case. To summarise, the AI-related novelties of this work are as follows: • We proposed a multimodal late fusion scheme taking into account features extracted from radiomics, pathomics and clinical data.
• The proposed approach shows as the simultaneous fusion of the three modalities leads to an improvement over the models fed with the stand-alone data flows.
• We compare the proposed multimodal late fusion approach with the early fusion scheme and show how this latter has lower performance.
• We also fed the multimodal learning scheme with deep features extracted from a pre-trained neural network fine-tuned with the modalities covered by this work. Although, early results show as the deep features lead to performance deterioration, also in this case the combination of the three modalities leads to an improvement of the performances.
The rest of this manuscript is organised as follows: section 2 presents a short overview of the multimodal learning framework and its applications to oncology. Section 3 introduces the materials, overviewing the multimodal data sources available. Section 4 presents the proposed multimodal learning framework, whilst section 5 and section 6 describe the experimental setup and the results respectively. Finally, section 7 provides concluding remarks.

Background
In this section, we first overview the various architectures in the multimodal learning framework, and then we summarise the current state-of-the-art on multimodalbased learning on oncology (section 2.2).

Multimodal learning
Multimodal learning involves the integration of heterogeneous data from multiple sources extracted from the observation of the same phenomena or problem. Hence, the use of multimodal data sources allows the extraction of a complementary, more robust and richer data representation, with the aim of improving performance compared to the use of a stand-alone modality. Although there is not any formal proof, this intuition has brought interesting results in many applications, medical imaging included [10].
Multimodal data integration can be performed at different levels using three types of fusion: early, joint, and late fusion [11,12], respectively (Figure 1), as now described.
Early fusion, also known as data-level fusion or representation learning, concerns the integration of raw inputs from multiple data source modalities into a single feature vector before passing it into a single machine learning model ( Figure 1, left panel). In the early fusion, raw input data can be merged into an embedded space according to different policies, such as simple concatenation, addition, pooling, or applying a gated unit [13]. Although the promising results achieved in several applications, how to manage one or more missing modalities, how to handle different sampling rates and/or the time-synchronicity between multiple data sources, and the possible redundancies occurring while generating very large embedded spaces are the main issues of this multimodal approach.
With respect to joint and late fusion, these two combination schemes work by aggregating different classification models. Joint fusion, also known as intermediatelevel fusion or hybrid fusion, concerns the combination of the extracted intermediate feature vectors from trained neural networks, one per modality, into an abstract fusion layer, also known as a shared representation layer (Figure 1, central panel). Then, this combined feature vector feeds a final classification model whose loss is backpropagated to the feature extracting neural networks during training. Since the loss is back-propagated during the training process, this fusion scheme improves the feature representation at each iteration leading to better multimodal embedded feature spaces. Although joint learning is a very flexible framework, its main issue concerns the design of the architecture in terms of how, when, and which modalities can be fused [12].
Let us now delve into late fusion approaches, as our proposed approach works at this level. In the early-to mid-2000s, late-or decision-level fusion has received considerable interest from the machine learning community due to its potential to improve the performance of stand-alone classifiers. Late fusion concerns the training of independent systems, one per modality, which are then combined by an aggregation function to reduce individual error rates ( Figure 1, right panel). This aggregation function takes as input the unimodal decision values provided by the different classifiers that are combined according to a fusion rule (e.g. minimum, maximum, mean, majority vote, etc.). There is a consensus that the key to the success of late fusion is that it builds a mixture of diverse classifiers [14], providing different and complementary points of view to the ensemble. Definitely, the late fusion approach is a well-suited multimodal strategy when input modalities are significantly uncorrelated and they vary significantly in terms of data dimensionality and sampling rates [12]. These are the major reasons that led us to explore this multimodal framework, which will be also experimentally compared against early fusion in section 6.

Multimodal oncology
Nowadays, the current clinical practice for cancer treatment requires collecting multimodal data for each patient, such as radiological images (e.g. X-ray, CT scan, ultrasound, MRI and metabolic imaging), histopathology slides, genomics and clinical data. Such a variety of modalities describes different clinical aspects of cancer disease and can provide a wide range of complementary bio-markers leading to more accurate diagnosis and more efficient treatment plans. Although there are several works in the current state-of-the-art dealing with the detection, classification and prognostic task taking the aforementioned single modalities individually [15][16][17][18], there are still few works that aim to fuse these modalities together. Hence, in recent years, researchers focused their efforts on the fusion of these modalities into a single machine learning framework [6][7][8][9].
In [6] the authors proposed a novel multimodal radiomics model for preoperative prediction of lymphatic vascular infiltration (LVI) in rectal cancer based on handcrafted features extracted from magnetic resonance (MR) and computed tomography (CT) modalities. The authors validated their method on a retrospective cohort of 94 patients with histologically confirmed rectal cancer. The early results show as the multimodal (MR/CT) radiomics models can serve as an effective visual prognostic tool for predicting LVI in rectal cancer. It demonstrated the great potential of preoperative prediction to improve treatment decisions over the stand-alone modalities.
In [7] the authors proposed a multimodal deep learning method for NSCLC survival analysis leveraging CT images in combination with clinical data. The authors validated their framework using data from The Cancer Imaging Archive (TCIA), which contains paired samples of CT scans and clinical data for 422 NSCLC patients [19]. The preliminary results show that there is a relationship between prognostic information and radiomics images. In addition, the proposed multimodal model improves the analysis of survival in NSCLC patients compared to the current state-of-the-art which only works with clinical data.
In [8] the authors proposed a deep multimodal fusion framework for the end-toend multimodal fusion of histopathological images and genomics features (mutations, CNV, mRNAseq) for survival outcome prediction. This work implements the Kronecker product to model pairwise feature interactions across modalities and controls the expressiveness of each modality through a gating-based attention mechanism. The authors validated their framework using glioma and clear cell renal cell carcinoma datasets from The Cancer Genome Atlas (TCGA), which contains paired samples of whole-slide images of hematoxylin-and-eosin-stained specimens, genotype, and transcriptome data for 769 patients [20]. Based on a 15-fold cross-validation, preliminary results show that the proposed multimodal fusion paradigm leads to an improvement over the current state-of-the-art in predicting survival outcomes when using each modality independently.
In [9] the authors proposed a deep model merging together radiology scans, molecular profiling, histopathology slides and clinical factors to predict the overall survival of glioma patients. The authors validated their framework by collecting data from the TCIA repository, which contains paired samples of whole-slide images of hematoxylinand-eosin-stained specimens, MRI scans, DNA sequencing data, and clinical variables for 176 glioma patients. Early results show that their model, with a median C-index of 0.788 ± 0.067, significantly outperforms the best performing unimodal model, which has a median C-index equal to 0.718 ± 0.064. Furthermore, the proposed model successfully stratifies patients into clinical subgroups based on overall survival, adding further granularity to clinical prognostic classification and molecular subtyping.
Despite the importance of radiomics and pathomics data taken individually, to the best of our knowledge, at the time this work was written, only one study has combined them into a single machine learning framework for the outcome prediction of radiation therapy treatment [9]. Hence, in this paper, we propose the second attempt to combine radiomics, pathomics and clinical data into a single model to predict outcomes for NSCLC patients. As mentioned in section 1, since in the current clinical practice the knowledge of oncogene-dependent characteristics does not change the therapeutic strategy, in this work we do not consider genomics data

Materials
In this work we used an in-house cohort of 33 patients with Locally-Advanced stage III NSCLC, who were enrolled from November 2012 to July 2014 and treated with concurrent chemoradiation at a radical dose with an adaptive approach. The adaptive protocol was approved by Ethical Committee Campus Bio-Medico University on 30 October 2012 and registered at ClinicalTrials.gov on 12 July 2018 with Identifier NCT03583723 after an initial exploratory phase. Enrolled patients underwent a clinical evaluation after chemoradiation treatment and were classified into two groups according to target reduction: (i) adaptive, i.e. patients who achieved a reduction in tumour volume, assessed by two radiation oncologists on weekly chest CT simulations, leading to the implementation of a new treatment plan with which the patient would continue radiation therapy (adaptive approach); (ii) not-adaptive, i.e. patients who did not achieve target shrinkage and continued the chemoradiation with standard treatment. The a priori probability of this patients' cohort consists of 11 and 22 adaptive and not-adaptive patients, respectively.
For this patient cohort we collected heterogeneous data including histological slides, CT scans, as well as clinical data, therefore forming the following unimodal data streams (i.e. pathomics, radiomics and semantics) in the multimodal learning framework investigated in this study: • Pathomics modality: This modality includes samples generated from biopsy slides of lung cancer tissue, stained with haematoxylin and eosin (HE). HE (haematoxylin/eosin) tumour tissue slides were reviewed by a pathologist to confirm sample adequacy. Slides were digitised (APERIO CS2 Leica Biosystems or NanoZoomer 2.0 RT Hamamatsu) at 20x magnification. The digitised slides were loaded and segmented on QuPath. Regions of interest (ROIs), also called crops in the following, were manually defined by lung pathology experts to identify tumour areas avoiding histological artefacts, macrophage clusters and inflammations, fibrosis and necrosis. A total of 1113 tumour areas were manu-  ally segmented for the 33 tissue samples, one per patient.
• Radiomics modality: This modality includes initial CT scans collected prior to the start of concomitant chemoradiation therapy treatment. The CT scans consisted of single layer spiral computerised tomography -Siemens Somatom Emotion. Acquisition parameters were 140 Kv, 80 mAs, and 3 mm for slice thick. The scans were pre-processed applying a lung filter (kernel B70) and a mediastinum filter (kernel B31). The characteristics investigated in this work and presented in section 4.2 were extracted from 3D ROIs given by the Clinical Target Volume (CTV), manually countered by expert radiation oncologists. The CTV is the volume containing the Gross Tumour Volume (GTV), i.e. the macroscopically demonstrable disease, and therefore, with a probability considered relevant for therapy, the microscopic disease at the subclinical level. It is worth noting that in [21] the authors showed that CTV should be preferred to GTV when computing radiomics features. In total, this modality contains 39 manually contoured CTVs for the 33 patients. It is worth noting that the number of CTVs exceeds the number of patients as multiple tumours can occur in a patient.
• Semantic modality: Two experienced radiation oncologists independently reviewed all CT scans and scored each tumour for four semantic imaging features, divided into tumour staging scores (T, N and tumour stage), and histological evaluation. They also added the age and sex of the patients. Each radiation oncologist blindly assigned staging scores, and, in case of disagreement, they reviewed the CT scans together and any discrepancies were resolved through discussion until consensus was reached.
As can be seen from these descriptions, the data sources used are highly heterogeneous and are uncorrelated unimodal flows. Thus, as we previously mentioned in section 1, this motivated the choice of using the late fusion approach as a multimodal approach. Figure 2 shows four examples of both crops extracted by the pathologists from the high-resolution whole-slide images which contain the selected tumour area of interest, and CTVs extracted by expert oncological radiotherapist by CT scans weekly collected during the radiation therapy treatment. Moreover, the first three rows of Table 3.1 summarise the a priori sample distribution for each of the three different modalities. Let M be a multiplexer that allows us switching between two patient aggregation modes, A 1 and A 2 , respectively. If mode A 1 is active, mode A 2 is deactivated by a control signal passing through the logic NOT port, and vice versa.

Methods
This section introduces the proposed fusion framework to handle the binary classification task introduced before. It is composed of four main blocks shown in Figure 3 identified by the bars at its bottom and presented in the following. First, a preprocessing phase is applied to the different unimodal flows (section 4.1). This stage uniforms the data, increases the dimensionality of unimodal flows, and encodes categorical features into numerical ones. The second step, presented in section 4.2, extracts the features from both images belonging to the pathomics and the radiomics flows. The third step consists in patient aggregation, i.e. we merge each instance of the same patient of a single modality to get a single label for each patient (section 4.3). This is necessary so that the following data fusion step can work on consistent samples, i.e. one sample per patient, and not on single histologicals (patches or CTs' slices). Section 4.4 presents the eight fusion rules we investigated that belong to three different paradigms, this offering a view of how different fusion techniques can fuse three sources of information. On the one side, they are the product, maximum, minimum, mean, decision template and Dempster-Shafer, and all of them are based on the decision profile, i.e., a matrix organising the output of l different soft classifiers in a multiple classier framework, with l = 3 in our case. On the other side, the other two are the majority voting rule that works with the crisp labels and the confidence rule, which is a classifier selection technique.

Pre-Processing
This section presents the pre-processing applied to each unimodal flow which differs for each modality due to the heterogeneity of the data.
In the case of pathomics images, we applied a patch extraction operation to the original crops manually contoured by expert pathologists on the high-resolution wholeslide images. The patch extraction phase was performed by a sliding window with a size equal to 100 × 100 and a stride equal to 60, both chosen empirically. This step permits us to increase the cardinality of the available images by exploiting the variability typical of different regions of the same image. Furthermore, it also provides homogeneous images, as the original ones are characterised by a wide size variability. After this operation, we empirically removed the extracted patches with more than 20% of pixels belonging to the background to keep only the most informative images. In the end, the new repository of pathomics patches is composed of 53550 instances.
Let us now focus on the pre-processing for radiomics. As already reported in section 3, the radiomics modality consists of CTVs manually contoured by expert radiation oncologists on CT scans. To increase the dimensionality of this modality, we decomposed the CTVs, i.e. the volumes containing the macroscopically demonstrable tumour mass and the microscopic disease at the sub-clinical level, into their component slices. Thus we passed from having few 3D CTVs per patient to multiple 2D slices per patient, for a total of 928 slices.
Finally, in the case of semantic information, in order to have numerical features, we applied an ordinal encoding for T, N and stadium features, and one-hot encoding for sex and diagnosis features as no ordinal relation exists for these latter categorical variables.
The second part of Table 3.1 shows the a priori distribution of instances obtained after the pre-processing.

Features Extraction
This section describes the features extraction stage implemented for both the pathomics and the radiomics unimodal flows. It is straightforward that this step is not applied to the semantic unimodal flow, since it includes patients' medical data already processed. Note also that the features computed for each modality were selected to optimise each unimodal flow, but this is out of the scope of this work and, for the sake of brevity, we do not present this phase here. Nevertheless, the starting feature set consists of 2D intensity and texture features well established in the medical image processing scenario [22] and, specifically, in radiomics [23] and digital pathology [24]. They are statistical features extracted from the first-order image histogram, and several descriptors extracted from the results provided by both the Grey Level Co-Occurrence Matrix (GLCM) and the Local Binary Pattern (LBP) operators. Please note that we also investigated the use of deep learning, as we will discuss in section 5.

Pathomics
Among the descriptors mentioned before, histopathological images are represented by measures derived from the GLCM, computed from each patch 1 . The GLCM is a texture feature used to analyze the spatial distribution of grey levels within a 2D image at the microscale [25]. Hence, given a n × m image I, and being l the number of bits used to represent it, a GLCM is a square matrix of size 2 l where each entry (g i , g j ) represents the number of times a pixel with intensity g i is separated from another pixel with intensity g j by spatial offsets ∆x and ∆y, respectively, in the vertical and horizontal directions: where g i and g j are the pixel values ∈ 0, 2 l − 1 , x and y are the spatial positions in the image I.
The GLCM can also be parameterised in terms of δ and θ instead of the spatial offsets (∆x, ∆y). The former represents the relative distance in pixels between two points in I, while the latter is the relative orientation between two points in I. Here, taking into account preliminary experiments and findings in related fields [26][27][28][29], for pathomics we used δ = 1 and θ ∈ [0°, 45°, 90°, 135°], so that each patch has four GLCMs.
From each of these matrices we extracted six Haralick descriptors [25], i.e. contrast, dissimilarity, homogeneity, energy, correlation and angular second moment, listed in depth in Appendix B. Their concatenation provides 24 textural descriptors per patch.

Radiomics
For the radiomics modality we used the same features as presented in our previous work [30]. Hence, for each slice that makes up the 3D ROI extracted from the CT scans by the radiologists, we computed 12 statistics features and 104 textural features.
Statistical features consist of the moments up to the fourth-order of the first-order image histogram, i.e., the mean, the standard deviation, the skewness and the kurtosis. Furthermore, the picture of grey-level distribution is also grasped by the histogram width, the energy, the entropy, the value of the histogram absolute maximum and the corresponding grey-level value, the energy around such maximum, the number of relative maxima in the histogram and their energy. These statistics are listed in depth in Appendix B.
Texture features are derived from the GLCM and from the LBP. The former is parameterised by a unit distance δ between pixels and an orientation θ ∈ [0°, 45°, 90°, 135°] and we extracted six second-order statistical features as the pathomics flow.
The latter is an operator that describes the local texture of an image by assigning each pixel an integer label according to its local circular neighbourhoods of P points located on the circumference of radius R centred at pixel (x c , y c ) [31]. In this work, we used the LBP riu2 P,R operator, which is an extension of the original operator making it invariant to both local monotonic greyscale variations and rotation [32]: where LBP P,R (x c , y c ) = P −1 p=0 sgn (g p − g c ) 2 p is the original LBP operator, sgn (·) is the sign function, g c and g p are the values of the grey intensity of the central pixel (x c , y c ) and of the p-th pixel in the considered neighbourhood, respectively, and where U (·) is the uniformity degree of a pattern: Then, this operator identifies P + 1 uniform patterns with decimal code in the range [0, P ], whilst all non-uniform patterns are grouped into a single code with an integer value of P + 1. Hence, LBP riu2 P,R identifies a total of P + 2 different decimal codes. Finally, once we have applied this operator to each pixel in the image, we can compute a histogram of the LBP decimal codes' occurrences.
In this work, we empirically parameterised R with a unit distance and we set P equal to 8. Finally, the same 12 statistical features reported above on the top of this section (i.e. mean, standard deviation, skewness, kurtosis, etc.) are then computed from the histogram of LBP distribution.

Patient Aggregation Rule
As mentioned above, each patient is composed of several samples both for the pathomics and the radiomics flows. For the former, the samples correspond to the patches extracted from the crops contoured by the pathologists, whilst for the latter, the samples correspond to the various slices included in the segmented CT VOIs.
For this reason, in order to have a single classification per patient and consistent sample fusion, a samples' patient-wise aggregation is necessary. In this work, we used two different patient-wise aggregation rules, denoted as A 1 and A 2 in Figure 3, respectively.
The former is applied before the classification step, and it averages out each component of the feature vector x ∈ n belonging to the same patient p: where X p is the set of feature vectors computed from all the samples of the same patient p (i.e. histopathology patches or CT slices), and N p is its cardinality. The latter works after the classification process, and it averages the soft labels of all the instances of a patient. Formally, given a classification problem with C class labels and L unimodal flows 2 , and assuming one classifier per modality, let denotes the set of classifiers. Hence, given x, a soft classifier outputs a C-dimensional vector given by where d i,j (x) ∈ [0, 1] is the soft label and it represents the degree of support provided by classifier D i for the hypothesis that x comes from the class ω j . On this premise, the A2 patient-wise aggregation rule is defined by: where, thus, d p i,j represents the average soft label per class computed over all the instances of the i-th modality of the same patient (i.e. x ∈ X p ).

Late Fusion Rules
This section introduces the late fusion rules that merge the multimodal information extracted from the different unimodal flows.
Using the notation already introduced, in a multimodality framework we organise the outputs returned by the L unimodal classifiers into a patient-wise decision profile DP p , defined by the following matrix: where µ i,j is computed according to A1 or A2 aggregation rule. This implies that, using A1, µ i,j = d i,j (x p ), whereas using A2 we have µ i,j = d p i,j . Thus, the patientwise data is projected into a new feature space with dimension L × C and this new representation combining the unimodal classification stages is depicted by the symbol in Figure 3.
The fusion methods calculate the support χ j for the class ω j by applying some mathematical procedure described below on the DP p representation and, using the maximum membership rule, we then assign the patient p to the class ω s if: In this work, to compute χ j we apply eight late fusion techniques, represented with the tag LF t , with t = 1, . . . , 8, in Figure 3. They include four fusion rules computing the support for the j-th class independently of the support of the other classes: 1. Product rule (LF 1 ): it computes the support χ j for the class ω j as: 2. Max rule (LF 2 ): it computes the support χ j for the class ω j as: 3. Min rule (LF 3 ): it computes the support χ j for the class ω j as: 4. Mean rule (LF 4 ): it computes the support χ j for the class ω j as: We also investigated others two rules computing the class supports comparing the entire DP p feature space with the decision templates (DTs) of each class. DTs-based methods have been found to be among the best combination techniques and show stable performance over a range of experimental settings [33]. They are: 1. Decision Templates, DTs (LF 5 ): its use was proposed in [33] and consists of calculating C DTs, one per class, that capture the pattern of each. The decision template DT i for class ω i is the centroid of class ω i in the training L × C feature space DP p and it is calculated as follows: where N i is the number of patients belonging to the class ω i .
Finally, the p-th patient's support degree χ i for the class ω i is computed by measuring the similarity between the current DP p and DT i : where dt i k,j is the k, j-th entry in the i-th decision template DT i .

2.
Dempster-Shafer rule (LF 6 ): it is still based on the use of DTs. The p-th patient's proximity Φ p between the output of the i-th classifier D p i and DT i j is defined as [34]: where DT i j denotes the i-th row of decision template for the class ω j , D p i denotes the output of the i-th classifier on the p-th patient and · is any matrix norm. Then, the final support degree for the j-th class is: where K is a scaling factor.
For the sake of completeness, we also investigated other two rules working with different paradigms.
On the one side, we use the majority voting rule (LF 7 ) that works with crisp label outputs of each modality by assigning the patient p the class label ω s that is most represented among those returned by the L unimodal classifiers. Formally: On the other side, we also applied the confidence rule (LF 8 ), which assigns patient p the class label ω s : s = arg max j DP p q , for j = 1, . . . , C which corresponds to the q-th unimodal classifier output with the largest degree of support: where DP p i denotes the i-th modality whose classifier output is represented by row the i-th of DP p .

Experimental setup
Here we introduce the experimental setup adopted, presenting in section 5.1 the set of experiments carried out, and in section 5.2 the validation adapted as well as the evaluation metrics used.

Mean Decision Template
Dempster Shafer

Majority Vote Confidence
Features Mean

Set of experiments
The first set of experiments consists of evaluating the late fusion paradigm through all the different combinations of the fusion and aggregation rules, LF x and A y , respectively, for a total of 16 combinations since x ∈ {1, . . . , 8} and y ∈ {1, 2}, which are summarised in Table 4.1. Furthermore, these 16 experiments were performed for all combinations of modalities, i.e. Pathomics + Radiomics + Semantic (P + R + S), Pathomics + Semantic (P + S), Radiomics + Semantic (R + S), and, finally, Pathomics + Radiomics (P + R), for a total of 64 experiments.
Then, we compared the late fusion approach with an early fusion framework. Concerning this last approach, in this work we considered two early fusion rules: a simple approach in which the different modalities are concatenated without any processing on the feature space, and a concatenation given by the application of the Kronecker product, as presented in [8]. Indeed, the latter rule was chosen to bring out a correlation of the different modalities in the various combinations of them. For the sake of consistency of samples, in the early fusion paradigm we only applied the A1 aggregation rule, i.e. samples' patient-wise aggregation is performed by averaging each component of the feature vectors belonging to the same patient.
In all the experiments, we used the same learning paradigm in the classifier blocks of Figure 3, in which is a Random Forest [35] with entropy as a function to measure the quality of a split, whilst, for all the other parameters, we used the default values provided by the Scikit-learn framework [36], without any fine-tuning. Indeed, it was empirically observed in [37] that in many cases the use of tuned parameters cannot significantly outperform the default values of a classifier suggested in the literature, as also confirmed in other works [38][39][40].
Finally, we compared the discrimination strength of hand-crafted features against deep features in the late fusion framework. For this comparison, the 64 experiments described above were also performed using deep features extracted from both pathomics and radiomics modalities using the ResNet-18 [41] and GoogLeNet [42] networks respectively, pre-trained on ImageNet dataset. The choice of using ImageNet as a pre-training tool is motivated by the fact that this dataset provides enough rich image detail of different objects and targets, and therefore we believe that these pre-trained network feature extraction capabilities can be transferred to both pathomics and radiomics tasks. For each patient we trained the CNNs with a transfer learning process performed with all samples from the other patients for 20 epochs, as suggested by our previous work [43]. Furthermore, given the reduced amount of training samples, during the training we froze the weights for all the layers except the ones of the new final fully connected layer. Straightforwardly in this last layer we removed the original

Aggregation Rules
Modalities Pathomics Radiomics Semantic .711 .731 - 1000 neurons, which are replaced by two softmax neurons with random weights. These experiments were performed using the PyTorch framework [44].

Evaluation methods
We tested all the proposed approaches with a leave-one-patient-out (LOPO) crossvalidation paradigm so that we performed a number of runs equal to the number of patients. Therefore, in each run the test set consisted of all samples belonging to one patient, whereas all the others were allocated to the training set.
The patient-wise performances were computed averaging the Area under the ROC curve (AUC) of each run, where, as a reminder, the positive and negative classes correspond to adaptive and non-adaptive patients, respectively. It is worth recalling that AUC is a figure of merit widely adopted in the medical community to characterise the performance of a prediction model. Furthermore, to compare the results we also applied some statistical tests that will be introduced hereinafter, and formally presented in Appendix A.

Results and discussion
This section presents and analyses the results in several directions, starting from the raw outputs reported in Table 6.1 and Table 6.2. The former reports the scores attained by each single modality, eventually using one aggregation rule to combine the features describing the samples. The radiomics modality with the use of the feature mean as patient aggregation rule is the best unimodal flow with an AUC equal to 0.870. The latter shows the performance attained by the pairwise fusion approaches and by the trimodal combination, for all the 16 fusion rules. With an AUC equal to 0.909, the best results are achieved by the multimodal triplet P +R+S, and the pairwise combinations R + S and P + R, with the following fusion rules respectively: A 1 + LF 6 , A 1 + LF 5 and A 1 + LF 2 . Hence, all of them are given by the use of the feature mean as patient aggregation rule at feature level followed by the Dempster-Shafer, Decision Template and Maximum as fusion rule, respectively.
To discuss these results, the rest of this section deepens the results in three directions. First, we present the results provided by the late fusion approaches introduced in section 4.4 and schematically depicted in Figure 3 (section 6.1). Second, in section 6.3 we compare late fusion and early fusion approaches. Third, in section 6.3 we show a comparison of hand-crafted and deep features. Figure 4: Radar chart showing the performance in terms of AUC of the unimodal and multimodal approaches, where P stands for Pathomics, R for Radiomics, and S for Semantic. The filled circle represents the flow with the highest rank, whilst blank circles represent unimodal or multimodal approaches with statistically different performances from the best approach according to Friedman test with the Iman-Davenport amendment followed by the pairwise Bonferroni-Dunn post-hoc test (p < 0.1).   (Figure 3) and varying the modalities combination. Filled circles represent the fusion rule combination with the highest rank, whilst blank circles represent models with statistically different performances from the best model according to the Wilcoxon signed-rank test (p < 0.1).

Late fusion results
The contribution of this experiment is three-fold, so it permits us to answer the following three questions: 1. What is the best combination of modes? 2. Within the multimodal combination, which is the unimodal mode contributing more to the best performance? 3. What is the best fusion rule?
Let us now explore each of these questions. In all the cases we will introduce figures that offer a high-level synthesis of the huge amount of results provided by all the experiments.
Best multimodal combination. Figure 4 shows a radar chart plotting the performances in terms of AUC of the various unimodal and multimodal approaches. As mentioned above and summarised in Table 4.1 we have a total of 16 different rules, given by the different combinations of the aggregation rules (A y ) and the late fusion rules (LF x ). So for each of these rule combinations we rank each approach so that the one with the highest performance receives a score of 7, whilst the one with the lowest performance gets a score of 1. At the end of this iterative process the rank of each stream is given by the sum of the ranks received for each of the 16 experiments mentioned above. We then normalise for the maximum rank achievable. In the figure, we adopt a filled circle to mark the flow with the highest rank, while the blank circles denote those approaches with a lower rank, whose performances are statistically different from the best one according to the Friedman test with the Iman-Davenport amendment followed by the Bonferroni-Dunn pairwise post-hoc test (p < 0.1). Furthermore, we do not report any circle when the rank of a flow is lower than the best one and the corresponding performance are not statistically different.
Under these premises, in this figure the lengths of the spokes show how the multimodal approaches generally perform better than the unimodal ones, as the latter always differ from the best combination in a statistically significant way. Furthermore, as you can see from the filled circle, the trimodal combination given by P + R + S is the best approach and it significantly differs from all unimodal approaches (i.e. P , R, S) P +S. Although in a different clinical context, similar considerations about the inherent superiority of multimodal approaches over the unimodal flows were obtained from [9,45]. Indeed in [9] and [45], the overall survival analyses performed for glioma and lung cancer, respectively, show how the multimodal combination significantly outperforms the best performing unimodal flows. Furthermore, as in our work, in both studies the triplet multimodal combination emerged as the best approach.
Most informative unimodal approach. Let us now rank the unimodal approaches in order to understand which flow is the most informative in terms of AUC. The ranks are computed as described before. Indeed, given a fusion rule, the multimodal approaches are ranked so that the flow with the best performance gets rank 4, as we have four different multimodal approaches. Then, all the unimodal approaches that make up the multimodal flow get the same rank as the multimodal one. So, for instance, if the case P + R + S multimodal flow got rank 3, the unimodal approaches of pathomics, radiomics and semantic all get the rank 3. This operation is repeated for all 16 different rule combinations and the final ranks are updated by accumulation and, finally, normalised for the maximum rank achievable.
Hence, with a rank equal to 0.667, the radiomics approach is the most informative unimodal flow, outperforming both the pathomics and semantic flows which got a rank equal to 0.549 and 0.535, respectively. Similar considerations were obtained from [9] where, although in a different clinical context, the overall survival analysis of glioma dealing with the radiomics unimodal flow emerges as the most informative modality in terms of Cox Loss.
Best fusion rule. Figure 5 shows a radar chart plotting the performance in terms of AUC of the various fusion rules varying the way we combine the modalities. Let us remember that, as mentioned above and summarised in Table 4.1, the rules represented in the figure are the combination of the aggregation and late fusion rules, A y and LF x , respectively. For each multimodal combination, we ranked each fusion rule in terms of AUC so that the worst rules receives rank 1, whilst the best receives rank 16. Filled circles in the figure represent the fusion rule combination with the highest rank, whilst blank circles represent models with statistically different performances from the best fusion rule according to the Wilcoxon signed-rank test (p < 0.1). Note that here we used such test rather than Friedman's method since, for each late fusion rule, we compared the four different multimodal combinations, a number limiting the application of Friedman's test. Furthermore, we do not report any circle when the rank of a flow is lower than the best one and the corresponding performance are not statistically different.
The chart shows that the patient aggregation rule named as score mean aggregation (A 2 ) performs generally worst than the feature mean rule (A 1 ). On the other hand, if we focus on the late fusion rules, we observe that they perform almost equally well on all multimodal combinations.
Moreover, the best combinations of rules depend on the modality combination considered. For the P + R + S combination, the best rule combination is denoted as A 1 + LF 6 , which is therefore given by the use of the feature mean as patient aggregation rule at feature level followed by the Dempster Shafer as fusion rule working on the outputs of each classifier. For the P + S combination, the best rules are denoted as A 1 + LF 4 and A 1 + LF 7 , which are given by the combination of the feature mean as patient aggregation rule and, respectively, the mean and majority vote as fusion rule. For the R + S combination, the best rule combination is denoted as A 1 + LF 5 , which is therefore given by the use of the feature mean followed by the Decision Template as fusion rule. For the P + R combination, the best rule combination is denoted as A 1 + LF 2 , which is therefore given by the use of the feature mean followed by the maximum as fusion rule. Globally, the best fusion rules are A 1 + LF 4 and A 1 + LF 7 , since on average they ranked highest across all modality combinations. It suggests that these rules are well adapted to different situations, generalising successfully across different types of datasets, which, in turn, are characterised by different data types, sizes and dimensionalities.

Simple Concatenation Kronecker
Pathomics + Radiomics + Semantic 5-2-1 8-0-0 Pathomics + Semantic 8-0-0 8-0-0 Radiomics + Semantic 4-1-3 7-0-1 Pathomics + Radiomics 5-0-3 7-1-0 Table 6.3: Exhaustive comparison of late and early multimodal learning for the various modality combinations in terms of AUC. Each cell shows the amount of win-tie-loss of a combination of the corresponding multimodal combination handled in the late fusion paradigm. For each modalities combination, the cells are highlighted in grey when the late fusion approaches are significantly better than the early fusion rules according to the one-tailed sign test (p < 0.05). Table 6.3 shows an exhaustive comparison of late and early multimodal learning for the various modality combinations in terms of AUC. As we already mentioned, with the early fusion paradigm we only tested the A 1 aggregation rule for the sake of consistency of samples. Each cell reports the amount of win-tie-loss of the corresponding multimodal flow handled in the late fusion paradigm. Since early fusion only handles the A 1 aggregation rule, the comparison was restricted to only the 8 late fusion rules that use this method of aggregation. Given the number of patient data, we statistically validated this comparison with the sign test, a simple but powerful statistical test. In Table 6.3, for each modalities combination, the grey cells highlight when the late fusion approaches are significantly better than early ones according to the one-tailed sign test (p < 0.05). The table shows that the late fusion paradigm almost always outperforms the early fusion paradigm for the performance metric considered. This suggests us that there is a certain difficulty in the data-level fusion process when the data are significantly uncorrelated, and have such a different nature and dimensionality, as in the medical task we are dealing with in this work.

Hand-crafted vs Deep Features
This experiment compares hand-crafted and deep features in the same multimodal late fusion framework ( Figure 2). Hence, for the two descriptor groups we have the same number of experiments, i.e. 64 as discussed in section 5.1.
We summarised the comparison in Table 6.4, which offers an exhaustive comparison between the modality combinations in terms of AUC. Each cell shows the amount of win-tie-loss of a pair of modality combinations indexed by row and column, respectively performed using hand-crafted and deep features. For instance, the second cell in row 1, with indexes (1, 2), counts the wins, ties and losses obtained by the modalities triplet P + R + S performed with hand-crafted features against the modalities pair P + S performed with deep features. Since for each combination we have a total of 16 different combinations of the fusion and aggregation rules, LF x and A y respectively, the total amount of scores for each cell is equal to 16.  for Semantic. Each cell shows the amount of win-tie-loss of a combination in a row compared with a combination in a column, respectively performed using hand-crafted and deep features. The cells highlighted in grey represent the modality combination significantly better than another according to the one-tailed sign test (p < 0.05).
Again, we statistically validated this comparison with the sign test. In Table 6.4 the grey cells represent the modality combination significantly better than another according to the one-tailed sign test (p < 0.05). From the table we can see how the hand-crafted features perform better than the deep features. The reason for this result can be found in the low dimensionality of the dataset, as could be expected. Indeed, this limits the ability of the deep neural networks to fully express their power of abstraction, generalisation and discrimination.

Conclusion
Nowadays, lung cancer is worldwide recognised as the most common type of cancer and one of the most frequent causes of tumour death despite the recent increase in the number of treatment options. The current clinical decision-making process relies on multiple data sources to improve the prognostic outcomes of radiotherapy treatment, such as radiology-based data, digital pathology slides, genome profiling and clinical data. Despite the importance of these data sources, to the best of our knowledge only one work to date has combined them together into a single deep learning framework fed with a large amount of data collected from public repositories [9]. Although deep learning networks offer a high generalisation ability, they require a large amount of data that are not always available in every clinical context. Hence, in this work, we have presented a multimodal late fusion framework combing radiomics, pathomics and clinical data to predict radiation therapy treatment outcomes for NSCLC patients. We fed the proposed framework with hand-crafted features extracted from the aforementioned data sources. Here, we explored the combinations of eight different late fusion rules (i.e. product, maximum, minimum, mean, decision template, Dempster-Shafer, majority voting, and confidence rule) with two samples' patient-wise aggregation rules (i.e. feature mean and score mean) implemented to have a single classification per patient and consistent sample fusion, for a total a 64 experiments. The contribution of these experiments is three-fold, as they allow us to find the best combination of modalities, the most informative unimodal flow and, finally, the best fusion rule. From these experiments emerges that the trimodal combination given by pathomics, radiomics and semantic data (i.e. P + R + S) is the best approach and it significantly differs from all unimodal approaches and the pairwise combination P + S. Furthermore, the radiomics approach is the most informative unimodal flow, outperforming both the pathomics and semantic flows. Finally, this set of experiments shows that the best fusion rules are A 1 + LF 4 and A 1 + LF 7 , which are given by the combination of the feature mean as patient aggregation rule and, respectively, the mean and majority vote as fusion rule, since on average they ranked highest across all modality combinations. This suggests us that these rules are well suited to different situations, successfully generalising across different types of modalities.
As another contribution of this work, we compared the late fusion framework with an early one. Concerning the latter, we explored two early fusion rules: a simple concatenation in which the different modalities are concatenated without any processing on the feature space, and a concatenation processed by the application of the Kronecker product. The preliminary results show that the early approach leads to a deterioration in performance due to their lack of capacity to handle highly heterogeneous data such as in the medical task dealt with in this work.
As a final contribution, we fed the multimodal late fusion scheme also with deep features extracted from pre-trained deep neural networks fine-tuned with the standalone modalities. As could be expected due to the limited size of the patients' cohort, the results show as the deep features lead to a general performance deterioration. Nevertheless, also in this case the combination of the three modalities leads to an improvement in performance.
Finally, turning to the clinical side, the take-home message emerging from this work is that the multimodal learning framework leads to a significant improvement of a learning system in terms of performance. Indeed, in this work the simultaneous fusion of the three modalities given by P + R + S is the best approach and it significantly differs from all the models fed with the stand-alone data flows. Although in a different clinical context, similar considerations about the inherent superiority of multimodal approaches over the unimodal ones were obtained from [9,45] The following sections describe the statistical tests we used in this work to validate and compare our approaches.

Friedman test with Iman and Davenport amendment
The Friedman test with amendment by Iman and Davenport is a non-parametric statistical test used to compare multiple models. For each of the 16 different combinations of fusion rules (A y + LF x ), each flow is ranked so that the best receives rank 1, whilst the worst receives rank 7. Tied ranks are shared equally as explained above. The test statistic with the amendment proposed by Iman and Davenport is the following: where R j = 1 N N i=1 r j i is the average rank of the j-th flow and r j i is the rank of the j-th flow when considering the i-th fusion rule, where i = 1, . . . , N and j = 1, . . . , M (N = 16, M = 7). Once the F -statistic has been computed, we can carry out the test comparing it with the critical value for the chosen level of significance. If it is greater than this value we can reject the null hypothesis H 0 and accept that there is a difference between the flows.
Bonferroni-Dunn post-hoc test. If H 0 is rejected, Bonferroni-Dunn post-hoc test is applied to find exactly where the differences are: where R 1 is the average rank of the best flow and R j is the average rank of j-th flow. Two flows are statistically different if the obtained p-value from this z-value is smaller than α M −1 , where α is the desired level of significance.

Wilcoxon signed rank test
The Wilcoxon signed-rank test is a non-parametric statistical test that tests if two models are statistically different. Given the error estimates of two models for the N folds of the LOPO validation paradigm, the test computes the differences of these errors d i . Then it ranks the absolute values of the differences |d i | so that the smallest value receives rank 1, whilst the largest one receives rank N . If there is a tie, all the ranks are shared so that the total sum stays 1 + 2 + · · · + N . Subsequently, it splits the ranks into positive and negative according to the sign of d i and calculates the following amounts: Finally, the test computes the following statistic: which is approximately distributed as a normal distribution and where T = min (R + , R − ).
Once the z-statistic has been computed, we can carry out the test comparing it with the critical value for the chosen level of significance. If it is greater than this value we can reject the null hypothesis and state that the two methods have statistically different performances.

Sign test
The sign test is simply performed counting wins, ties and losses, with or without statistical significance, of each method pair. This test is based on the intuition that if two methods are equivalent, each one will perform better than the other one on approximately N/2 of the tests. Hence, following the binomial distribution, we can claim that the first method is significantly better than the second one if its amount of wins is greater than N/2 + 1.96 N/2, at a level of significance of 0.05.

Appendix B. Statistical descriptors
The following are the statistical measures extracted from the intensity histogram and used in this work to provide a synthetic description of both the LBP and the grey level histogram: where l is the number of grey levels in the image I and p (r k ) is the number of pixels with a grey level equal to r k . Given instead a grey-level co-occurrence matrix G extracted from image I, the following are the formal definition of Haralick features used in this work: • Contrast: • Dissimilarity: where G i,j denotes the i, j-th entry of G, l is the number of grey levels in the image I, g i and g j denote two grey level values ∈ 0, 2 l − 1 , and, finally, µ i denotes the mean value of the one-dimensional marginal distributions of G.