MSFN: a multi-omics stacked fusion network for breast cancer survival prediction

Introduction: Developing effective breast cancer survival prediction models is critical to breast cancer prognosis. With the widespread use of next-generation sequencing technologies, numerous studies have focused on survival prediction. However, previous methods predominantly relied on single-omics data, and survival prediction using multi-omics data remains a significant challenge. Methods: In this study, considering the similarity of patients and the relevance of multi-omics data, we propose a novel multi-omics stacked fusion network (MSFN) based on a stacking strategy to predict the survival of breast cancer patients. MSFN first constructs a patient similarity network (PSN) and employs a residual graph neural network (ResGCN) to obtain correlative prognostic information from PSN. Simultaneously, it employs convolutional neural networks (CNNs) to obtain specificity prognostic information from multi-omics data. Finally, MSFN stacks the prognostic information from these networks and feeds into AdaboostRF for survival prediction. Results: Experiments results demonstrated that our method outperformed several state-of-the-art methods, and biologically validated by Kaplan-Meier and t-SNE.


Introduction
According to the Global Cancer Statistics 2020, 2.26 million new cases of breast cancer were diagnosed in 2020, and the deaths from breast cancer were in the fifth rank of all cancers (Sung et al., 2021).Breast cancer has become the most prevalent cancer in the world (Arnold et al., 2022).Survival prediction is an essential part of cancer prognosis.It aims to predict the survival risk of cancer patients and provide recommendations for pathologists and doctors in treatment (Hagerty et al., 2005).Accurate and reliable survival prediction can provide doctors with scientific guidance and improve the survival rate of patients.More importantly, the survival prediction tools could formulate reasonable treatment strategies for patients, avoid unnecessary pain caused by over-treatment, and improve the quality of life of patients.Meanwhile, it reduces the burden of doctors and avoids the wastage of medical resources (Deepa and Gunavathi, 2022).Therefore, developing accurate and reliable survival prediction methods is vital for the treatment and prognosis of breast cancer.
With the widespread application of next-generation sequencing technologies and the accumulation of medical data on cancers, plenty of survival prediction methods have been developed, including (i) statistical survival analysis methods and (ii) machine learning-based methods.Statistical survival analysis methods such as CoxPH and LogRank test use survival data and a few covariates to predict patient survival (Michaelson et al., 2002).However, these methods are difficult to model and not applicable to analyzing large amounts of data.Machine learning-based methods effectively address these limits of statistical survival analysis methods.Algorithms such as Support Vector Machines (SVM), Random Forest (RF) and Logistic Regression (LR) obtain prognostic features from large amounts of cancer data to predict survival (Xu et al., 2012).However, machine learning-based methods require researchers to perform laborious and complex feature engineering work.
In recent years, deep learning methods have provided scientists with powerful tools for extracting high-quality prognostic information from massive omics data and have been proven effective in survival prediction (LeCun et al., 2015;Deepa and Gunavathi, 2022).For instance, Ching et al. (2018) developed a neural network model Cox-nnet with a Cox regression layer to predict survival using RNA-Seq data.Katzman et al. (2018) proposed DeepSurv to predict patient survival and the effect of covariates on patient survival risk by combining DNN and Cox-PH.However, the human genome is extremely complex, and various factors influence cancer pathogenesis (Lujambio and Lowe, 2012).Multi-omics data contains a wealth of information, providing an unprecedented opportunity to investigate the occurrence and progression from multiple perspectives (Arjmand et al., 2022).But the deep learning survival prediction methods described above are inapplicable to multi-omics data.Therefore, deep learning methods based on multi-omics data have risen to prominence in survival prediction (Herrmann et al., 2021;Kang et al., 2022).
One kind of survival prediction research predicts patients' survival risk (survival rate) based on their survival time and survival status.For example, Cheerla et al. proposed introducing the COX loss function in the deep learning model to fusion clinical data, gene expression data, microRNA expression data, and WSIs (Whole Slide Images) to predict the survival rate of patients with 20 cancers (Cheerla and Gevaert, 2019).Li et al. (2022) proposed HFBSurv to predict patient survival by employing a factorized bilinear model to fuse gene expression, CNV, and pathology image features step by step.Another survival prediction research predicts the long and short survival of cancer patients.For instance, Sun et al. proposed MDNNMD, a survival prediction model that integrates clinical, CNV, and gene expression data of breast cancer by fusing three DNNs with different weights (Sun et al., 2018).AMDN extracts prognostic features of clinical and gene expression data using NMF matrix decomposition combined with attention mechanisms to predict breast cancer survival (Chen et al., 2019).Subsequently, Arya et al. proposed a stacked integration model STACKED RF to overcome the limitation that MDNNMD requires manual adjustment of fusion weights (Arya and Saha, 2020).In the follow-up research, they introduced a gated attention mechanism into STACKED RF to enhance the prediction performance, named SiGaAtCNN (Arya and Saha, 2021).However, previous survival prediction studies based on multi-omics data focus on extracting prognostic features from various multi-omics data, rather than patient similarity and correlation of multi-omics data.
To address these issues in classification prediction studies of long and short survival, we propose a novel Multi-omics Stacked Fusion Network (MSFN) for breast cancer survival prediction.First, we construct a patient similarity network using multi-omics data.Then, we employ ResGCN to obtain similarity information of patients and correlation information of multi-omics data.Simultaneously, we construct CNNs for each omics data to obtain the specificity information.Finally, we stack the prognostic information from the hidden layers of the networks and utilize AdaboostRF for survival prediction.The superiority of MSFN is to comprehensively consider the specificity information of multi-omics data, the similarity information of patients, and the correlation information of multi-omics data, stacking these information to achieve more accurate and reliable survival prediction.The contributions of this work are summarized as follows: • We propose a novel multi-omics stacked fusion network framework that comprehensively obtains survival-related information from multi-omics data for survival prediction.
• We integrate multi-omics data with Similarity Network Fusion (SNF) that sufficiently utilizes the similarity between patients and the correlation of multi-omics data to generate a comprehensive patient similarity network.
• We use ResGCN to extract the prognostic information of the patient similarity network, leveraging its residual connectivity to achieve a deeper network structure while effectively addressing gradient vanishing.

Datasets and preprocessing
To investigate the performance of our method, we conducted comprehensive and rigorous experiments on the BRCA multi-omics dataset from TCGA (The Cancer Genome Atlas).We obtained this dataset from the UCSC Xena platform (http://xena.ucsc.edu/)and removed samples and features with missing values above 20%.1048 patient samples were finally selected, each sample contained clinical, gene expression, CNV, and survival data.This is because

Description Value
Threshold (years) 5

Total patients 1048
Long-time survivors 248 Short-time survivors 800 Average survival (months) 42.34 clinical, gene expression and CNV data are highly associated with cancer occurrence and progression, and they have been used extensively in previous survival prediction studies (Shlien and Malkin, 2009;Li et al., 2017;Kalafi et al., 2019).Then, we divided patients into long-term and short-term survivors using a threshold of 5-year survival, with long-term survivors labeled as 1 and short-term survivors labeled as 0. The overview description of the dataset is shown in Table 1.
For the clinical data, we first removed not reported data and features and samples with more than 20% missing values.Then, we removed irrelevant text descriptions, markers, and years from the clinical data.Subsequently, according to the data processing procedure in the study by Sun et al. (2018); Arya and Saha (2020); Arya and Saha (2021), we screened clinical features such as age, tumor size, and TNM stage, and performed label coding and binarization for the categorical features.Finally, we obtained 33 features as clinical features.Since there were no missing values in the gene expression and CNV data, we only estimated missing values for clinical data.Specifically, we divided the 33 clinical features into 24 discrete-valued features and 9 continuous-valued features.For continuous features, we use the k-Nearest Neighbor algorithm (KNN) for interpolation then normalized them using the min-max normalization with the range set to [0,1] (Troyanskaya et al., 2001;Patro and Sahu, 2015).For the discrete features we used the mode interpolation (García-Laencina et al., 2015).For gene expression data, we also used the max-min normalization for normalization with the range set to [0,1].For CNV data, we directly use the discretized raw data.The gene expression and CNV data for each patient in the dataset has 60,488 and 19,729 features.This high dimensionality of data leads to the "dimensionality catastrophe" that negatively affects the performance of deep learning methods (Berisha et al., 2021).Therefore, we used the renowned mRMR algorithm for feature selection (Peng et al., 2005).Then, we searched the optimal number of gene expression and CNV features in steps of 100 (Arya and Saha, 2020;Arya and Saha, 2021).Finally, we selected 400 gene expression features, 200 CNV features, and all 33 clinical features as model inputs, as shown in Table 2.The framework of MSFN.

Methods
The proposed MSFN consists of three components.In the first component, MSFN constructs patient similarity networks using SNF and employs ResGCN to obtain similarity information of patients and correlation information of multi-omics data.In the second component, MSFN constructs CNNs for each omics data to obtain the specificity prognostic information.The last component is extracting and stacking the prognostic information of ResGCN and CNNs, feeding them into AdaboostRF for survival prediction.The framework of MSFN is briefly shown in Figure 1.The implementation of our method is available at https://github.com/AckerMuse/MSFN.

Construction of patient similarity network
In order to construct the patient similarity network, we employ the similarity network fusion (SNF) to construct the patient similarity network (Wang et al., 2014).SNF can integrate multiomics data from clinical, CNV and gene expression data to generate a comprehensive patient similarity network for fully leverages patients' similarities and the correlation of multi-omics data.Assuming there are n patients, each patient has m types of data.We denote the patient similarity network as a graph G (V, E), where V represents the set of patients,i.e.,{x 1 , x 2 , x 3 , . . ., x n }.The edge E corresponds to the similarity relation between vertices v ∈ V in the graph.The weights of these edges are represented by an n × n similarity matrix W, which is computed by Eq. 1: where λ is the hyperparameter, ϑ(x i , x j ) is the euclidean distance between patients x i and x j , and ε i,j is used to eliminate the scaling problem (Wang et al., 2014).Then, the similarity matrix is normalized by Eq. 2: Suppose N i is the set of neighbor nodes of x i .We can calculate the similarity matrix L of the single omics data by Eq. 3: t denote the similarity matrix after normalization of the v-th omics data (0 < v ≤ m) in the t-th iteration.Update P (v) t according to Eq. 4: where the L (v) denotes the local similarity matrix of the v-th omics data.Through continuous iterative fusion, the SNF ultimately generates a patient similarity network containing correlation information from all omics data.In this work, the patient similarity network is combined with ResGCN for cancer survival prediction.

Similarity and correlation features extraction by ResGCN
Since the patient similarity network constructed by SNF is graph-structured data, we employ ResGCN to obtain the survival prediction features from it (Li et al., 2019).ResGCN modifies the data transmission mechanism in graph neural networks to mitigate the gradient vanishing problem and overcome the limitation that graph neural networks cannot construct deep networks.As shown in Figure 2, ResGCN takes the feature matrix of multi-omics data and the patient similarity network as input.After the residual graph convolution operation, outputs the feature matrix of the node.The propagation mechanism of ResGCN can be first represented as Eq.5: where G(N) is the output of the N-th layer, and W(N) is the weight matrix of N-th layer.f() denotes the graph convolution operation.σ() denotes the nonlinear activation function.However, this propagation mechanism only considers the feature vectors of all neighboring nodes, and ignores the nodes themselves.Therefore, self-connection is added to A to overcome this problem, defined as Â A + E, where E denotes the identity matrix.Moreover, to avoid changes in the scale of the eigenvectors during the multiplication operation, D − 1 2 AD − 1 2 is defined to normalize A, where D is the diagonal node degree matrix.Therefore, the propagation mechanism is redefined as Eq. 6.
Theoretically, deeper networks possess more excellent learning capabilities than shallow neural networks to capture feature representations from more complex data (Bianchini and Scarselli, 2014).Furthermore, deeper neural networks are typically able to achieve outstanding performance with relatively less training data.These are particularly significant for multi-omics data, which are often complex and challenged with limited sample sizes (Picard et al., 2021;Zhang et al., 2021).ResGCN uses residual connections to improve the information flow in the network to alleviate the gradient vanishing problem and allow ResGCN to build deep networks (Li et al., 2019).Thus, the new propagation mechanism can be defined as Eq.7: After G(N) is transformed by f, vertex-wise addition is performed to obtain G(N + 1).The residual mapping f learns to take the patient similarity network as input and outputs a residual graph representation G(N + 1) res for the next layer.After several layers of residual convolution, the fusion and MLP modules are used to fuse the data processed by multiple residual blocks and output the prediction results.Then, we extract features representing patient similarity information and multi-omics correlation information from the fusion layer in the trained ResGCN.In summary, ResGCN mitigates the gradient vanishing by residual connection mechanism, which improves data transmission in the network and allows for deeper network architecture to fit complex multi-omics data to obtain survival prediction information.

B: Features extraction by CNN
To obtain specificity features for each omics data, we construct CNN for each omics data.Each CNN consists of an input layer, a convolutional layer, a fully connected layer, and an output layer, as shown in Figure 3.After the omics data is fed into the CNN, the convolution layer performs a convolution operation to generate the feature map and adds padding to the convolutional layer to control the feature map size.Subsequently, the flattening operation maps the output of the convolutional layer to a fully connected layer containing 150 units for survival prediction.In addition, the glorot initialization technique is used to generate random numbers to initialize the convolutional kernel (Glorot and Bengio, 2010).We also applied dropout and L2 regularization techniques to prevent overfitting during training (Cortes et al., 2012;Poernomo and Kang, 2018).Finally, we extract specificity features representing each omics data from the fully connected layers of the three trained CNNs.

C: Stack integration and survival prediction
Stacking hidden layer features of deep learning networks is an effective strategy for integrating multi-omics data for survival prediction (Arya and Saha, 2020;Arya and Saha, 2021).It allows flexible integration of feature representations from different neural network models to integrate correlation prognostic and specificity prognostic information.Moreover, this strategy allows integration in conditions that all neural network modules achieve optimal performance, rather than training all modules simultaneously.We stack the hidden layer features extracted from ResGCN and the three CNNs according to Eq. 8.
where F PSN represents the feature representation obtained from the ResGCN hidden layer, F Clin , F Expr , and F CNV represent the feature representation obtained from the CNN hidden layer of each omics data, respectively.F stacked represents the obtained stacked feature representation, and ⊕ is the matrix concat operation.Then, based on Feature extraction process by ResGCN.
Feature extraction process by CNN.
Frontiers in Genetics frontiersin.org05 Yifan et al. and Arya et al. we used F stacked to train the AdaBoostRF for the final breast cancer survival prediction (Arya and Saha, 2020;Yifan et al., 2021).

Evaluation metrics and experiment settings
To comprehensively evaluate our model, we use the Area Under the Curve (AUC), accuracy, precision, Recall, F1-score, and Matthew's correlation coefficient (Mcc) as performance evaluation metrics (Goutte and Gaussier, 2005;Huang and Ling, 2005;Chicco and Jurman, 2020).The definitions of these metrics are shown in Eqs 8-14: where P is the number of positive samples.N is the number of negative samples.p i is the positive sample prediction score.n j is the negative sample prediction score.
Precision TP TP + FP (11) where TP, FP, TN, and FN represent true positives, false positives, true negatives, and false negatives in the confusion matrix, respectively.
To overcome the variance problem caused by the limited sample size and sample imbalance, we used 10-fold cross-validation to evaluate the performance of MSFN (Rodriguez et al., 2009;Jiang and Wang, 2017).The 1048 patients were divided into 10 subsets, 9 of which were combined as the training set while the remaining 1 subset was used as the test set.The final performance was the average of the model's performance on the test set.MSFN was implemented using Pytorch.The experiments were executed on a PC with a 2.90 GHz Intel Core i7-10700 processor and NVIDIA GeForce RTX 3070 GPU.

Comparison with previous studies
To demonstrate the effectiveness of MSFN.We uniformly used 10-fold cross-validation to evaluate and compare it with several machine learning-based methods and deep learning-based methods.Specifically, we selected three widely used machine learning-based models as the baseline: LR (Logistic Regression) (Jefferson et al., 1997), RF (Random Forest) (Nguyen et al., 2013) and SVM (Support Vector Machine) (Xu et al., 2012).Then, we compared MSFN with Impact of different ResGCN layers on the model performance.
Frontiers in Genetics frontiersin.org06 Zhang et al. 10.3389/fgene.2024.1378809five current state-of-the-art deep learning-based models.Below are brief descriptions of deep learning-based methods: • MDNNMD (      attention CNNs for training RF to predict breast cancer survival. • PregGAN (Zhang et al., 2022): PregGAN is a CGAN-based survival prediction method.It generates high-quality pseudosamples based on limited samples for reliable survival prediction.
• Heterogeneous stacked RF (Jadoon et al., 2023): Heterogeneous stacked RF is a heterogeneous ensembled classification prediction model that integrates CNN and DNN to predict breast cancer patient survival.
The prediction results are shown in Table 3. From the results, MSFN achieves AUC value of 0.9787 and accuracy of 0.991, which is superior to other methods.Other evaluation metrics are also obviously improved.Specifically, MSFN achieves superior prediction performance compared to SiGaAtCNN RF, Stacked RF, Heterogeneous stacked RF, and MDNNMD because the patient similarity information and multi-omics data correlation information from the patient similarity network provide more comprehensive and wealthy prognostic information for survival prediction.MSFN achieves significant performance improvement compared to traditional machine learning methods and PregGAN which directly integrate multi-omics data.This demonstrates the effectiveness and superiority of the stacked integration strategy in multi-omics data fusion compared to direct data integration.

Performance comparison of different survival cohorts
To further validate the prediction performance of MSFN, we compared its performance in different survival cohorts.We used the ten-fold cross-validation for the experiments and displayed the results in Figure 4.It is obvious that MSFN presents a better prediction performance in both long and short survival cohorts, and the gap between the prediction performance of the two cohorts Frontiers in Genetics frontiersin.org08 Zhang et al. 10.3389/fgene.2024.1378809 is very small.This is attributed to that MSFN incorporates the prediction information from different deep learning modules, considering both correlation prognostic information and specificity prognostic information.

Ablation study
We verify how different modules of MSFN affect the performance through an ablation study and design three variants: i) MSFN/-CNNs: MSFN without CNN modules.ii) MSFN/-ResGCN: MSFN without ResGCN module.iii) MSFN/-RF: MSFN without AdaboostRF, and the prediction results are output by MLP.We compared MSFN with the variants described above.As can be seen from Table 4, both MSFN/-ResGCN and MSFN/-CNN perform lower than MSFN.Such results can be attributed to the incomplete prediction features obtained by MSFN/-ResGCN and MSFN/-CNN.This also reflects the importance of integrating prognostic information.Furthermore, MSFN performs better than MSFN/-RF.This is because the AdaboostRF is an ensemble machine learning algorithm with better feature learning ability than simple MLP and effectively deals with complex feature representations of multi-omics data.

Effect of multi-omics data
To validate the effect of multi-omics data, we constructed MSFN using each omics data, respectively.Then, we compared them with MSFN constructed using multi-omics data.As shown in Table 5, Clin, CNV, and Expr represent the MSFN constructed with clinical, CNV, or gene expression data, respectively.The accuracy and AUC only reach a maximum of 0.919 and 0.954 when using single-omics data.MSFN achieves the best performance with multi-omics data, with all evaluation metrics significantly better than single-omics data.This indicates that MSFN can obtain comprehensive prognostic information from multi-omics data and significantly improve prediction performance.

Effect of ResGCN layers
To explore the effect of different ResGCN layers on the model performance, we evaluated the performance of MSFN by changing the layers of ResGCN.As can be seen in Figure 5, the performance of MSFN gradually improves as the number of ResGCN layers increases.Several metrics achieved their maximum when the layer is set to 2. This demonstrates that the deep ResGCN constructed by residual concatenation can properly fit the multiomics data, bringing performance improvement to the entire model.However, all metrics fluctuate and gradually decrease as the number of layers increases.This may be because ResGCN with too many layers makes the model structure too complex, leading to overfitting of the model during training.

Effect of the number of features
To explore the effect of the number of features on model performance, we used the incremental method based on previous studies to conduct experiments.Specifically, We employed the mRMR algorithm to select the top 500 features from CNV and gene expression data.Then, we searched with a step size of 100 to evaluate the performance of MSFN under different combinations of feature numbers (Sun et al., 2018;Arya and Saha, 2020).Since only 33 features were available in clinical data, we used all the clinical features.The final results are presented in Table 6.It is evident that the model's performance gradually improves as the number of features increases.MSFN achieves the best accuracy, AUC, and Recall when the number of CNV and gene expression features is set to 200 and 400.However, as the number of features increases, the model's performance remains relatively stable and then gradually decreases.This shows that too many features inevitably introduce noisy information, reducing the model's focus on valuable features and leading to performance degradation.Consequently, we selected the top 200 CNV and top 400 gene expression features along with all 33 clinical features as model inputs.The Kaplan-Meier curves for MSFN.The t-SNE plot of stacked feature classification via MSFN.

Survival analysis
To further validate the survival prediction performance of MSFN, we performed survival analyses on the classification results of MSFN.We plotted Kaplan-Meier curves to evaluate the performance of MSFN in predicting long-term and short-term survivors Rich et al. (2010), illustrated in Figure 6.The Kaplan-Meier survival curves explicitly demonstrated a statistically significant difference (p-value <10e-26) between long-term and short-term survivors predicted by MSFN.This result proves that MSFN effectively distinguishes between long-term and shortterm survivors.
To validate the predictive ability of the stacked feature representations obtained in MSFN, we utilized the t-SNE algorithm to visualize the prediction results of the stacked feature representations.t-SNE attempts to minimize the difference between the conditional probabilities or similarities in the high and low dimensional spaces to map the data in the low-dimensional space (Van der Maaten and Hinton, 2008;Wattenberg et al., 2016).The visualization result is shown in Figure 7, a clear demarcation between the two groups at dimension 1 of about 25 indicates the excellent survival prediction ability of the hidden layer features extracted by MSFN.

Conclusion
Breast cancer is the most prevalent cancer worldwide and poses a major threat to women's health.Survival prediction can avoid the suffering caused by over-treatment and the waste of medical resources, which is significant for cancer treatment and prognosis.In this study, we propose a novel stacked fusion network (MSFN) for breast cancer survival prediction.MSFN integrates patient similarity, correlation, and specificity information of multi-omics data, providing a more comprehensive insight for survival prediction and effectively enhancing the prediction ability.First, MSFN constructs a patient similarity network and obtains patient similarity information and correlation of multi-omics data through ResGCN.Meanwhile, MSFN obtains the specificity information of multi-omics data through CNN.Finally, MSFN uses the stacking strategy to ingeniously integrate prognostic information and predict patient survival with AdaboostRF.Experiments on TCGA's breast cancer dataset showed that MSFN outperformed state-of-the-art methods in survival prediction.In future work, we will focus on exploring the survival regression issues.Furthermore, we will explore the interpretability of the survival prediction model to understand the decision-making process of the models and the interpretation of the results.

•
Stacked RF (Arya and Saha, 2020): Stacked RF is a CNN-based cancer survival prediction method.It trains RF to predict breast cancer survival by stacking three CNN networks' hidden layer feature representations.

FIGURE 5
FIGURE 5Impact of different ResGCN layers on the model performance.

TABLE 1
Overview of the dataset.

TABLE 2
Feature selection.

TABLE 3
Performance comparison of MSFN and comparison methods.Bold values represent the highest performance of the model on this metric in the experiment.

TABLE 5
Performance comparison of different omics data.Bold values represent the highest performance of the model on this metric in the experiment.

TABLE 4
Performance comparison between different variants of MSFN.Bold values represent the highest performance of the model on this metric in the experiment.

TABLE 6
The results of incremental feature number selection.