Identification of circRNA‐disease associations via multi‐model fusion and ensemble learning

Abstract Circular RNA (circRNA) is a common non‐coding RNA and plays an important role in the diagnosis and therapy of human diseases, circRNA‐disease associations prediction based on computational methods can provide a new way for better clinical diagnosis. In this article, we proposed a novel method for circRNA‐disease associations prediction based on ensemble learning, named ELCDA. First, the association heterogeneous network was constructed via collecting multiple information of circRNAs and diseases, and multiple similarity measures are adopted here, then, we use metapath, matrix factorization and GraphSAGE‐based models to extract features of nodes from different views, the final comprehensive features of circRNAs and diseases via ensemble learning, finally, a soft voting ensemble strategy is used to integrate the predicted results of all classifier. The performance of ELCDA is evaluated by fivefold cross‐validation and compare with other state‐of‐the‐art methods, the experimental results show that ELCDA is outperformance than others. Furthermore, three common diseases are used as case studies, which also demonstrate that ELCDA is an effective method for predicting circRNA‐disease associations.


| INTRODUC TI ON
Circular RNAs (circRNAs) are a class of non-coding RNA with a closed structure, and there is accumulating evidence indicating that circRNA plays an important role in biological processes such as the genetic aetiology of human complex diseases. 1circRNA was first observed in the cytoplasm of eukaryotic cells. 2 In the past, limited by technology, the research on circRNA has not been well-developed, but in recent years, high-throughput sequencing technologies have developed rapidly, the amount of circRNAs appears an exponential growth trend, and multiple databases of circRNAs have been established.CircBase, 3 which collects information of circRNAs on multiple species; circBank, 4 a comprehensive database of more than 140,000 human annotated circRNAs, based on the data of all humans in circBase database, further analysis and processing were conducted, which also provides miRNA-circRNA interactions; circNet, 5 an updated database for exploring circular RNA regulatory networks in cancers; circFunBase, 6 a web-accessible database that can provide a high-quality functional circRNA resource; exoRBase 7 provides the comprehensive annotation and expression landscapes of circRNAs; and circR2Disease, 8,9 circRN-ADisease, 10 and circ2Disease v2.0 11 are databases that manually curated experiment-supported human circRNAs related to diseases.
As a result, identifying potential circRNA-disease associations via wet-lab experiment is time-consuming and costly, which urge researchers to explore effective computational methods based on known associations and biological information. 12These methods can be roughly divided into two categories: traditional machine learningbased methods and deep learning-based methods.
Traditional machine learning-based methods always treat the association prediction problem as a binary classification problem.Fan et al. 13 proposed a method (KATZHCDA) based on KATZ measure for predicting unknown circRNA-disease associations; however, the network structure has a significant impact on model performance.
Zhao et al. 14 proposed a computational method IBNPKATZ, which also base on KATZ measurement, heavily relies on the structure of network.Yan et al. 15 developed a method (DWNN-RLS) based on Regularized Least Squares (RLS) of Kronecker product kernel and Decreasing Weight K-Nearest Neighbour (DWNN), due to the calculation process of Kronecker product, it is not suitable for largescale datasets.Wei et al. 16 proposed a novel computational method (icircDA-MF) based on Matrix Factorization (MF), which introduced the information of gene in this work.Zhao et al. 17 proposed a method based on locality-constrained linear coding, but the calculation of circRNAs and diseases similarity matrices will lead some bias.Peng et al. 18 proposed a method (RNMFLP) combining Robust Nonnegative Matrix Factorization (RNMF) and Label Propagation (LP).Wang et al. 19 developed a method (KNN-NMF) using weighted K nearest neighbours to reduce the false-negative association impact on prediction performance; however, the construction of similarity networks for the above three models are only depending on the topology information and ignoring the biological attribute information.Zhang et al. 20 predicted associations via metapath2vec++ and matrix factorization, metapath2vec++ requires prior specification of metapaths and inefficient for large-scale networks.Ding et al. 21predicted associations based on variational graph autoencoder with matrix factorization, where the variational auto-encoder assumes the latent variable follows a simple gaussian distribution, limits the expressiveness of the learned embeddings.Zhang et al. 22 proposed a novel method (ICDMOE) for predicting circRNA-disease associations through a multi-objective evolutionary algorithm, but the interaction of features is not considered in the model.
Deep learning-based methods usually learn feature embeddings of circRNAs and diseases on neural networks.Wang et al. 23 developed a method (GCNCDA) based on multi-similarity fusion and Fast learning with Graph Convolutional Network (FastGCN), where GCN is sensitivity to graph structures and has limited generalization capability.Bian et al. 24 proposed a method (GATCDA) to predict circRNA-disease associations based on graph attention network, and the performance of the model highly depended on the attention mechanism, requiring careful tuning and optimization.Zheng et al. 25 develop a method (iCDA-CGR) based on Chaos game representation to identify circRNA-disease associations, where Chaos game needs a large number of iterations for obtain the expressive representations.
Ji et al. 26 proposed a method (GATNNCDA) that combines Graph Attention Network (GAT) and multi-layer neural network to infer disease-related circRNAs, but the similarity network of circRNAs is highly dependent on the circRNA-disease network.Wang et al. 27 predicted unknown associations based on GraRep, where GraRep proposed only for homogeneous graphs, and the performance on heterogeneous graphs is limited.Chen et al. 28 proposed a novel method via signed heterogeneous graph network, due to the computational complexity, it is not suitable for large-scale graphs.Chen et al. 29 proposed a method (RGCNCDA) based on Relational Graph Convolutional Network (RGCN) and incorporate microRNA (miRNA) to improve the prediction performance, however, RGCN mainly focuses on the local information, and ignores the global information.
Guo et al. 30 proposed a method (THGNCDA) using graph neural network with attention to learn the importance of its each neighbour, but the model complexity is relatively high.
We propose a novel Ensemble Learning-based CircRNA-Disease Association prediction method (short for ELCDA) in this work.First, a heterogeneous network is constructed and multiple similarities are calculated based on different views; then, MAGNN (metapath aggregated graph neural network), 31 CMF 32 and GraphSAGE 33 are used to obtain the comprehensive representations of circRNAs and diseases; and the embeddings obtained by different models are fed into different classifiers, a soft voting strategy is used to fuse the classification results and obtain the final prediction results.
In summary, the main contributions of this study are listed as follows: 1.A 3-layer heterogenous network is constructed among circRNA, miRNA and disease, and 4 different similarity measurements are calculated from multi-views; 2. The metapath-based feature extractor mainly used to capture global information, GraphSAGE is used to obtain the local, nonlinear features, linear information is obtained via MF, the comprehensive representation can be obtained by integrating these features together; 3. Multiple classifiers are used here, and then an ensemble learning method is adopted to obtain the final predicted results.

| Problem description
The network of circRNA-disease associations can be considered as a bipartite network, assuming there are m circRNAs and n diseases in the network, the nodes can be denoted as two sets T C = c 1 , c 2 , ⋯ , c m and T D = d 1 , d 2 , ⋯ , d n , and there are three types of edges between nodes, which can be denoted as E = e cc , e cd , e dd , where e cc and e dd are the similarity between circR-NAs and diseases, e cd is the association between circRNA and disease, if circRNA c is related to disease d, e cd = 1, else, e cd = 0.The goal of our study was to reconstruct the adjacency matrix between circRNAs and diseases and make it as similar as possible to the original adjacency matrix, the values greater than 0 in the reconstructed matrix demonstrate that the corresponding circRNAs and diseases may have associations.As shown in Figure 1, the black solid lines and black dashed lines represented the known associations and predicted associations, respectively.| 3 of 14 YANG et al.

| Materials
In this paper, we collect the information of circRNAs, miRNAs and diseases, the known associations among circRNAs, miRNAs and diseases are downloaded from circBank, circR2Disease V2.0 and HMDD V3.2, 34 after data preprocessing, we obtain a dataset which contains 2223 circRNAs, 996 miRNAs and 199 diseases, the details are shown in Table 1.Based on the assumption that similar circRNAs are tend to related to similar diseases, several kinds of information are introduced to calculate the similarity matrices of circRNAs and diseases.In circRNA space, the expression profile similarity and functional similarity are used to build the circRNA similarity network, in disease space, the semantic similarity, gaussian interaction profile (GIP) kernel similarity are used to construct the disease similarity network, furthermore, PathSim and HeteSim are used here, the details are shown as follows: Definition 1. Heterogeneous graph. 35A graph can be denoted as G = (V, E), where V is the set of nodes and E is the set of edges.Γ v and Γ e are the sets of node types and edge types, respectively, where there are two mappings satisfying: and e : e → Γ e , if Definition 2. Metapath. 36A metapath P is a special path that connects two entities in the form is the composite relation between start node o 1 and target node o q , q is the length of path.Definition 3. PathSim. 36Given a symmetric metapath P, the PathSim between two objects x and y of the same type is defined as follows: where p x→y is a metapath instance from x to y. Definition 4. HeteSim. 37Given a relevance path P corresponding to the relation R defined above, the HeteSim between two objects x and y is: The HeteSim can be further simplified into the following form: (1) (2) The circRNA-disease association prediction problem.
Assuming the middle node between x and y via path P is mid, then we can split P into P L = (x ⋯ mid) and P R = (mid ⋯ y), and T is the transition probability matrix, which can be calculated as: where A XY is the adjacency matrix between node types X and Y.
As shown in Figure 3 Then the HeteSim score between c 2 and c 4 is:

| Disease semantic similarity
From MeSH database, each disease can be expressed as a directed acyclic graph (DAG), a disease d i can be represented as Δ is the semantic contributor factor (from previous studies, we set Δ = 0.5 here), then the semantic value of disease d i is defined as: (4) and the semantic similarity between d i and d j is defined as:

| Disease Gaussian Interaction Profile (GIP) kernel similarity
The GIP kernel similarity is widely used to measure the similarities among biomolecules, from HMDD v3.2 database, we can obtain the miRNA-disease association matrix MD, each column of MD can be considered as the interaction profile of disease, given two diseases d i and d j , the GIP kernel similarity between them can be calculated as follows: where MD(:,i) is the i-th column of MD, n is the number of disease.
The disease similarity matrix SD bio is obtained by combining the semantic similarity and GIP kernel similarity, that is: The flowchart of ELCDA.

| CircRNA expression profile similarity
exoRBase integrated RNA expression profile information based on normalized RNA-seq data, for example, the expression profile information of circRNA c i can be expressed as f i = f i1 , f i2 , ⋯ , f ih , and spearman correlation coefficient is used to measure the similarities among circRNAs.
where d k = f ik − f jk is the difference of rank, h is the dimension of feature vector.

| circRNA functional similarity
After obtaining the disease similarity matrix SD, we can define cir-cRNA functional similarity as follows: where Finally, combine the expression profile similarity and functional similarity, we can obtain the circRNA similarity SC bio :

| Integrated similarity for circRNAs and diseases
The disease similarity and circRNA similarity are calculated as follows:

| Metapath-based feature extractor
As shown in Figure 3(A), we selected eight different metapaths on circRNA-miRNA-disease heterogeneous network, however, the numbers of circRNAs, miRNAs and diseases are different, we apply a node type-specific linear transformation layer here to project different types of nodes into same vector space, that is: where W c ∈ ℝ m×l , W d ∈ ℝ n×l are the weight matrices, S c ∈ ℝ m , S d ∈ ℝ n are the original feature vectors of different types of nodes, here, we use the integrated similarity matrices of circRNAs and diseases as its original features, that is: S c ∈ ℝ m is the c-th row of SC, S d ∈ ℝ n is the d-th row of SD, l is the dimension of vector space.
A special metapath instance encoder is introduced here to transform the features of all nodes along the instance into a single vector: where p o 1 , o t is a metapath instance connecting entities o 1 and o t .
Then, multi-head attention mechanism is used to aggregate instances under same metapath, the goal is learning the weight of each instance and the weighted summing of all instances is considered as the features of nodes.
And the attention mechanism is also used to aggregate the information of different metapath as follows: where v i , i = 1,2 corresponding to circRNA and disease nodes, is the number of nodes in Γ v i , p i is the metapath instances related to node type i.
The objective function of metapath-based feature extractor is defined as follows: where N is the number of samples.

| Matrix factorization-based feature extractor
Matrix factorization (MF) can project features of circRNAs and diseases onto same low-dimensional vector space.As shown in Figure 3(B), the goal of MF is minimizing the following objective function: An indictor matrix W is introduced here, if there is a known association between circRNA and disease pair, W ij = 1, else, W ij = 0, then the objective function can be written as follows: where H is the adjacency matrix of circRNA-disease association network, C and D are the latent feature matrices of circRNAs and diseases, and are the trade-off parameters, ‖•‖ 2 F is the square of Frobenius norm, SC and SD are the similarity matrices of circRNAs and diseases, the alternating direction multiplier update rule is used here.

| GraphSAGE-based feature extractor
Traditional graph convolutional networks (GCNs) update the node representations of the whole graph in each iteration, when the scale of the graph is large, the training strategy is undoubtedly timeconsuming and even can not be updated, this promotes researchers to introduce the idea of mini-batch in GCN algorithms; therefore, GraphSAGE algorithm had been proposed.
The details of GraphSAGE algorithm can be summarized as follows: 1. Neighbour sampling: different from traditional GCN algorithms, GraphSAGE update the representation of the target node using the information of neighbours, specially, if the number of neighbours is greater than the pre-defined number of samples, the oversampling (resampling) strategy is used, conversely, if the number of neighbours is less than the pre-defined number of samples, the under-sampling technique is used, which is shown in Figure 3 (c).
2. Aggregation: for simplicity, the mean aggregator is used in this study, that is: where h 0 v is the original feature representation of node v, represented by the similarity matrices of circRNAs and diseases.

| Model fusion via ensemble learning
In order to obtain the optimal performance, the ensemble learning is used here, in this study, some classic classifiers are chosen, support vector machine (SVM), 38 random forest (RF), 39 extreme gradient boosting (XGBoost), 40 light gradient boosting machine (LightGBM), 41 gaussian naïve bayes (Gaussian NB), 42 where RF is a variant of bagging, XGBoost and LightGBM are boosting algorithm.
After obtain the classification results via different models and classifiers, a soft voting strategy is used to obtain the final predicted result of circRNA-disease pair (Algorithm 1).( 15)

| Evaluation metrics
To evaluate the performance of our model, we compared our propose model with other state-of-the-art methods under fivefold cross-validation (5-cv).Specifically, the known circRNA-disease associations in circR2Disease v2.0 is taken as the positive samples, and we randomly select negative samples with the same number of positive samples, and a balanced data set with 5940 samples can be obtain.The indicators to evaluate the model including AUC (the area under ROC curve), AUPR (the area under precision-recall curve), Accuracy, Recall and F1-score, we treat the association prediction as a binary classification problem, then the evaluate indicators can be defined as follows:

| Parameters analysis
In this section, we analyse two main parameters of ELCDA, first, the number of heads K in MAGNN, second, the aggregator used in GraphSAGE, the results are shown in Figure 4, when K is 8 and the aggregator is mean, ELCDA obtains the best performance.

| Comparison with other methods
In this paper, we compare ELCDA with seven other state-of-the-art as 2. Using selected classifiers to obtain the predicted results; 3. Using soft voting strategy to obtain the final predicted results.

F I G U R E 4
AUCs with different parameters combinations.
| 9 of 14 to less-than-ideal results under AUPR and some other evaluation metrics.Other indicators are listed in Table 3, and the bold values are the maximums, from the results we can also observe that the performance of ELCDA is superior than other SOTA methods in most cases.

| Ablation studies
We use three feature extractors and various classifiers in this paper, the ablation studies are adopted here to illustrate the effectiveness of different module, the results are shown in Figure 6.It is obvious that our proposed model obtains the best performance on different metrics.Actually, SVM is a common and basic classifier, but not applicable to the case with lots of missing data; RF is a common bagging classifier, performs well in most cases, but overfitting may occur in noisy classification problems; XGBoost and LightGBM are variants of gradient boosting decision tree (GBDT) algorithm, which are faster and more robust, but not considering the concept that the optimal solution is a synthesis of all features; GaussianNB is a classifier based on naïve bayes, which is extremely fast, but performs poor of data with large size.Voting strategy is a classical ensemble learning algorithm, and compare with hard voting, soft voting strategy can achieve higher classification performance.

| Case studies
Hepatocellular carcinoma (HCC), breast cancer (BC) and lung cancer (LC) are used to demonstrate the effectiveness of the proposed model.
At present, the global incidence of Liver Cancer is on the rise.It is estimated that by 2025, the annual number of liver cancer cases will exceed one million, where HCC is the most common type of LC, about 90% of the total number of cases. 48HCC is a prototypical inflammation-associated cancer, which is the most common type of cancer among American adults, most patients have no symptoms in the early stages, HCC is a primary cancer that originates from hepatocytes in the extensively hardened liver tissue.LC is one of the most common and deadliest cancers all over the world, there are two main types of lung cancer, non-small cell lung cancer and small cell lung cancer, and the former is more common, The ROC curves (left) and PR curves (right) under 5-cv on different models.

TA B L E 3
The performance results of methods.

| CON CLUS IONS
With gradually deepening of researching, an increasing body of evidence suggests that circRNAs play a crucial role in the occurrence and development of human complex diseases, which can be regarded as biomarkers for diagnosis, treatment and prognosis.
More and more studies have been conducted using the experimentally verified circRNA-disease associations with computational models, and most existing methods ignore the information carried by the heterogeneous network and the intermediate nodes  The proposed model still has shortcomings, which can be conducted in subsequent work in the further.First, the heterogeneous network can be constructed with different links, that is, other nodes can be introduced in the model, like gene, then circRNAgene, gen-disease, gene-gene associations can be used to further enrich the biological information of circRNAs and diseases.Second, with more associations introduced, more kinds of metapaths can be selected, which will lead the model more effective and robust.

Furthermore, we analyse
the distribution frequency of each type of association, as shown in Table 2, (A) the number of circRNA-related diseases; (B) the number of disease-related circRNAs; (C) the number of circRNA-related miRNAs; (D) the number of miRNA-related circRNAs; (E) the number of miRNA-related diseases; (F) the number of diseaserelated miRNAs.It can be seen that most circRNAs are only related to one disease (about 80%), which demonstrated that the adjacency matrix of circRNA-disease heterogeneous network is very sparse.The overview of proposed model is shown in Figure 2, which mainly consists of three modules: heterogeneous network construction, feature extraction and association prediction.Specifically, the high-quality and sub-structural features of nodes can be obtained via metapath-based feature extractor, the low-level and linear features of nodes can be obtained via matrix factorization (MF)-based feature extractor, the local and nonlinear features can be obtained via GraphSAGE-based feature extractor, then the ensemble learning is used to fusion them and obtain the classification results of unknown associations.
, the details of calculating the PathSim and HeteSim score between c 2 and c 4 is shown as follows, we can see there are 2 kinds of path instances under path P = CDC between c 2 and c 4 .1. PathSim 2. HeteSim First, split P into P L = CD and P R = DC, then the adjacency matrices of P L and P R are denoted as A and A Transpose , respectively, and obtain the transition matrices T CD and T DC via row normalization.

| 7 of 14 YANG
An example of Pathsim and Hetesim.et al.

( 1 .
SOTA) methods: KATZHCDA (2018)13 : predicting unknown associations based on KATZ measure; CD-LNLP (2019)43 : predicting circRNA-disease associations via linear neighbour label propagation; KATZCPDA (2019)44 : based on the original KATZHCDA model, taking into the impact of proteins to predict the associations between circRNAs and diseases; icircDA-MF (2020)16 : predicting the potential disease-associated circRNAs based on matrix factorization, and the circRNA-disease interaction profiles are then updated by the neighbour interaction profiles so as to correct the false negative associations; DMFCDA (2021)45 : using deep matrix factorization to improve prediction of circRNA-disease associations; GMNN2CD (2022) 46 : using variational inference and graph Markov neural networks to predict circRNA-disease associations; AGAEMDA (2023) 47 : predicting unknown associations via nodelevel attention graph auto-encoder.The ROC, PR curves are shown in Figure 5. From which we can observe that our proposed ELCDA has the best performance under both AUC and AUPR, which achieves 0.9289 and 0.9239 under 5-cv, outperforms all selected SOTA methods.Specifically, the AUPR values of KATZHCDA, CD-LNLP, KATZHCPDA, icircDA-MF, DMFCDA and GMNN2CD are significantly lower than AGAEMDA and ELCDA, cause the former methods didn't adopt any sample balance strategy, the ratio of positive and negative samples is close to 1:150, which indicates that the dataset used in this paper is extremely imbalanced, neglecting to perform data set balancing and preprocessing may lead(23) Accuracy = TP + TN TN + TP + FN + FP , Recall = TP TP + FN , Precision = TP TP + FP , F1 − score = 2 * Precision * Recall Precision + Recall ALGORITHM 1 Ensemble Learning based CircRNA-Disease Association prediction (ELCDA) Input: circRNA-disease association matrix CD; circRNA-miRNA association matrix CM; miRNA-disease association matrix MD; circRNA and disease similarity matrices SC, SD; the dimension of vector space l; the number of heads in metapath-based feature extractor K; the trade-off parameter in MF-based feature exactor λ; Training the metapath-based, MF-based and GraphSAGEbased feature extractors, obtaining the embeddings of nodes, denoted as F 1 , F 2 and F 3 , respectively, and concatenate them as the final representations of nodes, denoted

Furthermore
, taking HCC as an example, as shown in Figure7, circITCH can act as the sponge of hsa-miR-184, hsa-miR-224-5p, hsa-miR-20b-5p and hsa-miR-421, which indicates one of the mechanisms of action of circRNAs: circRNA can function as miRNA sponges by binding to them and preventing their interactions with target mRNAs, thereby affecting the occurrence and development of human complex diseases; thus, circITCH may have an inhibitory effect on HCC.49Researchers can work on the downstream mRNA and explore the potential role of circRNA by screening for the upstream circRNAs and identifying the corresponding circRNA-miRNA-mRNA pathway.Actually, there are increasing researches focus on the interactions among non-coding RNAs (ncRNAs) and other biological entities to better understand their regulatory mechanisms in diseases.For instance, circRNA-miRNA,50,51 miRNA-lncRNA52,53 and metabolitedisease interaction predictions,54,55 by investigating the interactions among ncRNAs, we can gain a better understanding their roles in cellular regulation and disease development.This knowledge can provide new insights and potential therapeutic targets for disease diagnosis and treatment.However, there are still many aspects that require further research to unravel the regulatory networks and mechanisms of ncRNAs.

(
miRNAs).To address these drawbacks, in this paper, we propose an ensemble learning-based model ELCDA for predicting circRNAdisease associations.Compared with the previous models, the HeteSim and PathSim are introduced here to enhance the model for extracting information from heterogeneous, and circRNA-miRNA, miRNA-disease associations are given, with not only provide more detailed biological information, but also expand the variety of nodes.As the number of node types increasing, more kinds of metapaths can be defined.In addition, this study also adopts GraphSAGE and MF-based feature extractor to obtain the comprehensive representations of nodes, and soft voting strategy TA B L E 5 Predicted top-20 circRNAs related to BC.
where T d i is the ancestor nodes of d i (including d i itself), E d i is the set of corresponding edges, then the semantic contribution of a disease d t in DAG d i can be calculated by: Frequency distribution of each type of association.
is used to get the final predicted results.The results of numerous experiments indicate that ELCDA is outperforming than most SOTA models.