BRWSP: Predicting circRNA-Disease Associations Based on Biased Random Walk to Search Paths on a Multiple Heterogeneous Network

,


Introduction
circRNAs are a special type of endogenous noncoding RNAs (ncRNAs), which widely exist in the gene expression of various organisms.e discovery of circRNAs could date back to the nineteen seventies.Sanger et al. [1] rst observed circRNAs in the process of studying plant viruses by using electron microscopy.circRNAs were gradually found in di erent species and cells after the following decades, such as yeast [2], zebra sh [3], and mouse [4].Because of the low abundance of circRNAs and the lack of known function, circRNAs have not got more attention for a very long time.
Complex diseases seriously threaten human health [16][17][18].erefore, studies on complex diseases have been a hot topic in the eld of medicine and bioinformatics [19,20].As more and more biological functions of circRNAs have been revealed, massive evidence has indicated that circRNAs play an important role in the emergence and development of complex diseases.According to the reports of Liu et al. [21], the function of circRNAs was also versatile to function as microRNA (miRNA) sponges [5,13] and protein sponges [22,23].For example, the circSMARCA5 [24] and circCFH [25] have been found to be expressed in a glioma-speci c pattern which may be used as the tumor biomarkers.CircNFIX [26] and circNT5E [27] have been found that they

Motivations
(1) Ba-Alawi et al. [38] used depth-first search algorithm to traverse all simple paths between a specific drug and a specific target protein and then aggregated the score from these searched paths to infer drug-target interactions.en this algorithm was extended to identify miRNA-disease associations [39], lncRNAdisease associations [40], circRNA-disease associations [32], and microbe-disease associations [41] and obtained satisfactory performance.However, this algorithm needs to search for all paths between a specific circRNA and a specific disease.If the network is very enormous, this type of algorithm cannot handle it well.
erefore, this type of algorithm cannot be well extended to a multiple heterogeneous network constructed by using many different types of biological networks.Being inspired by [42], a biased random walk is proposed to search paths.Compared with depth-first search algorithm, it chooses the paths according to the probabilities (such as Figure 1(c)).erefore, if the probability of one path is very smaller than other paths, it is very likely that the walker will not select this path in the process of selecting the next path.
(2) Recently, many methods [32][33][34] have been proposed based on a heterogeneous network to identify circRNA-disease associations.However, these methods use fewer biological data and depend greatly on the known circRNA-disease associations, which lead to insufficient analysis of circRNA-disease associations from a variety of biological perspectives.erefore, gene similarity networks and gene-disease associations are imported to build a multiple heterogeneous network which contains circRNA coexpression network, circRNA-disease associations, and disease similarity network.
e CircR2Disease database contains 725 circRNA-disease associations consisting of 661 circRNAs and 100 diseases.In order to ensure the accuracy of data, we only extract circRNAs with circBase IDs and gene symbols.Finally, 427 circRNA-disease associations, consisting of 372 circRNAs, 330 gene symbols, and 77 diseases, are remained.

Disease Semantic Similarity.
e similarity between diseases can be calculated by a directed acyclic graph (DAG).Firstly, we search DOID corresponding to 77 diseases, being extracted in Section 2.2.1, from the Disease Ontology database (http://www.disease-ontology.org/)[43].After deleting diseases without DOID, the dataset contains 55 diseases with DOID, 291 circRNAs, 261 gene symbols, and 2 Complexity 340 circRNA-disease associations.Based on disease ontology, Yu et al. [44] created a DOSE package of R, which can calculate disease semantic similarity by doSim function based on Wang's method [45].In this study, we adopt this DOSE package to calculate disease semantic similarity.

circRNA Expression Profile.
To calculate the circRNA coexpression similarity network, the circRNA expression profile is downloaded from the database exoRBase (http:// www.exorbase.org/)[9].After converting exor_circ_ID to circBase ID, we eliminate some circRNAs without expression profile among 291 circRNAs.e final data contain expression profile data of 154 circRNAs on 90 samples, 192 circRNA-disease associations consisting of 154 circRNAs (corresponding to 140 gene symbols) and 48 diseases (being shown in Figure 2).

Gene-Disease Associations.
In order to detect associations between 48 diseases and 140 genes (corresponding to circRNAs), we download the integrated gene-disease associations from the human_disease_textmining_full.tsvfile of the DISEASE Database [46].A confidence score is given to evaluate associations in this database.In order to ensure the reliability of data, we only select the gene-disease associations whose confidence score is greater or equal to 2 according to previous research [47].In total, among 48 diseases and 140 gene symbols, we obtain sufficiently 80 gene-disease associations consisting of 29 diseases and 34 genes.
Besides, we also extract some genes associated with the 48 diseases mentioned above from the DISEASE database [46] and DisGeNET database [48].Similarly, we only extract gene-disease associations with confidence score greater or equal to 2 for the human_disease_experiments_full.csvfile of the DISEASE database [46].And for the DisGeNET database, the gene-disease associations are extracted from the curated_gene_disease_associations.tsv.gzfile.Finally, among the 48 diseases mentioned above, 2193 disease-gene associations are extracted, which contain 37 diseases and 1607 disease-related genes.

Constructing Multiple Heterogeneous Network.
In this paper, we extract 140 gene symbols (corresponding to circRNAs) from CircR2Disease.According to these gene symbols, gene similarity network is constructed by mapping gene products to GO annotations [49].Genes are annotated by cellular component (CC), molecular function (MF), and biological process (BP).Herein, we use the biological process (BP) to measure gene semantic similarity value, which has been proven to embrace better performance in previous papers [50].Finally, the adjacency matrix GS is utilized to represent the gene similarity network, and the value GS(i, j) represents a functional similarity value between gene i and gene j, which can be calculated by the function of geneSim in the GoSemSim package of R [49].Complexity e adjacency matrix CD is constructed to represent circRNA-disease associations and CD(i, j) is equal to 1 when circRNA c(i) is associated with disease d(j); similarly, the adjacency of CG and GD is used to describe circRNAgene interactions and gene-disease associations, respectively.Besides, we employ the adjacency matrix DS to describe disease semantic similarity, in which the DS(i, j) indicates the semantic similarity between disease d(i) and disease d(j).For circRNA coexpression similarity CS, CS(i, j) represents the similarity value between circRNA c(i) and circRNA c(j), which is calculated by using the Pearson correlation coe cients based on circRNA expression pro le.
In the process of predicting circRNA-disease associations, the performance of the algorithm largely depends on the known circRNA-disease associations.However, the existing known circRNA-disease associations are still limited, which will a ect the accuracy of the algorithm for predicting circRNA-disease associations.In order to solve this problem, we calculate the initial score for circRNAdisease associations based on the gene-disease associations.
e initial score of the association between circRNA i and disease k is as follows: where g i is the gene corresponding to circRNA c(i) and dg k j represents the gene associating with disease d(k).geneSim(g i , dg k j ) represents the semantic similarity value between gene g i and gene dg k j calculated by the GoSemSim package of R [49]; Initial Score(i,k) represents the initial score of the association of circRNA c(i) and disease d(k).If C D(i, j) is equal to 0, Initial Score(i,k) will be assigned C D(i, j) as a new value.

Complexity
Next, a multiple heterogeneous network is constructed by using circRNA coexpression network, disease similarity network, gene functional similarity network, and their association information, which is represented as follows: where CG T , CD T , and DS T are the transposed matrices of CG, C D, and DS, respectively.To avoid the biases caused by larger values in the multiple heterogeneous network, H is utilized to construct a normalized multiple heterogeneous network NMH e strategy of selecting the next node is described as follows: where represents the transition probability of selecting node x the next biased random walk, and the currently visited node and the last visited node are v and t, respectively.Nei(v) and Nei(t) represent the neighbourhoods of v and t, respectively.For parameter q, if q is assigned a larger value, the nodes of Path are highly interconnected and belong to communities or similar network clusters (similar to BFS algorithm).Otherwise, the nodes of Path can more exactly describe a macroview of the neighbourhood (similar DFS algorithm).In other words, we can integrate the strategies of DFS and BFS by adjusting the value of the parameter q.Finally, each neighbourhood of v can obtain a probability of being visited in the next biased random walk.A roulette selection algorithm, a simple random choice based on probability, is employed to randomly select the next node from the neighbourhood of v based on their probability.en the selected node is added to corresponding Path.If k is equal to 1, the next node is randomly selected from the neighbourhoods of the last node based on their probability.
In the process of biased random walk to search paths between circRNA c(i) and disease d(j), the path from c(i) to d(j) will be saved if its length is less than or equal to L. Otherwise, the current biased walk fails to search for a corresponding path.In order to search for more possible paths between circRNA c(i) and disease d(j), we will repeat the above steps maxiter times.erefore, after the biased random walk, we can get a lot of paths from circRNA c(i) to disease d(j).

Evaluation Metrics.
In this paper, the leave-one-out cross-validation (LOOCV) is utilized to analyse the performance of BRWSP in the process of predicting circRNAdisease associations.According to the results of LOOCV, the receiver operating characteristic (ROC) curve is plotted and the area under of ROC curve (AUC) is calculated as evaluation criteria.

Complexity
In the process of predicting circRNA associated with disease k, the positive samples are those known circRNAs associated with disease k.Reliable negative samples are required in the process of evaluation.However, there is no prior information about the negative samples (non-diseaserelated circRNAs).All unknown genes can be regarded as negative samples.However, there are two disadvantages to this approach.Firstly, there is no evidence to prove that the unknown circRNAs are related or unrelated to diseases currently.It is not scienti c to make that all unknown genes are regarded as negative samples.Secondly, this approach will lead to class-imbalance problem since the number of known circRNAs is much fewer than the number of unknown circRNAs.is phenomenon has also been widely discussed in identifying disease-related genes, miRNAs and lncRNAs [47,[51][52][53].erefore, it is not scienti c to regard all unknown genes as negative samples.To overcome these problems and extract reliable negative samples, we rst calculate all initial scores of the associations between all circRNAs and disease k according to equation (1) and arrange them in ascending order.
e circRNAs whose number is same with the number of positive samples are selected as negative samples from the front of the results of ascending order.If all initial scores are equal to 0, we randomly select some circRNAs as negative samples from unknown circRNAs associated with disease k, in which the number of negative samples is equal to the number of positive samples.Finally, we can get all predicted scores for positive samples and negative samples.

e Effect of Gene Network.
One of the highlights of our paper is that the gene similarity network is utilized to construct a multiple heterogeneous network with circRNA coexpression similarity network, disease semantic similarity, and associations among them.In this section, we analyse its impact on predicting circRNA-disease associations.In other words, we run our algorithm on a heterogeneous network (constructed by circRNA coexpression similarity network, disease semantic similarity, initial score, and their association information) and a multiple heterogeneous network (constructed by circRNA coexpression similarity network, gene similarity network, disease semantic similarity, initial score, and their association information).
Obviously, we can clearly see from Figure 8 that our algorithm on a multiple heterogeneous network (Mul_-Het_Net) gets better performance than that on heterogeneous network (Het_Net).
e difference between Mul_Het_Net and Het_Net is that Mul_Het_Net introduces gene similarity network.erefore, the introduction of gene similarity network is helpful to identify circRNA-disease associations.8 Complexity

Case Study.
To further demonstrate the effectiveness of BRWSP (L � 3, q � 0.12, maxiter � 300, and α � 1) in predicting new circRNA-disease associations, a case study is performed for colorectal cancer, which is associated with 13 circRNAs (being shown in Table 1).In the process of experiment, 13 circRNAs associating with colorectal cancer are still assigned as training data and other circRNAs act as candidate samples.At the end of the prediction, we rank the score of candidate samples in descending order, and then the top 20 candidate samples (circRNAs) are selected.e literature mining method and interaction network method are utilized to analyse associations between them and colorectal cancer.e result of the literature validation method is shown in Table 2.For the fourth column in Table 2, if there is a corresponding literature indicating that the gene corresponding to circRNA is associated with colorectal cancer, and the corresponding position in the fourth column is set the corresponding literature's PMID, otherwise "-".Obviously, we can clearly see that there are 12 literature studies to support our result from Table 2.
Interaction network method is to show the host gene of circRNA interacts with disease genes in PPI network and Pathway network.If host gene of predicted circRNA interacts with disease genes, this phenomenon indicates that the predicted circRNA is likely to be associated with the corresponding disease.Genes associating with colorectal cancer are extracted from the DISEASE database [46] and DisGeNET database [48]; protein-protein interaction (PPI) network and Pathway network are extracted from the research [55].en, we extract the interaction between genes associating with colorectal cancer and genes corresponding top 20 circRNAs in PPI network and Pathway network.e final analysis result is shown in Figure 9.We can clearly observe that 11 genes corresponding to circRNAs interact with colorectal cancer genes.e gene POLD1 is not just colorectal gene and also associated with hsa_circ_0052012.In addition, three sets of connected graphs are constructed by predicted circRNAs, the host gene of predicted circRNAs, and colorectal cancer genes.e first set of connected graph contains hsa_circ_0067531, hsa_circ_0002362, hsa_circ_0091894, hsa_circ_0000893, hsa_circ_0052012,

Conclusion
In this study, we propose a novel path weighted computational method, named BRWSP, to predict circRNA-disease associations.Highlights of BRWSP are to construct a multiple heterogeneous network and to employ the biased random walk strategy to search paths between circRNAs and diseases.Firstly, BRWSP constructs a multiple heterogeneous network by using circRNA similarity network, gene similarity network, disease similarity network, and their associations, which can analyse the circRNA-disease associations from different biological perspectives.Secondly, the biased random walk is employed to search paths, which can eliminate some low probability paths.Experimental results show that BRWSP receives a satisfactory performance compared with other algorithms.Although the BRWSP can effectively predict circRNAdisease associations, it still has several shortcomings.Firstly,   Complexity we only use a small amount of circRNA-disease associations and do not consider those circRNAs without gene symbol, circBase ID, and expression profile information.Besides, BRWSP has to consider four parameters (maxiter, p, L, and α).erefore, it is a challenge about how to select optimal parameters in different situations.In a word, these limitations will encourage us to do further research studies in the future work.

Figure 1 :
Figure 1: e framework of BRWSP.(a) Some original data are downloaded from corresponding databases.(b) A multiple heterogeneous network is constructed.(c) Biased random walk algorithm runs on a heterogeneous network to find paths between a specific circRNA and a specific disease.
Score Based on Paths.It is known that circRNA c(i) and disease d(j) are possibly associated with each other if many paths with higher weight and shorter length are found among them.erefore, an exponential decay function for circRNA c(i) and disease d(j) is utilized to give more support for paths with high weight and short length as follows: score(c(i), d(j)) �  Path i ∈All Path where score(c(i), d(j)) represents the score of predicted association score between circRNA c(i) and disease d(j).All Path � Path 1 , Path 2 , . . ., Path n   represents all paths we have searched between circRNA c(i) and disease d(j), where Path i represents the ith searched path.W e (Path i ) represents the weight of the eth edge in Path i .len(Path i ) is the length of Path i and the parameter α represents a decay factor.

Figure 3 :
Figure 3: e framework of materials and preprocessing.

Figure 8 :
Figure 8: e effect of BRWSP on different networks.

Figure 9 :
Figure 9: e interaction network method validated the top 20 results.Red nodes represent the circRNAs; pink nodes represent genes corresponding to circRNAs; dark turquoise nodes represent colorectal cancer genes; blue edges represent circRNA-gene associations; cyan edges represent protein-protein interactions; green edges represent Pathway associations.

10
[42]D is a degree matrix of H.e overall framework of BRWSP is depicted in Figure3.Biased Random Walk to Search Paths.In the paper[42], DFS can search for more different types of nodes because it explored a network as deeply as possible.ebreadth-first search (BFS) can search the neighbourhoods of source node.Being inspired by it, a biased random walk algorithm is designed to search paths between circRNAs and diseases, which combines the advantages of DFS and BFS by adjusting the BRWSP's parameter (being explained as follows).Formally, let Path � p 1 , p 2 , . . ., p L+1   represents one path between circRNA p 1 and disease p L+1 .In this Path, p i represents the node (circRNA or disease) of Path and L represents the length of Path.Let c k indicate the node accessed by the kth biased random walk.

Table 1 :
e known 13 circRNAs for colorectal cancer in the CircR2Disease database.

Table 2 :
Literature validation of the top 20 results.