A Novel Approach for Predicting Disease-lncRNA Associations Based on the Distance Correlation Set and Information of the miRNAs

Recently, accumulating laboratorial studies have indicated that plenty of long noncoding RNAs (lncRNAs) play important roles in various biological processes and are associated with many complex human diseases. Therefore, developing powerful computational models to predict correlation between lncRNAs and diseases based on heterogeneous biological datasets will be important. However, there are few approaches to calculating and analyzing lncRNA-disease associations on the basis of information about miRNAs. In this article, a new computational method based on distance correlation set is developed to predict lncRNA-disease associations (DCSLDA). Comparing with existing state-of-the-art methods, we found that the major novelty of DCSLDA lies in the introduction of lncRNA-miRNA-disease network and distance correlation set; thus DCSLDA can be applied to predict potential lncRNA-disease associations without requiring any known disease-lncRNA associations. Simulation results show that DCSLDA can significantly improve previous existing models with reliable AUC of 0.8517 in the leave-one-out cross-validation. Furthermore, while implementing DCSLDA to prioritize candidate lncRNAs for three important cancers, in the first 0.5% of forecast results, 17 predicted associations are verified by other independent studies and biological experimental studies. Hence, it is anticipated that DCSLDA could be a great addition to the biomedical research field.


Introduction
For long time, RNA was just considered to be transcriptional noise and intermediary between a DNA sequence and its encoded protein [1,2]. However, sequence analyses point out that more than 98% of the human genome does not encode protein sequences [3]. Furthermore, increasing studies based on biological experiments have indicated that ncRNAs play important roles in numerous critical biological processes such as chromosome dosage compensation, epigenetic regulation, and cell growth [4]. In particular, the lncRNAs, as a class of important ncRNAs with a length more than 200 nucleotides [5], have been found to be associated with a wide range of human diseases, such as breast cancer [6], colorectal cancer [7], lung cancer [8], and cardiovascular diseases [9]. Hence, the study of finding novel disease-lncRNA associations has captured the attention of a lot of researchers and has been considered as one of the hottest topics in the research fields of diseases and lncRNAs. The identification of disease-lncRNA association can not only accelerate the understanding of human complex disease mechanism at the lncRNA level, but also serve as a biomarker identification for human disease diagnosis, treatment, and prevention [10]. So far, a lot of studies have generated a large amount of lncRNAs related biological data about sequence, expression, function, and so on [11][12][13]. However, compared with the rapidly increasing number of newly discovered lncRNAs, only few known lncRNA-disease associations have been reported. Hence, it is challenging and urgently needed to develop efficient and successful computational approaches to predict potential lncRNA-disease associations. In recent years, some computational methods have been proposed to predict novel lncRNA-disease associations, which can significantly decrease the time and cost of biological experiments 2 Computational and Mathematical Methods in Medicine by calculating the association probability of lncRNA-disease pairs. For example, Chen G et al. presented the first prediction method (genomic locus based) and constructed a lncRNAdisease association database as well [14]. Liang et al. proposed a genetic mediator and key regulator model to unveil the subtle relationships between lncRNAs and lung cancer. Liu et al. developed a computational framework to accomplish this by combining human lncRNA expression profiles, gene expression profiles, and human disease-associated gene data. Applying this framework to available human long intergenic noncoding RNAs (lincRNAs) expression data, Chen et al. developed a semi-supervised learning method based on framework of Laplacian Regularized Least Squares, LRL-SLDA, to infer potential lncRNA-disease associations which did not need negative samples and could obtain a reliable AUC of 0.7760 in the leave-one-out cross-validations [15]. In 2014, Sun et al. constructed a lncRNA functional similarity network and applied random walk with restart (RWR) to infer potential lncRNA-disease associations [16]. In the same year, Li et al. presented a bioinformatics method based on genomic location to predict the lncRNAs associated with vascular disease [17]. Then, Zhao et al. developed a computational method based on the naïve Bayesian classifier to identify cancer-related lncRNAs by integrating genome, regulome, and transcriptome data [18]. In 2015 Zhou et al. proposed a novel rank-based method named RWRHLDA to prioritize candidate lncRNA-disease associations by integrating miRNA-associated lncRNA-lncRNA crosstalk network, disease-disease similarity network, and known lncRNAdisease association network into a heterogeneous network and implemented a random walk with restart on the newly generated heterogeneous network [19].
Nowadays, with advent of many biological datasets, such as LncRNADisease [14], lncRNAdb [20], and NONCODE [13], the number of lncRNA-disease associations is still very limited. In 2015, Chen developed a method, named HGLDA, based on the information of miRNA [21], which predicted lncRNA-disease associations by integrating disease-miRNA associations with lncRNA-miRNA interactions and did not rely on known lncRNA-disease associations. Different from the method of HGLDA proposed by Chen et al., in this article, on the basis of experimentally reported lncRNAdisease associations collected from the HMDD database [22] and miRNA-lncRNA associations collected from the starBase database [23], a novel model based on distance correlation set is developed to predict potential lncRNAdisease associations by integrating known lncRNA-miRNA associations and known miRNA-disease associations. Compared with HGLDA, the advantage of DCSLDA lies in the introduction of the similarity of disease pairs and lncRNA pairs and distance correlation set. In addition, to optimize the prediction performance of DCSLDA, new methods to calculate the similarity of disease-disease pairs and lncRNA-lncRNA pairs are developed simultaneously. Finally, to evaluate the prediction performance of DCSLDA, LOOCV is implemented on the basis of the known lncRNA-disease associations and known lncRNA-cancer associations separately, and simulation results demonstrate that DCSLDA is superior to the state-of-the-art methods and can achieve a reliable AUC of 0.8517 in the LOOCV when the pregiven threshold parameter is set at 6. Additionally, to further evaluate the prediction performance of DCSLDA, case studies of breast cancer, colorectal cancer, and lung cancer are implemented for DCSLDA; as a result, among the first 0.5% of predictive results, 9, 6, and 2 predicted potential associations are confirmed by recent experimental reports, respectively. Hence, considering the excellent prediction performance of DCSLDA, it is obvious that DSCLDA can become a useful and efficient computational tool for biomedical researches.

Materials and Methods
2.1. Disease-miRNA Associations. We downloaded known disease-miRNA associations from the Human MicroRNA Disease Database (HMDD) in July 2017 (see Supplementary file 1), which included 10381 experimentally verified disease-miRNA associations (including 572 miRNAs and 383 diseases). After merging miRNAs which produce the same mature miRNA and eliminating duplicate data, we obtained dataset1 including 5430 disease-miRNA associations (including 383 human diseases and 495 lncRNAs). Let be the number of different diseases and M1 be the number of different miRNAs collected from the dataset1, respectively, = { 1 , 2 , . . . , } represent the set of these different diseases, and 1 = { 1 +1 , 1 +2 , . . . , 1 + 1 } represent the set of these M1 different miRNAs; then for any given ∈ and 1 ∈ 1 , we can define the Association Strong Correlation (ASC1) between and 1 as follows: (1)

lncRNA-Disease Associations.
In order to evaluate the performance of DCSLDA, the newly lncRNA-disease associations were downloaded from LncRNADisease database, which integrated more than 1000 lncRNA-disease entries and 475 lncRNA interaction entries, including 321 lncRNAs and 221 diseases from ∼500 publications. In this dataset, after duplicate associations and the lncRNA-disease associations involved in either diseases or lncRNAs which were not contained in the dataset1 or dataset2 were removed, 203 highquality lncRNA-disease associations were obtained finally (see Supplementary file 3).

Disease Functional Similarity Based on miRNAs.
For calculating the functional similarity between diseases, we introduced the concept of social network. In the social network, for any two nodes, we can calculate the similarities between them by comparing and integrating the similarities of nodes associated with these two nodes. In this section, based on the assumption that similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs, we calculated the disease similarity in the disease-miRNA interactive network. As illustrated in Figure 1, the calculation procedures of disease functional similarity based on miRNAs include 3 steps. First, we constructed miRNAdisease interactive network from known miRNA-disease associations (dataset1), whose topology can be abstracted as an undirected graph 1 = ( 1 , 1 ), where 1 = ∪ 1 = { 1 , 2 , . . . , , 1 +1 , 1 +2 , . . . , 1 + 1 } is the set of vertices, 1 is the set of edges, and, for any two nodes , ∈ 1 , there is an edge between and in 1 , if and only if there are ∈ , ∈ 1 , and 1( , ) = 1. However, since different miRNA terms in the dataset1 may relate to different numbers of diseases, it is not suitable to assign the same contribution value to different miRNAs. Hence, we define the contribution value of each miRNA as follows: ( ) Finally, we defined the functional similarity between diseases and by integrating the miRNAs related to , , or both of them as follows: where FSD is the disease functional similarity matrix calculated based on miRNA and ( ) and ( ) are the number of d i related edges and d j related edges in E 1 , respectively. As an example, in Figure 1, there is FSD ( 1 , 2 ) = exp( ( 1 )+ ( 3 ) + ( 4 ))/(4 + 5 − 3).

lncRNA Functional Similarity Based on miRNAs.
Based on the assumption that similar lncRNAs tend to show a similar interaction and noninteraction pattern with the miRNAs, we can calculate the lncRNA similarity in the lncRNA-miRNA interactive network. Similar to the calculation procedures of disease functional similarity, first, we constructed lncRNA-miRNA interactive network from known lncRNA-miRNA associations (dataset2), whose topology can be abstracted as an undirected graph 2 = ( 2 , 2 ), where . . , 2+ } is the set of vertices, 2 is the set of edges, and, for any two nodes , ∈ 2 , there is an edge between and in 2 , if and only if there are ∈ 2 , ∈ , and 2( , ) = 1. Then, considering the number of lncRNA-miRNA associations, we defined the contribution value of each miRNA as follows: Additionally, we defined the functional similarity between lncRNA and by integrating the miRNAs related to , , or both of them as follows: where FSL is the disease functional similarity matrix calculated based on miRNA and ( ) and ( ) are the number of related edges and related edges in 2 , respectively.

Method for Predicting Potential Association between lncRNAs and Diseases.
Based on the assumptions that similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs and similar miRNAs tend to show a similar interaction and noninteraction pattern with the lncRNAs, we proposed a novel model, DCSLDA, based on miRNAs and distance correlation set to predict potential disease-lncRNA associations. As illustrated in Figure 2, the procedures of DCSLDA consist of the following 6 major steps.
Step 1 (construction of the disease-miRNA-lncRNA interaction network). On the basis of the above descriptions and letting = 1 ∩ 2, we can construct a disease-miRNA-lncRNA interaction network based on dataset1 and dataset2, whose topology can be abstracted to an 3 is the edge set of 3 , and ∀ ∈ , ∈ , ∈ . There is an edge between and in 3 , if and only if the lncRNA relates to the miRNA . Moreover, there is an edge between and in 3 , if and only if the miRNA is related to the disease . Then, for any given , ∈ 3 , we can define the ASC3 between a and b as follows: In addition, although we did not use any known disease-lncRNA associations, the diseases and lncRNAs can still be linked by integrating edges between diseases node and miR-NAs node and edges between miRNAs nodes and lncRNAs nodes in the 3 .
Step 5 (estimation of association degree between a pair of nodes in the disease-miRNA-lncRNA interactive network).
Based on (13) 23 is a M × L matrix, C 31 is a L × D matrix, C 32 is a L × M matrix, and C 33 is a L × L matrix. It can be easily inferred that the matrix C 13 will be our prediction results, which provided the association probability between each disease and lncRNA. Moreover, we can introduce disease functional similarity and lncRNA functional similarity for C 13 as follows: where the entity ( , ) in row i column j reflects the probability that the lncRNA ( ) is related to the disease ( ).

Results and Case Studies
To evaluate the prediction performance of DCSLDA, first of all, we implemented LOOCV (leave-one-out crossvalidation) to compare DCSLDA with HGLDA [21] based on the lncRNA-disease association dataset downloaded from LncRNADisease database [14]. Next, LOOCV would be implemented to further evaluate the prediction performance of DCSLDA based on the known experimentally verified lncRNA-cancer associations. And then, the effects of the disease functional similarity and the lncRNA functional similarity to the prediction performance of DCSLDA would be analyzed also. Finally, experimental results about the prediction of associations between lncRNAs and three cancers were listed (see Table 1), and the performance comparisons between DCLSDA and HGLDA were implemented according to the rankings of these new disease-related lncRNAs in the case studies of three cancers (see Table 2).

Performance Evaluation of Potential Disease-lncRNA Association Prediction.
According to the lncRNA-disease association datasets downloaded from LncRNADisease database, DCSLDA and HGLDA were applied in the framework of LOOCV, respectively. While the LOOCV was implemented for investigated diseases and lncRNAs, each known lncRNA-disease association would be left out in turn as test sample, and then we further evaluated how well this association ranked relatively to the candidate samples. Here, the candidate samples comprised all potential lncRNAdisease pairs without confirmed associations. Therefore, after the implementation of DCSLDA was completed, the rank of each left-out testing sample relative to the candidate samples  could be further obtained. And then, the testing samples with a prediction rank higher than the given threshold were considered successfully predicted. Thus, we could further obtain the corresponding true positive rates (TPR, sensitivity) and false positive rates (FPR, 1-specificity) by setting different thresholds. Here, sensitivity refers to the percentage of test samples that were predicted with ranks higher than the given threshold, and the specificity was computed as the percentage of negative samples with ranks lower than the threshold. Therefore, the receiver-operating characteristics (ROC) curves could be drawn by plotting TPR versus FPR at different thresholds. And then, the areas under ROC curve (AUC) would be further calculated to evaluate the prediction performance of DCSLDA. An AUC value of 1 represented a perfect prediction while an AUC value of 0.5 indicated purely random performance.
The results of the performance comparison between DCSLDA and HGLDA were shown in Figure 4. Since the HGLDA method predicts lncRNA-disease associations without relying on the information of known disease-lncRNA association, it was selected for performance comparison with our method DCSLDA. As a result, it is clear that our newly proposed method DCSLDA achieved the AUC of 0.8517 in the framework of LOOCV, which is much higher than the AUC of 0.7621 achieved by HGLDA [21]. Simulation results indicate that DCSLDA significantly improved the performance of HGLDA by at least 0.0896 in the term of AUC values and fully demonstrate the performance superiority of HGLDA.

Performance Evaluation of Potential lncRNA-Cancer Association Prediction.
Cancer has become one of the most dangerous killers for human beings [24,25], and there is a high incidence of cancer in both developed countries and developing countries. Therefore, to further evaluate the prediction performance of DCSLDA, LOOCV was implemented  on the basis of 117 lncRNA-cancer associations collected from the LncRNADisease dataset, and the simulation results were illustrated in Figure 5.
From Figure 5, it is easy to find that DCSLDA achieved the AUC of 0.9015 in the frameworks of LOOCV when is set as 6, which indicates that our newly proposed method DCSLDA has a reliable predictive performance of cancers, and therefore it is a precise and high efficient method for the lncRNA-disease association prediction.

Effects of the Disease Functional Similarity and lncRNA
Functional Similarity. In formula (14), we defined = × 13 × . Then, in this section, we will analyze the effects of the disease similarity matrix FSD and the lncRNA similarity matrix FSL through comparing the prediction performances of DCSLDA in the framework of LOOCV while letting = 13 and FAD = FSD × 13 × , respectively. The simulation results are illustrated in Figure 6. It is obvious that DCSLDA achieved the AUCs of 0.8517 while matrixes FSD and FSL were considered, but the AUC achieved by DCSLDA is 0.8352 only when letting FAD = 13 . Simulation results indicated that the prediction performance of DCSLDA will be significantly improved by introducing the similarity matrixes FSD and FSC. Moreover, in Table 1, DCSLDA was applied to three important kinds of cancer (breast cancer, colorectal cancer, and lung cancer). As a result, 17 predicted lncRNA-disease pairs with high predicted value were publicly released to benefit the biological experimental validation.

Case Studies.
Obviously, DCSLDA can predict all potential relationships between diseases and lncRNAs in dataset1 and dataset2 simultaneously. And of course, potential associations with high predicted value can be publicly released to benefit the biological experimental validation. It is anticipated that these potential disease-lncRNA associations that significantly share common miRNAs could be validated by  Table 1).
In the world, breast cancer is the most prevalent cancer in women and a major public health problem. Several studies have focused on studying this disease, but more are needed, especially at the genetic and molecular levels [26,27]. Therefore, it is necessary to predict breast cancer-related lncRNAs and identify lncRNA biomarkers. DCSLDA was implemented to prioritize candidate lncRNAs for breast cancer. Among the first 5% of predictive results, nine breast cancer-related lncR-NAs have been confirmed based on recent experimental literature (see Table 1). For example, KCNQ1OT1, MALAT1, XIST, and NEAT1 are experimentally confirmed breast cancerrelated lncRNAs, which have been ranked 2nd, 11th, 12th, and 19th in the predicted list based on the model of DCSLDA, respectively. KCNQ1OT1 had significantly higher expression levels in invasive breast carcinoma and was induced by estrogen in estrogen receptor-alpha expressing breast cancer cells [28]. 17 -Estradiol treatment affects breast tumor or nontumor cells proliferation, migration, and invasion in an ER -independent, but a dose-dependent, way by decreasing the MALAT1 RNA level [29]. XIST expression is significantly reduced in breast cancer cell lines and breast cancer samples [30]. Breast cancer patients with high level of NEAT1 expression show low survival rate [31].
Colorectal cancer (CRC) is a leading cause of cancer deaths worldwide, one of the fundamental processes driving the initiation and progression of CRC is the accumulation of a variety of genetic and epigenetic changes in colon epithelial cells. Colorectal cancer is usually caused by the combination of various factors, such as genetic and epigenetic changes [32,33]. Specially, lncRNAs have been demonstrated to play a critical role in the development and progression of colon cancer [34]. As a result, six colorectal cancer-related lncRNAs were listed in Table 1. For example, Tanaka K et al. proved that Loss of imprinting of KCNQ1OT1 is considered as a useful marker for diagnosis of colorectal cancer because of its frequent occurrences in colorectal cancer samples [35]. Ji Q et al. findings implied that MALAT1 might be a potential predictor for tumor metastasis and prognosis [36]. Furthermore, the interaction between MALAT1 and SFPQ could be a novel therapeutic target for CRC. Lassmann S et al. proved that expression level change of or DNA amplification of XIST is associated with colorectal cancer [37].
Over the past 30 years, the morbidity and mortality of lung cancer have been increasing and the cancer has the highest incidence and mortality across the world [38]. Due to the early diagnosis of lung cancer and the lack of effective treatment, its survival rate is around 10% within five years, which seriously endangers human health. More and more evidence has shown that lncRNAs play a critical role in treatment of lung cancers. Among the first 5% of predictive results, three predicted lncRNAs have been confirmed by published experimental literature [39]. According to this literature, MALAT1 has been shown to be highly associated with metastasis of lung cancer and promote lung cancer cell motility by regulating motility related gene expression [40,41]. Long noncoding RNA XIST acts as an oncogene in non-small cell lung cancer by epigenetically repressing KLF2 expression [42].
In addition, performance comparisons between DCSLDA and HGLDA were implemented according to the rankings of these disease-related lncRNAs in the case studies of breast cancer, colorectal cancer, and lung cancer (see Table 2). By ranging the predicated results by HGLDA and our methods from good to bad, we selected the intersection of the underlying disease-lncRNA relationship predicated by HGLDA and the first 0.5 percent of the predicted results by our methods and listed the lncRNA items related to breast cancer, colorectal cancer, and lung cancer in this intersection in Table 2. As a result, DCSLDA significantly improved the prediction ability of HGLDA with higher ranks for these new disease-related lncRNAs.

Discussion and Conclusions
In recent years, plenty of studies have generated an enormous amount of biological data related to lncRNAs. Accumulating evidence shows that lncRNAs have played a very important role in the biological functions, and the study of lncRNA-disease association prediction is of great significance to human beings. However, there is a few computational models for predicting potential disease-lncRNA associations based on the information of miRNA. To utilize the wealth of disease-miRNA, miRNA-lncRNA, and disease-lncRNA associations data collected from three datasets and recently published in experimental literature, in this article, the novel model of DCSLDA was developed to predict potential disease-lncRNA associations. We calculated distance correlation set of each node based on disease-miRNA-lncRNA interactive network first and then further integrated disease functional similarity and lncRNA functional similarity for DCSLDA. The important difference from previous computational model is that DCSLDA does not rely on any known disease-lncRNA associations and it predicts disease-lncRNA associations only based on disease-miRNA-lncRNA interactive network. In order to evaluate the prediction performance of DCSLDA, the validation frameworks of LOOCV were implemented based on known disease-lncRNA and cancerrelated-lncRNA associations downloaded from LncRNADisease database. And case studies were further implemented to three important cancers (breast cancer, colorectal cancer, and lung cancer) based on recently published experimental literature. The simulation results show that DCSLDA can achieve reliable and excellent prediction performance and is superior to the state-of-the-art methods. Hence, it is anticipated that DCSLDA could play an important role in the prospective biomedical researches.
Disease functional similarity plays an important role in disease-related molecular function research. Functional associations between disease-related genes are often used to identify pairs of similar diseases from different perspectives. Calculating lncRNA functional similarity could benefit lncRNA function inference and disease-related lncRNA prioritization. Therefore, based on the two assumptions that (1) similar diseases tend to show a similar interaction and noninteraction pattern with the miRNAs and (2) similar lncRNAs tend to show a similar interaction and noninteraction pattern with the miRNAs, DCSLDA was developed to predict potential disease-related lncRNA by integrating lncRNA functional similarity and disease functional similarity. Simulation results indicated that the prediction performance of DCSLDA will be significantly improved by disease similarity and lncRNA similarity.
However, there are also some limitations in our method. Firstly, DCSLDA measures the correlations between lncRNAs and investigated diseases by integrating walks with different lengths in a lncRNA-miRNA-disease network, which is constructed by combining the known disease-miRNA network, miRNA-lncRNA network, and disease similarity network. The value of distance threshold parameters r is an important factor in DCSLDA, and how to select this parameter is not yet solved well. Secondly, although DCSLDA does not rely on any known experimentally verified lncRNA-disease relationships, the performance of DCSLDA was not very satisfactory compared with that of several existing methods. In the future, we will further integrate data of diseases and lncRNAs that do not rely on the lncRNA-disease interactive network, disease-miRNA interactive network, or miRNA-lncRNA interactive network; then these above problems may be well solved. Finally, introducing more reliable measure of disease similarity and lncRNA similarity and developing more reliable similarity integration method would improve the performance of DCSLDA. In particular, disease similarity and lncRNA similarity in this model totally rely on known disease-miRNA and miRNA-lncRNA associations. The performance of DCSLDA would be further improved when sequence similarity of lncRNA and semantic similarity of disease are introduced.