lncRNA-Disease Association Prediction Based On Weight Matrix And Projection Score


 Background: with the development of medical science, lncRNA, originally considered as a noise gene, has been found to participate in a variety of biological activities. Nowadays, more and more studies show that lncRNA is involved in various human diseases, such as gastric cancer, prostate cancer, lung cancer, etc. However, obtaining lncRNA-disease association only through biological experiments not only costs manpower and material resources, but also gains little. Therefore, it is very important to develop effective computational models for predicting lncRNA-disease association. Results: In this paper, a new lncRNA-disease association prediction model LDAP-WMPS based on weight distribution and projection score is proposed. Based on the existing research results of disease semantic similarity, the integrated lncRNA similarity matrix and the integrated disease similarity matrix are calculated according to the disease semantic similarity and the association information between data. On this basis, the weight algorithm is combined with the improved projection algorithm to predict the lncRNA-disease association through the known lncRNA-miRNA association and miRNA-disease association. The simulation results show that under the loocv framework, the AUC of LDAP-WMPS can reach 0.8822. Better than the latest results. Through the case study of adenocarcinoma and colorectal cancer, it is proved that LDAP-WMPS can effectively infer lncRNA-disease association. Conclusions: The simulation results show that LDAP-WMPS has good prediction performance, which is an important supplement to the research of lncRNA-disease association prediction without lncRNA-disease association data. Keywords: lncRNA-miRNA association, miRNA-disease association, disease semantic similarity, Integrated lncRNA similarity, integrated disease similarity, Weight allocation algorithm, Projection score.

exists in all organisms, and RNA plays a regulatory role in gene expression [2], virus infection [3,4], immune system [5], etc., thus, bringing biological research into a new stage. After that, the research on ncRNA gradually increased, among which the research on long non coding RNA (lncRNA) is one of the hot topics. Long non coding RNA is a kind of non coding RNA whose nucleotide length is more than 200. In previous studies, it was considered to be the noise generated in the process of transcription [6,7] Nowadays, lncRNA has been found to be involved in all aspects of cell life cycle, including transcription [8], cell differentiation [9], cell transport [10], apoptosis [11], metabolic process [12] and so on. Moreover, lncRNA has also been found to be associated with various human diseases [13], including leukemia [14,15], diabetes [16,17], prostate cancer [18,19], lung cancer [20,21], colon cancer [22,23], cardiovascular disease [24,25] and so on. lncRNA participates in diseases through abnormal sequence and spatial structure, abnormal expression level and abnormal interaction with binding proteins, thus affecting human health [26,27]. Therefore, linking lncRNA with diseases can realize the early detection of diseases, the targeted treatment of diseases, and the systematic understanding of the etiological characteristics of complex diseases. Because of the complex relationship between lncRNA and diseases, it costs a lot of money and time to carry out the biological experiments related to lncRNA. Computer aided experiment has become an effective research method. Computer aided experiments can effectively predict the association between lncRNA and complex diseases. For the prediction results, the data sets in the open lncRNA database are used to verify. The prediction of lncRNA disease association is of great significance in biology, medicine and other fields. In the field of biology, computer-aided experiments can reduce the cost of experiments and improve the success rate of experiments; in the field of medicine, computer-aided experiments can help researchers identify lncRNAs related to various diseases and understand the pathogenesis of diseases at the molecular level, so as to effectively prevent and treat diseases.So far, the prediction models put forward by various experts and scholars can be divided into two categories. The first model relies only on miRNA-disease association information or lncRNA-disease association information. Specifically, we can predict the association between miRNA-diseases by the association information between miRNA-diseases, and predict the association between lncRNA-diseases through the association information between lncRNA-diseases. For example, in this study, Guang et al. Proposed a label propagation model with linear neighborhood similarity, called LPLNS, to predict the potential association between lncRNA and disease [28]. Based on disease semantic similarity and lncRNA-disease association information, Guang et al. Developed an NCPLDA model to predict the potential association between lncRNA and diseases through network consistency [29]. Gu et al. Proposed a method to infer the pairwise functional similarity and functional network of human miRNA based on the RNA of disease relationship structure, so as to infer the new potential function of miRNA or related diseases [30]. The other model is to integrate multiple data, collect multiple biological data such as lncRNA, miRNA, protein, disease and so on, and integrate these data into matrix or heterogeneous network to infer the potential relationship between lncRNA and disease. For example, Yu and Wang et al. Developed a NBCLDA model, which integrates a variety of organisms to construct a new tripartite network, including miRNA-disease, miRNA-lncRNA and lncRNA-disease association and interaction. Then, a quadruple network is constructed and naive Bayesian classifier is applied to predict [31]. Chen et al. Proposed a new prediction model called LRLSLDA by fusing the known phenome -lncRNAome network, disease similarity network and lncRNA similarity network by using Laplace regularized least squares [32].Yu et al. Proposed a novel model CFNBC, which combined collaborative filtering with naive Bayes, and used to infer the potential lncRNA-disease association by calculating the association score between lncRNA and disease [33].Lu et al. Developed a computational model called SIMCLDA using inductive matrix. The principle is to complete the disease interaction of missing lncRNA based on known interactions, lncRNA similarity data and disease similarity data [34].
However, most of the prediction of lncRNA-disease correlation needs to know the correlation between lncRNA-diseases. But the known association between lncRNA-diseases is quite rare. To solve the above problems, this paper proposes a lncRNA-disease association prediction model LDAP-WMPS based on weight matrix and projection score. The model uses the relatively perfect lncRNA-miRNA association data and miRNA-disease association data to predict lncRNA-disease association. The integrated lncRNA similarity matrix and integrated disease similarity matrix were established by fusing various methods to calculate the similarity between lncRNA and disease. On this basis, the weight algorithm is improved and applied to the lncRNA-miRNA-disease triple network. Based on the network, a new lncRNA-disease weight matrix calculation method is proposed. Combined with the improved projection algorithm, the lncRNA-miRNA Association and miRNA-disease association are used to predict the lncRNA-disease association. The simulation results show that under the loocv framework, the AUC of LDAP-WMPS can reach 0.8822. Better than the latest. Taking adenocarcinoma and colorectal cancer as examples, it is proved that LDAP-WMPS can effectively infer the relationship between lncRNA and disease.

Performance evaluation
We evaluated the performance of LDAP-WMPS model by using Leave-One-Out Cross Validation (LOOCV), and compared the results with other prediction models using LOOCV, and compare the results with other prediction models for LOOCV. In the LOOCV experiment, for each disease j in the disease data set, we successively remove a lncRNAs that is known to be associated with the disease j. This one is the test set. The correlation score calculated in the test set is compared with the given threshold, we can get true positive (TP), true positive (TP), true negative (TN) and false negative (FN) by calculating one by one. In order to obtain the receiver operating characteristic curve (ROC) and the area under the ROC curve (AUC) for intuitive evaluation. True positive rate (TPR) and false positive rate (FPR) were calculated: Receiver operating characteristic curve (ROC) was drawn with True positive rate (TPR) and False positive rate (FPR), and area under ROC curve (AUC) was calculated.

Comparison with other advanced models
In order to prove the effectiveness of LDAP-WMPS model, we compare the LDAP-WMPS model with other three advanced models. The ROC curve and AUC area are obtained by applying four different models to the same dataset. After comparison, LDAP-WMPS model is slightly better than other methods in ROC curve, and AUC reaches 0.8822. The highest AUC of CFNBC [33], and NBCLDA [31] models were 0.8576 and 0.8521, respectively. The results show that our method is slightly better than that used in CFNBC. The results are shown in Table 1

Analysis of parameters
In this model, we introduce parameter  . The range of parameter  are [0,1].When  =0, only disease projection score is used for final score calculation; when  =1, only lncRNA projection score is used for final score calculation The results are shown in figure 3 and figure 4.Obviously, when  = 0.52, AUC reaches the highest value of 0.8822. In order to further prove the effectiveness of our lncRNA-disease weight matrix, we evaluated the model using weight matrix and the model not using weight matrix respectively, and the results are shown in Figure 5. It is obvious that our weight matrix effectively improves the prediction ability of the model.

Case studies
Tumor refers to a new organism formed by the proliferation of local tissue cells under the action of various oncogenic factors, because this new organism is mostly space occupying massive protuberances, also known as vegetations. According to the cellular characteristics of tumors and the degree of harm to the body, tumors are divided into benign tumors and malignant tumors: benign tumors can be removed by surgery, and will not metastasize and relapse; malignant tumors, as we often call cancer, are easy to metastasize, difficult to cure by surgery, and there is still the possibility of recurrence after cure [35].In order to further prove the practicability of LDAP-WMPS in lncRNA-disease association prediction, we studied adenocarcinoma and colorectal cancer. The first 20 pieces of information about LDAP-WMPS predicting adenocarcinoma and colorectal cancer are shown in Table  2 and Table 3.  Colorectal cancer is a common cancer type. Its incidence rate and mortality rate are high in the world. In 2018 alone, the number of new cases reached nearly 2 million, and the number of deaths was nearly 900 thousand. Some data show that in the United States, about 5.2% of men and 4.8% of women are at risk of colorectal cancer, and the mortality caused by colorectal cancer is close to 33% [36]. Many studies have shown that lncRNA is closely related to colorectal cancer. In our prediction results, 12 of the first 20 lncRNAs associated with colorectal cancer have been proved by relevant medicine: lncRNA XIST expedites metastasis and modulates epithelial-mesenchymal transition in colorectal cancer [37];lncRNA SNHG16 promotes colorectal cancer cell proliferation, migration, and epithelial-mesenchymal transition through miR-124-3p/MCP-1 [38];lncRNA MALAT1 promotes the colorectal cancer malignancy by increasing DCP1A expression and miR203 downregulation [39];The lncRNA HCG18 promotes the growth and invasion of colorectal cancer cells through sponging miR-1271 and upregulating MTDH [40];lncRNA FGD5-AS1 promotes colorectal cancer cell proliferation, migration, and invasion through upregulating CDCA7 via sponging miR-302e [41];Long non-coding RNA TUG1 mediates 5-fluorouracil resistance by acting as a ceRNA of miR-197-3p in colorectal cancer [42]. Adenocarcinoma is a kind of lung cancer. It is the least related to smoking, accounting for 40% of primary Adenocarcinoma. Often located in the peripheral part of the lung, but also involving the pleura and the formation of associated scar ring and pleural effusion. Because of the invasive growth of adenocarcinoma, extensive resection should be performed. The rate of lymph node metastasis of adenocarcinoma is high, which can be as high as 36% -47%. It is easy to relapse and has poor prognosis. Lin Guoji reported 68 cases of adenocarcinoma. The 5-year and 10-year cure rates were 43.9% and 29.0% respectively [43].In our prediction results, 14 of the first 20 lncRNAs associated with Adenocarcinoma have been proved by relevant medicine:lncRNA XIST promotes human lung adenocarcinoma cells to cisplatin resistance via let-7i/BAG-1 axis [44].lncRNA MALAT1 promotes gastric adenocarcinoma through the miR-181a-5p/AKT3 axis [45].lncRNA CTB-89H12.4 regulation of PTEN expression in prostate cancer [46].lncRNA HCG18 acted an oncogene in lung adenocarcinoma and enhanced lung adenocarcinoma progression by targeting miR-34a-5p/HMMR axis [47].lncRNA SNHG16 promotes cell proliferation and invasion in lung adenocarcinoma via sponging let-7a-5p [48].

Discussion
To explore the relationship between lncRNA and diseases is not only of great significance to the treatment of diseases, but also helpful to explore the mystery of human body. Using artificial intelligence to mine the existing medical data can not only improve the utilization rate of data, but also speed up the process of medical intelligence. In this study, we propose a computational model LDAP-WMPS. In this model, we propose a weight allocation algorithm based on lncRNA-miRNA-disease triple network, and on this basis, we propose a lncRNA disease association weight calculation method, and combine the lncRNA disease weight matrix with the improved projection algorithm to calculate the relationship between each lncRNA and disease the interaction between lncRNA and disease information can be obtained. Compared with the other three models, LDAP-WMPS is slightly better in AUC. 12 of the first 20 lncRNAs have been confirmed to predict the relationship between adenocarcinoma and colorectal cancer, which also proves the reliability of LDAP-WMPS. In addition, our model is based on the lncRNA and miRNA Association and miRNA-disease association to achieve the prediction of lncRNA-disease association. Through the current relatively perfect lncRNA-miRNA association data set and miRNA-disease association data set to predict lncRNA-disease association can effectively avoid the current lncRNA-disease association based on lncRNA-disease association data Lack of lncRNA-disease association data in data prediction.

Conclusion
In this manuscript, we propose a new computing model LDAP-WMPS. Our main contributions are as follows: 1. We propose an integrated lncRNA similarity calculation method and an integrated disease similarity calculation method. 2. We propose a weight assignment algorithm for lncRNA-miRNA-disease triple network. 3. Based on the weight distribution of lncRNA-miRNA-disease triple network, a method of lncRNA-disease weight calculation is proposed. 4. Improve the existing consistency projection scoring formula. 5. lncRNA-disease association can be predicted by LDAP-WMPS without relying on the known lncRNA disease association data.

Method lncRNA-disease association data set、miRNA-disease association data set、lncRNA-miRNA Association data Set
The known lncRNA-disease association dataset is download from mndrv2.0 database (2017 Edition) [49]. The known miRNA-disease association datasets are download from HMDD database (2018 Edition) [50]. The known lncRNA-miRNA Association dataset is download from Starbase v2.0 database (2015 Edition) [51]. After data cleaning and name unification, we get three datasets

Cosine similarity for diseases
The cosine similarity for disease between miRNA disease adjacency matrix was calculated： || ) (:, |||| ) (:, || is the i-th column vector in the adjacency matrix of miRNA and disease, which represents the association feature of disease i.

Jaccard similarity for diseases
The calculation of similarity is an important part of gene association prediction. At present, the methods of similarity calculation in most articles include Gauss interactive calculation of similarity. Compared with the past, we use Jaccard similarity to calculate. The Jaccard similarity for disease between miRNA disease adjacency matrix was calculated： ) (:, ) (:,

Integrated disease semantic similarity matrix
Integrated disease semantic similarity DS and cosine similarity CD for diseases:

Integrated lncRNA similarity matrix
Integrated miRNA similarity MS and cosine similarity CL for lncRNA:

Establishment of lncRNA-disease weight matrix
Weight assignment algorithm [52] is often used in association prediction of lncRNA dual network. Through the weight distribution, we can get the correlation score between lncRNA-diseases. We further improved it and applied it to the lncRNA miRNA disease triple network, as shown in Figure 6. Taking L to M as an example, the first step is defined as: Where m is the number of lncRNA types, Where n is the number of miRNA types, e is the number For lncRNA i, we calculated the potential association characteristics between miRNAs related to lncRNA i, and for disease j, we also calculated the potential association characteristics between miRNAs related to disease j,We

Building LDAP-WMPS Prediction Model
The flow chart of LDAP-WMPS model is shown in Figure 7. LDAP-WMPS model is divided into three parts, the first step is to calculate the disease projection score; the second step is to calculate the lncRNA projection score; the third step is to fuse the disease projection score and lncRNA projection score proportionally, and then normalize them to get our prediction score matrix.
The disease projection score is defined by the following formula:  Similarly, the more similar disease i is to disease j, the higher the value of The final lncRNA-disease potential association prediction score matrix was formed by fusing lncRNA projection score with disease projection score, defined as: ) , ( j i LDAP-WDPS is the final association score between lncRNA i and disease j. The performance of LDAP-WMPS and others models in terms of PR curves and AUPRs based on 407 known lncRNA-disease associations under the framework of LOOCV   Comparison of ROC curve calculated with weight matrix and ROC curve calculated without weight matrix.

Figure 6
Flow chart of lnRNA-disease association weight matrix construction