NDAMDA: Network distance analysis for MiRNA‐disease association prediction

Abstract In recent years, microRNAs (miRNAs) are attracting an increasing amount of researchers’ attention, as accumulating studies show that miRNAs play important roles in various basic biological processes and that dysregulation of miRNAs is connected with diverse human diseases, particularly cancers. However, the experimental methods to identify associations between miRNAs and diseases remain costly and laborious. In this study, we developed a computational method named Network Distance Analysis for MiRNA‐Disease Association prediction (NDAMDA) which could effectively predict potential miRNA‐disease associations. The highlight of this method was the use of not only the direct network distance between 2 miRNAs (diseases) but also their respective mean network distances to all other miRNAs (diseases) in the network. The model's reliable performance was certified by the AUC of 0.8920 in global leave‐one‐out cross‐validation (LOOCV), 0.8062 in local LOOCV and the average AUCs of 0.8935 ± 0.0009 in fivefold cross‐validation. Moreover, we applied NDAMDA to 3 different case studies to predict potential miRNAs related to breast neoplasms, lymphoma, oesophageal neoplasms, prostate neoplasms and hepatocellular carcinoma. Results showed that 86%, 72%, 86%, 86% and 84% of the top 50 predicted miRNAs were supported by experimental association evidence. Therefore, NDAMDA is a reliable method for predicting disease‐related miRNAs.

In parallel with much efforts being made to identify novel miR-NAs, the research community is also interested in predicting and validating miRNAs' associations with diseases. Using experimental methods to uncover such associations is typically costly and timeconsuming. Fortunately, taking the advantage of vast biological data for miRNAs, computational methods can be an efficient complement to experimental studies. By far, existing computational methods can be broadly divided into 2 categories: (i) those constructing networks and applying the corresponding network-based algorithms and (ii) those utilizing machine learning.
Inspired by the idea that functionally similar miRNAs tend to be related with phenotypically similar diseases, Jiang et al 16 18 presented a method to identify miRNA-disease associations based on the assumption that a disease tends to be associated with miRNAs whose target genes also have associations with this disease. They carried out a random walk analysis on a protein-protein interaction (PPI) network, and the analysis took into account the global network distance measure and the functional links between miRNAs' targets and disease genes. However, the methods mentioned above had two common drawbacks: the high false-positives and false-negatives in miRNA-target interactions and the incompletion of the disease-gene association network.
To overcome such drawbacks, researchers developed computational models without relying on miRNA-target interactions. Chen et al 19 put forward RWRMDA, the first global network similaritybased model, to capture the associations between miRNAs and diseases. Although the model was based on random walk that made full use of global network information, it was not applicable to new diseases without any known related miRNAs. Chen et al 20 proposed another global ranking model called WBSMDA, which utilized Gaussian interaction profile kernel similarity for diseases and miRNAs. As an upgrade to RWRMDA, WBSMDA could be implemented for diseases without any known related miRNAs. However, WBSMDA might cause bias to miRNAs with more known associated diseases and its scores needed to be integrated more reasonably. Xuan et al 21 presented HDMP based on weighted k most similar neighbours and the miRNA functional similarity. For a specific disease, the relevance score of a miRNA was calculated by summing all subscores of the miRNA's k neighbours. The subscore of a neighbour was calculated by multiplying the functional similarity between the miRNA and the neighbour with the weight of the neighbour; the assignment of weight was based on the neighbour's miRNA family or cluster. The members in the same miRNA family or cluster were assigned higher weights because they were usually transcribed together and therefore were more likely to be associated with similar diseases. However, this method also had some limitations: on one hand, HDMP could not be applied to the new diseases which did not have any known related miRNAs; on the other hand, HDMP did not make full use of global network similarity information. Pasquier et al 22 proposed a method named MiRAI which represented distributional information on miRNAs and diseases in a high-dimensional vector space and reduced dimensions with the help of singular value decomposition (SVD). The association score for a miRNA-disease pair was measured by the cosine similarity between the miRNA vector in the miRNA space and the disease vector in the disease space.
However, the prediction accuracy of MiRAI was low because the model had the data sparsity problem.
Besides, several computational models had adopted machine learning methods to uncover associations between miRNAs and diseases. Under the assumption that miRNAs involved in a specific tumour phenotype will exhibit aberrant regulation of their target genes, Xu et al 23  | 2885 association prediction. Based on this model, we could obtain not only new miRNA-disease associations but also their corresponding association types. Recently, Li et al 28 raised MCMDA based on the observation that the miRNA-disease association matrix was low-rank.
They filled the candidate samples without known associations with zero and then iteratively updated them with the predictive scores.
As mentioned above, the existing methods have different limitations. For example, miRNA-target interactions and disease-genes associations used in some methods are inaccurate or incomplete.
Furthermore, many methods could not be applied to disease without any known related miRNAs and many methods were constructed without optimal parameter. Therefore, new effective computational methods are in urgent need. Based on the assumption that functional similar miRNAs tend to be associated with simi-  2.3 | Disease semantic similarity model 1 We described each disease as a directed acyclic graph (DAG) with the help of the disease MeSH descriptors downloaded from the National Library of Medicine (http://www.nlm.nih.gov). 33 Taking disease d i ð Þ as Þ was the node set consisted of node D itself and its ancestor Þwas the corresponding edge set composed of the direct edges from parent nodes to child nodes. Therefore, summing all the contributions from ancestor diseases and disease d i ð Þ itself, we could calculate the semantic value of disease d i ð Þ as follows: where Δ was the semantic contribution factor. Their own contribution to the semantic value of disease d i ð Þ was defined as 1; the contribution decreased as the distance between d i ð Þ and other diseases increased. Therefore, disease terms in the same layer had the same contribution to the semantic value of disease d i ð Þ. We reasoned that 2 diseases sharing larger part of their DAGs were considered to have greater semantic similarity. Here, we defined semantic similarity between d i ð Þ and d j ð Þ as follows:

| Disease semantic similarity model 2
The disease semantic similarity model was unsatisfying in considering that 2 diseases which located in the same layer of DAG(d i ð Þ) might appear in different number of disease DAGs. It is obvious that the one appeared more commonly was less specific. Therefore, we developed disease semantic similarity model 2 to complement the old one. We defined the contribution of disease t in DAG(d i ð Þ) to the semantic value of disease a as follows: Based on the assumption that 2 diseases sharing larger part of their DAGs are considered to have stronger semantic similarity, we summed all the contributions from ancestor diseases and itself to determine the semantic value DV of disease d i ð Þ in the similar way as model 1.
Þ. The disease semantic similarity matrix SS2 was given by Þ were the semantic value of d i ð Þ and d j ð Þ, respectively.

| Gaussian interaction profile kernel similarity for diseases
Using the topologic information of known miRNA-disease association network, we proposed Gaussian interaction profile kernel similarity for diseases based on the assumption that functional similar miRNAs tend to be associated with similar diseases. Here, we used the vector IP to represent the interaction profiles of diseases, and IP was calculated based on the associated information between the disease and each miRNA, that is, the ith row of the adjacency matrix Y.
Then, Gaussian kernel similarity between disease d i ð Þ and d j ð Þ was defined based on their interaction profiles as follows: where parameter c d was used to control the kernel bandwidth and calculated as follows.

| Gaussian interaction profile kernel similarity for miRNAs
Gaussian interaction profile kernel similarity matrix of miRNA could be calculated in a similar way: Þwas the Gaussian interaction profile kernel similarity between miRNA m i ð Þ and m j ð Þ.

| Integrated similarity for miRNAs and diseases
Here, integrated miRNA similarity matrix S m and integrated disease similarity matrix S d were constructed based on miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity. For miRNA pairs and disease pairs that did not have similarity, we used KS m and KS d to respectively represent the similarity between them. In addition, we used FS to represent the similarity for miRNA pairs that had functional similarity; we used the average of SS1 and SS2 to represent the similarity for disease pairs that had semantic similarity.

| NDAMDA
We developed NDAMDA which was constituted by 3 steps: (I) network distance computation and adjustment (II) calculation of the confidence (III) score conversion (See Figure 1).

| Network distance computation and adjustment for miRNAs
We could obtain the similarities from our previous work between two miRNAs directly, for example, we could extract functional similarity between m i ð Þ and m j ð Þ as FS i; j ð Þ, then, the raw network distance between two miRNAs with a link in the network was defined as D = 1/FS, such that a smaller D (shorter distance) would correspond to a higher functional similarity. To those miRNAs without direct links, we used Gaussian interaction profile kernel similarity to fill it. In summary, the raw distance was determined as D = 1/S m . To develop a comprehensive network, we considered both the distance between two miRNAs and their respective mean network distances to all other miRNAs, and the adjusted distance was defined as

| Calculation of the confidence in miRNAs
We reasoned that, for a specific disease in the network, a related miRNA was closer to other related miRNAs than random miRNAs. CHEN ET AL.

| 2887
Therefore, we introduced the confidence C m i ð Þ in miRNA m i ð Þ as follows: where R was richness of the given disease indicating the total number of known related miRNAs, and Dm adj ij was the adjusted network distance between m i ð Þ and m j ð Þ. It could be concluded that a larger C m i ð Þ would suggest that the investigated miRNA had relatively shorter distance (stronger functional interaction) to known related miRNAs than to random miRNAs; the miRNA was therefore more likely to be associated with the investigated disease.
2.12 | Calculation of the confidence to pick the diseases regulated by specific miRNA Similarly, given a specific miRNA, we reasoned that a regulated disease deserves a stronger integrated similarity than with random diseases. Here, we introduced the confidence in d i ð Þ, C d i ð Þ, defined as follows: A larger C d i ð Þ could suggest that the disease under investigation was more likely to be associated with the given miRNA.

| Score conversion
For a given disease, the confidence in specific miRNA could be compared with each other, with higher confidence indicating higher probability to be an associated miRNA. However, they could not be directly MiRAI. 22 In LOOCV, each known association was used as the validation sample and the remaining known associations were regarded as the training samples. The miRNA-disease pairs without any known association evidence were considered as candidate samples. The known miRNA-disease associations were obtained from the HMDD v2.0 database. 31 The association scores of all miRNA-disease pairs would be returned by NDAMDA. In global LOOCV, the score of the validation sample was compared with all the candidate samples, while in local LOOCV, the score was compared with candidate samples for the investigated disease.
In fivefold cross-validation, the known miRNA-disease associations were randomly partitioned into 5 equally sized subsets. Each subset was retained as the validation set in turn, and the remain-

| Case studies
To demonstrate the sound prediction accuracy of our method, we further carried out 3 types of case studies on 5 important diseases.
In the first type of case studies, the top 10 and top 50 predicted miRNAs for the investigated diseases were validated by another two miRNA-disease databases, namely dbDEMC 29 and miR2Disease. 30 Breast cancer is the most commonly diagnosed in females. With more than 1 million new incidences every year, breast cancer is ranked as the second most frequent cancer type when considering both sexes together. 34 Incidence rates are high in most of the developed areas, and more than half of the cases are in industrialized countries. 35 It is the leading cause of death in females aged 20-59. 36 With the rapid development of high-throughput sequencing technologies, researchers have identified plenty of miRNAs associated with breast cancer. For example, higher levels of circulating miR-122 specifically predicted metastatic recurrence in patients with stage II-III breast cancer. 37 Besides, miR-155 was up-regulated greater than twofold in breast cancer compared with normal adjacent tissue (NAT), 38 while a decreased level of serum miR-155 was found after surgery and 4 cycles of chemotherapy. 39 Here, we implemented NDAMDA to identify potentially related miRNAs for brest neo-  Table 1).
Oesophageal cancer is the eighth most common cancer worldwide (accounting for about 500 000 new cases every year) and the sixth most common cause of death by cancers (with 400 000 deaths each year). 34 Moreover, cancer in the oesophagus is usually 3-4 times more common among males than females and has a very low diseases. 43 Among the top 10 and 50 potential oesophageal cancerrelated miRNAs, respectively, 8 and 43 miRNA-disease-predicted associations were supported by database evidence (See Table 2).
Lymphoma cancer begins in cells of the immune system and can be divided into two main categories: Hodgkin lymphoma and non-Hodgkin lymphoma, which accounts for 90 per cent of all lymphomas. 44,45 Hodgkin lymphoma can be identified by the presence of a type of cell called the Reed-Sternberg cell, and non-Hodgkin consists of a large, multiple group of cancers of immune system cells. 46,47 Recently, researcher found that the expression of miR-21 in plasma of patient with lymphoma group significantly correlated with their serum LDH level and the higher expressions of miR-21, miR-155 and miR-210 in plasma of patients with lymphoma were significantly higher. 48 It was also reported that miR-203, miR-218, miR-181a, miR19a and miR17 were found to be associated with lymphoma: the former 3 miRNAs functioned as tumour suppressors, and the latter two were found to up-regulate oncogenes for lymphoma. 49 This finding coincided with the generally accepted idea that canine lymphoma is a common spontaneous tumour with great similarities to human lymphoma. 50 Similarly, we used dbDEMC and miR2Disease to validate the potentially associated miRNAs for lymphoma, and 9 of the top 10 and 36 of the top 50 candidate miRNAs were examined by two databases (See Table 3).
In addition to the above 3 cancers, we used NDAMDA to prioritize candidate miRNAs for all diseases in HMDD v2.0 and the results are included in Table S1.
To assess the ability of NDAMDA in predicting potentially related miRNAs for diseases without any known associated miRNAs, we carried out another case study on prostate cancer. Its associated miRNAs were removed from the training set, and the rest known miRNA-disease associations were used to train NDAMDA. In this manner, potentially related miRNAs for prostate cancer were uncovered only using the information of other diseases-related miRNAs  Table 5). Taking  Therefore, NDAMDA would be a useful resource for researches to discover associations between diseases and miRNAs.
In our work, the case studies were based on cancers. The hallmarks of cancer are one of the most widely acknowledged organizing principles for research on cancer, and currently, ten hallmarks have been identified to represent the acquired capabilities that distinguish cancer from normal tissue. 52 These hallmarks are (1) selfsufficiency in growth signals; (2) insensitivity to antigrowth signals; (3) evading apoptosis; (4) limitless replicative potential; (5)  The disease's associated miRNAs were removed from the training set, and the rest known miRNA-disease associations were used to train NDAMDA. Subsequently, the top 50 prediction outcomes were confirmed with HMDD v2.0, dbDEMC and miR2Disease. The first column records top 1-25 related miRNAs. The second column records the top 26-50 related miRNAs.
we would consider to involve the information of cancer hallmarkgene associations in our analysis and examine whether this information could enhance the accuracy of our algorithm.
The reliable performance of NDAMDA could be attributed to several factors as follows. Firstly, heterogeneous datasets (disease-miRNA associations from HMDD, miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity for diseases and miRNAs) were integrated to construct the informative network for prediction. Secondly, we used the adjusted network distance and the algorithm for calculating the confidence in a specific miRNA and the confidence in a specific disease. Finally, we used a score conversion procedure that considered the variation in the number of related miRNAs for different diseases.
Yet, there still exist limitations in NDAMDA. Firstly, more known miRNA-disease associations are necessary for building a more accurate adjacency network and improving the performance of NDAMDA. Secondly, the model might cause bias to miRNAs with more known related diseases, as it was based on the assumption that the functional similar miRNAs are more likely to be connected with similar diseases. Thirdly, NDAMDA might be not applicable to the diseases whose associated miRNAs tend to distribute randomly in the network, and how to integrate two scores to calculate the final score in a more reasonable way should be studied in future.
Finally, although NDAMDA exhibited a commendable predictive performance with the currently available 5430 associations between 495 miRNAs and 383 diseases from HMDD v2.0, this association dataset was still limited; it contained a large amount of unlabelled data and only a very small amount (2.86%) of labelled data, which negatively affected the prediction accuracy. As experimental research continues, more miRNA-disease associations were expected to be biologically verified in future. With an improved association dataset, our model would be able to uncover disease-related miRNAs at an even higher accuracy.

ACKNOWLEDG EMENTS
XC was supported by National Natural Science Foundation of China under Grant Nos. 61772531 and 11631014.