Prediction of Micrornas Associated with Human Diseases Based on Weighted K Most Similar Neighbors

Background: The identification of human disease-related microRNAs (disease miRNAs) is important for further investigating their involvement in the pathogenesis of diseases. More experimentally validated miRNA-disease associations have been accumulated recently. On the basis of these associations, it is essential to predict disease miRNAs for various human diseases. It is useful in providing reliable disease miRNA candidates for subsequent experimental studies.


Introduction
MicroRNAs (miRNAs) are a set of short (21,24 nt) non-coding RNAs that play important roles in gene regulation by targeting mRNAs for cleavage or translational repression [1,2]. MiRNAs are involved in many important biological processes including cell differentiation, proliferation, and apoptosis [3]. Furthermore, accumulating evidence indicates miRNAs are associated with various human diseases [4][5][6][7].
Identifying the relationship between miRNAs and diseases by experimental methods, such as microarray profiling and qRT-PCR, has been proven successful. However, the false positive microarray results can be caused by the different melting temperatures of miRNAs [8][9][10][11]. Furthermore, the experimental cost is greatly increased by the probe design [12][13][14]. Therefore, development of computational methods that predict the reliable disease-related miRNA candidates is a valuable complement to experimental studies [15][16][17][18][19][20][21][22]. So far, little work is available in predicting disease miRNAs.
First, it was shown that functionally related miRNAs tend to be associated with phenotypically similar diseases [18]. Jiang et al. constructed the miRNA network by establishing a functional relationship of two miRNAs based on their target genes. Their target genes are predicted by the target prediction programs PITA [23] and TargetScan [24]. They integrated the miRNA network with a phenome network to infer potential miRNA-disease associations. In addition, Jiang et al. further improved the calculation of concordance score between a miRNA and a given disease [19]. However, the high false positive in miRNA target predictions [25] restrains the efficacy of Jiang's methods.
Second, it was reported that if miRNAs are associated with a similar regulatory pattern in the same type of disease, their target genes may share common functional characteristics [20]. Based on these results, Li et al. prioritized the miRNAs for a specific disease by estimating the functional consistency score (FCS) among their target genes and the known target genes associated with the disease [21]. The target genes of these miRNAs are predicted by the target prediction programs miRanda [26], PicTar [27] and TargetScan. FCS method was applied to 11 human diseases including breast cancer, lung cancer and etc. However, besides the high false positive in miRNA target predictions, the limited known disease-related target genes also restrain the method's usage for the 11 diseases. For other important human diseases, such as heart failure, the method is unable to provide their prediction results.
Third, it is observed that miRNAs with similar functions are often associated with similar diseases and vice versa [8,20,28,29]. The functional similarity of two miRNAs is successfully estimated by the semantic similarity of their associated two groups of diseases [20]. On the basis of the calculated similarities, RWRMDA constructed a miRNA similarity network. The new miRNA-disease associations are predicted based on random walking on the network [22]. However, the association information between the miRNAs passed by a walker and a specific disease is overlooked. Furthermore, RWRMDA does not consider the characteristics of the members from miRNA family or cluster.
In this study, we improved the functional similarity estimation method developed in [20] by further considering the information content of disease terms and phenotype similarity between diseases. Subsequently, the members of miRNA family or cluster are assigned higher weight since they are more probably associated with similar diseases. At last, we presented an effective prediction algorithm based on weighted k most similar neighbors (HDMP). HDMP's prediction performance is evaluated by performing 5fold cross validation and another validation based on an updated dataset. The results indicate HDMP achieves better performance than the existing methods.

Materials and Methods
Disease miRNAs prediction based on weighted k most similar neighbors For a specific disease d, we refer to the experimentally validated miRNAs associated with d as the labeled miRNAs. The others which have no evidence to validate that they are associated with d until now are referred to as the unlabeled miRNAs. As the unlabeled miRNAs are probably associated with d, our goal is to rank the unlabeled miRNAs according to their possibilities of associating with d. To achieve this goal, we correlate an unlabeled miRNA u with a relevance score Score(u). A greater Score(u) means higher possibility that u is associated with d. We then rank all the unlabeled miRNAs according to their relevance scores and select the top ranked miRNAs as potential d-related candidates.
The process of predicting d-related candidates includes four steps, as shown in Figure 1. First, the functional similarity of any two miRNAs is calculated by incorporating the semantic similarity and the phenotype similarity between diseases, and then a symmetric functional similarity matrix is constructed. Second, the members of each miRNA family or cluster are assigned higher weight according to the miRNA-disease association information in the family or cluster. Third, the relevance score of each unlabeled miRNA is estimated by considering the functional similarities of its weighted k most similar neighbors and the distribution information of the labeled miRNAs in these neighbors. Fourth, all the unlabeled miRNAs are ranked by their relevance scores. The miRNAs with higher ranks are potential d-related candidates. Our proposed prediction method is referred to as HDMP.

MiRNA functional similarity measurement
The prediction performance of HDMP is highly dependent on accurate miRNA functional similarity measurement. It is observed that miRNAs with similar functions are often associated with similar diseases and vice versa [8,20,28,29]. Therefore, Wang et al. proposed to estimate functional similarity of two miRNAs by measuring the semantic similarity of their associated diseases [20]. In this section we give a brief overview of Wang's measurement. In the next section, we pointed out its inadequacy and further proposed the improved estimation strategy.
Assume that DT u and DT v represent a group of diseases associated with the miRNA u and v, respectively, and, for example, DT u = {liver neoplasms (LN), breast neoplasms (BN)} and DT v = {pancreatic neoplasms (PN), breast neoplasms (BN)}. The similarity between DT u and DT v is calculated as the functional similarity between u and v, denoted as Misim(u, v). As shown in Figure 2, Wang's measurement process contains three steps.
First, the semantic similarity of any two diseases d u and is calculated, such as SS(LN, PN). Two diseases LN (liver neoplasms) and PN (pancreatic neoplasms) are represented by directed acyclic graph (DAG), as shown in Figure 3. In the DAG of LN, 'liver neoplasms' in the 0th layer is the most specific disease term and therefore its contribution to its own semantic value is defined as 1. Since 'Digestive system neoplasms' in the 1th layer is a more general denomination, its contribution is multiplied by the semantic contribution factor (D = 0.5). Wang et al. defined the factor D to differentiate the semantic contribution values of disease terms in different layers. 'Neoplasms by site' in the 2th layer is even more general than 'Digestive system neoplasms' and its contribution is further factored as 0.560.5. Thus, the semantic value of LN is DV(LN) = 1.0 (1.0 is the semantic contribution value of 'liver neoplasms')+0.5 ('digestive system neoplasms')+0.560.5 ('neoplasms by site')+0.560.560.5 ('neoplasms')+0.5 ('liver diseases')+0.560.5 ('digestive system diseases') = 2.625. In the same way, the semantic value of PN is DV(PN) = 3.375. Suppose T LN is the set of all ancestor nodes of 'liver neoplasms' including node 'liver neoplasms' itself and t represents a blue node shared by the two diseases and tMT LN >T PN . The sum of semantic contributions of all the blue nodes in Figure 3A is gD(LN, t) = 0.5 ('digestive system neoplasms')+0.560.5 ('neoplasms by site')+0.560.560.5 ('neoplasms')+0.560.5 ('digestive system diseases') = 1.125. Similarly, gD(PN, t) in Figure 3B   Third, the similarity between two groups of diseases, DT u and DT v , is calculated as the similarity of their associated two miRNAs, i.e., the functional similarity of u and v. It is denoted as Misim(u, v) and defined as follows.
Incorporating information content of disease terms and disease phenotype similarity Wang's measurement has been proved successful in estimating the functional similarity of two miRNAs. However, we found its inadequacy. As shown in Figure 3, the farther a disease term is from the 0th layer, the more general the disease term is and the less semantic contribution it has. Wang et al. defined the semantic contribution of a disease term in the kth layer as 0.5 k . Thus, the disease terms in the same layer (e.g., 'digestive system neoplasms' and 'liver diseases' in the 1th layer in Figure 3A) have the same semantic contribution value (0.5).
However, we found that 'digestive system neoplasms' appears in 40 disease DAGs, such as DAG of esophageal neoplasms and DAG of liver neoplasms. 'liver diseases' appears in 73 disease DAGs, such as DAG of liver failure and DAG of liver cirrhosis. Obviously, the former is more specific than the latter since the former appears in less DAGs. The semantic contribution of the former should be higher than the latter. Therefore, it is less accurate to assign the same contribution value to the disease terms of the same layer in Wang's measurement.
Intuitively, the more specific a disease term is, the more informative it is for calculating the functional similarity. Therefore, we calculate the information content of per disease term as its semantic contribution. In this way, a more specific disease term has a greater semantic contribution value. Given that the likelihood of a disease term t appearing in all the disease DAGs is denoted as p(t), the information content of t, IC(t), can be quantified as the negative log of the likelihood, and IC(t) = 2log[p(t)]. The information content of all the 4577 disease terms was calculated and available at our web site. Thus, the semantic value of 'liver neoplasms', DV(LN), is updated as 10.160 (10.160 is the value of IC(liver neoplasms))+6.838 (IC(digestive system neoplasms))+4.453 (IC(neoplasms by site))+2.785 (IC(neoplasms))+6.116 (IC(liver diseases))+3.961 (IC(digestive system diseases)) = 34.313. In the same way, DV(PN) is 46.566. The semantic similarity SS(LN, PN) is calculated as Obviously, a greater semantic similarity revealed that two diseases are more likely similar with each other. In addition, the similarity of two diseases is closely related to their phenotypes. We obtained the similarity between any two of 5080 disease phenotypes from the literature [30]. The phenotype similarity between two diseases, such as A and B, is denoted as PS(A, B). In order to incorporate the semantic similarity and the phenotype similarity, the similarity between A and B is defined as DS(A, B).

Assignment of weight based on miRNA families or clusters
It was reported that the members of miRNA family or cluster are more likely to associate with the similar diseases [8,20,31]. Therefore, these miRNAs are assigned higher weight and the assignment strategy is described as follows.
Assignment of weight based on miRNA families. The homologous miRNAs are gathered into the same miRNA family by RFam [32]. The seed regions (normally 2-8th nucleotide from the 59 end of miRNA) of miRNA sequences of the same family are almost identical. Since the seed of a miRNA is commonly required to be perfectly complementary to the target mRNAs for cleavage or translational repression, the members of the same family likely regulate a common set of mRNA targets. Hence, it is more likely that they are associated with the similar diseases [20,31]. Assume u is an unlabeled miRNA and v is one of its k most similar neighbors. Also, assume v is associated with disease d. Furthermore, u and v belong to a same family. As far as v is concerned, u is more possibly associated with d. At this time, v is assigned higher weight. In the future, the weight will be multiplied by the functional similarity between u and v as the subscore of v, detailed in section 'Calculation of the relevance scores of miRNA candidates'.
We download the information of miRNA families from the latest miRNA database miRBase 19. The 474 miRNAs involved in the 4379 miRNA-disease associations cover 52 families. For the ith (1#i#52) family, the rate of d-related miRNAs accounting for its size is defined as For instance, assume there are 10 miRNAs in the ith miRNA family. 6 of 10 miRNAs are associated with d. Thus, we have r fi (d) = 6/10 = 0.6. The greater r fi (d) means that most of miRNAs in the family have been associated with d. The remaining miRNAs are more likely to associate with d. Assume two miRNAs do not belong to the same family. The weight of these two miRNAs with respect to d is viewed as 1. Assume another two miRNAs belong to the ith family and some miRNAs of the family are associated with d. The weight of the miRNAs in the ith family with respect to d should be greater than 1 and it is defined as where a is a factor for adjusting the weight. To find a suitable a value, the different a values from 1 to 10 are tested by performing 5-fold cross validation. Figure S1A shows HDMP achieved better prediction performance when a = 4 than other values. Therefore, we set a as 4 in this study.
For each family, we calculate the weight of its members for each disease involved in the family. The calculation process is illustrated by miRNA family 1. As shown in Figure 4, assume there are p families and family 1 is composed of 5 miRNAs, including miRNA 1, 2, 3, 4, and 5. Assigning weight for family 1 includes the following 3 steps.
1. All the diseases that are associated with the members of family 1 are collected to form the disease set S 1 = {d 1 ,d 2 ,d 3 ,d 4 ,d 5 }. 2. We collect the miRNAs associated with disease d i (1#i#5), respectively. For instance, d 1 is associated with miRNA 1 and 3. d 2 is associated with miRNA 1, 2, 4, and 5. d 3 is associated with miRNA 2, 3, and 4. d 4 is associated with miRNA 2 and 3. d 5 is associated with miRNA 3 and 5. 3. The weight of the miRNAs in family 1 with respect to disease d i (1#i#5) is calculated. For instance, d 2 is associated with miRNA 1, 2, 4, and 5. Family 1 is composed of 5 miRNAs. Thus, d 2 -related miRNAs account for four fifth of family 1 and r f1 ( Repeating above 3 steps, the weight can be calculated for family 2, …, and family p, respectively.
Assignment of weight based on miRNA clusters. It has been reported that miRNAs are often found in genomic clusters [33]. The clustered miRNAs are usually transcribed together and more likely associated with the similar diseases [8,20]. Therefore, if the unlabeled miRNA u and one of its neighbor v belong to a same cluster, v is assigned higher weight.
We download the chromosomal coordinates of human miRNAs from miRBase 19. Wang et al. confirm that the clustered miRNAs located within 20 kb of genomic location are more likely to associate with the similar diseases [20]. Therefore, we merge the miRNAs whose distances are within 20 kb into a same cluster. The 474 miRNAs involved in the 4379 miRNA-disease associations cover 58 clusters. For the ith (1#i#58) cluster, the rate of disease d-related miRNAs accounting for its size is defined as The greater r gi (d) means that most of miRNAs in the ith cluster have been associated with d. The remaining miRNAs are more possibly associated with d. The weight of two miRNAs not belonging to the same cluster is viewed as 1. Assume there are another two miRNAs belonging to the ith cluster and some miRNAs of the cluster are associated with d. The weight of the miRNAs in the ith cluster with respect to d should be greater than 1 and it is defined as where b is a factor for adjusting the weight. The different b values from 1 to 10 were investigated by the experiments. HDMP achieved the highest prediction performance when b is 4 ( Figure  S1B).

Calculation of the relevance scores of miRNA candidates
For a specific disease d, to reliably estimate the relevance score of an unlabeled miRNA u, its k most similar neighboring miRNAs are observed. Assume miRNA v is one of the k neighbors and it is associated with d. Since u and its neighbor v have higher functional similarity, they are more possibly associated with a group of similar diseases. Thus, as far as v is concerned, u is also possibly associated with d. The greater the functional similarity between u and v, the higher the possibility that u is associated with d. Therefore, the functional similarity is considered when estimating the relevance score of u.
We correlate each of k neighbors with a subscore. The subscores of k neighbors are accumulated as the relevance score of u. The 3 combinations of u and v are listed as following.
1. MiRNA u and its neighbor v are not in the same miRNA family or cluster. Also, v is associated with d. For instance, as shown in the 3th part of Figure 1, miRNA 1 is an unlabeled miRNA. Since its neighbor 5 is associated with d, 1 is possibly associated with d. The greater functional similarity between 1 and 5, MS(1, 5), means that 1 is more likely to associate with d. Thus, with respect to 1, the subscore of 5 is assigned to MS(1, 5) = 0.6. 2. MiRNA u and its neighbor v belong to the same miRNA family or cluster. Also, v is associated with d. In Figure 1, since the neighbor 20 is associated with d, 1 is possibly associated with d. Furthermore, as both 1 and 20 are in the ith family, they are more likely to associate with d. Therefore, to assign the subscore of 20, we consider the functional similarity between 1 and 20. At the same time, the weight of these two miRNAs in this family is considered. The subscore of 20 is MS(1, 20)6w fi (d) = 0.661.15 = 0.69. In addition, if two miRNAs not only belong to a family but also belong to a cluster, both the weight based on this family and that based on this cluster are multiplied by their functional similarity as the subscore. 3. The neighbor v has no evidence to validate that it is associated with d. For instance, miRNA 2 is a neighbor of 1 and 2 is not associated with d. As far as 2 is concerned, it is very little possibility that 1 is associated with d. At this time, the subscore of 2 is assigned to 0.
The sum of subscores of k neighbors is calculated as the relevance score of u. As shown in Figure 1 For a specific disease d, assume the labeled miRNA set is Q = {q 1 ,q 2 ,…,q m }. The unlabeled miRNA set is U = {u 1 ,u 2 ,…,u n }. To determine the association possibility of an unlabeled miRNA u (uMU) with d, the sum of subscores of weighted k neighbors most similar to u is calculated as its relevance score. The higher score means a more possible association between u and d. The algorithm of predicting d-related miRNA candidates is described in Figure 5.
The relevance score accumulation between an unlabeled miRNA and its neighbors is dependent on the parameter k. If k is too great, the noise data will be included, which will not contribute to improving the prediction performance. If k is too small, there is no sufficient data to accurately estimate the relevance scores. The different k values from 1 to 50 were investigated by the experiments. Figure S1C shows HDMP achieved the highest prediction performance when k is 20.

Data preparation
The human miRNA-disease associations were downloaded from the human miRNA-disease database HMDD [29]. Two versions (November-2010 Version and September-2012 Version) of HMDD associations were used in the experiments. Invalid miRNA-disease associations with incorrect disease names or miRNA names are filtered out. The correct disease names were downloaded from the National Library of Medicine (http://www. nlm.nih.gov/). The correct miRNA names were obtained from the latest miRNA database miRBase 19 [34]. After filtering, November-2010 Version contains 2076 associations between 338 miRNAs and 199 diseases, and September-2012 Version contains 4379 miRNA-disease associations between 474 miRNAs and 268 diseases. The similarity between any two of 5080 OMIM disease phenotypes was obtained from the literature [30]. Since the disease names of OMIM are named differently from those of MeSH, their mapping information was downloaded from the comparative toxicogenomics database [35].

Prediction performance evaluation
To evaluate HDMP's ability of predicting disease miRNAs, 5fold cross validation was performed firstly. For a specific disease d, the labeled miRNAs of September-2012 Version are randomly divided into 5 subsets, 4 of which are used as known information to predict candidates, while the left out subset is used for testing. The d-related miRNA candidate pool consisted of all the unlabeled miRNAs and the labeled miRNAs used for testing. The relevance score of each unlabeled miRNA and that of each labeled miRNA in the pool are calculated. All these miRNAs are ranked by their relevance scores. The higher the labeled miRNAs are ranked, the better the prediction performance is.
If a labeled miRNA has higher rank than a given threshold, HDMP is considered to successfully predict it. By varying the threshold, the true positive rate (sensitivity) and the false positive rate (1-specificity) were calculated to obtain the receiver operating characteristic (ROC) curves. Sensitivity is the proportion of the labeled miRNAs successfully predicted accounting for all the labeled miRNAs in the pool. Specificity is the proportion of the unlabeled miRNAs which have lower ranks than the threshold accounting for all the unlabeled miRNAs. The area under the ROC curve (AUC) was calculated to demonstrate the prediction performance of HDMP. To obtain reliable evaluation result, we tested the 18 human diseases which are associated with at least 60 miRNAs respectively. As shown in Table 1, HDMP achieved the highest AUC with pancreatic neoplasms, and the lowest AUC with lupus vulgaris. The average AUC value for the 18 diseases is 0.825.
In addition, the associations of November-2010 Version were used to construct HDMP. HDMP was applied to predict the set of associations added into HMDD between November-2010 and September-2012. The set of associations formed the updated dataset. The validation based on the dataset is called updated dataset validation. There are 9 diseases each of which is associated with at least 60 miRNAs in November-2010. The 9 diseases were tested to further evaluate the prediction performance of HDMP. As shown in Table 2, the highest AUC was obtained with pancreatic neoplasms, and the lowest AUC was obtained with hepatocellular carcinoma. The average AUC for the 9 diseases is 0.726.

Importance for improving miRNA similarity measurement and incorporating miRNA family or cluster
To validate the importance for improving miRNA functional similarity measurement, two HDMP's instances were constructed based on our measurement and Wang's measurement respectively. The 5-fold cross validation was performed to evaluate the performance of these two instances. The former achieved higher AUC values for all the 18 human diseases (Table S1). The minimum AUC increase is 1.6% for adenoviridae infections and the maximum one is 3.7% for medulloblastoma. For the 18 diseases, the AUC is increased by 2.2% on average. It demonstrates our measurement is effective for improving the prediction performance. In addition, the prediction instance based on Wang's measurement achieves decent performance. It further confirms the prediction method based on weighted k neighbors is sufficient to ensure the prediction accuracy.
In addition, we constructed three prediction instances and listed their prediction results in Table S2. The first instance was constructed only based on k most similar neighbors without considering miRNA family and cluster. The others further incorporated miRNA family and cluster respectively. For the 18 diseases, the average AUC of the second instance is 2.9% greater than the first one. The third instance is also increased by 2.8% on average. It shows the importance of incorporating miRNA family and cluster during construction of the efficient prediction instance.

Comparison with FCS method, Jiang's method, and RWRMDA
We first compared HDMP with FCS method proposed in [21]. FCS method ranked the miRNA candidates based on the functional consistency between miRNA target genes and diseaserelated genes. While FCS method ranked disease miRNA candidates for 11 diseases, our HDMP method ranked for 18 diseases. The 5-fold cross validation was performed on their 8 common diseases. In addition, there are 4 common diseases for the updated dataset validation. The ranked miRNAs by FCS method were downloaded from the web site (http://bioinfo.hrbmu.edu. cn/CMP). The results over 5-fold cross validation (Table 1) and those over updated dataset validation (Table 2) show that HDMP is more accurate than FCS method. The average AUC value for 8 common diseases is increased by 19.5% and that for 4 common diseases is increased by 13.5%. We measured the statistical significance of the difference in their AUCs by paired t-test. The pvalues are reported in Table 3. Clearly, HDMP performs significantly better than FCS method at the significance level 0.05.
As mentioned before, FCS method is dependent on the predicted miRNA target genes. However, it is difficult to obtain highly accurate target genes although FCS method integrated the results of 3 target prediction programs to minimize the false positive. HDMP is based on the accurate measurement of miRNA  functional similarity and effective prediction process by observing weighted k most similar neighbors. Thus, HDMP achieves better prediction performance. Second, we compared HDMP with Jiang's method presented in [18], where the potential miRNA-disease associations were inferred based on the human phenome-miRNAnome network. Jiang's another method presented in [19] can not be compared since its source code and web service are unavailable. The disease description in Jiang's method comes from the Online Mendelian Inheritance in Man (OMIM) database [36]. Due to the slight differences between the disease names of OMIM and those of MeSH, there are no exact correspondences for 6 of the 18 diseases for 5-fold cross validation. Also, there are no corresponding disease names for 2 of the 9 diseases for updated dataset validation. Consequently, we compared the results of HDMP and those of Jiang's method for their common diseases (Table 1 and Table 2). The p-values by paired t-test are listed in Table 3. It is clear that the prediction performance of HDMP is significantly better than that of Jiang's method. The average AUC value for 12 common diseases is increased by 24.9% and that for 7 common diseases is increased by 17.6%. Jiang et al. constructed the miRNAnome network based on the predicted miRNA targets. These targets were obtained by simply merging the results of 2 target prediction programs. Thus, the high false positive in the merged targets has a great effect on the performance of Jiang's method.
Third, RWRMDA was originally constructed by using the 1395 miRNA-disease associations in the earlier version of HMDD (September, 2009). Unfortunately, the source code of RWRMDA provided by its web site is not available currently. To compare with RWRMDA, we implement RWRMDA based on 5-fold cross validation and updated dataset validation, respectively. The restart probability r of RWRMDA is set to 0.9 suggested by the experiments in [22]. The p-values by paired t-test are listed in Table 3. It indicates that HDMP performed significantly better than RWRMDA. The average AUC value for 18 diseases over 5-fold cross validation is increased by 15.3% and that for 9 diseases over updated dataset validation is increased by 9.2%. As mentioned before, RWRMDA predicted the disease miRNAs by random walking on the miRNA similarity network. However, when a walker moving from a miRNA to one of its neighbors, RWRMDA overlooked whether the miRNA is associated with d or not. It is not good for more specifically predicting d-related miRNAs. HDMP considers the k most similar neighbors and the distribution information of the known d-related miRNAs in these neighbors. Furthermore, HDMP incorporates the weight information of miRNA family or cluster. Therefore, HDMP achieved better performance.
The ROC curves for 5-fold cross validation and those for updated dataset validation are demonstrated in Figure 6 and 7 respectively. RWRMDA performed better than FCS method and Jiang's method for most of diseases. HDMP outperformed all the previous methods. It indicates that HDMP can successfully recover the known disease miRNAs. In addition, for all the prediction methods, their overall performance over 5-fold cross validation is better than that over updated dataset validation. The primary reason is that the number of labeled miRNAs in the training dataset for the former is greater than that for the latter.
Case studies: prostatic neoplasms, breast neoplasms, and lung neoplasms To further demonstrate the ability of HDMP to uncover potential disease-related miRNA candidates, we present the case studies of prostatic neoplasms, breast neoplasms, and lung neoplasms. Many researchers have shown that miRNAs play critical role in the three diseases. Due to space limitations, we only provide a comprehensive analysis of the prostatic neoplasmsrelated candidates.
HDMP predict the candidates by using the miRNA-disease associations in the earlier version of HMDD (1 January 2012). The newly reported prostatic neoplasms-related miRNAs after January 1 2012 are used to validate the predicted candidates. Furthermore, the miRNA-disease relevant databases ''miR2Disease'' [37] and ''dbDEMC'' [38] are also used to confirm the candidates.
The top 50 candidates in the ranked list are illustrated in Table 4, and detailed in Table S3. First, during the period from January 2012 to September 2012, HMDD has been updated three times. There are 24 newly reported prostatic neoplasms-related miRNAs. 9 of 50 miRNAs are supported by the newly reported miRNAs. It indicates that HDMP can discover potentially important prostatic neoplasms-related miRNAs.
Second, miR2Disease is a manually curated database which provides a comprehensive resource of miRNA deregulation in various human diseases [37]. The current version of miR2Disease contains 3273 curated associations between 349 human miRNAs and 163 diseases. 17 of 50 miRNAs are included in miR2Disease. It indicates these miRNAs are deregulated in prostatic neoplasms, which confirms that they are really associated with prostatic neoplasms.
Third, several literatures confirm the 6 of 7 miRNAs are significantly upregulated or downregulated in human prostatic neoplasms versus normal prostatic tissue [39][40][41][42][43]. The remaining 1 miRNA is found to be up-regulated or down-regulated in the metastatic prostate cancer xenografts, relative to their nonmetastatic counterparts [44]. HDMP successfully found these miRNAs due to their higher ranks.
Fourth, the database of differentially expressed miRNAs in human cancers, dbDEMC [38], is constructed to provide potential cancer-related miRNAs by in silco computing. The current version of dbDEMC contains 607 miRNAs which potentially have differential expression in 14 types of cancer, including prostatic cancer (malignant prostatic neoplasms). 33 of 50 miRNAs are contained in dbDEMC. These miRNAs are identified to be potentially upregulated or downregulated in prostatic cancer by using the significance analysis of the microarrays. It shows that the 33 miRNAs are more likely to participate in the prostatic cancerrelated biological process.
Last but not least, 7 miRNAs have higher ranks in the ranked list of FCS method, Jiang's method and RWRMDA. Hsa-mir-429 is ranked No. 1 and No. 2 by Jiang's method and RWRMDA Table 3. p-values obtained by paired t-testing the AUCs of HDMP and those of another prediction method. respectively. Hsa-mir-142, hsa-mir-18a, and hsa-mir-20b have greater functional consistency score (FCS) among their target genes and the known target genes associated with prostate neoplasms. Hsa-mir-18a, hsa-mir-18b, hsa-mir-499, and hsa-mir-542 are ranked No. 15, 45, 42, 25 by RWRMDA respectively. It indirectly confirms that the 7 miRNAs are more probably associated with prostatic neoplasms. All above analysis indicates the 50 miRNAs in Table 4 are potential prostatic neoplasmsrelated candidates. In addition, the top 50 breast neoplasms-related candidates are demonstrated in Table S4. 19 of 50 miRNAs are confirmed to be associated with breast neoplasms by the newly reported miRNAs in HMDD. 8 miRNAs are validated by the database miR2disease. 3 miRNAs are supported to have deregulation in breast cancer by literatures [45][46][47]. The dbDEMC identified 39 miRNAs as potential miRNAs upregulated or downregulated in breast cancer (malignant breast neoplasms). The genes-to-systems breast cancer database, G2SBC [48], is usually used for assistant studying the breast cancer. For 2 miRNAs, at least 16 of top 100 their predicted target genes are breast cancer-related genes. It indicates that the 2 miRNAs are more probably associated with breast cancer. In addition, 2 miRNAs have higher ranks in the ranked list of FCS method and that of RWRMDA, which indirectly confirms they are potential breast neoplasms-related candidates.
The top 50 lung neoplasms-related candidates are listed in table S5. 6 of 50 miRNAs are confirmed to be associated with lung neoplasms by HMDD. 12 miRNAs are validated by the database miR2disease. 8 miRNAs are supported to be upregulated or downregulated in lung cancer by literatures [49][50][51][52][53]. The dbDEMC identified 31 miRNAs as potential deregulated miRNAs in lung cancer. 2 miRNAs are ranked higher by FCS method and RWRMDA. We have not found the evidence for only 2 miRNAs to confirm they are potentially associated with lung neoplasms. All above results demonstrate that HDMP is powerful in predicting potential disease-related miRNA candidates.

Conclusions
A new prediction method based on weighted k most similar neighbors, HDMP, was developed for predicting disease miRNAs. We demonstrated the importance of accurately measuring miRNA functional similarity, incorporating weight information based on miRNA family or cluster, and considering the distribution information of a specific disease in achieving effective prediction result. A measurement strategy incorporating the information content of disease terms and phenotype similarity between diseases was proposed to accurately estimate the functional similarity of two miRNAs. The members of miRNA family or cluster are assigned higher weight according to their associations with a group of diseases. The functional similarity information and the distribution information of the disease d-related miRNAs in the k neighbors are incorporated to explore the possibility that a miRNA is associated with d.
HDMP has been compared with the existing prediction methods, including FCS method, Jiang's method, and RWRMDA. Both the results of 5-fold cross validation and those of the updated dataset validation demonstrated that HDMP has significantly higher accuracy in recovering the known disease miRNAs. The case studies of prostatic neoplasms, breast neoplasms, and lung neoplasms, further proved the ability of HDMP to uncover potential disease-related candidates. HDMP can provide reliable disease-related miRNA candidates for experimental research, which facilitates future studies of miRNA involvement in the pathogenesis of diseases. Figure S1 Prediction performance affected by a value, b value, and k value.  Table S3 The top 50 prostatic neoplasms-related miRNA candidates in the ranked list. (1) 'literature' means that there is a literature to support that the miRNA is upregulated or downregulated in human prostatic neoplasm, as compared with normal prostatic tissue. (2) With analysis of the microarray data sets, a miRNA is considered to potentially have different express levels in prostatic cancer when compared to normal tissues. This kind of miRNAs is labeled by 'dbDEMC'. (3) 'HMDD' means that a miRNA is a newly reported prostatic neoplasms-related miRNA which is collected by the latest version of human miRNA-disease database HMDD. (4) 'miR2Disease' means that a miRNA is included in the manually curated miRNA-disease association database, miR2Disease. (5) 'higher RWRMDA' means a miRNA has higher rank in the ranked list of RWRMDA. (6) 'higher FCS' means a miRNA has greater functional consistency score (FCS) among their target genes and the known target genes associated with prostatic neoplasms. (7) 'higher Jiang' means a miRNA has higher rank in the ranked list of Jiang's method.

Supporting Information
(DOC) Table S4 The top 50 breast neoplasms-related miRNA candidates in the ranked list. (1) 'literature' means that there is a literature to support that the miRNA is upregulated or downregulated in human breast neoplasm, as compared with normal breast tissue.
(2) With analysis of the microarray data sets, a miRNA is considered to potentially have different express levels in breast cancer when compared to normal tissues. This kind of miRNAs is labeled by 'dbDEMC'. (3) 'HMDD' means that a miRNA is a newly reported breast neoplasms-related miRNA which is collected by the latest version of human miRNA-disease database HMDD. (4) 'miR2Disease' means that a miRNA is included in the manually curated miRNA-disease association database, miR2Disease. (5) G2SBC is a genes-to-systems breast cancer database, which is usually used for assistant studying the breast cancer. 'G2SBC' means some of the top predicted target mRNAs of a miRNA are breast cancer-related genes. (6) 'higher RWRMDA' means a miRNA has higher rank in the ranked list of RWRMDA. (7) 'higher FCS' means a miRNA has greater functional consistency score (FCS) among their target genes and the known target genes associated with breast neoplasms.
(DOC) (2) With analysis of the microarray data sets, a miRNA is considered to potentially have different express levels in lung cancer when compared to normal tissues. This kind of miRNAs is labeled by 'dbDEMC'. (3) 'HMDD' means that a miRNA is a newly reported lung neoplasms-related miRNA which is collected by the latest version of human miRNA-disease database HMDD. (4) 'miR2Disease' means that a miRNA is included in the manually curated miRNA-disease association database, miR2Disease. (5) 'higher RWRMDA' means a miRNA has higher rank in the ranked list of RWRMDA. (6) 'higher FCS' means a miRNA has greater functional consistency score (FCS) among their target genes and the known target genes associated with lung neoplasms. (7) 'higher Jiang' means a miRNA has higher rank in the ranked list of Jiang's method. (8) 'unconfirmed' means there is no evidence to confirm that a miRNA is potentially associated with lung neoplasms. (DOC)