A Filtering Method for Identification of Significant Target mRNAs of Coexpressed and Differentially Expressed MicroRNA Clusters

MicroRNA (miRNA) binding is primarily based on sequence, but structure-specific binding is also possible. Various prediction algorithms have been developed for predicting miRNA target genes; the results, however, have relatively high levels of false positives, and the degree of overlap between predicted targets from different methods is poor or null. We devised a new method for identifying significant miRNA target genes from an extensive list of predicted miRNA target gene relationships using hypergeometric distributions. We evaluated our method in statistical and semantic aspects using a common miRNA cluster from six solid tumors. Our method provides statistically and semantically significant miRNA target genes. Complementing target prediction algorithms with our proposed method may have a significant synergistic effect in finding and evaluating functional annotation and enrichment analysis for miRNA.


Introduction
MicroRNAs (miRNAs) are a class of small RNAs that regulate gene expression at the transcript level, protein level, or both [1][2][3][4]. miRNAs modulate gene activity and are aberrantly expressed in most types of cancers [5]. Due to their small size and stability, miRNAs can also be measured in biologic fluids such as plasma and serum and can serve as circulating biomarkers [6][7][8][9]. In spite of the continuous attempts to identify miRNAs and to elucidate their basic mechanisms of action, little is understood about their biological functions.
Because of the regulatory role of miRNAs and lack of direct functional annotation to miRNAs, functional enrichment methods for miRNAs rely on their target gene's functional annotations [10][11][12]. For instance, if the target genes of a specific miRNA are significantly enriched with a set of Gene Ontology (GO) terms, it is reasonable to infer that the miRNA is also involved in the same GO annotations. Several studies on miRNAs have used "predicted targetgenes' functional annotation-based" miRNA function prediction strategy [13,14]; these methods, however, are limited in that they do not consider the many-to-many-to-many tripartite network topology among miRNAs, target genes, and GO annotation [15][16][17]. In our previous work, we proposed three types of measures (miRNA-centric, target gene-centric, and target link-centric) and a novel index for calculating the functional enrichment of miRNA. Among the three measures, the miRNA-centric measurements showed the best performance [18]. We also found that the miRNA's intrinsic properties of multiplicity and cooperability may be correctly modeled by combined hypergeometric distributions.
Most of the miRNA-to-mRNA target links are estimated by prediction algorithms. However, these algorithms generate a relatively high level of false positives [19], and the degree of overlap between predicted targets from different methods is often poor or null [13,20]. Studies in this field have developed multiple databases with enormous amount of miRNA-to-target mRNA relationships computed using diverse algorithms [21], whereas only a few experimentally validated targets are available [22,23]. In light of this circumstance, there is an unmet need for a method for identifying a significant miRNA target from a copious amount of predicted miRNA-target mRNA pairs.
According to miRNA characteristics (multiplicity and cooperatively activities), we employed the hypergeometric distribution to identify significant miRNA target genes from the extensive list of miRNA target genes. We also evaluated the performance of our method in two aspects: statistical significance and functional enrichment.

Computational Methods.
To find significant target mRNAs from the input miRNA cluster, we first searched for target mRNAs from all miRNA members within the input cluster from the miRNA target database. For each targeted mRNAs, we then calculated the numbers of miRNAs that have target relationships (p i ) with the mRNA and those that do not (p j ) using the two-by-two contingency table. We also calculated the numbers of miRNAs not in the input miRNA cluster by dividing those that have a target relationship (p k ) with the mRNA and those that do not (p l ), as shown in Table 1. Functional enrichment was tested from this contingency table using a hypergeometric distribution. e hypergeometric distribution applies to sampling without replacement from a finite population whose elements can be classified into two mutually exclusive categories: has/does not have a target relationship.
We then calculated the adjusted p values using the Bonferroni correction. Finally, for evaluating our methods, 10,000 simulated mRNA sets of the same size were also randomly sampled from the target mRNAs of the input miRNA cluster.
Using hypergeometric distribution, we assumed that the coordinated function among miRNAs within a cluster is valid when these miRNAs are regulated or annotated by common factors such as same target mRNA, Gene Ontology, or pathway.

Data Set: miRNA Clusters.
We obtained an miRNA set created by Volinia et al. [24] that has differentially expressed sets of up-or downregulated miRNAs in six solid tumor samples. Among the miRNA clusters, we selected an miRNA cluster composed of 57 miRNAs by prediction analysis of microarray (PAM) in six solid tumor samples versus normal tissues. e complete list of 57 miRNAs is in Additional File 1.
e TarBase and MirTarBase databases provide experimentally validated miRNA-target interaction data and evidence level (strong and less strong) of each interaction. e mirDIP database provides in silicopredicted miRNA-target interaction data from six established target prediction algorithms and 12 miRNA prediction databases. GO annotation of miRNA-target-mRNA was obtained from the Entrez Gene database. We excluded GO associations with ND (no biological data) and NR (not recorded) evidence code. Detailed processes are provided in Figure 1.

Statistical and Semantic Evaluation Measurement.
To evaluate our methods, we compared the performance in terms of statistical significances between a significant 317 mRNA cluster and randomly simulated 10,000 clusters. Each randomly generated cluster had the same size as the significant mRNA set. GO functional enrichment analysis was performed for all mRNA sets using GO annotations retrieved from the NCBI Entrez Gene database. We filtered out 339 GO terms that were greater than 0.05. e resulting lists of 377 GO terms are shown in Additional File 2. To reduce the number of GO terms, enriched GO terms and p values were submitted to REduce and Visualize GO (REViGO).
We computed the average log p values of ranked GO term sets from functional enrichment analysis of the significant mRNA set and the randomly simulated 10,000 mRNA sets. ese average log p values were then used for comparing the performance. Functional enrichment was performed using GO annotations of mRNA from NCBI Entrez gene [25]. Average log p values of ranked GO terms were based on the general assumption that highly significant GO terms are more desirable because it means the members of the cluster are highly correlated to each other. For semantic evaluation of the significant mRNA set, we used REViGO, which is a web-based system that summarizes a list of GO terms by finding a representative subset of the terms using the semantic similarity-based clustering algorithm [26].

Statistical Significances.
For evaluating our methods, we compare performance in terms of statistical significances between a significant 317 mRNA cluster and randomly Journal of Healthcare Engineering simulated 10,000 clusters. Each randomly generated cluster has the same size with the significant mRNA cluster. All mRNA cluster performed GO functional enrichment analysis using GO annotation from the NCBI Entrez gene. Figure 2 shows the distributions of average log p values for the rank of GO terms which belong to the biological process category. e significant mRNA set is shown as the red dotted graph. Randomly simulated 10,000 clusters are shown as box plots. e significant mRNA set showed a higher average log p value than random clusters did, which indicated that the members of the cluster highly correlated and meaningfully composed.

Gene Ontology Analysis of Significant miRNA Target
Genes. Using the UniProt database as background and the default semantic measure (SimRel), our analysis clearly showed that biological processes associated with cancer metabolism, regulation of cell death and apoptotic process, and negative regulation of autophagy were significantly overrepresented. Figure 3 shows the REViGO scatter plot represented in a two-dimensional space derived by applying multidimensional scaling to a matrix of GO terms semantic similarities. e resulting lists of 339 GO terms along with their p values were further summarized by the REViGO reduction analysis tool that condenses the GO description by removing redundant terms. e remaining terms after the redundancy reduction were plotted in a two-dimensional space. Bubble color indicates the p value (legend in the upper right-hand corner): the two ends of the colors are red and blue, which represent lower and higher p values, respectively. Size indicates the

Discussion
Functional enrichment studies for miRNA expression are performed in three steps: (1) selecting differentially expressed miRNAs, (2) finding their target mRNA, and (3) carrying out analysis of mRNA set overrepresentation [27]. Functional enrichment studies for miRNAs are mostly based on the annotation of target mRNA; however, due to improvements in the miRNA target prediction algorithms, a large number of target mRNAs are predicted. Considering this, filtering out significant mRNAs using a stable statistical method is of great importance. In this study, we proposed a method for identifying the significant miRNA target mRNA from the miRNA cluster. e proposed method was verified by functional enrichment analysis of differentially expressed or coexpressed miRNA clusters.
Inaccurate functional enrichment methods are a hindrance in increasing clinical utility for miRNAs, such as miRNA-based biomarkers or predictors [28,29]. Several tools have been recently established for direct prediction of miRNA functions [10,30]; however, these methods do not consider the regulatory or indirect functions of miRNAs, such as regulation or inhibition of target genes [31]. e intrinsic properties of multiplicity and cooperative activities of miRNAs should be considered while annotating the miRNA function. miRGator v3.0 is a tool created considering these characteristics and allows the user to manually select miRNAs and target mRNAs [32]. However, such tools are only useful when the number of miRNA and mRNA pairs is small. e limitation of the proposed method is that the hypergeometric distribution has a significant effect when members belonging to an miRNA cluster are regulated by common factors such as the target mRNA, GO, and pathway. e proposed method constructs a target mRNA set with statistical significance by receiving miRNA clusters with similar expression characteristics. e assumption of hypergeometric is well suited to this problem because the cluster-received input already has similar characteristics. e miRNA target prediction algorithms were modified to generate more accurate results based on the expanding understanding of the molecular mechanism of miRNA regulation. Nevertheless, identifying significant target mRNAs from the numerous, uncurated miRNA target links remain as a problem. Our method is based on computationally identifying statistically significant mRNAs using predicted or experimentally validated target relationships. Complementing target prediction algorithms with our proposed method may have significant synergistic effects in finding and evaluating functional annotation and enrichment analysis for miRNA.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Su Yeon Lee and Soo-Yong Shin have contributed equally to this work.