Identifying miRNA sponge modules using biclustering and regulatory scores

Background MicroRNA (miRNA) sponges with multiple tandem miRNA binding sequences can sequester miRNAs from their endogenous target mRNAs. Therefore, miRNA sponge acting as a decoy is extremely important for long-term loss-of-function studies both in vivo and in silico. Recently, a growing number of in silico methods have been used as an effective technique to generate hypotheses for in vivo methods for studying the biological functions and regulatory mechanisms of miRNA sponges. However, most existing in silico methods only focus on studying miRNA sponge interactions or networks in cancer, the module-level properties of miRNA sponges in cancer is still largely unknown. Results We propose a novel in silico method, called miRSM (miRNA Sponge Module) to infer miRNA sponge modules in breast cancer. We apply miRSM to the breast invasive carcinoma (BRCA) dataset provided by The Cancer Genome Altas (TCGA), and make functional validation of the computational results. We discover that most miRNA sponge interactions are module-conserved across two modules, and a minority of miRNA sponge interactions are module-specific, existing only in a single module. Through functional annotation and differential expression analysis, we also find that the modules discovered using miRSM are functional miRNA sponge modules associated with BRCA. Moreover, the module-specific miRNA sponge interactions among miRNA sponge modules may be involved in the progression and development of BRCA. Our experimental results show that miRSM is comparable to the benchmark methods in recovering experimentally confirmed miRNA sponge interactions, and miRSM outperforms the benchmark methods in identifying interactions that are related to breast cancer. Conclusions Altogether, the functional validation results demonstrate that miRSM is a promising method to identify miRNA sponge modules and interactions, and may provide new insights for understanding the roles of miRNA sponges in cancer progression and development. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1467-5) contains supplementary material, which is available to authorized users.


Background
MicroRNAs (miRNAs) are small (~22 nt), singlestranded, non-coding RNA molecules which are involved in the post-transcriptional regulation of gene expression. By binding to target mRNAs, miRNAs typically cause degradation and translation repression of mRNAs [1]. The fine-tuning of gene regulation by miRNAs in a wide range of biological processes and tumor progressions has attracted significant attentions to understand the biological functions and regulatory mechanisms of miRNAs.
Recently, the competing endogenous effect at the posttranscriptional level has shifted our understanding of miRNA regulatory mechanism. There are several types of RNAs acting as competing endogenous RNAs (ceRNAs) (also called miRNA sponges or miRNA decoys) to prevent miRNAs from binding their authentic targets. These miRNA sponges include protein-coding RNAs, long noncoding RNAs (lncRNAs), pseudogenes and circular RNAs (circRNAs) [2][3][4][5]. More and more miRNA sponges in different biological conditions have been identified by biological experiments [6]. These miRNA sponges interact with each other via shared miRNAs, and the crosstalks between them are formed to develop miRNA-mediated interaction or miRNA sponge interaction network. However, similar to miRNA target prediction, the identification of miRNA sponge interaction networks by using biological experiments is limited by their low efficiency, time consumption and high cost. Thus, a growing number of computational methods have been proposed to identify miRNA sponge interaction networks.
Existing computational methods for identifying miRNA sponge interaction networks can be divided into three categories [7]: (1) pair-wise correlation approach, (2) partial association approach, and (3) mathematical modelling approach. In the first category [8][9][10][11], each pair of interacting miRNA sponges in a network have a significant positive correlation or there is a significant difference in their correlation between two different conditions. The main limitation of these methods is that they don't consider the expression levels of the miRNAs shared by the two miRNA sponges when computing the correlation between the miRNA sponge pair. To address this limitation, methods of the second category [12][13][14] integrate the expression levels of the shared miRNAs of two miRNA sponges and calculate the partial association between them. These methods only use unweighted bipartite network consisting of putative miRNA-target interactions, but ignore the binding strengths between miRNAs and their targets. Moreover, some identified miRNA sponge interactions in the network are actually TF-target interactions or protein-protein interactions (PPIs), and should be removed. The third category [15][16][17][18] focuses on decribing a minimum or small miRNA sponge interaction network using different mathematical models. For each candidate miRNA sponge interaction, they would design a synthetic gene circuit to analyze the quantitative behavior of the miRNA sponge effect. The number of candidate miRNA sponge interactions is usually large. Since it is very time-consuming of designing many synthetic gene circuits for a large number of miRNA sponge interactions, these methods cannot be easily applied to study a larger miRNA sponge interaction network.
The identification of miRNA sponge interaction networks could provide a global view of studying the properties of miRNA sponges in cancer progression and development. Due to the modularity of cancer progression and development, it is also important to identify functional modules that involve miRNAs and miRNA sponges. Therefore, in this study, we present a novel computational method based on biclustering and regulatory scores to identify miRNA Sponge Modules (thus the proposed method is called miRSM).
We explore mRNA-related miRNA sponge modules by combining matched miRNA and mRNA expression data, and putative miRNA-target interactions. Instead of completely relying on putative miRNA-target interactions, we reconstruct miRNA-target interactions by considering both expression data and miRNA-target binding information. We use regulatory scores to infer miRNA-mRNA biclusters where a subset of mRNAs compete with each other to attract binding with a subset of miRNAs. We further identify miRNA sponge interactions in each miRNA-mRNA bicluster, and remove the candidate miRNA sponges which are not involved in any miRNA sponge interactions. The remaining candidate miRNA sponges and miRNAs in each bicluster are regarded as a miRNA sponge module.
The method is applied to the breast invasive carcinoma (BRCA) dataset provided by The Cancer Genome Altas (TCGA) to build miRNA sponge modules in BRCA. We discover that a few number of miRNA sponge interactions only exist in single module, and most miRNA sponge interactions are common across two modules. This result shows the module-conserved characteristic of miRNA sponge interactions across two different modules. Moreover, miRNA sponges of the modules are found to be biologically meaningful based on functional annotation and differential expression analysis. Through experimental validation using the thrid-party databases, some miRNA sponge interactions and miRNA-target interactions are experimentally validated. Finally, the comparison results show that miRSM performs better than or comparable to the other three existing methods (PC [8,9], SPPC [12], and Hermes [13]) in identifying miRNA sponge interactions.

Data sources
The matched miRNA and mRNA expression data of human BRCA are obtained from Paci et al. [12]. The dataset is generated with the platform of TCGA level 3 IlluminaHiSeq in 72 matched tumor and normal tissues. miRNAs and mRNAs with missing values in >50% samples are removed from the dataset. The remaining missing values are imputed using the k-nearest neighbours (KNN) algorithm of the R package impute. Furthermore, we remove the mRNAs without gene symbols and take the average expression values of replicate miRNAs and mRNAs. Therefore, we obtain 453 miRNAs and 11,157 mRNAs in the 72 matched samples.
The putative miRNA-target interactions are from TargetScan v7.1 [19]. We retain those miRNA-target interactions with context++ scores less than 0 in TargetScan. The context++ score for each miRNA-target interaction is the sum of the contributions of 14 robustly selected features [19]. As a result, we obtain 228,423 interactions (with negative context++ scores) between 402 mature miRNAs and 12,441 mRNAs. In this study, we choose two representative databases of experimentally validated human TF-target interactions and PPIs to illustrate the method. The first database HTRIdb [20] is a popular repository of experimentally verified human transcriptional regulation interactions, and we collect 51,871 TF-target interactions from it. The second database HPRD v9 (Human Protein Reference Database) [21] is a well-cited human protein reference database with high-quality PPIs, and we obtain 36,852 protein-protein interactions (PPIs) from it.
The experimentally validated miRNA-target interactions with strong evidence are from miRTarbase v6.1 [34]. The experimentally validated miRNA sponge interactions are retrived from [6,7], and miRSponge [35], the first manually curated miRNA sponge interactions database. We only extract experimentally validated mRNA-related miRNA sponge interactions for validations. (1) Data preparation. We firstly collect expression profiles and miRNA-target binding information. The expression profiles of miRNAs and mRNAs, and the context++ scores of miRNA-target interactions are regarded as input dataset of the next step. (2) Create miRNA-mRNA regulatory score matrix. We firstly use Pearson correlation method to calculate miRNA-mRNA correlation matrix, W, of the matched miRNA and mRNA expression data. Based on the putative miRNA-target binding information retrieved from TargetScan, we generate the miRNA-mRNA context++ score matrix, T. By combining the correlation matrix and the context++ score matrix, we create miRNA-mRNA regulatory score matrix, S. (3) Infer miRNA-mRNA biclusters. The miRNA-mRNA regulatory score matrix is regarded as the input matrix of the biclustering method. A subset of mRNAs exhibit similar behavior across a subset of miRNAs in each bicluster. (4) Identify miRNA sponge modules. Pearson correlation method is used to compute the correlations of all possible mRNA-mRNA pairs of each bicluster. We use the regulatory scores between miRNAs and mRNAs to reconstruct miRNA-target interactions, and a hypergeometric test is utilized to evaluate the significance of the sharing of miRNAs by each mRNA-mRNA pair. The mRNA-mRNA pairs with significant sharing of miRNAs (p-value <0.01) and significant positive correlation (p-value <0.01) are regarded as candidate miRNA sponges and their interactions as candidate miRNA sponge interactions. Moreover, we remove the candidate miRNA sponge interactions that are actually TF-target interactions or PPIs and the candidate miRNA sponges which are not involved in any miRNA sponge interactions are removed too. Finally, the remaining candidate miRNA sponges and miRNAs in each bicluster are considered as a miRNA sponge module.

Overview of miRSM
In the following, we will present the key steps in detail.

Calculating miRNA-mRNA regulatory scores
The regulatory scores denote the degree of regulation between miRNAs and mRNAs considering both their correlations and context++ scores. The correlations between miRNAs and mRNAs are based on the matched miRNA and mRNA expression data, and the context++ scores of miRNA-mRNA interactions are extracted from TargetScan v7.1 [19]. Let W be miRNA-mRNA correlation matrix, and T be miRNA-mRNA context++ score matrix. The miRNA-mRNA regulatory score matrix S is calculated as follows: In this study, we use the regulatory scores of miRNA-mRNA interactions to reconstruct putative miRNA-target interactions. We only consider negative values of the regulatory scores in S due to negative regulation of miRNAs. According to the empirical experiments, the negative correlation of two variables is around −0.3 under significant level of p-value <0.05. Thus, the default threshold s of regulatory scores is set to −0.3. That is to say, the miRNA-target interactions with regulatory scores equal to or less than s are regarded as reconstructed miRNA-target interactions. Identifying miRNA sponge modules Given the miRNA-mRNA regulatory score matrix with m rows in n columns, the biclustering method allows simultaneous clustering the rows (mRNAs) and columns (miRNAs) of the matrix. Here, a bicluster corresponds a module. For each bicluster, a subset of mRNAs exhibit similar behavior across a subset of miRNAs. To identify miRNA-mRNA biclusters, a biclustering method called BCPlaid [36] is used. The BCPlaid is an improved version of Plaid model [37]. The Plaid model estimates the normal expression level of each gene, then infers biclusters of genes that have similarly unusual expression levels across the biclustered samples. This feature makes it an attractive method for clustering expression data. To improve the computationally efficient for fitting the Plaid model, the BCPlaid is presented based on speedy individual differences clustering and uses binary least squares to update the cluster membership parameters.
After obtaining the biclusters, we calculate correlations of all mRNA-mRNA pairs of each bicluster. For a given mRNA-mRNA pair (mR 1 and mR 2 ), the significance p-value of the shared miRNAs by these two mRNAs is calculated in the following.
In the formula, N is the number of all miRNAs in the dataset, M and K represent the total numbers of miRNAs regulating mR 1 and mR 2 respectively, and x is the number of common miRNAs shared by mR 1 and mR 2 . The mRNA-mRNA pairs with significant sharing of miRNAs (p-value <0.01) and significant positive correlations (p-value <0.01) are regarded as candidate miRNA sponge interactions. We further remove the candidate miRNA sponge interactions that are actually TF-target interactions or PPIs and the candidate miRNA sponges which are not involved in any miRNA sponge interactions are removed too. Finally, all the reserved miRNA sponges and miRNAs in a bicluster are regarded as a miRNA sponge module.

miRNA sponge modules for BRCA
The default values of the tuning parameters a and b are set to 0.5, and the threshold s of regulatory scores is set to −0.3 for the reconstruction of miRNA-target interactions. As shown in Table 1, we identify four miRNA sponge modules. As illustrated in Fig. 2, there are many common miRNA sponge interactions (385,172) between Module 2 and Module 3, and almost all miRNA sponge interactions (37,948) in Module 4 exist in Module 1. However, there is no overlap of miRNA sponge interactions among the four modules. This result implies that most miRNA sponge interactions tend to be moduleconserved across two modules, and a small portion of miRNA sponge interactions are module-specific (i.e. only exist in a single module). The detail information of the four modules and module-specific miRNA sponge interactions can be seen in Additional file 1.
miRNA sponge modules are biologically meaningful As described previously (see the Data sources section), we collect a list of 428 BRCA miRNAs and 2949 BRCA genes. We also collect a list of 40 unique GO terms associated with 10 cancer hallmarks (Self Sufficiency in Growth Signals, Insensitivity to Antigrowth Signals, Evading Apoptosis, Limitless Replicative Potential, Sustained Angiogenesis, Tissue Invasion and Metastasis, Genome Instability and Mutation, Tumor Promoting Inflammation, Reprogramming Energy Metabolism, and Evading Immune Detection). Only five cancer hallmarks (Self Sufficiency in Growth Signals, Insensitivity to Antigrowth Signals, Evading Apoptosis, Tissue Invasion and Metastasis, and Genome Instability and Mutation) have related gene sets in more than half of the associated GO terms (details in Additional file 2). As a result, we have a list of 2224 unique genes associated with the five representative cancer hallmarks. The list of BRCA miRNAs, BRCA genes, and cancer hallmark genes can be seen in Additional file 3.
As shown in Fig. 3, the percentages of BRCA miRNAs, BRCA genes, and cancer hallmark genes are different due to the different components of each miRNA sponge module. Overall, 10.81% of miRNAs are BRCA miRNAs, (See figure on previous page.) Fig. 1 The pipeline of miRSM. We construct miRNA-mRNA correlation matrix using Pearson method, and miRNA-mRNA context++ score matrix using putative miRNA-target binding information. Next, miRNA-mRNA regulatory score matrix is inferred by combining miRNA-mRNA correlation and context++ score matrix. A biclustering method is then used to generate miRNA-mRNA biclusters. We identify miRNA sponge interactions in each miRNA-mRNA bicluster, and remove the candidate miRNA sponges which are not involved in any miRNA sponge interactions. The remaining candidate miRNA sponges and miRNAs in each bicluster are considered as a miRNA sponge module 21.81% of miRNA sponges are BRCA genes, and 13.40% of miRNA sponges are cancer hallmark genes in the identified miRNA sponge modules.
Since differentially expressed genes with abnormal expression are closely associated with the occurrence and development of cancer, we also perform differential expression analysis on the BRCA expression profiles using limma package [38] of Bioconductor. As a result, 278 miRNAs (adjusted p-value <0.01, adjusted by Benjamini & Hochberg method), and 5602 mRNAs (adjusted p-value <1E-04) are identified to be differentially expressed at significant level (details in Additional file 4). We find that the miRNA sponges in the four miRNA sponge modules are all differentially expressed mRNAs, and the percentages of  To uncover the biological machanism in BRCA, we further conduct functional annotation analysis of the miRNA sponges using GeneCodis [39] (the online tool at http:// genecodis.cnb.csic.es/). The top 5 enriched GO (Gene Ontology) [40] terms and KEGG (Kyoto Encyclopedia of Genes and Genomes) [41] pathways are listed in Table 2.
As shown in Table 2, most enriched GO biological processes and KEGG pathways are shared by Module 2 and Module 3, and there also exist common pathways between Module 1 and Module 4. This suggests that similar modules (Module 2 and Module 3, Module 1 and Module 4) with many overlaps of miRNA sponge interactions tend to have similar biological functions, and vice versa. Moreover, all modules have many enriched GO biological processes and KEGG pathways related to BRCA, such as Signal transduction (GO:0007165) [42], Cell cycle (GO:0007049, KEGG:04110) [43], and Pathways in cancer (KEGG:05200). Since the BRCA dataset is a cancer dataset, the result demonstrates that the discovered miRNA sponge modules are closely associated with the biological condition of the dataset. The module-specific miRNA sponge interactions among four modules are also significantly enriched in Signal transduction (GO:0007165) and Pathways in cancer (KEGG:05200). The result indicates that these module-specific miRNA sponge interactions may be involved in the progression and development of BRCA.
In summary, miRNA sponge modules are biologically significant, which may imply that the miRNA sponge modules discovered based on the BRCA dataset can indeed reveal the biological mechanism in BRCA. The detailed information of significant GO terms and KEGG pathways for the miRNA sponges in each module and module-specific interactions can be found in Additional file 5. Comparison with other existing methods in identifying miRNA sponge interactions

Validation of the interactions in the miRNA sponge modules
In this section, we compare the performance of miRSM with other existing methods in terms of the numbers of breast cancer-related miRNA sponge interactions and experimentally validated miRNA sponge interactions in the findings by the methods. We define that breast cancer-related miRNA sponge interactions are those in which the two interactive parties exist in the list of 2949 breast cancer genes (i.e. the breast cancer genes collected in the Data sources section). Since the mathematical modelling approaches in the third category are only applied to study a small number of miRNA sponge interactions, we don't compare miRSM with them in this study. Therefore, in this study, we select three typical methods from the first two categories (pair-wise correlation approach and partial association approach) for the comparison study. The first method is the Positive Correlation (PC) method [8,9], which is based on the positive correlation between each pair of interacting miRNA sponges. The second method is the Sensitivity Partial Pearson Correlation (SPPC) method [12], which uses partial correlations to estimate the contributed effect of common miRNAs on miRNA sponge interacting pairs. The third method is Hermes [13], which uses conditional mutual information to estimate partial associations between miRNA sponges.
To make a fair comparison, we use the same p-value cutoff (0.01) to calculate significance of the findings of shared miRNAs and the positive correlations of possible miRNA sponge interaction pairs. For the SPPC method, the cutoff of sensitivity correlation (the difference between Pearson Correlation and Partial Pearson Correlation) is set to 0.3, which is the value used in [12].
We compare the results of miRSM with three different parameter settings with those of the other 3 methods. As shown in Table 3, the numbers of validated miRNA sponge interactions for the three different parameter settings of miRSM are all 5, indicating a stable validation results of our method. In the case of the number of validated miRNA sponge interactions, our method performs better than SPPC and Hermes, but slightly worse than PC. However, our method generally performs better than the other three methods in the percentage of breast cancer-related miRNA sponge interactions.
Since PTEN-related miRNA sponge interactions are widely studied, we further focus on studying the overlap and differences between PTEN-related miRNA sponge interactions identified by miRSM_default, PC, SPPC, and Hermes. Figure 4 illustrates that different computational methods identify different sets of PTENrelated miRNA sponge interactions. Specifically, many PTEN-related miRNA sponge interactions are only inferred by miRSM.
Conclusions miRNA sponge effect is a novel type of gene regulation at the post-transcriptional level. The crosstalks between miRNA sponges involve many classes of RNAs, mainly including protein-coding and non-coding RNAs. Among different types of miRNA sponges (protein-coding RNAs, lncRNAs, pseudogenes, circRNAs, etc), the vast majority of them are protein-coding RNAs. Thus, we focus on mRNA-related miRNA sponge modules in this study. Identifying miRNA sponge interaction network using in silico methods is an emerging research field. The fundamental principle of the identification of miRNA sponge interactions using in silico methods are based on experimental evidence for miRNA sponges. The basic experimental evidence for miRNA sponges is that the overexpression of the putative miRNA sponges leads to increased expression of the competing RNAs, and vice versa. That is to say, miRNA sponge interaction pairs are positively correlated at expression level. Until now, an ubiquitous limitation of in silico methods assessing miRNA sponge interactions is that they are wholly dependent upon unweighted miRNA-target interactions at sequence level, and rarely take expression level into account. In fact, integrating both sequence level and expression level information lead to the discovery of more candidate miRNA sponge interaction pairs when exploring miRNA sponge interaction networks. In addition, an underlying problem of existing in silico methods is that they also regard other known gene regulatory interactions or molecular interactions (e.g. TF-target interactions and PPIs) as miRNA sponge interactions. Actually, these interactions are direct interactions rather than crosstalks between miRNA sponges. miRNA sponge interaction networks provide a global way to study the biological functions of miRNA sponges in cancer. Since modularity is an important feature of cancer progression and development, it is extremly necessary to investigate functional miRNA sponge modules associated with cancer from a local point of view. Therefore, in this paper, we propose miRSM to identify miRNA sponge modules. The method integrates data source from both sequence level and expression level, and uses regulatory scores to reconstruct miRNA-target interactions and infer miRNA-mRNA biclusters in which a subset of mRNAs compete with each other to bind with a subset of miRNAs. Moreover, we remove miRNA sponge interactions that are experimentally validated TF-target interactions or PPIs to improve the prediction of miRNA sponge modules. miRSM is a parametric method, i.e. the identified miRSM modules and the validation results are closely    Table 4, the threshold s is a negative value, and is associated with the number of candidate miRNA sponges. The smaller the value of s is, the less the number of miRNA sponge interactions is. The parameters a and b denote the contributions of expression data and sequence data to the identification of miRNA-target interactions, and the default values of a and b are the same. If a > b, the number of miRNA sponge interactions will increase, and vice versa. The comparison results show that miRSM performs better than or comparable to the other three existing methods (PC, SPPC, Hermes). Different methods have their own merits, leading to different sets of miRNA sponge interactions. The results focusing on PTEN-related miRNA sponge interactions show that miRSM can identify many different miRNA sponge interactions from the other three methods.
In summary, miRSM can be a promising method for identifying miRNA sponge modules, and hence provides new insights into the regulatory mechanisms and functions of miRNA sponges in different biological processes, including pathogenesis of cancers.