miRCancerdb: a database for correlation analysis between microRNA and gene expression in cancer

Objectives microRNAs regulate expression of target genes by specifically binding to their transcripts, subsequently leading to translational inhibition or mRNA degradation. Gene regulation by microRNAs has been implicated in a wide range of physiological and pathological conditions. Here, we leverage the use of public-access data and the available genomic annotations to pre-calculate the correlation of the expression of a large number of microRNAs with gene at the mRNA and protein level in the context of cancers. Results Expression data of miRNAs, mRNAs and proteins in cancer patients from The Cancer Genome Atlas along with TargetScan miRNAs-target annotations were used to calculate the expression correlations between miRNAs and features (mRNAs/proteins) in a number of cancer studies. We then packed the output of this analysis into a database and made it available through an interactive web application. The miRCancerdb is an easy-to-use database to investigate the microRNAs-dependent regulation of target genes involved in development of cancer.

Introduction miRNA (microRNA) are involved, among other functions, in the regulation of gene expression in a wide range of physiological and pathological conditions [1,2]. Identification of miRNAs and their gene targets is challenging in both bioinformatics and experimental biology. The sequence-based search of gene targets for miRNAs usually yields a high number of false positives [3]. To experimentally confirm these hits, biologist have to go through long lists of potential targets and use laborious laboratory techniques to isolate the meaningful functional associations among them and the miRNAs. The problem is only exacerbated by the diverse expression profiles and functions of miRNAs across tissues and conditions [4].
An attempt to tackle the problem of miRNA target identification while maintaining a reasonable highthroughput is to use a combination of sequence-based search methods and high-throughput expression profiling such as microarrays and RNA sequencing [5,6]. The process starts by manipulating the expression of a particular miRNA in a certain condition followed by surveying the gene expression by arrays or sequencing. The correlation between the expression of the miRNA and its potential target genes are then calculated [7]. Genes with sequences that match the miRNA seed and correlate strongly in their expression with the miRNA constitute higher chances to be true positives.
Here, we leverage the use of public-access data and the available annotations to pre-calculate the correlation values between the expression of a large number of miR-NAs and genes in the context of cancer. We used TCGA (The Cancer Genome Atlas) [8] to obtain miRNA, gene and protein expression profiles in 34 type of cancer and the TargetScan annotation database [9] to calculate the expression correlation for miRNAs and their targets at the mRNA and protein level. We then packed the output of this analysis into a database and made it available in an interactive web application.

Data sources
Expression data of miRNAs, genes and proteins were obtained from TCGA using the R package RTCGA [10]. RTCGA provides the processed counts of these three genetic elements from patient samples using miRNA sequencing, RNA sequencing and Reverse Phase Protein Arrays (RPPA), respectively. Human miRNA target annotations were obtained using the R package targetscan.Hs.eg.db [11]. The miRNA miRBase, gene ENTREZ ID, official gene symbols and manufacturer antibody IDs were used appropriately to connect the data from different assays and sources.

Analysis pipeline
Expression data from TCGA and target annotations were first processed to a tidy format and different identifiers were translated to common ones. miRNAs and features (mRNA/proteins) were excluded when their expression values were less than 1. Pearson's correlations of miRNA expression matrix and gene expression or protein expression matrices were calculated in each of the TCGA studies (n = 34). The following equation was used as implemented in the R base cor function to calculate the Pearson's correlation ( ρ ) [12].
where, E is the expectation, µ is the mean and σ is the standard deviation.
Missing values (NA) and correlation values less than 0.1 were omitted to bring the data to a manageable size. Correlation data for each feature in each study were then merged in a single table to build a database file. In addition, a target table of each miRNA and its corresponding targets was added to the database file. Finally, a tidy table of miRNAs expression profiles (counts) in TCGA studies was added to the database file. Figure 1a show the workflow of the analysis and the corresponding output views.

Database file
The database file is a sqlite3 instance. It contains four tables: cor_rnaseq for miRNA and mRNA expression correlations, cor_rppa for miRNA and protein expression correlations, targets for the miRNA-feature targets and profiles for the miRNA expression in TCGA studies.

Web interface and data presentation
We built an interactive shiny application on top of the database to provide an easy-to-use user interface for miRCancerdb [13]. The web application is available at https ://mahsh aaban .shiny apps.io/miRCa ncerd b/. The application main homepage consists of main and control panels. The control panel (Fig. 1b) is where the user can input a search query and customize the results. The results are shown on the main panel in different representations/views (Table 1). In addition, we provide an explanatory text for getting started, how-to and about sections to give an overview and detailed user's instruction on the website.

Building and using the database locally
The source files to build and use miRCancerdb locally are available on https ://githu b.com/MahSh aaban /miRCa ncerd b. It consists of R scripts to build the database file and launch an interactive browser application locally. The commands to build the application are wrapped in a makefile that makes it easy to trigger the local build with only a single command.
To do that, download the github repository, navigate to the miRCancerdb directory and run make.
git clone https :// github . com / MahShaaban / miRCancerdb cd miRCancerdb make To only run the browser application after the initial build, use the make section, run the R script from the command-line or use within Rstudio. make launch _ app R -e "shiny :: runApp (' app .R ', launch . browser = TRUE )" More details about building and using the database locally is available at the database github repository.

Use case
To illustrate querying the database using the web interactive application, we provide an example of a simple query. Here, we used the query options in the control panel to enter the TCGA identifiers for adrenocortical carcinoma (ACC) and bladder urothelial carcinoma (BLCA) studies, and the miRBase IDs for has-let-7b and has-let-7c. All other options are set to the defaults. Figure 2 show the outputs from the miRCancerdb views. miRCancerdb provides multiple views to output and represent the results of each query. In this case we show the results from only two. The dot and the profile views. Figure 2a shows the output of the previous query in a dot graph where the queried miRNA on the x-axis and the correlated genes on the y-axis, the figures are faceted by the queried studies. Each dot represent a recorded correlation with the size corresponding to its value and the color to the direction (positive or negative). Figure 2b is a point graph with the miRNAs on the x-axis and the log count in each study on the y-axis. Together, the dot and profile views show the correlated genes, the magnitude, direction of the correlation and the distribution of the count of each miRNA in each of the queried studies. To reproduce the exact figures while using the database locally, one needs to build the database first as described above then run the following code in an R session.
We start by loading the required libraries. Then, we connect to the database SQLite file and filter the table cor_rnaseq to get the correlations of the query of interest. Fig. 1 A workflow of the analysis pipeline and miRCancerdb application control panel. a The diagram represents the data sources, the analysis pipeline and the representation views. Briefly, expression profiles of three assays; miRNASeq, RNAseq and RPPA from TCGA were obtained using RTCGA package. miRNA gene targets were obtained using the targetscan.Hs.eg.db package. After tidying and filtering the data, Pearsons correlations for each miRNA to features (mRNA/protein) in each of the TCGA studies were calculated. Target features were then identified. Different types of views are available on the web interface to visualize different aspects of queries results. miRNASeq, miRNA sequencing; RNAseq, RNA sequencing; and RPPA, Reverse Phase Protein Array. b The panel contains two main areas to enter and customize the query. Query options: to choose the kind of feature expression to base the calculation on, the TCGA study/ies identifier and the miRNAs miRBase IDs. Subset options: to customize the query and limit the results. Options include using feature targets only, limit the results to the top n features for each miRNA in each TCGA study, choosing a direction of correlation; positive, negative or both and indicating an absolute minimum correlation to show Table 1   # get data dat <-db %>% tbl ( 'cor _ rnaseq ') %>% dplyr :: select ( mirna _ base , feature , c( 'ACC ', 'BLCA ')) %>% filter ( mirna _ base % in % c( 'hsa -let -7 b ', 'hsa -let -7 c ')) %>% collect % >% gather ( study , cor , -mirna _ base , -feature ) % >% mutate ( cor = as . numeric ( cor )) % >% na . omit % >% group _by ( study , mirna _ base ) %>% arrange ( desc ( abs ( cor ))) % >% slice (1:5) Plot the results as a dot plot (Fig. 2a).

Utility and discussion
Identification of the regulatory elements (transcription factors and miRNAs) at complex pathological conditions like cancer is a very important step to understand their pathophysiological mechanisms [1,4]. These regulatory elements often work as a complex network and drive many processes such as cancer growth, differentiation and metastasis [2]. Cistrome Cancer is a database that provides information about transcription factors correlation and regulatory potentials to gene expression in cancer [14]. It is based on integrative analysis of TCGA expression and public access ChIP-seq (Chromatin Immuno-precipitation Sequencing) data. To our knowledge, no tools have been developed to address the similar involvement on miRNAs. miRCancerdb provides pre-calculated expression correlations between miRNAs and mRNAs/proteins in cancers. Its based on an integrative analysis of TCGA public access data and TargetScan annotations. miRCancerdb can be deployed in a wide variety of research questions. Biologists interested in studying certain miRNAs can use it for their initial surveys. Mainly, the database provides calculated correlations values of these elements to either gene or protein expression in a large number of cancer studies. In addition, it is possible to integrate the available sequence-based annotations to limit the number of highly correlated genes to predicted targets. Also, those who are interested in larger questions can query the database to find out all miRNAs involved in certain types of cancers and compare them to each other. This can be used to discover and compare regulation patterns in neoplastic conditions.
The database file was built with several points in mind. First, we filtered out the less useful information to bring the file size to a manageable size. Second, the file is a stand-alone database which can be accessed programmatically using any SQLite client. Finally, the source code for building the database and launching the browser application is open sourced. This, in one hand leverages the reproducibility and makes updating the database very easy when needed. On the other, the scripts can be used to build the database on local machines and integrate it in different analysis pipelines.
To our knowledge, no available public database provide a data-driven correlation/co-expression of miRNA and genes in cancer. We developed an R package called cRegulome (unpublished) that interfaces the miRCancerdb programmatically from R environment. In addition to the miRNA correlations, the package provide access to transcription factor-gene correlations from the cistrome project [14]. miRCancer is another cancer association database of miRNA based on text mining the the literature [15].

Limitations
In this kind of integrative analysis based on public access data, we are limited by the available data and annotations. The current miRNA annotations are far from complete. The plan is to keep up with updates from the two major data and annotation sources, namely the TCGA and the TargetScan projects. That is through rebuilding the database with each future updates of the RTCGA and targetscan.Hs.eg.db Bioconductor packages.
In the current version of the database, we treated each TCGA study as one entity. Another plausible way is to consider the cancer heterogeneity and individual variations. This is to stratify the analysis with histological and/or clinical information about the samples. However, this would complicate the pipeline and increase the database size dramatically, so we might address this issue in a later version.
In addition, improving the user experience on the web application and providing graphics that can handle bigger queries is a priority. Finally, providing a similar analysis in the context of human tissues and cell lines using public access data rather than cancer is necessary. This would make a good tool to discover patterns of regulation in many physiological conditions and provides comparison set to these patterns in cancers.

Authors' contributions
MA, HN and TL performed the analysis built the database, the interactive application and contributed to writing the manuscript. DRK contributed to writing and revised the manuscript. All authors read and approved the manuscript.