miR2Trait: an integrated resource for investigating miRNA-disease associations

MicroRNAs are key components of cellular regulatory networks, and breakdown in miRNA function causes cascading effects leading to pathophenotypes. A better understanding of the role of miRNAs in diseases is essential for human health. Here, we have devised a method for comprehensively mapping the associations between miRNAs and diseases by merging on a common key between two curated omics databases. The resulting bidirectional resource, miR2Trait, is more detailed than earlier catalogs, uncovers new relationships, and includes analytical utilities to interrogate and extract knowledge from these datasets. miR2Trait provides resources to compute the disease enrichment of a user-given set of miRNAs and analyze the miRNA profile of a specified diseasome. Reproducible examples demonstrating use-cases for each of these resource components are illustrated. Furthermore we used these tools to construct pairwise miRNA-miRNA and disease-disease enrichment networks, and identified 23 central miRNAs that could underlie major regulatory functions in the human genome. miR2Trait is available as an open-source command-line interface in Python3 (URL: https://github.com/miR2Trait) with a companion wiki documenting the scripts and data resources developed, under MIT license for commercial and non-commercial use. A minimal web-based implementation has been made available at https://sas.sastra.edu/pymir18. Supplementary information is available at: https://doi.org/10.6084/m9.figshare.8288825.v3.


INTRODUCTION
are key elements of post-transcriptional regulation in the genomic architecture of both prokaryotes and eukaryotes. They are short non-coding RNAs of about 18-25 nucleotides, first observed in C. elegans (Lee, Feinbaum & Ambros, 1993). In their canonical role in RNA silencing, the miRNA binds to the cognate mRNA and destabilizes it, thereby priming the transcript for degradation. A single miRNA is capable of silencing the expression of several genes by relaxing the specificity of hybridisation. MiRNAbased regulation plays crucial roles in health, and miRNA dysregulation is a common mechanism in the etiology of complex diseases, viz. cardio-vascular, auto-immune and neuro-degenerative diseases, and cancers (Min et al., 2009;Calin et al., 2004;Tai, 2011). About half of all miRNA genes are present in cancer-related genetic loci (Calin et al., 2004), and they are known to play key roles in tumorigenesis and cancer progression (Liu et al., 2018;Hayes, Peruzzi & Lawler, 2014;MacFarlane & Murphy, 2010). The naming of human miRNAs in the miRBase registry (https://www.mirbase.org) is related to the miRNA maturation pathway (Fig. 1). In brief, miRNA subtyping with numbers (−1, 2, . . . ) indicates identical miRNAs that derive from different genomic regions; miRNA subtyping with letters (a, b, . . . ) indicates nearly identical miRNAs that differ in only one or two positions; and miRNA designations -5p and -3p refer to the source arm of the miRNA duplex from which the mature miRNA is derived. It is the mature miRNA that binds to the mRNA and inhibits/regulates protein translation (MacFarlane & Murphy, 2010). It is noteworthy that the sequence responsible for miRNA-specificity of the mRNA is not translated, thus separating the protein-coding region of a gene from the miRNA-binding site in sequence space.
Existing databases of miRNA-disease associations involve tedious manual curation or are limited in scope and evolution. miR2Disease was one of the earliest catalogues of miRNA-disease associations, providing relationships between 349 miRNAs and 163 diseases (Jiang et al., 2009). HMDD was developed by Li et al. (2014) to provide a more comprehensive database of miRNA-disease associations. It used text mining followed by manual annotation for identifying miRNA-disease connections. The current version of HMDD, version 3.2, contained associations between 1206 miRNAs and 893 diseases, and included miRNA subtype information (Huang et al, 2019). Only one arm of the mature miRNA, indicating the -5p or -3p source, is predominant in vivo, a vital annotation that is missing in the HMDD databases. In this work, we have addressed the limitations of previous efforts, and designed an algorithm to uncover the spectrum of miRNA-disease associations. Our approach is based on integrating two expert-curated databases on a common key. The resulting resource, miR2Trait, is comprehensive and representative, uncovers novel relations, and provides assorted tools for investigating miRNA-disease associations. miR2Trait can be queried by both miRNA and disease, and is available as both a web server (https://sas.sastra.edu/pymir18) and command-line resource (https://github.com/miR2Trait/ or http://doi.org/10.5281/zenodo.7002878).

miRNA: disease mapping
Two expert curated databases were used as the source databases in the creation of miR2Trait: (1) miRTarbase, a database of experimentally validated miRNA-target interactions (MTIs), with >13,400 entries (Chou et al., 2018); and (2) DisGeNET, a widely used and standardized knowledge management platform with >1,000,000 gene-disease associations (Piñero et al., 2015). Both databases involved the mapping of genetic information, in one case to miRNAs, and in the other, to diseases. This provided a clue to relating miRNAs and disease via the bridge of genetic information (Fig. 2). In-house Python scripts were used to extract the gene:miRNA and the gene:disease mappings from the respective databases as dictionary data structures. The two dictionaries were merged on 'gene' to establish a dictionary of miRNA:disease mappings. Similarly, an inverse dictionary of disease:miRNA mappings was created.

Database creation
The dictionaries obtained above were used to populate relational tables in the construction of two databases: (i) miRNAs and their associated diseases, and (ii) diseases and their associated miRNAs. MySQL was used for encoding the databases.

Spectrum width calculation
The number of diseases with which an miRNA is associated is indicative of the breadth of its regulatory impact, which could flag master regulator miRNAs (or regulatory hubs). The disease spectrum width (DSW) of the ith miRNA is calculated after Qiu, Chen & Cui (2012): where d i isthe number of diseases associated with miRNA i and D N is the total number of diseases in the database. Conversely the number of miRNAs associated with a given disease could be indicative of a complex multifactorial pathology, and the miRNA spectrum width (MSW) of the jth disease is given by: where m j is the number of miRNAs associated with the jth disease and M N isthe total number of miRNAs in the database.

miRNA list enrichment analysis
Given a list of miRNAs that are collectively dys-regulated, it would be of interest to identify which diseases would be enriched (or more likely to occur). Statistical techniques to quantify such enrichment for genes as well as miRNAs exist (Li et al., 2018;Backes et al., 2016;Çorapçıoğlu & Oğul, 2015;Rivals et al., 2007). Here we implement an open-source tool to identify diseases in the miR2Trait database that are enriched for an input set of miRNAs. The method uses a hypergeometric test to quantify the enrichment of each disease in miR2Trait for the given set of miRNAs (in this case identical to the Fisher's exact test) (Fisher, 1922). The test follows the construction of the Fisher contingency table for each disease, as illustrated in Table 1. The statistical significance (p-value) of enrichment is computed using Python scipy routines (http://www.scipy.org/), and diseases that pass a user-specified adjusted p-value threshold (with an option to choose among three methods for multiple hypothesis correction) are returned sorted by significance. In addition to significance, both the web server and standalone tool calculate the effect size (log-fold) of each enrichment.

Diseasome analysis
Given an input set of diseases related in some way (called a diseasome), it would be of interest to find shared dysregulated miRNAs. We approached this problem with an abundance analysis of miRNAs in the diseasome, and identifying miRNAs that passed a user-specified count. Such a set of miRNAs could constitute master regulators acting on common pathways in the specified set of diseases. If we are interested in the miRNA spectra of umbrella disease terms such as 'carcinoma', instead of a specific set of diseases, this is also allowed. In this case, diseases containing the general keyword are first identified,

RESULTS
The mapping sizes extracted from the respective databases is given in Table 2. In summary, 2,599 miRNAs were mapped to 15,062 genes using miRTarbase, and 8,819 genes were mapped to 13,075 diseases using DisGeNET. Merging the two relations yielded a mapping between 2,595 miRNAs and 11,689 diseases (Table 2). This mapping laid the foundation for the miRNA:disease and disease:miRNA databases, which could be freely downloaded as csv files from aforementioned URLs. The density function of the mappings is shown in Fig. 3.
Using the miRNA:disease db, miRNAs with the top ten DSWs were identified (Table 3). Such miRNAs are potential master (or hub) regulators at the intersection of multiple pathways, indicatively contributing to one-fifth of all diseases. Using the disease:miRNA db, diseases with the top ten MSWs were identified (Table 4). Complex neurological   conditions sweep >90% of all miRNAs, reflecting multifactorial, far from understood aetiologies.
The miRNA Enrichment Analyzer computes both the odds-ratio size and significance p-value for a given set of miRNAs. The input could be variable in the number of miRNAs provided. We used ''hsa-miR-346 hsa-miR-26a-5p hsa-miR-7-5p hsa-miR-34a-5p'' as  a query with an adj. p-value (i.e., q-value) cutoff of 0.05. This returned a set of 4,121 significantly enriched diseases. As another example, we used a set of three random miRNAs: ''hsa-miR-2355-3p hsa-miR-214-3p hsa-miR-4801'' as a query. This returned 830 significantly enriched diseases ( Analysis of diseasomes for over-represented miRNAs could yield valuable hypotheses for further research, in terms of both pathophysiology and therapeutic options. In this line, we pursued the investigation of diseasomes related to cancer. Since the evidence for miRNA involvement in all stages of cancer including tumorigenesis, progression and metastasis is well-established, we probed miR2Trait for the keyword 'Neoplasm' diseasome, and performed an miRNA occurrence analysis. We also performed occurrence analysis for diseasomes identified using the keywords 'Carcinoma', 'Sarcoma', 'Lymphoma', and 'Leukemia'. Table 6 shows the results of these analyses for 25 miRNAs ranked by the 'Neoplasm' occurrence analysis. MiRNAs documented in the literature as playing key roles in various cancers appeared in the top miRNAs for each keyword (Sandoval-Bórquez et al., 2017;Miyamoto et al., 2016;Zhang et al., 2015;Wang et al., 2018;Cloonan et al., 2008;Xu et al., 2014;Robertson et al., 2014;Fabbri et al., 2016;Garzon et al., 2008;Marcucci et al.,  2008; Calin et al., 2005;Stefano et al., 2010;Jima et al., 2010). Databases devoted to miRNA associations in cancer have been developed (Yang et al., 2017;Sarver et al., 2018;Ahmed et al., 2018), and our work here would augment efforts in this direction. Expression of specific microRNAs has been recognized as a vital diagnostic/prognostic biomarker and therapeutic target in a variety of cancers (Blenkiron et al., 2007;Sempere et al., 2007;Michael et al., 2006;Iorio et al., 2005;Lowery et al., 2009;Lebanony et al., 2009;Ueda, 2010;Fridman et al., 2010;Gutiérrez et al., 2010).

Benchmarking
Dedicated miRNA omics databases have been developed to aid researchers unravel the role of miRNAs in biological processes (for e.g., see Ruepp et al., 2010;Keller et al., 2011). Computational methods have also been advanced for inferring miRNA-disease connections (Chen, Liu & Yan, 2012;Xuan et al., 2013;Qu et al., 2018) manually curated database of miRNA-disease associations based on text-mining. While miR2Trait contains the full mature miRNA information, HMDD does not provide subtype information such as -5p or -3p; 1a or 1b. The miRNA arm designation (5p or 3p) is a major ingredient of disease etiology, and key to naming miRNAs (for e.g., the ones with the top DSWs in Table 3). To compare the databases, it is therefore necessary to 'normalize' the miRNA nomenclature between the databases (by suppressing the subtype information in miR2Trait). This contracted the unique miRNAs in miR2Trait to 1,759 in number. Following this, the databases were compared, and the results are shown in Fig. 4. Even after contraction, miR2Trait appears more comprehensive in the catalog of miRNA-disease associations. miR2Trait-unique miRNAs are 2.5-fold more in number than HMDDv3.2 -unique miRNAs (884 vs 334). The unique miRNA complements suggest directions for research. Some of the diseases missed by miR2Trait could be those that involve deleterious mutations in the miRNA-binding region of the gene itself. Finally, the significant size of the overlap between the two databases increases confidence in the methods advanced here, which are also easily extensible with updates to the underlying primary databases. These observations further suggest the use of miR2Trait to complement HMDD.

Network analysis
The availability of two-way mappings between miRNAs and diseases with various analysis tools paves the way to varied investigations. One direction would be to identify pairs of miRNAs that lead to clusters of similar disease phenotypes. We addressed this question by using the miRNA enrichment analyzer discussed herein to compute the disease enrichment scores (2-tuples of BH-adjusted p-value and odds-ratio) of each miRNA-miRNA pair using Fisher's exact test. This process is iterated over the entire set of miRNA-miRNA pairs (2595*2595) to obtain an adjacency matrix of ∼3 million interactions. This matrix was used to construct the corresponding network made of significant edges weighted by the odds-ratio. The interactions were sorted by -log 10 p-value and the top 10,000 interactions were used to construct the miRNA-miRNA adjacency network, yielding 2,442 nodes (and 10,000 edges). The network was read into Cytoscape (Shannon et al., 2003), and centrality analysis was performed using CytoNCA (Tang et al., 2015). The weighted versions of the following seven centrality measures were used: betweenness, closeness, eigenvector, degree, network, information, and local average centralities. The top 100 miRNAs from each centrality measure were used to obtain the consensus (five or more measures) top miRNAs (Table 7; algorithm details and Python script given here: https://github.com/miR2Trait/miR2Trait/wiki/Top-Nodes-from-Adjacency-Networks).
Edge-percolation centrality of the network was also computed using cytoHubba to rank the miRNAs (Chin et al., 2014). The top 23 miRNAs contained 14 consensus miRNAs and the top 75 contained all but one of the consensus miRNAs. Such a consensus list could identify the critical regulatory miRNAs in human for further investigation. A disease-disease adjacency network was similarly constructed, and both the above adjacency networks and the scripts to perform the network construction and analysis are available from miR2Trait (https://github.com/miR2Trait).
In summary, we have developed miR2Trait, a comprehensive resource for investigating miRNA-disease associations, based on a simple yet surprisingly effective technique for fusing two curated primary databases. One outcome of the approach has been the enhanced Table 7 Consensus identification of key miRNAs in disease development. Seven centrality measures were used to identify the top 100 nodes in the miRNA-miRNA adjacency network and their consensus is determined based on the number of measures that agree upon an miRNA (agreement ≥ 5). The consensus miRNAs appear central even by the Edge-percolation centrality (EPC) measure. interpretability and transparency of the resource constituents, aspects that are lost in more complex methods to arrive at reliable miRNA-disease associations.

CONCLUSIONS
MicroRNAs play key roles in health and the development of disease by virtue of their crucial regulatory activities. To document miRNA-disease associations, we used a novel method integrating two curated and validated sources of miRNA-gene and gene-disease relations, and created vastly expanded databases of miRNA-disease and disease-miRNA associations. A set of tools to interrogate these databases and uncover novel findings has been developed. Taken together, miR2Trait provides a starting point for the investigation of the miRNA-ome in the context of disease. A web-server interface to all the functionalities has been developed and is available at https://sas.sastra.edu/pymir18/. The service allows for flexible querying using regular expression-based pattern matching. It includes a tool for miRNA-enrichment analysis that would enable the identification of statistically over-represented diseases in a user-given list of miRNAs. This would help formulate hypotheses of miRNA-mediated common mechanisms underlying multiple disease dysregulation pathologies. A number of examples have been discussed to illustrate the use-cases for the resource. The resource has also been benchmarked with the best alternative miRNA-disease database (HMDDv3.2). Further, the source code for the entire project is made freely available, and includes a plethora of tools for working with the resource, along with an extensive wiki. All the data developed in the project are freely available as supplementary information, including the miRNA-miRNA and disease-disease adjacency networks.