miRquant 2.0: an Expanded Tool for Accurate Annotation and Quantification of MicroRNAs and their isomiRs from Small RNA-Sequencing Data

Summary Small non-coding RNAs, in particular microRNAs, are critical for normal physiology and are candidate biomarkers, regulators, and therapeutic targets for a wide variety of diseases. There is an ever-growing interest in the comprehensive and accurate annotation of microRNAs across diverse cell types, conditions, species, and disease states. Highthroughput sequencing technology has emerged as the method of choice for profiling microRNAs. Specialized bioinformatic strategies are required to mine as much meaningful information as possible from the sequencing data to provide a comprehensive view of the microRNA landscape. Here we present miRquant 2.0, an expanded bioinformatics tool for accurate annotation and quantification of microRNAs and their isoforms (termed isomiRs) from small RNA-sequencing data. We anticipate that miRquant 2.0 will be useful for researchers interested not only in quantifying known microRNAs but also mining the rich well of additional information embedded in small RNA-sequencing data.


Introduction
Non-coding RNAs (ncRNAs) have long been recognized as important regulators of cellular function [1][2][3][4]. For example, ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs) are essential for protein synthesis in all living organisms. However, recent advances in nextgeneration sequencing technology have helped unveil the prevalence and abundance of a diverse array of intracellular and extracellular ncRNAs [5], [6]. There are roughly three main classes of ncRNAs, which are somewhat arbitrarily differentiated by length: small (<40 nucleotides), intermediate (between 40 and 200 nucleotides), and long (>200 nucleotides). Increasingly well-studied examples of ncRNA types within these classes are microRNAs (miRNAs), cajal body specific RNAs (caRNAs), and long, intergenic non-coding RNAs (lincRNAs), respectively. The small RNA category [7] especially has experienced an explosion in the number of different types of ncRNAs, including, though not limited to, piwiinteracting small RNAs (piRNAs), tRNA-derived small RNAs (tDRs), Y-RNA derived small RNAs (yDRs), endogenous small interfering RNAs (endo-siRNAs), trans-acting small interfering RNAs (tasiRNAs), QDE-2 interacting small RNAs (qiRNAs), and terminiassociated short RNAs (TASRs). The level of experimental support for the molecular and biological functions of these recently discovered small RNAs varies according to the type. For example, the roles of piRNAs [8] in germline development are much better understood than the proposed functions of yDRs [9] (also referred to as Y-RNA fragments) in tumor proliferation and metastasis. The most thoroughly studied class of small RNAs to date is miRNAs.
MicroRNAs are small RNAs (~18-24 nucleotides in length) that regulate intracellular gene expression across all eukaryotes [10]. Their roles in gene regulation are important in a wide variety of fundamental biological processes such as development, growth, and metabolism, as well as in many organ-specific functions. MicroRNAs often serve as fine tuners of gene expression to maintain intracellular and systemic homeostasis [11][12][13]. Defects in miRNAmediated gene regulation have been implicated in many disorders, including Mendelian conditions such as deafness [14], a plethora of different complex diseases [15] including cancer [16], [17], and host-pathogen interactions [18][19][20]. Recently, miRNAs have emerged as candidate circulating biomarkers [21] and potential therapeutic targets [22], [23] for several diseases, suggesting that the study of miRNAs will not only advance understanding of basic biology and evolution, but also may promote the development of novel clinical strategies for diagnosis, prognosis, and treatment of complex diseases.
The biological and biomedical relevance of miRNAs has led to an ever-growing interest in high-throughput, accurate annotation and quantification of miRNAs in functionally distinct organ systems and cell types across many different species. The first major revolution in this research area occurred in the early 2000s with the use of microarray technology to detect miRNAs [24], [25]. Within the past ten years though, next-generation sequencing technology has replaced microarrays as the standard approach for high-throughput miRNA detection. Small RNA sequencing (smRNA-seq) approaches ameliorate many of the limitations of microarrays; however, they are not without their own challenges [26]. Some of the issues lie within the experimental realm; for example, mitigating sample contamination [27] or preparing small RNA libraries in a manner that faithfully captures all of the small RNAs present in the sample without bias [26]. Other issues pertain to the bioinformatic strategies for extracting accurate information out of the raw smRNA-seq data.
In 2013 we developed a bioinformatic strategy [28], which we now call miRquant, for accurate detection of miRNAs from smRNA-seq data, with an emphasis on separately annotating and quantifying functionally-distinct isoforms of canonical miRNAs, termed isomiRs. We have since expanded the tool significantly to miRquant 2.0. In addition to the quantification of miRNA and isomiR expression levels, we now provide additional detailed information on the quality of the sequencing data, genome mapping statistics, abundance of other types of small RNAs such as tDRs [29] and yDRs [9], prevalence of post-transcriptional modifications such as A-to-I edits and 3' non-templated nucleotide additions, and correlation in miRNA profiles across multiple samples. Furthermore, miRquant 2.0 is now equipped to handle smRNA-seq data from human, mouse, and rat, and is currently being expanded to fruit fly. We make publicly available both the tool and a detailed tutorial for users to learn how to run the program and interpret the results. Below we provide an overview of the miRquant workflow and the diverse capabilities of the tool.

3' Adapter trimming
In smRNA-seq, the number of sequenced bases (generally ~50 on the HiSeq 2000 and more recent platforms) exceeds the length of the target RNA fragment. Therefore, part of the 3' adapter is sequenced and requires removal prior to genome alignment. To accomplish this, miRquant 2.0 utilizes the adapter trimming program Cutadapt (version 1.12) [30]. Specifically, miRquant 2.0 requires a minimum of ten nucleotides overlap between the adapter sequence and the 3'-end of the sequencing read with less than 10% errors in the alignment. Reads that do not meet this criterion, or are <14 nucleotides in length following trimming, are discarded. Conversely, reads that meet this criterion are subject to trimming and retained for further analysis.

Read alignment -tier 1
Alignment of reads to the reference genome is accomplished by a two-tier process. In the first-tier, Bowtie (version 1.1.0) [31] is used to align only those reads that match perfectly to the genome. If a read aligns equally well to multiple genomic loci, it is proportionally assigned to each of the loci. Following this mapping step, miRquant 2.0 assembles contigs or 'genomic windows' that contain >=1 perfectly mapped reads. If genomic windows are within 65 nucleotides of each other, these windows are merged into larger 'super genomic windows.' This step is intended to ensure that all reads potentially related to a single pre-miRNA (roughly 65 to 110 nucleotides in length) are contained within the same window.

Read alignment -tier 2
In the second-tier, the SHRiMP aligner (version 2.2.2) [32] is used to align reads, with mismatches allowed, only to the genomic windows generated in tier 1. The types, locations, and number of mismatches allowed are defined in accordance with known patterns of posttranscriptional editing of miRNAs, most notably, A-to-I editing and 3' non-templated nucleotide addition. For example, the number of mismatches allowed at the 3'-end of a miRNA-related read ranges from 0 to 3, depending on the length of the read (with read length as r; 0 if 14 <= r <= 15, 1 else if 16 <= r <= 19, 2 else if 20 <= r <= 23, 3 if r > 23). Similar to the Bowtie step, reads aligning equally well to multiple locations are proportionally assigned to each of those loci. The two-tier mapping process is intended to mitigate two common issues with bioinformatics analyses of smRNA-seq data: (1) Computational tractability -Running the SHRiMP aligner on the entire genome/transcriptome would be computationally extremely expensive, particularly for short reads such as those from smRNA-seq (2) Mapping ambiguity -Mapping short reads, particularly those for which several genomic mismatches are allowed in order to accommodate post-transcriptional editing of miRNAs, can lead to highly non-specific, ambiguous mapping. That is, some reads may map to numerous genomic locations because they happen to harbor similar enough stretches of sequence just by chance. Some of these loci may have little to do with miRNAs These two concerns are effectively ameliorated by restricting the mappable space for SHRiMP to only the genomic windows generated in tier 1.

Annotation of microRNAs and other small RNAs
All reads aligned by the two-tier process outlined above are evaluated for overlaps with annotation libraries for miRNAs, tRNAs, and Y-RNAs, which are derived from miRBase [33], GtRNAdb [34], and GENCODE [35], respectively. For miRNAs, each read overlapping a known miRNA locus is compared to the start position of the reference miRNA listed in miRBase. If the start position of the read is identical to that of the reference miRNA, the read contributes one count toward the reference miRNA. If the start position of the read differs from the reference start location, the read contributes one count to an isomiR that is named to represent that offset (e.g., if the read start position is 2 nucleotides upstream of the reported start position for miR-25-3p, the read would be called miR-25-3p_+_2).
For tRNAs, each read overlapping a known tRNA locus in GtRNAdb is named to include the chromosome of origin, tRNA number, amino acid with which it associates, location within the tRNA to which the read maps, total length of the tRNA, and strand (chr#:tRNA#:amino acid and codon:start position in tRNA:end position in tRNA:tRNA length:strand). More detailed annotation and quantification of tDRs is available through other programs, such as tDRmapper [29] or MINTbase [36]. The same naming scheme is used for yDRs (Y-RNA loci are defined according to GENCODE [35] annotations). Reads for which no overlap occurs with miRNA, tRNA, or Y-RNA loci are named according to the genomic location of the start position and the strand (chr#:location#:strand).

miRquant 2.0 output
A hallmark feature of miRquant 2.0 is the extensive information provided on miRNAs, isomiRs, and other small RNAs detected in the smRNA-seq data (Figure 1, 2). This information is reported in separate files and these are summarized below.

Mapping statistics
For each sample, miRquant 2.0 reports tabular data that summarizes the quality of the sequencing data and the overall mapping statistics (Figure 2A). From these tables, users can determine the number of total reads generated, as well as the number and percentage of total reads that were successfully trimmed, trimmed reads that were mapped by Bowtie and SHRiMP, mapped reads that pertain to miRNAs, tRNAs, or Y-RNAs. These data inform the user of potential quality issues with the sample (RNA degradation, adapter dimerization during library prep, poor sequencing) as well as the relative abundance of different types of small RNAs represented in the data. Only miRNAs above a specified expression threshold are included in the correlation analysis. An example is provided in Figure 2B in which miRNA profiles are being compared across two different conditions with three replicates each. The color in each of the squares in the heat map corresponds to the extent of miRNA expression correlation between two particular samples. In this example, the heat map and dendrogram show that the replicates are grouped appropriately by condition, and that the two conditions cluster separately.

Read length distributions
For each sample, miRquant 2.0 reports the length distributions of all trimmed reads ( Figure  2C). A peak at ~18-24 nucleotides likely represents mature miRNAs, a subclass of tDRs known as tRNA fragments (tRFs), and possibly yDRs. A peak at ~30-33 nucleotides likely represents a subclass of tDRs referred to as tRNA halves (tRHs), and possibly yDRs. An elevated signal at other read sizes could indicate other known or novel small RNAs, but could also be suggestive of degradation.

Normalized expression
Read count normalization is performed in two ways. For a given miRNA/isomiR/tDR/yDR in a sample, the number of corresponding reads is divided by the total number of mapped reads and multiplied by 1 million. This is referred to as the reads per million mapped (RPMM). For miRNAs/isomiRs, an alternative method of normalization is employed. Specifically, the number of corresponding reads is divided by the total number of reads mapped to all miRNA/isomiR loci and multiplied by 1 million (Equation 1). This is referred to as reads per million mapped to miRNAs (RPMMM). The latter approach is generally favored for miRNAs. Notably, the types, locations, and frequency of potential post-transcriptional editing events are reported for every annotated miRNA/isomiR/tDR/yDR in the sample. Edits are separated into two groups: internal edits and 3' non-templated additions. Reads per million mapped to miRNAs, where c i is the number of reads mapped to a specific miRNA or isomiR (precise 5'-start position) and c tot is the total number of reads mapped to all miRNAs and isomiRs.

Programs required to run miRquant 2.0
In order to run miRquant 2.0 successfully on smRNA-seq data, users must have installed the following programs:

Discussion
We have provided here a description of miRquant 2.0 -a tool for comprehensive annotation and quantification of miRNAs, isomiRs, and other small RNAs from smRNA-seq data. miRquant 2.0 also provides detailed information on the sequencing data quality, genome mapping statistics, relative abundance of different classes of small RNAs, prevalence of posttranscriptional modifications, and similarity of miRNA profiles across samples. We have successfully used a previous version of miRquant in several published studies involving both mouse and human smRNA-seq datasets [37][38][39][40]. We anticipate that version 2.0 will be even more useful for researchers interested not only in quantifying miRNAs, but also digging deeper into the rich well of additional information embedded in smRNA-seq data.