unitas: the universal tool for annotation of small RNAs

Background Next generation sequencing is a key technique in small RNA biology research that has led to the discovery of functionally different classes of small non-coding RNAs in the past years. However, reliable annotation of the extensive amounts of small non-coding RNA data produced by high-throughput sequencing is time-consuming and requires robust bioinformatics expertise. Moreover, existing tools have a number of shortcomings including a lack of sensitivity under certain conditions, limited number of supported species or detectable sub-classes of small RNAs. Results Here we introduce unitas, an out-of-the-box ready software for complete annotation of small RNA sequence datasets, supporting the wide range of species for which non-coding RNA reference sequences are available in the Ensembl databases (currently more than 800). unitas combines high quality annotation and numerous analysis features in a user-friendly manner. A complete annotation can be started with one simple shell command, making unitas particularly useful for researchers not having access to a bioinformatics facility. Noteworthy, the algorithms implemented in unitas are on par or even outperform comparable existing tools for small RNA annotation that map to publicly available ncRNA databases. Conclusions unitas brings together annotation and analysis features that hitherto required the installation of numerous different bioinformatics tools which can pose a challenge for the non-expert user. With this, unitas overcomes the problem of read normalization. Moreover, the high quality of sequence annotation and analysis, paired with the ease of use, make unitas a valuable tool for researchers in all fields connected to small RNA biology. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4031-9) contains supplementary material, which is available to authorized users.


Background
Small non-coding (snc-) RNAs are important players in diverse cellular processes, often acting as guide molecules in transcriptional and post-transcriptional gene regulation [1][2][3]. Micro (mi-) RNAs, short interfering (si-) RNAs and Piwi-interacting (pi-) RNAs constitute their most prominent representatives but the number of described sncRNA classes continuously increases. Moreover, degradation products of larger RNA molecules such as rRNA or tRNA fragments further contribute to sequence heterogeneity of sncRNA transcriptomes [4,5]. As diverse as their source molecules are the places where sncRNAs can be found within an organism, ranging from nuclear and cytoplasmic localization inside a cell, to extracellular exosomes being released into diverse body fluids [6,7]. Studying the role of sncRNAs in diverse biological contexts typically involves high-throughput sequencing of sncRNAs derived from total RNA extracts. Subsequent disentangling of the complex composition of such sncRNA transcriptomes is one of the initial steps in sequence data processing and critical for all kinds of downstream analysis. As the use of high throughput sequencing technologies becomes more and more common, while this does not necessarily apply to bioinformatics knowhow, a robust and easy to use solution for reliable annotation of sncRNA sequence datasets is highly desirable.
So far, annotation of sncRNA sequence datasets is demanding for various reasons. On the technical side, existing tools cover particular aspects of sequence annotation (e.g. miRNA annotation) which means that complete annotation including all types of sncRNAs requires installation of a set of programs with different dependencies, some of which are restricted to specific operating systems. Illustrating the complexity of the task, a typical annotation process could include the following steps: i) 3′ adapter recognition with Minion [8] or DNApi [9], ii) adapter trimming with e.g. reaper [8] or cutadapt [10], iii) filtering of low complexity sequences with dustmasker [11] or Repeat-Soaker [12], iv) miRNA annotation with Chimira [13], v) annotation of tRNA-derived fragments with tDRmapper [14] or MINTmap [15], vi) annotation of other ncRNA or mRNA fragments with NCBI BLAST and, if applicable, vii) annotation of phased RNAs with PhaseTank [16] or viii) annotation of putative piRNAs by mapping sncRNA sequences to known piRNA producing loci [17].
However, when having established a local annotation pipeline it is almost impossible to correctly normalize the obtained results in case that a given sequence maps to different types of non-coding RNA. Even with a profound bioinformatics expertise, custom annotation is challenging due to the fact, that reference non-coding RNA sequences are stored at different online databases such as Ensembl database, miRBase, GtRNAdb and SILVA rRNA database. Further, mapping sncRNA sequences to reference sequences, once having gathered a complete collection, and subsequent parsing of the obtained results is bedeviled by, e.g., the presence of isomiRs or post-transcriptionally adenylated or uridylated miRNAs.
In order to facilitate and speed-up sncRNA annotation while making the obtained results comparable across different studies, we have developed unitas, a tool for sncRNA sequence annotation that requires not more than a computer with internet connection. Our aim is to provide a maximally convenient tool that runs with an absolute minimum of prerequisites on any popular operating system, making high-quality sequence annotation available for everyone. By providing complete annotation with one tool we intend to tackle the problem of normalization of multiple mapping sequences. In addition, we designed all annotation and analysis algorithms with the aim to overcome a number of limitations of existing tools, in order to make unitas the means of choice compared to a notional pipeline with state-of-the-art tools connected in series. The unitas source code and precompiled executable files are freely available at https://sourceforge.net/projects/unitas/ and http://www.smallrnagroup.uni-mainz.de/software.html.

General requirements
We provide precompiled standalone executable files of unitas for Linux, Mac and Windows systems. Unitas itself is written in Perl and designed to run with an absolute minimum of prerequisites, relying on Perl core modules, or modules which are part of widely used free Perl distributions such as Archive::Extract and LWP::Simple. Perl is commonly preinstalled on Linux and MacOS systems, where users can run the unitas Perl script without any further requirements. Windows users that prefer to run the Perl script rather than the executable file may have to install a free Perl distribution such as ActivePerl or Strawberry Perl. More detailed information and help is available in the unitas documentation. Since unitas uses publicly available online databases for sncRNA annotation, the program needs an internet connection when run for the first time. Later runs can use previously downloaded data. Input files can be sequence files in FASTA or FASTQ format (with or without 3′ adapter sequence), or alternatively map files in SAM or ELAND3 format. Some data analysis features are only available when using map files as input.

Reference sequence data management
Sequence annotation with unitas relies on publicly available reference sequences from Ensembl [18], miRBase [19], GtRNAdb [20], SILVA rRNA database [21] and piRNA cluster database [17] (Fig. 1). Currently, unitas supports 835 different species or strains for which information on ncRNAs is available at least in one of the Ensembl databases. Prior to annotation, unitas downloads a collection of latest reference sequences which are stored in a separate folder on the local machine for subsequent mapping. As availability of reference sequences is crucial, unitas is designed to address possible challenges that can occur during acquisition of that data. Since database URLs often change with new releases or updates of reference sequences, relying on URLs stored inside the programs source code would require frequent updates of the unitas software itself. Therefore, unitas connects to the Mainz University Server (MUS) and loads the latest list of URLs for downloading the required reference sequence data. However, in the event of these URL not being up to date (URLs are updated monthly), unitas ultimately downloads the required sequence data directly from MUS where the datasets are available via stable URLs and are synchronized regularly (Fig. 1).
By default, downloaded sequence datasets are used for subsequent unitas runs without anew downloading by default. Users can also download reference sequence data collections for any supported species at any time and use the downloaded data for later offline runs. The downloaded sequence data can be updated anytime.

Automated 3′ adapter recognition and trimming
Standard cloning protocols for small RNA library preparation prior to high throughput sequencing involve ligation of adapter molecules to both ends of a RNA molecule. During sequencing, the sequencing primer typically hybridizes to the 3′ end of the 5′ adapter, which means that the resulting sequence read starts with the original small RNA sequence and, given a sufficient read length, ends with the 3′ adapter sequence. However, there exist manifold commercially available 3′ adapters that can be used for library construction. In addition, these adapters can contain different index/barcode sequences. Finally, working groups may even use custom made adapter molecules. Since information on 3′ adapter sequences that were used to generate a small RNA dataset is not always available or at least difficult to find out, we integrated an adapter recognition and trimming module that can be applied using the option '-trim'.
Initially, unitas identifies the most frequently occurring sequence ignoring sequence read positions 1 to 22 (typical length for miRNAs). unitas adjusts the length of the motif, m, to be identified automatically according to the formula m = n -22 (as long as: 6 ≤ m ≤ 12) where n refers to the sequence read length. A first round of adapter trimming is then performed based on the identified motif allowing 2 mismatches for 12 nt motifs, 1 mismatch for motifs ≤11 nt and 0 mismatch for motifs ≤8 nt. If the original motif is not found within a given sequence read, unitas truncates the motif sequentially by one 3′ nt and checks for its occurrence at the very 3′ end of the sequence read until the motif is found or the motif length falls below 6 nt. Following this first round of adapter trimming, unitas checks the positional nucleotide composition of the trimmed sequence reads and will remove further 3′ nucleotide positions in case they exceed a specified nucleotide bias (default = 0.8). It is noteworthy that there may exist scenarios in which unitas will not detect the correct 3′ adapter sequences when using the default settings, particularly in cases with short library read length (≤35 nt) combined with a high amount of reads that share 3′ similarity such as, e.g., tRNA-derived fragments. In these special cases, adapter recognition can be improved by increasing the amount of 5′ positions to be ignored when searching for frequent sequence motifs (option: -trim_ignore_5p [n]).

Filtering low complexity reads
To filter out low complexity reads, unitas employs an advanced version of the duster algorithm from the NGS TOOLBOX [22]. By default, sequence reads with a length fraction f > 0.75 being composed of one repetitive sequence motif (default motif length = 1-5 nt) are rejected. Further, sequences with a length fraction being composed of only two specific nucleotides are also rejected.
miRNA annotation unitas performs miRNA annotation in several consecutive steps. Mature miRNA sequences and miRNA hairpin sequences are downloaded from miRBase and miRNAs are annotated in the following order: i) Canonical miRNAs of the species in question, ii) post-transcriptionally 3′-tailed canonical miRNAs of the species in question, iii) offset miRNAs of the species in question, iv) post-transcriptionally 3′-tailed offset miRNAs of the species in question. Subsequently, this procedure is repeated using miRNA sequence data from all other species included in miR-Base, which is particularly useful for those species with bad miRNA annotation status considering the fact that many miRNA sequences are widely conserved. Since the according output file comprises information on the source species of each matched miRNA gene, unitas users are able to assess the relevance of each match in a case-dependent manner. However, it is important to be aware that this approach will not identify new, unannotated lineagespecific miRNA genes, which can only be identified using de novo prediction tools. Nevertheless, accurate filtering of known miRNA sequences will be helpful for downstream de novo miRNA prediction. By default, the maximum number of allowed non-template 3′ nucleotides is 2 and the maximum number of allowed internal modifications is 1. In order to map sncRNA sequences to miRNA precursors (or to other ncRNA sequences in later annotation steps), unitas employs the mapping tool SeqMap [23] which not requires prior indexing of reference sequences and allows subsequent analysis of nontemplate 3′-nucleotides.

ncRNA/mRNA annotation
Following miRNA annotation, sequences that do not correspond to miRNAs are mapped to a species-specific collection of non-coding RNA and cDNA sequences downloaded from Ensembl database [18], Genomic tRNA database [20] and SILVA rRNA database [21]. Read counts of sequences that match different classes of reference sequences equally well are apportioned according to the simple equation: where c class refers to the read counts for ncRNA/cDNA class, while n is the total number of non-identical input sequences that map to this class and r i and h i refer to read counts and hits to different ncRNA/cDNA classes of input sequence i, respectively. During this process, special attention is payed to sequence reads matching tRNAs since different classes of functional tRNA derived fragments, so-called tRFs, have been described in the recent past [24][25][26][27][28][29][30]. unitas classifies these sequences into 5′ tRFs (5′ to D-loop), 5′ tR-halves (5′ to Anticodonloop), 3′ tRFs (TѱC-loop to 3′), 3′ CCA-tRFs (TѱC-loop to 3'CCA), 3′ tR-halves (Anticodon-loop to 3′), tRF-1 (3′ end of mature tRNA to oligo-T signal), tRNA-leader (sequence upstream of 5′ ends of mature tRNAs) and misc.-tRFs (miscellaneous tRFs). Worth mentioning, unitas relies on available ncRNA annotation and will not perform de novo prediction of ncRNA genes that e.g. encode tRNAs or rRNAs.

piRNA annotation
Considering the fact that piRNAs are highly diverse and virtually not conserved across different species, piRNA annotation based on sequence is challenging. However, many piRNAs originate from few genomic loci, many of which are annotated in the piRNA cluster database [17]. Providing that information on piRNA clusters is available for the species in question, sequences that were not annotated as (fragment of ) any other class of noncoding RNA are mapped to known piRNA producing loci of the respective species. Since almost every nucleotide position within a piRNA precursor transcript can give rise to the 5′ end of a mature piRNA, though there is certainly a bias for 5′-U, this procedure more reliably identifies putative piRNAs compared to the approach of directly mapping sequence reads to annotated piRNAs. Further evidence for the presence of genuine piRNAs can be obtained from sequence read length distribution and positional nucleotide composition which unitas outputs for each class of small RNAs separately. Providing that the input file provided by the user represents a map file, unitas can further screen the map file for the socalled ping-pong signature (using the option -pp), which refers to a bias for 10 nt 5′ overlaps of mapped sequence reads which arises from secondary piRNA biogenesis (ping-pong cycle) and indicates the presence of primary and secondary piRNAs. Screening for a ping-pong signature also includes calculation of a Z-score according to the method described by Zhang and coworkers [31].

phasiRNA annotation
The commonly applied method for identification of phased RNAs bases on calculation of a so-called phase score, P. After consolidation of mapped reads from both strands with an offset of 2 nt for minus strand mapped reads, P results from the following formula: in which n refers to the number of phase cycle positions occupied by at least one small RNA read within an eight-cycle window, and k refers to the total number of reads for all small RNAs with consolidated start coordinates in a given phase within an eight-cycle window [32]. Although the given formula yields higher P values with increasing k or n, the weighting between both factors, and finally the decision of which threshold to choose for P is rather arbitrary. We therefore decided to use a different method, which utilizes the binomial distribution to calculate the probability p to observe a defined number (or more) of phased reads within a given sliding window (default = 1 kb) according to the formula: in which j refers to the observed number of reads with length i in a specified phase, n refers to the total number of reads with length i and q is given by 1/i and refers to the probability of a read to be located in a given phase, assuming that a sequence read can map to any position within the sliding window with equal probability. As is the case for calculation of P, reads mapped to different strands are consolidated prior to calculation of p. If the p value of a locus under examination is below the critical value (default = 0.05, with strict Bonferroni correction based on the number of analyzed sliding windows), unitas applies further thresholds to reduce the rate of false positive predictions. By default, the fraction of phased RNAs has to be ≥50% of all mapped reads within a sliding window. Further, the phased reads must map to ≥5 different loci while not more than 90% of the phased reads must derive from one strand. Critical values for each of the mentioned parameters, including p and sliding window size can be adjusted by the user. Prediction of phasiRNAs requires map files (SAM or ELAND3) as input and can be performed with the option '-phasi [n]' where n refers to the length of the phased RNAs.

Results
We have tested unitas using a number of artificial datasets, real RNA-seq data and combinations of both. A detailed description of the datasets and the methods that were applied to generate them can be found in Additional file 1 (Supplementary Methods).

3′ adapter identification and trimming
The first steps in the analysis of small RNA data usually involve the removal of sequencing adapters from 3′ ends, for which numerous tools exist. However, this task becomes problematic if the adapter sequence is not known, e.g. if a dataset is deposited without the appropriate information. The number of programs for adapter prediction, in contrast to removal, is rather limited. The only published tools for this purpose are DNApi [9] and Minion from the Kraken package [8], which also contains Reaper for adapter trimming.
To test the efficiency of the 3′ adapter identification and trimming function of unitas, we processed ten randomly chosen datasets from the NCBI Sequence Read Archive and put the performance into comparison to the existing software. Both, unitas and DNApi, reliably predicted the correct adapter sequences in all cases, whereas Minion predicted a false adapter with a slightly deviated sequence for one of the ten libraries (SRA accession: SRR5130142), leading to a considerably reduced efficacy in subsequent read trimming by Reaper (Fig. 2a). Altogether, in eight instances unitas removed more adapter sequences than Reaper, hence resulting in higher quantities of trimmed reads that could be mapped perfectly to the corresponding genome (+ 9.7% on average, Additional file 2: Table S1).

Removal of low complexity sequences
The presence of low complexity reads can weaken biological signals within RNA-seq datasets and it has been demonstrated, that the correlation between RNA-seq and microarray gene expression data can be improved with strict filtering of sequences that map to genomic regions with low sequence complexity [12]. Since this method relies on the availability of a RepeatMasker annotation for the genome in question, we implemented a low complexity filter upstream of sequence annotation. We compared our filter with dustmasker, a popular tool for masking low complexity regions in DNA sequences, which is part of the NCBI blast + package [11]. We ran dustmasker on ten adapter-trimmed NGS sequence datasets (see above) with default settings and discarded sequence reads with more than 75% of bases being masked to produce results that are comparable to the results generated by unitas, which by default filters sequences being composed of more than 75% of repetitive sequence motifs.
First, we found that unitas generally filters more sequences ( Fig. 2b top), which taken by itself is certainly not indicative to favor one tool over the other since both algorithms can easily be adjusted to filter more or less sequences by changing the corresponding thresholds. Therefore, we in-depth analyzed the complexity of sequences filtered by both tools as well as those sequences that were filtered either by unitas or by dustmasker. We quantified the complexity of filtered sequences based on sequence entropy [33], Wootton-Federhen-complexity [34] and gzip (developed by Jean-loup Gailly and Mark Adler) compression ratio using the program SeqComplex, which was written by Juan Caballero. As expected, sequences filtered by both tools usually exhibit the lowest degree of entropy. Further, sequences filtered only by unitas exhibit a lower degree of entropy compared to sequences filtered only by dustmasker, in spite of the fact that unitas filters more sequences (Fig. 2b middle). According to Wootton-Federhen-complexity and gzip compression ratio, sequences filtered only by unitas also exhibit lower complexity compared to sequences filtered only by dustmasker and even compared to those sequences filtered by both tools (Additional file 3: Table S2).
We further wanted to check whether these rather theoretical assessments can be translated into a biological dimension. To this end we mapped the filtered sequences to the respective genomes and counted the number of genomic hits per sequence, assuming that the amount of information obtained by mapping a specific sequence decreases with a growing number of genomic hits. In line with the previous results, sequences filtered by both tools show the highest number of genomic hits, thus providing the lowest amount of information (Fig. 2b bottom). With one exception, sequences filtered exclusively by unitas map more frequently to the genome compared to sequences filtered exclusively by dustmasker (Fig. 2b bottom). Together, these results demonstrate that unitas filters sequences with low complexity in a more sensitive and more specific manner.

Annotation of miRNAs
Numerous programs for miRNA annotation in small RNA-seq data have been published in the past with varying focuses [35][36][37]. To compare the performance of unitas on this task we chose Chimira, which is a recent tool with a similar range of functions, primarily aiming at miRNA expression and modification analysis [13]. Chimira is a web-based system, accepting multiple input files in FASTA or FASTQ format at once and supporting 209 genomes so far. Input reads are mapped against Fig. 2 Adapter trimming and removal of low complexity reads. a Efficiency of 3′ adapter identification and trimming by unitas and Kraken tool kit. Asterisk marks the dataset for which adapter prediction by Minion failed. Trimmed reads were mapped to the corresponding genome without mismatches. b Removal and analysis of low complexity sequences filtered by unitas and dustmasker miRBase [19] hairpin sequences using BLASTn with two tolerated mismatches, identifying modifications at 3′ and 5′ ends, as well as internal substitutions (single nucleotide polymorphisms, SNPs) and ADAR-dependent editing events in the process. However, in order for an internal modification to be classified as a SNP, an arbitrary value of 70% is applied as a threshold for the ratio of modification counts to overall counts.
For a controlled comparison, we produced an artificial miRNA dataset based on human hairpin sequences from miRBase (release 21), incorporating internal modifications and 3′ tailings. Of overall 466,810 generated reads, unitas identified 99.9% as miRNAs, while Chimira detected only 85.8% of the original set. Moreover, unitas showed higher precision in assigning read counts to respective miRNA genes of origin than Chimira did, indicated by Pearson correlation coefficients of 0.9514 and 0.9146, respectively (Fig. 3a, Additional file 4: Table S3). Furthermore, unitas detected 3′ tailings and internal modifications more reliably, whereas Chimira barely showed the latter type, probably due to the considerably high (70%) threshold for internal modifications (Fig. 3b). It is noteworthy that the test dataset was designed to include all possible combinations of offset-, tailing-, and mismatch-scenarios without any weighting between canonical and non-canonical sequences. Consequently, the differences between unitas and Chimira annotations are typically less marked for real biological datasets (these observations are principally true for annotation of tRNA fragments as well, see below).
Subsequently, both tools were tested on published RNA-seq data obtained from HeLa cells (SRA accession: SRR029124) [38]. The resulting miRNA expression profiles are highly similar, with a Pearson correlation coefficient of 0.9752 (Fig. 3c). While unitas found 961,840 miRNA reads, Chimira called 960,880, which increased to 961,563 if the option '-split counts from paralogs' was selected. The amount of identified uridylation and adenylation events of 3′ ends were largely similar with a slight advantage on the side of unitas, but other tailing patterns were detected to a much lesser degree by Chimira (Fig. 3d). Analogous to the artificial test data, Chimira did not identify internal modifications to a comparable extent as unitas, apart from some amount of ADAR-dependent edits (A-to-G), which was also the most frequent modification detected by unitas.
Since both, unitas and similar computational approaches for miRNA annotation, rely on miRBase, it should be noted that the quality of database annotations in general vary among species, particularly those which are less well studied. Therefore, we point to existing tools designed specifically for the de novo prediction of miRNAs, such as CAP-miRSeq [35] and Oasis [36].

Annotation of tRNA-derived small RNAs
Currently, there are three major tools specific to the identification of tRNA-derived sncRNAs [39]. The tRFfinder of the tRF2Cancer web server package [40] is restricted to the analysis of human samples and considers sequences with lengths between 14 and 32 nt only. This, however, poses a limitation for the detection of longer tRNA-derived sncRNAs like tRNA-halves. For instance, 56% of tRNA-derived reads are larger than 32 nt in a sncRNA dataset of seminal exosomes (SRA accession: SRR1200712) [41].
The second tool, tDRmapper [14], is a command line based set of Perl scripts, which identifies tRNAderived RNAs (tDRs) from 14 to 40 nt in human and murine samples. Other species can be added manually according to a provided guide and with the help of Perl scripts that depend on bedtools. Notably, there are some features to the algorithm of tDRmapper that may hamper direct comparison with unitas. First, sequences with 100 reads or less are discarded. Subsequently, so-called primary tDRs are determined by the location, at which more than 50% of all reads mapping to a source tRNA are aligned. For example, the 5′-tRF type is assigned if more than 50% map at the 5′-end of the source tRNA. Moreover, tRFs are defined by length as being smaller than 28 nt and tRNA-halves as 28 nt or larger, regardless of alignment position. Further, a primary tDR is only specified if more than 66% of all reads mapping to the source tRNA map to any position of the considered tDR. Lastly, tDRs are quantified by 'relative abundance' , which is calculated by multiplying the percentage of tDR reads that map to its source tRNA and the proportion of reads on the area with the highest read coverage across the source-tRNA. Importantly, the resulting counts are not normalized, meaning there is no fractional assignment for multimapping reads. As the authors themselves point out, this approach may overestimate the relative abundance of a primary tDR.
Finally, another command line based tool called MIN-Tmap was recently developed for the profiling of tRNA fragments from human small RNA-seq data, emphasizing the profiling of both nuclear and mitochondrial tRNA fragments [15]. However, tRFs generated from trailer sequences (tRF-1) and 5′ leader-tRFs are excluded from analysis.
To test the efficiency and accuracy of unitas in the detection of tRNA-derived small RNAs, we produced an artificial dataset based on human tRNAs from the genomic tRNA database, incorporating one mismatch in 50% of sequences. Running with default settings, unitas assigned 92% of reads to tRNAs, which increased to 97% if miRNA detection was skipped (option '-no_miR'). For running tDRmapper on the test data, we disabled the rejection of sequences with less than 101 reads, since this would eliminate the entire input. Both, tDRmapper and MINTmap, deviated considerably from the original dataset in read shares assigned to different tRNAderived sRNA classes, in contrast to unitas (Fig. 4a). A direct comparison of read counts, however, was not possible due to the previously described quantification method of tDRmapper, which calculates so-called relative abundance. Further, read shares were most precisely assigned to source tRNAs by unitas, indicated by the highest Pearson correlation coefficient (0.9896) among the tested tools (Fig. 4b).
Additionally, unitas, tDRmapper and MINTmap were tested on RNA-seq data of exosomes from seminal fluid (SRR1200712) [41], using default settings. The differences in results between unitas (Fig. 4c) and tDRmapper (Fig. 4d) are largely due to the lack of fractional assignment for multi-mapping reads in the quantification approach of tDRmapper. Analysis by MINTmap yielded results that are largely similar to the output of unitas, but with overall slightly lower read counts and changed order of the source tRNAs with descending read coverage (Fig. 4e). Details on the test dataset and the annotation of tRFs from artificial and biological data with different tools are available in Additional file 5: Table S4 (A-G).

Annotation of phasiRNAs
We tested and compared phasiRNA annotation performance of unitas and PhaseTank, which is currently the only published tool for prediction of phased RNAs [16]. We used artificial datasets with known amounts of phased RNAs (Additional file 6: Table S5) as well as biological small RNA data from panicles of the two rice strains 93-11 and Nipponbare [42] to predict phased RNAs with unitas and PhaseTank using default parameters. All test datasets were collapsed to non-identical sequences, retaining information on read counts for each sequence in the FASTA header. Subsequently, the datasets were formatted to satisfy PhaseTank requirements (special format of FASTA headers) and used as input for PhaseTank (v.1.0) using default settings to search for 21 nt phased RNAs. Subsequently we searched for 24 nt phased RNAs with PhaseTank using the option '-size 24'. To generate input files for unitas, we mapped the test datasets to the human genome (GRCh38) with bowtie1, bowtie2 and STAR using settings that correspond to recommended and widely used settings for mapping of small RNAs with these tools and considering only perfect matches to be in line with PhaseTank default settings. The resulting SAM alignment files were used as input for unitas which was started twice with the option '-phasi 21' or '-phasi 24' , respectively, to search for 21 nt and 24 nt phased RNAs.
Using artificial datasets, we found that both tools perform equally well with those datasets that comprise exclusively phased RNAs or have rather low amounts of non-phased RNAs (Fig. 5a). However, PhaseTank drastically loses its sensitivity with an increasing amount of non-phased sequences within a dataset, while the sensitivity of unitas remains unaffected (Fig. 5a). When we assigned a read count value of 10 to each artificial phased RNA sequence, PhaseTank performs approximately as well as unitas, illustrating that sensitivity of PhaseTank not only depends on the number of phased sequences, but also on the number of reads per phased sequence. Consequently, PhaseTank will particularly miss those phasiRNA-producing loci that have a low sequence read coverage ( Fig. 5a and b). For neither tool we observed a namable issue with false positive predicted phasiRNAs when running both programs with datasets comprising no phased RNAs (Additional file 7: Table S6).
When searching for 21 nt phased RNAs in biological datasets we noted that unitas identifies slightly more pha-siRNAs compared to PhaseTank, while the number of identified clusters was identical ( Fig. 5c and d). Overall, the congruency between unitas and PhaseTank results is very high (Fig. 5d). However, when searching for 24 nt phased RNAs in the same datasets we observed remarkable differences with unitas identifying both more pha-siRNA sequences and more phasiRNA clusters (Fig. 5e and f). Considering that PhaseTank is less sensitive when the fraction of phased RNAs within a given dataset is low, these results are in line with the fact that the abundance of 24 nt phasiRNAs in rice panicles is several times lower compared to 21 nt phasiRNAs [42]. Accordingly, pha-siRNA clusters identified only by unitas have relatively low read coverage, while phasiRNA clusters identified only by PhaseTank were rejected by unitas because of a high strand bias (>95%) of mapped sequence reads.

Complete annotation of NGS datasets
To emphasize the broad range of possible applications of unitas on diverse sncRNA-seq datasets, we analyzed three exemplary libraries, which differ in origin and structure (Fig. 6). The sncRNA annotation output of unitas provides a general overview of the small RNA composition, as shown for a dataset produced from HeLa cancer line cells (SRA accession: SRR029124, Fig. 6a) [38]. In this library, the largest fraction of sncRNAs is constituted by miRNAs, which are of growing interest for cancer studies and clinical trials using miRNA profiling for patient diagnosis [43]. Apart from expression profiles and 3′ tailings, unitas offers a convenient description of miRNA modifications per position (Fig. 6b). For target recognition, complementarity of the seed region of a miRNA (positions 2-7) to its target is critical for downstream silencing efficacy. Beyond the seed region, a strong sequence conservation can also be observed at position 8 [44], and finally, miRNA sequences frequently start with a uridine which was found to promote miRNA loading on Argonaute proteins [45]. According to these functional aspects, the unitas output shows that in HeLa cells the first eight positions from the Fig. 4 Detection of tRNA-derived small RNAs by unitas, tDRmapper and MINTmap. a Read shares assigned to different tRNA-derived RNA classes of artificial test data by unitas and other tools compared to test set profile. Misc-tRFs (miscellaneous tRFs) are equivalent to internal tRFs. On test data, the elimination of sequences with less than 101 reads by tDRmapper was disabled. b Correlation of read proportions allocated to source tRNAs by unitas and other tools with original test data read shares per tRNA. c Analysis of RNA-seq data from seminal exosomes by unitas, d tDRmapper and e MINTmap 5′ end are rarely modified. Internal modifications occur predominantly at distinct positions downstream of the seed region, with A-to-G (A-to-I), known as ADAR edits [46], and G-to-T being the most common, followed by modifications leading to uridine or guanine incorporation (Fig. 6b).
Next, we analyzed a library generated from exosomes of human seminal fluid (SRA accession: SRR1200712) [41]. Notably, this dataset is particularly abundant in tRNAderived reads, while containing less miRNAs and other annotated reads (Fig. 6c). For the analysis of such sequences, unitas provides a summary of read counts for each tRNA gene, apportioned to classes of tRNA-derived sncRNAs. The majority of tRNA-derived reads in this library is represented by 5′ halves, originating mainly from the tRNAs Glu-CTC and Gly-GCC (Fig. 6d). It has been shown that fragments specifically derived from 5′ ends of tRNA-Gly-GCC in mouse epididymosomes repress genes by regulating the endogenous retroelement MERVL, whereas an RNA interference against the middle or 3′ end of tRNA-Gly-GCC and other tRNAs had no effect on these MERVL-dependent genes [27]. Generally, we found 5′ halves and 5′ tRFs to be the dominant classes, being especially associated with tRNAs exhibiting high read  (Fig. 6d). With decreasing coverage, however, tRNAs tend to give rise primarily to internal fragments and to a lesser extent to 3′ halves and 3′ tRFs. This suggests that internal fragments, which we describe as miscellaneous tRFs (misc-tRFs), might be rather characterized as random debris of degraded tRNAs than a class of tRNAderived small RNAs in its own right.
Lastly, we chose a sRNA dataset from macaque testis for analysis (SRA accession: SRR553581) [47]. As expected, the vast majority of testis-expressed sRNAs did not match any class of known non-coding RNA (Fig. 6e). In contrast to the former libraries, unitas found that the bulk of non-annotated reads (67.6%) maps to known piRNA producing loci. Besides the typical length profile (Fig. 6f ), these piRNA candidate sequences show a strong bias for uridine at 5′ ends (84.5%) which is typical for primary piRNAs being processed by the endonuclease Zucchini (PLD-6) [48,49]. Moreover, unitas attests a significant ping-pong signature (Z-score = 6.96), namely a high rate of 10 nt 5′ overlaps of sense and antisense reads, which is a hallmark of secondary piRNA biogenesis via the ping-pong cycle [50][51][52].

Discussion
Small RNA biology has become a major field in molecular biology research. In 2016, sequence data from 7271 Illumina sequencing runs with miRNA sequencing strategy, comprising more than 82 billion sequence reads, was uploaded to NCBI's Sequence Read Archive. Assuming total costs of 15 USD per 1 Million clean sequence reads, the total miRNA sequencing value for 2016 amounts to 1.2 million USD. Noteworthy, these numbers only refer to published sequence data and certainly only mirror the tip of the iceberg. Nevertheless, in light of these numbers, even a seemingly trivial improvement of adapter recognition and trimming by unitas yields a surplus value of more than 100,000 USD per year, considering the amount of additionally mapped sequence reads. However, although this is a benefit of unitas that can be descriptively quantified, it clearly reflects only a minor aspect of the overall value of unitas. Within the field of small RNA biology, miRNAs receive widespread attention owing to their pervasive contribution to gene regulatory processes [53]. However, it is not only mere miRNA expression, but also their post-transcriptional modification that vitally affects miRNA activity. Uridylation of miRNAs is thought to play a role in miRNA stability and possibly marks small RNAs for degradation [54,55]. Adenylation has recently been linked to clearance of maternal miRNAs in Drosophila eggs [56]. Further, internal modification events of miRNAs (or their precursors) can have wide implications for miRNA biogenesis and function [57][58][59][60]. It is therefore of immense importance, to accurately identify post-transcriptional editing events to gain a deeper understanding of miRNA-dependent regulatory processes (Additional file 8). As we have shown, unitas is more sensitive in detecting 3′ tailing events and much more sensitive in detecting internal modifications compared to existing tools. Importantly, unitas not only focuses on wellknown adenylation, uridylation and ADAR-dependent A-to-I editing, but also allows to detect all other types of modification events which can greatly facilitate the detection of yet unknown enzymatic editing activity in the future.
tRNA-derived small RNAs have been regarded as simple and non-functional degradation products for a long time. However, strong evidence for diverse functional roles in gene regulation, cancer biology, apoptosis and protein synthesis is mounting [24][25][26][27][28][29][30]. Since tRNAs and their precursor transcripts can be processed into functionally distinct types of tRFs, accurate attribution of tRNA-derived small RNAs to the different types of tRFs is important to make functional interpretations. In this regard, unitas shows higher precision than existing tools, while being also more sensitive in overall tRF detection. Notably, the recently published tool for tRF annotation MINTmap [15], which is more sensitive and accurate than the older tDRmapper [14] cannot identify some of the yet rather enigmatic tRFs which have their origin beyond the mature tRNA molecule, namely tRF-1 and 5′ leader-tRFs. Here, unitas can enable researches to elucidate possible functions of these cryptic RNAs by first of all spotting them in sncRNA transcriptomes.
Initially described as trans-acting siRNAs [61], plant specific phasiRNAs are well-characterized actors in posttranscriptional gene silencing [62]. In most cases, pha-siRNAs are 21 nt in length, but different pathways that produce 22 nt and 24 nt phasiRNAs have been described as well. Since the latter are by far less abundant, their detection in small RNA transcriptomes is challenging and the current approach underestimates the number of, e.g., phased 24 nt RNAs in small RNA datasets from rice panicles. In contrast, sensitivity of unitas depends far less on the amount of background reads, making it more suitable for the detection of particularly low abundance phasiRNAs and their source loci.

Conclusions
So far, accurate annotation of sncRNA required a large set of different software tools with a number of additional prerequisites. While the installation of these bioinformatics tools can pose a challenge for the non-expert user, unitas brings together all annotation and analysis features with an absolute minimum of further requirements. A complete annotation run is finished within a few minutes and can be started with one simple shell command, making its usage very convenient. By facilitating sncRNA annotation and providing in depth analyses that previously was not accessible for the non-expert user, we believe that unitas is a valuable tool for researchers in all fields connected to small RNA biology.