Bioinformatics resources for deciphering the biogenesis and action pathways of plant small RNAs

The next-generation sequencing (NGS) technology has revolutionized our previous understanding of the plant genomes, relying on its innate advantages, such as high throughput and deep sequencing depth. In addition to the protein-coding gene loci, massive transcription signals have been detected within intergenic or intragenic regions. Most of these signals belong to non-coding ones, considering their weak protein-coding potential. Generally, these transcripts could be divided into long non-coding RNAs and small non-coding RNAs (sRNAs) based on their sequence length. In addition to the well-known microRNAs (miRNAs), many plant endogenous sRNAs were collectively referred to as small interfering RNAs. However, an increasing number of unclassified sRNA species are being discovered by NGS. The high heterogeneity of these novel sRNAs greatly hampered the mechanistic studies, especially on the clear description of their biogenesis and action pathways. Fortunately, public databases, bioinformatics softwares and NGS datasets are increasingly available for plant sRNA research. Here, by summarizing these valuable resources, we proposed a general workflow to decipher the RDR (RNA-dependent RNA polymerase)- and DCL (Dicer-like)-dependent biogenesis pathways, and the Argonaute-mediated action modes (such as target cleavages and chromatin modifications) for specific sRNA species in plants. Taken together, we hope that by summarizing a list of the public resources, this work will facilitate the plant biologists to perform classification and functional characterization of the interesting sRNA species.


Introduction
Ten years after the accomplishment of the first plant genome project (Mozo et al., 1999), the advent of the next-generation sequencing (NGS) technology has uncovered an unprecedentedly intricate scene of genomewide transcription in plants (Varshney et al., 2009;Kelly and Leitch, 2011;Jain, 2012). In addition to the already annotated protein-coding genes, the fact is emerging that millions of the non-coding RNAs (ncRNAs) are transcribed from the intergenic or the intragenic regions (Jiang, 2015;Wendel et al., 2016). These non-coding transcripts could be roughly classified into the long non-coding RNAs (lncRNAs; > 200 nt) (Chekanova, 2015) and the small non-coding RNAs (sRNAs; < 200 nt) (Chen, 2009). Owing to the relatively short read length of NGS, the sRNAs were easier to be cloned at the beginning of the plant ncRNA research. Expectedly, the explosive sRNA world immediately became a hot research topic for the plant biologists. Notably, some of these small transcription "noises", which were once regarded as the degraded remnants, have been demonstrated to be generated through specific pathways and play essential roles in plant development (Chen, 2009).
One of the well studied sRNA species is microRNA (miRNA) (Jones-Rhoades et al., 2006;Voinnet, 2009). In plants, the transcription of most miRNA genes is driven by RNA polymerase II (Pol II), resulting in the production of the 5′ capped and 3′ poly(A) (polyadenylation)-tailed transcripts called primary microRNAs (pri-miRNAs). Relying on the highly complementary base pairing, stable hairpin-like structures could form within the specific regions of the pri-miRNAs. These local hairpin structures are the featured substrates of Dicer-like 1 (DCL1). Followed by the DCL1-mediated two-step cropping in the nucleus, the pri-miRNAs are sequentially processed into the secondary precursors named as precursor microRNAs (pre-miRNAs), and then into the miRNA/miRNA* duplexes. After exporting to the cytoplasm, the mature miR-NAs are selectively loaded into specific Argonaute (AGO)-centered protein complexes. In most cases, the miRNAs will be recruited by AGO1, although some exceptional cases have been reported for "miR172-AGO10" and "miR165/166-AGO10" in Arabidopsis (Arabidopsis thaliana), "miR168-AGO18" in rice (Oryza sativa), and "miR390-AGO7" in both plants (Fang and Qi, 2016). The AGO complex is guided by the recruited miRNA to bind onto a specific target transcript containing a region highly complementary to the miRNA. There are two major action modes of miRNA-guided gene silencing in plants. One is target cleavage which is considered as the most common mode (Voinnet, 2009), and the other is translational repression which has been observed in several studies (Chen, 2004;Gandikota et al., 2007;Li et al., 2013). Another class of plant sRNAs is collectively referred to as the small interfering RNAs (siRNAs), which could be further classified into heterochromatic small interfering RNAs (hc-siRNAs), trans-acting small interfering RNAs (ta-siRNAs), natural antisense transcript-derived small interfering RNAs (nat-siRNAs), and phased small interfering RNAs (phasiRNAs). Specifically, hc-siRNAs are encoded within the heterochromatic loci transcribed by RNA Pol IV. The single-stranded Pol IV transcripts are converted to double-stranded precursors through the RDR2 (RNA-dependent RNA polymerase 2)-dependent pathway. Then, the precursors are processed by DCL3 for the production of the 24-nt hc-siRNAs. The hc-siRNAs are incorporated into AGO4 to perform site-specific chromatin modifications (Xie et al., 2004;Qi et al., 2005;Henderson et al., 2006). In Arabidopsis, there are four TAS gene loci encoding ta-siRNAs. MiRNA-mediated cleavages (miR173 for TAS1 and TAS2, miR390 for TAS3, and miR828 for TAS4) of the primary TAS transcripts are the prerequisite for initiating ta-siRNA production. Through the RDR6dependent pathway, the cleaved TAS transcripts are converted to double-stranded precursors which will be subject to ta-siRNA processing by DCL4. Finally, most of the ta-siRNAs are loaded into AGO1 silencing complexes to guide target cleavages (Peragine et al., 2004;Vazquez et al., 2004;Allen et al., 2005;Gasciolli et al., 2005;Xie et al., 2005;Yoshikawa et al., 2005;Rajagopalan et al., 2006). For the nat-siRNAs, genome-wide studies in both Arabidopsis and rice showed that they were originated from the overlapping regions of the natural antisense transcript (NAT) pairs through the DCL1-dependent pathway or through the Pol IV-, RDR2-, and DCL3-dependent pathway . Moreover, in Borsani et al.'s study (2005), 21and 24-nt nat-siRNAs were demonstrated to be produced from a cis-NAT pair through the RDR6-and DCL1/2dependent pathway (Borsani et al., 2005). A major class of phasiRNAs was identified in the reproductive tissues of Gramineae species, such as rice (Johnson et al., 2009) and maize (Zea mays) (Zhai et al., 2015). Notably, in rice, the processing of 21-nt phasiRNAs was highly dependent on DCL4, while the processing of 24-nt ones required the activity of DCL3b (Song et al., 2012). Komiya and his colleagues (2014) reported that a portion of the DCL4dependent, 21-nt phasiRNAs preferentially associated with MEL1 (the ortholog of Arabidopsis AGO5), and these 5′ C-started phasiRNAs originated from hundreds of lincRNA (long intergenic non-coding RNA) loci. In addition to the above mentioned sRNAs, some non-canonical sRNA species have been discovered, such as the AGO4-associated long siRNAs of 25-nt in length (Zilberman et al., 2003), the DCL3-dependent, AGO4-associated, 24-nt miRNAs called long miRNAs (Wu et al., 2010), the Pol IV-and DCL2/3/4dependent, AGO2-associated double-strand break-induced sRNAs (Wei et al., 2012), the DCL-independent, AGO4associated, 20-to 60-nt siRNAs (Ye et al., 2016), and the intron-derived, DCL2/3/4-dependent siRNAs (Chen et al., 2011). For a clear summarization, Fig. 1a provides a brief framework of the biogenesis and action pathways of the plant sRNAs. However, all of the recent discoveries just witnessed the emergence of the unexpectedly huge and complicated RNA world. It is still far from thorough understanding of the biogenesis and action pathways of the enormous sRNA population.
Fortunately, the valuable public resources have become increasingly available for the mechanistic studies on the plant sRNAs. Here, by taking the two model plants Arabidopsis and rice as an example, we provided a list of the currently available resources to the plant biologists, including the public databases, the bioinformatics softwares and the NGS datasets. Notably, most of the bioinformatics softwares listed here are online tools with user-friendly interface. By proposing a workflow for analyzing the biogenesis and action pathways of the plant sRNAs, we made a clear description for the specific applications of different sequencing datasets and bioinformatics toolkits at each analytical step. Finally, we anticipate that this workflow along with the list could advance the efficiency of data analysis and interpretation, thus facilitating the experimental design for the functional studies on the plant sRNAs. Below, we will introduce the public resources step by step according to the workflow shown in Fig. 1b.

Genomic features and transcription
By using the BLAST tool provided by the plant genomic databases, such as TAIR (the Arabidopsis information resource) (Huala et al., 2001) for Arabidopsis and RGAP (rice genome annotation project) (Kawahara et al., 2013) or RAP-DB (the rice annotation project database) (Ohyanagi et al., 2006) for rice (Table 1), the genomic positions of the sRNA-coding loci could be obtained, facilitating the researchers to tell whether the sRNA loci are intergenic or intragenic. For mapping huge sRNA sequencing (sRNA-seq) datasets onto a plant genome, Bowtie should be selected as one of the powerful tools (Langmead et al., 2009). miRBase (the microRNA database) (Griffiths-Jones et al., 2006) and PLncDB (plant long non-coding RNA database) (Jin et al., 2013) are useful to check whether the sRNA is originated from a miRNA precursor or a lncRNA. Besides, ShortStack should be a useful tool to analyze the sRNA-seq data based on the available reference genomes (Axtell, 2013) (Table 2). It can output reports showing sRNA size distributions, repetitiveness, hairpin-association and phasing. One of its shortage is the requirement of bioinformatics experts for local installation and running.
Here, the workflow for analyzing the sRNA biogenesis and action pathways is proposed based on the scenario that the sRNAs are processed from their precursors transcribed from specific genomic loci (Fig. 1). If the sRNA precursor was experimentally cloned by using fine-scale methods such as RACE (rapid amplification of cDNA ends), the transcription boundary of the precursor-coding locus could be defined. In this case, the upstream region of user-defined length could be retrieved from the above mentioned genomic database, and be treated as the promoter region of this gene locus for cis-element analysis by using PlantCARE (a plant cis-acting regulatory element database) (Rombauts et al., 1999), PLACE (Plant cisacting regulatory DNA elements database) (Higo et al., 1998), or the newly updated tools JASPAR (Mathelier et al., 2016) and the MEME suite (Bailey et al., 2015) ( Table  2). The prediction results from these online tools might provide some valuable hints to infer the basic transcriptional features of this gene locus. For example, the coexistence of CAAT-box and TATA-box within the upstream region near to the transcription start site indicates the Pol II-drived transcription of the host gene (Lewin, 1990). Of course, we should acknowledge that the fine-scale cloning of the sRNA precursors is time consuming and laborious. The high-throughput solution is by analyzing the publicly available RNA sequencing (RNA-seq) data. Notably, distinct types of RNA-seq libraries were prepared with different purposes. For example, in a recent study, the poly(A)-tailed RNA-seq libraries were constructed for the detection of Pol II-dependent transcripts, while the rRNA-depleted total RNA-seq libraries were prepared for the identification of Pol IV-dependent transcripts (Li et al., 2015). After mapping such kind of RNA-seq data (Table  3) onto the plant genome by using a high-throughput The analysis is divided into five sections according to the step-by-step instructions in the main text, including "genomic features", "transcription", "precursors and processing", "sRNA action modes" and "functional studies" alignment tool, Bowtie 2 (Langmead and Salzberg, 2012) for example, the transcription boundaries of the sRNA precursor-coding loci could be delineated. Also based on the mapping result, the RNA polymerase dependence could be partially determined for the loci. Moreover, some of the sRNA-seq datasets, such as those originated from  (Table 3), could also be used to investigate the polymerase dependence of the sRNA-coding loci. Notably, compared to Bowtie, Bowtie 2 is particularly efficient for long read (up to hundreds of nucleotides in length) mapping. Thus, Bowtie 2 is recommended to be employed for RNA-seq data analysis, while Bowtie is more suitable for sRNA-seq read mapping as mentioned above.

Precursor formation and processing
As introduced above, there are two major forms of sRNA precursors that could be processed by DCLs, i.e. the long double-stranded RNA (dsRNA) precursors and the single-stranded RNAs with short internal double-stranded regions (Fig. 1). The former ones could be synthesized either through the RDR-dependent (such as the precursors of the hc-siRNAs or the ta-siRNAs) pathway or through the RDR-independent (such as the NATs) pathway. However, the later ones are unexceptionally Please see detailed descriptions of the datasets in the related references generated through the RDR-independent pathways (such as the pri-miRNAs and the sirtrons). Thus, distinct bioinformatics toolkits should be selected to identify the sRNA precursors belonging to the two different types, respectively. PlantNATsDB (plant natural antisense transcripts database)  provides the users with the genome-wide prediction results of both cis-and trans-NAT pairs of 70 plant species. Gene locus ID could be used as a query to see the possibility of this gene to form NAT pairs with other genes. Optionally, "batch download" could be selected to obtain the complete list of the predicted NAT pairs of a plant species. In the other way, the researchers could perform large-scale NAT prediction by using the program NATpipe (Yu et al., 2016a). The genome-independent feature of this software allows users to carry out NAT prediction solely based on the RNA-seq data. If the genomic information is available, then the predicted NATs could be classified into cis and trans ones. Additionally, if the sRNA-seq data is available, phase-distributed nat-siRNAs could be identified from the predicted NATs by using NATpipe.
RNAfold (Hofacker, 2003) and RNAshapes (Steffen et al., 2006) are both easy-to-use tools for local RNA secondary structure predictions. RNAfold is a web server allowing a query sequence of up to 10,000 nt in length, but its graphic outputs are difficult to be modified according to the users' requirements. RNAshapes is a locally installed program with a strict length limitation (up to~400 nt based our experience) of an input sequence. However, the outputs of RNAshapes could be graphically edited.
Recently, NGS-based, transcriptome-wide strategies have been developed to probe the RNA secondary structures, such as dsRNA sequencing (dsRNA-seq) (Kwok et al., 2015). The dsRNA-seq library is prepared by treating the total RNAs with the ribonuclease specific for the single-stranded RNAs, thus enabling researchers to detect the annealed region within an RNA sequence, or between two transcripts. Currently, the plant dsRNA-seq data is only available for Arabidopsis. These dsRNA-seq datasets were prepared not only from the wild type (WT) plants, but also from the nrpd1 and rdr6 mutants (Table 3). Thus, the Pol IV and RDR6 dependence of the dsRNA precursors could be interrogated by using these datasets. In addition to the dsRNA-seq data, sRNA-seq data of nrpd1, rdr2 and rdr6 should also be valuable to investigate the biogenesis pathways of the sRNA precursors (Table 3).
DCLs have been demonstrated to be widely implicated in the processing of diverse sRNA precursors in plants (Chen, 2009). Thus, by comparing to the sequencing data of WT, the public sRNA-seq data of the dcl mutants could be used to investigate the specific DCLmediated sRNA processing pathways.

sRNA action modes and network construction
According to the current understanding, target cleavages and chromatin modifications are the two major action modes of the plant sRNAs. And, these two distinct regulatory pathways are largely predetermined by the association of the sRNAs with specific AGO complexes (Fang and Qi, 2016). Thus, AGO enrichment analysis is necessary for functional studies on the plant sRNAs. To date, sequencing data of the AGO-associated sRNA populations has been reported by several research groups (Table 3). In Arabidopsis, AGO1-, AGO2-, AGO4-, AGO5-, AGO7-, AGO9-and AGO10-associated sRNA sequencing datasets are available (Mi et al., 2008;Montgomery et al., 2008;Havecker et al., 2010;Wang et al., 2011;Zhu et al., 2011). And in rice, AGO1-, AGO4-, MEL1-, AGO16-and AGO18-associated sRNA sequencing datasets have been published (Wu et al., 2009;Wu et al., 2010;Komiya et al., 2014). By comparing the level of a sRNA in a specific AGO complex to that in the total RNA extract, whether this sRNA is enriched in the AGO complex could be determined. The result could facilitate the researchers to deduce the action mode of this sRNA.
A large portion of the miRNAs and some of the siR-NAs such as the ta-siRNAs are incorporated into AGO1. These AGO1-associated sRNAs can recognize the highly complementary regions on the target transcripts, and inhibit gene expression through target cleavages. The high complementarity between the sRNA and its target forms an essential basis for the development of the target prediction tools. For plants, there are several user-friendly online tools for target prediction (Table 2), such as psRNATarget (a plant small RNA target analysis server) (Dai andZhao, 2011), Small RNA Target Prediction (Jones-Rhoades and, and TAPIR (target prediction for plant microRNAs) (Bonnet et al., 2010). Compared to TAPIR, the former two tools are more flexible for different users' purposes. By using psRNA-Target or Small RNA Target Prediction, the users can select one of the cDNA libraries provided by the tools, or can upload their own cDNA sequences for target prediction. However, TAPIR does not provide the pre-stored cDNA libraries for the users. Additionally, much more parameters are adjustable before performing analysis by using the former two tools. Thus, psRNATarget and Small RNA Target Prediction should be the efficient and easy-to-use tools for plant sRNA target prediction.
The 3′ cleavage remnants from the target transcripts are relatively stable in vivo, and could be detected by sequencing. This kind of high-throughput sequencing technology was called GMUCT (global mapping of uncapped and cleaved transcripts) (Willmann et al., 2014) or PARE (parallel analysis of RNA ends) (German et al., 2008;German et al., 2009), which is collectively referred to as degradome sequencing (degradome-seq) here. As listed in Table 3, there are many degradome-seq datasets available to perform large-scale validation of the predicted sRNA targets. For analyzing the degradome-seq data, comPARE (PARE validated miRNA targets) and sPARTA-Web (small RNA-PARE target analyzer) (Kakrana et al., 2014) might be the easy-to-use online tools for the wet-lab researchers ( Table 2). The difference between comPARE and sPARTA-Web is that the former was designed specifically for the miRNA target validation whereas the latter was developed for all of the sRNA target candidates. CleaveLand4 (Addo-Quaye et al., 2009) is also suitable for degradome-seq data-based validation of the sRNA targets. However, it is a Perl program, which requires extensive support from bioinformatics experts for local installation and running. Besides, our previously proposed workflow could also be referenced for degradome-seq data-based sRNA target validation (Meng et al., 2011b).
The AGO4-associated sRNAs, such as the hc-siRNAs, repress gene expression through chromatin modifications (Chen, 2009;Fang and Qi, 2016). By using BLAST or Bowtie, the genomic regions highly complementary to the AGO4-associated sRNAs could be identified with a genome-wide scale. Then, it will be interesting to investigate the chromatin status surrounding the complementary sites. Several epigenome databases are available for Arabidopsis, such as Arabidopsis epigenome maps , the SIGnAL Arabidopsis Methylome Mapping Tool , and the Arabidopsis epigenome data displayed in the UCSC Genome Browser Zhang et al., 2007;Stroud et al., 2013). In some databases, in addition to the WT data, the epigenomes of diverse mutants are also available, which might be valuable to inspect the sRNA-guided chromatin modification pathways in detail. Although the rice epigenome data was reported nearly ten years ago , and the database was established at that time, the web link is no long active. Fortunately, Plant Methylome DB provides researchers with the WT epigenomes of 40 species including Arabidopsis and rice.
"Target mimicry" was reported as a novel pathway for the regulation of the miRNA activities by the target mimics (Franco-Zorrilla et al., 2007). Although the online server TAPIR is not superior in sRNA target prediction, it provides a unique functional module for target mimic prediction (Bonnet et al., 2010). Besides, the recently established PceRBase (plant ceRNA database) stores the lists of the competing endogenous RNAs (similar to the target mimics) of 26 plant species for the users (Yuan et al., 2017).

Conclusions
In the present work, we proposed a general workflow for deciphering the biogenesis and action pathways of the plant sRNAs by using a series of publicly available resources. Most of the recently reported sRNA-seq and dsRNA-seq datasets of Arabidopsis and rice were summarized in Table 3, emphasizing their importance for elucidating the RDR-and DCL-dependent biogenesis pathways of the plant endogenous sRNAs. However, we should acknowledged that several useful toolkits have not been included in the list of softwares for plant small RNA research. For example, the UEA sRNA workbench, that is downloadable for local installation, provides an user-friendly platform for sRNA-seq data processing (Stocks et al., 2012). It contains several useful tools, such as "adaptor remover" and "Filter" for sRNA-seq data pre-treatment, "miRCat" and "hairpin annotation" for miRNA prediction, and the ta-siRNA prediction tool. Besides, as noticed in Tables 1 and 2, several valuable databases and bioinformatics tools, including Rice epigenome maps and PNRD, are currently terminated for unknown reason. We hope that these useful resources could be activated again for the plant biologists. Summarily, more research efforts, from both the bioinformaticians and the experimental practitioners, are anticipated to devote to the plant sRNA research.