Bioinformatics Analysis of Small RNAs in Pima (Gossypium barbadense L.)

Small RNAs (sRNAs) are ~20 to 24 nucleotide single-stranded RNAs that play crucial roles in regulation of gene expression. In plants, sRNAs are classified into microRNAs (miRNAs), repeat-associated siRNAs (ra-siRNAs), phased siRNAs (pha-siRNAs), cis and trans natural antisense transcript siRNAs (cis- and trans-nat siRNAs). Pima (Gossypium barbadense L.) is one of the most economically important fiber crops, producing the best and longest spinnable fiber. Although some miRNAs are profiled in Pima, little is known about siRNAs, the largest subclass of plant sRNAs. In order to profile these gene regulators in Pima, a comprehensive analysis of sRNAs was conducted by mining publicly available sRNA data, leading to identification of 678 miRNAs, 3,559,126 ra-siRNAs, 627 pha-siRNAs, 136,600 cis-nat siRNAs and 79,994 trans-nat siRNAs. The 678 miRNAs, belonging to 98 conserved and 402 lineage-specific families, were produced from 2,138 precursors, of which 297 arose from introns, exons, or intron/UTR-exon junctions of protein-coding genes. Ra-siRNAs were produced from various repeat loci, while most (97%) were yielded from retrotransposons, especially LTRs (long terminal repeats). The genes encoding auxin-signaling-related proteins, NBS-LRRs and transcription factors were major sources of pha-siRNAs, while two conserved TAS3 homologs were found as well. Most cis-NATs in Pima overlapped in enclosed and convergent orientations, while a few hybridized in divergent and coincided orientations. Most cis- and trans-nat siRNAs were produced from overlapping regions. Additionally, characteristics of length and the 5’-first nucleotide of each sRNA class were analyzed as well. Results in this study created a valuable molecular resource that would facilitate studies on mechanism of controlling gene expression.


Introduction
Small RNAs (sRNAs) are short, single-stranded RNAs that silence gene expression transcriptionally or post-transcriptionally [1][2][3], which is also known as RNA-induced silencing. sRNAs are produced from longer, double stranded RNA (dsRNA) precursors by DICER-LIKE proteins Additionally, miR828 and miR858 targeting GhMYB2D mRNA leads to generation of ta-siR-NAs, mediating fiber development [47]. In addition to miRNAs, TAS3-derived ta-siRNAs triggered by miR390 are found in Upland cotton as well [54]. The suppression of miR156/157 leads to the reduction of mature fiber length in Pima [48]. Compared to Upland cotton, apparently fewer miRNAs are identified in Pima [48,55], which produces the best and longest textile fiber, and siRNAs being the most abundant sRNA class in plants [34] still remain largely unknown. To profile these gene regulators in Pima, we analyzed two publicly available sRNA datasets from the root and fiber, resulting in identification of 5 major sRNA species. This is the first comprehensive analysis of Pima sRNAs and created a valuable molecular resource, which largely expands the scope of sRNAs in cotton and is very crucial for future studies on mechanism of regulating plant growth and development.

Methods and Materials
Small RNA datasets of G. barbadense Two sRNA datasets of Pima prepared from the root (GSM699076) and 10 DPA (days post anthesis) fiber (GSM634227) were downloaded from NCBI Gene Expression Omnibus (http:// www.ncbi.nlm.nih.gov/geo). The two datasets contain 15,599,325 and 7,764,225 reads (S1 Table), respectively. For further analysis, sequences 18 to 26 nt in length were retained, leaving a total of 9,518,811 unique sRNAs, and the total reads of each sRNA from the two datasets were recorded for expression-based filtration, using in-house Perl scripts.

Identification of conserved and lineage-specific miRNAs
To identify miRNAs, sRNAs with a total abundance !10 reads were submitted to a modified miREAP (http://sourceforge.net/projects/mireap/), in which 3 nucleotide shifts from putative miRNA loci are allowed, using the following parameters: the maximal distance between miR-NAs and miRNA Ã s (-d 450), the maximal copies on the genome (-u 5000), and defaults for others. The putative miRNAs obtained from miREAP were further filtered by strand bias (!0.8), which was calculated by dividing reads from sense strand by total reads, and abundant bias (!0.6), which was calculated by dividing total reads of top 3 sRNAs from miRNA loci by total reads [56,57]. Additionally, the remaining putative miRNAs were filtered by the secondary structures of their pre-miRNAs in which maximal four mismatches are allowed between miR-NAs and miRNA Ã s [57], using CentroidFold [58]. The identified miRNAs that are homologous to known plant miRNAs in miRBase version 21 (http://www.mirbase.org) with three or less substitutions are classified into conserved miRNAs (cs-miRNAs), while the others are considered as candidates of lineage-specific miRNAs (ls-miRNAs) [10].
To determine miRNAs yielded from protein-coding genes, cotton-derived transcripts were mapped to the cotton A2 and D5 genomes, using Blastn [59,60] with e-value < 10 -6 . Those transcripts, which were mapped to upstream and/or downstream of miRNA loci within 5000 nt with !96% identity, 3 mismatches and no indel (insertion-deletion), were selected for further analysis. The transcripts mapped to miRNA loci were annotated using Blastx [59,60] against plant-derived proteins deposited in NCBI (http://www.ncbi.nlm.nih.gov/) with a stringent e-value <10 -6 . miRNAs derived from protein-coding genes were further classified into 3 classes based on the location of their pre-miRNAs, intron-derived miRNAs processed from introns of protein-coding genes [17], exon-derived miRNAs cleaved from exons of protein coding genes, and junction-derived miRNAs yielded from exon-intron/UTR junctions [18].

Identification of ra-siRNAs
To determine ra-siRNAs, we extracted repeat DNAs from the cotton A2 and D5 genomes [43,44]. Given that the information of repeat loci in the cotton A2 genome is not publicly available, we identified repeat DNA loci in the A2 genome using RepeatMasker [61] with default parameters. All sRNAs matching repeat DNAs were considered as ra-siRNAs.

Identification of pha-siRNAs
In order to identify pha-siRNAs, sRNAs (excluding miRNAs and ra-siRNAs) were subjected to pha-siRNA analysis, using UEA Small RNA WorkBench [62] with p-value <10 -3 that employs the algorithm described previously [63]. The putative PYTs obtained from UEA Small RNA WorkBench [62] were further filtered by three extra characteristics: 1). the PYTs produced at least two pha-siRNAs; 2). at least a pha-siRNA had a total read count !5; 3). the ratio of pha-siRNAs/non-pha-siRNAs was !0.6. Analysis of miRNA initiators triggering the synthesis of pha-siRNAs was conducted, referring to the methodology described in early study [64] with minor modification. Briefly, putative initiators were predicted using psRNATarget with length for complementarity scoring (18) and expectation value (4.0). Then, the miRNAs that were predicted to flank the upstream or downstream of pha-siRNAs within 148 nt were selected for this analysis, allowing one nucleotide shift from the classic cleavage site [64].

Identification of cis-and trans-nat siRNAs
The cis-nat siRNAs were determined, referring to the methods described previously [65]. Briefly, cotton-derived transcripts were mapped to both cotton A2 and D5 genomes, and then pairs of transcripts derived from the same genomic loci with at least 50 nt overlap but from different orientations were chosen as cis-NATs. sRNAs produced from cis-NATs were considered as cisnat siRNA candidates.
After excluding cis-NATs, the remaining transcripts were used for analysis of trans-NATs. Putative trans-NATs were determined, using Blastn search for pairs of sequences that were complementary to each other with at least 100 nt overlap. The putative trans-NATs were then filtered by secondary structure using RNAcoFold in Vienna RNA package [66]. Those putative trans-NATs that can form at least 50 nt dsRNAs, in which 90% nucleotides were base-paired, were considered as trans-NATs, and sRNAs matched to trans-NATs were trans-nat siRNA candidates.

Conserved and lineage-specific miRNAs in Pima
An analysis of the Pima sRNA datasets (S1 Table), which contained 9,518,811sRNAs, was performed. This led to identification of a total of 678 miRNAs from 2,138 pre-miRNAs (S2 Table), of which 222 and 456 were assigned to 98 cs-miRNA and 402 ls-miRNA families, respectively.
A comparison of the 98 Pima cs-miRNA families across plants was conducted (Table 1). Twenty three (23%) cs-miRNA families were present in both Eudicotyledon and   Monocotyledon, with a subset also found in Embryophyta and/or Coniferophyta, suggesting that they have common ancient origins [67]. Notably, miR477 was a special family that was present in both Embryophyta and Eudicotyledons, but absent in Monocotyledon. Ten (10%) families were detected in only Eudicotyledons, indicating that they are Eudicotyledon-specific, and the other 64 families (65%) were restricted to Malvaceae, reflecting that they are poorly conserved. Among the 98 cs-miRNA families, 25 (26%) were detected from all three reference resources (Fig. 1A), 40 (41%) were shared between A2 and D5 genomes but not detected from cotton transcripts, while two were shared between either of two genomes and cotton transcripts. In addition, 5, 25 and 1 cs-miRNA families were uniquely detected from A2, D5 genome, and cotton transcripts, respectively. By contrast, only two ls-miRNA families (0.5%) were detected in all three reference resources (Fig. 1B), 80 (20%) were shared between A2 and D5 genome but not found in cotton transcripts, while 168 (42%), 144 (36%) and 4 (1%) were uniquely detected from A2, D5 genome, and cotton transcripts, respectively.
Pima miRNAs varied from 18 to 24 nt in length, with almost all miRNAs (99%) being 20 to 24 nt in length ( Fig. 2A). Most cs-miRNAs (60%) were 21 nt in length, whereas nearly half ls-miRNAs (46%) were 24 nt in length. In plants, miRNAs are predominantly 21 nt in length and regulate gene expression post-transcriptionally [10,68], but long miRNAs (24 nt in length) function to direct chromatin modification and consequently result in transcriptional repression of target loci [8]. Cs-miRNAs and ls-miRNAs displayed different length characteristics, which might be associated with their regulatory roles [8,69].
Analysis of the 5'first nucleotide revealed a higher frequency of A or U and a less frequency of C or G at the 5'end of Pima miRNAs (Fig. 2B). The 5'-first nucleotide is important for sorting sRNAs into different AGO clades, and changing 5'-first nucleotide of a miRNA can redirects it to a different AGO complex and alter its biological activity [5]. Thus, it could be inferred that the characteristic of the 5'-first nucleotide is also functionally important for miR-NAs in Pima.
Compared to previous studies in Pima and other cotton species [48][49][50]55], more conserved and lineage-specific miRNAs along with their pre-miRNAs are identified in Pima in this study, which largely expands cotton miRNA information. Consistent with previous study [67], most miRNAs are not conserved among plants, which are proposed to play species-specific roles in plant growth and development [70].

miRNAs derived from protein-coding genes
Typically, miRNAs in plants are produced from non-protein-coding genes [10,57], while recent studies show that miRNAs can also be produced from protein-coding genes in Arabidopsis and rice [71,72]. In the present study, 297 pre-miRNAs were identified to arise from introns, exons, or intron-exon junctions of protein-coding genes (S3 Table). These pre-miRNAs were able to form well-defined hairpin structures of plant miRNAs (Fig. 3A to 3D) [56,57]. Notably, a special pre-miRNA (MIRC47c) was observed, which could be processed from either an intron or exon of two different protein-coding genes from the same genomic loci (Fig. 3D), indicating a potential AS site.
In animals, the biogenesis of miRNAs derived from exon-intron junctions is negatively regulated through AS, which is competitive with Drosha cleavage [19]. In Arabidopsis, an AS event that occurs in the intron where MIR400 is located is specifically induced by heat stress, providing the evidence that AS acts as a regulatory mechanism linking miRNAs and environmental stress in plants [72].

Identification of ra-siRNAs in Pima
To identify ra-siRNAs, sRNAs (excluding miRNAs) were mapped to repeat DNA loci in both cotton A2 [43] and D5 [44] genomes. Consequently, 3,559,126 sRNAs were mapped to repeat DNA loci, which were considered as ra-siRNA candidates (S4 Table). These ra-siRNAs were further classified into 5 different types, according to their origin ( Table 2). A total of 3,447,580 ra-siRNAs arose from retrotransposons, of which 79% were derived from LTRs (long terminal repeats). Except for rRNA-derived ra-siRNAs, approximately 75% ra-siRNAs were uniquely detected in either A2 or D5 genome, indicating that ra-siRNAs are not well conserved between the two genomes.
The length distribution of major classes of ra-siRNAs was shown in Fig. 4A. The 24 nt ra-siRNAs were predominant, corresponding to 52-56% ra-siRNAs from DNA transposons, retrotransposons or simple repeats. By contrast, no obvious length bias was observed among rRNA-derived ra-siRNAs. Moreover, analysis of the 5' first nucleotide revealed a higher frequency of A than other residues at the 5'end of ra-siRNAs (Fig. 4B). These characteristics are consistent with that found in other plants [34,73].

PYTs and pha-siRNAs in Pima
Analysis of pha-siRNA was conducted using a combination of UEA Small RNA Workbench [62] and in-house Perl scripts as described in the Method section. As a result, 246 PYTs that produced 627 pha-siRNAs were obtained with significant p-values ranging from 0 to 10 -3 (Fig. 5A), indicating reliable capture of phasing phenomenon [63].
The PYTs were further annotated using either Blastx against NCBI nr protein database or Blastn against TAIR gene database. Consequently, 236 PYTs were derived from protein coding genes (S5 Table and Fig. 5B), 2 were from non-coding TAS3 homologs, and 8 had no hit in the databases. Genes encoding auxin response factors (ARFs) and auxin signaling F-box were the richest source of Pima pha-siRNAs, followed by those encoding NBS-LRR disease resistance proteins and transcription factors. In addition, PYTs were also produced from genes encoding various proteins or enzymes, such as pentatricopeptide repeats, Ca 2+ ATPase, Helicase, and peroxidase. Strikingly, a DCL1 isoform was able to produce pha-siRNAs (S6 Table), suggesting that the sRNA biogenesis machinery is subject to pha-siRNA regulation [38]. A subset of those PYT genes, which encode proteins, such as auxin signaling F-boxs, pentatricopeptide repeats, NBS-LRR and MYBs, et al., has been previously reported in other plants [23,38,74], indicating that these pha-siRNA pathways are likely conserved across plants. Furthermore, analysis of the 5'first nucleotide revealed a higher frequency of A and U and a lower occurrence of C and G at the 5' end of pha-siRNAs (Fig. 5C). This is consistent with characteristics of DCL products, such as miRNAs and ra-siRNAs.

Cis-nat siRNAs in Pima
Cis-nat siRNAs have been identified in various plants and are shown to regulate plant development, disease resistance and stress response [40,41,78]. However, no cis-nat siRNA is reported in Pima to date, although some cotton trans-NATs are deposited in plant NAT database,  PlantNATsDB (http://bis.zju.edu.cn/pnatdb/) [79]. In this study, 42,737 and 24,178 cis-NATs (S7 Table) were identified from cotton A2 and D5 genome, respectively. These cis-NATs yielded a total of 136,600 cis-nat siRNAs (S8 Table), representing 229,435 reads. The cis-NATs were further classified into 4 types in accordance to overlapping orientations: convergent (3'end overlap), divergent (5'end overlap), enclosed (one transcript encompassed the other transcript), and coincided (two transcripts of cis-NATs completely overlapped). Most cis-NATs hybridized in enclosed and convergent orientations, corresponding to 42% and 30% of the total cis-NATs, respectively, while only 10-16% overlapped in the divergent or coincided way (Fig. 7A). Compared to earlier studies [41,78], cis-NATs in Pima can form dsRNAs in the coincided orientation, a new overlapping means found in plants. In addition, nearly half cis-NATs in Pima overlapped in the enclosed way, differing from that the convergent orientation is the major overlapping means found previously [41,78], which reflects the overlapping means of cis-NATs might vary in different plant species or tissue types.

Trans-nat siRNAs in Pima
In this study, 1,250,581 trans-NATs were identified in Pima, of which 895,809 (72%) were detected to yield 79,994 trans-nat siRNAs (S9 Table). These trans-NATs overlapped in a wide length range from 50 to more than 1000 nt (Fig. 8A), while most (68%) had an overlapping region between 300 and 800 nt. The trans-NATs can form dsRNAs with a minimum free energy ranging from -1500 to -100 kcal.mol -1 (Fig. 8B), of which 85% ranged from -1200 to -400 kcal.mol -1 , indicating that these trans-NATs can form stable dsRNAs.
The 79,994 trans-nat siRNAs represented a total of 320,182 reads (Fig. 8C), and 91% reads were derived from overlapping regions. Trans-nat siRNAs ranged from 18 to 26 nt in length, with the 21 nt siRNAs being the most abundant (Fig. 8D). Most trans-nat siRNAs (62%) had an A or U as the 5'first nucleotide (Fig. 8E), and relatively fewer G and C appeared at the most 5'end.
To date, trans-NATs have been identified in a broad range of plants [79], such as Upland cotton and G. raimondii, while no trans-nat siRNA has been previously reported in Pima. In this study, we firstly identified a large number of trans-NATs in Pima, producing a significant number of trans-nat siRNAs. Consistent with the major length range of DCL products [10,81], most trans-nat siRNAs were 21 to 24 nt in length and had the characteristic of the 5'first nucleotide similar to other sRNAs in Pima, indicating that they are likely products of DCL cleavage. Like cis-nat siRNAs and in agreement with previous study [73], most trans-nat siRNA reads were produced from the overlapping regions, suggesting that a similar mechanism might govern the synthesis of NAT-derived siRNAs.

Conclusion
Through analysis of deep sequencing sRNA data of Pima, 5 major classes of sRNAs including miRNAs, ra-siRNAs, pha-siRNAs, cis-nat and trans-nat siRNAs along with their precursors were bioinformatically identified. This is the first comprehensive analysis of sRNAs in Pima and creates an important molecular resource for future studies on mechanism of regulating plant growth and development.
Supporting Information S1