Annotating and quantifying pri-miRNA transcripts using RNA-Seq data of wild type and serrate-1 globular stage embryos of Arabidopsis thaliana

The genome annotation for the model plant Arabidopsis thaliana does not include the primary transcripts from which MIRNAs are processed. Here we present and analyze the raw mRNA sequencing data from wild type and serrate-1 globular stage embryos of A. thaliana, ecotype Columbia. Because SERRATE is required for pri-miRNA processing, these precursors accumulate in serrate-1 mutants, facilitating their detection using standard RNA-Seq protocols. We first use the mapping of the RNA-Seq reads to the reference genome to annotate the potential primary transcripts of MIRNAs expressed in the embryo. We then quantify these pri-miRNAs in wild type and serrate-1 mutants. Finally, we use differential expression analysis to determine which are up-regulated in serrate-1 compared to wild type, to select the best candidates for bona fide pri-miRNAs expressed in the globular stage embryos. In addition, we analyze a previously published RNA-Seq dataset of wild type and dicer-like 1 mutant embryos at the globular stage [1]. Our data are interpreted and discussed in a separate article [2].

the globular stage [1]. Our data are interpreted and discussed in a separate article [2].
& Arabidopsis thaliana wt and se-1 embryos at the globular stage, with two biological replicates Data source location

Not applicable
Data accessibility Data is available as Supplementary file 1 and at NCBI GEO accession GSE100450

Value of the data
This is the first study to directly identify MIRNA genes expressed in early embryos of plants.
We provide an annotation file with 318 MIRNA gene models, including 77 predicted from the RNA-Seq data, that is useful for others interested in MIRNA gene regulation in Arabidopsis.
Our high-quality globular stage transcriptomes of wild type and serrate-1 embryos will be valuable for other studies of gene regulation in early embryogenesis.

Data
We generated RNA-Seq data for globular stage Arabidopsis thaliana embryos from two genotypes: serrate-1 (se-1) mutants, and wild type (wt), both in the Columbia ecotype. We then inferred MIRNA primary transcripts expressed at the globular stage by aligning RNA-Seq reads to the Arabidopsis genome, assembling and manually curating gene models, and analyzing differential expression. As an independent profile of MIRNA transcripts in embryos, we analyzed a previously published RNA sequencing experiment using wt Columbia and dicer-like 1 (dcl1-5) embryos [1]. We provide the raw data, predicted pri-miRNA gene models, quantification of all genes in both experiments, and differential expression results.

Experimental design, materials and methods
Two biological replicates of wt and se-1 of about 80 embryos each at the 32-64 cell (early to midglobular) stage were obtained, the RNA isolated, amplified and sequenced as described previously [2]. Illumina HiSeq 2000 sequencing yielded 101 nt paired-end reads with over 20 million reads per library (Table 1). Raw data files are available through the NCBI Gene Expression Omnibus (GEO, accession GSE100450). Raw data from the dcl1-5 RNA-Seq experiment was downloaded from GEO (accession GSE25404), consisting of a single replicate of dcl1-5 mutant and Col-0 (wt) early globular embryos with 36 nt reads obtained on an Illumina Genome Analyzer II [1].
The paired-end reads from the wt and se-1 libraries were mapped using HISAT2 [3] with default settings except for an intronic length suited for Arabidopsis (-max-intronlen 900). The default intronic length (-max-intronlen 500000), tuned for mammalian genomes, resulted in many reads falsely mapping across several genes. The 36 nt reads from the wt and dcl1-5 libraries were cleaned using cutadapt [4] v1.13 to remove adaptors, polyA and polyT sequences from the 3′ and 5′ ends (respectively), and low-quality bases and flanking Ns were trimmed from individual reads. Finally, any read shorter than 18 nt or with more than three internal Ns was discarded. The full parameters for cutadapt were: -q 6 -a ATCTCGTATGCCNNNNNNNNNNNNNNNNNNNNNNNN -a "A{36}" -g NNNNNNNNNNNNNNNNNNNNNNNNAGTCC-GACGATC -g "T{36}" -trim-n -max-n 3 -m 18.
The resulting cleaned reads were mapped using the Bowtie [5] short read aligner allowing up to 2 mismatches within a 25 nt seed sequence, and only uniquely mapped reads were retained (-l 25 -n 2 -m 1) ( Table 2). Both transcriptomes were mapped to the reference TAIR10 assembly of Arabidopsis thaliana, downloaded from The Arabidopsis Information Resource (ftp://ftp.arabidopsis.org/ home/tair/Sequences/whole_chromosomes/).
The Araport11 reference annotation from the Arabidopsis Information Portal [6] contains the coordinates of pre-miRNA hairpins, but does not contain information regarding the pri-miRNA transcripts. In order to predict genome wide pri-miRNA transcript coordinates, Cufflinks [7] was used to assemble and merge putative pri-miRNA transcripts from the se-1 and wt Col RNA-Seq read alignments. Cufflinks was first run for each library (-overlap-radius 1 -library-type frunstranded). The predictions were then merged (cuffmerge -s) using the TAIR10 genome assembly as reference. Out of the 325 miRNAs in the Araport11 annotation, 77 overlapped with a predicted Cufflinks gene model. All these predictions were manually verified, with only 4 of them requiring individual adjustments to better reflect the disposition of the reads from the RNA-Seq libraries, and to resolve overlapping conflicts. The main limitation of this approach is that pri-miRNAs can only be assembled if they are expressed in the sampled conditions (in this case, for early globular embryos).
Overlaps with, or even proximity to protein-coding genes can make it difficult to establish the appropriate gene model of a pri-miRNA. Due to this, the pri-miRNA predictions were divided into four groups: intergenic (G1), between 1-400 bp away from a protein-coding gene (G2), overlapping with a protein-coding gene (G3, divided into G3A if the overlap includes the pre-miRNA or G3B otherwise) and overlapping with a non-coding gene (G4); see Table 3. The 77 Cufflinks predictions were distributed amongst the pri-miRNA groups as follows: 28 in G1, 7 in G2, 0 in G3A, 37 in G3B and 5 in G4. There were 56 pre-miRNAs that overlapped with 54 protein-coding genes (G3A group). These pre-miRNAs were assigned to a pri-miRNA gene model identical to the overlapping protein-coding gene. In all other cases where no Cufflinks prediction was available, the pri-miRNA was kept the same as the pre-miRNA annotation from Araport11. A final annotation file with the newly predicted pri-miRNA gene models, in addition to all the gene models from Araport11, considering a total of 318 pri-miRNA genes and 27,562 protein-coding genes, was employed for the quantification of all the RNA-Seq libraries and is available as Supplementary file 1. Quantification of reads using this annotation file was performed in R with the function feature-Counts from the Rsubread package [8]. Multi-mapping reads were counted (countMultiMappin-gReads ¼TRUE) and only primary alignments were allowed (primaryOnly ¼TRUE). Additionally, reads were assigned to the feature with the largest number of overlapping bases (largestOverlap ¼TRUE) and a minimum mapping quality score of 10 was required (minMQS ¼ 10) for a read to be counted.
Finally, the edgeR [9] package was used to perform the differential expression analysis of both the se and dcl1 experiments, using the raw counts with no prior filtering. A tagwise dispersion was calculated for se, but since no replicates are available for the dcl1 experiment, the Biological Coefficient of Variation was fixed to 0.4, as recommended by the edgeR manual. To test for differential expression, quasi-likelihood F-tests and likelihood ratio tests were performed for the se and dcl1 experiments, respectively. In total, 6951 genes were upregulated in the se-1 mutant and 7138 were downregulated with an FDR o 0.05, and 125 genes were upregulated in dcl1-5 and 138 downregulated with an FDR o 0.05 (Table S3 from reference [2]).

Group Description
Number of pre-miRNAs in Araport11 Predicted pri-miRNA gene models Gene models from Araport11 Gene models with two pre-miRNAs  (except for the G3A category, where the gene models are taken from protein-coding genes). The average values of expression and log 2 FC for all genes, including pri-miRNAs and the median log 2 FC for each of the miRNA groups, is plotted in Fig. S2 from Ref. [2].
In se-1, 133 miRNAs have at least 4 accumulated reads from the libraries. Of these, 100 are differentially expressed (FDR o 0.05). For dcl1-5, 121 miRNAs have at least 4 accumulated reads from the libraries. A summary of the miRNAs that were detected in both the se-1 and dcl1-5 experiments is detailed in Table 4 and a summary of their differential expression behavior is shown in Tables 5 and 6. Table 5 Behavior of differentially expressed pri-miRNAs in se-1 and dcl1-5.