Dataset of the first transcriptome assembly of the tree crop “yerba mate” (Ilex paraguariensis) and systematic characterization of protein coding genes

This contribution contains data associated to the research article entitled “Exploring the genes of yerba mate (Ilex paraguariensis A. St.-Hil.) by NGS and de novo transcriptome assembly” (Debat et al., 2014) [1]. By means of a bioinformatic approach involving extensive NGS data analyses, we provide a resource encompassing the full transcriptome assembly of yerba mate, the first available reference for the Ilex L. genus. This dataset (Supplementary files 1 and 2) consolidates the transcriptome-wide assembled sequences of I. paraguariensis with further comprehensive annotation of the protein coding genes of yerba mate via the integration of Arabidopsis thaliana databases. The generated data is pivotal for the characterization of agronomical relevant genes in the tree crop yerba mate -a non-model species- and related taxa in Ilex. The raw sequencing data dissected here is available at DDBJ/ENA/GenBank (NCBI Resource Coordinators, 2016) [2] Sequence Read Archive (SRA) under the accession SRP043293 and the assembled sequences have been deposited at the Transcriptome Shotgun Assembly Sequence Database (TSA) under the accession GFHV00000000.


Specifications
Paired-end 100 nt Raw reads were filtered and de novo assembled. Transcripts were submitted to in-house batch BlastX/tBlastn searches [3] using Arabidopsis as reference to characterize the protein coding genes of yerba mate and to find out putative orthologous genes between the two species Data source location Misiones, Argentina Data accessibility Data are within this article and at DDBJ/ENA/GenBank under the accessions SRP043293 and GFHV00000000

Value of the data
This data provides full transcriptome assembled sequences of yerba mate, the first references for the Ilex L. genus.
Data is applicable for the characterization of agronomical important genes in yerba mate and related taxa in Ilex.
Accessibility of assembly and annotation data allows scientific community to implement additional analysis via original approaches.

Data
The data shared with this data article comprise Supplementary files 1 and 2. Supplementary file 1 presents transcriptome-wide assembled sequences of Ilex paraguariensis SRA SRP043293 (FASTA). Supplementary file 2 refers to the annotation of these assembled sequences via the integration of Arabidopsis thaliana protein databases (spreadsheets format).

Experimental design, materials and methods
Total RNA extracted of five samples of emerging, young, fully expanded, and early and late senescent stages leaves of I. paraguariensis breeding line Pg538 were pooled for high throughput sequencing.
The complete raw sequencing data at SRA under the accession SRP043293 [1,2] was used to generate a full transcriptome assembly employing the Trinity 2.0.6 platform [4]. All raw sequenced reads were quality filtered and then de novo assembled -using optimal parameters of 25 kmer word and group pairs distance of 500-into 44,840 transcripts (~180X coverage) which encompass ca. 31,694 genes and their respective isoforms (13,146) in agreement to the Trinity output (Supplementary File 1, FASTA).
For the first step of annotation analysis, the whole genome information of the model species A. thaliana L. (TAIR10; http://www.arabidopsis.org) was downloaded into a local server system. Subsequently, the complete yerba mate translated transcriptome (269,040 sequences) was scanned by in-house [3; v.11.0.2] batch homology searches via BlastX (matrix Blosum62, word size 3, cut off value of 1e−05) using as bait the TAIR10 proteome (35,386 peptides of 27,416 gene models). In addition, tBlastn searches (matrix Blosum62, word size 3, cut off value of 1e-05) were performed using TAIR10 proteome as query and the complete yerba mate translated transcriptome as target. For both, direct and reverse searches, the best hit strategy was applied.
Both BlastX and tBlastn searches results were organized in different spreadsheets (Supplementary File 2, sheet 1 and 2, respectively) integrating several indicators, i.e. query name, subject name, e-value, bit-score, % query coverage, % pairwise identity, cumulated total alignment length, frame.
BlastX revealed 32,480 hits out of 44,840 transcripts of yerba mate (72.4%), embracing 21,370 genes (out of 31,694; 67.4%) and 11,110 isoforms (out of 13,146; 84.5%) which targeted 12,435 gene models of Arabidopsis (out of 27,416; 45.3%). Complete BlastX results displayed the following mean parameters: e-value of 4.42e−08, bit-score of 329.5, % query coverage of 73.7, % pairwise identity of 64.7 and cumulated total alignment length of 258.2 nucleotides. Mean parameters for each category considering e-value ranges are shown in Table 1. Roughly 40 % of the annotated transcripts (0 r e r − 90; Fig. 1A) exceed those mean parameters. Most yerba mate transcripts were annotated according to direct frames (56.6 %; Fig. 1B).
In addition, through tBlastn approach, 30,476 sequences of A. thaliana proteome database (out of 35,386; 86.1%), embracing 23,033 gene models (out of 27,416; 84.0%), found yerba mate hits. Those hits belong to 10,904 unique transcripts − 24.3 % of yerba mate transcriptome-of 9,885 genes (out of 31,694; 31.2%). Complete tBlastn results displayed the following mean parameters: e-value of 3.09e −08, bit-score of 404.9, % query coverage of 79.5, % pairwise identity of 59.0 and cumulated total alignment length of 351.7 nucleotides. Mean parameters for each category considering e-value ranges are shown in Table 2. Near 46% of the annotated transcripts (0 r e r − 120; Fig. 2A) are above those mean parameters. Most yerba mate transcripts were annotated according to direct frames (55.4%; Fig. 2B).
Finally, BlastX and tBlastn results were merged, curated and organized according to consecutive gene model names of Arabidopsis (Supplementary File 2, sheet 3) in order to detect reciprocal best hits (RBH) and vague results amongst both search approaches, in addition to putative gene RBH strategy is useful to infer orthologous relationships among protein gene datasets [5]. However, to finally decide on the orthology of pair-wise aligned sequences, additional criteria should be considered, i.e. e-value, bit-score, % pairwise identity, cumulated total alignment length, visual inspection of the alignment [6]. Our analysis in yerba mate revealed 9,437 BlastX/tBlastn RBH pairs sensu stricto (equivalent Arabidopsis gene model peptide/yerba mate gene isoform), including 9,244 gene pairs out of 21,387 annotated genes (43.2%). From those, 4,764 gene pairs and their respective 10,683 unique isoforms can be grouped as RBH sensu lato (equivalent Arabidopsis gene model/yerba mate gene). Another 437 yerba mate genes, unrelated to the RBH sensu stricto group, and their respectives 1,292 unique isoforms can be grouped as RBH sensu lato also (see Supplementary File 2, sheet 3). RBH sensu stricto annotated pairs displayed the following mean parameters: e-value of 7.25e −09, bit-score of 485.9, % query coverage of 78.7, % pairwise identity of 66.6 and cumulated total alignment length of 368.0 nucleotides. Mean parameters for each category considering e-value ranges are shown in Table 3. Around 44% of the RBH sensu stricto annotated transcripts (0 r e r − 150) are above those mean parameters.  In sum, we performed an integrated high-throughput screening analyses, based in BlastX/tBlastn strategy and employing highly curated databases of A. thaliana, the most extensively studied plant. Our approach resulted in a comprehensive annotation of over 21,387 yerba mate genes and prediction of 9,874 orthologous genes among both species.
This Transcriptome Shotgun Assembly project has been deposited at DDBJ/ENA/GenBank under the accession GFHV00000000. The version described in this paper is the second version, GFHV02000000.