Data set for diet specific differential gene expression analysis in three Spodoptera moths

Examination of closely related species pairs is suggested for evolutionary comparisons of different degrees of polyphagy, which we did here with three taxa of lepidopteran herbivores, Spodoptera spp (S. littoralis, S. frugiperda maize (C) and rice (R) strains) for a RNAseq analysis of the midguts from the 3rd instar insect larvae for differential metabolic responses after feeding on pinto bean based artificial diet vs maize leaves. Paired-end (2×100 bp) Illumina HiSeq2500 sequencing resulted in a total of 24, 23, 24, and 21 million reads for the SF-C-Maize, SF-C-Pinto, SF-R-Maize, SF-R Pinto, and a total of 35 and 36 million reads for the SL-Maize and SL-Pinto samples, respectively. After quality control measures, a total of 62.2 million reads from SL and 71.7 million reads from SF were used for transcriptome assembly (TA). The resulting final de novo reference TA (backbone) for the SF taxa contained 37,985 contigs with a N50 contig size of 1030 bp and a maximum contig length of 17,093 bp, while for SL, 28,329 contigs were generated with a N50 contig size of 1980 bp and a maximum contig length of 18,267 bp. The data presented herein contains supporting information related to our research article Roy et al. (2016) http://dx.doi.org/10.1016/j.ibmb.2016.02.006[1].

and a maximum contig length of 18,267 bp. The data presented herein contains supporting information related to our research article Roy et al. (2016)

Value of the data
The genetic information of Spodoptera taxa in public databases is scant. The gene expression data provide insights into the adaptive mechanisms of these closely Spodoptera with relatively little phylogenetic noise [2].
It is a significant contribution to further research on adaptive mechanisms in Spodoptera taxa at all levels.
The data could be used as a benchmark and for comparative or collaborative studies in other noctuid moths (Nocutidae, the largest family in the Lepidoptera).
This data further supports our publication [1] and the use of transcriptomic technologies in non-model organisms.

Data
We provide data on the transcriptome responses in the larval guts of the three moth taxa when feeding on a semi-artificial diet (suitable for all three insect groups) and on maize leaves (the primary food of only the S. frugiperda C (Corn) strain). The raw data of our study on Spodoptera larval responses to different diets [1], can be accessed directly at the following URL: http://www.ebi.ac.uk/ ena/data/view/PRJEB10473.
The analyzed data presented here contains three types of data 1) the transcriptomes of S. littoralis and S. frugiperda 3rd instar larval midgut (assembled and annotated) in two Excel files (plain text & numerical) (Supplemental Files 1 and 2), 2) Quality control and annotation plots as Blast2GO, WEGO (Web Gene Ontology Annotation) in five Figures (plain graphical) (Figs. 1 to 5), and 3) differential gene expression (DGE) analysis in two sets of Supplementary charts (graphical listings of numbers) (Supplementary charts 6.1-6.5 and 7.1-7.6).

Plant and insect
The detailed information for plants and insect strains (incl. origin) will be found in our publication associated with this data set. In brief, maize plants (Sweet Nugget F1) were grown in Alnarp at 23°C and 70% RH. The R strain of S. frugiperda (Moore Haven, Florida), C strain of S. frugiperda (Santa Isabel, Puerto Rico) and S. littoralis (Egypt) were maintained in the lab on pinto bean based semi-artificial diet for several generations prior to the experiments.

RNA isolation
Total RNA was extracted from each of the larval midgut samples using a combined approach of Trizol based extraction followed by purification using the RNeasy Mini Kit (Qiagen, Venlo, Netherlands) following the manufacturer's guidelines. DNA was eliminated with on-column RNase-free DNase (Qiagen). The integrity of the RNA was verified using an Agilent 2100 Bioanalyzer and a RNA 6000 Nano Kit (Agilent Technologies, Palo Alto, CA). The quantity of RNA was determined using a Nanodrop ND-1000 spectrophotometer. RNA from a total of twenty individuals was combined in one pool per sample (for each of the SF-C-Maize, SF-C-Pinto, SF-R-Maize, SF-R-Pinto and for the SL-Maize and SL-Pinto samples) for sequencing.

RNA seq data generation and assembly
Tissue-specific transcriptome sequencing of four different RNA samples for SF and two RNA samples for SL was carried out on an Illumina HiSeq2500 Genome Analyzer platform using pairedend (2 Â 100 bp) read technology with RNA fragmented to an average of 150 nucleotides at the Max Planck Genome Centre Cologne (MPGCC). Sequencing resulted in a total of 24, 23, 24, and 21 million reads for the SF-C-Maize, SF-C-Pinto, SF-R-Maize, SF-R Pinto, and a total of 35 and 36 million reads for the SL-Maize and SL-Pinto samples, respectively. Quality control (QC) measures which include the filtering of high-quality reads based on the quality score given in fastq files, removal of reads containing primer/adaptor sequences and trimming of read length, were performed using CLC Genomics Workbench v6.5 (http://www.clcbio.com). The de novo transcriptome assembly was then carried out using the Genomics Workbench v6.5 software and the selection of the presumed optimal consensus transcriptome was done using the pre-optimized approach [3]. Any conflicts among the individual bases were resolved by selecting the base with highest frequency. Contigs shorter than 250 bp were removed from the final analysis.
The de novo assembled transcriptomes were annotated using BLAST, Gene Ontology and Inter-ProScan searches using BLAST2GO PRO v2.8.1 (www.blast2go.de) [4]. For BLASTX searches [5] against the non-redundant NCBI protein database (NR database) up to the 20 best NR hits per transcript were retained, with an E-value cut-off r10 À 1 and a minimum match length of 15 amino acids to obtain the best homologue for the predicted short polypeptides. Annex [6] was used to optimize the GO term identification further by crossing the three GO categories (biological process, molecular function and cellular component) to search for name similarities, GO term relationships and enzyme relationships within metabolic pathways (Kyoto Encyclopedia of Genes and Genomes) [7,8].
Finally, the GO enrichment analysis (level 2) was performed by plotting all the GO information, but from only the contigs with an expression difference more than fourfold and a minimum RPKM cut off at 1 against the reference set (SL and SF whole transcriptome) using the web tool called WEGO [9].

Re-mapping and digital gene expression analysis
Digital gene expression analysis was performed by using QSeq Software (DNAStar Inc.) to re-map the Illumina reads from all individual samples onto their respective transcriptome backbone reference, and then count the sequences to estimate expression levels, using previously described parameters for read mapping and normalization such as n-mer length ¼ 25; read assignment quality options required at least 25 bases and at least 90% of bases matching within each read to be assigned to a specific contig; maximum number of hits for a read ¼ 10; n-mer repeat settings were automatically determined by the software and other default settings [3]. Biases in the sequence datasets and different transcript sizes were corrected using the RPKM algorithm (reads per kilobase of transcript per million mapped reads; log2 transformed) to obtain accurate estimates for relative expression levels. To control for the effect of global normalization using the RPKM method, we also analyzed a number of highly-conserved housekeeping genes that are frequently used as control genes for quantitative RT-PCR. These included several genes encoding ribosomal proteins (RpL5, RpL8, RpL7a, RpL15, RpL22, RpS3a, RpS5, RpS8, RpS15, RpS18, and RpS20), elongation factor 1-alpha and eukaryotic translation initiation factors 4b and 5a. The corresponding genes were evaluated for overall expression levels across samples and treatments and were found to display expression level differences (based on log2-transformed RPKM values) lower than 1.4-fold across samples, validating them as housekeeping genes with no diet or strain dependent differential expression. Due to the robust pooling strategy detailed above, combining large numbers (20) of larvae per sample and treatment group, our effective sample size per group is n ¼1. Hence, there is no statistical power for conventional hypothesis testing using this pooling approach. However, this pooling strategy reduces the influence of individual outliers. Expression differences in the gene categories with less than 10 total identified transcripts were not considered for analysis (Supplementary charts 6.1-6.5).