De novo transcriptome assembly and data for the blue-winged teal (Spatula discors)

The blue-winged teal (Spatula discors) is a recreationally and ecologically important dabbling duck species in North America. Transcriptomic data of this species can be used in public and animal health studies given its role as a natural reservoir host for avian influenza, which can be a zoonotic disease of high concern. Ileum and bursa of Fabricius tissues were sampled from six captive raised blue-winged teals, four of the six who were experimentally infected with low-pathogenic avian influenza virus H5N9. RNAseq data were generated from extracted total mRNA from each tissue and pooled to create a de novo assembly of the transcriptome using Trinity. A total of 571,105 transcripts were identified at 449,956 unique unigenes that have been functionally annotated. This transcriptome will be useful for future blue-winged teal gene expression research, especially in hypothesis driven differential expression studies to determine the driving forces of avian influenza host-pathogen interactions, spatial distribution, and transmission.


Specification
Biology Specific Subject Area Virology and Transcriptomics Type of data Transcriptome assembly, raw sequences How data were acquired High-throughput RNA sequencing by the University of Minnesota Genomics Center (UMGC) using Illumina Hi-Seq 2500 Data format Analyzed, raw Experimental factors Teal eggs were hatched and raised in captivity. Four birds were inoculated with low-path avian influenza virus, two birds were sham inoculated.

Experimental features
Total RNA was extracted from three ileum and three bursa samples from six birds total. cDNA library preparation and sequencing were performed at UMGC. Bioinformatic analyses were completed by the authors. Description of data collection RNAseq data was generated from ileum and bursa tissue, then analyzed to create a de novo transcriptome using Trinity. Data

Value of the Data
• To our knowledge, this is the first published transcriptome assembly for the blue-winged teal. It will be broadly useful as a reference for future blue-winged teal research ranging from breeding and nutrition studies to disease ecology, toxicology, and host-pathogen interaction studies involving the blue-winged teal, specifically those infected with avian influenza virus. • These data will benefit ornithologists, wetland biologists, nutritionists, ecologists, veterinarians, and epidemiologists who are interested in studying transcriptomic processes in bluewinged teals and other closely related Anas species. • The blue-winged teal transcriptome can be used as a reference for differential expression analysis of RNAseq data.

Data
Data described in this article originate from cDNA sequencing of two tissue types (Ileum and bursa of Fabricius) from six blue-winged teals ( Spatula discors ) that were subsequently assembled into a de novo transcriptome using Trinity [1] . We present quality assessment information on the raw sequencing data and assembled transcriptome, as well as functional annotations derived from several databases. Table 1 describes summary statistics of the transcriptome, Table  2 describes predicted open reading frames, Table 3 provides the number of occurrences for the most common gene ontology (GO) terms, and Table 4 provides counts of the number of transcripts associated with the 15 more frequently occurring Kyoto Encyclopedia of Genes and Genomes (KEGG) identifiers. Fig. 1 shows distribution of transcript lengths and Fig. 2 is a histogram of the number of isoforms per assembled unigene.

Birds
Blue-winged teal eggs ( n = 6) were collected from nests (1 per nest) in uncultivated fields in Cando, North Dakota under U.S. Fish and Wildlife Scientific Collection permit #MB194270 and North Dakota Game and Fish Department License #GNF0363940. Birds were raised in captivity until when they were 9-12 weeks of age.

Sample collection
Ileum intestinal tissue was collected from three birds (two males and one female), and the Bursa of Fabricius (hereafter 'bursa') was collected from the other three (one male and two females). The two females in which the ileum was collected and the two males in which the bursa was collected were experimentally infected with virus and tested positive as detected by RT-PCR targeting the matrix protein gene of viral RNA extracted from cloacal swabs [2] . The other two birds were sham-inoculated and not positive for virus. All tissues were collected one day post inoculation. Tissue samples were placed in RNAlater (Sigma-Aldrich, St. Louis, MO, USA) and stored at room temperature for 24 h, then removed from RNAlater and stored at −80 °C.

RNA extraction, cDNA library construction, and sequencing
Total mRNA was extracted from each tissue using the Qiagen RNeasy kit according to the manufacturer's protocol (Cat #74,104, Qiagen, Inc, Hilden, Germany). RNA RIN scores averaged 9.0 and ranged from 7.4 to 9.8. Each RNA extract had a dual-indexed TruSeq stranded mRNA library created. All libraries were combined into a pool and sequenced across 1.5 lanes of a HiSeq 2500 2 × 125-bp run using v4 chemistry, which generated approximately 320 M pass filter reads for the pool or approximately 53.3 M reads per sample. All expected barcodes were detected and well represented. Mean quality scores for all libraries were ≥ Q30. The libraries were gel size selected to have inserts of approximately 200 bp.

Transcriptome analysis
Quality of raw reads was examined using FastQC (v. 0.11.7) [3] . Reads were quality filtered and had Illumina adapters removed using Trimmomatic (v. 0.38) [4] . Quality filtering was performed by removing bases with a Phred score lower than 15 across a four base sliding window, as well as reads with a length shorter than 40 bp. Next, we filtered for potential rRNA contamination by using Bowtie2 [5] to align all reads to the SILVA small and large rRNA subunit databases [6 , 7] . A total of 296,155,317 paired reads remained after filtering was completed. Bowtie2 alignments were performed using the -very-sensitive-local configuration and all reads that did not align were retained for further analysis.

De novo assembly and annotation
De novo transcriptome assembly was completed using Trinity [1] with default options and requiring a minimum contig length of 200 nt. Clustering of similar transcripts was performed using CD-HIT-EST [8 , 9] and transcript were quality filtered using TransRate [10] . This filtered transcriptome contained 571,105 transcripts associated with 449,956 unigenes. Quality assessment of read alignments was conducted using Bowtie2 to evaluate our ability to align transcripts to the de novo assembly, which provided a 98.5% realignment rate. To assess transcriptome completeness, BUSCO v3 [11 , 12] was used with the aves ortholog database, identifying 86.8% complete, 7.2% fragmented, and 6.0% missing single copy orthologs. Next, TransDecoder (v. 2.1.0) [13] was used to identify candidate coding regions prior to transcriptome annotation using the Trinotate pipeline (v. 3.2.0) [14] . Alignments to the SwissProt database were performed using blastx for the entire transcriptome and blastp for the subset of putative coding regions using a minimum e-value of 1e-5. Blastx annotations were recovered for 135,042 transcripts and blastp annotations were provided for 75,404 transcripts. Trinotate was also used to perform further functional annotations including GO terms ( Table 3 ) and designations from the KEGG (Kyoto Encyclopedia of Genes and Genomes) [15] and eggNOG [16] databases and for identifying protein domains in PFAM [17] . Finally, we used Kallisto [18] to quantify expression levels within the data used to construct the transcriptome, finding 130,077 transcripts with transcripts per millions (TPM) values > 0.05 and 69,074 transcripts with > 1.0 TPM.

Ethics Statement
All procedures were in accordance with the animal care and use protocol and approved by the Institutional Animal Care and Use Committee at MSU (IACUC AUF 12/16-211-00).