Genotyping-by-sequencing data of 272 crested wheatgrass (Agropyron cristatum) genotypes

Crested wheatgrass [Agropyron cristatum L. (Gaertn.)] is an important cool-season forage grass widely used for early spring grazing. However, the genomic resources for this non-model plant are still lacking. Our goal was to generate the first set of next generation sequencing data using the genotyping-by-sequencing technique. A total of 272 crested wheatgrass plants representing seven breeding lines, five cultivars and five geographically diverse accessions were sequenced with an Illumina MiSeq instrument. These sequence datasets were processed using different bioinformatics tools to generate contigs for diploid and tetraploid plants and SNPs for diploid plants. Together, these genomic resources form a fundamental basis for genomic studies of crested wheatgrass and other wheatgrass species. The raw reads were deposited into Sequence Read Archive (SRA) database under NCBI accession SRP115373 (https://www.ncbi.nlm.nih.gov/sra?term=SRP115373) and the supplementary datasets are accessible in Figshare (10.6084/m9.figshare.5345092).


a b s t r a c t
Crested wheatgrass [Agropyron cristatum L. (Gaertn.)] is an important cool-season forage grass widely used for early spring grazing. However, the genomic resources for this non-model plant are still lacking. Our goal was to generate the first set of next generation sequencing data using the genotyping-by-sequencing technique. A total of 272 crested wheatgrass plants representing seven breeding lines, five cultivars and five geographically diverse accessions were sequenced with an Illumina MiSeq instrument. These sequence datasets were processed using different bioinformatics tools to generate contigs for diploid and tetraploid plants and SNPs for diploid plants. Together, these genomic resources form a fundamental basis for genomic studies of crested wheatgrass and other wheatgrass species. The raw reads were deposited into Sequence Read Archive (SRA) database under NCBI accession SRP115373 (https://www.ncbi.nlm.nih.gov/sra?term ¼SRP115373) and the supplementary datasets are accessible in Figshare  Value of the data The first large set of next generation sequencing datasets obtained from genomic DNAs for 272 diploid and tetraploid crested wheatgrass plants. These plants represent seven breeding lines, five cultivars and five geographically diverse accessions.
These datasets can be utilized to enhance genetic and genomic studies, genetic diversity assessments, and marker-assistedbreeding of crested wheatgrass.
The SNP datasets can be directly explored for the development of useful molecular markers to investigate genetic variability of crested wheatgrass and to facilitate the molecular breeding of this plant.

Data
Our sequencing efforts in crested wheatgrass generated two new sets of genomic data. The first set consists of 608 FASTQ files generated for 272 diploid and tetraploid crested wheatgrass plants representing seven breeding lines, five cultivars and five geographically diverse accessions (see Table 1). These sequences were obtained from genomic DNA through genotyping-by-sequencing (GBS) technique using an Illumina MiSeq instrument for seven runs with paired-ends of 250 bp in length. Plants from two accessions were sequenced twice as a control for quality assessment. Raw reads for all 17 accessions were deposited into NCBI's SRA database with accession number SRP115373 (https://www.ncbi.nlm.nih.gov/sra?term¼SRP115373). The second set contains several meta data files generated from bioinformatics analysis of the sequence reads, including contigs for diploid and tetraploid plants and SNPs for diploid plants. These

Plant materials and DNA extraction
This study selected 17 crested wheatgrass accessions consisted of seven breeding lines, five cultivars and five geographically diverse accessions (Table 1). Five accessions are diploid and 12 accessions are tetraploid. These accessions were acquired from USDA plant germplasm system, Plant Gene Resources of Canada, and the joint forage breeding forage program of the University of Saskatchewan and Agriculture and Agri-Food Canada. Seeds were randomly selected from each accession and grown for six weeks in a greenhouse at the Saskatoon Research and Development Centre, Agriculture and Agri-Food Canada, under a 12 h photoperiod at 25°C during the day time. Young leaf tissues were collected from 16 randomly selected plants of each of the 17 accessions, and stored at −80°C prior to DNA extraction. For each of the 272 samples, DNA was extracted from 0.1 g ground tissue by following the protocols of NucleoSpin® Plant II Kit (Macherey-Nagel, Bethlehem, PA, USA) and eluted in a 1.5-ml Eppendorf tube with Elution Buffer. The DNA quality was measured with NanoDrop 8000 (Thermo Scientific) by comparing the 260 and 280-nm absorption. DNA samples were further quantified through the Quant-iTTM PicoGreen® dsDNA assay kit (Invitrogen) and subsequently diluted to 60 ng/ μl with 1×TE buffer prior to sequencing analysis.

GBS library preparation and sequencing
The complexity reduced and multiplexed GBS libraries were prepared following the published gd-GBS protocol [1]. In brief, each library preparation started with 200 ng of purified genomic DNA by restriction enzyme digestion of PstI þ MspI. Ligations between specifically customized 5′/3′ adapters and inserts by T4 ligase were carried out using standard product protocol. Ligated fragments were purified by AMPure XP kit and subsequently amplified and indexed with Illumina TruSeq HT multiplexing primers. Six library pools were made and each consisted of 48 indexed samples (3 acces-sions×16 individual plants). One extra library pool was included using 32 samples from two randomly selected accessions as a control. Prior to pooling of samples into a library, amplicon fragments from TruSeq HT kits were pre-selected by Pippin instrument for an insert size ranging between 250 and 450 bp, but the actual fragment size varied between 400 and 600 bp. Each pooled library was diluted to 6pM and denatured with 5% of sequencing-ready Illumina PhiX Library Control that serves for calibration for sequencing confidence. Sequencing was performed at the Saskatoon Research and Development Centre using an Illumina MiSeq instrument with paired-ends of 250 bp in length. Seven MiSeq runs generated 608 FASTQ sequence files for 272 plants from 17 accessions. Note that 32 plants from two accessions were sequenced twice as control for quality assessment.

Contig assembly and sequence similarity analysis
Contig sequence was assembled by using protein associated SNP prediction and genotyping pipeline paSNPg [2], which was specifically developed for non-model species and requires two inputs including the raw MiSeq sequence reads and relevant plant Ensembl PEP package [3]. Pep_database. tar.bz2 tarball was prepared by following the protocols of paSNPg to merge 44 plants species' PEP data and was placed together with paired-end sequence reads (or FASTQ files) as a combined input for paSNPg. The default settings were adopted for k-mer size (100 bp) and minimal percentage of sample size (MPSS, 80%). Contigs were generated mainly through the Minia routine [4] implemented in the paSNPg pipeline. The analysis generated 6674 contigs for diploid crested wheatgrass plants with the default setting of parameters: 75% of identical match and 99% of alignment length. Among those contigs, 768 (11.5%) were associated with exons of coding genes. A total of 7792 contigs were assembled for tetraploid crested wheatgrass plants, while 809 (10.3%) assembled contigs were associated with exons of coding genes.
Efforts were also made to assemble contigs for separate diploid or tetraploid accessions, following the same parameter setting as for the combined analysis of diploid or tetraploid plants. The outcomes are summarized in Table 2. A sequence similarity analysis of these contigs was also made between diploid and tetraploid plants using Blastn search among diploid-and tetraploid-based contigs under the cut-off of 1e-100 for E-value [5]. The parsed results revealed 4477 (67%) diploid-based contigs sequences matched with 4461 (57%) tetraploid-based contigs.
Additional analysis was made to assess the gd-GBS empirical genome coverage (EgC) for each accession. The genome sizes of crested wheatgrass plants were estimated through flow cytometry based on the genome sizes of Triticum durum and Triticum aestivum. The average genome size of 6898 Mbp and 13,527 Mbp were obtained for diploid and tetraploid plants, respectively. The EgC values ranged from 0.044% to 0.120% with a mean of 0.078% for diploid plants, and ranged from 0.015% to 0.030% with an average of 0.021% for tetraploid plants ( Table 2).

Protein-associated SNP identification
Efforts were made to do a SNP call only for diploid plants using the paSNPg pipeline, as it was developed specifically for diploid non-model species. Currently, there is no effective pipeline available for SNP calls from genomic sequences of tetraploid plants. A total number of 11,854 nuclear SNPs were successfully discovered from diploid plants, of which 1738 (14.7%) SNPs were associated with exons of coding genes. However, the number of total nuclear SNP and exon-associated SNPs without missing values across diploid plants were smaller with 1158 and 308, respectively.