Genomic survey sequencing, development and characterization of single- and multi-locus genomic SSR markers of Elymus sibiricus L

Siberian wildrye (Elymus sibiricus L.) attracts considerable interest for grassland establishment and pasture recovery in the Qinghai-Tibet Plateau (QTP) due to its excellence in strong stress tolerance, high nutritional value and ease to cultivate. However, the lack of genomic information of E. sibiricus hampers its genetics study and breeding process. In this study, we performed a genome survey and developed a set of SSR markers for E. sibiricus based on Next-generation sequencing (NGS). We generated 469.17 Gb clean sequence which is 58.64× of the 6.86 Gb estimated genome size. We assembled a draft genome of 4.34 Gb which has 73.23% repetitive elements, a heterozygosity ratio of 0.01% and GC content of 45.68%. Based on the gnomic sequences we identified 67,833 SSR loci and from which four hundred were randomly selected to develop markers. Finally, 30 markers exhibited polymorphism between accessions and ten were identified as single-locus SSR. These newly developed markers along with previously reported 30 ones were applied to analyze genetic polymorphism among 27 wild E. sibiricus accessions. We found that single-locus SSRs are superior to multi-loci SSRs in effectiveness. This study provided insights into further whole genome sequencing of E. sibiricus in strategy selection. The novel developed SSR markers will facilitate genetics study and breeding for Elymus species.

breeding progress of this species compared to other Triticeae cereals such as wheat and barley.
E. sibiricus has an allotetraploid genome (StStHH, 2n = 4x = 28). The mean nuclear DNA content (C-value) of E. sibiricus was determined by flow cytometry (FCM) as being 2C = 16.61 pg, approximately twice as large as the possible diploid progenitors from genus Pseudoroegneria (StSt) and Hordeum (HH) [6]. This complexity and huge genome size pose a great challenge to whole genome sequencing of E. sibiricus. Genome survey using next-generation sequencing (NGS) with a given uniform sequencing depth is an alternative and cost-effective strategy in obtaining genome size, heterozygosity, GC and repeat contents. Furthermore, development of molecular markers based on genome survey sequencing and in silico analysis has become a practical tool for genetic study [7].
Among the various types of molecular markers, SSRs (simple sequence repeats) have many advantages including high polymorphism, codominant heredity, good reproducibility and extensive distribution in the genome [8]. These properties have proven to be of great interest for diverse genetic studies including genetic map and fingerprint construction, genetic diversity characterization, and molecular marker assisted breeding, etc. SSR markers can be developed using homology searches from either genomic libraries or transcriptome sequences and expressed sequence tags (ESTs) databases. Often, G-SSR (genomic DNA-derived SSRs) markers are considered to possess higher polymorphism than EST-SSR markers (EST-derived SSRs) due to conservation of the transcribed portions of genome. The ongoing development of nextgeneration sequencing (NGS) techniques e.g. genomic survey sequencing has economically allowed to access large amounts of genomic data, and can further identify genomic SSR loci by in silico searching SSR motifs in massive scaffold datasets [9][10][11][12].
In general, SSRs are believed to be locus-specific i.e. single-locus markers, therefore, only one or two bands (homozygotes or heterozygotes) was expectedly amplified with a single SSR primer pair. However, complex banding pattern (multiple loci) in addition to the expected ones (single-locus) was frequently obtained by a single SSR primer pair [13]. This may be explained by the fact that each of the markers targeted more than one homoeolocus due to the large genome size (especially for polyploid origin) as well as the high proportion of repetitive DNA in the genome of higher plants [9,13,14]. The multi-loci SSR markers brings many difficulties in the precise identification of genomic loci containing the specific genes of interest. For example, in the practical application of some polyploid species, the amplified products of multi-loci SSR by gel electrophoresis may be from multiple loci of multiple genomes, leading to problems such as error in genotyping and inaccurate calculation of diversity index [15]. On the contrary, compared to multi-loci SSRs, single-locus SSRs primers target a unique location in the genome and could provide more reliable scoring of genotypes in genetic study and breeding programs.
In this study genomic survey sequencing was applied in a E. sibiricus cultivar 'Chuancao No.2' using Illumina Hiseq X-ten platform. The first draft genome of E. sibiricus was constructed and some single-locus SSRs and multi-loci SSRs were developed based on genomic survey data. Furthermore, genetic diversity and structure of 27 wild E. sibiricus accessions were characterized using these new SSR markers plus previously published ones. Effectiveness of those markers was also compared and evaluated.

Genome sequencing and characterization
Fourteen 270-bp libraries with pair-end reads of E. sibiricus were constructed and sequenced. The randomly selected 10,000 pairs of reads were then analyzed by BLAST using the NCBI databases. The BLAST result showed that Triticum aestivum and Hordeum vulgare, as the closely-related species to Elymus, were the best matching species in all libraries. In addition, reads of each library aligned with chloroplast genome of E. sibiricus showed lower than 5% matching rate, which indicated that the libraries were established without contamination (data not shown).
From fourteen libraries with 270 bp insertion size, totally 469.17 Gb of clean data were produced using an Illumina HiSeq X-ten platform. The estimated genome size of E. sibiricus was approximately 6.86 Gb with a total sequencing depth of 58.64-fold (Table 1). High quality scores of the filtered sequences were calculated, and the percentages of Q20 and Q30 (the sequencing error rate 1 and 0.1% respectively) was greater than 95.81 and 90.50% respectively, indicating the high accuracy of the sequencing process (Table 1). Then all of the clean data were subjected to 25-mer (k = 25) frequency distribution analysis. It showed that the peak value of the k-mer depth distribution emerged at 54 (Fig. 1). The heterozygosity rate and the proportion of repeat sequence was calculated to be 0.01 and 73.23% respectively according to the k-mer curve distribution.
After de novo assembly using the SOAP de novo program, all of the clean reads produced a total of 4,841,088 contigs with an N50 length of 2510 bp, which were subsequently assembled into scaffolds ( Table 2). Among all those contigs, 683,040 contigs were longer than 500 bp, 352,851 contigs were longer than 1 kb and 7536 contigs were longer than 10 kb. Scaffolds larger than 100 bp were selected for further analysis. Totally 4,763,904 scaffolds were generated with an N50 of 2648, and the number of scaffolds longer than 500 bp, 1 kb and 10 kb were 684,597, 352,659 and 8040, respectively ( Table 2). The assembled draft genome was approximately 4.34 Gb, accounting for 63.27% of the estimated 6.86 Gb genome. The calculated GC content of the assembled genome was 45.68% (Table 2), that was consistent with the scatter plot graph built with scaffolds larger than 500 bp (Fig. 2). A total of 25,993 genes were annotated in the E. sibiricus draft genome with an average transcript length per gene of 2632.11 bp and an average coding sequence length of 737.36 bp. The predicted average exons number per gene was 4.72 and the average exon length per transcript was 311.32 bp (Table S1).

Cluster and STRUCTURE analysis
Genetic memberships of the 27 tested E. sibiricus accessions based on ESGA-SL and ESGA-ML datasets were acquired via STRUCTURE 2.3.4 software ( Fig. 5 and Fig.  S2). The result revealed an optimal K value of 2 (K = 2), implying the tested accessions belonged to two potential genetic memberships, which was consistent with the UPGMA dendrogram and principal coordinates analysis (PCoA) using ESGA-SL markers. The 27 accessions could be divided into two clusters (Cluster I and Cluster II, Figs. 5 and 6). Cluster I included all the Mongolia (MG) and Siberia (SI) accessions, and Cluster II included all eastern QTP originated ones. Thus, the wild germplasms from different regions could be notably identified through the UPGMA and PCoA analysis, which revealed the powerful discriminability for wild accessions based on ESGA-SL markers. However, in spite of distinct characterization of tested accessions from different regions characterized via PCoA analysis based on ESGA-ML markers (Fig. S3), the topological structure of UMPGA dendrogram was ambiguous and couldn't distinguish the 27 wild accessions clearly (Fig. S2).  Fig. 3 The most abundant motifs (red portion) corresponding to mono-to hexa-nucleotide repeats Furthermore, the similar result of UPGMA cluster and PCoA analysis based on ESGA-ML markers was also found in ESGS markers ( Fig. S4 and Fig. S5). Besides, compared to genomic-developed markers, the transcriptome-developed ES markers could not explain their structure membership well ( Fig. S6 and Fig. S7). This indicates that G-SSR markers has the superior discriminability than the EST-SSR markers for intraspecific diversity analysis.

The genetic diversity pattern of E. sibiricus based on different types of markers
Based on geographical origin, all tested wild accessions were divided into three geo-groups: MGL (Mongolia), SI (Siberia) and QTP (eastern Qinghai-Tibet Plateau). The observed heterozygosity (H o ) of each geo-group based on the ESGA-SL dataset was changing from 0.467 to 0.492 (Table 5). In general, the values of N a , N e (effective alleles number), I (Shannon information index), H e (expected heterozygosity) and PP (Percentage of polymorphic loci) of each geo-group calculated by ESGA-SL marker were higher than that of ESGA-ML (Table 5). AMOVA analysis was carried out based on both the ESGA-SL and ESGA-ML markers and coefficient of genetic differentiation (F st ) were calculated. The results showed that genetic variation of tested germplasms was mainly distributed among geo-groups with moderate genetic differentiation (F st = 0.553 for ESGA-SL and F st = 0.573 for ESGA-ML) ( Table 6).

Discussion
Characteristics of E. sibiricus draft genome    [20]. This may be caused by the relatively big estimated genome size (6.86 Gb), high repetitiveness (73.23%) of E. sibiricus and the short insertion size of library. The estimated genome size of E. sibiricus (6.86 Gb) was smaller than that of related allohexaploid Triticum aestivum (17 Gb) [21], while larger than that of many other important species in Gramineae, such as Hordeum vulgare (5.1 Gb) [22], Aegilops tauschii (4.5 Gb) [23], Triticum urartu (5.0 Gb) [24], Brachypodium distachyon (260 Mb) [25], Oryza sativa (466 Mb) [26], Sorghum bicolor (730 Mb) [27], Lolium perenne (2 Gb) [28] and Zea mays (2.3 Gb) [29]. The low level of heterozygosity for E. sibiricus (0.01%) obtained via the kmer analysis was probably caused by the self-pollinating mating system of E. sibiricus, and indicated its feasibility for genome sequencing. This is the first draft genome of E. sibiricus and it is useful in the molecular marker development and functional gene mining. This work also provided the basis for further whole-genome sequencing using larger insert libraries and new sequencing technique like the single-molecule real-time sequencing.

SSR marker development
SSR markers have been widely applied in genetic study and molecular breeding. Among all of the identified 293, 362 SSRs, the vast of SSRs (97.95%) belonged to mono-, di-and tri-nucleotide motifs, which was similar to the result of restriction site associated DNA-Seq (RAD) in E. nutans [30]. However, in the transcriptome sequencing study of E. sibiricus, the tri-nucleotide motifs had the largest number [31], which could be due to the difference between sequences in non-coding and coding regions. Typically, the coding regions has a higher percentage of trinucleotides due to the enrichment of triplet codons under selection pressure [32]. Usually, the most abundant tri-nucleotide motif in monocotyledon is CCG/ CGG [33], while in this study that is CTC/GAG. This could be the result of codon usage bias in different species [34]. The A/T rich tendency of SSRs in E. sibiricus was also consistent with the study of eukaryotes as reported [35]. The phenomenon that motif abundance decreased as the motif repeat number increased of each motif type was in accordance with the previous study [36]. For polyploid species, it's usually hard to distinguish alleles because of the reciprocal overlapping and uncertain allelism of these fragments [37], which is difficult for genotype scoring. In this case, single-locus SSR markers are considered as the best choice, and development of single-locus SSRs has been reported in barley, peanut and Luffa by genome survey [9,12,38]. In this study, 10 single-locus SSR markers were developed via the genome survey of E. sibiricus with great potential use in genetic variation study and linkage map construction.

Effectiveness comparison between single-and multi-loci markers
Genetic diversity of 27 wild E. sibiricus accessions was evaluated by 30 markers developed in this study and other 30 ones reported before. We found that the expected singlelocus SSRs screened by in silico analysis still exhibited multi-loci amplicons when separated by polyacrylamide gel. This may be caused by their non-conservatism of flanking sequences [9]. Finally, only 10 single-locus (ESGA-SL) markers and 20 multi-locus (ESGA-ML) markers were obtained in this study for genetic diversity analysis of 27 wild E. sibiricus accessions.
The average amplified alleles of the 10 ESGA-SL markers was 2.9, which was close to the allotetraploid species Arachis hypogaea (3.85) and Brassica napus (3.23) [9,37]. The PIC value of the 10 ESGA-SL markers varied from 0.069 to 0.595 with an average of 0.391, that indicated its abundant polymorphism and high application value [39]. There was no significant difference of PIC detected between ESGA-SL and other three marker systems (ESGA-ML, ES and ESGS), which may be caused by the different calculation criteria between single-locus and multi-loci marker or the limited amplification loci of single-locus markers. According to the Mann Whitney test, G-SSR (ESGA and ESGS markers) was more efficient and polymorphic than EST-SSR (ES markers) in view of PIC, MI and Rp, that may be driven by the more conservative flanking sequences of EST-SSR [40,41]. In addition, significantly (P < 0.05) higher PIC values of ESGA-ML markers vs. ESGS markers were calculated, which demonstrate the superiority of SSR markers development method by sequencing over traditional method.
The UPGMA and PCoA derived cluster analysis based on ESGA-SL markers divided the 27 wild E. sibiricus accessions into two groups, and the structure analysis based on Bayesian algorithm also revealed the same pattern. However, the other three types of multi-loci  markers exhibited inferior ability than ESGA-SL marker in revealing actual genetic relationships. One should note that all the genetic diversity parameters (N a , N e , I, H e and PP) of each geo-group calculated based on ESGA-SL markers were higher than that of ESGA-ML, which suggested that the single-locus marker reveals more accurate genetic information, so it is more suitable for further genetic analysis [37]. However, slightly higher pairwise F st values were observed among each geo-group based on ESGA-ML markers. Given that multi-loci SSRs possesses characteristic like multiple amplification sites in the genome location, a part of genetic information was unavoidably covered. The advantage of single-locus markers over multi-loci markers was manifested in this study, however, vast number of single-locus markers that covering the entire genome of E. sibiricus are required further be identified or developed. In this case, higher quality genome-wide sequencing and assembling for E. sibiricus are necessary.

Conclusions
In this study, the de novo whole genomic survey of E. sibiricus was performed and a 4.34 Gb reference genome sequence was obtained with 73.23% repetitive elements, 0.01% heterozygosity and 45.68% GC content. Totally 293,362 SSR markers were identified from the draft genome and 67,833 potential markers were screened by in silico analysis. Subsequently, ten single-locus (ESGA-SL) markers and 20 multi-locus (ESGA-ML) markers were verified by gel electrophoresis and exhibited polymorphism in 27 E. sibiricus accessions. The single-locus marker was proved more efficient and informativeness in genetic study than multi-loci marker. This survey of the genome and the developed SSR markers will facilitate further whole genome sequencing, molecular breeding and phylogenetic study of E. sibiricus and related Triticeae species.

Plant materials
E. sibiricus cultivar 'Chuancao No.2' provided by Sichuan Academy of Grassland Sciences (Chengdu, China) was adopted after identification as tetraploid by flow cytometry (Fig. S8) and planted in the growth chamber (25°C, 300 μmols·m 2 ·s − 1 , 16-h photoperiod). The total genomic DNA of 'Chuancao No.2' was isolated from fresh young and clean leaves using a DNA extraction kit (Tiangen, Beijing, China). DNA concentration and purity were checked on a BioPhotometer (Eppendorf, Germany) and the quality was detected by 1% agarose gel electrophoresis.

Library construction and Illumina sequencing
Fourteen genomic paired-end (PE) libraries with 270-bp insertions were prepared following the manufacturer's instructions and then sequenced on an Illumina HiSeq Xten platform. Clean reads were obtained abide by the following filtration and correction criterion [9]: less than 10% unidentified nucleotides (N); no more than 10 nt aligned to the adaptor, allowing for at most 10% mismatches; with at most 50% bases having a phred quality of < 5. Putative PCR duplicates generated during PCR amplification in the library construction process was excluded.
In addition, to investigate the potential contaminating effect, 10,000 pairs of clean reads were randomly selected and searched against the NCBI database using BLAST [42]. Finally, to evaluate the content of extra-nuclear DNA in the aforementioned fourteen libraries, BLAST was performed using SOAP [43] with the chloroplast genome of E. sibiricus (MK775250, 135,075 bp).

Genome assembly, annotation and guanine plus cytosine (GC) content analysis
The filtered high-quality sequences were assembled by SOAPdenovo2 [44] following the k-mer size = 54 with default parameters, then GC content was calculated. Identification of protein-coding region and gene prediction of the assembly were conducted through the homology-based prediction method by alignment to genomes of four related species, Triticum aestivum [21], Hordeum vulgare [45], Aegilops tauschii [23] and T. urartu [24], and E-value cutoff was set as 1e-5. The GeneWise software [46] was used to predict the exact gene structure of the corresponding genomic regions after removing redundancy. Finally, Trnascan-SE software [46] was applied to predict tRNA.

Identification and verification of SSRs
Repeat sequence annotation of the newly obtained genome sequence set of E. sibiricus was carried out by RepeatMasker [16] following the repeat sequence database of Gramineae [47]. Then PERL5 script microsatellite software (http://pgrc.ipk-gatersleben.de/misa/) was used to identify SSRs in the genomic DNA sequences. The recognition criteria are as follows: the number of single nucleotide repeats is 8 or more; the number of di-, tri-, tetra-, penta-and hexa-nucleotides repeats are all more than 5 [48]. The parameters setting of primer design was: 182 7 bp primer size, annealing temperature at 55-65°C, GC content at 30-70% and 100~300 bp final product length [17]. Using in silico analysis, the designed primers were mapped back onto the assembly sequence of 'Chuancao NO.2', and the SSR combined with only one site was regarded as potential single-locus SSR [49].
SSRs evaluation based on the genetic diversity of E. sibiricus germplasm Four hundred pairs of SSR markers were randomly selected for synthesis, then PCR and electrophoresis were performed for screening and validation. Primers with only 0-2 amplified bands were recognized as singlelocus SSR and those possessing polymorphism was named 'E. sibiricus genome assembly single locus' (ESGA-SL) marker. Analogously, primers with more than 2 amplified bands simultaneously polymorphic was called 'E. sibiricus genome assembly multi loci' (ESGA-ML) marker. In addition, 30 pairs of multi loci markers including fifteen G-SSR markers (ESGS [40]) developed based on magnetic bead enrichment, and fifteen pairs of EST-SSR markers (ES [31]) based on E. sibiricus transcriptome were also selected to amplified the same 27 E. sibiricus accessions DNA (Table S5, Supplementary file).
Amplified bands were recorded following genotypes (single locus marker) or 0/1 binary matrix (multi loci marker). Nei's genetic distance (GD) matrix among 27 accessions was calculated by Freetree [50] with 10,000 bootstrap value, and the UPGMA dendrogram was constructed then visualized in Figtree [50]. The principal coordinate analysis (PCoA) was performed via NTSYS v2.2 [51]. STRUCTURE v2.3.4 [52] was performed based on a Bayesian model to the illustration of genetic membership. The parameters were set to 50,000 burn-in and 100,000 Monte Carlo Markov chain (MCMC) with an admixture model. The STRUCTURE HARVESTER [53] was then applied to estimate the "optimum K". The hierarchical analysis of molecular variance (AMOVA) was carried out using GenAlEx [54] to calculate the number of alleles (N a ), effective alleles (N e ), Shannon diversity index (I), expected heterozygosity (H e ), polymorphic site proportion (PP) and other genetic diversity parameters. Finally, the observed heterozygosity (H o ) and gene flow (N m ) of single-locus SSRs were calculated.