EST-SSR marker development based on RNA-sequencing of E. sibiricus and its application for phylogenetic relationships analysis of seventeen Elymus species

Background Elymus L. is the largest genus in the tribe Triticeae Dumort., encompassing approximately 150 polyploid perennial species widely distributed in the temperate regions of the world. It is considered to be an important gene pool for improving cereal crops. However, a shortage of molecular marker limits the efficiency and accuracy of genetic breeding for Elymus species. High-throughput transcriptome sequencing data is essential for gene discovery and molecular marker development. Results We obtained the transcriptome dataset of E. sibiricus, the type species of the genus Elymus, and identified a total of 8871 putative EST-SSRs from 6685 unigenes. Trinucleotides were the dominant repeat motif (4760, 53.66%), followed by dinucleotides (1993, 22.47%) and mononucleotides (1876, 21.15%). The most dominant trinucleotide repeat motif was CCG/CGG (1119, 23.5%). Sequencing of PCR products showed that the sequenced alleles from different Elymus species were homologous to the original SSR locus from which the primer was designed. Different types of tri-repeats as abundant SSR motifs were observed in repeat regions. Two hundred EST-SSR primer pairs were designed and selected to amplify ten DNA samples of Elymus species. Eighty-seven pairs of primer (43.5%) generated clear and reproducible bands with expected size, and showed good transferability across different Elymus species. Finally, thirty primer pairs successfully amplified ninety-five accessions of seventeen Elymus species, and detected significant amounts of polymorphism. In general, hexaploid Elymus species with genomes StStHHYY had a relatively higher level of genetic diversity (H = 0.219, I = 0.330, %P = 63.7), while tetraploid Elymus species with genomes StStYY had low level of genetic diversity (H = 0.182, I = 0.272, %P = 50.4) in the study. The cluster analysis showed that all ninety-five accessions were clustered into three major clusters. The accessions were grouped mainly according to their genomic components and origins. Conclusions This study demonstrated that transcriptome sequencing is a fast and cost-effective approach to molecular marker development. These EST-SSR markers developed in this study are valuable tools for genetic diversity, evolutionary, and molecular breeding in E. sibiricus, and other Elymus species. Electronic supplementary material The online version of this article (10.1186/s12870-019-1825-8) contains supplementary material, which is available to authorized users.


Background
The tribe Triticeae Dumort., an economically important tribe in the grass family (Poaceae), contains three important cereal crops closely related to human life and civilization, namely, wheat (Triticum aestivum L.), barley (Hordeum vulgare L.) and rye (Secale cereale L.), as well as many economically valuable forage grasses, such as crested wheatgrass (Agropyron cristatum L.), bottlebrush grass (Hystrix patula), sheepgrass [Leymus chinensis (Trin.) Tzvel.], and Siberian wildrye (Elymus sibiricus L.) [1]. Elymus L. is the largest genus in the tribe Triticeae Dumort., encompassing approximately 150 polyploid perennial species widely distributed in the temperate regions of the world [2,3]. The diverse origins and ecological habitats may cause genetic variation among and within different Elymus species and populations that could be captured and used in plant breeding and improvement programs [4,5].
As one of the main mode of speciation, polyploidy (autopolyploidy and allopolyploidy) is a natural hybridization process [6]. It is well known that Elymus arose through hybridization between representatives of different genera [2,7]. Previous cytological studies suggested that five basic genomes (St, H, Y, P and W) donated by different diploid species constitute Elymus species [8]. Different genomic constitution and the long-term differentiation among different species have made Elymus form a pattern of reticulate evolution and possess rich genetic basis than their diploid parents [9,10]. The Elymus species thus are ideal materials for studying the formation mechanism and evolution pattern of polyploidy species. Furthermore, Asia is the main origin and genetic differentiation center of Elymus [11,12]. Many Elymus species like E. sibiricus and E. nutans occur naturally in Eastern Mongolia, the Himalayas, and Western and Northern China [13]. Previous studies mainly focused on the classification between Elymus and their relatives, the origin determination of the ancestor species including St, H and Y genome, and the evolution relationships between basic genomes based on cytogenetics and molecular sequences [12,[14][15][16]. Information on molecular phylogeny and genetic structure among different Elymus species is limited, but necessary for germplasm collection, conservation and utilization. The phylogenetic relationships and genetic components among different Elymus species are largely unknown because of the limited detection ability of specific or unique nuclear gene sequences [9,10,12,[17][18][19]. Genetic diversity analysis among phylogenetically related species is based on the development of transferable and orthologous molecular markers. Expressed sequence tag-derived simple sequence repeat markers (EST-SSRs) are the markers of choice, because they are abundant, co-dominant, highly polymorphic, and are easily transferable among phylogenetically related species [13]. In previous studies, three expressed sequence tag (EST) libraries were developed and annotated for Pseudoroegneria spicata, a mixture of both Elymus wawawaiensis and E. lanceolatus, and a Leymus cinereus × L. triticoides interspecific hybrid [20]. EST-SSR primers developed from three perennial diploid Triticeae species were used to produce amplicons in these three species, and EST-SSR primers derived from Thinopyrum bessarabicum and Th. elongatum had greater transferability to each other than those derived from the St-genome Pseudoroegneria spicata due to close relationship between J b and J e genomes [21]. Moreover, EST-SSR markers are critical for genetic relationship analysis, genetic mapping, and DNA fingerprinting for many crops and forage grasses [13,[22][23][24][25][26][27]. Mott et al. developed simple sequence repeat markers based on Elymus expressed sequence tag sequence, and used a subset of the 23 most polymorphic SSRs to analyze genetic diversity of seven North American Elymus, Pseudoroegneria and Pascopyrum species [28].
Recently, the advent of next-generation sequencing (NGS), especially de novo transcriptome sequencing, had provided a cost-effective approach to identify microsatellite loci [23,29,30]. You et al. identified 5278 SSRs in taro (Colocasia esculenta) transcriptome data, and finally used 62 polymorphic markers for taro genetic diversity study [31]. Our previous transcriptome sequencing in E. sibiricus have generated and identified 185,523 unigenes and more than 30,000 differentially expressed transcripts (DETs), which provided important genetic resources and sequence information for developing EST-SSR markers in this study [32].
The objectives of the study were to (i) develop EST-SSR markers from E. sibiricus transcriptome sequencing and verify the transferability among different Elymus species; (ii) to evaluate the genetic diversity and genetic relationships among 17 polyploidy Elymus species based on the developed EST-SSR markers; (iii) to elucidate the phylogenetic relationships and genetic differentiation between StH, StY and StHY genome in Elymus species. Characterization of the genetic components among and within Elymus genomes will contribute to the understanding of the origin and evolution mechanism in polyploidy species.

Frequency and distribution of EST-SSR markers
A total of 4,598,845 contigs were obtained after de novo assembly. The total number of unigenes with paired-end reads was 135,433. The total length of the unigenes was 97,059,505, with an average length of 716.66 bp and N50 value of 1269. Among the 135,433 unigenes, the length of 109,476 unigenes (80.83%) ranged from 200 to 1000 bp, the length of 21,630 (15.97%) ranged from 1000 to 3000 bp, and 4327 unigenes (3.20%) were more than 3000 bp. The length distribution of the unigenes was shown in Additional file 1: Figure S1. Of the 135,433 unigenes, 57,756 (42.65%) unigenes were successfully annotated in at least one database (Table 1). According to Nr annotation, most BLAST hits were detected from Aegilops tauschii (22.38%), followed by Hordeum vulgare (18.01%) and Triticum urartu (12.88%) (Additional file 2: Figure S2).

Development and transferability of novel EST-SSRs
Two hundred EST-SSR primer pairs were designed and selected to amplify ten selected DNA samples. Eighty-seven pairs of primer (43.5%) generated clear and reproducible bands with expected size, and showed good transferability across different Elymus species. Forty-four primers (22%) generated non-polymorphic bands. Eighteen primers (9%) produced bands with unexpected size, and the remaining fifty-one (25.5%) primers didn't produce any bands. Finally, a total of thirty polymorphic EST-SSRs were used for the genetic diversity analysis of 480 Elymus individual plants.

Verification of repeat motif types across different species
To determine the authenticity of EST-SSR primers, amplicons from 17 different Elymus species for two EST-SSRs were sequenced. In all of the cases, the sequenced alleles from different Elymus species were homologous to the original SSR locus from which the primer was designed. Marker polymorphisms among the 17 Elymus species were found due to variation in number of repeats of SSR motifs. According to the sequencing results of expected bands generated from primer c11036, E. sibiricus had five TAG repeats, the remaining sixteen species had two TAG repeats. For primer c69822, E. antiquus had eight GCA repeats, E. nutans, E. ciliaris and E. panormitanus had three GCA repeats, E. sibiricus had five GCA repeats, and the remaining twelve Elymus species had six GCA repeats (Figs. 2 and 3).

Genetic diversity analysis of StH, StHY, and StY genome combinations
Thirty EST-SSR primers generated 572 bands. The number of amplified bands ranged from 5 (c11036) to 25 (c66150, c67290 and c68713), with an average of 19.1 ( Table 3). The percentage of polymorphic bands among Elymus species was 100%. The expected heterozygosity (He) ranged from 0.73 to 0.95, with an average of 0.92. The observed heterozygosity (Ho) ranged from and 0.89 to 1.00, with an average of 0.98. The polymorphism information content (PIC) value ranged from 0.28 (c61134 and c69822) to 0.43 (c68713), with an average of 0.36.
In general, hexaploid Elymus species with genomes StStHHYY had a relatively higher level of genetic diversity (Na The cluster analysis showed that all ninety-five accessions were clustered into three major clusters (Fig. 4).   In addition, analysis of molecular variance (AMOVA) was used to evaluate variance components among and within different accessions, species and genomes. The results revealed that 10% of variation occurred among species, whereas 90% of genetic variation existed within species (90%) ( Table 5). Despite different genomes, more than 80% of total variance existed within Elymus accessions (86% for StH, 83% for StHY and 81% for StY, respectively), while less than 20% of genetic existed among accessions.
The Mantel test was used to investigate the correlations between genetic information and environmental factors, including geographic distance, latitude and altitude. The regression analysis with 9999 permutations showed a strong positive correlation between Nei's genetic distance and geographic distance (r = 0.2086, p < 0.01) (Fig. 6). There were no significant correlations between genetic diversity and latitude and altitude at the species level. A positive correlations were found between effective number of alleles (Ne), Nei's genetic diversity (H) and latitude (r =    (Fig. 7).

Genetic structure and genetic differentiation analysis
The genetic structure of 480 individuals from 95 Elymus accessions was analyzed using STRUCTURE software. Based on maximum likelihood and K (ΔK) values, the optimal number of groups was two. As shown in Fig. 8a, E. sibiricus (StH) accessions could be easily separated from other sixteen Elymus species with StY and StHY genomes. We further investigated the internal genetic structure of these Elymus species. E. sibiricus accessions were assigned to four subgroups. Eleven StY genomes species were assigned to the same group. The remaining accessions with StHY genomes were assigned to other groups. There was no obvious relationship between geographic origin and genetic structure in Elymus accessions. For example, twenty-three individuals of five accessions from Geo-2 geographical groups were clustered into two groups.
Based on results from STRUCTURE analysis, genetic components of each accession was presented using pie graph (Fig. 8b). Based on the different genetic components within different genomes, the highest genetic distance was found between StH and StY genome (0.0416) (Fig. 8c). While StHY genome had a similar genetic distance with StH (0.0253) and StY (0.0208). Besides, we further investigated the probable ancestor origin among and within seven E. sibiricus and seven E. nutans groups (Fig. 8d). Geographical groups: Geo-1, Geo-4, Geo-5 and Geo-7 had more than 50% of same genetic components each. Particularly, E. sibiricus accessions from Xingjiang Tianshan (Geo-5) group shared 76% of genetic components. For E. nutans, some geographic groups had similar ancestor origin, for example, Geo-8 and Geo-9 originating from Eastern Qinghai-Tibetan Plateau shared more than 65% of genetic components (65.2% for Geo-8 and 75.8% for Geo-9). Accessions from Russia (Geo-14) had different ancestor origin compared with other geographical groups.

Discussion
The development of EST-SSRs based on E. sibiricus transcriptome database In E. sibiricus, marker-assisted selection (MAS) and molecular breeding lag behind other forage species due to the lack of effective molecular markers systems. Simple sequence repeat (SSR) markers are considered to be one of the most important marker systems for plant genetic and breeding studies due to their high polymorphism, high abundance, co-dominance, and genome-wide distribution. Compared with genomic SSRs, expressed sequence tag-derived simple repeat markers (EST-SSRs) are easily transferred among related species owing to the regions being more evolutionary conserved than non-coding sequences. Some previous studies reported the development of expressed sequence tags (ESTs) and simple sequence repeat (SSR) markers for several grasses  [33]. These EST-SSR markers derived from RNA-Seq can be used for genetic diversity analysis, genetic linkage map construction, and marker-assisted selection breeding, etc. [25,26]. There have been few reports on transcriptome analysis about E. sibiricus, the type species of Elymus genus. In this study, among 135,433 assembled unigene sequences, 8871 potential EST-SSRs were identified. The EST-SSR frequency was 4.94%, which was higher than sheepgrass (4.38%) and rice (3.57%) [30,34]. The distribution density was one SSR per 6.20 kb, which was higher than   [30,33,35,36]. Some possible explanations for the difference of the SSRs frequency in expressed sequence tags could be the different genetic basis of various plant species, SSR search criteria, as well as the mining tools used. In many organisms, the extensive distribution of trinucleotide repeats in coding sequences is a sign of the effects of selection, indicating that these SSRs were selected against possible frameshift mutations [30]. In this study, nucleotide repeat from mono-to hexa-nucleotide were detected in 8871 potential SSRs from 135,433 unigenes. Tri-nucleotide repeats were the most abundant SSR motifs. Particularly, CCG/CGG was predominant trinucleotide repeats, followed by AAC/GTT, which was similar with previous reports in rice, barley, wheat, and sheepgrass [30,37]. Kantety et al. reported that the di-nucleotide repeat motifs existed in similar frequencies in ESTs from various cereal species such as GA/CT and GC/CG [38]. However, in our study, AG/CT (40.9%) was the most abundant di-nucleotide repeat motif in E. sibiricus, which was the same as in annual ryegrass [39]. Our results suggested that most frequent motif repeats might vary between forage grasses and other cereal species. In addition, these ETS-SSR markers developed from E. sibiricus transcriptome data showed good transferability across different Elymus species, suggesting these EST-SSR

Phylogenetic relationship of StH, StHY and StY genome combinations
Elymus is a diverse, geographically widespread allopolyploid genus which includes multiple distinct genomic combinations. Cytological studies suggest that five basic genomes, namely, the St, Y, H, P and W in various combinations constitute Elymus species [11]. Of the five basic genomes, the St genome derived from Pseudoroegneria is a fundamental genome that exists in all Elymus species [40]. The H, P and W genomes are derived from the genera Hordeum, Agropyron and Australopyrum of  (Table 4); (c) The genetic distance among the StH, StHY and StY genomes. At K = 8, the proportion of each genome was described by using the pies, of which the protruding sectors belonged to the genome itself; (d) The mean ancestry in each of the eight clusters among 14 geographic groups of E. sibiricus and E. nutans. The percentage of the largest proportion was showed in the graph Triticeae, respectively [12]. However, the accurate origin of the Y genome has not yet been identified, although this genome is present in the majority of the Asiatic Elymus species. Previous studies indicated that the StH genome Elymus species is allotetraploid that combines the genomes of Pseudoroegneria (St) and Hordeum (H) [2,10]. The heterologous hexaploid species, StHY, may undergo two hybridization events, the combination of the St and Y genomes formed tetraploid StY genome, the second hybridization event involved in the combination of the StY and H genome [12,18]. Genetic diversity analysis in different genomic combinations will facilitate the understanding of the evolution process and genetic differentiation among different species. The phylogenetic relationships of the StH, StY and StHY genome Elymus species have been reported by using molecular sequences [9,10,12,[17][18][19].  [42]. Hybridization and polyploidization are the major driving force in the diversity and evolution of the genus Elymus. Hybridizations between different ancestral diploid genera had formed the novel allopolyploid species [7,17]. As a major mechanism of evolution and speciation, polyploidy Elymus species form diverse genotypes and phenotypes to adapt to the different ecological niches (especially in high altitude and high latitude regions) by inducing genomic replication, gene expression, and increasing the complexity of regulatory networks [43,44]. Although these Elymus species have the different genome combinations, gene flow (Nm) existed between different species (16.07 for StH and StHY, 6.65 for StH and StY, and 11.72 for StHY and StY), suggesting that no strict reproductive barriers exist among the three genomes.

Conservation implications
Phenotypically and genetically diverse germplasm is a potentially valuable source for the improvement of the desired agronomic trait [13]. Wild germplasm could provide advantageous alleles like improved stress tolerance, forage quality, and higher yield for modifying currently used cultivars by hybridization and introgression. The collection and preservation of rich and specific germplasm resources of wild relative species will benefit the utilization of excellent traits and special resistance genes [1,4,5]. According to AMOVA analysis, larger genetic variation was found within Elymus species. This result was in agreement with previous genetic studies of Elymus species which found that the majority of variation was apportioned within populations or geographic regions [45]. Hence, a considerable amount of overall genetic variation of Elymus species could be captured when sampling a larger number of plants from Elymus population. Meanwhile, Mantel test indicated that a strong positive correlation between Nei's genetic distance and geographic distance (r = 0.2086, p < 0.01) was found for Elymus accessions, which suggested more genetic diversity and variation could be captured in the wide range of geographical regions. Based on our data, accessions from Qinghai-Tibetan Plateau and Tianshan mountain had high level of genetic diversity. Therefore, these wild accessions with rich genetic diversity could be used as important genetic resources for future Elymus breeding programs.

Conclusions
In

Development of EST-SSR markers derived from transcriptome of E. sibiricus
Our previous study had constructed cDNA libraries and sequenced abscission zone tissue samples of E. sibiricus based on next-generation sequencing (NGS) [32]. The obtained raw reads (NCBI SRA: SRX2617497) were preprocessed to filter adaptor sequences, low-quality sequences, and reads with quality less than Q30 using the FASTX toolkit. The Trinity program was employed to assemble the de novo transcriptome clean reads [46]. The assembled unigene sequences were directly identified in simple sequence repeat identification tool program (MicroSatellite), of which the parameters were set for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide motifs as a minimum repeat number of 12, 6, 5, 5, 4 and 4, respectively. The EST-SSRs were designed using Primer 3 (http://primer3.sourceforge.net) based on the MISA result.

DNA extraction and genotyping
Seeds of all accessions were germinated in a greenhouse (25/15°Cday/night temperature) until 8 weeks old. Young leaf tissues of 480 individuals were collected for genomic DNA extraction (sodium dodecyl sulfate, SDS methods) [13]. Each accession was represented by 1 to 9 individuals, with an average of 5.1 (details in Table 4).
The quantity and quality of DNA samples were determined using the NanoDrop ND1000 spectrophotometer (Thermo Scientific, Waltham, MA, USA) and agarose gel electrophoresis, then diluted to 25 ng/μL and stored at − 20°C. Two hundred EST-SSR primer pairs were randomly selected from 8871 potential SSRs and were synthesized by Shanghai Sangon Biological Engineering Technology (Shanghai, China). A total of 20 individual plants from 17 Elymus species were selected for primer screening. Each primer was amplified twice to check whether it produce clear and reproducible bands. Finally, EST-SSRs with high transferability, polymorphism and repeatability were used to genotype 480 individual plants. The PCR amplification and EST-SSR genotyping as well as eletrophoresis were carried out as described by Xie et al. [47].

Data analysis
The amplified bands were considered as present (1) and absent (0), and only clear and reproducible bands were considered. The expected heterozygosity (He), observed heterozygosity (Ho) and polymorphism information content (PIC) value were calculated as the previous methods [13,48]. The expected heterozygosity formula is as follows: He = 1-∑pi 2 , where pi is frequency of the ith allele. The number of heterozygotes is determined by direct count method. PIC was calculated for each primer according to the formula: PIC = 1p 2q 2 , where p is frequency of present band and q is frequency of absent band. Genetic diversity parameters including observed number of alleles (Na), effective number of alleles (Ne), Nei's genetic diversity (H), Shannon's information index (I) and the percentage of polymorphic loci (% P) were calculated by using POPGENE v 1.31 program (Edmonton, AB, Canada) [49]. A neighbor-joining (NJ) tree was displayed by using of MEGA v 5 software based on the operations supported in PowerMarker v 3.25, of which the probabilities for each node was assessed by bootstrap analysis using 1000 replicates [50,51]. A principal coordinates analysis (PCoA) was constructed in GenAlEx v 6.5 [52]. Correlations between pairwise genetic distance and adjusted pairwise geographic distance were calculated by GenAlEx v 6.5 based on the mantel test with 9999 permutations. Person relation analysis was used to test the correlations between genetic parameter (Ne, effective number of alleles and H, Nei's genetic diversity) and environmental factors (latitude and altitude). The analysis of molecular variance (AMOVA) was used to investigate the total genetic variation among genomes, within species and within populations using GenAlEx v 6.5. The program STRUCTURE v 2.3.4 was used to analyze the genetic structure of four hundred and eighty individuals using a Markov chain Monte Carlo (MCMC) algorithm. Assuming an admixture model sample and correlated allele frequencies, 20 independent runs were performed for all K values (ranged from 1 to 11), each with 10,000 MCMC interactions and 10,000 replications. The delta K method was employed to determine the optimal K value for all the data set [53].