Development of Chloroplast Genomic Resources in Chinese Yam (Dioscorea polystachya)

Chinese yam has been used both as a food and in traditional herbal medicine. Developing more effective genetic markers in this species is necessary to assess its genetic diversity and perform cultivar identification. In this study, new chloroplast genomic resources were developed using whole chloroplast genomes from six genotypes originating from different geographical locations. The Dioscorea polystachya chloroplast genome is a circular molecule consisting of two single-copy regions separated by a pair of inverted repeats. Comparative analyses of six D. polystachya chloroplast genomes revealed 141 single nucleotide polymorphisms (SNPs). Seventy simple sequence repeats (SSRs) were found in the six genotypes, including 24 polymorphic SSRs. Forty-three common indels and five small inversions were detected. Phylogenetic analysis based on the complete chloroplast genome provided the best resolution among the genotypes. Our evaluation of chloroplast genome resources among these genotypes led us to consider the complete chloroplast genome sequence of D. polystachya as a source of reliable and valuable molecular markers for revealing biogeographical structure and the extent of genetic variation in wild populations and for identifying different cultivars.


Introduction
Chinese yam (Dioscorea polystachya Turcz.) belongs to section Enantiophyllum in genus Dioscorea, which also includes economically important food yams of tropical origin such as D. alata (water yam) and D. rotundata (white guinea yam) [1]. It is allogamous with fleshy tuber, branched stems, papery to thinly leathery leaves, and its seeds are inserted near middle of capsule and winged all round [2]. Chinese yam originated in China and was domesticated in the Song Dynasty, dating back approximately 1000 years [3]. It has been used as a dietary food and as a traditional medicine for strengthening stomach function, alleviating anorexia, and treating diarrhea [4].
Nowadays, there are mainly 80 cultivars on the Chinese market [5]. For a long time, cultivated yams mainly rely on clonally propagated using vegetative propagation of tubers, which led to serious degradation [3]. Its production systems face the problem that the cultivars have the limited diversity during long-term vegetative reproduction [6]. Detailed analysis of the genetic diversity in this species is important, because an accurate assessment of the genetic structure and diversity of cultivated and wild yams can be invaluable in crop breeding for diverse applications [7]. For example, analysis of the genetic variability among cultivated and wild yams can facilitate understanding of the process of domestication followed by Chinese farmers to generate agricultural biodiversity. However, there is lack of adequate information on the diversity evaluation of Chinese yam. Providing the potential conservation approaches for sustainable use, thereby saving the genetic diversity of this species in nature, is important.

BioMed Research International
Molecular resources have recently been developed in Chinese yam. For example, random-amplified polymorphic DNA (RAPD), inter-simple sequence repeats (ISSR), intron sequence amplified polymorphism (ISAP), and sequence characterized amplified region (SCAR) markers have been used to examine the genetic relationships among different cultivars and identify the most popular cultivar [3,[8][9][10]. However, these markers have low diversity, stability, and reproducibility. The development of more effective genetic markers will be necessary to assess genetic diversity and identify cultivars.
Recently, the chloroplast genome has been developed with the availability of the next-generation sequencing [11]. The chloroplast genomes of higher plants harbor approximately 130 genes in a 120-160 kb sequence [12]. Chloroplast genomes usually have a circular structure consisting of two copies of the large inverted repeat (IR) region separated by small single-copy (SSC) and large single-copy (LSC) regions and exhibit highly conserved gene content and order [13]. The nucleotide substitution rate of chloroplast genes is lower than that of nuclear genes but higher than that of mitochondrial genes [14,15]. Most protein-coding genes (83 or 81 genes) have been used for phylogenetic analyses and have proven to be effective in resolving difficult phylogenetic relationships [16][17][18]. Noncoding regions are most likely to evolve faster than coding regions in the chloroplast genome, and, therefore, these mutation "hot spots" have been used to identify species and clarify relationships at lower taxonomic levels [19][20][21][22][23].
Chloroplast genomes are typically uniparentally inherited, which may greatly facilitate the use of chloroplast genome markers in plant population genetic studies [24]. Chloroplast genome markers, such as single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs), have been used to monitor gene flow, population differentiation, and cytoplasmic diversity [25][26][27][28]. These chloroplast genome markers can also be applied to investigate domestication processes, such as the evolutionary history of Scutellaria baicalensis [29]. Another application of chloroplast genome markers is phylogeographical analysis, because the uniparental inheritance shows a clearer geographical structure than nuclear markers do [30]. The cultivars yam also is clonally propagated. Herein, we sequenced six wild D. polystachya genotypes from different geographical locations using the Illumina HiSeq platform. The first objective was to evaluate the intraspecific variation in this species, and the second objective was to obtain useful chloroplast molecular markers, including SNPs, SSRs, and indels, for evolutionary studies by comparing the chloroplast genomes. The genomic and marker resources developed in this study will not only reveal biogeographical structure and extensive population genetic variation in the wild populations of D. polystachya but also provide a molecular toolkit for cultivar identification.

Plant Materials, DNA Extraction, and Sequencing.
In total, six genotypes of D. polystachya were used (Table 1). Chinese yam was obtained from Hebei, Shandong, Henan, Beijing, Jiangsu, and Fujian, China, to represent the geographical distribution of this species. Voucher specimens were deposited in herbaria of the Institute of Chinese Materia Medica (CMMI), China Academy of Chinese Medical Sciences. Fresh leaves of each accession were immediately dried with silica gel prior to DNA extraction. Total genomic DNA was isolated from each individual plant using the mCTAB extraction protocol [31] and purified using the Wizard DNA Cleanup System (Promega, Madison, WI, USA). DNA samples were randomly fragmented into 400-600 bp lengths using an ultrasonicator. An Illumina paired-end DNA library with 500 bp insert size was constructed using a NEBNext5 Ultra6 DNA Library Prep Kit following the manufacturer's instructions. Paired-end sequencing (2 × 150 bp) was conducted on an Illumina HiSeq X Ten platform.

Assembly and Annotation.
The paired-end reads were qualitatively assessed and assembled using SPAdes 3.6.1 [32]. Chloroplast genome sequence contigs were selected from the initial assembly by performing a BLAST search using the Dioscorea elephantipes chloroplast genome sequence as a reference (GenBank accession number: EF380353). The selected contigs were assembled with Sequencher 5.4.5 (http://www .genecodes.com/). Gaps in the contigs were filled by PCR amplification and Sanger sequencing. The four junctions between the IRs and the SSC/LSC regions were checked by amplification with specific primers followed by Sanger sequencing [33]. The chloroplast genome annotation was performed with Plann [34] using the D. elephantipes reference sequence from GenBank. The chloroplast genome map was drawn using Genome Vx software [35].

Molecular Marker Development and Validation.
All sequenced D. polystachya chloroplast genomes were aligned using MAFFT v7 [36], assuming collinear genomes for the full alignment, and then adjusted manually using Se-Al 2.0 [37]. Variable and parsimony-informative base sites across the complete chloroplast genomes were calculated using MEGA 6.0 software [38].
Based on the aligned sequence matrix, the microstructural events were checked manually and were further divided into three categories: (i) SSR, (ii) non-SSR-related indels (common indels), and (iii) inverted sequences. Using the XSW genotype genome sequence as the standard reference, the size, location, and evolutionary direction of the microstructural events were counted. The proposed secondary structures of the inverted regions were analyzed using mfold software [39].

Phylogenetic Reconstruction.
Phylogenetic analysis was conducted using the chloroplast genome sequences of six genotypes of D. polystachya and four other Dioscorea species with available chloroplast genome sequences from GenBank (D. nipponica, D. villosa, D. zingiberensis, and D. elephantipes). Tacca chantrieri was used as an outgroup. Sequence alignments were carried out using MAFFT v7 [36] and then adjusted manually using Se-Al 2.0 [37]. We performed independent phylogenetic analyses using Bayesian inference (BI) and maximum likelihood (ML). RAxML version 8.0.20 was used for ML analyses with the GTR + G model. Node support values were determined with 500 rapid bootstrapping replicates. MrBayes 3.2.2 [40] was used to perform a BI analysis. The Markov chain Monte Carlo (MCMC) analysis was run for 2 × 5,000,000 generations. The average standard deviation of split frequencies remained below 0.01 after the fifty percent burn-in. The remaining trees were used to build a 50% majority-rule consensus tree.

Chloroplast Genome Sequencing, Characterization, and
Annotation. Using the Illumina HiSeq X Ten system, the total DNA from six genotypes of D. polystachya was sequenced to produce 47,638,574-70,997,840 paired-end raw reads (150 bp average read length) per genotype. After de novo and reference-guided assembly, the finished, highquality chloroplast genome sequences of these six genotypes of D. polystachya were obtained. The chloroplast genome sequences were deposited in GenBank ( Table 1). The D. polystachya chloroplast genomes ranged from 153,243 to 153,292 base pairs in length. The chloroplast genome can be divided into four regions: a pair of IR regions, a LSC region, and a SSC region. The overall GC content of the chloroplast genome was 37%, which is consistent with those of previously reported Dioscorea species [41]. The GC contents of the LSC and SSC regions were 34.8% and 30.9%, respectively, while that of the IR region was 42.9% (Table 1).
The D. polystachya chloroplast genome contained 18 intron-containing genes. Among them, sixteen genes had a single intron (ten protein-coding and six tRNA genes) and two genes (clpP and ycf3) contained two introns. The trnK-UUU gene had the largest intron, which contained the matK gene. The rps12 gene was trans-spliced, with the 5 end located in the LSC region and the duplicated 3 end in the IR region.

Numbers and Pattern of SNP Mutations.
The length of the alignment of the six chloroplast genomes was 153,497 bp. In total, 141 SNPs were detected, 84 of which were found in the LSC region, 7 in the IR region, and 43 in the SSC region (Table S1). A total of 134 of these SNPs were found in the IRs, 54 of which were in intergenic spacers, 70 in coding region, and 10 in intron regions. Twenty coding regions harbored SNPs; ycf1 had the highest number of SNPs (19), followed by rpoC2 (five), and rpoB (five). Five intron regions harbored SNPs (four in atpF, two in trnG and rpoC1, and one in trnV and rps16).
The pattern of SNP mutations is shown in Figure 2. A total of 88 transitions (Ts) and 53 transversions (Tv) were present, and the Tv to Ts ratio was 1 : 0.6, indicating a bias in favor of transitions. The most frequently occurring SNP mutations were from C to T and from G to A; mutations from C to G and from G to C exhibited the lowest frequency.

Microsatellites.
With MISA analysis, 66 SSR loci were detected in the D. polystachya chloroplast genome. These SSRs included 37 mononucleotide motifs, which ranged in length from 10 to 16 nucleotides, and 11 dinucleotide, 7 trinucleotide, 4 tetranucleotide, and 7 pentanucleotide SSRs ( Figure 3). Among the 48 mononucleotide and dinucleotide SSRs, 46 contained only A or T. Most SSRs were located in the noncoding portions of the LSC and SSC regions. After in silico comparative analysis, twenty-four SSR loci showed   polymorphisms among the six genotypes of D. polystachya ( Table 3). The clpP intron had the highest number of polymorphic SSRs (three), followed by matK-trnK and psbE-petL with two polymorphic SSRs. The other fifteen spacers and the rpl16 intron contained only one polymorphic SSR each (Table   S2). We designed primer pairs to amplify those SSRs and the other 42 SSR loci (Table S3).

Indels.
The indels involving SSR polymorphisms were filtered out of these analyses. We retrieved 44 common indels Intron-containing genes are marked by asterisks ( * ).    Length of inversion  Length of inverted repeat  FLW  TSW  YTW  XSW  NJW  MHW  trnK-matK  LSC  51  13  ----Inverted  -trnL intron  LSC  4 22  polystachya chloroplast genomes. The patterns were divided into six types as indicated by the six non-strand-specific base-substitution types (i.e., numbers of considered G to A and C to T sites for each respective set of associated mutation types). from the D. polystachya chloroplast genomes (Table S4). No indels were found in the coding regions. A total of 27 spacer regions harbored indels; the psbM-trnD and rbcL-accD spacer had the highest number of indels (three), followed by trnK-trnQ, psbI-trnS, trnS-trnG, petN-psbM, trnT-psbD, trnF-ndhJ, psbE-petL, and trnL-rpl32, all containing two indels. The other spacer regions contained only one indel (Table S4). Five indels were located in intronic regions, including the atpF (two indels) and clpP (three indels) introns. The sizes of the indels ranged from 1 to 28 bp, with one bp indels being the most common ( Figure 4). The largest indel, found in the atpF intron with a 28 bp length, was a deletion in the MHW genotype. The second longest, which was found in rbcL-accD, was an insertion in the YTW genotype. Finally, 13 insertion and 9 deletion indels were specific to the NJW genotype, 12 insertion and 5 deletion indels to YTW, one insertion in the psbZ-trnG region to XSW, and one insertion in trnL-rpl32 region and one deletion in atpF intron to MHW. Two deletions in petN-psbM and psbM-trnD independently occurred in the YTW and NJW genotypes.  (Table 4). Two inversions occurred in the LSC region and three in the SSC region. Inversions in the trnK-matK spacer, ndhA intron, and ndhD occurred in the NJW genotype, while an inversion in ccsA-trnL occurred in YTW. An inversion in the trnL intron occurred in the YTW and MHW genotypes.

Transversion
3.6. Phylogenetic Analysis. The phylogenetic position of D. polystachya within the genus Dioscorea was established using complete chloroplast genomes ( Figure 5). The chloroplast genome of Tacca chantrieri was used as the outgroup. The ML and BI trees reconstructed were congruent, and both phylogenetic trees had high support. The six Dioscorea species were grouped into two branches with 100% bootstrap support, and the NJW genotype was the earliest diverging lineage in D. polystachya. The XSW, TSW, and FLW genotypes formed a monophyletic clade.

Discussion
In this study, we obtained the chloroplast genomes of six D. polystachya genotypes using NGS methods, which provided important resources for the discovery of molecular markers. Understanding the genetic relationship of D. polystachya is vital to breeding programs and conservation strategies. The D. polystachya chloroplast genome exhibited a typical circular structure and was similar in genome size and GC content to the other published Dioscorea chloroplast genomes [41]. Using these chloroplast genome data, we were able to develop genetic resources, including SNPs, microsatellites (simple sequence repeats), indels, and small inversions, that constitute essential tools for studies of evolution, population genetics, and the origin of domestication in this species. This information will facilitate the establishment of an effective DNA-barcoding-based identification method and provide valuable markers to study the population genetics of D. polystachya.
Among the six genotypes examined, only 141 SNPs were detected. Despite the higher AT content in chloroplast genomes, AT to TA and GC to CG transversions were found to occur significantly less frequently among the four types of transversions ( Figure 2). This result clearly indicates a bias in chloroplast genome evolution. In general, most SNPs occurred in the noncoding regions of the D. polystachya plastid genomes, which may undergo less natural selection. However, no significant difference was present in the distribution of mutations among the genome regions (Table S1). Variations in mutation rates can be related to the function of genes. ycf1 had the highest number of SNPs (19) in the D. polystachya chloroplast genome, while atp, psa, and psb exhibited the lowest evolutionary rates (Table S1). The ycf1 gene is the second longest gene; it is essential for plant viability and encodes Tic214, a vital component of the Arabidopsis TIC complex [42]. The two parts of ycf1 in the  T  A  TA  AT  TC  GA  TAT  ATA  TTA  TCT  TTC  TAA  AGA  TATT  TTTA  AGAT  AATG  TATAT  ATATA  TATTT  TTAAT  TCTTG  AAATA TTTCA 0  Length of indels (dp) SSC region (ycf1a and ycf1b) were highly variable in flowering plants [19,43] and are suitable as markers for phylogeny and species identification [44].
Moreover, indels are another important class of genetic variation. A total of 43 common indels were identified in the D. polystachya chloroplast genomes, all in noncoding regions (Table S3). The indel sizes ranged from 1 to 28 bp. According to our results, the mutation rates of these indels were lower than those of nucleotide substitutions. Most indels were specific to individual genotypes, and many were informative for evolutionary studies. trnL-F, rbcL-accD, and trnS-trnG constitute the most frequently applied markers in plant molecular systematics and DNA barcoding [45][46][47]. As in previous reports, the variable regions psbM-trnD and rbcL-accD contained the most indels in these D. polystachya chloroplast genomes [19]. Adding indels to phylogenetic analyses significantly increases resolution and support compared to simple substitution-based matrices of chloroplast DNA sequences [48].
SSRs, which consist of tandemly repeated motifs of six base pairs (bp) or less, have become widely used as chloroplast genome markers due to their ability to generate highly informative DNA markers. The most common types are mononucleotide repeats, ranging in size from 10 to 15 nucleotides; the occurrence of di-, tri-, tetra-, penta-, and hexanucleotide repeats is less common [28]. After in silico comparative analysis, we identified 24 SSR loci showing polymorphisms, which may allow investigation of spontaneous gene flow among wild and domesticated D. polystachya and phylogeographical studies. Because chloroplast genome sequences are highly conserved, chloroplast genome SSRs are transferable across species; thus, these loci can likely be used in studies of other Dioscorea species [28].
In this study, we identified SNPs, indels, microsatellites, and small inversions in Chinese yam by comparative analyses of six chloroplast genomes. These resources will allow the identification of commercial cultivars of Chinese yam and the determination of their purity. Furthermore, chloroplast genomic resources are important for further studies of domestication, population genetics, and phylogenetic analysis, possibly in combination with other informative molecular markers from the mitochondrial and/or nuclear genomes. Table S1: the patterns of single nucleotide substitutions among the chloroplast genomes of six D. polystachya genotypes. Table S2: polymorphic SSR loci identified in the analyzed material. Positions, locations, types, and polymorphisms are shown. Table S3: primers for the other 42 SSR loci identified in the analyzed material. Table S4: polymorphic indels identified in the analyzed material. Indel events are reported for each of the six D. polystachya genotypes. (Supplementary Materials)