Comparative analysis of chloroplast DNA sequences of Codonopsis lanceolata and Platycodon grandiflorus and application in development of molecular markers

Codonopsis lanceolata and Platycodon grandiflorus (order Asterales) originate from East Asia. Despite the high commercial availability of C. lanceolata and P. grandiflorus, limited genetic research has been performed on these plants. We applied a targeting enrichment method to detect genetic diversity in C. lanceolata and P. grandiflorus and recovered their chloroplast genomes from total DNA sequence data. Chloroplast DNAs (cpDNAs) were 61,154 bp (C. lanceolata) and 81,214 bp (P. grandiflorus) in length. Sixteen simple sequence repeats and 15 long repeat sequences were determined, which are useful as potential markers in both plant species. We surveyed the phylogenetic relationships with increased resolution in 14 plant species, including other 8 species from the order Asterales and 4 from the order Apiales. In addition, we demonstrated the availability of recovered chloroplast genomes through cpDNA marker development to determine the authenticity of food fraud at the DNA level of plant species.


Introduction
Asterales is an order of dicotyledonous flowering plant species and includes 11 families evolved from one common ancestor. Many herbaceous plant species of Asterales are used as spices and traditional medicines. In previous phylogenetic research, Asterales have been included in the Campanulid clade, which is similar to the Euasterids II group of the Angiosperm Phylogeny Group III as one type of taxonomic classification (Bremer et al. 2002). The molecular evidence suggests that Asterales originated c. 100 Ma B . P . in the Cretaceous (Bremer and Gustafsson 1997).
In Korea, Codonopsis lanceolata, order Asterales, is known as Deodeok. Its roots are used in traditional food dishes. Platycodon grandiflorus, order Asterales, is a herbaceous flowering perennial plant, known as Doraji in Korea, and its roots are a recognized ingredient in salads and traditional cuisine. Both plants originate in East Asia and are used as natural and traditional medicine due to their own functional compounds. In addition, Panax ginseng, order Asterales, has been utilized in China as a traditional medicine for the treatment of ailments related to the central nervous system, and endocrine and adrenocortical systems, and to control blood pressure and diabetes, (Nah et al. 2007;Kim 2012). The complete chloroplast genome of Korean ginseng (Panax schinseng Nees) is 156,318 bp and North American ginseng (Panax quinquefolius) is 156,359 bp (Kim and Lee 2004;Han et al. 2015). Also, Zhao et al. (2015) reported the comparative analysis of chloroplast DNAs (cpDNAs) for four P. ginseng strains including Damaya, Ermaya, Gaolishen, and Yeshanshen, in which the complete chloroplast genome size (156,354 bp) of P. ginseng Damaya was equal to the total length of each cpDNA of P. ginseng Ermaya and Gaolishen, unlike that of P. ginseng Yeshanshen. The different minor allele sites in the large single copy and inverted repeat regions of the chloroplast genome were also identified, suggesting an inferred evolutionary relationship among these four P. ginseng strains. However, high-resolution studies of evolutionary relationships using numerous sequences have not been reported for C. lanceolata and P. grandiflorus chloroplast genomes.
Recently, the consumption of functional foods for improving health and wellness has increased. However, commercial fraud related to food products has become a concern due to the differences in the cost of food materials, e.g., the price of ginseng is generally higher than that of Deodeok or Doraji; therefore, the latter cheaper options are being used in products and being marketed as original ginseng. Therefore, consumer concerns related to unoriginal ginseng food products, which are mixed with rather cheap foods such as Deoduck and Doraji, have increased. However, it is difficult to discriminate the ingredients of ginseng products, especially ginseng powders, visually. Molecular markers can be used to determine the authenticity of food at the DNA level of plant species. For example, molecular markers derived from rpoB and rpoC2 successfully detected the cpDNA of rice grain flour in different mixed-flour samples using quantitative real-time PCR (qRT-PCR) (Hwang et al. 2015). Furthermore, Moon et al. (2016) determined unique species among five plants, including rice, barley, adlay, wheat, and maize, using a cpDNA marker developed from rpoC2, which was used for detecting the cpDNA for each plant species in seven commercial mixed-flour products. Molecular markers have been developed on the basis of sequence differences; however, the genetic information of C. lanceolata and P. grandiflorus mostly remains unknown, even though the chloroplast genome sequence of P. ginseng has been completed (Zhao et al. 2015).
Photosynthesis, which occurs in chloroplasts, is an important process for energy production of green plants. The chloroplasts contain a genome ranging in size from 120 to 170 kb (Clegg et al. 1994). The entire nucleotide sequence of cpDNA has been widely used for phylogenetic studies and molecular marker development due to its highly conserved genetic structure. Bremer et al. (2002) reported that the three coding and three non-coding cpDNA markers provided improved evidence for resolution among the Asterids order at higher taxonomic levels. Recently, the rapid development of the next generation sequencing technique provided a comprehensive methodology for monitoring genetic diversity in plant species. For example, the complete chloroplast genome sequences of P. ginseng obtained from DNA sequencing (DNA-seq) provided genome-wide genetic information, such as the phylogenetic relationships and structural diversity of cpDNA among various plant species (Zhao et al. 2015).
Here, we performed the DNA-seq from C. lanceolata and P. grandiflorus and applied a target enrichment method for convenient cpDNA recovery from total DNA sequence data. Phylogenic analysis was performed using 12 chloroplast genome sequences of Asterales, including C. lanceolata and P. grandiflorus, and 4 sequences of Apiales with P. ginseng. To develop molecular markers for discriminating three species, we selected two gene families, in which their segregating sites were 191 for rpo and 20 for ndh, and then designed gene-specific primers for quantitative real-time polymerase chain reaction (qRT-PCR). Finally, the specific primers were tested in total genomic DNA isolated from three plant species, C. lanceolata, P. grandiflorus, and P. ginseng.

Materials and methods
DNA sequencing of C. lanceolata and P. grandiflorus Seeds of C. lanceolata and P. grandiflorus were obtained from the Gyeonggi-do Agricultural Research and Extension Services, and seedlings were grown in a growth chamber at 25°C for 4 weeks. Total DNA was isolated from the leaves of 2-week-old seedlings of C. lanceolata and P. grandiflorus using the CTAB method (Saghai- Maroof et al. 1984), and the extracted DNA was used to conduct the DNA-seq by an Illumina HiSeq2000 following the manufacturer's instructions.
Recovery of the chloroplast genome from the total DNA sequence data To recover chloroplast genome sequences from DNA-seq data, we used the modified target enrichment method (Mandel et al. 2014). In summary, for quality control of raw data, the FASTQ files data obtained from DNA-seq were filtered using the PRINSEQ tool with a minimum read length of 40, minimum quality score of 15, minimum length of 3 0 -end poly-N tail of 10, and ambiguous base N percentage of 20 (http://prinseq.sourceforge.net/index. html). The filtered reads were annotated based on the complete chloroplast genome of ten plant species from the order Asterales obtained from NCBI (http://www.ncbi.nlm. nih.gov/) using BLAST search, and then the annotated reads identified as potential cpDNA sequences of C. lanceolata and P. grandiflorus were collected using PERL script. The collected cpDNA reads were used for de novo assembly using VelvetOptimiser (http://www.vicbioinfor matics.com/software.velvetoptimiser.shtml). The assembled contigs were rearranged to produce one chromosome genome sequence using the ABACAS tool (Assefa et al. 2009). The Campanula takesimana chloroplast genome as a reference sequence was used for contig assembly due to the higher genetic similarity of cpDNA with C. lanceolata and P. grandiflorus in Asterales. The gene information of rearranged chloroplast genomes was annotated using the cpGAVAS tool with default parameters (Liu et al. 2012).

Identification of repeat sequences
Simple sequence repeats (SSRs) in C. lanceolata and P. grandiflorus chloroplast genomes were identified using a microsatellite identification tool (http://pgrc.ipk-gate rsleben.de/misa/) with different minimum numbers of repeats, which were 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides, respectively. Also, the long repeat sequences of each chloroplast genome were detected using the REPuter tool with a minimum length of repeats of 30 bp (Kurtz and Schleiermacher 1999). The genes of SSRs and long repeat sequences were manually identified using a general feature format file obtained from cpGAVAS (Liu et al. 2012).

Codon usage and phylogenetic analysis
We detected the codon frequency and relative synonymous codon usage in C. lanceolata and P. grandiflorus using the SeqinR R package with default parameters (Charif et al. 2005). A total of 14 different chloroplast genomes obtained from NCBI, as well as C. lanceolata and P. grandiflorus, were used to construct a phylogeny tree using the progressive alignment method of MAUVE software (Darling et al. 2004). The constructed tree was visualized by MEGA5 (Tamura et al. 2011), and the Tajima's D of protein-encoding genes were calculated using MEGA5 (Tajima 1989).

Molecular marker development
To develop a molecular marker for C. lanceolata, P. grandiflorus, and P. ginseng, we selected the ndhF and rpoA genes as potential candidate genes, and designed the allelespecific PCR primers on the basis of sequence difference ( Fig. 2). Total DNA was extracted from 2-week-old seedling leaves of C. lanceolata, P. grandiflorus, and P. ginseng using the CTAB method, respectively (Saghai-Maroof et al. 1984), and then used for qRT-PCR. qRT-PCR was performed using a Real-Time System with 20 lL reaction mixture containing 10 ng of template DNA, ten 1 L of SYBR Ò Green TOP real qPCR 2 9 PreMIX (Enzynomics TM , Daejeon, Korea), and 10 pmol ndhF-targeted PCR primers. The qRT-PCR conditions were as follows: 10 min at 95°C, followed by 30 cycles of 10 s at 95°C, 30 s at 53-60°C, and 20 s at 72°C. The PCR products were loaded on a 2% agarose gel, and a photograph of the gel was taken using the Molecular Imager Ò Gel DOC TM XR ? System (Bio-Rad, Hercules, California, USA). The primer efficiency was determined using the formula of Bustin et al. (2009), and the slope value of the regression line was calculated by the linearity test (Ramakers et al. 2003).

Results and discussion
Recovery of C. lanceolata and P. grandiflorus chloroplast genomes from total DNA sequence data To identify the genetic variation of cpDNAs between C. lanceolata and P. grandiflorus, we employed the modified target enrichment approach to determine the chloroplast genome sequence, which facilitates rapid collection of target sequences in the numerous DNA sequences (Mandel et al. 2014). The collected target chloroplast genome of C. lanceolata was 61,154 bp in length except a gap that included 59 different genes, which are composed of 51 protein-coding genes, as well as 7 tRNA and 1 rRNA (Table 1). A total of 73 different genes, including 62 protein-coding genes and 11 tRNA, were exhibited in the P. grandiflorus chloroplast genome of 81,214 bp (Table 1). There were 22 genes (7551 bp) in the intron fragment in C. lanceolata and 31 (11,598 bp) in P. grandiflorus.
There were 6256 codon numbers within 51 proteincoding genes of C. lanceolata and 11,020 within 62 protein-coding genes of P. grandiflorus (Table 2). In addition, leucine was the most frequently used in codons in both chloroplast genomes, in which leucine codes were included in 838 amino acids (13.4%) in C. lanceolata and 1087 amino acids (9.9%) in P. grandiflorus. However, most the frequent codons encoding leucine were different in both species, i.e., TTG (22.91% within codons for leucine) in C.
lanceolata and TTA (24.20% within codons for leucine) in P. grandiflorus. The least frequent codons were cysteine (113 amino acids, 1.81%) in C. lanceolata and methionine (172 amino acids, 1.56%) in P. grandiflorus, respectively. Overall, the tendency toward codon frequencies encoding different amino acids was similar between C. lanceolata and P. grandiflorus. The different frequency tendencies of leucine and cysteine in C. lanceolata and P. grandiflorus chloroplast genomes were similar in Banana and Omani lime chloroplast genomes Su et al. 2014). Prosdocimi and Ortega (2007) suggested that leucine is a major component required to generate more stable DNA mutations in the 10,000 poli-codon sequences, indicating C. lanceolata and P. grandiflorus chloroplast genomes also exhibited the evolutionary stability from leucine codon usage. In addition, the frequencies of A or T at the third position in the codon usage were higher, suggesting a codon bias of genes toward A/T in C. lanceolata and P. grandiflorus.
Repetitive sequences on C. lanceolata and P. grandiflorus chloroplast genomes A total of 16 SSR loci were detected in both the chloroplast genomes, in which the SSR loci accounted for 86 bp (7 SSRs) in C. lanceolata and 106 bp (9 SSRs) in P. grandiflorus (Table 3), respectively. Most SSRs were located in the intergenic region, whereas four SSRs were identified in the coding region of genes, i.e., (GAA) 4 was located in psbC of C. lanceolata, and (T) 13 , (AT) 7 , and (GAA) 4 were located in ycf5, rpoC1, and psbC of P. grandiflorus,  respectively. In addition, the 30 long repeat sequences over 30 bp in length were detected in both C. lanceolata (6 long repeat sequences) and P. grandiflorus (24 long repeat sequences) ( Table 4). The 15 long repeat sequences were located in the genic region of genes, e.g., the eight long repeat sequences for C. lanceolata were found within six different genes, including rps12, psbB, psaA, ndhH, ndhA, and ndhG. Furthermore, four genes, including ndhA, ndhF, ndhH, and ndhG, were involved with seven long repeat sequences in the P. grandiflorus chloroplast genome. These repeat sequences might be useful in the development of molecular markers to determine the genetic diversity of C. lanceolata and P. grandiflorus populations. For example, Ccmp3 as a SSR marker developed from cpDNA was used to evaluate the genetic similarity in twenty-five varieties of Coffea arabica (Vieira et al. 2010), evidencing the utilization of the SSR loci of C. lanceolata and P. grandiflorus for SSR markers.

Phylogenetic analysis
To study the evolutionary relationship of C. lanceolata and P. grandiflorus chloroplast genomes, 16 plant species from both orders of Asterales and Apiales, including the both genomes, were employed, and thus phylogeny tree was constructed by using a progressive alignment method of MAUVE software (Darling et al. 2004) (Fig. 1). The chloroplast genomes of 16 plant species were clearly divided into two groups: the Apiales clade of 4 plants and Asterales clade of 12 plants, including C. lanceolata and P. grandiflorus. The C. lanceolata and P. grandiflorus chloroplast genomes were remarkably similar and grouped with Trachelium caeruleum and Campanula takesimana. Similarly, the genetic distance of C. lanceolata and P. grandiflorus exhibited a close relationship on the basis of a maximum parsimony tree, which was constructed using the petD group II intron as a species level marker (Borsch et al. 2009). In addition, the strict consensus sequences of atpB, atpB-rbcL, and atpF-H between C. lanceolata and P. grandiflorus were more similar than other plant species of Campanulaceae (Kim and Yoo 2012). Therefore, the phylogenic tree strongly evidenced a close genetic relationship between both chloroplast genomes of C. lanceolata and P. grandiflorus.
To identify the sequence diversity of protein-coding genes between C. lanceolata and P. grandiflorus, we m is the number of sequences; S is the number of segregating sites; p s = S/m; H = p s /a1, where p s is the number of polymorphic sites, a1 = (1 ? 1/2 ? 1/3 ? … ? 1/n-1), and n is the number of sequences; p is the nucleotide diversity; D is the Tajima test statistic Fig. 1 Phylogenetic relationships of 14 plant species including C. lanceolata and P. grandiflorus. The phylogeny tree was constructed using MAUVE software with a progressive alignment method performed the Tajima's D test following multiple alignment (Table 5). The highest number of segregating sites was detected from the psa gene family, which also exhibited a negative Tajima Application of C. lanceolata and P. grandiflorus cpDNA for molecular marker development In the present study, we used the genetic information of C. lanceolata and P. grandiflorus chloroplast genomes for molecular marker development, and the three cpDNA-based markers, which are named Co_ndhF, Pl_rpoA, and Pa_rpoA, were developed from ndhF and rpoA genes using the linearity test of the qRT-PCR assay (Fig. 2). To detect the specific cpDNA fragment of each plant species, we designed different gene-specific primer pairs of three cpDNA markers from ndhF and rpoA genes on the basis of sequence differences ( Fig. 2A). For example, the Co_ndhF marker primers for C. lanceolata exhibited 12 single-nucleotide polymorphisms (SNPs) among three plant species, and the numbers of SNPs on both primer sequences of each cpDNA marker were 2 loci for Pl_rpoA of P. grandiflorus and 3 loci for Pa_rpoA of P. ginseng. To confirm whether the cpDNA marker could be amplified only in targeted cpDNA of each plant species through regular PCR, we loaded the PCR products using gel electrophoresis (Fig. 2B). We identified the amplified cpDNA fragments of each marker from three plant species, suggesting the availability of potential markers for discriminating the targeted cpDNA of the three plant species. To improve the precision of cpDNA markers, we further analyzed the slope DNA concentrations (x axis) detected from total DNA of C. lanceolata, P. grandiflorus, and P. ginseng using the qRT-PCR assay. The colored dots represent different plant species. Primer efficiency was calculated using the following formula: %Efficiency = [(10-1/slope)-1] 9 9 9 100. The slope value of regression line was determined using the linearity test Appl Biol Chem (2017) 60(1):23-31 29 of the regression line and primer efficiency using qRT-PCR with three independent replicates (Fig. 2C) and identified appropriate primer efficiencies (83-97%) and obvious correlation over 0.99 of each cpDNA marker detected only in the targeted plant species, but not in the non-targeted plant species. In addition, the quantification cycle (C q ) values of each cpDNA marker consistently increased from 14 to 31 in the targeted plant species based on the concentration of the DNA template used in qRT-PCR, whereas the irregular C q values were detected in non-targeted plant species regardless of the amount of DNA. The slope of the regression provided a useful tool to measure the cpDNA content of C. lanceolata, P. grandiflorus, and P. ginseng in commercial food products. Similarly, the three cpDNA markers developed from rpoB and rpoC2 genes were successfully applied in the detection of various rice flours in commercial mixed-flour products using qRT-PCR (Hwang et al. 2015), suggesting the utility of three cpDNA markers for C. lanceolata. The accurate content estimation of food additives was possible using the slope of molecular markers detected by the linearity test. For example, de la Cruz et al. (2013) successfully estimated Brazil nut percentages in 19 commercial products by using a molecular marker developed in the 2S albumin DNA sequences of different nut plants.
In conclusion, comparative analysis of chloroplast genomes is useful for phylogenetic studies and provides useful information for molecular marker development. Here, we attempted to recover the chloroplast genome of C. lanceolata and P. grandiflorus from numerous total DNA sequence data and obtained the incomplete chloroplast genomes, including 59 genes for C. lanceolata and 73 genes for P. grandiflorus. Also, we determined the repeat sequences for molecular marker development from cpDNA for each species. In the phylogeny tree, the chloroplast genome of C. lanceolata was closely linked to that of P. grandiflorus within Asterales, indicating the closest evolutionary relationship between C. lanceolata and P. grandiflorus. In addition, we developed three cpDNA markers from ndhF and rpoA genes based on recovered chloroplast genome sequences for detecting the cpDNA of C. lanceolata, P. grandiflorus, and P. ginseng. The chloroplast genome of C. lanceolata and P. grandiflorus recovered from total DNA sequence data will provide useful information to improve the phylogenetic resolution and efficiency of marker development; and the cpDNA markers developed in the present study are useful to distinguish specific plant species between C. lanceolata, P. grandiflorus, and P. ginseng in commercial mixed-flour products.