Comparative analysis and implications of the chloroplast genomes of three thistles (Carduus L., Asteraceae)

Background Carduus, commonly known as plumeless thistles, is a genus in the Asteraceae family that exhibits both medicinal value and invasive tendencies. However, the genomic data of Carduus (i.e., complete chloroplast genomes) have not been sequenced. Methods We sequenced and assembled the chloroplast genome (cpDNA) sequences of three Carduus species using the Illumina Miseq sequencing system and Geneious Prime. Phylogenetic relationships between Carduus and related taxa were reconstructed using Maximum Likelihood and Bayesian Inference analyses. In addition, we used a single nucleotide polymorphism (SNP) in the protein coding region of the matK gene to develop molecular markers to distinguish C. crispus from C. acanthoides and C. tenuiflorus. Results The cpDNA sequences of C. crispus, C. acanthoides, and C. tenuiflorus ranged from 152,342 bp to 152,617 bp in length. Comparative genomic analysis revealed high conservation in terms of gene content (including 80 protein-coding, 30 tRNA, and four rRNA genes) and gene order within the three focal species and members of subfamily Carduoideae. Despite their high similarity, the three species differed with respect to the number and content of repeats in the chloroplast genome. Additionally, eight hotspot regions, including psbI-trnS_GCU, trnE_UUC-rpoB, trnR_UCU-trnG_UCC, psbC-trnS_UGA, trnT_UGU-trnL_UAA, psbT-psbN, petD-rpoA, and rpl16-rps3, were identified in the study species. Phylogenetic analyses inferred from 78 protein-coding and non-coding regions indicated that Carduus is polyphyletic, suggesting the need for additional studies to reconstruct relationships between thistles and related taxa. Based on a SNP in matK, we successfully developed a molecular marker and protocol for distinguishing C. crispus from the other two focal species. Our study provides preliminary chloroplast genome data for further studies on plastid genome evolution, phylogeny, and development of species-level markers in Carduus.

In most angiosperms, the chloroplast genome (cpDNA) contains genes essential to photosynthesis (Sugiura, 1992). Genomic events (i.e., gene deletion, inversion, or duplication) in cpDNA may provide information about species' evolutionary history (Cosner, Raubeson & Jansen, 2004;Do & Kim, 2017;Haberle et al., 2008). For example, the Fabaceae includes clades that are characterised by large inversions and the loss of inverted repeat regions (Choi & Choi, 2017). Inversions have also been recorded in the cpDNA of Asteraceae (Kim, Choi & Jansen, 2005). Specifically, a large inversion comprising a 22.8 kb sequence occurred simultaneously with a small inversion of a 3.3 kb fragment; this event coincided with the split between major clades (excluding Barnadesioideae) in the evolution of Asteraceae. cpDNA data can also be used to develop molecular markers based on nucleotide polymorphisms (i.e., single nucleotide polymorphism (SNP) markers and microsatellite markers). Molecular authentication has been reported for various plant species, with a focus on invasive plants, endangered species, and taxa with potential medicinal value (Kim et al., 2012;Ishikawa, Sakaguchi & Ito, 2016;Luo et al., 2016;Park et al., 2016;Marochio et al., 2017;Han et al., 2018;Do et al., 2019). Specific regions of cpDNA have been identified for developing molecular markers in plants, including the commonlyused matK region (Poovitha et al., 2016;Vu et al., 2017). Among the Asteraceae, studies on molecular markers have been conducted for rubber dandelion (Taraxacum kok-saghyz LE Rodin), horseweed (Conyza sp.), Indian Chrysanthemum (Chrysanthemum indicum L.), the endemic herb Aster savatieri Makino, and the invasive plant Tithonia diversifolia (Hemsl.) A Gray (Ishikawa, Sakaguchi & Ito, 2016;Luo et al., 2016;Zhang et al., 2017;Marochio et al., 2017;Han et al., 2018). In addition to the development of these molecular markers, complete cpDNA sequences have been reported for various Asteraceae species (Kim, Choi & Jansen, 2005;Choi & Park, 2015;Wang et al., 2015a;Wang et al., 2015b;Yun, Gil & Kim, 2017;Liu et al., 2018;Ma, Sun & Zhao, 2018;Su et al., 2018). cpDNA sequences may be used to elucidate the phylogeny of angiosperms from the clade that is basal to monocots and eudicots (Angiosperm Phylogeny Group, 2016). Previous investigations into phylogenetic relationships among members of the Asteraceae have been conducted using a range of molecular data types, including rbcL, ndhF, matK, chloroplast DNA restriction sites, ITS sequence data, and nuclear loci (Jansen, Michaels & Palmer, 1991;Häffner & Hellwig, 1999;Fu et al., 2016;Mandel et al., 2019). However, the paucity of available sequence data may have resulted in ambiguous relationships between Carduus and related taxa (Häffner & Hellwig, 1999;Fu et al., 2016). In particular, ITS sequence data suggest that C. leptacanthus is sister to Cirsium and Notobasis, whereas another Carduus species is sister to Cirsium and Tyrimmus (Häffner & Hellwig, 1999). As such, clarification of relationships between Carduus and related species will require studies that include a larger number of Carduus species and different data types (i.e., chloroplast and mitochondrial genomes).
We used next-generation sequencing (NGS) to sequence and characterise the chloroplast genomes of Carduus crispus, C. acanthoides, and C. tenuiflorus, which exhibit both invasive tendencies and potential medical utility (particularly C. crispus). We then conducted comparative genomic analyses to explore genomic diversity among the three species with respect to highly variable regions, and the types and numbers of repeats. In addition, we reconstructed the formerly ambiguous relationship between Carduus and related taxa based on 78 protein-coding regions and non-coding sequences. Finally, we developed a specific molecular marker for C. crispus based on a SNP in the matK gene. This molecular marker provides useful information for managing C. crispus invasions, particularly with respect to the identification of immature (vegetative) individuals, which tend to be morphologically similar to other Carduus species (i.e., having winged stems with apical spines and spiny leaves). This molecular marker may also support positive identification of C. crispus for medical usage.

MATERIALS & METHODS
Taxon sampling, total DNA extraction, chloroplast genome assembly, and comparative analysis Leaves of C. crispus, C. acanthoides, and C. tenuiflorus were collected and dried in silica gel powder for NGS analysis. In addition, to test the efficiency of molecular markers (Table 1), leaves of 22 individuals of the three species were sampled from herbaria at the Korea National Arboretum (KNA), National Institute of Biological Resources (NIBR), New York Botanical Garden (NYBG), Carnegie Museum Herbarium (CM), and the United States National Herbarium (US). A modified cetyl trimethylammonium bromide (CTAB) protocol was used to extract total DNA from collected samples (Doyle & Doyle, 1987). High-quality DNA was used to conduct NGS using the Miseq platform and a Miseq Reagent Kit v3 (Illumina, Seoul, South Korea). Raw NGS data (2 × 300 bp paired-end reads) were cleaned by cutting adapter sequences, removing duplicate and chimeric reads, and trimming ends with > 0.05 probability of error per base. Cleaning was conducted using  Table S1), and gene loss and rearrangement were identified using MAUVE (Darling, Mau & Perna, 2010). In addition, Geneious Prime was used to calculate the pairwise identities of cpDNA sequences in the focal species. Small single repeats (SSRs) were analysed using Phobos embedded in Geneious Prime (Christoph, 2006(Christoph, -2010. Minimum repeat numbers were 10, 5, 4, and 3 for mono-, di-, tri-, and tetra-nucleotides, respectively. REPuter (Kurtz et al., 2001) was used to analyse the large repeat sequences (minimum length = 20 bp) in each species. To explore nucleotide diversity, 131 coding and non-coding regions were extracted from the complete cpDNA (Table S2). Following alignment using MUSCLE embedded in Geneious Prime (Edgar, 2004), aligned sequences were imported into DnaSP 6 (Rozas et al., 2017) for calculation of Pi values.

Phylogenetic analysis of Carduus and related taxa
A total of 78 protein-coding regions were extracted from the complete cpDNA of the focal species and other related taxa (Table S1). Sequences were aligned using MUSCLE embedded in Geneious Prime (Edgar, 2004). We used jModeltest (Posada, 2008) to find the best model for the aligned DNA sequences; GTR + I + R was selected as the most suitable model and was used in Maximum Likelihood (ML) and Bayesian Inference (BI) analyses. The ML analysis was conducted with the IQ-tree web server (http://iqtree.cibiv.univie.ac.at), using 1,000 bootstrap replications to calculate branch support values (Trifinopoulos et al., 2016). We used MrBayes v3.2 (Ronquist et al., 2012) for BI analyses. The Markov chain Monte Carlo (MCMC) analysis was run for 1,000,000 generations, and a tree was assembled every 1000 generations. A 25% burn-in setting was used for summarising trees. Figtree v4.0 (http://tree.bio.ed.ac.uk/software/figtree/) was used to visualise phylogenetic trees. Other datasets, including whole chloroplast genomes (excluding one IR region), non-coding regions of cpDNA, and hotspot regions derived from the cpDNA of Carduus species, were used in phylogenetic analysis in addition to protein-coding regions (Table S1). Analytical procedures for these additional datasets were identical to those used for the protein coding regions; however, we used the TVM+I+G model for the whole chloroplast genome (excluding one inverted repeat [IR] region) and all non-coding regions, and the TVM+G model for the hotspot regions dataset.

Comparative chloroplast genome analysis of the focal species
Differing numbers of reads were obtained from NGS data, resulting in varying cpDNA coverage rates among the three focal species (Table 2). Total cpDNA length differed among species, ranging from 152,342 bp to 152,617 bp, and included a large single copy (LSC), a small single copy (SSC), and two IR regions (Fig. 1). By contrast, all three species had identical numbers of protein coding (80), tRNA (30), and rRNA (4) genes ( Table 2, Table S3). The IR-LSC and IR-SSC junctions were located in the rps19 and ycf1 coding regions, respectively, but were longer in C. acanthoides (rps19 = 61 bp and ycf1 = 568 bp) than in the other two species (rps19 = 60 bp and ycf1 = 565 bp). In addition,

Notes.
The dashes (-) mean no data in this study. pairwise identity indicated that C. crispus is more similar to C. tenuiflorus (99.6%) than C. acanthoides (99.2%). Observations of nucleotide diversity indicated that 119 of 131 surveyed regions differ among the three focal species (Table S2, Fig. 2). Compared to coding regions, non-coding sequences had higher Pi values (Fig. 2). The highest Pi values were found in the psbC-trnS (0.0171) and psbH-petB (0.0161) regions. The highest value in coding regions was 0.00696 for ycf1 (Fig. 2). High nucleotide diversity regions (Pi values >0.008) included psbI-trnS_GCU,

Features of cpDNA repeats
Analysis of SSRs yielded 43 SSRs in C. crispus, 40 in C. tenuiflorus, and 31 in C. acanthoides (Table S4). SSRs occupied the same position in all three species and were mostly located in non-coding regions. Although four types of SSR (i.e., mono-, di-, tri-, and tetranucleotides) were identified, most SSRs were mononucleotides composed of A and T nucleotides (Table S4). All 31 SSRs found in C. acanthoides were also present in the other two species (Table S4). By contrast, C. crispus had three unique SSRs and shared nine SSRs with C. tenuiflorus. There were no unique SSRs in C. acanthoides or specific shared SSRs between C. acanthoides and either C. crispus or C. tenuiflorus. Among the focal species, 16 repeats were identified for both C. crispus and C. tenuiflorus, compared to 15 for C. acanthoides (Fig. 3A, Table S5). There were more repeats in coding regions than in non-coding areas, with the exception of C. acanthoides. Three types of repeats (i.e., forward, reverse, and palindrome) were identified in C. tenuiflorus and C. acanthoides; by contrast, only forward and palindrome repeats were found in C. crispus. Forward repeats were more abundant than reverse and palindrome repeats (Fig. 3B). Among recorded repeats, nine were common among all three species (Fig. 3C). Carduus acanthoides had four unique repeats, whereas C. tenuiflorus and C. crispus each had a single unique repeat. Carduus acanthoides shared one specific repeat with both C. crispus and C. tenuiflorus, whereas C. crispus and C. tenuiflorus shared five specific repeat regions (Fig. 3C).

Phylogenetic relationships between Carduus and related taxa
The ML and BI analyses, based on 78 protein-coding genes from Carduus and related taxa, yielded identical topologies (Fig. 4). In particular, both analyses confirmed the monophyly of Asteraceae subfamilies (i.e., Carduoideae, Chichorioideae, and Asteroideae). In contrast to the high support for Carduoideae and Cichorioideae (Bootstrap = 100/Posterior Probability = 1), low support was found for Asteroideae clades (Fig. 4). Notably, monophyly of the three Carduus species was not supported by either analysis. For example, C. acanthoides was sister to Silybum marianum, whereas C. crispus and C. tenuiflorus were sister to Cirsium arvense. Additional ML and BI analyses of full cpDNA sequences and non-coding regions suggested similar relationships (Fig. S4).

Multiplex PCR and specific markers for C. crispus
The results of multiplex PCR for the two groups of primer pairs yielded similar products, both of which were designed to identify C. crispus. In the first group, a 323 bp band was found in C. crispus, whereas a 421 bp band was identified in C. acanthoides and C. tenuiflorus (Fig. 5A). By contrast, the combination of matK_463F, matK_1162R, CD_SNP_F2, and CD_SNP_R2 yielded a longer PCR product for C. crispus in comparison with the other species (Fig. 5B). The designed primer pairs were specific to all C. crispus samples examined in this study (Figs. S5 and S6).

Conservatism of Carduus cpDNA
Chloroplast genomes are highly conserved in angiosperms with respect to gene content and order (Sugiura, 1992). This conservative tendency was observed in the three newly sequenced Carduus cpDNA genomes, compared to other Asteraceae members (Table 2). Other cpDNA sequences have revealed unique genomic events in Asteraceae. For example, the atpB gene, which encodes the CF1 ATPase beta subunit, is annotated as a pseudogene in Aster spathulifolius due to a deletion within the coding region (Choi and Park, 2015). Similarly, trnT_GGU was completely deleted or pseudogenised in the tribe Gnaphalieae (Lee et al., 2017). Duplication of trnF_GAA has been identified in Taraxacum (Salih et al., 2017). No comparable genomic events are present in Carduus or other members of subfamily Carduoideae (Table 2, Fig. S7). However, nucleotide diversity data pointed to potential regions for further study of phylogeny and population genetics, and the development of Carduus-specific molecular markers (Table S2, Fig. 2). The number of species we examined for this study was low relative to the approximately 32,000 known Asteraceae species. Therefore, additional studies that include the majority of Asteraceae species should be conducted to explore the overall evolutionary trends in the chloroplast genomes of this globally-distributed family. Chloroplast genomes provide useful molecular data for reconstructing phylogeny, exploring biogeography, and estimating divergence time in angiosperm lineages (Do, Kim & Kim, 2014;Nguyen, Kim & Kim, 2015;Kim, Kim & Kim, 2016;Kim & Kim, 2018). Repetitive sequences in the chloroplast genome provide useful information for studying genomic rearrangement and phylogeny (Cavalier-Smith, 2002;Nie et al., 2012;Yi et al., 2013;Kim & Kim, 2018). In addition, existing repeats might result in the accumulation of new repeats in cpDNA (Asano et al., 2004). One of the crucial molecular data types in cpDNA is SSR sequences. Other studies have used SSRs to develop specific markers for different species and to study the genetic diversity of angiosperms (Ishikawa, Sakaguchi & Ito, 2016;Luo et al., 2016;Marochio et al., 2017;Han et al., 2018). In this study, although cpDNA sequences were highly conserved, the three Carduus species were found to have different numbers of SSRs (Table S4). While we did not develop SSR markers or conduct population studies of Carduus, the SSR information we obtained may be useful in future studies on Carduus species. In addition to the repeats shared among the three species, C. crispus had three unique repetitive sequences (Table S4), which may be useful in population studies, phylogenetic analyses, and the development of additional molecular markers.

Uncertain relationships among Carduus
Phylogenetic analyses of the Asteraceae have identified ambiguous relationships between Carduus, Cirsium, and Silybum (Fu et al., 2016;Panero, 2016;Arnelas et al., 2018); for example, ITS data suggests that Carduus is polyphyletic (Häffner & Hellwig, 1999). Although three coding regions (matK, rbcL, and ndhF ) were used to reconstruct phylogenetic relationships, the position of Carduus remained unresolved (Fu et al., 2016). We used 78 protein-coding regions to clarify these relationships; however, the phylogeny of Carduus and related taxa remains unclear (Fig. 4). Specifically, C. acanthoides was found to be close to Silybum marianum whereas C. crispus and C. tenuiflorus form a clade with Cirsium arvense. While non-coding regions can be useful in reconstructing the phylogeny of lower taxa, we were unable to recover the monophyly of Carduus using data from non-coding regions, including the eight hotspot areas as well as the combined data from coding and non-coding regions (Fig. S4). These issues suggest the need for additional studies on the phylogeny of Carduus and other members of the subfamily Carduoideae using supplementary molecular data and morphology.

Implications of SNP data for developing molecular markers for Carduus
SNPs are useful in population studies due to its extremely abundant presence in the angiosperms genomes (Cui et al., 2017;Fischer et al., 2017;Pantoja et al., 2017), and are effective in phylogenetic analysis (Leaché & Oaks, 2017). In addition, various molecular markers have been developed for different angiosperm species based on SNP data from chloroplast genomes (Khlestkina & Salina, 2006;Wang et al., 2015a;Wang et al., 2015b;Hyun et al., 2019;Xia et al., 2019). We successfully developed a molecular marker, inferred from SNP data, to distinguish C. crispus from C. acanthoides and C. tenuiflorus (Fig. S3). Our marker demonstrates that nucleotide sequence variations can provide rapid molecular identification of C. crispus. We focused on C. crispus because it exhibits the characteristics of an invasive species (Dunn, 1976;Verloove, 2014;Jung et al., 2017), and may also have value for the treatment of obesity and cancer (Xie, Li & Jia, 2005;Davaakhuu, Sukhdolgor & Gereltu, 2010;Lee et al., 2011;Tunsag, Davaakhuu & Batsuren, 2011). Various DNA-based markers (i.e., inter-simple sequence repeats [ISSRs], sequence characterisation of amplified regions [SCARs], and SSRs) have been developed to authenticate medicinal plants to ensure safety and efficacy (Hao et al., 2010;Sarwat et al., 2012;Ganie et al., 2015;Ward, Gaskin & Wilson, 2008). Additionally, molecular data are useful for understanding invasion processes of alien plants (Ward et al., 2008). We developed a SNP-based molecular marker for C. crispus (Fig. 5, Figs. S5 and S6) that may be used to detect C. crispus invasions in their early stages, and to develop suitable management strategies. Our alignment results also identified specific SNPs for C. acanthoides and C. tenuiflorus, which may be used to create molecular markers for these species (Fig. S1). Although we only used the SNP in matK, use of the complete cpDNA sequences of Carduus will enable the mining of SNPs from other regions for developing molecular markers for C. crispus and related species.

CONCLUSIONS
In this study, we provided the first complete cpDNA sequences for Carduus species. Despite the absence of significant differences (i.e., inversions, deletions, and duplications) between the chloroplast genomes of Carduus and those of related taxa, the newly acquired cpDNA sequences have value as a resource in future studies of the evolution of the chloroplast genome in Carduoideae and Asteraceae. Additionally, the 78 protein-coding regions of the chloroplast genome revealed uncertainty regarding the position of Carduus within the subfamily Carduoideae, and suggested the need for additional studies to reconstruct relationships not only among thistles, but among other members of the Asteraceae as well. The methods and protocols used in developing molecular markers for C. crispus are easy to apply and may be useful as a standard method in other studies of Asteraceae species.