Highly variable chloroplast genome from two endangered Papaveraceae lithophytes Corydalis saxicola and C. tomentella

Backgroud: Corydalis DC. , the largest genus of Papaveraceae, is recognized as one of the most taxonomically challenging plant taxa. However, no complete chloroplast (cp) genome for this genus has been reported to date. Results: We sequenced four complete cp genomes of two affinities Corydalis saxicola and C. tomentellav of the genus Corydalis , compared these cp genomes with each other and others from Papaveraceae, and analyzed the phylogenetic relationships based on the sequences of common CDS. The cp genomes are 189,029 to 190,247 bp in length, possessing a quadripartite structure and with two highly expanded inverted repeat (IR) regions (length: 41,955 to 42,350 bp). Comparison between the cp genomes of C. tomentella , C. saxicola and Papaveraceae species revealed high variability in genome sizes, genome structures, gene content, and gene arrangements. Five NADH dehydrogenase-like genes with psa C, rpl 32, ccs A and trn L-UAG normally located in the SSC region have migrated to IRs resulting in IR expansion and gene duplication. An up to 9 kb inversion involving five genes ( rpl 23, ycf 2, ycf 15, trn I-CAU and trn L-CAA) was found within IR regions. In addition, the acc D gene was found to be absent. The ycf1 gene has shifted from the IR/SSC border to the SSC region as a single copy. Phylogenetic analysis showed that genus Corydalis is quite distantly related to the other genera of Papaveraceae, supporting for recent advocacy to establish a separate Fumariaceae family. Conclusions: Our results provide a useful resource for classification of this taxonomically complicated genus, and will be valuable for understanding Papaveraceae evolutionary relationships.

Correct understanding of the relationship between different biological groups is the main focus of phylogenetic biology, the basis of taxonomy and naming, and a foundation for research in other branches of biology [5]. Compared with traditional molecular markers, the cp genomes provide specific advantages for establishing plant phylogenetic relationships and taxonomic research [6]. The length of cp genomes is usually 115-165 kb, a modest size that is easily sequenced, and an appropriate nucleotide substitution frequency has produced sufficient mutagenesis for analysis.
Relatively conserved gene sequences allow produce co-linearity among plant groups, and the evolution rates of coding regions and non-coding regions are significantly different to be suited for phylogenetic analysis of different ranks [7]. Taxonomists have widely used cp genomes to study plant phylogenetics and advocated for use of cp genomes as a super DNA barcode for species identification [6].
A large number of cp genome sequences have been sequenced, providing abundant data that can be used for plant phylogeny research to more accurately reveal the true evolutionary relationships between species and effectively solve difficult phylogenetic relationship problems in the study of complex taxa. Rorbert et al. clarified long-confusing phylogenetic relationships in rosids by comparing the whole cp genomes of 28 rosid species [8]. Zhang et al. analyzed plastomes from 130 species of 87 genera and offered new insights into deep phylogenetic relationships and the diversification history of Rosaceae [9]. On the basis of more abundant information sites, cp genomes have been successfully used as a "super barcode" to identify several taxonomically difficult species. Guo et al. found that the rpl32 gene in the Epimedium wushanense cp genome was deleted, and they successfully identified 11 Epimedium species by studying the evolutionary relationship of rpl32 genes [10]. Cui et al. analyzed eight cp genomes from Amomum to accurately identify Amomum villosum and related species [11]. Analysis of Lycium barbarum, L. chinense, and L. rutenicum cp genomes showed that these three plants could be successfully identified by cp genomics [12]. With the reduced cost of sequencing and the development of bioinformatics technology, cp genome analysis will be extensively used in future studies of plant systematic relationships and taxonomy.
Corydalis is the largest genus in Papaveraceae [13]. There are more than 400 Corydalis species that are widely distributed in the North Temperate Zone [14]. Corydalis has extremely complex morphological variation because of typical reticulate evolution and intense differentiation during evolution [13], including extensive interspecific hybridization and gene introgression [13][14][15].
Taxonomic study of the genus on the basis of morphological characteristics and DNA barcoding has been very difficult and slow, and a complete classification system has not yet been established in Corydalis [16]. Consequently, it is considered to be one of the most taxonomically complex taxa.

Corydalis.
In this study, high-throughput sequencing and comparative genomics were used to study the cp genomes of two important medicinal plants in genus Corydalis: C. saxicola and C. tomentella. They have a very special habitat (Fig. 1), grow in dry cracks of limestone and are critically endangered. We sequenced four complete cp genome sequences from these two species, described their genomic characteristics, conducted comparisons between these genomes and other Papaveraceae cp genomes, analyzed the phylogenetic relationships on the basis of common CDS in the cp genomes.

Results
Organization and features of tomentella and saxicola genomes The complete C. tomentella genomes were 190,198-190,247 bp long and exhibited a typical angiosperm circular cp structure, containing four regions: large single-copy region (LSC: 96,530 − 96,701 bp), small single-copy region (SSC: 9,636-9,664 bp), and a pair of inverted repeats (IR: 41,955 − 42,002 bp) (Fig. 1). The GC content of the genome and each genomic region was also typical of the angiosperm cp style. Specific lengths and contents are shown in Fig. 1 and Table 1. The lengths of the two complete C. saxicola genomes were 189,029 bp and 189,155 bp, which were slightly smaller than those of C. tomentella. The cp genome structure, size of each region, and GC content were similar between the two species (Table 1). There were 12 introns with a length of more than 700 bp, and the longest gene was trnK-UUU with a length of 2,478 bp. The gene features of C. saxicola cp genome were similar to those of C. tomentella.
The C. saxicola cp genome contained 120 genes, including 78 protein-coding genes, 38 tRNA genes, and four rRNA genes. Nineteen genes contained introns. The longest intron gene in the C. saxicola cp genome was trnK-UUU, and its length was also 2,478 bp ( Fig. 2, Table 2 and S1). SSRs are short tandem repeats of 1-6 bp DNA sequences that are widely distributed throughout the cp genome [17]. In this study, CPGAVAS2 software was used to analyze the sequences and the classification statistics of SSRs with a length greater than or equal to 8 bp. Here, we analyzed the distribution and the type of SSRs contained in C. tomentella and C. saxicola cp genomes. A total of 172 SSRs were identified in the whole C. tomentella cp genome (take MHJ1 as an example), including 100 mono-, 34 di-, and one compound nucleotide SSRs. Among all SSR types, A and T were the most commonly used bases and 116 SSRs in the C. tomentella cp genome had A, T, or AT repeat units (  In addition to SSRs, forward repeats (F) and palindromic repeats (P) are also called interspersed repeat sequences (length ≥ 30 bp). In the C. tomentella cp genome, there were 112 interspersed repeat sequences, comprised of 64 tandem repeats, 39 forward repeats, and 11 palindromic repeats (Table 3). A total of 132 long repeats were present in C. saxicola cp genome, comprised of 82 tandem repeats, 23 forward repeats, and 27 palindromic repeats (Table 3). Comparing the cp genomes of the two species, the C. saxicola genome had a greater total number of repeats than the C. tomentella cp genome, and the cp genome repeat content in both species was significantly higher than that of most species.
IR contraction and expansion IR regions are the most conserved regions in the plant plastome, contraction and expansion at their borders are regarded as the major causes of size variation [18][19]. We selected four phylogenetically close species (Papaver rhoeas, Papaver orientale, Papaver somniferum, and Coreanomecon hylomeconoides) and two model species (N. tabacum and Arabidopsis thaliana) as references for cp genome structure comparisons. Figure  Comparative genomic analysis and genome sequence divergence VISTA software was used to make multiple comparisons of the C. tomentella and C. saxicola cp genome sequences, and results show that intra-specific variation was small but there were still some inter-specific differences (Fig. 4). The coding and non-coding regions of C. saxicola samples were conserved, while the coding regions of C. tomentella samples were conserved but there were differences in several consecutive intergenic regions of rps12-clpP, clpP-psbB, and petB-psbH.
Comparing C. tomentella and C. saxicola, the most highly divergent regions mainly was observed in coding regions and intergenic regions, including rpl20, rrn23s, trnH-GUG, trnN-GUU, rps12-clpP, clpP-psbB, petB-psbH, and ycf1-ndhL. On the basis of morphological features and cluster analysis of DNA barcodes, it was found that the two species are closely related and difficult to identify accurately. The cp genome differences between the two species have potential for use as molecular markers for species authentication.
Comparisons with the N. tabacum outgroup and Papaveraceae family plants P. rhoeas, P. orientale, P.
somniferum, and C. hylomeconoides showed that C. tomentella and C. saxicola cp genomes have distinct cp genome structures. The differences included genome size, number of genes, and genome structure (Fig. 5). First, the C. tomentella and C. saxicola cp genome sizes (189.1-190.2 kb) were larger than those of N. tabacum (155.9 kb) and P. somniferum (152.9 kb). Second, the length of intergenic regions in C. tomentella and C. saxicola cp genomes were longer than those in N. tabacum and P. somniferum, as seen, for example, in the lengths of intergenic regions for psal/rpl32 (7 kb) in the IR region and rps12/clpP (5 kb) in the LSC region. Third, C. tomentella and C. saxicola cp genome structures were significantly different from those of the other six species, including large-scale gene replication, movement, reversal, and changes in the number and arrangement of genes. Fourth, C.
tomentella and C. saxicola IR regions were highly dilated (41.9-42.5 kb). The ndhF, ndhD, ndhL, ndhG, ndhE, psaC, ccsA, trnL-UAG and rpl32 genes, usually located in the SSC region, migrated to the IR regions to become double-copy genes (Fig. 1). A few rpl19 and rpl2 genes migrated from the IR region to the LSC region. In particular, in C. tomentella and C. saxicola, there is a large fragment (containing rpl23, trnL-CAU, ycf2, ycf15, and trnL-CAA) that moved within the IR region. Gene migration increased the length of the IR region and decreased the length of the SSC region. Fifth, the LSC region was highly conserved, but the accD gene was lost and the position of the rbcL gene changed substantially.
In short, both the coding and non-coding regions of C. tomentella and C. saxicola cp genomes differ greatly from those of other Papaveraceae and tobacco.

Phylogenetic analysis of Papaveraceae
With C. chinensis and N. tabacum as outgroups, common protein coding sequences from 13 cp genome sequences were extracted from C. saxicola, C. tomentella, and six Papaveraceae species (P.  [40]. Conversely, it is rare for NDH genes to undergo large-scale duplication and augmentation, and the effects of the increased genes resulting from gene duplication on plant growth and development have rarely been discussed in previous research. The NDH complex participates in photosystem I (PSI) cyclic electron flow (CEF), chlororespiration. NDH-dependent CEF provides additional pH change and ATP for CO 2 assimilation and alleviates oxidative stress caused by stromal over-reduction under stress conditions [38][39]. The non-photochemical quenching ability of NDH deficient mutants decreased under mild drought [43]. NDH deficient mutants grow slowly at low humidity [42]. Under strong light, tobacco ndhB mutants were more susceptible to photobleaching [43]. Under heat stress conditions, NDH-mediated cyclic and chlororespiratory electron transport are accelerated, mitigating photo-oxidative damage and inhibition of CO2 assimilation caused by high temperature [44]. C. tomentella and C. saxicola mainly grow in dry cracks of limestone, a unique environment with little available soil and water [45] (Fig. 1). So they have long been subjected to extreme environmental conditions, such as high temperature, drought, and low light. In view of NDH gene functions in plant defense against various environmental stresses, the doubling of NDH genes those results from IR expansion could lead to overexpression of these doubled genes, which would be helpful for adaptation to harsh environmental conditions. The special structure of the C. tomentella and C. saxicola cp genomes provides a clue that could explain their robust adaptation to harsh environments.
The accD gene was absent in C. saxicola and C. tomentella cp genomes. Usually, gene content is highly conserved among photosynthetic angiosperm cp genomes [29,46], but in a very few plants, for example, legumes and Circaeasteraceae [40,47], a number of genes have been lost or pseudogenized. The loss of accD in the cp genome is mirrored in other plant taxa, such as grasses, Circaeasteraceae and Oleaceae [29,30,37]. The accD gene encodes an acetyl-CoA carboxylase subunit and is an important regulator of carbon flow entering the fatty acid biosynthesis pathway [48]. It is known to be essential for leaf development in angiosperms [49][50]. Recent research has shown that the accD gene present in the plastome of most angiosperms is functional [48,50].
Furthermore, several studies have shown that the accD gene has been transferred into the nucleus, and the proteins it encodes are transported from the nucleus to the chloroplast to function in the form of a transfer peptide [37][38][39][40][41][42][43][44][45][46][47][48][50][51]. Whether the C. tomentella and C. saxicola accD genes have been lost or transferred to the nucleus, the effects on development are currently unknown.
Papaveraceae phylogenetic relationships and potential application of cp genomics in Corydalis By exhibiting high species identification power that accurately distinguished two closely related species (C. saxicola and C. tomentella), cp genomes have demonstrated a great potential for use as a super-barcode to discriminate Corydalis species. Genus Corydalis, is considered to be one of the most taxonomically complex taxa [13]. It is extremely difficult to depend on morphological characteristics for Corydalis species identification. Single-locus DNA barcodes lack adequate variation in closely related taxa. Researches using short sequence gene fragments and DNA barcodes showed that both nuclear genome (ITS/ITS2) sequence and cp genome (matK/rbcL/rps16) sequence produced unsatisfactory taxonomic identifications within Corydalis [14,45,52]. Cp genomes, exhibiting many advantages, including a moderate size and an appropriate frequency of nucleotide substitutions that can provide sufficient mutation sites [29], have been successfully used in the identification of various taxa, such as genera Epimedium [10] Fritillaria [53], Epipremnum [54], and Papaver [55]. In this study, C. saxicola and C. tomentella, two closely related species from Section Thalictrifoliae in Corydalis, are clustered into two branches in the phylogenetic tree, which indicates they could be accurately distinguished by cp genome analysis. While, these two affinities were not monophyletic in the phylogenetic analysis based on short sequences of DNA barcodes and couldn't been effectively distinguished. Recent barcoding studies have placed a greater emphasis on the use of whole-cp genome sequences, which are now more readily available as a consequence of improving sequencing technologies [3]. The demonstrated use of cp genomics in Corydalis species identification suggests that it has a great potential for taxonomic identification of this genus.
The cp genomics also efficiently identified Papaveraceae genera. The evolution rates of coding and non-coding regions are significantly different in cp genomes, enabling cp genome use for systematic analysis of different phylogenetic ranks [7]. Genus Corydalis belongs to Papaveraceae Fumarioideae (Corydaleae) and the phylogenetic relationships between this genus and its relatives remain controversial [13]. Recent studies have tended to treat genus Corydalis and closely related genera as an independent Fumariaceae family because the morphological characteristics of this family constitute a unique evolutionary series [15][16][56][57][58]. In this study, a Papaveriaceae phylogenetic tree, built using common protein coding sequences, shows that each genera is clustered into one branch. However, Corydalis species are clustered into a distinct clade that is quite distant from other Papaveriaceae genera. Combined with the substantial differences in cp genome structures between Corydalis and other Papaveriaceae genera, these results provide preliminary molecular evidence that supports classifying Corydalis as a separate Fumariaceae family. However, because there were few species included in this study, it will be necessary to analyze additional representative species in further studies.

Materials, DNA extraction and sequencing
Plant materials were provided by the Chongqing Institute of Medicinal Plant Cultivation and identified by researcher Zhengyu Liu as C. tomentella Franch. and C. saxicola Bununting. We collected young leaves from selected plants that were vigorous, healthy, and disease-free. These leaves were wiped with 70% alcohol and repeatedly washed with sterile water before genomic DNA extraction. Total DNA was extracted using a Tiangen plant genomic DNA extraction kit, and the DNA quality and concentration were detected using 1% agarose electrophoresis and a Nanodrop2000. Qualified total DNA was sent to Shanghai Biotech for sequencing with the Illumina HiSeq4000 platform.

Genome assembly and annotation
Referring to the method of Li et al. [4], cp genomes were assembled on a Linux system using BLAST+, capturereads.py, soapdenovo-127mer, and SSPACE-Basic-v2.0.pl software [4,55]. The completed genomes were annotated using CPGAVAS2 [59], and the results were modified for starter and terminator revisions by Apollo software [60]. CPGAVAS2 software was used to convert revised GFF3 format annotation results into a sqn format for NCBI submission. Sequin software was used to check and correct unsatisfactory comments in the sqn file, and the corrected results were submitted to the NCBI database [4]. Physical maps of the cp genomes were drawn by GenomeDRAW [61] using a GB format file exported from the sqn file by sequin software.
VISTA software was used to compare multiple cp genomes [68].

Phylogenetic analysis
A total of 15 cp whole genome sequences were used in cluster analysis. Thirteen genomes were from Papaveraceae (four Corydalis genomes, five Papaver genomes, two Meconopsis genomes, one Macleaya, and one Coreanomecon genome), and Coptis chinensis and Nicotiana tabacum genomes were included as outgroups. Of the Papaveraceae genomes, four genomes were newly sequenced in this study, and nine genomes were downloaded from the NCBI database. Common protein coding sequences were extracted from the cp genome sequences [4], and multiple global alignments of the protein coding sequences was performed using the Clustalw module in MEGA6.06 software [62]. The maximum-likelihood tree (ML) was constructed on the basis of the alignments of common protein sequences and full-length cp genome sequences by MEGA6.06 software [62].

Conclusions
In this study, we sequenced and characterized four complete cp genomes for C. tomentella and C.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The collection of plant material was complied with institutional and national guidelines. The field of the study was conducted in accordance with local legislation. Four plastomes sequenced in this study have been deposited in the National Center for Biotechnology Information (NCBI) genome database (https://www.ncbi.nlm.nih.gov/) (Accession numbers: see Table 1). All sequences used in phylogenetic analysis of Papaveraceae are available from NCBI (Accession numbers: see the "Phylogenetic analysis of Papaveraceae" in Result section).

Competing interests
The authors declare no conflict of interest.   Figure 1 The habitat of C. saxicola and C. tomentella. aThe distant view of steep cliff growing C.

Funding
saxicola; b the close shot of C. saxicola , and c the close shot of C. tomentella. The yellow arrows indicated the Corydalis plants.   Sequence identity plot comparison of the C. tomentella and C. saxicola cp genomes. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the inverted repeats (IRs), respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50 to 100%.

Figure 5
Sequence identity plot comparison of the cp genomes of C. tomentella, C. saxicola, Papaver somniferum, P. rhoeas, Coreanomecon hymenoides. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the inverted repeats (IRs), respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50 to 100%.

Figure 6
ML tree of C. saxicola and C. tomentella and its relative species based on common protein coding sequences.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.