Molecular evolution of chloroplast genomes in subfamily Zingiberoideae (Zingiberaceae)

Zingiberoideae is a large and diverse subfamily of the family Zingiberaceae. Four genera in subfamily Zingiberoideae each possess 50 or more species, including Globba (100), Hedychium (> 80), Kaempferia (50) and Zingiber (150). Despite the agricultural, medicinal and horticultural importance of these species, genomic resources and suitable molecular markers for them are currently sparse. Here, we have sequenced, assembled and analyzed ten complete chloroplast genomes from nine species of subfamily Zingiberoideae: Globba lancangensis, Globba marantina, Globba multiflora, Globba schomburgkii, Globba schomburgkii var. angustata, Hedychium coccineum, Hedychium neocarneum, Kaempferia rotunda ‘Red Leaf’, Kaempferia rotunda ‘Silver Diamonds’ and Zingiber recurvatum. These ten chloroplast genomes (size range 162,630–163,968 bp) possess typical quadripartite structures that consist of a large single copy (LSC, 87,172–88,632 bp), a small single copy (SSC, 15,393–15,917 bp) and a pair of inverted repeats (IRs, 29,673–29,833 bp). The genomes contain 111–113 different genes, including 79 protein coding genes, 28–30 tRNAs and 4 rRNA genes. The dynamics of the genome structures, gene contents, amino acid frequencies, codon usage patterns, RNA editing sites, simple sequence repeats and long repeats exhibit similarities, with slight differences observed among the ten genomes. Further comparative analysis of seventeen related Zingiberoideae species, 12 divergent hotspots are identified. Positive selection is observed in 14 protein coding genes, including accD, ccsA, ndhA, ndhB, psbJ, rbcL, rpl20, rpoC1, rpoC2, rps12, rps18, ycf1, ycf2 and ycf4. Phylogenetic analyses, based on the complete chloroplast-derived single-nucleotide polymorphism data, strongly support that Globba, Hedychium, and Curcuma I + “the Kaempferia clade” consisting of Curcuma II, Kaempferia and Zingiber, form a nested evolutionary relationship in subfamily Zingiberoideae. Our study provides detailed information on ten complete Zingiberoideae chloroplast genomes, representing a valuable resource for future studies that seek to understand the molecular evolutionary dynamics in family Zingiberaceae. The identified divergent hotspots can be used for development of molecular markers for phylogenetic inference and species identification among closely related species within four genera of Globba, Hedychium, Kaempferia and Zingiber in subfamily Zingiberoideae.

It is difficult to identify Zingiberaceae plants merely based on their vegetative parts and flowers [2,3,5]. First, the vegetative parts of Zingiberaceae plants are often very similar, which is not suitable for species identification [3,5]. Second, the structure of the flowers is complex, and the texture is weak, with some of them being as thin as cicada wings. Once pressed into dry specimens, it is difficult to know their original appearance, and they break and begin to decompose as soon as they are touched [5]. Past studies of Zingiberaceae have primarily concentrated on morphological classification and resources [1,4,5], ecology [4], cultivation and propagation [3,4], medicinal and ornamental uses [3][4][5][6][7], and phylogeny [2,[8][9][10][11]. Among these phylogenetic investigations, several studies used nuclear internal transcribed spacer (ITS) and traditional chloroplast matK/ trnK-matK/trnL-trnF data to explore the phylogenetic relationships within subfamily Zingiberoideae [2,8] and within the four genera of Globba, Hedychium, Kaempferia and Zingiber [2,[9][10][11]. These traditional chloroplast markers have successfully identified the patterns of the evolutionary relationships within the four genera but in general, have been limited in differentiating resolutions among these four genera. Compared to traditional chloroplast markers, complete chloroplast genomes provide high resolution for relationship reconstruction within the Alpinia, Amomum, Curcuma and Zingiber genera, which allows an exploration of their phylogenetic positions in family Zingiberaceae [12][13][14][15][16][17][18]. Moreover, chloroplast single nucleotide polymorphism (SNP)-based phylogenetic analyses improve the phylogenetic resolution within the Alpinia, Amomum, Curcuma, Hedychium, Kaempferia, Stahlianthus and Zingiber genera in family Zingiberaceae [19][20][21][22][23][24][25][26]. However, because of the lack of complete chloroplast genomic data for the genus Globba, no studies have focused on the structural or mutational dynamics of the chloroplast genomes among the four genera of Globba, Hedychium, Kaempferia and Zingiber in subfamily Zingiberoideae.
In this study, we completely sequenced ten Zingiberoideae chloroplast genomes (G. lancangensis, G. marantina, G. multiflora, G. schomburgkii, G. schomburgkii var. angustata, H. coccineum, H. neocarneum, K. rotunda 'Red Leaf ' , K. rotunda 'Silver Diamonds' and Z. recurvatum) and compared them to eight other published chloroplast genomes from three genera, Hedychium, Kaempferia and Zingiber, within subfamily Zingiberoideae (H. coronarium, H. spicatum, K. galanga, K. elegans, Z. montanum, Z. officinale, Z. spectabile and Z. zerumbet), which were downloaded from GenBank. The primary aims of this study were as follows: (1) to compare and analyze the structure features of ten sequenced chloroplast genomes from four genera, Globba, Hedychium, Kaempferia and Zingiber; (2) to determine the sequence variation and molecular evolution among all 18 chloroplast genomes from the four genera in subfamily Zingiberoideae; and (3) to reconstruct phylogenetic relationships to verify the four genera's relationships within subfamily Zingiberoideae.

Features of ten sequenced Zingiberoideae chloroplast genomes
All ten sequenced Zingiberoideae chloroplast genomes have a typical quadripartite structure containing one large single copy (LSC), one small single copy (SSC) and two inverted repeat regions (IRA and IRB) by OGDRAW [30] and CGView tool [31] (Fig. 1, Fig. S1, Table 1). The ten sequenced Zingiberoideae chloroplast genomes size ranges from 162,630 bp (K. rotunda 'Red Leaf ') to 163,968 bp (H. coccineum) ( Table 1, Table S1, Fig. S1). They display four junction regions, namely, a separate LSC region of 87,172-88,632 bp, an SSC region of 15,393-15,917 bp, and a pair of IRs (IRA and IRB) of 29,673-29,833 bp each (Fig. 1, Fig. S1, Table 1, Table S1). The size of the G. schomburgkii var. angustata chloroplast genome (163,432 bp) is the largest among the five sequenced Globba species, with 126 bp, 658 bp, 233 bp, and 107 bp longer than G. lancangensis, G. marantina, G. multiflora and G. schomburgkii, respectively (Table 1,  Table S1). The GC content of these 10 chloroplast genomes is very similar (35.73-36.18%) ( Table 1, Table  S1). Specifically, the GC content in the IR regions (41.02-41.15%) is higher than that in the SSC regions (28.83-29.60%) and LSC regions (33.35-34.02%) ( Table 1, Table  S1). Additionally, the GC content at the third codon position (28.49-34.05%) is lower than that at the first (41.20-44.90%) and second (34.15-38.89%) positions in the protein coding genes of these 10 chloroplast genomes (Table S1). The ten sequenced chloroplast genomes contain 140-141 predicted functional genes, which consist of 87 protein coding genes, 45-46 tRNA genes, and 8 rRNA genes (Tables 1, 2, Table S2). Among these genes, a total of 111-113 different genes are detected in our 10 sequenced chloroplast genomes, including 79 protein coding genes, 28-30 tRNA genes, and 4 rRNA genes (Tables 1, 2, Table S2). Among the different protein coding genes in our 10 sequenced chloroplast genomes, 61 genes are located in the LSC regions, 12 genes are located in the SSC regions, and 8-9 genes are duplicated in the IR regions (Table 1, Table S2). Furthermore, most of the Fig. 1 Chloroplast genome map of G. lancangensis (GenBank accession number: MT473704; the outermost three rings) and CGView comparison [31] of eighteen Zingiberoideae chloroplast genomes (the inter rings with different colors). Genes belonging to different functional groups are shown in different colors in the outermost first ring. Genes shown on the outside of the outermost first ring are transcribed counter-clockwise and on the inside clockwise.The gray arrowheads indicate the direction of the genes. The tRNA genes are indicated by one letter code of amino acids with anticodons. LSC, large single copy region; IR, inverted repeat; SSC, small single copy region. The outermost second ring with darker gray corresponds to GC content, whereas the outermost third ring with the lighter gray corresponds to AT content of G. lancangensis chloroplast genome by OGDRAW [30]. The innermost first black ring indicates the chloroplast genome size of G. lancangensis. The innermost second and third rings indicate GC content and GC skews deviations in chloroplast genome of G. lancangensis, respectively: GC skew + indicates G > C, and GC skew -indicates G < C. From innermost fourth color ring to outwards 21st ring in turn: G. Genes with introns protein coding genes and rRNAs in our 10 sequenced Zingiberoideae chloroplast genomes are similar, but there are slight differences. For instance, the chloroplast genomes of H. coccineum and H. neocarneum have the psbZ gene, while lhbA gene is missing in both genomes ( Table 2, Fig. S1). Additionally, for tRNAs, the chloroplast genome of G. schomburgkii has two copies of trnS-GCU and trnT-UGU , respectively, while the trnS-GGA and trnT-GGU are missing; the trnS-GCU and trnT-UGU exist as single copies in the remaining 9 sequenced Zingiberoideae chloroplast genomes ( Table 2, Fig. S1). As shown in Table 1, both G. lancangensis and K. rotunda 'Red Leaf ' contain 21 genes in two IR regions, including 9 protein coding genes (ndhB, ndhF, rpl2, rpl23, rps7, rps12, rps19, ycf1 and ycf2), 8 tRNA genes (trnA-UGC , trnH-GUG , trnI-CAU , trnI-GAU , trnL-CAA , trnN-GUU , trnR-ACG and trnV-GAC ), and all four rRNAs (rrn4.5, rrn5, rrn16 and rrn23). The other 8 chloroplast genomes contain 20 genes in the two IR regions, which is the same as the G. lancangensis and K. rotunda 'Red Leaf ' chloroplast genomes, with the exception of ndhF gene (Table 1, Table S2).
Additionally, ten sequenced chloroplast genomes had 532 long repeats that consisted of 192 forward repeats, 24 complement repeats, 59 reverse repeats, and 257 palindromic repeats (Fig. 4a, Table S9). Among the ten sequenced genomes, G. lancangensis had the smallest number (48), and H. neocarneum had the largest number of long repeats (70) (Fig. 4a, Table S9). The number of forward repeats varied between 12 (G. schomburgkii var. angustata) and 28 (H. coccineum), the number of complement repeats varied from 1 (G. lancangensis and K. rotunda 'Silver Diamonds') to 5 (G. schomburgkii var. angustata), the number of reverse repeats varied between 3 (H. coccineum and K. rotunda 'Silver Diamonds') and 9 (H. neocarneum), and the number of palindromic repeats  Table S9). There was no complement repeat in Z. recurvatum (Fig. 4a, Table S9). Long repeats with 30-39 bp were found to be the most common in the ten sequenced chloroplast genomes (Fig. 4b, Table S9). Long repeats with a length of 30 bp were the most common (148), and those with lengths of 31 bp and 32 bp were the second (74) and third (63), most common, respectively (Table S9). Collectively, the number, length and distribution of these long repeats varied from one species to another among the nine tested species in the current study.
Although the IR region of the 17 Zingiberoideae species' chloroplast genomes was highly conserved, structure variation was still found in the IR/SC boundary regions.
Within the 17 Zingiberoideae species, the rpl22 and rps19 genes were located in the boundaries of the LSC/ IRB regions, except for Z. spectabile, in which there were trnM and ycf2 genes, and the rpl22/rps19 gene was absent in the boundaries of the LSC/IRB regions (Fig. 5). There were 20-125 bp between rpl22 and the LSC/IRB borders within the rest of the 16 Zingiberoideae species, and the distance between rps19 and the LSC/IRB boundary ranged from 123 bp to 173 bp (Fig. 5).    5).
The SSC/IRA boundary was situated in the ycf1 coding region, which crossed into the IRA region in all 17 Zingiberoideae species. However, the length of ycf1 in the IRA region varied among the 17 Zingiberoideae species from 665 bp to 3987 bp, which indicated dynamic variation in the SSC/IRA boundaries (Fig. 5).
The rps19 and psbA genes were situated in the boundaries of the IRA/LSC regions in all 17 Zingiberoideae species, except for Z. spectabile, in which the trnH gene was at one end of the IRA region 256 bp away from the IRA/LSC border; for the rest of the 16 Zingiberoideae species, the distances between rps19 and the IRA/LSC border ranged from 123 bp to 173 bp (Fig. 5). For all 17 Zingiberoideae species, a 94-219 bp distance was observed between the psbA gene and the IRA/LSC border (Fig. 5).
Combing the results of mVISTA, CGView and DnaSP, 33 regions were extracted and constructed maximum   Fig. S2i, j, m).

Characterization of substitution rates and positive selection analyses
The nonsynonymous (Ka) and synonymous (Ks) nucleotide substitution rates of all 79 protein coding genes were analyzed across 17 Zingiberoideae species. Overall, the Ka/Ks ratios were less than 1.00 and invalid for most pairs comparison (95.37%) ( Table S11a). There were 49 protein coding genes with Ka/Ks ratios greater than 1.00 and p values less than 0.05 at nucleotide level, such as accD, psbJ, rbcL, rpl20, rps7, rps8, rps15, rps16, ycf1, ycf2 and so on (Table S11b). These data sets were so sophisticated and may generate false positives. To measure truly positive selection at the protein level for further, we used a Bayes empirical bayes (BEB) approach in PAML [35] to integrate over these uncertainties. The BEB method inferred that some amino acid sites of 14 protein coding genes were truly under positive selection with posterior probability greater than 0.95 (Table 4, Table S12). Among the 14 protein coding genes, rps12 showed the highest number of positive amino acids sites (40), followed by ycf1 (34) and ycf2 (20) (  (Table S12). Additionally, some amino acids positions were highly variable, such as accD (4, 9 and 299), ccsA (180), rbcL (449), rps18 (27) and ycf1 (215, 928 and 1452), which displayed three or more amino acid changes among all 17 Zingiberoideae species studied here (Table  S12).

Phylogenetic inference of subfamily Zingiberoideae
To examine the phylogenetic positions of 5 Globba species, 4 Hedychium species, 3 Kaempferia species and 5 Zingiber species, and their relationships within subfamily Zingiberoideae, Bayesian inference (BI) and ML trees were constructed based on the single nucleotide polymorphism (SNP) matrix from 59 chloroplast genomes (Fig. 8, Fig. S3). These chloroplast genomes included 16 of subfamily Alpinioideae, 40 of subfamily Zingiberoideae, and 3 species (Canna indica, Costus pulverulentus, and Costus viridis) as the outgroups. The topological structures of the BI and ML trees were consistent, and were divided into two subfamilies of Alpinioideae and Zingiberoideae with strong support (posterior probability = 1.00 for the BI tree and bootstrap value = 100% for the ML tree) (Fig. 8, Fig. S3). In the BI tree, the posterior probabilities of all nodes reached 1.00 (Fig. S3), which indicated that all nodes were strongly supported. In the ML tree, there were two genera (Amomum and Alpinia) in subfamily Alpinioideae, and six genera (Curcuma, Globba, Hedychium, Kaempferia, Stahlianthus, and Zingiber) in subfamily Zingiberoideae with moderate to strong support (bootstrap values = 83-100%) (Fig. 8).
Within subfamily Zingiberoideae, there were two clusters Curcuma I and Curcuma II in genus Curcuma: Curcuma II comprised only one species (C. flaviflora), while the rest of the Curcuma species were grouped in Curcuma I; Curcuma I also included genus Stahlianthus; for genus Globba, G. schomburgkii was first grouped with G. schomburgkii var. angustata with strong support (bootstrap value = 100%), and had a nested relationship with G. marantina, G. lancangensis, and G. multiflora (bootstrap values = 100%); for genus Hedychium, H. neocarneum and H. spicatum were first clustered together with strong support (bootstrap value = 90%), and then clustered with H. coccineum and H. coronarium with strong support (bootstrap values = 100%); for genus Kaempferia, there were two clusters with strong support (all bootstrap values = 100%), one of which included two forms of K. rotunda (K. rotunda 'Red Leaf ' and K. rotunda 'Silver Diamonds'), and the other included K. galanga and K. elegans; for genus Zingiber, Z. zerumbet, Z. spectabile, Z. montanum, Z. officinale and Z. recurvatum were clustered step-by-step with strong support (bootstrap values ≥99%) (Fig. 8). Meanwhile, Table 3 Evaluation of the identification capability of thirteen regions among four genera in subfamily Zingiberoideae

Chloroplast genome structure and sequence variation
In this study, 10 complete chloroplast genomes of 9 species from four genera of subfamily Zingiberoideae, namely, Globba, Hedychium, Kaempferia and Zingiber, were sequenced, assembled and applied for their comparative analyses with other related Zingiberoideae species [15,21,23,26]. The genome size (between 162,630 bp and 163,968 bp), genome quadripartite structure (one LSC, one SSC and two IR regions), GC content  (ranging from 35.73 to 36.18%), gene composition, most of the protein coding genes, tRNAs and rRNAs showed high similarities among the 10 sequenced chloroplast genomes, which had been observed in other Zingiberoideae chloroplast genomes [15,[21][22][23]26]. Although the chloroplast genomes of Zingiberoideae species were highly conserved, gene loss, intron loss and gene duplication occurred in current study, for example, both H. coccineum and H. neocarneum lost lhbA gene, suggesting that gene loss and insertion had occurred during the evolutionary process of H. coccineum and H. neocarneum.
Contraction and expansion at the borders of the IR regions of chloroplast genomes are considered to be important evolutionary events and may cause size variations, the origination of pseudogenes, gene duplication or the reduction of duplicate genes to single copies [41][42][43][44]. For instance, the IR region in Heimia myrtifolia is 25,643 bp long, which is shorter than the IR region of most Lagerstroemia species, indicating that the Lagerstroemia species differentiated later than H. myrtifolia in family Lythraceae [41]. After comparing the chloroplast genomes among 17 Zingiberoideae species, we found that the boundaries between the SSC and two IR regions were relatively conserved, and the distribution and locations of gene types in these regions were highly consistent. Compared with the other 16 Zingiberoideae species, the length of IR region in Z. spectabile was the smallest The tree was constructed with maximum likelihood analysis of SNP matrix using MEGA software [32]. The stability of each tree node was tested by bootstrap method with 1000 replicates. Bootstrap values ≧ 50% were indicated numbers next to the branches. The ten sequenced chloroplast genomes in this study were marked in bold (25,451 bp), mainly because the ycf2 gene located at the LSC/IRB boundary. Additionally, among studied 17 Zingiberoideae species, only two tRNAs of Z. spectabile were found at the LSC/IRB and LSC/IRA boundaries, respectively. Therefore, changes in the LSC/IR boundaries may be the main contributors to the contraction and expansion of IR regions in these Zingiberoideae species.

Chloroplast genome evolution in subfamily Zingiberoideae
To resolve the evolutionary history of Zingiberoideae species, it is necessary to analyze their adaptive evolution. The Ka/Ks ratio is very useful for measuring selection pressure at the protein level: if Ka/Ks > 1, the protein is considered to be positively selected; if Ka/Ks = l, the protein is neutral; and if Ka/Ks < 1, the protein is considered to have undergone purifying selection [50,51]. In this study, 14 genes with positive selection sites are identified in four genera of Globba, Hedychium, Kaempferia and Zingiber in subfamily Zingiberoideae. Among these genes containing amino acid positive sites, three genes of them encoding ribosome subunit proteins (rpl20, rps12 and rps18), are involved in chloroplast gene expression, which is essential for chloroplast biogenesis and function [52]. rps12 gene exhibits its variations in intragenic exon location and intron content, and has important effects on evolutionary rates and patterns of molecular evolution in fern [53]. Its spicing activity has been reported to impair photosynthesis and perturb development in Arabidopsis [52]. In our analyses, rps12 gene harbors the highest number (40) of positive amino acid sites within 17 Zingiberoideae species, suggesting that the rps12 gene may play important roles in Zingiberoideae species evolution and development.
Moreover, ten genes have also been identified with positive selection sites in current study, namely, accD, ccsA, ndhB, psbJ, rbcL, rpoC1, rpoC2, ycf1, ycf2 and ycf4. Recent studies have indicated that these ten genes with positive selection in some angiosperms may be very common phenomena [14, 26, 37, 45-47, 50, 51, 54]. For examples, accD, ndhB, ycf1 and ycf2 have been reported as positive selection in some Zingiberaceae species [14,26]; accD, ccsA, rbcL, rpoC1, rpoC2, ycf1, ycf2 and ycf4 have also been identified under positive selection in Orchidaceae [37]; accD, ccsA, psbJ, rbcL, ycf1 and ycf4 have also been identified under positive selection in Lythraceae [45,50]; accD, ccsA, rbcL, rpoC2, ycf1 and ycf2 have also been identified under positive selection in Euterpe [47]; accD, rbcL and ycf2, have also been identified under positive selection in Monsteroideae [46]; accD, ndhB and ycf2 have also been identified under positive selection in Pterocarpus [51]; and ycf2 has also been identified under positive selection in Pyrus [54]. Additionally, among these ten genes, we find that rbcL, ycf1 and ycf2 genes possess higher number (8,34,20, respectively) of positive amino acid sites within Zingiberoideae species. rbcL gene encodes large subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase (rubisco), and ycf1 and ycf2 genes encode unknown function proteins. Both rbcL and ycf1 genes are intensively used for species identification, phylogenetic, phylogeography, germplasm conservation and innovative utilization of many plants [55][56][57][58]. Lastly, there is one gene (ndhA) encoding subunit of NADH-plastoquinone oxidoreductase, which is found under positive selection with 5 amino acid sites. Among our analyzed species of Globba, Hedychium, Kaempferia and Zingiber here, they own diverse pseudostem heights and habitats; for instance, K. galanga spreads flat on ground, living in open areas, while G. schomburgkii, H. coronarium, and Z. officinale have pseudostems heights of 30-50 cm, 1-3 m, and 50-100 cm, respectively, native to forests [1,[3][4][5]. In other words, they live in diverse environment in their respective habitats, such as temperature, light and humidity, and keep high levels of plant diversity. We accept that our analyzed plants do not fully contain all of Zingiberoideae plants, and that they may exist genetic variations by themselves. Nonetheless, based on our results, we propose that positive selection and environmental heterogeneity may interconnect together to contribute to Zingiberoideae species evolution and adaption.

Phylogenetic analyses in subfamily Zingiberoideae
Over the past two decades, phylogenetic relationships within four genera of Globba, Hedychium, Kaempferia and Zingiber in subfamily Zingiberoideae, had been some ambiguous in previous phylogenies [2,[8][9][10][11], for examples, regarding Globba, nuclear ITS and chloroplast matK data had very low resolution or were generally lacking among Globba species [10]; and for Hedychium, nuclear ITS and chloroplast matK/trnL-trnF molecular data had strongly supported the monophyly of Hedychium, but its relationships with other genera had been poorly supported [2,8,9]. In this study, both BI and ML phylogenetic trees have demonstrated some congruence with previous phylogenies in subfamily Zingiberoideae, for instance, the monophyly of Hedychium [2,8,9]. Of course, our analyses also have strongly identified that Globba, Hedychium, and Curcuma I + "the Kaempferia clade" consisting of Curcuma II, Kaempferia and Zingiber, display a nested evolutionary relationship with high resolution in subfamily Zingiberoideae (Fig. 8, Fig.  S3).
Regarding Globba, Globba was classified into tribe Globbeae, which was one of the two tribes of subfamily Zingiberoideae in previous phylogenies [2,10]. In our analyses, the close evolutionary relationship between Globba and Hedychium as well as their genetic boundaries have been identified (Fig. 8, Fig. S3). Therefore, the taxonomic position of Globba requires some discussion. On the one hand, Globba owns the universal morphological characters of tribe Zingibereae, which has the parallel orientation of the plane of the distichy of the leafy shoots with respect to the rhizome, and the conspicuous and often well-developed lateral staminodes [2,3,10]. On the other hand, Hedychium is classified into tribe Zingibereae, and Globba is close to Hedychium with high resolution based on current phylogenetic results (Fig. 8,  Fig. S3). We confirm that our results do not completely resolve all of relationships among genera in two tribes of subfamily Zingiberoideae, and that our results do not sample a great deal of Globba species. In spite of all that, based on the results of our molecular phylogenies, we suggest a realignment of the tribe of Globba in subfamily Zingiberoideae: Globba is here transferred into tribe Zingibereae (Fig. 8, Fig. S3). Because of two important genera, Gagnepainia and Hemiorchis, were classified into tribe Globbeae in past phylogenies [2,10], we recommend that retaining the tribe Globbeae as previous recognized until future new evidence proves otherwise.

Conclusions
In this study, ten complete chloroplast genomes from nine Zingiberoideae species, namely, G. lancangensis, G. marantina, G. multiflora, G. schomburgkii, G. schomburgkii var. angustata, H. coccineum, H. neocarneum, K. rotunda 'Red Leaf ' , K. rotunda 'Silver Diamonds' and Z. recurvatum, have been sequenced, assembled and annotated for the first time. The structural characteristics of these ten chloroplast genomes are shown to be conservative, which are similar to those reported chloroplast genomes of Zingiberoideae species. Meanwhile, comparative analyses of 18 Zingiberoideae chloroplast genomes have generated 12 highly variable regions, which may be used as a potential source of molecular markers for phylogenetic analysis and species identification. Based on whole chloroplast-derived SNP data, phylogenetic relationships among four genera of Globba, Hedychium, Kaempferia and Zingiber in subfamily Zingiberoideae have been clearly resolved. In addition, at level of amino acid sites, 14 genes are under positive selection with high posterior probabilities, which may play important roles in Zingiberoideae species evolution and adaption to diverse environment. These results increase the genomic resources available for subfamily Zingiberoideae, which will be useful for future studies of evolution and phylogenetic.

Plant material sampling and chloroplast DNA extraction
We generated data on ten chloroplast genomes for nine species within the Zingiberoideae subfamily. Fresh and healthy leaves (G. lancangensis, G. marantina, G. multiflora, G. schomburgkii, G. schomburgkii var. angustata, H. coccineum, H. neocarneum, two forms of K. rotunda, namely K. rotunda 'Red Leaf ' and K. rotunda 'Silver Diamonds' , and Z. recurvatum) were collected from the resource garden of the environmental horticulture research institute (23°23′N, 113°26′E) at the Guangdong Academy of Agricultural Sciences, Guangzhou, China, and were immediately frozen in liquid nitrogen and stored at − 80 °C. The total genomic DNA was extracted from these leaves using the improved sucrose gradient centrifugation method [59]. The concentration and quantity of each isolated genomic DNA sample were determined with a NanoDrop 2000 micro spectrometer (Wilmington, DE, USA) and 1% agarose gel electrophoresis, respectively.

Chloroplast genome sequencing, assembly and annotation
For G. schomburgkii, two libraries with insert sizes of 300 bp and 10 kb were constructed after purification and then sequenced on an Illumina Hiseq X Ten instrument (Biozeron, Shanghai, China) and a PacBio Sequel platform (Biozeron, Shanghai, China), respectively. For the other species (G. lancangensis, G. marantina, G. multiflora, G. schomburgkii var. angustata, H. coccineum, H. neocarneum, K. rotunda 'Red Leaf ' , K. rotunda 'Silver Diamonds' , and Z. recurvatum), a library with insert sizes of 300 bp was constructed after purification for each species and then sequenced on an Illumina Hiseq X Ten instrument (Biozeron, Shanghai, China).
Complete chloroplast genomes were annotated using the online tool DOGMA (Dual Organellar Genome Annotator) [65] with default parameters, and then, it was checked manually. tRNAs and rRNAs were annotated by BLASTn searches on the NCBI website. A verification of tRNAs and rRNAs was performed using tRNAscanSE with default parameters [66] (Tables 1, 2; Tables S2, S3). Finally, the OGDRAW v.1.3.1 program was used with default settings to draw the circular chloroplast genome maps of the Zingiberoideae species and was manually edited [30] (Fig. 1; Fig. S1).

Prediction of codon usage and RNA editing sites
To examine the deviation in synonymous codon usage, the relative synonymous codon usage (RSCU) was calculated using MEGA v.7 [32] (Fig. 2a, Table S4). When the RSCU value > 1.00, this means that the use of a codon is more frequent than expected, and vice versa [39,43]. The clustered heat map of RSCU values of ten sequenced Zingiberoideae chloroplast genomes was conducted by R v.3.6.3 [33] (Fig. 2b, Table S5). To predict possible RNA editing sites in the ten sequenced chloroplast genomes, protein coding genes were used to predict potential RNA editing sites using the online program Predictive RNA Editor for Plants (PREP) suite (http:// prep. unl. edu/) with a cutoff value of 0.8 [67] (Table S6).

Analyses of SSRs and long repeats
MIcroSAtellite (MISA) (http:// pgrc. ipk-gater sleben. de/ misa/) was used to detect the location and number of SSRs of the ten sequenced chloroplast genomes with the settings as follows: ≥ 8 for mono-, ≥ 5 for di-, ≥ 4 for tri-, and ≥ 3 for tetra-, pena-, and hexa-nucleotide SSRs (Fig. 3, Tables S7, S8). The REPuter software was employed to identify long repeats such as forward, palindrome, reverse and complement repeats [68]. The criteria for determining long repeats were as follows: (1) a minimal repeat size of more than 30 bp; (2) a repeat identity of more than 90%; and (3) a hamming distance equal to 3 (Fig. 4, Table S9).

Comparative genomic analysis
To detect the contractions and expansions of the IRs in the chloroplast genomes of the four genera (Globba, Hedychium, Kaempferia and Zingiber), 18 whole genomes within subfamily Zingiberoideae were compared (Fig. 5). Using the annotated chloroplast genome of G. lancangensis as the reference, the mVISTA tool with the Shuffle-LAGAN mode [34] was used to make pairwise alignments among these 18 whole chloroplast genomes (Fig. 6). Additionally, the G. lancangensis chloroplast genome was compared to 17 chloroplast genomes of the four genera (Globba, Hedychium, Kaempferia and Zingiber) using CGView [31] (Fig. 1). GC distributions were measured based on GC skew using the following equation: GC skew = (G-C)/(G + C). The nucleotide variability (Pi) of the protein coding, intron and intergenic regions among these 18 chloroplast genomes within the Zingiberoideae subfamily was extracted and then calculated using DnaSP v.6 [69] (Fig. 7, Table S10). To identify the highly divergent regions, the protein coding regions and intergenic regions were extracted and then manually aligned using ClustalW in MEGA v.7 [32]. To obtain the molecular markers for differentiating the four genera in subfamily Zingiberoideae, a maximum likelihood (ML) analysis based on the nucleotide substitution model of Tamura-Nei was conducted using MEGA v.7 with 1000 replicates [32] (Table 3, Fig. S2).

Characterization of substitution rates and positive selection analyses
To calculate the nonsynonymous (Ka) and synonymous (Ks) substitution rates, the 79 protein coding genes of 18 Zingiberoideae chloroplast genomes, coming from four genera of Globba, Hedychium, Kaempferia and Zingiber, were extracted from their genomes, respectively. Then, the KaKs_Calculator with default parameters was used to calculate the rates of Ka, Ks and their ratio (Ka/Ks, denoted ω) for each gene of 18 Zingiberoideae chloroplast genomes by comparing pairwise [70]. A total of 11,835 Ka/Ks were obtained; the value could not be calculated if Ks = 0 or if the two aligned sequences existed as a perfect 100% match (Table S11). Next, to identify amino acid sites under the true occurrence of positive selection, program CODEML from PAML package v.4.8a [35] was used, with each corresponding protein coding sequence of chloroplast genome of G. lancangensis as the reference. A Bayes empirical bayes (BEB) approach [71] was then used to calculate posterior probabilities that a site came from the site class with ω > 1 by a site specific model 8 (β and ω). In the BEB analysis, posterior probability higher than 0.95 and 0.99 indicated sites that were under positive selection and strong positive selection, respectively ( Table 4, Table S12).

Phylogenetic tree analyses
To reconstruct the phylogenetic relationships and analyze the phylogenetic positions of 5 Globba species, 4 Hedychium species, 3 Kaempferia species and 5 Zingiber species within subfamily Zingiberoideae, 56 complete chloroplast genomes of family Zingiberaceae were used for analysis (Table S13). A total of 49 complete chloroplast genomes were downloaded from the Gen-Bank database. C. pulverulentus, C. viridis and C. indica were used as the outgroups. The phylogenetic trees were constructed using an SNP matrix from 56 chloroplast genomes of family Zingiberaceae and 3 chloroplast genomes of outgroups, using the chloroplast genome of G. lancangensis as the reference. Reliable SNP sites were obtained using a previously described method that utilized MUMmer and BLAT software [19,21,26,[72][73][74]. For each chloroplast genome, all SNP sites were connected in the same order to obtain a sequence in FASTA format. Each connected FASTA format sequence contained 11,346 SNP markers (Table S14). Multiple FASTA format sequence alignments were performed with ClustalW in MEGA v.7 [32]. The ML analysis of MEGA v.7 was used for the reconstruction of the ML phylogenetic tree with default settings including 1000 bootstrap replications along with the nucleotide substitution model of Tamura-Nei [32] (Fig. 8). Bootstrap values were categorized as strong (> 85%), moderate (70-85%), weak (50-70%), or poor (< 50%) [2]. Lastly, BI analysis was performed using MrBayes v.3.2 [75], using the substitution model GTR with running parameters: the Markov Chain Monte Carlo algorithm was applied for 2 million generations with four Markov chains and sampled of trees every 100 generations, then the first 10% of trees were discarded as burn-in. The software iTOL v.3.4.3 (http:// itol. embl. de/ itol. cgi) was used to edit and visualize the final BI tree (Fig. S3).