Comparative plastome analysis of Musaceae and new insights into phylogenetic relationships

Musaceae is an economically important family consisting of 70-80 species. Elucidation of the interspecific relationships of this family is essential for a more efficient conservation and utilization of genetic resources for banana improvement. However, the scarcity of herbarium specimens and quality molecular markers have limited our understanding of the phylogenetic relationships in wild species of Musaceae. Aiming at improving the phylogenetic resolution of Musaceae, we analyzed a comprehensive set of 49 plastomes for 48 species/subspecies representing all three genera of this family. Musaceae plastomes have a relatively well-conserved genomic size and gene content, with a full length ranging from 166,782 bp to 172,514 bp. Variations in the IR borders were found to show phylogenetic signals to a certain extent in Musa. Codon usage bias analysis showed different preferences for the same codon between species and three genera and a common preference for A/T-ending codons. Among the two genes detected under positive selection (dN/dS > 1), ycf2 was indicated under an intensive positive selection. The divergent hotspot analysis allowed the identification of four regions (ndhF-trnL, ndhF, matK-rps16, and accD) as specific DNA barcodes for Musaceae species. Bayesian and maximum likelihood phylogenetic analyses using full plastome resulted in nearly identical tree topologies with highly supported relationships between species. The monospecies genus Musella is sister to Ensete, and the genus Musa was divided into two large clades, which corresponded well to the basic number of n = x = 11 and n = x =10/9/7, respectively. Four subclades were divided within the genus Musa. A dating analysis covering the whole Zingiberales indicated that the divergence of Musaceae family originated in the Palaeocene (59.19 Ma), and the genus Musa diverged into two clades in the Eocene (50.70 Ma) and then started to diversify from the late Oligocene (29.92 Ma) to the late Miocene. Two lineages (Rhodochlamys and Australimusa) radiated recently in the Pliocene /Pleistocene periods. The plastome sequences performed well in resolving the phylogenetic relationships of Musaceae and generated new insights into its evolution. Plastome sequences provided valuable resources for population genetics and phylogenetics at lower taxon.

The genus Musa was established by Carolus Linnaeus in 1753 [7]. Cheesman [8] divided the genus into four sections: Australimusa and Callimusa with n = 10, Eumusa and Rhodochlamys with n = 11 chromosomes. Later, Argent [9] established the Musa sect. Ingentimusa based Sections Rhodochlamys and Eumusa are closely related, having bracts that are generally sulcate, glaucous and that become revolute on fading [8]. This contrasts with species of sections Australimusa and Callimusa, which have bracts that are smooth, polished on the outside, and that do not become revolute on fading. In contrast with the pendent inflorescences with dull-colored bracts and large plants (3 m or taller) in Eumusa, species of sect. Rhodochlamys are generally smaller in stature (less than 3 m), have erect inflorescences with brightly colored bracts. Species of sect. Callimusa are separated from those of sect. Australimusa by their unique seeds, which are cylindrical or barrel-shaped and possess a large apical chamber. Seeds of species of sect. Australimusa are subglobose or dorsiventrally compressed and possess a small apical chamber. These five sections proved to be very useful and have been widely accepted [8][9][10][11]. Since the molecular markers were applied in plant systematics, there are many related studies on the Musa section assessment. For example, Wong et al. [12] used AFLP to validate this classification system. Several phylogenetic studies have been published for the Musaceae, however, none of these five sections was recovered as monophyletic [5,6,[12][13][14][15][16][17][18]. Only two infrageneric clades corresponded well to the basic chromosome numbers (one clade with n = x = 11, the other with n = x = 10/9/7) [6,17]. Häkkinen [2] reappraised the five-section system by integrating molecular phylogenetic studies and proposed two infrageneric clades classification: sect. Musa and sect. Callimusa (referring as sect. Callimusa Cheesman emend Häkkinen). Sect. Rhodochlamys was synonymized with sect. Musa, sect. Australimusa and sect. Ingentimusa were treated as synonyms of sect. Callimusa [2].
Most edible banana cultivars are from hybridization between Musa acuminata Colla different subspecies or with M. balbisiana Colla [3] and these two species are both from the sect. Musa [2]. A well-resolved phylogeny of Musaceae is critical for the germplasm conservation of cultivated banana ancestors and their wild relatives. However, a well-resolved phylogeny of Musaceae has been still missing. The lack of herbarium specimens and quality molecular markers limited our understanding of the phylogenetic relationships of Musaceae species. Studies with broad taxonomic coverage usually employed limited gene fragments and reconstructed phylogeny containing polytomy and low-support branches [5,6,[17][18][19]. For instance, using plastid atpB-rbcL, rps16, trnL-F and nuclear ribosomal ITS, Li et al. [6] generated a phylogenetic tree with many polytomies though this study covered 36 species. Recently, Burgos-Hernandez et al. [18] used ITS, trnL-trnF and atpB-rbcL to conduct a biogeographic analysis of Musaceae and covered 37 species. Their resulting phylogeny also encompassed multiple low-support branches. In contrast, studies using multiple low copy nuclear genes or even whole-genome sequences on Musaceae phylogeny have in-depth gene coverage and strong internal support, but their taxonomic coverage was often sparse [20][21][22][23] since their sampled species did not even exceed 20. Thus, it is worthwile to investigate phylogenetic relationships of Musaceae in more detail with both expanded taxonomic coverage and gene sampling.
Genome skimming, an approach to sequence samples with shallow depth, is usually used to acquire the highcopy genomic fraction, such as plastome [24]. Many studies showed that the plastome significantly resolves phylogenetic relationships at lower taxonomic levels [25][26][27][28][29]. The plastome is maternally inherited without recombination in Musaceae [30]. They are generally comprised of four regions, namely the large single copy (LSC), the small single copy (SSC), and two inverted repeats (IRs, IRa, and IRb) [31]. Some highly variable regions in the plastome have been identified as "hotspots" and employed as useful molecular markers for phylogenetic studies [32,33]. In recent years, although some plastome sequences ofMusaceae have been reported [23,[34][35][36], most species studied concentrated on a few wild bananas cultivated at botanical gardens and did not propose a comprehensive plastome analysis for the Musaceae family. In this study, we used the genome skimming approach for the assembly of the plastomes of a large panel of Musaceae species. We analyzed their plastome (1) to investigate the plastome structure variations; (2) to identify highly variable regions; and (3) to reconstruct the phylogeny of the Musaceae, and (4) to assess the divergence time of the main clades.

Plastome features
We analysed the structure of 49 full plastomes covering 48 species/subspecies in the Musaceae (including 45 new plastome assemblies generated for this study) ( Table 1). The full-length variation of Musaceae and the genus Musa plastomes is approximately 5.7 kb (plastome length: 166,782-172,514 bp), with small variation in Ensete plastomes (163 bp, plastome length: 168,248-168,411 bp). All sequenced plastomes exhibited the typical quadripartite structure, composed of one LSC, one SSC, and two IRs (IRa and IRb) (Fig. 2). The overall GC content was nearly identical (36.5-37.1%) (Table 1). Individual plastome was annotated and followed by manual checking, resulting in a total of 113 genes, including 79 protein-coding genes, 30 transfer RNA (tRNA), and four ribosomal RNA (rRNA) genes (Fig. 2 Table S2). Among these 113 genes, 21 genes have two copies (within IR region), the remaining 92 have one single copy. Sixteen genes have one single intron, and two contain two, the left 95 genes have no intron (Table S2). The complete plastome alignment for the 48 Musaceae species illustrated that there was no genomic rearrangement (Fig. S1).

IR boundary comparative analysis
The IR/LSC and IR/SSC junctions of the 49 Musaceae species were compared to explore the IR expansion/contraction (Fig. S2). No noticeable expansion or contraction was found within the four Ensete species. Compared to Ensete species, the JLA and JLB of Musella lasiocarpa extended into gene rps19. Apparent differences in IR boundaries were observed among Musa species. The JSB of Musa gracilis withdraws to the spacer of ndhA1 and ndhF compared to other species from sect. Callimusa Cheesman emend Häkkinen, of which JSBs resided in ndhF (Fig. S2). On the contrary, the JSB of Musa balbisiana extended into the ndhF gene compared to other species in the sect. Musa. All those species from the sect. Callimusa Cheesman emend Häkkinen had only one copy of rps19 gene. In contrast, those species from the sect. Musa had one more copy of rps19, except Musa velutina. The four junctions between LSC/IRs and SSC/ IRs were confirmed with PCR-based sequencing. The assembly of the PCR product was mapped against the plastome that we generated previously and the mapping result was shown in Fig. S3. All of the IR borders could match the assemblies of PCR-based sequences.

Codon usage preference
Among the 49 Musaceae plastomes, the total codons (including stop codons) ranged from 28,770 in M. itinerans to 29,521 in M. yunnanensis (Table S3). The codon frequency was relatively similar across Musaceae species (Table S4). Only methionine (Met) and tryptophan (Trp) were encoded by a single codon among all 20 amino acids encoded by 64 codons (Fig. 3). The three most frequent codons were GAA-Glu, AUU-Ile, and AAA-Lys (Table S4). The most and least abundant amino acids were leucine (Leu) and cysteine (Cys), encoded by about 10% and 1% of codons, respectively (Table S4). The relative synonymous codon usage (RSCU) values of the same codon were very similar between all plastomes of Musaceae (Table S4). The two codons with the highest RSCU values were AGA-Arg and UUA-Leu. Codons ending in T or A had RSCU > 1. In contrast, codons with C or G in the third position mostly had RSCU < 1, indicating a significant preference for codons ending with T and A, which is generally observed in the angiosperm plastomes [37,38]. GC3 value is significantly higher than the GC2 in all Musaceae species, which supported this preference pattern (Table S3). Musa species exhibited higher usages in UUG, GUG, GAA, CGU, AGA, GGU, and GGA (Table S5).

Selective pressure analysis
Synonymous (dS) and nonsynonymous (dN) substitution rates, as well as dN/dS, were determined for the 79 coding sequences to estimate the selective pressure acting on them (Fig. S5, Table S11). The dN and dS ranged from 0 to 0.16, and 0 to 0.59, respectively. Among the 79 CDSs, ndhF and rpl32 showed relatively higher dS values (> 0.4), while accD and matK exhibited relatively higher dN values (> 0.1; Fig. S5, Table S11). For most genes (89.87%), dS was significantly greater than dN, resulting in a dN/dS value less than 0.5, suggesting a purifying selection. Two genes with relatively higher dN/dS value were identified (dN/dS > 1; ycf1, ycf2 valued as 1.16 and 4.44, respectively). The null model (dN/dS = 1) was performed for ycf1 and ycf2. The P value of Chi-square test for ycf2 was less than 0.05, indicating an intensive positive selection. P value of ycf1 was 0.4335, it suggested that ycf1 may not be in positive selection (Table 2).

Sequence variability and divergent hotspots identification
Nucleotide diversity (Pi) of the 49 Musaceae plastomes ranges from 0 to 0.03282, with an average of 0.00698 (Fig.  S6, Table S12). Among LSC, SSC, and IR regions, SSC and IR regions exhibit the highest and the lowest Pi value of 0.01671 and 0.00389, respectively (Table S12). Ten most variable regions with peak Pi values > 0.020 and alignment length over 600 bp were identified as divergent hotspots (Fig. S6, Table S12). The ndhF-trnL sequence had the highest Pi value (0.02470), followed by ndhF, matK-rps16, and accD (Table S12). These four hypervariable markers had more haplotypes (45 vs. 34) and higher resolution than the three universal DNA barcodes (matK, rbcL, and trnH-psbA) based on the ML tree (Fig. S7, Table S12). Moreover, based on the combination of the four most variable markers, many indels sites could be found within those pairwise species with the lowest K2P distance (Table S13). These indels increased the species identification rate for those closely related species.

Phylogenetic relationships
Our Maximum likelihood (ML) and Bayesian inference (BI) analyses generated a consistent phylogenetic tree supporting the same topological structure. The CDSs and the complete plastome dataset produced similar topology trees with only one discordance on the relationship between five species in sect. Callimusa (M. borneensis, M. barioensis, M. gracilis, M. salaccensis, M. lokok) (Fig. S8,  Fig. S9). The full plastome dataset provided a better-supported phylogeny than CDSs dataset because it possessed fewer branches with bootstrap support values of less than 90%. The monospecies genus Musella is sister to the Ensete (Fig. 5). The genus Musa was subdivided into two large clades, which corresponded to the Callimusa and Musa

Phylogenetic relationships of Musaceae
Compared to previous phylogenetic studies on Musaceae [5,6,17], this study is the first one to analyze Musaceae phylogenetic relationships with density sampling using plastome-scale sequences. The resulting tree is fully resolved with substantially increased support value for several branches across the Musaceae tree (Fig. 5). The sister relationship between the genus Musella and Ensete is reassured. The genus Musa is well-supported into two clades, corresponding to Häkkinen's two-section reappraisal as Sect. Musa and Sect. Callimusa Cheesman emend Häkkinen [2] that delineated the basic chromosome number of n = x = 11 and n = x =10/9/7, respectively. For the infrageneric classification in Musa, Cheesman [8] indicated that"the groups have deliberately been called sections rather than subgenera in an attempt to avoid the implication that they are of equal rank". Although there are significant morphological characters and chromosome number difference between both clades, following the suggestion of Cheesman [8], Häkkinen [2] classified both clades as sect. Musa and sect. Callimusa, respectively (Fig. 5). x = 11 is most reasonable original basic number in Zingiberales [39], with x = 10, 9,  splendida and M. lutea [41], but concentrating only on their morphological description. For this study, we could not access the material but it would likely help refining species delimitation and phylogenetic relationship within the subclade and between the two subclades.
The subclade II (with support value: 100/1.0) distributes in the Malayan Peninsula/Sumatra, Borneo, and Papua Guinea, with the species diversity center in Borneo. Notably, it includes M. beccarii (2n = 18) and the physically largest wild banana, M. ingens (2n = 14), whose chromosome numbers differ from the other species in the sect. Callimusa (2n = 20) (Fig. 5). M. ingens, the only species in sect. Ingentimusa, was treated as one section by Argent at 1976 [9] due to its seven pairs of chromosmes. M. ingens distributes in the tropical montane forests of New Guinea, Indonesia. Our study sampled more Australimusa species than earlier phylogenetic studies [6,17,18,23]  shapes, they are sympatric with other species in subclade II, and phylogenetically nested within subclade II. Therefore, in agreement with previous studies [6,17], we support the treatment of Häkkinen [2], that sect. Ingentimusa and sect. Australimusa should be reduced as the synonym of sect. Callimusa.
The Musa section is also subdivided into two subclades (subclades III and IV, both with support value: 100/1.0) with the species diversity center in Indo-Burma (Fig. 5). Subclade III includes banana wild relatives that share interesting features for crop improvement, such as M. balbisiana which is resistant to the harsh environment, M. itinerans immune to Foc. 4 [42], and M. basjoo the most cold-tolerant wild banana. M. balbisiana is one of the ancestors of the interspecific cultivated banana, no obvious close relatives were reported earlier [43]. Both Li et al. [6] and Janssens et al. [17]  . These species distribute from the eastern Himalayas region to South China, and grow from seasonal tropical forest to temperate forest, with drought and cold tolerance. Natural crossing between them is a relatively common event [44]. Therefore, these species can represent valuable genetic resources for banana breeding. However, as banana wild relatives, they were often neglected while more conservation and characterization is needed.
M. acuminata species, the main wild ancestor of cultivated banana, is included in the sister subclade (subclade IV, Fig. 5). M. acuminata is an extremely variable species with a wide geographical distribution from Burma through Malaysia to New Guinea, Queensland, Samoa and the Philippines [44]. Among the M. acuminata subspecies, M. a. ssp. burmannica is the earliest diversified, consistent with the previous studies covering four M. acuminata subspecies based on whole genomes [22] and 72 M. acuminata accessions using restriction-site-associated DNA sequencing data [45]. Consistently with previous studies [5,6,17], we found that M. acuminata clustered closely with four species from sect. Rhodochlamys, namely M. rubra, M. laterita, M. siamensis, and M. rosea. However, contrary to Janssens et al. [17], M. siamensis is not nested within M. acuminata subspecies, and is clustered with M. rubra. This result reinforces recent studies that claimed M. laterita and M. siamensis as a synonym of M. rubra [46,47]. Moreover, it is worth noting M. rubra and M. rosea were described based on the vouchers cultivated in the botanical garden, without evidence of their occurrence in the wild. The only wild population of M. rubra was reported in Manipur and Mizoram, NE India [46]. M. rosea, only collected in Angkor ruins in Cambodia, has long been a "lost species" [48]. The high plastome identity between these species and M. acuminata suggests that M. acuminata have provided their maternal material during hybridization. Various Eumusa × Rhodochlamys hybrids have been observed, which gave rise to considerable taxonomic confusion in poorly understood Rhodochlamys [44]. We, therefore, speculate that both species (M. rubra and M. rosea) are hybrids between Musa acuminata and species from sect. Rhodochlamys, but more studies are needed to verify their origin and species status.
Excluding Musa rubra, M. laterita, M. rosea, and M. siamensis, the other species from sect. Rhodochlamys formed one well-supported clade (support value: 100/1.0), with the common ancestor of M. acuminata. Although Rhodochlamys was morphologically characterized by the erect inflorescence and colorful bracts, this phylogenetic relationship suggests the separation of sect. Rhodochlamys from Eumusa was not clear-cut. Both Li et al. [6] and Janssens et al. [17] did not recover its monophyly due to the low resolution of few genes. This lineage experienced a recent (ca. 10.97 Ma) and rapid speciation (Figs. 5 and 6). Sect. Rhodoclamys species concentrate in the East Himalayas region, especially in the Assam-Burma mountain region. Reproductive isolation between Rhodochlamys species is slight [44]. Due to the difficult access for field investigation and rapid speciation, extending the sampling and employing more nuclear genes would provide further evidence for the evolutionary history of Rhodoclyamys species.

Divergence time estimation
Correct phylogeny and divergence-time estimation are essential for evolutionary history study. With a complete chloroplast gene set, we can choose suitable genes to facilitate and optimize divergence-time estimation. The crown node age of Musaceae (59.19 Ma, Fig. 6 [50]. Our study used more taxon sampling and DNA nucleotide to increase the divergence-time estimation accuracy. Among those studies for divergence-time estimation of Musaceae [17,18,20,49], two fossils (Spirematospermum chandlerae and Ensete oregonense) were often used: Ensete oregonense, confirmed to be part of Musaceae [51] and Spirematospermum chandlerae Friis is the oldest known fossil of the Zingiberales. This study selected one more fossil (Zingiberopsis attenuate) and one secondary calibration point compared to other related studies [17,18,20,49].
Our analyses suggest that main lineages within Musa diversified from the late Oligocene and accelerated at the late Miocene, and two lineages (Australimusa and most Rhodochlamys species) radiated very recently in the Pliocene /Pleistocene periods. As discussed in Burgos-Hernandez et al. [18], this time frame is consistent with the collision of India with Eurasia and the uplifts of the Qinghai-Tibetan Plateau (QTP). With the uplift of the QTP, the Asian monsoon was initiated in the late Oligocene, followed by several periods of strengthening in the Miocene (e.g., ~15 Ma & ~8 Ma) and a putative abrupt strengthening in the Pliocene/Pleistocene periods (~3 Ma) [52,53]. The intensification of amount and seasonality of precipitation in South East Asia may have produced higher rates of diversification for various biotic lineages [54], which may have led to the evolutionary diversification of Musa, as demonstrated in other species from the lower altitudes of SE Asia, i.e., Lepisorus [54], Pogostemon [55], and Primulina [56]. The recent diversification of Australimusa species in the Pliocene and Pleistocene coincides with rapid orogenesis in New Guinea [57]. The orogenesis of the Central Range in New Guinea was initiated in the late Miocene, but most of the mountain uplift probably occurred since 5 Ma [54]. As found in the sect. Petermannia in the genus Begonia [58], the recent radiation in the Australimusa may be jointly triggered by orogenesis and associated microallopatry.

Divergent IR borders and selective pressure analysis
Due to possessing many repetitive sequences, the size of IR regions could be variable, and their boundaries are in random dynamics in most plants [59,60]. The contraction/expansion of IR region could bring about gene loss/ addition [61,62]. This study found that the contraction/ expansion of IR region mainly existed in the boundaries of IR regions and LSC region, namely, JLA and JLB (Fig.  S2). The IR borders variation showed phylogenetic signal in Musa to a certain extent. According to these two boundaries, the genus Musa can be roughly divided into two groups, i.e., sect. Musa and sect. Callimusa Cheesman emend Häkkinen. The divergences of IR borders also led to the variation of gene composition in the genus Musa. Specifically, within sect. Musa, except for Musa velutina with a single copy of gene rps19, the remaining species contain two copies of gene rps19. Whereas all species of sect. Callimusa Cheesman emend Häkkinen harbors only one copy of rps19, reducing the gene content to 135 (Table 1, Table S2). In addition, M. coccinea lost one copy of the trnH gene. This result is congruent with previous investigations [23]. The different copy numbers of trnH and rps19 genes may hint at their gene substitution on nuclear and/or functional redundancy in the plastid [63].
Generally, variations in the synonymous mutation rate (dS) are likely to be affected by potential factors that could change the mutation rate, e.g., DNA repair. Nevertheless, the value of nonsynonymous mutation rate (dN) and dN/dS are impacted by the varied mutation rate and driven by selection regimes [64]. In our study, ycf2 and ycf1 were found with dN/dS value greater than 1 (Fig. S5, Table S11). The gene ycf2 was indicated under intensive positive selection. Huang et al. [65] suggested that ycf2 could be a useful DNA marker for estimating sequence variation and evolution in plants. Ycf2 is one of the largest genes encoding putative membrane protein [66,67] and was found to rapidly evolve in Fagopyrum [68], Ipomoea [69], Ophrys [70], Chrysosplenium [71], and Mimosoideae [72]. The extremely high dN/dS value (4.44) of ycf2 indicated that this gene is a valuable marker for the adaptive evolution study of Musaceae.

Divergent hotspots identification and molecular markers for Musaceae species
The mutations in the plastome are not universally randomly distributed along the sequence and are concentrated in certain regions referred to as the "hotspots" [73]. The highly variable hotspot regions could be used as markers to distinguish closely related species [74] and act as the taxon-specific DNA barcode. In this study, we identified ten highly variable regions (Fig. S6, Table S12). Among them, ycf1 has been recommended as the most promising chloroplast DNA barcodes for land plants [75] and was found to harbor the greatest number of informative sites in this study. The compound region ndhF-trnL, which proved to have the highest Pi value here, has been considered to be the best marker for molecular studies at a low taxonomic level [76][77][78]. However, both ycf1 and ndhF-trnL were less discriminatory when used alone since they could not provide enough haplotypes. The species identification analyses showed the better discriminatory power of the four most variable regions combined (ndhF-trnL, ndhF, matK-rps16, and accD) (Fig. S7). Therefore, we recommend these four regions to be the specific DNA barcodes for Musaceae species.

Conclusions
This study employed the genome-skimming approach and assembled the complete plastomes of 44 Musaceae species/subspecies, providing valuable genomic resources for this family. Based on the complete plastome analysis, the relationship within Musaceae was resolved with high branch support. In addition, the comparative analysis of plastomes revealed variable regions, which could be used as Musaceae-specific DNA markers. All the obtained genomic resources will contribute to future studies in species identification, population genetics, and germplasm conservation of Musaceae.

Taxon sampling, DNA extraction, and sequencing
The taxon sampling contains 49 accessions of Musaceae species/subspecies, representing four Ensete species (four accessions), 43 Musa species/subspecies (44 accessions), and one Musella species (one accession) (Table  S14). Among these 49 Musaceae plastomes, 45 plastomes of 44 species/subspecies representing two genera (Musa and Ensete) were generated by the current study. Due to the sample collection challenges, 22 of 37 species from sect. Callimusa Cheesman emend Häkkinen could not be included in this study. Fifteen plastomes from other eight families were downloaded from NCBI for analysis. Sixty-four plastomes were used in the current study (Table S14). For data quality consistency, we dropped the plastome of Musa textilis, which presents a distinct short plastome compared to other Musa species (GenBank accession number: NC_022926.1, length 161,347 bp). Total genomic DNA was extracted from silica-dried materials using CTAB protocol [79]. The quality and concentrations of the DNA were assessed using agarose gel electrophoresis and a Qubit 3.0 Fluorometer (Life Technologies). We constructed sequencing libraries using the TruePrep DNA Library Prep Kit V2 for Illumina (Vazyme, TD501). Library lengths were evaluated with the High Sensitivity NGS Fragment Analysis Kit (Advanced Analytical Technologies, Ankeny, IA) on the Fragment Analyzer (Advanced Analytical Technologies). Lengths of all libraries ranged from 300 to 450 bp and were pooled together at equimolar ratios. Libraries were subjected to 150 bp paired-end sequencing on an Illumina X Ten platform (BGI, Wuhan, China). On average, approximately 3 Gb of clean NGS data were obtained for each sample. All raw reads data were submitted into the Sequence Read Archive (SRA) under BioProject PRJNA530661.

Plastome assembly and annotation
Raw reads were trimmed, and adaptors were removed using Trimmomatric v. 0.36 [80]. The quality of filtered reads was assessed using FastQC (http:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc) to assure adaptors and bases below PHRED 30 were removed. We employed NOVOPlasty v. 4.2.1 [81] for the assembly of plastomes by providing Musa balbisiana as the reference (GenBank accession number NC_028439), and all parameters were kept as default settings (see https:// github. com/ ndier ckx/ NOVOP lasty). To confirm the result reliability of the assembling, we also used the toolkit GetOrganelle [82] to assemble the plastomes, and the parameter settings followed the online manual (see https:// github. com/ Kingg erm/ GetOr ganel le). In rare cases, when NOVOPlasty and GetOrganelle failed to obtain a complete plastome, reads were mapped against the non-overlapping contigs from NOVOPlasty to extend their ends to close the gap in Geneious, performing with medium-low sensitivity for 100 iterations.
Two independent approaches were applied to annotate these 45 plastomes. Firstly, the annotation of the plastome sequences was performed with GeSeq [83], choosing the plastome of Musa acuminata ssp. malaccensis (HF677508) as the reference genome. In the meantime, ARAGORN was selected as a third party to annotate tRNA. Secondly, we use MAFFT v. 7.388 [84,85] to align and annotate these plastome sequences using the "Annotation Transfer" option with Musa itinerans (NC_035723) as a reference in Geneious. The annotation results from GeSeq and Geneious were subsequently compared and manually integrated. The plastome maps were drawn using OGDRAW [86]. Newly generated plastomes were submitted to GenBank (see Table S14 for accession numbers).

Comparative plastome analyses for 49 Musaceae plastomes
The boundaries between the four plastome regions, i.e., LSC/IRb (JLB), SSC/IRb (JSB), SSC/IRa (JSA), and LSC/IRa (JLA), were inspected with the online program IRscope [87]. According to the phylogeny generated in this study (Fig. 5), we chose 17 representative species for confirming the IR region expansion/contraction. The four junctions between LSC/IRs and SSC/ IRs of the 17 species were confirmed with PCR-based product sequencing. Target DNA regions were amplified in 25 µl reactions containing 10 ng (1 µl) template DNA, dNTP mixture 2 µl, 10 × LA PCR Buffer 2.5 µl, 0.5 µl of each primer, and 18.5 µl ddH 2 O. The primer pairs designed and used for PCR in this study were listed in Table S15. PCR products were bi-directionally sequenced by GENEWIZ Biotechnology Co., Ltd. (Suzhou, China). The sequences were submitted to the Science DB (available at https://www.https:// doi. org/ 10. 11922/ scien cedb. 01436), and the accession number were listed in Table S16.
Codon usage analysis for protein-coding genes (PCGs) was conducted in DnaSP v. 6.12.03 [88]. PCGs were extracted and concatenated in Geneious before being imported to DnaSP for analysis. The relative synonymous codon usage (RSCU) values were calculated to measure the usage bias of synonymous codons. Other three indices, including the effective number of codons (ENC), codon bias index (CBI), GC content of the synonymous second (GC2) and third codons positions (GC3), were also computed to assess the extent of the codon usage bias.
The online program REPuter [89] was used to detect short dispersed repeats (SDRs), with the parameters setting as follows: (1) Hamming distance of 3; (2) maximum computed repeats of 500; (3) minimum repeat size of 30 bp. Besides, tandem repeats (≥ 10 bp) were calculated with the online program Tandem Repeats finder (http:// tandem. bu. edu/ trf/ trf. html). Three alignment parameters, i.e., match, mismatch, and indel were kept as two, seven, and seven. The minimum alignment score was set to 80 and the maximum period size to 500. Simple sequence repeats (SSRs) were identified in MISA-web [90]. The minimum number of repetitions was set to 10, 5, 4, 3, 3, and 3 for mono, di-, tri-, tetra-, penta-and hexa-nucleotide repeats. The Maximum length of sequence between two SSRs to register as compound SSR was set 0. Mauve v1.1.1 [91], a plugin within Geneious, was applied to detect the genome rearrangements and inversions among 49 Musaceae plastomes.

Nucleotide substitution rate analysis
Seventy-nine coding sequences (CDSs) were individually extracted from 49 Musaceae plastomes and separately aligned using "Translation Align" tool in Geneious. Nonsynonymous (dN) and synonymous (dS) substitution rates and the ratio of nonsynonymous to synonymous rates (dN/dS) were calculated using CODEML option in PAML v.4.9 [92]. The phylogeny generated from CDSs dataset was used as the constraint tree. The parameters in CODEML control file were set as follow: (1) F3 × 4 model for codon frequencies; (2) "model = 0" for allowing a single dN/dS value to vary among branches; (3) "cleandata = 1" to remove gaps; (4) default settings for other parameters (as alternative model, "fix_omega = 0" and "omega = 2") [64]. For the potential positive selection gene, a null model (set "fix_omega = 1" and "omega = 1" in the control file) was additionally performed following Xiong et al. [93]. LRT were used to test model fit and a Chi-square test was conducted to calculate the P value.

Sequence divergence analysis
A sliding window analysis was conducted in DnaSP v. 6.12.03 [88] to locate genomic regions with a high frequency of variation. The alignment of 49 Musaceae plastomes was generated in MAFFT (with default settings) and used as the input file. The window length and step size were set to 600 bp and 200 bp, respectively. Those regions with nucleotide diversity (Pi) values higher than 0.020 and alignment length longer than 600 bp were extracted from the alignment and analyzed individually to estimate their characteristics. The pairwise distance was calculated using Kimura 2-parameter (K2P) distance in MEGA 7 [94]. Indel polymorphism analysis was conducted in DnaSP v. 6.12.03.

Phylogenetic analysis
For the phylogenetic analysis of Musaceae, two datasets (coding plastid sequences (CDSs) and the complete plastome sequence) were generated. A total of 49 Musaceae plastomes representing 48 species/subspecies were used, including 45 plastomes generated in this study and four downloaded from NCBI (Table S14). Three Alpinia species with plastome in GenBank were added as outgroup (Table S14). The 79 coding plastid sequences were combined, followed by multiple sequence alignment (MSA). For the complete plastome sequence dataset, the IRa was removed and served as inputs for MSA. All alignments were performed using MAFFT [95] and then manually checked in Geneious. We used Modeltest-NG 0.1.6 [96] to determine an optimal nucleotide substitution model under the corrected Akaike Information Criterion (AICc) for each dataset. All the ML analyses were performed in RAxML v8.2.12 [97] by assigning the GTRGAMMA model, and 1,000 rapid bootstrap replicates were run to evaluate the support values for each node. All the BI analyses were conducted in MrBayes v. 3.2.6 [98], and the best-fit models selected for CDSs dataset and the complete plastome sequence dataset were both GTR+I+G. Two MCMC runs were performed with five million generations and four chains, sampling every 5,000 generations and discarding the 25% as burn-in. For the CDSs dataset, best-fit partitioning scheme (Table S17) was determined by PartitionFinder 2 [99], and an additional ML analyse was performed using IQ-TREE [100] with 1000 ultrafast bootstraps [101].

Molecular clock dating
The divergence time of Musaceae was estimated using BEAST v2.6.4 [102]. To incorporate multiple fossil calibration points and reduce the bias imported from a single calibration point, the divergence time was estimated by including the whole Zingiberales. SortaDate [50] was used to choose genes suitable for divergence-time estimation. This package determines which gene trees are clock-like, have the least topological conflict with the species tree, and have informative branch lengths. The ML tree generated from the complete plastome sequence dataset was used as an input species tree. As the result of SortaDate, the final screened genes were ccsA, matK, ndhF, rpoC1, and rpoC2. We selected optimal nucleotide substitution models for each of the five genes using Modeltest-NG 0.1.6 [96] under the AICc. These were identified as GTR+G4 for ccsA, matK, rpoC1, rpoC2, and GTR+I+G4 for ndhF.
In BEAST, the newick ML tree of Zingiberales inferred from complete plastome sequences was used as a starting tree due to its more robust phylogenetic resolution. Clock models were linked, while site models were unlinked for each gene. The uncorrelated log-normal distribution relaxed molecular clock model was selected with the Yule model as the tree prior. MCMC run was set to 100 million generations, sampling every 10,000 generations. BEAST 2 output was assessed in Tracer 1.7.2 [103] to evaluate convergence and ensure an effective sample size for all parameters surpassing 200. TreeAnnotator v2.6.4 was used to annotate the maximum clade credibility tree after removing the first 20% of samples as burn-in.
Three fossil records and one secondary calibration point were used in this divergence time estimation. Spirematospermum chandlerae [104] was used to calibrate the crown age of order Zingiberales with a mean age of 83.5 Ma. Zingiberopsis attenuate [105] was applied as a mean age of 65 Ma for the crown node of the Zingiberaceae family. Then Ensete oregonense [106] was used to calibrate the crown age of Ensete and Musella clade with a mean age 43 Ma. Each fossil calibration point was assumed to follow a normal distribution with a standard deviation of 2 and an offset of 2, resulting in 81.6-89.4, 63.1-70.9, and 41.1-48.9 Ma 95% intervals, respectively. The secondary calibration point was generated based on previous studies on Monocots [107,108]. It was placed on the stem node of Zingiberales with a normal distribution as a mean age of 100 Ma and a broad standard deviation of 5 (95% intervals 90.2 -110 Ma).