Molecular Structure and Variation Characteristics of the Plastomes from Six Malus baccata (L.) Borkh. Individuals and Comparative Genomic Analysis with Other Malus Species

Malus baccata (L.) Borkh. is an important wild species of Malus. Its rich variation types and population history are not well understood. Chloroplast genome mining plays an active role in germplasm identification and genetic evolution. In this study, by assembly and annotation, six complete cp genome sequences, ranging in size from 160,083 to 160,295 bp, were obtained. The GC content of stable IR regions (42.7%) was significantly higher than that of full length (36.5%) and SC regions (LSC-34.2%, SSC-30.4%). Compared with other Malus species, it was found that there were more sites of polymorphisms and hotspots of variation in LSC and SSC regions, with high variation sites including trnR/UCU-atpA, trnT/UGU-trnL/UAA, ndhF-rpl32 and ccsA-ndhD. The intraspecific and interspecific collinearity was good, and no structural rearrangement was observed. A large number of repeating elements and different boundary expansions may be involved in shaping the cp genome size. Up to 77 or 78 coding genes were annotated in the cp genomes of M. baccata, and high frequency codons such as UUA (Leu), GCU (Ala) and AGA (Arg) were identified by relative synonymous codon usage analysis. Phylogeographic analysis showed that 12 individuals of M. baccata clustered into three different groups with complex structure, whereas variant xiaojinensis (M.H. Cheng & N.G. Jiang) was not closely related to M. baccata evolutionarily. The phylogenetic analysis suggested that two main clades of different M. baccata in the genus Malus were formed and that I and II diverged about 9.7 MYA. In conclusion, through cp genome assembly and comparison, the interspecific relationships and molecular variations of M. baccata were further elucidated, and the results of this study provide valuable information for the phylogenetic evolution and germplasm conservation of M. baccata and Malus.


Introduction
Malus baccata (L.) Borkh., belonging to the genus Malus, is considered a large group of wild species [1,2]. It is native to China and has a wide range of natural distributions, including Northeast China, North China and Southwest China [3,4]. M. baccata has good fertility and a certain degree of resistance, so it is used in the production of grafting seedling rootstock and apple breeding parents [5]. Due to the differences in ecological environment and the influence of species adaptability, M. baccata has accumulated abundant variation types, such as broad leaves, hanging branches, early flowering, large fruit and so on [6,7]. The high genetic diversity gives a better chance of survival and development for M. baccata and also allows important data to be gathered for evolutionary relationship analysis and conservation biology research [8].
In as cultivated species [7,9,10]. Because of self-incompatibility and extensive interspecific hybridization, species classification and phylogenetic identification of Malus are very complicated [11]. Molecular marker sequences based on nuclear and organelle genomes have been used to study population evolution [12][13][14]. The chloroplast genome is maternally inherited and highly conserved, which has obvious advantages in the study of intraspecific variation and interspecific evolution [15]. In 26 cp genome evolutionary trees, M. prunifolia, M. micromalus and M. baccata are more closely related than M. yunnanensis [16]. Based on the phylogenetic branches of chloroplast genomes, M. sieversii and M. sylvestris may be the ancestors of cultivated apple [17].
Considering that there are few studies on the intraspecific variation of M. baccata and interspecific relationship of Malus species, plastid genome evolution and matrilineal heredity characteristics of six M. baccata from different geographical sources were analyzed in this paper. We mainly try to explore the following questions: (1) What are their chloroplast genome characteristics and differences? (2) How are the repeated sequences and boundary junction points distributed in the cp genomes of M. baccata? (3) What are the nucleotide polymorphisms and variation sites of the cp genome of Malus genus and how do we further classify the structure between different individuals based on the data of the organelle variation in M. baccata? (4) What is the evolutionary relationship between M. baccata and Malus species based on different chloroplast genome sequence sets and other comparative genome analysis?

Sample Extraction and Genome Assembly
Six germplasms of M. baccata (S1~S6) from different regions of China (Table S1) were transplanted in Qingdao Apple Rootstock Research and Development Center, Qingdao Academy of Agricultural Sciences and subjected to standard growth and reproduction with careful water and fertilizer management. After leaf collection and DNA extraction (CTAB method), library construction (DNA fragmentation, end repair, A-tail addition, ligation adaptor and PCR enrichment) and DNA sequencing were conducted on the Illumina HiSeq X platform (PE 150 bp). For chloroplast genome assembly, the original reads were assembled with the GetOrganelle v1.7.5 program [18] with the parameters set to '-R 15, -k 21, 45, 65, 85, 105,121'. Then, the full-length chloroplast sequence was obtained by a manual check. The GenBank numbers of the chloroplast genomes from the 6 M. baccata germplasms were generated (OQ362999~OQ363004) by submitting the sequence to the NCBI database. In addition, the plastid genome data of different species of Malus and other genera were also downloaded from the GenBank database for comparative genomic analysis, including M. baccata
PGA (plastid genome annotator) was used for regional (LSC, SSC, IRA, IRB) division and gene (CDS, tRNA, rRNA) annotation of chloroplast genomes [23]. The length and boundary of the genes were examined manually. In addition, confusion regarding gene names was corrected. The accuracy of the annotation results was further confirmed in CPGAVAS2 [24] and CPGView (chloroplast genome viewer) [25]. Finally, the correct annotation files (genbank documents) were submitted to the OGDRAW website (https: //chlorobox.mpimp-golm.mpg.de/OGDraw.html, accessed on 4 January 2023) to draw organelle genome maps and complete the visualization of genomic characteristics [26].
Codon usage bias-relative synonymous codon usage (RSCU value) and other important indicators that reflect codon adaptability and usage patterns were compared and analyzed in CodonW (http://codonw.sourceforge.net/, accessed on 10 January 2023), with the parameters set to Option_4: Codon usage indices and Option_12: Select all.

Variation Features of Chloroplast Genomes
Nucleotide diversity of the whole chloroplast genome was calculated using the MAFFT strategy (progressive method, FFT-NS-2 algorithm) and DnaSP (DNA polymorphism analysis) [27]. The boundary patterns and junction sites of the four regions (LSC, IRB, SSC, IRA) of chloroplast genomes were demonstrated using the IRscope tool (https:// irscope.shinyapps.io/irapp/, accessed on 6 February 2023) [28]. The variation hotspot analysis of the chloroplast genome for M. baccata and other Malus species was carried out in the mVISTA system (https://genome.lbl.gov/vista/mvista/submit.shtml, accessed on 6 February 2023) [29]. M. baccata S1 was designated as the reference genome and the Shuffle-LAGAN program was chosen as the alignment option.

Chloroplast Genome Composition and Structure of M. baccata (L.)
After DNA sequencing (Illumina paired-end 150 bp) and assembly (the processes of starting from reads), six complete chloroplast genome sequences were obtained. The nonrepeating six plastid sequences were assigned registration numbers (OQ362999~OQ363004). The average base coverage produced during assembly was estimated to be between 367.3× and 429.3×, whereas kmer coverage values were about 112.4 to 131.7 (Table S1). The length of their chloroplast genome was between 160,083 bp (S1) and 160,295 bp (S4), which is close to that of the previously reported species (Table 1). Chloroplast DNA is composed of four regions (SC (LSC and SSC) and IRs (IRA and IRB)), which together form a circular molecular structure. By comparing the size of each part of the six sequences, it can be seen that IR regions were basically the same (26,353 or 26,354 bp), whereas the LSC regions of different individuals had greater changes than the SSC regions (Table 1). Overall, the stability and variability of sequence length play a role in shaping chloroplast genome size.
The GC content and GC skew reflect the DNA density and stability of the genome. Through calculation and comparison of the M. baccata chloroplast genomes, it can be found the GC content of the whole genome or each section was basically identical for the six sequences provided in this experiment and the other six sequences previously published in the NCBI (Table 1 and Figure S1). In the 12 samples, the GC content in the IR regions (42.7%) was significantly higher than that in the LSC (34.2%), and the lowest content was in the SSC (30.4%). In addition, Figure S1 also shows that they have similar GC skew. The above results indicate that chloroplast partitioning stability is very important for chloroplast structure.

Homology and Similarity of Chloroplast Genomes in M. baccata and Other Malus Species
The homology of species can be better understood by global genome comparison. With O2 as the reference genome, the six chloroplast genomes sequenced in this study showed high homology, which was not only reflected in the reverse repetition region but also in other regions. As shown in Figure 1, a large number of similar fragments are scattered throughout the chloroplast genome. At the same time, a comparison of 7 M. baccata and 10 other species of Malus showed that chloroplast genomes had good collinearity and no structural rearrangement events were detected ( Figure 1). Furthermore, the same phenomenon was found by comparing the chloroplast genomes of different species of Malus ( Figure S2). What is different is that the interspecific similarity is different in different species. For example, when both are compared with M. baccata S1, the similarity effect of M. toringoides is better than that of M. kansuensis.

Identification of Repeat Sequences
The repeats in the chloroplast genome mainly include the following categories: SSRs (simple sequence repeats), LTRs (long tandem repeats) and LSRs (large sequence repeats, dispersed repeats). A large number of repeating elements were identified in the chloroplast genomes of six M. baccata species (Table 2) and other different species of Malus based

Gene Recognition
Chloroplast genome annotation is indispensable for the study of gene function, and the objects identified mainly include the CDS (coding sequence), tRNA (transfer RNA)

Codon Usage Pattern of Chloroplast Genes
Codon usage bias reflects the differences in gene evolution and expression. Codon usage patterns can be well understood by comparing the ENC (effective number of codons), CAI (codon adaptation index), CBI (codon bias index), FOP (frequency of optimal codons) and RSCU (relative synonymous codon usage). The effective codon numbers of the six chloroplast genomes were about 49, indicating that they had a certain usage bias. In addition, they had similar CAI, CBI and FOP values ( Table 3). The RSCU values show that UUA (Leu), GCU (Ala) and AGA (Arg) are used relatively frequently, whereas UAC (Tyr), CUC (Leu), AGC (Ser), CUG (Leu) and GAC (Asp) are used less ( Figure 5). In addition, isoleucine favors AUU, serine favors UCU, and the stop codon favors UAA.  The examination of nucleotide variation regions in the genome is helpful for species

Nucleotide Polymorphism in Whole Chloroplast Genomes
The examination of nucleotide variation regions in the genome is helpful for species identification and marker development. Using the DnaSP tool, the nucleotide polymorphisms of 17 chloroplasts in the genus Malus, including M. baccata S1~S6, were calculated. Through sliding window analysis, it can be seen that the polymorphism of the whole genome is varied and that there are distinctions in different regions ( Figure 6). The regions with larger differences include trnG/UCC-trnR/UCU-atpA, trnT/UGU-trnL/UAA, rps16 (LSC region) and nhdD (SSC region), etc.

Chloroplast Genome Boundary Analysis
The extension and contraction of chloroplast genome boundaries reflects genomi evolution and shapes size variation. The conserved IR and the SC formed four connectio points, namely JLA (IRa-LSC), JLB (LSC-IRb), JSA (SSC-IRa) and JSB (IRb-SSC). In th chloroplast genomes, the rps19 gene in the JLB region was stable and consistent (LSC re gion-159 bp, IRb region-120 bp) (Figure 7). In JSB junction site, there was no ycf pseudogene for M. baccata S1~S6 and M. rockii, and the offset of  (Figure 7). In addition, the distance of trnH gene from th IRa side was not completely consistent in different Malus species.

Chloroplast Genome Boundary Analysis
The extension and contraction of chloroplast genome boundaries reflects genomic evolution and shapes size variation. The conserved IR and the SC formed four connection points, namely JLA (IRa-LSC), JLB (LSC-IRb), JSA (SSC-IRa) and JSB (IRb-SSC). In the chloroplast genomes, the rps19 gene in the JLB region was stable and consistent (LSC region-159 bp, IRb region-120 bp) (Figure 7). In JSB junction site, there was no ycf1 pseudogene for M. baccata S1~S6 and M. rockii, and the offset of ycf1 to the SSC region is also different for different species (M. honanensis-8 bp, M. mandshurica and M. prunifolia-9 bp). The rps19 gene in the JLA region was only annotated in M. hupehensis, M. prattii, M. toringoides and M. sieboldii (Figure 7). In addition, the distance of trnH gene from the IRa side was not completely consistent in different Malus species.

Hotspots of Variation in the Chloroplast Genome
The hotspots of variation in the chloroplast genome were analyzed on the mVISTA website. As shown in the Figure S3, in addition to several regions with high polymorphism variation mentioned above, there are also several mutation hotspots. It is worth noting that the variations were slight in the coding regions of the genes and most of the variations were located in the non-coding regions, especially in the intergene regions. Compared with the conservative IR regions, there were more variation hotspots in the SC regions, especially in the LSC region. For LSC, trnG-UCC-trnR-UCU, trnR-UCU-atpA, atpH-atpI, petN-psbM, trnT-GGU-psbD, trnT-UGU-trnL-UAA, ndhC-trnV-UAC and petB could be used as markers of high variation ( Figure S3). Additionally, ndhF-rpl32, ccsA-ndhD and ndhA in the SSC region could be selected as candidate regions for development markers.

Hotspots of Variation in the Chloroplast Genome
The hotspots of variation in the chloroplast genome were analyzed on the mVISTA website. As shown in the Figure S3, in addition to several regions with high polymorphism variation mentioned above, there are also several mutation hotspots. It is worth noting that the variations were slight in the coding regions of the genes and most of the variations were located in the non-coding regions, especially in the intergene regions. Compared with the conservative IR regions, there were more variation hotspots in the SC regions, especially in the LSC region. For LSC, trnG-UCC-trnR-UCU, trnR-UCU-atpA, atpH-atpI, petN-psbM, trnT-GGU-psbD, trnT-UGU-trnL-UAA, ndhC-trnV-UAC and petB could be used as markers of high variation ( Figure S3). Additionally, ndhF-rpl32, ccsA-ndhD and ndhA in the SSC region could be selected as candidate regions for development markers.

Phylogeographic Structure Based SNPs and INDELs
Based on the reference genome (MK896774), 12 chloroplast genomes of M. baccata were tested for variation to explore their relationship composition and lineage structure. A total of 78 SNPs and 109 INDELs were found and extracted for building the evolutionary trees and principal components. As can be seen from Figure 8A, the evolutionary tree based on SNPs is mainly divided into three main branches. The first branch includes S1, O6 and O5, the second branch includes O3, O2, O4, S4 and O1 and the third branch includes S2, S5, S3 and S6. This result is also well reflected in the PCA analysis ( Figure 8C). The tree topology structure ( Figure 8B) and principal component analysis ( Figure 8D) based on INDELs, in general, also shows a consistent and similar phenomenon to the results from the SNP analysis; however, under this clustering condition, the relationship between S1 and O6 was closer than that between S1 and O5 ( Figure 8B,D).

Phylogeographic Structure Based SNPs and INDELs
Based on the reference genome (MK896774), 12 chloroplast genomes of M. baccata were tested for variation to explore their relationship composition and lineage structure. A total of 78 SNPs and 109 INDELs were found and extracted for building the evolutionary trees and principal components. As can be seen from Figure 8A, the evolutionary tree based on SNPs is mainly divided into three main branches. The first branch includes S1, O6 and O5, the second branch includes O3, O2, O4, S4 and O1 and the third branch includes S2, S5, S3 and S6. This result is also well reflected in the PCA analysis ( Figure 8C). The tree topology structure ( Figure 8B) and principal component analysis ( Figure 8D) based on INDELs, in general, also shows a consistent and similar phenomenon to the results from the SNP analysis; however, under this clustering condition, the relationship between S1 and O6 was closer than that between S1 and O5 ( Figure 8B,D).

Haplotype Analysis
The dominant and shared haplotypes were obtained by haplotype analysis in order to understand the species gene exchange and evolution process. By comparing the different individuals of M. baccata and M. baccata var. xiaojinensis, it was found that they formed 15 haplotypes and that haplotype 5 was composed of two individuals (M. baccata O5 and O6). It can be clearly seen from Figure 9 that the four haplotypes (Hap11, 10, 8 and 7) were closely diffused and evolved and that variants Hap_12, 13, 14 and 15 (M. baccata var. xiaojinensis) were obviously far away from the M. baccata on the right (Figure 9), which is in accordance with their morphological and evolutionary characteristics.
The dominant and shared haplotypes were obtained by haplotype analysis in order to understand the species gene exchange and evolution process. By comparing the different individuals of M. baccata and M. baccata var. xiaojinensis, it was found that they formed 15 haplotypes and that haplotype 5 was composed of two individuals (M. baccata O5 and O6). It can be clearly seen from Figure 9 that the four haplotypes (Hap11, 10, 8 and 7) were closely diffused and evolved and that variants Hap_12, 13, 14 and 15 (M. baccata var. xiaojinensis) were obviously far away from the M. baccata on the right (Figure 9), which is in accordance with their morphological and evolutionary characteristics.

Phylogenetic Tree
Based on the whole chloroplast genome sequences (24 Malus and 1 Crataegus sample), the evolutionary relationships of Malus species were compared (C. maximowiczii was used as the outgroup to specify the tree root). As shown in Figure 10A, Malus forms a large clade in which M. honanensis was juxtaposed with other individuals, indicating that it was evolutionarily distant and relatively independent. M. baccata O5, O6 and S1, M. hupehensis, M. rockii, M. toringoides and M. baccata var. xiaojinensis X1, X2, X3 and X4 together form a branch (Ⅱ); the remaining individuals evolved into another group (Ⅰ). The evolutionary network (Ⅰ and Ⅱ) could still reflect the diversity of M. baccata. The branch of the BI (Bayesian inference) tree is similar to the ML tree, and the posterior probability of each node is high ( Figure 10B), indicating that the phylogenetic relationship can be trusted. In addition to the whole genome, the evolution of two single-copy genes (matK and rbcL) is also discussed ( Figure S4). Due to their moderate variation and practical detection, they played an important role in comparing interspecific relationships and species identification. NJ tree and ML tree building with matK coding sequences ( Figure S4A,C) also described similar topological relationships to the above. Additionally, four M. baccata var. xiaojinensis X1~X4, M. rockii and M. toringoides gathered together ( Figure S4A,C), which fully verifies the high unity of their origin distribution and geographical location (southwest region of China).

Phylogenetic Tree
Based on the whole chloroplast genome sequences (24 Malus and 1 Crataegus sample), the evolutionary relationships of Malus species were compared (C. maximowiczii was used as the outgroup to specify the tree root). As shown in Figure 10A, Malus forms a large clade in which M. honanensis was juxtaposed with other individuals, indicating that it was evolutionarily distant and relatively independent. M. baccata O5, O6 and S1, M. hupehensis, M. rockii, M. toringoides and M. baccata var. xiaojinensis X1, X2, X3 and X4 together form a branch (II); the remaining individuals evolved into another group (I). The evolutionary network (I and II) could still reflect the diversity of M. baccata. The branch of the BI (Bayesian inference) tree is similar to the ML tree, and the posterior probability of each node is high ( Figure 10B), indicating that the phylogenetic relationship can be trusted. In addition to the whole genome, the evolution of two single-copy genes (matK and rbcL) is also discussed ( Figure S4). Due to their moderate variation and practical detection, they played an important role in comparing interspecific relationships and species identification. NJ tree and ML tree building with matK coding sequences ( Figure S4A,C) also described similar topological relationships to the above. Additionally, four M. baccata var. xiaojinensis X1~X4, M. rockii and M. toringoides gathered together ( Figure S4A,C), which fully verifies the high unity of their origin distribution and geographical location (southwest region of China).
Based on Bayesian inference analysis and the divergence time estimation of chloroplast genes shared by different species (Table S3), it was found that the topology of the evolutionary tree ( Figure 11) was consistent with that of the chloroplast genome sequence, and the distribution of M. baccata in the genus Malus was in two evolutionary clades (I and II). Clade I and II diverged 9.7 million years ago (95% HPD: 4.23~16.09 MYA) and then the two clades continued to evolve. Clade I consists of nine M. baccata individuals (S2, S3, S4, S5, S6, O1, O2, O3 and O4); the three M. baccata members of II are S1, O5 and O6. In addition, it can be seen from Figure 11   Based on Bayesian inference analysis and the divergence time estimation of chlo plast genes shared by different species (Table S3), it was found that the topology of evolutionary tree (Figure 11) was consistent with that of the chloroplast genome sequen and the distribution of M. baccata in the genus Malus was in two evolutionary clades (Ⅰ Ⅱ). Clade Ⅰ and Ⅱ diverged 9.7 million years ago (95% HPD: 4.23~16.09 MYA) and then two clades continued to evolve. Clade Ⅰ consists of nine M. baccata individuals (S2, S3,

Discussion
Chloroplasts exist in the cytoplasm and are relatively independent and semi-autonomous with maternally inherited elements (cp genome) [41]. The chloroplast genome has a double-chained structure, and its size and composition are stable [42]. Due to the influence of gene flow and DNA mutation, the chloroplast genome carries good marker information that can be used as a powerful tool for the study of population evolution, species classification and genetic engineering [43].
As a very important genus in the Rosaceae family, Malus includes apple, rootstock and a large number of ornamental crabapple species that have rich germplasm resources and biological value [5]. However, due to extensive hybridization and complex morphology, species classification and germplasm identification of Malus are extremely difficult [11]. Since the development of DNA sequencing technology, chloroplast maps of Malus had been published successfully, including M. prattii [44], M. sieboldii [45], M. toringoides [46,47], etc. The identification and comparison of these sequences have promoted the systematic classification of Malus [48]. In this study, through the assembly, annotation, comparison and phylogeny of chloroplast genomes, we improved the similarity and diversity analysis of the plastid sequences of M. baccata and other Malus species and systematically described their phylogeographic structures and evolutionary relationships.
The full length of M. baccata cp genomes was between 160,083 (S1) and 160,295 (S4), which was similar to previously released versions (M. baccata O1~O6). Through sequence homology analysis, it can be seen that M. baccata has high similarity to other species of Malus (such as M. hupehensis, M. sieboldii and M. prunifolia), and no structural rearrangement was found in these genomes, indicating that the variation is relatively moderate. Through repeat sequence recognition, 93 to 96 SSRs were found in the M. baccata cp genome, of which the single nucleotide repeat type accounted for the largest proportion. In addition, dispersed repeats are abundant (57~75), different in different chloroplast genomes and play a role in the stability of cp genomes [49]. A total of 110~112 genes were

Discussion
Chloroplasts exist in the cytoplasm and are relatively independent and semi-autonomous with maternally inherited elements (cp genome) [41]. The chloroplast genome has a doublechained structure, and its size and composition are stable [42]. Due to the influence of gene flow and DNA mutation, the chloroplast genome carries good marker information that can be used as a powerful tool for the study of population evolution, species classification and genetic engineering [43].
As a very important genus in the Rosaceae family, Malus includes apple, rootstock and a large number of ornamental crabapple species that have rich germplasm resources and biological value [5]. However, due to extensive hybridization and complex morphology, species classification and germplasm identification of Malus are extremely difficult [11]. Since the development of DNA sequencing technology, chloroplast maps of Malus had been published successfully, including M. prattii [44], M. sieboldii [45], M. toringoides [46,47], etc. The identification and comparison of these sequences have promoted the systematic classification of Malus [48]. In this study, through the assembly, annotation, comparison and phylogeny of chloroplast genomes, we improved the similarity and diversity analysis of the plastid sequences of M. baccata and other Malus species and systematically described their phylogeographic structures and evolutionary relationships.
The full length of M. baccata cp genomes was between 160,083 (S1) and 160,295 (S4), which was similar to previously released versions (M. baccata O1~O6). Through sequence homology analysis, it can be seen that M. baccata has high similarity to other species of Malus (such as M. hupehensis, M. sieboldii and M. prunifolia), and no structural rearrangement was found in these genomes, indicating that the variation is relatively moderate. Through repeat sequence recognition, 93 to 96 SSRs were found in the M. baccata cp genome, of which the single nucleotide repeat type accounted for the largest proportion. In addition, dispersed repeats are abundant (57~75), different in different chloroplast genomes and play a role in the stability of cp genomes [49]. A total of 110~112 genes were identified in the M. baccata chloroplast based on homologous annotation, including 77~78 CDSs, 29~30 tRNAs and 4 rRNAs. Codon usage pattern analysis revealed that the chloroplast genes of M. baccata had a certain degree of bias, for example, GCU appeared more frequently in Ala (alanine) and the stop codon (TER) preferred to use UAA rather than UAG and UGA. Similar results have been supported in previous studies; they found that the RSCU values of UUA, GCU and AGA are the most preferred ones [50]. As components of the chloroplast genome, IR regions were more stable than SC regions [51]. This phenomenon is fully demonstrated both in the characterization of GC content and in the calculation of nucleic acid polymorphism and mutation hotspots at the whole-genome level. There were several hypervariable regions in the chloroplast genome of M. baccata and other Malus species (trnR-UCU-atpA, trnT/UGU-trnL/UAA, rps16 and nhdD), which provide the genetic basis for the development of interspecific molecular markers.
M. baccata is a wild species of Malus that is native to China. It is widely distributed in the north and southwest of China with abundant variation types and has important ecological and breeding significance [3,4,6,7]. Based on the variations of different M. baccata individuals, the phylogeographic relationship was mainly divided into three categories: (1) S1, O6 and O5; (2) O3, O2, O4, S4 and O1; and (3) S2, S5, S3 and S6. The phylogenetic tree and principal component analysis all support the result of structure differentiation. The geographical and systematic distribution of the species can be well explained by the analysis of haplotype variation. A past study found that two haplotypes of the apple were shared by several wild species [13]. By further comparing the evolution of M. baccata and its variant (M. baccata var. xiaojinensis), it was found that M. baccata var. xiaojinensis was far away from the main haplotype of M. baccata and there were differences in the clade. Previous studies have shown that M. baccata var. xiaojinensis is closely related to M. hupehensis in comparisons of 13 chloroplast genomes [52]. In our data, M. baccata var. xiaojinensis and M. toringoides in the evolutionary branch cluster together, indicating that they are similar in maternal inheritance. Due to the complexity of morphological characteristics and evolutionary history, more data are needed to explain the classification and attribution of M. baccata var. xiaojinensis.
In addition, the differentiation of M. baccata was characterized by maximum likelihood and Bayesian inference. The results showed that from the comparison of the whole cp genome and the shared cp genes, different members of M. baccata formed two large branches (I and II). In the phylogenetic and clustering analysis (SSR markers) of 798 Malus resources native to China, it was found that the population of M. baccata is dispersed in different evolutionary branches [6]. The date of divergence between the two main branches of M. baccata and M. honanensis dates back 18.39 MYA (95% HPD: 10.22~26.84 MYA), which is close to that previously reported using 47 chloroplast genomes (16.66~30.29 MYA) [15].
The assembly and comparative analysis of the chloroplast genomes of M. baccata are necessary to understand its genetic variation and phylogeographic relationship. Therefore, the results of this study can provide valuable references for species identification and breeding engineering of the Malus genus, so as to facilitate the work of germplasm resources and conservation biology.

Conclusions
Six complete chloroplast genomes of Malus baccata were obtained by whole-genome resequencing and sequence assembly, with whole lengths ranging from 160,083 to 160,295 bp. Compared with other Malus species, it was found that there were more sites of polymorphisms and hotspots of variation in LSC and SSC regions; however, IR regions were stable due to high GC content and sequence similarity. Through repeat sequence and highly variable region recognition, numerous mononucleotide SSRs were found in the M. baccata cp genome, and trnG-UCC-trnR-UCU, trnR-UCU-atpA, trnT-UGU-trnL-UAA, ndhC-trnV-UAC, ndhF-rpl32 and ccsA-ndhD could be selected as candidate markers for distinguishing between different species of Malus. According to the analysis of phylogeny and haplotype, M. baccata formed two main branches (I and II) after the differentiation of Malus genus, and the variant (M. baccata var. xiaojinensis) was different from that of M. baccata to some extent. The results of the diverging time estimates suggest that clade I and II diverged 9.7 MYA. The chloroplast genome and its variation data for M. baccata provided a good resource for the species classification of Malus.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/biom13060962/s1, Figure S1: GC content and skew in the chloroplast genomes of M. baccata; Figure S2: Similar sequences of cp genomes in M. baccata and other Malus species; Figure S3: Hotspots of variation in the chloroplast genome of Malus; Figure S4: Evolutionary analyses of M. baccata and different Malus species based on single-copy genes (matK and rbcL); Table S1: The sample information for the M. baccata germplasms used in this study and the assembly quality of the chloroplast genomes; Table S2: Names of chloroplast coding genes with introns and their genome distribution in M. baccata; Table S3: Chloroplast gene information shared by different species for Bayesian inference analysis and divergence time estimation.