Comparative analysis of chloroplast genomes of endangered heterostylous species Primula wilsonii and its closely related species

Abstract Primula, well known for its heterostyly, is the largest genus in the family Primulaceae with more than 500 species. The considerable species number has introduced a huge challenge for taxonomy. The phylogenetic relationships among Primula still maintain unresolved due to frequent hybridization and introgression between closely related species. In this study, we sequenced and assembled the complete chloroplast genomes of Primula wilsonii Dunn, which is a PSESP (plant species with extremely small populations), using Illumina sequencing and compared its genomic sequences with those of four related Primula species. The chloroplast genomes of Primula species were similar in the basic structure, gene order, and GC content. The detected 38 SSRs (simple sequence repeats) loci and 17 hypervariable regions had many similarities in P. wilsonii, P. anisodora, P. miyabeana, and P. poissonii, but showed a significant difference compared with those in P. secundiflora. Slight variations were observed among Primula chloroplast genomes, in consideration of the relatively stable patterns of IR (inverted repeats) contraction and expansion. Phylogenetic analysis based on chloroplast genomes and protein‐coding genes confirmed three major clades in Chinese Primula, but the infrageneric sections were not in accordance with morphological traits. The P. poissonii complex was confirmed here and P. anisodora was the most closely related species to P. wilsonii. Overall, the chloroplast genome sequences provided useful genetic and evolutionary information for phylogeny and population genetics on Chinese Primula species.


| INTRODUC TI ON
Primula is famous for its mating with the heterostyly system, high ornamental value, and richness species. This group comprises about 500 species, with more than 300 species distributed in China, particularly clustering in the Himalaya-Hengduan Mountains (Richards et al., 2003). Primula is reputed to be one of the great garden plant genera throughout the world owing to the widely grown and long flowering period (Richards, 2002;Yan et al., 2010). The taxonomic study of Primula has been revised many times according to morphological characters. The infrageneric system with a total of 31 sections of Primula was initially proposed by Smith and Fletcher (Smith & Fletcher, 1947). Later, a revised system with seven subgenera was proposed that considered some putative reticulate evolutionary relationships (Wendelbo, 1961). Then, a six-subgenera system was amended in Primula by Richards (Richards et al., 2003), whereas the subgenus concept was not adopted in the delimitation of the Chinese Primula: a total of 24 sections were delimited (Hu, 1990).
Moreover, many key diagnostic traits, such as the shape of calyx, are slightly different and nonquantitative. The frequent natural hybridization and gene introgression also confuse species boundaries in this genus (De Vos et al., 2014;Guggisberg et al., 2009;Ma et al., 2014;. Although an increasing number of phylogenetic constructions have been carried out previously and have greatly advanced our understanding of the evolution of Primula, the phylogenetic relationships within the genus Primula species have remained incompletely resolved. Primula wilsonii Dunn is a perennial herb in Sect. Proliferae Pax of Primula (Primulaceae). The most closely related species of P. wilsonii are P. miyabeana, P. poissonii, and P. anisodora, and these species are clustered in a well-supported clade based on phylogeny construction using rbcL + matK + ITS markers (Yan et al., 2015). P. wilsonii sporadically scatters in limited areas of Sichuan and Yunnan provinces, while the other widespread Primulas are common in the fields and gardens. Considering the limited distribution areas and small size of populations in the fields, P. wilsonii was identified as a PSESP (plant species with extremely small populations) and eagerly in need of protection (Sun et al., 2019).
Chloroplasts are the photosynthetic organelles in plant cells, with highly conserved genomes that are inherited maternally in major angiosperms (Daniell, 2007;Daniell et al., 2016;Moore et al., 2010). The quadripartite structure of the angiosperm chloroplast genome consists of a large single copy (LSC) region and a small single copy (SSC) region, separated by two inverted repeats (IRs) . Up till now, taking advantage of the development of high-throughput sequencing technology, more than 7000 chloroplast genomes of land plants have been publicly known (Kirill et al., 2021) since the publication of the first plastid genome sequences for Nicotiana tabacum and Marchantia polymorpha (Ohyama et al., 1986;Shinozaki et al., 1986). Additionally, the lack of recombination and slow evolutionary rate as compared with mitochondrial and nuclear genomes make it suitable for phylogeny at genus and family level, species barcoding, population genetics and the conservation of endangered species (Daniell et al., 2016;Henriquez et al., 2014;Nguyen et al., 2017;Palmer et al., 1988;Wambugu et al., 2015;Wang et al., 2016;Zhai et al., 2019). And the comparative analysis of chloroplast genomes among closely related species and phylogenetic analysis in Primula provided insights into the evolutionary history (Li, 2022;Ren et al., 2018;Xu et al., 2020). Therefore, we determine the complete chloroplast genomes of a PSESP species in this genus (Xie et al., 2021). In addition, we explore simple sequence repeats (SSRs) loci and identify highly variable regions by comparing the genome contents and structures between the PSESP species P. wilsonii and its widely spread closely related species. Here, we aim to: (1) investigate the molecular structures of the chloroplast genomes of P. wilsonii, (2) examine the variations of SSRs and highly divergent regions that could be used as specific DNA markers for P.
wilsonii and its close relatives, (3) evaluate the evolution of several Primula chloroplast genomes, and (4) facilitate the systematics of the Chinese Primula species.

| Plant materials, DNA extraction, and sequencing
The plant materials used in this study were collected from Wuxuhai, Sichuan Province, China (29.16 N,101.41 E). The voucher specimen (voucher accession number XYP202007016) was deposited in the Key Laboratory of Plant Resource and Biology at Huaibei Normal University. Total genomic DNA was extracted from silica-dried leaves with a modified cetyltrimethylammonium bromide (CTAB) method (Doyle & Doyle, 1987). The quality and quantity of the DNA samples were determined using agarose gel electrophoresis and the Qubit dsDNA BR assay (Life Technologies). The qualified PCR-amplified library was sequenced with the Illumina NovaSeq Tenplatform (Nanjing Genepioneer Biotechnologies Inc.).

| Genome assembly, annotation, and analysis
The low-quality reads were assessed and filtered using FastQC (https://www.bioin forma tics.babra ham.ac.uk/proje cts/fastq c/) and Trimmomatic v 0.36 to obtain high-quality clean data (Bolger et al., 2014). The chloroplast genome sequence of P. wilsonii was assembled by a de nova method using SPAdes Genome Assembler v3.10.1 (Bankevich et al., 2012). The k-mers were set to 55, 87 and 121 to achieve optimal assembly. The whole chloroplast genomes were assembled using high-coverage and long-sequenced contigs.

| Comparative plastome analysis
Gene distribution and the percentage of sequence identity were compared and visualized using mVISTA software (Frazer et al., 2004) with the LAGAN mode (Brudno et al., 2003) in chloroplast genomes of five Primula species, with P. secundiflora, P. poissonii, P. anisodora, and P. miyabeana selected as close relatives of P. wilsonii. All the chloroplast genomes were obtained from NCBI, except for P. wilsonii. The annotation of P. wilsonii served as a reference. Nucleotide variability values (Pi) were calculated using the same alignment. Nucleotide diversity was detected by sliding window analysis conducted in DnaSP v.6.11.01 with a step size of 200 bp and window length of 600 bp (Rozas et al., 2017).
The expansion or contraction of the inverted repeat (IR) regions in the chloroplast genomes of the five related Primula species were investigated and visualized using IRscope program (Amiryousefi et al., 2018).

| Phylogenetic analyses
To investigate the phylogenetic relationship in the sections of Chinese Primula and the resolution ability of chloroplast genomes, phylogenetic analysis was performed for the 60 species representing 20/24 sections of Chinese Primula based on two datasets: the complete chloroplast genomes and 66 shared protein-coding genes.

Two outgroup species (Andorsace bulleyana and Ardisia solanacea)
were sampled from a closely related genus Andorsace and a more distantly related genus Ardisia of the Primulaceae in the sense of phylogenetic relationships (Chen et al., 2016). The sequences were firstly aligned by MAFFT v7.307 (Katoh & Standley, 2013). jModel-Test 2.0 was used to find the best nucleotide-substitution model and was determined according to Akaike's information criterion (AIC) before phylogenetic construction (Darriba et al., 2012). Trees were then constructed using the maximum likelihood (ML) method by online RaxML BlackBox software (Stamatakis et al., 2008) and Bayesian inference (BI) method using MrBayes (Ronquist & Huelsenbeck, 2003). ML was implemented starting from random trees, using 1000 rapid bootstrap inferences with a General Time Reversible (GTR) nucleotide-substitution model as suggested. The BI analysis was performed on the condition that Markov chain Monte Carlo calculation is set to 2,000,000 generations and sampled every 1000 generations. The remaining 1500 trees were used to generate the consensus tree after the first 25% of the generations were discarded as a burn-in. The final phylogenetic results were viewed by using FigTree v.1.6.1.

| Chloroplast genomes features
Among the 23,651,814 paired-end clean reads generated by Illumina sequencing, 604,525 reads were mapped to the final assembly.
Based on these data, we obtained a high-quality plastome of P. wilsonii, with average coverage depth of 1249.46 ×. The assembled chloroplast genome of P. wilsonii was 151,677 bp in length, exhibiting a typical circular chloroplast structure like most angiosperms.
As shown in Figure 1, it contains a LSC region of 83,510 bp, a SSC region of 17,765 bp, and a pair of inverted repeats (IRa and IRb) of 25,201 bp. Overall, the GC content was 36.99%. We found an uneven distribution of GC content across the regions of the genomes, which were 34.89%, 30.18%, and 42.87% for the LSC, SSC, and IR regions, respectively. The chloroplast genome structures, the size of each region, and the GC contents of the five chloroplast genomes of Primula species were compared and are shown in Table 1. The genome features were nearly identical in the five Primula Chloroplast genomes.
Regardless of the duplicate genes, the chloroplast genome of P.
wilsonii contained 113 genes, including 79 protein-coding genes, 30 tRNA genes, and four rRNA genes. Among them, six protein-coding genes, seven tRNAs, and all the four rRNAs were completely duplicated within IRs ( Figure 1). The rps12 gene was found to be transspliced in chloroplast genomes. It had two duplicates in IRs and one of its exons is located in the LSC region ( Figure 1). A total of 18 genes, including 12 protein-coding genes and six tRNA genes, had introns.
It was noted that 16 genes among them had only one intron while the other two genes had two introns. The comparisons between the genome of P. wilsonii and the other Primula chloroplasts are shown in Table 1. The results indicated that the gene contents were similar among the Primula species, excluding the loss of the ycf15 gene in the P. wilsonii chloroplast genome. The chloroplast genomes of P. poissonii and P. wilsonii possessed the infA gene, whereas the others did not have it. Moreover, the trnG-GCC sequence only appeared in the chloroplast genome of P. wilsonii ( Table 2).

| Repeat sequences and SSRs analysis
Except for the largest repeat in each genome (IR regions), a total of 111 repeat pairs no more than 60 bp in length were identified in the five Primula chloroplast genomes. Only two types of repeat sequences, forward and palindromic repeats, were detected ( Figure 2). There were 22 (nine forward, 13 palindrome), 22 (nine forward, 13 palindrome), 27 (12 forward, 15 palindrome), 13 (13 forward), and 27 (12 forward, 15 palindrome) repeats in P. wilsonii, P. anisodora, P. miyabeana, P. poissonii, and P. secundiflora, respectively. Among them, the chloroplast genomes of P. miyabeana and P. secundiflora had the largest number of repeats, whereas P. poissonii had the fewest. The results indicated that forward and palindrome repeats occupied almost the same proportion. In these five genomes, the length of the repeats was mainly in the range of 30-39 bp, with a percentage of 61.26% (68 of 111 repeats), followed by 40-49 bp, contributing 29.73% (33 of 111 repeats).
A total of 38 microsatellites in the P. wilsonii chloroplast genome were detected, including 28 mono-, five di-, two tri-and three tetra-nucleotide repeats. In P. poissonii, 38 SSRs loci were detected, consisting of 27 mono-, six di-, two tri-and three tetra-nucleotide repeats, respectively (Table 3). Moreover, 41 SSRs loci were spotted in P. secundiflora, and the numbers of different types are listed in Table 3. Here, we found that the one-base SSRs loci of P. wilsonii, P. miyabeana and P. poissonii were only A/T repeats. The dinucleotide repeats were AT/TA in all the five Primula species. It is shown that the trinucleotide repeats were AAT/ATT in P. wilsonii, P. anisodora, P. miyabeana, and P. poissonii, but they were not present in P. secundiflora. The tetra-nucleotide repeats AAAT/ATTT were present in all the Primula species. However, the AAAG/CTTT and AATT repeats only appeared in P. secundiflora. Among these SSRs loci, 30 (78.95%) were in the LSC region, six (15.79%) were in SSC region, and two (5.26%) were in the IR region of the P. wilsonii chloroplast genome.

| Sequence divergence
The Primula chloroplast genomes exhibited moderate sequence divergence. Furthermore, the results showed that the sequence of coding regions and the two IR regions were significantly more conserved than that of LSC and SSC regions. In the coding regions, most   genes were relatively conserved, except for matK, rpl22, ndhF, and ycf1. In contrast, the intergenic regions were shown to be highly divergent ( Figure 3). Then, we found that the variation level of DNA polymorphism was from 0.00067 to 0.02633. The greatest differences among the five Primula species were located in the two SC regions, while IR regions were the least different. About 17 hypervariable regions were discovered with a nucleotide diversity (Pi) value that was greater than 0.01 (Figure 4).

| IR expansion and contraction analyses
The IR/SC junction regions showed slight differences in gene content and gene order. As shown in Figure 5, genes rps19/rpl2, ndhF, ycf1, and rpl2/trnH were present at the junction of the LSC/IRb, IRb/SSC, SSC/Ira, and IRa/LSC borders, respectively. The LSC/IRb boundary was located in the rps19 region, which crossed into the IRb region in all the five chloroplast genomes, resulting in a variable expansion (14-102 bp) of IRb region toward the rps19 gene.
The ndhF gene was entirely located in the SSC region in the chlo-

| Phylogenetic analysis
Our sampling represented 20 of 24 recognized sections of the genus Primula in China (Hu, 1990). The selected models for whole F I G U R E 2 Types and numbers of large repeat sequences in the chloroplast genomes of five Primula species. F and P represent forward and palindromic repeats, respectively.   (Figure 6 and Figure S1). The phylogenetic tree identified that all the samples in the genus Primula could be divided into three major clades ( Figure 6)

| The general characteristics of Primula chloroplast genomes
As with most angiosperms, the chloroplast genomes were conserved in Primula species, with similar GC content and typical quadripartite structures, including small and large single copy (SSC and LSC) regions separated by two inverted repeats (IRs) regions (Xin et al., 2019). However, gene loss was found here. The infA gene, which encodes translation initiation factor 1 (Wicke et al., 2011), was present in the chloroplast genome of P. poissonii and P. wilsonii, but was not present in the related Primula chloroplast genomes in our study. Additionally, these findings are consistent with the results of some Primula species and other groups in angiosperm chloroplast genomes in previous studies (Gichira et al., 2017;Ren et al., 2018). The ycf15 gene was absent in the chloroplast genomes of P. wilsonii. Similar events have also occurred in other angiosperm plastid genomes, such as Colchicum F I G U R E 3 Comparison of the complete chloroplast genome of five Primula species, with P. wilsonii as a reference. Gray arrows and thick black lines above the alignment indicate gene orientation. The y-axis represents the percentage identity ranging from 50% to 100%. The acronym CNS represents conserved noncoding sequences.
genus (Nguyen et al., 2015). ycf15 was located in the IR region and was highly conserved (Liu, 2018), and the function of the ycf15 gene remained unclear and needed to be further investigated. The non-presence of ycf15 we found here might be a valuable molecular marker to separate P. wilsonii from P. poissonii, which is morphologically similar to P. wilsonii. Both of the two species are perennial herbs with candelabra inflorescence and purple flowers, so some scholars have argued that P. wilsonii should be merged into P. poissonii or treated as a subspecies of P. poissonii. Here we suggest that the missing ycf15 gene in the P. wilsonii chloroplast genome could be extremely useful for distinguishing the two confusing species at the molecular level.

| The evolution of the chloroplast genomes in Primula
Inverted repeats regions are highly conserved in most angiosperm chloroplast genomes. However, the contraction and expansion of IR regions are not rare (Kim & Lee, 2004). In this study, gene orders at the boundaries of SC/IR regions were the same among these five chloroplast genomes of Primula. However, the accurate positions of the genes at the SC/IR borders were slightly varied, such as the genes rps19, ndhF, ycf1, rpl2, and trnH (Ren et al., 2018). In addition, some genes normally located in the SC region, such as ndhF, had moved to IR region due to the expansion of the IR region. It was reported that the chloroplast genomes' size, the LSC/SSC length, the gene numbers and pseudogene origination could vary among different species due to the expansion and contraction of IR regions (Menezes et al., 2018;Saina et al., 2018). Moreover, the loss of IR regions has been occasionally detected in some taxa (Yi et al., 2013). This could be the reason that the chloroplast genome size of P. miyabeana was the largest among the five Primula species with the longest IR region, and the chloroplast genome size of P. secundiflora was the smallest with the shortest IR region. Furthermore, a large number of studies also confirmed that the length of IR region greatly affected the chloroplast genome size (Ren et al., 2021;Sun et al., 2017).
More genomic resources are needed to deeply investigate the phylogeny, biogeography, genetics, and heterostyly evolution of Primula species on account of the ornamental and reproductive importance. In addition, considering that P. wilsonii is a PSESP, we need more genetic information for the conservation of germplasm resources. The numbers and distributions of repeat sequences, especially large repeats that are longer than 20 and 60 bp, may play important roles in the arrangement and recombination of the plastid genome (Guisinger et al., 2011;Pombert et al., 2006). A total of 111 repeats were detected in the five Primula chloroplast genomes. All the repeat sequences appeared to be shorter than 60 bp in length. These findings are consistent with the results in other Primula species (Ren et al., 2018;Xu et al., 2020), but not in agreement with the results of some other angiosperm plants, such as herbaceous Alpinia species  or woody Aquilaria species (Ren et al., 2021). Our study detected very high levels of polymorphism in the large repeat sequences among the five Primula species in terms of both the types and numbers. Therefore, these large repeats might be an informative source for developing genetic markers for population genetics and phylogenetic constructions of Primula (Hishamuddin et al., 2020). SSRs markers are valuable genetic resources for phylogenetic investigations, population genetics assessment, and species discrimination due to their abundant polymorphism and codominant inheritance (Gulzar et al., 2018;Ren et al., 2021). The SSRs markers detected here were mostly A/T mononucleotide repeats (28/38), similar to the results of other Primula species (Ren et al., 2018) and some other angiosperm species (Shen et al., 2020;Wang et al., 2021).

| Phylogenetic relationships of Chinese Primula
A total of 60 species representing 20 of 24 sections in Chinese Primula were sampled in our phylogenetic construction using chloroplast genome sequences based on ML and BI methods. Three major clades of Primula were detected with high internal support in this study, which was in accordance with previous studies (Liu, 2018;Xu et al., 2020;Yan et al., 2015). Several sections did not exhibit monophyletic taxa, such as Sects. Monocarpicae, Crystallophlomis, Obconicolisteri, Denticulata, and Proliferae, which were partly or entirely confirmed by the previous viewpoints (Liu, 2018;Xu et al., 2020;Yan et al., 2015). A decision on the monophyly of Sect.
Proliferae requires additional consideration. It has been treated as a monophyletic group based on the concatenation of ITS, matK and rbcL sequences Yan et al., 2015). However, the chloroplast transcripts and protein-coding sequences from chloroplast genomes analyses strengthen the assumption that Sects.
Amethyatina and Petiolares species are nested within Sect. Proliferae (Li, 2022;Liu, 2018). This assumption is additionally supported by the results based on the whole chloroplast genome and protein-coding sequences data analysis in our investigation. This is corroborated by morphological traits such as an umbel with multiple flowers, campanulate calyx, and globose capsule. The conflicting phylogenetic diagnoses of nuclear and chloroplast sequences are common in plants (Degna & Rosenberg, 2009). The adaptive radiation caused by heterostyly, polyploidization, and natural hybridization, or gene introgression might complicate the phylogenetic relationships under Primula (De Vos et al., 2014;Guggisberg et al., 2009;Ma et al., 2014;. Furthermore, chloroplast capture through hybridization may occur in these sections, which has been documented in plants and the genus Primula (Casazza et al., 2012;Yi et al., 2015). This would explain why quite a few sections in Primula did not belong to monophyletic group according to morphological characters.
Primula wilsonii, together with P. poissonii, P. anisodora, and P. miyabeana (endemic to Taiwan) formed to P. poissonii complex, which was one of the taxonomically challenging groups in Sect. Proliferae.
The close relationship of these species has been revealed in previous studies, and P. wilsonii was closest to P. miyabeana based on rbcL + matK + ITS sequences, with low support (Yan et al., 2015).
However, the closest relative species was P. anisodora with very high support based on chloroplast genomic sequences in this study.
Therefore, we suggest that the phylogenetic relationships between Primula species need to be further studied based on more genetic information, especially at the genomic level, and we may come to the conclusion that chloroplast genomes sequences could provide a valuable resource for phylogenetic constructing of Primula.

| CON CLUS IONS
This study compared the basic characteristics of the chloroplast genomes from five Chinese Primula species. We assessed the variation and IR boundaries evolution among these species. Furthermore, we

CO N FLI C T O F I NTE R E S T
The authors declare that there are no conflicts of interest regarding publication of this paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
The original sequencing data have been submitted to the NCBI database and received GenBank accession numbers MW442886 (P. wilsonii). The data used in this study are already entirely in the public domain (https://www.ncbi.nlm.nih.gov). The GenBank accession numbers of all the species used for phylogenetic analysis are displayed in Figure 6 and Figure S1.