Evolutionary patterns of nucleotide substitution rates in plastid genomes of Quercus

Abstract Molecular evolution, including nucleotide substitutions, plays an important role in understanding the dynamics and mechanisms of species evolution. Here, we sequenced whole plastid genomes (plastomes) of Quercus fabri, Quercus semecarpifolia, Quercus engleriana, and Quercus phellos and compared them with 14 other Quercus plastomes to explore their evolutionary relationships using 67 shared protein‐coding sequences. While many previously identified evolutionary relationships were found, our findings do not support previous research which retrieve Quercus subg. Cerris sect. Ilex as a monophyletic group, with sect. Ilex found to be polyphyletic and composed of three strongly supported lineages inserted between sections Cerris and Cyclobalanposis. Compared with gymnosperms, Quercus plastomes showed higher evolutionary rates (Dn/Ds = 0.3793). Most protein‐coding genes experienced relaxed purifying selection, and the high Dn value (0.1927) indicated that gene functions adjusted to environmental changes effectively. Our findings suggest that gene interval regions play an important role in Quercus evolution. We detected greater variation in the intergenic regions (trnH‐psbA, trnK_UUU‐rps16, trnfM_CAU‐rps14, trnS_GCU‐trnG_GCC, and atpF‐atpH), intron losses (petB and petD), and pseudogene loss and degradation (ycf15). Additionally, the loss of some genes suggested the existence of gene exchanges between plastid and nuclear genomes, which affects the evolutionary rate of the former. However, the connective mechanism between these two genomes is still unclear.


| INTRODUC TI ON
Nucleotide substitution is a driving force of genome evolution, as the production of nonsynonymous substitutions may change protein functions, which may be fatal. Consequently, natural selection tends to delete these harmful mutations, resulting in most species being under negative selective pressure (Susann et al., 2013;Wang et al., 2018). Evolutionary rates of species may be affected by gene or protein function, selective pressure, population size, generation time, DNA-repair efficiency, species diversity, and individual species size (Bromham et al., 2015;Hao et al., 2018;Wang et al., 2015).
Selective pressure on functional genes has been found to be especially significant in determining rates of protein evolution (Minias et al., 2018;Xing & Lee, 2005). For example, selective pressures associated with habitats appear to have caused the rapid evolution of genes involved in cold response in Cardamine (Ometto et al., 2012).
Another example are parasitic plant species, which have greater evolutionary rates in their plastid, mitochondrial (mt), and nuclear genomes when compared with other plants as they are deemed to be under greater selective pressure (Bromham et al., 2013). With the rapid development of DNA sequencing technology, it is now possible to study the role of selective pressure in molecular evolution.
Plastids are semiautonomous organelles and include chloroplasts, leucoplasts, chromoplasts, and amyloplasts, which all develop from proplastids of an embryo and have an identical genome. Over the past few decades, plastomes have begun to be used extensively in studying species evolution, migration, identification, and classification (Jose et al., 2015;Moore et al., 2010). Earlier plastome studies mainly focused on comparisons of the absolute rates of evolution of some functional genes, and these results were based on different genes in different species, which may have led to statistical errors (Kenneth et al., 1987). Later, comparisons of evolutionary rates among plastid, mt, and nuclear genomes, as well as between gymnosperms and angiosperms, were conducted (Petersen et al., 2019;Smith et al., 2014). Modern sequencing technology now allows the rapid generation of genomic data, which is essential for a more accurate analysis of factors affecting molecular evolution.
The genus Quercus L. is native to the Northern Hemisphere and contains ~500 species and is widely distributed in Asia, northern Africa, Europe, and America (Aldrich & Cavender-Bares, 2011).
Infrageneric classification of Quercus is controversial mainly because convergent morphological evolution is commonplace within the genus (Denk et al., 2017) and many morphological differences are a result of introgression (Curtu et al., 2007;Moran et al., 2012). Species of Quercus (commonly referred to as oaks) are wind-pollinated and unable to discriminate among pollen from other species of the same section. Additionally, there is incomplete reproductive isolation among oaks, and as a result, phenotypes of progeny produced by interspecific hybridization are variable and difficult to discern (Williams et al., 2010).
Based on morphological characteristics, Chinese oaks have been classified into two subgenera: Quercus subg. Quercus and Q. subg.
Cyclobalanopsis (Oerst.) C.K. Schneid. (Zhou, 1992), or these have been considered as distinct genera in the Flora of China (Huang et al., 1999). More recently, phylogenetic research has been conducted on Quercus and Cyclobalanopsis Oerst. using nuclear DNA (nrDNA) and plastid DNA fragments (Denk et al., 2017) and, most recently, whole plastome, mt, and nuclear genome data (Hipp et al., 2020). In the most recent classification based on the nrDNA and plastid DNA (reviewed in Denk et al., 2017), Quercus is divided into two subgenera, Q. subg. Quercus and Q. subg. Cerris (Oerst.), corresponding to the New-and Old-World oaks, respectively, and with Cyclobalanopsis placed within Q. subg. Cerris Oerst.. This was corroborated by whole plastome, mt, and nuclear genome data (Hipp et al., 2020). However, relationships within the Old-World oaks of Q. subg. Cerris Oerst. are complex. Phylogenetic relationships established by nrDNA tended to differ from those based on plastid data, especially with regards taxa previously circumscribed in Q.
In our study, we sequenced the whole plastomes of Quercus fabri. Hence, Quercus semecarpifolia Smith, Quercus engleriana Seem, and Quercus phellos L. Q fabri, of Q. subg. Quercus sect. Quercus, is a deciduous forest species endemic to China. It is widely distributed south of the Yangtze river and grows on hills or mountains between 50 and 1,900 m a.s.l. Li, Li, Zang, et al., 2018).
Q. semecarpifolia and Q. engleriana, of Q. subg. Cerris sect. Ilex, are evergreen forest species endemic to China. Q. semecarpifolia has a narrow distribution, being found only in western China, while Q. engleriana has a comparatively large distribution, with its range covering the Chinese provinces Shanxi, Jiangxi, Fujian, Hunan, Hubei, Guangxi, Sichuan, Guizhou, Yunnan, and Tibet (Tang, 2015). Quercus phellos, of Q. subg. Quercus sect. Lobatae, is a North American deciduous red oak introduced to China and used extensively in landscape gardening .
All three Chinese endemic oak (Q. fabri, Q. semecarpifolia, and Q. engleriana) species are ecologically and economically important, and their physiology, genetic diversity, breeding, forest management, and food processing are under intense investigation (Wangda & Oshawa, 2006;Wei et al., 2021). Additionally, there are no native species of Q. sect. Lobatae in China, and no plastomes sequenced from this section prior to our study, with Q. phellos critical to our study for including section Lobatae characteristic. The whole plastomes of these four species, coupled with those of 14 other Quercus species retrieved from GenBank, belong to the two subgenera and five of the eight sections currently accepted, that is, Q. subg.
These species' distributional ranges and habitats largely differ and thus form ideal material for studying evolution of the Quercus genus.
The purposes of this study are to (a) generate a well-resolved phylogenetic framework based on plastomes; (b) analyze the sequence variation of coding and noncoding plastid gene regions and transfer of plastid genes to understand the diversification patterns of Quercus; and (c) analyze the selective pressure that acts on essential protein-coding genes to explore the variation and patterns in evolutionary rates of Quercus plastomes.
Genomic DNA was isolated using the modified cetyltrimethylammonium bromide method (Doyle, 1987). A 3 mg of leaf tissue was ground with the addition of 0.35 times the volume of absolute ethanol and inverted several times. Following this, 700 μl of preheated 2% CTAB extract was added to the solution, which was then kept at 65°C in a water bath for 30 min. This solution was then centrifuged at 12,000 r/min for 5 min, after which an equal volume of phenol/ chloroform/isoamyl alcohol (25:24:1) was added to the supernatant. This was then centrifuged at 12,000 r/min for 10 min, after which an equal volume of chloroform/isoamyl alcohol (24:1) was added to the supernatant. The supernatant was then transferred to an equal volume of prechilled isopropanol and let stand at −20°C for at least 40 min. The precipitate was then collected, 1 mol/L NaCl 400 μl was added for dissolution, then an equal volume of chloroform/isoamyl alcohol was added and the solution was centrifuged at 12,000 r/min for five minutes. Following this, a 1/2 volume of NaCl solution and 2-2.5 times volume of precooled absolute ethanol was added to the supernatant, followed by gentle shaking until white flocculent precipitate appeared. The precipitate was then collected and washed twice with 70% ethanol, before being placed on the clean bench and blow dried. Agarose gel electrophoresis and spectrophotometry (OD-1000; Shanghai Cytoeasy Biotech Co., Ltd., Shanghai, China) were used to determine DNA integrity and quality.
The DNA was fragmented with a Covaris sonication device and then the fragments were purified, end-repaired, and A-tailed.
Adapters were lighted through a 3′-thymine overhang. Next, the fragment size was selected by agarose gel electrophoresis, and the sequencing library was formed by PCR amplification. Finally, the constructed library was inspected. Sequencing was performed on an Illumina HiSeq 2500 platform (Illumina, Nanjing, China), yielding at least 8.18-10 GB of clean reads.
We selected the plastome sequence of Quercus aliena as a reference, and then employed reads to produce two-way extensions through overlap using NOVOPlasty software (Dierckxsens et al., 2017). When the assembly results were within the expected range, the overlap was greater than 200 bp, and the assembly formed a ring.

| Phylogenetic analyses
We used 67 shared protein-coding sequences of the 18 Quercus species and two outgroup species, Malus prunifolia (Willd.) Borkh. (NC_031163) and Ulmus gaussenii W. C. Cheng (NC_037840), to build the phylogenetic trees. Phylogenetic relationships were constructed using the BI analysis in Mrbayes v 3.2.6 (Huelsenbeck & Ronquist, 2001). First, the sequences were aligned in MAFFT (Kazutaka & Standley, 2013). Then, the multiple sequence alignment was visualized and manually adjusted using BioEdit (Hall, 1999). We used JmodelTest2 to evaluate the bestfitting models of nucleotide substitution (Darriba et al., 2012). GTR +G was selected as the best substitution model for BI analyses.

| Gene selective pressure analysis
The 79 shared protein-coding genes of the 18 Quercus species, which had sequence lengths longer than 300 bp were extracted and aligned separately using Mega 7.0 to analyze synonymous substitution rates (Ds) and nonsynonymous substitution rates (Dn) using the codeml program in PAML package version 4.7.1 under a one-ratio branch model (Yang & Yang, 2007). We constructed the tree topology using BI as the tree file. After removing outliers, a total of 64 effective genes were included in statistical analyses using SPSS 19.0 software (SPSS, Inc., Chicago, IL, USA). The Ds, Dn, and Dn/Ds ratio of functional gene functional groups were compared by ANOVA. A Mann-Whitney U test of two independent samples was used to determine whether there were significant differences in Ds, Dn, and Dn/Ds among the Quercus species and previously published values of gymnosperms (Wang et al., 2015) based on the calculated Pvalues obtained from the double-tail test (p < .05).

| Plastome assembly and characteristics
The Q. phellos complete plastome size was the largest with 161,331 bp (Table 1). The four oak species displayed a typical quadripartite structure, and we chose to describe the Q. fabri plastome, which included a pair of inverted repeats (IRs; 25,846 bp) separated by a single large copy (LSC; 90,563 bp) and a single small copy (SSC; 18,995 bp) region ( Figure 1; Table 1). The Q. fabri genome contained the most functional genes (about 136), including 89 protein-coding genes, 39 tRNA genes, and eight rRNA genes (Tables 1 and S1).
Introns occurred in 12 genes, rps16, atpF, rpoC1, ycf3, clpP, petB, petD, rpl16, rpl2, ndhB, ndhA, and rpl2 (Table S2). Of these, the rps12 gene was unique as it contained three exons, one of which occurred in the LSC region, while the other two exons occurred in the IRa and IRb regions. We found 26,756 codons in all the coding sequences (Table S3). Leucine was the most common amino acid, encoded by 10.84% (2,903) of the codons, while cysteine was the least frequent amino acid, encoded by 1.35% (331). The A-and Uending codons occurred at high frequencies, and all types of preferred synonymous codons (relative synonymous codon usage >1) ended with A or U.
There was no gene rearrangement among the 18 species during the evolution of Quercus ( Figure S1). We compared the details of

| Phylogenetic analysis
The phylogenetic relationships of the 18 Quercus species, constructed using 67 shared protein-coding sequences, were strongly supported and can be seen in Figure 2. Quercus was found to be divided into two major evolutionary clades.

| Deletion and transfer of plastid genes
The variation in plastome length among the 18 oak species studied was small, with plastomes ranging from 160,681bp (Q. sichourensis) to 161,355 bp (Q. tarokoensis; Table 1). Genes were consistently distributed among the different genomic regions across the 18 species.

| Comparative analysis of coding and noncoding regions
Using the mVISTA program ( Figure S4), we found that conserved noncoding sequences (containing intron sequences) had greater variation levels than exon sequences. The sequences of noncoding regions which had large variations included trnH_GUG-psbA, trnK_  ccsA-ndhD, and rpl32-trnL_UAG. We analyzed the coding and noncoding regions' sequences using a sliding window to present the observed variations using digitization. The sliding window analysis showed that protein-coding genes which had large pi values were: atpF, clpP, infA, ndhA, ndhD, ndhK, ndhH, psbC, rpl20, rpoC2, and rps16 ( Figure 4a). Among these genes, those with introns showed the greatest variation. The following genes had sequence variation in their gene interval regions: trnK_UUU-rps16, petA-psaJ, rbcL-accD, trnfM_CAU-rps14, and trnS_GUC-trnR_UCU ( Figure 4b). The sequence variation in protein-coding genes and gene interval regions reflected by the sliding window was the same as those determined using the mVISTA software. In general, we confirmed that the variations in noncoding sequences are greater than those of coding sequences through different analyses (Figure 4c).
To further explore the sequence variation characteristics of coding and noncoding regions, we counted the shared coding-protein genes of the 18 Quercus species studied. These species shared 79 protein-coding genes, 6 of which (ndhB, rpl2, rps12, rps7, ycf2, and rpl23) had duplicate copies. Of the 79 protein-coding genes, 31 had variable lengths (Figure 5a). We analyzed the lengths of these 31 protein-coding genes and found 12 had introns, with larger length variations than exons, such as atpF, clpP, ndhA, and rpl2. The intron lengths of the 31 protein-coding genes also differed. Additionally, Q. edithiae lost intron regions in petB, petD, rps12, rps16, and rpl16, which caused variation in intron lengths (Figure 5b). The sequence lengths of exons of the 31 protein-coding genes also varied, such as those of rps16, rps12, petD, petB, and rpl16 (Figure 5c). Protein-coding F I G U R E 2 Maximum-likelihood phylogenetic tree constructed using 67 shared protein-coding sequences of plastomes of 18 Quercus species. Numbers above the lines indicate the likelihood bootstrap values. Malus prunifolia and Ulmus gaussenii were used as the outgroups genes without introns were relatively conserved, with limited variation in gene lengths. Still, there were also several unique genes, such as rpl22, rpoc2, and ycf2, which had lengths that fluctuated greatly among the 18 species.

| Gene selective pressure analysis
We divided the 64 protein-coding genes into four major categories: Photosynthetic metabolism; Photosynthetic apparatus; Gene F I G U R E 3 Alignment of the ycf15 gene and protein sequences from the 18 Quercus species. a. Alignment of the ycf15 gene sequences; b. Alignment of the ycf15 protein sequences expression; Other (Table 2). After removing the "other" category, the data presented in Table 2 (Figure 6b). Of these, we found the variation of evolutionary rate within gene groups is still considerable, with some genes with unknown functions evolving faster than other genes ( Figure 6c). Compared with gymnosperms, Quercus has higher Dn (p < .05) and Dn/Ds (p > .05) ratios (Table S4).  Hipp et al., 2020) and subsequently underwent deep geographic separation of major clades within the first 15 Myrs after origin of the genus (Hipp et al., 2020). It is then hypothesized to have spread to a diverse range of environments in the later Cenozoic (Barrón et al., 2017), with China now forming the second-largest diversity center of Quercus (Oldfield & Eastwood, 2007). Their high diversity and remarkable introgression make Quercus a good study-system for understanding the link between the prevalence of adaptive gene flow and the evolutionary history of forest trees. Over the past five years, much research has focused on reexamining the oak phylogeny and investigating the history of diversification across the genus (Deng et al., 2018;Denk et al., 2017;Hipp et al., 2020;Jiang et al., 2019). The topology of our phylogenomic tree matches previous analyses based on plastid and nrDNA gene fragments and RAD-seq of whole plastid, mitochondrial, and nuclear genomes (reviewed in Denk et al., 2017;Hipp et al., 2020) for all sections except that of sect. Ilex. In previous analyses (Denk et al., 2017;Hipp et al., 2020), sect. Ilex was retrieved as a monophyletic group, sister to sect. Cerris. However, in our study, sect. Ilex was polyphyletic and divided into three strongly supported evolutionary clades placed inbetween sect. Cyclobalanopsis and sect. Cerris. This may be related to differential rates of introgression, which make it difficult for us to establish completely reliable phylogenomic relationships based on only one genome (plastome) and a relatively small sample size.

| D ISCUSS I ON
The focus of future phylogenomic studies of Quercus should be on incorporating a more comprehensive sampling of whole genomes, as well as including other molecular sampling approaches such as nuclear genome skimming, that covers all sections and morphological and biogeographic variability present in the genus. This should include whole plastomes and nrDNA from taxa belonging to Q. subg.
Compared with gymnosperm genera, Quercus has a much higher species diversity, and thus it is assumed a higher genetic diversity, albeit with extant species having a similar recent evolutionary age (Nagalingum et al., 2011). Thus, the question arises whether differences in the evolutionary patterns of plastomes affect their genetic diversity? Based on our results, we hypothesize that, during its early evolutionary history, Quercus had a considerable Dn value that en-  (Ducousso et al., 1993). Northern temperate deciduous forest species also have a higher pollen volume than that of earlier diverging evergreen oak species from warmer and drier climates, with the high pollen volume likely an adaptation evolved to cope with the relatively adverse damper cooler climate which inhibits anemophily (Cao & Zhou, 2002;Nagalingum et al., 2011). Additionally, most most studies on molecular evolution have focused on protein-coding sequences (Chen & Blanchette, 2007;Xie et al., 2018). In this study, the degree of variation in the noncoding sequences (gene spacing regions) was greater than that in the protein-coding sequences.
Insertion and deletion mutations lead to microstructural changes, which can be coded as evolutionary events during phylogenomic analyses. Even in the coding region, introns do not encode proteins, and 20%-68% of introns are species-specific. Over the past few hundred million years, introns have been gained and lost, although during the evolution of plants, introns have mainly been lost (Roy & Penny, 2007), while in algae abundant introns have been lost and gained recently (Turmel et al., 2017). In our study, we found that Quercus has 12 introns, with only Q. edithiae having lost intron regions in petB, petD, rps12, rps16, and rpl16. This could have occurred as a result from mistakes during gene annotation. However, this scenario could also represent patterns of evolution in Quercus. Differing mutation rate is the main cause of differentiations in evolutionary rate, and most mutation events generally occur in introns, untranslated regions (UTRS), and noncoding regulatory regions. Thus, the number of introns in a genome may reflect the evolutionary rate to a large extent. Species with slower evolutionary rates retain more introns than those with faster evolutionary rates.
Pseudogenes, that is, sequences lacking coding abilities, may also play essential roles in genome evolution .
Redundant DNA deletion events frequently occur in genomes, significantly reducing the pseudogene contents (Lafontaine & Dujon, 2010). Whether the pseudogene ycf encodes a protein has been controversial  and only recently was it confirmed that the basal groups of asterids have complete ycf15 gene structures and are assumed to have protein-coding functions (Curci et al., 2015). However, a small number of eudicots have ycf15 gene structures identified as pseudogenes. These were mainly found in the Ericales, Gentianales, Lamiales, Solanales, and Apiales, which TA B L E 2 Substitution and evaluation rates of Quercus plastid genes (standard deviation) using p < .05 as significant criterion and Melastomataceae, in which the ycf15 gene has undergone genetic degeneration, resulting in its reduction to only c. 50 bp in length (Choi & Park, 2015;Liu et al., 2013;Nie et al., 2012). Pseudogenes are considered to be less influenced by selective pressure during evolution as they can effectively reflect the molecular records of ancestors, which is very helpful in the study of molecular evolution. Interactions among the three genomes (nuclear, plastid, and mitochondrial) in plant cells may also affect the rate of evolution of Quercus plastomes. Gene transfers among these three genomes may lead to specific correlations between their evolutionary rates (Goremykin et al., 2009;Jansen et al., 2011;Millen et al., 2001). In rpl22, rpl23, rpl32, rpl33, rps16, accD, psaI, ycf4, ycf1, ycf2, and infA genes disappeared in some groups, with the disappearance frequency of infA gene being highest, although the infA gene appeared recently in Quercus (Millen et al., 2001). In this study, we identified some gene losses, such as psbB, rpl22, ycf4, and ycf15. The rpl22 gene occurs in the nuclear genome of Fagaceae plants and appeared approximately 34-37 Ma, which explains the gene exchange between plastid and nuclear genes (Jansen et al., 2011). Additionally, DNA replication and repair mechanisms in mitochondrial and plastid genomes have many similarities (Smith et al., 2014). In conclusion, we believe that these three genomes have related evolutionary rates, but the mechanism behind this is still unclear.

| CON CLUS IONS
The whole plastome phylogenomic relationships retrieved in our study mostly coincide with previous research (Denk et al., 2017;Hipp et al., 2020) in their support for sect. Quercus and sect. Lobatae of subg. Quercus, and sect. Cerris and sect. Cyclobalanopsis of subg.
Cerris. However, our whole plastome topology does not support findings from Denk et al. (2017) and Hipp et al. (2020) that sect. Ilex is a monophyletic group. The polyphyly of sect. Ilex is seen by how sect. Cerris and sect. Cyclobalanposis are inserted between three strongly supported lineages of sect. Ilex taxa. Before taxonomic recircumscription of these lineages can be done, phylogenomic studies incorporating a more comprehensive and varied molecular sampling are needed. This is especially pertinent as different genomes, and regions of the genome, can give largely varied interpretations of the evolutionary history of oaks (Hipp et al., 2020).
Based on the whole plastome sequences of 18 Quercus species, the evolutionary model of nucleotide substitution rates of typical plastomes found that most protein-coding genes have experienced relaxed purifying selection. Furthermore, the high Dn value indicated that these genes' functions have effectively adjusted to changes in the environment. Noncoding sequences were also found to have more significant variation, including variation in the interpretation of the intergenic region, loss of introns, and loss and degradation of pseudogenes, which are all manifestations of plastid evolution.
Additionally, transfers of individual genes between the plastid and nuclear genomes were identified. This allows us to speculate that the evolution of the oak plastome was correlated with that of the other two genomes (nuclear and mitochondrial) to some extent. Continued work focused on a comprehensive molecular sampling of all Quercus species, coupled with integrating new whole-genome sequencing technologies including those focused on the nuclear genome, will be fundamental in improving our understanding of the underlying mechanisms of evolution of both Quercus and organisms in general. Engineering Fund (KYCX18_0989). We also thank Xie Lei for his contribution to data processing.

CO N FLI C T O F I NTE R E S T
None declared.