Convergent molecular evolution of phosphoenolpyruvate carboxylase gene family in C4 and crassulacean acid metabolism plants

Phosphoenolpyruvate carboxylase (PEPC), as the key enzyme in initial carbon fixation of C4and crassulacean acid mechanism (CAM) pathways, was thought to undergo convergent adaptive changes resulting in the convergent evolution of C4 and CAM photosynthesis in vascular plants. However, the integral evolutionary history and convergence of PEPC in plants remain poorly understood. In the present study, we identified the members of PEPC gene family across green plants with seventeen genomic datasets, found ten conserved motifs and modeled three-dimensional protein structures of 90 plant-type PEPC genes. After reconstructing PEPC gene family tree and reconciled with species tree, we found PEPC genes underwent 71 gene duplication events and 16 gene loss events, which might result from whole-genome duplication events in plants. Based on the phylogenetic tree of the PEPC gene family, we detected four convergent evolution sites of PEPC in C4 species but none in CAM species. The PEPC gene family was ubiquitous and highly conservative in green plants. After originating from gene duplication of ancestral C3-PEPC, C4-PEPC isoforms underwent convergent molecular substitution that might facilitate the convergent evolution of C4 photosynthesis in Angiosperms. However, there was no evidence for convergent molecular evolution of PEPC genes between CAM plants. Our findings help to understand the origin and convergent evolution of C4 and CAM plants and shed light on the adaptation of plants in dry, hot environments.


INTRODUCTION
Plant photosynthesis is the most important biochemical process on the planet: it provides energy, food and oxygen for the survival and reproduction of vast majority though heterotrophic organisms, including humans (Fischer, Hemp & Johnson, 2016). Since its origin, oxygen levels have gradually risen, and carbon dioxide (CO 2 ) concentrations have decreased in the atmosphere (Foster, Royer & Lunt, 2017), which has increased photorespiration and led to bioenergy and carbohydrate losses (Schlüter & Weber, 2020). To obtain adequate CO 2 and improve photosynthetic efficiency, plants developed several carbon-concentrating mechanisms (CCMs), such as C 4 photosynthesis and crassulacean acid metabolism (CAM), so it could adapt to the sudden decline of atmospheric CO 2 approximately 350 million years ago (Mya) (Edwards, 2019;Heyduk et al., 2019). C 4 photosynthesis, an example of convergent evolution, has repeatedly evolved more than 60 times in angiosperms (Sage, Christin & Edwards, 2011;Sage, 2016); whereas, CAM has independently evolved in approximately 37 families of vascular plants (Silvera et al., 2010;Winter et al., 2021). Recently, several studies using comparative genomics have provided new insights into the genetics and evolution of CCMs (Yang et al., 2017;Heyduk et al., 2019;Wai et al., 2019;Yang et al., 2019;Jaiswal et al., 2021). However, the molecular mechanisms underlying convergent evolution in CCMs remain poorly understood.
Interestingly, PEPC genes are essential for regulation of the circadian clock in CAM photosynthesis (Boxall et al., 2020) and share convergent amino acid changes in diverse CAM species (Yang et al., 2017). In C 4 grasses, PEPC genes have also undergone parallel, adaptive genetic changes (Christin et al., 2007;Besnard et al., 2009;Moreno-Villena et al., 2018). Therefore, PEPC genes are crucial for elucidating the origin and convergent evolution of C 4 and CAM photosynthesis. Previous studies only examined the convergent evolution of PEPC gene family in a few C 4 or CAM photosynthetic lineages, such as grass (Christin et al., 2007;Besnard et al., 2009;Moreno-Villena et al., 2018) and angiosperms (Yang et al., 2017), but without Isoetes, which was the earliest-diverging lineage of CAM plants (Keeley, 1981), and without fern CAM plants such as Platycerium (Rut et al., 2008), so the origin and evolution of the PEPC gene family could not be clearly elucidated (Deng et al., 2016). Furthermore, CAM photosynthesis occurs widely in the major clades of vascular plants, including pteridophytes (lycopods and ferns), gymnosperms and angiosperms, while C 4 photosynthesis is only distributed in angiosperms. Therefore, to understand the convergent molecular evolution of the PEPC gene family in the plant kingdom, species with genomic datasets that represent C 3 , C 4 and CAM photosynthesis across the major lineages of plants should be sampled.
Fortunately, more and more plant genomes have been sequenced (https://www.plabipd. de/index.ep), which provides an excellent opportunity to resolve the origin and convergent evolution of C 4 and CAM photosynthesis in plants (Heyduk et al., 2019;Yang et al., 2019;Gilman & Edwards, 2020). However, it is a big challenge to use hundreds of genomic datasets to analyze the convergent evolution of PEPC genes, which have multiple copies in most species. In the present study, we only selected 17 plant genomes, which consisted of C 3 , C 4 and CAM species across algae, bryophytes, pteridophytes, gymnosperms and angiosperms, especially included Isoetes and Platycerium, which represented the earliestdiverging lineages of CAM plants. Then we identified the PEPC genes from the 17 genomes and reconstructed the evolutionary history and molecular convergence of PEPC genes in C 4 and CAM plants. Our study will help to elucidate the origin and evolution of C 4 and CAM plants and shed light on the adaptation of plants in dry, hot environments.

Gene family identification
To identify members of the PEPC gene family, we created a local BLAST database with protein sequences from the 17 plant species and then performed BLASTP searches with default parameters using PEPC protein sequences from Arabidopsis (At1g53310, At1g68750, At2g42600 and At3g14940) as queries (Camacho et al., 2009). Furthermore, the Pfam seed alignment of the PEPcase domain (PF00311, http://pfam.xfam.org/) was used to build the HMMER profile; then, we searched for candidate PEPC genes in the 17 genomic datasets using HMMER v3.2.1 (Mistry et al., 2013). After combining the two PEPC gene sets, we searched for conserved domains from the Conserved Domain Database using Batch CD-Search with default parameters (Lu et al., 2020). Only genes with the PEPcase conserved domain were identified as reliable plant-type PEPC genes, and bacterial-type PEPC gene contained the PRK00009 domain.

Prediction of conserved motifs and modeling of protein structure
After identifying reliable PEPC genes with our pipeline above, conserved motifs were predicted by MEME v5.1.0 (Bailey et al., 2006) with default parameters. Motif alignment was performed by MAST v5.1.0 with default parameters, and conserved motifs were visualized by TBtools v1.046 (Chen et al., 2020a). The 3D structure homology modelling of PEPC proteins was predicted by SWISS-MODEL (Waterhouse et al., 2018), which integrate up-to-date protein sequence and structure database as the structural templates. We selected the optimal protein model with the highest values of Global Model Quality Estimate (GMQE) and sequence identify, which indicated the highest reliability of the homology modelling. To detected the convergence of 3D protein structure, we simply test that if there is a 3D structure that only exists in CAM or C 4 plants, we think this structure is convergent.

Phylogenetic reconstruction and gene tree-species tree reconciliation
To understand the evolution of the PEPC gene family, the alignment of PEPC protein sequences was performed by MAFFT v7.453 (Katoh & Standley, 2013) with the accurate L-INS-i method and 1000 maximum iterative refinements. The conserved blocks were selected by GBLOCKS 0.91b (Castresana, 2000) with the parameters that minimum length of a block was five and allowed gap positions with half. Then, we reconstructed the PEPC gene family tree with maximum likelihood using IQ-TREE v1.6.11 (Nguyen et al., 2015). The best-fit amino acid model, JTTDCMut+R5, was detected by ModelFinder (Kalyaanamoorthy et al., 2017) using the Bayesian information criterion. The ultrafast bootstrap approximation was calculated using 1000 random replicates (Hoang et al., 2017). Phylogenetic reconciliation of the gene tree and species tree was performed by Treerecs (Comte et al., 2020) with default parameters. The species tree used was based on recent phylogenomic reconstruction of green plants (Initiative, 2019).

Convergent site detection
The convergent site definition of Rey et al. (2018), that ''a substitution is convergent if it occurred toward the same amino acid preference on every branch where the phenotype also changed toward the convergent phenotype'', were employed in this study, because several amino acids with similar biochemical properties may have roughly the same fitness at that site (Rey et al., 2018), it indicated that convergent site may be not the exact same amino acid in all species with a convergent phenotype. In addition, only some of PEPC gene copies are possibly involved in CAM or C 4 and others are not, but it is difficult to identify which one is the isoform of CAM or C 4 . Therefore, we labeled putative convergent clades or gene copies on the phylogenetic tree with three kinds of gene combinations: (1) including all gene copies of species with convergent phenotypes (C 4 or CAM), (2) only containing one clade within each convergent species and (3)

Identification of PEPC gene family
As the key carboxylase, PEPC genes are widely distributed in green plants (Table 1).
In the present study, we identified 264 homologous genes using BLASTP searching with Arabidopsis PEPC genes (At1g53310, At1g68750, At2g42600 and At3g14940) as queries and 179 homologous genes using the Pfam seed alignment of the PEPcase domain (PF00311) from 17 genomic datasets across green plants. After combination of the two gene sets, we obtained 179 common, homologous genes and then searched for conserved domains using the Conserved Domain Database: 109 genes contained conserved domains, of which 90 contained the plant-type PEPC (PTPC) gene domain (PEPcase) and 19 contained the bacterial-type PEPC (BTPC) gene domain (PRK00009); the remaining 70 genes contained other PEPcase superfamily domains (Tables 1, S1).

Conserved motifs and structures of PEPC gene family
We predicted ten conserved motifs from 109 PEPC proteins. Each motif was longer than 29 amino acids and was found in more than 104 of the 109 PEPC proteins ( Table 2). Most of the amino acids were conserved across all motifs, and the linear order of these motifs, especially in PTPC genes, was identical across all green plants. Some motifs were repeated in various genes (Fig. 1). In order to test if the protein structures evolved convergent in CAM or C 4 plants, we modeled 3D structures of 90 PTPC proteins using the SWISS-MODEL server ( Fig. S1) and obtained three templates of protein structures (Table S2): 5vyj.1.A  Figure 1 The phylogenetic tree and ten conserved motifs of PEPC gene family in plants. The asterisk in the maximum likelihood tree indicated 100 bootstrap support value, and the value lower than 60 was not showed in gene family tree. Arrowhead is indicated the amino acid residues of experimentally proven function in Wang et al. (2016) Li et al., 2018)

Gene duplication and loss
Here, we reconstructed the robust phylogenetic tree of the PEPC gene family with relatively adequate sampling to include C 3 , C 4 and CAM plants and the bootstrap support values of all branches were mostly higher than 60 ( Fig. 1). And the evolutionary history of PTPC genes was independently reconstructed with maximum likelihood and was reconciled with the species tree based on duplication-loss reconciliation (Fig. 3). The reconciliation tree showed that PTPC genes underwent at least 71 duplications and 16 losses in the evolutionary history of our sampled species.

Convergent evolution of PEPC gene family
To test whether convergent molecular evolution at the amino acid level occurred in PTPC proteins, we detected convergent sites in all phenotypically convergent clades of C 4 and CAM photosynthesis with the Profile Change with One Change (PCOC) pipeline. The results showed two convergent shifts in PTPC proteins that occurred in CAM species and three convergent shifts that occurred in C 4 species, the posterior probabilities (pp) for the PCOC model were greater than 0.9 at all convergent shifts (Fig. 4). In addition, we also detected convergent evolution at sites in different gene groups, in which each convergent phenotypic species retained one clade or one gene copy (one-to-one) (Fig. S2). Four identical convergent amino acid sites in one gene group (AhPEPC1.2/ZmPEPC2) were discovered in C 4 species (Fig. 5). However, no identical convergent sites were found in the one-to-one gene groups of CAM species (Fig. S2).

DISCUSSIONS PEPC gene family with different copies was conserved in plants
PEPC gene family was ubiquitous but with different copy number in the different lineages of green plants. In the present study, BTPC genes were distributed in 11 of the 17 sampled species and retained relatively fewer copies than PTPC genes ( Table 1). Because of missing the conserved PEPcase domain, we did not detect any PTPC genes in Norway spruce (Picea abies). This might be resulted from pseudogenization and/or insertion of transposable elements in conifers (Nystedt et al., 2013), or incomplete genome assembly because the length of ten homologs proteins (<510 aa) in Norway Spruce was less than that of PTPCs (∼900 aa) in other species. Therefore, ten homologs of PEPC in Norway spruce, which contained the PEPcase superfamily domain, probably performed the PTPC-like physiological functions or was the fragments of PTPC genes. Due to numerous WGDs (Soltis & Soltis, 2016), the PEPC gene family had relatively more copies in angiosperms, especially in maize that contained 22 PTPC genes. Interestingly, moss (Physcomitrella patens) also retained 22 PTPC genes, but its sister clades, hornworts (Anthoceros angustus) and liverworts (Marchantia polymorpha), only had one PTPC gene (Table 1). This extreme difference in gene content corresponded to different adaptation strategies for plant terrestrialization. Mosses underwent WGDs that increase gene-family complexity for coordinating multicellular growth and responding to dehydration (Rensing et al., 2008). However, liverworts have ancient dimorphic sex chromosomes, which may have resulted in a lack of WGDs and reduced proliferation of regulatory genes (Bowman et al., 2017). The genome of A. angustus is interestingly simple and has obtained stress-response and metabolic pathway genes through horizontal gene transfer from bacteria or fungi, which probably assisted its survival in a terrestrial environment (Zhang et al., 2020). PEPC genes displayed highly conserved amino acid sequences in all green plants. Here, all ten motifs were longer than 29 amino acids and was found in more than 104 of the 109 PEPC proteins (Table 2). It is suggested that the PEPC genes were detected reliably, because ultra-conserved motifs indicated similar and/or same function in common (Bejerano et al., 2004). Additionally, the linear order of these motifs, especially in PTPC genes, was identical across all green plants (Fig. 1), and all our predicted motifs with the several functional loci, such as PEP binding site, Mg 2+ binding site, HCO 3 − binding site, S/A 755 site and Asparate binding site, were also reported previously (Wang et al., 2016), but the motif 10 reported by Wang et al. (2016) was not detected in the present study, which might be only conserved in Angiosperms. These results clearly indicated that the PEPC gene family has been extremely conserved throughout its evolutionary history of more than 500 million years (My), since its origin from algae (Chollet, Vidal & O'Leary, 1996;Svensson, Bläsing & Westhoff, 2003;Darabi & Seddigh, 2018).

PEPC was convergent in C 4 photosynthesis but not in CAM photosynthesis
According to the robust phylogenetic tree, PEPC gene family consists of two major subfamilies, PTPC and BTPC, which is consistent with the predictions of the conserved domains (Fig. 1, Table S1). PTPC genes perform the critical roles for initial carbon fixation in C 4 and CAM photosynthesis (Jiao & Chollet, 1991;Lepiniec et al., 1994;Nimmo, 2000;Deng et al., 2016). Therefore, PTPC gene tree was independently reconstructed and was reconciled with the species tree based on duplication-loss reconciliation (Fig. 3). The reconciliation tree showed that PTPC genes underwent at least 71 duplications and 16 losses in the evolutionary history of our sampled species, which indicated that PTPC genes arose multiple times through frequent duplication events (Fig. 3), potentially caused by WGD in the evolutionary process of green plants, especially in angiosperms ( Van de Peer, Mizrachi & Marchal, 2017). After gene duplication, plants could respond to new environments through neo-functionalization of gene copies (Russell, 2003;Cheng et al., 2018). Previous research assumed that PEPC isoforms in C 4 andCAM species were duplicated from a non-photosynthetic PEPC gene that existed in ancestral C3 species (Svensson, Bläsing & Westhoff, 2003;Christin et al., 2014). Although there are limitations of sampling, our results indicated that no strong association was observed between PEPC gene duplication and CAM/C 4 evolution (Fig. 3), and the similar results were also found in the study of orchids PEPC genes, in which no correlations between the presence of CAM and gene duplication (Zhang et al., 2016). In other words, PEPC gene duplications may be important for the evolutionary origin of C 4 and CAM photosynthesis but without clear correlation. In CAM pathway, post-translational regulation of PEPC possibly might play a key role (Jiao & Chollet, 1991;O'Leary, Park & Plaxton William, 2011;Chen et al., 2020b).
To test whether convergent molecular evolution at the amino acid level occurred in PTPC proteins, we performed comprehensive detection of convergent sites in C 4 and CAM photosynthesis using the PTPC gene tree of green plants with the Profile Change with One Change (PCOC) pipeline, which can detect not only convergent substitutions of amino acids but also convergent shifts that correspond to convergent phenotypic changes (Rey et al., 2018). The results showed two convergent shifts in PTPC proteins that occurred in CAM species and three convergent shifts that occurred in C4 species, the posterior probabilities (pp) for the PCOC model were greater than 0.9 at all convergent shifts (Fig. 4). However, identical convergent substitutions were not detected in clades from both photosynthetic pathways, which indicated that identical convergent molecular evolution at the amino acid level might not occur in all copies of PTPC proteins.
Different isoforms of the PEPC gene family might perform different functions. In addition to photosynthetic functions, PEPC genes also perform hyper-diverse nonphotosynthetic functions, such as response to abiotic stress, fruit maturation, seed formation and germination (Lepiniec et al., 1994;Chollet, Vidal & O'Leary, 1996;O'Leary, Park & Plaxton William, 2011;Shi et al., 2015;Wang et al., 2016;Waseem & Ahmad, 2019;Zhao et al., 2019). Maybe only a few of the PEPC isoforms corresponded to convergent evolution of C 4 and CAM photosynthesis. Therefore, we also detected convergent evolution at sites in different gene groups, in which each convergent phenotypic species retained one clade or one gene copy (one-to-one) (Fig. S2). Interestingly, four identical convergent amino acid sites in one gene group (AhPEPC1.2/ZmPEPC2) were discovered in C 4 species (Fig. 5), two of which were also reported in previous studies (Bläsing, Westhoff & Svensson, 2000;Christin et al., 2007;Besnard et al., 2009;Paulus, Schlieper & Groth, 2013). The convergent amino acid mutations in the active site Ala774 and the inhibitory site Arg884 were sufficient to switch the photosynthetic function from C 3 to C 4 activity (Paulus, Schlieper & Groth, 2013). Due to limited sampling, our results maybe overestimate the number of convergent sites in C 4 plants, because increased sample size maybe decrease the number of inferred molecular convergence (Thomas, Hahn & Hahn, 2017). Therefore, the two new convergent sites of PEPC gene family in C 4 species should be verified by further studies with adequate sampling. Several convergent sites reported in the previous studies (Christin et al., 2007;Besnard et al., 2009;Christin et al., 2014;Moreno-Villena et al., 2018) were not detected in the present study, probably because these sites are only convergent in grass.
Previously, Yang et al. (2017) reported a convergent evolution site of PEPC gene in several CAM lineages of angiosperms, except Ananas comosus. However, when we detected convergent sites in the one-to-one gene groups of CAM species, no identical convergent sites were found (Fig. S2), which indicated that PEPC genes might not have identical convergent sites that resulted in photosynthetic conversion from the C 3 to CAM pathway (Wickell et al., 2021).

CONCLUSIONS
PEPC gene family plays a crucial role in C 4 and CAM photosynthesis and is considered to cause the convergent evolution of these CCMs. In the present study, we detected the convergent amino acid sites of PEPC gene family using relatively limited genomic datasets. In the evolutionary history of the PEPC gene family, gene duplication frequently occurred due to multiple WGD events, but no strong association was observed between PEPC gene duplication and CAM/C 4 evolution. 3D protein structures of PEPC gene family are also not associated with C 4 and CAM evolution. Additionally, four sites with convergent substitutions were detected in C4-PEPC isoforms, two of which were key functional positions to switch the photosynthetic pathway from C 3 to C 4 activity. However, no convergent sites were detected in CAM-PEPC genes. Our results indicated that convergent molecular substitutions of PEPC genes played key roles for the origin and convergent evolution of C 4 photosynthesis, but convergent evolution of CAM photosynthesis maybe not caused by convergence at the amino acid level in PEPC proteins. However, our limited sampling maybe affect the inference of molecular convergence. In future, the clearly evolutionary trajectories of PEPC genes will be clarified by more genomic data, which included more species of CAM, C 4 and C 3 relatives.