Genetic Diversity of C4 Photosynthesis Pathway Genes in Sorghum bicolor (L.)

C4 photosynthesis has evolved in over 60 different plant taxa and is an excellent example of convergent evolution. Plants using the C4 photosynthetic pathway have an efficiency advantage, particularly in hot and dry environments. They account for 23% of global primary production and include some of our most productive cereals. While previous genetic studies comparing phylogenetically related C3 and C4 species have elucidated the genetic diversity underpinning the C4 photosynthetic pathway, no previous studies have described the genetic diversity of the genes involved in this pathway within a C4 crop species. Enhanced understanding of the allelic diversity and selection signatures of genes in this pathway may present opportunities to improve photosynthetic efficiency, and ultimately yield, by exploiting natural variation. Here, we present the first genetic diversity survey of 8 known C4 gene families in an important C4 crop, Sorghum bicolor (L.) Moench, using sequence data of 48 genotypes covering wild and domesticated sorghum accessions. Average nucleotide diversity of C4 gene families varied more than 20-fold from the NADP-malate dehydrogenase (MDH) gene family (θπ = 0.2 × 10−3) to the pyruvate orthophosphate dikinase (PPDK) gene family (θπ = 5.21 × 10−3). Genetic diversity of C4 genes was reduced by 22.43% in cultivated sorghum compared to wild and weedy sorghum, indicating that the group of wild and weedy sorghum may constitute an untapped reservoir for alleles related to the C4 photosynthetic pathway. A SNP-level analysis identified purifying selection signals on C4 PPDK and carbonic anhydrase (CA) genes, and balancing selection signals on C4 PPDK-regulatory protein (RP) and phosphoenolpyruvate carboxylase (PEPC) genes. Allelic distribution of these C4 genes was consistent with selection signals detected. A better understanding of the genetic diversity of C4 pathway in sorghum paves the way for mining the natural allelic variation for the improvement of photosynthesis.


Introduction
C 4 photosynthesis has independently evolved in more than 60 different plant taxa [1]. The main driver for this convergent evolution is the tendency of Ribulose-1,5-bisphosphate carboxylase (Rubisco), which catalyzes the net fixation of carbon dioxide (CO 2 ) to also catalyze an unfavorable oxygenation reaction. This reaction produces toxic phosphoglycolate which has to be converted to useful metabolites requiring substantial metabolic energy [2,3]. This wasteful use of CO 2 is termed photorespiration. Photorespiration becomes a major constraint of photosynthesis in situations where CO 2 to O 2 ratios are homologous of C 4 genes but there was no evidences supporting their involvement in the NADP-ME photosynthetic pathway. Homology between these sorghum C 4 genes and their non-C 4 isoforms was further verified via a local blast strategy. Protein sequences of these 9 core C 4 genes were extracted from the sorghum reference genome V3.1 and were blasted against the reference genome. Blast hits of each gene were filtered using the criteria: E-value <−10, sequence identity >60%, and alignment length >80%. All hits of the same gene satisfying the criteria were plotted based on -log (E-value); only hits of top -log (E-value) class were considered if clear differentiation among them was visualized, otherwise all hits were used.
Genes 2020, 11, x FOR PEER REVIEW 3 of 15 isoforms are homologous of C4 genes but there was no evidences supporting their involvement in the NADP-ME photosynthetic pathway. Homology between these sorghum C4 genes and their non-C4 isoforms was further verified via a local blast strategy. Protein sequences of these 9 core C4 genes were extracted from the sorghum reference genome V3.1 and were blasted against the reference genome. Blast hits of each gene were filtered using the criteria: E-value <−10, sequence identity >60%, and alignment length >80%. All hits of the same gene satisfying the criteria were plotted based onlog (E-value); only hits of top -log (E-value) class were considered if clear differentiation among them was visualized, otherwise all hits were used. biosynthetic pathway of C4 photosynthesis (adapted from [40]). In the mesophyll cells, CO2 is converted to HCO3 − catalyzed by carbonic anhydrase (CA) and fixed into the four-carbon acid, oxaloacetate (OAA), by phosphoenolpyruvate carboxylase (PEPC). Phosphorylation of PEPC is carried out by PEPC kinase (PPCK). The OAA generated by PEPC is then reduced to malate by the NADP-malate dehydrogenase (NADP-MDH) or trans-aminated to aspartate. The resultant C4 acids, malate and aspartate, are transported to the bundle sheath and then decarboxylated in the vicinity of Rubisco to release CO2 and pyruvate. Pyruvate is transported back to mesophyll cells to regenerate PEP by pyruvate orthophosphate dikinase (PPDK), while CO2 enters the Calvin-Benson-Bassham cycle and is fixed by ribulose-1,5-bisphosphate carboxylase (Rubisco). Activation and inactivation of PPDK is catalyzed by PPDK regulatory protein (PPDK-RP).

Plant Material and Genomic Data
Sequence data of the identified C 4 genes were extracted from 48 accessions of Sorghum bicolor with high mapping depth (~22× per accession, ranging from 16 to 45×) reported in previous studies [24][25][26]. These 48 accessions represent all major cultivated sorghum races and some wild progenitors (Table S1).

Gene-Level Population Genetic Analyses
Population genetic parameters including nucleotide diversity (θπ) [41], Tajima's D [42], and Watterson's Estimator (hW) [43] were directly calculated for each of the 27 genes using the Bio::PopGen::Statistics module. F ST [44], which measures population differentiation, was also calculated for each of the 27 genes using the Bio::PopGen::PopStats module [26]. The Bio::PopGen::IO module was used to read input file, which was prepared using an in-house Perl script for calculation of these population genetic parameters.
The criteria used in Mace et al. (2013) were employed to identify genes under purifying selection and balancing selection, respectively. Criteria for purifying selection included: (1) θπ and hW < 5% of the empirical distribution in the cultivated group, (2) F ST between the group of cultivated sorghum and the group of wild and weedy sorghum > 95% of the population pairwise distribution, (3) Tajima's D < 0. Criteria for balancing selection included: (1) θπ and hW > 25% of the empirical distribution in the cultivated group, (2) F ST between the group of cultivated sorghum and the group of wild and weedy sorghum < 90% of the population pairwise distribution, (3) Tajima's D > 5% of the empirical distribution.

SNP-Level Identification of Selection Signature
Population genetics parameters including θπ, Tajima's D, and F ST between the group of cultivated sorghum and the group of wild and weedy sorghum were computed for these 27 genes using CDS sequence in PopGenome, a population genomics package implemented in the R environment (http://cran.r-project.org/) [45]. Specifically, commands diversity.stats, F_ST.stats, and neutrality.stats were called to calculate θπ, F ST , and Tajima's D for each single nucleotide polymorphism (SNP), respectively, with a slide window of 1-bp and 1-bp step size. Functional annotation of each SNP was conducted using get.codons command. Fold decrease of θπ in the cultivated sorghum group compared to the group of wild and weedy sorghum was calculated to represent reduction of diversity (RoD). The following criteria were adopted to identify sites with signature of purifying selection: (1) A RoD greater than the average of neutral genes; (2) F ST > 0; (3) Tajima's D < 0. The following criteria were adopted to identify sites with signature of balancing selection: (1) An increase in diversity (IoD) in the cultivated group and the group of wild and weedy comparison; (2) F ST > 0; (3) Tajima's D > 0.

Phylogenetic and Haplotype Analysis
A phylogenetic tree was constructed based on CDS of all 27 genes from C 4 gene families using the neighbor-joining method with default settings (bootstrapped 100 times; support threshold, 50%) in Geneious 8.1.2 (https://www.geneious.com/, Biomatters Ltd., Auckland, New Zealand). Analysis of haplotype network was conducted using a combination of the R package ape [46] and pegas [47]. All 48 sorghum accessions were classified into four groups: Cultivated, wild and weedy, Guinea margaritiferum and S. propinquum (Table S2).
Genetic diversity across C4 gene families was significantly reduced during sorghum domestication (paired t-test, p-value < 0.05). Averaged across all C4 gene families genetic diversity was reduced by 22.44% in the domesticated compared with the wild and weedy group and when just the 9 core C4 genes were considered, the reduction was 22.98%. However, the reduction of genetic diversity during domestication in C4 genes was not significantly different from that in housekeeping genes (Table S2) (t-test, p-value > 0.05). Among the 27 genes, Sobic.003G292400, a non-C4 NADP-ME isoform, exhibited the most severe reduction in genetic diversity, with a reduction of 98.23%. The C4 Mixed trends were found when comparing C 4 genes with non-C 4 isoforms in each gene family with the average overall genetic diversity of C 4 genes being comparable to that of their non-C 4 counterpart ( Table 2). The C 4 PPDK-RP gene (Sobic.007G166300) and C 4 NADP-MDH gene (Sobic.002G324400) had an overall θπ which was 161.76% and 79.85% higher than their non-C 4 isoforms, respectively, whereas the θπ of the C 4 PPDK gene (Sobic.009G132900) was 75.16% lower than that of the non-C 4 PPDK isoform. Nucleotide diversity of C 4 genes in the other gene families was within the range of variation of their non-C 4 isoforms.
Genetic diversity across C 4 gene families was significantly reduced during sorghum domestication (paired t-test, p-value < 0.05). Averaged across all C 4 gene families genetic diversity was reduced by 22.44% in the domesticated compared with the wild and weedy group and when just the 9 core C 4 genes were considered, the reduction was 22.98%. However, the reduction of genetic diversity during domestication in C 4 genes was not significantly different from that in housekeeping genes (Table S2) (t-test, p-value > 0.05). Among the 27 genes, Sobic.003G292400, a non-C 4 NADP-ME isoform, exhibited the most severe reduction in genetic diversity, with a reduction of 98.23%. The C 4 version of that gene, the NADP-ME gene (Sobic.003G036200), showed the greatest loss of genetic diversity (51.89%) among the C 4 genes, with an F ST between the cultivated and wild and weedy groups of 0.06 ( Figure 2B). In contrast, another non-C 4 isoform of NADP-ME (Sobic.009G069600), a non-C 4 isoform of PPCK (Sobic.006G148300), and a non-C 4 CA isoform (Sobic.003G234600) showed a more than 2-fold increase in genetic diversity in the cultivated group.

Identification of Selection Signals during Domestication across the 27 Genes
The selection signature of these C 4 gene families was firstly investigated at the gene level. Based on thresholds of genome-wide rankings described in Mace et al. (2013), only one gene (Sobic.001G326900, non-C 4 PPDK isoform) was identified as being under balancing selection, which maintains diversity of selected genes, during sorghum domestication, while no gene was identified as being under purifying selection, which reduces diversity of selected genes (Table 1). Subsequent to this, a higher resolution detection of selection signature was conducted at the SNP level using the CDS of the 27 genes. Among 521 SNPs across 27 CDS, 176 were non-synonymous.
A total of 60 SNPs across 8 genes were identified as being under balancing selection, 7 of which were non-synonymous SNPs distributed across 2 genes (Table S4). The non-C 4 PPDK (Sobic.001G326900) had 24 SNPs under balancing selection including 5 non-synonymous SNPs, and additionally had an overall gene-level signature of balancing selection based on the previous analysis. Two C 4 isoforms, PPDK-RP (Sobic.002G324400) and PEPC (Sobic.010G160700), were identified with 3 and 2 SNPs under balancing selection, respectively, although none of them were non-synonymous SNPs. Two non-C 4 PEPC (Sobic.003G100600, Sobic.004G106900) were identified with SNPs under balancing selection, with Sobic.003G100600 having 21 SNPs including 2 non-synonymous SNPs exhibiting signatures of balancing selection. The other 2 genes with SNPs under balancing selection were a non-C 4 CA isoform, Sobic.002G230100, and a non-C 4 PPCK isoform, Sobic.004G219900.

Allelic Variation of Core C 4 Genes under Selection in Sorghum
A phylogenetic tree was constructed using the CDS of these 27 genes to depict the genetic relationship of 48 accessions ( Figure S1). The inter-and intra-species distribution of private haplotypes of each gene is detailed in Table S5, with the majority (~90%) of the genes with private inter-species haplotypes from S. propinquum, e.g., 4 unique haplotypes were observed for the C 4 isoform of PEPC, with the 2 S. propinquum accessions sharing a single private haplotype. To investigate allelic variation of 4 core C 4 genes with SNPs under selection in sorghum, haplotype networks were constructed using CDS SNPs. Based on 16 SNPs within the CDS of the PPDK gene (Sobic.009G132900), 8 haplotypes were identified. Five haplotypes were identified in the wild and weedy genotypes, with 3 being private haplotypes and two of them being maintained in cultivated sorghum; two new haplotypes arose in cultivated sorghum after domestication ( Figure 3A). Ten haplotypes of one CA gene (Sobic.003G234200) were revealed using 33 SNPs, with 4 distinct haplotypes being characterized by the wild and weedy genotypes. Two of the wild and weedy haplotypes were maintained in cultivated sorghum during domestication, with three new haplotypes arising after domestication ( Figure 3B). The loss of wild and weedy haplotypes in cultivated sorghum in these two genes was consistent with the finding that they were under purifying selection.
The PPDK-RP gene (Sobic.002G324400) had 22 SNPs in the CDS, based on which 5 haplotypes were identified. Two haplotypes were characterized by the wild and weedy genotypes, with the main wild haplotype maintained and further diversifying into two new haplotypes in the cultivated group ( Figure 3C). Based on 28 SNPs in the CDS of the C 4 PEPC gene (Sobic.010G160700), 4 haplotypes were identified. Wild and weedy genotypes encompassed 3 haplotypes and all of them were maintained in cultivated sorghum ( Figure 3D). S. propinquum had unique haplotypes across all 4 genes, while the Sorghum bicolor race guinea margaritiferum shared haplotypes with the wild and weedy genotypes in most cases, indicating a closer relationship with the wild and weedy group. Genes 2020, 11, x FOR PEER REVIEW 9 of 15  Table S1. Color-coding as follows; cultivated sorghum (red), wild and weedy genotypes (purple), Sorghum propinquum (blue), and Sorghum guinea margaritiferum (green). The size of the circles in the haplotype networks is proportionate to the number of accessions with that haplotype. The branch length represents the genetic distance between two haplotypes.

Discussion
The evolution of C4 photosynthesis has been studied extensively at the cross-species level with signals of adaptive evolution identified on key genes in the C4 pathway [28,34,[48][49][50]. As the evolution of C4 photosynthesis is driven by environments characterized by low CO2 availability, such as hot and dry environments in which CO2 uptake is limited by stomatal closure, it is likely that within-species adaptive variation also exists. However, to our knowledge, studies of within-species allele diversity and signatures of selection on key genes in the C4 pathway have not previously been undertaken.
Knowledge of existing natural variation and levels of genetic diversity is a pre-requisite for the optimization of C4 photosynthesis. In this study, we performed the first investigation of the genetic diversity of C4 gene families within a C4 species using a collection of 48 sorghum lines. We focused on 9 C4 genes due to their reported key roles in C4 photosynthesis. Our collection of sorghum represents all major cultivated sorghum races, landraces, and wild progenitors, and captures a good proportion of genetic diversity within sorghum. Substantial variation of nucleotide diversity was observed among these 8 C4 gene families in sorghum, with the NADP-MDH gene family showing the least diversity and the PPDK gene family showing the greatest diversity. Nine core C4 genes also exhibited varying degrees of genetic diversity, ranging from θπ values of 5.04 × 10 −3 and 4.32 × 10 −3 in PPDK-RP and rbcS to θπ values of 0.33 × 10 −3 and 0.67 × 10 −3 in NADP-MDH and NADP-ME. However, despite such low levels of diversity, non-synonymous SNPs were identified in both NADP-MDH and NADP-ME (Table 1). C4 PPDK was the only gene which did not contain a non-synonymous SNP, despite its fairly large size (gene size, 12748bp; CDS, 2847bp), indicating the function of this gene is highly conserved.
Cultivated sorghum was domesticated more than five thousand years ago in Africa [51][52][53]. This artificial selection process has morphologically and physiologically reshaped sorghum to better suit human needs, and also resulted in substantial reduction of genetic diversity genome wide in cultivated sorghum compared with wild and weedy types [26,54,55]. In this study, reduction of  Table S1. Color-coding as follows; cultivated sorghum (red), wild and weedy genotypes (purple), Sorghum propinquum (blue), and Sorghum guinea margaritiferum (green). The size of the circles in the haplotype networks is proportionate to the number of accessions with that haplotype. The branch length represents the genetic distance between two haplotypes.

Discussion
The evolution of C 4 photosynthesis has been studied extensively at the cross-species level with signals of adaptive evolution identified on key genes in the C 4 pathway [28,34,[48][49][50]. As the evolution of C 4 photosynthesis is driven by environments characterized by low CO 2 availability, such as hot and dry environments in which CO 2 uptake is limited by stomatal closure, it is likely that within-species adaptive variation also exists. However, to our knowledge, studies of within-species allele diversity and signatures of selection on key genes in the C 4 pathway have not previously been undertaken.
Knowledge of existing natural variation and levels of genetic diversity is a pre-requisite for the optimization of C 4 photosynthesis. In this study, we performed the first investigation of the genetic diversity of C 4 gene families within a C 4 species using a collection of 48 sorghum lines. We focused on 9 C 4 genes due to their reported key roles in C 4 photosynthesis. Our collection of sorghum represents all major cultivated sorghum races, landraces, and wild progenitors, and captures a good proportion of genetic diversity within sorghum. Substantial variation of nucleotide diversity was observed among these 8 C 4 gene families in sorghum, with the NADP-MDH gene family showing the least diversity and the PPDK gene family showing the greatest diversity. Nine core C 4 genes also exhibited varying degrees of genetic diversity, ranging from θπ values of 5.04 × 10 −3 and 4.32 × 10 −3 in PPDK-RP and rbcS to θπ values of 0.33 × 10 −3 and 0.67 × 10 −3 in NADP-MDH and NADP-ME. However, despite such low levels of diversity, non-synonymous SNPs were identified in both NADP-MDH and NADP-ME (Table 1). C 4 PPDK was the only gene which did not contain a non-synonymous SNP, despite its fairly large size (gene size, 12748bp; CDS, 2847bp), indicating the function of this gene is highly conserved.
Cultivated sorghum was domesticated more than five thousand years ago in Africa [51][52][53]. This artificial selection process has morphologically and physiologically reshaped sorghum to better suit human needs, and also resulted in substantial reduction of genetic diversity genome wide in cultivated sorghum compared with wild and weedy types [26,54,55]. In this study, reduction of genetic diversity during sorghum domestication was also observed in the C 4 gene families, indicating that wild sorghum, as a repository for genetic diversity, might harbor alleles useful for improving C 4 photosynthesis.
However, the overall reduction in diversity of C 4 gene families was not significantly different from the genome-wide average, indicating that this gene family has not been under particularly strong selection pressure. Similarly, none of the 9 core C 4 genes showed a domestication signal at the gene level. The absence of large sequence variation at the gene level is also consistent with previous evolutionary studies suggesting that relatively minor changes to pre-existing regulatory networks and the use of pre-existing cis-elements were often sufficient to recruit genes into the C 4 pathway [56,57]. The C 4 isoform of the NADP-ME gene found in maize and sorghum is one such gene that has been found to be activated for C 4 photosynthesis via subtle changes to its promoter, while the rest of the gene is highly conserved [33]. This is consistent with the low diversity in this gene family observed in our study.
A further high-resolution investigation of domestication signature at the SNP level revealed 2 C 4 genes, PPDK (Sobic.009G132900) and CA (Sobic.003G234200), with SNPs under purifying selection, while the other 2 C 4 genes, PPDK-RP (Sobic.002G324400) and PEPC (Sobic.010G160700), were identified with SNPs under balancing selection. Previous studies have demonstrated that SNP-level analysis using less stringent criteria is superior for capturing soft selection signals compared with genome-wide ranking [54,58]. However, the higher sensitivity may come with a cost of a greater chance of false positives, and therefore requires cautious interpretation. The contrasting selection signals on genes from the same pathway within taxa found in this study was also reported previously in signal transduction pathways [59] and the starch biosynthesis pathway [60].
The C 4 isoforms of PPDK and PEPC were also found to show signals of positive selection in a previous cross-species evolutionary study using orthologous groups from closely related C 3 and C 4 grass species including sorghum [28]. PPDK and PPDK-RP regulate the regeneration of PEP and as such have a direct effect on CO 2 assimilation rate [61], especially under cool temperatures [62,63]. However, it is thought that only minor changes to the enzyme properties of PPDK were sufficient to recruit it into the C 4 pathway and its residues and regions involved in catalyzes are highly conserved in C 4 species [64], possibly validating the fact that only soft selection signals via SNP-level were found for the C 4 isoform of the PPDK gene in our study.
PEPC is also regarded as a potential limiting step in the assimilation of CO 2 , and variation of its affinity for CO 2 /HCO 3 − amongst species has been documented [65][66][67]. CA is also critical to C 4 photosynthesis as it catalyzes the first step of the C 4 pathway, converting CO 2 to HCO 3 − [68]. It was reported in the C 4 dicot Flaveria bidentis, where antisense plants with <10% of wild-type CA activity required high CO 2 for growth and showed reduced CO 2 assimilation rates [69,70]. Recent experiments showed CA and PEPC will be more limiting when stomates are partially closed, e.g., under water limitation [71]. The signal of soft purifying selection on PPDK and CA may suggest the C 4 pathway was indirectly improved during sorghum domestication. Without photosynthetic rate being a direct selection target in breeding programs, a steady increase in leaf photosynthetic rate over time of cultivar release has been shown in other cereals, e.g., in Australian bread wheat [72]. The balancing selection signal on C 4 PPDK-RP and PEPC may reflect adaptation to diverse environments, as both PPDK-RP and PEPC are associated with abiotic stress [73,74]. Interestingly, within the PPDK-RP and PPDK gene families, the non-C 4 genes all showed selection signals contrasting with their C 4 counterparts with both two non-C 4 PPDK-RP (Sobic.002G324500, Sobic.002G324700) containing SNPs under purifying selection and the non-C 4 PPDK (Sobic.001G326900) containing SNPs under balancing selection.
After domestication, sorghum was introduced from tropical to temperate areas, and adapted to divergent local environments. New mutations also arose during this diversification process, and played an important role in local adaptation. In the haplotype analysis, these haplotypes unique to cultivated sorghum are likely to be young alleles arising after domestication, while haplotypes unique to the wild progenitor indicate that some haplotypes were lost during domestication of sorghum. Nevertheless, the loss of wild haplotypes of C 4 genes in cultivated sorghum does not mean these haplotypes are inferior in terms of photosynthetic efficiency, as photosynthesis was not specifically targeted during sorghum domestication [11]. On the contrary, bringing these wild haplotypes back to breeding programs after evaluation of their functions may enrich breeders' toolkits to manipulate photosynthetic efficiency, ultimately contributing to yield improvements. C 4 photosynthesis has been well studied over the past 50 years and key components of this complex pathway have been identified following the advent of transgenic and sequencing technologies [9]. Understanding the genetic diversity of the key enzymes of the C 4 pathway is an important step towards mining the natural allelic variation for the improvement of photosynthesis.
Further investigation of these allelic variation to link them with agronomical traits will provide new targets for sorghum improvement [75].
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/7/806/s1. Table S1: List of re-sequenced sorghum accessions and their racial and geographic origins.  Table S5: Inter-and intra-species distribution of private alleles across 27 genes from C 4 gene families. Figure