Analysis of codon usage patterns of the chloroplast genome in Delphinium grandiflorum L. reveals a preference for AT-ending codons as a result of major selection constraints

Background Codon usage bias analysis is a suitable strategy for identifying the principal evolutionary driving forces in different organisms. Delphinium grandiflorum L. is a perennial herb with high economic value and typical biological characteristics. Evolutionary analysis of D. grandiflorum can provide a rich resource of genetic information for developing hybridization resources of the genus Delphinium. Methods Synonymous codon usage (SCU) and related indices of 51 coding sequences from the D. grandiflorum chloroplast (cp) genome were calculated using Codon W, Cups of EMBOSS, SPSS and Microsoft Excel. Multivariate statistical analysis combined by principal component analysis (PCA), correspondence analysis (COA), PR2-plot mapping analysis and ENC plot analysis was then conducted to explore the factors affecting the usage of synonymous codons. Results The SCU bias of D. grandiflorum was weak and codons preferred A/T ending. A SCU imbalance between A/T and G/C at the third base position was revealed by PR2-plot mapping analysis. A total of eight codons were identified as the optimal codons. The PCA and COA results indicated that base composition (GC content, GC3 content) and gene expression were important for SCU bias. A majority of genes were distributed below the expected curve from the ENC plot analysis and up the standard curve by neutrality plot analysis. Our results showed that with the exception of notable mutation pressure effects, the majority of genetic evolution in the D. grandiflorum cp genome might be driven by natural selection. Discussions Our results provide a theoretical foundation for elucidating the genetic architecture and mechanisms of D. grandiflorum, and contribute to enriching D. grandiflorum genetic resources.


INTRODUCTION
The codon is crucial in the process of genetic information transmission, and is the most fundamental step in biological activities (Powell & Moriyama, 1997;Chen et al., 2014). The accurate identification of codons encoding different amino acids is key to ensuring the correct expression of genetic information (Morton, Sorhannus & Fox, 2002;Sau et al., 2006). Most of the amino acids (except methionine (Met) and tryptophan (Trp)) are encoded by two to six synonymous codons (Guan et al., 2018). The choices of synonymous codons in different plant genomes are non-random, which is known as synonymous codon usage (SCU) bias (Wright, 1990). SCU bias reflects a mutation-selection balance, which can be affected by mutation pressure, natural selection, and genetic drift in a population (Bulmer, 1991;Eyre-Walker, 1991). Therefore, understanding the SCU bias can reveal the effects of long-term evolution on plant genomes.
The possible evolutionary forces based on codon usage patterns have been investigated in the genomes of numerous organisms. Generally, codon usage biases in microbes are driven by mutation pressure, such as in Xanthophyllomyces dendrorhous and Escherichia coli (Baeza et al., 2015;Boël et al., 2016). For invertebrate animals, codon usage bias is mainly driven by selection constraints, as exemplified in Bemisia tabaci and Hirudinaria manillensis (Sharma, Chakraborty & Uddin, 2014). Additionally, in plant species, codon usage bias seems to prefer a balance of mutation pressure and selection constraints (Zhang et al., 2018a;Zhang et al., 2018b;Zhang et al., 2018c;Liu et al., 2010). In the rice genome, the heterogeneity of codon usage patterns reflects a balance between a directional mutational bias and negative selection (Wang & Hickey, 2007). The codon usage bias in the Porphyra umbilicalis chloroplast (cp) genome is influenced by natural selection, mutation pressure, and nucleotide composition (Li et al., 2019). Moreover, codon usage bias results from Wang et al. (2018) indicated that translation selection has a more dominant role than mutation pressure in four cotton species. These studies indicate that complex evolutionary factors vary in different organisms, and analyzing codon usage bias can provide suitable strategies for identifying the principal driving forces. Delphinium grandiflorum L. (Ranunculaceae, Delphinium), a perennial herb with a blue flower, is mainly distributed in Mongolia, Siberia, and the Northwest of China (Chen et al., 2017). Owing to its high contents of two novel diterpenoid alkaloids, namely, grandiflodines A and B, D. grandiflorum is cultivated as a medicinal plant for toothache treatment and as a native pesticide (Zhang, Li & He, 2012). Furthermore, ovule culture is applied in D. grandiflorum to avoid hybrid embryos from aborting. For example, new interspecific hybrid plants (D. grandiflorum × D. nudicaule, D. grandiflorum × D. cardinal) are successfully selected with the intermediate flower color between the parents (Honda, Tsutsui & Hosokawa, 1999). Thus, D. grandiflorum is of great biological significance, and evolutionary analysis of D. grandiflorum can provide a rich resource of genetic information for developing hybridization resources for the genus Delphinium.
The chloroplast is a photosynthetic organelle in plant cells that plays crucial roles in photosynthesis and metabolite biosynthesis, for example, the synthesis of amino acids, starch, fatty acids, and pigments (Wicke et al., 2011). Compared to the mitochondrial genome and nuclear genome, the complete cp genome, which possesses many characteristics, including a small size, simple and highly conserved structure, single parental inheritance, and haploid nature, is widely applied in species identification, phylogenetic analysis, and adaptive evolutionary analysis (Raubeson et al., 2007). Codon usage in many plant species, such as Hemiptelea davidii, Haberlea rhodopensis, Medicago sativa, and so forth, has been investigated extensively based on the cp genome database (Liu et al., 2020;Ivanova et al., 2017;Tao et al., 2017). The cp genome of D. grandiflorum has been assembled and characterized using Illumina sequencing platform, it was 157,339 bp in length, which contained a pair of inverted repeated regions (52,304 bp), a large single copy region (88,098 bp) and a small single copy region (16,937 bp) (Duan et al., 2020). However, the SCU bias of D. grandiflorum cp genome has not been investigated.
In this study, we analyzed the codon bias and related indices of D. grandiflorum cp DNA, and then used multivariate statistical analysis to determine the general evolutionary driving factors. These results improve our understanding of the genetic architecture of D. grandiflorum, and also contribute to enriching the genetic resources and conservation of D. grandiflorum species.

Sequence data
A total of 117 genes were obtained from the D. grandiflorum cp genome (Genbank accession number: MN556604), and the sequence information is shown in Table S1 ( Duan et al., 2020). After filtering the repeated sequences and genes with sequence length <300 bp using an in-house Python script (Sanner, 1999), ORFfinder (http: //www.geneinfinity.org/sms/sms_orffinder.html) was used to distinguish and filter out non-coding regions of the remaining genes (Guan et al., 2018). Finally, a total of 51 qualified CDSs (complete coding sequence) were retained for subsequent analysis.

Codon usage bias and related indices analysis
A number of the codon usage indicators were estimated via the program codon W version 1.3 (https://sourceforge.net/projects/codonw/), including the relative synonymous codon usage value (RSCU), the effective number of codons (ENC), G + C content of the gene (GC), the frequency of the nucleotides G + C at the 3rd position of synonymous codons (GC 3s ), and the base compositions (A 3s , T 3s , G 3s , and C 3s ) (Zhang et al., 2018a;Zhang et al., 2018b;Zhang et al., 2018c). The RSCU value and ENC value were used together to describe codon usage patterns. The G+C content at the 1st, 2nd, 3rd of codons (GC 1 , GC 2 , GC 3 ) and the average GC content of the 1st and 2nd (GC 12 ) were determined by the Cusp function from EMBOSS (http://imed.med.ucm.es/cgi-bin/emboss.pl?_action=input&_app=cusp).

Identification of the optimal codon
According to the RSCU values, the synonymous codons with the highest frequencies, accompanied by the largest RSCU values, were identified (Yu et al., 2012). Using ENC analysis as a preference standard, the 51 sequences of D. grandiflorum were ordered, and 5% of the dataset with high bias (ENC value was less than 30) and low bias (ENC value was larger than 55) were selected (Cui et al., 2020). The sequences with high bias and low bias were recognized as highly and lowly expressed genes, respectively, as a result of codon bias and were positively correlated with gene expression level (Li et al., 2016). Highly expressed codons, were defined as those codons that occurred significantly more often in highly expressed genes relative to their frequency in lowly expressed genes, which was reflected by RSCU. The RSCU of each codon was calculated following the formula of RSCU = RSCU (high bias) -RSCU (low bias) (Wang et al., 2019). Finally, the optimal codon of the gene was speculated as the codon with both the highest RSCU value and the largest RSCU (Sharp & Li, 1986).

Multivariate statistical analysis
Principal component analysis (PCA) was used as a dimensionality reduction tool to reduce the data complexity in CodonW, with the principal components used to explore the codon usage variation among genes (Greenacre, 1984). PCA was performed on the RSCU values, the data were plotted in a 59-dimensional space of different axes, and the 59-dimensional space was based on the 59 triplet nucleotide codons (ATG encoding Met and TGG encoding Trp were excluded) (Gupta & Ghosh, 2001). Finally, the most prominent axes with important implications for codon usage variation were revealed (Choudhury, Uddin & Chakraborty, 2017). Correspondence analysis (COA) was used to compare two or more categories of variable data, and provide visual results for the major changes in the trends of codon usage and genes (Choudhury, Uddin & Chakraborty, 2017). The relationship between prominent axes and codons, prominent axes and GC content, and prominent axes and genes were visualized in scatter plots.
ENC-plot mapping analysis was employed to analyze and determine the crucial factors influencing the codon usage bias. The ENC plot reflects the relationship of the ENC values against the GC 3S values. The standard curve shows the optimal functional relation between ENC and GC 3s (Gupta, Bhattacharyya & Ghosh, 2004).
Neutrality plot mapping analysis was used to analyze the relationship of the GC 12 values and GC 3 values of all the genes. In the neutral graph, the value of GC 12 was used as a vertical coordinate, and the value of GC 3 was used as the horizontal axis (Wei et al., 2014).

Statistical analysis
Correlation analysis among many important indices was implemented in SPSS 16.0 software (SPSS Inc., Chicago, US) with the Spearman's test (two-tailed). The graphs were depicted in Microsoft EXCEL 2016 (Microsoft Corporation, Redmond, WA, US).

Nucleotide composition between codon positions in the D. grandiflorum cp genome
We identified 51 CDSs longer than 300 bp, and the average length of the CDSs was 1212.6 bp. In general, the four nucleotides were unevenly represented in the 51 CDSs. Thymine (T) was the most represented (31%), adenine (A) was the second-most represented (30%), cytosine (C) and guanine (G) were less represented (18% and 21%, respectively), and the average GC content of the CDSs was 39%. To better evaluate the nucleotide base composition in D. grandiflorum, we summarized the CDS numbers with different GC content levels, and all CDSs contained 30-46% GC content (Fig. 1A). We further divided the GC content range into three parts and analyzed the number of CDSs attributable to each part. The 35-40% part contained the most CDSs (the total number was 28), followed sequentially by the 30-35% and 40-46% intervals.
We also summed the GC content at different codon positions (1st, 2nd, and 3rd) in the CDSs. The composition at the 2nd codon position was similar to that of the overall nucleotide composition. The average GC content and the range between the upper and lower quartiles in the 1st codon position were the highest, and accordingly, the corresponding data in the 3rd codon position were the lowest (Fig. 1B).

The codon usage pattern of the D. grandiflorum cp genome
The amino acids number of 51 genes ranged between 101 and 2,138 with an average of 404. We identified a total of 61 synonymous codons (stop codons were excluded), among which 31 were more frequently represented with an RSCU value ≥ 1 ( Table 1). The codon TTA encoding Leu exhibited the highest RSCU value of 1.88. The above 31 codons with different end bases were divided into three classes, and the number of codons ending with T, A, and G was 16, 12, and 3, respectively, thus suggesting that the genes from the D. grandiflorum cp genome preferred codons with AT-endings, especially those ending with T. Moreover, we focused on the preferred and weak preferred codons, mainly emphasizing codons with extremely high (>1.5) and low RSCUs (<0.5). We found that codons such as ACT, TAT, CAA, and GGA were highly preferred and codons such as CTC, AAC, CTG, and CAG were less preferred in the CDSs. The two distinct patterns deviated from the neutral RSCU value of 1, further indicating that codons preferred an ending with A/T.
The average content of GC, GC 1 , GC 2 , and GC 3 of the CDSs from the D. grandiflorum was calculated ( Table 2). The ENC values of the different genes varied from 37.11 to 61.00, the average of which was 48.12, displaying different trends between the genes. Strong and weak SCU biases are typically distinguished by the ENC value with 35, and all of the ENC values of the genes in this study were greater than 35, suggesting a weak codon bias (Song et al., 2018).
A total of 18 codons with the largest RSCU value based on each amino acid were identified as high frequency synonymous codons (Table 1). Twenty-three codons were identified as highly expressed codons (Table 3). Eight codons with a high frequency as well as high expression, including GCT, GAT, TTT, ATT, AAA, TCT, ACT, and TAT, were identified as optimal codons, of which, seven ended with T and only one ended with A, further confirming that the codons ending with C and G were lacking preference in the D. grandiflorum cp genome.

PCA analysis
The 51 CDSs of the D. grandiflorum cp genome were analyzed using PCA analysis, and were distributed in 50 dimensional axes. The contribution of 40 axes was calculated, and

COA analysis
To determine how the codons ending with different bases were contributing toward codon usage variation in the major axes of Axis 1 and Axis 2, the location of codons ending with different bases was drawn using different color points between Axis 1 and Axis 2 by COA analysis (Fig. 2A). The codons with A/T ends were closer to Axis 1 and were more tightly clustered than the codons with G/C ends, suggesting that the base composition probably affected the SCU bias. In contrast, the genes with lower GC contents (30%-35%) were distributed along the side of Axis 2, and the genes with a relatively lower GC content were more concentrated than the genes with a higher GC content (Fig. 2B), implying that GC content might influence the SCU bias. In addition, considering the positions of different functional gene groups, and following the direction along Axis 1 and Axis 2, we also found that the different groups were distributed discretely, indicating that many other factors (i.e., natural selection) might play a role in SCU bias (Fig. 2C). In order to analyze the relationship of the important indices to the four main axes, correlation analysis was conducted to determine the central factors influencing codon usage bias (Table 4). The GC content showed an extremely negative correlation with Axis 2 (P <0.01), and the GC3s and GC3 contents also exhibited a significant negative correlation with Axis 2 and Axis 4.

PR2-plot mapping analysis
Using PR2 plot mapping analysis, the points in our plot fell among 0.39 to 0.59 on A 3 /(A 3 + T 3 ), and 0.26 to 0.82 G 3 /(G 3 + C 3 ) (Fig. 3A). The genes were clearly distributed unevenly     in the four quadrants centered on 0.5, with most points located under the horizontal centered line of 0.5 (in which the ratio of A 3 /(A 3 + T 3 ) <0.5) and a slightly greater number of points distributed on the right side of the vertical centered line of 0.5 (in which the ratio of G 3 /(G 3 + C 3 ) >0.5). These results indicated that the genes in D. grandiflorum preferred T and G, especially T at the third codon position. Furthermore, we performed PR2 plot analysis of four-codon amino acids, including alanine, glycine, proline, threonine, valine, arginine (CGA, CGU, CGG, and CGC), leucine (CUA, CUU, CUG, and CUC), and serine (UCA, UCU, UCG, and UCC) (Fig. 4). It was clear that PR2 violation was the rule rather than the exception, and the distribution pattern was unique for each of the eight amino acids. The average value of A 3 /(A 3 + T 3 ) and G 3 /(G 3 + C 3 ) from the eight amino acids weighted with codon numbers for each gene was 0.44 and 0.38, respectively, suggesting that the eight amino acids had a preference for T and C when the eight amino acids were combined. Therefore, the balance between A/T and G/C was disrupted in D. grandiflorum.

ENC plot analysis
An ENC plot was used to analyze the codon usage variation of the 51 CDSs in D. grandiflorum (Fig. 3B). Some genes were located on the standard curve toward the lower GC content region, for example, rps3 and rps4 from ribosomal proteins (SSU), ndhI and ndhK from NADH dehydrogenase, and so forth, which definitely originated from the extreme compositional constraints. However, a majority of the points were distributed away from the expected curve and were accompanied by a relatively concentrated distribution, suggesting that these genes should have additional codon usage biases, that are independent of compositional constraints. In addition, correlation analysis of the ENC and GC 3s values showed an extreme positive correlation (r = 0.548, P<0.01), suggesting that the base composition on the third position of the codons might play an important role in determining codon usage patterns.

Neutrality plot analysis
From the neutrality plot, the relationship of GC 12 and GC 3 was analyzed, and the degree of change in natural selection and mutation pressure was estimated (Fig. 3C). The ycf2 and cemA genes were located around the effected curve, while the remaining genes were above the standard curve. Using Pearson's correlation analysis, a weak correlation of all coding genes between GC 12 and GC 3 was found (r = 0.261).

DISCUSSION
The transition of genetic information from mRNA to protein relies on the formation of codons (Chakraborty et al., 2017). The basic characteristic of a genetic code is that an amino acid is often encoded by different codon combinations, known as synonymous codons (Baeza et al., 2015). The uneven usage of synonymous codons with the same amino acid is reflected by SCU bias, and the SCU bias differs among various species and genes (Karumathil et al., 2018). The possible causes of SCU bias have been investigated in the genomes of numerous living organisms, for example, in Zea mays, Arabidopsis thaliana, cotton, and so others (Liu et al., 2010;Wang et al., 2018;Qiu et al., 2011). In this study, 51 CDSs of the D. grandiflorum cp genome were selected to analyze the SCU bias, and the possible factors influencing SCU bias were inferred.

Unique codon usage pattern in the D. grandiflorum cp genome
The RSCU values reflect the codon usage pattern of different genes. The codon lacks bias when the RSCU value is less than 1 (Karumathil et al., 2018). In the D. grandiflorum cp genome, codons with the largest RSCU value based on each amino acid were suggested as high frequency codons. ENC reflects the degree of codon deviation from random selection and is an important index for reflecting the preference degree of the unequal use of synonymous codons (Gupta, Bhattacharyya & Ghosh, 2004). The range of ENC values is from 20 to 61, and the boundary value of ENC is 35. A value less than 35 represents strong codon preference, otherwise weak codon preference will occur (Song et al., 2018). In our study, the average ENC value of the codon genes was 48.12, implying a weak preference for SCU bias.
Our results indicated that the AT/GC nucleotide usage differed among the three positions of the codon, and these differences in base compositions might affect the total SCU bias in the D. grandiflorum cp genome. However, the overall SCU bias that we detected was low, which might be because the majority of codons were used during translation, and extreme SCU bias might only develop under particular conditions (Guan et al., 2018). In addition, we found that the genes from the D. grandiflorum cp genome showed a preference for AT-ending codons, particularly T-ending codons. Eight optimal codons further exhibited the similar patterns, seven of which ended with T, and one of which ended with A. PR2 is a rule of DNA base composition that endows A = T and G = C within a single strand when there is no any preference in mutation pressure and natural selection in both strands of DNA (Sueoka, 2001;Sueoka, 1995;Lobry, 1995). The present results showed that the distribution of genes with different ending bases was asymmetric and exhibited a preference for T-ending codons, and an apparent PR2 violation of the eight amino acids was further detected, thus revealing an SCU imbalance between A/T and G/C at the third base position. Our results were similar to those in other plant species. In the cotton genome, codons ending with T/A are preferred (Wang et al., 2018). A similar pattern was found in the codon usage of Elaeagnus angustifolia and Porphyra umbilicalis (Li et al., 2019;Wang et al., 2019). However, this phenomenon has not been observed in monocot species, for example, Z. mays, Oryza sativa, and Hordium vulgare (Liu et al., 2010;Wang & Hickey, 2007;Kawabe & Miyashita, 2003). The opposing patterns of codon ending bases might reflect the differences in differentiation between monocot and dicot plant species (Camiolo, Melito & Porceddu, 2015).

Base composition affects the SCU bias of the D. grandiflorum cp genome
PCA analysis is usually used to analyze genes located in a 59-dimensional space and relies on the RSCU values. PCA can extract considerable variations and concentrate them together, thus helping to determine the major factors influencing SCU bias (Wei et al., 2014). In the present study, four main axes reflecting variation were determined, and the major indices versus the four axes were analyzed by correlation analysis. The codons with A/T endings plotted on Axis 1 and Axis 2 and showed a more tightly clustered distribution, indicating that this base composition could explain the variation in codon use. The significant correlations of GC content, GC 3s and GC 3 content against the Axis 2 suggested that the base compositions as GC contents of the total and the third position of codons were valuable for SCU bias in the D. grandiflorum cp genome. However, Axis 1 and Axis 2 only explained 19.67% amount of the variation, and it appeared that the base composition had at most a partial influence on codon usage.

Natural selection plays a major role in the SCU bias of D. grandiflorum cp genome
Synonymous codons are uneven by their nature, the mutations of which often occur at the 3rd base of a codon (Comeron & Aguadé, 1998). If there is no external pressure, as in the case of random mutation or mutation pressure in a certain direction, there should be no change in the three different positions of each codon and the base content should be similar (Guan et al., 2018). Thus, the preference for AT ends caused by directional substitution implied that evolutionary factors of SCU bias from D. grandiflorum cp genome were indeed existed. Generally, mutation pressure acts on nucleotide composition bias through shuffling A/T and G/C pairs, selection constraints lead to codon bias through maximizing protein production efficiency in high expressed genes (Guan et al., 2018). In our study, A/T and G/C at the third base position were asymmetric by PR2 analysis, and the significant correlations of GC content, GC 3s and GC 3 content against the Axis 2 were found, which of them indicated that mutation pressure of base composition influenced SCU bias in the D. grandiflorum cp genome. However, Axis 2 only explained 8.96% amount of the variation, thus mutation pressure was not the determining factor shaping codon usage, other factors as well as natural selection might be more important than mutation pressure. ENC plot analysis and neutrality plot analysis are commonly combined to explore the two major evolutionary factors influencing codon usage in plant species (Wang & Hickey, 2007;Raubeson et al., 2007;Li et al., 2016). In order to determine whether natural selection was the main driving force affecting codon usage bias in the D. grandiflorum cp genome, we performed ENC plot analysis and neutrality plot analysis. ENC plot analysis is an important indicator that reflects the relationship of the two different indices (ENC value and GC 3s ), thus detecting the SCU variation among the genes (Wright, 1990). Wright concluded that the distribution comparison of genes and the standard curve could be indicative of some other factors, with the exception of mutation pressure. If the codon usage of a particular gene is under no selection, it should fall on the expected curve. In our study, it was observed that a few genes were positioned on the curve, which likely originated from the extreme mutation pressure. However, a majority of the points were lying well below the expected curve. This result suggested that a majority of genes in the D. grandiflorum cp genome had other SCU biases that were independent of mutation pressure, for example, natural selection. This hypothesis was largely supported by the neutrality plot mapping analysis. Neutrality plot analysis can effectively compare the effects of natural selection and mutation on codon usage bias (Sueoka, 1988). The low correlation between GC 12 and GC 3 , that is, the smaller regression coefficient of approximately 0, showed that the base composition of the three positions differ, and the GC content of the cp genome is highly conserved, indicating that natural selection was the most important determinant of codon usage patterns. Conversely it shows that codon usage patterns are evidently reliant on mutation pressure (Zhang et al., 2018a;Zhang et al., 2018b;Zhang et al., 2018c). In the neutral graph, no correlation was found between GC 3 and GC 12 , indicating a strong difference and that natural selection should be crucial for SCU bias in the D. grandiflorum cp genome. However, the signatures of selection constraints (positive, neutral, and negative) in D. grandiflorum cp genome could not be inferred for the lack of a reference sequence that is unaffected by selection, which need to be further detected in the following work.

CONCLUSIONS
This study systematically analyzed the codon usage pattern in the D. grandiflorum cp genome, and the factors affecting SCU bias were comprehensively explored. The SCU bias in the D. grandiflorum cp genome is weak, preferring A/T ending bases. Excepting the notable mutation pressure effects, the majority of genetic evolution in the D. grandiflorum cp genome may be driven by natural selection. These results are the first to provide a clear set of SCU patterns and explore the possible evolutionary forces acting on the D. grandiflorum cp genome.