Comparative analysis of codon usage patterns in chloroplast genomes of five Miscanthus species and related species

Miscanthus is not only a perennial fiber biomass crop, but also valuable breeding resource for its low-nutrient requirements, photosynthetic efficiency and strong adaptability to environment. In the present study, the codon usage patterns of five different Miscanthus plants and other two related species were systematically analyzed. The results indicated that the cp genomes of the seven representative species were preference to A/T bases and A/T-ending codons. In addition, 21 common high-frequency codons and 4–11 optimal codons were detected in the seven chloroplast genomes. The results of ENc-plot, PR2-plot and neutrality analysis revealed the codon usage patterns of the seven chloroplast genomes are influenced by multiple factors, in which nature selection is the main influencing factor. Comparative analysis of the codon usage frequencies between the seven representative species and four model organisms suggested that Arabidopsis thaliana, Populus trichocarpa and Saccharomyces cerevisiae could be considered as preferential appropriate exogenous expression receptors. These results might not only provide important reference information for evolutionary analysis, but also shed light on the way to improve the expression efficiency of exogenous gene in transgenic research based on codon optimization.


INTRODUCTION
Miscanthus Andersson (Poaceae) is a C4 photosynthetic plant, which have been widely investigated as a potential second-generation bio-energy crop (Barling et al., 2013). The genus Miscanthus includes approximately 20 species, which could be classified into Miscanthus clades and Triarrhena clades (Ge et al., 2017). China is the biological diversity center of Miscanthus species, of which Miscanthus lutarioriparius L.Liou (M. lutarioriparius), Miscanthus sinensis Andersson (M. sinensis), Miscanthus sacchariflorus (Maxim.) Nakai (M. sacchariflorus) and Miscanthus floridulus (Lab.) Warb. ex Schum. et Laut (M. floridulus) are the four most widely distributed species. In addition to being bioenergy plants, Miscanthus species possess extensive breeding values due to their extremely advantageous agricultural characteristics, such as high photosynthetic efficiency, cold tolerance and an extensive ability to adapt to environmental change (Vermerris, 2008). Currently, the research focus of Miscanthus species is to utilize it as a promising genetic resource (Clark et al., 2015;Zhang et al., 2013). The more diverse genetic resources available, the more likely it is that scientific research will be able to comprehend the adaptation, evolution and utilization of these significant economic crops.
Chloroplasts (cp) are key plastids involved in multifunctional processes of the plant cell, including photosynthesis, carbon fixation, starch storage, nitrogen metabolism, fatty acid and nucleic acid synthesis (Jarvis & López-Juez, 2013;Nielsen et al., 2016). Typically, cp genomes (cpDNAs) possess a small size, conserved gene content and large copy numbers, which have been extensively used as valuable source for evolution analysis and plastid engineering (Amiryousefi, Hyvönen & Poczai, 2018;Ravi et al., 2008;Yan et al., 2019). The lack of genomic resources of the Miscanthus species has hindered the adequate understanding of their diversity traits (Chae et al., 2014;Sheng et al., 2016) The low expression efficiency of an exogenous gene may limit the research progress on the functional studies of Miscanthus and their related species (Wu et al., 2021). Stemming from the rapid development of cp engineering, plasmid DNA have been transferred into the cpDNA of a variety of plants, such as Nicotiana tabacum, Manihot esculenta Crantz and Eruca sativa Mill (Havaux, Lütz & Grimm, 2003;Khodakovskaya et al., 2006;Kwak et al., 2019). Recently, the cpDNA of some Miscanthus species have been made available in the National Center for Biotechnology Information (NCBI) database (Sheng et al., 2021). These complete cpDNA sequences of Miscanthus species can be used for studying population genetics, evolution analysis and plastid engineering (Amiryousefi, Hyvönen & Poczai, 2018;Yan et al., 2019).
Codon usage bias refers to the variations on the usage frequencies of synonymous codons (Plotkin & Kudla, 2011). The pattern of codon usage could be caused by multifactors during the process of genome and gene evolution, including natural selection, compositional mutation mode, translational selection, gene length, tRNA abundance and mRNA secondary structure (Liu et al., 2004;Pop et al., 2014;Quax, Claassens & Söll, 2015;Tuller et al., 2010). The studies of codon preference can not only reveal the evolutionary rules between genes in a species or related species, but also improve the expression efficiency of exogenous sequences in transgenic research by codon optimization. Recently, the applicability of codon optimization in the cpDNA have been determined for many vascular plants, including Poaceae (Gramineae) (Zhang et al., 2012), Cinnamomum camphorn (L.) presl (Camphor tree) (Chen et al., 2017), Fragaria ×ananassa Duch (strawberry) (Cheng et al., 2017) and Solanum tuberosum L. (potato) (Zhang et al., 2018). However, the codon usage pattern of cpDNA in Miscanthus and related species has not been fully elucidated.
In this study, the codon usage patterns of the cpDNA of seven Miscanthus are most closely related to Sorghum bicolor (L.) Moench and Saccharum L (Sheng et al., 2021;Sheng et al., 2017). In addition, the codon usage bias of these seven species was compared with the other four model species including Populus trichocarpa Torr & Gray, Escherichia coli, Arabidopsis thaliana (L.) Heynh and Saccharomyces cerevisiae.
Miscanthus are not only potential bio-energy crops, but also forms an excellent breeding resources. So, it is of interest to understand the codon usage of Miscanthus for better utilization of Miscanthus and related resources as germplasm resources. Here, we revealed the codon usage patterns of the Miscanthus and related species and determined optimal codons for cpDNA genetic engineering. The results in the current study will not only provide insight into genetic evolution studies, but also provide a reference for selecting appropriate heterologous expression hosts to improve the gene expression of Miscanthus plants by optimizing codon. (2) the number of bases in each CDS must be the fold of three (3) the length of the sequence of CDS should be ≥300 bp (Wright, 1990;Zhang et al. 2007). After filtration, the CDS number, the base composition at the first/second/third site of codons (GC1/GC2/GC3) and average GC, as well as the total amino acids encoded by each CDS were calculated.

Analysis of relative synonymous codon usage (RSCU) and relative synonymous codon usage frequency (RFSC)
Relative synonymous codon usage value of a codon (the number of codon occurrences in a gene divided by the number of codon appearances expected under the same codon usage) is the ratio of its actual frequency of utilization to the expected usage frequency without bias. The RSCU was calculated as Eq. (1): where x ij represents the frequency of codon j encoding the i th amino acid, and n i represents the number of synonymous codon encoding the i th amino acid (Sharp & Li, 1986). If the RSCU value of a codon is equal to 1, the codon is used without bias, whereas a RSCU value greater than 1 reflects a significant codon usage bias (Sharp et al., 1993). The RFSC value refers to the proportion of the actually observed number of a codon in the number of all synonymous codons. The RFSC were calculated using Eq. (2): where x ij represents the frequency of codon j encoding for the I th amino acid. The highfrequency codon was screened based on the results of RFSC in all codon. The screening principles were as follows: the RFSC > 60% of one codon; or the RFSC of a codon exceeds the average frequency of synonymous codon by 0.5 times (Zhou et al., 2007).

Determination of optimal codons
The effective number of codons (ENC) can be applied to describe the extent of deviation of codon usage from the random selection, which reflects the degree of unbalanced use of synonymous codon in genes. The ENC value range from 20 (each amino acid uses only one synonymous codon) to 61 (Each synonymous codon is equally used), which is inversely proportional to the codon bias (Wright, 1990). The ENC value in each species was calculated by CodonW software and then 10% of the CDS with remarkably high and low expression levels were filtered out according to the ENC value. The RSCU of each codon was obtained from the sequence files of the high and low groups according to the cusp function of emboss (https://www.bioinformatics.nl/emboss-explorer/). Optimal codons were determined by RSCU method. Specifically, the average RSCU values of the two amino acid groups were computed and subtract subsequently ( RSCU). The codon will be identified as the optimal codon through comparing the high and low group of the same codon RSCU (>0.08) and RSCU value (high group > 1, low group < 1) (Romero, Zavala & Musto, 2000).

Comparative analysis of codon usage frequency
The ratio of codon usage frequency is one indicator of codon usage bias among species.
When the ratio is ≥2 or ≤0.5, it suggests that the codon bias difference between the two organisms is significant, whereas other values outside of this range represent a lack of significance (Pan et al., 2013).

Analysis of ENC-plot
The proportion of G and C content at the third position of a codon to the total number of gene bases are defined as GC3s. ENC-plot is plotted with ENC values as ordinate and GC3 value as abscissa, which can be used to analyze the codon usage characteristics of each gene and to explore the relevance between gene base component and codon preference (Wright, 1990). ENC values are located on or near the expected curve, when mutation pressure plays a key role in the formation of codon usage patterns. Conversely, when the use of a codon is constrained by natural selection, the ENC value will be well below the prospective curve (Wright, 1990).

PR2-plot analysis
In G3/(G3+C3) as the abscisic and A3/(A3+T3) as the ordinate graphic mapping (PR2plot), which is performed to explore the composition of the four bases at the third nucleotide position of each codon (Sueoka, 1999;Sueoka, 1995). The pattern of splashes around the central spot (A = T, C = G) indicate the extent and orientation of the base offset.

Analysis of Neutrality plot
Neutrality analysis is used to exploring the degree of impact between natural selection and mutation pressure on the mode of codon usage (Sueoka, 1988). GC12 indicates the mean GC content at the first and second nucleotide positions of the codon, while GC3 represents the GC content of the third site. GC content at the third nucleotide position of a codon was counted post eliminating the Codon Met (ATG) and Trp (TGG). GC3 was counted post the eliminating of the three stop codons (TAA, TAG and TGA) and three codons (ATT, ATC and ATA) of Ile (Sueoka, 1988). Both the GC12 and GC3 values of the seven cpDNAs were counted by Python scripts (https://github.com/shexuan/codon_analysis). If the gradient of the curve regression is 0, indicating that there is no impact of mutation pressure. Gradient 1 represents complete neutrality, which describes that codon usage preference is completely influenced by mutation pressure (Sueoka, 1988).

Correspondence analysis of codon usage
The variations of codon usage in the seven analyzed cpDNAs were investigated based on the correspondence analyses (COA) using CodonW (Anue et al., 2019). The usage patterns of 59 codons (exincluding Met, Trp and three termination codons) were compared and all genes can be embeded into a 59-dimension hyperspace, in which each dimension corresponds to the synonymous codon usage of the gene (Xiang et al., 2015). Therefore, the major trends (Axis 1) of these axes in the 59-dimensional hyperspace can be used to determine the maximum fraction of genetic variation, indicating the major sources of codon usage variation. In addition, according to the results of COA, the correlation index between Axis1 and codon usage exponent, including the GC content of codons, GC3s, codon adaptation index (CAI) and the total numbers of amino acids in the encoded polypeptide (L_aa) were computed by python scripts package (https://github.com/shexuan/codon_analysis). CAI value is widely applied to assess gene expression levels, ranging from 0 to 1. Specifically, the larger the CAI value is, the stronger the codon usage preference is, and vice versa (Sharp & Li, 1986).

Analysis of codon base composition
The  (Table 1). It was found that the contents of GC at all three sites and the average GC content (GC123) were all less than 0.5, which indicated the seven analyzed cpDNAs were prone to use A/T bases and A/T-ending codons (Table 1) (Table 1). Furthermore, the distribution trend of GC content was GC1 > GC2 > GC3, indicating that GC was not evenly distributed in the three positions of a codon of the seven assessed species. In summary, the codon usages of GC content in these seven cpDNAs were similar and were biased towards A/T bases.

RSCU and RFSC
The cp genomes of the seven Miscanthus and related species have 30 common codons (RSCU > 1) with 28 codons ending with the nucleotides A/T (93.3%) (Table S1) (Table S1). In addition, the maximum and the minimum RSCU values belonged to TTA and CTG which encode Leu, indicating the vitally positive bias. Furthermore, the pattern of codon usage were summarized in the seven Miscanthus and related species (Fig. 1). Specifically, the high-frequency codons of seven Miscanthus and related species possess strong common base and share a total of 21 high-frequency codons (Table S1). Furthemore, Saccharum spontaneum and Sorghum bicolor were determined to possess two more high-frequency codons, specifically CGT and TAA, than the other five Miscanthus species.

Determination of optimal codons
The ENC values of each CDS were ranked and 10% of genes from both ends were selected to establish high and low expression gene banks respectively. The RSCU values and RSCU values in the two expression libraries were calculated and are listed in Table S2. According to the values of RSCU, the optimal codons in the seven assessed species were determined as follows (Table 2).

Codon usage frequency
The codon usage frequencies of the seven cpDNAs were compared with four model species including Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and Populus trichocarpa (Table S3). The Results of our analyses indicated that there is little divergence in the codon usage frequencies among the seven assessed plant species with Saccharomyces cerevisiae, Arabidopsis thaliana and Populus trichocarpa, have 9 to11 (accounting for 14.1%-17.2% of total codons), 13 to 14 (20.3%-21.9%), 12 to 13 (18.8%-20.3%) different codons, respectively (Table S3). However, the codon usage frequencies of the seven  species with Escherichia coli were comparatively higher (27 different codons). The results indicated that the codon frequency difference between Miscanthus species and Arabidopsis thaliana, Populus trichocarpa and Saccharomyces cerevisiae was the lowest, while was the largest with Escherichia coli. Based on above results, it was optimal to select Saccharomyces cerevisiae, Arabidopsis thaliana and Populus trichocarpa as heterologous expression hosts for Miscanthus and related species. Furthermore, the results indicated that TAG is a different termination codon in usage frequency when comparing the seven assessed plants with the four model species (Table S3).

ENC-plot
The ENC and GC3s of the seven analyzed cpDNAs were plotted. It can be seen from Fig. 2 that the ENC values of most genes were lower than expected values and lie below the standard curve. The results of ENC-plot analysis suggested that codon usage preference of the seven cpDNAs is mainly influenced by natural selection and other factors, while mutation pressure was determined to play only a minor role.

PR2-plot
PR2-plot is an efficient method to indicate the influence of mutation pressure by investigating the composition of A, T, C and G at the third nucleotide position of a codon. Our results revealed that the AT-bias is 0.  (Fig. 3). Therefore, T/G-bias was observed in all seven assessed species. When considered together, codon usage bias of A/T and G/C in the seven cp genomes was unbalanced, indicating that the base composition of the seven analyzed cpDNAs is not only influenced by mutation pressure, but also by natural selection.

Neutrality plot
The distribution range of GC12 (the mean GC content at the first and second nucleotide positions of a codon) and GC3 is relatively concentrated, in which the range of GC12 is = −0.014, r16 = −0.020, r17 = 0.063, r18 = −0.032, r19 = −0.032), which suggested mutation pressure only contributes a minor role in the codon usage preference. In addition, the regression coefficient (slope of neutrality plot) was 0.006 to 0.198, indicating that the correlation between GC12 and GC3 is not significant, and the composition of the first two bases may be different from the third base of the codon. These results demonstrated that the codon usage patterns of cp coding sequences in the seven species are mainly affected by natural selection.

Correspondence analysis (COA)
Correspondence analysis is used to explore the variations of codon usage among the analyzed cpDNAs. In the current study, RSCU-based COA was used to compare the usage patterns of 59 codons, which produced a series of orthogonal axes, reflecting the trend of change of codon usage in the seven Miscanhus and related plant species cpDNAs. The first four axes accounted for 36.2%, 38.2%, 38.5%, 38.3%, 36.8%, 40.7% and 40.9% of the overall changes, while the first axis proportion to 14.3%, 15.9%, 15.8%, 15.8%, 14.6%, 15.6% and 9.7% of the total variation in seven species respectively (Table 3). Axis 1, responsible for ∼10% of total variation, was the main source of variation, indicating that the codon usage was influenced by multiple factors. In addition, the relationship between axis 1 and axis 2 was visualized to explore the effects of GC content on codon usage bias (Fig. 5). Genes with different GC content are plotted as different colors, red with GC% < 45% and blue with 45% ≤ GC%< 60% (Fig. 5). In order to determine the factors leading to gene dispersion along axis 1 and axis 2, the correlation index were computed on axis 1 with CAI, CBI, Fop, GC3, GC and L_aa (Table 3). As can be seen from the results in Table 3, axis 1 for M. floridulus, M. sacchariflorus, M. sinensis, M. transmorrisonensis, Saccharum spontaneum and Sorghum bicolor possessed a remarkable correlation with GC3s (p ≤ 0.01), which indicated the base composition stemming from mutation pressure was the main factor impacting codon usage preference.

DISCUSSION
The study compared the codon usage patterns of the five Miscanthus species and two related species. The findings reported here will help to improve our understanding of evolution analysis and the optimization of codon components suitable for gene expression. During the evolutionary processes, specific codon usage patterns were obtained to adapt to the diverse factors including origin, evolution, natural selection and mutation pressure. In addition, analyzing the source of variation in genomic codon usage, the pattern of codon bias and the codon frequency could provide insights into optimization of the codons of heterologous genes and selection of appropriate heterologous expression hosts. Therefore, the research will be of great significance to the study of genetic engineering and genetic evolution. Our analysis of base composition of codons revealed that the CDSs of the seven Miscanthus and related cpDNAs analyzed tended to use an A/T codon, which was consistent with the results of Zhang et al. (2012) on the 23 Poaceae cpDNAs that this previous study analyzed. According to a previous study, the GC3 values of dicotyledonous plants is often less than 50% (codon use prefers A/T), which is different from monocotyledonous plants with high GC3 values (GC3 values >50%, showing that codon use prefers G/C) (Murray, Lotzer & Eberle, 1989). The results of RSCU value analysis showed an A/T codon usage bias in the cpDNAs of the seven analyzed species, which was consistent with the patterns in most higher plants (Shang et al., 2011). According to neutral evolution theory, the effects of mutation pressure and natural selection on the variation of the third base of codon are neutral or nearly neutral (Sharp et al., 1993). The study of Kawabe & Miyashita (2003) showed that when codon use is affected by natural selection, GC3 values tend to be distributed in a small range and there is no significant correlation between GC12    and GC3 (Kawabe & Miyashita, 2003). The neutrality plot in this study revealed a weak correlation between GC12 and GC3 and the composition of the first two bases were different from the third base of the codon, which demonstrated that the codon usage patterns of the seven cpDNAs analyzed are mainly influenced by natural selection. This result is consistent with the codon usage of cpDNAs of many species, such as Oryza sativa (Liu, Tan & Xue, 2003), Zea mays (maize) (Liu et al., 2010), Triticum aestivum (Liu & Xue, 2005) and Euphorbiaceae (Wang et al., 2020). In addition, combining the results of ENC-plot, PR2-plot and COA analysis suggested that the codon usage bias of the seven cpDNAs were affected by multiple factors, including mutation pressure, base composition and gene length, which the dominant influencing factor was natural selection. This results was consistent with the analysis in Poaceae (Zhang et al., 2012), Populus alba (Zhou et al., 2008) and Euphorbiaceae (Wang et al., 2020) . The cpDNAs of the seven Miscanthus and related plants assessed here are highly conserved and share a total of 21 high-frequency codons. In addition, 4 to 11 codons were determined to be the optimal codons in each species, while no common optimal codon was defined in the seven representative species. These results of high frequency codons and optimal codons are not only beneficial for codon optimization, but also promote further understanding of the relationship between gene expression and codon usage preference. In higher plants, the main obstacle to applying cp transformation to more species and especially, to other important crops is the limitation of available tissue culture systems and regeneration protocols (Ruf et al., 2001). Considering the differences of codon usage bias between the seven cpDNAs analyzed in this study and heterologous expression hosts, codon usage frequencies were analyzed to select the suitable host system. Based on the results, it was suggested to select Saccharomyces cerevisiae, Arabidopsis thaliana and Populus trichocarpa as heterologous expression hosts for Miscanthus and related crops, which possesse a little difference in codon usage frequency with the seven plants.
This study conducted a comprehensive comparative analysis on codon usage patterns at the cpDNA-wide level of seven Miscanthus and related species. These results will improve our understanding on evolution analysis, the selection of appropriate heterologous expression hosts and the optimization of codon components suitable for gene expression, finally provide a theoretical basis for building a stable and efficient gene expression system in Miscanthus or other crops.

CONCLUSIONS
The codon usage patterns of cp genomes of the five Miscanthus and two related species were compared and systematically analyzed for the first time. The results of codon usage bias and RSCU analysis indicated that the seven representative species prefer the use of A/T bases and A/T-ending codons. In addition, 21 common high-frequency codons and four to11 optimal codons were elevated in their frequency in the seven cpDNAs. Furthermore, the analysis of codon usage frequencies between the seven representative species and four model organisms suggested that Arabidopsis thaliana, Populus trichocarpa and Saccharomyces cerevisiae could be considered as appropriate heterologous expression hosts for Miscanthus and related species. Finally, when considered together, the results of ENC-plot, PR2-plot and neutrality analysis revealed that the codon usage pattern of the seven cpDNAs analyzed here were influenced by multiple factors, which the dominant influencing factor was natural selection. These results presented in this study potentially provide not only important reference information for evolutionary analysis, but also potentially provide additional important insights for the improvement in the expression efficiency of exogenously introduced genic sequences in transgenic research by codon optimization.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was financially supported by Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 19KJB180023) and Postdoctoral Science Foundation of China (2019M662719). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.