Introduction

Coronaviruses (CoVs) which are included in the family Coronaviridae are enveloped and contain the largest RNA genomes with some reaching almost 30,000 nucleotides (Dimmock et al., 2002). They primarily infect the upper respiratory and gastrointestinal tract of animals, and severe acute respiratory syndrome coronavirus (SARS-CoV), a newly emerged group 2 CoV, spread rapidly from Asia to North America and Europe with a high degree of transmissibility and mortality (Lew et al., 2003; Riley et al., 2003; Friman et al., 2008). In response to the SARS pandemic in 2003, many scientists have been interested in vaccine development against SARS-CoV. The phase I human study for a SARS DNA vaccine was reported by Martin and his colleagues showing immunogenicity with spike proteins of SARS-CoV in all subjects and neutralizing antibody responses in 8 of 10 subjects (Yang et al., 2005; Martin et al., 2008).

The DNA vaccination represents a new strategy for highly pathogenic and infectious diseases (Ramakrishna et al., 2004; Martin et al., 2006, 2007; Wang et al., 2006a, 2006b; Catanzaro et al., 2007), and it is usually produced in three successive steps. First, the primers specific to the target regions of viral genome are produced to generate cDNA fragments. Second, these cDNA fragments are inserted into a bacterial DNA vaccine plasmid such as escherichia coli plasmid, and lately, the prepared DNA vaccine is injected into the cells of the target organisms such as mouse, rabbit or human subjects to produce one or more specific proteins by mimicking viral replication and protein production in the host. Because these proteins are recognised as foreign antigens in the target organisms, immune responses are triggered by them (Sin and Weiner, 2000; Donnelly et al., 2003). The phase I clinical trial of DNA vaccines against West Nile virus, Ebola virus and human immunodeficiency virus type 1 (HIV-1) in healthy adults have already been performed (Martin et al., 2006, 2007; Catanzaro et al., 2007). According to Wang et al. (2006a, 2006b) and Ramakrishna et al. (2004), the codon optimization of the Tat and envelope genes of HIV-1 as well as hemagglutinin genes of influenza A virus showed better antigen expression and immunogenicity in model animals such as mouse and rabbit. Each target gene of HIV-1 and influenza A virus was known to be changed to the preferred codons of the overall mammalian system to promote better expression of each encoded protein.

Synonymous codons usually encode common amino acids in protein synthesis, and they are not used randomly, with some codons being used more frequently than others (Moriyama and Hartl, 1993; McInerney, 1998; Duret, 2002; Lynn et al., 2002; Kawabe and Miyashita, 2003; Singer and Hickey, 2003). Codon usage bias has been known to mirror tRNA abundance in the early studies using Bacillus subtilis and Caenorhabditis elegans genes (Shields and Sharp, 1987; Stenico et al., 1994). In prokaryotes, such as thermophilic bacteria, highly expressed genes shift their codon usage toward a more restricted set of preferred synonymous codons compared to less highly expressed genes within the genome (Lynn et al., 2002; Singer and Hickey, 2003). As for the viral genomes, Gu et al. (2004) reported that the relative synonymous codon usage (RSCU) values of Nidovirales family including SARS-CoV are virus-specific, and translational selection and gene length may not affect the codon usage pattern in some viruses. However, Jenkins and Holmes (2003) who analyzed the extent of codon usage bias in the complete genomic coding region of 50 genetically and ecologically diverse human RNA viruses using the effective number of codon (ENC) as a parameter showed that the overall extent of codon usage bias was low and that there was little variation in bias between genes. More recently, Shackelton et al. (2006) reported that there was a striking difference in CpG content between DNA virus with large and small genomes as the majority of large genome viruses show the expected frequency of CpG, while most small genome viruses had CpG contents far below expected values. They suggested that the main reason for these differences might be due to the differences in the viral replication and repairing mechanisms, such as cellular or viral replicative machinery. In our previous study, synonymous codon usage patterns among RNA viruses such as influenza A viruses and HIV-1s were divided into each region, subtype, host or occurring-year group, with an expectation that there might be some correlations between the nucleotide patterns and the direction of viral variations on the codon basis (Ahn and Son, 2006, 2007; Ahn et al., 2006). Furthermore, van Hemert et al. (2007) reported that the recent evolution of astroviruses was associated with a switch in nucleotide composition and codon usage among non-human mammalian versus human/avian astroviruses. They suggested that evolutionary events within a virus family might be driven by forces operational at the level of synonymous substitutions, such as nucleotide composition, translational selection, and codon usage.

In this study, we hypothesized that the codon usage bias of viral genes might tend to mimic the specific genes and perform a key role during the initial immune responses, in their host species. C-type lectins, a supeframily of proteins containing C-type lectin domains (CTLDs), are a large group of extracellular Metazoan proteins with diverse functions (Zelensky and Gready, 2005). They usually provide Ca2+-dependent sugar-recognition activity and initiate a various kinds of biological processes, such as adhesion, endocytosis, and pathogen neutralization (Drickamer and Dodd, 1999; Dodd and Drickamer, 2001). As a point of immune responses, C-type lectins are also known to perform an important function in dendritic cell (DC) immune regulations, which include the triggering of inflammatory cytokines, as well as delivering antigens to T cell to initiate the specific immune response (Cella et al., 1997, 1999). C-type lectin receptors in DCs have been determined to act as a capture of attachment factor for influenza A virus (H5N1 subtype) or HIV-1 (Lambert et al., 2008; Wang et al., 2008), and SARS-CoV infection is also known to induce a immune responses related with DC functions such as delaying an activation of alpha interferon (Spiegel et al., 2006). In this study, we compared the synonymous codon usage patterns of Coronavirus genus with the CTLD genes of human (homo sapiens) and mouse (mus musculus) to investigate the possible relations between microbes and their host species in codon basis.

Results

Principal component analysis using the % GC contents on the 1st, 2nd and 3rd codon positions

The first two principal factors of the % GC contents on each codon position from the nucleocapsid and spike genes of CoVs as well as the CTLDs of human and mouse were investigated using the principal component analysis (Figure 1). Eigen vectors of each principal factor (PRIN1 and PRIN2) and eigenvalue proportions (%) were presented in Table 1. Among the CoV genes, the first two principal components of nucleocapsd genes accounted for 57.6% and 34.2%, whereas those of spike genes accounted for 60.8% and 27.8% of the total variance of the data set, respectively (Figure 1A and 1B). Eigenvector compositions of those two genes, however, showed different patterns. The % GC contents on the third codon position (GC3rd) among nucleocapsid genes showed highly positive correlations (0.978) along with PRIN2-axis, whereas PRIN2 of spike genes was strongly dependent on the GC2nd (0.899) (Table 1). The former pattern was also appeared when two genes from CoVs, and CTLDs from human and mouse were analyzed together (Figure 1D, Table 1), whereas CTLD gene itself revealed very similar eigenvector patterns with those of spike genes of CoVs (Figure 1C, Table 1). The eigenvectors of PRIN1s in all the cases commonly showed positive correlation with on the GC1st and GC2nd, with showing that PRIN1s were mainly dependent on the non-synonymous codon usage patterns of each gene.

Figure 1
figure 1

Principal component analysis of the % GC contents on the 1st, 2nd and 3rd codon position. The first two factors from the principal component analysis (PRIN1 and PRIN2) were presented with each eigenvalue proportion. Nucleocapsid (A) and spike (B) coding genes of Coronavirus genus were compared with CTLD genes of human (homo sapiens) and mouse (mus musculus) species (C, D). Family names of CTLDs were also presented with each plot. G1, Group 1 CoV; G2, Group 2 CoV; G3, Group 3 CoV; N, nucleocapsid gene of CoV; S, spike gene of CoV; homo, homo sapiens; mus, mus musculus.

Table 1 Eigenvectors and eigenvalues of the principal component analysis using the % GC contents on each codon position

In Figure 1, we categorized the nucleocapsid and spike genes into each CoV groups such as group 1, 2 and 3, which were presented as G1, G2 and G3, respectively. The nucleocapsid genes of both human and bat SARS-CoVs were located closely to the G3 CoVs such as infectious bronchitis virus along the PRIN1, but human SARS-CoVs displayed similar patterns with the G2 CoVs such as bovine, murine and rat CoVs along with the PRIN2 (Figure 1A). On the other hand, spike genes of SARS-CoVs were located near the G2 murine hepatitis virus CoVs as well as G3 human CoV 229E (Figure 1B). Human CoV (HKU1) in G2 CoVs were distinctly located from other G2 CoVs in both nucleocapsid and spike genes. As for the CTLD genes, mouse and human genes spread broadly across the biplots, and they were not clearly separated each other (Figure 1C). On the basis of the PRIN1-axis, the family 4Ms, 10As, 16A and 14A of human CTLDs as well as 3B, 4F and 4G of mouse CTLDs were positively biased, and those genes also showed different % GC contents from CoV genes in Figure 1D.

Phylogenetic relationships among CTLD genes of human and mouse species

To compare the phylogenetic relationships among human and mouse CTLDs with the synonymous codon usage patterns, we constructed a phylogenetic tree using the Neighbour-Joining method with 1000 times bootstrapping test. All the genes were well grouped into each CTLD family, and human 7As, 1A, 14A, 2B and 16A CTLDs were separately located with other human CTLD genes, showing closer relationships with mouse genes (Figure 2A). Among the mouse genes, family 4A2, 4N, 4D and 4G CTLDs showed close relationships with human family 4As, 4Cs, and 4Ms. Family 10As of human genes were distinctly located from other families. On the basis of the chromosomes which each gene was transcribed from, mouse CTLDs were encoded from chromosome 6, 8, 9 and 12, and human genes were from chromosome 12, 14, 16, 17 and 19.

Figure 2
figure 2

The results of phylogenetic analysis using the CTLD genes of human (homo sapiens) and mouse (mus musculus) species (A), and the scatter plots of the correspondence analysis using the relative synonymous codon usage values of the nucleocapsid and spike genes of CoVs as well as the CTLDs of human and mouse (B). Phylogram was derived by Neighbor-Joining method with bootstrap analysis of 1000 iterations, and bootstrap values (%) that are not 100% are represented as circulated numbers in each node. Each chromosome source of CTLD was also presented on the right column of tree. G1, Group 1 CoV; G2, Group 2 CoV; G3, Group 3 CoV; N, nucleocapsid gene of CoV; S, spike gene of CoV; CLEC, C-type lectin domain gene.

Synonymous codon usage analysis using the CA method

To investigate the synonymous codon usage patterns, we parsed each nucleotide sequence into each synonymous codon groups first, then, calculated the RSCU values per each sequence. After that, we assigned each kind of gene or species as rows, and RSCU values of 59 codons as columns for the CA. All the target sequences of CoVs, human and mouse species were analyzed together to compare the overall synonymous codon usage patterns (Figure 2B). First of all, the nucleocapsid and spike genes of CoVs showed opposite patterns along with the first dimensional factor (Dim1) of CA plots, and human and mouse CTLDs were located on the same side with nucleocapsid genes. Secondly, we also performed linear regression analysis to identify which codon usage parameters affect the Dim1 and Dim2 of CA result most (Table 2). The Dim1 showed the significant correlations with all the codon usage parameters such as GC1st, GC2nd and GC3rd and ENCs. Among the % GC contents, Dim1 was strongly dependent on the GC1st and GC2nd, showing R2 values of 0.781 and 0.671, respectively. As for the Dim2, however, only GC3rd showed positive correlations (R2=0.500) among all the % GC contents.

Table 2 The results of the regression analysis between each dimensional factor of correspondence analysis using RSCU values of CoVs and each codon pattern parameters. aFirst dimensional factor of the correspondence analysis, bSecond dimensional factor of the correspondence analysis, cR2 value of each linear regression analysis, dParameter estimate which was resulted from linear regression analysis, eEffective number of codons. *P < 0.0001

In Figure 2B, we also presented the enlarged region of CTLDs of both human and mouse species with each family, member and transcript variant (if exists) name 1 (Figure 2B right). The CTLD genes which were located within or near the CA plots of CoVs were clustered as 'group 1', and they were presented as the underlined italic characters in phylogenetic tree (Figure 2A). Interestingly, the group 1 CTLDs of human species were derived from the chromosome 12, when those of mouse were from the chromosome 6 and 12.

The CTLDs can be divided into seven groups based on their domain architecture, and seven new groups were added in his revised article in 2002 (Drickamer, 1993; Drickamer and Fadden, 2002; Zelensky and Gready, 2005). Among CTLD genes, human clec4C_1 and mouse clec14A showed very close relationships with the nucleocapsid genes of SARS-CoVs, whereas human clec4A_4, 14A and 10A_2 as well as mouse clec4A2, 4G and 4F were located far from SARS-CoVs on the basis of both Dim1 and Dim2.

Comparison of RSCU values between SARS-CoVs and the most similar CTLD genes of human and mouse

In the CA result, human clec4C_1 and mouse clec14A showed very close relationships with the nucleocapsid genes of SARS-CoV (Figure 2B). So we compared the RSCU profiles of each gene to analyze the different patterns more intensively (Figure 3). The nucleocapsid and spike genes of human and bat SARS-CoVs (Figure 3A and 3B) were compared with the two types of CTLD genes such as human clec4C_1 and mouse clec14A which were included in 'group 1' in Figure 2B, and human clec10A_2 and mouse clec4F which were not included in group 1 (Figure 3C and 3D). First of all, nucleocapsid genes of human and bat SARS-CoVs did not use two synonymous codons for cysteine (CYS) as well as ACG for threonine (THR), and showed similar patterns with the spike genes in alanine (ALA), asparagine (ASN), glutamine (GLN), glycine (GLY), proline (PRO), serine (SER) and threonine encoding codon groups (Figure 3A and 3B). In the phenylalanine (PHE) and the first three codons of leucine (LEU) encoding codons, however, nucleocapsid and spike genes showed somewhat opposite patterns from each other, and spike showed more biased patterns in CCU and UCU which encode proline and serine, respectively. Secondly, human and mouse CTLD genes in Figure 3C and 3D showed common RSCU profiles with SARS-CoVs in the alanine and proline encoding codon groups, but used different patterns from SARS-CoVs in glycine and phenylalanine codon groups. As for the human clec4C_1 and mouse clec14A, they showed similar RSCU profiles with spike genes in the arginine and serine coding groups, and human clec4C_1 alone showed the same patterns with nucleocapsid genes in the isoleucine (ILE) and serine encoding groups. The human clec10A_2 and mouse clec4F showed more biased patterns in leucine and serine encoding codons than those of the group 1 CLECs (Figure 3D).

Figure 3
figure 3

The profiles of the relative synonymous codon usage were shown as the vertical bar graph. The nucleocapaid (A) and spike (B) genes of human and mouse SARS-CoVs, as well as the human and mouse CTLD genes which were located near the nucleocapsid genes of SARS-CoVs (C) and were located far from those of SARS-CoVs (D) in the correspondence analysis in Figure 2B are presented. Genbank accession numbers are presented in legends. G1, Group 1 CoV; G2, Group 2 CoV; G3, Group 3 CoV; N, nucleocapsid gene of CoV; S, spike gene of CoV; CLEC, C-type lectin domain gene.

Discussion

Codon usage bias has been studied in various organisms ranging from virus to eukaryote, and optimized codon usages in the target viral genes also have been made to improve the efficacy of DNA vaccines development (Ramakrishna et al., 2004; Shackelton et al., 2006; van Hemert et al., 2007; Wang et al., 2006a, 2006b). Based on our previous studies, synonymous codon usage itself among RNA viruses such as influenza A viruses and HIV-1s revealed specific bias on the basis of each region, subtype, host or occurring-year group, suggesting that there might be some correlations between the codon usage patterns and viral variations in the codon basis (Ahn and Son, 2006, 2007; Ahn et al., 2006).

In this paper, we determined whether % GC contents on the first (GC1st), second (GC2nd), and third (GC3rd) codon positions showed similar patterns among the same genes of viral species as well as the CTLDs of human and mouse (Figure 1). Among the two genes of CoVs, the nucleocapsid genes showed highly positive eigenvectors of GC3rd (0.978) along with the PRIN2, and this pattern was also observed when we compared all the target genes from CoVs, human and mouse together using the principal factor analysis (Table 1). Traditionally, spike protein is known to define the viral tropism by its receptor specificity and also by its membrane fusion activity during virus entry into cells, so it has been the major target of neutralizing antibodies in vaccine development (Gallagher and Buchmeier, 2001). Recently, however, the nucleocapsid also has been studied as a new viral target protein in vaccine industry because of its good immunogenicity (Bode et al., 2003; Ye et al., 2007). As for hepatitic C virus (HCV), the nucleocapsid protein is known to play an important role in immune evasion, including the inhibition of IFN-α-induced tyrosine phosphorylation, and activation of STAT1 in hepatic cells (Bode et al., 2003). The nucleocapsid protein of SARS-CoV itself has become a potential candidate for DNA vaccine production because it revealed a critical role in viral infection process (Zhu et al., 2004; Zhao et al., 2007; Mark et al., 2008; Schulze et al., 2008). Ye et al. reported that the nucleocapsid gene of mouse hepatitis virus A59, a group 2 CoV, circumvented the effects of the type I interferon (2007). Furthermore, Okada and his colleagues reported that mice vaccinated with the nucleocapsid protein of SARS-CoV showed T-cell immune responses, and Gao resulted that SARS DNA vaccine encoding nucleocapsid protein generated INF-γ producing T-cells in rhesus monkeys (Gao et al., 2003; Okada et al., 2005). Our finding demonstrated that GC3rd of nucleocapsid genes revealed highly positive relationships along with the RPIN2 among CoV species, whereas spikes did not show any specific patterns related to GC3rd. This result implicates that the nucleocapsid genes of CoVs might be more heavily affected by the synonymous codon usage bias which is usually determined by the nucleotide on the third codon position than spike genes (Figure 1A and 1B).

In order to compare the synonymous codon usage patterns among two genes of CoVs as well as CTLDs of human and mouse more intensively, we calculated the RSCU values of all the target genes from CoVs, human and mouse, and then, analyzed the Euclidean distances using the CA (Figure 2B). As a result, the nucleocapsid genes of SARS-CoVs from both human and bat showed the most biased patterns (0.292) among CoVs along with Dim2, which showed the significant correlations with the GC3rd (R2=0.50, P < 0.0001) of those genes in linear regression test (Table 2), and CTLDs of both human and mouse were broadly distributed on the first quadrant (Figure 2B). Interestingly, the group 1 CTLDs of human species were derived from the chromosome 12, and those of mouse were from the chromosome 6 and 12, whereas other CTLDs were from chromosome 14, 16, 17 or 19 for human, and 6, 8 or 9 for mouse. Our finding suggests clue that there might be a specific genomic region or chromosomes, which show a more similar synonymous codon usage pattern with antigenic viral genes. Recently, DNA vaccine has become a more and more important part of vaccine development against many infectious viruses (Martin et al., 2006, 2007; Catanzaro et al., 2007), and the codon-optimization method which switches the synonymous codons of viruses to those of their host organisms has been reported to improve the immunogenicity of HIV-1 and influenza A virus (Ramakrishna et al., 2004; Wang et al., 2006a, 2006b). For now, the preferred codons of the overall mammalian system are used in the codon-optimization process, but we observed that there were various synonymous codon biases even among CTLD genes of both human and mouse species. Although those differences might be due to the chromosomal region which each gene was transcribed from, or other factors, one thing is clear that the preferred codons of host organisms are more various than we thought. In the case of CTLDs of human and mouse host, the group 1 genes were commonly transcribed from the chromosome 6 or 12.

On the other hand, human CoV (HKU1) which is included in group 2 CoVs showed the most distinct synonymous codon usage biases in both % GC contents and RSCU patterns (Figure 1, 2), which agrees with the results from Woo et al. (Woo et al., 2005). Woo suggested that it might be because human CoV (HKU1) may have originated from a major recombination event and numerous minor recombination events among group 2 CoVs. In this study, the nucleocapsid genes of human CoV (HKU2) were found on the opposite side from other group 2 CoVs along with the PRIN2 (Figure 1A), which showed high relationships with GC3rd in Table 1, and they also revealed the opposite RSCU patterns from other group 2 CoVs on the basis of Dim1 in Figure 2B.

In Figure 3, we compared the RSCU profiles of both nucleocapsid and spike genes of human SARS-CoVs (AY290752, AY627048) with other genes such as the nucleocapsid gene of bat SARS-CoVs (DQ071614, DQ0412043), the group 1 CLECs of human (NM_130441) and mouse (NM_025809), which were most closely located with human SARS-CoV, and other CLECs of human (NM_006344) and mouse (NM_016751), which showed distinct patterns from CoVs. As a result, the nucleocapsid genes of both human and bat SARS-CoVs did not use two synonymous codons for cysteine as well as ACG for threonine at all (Figure 3A), whereas other CoVs used them (data not shown). Among SARS-CoVs, the RSCU profile showed somewhat different patterns between the nucleocapsid and spike genes, especially in the phenylalanine and leucine encoding codons, and spike showed more biased patterns in U-ended codons such as CCU (RSCU = 2.34), and UCU (RSCU = 2.67) for profile and serine, respectively. In general, the RSCU value would be 1.00 if there is no codon usage bias. As for the human clec4C_1 and mouse clec14A, they showed very similar profiles with spike genes, especially with bat SARS-CoV, in the arginine coding groups, showing the high RSCU values over 2.50 in AGA. The human clec10A_2 and mouse clec4F showed more biased patterns in GC-ending codons such as CUC, CUG and UCC for leucine and serine encoding codons than those of the group 1 CLECs (Figure 3D).

Consequently, our study demonstrated that the nucleocapsid genes of CoVs might be more heavily affected by the synonymous codon usage than spike genes, and the CTLDs of human and mouse were partially overlapped with the nucleocapsid genes of CoVs. Furthermore, we showed that the group 1 CTLDs of human species were commonly derived from the chromosome 12, and those of mouse were from the chromosome 6 and 12. This suggests that there might be a specific genomic region or chromosomes which show a more similar synonymous codon usage pattern with viral genes. We also found the similar results between CoV genes and other human or mouse genes in our preliminary stage (data not shown). Our findings might be helpful for developing the codon-optimization method in DNA vaccines, and further study is necessary to determine a specific correlation between the codon usage patterns of coding sequences and the chromosomal locations where they are transcribed from in higher organisms.

Methods

Nucleotide sequences

The nucleocapsid (251 sequences) and spike (284 sequences) genes of Coronavirus genus including SARS-CoVs were collected from the NCBI Taxonomy Browser (www.ncbi.nlm.nih.gov/Taxonomy/) in GenBank format, and then, all the GenBank flat files were parsed into each category such as accession number, species name, gene name and sequence length using JAVA codes to construct a local database to facilitate the further computational works. As for the human and mouse species, we collected the coding sequences of human (homo sapiens) and mouse (mus musculus) from the genome section of NCBI's FTP site (ftp.ncbi.nih.gov/genomes/), and also parsed and constructed a local database. All the gene names, abbreviations, sequence lengths and their GenBank accession numbers of CTLD genes used in this study are shown in Supplemental Data Table S1. Abnormal sequences which include unknown characters except for A, G, C or U were not divided by three - maximum length of a codon unit - were removed. MySQL database management system was used to construct all the local databases on Linux operating system.

Principal component analysis of % guanine-cytosine contents data

Principal component analysis was performed using the % guanine-cytosine (GC) contents of the first (GC1st), second (GC2nd) and third (GC3rd) position of each codon, which were calculated for the nucleocapsid and spike coding genes of Coronavirus genus as well as the CTLD genes of human and mouse species. All the screened target sequences were extracted from our local database first, then, each coding sequence was parsed into each codon unit. From the pool of codon units for each sequence, we calculated the % GC contents on the first, second and third codon position. JAVA was used in all calculation processes, and the SAS 9.1 statistical program (Cary, 2004) was used for the principal analysis.

Phylogenetic analysis

Twenty nine sequences of the CTLD genes from human and mouse species were used for the multiple sequence alignments using the ClustalW ver. 1.83 program (Thompson et al., 1997) with default parameters that set the DNA weight matrix as the IUB matrix, and values of gap opening and gap extension penalties as 15.0 and 6.66, respectively. The Neighbor-Joining method with 1000 times bootstrapping process were performed using PAUP* ver. 4.0b program (Swofford, 1999).

Correspondence analysis

The correspondence analysis (CA) method was used to compare the RSCU values for the 59 codons described above using the SAS 9.1 statistical program (Cary, 2004). The RSCU value is the number of times that a particular codon is observed relative to the number of times that the codon would be observed in the absence of any codon usage bias. If there is no codon usage bias, the RSCU value is 1.00. The RSCU was calculated as

where Xij is the frequency of occurrence of the jth codon for the ith amino acid, and ni is the number of codons for the ith amino acid. Each gene is represented as a 59-dimensional vector excluding the start and stop codons and UGG, which codes for tryptophan which has no synonyms (Sharp and Li, 1986). We assigned each kind of gene or species as rows, and RSCU values of 59 codons as columns in an input data set for CA. The biplot graph from a CA includes the best two dimensional representations of the data, along with the coordinates of the plotted points, and a measure of the amount of information retained in each dimension. CA uses chi-square to standardize the frequency values, so the distance between two coordinates with the same row or column value indicates the chi-square distance (Hair et al., 1998). If this distance is long enough to have statistical meaning, the coordinates of the output plots along with each column or row direction will be located far from the origin, and they usually exist on the opposite side of each coordinate axis. The distance between each row or each column reveals the Euclidean distance (Gu et al., 2004; Perrière and Thioulouse, 2002), but there is no meaning between the row and column coordinates.

Other statistical analysis

Linear regression analysis was conducted to determine the correlations between the first two dimensional factors (Dim1 and Dim2) of the CA results, and the % GC contents on each codon position, effective number of codons (ENC) and the average hydrophilicities of encoded proteins. ENC values were often used to measure the magnitude of codon bias, which yields values raging from 20, when one codon is used for each amino acid, to 61, when all synonymous codons are used in equal frequency (Wright, 1990). We calculated each ENC value per each nucleotide sequence using JAVA codes, and all these analyses were performed using the SAS 9.1 statistical program (Cary, 2004).