Science from the sea.

Synonymous codon usage biases are associated with various biological factors, such as gene expression level, gene length, gene translation initiation signal, protein amino acid composition, protein structure, tRNA abundance, mutation frequency and patterns, and GC compositions. Quantification of codon usage bias helps understand evolution of living organisms. A codon usage bias pipeline is demanding for codon usage bias analyses within and across genomes. Here we present a CodonO webserver service as a user-friendly tool for codon usage bias analyses across and within genomes in real time. The webserver is available at http//www.sysbiology.org/CodonO.


INTRODUCTION
Within the standard genetic codes, all amino acids except Met and Trp are coded by more than one codon, which are called synonymous codons. DNA sequence data from diverse organisms clearly show that synonymous codons for any amino acid are not used with equal frequency, and these biases are as the consequence of natural selection during evolution. Extensive studies have shown that synonymous codon usage biases are associated with various biological factors, such as gene expression level, gene length, gene translation initiation signal, protein amino acid composition, protein structure, tRNA abundance, mutation frequency and patterns, and GC compositions (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11). Quantification of codon usage bias, especially at genomic scale, helps understand evolution of living organisms.
Many different approaches have been developed in the past few decades. These methods may be grouped into two categories: (i) methods based on the statistical distribution, such as codon-usage preference bias measure (CPS) based on 2 (12) and scaled 2 analyses (13); (ii) methods using a group of gene sequences as reference, which can be 'optimal codons' [e.g. codon bias index (14)], a defined set of highly expressed genes [e.g. codon preference statistics (15) and codon adaptation index (16)], a defined gene class [e.g. Codon Bias (7)], or all genes in the entire genome [e.g. the Shannon Information Method (17)]. Most of existing computational approaches are only suitable for the comparison of codon usage bias within a single genome. In order to overcome these limitations, we developed a new informatics method based on Shannon informational theory, referred to as synonymous codon usage order (SCUO), which enables a measurement of synonymous codon usage bias within and across genomes (3,12). The review and comparison of SCUO and current available methods are detailed in Wan et al. (18). Several computational software packages or webservers, for instance, CodonW (http://bioweb.pasteur.fr/seqanal/ interfaces/codonw.html) and JCAT (19), have been developed to measure Codon Adaptation Index (CAI) for genes. JCAT also integrates intrinsic terminators and enzyme digestion sites into their analyses.
Codon usage analyses within and across genomes will facilitate the understanding of evolution and environmental adaptation of living organisms. GC compositions have been shown to drive codon and amino-acid usages thus affect codon usage bias (20). Thus, it will be critical to study the correlation between GC compositions and codon usage bias. Previously, we have developed an analytical model to quantify synonymous codon usage bias by GC compositions based on SCUO (11). However, it is still laborious to perform codon usage analyses within and across genomes based on our knowledge, there is not any available tool designed for these purposes. The CodonO webserver described here is a pipeline for codon usage bias analyses within and across genomic sequences as well as a tool for studying the correlation between codon usage bias and GC compositions, especially for microbial species. Different from the standalone CodonO we developed earlier (10,11,18), CodonO webserver has the following additional functions: (i) besides allowing the users to compare their submissions, it connects genomic database and perform analyses in real time; (ii) it can be used to study the correlation between SCUO and GC compositions; (iii) it performs statistical comparison of SCUO within and across genomes; (iv) besides SCUO values, it extracts and displays codon usage frequency table as well as the gene attribute for each gene from the genomic database; and (v) it provides a user-friendly interface.

Synonymous codon usage order measurement
CodonO webserver employs the synonymous codon usage order (SCUO) measurement as the method to calculate synonymous codon usage biases. The details about the SCUO concept and method have been described previously (10,11,18). Simply, we calculate the entropy of the i-th amino acid in a sequence , j is the codon for the i-th amino acid, 1 4 j 4 6 for leucine, 1 4 j 4 2 for tyrosine, etc. If the synonymous codons for the i-th amino acid were used at random, one would expect a uniform distribution of them as representatives for the i-th amino acid. Thus, the maximum entropy for the i-th amino acid in each sequence is Thus, we can calculate SCUO for the i-th amino acid in each sequence.
Then the average SCUO for each sequence can be represented to summarize the SCUO from each amino acid.
The SCUO represents the synonymous codon usage bias for the entire sequence, and j is the codon for the i-th amino acid. Thus, 0 4 SCUO 4 1, and a larger SCUO denotes a higher codon usage bias in the sequence.

Statistical methods
CodonO webserver can perform codon usage bias analyses within genomes using Tukey statistical analysis (21) and across genomes using Wilcoxon Two Sample Test (22). Tukey statistical analysis is a simple and powerful method for estimating outliers for a population, which can be either a normal distribution or a non-normal distribution. We adapted the percentile calculation from JMP method (SAS, Inc., Cary, NC USA). q 100ðn þ 1Þ where n is the number of data points; IR is the integer part of R while FR is the fraction part of R. Then, The Tukey outliers are genes with SCUO values less than Q1 À 1.5IQR or greater than Q3 þ 1.5IQR, where IQR represent Interquartile range. IQR is the difference between 75th percentile and 25th percentile SCUO.
The Wilcoxon Two Sample Test (22) is utilized to test null hypothesis that the distributions of SCUO from two groups of sequences (e.g. genomes) are the same. The Wilcoxon Two Sample Test is a sensitive test in two groups even their values are not Normal distributed.

Features
As shown in Figure 1, CodonO server is directly connected and updated with GenBank genomic database daily. The user can define and select one or multiple genomes for analyses at the same time. The users can upload their own datasets as well. The underlying computations include synonymous codon usage order (SCUO) and GC composition measurements, and the latter includes GC, GC1s, GC2s and GC3s, where GC is the overall GC composition, GC1s is the GC composition at the first site of a codon, GC2s is the GC composition at the second site of a codon, and GC3s is the GC composition at the third site of a codon. The results will be plotted in a twodimensional graph, by which the clients can visualize and compare the results. The webserver can display the results for multiple genomes in the same plots, by which, the users can analyse the two dimensional differences (GC/GC1s/ GC2s/GC3s versus SCUO) between genes within and across genomes (Figure 2A) (11). Generally, a very low or very high GC composition is associated with a large codon usage bias. It has been shown that codon usage bias in some bacteria and archaea were affected by GC composition and environment condition (e.g. temperature) (23). Thus, the users can perform these types of analyses based on their own preferences.
As mentioned in the 'Statistical and methods' section, the webserver can identify the outliers for a genome or a group of sequences based on Tukey statistical analysis (21). The clients can pick and select the 'outlier' from the plot and find associated information for each codon and annotation information of a specific gene ( Figure 2B), in which the outliers are marked in different color from the other members in the SCUO population. To compare the statistical analyses across genomes, the CodonO webserver applys the Wilcoxon Two Sample Test (22) to compare whether the SCUO populations are the same or not between different genomes. The P-values from statistical comparison between genomes are listed in table (Figure 2C), and a P-value less than 0.05 informs a significant difference between two SCUO populations compared.

Implementation
The programs in this solution package are written in C/Cþþ or Java. The shell scripts are written in korn shell script in order to achieve high performance. GNUPlot is used for visualization. Cascading style sheets (CSS) are used for a consistent look across the pages. This also enables to change the overall design just by replacing the CSS definition file. PHP has been used as server side scripting and is written in C. In order to achieve high performance for computing in a genomic scale, we apply hash function or a binary tree, which enables that the codon usage analyses have a time complexity of O(nlog(n)) or O(n). The webservers have also designed special functions targeting the security and concurrency issues.

ACCESS
CodonO has been tested on Microsoft Internet Explorer, Netscape and Mozilla Firefox. The users need JavaScript to obtain full function of CodonO server. The webserver is available at http//www.sysbiology.org/CodonO/. This webserver can be run in a real time manner. The users can compare the maximum of 16 genomes for comparative analyses at the same time.

CONCLUSIONS
In summary, CodonO webserver has three major computational features for codon usage bias analyses: (i) it calculates the codon usage bias for one or more genomes; (ii) it compares and visualizes the correlation between codon usage bias and GC compositions; (iii) it performs statistical analyses for codon usage bias within and across genomes. Thus, CodonO provides an efficient user friendly web service for codon usage bias analyses across and within genomes using SCUO in real time.