Differential bicodon usage in lowly and highly abundant proteins

Degeneracy in the genetic code implies that different codons can encode the same amino acid. Usage preference of synonymous codons has been observed in all domains of life. There is much evidence suggesting that this bias has a major role on protein elongation rate, contributing to differential expression and to co-translational folding. In addition to codon usage bias, other preference variations have been observed such as codon pairs. In this paper, I report that codon pairs have significant different frequency usage for coding either lowly or highly abundant proteins. These usage preferences cannot be explained by the frequency usage of the single codons. The statistical analysis of coding sequences of nine organisms reveals that in many cases bicodon preferences are shared between related organisms. Furthermore, it is observed that misfolding in the drug-transport protein, encoded by MDR1 gene, is better explained by a big change in the pause propensity due to the synonymous bicodon variant, rather than by a relatively small change in codon usage. These findings suggest that codon pair usage can be a more powerful framework to understand translation elongation rate, protein folding efficiency, and to improve protocols to optimize heterologous gene expression.


INTRODUCTION
The central dogma of molecular biology establishes that the information that specifies which amino acid 24 monomers will be added next during protein synthesis is coded in one or more nucleotide triplets known 25 as codons (Watson et al., 2003). The genetic code establishes a set of rules that associate the 20 amino 26 acids and a stop signal with 64 codons. This code is almost universal with few exceptions (Jukes and 27 Osawa, 1993). As there are more codons than encodable signals (amino acids and stop signal) the genetic 28 code is considered degenerated. However, it is well known that synonymous codons are not used with In addition to the use of rare codons associated with scarce tRNA usage, other mechanisms exist that 47 modulate the speed of translation or cause pauses. Among these, one can mention the blocking of 48 ribosomal transit due to secondary structure elements in mRNAs (Nackley et al., 2006), and interactions 49 of basic residues in the nascent polypeptides with the wall of the ribosomal exit tunnel (Gloge et al., 50 2014). However, in the last years emerging evidence has shown that translational rate could be encoded 51 by a sequence longer than a triplet of nucleotides, in particular by bicodons (Guo et al., 2012). In this 52 sense, a study encompassing 16 genomes has revealed that bicodons formed by two rare codons are 53 frequently found in prokaryotes but rarely used in eukaryotes (Buchan et al., 2006). More recently, ). This fact has been used to produce synthetic viruses with attenuated virulence as a new strategy 66 for vaccine development (Coleman et al., 2008). 67 Thus, coding sequences seem to carry more information than that strictly needed for specifying the 68 linear sequence of amino acids in a protein. This additional information is linked with the overall syn-69 thesis rate of the associated protein, and the pauses required for it to acquire its correct native structure.  one sample of the most abundant proteins, and another sample of the less abundant ones. When more 95 than one isoform was present in the comprehensive data set, only one was included in the samples. It 96 is important to point out that protein abundance correlates negatively with coding-sequence length in 97 yeast (Coghlan and Wolfe, 2000). As PA distributions are generally biased, i.e., short proteins should be 98 more abundant than larger ones (see Fig. 1), I selected two sets of 500 sequences, in which the sequence 99 length distribution of both sets was similar. The procedure for sampling sequences with similar length and σ are the mean length and standard deviation, respectively, of the target distribution) with a ran-104 dom number uniformly distributed r. If r l > r, the sequence was added to the set of low PA sequences.

105
Then, I tested the second sequence in a similar manner and so on, until 500 sequences were selected.

106
To select the high PA sequence set I performed the same procedure, but beginning from the highest PA

131
The pause propensity score, denoted by π, is defined as the difference between the relative synonymous bicodon usage computed over the low PA sequences (RSBU L ), and the one computed over the sequence sample associated with high PA, (RSBU H ). Mathematically, Here, f X i j is the frequency of the bicodon i j computed over the sequence sample X, q ap is the number of where the * indicates that the sum is only over codon pairs kl encoding the same amino acid pair encoded by the bicodon i j. From the observed and normalized expected bicodon counts recorded in a given sample X, I computed the residual scores χ Xi j for each codon pair as: where X indicates the sequence samples, i.e., X = L for low PA sample, or X = H for high PA sequence 146 sample. These residual scores can be used to assess whether the bias in a given codon pair can be 147 explained, or not, by the bias in codons and amino acids.

148
In order to statistically assess the residual scores, I performed a random shuffling control. From each 149 sample of sequences I generated a second random set of sequences by shuffling the order of codons (but 150 preserving the stop codon at the end of sequence). This procedure removed the codon correlation but not 151 the codon usage. Then, I computed the residual score of bicodons associated with this random sample.

152
I repeated the above procedure 200 times and, finally for each bicodon i j I computed the mean value 153 ⟨χ 2 i j ⟩ ran , and associated standard deviation SD ran (χ 2 i j ). Thus, the residual scores of bicodon i j can be 154 expressed as the number of standard deviations from the mean, i.e., (χ 2 i j − ⟨χ 2 i j ⟩ ran )/SD ran (χ 2 i j ). This 155 procedure was performed for the low and high PA samples of sequences independently. for both samples (see Supplementary Fig. 6).

240
The above heat maps show that bicodon usage bias exists in lowly and highly abundant proteins, but Manuscript to be reviewed yeast (they have been indicated by * * in Supplementary Table S3).

271
The heat maps shown above are very useful to see some common features among organisms, but they

303
Regarding the inhibitor bicodons, I found that only five of them have a significantly different usage 304 frequency in low PA sample and only 1 in high PA sample (see Table S3). I searched for bicodons with 305 the same preferences shared by S. cerevisiae, C. elegans and D. melanogaster, by selecting those with 306 p-value < 0.01 and χ 2 ≥ 3× SD above the mean. Table 1 lists 16 shared bicodons with preference for 307 low PA, while Table 2 lists 40 bicodons with preference for high PA sequences. It is noteworthy that, 308 even within each species, the number of bicodons with preference for encode high PA sequences (blue 309 cells in Fig. 5) are smaller than the number of bicodons with preference for encoding low PA sequences 310 (yellow cells in Fig. 5), and the number of shared bicodons in Table 2 is substantially greater than those 311 listed in Table 1. This fact seems to indicate that bicodons with preference for high PA sequences are 312 more conserved across these organisms than bicodons with preference for low PA sequences.

313
Pause propensity score and cotranslational protein folding 314 The above statistical analysis is able to determine which bicodons are associated with lowly or highly  is also associated with a large change in the bicodon pause propensity score π. Specifically, bicodon 328 ATCGTG has preference for low PA sequence with π = 0.1, while bicodon ATTGTG has preference mance of the π score in these organisms. However, the residual score seems to be more robust analysis. In this paper, I considered that ribosomal pauses are encoded by bicodons, and examined the bicodon 391 frequency usage in nine organisms. I found that some codons have an evident preferential usage in se-392 quences that code for highly abundant proteins, while many others have preference for encoding proteins 393 that are scarce. The latter bicodons can be understood as short sequences linked to translational pauses.

394
The observed bias cannot be explained by codon usage in many bicodons.

395
It is worth noting that few bicodons with differential bicodon usage in low and high PA sequences     TGC  AAT  TAT  AGT  TTT  CGA  GTA  AGC  CTT  GGA  CCG  GCG  GAG  CGC  CTC  CGG  TCG  CAG  CTG  GGG  CCC  ACG  CTA  GCA  ATG  GGC  GTG  ACA  TCA  AGG  TGT  TTA  TGG  CCT  CGT  GTT  TCT  ACT  CAA  ACC  AAG  GCC  GTC  ATC  TCC  GAC  TTC  AAC  CAC  TAC  ATT  GAA  CCA  TTG  GCT  GGT  AGA   AAA  CTA  GAT  TGG  GGC  ATG  ATA  CGA  ACA  AAT  AGT  TAT  TTT  TGC  GCG  GGG   GTG   GAG   CGC  GCA   CCC  GTA   GGA   TCG  CTC  ACG  CAG  CGG  CCG  CTG  CTT  AGC  CAT  TCA   AGG   TGT  TTA  CCT  TGA  TAG   AAG   ACC  TCC  ATT  CAC  TAC  TTC  AAC  GAC  ATC  CGT  CAA  TCT  ACT TAT  TTT  TTA  CTA  CCT  GGC  GAG  TCA  AGT  CTT  AGG  CGA  CTC  AGC  TGC  TCG  CGC  CCG  CGG  ACG  CCC  GCG  CAT  CTG  CAG  GTG  GGG  ATG  TGG  TGT  GCA  GAT  GCT  TTC  TCT  AGA  AAC  CCA  ATC  CAA  TCC  GAA  CAC  TAC  ACC  GTC  GCC  GAC  CGT  GTT  ACT  ATT  AAG  TTG  GGT   AAA  AAT  ATA  GGA  ACA  AGT  GTA  TCA  CTG  AGC  ACG  CTC  TGC  TCG  CGG  C  C  C  CGC  CTT  GCG  CCG  AGG  GGG  CAG  TAT  GAG  CTA  CCT  GTG  GGC  CAT  CGA  TAA  TGA  TAG  TGT  TTA  TGG  GCA  TTT  GAT  GTC  TAC  TCC  CAC  GCC  CGT  GAC  ATG  CAA  ACC  ACT  TCT  ATC  GTT  ATT  TTC  CCA  GCT  GAA  GGT  TTG  AAC  AGA   Manuscript to be reviewed  Supplementary Fig. S13. The codon pairs whose preference for sequences with low or high PA cannot be explained by codon usage bias are outside the grey quadrant (i.e., χ 2 ≥ 3 SD above the mean). Among these, it is possible to distinguish bicodons more frequently used in low PA sequences (red dots), or in high PA sequences (blue dots). Inside the quadrant, there are codon pairs with a significantly different usage frequency in low and high PA samples, but whose bias can be explained by codon usage bias (green dots). Codon pairs whose usage frequencies in low and high PA samples are not significantly different aer indicated with black dots.