Codon stabilization coefficient as an index for prediction of codon bias and its potential applications

Different methods of mRNA half-life measurements are available, but genome-wide measurements of mRNA half-life in yeast showed a weak correlation between the methods. Moreover, when we compared mRNA half-life determined by these methods with other cellular measurements such as mRNA and protein abundance low correlation was found. To clarify this matter, we analyzed mRNA half-life datasets from nine different groups to determine the most accurate method of measurement. Since codon optimality is one of the significant determinants of mRNA stability, we used the codon stabilization coefficient (CSC) as a reference for mRNA half-life measurement accuracy. After CSC calculation for each dataset, we find strong positive correlations between the CSC from some datasets with other parameters that reflect codon optimality such as tRNA abundance and ribosome residence time. By the use of CSC parameter, we observed that most genes contain non-optimal codons and that codon bias exists toward optimized and non-optimized genes. We also observed that stretches of non-optimal are not randomly distributed since it causes impacts on translation.

2 Introduction 1 mRNA levels are determined by the rates of mRNA synthesis and degradation (Parker, 2012). Genome-2 wide mRNA half-life measurements showed poor correlation across different datasets and different 3 methods (and, in some cases, even between similar methods), classifying the same mRNA molecule as 4 stable and unstable. It makes the identification of stability sequence motifs and a global view of 5 transcription and translation dynamics in yeast a problematic task (Baudrimont et al., 2017;Harigaya and 6 Parker, 2016). Several features, such as mRNA secondary structure, sequence and structural elements 7 located within 5`and 3`UTR as well as the lengths of the transcripts can affect mRNA stability (Parker, 8 2012). Herrick and colleagues showed that the percentage of rare codons present in yeast unstable 9 mRNAs were significantly higher than in stable mRNAs (Herrick et al., 1990). Recent experiments on 10 bacteria, yeast, and metazoans indicate that codon optimality is a major determinant of mRNA 11 stability (Boël et al., 2016;Mishima and Tomari, 2016;Presnyak et al., 2015;Radhakrishnan et al., 2016). 12 The concept of codon optimality has been developed through the study of codon bias, which is the 13 frequency at which distinct synonymous codons are present within the genome (Sharp and Li, 1987). 14 Codon optimality is a term that incorporates the competition between tRNA supply and 15 demand (Pechmann and Frydman, 2012). In several species examined, the cellular tRNA concentrations 16 are proportional to tRNA gene copy numbers (Rocha and Rocha, 2004). Based on this observation, tRNA 17 adaptation index (tAI) calculates codon bias using the genome copy numbers of tRNAs (Dong et al., 1996;18 Sørensen et al., 1989;Tuller et al., 2010b). The tRNA adaptation index (tAI) reflects the efficiency of the 19 tRNA usage by the ribosome where optimal codons, which are decoded by high abundance tRNA species, 20 are translated more efficiently when compared to non-optimal codons (Reis, 2004). Recently, Coller and 21 colleagues showed a positive correlation between codon optimality and mRNA stability (Presnyak et al., 22 2015;Radhakrishnan et al., 2016). The authors measured the half-lives of thousands of genes and found 23 that some codons were enriched in the most stable mRNAs while other codons were enriched in the most 24 unstable mRNAs. Based on Pearson's correlation between the frequency of occurrence of each codon in 25 each mRNA and mRNA half-lives, the authors created the codon stabilization coefficient (CSC) (Presnyak 26 et al., 2012). We took advantage of the nucleotide resolution of ribosome profiling to see if the difference 1 occupancy at A site for each codon correlates with CSC calculated herein. In principle, if a codon is 2 optimal the ribosome occupancy of this codon must be shorter when compared with a non-optimal codon, 3 so we used 1/ribosome residence time to correlate with CSC ( Fig 2C). It is important to note that all 4 ribosome profiling data used herein were obtained in the absence of translational inhibitor, since this 5 treatment distorts ribosome profiles as initiation continue even though elongation is 6 blocked (Gerashchenko and Gladyshev, 2014;Hussmann et al., 2015;Requião et al., 2016;Weinberg et 7 al., 2016). We observed that the CSCs that most agree with three independent experiments measuring 8 codon occupancy time by ribosome profiling (Fang et al., 2018;Gardin et al., 2014;Weinberg et al., 2016) 9 were again the Coller, Becskei, Gresham and Cramer datasets (Spearman correlation coefficients (r) 10 ranging from 0.19 to 0.62). Very recently, it was showed that ribosomal A-site decoding rate impacts 11 normal mRNA decay in yeast reinforcing the correlation among CSC and codon occupancy (Hanson et al., 12 2018). The pattern observed for codon occupancy time by the ribosome was maintained when other 13 parameters such as tRNA abundance measured by microarray (Tuller et al., 2010) or RNAseq (Weinberg et 14 al., 2016) were used to correlate with CSC ( Fig 2C). We also calculated CSC using Spearman correlation 15 instead of Pearson correlation and similar results were found (data not shown). Different from the direct 16 comparison of mRNA half-lives and cellular parameters ( Figure S1) comparison with the datasets CSCs 17 gives us strong positive correlations, meaning that we are able to identify the datasets that are most 18 coherent with other parameters that are directly involved in codon adaptability and therefore have an 19 immediate effect in mRNA half-life. These data together suggest that four datasets, Coller, Becskei, 20 Gresham and Cramer, are the most reliable, since their CSCs agree with different experimental data, 21 while the others do not, moreover, the correlation among these studies showed (r) ranging from 0.12 to 22 0.75 ( Fig 1C) and the correlation among the CSC calculated from these studies ranged from 0.78 to 0.91 23 ( Fig 2C). 24 However, if one is interested to know the yeast genes mRNA half-life absolute values, Coller, Becskei, 25 Gresham and Cramer studies can give entirely different answers. In other words, a good correlation 26 between two studies can hide essential differences in the range of mRNA half-life absolute values 1 obtained by each research group. This is probably an inevitable consequence of using different 2 experimental methodologies. 3 4 Mean values of CSC computed for each gene. 5 Another way to estimate the biological significance of the CSC is the analysis of the correlation of the 6 mean values of CSC computed for each gene with other genome-wide measurements apart from mRNA 7 half-lives. When we measured the CSC average of individual genes (CSCg) using Coller's dataset, we 8 observed an almost perfect correlation with the frequency of optimal codons calculated by 9 Saccharomyces genome database (SGD) (Fig 3A). Agreement of CSC with SGD database further 10 strengthens the CSCs as an indicator of mRNA half-life measurements accuracy. Next, we did the same 11 analysis using the CSCs calculated from the other eight mRNA half-lives datasets and compared with the 12 frequency of optimal codons, protein abundance (Lahtvee et al., 2017), mRNA abundance (Yassour et al., 13 2009) and transcriptional elongation (Weinberg et al., 2016) (Fig 3B). We noticed that the CSCg 14 calculated from Coller's dataset was better correlated with all biological parameters analyzed when 15 compared to other CSCs obtained from other datasets ( Fig 3B). In that way, in the subsequent 16 experiments, we chose to continue this study only with Coller's CSC. 17 Next, we analyzed the distribution of all yeast genes taking in to account their CSCg. We can observe that 18 the majority of genes present a non-optimal CSCg (Fig 4A, black circles). A more precise analysis 19 revealed that two thirds of the CSCgs (66%) are non-optimal ( genes enriched in optimal-codons, we created a scramble yeast genome. We scrambled all codons, but we 23 maintained the protein sequence of all ORFs, as well as the proportion of all 64 codons from the full 24 genome. The CSCg distributions of scrambled genes showed that random codon choice creates a uniform 25 genome where most genes possess neutral CSCg (Fig 4A, blue squares). Interestingly, we also observed a 26 codon bias toward genes with extremely low CSCg. One possibility to explain this observation is that a 1 small fraction of genes concentrates the majority of optimal codons with the highest CSC scores and, 2 consequently, most genes use codons with lower CSC scores, increasing the chance to have genes with a 3 high amount of non-optimal codons. To address this point, we analyzed just the genes with CSCg lower 4 than 0 (3,258 genes), and we scrambled the codons of these genes. As observed in Figure 4B, the 5 frequency of genes with extremely low CSCg was ten times higher in the yeast genome when compared 6 with the scrambled genome. We concluded that codon bias exists as a force to select genes with a high 7 concentration of both optimal and non-optimal codons. 8 9 Highly expressed genes concentrate the optimal codons in yeast. 10 When we correlated the CSCg with the reads per kilobase per million (measured for each gene by 11 ribosome profiling(Heyer and Moore, 2016)), we observed a positive correlation (r = 0.63) ( Fig 5A). The 12 61 codons were then arbitrarily classified into three categories based on CSC. We considered codons with 13 CSC higher than 0.1 optimal-codons, while codons with CSC lower than -0.1 were non-optimal codons 14 ( Fig 5B). The codons with CSC in the range of -0.1 to 0.1 were considered neutral codons ( Fig 5B). 15 Using the criteria described above, we divided all yeast genes arbitrarily into ten groups ( Fig 5C). After k-16 means cluster we observed three main categories ( Fig 5C). The first category (cluster 1) was comprised of 17 genes composed by codons with optimal or neutral scores and virtually none of the non-optimal codons. 18 Even though less than 6% of the genes were found in this category, they were responsible for 19 approximately 70% of all transcripts in the cell ( Fig 5C). Genes characterized in the second category 20 (cluster 2) have a high abundance of neutral codons and represents 26% and 20% of the number of genes 21 and transcripts in the cell respectively ( Fig 5C). Finally, the third category (cluster 3) was the inverse of 22 the first, and virtually none of the optimal codons were found. Most genes (62%) fall in this category, but 23 they represent just 10% of all transcripts in the cell ( Fig 5C). Even though a small number of genes are 24 responsible for most of the transcripts, they are enriched with optimal codons, confirming the relationship 25 between codon optimality and higher levels of translation.

1
Correlation between CSCg and the copy number of proteins in yeast. 2 As described in Figure 3B, we observed a positive correlation between CSCg from Coller's dataset and 3 protein abundance. We analyzed this correlation with more details in Figure 6A (Breker et al., 2013). 4 Besides the Pearson correlation coefficients (r) of 0.66, we noticed the presence of outliers in the left 5 superior quarter of the plot (blue points, Fig 6A). All proteins with more than 10 5 copies/cell have CSCg 6 higher than 0. Nevertheless, some proteins with the highest CSCg (>0.1) are low abundant (less than 10 3 7 copies/cell). To explain this discrepancy, we compared the fold of induction/repression in more than 400 8 conditions of inliers vs outliers ( Fig 6B). It is clear that outliers are genes more inducible when compared 9 to inliers. It means that these genes possess high CSCg because in some moments it is essential for the 10 cells to produce high amounts of mRNA and proteins to deal with a new environmental or physiological 11 condition. To support this hypothesis, we calculated the CSCg of genes up-regulated in different stress 12 conditions (Fig 6C and 6D). We used as threshold 1.5 fold induction, as represented in Figure 6C when 13 the cells were exposed to DTT (Breker et al., 2013). We observed that, for all stress conditions analyzed, 14 the CSCg of induced genes were higher when compared with of the non-induced genes (Fig 6C), 15 confirming our hypothesis that high CSCg values are related to stable mRNA and abundant protein 16 expression, under certain conditions. 17

Distribution of codons in the yeast genome and its impact in translation. 19
Previous reports showed that non-optimal codons are concentrated in both 3'and 5'regions in several 20 genes (Tuller et al., 2010;. The biological explanation for this observation relays in the efficiency 21 that ribosome translates these codons (Tuller et al., 2010). The ribosome ramp hypothesis claims that 22 highly expressed genes possess, in their 3'mRNA regions, a subset of non-optimal codons, which would 23 be necessary to slow down translation, thus avoiding ribosome jams and minimizing the cost of protein 24 translation. However, the ribosome profiling data used to support the ribosome ramp hypothesis was 25 conducted in the presence of cycloheximide before cell lysis. As mentioned before, this protocol disrupts 26 the data, especially in the region nearby the ATG (Gerashchenko and Gladyshev, 2014;Hussmann et al., 1 2015;Weinberg et al., 2016). Using the CSC as a metric, we measured the occupancy of non-optimal 2 codons in the yeast genome and, as observed before by the use of tAI (Tuller et al., 2010), we observed an 3 enrichment of non-optimal codons from codon 3 until codon 10 ( Figure 7A). A more discrete enrichment 4 was also observed in the region close to the stop codon ( Figure 7A). 5 Recently, it was showed that the DEAD-box protein Dhh1p was a sensor of codon optimality that targets 6 mRNA for degradation (Radhakrishnan et al., 2016). In a series of elegant experiments, Coller and 7 colleagues showed that Dhh1p binds to ribosomes and modulates ribosome occupancy of mRNAs with 8 low codon optimality (Radhakrishnan et al., 2016). The authors proposed that Dhh1p binds to translating 9 ribosomes when elongation is slow and that the number of slow-moving ribosomes stimulates mRNA 10 decay. To prove that, a gene with optimal codons (PGK1) was modified with a non-endogenous stretch of 11 ten codons with exceptionally low CSC (TTA ATA GCG CGG CGG CGG CGG CGG GCG ACG). This 12 stretch was placed at increasing distance from the ATG. Interestingly, the half-live of PGK1 mRNAs 13 correlated inversely with the distance of which the stretch was placed, in other words, mRNAs with non-14 optimal codons concentrated close to ATG possess higher half-lives when compared to RNAs where 15 these stretches are further apart (Radhakrishnan et al., 2016). We asked if this observation could be 16 expanded to endogenous mRNAs. Using the CSC score we sum the value of each one of ten codons used 17 in Coller's constructs and we found a value of -1.28. To determine the frequency and the position of 18 stretches of non-optimal codons in yeast genome, we developed a program that is able to screen the 19 sequence of a given gene and calculate the sum of CSC of every 10 codons. We found 1,700 stretches 20 present in 1,200 genes with values equal or lower than -1.28. The distribution of these stretches was not 21 random, they were concentrated close to the ATG ( Figure 7B). We used the 1,200 genes with values 22 equal or lower than -1.28 in a window of ten codons and removed the genes with more than one stretch. 23 After that, we took just the genes where the half-life of mRNA was calculated by Coller's lab. The 24 remaining genes where aligned in a window of 200 codons in a crescent order of position of a stretch with 25 non-optimal codons ( Figure 7C). Next, the ribosome profiling of the same genes was analyzed ( Figure  26 7D). It is important to note that we used two independent ribosome profiling data (Koller and Gladyshev) 1 both performed in the absence of cycloheximide before cell lysis (Gerashchenko and Gladyshev, 2014;2 Pop et al., 2014). We clearly observed increased ribosome density associated with the stretch of non-3 optimal codons ( Figure 7D and 7F). We expected to see a high density of ribosomes before the stretch 4 with non-optimal codons ( Figure 7F). In fact, the ribosome density present in a window of 500 5 nucleotides before the stretch of non-optimal codons was higher when compared with the same window 6 of nucleotides after the stretch ( Figure 7D and 7F). Based on Coller's data the genes with non-optimal 7 stretch of codons closed to ATG should have higher half-life when compared to genes where the stretch is 8 further apart, but it was not the case ( Figure 7E), no correlation was observed ( Figure 7G). In other words, 9 the presence of non-optimal codons was sufficient to induce ribosome stalling ( Figure 7D) and to slow 10 ribosome movement before the non-optimal codons had been reached ( Figure 7F) but their position in 11 mRNA didn't influence the half-life of mRNA ( Figure 7E and 7G). 12 13

Conclusions 14
We conclude that the CSC is a useful parameter for estimate mRNA half-lives based on codon profile. 15 We suggest the use of CSC as a parameter to calculate the accuracy of wide measurement of mRNA 16 decay. Moreover, using CSC of each 61 codons as well the CSCg calculated for each yeast gene we were 17 able to understand important features of yeast genome such as; 1) most genes contain non-optimal 18 codons, 2) the codon bias exists toward optimized and non-optimized genes, 3) genes with optimal 19 codons usually are more translated and the exceptions are translated in stresses conditions, 4) stretches of 20 non-optimal codons are concentrated closed to the ATG in yeast transcripts but an explanation for this 21 observation still missing. 22 This paper Localization of non-optimal codons stretches This paper 2

Codon stabilization coefficient (CSC) and codon stabilization coefficient gene (CSCg) calculation 11
We calculated the CSC of each 61 codons as described previously (REF). To calculate the codon 12 stabilization coefficient in a given gene (CSCg), a program that can screen the coding sequence of a given Ribosome profiling data 16 To analyze the correlations between the ribosome density and the net charges, S. cerevisiae ribosome 1 profiling data were used (REF). The data were analyzed as described previously (REF). Briefly, the data 2 were downloaded from GEO, and the adaptors (CTGTAGGCACCATCAAT) were trimmed. The 3 trimmed FASTA sequences were aligned to S. cerevisiae ribosomal and noncoding RNA sequences to 4 remove the rRNA reads. The unaligned reads were aligned to the S. cerevisiae S288C genome, which is 5 deposited in the Saccharomyces genome database. First, any reads that mapped to multiple locations were 6 removed. Then, the reads were aligned to the S. cerevisiae coding sequence database, allowing two 7 mismatches per read. Genes with <50% of positions covered were eliminated. 8

Clustering Analysis 9
Clustering analyses were performed by Euclidean distance using the software Orange3. Clustering by k-10 means was performed with ten defined centers using the software Orange3. 11

Scrabble genome 12
A program that maintains the ORF length as well the amino-acid sequence generated by the scrabble 13 genome. The program scrabbled the codons at a given genome keeping the proportionality of the codons 14 of the input genome. 15

Localization of non-optimal codons stretches 16
We created a program that calculates the sum of CSC of each codon in a determined window for each 17 ORF. We defined a value of CSC of -1.28 in a window of ten codons that reflects the sum of CSC of the

Conceptualization
Spearman's correlation between CSCs for the 61 yeast codons from each of the datasets with different cellular parameters; tAI (Sharp and Li, 1987), 1/ribosome density at A site (Fang et al., 2018;Gardin et al., 2014;Weinberg et al., 2016) and tRNA abundance (Tuller et al., 2010;Weinberg et al., 2016). Values range from -0.72 (negative correlation, green panels) to 0.7 (positive correlation, red panels).    inset is the magnification of the panel B. It is clear that the outliers genes of panel A achieve, more often, higher levels of induction when compared to control (inliers). C. Induced proteins (blue dots) expressed 1.5 fold higher after cellular exposure to DTT when compared to non-stressed cells (Breker et al., 2013). D. The CSCg of genes induced 1.5 fold at different stresses (Breker et al., 2013) were compared to all genes (control). For all stresses analyzed the CSCg of genes overexpressed are higher when compared to the control. The numbers 0.005 and 0.010 represent the normalization of ribosome profiling analyzes. The normalization was performed by dividing the number of reads at each position by the total number of reads in each gene. F. Average ribosome footprint density of the genes presented in panel 7D aligned based on the stretch of non-optimal codons (position 0). The area under the curve of the 500 nucleotides before the non-optimal stretch of codons (-500 to 0, white bar) was compared with the area under the 500 nucleotides after the stretch (0 to 500, yellow bars). Two independent ribosome profiling studies were used (Gerashchenko and Gladyshev, 2014;Pop et al., 2014). G. No correlation of mRNA half-life measurement (Presnyak et al., 2015) with the position of the stretch with non-optimal codons (CSC ≤-1,28) was observed.