Impact of C‐terminal amino acid composition on protein expression in bacteria

Abstract The C‐terminal sequence of a protein is involved in processes such as efficiency of translation termination and protein degradation. However, the general relationship between features of this C‐terminal sequence and levels of protein expression remains unknown. Here, we identified C‐terminal amino acid biases that are ubiquitous across the bacterial taxonomy (1,582 genomes). We showed that the frequency is higher for positively charged amino acids (lysine, arginine), while hydrophobic amino acids and threonine are lower. We then studied the impact of C‐terminal composition on protein levels in a library of Mycoplasma pneumoniae mutants, covering all possible combinations of the two last codons. We found that charged and polar residues, in particular lysine, led to higher expression, while hydrophobic and aromatic residues led to lower expression, with a difference in protein levels up to fourfold. We further showed that modulation of protein degradation rate could be one of the main mechanisms driving these differences. Our results demonstrate that the identity of the last amino acids has a strong influence on protein expression levels.


Table of Contents
Codon composition at the C-terminal of bacterial protein sequences shows higher (red) or lower (blue) frequency when compared to their frequency in the bulk of the sequence (same color code for all panels). Significance of the biases are tested using exact Fisher test and multiple-tests correction with 5% false discovery rate.

Figure S6
Figure S6. Biases in C-terminal protein sequence composition in the bacterial kingdom at the level of codon pairs. The composition of the last two codons at the C-terminal of bacterial protein sequences shows higher (red) or lower (blue) frequency when compared to the frequency of codon pairs in the bulk of the sequence. Significance of the biases are tested using exact Fisher test and multiple-tests correction with 5% false discovery rate.  Significance of the biases were tested using exact Fisher test and multiple-tests correction with 5% false discovery rate within each class. C-terminal codon biases in each stop codon context at the level of phyla, for a selection of codons that showed strong enrichment or depletion. Phyla were ordered following an approximate phylogenetic tree. (D) Same analysis in the bacterial kingdom, with codons classified into NNA and other codons, in the UGA stop codon context, and in the † UGA context. The latter was defined as follows: in the case of the UGA stop codon context, we further excluded genes for which the start codon of the downstream gene was overlapping with the stop codon at nucleotide position -1, e.g. NNA-UGA where AUG is the downstream start codon. Comparison between NNA codons and others, independent t-test p=2.2e-08 for UGA context, p=0.12 for † UGA context. The distributions were compared using independent t-test. Significance code: n.s. not significant for p > 0.01, * for p < 0.05, ** for p < 0.01, *** for p < 0.001, **** for p < 1e-4.

Figure S14
Figure S14. Effect of C-terminal codon pair on protein expression levels in the randomized C-terminal library with the weak promoter. Protein expression level readout is reported as the log10 of the DAM ratio relative to the average. Codon pairs that contained the GATC motif were filtered out (missing data, grey squares), for example CGA-TCT. The number of reads for sequences in each codon pair is much lower than in the analysis at the level of individual codons, which leads to a stronger influence of noise in the estimation of the DAMratio.

Figure S15
Figure S15. Effect of C-terminal codon pair on protein expression levels in the randomized C-terminal library with the strong promoter. Protein expression level readout is reported as the log10 of the DAM ratio relative to the average. Codon pairs that contained the GATC motif were filtered out (missing data, grey squares), for example CGA-TCT. The number of reads for sequences in each codon pair is much lower than in the analysis at the level of individual codons, which leads to a stronger influence of noise in the estimation of the DAMratio.
Figure S16   Figure S18 Figure S18. Correlation between C-terminal variants protein expression levels between the ELM-seq assay and the luciferase assay. The average expression levels as measured in the ELM-seq experiment (DAMratio), averaged for both weak and strong promoter libraries, are compared to the normalized luciferase luminescence as measured in the luciferase assay, averaged over the three replicates.

Figure S19
Figure S19. Changes in C-terminal amino acid biases for taxonomic clades classified by the presence of release factor 3 (RF3) homolog. We chose 114 clades at the family rank that contained at least 4 genomes in our database, and identified the species that contained the prfC homolog (RF3) based on EggNOG orthologous group assignment (OG id 05C8A). In total 990 proteins were annotated as prfC homolog in our database. Then, we classified each clade based on the proportion of species that contained the prfC homolog. We compared the distribution of C-terminal amino acid biases of clades with prfC presence in more than 50% of species, or less than 50% of the species. Differences in the mean of the biases for each amino acid was compared between the two classes using independent t-test.

Figure S20
Figure S20. Protein degradation assay, fit to the exponential decay. For each C-terminal variant, the normalized luminescence at time points 2h, 4h, 6h and 8h was fitted to an exponential decay, independently for each of the three replicates.

Figure S21
Figure S21. C-terminal screen ELM-seq damID sequencing scheme. A) Construct design: the C-terminal amino acids are randomized in frame. B) DamID: DNA-seq was performed by PCR amplification of the screen cassette (after digestion with GATC methylation sensitive or insensitive enzymes). The custom PCR oligos included the Illumina sequencing, flow-cell binding and index sequences.