Quantification and discovery of sequence determinants of protein‐per‐mRNA amount in 29 human tissues

Abstract Despite their importance in determining protein abundance, a comprehensive catalogue of sequence features controlling protein‐to‐mRNA (PTR) ratios and a quantification of their effects are still lacking. Here, we quantified PTR ratios for 11,575 proteins across 29 human tissues using matched transcriptomes and proteomes. We estimated by regression the contribution of known sequence determinants of protein synthesis and degradation in addition to 45 mRNA and 3 protein sequence motifs that we found by association testing. While PTR ratios span more than 2 orders of magnitude, our integrative model predicts PTR ratios at a median precision of 3.2‐fold. A reporter assay provided functional support for two novel UTR motifs, and an immobilized mRNA affinity competition‐binding assay identified motif‐specific bound proteins for one motif. Moreover, our integrative model led to a new metric of codon optimality that captures the effects of codon frequency on protein synthesis and degradation. Altogether, this study shows that a large fraction of PTR ratio variation in human tissues can be predicted from sequence, and it identifies many new candidate post‐transcriptional regulatory elements.


Description of 5'UTR motifs found de novo
• AACUU, was present in 5' UTR of 1,390 (12%) of the 11,575 investigated genes, associated with 12% decrease in PTR ratio only in brain ( Figure 2E). According to the current knowledge of RBP binding motifs, AACUU may be bound by Serine/arginine-rich splicing factor 3 (SRSF3) with low probability (Qscore 0.002). AACUU sites were significantly evolutionarily conserved (P-value : 1.63e-16 ) compared to the background flanking regions (Appendix Figure S5A). The genes that contains the consensus sequence of AACUU were enriched in pri-miRNA transcription from RNA polymerase 2 promoter biological process (Appendix Figure S6A) and transcription factor activity metabolic function (Appendix Fig S5B).
• ACCUGC, present in 886 genes (8%), associated with 15% less PTR ratio in stomach only ( Figure 2E) and is similar to the consensus target sequence of mRNA decay activator protein ZFP36 (Qscore 0.03). ACCUUGC was significantly evolutionarily conserved (Pvalue: 3.79e-10, Appendix Figure S5B) and enriched for genes in ion transport biological process (Appendix Figure S6C) and localized in plasma membrane (Appendix Figure   S6D).
• AGCAAC, present in 524 genes (5%), on average associated with 20% less PTR ratio in ovary and prostate ( Figure 2E) and was enriched for genes of cilium morphogenesis (Appendix Figure S6E) which were localized in cilium (Appendix Figure S6F).
• CCCACCC, present in 716 genes (6%), significantly associated with 20% higher PTR ratio in lymph node. CCCACCC matches with the target motifs of PCBP1 (Qscore 1.0), poly(rc) binding protein 1, and G3BP1 (Qsore 0.01). The two C triplets appear to be important in this motif because mismatches are rare in the genome ( Figure 2F) and because these positions were more conserved than the central nucleotide and flanking nucleotides (Pval: 1.93e-07, Appendix Figure S5H).
• CGGAAG, present in 1,131 genes (10%), significantly associated with 14% higher PTR ratio in lung and urinary bladder ( Figure 2E). CGGAAG matches the binding target of SFRSF1. Consensus motif sites of CGGAAG were highly conserved through evolution (Appendix Figure S5K) and the genes containing these sites were enriched in specific biological processes (Appendix Figure S6N) localized in certain cell components (Appendix Figure S6O).
• CUGUCCU, present in 323 genes (3%), on average associated with 41% PTR ratio in 18 tissues ( Figure 2E). The genes with the consensus sequence of this motif were enriched in multicellular organismal process (Appendix Figure S6P).
• UACAGG, present in 331 genes (3%), associated with 26% higher PTR ratio in stomach ( Figure 2E) and matches with the target site of SRSF6 with Qscore being equal to 0.03.
• UCGAC, present in 662 genes (6%), associated with 20% less PTR ratio in duodenum and ovary ( Figure 2E). Consensus sequence of UCGAC mildly matches with the target motif of SRSF3 (Qscore 0.03) and the genes having this motif were enriched in axon regeneration process (Appendix Figure S6T).
Consensus motif sites of UUCCG were highly conserved through evolution (Appendix Figure S5X) and the genes that had the consensus sequence were significantly enriched in several biological processes such as cellular component biogenesis, RNA export from nucleus, organitrogen compound biosynthesis and intracellular transport processes (Appendix Figure S6U) and were localized in membrane-bounded organelles (Appendix Figure S6V).
• ACACUA, present in 2,175 genes (19%), on average associated with 10% higher PTR ratio in adrenal gland, appendix, brain and fat ( Figure 5A). ACACUA is the core sequence of the target motif of RNA binding protein QKI (Qscore 1.0), which is highly enriched in brain (Human Protein Atlas, (Uhlen et al, 2015)) and important for myelinization (Aberg et al, 2006), mRNA stability and protein translation (Teplova et al, 2013). Consistent with the function of QKI, the genes that contain the consensus sequence ACACUA were enriched in signal transduction process (Appendix Figure S12A).
ACCAAA is the perfect match of the target motif of RBMX (Qscore 1.0) which plays several roles in the regulation of post-transcriptional processes (Kanhoush et al, 2010).
Consensus motif sites of ACCAAA in 3' UTR are highly conserved (P-value: 9.08e-112, Appendix Figure S11D) and the genes containing these sites are enriched in cell communication, signaling processes (Appendix Figure S12C) which are intrinsic components of the membrane (Appendix Figure S12D).
• AUGAGAC, present in 805 genes (7%), associated with 24% higher PTR ratio in gallbladder. Genes with the consensus motif were enriched in cell communication process (Appendix Figure S12G).
• AUUUUUA, present in 4,119 (36%) genes, is another recovered well known AU-rich element (ARE) (Chung et al, 1996). AUUUUUA on average associated with 5% higher PTR ratio in fifteen tissues ( Figure 5A). There are several ARE-binding proteins including 7 HNRNPD, Hu proteins (ELAV family), ZFP36 and TIAL1 and the motif sites of this ARE are highly conserved (Appendix Figure S11G) through evolution.
• CCAAAG, present in 3,992 (34%) genes, associated with 6% higher PTR ratio in fallopian tube, lymph node, small intestine and tonsil. Binding target proteins of CCAAAG is not reported before but its big number of occurrences and highly conserved motif sites in 3' UTR sequences (P-value: 1.08e-149, Appendix Figure S11I) signal that it may be a key regulatory motif. The genes containing CCAAAG are enriched in signaling process (Appendix Figure S12J), localized in membrane (Appendix Figure S12K).
• CCUGUA, present in 3,484 genes (30%), on average associated with 5% increased PTR ratio in seventeen tissues ( Figure 5A). CCUGUA matches with the recognition motif of signal recognition particle 14 kDa protein SRP14 (Qscore 1.0). The genes containing the consensus sequence of CCUGUA are enriched in bone development (Appendix Figure   S12L) and magnesium ion binding (Appendix Figure S12M).
• CGUGUGG, present in 380 genes (3%), associated with 38% higher PTR in esophagus ( Figure 5A). Consistent with the information content, the conservation of the first two nucleotides of this motif is much smaller compared to UGUGG (Appendix Figure S11K).
CUCAGG slightly matches the recognition motif of SRSF6 (Qscore 0.03). It is highly conserved through evolution (P-value 1.16e-136, Appendix Figure S11L) and the genes containing this motif in their 3' UTR sequences are enriched in small GTPase mediated signal transduction process (Appendix Figure S12N).

8
• GGAGCC, present in 3,140 genes (27%), on average associated with 4% less PTR ratio in fat, kidney, lymph node, ovary and thyroid ( Figure 5A). GGAGCC matches with the recognition targets of three heterogeneous nuclear ribonucleoprotein (hnRNP) protein family members, namely HNRNPA1 (Qscore 1.0), HNRNPA2B1 (Qscore 1.0), and HNRNPA3 (Qscore 1.0). These RBPs has multiple roles including mRNA stabilization, and translational regulation (Geuens et al, 2016) and they are highly expressed in lymph node, ovary and thyroid gland consistent with their effect significance. Consensus motif sites of GGAGCC are highly conserved in 3' UTR (P-value 8.66e-106, Appendix Figure   S11N) and the genes containing this consensus sequence are significantly enriched in various regulatory biological processes (Appendix Figure S12P) localized in cell periphery (Appendix Figure S12Q).
• GGCCCCUG, present in 571 genes (5%), on average associated with 23% higher PTR ratio in adrenal gland, brain, endometrium, heart, liver and thyroid ( Figure 5A). Consensus sequence matches with the target motif of SRSF2 (Qscore 0.02) and the motif sites of the consensus sequence is significantly conserved compared to flanking regions (P-value 2.54e-03, Appendix Figure S11O). Genes with the consensus GGCCCCUG sequence in their 3' UTR regions are enriched in regulation of signal transduction (Appendix Figure   S12S).
• UAUGCA, present in 2,858 genes (25%), associated with 8% higher PTR ratio in appendix and colon ( Figure 5A). Genes having the consensus sequence were enriched in intracellular signal transduction process (Appendix Figure S12T) and metal ion transmembrane transporter activity (Appendix Figure S12V).
• UAUUUAU, another recovered AU-rich element was present in 3,158 genes (27%), and on average associated with 10% less PTR ratio in all tissues ( Figure 5A). The consensus motif sites of UAUUUAU are highly conserved in 3' UTR (Appendix Figure S11R) and the genes having this motif are enriched in various biological processes (Appendix Figure S12 W), especially localized in Golgi apparatus (Appendix Figure S12Y).
• UGUAAAUA, present in 1,320 (11%) genes, was another recovered well known motif bound by the Pumilio family of proteins (Filipovska et al, 2011) which act as a posttranscriptional repressor (Parisi & Lin, 2000). Consistent with the function of the bound protein, this motif on average associated with 15% less PTR ratio in twenty two tissues ( Figure 5A). Motif sites of UGUAAAUA were highly conserved through evolution (P-value 1 2 3 4 5 6 7 8 9 10 11 12 Number of major mRNA isoforms across 29 tissues Number of genes S1 S2 S3 mRNA-protein major isoform match Figure S1: Distribution of the number of different major mRNA isoforms each gene has across 29 tissues. Figure S3: Spearman's correlation coefficients between tissuespecific proteome and transcriptome with (y-axis, Materials and Methods) and without intronic reads normalization (x-axis).

Biological Process Motif_UTR5_AACUU
A B C D

E F
RNA polymerase II core promoter proximal region sequence−specific DNA bind core promoter proximal region DNA binding core promoter proximal region sequence−specific DNA binding transcriptional activator activity, RNA polymerase II transcription regulatory regio RNA polymerase II regulatory region sequence−specific DNA binding sequence−specific double−stranded DNA binding RNA polymerase II regulatory region DNA binding sequence−specific DNA binding transcription regulatory region sequence−specific DNA binding transcription factor activity, RNA polymerase II core promoter proximal region se transcriptional activator activity, RNA polymerase II core promoter proximal regio nucleic acid binding transcription factor activity transcription factor activity, sequence−specific DNA binding

Biological Process Motif_UTR5_CUGUCCU
ribonucleoprotein complex subunit organization ribonucleoprotein complex export from nucleus cellular macromolecular complex assembly rRNA processing DNA−templated transcription, termination mitochondrion organization nuclear export rRNA metabolic process ribonucleoprotein complex assembly metabolic process ribonucleoprotein complex localization intracellular transport organonitrogen compound biosynthetic process RNA export from nucleus cellular component biogenesis 0.0 0.5 1.0 1.5

Biological Process Motif_UTR5_UUCCG
phosphoprotein phosphatase activity transcription regulatory region DNA binding regulatory region nucleic acid binding regulatory region DNA binding protein domain specific binding protein binding kinase activity phosphotransferase activity, alcohol group as acceptor receptor signaling protein activity receptor signaling protein serine/threonine kinase activity protein kinase activity enzyme binding protein serine/threonine kinase activity   TRG_NLS_Bipartite_1_No  TRG_NLS_Bipartite_1_Yes  TRG_NLS_MonoCore_2_No  TRG_NLS_MonoCore_2_Yes  TRG_NLS_MonoExtC_3_No  TRG_NLS_MonoExtC_3_Yes  TRG_NLS_MonoExtN_4_No  TRG_NLS_MonoExtN_4_Yes  nuclearProtein_No  nuclearProtein_Yes  nuclearProtein_Yes_withMotif Average protein half−life in 5 cell lines (min−log10) S10 Figure S7: PTR-AI (2 fold codon frequency increase effect on PTR ratio) does not significantly correlate with human genomic codon frequencies. Figure S8: Distribution of the effects of protein N-terminal residues (with respect to Alanine) on PTR ratio across tissues. Figure S9: Distribution of median PTR ratio across 29 tissues for genes with and without the eukaryotic linear protein motifs. CLV_PKCS_FUR_1 (Furin (PACE) cleavage site), LIG_KEPE_1 (Sumoylation site), TRG_NLS_BIPARTITE_1 (classical bipartite nuclear localization signal), three classical monopartite nuclear localisation signals: TRG_NLS_MonoCore_2, TRG_NLS_MonoExtC_3 , TRG_NLS_MonoExtN_4, and nuclear proteins (GO:0005634) in general. Four nuclear localization signals were associated with less median PTR ratio even though there is no significant PTR ratio difference between nuclear and non-nuclear proteins.

Biological Process Motif_UTR3_UUCUGAG
Figure S12: Gene ontology terms that are enriched for set of genes that contain consensus sequences of the de-novo identified k-mers in 3' UTR that are predictive of PTR ratios.