Evolutionary origins and interactomes of human, young microproteins and small peptides translated from short open reading frames

Summary All species continuously evolve short open reading frames (sORFs) that can be templated for protein synthesis and may provide raw materials for evolutionary adaptation. We analyzed the evolutionary origins of 7,264 recently cataloged human sORFs and found that most were evolutionarily young and had emerged de novo. We additionally identified 221 previously missed sORFs potentially translated into peptides of up to 15 amino acids—all of which are smaller than the smallest human microprotein annotated to date. To investigate the bioactivity of sORF-encoded small peptides and young microproteins, we subjected 266 candidates to a mass-spectrometry-based interactome screen with motif resolution. Based on these interactomes and additional cellular assays, we can associate several candidates with mRNA splicing, translational regulation, and endocytosis. Our work provides insights into the evolutionary origins and interaction potential of young and small proteins, thereby helping to elucidate this underexplored territory of the human proteome.

In brief Sandmann et al. explore the evolution and interactomes of the smallest and evolutionarily youngest members of the human proteome. They identify previously unknown human short ORFencoded microproteins and show that conserved and young microproteins that emerged de novo in primates can engage with vital biological processes.

INTRODUCTION
Ribosome profiling (Ribo-seq) 1 has revealed the translation of thousands of short open reading frames (sORFs) in human cell lines and tissues, 2 which can result in the production of short proteins denoted as microproteins, micropeptides, or short ORF-encoded polypeptides (SEPs). 3 Many human microproteins have been selected for functional characterization based on high inter-species conservation 2,4 and a minimum amino acid (aa) size. However, evolutionarily young microproteins can have biological roles and have for instance been implicated in cell survival, 5-7 human brain development, 8 and cancer. [9][10][11] Similarly, a lower size cutoff for protein investigations seems unjustified: peptides as small as 11 aa can control morphogenetic development in insects 12-14 and muscle metabolism in humans, 15 respectively, and short bioactive peptides cleaved from bigger precursor proteins can function as peptide hormones. 16 These examples suggest that many more human sORFs encoding young microproteins, as well as very short peptides, may have yet unknown biological roles.
Here, we investigated these two underappreciated elements of the human proteome. To this end, we first analyzed the Resource conservation and evolutionary mechanisms of origin of 7,264 human-translated Ribo-seq ORFs included in a recently published GENCODE reference catalog. 2 We found that almost 90% were evolutionarily young, of which 4,101 emerged de novo from ancestral non-coding regions.
To gain first insights into the potential bioactivity of these young microproteins, we selected 45 microproteins of recent evolutionary origin from the set of 7,264 sORFs and performed a high-throughput protein interaction screen on peptide matrix (PRISMA). [17][18][19][20] For this approach, proteins are divided into short peptides that are synthesized onto cellulose membranes and incubated with a protein lysate. The protein interactome of each peptide is then analyzed by mass spectrometry (MS), making PRISMA well suited to detect interactions mediated by short linear motifs (SLiMs) within disordered protein regions, [17][18][19][20] which are common in microproteins. 21 Since the GENCODE Ribo-seq ORF catalog 2 introduced a lower length cutoff of 15 aa, we next analyzed previously published Ribo-seq datasets of human tissues 22,23 and performed (targeted) MS searches to identify and validate translated sORFs potentially encoding peptides below this cutoff (denoted sORFs 3-15 aa and peptides 3-15 aa , respectively). We identified 221 translated sORFs 3-15 aa , including 38 with endogenous peptide-level evidence, and showed that many of these are also translated in rodents and conserved across mammals. We designed an additional PRISMA screen that incorporated each of these 221 peptides 3-15 aa .
With our two PRISMA screens, we identified a number of recently evolved microproteins and very small peptides that interacted with proteins involved in cellular processes including splicing, translational regulation, and endocytosis. In line with these interactome results, we performed independent cellular assays that indicated that several candidates can modulate translation and endocytosis, respectively.
We here present evolutionary and interactome analyses of two underexplored parts of the human proteome. Our study can serve as a resource and blueprint to investigate the cellular roles of young and small human proteins that are being detected at a rapid rate but have been difficult to characterize.

RESULTS
Most human sORFs are young and have emerged de novo Recently, a community-driven effort supported by major gene and protein annotation projects (GENCODE-Ensembl, UniProt, HGNC, PeptideAtlas, and HUPO) produced a reference catalog of human Ribo-seq ORFs. 2 This catalog comprises 7,264 human sORFs longer than 15 aa (average length 43 aa) found in presumed long non-coding RNAs (lncRNAs), untranslated regions (UTRs), and alternative mRNA coding frames ( Figures 1A and  1B). We assessed the amino acid conservation of the putative microproteins encoded by these sORFs and compared our results with (1) a negative control set in form of length-matched sORFs sampled from UTRs and (2) a positive control set composed of 527 annotated proteins encoded by known short protein-coding sequences (sCDSs; length <100 aa, average length 76 aa) (STAR Methods).
We found that most of the putative 7,264 sORF-encoded microproteins (n = 6,506; 89.6%) lack significant protein homology across non-primate mammals and could therefore be classified as evolutionarily young ( Figures 1A-1D and S1A-S1C; Table S1; STAR Methods). The remaining 10.4% were conserved across non-primate mammals-a significantly higher fraction than observed in the negative control set (1.0%) ( Figure S1D). This indicated that the putative microproteins were more conserved than expected by chance. However, the fraction of conserved microproteins encoded by sORFs (10.4%) was still lower than what we observed for annotated sCDS (71.4% conserved) (Figures 1C and S1D; Table S1).
Since primatomorpha species (primates and colugos) are closely related, counterpart sequences of human sORFs are highly similar across their genomes, showing an average aa identity of 90.62% ( Figure 1A). Hence, sORF-encoded microprotein homologs in primatomorpha cannot be solely identified based on sequence similarity since unconstrained regions have not diverged enough. Therefore, we additionally evaluated the conservation of ORF structures across primatomorpha for all 6,506 young sORFs by reconstructing their ancestral sequences (A) Phylogenetic tree of the mammalian taxa comprising 120 mammalian species used for sORF genomic alignments (n = 7,264). sORFs were classified into lncRNA-ORFs (lncORFs), upstream ORFs (uORFs), upstream overlapping ORFs (uoORFs), internal ORFs (intORFs), and downstream ORFs (dORFs). For comparison, we included 527 sCDS. The heatmap displays the pairwise aa identity (%) of all sORFs and sCDSs (columns) across the 120 species' genomes (rows). (B) Numbers of evaluated sORFs and sCDS separated by ORF biotype. (C) Conservation scores (CSs) calculated across non-primate mammalian species. Dotted lines represent the CS cutoff of 8 (STAR Methods). sORFs and sCDS with (red) or without (light blue) significant protein sequence conservation are displayed below. (D) Dot plots displaying the average and 95% confidence interval of sORF, sCDS, and untranslated ORF truncation introduced by the most upstream stop codon in the aligned counterpart regions of the sequences. sORFs are divided by biotype and conservation of aa sequences. Internal sORFs (intORFs) were not considered due to additional constraints acting to preserve the frame of the sequence. (E) Top: total numbers of conserved (CS R 8) and young sORFs (CS < 8). Bottom: schematic of the classification of young sORFs (n = 6,506) based on conservation of ORF structures. We defined three levels of conservation: humans, old world monkeys, and primatomorpha. (F) Numbers of evolutionarily young sORFs per level of conservation of ORF structures. (G) Violin plots with the numbers of human 23 (left) and macaque 23 (right) brain Ribo-seq reads mapped to human brain translated ORFs (n = 830), by absence (light blue) or presence (dark blue) of conservation in macaque. Statistical differences were assessed by Wilcoxon signed-rank test. Horizontal bars represent the median values. ns, not significant. (H) Percentages of sORFs translated in the human brain with aligned counterpart regions translated in macaque. sORFs are divided by biotype and by the presence (dark blue) or absence (rlight blue) of conservation in macaques. (I) Schematic of modes of sORF evolution and numbers of young sORFs per category. Resource and inspecting the positional conservation of the start codon and the presence of an intact ORF (i.e., an ORF that is not truncated by premature stop codons) (Figures 1E, S1A, and S1E-S1J; STAR Methods). Within the subset of young sORFs, we found 4,914 (75.5%) with conserved structures across primatomorpha ( Figures 1E and 1F). There were 1,370 ORF structures that emerged during old-world monkey evolution (after the humanmacaque split), and 222 ORFs were human specific. The conservation of untranslated and translated ORF structures showed a similar phylogenetic distribution across primatomorpha lineages ( Figures 1D and S1D). However, we observed a significantly higher proportion of conserved ATG initiation codons in young sORFs compared with untranslated control sequences (63.30% ± 16.14% versus 35.83% ± 12.41%, Wilcoxon signed rank test, p value = 1.45 3 10 À21 , Figure S1F). These results suggest that there is significant selection pressure acting to preserve the initiation codons of young sORFs but not their frame structures.
We next compared published Ribo-seq data of human, macaque, and mouse brain samples and asked whether the absence of conserved ORF structures in non-human species led to a decrease in the levels of translation of the aligned counterpart sequences. Indeed, in comparison with regions with conserved ORF structures, we found that regions with nonconserved ORF structures exhibited lower levels of expression and ribosome occupancy, lower periodicity bias in ribosome footprints, and fewer actively translated ORFs ( Figures 1G, 1H, S1G, and S1H).
When we traced the genomic changes (i.e., DNA mutations) that led to the formation of the ORFs throughout primate evolution, we found that most young sORFs (63.0%) emerged de novo from ancestral non-CDSs of which 162 evolved in the human lineage ( Figure 1I; STAR Methods). Far less common was emergence through CDS duplication or fission of older protein-coding regions (3.3% and 0.5%, respectively) or through inserted endogenous retrovirus (ERV) elements and Alu repeats (1.3%). The mode of evolution could not be determined for the remainder of the investigated young sORFs (31.9%) ( Figure 1I).
Interactome profiling of microproteins translated from young sORFs with PRISMA Demonstrating that proteins engage in specific interactions with other proteins is a well-established step in gathering evidence toward putative protein functionality. 24 We applied PRISMA 17-20 to investigate the interactomes of young microproteins and the sequence motifs through which they mediate pro-tein-protein interactions (PPIs) (Figure 2A; STAR Methods). We selected 45 recently evolved microproteins with a median length of 53 aa based on their reproducible translation across human tissues and cell lines, 2,25-27 and for having protein-level evidence 2,5,6,22,28,29 Table S2; Data S1; STAR Methods). Nineteen out of 45 young microproteins were translated from presumed lncRNAs that had been associated with human diseases and six affected transcriptional profiles after microprotein overexpression or knockout based on previous publications 5,6 (Table S2; STAR Methods). We included 15 conserved microproteins with a median length of 65 aa for comparison, some with known interactomes (e.g., MRPL33, 30 MIEF1-MP [MP, microprotein] 31 ), as well as the well-characterized wild-type (WT) and mutated peptides of SOS1 and GLUT1 18,32 as positive controls (STAR Methods).
For PRISMA, each of the selected microproteins was divided into 15 aa long, overlapping peptides (tiles) with an offset of eight aa. This way, each aa per microprotein was represented in two subsequent tiles except for the first and last eight aa of each microprotein. Following this approach and including the control peptides, we analyzed 490 peptides in total (Figures 2A, S2A, and S2B; Table S3; STAR Methods). In order to determine specific versus non-specific interactions, we compared protein identifications in each peptide spot against all other peptide spots. The ratio of each log 2 -fold change to its standard error was computed for each detected protein, and the resulting p values were adjusted using the Benjamini and Hochberg method to control the false discovery rate (FDR). As performed previously, 18,32 we determined the significance cutoff based on known interactors of the SOS1 and GLUT1 control peptides (STAR Methods), retaining as many known interactors and as few non-reported interactors as possible. After applying this significance filter, we retained roughly 1.4% of all identified proteins as significant binders and detected up to 89% and 80% of known SOS1 and GLUT1 interactors (Figures 2C and 2D;  Table S3; STAR Methods). Additionally, we identified the expected sequence-specific interactome changes provoked by a single amino acid substitution between their WT and mutant versions ( Figure S2C) and recapitulated previously reported, endogenous interactors for conserved microproteins (including MRPL33 30 and MIEF1-MP 31 ; Figures 2E and S2M). These results demonstrate that PRISMA can identify sequence-specific and biologically relevant protein interactions.  (Table S2). Bottom: conserved (red) and young (blue) microproteins were sorted based on the highest interaction score (product of fold change and p value).  Figure S2N. (I) Percentage of essential proteins detected in the interactomes of conserved and young microproteins. No statistical differences were found among both groups (assessed by two-tailed Student's t test). The horizontal lines indicate 25%, 50%, and 75% quartiles, respectively. (J) Interaction scores for eleven young microproteins whose top interactor is an essential protein. Asterisks mark microprotein interactomes significantly enriched for essential proteins (assessed by Fisher's exact test, FDR < 0.05) (Table S3).

Resource
Young microproteins formed specific interactions with different classes of proteins (Figures 2B, bottom panel, 2F-2H,  and S2F; Table S3). For example, the mitochondrial proteins GOT2 and SLC25A33 stood out as the most significantly enriched interactors for the microproteins RP11-644F5.11-MP and RP11-464C19.3-MP ( Figures 2F and 2G), respectively, in line with their previously observed mitochondrial localization. 22 Between young and conserved microproteins, there were no apparent differences in the specificity, number, strength, or phylogenetic age of interactors ( Figures S2G-S2L). Notably, young and conserved microproteins both displayed an equal capacity to interact with proteins required for cell survival 33 (Figures 2H-2J; Table S3), and the interactomes of 11 young microproteins were enriched for essential proteins (Table S3). These results suggest that young proteins have the potential to engage in vital cellular processes.

SLiMs may drive microprotein-protein interactions
SLiMs 17-19 -three to ten aa-long stretches within intrinsically disordered regions of proteins 34 -might be important contributors to the interaction potential of microproteins since microproteins are less structured and enriched for such motifs. 21, 35 We identified putative disordered regions within the 60 investigated microproteins using the disorder prediction tool IUPred 36 and interrogated the eukaryotic linear motif (ELM) resource for functional sites in proteins 37 to annotate potential SLiMs (STAR Methods). Our predictions revealed 1,428 SLiMs in total, of which 429 are located within putative disordered regions of the microprotein candidates. Out of these 429, our PRISMA design covered 412 complete motifs. It should be noted that these included 87 proteolytic cleavage sites and 71 SLiMs whose binding capacity depends on additional requirements, e.g., posttranslational modifications (PTMs) or free C-terminal regions (Table S3) and are thus not likely to be validated by our PRISMA screen.
Based on known protein-SLiM interactions annotated in the ELM resource, 37 we searched for matches within interactomes of each microprotein tile, i.e., if a microprotein tile carrying a SLiM bound to an interactor that is known to bind this particular motif. In total, we detected 47 matching protein-SLiM interactions (Table S3; STAR Methods). Those included proline-rich sequences within the young microproteins PALLD-uORF-MP and RAB12-uoORF-MP that interacted with proteins containing Src homology 3 (SH3) domains involved in actin cytoskeleton organization and endocytosis ( Figures 3A and 3B), in a manner similar to the proline-rich SLiM of the SOS1 control peptide ( Figures 2C  and S2C). Furthermore, six young and three conserved microproteins interacted with kinases and harbored kinase phosphorylation and docking motifs ( Figures 3A and 3C). A general enrichment of kinases in interactomes of microprotein tiles carrying such motifs could not be observed (p value = 0.079, Fisher's exact test; STAR Methods). However, we detected phosphorylated tryptic peptides after MP overexpression for two of the nine microproteins (THAP7-uORF-MP and JHDM1D-AS1-MP) (Table S3; STAR Methods), indicating that the lack of PRISMAwide significance does not necessarily preclude the potential relevance of our observed kinase interactions.
We next focused on 190 interactors that were detected in pairs of neighboring, partly overlapping tiles within 39 microproteins ( Figure 3D; Table S3). Their repeated identification strengthened these interactions and the confidence that they were mediated by the overlapping sequence. For instance, two overlapping tiles of the young and nuclearly localized microprotein PVT1-MP each bound five serine/arginine-rich splicing factors (SRSF1, SRSF2, SRSF5, SRSF6, and SRSF7; enriched GO term: regulation of mRNA splicing, p adj = 0.011; CORUM: spliceosome, p adj = Figure 3. SLiMs may drive microprotein-protein interactions (A) Heatmap with fold changes of kinases and SH3-domain-containing proteins that interact with microproteins carrying a phosphorylation/kinase-docking motif or a proline-rich motif. (B) Peptide sequence and volcano plot with PRISMA results of a RAB12-uoORF-MP-derived peptide carrying a proline-rich motif (underlined). SH3-domaincontaining proteins are highlighted in red. (C) Peptide sequence and volcano plot with PRISMA results of the GAS5-MP-derived peptide carrying a phosphorylation motif (underlined). Kinases are highlighted in red. (D) Heatmap with fold changes of interactors detected in two overlapping peptides within one microprotein. Only microprotein tiles that share at least three interactors are plotted (Table S3). (E) Peptide sequences and volcano plots with PRISMA results of tile 2 and tile 3 of PVT1-MP. Splicing factors are highlighted in red. (F) Immunofluorescence stainings of FLAG-tagged PVT1-MP after overexpression in HeLa cells. Cell nuclei were stained with DAPI, mitochondria with anti-ATPIF1 antibody, and PVT1-MP-3xFLAG with anti-FLAG antibody. Scale bar represents 20 mm. (G) PLA in HeLa cells transfected with V5-tagged PVT1-MP and FLAG-tagged SRSF2. Red spots indicate PVT1-MP-V5 and SRSF2-FLAG interactions (additional images in Figure S3C). Cell nuclei were stained with DAPI. Controls: anti-FLAG single primary antibody only; anti-V5 single primary antibody only; both primary antibodies were omitted. As an additional control, the PLA was performed in untransfected HeLa cells ( Figure S3C). Scale bar represents 20 mm. (H) Peptide sequences and volcano plots with PRISMA results of tile 9 and tile 10 of LINC01128-MP. Tile 10 lacks the first amino acid of the clathrin box motif. Clathrins are highlighted in red. (I) Immunofluorescence stainings of FLAG-tagged LINC01128-MP after overexpression in HeLa cells. Cell nuclei were stained with DAPI, mitochondria with anti-ATPIF1 antibody, CLTC with anti-CLTC antibody, and LINC01128-MP-3xFLAG with anti-FLAG antibody. Scale bar represents 20 mm. (J) Representative images of fluorescently labeled transferrin (green) and EEA1 (red) detection in HeLa WT and LINC01128-MP knockout (KO) cells. Cell nuclei were stained with DAPI (gray) and EEA1 with anti-EEA1 antibody. Scale bar represents 10 mm. Images with lower magnification are shown in Figure S3H. (K) Beeswarm plot for quantification of transferrin and EEA1 co-localization in HeLa WT and LINC01128-MP KO cells using Manders' coefficient tM1. Each dot represents one analyzed cell. Per experiment, an average of 30 cells were quantified (n = 3). Statistical significance was determined using Student's t test.
(L) Volcano plot depicting significantly differentially expressed genes (in blue, À0.26 % log 2 (FC) R 0.26, p adj = 0.05) in RNA-seq data of wild-type versus LINC01128-MP KO cells. LINC01128 is highlighted in red and its transcript levels are not differentially expressed between wild-type and KO cells (p adj = 0.15); also see Figure S3H.

ll OPEN ACCESS
Resource 0.009), likely provoked by a stretch of three consecutive arginines within the sequence shared by both tiles (EGRRRAAS, Figures 3D-3F and S3A). This arginine stretch, which is located in a region of low complexity close to an RGG motif 38 (Table S3), resembles basic patches found in some human but mostly viral RNA-binding proteins. 38 As cell lysates were treated with benzonase before the PRISMA binding assay, these PPIs should be direct and not mediated by RNA. To support the PRISMA results, we tested SRSF2 and SRSF6 in overexpression studies and observed a partial co-localization with PVT1-MP with both splicing factors in HeLa cells ( Figure S3B). Moreover, we selected SRSF2 to perform an in situ proximity ligation assay (PLA). PLA is better suited to corroborate protein interactions than co-localization experiments as the PLA signal, provoked by hybridization of antibody-conjugated PLA probes, only occurs when the target proteins are in close proximity (<40 nm) to each other. 39 We observed a PLA signal in HeLa cells overexpressing V5-tagged PVT1-MP and FLAG-tagged SRSF2 but not in any of the negative controls ( Figures 3G and S3C). These results indicate that the interaction between PVT1-MP and SRSF2 detected with PRISMA can indeed occur in living cells.
We also found two overlapping tiles within the young LINC01128-MP to interact with multiple clathrin heavy-chain (CLTC and CLTCL1) and light-chain proteins (CLTA and CLTB) involved in vesicular trafficking and endocytosis. The binding to CLTC and CLTCL1 may be driven by a partially shared clathrin box motif (LLLLD, Figures 3D and 3H), which is usually present in endocytic cargo adaptor proteins (APs). 40,41 Within LINC01128-MP, the motif resides in a C-terminal extension that evolved in the human lineage through the loss of a stop codon ( Figure S3D). In line with the observed binding profile, LINC01128-MP localized to clathrin-rich foci ( Figures 3I and S3E), and the knockdown of the presumed LINC01128 lncRNA decreased endocytosis of epidermal growth factor (EGF) and transferrin in a previous genome-wide small interfering RNA (siRNA) screen performed by Collinet et al. 42 The microprotein had remained undetected at that time, and it was not investigated whether the effect could be attributed to the lncRNA molecule or the potential microprotein. To further investigate the role of LINC01128-MP in endocytosis and trafficking, we employed CRISPR-Cas9 to create a pool of cells carrying short insertions and deletions (indels) that lead to premature stop codons truncating the endogenous sORF encoding for LINC01128-MP ( Figure S3F; STAR Methods). When assessing transferrin trafficking in WT cells and the pool of modified cells (with 93% carrying premature STOP codons, Figure S3G), we observed a similar effect as in the study by Collinet and colleagues, 42 specifically a reduction of transferrin accumulation in early endosomes by approximately 30% (Figures 3J, 3K, and S3H; STAR Methods). RNA sequencing followed by differential expression analysis revealed that the sORF-truncating mutations did not significantly alter LINC01128 transcript levels (Figures 3L and S3I; STAR Methods) suggesting that the loss of the microprotein contributed to reduced transferrin accumulation. Although our approach reduced changes to the LINC001128 lncRNA to a minimum, we cannot completely exclude the impact of the short indels on potential RNA-mediated roles. Notably, genes differentially expressed upon microprotein truncation are enriched for extracellular matrix proteins and components of the plasma membrane.
These are involved in signaling receptor binding and growth factor activity, processes associated with endocytosis and vesicular trafficking 43,44 ( Figure S3J). Combined, these observations suggest a role for LINC01128-MP in the control of intracellular trafficking and endocytosis.
sORFs smaller than 16 aa (sORFs 3-15 aa ) are highly translated in multiple tissues and often conserved across mammals Next, we investigated the smallest potential peptides translated from independent ORFs in the human genome. For this purpose, we expanded the lower size cutoff of the currently cataloged Ribo-seq ORFs 2 to include ORFs below 16 aa in length (denoted sORFs 3-15 aa ). We reanalyzed Ribo-seq data obtained from 96 human samples of five different tissues (brain, testis, liver, kidney, and heart) 22,23 using RiboTaper 45 and ORFquant 46methods that leverage the 3-nt periodicity in ribosome footprints to newly identify translated ORFs. This resulted in the detection of 221 translated sORFs 3-15 aa (Figures 4A and 4B; Table S4) of which 69% were independently identified in at least two out of five analyzed tissues ( Figure 4C). We validated the translation of 182/221 (82%) sORFs 3-15 aa using a probabilistic algorithm for ORF detection (PRICE 47 ). We further retrieved a set of predicted translation initiation sites (TISs) from a deep learning transformer model trained on the sequence context of canonical TISs 48 and observed that TISs from sORFs 3-15 aa were significantly more likely to initiate translation than untranslated TISs from the same transcripts (p values ranging from 0.04 to 5.17 3 10 À82 ; Figure S4A; STAR Methods). The translation of the majority of these newly identified sORFs 3-15 aa was thus substantiated by independently developed frameworks that utilize ribosome footprint periodicity, probabilistic inference, and machine learning.
The vast majority of sORFs 3-15 aa (92%) were translated from 5 0 UTRs of protein-coding genes (uORFs) and detected at much higher levels than the downstream primary CDS (average fold change of 3.06 ± 1.09 per tissue ( Figures 4D and 4E). As the conservation of smaller sORFs is challenging to determine with approaches based on sequence homology, 49 we calculated the conservation of the sORF structures across 27 primates, 92 mammals, and three non-mammalian vertebrate species. We found that 181 out of 221 sORF 3-15 aa structures ($82%) were conserved across mammals, and for 36 cases, the conservation was extended to vertebrates ( Figures S4B-S4D; STAR Methods). In support of this, most were found to be translated in at least two out of four mouse tissues (61%; heart, brain, liver, and testis) and two rat tissues (76%; heart and liver) 22,23,50 ( Figure 4F; STAR Methods). We further validated the translation of a highly conserved peptide encoded by USP10-uORF in bird tissues (Figure S4C). Noticeably, sORF 3-15 aa structures displayed higher levels of species conservation compared with longer sORFs. This is possibly explained by their localization to 5 0 UTRs, as these regions contain conserved sequences that can impose additional evolutionary constraints on sORFs 3-15 aa located there, as well as the reduced likelihood of disrupting substitutions truncating very short sequences ( Figure S4G; STAR Methods).
To collect proteomic evidence for the peptides translated from the identified sORFs 3-15 aa , we searched the Human Proteome ll OPEN ACCESS

Resource
Map 51 resource as well as global proteome 52,53 and immunopeptidomic 6,54,55 datasets. For the peptide identifications of both shotgun and immunopeptidome data, we applied an FDR filter of <0.01 on the peptide level using the reverse-sequencebased target decoy approach implemented in MaxQuant. 56 The protein FDR filter was disabled, as had been performed previously for identifications of small proteins. 45,57 Moreover, the peptides 3-15 aa were included in the search database of the recent human HLA 2022-09 PeptideAtlas build 29,58 (STAR Methods), and analyzed with the PeptideAtlas build pipeline 29,58 and the Trans-Proteomic Pipeline 58 using a non-specific (no protease) search strategy. In total, this led to the endogenous identification of 27 peptides from public datasets, including 16 from the HLA PeptideAtlas build. Five peptides 3-15 aa were identified in multiple datasets with up to four different, unique peptides. Fourteen of the peptides 3-15 aa detected in immunopeptidomics datasets were supported by at least one MS peptide that was in silico predicted to be a high-affinity binder of the major histocompatibility complex I (MHC I) ( Figure S4F; Table S4). (H) Peptide sequence and chromatograms of fragment ions from synthetic and endogenous signature peptides of the SVIL-AS1-peptide 3-15 aa in K562 cells and the human heart. The star represents the oxidation of methionine. The dot product (dp) indicates the similarity to the matching spectrum of the synthetic peptide and ranges from 0 to 1 with higher scores indicating better similarities. We note that the detected peptide also matches an alternative microprotein isoform of SVIL-AS1 of 81 aa (Table S4).

Resource
To enhance the endogenous detection of these very small peptides, we additionally set up a targeted MS assay (parallel reaction monitoring [PRM]) in five human hearts and three cell lines (HEK239T, K562, and HeLa) (Figures 4G and 4H; Table S5; STAR Methods). Each identification was manually confirmed based on the analytical runs of synthetic peptide mixtures as well as the internal library-based fragment ranking, and only peptides detected in at least two biological replicates with a dot product of R0.7 were considered. This yielded evidence for 18 peptides 3-15 aa , increasing the peptide-level evidence to 38 out of 221 candidates (Tables S4 and S5).  Table S4). Hierarchical clustering of the enrichment values of all proteins identified in each peptide's pull down revealed several peptide features (length, hydrophobicity, and isoelectric point) that contributed to specific binding profiles ( Figure 5B).
After applying stringent filtering (STAR Methods), we detected on average 16 significant interaction partners for 166 out of 221 peptides 3-15 aa ( Figure S5H), several of which stood out for their particular interaction profiles (Figures 5C-5F and S5I). For instance, a peptide of only 5 aa translated from a highly conserved uORF within the MTMR3 gene (peptide sequence MLLWL) bound four clathrins (CTLA, CLTB, CLTC, and CLTCL1) as well as the clathrin assembly protein PICALM and the clathrin interactor CLINT1 (Figures 5C-5E). Interestingly, the interactome of this peptide resembled that of the GLUT1 mutant peptide carrying a di-leucine motif ( Figures 2D, S5F, and S5G), which is known to recruit clathrins via adapter proteins. 18 Furthermore, the interactomes of several peptides were enriched for proteins from distinct subcellular compartments, i.e., the nucleus ( Figure 5F, panels 1-3), suggesting an organelle-restricted function. A seven aa long peptide encoded by the uORF of STAT1 interacted specifically with four proteins regulating mitotic spindle assembly and cytokinesis ( Figure 5F, panel 4). These four proteins are not known to interact with each other, which excludes a secondary binder effect and supports direct interaction of the uORF-peptide with these proteins.

Peptide interactomes can predict modulators of cellular function
Out of the 221 small peptide interactomes, we identified 16 hydrophilic uORF-derived peptides 3-15 aa rich in arginine residues that predominantly interacted with components of the translational machinery ( Figures 6A-6C and S6A-S6F). In order to assess whether the presence of the translated uORF impacted downstream translation, we set up reporter assays for five of these candidates. To retain the natural sequence context of the uORFs, we inserted the candidates embedded within their endogenous 5 0 UTRs in front of the luciferase reporter. As controls for each candidate, we included ATG mutants (interrupted sORF translation) as well as arginine-to-alanine mutants (translated sORF with an altered sequence) ( Figure 6D, upper panel; STAR Methods). In four cases, the presence of the intact uORF significantly reduced reporter translation rates when compared against the respective uORF with mutated ATG ( Figure 6D). For two candidates (ASB15-uORF and PECAM1-uORF), the arginine-to-alanine mutants reversed the observed effect, i.e., downstream translation was not impacted ( Figure 6D). This indicates that the arginines are important for the repression of downstream translation of the main ORF. However, we cannot discern if the amino acid or the encoding nucleotides are responsible for the effect.
PRISMA also revealed four peptides 3-15 aa enriched for aromatic amino acids that bound APs involved in clathrin-mediated endocytosis ( Figures 6E-6G, S6G, and S6H). Two of these peptides 3-15 aa contain a tandem di-hydrophobic motif that was previously proposed as a non-classical AP-binding motif. 59 When assessing the impact of the AP-binding peptides 3-15 aa on endocytosis in vitro ( Figure S6I), we found that two out of three reduced the cellular uptake of a ligand by the endocytic receptor low-density lipoprotein receptor-related protein 2 (LRP2) in BN16 cells by 50% ( Figures 6H and 6I). This effect was akin to the reduction observed upon treatment with dynasore ( Figures 6H and 6I), a pharmacological inhibitor of clathrin-mediated endocytosis, but not upon treatment with a control peptide 3-15 aa that did not interact with APs in PRISMA ( Figure S6J). Based on these results, we hypothesized that the binding between these uORF-peptides and APs may inhibit clathrin recruitment during the formation of new vesicles and thereby reduce endocytosis ( Figure S6K).
In summary, our findings illustrate how PPIs detected by PRISMA can hint toward putative biological roles of previously unknown peptides, stimulating future efforts into the mechanistic roles of both evolutionarily young and very small peptides translated from short ORFs in the human genome.

DISCUSSION
Microproteins have gained increasing attention in recent years, but the biological significance of human evolutionarily young and very small microproteins has remained less well studied. 2,5,7,22,[60][61][62][63] We aimed to address this knowledge gap by investigating the evolutionary origins and interactomes of the putative microproteins and peptides encoded by young and very small ORFs.
To define the fraction of young human microproteins, we evaluated the conservation of the amino acid sequences of over  Resource 7,000 recently cataloged human sORFs across more than 90 mammalian species and found that most were not conserved to non-primate mammals. We further present a detailed resource describing the evolutionary age and mechanisms of origin of these sORFs across primatomorpha, which included the genomes of 27 different primate and colugo species. Since the genomes of these species are highly similar, standard sequence similarity search methods were not sufficient to reliably infer homology for constrained proteins and estimate their evolutionary ages. 64 We circumvented these limitations by assessing the conservation of translated sORF structures, i.e., we evaluated the conservation of the start codon and the presence of an intact ORF in the ancestor sequences of human young sORFs. We further estimated that 4,101 sORFs emerged de novo during primatomorpha evolution, including 162 human-specific ones. When translated into stable microproteins, this would increase the number of human-specific de novo microproteins reported previously 7 by an order of magnitude. This substantiates observations that sporadic protein evolution ''from scratch'' may occur at a higher rate than previously thought. 65-68 Of note, our evolutionary analyses are conservative because the presence of a homologous ORF sequence in other primate species does not indicate its expression per se. In primates, the extent of transcription and translation of young ORFs with conserved structures will require future studies that generate new pan-transcriptomes, -translatomes, and -proteomes from non-human primates.
In order to further investigate these novel microproteins, we employed PRISMA, which is highly suitable for the analysis of microprotein interactomes due to several technical advantages: (1) its high scalability allows the inclusion of hundreds of peptides which increases the statistical power to define significant interactions, (2) it does not rely on the ectopic expression of CDS vectors, (3) it does not require the addition of a tag (e.g., HA and FLAG), which is particularly problematic for small proteins, and (4) it does not depend on antibody-based affinity pull-downs that introduce antibody-specific background binding.
We show that young microproteins can bind proteins that are involved in diverse cellular processes, including proteins essential for cell survival. Moreover, our results illustrate how short sequence features such as SLiMs, which are prevalent within intrinsically disordered regions of microproteins, 21 may contribute to the ability of recently evolved microproteins to engage with the more conserved human proteome. This capacity to interact may be present as early as at, or shortly after their evolutionary origin, without the need to evolve larger and more complex three-dimensional structures. For example, and although we could not completely rule out potential RNA-mediated effects, LINC01128-MP KO experiments suggested that this microprotein plays a role in transferrin accumulation in early endosomes in vitro, potentially enabled through interactions of the microprotein's human-specific C terminus with endocytic proteins. This supports the idea that many young proteins can quickly become functional after they emerge de novo. 69,70 Further investigations are, however, necessary to delineate how LINC01128 may affect endocytosis.
Furthermore, we identified 221 novel small translated sORFs 3-15 aa by Ribo-seq in five human tissues. We demonstrated the translation of most of them by independently developed methodologies that exploit ribosome footprint periodicity (RiboTaper, 45 ORFquant, 46 and PRICE 47 ), machine learning-based inference of TISs transformer, 48 and evolutionary sequence alignments demonstrating conservation across mammals. The fact that the sORFs 3-15 aa were translated at high levels and the majority of their structures were conserved to rodents, as supported by Ribo-seq, might indicate biological relevance. Of note, we obtained putative peptide-level evidence for 38 out of 221 predicted very small peptides 3-15 aa with proteomics technologies. Each of these 38 peptides 3-15 aa was smaller than the smallest human peptide translated from an individual sORF annotated as protein-coding to date (MOTS-c 15 ; 16 aa). We would like to point out that MS identification of such small peptides is technologically challenging 4 and the possibility of false-positive identifications cannot be completely excluded.
Following the PRISMA results obtained for newly detected peptides 3-15 aa , we highlight a group of hydrophilic, argininerich uORF-encoded peptides 3-15 aa that bound components of the translational machinery. We observed in a luciferase reporter assay that four translated uORFs 3-15 aa reduced downstream translation, potentially through cis-mediated effects such as reinitiation inhibition 71 or ribosome stalling. 72 For two candidates, the translational inhibition of the downstream reporter seemed to depend on the peptide sequence and charge (i.e., presence of arginine residues). However, we cannot exclude that the (I) Results of the RAP endocytosis assay (five replicates per condition). Values were normalized to total protein content, and samples without RAP treatment were subtracted and then normalized to the treatment with RAP only (=100%). The PPARD-uORF-peptide, which did not bind APs, was included as a control (Figure S6J). The statistical significance was calculated using ANOVA and Tukey post hoc test.

OPEN ACCESS
Resource effect stems from the underlying nucleotide change. We continued to show that members of a separate class of novel peptides 3-15 aa could bind APs of the endocytic machinery and were capable of reducing endocytosis levels by 50% in vitro. As a means to control for a possible influence that the transactivating transcriptional activator (TAT) sequence attached to our candidate peptides might have on endocytosis, we included a control peptide that did not bind any endocytosis-related proteins in the PRISMA screen. This peptide did not influence endocytosis levels, indicating that the TAT-peptide per se does likely not (solely) impact the observed effect. We hypothesize that the candidate peptides 3-15 aa that reduce endocytosis might hinder clathrin recruitment and vesicle formation by competitive binding to APs. Interestingly, the protein encoded by the main CDS of one of these uORF-peptides-INSIG2-is known to mediate the feedback control of cholesterol synthesis. 73 This hints that the INSIG2-uORF could contribute to the impact of INSIG2 on circulating cholesterol levels by regulating endocytic uptake of cholesterol-rich LDL particles, possibly indicating a co-evolution of the uORF and the main protein. In light of the great potential that similarly sized synthetic peptides have shown as pharmacological compounds, 74 these peptides 3-15 aa may also be exploited therapeutically as modulators of endocytosis, or new inhibitors of translation, a mechanism commonly used in antibiotics. 75 In summary, our study provides new insights into the evolutionary origins and potential roles of evolutionarily young microproteins and very small peptides in humans. We describe motifresolution interactomes for hundreds of human microproteins and peptides, which may serve as a basis for candidatefocused, independent validation experiments. We anticipate that these insights will set the stage for future investigations of this underexplored part of the human proteome, which will be crucial for our understanding of protein evolution, adaptation, and human-specific biology.
Limitations of the study Although the interactions detected by PRISMA suggest that small peptides and young microproteins can take part in fundamental cellular processes, we recognize certain limitations of the assay. Longer microproteins need to be tiled into smaller segments, which leads to the loss of the natural protein context, and the impact of globular protein folds and domains on microprotein interactomes will be missed. Moreover, PRISMA employs synthetic peptides and does not recapitulate a peptide's stability, expression level, and concentration within cells. As a possible instability of the microproteins and peptides 3-15 aa would preclude them from building stable interactions in vivo, PRISMA can only yield information on possible interactions that can be used to derive functional hypotheses but need further validation. Furthermore, the use of a cell protein lysate may yield interactions that would otherwise not occur due to cellular compartmentalization, it is not possible to discriminate between direct and indirect interactions, and we lose the fraction of celltype-specific interactions that may occur in cell types other than the cell lysate used in this study. We highlight that PRISMA works with any cell or tissue lysate that is not suitable for conventional interactome approaches, such as hard-to-transfect cells (e.g., cardiomyocytes) or (human) disease-specific clinical samples. Lastly, we did not evaluate the impact of PTMs on microprotein interactomes, which will be of interest in future studies, particularly in regard to SLiMs that depend on PTMs. 41

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Norbert Hubner (nhuebner@mdc-berlin.de).

Materials availability
This study did not generate new unique reagents.
Data and code availability d All MS and RNA-seq data created in this study have been deposited online and are publicly available as of the date of publication. Microscopy data reported in this paper have been deposited to Mendeley Data and are publicly available as of the date of publication. The DOI is listed in the key resources table. This paper also analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. d All original code has been deposited at GitHub as well as Zenodo and is publicly available as of the date of publication. The DOI is listed in the key resources table. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Retrieval of sORF and control sequences
We retrieved a set of 7,264 Ribo-seq ORFs longer than 15 aa (denoted sORFs) annotated as part of the Phase I GENCODE ORF annotation project (Table S1). 2 For our analyses, we combined sORFs annotated as downstream overlapping ORFs (doORFs) and downstream ORFs (dORFs), since doORFs are a rare category that represents only 0.8% of all sORFs. This resulted in five considered sORF biotypes: lncRNA ORFs (lncORFs, encoded by presumed long non-coding RNAs and processed transcripts), upstream ORFs (uORFs, encoded by 5 0 UTR sequences), upstream overlapping ORFs (uoORFs, encoded by 5 0 UTR sequences and partially overlapping an annotated CDS in an alternative frame), internal ORFs (intORFs, fully overlapping an annotated CDS in an alternative frame), and downstream ORFs (dORFs, encoded by 3 0 UTR sequences). In order to determine the significance of our findings in subsequent evolutionary analyses, we defined two additional control sets. Firstly, we selected a set of 527 annotated CDS sequences (Ensembl v.101 97 ) from genes in which all annotated protein isoforms were shorter than 100 amino acids (aa), selecting the longest CDS per gene and discarding incomplete CDSs without annotated start and/or stop codons (sCDS , Table S1). Secondly, we generated negative controls of untranslated ORF sequences. To this end we sampled length-matched sequences starting with ATG codons from non-coding regions of genes hosting uORFs and lncORFs and we translated them in silico. These untranslated regions did not overlap any annotated CDS or translated sORF sequence included in this study. We excluded genes hosting ORFs overlapping with conserved CDS sequences (uoORF, intORF, dORF), since the overlapping coding sequences can impose additional constraints on these ORFs. We generated a set of 2,068 and 2,914 lengthmatched untranslated regions in lncRNAs and 5 0 UTRs, respectively. This set covers 93.54% of the genes containing uORFs and lncORFs included in the analysis. For the rest of the cases, we could not extract any compatible sequence from the corresponding untranslated regions.

Whole-genome alignments across mammalian species
We downloaded a comparative genomics resource that comprises pre-built whole-genome nucleotide alignments across 120 mammalian species 98 to calculate the extent of conservation of all human ORFs across the mammalian lineage. Additionally, we downloaded Cons30 multiple alignments from UCSC 84 comprising 27 different primate species (non-primate species were removed from the alignment). For a set of ten species with high-quality genomes (rhesus macaque, mouse, cow, dog, horse, elephant, opossum, chicken, western clawed frog, zebrafish) we also retrieved whole-genome Liftover chains from UCSC. 84 We included chicken, western clawed frog, and zebrafish as evolutionary outgroups to find potential ORFs with vertebrate conservation that extended beyond the mammalian clade. Finally, for every ORF we designed a custom script to define local multiple alignments including the species where the region could be fully aligned, discarding partial or ambiguous alignments.

Protein sequence conservation of sORFs and sCDS
Standard homology-detection approaches are not adequate for discovering sORF homologues in full transcriptomes or genomes due to the extense search space and the short length of the query sequences. 49,99 Hence, we instead estimated the levels of protein similarity of sORF-and sCDS-encoded microproteins (>15 aa), as well as untranslated ORF controls, to evaluate the significance of the similarity across a reduced set of previously aligned counterpart regions in mammals extracted from whole-genome alignments. When at least one genome region was aligned to an ORF, we ran BLASTp (v.2.7.1) 79 against species-specific databases containing all aligned regions and extracted the E-value of the alignment of the ORF against each specific counterpart region. Next, we calculated a conservation score (CS) for each encoded microprotein, defined as the negative logarithmic value (-log 10 ) of the median E-values across all aligned species. Since each species-specific database contained variable numbers of counterpart regions due to differences in genome quality and divergence, we decided to use E-values instead of bit-scores as they are adjusted by the size of the sequence databases. However, pairwise alignment bit-scores computed by BLAST can also clearly distinguish between young and conserved sequences ( Figure S1C). Because of the high genome similarity across primate species, we limited this score to non-primate mammalian species, where unconstrained genome sequences show higher divergence. We therefore excluded all mammalian species that were classified as primates or colugos (primatomorpha). Phylogenetic reconstructions have shown that the genomes of colugo and primate species are quite related, hence colugos can be phylogenetically classified as a sister taxon to primates. 100 Unaligned species were not considered, so counterpart regions which evolved across different branches of mammalian evolution were evaluated for conservation.
Finally, we selected a CS significance cutoff R 8 for consistent amino acid sequence similarity conservation across non-primate mammals. We estimated a FDR < 0.01 by randomly extracting 10,000 size-matched sequences from untranslated regions of the ll OPEN ACCESS Resource same genes that host sORFs and calculating the distribution of CS scores for these sequences. Hence, amino acid sequences below the CS threshold are defined as 'conserved proteins' through mammalian evolution, as opposed to the rest of sequences that encode 'evolutionarily young' proteins without conserved protein homologues in non-primate mammals. Of the 7,264 cataloged sORFs and 527 sCDS, 758 sORF-encoded microproteins (10.43%) and 379 sCDS-encoded microproteins (71.37%) were conserved across mammals, including at least 375 sORF-encoded microproteins and 257 sCDS-encoded microproteins with significant conservation in some vertebrate species (translated orthologous regions of chicken, western clawed frog, and/or zebrafish with pairwise E-value < 10 -4 ). Conserved proteins were aligned to an average of 71 non-primate mammalian species, with 95% of ORF sequences aligned to 5 or more species. 69.3% and 59.7% of the non-conserved sORFs could be aligned to at least one and more than five nonprimate mammalian species, respectively.
Of note, the fraction of sORF-encoded microproteins with detectable homologues in non-primate mammals was higher for ORFs overlapping protein-coding sequences ($30-35% of uoORFs and intORFs) compared to other ORF biotypes ($4-5% of lncORFs, uORFs and dORFs) ( Figure S1D). Conserved sORF-encoded microproteins were longer than young ones (median length, 64 versus 37 amino acids, Figure S1A) and exhibited higher levels of protein similarity than both non-conserved sORF-encoded microproteins as well as untranslated ORF sequences from the same transcripts ( Figure S1B).
Furthermore, we searched for signatures of evolutionary protein-coding potential in the set of young sORFs, sCDS, and untranslated controls by running PhyloCSF 80 with default parameters. PhyloCSF scores were calculated across the retrieved multiple alignments for primates and for mammals. Young sORFs displayed similar PhyloCSF 80 scores to untranslated control sequences across primates and mammals, indicating that the codon sequences of young sORFs were not constrained at the protein-coding level ( Figure S1E).

Conservation of sORF structures
We firstly adapted a previously published method 101 and used PRANK v.170427 81 to reconstruct ancestral sequences based on the built Cons120 and Cons30 whole-genome multiple alignments. Next, we evaluated the conservation of young sORF, sCDS and untranslated ORF structures (structural conservation) across ancestral sequences in the primatomorpha lineage. An ORF structure was considered as conserved in an ancestral region if the ATG translation initiation site (TIS) was present in the same position or within an in-frame window 6 nt downstream of the human ATG position, and if R 70% of the sequence did not contain stop codons truncating the ORF. For the cases with different alignments in Cons120 and Cons30, we selected the one with the highest level of conservation. The considered lineages were humans (<6-8 Million years ago (Mya), age 0), old world monkeys ($35 Mya, age 1) and primatomorpha (including primates and the only two living colugo species; $88 Mya, age 2). Hence, these categories define the most distant ancestral primatomorpha sequence predicted to contain the full ORF structure. These ORFs could be fixed across all species from the lineage, they could be segregating and only be present in a subset of species from the lineage, or they might have evolved convergently in independent primate and non-primate lineages (although with highly divergent protein sequences).
The numbers of species assigned to each lineage and the genome releases used to map sequencing datasets are described in Table S1.

Expression and translation of sORFs in mammals
To determine if the absence of a conserved ORF structure in a specific species was supported by the absence of ribosome occupancy and/or translation in the same region, we downloaded human, macaque and mouse Ribo-seq data from brain tissue, three replicates each (ArrayExpress accession number E-MTAB-7247 23 ). Ribo-seq reads were trimmed for adapters, filtered to remove common rRNA, tRNA and mitochondrial RNA contaminants, and mapped to the human, macaque, or mouse genome using STAR v2.5.2b 82 with maximum of 2 mismatches. Next, we ran ORFquant 1.00 46 to call translated ORFs in the human brain and subsetted 830 translated sequences that were also found in the set of 7,264 cataloged sORFs. Subsequently, we extracted local sORF coordinates from non-human species genomes using LiftOver chain alignments and ran BEDTools v2.27.1 83 to quantify the number of reads overlapping these sequences. In both macaque and mouse species, the absence of conservation of sORF structures led to a strong significant drop in the number of mapped Ribo-seq reads (macaque: median of 79 reads vs 14 reads for conserved and nonconserved ORF structures, Wilcoxon signed rank test p-value = 9.11 x 10 -33 ; mouse: median of 97 reads vs 8 reads for conserved and non-conserved ORF structures; Wilcoxon signed rank test, p-value = 2.58 x 10 -61 ; Figures 1F; Figure S1G), while the levels of ribosome occupancy remained constant in human. We additionally applied a simple binomial probability test to determine how many of the macaque and mouse brain counterpart regions had significant Ribo-seq periodicity biases (p-value < 0.01), as previously done by Patraquim and colleagues. 102 In support of our method, counterpart regions of human brain ORFs without conserved structures displayed lower periodicity biases in macaque and mouse ( Figure S1H). However, we still found a group of ORFs without conserved structures but with significant periodicity in macaque (27%) and mouse (18%). Thus, a small proportion of the non-conserved ORFs are still translated into shorter ORFs (<70% of human length sequence) or use alternative translation initiation or splice sites.
Moreover, we ran ORFquant to call translated ORFs in the macaque brain and found that counterpart regions of human ORFs without conserved structures were depleted from actively translated ORFs in macaque (Test of equal proportions, p-value < 0.05; Figure 1G). In addition, we retrieved a unified set of translated ORFs from several mouse tissues 103 and determined how many of these ORFs overlapped regions aligned to human sORFs with different levels of conservation ( Figure S1I). As expected, mouse regions that could be aligned to evolutionarily young ORFs contained a very low proportion of translated ORFs (5.30%) compared to ll OPEN ACCESS Resource conserved sORFs present in mammals (39.66%). This indicated that, for human brain translated ORFs, the corresponding truncated ORF sequences were not actively translated in other non-human species.
We next estimated how many counterpart regions potentially containing cataloged sORFs were supported by RNA expression evidence in other species. Chimpanzee, macaque, mouse, cow, dog, horse, elephant and opossum transcriptomes and genomes were retrieved from Ensembl v.98. Chimpanzee and macaque RNA-seq data from brain, heart, liver and testis were downloaded from Gene Expression Omnibus with accession code GSE69241. 104 RNA-seq reads were trimmed for adapters and mapped to each corresponding genome using STAR v2.5.2b 82 with a maximum of 4 mismatches. Chimpanzee and rhesus macaque RNA-seq data from the four tissues were combined to assemble a species-specific reference-guided transcriptome with Stringtie v1.2.1 85 (parameters -M 0.5 -j 3 -p 4 -f 0.2). For each species, we then searched for annotated and/or assembled transcripts overlapping the previously generated LiftOver regions. We found that 90-93% and 74-92% of the counterpart regions of conserved sORFs overlapped transcripts expressed in primate and non-primate mammalian lineages, respectively. In contrast, evolutionarily young sORFs were less commonly expressed in rodents (60%), especially in more distant mammalian lineages, such as ferae (32%). This indicated that, in mammals, the presence of conservation is often linked to the presence of expression ( Figure S1J). A similar trend could be observed for the expression of primate-specific versus mammalian-and vertebrate-specific sORF 3-15 aa ( Figure S4B).

Mode of evolution of sORFs
To identify the mode of evolution of evolutionarily young sORFs, we ran BLASTp to inspect whether any of the ORF sequences displayed significant homology to any other annotated protein in the human, macaque, mouse, cow, dog, horse, elephant and opossum proteomes (Ensembl v.98). For sORFs with significant matches (E-value < 10 -4 ), we further assessed if the candidate orthologous protein was translated in the same aligned genomic region or a different one by analyzing the previously generated LiftOver coordinates for these species. Orthologs found in a different region were defined as 'CDS duplications', while the ones found in the same aligned genomic region were defined as 'CDS fissions', provided that both the sORF and the rest of the human protein exhibited homology to different parts of the same protein in the other species.
For the remainder of the sequences, we re-analyzed the generated multiple and LiftOver alignments to trace the sORFs back to their evolutionary origins. We classified an sORF as 'de novo' if the age of the aligned region predated the age of the ORF, hence being able to spot the mutations responsible for the birth of the ORF sequence. For the cases where the region could not be aligned beyond the lineages where the ORF structure was conserved, we downloaded transposable element annotations from RepeatMasker v.4.1.0 86 and inspected whether both the ORF and the region emerged as a result of the insertion of an endogenous retrovirus (EVR) or alu element (category 'EVR/Alu derived'). The mode of evolution of the remaining sORF sequences could not be correctly assessed with the available data, since the ancestral region evolved at the same time as the 'orphan' ORF sequence, and we classified these cases as 'not known'. A comparison of the main findings of our approach and the one recently developed by Vakirlis and colleagues 7 is available in the following section.

Comparison of the modes of evolution with a previous resource
We compared our resource on modes of evolution with a recent study from Vakirlis et al. that evaluated the modes of evolution of 715 ORFs structrures. 7 Vakirlis et al. used aligned sequences from the UCSC 100-way phylogenetic tree to reconstruct the ancestral sequences and evaluate frame conservation -independent of the presence or absence of an initiating ATG codon-and the expression of the loci in other species. This set of 715 ORFs was extracted from a published ORF dataset, 6 which was one of the seven datasets included in the GENCODE sORF resource that we included in our study. Of these 715 ORFs evaluated by Vakirlis et al., 452 overlapped our sORF list. The reasons why 263 ORFs were excluded from our dataset were diverse: some overlapped in-frame annotated coding sequences and pseudogenes, had a length under the selected cutoff (< 16 aa), or could not be fully mapped to annotated GENCODE transcripts. Vakirlis et al. found 155 out of 715 ORFs to have evolved de novo, of which 94 overlapped with our ORF set. We similarly classified 60 out of these 94 (63.82%) ORFs as de novo evolved, while for 31 cases we could not assign any mode of evolution ('not known'). Only for 3 out of 94 (3.19%) cases we found a duplication event that would disqualify the ORF as having emerged de novo as per our analysis. These are: (1) c10riboseqorf103 (RPARP-AS1_104210065_114aa), which is encoded by an antisense lncRNA and the ORF contains a partial duplicated region which overlaps with the 3 0 UTR of the sense gene c10orf95, (2) c19riboseqorf102 (ZNF585A_37701369_35aa), which is a uORF harboring a partially duplicated region of the local neighbor gene AC012309.1, and (3) cXnorep31 (ZNF81_47696378_89aa), which contains an exon derived from an AluSz element that is also integrated in the coding sequences of several primate genes. Moreover, Vakirlis et al. reported 7 de novo ORFs as being human specific. Three out of these 7 overlapped the ORF list evaluated in our study and we also classified them as emerged de novo, although we defined an older primatomorpha origin for one of the ORFs (MALAT1_65266767_39aa / c1riboseqorf84) and a human origin for two of them (PTPRF_43996733_27aa / c1riboseqorf84 and TTC9C_62496064_18aa / c11riboseqorf70). Our study additionally identified 160 human-specific de novo ORFs, including 147 cases that were not evaluated previously by Vakirlis et al. and likely therefore not yet reported as de novo human-specific ORFs. We classified an additional 13 ORFs as being human specific that Vakirlis et al. reported as being conserved in non-primate mammals. Although our results generally overlap very well, the observed discrepancies in the mode of evolution for some of the cases might be due to differences in the parameters for the computational methods (i.e., we require conservation of the initial ATG codon) and the selection of distinct genome-wide alignments with different numbers of primate, mammalian and vertebrate species.

Resource
Candidate selection for PRISMA of microproteins translated from recently evolved sORFs We selected 45 evolutionarily young microproteins (CS < 8, longer than 15 aa) to investigate their interactomes with PRISMA. This included 41 microproteins putatively encoded by cataloged Ribo-seq ORFs (32 lncORFs, 5 uORFs, 4 uoORFs) as well as 4 microproteins putatively encoded by novel sORFs outside annotated gene regions discovered by us in a recent study (3 lncORFs and 1 uORF). 22 We additionally included 15 microproteins translated from conserved mammalian sORFs to compare the interactomes of young and conserved proteins of similar protein sizes. Four out of the 15 selected conserved microproteins were translated from lncORFs, three from uORFs and eight were sCDS (annotated small proteins). Four of the conserved sCDS (MIEF1-MP, MTLN, NDUFB3 and MRPL33) evolved during vertebrate evolution and had been included in interactomics studies before. 30,31,105,106 The properties, conservation and mode of evolution of each candidate were calculated and reported in Table S2.
In addition to the levels of conservation, we based the selection of the PRISMA candidates on the likelihood of these sORFs to be translated into stable microproteins. Therefore, we collected evidence for sORF translation in multiple datasets, as well as the exogenous and endogenous detectability of their translation products (microproteins). Moreover, we considered the potential relevance of the sORF or the host gene in disease. Details on how we collected all this information are described below.

Ribo-seq datasets
We counted the number of Ribo-seq studies that supported the translation of candidate sORFs. To this end, we retrieved evidence from the study of Mudge and colleagues that collected data from seven Ribo-seq studies to create a first consolidated sORF catalog. 2

ORF databases
We investigated how many of the candidate sORFs were reported in three public ORF annotation databases: SmProt, 25 MetamORF 26 and sORF.org. 27 Nine were detected in one database, 12 in two and 16 in all three databases. In vitro translation evidence We searched for candidate sORFs that were able to produce detectable microproteins in coupled in vitro transcription:translation assays, as described and reported in our previous study, 22 and found 23 out of 24 candidates that were tested.

Antibody-based microprotein detection after overexpression
We surveyed our previous study 22 as well as three others 5,6,28 for sORFs that produced detectable microproteins after ectopic expression in cultured human cells. In all four studies, epitope-tagged microproteins were overexpressed in human cell lines and detected either via immunoblotting or immunofluorescence. Additionally, in the current study we overexpressed 31 microproteins with a C-terminal 3xFLAG-tag and detected protein expression of 26 candidates by immunofluorescence (details in STAR Methods section detection of overexpressed microproteins by immunofluorescence and Data S1). Endogenous mass spectrometric evidence We collected protein expression evidence of 15 candidate sORF-encoded microproteins from our previous study, based on targeted mass spectrometry (selected reaction monitoring, SRM) data, 22 and acquired evidence for four more candidates (PVT1-MP, MRPL33, NDUFB3 and MIEF1-MP) in a targeted mass spectrometry assay (parallel reaction monitoring, PRM) performed in this study (details in STAR Methods section parallel reaction monitoring (PRM) proteomics). We additionally searched for identified peptide spectrum matches (PSMs) that uniquely mapped to sORF-encoded microproteins in a previous meta-analysis of 16 published MS searches, including both shotgun MS and HLA peptidomics studies. 2 For 32 sORFs, we found at least one PSM mapping to their encoded amino acid sequences. We also retrieved evidence of sORFs protein expression from PeptideAtlas (2022-01). 29 For seven sORFs, we found evidence of PSMs uniquely mapping to their respective translated microproteins in HLA peptidomics datasets. However, for five out of seven cases this evidence was limited to single peptide identifications. While these peptides were not assigned to any other translated sequences in the human proteome, further manual curation will be required to validate the collected MS evidence, especially when only supported by a unique PSM. Candidate disease relevance and prior indications of potential microprotein function We collected indications for disease relevance to prioritize candidates. To this end, we took into account whether the microprotein host gene was a presumed lncRNA that had been implicated in disease, anticipating that the encoded, but previously missed microprotein might have contributed to the observed phenotypic effect. We interrogated the manually curated database EVLncRNAs 2.0, 107 which contains lncRNAs whose function and disease association was validated by low-throughput and targeted experiments. Similarly, we searched for sORFs experimentally interrogated in different functional assays to find candidates with reported biologically active roles. First, we used the output data from the functional assay on 553 ORFs from Prensner and colleagues 5 and found 11 ORFs whose microproteins were included in our PRISMA design. We extracted two functional scores based on CRISPR phenotype and transcriptional activity score (transcriptional perturbation after overexpression). An ORF scored positive if the CRISPR phenotype was significant in the original study (CRISPR phenotype = 1) and/or the transcriptional activity score was higher than or equal to 0.2. None of the 11 microproteins had an effect on cell growth, but five young (RP11-140K17.3-MP, DANCR-MP, PRR34-MP, SNGH8-MP, and SNHG6-MP) and three conserved microproteins (MTLN, MKKS-uORF-MP, and MIEF1-MP) altered cellular transcription profiles after overexpression. Second, we selected 771 sORFs translated in the dataset published by Chen and colleagues, 6 one of the studies selected for the GENCODE Ribo-seq ORF catalogue, and retrieved CRISPR and Perturb-seq scores from the same study for 10 ORFs encoding microproteins included in our PRISMA design. An ORF was positive if the CRISPR score was significant in the original study (p-value % 0.5) and/or the Perturb-seq assay found pathways affected significantly after knocking-out the ORF. Four conserved microproteins (RNF10-uORF-MP, TUG1-MP, MIEF1-MP and IFRD1-uORF-MP) influenced cell viability and one young microprotein (PPP1R15A-uORF-MP) and three conserved microproteins (MIEF1-MP, LINC00881-MP and IFRD1-uORF-MP) scored in the Perturb-seq assay. Combined, the merger of both studies yielded 13 ORFs whose microproteins were included in our PRISMA design that either affected cellular growth or transcription profiles in at least one of the two investigated studies, of which seven would fall into the 'young' category, and six would be considered conserved. Additionally, of the 221 peptides 3-15 aa , 23 had been previously interrogated by Chen and colleagues, while Prensner and colleagues only included candidates >15 aa. Of those peptides 3-15 aa included by Chen and colleagues, six scored positive, i.e., impacted cell viability.

Summary of selection criteria
The selection criteria for each candidate are summarized in Table S2. Briefly, 30 out of 45 young and 11 out of 15 conserved sORFs were found to be translated in at least two independent sORF studies. Exogenous protein-level support for encoded microproteins was obtained for 44 sORFs from either in vitro translation assays (n young = 21/45; n conserved = 3/15) or antibody-based detection after overexpression (n young = 39/45; n conserved = 13/15). Mass spectrometric detection supported endogenous production of 29 young and 12 conserved microproteins in human cell lines and tissues. Seven young and six conserved microproteins were translated from ORFs with hits in CRISPR and overexpression screens, and 22 were translated from lncRNAs implicated in disease, including cardiovascular disease (10) and cancer (17).

Detection of overexpressed microproteins by immunofluorescence
Synthetic gene fragments containing the codon-optimized coding sequence of candidate microproteins with a C-terminal 3xFLAG tag were synthesized and cloned into a customized plasmid for mammalian expression by Genewiz Europe (Leipzig, Germany; constructs available upon request). The 3xFLAG-tagged human microproteins were overexpressed in HeLa cells and visualized using immunofluorescence as described by us previously. 22 Human HeLa cells were grown on glass slides in 12-well plates for 24 h and transfected with plasmids encoding c-terminally 3xFLAG-tagged microproteins using Lipofectamine 2000 according to manufacturer's instructions. The plasmids used are available upon request. 24 h post transfection cells were fixed with 4% paraformaldehyde (PFA) for 10 min at room temperature (RT) and washed three times with ice cold phosphate-buffered saline (PBS). Cells were permeabilized and blocked for 1 h with 2.5% bovine albumin serum (BSA), 10% anti-goat serum (NGS) and 0.1% Triton X in PBS. After washing the cells, overexpressed microproteins were stained with anti-FLAG mouse monoclonal antibody (1:500 in PBS with 5% BSA, F1804, Sigma Aldrich) for 1 h at RT. Mitochondria (1:1000 rabbit anti-ATPIF1 in PBS with 5% BSA, #13268, Cell Signaling Technology, Danvers, MA, USA) and in the case of LINC01128-MP overexpression clathrin-coated vesicles (1:100 rabbit anti-Clathrin Heavy Chain (P1663) in PBS with 5% BSA, #2410, Cell Signaling) were co-stained in this step. Afterwards cells were washed and incubated with fluorochrome-labeled secondary antibodies (1:500 in PBS with 5% BSA, Alexa Fluor 488 anti-rabbit and Alexa Fluor 594 anti-mouse (Invitrogen, Carlsbad, CA, USA) for 30 minutes at RT. Cells were washed again, stained with 4-6-diamidino-2-phenylindole (NucBlue Fixed Cell ReadyProbes Reagent, R37606, Thermo Fisher) for 5 minutes at RT and mounted onto glass slides using ProLongTM Gold antifade reagent (Molecular Probes; InvitrogenTM). Images were taken with a LEICA SP8 confocal microscope using a 63x objective and analyzed using ImageJ (v1.53c). 108

PRISMA of microproteins translated from cataloged sORFs Experimental setup
We adapted the PRISMA assay from the PRISMA assay introduced previously by others and ourselves. 17-20 Each of the 60 putative microproteins was divided into 15 aa long, overlapping peptides (tiles) with an offset of eight aa. This resulted in 478 tiles (minimum of two, maximum of 21, average of eight tiles per microprotein). To evaluate if arbitrary peptides derived from untranslated RNA sequences would serve as a suitable control for our screen, we generated a set of arbitrary peptides which we translated in silico from non-coding regions (5 0 UTRs or lncRNA exons without translation in our Ribo-seq data). These sequences start with ATG codons and are located in the genes hosting the 271 microproteins and peptides included in the PRISMA analysis, with length-matched distributions. We only considered regions that were not covered by annotated coding sequences nor by sequences from the set of 7,264 sORFs >15 aa and 221 sORFs 3-15 aa analyzed in our manuscript. We found that the 45 young microproteins included in the PRISMA analysis present similar amino acid compositions compared to untranslated sequences, as observed in the PCA (Figure S2D). This is in line with our observation that most of the young microproteins recently emerged de novo from non-coding RNA sequences. Moreover, we generated 10,000 sets of shuffled sequences for each of the 45 young microproteins included in the PRISMA screen and compared the numbers of SLiMs predicted in these sequences. The numbers of SLiMs in young microproteins and shuffled sequences are not statistically different (978 SLiMs in young microproteins and an average of 948 SLiMs in shuffled sequences, p-value = 0.16, Figure S2E). Because of this, we expect the extent and specificity of the interactomes of untranslated sequences, shuffled sequences and young microproteins to be rather similar, indicating that arbitrary or random peptides would not serve as a helpful control. Instead, we included four characterized control peptides derived from the proteins SOS1 and GLUT1 that had been investigated in previous proteomic interaction screens. 18,32 In total, 490 peptides were spot-synthesized (SPOT synthesis technology) on three cellulose membranes (JPT Inc., Berlin, Germany), with the GLUT1 and SOS1 peptide controls present on each membrane. Each spot carries approximately five nmol of peptide covalently bound to the cellulose-ß-alanine-membrane. Peptides were acetylated at their free N-termini to enhance stability and to better recapitulate the uncharged nature of a protein backbone. All three biologically distinct membranes were ordered and processed in triplicates. Sequences of spotted tiles can be found in Table S3. The spotted peptides have been referred to as ''baits'', the proteins that are bound by these peptides as ''prey''.

Resource
Protein lysate preparation HEK293T/17 cells were grown in 14 cm dishes (Sarstedt) as described above. All following steps were performed on ice and only icecold buffers were used. Cells were washed with PBS, scraped, transferred into falcon tubes and centrifuged for 5 min at 1000 g. After an additional wash with PBS cell pellets were resuspended in lysis buffer (50 mM HEPES pH 7.6 at 4 C, 150 mM NaCl, 1 mM EGTA, 1 mM MgCl2, 10% Glycerol, 0.5% Nonidet P-40, 0.05% SDS, 0.25% sodium deoxycholate and cOmpleteÔ EDTA-free protease inhibitor (Roche) (0.7 mL per 14 cm dish) and incubated for 30 min on ice. Five mL (1250 U) of Benzonase (Merck) were added, followed by another 15 min incubation and 15 min centrifugation step at 20,000 g. The supernatant was transferred into a fresh tube and the protein concentration was determined with the PierceÔ BCA Protein Assay Kit (Thermofisher Scientific) following manufacturer's instructions. The protein concentration was adjusted to 5 mg/mL with lysis buffer. The protein extract was directly used for the PRISMA assay. Sample preparation for mass spectrometric analysis Membranes with spot-synthesized peptides were equilibrated at RT for 15 min in wash buffer (50 mM HEPES pH 7.6 at 4 C, 150 mM NaCl, 1 mM EGTA, 1 mM MgCl2, 10% Glycerol), blocked with 1mg/mL tRNA (Invitrogen; diluted in wash buffer) for 10 min and washed again twice with wash buffer for 5 min. Afterwards, membranes were incubated with HEK239T protein lysate (5 mg/mL) for 2 h at 4 C while shaking, followed by 3 washing steps with a wash buffer for 5 min at 4 C, and were dried for 15 min at RT.
Peptide spots were punched out using a 2 mm mouse ear puncher and transferred directly into 20 mL of urea sample buffer (6M Urea, 2M Thiourea, 10mM HEPES). Samples were reduced in 12 mM dithiothreitol (DTT) solution for 30 min at RT and alkylated in 40 mM chloroacetamide for 45 min at RT in the dark. To digest proteins bound to the spotted peptides, samples were diluted with 100 mL of 50mM ammonium bicarbonate (pH 8.5) buffer containing trypsin (Promega; 5 mg/mL) and LysC (Wako; 5 mAU/mL), and incubated overnight at RT. The proteolytic digestion was stopped by adding 4 mL 25% trifluoroacetic acid. Peptides were extracted and desalted using StageTip protocol. 109 LC-MS/MS Peptides were eluted using Buffer B (80% acetonitrile and 0.1% formic acid), organic solvent was evaporated using a speedvac (Eppendorf) and samples were diluted in Buffer A (3% acetonitrile and 0.1% formic acid). Peptides were separated on a 20 cm reversed-phase column (inner diameter 75 mm, packed with ReproSil-Pur C18-AQ 3 mm resin (Dr. Maisch GmbH)) using a 45 min gradient with 250 nl/min flow rate of increasing Buffer B concentration on a High Performance Liquid Chromatography (HPLC) system (ThermoScientific). Peptides were ionized using an electrospray ionization (ESI) source (ThermoScientific) and analyzed on an Orbitrap Fusion instrument (ThermoScientific). Precursor survey scans were performed at 120K resolution with a 2 3 10 5 ion count target. Dynamic exclusion for selected precursor ions was 30 s. MS/MS was performed with a 1.6 m/z isolation window, HCD fragmentation with normalized collision energy of 32, ion count target of 1x10 4 and maximum injection time of 300 ms. The instrument was operated in top speed mode with 3 s cycles. Replicates were measured in batches with a different run order for each batch. A blank run was placed after each analytical run.

Data analysis
The resulting raw files were analyzed using the MaxQuant software package 1.6.0.1. 56 The internal Andromeda search engine was used to search MS2 spectra against a decoy human UniProt database (Human.2019-07) and an in-house database containing PRISMA peptide and microprotein sequences. The search included variable modifications of methionine oxidation, N-terminal acetylation, deamidation (N and Q) and carbamidomethyl cysteine as fixed modification. The FDR was set to 1% for peptide and protein identifications. Unique and razor peptides were considered for quantification. Retention times were recalibrated based on the built-in nonlinear time-rescaling algorithm. MS2 identifications were transferred between runs with the ''match between runs'' function. The integrated LFQ quantitation algorithm was applied.
Following analyses were done using R v.3.6.1 87 and adapted from Meyer and colleagues 18 with slight modifications. The resulting text files were filtered to exclude reverse database hits, potential contaminants, and proteins only identified by site. Missing LFQvalues were imputed with random noise simulating the detection limit of the mass spectrometer. Imputed values were taken from a log normal distribution with 0.3x the standard deviation of the measured, logarithmized values, down-shifted by 1.8 standard deviations. By doing this, we obtained a distribution of quantitative values for each protein across samples. We excluded replicates with sample identifications of over two standard deviations away from its other replicates, as well as samples whose correlation value was over two standard deviations away from the other correlations between replicates. This led to the exclusion of two peptide spots from interactome analyses (Table S3). For determination of specific interactions, i.e., to separate specific binders from background, we compared protein identifications in each peptide spot against all other peptide spots excluding spots of the same microproteins using moderated t-tests (limma v3.40.6 110 ). Only proteins with at least two valid values for the peptide spot were considered. The resulting p-values were adjusted using Benjamini-Hochberg correction. Adjusted p-values and fold-changes (log 2 space) were plotted as volcano plots. To determine significance cutoffs, we used a graphical formula combining a fold-change and p-value cutoff 18,111 : À log 10 ðpÞ R c jxj À x0 with x: enrichment factor of a protein, p: p-value of adjusted moderated t-test, x 0 : fixed minimum enrichment, c: curvature parameter. The curvature parameter c determines the maximum acceptable p-value for a given enrichment x. The parameters c and x 0 can be optimized based on prior knowledge of known true and false positives. 32,111 Here, cutoffs were chosen according to known interaction partners of the SOS1 and GLUT1 control peptide. 111 This resulted in a cutoff of x0 = 3, c = 4 that was applied to all other peptide spots. Ultimately, the PRISMA approach enabled us to define significant interactors for ll OPEN ACCESS Resource synthesized peptides individually. The total interactome of one microprotein was defined as the summary of interactors detected in all synthesized peptides that were derived from that microprotein. Quality control by replicate measurement assessment Principal component analysis showed that no batch bias was identified between the replicate membranes (data not shown). Further, for all three membranes triplicate measurements of each spotted peptide correlated well with a median Pearson's R of 0.73, 0.72, and 0.87, respectively, and were significantly higher than the median of correlations of random triplets ( Figure S2A). This was similar to what was observed in a previous peptide array screen. 18 The number of identified proteins per peptide spot ranged from 159 to 2,380 and was comparable across membranes (median of 973, 1,012 and 1,011 IDs for membrane 1, 2 and 3, respectively) ( Figure S2B). Quality control by evaluation of assay control peptides The spotted SOS1 peptide is known to interact with SH3-domain containing proteins via its proline-rich motif, while motif disruption in the mutant peptide leads to the loss of this interaction capability (loss-of-interaction mutant). 18,32 In the case of GLUT1, a proline (P) to leucine (L) mutation creates a dileucine motif in the mutant peptide which mediates the binding to adaptor and clathrin proteins. 18 Interactions with clathrins do not occur in the wild type peptide and lead to aberrant endocytosis of GLUT1 in GLUT1 deficiencies (gain-of-interaction mutant). 18 Reassuringly, we only found clathrins to be significantly enriched in pulldowns of the GLUT1 mutant peptide but not of the wild type peptide (Figures 2D and S2C) and detected up to eight of the nine known SOS1 binding partners ( Figures 2C and S2C). As expected, mutation of the SOS1 proline-rich motif led to the loss of most SH3 domain proteins except for CD2AP, GRB2 and BIN1 ( Figure S2C).
Overall, the assay controls demonstrated the ability of our peptide array approach to detect biologically relevant PPIs as well as its specificity to the spotted peptide sequence. For 481 out of 488 analyzed peptides, we identified between 1 and 94 significant protein interactors (with a median of 14), resulting in 13 to 333 interactors per microprotein (with a median of 107 interactors) ( Figure 2B; Table S3). Interactors that were detected in the interactomes of multiple microproteins tended to have lower interaction scores (product of fold change and p-value) than interactors that were found in fewer or only one microprotein interactome. However, this is likely due to the comparison approach against all other peptides, which will penalize proteins that are detected in many peptide pulldown ( Figures 2B and S2F). Quality control based on bait identification for PRISMA of young microproteins As part of the PRISMA quality control we assessed if the spotted peptides (baits) were identified and enriched in the expected samples, i.e., in the samples that contained the part of the membrane the bait peptide was synthesized on. We reasoned that the high amount of synthesized peptides should lead to their identification by peptide-spectrum matches in case a suitable tryptic peptide is produced upon enzymatic digestion. Thus, we only considered bait identifications that were detected ''by MS/MS'' in at least two replicates, while hits derived ''by matching'' were excluded. Because of the short length of the baits (15 aa), the possibilities to produce tryptic peptides suitable for MS is rather limited and indeed we do not detect all, but 193 out of 480 unique baits. 108 of those (56%) were exclusively identified in the expected samples (data not shown). 79 baits were enriched in the expected sample, but also in other samples. However, for 71 of those the median LFQ intensity across all three replicates was higher in the correct samples compared to LFQ intensities in unexpected samples. In seven cases the bait was detected with a higher intensity in an unexpected sample than in the correct sample, and seven baits were only enriched in unexpected samples (data not shown). In total four samples were completely excluded from the analysis and 13 were flagged and only used to determine the interactome of the entire microprotein but not to investigate motif-driven interactions (Table S3).

Expression of microproteins and peptides 3-15 aa in HEK293 cells
Since we used a protein lysate from HEK293T cells for the PRISMA analysis, we assessed the expression level of the microprotein and peptide-encoding genes in this cell line. Therefore, we downloaded a public RNA sequencing dataset of HEK293T cells (NCBI Sequence Read Archive (SRA) accession codes SRR1107836 and SRR1107837 77 ) and calculated the counts per million (CPM) from the mapped and quantified raw reads. A gene was determined as expressed if the mean of the two runs per gene was above or equal to 1. We detected the genes of 39/60 microproteins (65%) ( Table S3) and 194/221 peptides 3-15 aa (88%) ( Table S4) as expressed in HEK293 cells.

Annotation and enrichment of essential proteins detected in microprotein interactomes
We downloaded a list of proteins that were shown to be essential for survival of human cells in a study by Blomen and colleagues. 33 We only included genes that affected cell viability in both tested cell lines (KMB7 and HAP1). This resulted in a list of 1,734 proteins based on which we annotated microprotein interactors as essential proteins. We calculated if specific microprotein interactomes were enriched for essential proteins using Fisher's exact test (Table S3).
Computational prediction of disordered regions and short linear motifs within microproteins and peptides Intrinsically unstructured (disordered) regions in peptide sequences were predicted using IUPred v.1.0 in the ''short'' disorder mode and disorder values were averaged over the sequence. 36 For the detection of short linear motifs (SLiMs), also called eukaryotic linear motifs (ELMs), the 'elm_classes.tsv' file (version 1.4; 15 January 2018) was downloaded from the ELM resource for functional sites in proteins. 37 We then filtered the peptide sequences for matches to any of the motifs falling in regions with an average disorder value R 0.5. Five of the 60 microproteins subjected to PRISMA did not harbor any predicted SLiMs, the remaining 55 microproteins contained 429 SLiMs within disordered regions. Of those, 412 were captured within 159 of the tiled peptides spotted for the PRISMA screen (Table S3). Moreover, 174 of 221 peptides 3-15 aa contained 514 SLiMs within disordered regions all captured in the PRISMA approach (Table S4).

Detection of protein domain-SLiM matches in PRISMA interactomes
We first annotated known protein domains for each microprotein and peptide interactor using the R packages ''ensembldb'' and ''EnsDb.Hsapiens.v86.'' 97 Next we extracted SLiMs that were known to bind the respective protein domains from a public resource published by Kumar and colleagues (http://elm.eu.org/; ''elm_interactiondomains.tsv''). 41 We reported a domain-SLiM match when a SLiM was present in a microprotein tile or peptide 3-15 aa that bound an interactor carrying the SLiM-binding protein domain. In total we detected 47 protein domain-SLiM matches within disordered regions of 34 microprotein tiles (Table S3) and 30 protein domain-SLiM matches within 18 peptides 3-15 aa (Table S4).

Kinase-enrichment by microproteins with kinase-related SLiMs
In total, 117 out of 481 microprotein tiles were predicted to carry one of 35 different kinase phosphorylation and docking motifs within putative disordered microprotein regions extracted from the ELM resource for functional sites in proteins. 37 We detected 158 microprotein-kinase interactions in the entire interactome screen, 17 of which were detected in the interactomes of fifteen tiles (derived from six young and three conserved microproteins) of the 117 tiles harboring a kinase phosphorylation or docking motif. Fisher's exact test revealed that kinases were not enriched in interactomes of microprotein tiles that carry a kinase phosphorylation or docking motif within disordered regions (p-value = 0.079, Fisher's exact test).  Table S3). Therefore, 3xFLAGtagged microproteins were overexpressed, immunoprecipitated and analyzed by MS as performed previously 22 with slight modifications. HEK293T/17 were seeded in triplicates on poly-D-Lysine (Sigma, Germany) coated 10 cm dishes and transfected with 28 mg plasmid-DNA of FLAG-tagged microproteins using TransFectin (BioRad, California) following manufacturer's instructions. Two days post transfection cells were washed twice with ice-cold phosphate-buffered saline (PBS), scraped in 1.5 mL ice-cold PBS and transferred into Eppendorf tubes. After centrifugation at 950 g for five min at 4 C, cell pellets were lysed in 200 mL lysis buffer (150 mM NaCl, 50 mM Tris pH 7.5, 1% IGPAL-CA-630, 2x Complete protease inhibitor without EDTA) for 30 min on ice. Lysates were centrifuged at 20,800 g for 15 min at 4 C and supernatants were added to 30 mL 50% antibody-coupled magnetic bead solution (M2-magnetic beads, Sigma, Germany) and 300 mL wash buffer 1 (150 mM NaCl, 50 mM Tris pH 7.5). Beads were washed 3x in 150 mL wash buffer 1 before usage. After incubating the samples for 2 h at 4 C in an overhead shaker, samples were washed once with 1 mL wash buffer 2 (150 mM NaCl, 50 mM Tris pH 7.5, 0.05% IGPAL-CA-630) and three times with wash buffer 1. Supernatants were removed and magnetic beads were frozen at 80 C until analyzed by mass spectrometry. Beads were resuspended in 20 mL urea buffer (6 Murea, 2 Mthiourea, 10 mM HEPES, pH 8.0), reduced for 30 min at 25C in 12 mM DTT solution, followed by alkylation in 40 mM chloroacetamide for 20 min in the dark at 25 C. Samples were first digested with 0.5 mg endopeptidase LysC (Wako, Osaka, Japan) for 4 h. After adding 80 mL 50 mM ammonium bicarbonate (pH 8.5) samples were digested with 1 mg sequence grade trypsin (Promega) overnight at 25 C. The peptide-containing supernatant was removed and collected into a fresh tube. Beads were washed twice with 50 mL 50mMammonium bicarbonate (pH 8.5) and the supernatants were pooled. Samples were acidified by adding 1 mL formic acid to stop the digestion.

Phosphorylation of kinase-binding microproteins with domain-SLiM matches
Peptides were extracted, desalted and diluted as described in the previous section for PRISMA. Peptides were separated on a reversed-phase column (20 cm fritless silica microcolumns with an inner diameter of 75 mm, packed with ReproSil-Pur C18-AQ 1.9 mm resin (Dr. Maisch GmbH)) using a 90 min gradient with a 250 nL/min flow rate of increasing Buffer B concentration (from 2% to 60%) on a High-Performance Liquid Chromatography (HPLC) system (Thermo Fisher Scientific) and ionized using an electrospray ionization (ESI) source (Thermo Fisher Scientific) and analyzed on an Thermo Q Exactive Plus instrument, which was run in data dependent mode selecting the top 10 most intense ions in the MS full scans, selecting ions from 350 to 2000 m/z, using 70 K resolution with a 3 3 10 6 ion count target and 50 ms injection time. Tandem MS was performed at a resolution of 17.5 K. The MS2 ion count target was set to 5 3 10 4 with a maximum injection time of 250 ms. Only precursors with charge state 2-6 were sampled for MS2. The dynamic exclusion duration was set to 30 s with a 10-ppm tolerance around the selected precursor and its isotopes.

Resource
Data were analyzed using MaxQuant v1.5.2.8. The internal Andromeda search engine was used to search MS2 spectra against a human UniProt database (HUMAN.2017-01) and an in-house bait protein sequence database containing forward and reverse sequences. The search included variable modifications of methionine oxidation, N-terminal acetylation and serine, threonine and tyrosine phosphorylation, and fixed modification of carbamidomethyl cysteine. Minimal peptide length was set to seven amino acids and a maximum of 3 missed cleavages was allowed. The FDR was set to 1% for peptide and protein identifications. Unique and razor peptides were considered for quantification. Retention times were recalibrated based on the built-in nonlinear time-rescaling algorithm. MS2 identifications were transferred between runs with the ''Match between runs'' option for biological replicates, in which the maximal retention time window was set to 0.7 min. We detected four phosphorylated tryptic peptides from three of the overexpressed microproteins (RP11-12K22.1-MP, JHDM1D-AS1-MP and THAP7-uORF-MP). Two of these peptides contained phosphorylation motifs (Table S3).'' Gene ontology analysis Gene ontology (GO) 113 enrichment on small peptide and microprotein interactomes identified with PRISMA was performed with gProfiler2 v0.2.0, 88 with default parameters. As a custom background, we used all identified proteins in the respective PRISMA screen.

Proximity ligation assay
In situ proximity ligation assay (PLA) was performed to corroborate the interaction between PVT1-MP and SRSF2 using the DuolinkÒ In Situ Proximity Ligation Assay Starter Kit (Red, Mouse/Rabbit, DUO92101-1KT, Sigma-Aldrich) according to manufacturer's instructions. V5-tagged PVT1-MP and FLAG-tagged SRSF2 were co-overexpressed in HeLa cells as described above for co-localization experiments. Following fixation with 4% paraformaldehyde (PFA) for 10 min at RT, cells were permeabilized in 0.1% Triton X in PBS for 1 h at RT, and blocked in DuolinkÒ blocking solution for 1 h at 37 C. PVT1-MP-V5 and SRSF2-FLAG were stained overnight at 4 C with anti-V5 rabbit monoclonal antibody (1:500 in DuolinkÒ antibody diluent, #13202, Cell Signaling Technology) and with anti-FLAG mouse monoclonal antibody (1:500 in DuolinkÒ antibody diluent, F1804, Sigma Aldrich), respectively. Cells were washed twice for 5 minutes in 1x DuolinkÒ Wash Buffer A at RT and incubated with PLUS and MINUS PLA probes (1:5 in the DuolinkÒ Antibody Diluent) for 1 h at 37 C. After washing cells 3x for 5 min in 1x DuolinkÒ Wash Buffer A at RT, cells were incubated for 30 min at 37 C with DuolinkÒ ligase in 1x ligation buffer and washed again in 1x DuolinkÒ Wash Buffer A at RT. For amplification, cells were incubated with DuolinkÒ polymerase (1:80) in 1x amplification buffer for 100 min at 37 C. Cells were washed 2x 10 min in 1x DuolinkÒ Wash Buffer B, 1x 1 min in 0.1x DuolinkÒ Wash Buffer B, and stained with 4-6-diamidino-2-phenylindole (NucBlue Fixed Cell ReadyProbes Reagent, R37606, Thermo Fisher) for 5 minutes at RT. As negative controls, transfected cells were stained with only anti-V5 rabbit monoclonal antibody, only anti-FLAG mouse monoclonal antibody, or no primary antibodies and PLA probes. Additionally, untransfected cells were stained with both primary antibodies and PLA probes. Images were taken with a LEICA SP8 confocal microscope using a 63x objective and analyzed using ImageJ (v1.53c). 108 Generation of LINC01128-MP knock-out cells and RNA-seq analysis A HeLa LINC01128-MP knock-out (KO) cell pool was generated by Synthego Inc. (Redwood City, CA) using CRISPR/Cas9 with GAUCCAAGGCAGGCACUCAA as guide RNA targeting the N-terminal region of the LINC01128-MP encoding sORF. Non-homologous end joining (NHEJ) following the Cas9-mediated double-strand break led to insertions and deletions (indels) that cause premature STOP codons and ultimately a disruption of the microprotein-encoding sORF ( Figure S3F).

Resource
At passage four, RNA-sequencing was performed for triplicates of wild type (WT) cells and KO cell pools to assess i) the proportion of indels within the targeted cell pool and ii) the potential impact the introduced indels have on the expression of LINC01128 transcripts. RNA-seq data was analyzed by trimming adapters and mapping each sample to the human genome (hg38, Ensembl v.98) using STAR v2.5.2b 82 with a maximum of 4 mismatches.
Our analyses revealed that 93.7% of all reads mapped to sequences carrying different indels that all lead to a frameshift resulting in premature STOP codons and ultimately disruption of the sORF encoding LINC01128-MP, while only 6.3% of reads mapped to the wild type sequence. A single adenine insertion was most common within the KO cell pools (69.4%), while a single guanine deletion, a 14 bp deletion and a 13 bp deletion contributed with 9%, 9.5% and 5.7%, respectively ( Figure S3G). From these data, we can estimate that at least 87.4% of cells carry a mutation on both alleles.
Differential RNA-seq expression analysis was performed using DESeq2 v1.26.0. 89 Differentially expressed genes were selected based on adjusted p-values (p < 0.05) and log fold changes (logFC < -0.18 or logFC > 0.18) and showed that LINC01128 transcript expression was not significantly altered through the introduced indels in comparison to wild type cells ( Figures 3L and S3I).

The role of LINC01128-MP in transferrin endocytosis
For studying the uptake of fluorescently labeled transferrin-Alexa 647 in HeLa WT and HeLa LINCO1128-MP KO cells, cells from the same batch of cells that had been analyzed by RNA-seq (see above) were washed in PBS and incubated in DMEM without FBS for 30 min at 37 C. Subsequently, 5 mg/mL transferrin-Alexa 647 in DMEM were added to the cells for another 10 min at 37 C. Cells were washed several times in PBS and fixed in 4% PFA for 8 min on ice. Standard immunocytochemical analysis was carried out by incubation of cells with primary mouse anti-EEA1 antibody (1:100; BD Transduction Laboratories). Bound primary antibody was visualized using secondary antiserum conjugated with Alexa Fluor 555 (1:500; Invitrogen). Alexa 647-conjugated transferrin was purchased from Invitrogen (T23366). Nuclei were counterstained with DAPI (1:8000; Roche). Image acquisitions were carried out with a Leica TCS SP8 confocal microscope using a 63x PL APO CS2 oil immersion objective (NA 1.4). Manders' coefficient tM1 was determined with the Coloc 2 plugin from ImageJ/Fiji. Student's t-test was applied using Graph Pad Prism 7. Three independent experiments were performed. Per experiment and cell line, an average of 30 cells were analyzed for quantification.

Detection of sORFs 3-15 aa in human tissues
The first GENCODE annotation set of Ribo-seq ORFs only included sequences with a minimum length of 16 amino acids. 2 To identify very small novel ORFs translated in human tissues, we applied our previously published approach to detect actively translated ORFs in human left ventricular heart tissue (European Genome-Phenome Archive (EGA) accession code EGAS00001003263). 22 We decided to use this dataset as discovery tissue due to the high sample size (n = 80) and the high average codon periodicity (average of 85.4% in-frame reads for read lengths of 29 base pairs). We modified our original method for ORF detection 22 and removed the minimum length cutoff for ORF assignment, leading to the identification of 287 new translated ORFs starting with ATG and with a size of 15 amino acids or shorter (denoted sORFs 3-15 aa ). 221 of these ORFs met our translation rate cutoff, which required high levels of translation (minimum number of 290 raw P-sites and a minimum of 70% in-frame P-sites), resulting in their selection for further analysis (Table S4). We next inspected the translation of these 221 sORFs 3-15 aa in other human tissues by analyzing four publicly available ribosome profiling datasets corresponding to kidney, (n = 6, EGA accession code EGAS00001003263 22 ), liver (n = 7, EGA accession code EGAS00001003263 22 and ArrayExpress accession code E-MTAB-7247 23 ), brain (n = 3, ArrayExpress accession code E-MTAB-7247 23 ) and testis (n = 3, ArrayExpress accession code E-MTAB-7247 23 ).
For comparison, we inspected the translation of 183 sORFs 3-15 aa with conserved structures in rodents. To this end, we retrieved datasets corresponding to mouse heart (n = 6, European Nucleotide Archive (ENA) accession code PRJEB29208 22 ), mouse liver (n = 3, ArrayExpress accession code E-MTAB-7247 23 ), mouse brain (n = 3, ArrayExpress accession code E-MTAB-7247 23 ), mouse testis (n = 3, ArrayExpress accession code E-MTAB-7247 23 ), rat heart (n = 30, ENA accession code PRJEB38096 50 ) and rat liver (n = 30, ENA accession code PRJEB38096 50 ). We additionally retrieved chicken brain Ribo-seq samples (n = 3, ArrayExpress accession code E-MTAB-7247 23 ) to inspect the translation of USP10-uORF, an sORF 3-15 aa with conserved structure and length in mammals and birds. Ribo-seq datasets were filtered and mapped following a similar approach as for the mammalian Ribo-seq datasets (details in STAR Methods section 'expression and translation of sORFs in mammals'). Using the mapped data from the described tissues, we called translated ORFs per sample running ORFquant 1.00 46 . Next, we pooled the datasets by tissue and we extracted P-site counts with RiboseQC. 90 In-frame P-site counts were quantified for each annotated CDS and sORF 3-15 aa . Finally, raw P-site counts were subjected to a normalization procedure (estimateSizeFactorsForMatrix; DESeq2 v1.26.0 89 ) and divided by the total number of codons in each sequence. Additionally, we calculated the PhyloCSF 80 scores with default parameters of all 221 sORFs 3-15 aa ( Figure S4E). PhyloCSF scores were calculated across the retrieved multiple alignments for primates and for mammals. We found that a total of 30 (13.57%) and 44 sORFs 3-15 aa (19.91%) have positive PhyloCSF scores in primates and mammals. It is important to note that PhyloCSF performs better for sequences >30 nucleotides and can fail to determine the constraints of very short sequences, 80,114 which is why the determined PhyloCSF results can not be 100% reliable.
Effect of genomic location and length in the conservation of sORF structures Compared to sORFs 3-15 aa , cataloged sORFs (Ribo-seq ORFs, longer than 15 aa) contained a higher proportion of lncORFs (30% vs 8%), and a lower proportion of uORFs (42% vs 92%). Therefore, we inspected whether this difference in the biotype proportions had an effect on the observed patterns of conservation between cataloged sORFs and novel sORFs 3-15 aa . We considered two main sORF biotypes: upstream ORFs (uORF, which contains 3,083 cataloged sORFs and 203 sORFs 3-15 aa ) and lncRNA ORFs (lncORF, which contains 2,208 cataloged sORFs and 17 sORFs 3-15 aa ). Next, we checked how many of the sORFs had conserved structures in five different species: mouse, rat, cow, horse and cat. An sORF structure was considered as conserved in the counterpart region of the compared species if the ATG translation initiation site (TIS) was present in the same position or within a window 6 nt down-stream of the human ATG position, and if R 70% of the sequence did not contain stop codons truncating the ORF. For comparison, we randomly sampled non-translated ORF sequences from the same regions harboring translated sORFs (5 0 UTRs for uORFs, lncRNA exons for lncORFs). We observed that both translated and non-translated lncRNA sequences were less highly conserved than 5 0 UTR sequences ( Figure S4G). UTR regions are enriched in promoters, secondary structures and binding motifs that are evolutionary constrained. Hence, we concluded that sORFs 3-15 aa are more likely to overlap these elements and be maintained over longer evolutionary times.
Secondly, we evaluated whether the intrinsic length differences between cataloged sORFs and sORFs 3-15 aa affected the different observed patterns of conservation between both groups. We consistently observed a continuum, in which the proportion of sORFs with conserved structures decreased with ORF length in the five analyzed species ( Figure S4G). For the set of untranslated random sequences extracted from 5 0 UTRs and lncRNAs, we also observed a similar negative trend between ORF length and conservation of ORF structures. This indicates that, even if not translated, disabling substitutions in very short ORFs (< 15 aa) are less likely to occur across evolutionary time.

Additional evidences of sORF 3-15 aa translation
We alternatively validated the translation of the set of 221 sORFs 3-15 aa in the same discovery set of 80 human left ventricular tissue Ribo-seq samples by running PRICE v1.0.3b. 47 This method uses an expectation-maximization algorithm to compute probabilistic inferences of codon activities. sORFs 3-15 aa with a p-value < 0.05 were defined as translated. Moreover, we retrieved a public resource containing the relative probabilities of each human transcript position to contain a translation initiation site (TIS). These probabilities were calculated using TIS Transformer, 48 a deep learning model based on information embedded in processed transcript sequences. Per each transcript containing a sORF 3-15 aa , we extracted and ranked the probabilities of all possible ATG triplets in the transcript sequence.
Detection of sORF-translated peptides 3-15 aa in public mass spectrometry data We collected protein expression evidence for translated sORFs 3-15 aa from previously published datasets. 6,51,52,54,55 Therefore, we re-analyzed the raw spectra using MaxQuant v1.6.10 56 with settings indicated in the respective dataset, added a false discovery rate (FDR) filter of <0.01 using the reverse-sequence based target decoy approach implemented in MaxQuant, 56 and disabled the protein FDR filter, as it was previously done for the identification of small proteins. 45,57 We added the peptide sequences together with the human UniProt database (HUMAN.2019-07) into the search space. We also searched the analyzed results of a published dataset of translated sORFs for our candidate peptides. 53 Moreover, we included the peptides 3-15 aa in the search database of the recent Human HLA 2022-09 PeptideAtlas build, in which the 51 million MS/MS spectra from 49 immunopeptide datasets deposited to ProteomeXchange 115 were processed with the PeptideAtlas build pipeline. 29,58 The reference database used was the comprehensive THISP level 3 database 116 plus the candidate peptide sequences. Processing was performed with MSFragger 91 and the Trans-Proteomic Pipeline 58 using a non-specific (no protease) search strategy and other parameters as appropriate for each of the 49 datasets. The Human HLA 2022-09 PeptideAtlas build contains 15 million peptide-spectrum matches (PSMs), among which 794 PSMs mapped to 16 of the short peptide sequences described herein. This PeptideAtlas build also included two of the individually analyzed studies. 6,51,52,54,55 All peptides detected in the respective datasets can be found in Table S4. Of note, only 116/221 peptides 3-15 aa are theoretically detectable with standard shotgun MS, because they produce tryptic peptides above six aa after enzymatic digest. To ensure a proper identification, peptides additionally require to be unique within the digested proteome, which further reduces the number of peptides able to be detected. 193/221 peptides are theoretically detectable in immunopeptidomics, since no digestion is required. To ensure the validity of peptide identifications, each identification was required to be unique within the proteome. This was assessed with the tool Proteomapper v.1.5 92 (http://www.peptideatlas.org/map/), which maps input sequences to the proteome, in order to exclude that the identified peptides stem from fragments of longer proteins. Our analysis showed that all identified peptides are either unique within the proteome or, if present within other proteins, cannot be cleaved into the identified peptides because of lacking tryptic restriction sites. However, four of the identified peptides could theoretically stem from semi-tryptic digestion of canonical proteins and four might be derived from longer alternative sORF isoforms encoded by the same gene that had been detected in previous ribosome profiling studies 2 (Table S4).

Resource PRM proteomics
The parallel reaction monitoring (PRM) analysis was performed similarly to the targeted mass spectrometry approach carried out by us previously, 22 except for some modifications. To select appropriate tryptic and LysC-proteolytic signature peptides, we first digested candidate microproteins and peptides 3-15 aa in silico with i) trypsin and ii) LysC using the online tool MS-Digest (http:// prospector.ucsf.edu). We selected tryptic or LysC-proteolytic peptides that were unique across the human proteome digested with the respective enzyme, had a minimum length of six aa and fell into a mass to charge (m/z) range of 10 -1850 m/z. Selected peptides were purchased as synthetic peptides of crude quality (JPT Inc., Berlin, Germany), which were resuspended in 20% acetonitrile (100 mM ammonium bicarbonate) and measured (1 pmol per peptide) on a Q-Exactive HF-X mass spectrometer (Thermo Fisher Scientific) using data dependent acquisition mode (DDA) with a mass resolution of 60,000 for the MS scans and 15,000 for the MS/MS scans, considering precursor ions with charge state 1 to 6. Precursor fragmentation efficiency was evaluated using normalized collision energies (NCE) ranging from 25 to 35. The recorded spectra were analyzed using MaxQuant v1.6.3.4 56 applying a custom-made database containing the predicted sequences, with carbamidomethylation of cysteines as fixed and oxidation of methionines as variable modification. Based on the observed precursor signal, retention time, charge state and MS/MS fragment information, a library based PRM method was developed using the Skyline software package v3.6 93 with the following settings: precursor charge state 1 to 4; ion charge state 1 and 2; ion types y, p, b, a, z; auto-selection of matching transitions enabled; ion match tolerance = 0.05 m/z; method match tolerance = 0.055 m/z. For each candidate the most abundant variant together with the corresponding fragment ions (five or more) were selected. In total, we included 42 tryptic and 16 LysC precursor targets for 24 microproteins, and 77 tryptic and 193 LysC precursor targets for 149 peptides 3-15 aa (Table S5). A subset of precursor targets did not generate fragmentation patterns suitable for PRM analysis and were therefore excluded. The final set consisted of 39 tryptic and 9 LysC precursor targets for 24 microproteins, and 73 tryptic and 155 LysC precursor targets for 129 peptides 3-15 aa . As positive controls, proteotypic peptides from high abundant housekeeping genes (tryptic peptides: GAPDH, ACTA, and HIST1H2; LysC peptides: GAPDH and H2BC1) and sCDS were included (tryptic peptides: PLN and MIEF1-MP; lysC peptides: NDUFB3, MTLN, MPRL33, and MIEF1-MP). Based on their retention time profile, targets were split into two (tryptic peptides) and three (LysC peptides) PRM inclusion lists. Parameters for the positive controls were added to each list. Analytical PRM measurements were performed on a High Performance Liquid Chromatography (HPLC) system (ThermoScientific) using a 98 min gradient of increasing Buffer B concentration (from 2% to 60%, 250 nl/min flow rate) coupled to an Q-Exactive HF-X mass spectrometer (Thermo Fisher Scientific). Instrument parameters were set to 30,000 resolution with 2e5 AGC target value, 100 ms maximum ion injection time, 30 min retention time window and adjusted normalized collision energy (NCE) values for each target.
Analytical PRM measurements were performed on human tissue and cell line samples. Therefore, pulverized human heart tissue of five patients and three biological replicates of PBS-washed cell pellets of HEK239T, HeLa and K562 cells were resuspended in lysis buffer (1% weight per volume (w/v) sodium deoxycholate (SDC), 10 mM DTT, 40 mM chloroacetamide (CAA), 1 mM ethylenediaminetetraacetic acid (EDTA), 100 mM Tris, pH 8.5) and boiled for 10 min at 95 C. After cooldown, samples were incubated with Benzonase (Merck Germany) to digest nucleic acids. For tryptic peptide samples, 10 mg of protein extract was digested with 0.2 mg sequence grade endopeptidase LysC (Wako, Osaka, Japan) and 0.2 mg sequence grade trypsin (Promega). LysC peptide samples were obtained by digesting 10 mg of protein extract with 0.2 mg sequence grade endopeptidase LysC only (Wako, Osaka, Japan). All digests were performed at 37 C for 16 hours. The digest was stopped by acidifying each sample to pH < 2.5 by adding 10% trifluoroacetic acid solution. After centrifugation to pellet insoluble material (14,000 rpm, 10 min), the peptides were extracted and desalted using the StageTip protocol. 109 Peptide samples were eluted from StageTips (80% acetonitrile, 0.1% formic acid), dried down, resolved in sample buffer (3% acetonitrile and 0.1% formic acid) and analyzed on the mass spectrometer as described above. For heart tissue samples, two technical replicates for each biological replicate were performed. Cell line samples were analyzed by one analytical run per replicate.
PRM data analysis was carried out using the Skyline software package v21.02. 93 Analytical runs of synthetic peptide mixtures as well as the internal library-based fragment ranking were used to manually confirm the peak assignment. Peptides detected in at least two biological replicates with a dot product of R 0.7 were considered robust identifications. For these peptides, also peaks with a dot product R 0.6 were reported in the remaining samples. For peaks passing the quality filter, the total peak area, retention time and dot product values were exported and are available for all technical and biological replicates (Table S5). We were able to detect four out of the six sCDS. The control proteins GAPDH, ACTA, HIST1H2, and H2BC1 were robustly identified in all replicates of the five heart samples and the three cell lines. We detected signature peptides for two microproteins and for 18 peptides 3-15 aa . Of the latter 18, two peptides could also stem from semi-tryptic digestion of canonical proteins and four might be also derived from longer alternative sORF isoforms encoded by the same gene (Table S4).
PRISMA of peptides translated from sORFs 3-15 aa Setup, Preparation, LC-MS/MS & Data analysis The PRISMA approach described above was used for the peptide dataset of 221 candidate peptides with minor adaptations. We decided to use the same PRISMA controls as for the microproteins after analysis of amino acid frequencies ( Figure S2C): four peptides derived from SOS1 and GLUT1 (SOS1wt, SOS1mt, GLUT1wt, GLUT1mt) that had been previously investigated. 18,32 All 221 candidate peptides as well as the four PRISMA controls were spot-synthesized onto a cellulose membrane (JPT Peptide Technologies Inc., Berlin, Germany). We were able to synthesize each peptide in full length due to their short size. The assay was performed in triplicates, i.e., three membranes with identical peptides. The sequences of all spotted peptides were identical to all identified peptides and are listed in Table S4 together with the sequences of the four control peptides. Preparation for MS (i.e., incubation of the membrane with HEK239T protein lysate, punching out the peptide spots, digesting the peptide spots with trypsin and LysC, preparation of the StageTips), LC-MS/MS (i.e., elution of the peptides, separation of the peptides on a HPLC system, ionization and analysis of the peptides on an Orbitrap Fusion instrument with above settings), as well as raw data analysis and filtering steps for identification of PPIs were performed as described in the above PRISMA screen for microproteins. Quality control by replicate measurement assessment No batch bias was identified between the replicate membranes ( Figures S5A and S5B) and the triplicates of the same peptide correlated with a mean correlation coefficient (Pearson's R) of 0.82, indicating good data reproducibility ( Figure S5C). Six samples were excluded at this step because their correlation or number of interactors deviated more than two standard deviations from the other two samples. Taken together, a total of 3,784 unique binding proteins (including 64 bait peptides) were identified, with a range of 191 to 2,711 proteins binding to the individual peptides (mean identifications per peptide: 1,376 proteins, Figure S5D). The wide range of binding proteins could be expected due to the exploratory nature of the screen: some peptides might not be biologically relevant and will therefore not display any binding capacity, which in turn adds greater weight to the ones that do bind other proteins specifically. In order to distinguish between specific (transient) interactions and unspecific background binding, each triplicate was analyzed with regard to all other peptide pull-downs (used as background) via label-free quantification. Quality control by evaluation of assay control peptides We used the known interaction partners of GLUT1 and SOS1 to determine the appropriate significance cutoff, 18,32 resulting in a cutoff of c=6 and x 0 =2.8, equal to two standard deviations ( Figures S5F and S5G). The expected binding behavior of GLUT1 and SOS1 is described above (PRISMA ''Quality control by evaluation of assay control peptides''). For GLUT1, reassuringly, we find all known binding partners enriched in the mutant but not in the wild type peptide. For SOS1, we find all previously reported binding partners in the wild type, and only 3/9 in the mutant. Additionally, we find four previously unreported proteins with SH3 domains in the wild type but not in the mutant. Our very specific assay controls demonstrate the ability to detect previously reported, biologically relevant proteins, as well as its specificity to the spotted peptide sequence. In total, we identify between zero to 212 proteins per peptide (mean identification of 11 proteins per peptide, Figure S5H). We found that the shortest peptides (below 6 aa) have slightly lower numbers of interaction partners overall. Quality control based on bait identification for PRISMA of peptides 3-15 aa Similar to what was described for PRISMA, we assessed if the spotted peptides (baits) were identified and enriched in the expected samples ( Figure S5E). We detected 64 baits in total. It was to be expected that not all bait peptides could be detected, as the short length of the peptides reduces the possibility to produce tryptic peptides suitable for MS. 29/64 baits were found exclusively in expected samples. 28 were found in multiple samples but with higher mean LFQ intensity across the three triplicates in the expected sample compared to the unexpected sample. One bait was found exclusively in an unexpected sample, but it was only identified ''by matching'' and not ''by MS/MS'' and also had a relatively low LFQ intensity. As explained above, the high amount of peptide synthesized on the membrane should lead to a robust identification by peptide-spectrum matches and hits identified only ''by matching'' are more likely false positives. Six baits were not significantly enriched in any sample.

Peptide properties
Peptide hydrophobicity values were calculated using the Krokhin model. 117 For prediction of the peptides' MHC presentation, we used the tool NetMHCpan-4.1 95 which predicts binding of peptides to MHC molecules of known sequence using a neural network. We selected the 12 indicated MHC supertype representatives for binding analysis. In total, 105 peptides were reported to be presented on one or more MHC supertype representatives. The results, indicating for how many MHC supertype representatives a peptide is predicted to be presented, are listed in Table S4.
Analysis of CORUM complexes in interactomes of peptides 3-15 aa CORUM complexes 118 were downloaded from http://mips.helmholtz-muenchen.de/corum/ (Human_CORUM_coreComplexes_ 2018-09-03_symbol.gmt). We annotated a complex when at least two proteins of a CORUM complex were present in the interactome of a specific peptide.
Luciferase assay for ribosome-binding peptides We identified 16 peptides that clustered together in the global hierarchical clustering, indicating that their interactomes were very similar. For the reporter assay, we selected five out of 16 peptides encoded by sORFs 3-15 aa within 5 0 UTRs. We created two mutants for each of the five candidates: one with a mutated start codon (ATG > ACG) that impedes translation of the uORF, and one where we mutated charged arginines to non-charged, inert alanines. We created this second control (i.e., the charge mutant) in order to assess the impact of the peptide's sequence and/or charge on its effect on downstream (luciferase) translation. The respective 5 0 UTRs were synthetized and cloned into the backbone in front of the Renilla luciferase using the NheI restriction site (psiCHECK-2, constructs available upon request). We then confirmed the plasmid sequences using Sanger Sequencing. For the reporter assay (Dual Glo Luciferase Assay System, Promega, E2920), human HeLa cells were grown in a white 96-well plate with a transparent bottom. All plasmids including internal controls were transfected using LipoJet Reagent. The assay was performed 48 hours after transfection, according ll OPEN ACCESS Resource human or animal tissue (STAR Methods sections ''experimental model and subject details'' and ''detection of sORFs3-15aa in mammalian tissues''), the number of sORFs (results text, Figures 1A, 1E, 1G, and S1), number of microproteins ( Figure S2K), number of experiments ( Figure 3K) and the number of sORFs 3-15 aa (Figures 4F and S4B), uORFs and lncORFs ( Figure S4G). Statistical parameters used to indicate differential expression were derived from DESeq2 (STAR Methods: section ''generation of LINC01128-MP knock-out cells and RNA-seq analysis''), or otherwise the type of statistical test (e.g., Mann-Whitney U test or t test) is annotated in the figure legend and indicated in the STAR Methods segment specific to each analysis. Unless stated otherwise, statistical analyses are two-sided tests performed using R. In the PRISMA analysis, samples were excluded that deviated more than two standard deviations away from its other replicates regarding number of sample identifications as well as samples whose correlation value was over two standard deviations away from the other correlations between replicates. False discovery rate (FDR) was estimated using the Benjamini & Hochberg method.