Deep transcriptome annotation suggests that small and large proteins encoded in the same genes often cooperate

Recent studies in eukaryotes have demonstrated the translation of alternative open reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and evolutionary patterns indicate that altORFs are particularly constrained in CDSs that evolve slowly. Thousands of predicted alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. Protein domains and co-conservation analyses suggest a potential functional relationship between small and large proteins encoded in the same genes. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many coding genes code for more than one protein that are often functionally related.

generating a significant level of noise 13 . However, given that many small proteins have 63 12 reference proteins encoded in the same gene would be that they share protein domains. 249 We compared the functional annotations of the 585 alternative proteins with an InterPro 250 entry with the reference proteins expressed from the same genes. Strikingly, 89 of 110 251 altORFs coding for zinc finger proteins (Figure 11) are present in transcripts in which the 252 CDS also codes for a zinc finger protein. Overall, 138 alternative/reference protein pairs 253 share at least one InterPro entry and many pairs share more than one entry (Figure 12a). 254 The number of shared entries was much higher than expected by chance (Figure 12b, 255 p<0.0001). The correspondence between InterPro domains of alternative proteins and 256 their corresponding reference proteins coded by the same genes also indicates that even 257 when entries are not identical, the InterPro terms are functionally related (Figure 12c; 258  Recently, the interactome of 118 human zinc finger proteins was determined by affinity 264 purification followed by mass spectrometry 41 . This study provides a unique opportunity 265 to test if, in addition to possessing zinc finger domains, thus being functionally connected 266 some pairs of reference and alternative proteins coded by the same gene also interact. We 267 re-analyzed the MS data using our alternative protein sequence database to detect   (4) detection of thousands of alternative proteins in multiple existing proteomic datasets; 358 (5) correlated altORF-CDS conservation, but with overrepresentation of highly conserved 359 and fast-evolving altORFs; (6) overrepresentation of identical InterPro signatures 360 between alternative and reference proteins encoded in the same mRNAs; (7) several 361 thousand co-conserved paired alternative-reference proteins encoded in the same gene; 362 and (8)  Our results raise the question of the evolutionary origins of these altORFs. A first 388 possible mechanism involves the polymorphism of initiation and stop codons during 389 evolution 55,56 . For instance, the generation of an early stop codon in the 5'end of a CDS 390 could be followed by the evolution of another translation initiation site downstream, 391 creating a new independent ORF in the 3'UTR of the canonical gene. This mechanism of 392 altORF origin, reminiscent of gene fission, would at the same time produce a new altORF 393 that shares protein domains with the annotated CDS, as we observed for a substantial 394 fraction (24%) of the 585 alternative proteins with an InterPro entry. A second 395 mechanism would be de novo origin of ORFs, which would follow the well-established 396 models of gene evolution de novo 20,57,58 in which new ORFs are transcribed and 397 translated and have new functions or await the evolution of new functions by mutations. 398 The numerous altORFs with no detectable protein domains may have originated this way 399 from previously non-coding regions or in regions that completely overlap with CDS in 400 other reading frames. 401

402
Detection is an important challenge in the study of small proteins. A TIS detected by 403 ribosome profiling does not necessarily imply that the protein is expressed as a stable 404 molecule, and proteomic analyses more readily detect large proteins that generate several 405 peptides after enzymatic digestion. In addition, evolutionarily novel genes tend to be 406 poorly expressed, again reducing the probability of detection 20 . Here, we used a 407 combination of five search engines, thus increasing the confidence and sensitivity of hits 408 compared to single-search-engine processing 59,60 . This strategy led to the detection of 409 altORF TIS). An additional recent study was also included in our analysis 29  the length of the aligned region between the query and the match sequence equalled or 510 exceeded 50% of the length of the sequence, and when the bitscore reached a minimum 511 of 40 73 . Orthologs were detected by finding the mutually best scoring pairwise hits 512 (reciprocal best hits) between datasets A-B and B-A. The self-self runs were used to 513 identify paralogy relationships as described 72 . 514 515 Co-conservation analyses. For each orthologous alternative protein pair A-B between 516 two species, we evaluated the presence and the orthology of their corresponding 517 reference proteins A'-B' in the same species. In addition, the corresponding altORFs and 518 CDSs had to be present in the same gene. 519 In order to develop a null model to assess co-conservation of alternative proteins and 520 their reference pairs, we needed to establish a probability that any given orthologous 521 alternative protein would by chance occur encoded on the same transcript as its paired, 522 orthologous reference protein. Although altORFs might in theory shift among CDSs (and 523 indeed, a few examples have been observed), transposition events are expected to be 524 relatively rare; we thus used the probability that the orthologous alternative protein is 525 paired with any orthologous CDS for our null model. Because this probability is by 526 definition higher than the probability that the altORF occurs on the paired CDS, it is a 527 conservative estimate of co-conservation. We took two approaches to estimating this 528 percentage, and then used whichever was higher for each species pair, yielding an even 529 more conservative estimate. First, we assessed the percentage of orthologous reference 530 proteins under the null supposition that each orthologous alternative protein had an equal 531 probability of being paired with any reference protein, orthologous or not. Second, we 532 assessed the percentage of non-orthologous alternative proteins that were paired with 533 orthologous reference proteins. This would account for factors such as longer CDSs 534 having a higher probability of being orthologous and having a larger number of paired 535 altORFs. For example, between Homo sapiens and Mus musculus, we found that 22,304 536 of 54,498 reference proteins (40.9%) were orthologs. Of the 157,261 non-orthologous 537 alternative proteins, 106,987 (68%) were paired with an orthologous reference protein. Because 68% is greater than 40.9%, we used 68% as the probability for use in our null 539 model. Subsequently, our model strongly indicates co-conservation (Fig. 3 and Table 2; 540 p<10 -6 based on 1 million binomial simulations; highest observed random percentage 541 =69%, much lower than the observed 96% co-conservation). inside 20,814 CDSs, the average PhyloP score for wobble nucleotides within the altORF 552 region was compared to the average score for the complete CDS. To generate controls, 553 random regions in CDSs with a similar length distribution as altORFs were selected and 554 PhyloP scores for wobble nucleotides were extracted. We compared the differences 555 between altORF and CDS PhyloP scores (altORF PhyloP -CDS PhyloP) to those 556 generated based on random regions. We identified expected quantiles of the differences 557 ("DQ" column in the table), and compared these to the observed differences. Because 558 there was greater conservation of wobble nucleotide PhyloP scores within altORFs 559 regions located farther from the center of their respective genes (r = 0.08, p < 0.0001), 560 observed differences were adjusted using an 8-knot cubic basis spline of percent distance 561 from center. These observed differences were also adjusted for site-specific signals as 562 detected in the controls. 563 564 Human alternative protein classification and in silico functional annotation. 565

Repeat and transposable element annotation 566
RepeatMasker, a popular software to scan DNA sequences for identifying and classifying 567 repetitive elements (RRID:SCR_012954), was used to investigate the extent of altORFs 568 derived from transposable elements 74 . Version 3-3-0 was run with default settings. 569 Alternative protein analysis using InterProScan (RRID:SCR_005829) 570 InterProScan combines 15 different databases, most of which use Hidden Markov models 571 for signature identification 75 . Interpro merges the redundant predictions into a single 572 entry and provides a common annotation. A recent local version of InterProScan 5.14-573 53.0 was run using default parameters to scan for known protein domains in alternative 574 proteins. Gene ontology (GO) and pathway annotations were also reported if available 575 with -goterm and -pa options. Only protein signatures with an E-value  10 -3 were 576 considered. 577 We classified the reported InterPro hits as belonging to one or several of three clusters; 578 (1) alternative proteins with InterPro entries; (2) alternative proteins with signal peptides 579 (SP) and/or transmembrane domains (TM) predicted by at least two of the three SignalP, 580

PHOBIUS, TMHMM tools and (3) alternative proteins with other signatures. 581
The GO terms assigned to alternative proteins with InterPro entries were grouped and 582 categorised into 13 classes within the three ontologies (cellular component, biological 583 process, molecular function) using the CateGOrizer tool 76 (RRID:SCR_005737). 584 Each unique alternative protein with InterPro entries and its corresponding reference 585 protein (encoded in the same transcript) were retrieved from our InterProscan output. 586 Alternative and reference proteins without any InterPro entries were ignored. The overlap 587 in InterPro entries between alternative and reference proteins was estimated as follows. 588 We went through the list of alternative/reference protein pairs and counted the overlap in 589 the number of entries between the alternative and reference proteins as 590 100*intersection/union. All reference proteins and the corresponding alternative proteins 591 were combined together in each comparison so that all domains of all isoforms for a 592 given reference protein were considered in each comparison. The random distribution of 593 the number of alternative/reference protein pairs that share at least one InterPro entry was 594 computed by shuffling the alternative/reference protein pairs and calculating how many 595 share at least one InterPro entry. This procedure was repeated 1,000 times. Finally, we 596 compared the number and identity of shared InterPro entries in a two dimensional matrix 597 to illustrate which Interpro entries are shared. In many instances, including for zinc-finger 598 coding genes, InterPro entries in alternative/reference protein pairs tend to be related 599 when they are not identical.  PSMs matching reference target or decoy proteins were separated from those matching 629 alternative targets or decoys as previously described 83,84 . PSMs that matched both 630 reference and alternative proteins were automatically moved to the reference database 631 group. PSMs were then ranked according to their PeptideShaker score and filtered at 1% 632 FDR separately. Validated PSMs were selected to group proteins using proteoQC R tool 633 85 , and proteins were separately filtered again using a 1% FDR cut-off. 634 Only alternative proteins identified with at least one unique and specific peptide were 635 considered valid 59 . Any peptide matching both a canonical (annotated in Uniprot) and an 636 alternative protein was attributed to the canonical protein.  were tagged with Flag and HA tags, respectively. MiD51 GFP and altMiD51 GFP were also 671 cloned into pcDNA3.1 by Gibson assembly. For MiD51 GFP , a LAP tag 32 was inserted 672 between MiD51 and GFP. gBlocks were purchased from IDT. Human altDDIT3 mCherry 673 was cloned into pcDNA3.1 by Gibson assembly using coding sequence from transcript 674 variant 1 (NM_004083.5) and mCherry coding sequence from pLenti-myc-GLUT4-675 mCherry (Addgene plasmid # 64049). Human DDIT3 GFP was also cloned into pcDNA3.1 676 by Gibson assembly using CCDS8943 sequence. 677 For immunofluorescence, primary antibodies were diluted as follow: anti-Flag (Sigma, 678 Roche protease inhibitors per 50 mL of buffer). After 5 mins of lysis on ice, lysate was 704 sonicated twice at 11 % amplitude for 5 s with 3 minutes of cooling between sonication 705 cycles. Lysate was centrifuged, supernatant was isolated and protein content was assessed 706 using BCA assay (Pierce). GFP-Trap beads were conditioned with lysis buffer. 40 µL of 707 beads were added to 2 mg of proteins at a final concentration of 1 mg/mL. After 708 overnight immunoprecipitation, beads were centrifuged at 5000 rpm for 5 minutes and 709 supernatant was discarded. Beads were then washed three times with wash buffer (0.5 % 710 NP40, Tris-HCl 50 mM pH 7.5, NaCl 200 mM and two EDTA-free Roche protease 711 inhibitors per 50 mL of buffer) and supernatants were discarded. Immunoprecipitated 712 proteins were eluted from beads by adding 40 µL of Laemmli buffer and boiling at 95 °C 713 for 15 minutes. Eluate was split in halfs which were loaded onto 10 % SDS-PAGE gels to 714 allow western blotting of GFP and mCherry tagged proteins. 40 µg of initial lysates were 715 loaded into gels as inputs. 716 717 Mitochondrial localization, parameters and ROS production. Trypan blue quenching 718 experiment was performed as previously described 89 . ROS Inducer Pyocyanin at 100 µM. Fluorescence was monitored in real time. ROS 746 accumulation rate was measured between 1 to 3 hours following induction. After the 747 assay, total cellular protein content was measured using BCA protein assay reagent 748 (Pierce, Waltham, MA, USA) after lysis with RIPA buffer. Data were normalised for 749 initial fluorescence and protein concentration.   relative to the total number of reference-alternative protein pairs are indicated on the right 1039 (see Table 2 for details). We compared these differences to those generated based on five random regions in CDSs 1050 with a similar length as altORFs. Expected quantiles of the differences ("DQ" columns) 1051 were identified and compared to the observed differences. We show the absolute numbers 1052  Blue pixels indicate that these domains are not shared, white pixels indicate that they are 1193 shared once, and red that they are shared twice or more.    In order to compare the observed co-conservation to expected co-conservation, we used the more conservative of two expected values: either the percentage of all refProts (called here reference proteins) 2 that were defined as orthologous (column G), or the percentage of non-orthologous altProts (called here alternative proteins) that were paired with an orthologous refProt. Both of these methods are 3 themselves conservative, as they do not account for the conservation of the pairing. The larger of these values for each species was then used to generate 1 million random binomial distributions with 4 n=# of orthologous altProts; the maximum of these percentages is reported in column I. 4 2 These two proteins were not detected with unique peptides but with shared peptides. One protein only was counted in subsequent 5 analyses.
6 3 These five proteins were not detected with unique peptides but with shared peptides. One protein only was counted in subsequent 7 analyses.      While more than half of the human genome is composed of repeated sequences, only 9.83% or 18,003 altORFs are located inside these repeats (a), compared to 2,45% or 1,677 CDSs (b). AltORFs and CDSs are detected in non-LTR retrotransposons (LINEs, SINEs, SINE-VNTR-Alus), LTR repeats, DNA repeats, satellites and other repeats. Proportions were determined using RepeatMasker (version 3.3.0).   Figure 1g shows the percentage of unique altORFs with a kozak motif (15%), while the current Fig. shows the percentage of altORFs with a kozak motif relative to the total number of altORFs (14%).     Table 2 for details).  Differences between altORF and CDS PhyloP scores (altORF PhyloP -CDS PhyloP, y-axis) are plotted against PhyloPs for their respective CDSs (x-axis). We restricted the analysis to altORF-CDS pairs that were coconserved from humans to zebrafish. The plot contains 889 CDSs containing at least one fully nested altORF, paired with one of its altORFs selected at random (to avoid problems with statistical non-independence). PhyloPs for both altORFs and CDSs are based on 3rd codons in the CDS reading frame, calculated across 100 vertebrate species. We compared these differences to those generated based on five random regions in CDSs with a similar length as altORFs. Expected quantiles of the differences ("DQ" columns) were identified and compared to the observed differences. We show the absolute numbers ("n") and observed-to-expected ratios ("O/E") for each quantile. There are clearly substantial over-representations of extreme values (red signaling conservation DQ 0.95, and blue signaling accelerated evolution DQ 0.05) with 317 of 889 altORFs (35.7%). A random distribution would have implied a total of 10% (or 89) of altORFs in the extreme values. This suggests that 25.7% (35.7%-10%) of these 889 altORFs undergo specific selection different from random regions in their CDSs with a similar length distribution. This percentage is very similar to the 26.2% obtained from an analysis of altORFs without restriction based on co-conservation in vertebrates (see Figure 4-figure supplement 1), a total which would imply that there are about 4,458 altORFs fully nested in CDSs undergoing conserved or accelerated evolution relative to their CDSs. The plot contains all 20,814 CDSs containing at least one fully nested altORF, paired with one of its altORFs selected at random (to avoid problems with statistical non-independence). PhyloPs for both altORFs and CDSs are based on third codon positions in the CDS reading frame, calculated across 100 vertebrate species. We compared these differences to those generated based on five random regions in CDSs with a similar length as altORFs. Expected quantiles of the differences ("DQ" columns) were identified and compared to the observed differences. We show the absolute numbers ("n") and observed-to-expected ratios ("O/E") for each quantile.                   Pixels show the number of times entries co-occur in reference and alternative proteins. Blue pixels indicate that these domains are not shared, white pixels indicate that they are shared once, and red that they are shared twice or more.
Number of alternative/reference protein pairs with shared InterPro entries Number of random iterations The number of reference/alternative protein pairs that share domains (n = 49) is higher than expected by chance alone (p<0.001). The distribution of expected pairs sharing domains and the observed number are shown. This is the same analysis as the one presented in figure  12b, with the zinc finger domains taken out.    Mitochondrial function parameters were assessed in basal conditions (basal), in the presence of oligomycin to inhibit the ATP synthase (oxygen consumption that is ATP-linked), FCCP to uncouple the mitochondrial inner membrane and allow for maximum electron flux through the respiratory chain (maximal OCR), and antimycin A/rotenone to inhibit complex III (non-mitochondrial). The balance of the basal OCR comprises oxygen consumption due to proton leak and nonmitochondrial sources. The mitochondrial reserve capacity (maximal OCR-basal OCR) is an indicator of rapid adaptation to stress and metabolic changes. Mean values of replicates are plotted with error bars corresponding to the 95% confidence intervals. Statistical significance was estimated using a two-way ANOVA with Tukey's post-hoc test (**p = 0,004). (b) ROS production in mock and altMiD51-expressing cells. Cells were untreated, treated with a ROS inducer or a ROS inhibitor. Results represent the mean value out of three independent experiments, with error bars corresponding to the standard error of the mean (s.e.m.). Statistical significance was estimated using unpaired T-test. (c) ATP synthesis rate in mock and altMiD51-expressing cells. No significant differences in ATP production were observed between mock and altMiD51 transfected cells.
Results represent the mean of three independent experiments (8 technical replicates each). Error bars represent the standard error of the mean. At the end of the experiments, cells were collected and proteins analyzed by western blot with antibodies against the Flag tag (altMiD51) or actin, as indicated, to verify the expression of altMiD51. A representative western blot is shown on the right. Molecular weight markers are shown on the left (kDa).   HeLa cells were co-transfected with GFP and mCherry, or altDDIT3 GFP and DDIT3 mCherry , as indicated. Proteins were extracted and analyzed by western blot with antibodies, as indicated. Molecular weight markers are shown on the left (kDa). AltDDIT3 has a predicted molecular weight of 4.28 kDa and thus migrates at its expected molecular weight when tagged with GFP (~32 kDa). Representative experiment of two independent biological replicates.

Scatter plots of Pearson's Correlation Coefficient and Manders' Correlation Coefficient after
Costes' automatic threshold (p-value < 0.001, based on 1000 rounds of Costes' randomization colocalization analysis). M1 is the proportion of altDDIT3 GFP signal overlapping DDIT3 mCherry signal and M2 is the proportion of DDIT3 mCherry signal overlapping altDDIT3 GFP . Error bars represent the mean +/-SD of three independent experiments (28 cells).