N-terminal Proteomics and Ribosome Profiling Provide a Comprehensive View of the Alternative Translation Initiation Landscape in Mice and Men*

Usage of presumed 5′UTR or downstream in-frame AUG codons, next to non-AUG codons as translation start codons contributes to the diversity of a proteome as protein isoforms harboring different N-terminal extensions or truncations can serve different functions. Recent ribosome profiling data revealed a highly underestimated occurrence of database nonannotated, and thus alternative translation initiation sites (aTIS), at the mRNA level. N-terminomics data in addition showed that in higher eukaryotes around 20% of all identified protein N termini point to such aTIS, to incorrect assignments of the translation start codon, translation initiation at near-cognate start codons, or to alternative splicing. We here report on more than 1700 unique alternative protein N termini identified at the proteome level in human and murine cellular proteomes. Customized databases, created using the translation initiation mapping obtained from ribosome profiling data, additionally demonstrate the use of initiator methionine decoded near-cognate start codons besides the existence of N-terminal extended protein variants at the level of the proteome. Various newly identified aTIS were confirmed by mutagenesis, and meta-analyses demonstrated that aTIS reside in strong Kozak-like motifs and are conserved among eukaryotes, hinting to a possible biological impact. Finally, TargetP analysis predicted that the usage of aTIS often results in altered subcellular localization patterns, providing a mechanism for functional diversification.

Eukaryotic protein-coding genes can give rise to multiple translation products of which the expression is regulated at multiple levels. In contrast to transcriptional regulation, protein translational regulation permits for more immediate effects to take place. Initiation, elongation, termination, and ribosome recycling constitute the different phases of the eukaryotic translation process, with translation initiation acting as the gate-keeping step by the successive steps of ternary complex recruitment, scanning, AUG codon selection, and ribosomal subunit joining. Overall, this process requires over 30 different proteins including the eukaryotic initiation factors (eIFs) 1 (1). In eukaryotes, the translation start codon is typically found by ribosome scanning, referred to as the canonical mechanism of translation initiation. Here, the 43S pre-initiation complex (PIC) composed of the initiator Met-tRNA (Met-tRNAi) pre-loaded onto the small (40S) ribosomal subunit, binds near the 5Ј end of the mRNA molecule in a m 7 G-cap structure/eukaryotic initiation factors 4 (i.e. eIF4E, 4G, and 4A; jointly referred to as the eIF4F complex) mediated fashion. This complex then starts to scan successive triplets of the 5Ј untranslated region (5ЈUTR) in the 3Ј direction until an AUG start codon or, alternatively, a near-cognate start codon entered the P (peptidyl) decoding site of the ribosome. Start codon recognition requires base-pairing with the anticodon loop of Met-tRNAi and triggers a scanning arrest and GTP hydrolysis of the eIF2-GTP-Met-tRNAi ternary complex, ultimately leading to the formation of the 48S initiation complex. The latter is then followed by factor displacement, enabling the joining of the large (60S) subunit and assembly of the elongation-competent 80S initiation complex, which can now accommodate the second amino acid encoding aminoacyl-tRNA into the aminoacyl site (A-site) and thus formation of the first peptide bond in the process of translation elongation upon recruitment of translation elongation factors.
Secondary RNA structures might influence the processivity and efficiency of scanning and as such regulate translation initiation. mRNAs that contain secondary structures in their 5ЈUTR require ATP proportional to the degree of secondary structure (2) in addition to helicase activity to enhance 43S PIC binding and scanning.
Although the scanning mechanism of translation initiation is used by most mRNAs, an alternative manner of translation initiation of a specific subset of mRNAs is mediated by internal ribosomal entry sites (IRES).
Viruses use internal ribosomal entry as a mechanism of translation, engaging host cell ribosomes while bypassing the need for (a subset of) the limiting eIFs. Internal ribosomal entry sequences are typically long and highly structured elements that mimic the functions of eIFs while requiring trans-acting factors such as the polypyrimidine tract binding protein PTB or the La autoantigen (3). Several IRESes were also discovered in various cellular mRNAs expressed during apoptosis or mitosis or following cell stress, when cap-dependent translation is known to be impaired (4 -5). Moreover, other specific mechanisms of translation initiation exist, such as the structural mRNA element driven, cap-dependent and IRES-like mechanism of histone H4 translation initiation, related to the fact that the noncanonical histone H4 mRNA features-such as its short 5Ј UTR-prevent conventional scanning and translation initiation (6).
Besides IRES, a second common type of alternative translation is leaky ribosomal scanning. Here, the sequence context surrounding the first encountered AUG is suboptimal, leading to leaky scanning and translational initiation at both this first AUG codon and additional downstream AUG codons (7).
Further, translation re-initiation after a short upstream ORF (uORF) is another common regulatory control mechanism of translation initiation (8 -9). In fact, up to 50% of all mammalian genes encode mRNAs that have at least one short uORF residing upstream of the main protein-encoding ORF and that consists of about 30 codons on average (10). Here, some translation factors remain associated with the ribosome, thereby enabling scanning after translating the uORF and thus enabling re-initiation of translation at downstream sites.
Finally, 5Ј mRNA leader sequence recapping can also give rise to alternative translates (11)(12), and thus contributes to the translational initiation landscape.
Cis-acting sequence elements steer recognition of the correct initiation codon to ensure the fidelity of translation initiation. Usually, this AUG triplet resides in an optimal context (i.e. gcc[A/G]ccAUGG(not T)), with a purine at position Ϫ3 and a guanine at position ϩ4 relative to the A of the AUG codon which is designated as ϩ1 (7). Control of translation initiation codon recognition and thereby translation initiation can additionally be exerted through various trans-acting factors such as eIFs, where the conserved eIF1 acts as a key determinant. eIF1 mutations resulting in premature eIF1 dissociation were shown to increase initiation rates at near-cognate start codons (13) and are thus key in maintaining the fidelity of initiation (14). Further, eIF1A thought to occupy the A-site, regulates start codon selection in a dual fashion as its N-terminal region decreases the initiation accuracy and promotes eIF1 dissociation at AUG codons, whereas its C-terminal region increases the stringency of start codon selection and promotes continued scanning at non-AUG codons (15). Further, eIF2 and eIF5 also help to ensure the fidelity of initiation codon selection. In general, phosphorylation of the alpha subunit of eIF2 (eIF2␣P) is known to reduce translation initiation, contradictory however, translational induction of GCN4, a yeast transcriptional activator, has been observed by reducing translation initiation at four uORFs (16), thereby overcoming the inhibitory effect of these uORFs on re-initiation at the GCN4 ORF (17).
When used in combination with initiation-specific translation inhibitors, this technique allows for the study of (alternative) translation (initiation) with subcodon or even single-nucleotide resolution, the latter referred to as Global Translation Initiation sequencing or GTI-seq (22,33). As such, ribosome profiling provided a wealth of information on the mRNA engagement of (initiating) ribosomes and revealed the omnipresence of alternative translation initiation events in human and mice as nearly half the transcripts harbor multiple translation initiation sites or TIS in their sequence (22,33). Besides a handful of cases for which alternative TIS selection leads to (functionally) distinct proteins isoforms because of their N-terminal heterogeneity (i.e. protein stability (34), localization (35)(36)(37)(38)(39), function (40), etc.), the overall functional outcome of alternative mRNA engagement, the factors and mechanisms involved in TIS selection, and the overall outcome of expanding the proteome diversity remain largely elusive.
Upon ribosome emergence, nascent protein chains (i.e. 30 to 50 amino acid long protein N termini) can be subjected to various cotranslational modification events, including proteolysis (removal of the initiator methionine (iMet) by the MetAPs (methionine aminopeptidases) (41)(42)) and N-terminal (de)blocking modifications (N-terminal acetylation (Nt-acetylation) (43)(44)(45)(46) or deformylation (47)); ubiquitous modifications in eukaryotes and prokaryotes respectively. 50% of all soluble yeast proteins and 80 -90% of all soluble human proteins are modified by acetylation of the ␣-amino group of the aminoterminal residue (Nt-acetylation) (48 -51). The utmost N-terminal amino acid is the major determining factor whether or not a given protein is Nt-acetylated and by which N-terminal acetyltransferase (NAT) this occurs (52), although some redundancy among the different NATs can be observed (50,53). Because Nt-acetylation is considered to mainly occur cotranslationally (54), in vivo acetylated protein N termini can thus be considered as proxies of translation initiation, though read out at the proteome-wide level.
In this study, N-terminal proteomics was used to map the TIS landscape in human and mouse cells. Overall, more than 20% of all identified protein N termini point to aTIS and we report on more than 1700 unique alternative protein N termini next to the more than 4500 database annotated protein N termini identified in the proteomes of study, thereby linking about one-third of the uniquely identified protein N termini to alternative translation initiation events.
N-terminal COFRADIC and LC-MS/MS Analysis-N-terminal COFRADIC analyses were performed as described previously (58). To enable the assignment of in vivo Nt-acetylation events, all primary protein amines were blocked making use of a (stable isotopic encoded) N-hydroxysuccinimide ester at the protein level (i.e. NHS esters of 13 C 2 D 3 or D 3 acetate).
LC-MS/MS analysis was performed as described previously ( (50) and (48)). The generated MS/MS peak lists were searched with Mascot using the Mascot Daemon interface (version 2.2.0, Matrix Science, Boston, MA). Searches were performed in the Swiss-Prot database with taxonomy set to human or mouse (UniProtKB/Swiss-Prot database version 2012_03, containing 20,254 human and 16,513 mouse entries (535,248 sequence entries in total)) or using custom databases (combination of UniProtKB/Swiss-Prot and Ribo-seq derived translation sequences (59)). 13 C 2 D 3 -or D 3 -acetylation at lysines, carbamidomethylation of cysteine and methionine oxidation to methionine-sulfoxide were set as fixed modifications for the N-terminal COFRADIC analyses. Variable modifications were 13 C 2 D 3 -acetylation or D 3 -acetylation and acetylation of protein N termini. Pyroglutamate formation of N-terminal glutamine was additionally set as a variable modification. Endoproteinase Arg-C/P (Arg-C specificity with arginine-proline cleavage allowed) was set as enzyme allowing no missed cleavages. The mass tolerance on the precursor ion was set to 10 ppm, 0.2 Da and 0.5 Da, and on fragment ions to 0.5 Da, 0.1 Da, 0.5 Da for the Orbitrap, Q-TOF Premier and Ion Trap analyses respectively. The peptide charge was set to 1ϩ, 2ϩ, 3ϩ and instrument setting was put to ESI-TRAP for Orbitrap and Ion Trap analyses and to ESI-QUAD-TOF for Q-TOF Premier analyses. Only peptides that were ranked one and scored above the threshold score, set at 99% confidence, were withheld. The estimated false discovery rate by searching decoy databases (a shuffled version of the yeast Swiss-Prot database made by the DBToolkit algorithm (60)) was found to lie below 1.5% on the spectrum level. All annotated highest scoring spectra of the N-terminal peptides reported in supplemental Table S1 are provided as supplemental data.
Selection of N termini-From the mouse and human N-terminal data sets, N-terminally blocked peptides were selected and classified. The high confident TIS encompass: (1) all (partially) in vivo N ␣ -acetylated N termini and in vivo unmodified N termini of which the start position corresponded with a Swiss-Prot isoform, Ensembl and/or TrEMBL annotated TIS site; (2) iMet processed or iMet retaining counterparts of in vivo N ␣ -acetylated N termini; (3) N termini matching TIS previously identified by ribosome profiling; (4) N termini annotated as dbTIS in (a) prior Swiss-Prot release(s); (5) N termini for which the iMet processed and/or iMet retaining orthologous N-terminal peptide (HomoloGene) was identified as being (partially) N ␣ -acetylated in vivo.
Sequence Logo Analysis-All experimentally observed alternative N termini were aligned based on their translation start codon. The N-terminal peptides lacking the initiator methionine (iMet) were preceded with the iMet to rule out codon shifts in the sequence logo creation. Afterward, all peptides were mapped to their coding sequence (Perl scripting using the Ensembl API). Sequence logos were created based on the aligned transcript sequences (12 bp upstream and 9 bp downstream) using WebLogo 3 (http://weblogo.threeplusone. com, (61)). Sequence logos were plotted using both the residue probability and information content in bits as measure. Sequence logos were created for both the dTIS and corresponding dbTIS flanking regions. Also, an extra positive control to the dTIS sequence logos was generated based on 5000 randomly selected coding sequences corresponding to annotated translation initiation sites from CCDS (62) proteins were aligned for nucleotide context logo creation.
Ribosome Profiling and Genome-wide Visualization-Raw sequencing reads of the mESC ribosome profiling data (22) were downloaded from the Gene Expression Omnibus (data set GSE30839). All reads from the control (cycloheximide treated, also referred to as CHX treated, sample GSM765292) and harringtonine treated (sample GSM765295) were remapped using bowtie (v.0.12.7) on the mouse genome (assembly version 37) using the protocol described (63). All HEK293 cell line GTI-seq data (27) was downloaded from the NCBI Sequence Read Archive (accessions: SRX172392, SRX172361, SRX172360, SRX172315) and processed similarly.
Genome-wide visualization of the experimental data, in combination with publicly available data, was accomplished using an in-house developed genome browser (H2G2 , http://h2g2.ugent.be/biobix.html). Information tracks containing the ribosome profile/GTI-seq mappings of the CHX treated samples (generating a translation profile all over the coding mRNA) and harringtonine/lactimidomycin treated samples (translation profile accumulation at the TIS) are available (see (22) for more information). Furthermore, an information track is constructed showing the predicted translation products, based on the TIS-predictions from Ingolia et al. (22) and Lee et al. (27) (for respectively the mESC and HEK293 cell line sample) and the UCSC transcript annotation. The genomic locations of the N-terminal peptides identified by means of N-terminal COFRADIC are also visualized in the H2G2 browser. Other visualization tracks include genomic information from a local Ensembl (64) instance NCBIM37.66, transcript tracks holding the annotated TIS within the UniProtKB and Ensembl database and a conservation track based on the phastCons (65-66) conservation scores among others. More information on how to use the H2G2 browser can be found in Supplemental File S1 and as indicated below.
Genomic Annotation of the Identified dbTIS and aTIS-All Swiss-Prot annotated N termini (dbTIS) and alternative N termini (aTIS) were mapped to their corresponding reference genome (GRCh37 for human and NCBIM37 for mouse) based on the UniProt-KB/Swiss-Prot accession number (or alternatively the Swiss-Prot gene name) and the N-terminal peptide sequence (PeptideMapper script based on the BioMart (67) and Ensembl (68) API, version 66). The genomic locations of the experimental aTIS and dbTIS locations were made available as a visualization track in theH2G2 genome browser (see above). Two projects are made available (named TIS Human and TIS Mouse) using a public login (see supplemental File S1 for more details). These projects hold several visualization tracks (see The aTIS "GeneDigest" report lists all genes wherefore an alternative start site is reported, whereas the dbTIS "GeneDigest" report lists all genes of which a Swiss-Prot database annotated TIS has been identified. A third "Genedigest" report lists extra translation start sites identified from the N-terminomics experiments searching a protein product database constructed based on ribosome profiling sequence information (22) (see above). Further, experimental Ribo-seq data (22) are also presented as custom tracks, allowing manual inspection of co-occurrence of N-terminal COFRADIC and Ribo-seq experimental evidence.
Conservation Analysis-To assess the evolutionary conservation potential of the identified d(b)TIS and their flanking sequences as compared with 5000 randomly chosen, BioMart (67) annotated CCDS translation initiation sites, their orthologous positions in various vertebrate genomes were extracted using phastCons (65)(66) and scored in a multiple sequence alignment. The phastCons program computes conservation scores based on a phylo-HMM, a type of probabilistic model that describes both the process of DNA substitution at each site in a genome and the way this process changes from one site to the next. The value plotted at each site is the posterior probability that the corresponding alignment column was "generated" by the conserved state of the phylo-HMM.
TargetP Analysis-To categorize the subcellular location of the proteins translated from their annotated versus their alternative N termini, a targetP prediction (v1.1b, (69)) was performed. The location assignment is based on a predicted N-terminal presequence: a mitochondrial targeting peptide, or a secretory pathway signal peptide.
The mutagenized constructs were used as templates for in vitro coupled transcription/translation in a rabbit reticulocyte lysate system according to the manufacturer's instructions (IVTT; Promega, Madison, WI) to generate [ 35 S]methionine labeled protein products. 5 l of the translate reaction was diluted 10-fold in 10 mM Tris pH 8.0. NuPAGE ® LDS Sample Buffer (Invitrogen) was added and the samples heated for 10 min at 70°C. Samples were separated on 4 -12% NuPAGE ® Bis-Tris gradient gels (1.0 mm x 12 well) (Invitrogen) using MOPS Buffer. Subsequently, proteins were transferred onto a PVDF membrane, air-dried and exposed to a film suitable for radiographic detection (ECL Hyperfilms, Amersham Biosciences, Buckinghamshire, UK). Radiolabeled proteins were visualized by radiography.

Mapping of the Translation Initiation Landscape in Human and Mice Using N-terminomics Reveals Numerous Alternative
Translation Initiation Events-In this study, a proteome-wide map of the translation initiation landscape in human and mouse was created by mass spectrometry assisted analysis of protein N termini isolated by N-terminal COFRADIC (70). A TIS compilation was made from previously generated N-terminal proteomics data ( (50,(55)(56)(57) and unpublished data) derived from the proteomes of the human HeLa, HCT116, A-431, THP-1, K-562, and Jurkat cell lines in addition to primary B-cells as well as the mouse cell lines Mf4/4 and YAC-1 next to primary dendritic mouse cells. Here, prior to tryptic digestion, all primary amines, and thus free protein N termini, were mass tagged by acetylation using nonnatural, stable isotope encoded groups such as trideutero-acetate. In this way, in vivo Nt-acetylated and in vivo free N termini can be distinguished and the degree of Nt-acetylation determined (71). After tryptic digestion, all protein N termini will thus be blocked, whereas all other internal peptides will have a newly generated primary ␣-amine. Subsequently, N-terminal peptides are enriched for by means of strong cation exchange (SCX) step at low pH and further segregated from remaining internal peptides using a diagonal chromatography strategy. Selected protein N termini are subsequently identified following LC-MS/MS analysis (72). Identified protein N termini were grouped by their TIS context. First, protein N termini with a Swiss-Prot database annotated protein start position (i.e. N termini starting at protein position one or two in the protein sequence) are referred to as database annotated TIS or db-TIS. Overall, 2879 human and 1771 mouse dbTIS-indicative N termini originating from 2723 and 1708 unique Swiss-Prot protein entries were identified (supplemental Table S1).
Second, based on the cotranslational nature of N-terminal acetylation of protein N termini (48) by the NATs, the near universal requirement of a Met-encoding initiator codon (iMet) and the cotranslational processing of iMet by methionine aminopeptidases (MetAPs), all in vivo free and/or Nt-acetylated peptides with start positions downstream the database annotated TIS were grouped. In this way, 1231 human (1060 proteins) and 465 mouse (418 proteins) N termini hinted to protein N termini originating from the usage of in-frame, downstream TIS (dTIS), thus giving rise to N-terminal truncated protein isoforms. The N termini hinting to dTIS were further subdivided into two subcategories (See also Experimental procedures); the high confident dTIS encompassing all (partially) in vivo Nt-acetylated peptides among others. In vivo unmodified dTIS compliant with the rules of iMet-processing and Nt-acetylation (the latter for example considering that, without exception, (X)-Pro-starting N termini are unmodified (73)) were withheld as low confident dTIS and this only when their protein start position did not overlap with any proteolytic cleavage event reported in public repositories (74 -77) (i.e. protein signal processing sites or reported proteolytic cleavage sites after or before a Met residue). Finally, and whenever Ribo-seq or ortholog mapping hinted to translation initiation events, low confident dTIS were recataloged as high confident dTIS (see below and supplemental Table S1).
For dbTIS, the discrepancy between the numbers of identified protein N termini and the actual proteins is only because of the observed incompleteness of iMet-processing (i.e. cases where both the iMet processed and unprocessed N termini were identified) and thus heterogeneity of the N-terminal protein ends, whereas in the case of dTIS, multiple dTIS were observed for several proteins.
Further, the identified N termini were grouped by mapping them to the first or a subsequent exon, with the former cate-gory hinting to alternative translation events by leaky scanning or re-initiation, whereas the latter, besides representing putative dTIS, might point to TIS originating from alternative splicing (Table I and supplemental Table S1). Overall 1220 human dTIS (out of the 1231 Swiss-Prot N termini identified) and all 465 mouse dTIS N termini could be mapped onto their corresponding reference genome and their confidence level is given in Table I and supplemental Table S1. Meta data related to the d(b)TIS identifications are made available as visualization tracks in the H2G2 genome browser (http://h2g2.ugent.be/biobix.html, see also supplementary information).
Of the dTIS N termini identified, 18% (n ϭ 223) and 15% (n ϭ 71) of the Swiss-Prot nonannotated human and mouse TIS mapped to Swiss-Prot isoform entries and/or indicative transcripts in TrEMBL and/or Ensembl, validating our selection strategy for identifying dTIS, as these have been experimentally proven to give rise to N-terminally truncated protein isoforms (supplemental Table S1). Overall, these numbers are indicative for the fact that our data set is of high quality and thus holds numerous hitherto unreported dTIS sites, here discovered at the level of the proteome.
TIS Sequence Context Analyses-A survey of the sequence context flanking the dbTIS and dTIS of the Exon1 and ExonϾ1 categories using WebLogo (78) revealed a preference of the most crucial Kozak context elements being a purine at posi-  and mouse NCBIM37). The dTIS locations were detected throughout several mass spectrometry analyses and a compilation was extracted from the in-house ms-lims system (109) obtained from previously generated N-terminal proteomics data ( (50,(55)(56)(57) and previously unpublished data) (supplemental Table S1). The overlap with annotated Ensembl, Swiss-Prot or TrEMBL annotated TIS is also provided. For the Ensembl mapping an extra subdivision was done based on TIS location in the first (Exon1) or the consecutive exons (ExonϾ1). Also, a comparison was made with the TIS identified within two ribosome profiling studies on HEK293 and mESC cell lines (18,19). See the "selection of N-termini" paragraph within the material and methods section for more explanation on the subdivision based on confidence level (either H or L). "(ϩ meta)" indicates that available isoform, transcript, Ribo-seq and/or orthologues dTIS metadata is available for dTIS originally assigned as low confidence (See also supplemental  Fig. 1A and 1B). A detailed analysis of the human Swiss-Prot dbTIS for which we report an alternative start site located in the first exon (i.e. the dTIS exon1 category) additionally revealed an increased occurrence of suboptimal start codon contexts (with a pyrimidine in position Ϫ3 upstream of AUG instead of purine (82)). As compared with the start codon contexts of all identified dbTIS, an increased suboptimal versus optimal measure of 35.2% versus 19.5% is observed (as deduced from input data used in Fig. 1 and S1). According to the leaky scanning model (83), the 40S ribosomal subunits can miss an AUG codon in a suboptimal context and initiate translation at downstream AUG(s), which is in corroboration with the data obtained from GTI-seq data, showing that the strongest Ko- FIG. 1. A, Homo sapiens TIS WebLogos. The flanking sequences (12 bases upstream, 9 bases downstream) of the corresponding dbTIS for which a dTIS has been identified, the experimentally observed dTIS, located in exon 1 (n ϭ 374), and the experimentally observed dTIS, located in subsequent exons (n ϭ 791) are used to create WebLogos. B, Mus musculus TIS WebLogos. The flanking sequences (12 bases upstream, 9 bases downstream) of the corresponding dbTIS for which a dTIS has been identified, the experimentally observed dTIS, located in exon 1 (n ϭ 197), and the experimentally observed dTIS, located in subsequent exons (n ϭ 251) are used to create WebLogos. Both the probability and bits values are plotted. The discrepancy between the numbers given above and the numbers in Table I are because splice site spanning flanking sequences were not used to create the WebLogos. zak consensus sequence was observed in the gene group with no detectable dTIS but with dbTIS initiation, whereas this context was largely absent in the group of genes lacking a detectable translation initiation at dbTIS (33). The downstream flanking sites of these downstream start sites in both the dTIS exon1 and dTIS exonϾ1 categories were further inspected using the AUG_Hairpin software, enabling the prediction of downstream secondary structure influencing translation start site recognition (82, 84 -86). Following the strategy of Kochetov et al. only those dTIS that show a stable stem-loop structure (E tot Ͻ Ϫ20 kcal/mol) located between 13 and 19 nucleotides downstream from the start site were retained. Average energies of eligible stem-loop structures (E tot ) were Ϫ32.2 kcal/mol and Ϫ32.6 kcal/mol for the dTIS with suboptimal and optimal start codon contexts respectively (also the distributions of E tot values proved not to be significantly different according to a Kolmogorov-Smirnov two-sample test). Overall, the presence of the Kozak sequence context in all categories is further indicative for real TIS events.
Conservation Analysis-To assess the possibility of evolutionary conservation of the identified dTIS and their flanking sequences as compared with their corresponding dbTIS, the orthologous positions in various vertebrate genomes were extracted using phastCons (65)(66) and scored in a multiple sequence alignment, thereby generating a metagenic conservation plot (Fig. 2). Also, an analysis was made between the identified dTIS and a set of 5000 randomly chosen, BioMart (67) annotated complete CDS (CCDS) translation initiation sites (serving as a proxy for the global dbTIS landscape, supplemental Fig. S2). Only d(b)TIS locations were taken into account where the complete flanking region does not span any splice junction. In general, the phastCons score (between 0 and 1) gives a probability that each nucleotide belongs to a conserved element (see Material and Methods for detailed explanation). Overall, the human and mouse conservation plot indicated that the dTIS are highly conserved, with a mean conservation score of 0.97 (Ϯ 0.002, 95% confidence interval) and 0.97 (Ϯ 0.005) for respectively the Exon1 and ExonϾ1 groups compared with the dbTIS with a mean conservation score of 0.96 (Ϯ 0.01) and are thus indicative for the fact that the dTIS translation start sites are very well conserved within eukaryotic genomes in analogy to what was previously reported by Bazykin et al. (87) using in silico predictive analyses and using in vivo GTI-seq experiments (33).
Further, the conservation scores of the dTIS flanking regions of both the Exon1 and ExonϾ1 groups are high, ranging from 0.9 to 1. Here, next to the translation start codon, other Kozak hallmarks such as the guanine at position ϩ4 and purine at position Ϫ3 are well conserved. Also notable in the dTIS conservation plots-and expected given the higher coding potential of the first two nucleotides-is the slightly higher conservation of the first two nucleotides of the coding triplets in the translated sequence (88). This feature is most pronounced in the human dTIS exonϾ1 plot. As opposed to the dTIS plots, the flanking 5Ј upstream sequences in the dbTIS plots score significantly lower as these presumably contain untranslated sequence (UTR) in contrast to the 5Ј upstream region of the dTIS that contain translated sequence encoding for N-terminal protein extensions. No significant differences are obvious between the dTIS conservation plots of the Exon1 and ExonϾ1 groups.
Statistical testing was performed to assess the sequence conservation surrounding the Kozak motif and to increase confidence that the identified sites (dTIS) are genuine translation initiation sites. For that purpose we compiled a data set of decoy sites meeting the following criteria: (I) Consensus Coding Sequences (CCDS) were scanned for downstream Kozak sequence motifs [A/G]ccAUGG, (II) the identified Kozak sequence motif sites that overlap with dTIS identified in the N-terminal COFRADIC datasets reported in this study were discarded, (III) the ones showing an overlap with dTIS identified within the ribosome profiling experiments re-analyzed in this study (human (33) and mouse (22)) were also discarded. This group of decoy sites was compared with the different categories of TIS described in the study: database annotated TIS (dbTIS), downstream TIS located in exon1 (dTIS exon1 ) and downstream TIS located in further downstream exons (dTI-S exonϾ1 ). The PhastCons conservation scores at positions (-3,ϩ1,ϩ2,ϩ3,ϩ4), the most crucial Kozak context positions, were averaged for further calculation. A low p value (4.701e Ϫ12 and Ͻ2.2e Ϫ16 for respectively the human and mouse data sets) in a Welsh one-way ANOVA is indicative for a difference among the four TIS groups (after testing for heteroscedasticity using the Levene's test). In order to determine which particular group of TIS deviates the most, a Tukey's Honestly Significant Difference (Tukey-HSD) posthoc test (accounting for heteroscedasticity by using a heteroscedastic consistent covariance estimation) was performed showing a clear difference in the sequence conservation surrounding the Kozak motif between the decoy group and the three other sets (dbTis, dTIS exon1 and dTI-S exonϾ1 , p value Ͻ 0.001) at a 95% confidence level.
Finally, and to further analyze the degree of conservation of dTIS between our human and mouse data sets, the experimentally identified mouse and human dTIS were compared. In total, of 200 orthologous dTIS pairs, both the human and mouse N termini could be identified (i.e. for 43% of all mouse dTIS identified the human orthologous N termini could be identified).
Of these, 29 human and 31 mouse dTIS were originally classified as low confident dTIS. Based on the MS/MS-based evidence of Nt-acetylation of these orthologous N termini, three human and nine dTIS could now be re-cataloged under the reliability class 1 (supplemental Table S1).
TargetP Analysis-To have a first approximation of the functional impact of alternative TIS usage, a TargetP analysis was performed predicting the subcellular location of the fulllength proteins (i.e. proteins translated starting from their dbTIS) versus their N-terminally truncated counterparts iden-tified (dTIS exon1 and dTIS exonϾ1 ) (89) . Fig. 3 (upper and lower pane for respectively human and mouse) show that although only a small percentage of the dbTIS protein products is predicted to contain a mitochondrial targeting or signal peptide (i.e. most likely an underrepresentation, because in most cases signal or transit peptide maturation has occurred), a noticeable decrease of secreted or mitochondrial targeted proteins can be observed when assessing their N-terminally truncated counterparts (see Fig. 3 pie charts, Chi-squared test of Independency, p value Ͻ 2.2e Ϫ16 for both the human and mouse data sets). The bar plots within Fig. 3 give a more detailed view, making an extra subdivision based on (I) the reliability of the TargetP prediction (class 1 to 5) and (II) whether the dTIS is localized in exon1 or in an exon downstream exon1 (exonϾ1). The more detailed bar plots also show a significant drop for the mitochondrial and secretory pathway localization categories ("M" and "S") independent of the reliability classes (1)(2)(3)(4)(5) in both the exon1 and exonϾ1 groups (green versus blue bars).
Overall, the TargetP output strengthens the idea that dTIS usage has an impact on protein subcellular localization (35)(36)90), which was also hypothesized and computationally investigated by Cai et al. (91) and in fact proven for a variety of N-terminal protein isoforms generated by means of alternative translation initiation.
Ribosome Profiling Data Provide Independent Experimental Support for N-terminomics Data-Interestingly, of the here identified TIS, complementary Ribo-seq TIS profiling data are available for 861 of the 1755 transcript-matching mouse dbTIS (49%), 69 of the 465 mouse dTIS (15%), 1150 of the 2841 human dbTIS (40%), and 105 of the 1220 human dTIS (9%) (supplemental Table S1), thereby providing evidence that these represent genuine translation initiation sites in mouse and human transcripts (22), and thus that these N termini are representative for N-terminal protein variants.
As such, the experimental evidence obtained by ribosome profiling-assisted TIS identification categorizes 47 extra mouse dTIS and 66 extra human dTIS to reliability class 1, thereby increasing the percentage of validated dTIS to 27% (n ϭ 125) and 22% (n ϭ 272) in mouse and human respectively (supplemental Table S1).
Overall, and when taking into account the available isoform, transcript, Ribo-seq and orthologous dTIS metadata, 24 extra mouse dTIS, and 42 extra human dTIS originally assigned as low confidence, are now classified as highly likely genuine dTIS, respectively summing up to 73% (n ϭ 900) and 64% (n ϭ 298) of all identified human and mouse dTIS having data that support their translation initiation potency (supplemental Table S1).
Because the translatome (i.e. the ORF delineation and the translation initiation landscape) can be specifically delineated using ribosome profiling data, usage of this type of data does not necessitate translation into its three reading frames, hence decreasing the search space tremendously. Alongside, noncanonical codons serving as alternate initiation codons are depicted. These near cognate translation initiation codons can either be decoded as the expected initiator methionine residue or alternatively to their coding-matching amino acid as for example leucine-decoded CUG starts of translation initiations have been reported (92)(93). For these reasons, customized sample-oriented and ribosome profiling derived protein databases were created which served as custommade reference data sets for the proteomes of study (59).
In the mouse proteomes, besides the additional full-length protein N termini identified (i.e. N termini not enclosed in the Swiss-Prot database), 17 N termini indicating N-terminal protein extensions were identified (supplemental Table S1). Four N-terminal extensions were generated upon translation initiation at an AUG codon, in addition, 13 were produced by translation initiation at near cognate start codons; being GUG (6 N termini), CUG (4), and ACG (3) normally encoding for respectively Val, Leu, and Thr, but decoded to Met as evident from the iMet-retaining N termini identified. Besides, the N termini of an uORF and a dTIS protein product could be identified using respectively GUG and CUG as start codon.
In human, 17, 4, and 2 N termini respectively hinted to N-terminal protein extensions, N-terminally truncated and overlapping uORF protein products, 22 of which were generated upon translation initiation at a near-cognate start codon (supplemental Table S1). Besides, these database searches led to the identification of some additional dbTIS, not contained in the Swiss-Prot database (supplemental Table S1).
TIS Mutagenesis Analyses Reveal that N-terminal Protein Isoforms of the Class dTIS exon1 are Generated by Means of Leaky Ribosomal Scanning-To further verify whether some of the alternative TIS products identified are raised by alter- FIG. 3. TargetP analysis of the protein products generated by dbTIS versus dTIS usage. TargetP predicts both N-terminal mitochondrial targeting peptide (mTP) and signal peptides (SP) processing, respectively reflecting mitochondrial and secretory pathway localization. Both human (upper pane) and mouse (lower pane) are plotted. The pie charts depict the overall localization patterns of the dTIS and dbTIS translation products. The more detailed bar charts make an extra subdivision based on (I) the reliability of the TargetP prediction (class 1 to 5, where 1 indicates the strongest prediction) and (II) whether the dTIS is localized in exon1 or in an exon downstream exon1 (exonϾ1). The green and blue bars respectively correspond to N-terminal isoforms raised upon dbTIS and dTIS usage, dark and clear bars represent the Exon1 and ExonϾ1 group respectively. The x axis shows the predicted localizations ("M" stands for mitochondrial, "S" for secretory pathway) and the reliability of that prediction (class 1 to 5). The rightmost bars depict the combination of all reliability classes ("All M" and "All S"). The y axis corresponds to the total number of TIS events falling within the groups depicted in the x axis. native translation initiation, in vitro translation studies of (mutagenized) dTIS holding coding sequences (CDS) flanked by (a part of) their presumed 5ЈUTR were performed using coupled in vitro transcription/translation assays. In general, mutation of AUG to CUG typically abolished or greatly diminished translation initiation at these sites. In all cases distinct protein bands corresponding to the short N-terminal protein isoform(s) (i.e. dTIS products), identified by proteomics means, and the database annotated variant could be determined (Fig. 4). Further, because the mutation of the canonical initiation site affected the production of the truncated isoform(s) and vice versa (i.e. some truncated N-terminal isoforms were only detected when the dbTIS were mutagenized), our results strongly indicate that the dTIS products are produced by alternative translation initiation via leaky ribosomal scanning at internal translation start sites. Besides, in some cases a deviation from the 5Ј polarity of scanning could be observed for closely spaced AUG codons (up to 16 -19 nt). As a result of the proposed reverse directionality of scanning (3Ј to 5Ј), a lower initiation frequency at a 5Ј proximal AUG could be observed in the presence of (a) nearby downstream AUG(s), suggestive of downstream nucleotides inferring a restricted relaxation to the forward directionality of scanning of the proximal AUG (94). DISCUSSION The most acknowledged mechanism of protein diversification in mammalian genomes is alternative splicing, where different mRNAs are derived from the same nascent transcript. Only recently, alternative translation initiation from a single mature transcript was recognized as an important and wide-spread mechanism of protein diversification, further highlighting the importance of gene functionality (22,33).
As previously postulated, targeted analysis of protein N termini is ideally suited to study N-terminal protein isoform diversity (35).
In this study, we report on more than 1700 alternative translation initiation events in mouse and human cell lines by applying stringent rules for mass spectrometric based identifications of N-terminal peptides. Besides, our detailed understanding of the specificity of the Nt-acetyltransferases and cotranslational processes in general assists in judging if such peptides indeed report protein translation events and thereby allow for the functional (re-)annotation of genomes. For a significant fraction of the here reported TIS, available metadata from transcripts, ribosome profiling, TIS sequence context and conservation analyses served as evidence that our N-terminal selection provides a very powerful strategy to map FIG. 4. In vitro translation of TIS-mutagenized constructs reveal the existence of the by proteomics identified N-terminal protein variants. Wild type and d(b)TIS mutagenized pOTB7 constructs encoding N-terminal variants of aminoacyl tRNA synthase complex-interacting multifunctional protein 2 (AIMP2), inhibitor of kappa light polypeptide gene enhancer in B-cells kinase gamma (NEMO), Zinc finger protein 296 (ZN296), splicing factor 3a subunit 3 (SF3A3), cytoplasmic aspartate-tRNA ligase (SYDC), cytokine receptor-like factor 3 (CRLF3), Nucleosome assembly protein 1-like 1 (NP1L1), and pyridoxamine 5Ј-phosphate oxidase (PNPO) were in vitro transcribed and translated. Following SDS-PAGE and electroblotting, radiolabeled proteins were visualized by radiography. Assignments of the precursor band corresponding to the database annotated protein sequences (black arrowhead) and protein products raised upon dTIS usage (orange and dashed arrowheads) were verified by mutating their respective initiator methionines. A gray arrow points to translation initiation events at dTIS which were not identified by proteomics means. An asterisk is indicative of an unspecific protein band produced in the control in vitro transcription and translation reaction (i.e. a reaction without input DNA). In each case the theoretical molecular weights of the identified N-terminal protein variants are indicated. the TIS landscape in higher eukaryotes. In addition, and as is the case for the high mobility group proteins HMGB1, HMGB2, and HMGB3, next to the orthologs matching dTIS sites, corresponding dTIS sites in all three homologs could be identified (supplemental Table S1). These observations are further strengthened by the fact that previously reported dTIS products of the Insulin-like growth factor 2 mRNA-binding protein 2 (IGF2BP2) (95), Glucocorticoid receptor (GCR) (96), Insulin-degrading enzyme (97), Regulator of G-protein signaling 2 (98), and the BAG family molecular chaperone regulator 1 (99)of which the N-terminal isoform expression was shown to display a stage-and site-specific expression profile during mouse development-were also identified in this study (supplemental Table S1).
Ribo-seq and GTI-seq in mammalian cells revealed that only half of the TIS codons made use of AUG as the translation initiation codon. However, in the study of Lee et al. (33), TIS codon usage was shown to be distinct when residing in the presumed 5ЈUTR (uTIS) as opposed to the annotated CDS. When outside the dbTIS/dTIS reading frame, uTIS are mostly associated with short ORFs and were mostly non-AUG codons, whereas dTIS, typically encoding for N-terminal truncated protein variants predominantly made use of AUG codons, an observation which is in line with our data and the fact that only 37 protein products were raised upon translation initiation at near-cognate start codons, typically giving rise to N-terminal protein extensions, were identified in our mouse and human data sets based on available Ribo-seq data sets (22,33). Further, a recent computational analyses of ribosome profiling data calculating the efficiencies of individual translation initiation sites, revealed that despite the high frequency of non-AUG translation initiation sites identified by means of ribosome profiling, the probability of initiation at non-AUG codons was found to be considerably lower than at AUG codons (data presented by Pavel Baranov at the EMBO Conference Series "Protein Synthesis and Translational Control," Heidelberg, Germany 2013 (100)), likely explaining their general underrepresentation in the N-terminomics data sets here presented.
In addition to translation re-initiation for alternative translation initiation, GTI-seq demonstrated that leaky scanning was the major contributing mechanism leading to TIS selection because the strongest Kozak consensus sequences were observed in the gene group with dbTIS selection but no detectable dTIS, whereas dTIS selection was observed when a weak or no consensus sequence context of the dbTIS was present, enabling for an estimation of the leakiness of the first AUG codon and again confirming our proteome data.
Our TargetP analyses as well as other multiple lines of evidence indicate that alternative translation initiation can give rise to iso-functional though localization-specific N-terminal protein variants making translation initiation a very attractive mechanism of regulating protein localization as previously for example reported for the p43 Component of the Multisynthe-tase Complex (37) where a mitochondrial targeting sequence is lost when translation initiation proceeds via a dTIS, whereas in contrast, translation initiation at a dTIS in Flap endonuclease 1 (FEN-1) exposes a cryptic mitochondrial targeting signal (38), two cases contributing to the increased complexity of the mitochondrial proteome (39).
Although translation initiation at dTIS found in close proximity to dbTIS are more likely to yield isofunctional and localization nondistinct N-terminal isoforms, thereby providing a potential fail-safe mechanism for translation initiation to occur, it is noteworthy that the N-terminal identity of a protein is critically important in determining protein stability. More specifically, the N-end rule relates the regulation of the in vivo half-life of a protein to the identity of (the N-terminal modification of) its N-terminal residue. Therefore, the loss of even a single N-terminal amino acid or its modification can significantly impact protein stability (101), influence protein localization (102) and protein complex formation (103) among others. Finally, our TargetP analyses shows a clearly noticeable change in localization in both exon1 and exonϾ1 groups for both the human and mouse data sets (Fig. 3) indicative that the findings of altered subcellular localization/functional diversification also applies to the broader set of alternative translation events identified in this study.
Besides steering protein localization and protein stability, TIS selection can also regulate protein expression levels. For example, hypo-or hypermorphic point mutations introducing premature translation termination codons in the first exon can result in the production of truncated protein variants through translation initiation at in-frame methionines downstream the nonsense mutation (104 -106). Two such examples for which various dTIS sites were identified in this study include the nuclear factor kappa B (NF-B) essential modulator (IKK␥/ NEMO) and the NF-B inhibitor IB␣ (104 -106) ( Fig. 4 and supplemental Table S1). Premature translation termination codons in these genes-although leading to the residual production of a truncated variant sufficient for the nonlethality observed during development-have been shown to underlie specific cases of the genetic disorder anhydrotic ectodermal dysplasia with immune deficiency. The (presumed) dTIS of both reported truncated protein variants have here been identified as being in vivo Nt-acetylated (Ac-M 38 LHLPSEQGAPETLQR (NEMO) and Ac-M 37 KDEEYEQMVKELQEIR (IB␣)), indicative for the fact that, as demonstrated to be the case for wild type NEMO transfected cells, a (limited) translation initiation of IB␣ at these alternative methionines can also occur under normal cellular conditions. The dTIS product of IB␣ lacks the two amino-terminal IB kinase (IKK) phosphorylation sites known to be essential for targeting IB␣ for proteasomal degradation, and as a result the degradation-resistant variant acts as a dominant negative regulator of NF-B activity. The mutation in IKK␥/NEMO, the scaffolding subunit of the IKK complex, gives rise to a truncated but functional variant that is produced in limited, insufficient amounts for the development of protective immune responses. In vitro transcription/translation assays independently confirmed that the 44 kDa isoform of NEMO is the product of alternative translation initiation at the internal Met38. Here, close to the canonical start codon, other start codons are located downstream of the database annotated iMet 1 (i.e. GxxAUGG and GxxAUGC), corresponding to the (surrounding) nucleotide motifs of methionines 13 and 38. The likelihoods of these AUG codons to act as translation initiation sites were estimated 0.29 and 0.27 versus 0.54 for the first AUG codon (TxxAUGA) (translation initiation prediction at http://atgpr.dbcls.jp). As such, various mutagenized constructs were made (Fig. 4) in which the pre-sumed initiator Met1, the internal Met13, and/or Met38 were mutagenized to monitor translation initiation at these sites. Expression from the coding sequence alone resulted in the production of two clearly distinct protein forms (Fig. 4, WT sample), of which the expression of shorter isoform was lower as compared with its full-length counterpart, indicative for leaky ribosome scanning. Further, mutagenesis of Met13 also resulted in two distinct bands of which the higher MW band runs ϳ1 kDa lower as compared the highest MW band of the WT construct and thus probably indicative for the fact that the higher MW band observed encompasses the products of translation at Met1 and Met13, an observation which is con- firmed when Met 13 or alternatively Met 38 is mutated. The fact that translation initiation may occur at Met 1 and Met 13 is further supported by the observation that cellular expression of the WT and M38A mutant resulted in a somewhat smeared out precursor band (106) and that the TrEMBL and Ensembl databases hold preliminary entries of this Met13 initiated protein variant. The combined Met 13/38 mutant seems to express the 44 kDa form besides residual translation initiation at the first noncanonical mutant CUG codon. Although, in vitro translation from the triple AUG to CUG mutant construct is significantly impaired, residual translation initiation (mainly at the near-cognate start codon decoded to Met38) can still be observed. Overall, we observed translation at three different AUG codons in NEMO, of which the translation product starting with Met38 was identified using N-terminal proteomics in the proteomes of K562, THP-1, and B-cells.
Further, ribosome profiling data where the utmost 5Ј AUG triplet resides in a suboptimal context (absence of a purine at Ϫ3 and/or G at ϩ4) suggested that more than one quarter of the human transcripts showed clear evidence of downstream translation initiation and thus could display a bi-or multicistronic behavior (33), meaning that these mRNAs could produce more than one polypeptide through leaky scanning. Many of these hypothetical cases would be expected to involve the production of small peptides or N-terminal truncated protein isoforms that have routinely been excluded from database sequence annotations.
Despite the multitude of alternative TIS here identified, linking over 30% of the protein N termini to alternative translation initiation, for the majority of them their spectral counts (supplemental Table S1) hint to a general lower translation efficiency and thus likely lower concentration of protein products expressed from such secondary initiation codons. Nonetheless, potent functions of such possibly lower abundance dTIS protein products have been demonstrated as in the case of the mitogenic osteogenic growth peptide (OGP) translated by leaky scanning from mammalian histone H4 mRNA (107). Strikingly however, when TIS are relatively closely spaced (i.e. resulting in N-terminal protein isoforms differing less than about six amino acids), spectral counts and our in vitro translation results hint to a more balanced expression, likely linking this to a proximity effect previously shown to modify the strict sequential constraint of regular leaky scanning into a more competitive feature (94). In this respect, it is important to note that such protein products can easily be overlooked by conventional detection technologies such as Western blotting despite their potent expression, as here shown to be the case by mutagenesis studies enabling visualization of the N-terminal AIMP2, JUNB, UCHL1 and CRLF3 isoforms generated by translation initiation at closely spaced AUG start codons (Fig.  4 -6).
Further, the bioinformatics-assisted integration of positional proteomics and available ribosome profiling data enabled for a more sensitive and comprehensive protein discovery, thereby enabling a global (re-)annotation of the translation initiation landscape (31,59). Interestingly, the high overlap of alternative translation products identified in this comprehensive study with those from a previous positional proteomics study that focused on the TIS-landscape in mouse embryonic stem (mESC) cells, further hints to their functional importance and conservation. In contrast however to the majority of translation products linked to alternative translation initiation as observed in ribosome profiling, the repertoire reported in this study is mainly confined to translation products raised upon downstream AUG besides some upstream noncognate codon usage, meaning that the products of uORF and out-of-frame translation generally remain undetected. Although the sensitivity of ribosome profiling to detect TIS sites and translation products remains unprecedented, the bioinformatics-oriented approaches used to assign TIS in ribosome profiling studies in some cases appear ineffective (e.g. the dTIS in JUNB here identified, clearly provided a discrete LTM and Harringtonin signal in the human and mouse Ribo-seq data sets (22,33), though remained undetected by the strict training algorithm used in the original Ingolia et al. study (Fig. 5)). This, as well as the occurrence of cotranslational protein modification events and the recent finding that mRNA ribosome occupancy does not necessarily hints to effective translation (108), necessitate the need for proteomics endeavors to identify the (mature) translation products. As such, we foresee that the complementary use of proteomics approaches and ribosome profiling will further assist in the comprehensive cataloguing of TIS ultimately leading to detectable and functional translation products.