Comparative Analysis of Mitochondrial N-Termini from Mouse, Human, and Yeast *

The majority of mitochondrial proteins are encoded in the nuclear genome, translated in the cytoplasm, and directed to the mitochondria by an N-terminal presequence that is cleaved upon import. Recently, N-proteome catalogs have been generated for mitochondria from yeast and from human U937 cells. Here, we applied the subtiligase method to determine N-termini for 327 proteins in mitochondria isolated from mouse liver and kidney. Comparative analysis between mitochondrial N-termini from mouse, human, and yeast proteins shows that whereas presequences are poorly conserved at the sequence level, other presequence properties are extremely conserved, including a length of ∼20–60 amino acids, a net charge between +3 to +6, and the presence of stabilizing amino acids at the N-terminus of mature proteins that follow the N-end rule from bacteria. As in yeast, ∼80% of mouse presequence cleavage sites match canonical motifs for three mitochondrial peptidases (MPP, Icp55, and Oct1), whereas the remainder do not match any known peptidase motifs. We show that mature mitochondrial proteins often exist with a spectrum of N-termini, consistent with a model of multiple cleavage events by MPP and Icp55. In addition to analysis of canonical targeting presequences, our N-terminal dataset allows the exploration of other cleavage events and provides support for polypeptide cleavage into two distinct enzymes (Hsd17b4), protein cleavages key for signaling (Oma1, Opa1, Htra2, Mavs, and Bcs2l13), and in several cases suggests novel protein isoforms (Scp2, Acadm, Adck3, Hsdl2, Dlst, and Ogdh). We present an integrated catalog of mammalian mitochondrial N-termini that can be used as a community resource to investigate individual proteins, to elucidate mechanisms of mammalian mitochondrial processing, and to allow researchers to engineer tags distally to the presequence cleavage.

Mitochondria are ancient bacterium-derived organelles essential for eukaryotic life. A double membrane divides mitochondria into distinct compartments (matrix, inner membrane, intermembrane space, and outer membrane) that carry out specialized cellular processes, including oxidative phosphorylation, iron-sulfur cluster biogenesis, and a myriad biosynthetic pathways. Mammalian mitochondria contain a tiny genome encoding 13 proteins, whereas the remaining ϳ1200 proteins (1,2) are encoded in the nucleus and imported into the organelle. Although there are several different mechanisms that can target proteins to the mitochondrion, the predominant mechanism is via an N-terminal presequence that directs import through the double membrane, which is subsequently cleaved to produce the mature, functional protein (3).
To date, most of our understanding of mitochondrial protein import and processing has been elucidated using bakers' yeast as a model system. As reviewed recently (3), the canonical import pathway involves an amphipathic ␣-helical N-terminal sequence that directs import through the TOMM-TIMM translocase complex spanning both membranes. This presequence is typically cleaved in the mitochondrial matrix by the MPP 1 (encoded by Mas1/Mas2 in yeast, Pmpca/Pmpcb in mouse, and PMPCA/PMPCB in human). Other matrix peptidases are known to subsequently cleave one amino acid (Icp55) or eight amino acids (Oct1). Cleavage by MPP typically occurs two amino acids C-terminal to an arginine (R-2); therefore, proteins cleaved by MPPϩIcp55 typically have arginine at presequence position Ϫ3 (R-3), whereas proteins cleaved by MPPϩOct1 typically have arginine at position Ϫ10 (R-10). The inner membrane peptidase (Imp) is known to cleave a handful of proteins destined for the inner membrane or inter-membrane space, although a specific motif is not known. Approximately 70% of yeast proteins contain an N-terminal presequence that upon import is rapidly degraded (3,4). However, other mitochondrial import mechanisms exist that do not involve cleavage signals, especially for outer membrane or intermembrane space proteins.
Systematic identification of N-termini of all mitochondrial proteins promises to shed light on mitochondrial protein import/processing mechanisms. Two high-throughput studies have reported the mitochondrial N-proteomes in yeast and humans. In an elegant 2009 study, Vogtle et al. (4) isolated mitochondria from yeast and used combined fractional diagonal chromatography (COFRADIC) to specifically label and identify the N-termini. This method chemically modifies free N-termini of the internal peptides (after first protecting lysine side chains) and separates them from other peptides by chromatography. Then the N-terminal peptides are identified using mass spectrometry. This study identified N-termini for 615 yeast mitochondrial proteins, including 451 with cleaved presequences exceeding 5 aa. Yeast presequences typically are 20 -60 aa long and have a net charge between ϩ3 and ϩ6.
The study also showed that the N-termini of mature proteins typically begin with stabilizing amino acids such as serine and alanine, consistent with the N-end rule from bacteria (5). In 2015, Vaca Jacome et al. (6) reported the mitochondrial Nproteome in human U937 cells using trimethoxyphenyl phosphonium (TMPP) with C12 and C13 labeling. At exactly pH 8.2, TMPP chemically modifies free N-termini but not lysine side chains, and then the N-termini are determined from peptides with TMPP doublets containing both C12 and C13. This study identified N-termini for 356 human genes, 179 of which are in the MitoCarta2.0 catalog of mitochondrial proteins (2), whereas the remainder may be co-purifying contaminants. These two high throughput studies of mitochondrial N-termini are extremely useful resources to understand individual mitochondrial proteins as well as protein import mechanisms.
In this study, we applied the subtiligase method (7,8) to systematically identify cleaved N-termini from mitochondria isolated from mouse tissues (Fig. 1). The Wells laboratory has pioneered this method for identifying N-termini and has successfully applied it to elucidate caspase proteolysis of hundreds of substrates (9, 10), identify circulating proteolytic signatures of chemotherapy-induced apoptosis (11,12), and investigate evolutionary conservation of caspase substrates (13). Briefly, this method uses an engineered peptide ligase called subtiligase to covalently label free N-termini with an ester probe, including a biotin molecule for biochemical enrichment and an ␣-aminobutyric acid (Abu) residue for unambiguous identification using LC-MS/MS. As subtiligase cannot label lysine side chains nor modified N-termini, this method is specific for processed N-termini as most non-processed Ntermini are co-translationally acetylated in the cytoplasm prior to mitochondrial import. One known bias of this method is that subtiligase, like many proteases, has a preference for N-termini with small amino acids, specifically serine, alanine, and glycine, although ongoing studies are under way to improve this limitation (14). The subtiligase method uses positive (rather than negative) enrichment, unambiguous identification of free N-termini (Abu tag), and does not require modification of lysine residues prior to labeling. Thus, in this study we chose the subtiligase method to specifically identify mitochondrial N-termini cleaved after import.
Here, we report on mitochondrial protein N-termini encoded by 327 mouse genes, and we compare properties of mouse presequences to those previously reported in yeast and human. This catalog of mitochondrial N-termini across species provides insight into the evolutionary conservation of mitochondrial targeting mechanisms and provides a resource to understand individual mitochondrial proteins.

EXPERIMENTAL PROCEDURES
Experimental Design and Statistical Rationale-Protein N-termini were examined via LC-MS/MS from mitochondria isolated from mouse liver (one biological replicate with three technical replicates, including two digested with trypsin and one with trypsin and separately LysC) and mitochondria isolated from mouse kidney (one biological replicate using trypsin digestion, no technical replicates). Because the proteins detected in kidney were a strict subset of liver data, this tissue was not followed up with additional replicates, and no comparisons between kidney and liver were performed. Three technical replicates of liver were selected to enable quality control and robustness metrics. Statistical methods were used in data analysis to assess enrichment of mitochondrial subcompartments (Fisher's exact test) and to compare conservation between presequences and mature protein sequences (two-tailed, heteroscedastic t test).
N-Terminal Labeling and Enrichment-Mitochondrial samples were reduced with 20 mM dithiothreitol, and 2.5% Triton X-100 was added. The TEVTest4B probe was added to the samples at 1.5 mM prior to incubating with 1 M subtiligase for 2 h at room temperature. Tagged protein fragments were precipitated using acetonitrile, then denatured (8 M Gdn-HCl), reduced (2 mM tris(2-carboxyethyl)phosphine), and thiols alkylated (4 mM iodoacetamide), before ethanol precipitation. Biotinylated N-terminal peptides were then captured with neutravidin-agarose beads for 30 h. The beads were washed using 4 M Gdn-HCl, and the proteins were cleaved with trypsin, or trypsin and separately LysC, washed, and released from the beads using TEV protease. The tryptic peptides were fractionated into 11 fractions/sample using high pH reverse phase C 18 chromatography and then desalted with using C 18 Ziptip (Millipore).
Mass Spectrometry and Data Analysis-LC-MS/MS was carried out by reverse phase LC interfaced on two different mass spectrometer machines. The first dataset (Velos) was acquired on an LTQ Orbitrap Velos (Thermo Fisher Scientific) mass spectrometer featuring a nanoflow HPLC (NanoAcquity UPLC system, Waters) equipped with a trap column (180 m ϫ 20 mm, 5-m SymmetryC 18 , from Waters) and an analytical column (100 m ϫ 100 mm, 1.7-m BEH130C 18 , from Waters). Peptides were eluted over a linear gradient over 60 min from 2 to 30% acetonitrile in 0.1% formic acid. The second dataset (QExactive) was acquired on a QExactive Plus (Thermo Fisher Scientific) featuring a nanoflow HPLC (Dionex UltiMate 3000, Thermo Fisher Scientific) equipped with an analytical column (Acclaim Pep-Map RSLC, 75 m ϫ 150 mm, Thermo Fisher Scientific). One sample was analyzed from mouse kidney mitochondria (Velos with trypsin protease), and three samples were analyzed from mouse liver mitochondria (Velos with trypsin protease, QExactive with trypsin protease, and QExactive with trypsin and LysC proteases digested separately and then combined prior to injection). For data analysis, peptide sequences were assigned using the ProteinProspector (version 5.13.2) database search engine against the RefSeq mouse protein database (release 63, 27,886 proteins) (15) retaining all homologs. Search parameters included a precursor mass tolerance of 20 ppm, fragment ion mass tolerance of 20 ppm (and 6 ppm for the QExactive data), up to two missed trypsin cleavages, constant carbamidomethylation of Cys, variable modifications of N-terminal addition of Abu amino acid, acetylation of the protein N-terminus, and oxidation of methionine. The identified peptides were searched against a random decoy protein database for evaluating the falsepositive rates. The false discovery rate never exceeded 2%. Only peptides containing the N-terminal Abu tag were analyzed (1081 unique peptides mapping to 474 genes). Peptides that mapped to multiple isoforms of the same Entrez gene were assigned to the longest matching RefSeq isoform. Peptides were excluded if they mapped to more than one Entrez gene (149 unique peptides) or if they mapped to non-MitoCarta2.0 genes (191 unique peptides), leading to a final dataset of 855 unique peptides supporting 470 unique cleavages in 327 MitoCarta2.0 genes. Data including peak lists are available at MS-Viewer with search keys xjcsyqrrvh (liver, Velos), zjsxrtb9zx (kidney, Velos), jv5pi22igw (liver, QExactive with trypsin), and 4mvtyr7jg1 (liver, QExactive trypsin and LysC). Raw files are available in the MassIVE repository: MSV000080206, MSV000080207, MSV000080208, and MSV000080209.
Mass Spectrometry Half-tryptic Peptide Search of Published Mito-Carta2.0 Spectra-We re-analyzed existing mass spectra from our previous analysis of mitochondria isolated from 14 mouse tissues (1, 2). These 1.9 million high quality spectra from ion trap collisioninduced dissociation LC-MS/MS generated using an LTQ-Orbitrap (Thermo Scientific, San Jose, CA) were recently searched against the mouse RefSeq protein database (release 63) using standard methods to identify all tryptic peptides or those matching the first peptide from the annotated protein start (727,613 matched spectra), as described in the MitoCarta2.0 resource (2). Here, we used Spectrum Mill version 6.0 prerelease (Agilent Technologies, Santa Clara, CA) to search the remaining 1.2 million previously unmatched spectra to identify halftryptic peptides that had any residue at the N-terminus but lysine or arginine at the C-terminus. Spectrum Mill search parameters included electrospray ionization linear ion trap scoring parameters, mass tolerances of Ϯ25 ppm for precursor and Ϯ0.7 Da for product ions, 35% minimum matched peak intensity, trypsin nonspecific N-terminal enzyme specificity, fixed modification of carbamidomethylation at cysteine, and variable modifications of oxidized methionine with a precursor MH ϩ shift range of Ϫ18 to 64 Da. Spectrum Mill's autovalidation module was used with a fixed score threshold of 9.0 and a minimum peptide length of 7 aa to produce a draft list of candidate peptide spectrum matches containing 16,471 distinct peptides with a 2.9% peptide level false discovery rate based on forward versus reverse matches. We retained spectra that matched proteins identified by mass spectrometry in MitoCarta2.0 and mapped these to 8554 unique protein cleavages. We then retained only peptides that occurred N-terminal to the first observed tryptic peptide within the published MitoCarta2.0 dataset (457 unique cleavages, supplemental Table S1) with an identification FDR of 2.6%. We further defined a high quality subset, defined as cleavages detected in Ͼ1 tissue with a peptide score Ն13 (supplemental Table S1, FDR Ͻ0.9%). Comparisons with the subtiligase N-proteome dataset (Fig. 1B) were made at the gene level, between high quality half-tryptic peptides observed in any of the 14 tissues or with the subset observed from analysis of liver mitochondria.
Comparative Analyses of N-Termini-The mouse mitochondrial Nterminus subtiligase dataset (327 genes, 470 cleavages) was used to create a master set (327 cleavage sites defined as those with the largest number supporting spectra per gene, using a number of half-tryptic spectra detected in 14 tissues to break ties) and a reference subset of the master set (119 cleavage sites with Ն4 supporting spectra). Multiple spectra supporting the same cleavage site may be derived from the following: (i) identical peptides detected in different experimental samples; (ii) different peptides starting at the same position (e.g. SLCHSDFR and SLCHSDFRK); and/or (iii) different modifications of the same peptide sequence. Human mitochondrial N-termini were downloaded from Vaca Jacome et al. (6). The original dataset (356 UniProt proteins and 425 cleavage sites) was filtered to exclude non-mitochondrial contaminants by retaining only proteins in human MitoCarta2.0 (179 proteins and 229 cleavage sites) and was then used to create a master set (179 cleavages defined as those with the largest number of supporting spectra per gene) and a reference subset (47 cleavage sites with Ն4 supporting spectra). Human Uni-Prot proteins were mapped to NCBI Entrez genes via UniProt Knowledgebase (16) Mapper and mapped to mouse orthologs by best bidirectional hit (BlastP Expect Ͻ1e-3). Yeast mitochondrial N-termini were downloaded from Vogtle et al. (4) (615 genes, 1104 N-termini) along with tables containing the master set (279 cleavage sites) and a reference set with Ն6 supporting spectra (94 cleavage sites). Yeast proteins were mapped to 1:1 orthologs in mice (BlastP ExpectϽ1e-3) as described previously (2). For cross-species comparisons, multiple alignments between orthologs were generated by MUSCLE (17), and cleavage sites from each species were mapped to coordinates within the multiple alignment. Note that all three Nproteomic datasets contained multiple cleavage sites per gene. Overlaps between species ( Fig. 1B and supplemental Fig. S4) were computed at the gene level (i.e. an exact match was counted if any cleavage site exactly matched any orthologous cleavage site).
Predictions of Mitochondrial Targeting Sequences-TargetP (18), TPpred3 (19), and MitoFates (20) were run on all MitoCarta2.0 proteins using default parameters. Comparisons with the mouse N-terminal subtiligase dataset were performed using the subtiligase master set (Fig. 3A). Very similar results were obtained if comparisons were performed at the gene level (i.e. an exact match was reported if any of the cleavage sites per gene were correctly predicted by the algorithm).
Evolutionary Conservation-Evolutionary conservation (Fig. 3B) was assessed using precomputed multiple alignments for orthologous groups defined by EggNOG (21) version 4.0. RefSeq proteins were mapped to Ensembl proteins (via Ensembl BioMart (22)), and the cleavage site for the subtiligase master set was mapped from RefSeq to Ensembl protein coordinates via MUSCLE (17) multiple alignment.
For each Ensembl protein, EggNOG multiple alignments were downloaded for all vertebrates (veNOG), metazoa (meNOG), and opisthokonts (opiNOG). Percent identity at each position in the precomputed multiple alignment was defined as the percent of homologs sharing the most commonly observed amino acid at that site. These perposition metrics were averaged across all presequence positions (defined by the mouse cleavage site in the subtiligase master set) or across all remaining positions.
Amino Acid Frequency-The background frequency of all amino acids (Fig. 3F) was computed based on all MitoCarta2.0 mouse RefSeq proteins (mouse), all MitoCarta2.0 human RefSeq proteins (human), and all yeast proteins annotated as mitochondria in the Saccharomyces Genome Database (23) (yeast). The frequency of N-terminal amino acids (Fig. 3E) was based on analysis of the first amino acid after the cleavage site using the master sets from each of the three species, defined above.
Cleavage Site Motifs-Using the reference sets for each species, motifs were created for the sequences flanking the observed cleavage sites using WebLogo (24), excluding small sample correction. Subsets of the reference sets were used to investigate motifs for sequences matching R-2 (arginine in position 2 N-terminal to cleavage site), R-3 (arginine in the position 3 N-terminal to the cleavage site), or R-10 (arginine in the position 10 N-terminal to the cleavage site).
Outlier Cleavages Ͼ100 Amino Acids-To identify well supported cleavages Ͼ100 aa from the annotated RefSeq protein start, we considered all cleavages with Ն4 spectra (from subtiligase combined with half-tryptic peptides occurring anywhere in the protein, not just before the first tryptic peptide, combined with N-terminal spectra from human orthologs). We excluded the 10 most highly expressed liver proteins (based on MitoCarta2.0 liver proteomics) as these showed many cleavages throughout the length of the protein that were not conserved between mouse and human (proteins Cps, Apt5b, Glud1, Pcx, and Cat). We note that the reported cleavage site refers to cleavage prior to the listed position (e.g. NP_032318:313 refers to cleavage between amino acids 312 and 313). We searched for shorter isoforms by manually inspecting full-length mRNA and ESTs for mouse and human homologs using the UCSC genome browser (25). We searched for nearby translation initiation using existing ribosome profiling data (26) using the GWIPS-viz genome browser (27). For all 17 proteins with outlier cleavage events, we predicted PFAM protein domains (28) using HMMER3 (29) and transmembrane helices using TMHMM (30).
Mitochondrial Subcompartment Analyses-For supplemental Fig.  S3, mouse MitoCarta2.0 proteins were assigned to subcompartments. Proteins were first classified as membrane (presence of transmembrane helix based on TMHMM (30) or NCBI gene ontology (GO) assignment to "mitochondrial inner membrane" or "mitochondrial membrane") or soluble. Membrane proteins were classified as inner membrane (if detected in the matrix via APEX (31) or by GO annotation) or outer membrane (GO annotation) or otherwise unclassified. Soluble proteins were classified as matrix (if detected in the matrix via APEX (31) or by GO annotation) or intermembrane space (GO annotation or presence of Cx9C motifs (32)) or otherwise unclassified.

Identification of Cleaved N-Termini from Mouse
Mitochondria-To identify mitochondrial presequence cleavage sites, we used the subtiligase method to label free N-termini in mitochondria from mouse tissues (Fig. 1A). Briefly, we isolated mitochondria from fresh mouse liver and kidney and then incubated protein extracts with purified subtiligase protein and an ester probe with the following three components: a biotin molecule for biochemical enrichment; a TEV site; and an Abu residue. Subtiligase covalently links the ester probe to free N-termini. Labeled proteins were selected using Neutr-Avidin beads and then digested with either trypsin or LysC to retain only the N-terminal peptide. The ester probe was cleaved at the TEV site, and the Abu-labeled N-terminal peptides were subjected to LC-MS/MS. Peptides were searched against the RefSeq mouse protein database and then filtered for unambiguous matches to MitoCarta2.0 mitochondrial proteins. Identified spectra were highly specific for mitochondrial proteins (supplemental Fig. S1). Only peptides containing the Abu signature were retained for downstream analysis, enabling unambiguous identification of N-termini. In total, three technical replicates were performed from the liver mitochondrial sample (two using trypsin and one using LysC protein digestion), and one replicate was performed from the kidney mitochondrial sample (trypsin protein digestion).
These experiments yielded identification of N-termini for 327 mitochondrial proteins in liver and 158 mitochondrial proteins in kidney, which were a strict subset of those identified in liver ( Fig. 1A and supplemental Table S1). This set represents 28% of all MitoCarta proteins, 46% of MitoCarta proteins expressed in the liver, and 58% of mitochondrial matrix proteins expressed in the liver. The 327 identified proteins exhibit a bias toward abundant proteins, consistent with most MS studies (supplemental Fig. S2). As expected, there was strong enrichment for identification of matrix proteins (35% enrichment, p ϭ 2e-10, Fisher's Exact test, see supplemental Fig. S3A). Although many methods are biased against membrane proteins, surprisingly, we did not observe a depletion of known inner membrane proteins (supplemental Fig.  S3A). As expected, there is depletion of outer membrane proteins that typically are not cleaved after mitochondrial import (33), as these are typically N-acetylated and thus not detectable via subtiligase (supplemental Fig. S3A). Our inventory of 327 mitochondrial genes with the cleaved N-termini identified is more comprehensive than the published N-proteome study in humans (179 genes) and thus substantially improves knowledge of cleavage sites in mammalian mitochondria.
Assessing Sensitivity and Precision-We applied three complementary approaches to assess the quality of the subtiligase N-termini.
First, we compared our dataset to a control set of experimentally determined mitochondrial N-terminal cleavage sites from the literature. The control set included 15 mouse N-termini and 51 rat N-termini, typically determined via Edman degradation, based on literature curation within the UniProt knowledgebase (16) (Table S2). We mapped rat proteins to mouse orthologs via best bidirectional hits (expect Ͻ1e-3) and mapped orthologous cleavage sites via MUSCLE protein alignments. Our subtiligase dataset contained cleavage sites for 9/15 control mouse proteins and 32/51 rat proteins. Of these, concordant cleavage sites were observed in 7/9 mouse proteins and 29/32 rat proteins (within 2 aa in multiple alignment). Based on this control set, the subtiligase dataset shows 55% sensitivity (36/66) and 88% precision (36/41 matches out of all genes present in both sets). The limited sensitivity can largely be explained by control proteins not expressed in the selected tissues (33%) and by the lack of amenable N-terminal tryptic peptides at the control N-termini (67%) (see Table S2). The five non-concordant N-termini can all be explained by lack of amenable tryptic peptides at the control N-termini (Table S2). Together the control data suggests 55% sensitivity and 88% precision of cleavage site identification.
Second, we compared our subtilitase N-termini to high throughput N-terminal datasets from human (6) and yeast (4) mitochondria (Fig. 1B). Orthologous proteins were mapped between species, and orthologous cleavage sites were identified within MUSCLE protein alignments (supplemental Fig.  S4). Of the 116 orthologs with N-termini detected in mouse and human studies, 84% had concordant cleavage sites, including 77% with exact matches and 7% within 2 aa in the multiple alignment (thus may represent alignment artifacts). In contrast, there was very little overlap in cleavage site positions between mouse and yeast orthologs (29% within 2 aa in alignment), likely due to poor conservation of the presequences across the ϳ400 million years of evolution between these species and the resultant poor protein alignments. The high overlap in experimentally determined N-termini between mammalian species validates the quality of both the subtiligase and TMPP datasets in the mouse and human (6), respectively. Third, we compared N-termini from our subtiligase dataset to potential cleavage sites that we mined from existing LC-MS/MS spectra of mitochondrial proteins from 14 mouse tissues ( Fig. 1C and supplemental Fig. S5). Tissue datasets are particularly useful to increase sensitivity, because although ϳ50% of mitochondrial proteins are ubiquitously expressed, mRNA and proteomics data show marked tissue specificity for the remainder (e.g. ketogenesis, urea cycle, and steroidogenesis pathways) (1,34). We previously isolated mitochondria from 14 mouse tissues, digested proteins using trypsin, performed LC-MS/MS, and identified mitochondrial proteins using typical database search criteria that matched tryptic peptides (or half-tryptic peptides beginning at the annotated protein start). We previously published these data as part of the MitoCarta (1) and MitoCarta2.0 (2) catalogs, which include MS/MS support for 1008 mitochondrial proteins and 712 specifically in liver mitochondria. This previous search would fail to detect N-terminal peptides from cleaved proteins, as they would not match the in silico tryptic peptide database. Therefore, we now mined the unmatched MS/MS spectra to identify high confidence half-tryptic peptides with lysine or arginine at the C-terminus. To identify N-terminal half-tryptic peptides, we further required that half-tryptic peptides occur N-terminal to any tryptic peptide identified in the original MitoCarta2.0 search (see under "Experimental Procedures") and filtered for well supported cleavage sites, yielding 216 unique cleavage sites in 137 mitochondrial genes. Finally, we compared the MitoCarta half-tryptic N-terminal set to the subtiligase dataset. Of the 102 proteins present in both sets, 91 (89%) showed identical cleavage sites. This new halftryptic peptide search from 14 tissues both supports the accuracy of the subtiligase N-terminal dataset and can provide additional information about presence of these cleavage sites across tissues.
Data Integration-We integrated data from all the above methods to create a resource of N-termini for mammalian mitochondrial proteins. This supplemental Table S1 includes our mouse N-termini detected by subtiligase, the cleavage sites detected by half-tryptic peptides across 14 mouse tissues, human N-termini from the Vaca Jacome et al. study (6), and computational predictions using the following three algorithms: TargetP (18), TPpred3 (19), and MitoFates (20).
Multiple Cleavage Sites per Protein-Our subtiligase N-terminal dataset, like the published datasets in human and yeast, detects multiple cleavage sites for some proteins-which are likely a mix of technical artifacts and true biological complexity. We detected a total of 704 cleavage sites in 327 genes, including genes with one (47%), two (27%), or more (26%) cleavage sites, with the most abundant proteins enriched for multiple cleavage sites (supplemental Fig. S2). As in other studies (4), we created a master set containing the most abundant cleavage site for each protein, based on the greatest number of supporting MS/MS spectra. As has been noted in other N-terminomics studies (14,35,36), we observe many proteins with "ladders" of N-termini at consecutive positions (supplemental Fig. S6), which might indicate either artifact (e.g. aminopeptidase processing during sample preparation despite the addition of protease inhibitors) or true biological complexity, for example reflecting the known cleavages by MPP and then secondarily by Icp55 or Oct1 homologs (3).
We highlight four examples to illustrate single and multiple cleavages in our subtiligase dataset (Fig. 2). Clybl is a protein with a single cleavage observed by the subtiligase method that is also supported by MitoCarta2.0 half-tryptic peptides and by an orthologous cleavage site in humans (6). This cleavage site is predicted by one of three prediction algorithms (TPpred3), and it matches the canonical cleavage site for MPP (R-2). In contrast, we detected four potential cleavage sites for Cox5a, all of which were also supported by MitoCarta2.0 half-tryptic peptides. In both datasets, the most abundant peptide indicates cleavage at position 38 (5 subtiligase spectra, 320 MitoCarta spectra), which matches the canonical MPPϩIcp55 cleavage site (R-3). However, there is also support for the MPP cleavage (position 37) as well as two downstream sites (position 40 and 41). Similarly, for protein Akr7a5, both subtiligase and MitoCarta2.0 half-tryptic peptides support cleavage at position 29 consistent with MPPϩIcp55 cleavage; however, three additional cleavages are detected (position 21, 28, and 39), all consistent with MPP cleavage. The protein Uqcrh has three potential cleavage sites, none of which match canonical MPP, Icp55, or Oct1 cleavage sites. These examples show broad consistency be-tween the most abundant cleavages detected in two independent mouse N-terminal datasets and may hint that some proteins exist with a spectrum of different N-termini.
To assess whether the observed N-terminal ladders (e.g. Cox5a) are likely to represent true biological complexity versus sample preparation artifact, we sought to determine whether such ladders were enriched in mitochondrial versus non-mitochondrial proteins (supplemental Fig. S6). We compared prevalence of cleavage ladders in mitochondrial proteins versus non-mitochondrial contaminants in our study as well as in mitochondrial studies in human (6) and yeast (4). We additionally analyzed N-terminomics data by the subtiligase method from human blood plasma (36) as well as six cell lines without enrichment for mitochondrial proteins compiled in DegraBase1.0 (14). In each study, ladders were enriched in mitochondria compared with non-mitochondrial proteins (supplemental Fig. S6). Mitochondrial proteins showed the highest prevalence of ladders, followed by secreted proteins in the human plasma, whereas cytoplasmic proteins showed substantially fewer ladders (supplemental Fig. S6A). These data, generated across different platforms and species, together show consistent enrichment of consecutive cleavages in mitochondrial proteins. Because ladders are derived from exoproteolysis, it suggests that more exoproteolysis occurs in the mitochondria relative to the cytosol, consistent with known exoproteolysis via Icp55/Xpnpep3. Overall, our data indicate that many mitochondrial proteins exist with a spectrum of different N-termini in vivo.
Properties of Presequences across Species-We next explored properties of mitochondrial presequences in our mouse subtiligase dataset compared with published data in human (6) and yeast (4) (Fig. 3). The presequence is computationally defined as the sequence between a protein's annotated start and the observed cleavage site from the master set. We note that experimental datasets are crucial for this comparison given the low accuracy of existing presequence prediction algorithms (Fig. 3A), as has been previously reported (4). We find that mouse presequences are only half as conserved as the mature protein sequences, based on analysis of precomputed multiple alignments of orthologous proteins across different taxonomic groups in the EggNOG database (21) (Fig. 3B). Despite the low conservation of presequences at the sequence level, other presequence properties are extremely conserved across evolution, including length (Fig. 3C) and net charge (Fig. 3D). In all three species, presequences are typically 20 -60 aa long and have net charge ranging from ϩ3 to ϩ6. Consistent with the Vogtle et al. (4) study from yeast, all three species show a preference for stable amino acids at the N-terminus of mature proteins, following the N-end rule from bacteria (Fig. 3, E and F). We note that the mouse subtiligase dataset shows a particular bias for N-terminal serine and glycine, possibly due to subtiligase's preferential labeling of proteins beginning with these specific residues (14); however, we note that very similar frequencies are observed in the mouse half-tryptic peptide dataset.
In summary, experimentally determined N-termini from mouse, human, and yeast mitochondrial proteins show that whereas mitochondrial targeting signals are not well conserved at the sequence level, other properties show striking conservation, including length, net charge, and presence of stabilizing amino acids at the N-terminus of mature proteins.
Cleavage Site Recognition Motifs-Next, we aimed to assess whether mammalian mitochondrial cleavage sites have the same recognition motifs as those determined previously in yeast studies (3,4). In yeast presequences, the predominant cleavage site motif is the presence of arginine upstream of the cleavage site at position R-2 (MPP), R-3 (MPPϩIcp55), or R-10 (MPPϩOct1) (3,4,20). As in the published yeast study (4), we compiled a reference set of well supported cleavage sites in mouse and human (see under "Experimental Procedures"). Analysis of the position of the first arginine relative to the cleavage site (Fig. 4A) shows clear enrichment for the R-2, R-3, and R-10 motifs in all three species. However, in all three species, 15-25% of cleavage sites do not match any of these motifs (Fig. 4B) pointing to additional peptidases yet to be determined or pleiotropy of existing peptidases. Alignment of all cleavage sites showed broadly consistent patterns in each species, albeit with low information content (Fig. 4C, top row). Therefore, we also separately aligned cleavage sites matching R-2, R-3, or R-10 (Fig. 4C) and again observe consistency of recognition motifs across species. We note that the R-10 motif shows the least conservation between yeast and mice (Fig. 4C); specifically the mouse motif shows a preference for serine at position P 1 Ј and lacks the phenylalanine preference at position P 8 , consistent with a recent study using a synthetic peptide library for human MIPEP substrates (37).
We then asked whether the substrates of secondary peptidases Icp55 and Oct1 are also conserved across different species (Fig. 4, D and E). Yeast secondary peptidases lcp55 and Oct1 have mouse orthologs, Xpnpep3 and Mipep, respectively; however their substrates are not characterized in the mouse. We compiled the set of experimentally determined targets of Icp55 and Oct1 in yeast and identified the subset having 1:1 mouse orthologs with N-termini determined via subtiligase. Of the 13 yeast substrates of Icp55, 77% of mouse orthologs shared the same R-3 motif, supporting that these substrates are indeed strongly conserved across evolution (Fig. 4D). In striking contrast, 0/9 mouse orthologs to yeast Oct1 substrates had the canonical R-10 motif and instead were consistent with MPP or Icp55/Xpnpep3 cleavage (Fig. 4E). Thus, although the mouse data clearly show an R-10 motif, presumably the result of Mipep cleavage, the sub- strates of yeast Oct1 and mouse Mipep are likely to be quite distinct.
Given that the substrates of Oct1/Mipep are not conserved between yeast and mouse, we next sought to assess whether the putative Mipep cleavage sites (R-10) were conserved across animals. Thus, we selected 12 metazoan species and aligned orthologs of all mouse proteins with observed R-10 cleavage sites. We observed the mouse P 10 arginine is typically well conserved across animals in 11/12 cases, with one exception (Prodh) (supplemental Fig. S7). Moreover, in all five instances where human orthologs were present in the Vaca Jacome study (6), the human N-termini showed the same P 10 arginine. Together, these comparative data suggest that whereas Oct1/Mipep substrates are not conserved between yeast and mouse (Fig. 4E), within animals the substrates appear well conserved (supplemental Fig. S7).
Investigation of Outlier Cleavage sites-Although most observed N-termini are consistent with a cleaved presequence 20 -60 aa long, we observed 17 proteins with well supported cleavage sites Ͼ100 aa from the annotated RefSeq protein start (Fig. 5) (see "Experimental Procedures"). We note that such distant cleavages would not have been identified in the yeast study, which mapped mass spectra to a database consisting only of the first 100 aa of all mitochondrial proteins. Investigation of these 17 proteins revealed four with existing experimental support for the orthologous cleavage site in other animals, including known cleavage of a polyprotein into two functional enzymes (Hsd17b4/NP_032318:313) (38), and three sensor proteases that are cleaved and activated in response to stress conditions (Htra2/NP_062726:134 (39), Oma1/NP_080185:140 (40), and Opa1/NP_598513:195 (41)). We additionally observe a novel cleavage site in Opa1 with evidence in liver, cerebellum, and spinal cord based on halftryptic peptides (Opa1/NP_598513:184). For two mitochondrial outer membrane proteins known to be cleaved by cytosolic proteases (42)(43)(44), we report the first evidence of exact Among the remaining outlier cleavage events, we identified three genes with evidence of a shorter mRNA isoform not present in RefSeq but consistent with our observed N-termini (Acadm, Adck3/Coq8a, and Scp2; see supplemental Fig. S8). Such novel isoforms are further supported by nearby translation start sites detected by ribosome profiling experiments in mouse liver that used lactimidomycin to halt translation at the initiating codon (supplemental Fig. S8) (26).
Strikingly, we observed internal cleavages in two subunits of a key Krebs cycle enzyme complex, 2-oxoglutarate dehydrogenase, Ogdh (NP_001239211:112) and Dlst (NP_084501: 202). Furthermore, we observed experimental support for a novel protein isoform of DLST in the Human Protein Atlas (45), where Western blottings of liver extracts using a C-terminal targeting antibody show the following two bands: a 49-kDa band consistent with the full-length protein and an ϳ26-kDa band consistent with a shorter isoform produced by the observed N-terminus. Thus existing antibody validation experiments are consistent with one of our novel protein isoforms in liver.
Together, these 17 outlier cleavages highlight known polyprotein and regulated cleavages, novel mRNA isoforms, and suggest new protein isoforms for key metabolic enzymes. Further experiments are required to validate all these results and to decipher roles of shorter protein isoforms in mitochondrial pathways. DISCUSSION Here, we present the largest experimental catalog of N-termini for cleaved mammalian mitochondrial proteins. Our evolutionary analysis comparing cleavages between 327 mouse genes to published data from 179 human genes and 451 yeast genes shows that although mitochondrial presequences are poorly conserved at the sequence level, properties such as length and charge are extremely well conserved, as is the presence of stabilizing N-terminal amino acids consistent with the N-end rule (5). Mammalian presequences contain highly similar cleavage site motifs to those known in yeast, namely R-2 (MPP), R-3 (MPPϩIcp55), and R-10 (MPPϩOct1); however, 15-25% of cleavage sites do not match any known peptidase motif. By integrating experimental data from mouse and human, as well as targeting signal predictions by three algorithms, we created a resource to enable investigations into individual mitochondrial proteins or protein classes (supplemental Table S1).
We note that our inventory of mitochondrial cleavage sites has several distinctions from similar resources determined using different technologies in human and yeast. Most importantly, our subtiligase method cannot label acetylated N-termini and thus our dataset is nearly specific for cleaved mito-chondrial proteins. Indeed, our master set of 327 mouse N-termini contain only six uncleaved RefSeq proteins (beginning with methionine at position 1), which may represent un-acetylated N-termini. In contrast, the COFRADIC methodology applied to yeast mitochondria (4) identified both uncleaved and cleaved N-termini, and these authors estimated 70% of mitochondrial proteins contain a cleaved presequence exceeding 5 aa. Both subtiligase and COFRADIC methods include a step to enrich either N-terminally labeled proteins or N-terminal peptides (positive enrichment for subtiligase and negative enrichment for COFRADIC), and this enrichment avoids interference from the unlabeled proteins or internal peptides and increases sensitivity for detecting N-termini by LC-MS. In comparison, the TMPP labeling used in the human study lacks an enrichment step, which might have contributed to the lower coverage (179 human MitoCarta proteins detected). We note that the subtiligase method utilizes an enzyme-based modification compared with the other two chemical modifications. Finally, as our analysis pipeline did not limit peptide identification to the first 100 aa of proteins, we detect 17 outlier cleavages Ͼ100 aa from the annotated protein start that confirm existing knowledge of polyprotein and signaling cleavage events, as well as identify new protein isoforms for key metabolic enzymes.
We estimate our dataset has 55% sensitivity for detecting true cleaved mitochondrial proteins. Increased sensitivity could be obtained in future studies by increasing the number of mouse tissues analyzed, which we estimate could identify an additional 33% of cleaved proteins. Moreover, additional proteases could be used to identify N-termini that lack amenable tryptic peptides. Although subtiligase methods are not compatible with Glu-C (which would cleave the TEV site), future studies could increase sensitivity using Asp-N or chymotrypsin proteases (which we estimate would each identify ϳ60 additional proteins (supplemental Fig. S9)). We note that both subtiligase and other technologies have difficulty detecting low abundance proteins, but the subtiligase approach improves the chance of detecting low abundance proteins due to positive enrichment for N-termini (14).
Together, our experimental catalog of mouse mitochondrial N termini, supporting evidence from half-tryptic peptides across 14 mouse tissues, and orthologous N-termini from published human data provides the most comprehensive catalog of mammalian cleavage sites. These data can help researchers studying individual proteins to determine where the functional processed protein begins, which can be important for protein biochemistry or in engineering N-terminal selectable tags distal to cleavage sites. Additionally, these data can help elucidate mechanisms of mitochondrial protein import and processing and identify regulated cleavages such as those we observed in known sensors (Htra2, Oma1, Opa1, Mavs, and Bcs2l13). Given the large fraction of cleaved proteins not matching known proteases, there are likely additional biological processing pathways yet to be elucidated.