Status of Large-scale Analysis of Post-translational Modifications by Mass Spectrometry*

Cellular function can be controlled through the gene expression program, but often protein post-translational modifications (PTMs) provide a more precise and elegant mechanism. Key functional roles of specific modification events—for instance, during the cell cycle—have been known for decades, but only in the past 10 years has mass-spectrometry-(MS)-based proteomics begun to reveal the true extent of the PTM universe. In this overview for the special PTM issue of Molecular and Cellular Proteomics, we take stock of where MS-based proteomics stands in the large-scale analysis of protein modifications. For many PTMs, including phosphorylation, ubiquitination, glycosylation, and acetylation, tens of thousands of sites can now be confidently identified and localized in the sequence of the protein. The quantification of PTM levels between different cellular states is likewise established, with label-free methods showing particular promise. It is also becoming possible to determine the absolute occupancy or stoichiometry of PTM sites on a large scale. Powerful software for the bioinformatic analysis of thousands of PTM sites has been developed. However, a complete inventory of sites has not been established for any PTM, and this situation will persist into the foreseeable future. Furthermore, although PTM coverage by MS-based methods is impressive, it still needs to be improved, especially in tissues and in clinically relevant systems. The central challenge for the field is to develop streamlined methods for determining biological functions for the myriad of modifications now known to exist.

The human genome project has revealed a surprisingly small number of protein-coding genes: around 20,000 -not many more than those of the lowly worm (1). In contrast, it has turned out that there is a bewildering number and diversity of RNA species, and biological functions for many of them are still not known. Proteins constitute the next level of the gene expression program, and here diversity comes in two forms: variants affecting the primary sequence of amino acids, and post-translational modifications (PTMs). 1 Today primary sequence variation is mainly studied indirectly by means of deep RNA sequencing, although top-down and shotgun proteomics could provide more direct and quantitative results in the future (2). In contrast, the unbiased discovery and analysis of large numbers of PTM sites is the exclusive domain of MSbased proteomics, and great strides are being made in identifying the myriad possible protein modifications. The study of PTMs via MS now constitutes a large and vibrant field of research, as exemplified in the original and review papers in this special issue of Molecular and Cellular Proteomics. Here we provide a snapshot of where the field stands, what progress has already been made, and where we see the key challenges for the future.
PTM Analysis via MS-Mass spectrometry has long been used to map modifications on purified proteins; in fact, this was one of the first applications of MS in protein research (3). In contrast, proteomic investigations of PTMs-meaning their analysis on a large number of proteins, such as those in an organelle or in an entire cell-have taken off only since the beginning of the past decade (4 -6). Although the obstacles to complete proteome analysis are finally being overcome (7), this is not the case for PTMs, because their analysis is more difficult both technologically and conceptually. There are several principal differences between proteome and PTM analysis. In proteome measurements, several peptides are generally available to characterize each protein, which makes the measurement much more robust, whereas in PTM measurements each peptide representing a modification site of interest needs to stand on its own. Modified peptides can be of lower abundance and more difficult to identify from their frag-mentation spectra than nonmodified peptides. Furthermore, these tandem mass spectra should contain sufficient information to localize the modification with single amino acid resolution. Below we address the challenges of the PTM workflow in turn (Fig. 1).
Extraction, Solubilization, and Digestion-The sample preparation methods used for PTM analysis are generally the same as for standard shotgun proteomics. Exceptions are the PTM enrichment step and the strict requirement to inactivate enzymes responsible for adding or removing the PTM of interest. The extraction and solubilization of proteins from cell culture lysates is relatively straightforward, and a number of robust protocols exist. In contrast, more effort should be directed toward the efficient retrieval of proteins from tissues with high contents of extracellular matrix. As an example, cardiac muscle tissue is very rigid, and special homogenization technologies are needed to efficiently extract proteins for enrichment of their phosphopeptides (8). Chromatin-bound and transmembrane proteins are easily analyzed in proteome studies, but they are still challenging in PTM analysis, as the enrichment strategies might not be compatible with the specific buffers applied (9) and peptides from all parts of the proteins need to be generated.
As in expression proteomics, trypsin is the protease of choice for PTM analysis because of its high cleavage specificity (10) and the MS/MS-friendly properties of the resulting peptides. In some instances tryptic peptides might be too small or too large for LC-MS/MS analysis, a fact that needs to be taken into account when certain critical PTM sites need to be quantified in planned experiments. Likewise, some PTMs modify the solubility properties of the peptides significantly. For example, glycosylated or phosphorylated tryptic peptides can be too hydrophilic to be captured by conventional reversed-phase C 18 material. Using proteases with different cleavage specificities such as chymotrypsin, Lys-N, or endoproteinase Glu-C (11,12) in addition to trypsin can help one overcome some of these issues and should be more broadly considered. Developments in these areas are often underappreciated, but they are critical to the overall success of PTMbased proteomics.
Enrichment and Sensitivity in PTM Analysis-Regulatory PTMs, by definition, have substoichiometric occupancy of their target proteins, making it a major challenge to achieve sensitivity in MS analysis (Fig. 1). Although proteomics workflows have become much more sensitive overall, they still require a specific step in which the PTM of interest is enriched with respect to the unmodified proteins or peptides. There are now protocols with very high specificity for phosphorylated peptides, often leading to a phosphopeptide proportion of more than 90%. Enrichment efficiency can be around a factor of 100, making it possible to start with much more protein material than in proteome analysis and thereby boosting sensitivity (13). However, the fact that these starting amounts are much greater than what is needed for proteome analysis needs to be considered in PTM-based projects. Metal affinity chromatography-for instance, using titanium dioxide (14)and/or anti-phosphotyrosine antibodies (15) are frequently employed. It is also possible to enrich with antibodies raised against specific phosphorylation motifs, but generally it is more attractive to analyze the entire phosphoproteome instead. With a few exceptions such as N-linked glycopeptide enrichment via lectins (16), the specificity and enrichment for other PTMs are much lower.
The difference that a good enrichment strategy can make is exemplified by the recent introduction of diglycine-specific antibodies to detect the remnant modification after tryptic digestion of ubiquitinated proteins (17,18). Before the development of these antibodies, target proteins of interest had to be tagged or the ubiquitin itself had to be used as a purification handle, whereas now tens of thousands of ubiquitination sites can be specifically enriched and analyzed in a generic workflow (19,20). For most PTMs, however, efficient enrichment strategies do not exist at all, making them very difficult to study other than on a protein-by-protein basis. The development of specific and highly enriching protocols for more classes of PTMs therefore remains an urgent task for the community. We recommend that these tasks be recognized as important scientific endeavors in their own right that should be supported by the academic funding system.
Efficiency of Fragmentation, Identification, and Localization-The MS community has worked on peptide fragmentation mechanisms for several decades, and the basic mechanisms are well understood and robustly implemented. One major development over the past few years has been the wide adoption of the "high-high" strategy (21)high resolution at both MS and MS/MS levels-and this has been especially beneficial for PTMs.
In contrast to the identification of proteins, PTM analysis inherently has to rely on single peptide species, making it much more challenging. The same fragmentation methods, scoring schemes, and stringent false discovery rate controls can generally be used. However, apart from the unambiguous identification of the modified peptides themselves, the modified amino acids need to be localized in the sequence, which requires a separate "localization score." This is unproblematic for stable modifications, and it is our experience that the mean localization probability is typically greater than 99% in large data sets (22)(23)(24)(25)(26)(27)(28).
Guidelines in the proteomics literature state that the annotated spectra of PTMs bearing peptides should be provided (29). We fully support these requirements and note that they are particularly needed in the general biological literature, where novel modifications are still reported without any or with only low-quality MS/MS evidence.
In MS-based PTM analysis, it is important to generate sufficient peptide fragmentation information for peptide identification and site localization. Among the different strategies that have been developed for this purpose, multistage activa-tion in ion trap instruments (30), "beam-type" fragmentation on time-of-flight instruments (14,31), or the equivalent highercollisional dissociation (HCD) implementation on Orbitrap instruments (32,33) provide a high chance that the relevant modification-specific fragment events will be generated ( Fig. 1).
Although the techniques described above are almost universal, it can be difficult to localize phospho-serine in multiphosphorylated peptides, and very labile modifications such as O-GlcNAc might require the use of electron transfer dissociation (ETD) (34,35). Furthermore, the quantification of regulated changes in multiphosphorylated peptides remains challenging. Some PTMs can be very heterogeneous, as is the case for N-linked glycosylation. For these PTMs, the detailed determination of the structure of the PTM can be even more difficult than defining the modification site (36,37). Nevertheless, even the accurately measured fold changes of the modified peptides themselves are already very useful for planning biological follow-up experiments. Some protein modifications, such as intermediates in redox reactions, occur very transiently in the cell and can currently be studied only when they are trapped by stabilizing chemical reagents (38). We see a great opportunity for PTM proteomics in the large-scale analysis of these processes, which have so far been studied almost exclusively in in vitro reactions.
Completeness of the PTM Catalog: How Much Is Enough?-MS-based proteomics can now determine complete model proteomes, and this will soon be possible in mammalian systems as well (7). In contrast, there are no complete PTM catalogs so far. Not only are the MS-based efforts far from saturation for most PTMs, but it is difficult to even define what a complete PTM proteome would be. For instance, a given serine/threonine kinase may have some phosphorylation propensity toward a very large set of linear sequences, but this does not necessarily imply a biological function of these substrates in vivo. Conversely, biologically relevant substrates might become modified only under a very restricted set of circumstances (39) that were not part of the set of conditions used in the global PTM analysis. It appears that almost all proteins can be post-translationally modified. Extrapolating from current data, at least 70% of proteins are phosphorylated at some point (40,41), with similar proportions for ubiquitin (42) and lysine acetylation (43). The PTMs of several proteins or protein classes have been studied in very great depth, with histones being a prime example. Nearly all common modifications have been reported to occur on histones, and with the use of new enrichment strategies, fragmentation methods, and computational approaches, more are still being discovered. Histone PTMs also illustrate challenges to complete PTM analysis even on individual proteins, such as the difficulty of generating peptides of suitable length for MS, the detection of histone modifications that are restricted to few sites in the genome, and the interplay of modifications-in this case termed the histone code. Some of the proteins that are intensely studied, such as p53, have likewise been reported to have a plethora of modifications (44). However, much of these data were acquired via lowresolution and statistically uncontrolled methods and should therefore be treated with caution. The same caveats apply to some large-scale PTM studies. Researchers should not be satisfied with the mere presence of a PTM in a database; they should also critically evaluate the associated data. Fortunately, there is increasing awareness of the need for software tools to evaluate data quality within and between studies.
In our opinion, a useful minimum depth in large-scale PTM analysis should cover the key functional sites in the process under investigation. A phosphoproteome of 10,000 sites will include the majority of the sites classically studied using phospho-specific antibodies. Reaching such a depth would have been a daunting prospect just a few years ago. Today this can be done in single LC-MS/MS runs, making it possible to follow multiple conditions and replicates.
Relative Quantification and Stoichiometry of PTMs-The quantification of PTMs is highly desirable because regulation in a biological context makes it much more likely that a site will in fact be functional. The PTM-bearing peptides are quantified in the same way as unmodified peptides. However, it is more challenging because it is based on single peptides, often of very low abundance. Peptide quantification by means of stable isotope labeling with amino acids in cell culture (SILAC) is the gold standard because it eliminates many workflow-induced sources of error (45,46). Nevertheless, many chemical labeling strategies have been successfully used as well, and they can be applied to any sample type (47,48). For PTM analysis, large amounts of starting materials typically need to be labeled. With reporter-ion-based quantification, such as tandem mass tags and iTRAQ (49,50), care needs to be taken that the observed regulation is due to the PTMbearing peptide of interest, rather than co-fragmented precursors (51).
Recent improvements in data quality and in algorithms now make label-free quantification very attractive in the PTM field. In addition to being the most straightforward method operationally, label-free analysis allows direct comparison of MS signals between any number of samples, is applicable to any source material, and avoids reagent costs.
In addition to relative quantification of the modification site of interest, it is very useful to determine the site's occupancy or stoichiometry (i.e. the fraction of proteins that are modified). A large fractional stoichiometry, especially when combined with dynamic regulation, is a very good indication that the site may be functional. The converse does not necessarily hold true, because a modification may be located on a small pool of proteins that are temporally or spatially distinct. The cell cycle and DNA damage response are examples of this. In phosphoproteome analysis, stoichiometry has been determined for thousands of sites from the data themselves (40) or after deliberate dephosphorylation of half of the sample (52,53). In the future, this will be possible not only with stable isotope labeling with amino acids in cell culture, but also with label-free methods.
The value of stoichiometry information has been illustrated in the case of acetylation. Although this modification can readily be found on many abundant proteins, it appears to frequently occur at very low stoichiometries and to be of nonenzymatic origin (54). In contrast, certain classes of PTMs, such as N-linked glycosites, are mainly generated co-translationally and generally occur at very high stoichiometries.
Less Studied or Unknown Modifications-Almost all of the efforts in large-scale PTM mapping have been directed at the most prominent modifications, especially phosphorylation, ubiquitylation, glycosylation, and acetylation. However, more than 200 in vivo modifications are known, and there are even more peptide modifications that can be induced chemicallyfor instance, during sample preparation. Although PTM identification normally requires specifying the modified amino acid in search software, unbiased methods for PTM analysis have also been developed (55,56). These call for the in silico addition of the mass difference between the modified and the unmodified peptide to each amino acid in turn. Although this works very well even on a proteomic scale, the lack of an enrichment step means that only the most abundant of such unknown modifications are likely to be found.
Novel modifications are still being described regularly, especially on lysines (57)(58)(59). In addition to demonstrating biologically relevant regulation of such novel modifications, it is very useful to determine their stoichiometry to provide a functional context.
Modifications that are large relative to the peptide can be difficult to study via conventional LC-MS/MS and might require the development of special methods. Poly(ADP-ribosyl) ation, a prominent cellular response to genotoxic stress, is an example of such a PTM, with the additional challenge that it is very labile in the fragmentation step (61)(62). Other understudied PTMs include those on extracellular matrix proteins, which can be large, labile, and insoluble. Analyzing and biologically characterizing these classes of PTMs will occupy researchers for many years to come.
Bioinformatic Analysis-This was until recently seen as perhaps the largest bottleneck in MS-based proteomics (63). Fortunately, this area is now well developed in both expression proteomics and large-scale PTM analysis, and comprehensive and statistically rigorous software tools are readily available. Typical steps include motif analysis (64), Gene Ontology enrichment (65), pathway analysis (66), and analysis of protein-protein interactions (67). In the MaxQuant environment, the Perseus software allows extensive statistical and functional analysis of proteome and PTM data (68). The Net-WorKIN software (69,70) combines protein-protein information from STRING (67) with information on linear kinase motifs (62) to provide likely kinase-substrate relationships in largescale data.
Bioinformatic analysis also typically includes comparison to one of the PTM databases. These are still in a state of flux with regard to the data that they accept and retain. Curators of the widely used UniProt database, for instance, have largely stopped the indiscriminate incorporation of all data from largescale PTM projects. This is in contrast to the PhosphoSite database (71), which even includes unpublished data. An advantage of the latter approach is the higher degree of confidence in sites identified in multiple groups; however, it needs to be kept in mind that common misidentifications will also accumulate. Clearly, the community needs to establish more refined ways of integrating and presenting the PTM data that it generates. We envision that the usefulness of the PTM information in specialized and general databases will continue to improve as a result of the exponential increase in the data generated, as well as better data quality and reporting standards.
Throughput in PTM Analysis-So far global PTM analysis has been very challenging and resource demanding, with some of the large-scale studies taking months to complete. The proteomics community needs to develop ways to make PTM analysis much faster, just like the genomics community has done with next-generation sequencing technologies. Fortunately, current trends are very encouraging. Deep phosphoanalysis can now be done in a single day of measurement time. Single-shot approaches (72-74) already allow the analysis of more than 10,000 sites in a few hours (75), making it realistic to perform an entire project in a few days. Added advantages of single-shot approaches are that they tend to use less input material and are more robust because of the absence of fractionation steps.
Targeted approaches aim to identify and quantify a small number of peptides of interest in a short analysis time (76,77). In the context of PTMs, the targeted approach allows the measurement of key PTM sites in many conditions (78 -80). However, this comes at the expense of the unbiased nature of shotgun proteomics, as new and unexpected sites cannot be discovered. Targeted analysis is conceptually straightforward and has therefore proven very attractive to nonspecialists. However, extensive method optimization and process control are required in order to achieve and maintain acceptable levels of specificity. This is especially true if more than a few sites are targeted and when low-resolution instrumentation is used.
The high throughput promised by the methods described above is a precondition in order for PTM analysis to make an impact in the clinic. This is an especially intriguing area of PTM-based proteomics, because many diseases deregulate signaling pathways, and because this may be more directly evidenced in the dynamic modifications of pathway members than in the expression levels of transcripts or proteins. For instance, cancer genome projects have revealed large numbers of mutations in the same pathway, each occurring in small numbers of cells or patients (81). PTM analysis on clinical samples would determine, in a straightforward way, whether a pathway is critically modulated in a given tumor and how this activity is affected by treatment. Currently the community is still at the stage of proof of principle investigations of these concepts. Much development of the PTM workflow and instrumentation will be required in order to reach the level of robustness and validation required for clinical application.
Functional Assignment of PTMs-The signaling community has grown up with the concept of very few but biologically highly important PTMs-for example, key sites controlling the cell cycle or a growth factor pathway. In contrast, MS-based proteomics provides tens of thousands of sites, raising the question of their biological relevance. Clearly, if properly controlled, these large-scale data sets can directly point to key regulatory sites (82,83,85). In these instances, proteomics functions as an initial screen in a cell-signaling project. More often, however, researchers are faced with the challenge of how to select a very small number of sites from large-scale data and how to perform functional follow-up on these (86). Obvious prioritization criteria include high identification and quantification accuracy, regulation of the site in the process of interest, and a reasonable stoichiometry. Often, MS-based proteomics itself can be used in follow-up, through measurement of additional conditions or with additional proteomics data such as interactions of the PTM in question. However, classical in vitro enzyme assays might not correspond to the situation in vivo, a fact that should be kept in mind when validating sites via these methods. Likewise, point mutations might not be sufficiently informative, especially in the case of lysine, which can be the target of several PTM types. Finally, classical approaches typically do not take account of the cooperative nature of PTM sites, which is becoming more and more appreciated. We believe that it is unrealistic to hope that established signaling assays can validate more than a small fraction of the PTM sites now being identified. Instead we envisage proteomics approaches themselves providing the answer to the current dilemma. The throughput and accuracy of MS-based proteomics now lead to the possibility of quantifying a PTM on a global scale while systematically perturbing it. As an early example, each of the kinases and phosphatases in the yeast proteome has been deleted, followed by measurement of the phosphoproteome (87).
Enzyme-Substrate Relationships-Although advanced workflows can now routinely be used to analyze tens of thousands of phosphorylation sites in single experiments, it remains difficult to assign the specific kinase or kinases that directly Proteomics of PTMs would greatly benefit from a more complete repertoire of enrichment tools and reagents. With further development of mass spectrometric capabilities, however, it would be desirable to detect modified peptides without specific enrichment. This would make it possible to study many PTMs simultaneously and estimate their stoichiometry directly. Robust PTM analysis with high throughput could deliver patient classification and unique information on treatment efficacy. Finally, very deep and quantitatively accurate PTM analysis will provide a crucial basis for systems biology.
catalyze the modification of a phosphorylation site of interest. This is an important question that can partially be addressed bioinformatically via linear kinase motif analyses (69,88,89) or experimentally via pharmacological inhibition or knockdown of kinases involved in the cellular response under investigation. In the latter case, global quantitative phosphoproteomics can efficiently identify phosphorylation sites that are dependent on or downstream of specific kinases (24,90,91). Chemical proteomics strategies using analog-sensitive kinases (in which single site-directed mutagenesis of the so-called gatekeeper residue in the kinase domain renders the mutated kinase specific to a bulky ATP analog) can more directly determine kinase-substrate relationships (92)(93)(94). With the recent breakthroughs in genome editing technologies (95), we expect that analog-sensitive kinase approaches will in the future be applied widely to all susceptible kinases and analyzed via quantitative phosphoproteomics. Streamlined genome editing also holds great promise for the functional study of PTM networks and even individual PTM sites that are key determinants in cell fate decisions.
PTM Analysis in the Future-As discussed above, there are now robust MS-based proteomic workflows for the rigorous identification and quantification of very large numbers of PTMs. Enrichment protocols exist for the PTMs of greatest biological interest. Novel capabilities not generally available in classical approaches have been developed, with the quantification of site occupancy providing a prime example. Bioinformatic analysis of PTM data has become rigorous and streamlined, and it now constitutes an important foundation of systems biology.
The major challenges of the field lie in functional interpretation and follow-up. However, with increasing availability of the technology to signaling biologists, large-scale PTM quantification will often simply serve as an initial screen, followed by detailed study of key sites. How all these PTM data should be integrated in PTM databases and made available to the community is still an unsolved problem.
Many technological challenges remain (Fig. 2). PTM screens still need to be made faster, more sensitive, and more reproducible, and they need to cover an even greater dynamic range. Further development of the single-shot approach would combine the advantages of targeted approaches with those of shotgun approaches. With such advances, large-scale PTM analysis will be ready to make a direct impact in the clinic. The potential impact of such a development cannot be overemphasized, given the need for more precise therapies targeted at the molecular nature of the individual disease.
In the longer term, it is conceivable that dramatically improved mass spectrometric performance could allow the analysis of PTM-bearing peptides without prior enrichment (Fig. 2). This would have many benefits, including the ability to work with much less biological material and the comprehensive analysis of different PTMs at once, including the stoichiometry of each. * The NNF Center for Protein Research is supported by a generous donation from the Novo Nordisk Foundation. This work was supported by the 7th Framework Program of the European Union (262067-PRIME-XS) and the research career program FSS Sapere Aude (J.V.O.).