Peptide Identification by Tandem Mass Spectrometry with Alternate Fragmentation Modes*

The high-throughput nature of proteomics mass spectrometry is enabled by a productive combination of data acquisition protocols and the computational tools used to interpret the resulting spectra. One of the key components in mainstream protocols is the generation of tandem mass (MS/MS) spectra by peptide fragmentation using collision induced dissociation, the approach currently used in the large majority of proteomics experiments to routinely identify hundreds to thousands of proteins from single mass spectrometry runs. Complementary to these, alternative peptide fragmentation methods such as electron capture/transfer dissociation and higher-energy collision dissociation have consistently achieved significant improvements in the identification of certain classes of peptides, proteins, and post-translational modifications. Recognizing these advantages, mass spectrometry instruments now conveniently support fine-tuned methods that automatically alternate between peptide fragmentation modes for either different types of peptides or for acquisition of multiple MS/MS spectra from each peptide. But although these developments have the potential to substantially improve peptide identification, their routine application requires corresponding adjustments to the software tools and procedures used for automated downstream processing. This review discusses the computational implications of alternative and alternate modes of MS/MS peptide fragmentation and addresses some practical aspects of using such protocols for identification of peptides and post-translational modifications.

Technological and computational developments continue to expand the fundamental role of tandem mass spectrometry 1 (MS 2 ) in high throughput proteomics (1). Current protocols regularly identify thousands of proteins and post-trans-lational modifications (PTMs (2)) per experiment and can deliver very high levels of reproducibility (3) (e.g. using targeted approaches). In the large majority of cases, these advances have been enabled by a combination of standard protocols based on trypsin digestion followed by protein identification from collision-induced dissociation (CID) MS 2 spectra using database search tools (4). The efficiency and reliability of trypsin digestion remains one of the major reasons for the success of high-throughput proteomics. Protein digestion results in multiple peptides per protein and, in the limit, only one peptide needs to be significantly identified to be able to identify the corresponding protein (5,6). Trypsin further contributes to spectrum identifiability in CID MS 2 by cleaving C-term of K/R (Lysine/Arginine) and thus yielding rich and characteristic MS 2 peptide fragmentation patterns (7,8) for a substantial subset of all peptides. Incorporating knowledge of these peptide fragmentation patterns into algorithms for MS 2 identification has led to a variety of database search software tools (9,10) for peptide identification by scoring matches between acquired MS 2 spectra and predicted spectra for peptide sequences extracted from a protein sequence database.
Despite the success of the popular trypsin/CID tandem mass spectrometry protocols, it is well known that this approach generally biases experiments toward the identification of certain types of peptides, such as doubly and triply charged peptide precursors from medium-length peptides (11)(12)(13)(14)(15). Historically this has been advantageous because CID generates identifiable spectra from these types of peptides (low charge, high m/z) with the greatest efficiency. But a disadvantage of relying upon a single specific protease like trypsin is that many digested peptides are too short and can thus lead to incomplete coverage of the proteome. In the case of trypsin, the enzyme cleaves K/R-rich regions into peptides that are too short for reliable identification (about half of tryptic yeast peptides are ϳ6 residues long (15)). Alternative protease(s) can be used to obtain a wider distribution of peptides and increase coverage. For example, some have demonstrated the value of using of nonspecific proteases (16 -19), but these can also decrease experimental reproducibility. Another limiting factor is that nonspecific proteases can greatly increase the computational search space when matching MS 2 spectra to all possible peptides in the database. This not only increases the time required to search the database, but it can also decrease the sensitivity of spectrum identification at a given false discovery rate (FDR (6)). Others have had success using multiple specific proteases (e.g. LysC, AspN, GluC, and ArgC) in a more targeted approach (15,20,21). Compared with the use of a single specific protease, using multiple specific proteases allows for a greater percentage of the proteome (94% for yeast) to be covered by at least one peptide suitable for mass spectrometry sequencing technology (15). If a separate MS 2 run is executed for each enzymatic digestion, there is little increase in computational complexity because MS 2 spectra from each run can be separately matched to a set of peptides cleaved by a single specific protease.
A disadvantage of using multiple proteases is that they often yield peptides that are longer and contain one or more internal basic residues, which are poorly fragmented by CID (22). But alternative fragmentation strategies such as higherenergy collision dissociation (HCD (23), also formerly known as higher-energy C-trap dissociation) and electron transfer dissociation (ETD (24)) are known to improve identification of long, highly charged peptides, peptides containing basic residues, and peptides containing many or highly labile PTMs (25)(26)(27)(28). A popular strategy has been to use Lys-C in combination with ETD because compared with trypsin, Lys-C generates a larger portion of peptides amenable to efficient ETD fragmentation (15). The particular combination of ETD with Lys-N digestion yields peptide coverage complementary to trypsin and very simple fragmentation patterns that can aid in manual de novo peptide sequencing (29). Although the reduction of C-terminal ions in Lys-N ETD peptides reduces the difficulty of manual sequencing, it can also hinder automated approaches that use symmetry between N-and C-terminal ions. The complementarity of CID, HCD, and ETD dissociation strategies has been assessed in a variety of contexts (30 -33) and the recognition that each improves identifications of different types of peptides underlies the decision tree (15,(33)(34)(35) approach for real-time selection of fragmentation mode(s) based on each precursor's m/z and charge. Alternatively, CID/ETD (36) or CID/ETD/HCD (31) alternating MS 2 acquisition for every precursor have also been shown to substantially improve peptide identification and enable otherwise difficult analysis of PTMs (37). Underlying the utility of all tandem mass spectrometry approaches to high-throughput proteomics is the need to calculate false discovery rates, usually estimated using the Target/Decoy approach (TDA (38,39)). This approach is a key control of statistical significance in high-throughput proteomics but, as discussed below, requires careful adjustments before it can be meaningfully applied to the analysis of MS 2 data acquired with alternate peptide fragmentation modes.
Peptide Fragmentation Modes-Although there are several comparisons of peptide fragmentation modes in terms of the resulting numbers of spectrum and peptide identifications, only some (14, 36, 40 -42) attempt to characterize the observed differences in terms of their underlying MS 2 fragmentation statistics. These matter because the ability of a database search tool to identify MS 2 spectra is proportional to how well it models MS 2 fragmentation statistics and how it uses them to score Peptide Spectrum Matches (PSMs). The basic types of information captured in these peptide fragmentation models are illustrated here using spectra from a recent comparison (33) of how CID, HCD, and ETD MS 2 acquisition modes affect Mascot (10) identifications. In brief, a HeLa tryptic digest was analyzed in three separate Thermo LTQ Orbitrap runs, one for each of CID, HCD, and ETD MS 2 acquisition; survey scans were acquired in the Orbitrap with resolution 30,000 and MS 2 spectra were acquired in the Orbitrap at resolution 7500. Peptide identifications were obtained here using MS-GFDB as previously described (36) and resulted in 17,378 CID PSMs (out of 33,586 spectra), 21,246 HCD PSMs (out of 37,810 spectra), and 12,834 ETD PSMs (out of 25,734 spectra). As shown in Table I, all fragmentation modes achieved comparable spectrum identification rates of ϳ50% but also exhibit the expected correlations with precur- Each column shows the number of identified PSMs and the corresponding fraction of identified spectra per precursor charge state. As expected, the identification rate for ETD spectra is higher for precursors with charge states 3 or higher, in contrast with CID/HCD, which tend to perform best for doubly charged precursors. Nevertheless, the currently slower ETD scan rate (an issue that may be resolved soon (43)) still leads to lower numbers of identified spectra with precursor charges 3 and 4 even though its identification rate is consistently higher than those of CID and HCD spectra; this tradeoff seems to disappear at precursor charge states 5 and higher, where ETD consistently identified more spectra and a higher percentage of precursors than either CID or ETD.

CID
HCD ETD sor charge where ETD tends to perform better than CID or HCD on higher charge states. These show that ETD tends to achieve the highest identification rates for precursors of charge 3 or higher but yet also yields the lowest number of total identifications because of its slower scan rate (only ϳtwo-thirds as many MS 2 spectra as in HCD acquisition). Longer acquisition time remains a disadvantage of ETD, however recent work suggests that optimizing ETD acquisition parameters can significantly reduce the scan rate without loss in coverage (43). The complementarity of the fragmentation modes is further illustrated in Fig. 1 and supported by the observation that most peptides are not identified by at least one acquisition mode (see supplementary Table S2). This is partly because of the different scan rates and the stochastic nature of data-dependent MS 2 precursor selection but, as clearly illustrated for precursor charge 2 (for CID/HCD) and precursor charge 5 (for ETD), this also shows that different peptide fragmentation modes work best for different types of peptides. Of peptides that can be identified by all three acquisition modes, the breaks (observed cleavages along the peptide backbone, supported by either N-or C-terminal fragments) captured by ETD tend to complement those captured by CID and HCD, especially for precursors of charge 3 or higher. This can be seen in statistics from Fig. 2, which show how the union of observed peptide breaks increases by 24 -72% from CID/HCD to CID/HCD/ETD for precursors of charge 3 or higher. Supplementary Table S1 details how many breaks are unique to every possible combination of fragmentation modes: ETD alone accounts for 19% of all possible peptide breaks in CID/HCD/ETD triplets, the inter-    1. Complementary fragmentation in CID and ETD for peptide TAAANAAAGAAENAFRAP. CID (top) and ETD (bottom) spectra were separately identified against this C-terminal tryptic peptide at 1% FDR. Enough ions were separately detected in each spectrum to identify the peptide (65% of breaks in CID, 53% in ETD). But combining the two yields full coverage of all possible breaks, thus giving higher confidence to breaks observed in both spectra and possibly enabling full-length de novo sequencing. See Fig. 2 and supplementary Table S1 for evidence of CID/ETD complementarity over all identified spectra.

FIG. 2. Ion statistics for alternative peptide fragmentation modes. (Left)
Peptide MS 2 ion statistics for alternative fragmentation modes -This shows the percentage of breaks observed by each ion type over all identified MS 2 spectra with precursor charge 2 or 3 for each fragmentation method. z°corresponds to peaks at offset ϩH from z ions (94). Ions were counted if observed peak masses were within 20 ppm of expected ion masses. The "noise" ion corresponded to offset bϩ0.5, which was counted to show the level of noise in each type of MS 2 spectra. (Right) Peptide break statistics for combinations of alternative fragmentation modes-peptide breaks were counted for all unique peptides identified by all three fragmentation modes. The six columns show the percentage of breaks detected by each fragmentation mode and combination of fragmentation modes per precursor charge state. In CID and HCD spectra, the presence of breaks was indicated by the presence of b or y ions. For ETD, c, z°, or z°ϩH ions indicated the presence of a break. Multiply charged ions (up to the spectrum's precursor charge) were also considered in each spectrum. Prior to this analysis, peak filtering was applied all CID, HCD, and ETD spectra such that each peak was retained only if its intensity was ranked fifth or higher over all neighboring peaks in a Ϯ56 Da radius. If a peptide was identified by more than one CID, HCD, or ETD spectrum, a single representative spectrum was randomly chosen for each fragmentation mode. section of breaks seen in CID and HCD accounts for 17%, and the intersection of breaks seen in CID, HCD, and ETD accounts for 30%.
The connection between the identifications in Table I and the underlying MS 2 peptide fragmentation statistics in Fig. 2 is easily established for MS-GFDB PSM scoring models. As previously described (44,36), MS-GFDB uses the exact same computational model to learn MS 2 fragmentation statistics for CID, HCD, and ETD and weigh different MS 2 fragment ion types based on their observed propensity in spectra from each fragmentation mode. The predominant CID, HCD, and ETD ion types and their relative propensities are shown in Fig.  2. As expected, the most prominent differences in peptide fragmentation pertain to the contrast between the dominant b/y-ions in CID/HCD spectra and c/z-ions in ETD spectra. These ions are the most important in peptide identification because of their direct indication of peptide breaks; Fig. 2 compares the fractions of observed breaks on peptides identified on all three fragmentation modes. Not surprisingly, the per-precursor-charge fractions of observed breaks on CID, HCD, and ETD show a very high correlation with the relative identification rates in Table I as higher fractions of observed breaks almost always result in higher rates of identified spectra.
The complementarity of different peptide fragmentation modes in yielding identifications for different classes of peptides was the underlying principle behind the Decision Tree (DT) acquisition mode (34,33) where MS 2 acquisition modes are selected in real time based on precursor m/z and charge. Decision tree parameters are usually set to maximize the resulting number of peptide identifications and thus implicitly encode instrument-specific tradeoffs between scan rates and the expected rate of success in post-acquisition spectrum identification. Because the latter is highly dependent on the type of mass spectrometry experiment (e.g. shotgun proteomics (34) versus phosphoproteomics (30, 45)) and on the success rates of the software tools chosen for spectrum identification, it is important to note that optimal decision tree parameters may vary between instrument models and methods and depend on the choice of software tools used for peptide identification.
Peptide Identification-The complementarity of multiple peptide fragmentation modes can also be used to improve peptide identification by combining multiple spectra for the same peptides. One of the earliest such approaches was the utilization of MS/MS/MS (MS 3 ) acquisition for de novo peptide sequencing (46), later automated with heuristic (47) and optimal (48) algorithms. Similar applications to de novo peptide sequencing were also introduced at about the same time (49) for paired CID and Electron Capture Dissociation (ECD) spectrum acquisition from the same precursors, also later automated for CID/ECD pairs (50) and CID/ETD pairs (51,52). In all cases, automated de novo sequencing was improved by combining ions in the MS 2 /MS 3 or in the paired CID/ExD spectra using either combined peak intensities (47,49,50,52) or statistical scoring models (48,51). Current de novo applications of CID/HCD/ETD and MS 3 remain heavily dependent on manual interpretation with some assistance from automated methods (53).
Although database search tools developed for CID can be used for HCD and easily adapted to process ETD (by simply examining c/z°ion offsets instead of b/y), they would likely perform much worse than tools that are sensitive to unique features of HCD and ETD. Features of HCD that are not captured by CID scoring models include peaks in the low m/z range (including immonium ions), high fragment mass resolution (most CID spectra have low fragment mass resolution) and the presence of internal ions. Aside from c/z°ions, ETD spectra typically contain charge-reduced precursor peaks with high intensity, characteristic losses from charge-reduced precursors, and additional related ions at offsets ϮH from c and z°ions. Thus, CID scoring models should be redesigned for HCD and ETD in order to be most effective. The difficulty of the adaptation depends on the algorithm being considered. For example, MS-GFDB (36) can be automatically retrained for new types of spectra with only 1000 PSMs from unique peptides (per precursor charge state). Most CID database search tools (4, 9, 54 -59) have been extended to support identification of ETD MS 2 spectra: OMSSA (60), ProteinProspector (61), MS-GFDB (36), PeaksDB (62), Trans-Proteomic Pipeline (63). However, database search using alternate fragmentation modes remains a less explored approach because of two key issues: (1) how to best combine multiple spectra from the same precursor and (2) how to calculate experimentwide false discovery rates when CID, HCD, and ETD PSM scores may not be directly comparable.
Estimation of False Discovery Rates-As suggested by the MS 2 fragmentation statistics in Fig. 2, database search tools trained to identify CID spectra will not perform well if given ETD or CID/ETD merged peak lists (45). A more careful combination of CID/ECD MS 2 spectra for de novo sequencing was first described over a decade ago (49,50) and later approaches (51,36) have extended their statistical scoring models to incorporate CID-and ETD-specific ions into merged spectra that are then searched appropriately. In these cases PSM scores are directly comparable and standard FDR calculations (38) suffice to reveal substantial gains in peptide identification. An alternative approach to the creation of merged spectra is to separately search the multiple spectra from the same precursors and later merge the search results. This approach simplifies the reutilization of existing database search tools for additional alternate fragmentation modes but complicates the FDR calculations because, for example, a score threshold of 40 may yield a 1% FDR for CID matches but a 4% FDR for ETD matches. One possible way to address such discrepancies is to derive statistical e-value models for each type of search (64,65) and use the resulting normalized values to combine search results, similarly to approaches devised to combine search results from multiple search tools (66). To avoid the need for score normalization, intersectionbased approaches (40) address this issue by requiring matching identifications from both CID and ETD spectra to accept an identification for each precursor, but unfortunately this is known (36) to lower the number of resulting identifications by requiring significant matches from both CID and ETD spectra. Union-based approaches are often also mentioned where one imposes a 1% FDR on separate CID and ETD searches and reports the union of results; unfortunately this approach can result in a combined FDR higher than 1% because correct identifications will match the same peptide but incorrect identifications mostly accumulate. In practice, a simple strategy for 1% FDR estimation with k alternate acquisition modes would be to impose an FDR threshold of 1/k on each separate search, thus leading to merged results with accumulated false positives at Յ1% FDR. However, this is known to be a conservative strategy that is likely to be less sensitive than existing methods for combining search results from different search tools (59, 66 -68), which can also be adapted for the identification of spectra from experiments using alternate fragmentation modes, including Decision Tree-based MS 2 acquisition protocols (34). Another possible approach to this issue is to combine the search scores prior to FDR calculations. This approach was used in the first extension to multispectrum database search by deriving combined Mascot/ probability scores for MS 2 /MS 3 spectrum pairs (69). These approaches (70,36) avoid the intersection/union difficulties by using combined scores to facilitate FDR calculations and allow identifications where one fragmentation mode results in a good spectrum even if the other mode results in a poor spectrum. It is expected that combining search engines for CID, HCD, and/or ETD also improves results, but it is important to combine these approaches with appropriate estimation of FDR (such as iProphet (68)).
Post-translational Modifications-Identification of posttranslational modifications stands to benefit substantially from MS 2 analysis with alternate fragmentation modes, especially those involving electron-based dissociation modes (71,72). One of the earlier such approaches (73) is the still a popular phosphoproteomics protocol (74) in which MS 3 acquisition is triggered by the dominant loss of phosphate from precursor ions observed in MS 2 spectra. Alternating MS 2 modes further improve identification of phosphorylated peptides (35,45,74) and, in addition, enable otherwise challenging experiments such as co-identification of glycans and peptides from glycosylated peptides using ETD (75) or alternating CID/ETD acquisition (37,76). The complementarity of CID and ETD fragmentation modes is especially useful for PTMs such as glycosylation because CID leads to preferential fragmentation of the more labile glycosidic bonds and generally poor peptide fragmentation, thus facilitating glycan identification but complicating peptide identification. Conversely, ETD fragmentation of glycosylated peptides tends to result in series of c-and z-ions much like with unmodified peptides and thus facilitates peptide identification and localization of the site of glycosylation.
Accurate localization of PTM sites is also an important area that stands to gain from alternate (30,32,74) and alternative (77) peptide fragmentation modes. As recently reviewed by Chalkley and Clauser (78), the problem of PTM site localization was first addressed with the AScore approach (79), which assigns a probabilistic score to a site assignment based on the number of observed ions distinguishing the top-scoring site assignment from the runner-up site assignment. More recent approaches have slightly adapted this concept to assign site assignment scores based on the difference of database search scores between the top and runner-up peptidespectrum matches to each spectrum from a modified peptide. In all cases, the key factor determining the significance of site assignments is the presence or absence of MS 2 ions in between the possible sites of post-translational modification. Because it is clear from Fig. 2 that alternate fragmentation modes tend to increase observation of peptide breaks, it is expected that the use of such modes will result in different numbers and quality of site assignments (30,80), even though strategies for estimation of false-positive site assignments (False Localization Rates (78)) are still in their infancy. In addition to improving PTM site localization by increasing the numbers of observed b/y-ions, one especially interesting feature of HCD MS 2 acquisition is its generation of x-ions that are very specific indicators of phosphorylation sites (81). These are hypothesized to derive from the phosphoric acid being a much better leaving group than water on serines and threonines and hence was observed to be a precise indicator of phosphorylation sites, though it was found in only 33% of all phoshphorylated peptides.
Conclusions and Outlook-The substantial advantages of peptide identification with alternate fragmentation modes have been clearly demonstrated in a variety of contexts (31,36,37,49,51,(82)(83)(84) but their widespread adoption remains limited by two significant hurdles: (1) the scan rate tradeoff between increasing the chances of identifying each peptide versus just acquiring spectra for more distinct peptides and (2) the evolving but limited support of peptide identification tools for taking advantage of alternate fragmentation modes.
The scan rate tradeoff is a challenge that will continue to be addressed with technological developments such as those that brought us the current generation of mass spectrometry instruments, some of which are already able to generate tens of thousands of MS 2 spectra per hour (85). Nevertheless, alternate fragmentation has already been shown to be useful in key areas or potential therapeutic relevance (86 -89) such as peptidomics (31) and post-translational modifications analysis such as glycosylation (37), phosphorylation identification, and site localization ( 30,84) and histone modifications (83).
The challenge of limited software support is well on its way to resolving itself as more and more peptide identification tools (36,41,51,(61)(62)(63) add support for alternative MS 2 fragmentation modes. Still, too few tools support integrated analysis of multiple fragmentation modes (36). Although the concept of combining fragmentation modes is over a decade old (49), most existing scoring functions for peptide identification process CID, HCD, and/or ETD spectra individually. But possibly the biggest hurdle in the development of new software tools is the recurring limited public availability of mass spectrometry data. For example, only five out of over 120 published papers on CID/ETD have deposited their raw data (34,36,45,90,91) on Tranche/ProteomeCommons (92) or PeptideAtlas (93). This is a limitation that disproportionably affects the most novel and least common types of mass spectrometry experiments. We hope that the growing trend of making raw data publicly available will catch up with the dominant public availability of software tools and continue the self-reinforcing cycle toward better and more robust peptide identification strategies.