Inference and Validation of Protein Identifications

Discovery or shotgun proteomics has emerged as the most powerful technique to comprehensively map out a proteome. Reconstruction of protein identities from the raw mass spectrometric data constitutes a cornerstone of any shotgun proteomics workflow. The inherent uncertainty of mass spectrometric data and the complexity of a proteome render protein inference and the statistical validation of protein identifications a non-trivial task, still being a subject of ongoing research. This review aims to survey the different conceptual approaches to the different tasks of inferring and statistically validating protein identifications and to discuss their implications on the scope of proteome exploration.

Protein Inference in Shotgun Proteomics-The shotgun proteomics approach enables biologists to identify thousands of proteins in mass spectrometric measurements of a single sample. This approach borrows from its namesake, the genome shotgun sequencing approach that reconstructs whole genomes from sequencing random DNA fragments (1). The shotgun proteomics approach operates at the level of protein fragments, i.e. peptides to reconstruct the ensemble of proteins present in a biological sample (2) Both approaches implement a divide-and-conquer strategy commonly encountered in computer science, i.e. to solve a difficult task by breaking it down to many related easy tasks (3). The reconstruction of the difficult task's solution from those of the easy tasks is typically nontrivial. The convenient physico-chemical properties of peptides render the acquisition of informative data about short protein fragments an "easy" task. The destructive nature of the shotgun proteomics approach though shifts the challenge to the computational reconstruction of protein identities from this data.
Shotgun proteomics workflows comprise three main steps. First, proteins are biochemically extracted from a biological sample and then, they are enzymatically digested to yield a complex ensemble of peptides. Protein and/or peptide ensembles are optionally further fractionated according to phys-ical/chemical properties. Second, tandem mass spectrometry is used to sample and identify individual peptide species present in the resulting ensembles and to finally recover the set of proteins initially present in the biological sample. Mass spectrometric analysis of complex protein or respectively peptide mixtures comprises a two step scanning procedure that first registers the m/z ratios of all peptide species of a mixture, then selects, isolates and fragments one of these species and records the resulting fragment ion spectrum (4 -6). Third, peptide fragment ion spectra define the data to perform inference, i.e. to infer the proteins initially present in the biological sample. Inference traditionally involves two steps, peptide spectrum matching and protein inference (7). Peptide spectrum matching refers to assigning each fragment ion spectrum a peptide sequence that best explains its signals. Protein inference reconstructs the protein composition from the peptide spectrum matches obtained in the first step. Recent less widely used approaches blur the two step setup, by either reconstructing proteins directly from the mass spectrometrical data without generating peptide spectrum matches or by simultaneously matching peptides to spectra and inferring protein identities (8).
Peptide spectrum matching is a task that admits a fragment ion spectrum as input and that consists of finding the peptide sequence best matching to the input according to a suitable objective function (score) (9). The objective function encodes our understanding of the relation between a peptide and its fragment ion spectrum and is supposed to discriminate the peptide that gave rise to the input spectrum from all other peptides. It is nontrivial to find a good objective function because the fragmentation of peptides is only partially understood (10) and, furthermore, fragment ion spectra generated from complex peptide mixtures are noisy, i.e. the fragment signals are subject to statistical fluctuation (11) and convoluted with signals from moieties other than the enriched target peptide (12). Some work recently adopted objective functions that additionally account for peptide detectability. These extensions are based on expectations to observe a specific peptide in the biological sample considering prior knowledge about protein abundance distributions and peptide ionization properties (13,14). Most of the peptide spectrum matching approaches independently process each fragment ion spectrum. In a first step, a set of suitable candidate peptides is generated de novo (15)(16)(17)(18) or from a sequence database (9,19). Each candidate is scored against the fragment ion spectrum. The top scoring candidate peptide in conjunction with the fragment ion spectrum is reported as peptide spectrum match. Peptide spectrum matching has been extensively studied and reviewed in the past. For a more comprehensive overview please refer to e.g. (20).

SOLUTIONS
Protein Inference Approaches-Protein inference constitutes the second step after peptide spectrum matching and, in simple terms, typically takes the peptide spectrum matches as input and compiles a set of protein identifications that best represent the identified peptides. The protein inference task is specific to the shotgun proteomics setup (7). Enzymatic digestion of the proteins into peptides facilitates sample handling and dramatically enhance throughput. These benefits come at the cost of loosing the information which proteins gave rise to which of the identified peptides. For complex proteomes or mixtures of proteomes originating from various organisms (i.e. infectious diseases, microbial communities) peptide spectrum matches can map ambiguously to several protein entries, e.g. protein splice variants or highly conserved sequence stretches in orthologous proteins. Protein inference approaches aim to disambiguate these matches and have been implemented in various ways.
Different data input types and analysis procedures have been proposed for protein inference. Many approaches start off from a static list of peptide spectrum matches obtained from a database search (21)(22)(23)(24)(25)(26). Probabilistic approaches revisit the peptide spectrum matches and rescore these based on presence or absence of sibling matches pointing to the same protein (27)(28)(29)(30). Other approaches perform inference in a single step by jointly fitting a probabilistic model to establish peptide spectrum matches and protein identifications at the same time (8). To benefit from multiple database search engines, a recently proposed method performs protein inference from a list of nonredundant peptides (31). Spectral alignment approaches take a special position and start off from the raw mass spectrometrical data and de novo assemble (partial) protein sequences by aligning fragment ion spectra of overlapping peptides without resorting to sequence databases (32).
The main challenge in protein inference consists of dealing with peptide spectrum matches ambiguously mapping to several protein entries in the protein database. Each approach addresses this issue by defining different notions of a protein identification. A first class of protein inference approaches maps peptide spectrum matches back to a set of ambiguous protein entries that are either defined by a priori grouping protein isoforms or reporting one representative variant for each set of isoforms (21)(22)(23)(24)(25). This a priori grouping effectively disambiguates the protein database and therefore allows for unambiguously mapping peptide spectrum matches to the respective groups. This approach circumvents possible am-biguities related to isoform discrimination at the cost of not resolving these ambiguities even in case of sufficiently informative data. A second class of protein inference approaches defines protein groups a posteriori, i.e. groups that take into account the acquired spectral data. Specifically, each peptide identification is associated to its supported group of protein entries. The goal of these approaches is to summarize this list into a parsimonious, i.e. minimal list of protein groups that explains all peptide identifications (7). Probabilistic approaches assign each peptide identification to a protein entry (or group of indistinguishable proteins) with highest posterior probability (27,33,34). On the basis of predicted peptide detectabilities (35), Alves et al. have augmented this approach by scoring protein identifications with respect to expected though unobserved peptides (34,36). Other approaches formulate the parsimony constraint as a set cover problem (37,38), or as bipartite graph analysis (39). These approaches represent each protein as a set of peptides that they can give rise in a shotgun proteomics experiment and then seek to find a minimal list of proteins whose peptide sets comprise (cover) all peptides supported by the spectral data. A recent approach furthermore defines protein groups with richer hierarchical structure to better guide the user in disambiguating degenerate protein identifications (37). Given sufficiently discriminative data, this class of approaches is able to resolve apparent ambiguities related to proteins with shared peptide identifications. In addition to the application of one of the above protein inference approaches, it is common practice to exclude possibly unreliable protein identifications, such as e.g. single hit protein identifications. There has been considerable debate about whether such post-processing enhances protein inference (40,41). Latter approaches might miss protein identifications that are falsely discarded by the a priori grouping scheme or the parsimony constraint. Instead of disambiguating ambiguous peptide identifications, Farrah et al. report all proteins consistent with the spectral data (42). To be able to make statements about the occurrence of proteins in the biological sample, the authors of this study introduced the CEDAR scheme for protein identifications. This scheme defines a hierarchy of five protein identification types that are characterized by the ambiguity of their supporting peptide identifications. This approach allows the user to exploit a shotgun proteomic dataset while explicitly accounting for all protein identification ambiguities.
For the experimentalist it is difficult to choose an appropriate protein inference approach for his/her applications, given the many available protein inference variants. Although the criteria for this decision generally depend on the specific application scenario, a typical goal is to maximize the number of true protein identifications while keeping the number of spurious protein identifications low. Many of the developments discussed above aim at and provide empirical support for improving on this goal. However, general conclusions on protein inference performance are difficult due to the plethora of application scenarios. Ideally, the choice of a protein inference approach is guided by an application specific benchmark of a set of competing approaches with respect to their ability to achieve the designated goal (43). The following sections will address this issue by reviewing methods to count spurious protein identifications, factors influencing this count, and concluding remarks on how to report protein identifications in the light of these findings. VALIDATION False Discovery Rates for Protein Identifications-Protein identifications are not perfect. This observation is mainly related to the occurrence of spurious peptide spectrum matches. False positive peptide spectrum matches arise when the top-scoring candidate is not the source of the respective fragment ion spectrum. These events can mostly be attributed to flaws in the score related to the approximate encoding for the peptide fragmentation process and the lack of information in the fragment ion spectrum, e.g. in terms of lacking fragment ions.
It is important to control the quality of peptide spectrum matches for both the compilation of identified peptides and their inferred proteins. Various statistical approaches have been devised to control different measures of peptide spectrum match uncertainty, the false discovery rate being the most useful one because it accounts for multiple testing (44,45). In the context of peptide spectrum matching, the false discovery rate corresponds to the expected fraction of false positive matches. Three routes can be pursued to estimate the false discovery rate for a set of peptide spectrum matches. The false discovery rate can be derived from p values associated to each peptide spectrum match that is considered significant (44,45). E-value calibration methods for score normalization allow us to apply this approach to data sets that have been analyzed with multiple search engines (46). This approach to false discovery rate estimation is valid as long as p values can be accurately computed (47). This requirement is though rarely met (48). The false discovery rate can be estimated from the score distributions of true and false positive peptide spectrum matches (49). This mixture distribution has to be learned in an unsupervised scenario because the information whether a match is true or false positive is not known for any match. This task has been successfully implemented in e.g. PeptideProphet (49) by resorting to Expectation Maximization (50). Recently, the target-decoy strategy became very popular to estimate the peptide spectrum match false discovery rate (51). A decoy database with nonsense protein sequences is searched in addition to the (target) protein database of the studied organism. The number of peptide spectrum matches mapping to the decoy database serves as an estimate of the number of false positive matches. If the decoy database is designed similar to the target database, then we expect the false positive matches to uniformly distribute across the target and decoy database. Elias et al. have shown that reversed, pseudo-reversed as well as scrambled databases serve equally well as decoy databases, particularly ensuring uniform distribution of false positive matches (52). Its simplicity and generic applicability make the target-decoy strategy an appealing alternative to estimate false discovery rate of peptide spectrum matches.
Typically, protein identifications, instead of peptide spectrum matches, are the biologically relevant outcome of a shotgun proteomics study. Therefore it is highly desirable to control the quality of a shotgun proteomics study at the level of protein identifications. Statistical validation of protein identifications has long falsely been equated with statistical validation of peptide spectrum matches (Fig. 1). It turns out, however, that errors at the level of peptide spectrum matches propagate in a nontrivial fashion to the level of protein identifications (53). Therefore, the estimation of false discovery rates for protein identifications requires appropriate approaches differing from those for validation of peptide spectrum matches and is still a topic of ongoing research.
Several attempts have been made to control protein identification error rates. Many approaches estimate probabilities for a protein identification to be wrong from the respective probabilities of its constituting peptide spectrum matches (27,28,30,54). It turns out, however, that this kind of estimate is sensitive to the accuracy of the probability estimates for the individual peptide spectrum matches. Because these estimates are particularly difficult for peptide spectrum matches giving rise to single hit wonders in large data sets these approaches do not scale well with data set size (53) Another approach estimates the number of incorrect protein identifications assuming that false positive peptide spectrum matches distribute according to a Poisson distribution across the protein database (25,29). Depending on the choice of different assumptions for single hit protein identifications, this strategy gives either more or less optimistic estimates for protein error rates. Naive target-decoy approaches estimate protein identification false discovery rates as described for peptide spectrum matches, i.e. by estimating the number of false positive protein identifications with the number of decoy identifications (26,40,54,55) It turns out that the number of decoy protein identifications is an estimate for "mixed" protein identifications, i.e. identifications that are both supported by correct as well as incorrect peptide spectrum matches. Because a single correct supporting peptide spectrum match renders a protein identification true, the number of "mixed" protein identifications cannot generally be equated with the number of false positive protein identifications. In fact, the number of false protein identifications is likely to be smaller than the number of "mixed" protein identifications. Consequently, naive target-decoy approaches turn out to achieve too pessimistic error rates (53). The Mayu approach adapts the target-decoy strategy to the protein inference task by means of a hypergeometric model that also accounts for the occurrence of "mixed" protein identifications.
The hypergeometric model formalizes and takes advantage of the observation that the statistics of the number of "mixed" protein identifications is analogous to a draw from an urn with two types of balls (e.g. black and white). In this analogy the first type of balls represents the protein entries for which there is correct support and the other type represents all other entries of the underlying protein database. Mayu has shown to achieve accurate, independently validated protein identification false discovery rates for a range of diverse datasets differing in size, underlying proteome and experimental setting (53) and been added as additional feature to PeptideAtlas (42).
Current approaches to statistical validation of protein identifications assume wrong peptide spectrum matches as the single source of erroneous protein identifications. This assumption does not hold true in the context of complex proteomes featuring protein entries with overlapping sequences as for instance protein isoforms or splice variants. Protein inference approaches that assign ambiguous peptide spectrum matches to a single protein might suffer from events where correct peptide spectrum matches are associated to an incorrect protein identity. These events constitute an additional source of errors in the course of protein inference. To the best of our knowledge, there is still no published method to estimate the frequency of these subtle errors, thereby constituting a relevant and interesting target for future research. In the light of emerging targeted proteomics approaches like selected reaction monitoring (56) it is furthermore conceivable that reliable disambiguation of protein identities will be tackled by specifically providing additional informative experimental data.

PROTEIN INFERENCE IN PRACTICE
Data Set and Database Size Matter-The size of the database used for peptide spectrum matching and protein inference influences protein identification false discovery rates. At the level of peptide spectrum matching and for invariant filter criteria, larger protein databases contribute more confounding peptide sequences that lead to a larger amount of false positive peptide spectrum matches. More stringent filter criteria are required counteract this trend and to achieve an acceptable confidence level. More stringent filter criteria though come at the cost of increased false negative rates, i.e. increased number of correct peptide spectrum matches achieving below threshold scores. Besides this effect, the size of protein databases additionally affects protein inference performance by another mechanism. This phenomenon can be seen by considering the behavior of an incorrect peptide spectrum match, randomly mapping to some entry of the protein database. The more entries the database comprises the more likely the incorrect peptide spectrum match will map to a new, so far unsupported protein entry and thereby give rise to a false positive protein identification (Fig. 2). These trends taken together strongly advocate to prefer small protein databases that in particular exclude exceedingly rare protein entries.
Successful deep sequencing projects for various model organisms have achieved substantial proteome coverage by resorting to well curated protein databases featuring low redundancy (22)(23)(24)57). These studies cover around 50% of the respective sequence databases, indicating a reasonable tradeoff between constraining the size of the protein database while retaining sufficient diversity for comprehensive discov-FIG. 1. Overview of data analysis tasks in shotgun proteomics. The inference tasks consist of assigning peptide sequences to fragment ion spectra (peptide spectrum matching) and assembly of peptide spectrum matches to protein identities (protein inference). The validation tasks consist of estimating confidence measures like false discovery rate (FDR) to the set of peptide spectrum matches and, separately, to the set of protein identifications. Solution of these tasks requires different task specific methods. Particularly, FDR estimation procedures for protein identifications differ from those for peptide spectrum matches. ery. These considerations are more intricate in proteogenomic projects that aim at genome annotation and discovery of novel gene models from shotgun proteomic data (58,59). The nature of these projects entails the use of large sequence databases that account for all possible protein coding regions of a genome. Proteogenomic studies for various model organisms resorted to six frame translated genomic databases and expressed sequence tag (EST) 1 databases to achieve this goal (60 -64). The number of peptides in such databases is in the order of billions and further grows by two orders of magnitude if single amino acid mutations are considered, too (58). Several strategies have been pursued to faithfully compress these databases. A simple heuristic consists of only considering open reading frames of at least average exon length. More sophisticated lossless compression approaches involve the use of exon database graphs (65) and Bruijn graph representation of EST databases (66). Two pass database search approaches combine the benefits of achieving low error rates (and computational efficiency) and comprehensive discovery by first confidently identifying data supported genomic regions and secondly mapping the fragment ion spectra to a subdatabase that comprises an enumeration of ab initio predicted gene models for this subset of regions (67,68). The applicability of EST databases for protein inference is further complicated because a single gene product can map to several sequence tags and the mapping in general is nontrivial (68,69). The choice of databases and compression strategies can be guided by a benchmark with respect to a useful optimality criterion, e.g. the number of protein identifications or gene model discoveries at a user defined protein false discovery rate (43).
Data-set size has an important influence on protein identification false discovery rates. This influence is related to the different behavior of true and false positive peptide spectrum matches. Typically, only a small fraction of proteins represented in the protein database are actually present, or at least present at a level that is within the dynamic range of the mass spectrometer, in the studied biological sample. Therefore, true peptide spectrum matches start to redundantly map to the same protein entries with growing dataset size. The rate of true new protein discoveries slows down with data-set size. False positive peptide spectrum matches do not feature this redundant behavior (or at least to a significantly lower magnitude) and thereby contribute to a constant rate of false new protein discoveries over a wide range of dataset sizes. These observations lead to the trend of protein false discovery rates growing with data-set size while keeping the false discovery rate for peptide spectrum matches fixed (Fig. 2). For large data sets acquired to map out complete proteomes twenty fold differences between these two types of false discovery 1 The abbreviation used is: EST, expressed sequence tag.

FIG. 2. Error relation between peptide spectrum matches and protein identifications.
Impact of dataset and database size on discrepancy between false discovery rate of peptide spectrum matches and protein identifications. Protein database entries are represented as colored circles. true/false peptide spectrum matches (PSM) are depicted as green/red discs. True protein identifications (PID) are supported by at least one correct peptide spectrum match and tagged with a checkmark. The larger the data set or database size, the more pronounced the discrepancy of false discovery rates (FDR) at the level of peptide spectrum matches (PSM) and protein identifications (PID). For large datasets the apparent proteome coverage can deviate significantly from the coverage of true positive (tP) protein identifications.
have been observed (53). Because of this significant effect it is advisable to control the quality of a larger shotgun proteomics experiment at the level of protein identifications.
In the context of large shotgun proteomics projects aiming at extensive proteome coverage it is desirable to minimize the number of experiments not only to save resources but also to keep dataset size small and thereby enhance protein inference. Experiment design aims at minimizing the data-set size by identifying experiments that are expected to produce the most informative data, i.e. to most effectively explore a proteome. Four routes have been pursued to suggest informative experiments: (1) A priori simulation of shotgun proteomics experiments have been carried out to benchmark various fractionation schemes. Shotgun proteomics experiments have been modeled as consecutive fractionation steps at the protein and peptide level that uniformly distribute species into fractions with random omissions to account for sample losses along the course of the experiment. These simulations suggested that separation at the protein level results in more significant gains in proteome coverage than fractionation at the peptide level (70). (2) Directed mass spectrometry approaches exploit a small number of initial shotgun proteomics experiments to, first, identify informative MS1 precursor signals and, second, to perform targeted experiments that specifically generate fragment ion spectra for the selected precursors (71)(72)(73). (3) A posteriori analysis of protein identification statistics has been exploited to design experiments that specifically enrich for underrepresented identification types, as for instance for short and basic proteins in the context of a Drosophila sequencing project (22). (4) Finally, proteome coverage prediction approaches lend themselves to determine which experiments to carry out how many times in a multidimensional shotgun proteomics scenario to optimally improve proteome coverage (74,75). Application of these methods to design shotgun proteomics studies renders them more efficient and, as delineated above, also more informative and reliable.

GUIDELINES
Reporting Protein Identifications-Shotgun proteomics projects typically aim at comprehensively and precisely reconstructing the protein composition of the studied biological sample. Ideally, the list of reported protein identifications should be exempt from spurious identification and exactly reflect the sample proteins. This goal is probably not achievable. Fixing the protein false discovery rate at a reasonably low level (e.g. 1%) and asking for the maximal number of protein identifications constitutes a reasonable alternative goal.
There has been substantial debate on guidelines for reporting protein identifications. Rigid guidelines like the general exclusion of single hit wonders are recurring suggestions in this context. These rigid guidelines predominantly aim at ensuring high quality of the reported identifications and at avoid-ing the inflation of identification lists with erroneous entries. However, these suggestions neglect the second part of the delineated aim, i.e. the aim of maximizing the number of identifications at a desired quality. In fact, recent studies show evidence that retaining single hit wonders instead is advantageous since these still comprise many correct identifications (40,43,53). Besides these results on the specific rule of single hit wonder exclusion, focusing on error avoidance by means of rigid guidelines is generally prone to missing out on sophisticated protein inference approaches that internally deal with e.g. unreliable single hits and yet recover more protein identifications at the same quality, i.e. protein false discovery rate. These conceptual considerations motivate guidelines that simply require reporting the protein false discovery rate of a protein identification list and thereby leave the choice of protein inference approach to the experimentalist. CONCLUSION Protein inference is a task arising in shotgun proteomics that aims at mapping back peptide spectrum matches to entries in the underlying protein database. Because of its conceptual simplicity, breadth and depth, shotgun proteomics is likely to keep on playing a pivotal role in exploratory stages of proteomics projects. Protein inference will therefore keep proteomics researchers busy for a while, either as consumers or developers that tackle some of the still open and intricate validation issues. It will be furthermore be interesting to see how similar tasks will arise in new emerging peptide centric mass spectrometry based proteomics technologies and to what extent we will be able to transfer the lessons learned in the shotgun proteomics scenario.