Peptide Identification by Database Search of Mixture Tandem Mass Spectra

We

In high-throughput proteomics the development of computational methods and novel experimental strategies often rely on each other. In certain areas, mass spectrometry methods for data acquisition are ahead of computational methods to interpret the resulting tandem mass spectra. Particularly, although there are numerous situations in which a mixture tandem mass spectrum can contain fragment ions from two or more peptides, nearly all database search tools still make the assumption that each tandem mass spectrum comes from one peptide. Common examples include mixture spectra from co-eluting peptides in complex samples, spectra generated from data-independent acquisition methods, and spectra from peptides with complex post-translational modifications. We propose a new database search tool (MixDB) that is able to identify mixture tandem mass spectra from more than one peptide. We show that peptides can be reliably identified with up to 95% accuracy from mixture spectra while considering only a 0.01% of all possible peptide pairs (four orders of magnitude speedup). Comparison with current database search methods indicates that our approach has better or comparable sensitivity and precision at identifying single-peptide spectra while simultaneously being able to identify 38% more peptides from mixture spectra at significantly higher precision. Molecular & Cellular Proteomics 10: 10.1074/mcp.M111.010017, [1][2][3][4][5][6][7][8][9][10][11]2011.
Over the past several years there have been substantial advances in the sensitivity of protein identification approaches thanks to technological developments in chromatography (1) and tandem mass spectrometry (2,3). In shotgun proteomics, researchers can routinely identify thousands of proteins from complex biological samples in a single experiment (4 -6). But, despite this rapid progress, there are still challenging issues that remain unsolved (7,8). One such challenge is that in any high throughput tandem MS (MS/MS) 1 experiment only a fraction of MS/MS spectra can be identified by current computational methods. Although there are many factors contributing to this low spectrum identification rate, recent studies suggest that one reason is the occurrence of co-eluting peptides. Several analyses showed that in any complex mixture (e.g. whole-cell lysate), peptides with similar masses and chromatographic properties can occur quite frequently and give rise to mixture spectra generated from more than one peptide (9 -12). Mixture spectra can confuse current computational methods because most mainstream approaches make the assumption that each MS/MS spectrum comes from only one peptide. Houel et al. estimated that identification rates for mixture spectra can be half of what is expected for single-peptide spectra (12). Mixture spectra have also been explored by alternative data acquisition approaches to overcome the limitations of current mass spectrometry scan rates (13)(14)(15)(16).
In order to alleviate the algorithmic bottleneck in the identification of mixture spectra, Zhang et al. described the first database search tool for mixture spectra, ProbIDtree (17) and showed that it is possible to identify co-eluting peptides from single MS/MS spectra. However, in that study not all peptides could be confidently identified and in most cases ProbIDtree only identified the most prominent peptide in each spectrum. Recently, we also proposed a spectral library search tool, M-SPLIT, and showed that it is able to reliably and efficiently identify peptides from mixture spectra from co-eluting peptides in a yeast cell lysate (18). Despite the rapid growth of publicly available spectral libraries (19,20), methods based on spectral matching suffer from the limitation that peptides cannot be identified if they have not been observed before; thus database search methods are still the mainstream approach for peptide identification. Some database search tools approach the mixture spectra identification problem by reporting spectra with more than one significant single-peptide match and do not explicitly attempt to model the occurrence of fragment ions from two peptides in the same spectrum. False discovery rates (FDR) are also left unadjusted (17,21,22) and may result in higher than expected FDR for second IDs (e.g. when co-eluting peptides share a substantial number of fragment masses). Different from previous approaches, our new database search tool, MixDB, uses a scoring model specifically designed for matching spectra against pairs of peptides and determines separate FDRs for identification of singlepeptide spectra and mixture spectra. Applying our method to a yeast lysate data set, we show that using efficient filtration techniques and a rigorous probabilistic scoring model, peptides can be reliably identified from mixture spectra while only considering a small fraction of all possible peptide pairs.

EXPERIMENTAL PROCEDURES
Problem Formulation-A mixture spectrum is defined as an MS/MS spectrum from two different peptides. Analogous to the identification of single-peptide MS/MS spectra by comparison against all peptides in a database of known protein sequences, our goal is to identify mixture spectra by comparison against all possible pairs of peptides in a given protein sequence database. More formally, we represent an MS/MS spectrum as a real-value vector S ϭ s 1 ,s 2 , . . .s n , where each element corresponds to the total peak intensity at a particular mass bin (bin size depends on instrument resolution). A mixture spectrum M is modeled as M ϭ Aϩ␣B, where A and B are MS/MS spectra from two different peptides and ␣, the mixture coefficient, indicates their relative abundance. Without loss of generality, we assume that A and B are scaled to the same magnitude and that 0 Յ ␣ Յ1 (i.e. A always corresponds to the higher abundance peptide). We define a peptide sequence as a vector P ϭ p 1 ,p 2 , . . .p n , where p i , is nonzero if there is at least one theoretical ion mass in the corresponding mass bin. A database D is simply a set of peptides D ϭ ͕P 1 ,P 2 , . . .P n ͖. We can now formulate the following computational problem: Mixture Spectrum Identification Problem (MSIP)-Input-A putative mixture spectrum M and a sequence database D.
Output-A pair of peptides P 1 ,P 2 ʦ D, maximizing PPSM(M, P 1 , P 2 )where PPSM is a given Peptide/Peptide Spectrum Match scoring function that describes how well a pair of peptides P 1 , P 2 matches the spectrum M.
Because there is currently no publicly available data with validated identifications of mixture MS/MS spectra, we created a data set of simulated mixture spectra to develop and benchmark our approach using the procedure described in (18). Briefly, representative MS/MS spectra of peptides with precursor charge 2 or 3 are selected from the NIST yeast spectral library 2 (23). All spectra are first scaled to the same total intensity by dividing each peak intensity by the sum of all peak intensities in the spectrum. Because in a mixture the two peptides will most likely be present at different abundances, mixture spectra were created by randomly selecting two spectra from the library and linearly combining them using M ϭ A ϩ ␣B with predefined mixture coefficients ␣ ʦ 0.1,0.2,0.3,0.5,1.0. Because co-eluting peptides have similar precursor m/z, we further required that spectrum pairs have precursor m/z difference less than 3 Th. Noise was assumed to be additive in mixture spectra and thus modeled by not removing unexplained/noise peaks from single-peptide spectra used to simulate mixture spectra. To simulate signal suppression, the final spectra were filtered by keeping only the eight most intense peaks in every 50 Da window.
Filtration Strategy-Although the MSIP formulation is simple, the large number of possible peptide candidates in a sequence database (10 4 -10 5 peptides in the yeast database after precursor mass filtering with 3 Da tolerance) makes searching all possible pairs of peptides a prohibitive approach: the resulting Ϸ 10 10 comparisons per query spectrum would mean that a typical data set of 15,000 spectra would take Ϸ50 days to process (based on average InsPecT runtimes, previously shown to be Ϸ100 times faster than SEQUEST (24)), even without considering the additional computational burden of scoring spectra against pairs of peptides. The explosion in the number of candidate matches per spectrum also dramatically increases the chances for false-positive matches. We propose a filtration strategy to avoid this quadratic penalty of searching all pairs by first scanning through the database to discard peptide candidates that are not likely to result in good mixture matches and then pair up the remaining candidates to find the best scoring peptide pairs.
Core to any database search method is a peptide spectrum match (PSM) scoring function to describe the quality of a match between a candidate peptide and a given spectrum. Conceptually, this scoring function typically gives some reward for matches between observed peaks in the spectrum and theoretical fragment masses from the candidate peptides and imposes penalties for unexplained spectrum peaks or missing theoretical ion masses from the candidate peptide. When scanning the database to select promising peptide candidates, one needs to consider that a mixture spectrum M derived from peptides A and B may have a poor PSM score to either peptide alone; that is, the presence of B in the mixture spectrum typically results in many unmatched peaks between M and A. We address this issue with the concept of projected-spectrum (18), in which the match between M and A only considers peaks in M if there is a theoretical ion from A with the same mass (within peak mass tolerance). More precisely, for a spectrum M and a peptide A, let M[i] denote the i th bin in spectrum M, the projection of M onto A ͑M p͑A͒ ͒ is defined as: The intuition behind this operation is that because the peptide A ϭ a 1 ,a 2 ,. . . ...a n contains all theoretical ion masses generated from this peptide, the projection essentially separates the peaks that are possibly generated from peptide A from those that are noise or generated from peptide B. Given a spectrum M, the filtration stage then computes match scores between all candidate database peptides and their corresponding projected version of M; the top scoring matches are then selected for scoring of peptide-pair matches as described below. Because the projections extract peaks belonging to only one peptide from the query spectrum, the score between a projected spectrum and a peptide can be computed using any scoring function designed for single-peptide spectra; here we used the scoring function proposed by Kim et al. (25).
The efficiency of the projected-spectrum filter is determined by the highest (i.e. worst) rank of a correct peptide match of a mixture spectrum to the database D. Note that a correct match in D is a match to one of the peptides generating M -single-peptide spectra have one correct match, mixture spectra have two correct matches and thus two correct-match ranks. As shown in Fig. 1, the resulting ranks of correct matches indicate that the projected-spectrum filter is an efficient filter that, for over 96% of cases, retains both correct matches (i.e. maximum ranks) at ranks less than 500 from about 10,000 candidate peptides (yeast database, 3 Da precursor mass filtering). The left panel in Fig. 1 also shows that one of the correct matches, presumably the higher-abundance peptide in the mixture, almost always has rank less than 10 (i.e. minimum ranks). This means that for almost all cases one only needs to pair the top 10 candidates 2 There were also spectra with charge 1 and 4 in the library, but because the number of spectra was not enough to train a statistical model for the scoring function those were not considered in this study.
with the top 500 candidates to find the correct match. Using this strategy at most 10 ϫ 500 ϭ 5000 candidate pairs need to be considered, thus conferring a Ϸ 10,000 ϫ 10,000 2 ϫ 5000 speedup compared with considering all Ϸ 5 ϫ 10 7 ϭ 10,000 ϫ 10,000 2 possible candidate peptide pairs. Scoring Function for Mixture Spectrum Matches-Efficient filtration dramatically reduces the numbers of candidate peptide pairs that one needs to consider but these candidate pairs must then be scored and ranked according to their relative likelihood of generating each putative mixture MS/MS spectrum. Although scoring a peptide against an MS/MS spectrum is a well studied problem in proteomics, few scoring functions have been designed to handle more than one peptide (17). Here we describe a general probabilistic model for scoring Peptide/Peptide Spectrum Matches (PPSMs). First we briefly review the model for single-peptide matches (PSMs) (25) and show how to extend the approach for PPSMs.
As described above, an MS/MS spectrum is represented as a vector of n bins, each representing a mass interval of width ␦. A bin has value zero if there is no peak in the corresponding ␦-Dalton interval otherwise it is nonzero. For experimental MS/MS spectra the raw intensity in each bin is first transformed into peak intensity rank (ranked from most to least intense), whereas for a theoretical spectrum bin values indicate the ion type (e.g. b-ion or y-ion) that generates the peak. Hence we define an experimental spectrum S ϭ s 1 ,s 2 , . . .s n as a vector where s i ʦ R (peak ranks, always positive integers) and a peptide P ϭ p 1 ,p 2 , . . .p n as a vector where p i ʦ I (ion types). When multiple ion types fall into the same bin, we keep track of all the ion types associated with that particular bin. The probability of a peptide P generating a spectrum S is defined as: where Prob(x y) is an arbitrary ͉ R ͉ ϫ ͉ I ͉ matrix representing the probability that an ion type y in the peptide generates a peak with rank x in the spectrum. When there are multiple ions associated with a particular bin in peptide P we choose the ion that maximizes Prob͑s i ͉p i ͒. Formally, if we denote p ij as each of the ion types that associate with the i th bin in P then we have Prob(s i ͉p i ) ϭ max j Prob(s i ͉p ij ). Finally, the score of a peptide P against a spectrum S is defined as the ratio of the probability that S is generated by the peptide P versus the probability that S is generated by a peptide string of all zeros (i.e. all peaks interpreted as noise). We express this score as the sum of a log odds ratio: Because the additive contribution of each peak to the total peptide score is expressed by the log-ratio of probabilities of observing a peak rank for a particular ion type versus noise, such peak-rank models are usually very robust to the presence of low intensity noise peaks (see supplemental Fig. S1 for estimated rank distributions for noise and b/y ions). Conversely, high intensity noise peaks would cause the rank distributions for true ion peaks to shift toward worse ranks and, as expected, make them less distinguishable from high intensity noise; this would be reflected in lower peptide scores by resulting in lower log ͩ Prob͑s i ͉p i ͒ Prob͑s i ͉0͒ ͪ scores for matched ion peaks. The values of Prob͑s i ͉p i ͒ can be learned from a training data set of annotated single-peptide spectra; similarly the noise model Prob͑s i ͉0͒ can be trained using the rank distribution of unassigned peaks in the same annotated single-peptide spectra. The learning is done separately for peptides of different precursor charge and length to account for their different fragmentation statistics (see (25) for full details of this model).
In order to score mixture spectrum matches, we extend this model for pairs of peptides ͑P A ,P B ͒. Without loss of generality we refer to the highest-abundance peptide in the pair as P A . To represent a pair of peptides in our statistical framework we extend the set of possible ion types to distinguish ions from P A and those from P B . If the original set of ion types I contained m elements I ϭ ͕i 1 ,i 2 . . .i m ͖, the extended ion types now have 2m elements: where i j A represents the ion type i j ʦ I if generated from peptide P A and i j B also represents the ion type Cumulative distributions of minimum rank (left) and maximum rank (right) for correct matches of simulated mixture spectra to the yeast database. Candidate peptides in the database are first sorted according to decreasing score against the corresponding projected mixture spectra (typical number of peptide candidates after precursor mass filtering is Ϸ10,000). The ranks of the correct matches are then determined. Correct matches are peptides in the database that correspond to one of the peptides used to generate the simulated mixture spectrum. Because each mixture spectrum has two correct matches we report both the minimum (i.e. best) and maximum (i.e. worst) rank of the two matches in the left and right panels, respectively. As shown, in more than 96% of cases, one of the correct matches has rank Յ10 (left) whereas the other correct match ranks Յ500 (right). Thus it is sufficient to pair the top 10 candidates with the top 500 candidates to find the correct pair. Using this strategy at most 10 ϫ 500 peptide pairs need to be considered, resulting in a speedup of four orders of magnitude compared with considering all Ϸ 5 ϫ 10 7 possible peptide pairs. i j ʦ I but if generated from peptide P B . We represent a peptide pair in vector form by computing all theoretical ion masses from both peptides and placing the corresponding ion type in their respective mass bins (i.e. a peptide pair is represented as a vector P ϭ p 1 ,p 2 ,. . . .p n where p i ʦ I extended ). To learn the parameters for mixture spectrum scoring models, we generated a data set of 100,000 simulated mixture spectra as described in the previous section with ␣ uniformly selected from 0.1 to 1.0. As shown in supplemental Fig. S1, fragment ions from high-abundance and low-abundance peptides have quite different peak rank distributions. Scoring models for single-peptide spectra do not capture these characteristics of mixture spectra. Particularly, low-abundance peptides in mixture spectra will have low scores because their peak ranks distributions are closer to noise and their mixture spectra contain larger numbers of unmatched highintensity peaks than single-peptide spectra. Therefore, a scoring model that explicitly models fragment ions from pairs of peptides is needed. Because peptides with different charge states and length have different fragmentation patterns (25), we divided library spectra into four categories according to their identified peptides as shown on the left below: This separation results in a total of sixteen categories of mixture spectra by pairing spectra from each category with spectra from another category (see above right); the 100,000 simulated spectra were divided into 16 sets using the above categorization. A separate scoring model was then trained for each different type of mixture spectra. The considered ion types were: b, b(iso), b-H20, b-NH3, y, y(iso), y-H20, y-NH3, where b(iso) and y(iso) indicate the first isotopic peak of b/y ions, respectively. We consider doubly charged peaks for spectra from charge two precursors and both doubly and triply charged peaks for spectra from charge three precursors. Peak ranks are divided into bins as follows: (1) one rank per bin for rank 1-20; (2) five ranks per bin for ranks 20 -60; (3) ten ranks per bin for ranks 60 -150 and one last bin for all peaks ranks 150 or higher. Because our scoring function distinguishes between high-abundance and low abundance peptides in mixture spectra and during searching we do not know which candidate peptide is of higher abundance, we score a query spectrum against a candidate pair ͑P A ,P B ͒ by comparing the observed spectrum against two theoretical spectra, one with P A and another with P B as the higher abundance peptide; the higher score is the final PPSM score.
The performance of this scoring model is first evaluated using a set of simulated mixture spectra generated from a different data set than that used to generate the training data set. Using single-peptide spectra identified by both InsPecT and M-SPLIT in a previous study (18), we simulated mixture spectra with ␣ ϭ 1.0, 0.5, 0.3, 0.2, 0.1 as described above. The percentage of cases where the top peptide pair returned by MixDB is correct is shown in Table I. As expected, as ␣ decreases the performance worsens because it becomes harder to identify the lower abundance peptide. We also compared the performance of MixDB with M-SPLIT, our spectral library search tool previously shown to be efficient and robust in identifying mixture spectra. On average, M-SPLIT correctly identifies 15% more mixture spectra than MixDB. In general, spectral library search methods have two main advantages over database search methods: the relative intensities of different fragment ions are known in advance and the number of peptide candidates is smaller. In order to understand the relative importance of these two factors, we also evaluated MixDB when searching only against peptides with spectra in the NIST spectral library. As can be seen from Table I, with a reduced search space, MixDB has similar performance to M-SPLIT. We also compare MixDB with an iterative search strategy where one first identifies the highestscoring peptide, removes all annotated peaks from the spectrum and then uses the "residual" spectrum to search against the database a second time to identify the second peptide. As we can see in Table I when ␣ is high, the performance of the iterative method is comparable with that of the combined scoring function. However, as ␣ becomes smaller, it is better to consider both peptides at the same time.
Classification of Database Search Matches-A database search of MS/MS spectra will always identify some top-scoring peptide or peptide pair for any given query spectrum, even if the true match is not in the database. To assess whether the top match is significant we use a two-stage classifier to distinguish true matches from false positive matches. Because our goal is to build a general search tool that can identify both single-peptide and mixture spectra we consider three possible outcomes when searching a given query spectrum S: No-match: S does not match any peptide in the database. Single-peptide match: S matches one peptide in the database. Mixture match: S matches a pair of peptides in the database. Classification of the top matches is done using two Support Vector Machines (SVMs) (26). The first SVM distinguishes No-match cases from Single-peptide/Mixture matches and the second SVM distinguishes Single-peptide matches from Mixture matches (see Fig. 3). To build the SVMs we consider the PPSM score described above and several other features that have been found useful in distinguishing true matches from false positives in single-peptide spectra, namely: 1) Likelihood score for one peptide match: likelihood score while considering only matched peaks from one peptide.
2) Likelihood score divided by peptide length: score from (1) divided by the number of amino acids in the top candidate peptide.
3) Explained intensity: total intensity of annotated peaks divided by total intensity of the spectrum. Features 2-6 are computed separately for each peptide in the top pair, resulting in a total of 16 feature inputs to the SVMs To train the SVM models, we constructed two negative control data sets. The No-match data set consists of 5000 mixture spectra where the peptides used to create the mixture spectra are deleted from the database. The Single-match data set consists of 2500 single-peptide spectra and 2500 mixture spectra where one peptide in the mixture is removed from the database. These two data sets were combined with another data set of 5000 mixture spectra (Mixture-match data set) and searched against the database. For all simulated mixture spectra, the mixture coefficient ␣ was selected uniformly from 0.1 to 1. The top matches from each data set were used as training data for the SVM models. The training is carried out in a two-step fashion. In the first step, top matches from the No-match data set were treated as negative cases and correct top matches from the Single-match and Mixture-match data sets were used as positive cases. In the second step correct top matches from the Mixture-match data set were used as positive cases whereas all top matches from the Single-match data set and No-match data set were used as negative cases. The performance of the SVM models were assessed using 10-fold crossvalidation (shown in Fig. 2).
Estimation of False Discovery Rates-False Discovery Rates (FDRs) were estimated by extending the standard Target-Decoy strategy for database search (20). Depending on whether each peptide in the top peptide pair comes from the Target or Decoy databases, matches are divided into the following possibilities: TT, both peptides matched Target; TD, the most abundant peptide matched Target and the least abundant peptide matched Decoy; DT, the most abundant peptide matched Decoy and the least abundant peptide matched Target; and DD, both peptides matched Decoy. When searching mixture spectra there are two possible outcomes for identification: single-peptide matches and mixture matches, each with a separate corresponding FDR. The FDR for single-peptide matches is defined as: FDR match ϭ DT ϩ DD TT ϩ TD and the FDR for mixture-spectrum identification is defined as: FDR mixture ϭ TD TT (see Fig. 3). Note that DT and DD peptide pairs will never advance to the second SVM because they are rejected as false positives after FDR match ; only TD and TT matches are considered by the second SVM as candidate mixture matches.

RESULTS
To illustrate the utility of our method in a typical scenario, we tested our database search method on an experimental Yeast data set (27), generously made publicly available in Tranche/ProteomeCommons (28) by researchers at Vanderbilt University. In brief, a tryptic digest of Saccharomyces cerevisiae was analyzed on an LTQ Orbitrap XL mass spectrometer (Thermo Fisher Scientific) and MS/MS spectra were acquired using a data-dependent scanning mode in which each full MS scan (m/z 300 -2000) was acquired on the Orbitrap at resolution 60,000, followed by eight MS/MS scans collected on the LTQ (see (27) for full details). The data were analyzed using InsPecT (24), MixDB, and ProbIDtree (17) with 3 Da parent mass tolerance and 0.5 Da fragment mass tolerance against the SGD yeast protein database (ver.5/8/2009); a 1% false discovery rate was enforced using a target and decoy strategy and the only modification allowed was car- boxamidomethylation on cysteine. Although data collected with this survey scan setting typically features very high mass accuracy, we allowed a large precursor mass tolerance in the searches to find possible MS/MS spectra from co-eluting peptides. In short, InsPecT is able to identify a total of 22,658 single-peptide spectra, MixDB is able to identify 23,930 single-peptide spectra plus 978 mixture spectra, and ProbIDtree is able to identify 19,840 single-peptide spectra plus 821 mixture spectra. Because the Yeast data set was acquired with high-accuracy survey scans, this information was further used to validate the annotations by requiring the presence of the theoretical monoisotopic m/z value of the identified peptides in the corresponding survey scans. Because for lowabundance peptides the monoisotopic m/z may not be visible in the corresponding survey scan (observed in only Ϸ0.7% of all cases and only for the lower abundance peptide in mixture spectra), in such cases we also check one preceding and one subsequent survey scan for the monoisotopic m/z. An annotation is considered correct if the theoretical precursor m/z is within 5 ppm of the observed m/z; the results are summarized in Table II. We first focus our attention on single-peptide cases. As seen in Table IIB) all three database search methods achieve similar precision of Ϸ97%. A more detailed comparison in Fig. 4 shows that 78.4% and 68.2% of MixDB's annotations overlap with those of InsPecT and ProbIDtree, respectively. For all spectra for which both MixDB and In-specT/ProbIDtree make an annotation, more than 96% of them have the same peptide ID, indicating that these independent methods are consistent. Among those spectra that are only identified by MixDB or InsPecT/ProbIDtree, we further divide them into two categories-those where two meth-ods identify the same peptide as the top hit and those where the two methods have a different top hit. Those in the "sametop-hit" category are more likely to be correct, because different scoring functions rank them as the same top peptide candidate. If we consider these cases, it increases MixDB's overlap with InsPecT and ProbIDtree to 90 and 85%, respectively, further indicating very good agreements between these different methods. Overall, for identification of single-peptide spectra, all three database search methods have comparable accuracy whereas MixDB identifies ϳ6% more spectra than InsPecT and 21% more spectra than ProbIDtree. This shows that MixDB's performance in identifying single-peptide spectra was not diminished by trying to identify more than one peptide per spectrum.
For mixture spectra, MixDB identifies 978 spectra whereas ProbIDtree identifies a total of 821 spectra (see Table IIA). MixDB identified two peptides in each mixture spectrum, whereas ProbIDtree found 32 mixture spectra with more than two identified peptides. This indicates that even though more than two identifiable co-eluting peptides may appear in one MS/MS spectrum this is relatively rare. Thus by limiting mixture spectra to two peptides per spectra MixDB does not lose much sensitivity in peptide identification. Furthermore, MixDB compensates by identifying about 20% more mixture spectra than ProbIDtree (978 versus 821 mixture spectra). In addition, by limiting its search space MixDB is more accurate. As shown in Table IIA, MixDB achieves a precision of almost 96% whereas ProbIDtree has only Ϸ90% precision for mixture spectra. Two main reasons contribute to MixDB's higher precision. First, MixDB only searches up to two peptides per spectrum (ProbIDtree attempts to look for up to eight pep- Increasing SVM1 scores  TT   TD   DT   DD  TT TT   TT   TT   TD   TD   TD   TD  TD  TD   TD   TD   TD   TD   TD   TD  TD   TD   TD  Every query spectrum searched against the database becomes assigned to some peptide pair ͑P A ,P B ͒ that best matches the spectrum. Depending on whether the matched peptides come from the Target or Decoy databases, each match falls in one of four categories: Target/Target(TT), Target/Decoy(TD), Decoy/Target(DT), Decoy/Decoy(DD). All matches are ranked according to their SVM1 score to assess whether the most abundant peptide (P A ) in each paired match ͑P A ,P B ͒ is significant. As with the standard Target/Decoy strategy, matches with P A from Target are considered positive matches and matches with P A Decoy are considered negative matches. Therefore, the FDR for single-peptide matches is computed as: FDR match ϭ DT ϩ DD TD ϩ TT and the SVM1 score is thresholded to yield FDR match Ͻ 0.01. Matches with SMV1 score above the threshold are then ranked by their second SVM scores to evaluate whether the second peptide (P B ) in the top match ͑P A ,P B ͒ is also significant. As before, matches P B from Target database are considered positive matches and those with P B from Decoy are negative matches. Thus the FDR for mixture matches is computed as: FDR mixture ϭ TD TT and SVM2 scores are also thresholded such that FDR mixture Ͻ 0.01.
tides per spectrum), thus theoretically it has a smaller search space than ProbIDtree. More importantly, MixDB applies different scores and FDRs for the first and second identified peptides in each MS/MS spectrum (FDR match and FDR mixture , respectively). This is crucial because high-abundance peptides are likely to have very different match statistics (e.g. % explained intensity, % b/y ions presented) than low-abundance peptides in mixture spectra. Because there are many more single-peptide spectra than mixture spectra in this data set, applying a single global FDR for the combined identification of both single-peptide and mixture spectra can lead to an underestimation of FDR for mixture spectra. Just as in the case of ProbIDtree, it has an estimated precision of 90.1% ϫ 821 ϩ 97.8% ϫ 19840 19840 ϩ 821 ϭ 97.94% or an overall FDR of 2.06%. However, the precision for mixture spectra is only 90.1%, bringing its FDR on mixture spectra to 9.9% which is much higher than its FDR for single-peptide matches. In summary, we show that MixDB has both higher sensitivity and precision than ProbIDtree in identifying mixture spectra while having comparable performance to InsPecT in identifying single-peptide spectra.
Because spectral library methods are in general considered to be both more sensitive and more accurate in peptide identification than database search methods (29), we use identifications from spectral library searches to further validate the different database search methods. The Yeast data set was analyzed using an extension of the M-SPLIT algorithm (18) (See supplemental Methods) and SpectraST (30) with default parameters against the Yeast spectral library from NIST (ver.5/4/2009) with a precursor mass tolerance of 3 Da; a 1% false discovery rate was enforced using the decoy library strategy described in (31). Because we do not allow PTMs in the database searches we also removed all the entries in the spectral library that contain PTMs. To evaluate the performance of M-SPLIT, we first compare it to Spec-traST, the most popular publicly available method for peptide spectral library search. In short we showed that both methods identified similar number of single-peptide spectra and are consistent with each other (see Supplemental Table S1). Only in ϳ6% of the spectra do the two methods identify different peptides as top hits (see supplemental Figure S2). Therefore, from here on we use results from M-SPLIT as a reference to evaluate results from different database search methods. As shown in Fig. 4, M-SPLIT misses about Ϸ3600 single-peptide spectra identified by either MixDB or InsPecT. However, in 85-90% of these spectra, the identified peptides are not in the spectral library, indicating that M-SPLIT has very high sensitivity and therefore we can use spectra identified by M-SPLIT as a reference and compare the relative sensitivity of database search methods. Fig. 4 and supplemental Fig. S2 and S3 show that MixDB, InsPecT, and ProbIDtree identify 68%, 61.3%, and 51.8% of the spectra identified by M-SPLIT, respectively. Among these shared annotations more than 97% of them have the same peptide annotation as M-SPLIT, again indicating that these database search methods have high precision for single-peptide spectra. If we look further into those spectra that are only identified by M-SPLIT but not by database search methods, we find that for 60%, 30%, and 11% of the cases MixDB, InsPecT, and ProbIDtree, respectively, identify the same top hit as M-SPLIT. Altogether these indicate that MixDB, InsPecT, and ProbIDtree, have a sensitivity of 89%, 75%, and 58% for ranking the correct peptide as the top candidate. This implies that these database search methods are able to correctly identify these spectra, but the current scoring functions/SVM models do not have enough discriminative power to distinguish experiment-wide true  matches from false matches. Thus to keep the FDR low, database search methods have to discard some possibly valid but low-scoring matches. Next, we turn our attention to mixture spectra. Similar to the single-peptide case, most mixture spectra identified by database search methods but missed by M-SPLIT corresponded to peptides that did not have spectra in the spectral library (see Fig. 4). Assuming the union of mixture spectra returned by M-SPLIT and MixDB as the total number of mixture spectra in this data set we get that M-SPLIT has a false negative rate This agrees with what we observed before (18) and again shows that M-SPLIT has high sensitivity and can also be used as a reference to compare database search methods for the identification of mixture spectra. Out of the 2567 mixture spectra identified by M-SPLIT, MixDB is able to identify 615 whereas ProbIDtree is able to identify 282. If we look at these cases closely, we find that about 2460 mixture spectra identified by M-SPLIT come from peptides with charge 2 and 3, the only possibilities considered by MixDB. This means MixDB has a sensitivity of 25% whereas probIDtree has a sensitivity of 12%. If the two peptides in mixture spectra are considered independent and observing that MixDB has a sensitivity of 60% for single-peptide spectra, we expect MixDB to identify 60% * 60% ϭ 36% of all mixture spectra identified by M-SPLIT. This calculation makes the assumption that the two peptides are present at similar abundance, which, as shown in the previous section, is the easiest scenario for identification of both peptides in mixture spectra. In practice, one peptide is present at lower abundance in most mixture spectra and in the yeast data set, the average mixture coefficient ␣ estimated by M-SPLIT is only 0.3. Thus having a sensitivity of 25% is a reasonable performance for MixDB. In addition, for a large fraction of mixture spectra (1051/1953) that MixDB did not classify as mixture matches, it identified the same top peptide pair per spectrum as M-SPLIT, but again the current SVM model does not have enough discriminative power to separate true matches from false positives. The extension of more sophisticated statistical models that have been proposed for single-peptide spectra (32) to mixture spectra is likely to increase the sensitivity of database search methods. In each pairwise comparison, spectra identified by both methods (in the intersection, shown in purple) were assigned to the same peptide in 96 -97% of cases, indicating that the methods are consistent and the precision is in good agreement with our estimates. Spectra identified by one method but not the other are subdivided into two categories: cases where the two methods return the same peptide (peptide pair in the case of mixture matches) as the top hit but it was below the FDR threshold for one of the methods (shown in green) and cases where the two methods do not return the same peptide as the top hit (shown in black). In general MixDB has high overlap with other database search methods. For single-peptide spectra MixDB finds the same top peptide match as other methods in 85-90% of cases. When using spectra identified by spectral library search as a reference set, MixDB is able to identify 6 -16% more single-peptide spectra and 38% more mixture spectra than current database search methods. Taken together, these show that MixDB has better and comparable sensitivity and accuracy in identifying single-peptide spectra as well as significantly higher sensitivity and accuracy in the identification of mixture spectra.

(a) Numbers of identified spectra (single-peptide and mixture) and unique peptides are compared. (b) To allow for identification of co-eluting peptides, all searches were run using 3 Da precursor mass tolerance. The accurate precursor mass information was then used a posteriori to estimate the precision of peptide identification by comparing the theoretical precursor m/z of peptides returned by each method and the observed precursor m/z values in the corresponding MS1 scan (isotopic profile). An identification is considered correct if the difference between theoretical and experimental precursor m/z values is less than 5 ppm. For mixture spectra the precision is slightly lower because the second peptide in the mixture is usually of low-abundance (average
In order to estimate how the presence of more than one peptide affects current computational techniques in peptide identification, we performed a more detailed comparison between MixDB and InsPecT on the 2567 mixture spectra identified by M-SPLIT. As described above, the problem of peptide identification consists of two subtasks: (1) ranking the correct peptide as the top scoring candidate per spectrum among all the other peptide candidates in the database and (2) experiment-wide discrimination of correct peptide matches from false, but top scoring matches. We evaluated how the presence of more than one peptide affected each of these tasks. First we compared whether InsPecT and MixDB are able to correctly rank one of the peptides in the mixture as the correct top match. As shown in Fig. 5, InsPecT is able to rank one of the correct peptides at the top for 75% of cases. On the other hand, MixDB is able to achieve this goal for more than 95% of the cases and is further able to rank both correct peptides as the top for 65% of cases. This shows that by relaxing the assumption that each MS/MS spectrum comes from only one peptide, we gain a sensitivity of Ϸ20% in ranking the correct peptides in mixture spectra as top candidates. At a 1% false discovery rate, InsPecT is able to classify 53% of all mixture spectra as single-peptide matches, whereas MixDB is able to classify 68% of cases as matches, with 24% of these classified as mixture matches. Because each mixture spectrum contains information for two peptides and InsPecT is able to identify one peptide for 53% of the cases, this means InsPecT is able to recover about 53% ϫ 0.5 ϭ 26.5% of all peptide information in mixture spectra. In contrast, MixDB is able to identify both peptides in 24% of cases while identifying one peptide in another 43% of cases, thus resulting in a recovery rate of 24% ϩ 43% ϫ 0.5 ϭ 46% of all identification contained in mixture spectra. Recall that MixDB and InsPecT have a sensitivity of 68% and 61.3% for identifying single-peptide spectra. This means that the presence of co-eluting peptides does not interfere significantly with the ability of current database search methods to identify the most dominant peptides (only a 9% drop in sensitivity). As for MixDB the sensitivity for the presence of co-eluting peptides does not affect its ability to identify the most dominant peptide in the spectra, because for mixture spectra it also has a sensitivity of ϳ68%.

DISCUSSION
As increasingly more complex samples are analyzed in high-throughput proteomic experiments (10) and new data acquisition protocols evolve (13,14,15,33), the occurrence and detectability of co-eluting peptides per MS/MS spectrum is likely to increase. The almost ubiquitous assumption that most mainstream computational methods make, namely that every MS/MS spectrum comes from one peptide affects their ability to identify spectra from co-eluting peptides (12). Although in this study the effect is only moderate for InsPecT, a Ϸ10% decrease in sensitivity was observed, it is also worth noting that mixture spectra contain more information than single-peptide spectra. In our analysis of a yeast data set, M-SPLIT, MixDB, and ProbIDtree identified 5997, 5476, and 4420 unique peptides from 28417, 23930, and 19840 singlepeptide spectra, respectively. However, they were able to identify 2394, 1128, and 820 unique peptides from 2567, 978, 821 mixture spectra, thus revealing MixDB's rate of 1128/ 978 ϭ 115% peptide IDs per mixture spectrum versus 5476/ 23930 ϭ 23% peptide IDs per single-peptide spectrum. These results show that if dynamic range challenges can be addressed (e.g. by selectively isolating precursors of comparable abundance (33)) then protocols designed to generate mixture spectra have the potential to substantially improve peptide identifications while considerably decreasing the number of MS/MS scans required to obtain these identifications. In addition, because of the lower-abundance nature of most of second IDs from mixture spectra, they are less likely to be sampled again by the instrument and may not even be detectable without a co-eluting peptide. Altogether, methods that consider more than one peptide per MS/MS spectrum can potentially double the sensitivity in recovering peptide information from mixture MS/MS spectra.
However, this higher information content comes at a cost in that the combinatorial explosion of searching for best peptide pairs dramatically increases the search space for mixture spectra. Thus, effective filtration strategies and special con- To study how the presence of coeluting peptides affects database search results, mixture spectra identified by M-SPLIT were searched against the Yeast database using MixDB and InsPecT. InsPecT was able to rank one of the correct peptides as the top match in 75% of cases, whereas MixDB was able to rank one of the peptides as the top match in 95% of cases and ranked both correct peptides as the top match in 65% of cases, showing that MixDB gains 20% higher sensitivity in spectrum identification by relaxing the assumption that each spectrum comes from only one peptide. After imposing score thresholds at 1% FDR, In-sPecT was able to classify 53% of cases as single-peptide matches, whereas MixDB was able to classify 24% of cases as mixture matches and an additional 43% of cases as single-peptide matches. Because each mixture spectrum contains information for two peptides, InsPecT recovers 27% ϭ 53% ϫ 0.5 and MixDB recovers 46% ϭ 24% ϩ 43% ϫ 0.5 of the peptide information contained in mixture spectra. 0.95. sideration for controlling the FDR become crucial in achieving high precision in the identification of mixture spectra. Similar to the development of any mass spectrometry algorithm, a large data set of reliably identified spectra is crucial and is often hard to come by. In this study we addressed this issue with comprehensive simulations and by taking advantage of the public availability of the yeast spectral library (23) and experimental data (27). Spectral library search methods are in general considered to be more sensitive and accurate for peptide identification (29) and in this case the yeast spectral library is also comprehensive (i.e. only a small fraction of peptides identified by database search method are not in the library). Thus we can use identifications from spectral library searches as the reference "truth" and comprehensively benchmark various aspects of different database search methods. We showed that MixDB has good sensitivity and high precision in identifying both single-peptide spectra and mixture spectra.
The development of computational methods and novel experimental strategies often rely on each other. For example, because mainstream computational approaches assume each MS/MS spectra comes from one peptide, development of chromatography and mass spectrometry protocols has focused on making this assumption valid for most cases. However, with a new generation of algorithms that are able to identify mixture spectra, experimental protocols designed to generate mixture spectra may lead to many interesting applications. There are already several alternative data acquisition approaches that rely on mixture spectra to overcome the limitations of current instrument scan rates (13,14,15). Mixture spectra also arise from peptides that are covalently linked in the sample, examples include disulfide bridges (34), SUMOlyated peptides (35), and peptides from cross-linking experiments (36,37). The identification of these cross-linked peptides can provide valuable information on protein structures and interactions. Thus solving the problem of peptide identification from mixture spectra represents an important step toward addressing related and emerging problems in proteomics.