Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry*

In shotgun proteomics, high-throughput mass spectrometry experiments and the subsequent data analysis produce thousands to millions of hypothetical peptide identifications. The common way to estimate the false discovery rate (FDR) of peptide identifications is the target-decoy database search strategy, which is efficient and accurate for large datasets. However, the legitimacy of the target-decoy strategy for protein-modification-centric studies has rarely been rigorously validated. It is often the case that a global FDR is estimated for all peptide identifications including both modified and unmodified peptides, but that only a subgroup of identifications with a certain type of modification is focused on. As revealed recently, the subgroup FDR of modified peptide identifications can differ dramatically from the global FDR at the same score threshold, and thus the former, when it is of interest, should be separately estimated. However, rare modifications often result in a very small number of modified peptide identifications, which makes the direct separate FDR estimation inaccurate because of the inadequate sample size. This paper presents a method called the transferred FDR for accurately estimating the FDR of an arbitrary number of modified peptide identifications. Through flexible use of the empirical data from a target-decoy database search, a theoretical relationship between the subgroup FDR and the global FDR is made computable. Through this relationship, the subgroup FDR can be predicted from the global FDR, allowing one to avoid an inaccurate direct estimation from a limited amount of data. The effectiveness of the method is demonstrated with both simulated and real mass spectra.

Post-translational modifications of proteins often play an essential role in the functions of proteins in cells (1). Abnormal modifications can change the properties of proteins, causing serious diseases (2). Because protein modifications are not directly encoded in the nucleotide sequences of organisms, they must be investigated at the protein level. In recent years, mass spectrometry technology has developed rapidly and has become the standard method for identifying proteins and their modifications in biological and clinical samples (3)(4)(5).
In shotgun proteomics experiments, proteins are digested into peptide mixtures that are then analyzed via high-throughput liquid chromatography-tandem mass spectrometry, resulting in thousands to millions of tandem mass spectra. To identify the peptide sequences and the modifications on them, the spectra are commonly searched against a protein sequence database (6 -8). During the database search, according to the variable modification types specified by the user, all forms of modified candidate peptides are enumerated. For each spectrum, candidate peptides (with possible modifications) from the database are scored according to the quality of their match to the input spectrum. However, for many reasons, the top-scored matches are not always correct peptide identifications, and therefore they must be filtered according to their identification scores (9). Finding an appropriate score threshold that gives the desired false discovery rate (FDR) 1 is a multiple hypothesis testing problem (10 -12).
At present, the common way to control the FDR of peptide identifications is an empirical approach called the targetdecoy search strategy (13). In this strategy, in addition to the target protein sequences, the mass spectra are also searched against the same number of decoy protein sequences (e.g. reverse sequences of the target proteins). Because an incorrect identification has an equal chance of being a match to the target sequences or to the decoy sequences, the number of decoy matches above a score threshold can be used as an estimate of the number of random target matches, and the FDR (of the target matches) can be simply estimated as the number of decoy matches divided by the number of target matches. The target-decoy method, although simple and effective, is applicable to large datasets only. When the number of matches being evaluated is very small, this method becomes inaccurate because of the inadequate sample size (13,14). Fortunately, for high-throughput proteomic mass spectrometry experiments, the number of mass spectra is always sufficiently large. Current efforts are mostly devoted to increasing the sensitivity of peptide identification at a given FDR by using various techniques such as machine learning (15).
When the purpose of an experiment is to search for protein modifications, the problem of FDR estimation becomes somewhat complex. In fact, the legality of the target-decoy method for modification-centric studies was not rigorously discussed until very recently (16). At present, for multiple reasons, the identifications of modified and unmodified peptides are usually combined in the search result, and a global FDR is estimated for them in combination, with only a subgroup of identifications with specific modifications being focused on. However, the FDR of modified peptides can be significantly or even extremely different from that of unmodified peptides at the same score threshold. There are three reasons for this fact. First, because the spectra of modified peptides can have their own features (e.g. insufficient fragmentation or neutral losses), they can have different score distributions from those of unmodified peptides. Second, because the proportions of modified and unmodified peptides in the protein sample are different, the prior probabilities of obtaining a correct identification are different for modified and unmodified peptides. Third, because the proportions of modified and unmodified candidate peptides in the search space are different, the prior probabilities of obtaining an incorrect identification are also different for modified and unmodified peptides. Therefore, the modified peptide identifications of interest should be extracted from the identification result and subjected to a separate FDR estimation, as pointed out recently (16 -18).
The difficulty of separate FDR estimations is highlighted when there are too few modified peptide identifications to allow an accurate estimation. Many protein modifications are present in low abundance in cells but play important biological functions. These rare modifications have very low chances of being detected by mass spectrometry. A crucial question is, if very few modifications are identified from a very large dataset of mass spectra, can they be regarded as correct identifications? There was no answer to this question in the past in terms of FDR control. The target-decoy strategy loses its efficacy in such cases. For example, imagine that we have 10 modified peptide identifications above a score threshold after a search and that all of them are matches to target protein sequences. Can we say that the FDR of these identifications is zero (0/10)? If we decrease the score threshold slightly in such a way that one more modified peptide identification is included but find that that peptide is unfortunately a match to the decoy sequence, then can we say that the FDR of the top 10 target identifications is 10% (1/10)? It is clear here that the inclusion or exclusion of the 11th decoy identification has a great influence on the FDR estimated via the common target-decoy strategy. In fact, according to a binomial model (14), the probability that there are one or more false identifications among the top 10 target matches is as high as 0.5, which means that the real proportion of false discoveries has a half-chance of being no less than 10% (1/10). The appropriate way to estimate the FDR of the 10 target identifications is to give an appropriate estimate of the expected number of false identifications among them, and, most important, this estimate must not be an integer (e.g. 0 or 1) but can be a real number between 0 and 1. Note that single-spectrum significance measures (e.g. p values) are not appropriate for multiple hypothesis testing, not to mention that they can hardly be accurately computed in mass spectrometry.
Separate FDR estimation for grouped multiple hypothesis testing is not new in statistics and bioinformatics. A typical example is the microarray data of mRNAs from different locations in an organism or from genes that are involved in different biological processes (19,20). Efron (21) recently proposed a method for robust separate FDR estimation for small subgroups in the empirical Bayes framework. The underlying principle of this method is that if we can find the quantitative relationship between the subgroup FDR and the global FDR, the former can be indirectly inferred from the latter instead of being estimated from a limited amount of data. The relationship given by Efron is quite general and makes no use of domain-specific information. Furthermore, it requires known conditional probabilities of null and non-null cases given the score threshold. These probabilities are, however, unavailable in the modified peptide identification problem.
This paper presents a dedicated method for accurate FDR estimation for rare protein modifications detected from largescale mass spectral data. This method is based on a theoretical relationship between the subgroup FDR of modified peptide identifications and the global FDR of all peptide identifications. To make the relationship computable, the component factors in it are replaced by or fitted from the empirical data of the target-decoy database search results. Most important, the probability that an incorrect identification is an assignment of a modified peptide is approximated by a linear function of the score threshold. By extrapolation, this probability can be reliably obtained for high-tail scores that are suitable as thresholds. The proposed method was validated on both simulated and real mass spectra. To the best of our knowledge, this study is the first effort toward reliable FDR control of rare protein modifications identified from mass spectra. (Note that the error rate control for modification site location is another complex problem (22, 23) and is not the aim of this paper.)

MATERIALS AND METHODS
Suppose that we have searched a set of mass spectra against a target-decoy protein database, with one or more types of variable modifications specified. After the database search, each spectrum was assigned a (possibly modified) peptide along with an identification score. We are now only interested in those identified peptides that carry a specific type of modification, which is denoted by the symbol k, and we aim to estimate the FDR of these modified peptide identifications according to a score threshold. The notations used for modeling are given in Table I.
Note that the definition of the FDR (as a posterior probability) in Table I is called the Bayesian FDR (21), which is equivalent to the original definition of the FDR (10) when there is a large amount of data that are independent and identically distributed. Using the Bayes rule, we have The whole dataset of the peptide identifications in practice is usually sufficiently large that FDR(x) can always be accurately estimated by using the target-decoy strategy or more complicated approaches (24,25). However, the number of identifications of k-modified peptides could be too small to support a separate, accurate estimation of FDR k (x). Fortunately, Equation 1 implies that if k ͑x͒ and ␥ k ͑x͒ are known, FDR k (x) can be indirectly inferred from FDR(x) instead of being estimated from limited data. Equation 1 is a bridge that connects the FDRs of modified and unmodified peptides. The following two subsections present methods for estimating k ͑x͒ and ␥ k ͑x͒, respectively.
Estimation of k ͑x͒-Here, k ͑x͒ is the probability that a true identification with a score greater than x is an assignment of a k-modified peptide. This probability is directly related to the proportion of spectra that are produced from k-modified peptides. This proportion is usually unknown because neither mass spectrometers nor experts can distinguish the spectra from differently modified peptides before they are identified. However, it is possible to estimate k ͑x͒ from the identification results. In fact, we can use the proportion of k-identified peptides among the target identifications as an estimate of k ͑x͒.
where N(x) is the number of target identifications with scores greater than x and N k (x) is the number of target identifications of modification k with scores greater than x. Equation 2 might appear to be strange because there is a term FDR k (x) in it that is unknown, and this is why we must estimate k ͑x͒. Fortunately, by replacing k ͑x͒ in Equation 1 with k ͑x͒ in Equation 2 we get and therefore This estimate leaves ␥ k ͑x͒ as the only unknown factor on the right-hand side of the equation.
Estimation of ␥ k ͑x͒-Here, ␥ k ͑x͒ is the probability that a spectrum will be identified as a peptide with modification k given that the identification is false and the score is greater than x. This probability is closely related to the proportion of candidate peptides with modification k in the search space, and it is reasonable to use this proportion as an estimate of ␥ k ͑x͒. However, the search space of candidate peptides is hidden from the users, and different search engines could use different implementations when generating candidate pep- An identification of a peptide carrying modification k, or a k-modified peptide for short x The score threshold used to filter identifications FDR(x) ϭ P(F|X Ͼ x): the global FDR of all (modified and unmodified) peptide identifications with scores greater than x FDR k (x) ϭ P(F|I k, X Ͼ x): the subgroup FDR of k-modified peptide identifications with scores greater than x k (x) ϭ P(I k |T, X Ͼ x): the probability that a spectrum is identified as a k-modified peptide given that the identification is true and the score is greater than x ␥ k (x) ϭ P(I k |F, X Ͼ x): the probability that a spectrum is identified as a k-modified peptide given that the identification is false and the score is greater than x tides. For example, search engines could have their own ways of limiting the maximal number of modifications on a candidate peptide to avoid the combinatorial explosion of the search space. Moreover, the proportion of candidate peptides is a constant that is independent of the identification score x, whereas ␥ k ͑x͒ might not be, as shown by Fig. 1. This paper employs a data-driven approach to the estimation of ␥ k ͑x͒. We know that in the target-decoy database search method, all of the matches to the decoy sequences are definitely false identifications. These false identifications constitute natural training data for estimating ␥ k ͑x͒. According to the observations on real data (see the examples in Fig. 1), ␥ k ͑x͒ can be approximated by a linear function of x, where a and b are coefficients to be determined. Given the results of a database search and an arbitrary value of x, calculating the proportion of k-modified peptides among decoy identifications with scores above x is straightforward. The value of x and the calculated proportion constitute a sample of the linear function. Changing the value of x generates a dataset of training samples from which the estimates of the two coefficients in Equation 7, denoted by â and b , can be readily obtained using least-squares regression. As a result, the final esti- This estimate is called the transferred FDR for k-modified peptides, indicating that it is derived from the global FDR rather than estimated completely from data.
When the score threshold x is large, the decoy k-modified peptides will be few, and their proportion will be unstable. As shown in Fig. 1, the major part of the observed data shows good linearity, whereas the high-score tail fluctuates seriously. This fluctuation would make the direct separate FDR estimation very inaccurate. For this reason, we cannot use the decoy identifications above the score threshold x to estimate FDR k (x) directly, as noted in the Introduction. In fact, it is better not to use large x values when fitting the function in Equation 7. The value of ␥ k ͑x͒ for large x (e.g. the value that is suitable to be the score threshold) should be extrapolated from the fitted function.
The values of a and b depend not only on the dataset, but also on the searched database and search parameters. Therefore, they should be estimated on the fly for each search. Moreover, from Fig. 1 we can see that the fitted value of a shows different tendencies for different search engines (i.e. negative for Mascot, positive for SEQUEST, and close to zero for pFind). This interesting phenomenon implies that Mascot seems to penalize modified peptides, SEQUEST seems to encourage modified peptides, and pFind seems to treat modified and unmodified peptides fairly.
In order to gain an intuitive idea of the methodology, let us consider an example. In this example, 15,100 spectra of known peptide identities, including 100 spectra of phosphorylated peptides, were searched against a target-decoy database (for more details about the data and the experiment, see the simulated data in "Results"). Above the score threshold x ϭ 37, there are 3249 target identifications (3205 unmodified and 44 phosphorylated peptide identifications) and 3 decoy identifications (all unmodified peptide identifications). Then, we have the global FDR estimated as FDR͑37͒ ϭ 3/3249 Ϸ 0.00092 and the separate phosphorylation FDR estimated as 0/44 ϭ 0. By fitting Equation 7 to all of the decoy identifications, we obtain â ϭ Ϫ0.01 and b ϭ 0.6957, and thus ␥ k ͑37͒ ϭ Ϫ0.01 ϫ 37 ϩ 0.6957 ϭ 0.3257. According to Equation 8, the transferred FDR is FDR k ͑37͒ ϭ ͑3249/44͒ ϫ 0.3257 ϫ 0.00092 Ϸ 0.0222. When we examined the 3249 target identifications with scores above 37, we found that 5 of them were false, including 1 phosphorylated peptide identification. Therefore, the actual false positive proportion (FDP) of the phosphorylated peptide identifications was 1/44 Ϸ 0.0227, which is very close to the estimated transferred FDR (i.e. 0.0222) but very different from the estimated global FDR (i.e. 0.00092) or separate phosphorylation FDR (i.e. 0). Note that the actual FDP of all identifications with scores above 37 was 5/3249 Ϸ 0.0015, which is close to the estimated global FDR.
In this study, the accuracy of FDR estimation was assessed by comparing the estimated FDR to the actual FDP. Obviously, a better comparison should be made between the estimated FDR and the actual FDR. However, the actual FDR is in general unknown, because it is by definition the expected FDP, which is not computable in the peptide identification problem. In proteomics, FDR and FDP are often meant to be the same thing, but it is important to keep in mind that they are conceptually different.

RESULTS
In order to validate the above algorithm using mass spectral data, we must be able to accurately judge the correctness of identified modifications so as to compare the estimated FDR to the actual FDP. However, the objective judgment is in general absent for complex protein samples extracted from biological organisms. Therefore, we relied on mass spectra generated from (i) theoretical simulation, (ii) synthesized peptides, and (iii) purified standard proteins.
Three methods for estimating the subgroup FDR of modified peptide identifications are compared: Global FDR: the FDR that is estimated by using the common target-decoy strategy on all peptide identifications, including modification-containing and modification-free ones.
Separate FDR: the FDR that is estimated by using the common target-decoy strategy solely on identifications with modifications of interest.
Transferred FDR: the FDR that is estimated by using the algorithm described in "Materials and Methods." Three types of modifications were tested: phosphorylation, carbamylation, and acetylation, with an emphasis on phosphorylation. These types of modifications might not be very rare in nature but are used here to validate the methodology. It is expected that the conclusions obtained with respect to them will apply to other types of modifications. In addition, it is worthwhile to note that phosphorylation was defined as a new type of modification (with a new name) with a 79.966331 Da mass shift on the amino acids S, T, and Y and without specification of neutral losses; thus, the search engines did not know it was phosphorylation. In all of the experiments, the most widely used software for peptide identification, Mascot (version 2.2) (7), was used to identify the spectra. Two other search engines, SEQUEST (version 2.7) (6) and pFind (version 2.6) (8,26,27), were additionally used to demonstrate the linearity of the ␥ k ͑x͒ function (Fig. 1). The data used in this paper and the program for FDR estimation are available upon request.
Simulated Data-The simulated data were part of the data used in Ref. 16. Five subsets of simulated spectra were used, each of them containing 10,000 mass spectra. The spectra in the first subset (S ph1,10k ) were predicted from peptides with phosphorylations on S, T, or Y. The spectra in the second subset (S car,10k ) were from peptides with carbamylations on peptide N termini. The spectra in the third subset (S ace,10k ) were from peptides with acetylations on protein N termini. The spectra in the fourth subset (S non,10k ) were from peptides without modifications. The fifth subset (S out,10k ) included extra spectra from unrelated, unmodified peptides. The protein database contained 100,000 target protein sequences and 100,000 decoy protein sequences. All of the sequences were randomly generated using a Markov chain model trained on the UniProt protein sequence database. The peptides in the first four spectrum subsets, S ph1,1k , S car,10k , S ace,10k , and S non,10k , were from the target protein sequences, whereas the peptides in the last spectrum subset, S out,10k , were from protein sequences that were not in the database. The peptides that were used for spectrum simulation had unique sequences within each subset and were assumed to carry two charges. Note that if not achieved via simulation, such largescale nonredundant modification spectra would have hardly been available.
In spectrum simulation, singly charged b-and y-type fragment ions were predicted, with Gaussian random noises added to their theoretical mass-to-charge ratio (m/z) values. The intensities of the fragment ions were randomly sampled, with a random proportion of them forced to be zero (to mimic the missing peaks). Additionally, a number of random peaks were added as noise. More details about the data and the database can be found in Ref. 16.
Phosphorylation, peptide N-terminal carbamylation, and protein N-terminal acetylation increase the number of candidate peptides by decreasing degrees. These modifications were first tested individually. In other words, only one of them was set as the variable modification parameter in a database search. For each type of modification, the spectra containing it were added in for searching in increasing numbers (n ϭ 1, 10, 100, 1000, 10,000) with the aim of varying the value of k ͑x͒. For each value of n, the error distributions of the three FDR estimation methods were determined by bootstrapping the spectra with 10,000 trials. In each trial, n spectra that were randomly sampled from the subset of spectra containing the current type of modification and 15,000 spectra that were randomly selected from the subsets containing no modifications (S non,10k and S out,10k ) were searched together. Not all of the spectra in S non,10k and S out,10k were used in order to avoid the possibly dominant influence of some special spectra on the result. Mascot was used to identify the spectra. The precursor and fragment mass matching tolerances were Ϯ3 Da and Ϯ0.5 Da, respectively, and trypsin was specified for in silico protein digestion with up to two missed cleavages allowed. After each database search, the modified peptide identifications were filtered according to their scores, and the FDR was estimated using the three methods. Different score thresholds were used for different methods so that the estimated FDR was not more than 1%, and at the same time, the number of identifications was maximized. Finally, the FDR (Յ1%) that was estimated via each method was compared with the corresponding actual FDP, and the FDR estimation error was calculated as the former minus the latter. For example, suppose that there are five modified peptide identifications returned above a score threshold in a search, and all of them are target matches, with one of them being a random match (therefore, the FDP is 1/5 ϭ 20%). Then, if we use the separate FDR, the estimated FDR for the five modified peptide identifications is 0/5 ϭ 0%. Therefore, the FDR estimation error for this case is 0 Ϫ 20% ϭ Ϫ20%. When the experiment is repeated 10,000 times by bootstrap, we obtain 10,000 FDR estimation errors, which constitute an empirical error distribution.
Tables II through IV compare the three FDR estimation methods for phosphorylation, carbamylation, and acetylation, respectively, in terms of the average numbers of all and false modified peptide identifications as well as the mean and standard deviation of the error distribution. As shown by Table  II, when the number (n) of phosphorylation spectra was small (mimicking the rareness of protein modifications e.g. n ϭ 1, 10, or 100), the FDR of phosphorylated peptide identifications was dramatically underestimated by the global FDR. The estimates given by the separate FDRs were better but still deviated greatly from the actual FDP for small n. The two methods performed better with increasing n and became satisfactorily accurate (a mean error of Ͻ1%) when n was sufficiently large (10,000 for the global FDR and 1000 for the separate FDR). However, for all of the tested values of n, the transferred FDR accurately estimated the FDR of phosphorylated peptide identifications. Moreover, the identification sensitivity was not sacrificed. In other words, significant numbers of phosphorylated peptides were identified with the transferred FDR. Although many more phosphorylated peptides were identified with the global FDR and the separate FDR, these two methods incurred out-of-control error rates.
On carbamylation, the transferred FDR also performed the best among the three methods, as shown by Table III. Again, the estimated global FDR and the separate FDR were on average significantly lower than the actual FDP of carbamylated peptides. With acetylation, the advantages of the transferred FDR were not as clear as with the former two modifications, as shown by Table IV. The transferred FDR performed similarly to the separate FDR. The degraded performance of the transferred FDR for acetylation was possibly due to the small proportion of acetylated peptides in the search space, which limited the accuracy of the estimated ␥ k ͑x͒. In addition, an interesting phenomenon observed is that when the number of acetylation spectra was large, the global FDR seriously overestimated the FDR of acetylated peptide identifications Notes: n, the number of searched spectra of phosphorylated peptides; Ave. # of I.D.s., average number of false/all identifications of phosphorylated peptides from the target database at 1% estimated FDR; Est. error, FDR estimation error (i.e. the estimated FDR minus the actual FDP); Mean and S.D., mean and standard deviation of the estimation errors as a percentage.
(the estimated FDR was ϳ1%, whereas the actual FDP was close to zero), and as a result, the number of acetylated peptide identifications was significantly less than with the other two methods. For example, when n ϭ 10,000, an average of ϳ6750 acetylated peptide identifications per trial were obtained with the separate FDR or the transferred FDR, whereas this number was only 5255 with the global FDR, which corresponds to a reduction of 22% in the sensitivity of acetylation detection.
The above experiments were conducted when the FDR control level was set at 1% and only one type of modification was specified in each search. To investigate whether the proposed algorithm worked for more extensive situations, two additional experiments were performed. In the first experiment, the FDR control level was set at 1%, 2%, 3%, 4%, and 5%, and phosphorylation was specified as the variable modification for searches. Fig. 2 depicts box plots of the FDR estimation error distributions; it shows that the transferred FDR was consistently the best among the three methods. Interestingly, the accuracy of the global FDR decreased dramatically with an increasing FDR control level. In the second experiment, phosphorylation, carbamylation, and acetylation were specified as the variable modifications in a single search, and three forms of phosphorylated peptides were considered for FDR control. Fig. 3 shows that the approximate linearity of ␥ k ͑x͒ with regard to the score threshold x held for multiply modified peptides. Table V gives the results of FDR estimation. It is clear that only the transferred FDR accurately estimated the FDRs of all forms of phosphorylated peptide identifications. Note that because every spectrum contained at most one type of modification, any phosphorylated peptide identifications with co-occurring carbamylations and/or acetylations were definitely false identifications.
Synthesized Peptide Data-The second dataset was from the iPRG 2012 study conducted by the Proteome Informatics Research Group of the Association of Biomolecular Resource Facilities. The goal of this study was to evaluate the data analysis capabilities of proteomics researchers in identifying post-translational modifications present at substoichiometric levels within a complex peptide-mixture background. The background of the sample was a proteome lysate of yeast. Synthesized tryptic peptides from non-yeast proteins were spiked into the tryptic yeast lysate. The spiked-in peptides carried a variety of modifications including phosphorylation. The peptide mixture was analyzed by an AB Sciex 5600 TripleTOF mass spectrometer (AB SCIEX, Concord, ON) interfaced with a Waters nanoAcquity UPLC system (Waters, Millford, MA), and a total of 18,009 tandem mass spectra were produced. Twenty-four participants analyzed the data and submitted their results, from which the organizers compiled a list of consensus spectra identified by three or more participants agreeing on peptide sequences. In the current study, 7877 spectra of unmodified yeast peptides and 179 spectra of spiked-in phosphorylated peptides were extracted from the consensus spectra for the experiments. Charges of two and three were assumed for spectra that had undetermined charge states. Other types of modifications were not considered because they had too few consensus spectra for statistical analysis. More details about the dataset and the study can be found at the Association of Biomolecular Resource Facilities website. The spectra were searched against the UniProt protein database catenated with the reverse sequences as decoys.
Note that the small database provided by the iPRG 2012 study was not appropriate for search here, because it would have resulted in correct identification of almost all of the consensus spectra, which would have made the FDR analysis impossible. The precursor and fragment mass matching tolerances were Ϯ20 ppm and Ϯ0.05 Da, respectively, and trypsin was specified for protein digestion. After the database search, only peptides of more than eight amino acids were reserved in order to avoid short peptides occurring in both forward and reverse sequences in the enormous UniProt database. Similar to the experiments on simulated data, the spectra were sampled for 10,000 trials to generate the error distributions of FDR estimation. In each trial, 30 phosphorylation spectra and 5000 modification-free spectra were randomly selected for search. The FDR control level was set at 1% for all of the FDR estimation methods, and the error was computed as the estimated FDR minus the actual FDP. Fig. 4 shows the error distributions obtained with the three methods. Similar to the results on simulated data, with the global FDR or the separate FDR, the FDR of phosphorylated peptide identifications was greatly underestimated. With the transferred FDR, the FDR was slightly overestimated (within 1%) in most cases and was underestimated by ϳ5% in a few cases. On average, 14 phosphorylated peptides were identified per trial with the transferred FDR. Obviously, the transferred FDR performed best among the three methods. When the FDR control level was set at other values (e.g. 5%), similar results were observed (data not shown).
Standard Protein Data-The third dataset was from the widely used Standard Protein Mix Database, which is a diverse mass spectrum dataset designed for testing peptide and protein identification software tools (28). The proteolytic peptides in a mixture of tryptic digests of 18 purified standard proteins were analyzed using different mass spectrometers and under various conditions. The data used in this paper were from an analysis of the third mixture (mix3) on a Thermo Scientific LTQ-FT mass spectrometer (Thermo Scientific, Bremen, Germany). Ten liquid chromatography-tandem mass spectrometry runs produced 40,376 tandem mass spectra. The raw data can be downloaded from the Seattle Proteome Center Public Data Repository. Although the proteins in this sample were standard proteins, a wide range of modifications were indeed detected, which might have been introduced by sample handling or from contaminant proteins (29,30). Specifically, there were a few phosphorylation spectra that were very appropriate for the testing purposes of this paper.
The protein sequence database against which the above data were searched included the sequences of the 18 standard proteins, 79 contaminant proteins, and 6758 yeast proteins. The precursor and fragment mass matching tolerances were Ϯ10 ppm and Ϯ0.5 Da, respectively, and trypsin was specified as the digestion enzyme. The yeast sequences were added as background. Any match to the yeast proteins was expected to be a false identification. In contrast, a match to the standard or contaminant proteins had a very small chance of being a false identification. This chance could be approximated based on the ratio of the number of theoretical peptides from the standard or contaminant proteins to the total number of theoretical peptides from all proteins. By calculation (trypsin digestion with up to two missed cleavage sites allowed), this ratio is 10,947/1,089,961 Ϸ 0.01004. Therefore, the number of false identifications above a score threshold can be computed as n 1 ϩ 0.01004n 2 , where n 1 is the number of matches to yeast proteins and n 2 is the number of matches to standard or contaminant proteins.
The FDR control level was set at 1% to 5%. At each level, 10,000 trials were run. In each trial, 25,000 spectra were randomly sampled for search. Table VI compares  phorylated peptide identifications. We can see that the global FDR was, again, too optimistic an estimate of the FDR of phosphorylated peptide identifications, which resulted in failed FDR control. In contrast, both the separate FDR and the transferred FDR successfully controlled the FDR of phosphorylated peptide identifications (note that overestimation is a successful control). More phosphorylated peptides were identified with the transferred FDR than with the separate FDR. The reason for this is that some decoy phosphorylated peptide identifications had even higher scores than the highest scored, target, but false phosphorylated peptide identification. Consequently, the correct target phosphorylated peptide identifications with lower scores than the decoy phosphorylated peptide identifications could not pass the score threshold set by the separate FDR. According to the transferred FDR, these high-score decoy matches occurred solely by chance and did not justify an equal number of false target matches at the given score threshold. For example, in one trial, the 21st and 22nd highest-scored phosphorylated pep-tide identifications were both matches to decoy sequences. When the FDR control level was set at 1%, only the first 20 highest-scored phosphorylated peptide identifications were accepted and the estimated separate FDR was 0/20 ϭ 0%. When the FDR control level was set at 5%, the 21st highest-scored phosphorylated peptide identifications passed the score threshold and the estimated separate FDR was 1/20 ϭ 5% (note that the number of reported phosphorylated peptide identifications was still 20, because the 21st highest-scored phosphorylated peptide identification was a decoy match and was discarded). CONCLUSIONS This paper presents a solution to the problem of accurate FDR estimation for rare protein modifications detected via high-throughput tandem mass spectrometry. Through flexible use of the empirical data from target-decoy database searches, a computable relationship is derived between the subgroup FDR of modified peptide identifications and the global FDR of all peptide identifications. Through this relation-  Notes: Ph., all phosphorylated peptide identifications; Ph. only, phosphorylated peptide identifications without co-occurrences of other types of modifications; Ph. and Ca./Ac, phosphorylated peptide identifications with co-occurring carbamylations and/or acetylations (for definitions of other abbreviations, refer to Table II). In this experiment, for each of 10,000 trials, 15,000 spectra containing no modifications, 50 containing phosphorylations, 50 containing carbamylations, and 50 containing acetylations were randomly selected for search. ship, the FDR of rare modifications can be transferred from the global FDR, which can be accurately estimated via existing methods, thereby avoiding an inaccurate direct estimation from inadequate data. Experimental results demonstrate that the FDR of modified peptide identifications can be successfully controlled with the transferred FDR without sacrificing sensitivity, whereas the global FDR and the direct separate FDR present large biases when the modifications of interest are rare. Finally, it is worthwhile to note that the proposed method is in principle adaptable to other small-subgroup FDR estimation problems in proteomics, such as the identification of peptides containing specific amino acids or cross-linked peptides, in which cases the mass spectra of such peptides could be rare in a dataset.