A Non-parametric Cutout Index for Robust Evaluation of Identified Proteins*

This paper proposes a novel, automated method for evaluating sets of proteins identified using mass spectrometry. The remaining peptide-spectrum match score distributions of protein sets are compared to an empirical absent peptide-spectrum match score distribution, and a Bayesian non-parametric method reminiscent of the Dirichlet process is presented to accurately perform this comparison. Thus, for a given protein set, the process computes the likelihood that the proteins identified are correctly identified. First, the method is used to evaluate protein sets chosen using different protein-level false discovery rate (FDR) thresholds, assigning each protein set a likelihood. The protein set assigned the highest likelihood is used to choose a non-arbitrary protein-level FDR threshold. Because the method can be used to evaluate any protein identification strategy (and is not limited to mere comparisons of different FDR thresholds), we subsequently use the method to compare and evaluate multiple simple methods for merging peptide evidence over replicate experiments. The general statistical approach can be applied to other types of data (e.g. RNA sequencing) and generalizes to multivariate problems.

Inferring the protein content from these fragment ion spectra is difficult, and statistical methods have been developed with that goal. Protein identification methods (5-8) rank proteins according to the probability of their being present in the sample. Complementary target-decoy methods evaluate the proteins identified by searching fragmentation spectra against proteins that might be present (targets) and proteins that are absent (decoys). An identified target protein counts as a correct identification (increasing the estimated sensitivity), whereas each identified decoy protein counts as an incorrect identification (lowering the estimated specificity).
Current target-decoy methods estimate the protein-level false discovery rate (FDR) for a set of identified proteins (9,10), as well as the sensitivity at a particular arbitrary FDR threshold (11); however, these methods have two main shortcomings.
First, current methods introduce strong statistical biases, which can be conservative (10) or optimistic (12) in different settings. These biases make current approaches unreliable for comparing different identification methods, because they implicitly favor methods that use similar assumptions. Automated evaluation tools that can be run without user-defined parameters are necessary in order to compare and improve existing analysis tools (13).
Second, existing evaluation methods do not produce a single quality measure; instead, they estimate both FDR and sensitivity (which is estimated using the "absolute sensitivity," which treats all targets as present and counts them as true identifications). For data sets with known protein contents (e.g. the protein standard data set considered), the absolute sensitivity is estimable; however, for more complex data sets with unknown contents, the measurement indicates the relative sensitivity. Even if one ignores statistical biases, there is currently no method for choosing a non-arbitrary FDR threshold, and it is currently not possible to decide which protein set is superior-one with a lower sensitivity and stricter FDR, or another with a higher sensitivity and less stringent FDR. The former is currently favored but might result in significant information loss. Arbitrary thresholds have significant effects: in the yeast data analyzed, 1% and 5% FDR thresholds, respectively, yielded 1289 and 1570 identified protein groups (grouping is discussed in the supplementary "Methods" section). Even with such a simple data set, this subtle change results in 281 more target identifications, of which unknown subsets of 66 (0.05 ϫ 1570 Ϫ 0.01 ϫ 1289 Ϸ 66) are expected to be false identifications and 215 are expected to be true identifications (281 Ϫ 66 ϭ 215).
Here we introduce the non-parametric cutout index (npCI), a novel, automated target-decoy method that can be used to compute a single robust and parameter-free quality measure for protein identifications. Our method does not require prior expertise in order for the user to select parameters or run the computation. The npCI employs target-decoy analysis at the PSM level, where its assumptions are more applicable (4). Rather than use assumptions to model PSM scores matching present proteins, our method remains agnostic to the characteristics of present proteins and analyzes PSMs not explained by the identified proteins. If the correct present set of proteins is known, then the distribution of remaining, unexplained PSM scores resembles the decoy distribution (14). We extend this idea and present a general graphical framework to evaluate a set of protein identifications by computing the likelihood that the remaining PSMs and decoy PSMs are drawn from the same distribution ( Fig. 1).
Existing non-parametric statistical tests evaluating the similarity between two collections of samples (e.g. Kolmogorov-Smirnov test, used in Ref. 14, and the Wilcoxon signed rank test) were inadequate because infrequent but significant outliers (e.g. high-scoring PSMs) are largely ignored by these methods. Likewise, information-theoretic measures, such as the Kullback-Leibler divergence, were inadequate because they require a prior on the smoothing parameter that weighs more smoothing and higher similarity against less smoothing and lower similarity (a problem reminiscent of the original compromise between sensitivity and FDR); without the application of such a prior, the optimal Kullback-Leibler divergence occurs with infinite smoothing, which will make any distributions equal, rendering them completely uninformative and thus making it impossible to distinguish one identified protein set from another. For these reasons, we derived a novel, Bayesian, non-parametric process to compute the likelihood that two continuous collections are drawn from the same distribution. It can be used to provide a robust and efficient evaluation of discoveries.

EXPERIMENTAL PROCEDURES
We applied the npCI to two problems for which there is no current automated and parameter-free solution. First, we estimated the number of proteins present in two previously published data sets analyzed using PeptideProphet: the UPS1 protein standard (15), and unfractionated yeast lysate (2). Protein standards like the UPS1, which contains 48 human proteins and common contaminants, often produce a crisp "elbow" (or localized point of diminishing returns) in the sensitivity-FDR relationship; we hypothesized that the npCI would select a protein set close to the sensitivity-FDR elbow. The yeast data were selected because we had no prior knowledge of the sample content. Complex mixtures produce a diffuse elbow in the sensitivity-FDR curve; thus, it is almost impossible to manually identify the point with the best sensitivity-to-FDR ratio.
Second, we evaluated simple "peptide rule" methods for merging peptide evidence from replicate experiments on HeLa lysate using Mascot.
Sample Preparation and Peptide Search-Existing Data-Both UPS1 (15) and yeast (2) data were searched using SEQUEST (16), scored using PeptideProphet (17) (as described elsewhere (7)), and ranked according to the highest-scoring associated peptide. The UPS1 data set used a database including human target proteins and reversed decoy proteins. The yeast data set used a yeast ORF target database and a shuffled decoy database. The minimum allowed peptide score was 0; peptides reported to have lower scores in the original publication were thresholded to 0.

HeLa Sample Preparation and LC-MS/MS-Cell Growth and
Harvesting of HeLa Cells-In brief, HeLa cells were propagated in Dulbecco's modified Eagle's-F12 medium supplemented with 10% fetal bovine serum. Upon achieving Ͼ 90% confluency, the growth media was aspirated from the 10-cm dish and the cells were washed three times with ice-cold phosphate-buffered saline (PBS). The cells were dislodged with non-enzymatic CellStripper, harvested following the addition of 10 ml PBS, and pelleted via centrifugation at 3000 ϫ g for 5 min at 4°C, after which the supernatant was removed.
Cell Lysis and Protein Extraction-One milliliter of TBSp (50 mM Tris, 150 mM NaCl, pH 7.4, supplemented with 1X Roche Complete protease inhibitors), 1% Triton X-100, and 0.5% SDS were added to each cell pellet. Cells were homogenized with 12 passages through a 27-gauge needle (1.25 inches long) and incubated on ice with gentle agitation for one hour. The homogenate was sedimented via ultracentrifugation at 100,000 ϫ g for 60 min at 4°C. The protein concentration of the supernatant was determined using the bicinchoninic acid assay (ThermoFisher Scientific, catalog number 23225). The protein concentration was adjusted to 2 mg/ml using TBSp. Protein Under the supposition that the identified protein set (blue) is present, all peptides matching those proteins (also blue) might be present and have an unknown score distribution. When the correct set of proteins is identified, the remaining peptides (i.e. those not matching any shaded proteins in this figure) have a score distribution resembling that of absent peptides. Thus, the similarity of the remaining peptide score distribution (red dashed line) to the absent peptide score distribution (black solid line) determines the quality of the identified proteins.
was reduced in 20 mM TCEP for 30 min at 56°C. After cooling to room temperature, proteins were alkylated with 1% acrylamide for 30 min.
Acetone Precipitation-Ice-cold 100% acetone (four sample volumes; 500 l total) was added to 125 l of sample (ϳ250 g of protein), vortexed briefly, and incubated at Ϫ20°C for three hours. The samples were then centrifuged at 20,000 ϫ g for 30 min at 4°C. The supernatants were carefully aspirated, and the pellets were allowed to air dry at room temperature.
Tryptic Digestion-For tryptic digestion, the sample was resuspended in 50 mM ammonium bicarbonate, pH 8.1. The sample was incubated with 2.5 g of trypsin for 16 h. Following the incubation, the reaction was acidified with formic acid to a final concentration of 0.1% and evaporated via vacuum centrifugation. To remove interfering substances prior to mass spectrometry, peptides were isolated with C18 spin columns per the manufacturer's instructions. Samples were again vacuum centrifuged to dryness and stored at Ϫ80°C until analysis. Prior to mass spectrometric analysis, the peptides were resuspended in sample loading buffer (5% formic acid, 5% acetonitrile, 90% water).
Mass Spectrometry-The peptides were subjected to fractionation using reversed-phase high performance liquid chromatography (Thermo Scientific, Waltham, MA, and Eksigent, Dublin, CA), and the gradient-eluted peptides were analyzed using a hyphenated linear trap quadrupole (LTQ) or Orbitrap Classic hybrid mass spectrometer (Thermo Scientific, Waltham, MA). The liquid chromatography columns (15 cm ϫ 100 m inner diameter) were packed in-house (Magic C18, 5 m, 100 Å beads, Michrom BioResources, Auburn, CA) into PicoTips (New Objective, Woburn, MA). Samples were analyzed with a 120-min linear gradient (0%-35% acetonitrile with 0.2% formic acid), and data were acquired in a data-dependent manner in which MS/MS fragmentation was performed using the six most intense peaks of every full MS scan. Ten technical replicates were run on each of the instruments (LTQ and Orbitrap Classic).
HeLa Mascot Search-MGF files were converted by selecting the 200 most intense peaks with msconvert. Spectra were searched against Human UniProt (Swiss-Prot). The database was appended with common contaminants (using the common Repository of Adventitious Proteins, or cRAP proteins, listed in the supplementary material). The database also included decoy entries either reversed using in-house software or shuffled with Mascot 2.3.02 (i.e. spectra were searched using target-decoy competition). These databases (regardless of reversal or shuffling) contained 71,614 proteins (including contaminants and decoys). Tryptic peptides with up to one missed cleavage were allowed. A Ϯ1.0 Da precursor mass tolerance was used for the LTQ, and a 10 ppm mass tolerance was used for the Orbitrap (using monoisotopic masses). Both instruments used a Ϯ0.8 Da fragment mass error tolerance. A search was performed using a reversed human decoy database and was repeated on the shuffled human decoy database in supplemental Figs. S1 and S2. Three variable modifications-deamidation (NQ), propionamide (C), and oxidation (M)-were permitted.
A Non-parametric Bayesian Approach to Evaluation-By design, absent target PSMs should closely resemble decoy PSMs (decoy PSMs are constructed to emulate absent PSMs); we exploited this fact to propose a graphical method for evaluating protein identifications by computing the likelihood that the remaining PSMs and decoy PSMs will be drawn from the same distribution (Fig. 1). By supposing that a certain protein set is present, the method marks PSMs that might depend on those proteins and which thus might be drawn from the uncharacterized distribution of potentially present PSM scores; however, if the correct set of proteins is identified, then the unmarked PSM scores (i.e. those not dependent upon the present protein set) should be drawn from the distribution of absent PSM scores, which resemble decoy scores. Thus, the identified set of proteins can be evaluated by computing the probability that the remaining (i.e. supposed absent) PSM scores are drawn from the same distribution as the decoy PSM scores.
As described above, existing measures for computing the probabilistic or information-theoretic similarity between the remaining and decoy PSM scores were inadequate for these data. For this reason, we derived a novel Bayesian measure of probabilistic divergence. This novel Bayesian method computes a likelihood that follows directly from the supposition that the remaining target PSM scores (denoted by D ͑absent͒ ) are drawn from the decoy score distribution (denoted by pdf D ͑decoy͒) and that the decoy PSM scores (denoted by D ͑decoy͒ ) are drawn from the remaining target score distribution (denoted by pdf D ͑absent͒).
Pr͑D ͑absent͒ and D ͑decoy͒ are drawn from the same where pdf denotes the probability density function. If PSM scores were restricted to two categories (low-scoring and high-scoring), then Pr͑D ͑absent͒ Խpdf D ͑decoy͒͒ and Pr͑D ͑decoy͒ Խpdf D ͑absent͒͒ could each be computed with a simple binomial; however, because there are an infinite number of possible PSM scores, we computed these probabilities using a novel method reminiscent of a Dirichlet process that can be thought of as the generalization of the binomial from two categories to an infinite number of categories. Our method requires no prior knowledge of the distribution families and has no free parameters; even the parameter specifying the degree of smoothing is chosen internally in an automated, probabilistic manner, without the use of a prior. This Bayesian method is described in greater detail in the supplementary "Methods" section.
As implemented in the python prototype, the method took less than a minute to process each data set from this manuscript (replicate data sets, such as the HeLa data, can be processed in parallel when enough cores/processors are available). The most computationally expensive task performed was the smoothing of each proposed set of remaining PSM scores (i.e. D ͑absent͒ ) into its corresponding probability density (i.e. pdf D ͑absent͒). Although the method is currently efficient in practice, it can be substantially sped up by the use of a compiled language for numerical processing and by exploiting linearity in the smoothing process to smooth in only the new PSM scores when recomputing pdf D ͑absent͒. The increased speed will permit more efficient computation of likelihood curves more highly resolved than those shown in Fig. 2.

RESULTS
Figs. 2A-2F apply the npCI to the UPS1 standard (first row) and yeast (second row). The first column ( Figs. 2A and 2D) compares the distribution of peptide scores from empirically absent peptides to the peptides that remain when too few, too many, and the most likely set of proteins are identified. Too few protein identifications yield remaining peptides enriched for high scores; conversely, too many protein identifications yield too few high-scoring remaining peptides. Both are implausible because they differ from the empirical distribution of absent peptide scores. The most likely protein set has remaining target peptides with a score distribution highly similar to that of decoys. The second column (Figs. 2B and 2E) shows the relationship between the likelihood and the number of identified protein groups. The third column (Figs. 2C and 2F) shows the relationship between the protein-level FDR and the number of identified target protein groups (which approximates the sensitivity).
We also used the npCI to evaluate methods for aggregating evidence from 10 replicates of HeLa lysate processed on two instruments (LTQ and Orbitrap Classic). We considered all protein rankings requiring an identified protein match of at least k peptides in at least n replicate experiments, where k and n are free parameters. We chose the most likely protein set from the most likely ranking (Fig. 3). DISCUSSION As hypothesized, the npCI chose a protein set on the elbow of the UPS1 FDR-sensitivity curve (the most likely point has a protein FDR of ϳ0%). This is remarkable, because the npCI does not use the protein-level target-decoy information in Fig.  2C. The yeast data had a less distinct elbow; nevertheless, the npCI found a distinct maximum likelihood (protein FDR of ϳ7%). Fig. 2G demonstrates the npCI's improved reproducibility relative to the two-peptide rule, with peptide q-values below 1%. For this data set, evaluation with npCI revealed that the ranking produced by using the one-peptide rule with a variable q-value (k ϭ 1, n ϭ 1) is superior to the rankings obtained with the other peptide rules, including the two-peptide rule (Fig. 3). Because the samples are identical, the protein inference results should be the same for both instruments; high-mass-accuracy instruments simply have a better tradeoff between sensitivity and specificity. Rather than dramati-FIG. 2. Using the npCI to evaluate protein identifications on the UPS1 protein standard (first row) and yeast extract (second row) and on 10 replicates of HeLa lysate (third row). For different numbers of identified proteins, A and D show the log peptide score distributions, B and E show the likelihood (relative to the most likely, indicated by an asterisk), and C and F show the protein FDR. G, Venn diagrams (to scale) showing the overlap in the human proteins from HeLa replicates identified with the npCI and the overlap produced by using the two-peptide rule with peptide q-values below 1%. cally lower the sensitivity for the low-mass-accuracy LTQ instrument, the npCI simultaneously optimizes both the sensitivity and the FDR and chooses a similar set of proteins by slightly increasing the FDR. These results are robust with shuffled and reversed decoy databases (supplemental Figs. S1 and S2) generated via Mascot.
These results also confirm the recent assertion that a strict peptide threshold of one peptide (k ϭ 1) might be superior to a more lenient peptide threshold of two peptides (k ϭ 2) (8) (the superiority is also reinforced by the approximate FDRsensitivity relationship shown in supplemental Fig. S2). It is also noteworthy that proteins identified in two replicates with one peptide each (n ϭ 2, k ϭ 1) yielded significantly higher likelihoods than proteins identified in a single replicate with two peptides (n ϭ 1, k ϭ 2). When specificity is paramount, corroborating evidence from replicate experiments might be superior to data from multiple peptides in the same experiment. This is intuitive: because of variations in protein length and peptide detectability (4), as well as sampling variation (18) (i.e. different peptides represented in the results from analyzing the same sample), the number of observed peptides can vary substantially among present proteins. Choosing k ϾϾ 1 requires a permissive peptide FDR to identify present proteins with few detectable peptides.
It is worth noting the inherently robust theoretical properties of the npCI approach. First, the npCI does not attempt to model the behavior or measurements of present peptides or proteins, for which most data sets have no grounded truth or empirical examples. Instead it compares the remaining target peptides to empirically absent peptides; for this reason, the npCI is very different from the scores computed and optimized by protein identification methods. This complementary approach is more robust because it makes fewer assumptions. More important, it avoids protein-level biases by using the target-decoy approach only at the peptide level. Second, the npCI models the score distributions nonparametrically by simply smoothing the density of observed examples, which allows the user to remain agnostic about the shape and family of the score distributions. The algorithm is completely automated and has no free parameters; it requires only the peptide search engine output for the experiment (or replicate experiments) and either ranked proteins or specific sets of proteins to evaluate. These ranked proteins can come from different identification methods. This manuscript evaluates proteins identified using different protein-level FDR thresholds and proteins identified using different families of peptide rules; however, the countless methods used to rank proteins could be similarly evaluated using our approach. The method can be applied as is to evaluate families of rankings produced by peptide-or PSM-level FDR cuttoffs, and it even can be used to choose free parameters from parametric methods (7, 19).
Robust metrics eschewing ad hoc parameters are crucial for the advancement of proteomics, as well as for general statistical analysis. We have described an automated, parameter-free method for evaluating discoveries; the method can be used with arbitrary peptide scores and rankings from any identification algorithm, and it can be run by a user without prior knowledge or expertise. Furthermore, the general statistical method behind the npCI can be trivially modified for application to other data, such as RNA sequencing and paired case-control analyses. The npCI method can easily be adapted to multivariate problems and applied to various (possibly correlated) types of evidence in a fully automated manner (e.g. RNA sequencing and LC-MS/MS on the same sample).