A High-Confidence Human Plasma Proteome Reference Set with Estimated Concentrations in PeptideAtlas*

Human blood plasma can be obtained relatively noninvasively and contains proteins from most, if not all, tissues of the body. Therefore, an extensive, quantitative catalog of plasma proteins is an important starting point for the discovery of disease biomarkers. In 2005, we showed that different proteomics measurements using different sample preparation and analysis techniques identify significantly different sets of proteins, and that a comprehensive plasma proteome can be compiled only by combining data from many different experiments. Applying advanced computational methods developed for the analysis and integration of very large and diverse data sets generated by tandem MS measurements of tryptic peptides, we have now compiled a high-confidence human plasma proteome reference set with well over twice the identified proteins of previous high-confidence sets. It includes a hierarchy of protein identifications at different levels of redundancy following a clearly defined scheme, which we propose as a standard that can be applied to any proteomics data set to facilitate cross-proteome analyses. Further, to aid in development of blood-based diagnostics using techniques such as selected reaction monitoring, we provide a rough estimate of protein concentrations using spectral counting. We identified 20,433 distinct peptides, from which we inferred a highly nonredundant set of 1929 protein sequences at a false discovery rate of 1%. We have made this resource available via PeptideAtlas, a large, multiorganism, publicly accessible compendium of peptides identified in tandem MS experiments conducted by laboratories around the world.

Blood plasma contains a combination of subproteomes derived from different tissues, and thus, it potentially provides a window into an individual's state of health. Therefore, a detailed analysis of the plasma proteome holds promise as a source of biomarkers that can be used for the diagnosis and staging of diseases, as well as for monitoring progression and response to therapy.
For many years, before the era of proteomics, the classic multivolume reference, The Plasma Proteins by Frank Putnam (1975Putnam ( -1989 (1), provided a foundation for studies of plasma proteins. In 2002, Anderson and Anderson (2) published a review of 289 plasma proteins studied by a wide variety of methods, and quantified primarily with immunoassays, providing an early plasma proteome reference set.
Subsequently, the widespread adoption of liquid chromatography-tandem MS (LC-MS/MS) 1 techniques resulted in a rapid increase in plasma proteome-related data sets that needed to be similarly integrated to form a next-generation comprehensive human plasma proteome reference set. In 2002, the Human Proteome Organization (HUPO) launched Phase I of its Human Plasma Proteome Project (PPP) and provided reference specimens of serum and EDTA-, citrate-, and heparin-anticoagulated plasma to 55 laboratories. Eighteen laboratories contributed tandem MS findings and protein identifications, which were integrated by a collaborative process into a core data set of 3020 proteins from the International Protein Index (IPI) database (3) containing two or more identified peptides, plus filters for smaller, higher confidence lists (4,5). A stringent re-analysis of the PPP data, including adjustment for multiple comparisons, yielded 889 proteins (6).
Meanwhile, in 2004, Anderson et al. (7) published a compilation of 1175 nonredundant plasma proteins reported in the 2002 literature review and in three published experimental data sets (8 -10). Only 46 were reported in all four sources, suggesting variability in the proteins detected by different methods, high false positive rates because of insufficiently stringent identification criteria, and nonuniform methods for assigning protein identifications. Shen et al. (11) reported 800 to 1682 proteins from human plasma, depending on the proteolytic enzymes used and the criteria applied for identification; Omenn et al. (4) re-analyzed those raw spectra with HUPO PPP-I search parameters and matched only 213 to the PPP-I core data set. Chan et al. reported 1444 unique proteins in serum using a multidimensional peptide separation strategy (12), of which 1019 mapped to IPI and 257 to the PPP-I core data set. These previous efforts highlight the challenges associated with accurately determining the number of proteins inferred from large proteomic data sets, and with comparing the proteins identified in different data sets.
In 2005, we used a uniform method based on the Trans-Proteomic Pipeline (13) to create the first Human Plasma PeptideAtlas (14), containing 28 LC-MS/MS data sets and over 1.9 million spectra. Using a PeptideProphet (15) probability threshold of p Ͼ ϭ 0.90, 6929 peptides were identified at a peptide false discovery rate (FDR) of 12%, as estimated by PeptideProphet's data model, mapping to about 960 distinct proteins. Comparison of protein identifiers with those from studies cited above showed quite limited overlap.
From the 2005 Human Plasma PeptideAtlas, as well as the PPP-I collaboration, we concluded that different proteomics experiments using different samples, depletion, fractionation, sample preparation, and analysis techniques identify significantly different sets of proteins. We decided that a comprehensive plasma proteome could be compiled only by combining data from many diverse, high-quality experiments, and strove to collect as much such data as possible. The resulting 2007 Human Plasma PeptideAtlas (unpublished), encompassing 53 LC-MS/MS data sets, identified 27,801 distinct peptides-four times the number in the 2005 Atlas-and 2738 proteins.
In 2008, Schenk et al. (16) published a high-confidence set of 697 nonimmunoglobulin human plasma proteins based on measuring a single pooled sample on two high-end MS instruments after depletion, prefractionation, and protease inhibition, with stringent validation methods. This highly nonredundant set of proteins likely contains fewer false-positives than any previous MS-derived plasma proteome reference set.
The goal of the present work was to compile a larger human plasma proteome reference set of similar high confidence by creating a new release of the Human Plasma PeptideAtlas incorporating more data than in 2007 and interpreting the data using more stringent criteria. We searched raw data sets submitted to PeptideAtlas and performed peptide validation using a uniform pipeline (Fig. 1), compiled several sets of corresponding protein identifications at different clearly defined levels of redundancy (Fig.  2), and, using a spectral counting technique, provided a FIG. 1. Left: Search, analysis, and validation steps for each LC-MS/MS experiment. Spectra were searched against a spectral library or sequence database. The resulting PSMs were then processed using the TPP, including a new component, iProphet, to improve discrimination (see text for details). Right: The PeptideAtlas build process. ProteinProphet combines PSMs passing the FDR threshold for all experiments to create lists of distinct peptides, protein identifications, and protein groups. These data, along with supporting information such as consensus spectra, genome mappings, and proteotypic peptides, comprise a PeptideAtlas build. rough estimate of concentrations for a highly nonredundant set of protein sequences to guide blood-based diagnostic efforts such as doping using stable isotope-labeled synthetic reference peptides for selected reaction monitoring (SRM) experiments (Fig. 3). The result is a plasma proteome reference set (Fig. 4)  •Sequence-unique set: exhaustive set with exact duplicates removed. •Peptide-set-unique set: a subset of the sequence-unique set within which no two protein sequences include the exact same set of identified peptides. •Not subsumed set: peptide-set-unique set with subsumed protein sequences removed (those for which the identified peptides form a proper subset of the identified peptides for another protein sequence). •Canonical set: a subset of the not subsumed set within which no protein sequence includes more than 80% of the peptides of any other member of the set. Protein sequences that are not subsumed, but not canonical are called possibly distinguished, because each has a peptide set that is close, but not identical, to that of a canonical protein sequence. •Covering set: a minimal set of protein sequences that can explain all of the identified peptides.
B, Peptide-centric illustration of six protein sequences in a hypothetical ProteinProphet protein group, in order of descending ProteinProphet probability. Heavy lines represent protein chains (with invented identifiers); lighter lines represent observed peptides. Vertically aligned peptides are identical in sequence, and one instance of each is labeled with the letter of the highest probability protein to which it maps. A' is indistinguishable from A because it contains exactly the same set of observed peptides; both are equally likely to exist in the sample(s), but A is labeled canonical because its Swiss-Prot protein identifier is preferred. E is subsumed by A because its observed peptides form a subset of A's peptides; it is also subsumed by A', C, and D. Protein sequences B, C, and D are labeled possibly distinguished because the peptide set for each is slightly different from that of A. The three protein sequences with superscript C comprise the smallest subset of sequences sufficient to explain all the observed peptides in the group, and thus belong to the covering set.
supplemental Table S4, Supplemental Data), including 44 from Phase I PPP experiments, 13 from PPP Phase II, the Chan data set, and several from corporate research laboratories. Data from both plasma and serum samples, a variety of sample preparation techniques (depleted/not depleted, various fractionation schemata, use of protease inhibitors, N-linked glycocapture enrichment (22)), and analysis on a variety of instruments were included. All samples were digested with trypsin. Each data set consisted of between one and 38,252 LC-MS/MS runs (median 22) for a total of 48,789 LC-MS/MS runs 2 . For analysis, we separated the data sets into two groups, glycocapture and nonglycocapture, and later combined the results.
The 69 data sets for the nonglycocapture samples were all selected from ion trap experiments because we wished to search them against an ion trap spectral library. Data were converted to mzXML (23) and searched with SpectraST version 4.0 (24) against a spectral library consisting of the NIST 3.0 human spectral library (261,777 consensus spectra) (25) plus one SpectraST-generated (26) decoy for each NIST spectrum. This library contains consensus spectra derived from actual identified spectra, some of which include missed cleavages and/or modifications. A precursor mass tolerance of 3.0 Th (thomson) was used. See supplemental Data for complete SpectraST parameters.
Analysis and Validation of Search Results-The search results for each experiment were processed using the Trans-Proteomic Pipeline (TPP) (13), as shown in Fig. 1, left (see supplemental Data for TPP parameters used). PeptideProphet (15) computed a probability for each peptide-spectrum match (PSM) for peptides of length seven or greater. iProphet (27) was applied to the PeptideProphet results to improve discrimination by modeling five additional properties of the data beyond those modeled by PeptideProphet, and adjusting peptide probabilities accordingly. The five models are number of sibling searches (rewards or penalizes identifications based on the output of multiple search engines, not applicable here), number of replicate spectra (models the assumption that precursor ions with multiple high probability identifications are more likely to be correct), number of sibling experiments (models the assumption that precursor ions observed in multiple experiments and matched to the same peptide sequence are more likely to be correct), number of sibling ions (rewards peptides identified by precursors with different charges), and number of sibling modifications (rewards peptides identified with different mass modifications).
RefreshParser mapped each PSM to a combined protein sequence database derived from Swiss-Prot 2010 -04 including splice variants (28,29), IPI v3.71, Ensembl v57.37 (30), and cRAP v1.0 (31). In many cases, the exact same protein sequence is included in the combined database multiple times because it is contained in multiple databases and/or because the Ensembl database includes many duplicates. Each PSM was mapped to all protein sequences containing the PSM's peptide sequence; in many cases this resulted in a PSM mapping to multiple protein sequences that are duplicates, splice variants, or paralogs.
For very large data sets, the FDR at the peptide level tends to be much larger than that at the PSM level, and, at the protein level, much larger still (32). Thus, in order to obtain a 1% decoy-estimated protein FDR for the final Human Plasma PeptideAtlas, a stringent PeptideProphet-estimated PSM FDR filter of 0.0002 (corresponding to probability cutoffs ranging from 0.9903 to 0.9998) was applied to each experiment. (40) versus normalized spectral counts from the Human Plasma Non-glyco PeptideAtlas, plotted on a log scale. Each small square represents a protein found in both sources. Hollow squares represent proteins that were excluded when drawing the trend line (either depleted (albumin) or fewer than four spectrum counts). The line segments above and below the trend line are fit to the standard deviation of the y axis values computed at intervals of 0.1 (log scale). The arrows on the left represent proteins with reported concentrations in (40) but no spectrum counts. The histogram at the right depicts an estimate of the completeness of the Human Plasma Non-glyco PeptideAtlas as a function of concentration, calculated as the number of points divided by the total number of points and arrows within each decade. See supplemental Fig. S2, for N-Glyco atlas.

FIG. 3. Plasma protein concentrations determined using immunoassay and antibody microarray analysis
ProteinProphet (33) was then run on each experiment, assigning to each distinct peptide the probability of its highest probability PSM, and further adjusting these probabilities using a number of sibling peptides model, which rewards peptides that map to proteins with many identified peptides.
The set of identified peptides for the HsSerum NCI Large Survey experiment (12) was found to contain many peptides that map to yeast but not human. Suspecting yeast contamination, we purged the peptide set for this experiment of all peptides that appear in the yeast genome.
Next, ProteinProphet was run again, this time combining the PSMs for all experiments, to assign probabilities to protein identifications and to group protein identifications with overlapping peptide sets. The PSMs passing threshold for all experiments and their corresponding distinct observed peptides and protein identifications were then compiled ( Fig. 1, right) to form a Human Plasma Non-glyco PeptideAtlas build.
Classification of Protein Identifications-It is impossible to generate a definitive list of identified proteins because such a list depends on what is meant by "protein" and on what one considers sufficient evidence for the existence of a specific protein. Further, when the set of identified peptides mapping to the sequence of a protein is identical to, or a subset of, the set of peptides mapping to the sequence of another protein, it is quite possible that both proteins have been observed, but there is no way to determine this from the data.
To partially address this issue, we compiled several sets of protein identifications for this build at different levels of redundancy. For the purpose of this work, the redundancy of a set of protein identifications is the extent to which the set contains more sequences than necessary to reasonably explain all of the data. Note that redundancy is different from confidence and that the different redundancy levels do not correspond to different confidence levels. The four most useful protein sets are described below; the first two, exhaustive (most redundant) and canonical (least redundant), are used extensively throughout this report. All sets are summarized in Fig. 2, and a detailed explanation, including examples, is given in supplemental Data (Cedar). We have named this scheme Cedar to capture the somewhat tree-like pyramidal shape shown in Fig. 2A.
The exhaustive set includes any entry from the combined protein sequence database (Swiss-Prot 2010 -04 ϩ IPI v3.71 ϩ Ensembl v57.37) to which any identified peptide maps. This highly redundant set includes multiple copies of identical sequences. To determine whether a protein corresponding to a particular identifier exists in the Human Plasma PeptideAtlas, one must check whether that identifier is in the exhaustive set. Assuming the identifier is in Swiss-Prot 2010 -04, IPI v3.71, or Ensembl v57.37, its presence in the atlas' exhaustive set indicates that the protein sequence includes a peptide sequence in the atlas.
The canonical set is a highly nonredundant set of protein sequences explaining nearly all of the identified peptides and it serves as a proteome reference set. It includes the highest probability protein sequence from each ProteinProphet protein group, called the group representative. Swiss-Prot protein sequences are preferred for inclusion because of Swiss-Prot's comprehensive sequence documentation and curation, and because Swiss-Prot, a subset of Uniprot, is now considered to contain one entry for each currently known human protein coding gene (34), with a total of 20,251 entries in the 2010 -10 release, of which 13,329 have evidence at the protein level [www. uniprot.org]. When a protein group includes protein sequences for which the peptide set has less than 80% overlap with the group representative, we label those sequences canonical as well (see supplemental Data (Cedar) for algorithm and justification for 80% threshold). The size of the canonical set is a conservative estimate of the number of distinct proteins observed. It is important to understand that the label canonical is with respect to a particular data collection; a protein sequence that is identified in two atlas builds may be labeled canonical in one collection and something else in another.
The possibly distinguished set includes protein sequences that have one or more peptides distinguishing it from all protein sequences in the canonical set, but with these peptides comprising  supplemental Table S4. Height of dark bar ϭ canonical protein sequences identified per experiment; total height (dark ϩ light) ϭ cumulative tally; width of bar ϭ PSM count. See supplemental Fig. S5, for a similar graph of distinct peptides. fewer than 20% of the total number of identified peptides in each protein, making the case for independent existence less strong.
Finally, the covering set is a near-minimal set sufficient to explain all of the peptide identifications (see supplemental Data, (Cedar), for algorithm). This set consists of almost all of the canonical protein sequences plus some of the possibly distinguished protein sequences, and is usually somewhat larger than the canonical set. It is useful for assigning a "parent" protein identification to each identified peptide, as is necessary for estimating FDR using Mayu (32) or computing the empirical observability score described in Empirical Observability Score below.
See supplemental Data (Cedar) for settings to apply when using the PeptideAtlas web interface to obtain these protein sequence sets.
Analysis of N-linked Glycopeptide-enriched Samples-We then analyzed the 22 data sets from samples prepared using N-linked glycocapture enrichment. Our aim in including these samples was to detect low-abundance proteins, many of which are N-glycosylated. Sample preparation was as described in (35). Briefly, N-linked glycoproteins were conjugated to a solid support using hydrazide chemistry, proteins were digested with trypsin on the support, N-linked glycopeptides were optionally labeled with stable isotopes, and formerly N-linked glycosylated peptides were specifically released via peptide-N-glycosidase F (PNGase F) resulting in a N-linked glycopeptide-rich fraction, but with the glycans removed. Within this fraction, all asparagines (N) that had been glycosylated in the intact protein were now present as aspartic acid (D) residues. This fraction was analyzed via LC-MS/MS. We did not search against the NIST spectral library because it does not contain glycopeptide spectra; instead, data were searched with X!Tandem version 2009.10.01.1 (36) using a score plug-in implementing the COMET (k-score) function (13) against a target database consisting of IPI 3.54 (75,428 sequences) plus one decoy per target sequence generated by a random scrambling of each tryptic peptide in place. Peptides appearing in more than one target sequence were scrambled identically each time. The mass tolerance for precursor ions ranged from -2.1 to ϩ4.1 Daltons. Modifications were allowed on cysteine (fixed, mass depending on modification used) and methionine (variable, oxidation). A maximum of two missed cleavages was allowed. A standard protocol (37) was employed so that D-[not P]-[S/T]-containing spectra could be matched against N-[not P]-[S/T]-containing database sequences. Briefly, we substituted the letter B for N in all N-glycosite motifs in the database (B commonly denotes "N or D" but in this context denotes "N presumed to be glycosylated"), then searched with the mass of B fixed to the mass of D, allowing B to behave as D during the search. Instances of B were then converted back to N in the search results. See supplemental Data for complete X!Tandem parameters. It is important to note that, whereas this computational protocol allows identification of peptides containing the (possibly de-amidated) Nglyco motif, it does not confirm whether the site was indeed glycosylated in the sample.
We then constructed a Human Plasma N-Glyco PeptideAtlas using the same methods as above, but with a PSM FDR threshold of 0.00002, yielding a protein-level FDR of 0.56%. We chose this threshold to achieve our goal of a 1% protein FDR after combining with the 0.86% FDR Non-glyco build described above. It was not practical to use identical FDRs for the component builds because even fine adjustments in the PSM FDR for a component build sometimes resulted in coarse changes in the protein FDR for the combined build.
Concentration Estimation-Spectral counting was applied to roughly estimate the absolute concentration of the group representative for each ProteinProphet protein group in each atlas. Spectral counting rests on the observation that the PSM count for a peptide correlates linearly with its molar concentration in the sample (38). We applied a simplification of the APEX method described by Lu and coworkers (39). For each protein sequence, i, identified in the Human Plasma Non-glyco PeptideAtlas, we begin with a ProteinProphetadjusted count SC i of all PSMs that map to that protein sequence (ProteinProphet adjusts the actual PSM count downward according to the degeneracy of the peptide-protein mappings). SC i is then normalized by scaling it to the total number of available tryptic peptides. Specifically, we calculate a normalization factor, NF i , by dividing the number of tryptic peptides of length seven or more resulting from an in silico digestion, NTP i , by 25, which is very roughly the average number of tryptic peptides per protein sequence across the whole proteome, and then calculate the normalized spectrum count NSC i by dividing SC i by that factor: We calibrated the concentration scale to the published concentrations of individual proteins. In Fig. 3, we plot NSC i versus concentrations determined via immunoassay and antibody microarray in (40). In many cases, these concentrations reflect multiple isoforms and/or cross-reacting proteins.
Using the slope S and y-intercept K from this calibration plot, we then calculated an estimated concentration C i for each group representative protein sequence with NSC i Ͼ ϭ 4 (smaller counts have been found unreliable for this purpose (41)) in the Human Plasma Non-glyco PeptideAtlas: Concentrations were converted to mass units (ng/ml) for storage in PeptideAtlas using molecular weights calculated from amino acid sequence.
The distance of the standard deviation curve from the trend line at the center of each decade on the x axis (between 10 0 and 10 1 , between 10 1 and 10 2 , etc.) was recorded as an uncertainty factor for the normalized PSM counts in that decade, ranging from less than 5ϫ at high concentrations to 13ϫ at low concentrations. See supplemental Table S2, for complete listing.
To estimate concentrations in the N-Glyco Plasma PeptideAtlas, we adjusted the technique to account for the N-linked glycopeptide enrichment. About half of the distinct peptides in this atlas contain the N-glycosite motif (N -[not P] -[S/T]), indicating a potential N-linked glycosylation site. Thus, in calculating the normalization factor NF i , we take into account both the total number of tryptic peptides in each protein sequence NTP i and the number of peptides containing the N-glycosite motif, NTGP i : The calibration plot is shown in supplemental Fig. S2. Also shown is a plot correlating the estimated concentrations in the Non-glyco and N-Glyco atlases for protein sequences appearing in both (supplemental Fig. S3).

Construction of Combined PeptideAtlas Plasma
Build-Finally, we combined the PSMs and peptides from these two atlases to form a Human Plasma PeptideAtlas build that includes results from all 91 plasma (or serum) experiments, both nonglycocapture and glycocapture. We ran ProteinProphet on the combined set of experiments and created protein identification sets as described above. Estimated concentrations from the Non-glyco atlas were used for protein sequences with values in both contributing atlases.
False Discovery Rate-Mayu, a software tool for estimating false discovery rates of protein identifications in large-scale data sets (32), was applied to each component atlas and to the combined atlas to estimate the protein-level FDR. Mayu implements a refinement of the common decoy-counting approach, improving accuracy by taking into consideration the size of the data set, the number of tryptic peptides in each protein, and proteome coverage.
Manual Validation of Single-PSM Protein Identifications-Three hundred fifty-seven of the 1999 canonical protein identifications that emerged after combining the Non-glyco and N-Glyco builds were supported by only a single PSM (supplemental Table S5). We manually validated these, judging a PSM positively for each of the following: identifications to b-or y-ions or neutral losses for nearly all of the tallest peaks in the spectrum, at least one series of four or more consecutive highly abundant fragment ions of the same type (b or y, preferably y) and charge state, highly abundant fragments corresponding to cleavage N-terminal to proline and C-terminal to aspartic acid (42), no missed tryptic cleavages, fragments observed above the noise level for at least 50% of the expected ions, internal positively charged amino acids to account for precursor charges above ϩ2, and N-terminal acetylation only for peptides at N terminus of protein. We discarded 70 PSMs that failed to fulfill these criteria to the extent that, in our opinion, they had a greater than 10% chance of being incorrect identifications. So the user can view these 70 discarded identifications, they were not removed from the component (N-Glyco and Non-glyco) atlases.
Construction of Combined PeptideAtlas Plasma Build at 5% Protein FDR-We repeated the above atlas construction procedure to obtain a combined build with a protein FDR of ϳ5%, as follows. We applied a PSM FDR of 0.001 to the non-glyco data and a PSM FDR of 0.0007 to the glyco data, obtaining in each case a build with a Mayu protein FDR of 4.8%. These were combined to yield a "Human Plasma FDR 5% PeptideAtlas" build (actual Mayu protein FDR is 4.6%). Single-PSM identifications were not manually validated, and all that passed our computational criteria were retained in this build.
Empirical Observability Score-For all peptides in each atlas, we calculated an empirical observability score (43), defined as the ratio of the number of samples in which a given peptide is observed divided by the number of samples in which the parent protein sequence is observed. For example, if peptide X is seen in five different samples and its parent protein sequence is observed in 10 samples, the empirical observability score is 0.5.

RESULTS
Size, Confidence, and Completeness of Proteome Reference Set-The 2010 Human Plasma PeptideAtlas, constructed from 91 LC-MS/MS data sets, contains 1929 canonical protein sequences with an estimated protein FDR of Ͻ0.98% (Fig. 4 and supplemental Table S3). As described under "Experimental Procedures," the set of canonical protein sequences is a highly nonredundant protein sequence set with no protein sequence sharing more than 80% of its observed peptides with any other member of the set. This criterion may exclude closely related protein family members. The list of 1929 protein identifiers, along with estimated concentrations and number of supporting PSMs and distinct peptides, is given in supplemental Table S6.
Each canonical protein sequence in the Human Plasma PeptideAtlas is supported by between 1 and 521 distinct observed peptides (mean ϭ 11, median ϭ 3) and between 1 and 390,366 PSMs (mean ϭ 1720, median ϭ 10). Of the 1929 canonical protein sequences, 1642 are supported by more than one PSM, and 1313 are supported by more than one distinct peptide.
High Confidence Identifications-The previous Human Plasma PeptideAtlas contained 27,801 peptides mapping to 2738 nonredundant proteins (protein redundancy level corresponding roughly to that of the covering list for the 2010 atlas). The 2010 Human Plasma PeptideAtlas contains fewer identified peptides and protein sequences, but these fulfill much more stringent criteria. For lack of suitable methods, we could not accurately estimate the protein FDR of the 2007 build, but, because it was constructed using a very liberal PSM probability cutoff, its protein FDR is no doubt much higher than the 1% of the 2010 build. The high confidence level for the 2010 build, and the ability to estimate it, were accomplished by the inclusion of more data plus four methodological improvements: 1. Spectral Library Searching-Non-glyco query spectra were compared against consensus spectra derived from real spectra, rather than against theoretical spectra. This resulted in better discrimination between true and false identifications (24), giving a higher number of identifications at any given PSM FDR.
2. iProphet-A new component of the Trans-Proteomic Pipeline, iProphet (27), increased discrimination between true and false identifications in our atlas builds by modeling five additional properties of the data beyond those modeled by PeptideProphet (see Experimental Procedures).
3. PSM FDR Cutoff-For the 2007 build, we used a PSM probability cutoff of 0.9. Because experiments vary in the quality of their results, this uniform probability cutoff admitted a higher proportion of false PSMs for poor experiments than for high quality experiments. Therefore, here we instead used a PSM FDR threshold, adjusted to achieve a protein FDR of about 1% for the combined build. Corresponding probability cutoffs were one to three orders of magnitude more stringent than those for the 2007 build, admitting many fewer PSMs per experiment.
4. Decoy-estimated Protein FDR-By including decoys in our target database we were able to apply the recently developed tool Mayu to accurately estimate the protein FDR.
Single-PSM Protein Identifications-Three hundred fiftyseven single-PSM protein identifications passed our rigorous computational pipeline. This subpopulation has a Mayu decoy-estimated protein FDR of 3.4%. Because decoy analysis may under-estimate protein FDR (44) and because single-PSM protein identifications are especially in need of extra validation, we manually examined all 357 and discarded 70 that we believed had a greater than 10% chance of being false identifications (see details under "Experimental Procedures"). Assuming that the FDR decreased as a result, we state that the final protein FDR is Ͻ0.98%. Building a protein FDR 1% atlas excluding all single-PSM protein identifications would have included more multiple-PSM identifications, but fewer total protein identifications (see supplemental Data,

Choice of Atlas Stringency Level, for analysis).
Estimated Concentrations-Although plasma protein concentration is dependent on the individual organism, its disease state, and its physiological status at time of sample collection, concentrations of relatively abundant proteins under relatively normal conditions generally do not vary more than an order of magnitude (45), and it is useful to have a rough estimate of normal protein concentration for purposes such as the spiking in of reference peptides for SRM or other targeted MS measurements. Spectral counting has been established as a reliable method for both relative (38,41) and absolute (39,46) quantification of proteins based on LC-MS/MS data. Comparison of raw spectral counts has previously been used for relative quantification between plasma samples (47). Here, following a simplification of the APEX method of Lu and coworkers (39), we obtain absolute quantification by normalizing spectral counts to adjust for the number of observable tryptic peptides per protein and by calibrating to previously measured protein concentrations.
The estimated concentrations are rough estimates and should not be mistaken as accurate quantitative values. Above 1 g/ml, they are generally accurate within one to two orders of magnitude. Sixty-eight canonical proteins not used for spectral counting calibration appear in the Hortin et al. 2008 review of abundant plasma proteins (48); the estimated concentrations for 51% of these proteins are within a factor of 10 of the mean of the concentration range reported by Hortin et al., and 94% are within a factor of 100. Of course, there are considerable uncertainties about these previously published measurements as well, because of the nature of immunoassays and antibody specificities. Further, even a precise concentration measurement in a specific sample would not generate a general statement about plasma protein abundances because of the variation among individuals.
To the extent that these roughly estimated values are accurate, the very large amount of data contributes to the accuracy. Data heterogeneity may also add to accuracy by allowing averaging over many diverse samples. However, it may also detract because of the variety of instruments and settings used. Dynamic exclusion settings, for example, can be optimized to amplify the spectral counts of low abundance proteins relative to the counts for high abundance proteins (49); the mixing of results in PeptideAtlas from experiments with optimized and nonoptimized settings could reduce accuracy. Obviously, estimated concentrations are sensitive to the calibration values used; see supplemental Fig. S4, for illustration.
Concentration is estimated for the group representative for each protein group (as long as its ProteinProphet-adjusted PSM count is at least 4). This concentration must be considered to be shared among all protein sequences in the group, usually splice isoforms or paralogs. Some atlas data come from analysis of depleted samples; concentrations for de-pleted proteins (including those proteins that are inadvertently removed during the depletion process, see (50)) are underestimated. Plasma concentrations for cellular proteins can be elevated when there is nonphysiological breakage of blood cells during sample collection and preparation. The sum of the estimated concentrations for hemoglobin-␣ and -␤, 71 g/ml, is close to the 100 -200 g/ml measured in serum in (51), suggesting that such breakage was minimal.
The estimated concentrations based on spectral counting of the canonical protein sequences in the Human Plasma PeptideAtlas span 6.5 orders of magnitude, ranging from 1.6 ϫ 10 6 ng/ml for serum albumin (P02768) down to 0.5 ng/ml for CEACAM1 (P13688, Carcinoembryonic antigen-related cell adhesion molecule 1). Serum albumin is known to be the most abundant protein in plasma with a normal range of 3.4 -5.4 ϫ 10 7 ng/ml (2,52), but is underestimated in the atlas because of depletion.
N-linked Glycoproteome-Many proteins of medical interest, such as receptor extracellular portions, transport molecules, and hormones, are N-linked glycosylated. Ninety percent of the 485 canonical protein sequences in the Human Plasma N-Glyco PeptideAtlas contain the N-glycosite motif (N -[not P] -[S/T]) and are thus likely N-linked glycoproteins. However, we emphasize that our computational protocol does not confirm N-linked glycosylation for any particular protein and the N[115] notation does not indicate a confirmed deamidation site. See supplemental Data, Computational pipeline for N-Glyco atlas does not confirm glycosylation, for details. The employed glycocapture technique also purifies some non-glycosylated peptides, presumably through nonspecific binding to the base bead used (Table I).
Eighty-six canonical protein sequences from the Human Plasma N-Glyco PeptideAtlas, all with estimated concentrations Յ 25 ng/ml, are not found in the Human Plasma Nonglyco PeptideAtlas exhaustive set (supplemental Table S7). All but one of the 125 peptides mapping to these proteins has an N-glycosite motif. Because glycosylation hinders LC-MS/MS identification, it is highly unlikely that these peptides would be identified without the glycocapture protocol, which results in removal of glycan groups. Indeed, only four are present in the NIST 3.0 spectral library we used to search the non-glyco data.
Of the 86 proteins, 31% have no spectra in the NIST 3.0 library and thus could not have been identified by spectral searching. However, as explained in supplemental Data Completeness of spectral library searching, we expect very few additional canonical proteins would be identified were we to perform database searching on the non-glyco data. Therefore, we conclude that for nearly all of these 86 proteins, the reason they are missing from the Non-glyco atlas is because they are of low abundance in plasma. Missed Cleavages; Semitryptic and Nontryptic Peptides-Both SpectraST and X!Tandem were set to allow matches to peptides with missed cleavages and/or peptides that were not fully tryptic; see Table II for tallies. Missed cleavages and nontryptic termini are usually penalized by ProteinProphet; penalties vary depending on the software's statistical modeling of each data set.
Contribution of Trauma Experiments-Our intention was to catalog the proteins found in normal plasma; therefore, the 2010 Human Plasma PeptideAtlas almost exclusively includes experiments on samples originating from individuals with no known disease state or other unusual condition. Six included experiments, however, were performed on a pool of six severe trauma patients plus one healthy subject (20), and we found that 455, or 24%, of the canonical protein sequences in the Atlas were observed only in one or more of these experiments and not in any of the other 85, raising the question of whether these proteins are trauma-specific. The 455 are all low abundance with at most 145 PSMs per protein; we believe that most of these are difficult-to-detect proteins present in normal plasma, rather than trauma-specific proteins, because of the advanced technology employed in the experiments (depletion of the 12 most abundant plasma proteins; fractionation into cysteinyl and noncysteinyl peptides, glyco-and nonglycopeptide; separation of each fraction into 30 subfractions using strong cation exchange, analysis on a Thermo LTQ instrument), which yielded nearly twice the peptide identifications per experiment when compared with earlier experiments from the same lab (depletion of only six most abundant plasma proteins and, in some cases, a less advanced instrument (Thermo LCQ) employed) (54).
Keratins and Immunoglobulins-Some keratins are common contaminants in proteomic sample processing, and the immunoglobulins are a very large class of plasma proteins consisting of similar interchangeable subunits, so one may wish to omit these classes of protein sequences from a plasma proteome reference set. We estimated the number of canonical protein sequences that belong to these classes (Table II) by counting those identified as immunoglobulins or keratins in their descriptions, plus all those in the same protein group as such a sequence. We counted all keratins, even those that are internal cytokeratins and not skin contaminants. We did not count sequences annotated as immunoglobulin-like or immunoglobulin-related. Omitting these immunoglobulins and keratins leaves 1769 canonical protein sequences not belonging to these classes.
Evidence for Multiple Splice Isoforms and Single Nucleotide Polymorphisms-The human section of Swiss-Prot is curated to contain one entry per protein-coding gene, each with descriptions for known splice isoforms. There is only one Swiss-Prot entry for which two splice isoforms exist in the canonical set, and it is only this protein, mannan-binding lectin serine protease 1, which we confidently claim is present in more than one splice isoform in human plasma. Twelve additional Swiss-Prot alternative splice isoforms are noted as possibly distinguished; we are less confident that these are present as distinct isoforms because possibly distinguished protein sequences have only a small amount of peptide evidence distinguishing them from their canonical counterparts. Further, 131 canonical protein sequences come from the IPI or Ensembl databases, indicating that each includes at least one observed peptide that is not mappable to any Swiss-Prot entry. These might represent single nucleotide polymorphisms or sequence errors (see IPI00887739 in Complement C3 group in supplemental Fig. S1 for an example), or proteincoding genes or splice variants not described in Swiss-Prot.

Composition and Completeness of Proteome Reference
Set-Our set of 1929 canonical protein sequences, by far the largest published so far at this confidence level, includes the highest concentration proteins as well as nearly complete coverage of the phosphoproteome described in (55) (details in Table III). Still, we believe it is far from a complete catalog of the human plasma proteome. First, our reference set and the MS-derived lists in Table III are all biased toward proteins that are readily detectable by MS techniques; proteins missing from one list are likely to be missing in the others, so coverage of the lists in Table III is not indicative of complete proteome coverage.
Other evidence suggests we are not close to full coverage of even the LC-MS/MS-observable proteome. Mayu analysis of the 5% protein FDR plasma atlas (see Experimental Procedures) shows that at least 410 correct identifications are excluded from the 1% protein FDR Human Plasma Peptide- Atlas by its stringent FDR threshold. Fig. 4, showing the accumulation of canonical proteins as additional identified MS/MS spectra that were added to the Human Plasma PeptideAtlas, also suggests that we are not near complete coverage. The PPP-I data contributed about 38% of the total canonical proteins. Growth after PPP-1 was shallow, then jumped with the addition of experiments employing extensive depletion and fractionation and high mass accuracy instruments (19,20). The curve will asymptotically approach the total number of proteins detectable with the techniques used, but is not yet nearing that limit. In 2008, Schenk et al. published a plasma proteome reference set (16) of comparable confidence to ours (see supplemental Data, Comparison of confidence level with Schenk et al., for details). Of the 697 nonredundant, nonimmunoglobulin protein identifiers in (16), 51 are in our combined protein sequence database, but not in the Human Plasma PeptideAtlas exhaustive set, meaning that we identified no peptides for them (see supplemental Table S8). If the Schenk et al. data were added to the Human Plasma PeptideAtlas, most or all of these would appear in the resulting canonical list. This supports our conclusion that more data, preferably from different laboratories using different sample sources, depletion techniques, and preparation techniques, will continue to add significant numbers of high confidence protein sequences to the human plasma proteome.
Because we searched the nonglycocapture data against a spectral library and not against a sequence database, we only identified peptides that had been previously seen in LC-MS/MS experiments and included in the NIST spectral library. However, the NIST library is extremely comprehensive, including most of the human data in the PeptideAtlas (from plasma and many other sources), so nearly all human-derived spectra identifiable with a sequence search engine with standard parameters will be identified with our spectral library search. Very few, if any, canonical proteins would be added to the Atlas were we to incorporate sequence database search results (see supplemental Data, Completeness of spectra library searching for analysis).
Multi-tiered Protein Identifications: Alternatives for Comparison of Data Sets-As described under "Experimental Procedures," we created our exhaustive identification set by mapping all identified peptides to a combined protein sequence database containing many sequences repeated identically or with only slight variations. Removing redundancy from such a set is always a problem in interpreting proteomics data, and no standard methods have been agreed upon.
In considering this issue, it is critical to understand that virtually no protein identification list for a given data set can be considered definitive. Once one eliminates exact duplicates, the process of removing redundancy necessarily involves choices that are somewhat arbitrary, as described in (58), and is at odds with the preservation of identifications consistent with the data. In most cases, a highly nonredundant list is necessarily a model or example list, each entry of which may represent several proteins that are as likely, or almost as likely, to exist in the sample. In particular, we emphasize that we do not claim to have definitive evidence for any of the specific isoforms in our canonical set; rather, we claim that, for each protein sequence in the set, there exists either that protein or a closely related one in at least one of the samples.
For some purposes, such as estimating the number of distinct proteins revealed by the data, a highly nonredundant protein identification set is desired. For other purposes, such as comparison with a nonredundant list for another proteome, filtering by molecular weight or pI, or selection of peptides for SRM experiment design, redundancy is desirable. As described under "Experimental Procedures," we created several different protein sequence sets that could be used, alone or in combination, for different purposes.
Multitiered schemes are not novel and have been implemented in many proteomics studies. For example, the core data set for PPP-I contained 3020 protein sequences, but alternative threshold criteria were used to generate several other sets including a set of 889 protein sequences using very restrictive criteria with an adjustment for multiple hypothesis testing (6), roughly analogous to our canonical set, and an unintegrated set of 15,710 protein sequences based on only a single peptide, roughly analogous to the exhaustive set defined here.
With the current work, we make two contributions in this area. First, we present Cedar, a protein identification classification scheme based on the freely available ProteinProphet and applicable to any search results that can be converted to mzML (59) or mzXML (23). Protein identifications generated for different data sets using Cedar can be easily and meaningfully compared against each other. Although software is not yet available to automate Cedar, all steps except for the manual validation of single-PSM identifications are clearly defined and reproducible, and we propose Cedar as a standard for the community, including the HUPO Human Proteome Project.
Second, we assert that when evaluating the overlap between the protein identifications for two proteomics data sets, it is essential to map to the same sequence databases and to compare the highly nonredundant (Cedar's canonical) set for one against the maximally redundant (Cedar's exhaustive) set for the other. Otherwise, the overlap will be under-reported. For example, Schenk et al. reported that 242 of their 697 high confidence identifications were found on the HUPO highconfidence list. We compared their identifications against the exhaustive set for an atlas we built from most of the HUPO PPP-I data (see "Results", Single-PSM Protein Identifications) and found an overlap of 362 identifications, which is 50% more.
Spectral Counting-Spectral counting has allowed us to provide rough estimates for protein concentrations in the Human Plasma PeptideAtlas down to 0.54 ng/ml, but even lower estimated concentrations are achievable. By including about 100 times as many PSMs as currently included, we could reach 7 ϫ 10 Ϫ3 ng/ml, the lowest concentration measured by antibody-based methods in (40). See supplemental Data, Completeness of spectral counting, for analysis.
Estimated concentrations in ng/ml, along with uncertainty factors, are now available in PeptideAtlas. Again, these are rough estimates and should not be mistaken as accurate quantitative values. Experimentally measured concentrations from (40), (57), and (Anderson, N. L. (2007) private communication) are provided as well. We plan to apply this same spectral counting method to atlas builds for other subproteomes such as human urine, mouse plasma, and various organ or cell type data sets that we acquire. Our goal is to develop a quantitative PeptideAtlas reflecting protein expression in multiple organs, cell types, and biofluids in health and disease.

Uses of the Human Plasma Peptide
Atlas Biomarker Discovery-Polanski and Anderson in 2006 (57) published a review of candidate cancer biomarkers listing 1261 proteins believed to be differentially expressed in patients with various cancers. Literature search revealed only 274 to be reported in plasma, but 326 appear in the Human Plasma PeptideAtlas exhaustive set (Table III), skewed toward lower concentrations. Those identified in (46) as "high priority" for biomarker development (about one-third of the 326) are listed in supplemental Table S9.
Experiment Design for Targeted Proteomics-When a protein is observed in a sample that is analyzed with LC MS/MS techniques, some of the protein's peptides are observed many times, whereas others are not observed at all, despite being in the observable mass range and otherwise having attributes consistent with MS analysis (60,61). Several algorithms that attempt to predict observability based on sequence attributes have been put forward (39,60,61); these are heavily influenced by the data with which they are trained. As noted under "Experimental Procedures", for all peptides in the Human Plasma PeptideAtlas, we calculated an empirical observability score that does not rely on prediction algorithms; however, it is highly dependent on MS data collection parameters, including dynamic exclusion settings, as in (49).
Because shotgun-style experiments of complex samples will always miss many proteins, especially low concentration proteins, a targeted approach in which the mass spectrometer selects only peptides contained within specific proteins of interest should be more successful, reproducible, and time efficient. Using the PeptideAtlas web interface, one can select peptides based on the empirical observability score and other attributes, such as number of observations, number of protein mappings, missed cleavages, semitryptic, or multiple genome locations, and present these as an inclusion list for the mass spectrometer.
PeptideAtlas includes several other features to support SRM experiment design. For peptides belonging to proteins not yet observed in PeptideAtlas, observability scores based on sequence attributes are calculated. When multiple spectra exist for the same precursor ion, they are combined to generate a consensus spectrum that can be visualized by the user. Transition lists can be generated automatically from these consensus spectra according to user-specified rules. For absolute protein abundance measurements, the estimated protein concentrations described above allow one to spike in synthetic reference peptides at concentrations similar to those expected in the sample. These features and others are described in (43). Finally, we and others are in the process of systematically generating reference fragment ion spectra from synthetic peptide libraries using the triple quadrupole instruments used for SRM measurements and we will make these publicly accessible as verified transition sets (63,64). CONCLUSION PeptideAtlas is an integral part of the ProteomeXchange infrastructure for HUPO initiatives and other worldwide data submissions (figure published in (65)), together with the Pro-teomeCommons.org Tranche distributed file-sharing system (66) and the EBI PRIDE (67) database. PRIDE contains the investigators' original data sets; PeptideAtlas consolidates the raw data of individual studies into re-analyzed proteome reference sets. A significant aspect of PPP-II is the establishment of a standard method for the submission of data to the ProteomeXchange consortium. It is the policy of PPP-II that all published plasma data be submitted to Tranche or PRIDE, from which it will be stored in Tranche and incorporated into the PeptideAtlas.
The PeptideAtlas approach described here provides a framework for the continued analysis of human and other complex proteomes. Soon, MS/MS data interpretation based on translated genomes will be replaced by rich spectral libraries derived from both natural and synthetic peptide information, which outperform current database searching strategies. Already, there is a complete spectral database for the entire yeast proteome (64) and mouse and human are being completed (Deutsch et al., in preparation; Kusebauch et al., in preparation).
The 2010 Human Plasma PeptideAtlas, a comprehensive collection of high-confidence peptide and protein identifications, contains well over twice as many protein sequences as any previous collection at a similar confidence level. With estimated concentrations and a multitiered protein identification scheme, it is a useful resource for biomarker discovery and SRM experiment design. Peptide identifications, protein identifications, estimated concentrations, and raw data in mzXML (23) format are all offered freely to the public at www.PeptideAtlas.org.