Tandem Mass Spectral Libraries of Peptides in Digests of Individual Proteins: Human Serum Albumin (HSA) *

This work presents a method for creating a mass spectral library containing tandem spectra of identifiable peptide ions in the tryptic digestion of a single protein. Human serum albumin (HSA1) was selected for this purpose owing to its ubiquity, high level of characterization and availability of digest data. The underlying experimental data consisted of ∼3000 one-dimensional LC-ESI-MS/MS runs with ion-trap fragmentation. In order to generate a wide range of peptides, studies covered a broad set of instrument and digestion conditions using multiple sources of HSA and trypsin. Computer methods were developed to enable the reliable identification and reference spectrum extraction of all peptide ions identifiable by current sequence search methods. This process made use of both MS2 (tandem) spectra and MS1 (electrospray) data. Identified spectra were generated for 2918 different peptide ions, using a variety of manually-validated filters to ensure spectrum quality and identification reliability. The resulting library was composed of 10% conventional tryptic and 29% semitryptic peptide ions, along with 42% tryptic peptide ions with known or unknown modifications, which included both analytical artifacts and post-translational modifications (PTMs) present in the original HSA. The remaining 19% contained unexpected missed-cleavages or were under/over alkylated. The methods described can be extended to create equivalent spectral libraries for any target protein. Such libraries have a number of applications in addition to their known advantages of speed and sensitivity, including the ready re-identification of known PTMs, rejection of artifact spectra and a means of assessing sample and digestion quality.

This work presents a method for creating a mass spectral library containing tandem spectra of identifiable peptide ions in the tryptic digestion of a single protein. Human serum albumin (HSA 1 ) was selected for this purpose owing to its ubiquity, high level of characterization and availability of digest data. The underlying experimental data consisted of ϳ3000 one-dimensional LC-ESI-MS/MS runs with ion-trap fragmentation. In order to generate a wide range of peptides, studies covered a broad set of instrument and digestion conditions using multiple sources of HSA and trypsin. Computer methods were developed to enable the reliable identification and reference spectrum extraction of all peptide ions identifiable by current sequence search methods. This process made use of both MS2 (tandem) spectra and MS1 (electrospray) data. Identified spectra were generated for 2918 different peptide ions, using a variety of manually-validated filters to ensure spectrum quality and identification reliability. The resulting library was composed of 10% conventional tryptic and 29% semitryptic peptide ions, along with 42% tryptic peptide ions with known or unknown modifications, which included both analytical artifacts and post-translational modifications (PTMs) present in the original HSA. The remaining 19% contained unexpected missed-cleavages or were under/over alkylated. The methods described can be extended to create equivalent spectral libraries for any target protein. Such libraries have a number of applications in addition to their known advantages of speed and sensitivity, including the ready re-identification of known PTMs, rejection of artifact spectra and a means of assessing sample and digestion quality. Shotgun proteomics is a widely used and evolving method for determining the protein composition of a biological mixture (1)(2)(3). It most often involves the digestion of denatured proteins by trypsin, followed by the identification of product peptides and the use of this information to infer protein identities and possibly targeted post-translational modifications (PTMs) 1 . However, because digestion is a highly complex chemical process, a large proportion of identifiable products are not specifically targeted for analysis and therefore invisible to the analysis. These include unexpected and unwanted peptides that interfere with the analysis. Others may contain modifications of biological origin, which, unless specifically targeted, can be lost among the forest of artifacts (4 -6). This paper describes methods for building a tandem mass spectral library capable of characterizing all identifiable peptides in a tryptic digest of a selected protein. Spectral libraries are known to provide an effective way of reusing this information to quickly, reliably, and sensitively determine peptide identities (7)(8)(9)(10)(11). These identifications can serve several purposes, including 1) ensuring that all previously identified peptides are identified regardless of search engine settings, 2) tagging artifact peptides that might otherwise lead to false positive identifications, 3) ensuring the identification of known and identifiable biological post-translational modifications without explicitly looking for them, and 4) providing a list of artifact peptides for assessing the quality of the sample preparation process.
HSA, human serum albumin, was selected as the target protein for library development partly because of its ubiquity, making up Ͼ50% of the total protein in blood (12)(13) and therefore found in many biological samples, and partly because of the considerable background information available for its digestion products (14 -19). However, despite the longstanding interest in this protein (20 -21), a thorough determination of its digestion products has not been reported. HSA is composed of 585 amino acids and yields a wide range of tryptic peptides, including many with missed or irregular cleavages and a variety of both native and analytical modifications. At first sight, the analysis of just one protein may appear straightforward because it is common practice in the field of proteomics to search for thousands of proteins in a biological sample. However, this analysis aiming at thorough analytical characterization of HSA peptide ions requires a very different method of analysis. It needs to deal with the wide diversity of digestion products, many of which cannot be predicted in advance and whose relative concentrations are likely to depend on complex chemical processes that cannot be fully controlled. Products include peptides with missed and irregular cleavages, under or over alkylation, unexpectedly high and low charge states, and an uncertain number of modifications, including unknown modifications (i.e. so-called blind modifications (22)(23)). Furthermore, the process of identifying such peptides is prone to misidentification by accidental "homologies" (two different peptides yielding an overlapping set of y/b ions). Including these variant peptides leads to a dramatic increase in the number of both true and false HSA peptide identifications compared with those of the commonly sought tryptic peptides (24 -25) at a given score threshold. This paper describes a series of methods designed to first produce all possible identifications and then to reject false identifications using a variety of filters to generate a reliable and comprehensive library of reference spectra for a single protein.
Experimental and Computational Procedures-Experimental Methods and Data Sources-Most of the mass spectral data used for building the HSA library came from 2035 LTQ runs and 522 LTQ/Orbitrap runs (Thermo Fisher Scientific, San Jose, CA, see Disclaimer). Many of these were generated for two studies examining digestion variability (26,27). These served to generate peptides over a wide range of conditions and HSA sources, including 12 HSA samples from five vendors, eight sources of trypsin, and a range of denaturing/digestion conditions. High temperature (90°C) and urea (6 M) were the most commonly used denaturing conditions. Most commonly, dithiothreitol (DTT) was the reducing agent, iodoacetamide (IAA) the alkylation agent and tris-hydroxymethyl-aminomethane (TRIS) the buffer. Concentrations of these were varied as were those of HSA and trypsin. Other runs employed organic and no denaturants, cleavable surfactants, tris(2-carboxyethyl)phosphine (TCEP) as a reducing agent, and widely varying digestion times (5 min. to 2 days). Also included were 355 runs of digests of a plasma-like protein mix from the NIH/NCI-supported Clinical Proteomic Technology Assessment for Cancer (CPTAC) program (http://proteomics.cancer.gov/programs/ CPTAC/), comprised of 200 LTQ and 155 LTQ/Orbitrap runs (28 -30). Some 122 spectra from the NIST Human library were also included (described later).
Initial Peptide Identifications-The method developed for building this single-protein spectral library was derived from the methods currently used for building the NIST tandem mass spectral libraries of tryptic peptides from digests of biological protein samples (31)(32). As in that earlier work, initial identifications were made from ion-trap fragmentation spectra derived from tryptic digests using four sequence search engines (OMSSA (33), X!Tandem (34), Comet (35), and ProteinProspector (36)), but used a fasta file containing only the HSA sequence (see Supplemental Table S1) and its reverse. It was found that to reliably identify both long, highlycharged peptides as well as peptides containing a wide range of peptide modifications, two separate sets of searches were necessary. Otherwise, incorrect high scoring semitryptic peptides with unusual modifications could overwhelm correct identifications of conventional tryptic peptides, especially those with multiple missed cleavages. The first search allowed up to two missed cleavages and four charges as well as one nontryptic terminus (semitryptic) and included a list of 22 categories of HSA-targeted modifications (16 in Table IV and  6 in Table V). The second search allowed up to four missed cleavage sites, six charge states, did not allow semitrypic peptides, and permitted only common modifications (variable cysteine alkylation, methionine oxidation, ammonia loss of N-terminal Gln and Carbamidomethyl-Cys, and water loss from N-terminal Glu). Results of these searches were merged. To find unidentified modifications, two additional search engines, namely InSpect (37) and TagRecon (38), served to identify single, untargeted modifications with mass shifts at specific residues between Ϫ300 and 300 Da. The list of the 22 specified modifications just described was partly built by examining and assigning some of these identifications. Parent and fragment tolerances of 0.2 m/z and 0.8 m/z, respectively, were used at this stage.
Scores from each of the search engines were normalized using results of searching a combined HSA forward and reversed sequence database. This method refined scores using fractions of unassigned fragment abundances and peptide classes. Tentative identifications were determined based upon a formal 5% false discovery rate (FDR) using a targetdecoy approach (39). Owing to the large variety of peptides allowed, even this single protein generated sufficient decoy hits to allow setting a statistically meaningful FDR. Manual examination showed that the computed score threshold was sufficiently low not to miss any of the conventional peptides expected to be generated in HSA digestion. Note also that the actual FDR was far higher than 5% because of the wide search space employed and the consequent generation of many false "homologous" peptide identifications.
Filters-The wide peptide search space generated a large number of incorrect identifications at search scores appropriate for reliable identification of conventional tryptic peptides.
Ideally, scores would depend on the "prior probability" (40) that a particular variety of peptide ion would be present in the digest -of course this is not done by present methods. Rejection of these unusual and less predictable peptides requires post-processing analyses. To some degree, this was done by adjusting scores of certain classes of peptides (31)(32), but this was found to be inadequate for the wide range of modifications considered here. Therefore, a general peptide classification scheme, along with a series of five quality filters and one flag were developed. These are summarized in Table I, which shows the name of each filter, the type of data it uses, the specifics of the filter as well as thresholds for rejection. A description of the peptide classification method and each of the filters follows.
Peptide Classification-For the purpose of excluding the most improbable peptides, peptides were first separated into two broad classes-common and unusual. Common peptides are those expected from digestion and most commonly sought in sequence identification searching. Briefly, these include tryptic peptides with normal missed cleavages (near acidic groups or a terminus), Met/Trp oxidation and N-terminal Cys or Gln loss of ammonia. In-source peptides that co-elute with their precursor peptide are also expected as is the alkylation of all cysteines. Other peptides are classified as "unusual." Peptides that contain features of two or more unusual classes or modifications are rejected.
Filter 1: Peptide Ion Significance-This filter rejects identifications with weak signals that occur rarely. It uses two derived values, the median relative abundance, MRAB, and peptide ion identification frequency, PIIF. MRAB of each ion was extracted from the raw data by ProMS, a software tool for LC-MS/MS ion perception and annotation program developed at NIST and used in the NIST MSQC Pipeline (30,41). The abundance of each identified ion was determined from extracted ion chromatograms (XIC). For high resolution data, individual isotopic peaks were summed, whereas for low resolution (LTQ) data (e.g. unresolved isotopic peaks), the peaks were summed within a defined range (-0.6 to 1.6) of the m/z that was calculated based on the ion average mass, which generally represent isotopic components. Then relative abundance was derived by dividing this by that of the largest identified ion in that run. MRAB is the median of the relative abundance values obtained from all LC/MS runs where the ion is identified. If a precursor peak could not be found, its abundance was set to zero. The PIIF was simply the fraction of runs that an ion was identified, excluding special cases such as nonalkylated runs. These two values were computed separately for LTQ and LTQ Orbitrap data. Filtering used LTQ Orbitrap values when available and LTQ values when identifications were made only on those low resolution instruments.
Filter 2: m/z Error-The difference between observed and theoretical mass of each ion identified served as a filter. The m/z of each peptide ion in a run was taken as its intensityweighted monoisotopic m/z averaged over its elution profile. Each value was corrected for instrument bias by linear regression of these deviations versus m/z based on the confident identifications. Median absolute m/z deviations were then computed. Identifications made for Orbitrap spectra were rejected when these median deviations exceeded 5 ppm, whereas identifications made only in low resolution instruments (ion trap m/z determination) were rejected when these deviations exceeded 0.25 m/z. Filter 3. Unidentified Fragment Ions-The presence of significant fragment ions that could not be traced to known fragmentation paths suggests that either the spectrum was contaminated with co-fragmenting ions or that the identification was erroneous. In the NIST human ion trap library the median percentage of unidentified abundance in a spectrum was 8% and the percent of unidentified peaks was 15%. Examination of questionable spectra led to development of a filter that used both abundances and numbers of peaks. Subfilter 1 was the geometric mean of the fraction of unassigned abundance for the most abundant 20 peaks and for all peaks. Subfilter 2 added to this value the geometric mean of the unassigned fraction of the 20 most abundant peaks and all peaks. If the value for both subfilters 1 and 2 exceeded 0.32 and 0.36, respectively, the spectrum was rejected. Note that neutral loss from the precursor was excluded in these calculations, and that small peptides of sequence length less than six were not subject to this filter. Filter 4. Sufficient Ions above the Precursor m/z-Fragmentation products of multiply charged peptides are generally expected to produce significant product ions above the precursor m/z. Moreover, it was noted that a common feature of some questionable identifications was the presence of little signal above the precursor m/z. Based on examination of spectra and findings from the NIST human ion trap library, spectra were rejected when the fraction of the largest 20 fragment ions (excluding neutral loss from the precursor) above the precursor m/z was less than 0.2 for charge 2, 0.3 for charge 3, or 0.36 for charge state higher than 3.
Filter 5. Principal Charge State-A significant fraction of the abundance of most tryptic peptides appears in the peptide ion whose charge state equal to the number of basic residues (NBR ϭ Arg, Lys, His, and N-terminal amine) (42). Relatively little signal typically is carried by charge states more than 1 charge state away from this value. This behavior was confirmed for predominant tryptic peptides and peptides with multiple charge states. Therefore, peptides identified in only one charge state, constituting about 75% of identified HSA peptides, were rejected if their charge state did not match the NBR, with the following exceptions. When basic groups were adjacent, one lower charge state was permitted for each such pair (43)(44). Because of possible long range interactions and involvement of less basic peptides (42), peptides of sequence length greater than 20 containing multiple basic sites were not subject to this filter.
Flag: Gaps in Charge State Distribution-When peptides were identified in multiple charge states, all charge states between the maximum and minimum charge are expected to be identified.
Any gaps were manually examined to find the origin of the problem. As discussed later, this led to improvements in the methods.
In Stage 1 the underlying data was generated or collected. In Stage 2 peptides were tentatively identified using the wide range of search methods and parameters described earlier. This large search space led to many false and conflicting spectrum identifications. This most frequently occurred for groups of peptide ions having high charge states, unusual modifications, and/or irregular cleavages, but with sufficient sequence similarity to more common tryptic peptides to generate an overlapping set of y-or b-ions. These identifications were often found to depend on the search engine and its specific settings. These ambiguities were resolved in the later stages. In Stage 3, spectra for these tentative identifications were combined to create an annotated "consensus" spectrum that included information concerning the origin of the underlying spectra, peak labeling, search engine scores and, other of processing details (31)(32). In Stage 4, relevant MS1 and MS2 information needed for later filters was extracted and analyzed for these identifications using the underlying raw data. In Stage 5, the classifications and filters described FIG. 1. Single-protein spectral library building pipeline. Flow diagram illustrating the six major stages of library building process used in the single-protein spectral library.
above were used to reject uncertain identifications. Many rejected spectra, especially those with high scores and identification frequency, were examined to find why they were rejected, guiding the development of the present method. In Stage 6, the final library was derived, all spectra were inter-compared and conflicts between similar spectra having different identifications were resolved. In this process, expected peptides were preferred over unusual peptides. When this did not resolve ambiguities, the higher scoring identification was kept, with alternatives given in the spectrum annotation.
Consensus Spectrum Rejection using Quality Filters-Peptides were divided into the nine classes presented in Table II. For each class is shown the type of peptide (common or unusual), peptide description, number of ions prior to filtering, the numbers of ions rejected by each filter, ions in the final library (number and percent) and the contribution of each class to total identified ion intensity.
Filter 1: Peptide Ion Significance-The ability of the library consensus spectrum of a peptide ion to re-identify this ion in the original data provided a measure of significance of the ion and quality of its spectrum. Using the preliminary library (before filters were applied), 2214 consensus spectra had PIIF Յ 0.01 or MRAB ϭ 0 (Threshold in Table I) -these were rejected. Among them, 83 consensus spectra were not matched in any run. This occurred when a good quality consensus spectrum could not be derived during the construction of library consensus spectra because of low quality source spectra. It was also found that 141 ions produced identifications (score Ͼ 0.45) in only 1 or 2 runs, and 473 ions were matched in fewer than 10 runs. This filter has an especially large effect on Class 9 (unidentified modifications), removing 1739 ions, most seen only in low mass accuracy runs. Some examples of the excluded spectra are included in supplemental Table S2.
Filter 2: m/z Error-The mass accuracy calculations described in the Method section rejected hundreds of ambiguous or erroneous identification of peptides. Insufficient precursor m/z accuracy led to rejection of 792 (13%) of initially identified ions (Table II). Among them, 65% were from Orbitrap data, the rest were from LTQ runs. In Orbitrap runs, the deviations of 510 rejected ions ranged from 5 to 2181 ppm with a median 471 ppm. As shown in the Filter 2 column of Table II, 95% of these rejected ions were from classes 5-9, only 36 ions from the common classes. Manual inspection showed that many of these had different assignments from different sequence search engines. Filter 2 rejected many false assignments, with some examples given in supplemental Table S3.
Filter 3: Unidentified Fragment Ions-This filter led to the removal of 852 peptide spectra (14%) from the initial library. Of these rejected spectra, 595 would have been removed using Filter 1 as insignificant spectra and 114 removed using Filter 2 as spectra because of large mass error. Peptides with unusual modifications constituted 80% of the rejections.
Filter 4: Insufficient Ions above the Precursor m/z-Of 2946 multiply charged ions, 224 did not pass the requirement for sufficient sequence ions above the precursor m/z. Of these, 75% would have been rejected by Filter 1 because of low peptide identification frequency or by Filter 2 because of large mass error. This absence of significant identified peaks above the precursor m/z was a useful filter for removing low quality spectra (see supplemental Fig. S1 for examples).
Filter 5: Principal Charge States-Of the 4275 peptides identified in only one charge state, 151 were rejected because their charge state was not equal to the number of basic residues (NBR ϭ Arg, Lys, His, and N-terminal amine) in the peptide.
Flag: Gaps in the Charge State Distribution-Prior to application of the filters, 53 peptides identified in multiply charged states had gaps in their charge states. In some cases, gaps originated from erroneous identification of at least one peptide ion. After final filter development, all of these gaps disappeared. These flags therefore greatly assisted the refinement of the other filters. In other cases, consensus spectra of some minor peptides for intermediate charge states were rejected when reliable consensus spectra could not be cre- * This process started with 7359 spectra. After discarding peptides falling into multiple "unusual" classes, 5991 spectra remained and then subjected to quality filtering. ated, possibly because of contamination. These spectra were retained by the library.
Peptide Classes-The following sections present findings for the peptide classes given in Table II. Peptides are ranked by the peptide identification significance value, PSIG, defined as the geometric mean of MRAB and PIIF values. To better represent typical conditions, the following statistics exclude exceptional runs such as those without reducing or alkylating agents or with unusual m/z ranges.
Classes 1 and 2: Tryptic Peptides with and without Expected Missed-cleavages-These peptide classes dominate the field of shotgun proteomics. Table III lists those peptide ions with an identification frequency (PIIF) over 50% in the 350 LTQ-Orbitrap runs. Class 1 includes "proteotypic" peptides (45) with no missed cleavages (also includes Lys/Arg at the N-terminal resulting from cleavage between adjacent cleavable residues). Class 2 includes peptides that contain plausible missed cleavages that are often identified in sequence searching. These only include peptides with missed cleavages where D, E, K, or R is near the missed cleavage site (46). Other missed cleavages can occur when digestion is incomplete, so can be very significant for short-time digestion. Classes 1 and 2, which represent only 10.7% (312) of identified peptides, account for over 70% of total peptide abundance (Table II). Their sequence lengths ranged from 4 to 51 amino acid residues, covering over 96% of the total protein sequence. Identifications that were also made for 16 small peptides composed of two, three, or four amino acids in special LTQ-Orbitrap runs at lower m/z settings (100 -600 m/z). They were identified by both sequence database search and the NIST MSMS library containing tryptic dipeptides and tripeptides (47). Note that peptides having fewer than six amino acid residues are generally invisible in sequence searching but are readily identified by spectrum library searching.
Classes 3 and 6: Common and Less Common Modifications-These peptides are separated into two broad classes: first to be discussed are the 858 analytical modifications (Table IV) and, second, in the next section, are 22 posttranslational modifications likely present in the starting HSA (Table V). The origin of a few, such as methionine oxidation, can be unclear. Table IV lists the identified analytical modifications, all of which have been reported in the literature (48 -60). The most frequently observed were oxidation of methionine, carbamylation of N terminus and lysine (when urea is used as a denaturant), formylation of N terminus, and lysine, serine, and threonine, and adduction by sodium and iron, with maximum intensities in the range 1% to 4% of the most abundant ion. Several adducts, including sodium, iron, and calcium, most often appeared to originate in the electrospray, as indicated by their co-eluting with the nonadduct peptide. In some cases, two distinct chromatographic peaks for the same modified peptide were observed, suggesting the presence of some adduct in the original digest. This was especially common for methionine-oxidized peptides (49). One less discussed modification was transpeptidation, which involves the transfer of a basic residue to the N or C terminus of a peptide. Several papers have highlighted its ubiquity (53)(54)(55)(56). Transpeptidation was observed as the N-and Cterminal adduct of arginine or lysine. Fifty-three such peptides were identified, contributing 0.09% to the peptide total intensity and covering 50% of the HSA sequence. Another unusual modification, vicinal disulfide (57-58) -the formation of a disulfide bond between adjacent cysteines, was observed between Cys90 -91, Cys168 -169, and Cys476 -477. The delta mass of Ϫ2.0157 was detected on the bridged form of these adjacent cysteines in the MS2 spectra. Although they had low abundances of 0.15%, 0.23%, and 0.07% of the most abundant ion in the run, respectively, each was observed in over one-quarter of the LTQ Orbitrap runs. The MS2 fragmentation pattern of these peptides was consistent with that of their unmodified counterpart but without a cleavage product from the adjacent cysteine bonds. Some of adducts, such as Fe and Ca, often appeared to be attached to residues not reported by the Unimod database (59) -work is underway to confirm these results and define the positions more precisely.
Subclass: Post-translational Modifications (PTM)-HSA is known to possess various biological modifications (12)(13). Such modifications have direct effects on the binding and antioxidant properties of the molecule and are associated with various diseases (13,(15)(16)(17)(18)(19)(61)(62)(63). Therefore, these modifications were examined with special care. Using the methods described above, we were able to detect the presence of six categories of PTMs in HSA (Table V). These were: (a) cysteinylation (cysteine addition to Cys34), (b) Cys34 oxidation, (c) protein terminus truncation (the loss of aspartatealanine from the N terminus or leucine from the C terminus), (d) glycation, (e) acetylation, and (f) phosphorylation. Except for cysteinylation, these identifications were made with a mass accuracy of less than 3 ppm derived from the high resolution LTQ-Orbitrap data under normal digestion conditions. Cysteinylation was only identified in analyses without a reducing agent (64), thereby leaving all native disulfide bonds intact.
Cysteinylation at Cys34 was a particularly abundant modification (64 -65), roughly 70% as abundant as the unmodified counterpart in the same nonreducing runs. Oxidation of Cys34 to sulfenic acid, sulfonic acid, and sulfinamide was detected under typical digestion conditions in four peptide ions at abundance levels of about 5% of their unmodified counterparts. All 3ϩ charge states of these peptides were reported by Li and Grigoryan et al. (66 -67). Loss of N-terminal aspartate-alanine (-186.06 Da) and C-terminal leucine (-113.08 Da) was identified, and their median relative abundances suggested that C-terminal truncation was more prevalent than N-terminal truncation (68 -69). Several other modifications were also detected in the HSA digestion. Glycation was observed at several lysine residues, including the well documented Lys525 (70 -73). This specific modification was detected with an identification frequency range from 26% to 53% of LTQ-Orbitrap runs and an abundance of up to 1% of the most abundant ion in the run. Two lysine sites of HSA acetylation were identified, with Lys199 being seen in 82% of runs and Lys 525 observed in only 7% of the runs. HSA phosphorylation was rare, observed in three ions at very low abundance. All of these were only observed in CPTAC studies (28 -29), which employed recombinant human serum albumin. All modification sites in Table V have been reported by the Universal Protein Resource (UniProt) and PhosphoSitePlus (70 -71) and other references (72)(73)(74)(75).

Classes 4 and 5: In-solution and
In-source Semitryptic Peptides-These peptides were generated by either in-source fragmentation (labeled "in-source") in the electrospray or non-tryptic cleavage during digestion (labeled "in-solution"). The former were distinguished by their co-elution with their precursor peptides (generally observed within 5 seconds) and the presence of their precursor m/z as a major peak in the MS2 of their precursor peptide. As shown in Table II, 263 of these (Class 4) were identified as in-source fragments, and 577 (Class 5) were generated during the in-solution digestion. Table VI lists the most frequently identified semitryptic peptides and their precursor ions. The relative abundance ratios of in-source or in-solution fragments to their probable precursor ion were found to vary by up to 25%. To ensure confidence in their identification, those of rank 1-2, "FSALEVDETYVPK," and "FYAPELLFFAK," in the Class 5 section of Table IV, were synthesized and co-injected in digestion mixtures to confirm their non-in-source origin. Both eluted at distinctly different times as their potential precursor peptide and were not dominant fragmentation products of this potential precursor, confirming that they originated in the digestion process. Note that both are characteristic of "pseudotryptic" activity (76). Curiously, the three very abundant in-solution peptides, numbers 1, 2, and 4, in the second part of Table VI were reported as the values of candidate biomarkers for disease diagnosis (77)(78)(79). Class 7: Tryptic Peptides with Unexpected Missed-cleavage-Tryptic cleavages after K/R not hindered by nearby acidic or cleavable basic residues or proline are expected to be rapid, a large fraction of which cleave in less than 30 min. Hence, at longer digestion times relative amounts of peptides with such missed cleavages are expected to be small. However, a number of such trypsin cleavage sites persisted even after 18 h digestion periods and changed little in relative abundance between 2 and 18 h. A set of 293 such peptides were identified, accounting for 10% of peptides. The most significant ions of these persistent peptides with a PIIF over 0.40 are given in Table VII. The reason for their stability is not clear. It is plausible, but unproven, that a fraction of these peptides, once formed, have isomerized or coiled in some way to prevent further trypsinization.
Class 8: Under and Over Alkylation-Low accessibility of cysteine sites may lead to incomplete cysteine alkylation (80), which was found for 205 peptides. Alternatively, over-alkylation by iodoacetamide can occur when alkylation is not stopped by removing or "quenching" IAA with added DTT (81). Residues, E, H, and K were the most commonly alkylated residues in these cases. Over-alkylation was observed for 51 peptide ions. Table  VIII shows the eight most frequently observed over-alkylated and under-alkylated peptides, all of which were observed in over 40% of LTQ-orbitrap runs. Peptides with under-/overalkylation typically amounted to 1.9% of the HSA abundance under conventional digestion conditions. Class 9: Tryptic Peptides with Unidentified Modifications-In an effort to identify all products of digestion, searches applied two nontarget modification search engines, InSpect (37) and TagRecon (38), to find any single modification changing the peptide mass by up to 300 Da. Those that were identified were then added to the list of targeted modifications. Because exact mass was especially important for identifying members of this class, only those identified at high mass accuracy (Orbitrap) were included, and subject to the requirement that they appear in at least 10% of the runs. This generated 470 peptide ions with unknown modifications, accounting for 16.1% of total library peptides and 2.2% of the total peptide abundance. In most cases their position in the sequence and even their exact chemical formula is not yet certain. Table IX lists the peptides of this class appearing in over 40% of Orbitrap runs. One particularly prevalent modification, identified from over 75% of 350 LTQ Orbitrap runs, had a mass of 69.988 Da and appeared on N terminus (see, a PIIF was calculated a) for cysteinylation using 24 LTQ non-reducing runs, b) for Cys34 oxidation, N-or C-terminal truncation, glycation, and acetylation, using 350 LTQ-Orbitrap runs, and c) for phosphorylation using 170 LTQ-Orbitrap runs in CPTAC studies (26 -27).
b Category 2, Cys34 oxidation, has three oxidized forms (sulfinic/sulfonic acid and sulfinamide). c All cysteines in the categories 4 -6 are alkylated. Table IX). This appears to be associated with tris(hydroxymethyl)aminomethane (Tris) buffer because it did not appear when ammonium bicarbonate was used in its place. Work in progress will add localization procedures to precisely locate these modification sites and attempt to more precisely determine chemical formulas. The final HSA library contains 651 peptide ions with less common modifications and 470 with unidentified (unknown) modifications.

Rows 8 and 16 of the
HSA Spectra in the NIST Human Spectral Library-Spectra derived from the newly-built HSA spectral library were compared with HSA peptides already present in the 2012 NIST library of human tryptic peptides (31). Of the 2918 HSA peptide ions derived in this work, 911 were present in the human library, whereas 122 HSA ions in the human library were not in the HSA library. Among the latter set were 72 peptides with new charge states, 15 with common modifications or multiple missed alkylation sites, and 35 semitryptic peptides. All were then added to the HSA library. These new identifications likely arise because of the very wide range of analysis conditions and instruments in experiments from which the human library was built. In fact, 45% of these additions arose from peptides also found in the newly created HSA library, but with lower charge states, possibly reflecting lower protonation levels in some electrospray sources. This comparison also led to the discovery of 55 spectra in the human library that matched spectra in the HSA library, but were not assigned to HSA. These were found to be false identifications caused by as- signed spectra for unusual peptides in the present HSA library to simple tryptic peptides of less common proteins in the comprehensive human library -these have been removed in the 2013 release of the human library.

DISCUSSION
Creating a comprehensive library of tandem spectra of peptides for a single protein is a quite different task than building a library of peptides from digests of the thousands of proteins in a "proteome." Though single protein libraries may appear easier to build, in some ways they are more difficult. This difficulty is a consequence of the need to deal with the wide variety of peptide classes found even in a simple digest, the unpredictability of their concentrations, and even the uncertainty of some of their identities. The procedures described here employ a wider search space necessary to find these peptides, but then adds a variety of quality control filters necessary to reject the increased number of false identifications. Fig. 2 is a stacked bar graph of peptide ion identification frequency (PIIF) values for nine peptide classes at each residue position along the HSA sequence. This plot illustrates the wide range of fates of the individual residues and their dependence on the locations in the sequence. Note that 100% sequence coverage is achieved. The ordinate provides a measure of the number of different peptides in which each residue can appear in an HSA digest. Maxima are produced in regions where, by virtue of its location within observed peptides, a residue can be found in many different peptide ions. Minima are regions where residues are not well represented because they are not part of readily observed tryptic peptides, due primarily to their proximity to multiple K/R residues that do not give rise to abundant tryptic peptides with missed cleavages, and which form peptides too short to be observed in these experiments. These regions are typically probed using alternate proteases. As evident in Table II, the bulk of the product ion intensity from the digestion of HSA arises from conventional tryptic peptides. However, as evident in Fig. 2, in terms of numbers, the majority of identifiable peptides represent other varieties of peptides that are generally ignored by shotgun proteomics. In cases where peptides have biological significance, such as PTMs, searching a library containing such spectra will ensure that the modification is not missed. Otherwise, as would occur if it were not explicitly sought the modification could be "crowded out" by using the large search space. Further, unexpected quantities of unusual peptides may signify problems with the digestion or sample preparation.
Single protein libraries have a variety of applications. First, they provide a convenient means of storing and re-identifying all identifiable peptides and modifications found in the digest of a given protein. This can assist the separation of true PTMs  from analytical artifacts by limiting possible identifications to previously observed peptides. In fact, relative numbers of spectra identified in prior runs provide a measure of "prior probabilities" (40) of potential value in deriving more accurate probabilities. A second application is to identify the large possible number of peptides in a digest (e.g. carbamylated or otherwise modified) to prevent their misidentification as well as to assess the quality of the sample preparation process. A third application is the integration of these spectra with comprehensive proteome libraries, such the NIST human library (31). This not only adds more and better quality peptide spectra for individual proteins, but, as described earlier, can reveal incorrect identifications of peptides that may be falsely identified as tryptic peptides of minor proteins, but which actually originate as minor modifications of peptides from major proteins. This first attempt to build a single protein library involved a considerable amount of manual inspection to refine filters and assess their efficacy. This process was aided by the availability of the large numbers of digest results available from prior studies (26 -27); however, far fewer are expected to be needed for future library building efforts. Future work will extend this method to other proteins and proteases as well as develop a fully automated method for single-protein library creation. It is hoped that this procedure can then be extended to a large number of proteins of importance in proteomics and become a useful tool for those who have special interest in particular proteins. Other work is ongoing to build libraries of energy-dependent spectra from high resolution, collision-cell instruments.
We note that certain highly modified proteins remain a challenge for fully characterization in libraries of digest peptides. Especially for highly modified proteins, procedures are needed to localize modifications, possibly by extension of widely used methods to fix phosphorylation sites (82)(83). Highly glycosylated proteins present a special challenge, because glycan heterogeneity, identity, and the analysis of Olinked glycans requires special effort.
The HSA spectral library described in this work is available for download from http://peptide.nist.gov. It contains both 2918 spectra from the filtering described here and 122 spectra from the NIST human library. Occurrence information for the former spectra is given in Supplemental Table S4. * This work was supported by the NIH/NCI CPTAC program (http:// proteomics.cancer.gov/) through a series of Interagency Agreements with NIST.
□ S This article contains supplemental Fig. S1 and Tables S1 to S4. § To whom correspondence should be addressed: Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Stop 8362, Gaithersburg, MD 20899, United States.
Single Protein Library Building: HSA. DISCLAIMER: Certain commercial instruments are identified in this document. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products identified are necessarily the best available for the purpose.