A Classifier Based on Accurate Mass Measurements to Aid Large Scale, Unbiased Glycoproteomics*

Determining which glycan moieties occupy specific N-glycosylation sites is a highly challenging analytical task. Arguably, the most common approach involves LC-MS and LC-MS/MS analysis of glycopeptides generated by proteases with high cleavage site specificity; however, the depth achieved by this approach is modest. Nonglycosylated peptides are a major challenge to glycoproteomics, as they are preferentially selected for data-dependent MS/MS due to higher ionization efficiencies and higher stoichiometric levels in moderately complex samples. With the goal of improving glycopeptide coverage, a mass defect classifier was developed that discriminates between peptides and glycopeptides in complex mixtures based on accurate mass measurements of precursor peaks. By using the classifier, glycopeptides that were not fragmented in an initial data-dependent acquisition run may be targeted in a subsequent analysis without any prior knowledge of the glycan or protein species present in the mixture. Additionally, from probable glycopeptides that were poorly fragmented, tandem mass spectra may be reacquired using optimal glycopeptide settings. We demonstrate high sensitivity (0.892) and specificity (0.947) based on an in silico dataset spanning >100,000 tryptic entries. Comparable results were obtained using chymotryptic species. Further validation using published data and a fractionated tryptic digest of human urinary proteins was performed, yielding a sensitivity of 0.90 and a specificity of 0.93. Lists of glycopeptides may be generated from an initial proteomics experiment, and we show they may be efficiently targeted using the classifier. Considering the growing availability of high accuracy mass analyzers, this approach represents a simple and broadly applicable means of increasing the depth of MS/MS-based glycoproteomic analyses.

N-Glycosylation is an important post-translational modification that affects cell-cell signaling and protein stability, and it has been implicated in various pathologies (1). Protein Nglycosylation is difficult to characterize, due to heterogeneity at the levels of glycosylation site occupancy, glycan composition, and glycan structure. A truly comprehensive analysis of protein glycosylation identifies glycans, maps occupied sites, and matches the glycans to specific sites on glycoproteins (2). This site-specific analysis can be performed via analysis of intact glycopeptides using mass spectrometry (MS). However, this technique is complicated by sensitivity, sample preparation, and fragmentation challenges (3) that limit the throughput and depth of the results.
The analysis of site-specific glycosylation is complicated in part because the ionization of glycopeptides is suppressed by any nonglycosylated peptides that are coproduced during protease digestion with specific proteases. Alternatively, digestion using nonspecific proteases has been implemented to eliminate competing peptide species (4,5). Notably, specific proteases yield highly predictable peptide footprints and have been utilized for analysis of complex mixtures. However, as noted in a recent publication, glycopeptides are often not selected for fragmentation in data-dependent analysis (DDA) 1 (6), making glycopeptide identification impossible, as fragmentation is required for glycopeptide identification in nontrivial samples (7). To circumvent this issue, glycopeptide enrichment protocols using normal-phase hydrophilic interaction chromatography or lectin enrichment techniques have been established to enrich for glycopeptides (8). Although highly valuable, these purification approaches have varying specificities for glycopeptides, may preferentially isolate glycopeptides with certain types of glycans attached, and add additional sample handling steps.
Because of these challenges, a classifier capable of quickly discriminating between peptide and glycopeptide signals in mass spectrometry would be valuable and may significantly complement existing purification techniques. Fragment ions have been found that are specific to glycopeptides (9,10); however, these are not useful if the glycopeptides were not selected for fragmentation or if they yielded low quality MS/MS spectra. As mass defect (MD) classifications have been applied to similar challenges in proteomics (11)(12)(13), we investigated whether an MD classification would be useful for discriminating between peptides and glycopeptides.
Notably, a lowering of the MD has been observed for glycopeptides, because of the relative increase of oxygen (and its negative MD value) in glycopeptides (14). However, this original observation was made through comparison of tryptic peptides and small glycopeptides generated by nonspecific proteolysis. Since that original work, no systematic studies have been performed to maximize the analytical utility of this MD shift or to determine the true-and false-positive rates of such a classifier based on accurate mass measurements. Furthermore, no studies have determined whether the MD shift holds for peptides and glycopeptides generated by the same protease, a more pertinent comparison given typical sample preparation protocols.
In this study, we have determined these glycopeptide-rich acquisition enhancement zones (GRAEZs) and validated their utility in identifying useful precursor m/z values for large scale glycopeptide assignment by tandem MS. This classification may be applied to identify likely N-glycopeptides without parallel proteomic or glycomic experiments and without any prior knowledge of the proteome or glycome present in a mixture. Targeted MS studies of species within the GRAEZ will increase selection of glycopeptides for fragmentation, leading to increased glycopeptide identification. This concept is presented schematically in Fig. 1. We further demonstrate the efficacy of GRAEZ classification by validating LC-MS/MS data from urinary proteomics analysis.

MATERIALS AND METHODS
The GRAEZ MD settings were determined using an in silico training dataset and evaluated using an in silico test dataset of peptides and glycopeptides. Training and test sets were generated from the HUPO plasma proteome database, which may be accessed on line. Entries were re-mapped to SwissProt Identifiers. A total of 1797 unique entries were generated. Six hundred random protein entries were selected and digested in silico with either trypsin or chymotrypsin using MS-Digest to form the training sets. The remaining 1197 proteins were used to form the test set. One missed cleavage was permitted; cysteine residues were considered as their carbamidomethyl derivatives, and peptide output was restricted to Ͼ3 amino acids and 400 -5000 daltons. This range was chosen to select peptide sizes that were typically amenable to analysis on most MS instrumentation. MS-Digest reported singly protonated m/z values for all peptides. Peptide output was imported into Microsoft Excel for data analysis.
Redundant peptide sequences were removed. Peptides containing potential N-glycosylation consensus site were identified by the presence of NXS or NXT sequences, where X is any amino acid except proline. Glycopeptides were then generated in silico by adding the monosaccharide masses of eight distinct N-glycan compositions to each consensus site peptide. The glycans utilized are shown in Table  I and were chosen to represent common Homo sapiens N-glycans, without biasing the classifier for large N-linked glycans excessively. Because the MD shift is proportionally less for smaller N-glycans, a range of N-glycan masses was tested to challenge the classifier. Size distributions for tryptic and chymotryptic peptides are shown in supplemental Fig. 1.
Peptides and glycopeptides were plotted on a mass defect map to identify initial trends in integer and defect mass for each species, and best-fit lines were generated for each class. Initial GRAEZ settings were set between the best-fit lines for each class, and the accuracy (or % of correct assignments) of the classifier was evaluated. The initial slope and intercept values were then optimized using an automated iterative process to maximize accuracy.
To retrospectively verify the in silico findings, a catheterized urine sample from a healthy male infant was obtained with an IRB-approved protocol and processed using a previously published sample preparation method for urinary proteomics (15). Briefly, urine was concentrated and desalted on 5K molecular weight cutoff spin filters (Sartorius). Proteins were reduced and alkylated in the spin filter, washed extensively with TEAB, and removed from the upper chamber before digestion with trypsin at a (w/w) ratio of 50:1 sample/enzyme overnight at 37°C. Peptides were labeled with TMT 6 -126 (Thermo Scientific) according to the manufacturer's instructions and purified with HLB cartridges (Oasis). Peptides were separated into 24 fractions using an Agilent OFFGEL isoelectric point fractionator for 50 kV-h, extracted, and dried.
Individual fractions were reconstituted in loading buffer and analyzed by LC-MS/MS using a Thermo Scientific QExactive MS system equipped with an eksigent two-dimensional nano-LC system, autosampler, and C 18 column (15 cm length ϫ 17 m diameter). A "top 10" data-dependent LC-MS/MS method was utilized; resolution was set to 70 K for MS 1 and 17.5 K for MS 2 scans. A 60-min linear gradient from 5 to 35% ACN was used. Normalized collision energy was 30, and the AGC was set for 1e 6 for MS 1 and 5e 4 for MS 2 scans.
In addition to the retrospective GRAEZ evaluation, prospective GRAEZ testing was also performed. Tryptic peptides were generated as above using a urine sample donated by a healthy male adult. An initial DDA run was performed on the nonfractionated sample after cleanup. After acquisition, all MS 1 features were extracted using MaxQuant (16) and evaluated for GRAEZ status. A list of 2325 unique precursors was generated, which were classified as glycopeptides by GRAEZ, and targeted in two subsequent LC-MS runs. Data were acquired with similar instrumental parameters, except the normalized collision energy was 29 and the AGC was set for 3e 6 for MS 1 and 1e 5 for MS 2 scans.
All MS 2 spectra from the retrospective experiment were searched for the presence of two marker ions, the TMT reporter ion at 126.1277 daltons or the diagnostic Hex 1 HexNac 1 oxonium ion at 366.1395. Prospective data were evaluated for the 366.1395 and 204.0867 ions. Rapid identification of the relevant precursor m/z and z values was achieved by the use of an in-house script that functioned as an add-in for the msconvert tool. The tool, mzPresent, filters all MS 2 spectra for user-defined fragment ions and creates an mgf file and a comma separated value file as output that contains scan number, retention time, m/z selected for fragmentation, charge state of the precursor, and the intensity of the fragment ion. mzPresent has been incorporated into the proteowizard tool (17) for ease of use, and may use any arbitrary m/z value.
For this study, 10 ppm mass error was allowed, and a minimum of 25% relative intensity was required for the fragment ions. The precursor m/z and z values were used to calculate (M ϩ H) ϩ values for GRAEZ classification, and these GRAEZ classifications were crossreferenced against the presence of the glycopeptide-specific ions in MS 2 spectra to estimate the true-/false-positive rate ability of GRAEZ, as detailed below.

RESULTS AND DISCUSSION
Creating GRAEZ Settings and in Silico Evaluation-Because of the contribution of N-linked glycans, N-glycopeptides are larger in size than peptides, as shown by the histograms in supplemental Fig. 1. Based on the in silico data, all species below 1500 daltons may reasonably be excluded from targeted N-glycopeptide analysis with negligible loss in sensitivity. Approximately 49% of tryptic peptides and 43% of chymotryptic peptides were smaller than 1500 daltons. However, the in silico specificity measures listed below do not consider the elimination of these low mass species and therefore are quite conservative with regard to overall glycopeptide specificity.
The final GRAEZ settings are given in Equation 1, where NM is the nominal mass (i.e. integer portion of the mass) of the singly protonated (or multiply protonated and deconvoluted) species being tested, and MD is the defect mass (i.e. decimal portion of the mass). Species within the GRAEZ are more likely to be glycosylated peptides, as detailed below.
The GRAEZ regions determined by these equations are highlighted in Fig. 2 and plotted along with the test datasets. The "or" conditions shown in Equation 1 are required when the calculated values for the "high" end of the GRAEZ becomes Ͼ1 or Ͼ2. Any calculated GRAEZ values that were larger than 1 had their integer value subtracted, as MD by definition is between the values of 0 and 1. A species that satisfies the condition is classified as a glycopeptide by GRAEZ.  An Accurate Mass Classifier to Aid Unbiased Glycoproteomics For a large scale analysis, GRAEZ testing can be quickly and easily performed in any of several different software platforms after deconvolution of LC-MS data.
The tryptic training set had a sensitivity of 0.952 and a specificity of 0.900 within the mass range of 1500 to 5000 daltons. After eliminating m/z values outside the GRAEZ (or GRAEZing for glycopeptides), the glycopeptide/peptide ratio increased 9.5-fold. Similarly, the tryptic test set yielded an 8.8-fold increase and the chymotryptic sets averaged a 10fold increase. The overall accuracy of GRAEZ classification (the proportion of correct assignments) averaged 0.922 for tryptic digests. Similar sensitivity and specificity were achieved for the chymotryptic species, as summarized in Table II. Furthermore, tryptic peptide and glycopeptide test sets were evaluated using the initial study that proposed an MD difference between these species (14). Although the original study achieved some improvement in identifying likely peptides, the true-positive rates of glycopeptide assignment dropped to 0.68, meaning over 30% of tryptic peptides were misclassified as nonmodified peptides in silico using the original MD classification scheme. GRAEZ classification is therefore substantially more sensitive for glycopeptides.
The GRAEZ settings were further applied in silico to the remaining set of 1197 proteins to verify their performance on another dataset. The full list of peptides and glycopeptides utilized may be found in supplemental file 1. Both the tryptic and chymotryptic test sets gave a negligible change in accuracy in the training set (Table II), suggesting that the GRAEZ classifier is robust. In total, over 100,000 tryptic species were tested in silico and GRAEZ correctly classified 91.9% of these species. Similar accuracy was achieved with the chymotryptic test set species (93.3%), which numbered Ͼ90,000.
The in silico training sets were also evaluated as the 13 C 1 and 13 C 2 isotope, in addition to the monoisotopic species. The GRAEZ classification did not change with the heavy isotopes over 99% of the time (shown in supplemental file 1), a critical consideration for larger analytes for which the 13 C 1 or 13 C 2 isotopes are the most abundant. The experimental data shown below also support this claim, as the majority of glycopeptide precursors in human urine had at least one isotopic shift (Table III). Notably, combinatorial approaches to glycoproteomics assign glycopeptides by matching experimentally observed monoisotopic m/z values to a combination of a glycan and a peptide mass. We anticipate that experimental misclassification of precursor m/z values will occur frequently for large precursors and therefore detract from such approaches more often than 1% of the time. This could potentially be an advantage of the GRAEZ approach relative to combinatorial glycoproteomics; however, it requires additional study.
GRAEZ Evaluation of Published Reports-To further validate the in silico results, published proteomic and glycoproteomic data were also evaluated. GRAEZ testing of a recently published proteomic dataset of the HeLa cell proteome (18) correctly classified 96.2% of 4760 unique tryptic peptides between 1500 and 5000 daltons as peptides, with a specificity of 0.962 (supplemental file 1). Similarly, a retrospective GRAEZ classification of several published site-specific glycoproteomic studies was also performed to validate the sensitivity of the method. As glycoproteomics studies have not approached the scale of proteomics studies, several distinct studies were needed to generate a sufficient number of glycopeptides to test GRAEZ classifications. These studies examined a variety of different samples, including glycoprotein standards (19), fetal bovine serum (20), human urine (21) (24), hepatitis C glycoprotein (25), HIV envelope glycoprotein gp140 (26), and human IgG subclasses (27). In total, 624 nonredundant, intact tryptic glycopeptides were identified in these studies within the mass range of 1500 -5000 daltons. Subsequent GRAEZ testing was performed on experimental m/z values when given and on imputed m/z values when absent. GRAEZ correctly classified 564 of these species as glycopeptides (as detailed in supplemental file 3), for an overall sensitivity of 0.904. This result demonstrates that the sensitivity of GRAEZ classification was maintained among these reports on diverse samples. Based on these experimental data from multiple organisms, instrumental platforms, and laboratories, we predict GRAEZ classification will be useful for a wide variety of future N-glycoproteomic studies to identify likely N-glycopeptide precursors in LC-MS.
Experimental Validation of GRAEZ Classification-The utility of GRAEZ was further evaluated experimentally using tryptic peptides isolated from urine. Urine is a highly complex, clinically relevant sample type, and it contains numerous salts, peptides, and metabolites. To combat the possibility of nonpeptide background contamination affecting the classification, peptides were labeled with amine-reactive TMT tags before analysis. Using the mzPresent tool, every MS 2 spectra collected was searched for two fragment ions as follows: TMT reporter tag at 126.1277, which was required for the "peptide" designation, and the 366.1395 peak, which was required for "glycopeptide" designation. Species without either of these ions were not considered in the GRAEZ classification.
Urine was chosen to challenge the GRAEZ classification, as it is a highly complex sample containing thousands of proteins. We also opted against any glycopeptide enrichment, to challenge the GRAEZ classification. Therefore, it was expected that the number of peptides observed would vastly outnumber glycopeptides, as peptides are known to suppress ionization of glycopeptides, and are present at higher stoichiometric values. Furthermore, a recent study (21) shows Oglycopeptides are prevalent in urine, which are expected to further challenge the accuracy of N-linked glycopeptide characterization. Although N-and O-linked glycopeptides can yield the same small oxonium ions upon higher energy collisionally activated dissociation (HCD) fragmentation, they will not respond equally well to GRAEZ classification because an O-linked glycopeptide's glycan moieties are smaller. O-Linked glycopeptides are anticipated to be missed by GRAEZ classification, due to the small size of most O-linked glycopeptide's glycan moieties. Thus, the subset of oxoniumyielding precursor ions likely includes some O-glycopeptides that are expected to be less ideal candidates for GRAEZ classification. This subjects the classification criteria to yet another challenge that can reasonably be expected to occur in other sample types.
Despite these challenges, an analysis of MS 2 spectra (n ϭ 90,624) showed that 90% (692/772) of all species that yielded oxonium fragments upon activation by HCDs were characterized as N-glycopeptides by the GRAEZ algorithm. Similarly, 93% (83,289/89,852) of all peptide species were correctly classified as well. The lack of any glycopeptide enrichment led to a very challenging analytical background. In total, 116 unique peptide precursors were selected by DDA for every glycopeptide precursor. To date, there are few studies that intentionally analyze intact glycopeptides and peptides simultaneously, because peptides and glycopeptides have distinct optimal instrumental parameters (28,29). Unfortunately, the presence of both species in a glycopeptide enrichment sample is an undesired yet common outcome of contemporary enrichment protocols. In this study, we wanted to demonstrate the utility of GRAEZ classification within a highly complex and challenging analytical background. In addition, these samples were analyzed using peptide-optimized MS settings, and there was a majority (Ͼ85%) of low quality spectra acquired.
However, several high quality glycopeptide fragmentations were still observed, and the glycan portions were assigned by the presence of the abundant Y 1 ion (nomenclature detailed in Ref. 30) and a minimum of three other glycosidic fragment ions. Two examples of higher quality spectra are shown in

TABLE III An annotated set of glycopeptide assignments identified by LC-MS/MS
A total of 64 species were assigned, and relevant analytical information has been tabulated. A high degree of sialylated glycopeptides were observed with 1-3 sialic acid residues, and a total of 23 distinct glycan compositions were observed. For the glycan composition entry, the following notations were used: H, Hexose; N, N-acetylhexosamine; F, fucose; A, N-acetylneuraminic acid. Each glycan assignment was supported by a sub Ϫ20 ppm mass error in the MS/MS spectra.  Fig. 3. In each spectrum, a loss corresponding to the nonreducing end glycan moieties was observed, followed by successive losses of six monosaccharide residues. In both spectra, a 0,2 X 0 ion was observed, and in Fig. 3B, loss of the terminal GlcNac residue was also observed. Each spectrum identified the mass of the peptide portion in addition to the glycan composition. Spectra corresponding to a total of 61 glycopeptides were acquired with sufficient quality to manually assign the glycan portion of the glycopeptides in the data-dependent analyses, and relevant information is tabulated in Table III. These species were predominantly glycopeptides with sialylated complex-type glycans. The peptide MH ϩ values were imputed after assignment of the MS/MS pattern observed, usually supported by abundant Y 1 and 0,2 X 0 type ions. After identifying the Y 1 ion, the remaining mass lost from the calculated precursor MH ϩ was determined and cross-referenced against plausible N-glycan compositions to confirm the compositional assignment. Each glycan loss matched an N-glycan composition at less than 20 ppm mass tolerance. The peptide portions were not sequenced in this study and are reported as their input (M ϩ H) ϩ values. Prospective Analysis of Precursors of Interest-An unfractionated sample of urinary peptides was initially analyzed by DDA MS/MS and subsequently by targeted MS. A total of 2325 species from the initial analysis were characterized as glycopeptides by the GRAEZ. A total of 3196 MS 2 spectra were acquired, and 2598 (81%) of these had an oxonium ion at a minimum of 25% of the base peak intensity. A less stringent cutoff of 5% increased the number to 2878 or 90% of all MS 2 spectra acquired. Our fractionated urine sample gave a glycopeptide sampling rate of only 0.8% by comparison, generating only 772 MS 2 spectra in substantially more instrument time. Therefore, generating a targeted list based on GRAEZ classification significantly increased both the glycopeptide MS/MS sampling efficiency and depth.

OFFGEL
Considerations for GRAEZ Classification for Glycoproteomics-The GRAEZ approach has specific requirements. First, it requires an instrument capable of making precursor m/z measurements with high mass accuracy. Furthermore, an examination of the in silico data suggests that the GRAEZ approach will be useful provided the instrumental precursor mass error was no larger than 20 ppm (distributions shown in supplemental Fig. 2). GRAEZing does not solve enrichment issues; however, the experimental data acquired in this study suggest that MS 1 targets may be determined while perform-  3. Examples of two glycopeptide MS/MS spectra. A, complex, monosialylated, and difucosylated N-glycan is observed; B, complex monosialylated N-glycan is observed. Fragment ions are observed as a series of Y-type ions from the intact N-glycopeptide precursor and a clear sequential loss of the N-linked core mannoses and N-acetylglucosamine. In each case, a 0,2 X 0 type cleavage is observed for the reducing end N-acetylglucosamine. Remaining glycan compositions are assigned by accurate mass losses from the precursor ion, and a minimum of four Y-or X-type ions was required for each assignment. Although full structural characterization was not performed, a possible glycan is shown for each spectrum that reflects the composition determined. The full set of assignments is shown in Table III. ing standard proteomics experimentation. Overall, the size of the list of putative N-glycopeptides presented here (61) is comparable with recent reports using glycopeptide enrichment (21) of human urine, although a full characterization of the peptide portions was not performed in this study. GRAEZ performance was expected to be highest for N-glycopeptides with large glycan portions. The glycopeptide recall rate was minimized at ϳ60% for glycopeptides with very small Nglycans (Hex 3 HexNac 2 and Hex 3 Hexnac 5 ) in silico.
Recent work from the Desaire laboratory has optimized the MD boundaries of peptides and advanced the use of MD filters (31). They found the width of 95% of the peptide distribution at 3000 daltons to be 0.49 daltons, whereas the GRAEZ settings for glycopeptides are only 0.316 daltons wide for the same NM. We believe this narrowing is due to two factors. First, the set of peptides with a consensus motif is a subset of the full peptide pool, narrowing the distribution. Second, the distribution of possible MDs for glycopeptides is narrowed by the glycan moiety itself. For example, if an Nglycopeptide of a nominal mass of 3000 daltons has an Nlinked core attached, the corresponding spread of MD contributions from the peptide portion is approximately equivalent to a peptide of 1600 daltons. The glycan contributions do not spread the MD distribution as much, as there are fewer possible glycan compositions. Notably, the effect of mass shifting is essentially negligible with small post-translational modifications such as sulfation and phosphorylation, as the MD width does not shift much over a small mass difference.
We anticipate GRAEZing will be useful whenever glycopeptide signals are suppressed such that they are not or are sparsely fragmented and yet are still visible in MS 1 scans, a very common occurrence in data-dependent analysis of complex mixtures. Essentially no information about the sample was required (i.e. no glycomic or proteomic experiments need to be performed) other than knowledge of the protease used. Finally, GRAEZing is modular, and it may be performed after glycopeptide enrichment, increasing the utility of GRAEZing by limiting peptide contamination and improving the outcome of established glycopeptide enrichment strategies by increasing glycopeptide sampling in MS/MS analysis. We demonstrate GRAEZing may be performed after an initial proteomics DDA analysis, resulting in extensive coverage of glycopeptide targets.

CONCLUSIONS
The large scale analysis of glycopeptides remains a major analytical challenge. In this work, a sensitive and specific method applied to generic proteomic data that discriminates N-glycopeptides from nonglycosylated peptides based on accurate mass measurements has been developed. Precursors within the GRAEZ are enriched in glycopeptides by an order of magnitude. By identifying an N-glycopeptide-enriched targeted list from an initial data-dependent analysis, the researcher may efficiently target glycopeptides in a sub-sequent re-analysis. This also introduces the intriguing possibility of targeting glycopeptides in a single on-line experiment in which the instrument software is trained to preferentially select likely glycopeptide masses for MS/MS. Because classification is based on the intrinsic mass defects of the elements, this method can be applied to diverse glycoproteomic problems without the need for prior knowledge regarding the proteome or glycome present. For example, GRAEZ classification may be used to quickly compare the effectiveness of different glycopeptide sample preparations, where the detailed interpretation of hundreds or thousands of tandem MS spectra would be too laborious to fully characterize quickly. Finally, GRAEZ classification of existing proteomic datasets could be used to quickly evaluate the prevalence of glycosylated peptides in existing data, with no additional instrument time required. This will allow researchers to quickly evaluate the benefit of a thorough glycoproteomic characterization on previously analyzed samples. Overall, this approach represents a simple means of significantly improving glycoproteomic depth; as such, we predict that mass defect filtering will contribute substantially to the field of site-specific glycoproteomics.