Comprehensive Analysis of a Multidimensional Liquid Chromatography Mass Spectrometry Dataset Acquired on a Quadrupole Selecting, Quadrupole Collision Cell, Time-of-flight Mass Spectrometer

An in-depth analysis of a multidimensional chromatography-mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight (QqTOF) geometry instrument was carried out. A total of 3269 CID spectra were acquired. Through manual verification of database search results and de novo interpretation of spectra 2368 spectra could be confidently determined as predicted tryptic peptides. A detailed analysis of the non-matching spectra was also carried out, highlighting what the non-matching spectra in a database search typically are composed of. The results of this comprehensive dataset study demonstrate that QqTOF instruments produce information-rich data of which a high percentage of the data is readily interpretable.

Mass spectrometers interfaced to chromatographic separation allow the acquisition of large amounts of data in a relatively short period of time. New high throughput technologies have thus been developed to utilize this ability (1)(2)(3)(4)(5)(6). The quantity of data produced renders manual analysis of a significant amount of the data impractical. Scientists are therefore dependent on automated database search engines to summarize their results, the most popular being Mascot (www.matrixscience.com) and Sequest (8).
In database searches of large datasets there is always a long list of spectra that have not been matched to anything by the search engine. There are a number of reasons why these may not match, including poor quality spectra, spectra of peptides containing modifications that were not considered in the search, or peptides that were formed by non-specific cleavages when a certain enzyme cleavage specificity was defined in the search engine. Also the data analyzed by search engines are not the raw data but rather centroided peak list data, which are not always completely representative of the raw data.
These unmatched spectra are typically ignored despite the possibility they could contain important information. A summary of the complications in automated peptide and protein identification has been published recently (9). Hence a number of groups have developed statistical analysis programs of search results to better define the reliability of the reported matches (10 -13).
There are many groups publishing results from large scale mass spectrometric analyses using different combinations of mass spectrometers and search engines. Unfortunately if a researcher uses one particular combination of tools it can be difficult to assess the quality of the data in studies using different instrument and search engine combinations. Hence there is a drive toward making the raw data itself available so that one can independently assess results and, if desired, reanalyze the results using an alternative searching strategy (14).
In this study we present data from a multidimensional LC-MSMS experiment where we analyzed all acquired spectra manually. From this we are able to report exactly what these unmatched spectra actually constitute. We think this information is important for understanding where there are currently problems with these automated search strategies and to indicate areas where with further refinement this list of unmatched spectra could be reduced. The dataset submitted here was acquired on a QqTOF 1 geometry instrument, a QSTAR Pulsar (MDS Sciex/Applied Biosystems). A dataset of a multidimensional LC-MSMS experiment created on an ion trap, LCQ-DECA (Thermo), has already been published in this journal (15). Here we present a QSTAR dataset for comparison. Second to ion traps, QqTOF geometry instruments are the major type of instrument used for large scale proteomic analyses. This dataset submission will allow comparisons of the relative merits of data acquired on each instrument type.

EXPERIMENTAL PROCEDURES
His-tagged Gsp1p was expressed and purified from Escherichia coli as published previously (16). Yeast cells were arrested at the G 1 stage of the cell cycle using 2.5 g/ml ␣-factor exposure for 3 h or at M phase using 20 g/ml nocodazole for 3 h, and then interacting proteins were isolated as published previously (17). Proteins from each cell state (about 5-10 g/cell state) were labeled with the cleavable ICAT reagent (Applied Biosystems, Foster City, CA) and analyzed essentially following our published protocol for ICAT of low level samples (18). Briefly proteins were denatured in 9 M urea and reduced with trichloroethylphosphine, and then cysteines of G 1 phase-arrested proteins were alkylated with light ICAT reagent, while M phase proteins were alkylated with isotopically heavy reagent. After tryptic digestion peptides were separated by strong cation exchange using a Beckman Gold HPLC system equipped with an analytical flow upgrade. Separation was achieved using a 2.1 ϫ 10-mm polysulfoethyl A column (PolyLC) where Buffer A was 30% ACN, 0.05% formic acid and Buffer B was buffer A containing 400 mM NH 4 Cl. Six fractions were collected, and each of these was successively passed through the biotin affinity cartridge (Applied Biosystems ICAT kit). Each flow-through was collected separately, and then all ICAT peptides were eluted into one fraction using 30%ACN, 0.4% trifluoroacetic acid. ICAT tags were cleaved in 95% trifluoroacetic acid.
Each fraction was reverse phase cleaned up (Zip Tips, Millipore) to desalt the samples and then analyzed by reverse phase LC-MSMS. Reverse phase chromatography was performed using an Ultimate HPLC system and a Famos autosampler (both LC-Packings). Separation was achieved using a 75-M ϫ 150-mm Pepmap column (LC-Packings) at a flow rate of 300 nl/min. Buffer A was 0.1% formic acid, while Buffer B was acetonitrile, 0.1% formic acid. The gradient separation was 5-40% B over 105 min. As peptides eluted off the column they were introduced on line into an ESI-QqTOF instrument (QSTAR) and were analyzed using data-dependent switching between MS and MSMS modes; after a 1-s MS spectrum up to three multiply charged precursor ions could be selected for 2-s CID spectra acquisition. After a given precursor was selected, dynamic exclusion was used for the next 60 s to prevent its subsequent reselection.
Peak lists of MSMS spectra from each LC-MS run were created using the Mascot.dll script (version 1.4) within Analyst. These were searched using "Batch Tag," a new piece of software in the latest in-house developmental version of Protein Prospector (for further details see Ref. 19). Those spectra that did not return a high confidence result were manually analyzed by looking at the raw spectra in the Analyst software by interpreting amino acid sequence tags and searching in MS-Homology (Protein Prospector) or by closer examination of the results from the Batch Tag search and assessment of whether the ions observed are those one would predict to be most intense on the basis of the sites of amino acid cleavages (e.g. cleavage N-terminal to a proline or C-terminal to an aspartic acid).

RESULTS
During analysis of the six cation exchange fractions of the non-ICAT-labeled peptides (i.e. non-cysteine-containing) a total of 3269 MSMS spectra were acquired. These spectra were initially searched using Batch Tag, a new program in Protein Prospector designed for searching of LC-MSMS data against the Swiss-Prot Database (April 3, 2004), allowing only yeast proteins plus a couple of expected non-yeast proteins (GST and human keratins) (for details of Batch Tag see Ref. 19). The database search results were used to assist in the manual analysis of each spectrum, i.e. if a sequence tag of three or four amino acids was manually interpreted and this matched to the result Batch Tag returned and this result also explained the assignment of all the major peaks in the spectrum then the assignment was accepted.
Approximately 2000 of these spectra gave confident results, and these were verified only by a cursory look at plots of the ions observed and what they were matched to. The majority of these matches were on the basis of an extensive "y" ion series. The other ϳ1300 spectra were manually analyzed in more extensive detail to determine whether the peptides could be de novo interpreted and, if not, why a peptide could not be confidently assigned.
Following this comprehensive analysis of the dataset we could confidently assign 2368 spectra to predicted tryptic peptides that we felt a search engine should be able to identify when allowing for the modifications of oxidized methionines, protein N-terminal acetylation, and pyroglutamate formation from N-terminal glutamine residues. This left 901 spectra that for various reasons one would not expect the search engine to make a confident match. The reasons for this are summarized in Table I and reported graphically in Fig. 1.
226 of the spectra were not fragmentation spectra of peptides but were rather fragments of chemical contaminants, most commonly ICAT-related products (presumably either chemical side product impurities during synthesis of the reagent or produced by side reactions during the reagent cleav-  11 MSMS of peptide that has lost water in-source 8 peptides formed from in-source fragmentation of abundant co-eluting peak 1 peptide contains an internal disulfide bond 1 spectrum contains a methylated lysine 83 wrong precursor charge assignment 1 wrong precursor charge assignment and multiple peptides fragmented 2 wrong precursor charge assignment and not peptides 78 wrong monoisotopic peak assignment 14 wrong precursor charge assignment and monoisotopic peak assignment 3 wrong monoisotopic peak assignment and multiple peptides fragmented a Polyethylene glycol. age step in 95% trifluoroacetic acid). An example of one of these spectra is shown in Fig. 2. These spectra do not produce any immonium ion masses and nearly always contain characteristic fragment ions at m/z 481.28, 515.28, and 556.29. Some fragmentation spectra were of peptides less than 620 Da in mass, generally corresponding to a peptide only five amino acids in length. Many of the spectra of these short peptides do not contain enough ions to make a confident assignment, and even if the sequence could be determined, then a five-amino acid string is not sufficient to uniquely identify a protein. The selection window for the precursor ion corresponded to roughly Ϫ1.5 and ϩ2 Da from the selected monoisotopic mass. For 43 spectra multiple precursor ions co-eluted within this mass range and were simultaneously fragmented. In some cases both of the peptides could be identified by manual analysis and in many cases at least one could be determined, but unless one component was present at a significantly higher level than the other it would be difficult for the search engine to produce a confident match. 42 spectra were of peptides that would not be formed by tryptic cleavage of proteins in the database so are either formed by non-specific tryptic cleavage or other protease cleavage during sample isolation, and a further eight were formed by in-source fragmentation of an abundant co-eluting peak to produce a peptide with no enzyme specificity. Hence search-ing the dataset specifying tryptic cleavage would not match these spectra, although searching with no enzyme specificity could potentially identify these.
A total of 51 spectra were of modified peptides. The majority of these were either peptides where an asparagine had become deamidated to an aspartic acid or were from the trypsin, which is methylated to reduce chymotryptic activity and minimize autolysis (20). However, there was also a peptide that had an internal disulfide intact, thus having a molecular mass 2 Da less than the peptide with free sulfhydryl groups. A peptide from elongation factor 1 ␣ was identified that had a methylated lysine. This lysine 30 is a known site of modification (7).
A number of spectra could not be assigned because of problems in the creation of the peak list used for searching. The data are acquired as profile data but become converted to centroid data for database searching. Errors in the assignment of the peak charge state and recognition of the monoisotopic peak after this centroiding process lead to incorrect information about the parent ion mass, and thus the peptide will not be identified. Both of these problems were most common in spectra of components of relatively high mass (2500 Da or higher) and were mainly caused by poor ion statistics on weak monoisotopic peaks. Jagged peak shapes lead to labeling of multiple spikes on one isotopic peak, leading to the software interpreting this as a part of a highly charged ion and not part of the same isotope profile as the second and third isotopes (Fig. 3).
Four peptides corresponded to sequences that were not present in either Swiss-Prot (04.03.2004) or National Center for Biotechnology Information (NCBI) (03.29.2004) Databases. 313 spectra did not contain enough information for a confident manual assignment of a peptide mainly because they were weak spectra with few ions. In some cases a more intense MSMS spectrum of the same precursor was acquired at a similar time in the same or a neighboring ion exchange fraction; this allowed assignment of the weaker spectrum. The full curated list of all the spectra and their assignments or reason for lack of assignment is supplied in Supplemental Table 1.

DISCUSSION
From reverse phase LC-MSMS analysis of six cation exchange fractions a total of 3269 CID spectra were acquired. Of these 2368 spectra (72%) can be confidently interpreted as tryptic peptides by a combination of database searching with manual verification or manual de novo interpretation. There were errors in assignment of parent ion mass of 181 spectra through incorrect charge state and/or monoisotopic mass determination. Monoisotopic peak recognition and charge state determination are often not straightforward. However, new software is improving at this task. For example, Matrix Science recently released new software, Mascot Distiller, that made fewer errors in parent ion determination on this dataset (data not shown). Also many peak centroiding scripts, including the recent Mascot.dll in the Analyst software, if they are not certain of the charge state will "create" spectra with different charge states and assume the highest scoring MSMS that results from the multiple assigned charge states is correct. With monoisotopic peak and charge state correctly determined several more spectra could be assigned. 7% of CID spectra were not fragmentation spectra of peptides. This figure is likely to be much higher in datasets acquired on ion trap instruments. Because of the higher resolution of data acquired by time-of-flight, on-the-fly charge state determination of precursor ions allows one to specify to only fragment multiply charged precursor ions. Charge state determination of precursor ions on an ion trap can be performed using a narrow m/z range "zoom scan." However, many users choose not to perform this scan as it significantly increases the duty cycle of the analysis, reducing the number of precursor ions that are fragmented. Chemical contaminants are generally singly charged, whereas peptides usually are multiply charged species through capture of protons on the basic N terminus and the C-terminal basic residue (lysine or arginine). Hence QqTOF MSMS datasets will contain significantly fewer fragmentation spectra of non-peptide species.
This study is not reporting the results of a database search but a manual analysis of what we think a search engine could theoretically achieve on this dataset. For analysis of how search engines perform on this dataset, see the accompanying study (19). As these results are on the basis of manual assignments, there is inherently a subjectivity to the results. For example, 313 spectra were categorized as being unassignable fragmentation spectra of peptides. Their lack of assignment is due to an inability to determine with personal confidence an identity for the spectrum. This was in general due to there being very few ions in the spectrum, although some spectra contained several fragment ions of which many were clearly not derived from a peptide; i.e. the spectrum was a mixture of fragmentation of a peptide and a chemical contaminant.
Through the manual analysis of all the data we have been able to assess the quality of data acquired on a QSTAR mass spectrometer. This analysis has also highlighted some of the problems with the data produced. Although this dataset cannot be taken as completely representative of all data acquired on this type of instrument, it does show that the data are typically information-rich and that a high percentage of the data should be assignable.