Preprocessing Significantly Improves the Peptide/Protein Identification Sensitivity of High-resolution Isobarically Labeled Tandem Mass Spectrometry Data*

Isobaric labeling techniques coupled with high-resolution mass spectrometry have been widely employed in proteomic workflows requiring relative quantification. For each high-resolution tandem mass spectrum (MS/MS), isobaric labeling techniques can be used not only to quantify the peptide from different samples by reporter ions, but also to identify the peptide it is derived from. Because the ions related to isobaric labeling may act as noise in database searching, the MS/MS spectrum should be preprocessed before peptide or protein identification. In this article, we demonstrate that there are a lot of high-frequency, high-abundance isobaric related ions in the MS/MS spectrum, and removing isobaric related ions combined with deisotoping and deconvolution in MS/MS preprocessing procedures significantly improves the peptide/protein identification sensitivity. The user-friendly software package TurboRaw2MGF (v2.0) has been implemented for converting raw TIC data files to mascot generic format files and can be downloaded for free from https://github.com/shengqh/RCPA.Tools/releases as part of the software suite ProteomicsTools. The data have been deposited to the ProteomeXchange with identifier PXD000994.

Isobaric labeling techniques coupled with high-resolution mass spectrometry have been widely employed in proteomic workflows requiring relative quantification. For each high-resolution tandem mass spectrum (MS/MS), isobaric labeling techniques can be used not only to quantify the peptide from different samples by reporter ions, but also to identify the peptide it is derived from. Because the ions related to isobaric labeling may act as noise in database searching, the MS/MS spectrum should be preprocessed before peptide or protein identification. In this article, we demonstrate that there are a lot of high-frequency, high-abundance isobaric related ions in the MS/MS spectrum, and removing isobaric related ions combined with deisotoping and deconvolution in MS/MS preprocessing procedures significantly improves the peptide/protein identification sensitivity. The userfriendly software package TurboRaw2MGF (v2.0) has been implemented for converting raw TIC data files to mascot generic format files and can be downloaded for free from https://github.com/shengqh/RCPA.Tools/releases as part of the software suite ProteomicsTools. Mass spectrometry-based proteomics has been widely applied to investigate protein mixtures derived from tissue, cell lysates, or from body fluids (1,2). Liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) 1 is the most popular strategy for protein/peptide mixtures analysis in shotgun proteomics (3). Large-scale protein/peptide mixtures are separated by liquid chromatography followed by online detection by tandem mass spectrometry. The capabilities of proteomics rely greatly on the performance of the mass spectrometer. With the improvement of MS technology, proteomics has benefited significantly from the high-resolution and excellent mass accuracy (4). In recent years, based on the higher efficiency of higher energy collision dissociation (HCD), a new "high-high" strategy (high-resolution MS as well as MS/MS(tandem MS)) has been applied instead of the "highlow" strategy (high-resolution MS, i.e. in Orbitrap, and lowresolution MS/MS, i.e. in ion trap) to obtain high quality tandem MS/MS data as well as full MS in shotgun proteomics. Both full MS scans and MS/MS scans can be performed, and the whole cycle time of MS detection is very compatible with the chromatographic time scale (5).
High-resolution measurement is one of the most important features in mass spectrometric application. In this high-high strategy, high-resolution and accurate spectra will be achieved in tandem MS/MS scans as well as full MS scans, which makes isotopic peaks distinguishable from one another, thus enabling the easy calculation of precise charge states and monoisotopic mass. During an LC-MS/MS experiment, a multiply charged precursor ion (peptide) is usually isolated and fragmented, and then the multiple charge states of the fragment ions are generated and collected. After full extraction of peak lists from original tandem mass spectra, the commonly used search engines (i.e. Mascot (6), Sequest (7)) have no capability to distinguish isotopic peaks and recognize charge states, so all of the product ions are considered as all charge state hypotheses during the database search for protein identification. These multiple charge states of fragment ions and their isotopic cluster peaks can be incorrectly assigned by the search engine, which can cause false peptide identification. To overcome this issue, data preprocessing of the high-resolution MS/MS spectra is required before submitting them for identification. There are usually two major preprocessing steps used for high-resolution MS/MS data: deisotoping and deconvolution (8,9). Deisotoping of spectra removes all isotopic peaks except monoisotopic peaks from multi-isotopic peaks. Deconvolution of spectra translates multiply charged ions to singly charged ions and also accumulates the intensity of fragment ions by summing up all the intensities from their multiply charged states. After performing these two data-preprocessing steps, the resulting spectra is simpler and cleaner and allows more precise database searching and accurate bioinformatics analysis.
With the capacity to analyze multiple samples simultaneously, stable isotope labeling approaches have been widely used in quantitative proteomics. Stable isotope labeling approaches are categorized as metabolic labeling (SILAC, stable isotope labeling by amino acids in cell culture) and chemical labeling (10,11). The peptides labeled by the SILAC approach are quantified by precursor ions in full MS spectra, whereas peptides that have been isobarically labeled using chemical means are quantified by reporter ions in MS/MS spectra. There are two similar isobaric chemical labeling methods: (1) isobaric tag for relative and absolute quantification (iTRAQ), and (2) tandem mass tag (TMT) (12,13). These reagents contain an amino-reactive group that specifically reacts with N-terminal amino groups and epilson-amino groups of lysine residues to label digested peptides in a typical shotgun proteomics experiment. There are four different channels of isobaric tags: TMT two-plex, iTRAQ four-plex, TMT six-plex, and iTRAQ eight-plex (12)(13)(14)(15)(16). The number before "plex" denotes the number of samples that can be analyzed by the mass spectrum simultaneously. Peptides labeled with different isotopic variants of the tag show identical or similar mass and appear as a single peak in full scans. This single peak may be selected for subsequent MS/MS analysis. In an MS/MS scan, the mass of reporter ions (114 to 117 for iTRAQ four-plex, 113 to 121 for iTRAQ eight-plex, and 126 to 131for TMT six-plex upon CID or HCD activation) are associated with corresponding samples, and the intensities represent the relative abundances of the labeled peptides. Meanwhile, the other ions from the MS/MS spectra can be used for peptide identification. Because of the multiplexing capability, isobaric labeling methods combined with bottom-up proteomics have been widely applied for accurate quantification of proteins on a global scale (14,(17)(18)(19). Although mostly associated with peptide labeling, these isobaric labeling methods have also been applied at protein level (20 -23).
For the proteomic analysis of isobarically labeled peptides/ proteins in "high-high" MS strategy, the common consensus is that accurate reporter ions can contribute to more accurate quantification. However, there is no evidence to show how the ions related to isobaric labeling affect the peptide/protein identification and what preprocessing steps should be taken for high-resolution isobarically labeled MS/MS. To demonstrate the effectiveness and importance of preprocessing, we examined how the combination of preprocessing steps improved peptide/protein sensitivity in database searching. Several combinatorial ways of data-preprocessing were applied for high-throughput data analysis including deisotoping to keep simple monoisotopic mass peaks, deconvolution of ions with multiple charge states, and preservation of top 10 peaks in every 100 Dalton mass range. After systematic analysis of high-resolution isobarically labeled spectra, we further processed the spectra and removed interferential ions that were not related to the peptide. Our results suggested that the preprocessing of isobarically labeled high-resolution tandem mass spectra significantly improved the peptide/protein identification sensitivity.

EXPERIMENTAL PROCEDURES
Sample Preparation-The Goto-Kakizaki (GK) rat liver tissue was respectively mixed with SDT-lysis buffer (2% SDS, 0.1 M DTT, and 0.1 M Tris-HCl, pH ϭ 7.6) and then heated for 5 min at 100°C. After that, the tissue layers were cooled to room temperature, sonicated 60 s at 100 w, and then centrifuged at 16,000 ϫ g for 30 min at 20°C for removing cell debris. The protein concentration was detected by measurements of tryptophan fluorescence as described (24). Briefly, 1 l of sample or tryptophan standard (100 ng/l) was added into 3 ml of 8 M urea buffer (8 M urea and 20 mM Tris-HCl, pH ϭ 7.6). Fluorescence was excited at 295 nm and measured at 350 nm. The slits were set at 10 nm.
Six hundred micrograms of liver tissue from GK rat was digested by the FASP procedure as described (25) with small modifications. Each sample was transferred to a 10k filter (Pall Corporation, Port Washington, NY) and centrifuged at 10,000 ϫ g for 20 min at 20°C. 200 l of UA buffer (8 M urea and 0.1 M Tris-HCl, pH ϭ 8.5) was added and centrifuged at 10,000 ϫ g for 20 min again. This step was repeated once. Then, the concentrate was mixed with 100 l of 50 mM IAA in UA buffer and incubated for an additional 40 min at room temperature in darkness. After that, IAA was removed by centrifugation at 10,000 ϫ g for 20 min. Following dilution with 200 l of UA buffer and centrifugation twice, 200 l of 200 mM triethylammonium bicarbonate (TEAB) buffer (pH 8.5) was added and centrifuged at 10,000 ϫ g for 20 min. This step was repeated four times. Finally, 100 l of 50 mM TEAB buffer (pH 8.5) and Trypsin (1:50, enzyme to protein) was added to the filter, and after 4 h, another 50 g trypsin was added. The samples were digested 20 h at 37°C and peptides were collected by centrifugation at 16,000 ϫ g. To increase the yield of peptides, the filter was washed twice with 500 l 0.5 M TEAB buffer (pH 8.5). The peptide solutions were dried in a vacuum concentrator.
The trypsin digestion of 100 g protein from each sample was processed as described elsewhere. iTRAQ labeling was done following the manufacturer's instructions (AB SCIEX, Foster City, CA). Briefly, for each four-or eight-plex experiment, 100 g of dried peptide mixture power from each digested sample was reconstituted with 30l 0.5 mM TEAB Buffer (pH 8.5). Each peptide solution was labeled at room temperature for 2 h with one iTRAQ reagent vial (four-plex mass tag 114, 115, 116, 117 or eight-plex mass tag 113,114, 115, 116, 117, 118, 119, 121) previously reconstituted with 70 l of anhydrous acetonitrile (ACN). After 2 h, 100 l ddH2O were added to each tube to quench the iTRAQ reaction and incubated at room temperature for 30 min. The contents of all iTRAQ reagentlabeled sample tubes were combined into one tube for four or eight-plex experiments, respectively. Then, labeled samples were dried down by evaporation in a SpeedVac to obtain a brown pellet. 100 l of water was added to the tube and the sample was dried completely. Prior to MS analysis, samples were desalted onto Empore C18 47 mm Disk (3 M). Just prior nano-LC, the fractions were resuspended in 20 LC-MS/MS Analysis-The reverse phase-high performance liquid chromatography (RP-HPLC) separation was achieved on an UltiMate 3000 RSLC nanoLC Systems (Dionex, now ThermoFisher Scientific) equipped with a self-packed tip column (75 m ϫ 240 mm; C18, 1.9 m) using a 180 min gradient at a flow rate of 150 nl/min. An LTQ-Orbitrap Velos instrument (Thermo Fisher Scientific) was operated in data-dependent mode. MS full scans were acquired in ranges m/z 300 -2000. The mass spectrometer was set so that each full MS scan was followed by the ten most intense ions for MS/MS with charge Ն ϩ2 with the following Dynamic Exclusion™ settings: repeat counts, 1; repeat duration, 30 s; exclusion duration, 180 s. The normalized collision energy for MS2 was 45.0%. Full MS scans and MS/MS scans were acquired at a resolution of 30,000 for profilemode and 7500 for centroid-mode respectively, with a lock mass option enabled for the 445.120025 ion. Data were acquired using Xcalibur software.
b/y Free Windows-b/y free windows are two mass windows for a specific mass spectrum that no B ion or Y ion would be in. With the assumption that the mass of an isobaric tag was M, trypsin was used as protease and the isobaric tag was attached at both the N-terminal of peptide and lysine (K), for a spectrum with singly charged precursor mass MH ϩ , the b/y free windows of that spectrum can be calculated as below. Because only full-tryptic peptides are considered in data analysis, the latest amino acid of the peptide will be either arginine (R) with mass 156 or lysine with mass 128. Given the fact that glycine (G) is the smallest amino acid with mass 57, the minimum and maximum mass of B and Y ions can be calculated as formula (1-4): where H 2 O is the mass of water and H is the mass of hydrogen. Then, the b/y free window in the low mass range is from 0 to minimum (minimum (B), minimum (Y)) and the b/y free window in the high mass range is from maximum (maximum (B), maximum (Y)) to infinite. Ion Frequency and Abundance Analysis-Only the spectra with precursor charges 2, 3, and 4 were used to detect high frequency ions. The ion frequency and ion abundance distribution in each sample were generated by software "Raw Ion Frequency Statistic Builder," which was also a part of ProteomicsTools. The charge, mass to charge (m/z), and abundance of each ion were extracted from each MS/MS spectrum through Thermo's MS File Reader interface. The abundance of ions in each MS/MS was normalized to a uniform distribution [0..1]. The ions with relative abundance less than 0.01 were discarded. All remaining ions were deconvoluted to corresponding singly charged ions by formula (5). The ions without charge information were treated as a single charge state.
where H is the mass of hydrogen.
The ions in different deconvoluted spectra but with difference in masses less than 20 parts per million (ppm) were considered identical ions. The ion frequency and ion average relative abundance were calculated from all the MS/MS spectra in the sample. The ions with frequency larger than 0.3 and average relative abundance larger than 0.05 were defined as high frequency ions and classified to five categories: "Rep ϩ ," "Label ϩ ," "Y1," "b/y free," and "Unknown." "Rep ϩ " denotes that an ion is a reporter ion. "Label ϩ " denotes that an ion is an isobaric tag ion with both reporter group and balance group. "Y1" denotes that an ion is a first Y series ion. Because trypsin was used in the sample preparation, a Y1 ion was produced from either lysine (K) or arginine (R). b/y free denotes that the mass of the ion is located in the b/y free windows of that spectrum. All other ions belonged to the "Unknown" category. An ion within one of the first four categories "Rep ϩ , Label ϩ , Y1, and b/y free) was considered annotated. For each deconvoluted tandem mass spectrum (forward spectrum), a backward spectrum was generated by using the mass of the precursor minus the mass of each forward ion. The backward ions were also filtered and annotated in the same fashion as the forward ions except that the ions with mass equal to "Label ϩ " were marked as "Precursor-Label ϩ ." "Precursor-Label ϩ " denotes a precursor ion without the isobaric tag. The ions annotated as Rep ϩ , Label ϩ , and Precursor-Label ϩ are not related to the peptide and therefore can be confidently removed during data preprocessing. For the ions annotated as b/y free in low mass range, they are very likely not related to the peptide as well. But it is still possible that those ions are actually multiply charged ions that lack charge information in the spectrum.
Data Preprocessing-The tandem mass spectra were extracted by TurboRaw2MGF (v1.3.4) for database searching. Four fixed criteria were used to filter out low quality spectra: (1) the required precursor mass weight range was 400 to 5000 Daltons, (2) the minimum ion absolute abundance was 1.0, 3) the minimum ion count of a spectrum was 15, and 4) the minimum total ion absolute abundance of a spectrum was 100. Four processing options were also provided in Turbo-Raw2MGF including deisotoping to keep monoisotopic mass peaks, deconvolution of ions with multiple charge states, preservation of the top 10 peaks in every 100 Dalton mass range, and removing the ions that may not be related to the peptide. The spectra that passed the fixed criteria and were processed with a combination of the four options were saved in mascot generic format for further database searching.
Software Development-We implemented our preprocessing steps in a user friendly software package named TurboRaw2MGF (v2.0). The previous version of TurboRaw2MGF was developed for lowresolution tandem mass spectra and was integrated into the package ProtQuantSuite (32). TurboRaw2MGF (v2.0) was developed using the C# programming language and was compiled in the Microsoft Visual Studio 2012 Professional Edition. The software is fully compatible with Windows-based operating systems with dotNET framework v4.5. TurboRaw2MGF (v2.0) and its source code can be downloaded freely from ln]https://github.com/shengqh/RCPA.Tools/releases/. The manual of TurboRaw2MGF (v2.0) can be viewed at https://github.com/ shengqh/RCPA.Tools/wiki/.
Data Availability-The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (33)    iTRAQ4 spectra, the mass of a Label ϩ ion is within the low mass b/y free window, and the mass of a Precursor-Label ϩ ion is also within the high mass b/y free window. The isobaric related mass ranges include both low and high b/y free windows. For iTRAQ8 spectra, the mass of a Label ϩ ion is not within the low mass b/y free window and the mass of a Precursor-Label ϩ ion is also not within the high mass b/y free window. The isobaric related mass ranges not only include both low and high b/y free windows but also include the mass range around Label ϩ ion and Precursor-Label ϩ ion within a specific tolerance, which was 20 ppm in our study. Tables II and III show the high frequency forward ions in iTRAQ4 and iTRAQ8 tandem mass spectra respectively. Almost all high frequency forward ions in iTRAQ4 tandem mass spectra were annotated, except 429.0888. Even with the majority of high frequency ions annotated, there were still more ions left unannotated in the iTRAQ8 tandem mass spectra than in iTRAQ4 tandem mass spectra.

Ion Frequency and Abundance-
For backward ions, only 144.1 (frequency ϭ 0.3316, median of abundance ϭ 0.207) from iTRAQ4 tandem mass spectra with double precursor charge and 304.1997 (frequency ϭ ϩ ϩ ϩ ϩ ϩ a b/y free window in high mass range. b reporter and isobaric tag ions. c b/y free window in low mass range. 0.380, median of abundance ϭ 0.199) from iTRAQ8 tandem mass spectra with double precursor charge passed the criteria. Both ions were annotated as Precursor-Label ϩ . Also, the frequency and abundance of reporter ions in both iTRAQ4 and iTRAQ8 data sets were decreased while the corresponding precursor charge increased.
Identification Sensitivity Improvement-We evaluated how the combination of preprocessing steps affected the peptide/ protein identification sensitivity under the same peptide/protein false discovery rate 0.01. Table IV indicated   The size of spot indicates the rank of method based on identification performance. The bigger the spot, the better the identification performance. The red circle indicates the best performance method in the same identification level, same engine, and same isobaric labeling approach. Mascot, MyriMatch, OMSSA, and X! Tandem achieved the best two-hit protein identification with preprocessing isobaric related ions in iTRAQ4 data set. Mascot, OMSSA, and X! Tandem achieved the best two-hit protein identification with preprocessing isobaric related ions in iTRAQ8 data set. The preprocessing considering isobaric related ions did not significantly improve the Comet identification sensitivity in both iTRAQ4 and iTRAQ8 data sets. Fig. 1 illustrates the identification results from iTRAQ4 and iTRAQ8 data sets using five search engines. The bigger the point of a method in the graph, the more identification that method achieved in the same engine and same isobaric labeling method. The red circle indicates the preprocessing method that achieved the most identification among all 16 methods. In iTRAQ4 data set, Mascot, MyriMatch, OMSSA, and X! Tandem achieved the most identified spectrum, peptide, and two-hit protein identification with preprocessing isobaric related ions, although the top performance method of each engine might not be identical to each other. In iTRAQ8 data set, only Mascot, OMSSA, and X! Tandem achieved most two-hit protein identification with preprocessing isobaric related ions. The preprocessing did not significantly improve the Comet identification sensitivity in both iTRAQ4 and iTRAQ8 data sets. Fig. 2 illustrates the identification improvement of 15 preprocessing methods compared with non-preprocessing methods in iTRAQ4 and iTRAQ8 data sets. Among all five search engines, Mascot identification sensitivity was significantly improved by most of the preprocessing methods. The identification sensitivity of MyriMatch, OMSSA, and X! Tandem was moderately improved by some of the preprocessing methods. The identification sensitivity of Comet was not improved by most of the preprocessing methods. The detailed identification summary was also provided as supplemental Table S1-S10. Comparing method 2 to method 1 in Table IV and V indicates that deisotoping and deconvolution significantly improved the Mascot spectrum identification for iTRAQ4 and iTRAQ8 from 16,442 to 18,286 (increased 11.2%) and from 8817 to 10,219 (increased 15.9%) respectively. Comparing method 3 to method 1 shows that keeping the top 10 ions in each 100 Dalton window decreased the Mascot identification sensitivity for the iTRAQ4 data set but increased the identification sensitivity for the iTRAQ8 data set. Identified spectrum count were moderately increased for iTRAQ4 (from 16,442 to 17,912, increased 8.9%) and significantly increased for iTRAQ8 (from 8817 to 12,012, increased 36.2%) by removing isobaric tag ions and the ions in low mass range b/y free window (comparing method 5 to method 1). Comparing methods 5, 6, and 7 to 1 indicates removing any one of the three isobaric related ion types improved Mascot identification sensitivity in both iTRAQ4 and iTRAQ8 data sets, except the ions in high mass range b/y free window in iTRAQ4 data set. Finally, comparing method 10 to method 1 in Table IV indicates that deisotoping, deconvolution, and removing isobaric ions improved the Mascot spectrum identification from 16,442 to 19,118 (increased 16.3%), the peptide identification from 6275 to 7148 (increased 13.9%), and the two-hit protein identification from 950 to 1013 (increased 6.6%) in iTRAQ4  Table V indicates that deisotoping, deconvolution, and removing all possible isobaric related ions improved the Mascot spectrum identification from 8817 to 13,240 (increased 50.2%), the peptide identification from 3349 to 4671 (increased 39.5%) and the two-hit protein identification from 612 to 766 (increased 25.2%) in iTRAQ8 data set.
Mascot Score Improvement by Data Preprocessing-We evaluated how the Mascot peptide identification scores were improved by preprocessing of tandem mass spectra before database searching. The scores of peptide-spectrum-match identified in method 1 and 10 in iTRAQ4 data set and method 1 and 16 in iTRAQ8 data set were extracted (See supplemental Table S11). Fig. 3 indicates that data preprocessing before FIG. 3. Mascot score improvement after preprocessing tandem mass spectra. Both top two density plots and bottom two violin plots indicated that the majority of the spectra gained score improvement with data preprocessing in both iTRAQ4 and iTRAQ8 data sets. p value 2.2e-16 from Wilcoxon rank sum test indicates that the score improvement in iTRAQ8 data set was significantly higher than in iTRAQ4 data set.
database searching improved the identification scores from a majority of spectra at both iTRAQ4 and iTRAQ8 data sets. p value 2.2e-16 from Wilcoxon rank sum test indicates that the score improvement in iTRAQ8 data set was significantly higher than in iTRAQ4 data set.
C-terminal Peptide Identification-Because the tryptic peptide generated from the protein carboxyl terminus (C-terminal peptide) usually does not follow the assumption that the Y1 ion is either Y1(K) or Y1(R), which we use for calculating the b/y free window, we checked how those peptides were identified before and after data preprocessing. The scores of C-terminal peptide identified in method 1 and 10 in iTRAQ4 data set and method 1 and 16 in iTRAQ8 data set were extracted (See supplemental Table S12). In Fig. 4, the top two Venn diagrams indicate that preprocessing also increases Cterminal peptide identification sensitivity in both iTRAQ4 and iTRAQ8 data set, and the bottom two scatter plots indicate that the Mascot scores from a majority of commonly identified C-terminal peptides also increased after preprocessing. DISCUSSION We annotated the high frequency ions in isobarically labeled tandem mass spectra. The majority of high frequency ions in iTRAQ4 and iTRAQ8 data sets could be annotated as reporter ions (Rep ϩ ), isobaric tag ions (Label ϩ ), Y1 ions, or ions in the b/y free window. More unannotated ions were FIG. 4. C-terminal peptide identification improvement in iTRAQ4 and iTRAQ8 data sets after preprocessing tandem mass spectra. The top two Venn diagrams indicated that preprocessing also increased C-terminal peptide identification sensitivity in both iTRAQ4 and iTRAQ8 data sets. The bottom two scatter plots indicate that the Mascot scores of the majority of commonly identified C-terminal peptides were also increased after preprocessing. observed in iTRAQ8 data set than in iTRAQ4 data set. Such a phenomenon can be caused by the more complex iTRAQ8 isobaric labeling tag compared with iTRAQ4, which could introduce more byproduct ions by isolation of mass spectrometry. Reporter ions and isobaric tag ions are isobaric ions and can be confidently removed from the MS/MS spectrum for database searching. The other high frequency ions in the b/y free windows are very possibly not introduced by the peptide itself but by either the isobaric labeling procedure or mass spectrometry system. Those ions might be removed to de-noise the tandem mass spectra for improving identification sensitivity. But there is still a possibility that the ions in the low mass range b/y free window are actually multiply charged b/y ions but that their charges cannot be estimated from mass spectrum, thus, removing such ions may decrease the identification sensitivity. The benefit of removing the ions in b/y free window may be varied between different isobaric labeling methods and different searching engines. With less ions in low mass b/y free window in iTRAQ4 than in iTRAQ8 data set (supplemental Fig. S1), removing isobaric ions only may be more suitable for iTRAQ4 data and removing ions in low mass b/y free window may be more suitable for iTRAQ8 data. We also observed a few high frequency ions outside of b/y free windows, including 429.0888. Without confidential evidence, we did not remove them in this study.
We also examined the factors that might affect the sensitivity of peptide identification. Our results showed that the combination of deisotoping/deconvolution and removing isobaric related ions significantly improved the Mascot identification sensitivity and moderately improved MyriMatch, X! Tandem, and OMSSA identification sensitivity for both iTRAQ4 and iTRAQ8 data sets. Comet was only slightly affected by preprocessing procedure. We further validated our results using an independent TMT6 data set using Mascot. The analysis results from this TMT6 data set also showed similar peptide/protein identification sensitivity improvement (See supplemental Table S13). Based on our results, we conclude that removing isobaric related ions combined with deisotoping/deconvolution is highly recommended for preprocessing isobarically labeled MS/MS spectra before database search, especially for Mascot search engine.
The complexity of the isobaric labeling tag significantly affects the identification sensitivity improvement after preprocessing tandem mass spectra. Keeping the top 10 ions in each 100 Dalton window slightly decreased the Mascot peptide identification sensitivity in iTRAQ4 data sets, regardless of whether it was combined with deisotoping and deconvolution. It may indicate that the high-resolution mass spectra in our iTRAQ4 data set were very clean that keeping the top 10 ions in each 100 Daltons was not necessary during data preprocessing. This finding may require additional validation in other independent iTRAQ4 data sets. On the other hand, keeping the top 10 ions in each 100 Dalton window slightly increased the Mascot peptide identification sensitivity in the iTRAQ8 data sets. Comparing to method 1, a combination of deisotoping/deconvolution, keeping the top 10 ions in each 100 Dalton window, and removing isobaric related ions (method 16) improved identified spectra, peptides, and twohit proteins for iTRAQ8 over iTRAQ4 by 32.7%, 36.4%, and 18.5% respectively. This suggests that preprocessing is more crucial for iTRAQ8 than iTRAQ4 data.
We validated the identification improvement of the C-terminal peptides. C-terminal peptides might not end with "K" or "R," which voids our assumption for b/y free window calculation that Y1 ions were either from K or R. The result indicated that data preprocessing not only improved the Mascot scores of major C-terminal peptides but also increased the identification sensitivity of C-terminal peptides: even the ions in low mass b/y free window were removed.
We implemented TurboRawToMGF (v2.0) with a user friendly GUI. The GUI allows users to transfer the data generated from high-resolution mass spectrometry (such as Thermo LTQ-OrbitrapVelos) to mascot generic format file conveniently. TurboRawToMGF also supports filtering spectra based on user-defined mass ranges. For example, the user may define 428.75-429.25 to remove the 429.0888 ion. Tur-boRawToMGF (v2.0) offers many other conveniences to users. For example, the conversion from mzData and mzXML format file to mascot generic format file is supported, and conversion of multiple files in batch mode is also provided. TurboRawToMGF is free, and it will be consistently supported in the coming years.