ProLuCID: An improved SEQUEST-like algorithm with enhanced sensitivity and speci ﬁ city

ProLuCID,a new algorithmfor peptide identi ﬁ cation usingtandemmassspectrometry and proteinsequenceda-tabases has been developed. This algorithm uses a three tier scoring scheme. First, a binomial probability is used as a preliminary scoring scheme to select candidate peptides. The binomial probability scores generated by ProLuCID minimize molecular weight bias and are independent of database size. A modi ﬁ ed cross-correlation score is calculated for each candidate peptide identi ﬁ ed by the binomial probability. This cross-correlation scoring function models the isotopic distributions of fragment ions of candidate peptides which ultimately results in highersensitivity and speci ﬁ city than that obtained with the SEQUESTXCorr. Finally, ProLuCIDusesthedistribu-tion of XCorr values for all of the selected candidate peptides to compute a Z score for the peptide hit with the highest XCorr. The ProLuCID Z score combines the discriminative power of XCorr and DeltaCN, the standard parameters for assessing the quality of the peptide identi ﬁ cation using SEQUEST, and displays signi ﬁ cant improvement in speci ﬁ city over ProLuCID XCorr alone. ProLuCID is also able to take advantage of high resolution MS/MS spectraleadingtofurtherimprovementsinspeci ﬁ citywhencomparedtolowresolutiontandemMSdata.Acom-parison of ﬁ ltereddata searchedwithSEQUESTandProLuCIDusingthesamefalsediscovery rateas estimatedby a target-decoy database strategy, shows that ProLuCID was able to identify as many as 25% more proteins than SEQUEST.ProLuCIDisimplementedinJavaandcanbeeasilyinstalledonasinglecomputeroracomputercluster. © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
In recent years, shotgun proteomics [1,2] has emerged as a robust and sensitive method for identifying and quantifying proteins in a complex biological sample and is now a preferred method for large-scale proteomic analyses [3,4].The strategy is based on proteolytic digestion of complex protein mixtures into peptides followed by identification of the peptides using tandem mass spectrometry (MS/MS).Peptide identifications can be used to identify their corresponding proteins using an automated database search.Recent improvements in MS technologies allow the acquisition of hundreds of thousands of MS/MS spectra over the course of one LC/MS/MS analysis [5][6][7], and a large-scale shotgun proteomics project typically generates hundreds of millions of MS/MS spectra.Each of these spectra has to be correlated with the amino acid sequence of a peptide and corresponding protein.The sensitivity and efficiency of the database search program used is of critical importance in any high-throughput protein identification experiment.
The advent and commercialization of the high-performance mass spectrometer enables routine, wide-spread high resolution high mass accuracy measurements of peptides in proteomics [39].Early studies using this hybrid instrument have demonstrated a number of advantages including high mass accuracy, high resolution, large space charge capacity, and high dynamic range [5,40].Venable et al., evaluated the use of the LTQ-Orbitrap for the quantification of stable isotopelabeled peptides and showed a 4-5 fold improvement in the number and quality of the peptide ratio measurements compared with similar analyses done on the LTQ [41].In addition, the high mass accuracy generated by the LTQ-Orbitrap hybrid mass spectrometer can be used to improve the confidence of peptide identification and database search speed.One strategy for doing this is to obtain high mass resolution data for all precursor ions during the full MS scan in the Orbitrap mass analyzer and then collect low resolution MS/MS spectra on those precursor peptides in the linear ion trap.An alternative approach that takes advantage of the LTQ-Orbitrap is to collect both high resolution MS and MS/MS spectra in the Orbitrap mass analyzer for peptide identification.
Some database search programs utilize a two-step scoring scheme.The first step is a preliminary scoring (S p ) step that is used to select a fixed number of candidate peptides which are then analyzed using a more sophisticated second step of scoring.This S p step is important for the speed of the identification process since the final scoring algorithms are usually slower, making them impractical for scoring every candidate sequence.One common method for S p scoring is to use the number of shared peaks to select the final candidates.This is done by multiple algorithms, including the hypergeometric probability based PEP_PROBE [29], OMSSA [31] and the central limit theorem based PEP_PROBE [32].However, the "number of shared peaks" approach may not work well for a low quality spectrum, especially when the fragmentation is poor.Alternatively, the preliminary score (S p ) of SEQUEST is an empirically derived score that restricts the number of sequences analyzed in the correlation analysis.S p sums the peak intensity of fragment ions matching the predicted sequence ions and accounts for the continuity of an ion series and the length of a peptide.The original score is: where the first term in the product is the sum of ion abundances of all matched peaks, m is the number of matches, β is a 'reward' for each consecutive match of an ion series (for example, 0.075), ρ is a 'reward' for the presence of an immonium ion (for example 0.15) and L is the number of all theoretical ions of an amino acid sequence.The final scoring uses one of the following two methods to measure closeness of fit between spectra and peptide sequences: the first method uses a shared peak model to generate a quantitative measure of the fit, while the second method uses fragment ion frequency to generate the probability the sequence and spectrum are the best fit [32].Because the final scoring is usually more sophisticated and sensitive than the preliminary scoring, the final scoring method would ideally be applied to each candidate peptide rather than a limited number of them.
It is well known that the results of an unfiltered database search include a large number of false positive identifications from random hits to the database.Post-database search filtering programs, such as DTASelect [42,43], PeptideProphet, [44] and Search Engine Processor [45] are essential for the optimal separation of true peptide/protein hits from random hits.For a peptide to be successfully identified by a database search algorithm, it has to pass the following three tests: (1) it must be ranked high enough in the S p scoring to be selected for the final scoring, (2) it must be assigned the top rank during the final scoring, and (3) its score or scores have to be high enough to pass the postsearch filtering criteria [46].The major challenge to improvement of the overall performance of a database search algorithm is how to increase the sensitivity of searches while maintaining adequate discrimination between correct answers and false positives.
In this paper, we present ProLuCID, an MS/MS-based database search program with enhanced peptide identification sensitivity and specificity relative to SEQUEST.ProLuCID uses a three tiered scoring scheme to maximize the sensitivity of database searching.For its S p scoring method, ProLuCID computes a binomial probability score for each candidate peptide with a calculated mass that matches a precursor mass within a user specified tolerance.Then, based on the binomial probability scores, it selects a user-specified number of candidate peptides for final scoring (default = 500) that are least likely to be random hits.For each candidate peptide selected for further analysis, ProLuCID calculates a modified cross-correlation score (XCorr) and then further generates another score (Z score) based on the distribution of the XCorr of all final candidate peptides for that spectra.This three-tiered scoring scheme gives ProLuCID significantly higher sensitivity and specificity than SEQUEST.Here we show that for low mass accuracy MS/MS data, the cross-correlation-based Z score outperforms the binomial probability score in making correct spectral assignments, while the binomial probability score performs better with high mass accuracy tandem mass spectra MS/MS data.

Sample preparation
A variety of samples and instrument platforms were used to demonstrate the improvement in identification, regardless of sample complexity or instrument sensitivity.Samples of varying degrees of complexity were used in this study: a mixture of 17 known proteins, a human saliva sample, rat brain sample, human cell lysate, and a protein fractionated human cell lysate.The 17 protein mixture sample was used to assess the sensitivity and the specificity of ProLuCID and SEQUEST scores, while the other more complex samples were used to demonstrate the sensitivity improvement in protein identification with samples of medium to high complexity.

HEK 293 cells
Standard HEK293 cell lysate was prepared from HEK cells grown in Dulbecco's modified Eagle's medium (D-MEM) with 10% fetal bovine serum (FBS) supplemented with penicillin and streptomycin.Cells were grown (37 °C/5% CO 2 ) to approximately 80% confluence in tissue culture flasks.Cells were washed twice with DPBS, scrapped from flasks, supplemented with protease inhibitor cocktail (Roche) and lysed by sonication.Protein concentration was determined by BCA assay.Standard samples were kept at −80 °C until use.

Protein fractionation
HEK lysate was submitted to protein based fractionation by addition of organic solvent into ten protein fractions, effectively reducing the sample complexity.Protein pellets were washed with acetone and digested with trypsin.Dried pellets were dissolved in 8 M urea/100 mM Tris, pH 8.5.Proteins were reduced with 5 mM tris(2-carboxyethyl)phosphine hydrochloride (TCEP, Sigma-Aldrich) and alkylated with 10 mM iodoacetamide (Sigma-Aldrich).Proteins were digested overnight at 37 °C in 2 M urea/100 mM Tris, pH 8.5, 1 mM CaCl 2 with trypsin (Promega) in a ratio of 1:100 (enzyme:protein).Digestion was stopped with formic acid, 5% final concentration.Debris was removed by centrifugation.
For the saliva and the rat brain samples, about 200 micrograms of proteins were solublized with 8 M urea/Invitrosol (Invitrogen, Calsbad, CA), reduced with 10 mM dithiothreitol, alkylated with 10 mM iodoacetomide, diluted with 4 volumes of 100 mM Tris-HCl, and then digested with trypsin overnight.After digestion, the pH was adjusted to ~2.5 using 90% formic acid.Sixty micrograms of protein digest from each sample was analyzed by MudPIT.

Multidimensional protein identification technology
Digested proteins were pressure-loaded onto a fused silica capillary column packed with 3 cm of 5-μm Partisphere strong cation exchanger (SCX, Whatman, Clifton, NJ) and 3 cm of 5-μm Aqua C 18 material (RP, Phenomenex, Ventura, CA) with a 2 μm filtered union (UpChurch Scientific, Oak Harbor, WA) attached to the SCX end.The column was washed with buffer containing 95% water, 5% acetonitrile, and 0.1% formic acid.After desalting, a 100-μm i.d.capillary with a 5-μm pulled tip packed with 10 cm 3-μm Aqua C 18 material was attached to the filter union, and the entire split-column was placed inline with an Agilent 1100 quaternary HPLC (Agilent, Palo Alto, CA) and analyzed using a modified 12-step separation procedure described previously [2].Three buffer solution were used: 5% acetonitrile/0.1% formic acid (buffer A); 80% acetonitrile/0.1% formic acid (buffer B), and 500 mM ammonium acetate/ 5% acetonitrile/0.1% formic acid (buffer C).The first step consisted of a 100 min gradient from 0 to 100% buffer B, steps 2-11 had the following profile: 3 min of 100% buffer A, 5 min of X% buffer C, a 100 min gradient from 15 to 45% buffer B. The 5 min buffer C percentages (X) were 5,10,15,20,25,30,35,40,55, and 75%, respectively, for steps 2-11.In the final step, the gradient contained 3 min of 100% buffer a, 20 min of 100% buffer C, a 10 min gradient from 0 to 15% buffer B, and a 107 min gradient from 15 to 100% buffer B. As peptides were eluted from the microcapillary column, they were electrosprayed directly into an LTQ or LTQ-Orbitrap mass spectrometer (Thermo-Fisher, Palo Alto, CA) with the application of a distal 2.4-kV spray voltage.A cycle of one full scan mass spectrum (400-1400 m/z) followed by 8 data dependent MS/MS spectra at a 35% normalized collision energy was repeated continuously throughout each step of the multidimensional separation.

Database search
The data for the 17 protein mix, the human saliva sample and the rat brain sample were searched against a database with sequences of the 17 proteins added to a Schizosaccharomyces pombe protein FASTA database (http://www.sanger.ac.uk/Projects/S_pombe/protein_download.shtml, release date of March 3, 2005), the IPI human protein FASTA database (version 3.06 release date of May 10, 2005), and the IPI rat protein FASTA database (version 3.08 release date of July 12, 2005), respectively.Each protein database was concatenated with reversed sequences of all the proteins to estimate false positive rate.ProLuCID database searches were performed with precursor ion mass tolerance of 3 amu for low accuracy data or between 5 and 50 ppm for FTMS data, while fragment ion mass tolerances were 0.4 amu for low-resolution data and 30 ppm for calculation of the high resolution probability score calculated for FT-MS/MS data.All searches considered a static modification of 57.0215 on cysteine due to carboxyamidomethylation.The database search was not restricted by enzymatic specificity.Each dataset was searched twice, once with SEQUEST and once with ProLuCID, and the search results were directly compared.Similar database searches (precursor ion mass tolerance 3 amu and no enzyme restriction) were done with SEQUEST, MASCOT, XTANDEM, and OMSSA on the 17 protein mix dataset for sensitivity and specificity comparison.The raw and processed datasets are available at http://fields.scripps.edu/published/ProLuCID/.

Theory
ProLuCID utilizes a three tiered scoring scheme.It first selects candidate peptides (500 by default) for final scoring based on a binomial probability score.This binomial probability score is computed for each peptide in the protein database that has a calculated mass within the precursor mass ± user-defined mass tolerance.It then computes an XCorr and a Z score for each candidate peptide that is selected for final scoring.Previous studies have shown that the distribution of matching fragment ions between a set of candidate peptides and an experimental spectrum can be approximated by a Poisson distribution [20,29,31].As shown in Fig. 1, the number of fragment ions that match an experimental spectrum also fits a binomial distribution very well.The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability (p).Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial.We consider the testing of each theoretical peak as a Bernoulli trial and compute the probability of a peptide with at least m random matches with the formula (1): where n is the number of theoretical peaks of the candidate peptide tested, which is determined by the peptide length together with the minimum and maximum m/z in the spectrum; m is the number of theoretical peaks that match to a peak in the experimental spectrum and is guaranteed not greater than n; p is the probability that any fragment ion matches a peak in the spectrum, as determined by the mass tolerance for a fragment ion match and the density and distribution of peaks in the experimental spectrum.The binomial probability score P(x N= m) is the probability of getting m or more matches when n theoretical peaks are tested.By design, the binomial probability score computed by ProLuCID is database independent and is solely dependent on characteristics of the spectrum and the peptide sequence.
The second ProLuCID score is referred to as XCorr and is very similar to the SEQUEST XCorr.It is a cross-correlation of the experimental and theoretical spectra.
The correlation is processed and averaged to remove the periodic noise in the interval (− 75 to 75).Unlike the SEQUEST crosscorrelation procedure which assigns an intensity of 50 to the monoisotopic peak of each major peak series and an intensity of 25 to a window of 1 amu around the major peak, ProLuCID uses averagine [47] to model the isotopic distribution of each major ion peak based on it mass.Based on the averagine table, any isotopic peaks within the isotopic envelope that have at least 20% of the intensity of the base peak (i.e., the most intense peak) are assigned an intensity that is proportional to their theoretical intensity.In order to keep the ProLuCID XCorr comparable to the SEQUEST XCorr, we assign the intensity of the base peak in the isotopic envelope of each major fragment ion to 50 and the intensity of each minor peak (i.e., a ion, z ion, b loss of H2O, b loss of NH3 for CID spectra) to 10 as is done in SEQUEST.
In addition to the preliminary score and XCorr, ProLuCID computes a third score (Z score) for each final candidate peptide.For each spectrum, there should only be one correct answer and all the other candidate peptides are considered random hits.We have found that the distribution of XCorr's for the top 500 peptide hits to each spectrum is very close to a normal distribution with the true hit being an obvious outlier and statistically significantly different from the other final candidates.There are many ways to detect outliers from normal distributions and the Z score of Grubbs' test [48] is the method implemented in ProLuCID.The Z score is calculated as the difference between the outlier and the mean divided by the standard deviation SD (Eq.( 2)).A large Z score means that the XCorr of the top hit is significantly different from the other hits and the peptide is more likely to be a true hit.
X is the XCorr of the top hit, μ is the mean XCorr of all the final candidate peptides and n is the number of final candidate peptides.

Overview of ProLuCID
We have developed ProLuCID, a new database search algorithm for peptide identification that is highly flexible, efficient, and sensitive.ProLuCID is implemented in Java 1.6 and can be run on either a single CPU or multi-node computing cluster with 1.6 or later version Java.With Java multithreading technology, ProLuCID users can specify number of compute cores to be used to take advantage of multi-core architectures that come with most modern computers.It is can be used with protein FASTA databases or pre-processed databases for faster search speed.ProLuCID is also able to perform efficient and flexible differential modification searches and is capable of taking advantage of the high mass accuracy data generated by the latest instrumentation.These features of ProLuCID are described in greater detail below.

Using binomial probability as preliminary score (S p ) to improve sensitivity
The goal of a tandem mass spectral database search is to identify the best peptide sequence match for a spectrum.The ProLuCID algorithm uses a three-tiered scoring scheme to assess the quality of a match between a spectrum and a peptide amino acid sequence from a protein database.First, ProLuCID uses a binomial probability score (S p ) to distinguish random matches and select peptide candidates for final crosscorrelation scoring.Although the cross-correlation score provides higher sensitivity than the binomial probability score, it is computationally expensive (i.e., slow) and thus not practical for use in the initial scoring scheme.Instead, we select a user-defined number (500 by default) of candidate peptides for final scoring based on binomial probability scores.It is worth noting that the computation of exact binomial probability score is also a very slow process due to the computation of the factorials and exponentiations (see Eq. ( 1)), thus ProLuCID uses an approximation method to compute the scores.A lookup table is calculated when the program starts and the approximate probability score can be retrieved based on the number of matched peaks, the number of peaks tried and the fraction of the region from the minimum m/z to maximum m/z in the tandem mass spectrum that are considered positive (the p in Eq. ( 1)).In order to use the lookup table, the value of p is rounded and keeps only two significant digits so we can map any p to an integer between 1 and 100.This approximation also makes it feasible for ProLuCID to use the binomial probability score as a preliminary scoring method used for all candidate peptides within a given mass tolerance rather than just a final list of 100-200 peptide candidates as is done in other database search programs.The advantage of using a more sophisticated scoring function as the preliminary scoring routine can be seen in Fig. 2. Based on the 17 protein mix dataset, the SEQUEST S p score gives 5338 correct spectrum assignments the while ProLuCID binomial probability score gives 6353 correct spectrum assignments.Based on this result, we can conclude that the approximate binomial probability score displays better sensitivity than SEQUEST S p score.

Matching the isotopic distribution of fragment ions
The second score generated by ProLuCID is a measure of crosscorrelation between the experimental and theoretical spectra for a peptide and is referred to as XCorr.In SEQUEST, a theoretical spectrum is generated from predicted fragment ions for each peptide sequence (b-and y-ions for CID and c-and z-ions for ETD).In the theoretical spectrum, the main ion series products are assigned an abundance of 50, a window of 1 amu around the main fragment is assigned an intensity of 25, and water and ammonia losses are assigned an intensity of 10.The theoretical and normalized experimental spectra are then crosscorrelated to obtain similarities between the spectra.In contrast to SEQUEST, ProLuCID models the isotope distribution of each fragment ion in order to generate a more realistic theoretical spectrum for cross-correlation.Theoretical isotopic abundance distributions for proteins and peptides were created using a look-up table of 150 averagine theoretical isotopic distributions with monoisotopic mass values for multiples of 500 Da up to 75,000 Da with all abundance distributions in the look-up table created by Mercury [47].ProLuCID uses the averagine table to closely model the isotopic distribution of the fragment ions.This modification makes the distribution of ProLuCID XCorr of decoy hits closer to a normal distribution, and the score itself becomes more discriminative (Fig. 3).Importantly, the benefits of modeling the isotope distribution are realized even for low-resolution LTQ data in which the charge states cannot be determined.
Since ProLuCID can be configured to output both the binomial probability score and XCorr for each candidate peptide, we can determine which score is more sensitive in identifying target peptides by comparing the number of true peptides (from the 17 protein mix) that are ranked as the top hit by each scoring scheme.From Fig. 2 that XCorr performs better than preliminary scoring using either the ProLuCID binomial probability score or the SEQUEST S p score.The ProLuCID XCorr identifies more spectra correctly than the ProLuCID's binomial probability score (7299 vs 6353) while the ProLuCID probability score gets more correct spectral assignments than the SEQUEST S p score (6353 vs 5201).Based on these results, we can conclude that ProLuCID's binomial probability is a better score than SEQUEST's S p score, and ProLuCID's XCorr is a better score than ProLuCID's binomial probability score.The combination of binomial probability preliminary scoring and the modeling of the isotopic distribution of fragment ions make ProLuCID more sensitive than SEQUEST in terms of correct spectrum assignments (7299 vs 6974), regardless of the specificity of the scores.

Statistical Z score improves the specificity
In addition to the binomial probability and cross-correlation scores, ProLuCID outputs a Z score for each peptide hit.The Z score is a dimensionless score derived by subtracting the population mean from an individual raw score and then dividing the difference by the population standard deviation.It reveals how many units of the standard deviation a case is above or below the mean.Unlike XCorr, which is independent of database size and reflects the quality of the match between the experimental spectrum and the peptide sequence, the ProLuCID's Z score is database-dependent and reflects the quality of the match relative to near misses.A higher Z score indicates that the peptide hit is more likely to be a correct match to the spectrum.
Traditionally, filtering of database search results by DTASelect used threshold cutoffs for XCorr and DeltaCN, where DeltaCN is the difference between the top hit XCorr and the second best hit XCorr divided by the XCorr of the top hit.In the latest version of DTASelect (DTASelect2) [43], these two measurements are combined using a discriminant function that dynamically sets the XCorr and DeltaCN values in order to achieve a user-specified false discovery rate.For either case, high confidence spectrum assignments generally have both high  XCorr and high DeltaCN scores.Since the XCorr shows positive correlation across charge states (i.e.XCorr values increase for higher charge state spectra) [49], different cutoffs are usually applied to assignments with different charge states.DeltaCN measures the difference between the best hit and the second best hit and has proven to be a very good measure for separating true hits from false.However, in some cases the sequence corresponding to the second highest XCorr might have very high sequence similarity to the top hit, making the DeltaCN value very small.Thus, even though the identification itself may be reliable, it would be discarded by DTASelect due to its similarity to the 2nd best hit.Tang et al. [34] used the distance score which is defined as the difference between the highest score and the seventh highest score for each MS/MS spectrum.The distance score provides a measure of the separation between the highest scoring peptide and the pack of wrong peptides.The larger the distance score, the larger the probability that the highest scoring peptide is indeed a legitimate answer.The distributions of distance scores for correct peptides and incorrect peptides were found to be approximately Poisson.ProLuCID's Z score provides a statistical measurement to indicate how significant the difference between the best match and the rest of the matches to the same spectrum are.This measure provides an effective way of distinguishing the true hits from the random hits using a strong statistical foundation.
A common method for visualizing and comparing discrimination ability is the receiver operating characteristic (ROC) plot, [50] in which one can read the false positive level that must be tolerated in order to obtain any given true positive level.In our case, we consider any identification that matches a peptide sequence from any of the 17 proteins as a true positive, and any identification that matches a peptide sequence from a reversed protein as a false positive.The ROC curves in Fig. 4A-C clearly illustrate the improvement in sensitivity and specificity of the ProLuCID XCorr and Z score compared with the SEQUEST XCorr.Fig. 4A is a typical ROC curve with the area under the curve being 0.89, 0.91 and 0.96 for the SEQUEST XCorr, ProLuCID XCorr and ProLuCID Z score, respectively.Based on these result, we can conclude that ProLuCID XCorr is a more discriminative score than SEQUEST XCorr and that the ProLuCID Z score shows significantly improved specificity over both SEQUEST and ProLuCID XCorr.Fig. 4C is a modified ROC curve that plots the number of true hits against the false positive faction.These figures clearly show that ProLuCID XCorr and Z score have better sensitivity and specificity than SEQUEST XCorr and that the ProLuCID Z score shows better specificity than ProLuCID XCorr.We implemented Z score in SEQUEST and it show significant better specificity than SEQUEST XCorr (Fig. 4).
It is also worth noting that the ProLuCID Z score distributions for charge +2 and charge +3 decoy hits as shown in Fig. 5 are very similar, indicating that the ProLuCID Z score is largely charge state independent.
It is important for practical applications to know the true and false positive rates at given score thresholds.Fig. 6 plots the false positive rate against ProLuCID Z scores.In this dataset, the spectrum assignment false positive rate is 10% at Z score 4.42, 5% at Z score 4.67 and 1% at Z score 5.28, respectively.Although the distribution of ProLuCID Z scores shows relatively small variation between different MudPIT runs, it is still dataset dependent to some degree.

Performance test with biological samples of medium and high complexity
In order to test the performance of ProLuCID on data from more complex samples, we performed 12-step MudPIT experiments with a human salivary sample and a rat brain whole cell lysate sample.Human saliva is a biological fluid with a medium level of complexity.In a large scale saliva protein cataloging project that combined results from over 200 MudPIT experiments, we previously identified about 1500 proteins with high confidence (b=1% false positive rate).With a single 12-step MudPIT experiment using LTQ-Orbitrap, we identified 372 proteins with ProLuCID and 300 proteins with SEQUEST using the same DTASelect filtering criteria (at least two peptides per protein, each peptide has at least one tryptic terminus and 5% spectrum level false positive rate).From the results in Table 1, we find that ProLuCID identifies more proteins than SEQUEST at similar false positive rate.On the more complex sample of rat brain whole cell lysate, we identified about 3345 proteins with ProLuCID compared with 2991 with SEQUEST (Table 2), with false positive rates of 1.23% and 1.44% respectively.Thus, the improvements on the scoring methods used in ProLuCID versus SEQUEST leads to higher confidence in protein identifications.In Table 3, we show that ProLuCID results show higher sequence coverage, peptide counts and spectrum counts than SEQUEST results.

Comparison with Comet and SEQUEST on Hela sample
We compared ProLuCID with Comet and SEQUEST by searching triplicate data from a Hela sample.The tandem mass spectra were searched against UniProt human database (downloaded on November 08, 2010).To estimate peptide probabilities and FDRs accurately, we used a target/ decoy database containing the reversed sequences.The search space included fully tryptic peptide candidates that fell within the mass tolerance window with maximum three internal miscleavage constraints.Carbamidomethylation (+ 57.02146 Da) of cysteine was considered as a static modification.The validity of peptide/spectrum matches (PSMs) was assessed in DTASelect, using spectrum level FDR less than 1% and precursor delta mass threshold of 10 ppm.Table 4 shows that ProLuCID identified more proteins than both SEQUEST and Comet based on all protein, peptide and spectrum average counts from triplicates.ProLuCID identified 12% more spectra than SEQUEST, and 5% more than Comet.
We also compared ProLuCID and Comet with half-tryptic (considering candidate peptides with at least one tryptic end) parameter (Table 5).Comparing to fully tryptic search, the ProLuCID identified more spectra than Comet by even bigger difference.ProLuCID identified 20% more spectra and 19% more peptides than the Comet.

High Resolution MS and MS/MS database searches
The LTQ-Orbitrap hybrid mass spectrometer combines high resolution and mass accuracy with fast scan rates and the flexibility of two different mass analyzers which provides the user with the opportunity to operate the instrument in different modes.One mode uses the Orbitrap mass analyzer to collect all spectra for an experiment, including both high resolution full MS scans of precursor ions and high resolution tandem mass spectra after peptide fragmentation.The major advantages of this approach are the high mass accuracy of the precursor ion which restricts the number of candidate peptides that need to be considered by the database search algorithm, and the high mass accuracy of the fragment ions which could to lead to more confident peptide and protein identifications, as well as PTM localization.The disadvantage of this strategy, however, is the lower scan rate of the Orbitrap compared with the LTQ, which would result in the collection of fewer tandem mass spectra and likely fewer peptide and protein identifications.Alternatively, the LTQ-Orbitrap can be used so that the full MS scans are collected by the Orbitrap while the LTQ is used to obtain low resolution MS/MS spectra.In this approach, high mass accuracy is obtained for precursor ions while low mass accuracy is obtained for fragment ions.The advantage of this mode is that the high precursor mass accuracy can be used to reduce the false positive rate and/or speed up database search while a large number of tandem mass spectra are collected by the LTQ.Importantly, ProLuCID is capable of handling all of these possibilities and can search spectra with either high or low mass accuracy for both precursor and fragment ions, including deisotoped and decharged high-resolution MS/MS spectra [51].
ProLuCID allows the user to specify the precursor and fragment ion mass tolerance from 1 ppm to 1000 ppm.When high precursor mass accuracy is specified, ProLuCID can be configured to use a very narrow precursor mass tolerance to reduce the number of candidate peptides and thus speed up the search.In this case, however, the mass spectrometer may select and record the non-monoisotopic peaks (i.e., peptide ions containing one or more 13C atoms) for MS/MS fragmentation which can prevent these spectra from being identified when searches are restricted to small m/z windows.To address this problem, ProLuCID selects candidate peptides by assuming the precursor can be either the M + 0 (mono), M + 1 (with one 13C) M + 2 (with two 13C), etc., isotopic peak.The number of isotopic peaks considered by ProLuCID can be specified by the user in the ProLuCID search parameter file.This approach significantly reduces the number of candidate peptides and speeds up the database search without missing spectra obtained from the fragmentation of non-monoisotopic peaks.
Additionally, ProLuCID can use a preprocessed database in which the peptides are sorted by mass and can improve the computational efficiency by more than 1000-2000% over SEQUEST if stringent precursor mass tolerance (e.g., 5 ppm) is used.The search speed improvement can be more dramatic for differential modification searches and largely depends on the database size, precursor mass tolerance, enzyme restriction, etc.
Another advantage of high resolution full MS spectra is the ability to correctly assign charge states to the precursor ions.For low resolution data, the charge state of the precursor ions cannot easily be determined for spectra with charge states higher than + 1.When the charge state of a multiply charged precursor ion cannot be determined, the spectrum is typically searched against the database twice, once assuming a + 2 charge state and then again assuming a + 3 charge state.In this approach, spectra with charge states higher than + 3 are always incorrectly assigned.With high resolution Orbitrap data, charge states can be assigned to over 90% of MS/MS spectra using the in-house algorithm RawXtract.This eliminates the need to guess the charge state of precursor ions and enables peptides with charge states of + 4 or higher to be identified (Fig. 7).ProLuCID models + 1 fragment ions for + 1 and + 2 spectra, + 1 and + 2 fragment ions for + 3 spectra, and fragment ions of charge state from + 1 to the floor of (z + 2)/2 for spectra with precursor charge state + 4 or higher, where z is the precursor charge state.
For high resolution MS/MS data, ProLuCID allows users to specify fragment ion mass tolerance in terms of parts-per-million (ppm, e.g.20 ppm).We collected high resolution tandem mass spectra in a 4-step MudPIT experiment with the 17 protein mix sample.The same  set of tandem mass spectra were searched as high resolution data using 30 ppm fragment mass tolerance and low resolution data using 0.4 amu fragment mass tolerance.From Fig. 4D, we can see that the ProLuCID binomial probability score for a high fragment mass accuracy search shows better sensitivity and specificity than the binomial probability score and the Z score for a low fragment mass accuracy search.It is worth noting that the Z score is computed based on an XCorr with low fragment ion mass accuracy.ProLuCID takes a similar approach as SEQUEST for differential or variable modification searches.Users need to specify the type of modification and the maximum number of modifications to be considered.However, unlike SEQUEST in which the maximum number of modification types is set to 3 and each modification or mass shift can only occur to a maximum 3 amino acid residues, ProLuCID allows users to specify as many differential modification types as desired and each modification type or mass shift can be applied to as many residues as expected to be possible.This provides users the opportunity to search for unexpected modifications at a relatively low computational cost.Of course, for any given protein database, search times will increase as more modifications are considered.

Conclusions
ProLuCID achieves enhanced sensitivity and specificity by using a binomial probability score as a preliminary score, an improved XCorr, and the implementation of a novel Z score.ProLuCID Z score shows significantly higher sensitivity and specificity than SEQUEST XCorr.For high resolution (Orbitrap) MS/MS data, the ProLuCID probability score outperforms Z score, while Z score performs better than ProLuCID probability score for low mass accuracy (LTQ) MS/MS data.We show for typical shotgun proteomics experiments, using DTASelect with the same false positive rate filter, ProLuCID usually identifies about 10%-25% more proteins than SEQUEST does.The overall confidence of the identified proteins is improved due to significant increases in peptide count, spectrum count and sequence coverage.

Fig. 1 .
Fig. 1.Distribution of number of fragment ion matched to a tandem mass spectrum of all candidate peptides in (blue line) a protein database.The protein FASTA database contains amino acid sequences of the 17 proteins, all Pombe proteins and the reverse copy of each protein (10,006 entries in total).The fit curve (pink line) is a binomial distribution B (22, 0.1391).

Fig. 4 .
Fig. 4. ROC curves of ProLuCID and SEQUEST scores.A. Typical ROC curves of SEQUEST XCorr, ProLuCID XCorr and ProLuCID Z score.B. Modified ROC curves, showing true positive fraction as a function of false positive rate.C. Plots of number of true hits against false positive fraction of SEQUEST XCorr, ProLuCID XCorr and ProLuCID Z score.D. Plots of number of true hits against false positive fraction of ProLuCID high mass accuracy probability score, low mass accuracy probability score and Z score.

Fig. 3 .
Fig. 3. Histogram of SEQUEST and ProLuCID XCorr scores, separated into true hits and reverse hits, showing that the XCorr scores generated by ProLuCID are more discriminative than those generated by SEQUEST, because ProLuCID closely models fragment ion isotopic distributions.

Fig. 6 .
Fig. 6.Plot of ProLuCID Z score as a function of false positive rate on the 17 protein mixture dataset.

Fig. 5 .
Fig. 5. Histograms of ProLuCID Z scores of the true hits and decoy hits, showing good separation between the true hits and decoy hits, and that the distributions of the Z scores of the decoy hits of charge +2 and charge + 3 spectra are very similar.
, we can see ProLuCID SEQUEST Fig.2.Number of correct spectrum assignments by ProLuCID and SEQUEST XCorr and Sp scores.BC for both XCorr rank and Sp rank are correct; XC for XCorr rank is correct and Sp rank is incorrect; SPC for Sp rank is correct and XCorr rank is incorrect; FP for top hits on the reverse sequences of the 17 proteins.These results are based on a 6-step MudPIT with 75,866 spectra.The ProLuCID XCorr outperforms SEQUEST XCorr in terms of number of correct spectrum assignments (7299 vs 6974); The ProLuCID Sp scores (binomial probability score) work better than SEQUEST Sp scores (6353 vs 5338); and ProLuCID XCorr gives more true hits the top rank than ProLuCID Sp (7299 vs 6353).

Table 1
Number of protein identified in the saliva sample with SEQUEST and ProLuCID after DTASelect filtering.

Table 5
Search result comparison of ProLuCID and Comet with half tryptic parameter.Triplicates of Hela sample were searched against the same protein database.Search results were filtered with same DTASelect parameters.The results were averaged from triplicates.

Table 4
Search result comparison of ProLuCID, SEQUEST, and Comet.Triplicates of Hela sample were searched against the same protein database.Search results were filtered with the same DTASelect parameters.The results were averaged from triplicates.

Table 3
Average number of peptide count, spectrum count and sequence coverage of 1000 proteins with highest sequence coverage identified in the rat brain sample with SEQUEST and ProLuCID after DTASelect2 filtering.

Table 2
Number of proteins identified in the rat brain sample with SEQUEST and ProLuCID after DTASelect filtering.Search program and filter options Forward hits Decoy hits False positive rate MudPIT dataset with 139,277 spectra was searched with SEQUEST and ProLuCID respectively and DTASelect2 was used to get the final protein lists.SEQUEST_XD for SEQUEST search and XCorr and DeltaCN for DTASelect filtering; ProLuCID_XD for ProLuCID search with XCorr and DeltaCN for DTASelect2 filtering; ProLuCID_Z for ProLuCID search and Z score only for DTASelect filtering (with additional -noxcnodcn -sp options).DTASelect2 options -p 2 -y 1 -fp 0.05 were used for all three, and additional options (−noxc -nodcn -sp) for ProLuCID_Z to use Z score only for filtering.A protein was considered identified if it has at least two peptides that pass the 5% PSM (peptide-spectrum-match) false positive rate filter and each peptide has at least one tryptical terminus.