Protein Identification Using Top-Down Spectra*

In the last two years, because of advances in protein separation and mass spectrometry, top-down mass spectrometry moved from analyzing single proteins to analyzing complex samples and identifying hundreds and even thousands of proteins. However, computational tools for database search of top-down spectra against protein databases are still in their infancy. We describe MS-Align+, a fast algorithm for top-down protein identification based on spectral alignment that enables searches for unexpected post-translational modifications. We also propose a method for evaluating statistical significance of top-down protein identifications and further benchmark various software tools on two top-down data sets from Saccharomyces cerevisiae and Salmonella typhimurium. We demonstrate that MS-Align+ significantly increases the number of identified spectra as compared with MASCOT and OMSSA on both data sets. Although MS-Align+ and ProSightPC have similar performance on the Salmonella typhimurium data set, MS-Align+ outperforms ProSightPC on the (more complex) Saccharomyces cerevisiae data set.

In the last two years, because of advances in protein separation and mass spectrometry, top-down mass spectrometry moved from analyzing single proteins to analyzing complex samples and identifying hundreds and even thousands of proteins. However, computational tools for database search of top-down spectra against protein databases are still in their infancy. We describe MS-Align؉, a fast algorithm for top-down protein identification based on spectral alignment that enables searches for unexpected post-translational modifications. We also propose a method for evaluating statistical significance of topdown protein identifications and further benchmark various software tools on two top-down data sets from Saccharomyces cerevisiae and Salmonella typhimurium. We demonstrate that MS-Align؉ significantly increases the number of identified spectra as compared with MASCOT and OMSSA on both data sets. Although MS In the past two decades, proteomics was dominated by bottom-up mass spectrometry that analyzes digested peptides rather than intact proteins. Bottom-up approaches, although powerful, do have limitations in analyzing protein species, e.g. various proteolytic forms of the same protein or various protein isoforms resulting from alternative splicing. Top-down mass spectrometry focuses on analyzing intact proteins and large peptides (1)(2)(3)(4)(5)(6)(7)(8)(9)(10) and has advantages in localizing multiple post-translational modifications (PTMs) 1 in a coordinated fashion (e.g. combinatorial PTM code) and identifying multiple protein species (e.g. proteolytically processed protein species) (11). Until recently, most top-down studies were limited to single purified proteins (12)(13)(14)(15). Topdown studies of protein mixtures were restricted by difficulties in separating and fragmenting intact proteins and a shortage of robust computational tools.
In the last two years, because of advances in protein separation and top-down instrumentation, top-down mass spectrometry moved from analyzing single proteins to analyzing complex samples containing hundreds and even thousands of proteins (16 -21). Because algorithms for interpreting topdown spectra are still in their infancy, many recent developments include computational innovations in protein identification.
Because top-down spectra are complex, the first step in top-down spectral interpretation is usually spectral deconvolution, which converts a complex top-down spectrum to a list of monoisotopic masses (a deconvolved spectrum). Every protein (possibly with modifications) can be scored against a top-down deconvoluted spectrum, resulting in a Protein-Spectrum-Match (PrSM). The top-down protein identification problem is finding a protein in a database with the highest scoring PrSM for a top-down spectrum and further output the PrSM if it is statistically significant. There are several software tools for top-down protein identification (Table I).
• ProSightPC-ProSightPC is the most commonly used tool for top-down protein identification (22,23). ProSightPC searches spectra against a "shotgun annotated" protein database, which is generated by considering all expected PTMs. The "shotgun annotated" protein database is much larger than the original protein database. ProSightPC can identify some (but not all) proteins with unexpected PTMs using advanced search options, such as biomarker search and ⌬m mode, but it is not designed for identifying truncated proteins with unexpected PTMs that are not represented in the "shotgun annotated" database. ProSightPC is a fast tool that reports the statistical significance of PrSMs.
• PIITA-Unlike ProSightPC, PIITA (19) is a precursor independent method that uses only fragment ions for protein identification. It is capable of identifying protein species with unexpected PTMs on N-or C-termini, but it cannot directly identify protein species with PTMs on both N-and C-termini.
PIITA is a fast tool that provides FIT scores and ⌬ scores rather than statistical significance estimates.
• USTag-Unique Sequence Tag (USTag) (17) generates long (6 amino acids or longer) peptide sequence tags to identify PrSMs. This approach, although fast, relies on long peptide sequence tags that may be difficult to obtain for some spectra. It also does not provide an estimate of the statistical significance of PrSMs.
• MS-TopDown-MS-TopDown (24) is based on spectral alignment (25). MS-TopDown allows one to match top-down spectra to proteins with unexpected PTMs, i.e. without knowing which PTMs are present in the sample. However, MS-TopDown is rather slow when searching against large proteomes and does not provide the statistical significance of PrSMs, making it difficult to select good PrSMs.
We describe MS-Alignϩ, a fast software tool for top-down protein identification. MS-Alignϩ shares the spectral alignment approach with MS-TopDown, but greatly improves on speed, statistical analysis (providing E-values of PrSMs), and the number of identified PrSMs (e.g. by finding spectral alignments between spectra and truncated proteins). We benchmarked various tools for top-down protein identification on two data sets from Saccharomyces cerevisiae (SC) and Salmonella typhimurium (ST). We demonstrate that MS-Alignϩ significantly increase the number of identified spectra as compared with MASCOT and OMSSA on both data sets. Although MS-Alignϩ and ProSightPC have similar performance on the ST data set, MS-Alignϩ outperforms ProSightPC on the more complex SC data set.

EXPERIMENTAL PROCEDURES
Spectral alignment-Some MS/MS tools searching for unexpected PTMs transform a spectrum into a prefix residue mass (PRM) spectrum (28) that represents various peptide prefixes supported by the spectrum. For example, a monoisotopic peak with mass m corresponding to a y-ion in the MS/MS spectrum is converted to a mass M -m in the PRM spectrum (M is the precursor mass). A deconvoluted spectrum with k monoisotopic masses and precursor mass M is converted to a PRM spectrum S with 2k ϩ 2 masses. For a CID spectrum, we include into S two masses m and M -m for each monoisotopic mass m and two other masses 0 and M -18 (M -18 is treated as the PRM of the entire protein). We represent spectrum S as a list of ordered masses a 0 Ͻ a 1 Ͻ … Ͻ a n , where a 0 ϭ 0 and a n ϭ M -18.
A protein P of length m is represented as theoretical peak masses b 0 Ͻ b 1 Ͻ … Ͻ b m , where b i equals the sum of the masses of the first i residues in P (we assume that b 0 ϭ 0 and b m ϭ MЈ -18 where MЈ is the molecular mass of the unmodified protein form). Although, for the sake of simplicity, we ignore intensities of the peaks, intensities can be easily incorporated in the spectral alignment framework.
Given a protein P ϭ {b 0 , b 1 , …, b m } and a spectrum S ϭ {a 0 , a 1 , …, a n }, we define the spectral grid of P and S as a rectangle in two dimension formed by four points (0, 0), (b m , 0), (0, -a n ), (b m , -a n ) with (n ϩ 1)(m ϩ 1) matching points (b j , -a i ) (for 0 Յ i Յ n and 0 Յ j Յ m) located within the rectangle. A global spectral alignment between P and S is a path from the top left corner to the bottom right corner of the spectral grid consisting of diagonal (-45°), vertical, and horizontal segments (Fig. 1A). Diagonal segments correspond to subpeptides of P matched to S without modifications; vertical and horizontal segments correspond to PTMs with positive and negative offsets, respectively. The length of a horizontal or vertical segment represents the absolute value of the offset. For example, the spectral alignment in Fig. 1A has three diagonal segments (corresponding to subpeptides "PR", "TEI", and "STRING"), one vertical segment (a modification or mutation with ϩ80 Da offset) and one horizontal segment (a modification or mutation with -114 Da offset).
We often observe top-down spectra mapped to proteins with truncated peptides on N-and/or C termini. To identify these spectra, we define a semiglobal spectral alignment between P and S as a path from a matching point (b x , 0) on the top border to a matching point (b y , -a n ), x Ͻ y, on the bottom border in the spectral grid, consisting of diagonal, vertical, and horizontal segments.
Based on the positions of b x and b y , we classify semiglobal spectral alignments into four groups: complete (x ϭ 0 and y ϭ m), prefix (x ϭ 0 and y Ͻ m), suffix (x Ͼ 0 and y ϭ m), and internal (x Ͼ 0 and y Ͻ m) (Fig. 1). A complete alignment is characterized by a path from the top left corner to the bottom right corner of the spectral grid, corresponding to a global spectral alignment between the spectrum and the entire protein. A prefix/suffix/internal alignment corresponds to a global spectral alignment between the spectrum and a prefix/suffix/ internal subprotein of the protein. An alignment is called a nonshift alignment if it has no offsets and contains only one diagonal segment. Otherwise, it is called a shift alignment.
A matching point is on a diagonal segment if the distance between the point and the segment is within a given error tolerance. The score of an alignment is the number of matching points on the diagonal Fast ϩ a ProSightPC has various search modes that contribute to bridging the gap between blind and restrictive modes of MS/MS database search. It can identify truncated proteins by using biomarker search and identify unexpected modifications by using ⌬m mode and setting the error tolerance of precursor mass to a large value (e.g., 1999 Da). However, it is not designed for identifying truncated proteins with unexpected PTMs which are not represented in the "shotgun annotated" database.
b In its most advances mode, ProSightPC can search the annotated top-down database that contains various protein species. However, ProSightPC searches in this mode become an order of magnitude slower. segments in the alignment path. For example, the path in Fig. 1A represents an alignment with a score of 12. Spectral alignment algorithms find a highest-scoring alignment between a deconvoluted spectrum and a protein (24,25). Frank et al., 2008 (24) described a dynamic programming algorithm for finding an optimal global alignment between P and S with F offsets in O(nmF) time. This algorithm with minor modifications still works for finding an optimal semi-global spectral alignment between a deconvoluted spectrum and a protein.
Speeding up Spectral Alignment-The spectral alignment algorithm in (24) (e.g. its MS-TopDown implementation) typically takes 3 s to align a single spectrum against a single protein (on a typical desktop machine with a 2.2 GHz CPU). Aligning a small data set with 1000 spectra against a small proteome with 1000 proteins takes 3 ϫ 10 6 seconds (Ϸ800 h) and makes spectral alignment computationally intensive. To speed up protein identification, we quickly filter out proteins in the proteome that cannot possibly attain high-scoring PrSMs for a given spectrum, thus reducing the number of candidate proteins.
Given a -45°line crossing the spectral grid, we define its weight as the number of matching points on this line. This definition of weight is different from the score of a spectral alignment, which is defined as the number of matching points on its corresponding path. We rank all crossing lines in the decreasing order of their weights and output top ␣ crossing lines (Fig. 2). The weights of crossing lines can be quickly computed using spectral convolution (see (29) for the definition of spectral convolution).

FIG. 1. Spectral alignment examples.
A, A global spectral alignment (complete alignment) between a spectrum and a protein P ϭ PRTEINSTRING. The blue path from the top left corner to the bottom right corner represents the alignment of protein P and the spectrum arising from P with two PTMs: ϩ80 Da on "T" and -114 Da between "I" and "S" (deletion of 'N'). The circles along the path denote the matching points on the path. B, A prefix alignment between a spectrum and protein P. The path from the top left corner to a matching point on the bottom border represents an alignment between the spectrum and a prefix "PRTEINSTRI" of P. C, A suffix alignment between a spectrum and protein P. The path from a matching point on the top border to the bottom right corner represents an alignment between the spectrum and a suffix "EINSTRING" of P. D, An internal alignment between a spectrum and protein P. The path from a matching point on the top border to a matching point on the bottom border represents an alignment between the spectrum and a subsequence "EINSTRI" of P.
FIG. 2. The top five crossing lines in the spectral grid of the protein P ‫؍‬ PRTEINSTRING and the spectrum in Fig. 1A. The blue crossing line has the largest weight 7. The alignment in Fig. 1A is composed of three diagonal segments from the red, green, and blue crossing lines. The two black crossing lines, with weights 3 and 2, are not in the path of the best alignment.
For a spectral alignment with a score s and k diagonal segments, there exists a crossing line in the spectral grid with weight at least s/k. Thus, if all crossing lines in the spectral grid have low weights, the corresponding protein can be safely filtered out, eliminating the need to run spectral alignment against this protein. We use this filter to reduce the number of candidate proteins, thus speeding up protein identification. Given a spectrum, the diagonal score of a protein is the maximum weight among all crossing lines in the spectral grid. We rank all proteins in the proteome based on their diagonal scores and filter out all but top ␤ high-scoring proteins that are further subjected to spectral alignment (Fig. 3). The default value for ␤ is 20.
If a crossing line contains a diagonal segment that is in an optimal spectral alignment, the crossing line usually has a relatively heavier weight compared with a random crossing line. Therefore, we first select top ␣ crossing lines (default value ␣ ϭ 20) for a proteinspectrum pair, and use a dynamic programming algorithm to find an optimal path containing only diagonal segments on the ␣ crossing lines. Although this method may miss some PrSMs, it is much faster than the spectral alignment dynamic programming algorithm from (24). The reason is that we consider only ␣ crossing lines, not all crossing lines, in the dynamic programming algorithm, thus reducing the search space. With the improvements described above, MS-Alignϩ takes only 18 min to compare 1000 spectra against 1000 proteins, a 2500-fold speed-up as compared with the spectral alignment algorithm implemented in MS-TopDown. ProSightPC takes 22 min to compare 1000 spectra against 1000 proteins using biomarker search mode on the SC data set. However, ProSightPC becomes an order of magnitude slower in its most advanced mode (searching against annotated top-down database).
Statistical Significance of Spectral Alignments-ProSightPC is the only top-down tool that attempts to estimate the statistical significance of PrSMs. It approximates E-values based on the number of peaks in the spectrum, the number of matched peaks, the accuracy threshold, and the size of the database. However, as shown in (30) (in the case of bottom-up spectra), the approximated E-values are often inaccurate because PrSMs may have the same parameters but vastly different "true" E-values. Below we attempt to compute E-values based on entire spectra rather than a limited number of parameters.
Let X S be a random variable representing the number of matched peaks between a random protein and a spectrum S. The notion of a random protein needs to be carefully defined and we assume the same probabilistic model for random proteins as in (30) but only consider proteins with the same molecular mass (within error tolerance) as the precursor mass of S. Kim et al., 2008 (30) proposed a generating function approach to computing the probability Prob(X S Ն t) that a random protein and S have a complete nonshift alignment with a score at least t. We will use this probability several times in the following analysis.
We first analyze the statistical significance of a complete nonshift alignment between a protein and a spectrum S with a score t (case (1)). The E-value of the alignment is evaluated as where Prob(X S Ն t) is computed as described in (30) and N(S) is the number of proteins in the database with the same mass (with error tolerance) as the precursor mass of S.
For prefix, suffix, and internal nonshift alignments (case (2)), the type of the alignment determines the number of candidate peptides in the database that can have a nonshift alignment with the spectrum. A prefix nonshift alignment between a protein and a spectrum S with a score t is viewed as a complete alignment between a prefix of the protein and S. Thus, the E-value of the alignment is evaluated as N pref (S) ⅐ Prob(X S Ն t), where N pref (S) is the number of protein prefixes with the same mass (with error tolerance) to the precursor mass of S in the database. We remark that, in this case, the conversion from p value to E-value differs from the case (1). Similarly, when computing the E-values of suffix/internal nonshift alignments, we consider the number of protein suffixes/internal subproteins with the same mass (with error tolerance) to the precursor mass of the spectrum. This analysis addresses cases (1) and (2). For cases (3) and (4), we break the spectrum into several subspectra without internal PTMs and compute the E-value based on the statistical significance of the sub-spectra (see the supplementary material).
The computation of Prob(X S Ն t) using the generating function is a challenging task for top-down spectra. In this paper, the implementation of MS-Alignϩ uses bins of a fixed size in the dynamic programming procedure for computing the generating function. In this approach, the rounding mass errors (generated by the use of bins) may introduce deviations in reported probabilities because top-down spectra have high m/z accuracy and large precursor masses.

RESULTS
Data Sets-Two top-down data sets from Saccharomyces cerevisiae and Salmonella typhimurium were used for benchmarking: S. cerevisiae (SC) Data Set (16)-A lysate was quickly extracted from SC cell with the use of pressure cycling technology and in the presence of a protease inhibitor. The lysate obtained was directly separated on an LC system that was coupled online to an LTQ-Orbitrap (Thermo Fisher Scientific). A total of 30,760 FT-MS/MS spectra were acquired during the 600-min LC separation. Both FT-MS and MS/MS spectra were collected at a resolution of 30,000. The charge states of the MS/MS spectra range from 1 to 30 and the precursor masses range from 800 to 20 KDa (supplemental Fig. S2A).
S. typhimurium (ST) Data Set (19)-The proteins extracted from ST were reduced with dithiothreitol and alkylated with iodoacetamide, and desalted on a C4 spin column. The protein mixture obtained was separated with a NonoAquity HPLC system that was coupled online to an LTQ-Orbitrap (Thermo Fisher Scientific). MS and MS/MS spectra were collected at a resolution of 60,000 and 30,000, respectively. The experiment was repeated using gas-phase fractionation. The detailed experiment procedure can be found in (19). A total of 14,041 FT-MS/MS spectra were acquired. The charge states of the MS/MS spectra range from 1 to 24 and the precursor masses range from 1K to 20K Da (supplemental Fig. S2B).
Data Preprocessing-Thermo raw files were first converted to mzXML files using ReAdW (http://tools.proteomecenter. org/software.php), and each MS/MS spectrum was converted to a deconvoluted spectrum (a monoisotopic mass list) using MS-Deconv (31). A total of 11,030 and 4439 deconvoluted spectra had a precursor mass Ն 2500 Da and at least 10 fragment peaks in the SC and ST data sets, respectively. We focused our attention on these deconvoluted spectra because shorter peptides are well handled by the existing bottom-up approaches and spectra with a small number of peaks seldom have good PrSMs. Below these 11,030 and 4439 spectra are referred to as SC* and ST* data sets, respectively. Some of the spectra had very similar precursor mass (up to 15 ppm error tolerance), suggesting that some of them may correspond to the same protein species. For error tolerance 15 ppm, there were 4462 and 1825 various precursor masses in the two sets of spectra.
Protein Identification-MS-Alignϩ was applied to search the deconvoluted spectra against the corresponding proteomes using the parameter settings described in supplemental Table S4. The SC and ST protein databases (downloaded from www.yeastgenome.org and NCBI) had 5885 and 4527 proteins, respectively. Error tolerance for fragment and precursor ions was set to 15 ppm. Because various deconvolution software tools often report monoisotopic masses that deviate from correct masses by Ϯ1 Da, this type of shift error was allowed in spectral alignment. We emphasize the difference between the shift errors (that represents large but fixed Ϯ1 Da shifts from correct masses) and typical (small) errors in peak position, e.g. 15 ppm error from correct masses. To reflect shift errors, a theoretical peak and a spectral peak with mass m were classified as matching peaks if the mass of the theoretical peak was matched to either m -1, or m, or m ϩ 1 within 15 ppm. Carbamidomethylation of cysteine was used as a fixed modification for ST* data set, and no fixed modifications for SC* data set. MS-Alignϩ allows users to select some variable modifications to boost the number of identified proteins with particular frequent modifications. Protein N-terminal acetylation, protein N-terminal methionine loss, and protein N-terminal methionine loss plus acetylation were used as variable modifications for the two data sets. In addition, two mass shifts were allowed in spectral alignment (N-and C-terminal truncated peptides were not counted as mass shifts). Thus, every identified PrSM corresponded to up to 4 modifications other than the specified fixed and variable modifications. MS-Alignϩ reported only the PrSM with the best E-value for each spectrum. The running time of MS-Alignϩ for analyzing SC* and ST* data sets on a PC with a 2.67 GHz CPU and 12 GB memory was 724 and 495 min.
To estimate False Discovery Rate (FDR), we also searched the spectra in SC* and ST* data sets against decoy SC and ST databases, respectively (32). The decoy databases were generated by shuffling each protein sequence in the original databases. Spectrum level FDR was estimated based on the distributions of the E-values of the identified PrSMs in the target and decoy databases (supplemental Figs. S3, S4). With spectrum level 1% FDR, MS-Alignϩ identified 4059 spectra (290 proteins) and 2217 spectra (180 proteins), corresponding to about 36.8% and 49.9% identification rates, from SC* and ST* data sets, respectively (supplemental Table S1). Most unidentified spectra were characterized by small numbers of deconvoluted peaks (typically less than 30) and small diagonal scores (typically less than 10) (Fig. 4).

Protein Identification Using Top-Down
COT, OMSSA, PIITA, and ProSightPC on SC* and ST* data sets using the parameter settings described in supplemental Table S4.
Comparison with MASCOT and OMSSA-Both MASCOT and OMSSA are popular software tools for bottom-up protein identification (27,33) that are sometimes used for analyzing top-down spectra.
Using the same target/decoy approach, with spectrum level 1% FDR, MASCOT identified 3111 spectra (217 proteins) and 571 spectra (76 proteins), and OMSSA identified 2874 spectra (212 proteins) and 464 spectra (81 proteins), from SC* and ST* data sets, respectively. More than 91% of the spectra and more than 83% of the proteins were also reported by MS-Alignϩ ( Fig. 5 and supplemental Table S5), indicating that most of them were correctly identified. In addition, MS-Alignϩ identified a large number of spectra, especially in ST* data set, that were missed by MASCOT and OMSSA. Fig. 6 illustrates that MS-Alignϩ outperforms MASCOT and OMSSA and that the performance of MASCOT and OMSSA deteriorates for spectra with precursor masses Ն5K Da.
There were 301 spectra identified by MASCOT or/and OMSSA, but missed by MS-Alignϩ from SC* and ST* data sets. The simple scoring function of MS-Alignϩ, which is based on peak counting, may be another reason why the spectra were missed by MS-Alignϩ. The distribution of E-values of the spectra reported by MS-Alignϩ is shown in supplemental Fig. S5.
Comparison with PIITA-We compared PIITA and MS-Alignϩ on ST* data set (running PIITA on SC* data set resulted in a software error). Using the same target/decoy approach, with spectrum level 1% FDR, PIITA coupled with MS-Deconv identified 1958 spectra (169 proteins) from ST* data set, a significant improvement over Mascot and OMSSA. MS-Alignϩ and PIITA shared 1524 identified spectra and 144 identified proteins (supplemental Fig. S6 and supplemental Table S5). Interestingly, PIITA found some PrSMs missed by MS-Alignϩ and vice-versa (supplemental Tables S2 and  S3). The possible reasons why PIITA identified some PrSMs missed by MS-Alignϩ are (1) PIITA considers six ion types: b, b-H 2 O b-NH 3 , y, y-H 2 O and y-NH 3 , whereas MS-Alignϩ considers only b and y-ions; (2) PIITA does not depend on the precursor mass for protein identification, whereas MS-Alignϩ uses the precursor mass to convert y-ion peaks to PRM peaks. Therefore, PIITA is more sensitive when (1) the spectrum contains many peaks from ions with neutral losses or (2) the precursor mass is incorrect. PIITA identified 434 PrSMs from ST* data set which were missed by MS-Alignϩ. For each PrSM, PIITA reports only the unmodified form, not the modified form, of the identified protein. Therefore, we can not compare the precursor mass of the spectrum with the molecular mass of the modified form the protein. In the 434 PrSMs, no precursor masses can be matched to the molecular masses of the unmodified forms of the proteins within 0.5 Da (supplemental Table S3). On the other hand, PIITA can not identify PrSMs with both Nand C-terminal PTMs and may fail to identify PrSMs with internal PTMs. Therefore, MS-Alignϩ identified many PrSMs missed by PIITA.
Comparison with ProSightPC-ProSightPC 2.0 was used in the comparison. Target and shuffled protein databases required by ProSightPC were generated from the same FASTA sequences for other experiments using the default parameters of ProSightPC. Coupling with MS-Deconv, two search modes of ProSightPC: absolute mass search and (more powerful) biomarker search were used to analyze SC* and ST* data sets. The running time of ProSightPC for processing SC* and ST* data sets using biomarker search on a PC with a 2.16 GHz CPU and 3 GB memory was 890 and 199 min, respectively. The running time for each run of absolute mass search was less than 90 min. With spectrum level 1% FDR, ProSightPC identified 123 and 2230 spectra from SC* data set with absolute mass search and biomarker search, respec- tively. A total of 2299 (175 proteins) spectra were reported by combining the results from the two search modes (Fig. 7 and supplemental Table S5). MS-Alignϩ identified all but 10 of the spectra (Fig. 7A). Using the same approach, ProSightPC identified 2230 spectra (158 proteins) from ST* data sets (absolute mass search reported 2083 spectra and biomarker search reported 892 spectra). ST* data set is somehow "simpler" than SC* data set because it contains fewer truncated and modified proteins. MS-Alignϩ identified 1744 (78.2%) of the spectra identified by ProSightPC (Fig. 7B).
ProSightPC has various search modes and parameters. The user can increase its search space by setting a very wide window for precursor ions or using an annotated protein database containing many forms of a protein. With most advanced search modes, the running time of ProSightPC increases dramatically. For example, the "biomarker

Protein Identification Using Top-Down
search" mode with ⌬m option turned on and a precursor ion error tolerance of 100 Da is estimated to take 77 days to analyze SC* data set (on a single processor) while identifying more PrSMs. Thus, the user has to balance between the number of identified PrSMs and the running time of ProSightPC. We used SC* data set to analyze the performance and running time of several advanced modes of ProSightPC.
First, we tested ProSightPC using annotated databases. We downloaded a standard annotated top-down database of yeast from https://prosightptm2.northwestern.edu/. The annotated database contains 912,107 protein forms, about 32 times larger than the unannotated database with 28,238 protein forms generated from the FASTA sequences. The running time of ProSightPC for searching SC* data set against the annotated database was about 4 h and 160 h for absolute mass search and biomarker search (⌬m option was not used for biomarker search), respectively. Using the same E-value cutoffs as in the experiment with the unannotated database, ProSightPC identified 2310 spectra from 181 proteins (326 spectra from absolute mass search and 2125 spectra from biomarker search). MS-Alignϩ identified all but 16 of the spectra in about 12 h, a 13 times speed-up compared with ProSightPC. We also downloaded a SWISSPROT flat file of yeast S288C from the SWISSPROT database. By importing the flat file to ProSightPC with its default parameters, we generated an annotated database with 6629 proteins and 4,437,468 protein forms, which is about 158 times larger than the unannotated database. We only tested absolute mass search for this database, which took about 8 h. Using the same E-value cutoff as in the experiment with the annotated database, ProSightPC identified 157 spectra from 11 proteins, of which MS-Alignϩ identified 148 spectra. Compared with the unannotated database, ProSightPC coupled with the annotated databases is slower and does not significantly improve the number of identifications. The reason may be that SC* data set contains many truncated and modified proteins, which are not included the annotated database.
Second, we tested the advanced search modes with a large search window for precursor ions. Because the running time for these parameter settings was estimated to be very large (77 days), a smaller database of 500 yeast proteins were used instead of the complete yeast database with 5885 proteins. The small database was composed of 305 proteins identified by MS-Alignϩ or ProSightPC (supplemental Table S5) and 195 randomly selected proteins. Using the small database, the running time of ProSightPC is estimated to be 11 times smaller as compared with the complete database. We changed the precursor ion error tolerance from 1999 Da to 600K Da for absolute mass search (other parameters were set the same as in the previous experiment). With this parameter setting, no proteins were filtered out by precursor ion error. The running time of ProSightPC was about 10 h (estimated running time for the complete database is about 110 h). With 1% spectrum level FDR, 1217 PrSMs (122 proteins) were identified. We also changed the precursor ion error tolerance from 15 ppm to 100 Da for biomarker search. The running time of ProSightPC was about 7 days (estimated running time for the complete database is about 77 days). With 1% spectrum level FDR, 2526 PrSMs (174 proteins) were identified. By combining the results of the two search modes, ProSightPC identified 2822 spectra from 198 proteins. Using the same small database, MS-Alignϩ identified 4448 spectra from 298 proteins in 3 h. ProSightPC shared 2654 (94%) identified PrSMs with MS-Alignϩ. MS-Alignϩ identified about 57% more PrSMs while being 56 times faster. We point out that this approach is not optimal and is chosen only to bypass the extreme time requirements of ProSightPC. Indeed, when a small database is populated mainly with identified proteins, the FDR estimates become biased. Our goal is to roughly compare the relative performance of the two tools rather than to estimate the absolute number of proteins these tools can identify. We emphasize that some advanced modes of ProS-ightPC, although increasing the number of identified PrSMs, may be extremely time-consuming. For example, the biomarker search mode with a precursor ion error tolerance of 100 Da is estimated to take more than 150 days to search SC* data set against both the target and decoy protein databases.
The reason why MS-Alignϩ missed some spectra identified by ProSightPC is that the filtering step of MS-Alignϩ may miss some possible PrSMs. ProSightPC also failed to report many spectra identified by MS-Alignϩ, especially for SC* data set. Among the 4059 PrSMs identified by MS-Alignϩ from SC* data set, 3778 PrSMs are from truncated proteins (supplemental Table S1). Because most of the spectra in SC* data set are truncated proteins, absolute mass search of ProSightPC may fail to identify the corresponding PrSMs (see Fig. 6A). This observation suggests that MS-Alignϩ and Pro-SightPC may complement each other in top-down searches.
Coupling MS-Alignϩ and ProSightPC with Thrash-MS-Alignϩ is usually coupled with MS-Deconv and is optimized for the frequent Ϯ1 Da shift errors that are common in MS-Deconv. ProSightPC, on the other hand, is coupled with Thrash. To analyze the performance of MS-Alignϩ decoupling from MS-Deconv, we coupled MS-Alignϩ and ProSightPC with Thrash instead. The general conclusion is that switching from Thrash to Ms-Deconv improves the performance of both MS-Alignϩ and ProSightPC (similar observation was made in (31)).
Using its default parameters, Thrash reported 5136 and 2097 merged deconvoluted spectra with precursor mass Ն 2500 Da and at least 10 peaks from the SC and ST data sets, respectively. Using the same target/decoy approach, with spectrum level 1% FDR, MS-Alignϩ (coupled with Thrash) identified 2031 and 934 spectra from the merged spectra from th SC and ST data sets, respectively. By combining the results of absolute mass search and biomarker search, ProSightPC identified 617 spectra (61 spectra from absolute mass search and 562 spectra from biomarker search) from the SC data set, and 900 spectra (815 spectra from absolute mass search and 418 spectra from biomarker search) from the ST data set. MS-Alignϩ and ProSightPC shared 598 and 707 identified spectra in the SC and ST data sets, respectively.
Because MS-Deconv recovered single spectra and Thrash recovered merged spectra, we mapped the single spectra to the merged spectra and compared the identified single spectra and the identified merged spectra. For MS-Alignϩ, a total of 2592 (87.4%) identified merged spectra can be mapped to identified single spectra, and 4343 (69.2%) identified single spectra can be mapped to identified merged spectra. For ProSightPC, 1319 (86.9%) identified merged spectra can be mapped to identified single spectra, and 2295 (50.6%) identified single spectra can be mapped to identified merged spectra (supplemental Table S4). The reason might be that MS-Deconv performed better than Thrash for single spectra, and that Thrash performed better than MS-Deconv when several spectra can be merged to report a single deconvoluted spectrum.
Validation of E-Values Reported by MS-Alignϩ and ProSightPC-Both MS-Alignϩ and ProSightPC report an Evalue for each identified PrSM, which is used for estimating the statistical significance of the PrSM. The E-value is computed based on an estimated p value of the PrSM and the size of the protein database. In difference from ProSightPC, which uses the number of proteins as the size of the database, MS-Alignϩ uses the number of proteins with the same mass (with error tolerance) as the precursor mass of the spectrum as the size of the database. Although the size of a protein database is easy to compute, the estimation of p value is a difficult problem. For example, the p values reported by some bottom-up database search tools may differ from the textbook definition of the notion of p value by orders of magnitude (30). To test if the p values reported by MS-Alignϩ and Pro-SightPC are accurate, we estimate the theoretical p values by searching the spectra against a giant random protein database as described in Kim et al., 2008 (30). Suppose a PrSM between spectrum S and protein P has no PTMs and its score is t. We generate a random protein database with 10 6 proteins, each of which has a molecular mass similar to P. Then we search S against the random protein database and count the number a of PrSMs with score Ն t. Thus, the textbook estimate of the theoretical p value is a ϫ 10 Ϫ6 . The accuracy of reported p values is examined by comparing the reported p values and the estimated theoretical p values.
MS-Alignϩ used a generating function approach to compute p values, and reported 1348 PrSMs with p value Յ 10 Ϫ4 without PTMs in ST* data set. We computed the estimated theoretical p value for each PrSM by searching a giant random protein database with 1 million proteins. Even with such a giant protein database, one is unable to reliably estimate theoretical p values in the cases of p values below 10 -6 (because one expects to find less than a single database hit in these cases, often resulting in the estimated p value equal to 0). Thus, we had to limit our attention to only 144 PrSMs (out of 1348) with a reported p value in the interval [10 -6, 10 -4 ] and an estimated theoretical p value Ն 10 -6 . We computed the ratios between the reported p values and the estimated theoretical p values for these 144 PrSMs (Fig. 8). In the case of accurate p values one expects to see log-ratios of theoretical to reported p values equal to 0. Although the histogram in a log-ratio (base 10) between the reported and the estimated theoretical p value Ն 0.5 and 0.5% of the PrSMs have a log-ratio Յ -0.5. The reason for this discordance is that the rigorous computation of the generating function for top-down spectra is difficult because of rounding errors in the dynamic programming procedure from (30).
A similar experiment with ProSightPC becomes rather timeconsuming because it requires manual loading of the giant random database for each spectrum. Thus, we were able to conduct this experiment for only 10 spectra. We selected 10 PrSMs identified from the ST data set by both ProSightPC and MS-Alignϩ, and compared the reported p values and the estimated theoretical p values (Table II). For several examples in Table II, the p values reported by ProSightPC are rather inaccurate (half of all entries feature the error of at least one order of magnitude). One possible reason is that when ProSightPC computes p value, it does not consider positions of peaks and uses a simplifying and often incorrect assumption that matches between peaks and theoretical masses represent statistically independent events.

Analysis of Identified Protein-Spectrum Alignments
N-Terminal Methionine Excision-According to the canonical rule for N-terminal Methionine Excision (NME), a protein with an N-terminal prefix "MX," where X ϭ G,A,P,V,S,T,C, is expected to undergo NME (34). MS-Alignϩ reported 59 proteins without NME, including 13 proteins that were expected to undergo NME based on the NME rule, from ST* data set. Out of the 13 proteins, nine proteins had both NME and non-NME species and the remaining four proteins had only non-NME species. Presence of both NME and non-NME forms of the same protein is surprising because methionine aminopeptidase enzymes (responsible for NME) are believed to be very specific (35). MS-Alignϩ found 64 proteins with NME, including 5 proteins that were expected to not undergo NME based the NME rule, from ST* data set. For SC* data set, MS-Alignϩ identified 31 proteins without NME and 88 proteins with NME, including seven proteins that did not follow the NME specificity rule (34).
Signal Peptides-MS-Alignϩ identified 28 proteins from ST* data set with a short truncated prefix (the truncated prefix had 15-35 residues). Although 14 proteins had a truncated prefix with the canonical "AXA" motif for signal peptides, the other 14 proteins did not. For the 14 protein without AXA motif, SignalP (36) reported 12 signal peptides (supplemental Table S7), out of which nine signal peptides matched to the truncated prefixes reported by MS-Alignϩ, and the other three signal peptides did not.
Erroneous Annotations of Translation Start Sites-MS-Alignϩ found in ST* data set 12 protein species (12 proteins) with a truncated prefix that either ends in "M" or exposes M as the first amino acid in the truncated protein (Table III), including eight proteins with short truncated prefixes (from 2 to 9 residues) typical for mis-annotated bacterial start sites (37). However, the remaining 4 proteins had long truncated prefixes (over 80 amino acids) pointing to potential alternative start sites (38,39). All start sites but one in Table III follow the NME rule, thus reinforcing our conclusion that the translation start sites in these proteins represent either mis-annotated or alternative start sites.
PTMs on Internal Residues-MS-Alignϩ identified many proteins with common PTMs, e.g. proteins with oxidation (ϩ16 Da) and methylation (ϩ14 Da), and several proteins with uncommon PTMs. In SC* data set, MS-Alignϩ reported proteins with a mass shift of ϩ38 Da and proteins with a mass shift of ϩ183 Da (the protease inhibitor mixture used in preparing the sample may contain 4-(2-Aminoethyl)benzenesulfonyl fluoride hydrochloride that introduces this modification. See http://www.unimod.org/modifications_view.php?editid1ϭ 276). In ST* data set, MS-Alignϩ reported proteins with Carbamidomethyl DTT on cysteine (the cysteine modification is ϩ209 Da and the modification relative to a carbamidomethy- lated cysteine is ϩ152 Da), proteins with a disulfide bond on two cysteines (-116 Da compared with two cysteines with carbamidomethylation), and proteins with a -12 Da mass shift (one possible explanation is replacing an isoleucine with a threonine). Searching Six-Frame Translation-We also generated the six-frame translation of the ST genome and searched the top-down spectra in ST* data set against the database. Similar searches have recently been done using ProSightPC (40). By searching the (unannotated) six-frame translation, MS-Alignϩ was able to identify nearly all proteins identified in searches of the (annotated) proteome thus emphasizing potential proteogenomics applications of top-down mass spectrometry. Moreover, MS-Alignϩ identified a protein species of NP_462331.2 which contains four more amino acids than the annotated protein sequence at the N terminus (Fig. 9), the likely indication of an annotation error.
Analyzing Unidentified Spectra: Evidence of Room for Improvement-Highly accurate precursor mass measurements of top-down spectra allow one to consider a hypothesis that a protein matches to an unidentified spectrum based only on the precursor mass. Indeed, as supplemental Fig. S7A shows, a surprisingly large number of spectra have precursor masses either equal to the theoretical precursor masses of proteins in the ST proteome (118 spectra) or differ from these masses by 131 Da (89 spectra). The peaks at 0 (matching precursor and theoretical masses) and at 131 (matching precursor and theoretical mass minus NME) are the tallest peaks in supplemental Fig. S7A whereas the noise level corresponds to Ϸ 25 spectra. Thus, we estimate that roughly 118 ϩ 89 -25 -25 ϭ In the reported protein species, four residues "AKQS" (underlined) at the N terminus precede the annotated start site and suggest an annotation error.

CONCLUSIONS
Our analysis demonstrated that MS-Alignϩ favorably compares with other tools in identifying PTMs, particularly in the blind mode. In addition, it is fast and it computes more accurate p values than ProSightPC. As a result, MS-Alignϩ reported many new protein identifications (e.g. proteins with Nand C-terminal truncations and proteins with internal PTMs) from the SC and ST data sets.
There are several computational challenges in top-down mass spectrometry that require further studies. For example, a faster and sensitive filtering method is needed for speeding up the protein identification without significant loss of sensitivity. In MS-Alignϩ, multiple PTMs can be identified only if there are several peaks in the spectrum supporting each PTM. Such peaks may be missing in some spectra thus raising a problem of identifying multiple PTMs with a limited number of matched peaks. Another important problem is to develop computational tools for automatically determining and localizing PTMs.