- Split View
-
Views
-
Cite
Cite
Jiaan Dai, Wei Jiang, Fengchao Yu, Weichuan Yu, Xolik: finding cross-linked peptides with maximum paired scores in linear time, Bioinformatics, Volume 35, Issue 2, January 2019, Pages 251–257, https://doi.org/10.1093/bioinformatics/bty526
- Share Icon Share
Abstract
Cross-linking technique coupled with mass spectrometry (MS) is widely used in the analysis of protein structures and protein-protein interactions. In order to identify cross-linked peptides from MS data, we need to consider all pairwise combinations of peptides, which is computationally prohibitive when the sequence database is large. To alleviate this problem, some heuristic screening strategies are used to reduce the number of peptide pairs during the identification. However, heuristic screening strategies may miss some true cross-linked peptides.
We directly tackle the combination challenge without using any screening strategies. With the data structure of double-ended queue, the proposed algorithm reduces the quadratic time complexity of exhaustive searching down to the linear time complexity. We implement the algorithm in a tool named Xolik. The running time of Xolik is validated using databases with different numbers of proteins. Experiments using synthetic and empirical datasets show that Xolik outperforms existing tools in terms of running time and statistical power.
Source code and binaries of Xolik are freely available at http://bioinformatics.ust.hk/Xolik.html.
Supplementary data are available at Bioinformatics online.
1 Introduction
Cross-linking technique in combination with mass spectrometry (MS) is commonly used to analyze protein structures and protein-protein interactions (Young et al., 2000). Various computational tools have been developed to analyze cross-linking MS data. These tools include MS2Assign (Schilling et al., 2003), xQuest/xProphet (Walzthoeni et al., 2012), crux (McIlwain et al., 2010), xComb (Panchaud et al., 2010), Xlink-Identifier (Du et al., 2011), Protein Prospector (Trnka et al., 2014), pLink (Yang et al., 2012), MeroX (Götze et al., 2015), MXDB (Wang et al., 2014), Kojak (Hoopmann et al., 2015), XlinkX (Liu et al., 2015), ECL (Yu et al., 2016) and ECL2 (Yu et al., 2017).
Compared with the identification of single peptides (Eng et al., 1994; Perkins et al., 1999), the identification of cross-linked peptides requires a much larger search space because we need to examine all pairs of candidate peptides. To be more precise, the search space of identifying cross-linked peptides is quadratic with respect to the number of candidate peptides in a database (Liu et al., 2015). Therefore, it is computationally challenging to examine all possible candidate pairs in a large database.
To tackle this problem, MS-cleavable cross-linkers, such as disuccinimidyl sulfoxide (DSSO) (Kao et al., 2011), BuUrBu (MüLler et al., 2010) and cyanurbiotindipropionylsuccinimide (CBDPS) (Petrotchenko et al., 2011), have been introduced. With MS-cleavable cross-linkers, the identification of cross-linked peptides can be finished in linear time (Liu et al., 2015). However, the time complexity is still quadratic when non-cleavable cross-linkers are used.
To reduce the quadratic time complexity when non-cleavable cross-linkers are used, most of the existing tools use some heuristic screening strategies to reduce the number of candidates. For example, pLink (Yang et al., 2012) first conducts a coarse-grained scoring on all single peptides, and then selects the top 500 of them as candidates for further fine-grained scoring. In the fine-grained scoring, only these 500 candidate peptides are paired together as candidates of cross-linked peptides. pLink2 (Meng et al., 2017) adopts a different screening strategy by limiting one element peptide of the cross-linked pair being within the top 5–10 single peptides and finding the other element peptide in the whole database. Kojak (Hoopmann et al., 2015) uses a strategy similar to pLink, selecting only the top 250 single peptides. These tools limit the number of candidates when enumerating combinations. Consequently, the running time can be reduced to an acceptable level.
Instead of reducing the number of candidates, Chen et al., 2001 proposed a general algorithm for identifying cross-linked peptides and provided a theoretical analysis of their method. Their method can be decomposed into two stages. In the first stage, it generates possible solutions that satisfy the requirement of precursor mass. In the second stage, it measures the similarity between the experimental mass spectrum and each pair of peptides. Because of the large number of combinations, the authors suggested a speed-up implementation by removing low-score candidates in measuring similarities of single peptides. This is also a screening strategy.
ECL2 (Yu et al., 2017) uses an algorithm whose time complexity is linear with respect to the number of peptides in a database to solve the problem of cross-linked peptide identification. It uses the additive property of the scoring function, and a binning strategy to assign peptides into small bins according to their mass. Afterwards, it searches and recordes the peptide with the maximum score in each bin, and enumerates bin pairs to figure out the most similar cross-linked peptides. As the number of bins can be defined beforehand, the computational cost of enumerating bin pairs will be fixed.
However, this method suffers from the following issues. The time complexity of enumerating bins is , where M is the total range of the peptide mass, is the MS1 tolerance and w is the MS1 bin width (Yu et al., 2017). This time complexity can be rewritten as , where is the total number of bins, and is the number of candidate bins within the MS1 tolerance range. Both are proportional to , so the time complexity of ECL2 is still quadratic with respect to the MS1 mass precision . Since peptide combinations are replaced by bin combinations, i.e. the time complexity by enumerating peptide pairs (n peptides and average candidate peptides within the MS1 tolerance range) becomes in ECL2, the time complexity is still quadratic with respect to the objects being enumerated. In other words, the high complexity issue of enumerating combinations has not been fully solved. Furthermore, because the binning strategy is used in ECL2, the precursor mass constraint may not be strictly satisfied.
In this work, we propose a new linear-time algorithm to solve the problem of identifying cross-linked peptides. The proposed algorithm not only overcomes the above issues, but also shows a novel advantage in reducing the computational cost of scoring peptides. By using the data structure of double-ended queue, the proposed algorithm exhaustively searches cross-linked peptides in linear time. The time complexity of the proposed algorithm is not only linear with respect to the number of peptides in a database, but also constant with respect to the MS1 tolerance. We implement this algorithm in a tool named Xolik. The correctness proof and the time complexity analysis are given in the Supplementary Material. Experiments using synthetic and empirical datasets show that Xolik outperforms existing tools in terms of running time and statistical power.
2 Materials and methods
2.1 Problem formulation
2.2 Linear-time search algorithm
Any scoring functions satisfying the above additive property can be used in our algorithm. In the default implementation, we use a modified version of the XCorr scoring function (Eng et al., 2008). The modification is that we do not incorporate theoretical ions from two peptides if they are within the same MS2 bin. We generate all b and y theoretical ions of the cross-linked peptides in the scoring. To calculate the mass of ions containing the cross-linked site, we consider the mass difference (where PM denotes the precursor mass and denotes the mass of the examined peptide) as a pseudo-modification at the cross-linked site. This is also the method adopted by pLink in its open-search mode (Yang et al., 2012).
To identify cross-linked peptides from MS data, peptides in a sequence database are first digested in silico and sorted based on the mass in advance. (Sorting the mass requires time. However, since it is done offline, we do not include it in the analysis of the time complexity, which considers the running time when analyzing one spectrum.) With the proposed algorithm, finding the cross-linked peptides given a query MS2 spectrum takes O(n) time. We only calculate the similarity score between the query spectrum and a candidate if the comparison between them is necessary. This allows us to reduce the computational cost of scoring because we merely score peptides that are needed, instead of blindly scoring all candidates in the whole database. In the next subsection, we will describe the benefits of this strategy. To search the peptide pairs that satisfy the precursor mass constraint, we use three pointers () to denote the currently examined peptide, the lower bound and the upper bound of the index range of the candidate peptides that satisfy the requirement of the precursor mass, respectively (see Figs 1 and 2). If is initially pointed to the peptide with the smallest mass, while Ibf and Ibe are initially pointed to the peptide with the largest mass. We use a double-ended queue (deque) to maintain the order of scores compared in previous iterations. In each iteration, we examine one peptide and compute the range of the other peptide that satisfies the mass constraint. For each examined score, finding the other peptide with the maximum score in the valid range can be solved in a constant amortized time with the help of the double-ended queue. Therefore, after all iterations, the maximum score among all peptide pairs corresponding to the query MS2 spectrum is available in linear time with respect to the number of candidate peptides in the database.
The improvement of the proposed linear-time algorithm compared with ECL2 (Yu et al., 2017) is in the overhead on enumerating bin pairs. The time complexity of ECL2 is , where O(n) is the time complexity of scoring and binning, and is the time complexity of enumerating bin pairs. Xolik directly solves the problem in O(n) time, without any binning strategies. Therefore, the overhead in ECL2 is suppressed in Xolik. Besides this, Xolik does not relax the constraint of the precursor mass because no binning strategy is used.
2.3 Lazy evaluation on scoring single peptides
A nice property of the algorithm is that, when we retrieve a similarity score of a peptide, we are sure that this peptide must be in some valid solutions of cross-linked peptides that meet the requirement of the precursor mass. So the score of this peptide is necessary for the algorithm to figure out the most similar cross-linked peptides. At the same time, if a peptide is not in any valid solutions of cross-linked peptides, we will never retrieve its score. Therefore, we can postpone the computations of similarity scores of single peptides until we retrieve the scores for comparison. This lazy evaluation strategy saves resources on computing the similarity scores of those ‘useless’ peptides. Along with the memoization on the computed scores, all similarity scores will be computed at most once. To implement the memoization technique, we only need an additional cache layer to store the flags indicating whether scores have been calculated or not. This additional layer only requires a small amount of memory space. For a typical number of peptides in a large database, e.g. 2 000 000 peptides, a naive implementation using 64-bit integer as flags merely needs extra memory.
2.4 False discovery rate calculation
We use instead of > in Line 26 of Figure 1 to find the global maximum. Since decoy proteins are appended after the target proteins, after generating peptides and stably sorting peptides by peptide mass, decoy peptides will appear after target peptides if they have the same mass. If a decoy peptide has the same score as a target peptide that has the same mass, which means that there may be a mixture of TT (both peptides are from the target database) and TD (one from the target database and the other from the decoy database) with the same maximum score, our algorithm will only report one cross-linked pair containing the decoy peptide in this situation. That will result in an overestimated FDR, and finally lead to a conservative FDR control.
3 Results
3.1 Running time validation
In this subsection, we will examine whether Xolik can complete the identification task in linear time. As a comparison, we also run ECL (version 1.1.1) and ECL2 (version 2.1.4) to finish the same task. We use an MS data file (20111221_ananiav_DPDS_lib1_90min_CID35.mzXML) from a synthetic dataset (Wang et al., 2014) to search against random databases of different sizes (i.e. different numbers of proteins). Random databases are generated by randomly selecting proteins from the human protein database, with protein numbers ranging from 100 to 20 000. The MS1 tolerance is set at 50 ppm. To validate the effect when the MS1 tolerance changes, we also search the data file against a random database containing 10 000 proteins, with MS1 tolerance ranging from 50 ppm to 500 ppm. The MS2 bin size for XCorr is set at 0.5 Da (roughly 0.25 Da MS2 tolerance), and the maximum number of missed cleavages is 2. The mass range of candidate peptides is . In this experiment, we would like to compare the running time in a common scenario. Thus, we set BS3 as the cross-linker and enable the E-value estimation, even though the MS spectra in the data file are synthetically cross-linked by the SS-bond (Wang et al., 2014). We also search against a decoy database, which is constructed by reversing the protein sequences in the target database. The MS data file contains 16 557 MS2 spectra. The running time versus the database size is shown in Figure 3(a), and the running time versus the MS1 tolerance is shown in Figure 3(b). All tools are deployed on an Intel Core i5 3.20 GHz Linux desktop computer with 16GB memory, which is a regular PC with a basic configuration, and run on 4 threads (if applicable), with all memory assigned to the process.
Figure 3(a) shows that the running time of ECL increases quadratically, the running time of ECL2 increases linearly, and the running time of Xolik also increases linearly, with respect to the number of proteins in the database. All tools are consistent with their time complexities, respectively. Most notably, Xolik searches an MS data file against 20 000 proteins in around 13 min on a regular PC. This allows us to search against a large database within an acceptable period of time. We also show the performance of Xolik in searching a real dataset against a large protein database in the last experiment. The running time of Xolik is stable when the MS1 tolerance increases. In contrast, the running time of ECL2 with 0.001 Da MS1 bin width increases roughly linearly. Compared with the running time with 0.01 Da MS1 bins, when the MS1 tolerance is large, e.g. 500 ppm, one additional order of magnitude on the MS1 mass precision (0.001 Da MS1 bins) requires 55 min extra running time for ECL2.
We provide the theoretical analysis of Xolik’s linear-time algorithm in the Supplementary Material. It proves that it is indeed a linear-time algorithm and the time complexity will not change with respect to the MS1 tolerance.
3.2 Analysis of synthetic disulfide-bridged peptides
We run Xolik using a synthetic disulfide-bridged peptide dataset (Wang et al., 2014) to examine its statistical power. As a comparison, we also run ECL2, pLink (version 1.22), pLink2 (version 2.2) and Kojak (version 1.5.3) using the same dataset. pLink-SS (Lu et al., 2015) is a special workflow of pLink to analyze disulfide-bridged datasets, and we use pLink-SS in this experiment. ECL2 currently does not support the customization on the cross-linked site [SS-bond links at Cysteine (C)], and pLink2 does not support collision-induced dissociation (CID) data when conducting SS-bond analysis. Consequently, we do not compare Xolik with ECL2 and pLink2. Please note that Xolik does not consider linear peptides. All peptide-spectral matches (PSMs) mentioned in the experiments below are cross-linked PSMs. The dataset we analyze contains three synthetic peptide libraries, which have the following three specific sequence patterns:
K[AW][DE]F[VSHY]A[DY]SCVA[KR]
[TW]A[LE]H[FV]SCVT[PSGY]F[KR]
[WA]VK[FL]C[DE]T[VSGY]FA[KR]
Please refer to the original paper (Wang et al., 2014) for a detailed description of the sample preparation and the XL-MS analysis.
To validate the statistical power of Xolik, we search these libraries against a target database containing all peptide sequences matching the above patterns. Total 704 peptide sequences are in the target database. The MS1 tolerance is set at 50 ppm and the MS2 bin size for XCorr is set at 0.5 Da. The maximum number of missed cleavages is 2. The mass range of the candidate peptides is . The cross-linker used in the database search is SS-bond. The decoy database is constructed by reversing the protein sequences in the target database, and the FDR is controlled at 5% at the PSM level. For Kojak, we follow the whole identification procedure using Percolator (Kall et al., 2007) to control the FDR of the findings. FDRs are controlled separately for intra-protein and inter-protein identifications (Walzthoeni et al., 2012). Since Kojak will output multiple PSMs for one spectrum, for a fair comparison, we only keep the top match for each query spectrum on the controlled identifications. E-value estimation is enabled in Xolik and pLink as a default setting. Since Kojak does not have the E-value estimation option, we also run Xolik with E-value estimation disabled. All tools run on an Intel Core i5 3.30 GHz Windows desktop computer with 12 GB memory on 4 threads. Detailed comparisons of the search results are shown in Table 1.
. | #PSMs . | #Correct . | Accuracy . | Time (s) . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . |
1 | 61 | 1673 | 2091 | 2374 | 56 | 1559 | 1900 | 2184 | 91.8% | 93.2% | 90.9% | 92.0% | 453 | 53 | 24 | 10 |
2 | 1848 | 2962 | 3176 | 3647 | 1649 | 2797 | 2812 | 3296 | 89.2% | 94.4% | 88.5% | 90.4% | 364 | 36 | 13 | 8 |
3 | 176 | 1685 | 1113 | 1260 | 53 | 625 | 606 | 676 | 30.1% | 37.1% | 54.4% | 53.6% | 382 | 43 | 12 | 11 |
Total | 2085 | 6320 | 6380 | 7281 | 1758 | 4981 | 5318 | 6156 | 84.3% | 78.8% | 83.4% | 84.5% | 1199 | 132 | 49 | 29 |
. | #PSMs . | #Correct . | Accuracy . | Time (s) . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . |
1 | 61 | 1673 | 2091 | 2374 | 56 | 1559 | 1900 | 2184 | 91.8% | 93.2% | 90.9% | 92.0% | 453 | 53 | 24 | 10 |
2 | 1848 | 2962 | 3176 | 3647 | 1649 | 2797 | 2812 | 3296 | 89.2% | 94.4% | 88.5% | 90.4% | 364 | 36 | 13 | 8 |
3 | 176 | 1685 | 1113 | 1260 | 53 | 625 | 606 | 676 | 30.1% | 37.1% | 54.4% | 53.6% | 382 | 43 | 12 | 11 |
Total | 2085 | 6320 | 6380 | 7281 | 1758 | 4981 | 5318 | 6156 | 84.3% | 78.8% | 83.4% | 84.5% | 1199 | 132 | 49 | 29 |
Note: Xolik outperforms pLink and Kojak in terms of identified PSMs, identified correct PSMs and the running time. Accuracy is defined as #Correct/#PSMs. The column of Xolik denotes the results analyzed by Xolik when the E-value estimation is disabled. All tools run on an Intel Core i5 3.30 GHz Windows desktop computer with 12GB memory on 4 threads. Bold values in the table indicate the best results in the comparisons.
. | #PSMs . | #Correct . | Accuracy . | Time (s) . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . |
1 | 61 | 1673 | 2091 | 2374 | 56 | 1559 | 1900 | 2184 | 91.8% | 93.2% | 90.9% | 92.0% | 453 | 53 | 24 | 10 |
2 | 1848 | 2962 | 3176 | 3647 | 1649 | 2797 | 2812 | 3296 | 89.2% | 94.4% | 88.5% | 90.4% | 364 | 36 | 13 | 8 |
3 | 176 | 1685 | 1113 | 1260 | 53 | 625 | 606 | 676 | 30.1% | 37.1% | 54.4% | 53.6% | 382 | 43 | 12 | 11 |
Total | 2085 | 6320 | 6380 | 7281 | 1758 | 4981 | 5318 | 6156 | 84.3% | 78.8% | 83.4% | 84.5% | 1199 | 132 | 49 | 29 |
. | #PSMs . | #Correct . | Accuracy . | Time (s) . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . | pLink . | Kojak . | Xolik . | Xolik . |
1 | 61 | 1673 | 2091 | 2374 | 56 | 1559 | 1900 | 2184 | 91.8% | 93.2% | 90.9% | 92.0% | 453 | 53 | 24 | 10 |
2 | 1848 | 2962 | 3176 | 3647 | 1649 | 2797 | 2812 | 3296 | 89.2% | 94.4% | 88.5% | 90.4% | 364 | 36 | 13 | 8 |
3 | 176 | 1685 | 1113 | 1260 | 53 | 625 | 606 | 676 | 30.1% | 37.1% | 54.4% | 53.6% | 382 | 43 | 12 | 11 |
Total | 2085 | 6320 | 6380 | 7281 | 1758 | 4981 | 5318 | 6156 | 84.3% | 78.8% | 83.4% | 84.5% | 1199 | 132 | 49 | 29 |
Note: Xolik outperforms pLink and Kojak in terms of identified PSMs, identified correct PSMs and the running time. Accuracy is defined as #Correct/#PSMs. The column of Xolik denotes the results analyzed by Xolik when the E-value estimation is disabled. All tools run on an Intel Core i5 3.30 GHz Windows desktop computer with 12GB memory on 4 threads. Bold values in the table indicate the best results in the comparisons.
Each library only contains peptides following one specific sequence pattern. Therefore, we are able to evaluate the reported PSMs by comparing the identified sequences with the sequence pattern corresponding to the library. Only when both sequences of a cross-linked PSM match the sequence pattern is the reported peptide pair considered a correct identification.
As shown in Table 1, Xolik identifies more PSMs than Kojak and pLink in total. After manual examination of the identified sequence patterns, we find that Xolik also identifies more correct PSMs than Kojak and pLink in total, though Xolik identifies fewer correct PSMs than Kojak in the third dataset. This indicates that Xolik has a higher statistical power, which comes from the fact that Xolik exhaustively searches all candidate cross-linked peptides. Moreover, Xolik runs faster than Kojak and pLink. The column of Xolik shows the results of Xolik when the E-value estimation is disabled. The results show that whether the E-value estimation is enabled or not does not affect the conclusion.
3.3 Analysis of Escherichia coli 30S and 50S ribosomal subunits
To illustrate the performance on real datasets, we run Xolik using an E. coli ribosome dataset (Lauber and Reilly, 2011). There are 48 mass spectra files analyzed in total. As a comparison, we also run ECL2, pLink, pLink2 and Kojak using the same dataset. pLink2 does not fully support the customized cross-linker diethyl suberthioimidate (DEST) (Lauber and Reilly, 2011). Therefore, we do not compare with pLink2 using this dataset. All proteins in E. coli 30S and 50S ribosomal subunits are included in the target database (55 proteins), and the decoy database is constructed by reversing the protein sequences in the target database. The MS1 tolerance is set at 5 ppm, and the MS2 bin size for XCorr is set at 0.02 Da. The maximum number of missed cleavages is 2. We set a fixed modification +57.02146 Da at Cysteine (C), and no variable modifications. The allowed mass range of the candidate peptides is . Since pLink cannot change the mass range of candidate peptides, we use the default setting when running pLink. The cross-linker used in the database search is DEST. E-value estimation and multithreading (4 threads) are enabled if applicable. (pLink cannot run in multithread settings when using customized cross-linkers, and Kojak does not have E-value estimation.) Due to the limited size of the reported identifications, when running Percolator on Kojak’s results, we add the ‘-F 0.1 -Y’ options to make Percolator encounter fewer warnings and obtain better results. All tools are deployed on an Intel Core i5 3.30 GHz Windows desktop computer with 12 GB memory. When running ECL2, all memory (12 GB) is assigned to the process. We control the FDR on the reported identifications at 5% using the methods bundled in each tool at the PSM level. The search results of all tools are shown in Figure 4.
As shown in Figure 4, Xolik reports more PSMs than pLink and Kojak, and reports a similar number of PSMs compared with ECL2. In terms of running time, Xolik significantly outperforms ECL2 and pLink, and shows a similar performance compared with Kojak. The difference between ECL2 and Xolik is only in the matching algorithm for pairing two single peptides. However, because the algorithm used in ECL2 relaxes the constraints on the MS1 tolerance, peptides outside the MS1 tolerance range are possibly reported by ECL2. Therefore, around the boundary of the tolerance range, ECL2 and Xolik may assign different labels to the query spectrum. This also affects the threshold determined by the FDR controlling algorithm. As a consequence, after controlling the FDR, the identification results between Xolik and ECL2 are partly different. Although Xolik reports more PSMs than ECL2 in this experiment, it is not clear whether a strict precursor mass constraint will lead to an increment of identifications or not.
The numbers of reported identifications are not stable using this dataset compared with the other experiments presented in this paper. After manually examining the intermediate results, we found that the small number of experimental MS2 spectra per data file is the likely reason for the unstable numbers. Kojak relies on Percolator for FDR control. Due to the small size of the training samples, Percolator sometimes reports ‘cannot find an initial direction with positive training examples’ during the classification. This is perhaps the drawback of training-based methods, though more and more data are becoming available, making sample size less and less an issue.
3.4 Analysis of Homo sapiens HeLa cell dataset
We also run Xolik using a human sample dataset (Makowski et al., 2016) to evaluate the performance when searching a large protein database. This dataset contains around MS2 spectra, and the whole human protein database (downloaded from UniProt at 2016.03.10, total 20 198 proteins) is used in the database search. The decoy database is constructed by reversing the protein sequences in the target database, and the FDR is controlled at 5% at the PSM level. The MS1 tolerance is set at 5 ppm, and the MS2 bin size for XCorr is set at 0.02 Da. The maximum number of missed cleavages is 2. We set a fixed modification +57.02146 Da at Cysteine (C), and no variable modifications. The mass range of the candidate peptides is . The cross-linker used in the database search is BS3. We also run ECL2, pLink, pLink2 and Kojak using the same dataset for comparison. ECL2 cannot handle a database with more than 20 000 proteins because the Java platform used by ECL2 requires more than 32 GB memory during the analysis. As a consequence, ECL2 spends most of the time waiting for spare memory. Neither ECL2 nor pLink can finish the analysis within a week. Therefore, we only show the results of pLink2, Kojak and Xolik. We enable E-value estimation in pLink2 and Xolik, and enable multithreading in all tools (8 threads for Kojak and Xolik, and 7 threads for pLink2 because pLink2 does not allow users to specify the thread number as equal to the core number). All tools are deployed on an Intel Core i7 3.50 GHz Windows desktop computer with 32 GB memory. The results are shown in Figure 5. Xolik outperforms pLink2 and Kojak in terms of identified PSMs at the same FDR level. Also, Xolik runs faster than Kojak, even though Xolik searches exhaustively on all candidate cross-linked peptides. The running time of pLink2 and Xolik is roughly the same, while the search space of pLink2 is much smaller than that of Xolik.
4 Conclusion
In this paper, we propose a linear-time algorithm for finding cross-linked peptides with maximum paired scores in a protein sequence database. Instead of reducing the search space, the proposed algorithm exhaustively searches the entire search space. Using the double-ended queue to store the order of scores compared in previous iterations, the proposed algorithm achieves linear time complexity with respect to the number of candidate peptides in the database. Moreover, utilizing the lazy evaluation strategy together with the memoization technique, Xolik further reduces the computational cost. Experiments using a synthetic dataset and two empirical datasets show that Xolik outperforms existing tools in terms of running time and statistical power.
Xolik does not consider linear and mono-linked peptide interpretations of spectra. For the demand of interpreting the spectra in the dataset corresponding to linear and mono-linked peptides, search engines like Comet (Eng et al., 2013) and Mascot (Perkins et al., 1999) can be used. To resolve the conflict between linear peptide identifications and cross-linked peptide identifications, we may use a commonly used strategy that first runs linear peptide search, and then removes the confidently identified spectra from the dataset before searching the cross-linked peptides (Panchaud et al., 2010; Singh et al., 2008).
Acknowledgements
We thank the pLink Team for providing pLink2 for comparison, and for explaining the underlying strategies adopted by pLink2.
Funding
This work was partially supported by the theme-based project T12-402/13N from the Research Grant Council (RGC) of the Hong Kong S.A.R. government.
Conflict of Interest: none declared.
References
Author notes
The authors wish it to be known that, in their opinion, the Jiaan Dai, Wei Jiang and Fengchao Yu authors should be regarded as Joint First Authors.