In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics
Graphical abstract
Introduction
Proteomics can be used for the study of the biological functions of proteins, cellular localization, post-translational modifications (PTMs), and interactions between proteins [1], [2]. The field has seen great development in the last years due to advances in mass spectrometry (MS) instrumentation, the development of new analytical methods [3], [4], [5], and novel computational approaches [2]. Bottom-up proteomics is currently the standard analytical method to identify and quantify proteins based on the presence of peptides obtained by digestion of the protein mix during sample preparation. Current computational approaches can typically be broken down into three main steps: 1) peptide identification, 2) quality assessment of the peptide identifications, and 3) the assembly of the identified peptides into a final protein list using protein inference algorithms [6], [7]. During peptide identification, peptide fragmentation spectra (MS/MS) are assigned to peptide sequences to generate a set of Peptide-Spectrum Matches (PSMs) using database search engines, such as SEQUEST [8], Comet [9], Mascot [10], MS-GF + [11], or X!Tandem [12]. Then, it is necessary to assess the reliability of these identifications [13] by estimating collective false discovery rates or by assessing correctness probabilities for each PSM. Finally, the identified peptide sequences are assembled into a set of confident proteins, which enables protein quantitation or pathway analysis [14].
Ideally, protein inference produces a protein list from the identified peptides with all proteins of the original sample prior to digestion. Unfortunately, ambiguities arise when an identified peptide sequence can be explained by more than one entry in a protein database [15]. Under certain assumptions, some of these ambiguities can be resolved when taking other peptide identifications, physicochemical properties, or quantities into account. Unfortunately, there are cases when it is not possible to resolve an ambiguity, e.g. if two protein entries map to exactly the same sets of identified peptides.
In 2003, PeptideProphet/ProteinProphet combination was published as some of the first algorithms and tools to address the challenges of protein inference, using a probabilistic model [16]. ProteinProphet, a widely used algorithm integrated into the Trans-Proteomic Pipeline (TPP), employs an iterative heuristic probability model to estimate protein probabilities based on peptide probabilities. Other algorithms have been proposed using Bayesian methods [17] or linear programming [18], incorporating additional information like the isoelectric point [19], retention time, or detectability during protein inference. As a result, several protein inference implementations are available to the proteomics community [20] including the implementations provided by search engines such as Mascot or Andromeda [21]. In addition, a number of commercial tools provide protein inference, such as ProteomeDiscoverer (Thermo Scientific, http://www.thermoscientific.com/en/products/mass-spectrometry.html) and Scaffold. Despite this wide range of tools and algorithms, only a few evaluations have been performed to benchmark their performance [22], [23]. In 2012, Claassen and co-workers benchmarked ProteinProphet with different “gene locus inference” approaches and opened the field to perform other studies including other inference approaches [22]. A thorough comparison is hampered by the large number of possible combinations of tools, problems with interoperability of tools (e.g., the use of proprietary file formats, insufficient documentation or platform-dependence), and the lack of a clear set of metrics for unbiased evaluation of the performance.
Here, we evaluate and benchmark five leading tools for protein inference: ProteinProphet [16], MSBayesPro [24], ProteinLP [18], Fido [17] and PIA [25], [26], [27]. To achieve this, three popular search engines including Mascot, X!Tandem, MS-GF + and their combinations were used with every protein inference tool. We implemented a workflow in the highly customizable KNIME (https://www.knime.org/) workflow environment using a series of OpenMS [28] nodes and several new workflow nodes (https://github.com/KNIME-OMICS) to study all combinations of these search engines and inference algorithms. This approach is scalable to arbitrary numbers of algorithms. We provide different metrics to benchmark the algorithms under study. Among others, the numbers of reported proteins, peptides per protein, and uniquely reported proteins per inference method are used to evaluate the performance of each inference method. Four datasets of different complexities and from different species were employed to evaluate the performance of protein inference algorithms including one “gold standard” or “ground truth” dataset previously used to compare protein inference algorithms [25], [29]. The final results for complex samples (the yeast “gold standard” dataset and the human lung cancer dataset - PXD000603) vary not only regarding the actual numbers of protein groups but also concerning the actually reported groups. The robustness of the numbers of reported proteins when using databases of differing complexities is depending on the applied inference algorithm. The final results also showed that merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. At the same time, the present study shows that proper selection of search engine and inference algorithm is crucial to the yield of information from proteomic data sets.
Section snippets
The benchmark workflow
The presented protein inference comparison workflow is based on KNIME and OpenMS [28]. We made use of the existing OpenMS nodes, but we also implemented additional nodes for some of the analyzed tools. The developed workflow can be split into seven different steps (Fig. 1). The first step (A) configures basic variables like the regular expression to identify decoys in the FASTA protein database and the allowed FDR q-value threshold. Also, if a gold-standard dataset is analyzed, the reference
Multiple search engine assessment
Running the aforementioned workflow, we analyzed 420 different protein lists due to the combination of the three different search engines, the five inference tools and the four datasets using ten different databases. We analyzed the number of FDR filtered PSMs for each single search engine and their combinations before performing any protein inference evaluation. The benefit of combining search engine results for spectrum identification has already been shown extensively in other publications
Discussion
We have evaluated in detail the performance of different inference algorithms using four different datasets and a set of well-define metrics. MSBayesPro needs detectability predictions for each peptide as an input of the inference algorithm. These values can only be calculated using the results of preceding experiments or estimated using algorithms like the PTModel. Both modelling approaches have drawbacks when experimenting with analytical methods (e.g., enrichment, different fractionation
Conclusion
We introduced a workflow that uses three search engines and five open-source and generally applicable protein inference algorithms for the fair and in-depth comparison of protein inference results. The workflow and inference methods were tested on four datasets with different complexities of protein databases. While there is no explicit best inference algorithm, different considerations for choosing a tool can be given.
The analysis of identifications using protein databases with varying
Availability
All of the analyzed protein inference algorithms are available as KNIME nodes and can be used together with OpenMS workflows to yield protein identification lists. The designed workflow also allows the exchange of the tested inference algorithms and thus comprehensive benchmarking of new implementations. The plugins for the newly developed nodes and the complete workflow are available as open source on https://github.com/KNIME-OMICS. The workflows, search engine results and all of the final
Transparency document
Acknowledgements
E.A. was supported by a grant from the Boehringer Ingelheim Fonds, J.U. and T.S. are funded by the BMBF grant de.NBI - German Network for Bioinformatics Infrastructure (FKZ 031 A 534A resp. FKZ 031 A 535A); funding of ME is related to PURE and Valibio, Projects of North Rhine-Westphalia; Y·P-R. is supported by the BBSRC ‘PROCESS’ grant [BB/K01997X/1].
References (53)
- et al.
SCX charge state selective separation of tryptic peptides combined with 2D-RP-HPLC allows for detailed proteome mapping
J. Proteome
(2013) - et al.
Computational proteomics pitfalls and challenges: HavanaBioinfo 2012 workshop report
J. Proteome
(2013) - et al.
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
J. Am. Soc. Mass Spectrom.
(1994) A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics
J. Proteome
(2010)- et al.
In silico analysis of accurate proteomics, complemented by selective isolation of peptides
J. Proteome
(2011) - et al.
Isoelectric point optimization using peptide descriptors and support vector machines
J. Proteome
(2012) - et al.
PRIDE inspector Toolsuite: moving toward a universal visualization tool for proteomics data standard formats and quality assessment of ProteomeXchange datasets
Mol. Cell. Proteomics
(2016) - et al.
Combining results of multiple search engines in proteomics
Mol. Cell. Proteomics
(2013) - et al.
Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry
Mol. Cell. Proteomics
(2009) - et al.
The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience
Mol. Cell. Proteomics
(2014)
Next-generation proteomics: towards an integrative view of proteome dynamics
Nat. Rev. Genet.
Open source libraries and frameworks for mass spectrometry based proteomics: a developer's perspective
Biochim. Biophys. Acta
Peptide fractionation by acid pH SDS-free electrophoresis
Electrophoresis
Proteomics based on peptide fractionation by SDS-free PAGE
J. Proteome Res.
Bioinformatics challenges in mass spectrometry-driven proteomics
Methods Mol. Biol.
Comet: an open-source MS/MS sequence database search tool
Proteomics
Probability-based protein identification by searching sequence databases using mass spectrometry data
Electrophoresis
MS-GF + makes progress towards a universal database search tool for proteomics
Nat. Commun.
TANDEM: matching proteins with tandem mass spectra
Bioinformatics
Bioinformatics tools for the functional interpretation of quantitative proteomics results
Curr. Top. Med. Chem.
A statistical model for identifying proteins by tandem mass spectrometry
Anal. Chem.
Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data
J. Proteome Res.
A linear programming model for protein inference problem in shotgun proteomics
Bioinformatics
Protein inference: a review
Brief. Bioinform.
Andromeda: a peptide search engine integrated into the MaxQuant environment
J. Proteome Res.
Generic comparison of protein inference engines
Mol. Cell. Proteomics
Cited by (49)
Shotgun proteomics for the identification of yeasts responsible for pink/red discoloration in commercial dairy products
2023, Food Research InternationalReanalysis of ProteomicsDB Using an Accurate, Sensitive, and Scalable False Discovery Rate Estimation Approach for Protein Groups
2022, Molecular and Cellular ProteomicsComparative database search engine analysis on massive tandem mass spectra of pork-based food products for halal proteomics
2021, Journal of ProteomicsCitation Excerpt :ProteinProspector, however, uses a scoring algorithm to give a matching score based on the type of ions which is observed during database searching [8]; subsequently, discriminating the correct and incorrect matching to determine its accuracy score, known as discriminant score (DiscScore) [8]. Moreover, the DiscScore is derived from the E-value and the best peptide score parameters [35,40,41]. For ProteinPilot™, the large deviation for adjusted residual value was observed due to its Paragon™ algorithm; wherein the peptide identification is only based on sequence temperature value and standard feature probabilities [42].
ProtyQuant: Comparing label-free shotgun proteomics datasets using accumulated peptide probabilities
2021, Journal of ProteomicsCitation Excerpt :The cleavage of proteins leads to degenerated peptides, leading to ambiguity in finding peptide → protein relationships (see Fig. 2). Numerous algorithms have been reported to solve the ‘protein inference problem’ [7,8]; however, there is no consistently best-performing PSM/protein inference combination found yet [9]. For testing and benchmarking of ProtyQuant, Comet was used as PSM search engine, PeptideProphet [10] for peptide hit validation, and the ProteinProphet [11] for protein inference, since those programs are part of the well established TPP.
Advanced Analytical Tools to Reveal Food-Health Associations
2020, Comprehensive FoodomicsProteomics and proteoforms: Bottom-up or top-down, how to use high-resolution mass spectrometry to reach the Grail
2019, Fundamentals and Applications of Fourier Transform Mass Spectrometry
- 1
Both authors contributed equally to this work.