Elsevier

Journal of Proteomics

Volume 150, 6 January 2017, Pages 170-182
Journal of Proteomics

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics

https://doi.org/10.1016/j.jprot.2016.08.002Get rights and content

Highlights

  • A new workflow system to benchmark protein inference algorithms in multi-search engines studies.

  • Benchmark of five different protein inference algorithms in four different datasets including two ground truth datasets.

  • Different metrics has defined to evaluate the quality and accuracy of a protein inference algorithm.

  • Benchmark of three different search engines and their combination in Proteomics studies.

Abstract

In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF +. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended.

Significance

Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience.

This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF + and combinations thereof were evaluated for every protein inference tool. In total > 186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset.

The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.

Introduction

Proteomics can be used for the study of the biological functions of proteins, cellular localization, post-translational modifications (PTMs), and interactions between proteins [1], [2]. The field has seen great development in the last years due to advances in mass spectrometry (MS) instrumentation, the development of new analytical methods [3], [4], [5], and novel computational approaches [2]. Bottom-up proteomics is currently the standard analytical method to identify and quantify proteins based on the presence of peptides obtained by digestion of the protein mix during sample preparation. Current computational approaches can typically be broken down into three main steps: 1) peptide identification, 2) quality assessment of the peptide identifications, and 3) the assembly of the identified peptides into a final protein list using protein inference algorithms [6], [7]. During peptide identification, peptide fragmentation spectra (MS/MS) are assigned to peptide sequences to generate a set of Peptide-Spectrum Matches (PSMs) using database search engines, such as SEQUEST [8], Comet [9], Mascot [10], MS-GF + [11], or X!Tandem [12]. Then, it is necessary to assess the reliability of these identifications [13] by estimating collective false discovery rates or by assessing correctness probabilities for each PSM. Finally, the identified peptide sequences are assembled into a set of confident proteins, which enables protein quantitation or pathway analysis [14].

Ideally, protein inference produces a protein list from the identified peptides with all proteins of the original sample prior to digestion. Unfortunately, ambiguities arise when an identified peptide sequence can be explained by more than one entry in a protein database [15]. Under certain assumptions, some of these ambiguities can be resolved when taking other peptide identifications, physicochemical properties, or quantities into account. Unfortunately, there are cases when it is not possible to resolve an ambiguity, e.g. if two protein entries map to exactly the same sets of identified peptides.

In 2003, PeptideProphet/ProteinProphet combination was published as some of the first algorithms and tools to address the challenges of protein inference, using a probabilistic model [16]. ProteinProphet, a widely used algorithm integrated into the Trans-Proteomic Pipeline (TPP), employs an iterative heuristic probability model to estimate protein probabilities based on peptide probabilities. Other algorithms have been proposed using Bayesian methods [17] or linear programming [18], incorporating additional information like the isoelectric point [19], retention time, or detectability during protein inference. As a result, several protein inference implementations are available to the proteomics community [20] including the implementations provided by search engines such as Mascot or Andromeda [21]. In addition, a number of commercial tools provide protein inference, such as ProteomeDiscoverer (Thermo Scientific, http://www.thermoscientific.com/en/products/mass-spectrometry.html) and Scaffold. Despite this wide range of tools and algorithms, only a few evaluations have been performed to benchmark their performance [22], [23]. In 2012, Claassen and co-workers benchmarked ProteinProphet with different “gene locus inference” approaches and opened the field to perform other studies including other inference approaches [22]. A thorough comparison is hampered by the large number of possible combinations of tools, problems with interoperability of tools (e.g., the use of proprietary file formats, insufficient documentation or platform-dependence), and the lack of a clear set of metrics for unbiased evaluation of the performance.

Here, we evaluate and benchmark five leading tools for protein inference: ProteinProphet [16], MSBayesPro [24], ProteinLP [18], Fido [17] and PIA [25], [26], [27]. To achieve this, three popular search engines including Mascot, X!Tandem, MS-GF + and their combinations were used with every protein inference tool. We implemented a workflow in the highly customizable KNIME (https://www.knime.org/) workflow environment using a series of OpenMS [28] nodes and several new workflow nodes (https://github.com/KNIME-OMICS) to study all combinations of these search engines and inference algorithms. This approach is scalable to arbitrary numbers of algorithms. We provide different metrics to benchmark the algorithms under study. Among others, the numbers of reported proteins, peptides per protein, and uniquely reported proteins per inference method are used to evaluate the performance of each inference method. Four datasets of different complexities and from different species were employed to evaluate the performance of protein inference algorithms including one “gold standard” or “ground truth” dataset previously used to compare protein inference algorithms [25], [29]. The final results for complex samples (the yeast “gold standard” dataset and the human lung cancer dataset - PXD000603) vary not only regarding the actual numbers of protein groups but also concerning the actually reported groups. The robustness of the numbers of reported proteins when using databases of differing complexities is depending on the applied inference algorithm. The final results also showed that merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. At the same time, the present study shows that proper selection of search engine and inference algorithm is crucial to the yield of information from proteomic data sets.

Section snippets

The benchmark workflow

The presented protein inference comparison workflow is based on KNIME and OpenMS [28]. We made use of the existing OpenMS nodes, but we also implemented additional nodes for some of the analyzed tools. The developed workflow can be split into seven different steps (Fig. 1). The first step (A) configures basic variables like the regular expression to identify decoys in the FASTA protein database and the allowed FDR q-value threshold. Also, if a gold-standard dataset is analyzed, the reference

Multiple search engine assessment

Running the aforementioned workflow, we analyzed 420 different protein lists due to the combination of the three different search engines, the five inference tools and the four datasets using ten different databases. We analyzed the number of FDR filtered PSMs for each single search engine and their combinations before performing any protein inference evaluation. The benefit of combining search engine results for spectrum identification has already been shown extensively in other publications

Discussion

We have evaluated in detail the performance of different inference algorithms using four different datasets and a set of well-define metrics. MSBayesPro needs detectability predictions for each peptide as an input of the inference algorithm. These values can only be calculated using the results of preceding experiments or estimated using algorithms like the PTModel. Both modelling approaches have drawbacks when experimenting with analytical methods (e.g., enrichment, different fractionation

Conclusion

We introduced a workflow that uses three search engines and five open-source and generally applicable protein inference algorithms for the fair and in-depth comparison of protein inference results. The workflow and inference methods were tested on four datasets with different complexities of protein databases. While there is no explicit best inference algorithm, different considerations for choosing a tool can be given.

The analysis of identifications using protein databases with varying

Availability

All of the analyzed protein inference algorithms are available as KNIME nodes and can be used together with OpenMS workflows to yield protein identification lists. The designed workflow also allows the exchange of the tested inference algorithms and thus comprehensive benchmarking of new implementations. The plugins for the newly developed nodes and the complete workflow are available as open source on https://github.com/KNIME-OMICS. The workflows, search engine results and all of the final

Transparency document

Transparency document

Acknowledgements

E.A. was supported by a grant from the Boehringer Ingelheim Fonds, J.U. and T.S. are funded by the BMBF grant de.NBI - German Network for Bioinformatics Infrastructure (FKZ 031 A 534A resp. FKZ 031 A 535A); funding of ME is related to PURE and Valibio, Projects of North Rhine-Westphalia; Y·P-R. is supported by the BBSRC ‘PROCESS’ grant [BB/K01997X/1].

References (53)

  • A. Altelaar et al.

    Next-generation proteomics: towards an integrative view of proteome dynamics

    Nat. Rev. Genet.

    (2013)
  • Y. Perez-Riverol et al.

    Open source libraries and frameworks for mass spectrometry based proteomics: a developer's perspective

    Biochim. Biophys. Acta

    (1844)
  • Y. Ramos et al.

    Peptide fractionation by acid pH SDS-free electrophoresis

    Electrophoresis

    (2011)
  • Y. Ramos et al.

    Proteomics based on peptide fractionation by SDS-free PAGE

    J. Proteome Res.

    (2008)
  • L. Martens

    Bioinformatics challenges in mass spectrometry-driven proteomics

    Methods Mol. Biol.

    (2011)
  • J.K. Eng et al.

    Comet: an open-source MS/MS sequence database search tool

    Proteomics

    (2013)
  • D.N. Perkins et al.

    Probability-based protein identification by searching sequence databases using mass spectrometry data

    Electrophoresis

    (1999)
  • S. Kim et al.

    MS-GF + makes progress towards a universal database search tool for proteomics

    Nat. Commun.

    (2014)
  • R. Craig et al.

    TANDEM: matching proteins with tandem mass spectra

    Bioinformatics

    (2004)
  • T.N. Villavicencio-Diaz et al.

    Bioinformatics tools for the functional interpretation of quantitative proteomics results

    Curr. Top. Med. Chem.

    (2014)
  • A.I. Nesvizhskii et al.

    A statistical model for identifying proteins by tandem mass spectrometry

    Anal. Chem.

    (2003)
  • O. Serang et al.

    Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data

    J. Proteome Res.

    (2010)
  • T. Huang et al.

    A linear programming model for protein inference problem in shotgun proteomics

    Bioinformatics

    (2012)
  • T. Huang et al.

    Protein inference: a review

    Brief. Bioinform.

    (2012)
  • J. Cox et al.

    Andromeda: a peptide search engine integrated into the MaxQuant environment

    J. Proteome Res.

    (2011)
  • M. Claassen et al.

    Generic comparison of protein inference engines

    Mol. Cell. Proteomics

    (2012)
  • Cited by (49)

    • Comparative database search engine analysis on massive tandem mass spectra of pork-based food products for halal proteomics

      2021, Journal of Proteomics
      Citation Excerpt :

      ProteinProspector, however, uses a scoring algorithm to give a matching score based on the type of ions which is observed during database searching [8]; subsequently, discriminating the correct and incorrect matching to determine its accuracy score, known as discriminant score (DiscScore) [8]. Moreover, the DiscScore is derived from the E-value and the best peptide score parameters [35,40,41]. For ProteinPilot™, the large deviation for adjusted residual value was observed due to its Paragon™ algorithm; wherein the peptide identification is only based on sequence temperature value and standard feature probabilities [42].

    • ProtyQuant: Comparing label-free shotgun proteomics datasets using accumulated peptide probabilities

      2021, Journal of Proteomics
      Citation Excerpt :

      The cleavage of proteins leads to degenerated peptides, leading to ambiguity in finding peptide → protein relationships (see Fig. 2). Numerous algorithms have been reported to solve the ‘protein inference problem’ [7,8]; however, there is no consistently best-performing PSM/protein inference combination found yet [9]. For testing and benchmarking of ProtyQuant, Comet was used as PSM search engine, PeptideProphet [10] for peptide hit validation, and the ProteinProphet [11] for protein inference, since those programs are part of the well established TPP.

    View all citing articles on Scopus
    1

    Both authors contributed equally to this work.

    View full text