Abstract
Protein sequencing algorithms process data from a variety of instruments that has been generated under diverse experimental conditions. Currently there is no way to predict the accuracy of an algorithm for a given data set. Most of the published algorithms and associated software has been evaluated on limited number of experimental data sets. However, these performance evaluations do not cover the complete search space the algorithm and the software might encounter in real-world. To this end, we present a database of simulated spectra that can be used to benchmark any spectra to peptide search engine. We demonstrate the usability of this database by bench marking two popular peptide sequencing engines. We show wide variation in the accuracy of peptide deductions and a complete quality profile of a given algorithm can be useful for practitioners and algorithm developers. All benchmarking data is available at https://users.cs.fiu.edu/~fsaeed/Benchmark.html
Data availability
The proposed database is available at: https://users.cs.fiu.edu/ fsaeed/Benchmark.html
References
Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198
Arbelaez P, Maire M, Fowlkes C, Malik J (2011) Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell 33(5):898–916
Diament BJ, Noble WS (2011) Faster request searching for peptide identification from tandem mass spectra. J Proteome Res 10(9):3871–3879
Ebhardt HA, Root A, Sander C, Aebersold R (2015) Applications of targeted proteomics in systems biology and translational medicine. Proteomics 15(18):3193–3208
Elias JE, Gygi SP (2007) Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 4(3):207–214
Freytag S, Tian L, Ingrid L, Ng M, Bahlo M (2018) Comparison of clustering tools in r for medium-sized 10x genomics single-cell RNA-sequencing data. F1000Research 7
Gul Awan M, Saeed F (2016) MS-reduce: an ultrafast technique for reduction of big mass spectrometry data for high-throughput processing. Bioinformatics 32(10):1518–1526
Gul Awan M, Saeed F (2018) Mass-simulator: a highly configurable simulator for generating ms/ms datasets for benchmarking of proteomics algorithms. Proteomics 18(20):1800206
Iglesias-Gato D, Wikström P, Tyanova S, Lavallee C, Thysell E, Carlsson J, Hägglöf C, Cox J, Andrén O, Stattin P et al (2016) The proteome of primary prostate cancer. Eur Urol 69(5):942–952
Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods 4(11):923
Käll L, Storey JD, MacCoss MJ, Noble WS (2008) Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 7(01):29–34
Keller A, Nesvizhskii AI, Kolker E, Aebersold R (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by ms/ms and database search. Anal Chem 74(20):5383–5392
Kong AT, Leprevost FV, Avtonomov DM, Mellacheruvu D, Nesvizhskii AI (2017) Msfragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat Methods 14(5):513
Ma B (2015) Novor: real-time peptide de novo sequencing software. J Am Soc Mass Spectrom 26(11):1885–1894
McIlwain S, Tamura K, Kertesz-Farkas A, Grant CE, Diament B, Frewen B, Howbert JJ, Hoopmann MR, Kall L, Eng JK et al (2014) Crux: rapid open source protein tandem mass spectrometry analysis. J Proteome Res 13(10):4488–4491
PedroM C, Bengt F (2016) Emerging systems biology approaches in nanotoxicology: towards a mechanism-based understanding of nanomaterial hazard and risk. Toxicol Appl Pharmacol 299:101–111
Saeed F (2015) Big data proteogenomics and high performance computing: Challenges and opportunities. In Signal and information processing (GlobalSIP). In: 2015 IEEE Global Conference on. IEEE, pp 141–145
Savitski MM, Wilhelm M, Hahne H, Kuster B, Bantscheff M (2015) A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol Cell Proteom 14(9):2394–2404
Shteynberg D, Deutsch EW, Lam H, Eng JK, Sun Z, Tasman N, Mendoza L, Moritz RL, Aebersold R, Nesvizhskii AI (2011) iprophet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol Cell Proteom 10(12):M111-007690
Tsai T-H, Song E, Zhu R, Di Poto C, Wang M, Luo Y, Varghese RS, Tadesse MG, Ziada DH, Desai CS et al (2015) LC-MS/MS-based serum proteomics for identification of candidate biomarkers for hepatocellular carcinoma. Proteomics 15(13):2369–2381
Zhenqin W, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) Moleculenet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530
Acknowledgements
Research reported in this paper was supported by NIGMS of the National Institutes of Health under award number: R01GM134384. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Fahad Saeed was further supported by the National Science Foundations (NSF) under the Award Numbers NSF CAREER OAC-1925960. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation.
Author information
Authors and Affiliations
Contributions
MA devised the method and database, wrote the software for generation of the database and wrote the manuscript. AA performed experiments and analyzed the results. FS proposed the initial idea, designed the experiments, supervised the research, and wrote and edited the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Code availability
Proposed databases were generated using the MaSS-Simulator software (Gul Awan and Saeed 2018).
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Awan, M.G., Awan, A.G. & Saeed, F. Benchmarking mass spectrometry based proteomics algorithms using a simulated database. Netw Model Anal Health Inform Bioinforma 10, 23 (2021). https://doi.org/10.1007/s13721-021-00298-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-021-00298-3