Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model

A preprint version of the article is available at Research Square.

Abstract

Novel protein discovery and immunopeptidomics depend on highly sensitive de novo peptide sequencing with tandem mass spectrometry. Despite notable improvement using deep learning models, the missing-fragmentation problem remains an important hurdle that severely degrades the performance of de novo peptide sequencing. Here we reveal that in the process of peptide prediction, missing fragmentation results in the generation of incorrect amino acids within those regions and causes error accumulation thereafter. To tackle this problem, we propose GraphNovo, a two-stage de novo peptide-sequencing algorithm based on a graph neural network. GraphNovo focuses on finding the optimal path in the first stage to guide the sequence prediction in the second stage. Our experiments demonstrate that GraphNovo mitigates the effects of missing fragmentation and outperforms the state-of-the-art de novo peptide-sequencing algorithms.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of GraphNovo.
Fig. 2: The model details of GraphNovo.
Fig. 3: Overall evaluation.
Fig. 4: Evaluation of GraphNovo-PathSearcher.
Fig. 5: GraphNovo mitigates the missing-fragmentation problem.
Fig. 6: The degree of missing fragmentation affecting model performance.

Similar content being viewed by others

Data availability

The source data and trained models for all experiments reported in this paper are accessible at https://doi.org/10.5281/zenodo.8000316 (ref. 49).

Code availability

The source code of GraphNovo is available on GitHub at https://github.com/AmadeusloveIris/Graphnovo (ref. 50).

References

  1. Angel, T. E. et al. Mass spectrometry-based proteomics: existing capabilities and future directions. Chem. Soc. Rev. 41, 3912–3928 (2012).

    Article  Google Scholar 

  2. Dančík, V., Addona, T. A., Clauser, K. R., Vath, J. E. & Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342 (1999).

    Article  Google Scholar 

  3. Griss, J. Spectral library searching in proteomics. Proteomics 16, 729–740 (2016).

    Article  Google Scholar 

  4. Fernandez-de Cossio, J. et al. Automated interpretation of high-energy collision-induced dissociation spectra of singly protonated peptides by ‘seqms’, a software aid for de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 12, 1867–1878 (1998).

    Article  Google Scholar 

  5. Lu, B. & Chen, T. Algorithms for de novo peptide sequencing using tandem mass spectrometry. Drug Discov. Today BioSilico 2, 85–90 (2004).

    Article  Google Scholar 

  6. Tran, N. H. et al. Complete de novo assembly of monoclonal antibody sequences. Sci. Rep. 6, 31730 (2016).

    Article  Google Scholar 

  7. Tran, N. H. et al. Personalized deep learning of individual immunopeptidomes to identify neoantigens for cancer vaccines. Nat. Mach. Intell. 2, 764–771 (2020).

    Article  Google Scholar 

  8. Vitorino, R. et al. De novo sequencing of proteins by mass spectrometry. Expert Rev. Proteomics 17, 595–607 (2020).

    Article  Google Scholar 

  9. Muth, T. & Renard, B. Y. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification? Brief. Bioinform. 19, 954–970 (2018).

    Article  Google Scholar 

  10. Muth, T., Renard, B. Y. & Martens, L. Metaproteomic data analysis at a glance: advances in computational microbial community proteomics. Expert Rev. Proteomics 13, 757–769 (2016).

    Article  Google Scholar 

  11. Kuhring, M. & Renard, B. Y. Estimating the computational limits of detection of microbial non-model organisms. Proteomics 15, 3580–3584 (2015).

    Article  Google Scholar 

  12. Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).

    Article  Google Scholar 

  13. Jagannath, S. & Sabareesh, V. Peptide fragment ion analyser (PFIA): a simple and versatile tool for the interpretation of tandem mass spectrometric data and de novo sequencing of peptides. Rapid Commun. Mass Spectrom. 21, 3033–3038 (2007).

    Article  Google Scholar 

  14. Chen, T., Kao, M.-Y., Tepel, M., Rush, J. & Church, G. M. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 8, 325–337 (2001).

    Article  MATH  Google Scholar 

  15. Ma, B. et al. Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).

    Article  Google Scholar 

  16. Mo, L., Dutta, D., Wan, Y. & Chen, T. MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry. Anal. Chem. 79, 4870–4878 (2007).

    Article  Google Scholar 

  17. Taylor, J. A. & Johnson, R. S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chem. 73, 2594–2604 (2001).

    Article  Google Scholar 

  18. Chi, H. et al. pNovo: de novo peptide sequencing and identification using hcd spectra. J. Proteome Res. 9, 2713–2724 (2010).

    Article  Google Scholar 

  19. Yang, H. et al. Open-pNovo: de novo peptide sequencing with thousands of protein modifications. J. Proteome Res. 16, 645–654 (2017).

    Article  Google Scholar 

  20. Chi, H. et al. pNovo+: de novo peptide sequencing using complementary HCD and ETD tandem mass spectra. J. Proteome Res. 12, 615–625 (2013).

    Article  Google Scholar 

  21. Fischer, B. et al. NovoHMM: a hidden Markov model for de novo peptide sequencing. Anal. Chem. 77, 7265–7273 (2005).

    Article  Google Scholar 

  22. Frank, A. & Pevzner, P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).

    Article  Google Scholar 

  23. Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl Acad. Sci. USA 114, 8247–8252 (2017).

    Article  Google Scholar 

  24. Yang, H., Chi, H., Zeng, W.-F., Zhou, W.-J. & He, S.-M. pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework. Bioinformatics 35, i183–i190 (2019).

    Article  Google Scholar 

  25. Qiao, R. et al. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nat. Mach. Intell. 3, 420–425 (2021).

    Article  Google Scholar 

  26. Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. In Proc. 39th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 162 (eds Chaudhuri, K. et al.) 25514–25522 (PMLR, 2022).

  27. McDonnell, K., Howley, E. & Abram, F. The impact of noise and missing fragmentation cleavages on de novo peptide identification algorithms. Comput. Struct. Biotechnol. J. 20, 1402–1412 (2022).

    Article  Google Scholar 

  28. Ying, C. et al. Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. 34, 28877–28888 (2021).

    Google Scholar 

  29. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).

    Google Scholar 

  30. Ma, B. Novor: real-time peptide de novo sequencing software. J. Am. Soc. Mass Spectrom. 26, 1885–1894 (2015).

    Article  Google Scholar 

  31. Grossmann, J. et al. AUDENS: a tool for automated peptide de novo sequencing. J. Proteome Res. 4, 1768–1774 (2005).

    Article  Google Scholar 

  32. ROEPSTORFE, P. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed. Mass Spectrom. 11, 601–605 (1984).

    Google Scholar 

  33. Frese, C. K. et al. Toward full peptide sequence coverage by dual fragmentation combining electron-transfer and higher-energy collision dissociation tandem mass spectrometry. Anal. Chem. 84, 9668–9673 (2012).

    Article  Google Scholar 

  34. Baba, T. et al. Dissociation of biomolecules by an intense low-energy electron beam in a high sensitivity time-of-flight mass spectrometer. J. Am. Soc. Mass Spectrom. 32, 1964–1975 (2021).

    Article  Google Scholar 

  35. Qi, C. R., Su, H., Mo, K. & Guibas, L. J. Pointnet: deep learning on point sets for 3D classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 77–85 (2017).

  36. Fey, M. & Lenssen, J. E. Fast graph representation learning with pytorch geometric. In ICLR 7, 1–9 (2019).

  37. Shazeer, N., Lan, Z., Cheng, Y., Ding, N. & Hou, L. Talking-heads attention. Preprint at https://arxiv.org/abs/2003.02436

  38. Bhojanapalli, S., Yun, C., Rawat, A. S., Reddi, S. J. & Kumar, S. Low-rank bottleneck in multi-head attention models. Proceedings of Machine Learning Research 119, 864–873 (2020).

  39. Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).

    Google Scholar 

  40. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In ICLR. 7, 1–18 (2019).

  41. Biewald, L. Experiment Tracking with Weights and Biases (Weights & Biases, 2020); https://www.wandb.com/

  42. Yadan, O. Hydra—a framework for elegantly configuring complex applications. GitHub https://github.com/facebookresearch/hydra (2019).

  43. Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    Article  Google Scholar 

  44. Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).

    Article  Google Scholar 

  45. Meier, F., Geyer, P. E., Virreira Winter, S., Cox, J. & Mann, M. Boxcar acquisition method enables single-shot proteomics at a depth of 10,000 proteins in 100 minutes. Nat. Methods 15, 440–448 (2018).

    Article  Google Scholar 

  46. Fíla, J. et al. The beta subunit of nascent polypeptide associated complex plays a role in flowers and siliques development of Arabidopsis thaliana. Int. J. Mol. Sci. 21, 2065 (2020).

    Article  Google Scholar 

  47. Tharyan, R. G. et al. NFYB-1 regulates mitochondrial function and longevity via lysosomal prosaposin. Nat. Metab. 2, 387–396 (2020).

    Article  Google Scholar 

  48. Yu, Y. et al. Predictive signatures of 19 antibiotic-induced escherichia coli proteomes. ACS Infect. Dis. 6, 2120–2129 (2020).

    Article  Google Scholar 

  49. Zeping, M. & Ruixue, Z. Graphnovo dataset and checkpoint. Zenodo https://doi.org/10.5281/zenodo.8000316 (2023).

  50. Zeping, M. & Ruixue, Z. Amadeusloveiris/graphnovo: Nature Machine Intelligence original code. Zenodo https://doi.org/10.5281/zenodo.7996510 (2023).

Download references

Acknowledgements

This work is partially supported by the National Key R&D Program of China grant 2022YFA1304603 (M.L.), the Canada Research Chair programme (M.L.) and NSERC grant OGP0046506 (M.L.). We thank B. Shan, N. H. Tran and X. Cui for discussions.

Author information

Authors and Affiliations

Authors

Contributions

Z.M. conceived the initial idea and the prototype of the model. Z.M. and R.Z. implemented the proposed algorithm. R.Z. evaluated the results and did the data analysis. Z.M. and R.Z. wrote the paper and all authors contributed to improving the paper. M.L. supervised the research project. M.L. and L.X. revised the paper.

Corresponding author

Correspondence to Ming Li.

Ethics declarations

Competing interests

L.X. is an employee of Bioinformatics Solutions Inc. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ting Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jacob Huth, in collaboration with the Nature Machine Intelligence team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Tables 1–3 and Discussion for each figure and table.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00738-x

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing