Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model

Mao, Zeping; Zhang, Ruixue; Xin, Lei; Li, Ming

doi:10.1038/s42256-023-00738-x

Article
Published: 19 October 2023

Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model

Nature Machine Intelligence volume 5, pages 1250–1260 (2023)Cite this article

1427 Accesses
108 Altmetric
Metrics details

Subjects

A preprint version of the article is available at Research Square.

Abstract

Novel protein discovery and immunopeptidomics depend on highly sensitive de novo peptide sequencing with tandem mass spectrometry. Despite notable improvement using deep learning models, the missing-fragmentation problem remains an important hurdle that severely degrades the performance of de novo peptide sequencing. Here we reveal that in the process of peptide prediction, missing fragmentation results in the generation of incorrect amino acids within those regions and causes error accumulation thereafter. To tackle this problem, we propose GraphNovo, a two-stage de novo peptide-sequencing algorithm based on a graph neural network. GraphNovo focuses on finding the optimal path in the first stage to guide the sequence prediction in the second stage. Our experiments demonstrate that GraphNovo mitigates the effects of missing fragmentation and outperforms the state-of-the-art de novo peptide-sequencing algorithms.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: The model details of GraphNovo.**

**Fig. 4: Evaluation of GraphNovo-PathSearcher.**

**Fig. 5: GraphNovo mitigates the missing-fragmentation problem.**

**Fig. 6: The degree of missing fragmentation affecting model performance.**

Accurate de novo peptide sequencing using fully convolutional neural networks

Article Open access 02 December 2023

Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices

Article 18 March 2021

Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing

Article Open access 02 January 2024

Data availability

The source data and trained models for all experiments reported in this paper are accessible at https://doi.org/10.5281/zenodo.8000316 (ref. ⁴⁹).

Code availability

The source code of GraphNovo is available on GitHub at https://github.com/AmadeusloveIris/Graphnovo (ref. ⁵⁰).

References

Angel, T. E. et al. Mass spectrometry-based proteomics: existing capabilities and future directions. Chem. Soc. Rev. 41, 3912–3928 (2012).
Article Google Scholar
Dančík, V., Addona, T. A., Clauser, K. R., Vath, J. E. & Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342 (1999).
Article Google Scholar
Griss, J. Spectral library searching in proteomics. Proteomics 16, 729–740 (2016).
Article Google Scholar
Fernandez-de Cossio, J. et al. Automated interpretation of high-energy collision-induced dissociation spectra of singly protonated peptides by ‘seqms’, a software aid for de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 12, 1867–1878 (1998).
Article Google Scholar
Lu, B. & Chen, T. Algorithms for de novo peptide sequencing using tandem mass spectrometry. Drug Discov. Today BioSilico 2, 85–90 (2004).
Article Google Scholar
Tran, N. H. et al. Complete de novo assembly of monoclonal antibody sequences. Sci. Rep. 6, 31730 (2016).
Article Google Scholar
Tran, N. H. et al. Personalized deep learning of individual immunopeptidomes to identify neoantigens for cancer vaccines. Nat. Mach. Intell. 2, 764–771 (2020).
Article Google Scholar
Vitorino, R. et al. De novo sequencing of proteins by mass spectrometry. Expert Rev. Proteomics 17, 595–607 (2020).
Article Google Scholar
Muth, T. & Renard, B. Y. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification? Brief. Bioinform. 19, 954–970 (2018).
Article Google Scholar
Muth, T., Renard, B. Y. & Martens, L. Metaproteomic data analysis at a glance: advances in computational microbial community proteomics. Expert Rev. Proteomics 13, 757–769 (2016).
Article Google Scholar
Kuhring, M. & Renard, B. Y. Estimating the computational limits of detection of microbial non-model organisms. Proteomics 15, 3580–3584 (2015).
Article Google Scholar
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Article Google Scholar
Jagannath, S. & Sabareesh, V. Peptide fragment ion analyser (PFIA): a simple and versatile tool for the interpretation of tandem mass spectrometric data and de novo sequencing of peptides. Rapid Commun. Mass Spectrom. 21, 3033–3038 (2007).
Article Google Scholar
Chen, T., Kao, M.-Y., Tepel, M., Rush, J. & Church, G. M. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 8, 325–337 (2001).
Article MATH Google Scholar
Ma, B. et al. Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).
Article Google Scholar
Mo, L., Dutta, D., Wan, Y. & Chen, T. MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry. Anal. Chem. 79, 4870–4878 (2007).
Article Google Scholar
Taylor, J. A. & Johnson, R. S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chem. 73, 2594–2604 (2001).
Article Google Scholar
Chi, H. et al. pNovo: de novo peptide sequencing and identification using hcd spectra. J. Proteome Res. 9, 2713–2724 (2010).
Article Google Scholar
Yang, H. et al. Open-pNovo: de novo peptide sequencing with thousands of protein modifications. J. Proteome Res. 16, 645–654 (2017).
Article Google Scholar
Chi, H. et al. pNovo+: de novo peptide sequencing using complementary HCD and ETD tandem mass spectra. J. Proteome Res. 12, 615–625 (2013).
Article Google Scholar
Fischer, B. et al. NovoHMM: a hidden Markov model for de novo peptide sequencing. Anal. Chem. 77, 7265–7273 (2005).
Article Google Scholar
Frank, A. & Pevzner, P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).
Article Google Scholar
Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl Acad. Sci. USA 114, 8247–8252 (2017).
Article Google Scholar
Yang, H., Chi, H., Zeng, W.-F., Zhou, W.-J. & He, S.-M. pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework. Bioinformatics 35, i183–i190 (2019).
Article Google Scholar
Qiao, R. et al. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nat. Mach. Intell. 3, 420–425 (2021).
Article Google Scholar
Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. In Proc. 39th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 162 (eds Chaudhuri, K. et al.) 25514–25522 (PMLR, 2022).
McDonnell, K., Howley, E. & Abram, F. The impact of noise and missing fragmentation cleavages on de novo peptide identification algorithms. Comput. Struct. Biotechnol. J. 20, 1402–1412 (2022).
Article Google Scholar
Ying, C. et al. Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. 34, 28877–28888 (2021).
Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Google Scholar
Ma, B. Novor: real-time peptide de novo sequencing software. J. Am. Soc. Mass Spectrom. 26, 1885–1894 (2015).
Article Google Scholar
Grossmann, J. et al. AUDENS: a tool for automated peptide de novo sequencing. J. Proteome Res. 4, 1768–1774 (2005).
Article Google Scholar
ROEPSTORFE, P. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed. Mass Spectrom. 11, 601–605 (1984).
Google Scholar
Frese, C. K. et al. Toward full peptide sequence coverage by dual fragmentation combining electron-transfer and higher-energy collision dissociation tandem mass spectrometry. Anal. Chem. 84, 9668–9673 (2012).
Article Google Scholar
Baba, T. et al. Dissociation of biomolecules by an intense low-energy electron beam in a high sensitivity time-of-flight mass spectrometer. J. Am. Soc. Mass Spectrom. 32, 1964–1975 (2021).
Article Google Scholar
Qi, C. R., Su, H., Mo, K. & Guibas, L. J. Pointnet: deep learning on point sets for 3D classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 77–85 (2017).
Fey, M. & Lenssen, J. E. Fast graph representation learning with pytorch geometric. In ICLR 7, 1–9 (2019).
Shazeer, N., Lan, Z., Cheng, Y., Ding, N. & Hou, L. Talking-heads attention. Preprint at https://arxiv.org/abs/2003.02436
Bhojanapalli, S., Yun, C., Rawat, A. S., Reddi, S. J. & Kumar, S. Low-rank bottleneck in multi-head attention models. Proceedings of Machine Learning Research 119, 864–873 (2020).
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Google Scholar
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In ICLR. 7, 1–18 (2019).
Biewald, L. Experiment Tracking with Weights and Biases (Weights & Biases, 2020); https://www.wandb.com/
Yadan, O. Hydra—a framework for elegantly configuring complex applications. GitHub https://github.com/facebookresearch/hydra (2019).
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Article Google Scholar
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).
Article Google Scholar
Meier, F., Geyer, P. E., Virreira Winter, S., Cox, J. & Mann, M. Boxcar acquisition method enables single-shot proteomics at a depth of 10,000 proteins in 100 minutes. Nat. Methods 15, 440–448 (2018).
Article Google Scholar
Fíla, J. et al. The beta subunit of nascent polypeptide associated complex plays a role in flowers and siliques development of Arabidopsis thaliana. Int. J. Mol. Sci. 21, 2065 (2020).
Article Google Scholar
Tharyan, R. G. et al. NFYB-1 regulates mitochondrial function and longevity via lysosomal prosaposin. Nat. Metab. 2, 387–396 (2020).
Article Google Scholar
Yu, Y. et al. Predictive signatures of 19 antibiotic-induced escherichia coli proteomes. ACS Infect. Dis. 6, 2120–2129 (2020).
Article Google Scholar
Zeping, M. & Ruixue, Z. Graphnovo dataset and checkpoint. Zenodo https://doi.org/10.5281/zenodo.8000316 (2023).
Zeping, M. & Ruixue, Z. Amadeusloveiris/graphnovo: Nature Machine Intelligence original code. Zenodo https://doi.org/10.5281/zenodo.7996510 (2023).

Download references

Acknowledgements

This work is partially supported by the National Key R&D Program of China grant 2022YFA1304603 (M.L.), the Canada Research Chair programme (M.L.) and NSERC grant OGP0046506 (M.L.). We thank B. Shan, N. H. Tran and X. Cui for discussions.

Author information

These authors contributed equally: Zeping Mao, Ruixue Zhang.

Authors and Affiliations

Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
Zeping Mao, Ruixue Zhang & Ming Li
Bioinformatics Solutions Inc., Waterloo, Ontario, Canada
Lei Xin
Henan Academy of Sciences, Henan, China
Ming Li
Shanghai Institute of Mathematical and Interdisciplinary Sciences, Shanghai, China
Ming Li

Authors

Zeping Mao
View author publications
You can also search for this author in PubMed Google Scholar
Ruixue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Xin
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.M. conceived the initial idea and the prototype of the model. Z.M. and R.Z. implemented the proposed algorithm. R.Z. evaluated the results and did the data analysis. Z.M. and R.Z. wrote the paper and all authors contributed to improving the paper. M.L. supervised the research project. M.L. and L.X. revised the paper.

Corresponding author

Correspondence to Ming Li.

Ethics declarations

Competing interests

L.X. is an employee of Bioinformatics Solutions Inc. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ting Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jacob Huth, in collaboration with the Nature Machine Intelligence team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Tables 1–3 and Discussion for each figure and table.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x

Download citation

Received: 16 February 2023
Accepted: 14 September 2023
Published: 19 October 2023
Issue Date: November 2023
DOI: https://doi.org/10.1038/s42256-023-00738-x