Abstract
Novel protein discovery and immunopeptidomics depend on highly sensitive de novo peptide sequencing with tandem mass spectrometry. Despite notable improvement using deep learning models, the missing-fragmentation problem remains an important hurdle that severely degrades the performance of de novo peptide sequencing. Here we reveal that in the process of peptide prediction, missing fragmentation results in the generation of incorrect amino acids within those regions and causes error accumulation thereafter. To tackle this problem, we propose GraphNovo, a two-stage de novo peptide-sequencing algorithm based on a graph neural network. GraphNovo focuses on finding the optimal path in the first stage to guide the sequence prediction in the second stage. Our experiments demonstrate that GraphNovo mitigates the effects of missing fragmentation and outperforms the state-of-the-art de novo peptide-sequencing algorithms.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The source data and trained models for all experiments reported in this paper are accessible at https://doi.org/10.5281/zenodo.8000316 (ref. 49).
Code availability
The source code of GraphNovo is available on GitHub at https://github.com/AmadeusloveIris/Graphnovo (ref. 50).
References
Angel, T. E. et al. Mass spectrometry-based proteomics: existing capabilities and future directions. Chem. Soc. Rev. 41, 3912–3928 (2012).
Dančík, V., Addona, T. A., Clauser, K. R., Vath, J. E. & Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342 (1999).
Griss, J. Spectral library searching in proteomics. Proteomics 16, 729–740 (2016).
Fernandez-de Cossio, J. et al. Automated interpretation of high-energy collision-induced dissociation spectra of singly protonated peptides by ‘seqms’, a software aid for de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 12, 1867–1878 (1998).
Lu, B. & Chen, T. Algorithms for de novo peptide sequencing using tandem mass spectrometry. Drug Discov. Today BioSilico 2, 85–90 (2004).
Tran, N. H. et al. Complete de novo assembly of monoclonal antibody sequences. Sci. Rep. 6, 31730 (2016).
Tran, N. H. et al. Personalized deep learning of individual immunopeptidomes to identify neoantigens for cancer vaccines. Nat. Mach. Intell. 2, 764–771 (2020).
Vitorino, R. et al. De novo sequencing of proteins by mass spectrometry. Expert Rev. Proteomics 17, 595–607 (2020).
Muth, T. & Renard, B. Y. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification? Brief. Bioinform. 19, 954–970 (2018).
Muth, T., Renard, B. Y. & Martens, L. Metaproteomic data analysis at a glance: advances in computational microbial community proteomics. Expert Rev. Proteomics 13, 757–769 (2016).
Kuhring, M. & Renard, B. Y. Estimating the computational limits of detection of microbial non-model organisms. Proteomics 15, 3580–3584 (2015).
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Jagannath, S. & Sabareesh, V. Peptide fragment ion analyser (PFIA): a simple and versatile tool for the interpretation of tandem mass spectrometric data and de novo sequencing of peptides. Rapid Commun. Mass Spectrom. 21, 3033–3038 (2007).
Chen, T., Kao, M.-Y., Tepel, M., Rush, J. & Church, G. M. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 8, 325–337 (2001).
Ma, B. et al. Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).
Mo, L., Dutta, D., Wan, Y. & Chen, T. MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry. Anal. Chem. 79, 4870–4878 (2007).
Taylor, J. A. & Johnson, R. S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chem. 73, 2594–2604 (2001).
Chi, H. et al. pNovo: de novo peptide sequencing and identification using hcd spectra. J. Proteome Res. 9, 2713–2724 (2010).
Yang, H. et al. Open-pNovo: de novo peptide sequencing with thousands of protein modifications. J. Proteome Res. 16, 645–654 (2017).
Chi, H. et al. pNovo+: de novo peptide sequencing using complementary HCD and ETD tandem mass spectra. J. Proteome Res. 12, 615–625 (2013).
Fischer, B. et al. NovoHMM: a hidden Markov model for de novo peptide sequencing. Anal. Chem. 77, 7265–7273 (2005).
Frank, A. & Pevzner, P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).
Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl Acad. Sci. USA 114, 8247–8252 (2017).
Yang, H., Chi, H., Zeng, W.-F., Zhou, W.-J. & He, S.-M. pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework. Bioinformatics 35, i183–i190 (2019).
Qiao, R. et al. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nat. Mach. Intell. 3, 420–425 (2021).
Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. In Proc. 39th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 162 (eds Chaudhuri, K. et al.) 25514–25522 (PMLR, 2022).
McDonnell, K., Howley, E. & Abram, F. The impact of noise and missing fragmentation cleavages on de novo peptide identification algorithms. Comput. Struct. Biotechnol. J. 20, 1402–1412 (2022).
Ying, C. et al. Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. 34, 28877–28888 (2021).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Ma, B. Novor: real-time peptide de novo sequencing software. J. Am. Soc. Mass Spectrom. 26, 1885–1894 (2015).
Grossmann, J. et al. AUDENS: a tool for automated peptide de novo sequencing. J. Proteome Res. 4, 1768–1774 (2005).
ROEPSTORFE, P. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed. Mass Spectrom. 11, 601–605 (1984).
Frese, C. K. et al. Toward full peptide sequence coverage by dual fragmentation combining electron-transfer and higher-energy collision dissociation tandem mass spectrometry. Anal. Chem. 84, 9668–9673 (2012).
Baba, T. et al. Dissociation of biomolecules by an intense low-energy electron beam in a high sensitivity time-of-flight mass spectrometer. J. Am. Soc. Mass Spectrom. 32, 1964–1975 (2021).
Qi, C. R., Su, H., Mo, K. & Guibas, L. J. Pointnet: deep learning on point sets for 3D classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 77–85 (2017).
Fey, M. & Lenssen, J. E. Fast graph representation learning with pytorch geometric. In ICLR 7, 1–9 (2019).
Shazeer, N., Lan, Z., Cheng, Y., Ding, N. & Hou, L. Talking-heads attention. Preprint at https://arxiv.org/abs/2003.02436
Bhojanapalli, S., Yun, C., Rawat, A. S., Reddi, S. J. & Kumar, S. Low-rank bottleneck in multi-head attention models. Proceedings of Machine Learning Research 119, 864–873 (2020).
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In ICLR. 7, 1–18 (2019).
Biewald, L. Experiment Tracking with Weights and Biases (Weights & Biases, 2020); https://www.wandb.com/
Yadan, O. Hydra—a framework for elegantly configuring complex applications. GitHub https://github.com/facebookresearch/hydra (2019).
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).
Meier, F., Geyer, P. E., Virreira Winter, S., Cox, J. & Mann, M. Boxcar acquisition method enables single-shot proteomics at a depth of 10,000 proteins in 100 minutes. Nat. Methods 15, 440–448 (2018).
Fíla, J. et al. The beta subunit of nascent polypeptide associated complex plays a role in flowers and siliques development of Arabidopsis thaliana. Int. J. Mol. Sci. 21, 2065 (2020).
Tharyan, R. G. et al. NFYB-1 regulates mitochondrial function and longevity via lysosomal prosaposin. Nat. Metab. 2, 387–396 (2020).
Yu, Y. et al. Predictive signatures of 19 antibiotic-induced escherichia coli proteomes. ACS Infect. Dis. 6, 2120–2129 (2020).
Zeping, M. & Ruixue, Z. Graphnovo dataset and checkpoint. Zenodo https://doi.org/10.5281/zenodo.8000316 (2023).
Zeping, M. & Ruixue, Z. Amadeusloveiris/graphnovo: Nature Machine Intelligence original code. Zenodo https://doi.org/10.5281/zenodo.7996510 (2023).
Acknowledgements
This work is partially supported by the National Key R&D Program of China grant 2022YFA1304603 (M.L.), the Canada Research Chair programme (M.L.) and NSERC grant OGP0046506 (M.L.). We thank B. Shan, N. H. Tran and X. Cui for discussions.
Author information
Authors and Affiliations
Contributions
Z.M. conceived the initial idea and the prototype of the model. Z.M. and R.Z. implemented the proposed algorithm. R.Z. evaluated the results and did the data analysis. Z.M. and R.Z. wrote the paper and all authors contributed to improving the paper. M.L. supervised the research project. M.L. and L.X. revised the paper.
Corresponding author
Ethics declarations
Competing interests
L.X. is an employee of Bioinformatics Solutions Inc. The other authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Ting Chen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jacob Huth, in collaboration with the Nature Machine Intelligence team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–7, Tables 1–3 and Discussion for each figure and table.
Rights and permissions
About this article
Cite this article
Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-023-00738-x