Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model

A preprint version of the article is available at arXiv.

Abstract

Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction-representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a new pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results in performance of challenging downstream tasks. By possessing chemical knowledge, our generative framework overcomes the limitations of current molecule generation models that rely on a small number of reaction templates. In extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a noteworthy step toward a large-scale deep-learning framework for a variety of reaction-based applications.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Components and methods of Uni-RXN.
Fig. 2: Retrieval performance and attention weights of Uni-RXN.
Fig. 3: Process and performance of Uni-RXNGen.

Similar content being viewed by others

Data availability

The USTPO MIT was downloaded from the official Github repository (https://github.com/wengong-jin/nips17-rexgen) and Schneider datasets were downloaded from the Supplementary Information of the original article9 (https://pubs.acs.org/doi/suppl/10.1021/ci5006614/suppl_file/ci5006614_si_002.zip). We provide our processed training data in python pickle format at https://doi.org/10.5281/zenodo.8075066 (ref. 42).

Code availability

The code to reproduce the results and Python scripts to reproduce the training data are publicly available at https://github.com/qiangbo1222/Uni-RXN-official (ref. 43).

References

  1. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).

  2. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  3. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

    Article  Google Scholar 

  4. Hendrycks, D. et al. Pretrained transformers improve out-of-distribution robustness. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 2744–2751 (Association for Computational Linguistics, 2020).

  5. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).

    Article  Google Scholar 

  6. Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).

  7. Lowe, D. Chemical reactions from US patents (1976-Sep2016). figshare https://doi.org/10.6084/m9.figshare.5104873.v1 (2017).

  8. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  9. Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).

    Article  Google Scholar 

  10. Probst, D., Schwaller, P. & Reymond, J.-L. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digit. Discov. 1, 91–97 (2022).

    Article  Google Scholar 

  11. Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021).

    Article  Google Scholar 

  12. Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pretrained transformer for computational chemistry. Mach. Learn. 3, 015022 (2022).

    Google Scholar 

  13. Wen, M., Blau, S. M., Xie, X., Dwaraknath, S. & Persson, K. A. Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining. Chem. Sci. 13, 1446–1458 (2022).

    Article  Google Scholar 

  14. Wang, H. et al. International Conference on Learning Representations (ICLR, 2022).

  15. NameRXN (Nextmove Software, 2021); http://www.nextmovesoftware.com/namerxn.html

  16. Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. 2, 015016 (2021).

    Google Scholar 

  17. Korovina, K. et al. ChemBO: Bayesian optimization of small organic molecules with synthesizable recommendations. In Proc. 23rd International Conference on Artificial Intelligence and Statistics (eds Chiappa, S. & Calandra, R.) 3393–3403 (PMLR, 2020).

  18. Button, A., Merk, D., Hiss, J. A. & Schneider, G. Automated de novo molecular design by hybrid machine intelligence and rule-driven chemical synthesis. Nat. Mach. Intell. 1, 307–315 (2019).

    Article  Google Scholar 

  19. Gao, W., Mercado, R. & Coley, C. W. International Conference on Learning Representations (ICLR, 2022).

  20. Noh, J. et al. Path-aware and structure-preserving generation of synthetically accessible molecules. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16952–16968 (PMLR, 2022).

  21. Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).

    Article  Google Scholar 

  22. Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler–Lehman network. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 2604–2613 (Curran Associates Inc., 2017).

  23. Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).

    Article  Google Scholar 

  24. Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. & Hernández-Lobato, J. M. A model to search for synthesizable molecules. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 7937–7949 (Curran Associates Inc., 2019).

  25. Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. & Hernández-Lobato, J. M. Barking up the right tree: an approach to search over molecule synthesis DAGs. Adv. Neural Inf. Process. Syst. 33, 6852–6866 (2020).

    Google Scholar 

  26. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).

    Google Scholar 

  27. Genheden, S., Engkvist, O. & Bjerrum, E. J. A quick policy to filter reactions based on feasibility in AI-guided retrosynthetic planning. Preprint at chemRxiv https://doi.org/10.26434/chemrxiv.13280495.v1 (2020).

  28. Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 46, 1074–1082 (2018).

    Article  Google Scholar 

  29. Fialková, V. et al. LibINVENT: reaction-based generative scaffold decoration for in silico library design. J. Chem. Inf. Model. 62, 2046–2063 (2021).

    Article  Google Scholar 

  30. Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).

    Article  Google Scholar 

  31. Thakkar, A., Chadimov´a, V., Bjerrum, E. J., Engkvist, O. & Reymond, J.-L. Retrosynthetic accessibility score (RAscore)–rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem. Sci. 12, 3339–3349 (2021).

    Article  Google Scholar 

  32. Morris, A. et al. Discovery of sars-cov-2 main protease inhibitors using a synthesis-directed de novo design model. Chem. Commun. 57, 5909–5912 (2021).

    Article  Google Scholar 

  33. Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach. Learn. 1, 045024 (2020).

    Google Scholar 

  34. Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 6000–6010 (Curran Associates Inc., 2017).

  35. Ying, C. et al. Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. 34, 28877–28888 (2021).

    Google Scholar 

  36. Zhang, L., Xu, D., Arnab, A. & Torr, P. H. Dynamic graph message passing networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3726–3735 (2020).

  37. Jacob, P.-M. & Lapkin, A. Statistics of the network of organic chemistry. React. Chem. Eng. 3, 102–118 (2018).

    Article  Google Scholar 

  38. Vignac, C. & Frossard, P. International Conference on Learning Representations (ICLR, 2022).

  39. Chen, S. & Jung, Y. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nat. Mach. Intell. 4, 772–780 (2022).

    Article  Google Scholar 

  40. Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).

    Article  Google Scholar 

  41. Friesner, R. A. et al. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J. Med. Chem. 49, 6177–6196 (2006).

    Article  Google Scholar 

  42. Qiang, B. Processed training data for ‘Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model’. Zenodo https://doi.org/10.5281/zenodo.8075067 (2023).

  43. Qiang, B. qiangbo1222/Uni-RXN-official V1.0. Zenodo https://doi.org/10.5281/zenodo.8113249 (2020).

  44. Reymond Group: DRFP. GitHub https://github.com/reymond-group/drfp (2023).

Download references

Acknowledgements

This work was financially supported by National Key R&D Programme of China (grant no. 2022YFF1203003 (Z.L.) and grant no. 2022YFC2303700 (L.Z.)), Beijing AI Health Cultivation Project (grant no. Z221100003522022 (Z.L.)), Peking University Health Science and StoneWise Technology Joint Laboratory Project (grant no. L202107 (Z.L.)) and the Open Fund of State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, China (grant no. KF-202304 (Z.L.)).

Author information

Authors and Affiliations

Authors

Contributions

B.Q. conceived the initial idea for the projects. B.Q. and Y.D. processed the dataset and trained the model. B.H. provided support on computing resources. B.Q. and Y.Z. performed the experiments using the pretrained model and the generative model. Y.Z. analysed the results and B.Q. wrote the manuscript. B.Q., S.S., L.Z. and Z.L. contributed to the revision of the manuscript. The project was supervised by L.Z. and Z.L. All authors participated in discussions.

Corresponding authors

Correspondence to Bo Huang or Zhenming Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Esben Jannik Bjerrum and Thomas Blaschke for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Related works and details of experiments and implementation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiang, B., Zhou, Y., Ding, Y. et al. Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model. Nat Mach Intell 5, 1476–1485 (2023). https://doi.org/10.1038/s42256-023-00764-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00764-9

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing