Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model

Qiang, Bo; Zhou, Yiran; Ding, Yuheng; Liu, Ningfeng; Song, Song; Zhang, Liangren; Huang, Bo; Liu, Zhenming

doi:10.1038/s42256-023-00764-9

Article
Published: 05 December 2023

Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model

Nature Machine Intelligence volume 5, pages 1476–1485 (2023)Cite this article

2310 Accesses
1 Citations
13 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction-representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a new pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results in performance of challenging downstream tasks. By possessing chemical knowledge, our generative framework overcomes the limitations of current molecule generation models that rely on a small number of reaction templates. In extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a noteworthy step toward a large-scale deep-learning framework for a variety of reaction-based applications.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Components and methods of Uni-RXN.**

**Fig. 2: Retrieval performance and attention weights of Uni-RXN.**

**Fig. 3: Process and performance of Uni-RXN_Gen.**

Generative molecular design in low data regimes

Article 16 March 2020

Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks

Article 18 May 2020

Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks

Article Open access 03 October 2023

Data availability

The USTPO MIT was downloaded from the official Github repository (https://github.com/wengong-jin/nips17-rexgen) and Schneider datasets were downloaded from the Supplementary Information of the original article⁹ (https://pubs.acs.org/doi/suppl/10.1021/ci5006614/suppl_file/ci5006614_si_002.zip). We provide our processed training data in python pickle format at https://doi.org/10.5281/zenodo.8075066 (ref. ⁴²).

Code availability

The code to reproduce the results and Python scripts to reproduce the training data are publicly available at https://github.com/qiangbo1222/Uni-RXN-official (ref. ⁴³).

References

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Article Google Scholar
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Article Google Scholar
Hendrycks, D. et al. Pretrained transformers improve out-of-distribution robustness. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D. et al.) 2744–2751 (Association for Computational Linguistics, 2020).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).
Article Google Scholar
Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).
Lowe, D. Chemical reactions from US patents (1976-Sep2016). figshare https://doi.org/10.6084/m9.figshare.5104873.v1 (2017).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).
Article Google Scholar
Probst, D., Schwaller, P. & Reymond, J.-L. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digit. Discov. 1, 91–97 (2022).
Article Google Scholar
Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021).
Article Google Scholar
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pretrained transformer for computational chemistry. Mach. Learn. 3, 015022 (2022).
Google Scholar
Wen, M., Blau, S. M., Xie, X., Dwaraknath, S. & Persson, K. A. Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining. Chem. Sci. 13, 1446–1458 (2022).
Article Google Scholar
Wang, H. et al. International Conference on Learning Representations (ICLR, 2022).
NameRXN (Nextmove Software, 2021); http://www.nextmovesoftware.com/namerxn.html
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. 2, 015016 (2021).
Google Scholar
Korovina, K. et al. ChemBO: Bayesian optimization of small organic molecules with synthesizable recommendations. In Proc. 23rd International Conference on Artificial Intelligence and Statistics (eds Chiappa, S. & Calandra, R.) 3393–3403 (PMLR, 2020).
Button, A., Merk, D., Hiss, J. A. & Schneider, G. Automated de novo molecular design by hybrid machine intelligence and rule-driven chemical synthesis. Nat. Mach. Intell. 1, 307–315 (2019).
Article Google Scholar
Gao, W., Mercado, R. & Coley, C. W. International Conference on Learning Representations (ICLR, 2022).
Noh, J. et al. Path-aware and structure-preserving generation of synthetically accessible molecules. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 16952–16968 (PMLR, 2022).
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
Article Google Scholar
Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler–Lehman network. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 2604–2613 (Curran Associates Inc., 2017).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Article Google Scholar
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. & Hernández-Lobato, J. M. A model to search for synthesizable molecules. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 7937–7949 (Curran Associates Inc., 2019).
Bradshaw, J., Paige, B., Kusner, M. J., Segler, M. & Hernández-Lobato, J. M. Barking up the right tree: an approach to search over molecule synthesis DAGs. Adv. Neural Inf. Process. Syst. 33, 6852–6866 (2020).
Google Scholar
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).
Google Scholar
Genheden, S., Engkvist, O. & Bjerrum, E. J. A quick policy to filter reactions based on feasibility in AI-guided retrosynthetic planning. Preprint at chemRxiv https://doi.org/10.26434/chemrxiv.13280495.v1 (2020).
Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res. 46, 1074–1082 (2018).
Article Google Scholar
Fialková, V. et al. LibINVENT: reaction-based generative scaffold decoration for in silico library design. J. Chem. Inf. Model. 62, 2046–2063 (2021).
Article Google Scholar
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
Article Google Scholar
Thakkar, A., Chadimov´a, V., Bjerrum, E. J., Engkvist, O. & Reymond, J.-L. Retrosynthetic accessibility score (RAscore)–rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem. Sci. 12, 3339–3349 (2021).
Article Google Scholar
Morris, A. et al. Discovery of sars-cov-2 main protease inhibitors using a synthesis-directed de novo design model. Chem. Commun. 57, 5909–5912 (2021).
Article Google Scholar
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach. Learn. 1, 045024 (2020).
Google Scholar
Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.) 6000–6010 (Curran Associates Inc., 2017).
Ying, C. et al. Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. 34, 28877–28888 (2021).
Google Scholar
Zhang, L., Xu, D., Arnab, A. & Torr, P. H. Dynamic graph message passing networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3726–3735 (2020).
Jacob, P.-M. & Lapkin, A. Statistics of the network of organic chemistry. React. Chem. Eng. 3, 102–118 (2018).
Article Google Scholar
Vignac, C. & Frossard, P. International Conference on Learning Representations (ICLR, 2022).
Chen, S. & Jung, Y. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nat. Mach. Intell. 4, 772–780 (2022).
Article Google Scholar
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Article Google Scholar
Friesner, R. A. et al. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J. Med. Chem. 49, 6177–6196 (2006).
Article Google Scholar
Qiang, B. Processed training data for ‘Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model’. Zenodo https://doi.org/10.5281/zenodo.8075067 (2023).
Qiang, B. qiangbo1222/Uni-RXN-official V1.0. Zenodo https://doi.org/10.5281/zenodo.8113249 (2020).
Reymond Group: DRFP. GitHub https://github.com/reymond-group/drfp (2023).

Download references

Acknowledgements

This work was financially supported by National Key R&D Programme of China (grant no. 2022YFF1203003 (Z.L.) and grant no. 2022YFC2303700 (L.Z.)), Beijing AI Health Cultivation Project (grant no. Z221100003522022 (Z.L.)), Peking University Health Science and StoneWise Technology Joint Laboratory Project (grant no. L202107 (Z.L.)) and the Open Fund of State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, China (grant no. KF-202304 (Z.L.)).

Author information

Authors and Affiliations

State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing, China
Bo Qiang, Yiran Zhou, Yuheng Ding, Ningfeng Liu, Song Song, Liangren Zhang & Zhenming Liu
Beijing StoneWise Technology Co., Ltd, Beijing, China
Bo Huang

Authors

Bo Qiang
View author publications
You can also search for this author in PubMed Google Scholar
Yiran Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yuheng Ding
View author publications
You can also search for this author in PubMed Google Scholar
Ningfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Song Song
View author publications
You can also search for this author in PubMed Google Scholar
Liangren Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.Q. conceived the initial idea for the projects. B.Q. and Y.D. processed the dataset and trained the model. B.H. provided support on computing resources. B.Q. and Y.Z. performed the experiments using the pretrained model and the generative model. Y.Z. analysed the results and B.Q. wrote the manuscript. B.Q., S.S., L.Z. and Z.L. contributed to the revision of the manuscript. The project was supervised by L.Z. and Z.L. All authors participated in discussions.

Corresponding authors

Correspondence to Bo Huang or Zhenming Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Esben Jannik Bjerrum and Thomas Blaschke for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Related works and details of experiments and implementation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qiang, B., Zhou, Y., Ding, Y. et al. Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model. Nat Mach Intell 5, 1476–1485 (2023). https://doi.org/10.1038/s42256-023-00764-9

Download citation

Received: 14 March 2023
Accepted: 25 October 2023
Published: 05 December 2023
Issue Date: December 2023
DOI: https://doi.org/10.1038/s42256-023-00764-9