Enabling data-limited chemical bioactivity predictions through deep neural network transfer learning

Liu, Ruifeng; Laxminarayan, Srinivas; Reifman, Jaques; Wallqvist, Anders

doi:10.1007/s10822-022-00486-x

Enabling data-limited chemical bioactivity predictions through deep neural network transfer learning

Published: 22 October 2022

Volume 36, pages 867–878, (2022)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

485 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The main limitation in developing deep neural network (DNN) models to predict bioactivity properties of chemicals is the lack of sufficient assay data to train the network’s classification layers. Focusing on feedforward DNNs that use atom- and bond-based structural fingerprints as input, we examined whether layers of a fully trained DNN based on large amounts of data to predict one property could be used to develop DNNs to predict other related or unrelated properties based on limited amounts of data. Hence, we assessed if and under what conditions the dense layers of a pre-trained DNN could be transferred and used for the development of another DNN associated with limited training data. We carried out a quantitative study employing more than 400 pairs of assay datasets, where we used fully trained layers from a large dataset to augment the training of a small dataset. We found that the higher the correlation r between two assay datasets, the more efficient the transfer learning is in reducing prediction errors associated with the smaller dataset DNN predictions. The reduction in mean squared prediction errors ranged from 10 to 20% for every 0.1 increase in r² between the datasets, with the bulk of the error reductions associated with transfers of the first dense layer. Transfer of other dense layers did not result in additional benefits, suggesting that deeper, dense layers conveyed more specialized and assay-specific information. Importantly, depending on the dataset correlation, training sample size could be reduced by up to tenfold without any loss of prediction accuracy.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Integration Using Advances in Machine Learning in Drug Discovery and Molecular Biology

Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set

Article Open access 14 August 2017

Industry-scale application and evaluation of deep learning for drug target prediction

Article Open access 19 April 2020

Availability of data and materials

All the data used in this study were downloaded from public repositories as described in the Molecular Activity Datasets Section. The web links to specific datasets are given in Table S1 in the Supplemental Materials. We generated the input features using Pipeline Pilot 18.1.100.11 (Dassault Systèmes, Vélizy-Villacoublay, France, an evaluation license is available from https://www.3ds.com/how-to-buy/contact-sales). For anyone interested in repeating the computations but having no access to Pipeline Pilot, we provided the input data generated from Pipeline Pilot in Supplementary Information. We performed all DNN studies using the open source Keras API in Tensorflow 2.1.0 available from https://www.tensorflow.org/. Our python code for transfer learning using a two-hidden layer neural network is provided in the Supplementary Information.

References

Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, De Caprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:20170387
Article PubMed PubMed Central Google Scholar
Loiodice S, Nogueira da Costa A, Atienzar F (2019) Current trends in in silico, in vitro toxicology, and safety biomarkers in early drug development. Drug Chem Toxicol 42:113–121
Article CAS PubMed Google Scholar
Muster W, Breidenbach A, Fischer H, Kirchner S, Muller L, Pahler A (2008) Computational toxicology in drug development. Drug Discov Today 13:303–310
Article CAS PubMed Google Scholar
Valerio LG Jr (2009) In silico toxicology for the pharmaceutical sciences. Toxicol Appl Pharmacol 241:356–370
Article CAS PubMed Google Scholar
Keyvanpour MR, Shirzad MB (2021) An analysis of QSAR research based on machine learning concepts. Curr Drug Discov Technol 18:17–30
Article CAS PubMed Google Scholar
Piir G, Kahn I, Garcia-Sosa AT, Sild S, Ahte P, Maran U (2018) Best practices for QSAR model reporting: physical and chemical properties, ecotoxicity, environmental fate, human health, and toxicokinetics endpoints. Environ Health Perspect 126:126001. https://doi.org/10.1289/EHP3264
Article CAS PubMed PubMed Central Google Scholar
Tropsha A, Golbraikh A (2007) Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr Pharm Des 13:3494–3504
Article CAS PubMed Google Scholar
Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN, Andrade CH (2018) QSAR-based virtual screening: advances and applications in drug discovery. Front Pharmacol 9:1275. https://doi.org/10.3389/fphar.2018.01275
Article CAS PubMed PubMed Central Google Scholar
Mao J, Akhtar J, Zhang X, Sun L, Guan S, Li X, Chen G, Liu J, Jeon HN, Kim MS, No KT, Wang G (2021) Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 24:103052. https://doi.org/10.1016/j.isci.2021.103052
Article CAS PubMed PubMed Central Google Scholar
Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inform 29:476–488
Article CAS PubMed Google Scholar
Shaikhina T, Khovanova NA (2017) Handling limited datasets with neural networks in medical applications: a small-data approach. Artif Intell Med 75:51–63
Article PubMed Google Scholar
Sosnin S, Vashurina M, Withnall M, Karpov P, Fedorov M, Tetko IV (2019) A survey of multi-task learning methods in chemoinformatics. Mol Inform 38:e1800108. https://doi.org/10.1002/minf.201800108
Article CAS PubMed Google Scholar
Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) ImageNet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Emmert-Streib F, Yang Z, Feng H, Tripathi S, Dehmer M (2020) An introductory review of deep learning for prediction models with big data. Front Artif Intell 3:4. https://doi.org/10.3389/frai.2020.00004
Article PubMed PubMed Central Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article CAS PubMed Google Scholar
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2021) A comprehensive survey on transfer learning. Proc IEEE 109:43–76
Article Google Scholar
Zhuang D, Ibrahim AK (2021) Deep learning for drug discovery: a study of identifying high efficacy drug compounds using a cascade transfer learning approach. Appl Sci 11:7772. https://doi.org/10.3390/app11177772
Article CAS Google Scholar
Li Y, Xu Y, Yu Y (2021) CRNNTL: convolutional recurrent neural network and transfer learning for QSAR modeling in organic drug and material discovery. Molecules 26:7257. https://doi.org/10.3390/molecules26237257
Article CAS PubMed PubMed Central Google Scholar
Yamda H, Liu C, Wu S, Koyama Y, Ju S, Shiomi J, Morikawa J, Yoshida R (2019) Predicting materials properties with little data using shotgun transfer learning. ACS Cent Sci 5:1717–1730
Article Google Scholar
Cai C, Wang S, Xu Y, Zhang W, Tang K, Quyang Q, Lai L, Pei J (2020) Transfer learning for drug discovey. J Med Chem 63:8683–8694
Article CAS PubMed Google Scholar
Hu S, Chen P, Gu P, Wang B (2020) A deep learning-based chemical system for QSAR prediction. IEEE J Biomed Health Inform 24:3020–3028
Article PubMed Google Scholar
Fernandez-Torras A, Comajuncosa-Creus A, Duran-Frigola M, Aloy P (2022) Connecting chemistry and biology through molecular descriptors. Curr Opin Chem Biol 66:102090. https://doi.org/10.1016/j.cbpa.2021.09.001
Article CAS PubMed Google Scholar
Chuang KV, Gunsalus LM, Keiser MJ (2020) Learning molecular representations for medicinal chemistry. J Med Chem 63:8705–8722
Article CAS PubMed Google Scholar
Xue L, Bajorath J (2000) Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb Chem High Throughput Screen 3:363–372
Article CAS PubMed Google Scholar
Sahoo S, Adhikari C, Kuanar M, Mishra BK (2016) A short review of the generation of molecular descriptors and their applications in quantitative structure property/activity relationships. Curr Comput Aided Drug Des 12:181–205
Article CAS PubMed Google Scholar
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Article CAS PubMed Google Scholar
Broccatelli F, Trager R, Reutlinger M, Karypis G, Li M (2022) Benchmarking accuracy and generalizability of four graph neural networks using large in vitro ADME datasets from different chemical spaces. Mol Inform. https://doi.org/10.1002/minf.202100321
Article PubMed Google Scholar
Carracedo-Reboredo P, Linares-Blanco J, Rodriguez-Fernandez N, Cedron F, Novoa FJ, Carballal A, Maojo V, Pazos A, Fernandez-Lozano C (2021) A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J 19:4538–4558
Article CAS PubMed PubMed Central Google Scholar
Deng D, Chen X, Zhang R, Lei Z, Wang X, Zhou F (2021) XGraphBoost: extracting graph neural network-based features for a better prediction of molecular properties. J Chem Inf Model 61:2697–2705
Article CAS PubMed Google Scholar
Jiang D, Wu Z, Hsieh CY, Chen G, Liao B, Wang Z, Shen C, Cao D, Wu J, Hou T (2021) Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminform 13:12. https://doi.org/10.1186/s13321-020-00479-8
Article CAS PubMed PubMed Central Google Scholar
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59:3370–3388
Article CAS PubMed PubMed Central Google Scholar
Wieder O, Kohlbacher S, Kuenemann M, Garon A, Ducrot P, Seidel T, Langer T (2020) A compact review of molecular property prediction with graph neural networks. Drug Discov Today Technol 37:1–12
Article PubMed Google Scholar
Sun M, Zhao S, Gilvary C, Elemento O, Zhou J, Wang F (2020) Graph convolutional networks for computational drug development and discovery. Brief Bioinform 21:919–935
Article PubMed Google Scholar
Shoemaker RH (2006) The NCI60 human tumour cell line anticancer drug screen. Nat Rev Cancer 6:813–823
Article CAS PubMed Google Scholar
Close DA, Wang AX, Kochanek SJ, Shun T, Eiseman JL, Johnston PA (2019) Implementation of the NCI-60 human tumor cell line panel to screen 2260 cancer drug combinations to generate >3 million data points used to populate a large matrix of anti-neoplastic agent combinations (ALMANAC) database. SLAS Discov 24:242–263
Article CAS PubMed Google Scholar
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res 35:D198-201
Article CAS PubMed Google Scholar
Wang Y, Bryant SH, Cheng T, Wang J, Gindulyte A, Shoemaker BA, Thiessen PA, He S, Zhang J (2017) PubChem BioAssay: 2017 update. Nucleic Acids Res 45:D955–D963
Article CAS PubMed Google Scholar
Gadaleta D, Vukovic K, Toma C, Lavado GJ, Karmaus AL, Mansouri K, Kleinstreuer NC, Benfenati E, Roncaglioni A (2019) SAR and QSAR modeling of a large collection of LD50 rat acute oral toxicity data. J Cheminform 11:58. https://doi.org/10.1186/s13321-019-0383-2
Article PubMed PubMed Central Google Scholar
Sorkun MC, Khetan A, Er S (2019) AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data 6:143. https://doi.org/10.7910/DVN/OVHAW8
Article PubMed PubMed Central Google Scholar
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In Advances in neural information processing systems 2546–2554.
Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V (2015) Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model 55:263–274
Article CAS PubMed Google Scholar
Ramsundar B, Liu B, Wu Z, Verras A, Tudor M, Sheridan RP, Pande V (2017) Is multitask deep learning practical for pharma? J Chem Inf Model 57:2068–2076
Article CAS PubMed Google Scholar
Kingma DP, Ba JL (2015) Adam: A Method for Stochastics Optimization. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. https://arxiv.org/pdf/1412.6980.pdf.
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics, Chia Laguna Resort, Sardinia, Italy 2010. Volume 9 of JMLR: W&CP 9. http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
Xu Y, Ma J, Liaw A, Sheridan RP, Svetnik V (2017) Demystifying multitask deep neural networks for quantitative structure-activity relationships. J Chem Inf Model 57:2490–2504
Article CAS PubMed Google Scholar

Download references

Funding

This research was funded by the U.S. Army Medical Research and Development Command under Contract No. W81XWH20C0031 and by Defense Threat Reduction Agency Grant CBCall14-CBS-05-2-0007.

Author information

Authors and Affiliations

Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Development Command, FCMR-TT, 504 Scott Street, Fort Detrick, MD, 21702-5012, USA
Ruifeng Liu, Srinivas Laxminarayan, Jaques Reifman & Anders Wallqvist
The Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD, USA
Ruifeng Liu & Srinivas Laxminarayan

Authors

Ruifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Laxminarayan
View author publications
You can also search for this author in PubMed Google Scholar
Jaques Reifman
View author publications
You can also search for this author in PubMed Google Scholar
Anders Wallqvist
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, computation, and analysis were performed by Ruifeng Liu and Srinivas Laxminarayan. The first draft of the manuscript was written by Anders Wallqvist and Ruifeng Liu. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Anders Wallqvist.

Ethics declarations

Competing interests

The authors declare no competing interests.

Consent for publication

All authors have given consent for publication of the article. The opinions and assertions contained herein are the private views of the authors and are not to be construed as official or as reflecting the views of the U.S. Army, the U.S. Department of Defense, or The Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc. This paper has been approved for public release with unlimited distribution.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Datasets.zip

Datasets used in this study. In each dataset, column 1 are indexes of input molecules, column 2 are molecular activities, column 3 to 1026 are counts of ECFP_2 fingerprint features. We generated the counts of ECFP_2 fingerprint features using Pipeline Pilot. (ZIP 49560 KB)

Table S1

. Details of the molecular activity datasets used in this study (XLSX 12 KB)

Table S2

. Mean squared error (standard deviation) of 2-hidden layer DNN models of A549 cell inhibition trained with an increasing number of compounds (XLSX 11 KB)

Table S3

. Mean squared error (standard deviation) of 3-hidden layer DNN models of A549 cell inhibition trained with an increasing number of compounds (XLSX 11 KB)

Table S4

. Mean squared error (standard deviation) of HTB132 inhibition models trained with and without transfer parameters from pre-trained A549 inhibition models (XLSX 11 KB)

Table S5

. Transfer learning efficiency between dataset pairs with different degree of correlation. All models are trained with 500 compounds of dataset 2 with or without transferring the first hidden layer from models trained with all compounds of dataset 1 (XLSX 24 KB)

Table S6

. Transfer-learning efficiency between dataset pairs with different degrees of correlation. All models are trained with 1,000 compounds of dataset 2 with or without transferring the first hidden layer from models trained with all compounds of dataset 1 (XLSX 24 KB)

TL_2HiddenLayers.py

Python code for transfer learning using a two-hidden layer deep neural network (PY 12 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, R., Laxminarayan, S., Reifman, J. et al. Enabling data-limited chemical bioactivity predictions through deep neural network transfer learning. J Comput Aided Mol Des 36, 867–878 (2022). https://doi.org/10.1007/s10822-022-00486-x

Download citation

Received: 06 August 2022
Accepted: 17 October 2022
Published: 22 October 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10822-022-00486-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions