Computational chemogenomics: Is it more than inductive transfer?

Brown, J. B.; Okuno, Yasushi; Marcou, Gilles; Varnek, Alexandre; Horvath, Dragos

doi:10.1007/s10822-014-9743-1

Computational chemogenomics: Is it more than inductive transfer?

Published: 27 April 2014

Volume 28, pages 597–618, (2014)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

J. B. Brown²,
Yasushi Okuno²,
Gilles Marcou¹,
Alexandre Varnek¹ &
…
Dragos Horvath¹

762 Accesses
25 Citations
1 Altmetric
Explore all metrics

Abstract

High-throughput assays challenge us to extract knowledge from multi-ligand, multi-target activity data. In QSAR, weights are statically fitted to each ligand descriptor with respect to a single endpoint or target. However, computational chemogenomics (CG) has demonstrated benefits of learning from entire grids of data at once, rather than building target-specific QSARs. A possible reason for this is the emergence of inductive knowledge transfer (IT) between targets, providing statistical robustness to the model, with no assumption about the structure of the targets. Relevant protein descriptors in CG should allow one to learn how to dynamically adjust ligand attribute weights with respect to protein structure. Hence, models built through explicit learning (EL) by including protein information, while benefitting from IT enhancement, should provide additional predictive capability, notably for protein deorphanization. This interplay between IT and EL in CG modeling is not sufficiently studied. While IT is likely to occur irrespective of the injected target information, it is not clear whether and when boosting due to EL may occur. EL is only possible if protein description is appropriate to the target set under investigation. The key issue here is the search for evidence of genuine EL exceeding expectations based on pure IT. We explore the problem in the context of Support Vector Regression, using more than 9,400 \(pK_i\) values of 31 GPCRs, where compound–protein interactions are represented by the concatenation of vectorial descriptions of compounds and proteins. This provides a unified framework to generate both IT-enhanced and potentially EL-enabled models, where the difference is toggled by supplied protein information. For EL-enabled models, protein information includes genuine protein descriptors such as typical sequence-based terms, but also the experimentally determined affinity cross-correlation fingerprints. These latter benchmark the expected behavior of a quasi-ideal descriptor capturing the actual functional protein-protein relatedness, and therefore thought to be the most likely to enable EL. EL- and IT-based methods were benchmarked alongside classical QSAR, with respect to cross-validation and deorphanization challenges. A rational method for projecting benchmarked methodologies into a strategy space is given, in the aims that the projection will provide directions for the types of molecule designs possible using a given methodology. While EL-enabled strategies outperform classical QSARs and favorably compare to similar published results, they are, in all respects evaluated herein, not strongly distinguished from IT-enhanced models. Moreover, EL-enabled strategies failed to prove superior in deorphanization challenges. Therefore, this paper raises caution that, contrary to common belief and intuitive expectation, the benefits of chemogenomics models over classical QSAR are quite possibly due less to the injection of protein-related information, and rather impacted more by the effect of inductive transfer, due to simultaneous learning from all of the modeled endpoints. These results show that the field of protein descriptor research needs further improvements to truly realize the expected benefit of EL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Selection of Informative Examples in Chemogenomic Datasets

Badapple: promiscuity patterns from noisy evidence

Article Open access 28 May 2016

Jeremy J. Yang, Oleg Ursu, … Cristian G. Bologa

Nonadditivity in public and inhouse data: implications for drug design

Article Open access 02 July 2021

D. Gogishvili, E. Nittinger, … C. Tyrchan

Abbreviations

CG:: Chemogenomics
GA:: Genetic algorithm
DS:: Descriptor space
GPCR:: G-protein coupled receptor
QSAR:: Quantitative structure–activity relationships
SVM:: Support vector machine
SVR:: Support vector regression
IT:: Inductive transfer
MTL:: Multi-task learning
RMSE:: Root mean squared error
ISIDA:: In silico design and data analysis

References

Abernethy J, Bach F, Evgeniou T, Vert JP (2009) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826
Google Scholar
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Article Google Scholar
Bock JR, Gough DA (2002) A new method to estimate ligand-receptor energetics. Mol Cell Proteomics 1(11):904–910
Article CAS Google Scholar
Bock JR, Gough DA (2005) Virtual screen for ligands of orphan G protein-coupled receptors. J Chem Inf Model 45(5):1402–1414
Article CAS Google Scholar
Bonachera F, Horvath D (2008) Fuzzy tricentric pharmacophore fingerprints. 2. Application of topological fuzzy pharmacophore triplets in quantitative structure–activity relationships. J Chem Inf Model 48(2):409–425
Article CAS Google Scholar
Bonachera F, Parent B, Barbosa F, Froloff N, Horvath D (2006) Fuzzy tricentric pharmacophore fingerprints. 1—topological fuzzy pharmacophore triplets and adapted molecular similarity scoring schemes. J Chem Inf Model 46:2457–2477
Article CAS Google Scholar
Brown J, Nijima S, Okuno Y (2013) Compound–protein interaction prediction within chemogenomics: theoretical concepts, practical usage, and future directions. Mol Inf 32:906–921
Article CAS Google Scholar
Brown J, Okuno Y (2012) Systems biology and systems chemistry: new directions for drug discovery. Chem Biol 19(1):23–28
Article CAS Google Scholar
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27
Article Google Scholar
Collantes E, Dunn W (1995) Amino acid side chain descriptors for quantitative structure–activity relationship studies of peptide analogs. J Med Chem 38(14):2705–2713
Article CAS Google Scholar
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Google Scholar
Frimurer T, Ulven T, Elling C, Gerlach LO, Kostenis E, Hogberg T (2005) A physicogenetic method to assign ligand–binding relationships between 7TM receptors. Bioorg Med Chem Lett 15:3707–3712
Article CAS Google Scholar
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2011) Chembl: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40(D1):D1100–D1107
Article Google Scholar
Gozalbes R, Rolland C, Nicola E, Paugam MF, Coussy L, Horvath D, Barbosa F, Mao B, Revah F, Froloff N (2005) QSAR strategy and experimental validation for the development of a GPCR focused library. QSAR Comb Sci 24(4):508–516
Article CAS Google Scholar
Harrell F (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Graduate texts in mathematics. Springer, Berlin
Horvath D, Bonachera F, Solov’ev V, Gaudin C, Varnek A (2007) Stochastic versus stepwise strategies for quantitative structure–activity relationship generation—how much effort may the mining for successful QSAR models take? J Chem Inf Model 47:927–939
Article CAS Google Scholar
Horvath D, Marcou G, Varnek A (2013) Do not hesitate to use tversky—and other hints for successful active analogue searches with feature count descriptors. J Chem Inf Model 53(7):1543–1562
Article CAS Google Scholar
Hurle MR, Yang L, Xie Q, Rajpal DK, Sanseau P, Agarwal P (2013) Computational drug repositioning: from data to therapeutics. Clin Pharmacol Ther 93(4):335–341
Article CAS Google Scholar
Ivanciuc O (2007) Applications of support vector machines in chemistry. Wiley, New York, pp 291–400
Jacob L, Hoffmann B, Stoven V, Vert JP (2008) Virtual screening of GPCRS: an in silico chemogenomics approach. BMC Bioinform 9(1):363
Article Google Scholar
Jacob L, Vert JP (2008) Protein–ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19):2149–2156
Article CAS Google Scholar
Kontijevskis A, Komorowski J, Wikberg JES (2008) Generalized proteochemometric model of multiple cytochrome p450 enzymes and their inhibitors. J Chem Inf Model 48(9):1840–1850
Article CAS Google Scholar
Kontijevskis A, Prusis P, Petrovska R, Yahorava S, Mutulis F, Mutule I, Komorowski J, Wikberg J (2007) A look inside HIV resistance through retroviral protease interaction maps. PLoS Comput Biol 3:e48
Article Google Scholar
Lapins M, Eklund M, Spjuth O, Prusis P, Wikberg J (2008) Proteochemometric modeling of hiv protease susceptibility. BMC Bioinform 9(1):181
Article Google Scholar
Lapinsh M, Prusis P, Gutcaits A, Lundstedt T, Wikberg J (2001) Development of proteo-chemometrics: a novel technology for the analysis of drug–receptor interactions. Biochim Biophys Acta 1525:180–190
Article CAS Google Scholar
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476
Article CAS Google Scholar
Li S, Xi L, Wang C, Li J, Lei B, Liu H, Yao X (2009) A novel method for protein–ligand binding affinity prediction and the related descriptors exploration. J Comput Chem 30(6):900–909
Article CAS Google Scholar
Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl Acids Res 34(Suppl. 2):W32–W37
Medina-Franco JL, Giulianotti MA, Welmaker GS, Houghten RA (2013) Shifting from the single to the multitarget paradigm in drug discovery. Drug Discov Today 18(9–10):495–501
Article Google Scholar
Mikhalev AA, Shpilrain V, Yu JT (2004) The embedding problem. In: Borwein P, Borwein J (eds) Combinatorial methods. CMS books in mathematics. Springer, New York, pp 108–128
Pelikan M, Goldberg DE, Lobo FG (2002) A survey of optimization by building and using probabilistic models. Comput Optim Appl 21:5–20
Article Google Scholar
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl Acids Res 39(Suppl. 2):W385–W390
Rosenbaum L, Dorr A, Bauer MR, Boeckler FM, Zell A (2013) Inferring multi-target QSAR models with taxonomy-based multi-task learning. J Cheminform 5:1–20
Article Google Scholar
Ruggiu F, Gizzi P, Galzi JL, Hibert M, Haiech J, Baskin I, Horvath D, Marcou G, Varnek A (2014) Quantitative structure–property relationship modeling: a valuable support in high-throughput screening quality control. Anal Chem 86(5):2510–2520
Article CAS Google Scholar
Ruggiu F, Marcou G, Varnek A, Horvath D (2010) Isida property-labelled fragment descriptors. Mol Inform 29(12):855–868
Article CAS Google Scholar
Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S (1998) New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem 41:2481–2491
Article CAS Google Scholar
Schölkopf B, Tsuda K, Vert J (2004) Kernel methods in computational biology. MIT, Boston, MA, USA
Smola AJ, Schlkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article Google Scholar
Strombergsson H, Daniluk P, Kryshtafovych A, Fidelis K, Wikberg J, Kleywegt G, Hvidsten T (2008) Interaction model based on local protein substructures generalizes to the entire structural enzyme–ligand space. J Chem Inf Model 48:2278–2288
Article Google Scholar
Tetko IV (2002) Neural network studies. 4. Introduction to associative neural networks. J Chem Inf Comput Sci 42(3):717–728
Article CAS Google Scholar
Van Westen G, Wegner J, Geluykens P, Kwanten L, Vereycken I, Peeters A, IJzerman A, Van Vlijmen H, Bender A (2011) Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PLoS One 6:e27518
Article Google Scholar
Van Westen G, Wegner J, Ijzerman A, Van Vlijmen H, Bender A (2011) Proteochemometric modeling as a tool for designing selective compounds and extrapolating to novel targets. Med Chem Commun 2:16–30
Article Google Scholar
Varnek A, Gaudin C, Marcou G, Baskin I, Pandey AK, Tetko IV (2009) Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J Chem Inf Model 49(1):133–144
Article CAS Google Scholar
Varnek A, Tropsha A (2009) Chemoinformatics: approaches to virtual screening. Royal Society of Chemistry. Cambridge, USA
Wassermann AM, Geppert H, Bajorath J (2009) Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effects. J Chem Inf Model 49(10):2155–2167
Article CAS Google Scholar
Weill N, Rognan D (2009) Development and validation of a novel protein–ligand fingerprint to mine chemogenomic space: application to G protein-coupled receptors and their ligands. J Chem Inf Model 49(4):1049–1062
Article CAS Google Scholar
Weill N, Rognan D (2010) Alignment-free ultra-high-throughput comparison of druggable proteinligand binding sites. J Chem Inf Model 50(1):123–135
Article CAS Google Scholar
van Westen G, Swier R, Cortes-Ciriano I, Wegner J, Overington J, IJzerman A, Van Vlijmen H, Bender A (2013) Benchmarking of protein descriptors in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptors. J Cheminform 5:42
Article Google Scholar
van Westen GJP, Wegner JK, Ijzerman AP, van Vlijmen HWT, Bender A (2010) Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. MedChemComm 2(1):16–30
Article Google Scholar
Yabuuchi H, Niijima S, Takematsu H, Ida T, Hirokawa T, Hara T, Ogawa T, Minowa Y, Tsujimoto G, Okuno Y (2011) Analysis of multiple compound–protein interactions reveals novel bioactive molecules. Mol Syst Biol 7(472)

Download references

Acknowledgments

The authors thank the High Performance Computing centers of the Universities of Strasbourg, France, and Cluj, Romania, for having hosted a part of the herein reported calculations. This work was also supported in part by a Grant-in-Aid for Young Scientists from the Japanese Society for the Promotion of Science [Kakenhi (B) 25870336]. Special thanks are given to Professor Jürgen Bajorath of the University of Bonn for extracting and cleaning the herein used datasets. This research was additionally supported by the Funding Program for Next Generation World-Leading Researchers as well as the CREST program of the Japan Science and Technology Agency. J. B. Brown and Yasushi Okuno are supported in part by grants from Chugai Pharmaceutical Co. Ltd. and Mitsui Knowledge Co. Ltd.

Author information

Authors and Affiliations

Laboratoire de Chémoinformatique, UMR 7140 CNRS, Univ. Strasbourg, 1, rue Blaise Pascal, 67000 , Strasbourg, France
Gilles Marcou, Alexandre Varnek & Dragos Horvath
Department of Clinical System Onco-Informatics, Graduate School of Medicine, Kyoto University, Kyoto, 606-8501, Japan
J. B. Brown & Yasushi Okuno

Authors

J. B. Brown
View author publications
You can also search for this author in PubMed Google Scholar
Yasushi Okuno
View author publications
You can also search for this author in PubMed Google Scholar
Gilles Marcou
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Varnek
View author publications
You can also search for this author in PubMed Google Scholar
Dragos Horvath
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dragos Horvath.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 2301 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brown, J.B., Okuno, Y., Marcou, G. et al. Computational chemogenomics: Is it more than inductive transfer?. J Comput Aided Mol Des 28, 597–618 (2014). https://doi.org/10.1007/s10822-014-9743-1

Download citation

Received: 20 January 2014
Accepted: 11 April 2014
Published: 27 April 2014
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10822-014-9743-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Computational chemogenomics: Is it more than inductive transfer?

Abstract

Access this article

Similar content being viewed by others

Selection of Informative Examples in Chemogenomic Datasets

Badapple: promiscuity patterns from noisy evidence

Nonadditivity in public and inhouse data: implications for drug design

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (zip 2301 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Computational chemogenomics: Is it more than inductive transfer?

Abstract

Access this article

Similar content being viewed by others

Selection of Informative Examples in Chemogenomic Datasets

Badapple: promiscuity patterns from noisy evidence

Nonadditivity in public and inhouse data: implications for drug design

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (zip 2301 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation