Skip to main content

Advertisement

Log in

Computational chemogenomics: Is it more than inductive transfer?

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

High-throughput assays challenge us to extract knowledge from multi-ligand, multi-target activity data. In QSAR, weights are statically fitted to each ligand descriptor with respect to a single endpoint or target. However, computational chemogenomics (CG) has demonstrated benefits of learning from entire grids of data at once, rather than building target-specific QSARs. A possible reason for this is the emergence of inductive knowledge transfer (IT) between targets, providing statistical robustness to the model, with no assumption about the structure of the targets. Relevant protein descriptors in CG should allow one to learn how to dynamically adjust ligand attribute weights with respect to protein structure. Hence, models built through explicit learning (EL) by including protein information, while benefitting from IT enhancement, should provide additional predictive capability, notably for protein deorphanization. This interplay between IT and EL in CG modeling is not sufficiently studied. While IT is likely to occur irrespective of the injected target information, it is not clear whether and when boosting due to EL may occur. EL is only possible if protein description is appropriate to the target set under investigation. The key issue here is the search for evidence of genuine EL exceeding expectations based on pure IT. We explore the problem in the context of Support Vector Regression, using more than 9,400 \(pK_i\) values of 31 GPCRs, where compound–protein interactions are represented by the concatenation of vectorial descriptions of compounds and proteins. This provides a unified framework to generate both IT-enhanced and potentially EL-enabled models, where the difference is toggled by supplied protein information. For EL-enabled models, protein information includes genuine protein descriptors such as typical sequence-based terms, but also the experimentally determined affinity cross-correlation fingerprints. These latter benchmark the expected behavior of a quasi-ideal descriptor capturing the actual functional protein-protein relatedness, and therefore thought to be the most likely to enable EL. EL- and IT-based methods were benchmarked alongside classical QSAR, with respect to cross-validation and deorphanization challenges. A rational method for projecting benchmarked methodologies into a strategy space is given, in the aims that the projection will provide directions for the types of molecule designs possible using a given methodology. While EL-enabled strategies outperform classical QSARs and favorably compare to similar published results, they are, in all respects evaluated herein, not strongly distinguished from IT-enhanced models. Moreover, EL-enabled strategies failed to prove superior in deorphanization challenges. Therefore, this paper raises caution that, contrary to common belief and intuitive expectation, the benefits of chemogenomics models over classical QSAR are quite possibly due less to the injection of protein-related information, and rather impacted more by the effect of inductive transfer, due to simultaneous learning from all of the modeled endpoints. These results show that the field of protein descriptor research needs further improvements to truly realize the expected benefit of EL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Abbreviations

CG:

Chemogenomics

GA:

Genetic algorithm

DS:

Descriptor space

GPCR:

G-protein coupled receptor

QSAR:

Quantitative structure–activity relationships

SVM:

Support vector machine

SVR:

Support vector regression

IT:

Inductive transfer

MTL:

Multi-task learning

RMSE:

Root mean squared error

ISIDA:

In silico design and data analysis

References

  1. Abernethy J, Bach F, Evgeniou T, Vert JP (2009) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826

    Google Scholar 

  2. Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272

    Article  Google Scholar 

  3. Bock JR, Gough DA (2002) A new method to estimate ligand-receptor energetics. Mol Cell Proteomics 1(11):904–910

    Article  CAS  Google Scholar 

  4. Bock JR, Gough DA (2005) Virtual screen for ligands of orphan G protein-coupled receptors. J Chem Inf Model 45(5):1402–1414

    Article  CAS  Google Scholar 

  5. Bonachera F, Horvath D (2008) Fuzzy tricentric pharmacophore fingerprints. 2. Application of topological fuzzy pharmacophore triplets in quantitative structure–activity relationships. J Chem Inf Model 48(2):409–425

    Article  CAS  Google Scholar 

  6. Bonachera F, Parent B, Barbosa F, Froloff N, Horvath D (2006) Fuzzy tricentric pharmacophore fingerprints. 1—topological fuzzy pharmacophore triplets and adapted molecular similarity scoring schemes. J Chem Inf Model 46:2457–2477

    Article  CAS  Google Scholar 

  7. Brown J, Nijima S, Okuno Y (2013) Compound–protein interaction prediction within chemogenomics: theoretical concepts, practical usage, and future directions. Mol Inf 32:906–921

    Article  CAS  Google Scholar 

  8. Brown J, Okuno Y (2012) Systems biology and systems chemistry: new directions for drug discovery. Chem Biol 19(1):23–28

    Article  CAS  Google Scholar 

  9. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75

    Article  Google Scholar 

  10. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27

    Article  Google Scholar 

  11. Collantes E, Dunn W (1995) Amino acid side chain descriptors for quantitative structure–activity relationship studies of peptide analogs. J Med Chem 38(14):2705–2713

    Article  CAS  Google Scholar 

  12. Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637

    Google Scholar 

  13. Frimurer T, Ulven T, Elling C, Gerlach LO, Kostenis E, Hogberg T (2005) A physicogenetic method to assign ligand–binding relationships between 7TM receptors. Bioorg Med Chem Lett 15:3707–3712

    Article  CAS  Google Scholar 

  14. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2011) Chembl: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40(D1):D1100–D1107

    Article  Google Scholar 

  15. Gozalbes R, Rolland C, Nicola E, Paugam MF, Coussy L, Horvath D, Barbosa F, Mao B, Revah F, Froloff N (2005) QSAR strategy and experimental validation for the development of a GPCR focused library. QSAR Comb Sci 24(4):508–516

    Article  CAS  Google Scholar 

  16. Harrell F (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Graduate texts in mathematics. Springer, Berlin

  17. Horvath D, Bonachera F, Solov’ev V, Gaudin C, Varnek A (2007) Stochastic versus stepwise strategies for quantitative structure–activity relationship generation—how much effort may the mining for successful QSAR models take? J Chem Inf Model 47:927–939

    Article  CAS  Google Scholar 

  18. Horvath D, Marcou G, Varnek A (2013) Do not hesitate to use tversky—and other hints for successful active analogue searches with feature count descriptors. J Chem Inf Model 53(7):1543–1562

    Article  CAS  Google Scholar 

  19. Hurle MR, Yang L, Xie Q, Rajpal DK, Sanseau P, Agarwal P (2013) Computational drug repositioning: from data to therapeutics. Clin Pharmacol Ther 93(4):335–341

    Article  CAS  Google Scholar 

  20. Ivanciuc O (2007) Applications of support vector machines in chemistry. Wiley, New York, pp 291–400

  21. Jacob L, Hoffmann B, Stoven V, Vert JP (2008) Virtual screening of GPCRS: an in silico chemogenomics approach. BMC Bioinform 9(1):363

    Article  Google Scholar 

  22. Jacob L, Vert JP (2008) Protein–ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19):2149–2156

    Article  CAS  Google Scholar 

  23. Kontijevskis A, Komorowski J, Wikberg JES (2008) Generalized proteochemometric model of multiple cytochrome p450 enzymes and their inhibitors. J Chem Inf Model 48(9):1840–1850

    Article  CAS  Google Scholar 

  24. Kontijevskis A, Prusis P, Petrovska R, Yahorava S, Mutulis F, Mutule I, Komorowski J, Wikberg J (2007) A look inside HIV resistance through retroviral protease interaction maps. PLoS Comput Biol 3:e48

    Article  Google Scholar 

  25. Lapins M, Eklund M, Spjuth O, Prusis P, Wikberg J (2008) Proteochemometric modeling of hiv protease susceptibility. BMC Bioinform 9(1):181

    Article  Google Scholar 

  26. Lapinsh M, Prusis P, Gutcaits A, Lundstedt T, Wikberg J (2001) Development of proteo-chemometrics: a novel technology for the analysis of drug–receptor interactions. Biochim Biophys Acta 1525:180–190

    Article  CAS  Google Scholar 

  27. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476

    Article  CAS  Google Scholar 

  28. Li S, Xi L, Wang C, Li J, Lei B, Liu H, Yao X (2009) A novel method for protein–ligand binding affinity prediction and the related descriptors exploration. J Comput Chem 30(6):900–909

    Article  CAS  Google Scholar 

  29. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl Acids Res 34(Suppl. 2):W32–W37

  30. Medina-Franco JL, Giulianotti MA, Welmaker GS, Houghten RA (2013) Shifting from the single to the multitarget paradigm in drug discovery. Drug Discov Today 18(9–10):495–501

    Article  Google Scholar 

  31. Mikhalev AA, Shpilrain V, Yu JT (2004) The embedding problem. In: Borwein P, Borwein J (eds) Combinatorial methods. CMS books in mathematics. Springer, New York, pp 108–128

  32. Pelikan M, Goldberg DE, Lobo FG (2002) A survey of optimization by building and using probabilistic models. Comput Optim Appl 21:5–20

    Article  Google Scholar 

  33. Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl Acids Res 39(Suppl. 2):W385–W390

  34. Rosenbaum L, Dorr A, Bauer MR, Boeckler FM, Zell A (2013) Inferring multi-target QSAR models with taxonomy-based multi-task learning. J Cheminform 5:1–20

    Article  Google Scholar 

  35. Ruggiu F, Gizzi P, Galzi JL, Hibert M, Haiech J, Baskin I, Horvath D, Marcou G, Varnek A (2014) Quantitative structure–property relationship modeling: a valuable support in high-throughput screening quality control. Anal Chem 86(5):2510–2520

    Article  CAS  Google Scholar 

  36. Ruggiu F, Marcou G, Varnek A, Horvath D (2010) Isida property-labelled fragment descriptors. Mol Inform 29(12):855–868

    Article  CAS  Google Scholar 

  37. Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S (1998) New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem 41:2481–2491

    Article  CAS  Google Scholar 

  38. Schölkopf B, Tsuda K, Vert J (2004) Kernel methods in computational biology. MIT, Boston, MA, USA

  39. Smola AJ, Schlkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  Google Scholar 

  40. Strombergsson H, Daniluk P, Kryshtafovych A, Fidelis K, Wikberg J, Kleywegt G, Hvidsten T (2008) Interaction model based on local protein substructures generalizes to the entire structural enzyme–ligand space. J Chem Inf Model 48:2278–2288

    Article  Google Scholar 

  41. Tetko IV (2002) Neural network studies. 4. Introduction to associative neural networks. J Chem Inf Comput Sci 42(3):717–728

    Article  CAS  Google Scholar 

  42. Van Westen G, Wegner J, Geluykens P, Kwanten L, Vereycken I, Peeters A, IJzerman A, Van Vlijmen H, Bender A (2011) Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PLoS One 6:e27518

    Article  Google Scholar 

  43. Van Westen G, Wegner J, Ijzerman A, Van Vlijmen H, Bender A (2011) Proteochemometric modeling as a tool for designing selective compounds and extrapolating to novel targets. Med Chem Commun 2:16–30

    Article  Google Scholar 

  44. Varnek A, Gaudin C, Marcou G, Baskin I, Pandey AK, Tetko IV (2009) Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J Chem Inf Model 49(1):133–144

    Article  CAS  Google Scholar 

  45. Varnek A, Tropsha A (2009) Chemoinformatics: approaches to virtual screening. Royal Society of Chemistry. Cambridge, USA

  46. Wassermann AM, Geppert H, Bajorath J (2009) Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effects. J Chem Inf Model 49(10):2155–2167

    Article  CAS  Google Scholar 

  47. Weill N, Rognan D (2009) Development and validation of a novel protein–ligand fingerprint to mine chemogenomic space: application to G protein-coupled receptors and their ligands. J Chem Inf Model 49(4):1049–1062

    Article  CAS  Google Scholar 

  48. Weill N, Rognan D (2010) Alignment-free ultra-high-throughput comparison of druggable proteinligand binding sites. J Chem Inf Model 50(1):123–135

    Article  CAS  Google Scholar 

  49. van Westen G, Swier R, Cortes-Ciriano I, Wegner J, Overington J, IJzerman A, Van Vlijmen H, Bender A (2013) Benchmarking of protein descriptors in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptors. J Cheminform 5:42

    Article  Google Scholar 

  50. van Westen GJP, Wegner JK, Ijzerman AP, van Vlijmen HWT, Bender A (2010) Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. MedChemComm 2(1):16–30

    Article  Google Scholar 

  51. Yabuuchi H, Niijima S, Takematsu H, Ida T, Hirokawa T, Hara T, Ogawa T, Minowa Y, Tsujimoto G, Okuno Y (2011) Analysis of multiple compound–protein interactions reveals novel bioactive molecules. Mol Syst Biol 7(472)

Download references

Acknowledgments

The authors thank the High Performance Computing centers of the Universities of Strasbourg, France, and Cluj, Romania, for having hosted a part of the herein reported calculations. This work was also supported in part by a Grant-in-Aid for Young Scientists from the Japanese Society for the Promotion of Science [Kakenhi (B) 25870336]. Special thanks are given to Professor Jürgen Bajorath of the University of Bonn for extracting and cleaning the herein used datasets. This research was additionally supported by the Funding Program for Next Generation World-Leading Researchers as well as the CREST program of the Japan Science and Technology Agency. J. B. Brown and Yasushi Okuno are supported in part by grants from Chugai Pharmaceutical Co. Ltd. and Mitsui Knowledge Co. Ltd.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dragos Horvath.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 2301 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brown, J.B., Okuno, Y., Marcou, G. et al. Computational chemogenomics: Is it more than inductive transfer?. J Comput Aided Mol Des 28, 597–618 (2014). https://doi.org/10.1007/s10822-014-9743-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-014-9743-1

Keywords

Navigation