Abstract
High-throughput assays challenge us to extract knowledge from multi-ligand, multi-target activity data. In QSAR, weights are statically fitted to each ligand descriptor with respect to a single endpoint or target. However, computational chemogenomics (CG) has demonstrated benefits of learning from entire grids of data at once, rather than building target-specific QSARs. A possible reason for this is the emergence of inductive knowledge transfer (IT) between targets, providing statistical robustness to the model, with no assumption about the structure of the targets. Relevant protein descriptors in CG should allow one to learn how to dynamically adjust ligand attribute weights with respect to protein structure. Hence, models built through explicit learning (EL) by including protein information, while benefitting from IT enhancement, should provide additional predictive capability, notably for protein deorphanization. This interplay between IT and EL in CG modeling is not sufficiently studied. While IT is likely to occur irrespective of the injected target information, it is not clear whether and when boosting due to EL may occur. EL is only possible if protein description is appropriate to the target set under investigation. The key issue here is the search for evidence of genuine EL exceeding expectations based on pure IT. We explore the problem in the context of Support Vector Regression, using more than 9,400 \(pK_i\) values of 31 GPCRs, where compound–protein interactions are represented by the concatenation of vectorial descriptions of compounds and proteins. This provides a unified framework to generate both IT-enhanced and potentially EL-enabled models, where the difference is toggled by supplied protein information. For EL-enabled models, protein information includes genuine protein descriptors such as typical sequence-based terms, but also the experimentally determined affinity cross-correlation fingerprints. These latter benchmark the expected behavior of a quasi-ideal descriptor capturing the actual functional protein-protein relatedness, and therefore thought to be the most likely to enable EL. EL- and IT-based methods were benchmarked alongside classical QSAR, with respect to cross-validation and deorphanization challenges. A rational method for projecting benchmarked methodologies into a strategy space is given, in the aims that the projection will provide directions for the types of molecule designs possible using a given methodology. While EL-enabled strategies outperform classical QSARs and favorably compare to similar published results, they are, in all respects evaluated herein, not strongly distinguished from IT-enhanced models. Moreover, EL-enabled strategies failed to prove superior in deorphanization challenges. Therefore, this paper raises caution that, contrary to common belief and intuitive expectation, the benefits of chemogenomics models over classical QSAR are quite possibly due less to the injection of protein-related information, and rather impacted more by the effect of inductive transfer, due to simultaneous learning from all of the modeled endpoints. These results show that the field of protein descriptor research needs further improvements to truly realize the expected benefit of EL.
Similar content being viewed by others
Abbreviations
- CG:
-
Chemogenomics
- GA:
-
Genetic algorithm
- DS:
-
Descriptor space
- GPCR:
-
G-protein coupled receptor
- QSAR:
-
Quantitative structure–activity relationships
- SVM:
-
Support vector machine
- SVR:
-
Support vector regression
- IT:
-
Inductive transfer
- MTL:
-
Multi-task learning
- RMSE:
-
Root mean squared error
- ISIDA:
-
In silico design and data analysis
References
Abernethy J, Bach F, Evgeniou T, Vert JP (2009) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Bock JR, Gough DA (2002) A new method to estimate ligand-receptor energetics. Mol Cell Proteomics 1(11):904–910
Bock JR, Gough DA (2005) Virtual screen for ligands of orphan G protein-coupled receptors. J Chem Inf Model 45(5):1402–1414
Bonachera F, Horvath D (2008) Fuzzy tricentric pharmacophore fingerprints. 2. Application of topological fuzzy pharmacophore triplets in quantitative structure–activity relationships. J Chem Inf Model 48(2):409–425
Bonachera F, Parent B, Barbosa F, Froloff N, Horvath D (2006) Fuzzy tricentric pharmacophore fingerprints. 1—topological fuzzy pharmacophore triplets and adapted molecular similarity scoring schemes. J Chem Inf Model 46:2457–2477
Brown J, Nijima S, Okuno Y (2013) Compound–protein interaction prediction within chemogenomics: theoretical concepts, practical usage, and future directions. Mol Inf 32:906–921
Brown J, Okuno Y (2012) Systems biology and systems chemistry: new directions for drug discovery. Chem Biol 19(1):23–28
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27
Collantes E, Dunn W (1995) Amino acid side chain descriptors for quantitative structure–activity relationship studies of peptide analogs. J Med Chem 38(14):2705–2713
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Frimurer T, Ulven T, Elling C, Gerlach LO, Kostenis E, Hogberg T (2005) A physicogenetic method to assign ligand–binding relationships between 7TM receptors. Bioorg Med Chem Lett 15:3707–3712
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2011) Chembl: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40(D1):D1100–D1107
Gozalbes R, Rolland C, Nicola E, Paugam MF, Coussy L, Horvath D, Barbosa F, Mao B, Revah F, Froloff N (2005) QSAR strategy and experimental validation for the development of a GPCR focused library. QSAR Comb Sci 24(4):508–516
Harrell F (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Graduate texts in mathematics. Springer, Berlin
Horvath D, Bonachera F, Solov’ev V, Gaudin C, Varnek A (2007) Stochastic versus stepwise strategies for quantitative structure–activity relationship generation—how much effort may the mining for successful QSAR models take? J Chem Inf Model 47:927–939
Horvath D, Marcou G, Varnek A (2013) Do not hesitate to use tversky—and other hints for successful active analogue searches with feature count descriptors. J Chem Inf Model 53(7):1543–1562
Hurle MR, Yang L, Xie Q, Rajpal DK, Sanseau P, Agarwal P (2013) Computational drug repositioning: from data to therapeutics. Clin Pharmacol Ther 93(4):335–341
Ivanciuc O (2007) Applications of support vector machines in chemistry. Wiley, New York, pp 291–400
Jacob L, Hoffmann B, Stoven V, Vert JP (2008) Virtual screening of GPCRS: an in silico chemogenomics approach. BMC Bioinform 9(1):363
Jacob L, Vert JP (2008) Protein–ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 24(19):2149–2156
Kontijevskis A, Komorowski J, Wikberg JES (2008) Generalized proteochemometric model of multiple cytochrome p450 enzymes and their inhibitors. J Chem Inf Model 48(9):1840–1850
Kontijevskis A, Prusis P, Petrovska R, Yahorava S, Mutulis F, Mutule I, Komorowski J, Wikberg J (2007) A look inside HIV resistance through retroviral protease interaction maps. PLoS Comput Biol 3:e48
Lapins M, Eklund M, Spjuth O, Prusis P, Wikberg J (2008) Proteochemometric modeling of hiv protease susceptibility. BMC Bioinform 9(1):181
Lapinsh M, Prusis P, Gutcaits A, Lundstedt T, Wikberg J (2001) Development of proteo-chemometrics: a novel technology for the analysis of drug–receptor interactions. Biochim Biophys Acta 1525:180–190
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476
Li S, Xi L, Wang C, Li J, Lei B, Liu H, Yao X (2009) A novel method for protein–ligand binding affinity prediction and the related descriptors exploration. J Comput Chem 30(6):900–909
Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl Acids Res 34(Suppl. 2):W32–W37
Medina-Franco JL, Giulianotti MA, Welmaker GS, Houghten RA (2013) Shifting from the single to the multitarget paradigm in drug discovery. Drug Discov Today 18(9–10):495–501
Mikhalev AA, Shpilrain V, Yu JT (2004) The embedding problem. In: Borwein P, Borwein J (eds) Combinatorial methods. CMS books in mathematics. Springer, New York, pp 108–128
Pelikan M, Goldberg DE, Lobo FG (2002) A survey of optimization by building and using probabilistic models. Comput Optim Appl 21:5–20
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl Acids Res 39(Suppl. 2):W385–W390
Rosenbaum L, Dorr A, Bauer MR, Boeckler FM, Zell A (2013) Inferring multi-target QSAR models with taxonomy-based multi-task learning. J Cheminform 5:1–20
Ruggiu F, Gizzi P, Galzi JL, Hibert M, Haiech J, Baskin I, Horvath D, Marcou G, Varnek A (2014) Quantitative structure–property relationship modeling: a valuable support in high-throughput screening quality control. Anal Chem 86(5):2510–2520
Ruggiu F, Marcou G, Varnek A, Horvath D (2010) Isida property-labelled fragment descriptors. Mol Inform 29(12):855–868
Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S (1998) New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem 41:2481–2491
Schölkopf B, Tsuda K, Vert J (2004) Kernel methods in computational biology. MIT, Boston, MA, USA
Smola AJ, Schlkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Strombergsson H, Daniluk P, Kryshtafovych A, Fidelis K, Wikberg J, Kleywegt G, Hvidsten T (2008) Interaction model based on local protein substructures generalizes to the entire structural enzyme–ligand space. J Chem Inf Model 48:2278–2288
Tetko IV (2002) Neural network studies. 4. Introduction to associative neural networks. J Chem Inf Comput Sci 42(3):717–728
Van Westen G, Wegner J, Geluykens P, Kwanten L, Vereycken I, Peeters A, IJzerman A, Van Vlijmen H, Bender A (2011) Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PLoS One 6:e27518
Van Westen G, Wegner J, Ijzerman A, Van Vlijmen H, Bender A (2011) Proteochemometric modeling as a tool for designing selective compounds and extrapolating to novel targets. Med Chem Commun 2:16–30
Varnek A, Gaudin C, Marcou G, Baskin I, Pandey AK, Tetko IV (2009) Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J Chem Inf Model 49(1):133–144
Varnek A, Tropsha A (2009) Chemoinformatics: approaches to virtual screening. Royal Society of Chemistry. Cambridge, USA
Wassermann AM, Geppert H, Bajorath J (2009) Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effects. J Chem Inf Model 49(10):2155–2167
Weill N, Rognan D (2009) Development and validation of a novel protein–ligand fingerprint to mine chemogenomic space: application to G protein-coupled receptors and their ligands. J Chem Inf Model 49(4):1049–1062
Weill N, Rognan D (2010) Alignment-free ultra-high-throughput comparison of druggable proteinligand binding sites. J Chem Inf Model 50(1):123–135
van Westen G, Swier R, Cortes-Ciriano I, Wegner J, Overington J, IJzerman A, Van Vlijmen H, Bender A (2013) Benchmarking of protein descriptors in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptors. J Cheminform 5:42
van Westen GJP, Wegner JK, Ijzerman AP, van Vlijmen HWT, Bender A (2010) Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. MedChemComm 2(1):16–30
Yabuuchi H, Niijima S, Takematsu H, Ida T, Hirokawa T, Hara T, Ogawa T, Minowa Y, Tsujimoto G, Okuno Y (2011) Analysis of multiple compound–protein interactions reveals novel bioactive molecules. Mol Syst Biol 7(472)
Acknowledgments
The authors thank the High Performance Computing centers of the Universities of Strasbourg, France, and Cluj, Romania, for having hosted a part of the herein reported calculations. This work was also supported in part by a Grant-in-Aid for Young Scientists from the Japanese Society for the Promotion of Science [Kakenhi (B) 25870336]. Special thanks are given to Professor Jürgen Bajorath of the University of Bonn for extracting and cleaning the herein used datasets. This research was additionally supported by the Funding Program for Next Generation World-Leading Researchers as well as the CREST program of the Japan Science and Technology Agency. J. B. Brown and Yasushi Okuno are supported in part by grants from Chugai Pharmaceutical Co. Ltd. and Mitsui Knowledge Co. Ltd.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Brown, J.B., Okuno, Y., Marcou, G. et al. Computational chemogenomics: Is it more than inductive transfer?. J Comput Aided Mol Des 28, 597–618 (2014). https://doi.org/10.1007/s10822-014-9743-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-014-9743-1