Skip to main content
Log in

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

One of the most important characteristics of Quantitative Structure ActivityRelashionships (QSAR) models is their predictive power. The latter can bedefined as the ability of a model to predict accurately the target property(e.g., biological activity) of compounds that were not used for model development.We suggest that this goal can be achieved by rational division of an experimentalSAR dataset into the training and test set, which are used for model developmentand validation, respectively. Given that all compounds are represented by pointsin multidimensional descriptor space, we argue that training and test sets mustsatisfy the following criteria: (i) Representative points of the test set must beclose to those of the training set; (ii) Representative points of the training setmust be close to representative points of the test set; (iii) Training set must bediverse. For quantitative description of these criteria, we use molecular datasetdiversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci.,40 (2000) 414–425). For rational division of a dataset into the training and testsets, we use three closely related sphere-exclusion algorithms. Using severalexperimental datasets, we demonstrate that QSAR models built and validated withour approach have statistically better predictive power than models generated witheither random or activity ranking based selection of the training andtest sets.We suggest that rational approaches to the selection of training andtest setsbased on diversity principles should be used routinely in all QSAR modelingresearch.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hansch, C., Fujita, T., J. Am. Chem. Soc., 86 (1964) 1616–1626.

    Google Scholar 

  2. Kubinyi, H., In: Mannhold, R. et al. (eds.) Methods and Principles in Medicinal Chemistry, VCH, Weinheim, 1993.

    Google Scholar 

  3. Randi´c, M., J. Am. Chem. Soc., 97 (1975) 6609–6615.

    Google Scholar 

  4. Kier, L.B. and Hall, L.H., Molecular Connectivity in Chemistry and Drug Research. Academic Press, New York, 1976.

    Google Scholar 

  5. Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986.

    Google Scholar 

  6. Kier, L.B., Quant. Struct.-Act. Relat. 4 (1985) 109–116.

    Google Scholar 

  7. Kier, L.B., Quant. Struct-Act. Relat. 6 (1987) 8–12.

    Google Scholar 

  8. Hall, L.H. and Kier, L.B., Quant. Struct.-Act. Relat 9 (1990) 115–131.

    Google Scholar 

  9. Hall, L.H., Mohney, B.K. and Kier, L.B., Quant. Struct.-Act. Relat., 10 (1991) 43–51.

    Google Scholar 

  10. Hall, L.H., Mohney, B.K. and Kier, L.B., J. Chem. Inf. Comput. Sci., 31 (1991) 76–82.

    Google Scholar 

  11. Kier, L.B. and Hall, L.H., Molecular Structure Description: The Electrotopological State, Academic Press, 1999.

  12. Kellogg, G.E., Kier, L.B., Gaillard, P. and Hall, L.H., J. Comput. Aid. Mol. Des. 10 (1996) 513–520.

    Google Scholar 

  13. Sheridan, R.P., Nachbar, R.B. and Bush, B.L., J. Comput.-Aid Mol. Des. 8 (1994) 323–340.

    Google Scholar 

  14. Matter, H., J. Medic. Chem. 40(8) (1997) 1219–1229.

    Google Scholar 

  15. Clementi, S. and Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 319–338.

  16. Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in, VCH, (1995) 195–218.

  17. Hoffman B., Cho S.J., Zheng W., Wyrick S., Nichols D.E. and Mailman R.B., J. Med. Chem. 42 (1999) 3217–3226.

    Google Scholar 

  18. Zheng, W. and Tropsha, A., J. Chem. Inf. Comput. Sci., 40 (2000) 185–194.

    Google Scholar 

  19. Ajay. J. Med. Chem. 36 (1993) 3565–3571.

    Google Scholar 

  20. Cramer III, R.D., Patterson, D.E. and Bunce, J.D., J. Am. Chem. Soc. 110 (1988) 5959–5967.

    Google Scholar 

  21. Marshall, G.R. and Cramer III, R.D., Trends Pharmacol. Sci. 9 (1988) 285–289.

    Google Scholar 

  22. Pérez, C., Pastor, M., Ortiz, AR. and Gago, F., J. Med. Chem. 41 (1998) 836–852.

    Google Scholar 

  23. Cho, S.J. and Tropsha, A., J. Med. Chem. 38 (1995) 1060–1066.

    Google Scholar 

  24. Klebe, G., In: Kubinyi, H., Folkers, G., Martin, Y.C., (eds.) 3D QSAR in Drug Design. Volume 3. Recent Advances, Kluwer/ESCOM: Dordrecht, (1998) pp. 87–104.

    Google Scholar 

  25. Kubinyi, H., Hamprecht, F.A. and Mietzner, T., J. Med. Chem., 41 (1998) 2553–2564.

    Google Scholar 

  26. Topliss, J.G. and Edwards, R.P., J. Med. Chem. 22 (1979) 1238–1244.

    Google Scholar 

  27. Gironés, X., Gallegos, A. and Ramon, C.-D., J. Chem. Inf. Comput. Sci. 46 (2000) 1400–1407.

    Google Scholar 

  28. Bordás, B., Kömíves, T., Szántó , Z. and Lopata, A., J. Agric. Food Chem. 48 (2000) 926–931.

    Google Scholar 

  29. Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y. and Weinstein, J.N., J. Med. Chem. 44 (2001) 3254–3263.

    Google Scholar 

  30. Randi´c, M. and Basak, S.C., J. Chem. Inf. Comput. Sci. 40 (2000) 899–905.

    Google Scholar 

  31. Suzuki, T., Ide, K., Ishida, M. and Shapiro, S., J. Chem. Inf. Comput. Sci. 41 (2001) 718–726.

    Google Scholar 

  32. Recanatini, M., Cavalli, A., Belluti, F., Piazzi, L., Rampa, A., Bisi, A., Gobbi, S., Valenti, P., Andrisano, V., Bartolini, M. and Cavrini, V., J. Med. Chem. 43 (2000) 2007–2018.

    Google Scholar 

  33. Moró n, J.A., Campillo, M., Perez, V., Unzeta, M. and Pardo, L., J. Med. Chem. 43 (2000) 1684–1691.

    Google Scholar 

  34. Golbraikh, A. and Tropsha, A., J. Mol. Graphics Model. 20 (2002) 269–276.

    Google Scholar 

  35. Wold, S. and Eriksson, L., Statistical Validation of QSAR Results. In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 309–318.

  36. Clark, R.D., Sprous, D.G. and Leonard, J.M., Validating Models Based on Large Dataset. In: Höltje, H.-D., Sippl, W., (eds.) Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on Quantitative Structure-Activity Relationships. Aug 27 - Sept 1 (2000), Duesseldorf, Germany. Prous Science, (2001) 475–485.

  37. Novellino, E., Fattorusso, C. and Greco, G., Pharm. Acta Helv. 70 (1995) 149–154.

    Google Scholar 

  38. Norinder, U., J. Chemomet. 10 (1996) 95–105.

    Google Scholar 

  39. Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput. Sci. 41 (2001) 1022–1027.

    Google Scholar 

  40. Sachs, L., Applied Statistics. A Handbook of Techniques. Springer-Verlag, (1984).

  41. Huuskonen, J., J. Chem. Inf. Comput. Sci. 41 (2001) 425–429.

    Google Scholar 

  42. Tetko, I.V., Kovalishyn, V.V. and Livingstone D.J., J. Med. Chem. 44 (2001) 2411–2420.

    Google Scholar 

  43. Wu, W., Walczak, B., Massart, D.L., Heuerding, S., Erni, F., Last, I.R. and Prebble, K.A., Chemometr. Intell. Lab. Syst. 33 (1996) 35–46.

    Google Scholar 

  44. Yasri, A. and Hartsough, D., J. Chem. Inf. Comput. Sci. 41 (2001) 1218–1227

    Google Scholar 

  45. Bernard P., Kireev D.B., Chretien J.R., Fortier P.L. and Coppet L., J. Comput. Aided Mol. Des. 13 (1999) 355–371.

    Google Scholar 

  46. Takeuchi, Y., Shands, E.F.B., Beusen, D.D. and Marshall, G.R., J. Med. Chem. 41 (1998)3609–3623.

    Google Scholar 

  47. Kauffman, G.V. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553–1560.

    Google Scholar 

  48. Mattioni, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., in press.

  49. Gasteiger, J. and Zupan, J., Angewandte chemie. 32(4) (1993) 503.

    Google Scholar 

  50. Loukas, Y.L., J. Med. Chem. 44 (2001) 2772–2783.

    Google Scholar 

  51. Bernard, P, Pintore, M, Berthon, J.Y. and Chretien, J.R., Eur. J. Med. Chem. 36 (2001) 1–19.

    Google Scholar 

  52. Burden, F.R. and Winkler, D.A., J. Med. Chem. 42 (1999) 3183–3187.

    Google Scholar 

  53. Burden, F.R., Ford, M.G., Whitley, D.C. and Winkler, D.A., J. Chem. Inf. Comput. Sci. 40 (2000) 1423–1430.

    Google Scholar 

  54. Adams, M.J., Chemometrics in Analytical Spectroscopy. The Royal Society of Chemistry, UK, 1995.

    Google Scholar 

  55. Potter, T. and Matter, H., J. Med. Chem. 41 (1998) 478–488.

    Google Scholar 

  56. Lajiness, M., Johnson, M.A. and Maggiora, G.M., In: Fauchere, J.L., (ed.), QSAR: Quantitative Structure-Activity Relationships in Drug Design Alan R. Liss Inc.: New York, (1989) pp. 173–176.

    Google Scholar 

  57. Taylor, R., J. Chem. Inf. Comput. Sci. 35 (1995) 59–67.

    Google Scholar 

  58. Snarey, M., Terrett, N.K., Willett, P. and Wilton, D.J., J. Mol. Graphics Mod. 15 (1997) 372–385.

    Google Scholar 

  59. Kennard, R.W. and Stone, L.A., Technometrics 11 (1969) 137–148.

    Google Scholar 

  60. Bourguignon, B., Deaguiar, P.F., Thorre, K. and Massart, D.L., J. Chromatogr. Sci. 32 (1994) 144–152.

    Google Scholar 

  61. Bourguignon, B., Deaguiar, P.F., Khots, M.S. and Massart, D.L., Anal. Chem. 66 (1994) 893–904.

    Google Scholar 

  62. Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg, B., Wold, S. and Andrews, P., Int. J. Pept. Protein. Res. 37 (1991) 414–424.

    Google Scholar 

  63. Eriksson, L. and Johansson, E., Chemometr. Intell. Lab. Syst. 34 (1996) 1–19.

    Google Scholar 

  64. Carlson, R., Design and Optimization in Organic Synthesis. Elsevier, (1992).

  65. Martin, E.J. and Critchlow, R.E., J. Comb. Chem. 1 (1999) 32–45.

    Google Scholar 

  66. Miller, A. and Nguyen, N.-K., Appl. Stat. 43 (1994) 669–678.

    Google Scholar 

  67. Mitchell, T.J., Technometrics 16 (1974) 203–210.

    Google Scholar 

  68. Mitchell, T.J., Technometrics 42 (2000) 48–54.

    Google Scholar 

  69. Reynolds, C.H., Druker, R. and Pfahler, L.B., J. Chem. Inf. Comput. Sci. 38 (1998) 305–312.

    Google Scholar 

  70. Bucholz, E., Brown, R.L., Tropsha, A., Booth, R.G. and Wyrick, S.D., J. Med. Chem. 42 (1999) 3041–3054.

    Google Scholar 

  71. Golbraikh, A., Bonchev, D., Xiao, Y.-D. and Tropsha, A., In: Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on quantitative Structure-Activity relationships, Prous Science, (2001) pp. 219–223.

  72. Golbraikh A., Bonchev, D. and Tropsha, A., J. Chem. Inf. Comput. Sci. 41 (2001) 147–158.

    Google Scholar 

  73. Kier, L.B. and Hall, L.H., Quant. Struct.-Act. Relat. 10 (1991) 134–140.

    Google Scholar 

  74. Petitjean, M., J. Chem. Inf. Comput. Sci. 32 (1992) 331–337.

    Google Scholar 

  75. Wiener, H., J. Am. Chem. Soc. 69 (1947) 17.

    Google Scholar 

  76. Platt, J.R., J. Phys. Chem. 56 (1952) 328.

    Google Scholar 

  77. Shannon, C. and Weaver, W., Mathematical theory of Communication, University of Illinois, Urbana, (1949).

    Google Scholar 

  78. Bonchev, D., Mekenyan, O. and Trinajstic, N., J. Comput. Chem., 2 (1981) 127–148.

    Google Scholar 

  79. Gutman I., Ruscić, B., Trinajstić, N. and Wilcox, C.F., Jr., J. Chem. Phys., 62 (1975) 3399.

    Google Scholar 

  80. Rücker, G. and Rücker, C., J. Chem. Inf. Comput. Sci., 33 (1993) 683–695.

    Google Scholar 

  81. Bonchev, D., In: Devillers J., Balaban, A.T. (eds.), Topological Indices and Related Descriptors, Gordon and Breach, Reading, U.K. (1999) pp. 361–401.

    Google Scholar 

  82. Bonchev, D., SAR/QSAR Env. Res., 7 (1997) 23–43.

    Google Scholar 

  83. Golbraikh, A., J. Chem. Inf. Comput. Sci. 40 (2000) 414–425.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Golbraikh, A., Tropsha, A. Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. Mol Divers 5, 231–243 (2000). https://doi.org/10.1023/A:1021372108686

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021372108686

Keywords

Navigation