Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Chemical database techniques in drug discovery

Key Points

  • Although originally created merely as repositories for compounds synthesized within an organization, chemical databases can now be searched to give novel ideas for lead discovery.

  • The familiar chemical-structure diagrams are not amenable to computational operations such as database searching, so several types of chemical-structure representation have been developed by theoretical chemists for use in computer systems. The predominant form is the atom–bond connection table.

  • A connection table along with two-dimensional coordinates for display is generally sufficient to identify the substance. However, to perform any energy calculations or to determine if the compound has the potential to bind to a receptor or enzyme of interest, three-dimensional coordinates are necessary.

  • Chemical structures differ considerably from other entities that are commonly stored in databases, such as text, and so the various search modes also differ considerably, although some parallels can be drawn.

  • Exact-match searches — which might be performed to find out if a proposed new structure already exists in a database, for example — can be thought of as looking up a complete word in a dictionary.

  • Substructure searches — in which a user picks pieces of a chemical structure and requests that the system return a set of compounds that contain the pieces — are analogous to a wild-carded text search.

  • Similarity searches — which might be performed if a user wants compounds that resemble the compound of interest to a chemist's intuitive thinking but do not necessarily reflect an exact or substructure match — are analogous to a 'sounds like' text search.

  • In pharmacophoric searches, assumptions about which groups of atoms on the small molecule are involved in binding are combined with the spatial relationship of these groups to give a three-dimensional query. Although the process is generally slower than previously mentioned search types, the results provide an indication of whether a set of structures can bind to a receptor or enzyme, and so hits might be very valuable in the drug design process.

  • Molecular docking — placing a series of candidate molecules from a database into the active site of a protein to evaluate how well the compounds might bind to the receptor or enzyme — has become a more popular mode of database searching owing to continual improvement in the quality of the docking and scoring algorithms.

  • Once a database search has been performed, the list of potential molecules for biological testing can be refined by filtering (removing molecules deemed to have unsuitable properties), clustering (grouping similar compounds) and human inspection.

Abstract

Chemical databases are becoming a powerful tool in drug discovery. Database searches based on possible requirements for biological activity can identify compounds that might be suitable for further analysis or indicate novel ways to achieve the desired activity. What considerations are involved in the construction and searching of chemical databases?

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Representations of chemical structure, using the neurotransmitter γ-aminobutyric acid (GABA) as an example.
Figure 2

Similar content being viewed by others

References

  1. Voigt, J. H., Bienfait, B., Wang, S. & Nicklaus, M. C. Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41, 702–712 (2001).An excellent analysis of publicly and commercially available chemical databases.

    Article  CAS  Google Scholar 

  2. Trinjastic, N. (ed.) Chemical Graph Theory (CRC, Boca Raton, 1983).

    Google Scholar 

  3. Balaban, A. T. Applications of graph theory in chemistry. J. Chem. Inf. Comput. Sci. 25, 334–343 (1985).

    Article  CAS  Google Scholar 

  4. Dalby, A. et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 32, 244–255 (1992).

    Article  CAS  Google Scholar 

  5. Dury, L., Latour, T., Leherte, L., Barberis, F. & Vercauteren, D. P. A new graph descriptor for molecules containing cycles. Application as screening criterion for searching molecular structures within large databases of organic compounds J. Chem. Inf. Comput. Sci. 41, 1437–1445 (2001).

    Article  CAS  Google Scholar 

  6. Weininger, D. SMILES 1. Introduction and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31 (1988).

    Article  CAS  Google Scholar 

  7. Warr, W. A. Combinatorial chemistry and molecular diversity. An overview. J. Chem. Inf. Comput. Sci. 37, 134–140 (1997).

    Article  CAS  Google Scholar 

  8. Leland, B. A. Managing the combinatorial explosion. J. Chem. Inf. Comput. Sci. 37, 62–70 (1997).

    Article  CAS  Google Scholar 

  9. Walters, W. P., Stahl, M. T. & Murcko, M. A. Virtual screening — an overview. Drug Discov. Today 3, 160–178 (1998).This article provides an excellent overview of the 'hows and whys' of using computers to select molecules for testing.

    Article  CAS  Google Scholar 

  10. Schultz, J. L. & Wilks, E. S. Dendritic and star polymers: classification, nomenclature, structure representation, and registration in the DuPont SCION database. J. Chem. Inf. Comput. Sci. 38, 85–99 (1998).

    Article  CAS  Google Scholar 

  11. Wipke, W. T. & Dyott, T. M. Stereochemically unique naming algorithm. J. Am. Chem. Soc. 96, 4834–4840 (1974).

    Article  CAS  Google Scholar 

  12. Gasteiger, J., Rudolph, C. & Sadowski, J. Automatic generation of 3D-atomic coordinates for organic molecules. Tetrahedron Comp. Methodol. 3, 537–547 (1990).

    Article  CAS  Google Scholar 

  13. Pearlman, R. S. CONCORD: rapid generation of high quality approximate 3D molecular structures. Chem. Des. Autom. News 2, 1 (1987).

  14. Rusinko, A. Using CONCORD to construct a large database of three-dimensional coordinates from connection tables. J. Chem. Inf. Comput. Sci. 29, 327–333 (1989).

    Article  Google Scholar 

  15. Crippen, G. M. & Havel, T. F. Stable calculation of coordinates from distance information. Acta Cryst. A34, 282–284 (1978).

    Article  CAS  Google Scholar 

  16. Hahn, M. Three-dimensional shape-based searching of conformationally flexible compounds. J. Chem. Inf. Comput. Sci. 37, 80–86 (1997).

    Article  CAS  Google Scholar 

  17. Paris, C. G. Chemical structure handling by computer. Annu. Rev. Inform. Sci. Technol. 32, 271–338 (1997/1998).An excellent overview of the issues in storing, searching and analysing molecules using a computer.

    Google Scholar 

  18. Barnard, J. M. & Downs, G. M. Computer representation and manipulation of combinatorial libraries. Persp. Drug Discov. Des. 7/8, 13–30 (1997).

    Article  CAS  Google Scholar 

  19. Martin, Y. C., Bures, M. G. & Willett, P. Searching databases of three-dimensional structures. Reviews Comput. Chem. 1, 213–263 (1990).

    Google Scholar 

  20. Good, A. C. & Mason, J. S. Three-dimensional structure database searches. Reviews Comput. Chem. 7, 67–117 (1995).

    Google Scholar 

  21. Nicklaus, M. C. et al. HIV-1 integrase pharmacophore: discovery of inhibitors through three-dimensional database searching. J. Med. Chem. 40, 920–929 (1997).

    Article  CAS  PubMed  Google Scholar 

  22. Pickett, S. D., Mason, J. S. & McLay, I. M. Diversity profiling and design using 3D pharmacophores: pharmacophore-derived queries (PDQ). J. Chem. Inf. Comput. Sci. 36, 1214–1223 (1996).

    Article  CAS  Google Scholar 

  23. Greenidge, P. A., Carlsson, B., Bladh, L.-G. & Gillner, M. Pharmacophores incorporating numerous excluded volumes defined by X-ray crystallographic structure in three-dimensional database searching: application to the thyroid hormone receptor. J. Med. Chem. 41, 2503–2512 (1998).

    Article  CAS  PubMed  Google Scholar 

  24. Olender, R. & Rosenfeld, R. A fast algorithm for searching for molecules containing a pharmacophore in very large virtual combinatorial libraries. J. Chem. Inf. Comput. Sci. 41, 731–738 (2001).

    Article  CAS  PubMed  Google Scholar 

  25. Mason, J. S. et al. New four-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privilege substructures. J. Med. Chem. 42, 3251–3264 (1999).

    Article  CAS  PubMed  Google Scholar 

  26. Downs, G. M. & Willett, P. Similarity searching in databases of chemical structures. Rev. Comput. Chem. 7, 1–66 (1995).

    Google Scholar 

  27. Singh, S. B., Sheridan, R. P., Fluder, E. M. & Hull, R. D. Mining the chemical quarry with joint chemical probes: an application of latent semantic structure indexing (LaSSI) and TOPOSIM (Dice) to chemical database mining. J. Med. Chem. 44, 1564–1575 (2001).

    Article  CAS  PubMed  Google Scholar 

  28. Hefferlin, R. & Matus, M. T. Molecular similarity for small species: refining the isoelectronic index. J. Chem. Inf. Comput. Sci. 41, 484–494 (2001).

    Article  CAS  PubMed  Google Scholar 

  29. Sheridan, R. P., Miller, M. D., Underwood, D. J. & Kearsley, S. K. Chemical similarity using geometric atom pair descriptors. J. Chem. Inf. Comput. Sci. 36, 128–136 (1996).

    Article  CAS  Google Scholar 

  30. Hull, R. D. et al. Latent semantic structure indexing (LaSSI) for defining chemical similarity. J. Med. Chem. 44, 1177–1184 (2001).

    Article  CAS  Google Scholar 

  31. Nilakantan, R., Bauman, N., Dixon, J. S. & Venkataraghavan, R. Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors. J. Chem. Inf. Comput. Sci. 27, 82–85 (1987).

    Article  CAS  Google Scholar 

  32. Brown, R. D. & Martin, Y. C. The information content of 2D and 3D structural descriptors relevant to ligand–receptor binding. J. Chem. Inf. Comput. Sci. 37, 1–9 (1997).

    Article  CAS  Google Scholar 

  33. Schuffenhauer, A., Gillet, V. J. & Willett, P. Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J. Chem. Inf. Comput. Sci. 40, 295–307 (2000).

    Article  CAS  Google Scholar 

  34. Rhodes, N., Willett, P., Dunbar, J. B. Jr & Humblet, C. Bit-string methods for selective compound acquisition. J. Chem. Inf. Comput. Sci. 40, 210–214 (2000).

    Article  CAS  Google Scholar 

  35. Xue, L., Stahura, F. L., Godden, J. W. & Bajorath, J. Fingerprint scaling increases the probability of identifying molecules with similar activity in virtual screening calculations. J. Chem. Inf. Comput. Sci. 41, 746–753 (2001).

    Article  CAS  Google Scholar 

  36. Butina, D. Unsupervised database custering based on Daylight's fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 39, 747–750 (1999).

    Article  CAS  Google Scholar 

  37. Xue, L., Stahura, F. L., Godden, J. W. & Bajorath, J. Mini-fingerprints detect similar activity of receptor ligands previously recognized only by three-dimensional pharmacophore-based methods. J. Chem. Inf. Comput. Sci. 41, 394–401 (2001).

    Article  CAS  Google Scholar 

  38. Matter, H. & Pötter, T. Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J. Chem. Inf. Comput. Sci. 39, 1211–1225 (1999).

    Article  CAS  Google Scholar 

  39. McGregor, M. J. & Muskal, S. M. Pharmacophore fingerprinting. 1. Application to QSAR and focused library design. J. Chem. Inf. Comput. Sci. 39, 569–574 (1999).

    Article  CAS  Google Scholar 

  40. Xue, L. & Bajorath, J. Molecular descriptors for effective classification of biologically active compounds based on principal component analysis identified by a genetic algorithm. J. Chem. Inf. Comput. Sci. 40, 801–809 (2000).

    Article  CAS  Google Scholar 

  41. Willett, P., Barnard, J. & Downs, G. M. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998).A very comprehensive discussion of similarity searching.

    Article  CAS  Google Scholar 

  42. Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R. & Ferrin, T. E. A geometric approach to macromolecule–ligand interactions. J. Mol. Biol. 161, 269–288 (1982).

    Article  CAS  Google Scholar 

  43. Shoichet, B. K., Bodian, D. L. & Kuntz, I. D. Molecular docking using shape descriptors. J. Comput. Chem. 13, 380–397 (1992).

    Article  CAS  Google Scholar 

  44. Meng, E. C., Shoichet, B. K. & Kuntz, I. D. Automated docking with grid-based energy evaluation. J. Comput. Chem. 13, 505–524 (1992).

    Article  CAS  Google Scholar 

  45. Bissantz, C., Folkers, G. & Rognan, D. Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations. J. Med. Chem. 43, 4759–4767 (2000).

    Article  CAS  Google Scholar 

  46. Charifson, P. S., Corkery, J. J., Murcko, M. A. & Walters, W. P. Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. J. Med. Chem. 42, 5100–5109 (1999).

    Article  CAS  Google Scholar 

  47. Perola, E. et al. Successful virtual screening of a chemical database for farnesyltransferase inhibitor leads. J. Med. Chem. 43, 401–408 (2000).

    Article  CAS  Google Scholar 

  48. Lipinkski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23, 3–25 (1997).

    Article  Google Scholar 

  49. Higgs, R. E., Bemis, K. G., Watson, I. A. & Wikel, J. H. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 37, 861–870 (1997).

    Article  CAS  Google Scholar 

  50. Brown, R. D. & Martin, Y. C. Use of structure–activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 36, 572–584 (1996).Applications of database clustering.

    Article  CAS  Google Scholar 

  51. Xue, L., Godden, J., Gao, H. & Bajorath, J. Identification of a preferred set of molecular descriptors for compound classification based on principal components analysis. J. Chem. Inf. Comput. Sci. 39, 699–704 (1999).

    Article  CAS  Google Scholar 

  52. Mason, J. S. & Pickett, S. Partition-based selection. Persp. Drug Discov. Des. 7/8, 85–114 (1997).

    Article  CAS  Google Scholar 

  53. Barnard, J. M. & Downs, G. M. Clustering of chemical structures on the basis of two-dimensional similarity measures. J. Chem. Inf. Comput. Sci. 32, 644–649 (1992).

    Article  CAS  Google Scholar 

  54. Guénoche, A., Hansen, P. & Jaumard, B. Efficient algorithms for divisive hierarchical clustering with the diameter criterion. J. Classification 8, 5–30 (1991).

    Article  Google Scholar 

  55. Barnard, J. M. & Downs, G. M. Chemical fragment generation and clustering software. J. Chem. Inf. Comput. Sci. 37, 141–142 (1997).

    Article  CAS  Google Scholar 

  56. Lavecchia, A., Greco, G., Novellino, E., Vittorio, F. & Ronsisvalle, G. Modelling of κ-opioid receptor/agonist interactions using pharmacophore-based and docking simulations. J. Med. Chem. 43, 2124–2134 (2000).

    Article  CAS  PubMed  Google Scholar 

  57. Rughooputh, S. D. D. V. & Rughooputh, H. C. S. Neural network based chemical structure indexing. J. Chem. Inf. Comput. Sci. 41, 713–717 (2001).

    Article  CAS  Google Scholar 

  58. Ozawa, K., Yasuda, T. & Fujita, S. Substructure search with tree-structured data. J. Chem. Inf. Comput. Sci. 37, 688–695 (1997).

    Article  CAS  Google Scholar 

  59. Pang, Y. P., Perola, E., Xu, K. & Prendergast, F. G. EUDOC: a computer programme for identification of drug interaction sites in macromolecules and drug leads from chemical databases. J. Comput. Chem. 22, 1750–1771 (2001).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

M. M. would like to thank his colleagues in the scientific architecture group for useful feedback on the manuscript.

Author information

Authors and Affiliations

Authors

Supplementary information

Related links

Related links

FURTHER INFORMATION

CACTVS

ChemIDPlus

ChemWeb

Daylight Chemical Information Systems

Fingerprints — Screening and Similarity

National Cancer Institute Database

SMILES

Glossary

DRUG LIKE

Sharing certain characteristics with other molecules that act as drugs. The set of characteristics — size, shape and solubility in water and organic solvents — varies depending on who is evaluating the molecules.

CYCLIC/ACYCLIC BONDS

If chemical bonds occur in a ring, they are termed 'cyclic'. 'Acyclic bonds' occur in open chain structures.

COUNTERION

A set of one or more bonded atoms, with opposite charge and generally smaller size, that accompanies another charged set of bonded atoms as dictated by the principle of electrical neutrality of substances, solutions and so on.

COMBINATORIAL CHEMISTRY

The generation of large collections, or 'libraries', of compounds by synthesizing all possible combinations of a set of smaller chemical structures.

SEMA NAME

A stereochemical extension of the Morgan algorithm. A compact, canonical representation of a connection table.

SUBSTRUCTURE

One chemical structure is said to be a substructure of another if the first structure can be located within the second. (The second is said to be the superstructure of the first.) All structures are substructures of themselves. A substructure search scans a database for all substructural matches.

CONFORMATIONAL SPACE

The ensemble of three-dimensional shapes that a molecule can adopt without breaking any bonds.

MARKUSH STRUCTURE

Markush structures represent a set of chemical structures as a common core that contains marked substitution sites, and a set of possible fragments for each substitution point. They can be used to represent a set of compounds that are analysed to determine the effect of varying substituents on compound activity; to represent a set of compounds that are produced using combinatorial techniques; to produce a fine-tuned substructure query; or to represent a set of structures in a chemical patent or patent database.

PHARMACOPHORE

The ensemble of steric and electronic features that is necessary to ensure optimal interactions with a specific biological target structure and to trigger (or to block) its biological response.

STEREOCHEMISTRY

The spatial arrangements of atoms in molecules and complexes.

TAUTOMER

One of two or more structural isomers that exist in equilibrium and are readily converted from one isomeric form to another.

HYDROGEN BOND

A weak attraction (much weaker than a covalent or ionic chemical bond, but much stronger than van der Waals forces) between an oxygen, nitrogen or fluorine atom in one molecule and a hydrogen atom in a neighbouring molecule. Hydrogen-bond donors are groups with electron-hungry hydrogen atoms. Hydrogen-bond acceptors are atoms with electrons to share.

BIT STRING

A contiguous set of characters that consists entirely of 1s and 0s. A bit string can be used to encode a good deal of information in a compact way, and is easily and rapidly interpreted by computer systems.

AND

The combination of two input bits such that the result is 1 if both bits are 1 and 0 otherwise.

LOG P

The octanol/water partition coefficient is the ratio of the solubility of a compound in octanol to its solubility in water (also known as Kow). The logarithm of this partition coefficient is called log P. It provides an estimate of the ability of the compound to pass through a cell membrane.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miller, M. Chemical database techniques in drug discovery. Nat Rev Drug Discov 1, 220–227 (2002). https://doi.org/10.1038/nrd745

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrd745

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing