Chemical database techniques in drug discovery

Miller, Mitchell A.

doi:10.1038/nrd745

Review Article
Published: 01 March 2002

Chemical database techniques in drug discovery

Mitchell A. Miller¹

Nature Reviews Drug Discovery volume 1, pages 220–227 (2002)Cite this article

1367 Accesses
69 Citations
3 Altmetric
Metrics details

Key Points

Although originally created merely as repositories for compounds synthesized within an organization, chemical databases can now be searched to give novel ideas for lead discovery.
The familiar chemical-structure diagrams are not amenable to computational operations such as database searching, so several types of chemical-structure representation have been developed by theoretical chemists for use in computer systems. The predominant form is the atom–bond connection table.
A connection table along with two-dimensional coordinates for display is generally sufficient to identify the substance. However, to perform any energy calculations or to determine if the compound has the potential to bind to a receptor or enzyme of interest, three-dimensional coordinates are necessary.
Chemical structures differ considerably from other entities that are commonly stored in databases, such as text, and so the various search modes also differ considerably, although some parallels can be drawn.
Exact-match searches — which might be performed to find out if a proposed new structure already exists in a database, for example — can be thought of as looking up a complete word in a dictionary.
Substructure searches — in which a user picks pieces of a chemical structure and requests that the system return a set of compounds that contain the pieces — are analogous to a wild-carded text search.
Similarity searches — which might be performed if a user wants compounds that resemble the compound of interest to a chemist's intuitive thinking but do not necessarily reflect an exact or substructure match — are analogous to a 'sounds like' text search.
In pharmacophoric searches, assumptions about which groups of atoms on the small molecule are involved in binding are combined with the spatial relationship of these groups to give a three-dimensional query. Although the process is generally slower than previously mentioned search types, the results provide an indication of whether a set of structures can bind to a receptor or enzyme, and so hits might be very valuable in the drug design process.
Molecular docking — placing a series of candidate molecules from a database into the active site of a protein to evaluate how well the compounds might bind to the receptor or enzyme — has become a more popular mode of database searching owing to continual improvement in the quality of the docking and scoring algorithms.
Once a database search has been performed, the list of potential molecules for biological testing can be refined by filtering (removing molecules deemed to have unsuitable properties), clustering (grouping similar compounds) and human inspection.

Abstract

Chemical databases are becoming a powerful tool in drug discovery. Database searches based on possible requirements for biological activity can identify compounds that might be suitable for further analysis or indicate novel ways to achieve the desired activity. What considerations are involved in the construction and searching of chemical databases?

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Representations of chemical structure, using the neurotransmitter γ-aminobutyric acid (GABA) as an example.**

Extending the small-molecule similarity principle to all levels of biology with the Chemical Checker

Article 18 May 2020

Miquel Duran-Frigola, Eduardo Pauls, … Patrick Aloy

The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules

Article Open access 01 May 2020

Justin S. Smith, Roman Zubatyuk, … Sergei Tretiak

Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking

Article 04 February 2022

Francesco Gentile, Jean Charle Yaacoub, … Artem Cherkasov

References

Voigt, J. H., Bienfait, B., Wang, S. & Nicklaus, M. C. Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41, 702–712 (2001).An excellent analysis of publicly and commercially available chemical databases.
Article CAS Google Scholar
Trinjastic, N. (ed.) Chemical Graph Theory (CRC, Boca Raton, 1983).
Google Scholar
Balaban, A. T. Applications of graph theory in chemistry. J. Chem. Inf. Comput. Sci. 25, 334–343 (1985).
Article CAS Google Scholar
Dalby, A. et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 32, 244–255 (1992).
Article CAS Google Scholar
Dury, L., Latour, T., Leherte, L., Barberis, F. & Vercauteren, D. P. A new graph descriptor for molecules containing cycles. Application as screening criterion for searching molecular structures within large databases of organic compounds J. Chem. Inf. Comput. Sci. 41, 1437–1445 (2001).
Article CAS Google Scholar
Weininger, D. SMILES 1. Introduction and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31 (1988).
Article CAS Google Scholar
Warr, W. A. Combinatorial chemistry and molecular diversity. An overview. J. Chem. Inf. Comput. Sci. 37, 134–140 (1997).
Article CAS Google Scholar
Leland, B. A. Managing the combinatorial explosion. J. Chem. Inf. Comput. Sci. 37, 62–70 (1997).
Article CAS Google Scholar
Walters, W. P., Stahl, M. T. & Murcko, M. A. Virtual screening — an overview. Drug Discov. Today 3, 160–178 (1998).This article provides an excellent overview of the 'hows and whys' of using computers to select molecules for testing.
Article CAS Google Scholar
Schultz, J. L. & Wilks, E. S. Dendritic and star polymers: classification, nomenclature, structure representation, and registration in the DuPont SCION database. J. Chem. Inf. Comput. Sci. 38, 85–99 (1998).
Article CAS Google Scholar
Wipke, W. T. & Dyott, T. M. Stereochemically unique naming algorithm. J. Am. Chem. Soc. 96, 4834–4840 (1974).
Article CAS Google Scholar
Gasteiger, J., Rudolph, C. & Sadowski, J. Automatic generation of 3D-atomic coordinates for organic molecules. Tetrahedron Comp. Methodol. 3, 537–547 (1990).
Article CAS Google Scholar
Pearlman, R. S. CONCORD: rapid generation of high quality approximate 3D molecular structures. Chem. Des. Autom. News 2, 1 (1987).
Rusinko, A. Using CONCORD to construct a large database of three-dimensional coordinates from connection tables. J. Chem. Inf. Comput. Sci. 29, 327–333 (1989).
Article Google Scholar
Crippen, G. M. & Havel, T. F. Stable calculation of coordinates from distance information. Acta Cryst. A34, 282–284 (1978).
Article CAS Google Scholar
Hahn, M. Three-dimensional shape-based searching of conformationally flexible compounds. J. Chem. Inf. Comput. Sci. 37, 80–86 (1997).
Article CAS Google Scholar
Paris, C. G. Chemical structure handling by computer. Annu. Rev. Inform. Sci. Technol. 32, 271–338 (1997/1998).An excellent overview of the issues in storing, searching and analysing molecules using a computer.
Google Scholar
Barnard, J. M. & Downs, G. M. Computer representation and manipulation of combinatorial libraries. Persp. Drug Discov. Des. 7/8, 13–30 (1997).
Article CAS Google Scholar
Martin, Y. C., Bures, M. G. & Willett, P. Searching databases of three-dimensional structures. Reviews Comput. Chem. 1, 213–263 (1990).
Google Scholar
Good, A. C. & Mason, J. S. Three-dimensional structure database searches. Reviews Comput. Chem. 7, 67–117 (1995).
Google Scholar
Nicklaus, M. C. et al. HIV-1 integrase pharmacophore: discovery of inhibitors through three-dimensional database searching. J. Med. Chem. 40, 920–929 (1997).
Article CAS PubMed Google Scholar
Pickett, S. D., Mason, J. S. & McLay, I. M. Diversity profiling and design using 3D pharmacophores: pharmacophore-derived queries (PDQ). J. Chem. Inf. Comput. Sci. 36, 1214–1223 (1996).
Article CAS Google Scholar
Greenidge, P. A., Carlsson, B., Bladh, L.-G. & Gillner, M. Pharmacophores incorporating numerous excluded volumes defined by X-ray crystallographic structure in three-dimensional database searching: application to the thyroid hormone receptor. J. Med. Chem. 41, 2503–2512 (1998).
Article CAS PubMed Google Scholar
Olender, R. & Rosenfeld, R. A fast algorithm for searching for molecules containing a pharmacophore in very large virtual combinatorial libraries. J. Chem. Inf. Comput. Sci. 41, 731–738 (2001).
Article CAS PubMed Google Scholar
Mason, J. S. et al. New four-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privilege substructures. J. Med. Chem. 42, 3251–3264 (1999).
Article CAS PubMed Google Scholar
Downs, G. M. & Willett, P. Similarity searching in databases of chemical structures. Rev. Comput. Chem. 7, 1–66 (1995).
Google Scholar
Singh, S. B., Sheridan, R. P., Fluder, E. M. & Hull, R. D. Mining the chemical quarry with joint chemical probes: an application of latent semantic structure indexing (LaSSI) and TOPOSIM (Dice) to chemical database mining. J. Med. Chem. 44, 1564–1575 (2001).
Article CAS PubMed Google Scholar
Hefferlin, R. & Matus, M. T. Molecular similarity for small species: refining the isoelectronic index. J. Chem. Inf. Comput. Sci. 41, 484–494 (2001).
Article CAS PubMed Google Scholar
Sheridan, R. P., Miller, M. D., Underwood, D. J. & Kearsley, S. K. Chemical similarity using geometric atom pair descriptors. J. Chem. Inf. Comput. Sci. 36, 128–136 (1996).
Article CAS Google Scholar
Hull, R. D. et al. Latent semantic structure indexing (LaSSI) for defining chemical similarity. J. Med. Chem. 44, 1177–1184 (2001).
Article CAS Google Scholar
Nilakantan, R., Bauman, N., Dixon, J. S. & Venkataraghavan, R. Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors. J. Chem. Inf. Comput. Sci. 27, 82–85 (1987).
Article CAS Google Scholar
Brown, R. D. & Martin, Y. C. The information content of 2D and 3D structural descriptors relevant to ligand–receptor binding. J. Chem. Inf. Comput. Sci. 37, 1–9 (1997).
Article CAS Google Scholar
Schuffenhauer, A., Gillet, V. J. & Willett, P. Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J. Chem. Inf. Comput. Sci. 40, 295–307 (2000).
Article CAS Google Scholar
Rhodes, N., Willett, P., Dunbar, J. B. Jr & Humblet, C. Bit-string methods for selective compound acquisition. J. Chem. Inf. Comput. Sci. 40, 210–214 (2000).
Article CAS Google Scholar
Xue, L., Stahura, F. L., Godden, J. W. & Bajorath, J. Fingerprint scaling increases the probability of identifying molecules with similar activity in virtual screening calculations. J. Chem. Inf. Comput. Sci. 41, 746–753 (2001).
Article CAS Google Scholar
Butina, D. Unsupervised database custering based on Daylight's fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 39, 747–750 (1999).
Article CAS Google Scholar
Xue, L., Stahura, F. L., Godden, J. W. & Bajorath, J. Mini-fingerprints detect similar activity of receptor ligands previously recognized only by three-dimensional pharmacophore-based methods. J. Chem. Inf. Comput. Sci. 41, 394–401 (2001).
Article CAS Google Scholar
Matter, H. & Pötter, T. Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J. Chem. Inf. Comput. Sci. 39, 1211–1225 (1999).
Article CAS Google Scholar
McGregor, M. J. & Muskal, S. M. Pharmacophore fingerprinting. 1. Application to QSAR and focused library design. J. Chem. Inf. Comput. Sci. 39, 569–574 (1999).
Article CAS Google Scholar
Xue, L. & Bajorath, J. Molecular descriptors for effective classification of biologically active compounds based on principal component analysis identified by a genetic algorithm. J. Chem. Inf. Comput. Sci. 40, 801–809 (2000).
Article CAS Google Scholar
Willett, P., Barnard, J. & Downs, G. M. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998).A very comprehensive discussion of similarity searching.
Article CAS Google Scholar
Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R. & Ferrin, T. E. A geometric approach to macromolecule–ligand interactions. J. Mol. Biol. 161, 269–288 (1982).
Article CAS Google Scholar
Shoichet, B. K., Bodian, D. L. & Kuntz, I. D. Molecular docking using shape descriptors. J. Comput. Chem. 13, 380–397 (1992).
Article CAS Google Scholar
Meng, E. C., Shoichet, B. K. & Kuntz, I. D. Automated docking with grid-based energy evaluation. J. Comput. Chem. 13, 505–524 (1992).
Article CAS Google Scholar
Bissantz, C., Folkers, G. & Rognan, D. Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations. J. Med. Chem. 43, 4759–4767 (2000).
Article CAS Google Scholar
Charifson, P. S., Corkery, J. J., Murcko, M. A. & Walters, W. P. Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. J. Med. Chem. 42, 5100–5109 (1999).
Article CAS Google Scholar
Perola, E. et al. Successful virtual screening of a chemical database for farnesyltransferase inhibitor leads. J. Med. Chem. 43, 401–408 (2000).
Article CAS Google Scholar
Lipinkski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23, 3–25 (1997).
Article Google Scholar
Higgs, R. E., Bemis, K. G., Watson, I. A. & Wikel, J. H. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 37, 861–870 (1997).
Article CAS Google Scholar
Brown, R. D. & Martin, Y. C. Use of structure–activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 36, 572–584 (1996).Applications of database clustering.
Article CAS Google Scholar
Xue, L., Godden, J., Gao, H. & Bajorath, J. Identification of a preferred set of molecular descriptors for compound classification based on principal components analysis. J. Chem. Inf. Comput. Sci. 39, 699–704 (1999).
Article CAS Google Scholar
Mason, J. S. & Pickett, S. Partition-based selection. Persp. Drug Discov. Des. 7/8, 85–114 (1997).
Article CAS Google Scholar
Barnard, J. M. & Downs, G. M. Clustering of chemical structures on the basis of two-dimensional similarity measures. J. Chem. Inf. Comput. Sci. 32, 644–649 (1992).
Article CAS Google Scholar
Guénoche, A., Hansen, P. & Jaumard, B. Efficient algorithms for divisive hierarchical clustering with the diameter criterion. J. Classification 8, 5–30 (1991).
Article Google Scholar
Barnard, J. M. & Downs, G. M. Chemical fragment generation and clustering software. J. Chem. Inf. Comput. Sci. 37, 141–142 (1997).
Article CAS Google Scholar
Lavecchia, A., Greco, G., Novellino, E., Vittorio, F. & Ronsisvalle, G. Modelling of κ-opioid receptor/agonist interactions using pharmacophore-based and docking simulations. J. Med. Chem. 43, 2124–2134 (2000).
Article CAS PubMed Google Scholar
Rughooputh, S. D. D. V. & Rughooputh, H. C. S. Neural network based chemical structure indexing. J. Chem. Inf. Comput. Sci. 41, 713–717 (2001).
Article CAS Google Scholar
Ozawa, K., Yasuda, T. & Fujita, S. Substructure search with tree-structured data. J. Chem. Inf. Comput. Sci. 37, 688–695 (1997).
Article CAS Google Scholar
Pang, Y. P., Perola, E., Xu, K. & Prendergast, F. G. EUDOC: a computer programme for identification of drug interaction sites in macromolecules and drug leads from chemical databases. J. Comput. Chem. 22, 1750–1771 (2001).
Article CAS Google Scholar

Download references

Acknowledgements

M. M. would like to thank his colleagues in the scientific architecture group for useful feedback on the manuscript.

Author information

Authors and Affiliations

LION bioscience, 9880 Campus Point Drive, San Diego, 92121, California, USA
Mitchell A. Miller

Authors

Mitchell A. Miller
View author publications
You can also search for this author in PubMed Google Scholar

Supplementary information

Supplementary Table 1 | Selected chemical database software products (PDF 35 kb)

Supplementary Table 2 | A selection of commercially available chemical databases (PDF 42 kb)

Glossary

DRUG LIKE: Sharing certain characteristics with other molecules that act as drugs. The set of characteristics — size, shape and solubility in water and organic solvents — varies depending on who is evaluating the molecules.
CYCLIC/ACYCLIC BONDS: If chemical bonds occur in a ring, they are termed 'cyclic'. 'Acyclic bonds' occur in open chain structures.
COUNTERION: A set of one or more bonded atoms, with opposite charge and generally smaller size, that accompanies another charged set of bonded atoms as dictated by the principle of electrical neutrality of substances, solutions and so on.
COMBINATORIAL CHEMISTRY: The generation of large collections, or 'libraries', of compounds by synthesizing all possible combinations of a set of smaller chemical structures.
SEMA NAME: A stereochemical extension of the Morgan algorithm. A compact, canonical representation of a connection table.
SUBSTRUCTURE: One chemical structure is said to be a substructure of another if the first structure can be located within the second. (The second is said to be the superstructure of the first.) All structures are substructures of themselves. A substructure search scans a database for all substructural matches.
CONFORMATIONAL SPACE: The ensemble of three-dimensional shapes that a molecule can adopt without breaking any bonds.
MARKUSH STRUCTURE: Markush structures represent a set of chemical structures as a common core that contains marked substitution sites, and a set of possible fragments for each substitution point. They can be used to represent a set of compounds that are analysed to determine the effect of varying substituents on compound activity; to represent a set of compounds that are produced using combinatorial techniques; to produce a fine-tuned substructure query; or to represent a set of structures in a chemical patent or patent database.
PHARMACOPHORE: The ensemble of steric and electronic features that is necessary to ensure optimal interactions with a specific biological target structure and to trigger (or to block) its biological response.
STEREOCHEMISTRY: The spatial arrangements of atoms in molecules and complexes.
TAUTOMER: One of two or more structural isomers that exist in equilibrium and are readily converted from one isomeric form to another.
HYDROGEN BOND: A weak attraction (much weaker than a covalent or ionic chemical bond, but much stronger than van der Waals forces) between an oxygen, nitrogen or fluorine atom in one molecule and a hydrogen atom in a neighbouring molecule. Hydrogen-bond donors are groups with electron-hungry hydrogen atoms. Hydrogen-bond acceptors are atoms with electrons to share.
BIT STRING: A contiguous set of characters that consists entirely of 1s and 0s. A bit string can be used to encode a good deal of information in a compact way, and is easily and rapidly interpreted by computer systems.
AND: The combination of two input bits such that the result is 1 if both bits are 1 and 0 otherwise.
LOG P: The octanol/water partition coefficient is the ratio of the solubility of a compound in octanol to its solubility in water (also known as K_ow). The logarithm of this partition coefficient is called log P. It provides an estimate of the ability of the compound to pass through a cell membrane.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miller, M. Chemical database techniques in drug discovery. Nat Rev Drug Discov 1, 220–227 (2002). https://doi.org/10.1038/nrd745

Download citation

Issue Date: 01 March 2002
DOI: https://doi.org/10.1038/nrd745

This article is cited by

Progress on open chemoinformatic tools for expanding and exploring the chemical space
- José L. Medina-Franco
- Norberto Sánchez-Cruz
- Bárbara I. Díaz-Eufracio
Journal of Computer-Aided Molecular Design (2022)
Computational strategies for the discovery of biological functions of health foods, nutraceuticals and cosmeceuticals: a review
- Laureano E. Carpio
- Yolanda Sanz
- Stephen J. Barigye
Molecular Diversity (2021)
How to explore chemical space using algorithms and automation
- Piotr S. Gromski
- Alon B. Henson
- Leroy Cronin
Nature Reviews Chemistry (2019)
A possible extension to the RInChI as a means of providing machine readable process data
- Philipp-Maximilian Jacob
- Tian Lan
- Alexei A. Lapkin
Journal of Cheminformatics (2017)
Privacy-preserving search for chemical compound databases
- Kana Shimizu
- Koji Nuida
- Kiyoshi Asai
BMC Bioinformatics (2015)

Chemical database techniques in drug discovery

Key Points

Abstract

Access options

Similar content being viewed by others

Extending the small-molecule similarity principle to all levels of biology with the Chemical Checker

The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules

Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking

References

Acknowledgements

Author information

Authors and Affiliations

Supplementary information

Supplementary Table 1 | Selected chemical database software products (PDF 35 kb)

Supplementary Table 2 | A selection of commercially available chemical databases (PDF 42 kb)

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Progress on open chemoinformatic tools for expanding and exploring the chemical space

Computational strategies for the discovery of biological functions of health foods, nutraceuticals and cosmeceuticals: a review

How to explore chemical space using algorithms and automation

A possible extension to the RInChI as a means of providing machine readable process data

Privacy-preserving search for chemical compound databases

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Supplementary information

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links