Key Points
-
Although originally created merely as repositories for compounds synthesized within an organization, chemical databases can now be searched to give novel ideas for lead discovery.
-
The familiar chemical-structure diagrams are not amenable to computational operations such as database searching, so several types of chemical-structure representation have been developed by theoretical chemists for use in computer systems. The predominant form is the atom–bond connection table.
-
A connection table along with two-dimensional coordinates for display is generally sufficient to identify the substance. However, to perform any energy calculations or to determine if the compound has the potential to bind to a receptor or enzyme of interest, three-dimensional coordinates are necessary.
-
Chemical structures differ considerably from other entities that are commonly stored in databases, such as text, and so the various search modes also differ considerably, although some parallels can be drawn.
-
Exact-match searches — which might be performed to find out if a proposed new structure already exists in a database, for example — can be thought of as looking up a complete word in a dictionary.
-
Substructure searches — in which a user picks pieces of a chemical structure and requests that the system return a set of compounds that contain the pieces — are analogous to a wild-carded text search.
-
Similarity searches — which might be performed if a user wants compounds that resemble the compound of interest to a chemist's intuitive thinking but do not necessarily reflect an exact or substructure match — are analogous to a 'sounds like' text search.
-
In pharmacophoric searches, assumptions about which groups of atoms on the small molecule are involved in binding are combined with the spatial relationship of these groups to give a three-dimensional query. Although the process is generally slower than previously mentioned search types, the results provide an indication of whether a set of structures can bind to a receptor or enzyme, and so hits might be very valuable in the drug design process.
-
Molecular docking — placing a series of candidate molecules from a database into the active site of a protein to evaluate how well the compounds might bind to the receptor or enzyme — has become a more popular mode of database searching owing to continual improvement in the quality of the docking and scoring algorithms.
-
Once a database search has been performed, the list of potential molecules for biological testing can be refined by filtering (removing molecules deemed to have unsuitable properties), clustering (grouping similar compounds) and human inspection.
Abstract
Chemical databases are becoming a powerful tool in drug discovery. Database searches based on possible requirements for biological activity can identify compounds that might be suitable for further analysis or indicate novel ways to achieve the desired activity. What considerations are involved in the construction and searching of chemical databases?
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Voigt, J. H., Bienfait, B., Wang, S. & Nicklaus, M. C. Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41, 702–712 (2001).An excellent analysis of publicly and commercially available chemical databases.
Trinjastic, N. (ed.) Chemical Graph Theory (CRC, Boca Raton, 1983).
Balaban, A. T. Applications of graph theory in chemistry. J. Chem. Inf. Comput. Sci. 25, 334–343 (1985).
Dalby, A. et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 32, 244–255 (1992).
Dury, L., Latour, T., Leherte, L., Barberis, F. & Vercauteren, D. P. A new graph descriptor for molecules containing cycles. Application as screening criterion for searching molecular structures within large databases of organic compounds J. Chem. Inf. Comput. Sci. 41, 1437–1445 (2001).
Weininger, D. SMILES 1. Introduction and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31 (1988).
Warr, W. A. Combinatorial chemistry and molecular diversity. An overview. J. Chem. Inf. Comput. Sci. 37, 134–140 (1997).
Leland, B. A. Managing the combinatorial explosion. J. Chem. Inf. Comput. Sci. 37, 62–70 (1997).
Walters, W. P., Stahl, M. T. & Murcko, M. A. Virtual screening — an overview. Drug Discov. Today 3, 160–178 (1998).This article provides an excellent overview of the 'hows and whys' of using computers to select molecules for testing.
Schultz, J. L. & Wilks, E. S. Dendritic and star polymers: classification, nomenclature, structure representation, and registration in the DuPont SCION database. J. Chem. Inf. Comput. Sci. 38, 85–99 (1998).
Wipke, W. T. & Dyott, T. M. Stereochemically unique naming algorithm. J. Am. Chem. Soc. 96, 4834–4840 (1974).
Gasteiger, J., Rudolph, C. & Sadowski, J. Automatic generation of 3D-atomic coordinates for organic molecules. Tetrahedron Comp. Methodol. 3, 537–547 (1990).
Pearlman, R. S. CONCORD: rapid generation of high quality approximate 3D molecular structures. Chem. Des. Autom. News 2, 1 (1987).
Rusinko, A. Using CONCORD to construct a large database of three-dimensional coordinates from connection tables. J. Chem. Inf. Comput. Sci. 29, 327–333 (1989).
Crippen, G. M. & Havel, T. F. Stable calculation of coordinates from distance information. Acta Cryst. A34, 282–284 (1978).
Hahn, M. Three-dimensional shape-based searching of conformationally flexible compounds. J. Chem. Inf. Comput. Sci. 37, 80–86 (1997).
Paris, C. G. Chemical structure handling by computer. Annu. Rev. Inform. Sci. Technol. 32, 271–338 (1997/1998).An excellent overview of the issues in storing, searching and analysing molecules using a computer.
Barnard, J. M. & Downs, G. M. Computer representation and manipulation of combinatorial libraries. Persp. Drug Discov. Des. 7/8, 13–30 (1997).
Martin, Y. C., Bures, M. G. & Willett, P. Searching databases of three-dimensional structures. Reviews Comput. Chem. 1, 213–263 (1990).
Good, A. C. & Mason, J. S. Three-dimensional structure database searches. Reviews Comput. Chem. 7, 67–117 (1995).
Nicklaus, M. C. et al. HIV-1 integrase pharmacophore: discovery of inhibitors through three-dimensional database searching. J. Med. Chem. 40, 920–929 (1997).
Pickett, S. D., Mason, J. S. & McLay, I. M. Diversity profiling and design using 3D pharmacophores: pharmacophore-derived queries (PDQ). J. Chem. Inf. Comput. Sci. 36, 1214–1223 (1996).
Greenidge, P. A., Carlsson, B., Bladh, L.-G. & Gillner, M. Pharmacophores incorporating numerous excluded volumes defined by X-ray crystallographic structure in three-dimensional database searching: application to the thyroid hormone receptor. J. Med. Chem. 41, 2503–2512 (1998).
Olender, R. & Rosenfeld, R. A fast algorithm for searching for molecules containing a pharmacophore in very large virtual combinatorial libraries. J. Chem. Inf. Comput. Sci. 41, 731–738 (2001).
Mason, J. S. et al. New four-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privilege substructures. J. Med. Chem. 42, 3251–3264 (1999).
Downs, G. M. & Willett, P. Similarity searching in databases of chemical structures. Rev. Comput. Chem. 7, 1–66 (1995).
Singh, S. B., Sheridan, R. P., Fluder, E. M. & Hull, R. D. Mining the chemical quarry with joint chemical probes: an application of latent semantic structure indexing (LaSSI) and TOPOSIM (Dice) to chemical database mining. J. Med. Chem. 44, 1564–1575 (2001).
Hefferlin, R. & Matus, M. T. Molecular similarity for small species: refining the isoelectronic index. J. Chem. Inf. Comput. Sci. 41, 484–494 (2001).
Sheridan, R. P., Miller, M. D., Underwood, D. J. & Kearsley, S. K. Chemical similarity using geometric atom pair descriptors. J. Chem. Inf. Comput. Sci. 36, 128–136 (1996).
Hull, R. D. et al. Latent semantic structure indexing (LaSSI) for defining chemical similarity. J. Med. Chem. 44, 1177–1184 (2001).
Nilakantan, R., Bauman, N., Dixon, J. S. & Venkataraghavan, R. Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors. J. Chem. Inf. Comput. Sci. 27, 82–85 (1987).
Brown, R. D. & Martin, Y. C. The information content of 2D and 3D structural descriptors relevant to ligand–receptor binding. J. Chem. Inf. Comput. Sci. 37, 1–9 (1997).
Schuffenhauer, A., Gillet, V. J. & Willett, P. Similarity searching in files of three-dimensional chemical structures: analysis of the BIOSTER database using two-dimensional fingerprints and molecular field descriptors. J. Chem. Inf. Comput. Sci. 40, 295–307 (2000).
Rhodes, N., Willett, P., Dunbar, J. B. Jr & Humblet, C. Bit-string methods for selective compound acquisition. J. Chem. Inf. Comput. Sci. 40, 210–214 (2000).
Xue, L., Stahura, F. L., Godden, J. W. & Bajorath, J. Fingerprint scaling increases the probability of identifying molecules with similar activity in virtual screening calculations. J. Chem. Inf. Comput. Sci. 41, 746–753 (2001).
Butina, D. Unsupervised database custering based on Daylight's fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 39, 747–750 (1999).
Xue, L., Stahura, F. L., Godden, J. W. & Bajorath, J. Mini-fingerprints detect similar activity of receptor ligands previously recognized only by three-dimensional pharmacophore-based methods. J. Chem. Inf. Comput. Sci. 41, 394–401 (2001).
Matter, H. & Pötter, T. Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J. Chem. Inf. Comput. Sci. 39, 1211–1225 (1999).
McGregor, M. J. & Muskal, S. M. Pharmacophore fingerprinting. 1. Application to QSAR and focused library design. J. Chem. Inf. Comput. Sci. 39, 569–574 (1999).
Xue, L. & Bajorath, J. Molecular descriptors for effective classification of biologically active compounds based on principal component analysis identified by a genetic algorithm. J. Chem. Inf. Comput. Sci. 40, 801–809 (2000).
Willett, P., Barnard, J. & Downs, G. M. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998).A very comprehensive discussion of similarity searching.
Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R. & Ferrin, T. E. A geometric approach to macromolecule–ligand interactions. J. Mol. Biol. 161, 269–288 (1982).
Shoichet, B. K., Bodian, D. L. & Kuntz, I. D. Molecular docking using shape descriptors. J. Comput. Chem. 13, 380–397 (1992).
Meng, E. C., Shoichet, B. K. & Kuntz, I. D. Automated docking with grid-based energy evaluation. J. Comput. Chem. 13, 505–524 (1992).
Bissantz, C., Folkers, G. & Rognan, D. Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations. J. Med. Chem. 43, 4759–4767 (2000).
Charifson, P. S., Corkery, J. J., Murcko, M. A. & Walters, W. P. Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. J. Med. Chem. 42, 5100–5109 (1999).
Perola, E. et al. Successful virtual screening of a chemical database for farnesyltransferase inhibitor leads. J. Med. Chem. 43, 401–408 (2000).
Lipinkski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23, 3–25 (1997).
Higgs, R. E., Bemis, K. G., Watson, I. A. & Wikel, J. H. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 37, 861–870 (1997).
Brown, R. D. & Martin, Y. C. Use of structure–activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 36, 572–584 (1996).Applications of database clustering.
Xue, L., Godden, J., Gao, H. & Bajorath, J. Identification of a preferred set of molecular descriptors for compound classification based on principal components analysis. J. Chem. Inf. Comput. Sci. 39, 699–704 (1999).
Mason, J. S. & Pickett, S. Partition-based selection. Persp. Drug Discov. Des. 7/8, 85–114 (1997).
Barnard, J. M. & Downs, G. M. Clustering of chemical structures on the basis of two-dimensional similarity measures. J. Chem. Inf. Comput. Sci. 32, 644–649 (1992).
Guénoche, A., Hansen, P. & Jaumard, B. Efficient algorithms for divisive hierarchical clustering with the diameter criterion. J. Classification 8, 5–30 (1991).
Barnard, J. M. & Downs, G. M. Chemical fragment generation and clustering software. J. Chem. Inf. Comput. Sci. 37, 141–142 (1997).
Lavecchia, A., Greco, G., Novellino, E., Vittorio, F. & Ronsisvalle, G. Modelling of κ-opioid receptor/agonist interactions using pharmacophore-based and docking simulations. J. Med. Chem. 43, 2124–2134 (2000).
Rughooputh, S. D. D. V. & Rughooputh, H. C. S. Neural network based chemical structure indexing. J. Chem. Inf. Comput. Sci. 41, 713–717 (2001).
Ozawa, K., Yasuda, T. & Fujita, S. Substructure search with tree-structured data. J. Chem. Inf. Comput. Sci. 37, 688–695 (1997).
Pang, Y. P., Perola, E., Xu, K. & Prendergast, F. G. EUDOC: a computer programme for identification of drug interaction sites in macromolecules and drug leads from chemical databases. J. Comput. Chem. 22, 1750–1771 (2001).
Acknowledgements
M. M. would like to thank his colleagues in the scientific architecture group for useful feedback on the manuscript.
Author information
Authors and Affiliations
Related links
Related links
FURTHER INFORMATION
Daylight Chemical Information Systems
Fingerprints — Screening and Similarity
Glossary
- DRUG LIKE
-
Sharing certain characteristics with other molecules that act as drugs. The set of characteristics — size, shape and solubility in water and organic solvents — varies depending on who is evaluating the molecules.
- CYCLIC/ACYCLIC BONDS
-
If chemical bonds occur in a ring, they are termed 'cyclic'. 'Acyclic bonds' occur in open chain structures.
- COUNTERION
-
A set of one or more bonded atoms, with opposite charge and generally smaller size, that accompanies another charged set of bonded atoms as dictated by the principle of electrical neutrality of substances, solutions and so on.
- COMBINATORIAL CHEMISTRY
-
The generation of large collections, or 'libraries', of compounds by synthesizing all possible combinations of a set of smaller chemical structures.
- SEMA NAME
-
A stereochemical extension of the Morgan algorithm. A compact, canonical representation of a connection table.
- SUBSTRUCTURE
-
One chemical structure is said to be a substructure of another if the first structure can be located within the second. (The second is said to be the superstructure of the first.) All structures are substructures of themselves. A substructure search scans a database for all substructural matches.
- CONFORMATIONAL SPACE
-
The ensemble of three-dimensional shapes that a molecule can adopt without breaking any bonds.
- MARKUSH STRUCTURE
-
Markush structures represent a set of chemical structures as a common core that contains marked substitution sites, and a set of possible fragments for each substitution point. They can be used to represent a set of compounds that are analysed to determine the effect of varying substituents on compound activity; to represent a set of compounds that are produced using combinatorial techniques; to produce a fine-tuned substructure query; or to represent a set of structures in a chemical patent or patent database.
- PHARMACOPHORE
-
The ensemble of steric and electronic features that is necessary to ensure optimal interactions with a specific biological target structure and to trigger (or to block) its biological response.
- STEREOCHEMISTRY
-
The spatial arrangements of atoms in molecules and complexes.
- TAUTOMER
-
One of two or more structural isomers that exist in equilibrium and are readily converted from one isomeric form to another.
- HYDROGEN BOND
-
A weak attraction (much weaker than a covalent or ionic chemical bond, but much stronger than van der Waals forces) between an oxygen, nitrogen or fluorine atom in one molecule and a hydrogen atom in a neighbouring molecule. Hydrogen-bond donors are groups with electron-hungry hydrogen atoms. Hydrogen-bond acceptors are atoms with electrons to share.
- BIT STRING
-
A contiguous set of characters that consists entirely of 1s and 0s. A bit string can be used to encode a good deal of information in a compact way, and is easily and rapidly interpreted by computer systems.
- AND
-
The combination of two input bits such that the result is 1 if both bits are 1 and 0 otherwise.
- LOG P
-
The octanol/water partition coefficient is the ratio of the solubility of a compound in octanol to its solubility in water (also known as Kow). The logarithm of this partition coefficient is called log P. It provides an estimate of the ability of the compound to pass through a cell membrane.
Rights and permissions
About this article
Cite this article
Miller, M. Chemical database techniques in drug discovery. Nat Rev Drug Discov 1, 220–227 (2002). https://doi.org/10.1038/nrd745
Issue Date:
DOI: https://doi.org/10.1038/nrd745
This article is cited by
-
Progress on open chemoinformatic tools for expanding and exploring the chemical space
Journal of Computer-Aided Molecular Design (2022)
-
Computational strategies for the discovery of biological functions of health foods, nutraceuticals and cosmeceuticals: a review
Molecular Diversity (2021)
-
How to explore chemical space using algorithms and automation
Nature Reviews Chemistry (2019)
-
A possible extension to the RInChI as a means of providing machine readable process data
Journal of Cheminformatics (2017)
-
Privacy-preserving search for chemical compound databases
BMC Bioinformatics (2015)