Elsevier

Drug Discovery Today

Volume 21, Issue 8, August 2016, Pages 1291-1302
Drug Discovery Today

Review
Informatics
Descriptors and their selection methods in QSAR analysis: paradigm for drug design

https://doi.org/10.1016/j.drudis.2016.06.013Get rights and content

Highlights

  • A few newly introduced molecular descriptors were discussed.

  • Various computational approaches to calculate the descriptors are listed.

  • We described several methods for descriptors selection for building high predictive QSAR models.

  • Advantage and disadvantages of selection methods were also addressed.

  • Studies successfully applied the descriptors and their selection methods were also addressed.

The screening of chemical libraries with traditional methods, such as high-throughput screening (HTS), is expensive and time consuming. Quantitative structure–activity relation (QSAR) modeling is an alternative method that can assist in the selection of lead molecules by using the information from reference active and inactive compounds. This approach requires good molecular descriptors that are representative of the molecular features responsible for the relevant molecular activity. The usefulness of these descriptors in QSAR studies has been extensively demonstrated, and they have also been used as a measure of structural similarity or diversity. In this review, we provide a brief explanation of descriptors and the selection approaches most commonly used in QSAR experiments. In addition, some studies have also demonstrated the positive influence of features selection for any drug development model.

Introduction

Differentiating between drug-like from non-drug-like molecules is essential to reduce the cost associated with failed drug development. Various in silico approaches have shown the potential for screening chemical databases against the desired biological targets for the development of new potential leads [1]. Among them, ligand-based virtual screening has become popular because of its ability to screen millions of molecules rapidly from available chemical databases [2]. QSAR modeling is an important approach in drug discovery that correlates molecular structure with biological and pharmaceutical activities [3]. Such 2D methods rely on the calculation and comparison of molecular properties with the aim of identifying molecules that are similar with respect to the query molecule. Compared with 3D (or structure-based) methods, 2D approaches require substantially lower calculation times and, therefore, are mostly used as preliminary filters to reduce the number of compounds that can be used for further screening in later stages of drug development [4]. These 2D approaches are widely used in academia, industry, and research institutions worldwide. For the development of a QSAR model, one should consider it in terms of (i) the fundamental chemistry of the set of analogs, including any outliers; (ii) quantitatively correlating and summarizing the relations between chemical structure alterations and relevant changes in biological endpoint to determine the chemical properties that are the most likely determinants of the biological activities of the drug candidate; (iii) optimizing the existing leads to improve their biological activities; and (iv) predicting the biological activities of untried compounds.

Different QSAR approaches have been developed over the past few decades 5, 6, 7, 8. These approaches can determine the reliable relations between variations in the values of calculated descriptors and the biological activity for a series of chemical molecules, so that they can be used for predicting the activity of untried or newly synthesized compound(s). The chemical structures used in QSAR model building are encoded by a substantial number of molecular descriptors. The model is built by using only a few descriptors that are valid for closely related compounds. Most of the learning algorithms become computationally intractable when numbers of features are large, such as in the training algorithm and production steps. High-throughput data used in statistical modeling pose a challenge to accurate prediction. Given the large amount of inherent noise and variation in samples and their high dimensionality, there is the risk of overfitting [9]. Thus, there is a need for descriptor selection to improve model performance and avoid overfitting.

Descriptor selection methods provide a way of reducing computation time, improving prediction performance, and providing a better understanding of the data in machine learning. Descriptor selection is an important step for several reasons [10], including: (i) using only a few descriptors increases the interpretability and understanding of resulting models; (ii) It can reduce the risk of overfitting from noisy redundant molecular descriptors; (iii) it can provide faster and cost-effective models; and (4) it removes the activity cliff. However, noisy, redundant, or irrelevant descriptors should be removed in a way that the dimension of the input space is reduced without any loss of significant information [3]. In this review, we provide an update on, and a brief explanation of, commonly used descriptors, with a particular emphasis on their selection approaches for the development of more reliable, predictable, and generalized QSAR models.

Section snippets

Molecular descriptors

Despite great advances in the field of drug design, the use of descriptors to define the molecular structure of biologically active compounds is the main method utilized to discover new lead molecules. Descriptors are the chemical characteristic of a molecule in numerical form, used for QSAR/QSPR studies. Fig. 1 depicts the basic definition of these descriptors. Mathematical representation of these descriptors has to be invariant to the size of the molecule and the number of atoms it contains

Concluding remarks

Molecular descriptors are an essential part of the methodological toolbox used to study structure–property correlations and are widely used to optimize the characteristics of compounds in molecular design. Reliable prediction of these descriptors is significant for the development of predictable QSAR models, because accurate predictions can limit the number of expensive and time-consuming experiments required to synthesize the active novel hits with optimized pharmacodynamic and pharmacokinetic

Acknowledgments

We acknowledge DBT (Department of Biotechnology), Government of India for the support and internal facilities of the department. This work was supported by internal funds from the Biotechnology Unit, AMU and ICMR Grant; AMR/5/2011-ECD-1 and DBT grant; BT/PR8281/BID/7/448/2013 and BT/HRD/NBA/34/01/2012 to A.U.K.

References (79)

  • M. Randic

    Novel molecular description for structure–property studies

    Chem. Phys. Lett.

    (1993)
  • S. Nikolic

    Wiener index revisited

    Chem. Phys. Lett.

    (2001)
  • X. Li

    A novel set of Wiener indices

    J. Mol. Graph Model

    (2003)
  • M. Danishuddin et al.

    Virtual screening strategies: a state of art to combat with multiple drug resistance strains

    MOJ. Proteomics Bioinform.

    (2015)
  • M. Shahlaei

    Descriptor selection methods in quantitative structure–activity relationship studies: a review study

    Chem. Rev.

    (2013)
  • M.S. Murgueitio

    In silico virtual screening approaches for anti-viral drug discovery

    Drug Discov. Today.

    (2012)
  • A. Cherkasov

    QSAR modeling: where have you been? Where are you going to?

    J Med. Chem.

    (2014)
  • P. Borysov

    Activity prediction and identification of mis-annotated chemical compounds using extreme descriptors

    J. Chemometrics.

    (2016)
  • M. Luo

    Comparative analysis of QSAR-based vs. chemical similarity based predictors of GPCR binding affinity

    Mol. Inf.

    (2015)
  • P. Shi

    Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

    BMC Bioinform.

    (2011)
  • M. Goodarzi

    Feature selection methods in QSAR studies

    J. AOAC Int.

    (2012)
  • F. Harary

    Graph Theory

    (1971)
  • K. Roy

    Topological descriptors in drug design and modeling studies

    Mol. Diver.

    (2004)
  • H. Wiener

    Structural determination of paraffin boiling points

    J. Am. Chem. Soc.

    (1947)
  • M. Randic

    On characterization of molecular branching

    J. Am. Chem. Soc.

    (1975)
  • L.B. Kier

    Indexes of molecular shape from chemical graphs

    Acta Pharm. Jugosl.

    (1986)
  • I. Gutman et al.

    The first Zagreb index 30 years after

    MATCH Commun. Math. Comput. Chem.

    (2004)
  • L.B. Kier et al.

    Molecular Connectivity in Structure Activity Analysis

    (1986)
  • M. Randic

    Generalized molecular descriptors

    J. Math. Chem.

    (1991)
  • J.L. Melville et al.

    TMACC: interpretable correlation descriptors for quantitative structure–activity relationships

    J. Chem. Inf. Model.

    (2007)
  • B.W. Spowage

    Interpretable correlation descriptors for quantitative structure–activity relationships

    J. Cheminf.

    (2009)
  • R. Todeschini et al.

    Handbook of Molecular Descriptors

    (2000)
  • A.K. Ghose et al.

    Atomic physicochemical parameters for three-dimensional structure-directed quantitative structure–activity relationships I. Partition coefficients as a measure of hydrophobicity

    J. Comput. Chem.

    (1986)
  • J.A. Arnott

    Lipophilicity indices for drug development

    J. Appl. Biopharm. Pharmacokinet.

    (2013)
  • H. Waterbeemd et al.

    The parameterization of lipophilicity and other structural properties in drug design

    Adv. Drug. Res.

    (1987)
  • M.J. Waring

    Lipophilicity in drug discovery

    Expert Opin. Drug Discov.

    (2010)
  • A. Leo

    Calculation of hydrophobic constant (log P) from pi. and f constants

    J. Med. Chem.

    (1975)
  • A. Alex

    Intramolecular hydrogen bonding to improve membrane permeability and absorption in beyond rule of five chemical space

    Med. Chem. Commun.

    (2011)
  • P. Matsson

    Cell permeability beyond the rule-of-5

    Adv. Drug Deliv.

    (2015)
  • Cited by (248)

    View all citing articles on Scopus
    View full text