Descriptors and their selection methods in QSAR analysis: paradigm for drug design

doi:10.1016/j.drudis.2016.06.013

Drug Discovery Today

Volume 21, Issue 8, August 2016, Pages 1291-1302

https://doi.org/10.1016/j.drudis.2016.06.013 Get rights and content

Highlights

•
A few newly introduced molecular descriptors were discussed.
•
Various computational approaches to calculate the descriptors are listed.
•
We described several methods for descriptors selection for building high predictive QSAR models.
•
Advantage and disadvantages of selection methods were also addressed.
•
Studies successfully applied the descriptors and their selection methods were also addressed.

The screening of chemical libraries with traditional methods, such as high-throughput screening (HTS), is expensive and time consuming. Quantitative structure–activity relation (QSAR) modeling is an alternative method that can assist in the selection of lead molecules by using the information from reference active and inactive compounds. This approach requires good molecular descriptors that are representative of the molecular features responsible for the relevant molecular activity. The usefulness of these descriptors in QSAR studies has been extensively demonstrated, and they have also been used as a measure of structural similarity or diversity. In this review, we provide a brief explanation of descriptors and the selection approaches most commonly used in QSAR experiments. In addition, some studies have also demonstrated the positive influence of features selection for any drug development model.

Introduction

Differentiating between drug-like from non-drug-like molecules is essential to reduce the cost associated with failed drug development. Various in silico approaches have shown the potential for screening chemical databases against the desired biological targets for the development of new potential leads [1]. Among them, ligand-based virtual screening has become popular because of its ability to screen millions of molecules rapidly from available chemical databases [2]. QSAR modeling is an important approach in drug discovery that correlates molecular structure with biological and pharmaceutical activities [3]. Such 2D methods rely on the calculation and comparison of molecular properties with the aim of identifying molecules that are similar with respect to the query molecule. Compared with 3D (or structure-based) methods, 2D approaches require substantially lower calculation times and, therefore, are mostly used as preliminary filters to reduce the number of compounds that can be used for further screening in later stages of drug development [4]. These 2D approaches are widely used in academia, industry, and research institutions worldwide. For the development of a QSAR model, one should consider it in terms of (i) the fundamental chemistry of the set of analogs, including any outliers; (ii) quantitatively correlating and summarizing the relations between chemical structure alterations and relevant changes in biological endpoint to determine the chemical properties that are the most likely determinants of the biological activities of the drug candidate; (iii) optimizing the existing leads to improve their biological activities; and (iv) predicting the biological activities of untried compounds.

Different QSAR approaches have been developed over the past few decades 5, 6, 7, 8. These approaches can determine the reliable relations between variations in the values of calculated descriptors and the biological activity for a series of chemical molecules, so that they can be used for predicting the activity of untried or newly synthesized compound(s). The chemical structures used in QSAR model building are encoded by a substantial number of molecular descriptors. The model is built by using only a few descriptors that are valid for closely related compounds. Most of the learning algorithms become computationally intractable when numbers of features are large, such as in the training algorithm and production steps. High-throughput data used in statistical modeling pose a challenge to accurate prediction. Given the large amount of inherent noise and variation in samples and their high dimensionality, there is the risk of overfitting [9]. Thus, there is a need for descriptor selection to improve model performance and avoid overfitting.

Descriptor selection methods provide a way of reducing computation time, improving prediction performance, and providing a better understanding of the data in machine learning. Descriptor selection is an important step for several reasons [10], including: (i) using only a few descriptors increases the interpretability and understanding of resulting models; (ii) It can reduce the risk of overfitting from noisy redundant molecular descriptors; (iii) it can provide faster and cost-effective models; and (4) it removes the activity cliff. However, noisy, redundant, or irrelevant descriptors should be removed in a way that the dimension of the input space is reduced without any loss of significant information [3]. In this review, we provide an update on, and a brief explanation of, commonly used descriptors, with a particular emphasis on their selection approaches for the development of more reliable, predictable, and generalized QSAR models.

Section snippets

Molecular descriptors

Despite great advances in the field of drug design, the use of descriptors to define the molecular structure of biologically active compounds is the main method utilized to discover new lead molecules. Descriptors are the chemical characteristic of a molecule in numerical form, used for QSAR/QSPR studies. Fig. 1 depicts the basic definition of these descriptors. Mathematical representation of these descriptors has to be invariant to the size of the molecule and the number of atoms it contains

Concluding remarks

Molecular descriptors are an essential part of the methodological toolbox used to study structure–property correlations and are widely used to optimize the characteristics of compounds in molecular design. Reliable prediction of these descriptors is significant for the development of predictable QSAR models, because accurate predictions can limit the number of expensive and time-consuming experiments required to synthesize the active novel hits with optimized pharmacodynamic and pharmacokinetic

Acknowledgments

We acknowledge DBT (Department of Biotechnology), Government of India for the support and internal facilities of the department. This work was supported by internal funds from the Biotechnology Unit, AMU and ICMR Grant; AMR/5/2011-ECD-1 and DBT grant; BT/PR8281/BID/7/448/2013 and BT/HRD/NBA/34/01/2012 to A.U.K.

References (79)

M. Danishuddin et al.
Structure based virtual screening to discover putative drug candidates: necessary considerations and successful case studies
Methods
(2015)
A. Lavecchia
Machine-learning approaches in drug discovery: methods and applications
Drug Discov. Today.
(2015)
A.T. Balaban
Highly discriminating distance-based topological index
Chem. Phys. Lett.
(1982)
S. Winiwarter
Hydrogen bonding descriptors in the prediction of human in vivo intestinal permeability
J. Mol. Graph. Model.
(2003)
S. Maldonado et al.
A wrapper method for feature selection using support vector machines
Inf. Sci.
(2009)
C.L. Huang et al.
A GA-based feature selection and parameters optimization for support vector machines
Expert Syst. Appl.
(2006)
A.S. Reddy
Hybrid-genetic algorithm based descriptor optimization and QSAR models for predicting the biological activity of Tipranavir analogs for HIV protease inhibition
J. Mol. Graph. Model.
(2010)
P. Willett
Similarity-based virtual screening using 2D fingerprints
Drug Discov. Today.
(2006)
J. Duan
Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods
J. Mol. Graph. Model.
(2010)
M. Randic
The connectivity index 25 years after
J. Mol. Graph. Model.
(2001)

M. Randic

Novel molecular description for structure–property studies

Chem. Phys. Lett.

(1993)

S. Nikolic

Wiener index revisited

Chem. Phys. Lett.

(2001)

X. Li

A novel set of Wiener indices

J. Mol. Graph Model

(2003)

M. Danishuddin et al.

Virtual screening strategies: a state of art to combat with multiple drug resistance strains

MOJ. Proteomics Bioinform.

(2015)

M. Shahlaei

Descriptor selection methods in quantitative structure–activity relationship studies: a review study

Chem. Rev.

(2013)

M.S. Murgueitio

In silico virtual screening approaches for anti-viral drug discovery

Drug Discov. Today.

(2012)

A. Cherkasov

QSAR modeling: where have you been? Where are you going to?

J Med. Chem.

(2014)

P. Borysov

Activity prediction and identification of mis-annotated chemical compounds using extreme descriptors

J. Chemometrics.

(2016)

M. Luo

Comparative analysis of QSAR-based vs. chemical similarity based predictors of GPCR binding affinity

Mol. Inf.

(2015)

P. Shi

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

BMC Bioinform.

(2011)

M. Goodarzi

Feature selection methods in QSAR studies

J. AOAC Int.

(2012)

F. Harary

Graph Theory

(1971)

K. Roy

Topological descriptors in drug design and modeling studies

Mol. Diver.

(2004)

H. Wiener

Structural determination of paraffin boiling points

J. Am. Chem. Soc.

(1947)

M. Randic

On characterization of molecular branching

J. Am. Chem. Soc.

(1975)

L.B. Kier

Indexes of molecular shape from chemical graphs

Acta Pharm. Jugosl.

(1986)

I. Gutman et al.

The first Zagreb index 30 years after

MATCH Commun. Math. Comput. Chem.

(2004)

L.B. Kier et al.

Molecular Connectivity in Structure Activity Analysis

(1986)

M. Randic

Generalized molecular descriptors

J. Math. Chem.

(1991)

J.L. Melville et al.

TMACC: interpretable correlation descriptors for quantitative structure–activity relationships

J. Chem. Inf. Model.

(2007)

B.W. Spowage

Interpretable correlation descriptors for quantitative structure–activity relationships

J. Cheminf.

(2009)

R. Todeschini et al.

Handbook of Molecular Descriptors

(2000)

A.K. Ghose et al.

Atomic physicochemical parameters for three-dimensional structure-directed quantitative structure–activity relationships I. Partition coefficients as a measure of hydrophobicity

J. Comput. Chem.

(1986)

J.A. Arnott

Lipophilicity indices for drug development

J. Appl. Biopharm. Pharmacokinet.

(2013)

H. Waterbeemd et al.

The parameterization of lipophilicity and other structural properties in drug design

Adv. Drug. Res.

(1987)

M.J. Waring

Lipophilicity in drug discovery

Expert Opin. Drug Discov.

(2010)

A. Leo

Calculation of hydrophobic constant (log P) from pi. and f constants

J. Med. Chem.

(1975)

A. Alex

Intramolecular hydrogen bonding to improve membrane permeability and absorption in beyond rule of five chemical space

Med. Chem. Commun.

(2011)

P. Matsson

Cell permeability beyond the rule-of-5

Adv. Drug Deliv.

(2015)

Cited by (248)

A comparative study of the predictive performance of different descriptor calculation tools: Molecular-based elution order modeling and interpretation of retention mechanism for isomeric compounds from METLIN database
2024, Journal of Chromatography A
In the pharmaceutical industry, the need for analytical standards is a bottleneck for comprehensive evaluation and quality control of intermediate and end products. These are complex mixtures containing structurally related molecules. In this regard, chromatographic peak annotation, especially for critical pairs of isomers and closest structural analogs, can be supported by using a Quantitative Structure Retention Relationship (QSRR) approach. In our study, we investigated the fundamental basis of the reversed-phase (RP) retention mechanism for 1141 isomeric compounds from the METLIN SMRT dataset. Nine different descriptor calculation tools combined with different feature selection methods (genetic algorithm (GA), stepwise, Boruta) and machine learning (ML) approaches (support vector machine (SVM), multiple linear regression (MLR), random forest (RF), XGBoost) were applied to provide a reliable molecular structure-based interpretation of RP retention behaviour of the isomeric compounds. Strict internal and external validation metrics were used to select models with the best predictive capabilities (r_test > 0.73, order of elution > 60 %). For the developed models, mean absolute errors were in the range of 60 to 110 s. Stepwise and GA showed the most suitable performance as descriptor selection methods, while SVM and XGBoost modeling gave satisfactory predictive characteristics in most cases. Validation performed on the published experimental data for structurally related pharmaceutical compounds confirmed the best accuracy of MLR modeling in combination with GA feature selection of general physico-chemical properties. The resulting models will be useful for the prediction of separation and identification of structurally related compounds in pharmaceutical analysis, providing a simultaneous understanding of the interaction mechanisms leading to their retention under RP conditions.
The substitution sites of hydroxyl and galloyl groups determine the inhibitory activity of human pancreatic α-amylase in twelve tea polyphenol monomers
2024, International Journal of Biological Macromolecules
Tea polyphenols have been reported as potential α-amylase inhibitors. However, the quantitative structure–activity relationship (QSAR) between tea polyphenols and human pancreas α-amylase (HPA) is not well understood. Herein, the inhibitory effect of twelve tea polyphenol monomers on HPA was investigated in terms of inhibitory activity, as well as QSAR analysis and interaction mechanism. The results revealed that the HPA inhibitory activity of theaflavins (TFs), especially theaflavin-3′-gallate (TF-3′-G, IC₅₀: 0.313 mg/mL), was much stronger than that of catechins (IC₅₀: 18.387–458.932 mg/mL). The QSAR analysis demonstrated that the determinant for the inhibitory activity of HPA was not the number of hydroxyl and galloyl groups in tea polyphenol monomers, while the substitution sites of these groups potentially might play a more important role in modulating the inhibitory activity. The inhibition kinetics and molecular docking revealed that TF-3′-G as a mixed-type inhibitor had the lowest inhibition constant and bound to the active sites of HPA with the lowest binding energy (−7.74 kcal/mol). These findings could provide valuable insights into the structures-activity relationships between tea polyphenols and the HPA inhibitors.
Predicting the acute toxicity of organophosphate esters (OPEs) to aquatic organisms by modelling the structure-toxicity relationships using partial least square regression
2023, Science of the Total Environment
Organophosphate esters (OPEs) have been used worldwide as organophosphate flame retardants (OPFRs) since brominated flame retardants (BFRs) were banned. Due to the toxicity of these OPEs, environmental concerns and ecological risks arose. However, there are still large gaps in the understanding of their toxicity to organisms and the mechanisms of toxicity. After collecting the existing toxicity information and obtaining molecular descriptors of OPEs, a partial least square (PLS) regression model was used in this study to quantify the structure-toxicity relationships of OPEs. Based on the regression results, the acute toxicity of the remaining OPEs lacking acute toxicity data was predicted, and the risk level of total common OPEs was classified. The acute toxicity of 15 chemicals was collected, and >1660 molecular structure descriptors were obtained. The cross-validation results of the partial least square regression indicated that two principal components met the regression requirements with the selected features, and the regression equations of these chemicals were generated with selected molecular descriptors. The influence of physicochemical properties, such as hydrophobicity/molecular weight, in traditional perception of OPE toxicity was not that obvious, and acute toxicity was mainly influenced by the autocorrelation coefficients. However, the regression results indicated that the correlation between autocorrelation coefficients calculated based on different physicochemical properties and toxicity was different. According to the prediction result based on PLS regression, CDP may pose a high risk and halogenated alkyl-substituted OPEs such as TCEP may be less toxic. The results of the present study may help inform the environmental management and risk assessment of emerging chemicals such as OPEs.
Considerations for future quantitative structure-activity relationship (QSAR) modelling for heavy metals – A case study of mercury
2023, Toxicology
With increasing annual chemical development and production, safety testing demands and requirements have also increased. In addition to traditional animal testing, quantitative structure-activity relationship (QSAR) modelling can be used to predict the biological effect of a chemical structure, based on the analysis of quantitative characteristics of structure features. Whilst suitable for e.g., pharmaceuticals, other compounds can be more challenging to model. The naturally occurring heavy metal mercury speciates in the environment, with some toxic species accumulating in aquatic organisms. Although this is well known, only little data is available from (eco)toxicological studies, none of which account for this speciation behaviour. The present work highlights the current toxicity data for mercury in aquatic animals and gaps in our understanding and data for future QSAR modelling. All publicly available ecotoxicology data was obtained from databases and literature. Only few studies could be determined that assessed mercury toxicity in aquatic species. Of these, likely speciation products were determined using PHREEQc. This highlighted that the mercury exposure species was not always the predominant species in the medium. Finally, the descriptors for the modelled species were obtained from ChemDes, highlighting the limited availability of such details. Additional testing is required, accounting for speciation and biological interactions, to successfully determine the toxicity profile of different mercury species in aquatic environments. In the present work, insufficient mercury-species specific data was obtained, to conduct QSAR modelling successfully. This highlights a significant lack of data, for a heavy metal with potentially fatal repercussions.
Developing machine learning approaches to identify candidate persistent, mobile and toxic (PMT) and very persistent and very mobile (vPvM) substances based on molecular structure
2023, Water Research
Determining which substances on the global market could be classified as persistent, mobile and toxic (PMT) substances or very persistent, very mobile (vPvM) substances is essential to prevent or reduce drinking water contamination from them. This study developed machine learning models based on different molecular descriptors (MDs) and defined applicability domains for the screening of PMT/vPvM substances. The models were trained with 3111 substances with expert weight-of-evidence based PMT/vPvM hazard classifications that considered the highest quality data available. The model was based on the hypothesis that PMT/vPvM substances contain similar MDs, representative of chemical structures resistant to degradation, be associated with low sorption (or high-water solubility) and in some cases be associated with known toxic mechanisms. All possible model combinations were tested by integrating different molecular description methods, data balancing strategies and machine learning algorithms. Our model allows one-step prediction of candidate PMT/vPvM substances, and our method was compared with the approach predicting P, M and T separately (i.e. three-step prediction). The results showed that the one-step model achieved a higher accuracy of 92% for PMT/vPvM identification (i.e. positive samples) for an internal test set, and also resulted in a higher accuracy of 90% for an external test set of chemical pollutants detected in Taihu Lake, China. Furthermore, prediction mechanism of the model was interpreted by Shapley additive explanations (SHAP). This work presents an advance of big data in silico screening models for the identification of substances that potentially meet the PMT/vPvM criteria.
Facilitating structural elucidation of small environmental solutes in RPLC-HRMS by retention index prediction
2023, Chemosphere
Implementing effective environmental management strategies requires a comprehensive understanding of the chemical composition of environmental pollutants, particularly in complex mixtures. Utilizing innovative analytical techniques, such as high-resolution mass spectrometry and predictive retention index models, can provide valuable insights into the molecular structures of environmental contaminants. Liquid Chromatography-High-Resolution Mass Spectrometry is a powerful tool for the identification of isomeric structures in complex samples. However, there are some limitations that can prevent accurate isomeric structure identification, particularly in cases where the isomers have similar mass and fragmentation patterns. Liquid chromatographic retention, determined by the size, shape, and polarity of the analyte and its interactions with the stationary phase, contains valuable 3D structural information that is vastly underutilized. Therefore, a predictive retention index model is developed which is transferrable to LC-HRMS systems and can assist in the structural elucidation of unknowns. The approach is currently restricted to carbon, hydrogen, and oxygen-based molecules <500 g mol⁻¹. The methodology facilitates the acceptance of accurate structural formulas and the exclusion of erroneous hypothetical structural representations by leveraging retention time estimations, thereby providing a permissible tolerance range for a given elemental composition and experimental retention time. This approach serves as a proof of concept for the development of a Quantitative Structure-Retention Relationship model using a generic gradient LC approach. The use of a widely used reversed-phase (U)HPLC column and a relatively large set of training (101) and test compounds (14) demonstrates the feasibility and potential applicability of this approach for predicting the retention behaviour of compounds in complex mixtures. By providing a standard operating procedure, this approach can be easily replicated and applied to various analytical challenges, further supporting its potential for broader implementation.

View all citing articles on Scopus

View full text

ReviewInformaticsDescriptors and their selection methods in QSAR analysis: paradigm for drug design

Highlights

Introduction

Section snippets

Molecular descriptors

Concluding remarks

Acknowledgments

Methods

Drug Discov. Today.

Chem. Phys. Lett.

J. Mol. Graph. Model.

Inf. Sci.

Expert Syst. Appl.

J. Mol. Graph. Model.

Drug Discov. Today.

J. Mol. Graph. Model.

J. Mol. Graph. Model.

Chem. Phys. Lett.

Chem. Phys. Lett.

J. Mol. Graph Model

Virtual screening strategies: a state of art to combat with multiple drug resistance strains

MOJ. Proteomics Bioinform.

Descriptor selection methods in quantitative structure–activity relationship studies: a review study

Chem. Rev.

In silico virtual screening approaches for anti-viral drug discovery

Drug Discov. Today.

QSAR modeling: where have you been? Where are you going to?

J Med. Chem.

Activity prediction and identification of mis-annotated chemical compounds using extreme descriptors

J. Chemometrics.

Comparative analysis of QSAR-based vs. chemical similarity based predictors of GPCR binding affinity

Mol. Inf.

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

BMC Bioinform.

Feature selection methods in QSAR studies

J. AOAC Int.

Graph Theory

Topological descriptors in drug design and modeling studies

Mol. Diver.

Structural determination of paraffin boiling points

J. Am. Chem. Soc.

On characterization of molecular branching

J. Am. Chem. Soc.

Indexes of molecular shape from chemical graphs

Acta Pharm. Jugosl.

The first Zagreb index 30 years after

MATCH Commun. Math. Comput. Chem.

Molecular Connectivity in Structure Activity Analysis

Generalized molecular descriptors

J. Math. Chem.

TMACC: interpretable correlation descriptors for quantitative structure–activity relationships

J. Chem. Inf. Model.

Interpretable correlation descriptors for quantitative structure–activity relationships

J. Cheminf.

Handbook of Molecular Descriptors

Atomic physicochemical parameters for three-dimensional structure-directed quantitative structure–activity relationships I. Partition coefficients as a measure of hydrophobicity

J. Comput. Chem.

Lipophilicity indices for drug development

J. Appl. Biopharm. Pharmacokinet.

The parameterization of lipophilicity and other structural properties in drug design

Adv. Drug. Res.

Lipophilicity in drug discovery

Expert Opin. Drug Discov.

Calculation of hydrophobic constant (log P) from pi. and f constants

J. Med. Chem.

Intramolecular hydrogen bonding to improve membrane permeability and absorption in beyond rule of five chemical space

Med. Chem. Commun.

Cell permeability beyond the rule-of-5

Adv. Drug Deliv.

Review
Informatics
Descriptors and their selection methods in QSAR analysis: paradigm for drug design