Elsevier

Analytica Chimica Acta

Volume 609, Issue 2, 25 February 2008, Pages 169-174
Analytica Chimica Acta

Application of the modelling power approach to variable subset selection for GA-PLS QSAR models

https://doi.org/10.1016/j.aca.2008.01.013Get rights and content

Abstract

A previously developed function, the Modelling Power Plot, has been applied to QSARs developed using partial least squares (PLS) following variable selection from a genetic algorithm (GA). Modelling power (Mp) integrates the predictive and descriptive capabilities of a QSAR. With regard to QSARs for narcotic toxic potency, Mp was able to guide the optimal selection of variables using a GA. The results emphasise the importance of Mp to assess the success of the variable selection and that techniques such as PLS are more robust following variable selection.

Introduction

The next decade will see an increased use of (quantitative) structure-activity relationships ((Q)SARs) to predict toxicity for new and existing chemicals. Much of the focus will be on their application to reduce or replace animal use in toxicological testing for the regulation of existing chemicals (e.g. in the REACH legislation) [1]. As such, there is a high probability that models will be applied, and the predictions interpreted, by “non-experts”. To facilitate their use, much effort has been placed in the provision of strategies to evaluate the models and guidance for their usage, e.g. the OECD Principles for the Validation of (Q)SARs [2] and associated guidance. While these principles were envisioned for regulatory toxicology they are, of course, equally appropriate to assist in the solution of problems in the use of QSARs in drug design, property estimation and a number of different areas.

For the successful use of QSARs for environmental effects, such as acute toxicity (as well as any other use) a non-expert user must be able to apply robust and transparent models. The OECD Principles for the Validation of (Q)SARs provide a suitable framework for the development of such models. It is inevitable, and indeed in most cases helpful, that a large variety of statistical approaches are applied. Models must be statistically robust, i.e. a true “causal” relationship and predictive. Traditional statistics (R2, s, F, etc.) associated with regression-based QSARs provide simple and interpretable measures of statistical fit [3]. However, even for slightly more sophisticated methods, such as leave-one-out (or leave-more-out) cross-validation, it has been demonstrated that these methods can be over-optimistic [4], [5]. Recently, a more thorough and applicable method for evaluating the robustness of a QSAR was proposed—the Modelling Power Plot [6].

The Modelling Power Plot is a method to compare the quality of individual QSARs [6]. It is based on two new statistics associated with a regression model, the ‘Descriptive power’ (Dp), which is estimated through the relative uncertainty of model coefficients, and the ‘Predictive power’ (Pp), which is estimated through both the fitted and cross-validated explained variance of the response variable (i.e. biological activity). In particular, the “Modelling Power Plot” was shown to be a very useful tool to delineate between models based on different numbers of variables and different variable selection techniques [6].

There is a trend to develop QSARs from a variety of methods. In particular, Genetic Algorithms (GAs) are frequently used as search algorithms for variable selection in chemometrics and QSAR [7]. The GA provides a “population” of models, from which it could be difficult to identify the most significant or relevant models (which may be preferred in certain uses, e.g. regulatory toxicology prediction) [8].

Most QSAR-GA approaches use a single statistic related to the model predictive ability as objective function (e.g. cross-validation predictive residual sum of squares, PRESS, RMSE, Q2). Opportunely, in some cases, such statistics have been combined with other ones and/or decision-rules to decide on model consistency (e.g. [9], [10] and references therein). In the past, uncertainty related to the model coefficients (b) has been received little attention. Fortunately, there is a trend to report QSARs with an indication of their b-uncertainties, mainly as standard deviation or confidence interval, e.g. [9], but also as the estimated uncertainty interval b ± U(b) [6], [11]. However, up to our knowledge, the combined b ± U(b) information (e.g. the descriptive power of the model, Dp[6]) has not been used as criterion to be incorporated into the GA decision process.

The purpose of this study was to investigate the use of the modelling power (Mp) statistic to guide the GA-variable selection process and to use the Modelling Power Plot to evaluate QSARs developed using different GA strategies. The Mp-criterion uses a combination of individual models’ predictive-descriptive ability. A comparison with the classical criterion based on maximum model predictive ability was performed. It should be stressed at the outset of this paper that the aim was not to create novel models for toxicity, but introducing the Mp-criterion as a fitness function for GA-PLS as a way to optimize both the variable selection and the PLS-latent variables, LVs, adjustment. The novelty is therefore incorporating the ‘descriptive power, Dp’ (i.e. the uncertainty of the selected descriptors coefficients) into such process, where in the past, only the predictive power has been considered. The goal is observing differences between the ‘maximum predictivity’ versus the ‘high predictivity and descriptivity’ criteria in the GA process.

At this moment, we are not interested to propose the Mp-criterion as a definitive/exclusive tool in GA (this would require to explore it with a large number of cases and a comparison with the most accepted criteria; still a non-harmonized matter), but to recommend it as a complement to other proposed criteria to be used and further explored by modellers with other datasets.

Section snippets

Compounds considered and toxicological endpoint modelled

The concentrations causing 50% lethality (Cnar, in mol dm−3) of 123 compounds to the tadpole were taken from [12] (the full data matrix is available from this reference). The response variable (y-vector) was taken as log 1/Cnar[6].

All of the compounds are assumed to be narcotic in action and cover a broad range of organic structures [12]. In a previous work [6], it has been show that four compounds (Triacetin, Hexan-1-ol, Decan-1-ol and Acetal) appear as outliers. They were eliminated for the

Results and discussion

For toxicological QSARs to be used with confidence, a number of criteria should ideally be defined and met (e.g. those enshrined in the OECD Principle for Validation of (Q)SAR) [1]. At the heart of these criteria is the statistical robustness of a QSAR. Assessing the robustness of any model is more difficult than may be considered at first glance, especially as the model becomes increasingly multivariate and/or complex in nature. This paper has applied an approach, known as the Modelling Power

Conclusions

The modelling power statistic provides a simple tool to be used in combination with GAs for variable (and LVs) selection to determine QSAR-PLS models on the basis of their predictive and descriptive abilities. In order to introduce the technique we have used a well-known dataset (for tadpole narcosis) with 10 variables; however, the criterion needs to be proven on different situations (say, problems with more descriptors) to fit its real applicability. At this point, it is not possible to

Acknowledgements

The financial support of the Spanish Ministry of Science and Technology (MCYT) and the European Regional Development Fund (ERDF) (Project SAF2005-01435) is gratefully acknowledged.

References (14)

  • A. Golbraikh et al.

    J. Mol. Graph. Model.

    (2002)
  • P. Gramatica et al.

    J. Mol. Graph. Model.

    (2007)
  • R. Todeschini et al.

    Anal. Chim. Acta

    (2004)
  • V.K. Agrawal et al.

    Bioorg. Med. Chem.

    (2003)
  • A.P. Worth et al.

    SAR QSAR Environ. Res.

    (2007)
  • OECD, Guidance Document on the Validation of (Quantitative) Structure Activity Relationship [(Q)SAR] Models, OECD...
  • L. Eriksson et al.

    Environ. Health Perspect.

    (2003)
There are more references available in the full text version of this article.

Cited by (25)

  • A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration

    2014, Chemometrics and Intelligent Laboratory Systems
    Citation Excerpt :

    Due to the spectral datasets that are usually of high collinearity, the latent variables method like PLS has an advantage to address this problem over MLR. Many papers about the application of GA-PLS for variable selection have been published [27-39]. GAs have five basic steps: (1) coding of variables; (2) initiation of population; (3) evaluation of the response; (4) reproduction; (5) mutation.

  • Modeling the chiral resolution ability of highly sulfated β-cyclodextrin for basic compounds in electrokinetic chromatography

    2013, Journal of Chromatography A
    Citation Excerpt :

    Simultaneously, the predictive ability of the model can be assessed by analyzing the explained y-variance (EV) and its cross-validated value (EVCV), while the b-coefficients can provide the descriptive ability if combined with uncertainty estimation, b ± U(b) (jack-knifing, as a part of the cross-validation process [7,9]). Thus, cross-validation becomes basic to assure reliability of future y-estimations for new compounds [7,10]. Different strategies have been used in the literature for developing QSPR for modeling enantioresolution-related information in EKC [11–14].

  • Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

    2009, Analytica Chimica Acta
    Citation Excerpt :

    However, the selection of variables or elimination of uninformative variables is still very necessary to obtain a parsimonious model using relevant spectral variables with least collinearity, redundancies and noise. The recently developed methods for variable selection included generalized simulated annealing (SA) [5], genetic algorithm (GA) [6], correlation coefficients and B-matrix coefficients [7], x-loading weights [8,9], uninformative variable elimination (UVE) [10], regression coefficient analysis (RCA) [11–13], independent component analysis (ICA) [12,14,15], modeling power [12,16] and successive projections algorithm (SPA) [17,18]. Among these methods, successive projections algorithm (SPA) employs simple projection operations for variable selection with minimum of collinearity and redundancy.

View all citing articles on Scopus
View full text