ALK-5 Inhibition : A Molecular Interpretation of the Main Physicochemical Properties Related to Bioactive Ligands

Activin-like kinase 5 (ALK-5) receptor represents an attractive object to treat cancer. Analyses on the quantitative structure-activity relationship were performed to explore the relationship between the molecular structure of 1,5-naphthyridine, pyrazole and quinazoline derivatives and the inhibition of the activin-like kinase 5. From a data set containing 59 compounds, various electronic descriptors were calculated using density functional theory (DFT) method; stereochemical descriptors (as molecular volume and area), polar surface area (PSA), log P and dragon descriptors were also calculated. The ordered predictor selection (OPS) algorithm, weighted principal component analysis (PCA) and Fisher’s weights (FW), combined with sequential forward selection, were employed to select the most relevant descriptors to be employed in all partial least square regressions. Using this procedure, we selected the most informative descriptors and significant correlation coefficients were achieved (r = 0.74, q = 0.83). Additional validation tests were carried out, indicating that the obtained model is robust and reliable and, consequently, it can be used to predict the biological activity of new compounds.


Introduction
Cancer is a global problem and is cause of death in all countries.It is estimate that the number of cancer cases will increase worldwide due to the growth and aging of the population, particularly in less developed countries, in which about 82% of the world's population resides.In Brazil, an estimate performed by National Institute of Cancer in Brazil (INCA) for 2014, also valid for 2015, predicted an increase of 75% in new events of cancer. 1,22][3][4] Cancer is the second most common cause of death in the United States, exceeded only by heart diseases, accounting for nearly one of every four deaths.There are many cases of cancer in population and the mortality level is expected to rise globally. 3,4The global estimates are very concern because cancer is generally caused by genetic mutations, which provide some specific characteristics to the affected cell such as, high levels of proliferation, including neighboring tissues (metastasis) and evasion to apoptosis.Thus, it is extremely important to find out new drug candidates that target the cancer progression, invasion and metastasis. 1,2n this scenario, there is an interesting target protein known as transforming growth factor β (TGF-β). 36][7][8] The complex function of TGF-β depends on the activation of two highly conserved single trans-membrane serine/threonine kinases: type I (TβRI or ALK-5 activin-like kinase 5) and type II receptors (TβRII).The mechanism related to the TGF-β binding involves the following steps: TβRII phosphorylates the threonine residues in the GS (repeated series of serine-glycine) domain of the ligandoccupied ALK-5 (or TβRI).The ALK-5 receptor, on the other hand, phosphorylates the cytoplasmic proteins SMAD2 and SMAD3 at two carboxyl terminals of serine residues.The phosphorylated SMAD proteins form heteromeric complexes with SMAD4; this complex translocates inside the nucleus to affect the gene transcription.It is known that changes in the DNA expression are important to evolution and adaptation of living organisms.Thus, TGF-β and its Vol.26, No. 9, 2015 receptors (ALK-5 and TβRII) are able to control the cellular growth and to promote several biological responses.In summary, these receptors can be considered as important targets to treat complex diseases such as cancer and fibrosis.][17] In vitro assays on the activity of ALK5 inhibitors remain an intensive labor and time consuming operation.In this context, more efficient and economical alternative methods should be employed, such as in silico molecular modeling approaches, which are used in virtual screenings to predict and prioritize chemicals for subsequent in vitro and in vivo screenings.
Quantitative Structure-Activity Relationship (QSAR) studies have been widely used to help in predicting and designing new bioactive compounds.In this way, QSAR methodology was employed in this study to explain how the molecular properties of a compound series are associated to biological activity.9][20] There are some QSAR models of ALK-5 inhibitors such as benzimidazoles, 21 4-(quinolone-4-yl)-substituted, 1,5-naphthyridine, pyrazole and quinazoline derivative series reported on literature. 22,232][23] However, these previous studies did not take into account all compound classes analyzed here and the authors employed other QSAR techniques.This study presents a different point of view on the main interactions that can be occurring between the compound classes selected and the biological target (ALK-5).
QSAR studies, along with the extracted information from the available X-ray crystallographic structure of ALK-5, have shown to be useful tools in the lead compound optimization in order to obtain potential therapeutic agents for the treatment of cancer and to understand the role of ALK-5 in the pathology of this disease.For this, our study constructed a series of models in order to elucidate the most relevant relationship between the molecular properties of the ALK-5 inhibitors studied in this work and their biological activity.Another objective of this study is evaluating the ability of various methodologies used for an efficient variable selection and, consequently, constructing a statistical model independent of the molecular alignment aiming the construction of a simple, effective and innovative model that could be employed in further virtual screening protocols.

Data set
It was selected a dataset of 59 compounds, synthesized and tested at the same experimental conditions by Gellibert et al., [24][25][26] to construct robust and reliable statistical models.5][26] These compounds comprised three different classes of diverse structures: 1,5-naphthyridine, pyrazole and quinazoline derivatives, whose IC 50 values were converted in pIC 50 (−logIC 50 , see Figure 1 and Table S1).

Generation of 3D structures
In order to generate a bioactive conformation of all compounds, we performed several docking analyses employing the GOLD 5.0 software, 27 which uses genetic algorithm to generate the ligand conformation and GOLDScore as scoring function.All steps and details of the docking protocol, as well as the pose generation were shown in previous study. 23The good quality of the selected poses/conformations can be noted from the analyses of all statistical parameters for the 3D model, which is highly affected by the tridimensional alignment of the data set. 28n addition, the docking analyses of the most and the least active compounds corroborate the binding mode proposed in the literature.Therefore, the docking poses could be considered a good model for the bioactive conformations of the studied ALK-5 inhibitors.

Physicochemical properties and calculation of descriptors
After the generation of all 3D conformations from the docking analyses, it was calculated several electronic properties (for example, molecular orbital energies, dipole moment and atomic charges), as well as other descriptors obtained from density functional theory (DFT) method with B3LYP functional 24 and 6-311g(d) basis set, 29,30 implemented in Gaussian09 package. 27Stereochemical descriptors (such as molecular volume and area), polar surface area (PSA), log P, molecular weight and others were calculated using the software Spartan'08, 31 HyperChem 8.1 32 and Sybyl 8.1. 33opological descriptors were calculated employing E-dragon 2.1 available at Virtual Computational Chemistry Laboratory (VCCLAB), 34 which are considered valuable information about several aspects of the molecular structure. 35,36

Feature selection
The selection of features that mathematically represent the compound set and the relationships with the biological activity are not a trivial task.The methods employed to generate the molecular descriptors are able to provide a large number of variables (some of them may have up to thousands of descriptors).Furthermore, there is an ideal condition in QSAR studies: in general, each descriptor can explain five chemical compounds. 18,36This proportion (one descriptor for each five compounds), called parsimony principle or Occam's razor, was proposed to facilitate the physicochemical interpretation of QSAR model and also to avoid the overfitting in QSAR modeling (a condition when the excess of information improves randomly the quality of the model). 37For this reason, we tested the ability of various methods used for an efficient variable selection with the aim to select only chemical descriptors able to generate a robust model.All methods employed in this study for the variable selection will be described below.

Fisher's weight (FW)
Fisher's weight (FW) is a very used method in pattern recognition studies.It selects the variables that characterize or separate in two or more groups a given data set. 38,39he main idea of FW is finding a subset of variables such that on the data space generated by the selected variables, the distances between observations in different classes are as large as possible, while the distances between the observations in the same class are minimal as possible.The variable selection occurs by maximizing the trace criterion, an optimization function that can be applied to several methods of dimensionality reduction because it directly holds the distance between the observations within or between the classes of data.To simplify this problem, the most used heuristic is computing a weight for each variable of the set X(x j , j = 1,…,n) according to the criterion F. 40 Considering µ j k the average of the k th class corresponding j th variable and that s j the standard deviation of the j th variable, it is computed the weight of each variable by equation 1: (1) After the calculation of the Fisher's weight for each variable, variables with the highest weight are selected.

Ordered predictor selection (OPS)
Ordered predictor selection (OPS) is an algorithm employed to select the most relevant descriptors that will be employed in regression analyses. 41The OPS method generates a vector (informative vector) that contains information about the location of the best chemical descriptors for prediction.The vectors can be directly obtained from calculations performed with information about responses and dependent variables or combinations of different vectors obtained with the same purpose.Afterwards, the original variables are differentiated according to the corresponding absolute values of the informative vector obtained in the previous step.The higher the absolute value, more important is the response variable, which enables its sorting in descending order of magnitude. 42The multivariate regression models are built and evaluated using a cross validation strategy.An initial subset of variables (window) is selected to build and evaluate the model.Then, this matrix is expanded by the addition of a fixed number of variables (increment) and a new model is built and evaluated.New increments are added until all or some percentage of variables are taken into account.Quality parameters of the models are obtained for every evaluation and stored for a future comparison.The evaluated variable sets (initial window and its extensions) are compared using the quality parameters Vol. 26, No. 9, 2015 calculated during the validations.The model with the best quality parameters should contain the variables with the best predictive capability and so these will be the selected variables.

Weighted principal components analysis (WPCA)
Weighted principal components analysis (WPCA) is a method that uses the matrix of loadings obtained by PCA technique to perform the variable selection. 43In WPCA, there are combinations of the weighted principal components with a threshold algorithm.Specifically, the contribution of each feature is represented by a loading value in a weighted principal component, and a threshold algorithm based on a moving range-based control chart evaluates the significance of its contribution. 43n WPCA, the weight of each variable is obtained from the sum of the loading values that represent the importance of each feature in the formation of a PC (for example, a ij indicates the degree of importance of j th feature for the i th PC).For the case where a loading value of the j th original feature is initially computed m PC's, the importance of the j th feature can be represented by equation 2, where a ij (i,j = 1,2,…,n) represents the loading values of each variable in each PC after the application of PCA 44 and b i represents the weight of the i th PC.A way to determine b i is computing the total variance explained by the i th PC; w j is called a weighted PC loading for the feature j. (2) After obtaining the weighted PC's, it is performed the moving range-based threshold algorithm as a way to identify the significant features from the weighted PC loadings.The threshold algorithm comes from a moving average control chart widely used in quality control. 41A feature is considered as significant if the corresponding weighted PC loading exceeds the threshold g.

Sequential forward selection (SFS)
Sequential forward selection (SFS) is another method used for variable selection, which selects a subset of variables that have the best result in the generation of a regression or classification model.This search is carried out as following: (i) the algorithm starts its execution looking for a single variable that generates a regression model that satisfies a certain value (i.e., low calibration error), (ii) after, these new variables are sequentially grouped to the initial selected variable, since the value obtained will be better than the value obtained from the previous subset, or until a certain number of variables is reached.More information about this method can be found in other studies. 45All described methods (OPS, WPCA and FW) were employed in combination to sequential forward selection to achieve a defined final number of descriptors that better describe our system.

Splitting of training and test sets
Training and test sets are important to determine the quality of the statistical models obtained from regression methods.The composition of training and test sets is important to obtain an internally consistent model and to test its external ability of prediction using an equally representative set.Kennard-Stone is a rational method that is very employed to split training and test sets. 46,47his method was developed to produce a division when no standard experimental design can be applied. 44The Kennard-Stone algorithm selects the objects so that they are divided evenly throughout the descriptor space of the original data set.This technique is applied as follows: (i) initially, select the first two molecules of the dataset are selected by choosing the two ones that are farthest apart in terms of Euclidean distance; (ii) to select the compound that has the maximum dissimilarity from each one of the previously selected molecules and place this molecule in the training set; (iii) to repeat the step (ii) until the desired number of molecules has been added to the training set.

Outlier detection and applicability domain
Other two important aspects that should be checked in the generation of QSAR models are the outlier detection and the analysis of the applicability domain.These two properties are robustness measures of QSAR models that will be used for predicting compounds with unknown activity.In this study, for outlier detection, it was applied a method proposed by Filzmoser et al. 48that combines the ordered squared robust Mahalanobis distances (MD) of the observations and the distribution of chi-squared.Initially, the MD values for each observation are calculated.Afterwards, to perform the search for outliers, observations that exceed a certain value of the chi-squared distribution are marked.More details about this method can be found in Filzmoser et al. 48he applicability domain is widely used to express the scope and limitations of a QSAR model, i.e., the range of chemical structures for which the model is considered to be applicable. 49In our study, we used the leverage value and Studentized residuals to determine the applicability domain of the compounds.The leverage method provides a distance measure of the compounds from the centroid of the data set (i.e., vector mean of the dataset).Compounds near to centroid are less influential in QSAR model than that in extreme points.More details about these techniques can be found in references. 50,51nstruction of QSAR models The generation of QSAR models was performed using Partial Least Squares (PLS) method, implemented in Pirouette3.11software. 52The PLS method can handle data with numerous independent variables by constructing principal components (PCs) from a non-linear combination of all X variables used to construct the QSAR model.A short description of PLS technique involves the following idea: the X matrix of independent variables (containing the descriptors) is correlated with the Y vector (representing the biological data, in this case) in such a way that the projected coordinates (T) are good predictors of Y. 53 An important feature of PLS is the fact of the biological data is included in the decomposition procedure.Besides, the loading matrix (W) is defined in such a way that the product (variance in X) times (the correlation XW to Y) is maximized. 53A detailed description of PLS can be found in other references. 54,55The quality model was evaluated according to its internal consistency (q 2 , values of leaveone-out and leave-N-out methods), external predictive ability (r 2 of the test set and residual values), sensitivity of randomization (Y-scrambling) and external predictive ability potential (r 2 m ).

Results and Discussion
In order to define the best model, there was specified a flow chart as shown in Figure 2. Initially, from 1719 calculated descriptors, we applied an intermediary filter using WPCA, OPS and two forms of Fisher's weight.After the application of these techniques, the SFS algorithm was used aiming to achieve models with 8 variables, according to the rule of 1 descriptor for each 5 compounds, since the training set of our study contains 46 compounds.To carry out the selection of variables with WPCA, it was used the software MATLAB. 56As parameters to WPCA, we applied the error range equals a 0.01 (β = 0.01).The number of variables obtained with this method was equal to 42.To perform the variable selection with OPS, it was used the package OPS developed by Teofilo et al., 42 also implemented in MATLAB software.As parameters to the variable selection, it was employed the minimal value of root mean squared error, obtained after the application of PLS technique.From this procedure, 256 variables were selected.
The initial version of FW was applied separating the molecules in two classes of biological activity: (i) a class with the biological range between 4.95 and 7.32; (ii) a class with the range between 7.33 and 7.92.The choice for the splitting of the dataset using a non-uniform distribution of biological ranges is due to this threshold (pIC 50 ca.7.33) separates the compounds in two balanced subsets, biologically and structurally, with about 23 compounds each.Finally, the weight higher than 5.00 were selected, resulting in 357 variables selected with this methodology.
To the application of the second version of Fisher's weight (MFW), initially, the dataset was divided in six classes, according to the following ranges of biological activity: (i) class 1 (4.95-5.49);(ii) class 2 (5.50-5.99);(iii) class 3 (6.00-6.49);class 4 (6.50-6.99);class 5 (7.00-7.49)and class 6 (7.50-7.99).The main idea of MFW is to select the descriptors that are important to discriminate between the most active compounds and the least ones.In other words, MFW is designed to discriminate the most active compounds (class 6) in each other class individually.
After the definition of the classes, various comparisons between the most activity class (class 6) and the five remaining classes were carried out using the FW and the weights for each comparison were determined.Finally, for each variable, we calculated the sum of the weights found in each comparison from equation 3. MFW = 0.35FW 6-1 + 0.30FW 6-2 + 0.20FW 6-3 + 0.10FW 6-4 + 0.05FW 6-5 (3) In the last step employed in the application of MFW, we selected the variables with weight higher than 5.0, as done in the selection with FW.The major difference between FW and MFW is that the initial method can provide the variables related to the split of the data set in two classes (the most and the least potent compounds) and the second one provides the X variables that discriminate gradually the most active compounds (class 6) from the least active class.The application of MFW returned 646 variables.After the initial step of the variable selection, applying WPCA, FW, MFW and OPS, it was used the SFS technique to select eight variables from each subset of the variables cited previously, as shown in Figure 2. The SFS technique was combined with the PLS method and eight variables were selected, which resulted in a best value of q 2 .The main results are summarized in Table 1.
From Table 1, we selected the variables indicated by the MFW method, since these variables returned the best values of q 2 and the lowest value of standard error of estimation.The difference between the models generated with MFW and FW methods is not significant, then we employed the MFW model to perform a physicochemical interpretation of the selected variables but we also analyzed the other models.
After choosing the best set of variables using MFW, we performed several analyses of outliers and also different splitting of training and test sets.For the analysis of outliers, the technique described by Filzmoser et al. 48was applied, making the search for outliers in a chi-squared (Figure 3) distribution with limit value equals to 0.95.Moreover, the values of leverage obtained after the variable selection were calculated.Among all compounds, it was observed that the compound 3 was identified as an outlier by the Filzmoser's technique, as well as the coefficient of leverage.Thus, this compound was removed from the data set in the further analysis.
The splitting of training and test sets was performed in two steps: (i) the data was divided in two subsets according to the levels of biological activity: 4.95-7.32and 7:33-7.92;(ii) after this initial splitting, the Kennard-Stone method was applied in each subset, separating 80% for the training set and 20% for the test set.As a final result, 46 molecules were selected for the training and 12 for the test set (Supplementary Material, Table S2).

Statistical analysis of model 3
In comparison with the other models, the model 3 displays satisfactory internal and external correlation coefficients (q 2 LOO and q 2 LNO = 0.74; r 2 = 0.83 and r 2 test set = 0.87) and the Y-scrambling results (the average values of q 2 and r 2 for the scrambled models) indicate that the model was not obtained by chance (Table 2).Finally, the best quality of the model 3 can be observed by comparison of r 2 m of all models.Only the MFW and OPS models (models 3 and 4, respectively) showed acceptable   external predictive ability, but clearly the external predictive ability of the model 3 was strongly superior to the model 4.
However, the model obtained with the combination of MFW and SFS presented the lowest SEV and SEC values.
To evaluate the robustness and the stability of the selected model, leave-N-out and y-scrambling tests were carried out (Table 2).In fact, a good QSAR model must have an average value of q 2 close to the q 2 obtained with the leave-one-out procedure, while the standard deviation for each N should not exceed 0.1. 51The model obtained with the variable selection using MFW and SFS was stable with deviations from q 2 for each N being lower than 0.020.These findings confirm the stability and robustness of the model 3 (Figure 4).
The predictive power of the model 3 was also evaluated by predicting the biological activity of the compounds from the test set (external validation).Experimental and predicted pIC 50 values are listed in Table 3.The obtained results indicate that the obtained model is very predictive since the residual values of external predictions were lower than 0.80 log unities.
A plot of the experimental versus predicted pIC 50 for the compounds in training and test sets is shown in Figure 5.The good agreement between the experimental and calculated values indicates that a predictive MFW model was obtained and can be used to accurately predict the biological activity of other compounds within this structural class.
The y-scrambling validation was also employed to verify the possibility of chance correlations between the dependent variable and the selected descriptors.In this study, the pIC 50 values were scrambled and the r 2 and q 2 values were calculated (Figure 6).In the 100 y-scrambling experiments performed in our data, only low values of r 2    and q 2 were obtained, with average of −0.31 and 0.18, respectively.If low values were found for both parameters, then one can be sure that a true correlation between the selected descriptors and the response variable exists in our data set.
In summary, all internal and external validations indicate that the model 3 is suitable for the prediction of the biological activity of new ALK-5 inhibitors and, consequently, this model contain statistically relevant information in the relationships between the calculated descriptors and the biological activity.

Physicochemical interpretation of the best model
For the model obtained using the MFW and SFS algorithms (variable selection), 8 descriptors were selected: MATS4v, EEig04x, ESpm12r, BELp5, SPH, Mor26e, R8m+ and R5e+.Table 4 displays the description of each variable employed in the construction of the model 3.
The calculated values for the 8 selected descriptors are shown in Supplementary Information (Table S2) and the contributions of each descriptor to the regression vector, in the model 3, are displayed in Figure 7.
Regarding the selected descriptors used to build the model presented in this study, some considerations can be pointed out (Figure 7): (i) SPH is a geometrical descriptor and refers to the spherical format of the molecule.This variable suggests that the spherical shape of the compounds is an important parameter in the ALK-5 inhibition since this descriptor showed the highest contribution to PC. Compounds with values of SPH nearest to 1 indicate higher spherical shape while values nearest to 0 indicate compounds not spherical. 57In this study, the SPH descriptor presented important contribution (Figure 7) indicating a better complementarity between spherical compounds and the active site.Indeed, the three more potent compounds (21, 19 and 39) have values of SPH equal to 0.937, 0.949 and 0.853, respectively, while the three least potent ones (50, 46 and 49) have SPH values equal to 0.732, 0.783 and 0.802, respectively.These results indicate that the most potent compounds have higher values of SPH and, consequently, they are more spherical and can be performed more interactions in the active site of the biological target.(ii) EEig04x is the second descriptor with high positive contribution and represents the eigenvalue 04 from the edge-adjacency matrix weighted by edge degrees, which belongs to edge-adjacency indices.The adjacency matrix also provides some generalized descriptors of network connectivity like the average vertex degree and connectivity. 58,59iii) MATS4v represents the distribution mode of the atomic van der Waals volumes along the topological structure of the compounds. 60,65Therefore, the   positive contribution of this descriptor indicates the relationship between the topological structure weighted by van der Waals volume and the biological activity.(iv) MorSE descriptors have structural information by means of 3D atomic coordinates.In this case, the Mor26e descriptor represents a 3D-MorSE descriptor weighted by atomic electronegativity and this descriptor has the fourth positive contribution.Thus, the atomic electronegativity of the compounds showed high statistical importance for the proteinligand interaction. 35,61(v) R8m+ and R5e+ are GETAWAY descriptors that mean geometry, topology and atom-weights assembly descriptors derived from the leverage matrix, which is deduced by the centering of all atomic coordinates. 62,63Thereby, R8m+ is weighted by atomic masses with a positive contribution for the dataset and R5e+ is weighted by atomic electronegativities with negative contribution (see Figure 7).Therefore, these descriptors can contribute for the size (R8m+) and the shape (R5e+) of the ALK-5 inhibitor weighted by the properties of the data set from pIC 50 values.(vi) ESpm12r represents the resonance effects or resonance integrals between atoms twelve bonds apart. 59,66As its contribution to the model was negative, the resonance effects could inversely be related to the biological activity.(vii) BELp5 is a 2D Burden eigenvalue descriptor that has the lowest contribution (Figure 7).This descriptor is weighted by the atomic polarizabilities, encoding molecular branching, position and length.This topological descriptor is designed to encode atomic properties that drive intramolecular interactions. 64,67sed on the results obtained in this study and in the face of the continuous search for new anti-cancer compounds, statistical models can play an important role in the discovery and optimization of new drug candidates.In this work, WPCA, FW, MFW and OPS-PLS models were developed to provide insights on relevant molecular features for the ALK-5 inhibition.A set of 8 descriptors selected by MFW and SFS techniques has demonstrated to be suitable for the construction of reliable models.The good statistical parameters, stability and robustness of the models obtained here, as assured by the validation tests applied over our data, indicate that these models can be used to design other inhibitors with improved anti-cancer activity, i.e., using this model as virtual screening filter.Therefore, the selected descriptors could be employed to construct focused chemical libraries to find out new ALK-5 inhibitors.

Conclusions
In this study, four models were investigated with the aim to describe the relationships between the chemical structure of a series containing bioactive ligands and the ALK-5 receptor.WPCA, FW, MFW and OPS-PLS algorithms were employed to select the most relevant descriptors.MFW was the best algorithm for the variable selection, because it resulted in significant correlation coefficients (q 2 = 0.83, r 2 = 0.74 and r 2 Test = 0.87).The strategy employed in this work has provided a reliable model for the ALK-5 inhibition regarding the class of the studied ligands.Our findings suggest the importance of topological, geometrical, edge adjacency indices, 2D autocorrelation and 3D features for the anti-cancer activity presented by the studied compounds.The descriptors selected using the MFW method describe molecular features as the geometry (SPH) and connectivity (EEig04x), which are defined as dragon descriptors.Additionally, the influence of the distribution mode of atomic van der Waals volume (MATS4v) is indicated by 2D autocorrelations descriptor, as well as Mor26e and atomic electronegativity.Therefore, these results can be used to design other ALK5 inhibitors with anti-cancer activity.

Figure 1 .
Figure 1.The most and the least active compounds of the data set.

Figure 2 .
Figure 2. Scheme used to select chemical descriptors.

a q 2 :
validation coefficient; b SEV: standard error of validation; c r 2 : calibration coefficient; d SEC: standard error of calibration; e PCs: number of principal components; f weighted principal components analysis (WPCA); g Fisher's weight (FW); h second version of Fisher's weight (MFW); i ordered predictor selection (OPS).

Figure 3 .
Figure 3. (a) Analysis of outliers and (b) plot of leverage versus Studentized residuals.

Figure 4 .
Figure 4. Plot of the results obtained for the leave-N-out validation.

Figure 5 .
Figure 5. Experimental versus predicted pIC 50 of the training and test set compounds.

Figure 6 .
Figure 6.Plot of the results obtained in the y-scrambling tests.Figure 7. Contribution of all selected descriptors.

Figure 7 .
Figure 6.Plot of the results obtained in the y-scrambling tests.Figure 7. Contribution of all selected descriptors.

Table 1 .
Results of PLS regression combined with SFS technique

Table 2 .
Others statistical parameters for all obtained models c ordered predictor selection (OPS); d second version of Fisher's weight (MFW); e q 2 : validation coefficient; f SEV: standard error of validation; g r 2 : calibration coefficient; h SEC: standard error of calibration; i PCs: number of principal components.

Table 3 .
Experimental and predicted pIC 50 values for the test set compounds

Table 4 .
Symbols, types and definitions of the selected descriptors