QSPR study of the water solubility of a diverse set of agrochemicals: hybrid (GA/ MLR) approach Etude QSPR de la solubilité aqueuse d'un ensemble de divers produits agrochimiques: approche hybride (AG/RLM)

A quantitative structure- property relationship (QSPR) was performed for the prediction of the aqueous solubility of pesticides belonging to four chemical classes: acid, urea, triazine, and carbamate. The entire set of 77 pesticides was divided into a training set of 58 pesticides and a test set of 19 pesticides according to the Snee technique. A six descriptor model, with squared correlation coefficient (R 2 ) of 0.8895 and standard error of estimation (s) of 0.52 log unit, was developed by applying multiple linear regression analysis using the ordinary least square regression method and genetic algorithm- variable subset selection. The reliability of the proposed model was further illustrated using various evaluation techniques: leave- one- out cross- validation, bootstrap, randomization tests, and validation through the test set.

The massive use of agrochemicals, known generically as pesticides [1], has allowed significant reduction in the agricultural plagues, and consequently, increased the productivity. On the other hand, the massive use of these agrochemicals has an environmental cost (due to their toxicity, their persistence, or their tendency to bioaccumulation), which is necessary to evaluate to conciliate productivity and environment protection [2]. Solubility in water is an important physicochemical property, having numerous applications to the modeling of the environmental effects of chemicals [3]. It is a direct measurement of hydrophobicity, that is, the tendency of water to exclude the substance from solution. Although the experimental determination of solubility is not difficult, there are some justifications to develop models that can predict it. This is especially important in environmental studies where the compounds are toxic, carcinogenic, or undesirable for some or other reason. An extensive series of studies for the prediction of aqueous solubility has been reported in the literature [4][5][6][7][8][9][10]. These methods can be categorized into three types: 1 -Correlation of solubility with experimentally data such as melting point (MP) and log P (logarithme of octanol/ water partition coefficient). However, this approach is of little use because it requires a knowledge of the compound's experimental melting point which is not available for virtual compounds. The melting point is a key index of the cohesive interactions in the solid and it is difficult to estimate.
2 -Estimation of solubility by group contribution methods. The group contribution method allows the approximate calculation of solubility by summing up fragmental values associated with substructural units of the compounds. The disadvantages of the group contribution method are that: 1/ the groups included must be defined in advance and therefore the solubility of a new compound containing new groups cannot be estimated; and 2/ the different effects of a group in different chemical environments are not considered.
3 -Correlation of solubility with descriptors derived from the molecular structure by computational methods. This third approach has been proven to be particularly successful for the prediction of solubility because it does not need experimental descriptors and can therefore also be applied to collections of virtual compounds. The aim of the present work is to develop a robust QSPR model that could predict the aqueous solubility values for a diverse set of agrochemicals (which consists of 26 acids, 25 ureas, 13 triazines and 13 carbamates) using the general molecular descriptors computed with the help of DRAGON software [11].

Experimental Data
The experimental S values (mg/l) of 77 selected, structurally heterogeneous, pesticides were taken from Hansen [12]. The water solubility values (log S) span between -1.05 and 5.90 (Table 1). The detailed structures of all studied compounds are available as Supporting Information.

Descriptor Generation
The chemical structure of each compound was sketched on a PC using the HYPERCHEM program [13] and preoptimized using MM+ molecular mechanics method (Polack-Ribiere algorithm). The final geometries of the minimum energy conformation were obtained by the semi-empirical PM3 method at a restricted Hartree-Fock level with no configuration interaction, applying a gradient norm limit of 0.01 kcal.Ǻ -1 .mol -1 as a stopping criterion. Then the geometries were used as input for the generation of 1664 descriptors using the Dragon software (version 5.4) [11]. Quantum-chemical descriptors such as HOMO (highest occupied molecular orbital), LUMO (lowest unoccupied molecul arorbital), HOMO -LUMO gap (DHL), and ionization potential (P ion ), calculated by the semi empirical PM3 method using [13], were added and used for descriptor selection during model development. Constant values and descriptors found to be correlated pairwise were excluded in a prereduction step (when there was more than 98% pairwise correlation, one variable was deleted), and the genetic algorithm was applied for variables selection to a final set of 1230 descriptors.

Selection of the training and test sets
It is important to rationally define a training set from which the model is built and external test set on which to evaluate its prediction power. The object of this selection should be to

©UBMA -2016 14
generate two sets with similar molecular diversity, in order to be reciprocally representative and to cover all the main structural and physicochemical characteristics of the global data set. Several procedures can be adopted for the selection of the training and test sets, the later should contain between 15 and 40% of the compounds in the full data set.
DUPLEX algorithm adopted in this study proceeds as follows. In the first step, the two points which are furthest away from each other are selected for the training set. From the remaining points, the two-objects which are furthest away are included in the test set. In the third step, the remaining point which is furthest away from the two previously selected for the training set is included in that set. The procedure is repeated selecting a single point for the test set which is furthest from the existing points in that set. Following the procedure, points are added alternately to each set [14]. This algorithm was applied in the present study to separate data into two independent subsets: a training set of 58 compounds to build the model and a test set of the remained 19 compounds to evaluate its prediction ability.

Model Development and Validation
Multiple linear regression analysis (MLR) and variable selection were performed by the software MobyDigs [15] using the Ordinary Least Square regression (OLS) method and Genetic Algorithm-Variable Subset Selection (GA-VSS) [16]. The outcome of the application of the genetic algorithms is a population of 100 regression models, ordered according to their decreasing internal predictive performance, verified by Q 2 . The models with lower Q 2 are those with fewer descriptors. First of all, models with 1-2 variables were developed by the allsubsetmethod procedure in order to explore all the low dimension combinations. The number of descriptors was subsequently increased one by one, and new models were formed. The best models are selected at each rank, and the final model must be chosen from among them. This has to be sufficiently correlated and, at the same time, protect against any overparameterization, which would lead to a loss of predictive power for molecules outside training set. From a statistical view point the ratio of the number of samples (n) to the number of descriptors (m) should not be too low. Usually, it is recommended that n/ m ≥ 5 [17]. The GA was stopped when increasing the model size did not increase the Q 2 value to any significant degree. Particular attention was paid to the collinearity of the selected molecular descriptors: by applying the QUIK rule (Q Under Influence of K) [18] a necessary condition for the model validity. Acceptable model is only that with a global correlation of [x + y] block (K xy ) greater than the global correlation of the x block (K xx ) variable, x being the molecular descriptors and y the response variable. The collinearity in the original set of molecular descriptors results in many similar models that more or less yield the same predictive power (in MOBYDIGS software 100 models of different dimensionality). Therefore, when there were models of similar performance, those with higher ∆K (K xy -K xx ) were selected and further verified. The models were justified by the R 2 , the adjusted R 2 , the cross-validated values of Q 2 by leave-one-out (LOO), the F ratio values and the standard error s. The robustness of the models and their predictivity were evaluated by both 2 LOO Q and bootstrap. In this last procedure K ndimensional groups are generated by a randomly repeated selection of n-objects from the original data set. The model obtained on the first selected objects is used to predict the values for the excluded sample, and then Q 2 is calculated for each model. The bootstrapping was repeated 8000 times. The proposed model was also checked for reliability and robustness by permutation testing: new models are recalculated for randomly recorded response (Y-scrambling) by using the same original independent variable matrix. After repeating this test several times (100 times in this work) it is expected to obtain new models that have significantly lower R 2 and Q 2 than the original model. If this condition is not verified the original model is not acceptable, as it was due to a chance correlation or a structural redundancy in the training set. Obtaining a robust model does not give real information about its prediction power. This is evaluated by predicting the compounds included in the test set.The external for the test set is determined with equation (1):

©UBMA -2016 15
Here n ext and n tr are the number of objects in the external set (or left out by bootstrap) and the number of training set objects, respectively.

Applicability Domain Analysis
The applicability domain (AD) [19,20] is a theoretical region in the space defined by the descriptors of the model and the modeled response, for which a given QSPR should make reliable predictions. In this work, the structural AD was verified by the leverage (h ii ) approach [21].
The warning leverage * h is, generally, fixed at 3(m + 1)/n , where n is the total number of samples in the training set and m is the number of descriptors involved in the correlation. The presence of both the response outliers (Y outliers) and the structurally influential compounds (X outliers) was verified by the Williams plot [22], the plot of standardized residuals versus leverage values.

Results of the MLR Model
The dissolving process is the establishment of equilibrium between the phase of solute and its saturated aqueous solution. Aqueous solubility is almost exclusively dependent on the intermolecular forces that exist between the solute molecules and the water molecules. The solute-solute, solute-water, and water-water adhesive interactions determine the amount of compound dissolving in water. Additional solute-solute interactions are associated with the lattice energy in the crystalline state. The solubility of a compound is thus affected by many factors: the state of solute, the relative aromatic and aliphatic degree of the molecules, the size and shape of the molecules, the polarity of the molecule, steric effects, and the ability of some groups to participate in hydrogen bonding.
In order to predict solubility accurately, all these factors correlated with solubility should be represented numerically by descriptors derived from the structure of the molecule.
A best six-parameters equation was obtained, which is as the following:  [27,28] ; HATS7v is the leverage weighted autocorrelation of lag 7/ weighted by atomic van der Waals volumes [29,30] ; RTu+ is the R maximal index/ unweighted [29,30] ; AlogP2 is the squared Ghose-Crippen-Viswanadhan octanol-water partition coefficient [31,32]. More information about these descriptors can be found in [33] and the references therein. The results for the randomized models can be compared with the real starting one only by representing in a plot the statistical coefficients 2 R and 2 Q . This is depicted in figure 1. The statistics for the modified logS vectors are clearly lower than the real QSPR model. This ensures that a real structure-property relationship has been found out. Figure1. Randomization test associated to previous QSPR model. Circles represent the randomly ordered solubilities, and star corresponds to the real solubilities.

©UBMA -2016 17
Some important statistical parameters (as given in table 2) were used to evaluate the involved descriptors. The t -value of a descriptor measures the statistical significance of the regression coefficients. The high absolute tvalues shown in table 2 express that the regression coefficients of the descriptors involved in the MLR model are significantly larger than the standard deviation. The tprobability of a descriptor can describe the statistical significance when combined together within an overall collective QSPR model (i. e., descriptor's interactions). Descriptors with tprobability values below 0.05 (95% confidence) are usually considered statistically significant in a particular model, which means that their influence on the response variable is not merely by chance [34]. The smaller t -probability suggests the more significant descriptor. The tprobability values of the six descriptors are very small, indicating that all of them are highly significant descriptors. The VIF values suggest that these descriptors are weakly correlated with each others. Thus, the model can be regarded as an optimal regression equation. The calculated log S values from equation (2) for the training and test set are showed in table 1 and figure 2. The distribution of errors for the entire data set is given in figure 3. As the errors are distributed on both sides of the zero line, one may conclude that there is no systematic error in the developed model.

Descriptor Contribution Analysis and Interpretation
Based on a previously described procedure [35,36], the relative contribution of the six descriptors to the model were determined and they decrease in the following order: HATS7v (17.91%) > Mor2v (17.67%) > HOMO (16.94%) > AlogP2 (16.80%) > G2e (15.76%) >RTu+ (14.89%). It should be noted that the difference in the descriptor contribution between any two descriptors used in the model is not significant, indicating that all of the descriptors are indispensable in generating the predictive model (Fig.4).
where s is the scattering angle, nAT is the number of atoms, r ij is the interatomic distance between the i th and the j th atoms, w is an atomic property, including atomic number, masses, van der Waals volumes, Sanderson electronegativities, and polarizabilities. The coefficient of Mor02v is negative, indicating that an increase in Mor02v would result in a decrease in log S values.
Hence, as expected, atomic volumes have a specific effect on the log S values: an increase in Mor02v (or in HATS7v) would result in a decrease in log S values.
The Squared-Ghose-Crippen-Viswanadhan octanol-water partition coefficient (AlogP2) [31,32] is calculated from a regression equation based on the hydrophobic character of the molecule. It reflects both the interactions of the solute with the bulk of the surrounding solvent (macroscopic or non specific solvent effects) and the specific bonding between the solute and individual solvent molecules (microscopic or specific solvent effects). When this descriptor increases, the log S decreases.
Highest occupied molecular orbital energy (E HOMO ) is a measure of the nucleophicity of a molecule. It should explain the differences in the tendency of solutes to take part in the charge transfer interactions, i. e. the ability of electron-donating to water molecules of solute molecules. According to the Koopmans theorem [37], the energy of the HOMO is directly related to the ionization potential IP (-E HOMO = IP), provided that the ionization process is adequately represented by the removal of an electron from an orbital without change in the wave functions of the other electrons. The descriptor and its coefficient in the model are negative, so the contribution of E HOMO is positive.
The importance of the axial shape and symmetry of the molecule on the log S values is apparent due to the presence of G2e. In the calculations Sanderson atomic electronegativity was used for each atom because it may determine, with other atomic properties, the macroscopic properties of a compound. The positive sign of G2e means that the increase in this descriptor decreases the log S. RTu+, as HATS7v, is a GETAWAY descriptor and correlates with the experimental log S values of 0.490. The GETAWAY descriptors [29,30] have been proposed as chemical structure descriptors derived from a new representation of molecular structure, the molecular influence matrix. These descriptors, as based on spatial autocorrelation, encode information on molecular space. Moreover, they are independent of molecule alignment and, to some extent, account also for information on molecular size and shape as well as for specific atomic properties.

Applicability Domain of the MLR Model
Before a QSPR model is put into use for screening compounds, its applicability domain must be defined and predictions for only those compounds that fall in this domain can be considered as reliable. The AD of the MLR model was analyzed in the Williams plot (shown in Fig.5). Clearly observation 35 of the training set with leverage higher than the warning limit of 0.36 is a structurally influential compound. Deleting observation 35 could alter slightly 2 R between the experimental logS values and the selected descriptors to 0.8866 ( 2 Q = 0.8485) and increase the standard error to 0.524, while utilization of a higer energy conformation geometry for this observation alter negatively the calculated model.

CONCLUSION
In this paper, the QSPR method was applied to the prediction of the aqueous solubility of various type of pesticides. A six-parameter linear model was developed by hybrid GA/ MLR approach with 2 R of 88.95 and s of 0.52 log unit for the training set. The selected descriptors express many factors influencing aqueous solubility, to name: molecular size and shape, specific atomic properties, both macroscopic and microscopic effects and tendency of solutes to take part in the charge transfer interactions. Several validation techniques, including leave-one-out crossvalidation and bootstrap, randomization tests, and validation through the test set, illustrated the reliability of the proposed model. All of the descriptors can be directly calculated from the molecular structure of the compound, thus the proposed model is predictive and could be used to estimate the solubility of pesticides. In this case, the applicability domain will serve as a valuable tool to filter out "dissimilar" chemical structures.