QSPR Study of the Retention / Release Property of Odorant Molecules in Water Using Statistical Methods

An integrated approach physicochemistry and structures property relationships has been carried out to study the odorant molecules retention/release phenomenon in the water. This study aimed to identify the molecular properties (molecular descriptors) that govern this phenomenon assuming that modifying the structure leads automatically to a change in the retention/release property of odorant molecules. ACD/ChemSketch, MarvinSketch, and ChemOffice programs were used to calculate several molecular descriptors of 51 odorant molecules (15 alcohols, 11 aldehydes, 9 ketones and 16 esters). A total of 37 molecules (2/3 of the data set) were placed in the training set to build the QSPR models, whereas the remaining, 14 molecules (1/3 of the data set) constitute the test set. The best descriptors were selected to establish the quantitative structure property relationship (QSPR) of the retention/release property of odorant molecules in water using multiple linear regression (MLR), multiple non-linear regression (MNLR) and an artificial neural network (ANN) methods. We propose a quantitative model according to these analyses. The models were used to predict the retention/release property of the test set compounds, and agreement between the experimental and predicted values was verified. The descriptors showed by QSPR study are used for study and designing of new compounds. The statistical results indicate that the predicted values are in good agreement with the experimental results. To validate the predictive power of the resulting models, external validation multiple correlation coefficient was calculated and has both in addition to a performant prediction power, a favorable estimation of stability.


INTRODUCTION
The concept of quality of food commonly includes four criteria: Safety, Health, Flavor and Services.Each of these key words refers to the notions of product safety, their nutritional value and health, the organoleptic criteria of taste, odor and all the services associated with the food product.It can be said that if the notions of safety and health are present in the mind of the consumer during a purchase, the organoleptic dimension of a product remains essential [1].The flavor compounds present in a product must be sensorially perceived to be released from the food phase.The release of odorant molecules from the solid or liquid food matrix and their passage through the vapor phase is therefore the first step before a possible perception due to the activation of the olfactory receptors present in the nasal cavity and to the activation of complex neurophysiological events [2].
Retention/release property of odorant molecules is a phenomenon primarily dependent on the interactions between the solute and the stationary phase of molecules, which included directional force, induction force, dispersion force and hydrogen bond [3].These forces can be related to the topological structures; therefore, it was possible to predict the solute retention from molecular descriptors.
Molecular descriptors theoretically calculated can be used to construct mathematical models, being related to molecular properties.In this insight, the Quantitative Structure Property Relationships (QSPR) [4] refers to obtain a robust and predictive mathematical model involving response variable with molecular descriptors, calculated through molecular modeling methods.
The purpose of this work is to study the retention/release property of odorant molecules in the water by varying their chemical class and molecular structure (linear, branched and/or unsaturated) using QSPR chemical modeling methods

MATERIAL AND METHODS
This QSPR study was investigated for predicting, interpreting studied property and for designing new compounds by using linear and nonlinear methods.It consists of four stages: selection of data set and generation of molecular descriptors, descriptive analysis, statistical analysis and suggestion of novel compounds.
The methodology used in this QSPR study is as follows (Fig. 1):

Experimental data set
In this study, we selected 51 odorant molecules with properties reported in the literature [5] to provide a diversified set of chemical families (alcohols, aldehydes, ketones and esters) and chemical structures (linear, unsaturated and unsaturated-branched).The fragrant molecules were selected by their structures without taking into account their organoleptic qualities.The list of molecules and the Log(1/K) values are displayed in Table 1.
The retention/release property of the selected odorant molecules was examined using pure water, this property was quantified by the vapor-liquid partition coefficient K, and more precisely by the Log(1/K) values [5].
A total of 37 molecules were placed in the training set to build the QSPR models, whereas the remaining, 14 molecules constitute the test set.The division was carried out by random selection using the SPSS 19.0 statistical package [6].

Molecular descriptor generation
A wide variety of molecular descriptors were calculated using ACD/ChemSketch, MarvinSketch and ChemOffice software [7][8][9] to predict the correlation between these descriptors and the retention/release property of the molecules studied (Table 2).The Table S1 and Table S2 show values of these descriptors for each molecule studied.

IGTC
Percent ratios of carbon

Statistical analysis
In this step, Matrix of correlation was used to determine the non-linearity of variables (descriptors) and to select the descriptors correlated with the property [10].
Consequently, Multiple Linear Regression (MLR) is used to study the relationship between a dependent variable and several independent variables; it minimizes the differences between actual values and predicted values and has been used to select the descriptors to be used as inputs in Multiple Non-Linear regression (MNLR) and Artificial Neural Network ANN (Multi-Layer Perceptron (MLP) and Radial Basis Function Networks (RBF) types).Multiple linear and nonlinear regressions were used to predict the effects on the property, the equations were justified by the correlation coefficient (r), the mean square error (MSE), the Fisher value (F) and the significance level (p) [11].MLR, MNLR, and ANN are generated using the SPSS 19.0 statistical package [2].
Cross-Validations, the most commonly used techniques for internal validation, are statistical techniques in which different proportions of chemicals are iteratively held-out from the training set used for model development (an optimal parameters K selection step) and "predicted" as new by the developed model in order to verify internal "predictivity".In this work, the Leave-One-Out is used, this procedure successively removes one molecule from the training set containing 37 molecules.A QSPR model is constructed on an "36" set of compounds and the molecule removed is predicted by the model.This procedure is repeated "37" times in order to predict the property of all molecules [12].Y-randomization, randomly scrambling the responses, is another internal validation approach that must be used in parallel with Cross-Validations, and must always be applied to test the significance of the derived QSPR model, highlighting the presence of apparent models, obtained only by chance correlation [12].We performed in this work 100-y-randomization tests for the MLR and MNLR models.In this test, random QSPR models are generated by randomly shuffling the dependent variable while keeping the independent variables as it is.The new QSPR models are expected to have significantly low r 2 and r 2 cv values for several trials, which confirm that the developed QSPR models are robust.
The permutation test proved to be a good tool for detecting the presence of the trends in residuals of multivariate regression models.The quality of the permutation test depends on the number of permutations used.A total of 500,000 permutations are enough for reproducibility of the test results [13].In this work we used the Matlab code for the permutation test algorithm presented in the literature [13].When the p-value for the test is smaller than the level of significance adopted (α = 0.05), the residuals are not random.Otherwise, there are no trends in the residuals [13].
Other useful parameters to be considered are the RMSEP (Root Mean Squared Errors of prediction) calculated on test set.The r and rcv values are good tests for evenly distributed data, but they are not always reliable for unevenly distributed data sets; instead RMSEP provide a more reliable indication of the fitness of the model, independently of the applied splitting.The randomization t-test for the comparison of the predictive accuracy (RMSEP) of methods is useful in this case.In this work we used the Matlab code for the randomization t-test algorithm presented in the literature [14].When the p-value for the test is smaller than the level of significance adopted (α = 0.005 for 199 randomization trials) [14], the difference between methods is significant.

Data set for analysis
A QSPR study was carried out for a series of 51 odorant molecules, as indicated above, to determine a quantitative relationship between the structure and the property studied.The values of the 26 descriptors are shown in Table S1 and Table S2, and the correlations between this descriptors and the Log(1/K) value are shown in Table 3 as a matrix of correlation.

Multiple Linear Regressions (MLR)
The results of the PCA analysis are used to select the input data of the MLR.So, at the beginning we have eliminated all variables (descriptors) whose correlations are small (not significant, r ≤ 0.3) with respect to the dependent variable (Log(1/K)).In order to reduce the redundancy existing in our data matrix, the highly correlated descriptors (r ≥ 0.9) and which have the low correlation coefficient value in relation to the dependent variable have been excluded (Table 3).
The VIF (Variance Inflation Factor) was defined as 1/(1-r 2 ), where r was the multiple correlation coefficient for an independent variable against all other descriptors in the model.The models with a VIF greater than 5 were unstable and were eliminated; the models with VIF values between 1 and 4 may be accepted.
At this stage VIF values greater than 5 were found, then to improve the results (Table 4), the highlycorrelated descriptors (r ≥ 0.8) and which have the low value of correlation coefficient with the dependent variable were eliminated (Table 4).The relationship obtained using this method corresponds to the linear combination of these descriptors: Heat of formation (H°), Henry's law constant (KH), Index of refraction (n) and Balaban index (J).In this equation, N is the number of compounds, r is the correlation coefficient, r 2 is the coefficient of determination, MSE is the mean squared error, F is the Fisher's criterion and P is the significance level.
It is observed that the correlation coefficient r is very high, and the mean squared error value (MSE) is low, which makes it possible to indicate that the model is more reliable.A P value much smaller than 0.05 indicates that the regression equation is statistically significant, we can conclude, with confidence, that the model provides a significant amount of information [15].
The predicted Log(1/K) values calculated from equation ( 1) are given in Table 5 in comparison to the observed values.
The correlation between the predicted and observed Log(1/K) and the residue values are shown in Fig. 2.
The residuals should not show any trend.A trend would indicate that the residuals were not independent.In the permutation test, the MLR model showed p-value more than the significance level of 0.05, with result of 0.4999.In this case, the residuals of the MLR model were random.
The descriptors proposed in equation ( 1) by MLR are therefore used as input parameters in the multiple non-linear regressions (MNLR) and the artificial neural network (ANN) [16].

Multiple Non-Linear Regression (MNLR)
We also used multiple non-linear regression model technique to quantitatively improve the structure-property relationships by accounting for several parameters.MNLR is the most commonly used tool for the study of multidimensional data.We applied it to the data matrix constituted from the descriptors

Artificial Neural Networks (ANN)
In order to increase the probability of a good characterization of the molecules studied, the Artificial Neural Networks (ANN) can generate a predictive model of the QSPR relationship between the descriptors obtained from the MLR and the observed property.
In this study, we used two types of artificial neural networks: Multi-Layer Perceptron (MLP) and Radial Basis Function Networks (RBFs).

Multi-Layer Perceptron (MLP):
The ANN model has aroused great interests as its universal function approximators are capable of mapping any linear or nonlinear functions.The multi-layer perceptron (MLP) neuronal network model is a supervised neural network based on the original simple perceptron model with back propagation for training the network.It commonly consists of an input layer of source nodes, an output layer and one or more hidden layers of computation nodes (neurons) that increasing the learning power of the MLP model.The number of hidden neurons determines the learning capacity of MLP network.It is most recommended to select the network which performs best with the least possible number of hidden neurons.
The property model computed by the MLP method was developed using the properties of several molecules studied (Fig. 4).The correlation between the predicted and observed Log(1/K) and the residue values are shown Fig. 5.In the permutation test, the MLP model showed p-value more than the significance level of 0.05, with result of 0.2856.In this case, the residuals of the MLP model were random.
The predicted Log(1/K) values calculated by MLP method are given in Table 5 to comparison to the observed values.
Radial Basis Function Networks (RBFs): RBF neural networks are neural networks based on localized basis functions and iterative function approximation.In terms of structure, a RBF is composed of three layers, namely an input layer, an output layer, and a hidden layer (see Fig. 6).These types of networks are of fixed architecture with a single hidden layer; this is while MLP may be of more than one hidden layers.Indeed, a RBF represents a special case of a MLP [18].Owing to their simple design, extremely strong tolerance to input noises, and fast yet pervasive training capabilities, these networks have attracted a large deal of attention.In RBF, there is a single input layer wherein no processing is undertaken.The hidden layer, however, contains radial basis functions, with the output layer solely containing collectors.In fact, the output layer linearly combines all outputs from neurons in the -0,5 0 0,5 Residus Observations hidden layer to generate the network output.Compared to MLP networks, this type of network requires larger number of neurons, even though they enjoy shorter designs, with the principal distinction being the application of activation functions to be used by neurons [19].The property model computed by the RBF method was developed using the properties of several molecules studied (Fig. 6).The correlation of the predicted and observed property and the residue values are illustrated in Fig. 7.In the permutation test, the RBF model showed p-value more than the significance level of 0.05, with result of 0.4512.In this case, the residuals of the RBF model were random.
The predicted Log(1/K) values calculated by RBF method are given in Table 5 in comparison to the observed values.

Cross-Validation:
The Cross-Validation statistical procedure can be used to evaluate the predictive power of QSPR models.The Leave-One-Out procedure successively removes one molecule from the training set containing n molecules.A QSPR model is constructed on an "n-1" set of compounds and the molecule removed is predicted by the model.This procedure is repeated "n" times in order to predict the property of all molecules.
The QSPR model expressed by the equations of MLR and MNLR methods is validated by its appreciable values of r 2 cv (Table 6) obtained using the Leave-One-Out (LOO) procedure.The value of r 2 cv greater than 0.5 is the basic condition for qualifying a QSPR model as valid.We use Cross-Validation as an internal test of the quality of MLR and MNLR models.The performance of models was good and was characterized by r 2 cv values; 0.846 for the MLR and 0.860 for MNLR method (Table 6).

y-Randomization test:
To ensure the developed QSPR model is robust and not derive due to chance, the y-randomization test was performed on the training set data as recommended [20].In this test, MLR and MNLR models are generated by randomly scrambling the dependent variable (property data) while keeping the independent variable (descriptors) unchanged.The resulting models are expected to have significantly low r 2 and cross validated r 2 cv values for several trials, which confirm that the developed models are robust.We performed 100-y-randomization tests and observed that for all the models, the values of r 2 and r 2 cv were <0.5 (Fig. 8).This test confirms that the developed models are robust and not derived merely due to chance.

External Validation
To estimate the predictive power of the MLR, MNLR and ANN (MLP and RBF types) models, we must use a set of compounds that have not been used in the training set to establish the QSPR model.The models established in the calculation procedure using the odorant molecules are used to predict the property of the remaining 14 molecules.The main performance parameters for the four models are shown in Table 7.
The results obtained by MLR, MNLR and ANN (MLP and RBF types) models, are very sufficient to conclude the performance of models; it's confirmed by the test done with the 14 compounds.
A comparison of the quality of MLR, MNLR and ANN (MLP and RBF types) models shows that the four approaches have better predictive capability gives better results.MLR, MNLR and ANN were able to establish a satisfactory relationship between the molecular descriptors and the retention/release property of the studied compounds, it can be also seen that MLR method yielded the smallest RMSEP but the comparison of the prediction accuracy of four methods by randomization t-test show that the difference between MLR and the other methods (MNLR and ANN (MLP and RBF types)) is only indicative (p = 0.01 for 199 randomization trials, so p˃(α=0.005)), in this case the four methods cannot account for a significant difference in prediction accuracy.

MLR Model
MNLR Model Figure 8. y-Randomization plot of MLR and RNLM model.

Domain of applicability
Evaluation of the applicability domain of the QSPR model is considered as an important step to establish that the model is reliable to make predictions within the chemical space for which it was developed [21].There are several methods for defining the applicability domain of a QSPR model, but we used the most commonly used leverage approach in this study [22].Leverage of a given chemical compound hi is defined as:  From the Williams plot (Fig. 9), it is obvious that all compounds in the test set fall inside the domain of the MLR model (the warning leverage limit is 0.405).For all the compounds in the training and test sets, their standardized residuals are smaller than three standard deviation units (3±δ).Therefore, the predicted retention/release property by the developed MLR model is reliable.

Proposed novel compounds
QSPR correlates property data with the physicochemical properties of a group of compounds.It has been frequently used to predict proprieties of new compounds and to design compounds with desired properties.
The developed equation (1) can be used for the designing of new odorant molecules derivatives with improved retention/release property (Log (1/k)).
Comparing t-test and standardized coefficient values of descriptors (Table 8) indicates that the influences of the Henry's law constant KH on Log (1/k) are stronger than those of the others.The obtained results show that, to increase retention property of odorant molecules, we will increase Henry's law constant KH.Moreover, to increase release property, we will decrease Henry's law constant KH of this molecule, by adding suitable substituents and calculated their property using the equation (1).
The structures of the designed compounds and their parameter values calculated by the same methods as well as Log(1/k) values theoretically predicted by the MLR model (Equation ( 1)) are listed in Table 9.
From the predicted properties, it has been observed that the designed compounds X 1 , X 2 , X 3 , and X 5 have higher Log (1/k) values than the existing compounds.Also, the designed compounds Y 1 , Y 2 , Y 3 , Y 4 , Y 5 , Y 6 , Y 7 , Y 8 and Y 9 have lower Log (1/k) values than the existing compounds in the case of the 51 studied compounds (Table 1).
The leverage values (h) calculated by equation (1) of the MLR for the new designed compounds are displayed in Table 9, only three compounds X 2 , X 3 and Y 6 are defined as outliers and consequently they are not being considered, because they have higher leverage which is greater than h* (h*=0.405)[23].

CONCLUSION
Multiple linear and non-linear Regression and artificial neural networks (MLP and RBF types) were used to construct quantitative structure property relation models of odorant molecules for their retention/release property.The results show that the models proposed in this paper can predict retention/release property accurately and that the selected parameters are pertinent.The accuracy and predictability of the proposed models were illustrated by comparison of the key statistical terms r or r 2 and the predictive powers of the equations were validated by an internal test (Cross validation and 100-yrandomization) and external test set.
All used models results have substantially good predictive capability, but MLR gives the most important interpretable results.The applicability domain of the MLR model was defined.
We conclude that the most important finding about this research is that we have been able to design and proposed some new compounds with high or lower values property than the existing ones by adding suitable substituents and calculated their property using regression equation.Consequently, the proposed models will reduce the time and cost of synthesis and determination of the retention/release property for the odorant molecules.

ACKNOWLEDMENTS
We are grateful to the "Association Marocaine des Chimistes Théoriciens" (AMCT) for its pertinent help concerning the programs.

Figure 1 .
Figure 1.Flow chart of the methodology used in this work.

Figure 2 .
Figure 2. Graphical representation of calculated and observed property and the residues values calculated by MLR (training set in blue; test set in red).

Figure 3 .
Figure 3. Graphical representation of calculated and observed property and the residues values calculated by MNLR (training set in blue; test set in red).

Figure 4 .
Figure 4.The architecture of the MLP method used (four input variables, one neuron in the hidden layer and one neuron to the output layer).

Figure 5 .
Figure 5. Graphical representation of calculated and observed property and the residues values calculated by MLP method (training set in blue; test set in red).

Figure 6 .
Figure 6.The architecture of the RBF method used (four input variables, nine neurons in the hidden layer and one neuron to the output layer).

Figure 7 .
Figure 7. Graphical representation of calculated and observed property and the residues values calculated by RBF method (training set in blue; test set in red).
hi = xi T (X T X) -1 xi (i = 1. . ..n)where xi is the descriptor row of the query compound and X is the descriptor matrix of the training set compounds used to develop the model.As a prediction tool, the warning leverage h* is defined as:h*= 3(p + 1)/nwhere n is the number of training compounds, and p is the number of descriptors in the model.

2 Figure 9 .
Figure 9. Williams plot to evaluate the applicability domain of MLR model.

Table 1 .
List of aroma compounds.
a Test Set.

Table 3 .
Matrix of correlation.

Table 5 .
Comparison of the observed values with those calculated by MLR, MNLR and ANN (MLP and RBF types) methods.
a Test Set.

Table 6 .
r 2 cv values obtained by the leave-one-out (LOO) method.

Table 7 .
Comparison of MLR, MNLR and ANN (MLP and RBF types) models.
The equation (1) of the MLR method indicated the positive correlation of the Henry's law constant KH.

Table 9 .
Values of descriptors, retention/ release property (Log(1/K)), and leverages (h) for the new designed compounds.