QSPR Modeling of Odor Threshold of Aliphatic Alcohols Using Extended Topochemical Atom (ETA) Indices

The present work establishes a quantitative structure-property relationship (QSPR) between topochemical features and odor threshold (OT) of aliphatic alcohols. A data set of 53 aliphatic alcohols was chosen for the analysis employing different chemometric techniques, among which, genetic function approximation with spline option (GFA-spline) showed the most acceptable results in terms of internal and external validation metric values. The extended topochemical atom (ETA) indices, developed by the present authors’ group, were considered as descriptors for model development. Additionally, selected nonETA descriptors were also tried for model development. It was observed that the models with ETA indices significantly surpass the predictive ability of the models developed using other descriptors. The final model suggests that molecular branching and electronic parameters significantly influence the odor potency of the molecules. Additionally, increased lipophilicity and reduced electronegativity increase the odorant property. The model thus developed may effectively be used for prediction of odor threshold of any untested aliphatic alcohols. (doi: 10.5562/cca2284)


INTRODUCTION
Odorant compounds constitute a significant portion of the organic chemistry. Those which are present in the environment, facilitate to identify their presence in air with their typical odor characteristic. The threshold of olfaction presents a key feature to all the odor active compounds, the value of which may differ due to variation in the protocols for measurement. Olfaction, a physico-chemical property can be defined as the least concentration of any air borne chemical that is perceived by half of the healthy tested individual. 1 However, two chemicals having same odor threshold may not produce the same level of annoyance in the surroundings, as it depends on the type of odor of those chemicals. 2 This demonstrates the presence of a complex mechanism of action of the odorant receptors (ORs). An odorous molecule present in the environment is supposed to bind to a number of ORs at a time. Thus, the ambiguous nature of odorant receptors along with various characteristics of olfactory data has enhanced the urge to gain information about threshold data for odor of various compounds which have its wide application in the field of bioscience, food chemistry and environ-mental pollution. [3][4][5] In relation to the mechanism behind odorant binding, earlier it was proposed that odorants of similar property activates common receptor subtypes. But later, it was proved to be wrong as it was seen that homologous oxygenated aliphatic molecules though having similar molecular properties do not share similar quality of odor. Thus, it was proposed that the theory behind the olfactory mechanism lies in the combinatorial effects of different types of receptors. 6 Again, it was found that majority of the mammalian olfactory receptors belong to the class A of G-protein coupled receptors (GPCRs) superfamily, although a convincing reason behind the anomalous behavior of these olfactory receptors is still unknown. 6 Thus, a great deal of attention of the present research has been oriented towards the development of models that enable the prediction of the odor threshold of compounds and aid in understanding the facts behind their binding possibilities by avoiding time consuming and costly experimental setup. In this aspect, the in-silico prediction of olfactory threshold using quantitative structure-property relationship (QSPR) 7 technique gets highlighted. It is a method by which the structural information of any chemical compound can be correlated to its respective property value. This involves the extraction of chemical information in the form of descriptors, followed by their correlation with property values of individual compounds giving a predictive mathematical equation. 8 From this computational technique, it is easier to know the structural fragments that alter the physicochemical properties of compounds which further help to design new potential molecules with low odor threshold. Moreover, these developed predictive models can also assist in the screening of potent odorant moieties from large database of compounds, which reduces the requirement of time consuming synthesis and testing analysis of a large number of odorous compounds for different purposes. The QSPR paradigm is now supported by the Registration, Evaluation, and Authorization of Chemicals (REACH) norms, 9 a legislatory initiative of the European Commission and also by the organization for economic cooperation and development (OECD). 10,11 Again, these types of in-silico predictive models are used by the Food and Drug Administration (FDA) 12 for minimizing the rate of false negatives and false positives saving incalculable costs for manufacturers. The Council for International Organizations of Medical Sciences (CIOMS) 13 also recommends the methods such as in silico mathematical models, computer simulation, and the use of in vitro biological systems before animal experiments for the advancement of biological knowledge. However, all QSPR/QSAR 14 (activity)/QSTR 15 (toxicity) models should be sufficiently validated before their application for prediction of new data.
The QSPR/QSAR approach has been widely used by various research groups for successful prediction of odor potency. Luan et al. 16 established that support vector machine (SVM) can be an effective tool in QSAR studies for developing classification based model of fragrance properties. 91 organic compounds were used for building both linear and non-linear models where the non-linear model (SVM) showed superior predictability than the linear model developed using the linear discriminant analysis. Again, QSAR approach was taken up by Du et al. 17 where the researchers utilized 64 volatile organic compounds for prediction of odor detection thresholds and nasal pungency thresholds (NPTs) for the olfaction and nasal trigeminal chemosensory systems. The best model was developed using local lazy regression method which proved to be effective even when the experimental property values are unknown.
In the present study, QSPR analysis has been carried out for establishing a relationship between the odor threshold (OT) data of 53 aliphatic alcohols and their structural attributes. Different types of descriptors were calculated for the purpose, but, a very simple class of 2D descriptors showed to be the most important one for prediction of OT values. These 2D descriptors belong to the extended topochemical atom (ETA) 18,19 indices which have been shown previously to be very much effective while prediction of other properties like solubility 20 and CMC 21 values. The calculation of ETA parameters does not involve the requirement of computationally exhaustive conformational analysis and alignment procedure. Thus less computational time is required for calculation of these variables than the complex 3D descriptors. The first generation ETA descriptors were developed based on TAU descriptors representing the valence electron mobile (VEM) environment. 22,23 The development history and formalism of the first generation ETA indices have been detailed in a book chapter by Roy and Das. 19 It provides information regarding the electronic features, size, shape, branching, and functionality of molecules. Moreover, the second generation variables can describe the electron richness, unsaturation, polar surface area and ability of hydrogen-bond formation of a given molecule. The first and second generation ETA indices are now available in PaDEL-Descriptor (version 2.11), 24 an open source software available at http://padel.nus.edu.sg/software/padeldescriptor.

The Dataset
The dataset for the present QSPR study has been collected from a report by Anker et al. 25 The negative logarithmic value of the average of observed highest and lowest odor thresholds for each compound has been considered as the response variable for the current analysis. In total, there are 53 dataset compounds comprising of different aliphatic alcohols, the odor thresholds of which are expressed in mol/L. The list of compounds has been given in Table 1.

Model Development
For generation of reliable QSPR models based on odor threshold data, firstly descriptors were calculated which have considerable contribution for modulating the values of the physicochemical property concerned. The independent variables comprised of descriptors from PaDEL-Descriptor, 24 Cerius2 26 and Dragon software 27 platform. 107 descriptors were finally considered after elimination of highly correlated variables and also those whose variance was less than 0.0001. This pool of descriptors that has been utilized for model development has been shown in Table S1 in the Supplementary Materials section. The non ETA descriptors 28 include topological, structural, physicochemical, electronic and spatial types whereas the ETA descriptors include both first and second generation variables. All the ETA parameters are further discussed in Table S2  Supplementary Materials section. Compounds were divided into two classes: one comprising 42 compounds which has been considered as training set and the other is the test set comprising 11 compounds (size ration 4:1). Division of whole dataset into training and test set plays an important feature in model development, since the quality of the QSPR model depends highly on the selection of training and test sets. 29 In the present study, k-means clustering technique, 30 available in the SPSS software, 31 was employed for the splitting of the dataset. Five clusters were generated according to the features available for the respective compounds. 42 compounds were selected from the total cluster so that the training set encompasses the entire range of chemical space of the whole dataset. Figure S1 in Supplementary Materials shows the plot of the first three principal components of the variables and depicts that each test set compound remains in close vicinity of at least one training compound. 32,33

Chemometric Tools Employed for Model Development
The training set was utilized for model development and the test set for subsequent model validation. At first the total pool of independent variables comprising both first and second generation ETA descriptors was used. Different algorithms were also utilized for model development keeping the division of the training and test set compounds unaltered. These include GFA-MLR 34 (genetic function approximation followed by multi linear regression) and G/PLS 35 (genetic/partial least squares) methodologies. Both linear and spline options were considered for each method. For the GFA-MLR models, the selection of the best model was done based on the lowest LOF 36 (lack-of-fit) score using 5000 crossovers. The G/PLS models were derived in Cerius2 software, at 1000 iterations using scaled variables The smoothness parameter value was kept at a value of 1.0. Further, the compounds of the training set were utilized for the generation of QSPR models with all other non ETA parameters, using the same algorithms for model generation. And lastly, the whole descriptor pool was employed for the third category of model development where both ETA and non-ETA variables were considered. It included a total of 107 descriptors.

Validation of the QSPR Models
All the final QSPR models, developed using three sets of descriptors (ETA, non-ETA and both ETA and non-ETA), were selected based on the significant values of different statistical parameters 37 such as determination coefficient (R 2 ), explained variance (R 2 a ) and variance ratio (F) at specified degrees of freedom (df ) describing the quality of the model (the F value should be significant at p < 0.01). The standard errors of all regression coefficients should be sufficiently low so that corresponding 't' values are significant at p < 0.01. The error involved or the accuracy involved in the model development can also be understood from the value of standard error of estimate (s) and rmse values. The definitions of different statistical metrics for equation quality are given in Supplementary Materials section.
All the developed models were subjected to extensive statistical validation involving internal, external and overall strategies and thereby complying with the proper OECD guidelines. For ensuring the internal validation of the deduced QSPR models, leave-one-out cross-validation technique has been employed, the results being presented by the cross-validated squared correlation coefficient (Q 2 or Q 2 (LOO) ). The Q 2 value takes care of the statistical significance of the model since it is calculated by using the LOO predicted values of the training set compounds that are generated during each leave-one-out (LOO) cross validated cycle. Furthermore, the 2 m r metrics 38 were calculated for describing the performed internal (LOO) validation. Equations (1) and (2) give the formulae for computing the 2 m r metrics.
in which 2 r represents the squared correlation coefficient between the observed and predicted (LOO predicted) data of compounds with intercept (i.e., for the regression line, observed = slope × predicted + intercept) , The test set compounds have been employed to check the predictive ability of the model and thus verifying its external predictive potential. The resultant R 2 pred 39 value thus determines the predictability of the developed model in determining the odor threshold values of similar type of untested compounds. Here also the 2 m r metrics for external validation have been applied. 40  , signify the overall performance of the deduced models.
Additionally, the robustness of the models has been also ascertained by the Y-randomization test 41 available in the Cerius2 software. Here, many models were developed after randomizing the values of dependent variable while keeping the descriptor matrix intact. A QSPR model is said to be robust if the value of R 2 of the non-random model is more than the square of average value of R (R r 2 ) of the randomized models. For the present study, both process as well as model randomization tests have been performed for the final model developed using only the ETA variables at 90 % and 99 % confidence levels respectively. Finally, an additional metric, the c R p 2 value has been calculated using the following formula (Equation 3), which shows the reliability of the model and the process by which it has been established. 42 According to the point 3 of OECD 10,11 principles of QSAR models development, the applicability domain of a QSPR model must also be well defined since a single in-silico predictive model cannot be universally accepted for all types of compounds. The domain of applicability is a theoretical space covering the model descriptors and response variables of the training set. In this study, the domain of applicability has been determined following the leverage approach (Williams plot). 43 The plot has leverage values (h) on the x axis with standardized residual values on the y axis. The leverage (h) of a compound in the original variable space is calculated based on the HAT matrix as H = (X T (X T X) -1 X), where H is an (n× n) matrix that orthogonally projects vectors into the space spanned by the columns of X. The leverage values of all the compounds were calculated using the Statistica software 44 which help to determine whether that compound fits in the applicability domain of the model or not. Here, the critical leverage value, h*, was calculated using the mathematical formula: h* = 3((p + 1) / n) and standardized residual limit for the boundary of the applicability domain was set at ±2.5σ.

RESULTS AND DISCUSSION
Diverse models were developed using two different chemometric tools namely, GFA and G/PLS algorithms employing three different set of descriptors (ETA, non-ETA and combination of ETA and non-ETA). In the Croat. Chem. Acta 87 (2014) 29. present study, four sets of models were developed for each descriptor set. Comparison among the models is described briefly in Table 2. Most of the models showed encouraging statistical parameters proving the reliability of the models and the process of its development. The equation (Equation 4, model 2; see below) bearing best prediction ability [with respect to both internal and external validation measures (Q 2 = 0.778, R 2 pred = 0.813)] has been selected as the best model pertaining to odor threshold values. From Table 3 it could be noted that the 2D descriptor, ETA, plays a significant role in correlating the property value (log(1/T )) with that of the structural features of each molecule.
Though good quality models were developed out of non-ETA variables, but it was noted that when ETA parameters were added to the descriptor matrix, QSPR models having improved prediction power were obtained. For example, model 5 (Table 3) was developed using GFA-linear algorithm. All the non-ETA descriptors were utilized for model development, among which the lipophilic factor Log P and 3D descriptor Jurs_WPSA_2 were selected for the generation of the model. The corresponding R 2 and Q 2 values were 0.784 and 0.736. But, when an ETA parameter,   local F η was introduced in the model (Equation 12) using the same algorithm and definite division of dataset, it was observed that the R 2 and Q 2 values were enhanced to 0.806 and 0.767 respectively. In case of models with non ETA parameters, though the R 2 values were satisfactory but the R 2 pred values were moderately low. The criterion of a good QSPR model is not only to have good R 2 and Q 2 values but the prime necessity is that the model should bear good prediction capacity. The prediction ability of the models also gets enhanced on the addition of ETA descriptors. This was observed for Equations 13 and 12 where the presence of only one ETA parameter in each model ( η and   local F η respectively) significantly increases the ability of the model in predicting the log(1/T ) value of different compounds. Thus, the use of ETA parameter in a QSPR model is noteworthy. In this context, it was noticed that a model with only ETA descriptors provided good statistical quality along with better predictability in comparison to all other QSPR models on odor threshold that were developed using non ETA and combined pool of descriptors (ETA and non-ETA).

Discussion of the best model
Among all the employed modeling techniques, the GFAspline algorithm gave the best results. These models explained nonlinearity of the developed correlation. The spline terms are denoted in the parenthesis, e.g. <f(x) -a> where 'f(x)' denotes the variable while 'a' is known as the knot of the spline representing an optimum value of the independent variable. For each case, the total spline term has been considered as zero if the summation of the knot and the value lie in the negative range. 34 Three different models were developed using the GFA-spline tool using three different descriptor matrices. Among them, model developed using only the ETA variables showed the best results towards the prediction of odor threshold values. In QSAR studies, more emphasis is now given to the predictive quality of a model. Model 2 in Table 3 shows the highest Q 2 (internal validation metric) and highest R 2 pred (external validation metric) among all the tabulated models and hence has been selected as the best model.
The standard errors of the regression coefficients are shown within parentheses. Equation 4 denotes the best QSPR model along with the results of the statistical and validation parameters. GFA was performed with 5000 iterations using 42 compounds as the training set (n training ) and validated with the 11 test set compounds (n test ). Among 100 models, the one bearing the least Friedman's LOF score (0.279) has been selected as the final model as this fitness function denotes the degree of over fitting of the model. The model could explain 79.4 % of the variance (adjusted coefficient of variation) and could predict 77.8 % of the variance (leaveone-out predicted variance). The prediction error involved in the model development has been shown in terms of the standard error of estimate (s) and the square sum of predictive residual (PRESS) measures which are lower (0.463 and 9.478 respectively) for the best model. The statistical quality of the model can be well explained by the determination coefficient (R 2 ), the value of which should be as near as possible to 1 for a good model. In the present case, the R 2 value is 0.809 which signifies that the descriptors involved in the final model could well encode the structural parameters of the compounds required to explain response variable. All the regression coefficients are significant at p < 0.01 as evidenced from the corresponding t value at df = 38.
The F value of the model is significant at p < 0.01. The values of all descriptors appearing in Equation (4) are given in Table S3  The deviations of the prediction data of test set compounds from that of the observed data has been expressed as root mean square error in prediction (rmsep ext = 0.415). The error involved in the prediction of responses of the training set compounds using cross validation technique has been also marked by the value rmsep int (0.475). 45 The rmsep int value has been calculated based on leave-one-out predictions values while the rmsep ext value has been computed from the predicted values of the test set compounds. Both the values are quite low and are close to each other. The predicted values of individual compounds of the dataset calculated by the best QSPR model (Equation 4) have been provided in Table 1. The proximity of the observed and calculated / predicted responses of the compounds of both the training and test sets have been shown in the scatter plot. (Figure 1) The absence of chance correlation between the response variable and the descriptors during model development has been analyzed using the Y-randomization test. For the best model (Equation 4), the square of average correlation coefficient of the randomized models (R r 2 ) is much less than the actual R 2 value of the nonrandom model which finally resulted in significant values for the c R p 2 parameter (model = 0.678 and process = 0.634). A value of c R p 2 more than 0.5 signifies robustness in favor of the model and also for the process involved.
Croat. Chem. Acta 87 (2014) 29. The best QSPR model (Equation 4) involving odor threshold data consists of three 2D independent variables encoding the essential physicochemical features incorporated in the structures of the compounds under concern. The first descriptor α Nv  has the highest regression coefficient among the three variables. It is a first generation ETA descriptor where α individually denotes the size of any atom. Thus, the Σα represents the molecular bulk of a molecule, whereas N V stands for the total number of non-hydrogen atoms. Moreover, the presence of positive sign in the coefficient of the descriptor clearly denotes that the log(1/T) increases with increase in the molecular size of alcohols. This has been rightly observed for the highest potent molecule, C16 (1-decanol) and also for long chain alcohols like 1-dodecanol (C18), 2-decanol (C30), 3-decanol (C49). Since, lipophilicity increases with molecular bulk, it can be said that a molecule should be more lipophilic for it to be a potent odorant. β s   , ranking second among the three descriptors in the value of regression coefficient, modulates the threshold value of odor inversely. Here, the basic parameter β includes the electronic features of molecules where βs denotes the contribution of σ electrons. The descriptor describes the contribution of electronegativity (electron richness) towards prediction of odor threshold values. Thus, lesser is the electronegativity, better is the odorant property as seen for compound 23 (3-methyl-2butanol). Again, the first generation composite ETA index, η , within the spline term denotes overall topological environment of a molecule. Although, its regression coefficient value is the least among the three descriptors, yet its presence plays a significant role for obtaining a good correlation value. From the equation, it can be inferred that a positive value of the spline term is obtained only for values of the descriptor greater that 0.744 (knot of the spline). Such a condition is essential in order to obtain molecules with significant odor potency. It simultaneously denotes molecular branching and electronic distribution features present in a compound. Taking all the descriptors together, it may be concluded that the unfavorable value of the most important variable, α Nv  is responsible for the reduced potency of derivatives like ethanol (C1), 2-propanol (C19) and 2-butanol (C20), although they possess acceptable values for other descriptors.

Domain of Applicability
It is a prime requisite of any QSPR model to determine the applicability domain since the prediction of any compound can be appropriate only if the test compound falls within the domain of applicability of the model. Figure 2 shows the Williams plot by which the applicability domain of the final model (Equation 4) with only ETA descriptors has been ascertained. From the plot, it can be marked out that training set compound, C1 (Ethanol) lies outside the domain i.e., the leverage value of the referred compound is more than that of the critical hat value (h*) which is equal to 0.286. Thus, ethanol can be considered as an influential chemical with respect to the developed QSPR model, since, avoiding which can lower the correlation value. Here, the entire test set compounds lie within the applicability domain of the model denoting reliable prediction.

COMPARISON OF THE BEST MODEL WITH PREVIOUSLY REPORTED MODELS
Junkes et al., 46 established a relationship between semiempirical topological index and odor threshold values  using the same dataset of the present study, taking 49 aliphatic compounds out of it. Their best model showed to have a R 2 value of 0.714 and Q 2 value corresponding to 0.747. Again, Anker and Jurs 25 developed a QSPR model using the same set of compounds which yielded a squared correlation coefficient value of 0.863 and four compounds were denoted as response outliers by the model. Here, we have developed a QSPR model (Equation 4) with the training set compounds which were selected based on clustering technique, using ETA indices and employing genetic function approximation approach which showed determination coefficient value of 0.809. 42 out of 53 compounds were taken for the model generation and rest were predicted using the best model (R 2 pred = 0.813). The final model showed acceptable values for the various validation parameters. The applicability domain for the model has also been reported. Since, the descriptors can easily depict the topological as well as chemical nature of the compounds at a same time; hence, it is useful to utilize this model for in-silico prediction of odor threshold. Comparison between different QSPR models on odor threshold, developed by different research group has been summarized in Table S4 of the Supplementary Materials section.

CONCLUSION
The present study successfully demonstrates the application of ETA indices to predict the odor threshold of a series of aliphatic alcohols. The model constructed using GFA (spline) technique showed acceptable internal stability along with good external prediction quality. Even the closeness between the experimental observation data and prediction values of log(1/T ) for all the compounds was reflected in the significant values of 2 m r metrics. Thus, it is well understood that the ETA parameters possess sufficient diagnostic power in defining the changes in the property values with variation in the structure of the compounds with the -OH functional group which fall within the domain of applicability of the developed model. The mechanistic interpretation of the best QSPR model (Equation 4) suggests that increased lipophilicity and reduced electronegativity potentiate odorant property. Molecular branching and electronic distribution properties of each compound may also be studied further to understand the mechanism behind binding of the odorants to receptors. Hence, ETA descriptors which are simple and easily interpretable requiring less time for calculation can be applied for developing reliable QSPR models for prediction of odor threshold.
Supplementary Materials. -Supporting informations to the paper are enclosed to the electronic version of the article. These data can be found on the website of Croatica Chemica Acta (http://public.carnet.hr/ccacaa).

Quality measures in fitting of a QSAR model
A QSAR model is needed to be checked for its quality before applying it for screening of new molecules. Several statistical parameters are available for assessing the quality of the model. Initially the acceptability of a QSAR model depends upon three statistical parameters: (i) standard error of estimate (s), (ii) squared correlation coefficient (R 2 ) and (iii) explained variance (R a 2 ) based on the MLR technique. The error in the estimation of individual activity values of the compounds under study using the MLR method can be quantified based on their residual data. The standard error of estimate (SEE or s) for the residuals is calculated by taking the root-mean square of the residuals. The standard error of the estimate is a measure of the accuracy of fitting. Lower values of SEE correspond to improved model acceptability. (S1) In Eq. S1, Y obs and Y calc are the actual and estimated scores respectively, while n is the number of scores and p is the number of descriptors. Again, variation in the data is quantified by the correlation coefficient (R), which measures how closely the observed data tracks the fitted regression line. An R 2 of 0 means that there is no relationship between activity and the parameters selected for the study, while an R 2 of 1 means a perfect correlation. R 2 is calculated as the ratio of regression variance to the original variance where the regression variance is calculated as the original variance minus the variance around the regression line.
In Eq. S2, training Y is the mean observed activity of the training set compounds. Previously, QSAR models were only based on the fitting description of the mathematical equation using the correlation coefficient. The prime drawbacks of the R 2 parameter lies in the facts that it does not provide any information on whether: (i) the independent variables are a true cause of the changes in the dependent variable, (ii) the correct regression was used, (iii) the most appropriate set of independent variables has been chosen, (iv) the model might be improved by using transformed versions of the existing set of independent variables and (v) whether any collinearity exists in the data or not. However, adjusted R 2 (R 2 a representing Eq. S3) is a modification of R 2 that adjusts for the number of explanatory terms in a model. Unlike R 2 , the R 2 a increases only if the new term improves the model more than would be expected by chance. The adjusted R 2 can be negative, and will always be less than or equal to R 2 .
In Eq. S3, n is the number of compounds and p is the number of descriptors. However, acceptable values of these statistical parameters are not always sufficient enough to judge model predictivity and alternative methods are employed to assess the predictive ability of the developed QSAR models. The addition of descriptors to the model increases the value of R 2 , but this may not indicate an improvement in model quality. So to optimally determine the predictive quality, the models are required to be further validated using various validation techniques.

Validation strategies
Both internal and external validation statistics constitute the primary methods for validation of the developed QSAR models. Both the methods have been widely used by different groups of researchers for assessing the predictive ability of the developed model. Several metrics are used to check the predictivity of the QSAR models. For the validation of QSAR models, three strategies are primarily adopted: (i) internal validation using the training set molecules, and (ii) external validation based on the test set compounds.

Internal validation (Leave-one-out cross-validation)
Internal validation deals with validation of a QSAR model based on the molecules involved in the QSAR model building process (training set data). In this technique, one compound is eliminated from the data set at random in each cycle and the model is built using the rest of the compounds. The model thus formed is used for predicting the activity of the eliminated compound. The process is repeated until all the compounds are eliminated once. On the basis of the predicting ability of the model, the predicted residual sum of squares (PRESS) (Eq. S4), the value of standard deviation of error of prediction (SDEP) (Eq. S5) and the cross-validated R 2 (Q 2 ) metrics (Eq. S6) for the model are determined. The higher is the value of Q 2 (more than 0.5) the better is the model predictivity.   Table S2. Brief description of ETA descriptors The ETA indices provide potential information about electronic features and the contribution of size, shape, branching, and functionality of a molecule. The second generation indices can have a better power to encode the structural features responsible for electron richness, unsaturation, polar surface and ability of hydrogen-bond formation of a given molecule. The variables are denoted by some basic parameters such as α which is related to the size or bulk, ε which provides information about electronegativity of atoms and β that is related to electronic contribution. The following are the ETA indices that have been utilized in the present work. Δε C A measure of contribution of electronegativity 14 Δε D A measure of contribution of hydrogen bond donor atoms 15 Δψ A A measure of hydrogen bonding propensity of the molecules 16 Δβ′ A measure of relative unsaturation content relative to molecular size 17 Σβ′ ns(δ) A measure of lone electrons entering into resonance relative to molecular size  Figure S1. Principle component analysis plot with the first three principle components generated by factor analysis