Sweetness power QSARs by PRECLAV software

This paper presents some QSAR ( Quantitative Structure Activity Relationship ) studies with a testing set, realized by the PRECLAV ( Property Evaluation by Class Variables ) computer program. The database we used contains sweeteners with very diverse structures – sugars, halosugars, guanidine derivatives and 3-aminosuccinamic acid derivatives. According to their estimated values of Log(RS), the testing set molecules are classified as “recommended”, “uncertain”, or “un-recommended” for synthesis. Comparing the estimated Log(RS) values with the observed values we have found that the aforementioned classification is sufficiently correct to have actual practical value, even if the training/testing set contains sweeteners of several different classes. The N-phenyl-guanidine-acetic acid derivatives, with a polycyclic system bonded with the nitrogen atom, represent a distinct subclass of guanidinic sweeteners.


Introduction
The PRECLAV (Property Evaluation by Class Variables) computer program 32 has been used for several years in doing QSAR (Quantitative Structure Activity Relationship) studies for "academic" purposes (to test the quality of certain algorithms and/or the predicting ability of certain descriptors) as well as to solve "practical" problems that have been proposed by various research groups in the drug design area (identifying the predictors having the highest influence on the values of the dependent property, and estimating the value of the desired property for molecules not yet synthesized) [1][2][3][4][5][6][7][8][9][10][11] .
We have recently thoroughly described the program's latest version algorithm. 12he present paper presents the results of some QSAR studies in which we have used databases containing sweeteners with a very diverse structure -sugars, halosugars, guanidine derivatives and dipeptides.

Methods and formulae
The molecules have been constructed virtually using the molecular mechanics program, PCMODEL 13 .
The geometry of the minimum energy conformer was obtained by using the MMX force field and GMMX algorithm 14 .Further, the geometry was more rigorously optimized with the quantum mechanics program MOPAC 15 , using the keyword string: "am1 pulay gnorm=0.01shift=50 geook mmok camp-king bonds vectors".
The output files created by MOPAC for each analyzed molecule are input files for PRECLAV and they contain the values of some descriptors.Using the data from the files generated by MOPAC, PRECLAV has computed most of the descriptors and has performed the statistical analysis.A detailed list of descriptors is available as supplementary material.
The analyzed dependent property was Log(RS), where RS (relative sweetness) is the sweetness power relative to sucrose.When the analyzed molecules had a common skeleton we used "whole molecule" and "grid" descriptors.Otherwise we used only "whole molecule" descriptors.
The QSAR studies can be made with or without a testing set.In the case of QSAR studies with a testing set, PRECLAV uses the Class function for identifying the significant descriptors.The QSAR equation that PRECLAV uses for prediction purposes in such situations is not the same as the equation one obtains when the program works without a testing set.
The "significant" descriptors satisfy conditions (1) and ( 2): where C v is the coefficient of variation for descriptor values, defined as usual by where σ is the standard deviation around the average value, V m is the average absolute value, and Q is the quality function for the analysed descriptor where r 2 is the square of the Pearson linear correlation between the descriptor values and the dependent property values, r 2 min is the minimum value imposed for r 2 ; the default value for r 2 min , empirically established, is 4 / N (where N is the number of molecules from the training set); the user may modify this value, and C is Class function where σ N is σ from formula (3) computed for N molecules from the training set, σ N+K is σ from formula (3) computed for the entire database (N molecules from the training set + K molecules from the testing set), a is a real number, whose value is established empirically (a = 10) by analysing a large number of databases (training set + testing set), and b =1 for the "whole molecule" descriptors and b = 2 for the "grid" descriptors (this way the "grid" descriptors selection is more drastic) It is considered that the Class function measures how representative a sample -from the statistical point of view -is the training set in the joint set of the testing and training sets from the analyzed descriptor's point of view.If the testing set is missing then C = 1 for all descriptors and the condition (4) becomes r 2 > b x r 2 min .Usually, according to the selection criteria (1) and ( 2), only 5-25% of the computed descriptors are "significant".
The results of some QSAR studies performed without a testing set, using the same databases we have used here, will be presented in a future paper.Here we present only the results of several QSAR studied performed with a testing set.The training and testing sets have been defined by a standard procedure.This procedure involves the ordering of the molecules in the database according to the value of the dependent property, starting with the smallest value.The molecules with rank 3, 8, 13, 18, 23 … in the string will form the actual testing set.
The analysis of the training set molecules has produced tens of thousands of multi-linear QSAR equations of the following form: 0 ( ) The "best" QSAR equation was selected according to the value of a cross-validation quality function, specific to PRECLAV 12 .This equation was then utilized for predicting the values of Log(RS) for the molecules in the testing set.Once the computations were over, the testing set molecules were classified in three categories: "recommended for synthesis", "uncertain", and "un-recommended for synthesis".The classification was based on Log(RS)'s estimated value, relative to the other estimated values for the rest of the molecules in the testing set.After computing the values of the dependent property for the molecules in the testing set, PRECLAV sorts these molecules according to the estimated values.An average value P calc m is computed for the estimated values and also a standard deviation σ of the estimated values around the average.
The program considers "high" the value fulfilling the criterion (7) and "low" the value fulfilling the criterion (8): If the user wishes to synthesize molecules with a pronounced biochemical activity, the molecules fulfilling criterion (7) are "recommended for synthesis", while the ones fulfilling criterion (8) are "un-recommended for synthesis".
In the "practical" QSAR studies, the testing set contains new molecules, not yet synthesized, with a structure imagined by the program user.In this case the observed values of Log(RS) for the testing set are not known because the molecules have not yet been analyzed by physical / chemical methods.It is very important that the program properly sorts the testing set molecules by the estimated values of Log(RS), even if the values themselves do not correspond too well with the real values -the most important thing is that the program arranges the molecules in the correct order.This way the molecules "recommended for synthesis" can be correctly identified.Thus, in the "academic" QSAR studies we present here, we have considered that an adequate measure for the quality of the prediction is the value of the Kendall rank correlation between the computed and the observed values of Log(RS).
The SMILES notation of analysed molecules is available as supplementary material.

Results and Discussion
QSAR study #1 Database: sugars and halosugars, 41 molecules (Fig. 1, Table 1) Dependent property: Log(RS), the values are taken from literature 16,17 Training set: 33 molecules (Table 1, normal font) Testing set: 8 molecules (Table 1, bold font) Descriptors: "whole molecule" + "grid" Number of significant descriptors: 99 The type ( 6 The three molecules in the testing set having the smallest observed values of Log(RS) have been labeled "un-recommended for synthesis".Two molecules having the highest observed values of Log(RS) have been labeled "recommended for synthesis" and another one has been labeled "uncertain".In case of molecule 23 the value of Log(RS) is over-estimated, while for molecule 33 the value of Log(RS) is under-estimated.In QSAR study # 1 the descriptor having the highest influence on the Log(RS) value is the "QSAR of molecular orbital energies".This descriptors gives Log(RS) as a linear function of the inverse of the energy differences between the HOMO-1, HOMO, LUMO and LUMO+1 molecular orbitals.When all the molecules from Table 1 had been included in the training set, the same descriptor proved to have the highest influence on the Log(RS) value.This suggests that the Log(RS) value for (halo)sugars correlates with the absorbed radiation wavelengths in the UV-VIS domain.

QSAR study #2
Database: guanidine derivatives, 41 molecules (Fig. 2, Table 2) Dependent property: Log(RS), the values are taken from literature 16 Training set: 33 molecules ( The three testing set molecules having the highest values of Log(RS) have been labeled "recommended for synthesis".The molecule having the smallest Log(RS) value has been labeled "un-recommended for synthesis".
It is remarkable how few significant descriptors there are.Due to how PRECLAV selects the significant descriptors (from a group of almost 1000 computed), a small number of significant descriptors suggest that the training set is not a representative sample for the molecules in Table 2. From the group of 21 retained significant descriptors only 5 are "grid" descriptors.Nevertheless, the equation utilized for prediction contains only "grid" predictors.In QSAR study # 2 the descriptor having the highest influence on the Log(RS) value is the "grid" descriptor A33.When all the molecules from Table 2 had been included in the training set, the "bond orders sum" descriptor proved to have the highest influence on Log(RS).This suggests that in case of Fig. 2 guanidine derivatives the values of Log(RS) depend on the molecular size and on the un-saturation degree of the chemical bonds.The importance of the size of the molecule is stressed too -using the "moment of inertia C" descriptor -by the QSAR study on guanidines, performed with a very different training set by Katrizky et al. 33 There have been synthesized some guanidines where the R 4 chemical group (see Figure 2) contains a polycyclic system (naphthyl, indanyl, adamantyl, 1,3-benzodioxolil etc.) 16 .We have performed numerous other QSAR studies using PRECLAV (that are not included here) with a database including both the molecules from Table 2 and several guanidines with a polycyclic system.No matter how we grouped the molecules in the training and testing sets, the prediction power of the resulting equations was much weaker -for both the training set and the testing set molecules.Therefore, we are drawing the conclusion that the guanidines with a R 4 containing a polycyclic system and the guanidines from Table 2 belong to two different subclasses of guanidinic sweeteners.
The prediction for the testing set molecules is poorer (K = 0.5714).Nevertheless, molecule 85, having the lowest Log(RS) value, is correctly labeled "un-recommended for synthesis", and molecule 120, having the highest Log(RS) value, is correctly labeled as "recommended for synthesis".
In QSAR study # 3 the descriptor having the highest influence on the value of Log(RS) is the "E(lumo+1) -E(homo-1) gap" descriptor.When all the molecules from Table 3 were included in the training set, the "Platt topologic index / Heavy atoms number ratio" descriptor proved to have the highest influence on Log(RS).This suggests that in the case of dipeptides, the value of Log(RS) depends on the molecular size and on the ramification degree of catena.

QSAR study # 4
Database: 123 molecule (    The "recommended for synthesis" group in testing set includes 8 guanidines and 2 dipeptides.The "un-recommended for synthesis" group in testing set includes 6 (halo)sugars and 1 dipeptide.In QSAR study # 4 the descriptor having the highest influence on the Log(RS) value is the "Molecular weight" descriptor.When all the molecules from Table 1, Table 2, and Table 3 were included in the training set, the "Percent of oxygen * Maximum charge of oxygen atoms product" descriptor proved to have the highest influence on Log(RS).This suggests that the value of Log(RS) depends on the size of molecules and on the electrostatic interactions involving oxygen atoms.

Conclusions
PRECLAV software classifies the potential sweeteners from the testing set according to the values of Log(RS), in "recommended" or "un-recommended" for synthesis.By comparing the estimated values with the observed Log(RS) values we have found that the classification is mostly correct and thus it has practical value.This is the case even if the training/testing set contains sweeteners from several different classes.
The descriptors having the highest influence on Log(RS) are specific to each class of sweeteners.
The N-phenyl-guanidine-acetic acid derivatives, with a polycyclic system bonded to the nitrogen atom, represent a distinct subclass of N-phenyl-guanidine-acetic acid derivative sweeteners.

Supplementary Material is Available
Global descriptors Grid descriptors

Table 2 .
Log(RS) values of guanidine derivatives

Table 4 .
Log(RS) values of testing set molecules (entire database)