Prediction of Standard Enthalpy of Formation by a QSPR Model

The standard enthalpy of formation of 1115 compounds from all chemical groups, were predicted using genetic algorithm-based multivariate linear regression (GA-MLR). The obtained multivariate linear five descriptors model by GA-MLR has correlation coefficient (R2 = 0.9830). All molecular descriptors which have entered in this model are calculated from chemical structure of any molecule. As a result, application of this model for any compound is easy and accurate.


Introduction
Physical and thermodynamic properties data of compounds are needed in the design and operation of industrial chemical processes. Of them, standard enthalpy of formation or standard heat of formation, is an important fundamental physical property of compounds which is defined as change of enthalpy that accompanies the formation of 1 mole of compound in its standard state from its constituent elements in their standard states (the most stable form of the element at 1 atm of pressure and the specified temperature usually 298 K or 25 degrees Celsius). All elements in their standard states (such as hydrogen gas, solid carbon in the form of graphite, etc.) have standard enthalpy of formation of zero, as there is no change involved in their formation. The standard enthalpy change of formation is used in thermo-chemistry to find the standard enthalpy change of reaction. This is done by subtracting the summation of the standard enthalpies of formation of the reactants from the summation of the standard enthalpies of formation of the products, as shown in the equation below. are widely used. These three methods are the Benson method [1], Jobak and Reid method [2], and Constantinou and Gani method [3]. All of these methods are classified in the field of group contribution methods which in these methods, the property of a compound is estimated as a summation of the contributions of simple chemical groups which can occur in the molecular structure. They provide the important advantage of rapid estimates without requiring substantial computational resources.
Application of quantitative structure-property relationship (QSPR) models in prediction and estimation of physical properties of materials is widely developing [4][5]. In QSPR, advanced mathematical methods (Genetic algorithm, neural networks, and etc.) are used to find a relation between property of interest and the basic molecular properties which are obtained solely from the chemical structure of compounds and called "molecular descriptors".
In this study, a new QSPR model for prediction of

Data set
Many compilations for have been published in the literature, but of them, we selected the DIPPR 801 [6] compilation for our problem. This compilation has been recommended by AIChE (American Institute of Chemical Engineers). From this compilation, 1115 compounds were selected and o f H ∆ of them were extracted from this database.

Calculation of Molecular Descriptors
In the calculation of molecular descriptors, the optimized chemical structures of compounds are needed. The chemical structures of all 1115 compounds in our data set, were drawn in Hyperchem software [7], and pre-optimized using MM+ mechanical fore field. A more precise optimization was done with PM3 semi empirical method in Hyperchem.
In the next step for all 1115 compounds, molecular descriptors were calculated by Dragon software [8]. Dragon can calculate 1664 molecular descriptors for any chemical structure. After calculating molecular descriptors for all 1115 chemical structures, we must reject non informative descriptors from output of Dragon. First the descriptors with standard deviation lower than 0.0001, have been rejected because these descriptors were near constant. In second step, the descriptors with only one value different from the remaining ones are rejected. In the third step, the pair correlation of each two descriptors was checked and one of two descriptors with a correlation coefficient equal one (as a threshold value) was excluded. For each pair of correlated descriptors, the one showing the highest pair correlation with the other descriptors rejected from the pool of descriptors.
Finally, the pool of molecular descriptors was reduced by deleting descriptors which could not be calculated for every structure in our data set.
As a result, from the calculated 1664 molecular descriptors, in the first step, only 1477 molecular descriptors remained in the pool of molecular descriptors.

Methods of calculation and results
In this step, 20% of our database (223 compounds) is randomly removed and entered to test set as an excluded data set. This test set was used in next steps, only for testing the prediction power of obtained model and are not used for developing model. The remaining 80% (892 compounds) of our data set was used for training set.
In this step our problem is to find the best multivariate linear model which has the most accuracy as well as the minimum number of possible molecular descriptors. One of the best algorithms for these types of problems has been proposed by Leardi et al. [9]. In order to perform this algorithm, a program was written based on MATLAB (Mathworks Inc. software). This program finds the best multivariate linear model by genetic algorithm based multivariate linear regression (GA-MLR) which has proposed by Leardi et al. [9] and we have used it to our previous works, successfully [10][11][12]. The input of this program is the molecular descriptors which have been obtained in previous section and the desired number of parameter of multivariate linear model. The fitness function of our program was the cross validated coefficient. For obtaining the best model, we must consider the effect of increase in the number of molecular descriptors on the increase in the value of the cross validated coefficient. When the cross validated coefficient was quite constant with increasing the number of molecular descriptors, we must stop our search, and the best result has been obtained.
For obtaining the best multivariate linear model, first, we started with one molecular descriptor model and found the best multivariate linear model, then the two molecular descriptors model were tested, and the best multivariate linear two descriptors model was found. This work was repeated and the number of descriptors was increased, till, we found that increase in the number of molecular descriptors does not affect the accuracy of the best model. The best obtained model has six parameters and is presented below:

Validation of Model
There are many validation techniques for checking the validation of the obtained model [13]. Todeschini et al. [13] presented a quick rule for checking the validity of obtained model. This rule compares the multivariate correlation index X K of X-block of the predictor variables with the multivariate correlation index XY K obtained by the augmented X-block matrix by adding the column of the response variable. This rule says that if XY K is greater than X K , the model is predictive [13]. Obtained values of these two indexes in our problem are 62 , as a result, with respect to this quick rule, obtained model is predictive ( Cross-validation technique is the most common validation technique [13]. In this technique each member of our data set is deleted, then, with the other members a model is produced, and the value of the deleted object is predicted. This technique is performed for all members of the data set and finally, a squared cross validated correlation is obtained. In our problem this work was done and the values of squared cross validated correlation ( 2 Q ) was 0.9826. The difference between 2 R and 2 Q is promising and thus validity of this model is confirmed by this technique. Another validation technique is bootstrap technique [13]. By this technique, validation is performed by randomly generating training sets with sample repetitions and then evaluating the predicted responses of the samples not included in the training set. This work usually repeated thousands of times. After 5000 times repetition of this technique, the parameter

Discussion
In the formation of a molecule from its constituent elements,

Conclusions
In this present study, a simple five descriptors linear model was presented. This model was the result of a QSPR study on the standard enthalpy of formation of 1115 compounds. These compounds have been selected from all families of compounds as a result there are no specific limit in application of this model. Also the simplicity of the use of it is one of the advantages of this model.
All molecular descriptors of this model can be easily calculated from the chemical structure of a molecule.