How Good Can the Characteristic Polynomial Be for Correlations?

The aim of this study was to investigate the characteristic polynomials resulting from the molecular graphs used as molecular descriptors in the characterization of the properties of chemical compounds. A formal calculus method is proposed in order to identify the value of the characteristic polynomial parameters for which the extremum values of the squared correlation coefficient are obtained in univariate regression models. The developed calculation algorithm was applied to a sample of nonane isomers. The obtained results revealed that the proposed method produced an accurate and unique solution for the best relationship between the characteristic polynomial as molecular descriptor and the property of interest.

Hosoya first reported the use of the absolute values of the coefficients of the characteristic polynomial of a non-ciclic chemical compound in 1971 [9], known today as the Hosoya index Z. Since then, the analysis of the correlation between Z and many thermodynamic properties has been thoroughly studied [26][27][28][29][30][31]. However, a polynomial is a more general treatment than an index. The characteristic polynomial is just one polynomial calculated on a molecular structure [5]. The advantage of polynomials is the reduction of degeneration. Our goal was to create a procedure for creating and using a polynomial formula to correlate the structure with a given property through the value of polynomials at a point. This concept generalizes somehow the use of polynomials in regression analysis. Moreover, the desired functionality of our application is to find all singularities of polynomials derivatives, in order to answer our proposed question: How Good Can the Characteristic Polynomial Be for Correlations?
Starting with the characteristic polynomials as molecular descriptors in characterization of structure-property relationships, the aim of the research was to develop a formal calculation algorithm able to identify the value of the characteristic polynomial parameter for which the extremum values of squared correlation coefficients are obtained in univariate regression models.

Statement of the Problem and Mathematical Solution
Let's consider a sample of n compounds. The molecule will be abbreviated as c i where i is an integer and takes values from 1 to n.
The characteristic polynomial can be built and calculated based on the compound's structure by using the following generic functions (where for the simplification all polynomials are of the same degree k): mol i ChP i = a 0i X 0 + a 1i X 1 + a 2i X 2 + …+ a ki X k (2) where a ki are coefficients of the characteristic polynomial (a 0i = the constant coefficient and a ki = the leading coefficient, k = the degree of polynomial), ChP i are the characteristic polynomial functions, and X is a generic variable. Each chemical compound from the sample (c i ) has a molecular structure (S i ) and an associated property of interest (Y i ). These can be written as: We have compounds (c i ) with associated property of interest (Y i ), and starting from their structure associated characteristic polynomials (ChP i ): For characterization of the compounds' property, the abstract function of the characteristic polynomial is not useful; the value associated with the characteristic polynomial function is necessary: where ChP i (x) is the value of the characteristic polynomial function associated to i molecule.
A problem arises at this point: what is the value of X (X = x) for which the correlation between the property of interest and the characteristic polynomial function attain the maximum value?
It is well known that the Pearson product-moment correlation coefficient is the most used correlation coefficient for quantitative variables. In our example this coefficient indicates the strength and direction of the linear relationship between property of interest and characteristic polynomial. Transforming the problem into a formula, the problem becomes: where cov is the covariance; σ y , σ ChP(X) are the standard deviation of the property activity (Y) and characteristic polynomial (ChP(X)); M is the expected value of the variables Y and ChP(X); and µ Y , µ ChP(X) are the variables averages. The above parameters could be written as: ). In these conditions, the formula of the correlation coefficient is: To solve the problem it is necessary to find equations of unknown grade in X with real solutions. The formula: where ∂r/∂X = derivative of r(Y, ChP(X)), and j is an integer, gives the solutions for x 1 , …, x j . Note that it is difficult to work with r from Eq. (7); it is much easier to work with its squared value (r 2 ). Using squared correlation coefficient (r 2 ) instead of correlation coefficient (r), Eq. (8) becomes: ∂r 2 /∂X = 2r∂r/∂X = 0 (9) So, the roots x 1 , …, x j of ∂r(Y, ChP(X))/∂X = 0 will be between the roots of ∂r2(Y, ChP(X))/∂X = 0. In any case, not all roots of r 2 = 0 (or r = 0) are of interest. Eq. (10) will provide all extremum points (Eq. (11)): ∂(·)/∂X = 0 (10) ∂(·)(Y,ChP(X))/∂X| X=x = 0 ⇔ x is a extremum point of (·) (11) where dot (·) designs any function (such as r, r 2 in our case) In order to find which among the solutions ({x 1 , …, x k }) of Eq. (11) are global maxima, the values of all r(Y,ChP(x k )) must be computed and from the obtained values the greatest ones must be selected: x j is a maximum (positive or negative) ⇔ r(Y,ChP(x j )) = max{r(Y,ChP(x k ))} (12) Assuming that there is a string of polynomials (as in Eq. (2)) with equal degree k: P j = a 0j X 0 + a 1j X 1 + a 2j X 2 + … + a kj X k (13) the proposed implementation of the model uses the following elementary mathematical operations: ÷ Multiplication: R = αP j = αa 0j X 0 + αa 1j X 1 + αa 2j X 2 + …+ αa kj X k (14) ÷ Addition: R = P i +P j = (a 0i +a 0j )X 0 + (a 1i +a 1j )X 1 + (a 2i +a 2j )X 2 + … + (a ki +a kj )X k (15) ÷ Average: R = M(P i ) = M(a 0j )X 0 + M(a 1j )X 1 + M(a 2j )X 2 + … + M(a kj )X k (16) ÷ Product: R = P i P j = (a 0i a 0j )X 0 + (a 0i a 1j +a 1i a 0j )X 1 + … + (a ki a kj )X 2k (17) ÷ Derivative: In order to solve Eq (9), a derivative of a fraction is also necessary: if R = (P i /P j )' = 0 then P i 'P j -P i Pj' = 0 The proposed calculus could be done with pen and paper, but is time consuming, especially when there are many compounds of interest. Thus, a formal computation method could help to find the exact and unique solution of the best relationship between characteristic polynomial and property of interest.

Calculation Algorithm
1. Parse polynomials formulas for all given molecules (ChP j , 1 ≤ j ≤ n); parse measured data values for given molecules (Y j , 1 ≤ j ≤ n). Comments: a. The polynomials are stored as sums of monomials; b. Every monomial is in fact a pair of two values: the power of variable (X) and the coefficient; c. A measured data value is assigned with a polynomial through j value (where j is an integer and takes value from 1 to n). 2. Search in the polynomial formulas and remove the identical monomials (as in Table 1). Comments: a. It is safe to remove the repeated monomials (such for example the X 9 or -8·X 7 , see Table 1). The calculations made by using Eq. (7) revealed that the values of correlation coefficients are not affected; b. It is better to remove the identical monomials in order to reduce the calculation complexity, magnitude of numbers, and errors propagation. ∂denominator(X)/∂X; 6. Calculate the product between numerator1(X) and denominator(X) (as polynomial): product1(X) = numerator1(X)·denominator(X); 7. Calculate the product between numerator(X) and denominator1(X) (as polynomial): product2(X) = numerator(X)·denominator1(X); 8. Change the sign of the product2(X): product2(X) = (-1)·product2(X); 9. Add the product2(X) to the product1(X) and store the result in the r2_1_numerator: r2_1_numerator(X) = product1(X) + product2(X); 10. Factorize r2_1_numerator(X) if it is possible (usually is easy to factorize with X if this factor is contain in it, so will factorize on X); let X p be the factor; delete the factor; thus the r2_1_numerator became: r2_1_numerator(X) = r2_1_numerator(X)/X p ; 11. Find roots of equation r2_1_numerator(X) = 0 and return them as pairs (x i ,ε i ) 1 ≤ i ≤ m where in fact r2_1_numerator(x i ) = ε i . Comments: a. The procedure of finding roots is an approximate one for at least two reasons. First, the M(·) operator is used, so the coefficients cannot be integers. Second, even if the S(·) operator (sum operator) is used instead of the M(·) operator in order to obtain integer coefficients, the degree of the obtained polynomial is too great to apply some nonnumeric methods here (for our example the degree of the obtained polynomial equation was 12); b. The returning of the ε i is used in order to know how close the exact solution is to the result; c. The procedure of finding roots is a recursive one and it also calculates and uses all superior derivatives of the polynomial in order to find all real roots of the equation. 12. Use the set of roots {x i }1 ≤ i ≤ m and pairs of polynomials (numerator(X),denominator(X)) to calculate the value of r 2 in the following points: The above-presented algorithm has been implemented using PHP language (Hypertext Preprocessor). In order to illustrate its effectiveness, the program was run for a sample of nonane isomers, the Henry's law constant (solubility) being the property of interest.
The solutions of roots for the squared correlation coefficient obtained by the proposed algorithm for the sample of nonane isomers are presented in Table 2, where the ε i parameter shows how closely the obtained value is to the exact solution r 2 (x i ) ' X = 0 . Indeed, the r2_1_numerator(x i ) = ε i was true, where the r2_1_numerator was from the eleventh step of the proposed algorithm and represented a part of parameter depicted above.  As it can be observed from Table 2, the proposed algorithm obtained pairs of roots (as negative and positive values: 1.1 -1.2, 2.1 -2.2, and 3.1 -3.2, see the values from the x i column). The values of squared correlation coefficients are local extremum values (maximum and/or minimum values): one negative (for pair of roots of ± 0.856…) and two positive (one minimum for the ± 0.481… pair of roots and one maximum for the ± 1.656… pair of roots). These are the expected results taking into account that the r2_1_numerator(X) is a polynomial pair of X.
Analyzing the results presented in Table 2 it can be observed that, for the identified roots, the numerical errors of the models were in all cases less than 0.0001. These results sustain the power of the model to identify the imposed solutions. Looking at the values of the obtained squared correlation coefficients it can be observed that the proposed method identified one maximum value (for roots ± 1.656…) and two minimum values (± 0.856, and ± 0.481) (note that these are local extremum values). Regarding the maximum value of the squared correlation coefficient, it can be observed that is 0.296 and, from the statistical point of view, revealed a week linear relationship between the characteristic polynomial and Henry's law constant for the studied alkanes. It must be noted that the aim of the paper was not to obtained a significant correlation coefficient; it was to develop and implement a formal algorithm able to identify the characteristic polynomial parameter for which the extremum values (as maximum and minimum values) for the correlation coefficient are obtained in univariate regression models, this aim being accomplished.
Regarding the proposed method one question can arise: why use the proposed method when the Hosoya Z index [9] can be used in QSPR without using a computer? First, the use of characteristic polynomials instead of the Z index reduces the degeneration. Second, the proposed model is able to find all singularities of polynomial derivatives. Last in sequence but not least in importance, the proposed computer based method is able to work with small as well as with large sample sizes without any involvement of human time or abilities, eliminating any human errors.
It is well known that the squared correlation coefficients increase with the number of variables used by a linear regression model [33]. Starting from this hypothesis it will be interesting to analyze the applicability of the proposed model to multivariate regression models. The next plan of our research refers the implementation of a similar computational algorithm for multivariate models when characteristic polynomials are use as molecular descriptors. Another question that needs to be answered refers the usefulness of the method for characterization of relationships between compound's activity and structure, an approach that will be investigated in future research.

Concluding Remarks
The proposed calculation algorithm is able to obtain unique and reproducible solutions. The solutions are unique, meaning that for a sample of compounds with a property of interest the maximum value of the squared correlation coefficient between property and characteristic polynomials is always given by a single pair of roots. The computation algorithm can be applicable on any class of compounds when the characteristic polynomials are used as descriptors in analysis of the relationship between compounds' structure and their properties.