Bond, Bond-type, and Total Linear Indices of the Non-stochastic and Stochastic Edge Adjacency Matrix. 1. Theory and Qspr Studies

Novel bond-level molecular descriptors based on linear maps similar to those defined in algebra theory are proposed. The k th edge-adjacency matrix (E k) denotes the matrix of bond linear indices (non-stochastic) with respect to the canonical basis set. The k th stochastic edge-adjacency matrix, ES k , is here proposed as a new molecular representation easily calculated from E k. Then, the k th stochastic bond linear indices are calculated using ES k as operators of linear transformations. In both cases, the bond-type formalism was developed. The k th non-stochastic and stochastic bond-type linear indices values are the sum of the k th non-stochastic and stochastic bond linear indices values for bonds of the same bond type, respectively. In the same way, the k th non-stochastic and stochastic total (whole-molecule) linear indices are calculated by summing up the k th non-stochastic and stochastic bond linear indices, correspondingly, of all bonds in the molecule. The new bond-based molecular descriptors were tested for suitability for the quantitative structure-property relationship (QSPR) by analyzing regressions of novel indices for selected physicochemical properties of octane isomers. All the found regression models are very significant from the statistical point of view and showed very good stability to data variation in leave-one-out cross-validation experiments. General performance of the new descriptors in this QSPR studies has been evaluated with respect to the well-known sets of 2D/3D molecular descriptors. From the analysis, we can conclude that the non-stochastic and stochastic bond-based (total and bond-type) linear indices have an overall good modeling capability proving their usefulness in QSPR studies. The approach described in this work appears to be a very promising structural invariant, useful not alone for QSPR/QSAR studies, but also for similarity/diversity analysis and drug discovery protocols.


INTRODUCTION
4][5] These theoretical indices are numbers that describe the structural information of molecules through graph theoretical invariants and can be considered as structure-explicit descriptors. 6At present, there are a great number of TIs that can be used in QSPR/QSAR studies.However, a simple inspection of the large number of TIs defined in the literature shows that many of them are computed with identical mathematical equations, by using different molecular matrices.There are two main sources of TIs, the vertex (atom)-based adjacency (A) and distance (D) matrices, [1][2][3][4][5][6][7] furthermore the number and diversity of the graph invariants is so wide that this makes it difficult to find general relations for the so-derived molecular fingerprints.
The edge (bond)-adjacency relationships have also been used in the generation of new TIs.Their matrix form has been considered and explicitly defined in the chemical graph theory literature, but has received very little attention in both chemical and mathematical literature.Nevertheless, in the last decade Estrada rediscovered this matrix as an important source of graph theoretical invariants useful in the generation of new molecular descriptors. 1For instance, first the є index was defined by this author 8 using the Randićtype graph-theoretical invariant.That is to say, this new index is analogous to the Randić branching index but calculated by edge degrees instead of vertex degrees.
In a second work, our research group 9 extends the edge adjacency matrix E in molecular graph in a 3D-E matrix in order to generate the so-colled topographic edge-connectivity index є(ρ), also using the Randić-type graph-theoretical invariant.Later, Estrada used the same edge adjacency relationships in the generation of the a new family of TIs, spectral moments of the E-matrix. 10The analogous concept of spectral moments of vertexadjacency matrix had also been discussed previously by different authors. 11Afterward, Estrada et al. 12 introduced a extended set of edge connectivity indices, m є t (G), using the same way in which the branching index of Randić was extended to the series of molecular connectivity indices.Finally, a novel graph theoretical polynomial, P є (G, x), counting the edge connectivity was introduced by the same researcher. 13The first derivative of this polynomial evaluated for x = 0 is equal to the edge-connectivity index of the molecular graph.A series of edge-connectivity indices modified to include longrange bond contributions, є c (x), was obtained by this author using values of x different from zero.Such edge-adjacency relationships will be applied in the present report in order to generate a series of bond-based molecular descriptors to be used in drug design and chemoinformatic studies.
On the other hand, TIs can be classified as "global" and "local" according to the way in which they characterize the molecular structure, although most of them can be considered as global molecular fingerprints. 14One exception in this sense is the electrotopological state (E-state) index. 15Other "global" descriptors such as spectral moments of the edgeadjacency matrix had been redefined in local form. 14The great success of the E-state and edge-based spectral moments in QSPR/QSAR recently stimulated us to propose and validate some novel total and local descriptors based on a topological (edge-adjacency relationships) characterization of the molecular structure.In this sense, in a manner similar to that for the atom-and atom-type level E-State, an E-State index for bonds and bond-type has been proposed.The bond-based E-State indices provided an improvement of 25% with regard to the atom-based E-State indices in the description of the boiling point of 372 alkanes, alcohols, and chloroalkanes. 15cently, one of the present authors, Y. M-P, has introduced a new set of atom-level molecular descriptors of relevance to QSAR/QSPR studies and 'rational' drug design, atom linear indices f k (x i ). 16These local (atom and atom-type) indices are based on the calculation of linear maps in ℜ n in canonical basis.The description of the significanceinterpretation and the comparison to other molecular descriptors was also performed. 167][18][19] Specifically, the features of the k th total and local linear indices were illustrated by examples of various types of molecular structures, including chain length and branching as well as content of heteroatoms, and multiple bonds. 16ditionally, the linear independence of the atom-type linear fingerprints to 229 other 0D-3D molecular descriptors was demonstrated.In this sense, it was concluded that local (atom-based) linear fingerprints are independent indices, which contain important structural information to be used in QSPR/QSAR and drug design studies. 16is -in silico-method has been successfully applied to the prediction of several physical, physicochemical and chemical properties of organic compounds. 17These atomlevel molecular descriptors, and their stochastic forms, 18,19 have also been useful for the selection of novel subsystems of compounds having a desired property/activity.In this sense, it was successfully applied to the virtual (computational) screening of novel anthelmintic compounds, which were then synthesized and in vivo evaluated on Fasciola hepatica. 209] In addition, the molecular linear indices have been extended to consider three-dimensional features of small/medium-sized molecules based on the trigonometric-3D-chirality-correction factor approach. 21Finally, promising results have been found in the modeling of the interaction between drugs and HIV Ψ-RNA packaging-region in the field of bioinformatics using the nucleic acid's linear indices. 22An alternative formulation of our approach for structural characterization of proteins was also carried out recently. 23This extended method was used to encompass protein stability studies -specifically how alanine substitution mutation on Arc repressor wild-type protein affects protein stability-by means of a combination of protein linear or quadratic indices (macromolecular fingerprints) and statistical (linear and non-linear model) methods. 23 propose in this paper a new local (bond and bond-type) and total molecular descriptors based on the adjacency of edges.We also propose in this paper a new matrix representation of the molecule on the "stochastic" adjacency of edges and linear indices derived from there.In addition, the correlation ability of the new descriptors is tested in a QSPR study of some physicochemical properties of octanes.

THEORETICAL FRAMEWORK
2][13][14] In this section, we first will define the nomenclature to be used in this work, then the atom-based molecular vector (X) will be redefined for bond characterization using the same approach as previously reported, and finally some new definition of bond-based nonstochastic and stochastic linear indices will be given.

Background in Edge-Adjacency Matrix
Let G = (V, E) be a simple graph, with V = {v 1 , v 2 , ..., v n } and E = {e 1 , e 2 , ...e m } being the vertex-and edge-sets of G, respectively.Then G represents a molecular graph having n vertices and m edge (bonds).The edge-adjacency matrix E of G (likewise called bondadjacency matrix, B) is a square and symmetric matrix whose elements e ij are 1 if and only if edge i is adjacent to edge j. 1,10,14 Two edges are adjacent if they are incidental to a common vertex.This matrix corresponds to the vertex-adjacency matrix of the associated line graph.Finally, the sum of the i th row (or column) of E is named the edge-degree of bond i, δ(e i ). 1,8,12,13

New Edge-Relations: Stochastic Edge-Adjacency Matrix
By using the edge (bond)-adjacency relationships we can find other new relation for a molecular graph that will be introduced here.The k th stochastic edge-adjacency matrix, ES k can be obtained directly from E k .Here, ES k = [ k es ij ] is a square table of order m (m = number of bonds) and the elements k es ij are defined as follows: where, k e ij are the elements of the k th power of E and the SUM of the i th row of E k are named the k-order edge degree of bond i, i k e) ( δ .Note that the matrix ES k in Eq. 1 has the property that the sum of the elements in each row is 1.An mxm matrix with nonnegative entries having this property is called a "stochastic matrix". 26
The components (w) of W are numeric values, which represent a certain standard bond property (bond-label).That is to say, these weights correspond to different bond properties for organic molecules.Thus, a molecule having 5, 10, 15,..., m bonds can be represented by means of vectors, with 5, 10, 15,..., m components, belonging to the spaces ℜ 5 , ℜ 10 , ℜ 15 ,..., ℜ m , respectively; where m is the dimension of the real sets ( ℜ m ).This approach allows us encoding organic molecules such as 2-hydroxybut-2enenitrile through the molecular vector W = [w Csp3-Csp2 , w Csp2=Csp2 , w Csp2-Osp3 , w H-Osp3 , w Csp2-Csp , w Csp≡Nsp ].This vector belongs to the product space ℜ 6 .These properties characterize each kind of bond (and bond-types) within the molecule.
Diverse kinds of bond weights (w) can be used in order to codify information related to each bond in the molecule.These bond labels are chemically meaningful numbers such as standard bond distance, [36][37][38][39] standard bond dipole [36][37][38][39] or even mathematical expressions involving atomic weights such as atomic Log P, 40 surface contributions of polar atoms, 41 atomic molar refractivity, 42 atomic hybrid polarizabilities, 43 and Gasteiger-Marsilli atomic charge, 44 atomic electronegativity in Pauling scale 45 and so on.Here, we characterized each bond with the following parameter: which characterizes each bond.In this expression x i can be any standard weight of the atom i bonded with atom j. δ i is the vertex (atom) degree of atom i.The use of each scale (bond property) defines alternative molecular vectors, W.

Calculation of Linear Indices for Bonds, Bond-Types and the Whole Molecule
If a molecule consists of m bonds (vector of ℜ m ), then the k th bond linear indices for bond i in a molecule, are calculated as linear maps on ℜ m (endomorphism on ℜ m ) in canonical basis set.Specifically, the k th non-stochastic and stochastic bond linear indices, f k (w i ) and s f k (w i ), are computed from these k th non-stochastic and stochastic edgeadjacency matrices, E k and ES k , as shown in Eqs. 3 and 4, respectively: where m is the number of bonds of the molecule and w j are the coordinates of the bondbased molecular vector (W) in the so-called canonical ('natural') basis.In this basis system, the coordinates of any vector W coincide with the components of this vector. 26,46- 47For that reason, those coordinates can be considered as weights (bond-labels) of the edge of the molecular graph.The coefficients k e ij and k es ij are the elements of the k th power of the matrix E(G) and ES(G), correspondingly, of the molecular graph.The defining equation ( 3) and ( 4) for f k (w i ) and s f k (w i ), respectively, may be also written as the single matrix equation, where [W] is a column vector (an mx1 matrix) of the coordinates of W in the canonical basis of ℜ m .Here, E k and ES k denote the matrices of linear maps with respect to the natural basis set.
Note that both bond linear indices are defined as a linear transformation f k (w i ) on molecular vector space ℜ m .This map is a correspondence that assigns a vector f(w) to every vector W in ℜ m in such a way that: for any scalar λ 1 ,λ 2 and any vector W 1 , W 2 in ℜ m .
Total (whole-molecule) bond-based non-stochastic and stochastic linear indices, f k (w) and s f k (w), are calculated from local (bond) linear indices as shown in Eqs.6 and 7, correspondingly: where m is the number of bonds, and f k (w i ) and s f k (w i ) are the non-stochastic and stochastic bond linear indices obtained by Eqs. 3 and 4, respectively.Then, both total linear form, f k (w) and s f k (w), can also be written in matrix form for each molecular vector W∈ ℜ n , where [u] t is an n-dimensional unitary row vector.As it can be seen, the k th total linear indices (both non-stochastic and stochastic) are calculated by summing the local (bond) linear indices of all bonds in the molecule.
In the bond-type linear indices formalism, each bond in the molecule is classified into a bond-type (fragment).In this sense, bonds may be classified into bond types in terms of the characteristics of the two atoms that define the bond.For all data sets, including those with a common molecular scaffold as well as those with very diverse structure, the k th fragment (bond-type) linear indices provide much useful information.Thus, the development of the bond-type linear indices description provides the basis for application to a wider range of biological problems in which the local formalism is applicable without the need for superposition of a closely related set of structures.
It is useful to perform a calculation on a molecule to illustrate the steps in the procedure.
For this, in the next section I depict a pictorial representation of the calculus of the nonstochastic and stochastic linear indices of the bond matrix (both total and local) using a simple chemical example.In that section, I will also stand out that our approach is rather similar to the LCBO-MO (Linear Combination of Bond Orbitals-Molecular Orbitals) method (e.g., for k = 1). 48LCBO-MO is another way of forming molecular orbitals by taking linear combinations of functions associated with the different bonds in the molecule.In this sense, MOs are made up as LCBO of bonds composing the system, i.e. are written in the form, where i is the number of the MO, ϕ [in our case, f 1 (w i )]; j are the numbers of bond Yorbitals (in our case, w j ); c ij (in our case, 1 e ij or 1 es ij for non-stochastic and stochastic indices, respectively) are the numerical coefficients defining the contributions of individuals BOs to the given MO.Although the LCAO (Linear Combination of Atom Orbitals) approximation has been particularly useful for the study of conjugated hydrocarbons, the LCBO method has been particularly applied to the calculation of properties of saturated hydrocarbons.As a saturated molecule can be considered as made up of localized bonds, it is reasonable to associate an orbital to each of the corresponding regions. 48

Sample Calculation
The linear indices of the bond matrix are calculated in the following way.Considering the molecule of 2-hydroxybut-2-enenitrile as a simple example, we have the following labeled molecular graph and bond-based adjacency matrices (E and ES).The second (k = 2) and third (k = 3) power of these matrices and bond-based molecular vector, W, are also given:

ES
The molecule contains five localized bonds (Corresponding to five edges in the Hsuppressed molecular graph).To these we will associate the five "bond orbitals" w 1 , w 2 , w 3 , w 4 , and w 5 .Thus, W = [w 1 , w 2 , w 3 , w 4 , w 5 ] = [w (C-C) , w (C=C) , w (C-C) , w (C≡N) , w (C-O) ] and each "bond orbital" can be computed by Eq. 2 using, for instance, the atomic electronegativity in Pauling scale (x) 45 as atomic weight (atom-label): Each non-stochastic and stochastic "molecular orbital" will have the form: The k e ii 's and k es ii 's can be considered to measure a the attraction of an electron for a bond in the k step.The k e ij 's and k es ij 's are the terms of interaction between two bonds in the k step.The k e ij = k e ji are equal by symmetry (non-oriented molecular graph).However, k es ij 's ≠ k es ji 's.This is a logical result because the k th es ij elements are the transition probabilities with the 'electrons' moving from bond i to j at the discrete time periods t k and it should be different in both senses.This result is in total agreement if the electronegativity of the two atom types in the bonds are taken into account.

QSPR Studies
The decisive criterion of quality for any molecular descriptor is its ability to describe structure-related properties of molecules.With this objective we developed the QSPR models to describe seven physicochemical properties of octane isomers.3][54][55][56][57][58] This selection is recommended due to the most of the fact that physicochemical properties commonly studied in QSPR analyses with topological indices are interrelated for data sets of compounds with different molecular weights, for instance for alkanes with two to nine carbon atoms.These correlations are not necessarily observed when the same indices are used in isomeric data sets of compounds, such as the octane data set.In addition, these properties are hardly interrelated when octanes are used as a data set. 59On the other hand, all topological indices are designed to have (gradual) increments with the increments in the molecular weight.By this way, if we do the present study by using a series of compounds having different molecular weights, we will find "false" interrelations between the indices by an overestimation of the size effects inherent to these descriptors. 13,52The same is also valid when the QSPR model is to be obtained.It is not difficult to find "good" linear correlations between TIs and physicochemical properties of alkanes in data sets with great size variability. 13,52In fact, the simple use of the number of vertices in the molecular graph produced regression coefficients greater than 0.97 for most of the physicochemical properties of C2-C9 alkanes studied by Needham et al. 60 However, when data sets of isomeric compounds are considered, typically correlations that have high correlation coefficients when molecules of different size were considered will no longer show such good linear correlation.In conclusion, if a new proposed molecular descriptor is not able to model the variation of at least one property of octanes, then it probably does not contain any useful molecular information.Moreover, octanes constituted a good set of chemicals for comparative study, since many experimental data among their physicochemical properties are available.In this sense, we analyzed the quality of the QSPR models obtained to describe the boiling point (BP), motor octane number (MON), heat of vaporization (HV), molar volume (MV), entropy (S), and heat of formation (∆ f H) of the octane isomers.[54][55][56][57][58] Precisely, to evaluate the quality of the models based on our new bond-level chemical descriptors we have taken as the reference: 1) the models published by Randić [54][55][56] based on diverse topological indices such as the Wiener matrix invariants, 2) the equation published by Diudea 58 based on the SP indices, and 3) the best models obtained with a set constituted by the topological (69), WHIM (99), and GETAWAY descriptors (197). 53e total and local (bond-type) bond-based linear indices used to search for the best regression of the selected physicochemical properties of octanes were calculate by the

TOMOCOMD-CARDD (acronym of TOpological MOlecular COMputer Design-
Computer Aided "Rational" Drug Design) program. 61This software is an interactive program for molecular design and bioinformatic research.The software was developed based on a user-friendly philosophy.That is to say, this computer graphics software shows a great efficiency of interaction with the user, without prior knowledge of programming skills (e.g.practicing pharmaceutic and organic chemist, teacher, university student, and so on).CARDD subprogram allows drawing the structures (drawing mode) and calculating 2D (topologic), 3D-chiral (2.5D) and 3D (geometric and topographic) non-stocahstic and stochastic molecular descriptors (calculation mode).The bond-based TOMOCOMD-CARDD descriptors computed in this study were the following: These k th total and local bond-based linear indices were used as molecular descriptors for derived QSARs.One of the difficulties with the large number of descriptors is deciding which ones will provide the best regressions, considering both goodness of fit and the chemical meaning of the regression.3][64][65][66][67] GAs are a class of algorithms inspired by the process of natural evolution in which species having a high fitness under some conditions can prevail and survive to the next generation; the best species can be adapted by crossover and/or mutation in the search for better individuals.][64][65][66][67][68][69][70] The software BuildQSAR 71 was employed to perform variable selection and QSAR modeling.The mutation probability was specified as 35%.The length of the equations was set three-four terms and a constant.The population size was established as 100.The GA with an initial population size of 100 rapidly converged (200 generations) and reached an optimal QSAR model in a reasonable number of GA generations.
The search for the best model can be processed in terms of the highest correlation coefficient (R) or F-test equations (Fisher-ratio's p-level [p(F)]), and the lowest standard deviation equations (s). 71The quality of models was also determined by examining the Leave-One-Out (LOO) cross-validation (CV) (q 2 , s cv ). 72In recent years, the LOO press statistics (e.g., q 2 ) have been used as a means of indicating predictive ability.Many authors consider high q 2 values (for instance, q 2 > 0.5) as an indicator or even as the ultimate proof of the high-predictive power of an QSAR model.
The best linear models found using non-stochastic and stochastic total and bond-type linear indices are presented in Table 1.For each selected property of octane isomers, the statistical information for the best regressions with 1, 2, and 3 molecular descriptors published so far are also depicted in Table 1.Together with the LOO cross-validated explained variance (q 2 LOO ), the determination coefficient (R 2 ), the standard estimate of the error (s), and Fischer ratio (F) are listed.The molecular descriptor symbols are reported in eighth column, and the last column in the table contains the references of the models taken from the literature.

)
As can be appreciated from the statistical parameters of regression equations in Table 1, all of the physicochemical properties were well described by bond-based linear indices.
In this table we can observe that the statistical parameters for the models obtained with According to the obtained QSPR results, it is possible to conclude that the new descriptors encode some useful molecular information different from that of previous proposed descriptors.Moreover, they are quite diverse among themselves being able to describe well the variation of different properties of octanes.

CONCLUDING REMARKS
The total and local (bond and bond-type) linear indices of the non-stochastic and stochastic edge adjacency matrices are novel sets of graph-theoretical descriptors.These indices have a series of important features that make them useful molecular descriptors to be employed in QSPR/QSAR studies, similarity/diversity analysis and drug design protocols.The correlations found by these new sets of bond-level chemical descriptors for the description of six representative physicochemical properties of octane isomers can be considered as statistically significant.The approach described in this paper appears to be a promissory method to find in silico models for description of physical, chemical and biological properties.Applications of theses new descriptors in molecular property/activity modeling, similarity/diversity analysis and biosilico drug discovery will be published in subsequent papers.

Finally, in addition
to total and bond linear indices computed for each bond in the molecule, a local-fragment (bond-type) formalism can be developed.The k th bond-type linear index of the edge-adjacency matrix is calculated by summing up the k th bond linear indices of all bonds of the same bond type in the molecule.That is to say, this extension of the bond linear index is similar to the group additive schemes, in which an index appears for each bond type in the molecule together with its contribution based on the bond linear index.Consequently, if a molecule is partitioned into Z molecular fragments, the total non-stochastic [or stochastic] linear indices can be partitioned into Z local nonstochastic [or stochastic] linear indices f kL (w) [or s f kL (w)], L = 1, …, Z.That is to say, the total (both non-stochastic and stochastic) linear indices of order k can be expressed as the sum of the local linear indices of the Z fragments of the same order: 55/1 + 2.55/3 = 3.4 w 2 = x C /3 + x C /4 = 2.55/3 + 2.55/4 = 1.4875 w 3 = x C /4 + x C /4 = 2.55/4 + 2.55/4 = 1.275 w 4 = x C /4 + x N /3 = 2.55/4 + 3.04/3 = 1.650833 w 5 = x C /4 + x O /1 = 2.55/4 + 3.44/1 = 4.0775 and therefore, W = [3.4,1.4875, 1.275, 1.650833, 4.0775]

w 5 )
= 11.46583+37.51083 +34.65 +12.79 +24.25583 = 120.6725The terms in the summations for calculating the total linear indices are the so-colled bond linear indices.We have written these terms in the consecutive order of the bond labels in the graph.For instance, the non-stochastic bond linear indices of order 0, 1, 2 and 3 for the bond labeled as 1 are 3.4, 1.4875, 8.7525, and 11.46583, respectively.The k th total stochastic linear indices values are also the sum of the k th local (bond) stochastic linear indices values for all bonds in the molecule:

1 )
k th (k = 15) total non-stochastic bond-based linear indices not considering and considering H-atoms in the molecular graph (G) [f k (w) and f k H (w), respectively].2) k th (k = 15) total stochastic bond-based linear indices not considering and considering H-atoms in the molecular graph (G) [ s f k (w) and s f k H (w), respectively].3) k th (k = 15) bond-type (C-H in methyl group) non-stochastic and stochastic linear indices considering H-atoms in the molecular graph (G) [f kL H (w C-H ) and s f kL H (w C-H ), correspondingly].These local descriptors are calculated taken into account only one of the three bond types for carbon-hydrogen bonds (C primary -H) that there are for octanes data.
bond-based linear indices to describe motor octane number (MON) (Eqs 15 and 16) and molar volume (MV) (Eqs 19 and 20) of octanes are better than those taken from the literature.The first physicochemical property, that is, MV, is well-described exclusively by the bond-based linear indices.Note also that in the models based on the bond-level chemical linear indices, the two regressions for the heat of vaporization (HV) (Eqs 17 and 18) are better-to-similar than the models published so far.Only the models found by us to describe boiling point (BP) (Eqs 13 and 14), entropy (S) (Eqs 21 and 22), and heat of formation (∆ f H) (Eqs 23 and 24) have significant differences with the precedent models obtained by applying the selection procedure to the set given by GETAWAY descriptors plus WHIM and topological indices.

Table 1 .
Statistical Information for Best Multiple Regression Models of Selected Physicochemical Properties of Octane Isomers.