Role of Vertex Index in Substructure Identification and Activity Prediction : A Study on Antitubercular Activity of a Series of Acid Alkyl Ester Derivatives

Tuberculosis (TB) is a life threatening disease caused due to infection from Mycobacterium tuberculosis (Mtb). That most of the TB strains have become resistant to various existing drugs, development of effective novel drug candidates to combat this disease is a need of the day. In spite of intensive research world-wide, the success rate of discovering a new anti-TB drug is very poor. Therefore, novel drug discovery methods have to be tried. We have used a rule based computational method that utilizes a vertex index, named ‘distance exponent index (D)’ (taken x = –4 here) for predicting anti-TB activity of a series of acid alkyl ester derivatives. The method is meant to identify activity related substructures from a series a compounds and predict activity of a compound on that basis. The high degree of successful prediction in the present study suggests that the said method may be useful in discovering effective anti-TB compound. It is also apparent that substructural approaches may be leveraged for wide purposes in computer-aided drug design. (doi: 10.5562/cca2306)


INTRODUCTION
Use of computational methods for the prediction of the pharmacological and toxicological activities of chemical compounds is one of the core areas of study in drug discovery research. 1The conventional methods are primarily based on the use of molecular properties that can explain such activities and several statistical / chemometric approaches have been adopted over the years in this purpose. 2,3In a traditional medicinal chemistry approach chemists try to find either a new lead compound or, an effective congener of some series of compounds by suitably changing the substituents in different possible ways at different positions of the core structure or, scaffold.However, such traditional methods become intractable when one has to deal with a large number of compounds.In order to work with such problems the need for using mathematical / statistical tools has become essential.
Molecular properties which have been of interest for long are broadly categorized as steric, electronic and hydrophobic in nature. 4The idea is to interpret biological activities of chemical compounds in terms of such molecular properties.In mathematical term, biological activity (BA) of a chemical compound may be thought of as a function ( f ) of such molecular properties: BA (compound)  f(Steric, Electronic, Hydrophobic).This categorization has been found to be quite in line with how the activity of a drug molecule in a physiological milieu can be interpreted in a reasonable manner. 1,4owever, for carrying out quantitative or, qualitative evaluation of biological activities of chemical compounds, quantitative molecular descriptors / indices giving structural and molecular property information can be extremely helpful. 1 Looking into their applicability, use of such indices has become an integral part of today's drug discovery programs.Some of the widely used molecular properties / descriptors are van der Waals volume, Hydrophobicity (logP), Molar Refractivity (MR), steric parameters proposed by Taft, Hammett's electronic parameters, quantum chemical parameters like HOMO, LUMO etc. 1 Although descriptors for whole molecules have been used widely in explaining and predicting biological activities, descriptors for substructure(s) / certain part(s) of a molecule such as a substituent which may be the sole or, primary determinant of the compound's activity can also play important role in this purpose.Those structural components may be responsible for a drug molecule's interaction with the target biomolecule(s) or, for having adequate properties of the compound to become drug able.Also, such a substructure may be present in a molecule having distinctly different core structure i.e., other than similar type of structures in, say, a series of congeners and the presence of such an activity related substructure may increase the possibility of making the compound active.Surmising the importance of the role substructural descriptors may play in activity prediction, physicochemical substructural descriptors, known as substituent constants, have been proposed and dealt with by Hansch and coworkers. 5During the same period, Free and Wilson 6 proposed a method for quantitative prediction of activity using indicator variables.In this approach, 6 digits 1 and 0 are used to mark, respectively, the presence and absence of a particular molecular fragment in a compound.
As required for practical purposes, all such molecular descriptors / properties are physical / physicochemical in nature and can serve the purpose of propertyactivity correlation studies very well. 5,7However, such descriptors do not convey the structural information such as the way atoms are connected in a molecule or, the minimum / low energy conformation of a molecule.Such structural information is important in understanding the structural requirements of a compound for becoming biologically active as well as in explaining the changes in the activities of a chemical compound with the changes in its structure.In order to do that, one requires structural descriptors of chemical compounds for having structural interpretation of their biological activities.Over the years, structural descriptors have been derived from molecular structures in the form of 2D (two-dimensional) descriptors [7][8][9][10][11][12] using methods based on graph theory 13 and are derived from molecular graph models of the structural formula of chemical compounds and as 3D (three-dimensional) descriptors 14 derived from 3D structural information of chemical compounds obtained from X-ray crystallographic studies or, developed using 3D structure generating computational tools.While 3D descriptors reflect some physical characteristics of a molecule or, parts of it such as those obtained from 3D co-ordinates / conformations of a molecule, 14 2D descriptors translate into numbers the topological characteristics of a molecule such as, branching, cyclicity, neighborhood etc. [15][16][17][18][19][20][21] which are some of the fundamental structural characteristics of a chemical compound.
Molecular structural descriptors which are of much use in computer-aided drug design are derived for the whole molecule as it is done for computing physical / physicochemical properties.However, as discussed earlier, since certain parts of a molecule can be the sole or, major determinant of biological activities of chemical compounds, substructural descriptors may also find useful applications in interpreting biological activities.Like structural descriptors for whole molecule, substructural descriptors too can be of both 3D type 14 and 2D type. 22,23So far computation of substructural descriptors is concerned, 3D descriptors are computed from 3D co-ordinates of chemical compounds 14 and 2D descriptors are computed from the subgraphs 13 representing structural components of chemical compounds.It may be noted, that studies on molecular fragments that contain both chemical nature of the atoms and the bonding pattern i.e., single bond, double bond etc, is also an active area of research 3,24 in the field of computational drug discovery.
For the present purpose, our interest lies on the application of substructural / vertex indices derived from molecular graphs. 23Balaban, in a series of papers, 22,25,26 has discussed on the development of local / vertex indices from various structural and mathematical considerations.Such methods are based on vertex degrees, topological distances between pairs of vertices, Eigen values obtained from the adjacency matrix of a graph, information-theoretical formalism etc.In general, most of such local indices are combined together to compute molecular indices and are used for structure-property and structure-activity correlation studies. 7,11More recently, several vertex degree based local indices have been used for the same purpose. 279][30][31][32] For their work, vertex indices have been computed from the connected graph 13 models of chemical structures.The indices are computed from the topological distances of all the vertices in a molecular graph with respect to individual vertices of that graph.From structural point of view, each one of such vertex indices translates the connectivity of a tree rooted at a given vertex in the graph having same number of vertices as that of the graph itself.For example, for a graph of 8 vertices, one would get eight vertex indices for eight rooted trees of 8 vertices.Thus, for a series of bioactive compounds, containing both active and inactive molecules, one gets a large number of vertex indices corresponding to as many rooted trees for all the atoms of active and inactive molecules.Generally, atoms (vertices) situated at a similar type of topological environment in different molecules, active or, inactive, have rooted trees very close in their structures.It has been found that such vertex indices have fair amount of discriminating power which is helpful in identifying activity related substructures (rooted trees) effectively from those which are not related to activity using certain Croat.Chem.Acta 87 (2014) 39.
9][30][31][32] Moreover, since vertex indices are computed for each atom of a molecule, it becomes possible to identify all potential activity related substructures present in a molecule.Understandably, this rule based system is meant for learning the structural requirement for the activity of a compound from substructural descriptors and predict activity.
It seems worth noting that referring to the potentiality of using local indices for activity prediction, done by Klopman and Raychaudhury, 30 Pepperrell 14 proposed 3D local descriptors for use in molecular similarity studies.In this approach, the entries of the distance matrix are the Euclidean distances between pairs of vertices (instead of topological distances) of the hydrogen suppressed graph of a molecule and the distances are computed from the 3D atomic co-ordinates. 14This further emphasizes on the applicability of local / substructural indices in the studies where structural / substructural descriptors are used.
In the present paper, we have used the earlier mentioned rule based system 23,[28][29][30] along with a vertex index, 23,31,32 named "distance exponent index" (D x ), x being the exponent which can assume any real number value, in identifying activity related substructures from a series of acid alkyl ester derivatives 33 and in predicting their anti-TB activities.Tuberculosis (TB) is a life threatening disease spreading over a large section of the globe and a number of TB strains are already resistant to several existing drugs. 33Therefore, it seems necessary to carryout research for developing novel anti-TB drugs, may be using novel method too such as the one used here, to evaluate its usefulness for this series of anti-TB compounds.So far vertex index is concerned, we have considered x = -4, i.e., have used D -4 index for carrying out this study since it has produced successful predictions for different series of bioactive compounds. 23,31,32egarding activity prediction, this method has been used to predict antitubercular activity, MIC (μM), of the said series of acid alkyl ester derivatives 33 which have been found to be effective in inhibiting the growth of replicating Mtb (R-TB) strains H 37 Rv in a microplate Alamar blue assay (MABA). 33Here we have used the afore mentioned method to identify activity related substructures by training the system with a set of compounds (Training Set) from acid alkyl ester derivatives 33 and have kept few compounds (Test Set) for testing the predictive ability of the system.Subsequently, the activities of both training set and test set compounds have been predicted using the trained system.The results show a high degree of successful prediction.This suggests the usefulness of the present method in anti-TB drug discovery and the bigger role that vertex indices may play in computer-aided drug design in general.

Methodology
Local / vertex indices which have so far been used for characterizing chemical structure and correlating molecular properties or, activities are based mainly on vertex degree or, distances between pairs of vertices in a molecular graph. 22,25,26One of the traditional ways of computing such local indices is to use adjacency matrix associated with a molecular graph for obtaining degree based indices and distance matrix for distance based indices.The adjacency and distance matrices for a graph G representing carbon skeletal of one of the pentane isomers is shown in Figure 1.It may be noted that both A(G) and D(G) are square symmetric matrices having all the main diagonals entries as zero. 13It is so, because in such connected molecular graphs no vertex is adjacent to itself (unless one introduces a self-loop on a vertex) and each vertex is at a distance zero from itself.However, such matrices can be used to extract some important information.For example, the sum of all the entries of any row / column of an adjacency matrix gives the degree 13,15 of the corresponding vertex.This value may give the valency of the atom it is representing if hydrogen-filled graphs are considered.Again, the sum of all the entries of any row / column of D(G), known as distance sum, 9,10,16,17 is obtained from the distance distribution of all the vertices of G from the given vertex.From structural point of view, this distance distribution helps get some interesting features.For example, if there are two vertices at a distance one and two vertices at a distance two from a vertex v in a connected graph, then one gets two graphs from this distance distribution v(1,2, 2), 1 for v itself G: and 2 each for other vertices at distances one and two from v (Figure 2).The point of interest here is that these two graphs, generated from the same distance distribution, represent two different molecules although they would be very similar to each other.Therefore, it may be assumed that two vertices having the same or, very close distance distributions associated with them would have closeness in their structures too.This aspect may be used for identifying similar substructures present in different compounds.Now, using the distance distribution one can compute substructural or, vertex indices for the vertices of a series of compounds.As discussed earlier, an index having reasonable discriminating power may be preferred.The index to be used for the present study 'distance exponent index' 23,31,32 is believed to have the required discriminating power and is denoted by where D is a distance value that gives the topological distance of any other vertex in a graph from a given vertex v for which the index is being computed.The index D x is computed considering an undirected, unweighted and unlabelled connected graph 13 model of a chemical compound and is done as follows: Let there be n i vertices at a distance d i from a vertex v in a molecular graph for which the vertex index D x has to be computed and d i , i = 1, 2, …, m, are different topological distances 10,13,16,17 of those n vertices from v (vertex v lies at a distance zero from itself) in that molecular graph.Then the index D x for vertex v may be computed from Equation (1).

( ) ×( )
The exponent x can assume any real value.However, for the present study we have used x = -4 as this exponent value has produced high level of successful prediction for different series of compounds. 23,31,32omputation of the index D -4 is illustrated below for vertex representing the carbon atom of the orthosubstituted methylthio group in the phenyl ring of compound No. 14 (Supplementary info S2).This carbon atom is atom No. 20 of the said molecule.
For the vertex, say u, representing the above mentioned atom, D -4 index D -4 (u) for vertex u may be com-puted from the topological distances of all other vertices in the molecular graph, representing compound 14, from vertex u.Since there are 4, 1, 2, 4, 5, 3, 2, 2, 1, 1, 2, 1, 3 and 3 vertices at distances 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14, respectively from u, D -4 (u) may be computed as follows: For working with this vertex index, D -4 is computed for all the vertices (atoms) of hydrogen-filled graph models of all the compounds in the training set.All these index values for the training set are then arranged in an ascending order.By doing that, a number of segments in the ordering are found to be formed by values coming from active molecules and a number of other segments where the values are contributed by inactive compounds.This observation has led us to formulate rules 23 to identify active and inactive ranges which may be used to predict activity of both the training set compounds (retrofit) as well as for the test set compounds (may be done for newly discovered or, designed compounds as well).However, it may be noted that although D -4 values have been computed from the hydrogenfilled graph models of all the compounds in the data set in order to capture the entire topological architecture of the molecules, the activity prediction has been carried out, for both training set and test set, considering the index values of non-hydrogen atoms only since that gives the essential connectivity information of a molecule with hydrogen atoms connected to them as required.Thus, for the purpose of activity prediction, D -4 index values of only non-hydrogen atoms have been considered.The rules used for identifying active and inactive ranges in the ordering of the values of D -4 index are as follows: 1. Three or, more consecutive index values coming exclusively from active molecules or, exclusively from inactive molecules are considered to form an active range or, an inactive range, respectively.However, at least three of them have to be distinct when coming from the same molecule or, at least two of them have to be distinct when they come from different molecules.2. Some single value coming from both active and inactive compounds is not considered to form, say, an active range by itself or, together with other values unless more than two-third of the molecules contributing this value are active.The same rule is applied in identifying inactive ranges too.Once such active and inactive ranges are identified, the activity of a compound is predicted from the occurrences of D -4 values for the non-hydrogen atoms of its molecular graph in active and inactive ranges.Another set of rules 23 are followed for the prediction purposes.Accordingly, a compound is predicted to be active if the D -4 index values of its vertices (for nonhydrogen atoms) fall: 1.Only in active range(s), or, 2. In both active and inactive ranges and the number of index values in the active ranges is greater than the number of index values in inactive ranges.
Otherwise the compound is predicted to be inactive.

RESULTS AND DISCUSSION
The present work has been carried out for the identification of activity related substructures and for the prediction of anti-TB activity of a series of acid alkyl ester derivatives using vertex index D -4 and the rules 23 described earlier here.For this study, we have taken 41 such compounds from the literature. 33Few more compounds which come with both cis-and trans-isomers 33 have not been considered since such isomers cannot be discriminated by the molecular topology based method used here for computing vertex indices.The compounds along with their activity data, minimum inhibitory concentration (MIC), have been given in Table 1.The activity data, MIC (μM), is a measure of anti-TB activity as obtained from microplate Alamar blue assay (MABA) against R-TB strain H 37 Rv. 33The compounds have been grouped as active and inactive on the basis of a cut off in their activity values defined suitably for this study.In doing that we find that compounds which have MIC values ≤ 3.9 are among the potentially active compounds and for most of the other compounds the MIC values are much higher than 3.9.Accordingly, we have considered compounds having MIC ≤ 3.9 as active and the remaining compounds as inactive.On the basis of this criterion, 21 compounds have gotten active tag and the remaining 20 compounds inactive tag.This also makes the data set unbiased to any particular activity type (active / inactive).
In order to evaluate the predictive power of this method for the present series of anti-TB compounds, 33 this data set of 41 compounds has been divided into training set and test set (Table 1).The idea is to train the system with the topological substructural information of the training set compounds and then predict the activity of the compounds present in both training set (retrofit) and the test set.In particular, successful prediction for the test set compounds is more important since those Furthermore, in order to validate the results given in Table 1 as well as for all the 10,000 training set-test set (Tr-Ts) combinations studied here, we have computed validation parameters Sensitivity (Sen), Specificity (Spe) and Accuracy (Acc) for these combinations and the results are given in Table 2.It clearly indicates that the accuracy of prediction is very high for the combination given in Table 1 as well as for the entire 10,000 combinations of training sets and test sets.Also, high maximum values for the validation parameters, sensitivity and specificity indicate that most of the active as well as inactive compounds have been classified correctly.Low minimum values seem to be rightly indicating that some of the combinations have not produced for 10,000 runs which means there are 3 mispredictions in 9 Tr-Ts combinations, 4 in 210 combinations and so on.A consolidated overview may be obtained by looking at the accuracies of 10,000 randomly chosen Tr-Ts combinations (Supplementary Figure S1).As mentioned earlier, it is clear from this result that all the Tr-Ts combinations may not produce useful classification and prediction since all the training sets may not be proper representation of the total data set (41 compounds in our case).Accordingly, the average values of the validation parameters (Table 2) are the results of this mixed outcome, lying in between maximum and minimum values.This is very much in line with the present purpose where we attempt to find suitable training sets that can be used for effective classification of compounds and subsequently for useful activity prediction.The activity prediction result given in Table 1 is one of the better outcomes obtained out of the 10,000 Tr-Ts combinations.So far the procedure of predicting activity of a compound by the present method is concerned activity prediction of compound No. 14 of the test set (Table 1) has been given in Table 3.It is interesting to note from Table 3 that D -4 values of most of the atoms (vertices) of this compound (Figure 3) have fallen in active ranges compared to only one falling in an inactive range.Moreover, those atoms which are falling in active ranges are present at the important parts of the molecule which are believed to make the compound active.It suggests that the index D -4 has been able to identify activity related substructures effectively making the  predictions highly significant as apparent from the activity prediction of compound No. 14 (Table 3) which is one of the most active compounds in the dataset considered here and the compound has been correctly predicted as active.One can also find from the supplementary file that introduction of chlorine (Cl) atoms on the aromatic ring gave the most active compound of this series and the vertices representing these atoms have fallen in an active range (values 1.198245-1.198606,nos.82-86 in the ordering given in the supplementary info S3) indicating that the range selection is capable of identifying potential activity related substructures.More such instances may be found in the ordering of index values given in the supplementary info S3.
It is apparent from the present study that the index D -4 along with the rules described here may find useful application in the studies for antitubercular drug design with the acid alkyl ester derivatives.At the same time, results obtained here are encouraging enough for carrying out studies with various other series of anti-TB compounds.That may help in investigating the usefulness of the present method in antitubercular drug design in greater detail.It may pave the path for discovering novel anti-TB compounds as well through chemical database searching using the identified activity related substructures.This may also help design novel anti-TB drugs using graph reconstruction technique 30 based on the distance distribution associated with the vertex indices of activity related substructures which may facilitate design of structures having very different and diverse core or, scaffold.Moreover, the present method is also believed to reflect some kind of topo-isosterism.For example, if the elements sulfur (S) and oxygen (O) are interchanged at the same position of a compound and the compound remains active, then the vertex index values for both of them would be designated as active (+) and that would help forming an active range.On the other hand, if one is active and the other is inactive, the index value will have both positive (+) and negative (-) tags and that may go against forming an active range and subsequently in activity prediction.This may be the case for halogen substitutions as well.
In addition to that, elaborate studies on creating useful training set-test set combination may help identify activity related substructures more appropriately, perhaps of wide variety too, which may, in turn, help predict activities with higher precision.This can be evaluated from acceptable successful prediction of activities of test set compounds as well as for a set of newly designed molecules.The predicted activities of the newly designed molecules may be verified experimentally by synthesizing the compounds.On a broader perspective, we feel tempted to believe by looking at the findings of the present study as well as at those obtained earlier 23,[28][29][30][31][32] that an approach based on vertex / local / substructural indices along with machine learning techniques and / or, different mathematical / statistical methods may find a bigger role to play in computer-aided drug design and establish the importance of using vertex indices in structure-activity analyses on a stronger platform.Supplementary Materials.-Supporting informations to the paper are enclosed to the electronic version of the article.These data can be found on the website of Croatica Chemica Acta (http://public.carnet.hr/ccacaa).3.

1 2 0 Figure 1 .
Figure 1.Adjacency matrix A(G) and distance matrix D(G) of graph G representing the hydrogen-suppressed structure of one the pentane isomers.

2 .
Figure 2. Adjacency matrix A(G) and distance matrix D(G) of graph G representing the hydrogen-suppressed structure of one the pentane isomers.
are D -4 index values of non-hydrogen atoms in compound No. 14. (X) = (A), (I) means that the value falls in active, inactive ranges respectively.(X) = (N) means the value does not fall in any active or, inactive range.Start = the starting D -4 value in the range for both active and inactive ranges.End = The last D -4 value in the range for both active and inactive ranges.A-Count = No. of values coming from active molecule in that range.I-Count = No. of values coming from inactive molecule in that range.Croat.Chem.Acta 87 (2014) 39.

Figure 3 .
Figure 3. Heavy atom labeled 2D diagram of Compound No. 14 as given in the supplementary file S1 and used for illustration in Table3.

Figure S1 .
Figure S1.Accuracies for 10000 random training and test sets

Table 1 .
Experimentally determined MIC values and Assigned and predicted qualitative activities of 41 acid alkyl ester derivatives divided into 29 training set compounds and 12 test set compounds different from the training set compounds and are therefore new to the trained system.As a matter of fact, one can always generate various combinations of training set and test set for a given series of compounds.In order to explore this aspect, 10,000 such combinations of training set of 29 compounds and test set of 12 compounds have been generated from the present data set.The number of compounds kept in training set and test set seems to be suitable enough both for training the system with the substructural information as well as for testing the predictive power of the trained system.However, care has been taken to keep the number of active and inactive molecules almost equal in number in the training sets in order to keep it as much unbiased as possible.The remaining 12 compounds in each combination have been kept as a test set where too the number of active and inactive compounds is equal in number (6 each).Interestingly (and perhaps expectedly too) a number of such combinations of training and test sets produced high degree of successful activity prediction.The reason may be that the training set compounds in such cases represent the actual data set more appropriately than others and therefore are able to predict activities of the compounds more effectively compared to those which have not been able to produce acceptable activity prediction.The assigned and predicted activities of one of such better combinations of training set and test set have been furnished in Table1.The predictions have been made on the basis of the occurrences of the compounds' vertex index values in active and inactive ranges obtained from the ordering of the vertex index values for compounds of the training set given in Table1.For this training set, 91 ranges have been identified out of which 49 ranges are active ranges and 42 ranges are inactive ranges.Table1shows a high degree of successful prediction with only 2 compounds in both training set and test set being wrongly predicted.While for the training set, compound Nos.12 and 27 have been mispredicted, compound Nos. 2 and 13 have been misclassified for the test set.Furthermore, out of those two, only one active has been mispredicted in both training set (No. 27) and test set (No. 13).It seems important not to have many active compounds predicted inactive (false negative) since it may not be desirable in situations where one does not want to miss out potentially active compounds in the process of screening.It may be noted that the number of active and inactive ranges in an ordering depends on the training set used for a study / run and may vary from training set to training set.

Table 2 .
Experimentally determined MIC values and Assigned and predicted qualitative activities of 41 acid alkyl ester derivatives divided into 29 training set compounds and 12 test set compounds

Table 3 .
Illustration for activity prediction of ACTIVE compound No.14 given in Figure3

Table S2 :
Ordering of vertex Index D -4 for all the atoms of the training set