Models for Antitubercular Activity of 5′-O-[(N-Acyl)sulfamoyl]adenosines

The relationship between topological indices and antitubercular activity of 5′-O-[(N-Acyl)sulfamoyl]adenosines has been investigated. A data set consisting of 31 analogues of 5′-O-[(N-Acyl)sulfamoyl]adenosines was selected for the present study. The values of numerous topostructural and topochemical indices for each of 31 differently substituted analogues of the data set were computed using an in-house computer program. Resulting data was analyzed and suitable models were developed through decision tree, random forest and moving average analysis (MAA). The goodness of the models was assessed by calculating overall accuracy of prediction, sensitivity, specificity and Mathews correlation coefficient. Pendentic eccentricity index – a novel highly discriminating, non-correlating pendenticity based topochemical descriptor – was also conceptualized and successfully utilized for the development of a model for antitubercular activity of 5′-O-[(N-Acyl)sulfamoyl]adenosines. The proposed index exhibited not only high sensitivity towards both the presence as well as relative position(s) of pendent/heteroatom(s) but also led to significant reduction in degeneracy. Random forest correctly classified the analogues into active and inactive with an accuracy of 67.74%. A decision tree was also employed for determining the importance of molecular descriptors. The decision tree learned the information from the input data with an accuracy of 100% and correctly predicted the cross-validated (10 fold) data with accuracy up to 77.4%. Statistical significance of proposed models was also investigated using intercorrelation analysis. Accuracy of prediction of proposed MAA models ranged from 90.4 to 91.6%.


Introduction
In the pharmaceutical industry, much effort is being devoted to develop new drugs [1]. The seven steps involved in the drug discovery process are: disease selection, target hypothesis, lead identification, lead optimization, pre-clinical trial, clinical trial and pharmacogenomic optimization. Traditionally, these steps are carried out sequentially, and if one of these steps is slow, it naturally slows down the entire process [2]. Considering both, the potential benefits to human health and the enormous cost in time and money of drug discovery, any tool or technique that enhances the efficiency of any stage of drug discovery enterprise will be highly prized [3]. A viable solution to this quagmire lies in the estimation of necessary properties of molecules directly from their structure without the input of any other experimental data through quantitative structure-activity relationship (QSAR) models [4]. The main hypothesis in the QSAR/QSPR (quantitative structureactivity/property relationship) approach is that all properties (physico-chemical and biological) of a chemical substance are statistically related to its molecular structure [5]. Quantitative relations generated from such studies help in hypothesizing important contributions of specific structural aspects or chemical interactions in modifying physicochemical properties and biological activities and also in predicting properties and activities of untested and not yet synthesized compounds [6]. Mathematical descriptors of molecular structure, such as various topological indices (TIs), have been widely used in structureproperty-activity relationship studies [7]. Topological descriptors are mathematical entities encoding molecular graphs composed of vertices (corresponding to the atoms) and edges (representing the bonds among atoms). These are two-dimensional descriptors which take into account the internal atomic arrangement of compounds, and encode in numerical form information about molecular size, shape, branching, presence of heteroatoms and multiple bonds [8]. One of the most interesting advantages of molecular topology is the straightforward calculation of topological descriptors [9] without requirement of any experimentally derived measurement. The usefulness of TIs in QSPR and QSAR studies has been widely demonstrated, and they have also been used as a measure of structural similarity or diversity by their application to databases virtually generated by computer [10]. Though a large number of topostructural and topochemical indices of diverse nature have been reported in literature but only a small proportion of them has been successfully employed in structure-activity-relationships (SARs). Some of the topostructural and topochemical indices, which have been successfully employed in SAR studies include Wiener's index [11], Hosoya's index [12], Randic's molecular connectivity index [13], Zagreb group parameters [14,15], Balaban's index [16], Schultz'index [17], molecular connectivity topochemical index [18,19], eccentric connectivity index [20], revised Wiener index [21], E-state index [22], eccentric connectivity topochemical index [23], Zagreb topochemical indices [24], and superaugmented eccentric connectivity indices [25].
Tuberculosis (TB), one of the oldest recorded human afflictions, is still one of the biggest killers among the infectious diseases, despite the worldwide use of a live attenuated vaccine and combination of several antibiotics [26]. The disease spreads more easily in over crowded places and in the conditions of malnutrition and poverty; characteristics typical of developing countries. Tuberculosis is the commonest opportunistic disease in persons infected with human immunodeficiency virus [27]. Mycobacterium tuberculosis, the causative agent of TB, is the leading bacterial cause of infectious disease mortality. Mycobacterium tuberculosis and Yersinia pestis, the causative agent of plague, have been reported to be pathogens with serious ongoing impact on global public health and potential use as agents of bioterrorism [28]. The development of M. tuberculosis strains which are resistant to all of the current front-line antitubercular drugs has prompted worldwide efforts to develop new antibiotics to treat this notorious pathogen [29]. It is well known fact that iron is a required element for growth and survival of M. tuberculosis in its host, and iron overload can be an exacerbating cofactor to tuberculosis [30]. Although, iron's abundance in the earth's crust, spin state, and redox tuneability makes it the most versatile among transition elements, the insolubility of ferric hydroxide at pH 7.4 limits the concentration of [Fe 3+ ] (the free aqueous ion) to ~10 −18 M. However, even below this concentration, free ferric ion is toxic. To avoid toxicity and regulate iron transport, the human serum iron transport protein, transferrin, maintains the free ferric iron concentration at about 10 −24 M [31]. In a mammalian host, the concentration of free iron in serum and body fluids is too low to support growth of bacteria [32]. The ability of pathogens to obtain iron from transferrins, ferritin, hemoglobin, and other iron-containing proteins of their host is central to whether they can live or die [33]. Both pathogenic and saprophytic microorganisms have evolved sophisticated iron-acquisition systems to overcome iron deficiency imposed by host defensive mechanism and their environment. At the core of such systems is the production of small molecules known as siderophores, which are secreted into the extracellular space, tightly bind available iron, and then are reinternalized with their bound iron through specific cell surface receptors [34]. M. tuberculosis is reported to produce two series of structurally related siderophores, collectively known as the mycobactins, which are critical for virulence and growth. Mycobactin biosynthesis is initiated by MbtA, an adenylate-forming enzyme that catalyzes a two-step reaction and is responsible for incorporating salicylic acid into the mycobactins [35]. The reaction mechanism catalyzed by MbtA provides several opportunities to develop inhibitors against MbtA [32]. MbtA is an ideal target since it has no mammalian homologues [36]. Inhibition of siderophore biosynthesis has emerged as an attractive strategy to develop new antibiotics against pathogens which require siderophores for virulence [32].
In the present study, a pendenticity based topochemical descriptor termed as pendentic eccentricity index (in both topostructural and topochemical forms) has been conceptualized and successfully utilized along with existing TIs for development of models for prediction of antitubercular activity of 5'-O-[(N-Acyl)sulfamoyl]adenosines.

Dataset
A dataset comprising of 31 analogues of 5'-O-[(N-Acyl)sulfamoyl]adenosines was selected for the present investigation [35]. The basic structures of 5'-O-[(N-Acyl)sulfamoyl]adenosines are shown in Fig. 1  Although, the Ki app reorted are not a measure of the true inhibitor potency, the differences are reflective of free energy differences associated with inhibitor binding to Mbta, presuming equivalent modalities of inhibition [35]. All inhibitors were also evaluated against whole-cell M. tuberculosis H37Rv under iron-limiting and iron-rich conditions by Qiao et al. [35].
For the purpose of present study, the analogues possessing Ki app values of ≤0.05 μM were considered to be active and analogues possessing Ki app values of >0.05 μM were considered to be inactive. Further, the analogues possessing MIC 99 (Minimum inhibitory concentration that inhibited >99% of cell growth) values of ≤12.5 μM in iron-deficient conditions and ≤50 μM in iron-rich conditions were considered to be active, and analogues possessing MIC 99 values of >12.5 μM in iron-deficient conditions and >50 μM in iron-rich conditions were considered to be inactive.

Topological indices
Values of twenty-six topological indices [13-15, 18-20, 23-25, 37-50] of diverse nature used in the present study (Tab. 2) were calculated for all the analogues involved in the data set using an in-house computer program.

Decision tree
The decision tree (DT) methodology determines activity of a chemical through a series of rules based on selection of descriptors [51]. The simplified mechanism of a decision tree is to find some rules for each class based on the descriptors of the training set. These rules are subsequently utilized for building a decision tree having several branches leading to a leaf with a given class assignment [52]. The name decision tree is due to the reason that the classification is done using a set of tests (or decisions) that are arranged in the form of a tree [53]. The prediction for a molecule reaching a given terminal node is obtained by majority vote of the molecules reaching the same terminal node in the training set. The tree with lowest value of error in cross-validation is selected as optimal tree [54]. In this study, R program (version 2.1.0) along with RPART library was used to grow decision tree.

Random Forest
A random forest (RF) is an ensemble of unpruned classification trees created by using bootstrap samples of the training data to construct multiple trees (forests) and random subsets of variables to define the best split at each node, hence the name "random" forests [55,56]. Random forest operates by generating a user-defined number of decision trees, 100 in this application. Mathematically a RF may be expressed as [57] Where T 1 (X) is a single decision tree and X represents a single molecular descriptor vector. In present study, the RFs were grown with the R program (version 2.1.0) using the random forest library.

Moving average analysis
In order to develop single topological index based models for classifying data set into active and inactive analogues, moving average analysis (MAA) was applied. Index values of all the 26 chosen descriptors were analyzed and suitable models were developed after identification of the active ranges by maximization of moving average with respect to active compounds (<35% = inactive, 35-65% = transitional, >65% = active) [44,54]. Subsequently, each analogue of data set was assigned a biological activity using these models, which was then compared with the reported activity [35].

Calculation of topological indices
Though a total of 26 indices were employed for the present study (Tab. 2) but 11 indices were ultimately shortlisted by either DT or MAA. Classification ability and non-correlation nature of TIs were the main criteria adopted for short listing of TIs for MAA.

Wiener's topochemical index ) ( c w
Wiener's topochemical index [41] is defined as the sum of the chemical distances between all pairs of vertices in hydrogen-suppressed molecular graph. It is a refined form of oldest and widely used distance-based topological index -Wiener's index [11] and this modified index takes into consideration the presence as well as relative position of heteroatom(s) in a molecular structure. It can be expressed as: where c cj i P is the chemical length of the path that contains the least number of edges between vertex i and j in the graph G, n is the number of vertices in the hydrogen depleted graph [41].

Molecular connectivity topochemical index (χ A )
The molecular connectivity topochemical index [18,19] is defined as the summation of the modified bond values of adjacent vertices for all edges in the hydrogen-suppressed molecular graph. It is a modified form of the widely used adjacency-based topological index -molecular connectivity index [13,43] and it takes into consideration the presence as well as relative position of heteroatom(s) in a molecular structure, as per the following equation: Eq. 2.
where n is the number of vertices, V c i and V c j are the chemical degrees of adjacent vertices i and j forming the edge {i, j} in a graph G. The modified degree of a vertex can be obtained from the adjacency matrix by substituting row element corresponding to heteroatom, with relative atomic weight with respect to carbon atom [18,19].

Superpendentic index ∫ P ) (
A pendenticity based graph invariant termed as superpendentic index and denoted by ∫ P is calculated as the square root of the sum of products of the non-zero row elements in the pendent matrix [49]. It is expressed as: Eq. 3.

Pendentic eccentricity index (
Pendentic eccentricity index ( P ξ ), proposed in the present study, can be defined as the summation of the quotients of the product of non-zero row elements in the pendent matrix and squared eccentricity of the concerned vertex, for all vertices in the hydrogen suppressed molecular graph. Pendent matrix, Dp, of a graph G is a submatrix of distance matrix obtained by retaining the columns corresponding to pendent vertices i.e. terminal vertices or an end vertex with a degree of one [58]. The eccentricity E i of a vertex i in a graph G is the path length from vertex i to the vertex j that is farthest from i (Ei = max d( ij ); j G) It is expressed as: Pendentic eccentricity topochemical index can be easily calculated from chemical pendent matrix, a submatrix of chemical distance matrix. Calculation of proposed index for three isomers of five membered molecule containing one heteroatom and at least one pendant vertex is exemplified in Fig. 2. The sensitivity of the proposed topochemical descriptor towards presence and relative position of heteroatom(s) for all three, four and five membered isomers containing only one heteroatom and at least one pendent vertex has been illustrated in Tab. 3. Discriminating power and degeneracy of the pendentic eccentricity topochemical index were investigated using all possible structures with three, four and five vertices containing one heteroatom and at least one pendent vertex and were compared with that of the other three indices (Tab. 4).

Fig. 2.
Calculation of pendentic eccentricity topochemical index values for three isomers of a five membered molecule containing one heteroatom and at least one pendent vertex.

Tab. 3.
Index values for all possible structures with three, four and five vertices containing one heteroatom and at least one pendent vertex.

Eq. 10.
) where d c (i)d c (j) is the chemical weight of the edge {i,j} in the hydrogen suppressed molecular graph and n is the number of edges [24].

Augmented eccentric connectivity index
This is an adjacency-cum-distance based index [44] and is defined as the summation of the quotients of the product of adjacent vertex degrees and eccentricity of the concerned vertex, for all vertices in the hydrogen suppressed molecular graph. It is expressed as: where, M i is the product of degrees of all vertices (v j ), adjacent to vertex i, E i is the eccentricity, and n is the number of vertices in graph G [44].

Performance evaluation
The goodness of the models was assessed by calculating sensitivity, specificity [59,60], overall accuracy of prediction [44], and Matthews correlation coefficient (MCC) [61]. The sensitivity and specificity are defined as per the following: MCC quantifies the strength of the linear relation between the molecular descriptors and the classifications, and it may often provide a much more balanced evaluation of the prediction than, for instance, the percentages (accuracy). Matthews correlation coefficient of 1 corresponds to a perfect prediction, whereas 0 corresponds to a completely random prediction and takes both sensitivity and specificity into account. It is calculated as [59]: FP  TN  FN  TN  FP  TP  FN  TP   FP  FN  TN  TP The percent degree of prediction for each range as well as overall degree of prediction were calculated. The percent classification was obtained from the ratio of number of compounds present in active and inactive ranges to the total number of compounds in the data set. The percent degree of prediction for each range as well as overall accuracy of prediction of the proposed model for antitubercular activity in iron-deficient and iron-rich state were also measured.
The validation of the DT based model and self-consistency test were performed by 10-fold cross validation (CV) method, in which the compound dataset was randomly split into 10 folds. The model was developed using 9 randomly selected folds, and prediction was done on the remaining fold. The goodness of DT based model was also assessed by calculating sensitivity, specificity, overall accuracy of prediction and MCC. The 10-fold CV results are given in Tab. 5. From a practical application point of view, topological descriptors used should be least correlated [62]. Absence of direct correlation indicates that the two indices are distinctive and consider different structural components. Statistical significance of TIs used in building predictive models was also assessed by intercorrelation analysis by using index values of analogues of 5'-O-[(N-Acyl)sulfamoyl]adenosines.

Results and Discussion
Computational approaches applied in drug discovery and toxicity prediction often require molecular descriptors that reflect structural information and physicochemical properties of molecules [63]. The description of the molecular structure through the so-called molecular descriptors is a more difficult but necessary task. Difficulties arise in the generation of such indices, due to non-mathematical nature of the molecular structure [64]. Topological indices are one of the widely used molecular descriptors, which are easily available and can be quickly computed for existing and virtual structures [65,66]. The successful implementation of QSPR and QSAR certainly decreases the number of compounds synthesized, by making it possible to select the most promising compounds. However, it does not completely eliminate the trial and error factor involved in the development of new drugs [67].
Researchers are striving hard to develop new TIs with not only high discriminating power but also devoid of both degeneracy and correlation with existing TIs. As observed from Fig. 2, value of pendentic eccentricity index changes by >4 times (from 2.052 to 8.395) with a small change in the branching of a five membered molecule containing one heteroatom and at least one pendant vertex. Thus, novel descriptor has high discriminating power, defined as the ratio of highest to lowest value for all possible structures of same number of vertices. This is evident from the fact that the ratio of the highest to lowest value for all possible structures containing five vertices is 6.25 for P c ξ , in contrast to 1.5, 1.22 and 2.24 for c w , χ A and ∫ P c respectively. Thus, pendentic eccentricity topochemical index revealed ~4 times higher discriminating power with respect to Wiener's topochemical index, >5 times higher discriminating power with respect to molecular connectivity topochemical index and ~2.8 times higher discriminating power with respect to superpendentic topochemical index for all the possible structures of five vertices containing a heteroatom and at least one pendent vertex (Tab. 4). High discriminating power and extremely low degeneracy are desirable properties of an ideal topological index. High discriminating power of the proposed new descriptor makes it more sensitive towards any change in molecular structure.
Degeneracy is the measure of ability of an index to differentiate between the relative positions of atom in a molecule. It is well known fact that topological indices show degeneracy, that is, two or more non-isomorphic graphs may have identical numerical values for an index [68]. The novel pendentic eccentricity topochemical index had significantly reduced degeneracy as compared to Wiener's topochemical index and superpendentic topochemical index. This is evident from the fact that pendentic eccentricity topochemical index had only 5 identical values out of 30 structures with only five vertices containing one heteroatom and at least one pendent vertex whereas Wiener's topochemical index and superpendentic topochemical index had 13 and 9 identical values, respectively, for the same compounds (Tab. 4). It is pertinent to mention here that pendentic eccentricity topochemical index had also reduced degeneracy as compared to molecular connectivity topochemical index, as is evident from the fact that novel index had a single identical index value out of 31 values of dataset under study, whereas molecular connectivity topochemical index had two identical values for the same (see tab. 1). Lower the degeneracy, better is the index [39]. Significant reduction in degeneracy indicates the enhanced capability of novel topochemical index to differentiate and demonstrate slight variations in the molecular structure. This means that the likeliness of different structures to have same value is very less. As observed from Tab. 6, pendentic eccentricity topochemical index is not correlated with most of the commonly used TIs. Pairs of indices with r≥0.97 are considerably highly intercorrelated, those with 0.90≥r<0.97 are appreciably correlated, those with 0.50≤r≤0.89 are weakly correlated and finally the pairs of indices with low r values (<0.50) are not intercorrelated [69]. Intercorrelation analysis (Tab. 6) revealed that the pair of indices ∫

Tab. 5.
Confusion Matrix for antitubercular activity and recognition rate of models based on decision tree and Random forest.

Model Description Ranges
Active Inactive In the present study, DT, RF and MAA based models were developed for the prediction of antitubercular activity of 5'-O-[(N-Acyl)sulfamoyl]adenosines. The decision tree was built by utilizing 26 TIs of diverse nature. This recursive partitioning scheme generates rules based on the numerical data of the available descriptors for each molecule. In this case, a classification of data set [35] into active and inactive compounds was desired. Decision tree assigns a probability value (0-1) that a compound is active or inactive; compounds with the probability equal to or greater than 0.5 are designated as active, while others are designated as inactive [70]. Decision tree identified five important topological indices: superpendentic topochemical index (A11), Zagreb group parameter, M 2 (A21), Molecular connectivity topochemical index (A1), Zagreb topochemical index, M 2 c (A8) and augmented eccentric connectivity topochemical index (A3). The obtained topology of the decision tree is shown in Fig. 3, where the respective descriptor is denoted with an alphanumerical abbreviation that refers to Tab. 2. The index at the root node is most important and significance of index decreases as the tree increases. The DT classified analogues of 5'-O-[(N-Acyl)sulfamoyl]adenosines in the training set with an accuracy of 100% and the cross validated set with an accuracy of 77.4% with regard to antitubercular activity. The sensitivity and specificity of DT based model in the training set was found to be 100%. The sensitivity and specificity of decision tree based model in the crossvalidated set was of the order of 70% and 80.9% respectively. The values of MCC for DT based model in the training set and cross validated set are 1 and 0.497 respectively suggesting satisfactory performance as well as robustness of the model. The values of sensitivity, specificity and MCC are shown in Tab. 5.

Tab. 6.
Intercorrelation matrix. The methodology used in the present study aims at the development of suitable models for providing lead molecules through exploitation of the active ranges in the proposed models based on topological indices. Proposed models are unique and differ widely from conventional QSAR models. Both systems of modeling have their own advantages and limitations. In the instant case, the modeling system adopted has distinct advantage of identification of narrow active range(s), which may be erroneously skipped during routine regression analysis in conventional QSAR modeling. Since the ultimate goal of modeling is to provide lead structures, therefore, these active ranges can play vital role in lead identification [71].

Tab. 7.
MAA derived topological models for antitubercular activity. Retrofit analysis of the data with regard to pendentic eccentricity topochemical index (Tab. 7-9) revealed that 91.3% analogues were predicted correctly with respect to antitubercular activity. Extremely low average Ki app value of 0.018 μM for the correctly predicted compounds indicates high potency of the active range in the proposed model. Activity of all the analogues in both inactive ranges was predicted correctly. The average Ki app value of lower inactive range and of upper inactive range was found to be 48.9 μM and 52.92 μM respectively. Existence of a transitional range indicates gradual change in biological activity. The ratio of average Ki app values of active range with lower inactive range and upper inactive range was found to be 1:2716.66 and 1:2940 respectively. Overall accuracy of this model, for prediction of antitubercular activity in iron-deficient and iron-rich state was found to be 82.6%. Sensitivity, Specificity, and MCC for this model has been found to be 100%, 87.5%, and 0.82 respectively.
Pendentic eccentricity topochemical index ( P c ξ ) depends upon number of pendent atoms and eccentricity. It also takes care of both the nature as well as relative position(s) of pendent atom(s)/heteroatom(s). For a compound to be biologically active, two pendent vertices on the cyclic substituent R (at appropriate places) are essential as observed from relative Ki app (μM) values [35]. Any deviation from such substitution leads to either loss or reduction in biological activity. All of the compounds which have been characterized as active by the proposed model contained two pendent atoms in the cyclic substituent R. Accordingly, all the compounds [excepting 7 and 16] predicted as active by the proposed model were also experimentally reported to be active. Compounds 7 and 16 were categorised as active according to our proposed model with a cut off value of ≤0.05 μM. Though these two compounds were experimentally reported to be inactive as per the proposed model with a cut off value of ≤0.05 μM but both these compounds exhibited significant biological activity with Ki app values of 0.061 and 0.137 respectively when compared to average Ki app values of ~50 μM for the inactive range. Consequently, all the compounds which were categorised as active as per the proposed model were either experimentally reported to be active or exhibited significant biological activity. All the compounds which have been characterized as inactive as per model possessed either less than two pendent atoms or more than two pendent atoms in the cyclic substituent R with an exception of compound 17. Inactivity of compound 17 may be due to lack of pendent vertex at ortho-position. This fact has already been reported earlier [35]. Since study signifies the influence of both the number as well as relative position(s) of pendent atom(s) in the cyclic substituent R on the biological activity, therefore, pendenticity based topological descriptors will naturally be of utmost importance in drug design.
The results of average Ki app (µM) values of correctly predicted analogues in various ranges of the proposed MAA based topological models are shown in Figures 4-6.

Conclusion
Pendentic eccentricity topochemical index -a novel molecular descriptor exhibited high discriminating power, sensitivity towards both the presence as well as relative position(s) of pendent/heteroatom(s) apart from reduced degeneracy. Moreover, Pendentic eccentricity topochemical index was found not to be correlated with important topological descriptors rendering it highly beneficial tool for isomer discrimination, similarity/dissimilarity, drug design, quantitative structure-activity/structure-property relationships, lead optimization and combinatorial library design.
Significant correlation of topological descriptors with antitubercular activity of 5'-O-[(N-Acyl)sulfamoyl]adenosines led to development of numerous models through decision tree, random forest and MAA. All the proposed models exhibited high degree of prediction with regard to anti-tubercular activity. These models offer vast potential for providing lead structures for the development of potent therapeutic agents for treatment of tuberculosis.