In silico Antibacterial Activity Modeling Based on the TOMOCOMD-CARDD Approach

In the recent times, the race to cope with the increasing multidrug resistance of pathogenic bacteria has lost much of its momentum and health professionals are grasping for solutions to deal with the unprecedented resistance levels. As a result, there is an urgent need for a concerted effort towards the development of new antimicrobial drugs to stay ahead in the fight against the ever adapting bacteria. In the present report, antibacterial classification functions (models) based on the topological molecular computational design-computer aided ‘‘rational’’ drug design (TOMOCOMD-CARDD) atom-based non-stochastic and stochastic bilinear indices are presented. These models were built using the linear discriminant analysis (LDA) method over a balanced chemical compounds dataset of 2230 molecular structures, with a diverse range of structural and molecular mechanism modes. The results of this study indicated that the non-stochastic and stochastic bilinear indices provided excellent classification of the chemical compounds (with accuracies of 86.31% and 84.92%, respectively, in the training set). These models were further externally validated yielding correct classification percentages of 86.55% and 87.91% for the nonstochastic and stochastic bilinear models, respectively. Additionally, the obtained models were compared with those reported in the literature and demonstrated comparable results, although the latter were built over much smaller datasets and with much higher degrees of freedom. Finally, simulated ligand-based virtual screening of 116 compounds, recently identified as potential antibacterials, was performed yielding 86.21% and 83.62% of correct classification, respectively, and thus demonstrating the utility of the obtained TOMOCOMD-CARDD models in the search of novel compounds with desirable antibacterial activity.

In the recent times, the race to cope with the increasing multidrug resistance of pathogenic bacteria has lost much of its momentum and health professionals are grasping for solutions to deal with the unprecedented resistance levels.As a result, there is an urgent need for a concerted effort towards the development of new antimicrobial drugs to stay ahead in the fight against the ever adapting bacteria.In the present report, antibacterial classification functions (models) based on the topological molecular computational design-computer aided ''rational'' drug design (TOMOCOMD-CARDD) atom-based non-stochastic and stochastic bilinear indices are presented.These models were built using the linear discriminant analysis (LDA) method over a balanced chemical compounds dataset of 2230 molecular structures, with a diverse range of structural and molecular mechanism modes.The results of this study indicated that the non-stochastic and stochastic bilinear indices provided excellent classification of the chemical compounds (with accuracies of 86.31% and 84.92%, respectively, in the training set).These models were further externally validated yielding correct classification percentages of 86.55% and 87.91% for the nonstochastic and stochastic bilinear models, respectively.Additionally, the obtained models were compared with those reported in the literature and demonstrated comparable results, although the latter were built over much smaller datasets and with much higher degrees of freedom.Finally, simulated ligand-based virtual screening of 116 compounds, recently identified as potential antibacterials, was performed yielding 86.21% and 83.62% of correct classification, respectively, and thus demonstrating the utility of the obtained TOMOCOMD-CARDD models in the search of novel compounds with desirable antibacterial activity.

Introduction
The race to cope with the increasing multidrug resistance of pathogenic bacteria has in the recent times lost much of its momentum, particularly due to a fundamental shift in the interest of the pharmaceutical companies.In the period 2010-2012, only one antibiotic was approved by the US food and drug administration, highlighting the tremendous fall in the development of novel antibiotics. 1 The argument put forward by pharmaceutical companies is that antibiotics are a non-viable investment given they are short course therapies compared to other chronic illnesses. 2herefore in the wake of the ever leaner antibiotic pipeline, antibiotic resistance has risen to unprecedented levels catching up with remedies long considered as "last resorts" such as vancomycin and carbapenem. 3[6][7][8][9][10] Consequently, it is needless to say that there is an urgent need for concerted efforts in the development of new antimicrobial drugs (with more selectivity and less toxicity) to stay ahead in the battle against the ever adapting disease causing bacteria.During the past two decades, drug discovery research has been reoriented towards the development of theoretical and/or computational methods enabling the rational selection or design of novel agents with the desired properties, offering huge advantages in terms of the time and cost incurred in the search of novel lead compounds. 11Recently, high throughput and ultrahigh throughput screening (HTS and uHTS) have been introduced to spearhead the identification of new lead compounds. 12However, while these methods are rapid they are still relatively cost friendly, for many pharmacological activities, HTS endpoints may not be available. 13In addition, because of the low number of high-quality leads derived from HTS tests (1 per 100 000 compounds), 14 several techniques for "recognizing drug-like molecules" have been introduced.Thus, virtual (computational) screening has emerged as an interesting alternative to HTS. 15 In this way, computational techniques are used to select a reduced number of potentially active compounds, from large available chemical or virtual combinatorial libraries.The main aim of this approach is to discriminate potential candidate molecules from inactive ones.Indeed, various in silico approaches have been employed in the construction of quantitative structure-activity relationship (QSAR) models for the prediction of antibacterial activity, 13,[16][17][18][19][20][21][22] with greater emphasis on ligand-based classification methods.
The QSAR models can be categorized as local, global and universal models.7][18][19][20][21] Therefore, it is imperative to construct diverse chemical datasets (in structural and activity mechanisms terms) for QSAR modeling and to explore alternative molecular structural characterizing strategies/ parameters capable of codifying orthogonal information to the existing ones, as a means of expanding the AD of the obtained models.In this context, a novel scheme to perform rational in silico molecular design (or selection/ identification of lead drug-like chemicals) and quantitative structure activity/property relationship (QSAR/QSPR) studies, known as TOMOCOMD-CARDD (acronym of topological molecular computational design-computer aided ''rational'' drug design) 29 has recently introduced.][32][33][34] The primary objective of the present report is to construct classification functions, based on the TOMOCOMD-CARDD atom-based non-stochastic and stochastic bilinear indices and a diverse chemical dataset of 2230 compounds, with the ultimate goal of screening for NMEs with possible broad spectrum antibacterial activity.The linear discriminant analysis (LDA) technique will be employed as the classification method.The key advantage of LDA is its inherent simplicity, both in terms of the lower computational cost involved with this method [compared to other non-linear methods such as artificial neural networks (ANN), support vector machine (SVM), random forest (RF), etc.] and the ease in the analysis of linear biological relationships.Comparisons with other approaches reported in the literature will be performed with the aim of assessing the performance of the built classification models.Finally, simulated virtual screening is carried out over 116 new compounds, recently reported in the literature as possessing antibacterial activity, to evaluate the true predictive ability of the TOMOCOMD-CARDD models.

Construction of chemical dataset
It is known that greater structural variability of the training series dataset enhances the general performance of learning methods as it favors a broader AD for the classification models, critical in the successful screening the huge chemical compound databanks present today for NMEs with possibly novel modes of action.In this sense, we constructed a data set comprised of 2230 chemicals, with 1051 compounds reported in the literature as antibacterials [35][36][37] and the rest (1179 compounds) with other pharmacological uses. 36,37The former were considered as active, while the latter as inactive, consistent with binary classification models.The dataset of active compounds was built considering representatives from most of the different structural patterns and action modes of antibacterial activity (see also Table S1 in Supplementary Information).For instance, it includes antimicrobial agents that interfere with the synthesis or action of folate (sulphonamides and dihydrofolate-reductase inhibitors such as trimethoprim), b-lactam antibiotics (cephalosporins, cephamycins, penicillins, monobactams and carbapenems), antimicrobial agents affecting bacterial protein synthesis (tetracyclines, phenicols, aminoglycosides, macrolides, lincosamides), chemicals affecting DNA gyrase (quinolones), miscellaneous antibacterial agents (vancomycin, polymixim antibiotics, nitrovinylfurans, bacitracin) and many others. 38,39Other compounds for which a specific mode of action has not been found or defined were also included. 36,37Therefore great variability in structural and molecular mechanism terms was achieved.
Posteriorly, in order to divide the built chemical compound dataset into training and test sets, respectively, two k-means cluster analyses (k-MCAs) were performed for active and inactive series of chemicals, separately. 40,41he main idea consists of carrying out a partition of the active and inactive series of chemicals in several statistically representative classes of chemicals.This procedure ensures that all chemical classes (as determined by the clusters derived from k-MCA) will be represented in both series of compounds. 40Posteriorly, for each cluster, the selection of the training and test sets was performed following a random sampling procedure.As a result, the training set was composed of 799 antibacterials and 918 non-antibacterials out of a set of 1717 chemicals.The remaining group, composed of 252 antibacterials (active) and 261 compounds with different biological properties (considered as inactive in this context), was used as the test set for the validation of the models.
Finally, an external validation set comprised of 116 novel antimicrobial agents recently reported in the literature 42,43 was set aside to assess, in a simulate virtual screening experiment, the earnest predictive ability of the obtained classification models.

Molecular descriptor computation
The proper selection of molecular structural characterizing parameters for use in statistical modeling plays an important role in the quality of the models constructed thereof.While there is no general consensus on which family/class of theoretical molecular descriptors (MDs) should be preferred, orthogonality (in regard to the chemical information codified), simplicity (in terms of algorithms followed in their computation) and high discriminating power for structurally different chemicals constitute desirable characteristics to be taken into account. 44In this sense, the so-called topological (and topo-chemical) indices (TIs) have gained increasing utility in QSPR/QSAR modeling as they generally approximate to these attributes. 44,45The TIs are numbers that describe the structural information of molecules through graph-theoretical invariants and can be considered as structure-explicit descriptors. 46n previous reports, Marrero et al. 11,40 introduced a new family of TIs known as TOMOCOMD-CARDD bilinear indices.These MDs are based on the calculation of bilinear maps in ℜ n , in canonical basis sets. 47,48The computation of the non-stochastic and stochastic bilinear indices is performed using the k th "non-stochastic and stochastic graph-theoretical electronic-density matrices" denoted by M k and S k , respectively. 47,48These matrix operators are graph-theoretical electronic-structural models, similar to the ''extended Hückel MO (molecular orbital) model''.The M 1 matrix considers all valencebond electrons (s-and p-networks) in one step, and their power k (k = 0, 1, 2, 3, ...) can be considered as an interacting-electronic chemical-network in k th step.The present approach is based on a simple model for the intramolecular (stochastic) movement of all outer-shell electrons.Therefore, our approach describes changes in the electronic distribution throughout the molecular backbone with time. 47,48These indices may be computed to characterize the totality of the entire molecular structure and/or determined fragments or characteristics of the molecule (local atom and atom-type indices).This is an essential attribute bearing in mind that the antibacterial activity may sometimes not necessarily depend on the entire molecular structure but on determined regions which interact with the inhibitory sites in the microorganism.Vol. 26, No. 6, 2015   To automatize the computation of these indices, a free and interactive computational program denominated TOMOCOMD-CARDD was implemented.This software was in the Java programming language and is designed to operate in a parallel environment and thus maximizing the architecture of modern computers. 49The following descriptors were calculated for this study: ] bilinear indices were also computed.Additionally, the following properties were employed as weighting schemes for atoms in the molecular structures: atomic mass (M), atomic polarizability (P), atomic Mulliken electronegativity (K) and van der Waals volume (V).

Chemometric tools
The statistical software Statistica was used to perform the k-MCA. 50The number of members in every cluster and the standard deviation of the variables in the cluster were taken into account (as low as possible) in order to have acceptable statistical quality of data partition in clusters.We also made an inspection of the standard deviation between and within the clusters, as well as the respective Fisher ratios and p-levels of significance, the latter was considered to be lower than 0.05. 41,51Posteriorly, LDA-based classification models were built using the Statistica software. 50The forward stepwise algorithm was used as the strategy for variable selection.The performance of the LDA models was assessed using the statistical parameters square Mahalanobis distance (D 2 ), Wilks' λ parameter (U-statistic), Fisher ratio (F), p-level [p(F)] and the percentage of correct classification in the training and test sets, respectively.The statistical robustness and predictive power of the obtained models were assessed using the test set of compounds.The binary classification variable CA with values 1 and -1 for active and inactive compounds, respectively.It follows that a compound is classified as active, if ΔP% > 0 with ΔP% = [P(active) -P(inactive)] × 100, and as inactive otherwise [P(active) and P(inactive) are the probabilities with which the equations classify a compound as active and inactive, respectively].Finally, the calculation of percentages of global good classification (accuracy), sensibility, specificity (also known as 'hit rate'), false positive rate (also known as 'false alarm rate'), Matthews correlation coefficient (MCC) and the receiver operating characteristic (ROC) curve analysis for the training (1717 compounds) and test (513 compounds) sets, respectively, permitted us to carry out the assessment of the models' robustness, predictive power and the statistical significance with respect a random classifier. 52,53Posteriorly, external validation set (VS) of 116 novel antimicrobial agents, taken from recently published reports 42,43 were used to more vigorously assess the predictive ability of the obtained classification models in a simulated virtual screening experiment.where N is the number of compounds, λ is Wilks' lambda, D 2 is the square Mahalanobis distance, F is the Fisher ratio, p-value is the significance level, MCC is the Matthews' correlation coefficient for the training set, Q LS and Q PS are the accuracy of the model for the training and prediction sets, respectively.

Development of the discriminant functions
Equation 1, built from non-stochastic indices, has an accuracy of 86.31% for the training set.This model showed a good MCC of 0.72; MCC quantifies the strength of the linear relation between the MDs and the classifications, and it may often provide a much more balanced evaluation of the prediction than, for instance, the percentages (accuracy). 54Nevertheless, the most important criterion, for the acceptance or not of a discriminant model, is based on the statistics for the test set.The non-stochastic model showed an accuracy of 86.55% (MCC = 0.73) for the compounds in the test set.
A rather similar behavior was obtained with the stochastic bilinear indices (equation 2).In this case, the model achieved an accuracy of 84.92% with a MCC of 0.70 for the training set, and for the test set an accuracy of 87.91% and MCC of 0.76; these values are similar-to-better than to those obtained with non-stochastic bilinear indices.The results of classification obtained with both models are shown in Table 1.
Figures 1 and 2 illustrate the ROC curves for the nonstochastic and stochastic bilinear index-based LDA models, respectively, for the training and test sets in each case.As can be observed, the values for the area under the curve (AUC) in the case of the former are of 0.931 and 0.938, while in the latter are 0.928 and 0.942 for the training and test sets, respectively.These scores ratify the inference that the obtained models are significantly different from a random classifier [AUC (random classifier) = 0.5].
All together, the statistical quality of the built models was satisfactory, validating the applicability of these models in virtual screening of chemical compounds.The complete set of compounds in the training and test sets, as well as their classification using both models, is given in Supplementary Information (for details see Tables S2-S4).

Comparison with other approaches for antibacterial activity prediction
7][18][19][20][21] Nevertheless, straightforward comparisons are not possible, bearing in mind the differences in the chemometric methods and experimental data employed in the respective studies.Therefore this comparative study was based on the characteristics and statistics of the different studies such as: the fitting and validation methods for the different models and the corresponding statistics, the number and diversity of chemical compounds in the datasets and the percentages of correct classifications.The Table 2 shows a comparison of the results obtained using the bilinear indices with those reported in the literature.
As can be observed, in this study we used a much larger (2230 compounds) and more diverse (comprised of a broader range of antibacterial families) dataset, probably  the largest and most diverse dataset used in antibacterial activity modeling, from the best of our knowledge.Therefore this dataset may be considered as a benchmark for posterior modeling tasks of the antibacterial activity (a complete list of the 2230 chemicals employed in the present report is available as Supplementary Information; see S2-S4).7][18][19][20][21] The importance of the dataset structural pattern in QSAR modeling cannot be overemphasized.It is anticipated that models built with datasets with diverse families of antibiotics and mechanisms of action should increase their utility in the search of broad spectrum antibacterials, as a desirable characteristic of novel chemotherapeutic compounds.Therefore, although there are some models with superior statistics (for the training and test set) compared to those in the present report, the former are generally based on a narrower chemical space (in terms of the number and diversity of the compounds), which decreases their utility in present day virtual screening tasks.Additionally, with the exception of the models reported by Domenech and de Julián-Ortiz, 17 the bilinear index-based models (equations 1 and 2) are much smaller size (3-variable models) compared to those reported in the literature (ranging from 6-62 variables).It is thus logical that the models in the present report possess superior Fisher ratio (F) values than the rest of the models (Table 2).This implies that the former are unlikely prone to overfitting and thus possess a greater generalizability.Also some studies reported in the literature used much more robust non-linear modeling methods [i.e., artificial neural networks (ANN) and binary logistic regression (BLR)] were used and these are known to generally yield better results (Table 2).Even then the statistics of the present models do not significantly differ from the ones obtained with these approaches.
It may therefore be concluded that the models obtained in the present report compare relatively well with those reported in the literature, bearing in mind the inherent dissimilarities in the different studies.

Computational screening of new compounds with antibacterial activity reported in the anti-infective field
Several reports in the literature have pointed out that it is much more desirable that the obtained and validated QSAR models be further tested on chemical compounds that did not form part of the studied dataset, as this adds greater rigor to the external validation procedure. 57,58However, due to the scarcity of experimental results on the biological activity for chemical compounds, the model building workflow is more often limited to the splitting of datasets into training and test sets, for fitting and external validation purposes, respectively, but rarely are models tested on an additional set of compounds not forming part of original datasets, in what may be denominated as "simulated" virtual screening.Therefore in this study, a search for compounds recently identified as possessing antibacterial activity was performed yielding 116 compounds 42,43 and these were evaluated using TOMOCOMD-CARDD models.The non-stochastic (eq. 1) and stochastic (eq.2) bilinear models showed predictive accuracies of 86.21% and 83.62%, respectively (for details see, Supplementary Information for Tables S5  and S6).Considering that many of the 116 compounds in this validation set were recently identified as antimicrobial agents, this -in silico-evaluation may be viewed as an equivalent to the discovery of new lead compounds using the developed models.Therefore the utility, in particular the predictive power and generalizability, of the obtained QSAR classification models is demonstrated, which opens way for posterior virtual screening tasks of novel compounds with antibacterial activity.

Conclusions
The TOMOCOMD-CARDD based models obtained in the present study were statistically significant and compared satisfactorily with most of the ligand-based antimicrobial classification models reported up to date.Additionally, the simulated virtual screening experiment performed on the 116 compounds, recently reported as possessing antibacterial activity, revealed the predictive power and the ultimate usability of the models obtained using the TOMOCOMD-CARDD approach as a quicker and reliable alternative applicable in the virtual screening of novel antibacterial lead compounds.
On the other hand, the comprehensive chemical compounds dataset presented in the present report constitutes an important benchmark for posterior QSAR studies the modeling and/or virtual screening of novel compounds with antibacterial activity.
(i) k th nonstochastic total bilinear indices, considering hydrogen suppressed and filled molecular pseudographs (G) denoted by b k (respectively; (ii) k th non-stochastic local bilinear indices (for heteroatoms based on atomtypes: S, N, O), for H filled and suppressed molecular pseudographs (G) denoted by b kL H (x E ,y E ) and b kL (x E ,y E ), respectively.These local descriptors are putative H-bonding acceptors, charge and the dipole moment; (iii) k th nonstochastic local (for atom groups based on H-atoms bonded to heteroatoms: S, N, O) bilinear indices, considering H atoms in the molecular pseudograph (G) [b kL H (x E -H ,y E-H )].These local descriptors are putative H-bonding donors.The k th stochastic total [ s b k (x,y) and s b k H

Figure 1 .
Figure 1.ROC curves for discriminant function based on the nonstochastic indices (equation 1) for the (a) training set (b) test set.

Figure 2 .
Figure 2. ROC curves for discriminant function based on the stochastic indices (equation 2) for the (a) training set (b) test set.

Table 1 .
Global results of the classification of compounds in the training and test sets 17are reported in this work, model A is reported by some of the present authors by using another dataset and different (quadratic) TOMOCOMD-CARDD MDs; 11 models 1 and 2 were reported by Domenech and de Julián-Ortiz;17model 3 was reported by Tomás-Vert et al.; 18 model 4 was reported by Mishra et al., 16 models 5 and 6 are after Cronin et al.; 13 model 7 was reported by Molina et al.; 19 models 8 and 9 were reported by Murcia-Soler et al.; 20 model 10 was reported by Cherkasov, 21 models 11 and 12 were reported by Aptula et al.; 55 and model 13 was introduced by González-Díaz et al.;56 bLDA refers to linear discriminant analysis, ANN to artificial neural network, and BLR to binary logistic regression; c only largely represented families were considered, e.g., methods 1 and 2 used 3 in training quinolones, sulphonamides, and cephalosporins but add only diaminopyridine (1 compound), cephamicins (2), oxacephems(1)and sulfones (1) to predicting series; d validation methods are: (i) test set, (ii) leave-many-out cross-validation (sub-sample) and (iii) external individual prediction set: e by using 116 compounds or f 87 chemicals.N.R.: not reported.
a Equations 1 and