Molecular Topology for the Discovery of New Broad-Spectrum Antibacterial Drugs

In this study, molecular topology was used to develop several discriminant equations capable of classifying compounds according to their antibacterial activity. Topological indices were used as structural descriptors and their relation to antibacterial activity was determined by applying linear discriminant analysis (LDA) on a group of quinolones and quinolone-like compounds. Four equations were constructed, named DF1, DF2, DF3, and DF4, all with good statistical parameters such as Fisher–Snedecor’s F (over 25 in all cases), Wilk’s lambda (below 0.36 in all cases) and percentage of correct classification (over 80% in all cases), which allows a reliable extrapolation prediction of antibacterial activity in any organic compound. From the four discriminant functions, it can be extracted that the presence of sp3 carbons, ramifications, and secondary amine groups in a molecule enhance antibacterial activity, whereas the presence of 5-member rings, sp2 carbons, and sp2 oxygens hinder it. The results obtained clearly reveal the high efficiency of combining molecular topology with LDA for the prediction of antibacterial activity.


Introduction
The synthesis of artificial antibacterial agents and the discovery and improvement of antibiotics have brought about a true pharmacological revolution in the treatment of infectious diseases in this century. However, the extreme versatility and adaptability of microorganisms has prevented a decrease in the prevalence of infectious diseases, since many bacteria have been developing mechanisms that protect them against many drugs [1].
Currently, there is a growing concern about decreased efficacy of antimicrobial agents and increased prevalence of new and old bacterial pathogens. Increases in the rate of antibiotic resistance are resulting in higher mortality rates, and higher healthcare costs [2]. In fact, according to the World Health Organization (WHO), bacterial resistances will be the leading cause of death in 2050 and the biggest challenge in the field of Biomedicine in the 21st century [3]. For all these reasons, it is necessary to know the sensitivity of the main microorganisms and to be continuously alert to the appearance of resistant strains that could lead to treatment failure [2].
Given the problem involved in antibacterial therapy, the emergence of resistance to treatments with classic antibacterial compounds, it is necessary to expand the therapeutic arsenal of any group of antibacterial agents. One effective and low-cost way of tackling this problem is by using molecular connectivity or molecular topology (MT), developed by Kier and Hall in the mid-1970s, a Quantitative Structure-Activity Relationships (QSAR) derived method capable of predicting molecular properties in new compounds, without the need to synthesize them [4]. By combining it with techniques of pattern recognition such as linear discriminant analysis (LDA) [5], neural networks [6], multilinear regression [7] or principal component analysis [8] and appropriately selecting the molecular descriptors to use, we can build mathematical-topological equations able to identify almost any molecular property, becoming a powerful tool for the search and design of new compounds with pharmacological activities.
In this context, this study aims to obtain mathematical-topological equations capable of predicting antibacterial activity. By combining MT and LDA, the topological indices (TI) can be used to classify a compound as antibacterial or non-antibacterial. To do this, we simply select a group of compounds with antibacterial activity and another one lacking it.
To do this, antibacterial and non-antibacterial quinolones have been selected, since it is a known and extensive group that will allow us to collect numerous data [9], leading to greater precision of the predictive equations [10]. Moreover, the interest of the therapeutic target group is reflected in the large number of publications issued since 2010, more than 21,000 articles according to PubMed and 8000 according to the Web of Science (WoS). The discriminant equations obtained can be applied both in new quinolones not used in the study as in molecules that have no structural relationship, since it will select those that have a similar mathematical-topological relationship. The effectiveness of the topological method has also been proven, having been successfully applied in different therapeutic groups such as oral hypoglycemic [11], anti-inflammatory [12], antimalarial [13], antihistaminic [14], antidementia [15] and antibacterial agents [16].

Compound Selection
With the purpose of collecting the maximum information on experimental values, we held a comprehensive and exhaustive bibliographical review. We collected information on quinolones using the ISI Web of Science, Medline and SciFinder (Caplus) search engines.
We finally collected in vitro activity data of 99 quinolones and structurally related compounds against various bacteria, which were classified into two groups: one with 50 antibacterial quinolones and one with 49 non-antibacterial quinolones or closely related structures (the structure of all compounds as well as bibliographic references about their activity can be found in Tables S1 and S2). To consider a compound as antibacterial, it should be active (minimum inhibitory concentration (MIC) < 1 mg/mL) against numerous bacteria, while those compounds considered non-antibacterial (MIC >16 mg/mL) should be inactive against at least three Gram-positive and three Gram-negative bacteria. Those with intermediate activities were not included in the study. Regarding stereoisomers, if any of them was active, it was included in the active group. If all individual stereoisomers or the mixture of them in any ratio was inactive, they were included as a single graph in the inactive group.

Topological Descriptors
Each compound was characterized by a set of 144 non-redundant, significant (indices with value 0 for every compound and with identical values for all quinolones were removed) descriptors specific to each molecule. These descriptors do not contain 3D parameters that distinguish between enantiomers. They were computed from the adjacency topological matrix obtained from the hydrogen-depleted chemical pseudographs, previously drawn with the ChemBioDraw Ultra 12. Chemical pseudographs are two-dimensional objects and, therefore, so are their molecular descriptors, although they can represent a high content of information about three-dimensional structures [17]. The molecular descriptors used are described in Table S3 along with their definitions and references.

Linear Discriminant Analysis (LDA)
Stepwise linear discriminant analysis, LDA, is a pattern recognition method providing a classification model based on the combination of variables that best predicts the category or group to which a given compound belongs. The variables used to compute the linear classification functions were chosen in a stepwise manner, based on the Fisher-Snedecor parameter F, which relates the variance explained by the equation with the residual variance. At each step, the variable with the greater value of F, thus, the variable that makes the larger contribution to the separation of the groups, was entered in the discriminant function. Conversely, selected variables with a small value of F, thus, variables which lowered the statistical significance of the classification function, were removed.
The discriminant ability was assessed by the percentage of correct classifications attained for each set. The classification criterion is the minimal Mahalanobis distance (distance of each case to the mean of all the cases in a category). The quality of the discriminant function was evaluated through Wilk's U-statistical parameter, λ, which was obtained by a multivariate analysis of variance that tests the equality of group means for the variable in the discriminant model. LDA was then applied to the database, except for the molecules reserved as the test group, to obtain a predictive mathematical model linking structural descriptors and activity. The independent variables in this study were the topological descriptors, and the discriminant property was antimicrobial activity. The software used for the LDA study was the BioMeDical Program (BMDP) 7 M module (BMDP Statistical software Inc., University of California: Berkeley, CA, USA, 1990), which randomly chooses the compounds reserved for the test set.
The equations were finally validated internally by the Jack-Knife (JK) method and externally by the test groups.

Pharmacological Distribution Diagrams
After selecting the discriminant function, the corresponding pharmacological distribution diagrams (PDD) were built up. These plots are useful to determine the intervals of the discriminant function in which the expectancy, E, to find active compounds is maximum. PDDs are histogram-like plots of connectivity functions in which the expectancies appear on the ordinate axis. For an arbitrary interval of values of a given function, we can define the expectancy of activity as Ea = a/(i + 1), where "a" is the number of active compounds in the interval divided by the total number of active compounds, and "i" is the number of inactive compounds. The expectancy of inactivity is defined in a symmetrical way, as Ei = i/(a + 1). This representation provides good visualization of the regions of minimum overlap and selects regions in which the probability of finding improved compounds is maximum [18].
PDDs allowed us to carry out the assignment of thresholds useful to discriminate active from inactive compounds with the highest probability of success.

Results
To obtain the discriminant functions we used the data from a previous study [19], in which the statistical program BMDP randomly formed two training groups with 38 active and 38 inactive compounds and two test groups with 12 active and 11 inactive compounds. This test groups allowed evaluating the quality of the selected discriminant functions.
The discriminant functions formed in this study along with their statistical parameters are shown in Equations (1) to (4), while their corresponding classification matrices are shown in Tables 1-4. The classification criterion was determined by the value of the discriminant functions: if the value of the equation for a given compound was equal or bigger than 0, such compound was classified as active, whereas if the value of the equation for a compound was smaller than 0, such compound was classified as inactive.
We plotted the corresponding PDDs for every function to visualize the values of the function in which the probability of classifying a compound as active or inactive is maximum. In other words, to find areas where the overlap between the two groups of compounds is minimal. The PDDs obtained for DF1-4 along with the highest activity range for each function are shown in  Compounds with values below the range were considered inactive while compounds with values over the range were considered unclassified. Thus, the value ranges derived from these PDDs establish the applicability domain for each of the discriminant functions.
Biomolecules 2020, 10, x 5 of 12 We plotted the corresponding PDDs for every function to visualize the values of the function in which the probability of classifying a compound as active or inactive is maximum. In other words, to find areas where the overlap between the two groups of compounds is minimal. The PDDs obtained for DF1-4 along with the highest activity range for each function are shown in Figures 1-4. Compounds with values below the range were considered inactive while compounds with values over the range were considered unclassified. Thus, the value ranges derived from these PDDs establish the applicability domain for each of the discriminant functions.

DF1
DF1 Highest Activity Expectancy = 2-10  We plotted the corresponding PDDs for every function to visualize the values of the function in which the probability of classifying a compound as active or inactive is maximum. In other words, to find areas where the overlap between the two groups of compounds is minimal. The PDDs obtained for DF1-4 along with the highest activity range for each function are shown in Figures 1-4. Compounds with values below the range were considered inactive while compounds with values over the range were considered unclassified. Thus, the value ranges derived from these PDDs establish the applicability domain for each of the discriminant functions.

DF3
DF3 Highest Activity Expectancy = 0-12  When PDDs are used, the accuracy for active compounds decreases whereas that for inactive ones increases, so the probability of a false active compound being selected after applying the PDD filter decreases. The average percentage for the four equations of correctly classified inactive compounds is 94.1% for the training group and 100% for the test group; and the average percentage of accurately classified active compounds is 88.8% for the training group and 87.5% for the test group. Tables 5 and 6 summarize the classification of the results obtained for all functions selected for both training groups, active group and inactive group, respectively, and Table 7 summarizes the results for the test groups. As can be inferred from the tables, the training and test groups exhibit an average overall accuracy of 91.9%.  When PDDs are used, the accuracy for active compounds decreases whereas that for inactive ones increases, so the probability of a false active compound being selected after applying the PDD filter decreases. The average percentage for the four equations of correctly classified inactive compounds is 94.1% for the training group and 100% for the test group; and the average percentage of accurately classified active compounds is 88.8% for the training group and 87.5% for the test group. Tables 5 and 6 summarize the classification of the results obtained for all functions selected for both training groups, active group and inactive group, respectively, and Table 7 summarizes the results for the test groups. As can be inferred from the tables, the training and test groups exhibit an average overall accuracy of 91.9%. When PDDs are used, the accuracy for active compounds decreases whereas that for inactive ones increases, so the probability of a false active compound being selected after applying the PDD filter decreases. The average percentage for the four equations of correctly classified inactive compounds is 94.1% for the training group and 100% for the test group; and the average percentage of accurately classified active compounds is 88.8% for the training group and 87.5% for the test group. Tables 5 and 6 summarize the classification of the results obtained for all functions selected for both training groups, active group and inactive group, respectively, and Table 7 summarizes the results for the test groups.
As can be inferred from the tables, the training and test groups exhibit an average overall accuracy of 91.9%.

Discussion
All equations have a low value of λ, indicating that there is a low linear dependence between independent variables. Furthermore, the high value of F in the equations indicate that the selected independent variables contribute largely to the separation of the active and inactive groups. Moreover, all equations correctly classify each compound with its corresponding group with high success rates.
Three functions (DF1-3) were obtained using combinations of different types of indices: DF1 used charge indices [20] and connectivity's differences and ratios indices, while DF2 and DF3 used electrotopological-state (S i ) [21] and connectivity indices. DF4 was obtained using all 144 TI. DF1 involves three charge indices (G 1 , G 4 and J 5 ) and two connectivity indices formed by difference and ratios between non-valence and valence connectivity indices of zero order ( 0 D) and third order cluster type ( 3 C c ). The DF1 value is influenced very positively by the topological charge present in 5th order subgraphs (J 5 ), but negatively by the topological charge found in the 1st and 4th order subgraphs (G 1 and G 1 ). Differences ( 0 D) and ratios ( 3 C c ) of non-valence and valence indices affect the value of DF1 oppositely. The first type of indices describe the distribution of the global charge in the molecule through the evaluation of charge transfer between pairs of atoms, while the second type of indices take into account the charge densities from each chemical graph, representing a measure of the polarizability of the molecule.
DF2 involves one connectivity index ( 3 χ ch ) and three electrotopological indices (S =C< , S >C< and S =O ). In the case of DF2, the presence of sp 3 carbons (S >C< ) increase its value while the presence of sp 2 carbons (S =C< ), and sp 2 oxygens (S =O ) decreases it. The equation also shows a clear dependence of the activity relative to the connectivity chain type third order index ( 3 χ ch ), which implies that the presence of a cyclopropyl group greatly enhances the antibacterial activity.
DF3 involves two valence connectivity indices ( 3 χ V c and 5 χ V ch ) and four electrotopological indices (S =C< , S >C< , S -NH-and S =O ). In this case, the value of the equation is positively influenced by the valence connectivity cluster type third order index ( 3 χ V c ), meaning that the presence of ramifications (especially if they are attached to hydrogen donors) favors the antibacterial activity. On the other hand, the value of the equation is very negatively influenced by the valence connectivity chain type fifth order index ( 5 χ V ch ), meaning that the presence of 5-member rings (especially if such rings include double bonds and/or hydrogen donors). Regarding the electrotopological indices, the presence of sp 3 carbons (S >C< ) and secondary amine groups (S -NH-) increase the value of the equation while the presence of sp 2 carbons (S =C< ) and sp 2 oxygens (S =O ) decrease it.
DF4 involves two connectivity indices ( 5 χ V ch and 3 C c ) and three electrotopological indices (S =C< , S -NH-and S =O ). The connectivity indices are the valence connectivity chain type fifth order index ( 5 χ V ch ) and the ratio between non-valence and valence connectivity indices of third order cluster type ( 3 C c ). The only index that has a positive influence on the value of the equation is S -NH-, meaning that the presence of secondary amine groups favors the antibacterial activity, while the rest are detrimental for such activity.
MT transforms the structure into numbers by applying more or less complex mathematical functions [4], so it is generally difficult to draw structural conclusions from mathematical equations, but from the four discriminant functions it can be extracted that the presence of sp 3 carbons, ramifications, and secondary amine groups in a molecule enhance antibacterial activity, whereas the presence of 5-member rings, sp 2 carbons, and sp 2 oxygens hinder it.
Furthermore, these discriminant functions can be used for the development of new antibacterial drugs in different ways. First, QSAR models are traditionally used to optimize the activity of already existing molecules. Along these lines, Wang et al. [22] developed a QSAR model based on cinnamaldehyde-amino acid Schiff base compounds, a family of newly discovered compounds that exhibit good broad-spectrum antibacterial activity, which was then used to design and synthesize three new compounds with comparable antibacterial activity to that of ciprofloxacin. Similarly, De Bruijn et al. [23] used the information derived from a QSAR model to develop novel lead compounds with a 1,4-benzoxazin-3-one scaffold which were reported to be up to 5 times more active than any of the analogue compounds used to build the model. In the same way that these authors used the topological information provided by their models to increase the activity of a family of compounds, the DFs built in this work also provide information which could be useful to optimize the activity of antibacterial quinolones.
The second main approach by which these DFs could aid in the discovery of new antibacterial compounds is drug repurposing. There is extensive literature in which using QSAR models to reposition approved drugs is posed as a key tool in the development of new drug-disease associations [24,25]. More specifically, this approach has been already successfully used to identify antibacterial activity in several drugs approved for different purposes (Table 8) [26][27][28][29][30][31].

Conclusions
Currently, the development of antibiotic resistance in microorganisms is one of the most important problems that have appeared in recent years in the treatment of infectious diseases. MT has proven to be a useful methodology for identifying new compounds with antibacterial activity. By combining it with LDA, four statistically significant discriminant equations with a very good classification success rate have been obtained. The stability of the selected functions is supported by external and internal validation studies, as well as by all the statistical parameters that have been used in their selection, which can be classified as very satisfactory in all cases. These equations, used both as predictive models or as filters, can be extremely useful to find new molecules with antibacterial activity and even for use in drug repositioning, where we find thousands of candidate compounds to be tested for new pharmacological activities.
We can conclude that the discriminant functions obtained confirm MT as a powerful, cost-effective, and useful tool in the prediction of antibacterial activity. These equations, used either as predictive models or as filters such as Lipinski's, can be extremely useful to find new molecules with antibacterial activity and even for use in drug repositioning, where we find thousands of candidate compounds to be tested for new pharmacological activities. With this information, it is simpler to understand how certain non-obvious molecular properties described by TI may affect the antibacterial activity of a compound. One additional advantage of MT is that it allows virtual screening of large databases in a short time for the search of possible antibacterial compounds. Moreover, this methodology also allows the prediction of pharmacokinetic and toxicological properties to identify safer and more efficient drugs.
Supplementary Materials: The following are available online at http://www.mdpi.com/2218-273X/10/9/1343/ s1, Table S1: Active group used in the topological pattern: Paper name/code, IUPAC name and structure, and bibliographic references about activity for each compound, Table S2: Inactive group used in the topological pattern: Paper name/code, IUPAC name and structure, and bibliographic references about activity for each compound, Table S3: Symbols and Definitions of TI used with DESMOL13 and MOLCONN-Z programs.