A Novel Ensemble Based Recommendation Approach using Network Based Analysis for Identication of Effective Drugs for Tuberculosis

Tuberculosis (TB) is a fatal infectious disease which affected millions of people worldwide for many decades and now with mutating drug resistant strains, it poses bigger challenges in treatment of the patients. Computational techniques might play a crucial role in rapidly developing new or modified anti-tuberculosis drugs which can tackle these mutating strains of TB. This research work applied a computational approach to generate a unique recommendation list of possible TB drugs as an alternate to a popular drug, EMB, by first securing an initial list of drugs from a popular online database, PubChem, and thereafter applying an ensemble of ranking mechanisms. As a novelty, both the pharmacokinetic properties and some network based attributes of the chemical structure of the drugs are considered for generating separate recommendation lists. The work also provides customized modifications on a popular and traditional ensemble ranking technique to cater to the specific dataset and requirements. The final recommendation list provides established chemical structures along with their ranks, which could be used as alternatives to EMB. It is believed that the incorporation of both pharmacokinetic and network based properties in the ensemble ranking process added to the effectiveness and relevance of the final recommendation.


Introduction
Tuberculosis (TB), caused by the bacteria Mycobacterium tuberculosis (Mtb), is a major global health hazard, where the human mortality rate goes above 1 million per year. Surveys have revealed the existence of Mtb for more than four decades [1], and more importantly, it has developed various degrees of resistance to multiple anti tuberculosis drugs along this journey. Studies have also shown that [2] the pathways that led to these drug resistant Mtb strains were very complex and indicated that the bacteria acquired a step by step process, starting from a low degree mutation to a near complete drug resistance. The role of bacterial factors like persistence, compensatory evolution, fitness, hyper mutation on the evolution of drug resistance were also explored [3].
There are various resources which documented the drug resistant properties of Mtb by various types of databases [4]. For example, the TBDRMD database [5] reported polymorphisms at different codon regions of certain genes of H37Rv, a specific strain of Mtb, in response to specific anti tuberculosis drugs. The VFDB database [6] provided virulence factors and structural features and functions of various bacteria including the different strains of Mtb. One important thing which came out from these studies [4] was the urgent need to rapidly discover new drugs which could tackle the mutating drug resistant strains of Mtb.
PubChem, an open chemistry database [7], provided an extensive resource on chemical structures, chemical and physical properties of a chemical compound (drug) as well as a comprehensive list of compounds (drugs) which were similar in nature. It was encouraging to know that a large number of growth inhibitors for Mtb were submitted [8] on PubChem after an elaborate process of high throughput screening and further evaluation based on potency, reproducibility and cytotoxicity. Thus, for this paper, PubChem was chosen to be the resource which could give an initial list of drugs which had some kind of similarity with an existing anti tuberculosis drug.
The basic functional effectiveness of any drug depends heavily on the molecular properties that are crucial to its pharmacokinetics, including absorption, distribution, metabolism and excretion (ADME).
The well-established Lipinski's rule of five [9] defined some boundaries on the chemical properties of the drugs which, when satisfied, made the drugs more functionally acceptable for human intake. This important aspect was emphasized in our proposed methodology, both during data pre-processing as well as during the actual recommendation process.
There were many instances found in the literature where machine learning techniques were used to suggest new drugs, using available datasets related to drug resistant Mtb. Reviews [4] have pointed out cheminformatcs tools that were used on several datasets related to drug resistant Mtb. For example, a cheminformatics data fusion approach [10] was followed by validation on datasets having cytotoxicity and tuberculosis data by Support Vector Machine and Bayesian models. Subsequently, Bayesian machine learning models were applied [11] on large datasets for rapid screening and prioritizing of compounds for anti-tubercular activities before in vitro testing. As both high throughput screening techniques and chemical datasets expanded [12], the needs to integrate and link the different datasets as well as consolidate the techniques by secure mechanisms were proposed. It has been reported that [13], popular machine learning techniques in cheminformatics can be useful in making the drug discovery process more efficient. This was due to the faster selection of filtered compounds for testing, when compared to the traditional in vitro screening for toxicity of compounds. This inference was also validated [14] when structure-activity relationships of a couple of Mtb enzymes were analysed both by in vitro screening as well as Naïve Bayesian model with smoothening. In another review [15], the relevance of the advancement of tools, including machine learning techniques, were highlighted with regard to development of anti-tuberculosis drugs. It also mentioned how recent advances in the field of genomics, high throughput screening have played a significant role in coming up with a variety of drugs at a more efficient rate.
Literature also revealed interesting approaches, other than throughput screening and cheminformatics tools, which were applied for drug discovery against drug resistant Mtb. Lipophilicity [16], along with structural peculiarities and energy depletion, in some cases, were reported to be more relevant than Lipinsky's rule of five, in the specific context of drug resistant TB. Reports [17] have summarised newer anti-tuberculosis drugs in different stages of clinical trials, along with their various methods of action. In another study [18[, the efficacy of Ethambutol (EMB), a popular anti-tuberculosis drug, was evaluated on how the drug affected both the cell wall integrity and the metabolism of the Mtb.
The review [15], mentioned before, even though acknowledged the effectiveness of compounds like EMB and INH as part of a multi-drug regimen against TB, it also underlined the critical need to come up with new anti-tuberculosis drugs, and suggested using repurposed drugs by exploring properties like genomics and crystallography. Reports [19] have also suggested that existing drugs meant for Parkinson's disease could be repositioned to treat drug resistant TB, by modelling protein-ligand interaction networks from looking at similarities in ligand binding sites and protein ligand docking properties. In the context of drug repurposing in general, reviews [20] classified them through two axes, drug based or disease based. The review also highlighted that computational methods of repurposing benefitted both these axes. Studies [21] have also pointed out three actual repurposed drugs which were extensively used to treat drug resistant Mtb, when traditional combinations of antituberculosis drugs were not effective. One study [22] looked at repurposing in general, by computational workflows, linking data availability with current trends in technology along with the associated algorithms. On similar lines, drug repositioning [23] was highlighted for associations between drug to target and target to disease, which consequently helped in predicting drug to disease mappings. In an experimental study [24], Primaquine derivatives were shown to be effective as both anti-malarial and anti-tuberculosis agents. In one more recent study [25], a traditional anti-tuberculosis drug, AMC, and a repurposed drug, Diosmin, were used together to treat TB.
There were some studies which looked at drug discovery, in general, from a graph oriented, network based approach. PROMISCOUS [26] was a database that had an extensive list of drugs annotated with protein-protein interactions (PPI) and drug-protein interactions. This study highlighted that structural similarities among drugs, connected with PPI, has the potential in the field of multi-pharmacology, and drug repositioning. Network analysis approaches [27] were studied in the context of drug discovery, and similarities were suggested in both social networks and biological networks. Incomplete network motifs of bi-cliques [28] were utilised in drug-target-disease networks to predict new drugs for one or more diseases. In one study, molecular interaction networks [29] were explored for lead identification in drug discovery pipeline. It also pointed out how this can help in exploring the emergence of drug resistance and drug repositioning. Reports [30] had also suggested the need to connect genome-based biological networks with anti-tuberculosis drug discovery to come up with potentially rational drugs. Surveys [31] had shown that network based computational approaches focussed on molecular interactions can address both drug repositioning, as well as drug combination. In case of finding drugs for Mtb, both of these (repositioning and combination) were very crucial for effective treatment of TB.
A proteomic structural approach [32] was adopted for creation of clusters of pocket-similarity networks, or pocketomes, which consequently helped in finding sets of binding sites within Mtb. In a recent study [33], a drug-disease proximity measure was proposed by looking into the network neighbourhoods of both disease genes and drugs. This study revealed that effectiveness of most of the drugs was limited to small sub networks of disease genes.
In summary, the entire literature study repeatedly highlighted the importance of coming up with discovery of relevant and effective anti-tuberculosis drugs in a rapid manner. Majority of these studies acknowledged the importance of the pharmacokinetic / ADME properties while discovering the new drugs, while quite a few of them looked at the problem from the structural network point of view. With these issues in context, this work is aimed at providing a recommendation list of possible antituberculosis drugs which were similar to a popular and existing tuberculosis drug. The proposed method incorporated an ensemble approach, where both pharmacokinetic properties as well as network properties of the chemical structures of the recommended drugs were given special attention. For this ensemble recommendation approach, literature showed that Borda Count [34] was a very well established method and was used both in social science, as well as in machine learning domains. Studies [35] also showed that Borda Count can be effectively tailored for specific scenarios. In this context, the uniqueness of this effort was in the customisation of this popular ensemble ranking method, Borda Count, as well as utilising the network properties of the recommended drugs for the second time to fine tune the final recommendation list, and all these specific customisations were made to address the drug resistant property of the Mtb.
This section provided a basic introduction to the needs and a summary of techniques practised in discovering drugs against drug resistant Mtb. Section 2 details the methodology of this study along with our proposed ensemble recommendation system. Section 3 provides the experimental details for a popular anti-tuberculosis drug, and documents the results at each step of the process. Section 4 discusses and analyses the significance of our results, and finally, Section 5 highlights the conclusion and future possibilities.

Methodology
The process of consolidating the recommendation list of proposed TB drugs from a given effective TB drug is shown by the block diagram in Figure 1. The process involved data pre-processing at the beginning, followed by the generation of three different recommendation lists (Uniform weighted, Pharmacokinetic weighted and Network weighted). These were then fed into an ensemble recommendation system resulting in a consolidated recommendation. The final step involved refinement of the consolidated ranking by utilising the Network weighted list of values.
[Suggested insertion of Figure 1]

Data Pre-Processing
The online PubChem database provided a list of drugs having similar characteristics for a TB drug given as a query. The dataset consisted of the list of drugs along with several attributes for each of these drugs.
Since the objective of the research work dealt with providing recommendations by utilising different computational measures, a quantifiable approach was the basic requirement. In this context, data preprocessing was done on two fronts. The first filtering was done on numeric/non-numeric nature of the attributes, and the subsequent filtering was done on the pharmacokinetic values of the attributes.

Numeric attributes
From the initial dataset, at first, all non-numeric attributes were removed. This paved the way for performing numerical computations.

Pharmacokinetic filtering
The well-established pharmacokinetic properties of chemical drugs, provided by Lipinski's rule of five, Ghose Filter and Verber's rule, were applied to the relevant attribute values and those drugs whose attributes did not satisfy any of these rules, were removed. The combined rules are listed in Table 1.
[Suggested insertion of Table 1]

Ensemble of Ranked Lists
Three different ranked lists were generated from the processed dataset. The objective was to explore multiple ways of finding k nearest neighbours / drugs for the given query drug, where the entries in the ranked recommendation lists were in ascending order of dissimilarity. This meant that the topmost drug on a particular list would be the closest to the queried drug, as per the chosen measure of evaluation, and consequently the bottommost (k th) drug on the list would be the farthest. Three separate ranked recommendation lists were generated using three different evaluation measures.

Uniform Weighted Ranking
The first, and completely unbiased, ranking involved finding out the individual distances between each of the drugs in the processed dataset to the queried drug, and consequently storing the nearest drugs.
Mathematically, this process was similar to finding the Euclidean distance between vectors, and storing the closest vectors, in sorted order, with respect to a single vector, represented by the queried drug.
Each drug was represented by a vector of numeric attributes. The formula for finding out the Euclidean dissimilarity between each of the suggested drugs and the queried drug is given in Equation 1.
UED( , ) stands for the uniform Euclidean distance between the vectors of the queried drug, ,, and , the i th suggested drug. Each of these vectors had nine numeric entries and to represents the numeric entries of the queried drug and subsequently to represents the numeric entries of the drug.
The key characteristic of this evaluation measure was that each of the attributes was given equal weightage when distance / dissimilarity were calculated for each of the suggested drugs from the queried drug.

Pharmacokinetic weighted Ranking
The second ranking used the same processed dataset to generate nearest drugs and used the same Euclidean distance measure to find out the dissimilarity between each of the suggested drugs to the queried drugs. However, this method gave more weightage to a subset of the attributes while calculating the overall dissimilarity. These more important attributes (four in number) were the ones mentioned in Lipinsky's rule of five. Equation 2 shows the formula for calculating the dissimilarity, by incorporating the added weightage given to these more important attributes. This weighted ranking approach gave more importance to the findings reported by researchers who had studied the pharmacokinetic properties of chemical drugs

Network Weighted Ranking
This paper introduces a new set of numeric attributes, which represents the network based characteristics of the chemical drugs. These network/graph based properties not only gave more insights into the chemical structures, by means of how the atoms were situated with respect to each other and how they were connected to each other, they also indicated the importance of an atom or a connection between two atoms to act as a bridge to connect to the remaining atoms of the network structure. This information, we believe, can be related with molecular bonding properties, which consequently stands for the total number of edges, and stands for the total number of nodes in a network. In the case of average closeness, ( , ) signifies the shortest distance between nodes and . For average node betweenness, represents the total number of shortest paths between the nodes and , while ( ) shows the ones among those which pass through the node . Similarly, for average edge betweenness, generates the total number of shortest paths between the nodes and , while ( ) gives the ones which pass through the edge .
This network based ranked recommendation list stored k nearest neighbours / drugs with respect to the queried drug, where the input was only the ids of the suggested drugs, and the four attributes (i to iv).
All these four numeric values were calculated from the graphical representations of the chemical drugs.
The k nearest neighbours /drugs, in sorted order, with respect to the queried drug, with this customized dataset, were generated in the same manner as 3.2.1 using Equation 1 to calculate the dissimilarity.

Consolidated Ranking
The three different k-ranked recommendation lists were fed as input to the consolidated ranking system to generate a combined and consistent k-ranked recommendation list of drugs for a queried drug. The popular Borda ranking technique was initially chosen for this task, not only for its simple and effective way of ranking multiple ranked lists, but also for its unbiased way of giving weightage to a candidate even if it did not feature in the top k ranks in any of the ranked lists.
This paper proposes a customized version of Borda ranking to cater to the specific properties of the dataset. The ranking mechanism is explained in Table 2 as a demonstration. In the traditional Borda  Table 2.
[Suggested insertion of Table 2]

Refinement of Recommendation
The consolidated ranked recommendation list provided k nearest neighbours / drugs with respect to the queried drugs by incorporating three kinds of evaluation measures. This paper specifically looked at one Tuberculosis (TB) drug in the query and consequently generated the k nearest recommended drugs.
The queried TB drug, although being quite effective, was not perfect, as limited polymorphisms were reported by certain genes of a specific Mtb strain. This indicated that the effectiveness of the drug might reduce drastically in the future once the Mtb strain gradually becomes drug resistant to that specific drug. Therefore, an effective way to recommend other drugs could be to look for similar drugs, i.e., drugs that are high up in the ranked recommendation list, but not those drugs which are extremely similar, or identical, in terms of the attribute values. In this specific context, this paper proposes a unique refinement on the consolidated recommendation, by revisiting the network based recommendation list, and removing those topmost recommended drugs from the consolidated ranked recommendation list which had identical network structure based attribute values when compared with the queried drug. This meant that drugs which had almost identical chemical structure might not be an effective recommended drug in the long run, while top recommended drugs which are similar, but did not have identical network structures were viable suggestions.
The consolidated ranked k nearest neighbour recommendation list was thus refined by removing those drug entries which had identical attribute values with respect to the queried drug in the network based ranked recommendation list.

Algorithm
The proposed methodology is detailed by the algorithm given below.
Input: A dataset "ID" with "n" number of tuples and "c" number of columns. Each tuple represents a drug along with its properties. The dataset was provided by PubChem in response to a queried drug Output: A table "FR" having "fn" number of tuples and 2 columns, the first column being the unique numeric (cid) code of the drug and the second column representing the consolidated rank of the drug, 1 being the highest rank

Begin
Step Step 2: D[n' x num]  ID'[n x num] // ID'[n x num] is reduced to D[n' x num], after tuples were removed because these tuples (drugs) did not satisfy the pharmacokinetic properties, as specified in Table 1. Step

3a: A[tn,3]  D[n' x num] //A[tn,3] is acquired from D[n'
x num], by calculating uniform weighted dissimilarity using Equation (1). The first column in A corresponds to the cid of the recommended drug, the second column gave the uniform dissimilarity score and the third column gave the dissimilarity rank. tn represented the top number of drugs that was asked for.
Step 3b: B[tn,3]  D[n' x num] //B[tn,3] is acquired from D[n' x num], by calculating pharmacokinetic weighted dissimilarity using Equation (2). The first column in B corresponds to the cid of the recommended drug, the second column gave the weighted dissimilarity score and the third column gave the dissimilarity rank. tn represented the top number of drugs asked for.
Step iii)customised Borda rank for tn number of tuples (drugs). The inputs were the datasets A, B and C, with tn entries and only two columns, the i) cid and the ii) dissimilarity rank. Step

Experimental Results
The online PubChem database was used to gather the initial dataset for this effort. The dataset provided a list of chemical structures (and their details) that was similar to a queried chemical structure. The the nine numeric attributes were kept and the remaining attributes were pruned as the first part of data pre-processing, which reduced the dataset to be 352 x 10. Thereafter, pharmacokinetic filtering was applied as the second part of data pre-processing, which involved with the values in six (i, ii, iv, vi, vii and vii) of the numeric attributes. The filtering was done as per rules described in Table 1, and the final pre-processed dataset, D, came out with a dimension of 236 x 10.
The cleaned dataset D was first utilised to generate the uniform weighted ranked recommendation list.
For the experiments conducted, we chose the top 15 (k=15) chemical structures / drugs which were similar to the queried drug, EMB, having a cid of 14052. The uniform weighted ranked recommendation list is shown in Table 3. This was calculated using Equation 1, where k-nearest neighbours/drugs of 14052 were listed with the dissimilarity values and the corresponding ranks, the 1st rank being the one with the least dissimilarity.
[Suggested insertion of Table 3] The dataset D was again used to generate the pharmacokinetic weighted ranked recommendation list, where the top k(=15) nearest neighbours/drugs of 14052 were listed (in Table 4) with the dissimilarity values and the corresponding ranks, the 1st rank being the one with the least dissimilarity. The dissimilarity was calculated using Equation 2, by a Euclidean distance measure, where more weightage was given to four of the attributes (mw, xlogp, hbond donor count, hbond acceptor count).
[Suggested insertion of Table 4] The dataset D had a list of drugs (cids) which were similar to EMB (14052) as per PubChem query.
PubChem also provided 2-dimensional chemical structures of these cids. This paper, as a novelty, looked into four network based properties (average degree, average closeness, average node betweenness and average edge betweenness) of each of these chemical structures and found out how each of them were dissimilar from the chemical structural properties of popular TB drug EMB. The dissimilarity was calculated by the Euclidean distance measure and consequently the network weighted ranked recommendation list was generated for the k (=15) nearest neighbours/drugs as shown in Table   5.
[Suggested insertion of Table 5] After generating the three different ranked recommendation lists ( respectively. The 'Consolidated Rank' column of Table 6 shows the result of this modified Borda ranking.
[Suggested insertion of Table 6] The final list of suggested drugs with their corresponding ranks is listed in the 'Refined Recommendation" column of Table 6. According to our proposed recommendation mechanism, the chemical compound Myambutol, having cid 3279, was top ranked among the suggested chemical structures, which could be studied as an alternative to the queried drug, EMB, with cid 14052. The [Suggested insertion of Figure 2] [Suggested insertion of Figure 3]

Discussion
It is evident from to satisfy basic pharmacokinetic properties for effective utilization by the human body. In this context, this study attempted to provide a balanced approach by giving due weightage to both these aspects by generating two additional recommendation lists, one with emphasis on network centric similarities and the other with emphasis on pharmacokinetic related similarities. Consequently, the customised ensemble ranking was aimed at providing an elegant way to come up with a consolidated ranking which took into account all these aspects.
The final list of suggested drugs (from the last column in Table 6) had 15 entries, even though Table 3, Table 4 and table 5 showed 18 entries. These 3 additional entries were utilised in our modified Borda ranking technique. These additional entries were also helpful because after the final refinement of recommendation, where three chemical structures were omitted, it was still able to suggest 15 chemical structures at the end.

Conclusion and Future Work
The effort was aimed at generating a list of possible TB drugs by using an ensemble of ranking mechanisms in which one of the ranking mechanisms included a novel way of looking at the chemical structures of the drugs and evaluating them based on network based attributes, and another ranking mechanism stressed on pharmacokinetic compatibility. The effort also proposed unique modifications on a popular ensemble ranking technique to suit our specific requirement. The results dealt with providing established chemical structures along with their ranks, which could possibly be used as alternatives to EMB, a popular drug used for H37Rv strain of TB. It is believed that the incorporation of the pharmacokinetic properties as well as network based properties in the ensemble ranking mechanism would make the proposed recommendations logical, practical and effective.
The proposed methodology can be applied not only for TB drugs, but to any other drugs, provided they are recorded in the online PubChem database. From the pharmacokinetic point of view, more studies can be done to verify that the recommendation does not produce an unwanted inhibitory or toxic effect on the biological system. On the other hand, while evaluating network based properties of the chemical structures, four popular properties were looked at. One can also look at more network properties as well as graph isomorphism properties to find out the level of similarity/dissimilarity among chemical structures of drugs.

Funding statement:
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the research work.

Data availability statement:
Data sharing is not applicable to this article as no new data were created or analysed in this study.    , which attributed to pharmacokinetic properties were given more weight than the others