Computational Prediction of Drug-Target Interactions via Ensemble Learning

Ezzat, Ali; Wu, Min; Li, Xiaoli; Kwoh, Chee-Keong

doi:10.1007/978-1-4939-8955-3_14

Ali Ezzat³,
Min Wu⁴,
Xiaoli Li⁴ &
…
Chee-Keong Kwoh⁵

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1903))

2114 Accesses
17 Citations
1 Altmetric

Abstract

Therapeutic effects of drugs are mediated via interactions between them and their intended targets. As such, prediction of drug-target interactions is of great importance. Drug-target interaction prediction is especially relevant in the case of drug repositioning where attempts are made to repurpose old drugs for new indications. While experimental wet-lab techniques exist for predicting such interactions, they are tedious and time-consuming. On the other hand, computational methods also exist for predicting interactions, and they do so with reasonable accuracy. In addition, computational methods can help guide their wet-lab counterparts by recommending interactions for further validation. In this chapter, a computational method for predicting drug-target interactions is presented. Specifically, we describe a machine learning method that utilizes ensemble learning to perform predictions. We also mention details pertaining to the preparation of the data required for the prediction effort and demonstrate how to evaluate and improve prediction performance.

Download protocol PDF

Drug-target interaction prediction via class imbalance-aware ensemble learning

Article Open access 22 December 2016

Ali Ezzat, Min Wu, … Chee-Keong Kwoh

An Ensemble Learning Approach for Improving Drug–Target Interactions Prediction

Predicting Drug Target Interactions Using Dimensionality Reduction with Ensemble Learning

Key words

1 Introduction

Drugs are designed with their intended targets in mind. A drug’s interaction with its respective target is what brings about the therapeutic effect that helps overcome the disease that the drug was developed for. The target may be a protein (or gene) that is directly associated with the disease, or it may be a protein whose perturbation indirectly helps counteract the disease-causing one [1]. Either way, interactions are worth studying because it is through them that the therapeutic effects of drugs take place. Note that while drug targets could be either DNA, RNA, or proteins, we only consider proteins here for ease of discussion and because they currently constitute the majority of known drug targets.

The drug discovery process is a costly and time-consuming one; developing a drug typically takes over 10 years and costs over one billion US dollars [2]. This led to a new trend, namely, drug repositioning where old drugs are repurposed to deal with new diseases [3]. The rationale behind drug repositioning is that older drugs have already passed clinical trials (i.e., at least passed phase I of the trials), which means that they are already well-studied and have previously passed safety tests. This accelerates the drug discovery effort considerably [4]. Moreover, the fact that drugs usually bind more than one protein [5, 6] further encourages the search for new interactions for known drugs.

There are experimental wet-lab techniques (e.g., bioassays [7]) that can be used to predict interactions for known drugs (or compounds, in general). However, these techniques are tedious and take time to set up and execute. On the other hand, computational prediction methods detect interactions swiftly and cheaply. As such, computational methods can be used to narrow down the search space to be investigated by experimental techniques.

Computational methods that predict drug-target interactions use machine learning to train prediction models to detect new undiscovered interactions. The interaction (i.e., label) data required for training such models may be obtained from online databases where information on known drug-target interactions are regularly stored such as KEGG [8], DrugBank [9], and STITCH [10]. As for data for representing drugs and targets, they have come in many different forms in previous work; to name a few examples, data that have been used to represent drugs include graphical representations of their chemical structures [11], side effects [12], and ATC codes [13], whereas targets have been represented using their genomic sequences [14], Gene Ontology (GO) information [15], disease associations [16], and protein-protein interaction network information [17].

Many promising and successful computational prediction methods have appeared in the last decade. They can be roughly divided into feature-based methods and similarity-based methods . Similarity-based methods are those that are able to work with non-vectorial data (i.e., data that are not in the form of fixed-length feature vectors). For example, while each of the target proteins can be represented by its genomic sequence, that would be inadequate for use in machine learning algorithms due to the varying lengths of these sequences. The same could be said about drugs’ chemical structures that are of varying sizes and molecular weights. As such, similarity-based methods resort to using the kernel trick and generate pairwise similarity matrices from these non-vectorial data. The generated matrices can then be used as input to the machine learning algorithm.

A pioneering similarity-based method [18] generated a pairwise similarity matrix for targets by computing the normalized Smith-Waterman scores [19] between all pairs of target proteins using their genomic sequences. A similar matrix was computed for drugs using SIMCOMP [20] similarity between the drugs’ chemical structures . Along with an interaction matrix (collected from KEGG [8]) that shows which drugs and targets interact, the similarity matrices are given as input to the machine learning method. Since then, this work has been followed by many other similarity-based methods [21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Some of these methods have tried generating similarity matrices from other sources of information and using them simultaneously via multiple kernel learning (e.g., [27, 33]).

Feature-based methods , on the other hand, are standard methods that take input only as fixed-length feature vectors, which can initially be interpreted as a disadvantage when compared with similarity-based methods . However, similarity-based methods have what is known as the cold start problem, which is a direct result of their dependence on matrix operations to perform predictions. To illustrate this issue, consider the case where the prediction method had already been run, and we would like to predict interactions involving new drugs and targets that were not in the original training set. In such a case, the similarity matrices would have to be recomputed, and the method would have to be run all over again to obtain the predictions for these new drugs and targets. Feature-based methods do not suffer from this issue since trained models can be used to predict interactions for new drugs and targets that have not appeared in the training set.

One of the earlier feature-based methods [35] represented drugs as binary vectors indicating the presence/absence of each of a number of common functional groups that are found in drugs’ chemical structures , while targets were represented using their pseudo amino acid composition. The drug-target pairs were then represented by concatenating the feature vectors of the involved drugs and targets. That is, if a drug d is represented by a feature vector [d₁ ,d₂ ,…,d_p] and a target t is represented by a feature vector [t₁ ,t₂ ,…,t_q], then the drug and target feature vectors would be concatenated to represent the drug-target pair (d,t) as [d₁ ,d₂ ,…,d_p ,t₁ ,t₂ ,…,t_q]. A feature selection phase then commences that provides a subset of the most relevant features. Finally, a nearest neighbor algorithm is used to obtain predictions.

Many feature-based methods come in the form of decision tree-based ensembles that are based on random forest [36,37,38], rotation forest [39], and extremely randomized trees [40]. Other feature-based methods have also been proposed such as those based on fuzzy KNN [41], relevance vector machines [42], and deep learning [43,44,45,46,47,48,49]. Furthermore, there are works that primarily focus on training interpretable models [50,51,52]. After having been trained, these interpretable models could be scanned for learned rules that may contain useful insights and provide better understanding on the factors that govern drug-target interactions.

This chapter presents a machine learning method for predicting drug-target interactions. That is, given a set of known drug-target interactions, the method aims to predict new undiscovered ones. In particular, the method described below is EnsemDT, a feature-based method that has previously been published in [38]. Using that method as an example, we describe the data needed to perform drug-target interaction prediction, show how to evaluate the prediction performance of the developed machine learning method, and provide tips on how to further improve the quality of the obtained predictions.

2 Materials

In order to use machine learning to perform predictions, prior data on known drug-target interactions needs to be gathered first so that new undiscovered interactions may be predicted based on them. More precisely, the gathered data is used to train a computational model for predicting interactions on unseen data. The data required for computational prediction of interactions via machine learning include:

1.
Interaction data showing which drugs and targets interact with each other
2.
Features for representing the different drugs
3.
Features for representing the different targets

In this work, we use the dataset that has been introduced in [37]. We elaborate on the details of this dataset in the following sections.

2.1 Interaction Data

The interaction data were obtained from the DrugBank database (version 4.3, released on 17 November 2015). More precisely, we downloaded an XML file from the link, https://www.drugbank.ca/releases/latest, and created a MySQL database with it. All interactions were extracted from the created MySQL database, which were then used to generate an adjacency matrix Y ∈ R ^n × m with n drug rows and m target columns. That is, Y_ij = 1 if drug d_i and target t_j interact and Y_ij = 0 otherwise. Some statistics regarding the collected interaction data are given in Table 1. Note that, for reasons that will be mentioned shortly, some drugs and targets were eliminated from Y. The statistics provided in Table 1 reflect the final dataset after the removal of these drugs and targets.

Table 1 Statistics of the interaction data

Full size table

2.2 Drug Features

The Rcpi package [53] was used to generate features for the drugs. In order to obtain features for representing drugs using Rcpi, we first had to obtain their SMILES [11] representations (i.e., graphical representations of drugs’ chemical structures ) to use as input for Rcpi. The SMILES representations of the drugs can be easily obtained from the DrugBank website via a script looping over all the drug IDs. For example, a drug with the ID, DB04237, can have its SMILES representation downloaded from the link: https://www.drugbank.ca/structures/small_molecule_drugs/DB04237.smiles.

After the SMILES representations have been obtained for the drugs, the Rcpi package was used to extract the drugs’ features. The extracted features include topological, constitutional, and geometrical descriptors among other molecular properties.

Since only small molecule drugs have SMILES representations available for them, biotech drugs were excluded from the dataset. In addition, some of the small molecule drugs were missing their SMILES and had to be removed as well.

2.3 Target Features

From the target side, the PROFEAT server [54] was used to extract features from the targets’ genomic sequences; target features include amino acid composition, dipeptide composition, CTD and autocorrelation descriptors, quasi-sequence order, amphiphilic pseudo amino acid composition, and total amino acid properties. The latest version of the PROFEAT server can be found at http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi.

Before using PROFEAT to obtain the target features, the genomic sequences of the target proteins were downloaded from the UniProt database [14]. However, some targets had to be excluded from the dataset since PROFEAT is unable to generate features for sequences that are less than a specific length. As such, targets with sequences that are less than 100 amino acids long were removed from the final dataset.

2.4 Data Cleaning

After the features have been generated, features with constant values have been removed as they would not contribute to the prediction of drug-target interactions. Moreover, other features occasionally had missing values for some drugs and targets. To deal with these cases, each missing value was assigned the mean of its respective feature over all drugs and targets.

2.5 Representing Drug-Target Pairs

Now that feature vectors for representing drugs and targets have been generated, drug-target pairs are then represented by feature vectors that are formed by concatenating those of the involved drugs and targets. That is, each drug-target pair (d,t) is represented as

$$ \left[{d}_1,{d}_2,\dots, {d}_p,{t}_1,{t}_2,\dots, {t}_q\right] $$

where p and q are the numbers of drug and target features, respectively, and [d₁ ,d₂ ,…,d_p] and [t₁ ,t₂ ,…,t_q] are the feature vectors for representing drug d and target t, respectively. For the remainder of this chapter, the drug-target pairs will also be referred to as instances. Finally, to prevent any bias from the original feature values, all features were normalized to the range [0, 1] using min-max normalization as

$$ \forall i=1,\dots, p,\kern1em {d}_i=\frac{d_i-\min \left({d}_i\right)}{\max \left({d}_i\right)-\min \left({d}_i\right)} $$

$$ \forall j=1,\dots, q,\kern1em {t}_j=\frac{t_j-\min \left({t}_j\right)}{\max \left({t}_j\right)-\min \left({t}_j\right)}. $$

This dataset—including the lists of drugs and targets, their features, and the interaction data—is accessible online in the supplementary material for [37].

3 Methods

Here, we present EnsemDT [38], a simple machine learning method that utilizes ensemble learning. The method consists of an ensemble of decision tree classifiers. More details are given below.

3.1 Method Description

The dataset used here consists of a total of 19,676,196 drug-target pairs, out of which 12,674 are interactions (or positive instances) and the rest are non-interactions (or negative instances). Including the entire dataset in the training set would lead to a computationally intensive training phase because the training set would be too big. Furthermore, another problem with using the entire dataset in training is that there is a severe class imbalance in the data where the (negative) majority class is order of magnitudes larger than the (positive) minority class. Such imbalance can cause bias in the prediction results toward the majority class, leading to a lower prediction performance on the minority class. Undersampling of the negative set (i.e., the set of non-interactions) is therefore utilized here to deal with these issues. That is, a subset of the full negative set is randomly sampled and appended to the training set. More precisely, a separate training set is generated for each base learner. The entire positive set (i.e., the set of interactions) is included in the training set for each base learner. As for the negative data, a different subset of the full negative set is randomly sampled for each of the base learners. The size of the negative subset being sampled for each base learner can be controlled via the parameter, npRatio. Specifically, if |P| is the size of the positive set, then the size of the negative subset would be npRatio×|P|.

Feature subspacing is then performed to inject more diversity into the ensemble; i.e., a different feature subset is randomly selected for each base learner. Diversity is known to be useful for improving the prediction performance of the ensemble [55]. More precisely, using a parameter r (where 0 < r ≤ 1), each base learner has a feature subset of size r × |F| randomly sampled for it (where |F| is the total number of features).

Figure 1 shows the pseudocode for EnsemDT. The number of base learners can be controlled via the parameter, M. Furthermore, we remind the reader that p and q are the numbers of drug and target features, respectively. Note that, in [38], there is an additional dimensionality reduction step directly following the feature subspacing step. We decided to remove it here for the sake of simplification, but we discuss dimensionality reduction in the upcoming Notes section (see Note 3).

3.2 Evaluation

An experiment needs to be executed in order to gauge the prediction performance of EnsemDT. Cross validation experiments are commonly used to assess the performance of machine learning methods. As for the evaluation metric, one that is used frequently is the AUC (i.e., area under the ROC curve) due to its insensitivity to class imbalances in the data [56]. The AUC can be computed by the equation,

$$ \mathrm{AUC}=\frac{{\sum \limits}_{i=1}^{n^{+}}{\sum \limits}_{j=1}^{n^{-}}I\left(f\left({x}_i^{+}\right)>f\left({x}_j^{-}\right)\right)}{n^{-}{n}^{+}} $$

where f(x) is the prediction score given by the prediction method; n ⁺ and n ⁻ are the numbers of positive and negatives instances in the data, respectively; and $ {x}_i^{+} $ and $ {x}_j^{-} $ are the current positive and negative instances, respectively. Note that I(⋅) is an indicator function where I(⋅) = 1 when f(x_i⁺) > f(x_j⁻) and I(⋅) = 0 otherwise. AUC is adequate here since it provides an indicator for how well a prediction method ranks the positive instances higher than the negative instances. In other words, the better the AUC is, the more we are confident of the method’s ability to correctly rank the true interactions highly.

Here, we conduct a 5-fold cross validation experiment. That is, the data is divided into five folds where each of the five folds takes a turn being the test set, while the remaining folds are used as the training set. An AUC score is computed for each fold, and then the AUC scores are averaged to give the final score. In Table 2, EnsemDT is compared with other state-of-the-art feature-based methods in terms of AUC.

Table 2 AUC results of the cross validation experiment

Full size table

By comparing the results of EnsemDT against those of the other methods, it appears that EnsemDT is successful in improving prediction performance beyond the other competing methods. In the next section, we investigate and analyze the impacts of the different components of EnsemDT (see Note 1).

The parameter values in EnsemDT were set as follows: npRatio = 5, r = 0.2, and M = 50. These values have been selected based on the results of sensitivity analyses that are provided later in the Notes section (see Note 2). As for the other methods, default values for decision tree, random forest, and support vector machines have been obtained from MATLAB’s fitctree, TreeBagger, and fitcsvm packages, respectively. Other notes pertaining to EnsemDT and machine learning methods in general are provided as well (see Notes 4 and 5).

4 Notes

1.
One way to analyze the different aspects of EnsemDT is to create multiple variants where, in each of these variants, an aspect is removed to see how it would perform. In the first variant, the training set is the same for all base learners—a single subset of the negative data was randomly sampled and used in all base learners—meaning that the only difference between these base learners is the different feature subsets obtained from the feature subspacing step. A second variant is also added. It is identical to the first (i.e., same training set for all base learners) with the exception that bagging (i.e., random sampling with replacement) is performed on the training sets of the base learners. We finally include a third variant similar to the second one with the exception that bagging is performed only on the negative examples (i.e., the entire positive set is included in the training sets of all the base learners). Finally, the last variant removes this bagging feature and instead randomly samples a different negative set for each base learner, which is none other than the original EnsemDT included in Table 2. Below in Table 3 are the results of the comparison between the four variants.

The first variant (same training set) acts as a baseline against which to compare the rest of the variants. The second variant (bagging on the same training set) shows a decrease in prediction performance. This result is surprising because bagging is a technique that is typically used to improve the prediction performance since it adds diversity to the ensemble. Note that the second variant is the most similar to random forest but with slight parameter-tuning differences. The third variant explains this surprising result when bagging is performed only on the negative instances (i.e., the positive set is left intact and entirely used in the training sets of all base learners). Bagging, when applied to the negative instances only, produced a slight improvement in the AUC result. It seems that all the positive instances are too precious to leave out as a result of the bagging procedure. The best results, however, are obtained by the last variant where a different negative set is randomly sampled and added to the training set for each base learner.
2.
We further analyze the different aspects of a drug-target interaction prediction method by performing sensitivity analyses for its parameters. Figures 2 and 3 show sensitivity analyses that were performed for EnsemDT’s parameters: M (the number of base learners) and r (the portion of features to randomly sample for each base learner).

For M (the number of base learners), it is obvious that increasing its value generally improves the prediction performance of the ensemble. This makes sense as each base learner that gets added to the ensemble comes with a different set of training instances that are represented with a different subset of randomly sampled features. This variability between the base learners adds to the diversity in the ensemble which helps improve the prediction performance overall. As for r (the portion of randomly sampled features), multiple conclusions were drawn. It was found that using part of the feature set in each base learner is better than using the entire feature set in all the base learners (i.e., no feature subspacing), which makes sense because feature subspacing is a technique that increases diversity of the ensemble and thus improves prediction performance. Furthermore, it was found that prediction performance is generally robust to the value of r, so it makes sense to use a smaller value as the default (e.g., r = 0.2); a lower number of sampled features per base learners lead to improved computational efficiency.
3.
When the number of base learners is high, the running time may be long due to the high dimensionality of the data. In such a case, a dimensionality reduction technique may be used to help further improve the computational efficiency of the ensemble. Table 4 shows the running times of EnsemDT without dimensionality reduction and with two dimensionality reduction techniques, singular value decomposition and partial least squares [57].

Based on the running times given in Table 4, there is an obvious improvement in running time after applying dimensionality reduction. Occasionally, dimensionality reduction techniques have a nice side effect of improving the prediction performance since they may help reduce the amount of noise and/or eliminate feature redundancy in the data. For the sake of completion, we provide the pseudocode of EnsemDT with the dimensionality reduction step in Fig. 4.
4.
In [58], a study was made on pair-input methods, i.e., methods that make predictions for pairs of objects. Methods that predict drug-target interactions are considered pair-input methods where the objects of a pair are a drug and a target. The authors investigated a concept called differential representation bias which refers to how much an object appears (or is represented) in the positive class (interactions) as opposed to the negative class (non-interactions). This representation bias was found to be beneficial for the prediction performance in general. For example, assuming the training data accurately reflects the general population, if an object appears (or is represented) in the positive training data more than in the negative training data, then the prediction method is more likely to correctly predict test pairs involving this object as positive, which is usually true [58].

However, in drug-target interaction data, an extreme case of differential representation bias is prevalent; there are always drugs and targets that appear only in the negative set. This is because many of the drugs and targets have only a single known interaction, which may get left out in the cross validation experiments, leading the involved drug or target to be unavailable in the positive training data. As a result, interactions involving such drugs and targets (that do not show up in the positive training data) are likely to not be predicted correctly.

As such, an experiment was performed where we modified the procedure for random sampling of the negative set for each base learner. Specifically, a negative instance (i.e., a non-interacting drug-target pair) is not added to the training set of any base learner unless both the involved drug and target have each appeared at least once in the positive set. Results of this experiment are given in Table 5.

As can be seen from the above table, the sampling modification had quite a positive impact on the AUC results produced from EnsemDT. This shows the importance of having the drugs and targets that appear in the negative training data to also exist in the positive training data.
5.
It is generally desirable to train prediction models with the most reliable data possible. Unfortunately, a common issue in drug-target interaction datasets is the absence of a reliable set of non-interactions that could be used for training prediction models. In the case of EnsemDT, negative instances are being sampled from the set of drug-target pairs that are not known to interact. Note that while the non-interactions in the data are not of high confidence, it is assumed that the majority of the negative instances in the data are correct (i.e., true non-interactions), with relatively few of them being potential undiscovered interactions. Most prediction methods assume all non-interactions in the data to be true and use them to train prediction models.

While researchers in the field of drug development do not typically report non-interactions that they may have encountered in their work, various efforts have been made to come up with a reliable set of negative instances. For example, the BioLip [59] and BindingDB [60] databases have been used in [61] to generate a set of reliable non-interactions by collecting drug-target pairs with an affinity less than 10 μM. Another way to circumvent the issue of absence of reliable negatives is to use a dataset that contains the binding affinities of the different drug-target pairs (represented by continuous values on a scale) as opposed to a dataset with binary values that encode interacting and non-interacting drug-target pairs [62]. Two examples of such continuous-valued datasets are those given in [63] and [64].

Moreover, there are other works that dealt with this issue via positive unlabeled learning. One such work is biased SVM [65] where different weights are given to the positive and negative classes (positives are given higher weights since they are more reliable). The weights are tuned to obtain the best prediction performance. PUDT [66] is another work that uses positive unlabeled learning where the negative instances are labeled beforehand as reliable negative and likely negative, and then weights are assigned to them (along with the positive instances) and are tuned to give the best prediction performance.

Table 3 AUC results of the different EnsemDT variants

Full size table

Table 4 Running times (in minutes) of EnsemDT with/without dimensionality reduction

Full size table

Table 5 AUC results with/without the sampling modification

Full size table

References

Yıldırım MA, Goh K-I, Cusick ME et al (2007) Drug-target network. Nat Biotechnol 25:1119–1126
Article Google Scholar
Paul SM, Mytelka DS, Dunwiddie CT et al (2010) How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov 9:203–214
Article CAS Google Scholar
Ashburn TT, Thor KB (2004) Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov 3:673–683
Article CAS Google Scholar
Novac N (2013) Challenges and opportunities of drug repositioning. Trends Pharmacol Sci 34:267–272
Article CAS Google Scholar
Hopkins AL (2008) Network pharmacology: the next paradigm in drug discovery. Nat Chem Biol 4:682–690
Article CAS Google Scholar
Imming P, Sinning C, Meyer A (2006) Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5:821–834
Article CAS Google Scholar
Bolton EE, Wang Y, Thiessen PA et al (2008) PubChem: integrated platform of small molecules and biological activities. In: Ralph AW, David CS (eds) . Annual reports in computational chemistry, Elsevier, pp 217–241
Google Scholar
Kanehisa M, Goto S, Sato Y et al (2012) KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 40:D109–D114
Article CAS Google Scholar
Law V, Knox C, Djoumbou Y et al (2014) DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 42:D1091–D1097
Article CAS Google Scholar
Kuhn M, Szklarczyk D, Pletscher-Frankild S et al (2014) STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Res 42:D401–D407
Article CAS Google Scholar
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
Article CAS Google Scholar
Kuhn M, Campillos M, Letunic I et al (2010) A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol 6:343
Article Google Scholar
Skrbo A, Begović B, Skrbo S (2004) Classification of drugs using the ATC system (anatomic, therapeutic, chemical classification) and the latest changes. Med Arh 58:138–141
PubMed Google Scholar
Jain E, Bairoch A, Duvaud S et al (2009) Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics 10:136
Article Google Scholar
Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29
Article CAS Google Scholar
Emig D, Ivliev A, Pustovalova O et al (2013) Drug target prediction and repositioning using an integrated network-based approach. PLoS One 8:e60618
Article CAS Google Scholar
Zong N, Kim H, Ngo V et al (2017) Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics 33(15):2337–2344
Article Google Scholar
Yamanishi Y, Araki M, Gutteridge A et al (2008) Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24:i232–i240
Article CAS Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article CAS Google Scholar
Hattori M, Okuno Y, Goto S et al (2003) Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J Am Chem Soc 125:11853–11865
Article CAS Google Scholar
Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25:2397–2403
Article CAS Google Scholar
Xia Z, Wu L-Y, Zhou X et al (2010) Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Syst Biol 4:S6
Article Google Scholar
Laarhoven TV, Nabuurs SB, Marchiori E (2011) Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics 27:3036–3043
Article Google Scholar
Chen X, Liu M-X, Yan G-Y (2012) Drug-target interaction prediction by random walk on the heterogeneous network. Mol BioSyst 8:1970–1978
Article CAS Google Scholar
Gönen M (2012) Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28:2304–2310
Article Google Scholar
Mei J-P, Kwoh C-K, Yang P et al (2013) Drug–target interaction prediction by learning from local information and neighbors. Bioinformatics 29:238–245
Article CAS Google Scholar
Zheng X, Ding H, Mamitsuka H et al (2013) Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Proceedings of the 19th ACM SIGKDD International conference on knowledge discovery and data mining, ACM, Chicago, IL, pp 1025–1033
Google Scholar
Cobanoglu MC, Liu C, Hu F et al (2013) Predicting drug–target interactions using probabilistic matrix factorization. J Chem Inf Model 53:3399–3409
Article CAS Google Scholar
Fakhraei S, Huang B, Raschid L et al (2014) Network-based drug-target interaction prediction with probabilistic soft logic. IEEE/ACM Trans Comput Biol Bioinform 11:775–787
Article Google Scholar
Ba-alawi W, Soufan O, Essack M et al (2016) DASPfind: new efficient method to predict drug–target interactions. J Chem 8:15
Article Google Scholar
Ezzat A, Zhao P, Wu M et al (2016) Drug-target interaction prediction with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Bioinform 14:646–656
Article Google Scholar
Liu Y, Wu M, Miao C et al (2016) Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS Comput Biol 12:e1004760
Article Google Scholar
Nascimento ACA, Prudêncio RBC, Costa IG (2016) A multiple kernel learning algorithm for drug-target interaction prediction. BMC Bioinformatics 17:46
Article Google Scholar
Hao M, Bryant SH, Wang Y (2017) Predicting drug-target interactions by dual-network integrated logistic matrix factorization. Sci Rep 7. https://doi.org/10.1038/srep40376
He Z, Zhang J, Shi X-H et al (2010) Predicting drug-target interaction networks based on functional groups and biological features. PLoS One 5:e9603
Article Google Scholar
Yu H, Chen J, Xu X et al (2012) A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data. PLoS One 7:e37608
Article CAS Google Scholar
Ezzat A, Wu M, Li X-L et al (2016) Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics 17:267–276
Article Google Scholar
Ezzat A, Wu M, Li X-L et al (2017) Drug-target interaction prediction using ensemble learning and dimensionality reduction. Methods 129:81–88
Article CAS Google Scholar
Wang L, You Z-H, Chen X et al (2016) RFDT: a rotation Forest-based predictor for predicting drug-target interactions using drug structure and protein sequence information. Curr Protein Pept Sci
Google Scholar
Huang Y-A, You Z-H, Chen X (2016) A systematic prediction of drug-target interactions using molecular fingerprints and protein sequences. Curr Protein Pept Sci
Google Scholar
Xiao X, Min J-L, Wang P et al (2013) iGPCR-drug: a web server for predicting interaction between gpcrs and drugs in cellular networking. PLoS One 8:e72234
Article CAS Google Scholar
Meng F-R, You Z-H, Chen X et al (2017) Prediction of drug–target interaction networks from the integration of protein sequences and drug chemical structures. Molecules 22:1119
Article Google Scholar
Wang Y, Zeng J (2013) Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics 29:i126–i134
Article CAS Google Scholar
Wang C, Liu J, Luo F et al (2014) Pairwise input neural network for target-ligand interaction prediction. 2014 I.E. International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, pp 67–70
Google Scholar
Tian K, Shao M, Wang Y et al (2016) Boosting compound-protein interaction prediction by deep learning. Methods 110:64–72
Article CAS Google Scholar
Wan F, Zeng J (2016). Deep learning with feature embedding for compound-protein interaction prediction. bioRxiv
Google Scholar
Hu P-W, Chan KCC, You Z-H (2016) Large-scale prediction of drug-target interactions from deep representations. 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, pp 1236–1243
Google Scholar
Wang L, You Z-H, Chen X et al (2017) A computational-based method for predicting drug–target interactions by using stacked autoencoder deep neural network. J Comput Biol
Google Scholar
Wen M, Zhang Z, Niu S et al (2017) Deep-learning-based drug–target interaction prediction. J Proteome Res 16:1401–1409
Article CAS Google Scholar
Yamanishi Y, Pauwels E, Saigo H et al (2011) Extracting sets of chemical substructures and protein domains governing drug-target interactions. J Chem Inf Model 51:1183–1194
Article CAS Google Scholar
Tabei Y, Pauwels E, Stoven V et al (2012) Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics 28:i487–i494
Article CAS Google Scholar
Zu S, Chen T, Li S (2015) Global optimization-based inference of chemogenomic features from drug–target interactions. Bioinformatics 31:2523–2529
Article CAS Google Scholar
Cao D-S, Xiao N, Xu Q-S et al (2015) Rcpi: R/bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics 31:279–281
Article CAS Google Scholar
Li Z-R, Lin HH, Han LY et al (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34:W32–W37
Article CAS Google Scholar
Zhou Z-H (2012) Ensemble methods: foundations and algorithms. CRC Press
Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Article Google Scholar
de Jong S (1993) SIMPLS: an alternative approach to partial least squares regression. Chemom Intell Lab Syst 18:251–263
Article Google Scholar
Park Y, Marcotte EM (2012) Flaws in evaluation schemes for pair-input computational predictions. Nat Methods 9:1134–1136
Article CAS Google Scholar
Yang J, Roy A, Zhang Y (2013) BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res 41:D1096–D1103
Article CAS Google Scholar
Chen X, Liu M, Gilson MK (2001) BindingDB: a web-accessible molecular recognition database. Comb Chem High Throughput Screen 4:719–725
Article CAS Google Scholar
Coelho ED, Arrais JP, Oliveira JL (2016) Computational discovery of putative leads for drug repositioning through drug-target interaction prediction. PLoS Comput Biol 12:e1005219
Article Google Scholar
Pahikkala T, Airola A, Pietilä S et al (2014) Toward more realistic drug–target interaction predictions. Brief Bioinform 16:325–337
Article Google Scholar
Metz JT, Johnson EF, Soni NB et al (2011) Navigating the kinome. Nat Chem Biol 7:200–202
Article CAS Google Scholar
Davis MI, Hunt JP, Herrgard S et al (2011) Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29:1046–1051
Article CAS Google Scholar
Cheng Z, Zhou S, Wang Y et al (2016) Effectively identifying compound-protein interactions by learning from positive and unlabeled examples. IEEE/ACM Trans Comput Biol Bioinform:1–1
Google Scholar
Lan W, Wang J, Li M et al (2016) Predicting drug–target interaction using positive-unlabeled learning. Neurocomputing 206:50–57
Article Google Scholar

Download references

Author information

Authors and Affiliations

Biomedical Informatics Lab, School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Ali Ezzat
Data Analytics Department, Institute for Infocomm Research, A-Star, Singapore, Singapore
Min Wu & Xiaoli Li
Division of Software and Information Systems, School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Chee-Keong Kwoh

Authors

Ali Ezzat
View author publications
You can also search for this author in PubMed Google Scholar
Min Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Li
View author publications
You can also search for this author in PubMed Google Scholar
Chee-Keong Kwoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Ezzat .

Editor information

Editors and Affiliations

Insilico Medicine, Inc., Rockville, MD, USA
Quentin Vanhaelen

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Ezzat, A., Wu, M., Li, X., Kwoh, CK. (2019). Computational Prediction of Drug-Target Interactions via Ensemble Learning. In: Vanhaelen, Q. (eds) Computational Methods for Drug Repurposing. Methods in Molecular Biology, vol 1903. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8955-3_14

Download citation

DOI: https://doi.org/10.1007/978-1-4939-8955-3_14
Published: 14 December 2018
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8954-6
Online ISBN: 978-1-4939-8955-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Computational Prediction of Drug-Target Interactions via Ensemble Learning

Abstract

Similar content being viewed by others

Drug-target interaction prediction via class imbalance-aware ensemble learning

An Ensemble Learning Approach for Improving Drug–Target Interactions Prediction

Predicting Drug Target Interactions Using Dimensionality Reduction with Ensemble Learning

Key words

1 Introduction