Key words

1 Introduction

Drugs are designed with their intended targets in mind. A drug’s interaction with its respective target is what brings about the therapeutic effect that helps overcome the disease that the drug was developed for. The target may be a protein (or gene) that is directly associated with the disease, or it may be a protein whose perturbation indirectly helps counteract the disease-causing one [1]. Either way, interactions are worth studying because it is through them that the therapeutic effects of drugs take place. Note that while drug targets could be either DNA, RNA, or proteins, we only consider proteins here for ease of discussion and because they currently constitute the majority of known drug targets.

The drug discovery process is a costly and time-consuming one; developing a drug typically takes over 10 years and costs over one billion US dollars [2]. This led to a new trend, namely, drug repositioning where old drugs are repurposed to deal with new diseases [3]. The rationale behind drug repositioning is that older drugs have already passed clinical trials (i.e., at least passed phase I of the trials), which means that they are already well-studied and have previously passed safety tests. This accelerates the drug discovery effort considerably [4]. Moreover, the fact that drugs usually bind more than one protein [5, 6] further encourages the search for new interactions for known drugs.

There are experimental wet-lab techniques (e.g., bioassays [7]) that can be used to predict interactions for known drugs (or compounds, in general). However, these techniques are tedious and take time to set up and execute. On the other hand, computational prediction methods detect interactions swiftly and cheaply. As such, computational methods can be used to narrow down the search space to be investigated by experimental techniques.

Computational methods that predict drug-target interactions use machine learning to train prediction models to detect new undiscovered interactions. The interaction (i.e., label) data required for training such models may be obtained from online databases where information on known drug-target interactions are regularly stored such as KEGG [8], DrugBank [9], and STITCH [10]. As for data for representing drugs and targets, they have come in many different forms in previous work; to name a few examples, data that have been used to represent drugs include graphical representations of their chemical structures [11], side effects [12], and ATC codes [13], whereas targets have been represented using their genomic sequences [14], Gene Ontology (GO) information [15], disease associations [16], and protein-protein interaction network information [17].

Many promising and successful computational prediction methods have appeared in the last decade. They can be roughly divided into feature-based methods and similarity-based methods . Similarity-based methods are those that are able to work with non-vectorial data (i.e., data that are not in the form of fixed-length feature vectors). For example, while each of the target proteins can be represented by its genomic sequence, that would be inadequate for use in machine learning algorithms due to the varying lengths of these sequences. The same could be said about drugs’ chemical structures that are of varying sizes and molecular weights. As such, similarity-based methods resort to using the kernel trick and generate pairwise similarity matrices from these non-vectorial data. The generated matrices can then be used as input to the machine learning algorithm.

A pioneering similarity-based method [18] generated a pairwise similarity matrix for targets by computing the normalized Smith-Waterman scores [19] between all pairs of target proteins using their genomic sequences. A similar matrix was computed for drugs using SIMCOMP [20] similarity between the drugs’ chemical structures . Along with an interaction matrix (collected from KEGG [8]) that shows which drugs and targets interact, the similarity matrices are given as input to the machine learning method. Since then, this work has been followed by many other similarity-based methods [21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Some of these methods have tried generating similarity matrices from other sources of information and using them simultaneously via multiple kernel learning (e.g., [27, 33]).

Feature-based methods , on the other hand, are standard methods that take input only as fixed-length feature vectors, which can initially be interpreted as a disadvantage when compared with similarity-based methods . However, similarity-based methods have what is known as the cold start problem, which is a direct result of their dependence on matrix operations to perform predictions. To illustrate this issue, consider the case where the prediction method had already been run, and we would like to predict interactions involving new drugs and targets that were not in the original training set. In such a case, the similarity matrices would have to be recomputed, and the method would have to be run all over again to obtain the predictions for these new drugs and targets. Feature-based methods do not suffer from this issue since trained models can be used to predict interactions for new drugs and targets that have not appeared in the training set.

One of the earlier feature-based methods [35] represented drugs as binary vectors indicating the presence/absence of each of a number of common functional groups that are found in drugs’ chemical structures , while targets were represented using their pseudo amino acid composition. The drug-target pairs were then represented by concatenating the feature vectors of the involved drugs and targets. That is, if a drug d is represented by a feature vector [d1 ,d2 ,…,dp] and a target t is represented by a feature vector [t1 ,t2 ,…,tq], then the drug and target feature vectors would be concatenated to represent the drug-target pair (d,t) as [d1 ,d2 ,…,dp ,t1 ,t2 ,…,tq]. A feature selection phase then commences that provides a subset of the most relevant features. Finally, a nearest neighbor algorithm is used to obtain predictions.

Many feature-based methods come in the form of decision tree-based ensembles that are based on random forest [36,37,38], rotation forest [39], and extremely randomized trees [40]. Other feature-based methods have also been proposed such as those based on fuzzy KNN [41], relevance vector machines [42], and deep learning [43,44,45,46,47,48,49]. Furthermore, there are works that primarily focus on training interpretable models [50,51,52]. After having been trained, these interpretable models could be scanned for learned rules that may contain useful insights and provide better understanding on the factors that govern drug-target interactions.

This chapter presents a machine learning method for predicting drug-target interactions. That is, given a set of known drug-target interactions, the method aims to predict new undiscovered ones. In particular, the method described below is EnsemDT, a feature-based method that has previously been published in [38]. Using that method as an example, we describe the data needed to perform drug-target interaction prediction, show how to evaluate the prediction performance of the developed machine learning method, and provide tips on how to further improve the quality of the obtained predictions.

2 Materials

In order to use machine learning to perform predictions, prior data on known drug-target interactions needs to be gathered first so that new undiscovered interactions may be predicted based on them. More precisely, the gathered data is used to train a computational model for predicting interactions on unseen data. The data required for computational prediction of interactions via machine learning include:

  1. 1.

    Interaction data showing which drugs and targets interact with each other

  2. 2.

    Features for representing the different drugs

  3. 3.

    Features for representing the different targets

In this work, we use the dataset that has been introduced in [37]. We elaborate on the details of this dataset in the following sections.

2.1 Interaction Data

The interaction data were obtained from the DrugBank database (version 4.3, released on 17 November 2015). More precisely, we downloaded an XML file from the link, https://www.drugbank.ca/releases/latest, and created a MySQL database with it. All interactions were extracted from the created MySQL database, which were then used to generate an adjacency matrix Y ∈ R n × m with n drug rows and m target columns. That is, Yij = 1 if drug di and target tj interact and Yij = 0 otherwise. Some statistics regarding the collected interaction data are given in Table 1. Note that, for reasons that will be mentioned shortly, some drugs and targets were eliminated from Y. The statistics provided in Table 1 reflect the final dataset after the removal of these drugs and targets.

Table 1 Statistics of the interaction data

2.2 Drug Features

The Rcpi package [53] was used to generate features for the drugs. In order to obtain features for representing drugs using Rcpi, we first had to obtain their SMILES [11] representations (i.e., graphical representations of drugs’ chemical structures ) to use as input for Rcpi. The SMILES representations of the drugs can be easily obtained from the DrugBank website via a script looping over all the drug IDs. For example, a drug with the ID, DB04237, can have its SMILES representation downloaded from the link: https://www.drugbank.ca/structures/small_molecule_drugs/DB04237.smiles.

After the SMILES representations have been obtained for the drugs, the Rcpi package was used to extract the drugs’ features. The extracted features include topological, constitutional, and geometrical descriptors among other molecular properties.

Since only small molecule drugs have SMILES representations available for them, biotech drugs were excluded from the dataset. In addition, some of the small molecule drugs were missing their SMILES and had to be removed as well.

2.3 Target Features

From the target side, the PROFEAT server [54] was used to extract features from the targets’ genomic sequences; target features include amino acid composition, dipeptide composition, CTD and autocorrelation descriptors, quasi-sequence order, amphiphilic pseudo amino acid composition, and total amino acid properties. The latest version of the PROFEAT server can be found at http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi.

Before using PROFEAT to obtain the target features, the genomic sequences of the target proteins were downloaded from the UniProt database [14]. However, some targets had to be excluded from the dataset since PROFEAT is unable to generate features for sequences that are less than a specific length. As such, targets with sequences that are less than 100 amino acids long were removed from the final dataset.

2.4 Data Cleaning

After the features have been generated, features with constant values have been removed as they would not contribute to the prediction of drug-target interactions. Moreover, other features occasionally had missing values for some drugs and targets. To deal with these cases, each missing value was assigned the mean of its respective feature over all drugs and targets.

2.5 Representing Drug-Target Pairs

Now that feature vectors for representing drugs and targets have been generated, drug-target pairs are then represented by feature vectors that are formed by concatenating those of the involved drugs and targets. That is, each drug-target pair (d,t) is represented as

$$ \left[{d}_1,{d}_2,\dots, {d}_p,{t}_1,{t}_2,\dots, {t}_q\right] $$

where p and q are the numbers of drug and target features, respectively, and [d1 ,d2 ,…,dp] and [t1 ,t2 ,…,tq] are the feature vectors for representing drug d and target t, respectively. For the remainder of this chapter, the drug-target pairs will also be referred to as instances. Finally, to prevent any bias from the original feature values, all features were normalized to the range [0, 1] using min-max normalization as

$$ \forall i=1,\dots, p,\kern1em {d}_i=\frac{d_i-\min \left({d}_i\right)}{\max \left({d}_i\right)-\min \left({d}_i\right)} $$
$$ \forall j=1,\dots, q,\kern1em {t}_j=\frac{t_j-\min \left({t}_j\right)}{\max \left({t}_j\right)-\min \left({t}_j\right)}. $$

This dataset—including the lists of drugs and targets, their features, and the interaction data—is accessible online in the supplementary material for [37].

3 Methods

Here, we present EnsemDT [38], a simple machine learning method that utilizes ensemble learning. The method consists of an ensemble of decision tree classifiers. More details are given below.

3.1 Method Description

The dataset used here consists of a total of 19,676,196 drug-target pairs, out of which 12,674 are interactions (or positive instances) and the rest are non-interactions (or negative instances). Including the entire dataset in the training set would lead to a computationally intensive training phase because the training set would be too big. Furthermore, another problem with using the entire dataset in training is that there is a severe class imbalance in the data where the (negative) majority class is order of magnitudes larger than the (positive) minority class. Such imbalance can cause bias in the prediction results toward the majority class, leading to a lower prediction performance on the minority class. Undersampling of the negative set (i.e., the set of non-interactions) is therefore utilized here to deal with these issues. That is, a subset of the full negative set is randomly sampled and appended to the training set. More precisely, a separate training set is generated for each base learner. The entire positive set (i.e., the set of interactions) is included in the training set for each base learner. As for the negative data, a different subset of the full negative set is randomly sampled for each of the base learners. The size of the negative subset being sampled for each base learner can be controlled via the parameter, npRatio. Specifically, if |P| is the size of the positive set, then the size of the negative subset would be npRatio×|P|.

Feature subspacing is then performed to inject more diversity into the ensemble; i.e., a different feature subset is randomly selected for each base learner. Diversity is known to be useful for improving the prediction performance of the ensemble [55]. More precisely, using a parameter r (where 0 < r ≤ 1), each base learner has a feature subset of size r × |F| randomly sampled for it (where |F| is the total number of features).

Figure 1 shows the pseudocode for EnsemDT. The number of base learners can be controlled via the parameter, M. Furthermore, we remind the reader that p and q are the numbers of drug and target features, respectively. Note that, in [38], there is an additional dimensionality reduction step directly following the feature subspacing step. We decided to remove it here for the sake of simplification, but we discuss dimensionality reduction in the upcoming Notes section (see Note 3).

Fig. 1
figure 1

Pseudocode of EnsemDT, an Ensemble of Decision Tree classifiers that uses feature subspacing as well as undersampling of the negative set

3.2 Evaluation

An experiment needs to be executed in order to gauge the prediction performance of EnsemDT. Cross validation experiments are commonly used to assess the performance of machine learning methods. As for the evaluation metric, one that is used frequently is the AUC (i.e., area under the ROC curve) due to its insensitivity to class imbalances in the data [56]. The AUC can be computed by the equation,

$$ \mathrm{AUC}=\frac{{\sum \limits}_{i=1}^{n^{+}}{\sum \limits}_{j=1}^{n^{-}}I\left(f\left({x}_i^{+}\right)>f\left({x}_j^{-}\right)\right)}{n^{-}{n}^{+}} $$

where f(x) is the prediction score given by the prediction method; n + and n are the numbers of positive and negatives instances in the data, respectively; and \( {x}_i^{+} \) and \( {x}_j^{-} \) are the current positive and negative instances, respectively. Note that I(⋅) is an indicator function where I(⋅) = 1 when f(xi+) > f(xj) and I(⋅) = 0 otherwise. AUC is adequate here since it provides an indicator for how well a prediction method ranks the positive instances higher than the negative instances. In other words, the better the AUC is, the more we are confident of the method’s ability to correctly rank the true interactions highly.

Here, we conduct a 5-fold cross validation experiment. That is, the data is divided into five folds where each of the five folds takes a turn being the test set, while the remaining folds are used as the training set. An AUC score is computed for each fold, and then the AUC scores are averaged to give the final score. In Table 2, EnsemDT is compared with other state-of-the-art feature-based methods in terms of AUC.

Table 2 AUC results of the cross validation experiment

By comparing the results of EnsemDT against those of the other methods, it appears that EnsemDT is successful in improving prediction performance beyond the other competing methods. In the next section, we investigate and analyze the impacts of the different components of EnsemDT (see Note 1).

The parameter values in EnsemDT were set as follows: npRatio = 5, r = 0.2, and M = 50. These values have been selected based on the results of sensitivity analyses that are provided later in the Notes section (see Note 2). As for the other methods, default values for decision tree, random forest, and support vector machines have been obtained from MATLAB’s fitctree, TreeBagger, and fitcsvm packages, respectively. Other notes pertaining to EnsemDT and machine learning methods in general are provided as well (see Notes 4 and 5).

4 Notes

  1. 1.

    One way to analyze the different aspects of EnsemDT is to create multiple variants where, in each of these variants, an aspect is removed to see how it would perform. In the first variant, the training set is the same for all base learners—a single subset of the negative data was randomly sampled and used in all base learners—meaning that the only difference between these base learners is the different feature subsets obtained from the feature subspacing step. A second variant is also added. It is identical to the first (i.e., same training set for all base learners) with the exception that bagging (i.e., random sampling with replacement) is performed on the training sets of the base learners. We finally include a third variant similar to the second one with the exception that bagging is performed only on the negative examples (i.e., the entire positive set is included in the training sets of all the base learners). Finally, the last variant removes this bagging feature and instead randomly samples a different negative set for each base learner, which is none other than the original EnsemDT included in Table 2. Below in Table 3 are the results of the comparison between the four variants.

    The first variant (same training set) acts as a baseline against which to compare the rest of the variants. The second variant (bagging on the same training set) shows a decrease in prediction performance. This result is surprising because bagging is a technique that is typically used to improve the prediction performance since it adds diversity to the ensemble. Note that the second variant is the most similar to random forest but with slight parameter-tuning differences. The third variant explains this surprising result when bagging is performed only on the negative instances (i.e., the positive set is left intact and entirely used in the training sets of all base learners). Bagging, when applied to the negative instances only, produced a slight improvement in the AUC result. It seems that all the positive instances are too precious to leave out as a result of the bagging procedure. The best results, however, are obtained by the last variant where a different negative set is randomly sampled and added to the training set for each base learner.

  2. 2.

    We further analyze the different aspects of a drug-target interaction prediction method by performing sensitivity analyses for its parameters. Figures 2 and 3 show sensitivity analyses that were performed for EnsemDT’s parameters: M (the number of base learners) and r (the portion of features to randomly sample for each base learner).

    For M (the number of base learners), it is obvious that increasing its value generally improves the prediction performance of the ensemble. This makes sense as each base learner that gets added to the ensemble comes with a different set of training instances that are represented with a different subset of randomly sampled features. This variability between the base learners adds to the diversity in the ensemble which helps improve the prediction performance overall. As for r (the portion of randomly sampled features), multiple conclusions were drawn. It was found that using part of the feature set in each base learner is better than using the entire feature set in all the base learners (i.e., no feature subspacing), which makes sense because feature subspacing is a technique that increases diversity of the ensemble and thus improves prediction performance. Furthermore, it was found that prediction performance is generally robust to the value of r, so it makes sense to use a smaller value as the default (e.g., r = 0.2); a lower number of sampled features per base learners lead to improved computational efficiency.

  3. 3.

    When the number of base learners is high, the running time may be long due to the high dimensionality of the data. In such a case, a dimensionality reduction technique may be used to help further improve the computational efficiency of the ensemble. Table 4 shows the running times of EnsemDT without dimensionality reduction and with two dimensionality reduction techniques, singular value decomposition and partial least squares [57].

    Based on the running times given in Table 4, there is an obvious improvement in running time after applying dimensionality reduction. Occasionally, dimensionality reduction techniques have a nice side effect of improving the prediction performance since they may help reduce the amount of noise and/or eliminate feature redundancy in the data. For the sake of completion, we provide the pseudocode of EnsemDT with the dimensionality reduction step in Fig. 4.

  4. 4.

    In [58], a study was made on pair-input methods, i.e., methods that make predictions for pairs of objects. Methods that predict drug-target interactions are considered pair-input methods where the objects of a pair are a drug and a target. The authors investigated a concept called differential representation bias which refers to how much an object appears (or is represented) in the positive class (interactions) as opposed to the negative class (non-interactions). This representation bias was found to be beneficial for the prediction performance in general. For example, assuming the training data accurately reflects the general population, if an object appears (or is represented) in the positive training data more than in the negative training data, then the prediction method is more likely to correctly predict test pairs involving this object as positive, which is usually true [58].

    However, in drug-target interaction data, an extreme case of differential representation bias is prevalent; there are always drugs and targets that appear only in the negative set. This is because many of the drugs and targets have only a single known interaction, which may get left out in the cross validation experiments, leading the involved drug or target to be unavailable in the positive training data. As a result, interactions involving such drugs and targets (that do not show up in the positive training data) are likely to not be predicted correctly.

    As such, an experiment was performed where we modified the procedure for random sampling of the negative set for each base learner. Specifically, a negative instance (i.e., a non-interacting drug-target pair) is not added to the training set of any base learner unless both the involved drug and target have each appeared at least once in the positive set. Results of this experiment are given in Table 5.

    As can be seen from the above table, the sampling modification had quite a positive impact on the AUC results produced from EnsemDT. This shows the importance of having the drugs and targets that appear in the negative training data to also exist in the positive training data.

  5. 5.

    It is generally desirable to train prediction models with the most reliable data possible. Unfortunately, a common issue in drug-target interaction datasets is the absence of a reliable set of non-interactions that could be used for training prediction models. In the case of EnsemDT, negative instances are being sampled from the set of drug-target pairs that are not known to interact. Note that while the non-interactions in the data are not of high confidence, it is assumed that the majority of the negative instances in the data are correct (i.e., true non-interactions), with relatively few of them being potential undiscovered interactions. Most prediction methods assume all non-interactions in the data to be true and use them to train prediction models.

    While researchers in the field of drug development do not typically report non-interactions that they may have encountered in their work, various efforts have been made to come up with a reliable set of negative instances. For example, the BioLip [59] and BindingDB [60] databases have been used in [61] to generate a set of reliable non-interactions by collecting drug-target pairs with an affinity less than 10 μM. Another way to circumvent the issue of absence of reliable negatives is to use a dataset that contains the binding affinities of the different drug-target pairs (represented by continuous values on a scale) as opposed to a dataset with binary values that encode interacting and non-interacting drug-target pairs [62]. Two examples of such continuous-valued datasets are those given in [63] and [64].

    Moreover, there are other works that dealt with this issue via positive unlabeled learning. One such work is biased SVM [65] where different weights are given to the positive and negative classes (positives are given higher weights since they are more reliable). The weights are tuned to obtain the best prediction performance. PUDT [66] is another work that uses positive unlabeled learning where the negative instances are labeled beforehand as reliable negative and likely negative, and then weights are assigned to them (along with the positive instances) and are tuned to give the best prediction performance.

Table 3 AUC results of the different EnsemDT variants
Fig. 2
figure 2

Sensitivity analysis for M, showing the effect on the prediction performance (AUC) for different values of M

Fig. 3
figure 3

Sensitivity analysis for r, showing the effect on the prediction performance (AUC) for different values of r

Table 4 Running times (in minutes) of EnsemDT with/without dimensionality reduction
Fig. 4
figure 4

Pseudocode of EnsemDT with dimensionality reduction , where EnsemDT is augmented with a dimensionality reduction step that directly follows the feature subspacing step

Table 5 AUC results with/without the sampling modification