BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning
Introduction
Pharmaceutical science is an interdisciplinary domain of research incorporating various fields of science and engineering with an objective to discover potential drugs. Drug discovery is a tedious task involving the identification of new drugs and their potential targets. Existing research on drugs focuses on repurposing a known drug for new targets and diseases. Drugs being repurposed are already approved ones and a lot of research [1], [2], [3], [4] is already done on them, so it helps in reducing cost and time involved in drug discovery. Moreover, cancer classification using optimization [5], [6], [7] approaches is one of the key research domain. In this paper, we are focusing on the computational prediction of drug-target interactions using ensemble machine learning. Drug-target interaction can be considered as a binary classification problem. There are two sets of agents involved in drug-target interactions: Chemical compounds form the drug set and Protein (amino acid) forms a target set. The Drug-Target Interactions play an important role in drug discovery research and help to identify new potential drugs targets. They help to understand the drug mechanism and side effects caused by drugs. However, there are several issues pertaining to drug discovery such as time-consuming clinical trials, drug resistance and toxicity towards patients. The first issue is of heterogeneous drug effects on different people [8], [9] and second is to map drug effect with the drug interaction pathway [10].
Drug-target interactions can be predicted using either of the two approaches: experimental/clinical (in vivo) or using computational(in silico) methods. These methods are further classified into four broad categories: Docking methods [11], [12], ligand-based methods [13], literature text mining [14] and pharmacogenomics methods [15], [16]. Clinical methods are time-consuming, tiresome and even difficult to reproduce [17]. However, clinical docking methods are more reliable and well-accepted methods but, the unavailability of the 3-D structure of proteins and time-consuming simulations are major drawbacks. Docking methods consider the information of protein 3-D structure and then approximate whether it will interact with the given drug or not using simulations. Secondly, there are methods known as ligand-based that are based on the similarity between targets (ligands). But, these methods are not so popular due to the lack of information about target ligands. Literature text mining methods explore literature to identify the relationship between the given drug and target. They also suffer from a limited information source for predicting new interactions. The fourth category i.e. pharmacogenomics utilize drugs’ and targets’ features simultaneously to identify potential drug-target interactions. Pharmacogenomic approaches explore computational methods such as machine learning and kernel-based similarity approaches to reduce the complexity of the prediction problem involved in DTI. Recently, various computational methods have been proposed for predicting DTI [18], [19]. Nowadays, computational approaches are gaining more popularity as they speed up the drug discovery process.
Various online databases provide access to the data related to compounds and target proteins. PubChem [20] consists of 35 million compounds but, approximately 7000 compounds possess information regarding target protein. There are several other databases such as DrugBank [21], ChEMBL [22], KEGG DRUG [23]. These databases help in building computational approaches and new protocol/pipelines for predicting DTI. Most of the existing computational techniques/approaches proposed in the literature exploit existing benchmark datasets [15] obtained from online data sources. Most of these techniques consider drug-target interaction as the binary classification problem. Positive label (1) defines the presence of interaction while negative label (0) defines non-interaction between given drug-target pair. In most of the cases, the number of positive labels is in minority and negative labels in the majority group leading to the issue of class imbalance. This imbalance can result in biasing of prediction model towards majority class i.e negative labels, but our main interest is in minority class i.e. positive labels. One of the solutions to deal with this problem is to randomly sample instances equal to the number of positive class instances from negative class. This can help in better modeling but, leads to a loss of useful information from the negative class instance. Another issue that comes along while solving a binary classification problem is the curse of dimensionality. The raw form in which data is generally provided is high dimensional in nature. So, one of the objectives of classification problem is to represent data in lower sub-space while preserving information from the whole of the data. This can be achieved using dimensionality reduction methods which help to find the subspace that can represent the whole of the data.
In this paper, we have proposed bagging based ensemble framework for drug-target interaction using dimensionality reduction and active learning. The rest of the paper is organized as follows. Section 2. gives the detailed overview of related work corresponding to the problem under consideration. Section 3. gives a brief description of the preliminaries and background related to the proposed framework. Section 4. describes in detail the proposed framework. Section 5. provides an overview of experiments and simulation results for the proposed work. Finally, we have concluded this research in Section 6.
Section snippets
Related work
This section presents a brief review of various existing techniques for DTI prediction. We will focus only on computational approaches defined in literature for predicting DTI. Recently, machine learning methods are used extensively to predict drug-target interactions using drugs and targets features. Computational methods can be broadly categorized as feature-based methods and similarity-based methods. Similarity-based methods use kernel techniques [18], [31], matrix factorization [19], [32]
Preliminaries and background
In this section, some basic terminologies and background are discussed before switching to our proposed framework. In this proposed framework, we will be using the decision tree as a base learner for ensemble classification. Further, the sampling of training dataset is performed using an active learning strategy which is an extension to bootstrap aggregation (bagging). Active learning helps to improve bagging based ensembles for the imbalanced dataset.
Problem formulation
Drug-target interaction prediction problem can be defined as a supervised machine learning (classification) problem: Let X be a set of drugs and Y be a set of targets. The problem is to find whether there exists the interaction between existing drug candidates (Nx) and target proteins (Ny). Existing drug-target interactions can be represented by matrix Z such that rows of matrix denote drugs and columns denote targets. If there exists an interaction between drug and target then zij=1 otherwise
Experimental evaluation and results
This section gives the detailed overview of experiments and analysis performed using the proposed technique. We have compared our proposed framework with other feature-based approaches: SVM, RF and DT. Further, 10-fold cross-validation is also performed to check the robustness of the proposed technique. Table 3. shows the generalized confusion matrix to evaluate the performance of classification models.
Discussion
In this paper, we propose a computational framework for DTI prediction using machine learning. The proposed framework has successfully ranked drugs corresponding to varying targets and vice-versa. The main contribution of the proposed framework is handling class-imbalanced data using active-learning strategy. Moreover, ensemble learning helps in adding diversity among the base learners and hence improves the prediction accuracy. Performance of all the methods is assessed in terms of AUC,
References (58)
- et al.
Spotted hyena optimizer: a novel bio-inspired based metaheuristic technique for engineering applications
Adv. Eng. Softw.
(2017) - et al.
Multi-objective spotted hyena optimizer: a multi-objective optimization algorithm for engineering problems
Knowl. Based Syst.
(2018) - et al.
A fast flexible docking method using an incremental construction algorithm
J. Mol. Biol.
(1996) - et al.
Kernel-based data fusion improves the drug–protein interaction prediction
Comput. Biol. Chem.
(2011) - et al.
Pubchem: integrated platform of small molecules and biological activities
Annu. Rep. Comput. Chem.
(2008) - et al.
Drug–target interaction prediction with bipartite local models and hubness-aware regression
Neurocomputing
(2017) - et al.
Drug-target interaction prediction: a bayesian ranking approach
Comput. Methods Programs Biomed.
(2017) - et al.
Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding
Pattern Recognit.
(2015) Simpls: an alternative approach to partial least squares regression
Chemom.Intell.Lab.Syst.
(1993)- et al.
Neighbourhood sampling in bagging for imbalanced data
Neurocomputing
(2015)