BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning

https://doi.org/10.1016/j.cmpb.2018.08.011Get rights and content

Highlights

  • Proposed framework consists of bagging based ensemble learning which adds diversity in the classifier.

  • Drug features are prepared using Rcpi package and target features are prepared using PROFEAT web server.

  • The issue caused due to class imbalance is resolved using Active Learning.

  • High dimensionality of data is addressed in this paper.

  • The performance of our proposed framework is compared with five existing feature-based approaches on the basis of AUC, AUPR, Sensitivity, Specificity, G-mean.

Abstract

Background and objective

Drug-target interaction prediction plays an intrinsic role in the drug discovery process. Prediction of novel drugs and targets helps in identifying optimal drug therapies for various stringent diseases. Computational prediction of drug-target interactions can help to identify potential drug-target pairs and speed-up the process of drug repositioning. In our present, work we have focused on machine learning algorithms for predicting drug-target interactions from the pool of existing drug-target data. The key idea is to train the classifier using existing DTI so as to predict new or unknown DTI. However, there are various challenges such as class imbalance and high dimensional nature of data that need to be addressed before developing optimal drug-target interaction model.

Methods

In this paper, we propose a bagging based ensemble framework named BE-DTI’ for drug-target interaction prediction using dimensionality reduction and active learning to deal with class-imbalanced data. Active learning helps to improve under-sampling bagging based ensembles. Dimensionality reduction is used to deal with high dimensional data.

Results

Results show that the proposed technique outperforms the other five competing methods in 10-fold cross-validation experiments in terms of AUC=0.927, Sensitivity=0.886, Specificity=0.864, and G-mean=0.874.

Conclusion

Missing interactions and new interactions are predicted using the proposed framework. Some of the known interactions are removed from the original dataset and their interactions are recalculated to check the accuracy of the proposed framework. Moreover, validation of the proposed approach is performed using the external dataset. All these results show that structurally similar drugs tend to interact with similar targets.

Introduction

Pharmaceutical science is an interdisciplinary domain of research incorporating various fields of science and engineering with an objective to discover potential drugs. Drug discovery is a tedious task involving the identification of new drugs and their potential targets. Existing research on drugs focuses on repurposing a known drug for new targets and diseases. Drugs being repurposed are already approved ones and a lot of research [1], [2], [3], [4] is already done on them, so it helps in reducing cost and time involved in drug discovery. Moreover, cancer classification using optimization [5], [6], [7] approaches is one of the key research domain. In this paper, we are focusing on the computational prediction of drug-target interactions using ensemble machine learning. Drug-target interaction can be considered as a binary classification problem. There are two sets of agents involved in drug-target interactions: Chemical compounds form the drug set and Protein (amino acid) forms a target set. The Drug-Target Interactions play an important role in drug discovery research and help to identify new potential drugs targets. They help to understand the drug mechanism and side effects caused by drugs. However, there are several issues pertaining to drug discovery such as time-consuming clinical trials, drug resistance and toxicity towards patients. The first issue is of heterogeneous drug effects on different people [8], [9] and second is to map drug effect with the drug interaction pathway [10].

Drug-target interactions can be predicted using either of the two approaches: experimental/clinical (in vivo) or using computational(in silico) methods. These methods are further classified into four broad categories: Docking methods [11], [12], ligand-based methods [13], literature text mining [14] and pharmacogenomics methods [15], [16]. Clinical methods are time-consuming, tiresome and even difficult to reproduce [17]. However, clinical docking methods are more reliable and well-accepted methods but, the unavailability of the 3-D structure of proteins and time-consuming simulations are major drawbacks. Docking methods consider the information of protein 3-D structure and then approximate whether it will interact with the given drug or not using simulations. Secondly, there are methods known as ligand-based that are based on the similarity between targets (ligands). But, these methods are not so popular due to the lack of information about target ligands. Literature text mining methods explore literature to identify the relationship between the given drug and target. They also suffer from a limited information source for predicting new interactions. The fourth category i.e. pharmacogenomics utilize drugs’ and targets’ features simultaneously to identify potential drug-target interactions. Pharmacogenomic approaches explore computational methods such as machine learning and kernel-based similarity approaches to reduce the complexity of the prediction problem involved in DTI. Recently, various computational methods have been proposed for predicting DTI [18], [19]. Nowadays, computational approaches are gaining more popularity as they speed up the drug discovery process.

Various online databases provide access to the data related to compounds and target proteins. PubChem [20] consists of 35 million compounds but, approximately 7000 compounds possess information regarding target protein. There are several other databases such as DrugBank [21], ChEMBL [22], KEGG DRUG [23]. These databases help in building computational approaches and new protocol/pipelines for predicting DTI. Most of the existing computational techniques/approaches proposed in the literature exploit existing benchmark datasets [15] obtained from online data sources. Most of these techniques consider drug-target interaction as the binary classification problem. Positive label (1) defines the presence of interaction while negative label (0) defines non-interaction between given drug-target pair. In most of the cases, the number of positive labels is in minority and negative labels in the majority group leading to the issue of class imbalance. This imbalance can result in biasing of prediction model towards majority class i.e negative labels, but our main interest is in minority class i.e. positive labels. One of the solutions to deal with this problem is to randomly sample instances equal to the number of positive class instances from negative class. This can help in better modeling but, leads to a loss of useful information from the negative class instance. Another issue that comes along while solving a binary classification problem is the curse of dimensionality. The raw form in which data is generally provided is high dimensional in nature. So, one of the objectives of classification problem is to represent data in lower sub-space while preserving information from the whole of the data. This can be achieved using dimensionality reduction methods which help to find the subspace that can represent the whole of the data.

In this paper, we have proposed bagging based ensemble framework for drug-target interaction using dimensionality reduction and active learning. The rest of the paper is organized as follows. Section 2. gives the detailed overview of related work corresponding to the problem under consideration. Section 3. gives a brief description of the preliminaries and background related to the proposed framework. Section 4. describes in detail the proposed framework. Section 5. provides an overview of experiments and simulation results for the proposed work. Finally, we have concluded this research in Section 6.

Section snippets

Related work

This section presents a brief review of various existing techniques for DTI prediction. We will focus only on computational approaches defined in literature for predicting DTI. Recently, machine learning methods are used extensively to predict drug-target interactions using drugs and targets features. Computational methods can be broadly categorized as feature-based methods and similarity-based methods. Similarity-based methods use kernel techniques [18], [31], matrix factorization [19], [32]

Preliminaries and background

In this section, some basic terminologies and background are discussed before switching to our proposed framework. In this proposed framework, we will be using the decision tree as a base learner for ensemble classification. Further, the sampling of training dataset is performed using an active learning strategy which is an extension to bootstrap aggregation (bagging). Active learning helps to improve bagging based ensembles for the imbalanced dataset.

Problem formulation

Drug-target interaction prediction problem can be defined as a supervised machine learning (classification) problem: Let X be a set of drugs and Y be a set of targets. The problem is to find whether there exists the interaction between existing drug candidates (Nx) and target proteins (Ny). Existing drug-target interactions can be represented by matrix Z such that rows of matrix denote drugs and columns denote targets. If there exists an interaction between drug and target then zij=1 otherwise

Experimental evaluation and results

This section gives the detailed overview of experiments and analysis performed using the proposed technique. We have compared our proposed framework with other feature-based approaches: SVM, RF and DT. Further, 10-fold cross-validation is also performed to check the robustness of the proposed technique. Table 3. shows the generalized confusion matrix to evaluate the performance of classification models.

Discussion

In this paper, we propose a computational framework for DTI prediction using machine learning. The proposed framework has successfully ranked drugs corresponding to varying targets and vice-versa. The main contribution of the proposed framework is handling class-imbalanced data using active-learning strategy. Moreover, ensemble learning helps in adding diversity among the base learners and hence improves the prediction accuracy. Performance of all the methods is assessed in terms of AUC,

References (58)

  • A. Sharma et al.

    An optimized framework for cancer classification using deep learning and genetic algorithm

    J. Med. Imaging Health Inf.

    (2017)
  • A. Sharma et al.

    Ksrmf: kernelized similarity based regularized matrix factorization framework for predicting anti-cancer drug responses

    J. Intell. Fuzzy Syst.

    (2018)
  • A. Sharma et al.

    Classification of cancerous profiles using machine learning

    Machine Learning and Data Science (MLDS), 2017 International Conference on

    (2017)
  • A. Sharma et al.

    An integrated framework for identification of effective and synergistic anti-cancer drug combinations

    J. Bioinf. Comput. Biol.

    (2018)
  • G. Dhiman et al.

    Emperor penguin optimizer: a bio-inspired algorithm for engineering problems

    Knowl. Based Syst.

    (2018)
  • W.E. Evans et al.

    Pharmacogenomics drug disposition, drug targets, and side effects

    N. Engl. J. Med.

    (2003)
  • D.-Q. Wei et al.

    Molecular modeling of two cyp2c19 snps and its implications for personalized drug design

    Protein Pept. Lett.

    (2008)
  • S. Mizutani et al.

    Relating drug–protein interaction network with drug side effects

    Bioinformatics

    (2012)
  • L. Xie et al.

    Drug discovery using chemical systems biology: weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir

    PLoS Comput. Biol.

    (2011)
  • L. Jacob et al.

    Protein-ligand interaction prediction: an improved chemogenomics approach

    Bioinformatics

    (2008)
  • S. Zhu et al.

    A probabilistic model for mining implicit chemical compound–gene relations from literature

    Bioinformatics

    (2005)
  • Y. Yamanishi et al.

    Prediction of drug–target interaction networks from the integration of chemical and genomic spaces

    Bioinformatics

    (2008)
  • S. Fakhraei et al.

    Network-based drug-target interaction prediction with probabilistic soft logic

    IEEE/ACM Trans. Comput. Biol. Bioinf.

    (2014)
  • T. van Laarhoven et al.

    Gaussian interaction profile kernels for predicting drug–target interaction

    Bioinformatics

    (2011)
  • X. Zheng et al.

    Collaborative matrix factorization with multiple similarities for predicting drug-target interactions

    Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2013)
  • C. Knox et al.

    Drugbank 3.0: a comprehensive resource for omics research on drugs

    Nucleic Acids Res.

    (2010)
  • A. Gaulton et al.

    Chembl: a large-scale bioactivity database for drug discovery

    Nucleic Acids Res.

    (2011)
  • M. Kanehisa et al.

    Kegg for integration and interpretation of large-scale molecular data sets

    Nucleic Acids Res.

    (2011)
  • Y.-C. Wang et al.

    Computationally probing drug-protein interactions via support vector machine

    Lett. Drug Des. Discovery

    (2010)
  • Cited by (0)

    View full text