Elsevier

Journal of Theoretical Biology

Volume 293, 21 January 2012, Pages 143-150
Journal of Theoretical Biology

Prediction of protein–protein interaction sites using patch-based residue characterization

https://doi.org/10.1016/j.jtbi.2011.10.021Get rights and content

Abstract

Identifying protein–protein interaction sites provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Using a patch-based model for residue characterization, we trained random forest classifiers for residue-based interface prediction, which was followed by a clustering procedure to produce patches for patch-based interface prediction. For residue-based interface prediction, our method achieves a specificity rate of 0.7 and a sensitivity rate of 0.78. For patch-based interface prediction, a success rate of 0.80 is achieved. Based on same datasets, we also compare it with several published methods. The results show that our method is a successful predictor for residue-based and patch-based interface prediction.

Highlights

► We construct a patch-based model for residue characterization. ► Random forests are trained to predict protein–protein interface based on this model. ► Compared with several published methods, our method achieves better results.

Introduction

Proteins function through interactions with other biomolecules such as proteins, nucleic acids, carbohydrates or small ligands. Protein–protein interfaces hold key information toward molecular understanding of protein function. These interactions occur through the formation of complexes, either transient or permanent. Thus, the atomic structures of protein complexes will be very crucial for a detailed understanding of protein function. There are experimental methods for solving structures of complexes, which are nevertheless expensive and laborious. Computational methods such as docking are therefore expected to play an important role in structural biology. Protein docking aims at predicting the 3D structures of a protein complex from the known structures of its individual components. The vast search space for potential poses and the difficulty in ranking docked poses are its two major problems. Interface prediction provides important clues to the function of a protein and can help in both problems (Qin and Zhou, 2007). Predicted interface residues can help limit the initial search; this is referred to as front-end use. Alternatively, they can assist in scoring the docked poses; this is referred to as back-end use (Zhou and Qin, 2007). Both front-end use and back-end use have been made of interface prediction in docking studies (Heuser et al., 2005, Tress et al., 2005, van Dijk et al., 2005, Chelliah et al., 2006, Tjong et al., 2007). For understanding protein function, some computational studies about protein–protein interaction network and protein–protein interaction mechanism have also been presented (Chou and Cai, 2006, Hu et al., 2011, Ren et al., 2011, Xia et al., 2010, Yang and Jiang, 2010, Zhou, 2011).

There are a number of articles published for interface prediction. A large number of properties have been identified that have some predictive power for interfaces (de Vries and Bonvin, 2008). They can be roughly divided into three groups: (a) the type and properties of the residues in the amino acid sequence; (b) the evolutionary conservation; (c) the information contained in the atomic coordinates of the structure. Unfortunately, there is no single property sufficient for unambiguous identification of the interface (Zhou and Qin, 2007). Therefore, many methods integrate these properties into interface prediction, which can be grouped into two approaches. The one optimizes a usually small number of parameters to construct a discriminant function that combines the properties, linearly or nonlinearly (Jones and Thornton, 1997, de Vries et al., 2006, Li et al., 2006, Liang et al., 2006, Kufareva et al., 2007). The other approach uses a machine learning algorithm to combine different properties in an optimal way from a large number of parameters. Popular machine learning algorithms are neural networks (NN) (Zhou and Shan, 2001, Fariselli et al., 2002, Ofran and Rost, 2003, Chen and Zhou, 2005, Ofran and Rost, 2007, Pettit et al., 2007, Porollo and Meller, 2007), support vector machines (SVM) (Koike and Takagi, 2004, Bordner and Abagyan, 2005, Bradford and Westhead, 2005, Res et al., 2005, Chung et al., 2006, Wang et al., 2006a, Wang et al., 2006b, Dong et al., 2007), random forests (Chen and Jeong, 2009, Sikic et al., 2009) and Bayesian networks (Neuvirth et al., 2004, Bradford et al., 2006).

Introduction of novel mathematical approaches and physical concepts into molecular biology, such as Mahalanobis distance (Chou, 1995), pseudo amino acid composition (Chou, 2001), graph and diagram analysis (Andraos, 2008, Chou, 1989a, Chou, 2010, Zhou, 2011, Zhou and Deng, 1984), cellular automaton (Xiao et al., 2009), gray system theory (Xiao et al., 2008b), geometric moments (Xiao et al., 2008a), low-frequency (or Terahertz frequency) phonons (Chou, 1988, Chou, 1989b, Madkan et al., 2009), surface diffusion-controlled reaction (Chou and Zhou, 1982), ensemble classifier (Chou and Shen, 2007b) and various network approaches (He et al., 2010, Hu et al., 2011, Huang et al., 2010), can significantly stimulate the development of biological and medical science. Here, we would like to introduce a novel approach, the so-called “patch-based residue characterization”, for prediction of protein–protein interaction sites.

In this study, without direct use of sequence information, we design a novel residue characterization model, i.e. the patch-based model, containing 28 properties based on 3D structure to characterize each surface residue in a protein. This is to say, a residue can be characterized as the properties of three patches centered at it. Random forests are chosen as our classification algorithm. Besides interface prediction, it also has been used for protein prediction (Jia and Hu, 2011, Kandaswamy et al., 2011). Using this patch-based model, we define different interacting residues, and then several corresponding random forest classifiers are constructed. The combination of these classifiers results in a better sensitivity in prediction. The significance of these interface residue definitions is also discussed.

In current methods for interface prediction, there are two predicted targets: residues and patches. Most belong to residue-based methods. Only several methods predict surface patches (Jones and Thornton, 1997, Bradford and Westhead, 2005, Bradford et al., 2006, Higa and Tozzi, 2008). The patch area for a test protein is estimated from its total surface area by a linear regression using its training dataset. Namely, for the protein, the area of all predicted patches is a fixed value. With interface residue prediction as first stage, we use a clustering procedure as second stage to cluster the output of the first stage and produce continuous patches whose area is not invariable. The patch-based interface predictors generally identify none but surface patches, however, our method can predict surface residues and surface patches at the same time.

Section snippets

Materials and methods

According to a recent comprehensive review (Chou, 2011), to develop a useful predictor for protein systems, the following things often need to be considered: (i) benchmark dataset construction or selection, (ii) protein sample formulation, (iii) operating algorithm (or engine), (iv) anticipated accuracy and (v) web-server establishment. Below, let us elaborate how to deal these procedures.

Results for different definition of interface residue

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test and jackknife test (Chou and Zhang, 1995). However, of the three test methods, the jackknife test is deemed the most objective (Chou and Shen, 2008), as elucidated by Chou and Shen (2008) and demonstrated by Eqs. 28–30 in Chou (2011). Accordingly, the jackknife test has been increasingly and

Acknowledgments

The authors gratefully acknowledge financial support for this work from the National Natural Science Foundation (no. 11072048) and National Basic Research Program of China (no. 2009CB918501).

References (97)

  • D.N. Georgiou et al.

    Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition

    J. Theor. Biol.

    (2009)
  • S. Jones et al.

    Prediction of protein–protein interaction sites using patch analysis

    J. Mol. Biol.

    (1997)
  • K.K. Kandaswamy et al.

    AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties

    J. Theor. Biol.

    (2011)
  • J. Kyte et al.

    A simple method for displaying the hydropathic character of a protein

    J. Mol. Biol.

    (1982)
  • J.J. Li et al.

    Identifying protein–protein interfacial residues in heterocomplexes using residue conservation scores

    Int. J. Biol. Macromol.

    (2006)
  • H. Lin

    The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition

    J. Theor. Biol.

    (2008)
  • H. Mohabatkar et al.

    Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine

    J. Theor. Biol.

    (2011)
  • H. Neuvirth et al.

    ProMate: a structure based prediction program to identify the location of protein–protein binding sites

    J. Mol. Biol.

    (2004)
  • Y. Ofran et al.

    Predicted protein–protein interaction sites from local sequence information

    FEBS Lett.

    (2003)
  • F.K. Pettit et al.

    HotPatch: a statistical a pproach to finding biologically relevant features on protein surfaces

    J. Mol. Biol.

    (2007)
  • B. Wang et al.

    Predicting protein interaction sites from residue spatial sequence profile and evolution rate

    FEBS Lett.

    (2006)
  • X. Xiao et al.

    Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image

    J. Theor. Biol.

    (2008)
  • X. Xiao et al.

    iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites

    J. Theor. Biol

    (2011)
  • L. Yu et al.

    SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition

    J. Theor. Biol.

    (2010)
  • Y.H. Zeng et al.

    Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach

    J. Theor. Biol.

    (2009)
  • G.Y. Zhang et al.

    Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo amino acid composition

    J. Theor. Biol.

    (2008)
  • G.P. Zhou

    The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism

    J. Theor. Biol.

    (2011)
  • X.B. Zhou et al.

    Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes

    J. Theor. Biol.

    (2007)
  • R. Abagyan

    Protein structure prediction by global energy optimization

  • J. Andraos

    Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs

    Can. J. Chem.

    (2008)
  • H.M. Berman et al.

    The Protein data bank

    Nucleic Acids Res.

    (2000)
  • A.J. Bordner et al.

    Statistical analysis and prediction of protein–protein interfaces

    Proteins

    (2005)
  • J.R. Bradford et al.

    Improved prediction of protein–protein binding sites using a support vector machines approach

    Bioinformatics

    (2005)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • C. Chen et al.

    Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine

    Protein Pept. Lett.

    (2009)
  • H.L. Chen et al.

    Prediction of interface residues in protein–protein complexes by a consensus neural network method: test against NMR data

    Proteins

    (2005)
  • X.W. Chen et al.

    Sequence-based prediction of protein interaction sites with an integrative method

    Bioinformatics

    (2009)
  • K.C. Chou

    A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space

    Proteins

    (1995)
  • K.C. Chou

    Prediction of protein cellular attributes using pseudo amino acid composition

    Proteins

    (2001)
  • K.C. Chou

    Graphic rule for drug metabolism systems

    Curr. Drug Metab.

    (2010)
  • K.C. Chou et al.

    Predicting protein–protein interactions from sequences in a hybridization space

    J. Proteome Res.

    (2006)
  • K.C. Chou et al.

    Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites

    J. Proteome Res.

    (2007)
  • K.C. Chou et al.

    Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Natural Science, 2010, 2, 1090–1103)

    Nat. Protocols

    (2008)
  • K.C. Chou et al.

    Review: recent advances in developing web-servers for predicting protein attributes

    Nat. Sci.

    (2009)
  • K.C. Chou et al.

    Plant-mPLoc: a top–down strategy to augment the power for predicting plant protein subcellular localization

    PLoS ONE

    (2010)
  • K.C. Chou et al.

    iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins

    PLoS One

    (2011)
  • K.C. Chou et al.

    Review: prediction of protein structural classes

    Crit. Rev. Biochem. Mol. Biol.

    (1995)
  • K.C. Chou et al.

    Role of the protein outside active site on the diffusion-controlled reaction of enzyme

    J. Am. Chem. Soc.

    (1982)
  • Cited by (18)

    • Protein–protein interaction site predictions with minimum covariance determinant and Mahalanobis distance

      2017, Journal of Theoretical Biology
      Citation Excerpt :

      Unless otherwise noted, any residue referred to below is regarded as a surface residue. A patch-based model (Qiu and Wang, 2012) with 28 features that are relevant to the structural and physicochemical characters of protein surface residues, was introduced to characterize every residue. It contains three patches each of which is made up of a center residue and its n-1 nearest spatial residues (n = 5, 9 and 15, respectively).

    • Prediction of interface residue based on the features of residue interaction network

      2017, Journal of Theoretical Biology
      Citation Excerpt :

      With the random decision forest model, the overfitting to the training set also can be corrected. Random forest has been widely used in many areas, for example, the QSAR model (Svetnik et al., 2003), the prediction of protein-protein binding sites (Ma and Sun, 2014; Qiu and Wang, 2011, 2012), or protein-DNA binding site (Lin et al., 2011), the prediction of antifreeze proteins (Kandaswamy et al., 2011) and lysine succinylation sites (Jia et al., 2016c), etc. In this work, the random forest method is used to make the prediction.

    • Efficacy of function specific 3D-motifs in enzyme classification according to their EC-numbers

      2013, Journal of Theoretical Biology
      Citation Excerpt :

      However, of the three test methods, the jackknife test is regarded as the least arbitrary that can always yield a unique result for a given benchmark dataset as elaborated in (Chou, 2011) and demonstrated by Eqs. 28–30 in that paper. Consequently, the jackknife test has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors (see, e.g. (Cao et al., 2012; Chen and Li, 2013; Chou, 2001; Esmaeili et al., 2010; Guo et al., 2011; Huang and Yuan, 2013; Lin, 2008; Mohabatkar, 2010; Mohabatkar et al., 2013; Qiu and Wang, 2012; Sahu and Panda, 2010; Shi et al., 2012; Wang et al., 2010; Zakeri et al., 2011; Zhang et al., 2008)). The jackknife is similar to a specific type of k-fold cross-validation where k is equal to the number of observations.

    • A simple iterative method to optimize protein-ligand-binding residue prediction

      2013, Journal of Theoretical Biology
      Citation Excerpt :

      In order to use these classifiers to identify the ligand-binding residues in proteins, we must adopt a middle method. In our previous study (Qiu and Wang, 2011, 2012), combining classifiers with OR rule was used to increase their recalls, which better improved the patch-based binding site prediction. The threshold-altering method (TAM) depends on the residue representation model used in this paper.

    • Interrogating noise in protein sequences from the perspective of protein-protein interactions prediction

      2012, Journal of Theoretical Biology
      Citation Excerpt :

      Various experimental techniques have been developed for large-scale protein-protein interactions (PPIs) analysis, including yeast two-hybrid systems (Fields and Song, 1989; Ito et al., 2001), mass spectrometry (Gavin et al., 2002; Ho et al., 2002), protein chip (Zhu et al., 2001) and so on. One computational idea is applying the machine learning methods to learn understandable rules from the available PPIs and furthermore to predict novel interactions (Deng et al., 2011; Hu et al., 2011; Ma et al., 2011; Qiu and Wang, 2012; Chou and Cai, 2006; Ren et al., 2011; Xia et al., 2010; Yang and Jiang, 2010; Zhang et al., 2011; Zhou, 2011). Comparing with costly and time-consuming biochemical experiments, computational methods for PPIs prediction have played an important role (Shen et al., 2007).

    View all citing articles on Scopus
    View full text