Prediction of protein–protein interaction sites using patch-based residue characterization
Highlights
► We construct a patch-based model for residue characterization. ► Random forests are trained to predict protein–protein interface based on this model. ► Compared with several published methods, our method achieves better results.
Introduction
Proteins function through interactions with other biomolecules such as proteins, nucleic acids, carbohydrates or small ligands. Protein–protein interfaces hold key information toward molecular understanding of protein function. These interactions occur through the formation of complexes, either transient or permanent. Thus, the atomic structures of protein complexes will be very crucial for a detailed understanding of protein function. There are experimental methods for solving structures of complexes, which are nevertheless expensive and laborious. Computational methods such as docking are therefore expected to play an important role in structural biology. Protein docking aims at predicting the 3D structures of a protein complex from the known structures of its individual components. The vast search space for potential poses and the difficulty in ranking docked poses are its two major problems. Interface prediction provides important clues to the function of a protein and can help in both problems (Qin and Zhou, 2007). Predicted interface residues can help limit the initial search; this is referred to as front-end use. Alternatively, they can assist in scoring the docked poses; this is referred to as back-end use (Zhou and Qin, 2007). Both front-end use and back-end use have been made of interface prediction in docking studies (Heuser et al., 2005, Tress et al., 2005, van Dijk et al., 2005, Chelliah et al., 2006, Tjong et al., 2007). For understanding protein function, some computational studies about protein–protein interaction network and protein–protein interaction mechanism have also been presented (Chou and Cai, 2006, Hu et al., 2011, Ren et al., 2011, Xia et al., 2010, Yang and Jiang, 2010, Zhou, 2011).
There are a number of articles published for interface prediction. A large number of properties have been identified that have some predictive power for interfaces (de Vries and Bonvin, 2008). They can be roughly divided into three groups: (a) the type and properties of the residues in the amino acid sequence; (b) the evolutionary conservation; (c) the information contained in the atomic coordinates of the structure. Unfortunately, there is no single property sufficient for unambiguous identification of the interface (Zhou and Qin, 2007). Therefore, many methods integrate these properties into interface prediction, which can be grouped into two approaches. The one optimizes a usually small number of parameters to construct a discriminant function that combines the properties, linearly or nonlinearly (Jones and Thornton, 1997, de Vries et al., 2006, Li et al., 2006, Liang et al., 2006, Kufareva et al., 2007). The other approach uses a machine learning algorithm to combine different properties in an optimal way from a large number of parameters. Popular machine learning algorithms are neural networks (NN) (Zhou and Shan, 2001, Fariselli et al., 2002, Ofran and Rost, 2003, Chen and Zhou, 2005, Ofran and Rost, 2007, Pettit et al., 2007, Porollo and Meller, 2007), support vector machines (SVM) (Koike and Takagi, 2004, Bordner and Abagyan, 2005, Bradford and Westhead, 2005, Res et al., 2005, Chung et al., 2006, Wang et al., 2006a, Wang et al., 2006b, Dong et al., 2007), random forests (Chen and Jeong, 2009, Sikic et al., 2009) and Bayesian networks (Neuvirth et al., 2004, Bradford et al., 2006).
Introduction of novel mathematical approaches and physical concepts into molecular biology, such as Mahalanobis distance (Chou, 1995), pseudo amino acid composition (Chou, 2001), graph and diagram analysis (Andraos, 2008, Chou, 1989a, Chou, 2010, Zhou, 2011, Zhou and Deng, 1984), cellular automaton (Xiao et al., 2009), gray system theory (Xiao et al., 2008b), geometric moments (Xiao et al., 2008a), low-frequency (or Terahertz frequency) phonons (Chou, 1988, Chou, 1989b, Madkan et al., 2009), surface diffusion-controlled reaction (Chou and Zhou, 1982), ensemble classifier (Chou and Shen, 2007b) and various network approaches (He et al., 2010, Hu et al., 2011, Huang et al., 2010), can significantly stimulate the development of biological and medical science. Here, we would like to introduce a novel approach, the so-called “patch-based residue characterization”, for prediction of protein–protein interaction sites.
In this study, without direct use of sequence information, we design a novel residue characterization model, i.e. the patch-based model, containing 28 properties based on 3D structure to characterize each surface residue in a protein. This is to say, a residue can be characterized as the properties of three patches centered at it. Random forests are chosen as our classification algorithm. Besides interface prediction, it also has been used for protein prediction (Jia and Hu, 2011, Kandaswamy et al., 2011). Using this patch-based model, we define different interacting residues, and then several corresponding random forest classifiers are constructed. The combination of these classifiers results in a better sensitivity in prediction. The significance of these interface residue definitions is also discussed.
In current methods for interface prediction, there are two predicted targets: residues and patches. Most belong to residue-based methods. Only several methods predict surface patches (Jones and Thornton, 1997, Bradford and Westhead, 2005, Bradford et al., 2006, Higa and Tozzi, 2008). The patch area for a test protein is estimated from its total surface area by a linear regression using its training dataset. Namely, for the protein, the area of all predicted patches is a fixed value. With interface residue prediction as first stage, we use a clustering procedure as second stage to cluster the output of the first stage and produce continuous patches whose area is not invariable. The patch-based interface predictors generally identify none but surface patches, however, our method can predict surface residues and surface patches at the same time.
Section snippets
Materials and methods
According to a recent comprehensive review (Chou, 2011), to develop a useful predictor for protein systems, the following things often need to be considered: (i) benchmark dataset construction or selection, (ii) protein sample formulation, (iii) operating algorithm (or engine), (iv) anticipated accuracy and (v) web-server establishment. Below, let us elaborate how to deal these procedures.
Results for different definition of interface residue
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test and jackknife test (Chou and Zhang, 1995). However, of the three test methods, the jackknife test is deemed the most objective (Chou and Shen, 2008), as elucidated by Chou and Shen (2008) and demonstrated by Eqs. 28–30 in Chou (2011). Accordingly, the jackknife test has been increasingly and
Acknowledgments
The authors gratefully acknowledge financial support for this work from the National Natural Science Foundation (no. 11072048) and National Basic Research Program of China (no. 2009CB918501).
References (97)
- et al.
Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential
Fold. Des.
(1997) - et al.
Insights into protein–protein interfaces using a Bayesian network prediction method
J. Mol. Biol.
(2006) - et al.
Efficient restraints for protein–protein docking by comparison of observed amino acid substitution patterns with those predicted from local environment
J. Mol. Biol.
(2006) Review: low-frequency collective motion in biomacromolecules and its biological functions
Biophys. Chem.
(1988)Graphic rules in steady and non-steady enzyme kinetics
J. Biol. Chem.
(1989)Low-frequency resonance and cooperativity of hemoglobin
Trends Biochem. Sci.
(1989)Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review)
J. Theor. Biol.
(2011)- et al.
Review: recent progresses in protein subcellular location prediction
Anal. Biochem.
(2007) - et al.
Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses
J. Theor. Biol.
(2010) - et al.
Identification of protein–protein interaction sites from docking energy landscapes
J. Mol. Biol.
(2004)
Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition
J. Theor. Biol.
Prediction of protein–protein interaction sites using patch analysis
J. Mol. Biol.
AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties
J. Theor. Biol.
A simple method for displaying the hydropathic character of a protein
J. Mol. Biol.
Identifying protein–protein interfacial residues in heterocomplexes using residue conservation scores
Int. J. Biol. Macromol.
The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition
J. Theor. Biol.
Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine
J. Theor. Biol.
ProMate: a structure based prediction program to identify the location of protein–protein binding sites
J. Mol. Biol.
Predicted protein–protein interaction sites from local sequence information
FEBS Lett.
HotPatch: a statistical a pproach to finding biologically relevant features on protein surfaces
J. Mol. Biol.
Predicting protein interaction sites from residue spatial sequence profile and evolution rate
FEBS Lett.
Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image
J. Theor. Biol.
iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites
J. Theor. Biol
SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition
J. Theor. Biol.
Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach
J. Theor. Biol.
Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo amino acid composition
J. Theor. Biol.
The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism
J. Theor. Biol.
Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes
J. Theor. Biol.
Protein structure prediction by global energy optimization
Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs
Can. J. Chem.
The Protein data bank
Nucleic Acids Res.
Statistical analysis and prediction of protein–protein interfaces
Proteins
Improved prediction of protein–protein binding sites using a support vector machines approach
Bioinformatics
Random forests
Mach. Learn.
Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine
Protein Pept. Lett.
Prediction of interface residues in protein–protein complexes by a consensus neural network method: test against NMR data
Proteins
Sequence-based prediction of protein interaction sites with an integrative method
Bioinformatics
A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space
Proteins
Prediction of protein cellular attributes using pseudo amino acid composition
Proteins
Graphic rule for drug metabolism systems
Curr. Drug Metab.
Predicting protein–protein interactions from sequences in a hybridization space
J. Proteome Res.
Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites
J. Proteome Res.
Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Natural Science, 2010, 2, 1090–1103)
Nat. Protocols
Review: recent advances in developing web-servers for predicting protein attributes
Nat. Sci.
Plant-mPLoc: a top–down strategy to augment the power for predicting plant protein subcellular localization
PLoS ONE
iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins
PLoS One
Review: prediction of protein structural classes
Crit. Rev. Biochem. Mol. Biol.
Role of the protein outside active site on the diffusion-controlled reaction of enzyme
J. Am. Chem. Soc.
Cited by (18)
Protein–protein interaction site predictions with minimum covariance determinant and Mahalanobis distance
2017, Journal of Theoretical BiologyCitation Excerpt :Unless otherwise noted, any residue referred to below is regarded as a surface residue. A patch-based model (Qiu and Wang, 2012) with 28 features that are relevant to the structural and physicochemical characters of protein surface residues, was introduced to characterize every residue. It contains three patches each of which is made up of a center residue and its n-1 nearest spatial residues (n = 5, 9 and 15, respectively).
Prediction of interface residue based on the features of residue interaction network
2017, Journal of Theoretical BiologyCitation Excerpt :With the random decision forest model, the overfitting to the training set also can be corrected. Random forest has been widely used in many areas, for example, the QSAR model (Svetnik et al., 2003), the prediction of protein-protein binding sites (Ma and Sun, 2014; Qiu and Wang, 2011, 2012), or protein-DNA binding site (Lin et al., 2011), the prediction of antifreeze proteins (Kandaswamy et al., 2011) and lysine succinylation sites (Jia et al., 2016c), etc. In this work, the random forest method is used to make the prediction.
A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition
2013, Journal of Theoretical BiologyEfficacy of function specific 3D-motifs in enzyme classification according to their EC-numbers
2013, Journal of Theoretical BiologyCitation Excerpt :However, of the three test methods, the jackknife test is regarded as the least arbitrary that can always yield a unique result for a given benchmark dataset as elaborated in (Chou, 2011) and demonstrated by Eqs. 28–30 in that paper. Consequently, the jackknife test has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors (see, e.g. (Cao et al., 2012; Chen and Li, 2013; Chou, 2001; Esmaeili et al., 2010; Guo et al., 2011; Huang and Yuan, 2013; Lin, 2008; Mohabatkar, 2010; Mohabatkar et al., 2013; Qiu and Wang, 2012; Sahu and Panda, 2010; Shi et al., 2012; Wang et al., 2010; Zakeri et al., 2011; Zhang et al., 2008)). The jackknife is similar to a specific type of k-fold cross-validation where k is equal to the number of observations.
A simple iterative method to optimize protein-ligand-binding residue prediction
2013, Journal of Theoretical BiologyCitation Excerpt :In order to use these classifiers to identify the ligand-binding residues in proteins, we must adopt a middle method. In our previous study (Qiu and Wang, 2011, 2012), combining classifiers with OR rule was used to increase their recalls, which better improved the patch-based binding site prediction. The threshold-altering method (TAM) depends on the residue representation model used in this paper.
Interrogating noise in protein sequences from the perspective of protein-protein interactions prediction
2012, Journal of Theoretical BiologyCitation Excerpt :Various experimental techniques have been developed for large-scale protein-protein interactions (PPIs) analysis, including yeast two-hybrid systems (Fields and Song, 1989; Ito et al., 2001), mass spectrometry (Gavin et al., 2002; Ho et al., 2002), protein chip (Zhu et al., 2001) and so on. One computational idea is applying the machine learning methods to learn understandable rules from the available PPIs and furthermore to predict novel interactions (Deng et al., 2011; Hu et al., 2011; Ma et al., 2011; Qiu and Wang, 2012; Chou and Cai, 2006; Ren et al., 2011; Xia et al., 2010; Yang and Jiang, 2010; Zhang et al., 2011; Zhou, 2011). Comparing with costly and time-consuming biochemical experiments, computational methods for PPIs prediction have played an important role (Shen et al., 2007).