Residue co-evolution helps predict interaction sites in α-helical membrane proteins
Graphical abstract
Introduction
A full understanding of a protein’s function requires not only a possibly complete knowledge of its interaction partners, but also of the binding site location on its surface. Modern high-throughput assays, such as yeast two-hybrid or tandem affinity purification have generated data on close to 900,000 binary interactions (Kerrien et al., 2007). By contrast, information about the specific interaction interfaces remains relatively scarce. Out of the 52,660 proteins from Human and the model organisms E. coli and Bacillus subtilis (bacteria), S. cerevisiae (fungi) and Mus musculus (Vertebrates) only 3318 proteins (6.3%) have regions annotated as interaction sites in the Swiss-Prot (The UniProt, 2017) database, with only 12 of them annotated as 3D structure-derived. Furthermore, 1722 and 39 proteins have annotations obtained by similarity-based transfer or motif and rule-based approaches, respectively. Less than a third of the proteins (1296) were annotated based on publications involving experiments, such as alanine-scanning mutagenesis (Moreira et al., 2007). Manual inspection of these experimental annotations revealed that many of them are also based on 3d structures. In addition, 710 proteins have interface region annotations for which no evidence is provided. The percentage of proteins with available interface annotations ranges from less than 1% in bacteria to around 2% in yeast and 8–9% in vertebrates. For transmembrane proteins (TMPs), which constitute 20–30% of all proteins the living cells (Frishman and Mewes, 1997), these numbers are even lower – only 0.5% in yeast and 5–6% in vertebrates.
PPI interfaces possess specific physico-chemical (amino acid composition, hydrophobicity, polarity), geometrical (accessible area, planarity), and evolutionary (conservation) properties (Nooren and Thornton, 2003), which makes them amenable to recognition by machine learning methods. A number of sequence- (Ofran and Rost, 2003, Res et al., 2005, Murakami and Mizuguchi, 2010, Meyer, 2018, Kamisetty et al., 2013) and structure-based (Fernandez-Recio et al., 2004, Neuvirth et al., 2004) computational methods have been proposed to predict PPI interfaces in globular proteins. The latter group of methods tends to be more accurate, because they can leverage structure-level information such as solvent accessibility and the proximity of residues to each other, while the former one has the advantage of being applicable to the vast majority of proteins for which no experimental atomic structure is available. Most methods use machine learning techniques such as neural networks (Zhou and Shan, 2001, Fariselli, 2002, Wang, 2006, Chen, 2012), support vector machines (Bordner and Abagyan, 2005, Koike and Takagi, 2004, Bradford and Westhead, 2005, Zellner, 2012) and random forest (Li, 2012, Sikic et al., 2009, Segura et al., 2011) and almost all of them have been trained on globular proteins. The only method specifically geared towards predicting the interface residues in TMPs from sequence was proposed by Bordner in 2009 (Bordner, 2009). His random forest model, trained on evolutionary profiles extracted from 128 TMPs, achieved an average AUC of 0.75. Over the past 10 years, not only has the number of experimentally determined 3D structures of TMPs significantly increased but also database search tools such as HHblits have become much more sensitive (Remmert, 2012); additionally, vastly improved sequence co-evolution measures have become available (Marks et al., 2011, Morcos, 2011), providing powerful features for training machine learning algorithms.
Here we describe a novel computational method MBPred (Membrane-protein Binding-residues Prediction), which utilizes a combination of four individual random forest models - MBPredTM, MBPredCyto, MBPredExtra, and MBPredAll – trained to predict residues involved in protein interactions in transmembrane, cytoplasmic, and extracellular segments as well as in the entire amino acid sequence, respectively. MBPredCombined merges the output of MBPredTM, MBPredCyto, MBPredExtra and is used when the location of the TM segments is known from structure or other experiments. Alternatively, MBPredAll is used when the location of the TM segments is unknown and therefore has to be predicted from sequence. The method was trained on 171 structures of α-helical membrane proteins from 133 complexes and tested on an independent dataset of 36 structures. Since the Bordner’s method does not appear to be available, no direct comparison with MBPred was possible. However, in our own implementation, a similar method only using evolutionary features achieved the AUC of 0.75 on the much larger dataset, while adding further features, such as TM helix orientation, residue co-evolution, and relative residue position with respect to the membrane, improved the AUC-based performance to 0.79. We also demonstrate that the surface patches consisting of amino acid residues classified by MBPred as interacting exhibit a significant overlap with the structure-derived interface regions. In 75% of the proteins, more than a half of the residues in the interface patches were correctly predicted.
Section snippets
Datasets
For training and benchmarking our method we created three datasets: (i) comparison dataset (CompData), solely used for comparing the results with the previous work of Bordner (2009), (ii) classification dataset (ClassData), for training and cross validating the classifier, and (iii) an independent test dataset (TestData), for evaluating the performance of the final classifier.
Binding residues are more conserved in the transmembrane portions of proteins
Residues mediating inter-molecular interactions tend to be evolutionarily conserved (Guharoy and Chakrabarti, 2010). We compared sequence conservation calculated by the entropy-based score (Section 2.7.1) between interacting and non-interacting residues in the three types of segments (TM, Cyto, and Extra) as well as in the full TMP sequences (All) (Fig. 2). Alignment positions with more than 50% gaps were ignored. Interacting residues in the full TMP sequences are significantly more conserved
Availability
The full source code and a standalone version of MBPred are available from https://github.com/bojigu/MBPred.git.
Conclusions
α-helical transmembrane proteins form complexes and bind to ligands in order to perform their biological functions. Elucidating the precise location of the binding sites is an indispensable part of protein functional annotation. Here we present a machine learning approach called MBPred for predicting interacting residues in α-helical membrane proteins from sequence alone. MBPred was developed having two application scenarios in mind. In the first situation, the user wishes to identify potential
Conflict of interest
The authors declared that there is no conflict of interest.
Acknowledgements
This work was supported by the Deutsche Forschungsgemeinschaft (grant FR 1411/14-1) and by the Russian Science Foundation (grant number 16-44-02002).
References (58)
- et al.
Identification of protein-protein interaction sites from docking energy landscapes
J. Mol. Biol.
(2004) Transmembrane domains interactions within the membrane milieu: principles, advances and challenges
Biochim. Biophys. Acta
(2012)Native-like photosystem II superstructure at 2.44 A resolution through detergent extraction from the protein crystal
Structure
(2014)- et al.
Accurate prediction of helix interactions and residue contacts in membrane proteins
J. Struct. Biol.
(2016) Three-dimensional structures of membrane proteins from genomic sequencing
Cell
(2012)- et al.
ProMate: a structure based prediction program to identify the location of protein-protein binding sites
J. Mol. Biol.
(2004) - et al.
Predicted protein-protein interaction sites from local sequence information
FEBS Lett.
(2003) Solvent accessible surface area and excluded volume in proteins. Analytical equations for overlapping spheres and implications for the hydrophobic effect
J. Mol. Biol.
(1984)Knowledge-based potential for positioning membrane-associated structures and assessing residue-specific energetic contributions
Structure
(2012)Predicting protein interaction sites from residue spatial sequence profile and evolution rate
FEBS Lett.
(2006)
The membrane- and soluble-protein helix-helix interactome: similar geometry via different interactions
Structure
Empirical lipid propensities of amino acid residues in multispan alpha helical membrane proteins
Proteins
Prediction of transmembrane helix orientation in polytopic membrane proteins
BMC Struct. Biol.
Predicting protein-protein binding sites in membrane proteins
BMC Bioinf.
Statistical analysis and prediction of protein-protein interfaces
Proteins-Struct. Funct. Bioinform.
Improved prediction of protein-protein binding sites using a support vector machines approach
Bioinformatics
Are protein-protein interfaces more conserved in sequence than the rest of the protein surface?
Protein Sci.
Protein-protein interaction site predictions with three-dimensional probability distributions of interacting atoms on protein surfaces
PLoS One
A hot spot of binding energy in a hormone-receptor interface
Science
Prediction of protein–protein interaction sites in heterocomplexes with neural networks
Eur. J. Biochem.
Protein structural classes in five complete genomes
Nat. Struct. Biol.
Prediction of helix-helix contacts and interacting helices in polytopic membrane proteins using neural networks
Proteins
Conserved residue clusters at protein-protein interfaces and their use in binding site identification
BMC Bioinf.
Alternative protein-protein interfaces are frequent exceptions
PLoS Comput. Biol.
FreeContact: fast and free software for protein contact prediction from residue co-evolution
BMC Bioinf.
Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server
Nucleic Acids Res.
Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era
Proc. Natl. Acad. Sci. USA
AAindex: amino acid index database
Nucleic Acids Res.
IntAct--open source resource for molecular interaction data
Nucleic Acids Res.
Cited by (16)
Machine learning in computational modelling of membrane protein sequences and structures: From methodologies to applications
2023, Computational and Structural Biotechnology JournalImproved sequence-based prediction of interaction sites in α-helical transmembrane proteins by deep learning
2021, Computational and Structural Biotechnology JournalCitation Excerpt :For example, on the CompData dataset the AUC (0.762, 0.790, and 0.796) and AUCPR values (0.527, 0.690, and 0.738) were obtained using the BordInter, FuchInter, and RostInter definitions, respectively. A higher distance threshold (RostInter) also results in more residues labeled as interacting, thus partially alleviating the imbalance between the two residue classes (interacting and non-interacting), while the low distance threshold BordInter impairs the performance by increasing the imbalance as discussed in [12] (see also Supplementary Table S16). Using the most stringent contact definition (BordInter) DeepTMInter remains the most accurate method on structure-derived Combined regions (Fig. 10d-i).
DeepHelicon: Accurate prediction of inter-helical residue contacts in transmembrane proteins by residual neural networks
2020, Journal of Structural BiologyIntegrative Modelling of Biomolecular Complexes
2020, Journal of Molecular BiologyExperimental determination and data-driven prediction of homotypic transmembrane domain interfaces
2020, Computational and Structural Biotechnology JournalCitation Excerpt :Our machine-learning predictor, THOIPA, is the first of its kind for predicting homotypic TMD interfaces. Machine learning is already a common technique applied to related problems, including the prediction of PPI interface residues between membrane proteins with a known structure [49–51], or the prediction of contacting residues within a folded polytopic membrane protein [16,84,86]. THOIPA is well-placed to prioritise TMD residues in mutational analyses of given functions, assuming that they contribute to quaternary structure formation.
Prediction and targeting of GPCR oligomer interfaces
2020, Progress in Molecular Biology and Translational ScienceCitation Excerpt :Contrarily to non-MPs, in the last years, very few ML methods were explicitly developed for MPs oligomers interface prediction. To date, only RFs,212,213 SVMs214 and NNs215 approaches have been applied for this purpose. Bordner was one of the pioneer studies that used a RF classifier to predict the interface residues of transmembrane proteins, although the final dataset included more α-helical than β-barrel structures; a problem which can be traced back to the fact that this impairment is also noticeable at experimentally determined structures level.212