Residue co-evolution helps predict interaction sites in α-helical membrane proteins

https://doi.org/10.1016/j.jsb.2019.02.009Get rights and content

Highlights

  • Transmembrane protein interaction sites are best predicted by a specific classifier.

  • Separate models for predicted and structure determined transmembrane regions improve accuracy.

  • Using co-evolution as a feature compensates for the restricted amino acid composition in transmembrane segments.

Abstract

Many integral membrane proteins, just like their globular counterparts, form either transient or permanent multi-subunit complexes to fulfill specific cellular roles. Although numerous interactions between these proteins have been experientially determined, the structural coverage of the complexes is very low. Therefore, the computational identification of the amino acid residues involved in the interaction interfaces is a crucial step towards the functional annotation of all membrane proteins. Here, we present MBPred, a sequence-based method for predicting the interface residues in transmembrane proteins. An unique feature of our method is that it contains separate random forest models for two different use cases: (a) when the location of transmembrane regions is precisely known from a crystal structure, and (b) when it is predicted from sequence. In stark contrast to the aqueous-exposed protein segments, we found that the interaction sites located in the membrane are not enriched for evolutionary conservation, most likely due to their restricted amino acid composition or their random distribution among buried and exposed residues. On the other hand, residue co-evolution proved to be a very informative feature which has not so far been used for predicting interaction sites in individual proteins. MBPred reaches AUC, precision and recall values of 0.79/0.73, 0.69/0.51 and 0.55/0.48 on the cross-validation and independent test dataset, respectively, thus outperforming the previously published method of Bordner as well as all methods trained on globular proteins. Moreover, we show that for the majority of complete interface patches, the method captures more than 50% of the involved residues.

Introduction

A full understanding of a protein’s function requires not only a possibly complete knowledge of its interaction partners, but also of the binding site location on its surface. Modern high-throughput assays, such as yeast two-hybrid or tandem affinity purification have generated data on close to 900,000 binary interactions (Kerrien et al., 2007). By contrast, information about the specific interaction interfaces remains relatively scarce. Out of the 52,660 proteins from Human and the model organisms E. coli and Bacillus subtilis (bacteria), S. cerevisiae (fungi) and Mus musculus (Vertebrates) only 3318 proteins (6.3%) have regions annotated as interaction sites in the Swiss-Prot (The UniProt, 2017) database, with only 12 of them annotated as 3D structure-derived. Furthermore, 1722 and 39 proteins have annotations obtained by similarity-based transfer or motif and rule-based approaches, respectively. Less than a third of the proteins (1296) were annotated based on publications involving experiments, such as alanine-scanning mutagenesis (Moreira et al., 2007). Manual inspection of these experimental annotations revealed that many of them are also based on 3d structures. In addition, 710 proteins have interface region annotations for which no evidence is provided. The percentage of proteins with available interface annotations ranges from less than 1% in bacteria to around 2% in yeast and 8–9% in vertebrates. For transmembrane proteins (TMPs), which constitute 20–30% of all proteins the living cells (Frishman and Mewes, 1997), these numbers are even lower – only 0.5% in yeast and 5–6% in vertebrates.

PPI interfaces possess specific physico-chemical (amino acid composition, hydrophobicity, polarity), geometrical (accessible area, planarity), and evolutionary (conservation) properties (Nooren and Thornton, 2003), which makes them amenable to recognition by machine learning methods. A number of sequence- (Ofran and Rost, 2003, Res et al., 2005, Murakami and Mizuguchi, 2010, Meyer, 2018, Kamisetty et al., 2013) and structure-based (Fernandez-Recio et al., 2004, Neuvirth et al., 2004) computational methods have been proposed to predict PPI interfaces in globular proteins. The latter group of methods tends to be more accurate, because they can leverage structure-level information such as solvent accessibility and the proximity of residues to each other, while the former one has the advantage of being applicable to the vast majority of proteins for which no experimental atomic structure is available. Most methods use machine learning techniques such as neural networks (Zhou and Shan, 2001, Fariselli, 2002, Wang, 2006, Chen, 2012), support vector machines (Bordner and Abagyan, 2005, Koike and Takagi, 2004, Bradford and Westhead, 2005, Zellner, 2012) and random forest (Li, 2012, Sikic et al., 2009, Segura et al., 2011) and almost all of them have been trained on globular proteins. The only method specifically geared towards predicting the interface residues in TMPs from sequence was proposed by Bordner in 2009 (Bordner, 2009). His random forest model, trained on evolutionary profiles extracted from 128 TMPs, achieved an average AUC of 0.75. Over the past 10 years, not only has the number of experimentally determined 3D structures of TMPs significantly increased but also database search tools such as HHblits have become much more sensitive (Remmert, 2012); additionally, vastly improved sequence co-evolution measures have become available (Marks et al., 2011, Morcos, 2011), providing powerful features for training machine learning algorithms.

Here we describe a novel computational method MBPred (Membrane-protein Binding-residues Prediction), which utilizes a combination of four individual random forest models - MBPredTM, MBPredCyto, MBPredExtra, and MBPredAll – trained to predict residues involved in protein interactions in transmembrane, cytoplasmic, and extracellular segments as well as in the entire amino acid sequence, respectively. MBPredCombined merges the output of MBPredTM, MBPredCyto, MBPredExtra and is used when the location of the TM segments is known from structure or other experiments. Alternatively, MBPredAll is used when the location of the TM segments is unknown and therefore has to be predicted from sequence. The method was trained on 171 structures of α-helical membrane proteins from 133 complexes and tested on an independent dataset of 36 structures. Since the Bordner’s method does not appear to be available, no direct comparison with MBPred was possible. However, in our own implementation, a similar method only using evolutionary features achieved the AUC of 0.75 on the much larger dataset, while adding further features, such as TM helix orientation, residue co-evolution, and relative residue position with respect to the membrane, improved the AUC-based performance to 0.79. We also demonstrate that the surface patches consisting of amino acid residues classified by MBPred as interacting exhibit a significant overlap with the structure-derived interface regions. In 75% of the proteins, more than a half of the residues in the interface patches were correctly predicted.

Section snippets

Datasets

For training and benchmarking our method we created three datasets: (i) comparison dataset (CompData), solely used for comparing the results with the previous work of Bordner (2009), (ii) classification dataset (ClassData), for training and cross validating the classifier, and (iii) an independent test dataset (TestData), for evaluating the performance of the final classifier.

Binding residues are more conserved in the transmembrane portions of proteins

Residues mediating inter-molecular interactions tend to be evolutionarily conserved (Guharoy and Chakrabarti, 2010). We compared sequence conservation calculated by the entropy-based score (Section 2.7.1) between interacting and non-interacting residues in the three types of segments (TM, Cyto, and Extra) as well as in the full TMP sequences (All) (Fig. 2). Alignment positions with more than 50% gaps were ignored. Interacting residues in the full TMP sequences are significantly more conserved

Availability

The full source code and a standalone version of MBPred are available from https://github.com/bojigu/MBPred.git.

Conclusions

α-helical transmembrane proteins form complexes and bind to ligands in order to perform their biological functions. Elucidating the precise location of the binding sites is an indispensable part of protein functional annotation. Here we present a machine learning approach called MBPred for predicting interacting residues in α-helical membrane proteins from sequence alone. MBPred was developed having two application scenarios in mind. In the first situation, the user wishes to identify potential

Conflict of interest

The authors declared that there is no conflict of interest.

Acknowledgements

This work was supported by the Deutsche Forschungsgemeinschaft (grant FR 1411/14-1) and by the Russian Science Foundation (grant number 16-44-02002).

References (58)

  • S.Q. Zhang

    The membrane- and soluble-protein helix-helix interactome: similar geometry via different interactions

    Structure

    (2015)
  • L. Adamian

    Empirical lipid propensities of amino acid residues in multispan alpha helical membrane proteins

    Proteins

    (2005)
  • L. Adamian et al.

    Prediction of transmembrane helix orientation in polytopic membrane proteins

    BMC Struct. Biol.

    (2006)
  • A.J. Bordner

    Predicting protein-protein binding sites in membrane proteins

    BMC Bioinf.

    (2009)
  • A.J. Bordner et al.

    Statistical analysis and prediction of protein-protein interfaces

    Proteins-Struct. Funct. Bioinform.

    (2005)
  • J.R. Bradford et al.

    Improved prediction of protein-protein binding sites using a support vector machines approach

    Bioinformatics

    (2005)
  • D.R. Caffrey

    Are protein-protein interfaces more conserved in sequence than the rest of the protein surface?

    Protein Sci.

    (2004)
  • C.T. Chen

    Protein-protein interaction site predictions with three-dimensional probability distributions of interacting atoms on protein surfaces

    PLoS One

    (2012)
  • T. Clackson et al.

    A hot spot of binding energy in a hormone-receptor interface

    Science

    (1995)
  • P. Fariselli

    Prediction of protein–protein interaction sites in heterocomplexes with neural networks

    Eur. J. Biochem.

    (2002)
  • D. Frishman et al.

    Protein structural classes in five complete genomes

    Nat. Struct. Biol.

    (1997)
  • A. Fuchs et al.

    Prediction of helix-helix contacts and interacting helices in polytopic membrane proteins using neural networks

    Proteins

    (2009)
  • M. Guharoy et al.

    Conserved residue clusters at protein-protein interfaces and their use in binding site identification

    BMC Bioinf.

    (2010)
  • T. Hamp et al.

    Alternative protein-protein interfaces are frequent exceptions

    PLoS Comput. Biol.

    (2012)
  • L. Kajan

    FreeContact: fast and free software for protein contact prediction from residue co-evolution

    BMC Bioinf.

    (2014)
  • L. Kall et al.

    Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server

    Nucleic Acids Res.

    (2007)
  • H. Kamisetty et al.

    Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era

    Proc. Natl. Acad. Sci. USA

    (2013)
  • S. Kawashima et al.

    AAindex: amino acid index database

    Nucleic Acids Res.

    (1999)
  • S. Kerrien

    IntAct--open source resource for molecular interaction data

    Nucleic Acids Res.

    (2007)
  • Cited by (16)

    • Improved sequence-based prediction of interaction sites in α-helical transmembrane proteins by deep learning

      2021, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      For example, on the CompData dataset the AUC (0.762, 0.790, and 0.796) and AUCPR values (0.527, 0.690, and 0.738) were obtained using the BordInter, FuchInter, and RostInter definitions, respectively. A higher distance threshold (RostInter) also results in more residues labeled as interacting, thus partially alleviating the imbalance between the two residue classes (interacting and non-interacting), while the low distance threshold BordInter impairs the performance by increasing the imbalance as discussed in [12] (see also Supplementary Table S16). Using the most stringent contact definition (BordInter) DeepTMInter remains the most accurate method on structure-derived Combined regions (Fig. 10d-i).

    • Integrative Modelling of Biomolecular Complexes

      2020, Journal of Molecular Biology
    • Experimental determination and data-driven prediction of homotypic transmembrane domain interfaces

      2020, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      Our machine-learning predictor, THOIPA, is the first of its kind for predicting homotypic TMD interfaces. Machine learning is already a common technique applied to related problems, including the prediction of PPI interface residues between membrane proteins with a known structure [49–51], or the prediction of contacting residues within a folded polytopic membrane protein [16,84,86]. THOIPA is well-placed to prioritise TMD residues in mutational analyses of given functions, assuming that they contribute to quaternary structure formation.

    • Prediction and targeting of GPCR oligomer interfaces

      2020, Progress in Molecular Biology and Translational Science
      Citation Excerpt :

      Contrarily to non-MPs, in the last years, very few ML methods were explicitly developed for MPs oligomers interface prediction. To date, only RFs,212,213 SVMs214 and NNs215 approaches have been applied for this purpose. Bordner was one of the pioneer studies that used a RF classifier to predict the interface residues of transmembrane proteins, although the final dataset included more α-helical than β-barrel structures; a problem which can be traced back to the fact that this impairment is also noticeable at experimentally determined structures level.212

    View all citing articles on Scopus
    View full text