Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network

doi:10.1016/j.neucom.2019.05.013

Neurocomputing

Volume 357, 10 September 2019, Pages 86-100

https://doi.org/10.1016/j.neucom.2019.05.013 Get rights and content

Highlights

•
A speedup variant of long short-term memory network.
•
Effective measurements for combating imbalance classification in deep learning.
•
DLPred, a deep learning tool for protein-protein interacting sites prediction.

Abstract

Proteins often interact with each other and form protein complexes to carry out various biochemical activities. Knowledge of the interaction sites is helpful for understanding disease mechanisms and drug design. Accurate prediction of the interaction sites from protein sequences is still a challenging task and severe imbalance data also decreased the performance of computational methods. In this study, we propose to use a deep learning method for improving the imbalanced prediction of protein interaction sites. We develop a new simplified long short-term memory (SLSTM) network to implement a deep learning architecture (named DLPred). To deal with the imbalanced classification in the deep learning model, we explore three new ideas. First, our collection of the training data is to construct a set of protein sequences, instead of a set of just single residues, to retain the entire sequential completeness of each protein. Second, a new penalization factor is appended to the loss function such that the penalization to the non-interaction site loss can be effectively enhanced. Third, multi-task learning of interaction sites and residue solvent accessibility prediction are used for correcting the preference of the prediction model on the non-interaction sites. Our model is evaluated on three public datasets: Dset186, Dtestset72 and PDBtestset164. Compared with current state-of-the-art methods, DLPred is able to significantly improve the predictive accuracies and AUC values while improving the F-measure. The training dataset, test datasets, a standalone version of DLPred and online service are available at http://qianglab.scst.suda.edu.cn/dlp/.

Introduction

Protein-protein interactions are fundamental for many cellular biological processes, such as signal transduction, immune response, and cellular organization [1]. The protein-protein interaction sites (PPISs) are composed of a set of amino acid residues that form chemical bonds with a part of another molecule. Detection of interaction domains in sequences is very useful for understanding mechanisms of various biological processes, disease development and drug designs. Experimentally determined protein 3D structures indeed provide important clues to identifying interaction sites and understanding protein functions [2]. However, biological experimental methods [3] are labor-intensive and time-consuming, and the number of known 3D structures is still considerably smaller than that of protein sequences.

Over the decades, researchers have investigated the possibility of utilizing computational approaches to rapidly and accurately predict interacting residues from protein sequences. Jones and Thornton’s research [4] reported that solvation potential, residue interface propensity, hydrophobicity, planarity, protrusion and accessible surface area are the most important features to differentiate an observed interface from others defined on the surface of a protein. Neuvirth [2] suggested that locations of protein-protein binding sites are imprinted in the structures of the proteins. Ofran and Rost also concluded [5] that unbound proteins could suffice for the identification of interface residues.

Hitherto many computational methods have been proposed to deal with this prediction problem, including artificial neural networks [1], [6], [7], [8], support vector machines (SVMs) [7], [9], [10], random forests [11], [12], Naïve Bayes classifier [13], L1-regularized logistic regression [14], ensembles of SVM and sample-weighted random forests [15]. In particular, Zhou and Shan [1] proposed a neural network prediction with sequence profiles of neighboring residues and solvent exposure as input. Ofran and Rost [5] proposed another neural network predictor (ISIS), which was trained on sequences profiles and structural features predicted from the sequences. Porollo and Meller [7] proposed a method named SPPIDER using an SVM, neural network and linear discriminant analysis based on 19 selected features from the sequences. Murakami and Mizuguchi [13] developed a predictor called PSIVER, which is Naïve Bayes classifier with a kernel density estimation based on position-specific scoring matrix (PSSM) and predicted solvent accessibility. Kaustubh et al. [14] proposed a L1-regularized logistic regression classifier named LORIS. Furthermore, Singh et al. [8] proposed a novel artificial neural network predictor SPRINGS. Both SPRINGS and LORIS are trained on the feature space of PSSM, averaged cumulative hydropathy and predicted relative solvent accessibility.

Although much progress has been made, there still has room for further improving the performance of PPIS prediction. And one of the challenging issues in this research is class imbalance. Recently, Some methods have dedicated effort to solve the problem. Wei et al. [16] firstly concerned the problem and a cascade random forests algorithm(CRF) is proposed. The proposed CRF connects multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples. However, sampling of training data based-on residues level destroys the completeness of a sequence. Another method, SSWRF [15] combines an ensemble of SVMs and sample-weighted random forests to cope with the class imbalance issue, but its prediction accuracy is not very appealing.

In this work, we explore new ideas to address the imbalance issue and design a proper deep learning architecture such that the model has more generalization on the imbalanced data.

Firstly, A lightweight variant of long short-term memory (LSTM) [17], named simplified long short-term memory (SLSTM) network, is proposed and taken as the fundamental module in our model architecture. Our deep learning model (named DLPred) is stacked by a three-layer SLSTM linked with two layers of forward neural networks. Compared with the models using LSTM or gated recurrent units(GRU) [18], parameters of SLSTM-based model are just only 61.4% of LSTM-based model, or 81.7% of GRU-based model. The training speed of SLSTM-based model is faster than GRU-based or LSTM-based model, but the performance of DLPred model based on SLSTM is comparable to that of GRU_based model and better than that of LSTM_based model.

The training data is filtered on sequence level. Specific approaches to the construction of training data have been well investigated in the literature [19], [20] to handle the imbalance issue. The most straight-forward approaches are various techniques of adjusting training data. Traditionally the collection of training data is to form a set of individual residues. If we adjust the training data based-on residues level, Such an approach will shatter the sequential completeness of many proteins. In this work, the collection of training data is to form a set of complete protein sequences. Thus, each sequence in the training dataset still contains its complete set of binding residues and its complete set of non-binding residues. Our training dataset (TR5860) comprised of 5860 sequences is collected from multiple data sources, where each sequence has at least 10% of the interacting residues over the whole sequence [21].

Inspired by the recent successes of cost-sensitive learning in convolutional neural networks (CNNs) [22], we append a new penalization factor in the loss function so that the penalization on the mis-classed non-interacting residues is enhanced to cope with the imbalance issue.

Finally, multi-task learning is used to correct the preference of the prediction model for the non-interacting residues. The interacting residues are closely correlated with residue solvent accessibility (RSA) in our feature space construction. Most of the interacting residues are interface residues of the protein. Only residues with more solvent accessible area have higher potential to become interface residues. We propose to concurrently predict PPISs and RSA, which is an effective approach to improve our model generalization of imbalance classification.

In this study, we incorporate sequence-derived features such as the PSSM, physical properties, hydropathy index, etc. in the DLPred model. DLPred is evaluated on three public PPISs test datasets Dset186, Dtestset72 and PDBtestset164. Experimental results show that our model has improved F-measures, predictive accuracies and AUC values. We achieved 38.9%, 69.1% and 80.1% in F-measure, accuracy and AUC respectively on Dset186; we achieved 42.6%, 69% and 81.1% in F-measure, accuracy and AUC respectively on Dtestset72; and we achieved 38.8%, 68.4% and 78.9% in F-measure, accuracy and AUC respectively on PDBtestset164. Compared with other predictors, DLPred is simple but more generalizable and improved the performance of imbalance classification.

Section snippets

Materials and methodology

In this section, the proposed method of protein-protein interaction sites prediction is explained in detail.

Experimental setup

In this study, 200, 400 and 400 units are used in the first, second and third BRNN layers, respectively. The output dimensionality of each BRNN layer is 400. Sixty-four hidden nodes are used in the first fully connected layer and the following fully connected layer is the classification layer with the softmax function. A weight constraint of dropout (p = 0.5) used to avoid overfitting is applied to the output of each hidden layer. To obtain a better overall performance model, the F-measure is

Conclusion

We have presented a novel deep learning method for improving the prediction performance of protein interacting residues. This is an imbalanced classification problem. We proposed to use a simplified Long-short Term Memory (SLSTM) network to design a deep learning model, DLPred. Three ideas are used to deal with the imbalance issue: collection of protein sequences having a high ratio of interacting residues for the training dataset, a new penalization factor introduced in the loss function, and

Conflict of interest

The all authors have declared that no conflict of interest exists.

Acknowledgments

This research was supported in part by National Natural Science Foundation of China (No. 61170125 and 31801108), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) and the Natural Science Research Project of Anhui Provincial Department of Education (No. KJ2018A0383).

Buzhong Zhang is a Ph.D. candidate at the School of Computer Science and Technology, Soochow University, China.His research interests include bioinformatics,data mining and machine learning.

References (51)

H. Neuvirth et al.
Promate: a structure based prediction program to identify the location of protein-protein binding sites
J. Mol. Biol.
(2004)
G. Drewes et al.
Global approaches to protein-protein interactions
Curr. Opin. Cell Biol.
(2003)
S. Jones et al.
Analysis of protein-protein interaction sites using surface patches
J. Mol. Biol.
(1997)
Z.S. Wei et al.
Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests
Neurocomputing
(2016)
M. Galar et al.
Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognit.
(2013)
H. Zhou et al.
Prediction of protein interaction sites from sequence profile and residue neighbor list
Proteins: Struct. Funct. Bioinform.
(2001)
Y. Ofran et al.
Isis: interaction sites identified from sequence
Bioinformatics
(2007)
H. Chen et al.
Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data.
Proteins: Struct. Funct. Bioinform.
(2005)
A. Porollo et al.
Prediction-based fingerprints of protein-protein interactions
Proteins: Struct. Funct. Bioinform.
(2007)
G. Singh et al.
Springs: prediction of protein-protein interaction sites using artificial neural networks
J. Proteomic Comput. Biol.
(2014)

Cited by (68)

AGF-PPIS: A protein–protein interaction site predictor based on an attention mechanism and graph convolutional networks
2024, Methods
Protein–protein interactions play an important role in various biological processes. Interaction among proteins has a wide range of applications. Therefore, the correct identification of protein–protein interactions sites is crucial. In this paper, we propose a novel predictor for protein–protein interactions sites, AGF-PPIS, where we utilize a multi-head self-attention mechanism (introducing a graph structure), graph convolutional network, and feed-forward neural network. We use the Euclidean distance between each protein residue to generate the corresponding protein graph as the input of AGF-PPIS. On the independent test dataset Test_60, AGF-PPIS achieves superior performance over comparative methods in terms of seven different evaluation metrics (ACC, precision, recall, F1-score, MCC, AUROC, AUPRC), which fully demonstrates the validity and superiority of the proposed AGF-PPIS model. The source codes and the steps for usage of AGF-PPIS are available at https://github.com/fxh1001/AGF-PPIS.
GHGPR-PPIS: A graph convolutional network for identifying protein-protein interaction site using heat kernel with Generalized PageRank techniques and edge self-attention feature processing block
2024, Computers in Biology and Medicine
Accurately pinpointing protein-protein interaction site (PPIS) on the molecular level is of utmost significance for annotating protein function and comprehending the mechanisms underpinning various diseases. While numerous computational methods for predicting PPIS have emerged, they have indeed mitigated the labor and time constraints associated with traditional experimental methods. However, the predictive accuracy of these methods has yet to reach the desired threshold. In this context, we proposed a groundbreaking graph-based computational model called GHGPR-PPIS. This innovative model leveraged a graph convolutional network using heat kernel (GraphHeat) in conjunction with Generalized PageRank techniques (GHGPR) to predict PPIS. Additionally, building upon the GHGPR framework, we devised an edge self-attention feature processing block, further augmenting the performance of the model. Experimental findings conclusively demonstrated that GHGPR-PPIS surpassed all competing state-of-the-art models when evaluated on the benchmark test set. Impressively, on two distinct independent test sets and a specific protein chain, GHGPR-PPIS consistently demonstrated superior generalization performance and practical applicability compared to the comparative model, AGAT-PPIS. Lastly, leveraging the t-SNE dimensionality reduction algorithm and clustering visualization technique, we delved into an interpretability analysis of the effectiveness of GHGPR-PPIS by meticulously comparing the outputs from different stages of the model.
Global protein-protein interaction networks in yeast saccharomyces cerevisiae and helicobacter pylori
2023, Talanta
Understanding many biological processes relies heavily on accurately predicting protein-protein interactions (PPIs). In this study, we propose a novel method for predicting PPIs that is based on LogitBoost with a binary bat feature selection algorithm. Our approach involves the extraction of an initial feature vector by combining pseudo amino acid composition (PseAAC), pseudo-position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Subsequently, a binary bat algorithm is applied to eliminate redundant features, and the resulting optimal features are fed into the LogitBoost classifier for the identification of PPIs. To evaluate the proposed method, we test it on two databases, Saccharomyces cerevisiae and Helicobacter pylori, using 10-fold cross-validation, and achieve accuracies of 94.39% and 97.89%, respectively. Our results showcase the significant potential of our pipeline in accurately predicting protein-protein interactions (PPIs), thereby offering a valuable resource to the scientific research community.
ISPRED-SEQ: Deep Neural Networks and Embeddings for Predicting Interaction Sites in Protein Sequences
2023, Journal of Molecular Biology
The knowledge of protein–protein interaction sites (PPIs) is crucial for protein functional annotation. Here we address the problem focusing on the prediction of putative PPIs considering as input protein sequences. The issue is important given the huge volume of protein sequences compared to experimental and/or computed structures. Taking advantage of protein language models, recently developed, and Deep Neural networks, here we describe ISPRED-SEQ, which overpasses state-of-the-art predictors addressing the same problem. ISPRED-SEQ is freely available for testing at https://ispredws.biocomp.unibo.it.
Improving protein-protein interaction site prediction using deep residual neural network
2023, Analytical Biochemistry
Accurate identification of protein-protein interaction (PPI) sites is significantly important for understanding the mechanism of life and developing new drugs. However, it is expensive and time-consuming to identify PPI sites using wet-lab experiments. Developing computational methods is a new road to identify PPI sites, which can accelerate the procedure of PPI-related research. In this study, we propose a novel deep learning-based method (called D-PPIsite) to improve the accuracy of sequence-based PPI site prediction. In D-PPIsite, four discriminative sequence-driven features, i.e., position specific scoring matrix, relative solvent accessibility, position information and physical properties, are employed to feed into a well-designed deep learning module, consisting of convolutional, squeeze and excitation, and fully connected layers, to learn a prediction model. To reduce the risk of a single prediction model getting stuck in local optima, multiple prediction models with different initialization parameters are selected and integrated into one final model using the mean ensemble strategy. Experimental results on five independent testing data sets demonstrate that the proposed D-PPIsite can achieve an average accuracy of 80.2% and precision of 36.9%, covering 53.5% of all PPI sites while achieving the average Matthews correlation coefficient value (0.330) that is significantly higher than most of existing state-of-the-art prediction methods. We implement a new standalone-version predictor for predicting PPI sites, which is freely available at https://github.com/MingDongup/D-PPIsite for academic use.
MAPE-PPI: TOWARDS EFFECTIVE AND EFFICIENT PROTEIN-PROTEIN INTERACTION PREDICTION VIA MICROENVIRONMENT-AWARE PROTEIN EMBEDDING
2024, arXiv

View all citing articles on Scopus

Buzhong Zhang is a Ph.D. candidate at the School of Computer Science and Technology, Soochow University, China.His research interests include bioinformatics,data mining and machine learning.

Jinyan Li is a Professor of Data Sciene at the Advanced Analytics Institute and a core member at the Centre for Health Technologies, Faculty of Engineering and IT, UTS. He is also the Bioinformatics Program leader. Jinyan received the Ph.D. degree in computer science from the University of Melbourne, Australia.

Lijun Quan is a Lecturer with the School of Computer Science and Technology, Soochow University. She received the Ph.D. degree from Soochow University, Suzhou, China. Her research interests include bioinformatics, Parallel and distributed computing and machine learning.

Yu Chen is a teacher at the School of Computer Science and Technology, Soochow university, China. His research interests include computer vision, protein structure prediction, and machine learning.

Qiang Lü received the Ph.D. degree in computer science from Soochow University, China, in 2006. He is currently a professor in the School of Computer Science and Technology, Soochow University, Suzhou, China. His research interests include protein and RNA structure prediction, meta heuristic search, and parallel algorithm.

View full text

Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network

Highlights

Abstract

Introduction

Section snippets

Materials and methodology

Experimental setup

Conclusion

Conflict of interest

Acknowledgments

J. Mol. Biol.

Curr. Opin. Cell Biol.

J. Mol. Biol.

Neurocomputing

Pattern Recognit.

Prediction of protein interaction sites from sequence profile and residue neighbor list

Proteins: Struct. Funct. Bioinform.

Isis: interaction sites identified from sequence

Bioinformatics

Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data.

Proteins: Struct. Funct. Bioinform.

Prediction-based fingerprints of protein-protein interactions

Proteins: Struct. Funct. Bioinform.

Springs: prediction of protein-protein interaction sites using artificial neural networks

J. Proteomic Comput. Biol.

Prediction of protein-protein interaction sites using an ensemble method

Bmc Bioinformatics

Detection of outlier residues for improving interface prediction in protein heterocomplexes

IEEE/ACM Trans. Comput. Biol. Bioinform.

Prediction of protein-protein interaction sites in sequences and 3d structures by random forests

Plos Comput. Biol.

Seeing the trees through the forest: sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest

Bioinformatics

Applying the naïve bayes classifier with kernel density estimation to the prediction of proteincprotein interaction sites

Bioinformatics

Sequence-based prediction of protein-protein interaction sites with l1-logreg classifier

J. Theor. Biol.

A cascade random forests algorithm for predicting protein-protein interaction sites

IEEE Trans. Nanobiosci.

Long short-term memory

Neural Comput.

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Proceedings of the Conference on Empirical Methods in Natural Language Processing

Learning from imbalanced data

IEEE Trans. Knowl. Data Eng.

Training cost-sensitive neural networks with methods addressing the class imbalance problem

IEEE Trans. Knowl. Data Eng.

Classification of imbalanced data: a review

Int. J. Pattern Recognit. Artif. Intell.

Cost-sensitive learning of deep feature representations from imbalanced data

IEEE Trans. Neural Netw. Learn. Syst.

Pisces: a protein sequence culling server

Bioinformatics

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Bioinformatics