Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network
Introduction
Protein-protein interactions are fundamental for many cellular biological processes, such as signal transduction, immune response, and cellular organization [1]. The protein-protein interaction sites (PPISs) are composed of a set of amino acid residues that form chemical bonds with a part of another molecule. Detection of interaction domains in sequences is very useful for understanding mechanisms of various biological processes, disease development and drug designs. Experimentally determined protein 3D structures indeed provide important clues to identifying interaction sites and understanding protein functions [2]. However, biological experimental methods [3] are labor-intensive and time-consuming, and the number of known 3D structures is still considerably smaller than that of protein sequences.
Over the decades, researchers have investigated the possibility of utilizing computational approaches to rapidly and accurately predict interacting residues from protein sequences. Jones and Thornton’s research [4] reported that solvation potential, residue interface propensity, hydrophobicity, planarity, protrusion and accessible surface area are the most important features to differentiate an observed interface from others defined on the surface of a protein. Neuvirth [2] suggested that locations of protein-protein binding sites are imprinted in the structures of the proteins. Ofran and Rost also concluded [5] that unbound proteins could suffice for the identification of interface residues.
Hitherto many computational methods have been proposed to deal with this prediction problem, including artificial neural networks [1], [6], [7], [8], support vector machines (SVMs) [7], [9], [10], random forests [11], [12], Naïve Bayes classifier [13], L1-regularized logistic regression [14], ensembles of SVM and sample-weighted random forests [15]. In particular, Zhou and Shan [1] proposed a neural network prediction with sequence profiles of neighboring residues and solvent exposure as input. Ofran and Rost [5] proposed another neural network predictor (ISIS), which was trained on sequences profiles and structural features predicted from the sequences. Porollo and Meller [7] proposed a method named SPPIDER using an SVM, neural network and linear discriminant analysis based on 19 selected features from the sequences. Murakami and Mizuguchi [13] developed a predictor called PSIVER, which is Naïve Bayes classifier with a kernel density estimation based on position-specific scoring matrix (PSSM) and predicted solvent accessibility. Kaustubh et al. [14] proposed a L1-regularized logistic regression classifier named LORIS. Furthermore, Singh et al. [8] proposed a novel artificial neural network predictor SPRINGS. Both SPRINGS and LORIS are trained on the feature space of PSSM, averaged cumulative hydropathy and predicted relative solvent accessibility.
Although much progress has been made, there still has room for further improving the performance of PPIS prediction. And one of the challenging issues in this research is class imbalance. Recently, Some methods have dedicated effort to solve the problem. Wei et al. [16] firstly concerned the problem and a cascade random forests algorithm(CRF) is proposed. The proposed CRF connects multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples. However, sampling of training data based-on residues level destroys the completeness of a sequence. Another method, SSWRF [15] combines an ensemble of SVMs and sample-weighted random forests to cope with the class imbalance issue, but its prediction accuracy is not very appealing.
In this work, we explore new ideas to address the imbalance issue and design a proper deep learning architecture such that the model has more generalization on the imbalanced data.
Firstly, A lightweight variant of long short-term memory (LSTM) [17], named simplified long short-term memory (SLSTM) network, is proposed and taken as the fundamental module in our model architecture. Our deep learning model (named DLPred) is stacked by a three-layer SLSTM linked with two layers of forward neural networks. Compared with the models using LSTM or gated recurrent units(GRU) [18], parameters of SLSTM-based model are just only 61.4% of LSTM-based model, or 81.7% of GRU-based model. The training speed of SLSTM-based model is faster than GRU-based or LSTM-based model, but the performance of DLPred model based on SLSTM is comparable to that of GRU_based model and better than that of LSTM_based model.
The training data is filtered on sequence level. Specific approaches to the construction of training data have been well investigated in the literature [19], [20] to handle the imbalance issue. The most straight-forward approaches are various techniques of adjusting training data. Traditionally the collection of training data is to form a set of individual residues. If we adjust the training data based-on residues level, Such an approach will shatter the sequential completeness of many proteins. In this work, the collection of training data is to form a set of complete protein sequences. Thus, each sequence in the training dataset still contains its complete set of binding residues and its complete set of non-binding residues. Our training dataset (TR5860) comprised of 5860 sequences is collected from multiple data sources, where each sequence has at least 10% of the interacting residues over the whole sequence [21].
Inspired by the recent successes of cost-sensitive learning in convolutional neural networks (CNNs) [22], we append a new penalization factor in the loss function so that the penalization on the mis-classed non-interacting residues is enhanced to cope with the imbalance issue.
Finally, multi-task learning is used to correct the preference of the prediction model for the non-interacting residues. The interacting residues are closely correlated with residue solvent accessibility (RSA) in our feature space construction. Most of the interacting residues are interface residues of the protein. Only residues with more solvent accessible area have higher potential to become interface residues. We propose to concurrently predict PPISs and RSA, which is an effective approach to improve our model generalization of imbalance classification.
In this study, we incorporate sequence-derived features such as the PSSM, physical properties, hydropathy index, etc. in the DLPred model. DLPred is evaluated on three public PPISs test datasets Dset186, Dtestset72 and PDBtestset164. Experimental results show that our model has improved F-measures, predictive accuracies and AUC values. We achieved 38.9%, 69.1% and 80.1% in F-measure, accuracy and AUC respectively on Dset186; we achieved 42.6%, 69% and 81.1% in F-measure, accuracy and AUC respectively on Dtestset72; and we achieved 38.8%, 68.4% and 78.9% in F-measure, accuracy and AUC respectively on PDBtestset164. Compared with other predictors, DLPred is simple but more generalizable and improved the performance of imbalance classification.
Section snippets
Materials and methodology
In this section, the proposed method of protein-protein interaction sites prediction is explained in detail.
Experimental setup
In this study, 200, 400 and 400 units are used in the first, second and third BRNN layers, respectively. The output dimensionality of each BRNN layer is 400. Sixty-four hidden nodes are used in the first fully connected layer and the following fully connected layer is the classification layer with the softmax function. A weight constraint of dropout (p = 0.5) used to avoid overfitting is applied to the output of each hidden layer. To obtain a better overall performance model, the F-measure is
Conclusion
We have presented a novel deep learning method for improving the prediction performance of protein interacting residues. This is an imbalanced classification problem. We proposed to use a simplified Long-short Term Memory (SLSTM) network to design a deep learning model, DLPred. Three ideas are used to deal with the imbalance issue: collection of protein sequences having a high ratio of interacting residues for the training dataset, a new penalization factor introduced in the loss function, and
Conflict of interest
The all authors have declared that no conflict of interest exists.
Acknowledgments
This research was supported in part by National Natural Science Foundation of China (No. 61170125 and 31801108), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) and the Natural Science Research Project of Anhui Provincial Department of Education (No. KJ2018A0383).
Buzhong Zhang is a Ph.D. candidate at the School of Computer Science and Technology, Soochow University, China.His research interests include bioinformatics,data mining and machine learning.
References (51)
- et al.
Promate: a structure based prediction program to identify the location of protein-protein binding sites
J. Mol. Biol.
(2004) - et al.
Global approaches to protein-protein interactions
Curr. Opin. Cell Biol.
(2003) - et al.
Analysis of protein-protein interaction sites using surface patches
J. Mol. Biol.
(1997) - et al.
Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests
Neurocomputing
(2016) - et al.
Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognit.
(2013) - et al.
Prediction of protein interaction sites from sequence profile and residue neighbor list
Proteins: Struct. Funct. Bioinform.
(2001) - et al.
Isis: interaction sites identified from sequence
Bioinformatics
(2007) - et al.
Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data.
Proteins: Struct. Funct. Bioinform.
(2005) - et al.
Prediction-based fingerprints of protein-protein interactions
Proteins: Struct. Funct. Bioinform.
(2007) - et al.
Springs: prediction of protein-protein interaction sites using artificial neural networks
J. Proteomic Comput. Biol.
(2014)
Prediction of protein-protein interaction sites using an ensemble method
Bmc Bioinformatics
Detection of outlier residues for improving interface prediction in protein heterocomplexes
IEEE/ACM Trans. Comput. Biol. Bioinform.
Prediction of protein-protein interaction sites in sequences and 3d structures by random forests
Plos Comput. Biol.
Seeing the trees through the forest: sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest
Bioinformatics
Applying the naïve bayes classifier with kernel density estimation to the prediction of proteincprotein interaction sites
Bioinformatics
Sequence-based prediction of protein-protein interaction sites with l1-logreg classifier
J. Theor. Biol.
A cascade random forests algorithm for predicting protein-protein interaction sites
IEEE Trans. Nanobiosci.
Long short-term memory
Neural Comput.
Learning phrase representations using RNN encoder-decoder for statistical machine translation
Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Training cost-sensitive neural networks with methods addressing the class imbalance problem
IEEE Trans. Knowl. Data Eng.
Classification of imbalanced data: a review
Int. J. Pattern Recognit. Artif. Intell.
Cost-sensitive learning of deep feature representations from imbalanced data
IEEE Trans. Neural Netw. Learn. Syst.
Pisces: a protein sequence culling server
Bioinformatics
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics
Cited by (0)
Buzhong Zhang is a Ph.D. candidate at the School of Computer Science and Technology, Soochow University, China.His research interests include bioinformatics,data mining and machine learning.
Jinyan Li is a Professor of Data Sciene at the Advanced Analytics Institute and a core member at the Centre for Health Technologies, Faculty of Engineering and IT, UTS. He is also the Bioinformatics Program leader. Jinyan received the Ph.D. degree in computer science from the University of Melbourne, Australia.
Lijun Quan is a Lecturer with the School of Computer Science and Technology, Soochow University. She received the Ph.D. degree from Soochow University, Suzhou, China. Her research interests include bioinformatics, Parallel and distributed computing and machine learning.
Yu Chen is a teacher at the School of Computer Science and Technology, Soochow university, China. His research interests include computer vision, protein structure prediction, and machine learning.
Qiang Lü received the Ph.D. degree in computer science from Soochow University, China, in 2006. He is currently a professor in the School of Computer Science and Technology, Soochow University, Suzhou, China. His research interests include protein and RNA structure prediction, meta heuristic search, and parallel algorithm.