Elsevier

Neurocomputing

Volume 357, 10 September 2019, Pages 86-100
Neurocomputing

Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network

https://doi.org/10.1016/j.neucom.2019.05.013Get rights and content

Highlights

  • A speedup variant of long short-term memory network.

  • Effective measurements for combating imbalance classification in deep learning.

  • DLPred, a deep learning tool for protein-protein interacting sites prediction.

Abstract

Proteins often interact with each other and form protein complexes to carry out various biochemical activities. Knowledge of the interaction sites is helpful for understanding disease mechanisms and drug design. Accurate prediction of the interaction sites from protein sequences is still a challenging task and severe imbalance data also decreased the performance of computational methods. In this study, we propose to use a deep learning method for improving the imbalanced prediction of protein interaction sites. We develop a new simplified long short-term memory (SLSTM) network to implement a deep learning architecture (named DLPred). To deal with the imbalanced classification in the deep learning model, we explore three new ideas. First, our collection of the training data is to construct a set of protein sequences, instead of a set of just single residues, to retain the entire sequential completeness of each protein. Second, a new penalization factor is appended to the loss function such that the penalization to the non-interaction site loss can be effectively enhanced. Third, multi-task learning of interaction sites and residue solvent accessibility prediction are used for correcting the preference of the prediction model on the non-interaction sites. Our model is evaluated on three public datasets: Dset186, Dtestset72 and PDBtestset164. Compared with current state-of-the-art methods, DLPred is able to significantly improve the predictive accuracies and AUC values while improving the F-measure. The training dataset, test datasets, a standalone version of DLPred and online service are available at http://qianglab.scst.suda.edu.cn/dlp/.

Introduction

Protein-protein interactions are fundamental for many cellular biological processes, such as signal transduction, immune response, and cellular organization [1]. The protein-protein interaction sites (PPISs) are composed of a set of amino acid residues that form chemical bonds with a part of another molecule. Detection of interaction domains in sequences is very useful for understanding mechanisms of various biological processes, disease development and drug designs. Experimentally determined protein 3D structures indeed provide important clues to identifying interaction sites and understanding protein functions [2]. However, biological experimental methods [3] are labor-intensive and time-consuming, and the number of known 3D structures is still considerably smaller than that of protein sequences.

Over the decades, researchers have investigated the possibility of utilizing computational approaches to rapidly and accurately predict interacting residues from protein sequences. Jones and Thornton’s research [4] reported that solvation potential, residue interface propensity, hydrophobicity, planarity, protrusion and accessible surface area are the most important features to differentiate an observed interface from others defined on the surface of a protein. Neuvirth [2] suggested that locations of protein-protein binding sites are imprinted in the structures of the proteins. Ofran and Rost also concluded [5] that unbound proteins could suffice for the identification of interface residues.

Hitherto many computational methods have been proposed to deal with this prediction problem, including artificial neural networks [1], [6], [7], [8], support vector machines (SVMs) [7], [9], [10], random forests [11], [12], Naïve Bayes classifier [13], L1-regularized logistic regression [14], ensembles of SVM and sample-weighted random forests [15]. In particular, Zhou and Shan [1] proposed a neural network prediction with sequence profiles of neighboring residues and solvent exposure as input. Ofran and Rost [5] proposed another neural network predictor (ISIS), which was trained on sequences profiles and structural features predicted from the sequences. Porollo and Meller [7] proposed a method named SPPIDER using an SVM, neural network and linear discriminant analysis based on 19 selected features from the sequences. Murakami and Mizuguchi [13] developed a predictor called PSIVER, which is Naïve Bayes classifier with a kernel density estimation based on position-specific scoring matrix (PSSM) and predicted solvent accessibility. Kaustubh et al. [14] proposed a L1-regularized logistic regression classifier named LORIS. Furthermore, Singh et al. [8] proposed a novel artificial neural network predictor SPRINGS. Both SPRINGS and LORIS are trained on the feature space of PSSM, averaged cumulative hydropathy and predicted relative solvent accessibility.

Although much progress has been made, there still has room for further improving the performance of PPIS prediction. And one of the challenging issues in this research is class imbalance. Recently, Some methods have dedicated effort to solve the problem. Wei et al. [16] firstly concerned the problem and a cascade random forests algorithm(CRF) is proposed. The proposed CRF connects multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples. However, sampling of training data based-on residues level destroys the completeness of a sequence. Another method, SSWRF [15] combines an ensemble of SVMs and sample-weighted random forests to cope with the class imbalance issue, but its prediction accuracy is not very appealing.

In this work, we explore new ideas to address the imbalance issue and design a proper deep learning architecture such that the model has more generalization on the imbalanced data.

Firstly, A lightweight variant of long short-term memory (LSTM) [17], named simplified long short-term memory (SLSTM) network, is proposed and taken as the fundamental module in our model architecture. Our deep learning model (named DLPred) is stacked by a three-layer SLSTM linked with two layers of forward neural networks. Compared with the models using LSTM or gated recurrent units(GRU) [18], parameters of SLSTM-based model are just only 61.4% of LSTM-based model, or 81.7% of GRU-based model. The training speed of SLSTM-based model is faster than GRU-based or LSTM-based model, but the performance of DLPred model based on SLSTM is comparable to that of GRU_based model and better than that of LSTM_based model.

The training data is filtered on sequence level. Specific approaches to the construction of training data have been well investigated in the literature [19], [20] to handle the imbalance issue. The most straight-forward approaches are various techniques of adjusting training data. Traditionally the collection of training data is to form a set of individual residues. If we adjust the training data based-on residues level, Such an approach will shatter the sequential completeness of many proteins. In this work, the collection of training data is to form a set of complete protein sequences. Thus, each sequence in the training dataset still contains its complete set of binding residues and its complete set of non-binding residues. Our training dataset (TR5860) comprised of 5860 sequences is collected from multiple data sources, where each sequence has at least 10% of the interacting residues over the whole sequence [21].

Inspired by the recent successes of cost-sensitive learning in convolutional neural networks (CNNs) [22], we append a new penalization factor in the loss function so that the penalization on the mis-classed non-interacting residues is enhanced to cope with the imbalance issue.

Finally, multi-task learning is used to correct the preference of the prediction model for the non-interacting residues. The interacting residues are closely correlated with residue solvent accessibility (RSA) in our feature space construction. Most of the interacting residues are interface residues of the protein. Only residues with more solvent accessible area have higher potential to become interface residues. We propose to concurrently predict PPISs and RSA, which is an effective approach to improve our model generalization of imbalance classification.

In this study, we incorporate sequence-derived features such as the PSSM, physical properties, hydropathy index, etc. in the DLPred model. DLPred is evaluated on three public PPISs test datasets Dset186, Dtestset72 and PDBtestset164. Experimental results show that our model has improved F-measures, predictive accuracies and AUC values. We achieved 38.9%, 69.1% and 80.1% in F-measure, accuracy and AUC respectively on Dset186; we achieved 42.6%, 69% and 81.1% in F-measure, accuracy and AUC respectively on Dtestset72; and we achieved 38.8%, 68.4% and 78.9% in F-measure, accuracy and AUC respectively on PDBtestset164. Compared with other predictors, DLPred is simple but more generalizable and improved the performance of imbalance classification.

Section snippets

Materials and methodology

In this section, the proposed method of protein-protein interaction sites prediction is explained in detail.

Experimental setup

In this study, 200, 400 and 400 units are used in the first, second and third BRNN layers, respectively. The output dimensionality of each BRNN layer is 400. Sixty-four hidden nodes are used in the first fully connected layer and the following fully connected layer is the classification layer with the softmax function. A weight constraint of dropout (p = 0.5) used to avoid overfitting is applied to the output of each hidden layer. To obtain a better overall performance model, the F-measure is

Conclusion

We have presented a novel deep learning method for improving the prediction performance of protein interacting residues. This is an imbalanced classification problem. We proposed to use a simplified Long-short Term Memory (SLSTM) network to design a deep learning model, DLPred. Three ideas are used to deal with the imbalance issue: collection of protein sequences having a high ratio of interacting residues for the training dataset, a new penalization factor introduced in the loss function, and

Conflict of interest

The all authors have declared that no conflict of interest exists.

Acknowledgments

This research was supported in part by National Natural Science Foundation of China (No. 61170125 and 31801108), the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) and the Natural Science Research Project of Anhui Provincial Department of Education (No. KJ2018A0383).

Buzhong Zhang is a Ph.D. candidate at the School of Computer Science and Technology, Soochow University, China.His research interests include bioinformatics,data mining and machine learning.

References (51)

  • D. Lei et al.

    Prediction of protein-protein interaction sites using an ensemble method

    Bmc Bioinformatics

    (2009)
  • P. Chen et al.

    Detection of outlier residues for improving interface prediction in protein heterocomplexes

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2012)
  • M. Šikić et al.

    Prediction of protein-protein interaction sites in sequences and 3d structures by random forests

    Plos Comput. Biol.

    (2009)
  • Q. Hou et al.

    Seeing the trees through the forest: sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest

    Bioinformatics

    (2017)
  • Y. Murakami et al.

    Applying the naïve bayes classifier with kernel density estimation to the prediction of proteincprotein interaction sites

    Bioinformatics

    (2010)
  • D. Kaustubh et al.

    Sequence-based prediction of protein-protein interaction sites with l1-logreg classifier

    J. Theor. Biol.

    (2014)
  • Z.S. Wei et al.

    A cascade random forests algorithm for predicting protein-protein interaction sites

    IEEE Trans. Nanobiosci.

    (2015)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • K. Cho et al.

    Learning phrase representations using RNN encoder-decoder for statistical machine translation

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    (2014)
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • Z.H. Zhou et al.

    Training cost-sensitive neural networks with methods addressing the class imbalance problem

    IEEE Trans. Knowl. Data Eng.

    (2006)
  • Y. Sun et al.

    Classification of imbalanced data: a review

    Int. J. Pattern Recognit. Artif. Intell.

    (2009)
  • S.H. Khan et al.

    Cost-sensitive learning of deep feature representations from imbalanced data

    IEEE Trans. Neural Netw. Learn. Syst.

    (2018)
  • R.L.D.J. Guoli Wang

    Pisces: a protein sequence culling server

    Bioinformatics

    (2003)
  • W. Li et al.

    Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

    Bioinformatics

    (2006)
  • Cited by (68)

    View all citing articles on Scopus

    Buzhong Zhang is a Ph.D. candidate at the School of Computer Science and Technology, Soochow University, China.His research interests include bioinformatics,data mining and machine learning.

    Jinyan Li is a Professor of Data Sciene at the Advanced Analytics Institute and a core member at the Centre for Health Technologies, Faculty of Engineering and IT, UTS. He is also the Bioinformatics Program leader. Jinyan received the Ph.D. degree in computer science from the University of Melbourne, Australia.

    Lijun Quan is a Lecturer with the School of Computer Science and Technology, Soochow University. She received the Ph.D. degree from Soochow University, Suzhou, China. Her research interests include bioinformatics, Parallel and distributed computing and machine learning.

    Yu Chen is a teacher at the School of Computer Science and Technology, Soochow university, China. His research interests include computer vision, protein structure prediction, and machine learning.

    Qiang Lü received the Ph.D. degree in computer science from Soochow University, China, in 2006. He is currently a professor in the School of Computer Science and Technology, Soochow University, Suzhou, China. His research interests include protein and RNA structure prediction, meta heuristic search, and parallel algorithm.

    View full text