GTB-PPI: Predict Protein–protein Interactions Based on L1-regularized Logistic Regression and Gradient Tree Boosting

Protein–protein interactions (PPIs) are of great importance to understand genetic mechanisms, delineate disease pathogenesis, and guide drug design. With the increase of PPI data and development of machine learning technologies, prediction and identification of PPIs have become a research hotspot in proteomics. In this study, we propose a new prediction pipeline for PPIs based on gradient tree boosting (GTB). First, the initial feature vector is extracted by fusing pseudo amino acid composition (PseAAC), pseudo position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Second, to remove redundancy and noise, we employ L1-regularized logistic regression (L1-RLR) to select an optimal feature subset. Finally, GTB-PPI model is constructed. Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15% and 90.47% on Saccharomyces cerevisiae and Helicobacter pylori datasets, respectively. In addition, GTB-PPI could be applied to predict the independent test datasets for Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, the one-core PPI network for CD9, and the crossover PPI network for the Wnt-related signaling pathways. The results show that GTB-PPI can significantly improve accuracy of PPI prediction. The code and datasets of GTB-PPI can be downloaded from https://github.com/QUST-AIBBDRC/GTB-PPI/.


Introduction
Knowledge of protein-protein interactions (PPIs) can help to probe the mechanisms underlying various biological processes, such as DNA replication, protein modification, and signal transduction [1,2]. The accurate understanding and analysis of PPIs can reveal multiple functions at the molecular and proteome levels, which has become a research hotspot [3,4]. However, web-lab identification methods suffer from incomplete and false prediction problems [5]. Alternatively, employing reliable bioinformatics methods for PPI prediction could provide candidates for subsequent experimental validation in a cost-effective way.
Compared with structure-based methods, sequence-based methods are straightforward and do not require a priori information, which have been widely used. Martin et al. [6] proposed the signature kernel method to extract protein sequence feature information, but they did not use physicochemical property information. Subsequently, Guo et al. [7] employed seven physicochemical properties of amino acids to predict PPIs by combining autocovariance and support vector machine (SVM).
Different feature extraction methods can complement each other, and prediction accuracy can be improved by effective feature fusion [8,9]. For instance, Du et al. [8] constructed a PPI prediction framework called DeepPPI, which employed deep neural networks as the classifier. They fused amino acid composition information-based features and physiochemical property-based sequence features. However, presence of information redundancy, noise, and excessively high dimensionalities after feature fusion would affect the classification accuracy. You et al. [10] used the minimum redundancy maximum relevance (mRMR) to determine important and distinguishable features to predict PPIs based on SVM.
Ensemble learning systems can achieve higher prediction performance than a single classifier. To our knowledge, Jia et al. [11] combined seven random forest (RF) classifiers according to voting principles. As an ensemble learning method, gradient tree boosting (GTB) has been widely applied in miRNA-disease association [12], drug-target interaction [13], and RNA-binding residue prediction [14]. GTB outperforms SVM and RF, showing superior model generalization performance.
Although a large number of algorithms have been proposed and developed, challenges remain for sequenced-based PPI predictors currently available. First, the sequence-only-based information of PPIs is not fully represented and elucidated, and satisfactory results cannot be obtained by merely adjusting individual parameters. Multi-information fusion is a very useful strategy through fusing multiple descriptors, such as pseudo amino acid composition (PseAAC) and pseudo position-specific scoring matrix (PsePSSM), which have been widely applied in PPI prediction [15], Gram-negative protein localization prediction [16], identification of submitochondrial locations [17], and apoptosis protein localization prediction [18]. Secondly, there is a severe data imbalance problem in PPI prediction. The number of non-interacting protein pairs is much higher than that of interacting protein pairs. Currently, machine learning methods cannot deal with such problems well and could result in poor overall performance when dealing with imbalanced data [19].
To overcome the aforementioned limitation of machine learning methods, this study proposes a new PPI prediction pipeline called GTB-PPI. First, we fuse PseAAC, PsePSSM, reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD) to extract amino acid composition-based information, evolutionary information, and physicochemical information. To retrieve effective details representing PPIs without losing important and reliable characteristic information, L1-regularized logistic regression (L1-RLR) is first utilized for PPI prediction to eliminate redundant features. At the same time, we employ GTB as a classifier to bridge the gap between the extracted PPI features and class label. Our data show that the PPI prediction performance of GTB is better than that of SVM, RF, Naı¨ve Bayes (NB), and K nearest neighbors (KNN) classifiers. The linear combination of decision trees can fit the PPI data well. When applied to the network prediction, GTB-PPI obtains the accuracy values of 93.75% and 95.83% for the one-core PPI network for CD9 and the crossover PPI network for the Wnt-related signaling pathways, respectively.

Method Data source
The Saccharomyces cerevisiae PPI dataset was obtained from the Database of Interacting Proteins (DIP) (DIP: 20070219) [7]. Protein sequences consisting of < 50 amino acid residues or showing sequence identity ! 40% via CD-HIT [20] were removed. Thus, 5594 interacting protein pairs are considered as positive samples; 5594 protein pairs with different subcellular location information are selected as negative samples, and their location information is obtained from Swiss-Prot. The Helicobacter pylori PPI dataset was constructed before [6], which contains 2916 samples (1458 PPI pairs and 1458 non-PPI pairs).
Four independent PPI datasets [21] were also used to test the performance of GTB-PPI. These datasets are obtained from Caenorhabditis elegans (4013 interacting pairs), Escherichia coli (6954 interacting pairs), Homo sapiens (1412 interacting pairs), and Mus musculus (313 interacting pairs). The number of unique proteins in each dataset is shown in Table S1.

Feature extraction
We fuse PseAAC, PsePSSM, RSIV, and AD to extract the PPI feature information, including sequence-based features, evolutionary information features, and physicochemical property features. The detailed descriptions of methods are presented in File S1.
where k Á k 1 represents the L1 norm; l is the number of samples; x represents the weight coefficient; and C represents penalty term, which determines the number of selected features.
We use the coordinate descent algorithm in LIBLINEAR [22] to solve Equation (1).

GTB
GTB can be used to aggregate multiple decision trees [23,24]. Different from other ensemble learning algorithms, GTB fits residual of the regression tree at each iteration using negative gradient values of loss. GTB can be expressed as the relationship between the label y and the vector of input variables x, which are connected via a joint probability distribution pðx; yÞ. The goal of GTB is to obtain the estimated functionFðxÞ through minimizing Lðy; FðxÞÞ: Let h m ðxÞ be the mth decision tree and J m indicates number of its leaves. The tree partitions the input space into J m disjoint regions R 1;m ; R 2;m ; Á Á Á; R Jm;m and predicts a numerical value b jm for each region R jm . The output of h m ðxÞ can be described as: Then the value of c m can be obtained using steepest descent to fulfill the GTB model: where F mÀ1 ðxÞ represents an estimated function. The iterative criterion of GTB is shown using Equation (4).
where iterations are set as M, and GTB model iŝ GTB can complement the weak learning ability of decision tree, thus improving the ability of representation, optimization, and generalization. GTB can capture higher-order information and is invariant to scaling of sample data. GTB can effectively avoid overfitting condition by weighting combination scheme. GTB-PPI uses the GTB algorithm of Scikit-learn [25].

Performance evaluation
In GTB-PPI pipeline, recall, precision, overall prediction accuracy (ACC), and Matthews correlation coefficient (MCC) are used to evaluate the model performance [8]. The definitions are as follows: TP indicates the number of predicted PPI samples found in PPI dataset; TN indicates the number of non-PPI samples correctly predicted; FP and FN indicate false positive and false negative, respectively. Receiver operating characteristic (ROC) curve [26], precision-recall (PR) curve [27], area under ROC curve (AUROC), and area under PR curve (AUPRC) are also used to evaluate the generalization ability of GTB-PPI.

GTB-PPI pipeline
The pipeline of GTB-PPI for predicting PPIs is shown in Figure 1, which can be implemented using MATLAB 2014a and Python 3.6. There are five steps of GTB-PPI as described below.

Data input
The input values of GTB-PPI are PPI samples, non-PPI samples, and the corresponding binary labels.

PPI prediction based on GTB
According to step 2 for feature extraction and step 3 for dimensionality reduction, L1-RLR is used to better capture the sequence representation details. In this way, GTB-PPI model can be constructed using GTB as the classifier.

PPI prediction on independent test datasets and network datasets
The optimal feature set representing PPIs can be obtained through feature encoding, fusion, and selection. GTB is employed to predict the binary labels on four independent test datasets and two network datasets.

Parameter optimization of PseAAC, PsePSSM, and AD
It is essential to optimize parameters of PseAAC, PsePSSM, and AD for GTB-PPI predictor construction. We implement the hyperparameter optimization through five-fold crossvalidation.
To extract features from the sequence, the values for k of PseAAC, n of PsePSSM, and lag of AD should be determined. We set the values of k as 1, 3, 5, 7, 9, and 11; similarly, values for n and lag are also set as 1, 3, 5, 7, 9, and 11 in order. GTB is then used to predict the binary labels (Tables S2-S4). As shown in Figure 2, the prediction perfor- Figure 1 Overall framework of GTB-PPI for PPI prediction First, the benchmark datasets are collected. Second, PseAAC, PsePSSM, RSIV, and AD are used for feature extraction. Third, the L1-RLR is employed for dimensionality reduction. Fourth, we use GTB to predict PPIs and GTB-PPI model is constructed. Finally, five-fold cross-validation, independent test, and PPI network are employed to evaluate GTB-PPI. PseAAC, pseudo amino acid composition; PsePSSM, pseudo-position-specific scoring matrix; RSIV, reduced sequence and index-vectors; AD, autocorrelation descriptor; GTB, gradient tree boosting; PPI, protein-protein interaction; ACC, overall prediction accuracy; MCC, Matthews correlation coefficient; L1-RLR, L1-regularized logistic regression. Figure 2 Prediction results of different parameters k, n, and lag on the S. cerevisiae and H. pylori datasets The k, n, and lag are the parameters that need to be adjusted in PseAAC, PsePSSM, and AD, respectively. mance on S. cerevisiae and H. pylori datasets changed with the alteration in the values of the respective parameters. For the parameter k in PseAAC, the highest prediction performance for these two datasets was obtained at different k values: the optimal k value for S. cerevisiae is 9, while the optimal k value of H. pylori is 11. Considering that PseAAC generates fewer dimensional vectors than the other three feature extraction methods (PsePSSM, RSIV, and AD), we choose the optimal parameter k ¼ 11 to mine more PseAAC information. The parameter selection of n and lag can be found in File S2. In summary, for each protein sequence, PseAAC extracts 20 þ 11 ¼ 31 features, PsePSSM obtains 20 þ 20 Â 9 ¼ 200 features, the dimension of RSIV is 197, and AD encodes 3 Â 7 Â 11 ¼ 231 features. We can obtain 659-dimensional vectors by fusing all four coding methods. Then the 1318-dimensional feature vectors are constructed by concatenating two sequences of protein pairs.

Effect of dimensionality reduction
L1-RLR can effectively improve prediction performance with higher computational efficiency. The process of parameter selection is described in File S3. To evaluate the performance of L1-RLR (C ¼ 1), we compared its prediction performance with SSDR [28], PCA [29] (setting of contribution rate is shown in Table S5), KPCA [30] (adjustment of contribution rate is shown in Table S6), FA [31], mRMR [32], and CMIM [33] (Table S7). ROC and PR curves of different dimensionality reduction methods are shown in Figure 3. The AUROC and AUPRC are shown in Table S8. The numbers of raw features and optimal features can be obtained in Figures S1 and S2.
As shown in Figure 3A and B, ROC curves for both the S. cerevisiae and H. pylori datasets show that the L1-RLR has superior model performance. For the S. cerevisiae dataset, the AUROC value of L1-RLR is 0.9875, which is 4.55%, 4.83%, 6.13%, 3.21%, 1.07%, and 1.09% higher than that of SSDR, PCA, KPCA, FA, mRMR, and CMIM, respectively (Table S8). For the H. pylori dataset, the AUROC value of L1-RLR is 0.9559, which is 3.47%, 9.80%, 8.59%, 8.33%, 1.04%, and 9.55% higher than that of SSDR, PCA, KPCA, FA, mRMR, and CMIM, respectively (Table S8)  higher than the other six dimensionality reduction methods on the S. cerevisiae and H. pylori datasets, respectively (Table S8). These results indicate that L1-RLR can effectively remove the redundant features without losing important information. The effective features related to PPIs could be fed into a GTB classifier, generating a reliable GTB-PPI prediction model.

Selection of classifier algorithms
GTB is used as a classifier with the number of iterations set to 1000 and loss function set as ''deviance". The prediction results of other four classifiers are also provided via five-fold cross-validation, including KNN [34] (number of neighbors = 3) (Table S9), NB [35], SVM [36] (recursive feature elimination as the kernel function), and RF [37] (number of the base decision trees = 1000) ( Table S10). The prediction results of KNN, SVM, NB, RF, and GTB on the S. cerevisiae and H. pylori datasets are shown in Table S11 and Figures S3  and S4. We also obtain the ROC and PR curves ( Figure 4) and AUROC and AUPRC values for different classifiers (Table S12).
As shown in Figure 4A and B, ROC curves for both the S. cerevisiae and H. pylori datasets show that the GTB classifier outperforms than KNN, NB, SVM, and RF. The AUROC values of GTB are 1.16%-24.65% and 0.53%-22.95% higher than the other four classifier methods on the S. cerevisiae and H. pylori datasets, respectively (Table S12). As shown in Figure 4C and D, the prediction performance of GTB is superior to KNN, NB, SVM, and RF. The AUPRC values of GTB are 1.42%-24.32% and 0.22%-24.56% higher than the other four classifier methods on the S. cerevisiae and H. pylori datasets, respectively (Table S12). These results demonstrate that GTB-PPI can accurately indicate whether a pair of proteins interact with each other within the S. cerevisiae or H. polyri dataset. GTB is an ensemble method using boosting algorithm that can achieve superior generalization performance over a single learner. Specially, RF achieves worse performance than GTB, because all the base decision trees of RF are treated equally. If the base classifier's prediction performance is biased, the final ensemble classifier may get the unreliable and biased predicted results. GTB can utilize steepest descent step algorithm to bridge the gap between the sequence and PPI label information.

Comparison of GTB-PPI with other PPI prediction methods
To verify the validity of the GTB-PPI model, we compare GTB-PPI with ACC+SVM [7], DeepPPI [8], and other state-of-the-art methods on the S. cerevisiae and H. pylori datasets.  Table 1, for the S. cerevisiae dataset, compared with other existing methods, the ACC of GTB-PPI increases by 0.14%-9.00%; the recall of GTB-PPI is 0.15% higher than DeepPPI [8] and 1.54% higher than MCD+SVM [10]; the precision of GTB-PPI is 1.32% higher than DeepPPI [8] and 0.81% higher than MIMI+NMBAC+RF [41].

PPI prediction on independent test datasets
The performance of GTB-PPI can also be evaluated using cross-species datasets. After the feature extraction, fusion, and selection, the S. cerevisiae dataset is used as a training set to predict PPIs of four independent test datasets.
(82.22%) [45]. For the M. musculus dataset, the ACC of GTB-PPI (98.08%) is 2.23%-18.21% higher than DeepPPI (91.37%) [8], MIMI+NMBAC+RF (95.85%) [41], MLD +RF (91.96%) [39], and DCT+WSRC (79.87%) [45]. The findings indicate that the hypothesis of mapping PPIs from one species to another species is reasonable. We can conclude that PPIs in one organism might have ''co-evolve" with other organisms [41]. Table 3 Performance comparison of GTB-PPI with other state-of-the-art predictors on independent datasets Figure 5 Prediction results of one-core and crossover networks using GTB-PPIA The prediction performance of one-core network for CD9. CD9 is the core protein, and the others are satellite proteins. 15 of all 16 PPIs are predicted successfully. B. The prediction performance of crossover network for the Wnt-related signaling pathways. WNT9A, DVL1, AXIN1, and CTNNB are linked in this work, which are of great importance to the Wnt-related signaling pathways. 92 of the 96 PPI pairs are identified. The blue and red lines represent true and false prediction of PPIs, respectively. The two networks are from Ding et al. [41] and Shen et al. [48].

PPI network prediction
The graph visualization of the PPI network can provide a broad and informative idea to understand the proteome and analyze the protein functions. We employ GTB-PPI to predict the simple one-core PPI network for CD9 [46] and crossover PPI network for the Wnt-related signaling pathways [47] using the S. cerevisiae dataset as a training set.
As shown in Figure 5A, only the interaction between CD9 and Collagen-binding protein 2 is not predicted successfully based on GTB-PPI, which was not predited by Shen et al. [48] either. Compared with Shen et al. [48] and Ding et al. [41], GTB-PPI achieves the superior prediction performance. The ACC is 93.75%, which is 12.50% higher than Shen et al. (81.25%) [48] and 6.25% higher than Ding et al.
The palmitoylation of CD9 could support CD9 to interact with CD53 [49]. In the one-core network for CD9, we can see that the interaction between CD9 and CD53 is predicted successfully based on GTB-PPI. In the crossover PPI network for the Wnt-related signaling pathways, ANP32A, CRMP1, and KIAA1377 are linked to the Wnt signaling pathway via PPIs. The ANP32A has been demonstrated as a potential tumor suppressor [50], and GTB-PPI could predict its interactions with the corresponding proteins. However, the interaction between ROCK1 and CRMP1 is not predicted. It is likely because we use the S. cerevisiae dataset as a training set, and ROCK1 and CRMP1 are different organism genes from S. cerevisiae. At the same time, ROCK1 is part of the noncanonical Wnt signaling pathway [47], GTB-PPI may not be very effective in this case. A previous study has reported that AXIN1 could interact with multiple proteins [51]. Here, we find that GTB-PPI can predict the interactions between AXIN1 and its satellite proteins, which provides new insights to elucidate the biological mechanism of PPI network.

Conclusion
The knowledge and analysis of PPIs can help us to reveal the structure and function of protein at the molecular level, including growth, development, metabolism, signal transduction, differentiation, and apoptosis. In this study, a new PPI prediction pipeline called GTB-PPI is presented. First, PseAAC, PsePSSM, RSIV, and AD are concatenated as the initial feature information for predicting PPIs. PseAAC obtains not only the amino acid composition information but also the sequence order information. PsePSSM can mine the evolutionary information and local order information. RSIV can obtain the frequency feature information using the reduced sequence. AD reflects the physicochemical property features on global amino acid sequence. Second, L1-RLR can obtain effective information features related to PPIs without losing accuracy and generalization. Simultaneously, the performance of L1-RLR is superior to SSDR, PCA, KPCA, FA, mRMR, and CMIMs ( Figure 3). Finally, the PPIs are predicted based on GTB whose base classifier is a decision tree, which can bridge the gap between amino acid sequence information features and class label. Experimental results show that the PPI prediction performance of GTB is better than that of SVM, RF, NB, and KNN. Especially, in the field of binary PPI prediction, the L1-RLR is used for dimensionality reduction for the first time. The GTB is also first employed as a classifier. In a word, GTB-PPI shows good performance, representation ability, and generalization ability.