Prediction of Protein–Protein Interaction Sites Using Convolutional Neural Network and Improved Data Sets

Protein–protein interaction (PPI) sites play a key role in the formation of protein complexes, which is the basis of a variety of biological processes. Experimental methods to solve PPI sites are expensive and time-consuming, which has led to the development of different kinds of prediction algorithms. We propose a convolutional neural network for PPI site prediction and use residue binding propensity to improve the positive samples. Our method obtains a remarkable result of the area under the curve (AUC) = 0.912 on the improved data set. In addition, it yields much better results on samples with high binding propensity than on randomly selected samples. This suggests that there are considerable false-positive PPI sites in the positive samples defined by the distance between residue atoms.


Introduction
Proteins play key roles in various aspects of life [1] by physically interacting with other proteins [2,3]. Protein-protein interactions (PPIs) are the molecular basis for many biological processes, such as signal transduction, transport, metabolism, gene expression, growth and proliferation of cells [4,5]. Protein-binding interfaces are heterogeneous and some interface residues contribute more to binding than the others. These residues are called "hotspots" [6][7][8][9][10]. Hotspots are often pre-organized in the unbound protein state. So it is suggested that much of the protein surface does not accommodate binding and the potential binding sites of a protein are already imprinted in its unbound state [8].
Popular PPI sites prediction methods can be sorted into three groups according to the information they are based on.

1.
Sequence-based methods. Methods based on sequence information use features extracted from protein sequences to predict protein interaction sites. PPiPP [48] uses the position-specific scoring matrix (PSSM) and amino acid composition to predict PPI sites and achieves an area under the receiver operating characteristic (ROC) curve (AUC) of 0.729. DLPred [19], which uses long-short term memory (LSTM) to learn features such as PSSM, physical properties, and hydropathy index, obtains a higher AUC score of 0.811. Still, we need more information to improve prediction accuracy. 2.
Structure-based methods. Knowledge of the three-dimensional (3D) structure of the protein complex provides much valuable information on the protein interaction sites [14]. Some PPI sites predictors utilize 3D structural information of proteins for prediction. ProMate combines all the significant interface properties and reaches a success rate of 0.70 [33]. Bradford and Westhead [21] achieved a successful prediction rate of 0.76 based on protein structure data. 3.
Methods based on integrated information. Three-dimensional structure of proteins are far more difficult and expensive to elucidate than protein sequences, so its magnitude in protein structure databases such as the Protein Data Bank (PDB) [49] is remarkably smaller compared to that of sequences in protein sequence databases like UniProt [50]. Therefore, most methods use a combination of structural and sequence information for the prediction of PPI sites. Li et al. [38] use physicochemical properties, sequence conservation, residue disorder, secondary structure, solvent accessibility, and five 3D structural features to train a random forest model to predict PPI sites. SPPIDER [51] uses relative solvent accessibility (RSA), sequence and structure features to predict PPI sites and demonstrates that RSA prediction-based fingerprints of protein interactions significantly improve the discrimination between interacting and noninteracting sites. It yields an overall classification accuracy of about 0.74 and Matthews correlation coefficients (MCC) of 0.42. IntPred [39] uses 11 features of both sequence and structure and obtains a specificity of 0.916 and a sensitivity of 0.411. PAIRpred [24] captures sequence and structure information about residue pairs through pairwise kernels that are used for training a support vector machine classifier. This method gives a remarkable AUC score of 0.870 and rank the first positive prediction (RFPP) value on the 176 complexes in protein-protein docking benchmark version 4.0 (DBD 4.0) [52] with its structure kernel.
In this study, we propose a novel statistics-based method for judging the binding propensity of amino acids and apply it to the partitioning of samples. We extracted the sequence and structure features of each sample and input them into the convolutional neural network for training. Compared to previous methods, our approach has made significant improvement in the AUC score (0.912) and some of the RFPP values (RFPP (100) = 580) on the 116 dimers in DBD 4.0 [52].

Distribution Tendency of Residues in Proteins
In order to demonstrate the distribution tendency of residues, we first compared the abundance of residues (AR) between the protein surface (AR s ) and whole protein (AR w ) and used AR w /AR s as the indicator of the tendency of a residue to be inside of proteins (Table 1, Section 4.3).
From Table 1, we find that most hydrophobic residues (alanine, leucine, isoleucine, valine, glycine, cysteine, phenylalanine, proline) and amphipathic residues (tryptophan, tyrosine, and methionine) tend to distribute inside proteins except proline which tends to be on the protein surface, and glycine which shows a weak surface tendency. Charged (arginine, lysine, aspartic acid, and glutamic acid) and hydrophilic residues tend to appear on the protein surface with the exception of histidine, which shows no tendency towards protein inside or surface. N w : the number of specified amino acids in whole proteins; N s : the number of specified amino acids in the protein surface; AR w : abundance of residues of whole proteins; AR s : abundance of residues at the protein surface. AR w /AR s ≥ 1 (shaded) represent residues that tend to distribute inside protiens.

Residue Binding Propensity
Protein residues exhibit different binding propensity for different residues in protein-protein interfaces. We used a statistical method (Section 4.4) to classify residues interacting with one certain residue into high and low binding propensity residue groups and compared the binding propensity with polarity, hydrophobicity, and distribution tendency of residues. Table 2 shows the relative abundance of interacting residues (RAIR, Section 4.4) which indicates the binding propensity of each residue with all 20 residues. Data of abundance of interacting residues (AIR) and AR are available in Tables S1 and S2 in the Supplementary Material, respectively. Polarity, hydrophobicity [45], and the ratio of AR w and AR s from Table 2 are also placed in the same table for comparison.
From Table 2, we find that (1) Ten residues (leucine, isoleucine, valine, arginine, histidine, cysteine, methionine, tyrosine, tryptophan, and phenylalanine) show a high propensity to bind to most residues (RAIR scores ≥ 1, shaded), while the other ten residues show a low binding propensity. (2) Most of the residues with high binding propensity overlap with the residues with polarity ≤ 7 (shaded polarity scores) except arginine (polarity = 10.5) and histidine (polarity = 10.4). (3) Residues with positive hydrophobicity (shaded hydrophobicity scores) also exhibit higher binding propensity but with more exceptions. Alanine, glycine, and proline have positive hydrophobicity but low binding propensity. On the contrary, arginine and histidine have negative hydrophobicity but high binding propensity. (4) Interestingly, residues with AR w /AR s ≥ 1 (shaded AR w /AR s scores) show high coincidence with those with high binding propensity, only with the exception of alanine (AR w /AR s = 1.22) and arginine (AR w /AR s = 0.95).
Finally, a total of 221 pairs of residues with high binding propensity (shaded RAIR scores in row 2-21 in Table 2) were obtained and used for further screening of positive samples. Each residue-pair contact propensities in a protein-protein interface are shown in Figure S1 in the Supplementary Material.

Positive Samples with High Binding Propensity
In order to verify the effectiveness of the improved positive samples, we used the same model parameters (Section 4.6) to perform a leave-one-complex-out cross-validation verification for two sample data sets, one with high binding propensity and another with no propensity.
According to the definition of interacting residue pairs (Section 4.2) from PAIRPred [24] and PPiPP [48], a total of 12,138 positive and 5,534,983 negative samples from 138 dimers of the DBD 5.0 version [53] were obtained (Section 4.1). Among the positive samples, 6739 residue pairs with binding propensity ≥ 1 were used as final positive samples in this study. There is an average of 49 pairs of positive samples for each dimer.
Test data sets used in this study were extremely imbalanced, so we used AUC as the main measure to evaluate the model performance on two data sets. The AUC for data set with no propensity is 0.824, while the AUC for data set with high propensity reaches 0.912 ( Figure 1). We also provided accuracy and recall under different thresholds in Table S3

Positive Samples with High Binding Propensity
In order to verify the effectiveness of the improved positive samples, we used the same model parameters (Section 4.6) to perform a leave-one-complex-out cross-validation verification for two sample data sets, one with high binding propensity and another with no propensity.
According to the definition of interacting residue pairs (Section 4.2) from PAIRPred [24] and PPiPP [48], a total of 12,138 positive and 5,534,983 negative samples from 138 dimers of the DBD 5.0 version [53] were obtained (Section 4.1). Among the positive samples, 6739 residue pairs with binding propensity ≥ 1 were used as final positive samples in this study. There is an average of 49 pairs of positive samples for each dimer.
Test data sets used in this study were extremely imbalanced, so we used AUC as the main measure to evaluate the model performance on two data sets. The AUC for data set with no propensity is 0.824, while the AUC for data set with high propensity reaches 0.912 ( Figure 1). We also provided accuracy and recall under different thresholds in Table S3 in the Supplementary Material.

Comparison with Randomly Sampled Data Set
To further verify the rationality of the binding propensity, we conducted a five-fold crossvalidation to compare the performance of our model on data sets with high binding propensity and data sets randomly sampled (also have 6739 residue pairs) from original positive samples by using 138 dimers from DBD 5.0 [53].
There is a significant difference between the AUCs of these two data sets. The AUC for the data set with high propensity is 0.889 ± 0.007 (Figure 2a), while the AUC for the randomly sampled data set is 0.811 ± 0.006 (Figure 2b). It is notable that the result of five-fold cross-validation is close to that of leave-one-complex-out cross-validation (0.912) on the data set with high propensity. This result indicates that our model benefits from the data set screened using a propensity score and identifies PPI sites with more accuracy.

Comparison with Randomly Sampled Data Set
To further verify the rationality of the binding propensity, we conducted a five-fold cross-validation to compare the performance of our model on data sets with high binding propensity and data sets randomly sampled (also have 6739 residue pairs) from original positive samples by using 138 dimers from DBD 5.0 [53].
There is a significant difference between the AUCs of these two data sets. The AUC for the data set with high propensity is 0.889 ± 0.007 (Figure 2a), while the AUC for the randomly sampled data set is 0.811 ± 0.006 (Figure 2b). It is notable that the result of five-fold cross-validation is close to that of leave-one-complex-out cross-validation (0.912) on the data set with high propensity. This result indicates that our model benefits from the data set screened using a propensity score and identifies PPI sites with more accuracy. Comparison of AUC scores between data sets with high binding propensity and data sets randomly sampled. (a) ROC curve and AUC score of the high propensity data set using five-fold cross-validation; (b) ROC curve and AUC score of the random sampling data set using five-fold crossvalidation. The red dotted line is a control line on which AUC = 0.5.

Comparison with Existing Methods
To further evaluate our model, we compared its performance with those of PSIVER [35], PPiPP [48], SSWRF [54], DLPred [19], and PAIRPred [24] using AUC scores ( Table 3). The first four methods used sequence-based features, while PAIRPred and our method used both sequence-and structurebased features. The AUC score of our method is noteworthily higher than that of the other methods. The results also prove the importance of structural features in PPIs site prediction.

Methods
AUC (%) PSIVER [35] 62.8 PPiPP [48] 72.9 SSWRF [54] 72.9 DLPred [19] 81.1 PAIPred [24] 86.2 OURS 91.2 PAIRPred [24] is currently one of the best-performing methods for predicting protein interaction sites based on sequence and structure features. It uses SVM with pairs of kernels to predict if two residues interact with each other. We compared the performance of PAIRPred and our model on 116

Comparison with Existing Methods
To further evaluate our model, we compared its performance with those of PSIVER [35], PPiPP [48], SSWRF [54], DLPred [19], and PAIRPred [24] using AUC scores ( Table 3). The first four methods used sequence-based features, while PAIRPred and our method used both sequence-and structure-based features. The AUC score of our method is noteworthily higher than that of the other methods. The results also prove the importance of structural features in PPIs site prediction.

AUC (%)
PSIVER [35] 62.8 PPiPP [48] 72.9 SSWRF [54] 72.9 DLPred [19] 81.1 PAIPred [24] 86.2 OURS 91.2 PAIRPred [24] is currently one of the best-performing methods for predicting protein interaction sites based on sequence and structure features. It uses SVM with pairs of kernels to predict if two residues interact with each other. We compared the performance of PAIRPred and our model on 116 dimers from DBD 4.0. The results are shown in Figure 3. We performed leave-one-complex-out cross-validation on positive samples with high propensity and obtained an AUC of 0.912, which is higher than that of PAIRPred (0.862).  We also evaluate our model by using the first rank of the first positive prediction (RFPP, Section 4.7). For RFPP, our method performs better on 90% (169) and 100% (580) than PAIRPred (194 and 2861, respectively) for dimers in DBD 4.0, while PAIRPred has better RFPP results on 10%, 25%, 50%, and 75% (Table 4). Our method has significantly improved on RFPP (100) of the complexes, which means that our model has better generalization ability.

Discussion
Protein-protein interactions play essential roles in many biological processes. Among different methods proposed to predict PPIs, machine learning is the most promising and commonly used algorithm. Deep learning is a popular machine learning branch and has been applied to many fields in recent years. In this paper, we used a convolutional deep learning model and improved data sets for the prediction of PPI sites and obtained a result of AUC 0.912, which is better than those of published predictors.
The protein-protein docking benchmark data sets (DBD) have been widely used for evaluation of PPIs prediction methods. Its latest version is 5.0 [53]. We emphasized comparison with PAIRpred [24] since it is one of the best-performing PPI predictors. PAIRpred was tested on DBD 4.0 [52], but SPINE X [55], on which PAIRpred was dependent on, is obsolete now, so we cannot calculate the AUC and RFPP of PAIRpred on DBD 5.0. As a result, we compared the results of PAIRpred and our method on 116 dimers in DBD 4.0, and used 138 dimers in DBD 5.0 (including 116 dimers in DBD 4.0) to further evaluate our model.
Residue interface propensities were observed in different kinds of protein complexes [25] and has been used to improve prediction accuracy of PPI sites in different studies [  We also evaluate our model by using the first rank of the first positive prediction (RFPP, Section 4.7). For RFPP, our method performs better on 90% (169) and 100% (580) than PAIRPred (194 and 2861, respectively) for dimers in DBD 4.0, while PAIRPred has better RFPP results on 10%, 25%, 50%, and 75% (Table 4). Our method has significantly improved on RFPP (100) of the complexes, which means that our model has better generalization ability.

Discussion
Protein-protein interactions play essential roles in many biological processes. Among different methods proposed to predict PPIs, machine learning is the most promising and commonly used algorithm. Deep learning is a popular machine learning branch and has been applied to many fields in recent years. In this paper, we used a convolutional deep learning model and improved data sets for the prediction of PPI sites and obtained a result of AUC 0.912, which is better than those of published predictors.
The protein-protein docking benchmark data sets (DBD) have been widely used for evaluation of PPIs prediction methods. Its latest version is 5.0 [53]. We emphasized comparison with PAIRpred [24] since it is one of the best-performing PPI predictors. PAIRpred was tested on DBD 4.0 [52], but SPINE X [55], on which PAIRpred was dependent on, is obsolete now, so we cannot calculate the AUC and RFPP of PAIRpred on DBD 5.0. As a result, we compared the results of PAIRpred and our method on 116 dimers in DBD 4.0, and used 138 dimers in DBD 5.0 (including 116 dimers in DBD 4.0) to further evaluate our model.
Residue interface propensities were observed in different kinds of protein complexes [25] and has been used to improve prediction accuracy of PPI sites in different studies [4,21,25,33,56,57]. It is usually used as a parameter of predicting models. In this study, we used residue binding propensity to screen positive samples and improved the performance of prediction remarkably. Our method may be a little radical, but the result suggests that it makes sense to reduce the fraction of false positive samples by introducing binding propensity.
It is found that polar residues are statistically disfavoured in interface sites, with the exception of arginine [25,33,56]. In order to show the correlation between polarity, hydrophobicity, and binding propensity in a more intuitive way, we compared polarity, hydrophobicity [45], and RAIR, and found that besides arginine, another polar residue histidine also exhibited a high binding propensity. This coincides with the result of another research [58] which found that histidine was favoured in all types of interactions.
An interesting finding of this study is that residues which tend to be inside of proteins (AR w /AR s > 1) have higher binding propensity. This seems strange but makes sense since most of these residues are hydrophobic and if they appear at the surface of proteins they tend to interact with the hydrophobic residues on the surface of other proteins. An exception is alanine whose side chain is just a methyl, which disfavours interaction with other residues and has been utilized for alanine-scanning mutagenesis analysis [9]. On the contrary, residues with large hydrophobic side chains, such as tryptophan, was found to have a unique role in the folded structure and the binding sites of proteins [59]. Charged residues show high binding propensity for oppositely charged residues. Arginine was found to be the most frequently occurring residue in known protein interaction sites because of its wide radius of action [11]. We also found that arginine exhibited a high binding propensity in our study, although it tended to appear at the protein surface (AR w /AR s = 0.74).
In conclusion, our convolutional deep learning model performs well for prediction of protein interaction sites, especially on the improved data set with high binding propensity. This suggests that a nonnegligible portion of false positive interacting pairs exist in the original positive samples obtained by 6Å definition, which may impede the efforts of improving the accuracy of prediction for PPI sites. Reducing false positive interacting samples is likely to become a promising direction for PPI site prediction studies.

Data Sets
The protein-protein docking benchmark data set (DBD, version 5.0) [53] and DBD 4.0 [52] were used in this work. DBD 5.0 contains 139 non-redundant dimers with characterized bound and unbound X-ray crystallography structures. DBD 4.0 [52] contains 174 complexes among which 116 are dimers and form a subset of DBD 5.0. Two interaction protein chains of a dimer are from different families defined by Structural Classification of Proteins (SCOP) with sequence identity less than 30% [53]. There are a few deletions in the sequence of 1ZLI in unbound state so it is excluded from the data sets. Finally, 174 complexes from DBD 4.0 were used for computing of residue distribution tendency and statistics of binding propensity, 116 dimers from DBD 4.0 were used for model comparison, and 138 dimers from DBD 5.0 were used to further validate our model (Tables 5 and 6).

Definition of Interacting Residue Pairs
A pair of residues from two proteins are considered to have interaction if the Euclidean distance between any two atoms from each of the two residues in the bound state is less than or equal to 6 Å [24,48]. According to this definition, 12,138 positive samples (interacting residue pairs) and 5,522,852 negative samples (non-interacting residue pairs) were obtained, each dimer has an average of 88 positive samples and 40,006 negative samples. Contacting residues within a protein chain were not included.
The number of negative samples in this study is much larger than that of positive samples. This imbalanced data made it difficult to train the model and the under-sampling might lead to information loss, so we used the EasyEnsemble algorithm [60] to build the training set with equal positive and negative samples.

Distribution Tendency of Residues in Proteins
Residues show different preferences of locations in proteins. We calculated the abundance of residues (AR) of the protein surface (AR s ) and whole proteins (AR w ) using 174 complexes from DBD 4.0 (Section 4.1) and used AR w /AR s as the indicator of a residue's tendency to be inside or at the surface of proteins. If AR w /AR s for a residue is larger than 1, it tends to be inside of proteins. Otherwise, if AR w /AR s < 1, it appears more often at the suface of proteins.
DSSP [61] from xssp [62] was used to calculate the solvent-accessible surface area (ASA) of a residue. If the ratio of ASA of a residue to its maximum ASA (Table 7) [63,64] is larger than or equal to 0.16 [28,63], then it is defined as a surface residue.

Binding Propensity of Residue Pairs
The above definition of interacting residues is concise and easy to use. But binding propensity between residues varies considerably. Residues with strong binding propensity may dominate the interaction and lead their adjacent residues with weak binding propensity into the defined range of interacting residues. In this situation, the dominant interacting residues (DIRs) are true positive samples, while it is more reasonable to classify the passive interacting residues (PIRs) as false positive samples ( Figure 4).

Binding Propensity of Residue Pairs
The above definition of interacting residues is concise and easy to use. But binding propensity between residues varies considerably. Residues with strong binding propensity may dominate the interaction and lead their adjacent residues with weak binding propensity into the defined range of interacting residues. In this situation, the dominant interacting residues (DIRs) are true positive samples, while it is more reasonable to classify the passive interacting residues (PIRs) as false positive samples ( Figure 4). Due to the strong interaction of D and R, the Euclidean distance between G and P or A and N is less than 6.0 Å, but the interaction is dominated by D and R.
The binding propensity between different residues can be computed based on interacting residue frequencies [56]. In this study, a statistical method was used to classify residues interacting with one certain residue into high and low binding propensity residue groups. Residues of 174 protein complexes from DBD 4.0 were used to estimate the residue binding propensity.
Relative abundance of interacting residues (RAIR) was used to indicate the binding propensity of each residue pair. Abundance of residues (AR) represents the frequency of each residue (20 in total) in the total number of suface residues of 174 protein complexes from DBD 4.0. The abundance of interacting residues (AIR) represents the frequency at which each residue interacts with 20 residues (400 pairs in total). The RAIR between residue i and residue j is defined as follows: where N is the total number of all surface residues of 174 protein complexes, Ni is the number of Due to the strong interaction of D and R, the Euclidean distance between G and P or A and N is less than 6.0 Å, but the interaction is dominated by D and R.
The binding propensity between different residues can be computed based on interacting residue frequencies [56]. In this study, a statistical method was used to classify residues interacting with one certain residue into high and low binding propensity residue groups. Residues of 174 protein complexes from DBD 4.0 were used to estimate the residue binding propensity.
Relative abundance of interacting residues (RAIR) was used to indicate the binding propensity of each residue pair. Abundance of residues (AR) represents the frequency of each residue (20 in total) in the total number of suface residues of 174 protein complexes from DBD 4.0. The abundance of interacting residues (AIR) represents the frequency at which each residue interacts with 20 residues (400 pairs in total). The RAIR between residue i and residue j is defined as follows: where N is the total number of all surface residues of 174 protein complexes, N i is the number of residue i, M ij is the number of residue j interacting with residue i, and M i is the total number of all residues interacting with residue i. RAIR is used for further classification of samples in this study. A pair of residues is considered to have a high binding propensity when RAIR ≥ 1 otherwise the pair has a low binding propensity.

Amino Acid Encoding
Twenty amino acids were coded as one-hot encoding [65] (Table S4 in the Supplementary Material).

Profile Features
Position specific scoring matrix (PSSM) and position specific frequency matrix (PSFM) reflect the conservation of residues at specific positions of protein chains based on evolutionary information [3,24]. Each row of the PSSM or PSFM is a 20-dimensional vector. PSSM and PSFM were computed by running 3 iterations of PSIBLAST [66] against the NCBI NR database for a given protein with E-value set to 0.001. PSSM and PSFM columns were taken within a length 3 window centered at a residue of the protein to obtain a 3 × 40 matrix.

Structure Features
Residues that play important roles in protein function generally appear at the surface of proteins. The accessible surface area (ASA) and the relative accessible surface area (RASA) were used to identify whether a residue is at the surface of a protein. The geometric properties of the protein surface can affect the interaction between proteins [38,68]. Protrusion index (CX) and depth index (DPX) were used to describe these properties. We also used hydrophobicity, which plays an important role in PPIs, protein folding and unfolding [69,70]. These five structure-based features were computed using PSAIA [71], which was developed for calculation of the geometric parameters of large protein structures and the prediction of protein interaction sites.

Deep Learning Model
The structure of the deep learning model used in this paper is shown in Figure 5. The source code of this study is available at https://github.com/Xiaoya-Deng/PPI-sites-prediction. Whether two residues interact is a binary classification problem. In our model, each sample is represented by ((ri, li), yi), where (ri, li) represents a pair of residues and yi is corresponding label. yi = 1 if two residues interact, otherwise yi = 0.
Each residue pair is encoded as a 2 × 217 × 1-dimensional vector for the input of the network.

Convolutional Layers
Three convolutional layers are used in this paper. The filters of the first and the second convolutional layers have the size of 3 × 3 with depth of 32 and 64, respectively. The third convolutional layer filter has a size of 1 × 3 and a depth of 128. All three layers use stride 2 and no zero padding.

Pooling Layers
There is a pooling layer after each convolutional layer. All the pooling layers use max pooling with filter size of 2 × 2. The strides for both directions are set to 2.

Fully Connected Layer
The output of the last pooling layer is expanded into a 1-dimensional vector and used as the input of a fully connected neural network with 1024 neurons. Finally, a Softmax function is used as the classifier.

Activation Function and Loss Function
Softplus and Softmax are used as the activation and loss functions, respectively, in our model.

Input
Whether two residues interact is a binary classification problem. In our model, each sample is represented by ((r i , l i ), y i ), where (r i , l i ) represents a pair of residues and y i is corresponding label. y i = 1 if two residues interact, otherwise y i = 0.
Each residue pair is encoded as a 2 × 217 × 1-dimensional vector for the input of the network.

Convolutional Layers
Three convolutional layers are used in this paper. The filters of the first and the second convolutional layers have the size of 3 × 3 with depth of 32 and 64, respectively. The third convolutional layer filter has a size of 1 × 3 and a depth of 128. All three layers use stride 2 and no zero padding.

Pooling Layers
There is a pooling layer after each convolutional layer. All the pooling layers use max pooling with filter size of 2 × 2. The strides for both directions are set to 2.

Fully Connected Layer
The output of the last pooling layer is expanded into a 1-dimensional vector and used as the input of a fully connected neural network with 1024 neurons. Finally, a Softmax function is used as the classifier.

Activation Function and Loss Function
Softplus and Softmax are used as the activation and loss functions, respectively, in our model.
L(x i ) = − log e x i j e x j (6) 4.6.6. Model Optimization The AdamOptimizer in TensorFlow was used for training optimization. The dropout method and decay learning rate method were used to prevent over-fitting during training.

Performance Measure
A leave-one-complex-out cross-validation method [24] was used to evaluate our model. All positive and negative samples of one protein complex were chosen as the imbalanced validation set while the samples of the other complexes were used as the training set which consists of balanced positive and negative samples.
A probability value between 0 and 1 is returned after the samples are trained, while the class label is binary (1 for interaction and 0 for non-interaction). At a given probability threshold, any correctly predicted pair of interacting residues is designated as true positive (NTP), and any correctly predicted pair of non-interacting residues is designated as true negative (NTN). While false positive (NFP) and false negative (NFN) are pairs of residues that are incorrectly predicted to be positive or negative, respectively.
Accuracy, recall, and precision can be used for evaluation of machine learning models. But the balance between these parameters varies with thresholds. The other two evaluation methods are more commonly used: (1) the area under the ROC curve (AUC), where ROC is for a (1-specific) recall map that takes into account the entire threshold range; and (2) a set of optimal performance thresholds accuracy, recall and F1. In this study, we use AUC as the main performance metric.
For imbalanced data, AUC may give a false impression of accuracy [24]. So we employed another measure of accuracy proposed in PAIRpred [24], the first rank of the first positive prediction (RFPP). RFPP is defined as follows: RFPP(p) = q RFPP indicates p% of the dimers tested have at least one true positive interacting residue pair among the top q predictions [24].

Validation on Randomly Sampled Data
In order to further validate the rationality of a propensity score, we compared the result with propensity to that using randomly sampled interacting pairs from original positive samples. The same number of positive and negative samples (6739) were randomly selected from 12,138 original positive and 5,522,852 negative samples, respectively, using EasyEnsemble [60]. We used five-fold cross-validation to compare the AUCs of this random sample to those of the data set with high propensity to see if there is difference between them.