A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences

Protein-protein interactions (PPIs) play an important role in the life activities of organisms. With the availability of large amounts of protein sequence data, PPIs prediction methods have attracted increasing attention. A variety of protein sequence coding methods have emerged, but the training of these methods is particularly time consuming. To solve this issue, we have proposed a novel matrix sequence coding method. Based on deep neural network (DNN) and a novel matrix protein sequence descriptor, we constructed a protein interaction prediction model for predicting PPIs. When performed on human PPIs data, the method achieved an accuracy of 94.34%, a recall of 98.28%, an area under the curve (AUC) of 97.79% and a loss of 23.25%. A non-redundant dataset was used to evaluate this prediction model, and the prediction accuracy is 88.29%. These results indicate that the matrix of sequence (MOS) descriptor can enhance the predictive power of PPIs and reduce training time, which can be a useful complement for future proteomics research. The experimental code and experimental results can be found at https://github.com/smalltalkman/hppi-tensorflow.


Introduction
Protein-protein interactions (PPIs) are useful for elucidating the changing mechanisms of organisms in physiological or pathological conditions and are important for disease prevention and drug development. In the last decade, numerous methods for studying protein-protein interactions, such as yeast two-hybrid screens [1], hybrid approaches [2] and protein chips [3], have emerged. However, all of these experimental methods have the disadvantage of being time-consuming and costly. Therefore, using computational approaches to predict unknown PPIs has become an important research topic in bioinformatics. In recent years, many computer prediction methods have been proposed to predict PPIs based on a phylogenetic profile method [4], amino acid index distribution [5] and gene fusion events [6,7]. PLOS  However, these methods are not universal because the reliability of these methods depends on a priori information about the protein pairs. In recent years, a large amount of protein sequence information has been accumulated, and numerous computer calculation methods and sequence-based methods have become more universal and acceptable [8][9][10][11][12][13][14][15][16][17][18], such as support vector machines (SVM) [8][9][10], Naïve Bayes [11,12], decision trees [13][14], random forests [15][16], and deep learning [17][18]. From the above methods, the accuracy of PPIs prediction is not only related to machine learning methods but also to protein coding methods. Protein coding methods and classification algorithms are the core steps of PPI prediction and have become primary tasks of current life science research. Until now, many efficient protein coding methods have been proposed for inferring PPIs based on protein sequence, such as the conjoint triad method (CT) [19], the auto covariance method (AC) [20] and local descriptor (LD) [21]. Among them, the conjoint triad method (CT) [19] considers considered the order relationship of three amino acids. In such a protein coding method, the 20 amino acids are clustered into seven classes according to the dipoles and volumes of the side chains. Auto covariance (AC) [20] considers the order relationship of 30 amino acids. Local descriptor (LD) is an alignment-free approach, and its effectiveness depends largely on the underlying amino acid groups, and only considers the neighbouring effect of two adjacent types of amino acids [21]. Though the various methods described above for protein coding methods are useful, one of the drawbacks is that the order relationship of the entire amino acid sequence is not considered. To overcome this problem, we propose a sequence-based method based on a novel representation of the matrix of sequence (MOS). The MOS descriptor is first classified into 7 classes according to the successful use of classification in Shen et al. [20]. Then, we combine this classification with a novel representation of protein sequence descriptors. Next, we constructed a (deep neural network-matrix of sequence) DNN-MOS model by combining the DNN and MOS. Finally, we evaluated the performance of the DNN-MOS protein prediction model. When performed on human data, our method had an accuracy of 94.34%, a recall of 98.28%, an area under the curve (AUC) of 97.79% and a loss of 23.25%. To prove the effectiveness of MOS, we compared MOS with existing protein coding methods. We found that the MOS can greatly reduce the loss and training time, and the prediction performance is improved. Additionally, we found that MOS achieves better performance in other classifiers such as decision tree, k-neighbors and random forest.

Data set construction
(1) Benchmark dataset: The benchmark PPIs dataset was used in our experiment, which was provided by Pan et al. [22]. Among this benchmark dataset, the positive samples were taken from the Human Protein Reference Database (HPRD) 2007 version, and the negative samples were taken from the Swiss Swiss-Prot database 57.3 version. These positive samples are usually verified by reliable methods [23][24]. The negative samples (non-interacting pairs of proteins) were generated by pairing proteins found in different subcellular locations, according to the following requirements [19,25]: (1) the non-interactive pairs cannot appear in interacting data sets; (2) sequences annotated with ambiguous or uncertain subcellular location terms were excluded to construct the negative samples; (3) sequences annotated by two or more locations were excluded due to lack of the uniqueness. After removing the self-interactions and duplicate interactions of the positive dataset, we finally obtained 36,630 positive pairs and 36,480 negative pairs. Protein pairs with unusual amino acids and <50 amino acids were excluded, such as B, J, O, U, X and Z to yield 36,591 positive samples and 36,324 negative samples to form the benchmark dataset. We mixed the positive and negative samples in the benchmark dataset and randomly selected 60,000 pairs (30,000 positive samples, 30,000 negative samples as training datasets for models, with the remainder constituting the training set as a hold-out test set to validate the model).
(2) Non-redundant dataset: This dataset was provided by Pan et al. [22]. The protein pairs of this dataset exclude proteins with �25% sequence identity from the benchmark dataset. This dataset contains 3,899 positive protein pairs and 4,262 negative protein pairs.

Matrix of sequence (MOS)
Classification of amino acids. According to Shen et al. [19], 20 amino acids can be divided into seven different groups based on their dipole and side chain volumes. The seven different amino acid classifications are shown in Table 1. Then, a protein sequence is represented by these seven groups according to Table 1. For example, the protein sequence "AGCRQTSPLGVKSE" would be represented as "11754332211536".
Based on the definition of the sequence matrix, the sum of all elements in the sequence matrix is equal to LðLþ1Þ 2 , m ij ¼ Thus, for any two sequences, when the sequence lengths are different or the sequence lengths are the same but at least one element contains different numbers of elements, the corresponding sequence squares are different. Algorithm of sequence matrix. Hypothetical non-empty finite set: O = {w 1 , . . ., W N }, where N is the number of categories of the sequence. Given sequence: S = S 1 , S 2 , . . ., S L , where L represents the length of sequence S, S i 2 O, 1�i�L. The sequence matrix of a given sequence S can be expressed as: Protein feature representation. In this article, we present a novel method of protein feature representation by combining sequence matrix descriptors with the amino acid classification method. To reduce the computational vector, we first classify 20 amino acids into 7 classes according to the amino acid classification method in Table 1. Thus, a protein sequence can be represented by a matrix of 7×7, as shown in Eq 2.
The next step is to standardize m ij of each matrix element ranging from 0 to 1. To solve this problem, we defined a new parameter p ij , by normalizing m ij with Eq 3: where L is the length of the protein sequence. The numerical value of p ij of each protein ranges from 0 to 1. The elements in the diagonal of the matrix and the elements above the diagonal are combined into a 28-dimensional vector. To distinguish the lengths of the sequences, a sequence tag is added, and the sequence tags are represented by the reciprocal of the length of the protein. Finally, a total 29-dimensional vector has been built to represent each protein sequence.

Deep neural network (DNN)
A deep neural network is a popular type of deep learning algorithm with three or more hidden layers. The basic structure of a deep neural network is similar to the basic structure of a shallow neural network and consists of an input layer, middle hidden layers, and an output layer. However, the parameters, calculation units and algorithms of deep neural networks are more abundant than traditional shallow neural networks. As shown in Fig 1, input data (x) are given to the input layer, processed layer by layer through the hidden layer, and then transmitted to the output layer. The weights w (i) between neurons are free parameters that capture the model's representation of the data and are learned from input/output samples. Each neuron computes a weighted sum of its inputs and applies a nonlinear activation function to calculate its outputs. The formulation of input data in forward propagation is calculated according to Eq 1: where a (l+1) is the input data of the (l+1)-th layer, δ denotes the activation of the (l+1)-th layer, w (l+1) is the connection weight matrix between the (l)-th layer and the (l+1)-th layer, a l is the input data of the (l)-th layer, and b (l+1) is the bias term in the (l+1)-th layer. Back propagation is the propagation of the output through the hidden layer to the input layer, and the error is distributed to all of the cells of each layer, to obtain the error signal of each layer. In general, ReLU (rectified linear unit) is used as the activation function for neurons in DNN. The ReLU can change all negative values to zero while leaving the positive values unchanged. Compared to other activation functions, ReLU has a few advantages [26,27]. For linear functions, ReLU is more expressive, especially in deep networks. For non-linear functions, ReLU does not have the disadvantage of gradient disappearance and can thereby maintain the convergence speed of the model at a stable level.

Evaluation measure
The performance of the models was evaluated by a series of evaluation indicators, including the accuracy, recall, AUC and loss in this study. Their criterion functions are defined, respectively, by: where TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative, respectively. AUC was calculated using an open source code [28]. Loss was calculated according to the following cross-entropy function: where y = (y 1 ,y 2 ,y 3 . . .. . .y n ) represents the actual output and y� ¼ ðy � 1 ; y � 2 ; y � 3 ; . . . . . . ; y � n Þ represents the desired output.

Selecting optimal parameters
Selecting an optimal parameter is an important step in the model training process and one of the key elements in training a robust model. In this experiment, ReLU was selected as the activation function, Adam as the optimizer, and cross entropy as the cost function. Compared with the Sigmoid and the Tanh activation functions, ReLU has a simple operation, has the sparse expression ability and learning ability of a neural network, and has a faster convergence speed during the gradient descent. Due to the above advantages, ReLU was used as the activation function for this model [27]. Adam combines the advantages of the RMSprop and Adagrad algorithms for the improved handling of noise, which led us to choose it as the optimizer [29,30]. The cross-entropy cost function can measure the predicted and actual values in a deep neural network, and it can compensate for the defects caused by the easy saturation of the sigmoid function, thus causing the training set to converge faster. In this experiment, we chose to use the cross-entropy cost function.
In our method, the three parameters of learning rate, network width and network depth must be determined. To determine the learning rate, the number of hidden layer nodes is set to 64, the activation function is ReLU, the optimization algorithm is Adam, the batch size is 128, the dropout is 0, and the number of iterations is 300,000. The results of adjusting the learning rate are shown in Table 2. From Table 2, we found that when the learning rate is 0.01, it has the best predictive performance in the context of PPIs prediction. Therefore, here, 0.01 was chosen as the learning rate for our experiment.
To determine the network width of the model, the learning rate is set to 0.01, the other parameters are unchanged, and the network width results of our model are shown in Table 3. According to Table 3, when the network width is 512, the performance of the model is better than that of several other network widths. Therefore, the network width of this model is set to 512.
After the learning rate and network width are determined, the next step is to determine the depth of the model. The depth adjustment results are shown in Table 4. From Table 4, the

Performance of MOS on PPIs
Results on benchmark dataset. The proposed DNN-MOS model (protein sequences coded by MOS descriptors) was applied to the human dataset. To investigate the contribution of the novel MOS descriptor, we separately trained DNN based on CT, AC, LD, and MOS. Among them, the parameter settings of DNN-MOS are shown in 3.1. The parameters of DNN-CT (deep neural network-conjoint triad), DNN-AC (deep neural network-auto covariance) and DNN-LD (deep neural network-local descriptor) are set as follows: the activation function was ReLU, the optimization algorithm was Adam, the batch size was 128, the dropout was 0, the number of hidden layer nodes was set to 256, the network depth was [256-256-256], the learning rate was 0.001, the number of times to repeat the hold-out-validation was 30 and the number of times was 10,000 per iteration.
The results of each prediction model are shown in Table 5. From Table 5, we can observe that the predictive performance using MOS is not superior to other descriptors for almost all evaluation metrics. The accuracy and AUC of DNN-MOS are 94.34% and 98.28%, lower than those of DNN-CT and DNN-AC. The AUC of DNN-MOS is slightly higher than that of DNN-LD and significantly lower than those of DNN-CT and DNN-AC. However, the loss of MOS is significantly better than the other three encoding methods.
The training time is related to the parameters, such as the width and depth of the model. To compare the training time of each code, we set the parameters the same. The parameters of DNN-MOS, DNN-CT, DNN-AC and DNN-LD are set as follows: the number of hidden layer nodes was set to 64, the activation function was ReLU, the optimization algorithm was Adam, the batch size was 128, the dropout was 0, the learning rate was 0.001, the number of times to repeat the hold-out-validation was 30 and the number of times was 10,000 per iteration. The results of the training time are shown in Table 6. As shown in Table 6, the DNN-MOS has the lowest training time per 1000 steps, only 0.1261 seconds. The training time of DNN-MOS is nearly 2 times faster than DNN-AC's training time, more than 2 times faster than DNN-CT's and more than 3 times faster than DNN-LD's. From Table 6, we found that the difference in test time was small, but the test time trend was the same as the training time. Therefore, we found that MOS can significantly save training time and test time. From Table 6, we can also see that the larger the vector dimension, the more training time was required. Results on non-redundant dataset. To further assess the practical prediction ability of DNN-MOS, we trained the models of DNN-MOS, DNN-CT and DNN-AC on a non-redundant dataset (removing the samples that has �25% sequence identity to any sample in the pre-training set). The prediction results are shown in Table 7. From Table 7, we can observe that the accuracy of DNN-MOS, DNN-CT and DNN-AC on the non-redundant dataset are 88.29%, 89.88% and 93.35%, respectively. Shen et al. [17] studied the PPIs of the dataset using a deep learning algorithm, achieving an accuracy of 85.84%, which is lower than our results.

Comparison with different classifiers
In order to verify the effectiveness of the feature extraction method of MOS on PPIs, we combined the MOS with Decision Tree (DT), K-Neighbors (KN) and Random Forest (RF) on human data to construct three models of DT-MOS (decision tree-matrix of sequence), KN-MOS (K-Neighbors-matrix of sequence) and RF-MOS (random forest-matrix of sequence). The results are shown in Table 8. From Table 8, we can see that these methods present an accuracy of 83.01-97.29%, and the accuracies of DT-MOS, KN-MOS and RF-MOS are 94.36%, 83.01%, and 97.29%, respectively. These results show that the novel MOS of our proposed are also effective in other classifiers such as DT, KN and RF.

Discussion
We have presented a novel protein sequence coding approach for PPIs prediction. Of note, we propose a strategy for projecting protein sequences into a vector space, which is used to represent the matrix space of PPI information. Specifically, we first classify 20 amino acids into 7 amino acids according to their physicochemical properties ( Table 1). The dimensions of the matrix space can be significantly reduced, from 20×20 to 7×7. Next, we combine the elements on the 7×7 matrix diagonal and the elements above the diagonal into a 28-dimensional vector.
To distinguish the length of a sequence, a sequence label is added. Finally, a 29-dimensional vector can represent a protein sequence. We combined MOS with DT, KN and RF and achieved good results. The experimental results show that the proposed MOS feature extraction method is effective. However, the disadvantage of the novel matrix sequence descriptor is that the sequence matrix cannot be in one-to-one correspondence with the protein sequence. For any given two sequences, the corresponding sequence matrices are different when the sequence lengths are different, or the sequence lengths are the same but at least one element contains different numbers of elements. Therefore, pre-processing data is required to remove protein pairs with the same protein sequence length and the same number of elements.
Recently, new feature extraction approaches for PPIs have been developed [30][31][32][33]. Among them, Li et al. [30] proposed a new method for predicting self-interacting proteins (SIPs) based on amino acid sequences, achieving high precisions of 86.86 and 91.30% on the Saccharomyces cerevisiae and human SIPs datasets, respectively. Wang et al. [31] reported a novel method of PPIs based on pseudo position specific scoring matrix (PSSM) feature descriptors and an ensemble rotation forest (RF) learning system from protein amino acid sequences. Their method achieved accuracies of 98.38%, 89.75%, and 96.25% on the yeast, H. pylori, and independent datasets, respectively. Li et al. [32] developed a new hybrid method of physical chemistry and evolution-based feature extraction methods, which can capture discriminant features from evolution-based information and physicochemical features. An et al. [33] explored a new feature representation method based on local binary pattern (LBP), which not only considers the amino acid sequence information but also the evolutionary information of multiple sequence alignments. The above studies show that effective feature extraction methods can mine useful information on protein pairs and improve the performance of PPIs prediction. In this study, although we found that the performance of DNN-MOS is not prominent in Table 5, DNN-MOS can greatly reduce loss and training time (Table 6). In addition, Table 8 show that the novel MOS of our proposed are also effective in other classifiers such as DT, KN and RF. Overall, although the performance of DNN-MOS is not prominent, it can be a useful supplement to PPIs predictions. The reason why the accuracy of DNN-MOS is lower than that of DNN-CT, DNN-AC and DNN-LD may be due to the loss of part of the information when converting the protein sequence into a matrix vector. In future research, we will try our best to solve this problem and improve the predictive performance of DNN-MOS.

Conclusion
With the increasing number of PPI calculation methods, the coding methods of various amino acid feature vectors are also emerging. Although the various protein encoding methods such as AC, CT, and LD are useful, one of the disadvantages is that the order relationship of the entire amino acid sequence is not considered. The CT [19] considers considered the order relationship of three amino acids. AC [20] considers the order relationship of 30 amino acids. LD only considers the neighbouring effect of two adjacent types of amino acids [21]. To overcome this problem, we propose an efficient method for predicting PPIs from amino acid sequences by a novel matrix sequence descriptor feature representation with deep neural network. The novel protein feature extraction method we have proposed considers the order relationship of the entire amino acid sequence. When performed on human PPIs data, DNN-MOS, DT-MOS, KN-MOS and RF-MOS have achieved good results. Additionally, the model was used to evaluate this prediction model on a non-redundant dataset and the prediction accuracy is 88.29%. The experimental results show that the matrix sequence descriptor is promising for predicting PPIs and can be used as a complementary supplement to other methods.
Supporting information S1 File. The positive protein-protein interaction. There are 36,630 protein-protein pairs from total 9476 proteins, and the first column is protein ID from HPRD, the second column is the other protein ID and the two proteins constitute the positive Protein-protein interaction. (DOC) S2 File. The negative protein-protein interaction. There are 36,480 protein-protein pairs from total 2184 proteins, and the first column is protein ID, the second column is the other protein ID and the two proteins constitute the negative Protein-protein interaction. (DOC) S3 File. The identity of positive protein-protein interaction is below 25%. There are 3899 protein-protein pairs from total 2502 proteins, and the first column is protein ID from HPRD, the second column is the other protein ID and the two proteins constitute the positive Protein-protein interaction and protein identity of all the proteins from S3 file is below 25%. (DOC) S4 File. The identity of negative protein-protein interaction is below 25%. There are 4262 protein-protein pairs from total 661 proteins, and the first column is protein ID from HPRD, the second column is the other protein ID and the two proteins constitute the positive Protein-protein interaction and protein identity of all the proteins from S4 file is below 25%. (DOC)