Abstract

Protein-protein interactions (PPIs) in plants are crucial for understanding biological processes. Although high-throughput techniques produced valuable information to identify PPIs in plants, they are usually expensive, inefficient, and extremely time-consuming. Hence, there is an urgent need to develop novel computational methods to predict PPIs in plants. In this article, we proposed a novel approach to predict PPIs in plants only using the information of protein sequences. Specifically, plants’ protein sequences are first converted as position-specific scoring matrix (PSSM); then, the fast Walsh–Hadamard transform (FWHT) algorithm is used to extract feature vectors from PSSM to obtain evolutionary information of plant proteins. Lastly, the rotation forest (RF) classifier is trained for prediction and produced a series of evaluation results. In this work, we named this approach FWHT-RF because FWHT and RF are used for feature extraction and classification, respectively. When applying FWHT-RF on three plants’ PPI datasets Maize, Rice, and Arabidopsis thaliana (Arabidopsis), the average accuracies of FWHT-RF using 5-fold cross validation were achieved as high as 95.20%, 94.42%, and 83.85%, respectively. To further evaluate the predictive power of FWHT-RF, we compared it with the state-of-art support vector machine (SVM) and K-nearest neighbor (KNN) classifier in different aspects. The experimental results demonstrated that FWHT-RF can be a useful supplementary method to predict potential PPIs in plants.

1. Introduction

Protein-protein interactions (PPIs) in plants underlie many biological processes, including cellular organization, signal transduction [1], metabolic cycles [2], and plant defense [3]. Thus, detecting and characterizing the protein interactions are critically important for understanding the relevant molecular mechanisms inside the plant cells. With the scientific and technological advances, a multitude of experimental approaches had been developed to identify PPIs in plants, such as yeast two-hybrid (Y2H) [4], bimolecular fluorescence complementation (BiFC) [5], tandem affinity purification (TAP) [6], and some other high-throughput DNA sequencing technology for PPIs detection. Therefore, a huge and ever-increasing of experimental data about plants PPIs has been accumulated. However, these approaches have some inevitable shortcomings; they are particularly expensive, time-consuming, and always present problems with high false-negative rates. Besides, it is also difficult to apply large-scale experiments on plants due to the complexity of interactions in plant cells. As a result of these shortcomings, developing accurate computational methods to predict PPIs would be of great value to plant biologists.

In recent years, many computational methods and ensemble learning algorithms have been established to offer complementary and supporting information for previous experimental approaches [79]. These methods can be broadly classified into three categories: protein structure-based method, docking-based method, and sequence-based method. Generally, the first two methods usually need structural details. However, many proteins do not have information about the prior knowledge, such as 3D structural information and protein homology. In addition, with the rapid advance in high-throughput sequencing technology, more and more plant protein sequence data are available, which lead to a great interest in sequence-based methods for PPIs prediction.

To date, many sequence-based approaches have been presented for predicting PPIs and many ensemble-learning algorithms have been proposed for classification [1012]. For example, Yi et al. [13] proposed a method called RPI-SAN, which adopts the deep-learning stacked auto-encoder network to mine the features from RNA and protein sequences and then employs the rotation forest classifier to predict ncRNA binding proteins. Hashemifar et al. [14] developed a novel deep learning framework named DPPI. DPPI combined random projection and data augmentation with a deep, Siamese-like convolutional neural network to predict PPIs. Zhang et al. [15] presented the EnsDNN (Ensemble Deep Neural) method, which first employed the local descriptor, covariance descriptor, and multiscale continuous and discontinuous local descriptor together to explore the interactions between proteins. Then, it trained the deep neural networks (DNNs) based on different configurations of each descriptor. Finally, they adopted a two-hidden layers neural network to integrate these DNNs to predict potential PPIs. Wei et al. [16] combined the novel negative samples, features, and an ensemble classifier to predict PPIs. They report two types of novel feature extraction methods. One is the based on physicochemical properties of proteins, and the other is based on the secondary structure information. Sun et al. [17] applied a deep-learning algorithm, stacked autoencoder, to identify PPIs from the protein sequence. Kulmanov et al. [18] developed a method called DeepGO, which combined a deep ontology-aware classifier with amino acid sequence information to detect protein functions and interactions. Despite these advances, there is still room for improvement in the prediction performance of PPIs’ model [19].

In this article, we present a novel sequence-based computational approach, namely, FWHT-RF, to predict potential protein-protein interactions in plants. More specifically, we first transformed the plants protein sequences as position-specific scoring matrix (PSSM). Then, in order to fully characterize the evolutionary information of protein pairs, we performed the fast Walsh–Hadamard transform (FWHT) on the PSSM to extract features’ vectors. Although FWHT plays an essential role in image analysis and pattern recognition, but as we know, it is first time to be applied in plant biology for the purpose of PPIs’ prediction. Lastly, a powerful classification model, rotation forest (RF), was used to train the models. The major contributions of FWHT-RF are as follows: (1) FWHT-RF did not depend on unique subspaces in the studied proteomic space or known PPIs’ samples because it extracts features directly from PSSM of the plant protein sequence. (2) Since these characteristics are linked to the evolutionary past of plant proteins, they have more power to detect PPIs than many other approaches. (3) The basic features from PSSM for each plant proteins were extracted using a novel statistical selection feature mechanism and converted into a 400-dimensional feature vector. As a result, the feature vectors of these two proteins are integrated to create an 800-dimensional feature vector for each protein pair. (4) Finally, this work suggested to use the RF classifier for training these features, which can improve the accuracy of PPIs prediction. This model has been well investigated in three plants’ datasets (Maize, Rice, and Arabidopsis thaliana (Arabidopsis)) and yields a high prediction accuracy of 95.20%, 94.42%, and 83.85%, respectively. To further evaluate the predictive performance of FWHT-RF, we compared FWHT-RF with the state-of-art support vector machine (SVM) and k-nearest neighbor (KNN) classifier. The experimental results indicated that FWHT-RF can be a complement tool to large-scale prediction of PPIs in plants.

2. Materials and Methods

2.1. Benchmark Datasets Collection

Although many experiments and databases have been developed to identify and store the PPIs data in plants [20, 21], however, false positive interactions are typical in these data. These false positive data may have a negative impact for the computational methods. Therefore, the construction of benchmark datasets to improve the accuracy of plant PPIs prediction is necessary. In this paper, we evaluate the FWHT-RF approach through three plants’ benchmark datasets, including Maize, Rice, and Arabidopsis thaliana (Arabidopsis).

As we all know, maize is one of the most important cereal crops in the world and a model plant for genomic studies of PPIs. The Maize dataset was gathered from the Protein-Protein Interaction Database for Maize (PPIM) [22] and agriGO [23]. We obtained 14,800 nonredundant maize protein pairs which built the positive dataset. In order to construct the negative dataset, we selected 14,800 additional maize protein pairs of different subcellular localizations. Consequently, the whole Maize dataset consists of 29,600 protein pairs.

To further demonstrate the feasibility of the proposed method, two different types of plant PPIs’ datasets were also adopted in this study. The first one is Rice, which was gathered from the PRIN [24] database, which consisted of 9600 nonredundant rice protein interaction pairs (4800 interacting pairs and 4800 noninteracting pairs). The second is the popular model plant Arabidopsis. We collected Arabidopsis PPIs from public PPI databases IntAct [25], BioGRID [26], and TAIR [27]. After the removal of redundant sequences, we obtained 28,110 interactions from 7437 Arabidopsis proteins. The negative protein pairs are generated by randomly pairing the proteins without evidence of interactions. In this way, the whole Arabidopsis dataset is constructed by 56,220 protein pairs.

2.2. Representation of Target Proteins

The position-specific scoring matrix (PSSM) [28] was firstly proposed for testing the distantly related proteins. In recent years, PSSM has been widely used for mining the evolutionary information of protein sequences [29]. PSSM is a matrix. The number of amino acids in the proteins is represented by P, and the naive amino acids are represented by 20 columns. Suppose that , and the following is a summary of each matrix:where in the i row of PSSM indicates the probability of the ith residue being mutated into jth native amino acid.

In this study, we adopted the Position-Specific Iterated BLAST (PSI-BLAST) [30] tool to generate the PSSM for the purpose of extracting evolutionary information. To achieve broad and high homologous sequences, the expectation value (e value) was set to 0.001, the number of iterations was set to 3, and other parameters were maintained as the default values.

2.3. 2D Fast Walsh–Hadamard Transform

Walsh–Hadamard transform (WHT) [31] is employed in many applications such as image analysis and signal processing. It is recognized as a generalized type of Fourier transforms (FT) and has three popular orderings: (1) Natural Ordering (Hadamard Ordering), (2) Dyadic Ordering (Paley Ordering), and (3) Sequency Ordering (Walsh Ordering) [3234]. In this study, we will focus on the WHT of Natural Ordering. The WHT matrix consists only by ±1. Since no multiplication operation is required in the computation, the computational complexity is greatly reduced. In the encrypted domain, this algorithm can avoid quantization error and thus WHT can ensure perfect reconstruction of the encrypted image. Therefore, WHT is better and more effective than transformations such as DFT [35] or DCT [36].

Suppose that represented the input image with size, where a and b were used to describe the same and the power of 2. The two-dimensional fast Walsh–Hadamard transform (FWHT) [37] of Natural Ordering (Hadamard Ordering) can be defined as follows:where and denotes the Hadamard matrix. can be generated by the core matrix:and the Kronecker product recursion is as follows:where is the Kronecker product operator [38]. The 2D FWHT [39] is a separable transformation which can be further divided into two 1D transforms. When applying the 2D FWHT on the input image, is equivalent to applying 1D FWHT on all columns of the input image initially and then using 1D FWHT on all rows of achieved results. For 2D FWHT, the computational complex is . In this study, is the input signal matrix, and here is the PSSM matrix. By this way, the plant protein sequence can be represented by FWHT feature descriptors.

2.4. Ensemble Rotation Forest Classifier

Rotation forest (RF) was introduced by Rodriguez et al. [40], which is an ensemble learning algorithm based on an independently trained decision tree. The main advantage of RF is that it can balance diversity and accuracy at the same time. RF first randomly divided the samples into different subsets. Then, principal component analysis (PCA) [41] was used to transform the attribute subsets to increase the difference between the subsets. At last, the transformed subsets will be fed into the decision trees. The results of RF can be achieved via a voting method by these trees. The specific steps of RF are as follows.

Suppose that contains T samples, of which be an L-dimensional feature vector. Let Z represents the training sample set containing T training samples and forming a matrix of . Let U represents the feature set and M denotes the label set. Assume that the number of decision trees is S; then, the decision trees can be denoted as . The rotation forest algorithm is implemented as follows:(1)Choose the suitable parameter M, which can randomly split divide U into M disjointed subsets, and the number of features contained in the feature subset is L/M.(2)Let represent the jth feature subset and be used to train the classifier . The sample subset is constructed by a nonempty subset, which is randomly picked out from a certain proportion.(3)Apply PCA on to order the coefficients, which is stored in matrix .(4)The coefficients achieved from the matrix are used to construct a sparse rotation matrix , which can be defined as follows:

During the prediction process, a test sample is given, which is generated by the classifier of which is introduced to indicate that belongs to class . Then, the class of confidence is calculated via the average combination, and the formula can be expressed as follows:

Then, assign the category with the largest value to . The overview of FWHT-RF workflow is presented in Figure 1.

3. Results and Discussion

3.1. Validation Measures

In this work, we employed multiple evaluation indicators to access the effectiveness of FWHT-RF, including accuracy (Acc.), sensitivity (Sen.), precision (Prec.), Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC). Correspondingly, the first four formulas can be represented as follows:where TP (true positive) represents the number of true plant PPIs that are correctly identified (positive samples), FP (false positive) refers to the number of noninteraction plant protein pairs (negative samples), and TN (true negative) denotes the number of correct classification of positive samples, while FN (false negative) refers to the number of incorrect classification of negative samples.

To provide a more comprehensive assessment of the FWHT-RF method, the receiver operating characteristic (ROC) curves, which are suitable for accessing the performance of the proposed method, were computed. The area under the ROC curves (AUC) was also calculated to test the predictive ability of FWHT-RF. AUC denotes the probability that a positive sample is ahead of a negative one. The AUC value closer to 1.0 indicates the better predictive performance of the FWHT-RF method [42].

3.2. Assessment of Prediction Ability

In this article, we adopt 5-fold cross-validation technique to comparatively access the prediction performance of FWHT-RF in three plant datasets involving Maize, Rice, and Arabidopsis. By this way, we can prevent overfitting and test the stability of the proposed method. More specifically, each plant PPIs’ dataset is randomly split into five subsets, one of them is used as a testing set in turn and the other four subsets are adopted as training sets. Thus, five models can be generated for the five sets of data. The cross validation has the advantages that it can minimize the impact of data dependency and improve the reliability of the results.

The 5-fold cross validation results of the proposed approach on the three plants datasets are listed in Tables 13. From Tables 13, we can observe that when applying the proposed method to the Maize dataset, we obtained the best prediction results of average accuracy, precision, sensitivity, and MCC as 95.20%, 97.29%, 92.99%, and 90.85% with corresponding standard deviations 0.38%, 0.26%, 0.62%, and 0.69%, respectively. When performing FWHT-RF on the Rice dataset, we yielded good results of average accuracy, precision, sensitivity, and MCC of 94.42%, 94.63%, 94.17%, and 89.46%, respectively. The standard deviations of these criteria values are 0.56%, 0.84%, 0.72%, and 0.99%, respectively. When performing FWHT-RF on the Arabidopsis dataset, the proposed approach obtained good results of average accuracy, precision, sensitivity, and MCC of 83.85%, 89.29%, 76.95%, and 72.66% and the standard deviations are 0.35%, 0.62%, 1.16%, and 0.52%, respectively. Figures 24 show the ROC curves for the proposed approach on Maize, Rice, and Arabidopsis. The average AUC values range from 90.55% to 97.50% (Maize: 97.50%, Rice: 96.90%, and Arabidopsis: 90.55%), demonstrating that FWHT-RF is fitting well for predicting PPIs in plants from amino acid sequences.

These good results collectively indicated that it is sufficient to predict PPIs in plants only using protein sequence information and that powerful prediction capability can be generated by combining the RF classifier with FWHT features’ descriptors. The high accuracies and low standard deviations of these criterion values indicate that FWHT-RF is feasible and effective for predicting potential PPIs in plants.

3.3. Comparison of RF with SVM and KNN Classifiers

There are various methodologies for machine learning models to identify PPIs, and most of them are based on traditional classifiers. To further access the predictive performance of FWHT-RF, we compared it by using the same feature extraction approach with the state-of-art SVM and KNN classifier in the same three plants’ datasets. The main idea of the SVM algorithm is to find the optimal hyperplane that maximally separates training data from the two classes, and it is effective for solving classification prediction problems. K-nearest neighbor is a supervised machine learning technique, and it can solve the classification task. The LIBSVM tool was selected in this paper to training the SVM model. At the same time, there are two parameters c and that need to be optimized. In the experiment of the Maize and Rice dataset, we set c = 5,  = 0.3, c = 7, and  = 0.4, respectively. When applying the FWHT-RF on the Arabidopsis dataset, we set c = 5 and  = 0.7. The KNN model needs to choose the neighbor k and distance measuring function. In this paper, k is set to be 1 and the distance measuring function is selected as L1.

Figure 5 shows the experimental results of RF, SVM, and KNN models in three plants datasets Maize, Rice, and Arabidopsis. From Figures 5(a)5(d), it can be concluded that the results of the RF classifier are significantly better than those of SVM and KNN classifiers. For example, the accuracy gaps between SVM and RF on the Maize, Rice, and Arabidopsis were 7.98%, 8.53%, and 3.26%, respectively. Similarly, the accuracy gaps between KNN and RF are 11.72%, 15.36%, and 10.40%, respectively. The ROC curves achieved by the SVM and KNN classifiers on the three plants datasets are shown in Figures 68. All the experimental results are listed in Table 4.

4. Discussion and Conclusions

In this study, we presented an effective sequence-based method called FWHT-RF to predict potential PPIs in plants. This method combined position-specific scoring matrix (PSSM) with fast Walsh–Hadamard transform (FWHT) and rotation forest (RF) classifier. First, we transformed the plant protein sequences into PSSM to obtain the evolutionary information of plants’ protein sequences. Then, the FWHT algorithm was used to extract as much hidden information as possible from the plant protein sequences. At last, the RF classifier was trained for predicting PPIs in plants. When performed FWHT-RF on three plants’ PPI datasets Maize, Rice, and Arabidopsis, it achieved a high prediction accuracy of 95.20%, 94.42%, and 83.85%, respectively. Moreover, we compared FWHT-RF with the state-of-art SVM and KNN classifier by adopting the same feature extraction method. The comprehensive experiments demonstrated that FWHT-RF is an effective tool to predict PPIs in plants. In the future work, we will consider applying FWHT-RF to other bioinformatics problems.

Data Availability

The data are original, and the data source is restricted.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This research was funded by the National Natural Science Foundation of China, Grant nos. 61722212 and 62002297.