PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method.

DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.


Introduction
DNA-binding proteins (DBPs) are fundamental in the process of composing DNA and regulating genes. They execute intercellular and intracellular functions such as transcription, DNA replication, recombination, modification, and other biological activities associated with DNA [1]. As the significant role of DBPs undertaken, it has become one of the hot research topics to effectively identify DBPs in the field of protein science. The past decade has witnessed tremendous progress in the DBP recognition, including experimental methods, and computational methods [2]. In the early researches, DBPs were detected by laborious experimental techniques such as filter binding assays, genetic analysis, X-ray crystallography, chromatin immune precipitation on microarrays, and nuclear magnetic resonance [3]. With the rapid development of high-throughput sequencing technology and growing extension of protein sequence data, more efficient and accurate machine learning (ML) methods are implemented and applied for the classification of DBPs [4,5].
Feature encoding schemes and classification algorithms have great impacts on the performance of ML-based methods. Feature representation numerically formulates diverse-length protein sequences as fixed-length feature vectors, which could be categorized into structure-based models and sequence-based models. Structure-based methods rely on the structure information of proteins such as the spatial distribution, net charge, electrostatic potential, the dipole moment, and quadrupole moment tensors [6,7]. However, the great difficulty of acquiring the high-resolution crystal structure of proteins and the insufficient quantity of proteins with known structure information heavily limit the use of structure-based predictors [8].
Previous studies have demonstrated the importance of PSSM-based features for enhancing DBPs prediction. For example, Kumar et al. initially adopted evolutionary information embedded in the PSSM profile to identify DBPs and achieved a well-performed result [17]. Waris et al. produced an ingenious classifier by integrating the PSSM profile with dipeptide composition and split AAC [18]. Zou et al. proposed a fuzzy kernel ridge regression model to predict DBPs based on multiview sequence features [22]. Ali et al. introduced the DP-BINDER model for the discrimination of DBPs by fusing physicochemical information and PSSMbased features [23]. In the recent study, Zaman et al. built an HMMBinder predictor for the DBP recognition problem by extracting monogram and bigram features derived from the HMM profile [20]. They also experimentally proved that the HMM-based features are more effective for the prediction of DBPs than the PSSM-based features, especially on the jackknife test. Nevertheless, HMMBinder achieved relatively poor performance on the independent test. Accordingly, there is still more scope to improve the DBP prediction by exploring highly recognizable features from the HMM profile.
Prediction of DBPs is usually formulated as a supervised learning problem. In recent years, many classification algorithms have been adopted to solve this problem, including support vector machine (SVM) [24][25][26], random forest (RF) [27,28], naive Bayes classifier [3], ensemble classifiers [29][30][31], and deep learning [32][33][34]. Among these models, stacked generalization (or stacking) is an ensemble learning technique that takes the outputs of base classifiers as input and attempts to find the optimal combination of the base learners to make a better prediction [35]. Xiong et al. constructed a stacked ensemble model to predict bacterial type IV secreted effectors from protein sequences by using the PSSM-composition features [36]. Recently, Mishra et al. developed a StackDPPred method for the effective prediction of DBPs, which utilized a stacking-based ML method and features extracted from the PSSM profiles [29].
Inspired by the work of Zaman and Mishra, respectively, we propose a stacked ensemble method, called PredDBP-Stack, to further improve the performance of DBP prediction by exploring valuable features from the HMM profiles. First, we convert the HMM profiles into 420-dimensional feature vectors by fusing AAC and transition probability composition (TPC) features. Next, six types of ML algorithms are adopted to implement base classifiers in the first stage. Then, the optimal combination of base learners is searched, and the prediction probabilities of these selected base learners are used as inputs to the meta-classifier to make the final prediction in the second stage. Compared with existing state-ofthe-art predictors, our method performs better on the jackknife cross validation as well as on the independent test.

Materials and Methods
In this section, we describe all details about the proposed prediction model for identifying DBPs. The system diagram of the PredDBP-Stack methodology is illustrated in Figure 1. Several major intermediate steps in the development process of PredDBP-Stack are specified in the following subsections.
2.1. Datasets. The construction of a high-quality benchmark dataset is crucial for building a robust and reliable MLbased predictive model. In this study, two well-established datasets, i.e., PDB1075 [5] and PDB186 [3], are adopted to examine the performance of our predictor. The PDB1075 dataset consists of 1075 protein sequences with 525 DBPs and 550 non-DBPs, which are applied for model training and testing by using the jackknife cross validation. The PDB186 dataset is designed as an independent test dataset that contains 93 DBPs and 93 non-DBPs. All protein sequences in these two datasets were downloaded from the Protein Data Bank [37] and have been filtered rigorously by removing those with relatively high similarity (≥25%) or those with too small length (<50 amino acids) or involving unknown residues such as "X".    2 BioMed Research International been widely used in bioinformatics, such as protein remote homology detection [38], DBP prediction [20], and protein fold recognition [39]. In this study, HMM profiles are generated from the multiple sequence alignments by running four iterations of the HHblits program [40] against the latest Uni-Prot database [41] with default parameters. Similar to PSSM profile, we only use the first 20 columns of the HMM profile in the form of an L × 20 matrix where L represents the length of the query protein sequence. Each element from the HMM profile is normalized by using the following function: where x is the original value of the HMM profile.

Feature Extraction from HMM Profiles.
Feature extraction often plays an important role in most protein classification problems, which has a direct impact on the prediction accuracy of ML-based predictors. In this study, a simple and powerful feature encoding scheme by extracting AAC and TPC features is adopted to convert the HMM profiles into fixed-length feature vectors.
Since DNA-binding preference of a protein is closely related to its AAC [9], we first obtain AAC features from the HMM profile by using the following formula: where h i,j is the value in the i th row and j th column of the HMM profile.
is the composition of amino acid type j in the HMM profile and represents the average score of the amino acid residues in the query protein being changed to amino acid type j during the evolution process. AAC based on the HMM profile is a simple and intuitive feature; however, it ignores the role of sequenceorder information.
To partially reflect the local sequence-order effect, TPC features are computed from the HMM profile as follows: To include evolution information and sequence-order information, a 420-dimensional vector is finally employed to represent a protein by fusing AAC and TPC features. We call this feature encoding method AATP-HMM in this study.

Classification Algorithm.
In this study, we apply one of the effective ensemble techniques called stacking [35] to achieve the performance improvement of the DBP predictor. Stacking makes up the limitation of the single classifier by integrating prediction results from multiple classification algorithms. There are two stages in our stacked ensemble scheme (Figure 2). For the first stage, various classification algorithms are employed individually as base classifiers to produce prediction class probabilities. For the second stage, these probabilities as inputs are taken into the metaclassifier in different combinations to generate desired prediction results.
Taking into account the underlying principle of each classification algorithm and their prediction performance, we select three top learners, i.e., SVM (RBF), RF, and XGB, to, respectively, combine with other base classifiers. Also, we build the SM with these three best-performed classifiers and the one with all classification models. The following SMs are five combinations of base classifiers in this study:  [49] as the meta-classifier to perform the final prediction of DBPs. Gradient boosting is a powerful ML technique, which produces a prediction model in the form of an ensemble of weak learners, typically DT [50]. Due to the arbitrary of choosing the loss function, GBDT could be customized to any particular ML task.

Performance Evaluation.
To evaluate the performance of PredDBP-Stack, we first implement the jackknife crossvalidation test on the PDB1075 dataset. In the jackknife test, every protein is tested one by one by the predictor trained with the remaining proteins in the benchmark dataset. Next, the independent test on the PDB186 dataset is also performed to examine the generalization ability of the proposed model. In this study, four widely used performance metrics are employed to compare PredDBP-Stack with several state-of-the-art models for identifying DBPs, including Overall Accuracy (OA), Sensitivity (SN), Specificity (SP), and Matthew's correlation coefficient (MCC) [51][52][53][54]. These metrics are formulated as follows:

Results and Discussion
3.1. Performance of Base Classifiers. Based on the AATP-HMM feature representation, we first analyze the predictive power of six classifiers, i.e., DT, KNN, LR, XGB, RF, and SVM employed in the base level of stacking. The models are tested on the PDB1075 dataset by using the jackknife cross validation and experimental results are shown in Table 1. Table 1 indicates that the optimized SVM with RBFkernel provides the highest performance in terms of OA, MCC, and AUC compared to the other methods for the prediction of DBPs. Moreover, the RF method obtains the best SN value of 83.4%, and the XGB method gives an outstanding SP value of 80.69%. It is also evident that the DT model performs worst in this task. In addition, the algorithms of KNN and LR show the acceptable performance with the AUC value larger than 0.8. To assure the distinct and high quality of the target figure, only three ROC curves corresponding with LR, DT, and SVM models are shown in Figure 3, which illustrates the consistent findings with Table 1.

Performance of Meta-Classifiers.
To find out the optimal combination of base learners, we construct five SMs with different classifiers as follows. As SVM, XGB, and RF are the top three competitive classifiers in the above tests; each one of them is combined with the remaining classifiers to formulate an SM, namely SM1, SM2, and SM3, respectively. The combination of the three outstanding classifiers and all classifiers are formulated as SM4 and SM5. For all the SMs, the metaclassifier in the second stage is GBDT. The performance of five SMs on the PDB1075 dataset using the jackknife test is shown in Table 2.
From Table 2, we observe that SM1, SM2, SM3, and SM5 provide similar performance with the OA larger than 90%. However, SM4 produces less competitive scores on the five evaluation measures. It may imply that the combination of the top three competitive classifiers does not mean an  advantageous result. Additionally, SM1, which employs KNN, LR, DT, and SVM (RBF) as base learners and GBDT as a meta-classifier, achieves the highest scores on the OA, SN, MCC, and AUC, respectively. SM2 gives the best SP of 92.55%. We also plot the ROC curves for SM1 and its four base classifiers in Figure 4, which demonstrates that stacked generalization can indeed improve the performance of base-level learners. Thus, SM1 is adopted as the final predictor for the identification of DBPs in the subsequent analysis.

Comparison with Existing Methods.
In this section, we evaluate the performance of PredDBP-Stack by performing the following two testing protocols for a fair comparison with the existing methods, including DNABinder [17], DNA-Prot [4], iDNA-Prot [28], iDNA-Prot|dis [5], Kmer1+ACC [14], iDNAPro-PseAAC [19], Local-DPP [27], HMMBinder [20], and StackDPPred [29]. The jackknife test is first implemented on the benchmark dataset PDB1075, and the detailed results are reported in Table 3. As shown in Table 3, HMMBinder, StackDPPred, and the proposed PredDBP-Stack provide outstanding performance with the OA higher than 85% and the AUC value more than 0.9. However, our method shows the best predic-tive power on the five metrics: OA (92.42%), SN (92.47%), SP (92.36%), MCC (0.85), and AUC (0.9677). This is likely attributable to the effective feature extraction technique from the HMM profile and the powerful stacked ensemble classifier adopted in the PredDBP-Stack model.
To further assess the robustness of the proposed method, we perform an independent test on the PDB186 dataset, where PredDBP-Stack is beforehand trained on the benchmark dataset. Table 4 lists the predictive results of our method and nine existing state-of-the-art predictors mentioned above. From Table 4, we observe that our method, together with StackDPPred, performs better than the other methods on the PDB186 dataset, with the OA of 86.56%. Specifically, our method achieves the highest SP (86.02%) and AUC (0.8932) among the evaluated methods. In addition, the proposed PredDBP-Stack attains the second-best SN (87.10%) and MCC (0.731), which are      slightly lower than those of StackDPPred. It should be pointed that the StackDPPred also applies a stacking technique to establish a powerful predictor for the identification of DBPs, which utilizes two different types of features, i.e., PSSM profile and residue wise contact energy profile [29]. However, our method also obtains favorable prediction accuracy when only the HMM profile is used. The successful applications of StackDPPred and PredDBP-Stack show that the stacking-based ML technique might yield a competitive tool for the prediction of DBPs and other protein classification tasks.
From the above comparisons, our method outperforms the existing models based on both the jackknife test and the independent test. This indicates that our method is a very promising tool for identifying DBPs and may at least play an important complementary role to existing methods.

Conclusions
Even though considerable efforts have been devoted so far, prediction of DBPs solely from sequence information still remains a challenging problem in bioinformatics. In this study, we develop a stacking-based ML model PredDBP-Stack to further improve prediction accuracy of DBPs, which employs an ensemble of base learners, such as KNN, LR, DT, and SVM, to generate outputs for the meta-classifier. Firstly, a hybrid feature encoding model, called AATP-HMM, is proposed to transform the HMM profiles to fixed-length numeric vectors, which incorporate evolution information and sequence-order effects. Next, these feature vectors are used to train the base-level predictors in the first stage. Then, GBDT is adopted as the metaclassifier in the second stage to implement the final prediction of DBPs. Finally, the jackknife cross validation and the independent test are performed on the two benchmark datasets to evaluate the predictive power of the proposed method. Comparison with the other existing predictors indicates that our method provides the outstanding improvement and could serve as a useful tool for predicting DBPs, given the sequence information alone.

Data Availability
The datasets and source codes for this study are freely available to the academic community at: https://github.com/ taigangliu/PredDBP-Stack.