Identification of DNA-binding protein based multiple kernel model

: DNA-binding proteins (DBPs) play a critical role in the development of drugs for treating genetic diseases and in DNA biology research. It is essential for predicting DNA-binding proteins more accurately and efficiently. In this paper, a Laplacian Local Kernel Alignment-based Restricted Kernel Machine (LapLKA-RKM) is proposed to predict DBPs. In detail, we first extract features from the protein sequence using six methods. Second, the Radial Basis Function (RBF) kernel function is utilized to construct pre-defined kernel metrics. Then, these metrics are combined linearly by weights calculated by LapLKA. Finally, the fused kernel is input to RKM for training and prediction. Independent tests and leave-one-out cross-validation were used to validate the performance of our method on a small dataset and two large datasets. Importantly, we built an online platform to represent our model, which is now freely accessible via http://8.130.69.121:8082/.


Introduction
Many biological processes are carried out by the DBPs, such as specific nucleotide sequence recognition, transcription and DNA replication.Therefore, identification of DBPs has become an import subject of biology.The protein can be identified by various experimental techniques, such as ChIP-chip [1,2] and filter binding assays [3].However, with the development of high-throughput sequencing technology, protein sequence databases have increased unprecedentedly.Proteins whose structure and function are unknown are on the rise.A rapid and accurate method for identifying and characterizing DBPs based on their protein sequence is highly desired.Computer prediction methods have been widely applied to various biological problems [4][5][6][7][8][9][10][11].
The existing prediction methods are broadly divided into two groups.The first group is model based prediction methods.These methods that borrow prior information across sequences to predict DBPs, including amino acid composition [12,13], evolutionary information [14,15] and physicochemical [16] character.For example, Rahman et al. [17] presented a predictor named DPP-PseACC.They used Chou's PseAAC [18] to extract features from amino acid composition and Random Forest (RF) model to reduce the dimension of feature vector.Then, they applied Support Vector Machine [19] (SVM) with linear kernel to train prediction model.Similarly, StackPDB take three steps to predict DBPs, including feature extraction, feature selection and model construction.StackPDB extract protein sequence features from amino acid and composition and evolutionary information.Evolutionary information can be represented by the position specific scoring matrix (PSSM), which is generated by PSI-BLAST [20] program.In StackPDB method, PsePSSM, PSSM-TPC, EDT and RPT are used to extract PSSM.They then used extreme gradient boosting-recursive feature elimination to select the best features.Finally, the excellent feature subset is fed into the stacked ensemble classifier, which composes XGBoost, SVM and LightGBM.From the previous study [21][22][23][24][25], we can see that the protein sequence can be described by different representations, such as amino acid composition and PSSM.Because fusion methods can exploit information from all representations to effectively improve the model performance, some fusion techniques are performed in identification of DBPs.
For example, CKA-MKL [26], HSIC-MKL [27], HKAM-MKL [28] and MLapSVM-LBS [29].CKA-MKL, HSIC-MKL and HKAM-MKL are the Multiple Kernel Learning (MKL), which is popular early fusion techniques.MKL aims to learn optimal kernel weights.The optimal kernel is linear combined by multiple base kernels based on the related weight.CKA-MKL maximizes the cosine similarity score between the optimal and the ideal kernel.In addition, CKA-MKL introduce the Laplacian term about weights into objective function to avoid extreme situations.CKA-MKL only considers the global kernel alignment and ignore the difference information between local samples.Therefore, HKAKM-MKL both maximize the score of local and global kernel alignment.CKA-MKL and HKAM-MKL both use SVM as a classifier.HSIC-MKL maximizes the value of independence between trained samples and labels in Reproducing Kernel Hilbert Space (RKHS).Then, the optimal kernel was input into a hypergraph based Laplacian SVM, which is the extension of SVM.CKA-MKL only considers the global manner.Furthermore, HKAM-MKL both consider global and local manners.HKAM-MKL is therefore superior to CKA-MKL in predicting DBPs.Different from the above MKL methods, MLapSVM-LBS fuses multiple information during training progress.MLapSVM-LBS uses the multiple local behavior similarity graph as the regularization term.Because the objective function of MLapSVM-LBS is non-convex, an alternation algorithm is employed.The advantage of MLapSVM-LBS is that the multiple information is fused during the training phase while allowing for some degree of freedom to model the views differently.
There are several methods for predicting protein sequences that are based on structural information.Using structural alignment and statistical potential, Gao et al. [4] proposed the DBD-Hunter.DBD-Threader was subsequently proposed by Gao et al. [30] for the prediction of DBPs.The DBD-Threader uses a template library consisting of DNA-protein complex structures, while its classification relies only on the sequence of the target protein.When the structure of a candidate protein is known, structure-based predictors can be used.Therefore, predictors that rely solely on structural information about proteins are limited in their application.
The second group is deep learning-based prediction methods.Deep learning-based methods are designed by capture the hidden representation of protein sequence.For example, Du et al. [31] reported a deep learning-based method called MsDBP.MsDBP only relies on the primary sequence, without human-crafted feature selection.Lu et al. [32] proposed a predictor that contains parallel long and short-term memory (LSTM) and convolutional neural networks (CNN).In Lu's work, the input of LSTM and CNN is sequence and PSSM, respectively.The spatial structure of a protein contains richer information compared with protein sequences.Therefore, Lu et al. [33] further constructed a graph convolutional network based on the contact map, which is generated by Pconsc4 [34].Yan et al. [35] employed the transfer learning to construct data sets and build a deep learning neural network with attention mechanisms to detect DBPs.Because of their nature, most deep learning-based methods [33,36,37] are not suitable for small datasets.Inspired by a series of recent publications [26,27,[38][39][40][41][42][43][44][45][46], we propose a predictor for detecting DBPs.This predictor was called LapLKA-RKM, which needs the following three steps: 1) represent the protein sequence with a set of feature vectors, including Global Encoding (GE), Multi-scale Continuous and Discontinuous descriptor (MCD), Normalized Moreau-Broto Auto Correlation (NMBAC), PSSM-based Discrete Wavelet Transform (PSSM-DWT), PSSM-based Average Blocks (PSSM-AB) and PSSM-Pse; 2) fuse these features by LapLKA (this progress can be seen as selection of features); 3) RKM was developed to make the prediction.A brief architecture of LapLKA-RKM is shown in Figure 1.We conducted LOOCV and independent testing on PDB1075 and PDB2272, respectively.The prediction accuracy indicate that our methods is an effectively tool for DBPs detection.
The contributions of our methods include: 1) we propose a MKL algorithm, called LapLKA, which can outperform other MKL methods in handling multiple kernels; 2) we extend the RKM to a multiple kernel setting by weighting shared hidden features.

Datasets and experiment setup
Three protein datasets with different sizes were adopted in our study to test the ability of LapLKA-RKM in predicting DBPs.These datasets collected from the PDB, UniProt and Swiss-Prot database, namely PDB1075 [12], PDB14189 [31] and PDB2272 [31].
The dataset construction rules are as follows: where N is the number of total samples, N  is the number of DBPs samples and N  is the number of non-DBPs samples.We present a brief summary of the three datasets in Table 1.Sequences with sequence similarity greater than 25%, 25%, 40% in PDB1075, PDB2272 and PDB14189 were removed, respectively.Leave-one-out cross-validation (LOOCV) and independent testing are conducted to show the ability of predictor.We conduct the LOOCV and 10-CV on PDB1075, because PDB1075 is a small dataset and its running time is acceptable.To show the robustness of generalization and ability of big dataset of models, we take PDB14189 dataset as training set and PDB2272 as test set.

Feature extraction
A total of six sequence-based features are extracted from proteins, including GE [47], MCD [48], NMBAC [49], PSSM-DWT [50], PSSM-AB [51] and PSSM-Pse [13,[52][53][54].Where GE and MCD extract feature vectors from the amino acid composition of sequences.NMBAC describes the six physicochemical properties of amino acids, namely Polarizability, Polarity, Solvent Accessible Surface Area, Hydrophobicity, Net Charge Index of Side Chains and Volume of Side Chains.PSSM-AB, PSSM-DWT and PSSM-Pse consider proteins' evolutionary information, which can be represented by the position specificity score matrix (PSSM).PSSM is generated by PSI-BLAST [20].The optimal parameters of NMBAC and PSSM-Pse were implemented by previous study [26].In the related literature, these features are described in detail.
RKM is a kind of kernel methods [55][56][57][58].It maps data points from the input space to the feature space.The mapping is determined implicitly by a kernel function.Therefore, we need to construct kernel metrics as input to RKM.The kernel function mainly includes Linear Function, Polynomial Function and Radial Basis Function.Like other methods [27,38,[59][60][61], RBF is employed to construct kernels and its formula is defined as: where x i and x j are sample points,  is the kernel bandwidth.Then, a predefined kernel set K is obtained: ,   ,   ,    ,    ,

Laplacian Local Kernel Alignment Algorithm
Laplacian Local Kernel Alignment (LapLKA) is a kind of supervised Multiple Kernel Learning (MKL).As we all know, an appropriate kernel matrix is very important to the success of any kernel method [62].However, choosing an appropriate kernel matrix is difficult for biological applications.In terms of protein sequence, it can be described by different kernel matrixes.To address this limitation, MKL is proposed [39].MKL aims to combine a set of predefined kernels by linear weight and the optimal kernel accurately represent a set of protein sequences.Let P as the number of predefined kernels, as the kernel set.The optimal kernel * K is the linear combination of the kernel set: where p  is the kernel mixture weight.Usually, the 1 L -norm is imposed to constraint the structure of β : The main goal of the LapLKA algorithm is to determine the values of β .There are two parts to LapLKA's learning strategy: local kernel and the inner relationship of global kernels.In previous studies [63][64][65], the score of kernel alignment is calculated only in global or local manner.Global manner aims to maximize the alignment score between the whole optimal kernel and the ideal kernel.Global manner may ignore the difference between similar samples.Contrary the global manner, local manner only considers the sub kernel, which is constructed by a set of similar samples.In the global manner, whole samples will be missed.For this reason, we propose LapLKA, which integrated local kernel alignments and the global kernel alignments.
First, we define the function of kernel alignment as follow: ,  〈, 〉 ‖‖ ‖‖ 6 where P and Q are positive defined matrix, , F   and F  are the Frobenius inner product and Frobenius norm, respectively.The value of kernel alignment is the cosine similarity between two kernels.
For the local manner, we maximize the alignment score between the local kernel and the ideal kernel.The local kernel is constructed by each sample and its neighbors.We select the index of the k samples neighbor samples that are nearest to each sample.We choose the Euclidean distance in the input space as the evaluation of sample similarity.Then, we select the sample's neighbor samples based the similarity.The set of neighbors of samples of i x is The local kernel about c x can be represented as:

Restricted kernel machine
Restricted Kernel Machine (RKM) classification model is a kind of kernel method [8].It was proposed by Suykens [56].The objective function of RKM is closely similar to the Least Squares Support Vector Machine (LS-SVM) [74] model.SVM is also a kernel method and most methods [15,23,26,28,75] select SVM as classification.However, we choose RKM as classification.The reason is that, RKM is easily extend to deep framework, called Deep RKM [56].Deep RKM can produce good results and we will use it throughout the rest paper.
x y  denotes as the training data, where is the i -th input pattern and   We formulate a lower bound on the function Eq (14), and then the objective function of RKM classification is obtained: where b is a bias term,  and  are hyperparameters and i h is a hidden feature.The map function ( )    maps x from the input space into a reproducing kernel Hilbert space.Hidden features are obtained by an internal pairing of T e h , where e is the classification loss.The stationary points of the objective function Eq (15) in the primal formulation are characterized by: By eliminating the weights w , the linear formulation is obtained: where I N and 1 N are the identity matrix and a one column vector,  is the element-wise product.
In this paper, we mainly focus on the RKM-based MKL formulations.The final linear system of RKM-based MKL is given by: The linear system Eq (18) can be solved based on the training data.The variables h and bias term b are used to construct the classifier.For a test data point t x , the final decision function is:

Evaluation measurements
Because the identification of DBPs is the binary classification problem.The following parameters are employed to measure the performance of predictor: Here, TP is the number of DBPs that are predicted to be non-DBPs; FN is the number of non- DBPs that are predicted to be DBPs; FP is the number of DBPs that are predicted to be non-DBPs, and TN is the number of non-DBPs that are predicted to be non-DBP.In addition, ROC curve [76,77] and PR curve are also used to evaluate classification performance.

Parameters selection
We tune parameters for best performance by 5-fold cross-validation (5-CV) and grid searching on PDB1075.First, we try to find the optimal kernel bandwidth for six types of kernels.The optimal kernel bandwidth is obtained from its single kernel RKM and set the range from 5   2  to 5 2 with step 1 2 .The results are shown in Table 2.Then, we select the parameters  ,  and  from 5  2  to 5 2 with step 1 2 , k from 10 to 50 with step 5.  and k are parameters of LapLKA. weighs the relationship between the local manner and the global manner, and k is the number of neighbors for samples. and  are regularization parameters in RKM objective function.
To demonstrate parameters sensitivity of LapLKA, we study the variation of performance according to change of and with fixed parameters of RKM. Figure 2 shows the ACC variation with and on PDB1075.We can see that our method is not sensitive to and , especially .Similarity, we study parameters sensitivity of RKM with fixed parameters of LapLKA.The ACC variation of and is shown in Figure 3.We can observe that and are both sensitivity parameters.When and , the ACC score is the lowest.With λ and decreases gradually, the predicted performance of 5-CV is increase.It is still an open problem that the sensitively of the model to hyperparameters.Finally, we set , , and to be 15, 2, 0.125 and 2, respectively.In our method, there are four hyperparameters: .Here, is the parameter in the local multi-kernel, weighs the relationship between the global kernel and the local kernel, and is the RKM positive real regularization constant.

Compared with single kernel
To analyze the performance of these kernels, we evaluate different kernels in two experiments, as shown in Tables 3 and 4 and Figure 4.  Results of LOOCV on PDB1075 are listed in Table 3 and Figure 4.Because LapLKA is a linear combination of six types of kernels, LapLKA performs much better than the single kernel.In addition, the average scores of ACC, SN, SP, MCC and AUC with kernels using PSSM information (PSSM-AB, PSSM-DWT, PSSM-Pse) are 76.28%, 79.49%, 73.21%, 0.5279 and 0.8439, respectively.The kernels using AAC information (GE, MCD) perform worst, its average score of metrices is ACC:69.58%,SN:74.28%,SP:65.09%,MCC:0.3960 and AUC:0.7743.We can observe that the model using PSSM information is better than other information.Thus, PSSM is an excellent feature extraction method that contains the evolutionary relationship with other sequences.Results of independent test on PDB2272 are list in Table 4. Table 4 shows a same trend with Table 3. LapLKA achieves best performance and the model using PSSM information is better than other information.In addition, PSSM-AB achieves highest SP (65.33%) and second highest ACC (77.33%),MCC (0.5601) and AUC (0.8656).The advantage of LapLKA is also reflected on PDB2272.The improvement in ACC, SN, MCC and AUC are 2.2% (PSSM-AB), 4.51% (PSSM-DWT), 0.0663 (MCC) and 0.0647 (AUC), respectively.
The running time of RKM with different kernels is also evaluated.In Table 5, the results are presented.RKM with multiple kernels is implemented in Matlab.It runs on an Intel i7-10750H CPU with 16 GB RAM.As we can see, our method is the most time-consuming.This can be explained by looking at the time complexity of RKM with single kernel and RKM with LapLKA-MKL.In RKM with single kernel, the time complexity of the training phase is largely influenced by calculating kernel matrices (  

Compared with baseline methods
Compared with single kernels, LapLKA achieves an obvious advantage.As a further demonstration of LapLKA's fusion capabilities, we compare it with BSV, FC, Comm and MV.Other MKL algorithms are also evaluated, including CKA, HSIC and FKL.In addition, we compare our method with other well-known classifiers.Other classifiers are fed multiple features concatenated for fair comparison.Details of the baseline methods are as follows: • Best Single Kernel with RKM (BSK-RKM): The results of applying RKM in the best performance.
• Feature Concatenation with RKM (FC-RKM): Multiple features are concatenated and RKM is used to do classification.
• Feature Concatenation with Xtreme gradient boosting (FC-XGBoost): Multiple features are concatenated and XGBoost is used to do classification.The XGBoost [78] algorithm is a kind of ensemble learning model, which produces a strong model by assembling decision trees.
• Feature Concatenation with Random Forest (FC-RF): Multiple features are concatenated and RF is used to do classification.RF [79] is a classification algorithm combining ensemble tree-structed classifiers.
• Feature Concatenation with K Nearest Neighbors (FC-KNN): Multiple features are concatenated and KNN [80] is used to do classification.KNN is an algorithm for classification, which assigns a class label to a new data point based on the k nearest neighbors in the feature space.
• Committee RKM with RKM (Comm-RKM): Each kernel was input to RKM classification separately and taking the average of multiple RKM results as the final prediction result.
• Multi-View RKM classification [55] (MV-RKM): MV-RKM is an extension of the RKM Classification by assuming shared hidden nodes over all different features.The linear system of MV-RKM is: where N P is a column vector where each element equals P .From Eq (24), we can observe that MV- RKM can be seen as the MKL with mean weighted based RKM.
• Centered Kernel Alignment [26] with RKM (CKA-RKM): CKA is a kind of MKL algorithm.CKA estimates the optimal weights of kernels by maximizing the cosine similarity between the optimal kernel and ideal kernel.Different from LapLKA, CKA only consider the global manner.
• Hilbert Schmidt Independence Criterion [81] with RKM (HSIC-RKM): HSIC is a kind of MKL algorithm.HSIC optimize the kernel weight by maximize the dependence between the optimal kernel and ideal kernel.The advantage of HSIC is simple calculation and fast convergence.
• Fast Kernel Learning [82] with RKM (FKL-RKM): FKL also is a kind of MKL algorithm.FKL find fusing weight by minimize the Euclidean distance between the optimal kernel and ideal kernel.Since the objective function of FKL is quadratic programming, it is fast and effective at solving kernel weights.
The hyperparameters of these fusion methods are detected by the 5-CV and the grid search on PDB1075.   5 show all baseline methods and LapLKA on the PDB1075 by LOOCV.Table 7 shows comparison between each baseline methods on the PDB2272 by independent test.We can see: 1) LapLKA has the best performance no matter LOOCV on PDB1075 or independent test on big dataset.This indicates that LapLKA can obtain the best optimal kernel for classification by effectively combing the multiple kernels.2) MV, CKA, HSIC and KTA perform better than typical fusion methods (BSV, FC and Comm) on PDB1075 by LOOCV.However, these MKL methods (MV, CKA, HSIC and KTA) are slightly inferior to typical fusion methods.
A good prediction method should have good generalization capabilities.In light of this, we report the uncertainties of our method and baseline methods by 10-CV on PDB1075.The results are shown in Figure 6.According to the boxplot, our method is likely to produce similar results for different crossvalidation splits.Additionally, our method produces the highest mean ACC.Furthermore, we report statistical tests of the differences under 10-CV on PDB1075.Table 7 demonstrates that, our method has statistically significant improvement over other baseline methods (P-value < 0.05, by t-test, in terms of ACC, for BSV-RKM, FC-RKM, FC-XGBoost, FC-RF, FC-KNN and CKA-RKM).Weights is shown in Figure 7.In HSIC and LapLKA approaches, the weights of PSSM-Pse are the largest and NMBAC is close to 0. Additionally, the weights of kernels using AAC usually lower than kernels using PSSM.For example, the sum of weights of PSSM kernels is 0.598, and the weights of AAC kernels is 0.281 in LapLKA.The analysis of performance of single kernel demonstrates that, the model using PSSM information is better than other information.Therefore, we can draw the conclusion that LapLKA could set low weights to noise kernels.Here, we compare our approach with other existing methods on PDB1075 by LOOCV and PDB2272 by independent test, as shown in Tables 9 and 10, respectively.It can be observed that high ACC of 85.77% (PDB1075 by LOOCV), 79.5% (PDB2272 by independent test).On PDB1075, our method got 1.12%, 2.26% and 0.03 improvement in ACC, SN and MCC over the second bet MV-H-RKM, respectively.MV-H-RKM enforce the structure consistency between input feature and the hidden node by the hypergraph regularization term.Therefore, MV-H-RKM also achieves the good performance.However, MV-H-RKM couple multiple features by means of hidden vector, which is same as MV-RKM.This means MV-H-RKM cannot filter noise features.HKAM-MKM achieves good performance with ACC (84.28%) and MCC (0.69).Similar our method, HKAM-MKM both consider the local and global kernel alignment and propose a hybrid kernel alignment model.Difference our method, the optimal kernel is input to SVM.

Conclusions
In this paper, we developed an approach called LapLKA-RKM, a machine learning based predictor for DBPs.Our method contains three steps: feature extraction, feature fusion and classifier construction.We apply six different feature extraction methods (MCD, GE, NMBAC, PSSM-AB, PSSM-DWT and PSSM-Pse) to represent the protein sequences.Then, we utilize LapLKA-MKL to combine multiple predefined kernels.Finally, we employ RKM as a predictive classifier.
Compared with other baseline methods and existing DBPs predictor, our method achieves the best accuracy on different datasets by LOOCV and independent test.On the LOOCV of PDB1075, LapLKA-RKM achieves the highest ACC, SN, MCC and AUC of 85.77%, 89.90%, 81.82%, 0.72 and 0.9258, respectively.Further, our method was tested on PDB2272 via independent test and also achieves better performance with ACC (79.5%),SN (96.6%),MCC (0.626) and AUC (0.9303).The results demonstrated that our method is an accurate tool for identification of DBPs.We also built an online platform to represent our model.We hope the simple to use web interface will lead to wide adoption of our method.

Figure 2 .Figure 3 . and 2   via 5 -
Figure 2. Effect of and on ACC with fixed and via 5-CV on PDB1075.

Figure 4 .
Figure 4.The ROC and PR curves of different kernels (LOOCV).

2 O 3 O 2 O
N d ) and solving linear problems (   N ).Three steps are involved in RKM with LapLKA-MKL: calculate the kernel matrices, MKL and solve a linear problem.These steps have a time complexity of  

Figure 5 .
Figure 5.The ROC and PR curves of different baseline methods.

Table 1 .
A summary of three datasets used in this study.
by the label of related samples.The global kernel alignment information is introduced into Eq (8) by the Laplacian regular term: ij W represents the value of kernel alignment   , i jA K K .Equations (8) and (9) are integrated as follow:

Table 2 .
The optimal parameters for single kernel RKM.

Table 4 .
Compared with single kernel on PDB2272 (independent test).

Table 5 .
The running time of different kernels on PDB2272 (independent test).

Table 6 and
Figure

Table 7 .
The statistics of different baseline methods on PDB1075 (10-CV).

Table 8 .
Performance compared with other baseline methods on PDB2272 (independent test).
Figure 7.The weights of kernels obtained by different MKL on the PDB14189.In addition, the weight of each kernel (with MV, CKA, HSIC, KTA and LapLKA) on PDB14189

Table 9 .
Performance comparison with other existing methods on PDB1075 (LOOCV).

Table 10 .
Performance comparison with other existing methods on PDB2272 (independent test).