MLapSVM-LBS: Predicting DNA-binding proteins via a multiple Laplacian regularized support vector machine with

DNA-binding proteins (DBPs) are of great significance in many basic cellular processes. Experiment-based methods for identifying DBPs are costly and time-consuming. To deal with large-scale DBP identification tasks, a variety of computation-based methods have been developed. Inspired by previous work, we propose a multiple Laplacian regularized support vector machine with local behavior similarity (MLapSVM-LBS) to predict DBP. We serially combine three features that are extracted from protein sequences (including PsePSSM, GE, NMBAC) and feed them into MLapSVM-LBS. Based on human behavior learning theory, MLapSVM-LBS can better represent the relationship between samples through local behavior similarity. We introduce a new edge weight calculation method that takes label information into consideration. In addition, a local distribution parameter reflecting the underlying probability distribution of a sample’s neighborhood is also employed. To further improve the robustness of the model, we utilize multiple Laplacian regularization to build a multigraph model in which five Laplacian graphs are constructed with local behavior similarity by changing the neighborhood size. To appraise the performance of our model, MLapSVM-LBS is trained and tested on the PDB186, PDB1075, PDB2272 and PDB14189 datasets. On two independent testing sets (PDB186 and PDB2272), our method reaches the accuracies of 0.887 and 0.712, respectively. The good results on both datasets demonstrate the reliable performance of our model.


Introduction
DNA-binding proteins consist of a large class of proteins that physically attach to DNA.These proteins play an important role in a number of major cellular processes, including DNA transcription, replication, recombination and transposition.There are several experiment-based biological methods for identifying DBPs, such as nuclear magnetic resonance (NMR), X-ray diffraction crystallography, filter-binding assays, chromatin immunoprecipitation and the yeast one-hybrid system (YIH).However, these traditional methods are highly costly and time-consuming.With the rapid development of biology, such methods cannot deal with large-scale DBP identification.
In the biomedicine and bioinformatics fields, machine learning methods have been widely used and have obtained good results.Examples include O-GlcNAcylation site prediction [1], electron transport protein identification [2], protein remote homology detection [3], protein crystallization prediction [4], protein subcellular localization detection [5,6], drug-target interaction prediction [7][8][9], drug-drug interaction identification [10,11], and potential disease-associated microRNA detection [12][13][14][15][16][17].To solve the problems mentioned above, machine learning methods that can reduce the considerable consumption of resources and time are commonly implemented to detect DBP [18][19][20].Some researchers employed structural information of proteins to identify DBP.Guy Nimrod et al. [21] built a random forest model for identifying DBP via the average surface electrostatic potential, amino acid conservation pattern information and dipole moment.Based on a support vector machine, Bhardwaj et al. [22] utilized three types of features (overall charge, surface patches and composition features) to develop a predictive model.By combining structural information and evolutionary information, Shahana Yasmin Chowdhury et al. [23] proposed a model called iDNAProt-ES to detect DBPs.Ahmad et al. [24] employed a neural network model to detect DBPs.Three types of features https://doi.org/10.1016/j.knosys.2022.1091740950-7051/© 2022 The Author(s).Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).are fed into the model, including the net charge of the protein, electric dipole moment and fourth-moment tensor.
The structural information of proteins is difficult to obtain, and most of them remain unknown.In reality, sequence-based methods are more effective in many practical applications.DNA methylation sites, recombination spots, and posttranslational modification (PTM) sites (protein) were detected by sequential methods.Proteins with similar sequences tend to have similar structural information.Consequently, it is highly possible to utilize sequence-based computational methods to identify DBPs.During the past decade, a massive number of sequence-based computational models have been developed to identify DBPs.Cai and Liu [25][26][27] extracted amino acid composition and pseudo amino composition (PseACC) from the protein sequence and fed them into a support vector machine.Based on the position specificity score matrix (PSSM), which was generated by PSI-BLAST software [28], Kumar and Liu [29] established a predictor named DNAbinder.Leyi Wei et al. [18] developed a predictive model named local-DPP by applying local conservation information of PSSM features to the random forest classifier.By means of Chou's five-step rule, Zou et al. [30] built a fuzzy kernel ridge regression model based on multiview sequence features (FKRR-MVSF) to detect DBPs.To take advantage of multiple protein sequence features, Ding et al. [31] proposed a multikernel support vector machine model through heuristic kernel alignment (MKSVM-HKA).
Inspired by previous work, we develop a novel DBP identification model called the multiple Laplacian regularized support vector machine with local behavior similarity (MLapSVM-LBS).Based on human behavior learning theory, we utilize local behavior similarity (LBS) to construct the adjacent matrix of samples.Specifically, we apply label information to the edge weight calculation and introduce a local distribution parameter to reflect the underlying probability distribution of a sample's neighborhood.Furthermore, we construct five Laplacian graphs by changing the number of nearest neighbors k to make the model less sensitive to the neighborhood size.Moreover, we combine three features extracted from the protein sequence, including PsePSSM, GE, and NMBAC, and feed them into the MLapSVM-LBS model.Compared with other methods, our model achieves better results reaching accuracies of 0.887 and 0.712 on PDB186 and PDB2272, respectively.
The contributions of our study include: (1) We employ an adjacent matrix calculation method named Local Behavior Similarity (LBS).Combined with human cognitive features, LBS can better characterize relationship between samples.(2) We propose Multiple Laplacian regularized LapSVM to make the model less sensitive to the value of nearest neighbors k. (3) An iterative optimization method is utilized to solve the objective function.

Method and materials
DBP prediction is a classical binary classification problem.To precisely detect DBPs, it is important to utilize methods of feature extraction and train a model that has good generalization.In our work, we utilize three feature extraction methods to represent protein features: pseudo-position specific scoring matrix (PsePSSM) [26], normalized Moreau-Broto autocorrelation (NMBAC) [32] and global encoding (GE) [33].Based on human behavior learning theory, we utilize local behavior similarity(LBS), which applies label information to edge weight calculations and introduces a local distribution parameter to reflect the underlying distribution of a sample's neighborhood when constructing the adjacent matrix.To further enhance the performance of our model, multiple Laplacian regularization built with LBS is employed to develop the multigraph LapSVM model.The framework of our method is shown in Fig. 1.

Benchmark datasets
In our work, four datasets, including PDB1075, PDB186, PDB14189 and PDB2272 are employed to test the performance of our model.The details of the datasets are listed in Table 1.PDB1075 and PDB186 were obtained from Liu [25] and Lou [34].PDB14189 and PDB2272 were constructed by Du [35].All datasets were selected from the PDB bank.

Related work
The Laplacian support vector machine (LapSVM) [36] is a well-performing semi-supervised learning algorithm proposed by Mikhai Belkin.It successfully applies manifold regularization, which contains the geometric information of labeled and unlabeled samples, to the support vector machine(SVM) [37].
Let us consider the labeled sample set {(x i , y i )} l i=1 and unlabeled sample set {(x i )} l+u i=l+1 , where x i ∈ R d , labels y i ∈ {+1, −1}.
LapSVM uses reproducing kernel Hilbert space(RKHS) as hypothesis space.The objective function of LapSVM containing kernel function and loss function can be defined as follows: where denotes the hinge loss function of l labeled samples, ∥f ∥ 2 K is the standard regularization in RKHS aiming to maintain the smoothness of the solution, and L is the Laplacian matrix, γ A and γ I are the accommodation coefficients.
The solution to the formula can be shown as follows: Therefore, the LapSVM optimization problem is equivalent to: where K matrix K is the kernel matrix of both labeled and unlabeled samples.
Introduce the Lagrangian multipliers β i , ζ i : Calculate the first-order partial derivative of b and ξ i in Eq. ( 4): We integrate Eq. ( 4) and Eq. ( 5) to obtain the simplified expression: where J = [I 0] is a l × (l + u) matrix, I is a l × l identity matrix, and Y = diag (y 1 , y 2 , . . ., y l ).
Calculate the derivative of α : ∂L g ∂α = We can obtain the optimal solution: Substitute α * in Eq. ( 6), and the Lagrangian dual problem can be obtained: where Algorithm 1 Procedure of LapSVM.

Input:
Coefficients of γ A and γ I ; to calculate the edge weight, con- struct the adjacent graph of labeled and unlabeled samples; and build kernel matrix K for both labeled and unlabeled samples.4: Utilize standard SVM algorithm to solve quadratic programming, obtain α * ;

Output:
The decision function

Local behavior similarity
In many practical applications, people tend to comprehensively utilize their experience including supervised experience obtained from teachers and passive experience obtained from nature to make classification decisions.During the learning process, people think about the concept mechanism and the learning task is terminated when they determine the concept category.This is in line with the central concept of the semi-supervised learning algorithm.Consequently, it makes sense to determine how humans build the conceptual dividing line based on both labeled and unlabeled samples [38].
According to the LapSVM solving process, a critical step is to work out the Laplacian graph L, which is equal to calculating the adjacent matrix W in essence.Accordingly, the quality of W has a decisive impact on the classification performance and efficiency.
Traditional LapSVM employs the heat kernel function, W ij = e −∥x i −x j ∥ 2 /4t , to calculate W . However it has several shortcomings that degrade the model performance.Assume that Fig. 2 is a dataset with two kinds of samples.Triangles and circles represent samples of different classes.Filled dots and hollow dots represent labeled and unlabeled samples, respectively.x 1 , x 2 , x 3 , x 4 are all in the neighborhood of x 0 , and they share the same distance from x 0 .Thus in LapSVM, the edge weight between x 1 , x 2 , x 3 , x 4 and x 0 is the same.Obviously, it does not conform to humans' characteristics of cognition.People prioritize the label information of data when making decisions by nature.We prefer to place objects of identical labels in the same category even if their features are not so similar and distinguish things of different classes although they bear a strong resemblance in features.From Fig. 2, the label of x 0 is the same as x 1 and different from x 2 ; we can naturally conclude that x 0 and x 1 have a higher similarity.The density and distance information of the feature space distribution are implied in the sample labels.Thus, we can better depict the distribution of samples by introducing label information to the adjacent matrix construction.Combined with human cognition characteristics, the edge weight based on behavior similarity is defined as follows: others (10) where W ij = e −d∥x i −x j ∥ 2 /2 .Fig. 3 is graphical presentation of Eq. ( 10), where red line denotes y i = y j , purple line represents y i ̸ = y j , black represents other situations.As shown in Fig. 3, on the assumption that the distance between samples is same, edge weight is clearly divided into three areas based on label information.The intent that edge weight of samples with identical label is larger than those belong to different class is well achieved.The heat kernel function simply focuses on the corresponding sample, ignoring its neighbors.Apparently, it is contrary to humans' cognitive features.When people measure the interrelationship of objects, they do not merely contemplate the thing itself; the surrounding environment also makes indispensable differences in decision-making.People often take the local distribution of the feature space into account to make correct decisions.Based on such cognitive characteristics, the definition of the local view distance from x i to x j is as follows: where ) denotes the local distribution pa- rameter of x i ; in other words, ρ i is the average distance between x i and its N k nearest neighbors , where N k is the number of neighbors.The local view distance is based on human cognitive features, and it can better represent neighborhood distribution.
The local view distance from sample x j to sample x i is d ( x j , x i ) / ρ j .Integrating the local view distance of both, the mutual distance between x i and x j can be obtained as follows: Replace the distance measurement and kernel parameter t in the heat kernel function:

MLapSVM-LBS
In the case of local behavior similarity, if the dataset is fixed, the key factor that changes the edge weight is the value of nearest neighbors k.However, the value of k relies heavily on the neighborhood distribution of a sample.Moreover, it is difficult to manually set the proper k for different datasets.A value of k that is too small may lead to useful neighborhood information being insufficient, whereas outliers are included in the k nearest neighbors if k far exceeds the local neighborhood size of the samples.As a result, it is meaningful to combine the information of various neighborhood sizes rather than tuning k as a fixed value.
In this paper, we construct five Laplacian graphs via different numbers of nearest neighbors (ranging from 2, 4, 8, 16, 32) to make the model less sensitive to the neighborhood size and apply the multi-Laplacian regularization to the LapSVM framework.

Formulation
Different graphs represent various distribution information in the neighborhood of samples, each of which makes diverse contribution to the MLapSVM-LBS model.α v = [α 1 , α 2 , . . ., α v ] T ∈ R v×1 is the nonnegative weight vector to integrate Laplacian matrices, and v is the number of matrices.The formula of the fused Laplacian matrix is as follows: Integrating the multigraph regularization, the optimization problem of MLapSVM-LBS is as follows: In certain situations where the weight of a Laplacian matrix approaches 0 or 1, the model cannot find the complement of multiple graphs.To prevent the above problem, we employ a technique in which the η v is replaced with η v e and e > 1.

Optimization
To solve the objective function in Eq. ( 15), an effective alternation algorithm is utilized.The overview of MLapSVM-LBS is listed in Algorithm 2.
First, we fix η v = 1/v and optimize the variant f * .Given , the objective function is the same as that of the traditional LapSVM.We can obtain the solution f * according to the solution to LapSVM.
Second, we fix the variant f * to optimize the variant η v .The objective function is related to: Introduce the Lagrange multiplier ξ and convert the problem above to the Lagrange function: Set the derivative of η v and ξ to 0: Thus, η v can be obtained by the following equation: Algorithm 2 Algorithm of MLapSVM-LBS.

Input:
Training set {x i , y i } l i=1 , test set {x i } l+u i=l+1 ; The maximum number of iterations t max Coefficients of γ A and γ I ; 1: Use Eq. 10 and Eq. 13 to calculate the edge weight, W ij BS , via different neighborhood size k; Calculate f * (t) according to the solution to LapSVM; .., V via Eq.19; 8: end for Output: The decision function f * (x) = ∑ l+u i=1 α i * K (x, x i );

Evaluation measurements
In this work, we employ accuracy (ACC), Matthew's correlation coefficient (MCC), specificity (SP), sensitivity (SN), and area under the ROC curve (AUC) to measure the performance of our method.ACC, SP, SN and MCC are calculated as follows: where TP and TN denotes the number of true positive and true negative samples.FN, FP is the number of false negative and false positive samples.Moreover, we also employ the area under the receiver operating characteristic curve (AUC) to evaluate our model.

Feature combination
In our study, we utilize six types of features extracted from protein sequence, which includes GE, NMBAC, MCD, PSSM-AB, PsePSSM and PSSM-DWT.To obtain best feature combination, single feature and the feature fusion are compared on PDB1075 under LOOCV by traditional LapSVM.The comparison results are listed in Table 2.As shown in Table 2, LapSVM achieve best performance (ACC: 0.7598, MCC: 0.52) with fused feature (GE, NMBAC, PsePSSM).Consequently, we utilize GE, NMBAC, PsePSSM to train and test our model.

Parameter selection
In our work, the grid search method is utilized to obtain optimal parameters.There are four parameters (C, γ , γ A , a = γ I (l+u) 2 ) in our model and we test these parameters on PDB1075 by fivefold cross-validation (5-CV).The range of values for C and γ are from 2 −5 to 2 5 with step 2 1 .We set the range of values for γ A and a from 0.1 to 0.9 with step 0.1.For MLapSVM-LBS, we obtain the optimal parameters C, γ , γ A under 8, 0.5, 0.9 and 1, respectively.
To further prove the effectiveness of our model, we also evaluated it on PDB1075 and compared its result with traditional

Performance analysis on PDB186
PDB1075 and PDB186 are employed as the training and test sets, respectively, to further test the reliability of our model.The sequence similarity between the protein in the test set and the protein in the training set has a great influence on the prediction results.We removed proteins in PDB1075 which bear more than 25% similarity with protein in PDB186 and rebuilt the model on reduced PDB1075.The test results are listed in Table 6.

Performance analysis on PDB14189
To evaluate the robustness of our method, we also test our model on dataset of large size (PDB14189) by five-fold cross validation.The comparison results are shown in Table 7.Compared with evaluated methods, MLapSVM-LBS achieves best results.The value of ACC, SN, SP, MCC, AUC are 0.8451, 0.8312, 0.8681, 0.65, 0.917, respectively.The better performance shows the effectiveness of proposed model.

Performance analysis on PDB2272
We also remove proteins in PDB14189 with more than 25% similarity to any protein in PDB2272 and rebuild the model on reduced PDB14189.The comparison results between MLapSVM-LBS

Conclusion and discussion
In this work, we utilize three protein features, PsePSSM, NM-BAC and GE, to represent proteins and built the MLapSVM-LBS model for detecting DBPs.Compared with standard LapSVM, the Laplacian matrix of MLapSVM-LBS is built with local behavior similarity (LBS) which is inspired by human behavior learning theory.In detail, MLapSVM-LBS takes advantage of label information when calculating edge weight and introduces a local distribution parameter to reflect the underlying probability distribution of a sample's neighborhood.In addition, to improve the predictive performance, MLapSVM-LBS applies multiple Laplacian regularization by changing the neighborhood size which can make full use of the geometric distribution of samples.Our method reaches the accuracy of 0.887 and 0.712 on PDB186 (independent test set of PDB1075) and PDB2272 (independent test set of PDB14189), respectively.
Despite the fact that a number of computation-based methods have been proposed to detect DBPs, the performance of existing methods can still improve.Similar to many other sequence-based approaches, our method does not consider noise.In future work, we will utilize other graph-based model [45,46], density-based methods and fuzzy theory to improve the performance of our model.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Sun: Initial methodology, Experimental part, Writing -original draft, Proofreading.Prayag Tiwari: Methodological part, Experimental part, Writing -original draft, Proofreading.Yuqin Qian: Improved the methodological part, Experimental part, Writing -original draft, Proofreading.Yijie Ding: Improved the methodological part, Experimental part, Writing -original draft, Proofreading.Quan Zou: Improved the methodological part and helped in the experimental part, Writing -original draft, Proofreading.

Table 1
The details on four benchmark datasets.

Table 6
The results of comparison between MLapSVM-LBS and previous methods on PDB186 (independent test of reduced PDB1075).
a ''-'' represents that the value is not available.b Proteins in PDB1075 with more than 25% similarity to any protein in PDB186 are removed.
a ''-'' represents that the value is not available.b