Deep Non-Parallel Hyperplane Support Vector Machine for Classification

In the last few decades, deep learning based on neural networks has become popular for the classification tasks, which combines feature extraction with the classification tasks and always achieves the satisfactory performance. Non-parallel hyperplane support vector machine (NPHSVM) aims at constructing two non-parallel hyperplanes to classify data and extracted features are always used to be input data for NPHSVM. As for NPHSVM, extracted features will greatly influence the performance of the model to some extent. Therefore, in this paper, we propose a novel DNHSVM for classification, which combines deep feature extraction with the generation of hyperplanes seamlessly. Each hyperplane is close to its own class and as far as possible to other classes, and deep features are friendly for classification and samples are easy to be classified. Experiments on UCI datasets show the effectiveness of our proposed method, which outperforms other compared state-of-the-art algorithms.


I. INTRODUCTION
Support vector machine (SVM) [1] achieves relatively good performance in many applications, which aims at constructing a hyperplane that can classify samples with the structure risk minimization principle. In recent decades, several variants for SVM have been proposed, such as least squares SVM (LSSVM) [2], pinball SVM (Pin-SVM) [3], robust support vector classifiers (RSVC) [4] and so on.
However, SVM strictly requires two parallel hyperplanes, which will meet challenges when solving some learning tasks. Consequently, Mangasarian et al. [5] proposed generalized eigenvalue support vector machine (GEPSVM), which can construct two non-parallel hyperplanes by solving generalized eigenvalue problems. After that, Tian et al. presented non-parallel support vector machine for pattern classification [6]. The ϵ-insensitive loss function was introduced to two primal problems instead of the quadratic loss function. Li et al. [7] proposed robust L 1 norm non-parallel proximal support vector machine, which uses an iterative technique to solve a pair of L 1 norm optimal problems because L 1 norm is robust to outliers. Twin support vector machine (TSVM) [8] The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . was presented by Jayadeva et al., which attempts to construct two non-parallel hyperplanes by solving two small quadratic programming problems (QPPs). In the last few years, many variants for TSVM have also been proposed. For example, Gao et al. proposed L 1 norm least squares twin support vector machine [9], which can automatically select relevant features. Qi et al. [10] presented structural twin support vector machine for classification, which can fully exploit the prior structural information to enhance the classification accuracy. Wang et al. proposed robust capped L 1 norm twin support vector machine [11] which is more robust to outliers, and they designed a simple and efficient algorithm to solve the resultant objective. Xie et al. [12] proposed multitask twin support vector machine, which learns multiple related tasks simultaneously and puts TSVM to multitask learning. They also proposed multitask centroid twin support vector machine [13], which overcomes the disadvantage that TSVM is sensitive to outliers. In addition, Xie et al. also extended TSVM to multi-view learning, such as general multi-view semi-supervised least squares support vector machines with multi-manifold regularization [14], multi-view twin support vector machines [15], multi-view support vector machines with the consensus and complementary information [16], multi-view semi-supervised least squares twin support vector VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ machines with manifold-preserving graph reduction [17] and so on. Shao et al. [18] proposed an efficient weighted Lagrangian twin support vector machine for imbalanced data classification. They introduced a graph-based undersampling strategy to keep the proximal information, which is robust to outliers. In addition, the weight biases are embedded in Lagrangian TSVM formulations, which overcomes the bias phenomenon for imbalanced data classification. Li et al. [19] proposed generalized elastic net L p norm non-parallel support vector machine. L p norm is used to measure the distance of samples to each hyperplane and an appropriate p can be chosen to achieve the desired performance. Li et al. [20] proposed domain adaptive twin support machine learning using privileged information, and they also proposed an effective method to generate two non-parallel hyperplanes to minimize the distance between the source domain and target domain. Xie et al. [21] proposed Laplacian Lp norm least squares twin support vector machine, and the performance can be improved with the appopriate p value and Lp norm graph regularization term. Rezvani et al. [22] proposed intuitionistic fuzzy twin support vector machinesd for imbalanced data, which can easily deal with imbalanced datasets in the presence of noises and outliers. Chen et al. [23] proposed fuzzy support vector machine with graph for classifying imbalanced datasets. Firstly, they designed a graph-based fuzzy membership function to assess the importance of samples in the original feature space. Secondly, they proved that the function can mine discriminative information between samples in high-dimensional data. Thirdly, a method was provided to calculate the fuzzy membership function in the kernel space. Finally, the model analyzes samples of each class independently. In recent years, SVM and TSVM have been extended to the field of deep learning. For example, Li et al. presented deep twin support vector machine [24], which regards the projection as a kind of location information and projects raw data into 4-dimensional space to get new data. However, it does not use deep neural network (DNN) to extract deep features. Deep learning using support vector machines [25] was proposed by Tang. In addition, there are also some applications for deep SVM. For example, Wu et al. [26] proposed deep two-view support vector machine for facial expression recognition. The visible and thermal facial images are viewed as two views. Two deep neural networks are trained for visible and thermal image data. In addition, Okwuashi et al. proposed deep support vector machine for hyperspectral image classification [27].
The kernel trick can be used to project the original data to higher dimension space, which can be used to handle linear nonseparable problem. Besides, neural network can also be used to extract features. In this paper, inspired by the success of non-parallel hyperplane support vector machine (NPHSVM) [28] and deep learning, we propose a novel deep non-parallel hyperplane support vector machine (DNHSVM) for classification.
The contributions of this paper are summarized as follows: (1) We extend non-parallel hyperplane support vector machine to deep learning field, which can combine feature extraction with the task of classification seamlessly.
(2) The parameters of neural networks and hyperplanes can be optimized simultaneously by the algorithm of stochastic gradient descent (SGD), and extracted deep features are friendly for classification.
(3) Experiments on UCI datasets demonstrate the effectiveness of our proposed method. Our proposed method outperforms other state-of-the-art compared methods.
The remainder of this paper is organized as follows: Section II gives a brief introduction of SVM, TSVM, NPHSVM and deep support vector machine (DSVM). Section III elaborates our proposed DNHSVM and its optimization procedure. Section IV shows experiments for our proposed method and compared methods on benchmark datasets. Finally, in section V, we make the conclusion.

II. RELATED WORK
Notations are summarized here throughout this paper. Matrices are written in uppercase. As for matrix A ∈ R n×d , A i denotes the ith row of matrix A and the transpose of matrix A is denoted as A ⊤ . A −1 is the inverse matrix of A, I is an identity matrix. The vectors and scalar are written in lowercase.

A. SUPPORT VECTOR MACHINE
Suppose we are given n training data as matrix A ∈ R n×d , where d is the number of dimension of samples. The ith row of matrix A represents the ith sample and label for the ith training data is y i , y i ∈ {−1, 1}. w ∈ R d and b ∈ R are normal vector and bias, respectively. In order to obtain hyperplane with the structure risk minimization principle, the following constraint should be satisfied The hyperplane is denoted as w ⊤ x + b = 0 and the margin between two hyperplanes is 2 ∥w∥ . The standard SVM is given by solving the following optimization problem The parameters of hyperplane can be obtained by solving a complex QPP, then the decision function can be written as The test sample x ∈ R d can be assigned to positive class ''+1'' or negative class ''−1'' according to which side of hyperplane it is on.

B. TWIN SUPPORT VECTOR MACHINE
TSVM attempts to construct two non-parallel hyperplanes by solve a pair of small QPPs and each hyperplane is close to its 7760 VOLUME 11, 2023 own class and keeps away from the other class. Positive and negative samples are denoted as A ∈ R n 1 ×d and B ∈ R n 2 ×d , respectively, where n 1 and n 2 are the numbers of positive and negative samples respectively and d is the dimension of samples. Two hyperplanes can be given as: where w 1 and w 2 are normal vectors of hyperplanes, b 1 and b 2 are bias terms. The primal problems of TSVM can be formulated as where ξ and η are slack variables, c 1 and c 2 are hyperparameters which need to be tuned, e 1 and e 2 are vectors whose all elements are ones with the appropriate dimension. After introducing the Lagrangian multipliers α and β, Eq. (5) can be written as with KKT conditions, the following equations can be obtained From the KKT condition and Eq. (5), Eq. (13) can be obtained. Using the similar method, Eq. (14) can also be obtained. where , vectors α and β are Lagrangian multipliers. After solving QPPs, parameters of two hyperplanes can be formulated as: where I is an identity matrix of the appropriate dimension, δI is a regularization term to solve challenge that the inverse of G ⊤ G or H ⊤ H is difficult to obtain. After we obtain parameters of two hyperplanes, a new sample x ∈ R d can be assigned to positive class ''+1'' or negative class ''−1'' depending on which hyperplane it is closer to, i.e.

C. NONPARALLEL HYPERPLANE SUPPORT VECTOR MACHINE
Nonparallel hyperplane support vector machine aims at constructing two hyperplanes Each hyperplane is close to its own class and far away from the other class. It maximizes the differences of (X 1 w 1 + e 1 b 1 ) − (X 1 w 2 + e 1 b 2 ) and (X 2 w 2 + e 2 b 2 ) − (X 2 w 1 + e 2 b 1 ), and the primal for NPHSVM is as follows where C 1 and C 2 are regularization parameters. There are three terms in equation (18), and the first is the regularization term. The second is the sum of squared distances from the two hyperplanes to samples in the corresponding classes which keeps the hyperplane close to its own class. The third is the sum of error variables corresponding to constraints. With the Lagrangian and Karush-Kuhn-Tucker (KKT) conditions, the dual to the problem can be obtained  be written as where w is a normal vector for hyperplane and b is a bias term of DSVM, c is a hyperparameter that needs to be tuned, is an activation function such as Relu or Sigmoid, w M and b M are weight and bias for Mth-layer of neural network. The model for DSVM is shown in Fig. 1. After neural network is trained, test samples can be assigned to corresponding labels according to which side of hyperplane they are on.

III. OUR PROPOSED DEEP NON-PARALLEL HYPERPLANE SUPPORT VECTOR MACHINE A. BINARY CLASSIFICATION FOR DNHSVM
In this section, we elaborate our proposed deep non-parallel hyperplane support vector machine. The loss function for our DNHSVM can be defined as: where f M (A) and f M (B) are extracted features through M -layer neural network for positive and negative samples, respectively, w 1 and w 2 are normal vectors of hyperplanes, b 1 and b 2 are bias terms, e 1 and e 2 are vectors whose all elements are ones with the appropriate dimension.
To explain the principle of our DNHSVM, we provide the following analyses.
(1) The first term indicates the squared distance from samples to their own hyperplanes. The smaller the value is, the closer they are to their own hyperplanes.
(2) The second term shows that the difference between the distance from samples to their own hyperplane and the other hyperplane should be at least one, which can be used to classify two classes effectively. (3) The third term 1 2 (∥w 1 ∥ 2 + ∥w 2 ∥ 2 ) is a regularization term which can be used to avoid over-fitting.
The model is shown in Fig. 2 and the red and blue lines are parameters of two non-parallel hyperplanes we seek.
We jointly optimize parameters of two hyperplanes The gradients of L with respect to w 1 , b 1 , w 2 , b 2 are shown as: where The learning rate for neural network is denoted as α, then w 1 , b 1 , w 2 , b 2 can be updated as follows: The backward error will be passed to the previous layers. For M -layer, the backward error for the ith positive and negative samples is denoted as δ M A i and δ M B i , respectively, i.e.
where z M A i and z M B i are positive and negative input data of M -layer neural network. f M (A i ) and f M (B i ) are the ith positive and negative samples for output data of M -layer, respectively, ⊙ is element-wise multiplication, g ′ (·) is the derivative of activation function g(·). Then we can obtain the definition of δ According to the definition of f (m) i and back-propagation algorithm, subgradient with respect to W m and b m can be formulated as where f (m−1) i represents the ith positive or negative sample output data of (m − 1)-layer. Based on Eq. (30) and Eq. (31) with SGD algorithm, W (m) and b (m) can be updated as follows: Before the training, parameters of deep neural network are initialized randomly. During the testing, decision function of our proposed model is shown in Eq. (33) where f M (x) is extracted feature through M -layer neural network. The algorithm of our proposed method is described in Algorithm 1.

B. MULTICLASS CLASSIFICATION FOR DNHSVM
An easy way to extend binary non-parallel support vector machine to classification is using one − vs − one approach. As for K class problems, K (K −1) 2 DNHSVM models will be trained independently and the vote strategy is used to obtain the final results. Backward propagate network to get ∂L ∂w 1 , ∂L ∂b 1 , ∂L ∂w 2 , ∂L ∂b 2 and gradients for weights and biases of previous layers; 8: Update w 1 , b 1 , w 2 , b 2 and weights and biases of previous layers with obtained gradients; 9: end 10: end

IV. EXPERIMENTS
In this section, we make experiments on UCI datasets to validate the effectiveness of our proposed method and compare it with other algorithms including K nearest neighbors (KNN) algorithm, SVM [1], TSVM [8], NPHSVM [28], least squares recursive projection twin support vector machine (LSPTSVM) [29] and DSVM [25]. Our experiments are implemented on a Windows 10 computer with 3.6GHz Intel Core i7-9700K with 32GB RAM and RTX 2080Ti. Before training, as for KNN, SVM, TSVM, NPHSVM, LSPTSVM, values of data are located in [0,1] and values for model of DSVM and DNHSVM are standardized. The compared methods are briefly introduced as follows: • KNN: k nearest neighbor algorithm, the core idea of the KNN algorithm is that if most of the k nearest samples in the feature space belong to a certain category, the sample also belongs to this category.
• SVM: it constructs a hyperplane with the structure risk minimization principle and parameters of hyperplane can be obtained by solving a QPP.
• TSVM: it aims to construct two non-parallel hyperplanes and each hyperplane is close to its own class and as far as possible to the other class, which can be solved by two small QPPs.
• NPHSVM: it constructs two non-parallel hyperplanes simultaneously by solving a single QPP and is consistent between its prediction and training processes.
• LSPTSVM: it adds a regularization term to projection twin support vector machine to ensure that the optimization problems are positive definite and owns better generalization ability.
• DSVM: it uses DNN to combine feature extraction with the task of constructing hyperplane seamlessly and the hinge loss is used to be the loss function.  In order to show the performance and generalization of our and other compared algorithms, we perform experiments on UCI datasets and details of UCI datasets we use are shown in Table 1.

A. IMPLEMENTATION
In this subsection, we introduce the setting of all methods. For hyperparameters of all methods, they are determined by the strategy of cross-validation and chosen from the set The learning rate α is fixed to 0.01 and mini-batch m b is set to 100. The maximum number of training epochs E in the optimization process is set to 100 and Adam optimizer is applied when training neural network and Relu is used to be the activation function. We repeat algorithms for five times and report average performance and standard deviation.

B. EXPERIMENT RESULTS
The results for all methods are shown in Table 2, and we report the best performance in boldface. As we can see from Table 2, our method obtains the optimal performance nearly all the datasets. To be specific, DNHSVM achieves 7764 VOLUME 11, 2023  about 0.58%, 1.09%, and 1.89% relative improvement over accuracy compared to the second best algorithm on datasets Austra, Breast, and Vehicle, respectively.
There are some reasons why our proposed method outperforms other state-of-the-art methods: (1) The deep model integrates feature extraction and construction of hyperplanes seamlessly, and samples can be easy to be classified. (2) The number of neurons is not fixed, and original data will be projected to a suitable space in which positive and negative samples have significant differences.

C. HYPERPARAMETER INFLUENCE AND CONVERGENCE ANALYSIS
The parameter influence analysis is shown in this part and parameter influence for UCI datasets Breast, Bupa, Heart, Wpbc, Wine and Air is shown in Fig. 3. As for other UCI datasets mentioned in Table 1, they are insensitive to the choice of hyperparameters. To this end, the influence of these datasets is not described in Fig. 3. As we can see from Fig. 3, our method achieves the promising performance when both λ 1 and λ 2 hold a small value. In addition, our method has a poor performance when λ 1 tends to a small value and λ 2 tends to a big value. Hence we suggest that setting parameter λ 1 and λ 2 to a small value is beneficial to the performance of our method.
The convergence is described in Fig. 4. We list convergence condition of binary classification datasets for our methods as the classification is to train multiple classifiers and their objective function value is just the sum of the multiple classifiers. As we can see from Fig. 4, the objective function value of our method is monotonically decreasing in the most datasets and tends to be stable after dozens of iterations. VOLUME 11, 2023

D. DISCUSSION
In this subsection, we give a discussion of the proposed DNHSVM model. The results in Table 2 show that our method outperforms other state-of-the-art methods, which shows that the deep model can effectively extract friendly features and that better hyperplanes can be obtained for our method.
The influence of hyperparameter is described in Fig. 3. We discuss the influence of hyperparameters and convergence analysis in subsection IV-C and we give suggestions for the choice of hyperparameters. The figures of convergence show that the objective function value tends to be stable after dozens of iterations.
We report related results for the accuracy as the increments of epochs on UCI datasets in Fig. 5. As for datasets Austra, Breast, Ionosphere and Sonar, the performance of these datasets rises firstly and then fluctuates in a small range. For other datasets, the performance of our methods fluctuates on a relatively large scale.
Analysis of Variance (ANOVA) is a statistical hypothesis test, which can be used to analyze the accuracy results and test significance differences between several groups of results.
In order to evaluate whether the performance improvement observed from our proposed DNHSVM is statistically significant. The proposed DNHSVM and compared state-of-theart methods are used to pairwise comparison on test datasets. The null hypothesis is that there is no significant difference in accuracy between the two methods on these datasets. The accuracy on test datasets are shown in Table 2, and we obtain the p-value shown in Table 3. Since all the p-values in Table 3 are less than 0.1, the significant difference is at the 0.1 significant level.

V. CONCLUSION
In this paper, we propose a novel deep non-parallel hyperplane support vector machine that combines feature extraction and classification seamlessly and the features which are extracted through neural networks are friendly for classification. As for the construction of hyperplanes, each hyperplane is closer to its own class and as far as possible to the other classes. The experiments performed on UCI datasets show that our proposed method outperforms other state-of-theart methods, which shows the effectiveness of our proposed algorithms. In future work, we will extend our model to multi-view learning and semi-supervised learning.