SPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION

. Support vector machine (SVM) is a very popular method for bi-nary data classiﬁcation in data mining (machine learning). Since the objective function of the unconstrained SVM model is a non-smooth function, a lot of good optimal algorithms can’t be used to ﬁnd the solution. In order to overcome this model’s non-smooth property, Lee and Mangasarian proposed smooth support vector machine (SSVM) in 2001. Later, Yuan et al. proposed the polynomial smooth support vector machine (PSSVM) in 2005. In this paper, a three-order spline function is used to smooth the objective function and a three-order spline smooth support vector machine model (TSSVM) is obtained. By analyzing the performance of the smooth function, the smooth precision has been improved obviously. Moreover, BFGS and Newton-Armijo algorithms are used to solve the TSSVM model. Our experimental results prove that the TSSVM model has better classiﬁcation performance than other competitive baselines.


1.
Introduction. Among all the methods that are commonly used for data classification, support vector machine (SVM), a well-known method based on statistic learning theory( [1][2][3][4][5][6][7][8]) has become one of the most popular methods nowadays. In fact, support vector machine has surpassed neural network and becomes the most popular method among all the statistically learning methods. Now, SVM is often used in pattern recognition, regression analysis, probability density estimate and so on. For many of these tasks, SVM is as good as or better than traditional machine learning methods.
In this paper, we focus on the SVM approach for data classification. The SVM model for classification can be formulated as a non-smooth unconstrained optimization problem ( [9][10][11][12]). However, the objective function is non-differentiable at zero. In our approach, we change the model slightly and apply the smooth techniques that have been extensively used for solving important mathematical programming problems ( [13][14][15][16][17][18][19][20][21][22][23]). In 2001, Lee et al.( [20]) have employed a smooth method to solve the resulting optimization problem. They proposed to use signal function integral They got a smooth SSVM model. Here, ε is the fondue of natural logarithm and called as smooth parameter. Later, they used the same smooth function to smooth support vector machine regression in 2005 (see in [21]). In 2005, we presented two polynomial functions ( [22]) as following Using the above smooth functions to the non-smooth unconstrained optimization problem, we can get smooth SVM model (we termed it PSSVM.). It can be proved that PSSVM is more effective than SSVM through theory analysis. The primal goal of this paper is to use the three-order spline function to smooth the objective function of the original model. After this, the three-order spline smooth support vector machine model (we call it TSSVM) is obtained. Also, we give the conclusion through theoretical analysis and experimental results in this paper that the TSSVM model is better than given models in smooth precision and classification capability.
The rest of the paper is organized as follows. In Section 2 we will briefly introduce how to obtain the SSVM model and state the three-order spline function to form the three-order spline smooth support vector machine (TSSVM) model. In Section 3 we study the smooth property of the three-order spline function such as smooth capability, the approach degree to original function. In Section 4 we prove the convergence of the TSSVM model. In Section 5 we discuss the optimization algorithms, e.g. BFGS and Newton method, for the TSSVM optimal model to obtain the smooth parameter. In Section 6 the experimental results are given. We conclude the paper in Section 7.
For ease of understanding, we use the following notations throughout the paper. All vectors will be column vectors unless transposed to a row vector by a prime superscript. The transpose of the vector or matrix is denoted by (·) T . For a vector x in the n-dimensional real space R n , the plus function x + is defined as (x + ) i = max(0, x i ), i = 1, 2, 3, · · · , n. The scalar (inner) product of two vectors x, y in the n-dimensional real space will be denoted by x T y and the p-norm of x will be denoted by x p . For a matrix A ∈ R m×n , A i is the i − th row of which is a row vector in R n . A column vector of ones of arbitrary dimension will be denoted by e. If f is a real valued function defined in the n-dimensional real space R n , the gradient of f is denoted by ▽f (x) which is a row vector in R n and the n × n Hessian matrix of f at x is denoted by ▽ 2 f (x). The level set of f is defined as x ∈ Domain(f )} for a given real number .
2. Three-order spline smooth support vector machine model. In this section, the unconstrained optimal model of SVM is obtained.
A pattern classification problem is to classify m points in n−dimensional real space R n , represented by an m × n matrix A, according to membership of each point A i in the classes 1 or -1 as specified by a given m × m diagonal matrix D with 1 or -1 diagonals. The standard support vector machine (see in [20][21][22][23][24][25]) for this problem is given by the following min (ω,γ,y)∈R n+1+m y ≥ 0.
for some ν > 0, ω is the normal to the bounding plane and γ is the distance of the bounding plane to the origin. Let us define the linear separating plane with normal ω ∈ R n and distance |γ| ω 2 to the origin. In order to understand these concepts, a simple illustration diagram is given below. In Figure 1, the circle points on the left corner are in class with index -1 and determined by support vectors. The distance between these two solid lines is margin for classification (can be computed by 1 ω 2 ). The first term in the objective function of (3) is the 1-norm of the slack variable y with weight ν. The second term ω T ω is the square of the 2-norm of the vector ω with half introduced for simplicity. Replacing the first term with the 2-norm vector y, the SVM problem can be modified into the following form min (ω,γ,y)∈R n+1+m As a solution of problem (6), y is given by where the element of the vector (a) + is defined by Substituting y into the objective function of (6) converts problem (6) into an equivalent unconstrained optimization problem This is a strongly convex minimization problem without any constraints and it has a unique solution. However, the objective function in (9) is not differentiable at zero which precludes the use of existing optimization methods using derivatives. This property means that a lot of good algorithms can't be used to solve it. Since many optimal algorithms require that the objective function is first or twice differentiable, it is necessary to smooth the objective function.
In this paper, a new smooth function is introduced. It is three-order spline function as following If we replace the plus function in (9) by this function, a new smooth SVM model is obtained as following In the next section, we will prove that the smooth performance of the three-order spline function is better than the previous smooth functions with the same smooth parameter k.
3. Performance analysis of smooth functions. Before we do performance analysis, we need to introduce the following lemmas.

Lemma 1.
Let Ω ⊂ R,p(x, k) is defined as (1) and x + is plus function. The following results are easily obtained.
The proof can be seen in [21].

Lemma 2.
Let Ω ⊂ R, q(x, k) and f (x, k) are defined as (2) and (3) and x + is plus function. The following results are easily obtained.
The proof can be seen in [22,23].
Remark 1. These results in lemma 1 and lemma 2 are easy to verify when x is a real value in R.

Theorem 1.
Let Ω ⊂ R, T (x, k) be defined as (10) and x + is plus function. The following results are easily obtained.
Proof of Theorem. (i) Let us observe the definition of T (x, k) in (10). If we substitute points x = ± 1 k , x = 0 into it directly, the results in (i) are obtained easily.
after taking a = kx in it, According to results of Lemma 1, Lemma 2 and Theorem 1, the following performance comparison results of smooth functions are obtained.
(ii) If the smooth function is defined as (2) or (3), by Lemma 2, (iii) If the smooth function is defined as (10), by Theorem 1, The results can be obtained directly form Lemma 1, Lemma 2 and Theorem 1. From the above theorem, it is obvious that T (x, k) is the best smooth function among all of them. In order to show the difference more clearly, we present the following smooth performance comparison diagram. The smooth parameter is set at k = 10.

SPLINE FUNCTION SMOOTH SUPPORT VECTOR MACHINE FOR CLASSIFICATION 535
As can be seen from Figure 2, our proposed spline smooth function is the closest to the original non-smooth function, which indicates the superiority of our proposed smooth function.
4. Performance analysis of smooth functions. In this section, the convergence of TSSVM model (11) is presented. We will prove that the solution of TSSVM can closely approximate the optimal solution of the original model (9) when k moves towards positive infinity. The smooth function can be applied to every element of a multi-dimensional vector. (10). The following conclusions can be obtained. (i) h(x) and g(x, k) are strong convex functions; (ii) There is a unique solution x * to min x∈R n h(x) and there is also a unique solution (x * ) k to min x∈R n g(x, k); (iii) For ∀k ≥ 1, x * and (x * ) k both satisfy the following condition (iv) x * and (x * ) k satisfy Proof of Theorem. (i) h(x) and g(x, k) are strong convex functions because · 2 2 is strong convex function; (ii) If L v (h(x)) is the level set of h(x) and L v (g(x, k)) is the level set of g(x, k), then, according to result (ii) of Theorem 1, we have . Therefore, L v (g(x, k)) and L v (h(x)) are strict convex sets. Because of this, there is a unique solution to both min x∈R n h(x) and min x∈R n g(x, k). (iii) If x * is the optimal solution to min x∈R n h(x) and (x * ) k is the optimal solution to min x∈R n g(x, k), because of the optimal condition and convex property of h(x) and g(x, k), the following inequalities are hold, If we add the two formulas above and note T (x, k) ≥ x + , we can obtain . According to result (iii) of Theorem 1, (x * ) k − x * 2 2 ≤ m 48k 2 , so the conclusion (16) is correct. (iv) When k goes towards infinity in (16), we can obtain The conclusion (iv) of Theorem 3 explains that the optimal solution of the smooth model TSSVM is close to the original support vector machine model when k moves towards positive infinity. With the above in mind, we design two algorithms to solve TSSVM in the next section.

5.
Design of algorithms. Two things need to be considered when designing algorithms for solving the SVM model. One is the choice of an optimal smooth parameter, the other is the choice of solution method. We have presented above that TSSVM's optimization solution converges to the real solution of the original model when k goes towards positive infinite. However, there is a tradeoff between the size of k and computational time. Thus k can not be too large in practice. It is very important to choose a proper optimization smooth parameter k before an algorithm design.
Definition 1. For m dimensional training data set A ∈ R m×n , ǫ is a given parameter, (x * ) k is an optimization solution of TSSVM and x * is optimization solution of the original model. When the algorithm stops, k must satisfy (x * ) k − x * 2 2 ≤ ǫ. We call the minimal k as a minimal smooth parameter and denote it as kopt(m, ǫ).
From the Definition 1, if the smooth parameter k in the TSSVM model satisfies k ≥ kopt(m, ǫ), the solution of TSSVM can satisfy the precision requirement (x * ) k − x * 2 2 ≤ ǫ. We next present a theorem on how to estimate the minimal smooth parameter.
Theorem 4. A ∈ R m×n is a sample data set, (x * ) k is the optimal solution of TSSVM and x * is the optimal solution of SVM. m is the number of sample points and ǫ is the algorithm terminating control parameter.
The results can be easily obtained from results of Theorem 2. The proof is omitted.
We next present two algorithms to solve SVM models. BFGS method is suitable for unconstrained optimization problems when the objective function and its gradient value can easily be obtained. BFGS method is the most widely used one among various quasi-Newton methods [27] [28].
The Newton-Armijo method for problem (11) is as follows: Algorithm 2 (The Newton-Armijo algorithm for TSSVM(11)). step 1: (Initialization) H 0 = I, ((ω 0 ) T , γ 0 ) = p 0 ∈ R n+1 , ǫ ,α 0 = 1 and set i := 0; step 2: Compute F i = F (p i , kopt) and g i = ∇F (p i , kopt); step 3: If g i 2 2 ≤ ǫ or α 0 < 10 −12 , then stop, and accept p i = ((ω i ) T , γ i ) as the optimal solution of (11), else calculate Newton direction d i from the array of equations ▽ 2 F (p i , kopt)d i = −g i ; step 4: Perform a linear search along direction d i with the Armijo step to get a step length α i > 0; Let p i+1 = p i + α i d i ; step 5: Set i := i + 1, go to step 2. Overall, the BFGS algorithm has a less restricted requirement on the objective functions than the Newton method. It does not require the objective functions to have second-order derivatives. Thus it needs less storage and memory. 6. Numerical experiments and results.
6.1. Experimental Design and Results. We synthetically create the data set for our experiments. The first thing is to generate (randomly) the experimental data D(D ∈ R (m+m1)×n ), where m is the number of the training data set records, m 1 is the number of the testing data set records, and n is the number of attributes of both training and test data sets. e ∈ R n is a vector, of which each element is equal to 1. Next, we classify every data point in D. We classify each data record into three classes according to the following rules: (i) If e T x ≥ 1, it is classified into the first class; (ii) If e T x ≤ −1, it is classified into the second class; (iii) If −1 < e T x < 1, it is classified as noise data. We probabilistically assign each noisy data record into the existing two classes. For example, if −1 < e T x < 1, we can keep this data point according to the probability of p 1 (such as p 1 = 0.7). In this case, if e T x < 0, it is classified into the second class according to the probability p 2 (such as p 2 = 0.8), and into the first class according to the probability 1 − p 2 . Conversely, if e T x ≥ 0, it is classified into the first class according to the probability p 2 (such as p 2 = 0.8), and into the second class according to the probability 1 − p 2 . We conduct numerical experiments to test the following three aspects of the proposed algorithm: (i) Algorithm validity. The algorithm validity can be tested with three indexes: the optimization target value (namely optv), the norm of the gradient when the algorithm ends (namely ng) and the step length of the gradient when the algorithm ends (namely setpl). The algorithm often terminates when ng < ǫ or setpl < ǫ 1 . When the algorithm terminates because setpl < ǫ 1 , the condition ng < ǫ usually could not be satisfied, which will result in low classification accuracy. On the other hand, if the algorithm ends because ng < ǫ, the algorithm usually has a high accuracy irrespective whether the condition setpl < ǫ 1 . Thus smaller ng often means better algorithm validity. (ii) Algorithm efficiency. The algorithm efficiency is measured using computing time (namely cput). The algorithm efficiency is higher when cput is smaller. (iii) Classification performance. The performance of linear classification tools obtained with different smooth functions are measured by two indexes: one is the training correct rate (namely T rCR) and the other is the test correct rate (namely T eCR). Higher T rCR and T eCR mean higher classification performance.
This numerical experiment was carried out on a computer with 1.8Ghz CPU and 256MB memory. The coding was done using MATLAB 6.1. The parameters of the experiment were set as follows: ǫ = 10 −3 , ǫ 1 = 10 −3 , and the initialization iteration point was set p 0 = 0. The experimental results are shown in Tables 1 and 2. In both tables, the first column shows the scale of training data set, in which m means the number of the training data set points and n means the dimension of every training data point; the rest of the columns are numerical experimental results for different algorithms with different smooth functions. The number of testing data points m 1 is 1000. The values in each cell of the tables are T rCR, T eCR, cupt, optv, ng and stepl, respectively. 6.2. Analysis of the Experimental Results. In terms of the optimization algorithm, both BFGS and Newton algorithms can help find optimal solutions for our TSSVM model because they both end with small ng values. Among all the smooth functions, our proposed spline function achieved the lowest optv and ng. This demonstrates the advantage of our proposed TSSVM mdoel.
In terms of the algorithm efficiency, Newton-Armijo method is faster than BFGS. However, Newton-Armijo can be be applied to QPSSVM model due to lack of the second order derivative. In order to show clearly the relationship between the sample size and the CPU time, we show graphically the results in Figures 3, 4, 5 and 6.
Among the four smooth functions, our proposed TSSVM model is the best in terms of the classification performance as measured by T eCR, followed by the PSSVM. SSVM has some troubles in finding the optimal solutions. 7. Conclusions and future work. From the analysis of the above experimental results, we can see that the TSSVM model is more effective than the previous models. Our proposed TSSVM model in this paper has very good classification performance and computational stability. We can use the Newton method as the preferred optimization algorithm for the TSSVM model. In other words, for a given spline-based smooth function, the Newton method can find the optimal solution in a shorter time than BFGS.
For future work, we are going to do the following: (i) The investigation of other smooth functions. In this paper we presented a three-order spline function. We believe there are better smooth functions yet to be discovered and evaluated.
(ii) Research about the optimization algorithms for solving smooth functionbased SVM models. There are many optimization algorithms available. However, these algorithms are often suitable for different objective functions. We need to look at different smooth functions and identify the best optimization algorithm for each model.  Table 2. Results for different smooth functions using different algorithms with fixed training size.
(iii) Research about duality theory of TSSVM. In recent years, some duality theories were developed [31,32,33,34] -strong duality theory with no duality gap, canonical duality theory proposed by Gao and so on [31,32,33,34]. The authors believe that some more interesting results maybe be obtained by applying the duality theory.