Robust Algorithm for Multiclass Weighted Support Vector Machine

—Support Vector Machine (SVM) has shown better performance than other methods in real world classification applications, because it gives mathematical tractability and geometrical interpretation. However, the standard SVM can suffer from outliers in either the response or the predictor space. SVM can be viewed as a penalized method with the hinge loss function and penalty functions. Instead of í µí°¿ 2 penalty function we considered the Smoothly Clipped Absolute Deviation (SCAD) function, because it has two advantages, sparse learning and unbiasedness. However, it has drawbacks of non-robustness when there are outliers in the data. We develop a robust algorithm for SVM using a weight function of the SCAD with a Local Linear Approximation (LLA) method and a Local Quadratic Approximation (LQA). We compare the performance of the proposed algorithm with the standard SVM using í µí°¿ 1 and í µí°¿ 2 penalty functions.


INTRODUCTION
LASSIFICATION is an important method in pattern recognition or discrimination and recently it sheds new light on big data technologies.There are many algorithms for classification such as linear discrimination function, logistic regression function, k-nearest neighbour, boosting and neural networks [Hastie et al., 2001].Vapnik (1995) introduced Support Vector Machine (SVM) which is an optimal margin classifier among linear classifiers.SVM has shown better performance than other methods in real applications of engineering and bioinformatics, because it gives mathematical tractability and geometrical interpretation.
Let  denote a feature vector.The class labels, , are coded as {−1, 1}.For a given training data set   ,   ,  = 1,2, ⋯ , , the SVM can be written by a penalized hinge loss function where the subscript + means the positive part, for example  + = max(, 0).The SVM classifier becomes the sign of function  0 +    for a given feature vector .Equation ( 1) can be interpreted as a penalized regression with the hinge loss function and  2 penalty function.It is well known that  2 penalty function may consider a model with all components of feature vector.For the sparseness of the solution Tibshirani (1996) proposed the Least Absolute Shrinkage and Selection Operator (LASSO) in linear regression with  1 penalty function instead of  2 function.However, the LASSO estimates can be biased for large coefficients since larger penalties are imposed on larger coefficients.Fan & Li (2001) proposed a non-convex penalty function, the Smoothly Clipped Absolute Deviation (SCAD) penalty function which gives unbiased estimates for even large coefficients.
In real classification problems the binary SVM can be useless.Lee et al., (2004) proposed a simultaneous algorithm for multiclass classification problems and Jung (2012) suggested a simultaneous multiclass SVM algorithm with the SCAD penalty function.The SCAD multiclass SVM conducts variable selection and classification simultaneously, and it gives a compact classifier with high accuracy.Because gene selection treats the expression levels of thousands of genes simultaneously in one single experiment, the SCAD SVM can be very useful in bioinformatics [Zhang et al., 2006].
SVM is known to be sensitive to noisy training data, because the loss function of (1) is not bounded.Outliers in the SVM can be defined to data lying far away from their own classes, because the unbounded hinge loss function affects strongly SVM algorithm.In this paper we consider a robust algorithm for SVM using the weight function that C makes the loss function be bounded.It can yield higher correct classification rate than ordinal SVM in many problems.Furthermore the number of the support vectors for the proposed SVM is less than that of the SVM, because our algorithm is not affected by outliers and the proposed method does not retain the support vectors of outliers.That is, the set of support vectors for our algorithm is a subset of the set of support vectors for the SVM [Wu & Liu, 2007].Hence, it does not require much computation time and it can give easy interpretation of the input variables.
The paper is organized as follows.Section 2 describes the previously related works.Section 3 provides our proposed algorithm of a weighted SVM with the SCAD penalty.Since the SCAD function is not convex, we use an approximation algorithm to solve the non-differentiable and non-convex objective function in SVM with the SCAD penalty.The SVM can solve linear programming problems or quadratic programming problems.It requires general software to obtain the solution.We provide two results, the solver of linear programming and linear equation system.Section 4 illustrates the results of simulation and a real data set.It shows that the proposed algorithm has superior to other methods from the view of robustness.

II. RELATED WORKS
For variable selection the standard SVM (1) where >  where  > 2 and  > 0 are tuning parameters.The parameter  in the objective function (3) regulates the trade-off between data fitting and model parsimony.The parameter  is set as Fan & Li (2001) showed that the Bayes risks are not sensitive to the choice of a and  = 3.7 showed good results for many problems.
To reduce the influence of outliers Wu & Liu (2007) used a truncated hinge loss function as a method of a bounded loss function.They proposed to apply the difference convex algorithm to solve the non-convex problem through a sequence of convex sub-problems in (1), because the truncated hinge loss function is furthermore not convex.Liu & Shen (2006) developed a non-convex loss function in learning to treat robust problems in SVM.They generalized binary -learning to the multiclass case.Wu & Liu (2013) used a SVM with a weight loss function of  2 penalty, which gives larger weights for points closer to the boundaries and smaller weights for points farther away.

Binary Weighted SVM
The solution in ( 1) is sparse in which most of coefficients become zeroes and only support vectors can have an impact on the SVM classifier.Among support vectors the misclassified points lying far from the hyper-plane significantly impact the classifier, because the points are misclassified and the distance from the boundary is larger than others.
In linear regression one of the robust estimates is a weighted version obtained by a reduction of the impact of large residuals.Then the points having large residuals do not impact the regression coefficients.The idea was adapted to (1) In addition the dual form of (4) can be obtained by a quadratic programming solver.They set the weight     = 1/(1 + |    |) for  = 1, ⋯ ,  where   (•) is the solution of (1).The function    [1 −     ] + becomes same as the 0-1 loss except [0,1].However, the weighted hinge loss function is continuous.In case   = 1 for all the data the weighted SVM reduces to the standard SVM (1).
In this paper, we adapt the weight function for a robust SVM to the SCAD penalty as where )|  | is the linearized SCAD penalty function [Zou & Li, 2008] and  0 is an initial estimator.Unlike the objective function (3), the objective function defined in (5) Replacing the penalty function in (3) by Equation ( 6) gives the objective function (5).Updating the solution of (5) until the solution converges.The weight   can be small for the mis-classified data and it can be near to one for the wellclassified data.The weighted SCAD SVM procedure is consisted of two steps.The first step is to solve (5) with   = 1,  = 1, ⋯ ,  for all training data points, and the second step is to solve (5) with the weights   = 1 1+     , where     =  0 +     and  0 ,  are the solution of ( 5) for non-weighted case   = 1.Wu & Liu (2013) recommended one-step weighted iteration because the iterative solution cannot guarantee convergence if the weights based on the original hinge loss function is assigned instead of the one-step solution.
Now we obtain the solution of ( 5).The equation ( 5) can be sufficiently solved by standard Linear Programming (LP) software.To derive the LP formulation of (5), we introduce a set of slack variables

The proposed algorithm can be summarized as
Step 1: Set the initial solution  0 0 ,  0 by the linear discriminant function.
Step 2: Solve the linear programming problem (7) or the linear system (9) with   = 1 until convergence.

Multiclass Weighted SVM
Consider a K-class problem with a training set {  ,   ;  = 1, ⋯ , }, where   is the input vector and   ∈ {1,2, ⋯ , } represents its class label.The classifier needs a Kdimensional decision function with a vector function   = ( 1  , ⋯    ) with a sum-to-zero constant    = 0  =1 for any input vector  ∈ ℝ  , minimizing the objective function [Lee et al., 2004 subject to for all 1 ≤  ≤ and  = 0,1, ⋯ , Similar to Section 3.1, we can use the LQA method in multiclass SVM.Let   the (, ) element of the  ×  matrix having the value of (  ≠ ).Define   = ( 1  , ⋯ ,  −1  ), where   = ( 0 ,  1 , ⋯ ,   )  .In the unweighted case of (11) where where   0 denotes an initial value of   .Then (13)  of length ( + 1).Now we consider a weighted version of the objective function (11).The minimization problem of ( 11) can be written by 1 2      +     (15) where   and   are defined similarly to  and  in (14) except that the element   should be replaced by     .In the un-weighted case, the objective function (15) reduces to (14).
We can summarize a robust algorithm for the weighted multiclass SVM with the SCAD penalty function by the following iterative steps: Step 1: Set the initial solution  0 0 ,  0 by the linear discriminant function.
Step 2: Solve the linear programming problem (12) or with   = 1 until convergence.

IV. SIMULATION RESULTS
This section demonstrates simulations to show the robustness of the method proposed in Section III.We numerically compare the proposed method with the  2 SVM, the  1 SVM and the SCAD SVM.

Simulation
We consider the sample sizes 100, 1000, 10000 of the training data, the tuning data and the test data, respectively.
To select an appropriate tuning parameter  we find the tuning parameter minimizing the misclassification rate for  2  = −10, −9, ⋯ ,1,2.If the minimum values are tied, the maximum value of such tuning parameters would be selected for general learning capability.And we used the approximation objective function ( 9) and ( 15).First we consider a multiclass example with  = 3,  = 2.The data  is generated from the bivariate normal distribution    ,  2 , where  1 = ( 3, 1)  ,  2 = (− 3, 1)  ,  3 = (0, −2)  and  2 = 2 [Jung, 2012].We contaminated the data by flipping the response to one of other response value with a given probability per /2.
After finishing the learning for the training data we compute the mean and standard deviation of the misclassification rate for the test data.Table1 summarizes the results for 100 replications.The number in the table is the mean of the misclassification rate and the number in parenthesis is the sample standard deviation.It shows that our proposed robust algorithm is the best among the four methods for the mean of misclassification rate.Especially in the point of the standard deviation the value of our proposed algorithm is the least value among the methods.

Real Data
We choose a real data from the UCI repository.The liver data set is consisted of 345 observations of 6 input variables with 2 classes.The first 5 variables are the measurements of blood tests believed to be sensitive to liver disorders that might arise from excessive alcohol consumption.The sixth input variable is the drinks number of half-pint equivalents of alcoholic beverages drunk per day.We divide randomly the data sets into three parts as usual, the training set, the tuning set and the test set.The number of samples for each set is 115, respectively.For contaminating the data we conducted the flipping as in Section 4.1.The flipping rates are 0%, 3%, 6% and 10%.The simulation is conducted for 100 replications.Table 3 shows the performance for robustness on the contaminated data.The robustness of our proposed method is best among the methods in view of resistance to outliers.

V. CONCLUSION
In this paper we proposed a robust algorithm for multiclass SVM with the SCAD penalty function.We used a weight function for robustness.We derived two approximation objective functions to treat the non-convex optimization problem.One is LLA and the other is LQA.Even though LLA is more efficient than LQA, however its implementation is not easy.The simulation shows the effectiveness of our proposed algorithm.
is convex in .Zou & Li (2008) proposed a new unified algorithm based on Local Linear Approximation (LLA) to the penalty function 2, ⋯ , , and we write   =   + −   − where   + and   − denote the positive and negative parts of   , respectively.Then it is straightforward to show that (5) is equivalent to  0 ,   )  ,  1 =  1  +  2 and  1 = is defined by {   −    ,  ≠ }[Wu & Liu, 2013].The weight does have less impact on the misclassified data points far from the decision boundaries which can be support vectors.Let   =      ≠   0 +      + 1 + .

Table 1 :
Simulation Results for Three Classes

Table 2 :
Average of Misclassification Rates for the Liver Data