Variable selection for the multicategory SVM via adaptive sup-norm regularization

The Support Vector Machine (SVM) is a popular classification paradigm in machine learning and has achieved great success in real applications. However, the standard SVM can not select variables automatically and therefore its solution typically utilizes all the input variables without discrimination. This makes it difficult to identify important predictor variables, which is often one of the primary goals in data analysis. In this paper, we propose two novel types of regularization in the context of the multicategory SVM (MSVM) for simultaneous classification and variable selection. The MSVM generally requires estimation of multiple discriminating functions and applies the argmax rule for prediction. For each individual variable, we propose to characterize its importance by the supnorm of its coefficient vector associated with different functions, and then minimize the MSVM hinge loss function subject to a penalty on the sum of supnorms. To further improve the supnorm penalty, we propose the adaptive regularization, which allows different weights imposed on different variables according to their relative importance. Both types of regularization automate variable selection in the process of building classifiers, and lead to sparse multi-classifiers with enhanced interpretability and improved accuracy, especially for high dimensional low sample size data. One big advantage of the supnorm penalty is its easy implementation via standard linear programming. Several simulated examples and one real gene data analysis demonstrate the outstanding performance of the adaptive supnorm penalty in various data settings.


Introduction
In supervised learning problems, we are given a training set of n examples from K ≥ 2 different populations. For each example in the training set, we observe its covariate x i ∈ R d and the corresponding label y i indicating its membership. Our ultimate goal is to learn a classification rule which can accurately predict the class label of a future example based on its covariate. Among many classification methods, the Support Vector Machine (SVM) has gained much popularity in both machine learning and statistics. The seminal work by Vapnik (1995Vapnik ( , 1998 has laid the foundation for the general statistical learning theory and the SVM, which furthermore inspired various extensions on the SVM. For other references on the binary SVM, see Christianini and Shawe-Taylor (2000), Schölkopf and Smola (2002), and references therein. Recently a few attempts have been made to generalize the SVM to multiclass problems, such as Vapnik (1998), Weston and Watkins (1999), Crammer and Singer (2001), Lee et al. (2004), Liu and Shen (2006), and Wu and Liu (2007a).
While the SVM outperforms many other methods in terms of classification accuracy in numerous real problems, the implicit nature of its solution makes it less attractive in providing insight into the predictive ability of individual variables. Often times, selecting relevant variables is the primary goal of data mining. For the binary SVM, Bradley and Mangasarian (1998) demonstrated the utility of the L 1 penalty, which can effectively select variables by shrinking small or redundant coefficients to zero. Zhu et al. (2003) provides an efficient algorithm to compute the entire solution path for the L 1 -norm SVM. Other forms of penalty have also been studied in the context of binary SVMs, such as the L 0 penalty (Weston et al., 2003), the SCAD penalty (Zhang et al., 2006), the L q penalty , the combination of L 0 and L 1 penalty (Liu and Wu, 2007), the combination of L 1 and L 2 penalty (Wang et al., 2006), the F ∞ norm (Zou and Yuan, 2006), and others (Zhao et al., 2006;Zou, 2006).
For multiclass problems, variable selection becomes more complex than the binary case, since the MSVM requires estimation of multiple discriminating functions, among which each function has its own subset of important predictors. One natural idea is to extend the L 1 SVM to the L 1 MSVM, as done in the recent work of Lee et al. (2006) and Wang and Shen (2007b). However, the L 1 penalty does not distinguish the source of coefficients. It treats all the coefficients equally, no matter whether they correspond to the same variable or different variables, or they are more likely to be relevant or irrelevant. In this paper, we propose a new regularized MSVM for more effective variable selection. In contrast to the L 1 MSVM, which imposes a penalty on the sum of absolute values of all coefficients, we penalize the sup-norm of the coefficients associated with each variable. The proposed method is shown to be able to achieve a higher degree of model parsimony than the L 1 MSVM without compromising classification accuracy.
This paper is organized as follows. Section 2 formulates the sup-norm regularization for the MSVM. Section 3 proposes an efficient algorithm to implement the MSVM. Section 4 discusses an adaptive approach to improve performance of the sup-norm MSVM by allowing different penalties for different covariates according to their relative importance. Numerical results on simulated and gene expression data are given in Sections 5 and 6, followed by a summary.

Methodology
In K-category classification problems, we code y as {1, . . . , K} and define f = (f 1 , . . . , f K ) as a decision function vector. Each f k , a mapping from the input domain R d to R, represents the strength of the evidence that an example with input x belongs to the class k; k = 1, . . . , K. A classifier induced by f , assigns an example with x to the class with the largest f k (x). We assume the n training pairs {(x i , y i ), i = 1, . . . , n} are independently and identically distributed according to an unknown probability distribution P (x, y). Given a classifier f , its performance is measured by the generalization error, Let p k (x) = Pr(Y = k|X = x) be the conditional probability of class k given X = x. The Bayes rule which minimizes the GE is then given by (2.1) For nonlinear problems, we assume f k (x) = b k + q j=1 w kj h j (x) using a set of basis functions {h j (x)}. This linear representation of a nonlinear classifier through basis functions will greatly facilitate the formulation of the proposed method. Alternatively nonlinear classifiers can also be achieved by applying the kernel trick (Boser et al., 1992). However, the kernel classifier is often given as a black box function, where the contribution of each individual covariate to the decision rule is too implicit to be characterized. Therefore we will use the basis expansion to construct nonlinear classifiers in the paper.
The standard multicategory SVM (MSVM; Lee et al., 2004) solves under the sum-to-zero constraint K k=1 f k = 0. The sum-to-zero constraint used here is to follow Lee et al. (2004) in their framework for the MSVM. It is imposed to eliminate redundancy in f k 's and to assure identifiability of the solution. This constraint is also a necessary condition for the Fisher consistency of the MSVM proposed by Lee et al. (2004). To achieve variable selection, Wang and Shen (2007b) proposed to impose the L 1 penalty on the coefficients and the corresponding L 1 MSVM then solves under the sum-to-zero constraint. For linear classification rules, we start with The sum-to-zero constraint then becomes (2.4) The L 1 MSVM treats all w kj 's equally without distinction. As opposed to this, we take into account the fact that some of the coefficients are associated with the same covariate, therefore it is more natural to treat them as a group rather than separately.
Throughout the paper, we use w k = (w k1 , . . . , w kd ) T to represent the kth row vector of W , and w (j) = (w 1j , . . . , w Kj ) T for the jth column vector of W . According to Crammer and Singer (2001), the value b k + w T k x defines the similarity score of the class k, and the predicted label is the index of the row attaining the highest similarity score with x. We define the sup-norm for the coefficient vector w (j) as (2.5) In this way, the importance of each covariate x j is directly controlled by its largest absolute coefficient. We propose the sup-norm regularization for MSVM: The sup-norm MSVM encourages more sparse solutions than the L 1 MSVM, and identifies important variables more precisely. In the following, we describe the main motivation of the sup-norm MSVM, which makes it more attractive for variable selection than the L 1 MSVM. Firstly, with a sup-norm penalty, a noise variable is removed if and only if all corresponding K estimated coefficients are 0. On the other hand, if a variable is important with a positive sup-norm, the sup-norm penalty, unlike the L 1 penalty, does not put any additional penalties on the other K − 1 coefficients. This is desirable since a variable will be kept in the model as long as the sup-norm of the K coefficients is positive. No further shrinkage is needed for the remaining coefficients in terms of variable selection. For illustration, we plot the region 0 ≤ t 1 + t 2 ≤ C in Figure 1, where t 1 = max(w 11 , w 21 , w 31 , w 41 ) and t 2 = max(w 12 , w 22 , w 32 , w 42 ). Clearly, the sup-norm penalty shrinks sum of two maximums corresponding to two variables. This helps to lead to more parsimonious models. In short, in contrast to the L 1 penalty, the sup-norm utilizes the group information of the decision function vector and consequently the sup-norm MSVM can deliver better variable selection.
For three-class problems, we show that the L 1 MSVM and the new proposed sup-norm MSVM give identical solutions after adjusting the tuning parameters, which is due to the sum-to-zero constraints on w (j) 's. This equivalence, however, does not hold for the adaptive procedures introduced in Section 4. Proposition 2.1. When K = 3, the L 1 MSVM (2.3) and the sup-norm MSVM (2.6) are equivalent.
When K > 3, our empirical experience shows that the sup-norm MSVM generally performs well in terms of classification accuracy.
Here we would like to point out two fundamental differences between the supnorm penalty and the F ∞ penalty used for group variable selection (Zhao et al., 2006;Zou and Yuan, 2006) considering their similar expressions. The purpose of group selection is to select several prediction variables altogether if these predictors work as a group. Therefore, each F ∞ term in Zou and Yuan (2006) is based on the regression coefficients of several variables which belong to one group, whereas each supnorm penalty in (2.6) is associated with only one prediction variable. Secondly, in the implementation of the F ∞ , one has to decide in advance the number of groups and which variables belong to a certain group, whereas in the supnorm SVM each variable is naturally associated with its own group and the number of groups is same as the number of covariates.
As a remark, we point out that Argyriou et al. (2007Argyriou et al. ( , 2006 proposed a similar penalty for the purpose of multi-task feature learning. Specifically, they used a mixture of L 1 and L 2 penalties. They first applied the L 2 penalty for each feature across different tasks and then used the L 1 penalty for feature selection. In contrast, our penalty is a combination of the L 1 and supnorm penalties for multicategory classification. The tuning parameter λ in (2.6) balances the tradeoff between the data fit and the model parsimony. A proper choice of λ is important to assure good performance of the resulting classifier. If the chosen λ is too small, the procedure tends to overfit the training data and gives a less sparse solution; on the other hand, if λ is too large, the solution can become very sparse but possibly with a low prediction power. The choice of the tuning parameter is typically done by minimizing either an estimate of generalization error or other related performance measures. In simulations, we generate an extra independent tuning set to choose the best λ. For real data analysis, we use leave-one-out cross validation of the misclassification rate to select λ.

Computational Algorithms
In this section we show that the optimization problem (2.6) can be converted to a linear programming (LP) problem, and can therefore be solved using standard LP techniques in polynomial time. This great computational advantage is very important in real applications, especially for large data sets.
Let A be an n × K matching matrix with its entry a ik = I(y i = k) for i = 1, . . . , n and k = 1, . . . , K. First we introduce slack variables ξ ik such that The optimization problem (2.6) can be expressed as To further simplify (3.2), we introduce a second set of slack variables which add some new constraints to the problem: |w kj | ≤ η j , for k = 1, . . . , K; j = 1, . . . , d.
Finally write w kj = w + kj − w − kj , where w + kj and w − kj denote the positive and negative parts of w kj , respectively. Similarly, w + j and w − j respectively consist of the positive and negative parts of components in w j . Denote η = (η 1 , . . . , η d ) T ; then (3.2) becomes

Adaptive Penalty
In (2.3) and (2.6), the same weights are used for different variables in the penalty terms, which may be too restrictive, since a smaller penalty may be more desired for those variables which are so important that we want to retain them in the model. In this section, we suggest that different variables should be penalized differently according to their relative importance. Ideally, large penalties should be imposed on redundant variables in order to eliminate them from models more easily; and small penalties should be used on important variables in order to retain them in the final classifier. Motivated by this, we consider the following adaptive L 1 MSVM: where τ kj > 0 represents the weight for coefficient w kj . Adaptive shrinkage for each variable has been proposed and studied in various contexts of regression problems, including the adaptive LASSO for linear regression (Zou, 2006), proportional hazard models (Zhang and Lu, 2007), and quantile regression (Wang et al., 2007;Wu and Liu, 2007b). In particular, Zou (2006) has established the oracle property of the adaptive LASSO and justified the use of different amounts of shrinkage for different variables. Due to the special form of the sup-norm SVM, we consider the following two ways to employ the adaptive penalties: [I] [II] where the vector (τ w) (j) = (τ 1j w 1j , . . . , τ Kj w Kj ) T for j = 1, ..., d.
In (4.1), (4.2), and (4.3), the weights can be regarded as leverage factors, which are adaptively chosen such that large penalties are imposed on coefficients of unimportant covariates and small penalties on coefficients of important ones. Letw be the solution to standard MSVM (2.2) with the L 2 penalty. Our empirical experience suggests that τ kj = 1 |w kj | is a good choice for (4.1) and (4.3), and is a good choice for (4.2). Ifw kj = 0, which implies the infinite penalty on w kj , we set the corresponding coefficient solutionŵ kj to be zero.
In terms of computational issues, all three problems (4.1), (4.2), and (4.3) can be solved as LP problems. Their entire solution paths may be obtained by some modifications of the algorithms in Wang and Shen (2007b).

Simulation
In this section, we demonstrate the performance of six MSVM methods: the standard L 2 MSVM, L 1 MSVM, sup-norm MSVM, adaptive L 1 MSVM, and the two adaptive sup-norm MSVMs. Three simulation models are considered: (1) a linear example with five classes; (2) a linear example with four classes; (3) a nonlinear example with three classes. In each simulation setting, n observations are simulated as the training data, and another n observations are generated for tuning the regularization parameter λ for each procedure. Therefore the total sample size is 2n for obtaining the final classifiers. To test the accuracy of the classification rules, we also independently generate n ′ observations as a test set. The tuning parameter λ is selected via a grid search over the grid: log 2 (λ) = −14, −13, . . . , 15. When a tie occurs, we choose the larger value of λ. As we suggest in Section 4, we use the L 2 MSVM solution to derive the weights in the adaptive MSVMs. The L 2 MSVM solution is the final tuned solution using the separate tuning set. Once the weights are chosen, we tune the parameter λ in the adaptive procedure via the tuning set.
We conduct 100 simulations for each classification method under all settings. Each fitted classifier is then evaluated in terms of its classification accuracy and variable selection performance. For each method, we report its average testing error, the number of correct and incorrect zero coefficients among Kd coefficients, the model size as the number of important ones among the d variables, and the number of times that the true model is correctly identified. The numbers given in the parentheses in the tables are the standard errors of the testing errors. We also summarize the frequency of each variable being selected over 100 runs. All simulations are done using the optimization software CPLEX with the AMPL interface (Fourer et al., 2003). More information about CPLEX can be found on the ILOG website http://www.ilog.com/products/optimization/.

Five-Class Example
Consider a five-class example, with the input vector x in a 10-dimensional space. The first two components of the input vector are generated from a mixture Gaussian in the following way: for each class k, generate (x 1 , x 2 ) independently from N (µ k , σ 2 1 I 2 ), with µ i = 2 (cos([2k − 1]π/5), sin([2k − 1]π/5)) , k = 1, 2, 3, 4, 5, and the remaining eight components are i.i.d. generated from N (0, σ 2 2 ). We generate the same number of observations in each class. Here σ 1 = √ 2, σ 2 = 1, n = 250, and n ′ = 50, 000.  Table 1 shows that, in terms of classification accuracy, the L 2 MSVM, the supnorm MSVM, and the two adaptive supnorm MSVMs are among the best and their testing errors are close to each other. In terms of other measurements such as the number of correct/incorrect zeros, the model size, and the number of times that the true model is correctly identified, the supnorm MSVM procedures work much better than other MSVM methods. Table 2 shows the frequency of each variable being selected by each procedure in 100 runs. The type I sup-norm MSVM performs the best among all. Overall the adaptive MSVMs show significant improvement over the non-adaptive classifiers in terms of both classification accuracy and variable selection.

Four-Class Linear Example
In the simulation example in Section 5.1, the informative variables are important for all classes. In this section, we consider an example where the informative vari-ables are important for some classes but not important for other classes. Specifically, we generate four i.i.d important variables x 1 , x 2 , x 3 , x 4 from Unif[−1, 1] as well as six independent i.i.d noise variables x 5 , . . . , x 10 from N (0, 8 2 ). Define the functions and set p k (x) = P (Y = k|X = x) ∝ exp(f k (x)), k = 1, 2, 3, 4. In this example, we set n = 200 and n ′ = 40, 000. Note that x 1 is not important for distinguishing class 3 and class 4. Similarly, x 2 is noninformative for class 1 and class 4, x 3 is noninformative for class 1 and class 2, and x 4 is noninformative for class 2 and class 3.  Table 3 summarizes the performance of various procedures, and Table 4 shows the frequency of each variable being selected by each procedure in 100 runs. Due to the increased difficulty of this problem, the performances of all methods are not as good as that of the five-class example. From these results, we can see that the adaptive procedures work better than the non-adaptive procedures both in terms of both classification accuracy and variable selection. Furthermore, the adaptive L 1 MSVM performs the best overall. This is due to the difference between the L 1 and the supnorm penalties. Our proposed supnorm penalty treats all coefficients of one variable corresponding to different classes as a group and removes the variable if it is non-informative across all class labels. By design of this example, important variables have zero coefficients for certain classes. As a result, our supnorm penalty does not deliver the best performance. Nevertheless, the adaptive supnorm procedures still perform reasonably. Table 4 Variable selection frequency results for the four-class example.
To examine the performance of various methods using a richer set of basis functions, we also fit nonlinear MSVMs via polynomial basis of degree 3 with 55 basis functions. Results of classification and variable selection with n = 200 and 400 are reported in Tables 7 and 8 respectively. Compared with the case of the second order polynomial basis, classification testing errors using the third order polynomial basis are much larger for the L 2 , L 1 , and supnorm MSVMs, but similar for the adaptive procedures. Due to the large basis set, none of the methods can identify the correct model. However, the adaptive procedures can eliminate more noise variables than the non-adaptive procedures. This further demonstrates the effectiveness of adaptive weighting. The results of variable selection frequency (not reported due to lack of space) show a similar pattern as that of the second order polynomial. When n increases from 200 and 400,  we can see that classification accuracy for all methods increases as expected.
Interestingly, compared to the case of n = 200, the performance of variable selection with n = 400 for non-adaptive procedures stays relatively the same, while improves dramatically for the adaptive procedures.

Real Example
DNA microarray technology has made it possible to monitor mRNA expressions of thousands of genes simultaneously. In this section, we apply our six different MSVMs on the children cancer data set in Khan et al. (2001). Khan et al. (2001) classified the small round blue cell tumors (SRBCTs) of childhood into 4 classes; namely neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL), and the Ewing family of tumors (EWS) using cDNA gene expression profiles. After filtering, 2308 gene profiles out of 6567 genes are given in the data set, available at http://research.nhgri.nih.gov/microarray/Supplement/. The data set includes a training set of size 63 and a test set of size 20. The distributions of the four distinct tumor categories in the training and test sets are given in Table 9. Note that Burkitt lymphoma (BL) is a subset of NHL.
To analyze the data, we first standardize the data sets by applying a simple linear transformation based on the training data. Specifically, we standardize the expressionx gi of the g-th gene of subject i to obtain x gi by the following formula: x gi =x gi − 1 n n j=1x gj sd(x g1 , · · · ,x gn ) . Then we rank all genes using their marginal relevance in class separation by adopting a simple criterion used in Dudoit et al. (2002). Specifically, the relevance measure for gene g is defined to be the ratio of between classes sum of squares to within class sum of squares as follows: where n is the size of the training set,x (k) ·g denotes the average expression level of gene g for class k observations, andx ·g is the overall mean expression level of gene g in the training set. To examine the performance of variable selection of all different methods, we select the top 100 and bottom 100 genes as covariates according the relevance measure R. Our main goal here is to get a set of "important" genes and also a set of "unimportant" genes, and to see whether our methods can effectively remove the "unimportant" genes.
All six MSVMs with different penalties are applied to the training set. We use leave-one-out cross validation on the standardized training data with 200 genes for the purpose of tuning parameter selection and then apply the resulting classifiers on the testing data. The results are tabulated in Table 10. All methods have either 0 or 1 misclassification on the testing set. In terms of gene selection, three sup-norm MSVMs are able to eliminate all bottom 100 genes and they use around 50 genes out of the top 100 genes to achieve comparable classification performance to other methods.
In Figure 3, we plot heat maps of both training and testing sets on the left and right panels respectively. In these heat maps, rows represent 50 genes selected by the Type I sup-norm MSVM and columns represent patients. The gene expression values are reflected by colors on the plot, with red representing the highest expression level and blue the lowest expression level. For visualization, we group columns within each class together and use hierarchical clustering with correlation distance on the training set to order the genes so that genes close to each other have similar expressions. From the left panel on Figure 3, we can prediction. The notion of "group lasso" has also been studied in the context of learning a kernel (Micchelli and Pontil, 2007). If such kind of group information is available for multicategory classification, there will be two kinds of group information available for model building, one type of group formed by the same covariate corresponding to different classes as considered in the paper and the other kind formed among covariates. A future research direction is to combine both group information to construct a new multicategory classification method. We believe that such potential classifiers can outperform those without using the additional information.
This paper focuses on the variable selection issue for supervised learning. In practice, semi-supervised learning is often encountered, and many methods have been developed including Zhu et al. (2003) and Wang and Shen (2007a). Another future topic is to generalize the sup-norm penalty to the context of semi-supervised learning.