Elsevier

Neural Networks

Volume 48, December 2013, Pages 32-43
Neural Networks

Analysis of programming properties and the row–column generation method for 1-norm support vector machines

https://doi.org/10.1016/j.neunet.2013.07.005Get rights and content

Abstract

This paper deals with fast methods for training a 1-norm support vector machine (SVM). First, we define a specific class of linear programming with many sparse constraints, i.e., row–column sparse constraint linear programming (RCSC-LP). In nature, the 1-norm SVM is a sort of RCSC-LP. In order to construct subproblems for RCSC-LP and solve them, a family of row–column generation (RCG) methods is introduced. RCG methods belong to a category of decomposition techniques, and perform row and column generations in a parallel fashion. Specially, for the 1-norm SVM, the maximum size of subproblems of RCG is identical with the number of Support Vectors (SVs). We also introduce a semi-deleting rule for RCG methods and prove the convergence of RCG methods when using the semi-deleting rule. Experimental results on toy data and real-world datasets illustrate that it is efficient to use RCG to train the 1-norm SVM, especially in the case of small SVs.

Introduction

It is known that the introduction of the hinge loss into standard support vector machine (SVMs) can lead to a sparse model representation (Girosi, 1998, Steinwart, 2003). In least absolute shrinkage and selection operator (LASSO) (Efron et al., 2004, Tibshirani, 1996, Tibshirani, 1997) and basis pursuit (BP) methods (Mallat and Zhang, 1993, Smola and Bartlett, 2001, Smola and Schölkopf, 2000, Vincent and Bengio, 2002), the 1-norm regularization penalty as a substitution of the 0-norm regularization penalty can also induce a sparse solution. The 1-norm SVM minimizes the weighted combination with the hinge loss and the 1-norm regularization penalty so that the 1-norm SVM is expected to have a much better sparseness. Zhang and Zhou (2010) showed that the number of non-zero coefficients in the 1-norm SVM model is no more than that of exact support vectors (ESVs) which are only a part of support vectors (SVs). ESVs are some samples where the multiplication of their labels and the corresponding outputs of a hypothesis function equals 1. Of course, the number of non-zero coefficients in the standard SVM equals that of SVs (Girosi, 1998, Steinwart, 2003). Other experiential evidence for the sparsity of the 1-norm SVM has been reported in the literature (Bi et al., 2005, Natsoulis et al., 2005, Stoeckel and Fung, 2007, Zhang et al., 2006). The sparsity can not only obviously improve the test speed but also lead to good generalization performance. Demiriz, Bennett, and Shawe-Taylor (2002) showed that the 1-norm soft margin error rule employed by the 1-norm SVM is related to the generalization error bound. From the point of view of statistical learning theory, the weighted summation of the 1-norm penalty and the hinge loss also constitutes a structural risk in which the 1-norm penalty plays an important role in controlling the capacity of the 1-norm SVM (Floyd and Marmuth, 1995, Graepel and Shawe-Taylor, 2000, Vapnik, 1998, Vapnik, 1999). Therefore the 1-norm SVM usually has a comparable generalization performance to standard SVMs. As a result of the sparsity, the linear 1-norm SVM can be used as a feature (or variable) selectors, and the kernel 1-norm SVM as a basis function selector (Bi, Bennett, Embrechts, Breneman, & Song, 2003).

The 1-norm SVM can be formulated as a standard linear programming (LP) with inequality constraints (Bennett, 1999, Mangasarian, 2000, Rätsch et al., 2000, Zhou et al., 2002, Zhu et al., 2004). It is well known that many commercial optimization tools can be used to solve the LP problem of the 1-norm SVM. For example, the simplex method (Vanderbei, 1996) and its speed up method (Barrodale & Roberts, 1973) could be used for training the 1-norm SVM. For large-scale problems, however, these tools also suffer from computational complexity problems. As usual, the 1-norm SVM has a simpler form than standard SVMs. More efficient algorithms are overwhelmingly desired. Past research has resulted in a variety of algorithms for large-scale problems. Bradley and Mangasarian (2000) developed a linear programming chunking (LPC) algorithm to train the large-scale linear 1-norm SVM. The LPC algorithm is essentially a block row generation (RG) method for the dual problem of the 1-norm SVM. By using the LPC algorithm, Bradley and Mangasarian (2000) reported that the linear 1-norm SVM can be applied to a learning task with one million samples. However, the LPC algorithm is inapplicable to the kernel 1-norm SVM when processing large samples. The reason is that each subproblem in the LPC algorithm still has a large number of constraints and variables. An RG for the dual problem (i.e. a column generation (CG) for the primal problem) was proposed by Demiriz et al. (2002) to solve the LP boosting, which can be simply generalized to solve the LP problem of the 1-norm SVM. In this method, each subproblem has +n+1 constraints and +1 variables, where is the number of samples and n is the number of generated rows currently. Both Fung and Mangasarian (2004) and Mangasarian (2006) used the Newton method to minimize an exterior penalty problem constructed from the dual problem of the 1-norm SVM, which is called NLPSVM. They also showed that an exact least 2-norm solution to the primal problem can be obtained based on the solution to the dual problem. For each iteration of NLPSVM, it is necessary to compute the inverse of the generalized Hessian with the time cost of O(min(,D)3) when the Sherman–Morrison–Woodbury identity is applied, where and D are the number of rows and columns of a sample matrix, respectively. Thus, NLPSVM is only applicable to the case that the sample matrix is far away from the square matrix. In other words, either the number of training samples or the number of sample features is small. For the linear 1-norm SVM, is the number of samples, and D is the dimensionality of samples, which is usually small, while for the kernel 1-norm SVM, D=. Hence, the NLPSVM method is also not applicable to the kernel 1-norm SVM with large samples.

We know that the primal problem of the standard SVM has a few active (or tight) constraints, and the number of active constraints equals that of support vectors (SVs) (or non-zero coefficients) because of the use of the hinge loss (Burges, 1998, Smola, 1996, Vapnik, 1998). The 1-norm SVM which also employs the hinge loss function results in the fact that the number of the active constraints equals the number of non-zero coefficients in the 1-norm SVM model. Usually samples with non-zero coefficients are only a part of the training samples. An asymptotic bound of SVs is given for the 2-norm penalty SVM by Steinwart (2003), which can informally be stated as: “With probability tending to 1 the fraction of support vectors is essentially greater than the Bayes risk when the number of training examples tends to infinite”.

In the 1-norm SVM, there exist a lot of inactive (or relaxed) constraints in their primal problems. This type of LP with a lot of inactive constraints is defined as the row sparse constraint LP hereinafter. Based on the property of the row sparse constraint LP, the LPC algorithm can use the primal RG to train the linear 1-norm SVM with large examples. By the well-known complementary slackness theorem (Dantzig & Wolfe, 1961), we can conclude that the number of active constraints in the primal problem equals that in its dual if both of them are non-degenerate. This implies that both the row constraints for the primal problem and the column constraints for the dual problem are sparse for the 1-norm SVM. This type of LP with row and column sparse constraints is defined as the row–column spares constraint LP (RCSC-LP). Based on the property of the column sparse constraint LP, LP boosting via column generation can be used to train the 1-norm SVM with large variables efficiently (Demiriz et al., 2002).

Note that either the LPC algorithm or the LP boosting algorithm only exploits one fold of the sparseness of the 1-norm SVM to reduce the number of constraints or variables of LP. Therefore, these two algorithms can be applicable only to problems in which the sample matrix is far away from a square matrix. At present, there does not exist an algorithm for training the 1-norm SVM by using both row and column sparseness. This paper introduces a family of row–column generation methods which can utilize both the row and the column sparseness to train the 1-norm SVM. RCG methods were first proposed by Jaumard, Meyer, and Vovor (1999) and applied to the channel assignment problem. Similar RCG methods were presented and applied to the multistage cutting stock problem (Zak, 2002). Row and column generation methods have been seldom used together. This can be explained in part by the convergence issue, as the process of row generation is more and more restricted while that of column generation is more and more relaxed. The sequence of objective function values is definitely not monotonic when using row and column generation algorithms at the same time.

The contribution of our work is that we introduce RCG methods to solve the LP problem of the 1-norm SVM or RCSC-LP, and a semi-deleting rule for RG/CG and RCG methods. By introducing a semi-deleting rule, not only can the convergence be guaranteed, but also the scale of each subproblem can be reduced greatly, which means the inactive constraints can mostly be deleted. RCG methods can fully use the sparsity of row and column constraints and speed up the optimization process.

The rest of this paper is organized as follows. Section  2 gives the definition of RCSC-LP and a review on the 1-norm SVM. RG/CG and a new deleting rule are introduced in Section  3.1. RCG and its implementation are discussed in Sections  3.2 Row–column generation methods, 3.3 Implementation issues of RCG methods, respectively. In addition, we also give some theorems on the convergence of these methods with adopting this new rule in Section  3 and prove these theorems in Appendices. Some numerical comparisons are shown in Section  4. Finally, Section  5 concludes this paper.

Section snippets

Row–column sparse constraint LP and 1-norm SVM

In this section, we define RCSC-LP and review the 1-norm SVM for classification.

Methods for training the 1-norm SVM

In this section, we discuss three methods for training the 1-norm SVM: the column generation (CG) method, the row generation (RG) method, and the row–column generation (RCG) method. A semi-deleting rule is introduced to guarantee the convergence of the CG, RG, and RCG methods, and to speed up the optimization process. So far, RCG methods have not been used to solve the LP problem of the 1-norm SVM. Here we mainly focus on RCG methods and their application to 1-norm SVMs.

Numerical experiments

In order to validate the RCG methods for training the 1-norm SVM for classification, experiments are performed on a toy dataset with two-class Gaussian distribution, five two-class datasets from the UCI database (Frank & Asuncion, 2010), and three multi-class datasets. All numerical simulations were performed on a personal computer with a 1.8 GHz Pentium III and 1G bytes of memory. This computer runs on Windows XP with Matlab 7.1 installed.

In all simulations, the RCG methods are compared with

Concluding remarks

The 1-norm SVM can be formulated as a standard inequality constrained LP, which is a specific class of inequality constrained linear programming, called RCSC-LP. It is shown that the LP problem of the 1-norm SVM has the same number of active row and column constraints as the number of support vectors. We introduce a family of RCG methods which generate the row and the column synchronously. In addition, in order to balance the none-deleting rule which leads to many inactive constraints in the

Acknowledgments

We would like to thank two anonymous reviewers and Editor Kenji Doya for their valuable comments and suggestions, which have significantly improved this paper. This work was supported in part by the National Natural Science Foundation of China under Grant No. 61033013, by the Natural Science Foundation of Jiangsu Province of China under Grant Nos. BK2011284 and BK201222725, by the Natural Science Pre-research Project of Soochow University under Grant No. SDY2011B09, by the Natural Science

References (43)

  • E.J. Zak

    Row and column generation technique for a multistage cutting stock problem

    Computers and Operations Research

    (2002)
  • L. Zhang et al.

    On the sparseness of 1-norm support vector machines

    Neural Networks

    (2010)
  • W. Zhou et al.

    Linear programming support vector machines

    Pattern Recognition

    (2002)
  • E. Allwein et al.

    Reducing multiclass to binary: a unifying approach for margin classifiers

    Journal of Machine Learning Research

    (2000)
  • C. Barnhart et al.

    Branch-and-price: column generation for solving huge integer programs

    Operation Research

    (1998)
  • I. Barrodale et al.

    An improved algorithm for discrete 1 linear approximation

    SIAM Journal on Numerical Analysis

    (1973)
  • K.P. Bennett

    Combining support vector and mathematical programming methods for classification

  • J. Bi et al.

    Dimensionality reduction via sparse support vector machines

    Journal of Machine Learning Research

    (2003)
  • Bi, J., Chen, Y., & Wang, J. (2005). A sparse support vector machine approach to region-based image categorization. In...
  • P.S. Bradley et al.

    Massive data discrimination via linear support vector machines

    Optimization Methods and Software

    (2000)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • G. Dantzig et al.

    The decomposition algorithm for linear programs

    Econometrica

    (1961)
  • A. Demiriz et al.

    Linear programming boosting via column generation

    Machine Learning

    (2002)
  • B. Efron et al.

    Least angle regression

    Annals of Statistics

    (2004)
  • S. Floyd et al.

    Sample compression learnability, and Vapnik–Chervonenkis dimension

    Machine Learning

    (1995)
  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository. From...
  • G. Fung et al.

    A feature selection Newton method for support vector machine classification

    Computational Optimization and Applications

    (2004)
  • F. Girosi

    An equivalence between sparse approximation and support vector machines

    Neural Computation

    (1998)
  • T. Graepel et al.

    Generalization error bounds for sparse linear classifiers

  • Jaumard, B., Meyer, C., & Vovor, T. (1999). Column/row generation and elimination methods (Les Cahiers du GERAD No....
  • Y. LeCun et al.

    Gradient-based learning applied to document recognitions

    Proceedings of the IEEE

    (1998)
  • Cited by (6)

    • A fast approximation algorithm for 1-norm SVM with squared loss

      2015, Proceedings of the International Joint Conference on Neural Networks
    View full text