Analysis of programming properties and the row–column generation method for 1-norm support vector machines

doi:10.1016/j.neunet.2013.07.005

Neural Networks

Volume 48, December 2013, Pages 32-43

https://doi.org/10.1016/j.neunet.2013.07.005 Get rights and content

Abstract

This paper deals with fast methods for training a 1-norm support vector machine (SVM). First, we define a specific class of linear programming with many sparse constraints, i.e., row–column sparse constraint linear programming (RCSC-LP). In nature, the 1-norm SVM is a sort of RCSC-LP. In order to construct subproblems for RCSC-LP and solve them, a family of row–column generation (RCG) methods is introduced. RCG methods belong to a category of decomposition techniques, and perform row and column generations in a parallel fashion. Specially, for the 1-norm SVM, the maximum size of subproblems of RCG is identical with the number of Support Vectors (SVs). We also introduce a semi-deleting rule for RCG methods and prove the convergence of RCG methods when using the semi-deleting rule. Experimental results on toy data and real-world datasets illustrate that it is efficient to use RCG to train the 1-norm SVM, especially in the case of small SVs.

Introduction

It is known that the introduction of the hinge loss into standard support vector machine (SVMs) can lead to a sparse model representation (Girosi, 1998, Steinwart, 2003). In least absolute shrinkage and selection operator (LASSO) (Efron et al., 2004, Tibshirani, 1996, Tibshirani, 1997) and basis pursuit (BP) methods (Mallat and Zhang, 1993, Smola and Bartlett, 2001, Smola and Schölkopf, 2000, Vincent and Bengio, 2002), the 1-norm regularization penalty as a substitution of the 0-norm regularization penalty can also induce a sparse solution. The 1-norm SVM minimizes the weighted combination with the hinge loss and the 1-norm regularization penalty so that the 1-norm SVM is expected to have a much better sparseness. Zhang and Zhou (2010) showed that the number of non-zero coefficients in the 1-norm SVM model is no more than that of exact support vectors (ESVs) which are only a part of support vectors (SVs). ESVs are some samples where the multiplication of their labels and the corresponding outputs of a hypothesis function equals 1. Of course, the number of non-zero coefficients in the standard SVM equals that of SVs (Girosi, 1998, Steinwart, 2003). Other experiential evidence for the sparsity of the 1-norm SVM has been reported in the literature (Bi et al., 2005, Natsoulis et al., 2005, Stoeckel and Fung, 2007, Zhang et al., 2006). The sparsity can not only obviously improve the test speed but also lead to good generalization performance. Demiriz, Bennett, and Shawe-Taylor (2002) showed that the 1-norm soft margin error rule employed by the 1-norm SVM is related to the generalization error bound. From the point of view of statistical learning theory, the weighted summation of the 1-norm penalty and the hinge loss also constitutes a structural risk in which the 1-norm penalty plays an important role in controlling the capacity of the 1-norm SVM (Floyd and Marmuth, 1995, Graepel and Shawe-Taylor, 2000, Vapnik, 1998, Vapnik, 1999). Therefore the 1-norm SVM usually has a comparable generalization performance to standard SVMs. As a result of the sparsity, the linear 1-norm SVM can be used as a feature (or variable) selectors, and the kernel 1-norm SVM as a basis function selector (Bi, Bennett, Embrechts, Breneman, & Song, 2003).

The 1-norm SVM can be formulated as a standard linear programming (LP) with inequality constraints (Bennett, 1999, Mangasarian, 2000, Rätsch et al., 2000, Zhou et al., 2002, Zhu et al., 2004). It is well known that many commercial optimization tools can be used to solve the LP problem of the 1-norm SVM. For example, the simplex method (Vanderbei, 1996) and its speed up method (Barrodale & Roberts, 1973) could be used for training the 1-norm SVM. For large-scale problems, however, these tools also suffer from computational complexity problems. As usual, the 1-norm SVM has a simpler form than standard SVMs. More efficient algorithms are overwhelmingly desired. Past research has resulted in a variety of algorithms for large-scale problems. Bradley and Mangasarian (2000) developed a linear programming chunking (LPC) algorithm to train the large-scale linear 1-norm SVM. The LPC algorithm is essentially a block row generation (RG) method for the dual problem of the 1-norm SVM. By using the LPC algorithm, Bradley and Mangasarian (2000) reported that the linear 1-norm SVM can be applied to a learning task with one million samples. However, the LPC algorithm is inapplicable to the kernel 1-norm SVM when processing large samples. The reason is that each subproblem in the LPC algorithm still has a large number of constraints and variables. An RG for the dual problem (i.e. a column generation (CG) for the primal problem) was proposed by Demiriz et al. (2002) to solve the LP boosting, which can be simply generalized to solve the LP problem of the 1-norm SVM. In this method, each subproblem has $ℓ + n + 1$ constraints and $ℓ + 1$ variables, where $ℓ$ is the number of samples and $n$ is the number of generated rows currently. Both Fung and Mangasarian (2004) and Mangasarian (2006) used the Newton method to minimize an exterior penalty problem constructed from the dual problem of the 1-norm SVM, which is called NLPSVM. They also showed that an exact least 2-norm solution to the primal problem can be obtained based on the solution to the dual problem. For each iteration of NLPSVM, it is necessary to compute the inverse of the generalized Hessian with the time cost of $O (min {(ℓ, D)}^{3})$ when the Sherman–Morrison–Woodbury identity is applied, where $ℓ$ and $D$ are the number of rows and columns of a sample matrix, respectively. Thus, NLPSVM is only applicable to the case that the sample matrix is far away from the square matrix. In other words, either the number of training samples or the number of sample features is small. For the linear 1-norm SVM, $ℓ$ is the number of samples, and $D$ is the dimensionality of samples, which is usually small, while for the kernel 1-norm SVM, $D = ℓ$ . Hence, the NLPSVM method is also not applicable to the kernel 1-norm SVM with large samples.

We know that the primal problem of the standard SVM has a few active (or tight) constraints, and the number of active constraints equals that of support vectors (SVs) (or non-zero coefficients) because of the use of the hinge loss (Burges, 1998, Smola, 1996, Vapnik, 1998). The 1-norm SVM which also employs the hinge loss function results in the fact that the number of the active constraints equals the number of non-zero coefficients in the 1-norm SVM model. Usually samples with non-zero coefficients are only a part of the training samples. An asymptotic bound of SVs is given for the 2-norm penalty SVM by Steinwart (2003), which can informally be stated as: “With probability tending to 1 the fraction of support vectors is essentially greater than the Bayes risk when the number of training examples tends to infinite”.

In the 1-norm SVM, there exist a lot of inactive (or relaxed) constraints in their primal problems. This type of LP with a lot of inactive constraints is defined as the row sparse constraint LP hereinafter. Based on the property of the row sparse constraint LP, the LPC algorithm can use the primal RG to train the linear 1-norm SVM with large examples. By the well-known complementary slackness theorem (Dantzig & Wolfe, 1961), we can conclude that the number of active constraints in the primal problem equals that in its dual if both of them are non-degenerate. This implies that both the row constraints for the primal problem and the column constraints for the dual problem are sparse for the 1-norm SVM. This type of LP with row and column sparse constraints is defined as the row–column spares constraint LP (RCSC-LP). Based on the property of the column sparse constraint LP, LP boosting via column generation can be used to train the 1-norm SVM with large variables efficiently (Demiriz et al., 2002).

Note that either the LPC algorithm or the LP boosting algorithm only exploits one fold of the sparseness of the 1-norm SVM to reduce the number of constraints or variables of LP. Therefore, these two algorithms can be applicable only to problems in which the sample matrix is far away from a square matrix. At present, there does not exist an algorithm for training the 1-norm SVM by using both row and column sparseness. This paper introduces a family of row–column generation methods which can utilize both the row and the column sparseness to train the 1-norm SVM. RCG methods were first proposed by Jaumard, Meyer, and Vovor (1999) and applied to the channel assignment problem. Similar RCG methods were presented and applied to the multistage cutting stock problem (Zak, 2002). Row and column generation methods have been seldom used together. This can be explained in part by the convergence issue, as the process of row generation is more and more restricted while that of column generation is more and more relaxed. The sequence of objective function values is definitely not monotonic when using row and column generation algorithms at the same time.

The contribution of our work is that we introduce RCG methods to solve the LP problem of the 1-norm SVM or RCSC-LP, and a semi-deleting rule for RG/CG and RCG methods. By introducing a semi-deleting rule, not only can the convergence be guaranteed, but also the scale of each subproblem can be reduced greatly, which means the inactive constraints can mostly be deleted. RCG methods can fully use the sparsity of row and column constraints and speed up the optimization process.

The rest of this paper is organized as follows. Section 2 gives the definition of RCSC-LP and a review on the 1-norm SVM. RG/CG and a new deleting rule are introduced in Section 3.1. RCG and its implementation are discussed in Sections 3.2 Row–column generation methods, 3.3 Implementation issues of RCG methods, respectively. In addition, we also give some theorems on the convergence of these methods with adopting this new rule in Section 3 and prove these theorems in Appendices. Some numerical comparisons are shown in Section 4. Finally, Section 5 concludes this paper.

Section snippets

Row–column sparse constraint LP and 1-norm SVM

In this section, we define RCSC-LP and review the 1-norm SVM for classification.

Methods for training the 1-norm SVM

In this section, we discuss three methods for training the 1-norm SVM: the column generation (CG) method, the row generation (RG) method, and the row–column generation (RCG) method. A semi-deleting rule is introduced to guarantee the convergence of the CG, RG, and RCG methods, and to speed up the optimization process. So far, RCG methods have not been used to solve the LP problem of the 1-norm SVM. Here we mainly focus on RCG methods and their application to 1-norm SVMs.

Numerical experiments

In order to validate the RCG methods for training the 1-norm SVM for classification, experiments are performed on a toy dataset with two-class Gaussian distribution, five two-class datasets from the UCI database (Frank & Asuncion, 2010), and three multi-class datasets. All numerical simulations were performed on a personal computer with a 1.8 GHz Pentium III and 1G bytes of memory. This computer runs on Windows XP with Matlab 7.1 installed.

In all simulations, the RCG methods are compared with

Concluding remarks

The 1-norm SVM can be formulated as a standard inequality constrained LP, which is a specific class of inequality constrained linear programming, called RCSC-LP. It is shown that the LP problem of the 1-norm SVM has the same number of active row and column constraints as the number of support vectors. We introduce a family of RCG methods which generate the row and the column synchronously. In addition, in order to balance the none-deleting rule which leads to many inactive constraints in the

Acknowledgments

We would like to thank two anonymous reviewers and Editor Kenji Doya for their valuable comments and suggestions, which have significantly improved this paper. This work was supported in part by the National Natural Science Foundation of China under Grant No. 61033013, by the Natural Science Foundation of Jiangsu Province of China under Grant Nos. BK2011284 and BK201222725, by the Natural Science Pre-research Project of Soochow University under Grant No. SDY2011B09, by the Natural Science

References (43)

E.J. Zak
Row and column generation technique for a multistage cutting stock problem
Computers and Operations Research
(2002)
L. Zhang et al.
On the sparseness of 1-norm support vector machines
Neural Networks
(2010)
W. Zhou et al.
Linear programming support vector machines
Pattern Recognition
(2002)
E. Allwein et al.
Reducing multiclass to binary: a unifying approach for margin classifiers
Journal of Machine Learning Research
(2000)
C. Barnhart et al.
Branch-and-price: column generation for solving huge integer programs
Operation Research
(1998)
I. Barrodale et al.
An improved algorithm for discrete $ℓ_{1}$ linear approximation
SIAM Journal on Numerical Analysis
(1973)
K.P. Bennett
Combining support vector and mathematical programming methods for classification
J. Bi et al.
Dimensionality reduction via sparse support vector machines
Journal of Machine Learning Research
(2003)
Bi, J., Chen, Y., & Wang, J. (2005). A sparse support vector machine approach to region-based image categorization. In...
P.S. Bradley et al.
Massive data discrimination via linear support vector machines
Optimization Methods and Software
(2000)

C.J.C. Burges

A tutorial on support vector machines for pattern recognition

Data Mining and Knowledge Discovery

(1998)

G. Dantzig et al.

The decomposition algorithm for linear programs

Econometrica

(1961)

A. Demiriz et al.

Linear programming boosting via column generation

Machine Learning

(2002)

B. Efron et al.

Least angle regression

Annals of Statistics

(2004)

S. Floyd et al.

Sample compression learnability, and Vapnik–Chervonenkis dimension

Machine Learning

(1995)

Frank, A., & Asuncion, A. (2010). UCI machine learning repository. From...

G. Fung et al.

A feature selection Newton method for support vector machine classification

Computational Optimization and Applications

(2004)

F. Girosi

An equivalence between sparse approximation and support vector machines

Neural Computation

(1998)

T. Graepel et al.

Generalization error bounds for sparse linear classifiers

Jaumard, B., Meyer, C., & Vovor, T. (1999). Column/row generation and elimination methods (Les Cahiers du GERAD No....

Y. LeCun et al.

Gradient-based learning applied to document recognitions

Proceedings of the IEEE

(1998)

Cited by (6)

Column generation-based prototype learning for optimizing area under the receiver operating characteristic curve
2024, European Journal of Operational Research
The traditional classification algorithms focus on the maximization of classification accuracy which might lead to poor performance in practice by forcing classifiers to overfit to the majority class. In order to overcome this issue, various approaches focus on the optimization of alternative loss functions such as the Area Under the Curve (AUC). AUC is a Receiver Operating Characteristics (ROC) metric that has been widely used to measure classification performance, especially when there are class imbalances. In this work, we propose a column generation (CG)-based algorithm called Ranking-CG, which learns a model, similar to the popular Ranking SVM, through approximate maximization of the AUC. Unlike the Ranking SVM, our algorithm utilizes a column generation method that iteratively adds features to control the model complexity effectively working as an internal feature selection procedure. Our experiments show that column generation can be an important tool to prevent overfitting. We extend the Ranking-CG by proposing a prototype generation method, denoted by Ranking-CG Prototype, that constructs reference points by solving a non-linear optimization problem. Based on the extensive experiments conducted on 74 binary classification problems, the Ranking-CG Prototype yields the best average test AUC among all competing methods by using significantly few features than other benchmarks.
A Mixed Integer Linear Programming Support Vector Machine for Cost-Effective Group Feature Selection: Branch-Cut-and-Price Approach
2022, European Journal of Operational Research
Recently, cost-based feature selection has received significant attention due to its great ability to achieve promising prediction accuracy at a minimum feature acquisition cost. To further improve its predictive and economic performances, this research proposes a cost-effective 1-norm support vector machine with group feature selection as ${GFS-CESVM}_{1}$ . Its robust counterpart model, ${GFS-RCESVM}_{1}$ , is also introduced to address the cost uncertainty of features and feature groups because cost variation commonly exists in real-world problems. The proposed models are formulated as Mixed Integer Linear Programming (MILP). To efficiently solve the proposed SVM MILP models, we develop a Branch-Cut-and-Price (BCP) algorithm that considers only a limited number of variables and/or constraints, which thereby leads to rapid convergence to an optimal solution. Various experimental results on benchmark and synthetic datasets demonstrate that ${GFS-CESVM}_{1}$ can achieve competitive outcomes by considering not only individual feature evaluation but also group structural information among features. The ${GFS-RCESVM}_{1}$ can identify the subset of features that is immune to cost uncertainty and therefore provide feasible and optimal solutions. Furthermore, our BCP algorithm can dominantly outperform the ordinary BB algorithm for finding better objective value and integrality gap within a short period of time.
Multi-view L2-SVM and its multi-view core vector machine
2016, Neural Networks
In this paper, a novel L2-SVM based classifier Multi-view L2-SVM is proposed to address multi-view classification tasks. The proposed Multi-view L2-SVM classifier does not have any bias in its objective function and hence has the flexibility like $μ$ -SVC in the sense that the number of the yielded support vectors can be controlled by a pre-specified parameter. The proposed Multi-view L2-SVM classifier can make full use of the coherence and the difference of different views through imposing the consensus among multiple views to improve the overall classification performance. Besides, based on the generalized core vector machine GCVM, the proposed Multi-view L2-SVM classifier is extended into its GCVM version MvCVM which can realize its fast training on large scale multi-view datasets, with its asymptotic linear time complexity with the sample size and its space complexity independent of the sample size. Our experimental results demonstrated the effectiveness of the proposed Multi-view L2-SVM classifier for small scale multi-view datasets and the proposed MvCVM classifier for large scale multi-view datasets.
Smooth Ranking SVM via Cutting-Plane Method
2024, arXiv
Applying 1-norm SVM with squared loss to gene selection for cancer classification
2018, Applied Intelligence
A fast approximation algorithm for 1-norm SVM with squared loss
2015, Proceedings of the International Joint Conference on Neural Networks

View full text

Analysis of programming properties and the row–column generation method for 1-norm support vector machines

Abstract

Introduction

Section snippets

Row–column sparse constraint LP and 1-norm SVM

Methods for training the 1-norm SVM

Numerical experiments

Concluding remarks

Acknowledgments

Computers and Operations Research

Neural Networks

Pattern Recognition

Reducing multiclass to binary: a unifying approach for margin classifiers

Journal of Machine Learning Research

Branch-and-price: column generation for solving huge integer programs

Operation Research

An improved algorithm for discrete ℓ1 linear approximation

SIAM Journal on Numerical Analysis

Combining support vector and mathematical programming methods for classification

Dimensionality reduction via sparse support vector machines

Journal of Machine Learning Research

Massive data discrimination via linear support vector machines

Optimization Methods and Software

A tutorial on support vector machines for pattern recognition

Data Mining and Knowledge Discovery

The decomposition algorithm for linear programs

Econometrica

Linear programming boosting via column generation

Machine Learning

Least angle regression

Annals of Statistics

Sample compression learnability, and Vapnik–Chervonenkis dimension

Machine Learning

A feature selection Newton method for support vector machine classification

Computational Optimization and Applications

An equivalence between sparse approximation and support vector machines

Neural Computation

Generalization error bounds for sparse linear classifiers

Gradient-based learning applied to document recognitions

Proceedings of the IEEE

An improved algorithm for discrete $ℓ_{1}$ linear approximation