Reweighted stochastic learning

doi:10.1016/j.neucom.2015.08.123

Neurocomputing

Volume 198, 19 July 2016, Pages 135-147

https://doi.org/10.1016/j.neucom.2015.08.123 Get rights and content

Abstract

Recent advances in stochastic learning, such as dual averaging schemes for proximal subgradient-based methods and simple but theoretically well-grounded solvers for linear Support Vector Machines (SVMs), revealed an ongoing interest in making these approaches consistent, robust and tailored towards sparsity inducing norms. In this paper we study reweighted schemes for stochastic learning (specifically in the context of classification problems) based on linear SVMs and dual averaging methods with primal–dual iterate updates. All these methods favor properties of a convex and composite optimization objective. The latter consists of a convex regularization term and loss function with Lipschitz continuous subgradients, e.g. l₁-norm ball together with hinge loss. Some approaches approximate in a limit the l₀-type of a penalty. In our analysis we focus on a regret and convergence criteria of such an approximation. We derive our results in terms of a sequence of convex and strongly convex optimization objectives. These objectives are obtained via the smoothing of a generic sub-differential and possibly non-smooth composite function by the global proximal operator. We report an extended evaluation and comparison of the reweighted schemes against different state-of-the-art techniques and solvers for linear SVMs. Our experimental study indicates the usefulness of the proposed methods for obtaining sparser and better solutions. We show that reweighted schemes can outperform state-of-the-art traditional approaches in terms of generalization error as well.

Introduction

In many domains dealing with online and stochastic learning, the input instances are of very high dimension, yet within any particular instance several features are non-zero. Therefore specific stochastic and online approaches crafted with sparsity inducing regularization are of particular interest for many machine learning researchers and practitioners. This paper investigates an interplay between Regularized Dual Averaging (RDA) approaches [1] (along with other techniques for solving linear SVMs in the context of stochastic learning [2]) and parsimony concepts arising from the application of sparsity inducing norms, like the l₀-type of a penalty.

One can see an increasing importance of correctly identified sparsity patterns and proliferation of proximal and soft-thresholding subgradient-based methods [1], [3], [4]. There are many important contributions of the parsimony concept to the machine learning field. One may allude to the understanding of the obtained solution and simplified or easy to extract decision rules [5], [6], [7]. On the other hand the informativeness of the obtained features might be useful for a better generalization on unseen data [5]. Approaches based on l₁-regularized loss minimization were studied in the context of stochastic and online learning by several research groups [1], [3], [8], [9] but we are not aware of any l₀-norm inducing methods which were applied in the context of Regularized Dual Averaging and stochastic optimization.

In this paper we are trying to provide a supplementary analysis and sufficient regret bounds for learning sparser linear Regularized Dual Averaging (RDA) [1] models from random observations. We extend and modify our previous research [10], [11] and present complementary proofs with fewer assumptions and discussion for the reported theoretical findings. We use sequences of (strongly) convex reweighted optimization objectives to accomplish this goal.

This paper is structured as follows. Section 2 describes previous work on l₀-norm induced learning and some existing solutions to stochastic optimization with regularized loss. Section 3.1 presents a problem statement for the reweighted algorithms. 3.2 Reweighted, 3.5 Reweighted introduce our reweighted l₁-RDA and l₂-RDA methods respectively while Section 3.8 presents completely novel approach based on probabilistic reweighted Pegasos-like linear SVM solver. 3.4 Analysis for the reweighted, 3.7 Analysis for the reweighted provide a theoretic background for our reweighted RDA approaches. Section 4 presents our numerical results and Section 5 concludes the paper.

Section snippets

Related work

Learning with $∥ w ∥_{0}$ pseudonorm regularization is a NP-hard problem [12] and can be approached via the reweighting schemes [13], [14], [15], [16] while lacking a proper theoretical analysis of convergence in the online and stochastic learning cases. Some methods, like [17], consider an embedded approach where one has to solve a sequence of QP-problems, which might be very computationally- and memory-wise expensive while still missing some proper convergence criteria.

In many existing iterative

Problem statement

In the stochastic Regularized Dual Averaging approach developed by Xiao [1] one approximates the loss function f(w) by using a finite set of independent observations $S = {ξ_{t}}_{1 \leq t \leq T}$ . Under this setting one minimizes the following optimization objective: $\min_{w} \frac{1}{T} \sum_{t = 1}^{T} f (w, ξ_{t}) + ψ (w),$ where $ψ (w)$ represents a regularization term. Every observation is given as a pair of input–output variables $ξ = (x, y)$ . In the above setting one deals with a simple classification model ${\hat{y}}_{t} = sign (〈 w, x_{t} 〉)$ and calculates the

Experimental setup

For all methods in our experiments with UCI datasets [24] for tuning (e.g. to estimate the ubiquitous λ hyperparameter or tuples of hyperparameters employed in Algorithm 1, Algorithm 2) we use Coupled Simulated Annealing [27] initialized with 5 random sets of parameters. These random sets are made out of tuples of hyperparameters linked to one particular setup of an algorithm. At every iteration step for CSA we proceed with a 10-fold cross-validation. Within the cross-validation we are

Conclusion

In this paper we studied reweighted stochastic learning in the context of dual averaging schemes and solvers for linear SVMs. We have presented two different directions for applying reweighting at each round t. The first approach helps to approximate very efficient l₀-type of a penalty using a reliable and proven dual averaging scheme [22]. We applied the reweighting procedure to different norms and elaborated two versions of the Regularized Dual Averaging method [1], namely Reweighted l₁- and l

Acknowledgments

EU: The research leading to these results has received funding from the European Research Council under the European Union׳s Seventh Framework Programme (FP7/2007-2013)/ERC AdG A-DATADRIVE-B (290923). This paper reflects only the authors׳ views, the Union is not liable for any use that may be made of the contained information. Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/Postdoc grants Flemish Government: FWO: projects: G.0377.12 (Structured systems), G.088114N

Vilen Jumutc received his B.Sc. and M.Sc. degrees in Computer Science from the Riga Technical University in 2007 and 2009 respectively. He is currently a Ph.D. researcher in the Department of Electrical Engineering (ESAT) of the Katholieke Universiteit Leuven. His interests include large-scale stochastic and online learning problems, kernel methods, semi-supervised learning and convex optimization.

References (28)

N.H. Barakat et al.
Rule extraction from support vector machinesa review
Neurocomputing
(2010)
M.-J. Lai et al.
The null space property for sparse recovery from multiple measurement vectors
Appl. Comput. Harmon. Anal.
(2011)
L. Xiao
Dual averaging methods for regularized stochastic learning and online optimization
J. Mach. Learn. Res.
(2010)
S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: primal estimated sub-gradient solver for svm, in: Proceedings of the...
S. Shalev-Shwartz, A. Tewari, Stochastic methods for l1 regularized loss minimization, in: Proceedings of the 26th...
J. Duchi et al.
Efficient online and batch learning using forward backward splitting
J. Mach. Learn. Res.
(2009)
H. Núnez, C. Angulo, A. Català, Rule extraction from support vector machines, in: Proceedings of European Symposium on...
C.J.C. Burges, Simplified support vector decision rules, in: L. Saitta (Ed.), Proceedings of the 13th International...
X. Chen, Q. Lin, J. Peña, Optimal regularized dual averaging methods for stochastic optimization, in: P.L. Bartlett,...
J. Duchi et al.
Adaptive subgradient methods for online learning and stochastic optimization
J. Mach. Learn. Res.
(2011)

V. Jumutc, J.A.K. Suykens, Reweighted l2-regularized dual averaging approach for highly sparse stochastic learning, in:...

V. Jumutc, J.A.K. Suykens, Reweighted l1 dual averaging approach for sparse stochastic learning, in: 22th European...

J.L. Lázaro, K. De Brabanter, J.R. Dorronsoro, J.A.K. Suykens, Sparse LS-SVMs with l0-norm minimization, in: ESANN,...

R. Chartrand, W. Yin, Iteratively reweighted algorithms for compressive sensing, in: IEEE International Conference on...

Cited by (1)

Mini-batch algorithms with Barzilai–Borwein update step
2018, Neurocomputing
As a way to accelerate stochastic schemes, mini-batch optimization has been a popular choice for large scale learning due to its good general performance and ease of parallel computing. However, the performance of mini-batch algorithms can vary significantly based on the choice of the step size sequence, and, in general, there is a paucity of guidance for making good choices. In this paper, we propose to use the Barzilai–Borwein (BB) update step to automatically compute step sizes for the state of the art mini-batch method (mini-batch semi-stochastic gradient descent (mS2GD) method), thereby obtaining a new optimization method: mS2GD-BB. We prove that mS2GD-BB converges linearly in expectation for nonsmooth strongly convex objective functions. We analyze the complexity of mS2GD-BB and show that it achieves as fast a rate as modern stochastic gradient methods. Numerical experiments on standard data sets indicate that the performance of mS2GD-BB is superior to some state of the art methods.

Johan A.K. Suykens was born in Willebroek Belgium, May 18, 1966. He received the master degree in Electro-Mechanical Engineering and the Ph.D. degree in Applied Sciences from the Katholieke Universiteit Leuven, in 1989 and 1995 respectively. In 1996 he has been a Visiting Postdoctoral Researcher at the University of California, Berkeley. He has been a Postdoctoral Researcher with the Fund for Scientific Research FWO Flanders and is currently a Professor (Hoogleraar) with KU Leuven. He is Author of the books “Artificial Neural Networks for Modelling and Control of Non-linear Systems” (Kluwer Academic Publishers) and “Least Squares Support Vector Machines” (World Scientific), co-author of the book “Cellular Neural Networks, Multi-Scroll Chaos and Synchronization” (World Scientific) and Editor of the books “Nonlinear Modeling: Advanced Black-Box Techniques” (Kluwer Academic Publishers), “Advances in Learning Theory: Methods, Models and Applications” (IOS Press) and “Regularization, Optimization, Kernels, and Support Vector Machines” (Chapman & Hall/CRC). In 1998 he organized an International Workshop on Nonlinear Modelling with Time-Series Prediction Competition. He has served as an Associate Editor for the IEEE Transactions on Circuits and Systems (1997–1999 and 2004–2007) and for the IEEE Transactions on Neural Networks (1998–2009). He received an IEEE Signal Processing Society 1999 Best Paper Award and several Best Paper Awards at International Conferences. He is a recipient of the International Neural Networks Society INNS 2000 Young Investigator Award for significant contributions in the field of neural networks. He has served as a Director and Organizer of the NATO Advanced Study Institute on Learning Theory and Practice (Leuven 2002), as a Program Co-Chair for the International Joint Conference on Neural Networks 2004 and the International Symposium on Nonlinear Theory and its Applications 2005, as an Organizer of the International Symposium on Synchronization in Complex Networks 2007, a Co-Organizer of the NIPS 2010 Workshop on Tensors, Kernels and Machine Learning, and Chair of ROKS 2013. He has been awarded an ERC Advanced Grant 2011 and has been elevated IEEE Fellow 2015 for developing least squares support vector machines.

View full text

Neurocomputing

Reweighted stochastic learning

Abstract

Introduction

Section snippets

Related work

Problem statement

Experimental setup

Conclusion

Acknowledgments

Rule extraction from support vector machinesa review

Neurocomputing

The null space property for sparse recovery from multiple measurement vectors

Appl. Comput. Harmon. Anal.

Dual averaging methods for regularized stochastic learning and online optimization

J. Mach. Learn. Res.

Efficient online and batch learning using forward backward splitting

J. Mach. Learn. Res.

Adaptive subgradient methods for online learning and stochastic optimization

J. Mach. Learn. Res.

Mini-batch algorithms with Barzilai–Borwein update step