1 Introduction

Linear classification and regression models—such as the support vector machine (SVM) and logistic and linear regression—are widely used in machine learning, and regularized empirical-risk minimization is a standard approach to optimizing their parameters. To avoid overfitting, linear models are typically either densely or sparsely regularized. With an \(\ell _2\) regularizer, one obtains a dense weight vector in which all features contribute to the prediction task. For interpretability, one is often interested in a sparse solution in which many entries of the weight vector are zero. To this end, one may employ an \(\ell _1\) absolute value norm regularizer [18, 25]. While this type of regularization may lead to lower predictive accuracies than \(\ell _2\) regularization [10], the result focuses only on the most relevant features.

This paper promotes the idea of using a combination of both types of regularization, thus combining the best of both worlds. Instead of using just a single weight vector \(\mathbf{{w}}\) that is either dense or sparse, we employ a sum of two weight vectors \(\mathbf{{w}}+\mathbf{{v}}\). While \(\mathbf{{w}}\) is \(\ell _2\) regularized and therefore dense, \(\mathbf{{v}}\) is \(\ell _1\) regularized and therefore sparse. Having two different weight vectors with different regularizations allows linear models to more flexibly fit the data. It comes at a moderate computational cost, since the number of parameters is doubled.

Fig. 1.
figure 1

Geometrical illustration of the proposed Huber-norm regularizer and comparison to common regularizers.

We first show that the proposed combination of two weight vectors is mathematically equivalent to imposing Huber-norm [7] regularization on the empirical risk of a linear model. This approach is known to be statistically more robust [8] in the sense that individual sparse weights do not necessarily involve a huge cost in the loss. This Huber norm involves quadratic costs near the origin and linear costs far away from the origin, this way penalizing outliers less severely. Because of this analogy, we call our method Huber-norm regularization. We derive uniform and data-dependent upper bounds on the statistical risk of the model class by means of the Rademacher complexity. We deduce a simple type of oracle inequality on the inference efficiency of the decision rule which measures the deviation of the model’s risk from the lowest risk of any model in the class.

Our empirical studies show that Huber-norm regularized logistic regression outperforms \(\ell _1\)- and \(\ell _2\)-regularized as well as elastic-net-regularized logistic regression [26] in the majority of cases over a wide range of benchmark problems. To support this claim we provide evidence based on empirical studies on the UCI machine learning repository, where our method performs best among the compared methods on 23 out of 31 data sets. On particular data set—the well-known Iris data set—Huber-norm regularization leads to a prediction accuracy of 0.96 while the next-best method merely achieves 0.84.

Our paper is organized as follows. Section 2 reviews related work. In Sect. 3, we describe our model and its basic properties. We also prove the equivalence of the two weight vectors to Huber-norm regularization in the conventional setting. In Sect. 4 we then present the underlying theoretical foundations of our approach, where we prove an upper bound on the statistical risk. We present our experimental results in Sect. 5 and conclude in Sect. 6.

2 Related Work

Comparisons between \(\ell _1\)-norm and \(\ell _2\)-norm SVMs are ubiquitous in the literature [13, 14, 25]. A robust alternative to the SVM based on the smooth ramp loss [23] requires the convex-concave procedure to convert this non-convex optimization problem into a convex one [24]. Another way of making the SVM robust [20] is based on the weighted LS-SVM that yields sparse results. Different type of classification problems for the SVM (both convex and non-convex) are discussed by Hailong et al. [6] where the conjugate gradient approach is used to solve the optimization problem.

Our novel type of regularizer relates to the elastic-net regularizer [26] that simply amounts to taking the sum of an \(\ell _1\) and \(\ell _2\) regularizer. Our proposed regularizer is very different, as is evident from Fig. 1. The plot shows contours of different regularizers in comparison. As a major difference between the elastic net and our approach, our regularizer grows asymptotically linearly for large weight vectors whereas the elastic net grows asymptotically quadratically. Lastly, our theoretical contributions are based on fundamental work by Vapnik [22].

The Huber norm [7] is frequently used as a loss function; it penalizes outliers asymptotically linearly which makes it more robust than the squared loss. The Huber norm is used as a regularization term of optimization problems in image super resolution [21] and other computer-graphics problems. The inverse Huber function [17] has been studied as a regularizer for regression problems. While the Huber norm penalizes large weights asymptotically linearly, the inverse Huber function imposes an asymptotically squared penalty on large weights.

3 Huber-Norm-Regularized Linear Models

In this section, after formally introducing the problem setting and optimization criterion, we show that this optimization criterion has an equivalent formulation in which the Huber norm becomes explicit. We derive the dual form and show how Huber-norm regularization for linear models can be implemented.

3.1 Problem Setting and Preliminaries

We consider the standard supervised prediction setup, where we are given a training sample \(S=\{{\mathbf{x }_{i}},y_{i}\}_{i=1}^{n}\) from a space \(\mathcal X\times \mathcal Y\) with \(\mathcal {X} = \mathbb {R}^{d}\). We aim at finding a linear function f that predicts well. A common way to achieve this is to first define a loss function \(\ell :\mathbb {R} \times {\mathcal Y}\rightarrow \mathbb {R}_{+}\cup \{0\}\) that measures the deviation of the prediction \(f(\mathbf{{x}})\) from the correct value y, such as the logistic loss \(\ell (f(\mathbf{{x}}),y) := \log (1+\exp (-yf(\mathbf{{x}})))\) or hinge loss \(\ell (f(\mathbf{{x}}),y) := \max (0,1-yf(\mathbf{{x}}))\). The empirical risk is then the averaged loss over the training sample, \( \hat{L}_{n}(f) = \frac{1}{n}\sum _{i=1}^{n}\ell (f(\mathbf{{x}}_i),y_{i})\) of f.

In this paper we consider methods that employ linear prediction functions \(f(\mathbf{{x}})= \mathbf{{w}}^\top \mathbf{{x}}\). To avoid overfitting, one usually uses a regularizer such as the \(\ell _1\) regularizer \(R_1(\mathbf w)=||\mathbf w||_1\), the \(\ell _2\) regularizer \(R_2(\mathbf w)=||\mathbf w||_2^2\), or the elastic-net regularizer \(R_{en}(\mathbf w)=||\mathbf w||_1+||\mathbf w||_2^2\). This results in the regularized empirical risk minimization or short reg-ERM problem:

$$\begin{aligned} \min _{\mathbf w} \lambda R(\mathbf w) + \frac{1}{n}\sum _{i=1}^n\ell (y_i,\mathbf w^\top \mathbf{{x}}_i). \end{aligned}$$

The \(\ell _1\) and elastic-net regularizers produce sparse, \(\ell _2\)-norm regularizer dense weight vectors. Hence, depending on the problem, the regularizer can be chosen to match the underlying sparsity of the problem.

3.2 Linear Models with Sums of Dense and Sparse Weights

Using \(\ell _1\)-, \(\ell _2\)-, or elastic-net-regularized ERM either produces dense or sparse solutions. In this paper, we argue it can be beneficial to produce dense solutions with pronounced feature weights as in \(\ell _1\)-norm regularized methods. We propose to consider linear models of the form \(f(\mathbf{{x}}) := (\mathbf v+\mathbf w)^\top \mathbf x\) (for notational convenience, we disregard constant offsets and assume that the first element of each \(\mathbf{{x}}\) is a constant 1) and the regularizer \(R_H(\mathbf v,\mathbf w)=\lambda ||\mathbf v||_1+\mu ||\mathbf w||_2^2\), hence resulting in the following optimization problem.

Optimization Problem 1

(Sums of dense and sparse weights). Given \(\lambda , \mu > 0\) and loss function \(\ell (t,y)\), solve:

$$\begin{aligned} (\hat{\mathbf{{w}}}_{},\hat{\mathbf{{v}}}_{})&= \mathop {{{\mathrm{arg\,min}}}}_{\mathbf{{v}},\mathbf{{w}}} G(\mathbf{{w}},\mathbf{{v}},S)\nonumber \\ with\ G(\mathbf{{w}},\mathbf{{v}},S)&= \lambda ||\mathbf{{v}}||_{1}+\mu ||\mathbf{{w}}||_{2}^{2} + \frac{1}{n}\sum _{i=1}^{n}\ell (y_i,(\mathbf{{w}}+\mathbf{{v}})^\top \mathbf{{x}}_{i}), \end{aligned}$$
(1)

where \(||\cdot ||_{2}\) and \(||\cdot ||_{1}\) denote standard \(\ell _2\)-norm and \(\ell _1\)-norm correspondingly.

For reasons that will become clear in the section below we call the method Huber-regularized empirical risk minimization or short Huber-regERM. Note that by letting \(\lambda \rightarrow \infty \), we obtain the classic \(\ell _2\)-norm regularization, while letting \(\mu \rightarrow \infty \) leads to \(\ell _1\)-norm regularization. Thus these methods are obtained as limit cases of our method. Elastic-net-regularization is not a special case of this framework, but it could be obtained by enforcing an additional constraint \(\mathbf v=\mathbf w\).

3.3 Geometry of the Huber Norm

The following geometrical interpretation lets us compare linear models with sums of dense and sparse weights to the \(\ell _1\), \(\ell _2\), and elastic-net regularizers. We prove that Problem 1 is equivalent to the following problem.

Optimization Problem 2

(Equivalent Huber-Norm Problem). Optimization Problem 1 can equivalently be formulated as:

$$\begin{aligned} \hat{\mathbf {z}}&= \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {z}}R_{H}(\mathbf {z})+\frac{1}{n}\sum \limits _{i=1}^{n}\ell (y_i,\mathbf {z}^\top \mathbf{{x}}_{i}) \end{aligned}$$
(2)

where \(R_{H}(\mathbf {\mathbf {z}})=\sum _{i=1}^{d} r_{H}(z_{i})\), and \( r_{H}(z_{i})= {\left\{ \begin{array}{ll} \lambda \left( |z_{i}| -\frac{\lambda }{4\mu }\right) &{}\text {if}\, |z_{i}| \ge \frac{\lambda }{2\mu } \\ \mu z_{i}^2 ,&{}\text {otherwise}\\ \end{array}\right. }. \)

Note that \(R_{H}(\mathbf {z})\) is the Huber norm of \(\mathbf {z}\). While the Huber norm is often used as a robust loss function that is less sensitive to outliers, Optimization Problem 2 employs the Huber norm as regularizer. Intuitively, this results in a regularization scheme that is less sensitive to individual features which have a stong impact on f than \(\ell _2\) regularization. Figure 1 illustrates isotropic lines for the Huber-norm regularizer and known regularizers for \(\lambda =\mu =1\). The Huber norm is composed of linear and squared segments. While it does not encourage sparsity as the \(\ell _1\) regularizer does, it encourages that most attributes only have a small impact on the decision function.

Proof

(Equivalence of Optimization Problems 1 and 2 ). Let \(\mathbf z =\mathbf{{w}}+\mathbf{{v}}\). Problem 1 can then be formulated as

$$\begin{aligned} \min \limits _{\mathbf{{w}},\mathbf{{v}}} G(\mathbf{{w}},\mathbf{{v}},S)=&\min \limits _{\mathbf {z},\mathbf {v}} \lambda ||\mathbf{{v}}||_{1}+\mu ||\mathbf {z}-\mathbf{{v}}||_{2}^{2} + \frac{1}{n}\sum \limits _{i=1}^{n}\ell (y_i,\mathbf {z}^\top \mathbf{{x}}_{i}) \nonumber \\ =&\min \limits _{\mathbf {z}}\left( \mu \min \limits _{\mathbf{{v}}} \left( \frac{\lambda }{\mu }||\mathbf{{v}}||_{1}+||\mathbf {z}-\mathbf{{v}}||_{2}^{2} \right) +\frac{1}{n}\sum \limits _{i=1}^{n}\ell (y_i,\mathbf {z}^\top \mathbf{{x}}_{i})\right) . \end{aligned}$$
(3)

Let us define \(R(\mathbf{{v}},\mathbf{z}) := {\overline{c}}||\mathbf{{v}}||_{1}+||\mathbf {z}-\mathbf{{v}}||_{2}^{2}\) where \(\overline{c} := \frac{\lambda }{\mu }\). It remains to be shown that \(\min \limits _{\mathbf{{v}}}R(\mathbf{{v}},\mathbf{z})\) is a Huber-norm regularizer.

Simplifying \(R = \mathbf{{v}}^\top \mathbf{{v}}-2\mathbf{{v}}^\top \left( \mathbf {z}- \frac{\overline{c}}{2}{{\mathrm{sgn}}}(\mathbf{{v}})\right) + \mathbf {z}^\top \mathbf {z}\), we find

$$\begin{aligned} \min \limits _{\mathbf{{v}}}R =\min \limits _{\mathbf{{v}}=(v_1,...,v_d)} \left( \sum \limits _{i=1}^{d}v_{i}^2-2v_{i}(z_{i}-\frac{\overline{c}}{2}{{\mathrm{sgn}}}(v_{i}))\right) +\sum \limits _{i=1}^{d}z_{i}^{2}. \end{aligned}$$
(4)

For each \(i \in \{1,...,d\}\) we minimize \(R_{i} := v_{i}^2-2v_{i}(z_{i}-\frac{\overline{c}}{2}{{\mathrm{sgn}}}(v_{i}))\) with respect to \(v_{i}\). This is equivalent to:

$$\begin{aligned} \left[ {\begin{array}{*{20}l} {\mathop {\min }\limits _{{v_{i} }} v_{i}^{2} - 2(z_{i} - \frac{{\bar{c}}}{2})v_{i} } &{} {{\text {if}}\,v_{i} > 0} \\ {\mathop {\min }\limits _{{v_{i} }} v_{i}^{2} - 2(z_{i} + \frac{{\bar{c}}}{2})v_{i} } &{} {{\text {if}}\,v_{i} \le 0}. \\ \end{array} } \right. \end{aligned}$$

We can minimize each of these two quadratic terms analytically:

$$\begin{aligned} \left[ {\begin{array}{ll} -(z_{i}-\frac{\overline{c}}{2})^{2} &{} \text {if}\, z_{i} \in \mathcal {A}_{} := \{z \in \ \mathbb {R}: |z|\ge \frac{\overline{c}}{2}\} \\ 0 &{} \text {if}\, z_{i} \in \mathcal {A}_{c} := \{z \in \mathbb {R}: |z|< \frac{\overline{c}}{2}\}.\\ \end{array}}\right. \end{aligned}$$

This means, that for Eq. 4 we have explicitly:

$$\begin{aligned} \min \limits _{\mathbf{{v}}}R = \sum \limits _{i=1}^{d} \left( z_{i}^{2} - \left( z_{i}-\frac{\overline{c}}{2}\right) ^2\mathbb {I}_{z_{i} \in \mathcal {A}}\right) = \sum _{i=1}^{d}\left( z_{i}^2\mathbb {I}_{z_{i} \in \mathcal {A_{C}}} + \overline{c}\left( |z_{i}|-\frac{\overline{c}}{4}\right) \mathbb {I}_{z_{i} \in \mathcal {A}}\right) . \end{aligned}$$

This is exactly the Huber-norm regularizer \(R_H(\mathbf{z})\) of Optimization Problem 2.

   \(\square \)

3.4 Dual Problem

In order to classify a training point, we need to compute the scalar product \((\mathbf {\mathbf{{w}}}+\mathbf {\mathbf{{v}}})^\top \mathbf {x}\) which may be expensive when the dimension of vectors \(\mathbf {w},\mathbf {v}\) is large. One possible solution to overcome this consists in considering a weighted sum of constraints together with an objective function computed on the training sample. This leads to a dual approach. Steinwart [19] gives a general overview of dual optimization problems for SVMs using \(\ell _2\)- and \(\ell _1\)-norm regularizers. The dual form of the optimization problem depends on the loss function. We complete Steinwart’s overview by deriving the dual form of the Huber-norm regularized SVM in the following.

Optimization Problem 3

(Dual Huber-Norm SVM Problem). Optimization Problem 1 with hinge-loss loss function (Huber-Norm SVM) has an equivalent dual form which can be formulated as follows:

$$\begin{aligned} \max \limits _{\alpha \in \mathbb {R}^{n}}&\sum \limits _{i=1}^{n}\alpha _{i} -\frac{1}{2}\sum \limits _{i,j=1}^{n}\alpha _{i}\alpha _{j}y_{i}y_{j}\mathbf{{x}}^\top _{i}{\mathbf {x_{j}}} \nonumber \\ \text{ s.t. }&\mathbf {\alpha } \in [0,C]^{n}\wedge ||\mathbf {X^\top }\mathbf {\alpha }||_{\infty } \le \frac{\lambda }{2\mu } , \end{aligned}$$
(5)

where \(C := \frac{1}{2n\mu }\) and \(\mathbf {X}=\left( \mathbf {x}_1,...,\mathbf {x}_{n}\right) \in \mathbb {R}^{n \times d}\).

Proof

The Lagrangian \(L(\mathbf {w},\mathbf {v},\xi ,\alpha ,\eta )\) that corresponds to Eq. 1 is given as follows:

$$\begin{aligned} L(\mathbf {w},\mathbf {v},\xi ,\alpha ,\eta ) := C\sum \limits _{i=1}^{n}\xi _{i} +&\frac{\lambda }{2\mu }||\mathbf {v}||_{1}+\frac{1}{2}||\mathbf {w}||^{2}_{2} + \nonumber \\&\sum \limits _{i=1}^{n}\alpha _{i}\left( 1-y_{i} (\mathbf {w}^\top +\mathbf {v}^\top ) \mathbf{{x}}_i -\xi _{i} \right) -\sum \limits _{i=1}^{n}\eta _{i}\xi _{i}, \end{aligned}$$
(6)

where \(\alpha = (\alpha _{1},...,\alpha _{n}) \in [0,\infty )^{n}\) and \(\mathbf {\eta } = (\eta _{1},...,\eta _{n}) \in [0,\infty )^{n}\). So the dual problem [3] can be written as:

$$\begin{aligned} \max \limits _{\alpha ,\eta }\inf _{\mathbf {w},\mathbf {v},\xi } L(\mathbf {w},\mathbf {v},\xi ,\alpha ,\eta ). \end{aligned}$$
(7)

Grouping the terms in the Lagrangian gives us:

$$\begin{aligned} L(\mathbf {w},\mathbf {v},\xi ,\alpha ,\eta ) =&\sum \limits _{i=1}^{n}(C-\alpha _{i}-\eta _{i})\xi _{i}+\frac{\lambda }{{2\mu }}||\mathbf {v}||_{1}\nonumber \\&\quad \quad \quad -\sum \limits _{i=1}^{n}\alpha _{i}y_{i} \mathbf {v}^\top \mathbf{{x}}_i + \frac{1}{2}||\mathbf {w}||^{2}_{2} - \sum \limits _{i=1}^{n}\alpha _{i}y_{i} \mathbf {w}^\top \mathbf{{x}}_i+\sum \limits _{i=1}^{n}\alpha _{i}. \end{aligned}$$

Now, considering the infimum with respect to \(\mathbf {v}\) and \(\mathbf {w}\) separately, and using the definition of a conjugate function [3, 19] we obtain:

$$\begin{aligned} \inf \limits _{\mathbf {v}}\frac{\lambda }{2\mu }||\mathbf {v}||_{1}-\sum \limits _{i=1}^{n}\alpha _{i}y_{i} \mathbf {v}^\top {\mathbf {x}_{i}}&= -\sup \limits _{\mathbf {v}}\frac{\lambda }{2\mu }||\mathbf {v}||_{1}+ \mathbf {v}^\top \sum \limits _{i=1}^{n}\alpha _{i}y_{i}{\mathbf {x}_{i}} \nonumber \\&= {\left\{ \begin{array}{ll} 0,&{}\text {when}\,|\mathbf {X}^\top \alpha ||_{\infty } \le \frac{\lambda }{2\mu }\\ -\infty ,&{}\text {otherwise},\\ \end{array}\right. } \end{aligned}$$
(8)

where \(\mathbf {X}=\left( \mathbf {x}_1,...,\mathbf {x}_{n}\right) \in \mathbb {R}^{n \times d}\) is the data matrix whose rows \(\mathbf{{x}}_i^\top \) are the instances and \(||\cdot ||_{\infty }\) -supremum norm in \(\mathbb {R}^d\). Analogously, for \(\mathbf {w}\) we have:

$$\begin{aligned} \inf \limits _{\mathbf {w}}\frac{1}{2}||\mathbf {w}||^2_{2}-\sum \limits _{i=1}^{n}\alpha _{i}y_{i} \mathbf {w}^\top {\mathbf {x}_{i}}&= -\sup \limits _{\mathbf {w}}-\frac{1}{2}||\mathbf {w}||^{2}_{2}+ \mathbf {w}^\top \sum \limits _{i=1}^{n}\alpha _{i}y_{i}{\mathbf {x}_{i}} \nonumber \\&= \frac{1}{2}\left( \sum \limits _{i=1}^{n}\alpha _{i}y_{i}{\mathbf {x}_{i}}\right) ^\top \left( \sum \limits _{i=1}^{n}\alpha _{i}y_{i}{\mathbf {x}_{i}}\right) . \end{aligned}$$
(9)

Finally, computing the gradient with respect to \(\xi \) gives that for each \(i \in \{1,...n\}\):

$$\begin{aligned} C-\eta _{i}-\alpha _{i}=0 \Leftrightarrow \alpha _{i} = C -\eta _{i}. \end{aligned}$$
(10)

Now, for fixed \(\lambda ,\mu ,\) and \(\mathbf {X}\), define \(P=\{\alpha | \alpha \in [0,C]^{n}\wedge ||\mathbf {X^\top }\alpha ||_{\infty } \le \frac{\lambda }{2\mu } \}\),where \(\mathbf {y} = (y_{1},...,y_{n}) \in \mathbb {R}^{n}\). Substituting Eqs. 8, 9, and 10 into Eq. 7 gives the following dual problem:

$$\begin{aligned} \max \limits _{\alpha \in P} \sum \limits _{i=1}^{n}\alpha _{i} -\frac{1}{2}\sum \limits _{i,j=1}^{n}\alpha _{i}\alpha _{j}y_{i}y_{j}\mathbf{{x}}^\top _{i}{\mathbf {x}_{j}}, \end{aligned}$$
(11)

which is a quadratic optimization problem within set P and can be solved with known methods.    \(\square \)

By close inspection of Eq. 11, we observe that our dual optimization problem closely resembles the one for SVM using \(\ell _2\) regularization, but with a difference in the form of the domain P of the optimization problem.

3.5 Algorithm and Implementation

Algorithm 1 implements Huber-regularized empirical risk minimization for linear models. The algorithm works by alternatingly minimizing the occurring \(\ell _1\)-norm and \(\ell _2\)-norm regularized minimization problems, respectively. For each step of optimization procedure we use gradient descent, assuming that the other vector is constant. The gradient of the \(\ell _1\) norm of \(\mathbf{{v}}\) is not defined for \(\mathbf{{v}}=\mathbf {0}\); here, we use subgradients [3].

figure a

4 Theoretical Analysis

In this section we present a theoretical analysis of the proposed Huber-norm regularizer for linear models. We obtain bounds on the statistical risk based on the established framework of Rademacher complexities [2, 16] and, consequently, on the norms of the vectors \(\mathbf{{v}},\mathbf{{w}}\) and number of training samples n [2].

4.1 Preliminaries and Aim

Let \(S=\{{\mathbf {x_{i}}},y_{i}\}_{i=1}^{n}\) be a sample of n training points that are independently drawn from one and the same distribution \(P_{X,Y}\) over \(\mathcal {X} \times \mathcal {Y}\), where \(\mathcal {X} = \mathbb {R}^{d}\); let the output space \(\mathcal {Y}\) be discrete for classification and continuous for regression. In this theoretical analysis, we study the Huber-regERM model class

$$\begin{aligned} \mathcal {F} := \{f : \mathbf {x} \mapsto (\mathbf {w}+\mathbf {v})^\top \mathbf {x} : \mathbb {R}^{d} \rightarrow \mathbb {R}| ||\mathbf{{w}}||_{2} \le W, ||\mathbf{{v}}||_{2} \le V\}, \end{aligned}$$
(12)

where W and V are initially unknown constants. Loss function \(\ell :\mathbb {R} \times \mathcal {Y} \rightarrow \mathbb {R}_{+}\cup \{0\}\) may be any convex loss function that is \(\mathcal {L}\)-Lipschitz continuous and absolutely bounded by constant \(B\in \mathbb {R}\). The aim of our theoretical analysis is to obtain bounds on the deviation of the risk \(L(f) = E_{P_{X,Y}}[\ell (f(\mathbf {x}),y)]\) of the model \(f \in \mathcal {F}\) from empirical risk \(\hat{L}_{n}(f) = \frac{1}{n}\sum _{i=1}^{n}\ell (f(\mathbf {x_{i}}),y_{i})\).

Let \(\{\sigma _{i}\}_{i=1}^{n}\) be independent Rademacher random variables, meaning that each of them is uniformly distributed over \(\{-1,+1\}\). Denote by \(\varSigma \) the joint uniform distribution of \(\sigma _1,\ldots ,\sigma _n\). Then the empirical Rademacher complexity is defined as

$$\begin{aligned} \hat{\mathfrak {R}}_{S}(\ell \circ \mathcal {F}) := E_{\varSigma }\left[ \sup _{f \in \mathcal {F}} \frac{1}{n}\sum _{i=1}^{n}\sigma _{i}\ell (f(\mathbf{{x}}_{i}),y_{i}))\right] , \end{aligned}$$
(13)

and the (theoretical) Rademacher complexity [2, 16] is defined as \(\mathfrak {R}_{n}(\ell \circ \mathcal {F}) := E_S[\hat{\mathfrak {R}}_{S}(\ell \circ \mathcal {F})].\) Here, the expectation is taken under the distribution of the sample S. It has been shown [2, 16] that when \(\ell \) is \(\mathcal {L}\)-Lipschitz continuous in the second argument, then with probability at least \(1-\delta \), for all \(f\in \mathcal {F}\):

$$\begin{aligned} L(f) \le \hat{L}_{n}(f)+2 \mathcal {L}E_S\left[ \hat{\mathfrak {R}}_{S}( \mathcal {F})\right] + B\sqrt{\frac{\log \delta ^{-1}}{2n}}. \end{aligned}$$
(14)

4.2 Bounds on the Risk of Huber-Regularized Linear Models

Our main theoretical contributions are bounds on statistical risk based on data-dependent and uniform upper bounds on the Rademacher complexity of the model class \(\mathcal {F}\) defined by Eq. 12.

Theorem 1

(Uniform risk bound for Huber regularization). Let \(\mathcal {F}\) be defined by Eq. 12, let \(\ell \) be a \(\mathcal {L}\)-Lipschitz continuous loss function, and let R be a constant such that \(|\ell (t,y)|\le R\) for all \(t\in \mathbb {R}\) and \(y \in \mathcal {Y}\). Let the \(\ell _2\) norm of all instances is bounded by \(||\mathbf{{x}}||_2\le R_{\mathbf{{x}}}\) with probability 1 by some \(R_{\mathbf{{x}}}\). Then, for every \(\delta \in ( 0,1)\), with probability at least \(1-\delta \) the following holds for all \(f \in \mathcal {F}\):

$$\begin{aligned} L(f) \le \hat{L}_{n}(f) + 2\mathcal {L}\sqrt{\frac{2(W^2+V^2)}{n}}R_{\mathbf{{x}}}+ R\sqrt{\frac{\log \delta ^{-1}}{2n}} \end{aligned}$$
(15)

where \( W = \sqrt{\frac{R}{\mu }},V=\frac{R}{\lambda }\).

Instead of relying on a uniform bound \(R_{\mathbf{{x}}}\) on the data \(\mathbf{{x}}_{i}\), we can give the following data-dependent bound on the risk.

Proposition 1

(Data-dependent risk bound for Huber regularization). Let \(\mathcal {F}\) be defined by Eq. 12, and let \(\ell \) be a \(\mathcal {L}\)-Lipschitz continuous loss function. Then, for every \(\delta \in ( 0,1)\), with probability at least \(1-\delta \) the following holds for all \(f \in \mathcal {F}\), where W, V, and R as defined as in Theorem 1:

$$\begin{aligned} L(f) \le \hat{L}_{n}(f) + 2\mathcal {L}\frac{\sqrt{2(W^2+V^2)\sum \limits _{i=1}^{n}||\mathbf{{x}}_{i}||^2}}{n}+ (2\mathcal {L}+1)R\sqrt{\frac{\log (\frac{2}{\delta })}{2n}}. \end{aligned}$$
(16)

4.3 Lemmata and Auxilary Results

The risk bounds are based on the following three lemmas.

Lemma 1

For the functional class \(\mathcal {F}\) of Eq. 12, the following data-dependent bound on the empirical Rademacher complexity holds:

$$\begin{aligned} \hat{\mathfrak {R}}_{S}(\mathcal {F}) \le \frac{\sqrt{2(W^2+V^2)\sum \limits _{i=1}^{n}||\mathbf{{x}}_{i}||^2}}{n}. \end{aligned}$$
(17)

Lemma 2

For the functional class \(\mathcal {F}\) of Eq. 12, the (theoretical) Rademacher complexity is bounded as follows:

$$\begin{aligned} \mathfrak {R}_{n}(\mathcal {F})=E_S[\hat{\mathfrak {R}}_{S}(\mathcal {F})] \le \sqrt{\frac{2(W^2+V^2)}{n}}R_{\mathbf{{x}}}. \end{aligned}$$
(18)

where \(R_{\mathbf{{x}}}\) is a constant such that \(||\mathbf{{x}}||_{2} \le R_{\mathbf{{x}}}\) almost surely under \(P_{X}\).

Lemma 3

Let \((\hat{\mathbf {w}},\hat{\mathbf {v}})= \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {v},\mathbf {w}}G(\mathbf {w},\mathbf {v},S)\). Then \(||\hat{\mathbf {w}}||_2 \!\le \! \sqrt{\frac{R}{\mu }}\), \(||\hat{\mathbf {v}}||_2 \!\le \! {\frac{R}{\lambda }}\), where R as in Theorem 1.

Proof

(Lemma 1 ). Following the ideas presented by Mohri [16], we rewrite the empirical Rademacher complexity using the Cauchy-Schwartz inequality:

$$\begin{aligned} \hat{\mathfrak {R}}_{S}(\mathcal {F})&=\frac{1}{n} E_{\sigma }\left[ \sup \limits _{||\mathbf {w}||_{2} \le W, ||\mathbf {v}||_{2}\le V}\sum \limits _{i=1}^{n}\left( \sigma _{i}(\mathbf {w}+\mathbf {v})^{\top }{\mathbf {x}_{i}} \right) \right] \nonumber \\&= \frac{1}{n} E_{\sigma }\left[ \sup \limits _{||\mathbf {w}||_{2} \le W, ||\mathbf {v}||_{2}\le V} (\mathbf {w}+\mathbf {v})^{\top }\sum \limits _{i=1}^{n}\sigma _{i}{\mathbf {x}_{i}}\right] \nonumber \\&\le \frac{1}{n} E_{\sigma }\left[ \sup \limits _{||\mathbf {w}||_{2} \le W, ||\mathbf {v}||_{2}\le V} ||\mathbf {w}+\mathbf {v}||_{2}\left\| {\sum _{i=1}^{n}\sigma _{i}{\mathbf {x}_{i}}}\right\| _{2}\right] . \end{aligned}$$
(19)

Using the inequality \(\sum _{i=1}^{d}({w}_{i}+{v}_{i})^2 \le 2\sum _{i=1}^{d}(w_{i}^2+{v}_{i}^2) \) for the right-hand side of Eq. 19, according to the restrictions on the norms of \(\mathbf {w},\mathbf {v}\) we get:

$$\begin{aligned} E_{\sigma }\left[ \sup \limits _{||\mathbf {w}||_{2} \le W, ||\mathbf {v}||_{2}\le V} ||\mathbf {w}+\mathbf {v}||_{2}\left\| {\sum _{i=1}^{n}\sigma _{i}{\mathbf {x}_{i}}}\right\| _{2}\right] \le \sqrt{2(W^2+V^2)}E_{\sigma }\left[ \left\| {\sum \limits _{i=1}^{n}\sigma _{i}{\mathbf {x}_{i}}}\right\| _2\right] \end{aligned}$$
(20)

and because of Jensen’s inequality for \(E_{\sigma }\left[ ||\cdot || \right] \), linearity of expectation and independence of \(\sigma _{i},\sigma _{j}\) for \(j\ne i\) we obtain:

$$\begin{aligned} E_{\sigma }\left[ \left\| {\sum \limits _{i=1}^{n}\sigma _{i}{\mathbf {x}_{i}}}\right\| _2\right] \le&\sqrt{E_{\sigma }\left[ \sum \limits _{i,j=1}^{n}\sigma _{i}\sigma _{j}{\mathbf {x_{i}}}^{t}{\mathbf {x}_{j}}\right] } \nonumber \\ =&\sqrt{\sum \limits _{i=1}^{n}E_{\sigma }\left[ ||{\mathbf {x}_{i}}||_2^2\right] } = \sqrt{\sum \limits _{i=1}^{n}||\mathbf{{x}}_{i}||_{2}^2}. \end{aligned}$$
(21)

Uniting the results of Inequality 20 and Eq. 21 in Eq. 19 we get the statement of Lemma 1.    \(\square \)

Proof

(Lemma 2 ). Using Lemma 1 and the assumption that the \(\mathbf{{x}}_{i}\) are uniformly bounded by constant \(R_{\mathbf{{x}}}\) we obtain:

$$\begin{aligned} \hat{\mathfrak {R}}_{S}(\mathcal {F}) \le \sqrt{\frac{2(W^2+V^2)}{n}}R_{\mathbf{{x}}}. \end{aligned}$$
(22)

Equation 22 no longer depends on the sample, and therefore Lemma 2 follows.

   \(\square \)

Naturally, one may not have any a-priori knowledge about the constants W and V that restrict the possible values of \(\mathbf {w}\) and \(\mathbf {v}\) in Inequality 18. Despite that, for a given optimization problem that includes the current class of models, one can apply certain arguments from which one can infer bounds for W and V. Lemma 3 gives us such bounds for Optimization Problem 1.

Proof

(Lemma 3 ). When \((\hat{\mathbf {w}},\hat{\mathbf {v}})\) is a solution of optimization problem (1), then

$$\begin{aligned} G(\hat{\mathbf {w}},\hat{\mathbf {v}},S) \le G(\mathbf {0},\mathbf {0},S) \le R \end{aligned}$$

This implies that the optimal solution necessarily satisfies the following condition: \(\lambda ||\mathbf {v}||_1 + \mu ||\mathbf {w}||_2 \le R\). As far as \(||\mathbf {v}||_1 \ge ||\mathbf {v}||_2\) we have that in order to be an optimal solution \(\hat{\mathbf {v}}\) should satisfy following constraint: \(||\mathbf {v}||_2 \le \frac{R}{\lambda }\). For \(\hat{\mathbf {w}}\) we obtain straightforward necessary condition, that \(||\mathbf {w}||_2^{2} \le \frac{R}{\mu }\) which implies the claim of Lemma 3.    \(\square \)

Lemma 3 implies that the norms of the vectors \(\mathbf{{v}}\) and \(\mathbf{{w}}\) of a solution of Optimization Problem 1 necessary have to lie within balls with radius \(W := \sqrt{\frac{R}{\mu }} \) for \(\mathbf {w}\) and of radius \(V := \frac{R}{\lambda }\) for \(\mathbf {v}\), centered in the origin.

4.4 Proof of the Huber-Norm Risk Bounds

We are now equipped to prove Theorem 1.

Proof

(Theorem 1 ). Lemma 2 gives us a bound on the Rademacher complexity of the functional class of Eq. 12, and Lemma 3 gives us necessary constraints on the norms W and V. Inserting both into Inequality 14, we obtain Theorem 1.    \(\square \)

Proof

(Proposition 1 ). Lemma 1 gives us a data-dependent bound on the empirical Rademacher complexity of the functional class of Eq. 12. Adapting Inequality (3.14) from theorem 3.1 in Mohri et al. [16] for our needs, we have with probability at least \(1-\frac{\delta }{2}\):

$$\begin{aligned} \mathfrak {R}_{n}(\mathcal {F}) \le \hat{\mathfrak {R}}_{S}(\mathcal {F}) + R\sqrt{\frac{\log (\frac{2}{\delta })}{2n}}. \end{aligned}$$
(23)

Using the union bound for Inequality 14 (with \(\frac{\delta }{2}\) instead of \(\delta \) and constant R from Theorem 1) and Inequality 23, we get with probability \(1-\delta \):

$$\begin{aligned} L(f) \le \hat{L}_{n}(f) + 2\mathcal {L}\hat{\mathfrak {R}}_{S}(\mathcal {F})+ 2\mathcal {L}R\sqrt{\frac{\log (\frac{2}{\delta })}{2n}} + R\sqrt{\frac{\log (\frac{2}{\delta })}{2n}}. \end{aligned}$$
(24)

Together with Lemma 1 this yields the claim of Proposition 1.    \(\square \)

4.5 Corollaries

In practice, we will be interested in obtaining upper bounds for concrete loss functions such as the hinge loss \(\ell (t,y)=\max (0,1-yt)\) or logistic loss \(\ell (y,t) = \log (1+\exp (-yt))\) in case of two-class classification problems. Since these loss functions are 1-Lipschitz [19], Theorem 1 produces therefore following corollaries.

Corollary 1

For Optimization Problem 1 under the assumptions of Theorem 1 with loss-function \(\ell (y,t) = \max (0,1-yt), t \in \mathbb {R},y \in \{-1,1\}\) one obtains that, with probability at least \(1-\delta \) for all \( f \in \mathcal {F}\):

$$\begin{aligned} L(f) \le \hat{L}_{n}(f) + 2\sqrt{\frac{2(W^2+V^2)}{n}}R_{\mathbf{{x}}}+ B\sqrt{\frac{\log \delta ^{-1}}{2n}} \end{aligned}$$
(25)

where \(W=\sqrt{\frac{1}{\mu }}\), \(V = {\frac{1}{\lambda }}\), \(B=1+\sqrt{2(W^2+V^2)}R_{\mathbf{{x}}}\).

Corollary 2

For Optimization Problem 1 under the assumptions of Theorem 1 with loss-function \(\ell (y,t) = \log (1+\exp (-yt)), t \in \mathbb {R}, y \in \{-1,1\}\) one obtains that with probability at least \(1-\delta \) for all \( f \in \mathcal {F}\):

$$\begin{aligned} L(f) \le \hat{L}_{n}(f) + 2\sqrt{\frac{2(W^2+V^2)}{n}}R_{\mathbf{{x}}}+ B^{l}\sqrt{\frac{\log \delta ^{-1}}{2n}} \end{aligned}$$
(26)

where \(W=\sqrt{\frac{\log 2}{\mu }}\), \(V = {\frac{\log 2}{\lambda }}\), \(B^{l} := \frac{\exp (R_{\mathbf{{x}}}\sqrt{2(W^2+V^2)})}{ \sqrt{\exp (R_{\mathbf{{x}}}\sqrt{2(W^2+V^2)})+1}}\).

Proof

For the hinge loss, under the conditions of Theorem 1, we have that for any \(\mathbf {x} \in \mathbb {R}^d\), s.t, \(||\mathbf {x}||_2 \le R_{\mathbf{{x}}} \) the loss is bounded by \(1+| (\mathbf {w}+\mathbf {v})^\top \mathbf {x}|\), which is upper-bounded by \(1+\sqrt{2(W^2+V^2)}R_{\mathbf{{x}}}\) as the combination of bounds on \(||\mathbf {w}+\mathbf {v}||_2\) and \(||\mathbf {x}||_2\). So, \(|\ell (t,y)| \le B := 1+\sqrt{2(W^2+V^2)}R_{\mathbf{{x}}}\). Then the conclusion follows by applying Theorem 1. The proof for the logistic loss is analogous.    \(\square \)

4.6 Discussion of Results

We will now compare the generalization performance of the developed Huber-norm regularizer with the performance of known regularizers.

Comparison to \(\ell _1\) and \(\ell _2\) -Norm Regularization. The optimization problems of the \(\ell _2\)-norm and \(\ell _1\)-norm empirical risk minimization are

$$\begin{aligned} \hat{\mathbf {w}}_{} =&\mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {w}} \mu ||\mathbf {w}||_{2}^{2} + \frac{1}{n}\sum \limits _{i=1}^{n}\ell (y_{i},\mathbf {w}^{T}\mathbf{{x}}_i) \text{ and } \end{aligned}$$
(27)
$$\begin{aligned} \hat{\mathbf {v}}_{} =&\mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {v}} \lambda ||\mathbf {v}||_{1}+ \frac{1}{n}\sum \limits _{i=1}^{n}\ell (y_{i},\mathbf {v}^{T}\mathbf{{x}}_i), \end{aligned}$$
(28)

respectively. Theoretical upper bounds on the statistical risk for both Eqs. 27 and 28 result from Mohri [16] for the Rademacher complexity of linear models. In these cases, the upper bound on the Rademacher complexity is also of the order of \(\sqrt{\frac{1}{n}}\) and depends as well on the bounds on norms of the vectors WV (for each case separately) and on the bounds on the data.

Comparison to Elastic Net. The optimization problem of the empirical risk minimization with elastic-net regularizer is

$$\begin{aligned} \hat{\mathbf {w}}_{} = \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {w}} \lambda ||\mathbf {w}||_{1}+\mu ||\mathbf {w}||_{2}^2+ \frac{1}{n}\sum \nolimits _{i=1}^{n}\ell (y_{i},\mathbf {w}^\top \mathbf{{x}}_i) \end{aligned}$$
(29)

with \(\ell (y,0) = 1\) [26]. From a similar argumentation as in Theorem 1 [11, 16] one can infer that upper bounds on the Rademacher complexity for this procedure will also be of order \(\mathcal {O}(\sqrt{\frac{W^2R_{\mathbf{{x}}}^2}{n}})\), where now \(W = \sqrt{\frac{1}{\lambda + \mu }} \) and \(R_{\mathbf{{x}}}\) as before.

Oracle Inequality. We will relate the generalization performance of the model to the performance of the best possible model in that class—which is unknown in practice—using an oracle-type inequality [4, 12]. As a corollary of Theorem 1, we can obtain an oracle-type inequality in high probability for \(\mathcal {F}\):

$$\begin{aligned} G(\hat{\mathbf {w}},\hat{\mathbf {v}},S) \le \mathop {{{\mathrm{arg\,min}}}}\nolimits _{(\mathbf {w},\mathbf {v})}G(\mathbf {w},\mathbf {v},S)+2\varDelta , \end{aligned}$$

where \(\varDelta \) is the parameter that defines the complexity of \((\hat{\mathbf{{w}}},\hat{\mathbf{{v}}})\in \mathcal {F}\) and is given explicitly in the following Proposition 2 that follows from Theorem 1.

Proposition 2

Let all conditions of Theorem 1 hold, let \((\hat{\mathbf {w}},\hat{\mathbf {v}}) = {{\mathrm{arg\,min}}}_{\mathbf {w},\mathbf {v}} G(\mathbf {w},\mathbf {v},S)\), and let W, V, \(R_{\mathbf{{x}}}\), and R be defined in Theorem 1. Then with probability at least \(1-\delta \):

$$\begin{aligned} G(\hat{\mathbf {w}},\hat{\mathbf {v}},S) - \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {w},\mathbf {v}}G(\mathbf {w},\mathbf {v},S) \le 2\mathcal {L}\sqrt{\frac{2(W^2+V^2)}{n}}R_{\mathbf{{x}}}+ R\sqrt{\frac{\log \delta ^{-1}}{2n}}. \end{aligned}$$
(30)

Tightness Comparison. Comparing the order of our upper risk bound with classical results for empirical risk minimization problems [1, 5] one can see that our bound is tight, and of order \(\sqrt{\frac{1}{n}}\).

5 Experiments

This section compares logistic regression with Huber-norm regularization to logistic regression with \(\ell _1\), with \(\ell _2\), and with elastic-net regularization.

5.1 Experimental Setting

We conduct experiments on benchmark problems from the UCI repository [15]. In order to avoid a possible selection bias, we select the 31 first (in alphabetical order) classification problems that use matrix data format. We skip trivial problems for which all models achieve perfect accuracy. We transform categorical features into binary values using one-hot coding. For multi-class problems, we removed classes that have fewer instances than the number of cross-validation folds. All features are centered and scaled to unit variance. Missing values are filled in using mean imputation for continuous values and are represented as a separate one-hot coded attribute for categorical values.

Table 1. Accuracies and standard errors for UCI data sets

We run nested stratified cross validation with an outer loop of five folds. Regularization parameters \([\lambda ,\mu ]\) are tuned by an inner loop of three-fold cross validation on the training portion over the grid of \([10^{-5},...,10^{3}] \times [10^{-3},...,10^{4}]\).

5.2 Results

Table 1 shows the accuracies of different regularizers. For each problem the highest empirical accuracy is typeset in bold face; asterisks mark models that are significantly better than the best of the other three models, based on a paired t test with \(p<0.05\). logistic regression with Huber-norm regularization achieves the highest empirical accuracy for 23 out of 31 problems; its accuracy is significantly higher than the accuracy of any other model for 8 problems. No reference methods outperform Huber-norm regularization significantly.

The UCI repository reflects a certain distribution P(S) over data sets. We state the null hypothesis A that the probability of Huber-norm regularization outperforming all three reference methods on a randomly drawn problem under P(S) does not exceed 0.5, and the null hypothesis B that the probability of Huber-norm regularization outperforming all three reference methods on a randomly drawn problem under P(S) is below 0.5. We count each cross-validation fold of each UCI data set as a single observation of a binary random variable and determine the binomial likelihood of observing the outcomes which are reflected in Table 1. Logistic regression with Huber-norm regularization achieves a higher empirical accuracy than all three baselines in 86 out of 155 cross-validation folds, and an equally high accuracy as the best baseline in an additional 24 cases. We can therefore reject the null hypothesis A at \(p=0.09\) and null hypothesis B even at \(p<0.001\). We conclude that for the distribution of UCI problems, the Huber-norm regularization is the best-performing regularizer among the \(\ell _1\), \(\ell _2\), elastic-net and Huber regularization.

6 Conclusions

We proposed a new way of regularizing linear prediction models based on a combination of dense and sparse weight vectors. In more detail, we employ a linear weight vector that is the sum of two terms, \(\mathbf{{w}}+\mathbf{{v}}\), where \(\mathbf{{w}}\) is \(\ell _2\) regularized and \(\mathbf{{v}}\) is \(\ell _1\) regularized. This results in an effective Huber-norm regularizer for \(\mathbf{{w}}+\mathbf{{v}}\), which is very different from an elastic net. Starting with theoretical considerations, we first derived bounds on the statistical risk based on the framework of Rademacher complexities. In our subsequent experimental study, our algorithm showed higher predictive accuracies on a majority of UCI data sets, where we compared against \(\ell _1\), \(\ell _2\), and elastic-net regularization. In future work, we would like to study extensions to non-linear kernel functions and multiple kernels [9].