Lasso type classifiers with a reject option

We consider the problem of binary classification where one can, for a particular cost, choose not to classify an observation. We present a simple proof for the oracle inequality for the excess risk of structural risk minimizers using a lasso type penalty.


Introduction
This paper discusses structural risk minimization in the setting of classification with a reject option. Binary classification is about classifying observations that take values in an arbitrary feature space X into one of two classes, labelled −1 or +1. A discriminant function f : X → R yields a classifier sgn(f (x)) ∈ {−1, +1} that represents our guess of the label Y of a future observation X and we err if the margin y · f (x) < 0. Since observations x for which the conditional probability is close to 1/2 are difficult to classify, we introduce a reject option for classifiers, by allowing for a third decision, (reject), expressing doubt. We built in the reject option by using a threshold value 0 ≤ τ < 1 as follows. Given a discriminant function f : X → R, we report sgn(f (x)) ∈ {−1, 1} if |f (x)| > τ , but we withhold decision if |f (x)| ≤ τ and report . We assume that the cost of making a wrong decision is 1 and the cost of utilizing the reject option is d > 0. The appropriate risk function is then for the discontinuous loss Since we never reject if d > 1/2, see [11], we restrict ourselves to the cases 0 ≤ d ≤ 1/2. The generalized Bayes discriminant function, minimizing (2), is then with risk E [min{η(X), 1 − η(X), d}] , see [9,13]. The case (τ, d) = (0, 1/2) reduces to the classical situation without the reject option. We can view d as an upper bound on the conditional probability of misclassification (given X) that is considered tolerable. The estimators of f 0 (x) that we study in this paper are linear combinations of base functions f j from a dictionary F M = {f 1 , . . . , f M }. We suggest regularized empirical risk minimization based using convex surrogate loss functions φ and a penalty term p(λ) = 2r n |λ| 1 that is proportional to the ℓ 1 -norm |λ| 1 of the parameter λ. The regularized empirical risk is then convex in λ and its minimization can be solved by a (tractable) convex program. The organization of the paper is as follows. Section 2 presents a general bound on the excess risk of minimizers λ of the penalized empirical risk (5). We define an oracle target λ * , that provides an ideal approximation f λ * of f 0 with possibly many fewer elements f i of the dictionary F M , and show under mild assumptions that this oracle target can be recovered by minimization of (5), even if M is larger than n. We advance the use of a novel type of oracle inequality, explored in [8,6], where the aim is to show that the sum of the excess risk and the penalty term p( λ − λ * ) achieves the optimal balance between the excess risk and a regularization term. This allows us to determine that the oracle can be recovered and gives us information about the ℓ 1 -distance between λ and the oracle vector λ * . This extends the work of [4,5,6,7] on lasso-type estimators in regression and density estimation problems to empirical risk minimization of the general criterion (5) in the context of classification with a reject option. We take a different approach than the recent technical report [17]. In particular, we use the concept of mutual coherence, used in [4,5,6,7], which is weaker than the corresponding requirement in [17] and give a different, simple proof of the main oracle inequality. We demonstrate that the choice of the the tuning parameter r n in the penalty p(λ) = 2r n |λ| 1 is crucial. We prove that the oracle inequality holds on an event where r n exceeds a certain random quantity r. Then we show that r is highly concentrated around its mean using McDiarmid's concentration inequality and provide an upper bound for E[ r].
Section 3 applies the results of Section 2 to the specific generalized hinge loss function φ d introduced in [1], extending the work [14] to classification with a reject option. This loss is convex, so that the minimization of (5) is computationally feasible, and at the same time classification calibrated, as the minimizer of E[φ d (Y f (X))] is the Bayes discriminant f 0 , our parameter of interest.
Finally, the proofs are collected in Section 4.

Preliminaries
The data (X 1 , Y 1 ), . . . , (X n , Y n ) consist of independent copies of (X, Y ) where X takes values in an arbitrary measurable space X and Y ∈ {−1, +1}. Let . . , f M } be a finite set of functions (dictionary) with f j ∞ ≤ C F and we consider discriminant functions We consider a loss function φ : with C φ < ∞ and based on this loss function, we define the risk functions We assume that f 0 defined in (4) minimizes the risk E[φ(Y f (X))] over all measurable f : X → R, and we denote its risk by R 0 , that is, We measure the performance of our estimators in terms of the excess risk Based on the penalty with r n specified later in Section 2.4, the penalized empirical risk minimizer λ satisfies In particular, (6) ensures that for λ 0 = (0, . . . , 0), . This means that we effectively minimize the penalized empirical risk R φ (λ) + p(λ) over λ in the set

Assumptions
We impose two conditions. Given some finite measure µ on X , set The first condition imposes a link between the distance f λ − f 0 and excess risk ∆ φ (λ): In regression and density estimation problems as considered in [4,5,6,7], this condition trivially holds with β = 1/2 and C ∆,µ = 1. This relation is more delicate to establish in classification problems. It depends on the behavior of the conditional probability η(X) near d and 1 − d, see Section 3 below.
Our goal is to estimate f 0 via linear combinations f λ (x) and to evaluate performance in terms of the excess risk ∆ φ (λ). For any I = {i 1 , . . . , i m } ⊆ {1, . . . , M }, we define the approximating parameter space Λ(I) = λ ∈ R M : λ i = 0 for all i ∈ I and let λ I minimize R φ (λ) over Λ(I). An oracle that knows f 0 would be able to tell us in advance which approximating space Λ(I) yields the smallest excess risk ∆ φ ( λ I ). However, f 0 is unknown so the best we can do is to mimic the behavior of the oracle. General theory for empirical risk minimization in the classification context [2,3,11] indicates that where |I| denotes the cardinality of the set I and the symbol means that the inequality holds up to known multiplicative constants. Various choices are possible for the parameter ρ depending on the margin exponent α ≥ 0 defined in Section 3. Our target of interest, the oracle vector λ * ∈ Λ n , depends on β. Formally, we define it as follows: Definition. Let c µ = min 1≤i≤M f i and let λ * be the minimizer of Thus λ * balances the approximation error, as measured by the excess risk ∆ φ (λ), and the complexity of the parameter set Λ(I) to which λ * belongs to, as measured by the regularization term (r 2 n |λ| 0 ) 1/(2−2β) . The constants 3 and 2(8C ∆,µ ) 1/(1−β) can be changed: A decrease in the former will lead to a increase in the latter, and vice-versa. The constant c µ can be avoided altogether if we take the penalty p(λ) = 2r n M i=1 f i |λ i |, but in practice µ, and consequently f i , is unknown. Surely we could plug in estimates for f i as in [4,5,6,17], but we chose to keep the exposition and proofs as simple as possible.
Let I * = {i : λ * i = 0} be the collection of non-zero coefficients of λ * , be the cardinality of I * , and be the correlation between f i and f j . Our second assumption requires that is small: Let c µ = min 1≤j≤M f j and assume that This mainly states that the submatrix (< f i , f j >) i,j∈I * is positive definite and that the correlations ρ(i, j) between elements f i , i ∈ I * , of this submatrix and outside elements f j , j ∈ I * , are relatively small. We refer to this assumption as the local mutual coherence assumption, see [4,5,6,7].

Oracle inequality
Instrumental in our argument is the random quantity where we take ε n = φ(0)/(nr n ). Our first result states the oracle inequality. It holds true as long as the tuning parameter r n in the penalty term exceeds r.
Theorem 1. Assume that (7) and (10) hold. On the event r n > r, The next section discusses choices of the tuning parameter r n that ensure that the probability of the event {r n ≥ r} is large.

Choice of the tuning parameter r n
The next lemma states that r is sharply concentrated around its mean.
and, for all δ > 0, Proof. The first assertion follows directly from the definition of r. The second statement follows from an application of McDiarmid's bounded differences inequality [10, Theorem 2.2, page 8] after observing that a change of a single pair (X i , Y i ) changes r by at most 2C φ C F /n.
The range of r in (13) is important for implementation of the method: We suggest to find a good value for r n based on cross validation and the grid can be taken on the interval [0, 2C φ C F ]. Inequality (14) is important for theoretical considerations. It shows that we should take for some 0 < δ < 1, since then The expected value E[ r] is of order {log(M ∨ n)/n} 1/2 by the following lemma.

Example: generalized hinge loss
Throughout this section, we consider a fixed cost d and a fixed threshold value τ with 0 ≤ d ≤ 1/2 and d ≤ τ ≤ 1 − d. Instead of the discontinuous loss ℓ(z) defined in (3), [1] considers the convex surrogate loss where a = (1 − d)/d ≥ 1 and shows that the Bayes discriminant function f 0 defined in (4) minimizes both the risks Moreover, [1] shows that a relation like this holds not only for the loss functions and hence the risks, but for the excess risks as well. In particular, This is important since minimization of (5) produces oracle inequalities in terms of the φ d -excess risk (Theorem 1), not in terms of the original excess risk directly. The latter risk has a sound statistical interpretation. For plug-in rules and empirical risk minimizers, [1,11] show that for classification with a reject option, fast rates (faster than n −1/2 ) for the excess risk may be obtained if the probability that η(X), defined in (1), is close to the critical values of d and 1 − d, is small. More precisely, assume that there exist A ≥ 1 and α ≥ 0 such that for all t > 0, For d = 1/2, this asumption is equivalent to Tsybakov's margin condition [15]. Then, [1, Proof of Lemma 7] shows that where Following [14], we consider the measure µ defined by for any Borel set B, where P is the probability measure of X. Since it follows from (20) that condition (7) holds for all λ with |λ| 1 ≤ C Λ with and β = α/(2 + 2α). Let λ minimize the penalized empirical risk R φ d (λ) + p(λ) over the restricted set for some finite C Λ and let λ * minimize over λ ∈ Λ. Provided then that the mutual coherence assumption (10) holds, Corollary 4 states that for all choices r n = r n (δ) in (16) with probability at least 1 − δ, where 0 < δ < 1 is given in (16). Consequently, via (18), Theorem 5. Assume that (19) holds for some α ≥ 0 and that the dictionary F M satisfies (10) with µ defined in (21). Let λ * ∈ Λ be as given in (23). Then the minimizer λ ∈ Λ with r n as in (16) with δ = 1/(n ∨ M ) and C φ = (1 − d)/d satisfies, for C ∆,µ defined in (22), with probability tending to 1 as n → ∞.

Proof of Theorem 1
Lemma 6. On the set r ≤ r n , we have Proof. Rewrite (6) to obtain, for G(λ) = R(λ) − R(λ), On the event r n ≥ r then, Add r n | λ − λ * | 1 to both sides, and deduce which proves our claim.
Proof. See the proof of Theorem 2 of [7, pages 536, 537]. For completeness, we repeat the argument: Set and so we obtain The left-hand side can be bounded by j∈I * u 2 j f j 2 ≥ (U * ) 2 /|λ * | 0 using the Cauchy-Schwarz inequality, and we obtain that and, using the properties of a function of degree two in U (λ), we further obtain and the results follows from c µ i∈I * | λ * i − λ * i | ≤ U * . Combining both lemmas with the mutual coherence assumption immediately gives Lemma 8. On the event r n ≥ r, as |λ − λ * | 1 ≤ φ(0)/r n for all λ ∈ Λ n . The first term can be bounded as follows: by the contraction principle for Rademacher processes, see [12, pages 112 -113]. This implies that where we used [10, Lemma 2.2, page 7] to get the last inequality. We can apply this result since for all s, that follows in turn from [10, Lemma 2.1, page 5] . The second term (II) requires a peeling argument [16, page 70]. Since 0 ≤ r ≤ 2C φ C F almost surely, we can use the bound Observe that for any ζ > 0, and for J n the smallest integer with 2 Jn ε n ≥ φ(0)/r n or 2 Jn ≥ n, P sup εn≤|λ−λ * |1≤φ(0)/rn Now, set Z j = sup |λ−λ * |1≤2 j εn G 0 (λ) − G 0 (λ * ) and the same considerations leading to the final bound of (I) above yield E[Z j ] ≤ 2 j ε n C φ C F 2 log(2M ) √ n and Jn j=1 P sup A change of a single pair (X i , Y i ) changes Z j by at most 2C φ C F (2 j ε n )/n, so that another application of the bounded differences inequality [10, Theorem 2.2, page 8] gives, by taking ≤ Jn j=1 P Z j − E[Z j ] ≥ 2 · 2 j ε n 2 log(2M ∨ 2n) √ n ≤ J n exp − 2(C φ C F 2 j ε n ) 2 2 log(2M ∨ 2n) (C φ C F 2 j ε n ) 2 = J n exp {−2 log(2M ∨ 2n)} = J n (2M ∨ 2n) −2 .
Invoke (29) to conclude the proof of Lemma 3.