Generalization error for multi-class margin classification

In this article, we study rates of convergence of the generalization error of multi-class margin classifiers. In particular, we develop an upper bound theory quantifying the generalization error of various large margin classifiers. The theory permits a treatment of general margin losses, convex or nonconvex, in presence or absence of a dominating class. Three main results are established. First, for any fixed margin loss, there may be a trade-off between the ideal and actual generalization performances with respect to the choice of the class of candidate decision functions, which is governed by the trade-off between the approximation and estimation errors. In fact, different margin losses lead to different ideal or actual performances in specific cases. Second, we demonstrate, in a problem of linear learning, that the convergence rate can be arbitrarily fast in the sample size $n$ depending on the joint distribution of the input/output pair. This goes beyond the anticipated rate $O(n^{-1})$. Third, we establish rates of convergence of several margin classifiers in feature selection with the number of candidate variables $p$ allowed to greatly exceed the sample size $n$ but no faster than $\exp(n)$.


Introduction
Large margin classification has seen significant developments in the past several years, including many well-known classifiers such as Support Vector Machine (SVM, (7)) and Neural Networks. For margin classifiers, this article investigates their generalization accuracy in multi-class classification.
In the literature, the generalization accuracy of large margin classifiers has been investigated in two-class classification. Relevant results can be found in, for example, (3), (29) and (14). For multi-class classification, however, there are many distinct generalizations of the same two-class margin classifier; see Section 3 for a further discussion of this aspect. As a result, much less is known with regard to the generalization accuracy of large margin classifiers, particularly its relation to presence/absence of a dominating class, which is not of concern in the two-class case. Consistency has been studied in (30), and (21). To our knowledge, rates of convergence of the generalization error have not been yet studied for general margin classifiers in multi-class classification.
In the two-class case, the generalization accuracy of a large margin classifier is studied through the notion of Fisher consistency (cf., (15); (30)), where the Bayesian regret Regret(f ,f ) is used to measure the discrepancy between an estimated decision functionf andf , the (global) Bayes decision function over all possible candidate functions. When a specific class of candidate decision functions F and a surrogate loss V are used in classification,f is often not the risk minimizer defined by V over F . Then an approximation error of F tof with respect to V is usually assumed to yield an upper bound of Regret(f ,f ), expressed in terms of an approximation error plus an estimation error of estimating the decision function. One major difficulty with this formulation is that the approximation error may dominate the corresponding estimation error and be non-zero. This occurs in classification with linear decision functions; see Section 5.1 for an example. In such a situation, well-established bounds for the estimation error become irrelevant, and hence that such a learning theory breaks down when the approximation error does not tend to zero.
To treat the multi-class margin classification, and circumvent the aforementioned difficulty, we take a novel approach by targeting at Regret(f , f V ) with f V the risk minimizer over F given V . Toward this end, we study the ideal generalization performance of f V and the mean-variance relationship of the cost function. This permits a comparison of various margin classifiers with respect to the ideal and actual performances respectively described in Sections 3 and 4, bypassing the requirement of studying the Fisher consistency. As illustrated in Section 5.2, we show that the rate of convergence of the generalization error of certain large margin classifiers can be arbitrarily fast in linear classification, depending on the joint distribution of the input/output pair. Moreover, in linear classification, the ideal generalization performance is more crucial than the actual generalization performance, whereas in nonlinear classification the approximation error becomes important to the actual generalization performance. Finally, we treat variable selection in sparse learning in a high-dimensional situation. There the focus has been on how to utilize the sparseness structure to attack the curse of high dimensionality, c.f, (31) and (12). Our formulation permits the number of candidate variables p greatly exceeding the sample size n. Specifically, we obtain results for several margin classifiers involving feature selection, when p grows no faster than exp(n). This illustrates the important role of penalty in sparse learning.
This article is organized as follows. Section 2 introduces the notation of generalized multi-class margin losses to unify various generalizations of two-class margin losses. Section 3 discusses the ideal generalization performance of f V with respect to V , whereas Section 4 establishes an upper bound theory concerning the generalization error for margin classifiers. Section 5 illustrates the general theory through four classification examples. The Appendix contains technical proofs.

Multi-class and generalized margin losses
In k-class classification, a decision function vector f = (f 1 , · · · , f k ), with f j representing class j, mapping from input space X ⊂ R d to R, is estimated through a training sample Z i = (X i , Y i ) n i=1 , independent and identically distributed according to an unknown joint probability P (x, y), where Y i is coded as {1, · · · , k}. For an instance x, classification is performed by rule arg max 1≤j≤k f j (x), assigning x to a class with the highest value of f j (x); j = 1, · · · , k. The classifier defined by arg max 1≤j≤k f j (x) partitions X into k disjoint and exhaustive regions X 1 , · · · , X k . To avoid redundancy in f , a zero-sum constraint k j=1 f j = 0 is enforced. Note that f j ; j = 1, · · · , k are not probabilities.
In multi-class margin classification, there are a number of generalizations of the same two-class method. We now introduce a framework using the notion of generalized margin, unifying various generalizations. Define the generalized , comparing class y against the remaining classes. When k = 2, it reduces to the binary functional margin f y − f c =y , which, together with the zero-sum constraint, is equivalent to yf (x) with y = ±1. Within this framework, we define a generalized margin loss for some measurable function h and z = (x, y), where V is called large margin if it is nondecreasing with respect to each component of u(f (x), y), and V is often called a surrogate loss when it is not the 0-1 loss. For import vector machine (33), For multi-class SVMs, several versions of generalized hinge loss exist. The generalized hinge loss proposed by (23), (26), (5), (8), and (11) is defined by the generalized hinge loss in (13) is defined by the generalized hinge loss in (16) is defined by For classification, a penalized cost function is constructed through V (f , Z): where J(f ) is a nonnegative penalty penalizing undesirable properties of f , and λ > 0 is a tuning parameter controlling the trade-off between training and J(f ). The minimizer of (2.1) with respect to f ∈ F = {(f 1 , · · · , f k ) ∈ F : k j=1 f j = 0}, a class of candidate decision function vectors, yieldsf = (f 1 , · · · ,f k ) thus classifier arg max j=1,··· ,kfj .
In classification, J(f ) is often the inverse of the geometric margin defined by various norms or the conditional Fisher information (6). For instance, in linear SVM classification with feature selection, the inverse geometric margin with respect to a linear decision function vector f is defined as 1 2 k j=1 w j 1 , cf., (4), where f j (x) = w j , x +b j , j = 1, · · · , k, with ·, · the usual inner product in R d , b j ∈ R, and · 1 is the usual L 1 norm. In standard kernel SVM learning, the inverse geometric margin becomes 1 Here K(·, ·) is symmetric and positive semi-definite, mapping from X × X to R, and is assumed to satisfy Mercer's condition (17) so that g 2 K is a norm.

Ideal generalization performance
The generalization error (GE) is often used to measure the generalization accuracy of a classifier defined by f , which is with multi-class misclassification (0-1) loss L(f , z) = I(Y = arg max j=1,··· ,k f j (X)).
The corresponding empirical generalization error (EGE) is n −1 n i=1 L(f , Z i ). Often a surrogate loss V is used in (2.1) as opposed to the 0-1 loss for a computational consideration. In such a situation, (2.1) targets at the minimizer f V = arg inf f ∈F EV (f , Z), which may not belong to F . Consequently, Note that for f ∈ F , e V (f , f V ) ≥ 0 but e(f , f V ) may not be so, depending on the choice of V . In this article, we provide a bound of |e(f , f V )| to measure the discrepancy between the actual performance and ideal performance of a classifier defined by f in generalization.
It is worthwhile to mention that for two margin losses V i ; i = 1, 2, the ideal generalization performances determine the asymptotic behavior of their actual generalization performances of the corresponding classifiers defined bŷ f i . Therefore, if EL(f V1 , Z) < EL(f V2 , Z) then EL(f 1 , Z) < EL(f 2 , Z) eventually provided that |e(f i , f Vi )| → 0 as n → ∞. Consequently, a comparison of |e(f 1 , f V1 )| with |e(f 2 , f V2 )| is useful only when their ideal performances are the same, that is, EL(f V1 , Z) = EL(f V2 , Z).
To study the ideal generalization performance of f V with respect to V , letf be the (global) Bayes rule, obtained by minimizing Err(f ) with respect to all f , including f / ∈ F . Note that the (global) Bayes rule is not unique but its error is unique with respect to loss L, because anyf , satisfying argmax jfj (x) = argmax j P j (x) with P j (x) = P (Y = j|X = x), yields the same minimal. Without loss of generality, we definef = (f 1 , . . . ,f k ) with f l (x) = k−1 k if l = argmaxP j (x), and − 1 k otherwise. Let V svmj and V ψ be margin losses defined by h svmj and h ψ , respectively.
for any margin loss V . If, in addition, for generalized hinge losses V svmj , j ∈ {1, 3}, it is separable in that EV (f Vsvmj , Z) = 0, then Lemma 1 concerns V ψ in both the separable and nonseparable cases, and V svmj ; j = 1, 3, in the separable case, in relation to other margin losses. For other margin losses, such an inequality may not hold generally, depending on F and V . Therefore a case by case examination may be necessary; see Section 5.1 for an example.

Actual generalization performance
In our formulation, F is allowed to depend on the sample size n; so is f V defined by F . When f V depends on n and approximates f * (independent of n) in that for any f ∈ F and some truncation constant T > 0, where ∧ defines the minimum. Define The following conditions are assumed based on the bracketing L 2 metric entropy and the uniform entropy.
Assumption A: (Conversion) There exists a constant T > 0 independent of n such that T > max(V (f 0 , Z), V (f * , Z)) a.s., and there exist constants 0 < α ≤ ∞ and c 1 > 0 such that for all 0 < ǫ ≤ T and f ∈ F , Assumption B: (Variance) For some constant T > 0, there exist constants β ≥ 0 and c 2 > 0 such that for all 0 < ǫ ≤ T and f ∈ F , To specify Assumption C, we define the L 2 -bracketing metric entropy and the uniform metric entropy for a function space G = {g} consisting of function g's. For any ǫ > 0, call {(g l 1 , g u 1 ), . . . , (g l m , g u m )} an ǫ-bracketing set of G if for any g ∈ G there exists an j such that g l j ≤ g ≤ g u j and g u j − g l j 2 ≤ ǫ, where g 2 = (Eg 2 ) 1/2 is the usual L 2 -norm. The metric entropy H B (ǫ, G) of G with bracketing is then defined as the logarithm of the cardinality of ǫ-bracketing set of G of the smallest size. Similarly, a set (g 1 , · · · , g m ) is called an ε-net of G, if for any g ∈ G, there exists an j such that g j −g Q,2 ≤ ε, where · Q,2 is the L 2 (Q)norm with respect to Q, defined as g Q,2 = ( g 2 dQ) 1/2 . The L 2 (Q)-metric entropy H Q (ε, G) is the logarithm of the covering number-minimal size of all ε-nets. The uniform metric entropy is defined as Assumption C: (Complexity) For some constants c i > 0; i = 3, · · · , 5, there exists ε n > 0 such that Assumption A specifies a relationship between e(f , f * ) and e V T (f , f * ), which is a first moment condition. Assumption B, on the other hand, relates e(f , , which implies that β = 0 in the worst case. Exponents α and β in Assumptions A and B are critical to determine the speed of convergence of e(f , f * ), although e V T (f , f * ) may not converge fast. As illustrated in Section 5.2, an arbitrarily fast rate is achievable in large margin linear classification, because α can be arbitrarily large. Assumption B appears to be important in discriminating several classifiers in the linear and non-linear cases. Assumption C measures the complexity of F . However, if c 1 and c 2 in Assumptions B and C depend on n, then they may enter into the rate.
Two situations are worthwhile mentioning, depending on richness of F . First, when F is rich, f * =f , and margin classification depends only on the behavior of the marginal distribution of X near the decision boundary. This is characterized by the values of α and β. For instance, in nonlinear multi-class ψ-learning, α = 1 and 0 < β ≤ 1, cf., (16). This corresponds to the case of the n −1 rate in the separable case and n −1/2 in the non-separable case, as described in (2). Second, when F is not rich, as in linear classification, f * =f is typically the case, where α and β depend heavily on the distribution of (X, Y ); see Section 5.2 for an example. As a result, actual generalization performances of various margin classifiers are dominated by different ideal generalization performances; see Section 5.1 for an example.

Theorem 1. If Assumptions A-C hold, then, for any estimated decision function vectorf defined in (2.1), there exists a constant
The rate δ 2 n is governed by two factors: (1) ε 2 n determined by the complexity of F and (2) the approximation error e V (f * , f 0 ) defined by V . When e(f * , f 0 ) = 0, there is usually a trade-off between the approximation error and the complexity of F with respect to the choice of f 0 ; see Section 5.3. Remark 1: The results in Theorem 1 and Corollary 1 continue to hold if the "global" entropy is replaced by its corresponding "local" version; see e.g., (24). That is, The proof requires only a slight modification. The local entropy allows us to avoid to loss of log(n) in linear classification, although it may not be useful for nonlinear classification. Remark 2: For ψ-learning, Theorem 1 may be strengthened by replacing F by the corresponding set entropy if the problem structure is used; cf., (19). Remark 3: The preceding formulation can be easily extended to the situation of multiple regularizers by replacing λJ(f ) by its vector version, i,e, λ T J

Linear classification: Ideal and actual performances
This section illustrates that the ideal generalization performances of various margin classifiers, defined by f V , may differ, dominating the corresponding actual ones, where e(f V ,f ) = 0. This reinforces our discussion in Section 3.
Interestingly, the fast rate e(f ,f ) = e(f , f V ) = O p (n −(γ+1)/2 ) is because classification is easier than its counterpart-function estimation, as measured by γ ≥ 0. This is evident from that e V (f , f V ) = O p (n −1/2 ). This rate is arbitrarily fast as γ → ∞.
In this example, is absolutely continuous, f (m) ∈ L 2 [0, 1]} with the degree of smoothness m measured by the L 2 -norm, generated by spline kernel K(·, ·) whose expression can be found in (10) or (23). In what follows, we embed It follows from the reproducing kernel Hilbert spaces (RKHS) representation theorem (cf., (10)) that minimization of (2.1) over F is equivalent to that over its subspace We now verify Assumptions A-C. Some useful facts are given in Lemmas 5-7.

Lemma 5. (Global Bayes rulef ) In this example,f
, some constant C > 0, any T ≥ 9 and any measurable f ∈ R 3 . Z)) for some constant C > 0, any T ≥ 9 and any mea-
ψ-learning: Set T ≥ 1 as 0 < V 4 ≤ 1. For Assumption A, α = 1 by Theorem 3.1 of (16). For Assumption B, β = 1 following an argument similar to that in (16). For Assumption C, let f 0 (x) = τ (4 − 9x, 1, 9x − 5) when m ≥ 2; Evidently, the approximation error e V (f 0 ,f ) and J 0 = max(J(f 0 ), 1) play a key role in rates of convergence. With different choices of approximating f 0 for ψ-loss and the hinge loss, ψ-learning and SVM have different error rates with the ψ-loss yielding a faster rate when m ≥ 2 and the same rate when m = 1. Moreover, in this example, the dominating class does not seem to be an issue.

Feature selection: High-dimension p but low sample size n
This section illustrates applicability of general theorem to the high-dimension, low sample size situation. Consider feature selection in classification, where the number of candidate covariates p is allow to greatly exceed the sample size n and to depend on n. For the L 1 penalty, (22) and (28) obtained the rates of convergence for the binary SVM when p < n and multi-class SVM when p > n.
Here we apply the general theory to the elastic-net penalty (see (31)) for binary SVM (27) to obtain a parallel result of (28). We use linear representations in (2.1) as in (27), because of over-specification of non-linear representations.
We now verify Assumptions A-C for the hinge loss V . Because X 1 and Y are independent of (X 2 , . . . , X p ), one can verify that the minimal of EV (f , Z) is that of EV (f 1 (X 1 ), Y ) over {f 1 : f 1 (x) = ax 1 + b}, attained by f * 1 = a * x 1 for some a * > 0. For Assumptions A-B, we apply the result in Example 5.1 to obtain α = 1/2 and β = 1.

Conclusion
This article develops a statistical learning theory for quantifying the generalization error of large margin classifiers in multi-class classification. In particular, the theory develops upper bounds for a general large margin classifier, which permits a theoretical treatment for the situation of high-dimension but low sample size. Through the theory, several learning examples are studied, where the generalization errors for several large margin classifiers are established. In a linear case, fast rates of convergence are obtained, and in a case of sparse learning, rates are derived for feature selection in which the number of variable greatly exceeds the sample size.
To compare various large margin classifiers with regard to generalization, we may need to develop a lower bound theory. Otherwise, a comparison may be inconclusive although our learning theory provides an upper bound result.
Proof of Lemma 1: To prove EL(f V ψ , Z) = EL(f L , Z), note that it follows from the definition of f L that EL(f V , Z) ≥ EL(f L , Z). Then for any ε > 0 there exists f 0 ∈ F such that EL(f 0 , Z) ≤ EL(f L , Z) + ε. It follows from linearity of F that cf 0 ∈ F for any constant c > 0. The result then follows from the fact that lim c→∞ EV ψ (cf 0 , Z) = EL(f 0 , Z).
The treatment here is to use a large deviation inequality in Theorem 3 of (20) for the bracketing entropy and Lemma 9 below for the uniform entropy. Our approach for bounding P |e(f , f * )| ≥ δ 2 n is to bound a sequence of empirical processes induced by the cost function l over P (A ij ); i, j = 1, · · · , n. Specifically, we apply a large deviation inequality for empirical processes, by controlling the mean and variance defined by V (f , Z i ) and penalty λ. This yields an inequality for empirical processes and thus for e(f , f * ). In what follows, we shall prove the case of the bracketing entropy as that for the uniform entropy is essentially the same.
To prove the result with the uniform entropy, we use Lemma 9 with a slight modification of the proof. Finally, This implies that I 1/2 ≤ (5/2 + I 1/2 ) exp(−c 6 n(λJ 0 ) 2−min(1,β) ). The result then follows from the fact I ≤ I 1/2 ≤ 1. 2 Now we derive Lemma 9 as a version of Theorem 1 of (20) using the uniform entropy.

. Suppose
and, if s ≤ v 1/2 , Then Proof: The proof uses conditioning and chaining. The first step is conditioning. Let Z 1 , . . . , Z N be an i.i.d. sample from P , and let (R 1 , . . . , R N ) be uniformly distributed over the set of permutations of (1, . . . , N ), where N = mn, with m = 2. Define n ′ = N − n,P n,N = n −1 n i=1 δ ZR i , and P N = N −1 N i=1 δ Zi , with δ Zi the Dirac measure at observation Z i . Then the following inequality can be thought of as an alternative to the classical symmetrization inequality (cf., (25) Lemma 2.14.18 with a = 2 −1 and m = 2), (A.4) Conditioning on Z 1 , . . . , Z N , it suffices to consider P * |N (sup F |P n,N h − P N h| > M ), where P |N be the conditional distribution given Z 1 , . . . , Z N .
The second step is to bound P * |N (sup F |P n,N h − P N h| > M ) by chaining. Let ε 0 > ε 1 > . . . > ε T > 0 be a sequence to be specified. Denote by F q the minimal ε q -net for F with respect to the L 2 (P N )-norm. For each h, let π q h = arg min g∈Fq g − h PN ,2 . Evidently, π q h − h PN ,2 ≤ ε q , and |F q | = N (ε q , F , L 2 (P N )), the covering number. Then P * |N (sup F |P n,N h − P N h| > M ) is bounded by . . , T , and T = min{q : ε q ≤ s}. Note that ε 0 ≤ v 1/2 by construction. Furthermore, by (A.3) and Lemma 3.1 of (1), We now proceed to bound P 1 -P 3 separately.