General oracle inequalities for model selection

: Model selection is often performed by empirical risk minimiza- tion. The quality of selection in a given situation can be assessed by risk bounds, which require assumptions both on the margin and the tails of the losses used. Starting with examples from the 3 basic estimation problems, regression, classiﬁcation and density estimation, we formulate risk bounds for empirical risk minimization and prove them at a very general level, for general margin and power tail behavior of the excess losses. These bounds we then apply to typical examples.


Introduction
Consider a sample Z 1 , . . . , Z N of independent random variables in some space Z, whose distribution depends on an unknown parameter f. To estimate f, we split the sample into two parts: a test set Z 1 , . . . , Z n and a training set Z n+1 , . . . , Z N . Based on the training set various estimators of f are constructed, sayf 1 , . . . ,f p . To decide among these estimators, we use the test set. Suppose that γ f : Z → R is a loss function. The final estimatef is now chosen to minimize the empirical risk:f := arg min fj:1≤j≤p In this paper, we examine whether this empirical risk minimization leads to taking, among the p estimators, the "nearly best" one. Here, "nearly best" will be defined in terms of the excess risk of the estimators.
The behavior of the excess risk near f will be called the margin behavior. We not only consider the classical case, which is quadratic margin behavior, but also more general margin behavior. For the tails of our excess loss functions, we consider both an exponential moment condition and a more general power tail condition. We prove a risk inequality under the most general combination of these conditions, and in doing so automatically obtain risk inequalities for more restricted situations. These latter situations represent examples we give from regression, classification and density estimation.
A common and succinct way of expressing the quality of an aggregated estimator is by way of an oracle inequality of the form ER(γ) ≤ A · inf γ∈Γ R(γ) + C(Γ, n) .
Here R(γ) := E Z γ(Z) is the risk of the procedure that has loss γ, and C(Γ, n) is a quantity that depends on the cardinality (when finite) or complexity (such as the metric entropy) of the class Γ of models or aggregates up for selection, as well as on the sample size n.
When the number of procedures being aggregated is a finite number p := |Γ|, most of the results in the literature set O(log(p)/n) to be the benchmark for the rate of the term C(Γ, n) above. For instance, Bunea et al. [8] give this rate for Gaussian regression and a linear aggregate that minimizes a penalized sum of squares. For a more general risk problem, Györfi and Wegkamp [11] obtain a similar result, and Lecué [15] achieves the same rate for the Cumulative Aggregation with Exponential Weights (CAEW) procedure in a classification setup with bounded loss. Other types of results in this vein include Bartlett and Mendelson's [5] high probability bounds for the estimator risk of empirical risk minimization, done for the estimation of functions from a class with a uniform bound.
The analysis of empirical risk minimization stands on two major pillars. The first of these is empirical process theory. In Vapnik and Chervonenkis' seminal work on pattern recognition [25], the importance of the empirical process ((P n − P )(f)) f∈F of the class F of candidate procedures for the study of empirical risk minimizers was already recognized. More recently, van de Geer [22] also describes the use of empirical processes in understanding such estimators. The second foundation we need is the study of concentration inequalities, which describe the concentration of random variables and their empirical means around their true means. The value of such inequalities in the analysis of model selection via empirical risk minimization is recognised, and put to use, in the papers of Barron et. al. [4] and Birgé and Massart [6].
In much of the literature, the quantities to be estimated are assumed to be uniformly bounded. Another very important condition for ensuring good rates in oracle inequalities is the margin condition, which controls the "noise" between procedures that differ only very slightly in risk, and thus makes assumptions on the small-scale behaviour of the family of losses. For some regression setups, a uniform bound on the target and the estimates already dispenses with the need for a margin condition, as in the results of Bunea et al. [8]. (We shall see in Example 3.1 that such a uniform bound implies the margin condition when using L 2 -loss.) In classification, though, which is the original area for margin conditions, the situation is somewhat more complex. Here the margin conditions that hold are generally weaker than the ones known in regression or density estimation setups. Tsybakov [21] provides a good treatment of this case.
Koltchinskii [14] looks at a wider range of situations, generalizing Tsybakov's results, among others; besides a margin condition, his approach also requires direct conditions on the empirical process or on the complexity of the candidate class Γ in lieu of boundedness conditions. In this paper, we shall define the margin condition in Section 3 and there examine it more closely.
Generally, most of the literature deals with only one particular problem, such as regression; furthermore, the strong boundedness conditions usually imposed are not always necessary. It is well-known that some conditions must be imposed in order to obtain risk rates that are better than O(1/ √ n). For example, Lee et al. [17] give an overview of risk rates in an agnostic learning setup and show that convexity properties on the class of candidate functions lead to risk rates around O(1/n) rather than O(1/ √ n). Mendelson [19] uses a least-squares regression example to also show that O(1/ √ n) cannot be improved upon without assuming something like a Bernstein-type inequality. (While convexity assumptions can suffice for obtaining fast risk rates, they are not always necessary, as also shown by Mendelson [18]). Our interest lies in inequalities for a general loss function setup, with boundedness conditions replaced by suitably loose requirements on the tails, at least when conditioning on the training set. Such conditioning on the training set is common practice; to average the results over the training data then requires margin and power tail conditions to hold uniformly over all trained versions of the estimators used, if possible -or if not, then other, possibly more stringent conditions.
Another fairly general approach is taken by Audibert [1], who looks at the general prediction problem, i.e. regression and classification, and uses a progressive mixture rule for aggregation, but with only a brief reference to averaging over the training stage, which would be part of the full sample splitting problem. On the other hand, Rigollet [20] examines sample splitting schemes with multiple splits and thus comes close to cross validation, but does so only for the problem of density estimation. A direct treatment of a cross validation scheme is to be found in van der Vaart et al. [24]. And in the context of classification, recent inequalities are given for recursive aggregation by mirror descent by Juditsky et al. [13] and for aggregation with exponential weights by Lecué [15].

Notation
The results will be conditional on the training set. We use P to denote the distribution of the test sample, and E denotes the expectation of random variables depending on the test sample.
We will write f * for the corresponding parameter value (or an arbitrary choice thereof, if it is not unique) at which this minimum is attained, i.e. for which γ * = γ f * . We define the excess riskŝ (which is a random variable, as it depends on the test sample), and Without loss of generality, we assume that Γ is of the form Γ := {γ f : f ∈ F}, where F is a subset of a semi-metric space with semi-metric d, and write (with some abuse of notation) γ fj as γ j , {f j } p j=1 ⊂ F.

Goal
Our goal is now to show thatÊ is close to E * (with large probability or in expectation). The results are modifications of inequalities of the form where δ > 0 is an arbitrary small constant, and with ∆ of order log(p)/n and not depending on E * -see for example Chapter 7 in Györfi et al. [10]. In the standard setup of Section 4 and under a quadratic "margin condition", for instance, we show that for 1 ≤ m ≤ 1 + log p with ∆ of order log(2p)/n and not depending on E * . In particular, with m = 2, this reads This gives rise to a non-sharp oracle inequality A sharp (δ = 0) and rate-optimal (correction term O(∆)) oracle inequality cannot be established in a general setup by empirical risk minimization (cf. Lecué [15]). Instead, methods such as mirror averaging could be used, as by Juditsky et al. [12]. See also Audibert ([2] and [3]) for some limitations of empirical risk minimization, and alternative approaches to overcome the limitations. We however believe empirical risk minimization remains an important topic of study because it is widely applied in practice, and is closely related to various cross validation schemes.

Convex loss
In our proofs, we only use the property P nγ ≤ P n γ * .
This means that we can replaceγ by γ αf +(1−α)f * throughout, leading to inequalities for the excess risk From these, one may then often deduce inequalities for the original d(f , f 0 ). As we shall see, this extension (with α < 1) allows us to work with weaker conditions (than with α = 1). In particular, the example on maximum likelihood will take a similar approach with α set to 1/2.

Organization of the paper
The paper is organized as follows. Section 2 presents Bernstein's inequality. It is stated in the form of a probability inequality and a moment inequality. Section 3 presents the margin condition and some examples where it holds. Section 4 gives the main results, both one for exponential moments and a very general margin condition, and one for power tails and a particular form of margin condition. Subsequently, Section 5 applies the main results to the examples already given. Finally, the proofs are in Section 6.

Bernstein's inequality
Bernstein's inequality for a single average is well known, and the extension of Bernstein's probability inequality to a uniform probability inequality over p averages is completely straightforward. The result can be seen as the simplest version of a concentration inequality in the spirit e.g. of Bousquet [7] (emphasizing how tight these general concentration inequalities are). The moment inequality for the maximum of p averages is perhaps less known. For all j, we let γ c j (·) := γ j (·) − P γ j denote the centered loss functions. To obtain our results, we make assumptions on the tails of the centered excess losses γ c j − γ c * or of their envelope Γ := max 1≤j≤p γ c j − γ c * as follows: Definition 2.1. We say that the excess losses γ j − γ * satisfy the exponential moment condition for some for all m = 2, 3, . . . and for all j = 1, . . . , p.
We say that the envelope function Γ has power tails of order s > 1 if there exists an M ∈ (0, ∞) such that Here d(·, ·) is a semi-metric on the underlying parameter space that allows for different weighting of the procedures under consideration. As an important example of this define, for all γ, the variance Then clearly (1) implies that Moreover, if the bound |γ j − γ * | ≤ 3K holds for all j, then (1) holds with In the following sections, we will indeed often assume (1) with this value for d(f j , f * ), but we will also consider an extension. The choice of the semi-metric d is intertwined with the margin behavior, which we consider in the next section. Furthermore, when applying the margin condition, we shall implicitly use the inequality (3). As we will make repeated use of Bernstein's inequality, and the term 2 log(2p)/n will appear frequently, we will henceforth denote this term by ∆ := 2 log(2p) n .
Using this notation, the version of Bernstein's inequality that we will need in this paper is: (Bernstein's inequality for the maximum of p averages: weighted version) Assume that for some constant K, the exponential moment condition (1) holds. Then for all t > 0 and τ > 0 Moreover, for all 1 ≤ m ≤ 1 + log p, Remark: The moment inequality is for moments of order m ≤ 1+log p. It can be extended to hold for general m, provided a slight adjustment, depending on m, is made on the constants. Because we have the situation in mind where p is large, we have formulated the result for m ≤ 1+log p to facilitate the exposition.

Margin behavior
Definition 3.1. We say that the margin condition holds with strictly convex Furthermore, we say that the margin condition holds with constants κ > 1/2 and C > 0, if (4) holds with The specific case of G(u) = u 2κ /C 2κ is the one most typically used in the literature, with the semi-metric d taken to be the variance of the excess loss γ j − γ 0 . Such a margin condition can be found e.g. in Chesneau and Lecué [9] for regression and density estimation setups, with a comparable result as here for regression, and a result for a different example (squared loss) given for density estimation. Tsybakov [21] gives a similar margin condition for classification. In that paper, the use of 0-1-loss means that The concept of a Bernstein class -as used by Bartlett and Mendelson [5] -is the same thing after a suitable reparametrization. As we shall see, κ = 1 in typical cases -but other, in particular larger, values can also occur.
Let us now consider some examples. In a regression or classification situation, can be measured by applying a loss function γ : Y × Y → R to the true and the estimated response.
Let F be a class of real-valued functions on X , and for all x ∈ X and y ∈ Y, let We moreover write l f (x) := l(f(x), x). As our target we take the overall minimizer f 0 (·) := arg min a∈R l(a, ·) .
We now check whether the margin condition holds with κ = 1 and where K 2 is an appropriate constant. Then then for all f − f 0 ∞ ≤ K 1 , we have If l(a, ·) has two derivatives near a = f 0 (·), and the second derivatives are positive and bounded away from zero, then l(a, ·) behaves quadratically near its minimum, i.e., then (5) holds for some K 1 > 0.
It also also clear that (6) holds as soon as γ(·, y) is Lipschitz for all y, with Lipschitz constant L. Then we may take K 2 = L. When γ(·, y) is not Lipschitz (e.g., quadratic loss), it may be useful to define Then obviously Note that with fixed design, the second term in (7) vanishes.
Quadratic loss: In the case of least squares, the loss function is Assuming that the conditional variance is bounded by some constant σ ǫ , i.e., max we may conclude the following.
Least squares with fixed design: The margin condition holds with κ = 1 and C 2 = 4σ 2 ǫ . Least squares with random design: If f j − f 0 ∞ ≤ K 1 for all j, the margin condition holds with κ = 1 and The target is again the overall minimizer It is clear that f 0 is the Bayes rule We moreover have

Consider the function
and its convex conjugate (assuming the maximum exists).
Lemma 3.2. The inequality If H 1 (v) = 0 for v ≤ C 1 , we take G 1 (u) = C 1 u. More generally, the Tsybakov margin condition (see [21]) assumes that one may take, for some C 1 ≥ 1 and λ ≥ 0 (Tsybakov himself writes γ for this parameter), . Thus, then the margin condition holds with this value of C and with κ = 1 + λ The squared Hellinger distance of densities f andf is We now check the margin and power tail conditions for a distance measure d(f, f 0 ) which is a multiple of h(f, f 0 ).

Lemma 3.3.
For all densities f, we have

Moreover, under the assumption
This lemma contains the exponential moment condition (1) for K = 1, and also allows us to deduce the margin condition for margin constants κ = 1 and C = 1.

Main results
If we assume exponential tails on the loss functions, we are able to obtain a result for a wide range of margin conditions: Assume that for some m ≤ 1 + log p, the function H(v 1 m ), v > 0, is concave. Assume moreover that the exponential moment condition (1) holds for some K > 0 and for d(f j , f * ) := G −1 (E j ) + G −1 (E * ). Then for all 0 < δ < 1, and ε > 0, we have The next theorem focuses on the common family of margin functions G(u) = u 2κ /C 2κ , u > 0, κ ≥ 1, but also relaxes the exponential tail condition to a power tail condition. Note that for this family of margin functions, the corresponding convex conjugates are H(v) of order O(v 2κ 2κ−1 ), and thus Lemma 4.1 gives an oracle inequality with correction term rate O(∆ κ 2κ−1 ), which agrees with the rates found in the literature and in the next theorem: Theorem 4.1. (i) Suppose that the margin condition holds for the loss functions γ j with constants κ ≥ 1 and C > 0 and some d satisfying d(f j , f 0 ) ≥ σ(γ j − γ 0 ), ∀ j. Also assume that the envelope Γ has power tails in the form of (2), of order s > 1 and for some M > 0. Then for all m in the interval [2κ, min(2sκ, 1 + log(p))[ and for all τ > 0, we have the following inequality:

187
(ii) Furthermore, if the excess losses satisfy the exponential moment condition (1) for some constant K > 0, then α for all m in the interval [2κ, 1 + log(p)[ . In this case we also have tail bounds for all t > 0.
These statements lead to simpler ones if we use that τ ≤ E * ∨ τ ≤ E * + τ and then optimize over τ , trading off the summand with positive exponent 1/(2κ) and the one with negative exponent −1/(2κ) · αβ/(α + β). This yields the main result of this paper: when the loss envelope Γ has power tails (2) (ξ(κ, s, m) is a constant depending only on κ, s and m), and when the excess losses satisfy the exponential moment condition (1). In the latter case we also have the tail bound P Ê 1 2κ ≥ E 1 2κ * + A(κ) · C α · (∆ + 2t/n) α/2 + A(κ) 2κ−1 · K(∆ + 2t/n) and for any δ > 0 the general inequality (a + b) 2κ ≤ (1 + δ) · a 2κ + (1 + 1/δ) · b 2κ (for a, b ≥ 0 and δ > 0) then yields the oracle inequality Corollary 4.1 naturally also leads to statements about risk ratios. Under the exponential moment condition, for example, we can see that when we have have the ratio inequality The results of Corollary 4.1 constitute a generalization of other, similar, results to be found in the literature. For instance, the rate O(∆ κ 2κ−1 ) we obtain for exponential tails and a margin condition of order κ ≥ 1 is similar to that described by Lecué [16] for classification using Tsybakov's margin condition; the only difference is that there the rate also depends on that of the oracle, i.e. the rate at which E * tends to zero as ∆ does. For bounded losses, Chesneau and Lecué [9] give a general oracle inequality that they subsequently apply to examples of density estimation and bounded regression. Their most general oracle inequality also has the rate O(∆ κ 2κ−1 ) when the oracle rate is not too large.

Application to examples
We can apply Corollary 4.1 to the (more restricted) cases described in the previous sections:

Quadratic margin, exponential tails
The quadratic margin condition corresponds to taking κ = 1. Taking the second part of Corollary 4.1 for this value of κ yields the oracle inquality for all δ > 0, when the losses satisfy the exponential moment condition.
In Lemma 3.3, we have already shown the margin and exponential moment conditions for the transformed parametersf j and the scaled Hellinger distance d(f, f ′ ) := Lh(f, f ′ ). The parameters there are C = K = κ = 1, and thus we obtain the oracle inequality This involves the density (f + f * )/2, which is not an estimator. We can however use this oracle inequality to deduce a risk inequality for the estimator f using the following lemma about the Hellinger distance: Let f,f ′ and f 0 be densities with respect to the measure µ. Then we have the following inequality: By the first part of Lemma 3.3, we haveK ≥ h 2 (f , f 0 ) and K * ≥ h 2 (f * , f 0 ). Combining this with the oracle inequality (11) and with Lemma 5.1, we obtain the risk inequality We cannot expect to obtain an oracle inequality involving E * , however, as there is no general bound of the Kullback-Leibler distance of densities by their Hellinger distance.

Example 5.2. (Regression)
Upper bounds: In Example 3.1, we saw that least-squares regression satisfies a quadratic margin condition, i.e. one with κ = 1. For instance, we have the margin parameter C := 2σ ǫ in the fixed-design case. If furthermore we assume that the errors ǫ i possess some finite moment of order 2s > 2 -a less restrictive assumption than the Gaussianity often required -then the loss has power tails of order s > 1: and so by Chebyshev, Thus the oracle inequality (12) holds here.

Lower bounds:
Consider the fixed-design case with double Pareto tails of order s > 2, i.e. the distribution of the ǫ i is symmetric around 0, and Fix some p ∈ N, p ≥ 2, and define f p := f 0 ≡ 0, f j (x) = l{x = X j }n 1 2s , x ∈ X , j = 1, . . . , p − 1 .
Lemma 5.2. The margin condition holds with κ = 1 and C 2 = 8/((s−2)(s−1)), and when p ≥ √ n+1, the power tail condition (2) holds with M = 2. For n ≥ 2 2s and all p ≤ n, moreover, we haveÊ Remark: We can easily extend the lower bound result to p > n, because we can add, as candidates, any number of bounded functions f j , say with f j ∞ ≤ 1, without neccessitating an increase in the scale parameter M of the power tail condition. These additional functions may be selected by the least squares estimator, but if they all have norm P f 2 j ≥ n − s−1 s , selecting one of these still gives the same lower bound.
Combining this lower bound with the oracle inequality (12), we find that for n ≥ p ≥ √ n + 1, we have for some constant C ′ (s) that depends only on s, which shows the rate-optimality -up to a logarithmic factor -of the upper bound. If p is small compared to √ n, however, things look very different: where c s := 2 2 π Γ 1/s s + 1 2 .
This leads to a (non-sharp) oracle inequality whose correction term has the order p 2/s /n. If p ≪ √ n, then p 2/s /n ≪ n − s−1 s , i.e. a lower bound of order n − s−1 s for EÊ will not hold.

General margin, exponential tails
The risk bound in this case was given in Part (ii) of Corollary 4.1, whose correction term is of order O(∆ 1/(4κ−2) ). Taking m = 2κ, this leads to an oracle inequality of the form for all δ > 0.

Example 5.3. (Classification)
In Example 3.2, we saw the margin condition for κ = 1 + λ and where λ ≥ 0, as a consequence of Tsybakov's margin condition. Furthermore, for all f in this example, which means that the excess losses have exponential moments (1) with K = 1. Thus we have an oracle inequality for all δ > 0, and for constantsÃ 1 (C, λ) andÃ 2 .
6. Proofs where τ > 0 is arbitrary. Thus it suffices to show that under the condition that for centered loss functions γ j , the inequality holds for all t > 0, and that for all 1 ≤ m ≤ 1 + log p, Bernstein's probability inequality says that for all t > 0, This inequality follows from the intermediate result which holds for all L > 2K. Inequality (13) follows immediately from (15). To prove (14), we apply Lemma 6.1 to the function g : x → (L · log(x + 1)) m , which is increasing on [0, ∞) and concave on [e m−1 − 1, ∞). We then obtain for all L > 0 and all m that From (16), and invoking e |x| ≤ e x + e −x , we obtain for L > 2K, and use the extra restriction m ≤ 1 + log p to get the desired result. We now apply Jensen's inequality to the term on the left, and then use the concavity on [c, ∞) to incorporate the term on the right: Eg(|X|) ≤ g E |X| |X| ≥ c P(|X| ≥ c) + g(c)P(|X| < c) ≤ g E|X| + cP(|X| < c) .

Proofs for Section 3
Proof of Lemma 3.1. This follows from Proof of Lemma 3.2. We have with u = P |f −f 0 |. Since this is true for all v, we may maximize over v to obtain Moreover, Proof of Lemma 3.3. As the excess risk is a Kullback-Leibler distance to the true distribution, the first statement of the Lemma is just the classical lower bound by the Hellinger distance: For the second part, we can use Lemma 7.2 in van de Geer [22], which says that We moreover have 6.3. Proofs for Section 4

Preparatory lemmas
We begin with two simple results (without proofs) for ease of reference.
Proof. The second part is clear, as it involves the omission only of positive summands from the LHS to the RHS. For the first part, we write and note that Now asf (0) = 0 and for 0 ≤ z ≤ 1, we know thatf (z), and thus f(z), is non-negative on [0, 1].

Main proofs
Proof of Lemma 4.1. Define Then we havê Using Lemma 6.5, we obtain the inequalitŷ where for the second step we used the elementary observation a 2κ + b 2κ ≤ (a + b) 2κ for a, b ≥ 0, κ > 1/2. Now we will first compute the moments of Z by an application of Bernstein's inequality. We know that which by the margin condition ≤ C · (P (γ j − γ 0 )) 1/2κ + C · (P (γ * − γ 0 )) Thus for all j, C (E τ * ) . Now to compute the moments of (P n + P ) (Γ1 {Γ > K}) and minimize the upper bound over K ≥ 0 (using Lemma 6.3), we obtain the desired oracle inequality for the power tail case.
(ii) If we assume the exponential moment condition instead of power tails, we can take and we obtain the same bound for Z m as before, but no term stemming from Γ1 {Γ > K}. This yields the desired risk moment inequality. The corresponding risk tail bound also comes straight from applying Bernstein's inequality (13) to Z. It follows that with probability at least 1 − exp[−2 −1 (1 + u/2) −s · (p − 1)/ √ n], min 1≤j≤p P n (γ j ) < P n (γ 0 ) − un 1 2s .