On the uniform convergence of empirical norms and inner products, with application to causal inference

Uniform convergence of empirical norms - empirical measures of squared functions - is a topic which has received considerable attention in the literature on empirical processes. The results are relevant as empirical norms occur due to symmetrization. They also play a prominent role in statistical applications. The contraction inequality has been a main tool but recently other approaches have shown to lead to better results in important cases. We present an overview including the linear (anisotropic) case, and give new results for inner products of functions. Our main application will be the estimation of the parental structure in a directed acyclic graph. As intermediate result we establish convergence of the least squares estimator when the model is wrong.


Introduction
Let X 1 , . . . , X n be independent random variables with values in X and F be a class of real-valued functions on X . For a function f : X → R, we denote its empirical measure by P n f := n i=1 f (X i )/n and its theoretical measure by P f := n i=1 I Ef (X i )/n (assuming it exists). Furthermore, we let f 2 n := P n f 2 and f 2 := P f 2 (again assuming it exists). We call f n the empirical norm of the function f and f its theoretical norm. We review some results concerning the the uniform (over F) convergence of · n to · . As example, we consider the case X = R p (with p possibly large) and F is a class of additive functions f (x 1 , . . . , x p ) = p k=1 f 0 (x k ) with f 0 in a given class of functions F 0 on R (Theorem 2.3). We extend the results to uniform convergence of the empirical measure of products of functions. The latter will be an important tool for statistical theory for causal inference. As intermediate step we show convergence of the least squares estimator when the model is wrong.
In Theorem 2.1 we present results from Guédon et al. [2007] and Bartlett et al. [2012] and in Theorem 2.2 we compare these with more classical approaches using e.g. the contraction inequality. The extension to inner products is given in Theorem 3.1. The latter can be used in statistical applications where functions from different smoothness classes are estimated (for example in an additive model).
We pay some special attention to the linear case, i.e. the case where F is (a subset of) a linear space. For isotropic distributions the uniform convergence of · n to · over linear functions is well developed. We refer to Adamczak et al. [2011] and with sub-Gaussian random vectors to Raskutti et al. [2010], Loh and Wainwright [2012] and Rudelson and Zhou [2013]. We will not require isotropic distributions but instead consider possibly anisotropic but bounded random variables. the refined inequalities from Guédon et al. [2007] and Ahlswede and Winter [2002]. We show in Theorem 6.1 that this estimator is consistent under various scenario's: IP(π / ∈ Π 0 ) converges to zero. An important assumption here is an identifiability assumption: see Condition 6.1. This excludes the Gaussian linear structural equations model where X 1,j depends linearly on its parents. We will instead model each f j ∈ F j as being an additive non-linear function where each f k,j belongs to a given class F 0 of real-valued functions on R.
We consider several cases. The results can be found in Theorem 6.1. They are a consequence of uniform convergence of empirical norms of a class of additive functions as given in Theorem 2.3 which may be of independent interest. Let us summarize the findings here.
In the first two cases, the class F 0 is assumed to have finite entropy integral for the supremum norm. We then derive consistency when p 3 = o(n). Under additional assumptions this is can be relaxed to p 3−(1−α) 2 = o(n), where 0 < α < 1 is a measure of the "smoothness " of the class F 0 .
An important special case is where F 0 is a class of linear functions. Each f k,j is then a linear combination of functions in a given dictionary {ψ r } N r=1 : f k,j (x k ) = N r=1 β r,k,j ψ r (x k ).
In other words, the dependence of a variable (index j) on one of its parents (index k) is then modelled as a linear combination of certain features (index r) of this parent. We assume the dictionary to be bounded in supremum norm.
If F 0 is the signed convex hull of the functions {ψ r } N r=1 we obtain consistency when p 2 log N log 3 n = o(n). The latter situation covers for example the case where F 0 is a collection of functions with total variation bounded by a fixed constant.
Under certain eigenvalue conditions we find that pN 2 log n = o(n) also yields consistency.
Finally, if F 0 can be approximated by linear functions in a space of dimension N with bias of order N −1/(2α) , then consistency follows from p 1+4α log n = o(n).
The paper  shows consistency for the case p fixed (the low-dimensional case). It also has theoretical results for the high-dimensional case, but for a restricted estimator where it is assumed that X 1,j has only a few parents and a superset of the parents parents(X 1,j ) is known or can be estimated (j = 1, . . . , p). This superset then is required to be small. The paper is organized as follows. In Sections 2 and 3 we study a generic class of functions F satisfying some · -and · ∞ -bounds. We present the uniform convergence for empirical norms in Section 2, with main example in Subsection 2.5.
Section 3 looks at empirical inner products of functions in different "smoothness" classes. Subsection 3.2 illustrates the results by considering two classes of functions satisfying different entropy conditions. In many applications one also needs uniform convergence of inner products with a sub-Gaussian (instead of bounded) random variable. Therefore we briefly review this case as well in Subsection 3.3.
Section 4 applies the theory to a class of linear functions and Section 5 studies linear regression when the model is wrong. Section 6 contains the main application: estimation of the order in a directed acyclic graph. Section 7 concludes.
Section 8 presents the technical tools and Section 9 contains the proofs. Throughout C 0 , C 1 , C 2 , · · · and c 0 , c 1 , c 2 , · · · are universal constants, not the same at each appearance.
2 Bounds for the empirical norm

Entropy and entropy integrals
For a real-valued function f on X we let its supremum norm restricted to the sample be f n,∞ := max 1≤i≤n |f (X i )| and we let H(u, F, · n,∞ ) be the entropy of (F, · n,∞ ). We further define for z > 0 where the constant C 0 is taken as in Theorem 8.3 (Dudley's Theorem). We can without loss of generality assume the integral exists (replace the entropy by a continuous upper bound). The subscript ∞ here refers to the fact that we are considering ℓ ∞ -norms.
We also consider uniform ℓ 2 -entropies, defined as follows. Let A n be the set of all configurations A n of n (possibly non-distinct) points within the support of P . For A n ∈ A n and f a real-valued function on X we let Note that f n = f X where X is the random sample X := {X 1 , . . . , X n }. For a class F of functions on X , we let H(·, F) := sup An∈An H(·, F, · An ) and J 0 (z, F) : The calligraphic symbol J indicates that instead of random entropies we consider the maximum entropy over all possible configurations of (at most) n points. Apart from this and from considering ℓ 2 -entropy instead of ℓ ∞ -entropy we now moreover implicitly assume that the entropy integral converges and use J 0 with subscript 0 to indicate this. The reason for taking 0 as lower-integrant is that v → J 0 ( √ v, F) is a concave function. We will see this to be useful in Theorem 2.2 in view of Jensen's inequality.
Finally, for A n ∈ A n and f a real-valued function on X we let Note that f n,∞ = f X,∞ where X is the sample X := {X 1 , . . . , X n }. For a class F of functions on X we set We furthermore define for z > 0 By the definition of J ∞ (see (1)) J ∞ (z, F) ≤ J ∞ (z, F). We use the calligraphic symbol J ∞ with subscript ∞ here to indicate that the maximal ℓ ∞ -entropy over all possible configurations of (at most) n points is used.

Bounds using ℓ ∞ -norms
The following theorem follows from Guédon et al. [2007]. Recall the definition (1) of J ∞ . Then Moreover, for all t > 0, with probability at least 1 − exp[−t], where the constant C 1 is as in Theorem 8.4 (a deviation inequality). As by-product of the proof, we find Actually, in Guédon et al. [2007] the entropy integral related quantity J ∞ is replaced by a more general quantity coming from generic chaining.

Bounds using ℓ 2 -norms
In Theorem 2.2 below, we reverse the role of R and K as compared to Theorem 2.1. The result is well-known, it follows from contraction inequality (Ledoux and Talagrand [1991]) or from a direct argument. See also Giné and Koltchinskii [2006]. Recall the definition (2) of J 0 .
Let for z > 0, G −1 (z 2 ) := J 0 (z, F) and let H be the convex conjugate of G.
Assume that R 2 ≥ H(4K/ √ n). Then where the constant C 1 is as in Theorem 8.4. As by-product of the proof, we find

The scaling phenomenon
As said, the essential difference between Theorems 2.1 and 2.2 is that the roles of K and R are reversed, instead of RJ ∞ (K, F) we are dealing with KJ 0 (R, F).
These assumptions say that the local class F(R) behaves like to global class F 1 as far as supremum norm and entropy are concerned. Then, taking K ≍ 1, Thus, by using Theorem 2.1 instead of Theorem 2.2 we win a factor R α .
Otherwise put, let F K := {Kf : f ∈ F 1 } for some K ≥ 1 and Then, taking R = 1, So by using Theorem 2.1 instead of Theorem 2.2 we get rid of a factor K α .
In fact, we find a scaling phenomenon in Theorem 2.1: whereas for general deviation inequalities the term involving the expectation of the supremum of the empirical process dominates the deviation term, in the current situation they are of the same order.
Also more generally Theorem 2.1 gives better results than Theorem 2.2. As we will see, in the particular case where F is the signed convex hull of p given functions, uniform convergence follows from Theorem 2.1 for p of small order n (up to log-factors) (see Theorem 4.1), whereas Theorem 2.2 needs p to be of small order √ n (up to log-factors).

Example: additive functions
Let F 0 be a class of real-valued functions defined on the real line. Let further X := R p where p ≤ n and let We will sometimes require the following incoherence condition: for a constant c 1 and for all f 0 ∈ F 0 and f 0,k (x 1 , . . . , In the following theorem one may think of F 0 being for a given m ∈ N the Sobolev class. The constant α is then α = 1/(2m) and the choice N ≍ n α 1+α corresponds to taking a piecewise polynomial approximation with ≍ n 1 2m+1 pieces (i.e. the bandwidth of the usual order n − m 2m+1 ). The bound (7) is shown for this case in Agmon and Jones [1965] under the condition that the one-dimensional marginal densities of the X i ∈ R p stay away from zero (see also Lemma 2.1 below).
We define Then Z 2 (F(1)) = O I P (p 3 /n). Case 2. Assume in addition to the condition of Case 1 that the incoherence condition (4) holds true for some constant c 1 = O(1) and that for some constant Then where {ψ r } is a given dictionary satisfying max r ψ r ∞ = O(1). Assume that the incoherence condition (4) is met for some constant c 1 = O(1). Assume moreover that for a constant c 0 = O(1), all β ∈ R N , and for all k and for Then Z 2 (F(1)) = O I P (pN 2 log n/n). When one chooses N ≍ n α 1+α this reads Moreover, assume the incoherence condition (4) with c 1 = O(1)and that for all N ∈ N, all β ∈ R N , all k and for f β,k (x 1 , . . . , Then Z 2 (F(1)) = O I P (p 1+4α 1+2α (log n/n) 1 1+2α ).

Remark 2.1 If Condition (4) holds, one may in fact replace
The same is true for Case 2, where Condition (4) is indeed assumed. In Case 3, assuming (4) one may replace condition (8) by the local version To complete the picture we show in the next lemma that condition (7) is natural in the context of Case 5 (although we do not use it there).
i.e., that as soon as s > c 0 , ψ r and ψ r+s do not overlap.Then (7) holds for some constant c 2 = O(1).
In Case 2, the bound found in Meier et al. [2009] is Z(F(1)) = O I P (p 2(1+α) /n). Note that in Case 5, we have Z(F(1)) = o I P (1) whenever p 1+4α /n = o(1). The conditions on p can possible be weakened (possibly by replacing entropy bounds by Gaussian means) but this is an open problem. It is not clear to us whether the bounds presented in Theorem 2.3 are sharp.
Case 1 and 2 of Theorem 2.3 follow from Theorem 2.1 by straightforward entropy bounds. Case 3 is based on a result from Rudelson and Zhou [2013] cited here as Theorem 4.1. Case 4 is based on the general matrix version of Bernstein's inequality of Ahlswede and Winter [2002] cited here as Theorem 4.3. Case 5 follows from Case 4 using a trade-off argument for the choice of N (the value N = n α/(1+α) suggested in Case 4 may not give the optimal trade-off). The details are in Section 9.

Empirical inner products
Consider products f g of functions f and g with f in some class F and g in some class G. Note that one can derive results for products via squares: If F and G have the same · -diameter R and the same · ∞ -diameter K it is easy to see that without loss of generality we may assume that F = G (replace F and G by F ∪ G). However, if f and g are in different classes it may be more appropriate to analyze the products directly. This case with F and G having different radii is studied here.
We only present the results using ℓ ∞ -norms. Again, one may reverse the roles of · ∞ -radii and · -radii, getting other versions for the bounds. The best bound may depend on the situation at hand.

Inner products of functions from different classes
Let and and Then with probability at least 1 − 12 exp Remark 3.1 Theorem 3.1 can be refined using generic chaining type of quantities instead of entropies. We have omitted this to avoid digressions.
Remark 3.2 Consider the special case where G = {g 0 } is a singleton. Assume that g 0 ∞ = K 0 . Take R 2 = K 2 = K 0 in Theorem 3.1, and write R 1 := K and K 1 := K. For a singleton G, the term J ∞ (K 2 , G) can be omitted. We then get from Theorem 3.1: for t ≥ 4 and We will see a similar result in Theorem 3.2, where g 0 is not bounded but sub-Gaussian.

Empirical inner products for smooth functions
Let us suppose that where β > α. For example, one may think of Sobolev classes as was indicated in Subsection 2.5, or more locally adaptive cases such as Then J ∞ (z, F) ≍ z 3/4 (α = 1/4) and J ∞ (z, G) ≍ z 1/2 √ log n (β = 1/2). The log nterm plays a moderate role and we neglect such details in the following general line of reasoning.
The fact that β > α expresses that F is smoother (less rich) than G. Having an additive model in mind (the response Y i is an additive function plus noise Y i = f 0 (X i,1 )+g 0 (X i,2 )+ε i , i = 1, . . . , n) one may expect to be able to estimate a function f 0 ∈ F with squared rate R 2 1 := n −1/(1+α) and a function g 0 ∈ G with (slower) squared rate R 2 2 := n −1/(1+β) . Let us simplify the situation by assuming that X i,1 and X i,2 are independent (the dependent case is detailed in van de Geer and Mammen [2013]). Also assume that the functions in F and G are already centred. We now want to show that P n f g is small, namely negligible as compare to R 2 1 . Indeed, inserting Theorem 3.1 (note that (11) and (12) are true for t fixed and n sufficiently large), we get with probability at For fixed t the right hand side of the above inequality is o(1).
Actually, van de Geer and Mammen [2013] first proof the global (slow) rate R = R 2 . Suppose that that now f /K 1 ∈ F where K 1 = R/λ with λ ≍ n −1/(1+α) . Again (11) and (12) are true for t fixed and n sufficiently large for R 2 1 = R 2 2 = R 2 = n −1/(1+α) , K 1 = R/λ and K 2 = 1. We find as similar result as above: with probability at least 1 − 12 exp[−t] Related is the paper Müller and van de Geer [2013] where the additive model is studied with f 0 a high-dimensional linear function. Again, it can be shown that f 0 can be estimated with a fast oracle rate, faster than the rate of estimation of the unknown function g 0 .

Products with a sub-Gaussian random variable
Consider now real valued random variables Y i , i = 1, . . . , n. We let P n be the empirical measure based on {X i , Y i } n i=1 : for a real-valued function f on X We write P Yf := I EP n Yf . We study the supremum of the absolute value of the product process (P n − P )Yf, f ∈ F.
Definition 3.1 For Z ∈ R and Ψ(z) := exp[|z| k ], k = 1, 2, we define the Orlicz norm whenever it exists. If Z Ψ 1 exists, we call Z sub-exponential, and if Z Ψ 2 exists we call Z sub-Gaussian.

Definition 3.2 We say that
The result below is about products of functions, where the class G consists of the single sub-Gaussian function Y.
We recall the definition (3) of J ∞ .

Theorem 3.2 Let sup
Suppose Y is uniformly sub-Gaussian with constant K 0 . Consider values of t and n such that 2t n + t n ≤ 1.

For these values
where the constant C 1 is as in Theorem 8.5.

Application to a class of linear functions
Suppose X = R p . We let X i be a row vector in R p , i = 1, . . . , n. For a column vector β ∈ R p we define f β (X i ) := X i β. We assume in that for some constant The following lemma is Lemma 3.7 in Rudelson and Vershynin [2008]. We inserted an explicit constant.

Lemma 4.1 We have
As a consequence, we obtain a result which is in Rudelson and Zhou [2013]. It suffices to combine Theorem 2.1 with Lemma 4.1.
Theorem 4.1 For all t > 0 IP sup Theorem 4.1 has very useful applications, in particular to ℓ 1 -regularization or to exact recovery using basis pursuit (Chen et al. [1998]) where results often rely on bounds for compatibility constants (van de Geer [2007], van de Geer and Bühlmann [2009]) or restricted eigenvalues (Bickel et al. [2009]). This is elaborated upon in Rudelson and Zhou [2013].
Theorem 4.1 can be applied also to obtain a uniform bound over all subspaces. Define the minimal eigenvalue Λ 2 min := min β 2 ≤1 f β 2 .
Theorem 4.2 Suppose Λ min > 0. Define for S ⊂ {1, . . . , p}, β j,S = β j l{j ∈ S}, j = 1, . . . , p. For all t > 0 IP ∃ s : sup The next theorem is a direct application of a Bernstein type inequality for random matrices as given in Ahlswede and Winter [2002] (see also Theorem 3 in Koltchinskii [2013]). It shows that in Theorem 4.2 the log 3 n-term can be omitted when one considers a fixed set S instead of requiring a result uniform in S.

Remark 4.1 Let us briefly indicate how this compares to an isotropic case.
Following an idea of Loh and Wainwright [2012] (see also Lemma 1 in Nickl and van de Geer [2013]) one can show that the supremum over all f β ≤ 1 can in fact be replaced by a maximum over a finite class: where f β j ≤ 1 for all j = 1, . . . , N and where log N ≤ c 2 0 p. We can now proceed by invoking the union bound for the maximum. An isotropy assumption then leads to good results. We assume sub-Gaussianity of the vectors {X i }, meaning that each f β (X i ) is sub-Gaussian: there is a constant K 1 such that for all f β ≤ 1 and all i it holds that f β (X i ) Ψ 2 ≤ K 1 . Then by Bernstein's inequality, for all all f β ≤ 1 and all t > 0 The union bound together with the above reduction then gives for all t > 0 IP sup The latter result is a "true" deviation inequality: the deviation from the bound ≍ p/n for the mean does not involve this bound, i.e., there is no p in front of t inside the probability. This in contrast to the result (13) To avoid too involved expressions, we from now on will use order symbols. Then, the results needed for the next section can be summarized as follows.

Least squares when the model is wrong
In this section we examine a p-dimensional linear model with p moderately large, and the least squares estimator. The observations are {(X i , Y i )} n i=1 , independent, and with X i ∈ X and Y i ∈ R (i = 1, . . . , n). Let {ψ j } p j=1 be a given dictionary of functions on X . We write f β (·) := p j=1 β j ψ j (·), β ∈ R p . The least squares estimator iŝ . . , n. The projection in L 2 (P ) of f 0 on the linear space {f β : β ∈ R p } is written as f * . We want to show convergence off to f * . Because we know little about the higher order moments of f * (only the second moment is under control as f * ≤ Y ) the situation is a little more delicate than in the usual regression context (where f * −f 0 is small). This is where uniform convergence of · n to · comes in.
Lemma 5.1 Let 0 < δ n < 1/2. On the set To handle the set T given in the above lemma, we invoke Summary 4.1. To this end, define the matrix Σ := P ψ T ψ and let Λ 2 min be the smallest eigenvalue of Σ.
Theorem 5.1 Suppose that Y := {Y 1 , . . . , Y n } is uniformly sub-Gaussian with constant K 0 (see Definition 3.2), that max i,j |X i,j | ≤ K X , Λ min > 0, and that δ n = o(1) where δ 2 n := Then Moreover In view of the uniformity in Summary 4.1 we can formulate an extension. Such an extension will be useful in the next section. Recall the notation: for a set S ⊂ {1, . . . , p} and β ∈ R p β j,S := β j {j ∈ S}, j = 1, . . . , p.
Consider, for any set S ⊂ {1, . . . , p}, the projection f * S of f 0 on the |S|dimensional space F S := {f β S : β ∈ R p } and the corresponding least squares estimatorf Theorem 5.2 Assume the conditions of Theorem 5.1 and let δ n be defined as there. Then uniformly in all S, Moreover 6 Application to DAG's Let X be a n × p matrix with i.i.d. rows. We throughout this section assume p ≤ n. The i-th row is denoted by X i := (X i,1 , . . . , X i,p ) (i = 1, . . . , n). The distribution of a row, say X 1 , is denoted by P .
We assume a directed acyclic graph (DAG) structure. Namely, we assume the structural equations model defined as follows.

Some notation
We consider a given class F 0 of functions f 0 : R → R.

Identifiability
In order to be able to estimate a correct permutation one needs to assume that the wrong permutations can be detected.
Condition 6.1 (Identifiability condition). For some constant ξ > 0, This condition is discussed in . The linear Gaussian structural equations model has Π 0 = Π, i.e. any permutation is correct. In the non-linear case, we think of the situation where, unlike the linear case, the parental dependence is the same for all π 0 ∈ Π 0 , say f 0 j (X 1,π 0 1 , . . . , X 1,π 0 j−1 ) := f * j ({X 1,k } k =j ) (j = 1, . . . , p), and hence also the residual variances σ 2 j , j = 1, . . . , p do not depend on π 0 . The identifiably condition then requires that choosing π / ∈ Π 0 will give on average too large residual variances. If the model is misspecified, Condition 6.1 is to be seen as assuming robustness to the bias that misspecification introduces. In an asymptotic formulation, it suffices to assume identifiability at the truth: inf π / ∈Π 0 p j=1 log(σ j (π)/σ j )/p > ξ 0 with 1/ξ 0 = O(1) together with a vanishing bias: sup π 0 ∈Π 0 p j=1 log(σ j (π 0 )/σ j )/p → 0. One may consider choosing a model with low complexity (large bias) because π 0 is the parameter of interest here. The estimation of f 0 j (j = 1, . . . , p) can then follow in a second step using a standard (nonparametric) regression estimator and the estimated permutation.

The estimator
To describe the estimator of π 0 we introduce empirical counterparts of the quantities given above. For each j and π we write We letf j (π) be the least squares estimator f j (π) := arg min X π j − f j (π) n : f j ∈ F j and take the normalized residual sum of squareŝ σ j (π) := X π j −f j (π) 2 n as estimator of σ 2 j (π). We then let π :∈ arg min π∈Π p j=1 logσ 2 j (π).
We recall Remark 2.1: the conditions on F may be weakened to local versions.

Conclusion
In this paper we summarized some results for the uniform convergence of empirical norms and the extension to empirical inner products.
For statistical theory the results are very useful. In Bartlett et al. [2012] one can find an application to ℓ 1 -restricted regression for the case of random design and Rudelson and Zhou [2013] focuses on the restricted isometry property and restricted eigenvalues. We have given the application to order estimation in directed acyclic graphs (DAG's). We omitted important computational issues and further discussions for this special case as it is beyond the scope of the paper. For more details we refer to .
The results can also be applied to generalize the results in van de Geer and  for DAG's to the linear non-Gaussian case, in particular to anisotropic distributions. A generalization to to isotropic distributions (e.g. sub-Gaussian distributions) is possible but perhaps less relevant as in many statistical applications isotropy is not very natural or stable (for DAG's sub-Gaussianity can hold when the linear model is exactly true but it is not clear what happens when the model is only approximately linear).
A further application is the estimation of a precision matrix for non-Gaussian data. We mention that such an approach is used in van de Geer et al.
[2013] to construct confidence intervals for a single parameter. Here, a Lasso is used for estimating a Fisher-information matrix. The estimator is based on empirical projections and also the function to be estimated is a theoretical projection as in Section 5. In the context of confidence intervals in ℓ 2 , the uniform convergence may generalize the (sub-)Gaussian case considered in Nickl and van de Geer Let moreover ǫ 1 , . . . , ǫ n be a Rademacher sequence (that is, ǫ 1 , . . . , ǫ n are independent random variables taking the values +1 or −1 each with probability 1 2 ) independent of X 1 , . . . , X n , and define Theorem 8.1 (see e.g. van der Vaart and Wellner [1996]). It holds that Theorem 8.2 (see Pollard [1984]). Let R := sup f ∈F f . For t ≥ 4, IP(Z(F) ≥ 4R 2t/n) ≤ 4IP(Z ε (F) ≥ R 2t/n).

Dudley's theorem
Dudley's theorem is originally for Gaussian processes (see Dudley [1967]). The extension to sub-Gaussian random variables and Rademacher averages is rather straightforward. We summarize these in our context in Theorem 8.3 below.
Let H(·, F, · n ) denote the entropy of F equipped with the metric induced by the empirical norm · n . and letR be the random radiusR := sup f ∈F f n .

Deviation inequalities
We present two deviation inequalities, for the bounded case and the sub-Gaussian case.
9 Proofs 9.1 Proofs for Section 2 Theorem 2.1 follows from Guédon et al. [2007]. We present a proof for completeness and to facilitate the extension to products of functions.
Proof of Theorem 2.1. We consider the symmetrized process with ǫ 1 , . . . , ǫ n a Rademacher sequence independent of X 1 , . . . , X n , and then apply Dudley's theorem, see Theorem 8.3. Note that for two functions f and f in the class F It follows that H(u, F 2 , · n ) ≤ H(u/(2R), F, · n,∞ ), u > 0.
Here we used that f 2 n ≤RK. So by Theorem 8. 3 But then by Theorem 8.1 This leads to the by-product of the theorem: the inequality I ER 2 ≤ R 2 + 2J ∞ (K, F) I ER 2 / √ n gives I ER 2 ≤ R + 2J ∞ (K, F)/ √ n.
Insert this in (16) to find We now apply Theorem 8.4. We have Hence, inserting the just obtained bound for the expectation, for all t > 0 IP sup

⊔ ⊓
Proof of Theorem 2.2. We start as in the proof of Theorem 2.1 by considering the symmetrized process P ǫ n f 2 := n i=1 ǫ i f 2 (X i )/n with ǫ 1 , . . . , ǫ n a Rademacher sequence independent of X 1 , . . . , X n . But when applying Dudley's theorem, see Theorem 8.3, we use a different entropy bound.
For two functions f andf in the class F
So by Theorem 8.3
Insert this back to find Finally apply Theorem 8.4.

Proofs for Section 3
Proof of Theorem 3.1. Let For functions f,f in the class F and g,g in the class G we have f g −fg n ≤ f g −f g n + f g −fg n ≤R 2 f −f n,∞ +R 1 g −g n,∞ .