On higher order isotropy conditions and lower bounds for sparse quadratic forms

This study aims at contributing to lower bounds for empirical compatibility constants or empirical restricted eigenvalues. This is of importance in compressed sensing and theory for $\ell_1$-regularized estimators. Let $X$ be an $n \times p$ data matrix with rows being independent copies of a $p$-dimensional random variable. Let $\hat \Sigma := X^T X / n$ be the inner product matrix. We show that the quadratic forms $u^T \hat \Sigma u$ are lower bounded by a value converging to one, uniformly over the set of vectors $u$ with $u^T \Sigma_0 u $ equal to one and $\ell_1$-norm at most $M$. Here $\Sigma_0 := {\bf E} \hat \Sigma$ is the theoretical inner product matrix which we assume to exist. The constant $M$ is required to be of small order $\sqrt {n / \log p}$. We assume moreover $m$-th order isotropy for some $m>2$ and sub-exponential tails or moments up to order $\log p$ for the entries in $X$. As a consequence we obtain convergence of the empirical compatibility constant to its theoretical counterpart, and similarly for the empirical restricted eigenvalue. If the data matrix $X$ is first normalized so that its columns all have equal length we obtain lower bounds assuming only isotropy and no further moment conditions on its entries. The isotropy condition is shown to hold for certain martingale situations.


Introduction
Let X be an n × p data matrix with rows being i.i.d. copies of a random vector X T 0 ∈ R p . We consider the empirical inner product matrixΣ = X T X/n. For a vector u ∈ R p , let u q be its ℓ q -norm (1 ≤ q ≤ ∞). We examine sparse quadratic forms u TΣ u where u is sparse in the sense that u 1 ≤ M for some constant M ≥ 1. We will provide lower bounds for min{u TΣ u : u T Σ 0 u = 1, u 1 ≤ M } with Σ 0 := I EΣ being the theoretical inner product matrix which we assume to exist. The constant M will be required to be of small order n/ log p.
The conditionφ 2 (L, S) > 0 for suitable values of L and S allows one to establish oracle inequalities for the Lasso. Indeed, let u 0 be the sparse vector we want to recover and let S := {j : u 0 j = 0} be its active set. Let ξ ∈ R n be a "noise" vector. Consider the Lassô u := arg min u∈R p ξ + Xu 0 − Xu 2 2,n + 2λ u 1 where λ > 0 is a tuning parameter and where we use the notation v 2 2,n := v T v/n, v ∈ R n . For λ > λ 0 =: ξ T X ∞ /n it holds that X(û − u 0 ) 2 2,n ≤ (λ + λ 0 ) 2 |S|/φ 2 (L, S) (1.1) where L := (λ + λ 0 )/(λ − λ 0 ). We refer to [4] and the references therein. In the literature result (1.1) is considered to be an "oracle inequality" if -in a suitable asymptotic formulation -the constant L remains bounded (i.e. λ is of the same order as λ 0 ) andφ 2 (L, S) stays away from zero. In the present paper, this case may serve as benchmark case. We give non-asymptotic results and some asymptotic consequences showing that under certain conditionsφ 2 (L, S) indeed stays away from zero. Closely related is the so-called null space property 1 (see e.g. [6]) used in exact recovery. One says that X has the null space property relative to S if for all u ∈ R p with Xu = 0 it holds that u S 1 < u −S 1 . The null space property is the same as the conditionφ(1, S) > 0 and implies in the noiseless case exact recovery of a sparse signal u 0 with active set S using basis pursuit ( [5]): arg min{ u 1 : Xu = Xu 0 } = u 0 .
In many cases (e.g. when applying the Lasso) the data are first normalized: forσ 2 j :=Σ j,j one replaces the j-th column X j of X byX j /σ j , j = 1, . . . , p. Therefore we study in Section 5 the compatibility constant for normalized desigñ φ 2 (L, S) := min{|S|u TR u : u S 1 = 1, u −S 1 ≤ L} 1 We thank Emmanuel Candés for pointing this out.

Organization of the paper
After some notations and definitions in the next section, we present in Section 3 a bound for sparse quadratic forms. The lower bounds for the empirical compatibility constant and empirical restricted eigenvalue follow from this. The upper bounds depend on fourth moments. We will show thatφ(L, S) converges to its theoretical counterpart, and similarly forκ(L, S) (see Theorem 4.2). For this we need (L + 1) √ s to be of small order n/ log p (for the lower bound). This is detailed in Section 4. In Section 5 we consider the transfer principle from [13] which allows one to show that for the case where the data are normalized very weak moment conditions suffice. Section 6 is devoted to a discussion with related work. There we summarize the comparison of results in Table 1. In Section 7 we make a brief comparison of the results when we drop the isotropy assumption. We show convergence of |u TΣ u − u T Σ 0 u| uniformly over u 1 ≤ M assuming sub-exponential entries in X 0 . In Section 8 we examine the higher order isotropy condition. Finally, Section 9 contains the proofs.

Notation and definitions
We let Σ 0 := I EX 0 X T 0 = I EΣ be the theoretical inner product matrix. Its smallest eigenvalue is denoted by ψ 2 0 . We do not assume ψ 0 > 0. For m ≥ 1, and Z a real-valued random variable, we introduce the notation Z m m := I E|Z| m .
Thus u T Σ 0 u = X 0 u 2 2 where X 0 u is the inner product X T 0 u, u ∈ R p . Let X T i,· be the i-th row of X (i = 1, . . . , n). We write for a function f : R p → R, and 2,n so that Xu 2 2,n = u TΣ u. Definition 2.1. We say that a random variable Z is Bernstein with constants σ and K if for all k ∈ {2, 3, . . .} Definition 2.2. We say that a random variable Z ∈ R is sub-Gaussian with constant C if for all λ > 0 Let us denote for k = 1, 2 the Orlicz norm by Then being a Bernstein random variable is equivalent to having finite · Ψ1norm (i.e., being sub-exponential) and sub-Gaussianity is equivalent to a finite · Ψ2 -norm. We have chosen for the Definitions 2.1 and 2.2 in order to have simple explicit dependence on the constants later on.
Note that if a random variable is sub-Gaussian with constant C it is also Bernstein with constants σ = 2 and K = √ 2C. Moreover, a Bernstein random variable Z with constants σ and K always has σ ≤ 3K and so Z m ≤ mK for all m ∈ {3, 4, . . .}.
We use the definition of [12] or [11] of a sub-Gaussian vector (in a slightly alternative formulation).
The main concept we will use in this paper is weak isotropy for which we now present the definition. Definition 2.4. Let m ≥ 2. The random vector X 0 ∈ R p is weakly m-th order isotropic with constant C m if for all u ∈ R p with X 0 u 2 = 1 it holds that A Gaussian vector is sub-Gaussian with constant 1 and is strongly m-th order isotropic (defined in Definition 6.1) with constant √ 2Γ Definition 2.4 (and 2.3) are invariant under rotations: if ψ 0 > 0 one may without loss of generality assume Σ 0 = I here. We however explicitly do not assume Σ 0 = I because conditions on the ℓ 1 -norm are not invariant under rotation. In contrast to the literature where the "isotropic" case is sometimes defined as the case Σ 0 = I our definition of isotropy is rather to be understood as uniformity in all one-dimensional directions (very much like isotropy of functions in Besov spaces).

Lower bounds for sparse quadratic forms under higher order isotropy
The first result of Theorem 3.1 below is as in [16] and is given for completeness. It is only of interest when p is smaller than n. The result is improved in [7]. We 4 refer to Section 6 for a discussion. The second result of Theorem 3.1 extends the situation to the case where p can be larger than n but ℓ 1 -restrictions are invoked.
Here we need bounds on Rademacher averages. A Rademacher sequence is a sequence of independent random variables ǫ 1 , . . . , ǫ n where each ǫ i takes values ±1 with probability 1/2. We assume that ǫ := (ǫ 1 , . . . , ǫ n ) T is independent of X. Consider the Rademacher averages W T := (W 1 , . . . , W p ) T := ǫ T X/n. and let W ∞ := max 1≤j≤p |W j |. We will need bounds for I E W ∞ . If the entries in X 0 are Bernstein with constants σ X and K X , then applying Lemma 14.12 in [4] gives and it is this bound that is invoked in the second result of Theorem 3.1. Further bounds for I E W ∞ are discussed in Subsection 3.1.
Theorem 3.1. Suppose that for some m > 2 the random vector X 0 is weakly m-th order isotropic with constant C m and define Then for all t > 0 with probability at If in addition the entries in X 0 are Bernstein with constants σ X and K X , then for all t > 0 with probability at least Asymptotics In an asymptotic formulation suppose that C m , K X and σ X remain fixed and that (log p)/n = o(1). Then the second result (3.4) of Theorem 3.1 says that for Remark 3.1. The constant t in the formulation of Theorem 3.1 allows one to choose the confidence level of the result. If t is large (for example p large and t = log p) the bounds will be true with large probability. Of course for very large t the bounds will become void. 5 Remark 3.2. We have not attempted to obtain small constants in the bound of Theorem 3.1. In fact, the last "smaller order" term in the expression (3.5) for ∆ L n (M, t) can be refined but this will make the expressions more involved. Remark 3.3. The technique to prove Theorem 3.1 does not rely on the fact that we consider squared functions |(Xu) i | 2 , i = 1, . . . , n. For example, one may use it for bounding inf Then one could e.g. use weak isotropy conditions of order m > q. However a motivation for having such results is perhaps lacking.
Theorem 3.1 is based on a truncation argument. For the case of a sub-Gaussian vector X 0 the truncation level can be taken rather small leading to an improved bound. We present this for completeness in the next lemma.
Lemma 3.1. If the random vector X 0 is sub-Gaussian with constant C we find that for all t > 0 with probability at least with δ ′ n := C 2 log(2p) n .

Bounds for I E W ∞
Inequality (3.1) presents a bound for I E W ∞ assuming Bernstein conditions. This bound is then invoked in Theorem 3.1. One may derive alternative bounds for I E W ∞ and adjust the definion of δ n in Theorem 3.1 accordingly. For example one may impose existence of k-th moments of the entries of X 0 where k is of order log p. The paper [9] presents refined results which we cite in the next lemma.
Lemma 3.2. Let Z 1 , . . . , Z n be i.i.d. copies of a mean-zero random variable Z ∈ R andZ := n i=1 Z i /n. Suppose that for some constants κ 1 and α ≥ 1/2 one has Then for n ≥ k max{2α−1,1} 0 and for all k ≤ k 0 where c 0 is a universal constant.
Corollary 3.1. Suppose that for some constants κ 1 , η ≥ 2/ log p and α ≥ 1/2 one has max Then for n ≥ k where c 0 is a universal constant. But then

Convergence of the compatibility constant and restricted eigenvalue
An "almost isometric" (in a terminology from [7]) lower bound for the empirical compatibility constant and empirical restricted eigenvalue follows easily from Theorem 3.1 as is shown in the next theorem.
Recall that S ⊂ {1, . . . , p} is an arbitrary subset. Let for s := |S| be the theoretical compatibility constant and be the theoretical restricted eigenvalue.
Theorem 4.1. Under the conditions of Theorem 3.1 and using its notation we find that for all t > 0, with probability at Note that Theorem 4.1 does not depend on the smallest eigenvalue ψ 0 of Σ 0 nor on its maximal eigenvalue. If ψ 0 > 0 one may however want to insert the bounds φ 0 (L, S) ≥ κ 0 (L, S) ≥ ψ 0 . We refer to the "Asymptotics" paragraph at the end of this section for a further discussion.
The next issue is whetherφ(L, S) actually converges to φ 0 (L, S) andκ(L, S) to κ 0 (L, S). This part follows easily from the lower bounds of Theorem 4.1 and convergence of Xu 2 2,n − X 0 u 2 2 for fixed values of u, for which in turn we e.g. would like to have fourth moments. If m > 4, this 4-th order moment condition follows from m-th order weak isotropy. If however X 0 is only m-th order weakly isotropic for m ≤ 4 we need some other means to check 4-th moments. The next lemma can be invoked.
Lemma 4.1. Suppose that the entries in X 0 are Bernstein with constants σ X and K X ≥ σ X and that for some constant c 0 ≥ 1 and for Then for all u with X 0 u 2 ≤ 1 and u 1 ≤ M we have Combining Theorem 4.1 with Lemma 4.1 gives the upper and lower bounds shown in the next theorem.
Suppose that X 0 is weakly m-th order isotropic with constant C m and that the entries in X 0 are Bernstein with constants σ X and K X > σ X . For the case m ≤ 4 we assume in addition that for some constant c 0 ≥ 1 and for c 1 := 2(1 + c 0 )(K X + σ X ) Define D m as in (3.2) and ∆ L n (M, t) as in (3.5). For all t > 0, with probability and Asymptotics In an asymptotic formulation we assume that the constants 1/φ 0 (L, S), C m , σ X and K X remain bounded. Then it follows from Theorem 4.2 that under its conditions, as long as (L + 1) Similar results hold for the restricted eigenvalue. (Note that for c 0 fixed and p > n condition (4.1) follows from the already imposed condition (L + 1) √ s = o( n/ log p).) Thus, in the upper bound an additional √ log p appears in the requirement on M . This term can be omitted if m > 4 or if we assume the entries in X 0 are sub-Gaussian instead of Bernstein.

Bounds for the compatibility constant and restricted eigenvalue using the transfer principle
In this section, we assume for simplicity that Σ 0 has ones on the diagonal. We letσ 2 j :=Σ j,j = X j 2 2,n , j = 1, . . . , p where X j denotes the j-th column of X.

The transfer principle
The transfer principle given in the next theorem is from [13]. As shown in the latter paper it can be used to move from the case p ≤ n to p > n assuming ℓ 1 -conditions. We will apply this technique here as well, for non-normalized design in Theorem 5.2 and for normalized design in Theorem 5.3. The results are compared with [13] in Section 6.
We will invoke the transfer principle via the following corollary (as well as directly in the proof of Theorem 5.3). The corollary is as in [13] and we state it here in our notation for ease of reference.
and, for some ǫ > 0, the event To put this corollary to work we insert the first result of Theorem 3.1.
Theorem 5.2. Suppose that for some m > 2 the random vector X 0 is weakly m-th order isotropic with constant C m . Define D m as in Theorem 3.1. Let for and let, for some ǫ > 0, B be the event Then with probability at least The above theorem invokes Theorem 3.1 for handling the event A. One may also use the results in [13] for the case of m-th order strong isotropy (defined in Definition 6.1) with m ≥ 4 or those which can be deduced from [7] for the case mth order weak isotropy with m > 2 (the latter paper does not explicitly treat an event of the form A). For the case m < 4 for instance the arguments in [7] would allow to replace∆ n (M, t) in Theorem 5.2 (which is of order M log p/n m−2 m−1 by a term of order M log p/n log(1/( log p/nM )) 2(m−2)/m .
Clearly, one can again apply the results to the compatibility constant and restricted eigenvalue as in Theorem 4.1. This gives the following corollary.

The behaviour of max jσ 2 j
Recall that in Theorem 3.1 the lower bound for inf{ Xu 2,n : Bounding I E W ∞ leads to moment conditions on the entries in X 0 . The transfer principle now leads to requiring a bound for max jσ The latter is clearly a more difficult task than the former. In the non-normalized case this appears to be the price to pay for application of the elegant transfer principle.
We first assume sub-Gaussian tail behaviour in Lemma 5.1 and then moments up to order log p in Lemma 5.2.

Normalized design
DefineX j := X j /σ j , j = 1, . . . , p,X := (X 1 , . . . ,X p ) and Similarly, the (empirical) restricted eigenvalue for normalized design is In [4] the (theoretical) adaptive restricted eigenvalue is defined as Clearly κ 2 * (L, S) ≤ κ 2 0 (L, S). We prove in Theorem 5.3 that the empirical compatibility constantφ 2 (L, S) can be bounded from below by the theoretical adaptive restricted eigenvalue. The theorem establishes that compatibility needs no further moment conditions on the entries in X 0 . If we do assume such moment conditions on the entries X 0,j with j ∈ S, the results can be extended to restricted eigenvalues, as shown in [13] for the case of 4-th order strong isotropy (defined in Definition 6.1), and as shown in the next theorem.  ).

Related work
Before discussing related work we present the definitions of the concepts used. Recall that in this paper we require weak isotropy (see Definition 2.4).
Definition 6.1. Let m ≥ 2. The random vector X 0 ∈ R p is strongly m-th order isotropic with constant C m if for all u ∈ R p with X 0 u 2 = 1 it holds that Definition 6.2. The random vector X 0 ∈ R p satisfies the L 1 -L 2 property with constant C if for all u ∈ R p with X 0 u 2 = 1 it holds that X 0 u 1 ≥ 1/C. Definition 6.3. The random vector X 0 ∈ R p satisfies the small ball property with constants C 1 > 0 and C 2 > 0 if for all u ∈ R p with X 0 u 2 = 1 it holds that It can be shown that for appropriate constants one has (for m > 2) strong m-th order isotropy ⇒ weak m-th order isotropy ⇒ L 1 -L 2 property ⇒ small ball property.
E.g. for the last implication see [7].

Relation of this work with [7] and [13]
The paper [7] obtains lower bounds for the smallest eigenvalue ofΣ for the case p ≤ n. Their approach allows one to show that for p ≪ n it holds that u TΣ u ≥ (1 − ∆)u T Σ 0 u uniformly in u ∈ R p with large probability for some small ∆. Such a result is not stated explicitly but it is easy to infer. The bounds in [7] are better than the first result (3.3) of Theorem 3.1. The paper employs a type of "peeling device" and the fact that for all 0 < a < b {{x : |xu| > K} : is a VC-class with dimension at most p. If we have "good" bounds for the entropy for · 2,n of the classes {{x : |xu| > K} : their argument can be extended to the case p > n with ℓ 1 -restrictions. However, how to derive "good" entropy bounds for such classes is as yet not clear to us. Both papers [7] and [13] assume m-th order isotropy (defined here in Definitions 2.4 and 6.1). The paper [7] has results with weak isotropy for any m > 2, whereas [13] assumes strong isotropy with m = 4. The paper [13] shows that by a transfer principle (described here in Theorem 5.1) a result for p ≤ n can be invoked to derive that also for the case p ≫ n one has u TΣ u ≥ (1 − ∆ ′ )u T Σ 0 u uniformly in u 1 ≤ M u T Σ 0 u with large probability for some small ∆ ′ and not too large M (generally of small order n/ log p). In the present paper we consider weak isotropy with m > 2 as in [7] and we show by a direct method that u TΣ u ≥ (1 − ∆)u T Σ 0 u uniformly in u 1 ≤ M u T Σ 0 u with large probability for some small ∆. Here, we assume sub-exponential tails for the entries in X 0 , or, inserting results from [9], existence of moments up to order log p for these entries. We compared the result with the one using the transfer principle of [13]. 13 Our finding is that the transfer principle needs slightly stronger moment conditions. In fact, our direct approach requires a bound for the maximum of the p Rademacher averages of the columns of X, whereas the approach using the transfer principle makes it necessary to have a bound for the maximal length of the p columns of X. Both can be dealt with by assuming higher order moments, but clearly the Rademacher averages need less moments than the lengths.
The paper [13] shows that when the columns of X are normalized to have all equal length, then the transfer principle leads to lower bounds for the (empirical) compatibility constant and (empirical) restricted eigenvalues assuming only 4-th order strong isotropy and moments of order bigger than 4 for the entries in X 0 . We presented this result in Section 5 relaxing 4-th order strong isotropy to mth order weak isotropy with m > 2. Moreover, we derive that the compatibility constantφ 2 (L, S) is positive with large probability assuming only isotropy but no additional moment assumptions on the entries in X 0 . Thus, using normalized design one obtains exact recovery under isotropy only.

Further related work
In [15] a result of [14] concerning a lower bound for restricted eigenvalues is extended from the Gaussian case to the sub-Gaussian case. The paper [1] considers the case of log-concave distributions, which is related to sub-exponentiality of the vector X 0 (the sub-exponential variant of Definition 2.3). The papers [16] and [7] provide lower bounds for the empirical smallest eigenvalueψ 2 := min{u TΣ u : u 2 = 1} for the case where p is at most n. The paper [16] uses higher order isotropy conditions (defined in Definitions 6.1 and 2.4) and the paper [7] uses these too, but they in addition explore small ball properties (defined in Definition 6.3). The paper [9] considers the null space property and restricted eigenvaluesκ 2 (L, S) invoking small ball properties. Indeed, they show that small ball properties are very natural requirements when one aims at lower bounds. With the small ball property one obtains an "isomorphic" bound (we call this a result of type II in Table 1), that is, in a standard asymptotic framework the lower bound remains strictly smaller than the theoretical counterpart. Apart from the small ball property the paper [9] needs moment conditions. It requires the stronger ("sub-Gaussian") conditions of Lemma 5.2 instead of the ("sub-exponential") condition (3.6) of Corollary 3.1. The papers [8] and [9] show that moment conditions are necessary for exact recovery.
In Table 1 we present a summary of the results in the cited papers in comparison with the present paper. Of course it is not possible to make a simple comparison doing all aspects of the cited papers justice. The summary should be seen as focussing on what are in our view the relevant differences.

The case of (almost) bounded random variables
The bounded case is considered [15] and a reformulation is in [18]. It is shown there that when X 0 ∞ ≤ K X then for a universal constant c 1 and for all t > 0, 14 [13] [  Table 1 The entry "pp" stands for the present paper. With "isotropy" we mean weak or strong isotropy. The "sub-Gaussian" results concern the lower tails for quadratic forms. With "conditions on p" we mean conditions stronger than the asymptotic one log p/n → 0. The entries "normalized" stand for normalized design and "non-normalized" for non-normalized design. The symbols κ 2 , ψ 2 and φ 2 are shorthand for restricted eigenvalue, smallest eigenvalue and compatibility constant respectively. With results of "type I" we mean results in terms of theoretical counterparts. Results of "type II" are in terms of the constants occurring e.g. in the small ball property. The "moment conditions" are apart from isotropy (or small ball properties) those on the entries of X 0 .
with probability at Observe this inequality goes both ways, and it does not require higher order isotropy conditions. On the other hand, the bounds involve an additional log 3 nfactor. If we replace the assumption of bounded random variables by (say) a sub-Gaussian assumption but do assume strong (say) isotropy we can again use a truncation argument and obtain an inequality that goes both ways. Admittedly, the number of log p-and log n-terms increases. We first present an auxiliary truncation lemma.
Lemma 7.1. Suppose X 0 is strongly m-th order isotropic with constantC m and that its components are sub-Gaussian with constant C. Let t > 0 be arbitrary and let Then for all u with X 0 u 2 = 1

Higher order isotropy
If X 0 is (strongly or weakly) m-th order isotropic with constant C m and A is a q × p matrix, then clearly AX 0 is also (strongly or weakly) m-th order isotropic with constant C m . In other words, the property is invariant under linear transformations. The same is true for sub-Gaussianity. In particular, we have invariance under any permutation of the X 0,j . In the next subsection, we assume that the {X 0,j } form a directed acyclic graph (possibly after some linear transformation) where the noise terms are a martingale difference array with fixed sub-Gaussian tail behaviour. Then we extend in Subsections 8.2 and 8.3 the situation where the conditional tail behaviour is sub-Gaussian or Bernstein, with constants depending on predictable random variables. We consider there a filtration {F j } p j=1 and predictable random variables {V j } p j=1 that satisfy for some constants m > 2 and µ m max 1≤j≤p V j m ≤ µ m .
We investigate strong m-th order isotropy. In fact we give explicit expressions for X 0 u m in terms of u 2 . This implies strong isotropy if we assume the smallest eigenvalue ψ 2 0 of Σ 0 is positive. Obviously this also implies a bound for the largest eigenvalue ψ 2 max of Σ 0 :

Directed acyclic graphs
Let X 0 be a vector of random variables with mean zero and covariance matrix Σ 0 := I EX 0 X T 0 . We want to find conditions such that for all u ∈ R p , with X 0 u 2 = 1 the random variable X 0 u is sub-Gaussian with constant C. 16 We will examine this here for the situation where the graph of X 0 has a directed acyclic graph (DAG) structure that is, satisfying (after an appropriate permutation of the indexes) the structural equations model where {ǫ 0,j } p j=1 is a martingale difference array for the filtration {F j } p−1 j=0 . We assume X 0,j is F j -measurable, j = 1, . . . , p. We moreover assume that ω 2 j := var(ǫ 0,j ) = I Evar(X j |F j−1 ) exists for all j. Note that model (8.1) holds when X 0 is Gaussian for example. More generally, the standard linear structural equations model is a special case. The latter model assumes that for j ≥ 2, the noise ǫ 0,j is independent of {X 0,k } j−1 k=1 , and that ǫ 0,1 , . . . ǫ 0,p are independent mean-zero random variables.
Lemma 8.1. Assume the structural equations model (8.1). Assume in addition that for some constant C and for all λ ∈ R Then X 0 is sub-Gaussian with constant C.
The above lemma follows from the fact that its condition implies that the vector ǫ 0 := (ǫ 0,1 , . . . , ǫ 0,p ) T is sub-Gaussian with constant C. If ǫ 0 is (strongly or weakly) m-th order isotropic with constant C m , then under the structural equations model (8.1) the vector X 0 is also (strongly or weakly) m-th order isotropic with constant C m . This follows from the fact that X 0 is a linear transformation of ǫ 0 . One may use the results of the next two subsections to check isotropy of ǫ 0 .

The conditionally sub-Gaussian case
Let {F j } p j=0 be a filtration and for j = 1, . . . , p, let X 0,j be F j -measurable and V j be F j−1 -measurable. We assume that for some m > 2, max 1≤j≤p V j m := µ m < ∞.
For general predictable {V j } we have for 2 < m 0 < m and all u 2 = 1

The conditionally Bernstein (or sub-exponential) case
Let as in the previous sub-section {F j } p j=0 be a filtration and for j = 1, . . . , p, let X 0,j be F j -measurable and V j be F j−1 -measurable and satisfying for some m > 2, max 1≤j≤p V j m := µ m < ∞.
As in the previous section, we prove strong isotropy but now under a different condition.
Lemma 8.3. Suppose that for some constant K and all j If the {V j } p j=1 are non-random, then for all u 2 = 1 For general predictable {V j } p j=1 we have for all 2 < m 0 < m and all u 2 = 1 Note that the conditions of the above lemma imply that the entries in X 0 are Bernstein with constants µ 2 and K, where µ 2 := max 1≤j≤p V j 2 ≤ µ m . In other words, the conditions of the lemma imply the bound of Theorem 3.1 with δ n = µ 2 2 log(2p)/n + K log(2p)/n and with m replaced by any m 0 < m.

Proofs for Section 3
Recall that Theorem 3.1 presents lower bounds for sparse quadratic forms.
Proof of Theorem 3.1. For Z ∈ R, and K > 0, we introduce the truncated version We obviously have for any K > 0 and u ∈ R p Here, we used the formula We note that By symmetrization (see e.g. [20], p.108) and contraction ( [10], p.112), Continuing with the last bound, we will apply for deriving (3.3) and for deriving (3.4). In other words Next we apply the concentration inequality of [3] to Z. We get for all t > 0 where we used for X 0 u 2 ≤ 1/K the bound We invoke that 2t/n 1/K 2 + 4I EZ ≤ 2t/n(1/K + 2 √ I EZ) This gives for all t > 0 and hence So with probability at We now let ⊔ ⊓ Remark 9.1. With assumptions weaker than the weak isotropy assumption used in the present paper, for example with the L 1 -L 2 property, one can prove lower bounds along the same lines as for Theorem 3.1. One applies instead of the truncation inequality (9.1) in the proof of Theorem 3.1 the inequality where One can then proceed using the arguments following (9.1) in the proof of Theorem 3.1 using the Lipschitz property of the absolute value function Z → |Z|.
For results assuming only the small ball property, we refer to [9].
We now provide a proof for the sub-Gaussian case along the same lines as the proof of Theorem 3.1.
Proof of Lemma 3.1. We use the same notation as in the proof of Theorem 3.1 for truncation at a value K. Whenever X 0 u 2 = 1, The result then follows by the same arguments as those used for Theorem 3.1 and inserting that in the sub-Gaussian case one has I E W ∞ ≤ δ ′ n . ⊔ ⊓

Proofs for Section 4
We first proof the "almost isometric" bound for the compatibility constant and restricted eigenvalue. Proof of Theorem 4.1. By Theorem 3.1 we know that uniformly in u with u 1 ≤ M X 0 u 2 with probability at least 1 − exp[−t] If u S 1 = 1 and u −S 1 ≤ L we clearly have This implies the lower bound for the compatibility constant. If u S 2 = 1 and u −S 1 ≤ L u S 1 we again have u 1 ≤ (L+1) √ s u 2 ≤ (L+1) √ s X 0 u 2 /κ 0 (L, S) which implies the result for the restricted eigenvalue.
⊔ ⊓ We now check the fourth moments, i.e. the second moments of quadratic forms.
Proof of Lemma 4.1.
One readily sees that each X 0,j has Orlizc norm · Ψ1 bounded by K X + σ X : Hence for all t > 0 and all j It follows that for all t > 0 IP max Clearly We have for u 1 ≤ M and X 0 u 2 ≤ 1 Now for a random variable Z satisfying for all t > 0 IP(|Z| > bt + K/2) ≤ c exp[−t] for certain constants b, c and K
But by Chebyshev's inequality, for all t > 0 Insert the bound of Lemma 4.1 for X 0 u * 4 / X 0 u * 4 2 or, in the case m > 4, the bound This gives that with probability at least 1 − 1/t The result for the restricted eigenvalue follows in the same way. ⊔ ⊓

Proofs for Section 5
We use the transfer principle to obtain lower bounds for sparse quadratic forms. The result follows now from Corollary 5.1. ⊔ ⊓ To handle the event B = {max jσ 2 j ≤ 1 + ǫ} we gave two lemmas. Here are their proofs.
Proof of Lemma 5.1. Recall we assumed in the beginning of Section 5 that X 0,j 2 = 1 for all j. The assumption that the X 0,j are sub-Gaussian implies X 0,j Ψ2 ≤ 2C.
We therefore have by Lemma 3. is non-negative on A and less than or equal to 1. So on A by the transfer principle (Theorem 5.1) we know for all u ∈ R p with u 2 1 ≤ (L + 1) 2 that We now note that The further bounds on the event A ∩ B S follow in the same way. ⊔ ⊓

Proofs for Section 7
We show that a vector X 0 which is m-th order strongly isotropic and has p sub-Gaussian entries is up to constants "almost bounded" by log(2p). The proof is finished by applying the inequality . ⊔ ⊓ If we have n independent copies of a vector X 0 which is m-th order strongly isotropic and has p sub-Gaussian entries these n×p variables are up to constants "almost bounded" by log(np). For such bounded random variables, we now prove to have uniform convergence of the empirical norm.