Adaptive estimation of covariance matrices via Cholesky decomposition

This paper studies the estimation of a large covariance matrix. We introduce a novel procedure called ChoSelect based on the Cholesky factor of the inverse covariance. This method uses a dimension reduction strategy by selecting the pattern of zero of the Cholesky factor. Alternatively, ChoSelect can be interpreted as a graph estimation procedure for directed Gaussian graphical models. Our approach is particularly relevant when the variables under study have a natural ordering (e.g. time series) or more generally when the Cholesky factor is approximately sparse. ChoSelect achieves non-asymptotic oracle inequalities with respect to the Kullback-Leibler entropy. Moreover, it satisfies various adaptive properties from a minimax point of view. We also introduce and study a two-stage procedure that combines ChoSelect with the Lasso. This last method enables the practitioner to choose his own trade-off between statistical efficiency and computational complexity. Moreover, it is consistent under weaker assumptions than the Lasso. The practical performances of the different procedures are assessed on numerical examples.


Introduction
The problem of estimating large covariance matrices has recently attracted a lot of attention. On the one hand, there is an inflation of high-dimensional data in many scientific areas: gene arrays, functional magnetic resonance imaging (fMRI), image classification, and climate studies. On the other hand, many data analysis tools require an estimation of the covariance matrix Σ. This is for instance the case for principal component analysis (PCA), for linear discriminant analysis (LDA), or for establishing independences or conditional independences between the variables. It is known for a long time that the simplest estimator, the sample covariance matrix performs poorly when the size of the vector p is larger than the number of observations n (see for instance Johnstone [16]).
Depending on the objectives of the analysis and on the applications, different approaches are used for estimating high-dimensional covariance matrices. Indeed, if one wants to perform PCA or to establish independences between the covariates, then it is advised to estimate directly the covariance matrix Σ. In contrast, performing LDA further relies on the inverse of the covariance matrix. In the sequel, we call this matrix the precision matrix and note it Ω. Sparse precision matrices are also of interest because of their connection with graphical models and conditional independence. The pattern of zero in Ω indeed corresponds to the graph structure of the distribution (see for instance Lauritzen [20] Sect. 5

.1.3).
Most of the methods based on direct covariance matrix estimation amount to regularize the empirical covariance matrix. Let us mention the work of Ledoit and Wolf [21] who propose to replace the sample covariance with its linear combination with the identity matrix. However, these shrinkage methods are known to provide an inconsistent estimation of the eigenvectors [17]. Applying recent results on random matrix theory, El Karoui [11] and Bickel and Levina [5] have studied thresholding estimators of Σ. The resulting estimator is sparse and is proved (for instance [5]) to be consistent with respect to the operator norm under mild conditions as long as log(p)/n goes to 0. These results are particularly of interest for performing PCA since they imply a consistent estimation of the eigenvalues and the eigenvectors. Observe that all these methods are invariant under permutation of the variables. Yet, in many applications (for instance times series, spectroscopy, climate data), there exists a natural ordering in the data. In such a case, one should use other procedures to obtain faster rates of convergence. Among other, Furrer and Bentgsson [14] and Bickel and Levina [6] use banded or tapering estimators. Again, the consistency of such estimators is proved. Moreover, all these methods share an attractive computational cost. We refer to the introduction of [5] for a more complete review.
The estimation procedures of the precision matrix Ω fall into three categories depending whether there exists an ordering on the variables and to what extent this ordering is important. If there is not such an ordering, d'Aspremont et al. [3] and Yuan and Lin [33] have adopted a penalized likelihood approach by applying a l 1 penalty to the entries of the precision matrix. It has also been discussed by Rothman et al. [27] and Friedman et al. [13] and extended by Lam and Fan et al. [19] or Fan et al. [12] to other penalization methods. These estimators are known to converge with respect to the Frobenius norm (for instance [27]) when the underlying precision matrix is sparse enough.
When there is a natural ordering on the covariates, the regularization is introduced via the Cholesky decomposition: where T is a lower triangular matrix with a unit diagonal and S is a diagonal matrix with positive entries. The elements of the i-th row can be interpreted as regression coefficient of i-th component given its predecessors. This will be further explained in Section 2.1. For time series or spectroscopy data, it is more likely that the relevant covariates for this regression of the i-th component are its closest predecessors. In other word, it is expected that the matrix T is approximately banded. With this in mind, Wu and Pourahmadi [31] introduce a k-banded estimator of the matrix T by smoothing along the first k subdiagonals and setting the rest to 0. The choice of k is made by applying AIC (Akaike [1]). They prove element-wise consistency of their estimator but did not provide any high-dimensional result with respect to a loss function such as Kullback or Frobenius. Bickel and Levina [6] also consider k-banded estimator of T and are able to prove rates of convergence in the matrix operator norm. Moreover, they introduce a cross-validation approach for choosing a suitable k, but they do not prove that the selection method achieves adaptiveness. More recently, Levina et al. [22] propose a new banding procedure based on a nested Lasso penalty. Unlike the previous methods, they allow the number k = k i used for banding to depend on the line i of T . They do not state any theoretical result, but they exhibit numerical evidence of its efficiency. In the sequel, we call the issue of estimating Ω by banding the matrix T the banding problem.
Between the first approach based on precision matrix regularization and the second one which relies on banding the Cholesky factor, there exists a third one which is not permutation invariant, but does not assume that the matrix T is approximately banded. It consists in approximating T by a sparse lower triangular matrix (i.e. most of the entries are set to 0).
When is it interesting to adopt this approach? If we consider a directed graphical model whose graph is sparse and compatible with the ordering of the variables, then the Cholesky factor T is sparse. Indeed, its pattern of zero is related to the directed acyclic graph (DAG) of the directed graphical model associated to this ordering (see Section 2.1 for a definition). More generally, it may be worth using this strategy even if one does not know a "good" ordering on the variables. On the one hand, most of the procedures based on the estimation of T are computationally faster than their counterpart based on the estimation of Ω. This is due to the decomposition of the likelihood into p independent terms explained in Section 3. On the other hand, there exist examples of sparse Cholesky factor T such that the precision matrix Ω is not sparse at all. Consider for instance a matrix T which is zero except on the diagonal and on the last line. Admittedly, it is not completely satisfying to apply a method that depends on the ordering of the variables when we do not know a good ordering. There are indeed examples of sparse precision matrices Ω such that for a bad ordering, the Cholesky factor is not sparse at all (see [27] Sect.4). Nevertheless, if sparse precision matrices and sparse Cholesky factors have different approximation capacities, it remains still unclear which one should be favored.
In the sequel, we call the issue of estimating T in the class of sparse lower triangular matrices the complete graph selection problem by analogy to the complete variable estimation problem in regression problems. In this setting, Huang et al. [15] propose to add an l 1 penalty on the elements of T . More recently, Lam and Fan [19] have extended the method to other types of penalty and have proved its consistency in the Frobenius norm if the matrix T is exactly sparse. To finish, let us mention that Wagaman and Levina [30] have developed a data-driven method based on the isomap algorithm for picking a "good" ordering on the variables.
In this paper, we consider both the banding problem and the complete graph selection problem. We introduce a general l 0 penalization method based on maximum likelihood for estimating the matrices T and S. We exhibit a non-asymptotic oracle inequality with respect to the Kullback loss without any assumption on the target Ω.
For the adaptive banding issue, our method is shown to achieve the optimal rate of convergence and is adaptive to the rate of decay of the entries of T when one moves away from the diagonal. Corresponding minimax lower bounds are also provided. We also compute asymptotic rates of convergence in the Frobenius norm. Contrary to the l 1 penalization methods, we explicitly provide the constant for tuning the penalty. Finally, the method is computationally efficient.
For complete graph selection, we prove that our estimator non-asymptotically achieves the optimal rates of convergence when T is sparse. We also provide the corresponding minimax lower bounds. To our knowledge, this minimax lower bounds with respect to the Kullback discrepansy are also new. Moreover, our method is flexible and allows to integrate some prior knowledge on the graph. However, this procedure is computationally intensive which makes it infeasible for p larger than 30. This is why we introduce in Section 7 a computationally faster version of the estimator by applying a two-stage procedure. This method inherits some of the good properties of the previous method and applies for arbitrarily large p. Moreover, it is shown to select consistently the pattern of zeros under weaker assumptions than the Lasso. These theoretical results are corroborated by a simulation study.
Since data analysis methods like LDA are based on likelihood we find more relevant to obtain rates of convergence with respect to the Kullback-Leibler loss than Frobenius rates of convergence.
imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 Moreover, considering Kullback loss allows us to obtain rates of convergence which are free of hidden dependency on parameter such as the largest eigenvalue of Σ. In this sense, we argue that this loss function is more natural for the statistical problem under consideration.
The paper is organized as follows. Section 2 gathers some preliminaries about the Cholesky decomposition and introduces the main notations. In Section 3, we describe the procedure and provide an algorithm for computing the estimator Ω. In Section 4, we state the main result of the paper, namely a general non-asymptotic oracle type inequality for the risk of Ω. In Section 5, we specify our result to the problem of adaptive banding. Moreover, we prove that our sodefined estimator is minimax adaptive to the decay of the off-diagonal coefficients of the matrix T . Asymptotic rates of convergence with respect to the Frobenius norm are also provided. In Section 6, we investigate the complete graph selection issue. We first derive a non-asymptotic oracle inequality and then derive that our procedure is minimax adaptive to the unknown sparsity of the Cholesky factor T . As previously, we provide asymptotic rates of convergence with respect to the Frobenius loss function. Moreover, we introduce a computationally feasible estimation procedure in Section 7 and we derive an oracle-type inequality and sufficient condition for consistent selection of the graph. In Section 8, the performances of the procedure are assessed on numerical examples for both the banding and the complete graph selection problem. We make a few concluding remarks in Section 9. Sketch of the proof are in Section 10, while the details are postponed to the technical Appendix [29].

Link with conditional regression and graphical models
In this subsection, we review basic properties about Cholesky factors and explain their connection with directed graphical models.
We consider the estimation of the vector X = (X i ) 1≤i≤p of size p which follows a centered normal distribution with covariance matrix Σ. We always assume that Σ is non-singular. We recall that the precision matrix Ω uniquely decomposes as Ω = T * ST where T is a lower triangular matrix with unit diagonal and S is a diagonal matrix. Let us first emphasize the connection between the modified Cholesky factor T and conditional regressions. For any i between 2 and p we note t i the vector of size i − 1 made of the i − 1-th first elements of the ith-line of T . By convention t 1 is the vector of null size. Besides, we note s i the i-th diagonal element of the matrix S. Let us define the vector ǫ = (ǫ i ) 1≤i≤p of size p as ǫ := T X. By standard Gaussian properties, the covariance matrix of ǫ is S. Since the diagonal of T is one, it follows that for any 1 ≤ i ≤ p where Var(ǫ i ) = s i and the (ǫ i ) 1≤i≤p are independent.
Let − → G be a directed acyclic graph who vertex set is {1, . . . , p}. We assume that the direction of the edges is compatible with the natural ordering of {1, . . . , p}. In other words, we assume that any imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 edge j → i in − → G satisfies j < i. Given a vertex i, the set of its parents is defined by: Then, the vector X is said to be a directed Gaussian graphical model with respect to − → G if for . This means that only the variables (X k ) k∈pa− → G (i) are relevant for predicting X i among the variables (X k ) k<i . There are several definitions of directed Gaussian graphical model (see Lauritzen [20]), which are all equivalent when Σ is non-singular.
There exists a correspondence between the graph − → G and the Cholesky factor T of the precision matrix Ω. If X is a directed graphical model with respect to − → G , then T [i, j] = 0 for any j < i such that j i. Conversely, X is a directed graphical model with respect to the graph − → G defined by j → i if and only T [i, j] = 0. Hence, it is equivalent to estimate the pattern of zero of T and the minimal graph − → G compatible with the ordering. These definitions and properties depend on a particular ordering of the variables. It is beyond the scope of this paper to discuss the graph estimation when the ordering is not fixed. We refer the interested reader to Kalisch and Bühlmann [18].

Notations
For any set A, |A| stands for its cardinality. We are given n independent observations of the random vector X. We always assume that X follows a centered Gaussian distribution N (0 p , Σ). In the sequel, we note X the n × p matrix of the observations. Moreover, for any 1 ≤ i ≤ p and any subset A of {1, . . . , p − 1}, X i and X A respectively refer to the vector of the n observations of X i and to the n × |A| matrix of the observations of (X i ) i∈A .
In the sequel, K(Ω; Ω ′ ) stands for the Kullback divergence between the centered normal distribution with covariance Ω −1 and the centered normal distribution with covariance Ω ′−1 . We shall also sometimes assess the performance of the procedures using the Frobenius norm and the l 2 operator norm. This is why we respectively define A 2 F := i,j A[i, j] 2 and A as the Frobenius norm and the l 2 operator norm of the matrix A. For any matrix Ω, ϕ max (Ω) stands for the largest eigenvalue of Ω. Finally, L, L 1 , L 2 ,. . . denote universal constants that can vary from line to line. The notation L . specifies the dependency on some quantities.

Description of the procedure
In this section, we introduce our procedure for estimating Ω given a n-sample of the vector X. For any i between 1 and p, m i stands for a subset of {1, . . . , i − 1}. By convention, m 1 = ∅. In terms of directed graphs, m i stands for the set of parents of i. Besides, we call any set m of the form m = m 1 × m 2 × . . . × m p a model. This model m is one to one with a directed graph whose ordering is compatible with the natural ordering of {1, . . . , p}. We shall sometimes call m a graph in order to emphasize the connection with graphical models.
Given a model m, we define T m as the affine space of lower triangular matrices T with unit diagonal such for any i between 1 and p, the support (i.e. the non-zero coefficients) of t i is included in m i . We note Diag(p) the set of all diagonal matrices with positive entries on the diagonal. The imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 matrices T m and S m are then defined as the maximum likelihood estimators of T and S T m , S m = arg min Here, L n (T, S) stands for the negative log-likelihood. Hence, the estimated precision matrix is Ω m = T * m S −1 m T m . This matrix Ω m is the maximum likelihood estimator of Ω among the precision matrices which correspond to directed graphical models with respect to the graph m.
For any i between 1 and p, M i refers to a collection of subsets of {1, . . . , i − 1} and we call M := M 1 × . . . × M p a collection of models (or graphs). The choice of the collection M depends on the estimation problem under consideration. For instance, we shall use a collection corresponding to banded matrices when we will consider the banding problems. The collections M are specified for the banding problem and the complete graph selection problem in Sections 5 and 6.
Our objective is to select a model m ∈ M such that the Kullback-Leibler risk E[K(Ω; Ω m )] is as small as possible. We achieve it through penalization. For any 1 ≤ i ≤ p, pen i : M i → R + is a positive function that we shall explicitly define later. The penalty function pen : M → R + is defined as pen(m) = As mentioned earlier, the idea underlying the use of the matrices T and S lies in the regression models (1). Indeed, these regressions naturally appear when deriving the negative log-likelihood (2): where . n stands for the Euclidean norm in R n divided by √ n. By definition of T m and S m , we easily derive that the i-th row vector t i,mi of T m and the i-th diagonal element s i,mi of S m respectively equal for any 1 ≤ i ≤ p. Here, supp(t ′ i ) stands for the support of t ′ i . Hence, the row vector t i,mi is the leastsquares estimator of t i in the regression model (1) and s i,mi is the empirical conditional variance of X i given X mi . There are two main consequences: first, Expression (3) emphasizes the connection between covariance estimation and linear regression in a Gaussian design. Second, it highly simplifies the computational cost of our procedure. Indeed, the negative log-likelihood L n ( T m , S m ) now writes and it follows that m i = arg min mi∈Mi log ( s i,mi ) + pen i (m i ). This is why we suggest to compute m and Ω as follows. Assume we are given a collection of graphs M = (M 1 , . . . , M p ) and penalty functions (pen 1 (.), . . . , pen p (.)).
1. For i going from 1 to p, • Compute s i,mi for each model m i ∈ M i .
• Take m i = arg min mi∈Mi log ( s i,mi ) + pen i (m i ).
In what follows, we refer to this method as ChoSelect. In order to select m, one needs to compute all s i,mi for any i ∈ {1, . . . , p} and any model m i ∈ M i . Hence, the complexity of the procedure is proportional to p i=1 |M i |. We further discuss computational issues and we provide a faster procedure in Section 7.

Risk analysis
In this section, we first provide a bias-variance decomposition for the Kullback risk of the parametric estimator Ω m . Afterwards, we state a general non-asymptotic risk bound for Ω.

Parametric estimation
We define the conditional Kullback-Leibler divergence of the distribution of X i given X <i by where P ti,si (X i |X <i ) stands for the conditional distribution of X i given X <i with parameters (t i , s i ).
Applying the chain rule, we obtain that K(Ω; Consequently, we analyze the Kullback risk E[K(Ω; Ω m )] by controlling each conditional risk E K(t i , s i ; t i,mi , s i,mi ) . Let us define t i,mi and s i,mi as the projections of (t i , s i ) on the space associated to the model m i with respect to the Kullback divergence K(t i , s i ; ., .). In other words, t i,mi and s i,mi satisfy Applying the chain rule, we check that t i,mi corresponds to (i − 1)-th first elements of the i-th line of T m and s i,mi is the i-th diagonal element of S m . Thanks to the previous property, we derive a bias-variance decomposition for the Kullback risk Proposition 4.1. Assume that |m i | is smaller than n − 2. The Kullback risk of ( t i,mi , s i,mi ) decomposes as follows where R n,d is defined by An explicit expression of R n,d is provided in the proof. Applying the chain rule, we then derive a bias-variance decomposition for the maximum likelihood estimator Ω m .
Corollary 4.2. Let m = (m 1 , . . . , m p ) be a model such that the size |m i | of each submodel is smaller than n − 2. Then, the Kullback risk of the maximum likelihood estimator Ω m decomposes into If the size |m i | of each submodels is small with respect to n, the variance term is of the order For other loss functions such as the Frobenius norm or the l 2 operator norm between Ω and Ω m , there is no such bias-variance decomposition with a variance term that does not depend on the target.

Main result
In this subsection, we state a general non-asymptotic oracle inequality for the Kullback-Leibler risk of the estimator Ω. We shall consider two types of penalty function pen(.): the first one only takes into account the complexity of the model collection while the second is based on a prior probability on the model collection.
where d is any integer larger or equal to 1. Besides, H i (0) is set to 0 for any i between 1 and p. Assumption (H K,η ): Given K > 1 and η > 0, the collection M and the number η satisfy where is positive and increases to one with K. This condition requires that the size of the collection is not too large. Assumption (H K,η ) is similar to the assumption made in [28] Sect 3.1 for obtaining an oracle inequality in the linear regression with Gaussian design framework. We further discuss (H K,η ) in Sections 5 and 6 when considering the particular problems of ordered and complete variable selection.
Theorem 4.4. Let K > 1 and let η < η(K). Assume that n is larger than some quantity n 0 (K) only depending on K and that the collection M satisfies (H K,η ). If the penalty pen(.) is lower bounded as follows then the risk of Ω is upper bounded by where τ n is defined by and L 2 (K, η) is positive. Here, I p stands for the identity matrix of size p.
Remark 4.1. This theorem tells us that Ω performs almost as well as the best trade-off between the bias term K(Ω; Ω m ) and the penalty term pen(m). The term p/n is unavoidable since it is of the same order as the variance term for the null model by Corollary 4.2. The error term τ n is considered as negligible since converges exponentially fast to 0 with n.

Remark 4.2.
The result is non-asymptotic and holds for arbitrary large p as longs n is larger than the quantity n 0 (K) (independent of p). There is no hidden dependency on p except in the complexity functions H i (.) and Assumption (H K,η ) that we shall discuss for particular cases in Sections 5.1 and 6.1. Besides, we are not performing any assumption on the true precision matrix Ω except that it is invertible. In particular, we do not assume that it is sparse and we give a rate of convergence that only depends on a bias variance trade-off. Besides, there is no hidden constant that depends on Ω (except for τ n ).

Remark 4.3.
Finally, the penalty introduced in this theorem only depends on the collection M and on a number K > 1. One chooses the parameter K depending on how conservative one wants the procedure to be. We further discuss the practical choice of K in Sections 5 and 6. In any case, the main point is that we do not need any additional method to calibrate the penalty.

Penalties based on a prior distribution
The penalty defined in Theorem 4.4 only depends on the models through their cardinality. However, the methodology developed in the proof easily extend to the case where the user has some prior knowledge of the relevant models.
Suppose we are give a prior probability measure π M = π M1 × . . . × π Mp on the collection M. For any non-empty model By convention, we set l (i) ∅ to 1. We define in the next proposition penalty functions based on the quantity l (i) m that allow to get non-asymptotic oracle inequalities.
Assumption (H bay K,η ): Given K > 1 and η > 0, the collection M, the numbers l (i) m and the number η satisfy where η(K) is defined as in (H K,η ).
Proposition 4.5. Let K > 1 and let η < η(K). Assume that n ≥ n 0 (K) and that Assumption (H bay K,η ) is fulfilled. If the penalty pen(.) is lower bounded as follows then the risk of Ω is upper bounded by where L K,η and τ n are the same as in Theorem 4.4.
The proof is postponed to the technical Appendix [29].
Remark 4.4. In this proposition, the penalty (11) as well as the risk bound (12) depend on the prior distribution π M . In fact, the bound (12) means that Ω achieves the trade-off between the bias and some prior weight, which is of the order − log[π M (m)]/n . This emphasizes that Ω favours models with a high prior probability. Similar risk bounds are obtained in the fixed design regression framework in Birgé and Massart [8].
Remark 4.5. Roughly speaking, Assumption (H bay K,η ) requires that the prior probabilities π Mi (m i ) are not exponentially small with respect to n.

Adaptive banding
In this section, we apply our method ChoSelect to the adaptive banding problem and we investigate its theoretical properties.

Oracle inequalities
Let d be some fixed positive integer which stands for the largest dimension of the models m i . For any 2 ≤ i ≤ p, we consider the ordered collections We write Ω d ord for the estimator Ω defined with the collection M d ord and the penalty (13).
1+η . If n is larger than some quantity n 0 (K), then This bound is a direct application of Theorem 4.4.
Remark 5.1. The term τ n is defined in Theorem 4.4 and is considered as negligible since it converges to 0 exponentially fast towards 0. Hence, the penalized estimator Ω achieves an oracle inequality without any assumption on the target Ω.
Remark 5.2. This oracle inequality is non-asymptotic and holds for any p and any n larger than n 0 (K). Moreover, by choosing a constant K large enough, one can consider a maximal dimension of model d up to the order of n, because η(K) converges to one when K increases.
Choice of the parameters K and d. Setting K to 2 gives a criterion close to AICc (see for instance [24]). Besides, Verzelen [28] (Prop.3.2) has justified in a close framework this choice of K is asymptotically optimal. A choice of K = 3 is advised if one wants a more conservative procedure. We have stated Corollary 5.1 for models m i of size smaller than d = η 1+η n. In practice, taking the size n/2 yields rather good results even if it is not completely ensured by the theory.
Computational cost. The procedure is fast in this setting. Indeed, its complexity is the same as p times the complexity of an ordered variable selection in a classical regression framework. From numerical comparisons, it seems to be slightly faster than the methods of Bickel and Levina [6] and Levina et al. [22] which require cross-validation type strategies.

Adaptiveness with respect to ellipsoids
We now state that the estimator Ω d ord is simultaneously minimax over a large class of sets that we call ellipsoids.
Definition 5.2. Let (a i ) 1≤i≤p−1 be a non-increasing sequence of positive numbers such that a 1 = 1 and let R be a positive number. Then, the set E(a, R, p) is made of all the non-singular matrices Ω = T * S −1 T where S is in Diag(p) and T is a lower triangular matrix with unit diagonal that satisfies the following property By convention, we set a p = 0. The sequence (a i ) measures the rate of decay of each line of T when one moves away the diagonal. Observe that in this definition, every line of T decreases the same rate. To the price of more technicity, we can also allow different rates of decay for each line of T . We shall restrict ourselves to covariance matrices with eigenvalues that lie in a compact when considering the ellipsoid E(a, R, p) Proposition 5.3. For any ellipsoid E(a, R, p), the minimax rates of estimation is lower bounded by Let us consider the estimator Ω d ord defined in Section 5.1 with d = ⌊n η 1+η ⌋ and the penalty (13). We also fix γ > 2. If the sequence (a i ) 1≤i≤p and R also satisfy R 2 ≥ 1 n and a 2 if n is larger than n 0 (K) Remark 5.3. The minimax rates of convergence over E(a, R, p) in the lower bound (17) is similar to the one obtained for classical ellipsoids in the Gaussian fixed design regression setting (see for instance [23] Th. 4.9). We conclude from the second result that our estimator Ω d ord is minimax adaptive to the ellipsoids that are not degenerate (i.e. R 2 ≥ 1/n) and whose rates (a i ) does not converge too slowly towards zero (i.e. a 2 . Note that all the sequences (a i ) such that a 2 i ≤ R 2 /i satisfy the last assumption. Remark 5.4. However, the estimator Ω d ord is not adaptive to the parameter γ since the constant L in (18) depends on γ. This is not really surprising. Indeed, the oracle inequality (14) is expressed in terms of the Kullback loss while the ellipsoids are defined in terms of the entries of T . If we would have considered the minimax rates of estimation over sets analogous to E(a, R, p) but defined in terms of the decay of the Kullback bias, then we would have obtained minimax adaptiveness without any condition on the eigenvalues.
We are also able to prove asymptotic rates of convergence and asymptotic minimax properties with respect to the Frobenius loss function. For any s > 0, we define the ellipsoid E ′ (s, p, R) as the ellipsoid E(a, R, p) with the sequence (a i ) 1≤i≤p−1 := i −s . , If s > 1/2, then uniformly over the set E ′ (s, R, p n ) ∩ B op (γ), the estimator Ω d ord satisfies Moreover, these two rates are optimal from a minimax point of view.
The estimator Ω d ord achieves the minimax rates of estimation over special cases of ellipsoids. However, all these results depend on γ and are of asymptotic nature.

Complete graph selection
We now turn to the complete Cholesky factor estimation problem. First, we adapt the model selection procedure ChoSelect to this setting. Then, we derive an oracle inequality for the Kullback loss. Afterwards, we state that the procedure is minimax adaptive to the unknown sparsity both with respect to the Kullback entropy and the Frobenius norm. Finally, we discuss the computational complexity and we introduce a faster two-stage procedure.

Oracle inequalities
Again, d is a positive integer that stands for the maximal size of the models m i . We consider the collections of models M d i,co that contain all the subsets of {1, . . . , i − 1} of size smaller or equal to d. A model m ∈ M d co corresponds to a pattern of zero in the Cholesky factors T . As explained in Section 2, such a model m is also in correspondence with an ordered graph − → G which is compatible with the ordering. Hence, the collection M d co is in correspondence with the set of ordered graphs − → G of degree smaller than d which are compatible with the natural ordering of {1, . . . , p}.
For any 2 ≤ i ≤ p and any model m i in M d i,co we fix the penalty where K > 1. In the sequel, Ω d co corresponds to the estimator ChoSelect with the collection M d co and the penalty (21).
Corollary 6.1. Let K > 1 and η < η ′ (K) (defined in the proof ). Assume that If n is larger than some quantity n 0 (K), then Ω d co satisfies where the remaining term τ ′ n is of the same order as τ n in Theorem 4.4. A proof is provided in Section 10.3. We get an oracle inequality up to logarithms factors, but we prove in Section 6.2 that these terms log[(i − 1)/|m i |] are in fact unavoidable. For the sake of clarity, we straightforwardly derive from (23) the less sharp but more readable upper bound where |m| := p i=1 |m i |. Remark 6.1. As for the previous results, we do not perform any assumption on the target Ω and the obtained upper bound is non-asymptotic. By Condition (22), we can consider dimension d up to the order n/[log(p/n) ∨ 1]. If p is much larger than n, the maximal dimension has to be smaller than the order n/ log(p). This is not really surprising since it is also the case for linear regression with Gaussian design as stated in [28] Sect. 3.2. There is no precise results that proves that this n/ log(p) bound is optimal but we believe that it is unimprovable. If p is of the same order as n, it is possible to consider dimensions up to the same order as p.
Remark 6.2. The same bound (23) holds if we use the penalty For a given K, observe that pen i (m i ) = log(1 + pen ′ i (m i )). Hence, these two penalties are equivalent when n is large. In Corollary 6.1, we have privileged a logarithmic penalty, because this penalty gives slightly better results in practice.
Choice of K and d. In practice, we set the maximal dimension to n/{2.5[2 + (log(p/n) ∨ 0)]}. Concerning the choice of K, we advise to use the value 1.1, if the goal is to minimize risk. When the goal is to estimate the underlying graph, one should use a larger value of K like 2.5 in order to decrease the proportion of falsely discovered vertices.

Adaptiveness to unknown sparsity
In this section, we state that the estimator Ω d co achieves simultaneously the minimax rates of estimation for sparsity of the matrix T . In the sequel, U 1 [k, p] stands for the set of positive square matrices Ω = T * S −1 T of size p such that its Cholesky factor T contains at most k non-zero offdiagonal coefficients on each line. The set U 1 [k, p] contains the precision matrices of the directed Gaussian graphical models whose underlying directed acyclic graph − → G satisfies the two following properties: imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 • It is compatible with the ordering on the variables.
• Each node of − → G has at most k parents.
We shall also consider the set U 2 [k, p] that contains positive square matrices whose whose Cholesky factor is k-sparse (i.e. contains at most k non-zero elements). Hence, the set U 2 [k, p] corresponds to the precision matrices of the directed Gaussian graphical models whose underlying directed acyclic graph − → G is compatible with the ordering on the variables and has at most k edges. When Ω belongs to U 2 [k, p] with k "small", we say that the underlying Cholesky factors T are ultra-sparse.
For deriving the minimax rates of estimation, we shall restrict ourselves to precision matrices whose Kullback divergence with the identity is not too large. This is why we define inf Consider K > 1, β > 1, and η < η(K). Assume that n ≥ n 0 (K) and choose a positive integer d that satisfies Condition (22). The penalized estimator Ω d co defined in Corollary 6.1 is minimax adaptive over the sets U 1 [k, p] ∩ B K (n β ) for all k smaller than d that also satisfy n ≥ Lk 2 (1 + log(p/k)). It is also minimax adaptive over U 2 [k, p] ∩ B K (n β ) for all k less than d: E Ω K Ω; Ω .
Remark 6.3. The minimax rates of estimation over U 1 [k, p] is of order kp[1 + log (p/k)]/n. We do not think that the condition n ≥ Lk 2 [1 + log(p/k)] is necessary but we do not know how to remove it. The technical condition K (Ω; I p ) ≤ pn β is not really restrictive. It comes from the term n 5/2 K(Ω; I p ) exp [−nL K,η ] in Theorem 4.4 which goes exponentially fast to 0 with n as long as K(Ω, I p )/p is grows polynomially with respect to n. In conclusion, our estimator Ω d co is adaptive to the sparsity of its Cholesky factor T . Remark 6.4. Let us translate the proposition in terms of directed graphical models. The Kullback minimax rate of covariance estimation over graphical models with at most k parents by node is of the order pk(1 + log(p/k))/n. Moreover, the Kullback minimax rate of covariance estimation over graphical models with at most k vertices is of the order (p+k log p)/n. Finally, Ω d co is minimax adaptive for estimating the distribution of a sparse directed Gaussian graphical model whose underlying graph is unknown.
We can also consider the rates of convergence with respect to the Frobenius norm or the operator norm in the spirit of the results of Lam and Fan [19]. We recall that . F and . respectively refer to the Frobenius norm and the operator norm in the space of matrices. We also recall that the set B op (γ) is defined in (16). Corollary 6.3. Let K > 1, η < η(K), γ > 2, and let d be the largest integer that satisfies (22). If p n k n [1 + log(p n /k n )] = o(n), then . Moreover, all these Frobenius rates of convergence are optimal from a minimax point of view.
Remark 6.5. The estimator Ω d co is asymptotically minimax adaptive to the sets U 1 [k, p] ∩ B op (γ) and U 2 [k, p] ∩ B op (γ) with respect to the Frobenius norm. Moreover, these rates are coherent with the ones obtained by Lam and Fan in Sect.4 of [19]. We do not think that the rates of convergence with respect to the operator norm are sharp. Remark 6.6. These results are of asymptotic nature and require that p n has to be much smaller than n. Besides, the upper bounds on the rates highly depend on the largest eigenvalue ϕ max (Ω). This is why we have restricted ourselves to precision matrices whose eigenvalues lie in the compact [1/γ; γ]. Nevertheless, to our knowledge all results in this setting suffer from the same drawbacks. See for instance Th.11 of Lam and Fan [19].

A two-step procedure
The computational cost of Ω d co is proportional to the size of M d i,co , which is of the order of p d . Hence, it becomes prohibitive when p is larger than 50. In fact, Ω d co minimizes a penalized criterion over the collection M d co . Nevertheless, the collections M d i,co contain an overwhelming number of models that are clearly irrelevant. This is why we shall use a two-stage procedure. First, we compute a subcollection of M d co . Then, we minimize the penalized criterion over this subcollection. Suppose we are given a fast data-driven method that computes a subset M i of M d i,co for any i in 1, . . . p. 3. Take In what follows, we refer to this method as ChoSelect f . For any 2 ≤ i ≤ p and any model m i in M d i,co , we advise to fix the penalty as in Section 6.1: with K > 1. K = 1.1 gives good results in practice.
where τ n is defined in Theorem 4.4.
Remark 7.2. Hence, under the event A m * where m * is the oracle model, Ω f achieves the optimal of convergence. The estimator achieves also a small risk as soon as any "good" model belongs to the estimated collection. Here, "good" refers to a small Kullback risk. Observe that it is much easier to estimate a collection M i that contains a "good" model than directly estimating a "good" model.
In fact, Algorithm 7.1 and Proposition 7.1 are generally applicable to any collection M and penalties defined by (7) or (11).  The estimator Ω f then performs as well (up to a log p factor) as the best parametric estimator with a model in the regularization path. The collection size is fairly small, but the oracle model may not belong to M i with large probability. This is especially the case is the true covariance Σ is far from the identity since the Lasso estimator is possibly inconsistent. In many cases, the true (or the oracle) model is a submodel of the model selected by the Lasso with a suitable parameter [2]. When choosing k = D, it is therefore likely that the true model or a "good" model belongs to M i .
The regularization path of the Lasso is not necessarily increasing [10]. If we want that M contains all subsets of sparse solutions of the Lasso we need to use a variant of the previous algorithm: 1. Using the LARS [10] algorithm, compute all the Lasso solutions for the regression of X i with respect to the covariates X <i . 2. For any λ > 0, consider the set of {X j1 , X j2 . . . X js λ } of variables selected by the Lasso.
If s λ > k we define A λ i = ∅ while we take A λ i = P(X j1 , . . . , X js λ ) is s λ ≤ k. Here, P(A) contains all the subsets of A. In the following proposition, we show the ChoSelect f outperforms the Lasso under restricted eigenvalue conditions. We consider an asymptotic setup where p and n go to infinity with p larger n.
Moreover, we assume that and q * log(p)/n goes to 0 when p and n go to infinity.
imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 • (H.2) Fix some v < 1. The vector t p (which corresponds to the p-th line of T ) is q-sparse with some q < n v log p ∨ n log p . The set of non-zero component is denoted m * . Let us set some K > 24 ∨ (2/(1 − v)) and define For any zero-component t p [j], we have The quantities q and q * are such that The proof of the proposition is postponed to the appendix [29] Remark 7.4. In contrast to ChoSelect f , the Lasso procedure does not consistently select the support of t p under restricted eigenvalue conditions [35,34]. Observe that our assumptions (H.1), (H.2), (H.3) and our result are quite similar to the ones obtained by the stability selection method of Meinshausen and Bühlmann [25].
Remark 7.5. Under similar conditions, one can prover that ChoSelect f selects consistently the support of any vector t i for n ≤ i ≤ p. In order to consistently estimate the whole pattern of zero of T , one needs to slightly change the penalty (21) by replacing (i − 1) by (i − 1) ∨ n.
Remark 7.6. For the sake of simplicity, we have only described two methods for building the collection M. One may also use a collection based on the adaptive Lasso or more generally any (data-driven) collection M. Moreover, ChoSelect f can be interpreted as a way to tune an estimation procedure and to merge different procedures. Suppose we are given a collection A of estimation procedure. For any procedure a ∈ A, we build a collection M a using the model corresponding to the estimator Ω a or using a regularization path associated to a (if possible). If we take the collection M as the reunion of all M a for a ∈ A, then by Proposition 7.1 the estimator Ω f nearly selects the best model (from the risk point of view) among the ones previously selected by the procedures a ∈ A.

Simulation Study
In this section, we investigate the practical performances of the proposed estimators. We concentrate on two applications: adaptive banding and complete graph selection.

Simulation scheme
Simulating the data. We have used a similar scheme to Levina et al. [22]. Simulations were carried out for centered Gaussian vectors with two different precision models. The first one has entries of the Cholesky factor exponentially decaying as one moves away from the diagonal.
The second model allows different sparse structures for the Cholesky factors.
Here U (k 1 , k 2 ) denotes an integer selected uniformly at random from all integers from k 1 to k 2 . We generate from this structure for p = 30. Levina et al. pointed out that this structure can generate poorly conditioned covariance matrix for larger p. To avoid this problem, we divide the variables for p = 100 and p = 200 into respectively 4 and 8 different blocks and we generate a random structure from the random structure from the model described above for each of the blocks. For each of the covariance models, we generate a sample of n = 100. We consider three different values of p: 30, 100, and 200.
We apply the following procedures: • our procedure ChoSelect as described in Section 5. More precisely, we take the collection M ⌊n/2⌋ ord , the penalty (13), and K = 3.
• the nested Lasso method of Levina et al. [22]. It is computed with the J 1 penalty, while its tuning parameter is selected via 5-fold cross-validation based on the likelihood. We have used the penalty J 1 instead of J 2 for computational reasons. • the banding procedure of Bickel and Levina [6]. The tuning parameter is chosen according to Sect.5 in [6] with 50 random splits. • the regularization method of Ledoit and Wolf [21].
For the first covariance model Ω 1 , we also compute the oracle estimator, i.e. the parametric estimator which minimizes the Kullback risk among all the estimators Ω m with m ∈ M ⌊n/2⌋ ord . We recall that the computation of the oracle estimator require the knowledge of the target Ω 1 . The performances of this estimator are presented here as a benchmark. The experiments are repeated N = 100 times. In the second scheme, N 1 = 10 precision matrices are sampled and N 2 = 10 experiments are made for each sample.

Results
In Tables 1 and 2 the operator distance Ω − Ω , and the operator distance between the inverses Ω −1 − Σ for any of the fore-mentioned estimators. We have chosen the Kullback loss because of its connection with discriminant analysis. The two other loss functions are interestingly connected to the estimation of the eigenvalues and the eigenspaces. For the second structure, we also consider the pattern of zero estimated by our procedure, the nested Lasso and the banding method of Bickel and Levina. More precisely, we estimate the power (i.e. the fraction of non-zero terms in T estimated as non-zero) and the FDR (i.e. the ratio of the false discoveries over the true discoveries) in Table 3.   Table 2: Estimation and 95% confidence interval of the Kullback risk, the operator distance risk, and the operator distance between inverses risk for the second covariance model Ω 2 .

Method
Comments of Tables 1 and 2: In the first scheme Ω 1 , the three methods based on Cholesky decomposition exhibit a Kullback risk close to the oracle. The ratio of their Kullback risks over the oracle risk remains smaller than 1.4. The risk of the nested Lasso and the banding method is about 15% smaller than the risk of ChoSelect. We observe the same pattern for the operator distance between precision matrices. In contrast, all these estimators have more or less the same risks for the operator distance between the covariance matrices. The estimator of Ledoit and Wolf is a regularized version of the empirical covariance matrix. Its performances with respect to the Kullback loss are poor but it behaves well with respect to the operator norms.
In the second scheme, the method of Ledoit and Wolf performs poorly with respect to the Kullback loss functions and the first operator norm loss function. ChoSelect performs two times better than the nested Lasso in terms of the Kullback discrepancy and the operator distance between precision matrices. The banding method exhibits a far worse Kullback risk. As in the first scheme, the three procedures based on Cholesky decomposition perform similarly in terms of the operator distance between covariance matrices. These last risks are high for p = 30 because the covariance matrix is poorly conditioned in this case and its eigenvalues are high.
The banding method only performs well if the Cholesky matrix T is well approximated by a banded matrix, which is not the case in the second scheme. The nested Lasso seems to perform well when there is an exponential decay of the coefficients as in the first scheme. However, its performance seem to be far worse when the decay is not exponential. In contrast, ChoSelect seems to always perform quite well. This observation corroborates the theory: indeed, we have stated in Corollary 5.1 that ChoSelect satisfies an oracle inequality without any assumption on Σ. Finally, there no clear interpretation for the risk with respect to the operator norm between covariances.  Table 3: Estimation and 95% confidence interval of the power and FDR for the second precision model Ω 2 .
Estimating the pattern of zero. In the second scheme, we can compare the ability of the procedures to estimate well the pattern of non-zero coefficients ( Table 3). The banding method does not work well since the Cholesky factor T is not banded. ChoSelect a higher power and a lower FDR than the nested Lasso.

Simulation scheme
Simulating the data. In the first simulation study, we consider Gaussian random vectors whose precision matrices based on directed graphical models.
1. First, we sample a directed graph − → G in the following way. For any node i in {2, . . . , p} and any node j < i, we put an edge going from j to i with probability (Esp/(i − 1) ∧ 0.5), where Esp is a positive parameter previously chosen. Hence, the expected number of parents for a given node is Esp∧(i − 1)/2.
In the second simulation scheme, we consider the case where the "good" ordering is partially known. More precisely, we first sample a precision matrix Ω c 1 according to the first simulation scheme. Then, we sample uniformly 10 variables and change uniformly their place in the ordering. This results in a new precision matrix Ω c 2 . Its Cholesky factor is generally less sparse than the one of Ω c 1 . The purpose of this scheme is to check whether our method is robust to small changes in the ordering. For this study, we choose p = 200, Esp= 1, 3, 5, and n = 100.
We compute the following estimators: • the procedure ChoSelect f as described in Section 7. We take the collection M d co with d = • the procedure ChoSelect with collection M 7 co , the penalty (21) with K = 1.1. Since this method is computationally prohibitive, we only apply it for p = 30.
• the Glasso method [3]. It is computed using the Glasso R-package by Friedman et al. [13], while the tuning parameter is chosen via 5-fold cross validation based on the likelihood. Following Rothman et al. [27] and Yuan and Lin [33], we do not penalize the diagonal of Ω. • the Lasso method of Huang et al. [15]. The regularization parameter is calculated by 5-fold cross validation based on the likelihood.
For each estimator and simulation scheme, we evaluate the Kullback loss K(Ω; Ω), the operator Ω − Ω , and the operator distance between the inverses Ω −1 − Σ . We also consider the pattern of zero estimated by our procedure ChoSelect f and the Lasso of Huang et al. [15]. More precisely, we evaluate the power (i.e. the fraction of non-zero terms in T estimated as non-zero) and the FDR (i.e. the ratio of the false discoveries over the true discoveries) in the first simulation study. Empirical 95% confidence intervals of the estimates are also computed. The experiments are repeated N = 100 times: N 1 = 10 precision matrices are sampled and N 2 = 10 experiments are made for each precision matrix sampled.

Results
Comparison of ChoSelect and ChoSelect f . In Table 4, we have set p = 30 in order to compute the method ChoSelect and compare it with ChoSelect f . It seems that both methods perform more or less similarly. When the sparsity of the Cholesky factor decreases (Esp=5), ChoSelect f exhibits a slightly smaller Kullback risk.
These simulations confirm that ChoSelect f exhibits similar performances to ChoSelect with a much small computational complexity. In the other simulations, we only compute ChoSelect f .   Estimation of Ω. This study corresponds to the situation where a "good" ordering of the variables is known. In Table 5, ChoSelect f has a smaller Kullback risk than the Lasso, which is better than the Glasso, and Ledoit and Wolf's method. This is especially true when p is large. We also observe the same results it terms of the operator distance between the precision matrices. The results for the operator distance between covariance matrices are more difficult to interpret. It seems that the risk of the Lasso is high, while the Glasso and ChoSelect f perform more or less similarly. Ledoit and Wolf's method gives good results when Esp=3, 5.  Estimation of the graph. In Table 6, we compare the ability of the procedures to estimate the underlying directed graph. This is why we only consider the procedures based on Cholesky decomposition: the Lasso of Huang et al. and ChoSelect f . The Lasso exhibits a high power but also a high FDR (larger than 50%). In contrast, ChoSelect f keeps the FDR reasonably small to the price of a small loss in the power. When p increases, the power of the procedures decreases. These results corroborate the results of Proposition 7.2. When the number of parents (i.e. ESP) increases, it seems that the FDR of the ChoSelect f increases. We recall that if one wants a lower FDR in the graph estimator, one should choose a larger value for K. In practice, taking K = 2.5 or K = 3 enforces the FDR to be smaller than 10%.  Effect of the ordering. In Table 7, we study here the performances of the procedures when the ordering of the variables is slightly modified. The Glasso method and the regularization method of imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 Ledoit and Wolf perform as in the first scheme since these procedures do not depend on a particular ordering of the variables. Lasso and ChoSelect f procedures provide slightly worse results than in the first scheme, especially when the sparsity decreases. Indeed, the effect of a bad ordering is higher when the sparsity is low. Nevertheless, ChoSelect f still performs better than the other procedures for the Kullback risk and the operator distance between precision matrices, while the Glasso and ChoSelect f still perform similarly the operator distance between covariance matrices. The respective performances are different when the ordering is completely unknown (see the Appendix [29]).

Method
Conclusion. When the ordering is known or partially known, ChoSelect f has a small risk with respect to the Kullback discrepancy and the operator distance between precision matrices. Moreover, ChoSelect f provides a good estimation of the underlying graph. It is difficult to interpret the results for the operator distance between the covariance matrices. If the objective is to minimize the operator distance Σ− Σ , it seems that a direct estimation of Σ should be prefered to the inversion of an estimation of Ω.

Discussion
Adaptive banding problem. ChoSelect achieves an oracle inequality and is adaptive to the decay in the Cholesky factor T . We have also derived corresponding asymptotic results for the Frobenius loss function. This procedure is computationally competitive with the other existing methods. Finally, we explicitly provide the penalty and there are therefore no calibration problems contrary to most procedures in the literature. In a future work, we would like to study the performances of ChoSelect with respect to the operator norm and prove corresponding minimax bounds. Bickel and Levina have indeed proved risk bounds for their banding procedure [6]. This method is based on maximum likelihood estimators as ChoSelect. This is why we believe that ChoSelect may also satisfy fast rates of convergence with respect to the operator distance.
Complete graph estimation problem. We have derived that ChoSelect satisfies an oracle type inequality and we have derived the minimax rates of estimation for sparse Cholesky factors T . ChoSelect is shown to achieves minimax adaptiveness to the unknown sparsity of Cholesky factor. As in the banded case, we provide an explicit penalty. However, this procedure is computationally feasible only for small p. In contrast, the method ChoSelect f introduced in Section 7 shares some advantages of the previous method with a much lower computational cost. In Algorithm 7.2, we propose two collections based on the Lasso. In practice, there are maybe smarter ways of building the collections M i than using the Lasso.

Some notations and probabilistic tools
First, we introduce the prediction contrasts l i (., .). Consider i be an integer between 2 and p and let (t, t ′ ) be two row vectors in R i−1 then the contrast l i (t, t ′ ) is defined by imsart-ejs ver. 2010/09/07 file: Chol_ejs.tex date: October 11, 2010 Consider a model m i ∈ M i . We define the random variable ǫ mi by By definition of t i,mi in Section 4.1, the variable ǫ mi is independent of ǫ and of X mi . Besides, its variance equals l i (t i,mi , t i ). If follows from the definition of s i,mi that s i,mi = l i (t i,mi , t i ) + s i . The vectors ǫ and ǫ m refer to the n samples of ǫ and ǫ m . For any model m and any vector Z of size n, Π m Z refers to the projection of Z onto the subspace generated by (X i ) i∈m whereas Π ⊥ m Z stands for Z − Π m Z. For any subset m of {1, . . . , p}, Σ m denotes the covariance matrix of the vector X * m . Moreover, we define the row vector Z m := X m Σ −1 m in order to deal with standard Gaussian vectors. Similarly to the matrix X m , the n × |m| matrix Z m stands for the n observations of Z m .
The estimators t i,mi and s i,mi are expressed as follows This lemma is a consequence of the definitions of t i,mi , s i,mi , and K (t i , s i ; t ′ i , s ′ i ) in Sections 3 and 4.1.

Proof of Proposition 4.1
Proof of Proposition 4.1. First, we decompose the Kullback-Leibler divergence into a bias term and a variance term using Expression (31).
By definition, t i,mi is the least-squares estimator of t i over the set of vectors of size i − 1 whose support is included in m i and −X <i t * i,mi is the best predictor of X i given X mi . Hence, the prediction error l i ( t i,mi , t i ) + s i equals l i ( t i,mi , t i,mi ) + s i,mi and it follows that Let us compute the expectation of these three last terms. Notice that n s i,mi /s i,mi = n Π ⊥ mi X i 2 n /s i,mi follows the distribution of a χ 2 distribution with n − |m i | degrees of freedom.
Obviously, Assumption (H K,η ) is equivalent to the union of the assumptions (H i K,η ).
Proposition 10.3. Let K > 1 and η < η(K). Assume that n ≥ n 0 (K), that (H i K,η ) holds, and that the penalty function is lower bounded as follows for any m ∈ M i and some K > 1 .
Then, the penalized estimator ( t i , s i ) satisfies The remaining term τ n (t i , s i , K, η) is defined by where 0 stands here for the null vector of size i − 1.
Let us apply this property for any i between 1 and p. Then, we get an upper bound for E[K(Ω; Ω)] by applying the chain rule as in Section 4.1. The risk bound (8) follows.
Proof of Proposition 10.3. The proof of this result is mainly inspired by ideas introduced in the proofs of Th.3 in [4] and of Th.3.4 in [28]. The case i = 1 is a consequence of Proposition 4.1 since |M 1 | = 1. Let us assume that i is larger than one. For the sake of clarity, we forget the subscripts i in the remainder of the proof.
Let us introduce some new notations. First, ., . n is the inner product in R n associated to the norm . n . Let m be any model in the collection M.
We shall use the constants κ 1 , κ 2 , and ν(K) as defined in the proof of Th.3.4 in [28]. We provide their expression for completeness although they are not really of interest.
For clarity, the proof is split into six lemmas.
This lemma gives a decomposition of the relevant terms that we have to bound. See [29] Sect.1.1 for a detailed computation. In the next four lemmas, we bound each of these terms.
Lemma 10.5. Let us assume that n ≥ n 0 (K), where n 0 (K) is defined in the proof. There exists an event B 1 of probability larger than 1 − L K n exp [−nL ′ (K, η)] with L ′ (K, η) > 0 such that where v(K, η) is a positive constant (strictly) smaller than 1. These two upper bounds are at the heart of the proof. The sketch of their proofs is analogous to Lemmas 7.10 and 7.11 in [28]. The main tools are deviation inequalities of χ 2 random variables and of the largest eigenvalue of a Wishart matrix. See [29] Sect.1.2 and 1.3 for detailed proofs.
Since l( t, t)/ s is smaller than 2K t, s; t, s , it follows that 2E K t, s; t, s 1 B1 ≤ L K,η 2E K t, s; t m , s m + pen(m) + E [(R 2 (m) + R 4 (m, m))1 B1 ] . The proofs of this lemma relies on the same ideas as the proofs of Lemma 3 in [4]. See [29] Sect.1.5 for a detailed proof.
Gathering these two lemmas, we control the Kullback risk of ( t, s) on the event B 1 2E K t, s; t, s 1 B1 ≤ L K,η 2E K t, s; t m , s m + pen(m) To conclude, we need to control the Kullback risk of the estimator ( t, s) on the event B c 1 . This lemma is based on Hölder's inequality and on an upper bound of the moments of the parametric losses K(t, s; t m , s m ). A detailed proof is in the technical Appendix [29] Sect.1.6. Combining (40) and Lemma 10.9 allows to conclude E K t, s; t, s ≤ L K,η E K t, s; t m , s m + pen(m) + L K n + L K,η n 5/2 [1 + K(t, s; 0, 1)] exp [−nL K ] . If we assume that d[1 + log(p/d) ∨ 0] ≤ nη ′ (K), for some well chosen function η ′ (K), then (H K ′ ,η ) is fulfilled and that the risk bound (23) holds. A detailed proof is in the technical Appendix citetechnical Sect.1.7.

Proofs of the minimax bounds
The minimax bounds are based on Fano's method [32]. Since the Kullback discrepansy is not a distance, we cannot directly apply this method. Instead, we use a modified version of Birgé's lemma [7] for covariance estimation. In the sequel, we note t l2 the Euclidean norm of a vector t.
The numerical constants κ 1 and κ 2 are made explicit in the proof.
The general setup of the proofs is to pick a maximal subset Υ of matrices that are well separated with respect to d(., .) and such that their Kullback discrepansy is not too large. The existence of these subsets is ensured by technical combinatorial arguments. We postpone the complete proofs to the technical appendix [29] Sect.2.

Proof of the Frobenius bounds
We derive the Frobenius rates of convergence from the Kullback bounds. Indeed, we prove in [29] that √ ΣΩ ′ √ Σ − I pn when K (Ω; Ω ′ ) is close to 0. Hence, one may upper bound the Frobenius distance between Ω ′ and Ω in terms of Kullback discrepancy using that The complete proof of Corollaries 5.4 and 6.3 are postponed to the technical Appendix [29] Sect.4.