Discussion of “Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation”

Statistical inference for large covariance and precision matrices is a novel and interesting topic emerged in the last decade. The paper by Tony Cai, Zhao Ren and Harry Zhou (further referred to as [CRZ]) summarizes the key recent achievements in this rapidly developing area where the authors are among the leading contributors. The focus is on fundamental decision theoretic aspects, namely, on the following questions: (a) what are the best attainable rates of convergence of estimators in a minimax sense on various classes of matrices, and (b) how to construct data-driven adaptive procedures attaining these rates without the knowledge of the parameters of the classes. When the dimension of the covariance matrix is greater than the sample size, accurate estimation is problematic unless some assumptions are imposed on the structure of the matrix. A wealth of such structure assumptions is presented in the paper, most of them having the form of sparsity or approximate sparsity constraints. Sparsity here is understood either as a small number of non-zero entries or of non-zero columns/rows of the matrix, or as a small (cid:2) q -norm of columns/rows, or as a low rank of the matrix, or as a combination of these properties. The questions addressed in the paper have analogs in the classical Gaussian mean (Gaussian sequence) model, which is now extensively studied, cf., e.g., [2]. A key there to construct minimax optimal and adaptive estimators of vectors on the (cid:2) q -balls based on observation of the unknown vector in Gaussian noise. estimation of a the

Statistical inference for large covariance and precision matrices is a novel and interesting topic emerged in the last decade. The paper by Tony Cai, Zhao Ren and Harry Zhou (further referred to as [CRZ]) summarizes the key recent achievements in this rapidly developing area where the authors are among the leading contributors. The focus is on fundamental decision theoretic aspects, namely, on the following questions: (a) what are the best attainable rates of convergence of estimators in a minimax sense on various classes of matrices, and (b) how to construct data-driven adaptive procedures attaining these rates without the knowledge of the parameters of the classes. When the dimension of the covariance matrix is greater than the sample size, accurate estimation is problematic unless some assumptions are imposed on the structure of the matrix. A wealth of such structure assumptions is presented in the paper, most of them having the form of sparsity or approximate sparsity constraints. Sparsity here is understood either as a small number of non-zero entries or of non-zero columns/rows of the matrix, or as a small q -norm of columns/rows, or as a low rank of the matrix, or as a combination of these properties.
The questions addressed in the paper have analogs in the classical Gaussian mean (Gaussian sequence) model, which is now extensively studied, cf., e.g., [2].
A key problem there is to construct minimax optimal and adaptive estimators of vectors on the q -balls based on observation of the unknown vector in Gaussian noise. A straightforward matrix extension of this classical problem is estimation of a sparse matrix Σ ∈ R p×p from the observation where W is a random noise matrix with i.i.d. standard Gaussian entries and ε > 0 is the noise level that we can set as ε = 1/ √ n in order to explore similarities * Main article 10.1214/15-EJS1081.

A. B. Tsybakov
with the covariance matrix estimation model. Some work about the minimax optimal estimation in model (1) under sparsity (in ordinary sense or in the sense of low rank) is now available, regarding mainly the estimation in the Frobenius norm (see, e.g., [3,4,8]). It would be interesting to see what are the differences or similarities with the covariance matrix estimation problem, under the same assumptions on Σ. Of course, for covariance matrix estimation, the model is somewhat different. Then, observations of the form (1) are also available with Y being the empirical covariance matrix but the noise matrix W is not Gaussian and its entries are not i.i.d. Furthermore, as compared to the above papers, there are more restrictions on matrix Σ since it s! hould be symmetric and positive definite. These differences make the analysis of covariance matrix estimation more involved, especially in what concerns the minimax lower bounds. However, intuitively it seems that there should be no fundamental difference in the rates between the two models. It would be interesting to clarify this point. Consider one example, namely, the estimation of sparse spiked covariance matrices treated in Theorem 4 of [CRZ]. This theorem is based on a result in [1]. At first sight, it seems that the rate is different from what could be expected for model (1) under the same assumptions on Σ. Indeed, as shown in [4], the minimax rate of convergence under the spectral norm in model (1) does not depend on the rank, while the rank r appears in the rate of Theorem 4. However, Theorem 4, as well as its prototype in [1] are valid under the condition r ≤ k where, r = r n,p and k = c n,p in the notation of [CRZ]. Thus, assuming that λ n,p is bounded by a fixed constant, we immediately deduce from Theorem 4 an upper bound of the order k log(ep/k)/n on the minimax risk. The lower bound is also of the same order. Therefore, with a fixed bound on λ n,p , there is no dependency on the rank, which is in accordance with our initial guess based on the knowledge about model (1). The same result is easy to obtain by considering the estimatorΣ = argmin Σ emp denotes the empirical covariance matrix, and H 0 (k) is the class of all covariance matrices of size p×p represented as Σ = I +B with a symmetric matrix B having at most k non-zero rows and k non-zero columns. Here, u 2 is the Euclidean norm of u ∈ R p and u 0 is the number of its non-zero components. Let Σ * ∈ H 0 (k) be the true covariance matrix. By definition ofΣ we have Since bothΣ and Σ * belong to H 0 (k) the differenceΣ − Σ * has at most 2k nonzero rows and at most 2k non-zero columns. Thus, where · denotes the spectral norm. On the other hand, if the observations X i are i.i.d. N (0, Σ * ), and 2k ≤ n, we have where λ is an upper bound on the spectral norm of B * in the representation Σ * = I + B * , and C > 0 is an absolute constant. The first inequality in (2) follows from the results of [9,5] and the union bound. In conclusion, we have where H(k, λ) = {Σ = I + B ∈ H(k) : B ≤ λ} is a larger class than the one considered in Theorem 4 of [CRZ], and this bound on the risk holds for any rank r ≤ p.
Another interesting point of comparison with model (1) arises in the context of missing data. When the dimensions are very high, assuming that all entries of the matrix are observed is often non-realistic. This motivated the theory of matrix completion, which is now a very elaborate field regarding mainly model (1). Much less is known about the behavior of estimators of covariance structures with missing data. First papers in this direction devoted to sparse PCA and to estimation of covariance matrices have appeared only very recently [6,7]. The main question here is what is the largest fraction of missing values such that successful estimation of the matrix or of its caracteristics is still possible. The focus in [6,7] is on the low rank covariance structures. The same question can be asked about various other covariance or precision matrix structures discussed in [CRZ].