Estimating Structured High-dimensional Covariance and Precision Matrices: Optimal Rates and Adaptive Estimation

This is an expository paper that reviews recent developments on optimal estimation of structured high-dimensional covariance and precision matrices. Minimax rates of convergence for estimating several classes of structured covariance and precision matrices, including bandable, Toeplitz, and sparse covariance matrices as well as sparse precision matrices, are given under the spectral norm loss. Data-driven adaptive procedures for estimating various classes of matrices are presented. Some key technical tools including large deviation results and minimax lower bound arguments that are used in the theoretical analyses are discussed. In addition, estimation under other losses and a few related problems such as Gaussian graph-ical models, sparse principal component analysis, and hypothesis testing on the covariance structure are considered. Some open problems on estimating high-dimensional covariance and precision matrices and their functionals are also discussed.


Introduction
Driven by a wide range of applications in many fields, from medicine to signal processing to climate studies and social science, high-dimensional statistical inference has emerged as one of the most important and active areas of current research in statistics. There have been tremendous recent efforts to develop new methodologies and theories for the analysis of high-dimensional data, whose dimension p can be much larger than the sample size n. The methodological and theoretical developments in high-dimensional statistics are mainly driven by the important scientific applications, but also by the fact that some of these high-dimensional problems exhibit new features that are very distinct from those in the classical low-dimensional settings.
Covariance structure plays a particularly important role in high-dimensional data analysis. A large collection of fundamental statistical methods, including the principal component analysis, linear and quadratic discriminant analysis, clustering analysis, and regression analysis, require the knowledge of the covariance structure or some aspects thereof. Estimating a high-dimensional covariance matrix and its inverse, the precision matrix, is becoming a crucial problem in many applications including functional magnetic resonance imaging, analysis of gene expression arrays, risk management and portfolio allocation.
The standard and most natural estimator, the sample covariance matrix, performs poorly and can lead to invalid conclusions in the high-dimensional settings. For example, when p/n → c ∈ (0, ∞], the largest eigenvalue of the sample covariance matrix is not a consistent estimate of the largest eigenvalue of the population covariance matrix, and the eigenvectors of the sample covariance matrix can be nearly orthogonal to the truth. See Wachter (1976Wachter ( , 1978, Johnstone (2001), El Karoui (2003, Paul (2007), and Johnstone and Lu (2009). In particular, when p > n, the sample covariance matrix is not invertible, and thus cannot be applied in many applications that require estimation of the precision matrix.
To overcome the difficulty due to the high-dimensionality, structural assumptions are needed in order to estimate the covariance or precision matrix consistently. Various families of structured covariance and precision matrices have been introduced in recent years, including bandable covariance matrices, sparse covariance matrices, spiked covariance matrices, covariances with a tensor product structure, sparse precision matrices, bandable precision matrix via Cholesky decomposition, and latent graphical models. These different structural assumptions are motivated by various scientific applications, such as genomics, genetics, and financial economics. Many regularization methods have been developed accordingly to exploit the structural assumptions for estimation of covariance and precision matrices. These include the banding method in Wu and Pourahmadi (2009) and Bickel and Levina (2008a), tapering in Furrer and Bengtsson (2007) and Cai et al. (2010), thresholding in Levina (2008b), El Karoui (2008) and Cai and Liu (2011a), penalized likelihood estimation in Huang et al. (2006), Yuan and Lin (2007), d 'Aspremont et al. (2008), Banerjee et al. (2008), Rothman et al. (2008), Lam and Fan (2009), Ravikumar et al. (2011), and Chandrasekaran et al. (2012), regularizing principal components in Johnstone and Lu (2009), Zou et al. (2006), Cai, Ma, and Wu (2013), and Vu and Lei (2013), and penalized regression for precision matrix estimation in Meinshausen and Bühlmann (2006), Yuan (2010), , Sun and Zhang (2013), and Ren et al. (2015).
Parallel to methodological advances on estimation of covariance and precision matrices there have been theoretical studies of the fundamental difficulty of the various estimation problems in terms of the minimax risks. Cai et al. (2010) established the optimal rates of convergence for estimating a class of high-dimensional bandable covariance matrices under the spectral norm and Frobenius norm losses. Rate-sharp minimax lower bounds were obtained and a class of tapering estimators were constructed and shown to achieve the optimal rates. Cai and Zhou (2012a,b) considered the problems of optimal estimation of sparse covariance and sparse precision matrices under a range of losses, including the spectral norm and matrix 1 norm losses. Cai, Ren, and Zhou (2013) studied optimal estimation of a Toeplitz covariance matrix by using a method inspired by an asymptotic equivalence theory between the spectral density estimation and Gaussian white noise established in Golubev et al. (2010).  solved the minimax estimation problem for a large class of sparse spiked covariance matrices under the spectral norm loss. Recently Ren et al. (2015) obtained fundamental limits on estimation of individual entries of a sparse precision matrix.
Standard techniques often fail to yield good results for many of these matrix estimation problems, and new tools are thus needed. In particular, for estimating sparse covariance matrices under the spectral norm, a new lower bound technique was developed in Cai and Zhou (2012b) that is particularly well suited to treat the "two-directional" nature of the covariance matrices, where one direction is along the rows and another along the columns. The result can be viewed as a generalization of Le Cam's method in one direction and Assouad's lemma in another. This new technical tool is useful for a range of other estimation problems. For example, it was used in  for establishing the optimal rate of convergence for estimating sparse precision matrices and Tao et al. (2013) applied the technique to obtain the optimal rate for volatility matrix estimation.
The goal of the present paper is to provide a survey of these recent optimality results on estimation of structured high-dimensional covariance and precision matrices, and discuss some key technical tools that are used in the theoretical analyses. In addition, we will present data-driven adaptive procedures for estimation of various structured matrices. The main focus is on the bandable, Toeplitz, sparse, and sparse spiked covariance matrices as well as sparse precision matrices. Among these classes of matrices, the optimal procedures for estimating the bandable, Toeplitz, and sparse covariance matrices are obtained by "smoothing" or thresholding the sample covariance matrices based on various sparsity assumptions. In contrast, estimation of sparse spiked covariance matrices, which have sparse principal components, requires significantly different techniques to achieve optimality results. A few related problems such as sparse principal component analysis, factor models, and hypothesis testing on the covariance structure are also considered. Some open problems will be discussed at the end.
Throughout the paper, we assume that we observe a random sample {X (1) , . . . , X (n) } which consists of n independent copies of a p-dimensional random vector X = (X 1 , . . . , X p ) following some distribution with covariance matrix Σ = (σ ij ). The goal is to estimate the covariance matrix Σ and its inverse, the precision matrix Ω = Σ −1 = (ω ij ), based on the sample X (1) , . . . , X (n) . Here for ease of presentation we assume E(X) = 0. This assumption is not essential. The non-centered mean case will be briefly discussed in Section 5.
Notations. Before we present a concise summary of the optimality results for estimating various structured covariance and precision matrices in this section, we introduce some basic notation that will be used in the rest of the paper. For any vector x ∈ R p , we use ||x|| ω to denote its ω norm with the convention that ||x|| = ||x|| 2 . For any p by q matrix M = (m ij ) ∈ R p×q , we use M to denote its transpose. The matrix ω operator norm is denoted by ||M || ω = max ||x||ω =1 ||Mx|| ω with the convention ||M || = ||M || 2 for the spectral norm. Moreover, the entrywise ω norm, that is, the ω norm of M viewed as a vector, is denoted by ||M || ω and the Frobenius norm is represented by ||M || F = ||M || 2 = ( i,j m 2 ij ) 1/2 . The submatrix with rows indexed by I and columns indexed by J is denoted by M I,J . When the submatrix is a vector or a real number, we sometimes also use the lower case m instead of M . We use ||f || ∞ = sup x |f (x)| to denote the sup-norm of a function f (·), and I {A} to denote the indicator function of an event A. We denote the covariance matrix of a random vector X by Cov(X) with the convention Var(X) = Cov(X) when X is a random variable. For a symmetric matrix M , M 0 means positive definiteness, M 0 means positive semi-definiteness and det(M ) is its determinant. We use λ max (M ) and λ min (M ) to denote its largest and smallest eigenvalues respectively. Given two sequences a n and b n , we write a n = O(b n ), if there is some constant C > 0 such that a n ≤ Cb n for all n, and a n = o(b n ) implies a n /b n → 0. The notation a n b n means a n = O(b n ) and b n = O(a n ). The n × p dimensional data matrix is denoted by X = (X (1) , . . . , X (n) ) and the sample covariance matrix with known E(X) = 0 is then defined asΣ n = X X/n = (σ ij ). For any index set I ⊆ {1, . . . , p}, we denote by X I the submatrix of X consisting of the columns of X indexed by I. The following definition of sub-Gaussian distributions is used throughout the paper. Definition 1. The distribution of a random vector X is said to be sub-Gaussian with constant ρ > 0 if P{|v (X − EX)| > t} ≤ 2e −t 2 ρ/2 , for all t > 0 and all deterministic unit vectors v = 1.
Several matrix norms are used for the loss functions and in the technical analysis. These include the spectral norm · , matrix 1 operator norm || · || 1 , matrix ∞ operator norm || · || ∞ and Frobenius norm || · || F among others. In particular, for symmetric matrices, the spectral norm is the largest singular value. The matrix 1 ( ∞ ) operator norms are the largest column (row) sum of absolute values. Hence for symmetric matrices, these two norms coincide with each other. While the choice of the loss function varies according to specific applications and needs, the minimax behavior of the matrix estimation critically depends on the norm under which the error is measured. Matrix estimation under the Frobenius norm loss is essentially a vector estimation problem. What really makes matrix estimation apart from it is those estimation problems under matrix operator norm losses, which are highly non-additive with respect to entries of the matrix. In particular, estimating covariance and precision matrices under the spectral norm loss brings in many new challenges as well as insights. We present the optimality results in the rest of this section under the Gaussian assumption with the focus on estimation under the spectral norm loss. More general settings will be discussed in Sections 2 and 3. A brief discussion on estimation under the Schatten q norm losses is given in Section 6.

Estimation of structured covariance matrices
We will consider in this paper optimal estimation of a range of structured covariance matrices, including bandable, Toeplitz, sparse and sparse spiked covariance matrices.

Bandable covariance matrices
The bandable covariance structure exhibits a natural "order" or "distance" among variables. This assumption is mainly motivated by time series with many scientific applications such as climatology and spectroscopy. See, for example, Friston et al. (1994) and Visser and Molenaar (1995). We consider settings where σ ij is close to zero when |i − j| is large. In other words, the variables X i and X j are nearly uncorrelated when the distance |i − j| between them is large. The following parameter space was proposed in Bickel and Levina (2008a) (see also Wu and Pourahmadi (2003)), The parameter α specifies how fast the sequence σ ij decays to zero as j → ∞ for each fixed i. This can be viewed as the smoothness parameter of the class F α , which is usually seen in nonparametric function estimation problems. In particular, for stationary processes, the parameter α is related to the smoothness of the corresponding spectral density function, which makes F α (M 0 , M) a natural class of covariance matrices. See Grenander and Szegö (1958) for details. A larger α implies a smaller number of "effective" parameters in the model. Some other classes of bandable covariance matrices have also been considered in the literature, for example, Note that G α (M 1 ) ⊂ F α (M 0 , M) if M 0 and M are sufficiently large. We will mainly focus on the larger class (1) in this paper. Assume that p ≤ exp( n) Estimating structured high-dimensional covariance and precision matrices 7 for some constant > 0, then the optimal rate of convergence for estimating the covariance matrix under the spectral norm loss over the class F α (M 0 , M) is given as follows. See Cai et al. (2010).
Theorem 1 (Bandable Covariance Matrix). Suppose X is Gaussian. The minimax risk of estimating the covariance matrix over the bandable class given in (1) under the spectral norm loss is The minimax upper bound is derived by using a tapering estimator and the key rate n − 2α 2α+1 in the minimax lower bound is obtained by applying Assouad's lemma. We will discuss the important technical details in Sections 2.1 and 4.2 respectively. An adaptive block thresholding procedure, not depending on the knowledge of smoothness parameter α, is also introduced in Section 2.1.

Toeplitz covariance matrices
Toeplitz covariance matrix arises naturally in the analysis of stationary stochastic processes with a wide range of applications in many fields, including engineering, economics, and biology. See, for instance, Franaszczuk et al. (1985), Fuhrmann (1991) and Quah (2000) for specific applications. It can also be viewed as a special case of bandable covariance matrices. Similar decay or smoothness assumption like the one given in (1) is imposed, but each descending diagonal from left to right is constant for a Toeplitz matrix. Specifically, suppose the random vector X = (X 1 , . . . , X p ) consists of the first p variables of a fixed zero mean stationary process {X i } as p → ∞. Then the Toeplitz covariance matrix Σ of X is uniquely determined by the autocovariance sequence It is well known that the spectrum of Toeplitz covariance matrix Σ is closely connected to the spectral density of the stationary process {X i } given by See, e.g., Grenander and Szegö (1958). Motivated by time series applications, we consider the following class of Toeplitz covariance matrices FT α (M 0 , M) defined in terms of the smoothness of the spectral density f . Let α = γ + β > 0, where γ is the largest integer strictly less than α, 0 < β ≤ 1, (3) In other words, the parameter space FT α (M 0 , M) contains the Toeplitz covariance matrices whose corresponding spectral density functions are of Hölder smoothness α. See, e.g., Parzen (1957) and Samarov (1977). Another parameter space, which directly specifies the decay rate of the autocovariance sequence (σ m ), has also been considered in the literature, see, Cai, Ren, and Zhou (2013). Under the assumption (np/ log(np)) 1 2α+1 < p/2, the following theorem gives the optimal rate of convergence for estimating the Toeplitz covariance matrices over the class FT α (M 0 , M) under the spectral norm loss.
Theorem 2 (Toeplitz Covariance Matrix). Suppose X is Gaussian. The minimax risk of estimating the Toeplitz covariance matrices over the class given in The minimax upper bound is attained by a tapering procedure on certain estimators of autocovariance sequence, which is different from those banding estimators considered in Bickel and Levina (2008a) or Furrer and Bengtsson (2007). The minimax lower bound is established through the construction of a more informative model and an application of Fano's lemma. The essential technical details can be found in Sections 2.2 and 4.5 respectively. See Cai, Ren, and Zhou (2013) for further details.

Sparse covariance matrices
For estimating bandable and Toeplitz covariance matrices, one can take advantage of the information from the natural "order" on the variables. However, in many other applications such as genomics, there is no knowledge of distance or metric between variables, but the covariance between most pairs of the variables are often assumed to be insignificant. The class of sparse covariance matrices assumes that most of entries in each row and each column of the covariance matrix are zero or negligible. Compared to the previous two classes, there is no information on the "order" among the variables. We consider the following large class of sparse covariance matrices, If an extra assumption that the variances σ ii are uniformly bounded with max i σ ii ≤ ρ for some constant ρ > 0 is imposed, then H(c n,p ) can be defined in terms of the maximal truncated 1 norm max 1≤i≤p p j=1 min{1,|σ ij |(n/log p) 1/2 }, which has been considered in high-dimensional regression setting (see, for example, Zhang and Zhang (2012)).
For recovering the support of the sparse covariance matrices, it is natural to consider the following parameter space in which there are at most c n,p nonzero entries in each row/column of a covariance matrix, Estimating structured high-dimensional covariance and precision matrices 9 One important feature of the classes H(c n,p ) and H 0 (c n,p ) is that they do not put any constraint on the variances σ ii , i = 1, . . . , p. Therefore the variances σ ii can be in a very wide range and possibly max i σ ii → ∞. When the additional bounded variance condition max i σ ii ≤ ρ for some constant ρ > 0 is imposed, it can be shown that the class H(c n,p ) contains other commonly considered classes of sparse covariance matrices in the literature, including an q ball assumption max i p j=1 |σ ij | q ≤ s n,p in Bickel and Levina (2008b), and a weak q ball assumption max 1≤j≤p σ j[k] q ≤ s n,p /k for each integer k in Cai and Zhou (2012a) where σ j [k] is the kth largest entry in magnitude of the jth row (σ ij ) 1≤i≤p . More specifically, these two classes of sparse covariance matrices are contained in H(c n,p ) with c n,p = C q s n,p (n/ log p) q/2 for some constant C q depending on q only. The class H(c n,p ) also covers the adaptive sparse covariance class U * q (s n,p ) proposed in Cai and Liu (2011a), in which each row/column (σ ij ) 1≤i≤p is assumed to be in a weighted q ball for 0 ≤ q < 1,i.e., with c n,p = C q s n,p (n/ log p) q/2 for 0 ≤ q < 1. Although the class H(c n,p ) is slightly larger than the classes defined via q ball, weak q ball under bounded variances condition and the class U * q (s n,p ), the minimax risks of estimation over these different classes are the same. Indeed, the least favorable subclass of H(c n,p ) chosen to establish the minimax lower bound is contained in other smaller classes. See Cai and Zhou (2012b) for further details on the minimax lower bound construction. It is worthwhile to point out that the class H(c n,p ) is less general than the ones considered in El Karoui (2008) for which consistency results under the spectral norm loss were established. We advocate the larger sparse covariance class H(c n,p ) in this paper not only because it contains almost all other classes considered in the literature, but also the comparison between the noise level ((σ ii σ jj log p)/n) 1/2 and the signal level |σ ij | captures the essence of the sparsity of the model.
Under some mild conditions 1 ≤ c n,p ≤ C n/(log p) 3 and p ≥ n φ for some constants φ and C > 0, the optimal rate of convergence for estimating sparse covariance matrices over the class H(c n,p ) under the spectral norm is given as follows. See Cai and Zhou (2012b).
Theorem 3 (Sparse Covariance Matrix). Suppose X is Gaussian. The minimax risk of estimating a sparse covariance matrix over the class H(c n,p ) given in (4) satisfies The distributional assumption on X can be significantly relaxed. See Section 2.3 for further details. Besides, an adaptive thresholding estimator is constructed in Section 2.3, and it is shown to be adaptive to the variability of the individual entries and attains the minimax upper bound. The lower bound argument for estimation under the spectral norm was given in Cai and Zhou (2012b) by applying Le Cam-Assouad's method, which is introduced in Section 4.1.

Sparse spiked covariance matrices
where λ 1 ≥ λ 2 ≥ · · · ≥ λ r > 0 and the vectors v 1 , . . . , v r are orthonormal, arises naturally in principal component analysis as well as factor models with homoscedastic noise. In particular, the first r eigenvalues of Σ are strictly larger than 1, the remaining eigenvalues. Since the spectrum of Σ has r spikes, (6) was first named spiked covariance model in Johnstone (2001). Before defining the sparse spiked covariance structure, we introduce some notation. Let the set of p by r matrices with orthonormal columns be O(p, r) = {V ∈ R p×r : V V = I}. For any such V = (v 1 , v 2 , . . . , v r ) ∈ O(p, r), we denote its ith row by V i, * . The diagonal matrix Λ = diag(λ 1 , . . . , λ r ). The class of sparse spiked covariance matrices imposes the sparsity on the rows of V . Specifically, we define where 1 ≤ r n,p ≤ c n,p ≤ p. More concretely, the group-sparse structure is imposed on the r leading eigenvectors of Σ in J (c n,p , r n,p , λ n,p ) and the number of nonzero rows of V is no more than c n,p . As a consequence, as the sum of a sparse matrix and an identity matrix, the covariance matrix Σ itself is also sparse. In particular, there are no more than c n,p nonzero entries in each row and each column of Σ and hence J (c n,p , r n,p , λ n,p ) ⊂ H 0 (c n,p ) defined in (5). However, compared with the previous three structures, sparse spiked covariance matrix has a very different structural component, the rank r matrix r i=1 λ r v i v i . This extra low-rank structure, together with the sparsity assumption yields a faster minimax rate of convergence for estimating the covariance matrices over J (c n,p , r n,p , λ n,p ) than that over sparse covariance class with sparsity c n,p . Under the assumption cn,p n log ep cn,p ≤ c for some sufficiently small constant c > 0, the optimal rate of convergence for estimating sparse spiked covariance matrices over the class J (c n,p , r n,p , λ n,p ) under the spectral norm is given as follows. See .
Theorem 4 (Sparse Spiked Covariance Matrix). Suppose X is Gaussian. The minimax risk of estimating the sparse spiked covariance matrices over the class given in (7) (λ n,p + 1) c n,p n log ep c n,p + λ 2 n,p r n,p n , λ 2 n,p .
The minimax risk essentially has two terms (λn,p+1)cn,p n log ep cn,p and λ 2 n,p rn,p n . The first term does not involve r n,p and is the same bound in the rank-one r = 1 setting, in which estimation can be reduced into a single sparse vector estimation problem. The second term is the oracle risk when the support of V is known and can be treated as the risk in r dimensional setting. The minimax upper bound is attained by a global searching scheme for the support of V which is constructed in Section 2.4. The minimax lower bounds of the two terms are shown through applications of Fano's lemma and Le Cam's method respectively, which are introduced in Sections 4.1. See  for further details.

Estimation of structured precision matrices
In addition to covariance matrix estimation, there is also significant interest in estimation of its inverse, the precision matrix, under the structural assumptions on the precision matrix itself. Precision matrix is closely connected to the undirected Gaussian graphical model, which is a powerful tool to model the relationships among a large number of random variables in a complex system and is used in a wide array of scientific applications. See, for instance, Wille et al. (2004) and Runge et al. (2014), for some recent applications in genomics and climate studies. It is well known that recovering the graph structure G = (V, E) of an undirected Gaussian graphical model is equivalent to recovering the support of the precision matrix. In fact, if X ∼ N 0, Ω −1 is a graphical model with respect to G, then the entry ω ij is zero if and only if the variables X i and X j are conditionally independent given all the remaining variables, which is equivalent to the edge (i, j) / ∈ E. (See, e.g., Lauritzen (1996).) Consequently, a sparse graph corresponds to a sparse precision matrix. We thus focus on estimation of sparse precision matrices and present the optimality results under the Gaussian assumption.

Sparse precision matrices and Gaussian graphical model
The class of sparse precision matrices assumes that most of entries in each row/column of the precision matrix are zero or negligible. The class of sparse precision matrices HP(c n,p , M) introduced in this section is similar to the class of sparse covariance matrices defined in (4), where the sparsity is modeled by a truncated 1 type norm. We also assume the spectra of Ω are bounded from below and max i σ ii is bounded above for simplicity. More specifically, we define HP(c n,p , M) by where M is some universal constant and the sparsity parameter c n,p is allowed to grow with p, n → ∞. This class of precision matrices was proposed in Ren et al. (2015) and contains similar classes proposed in  and , in which an extra matrix 1 norm bound M n,p is included in the definition. As a special case where each |ω ij | is either zero or above the level (n/ log p) −1/2 , such a matrix in HP(c n,p , M) has at most c n,p nonzero entries on each row/column which is called the maximum node degree of Ω in the Gaussian graphical model. We define the following class for the support recovery purpose, The following theorem provides the optimal rate of convergence for estimating sparse precision class under the spectral norm loss.
Theorem 5 (Sparse Precision Matrix). Suppose X is Gaussian. Assume that 1 ≤ c n,p ≤ C n/(log p) 3 . The minimax risk of estimating the sparse precision matrix over the class HP(c n,p , M) given in (8) satisfies In Section 3.1, we establish the minimax upper bound via a neighborhood regression approach. The estimator ANT introduced in Section 3.2 also achieves this optimal rate. The lower bound argument is provided by applying Le Cam-Assouad's method developed in  which is discussed in Section 4.4.
For estimating sparse precision matrices, besides the minimax risk under the spectral norm, it is also important to understand the minimax risk of estimating individual entries of the precision matrix. The solution is not only helpful for the support recovery problem but also makes important advancements in the understanding of statistical inference of low-dimensional parameters in a highdimensional setting. See Ren et al. (2015).
Theorem 6 (Entry of Sparse Precision Matrix). Suppose X is Gaussian. Assume that (c n,p log p)/n = o(1). The minimax risk of estimating ω ij for each i, j over the sparse precision class given in (8) is The minimax upper bound is based on a multivariate regression approach given in Section 3.2 while Le Cam's lemma is used to show the minimax lower bound in Section 4.3.

Organization of the paper
The rest of the paper is organized as follows. Section 2 presents several minimax and adaptive procedures for estimating various structured covariance matrices and establishes the corresponding minimax upper bounds under the spectral norm loss. Estimation under the factor models and sparse principal component Estimating structured high-dimensional covariance and precision matrices 13 analysis are also discussed. Section 3 considers minimax and adaptive estimation of sparse precision matrices under the spectral norm loss. Section 3 also discusses inference on the individual entries of a sparse precision matrix and the latent graphical model. Section 4 focuses on the lower bound arguments in matrix estimation problems. It begins with a review of general lower bound techniques and then applies the tools to establish rate-sharp minimax lower bounds for various covariance and precision matrix estimation problems. The upper and lower bounds together yield immediately the optimal rates of convergence stated earlier in this section. Section 5 briefly discusses the non-centered mean case and the positive semi-definite issue in covariance and precision matrix estimation. Hypothesis testing on the covariance structure is also discussed. The paper is concluded with a discussion on some open problems on estimating covariance and precision matrices as well as their functionals in Section 6.

Estimation of structured covariance matrices
This section focuses on estimation of structured covariance matrices. Minimax upper bounds and adaptive procedures are introduced. Many estimators are based on "smoothing" the sample covariance matrices. These include the banding, tapering and thresholding estimators. The optimal estimator for sparse spiked covariance matrices, however, relies on a global searching scheme. Estimation of precision matrices is considered in the next section.

Bandable covariance matrices
Minimax upper bound Bickel and Levina (2008a) introduced the class of bandable covariance matrices F α (M 0 , M) given (1) and proposed a banding estimator based on the sample covariance matrixΣ n = (σ ij ) for estimating a covariance matrix Σ ∈ F α (M 0 , M). The bandwidth k was chosen to be k B = log p n 1 2(α+1) and the rate of convergence log p n α α+1 for estimation under the spectral norm loss was proved under the sub-Gaussian assumption on X = (X 1 , . . . , X p ) . It was unclear if this rate is optimal. Cai et al. (2010) further studied the optimal estimation problem for the classes F α (M 0 , M) in (1) and G α (M 1 ) in (2) under the sub-Gaussian assumption. A tapering estimator was proposed. Tapering estimators have been effectively used in the literature of climate studies. See, for instance, Gaspari and Cohn (1999), Houtekamer and Mitchell (2001) and Hamill et al. (2001). Furrer et al. (2006) and Furrer and Bengtsson (2007) previously also considered tapering estimators for estimating covariance matrices but in different settings.
14 T. T. Cai et al. Specifically, for a given even positive integer k ≤ p, let ω = (ω m ) 0≤m≤p−1 be a weight sequence with ω m given by The tapering estimatorΣ T,k of the covariance matrix Σ is defined bŷ It was shown that the tapering estimator with bandwidth k T = min{n 1 2α+1 , p} gives the following rate of convergence under the spectral norm.
Theorem 7 (Cai et al. (2010)). Suppose that X is sub-Gaussian distributed with some finite constant. Then the tapering estimatorΣ T,k Theorem 7 clearly also holds for G α (M 1 ), a subspace of F α (M 0 , M). Note that the rate given in (12) is faster than the rate ((log p)/n) α/(α+1) obtained in Bickel and Levina (2008a) for the banding estimatorΣ B,k with the bandwidth k B = log p n 1 2(α+1) , which implies that this banding estimator is sub-optimal. A minimax lower bound is also established in Cai et al. (2010), which shows that the rate of convergence in (12) is indeed optimal. We will discuss this minimax lower bound argument in Section 4.2.
There are two key steps in the technical analysis of the tapering estimator. In the first step, it is shown that the tapering estimatorΣ T,k has a simple representation and can be written as the average of many small disjoint submatrices of size no more than k in the sample covariance matrixΣ n . Consequently, the distance Σ T,k T − Σ can be bounded by the maximum of distances of these submatrices from their respective means. The second key step involves the application of a large deviation result for sample covariance matrix of relatively small size under the spectral norm. This random matrix result, stated in the following lemma, is a commonly used technical tool in high-dimensional statistical problems. See Cai et al. (2010) for further details.
Then there exist some universal constant C > 0 and some constant ρ 1 depending on ρ, such that the sample covariance matrix of See also Davidson and Szarek (2001) for more refined results under the Gaussian assumption.
For the banding estimator, Bickel and Levina (2008a) used the matrix 1 norm as the upper bound to control the spectral norm. Then bounding the risk under the spectral norm can be turned into bounding the error on each row of Σ n under the vector 1 norm, which is an easier task. An analysis of the bias and variance trade-off then leads to their choice of bandwidth k B = (n/ log p) 1 2(α+1) . The loose control of spectral norm by the matrix 1 norm is the main reason for the sub-optimal result in Bickel and Levina (2008a) under the spectral norm loss. It is worthwhile to point out that the result in Bickel and Levina (2008a) is still not optimal under the matrix 1 norm loss due to the sub-optimal choice of the bandwidth. See the discussion on estimation under other losses at the end of this section. An interesting question is whether the banding estimator with a different bandwidth is also optimal. Indeed it can be shown that the banding estimatorΣ B,k with the bandwidth k = min{n 1 2α+1 , p} is rate-optimal. See, for example, Xiao and Bunea (2014) for a detailed calculation under the Gaussian assumption.

Adaptive estimation through block thresholding
It is evident that the construction of the optimal tapering estimatorΣ T,k T requires the explicit knowledge of the decay rate α which is usually unknown in practice. Cai and Yuan (2012) considered the adaptive estimation problem and constructed a data-driven block thresholding estimator, not depending on α, M 0 , M or M 1 , that achieves the optimal rate of convergence simultaneously over the parameter spaces F α (M 0 , M) and G α (M 1 ) for all α > 0.
The construction of the adaptive estimator consists of two steps. In the first step, we divide the sample covariance matrix into blocks of increasing sizes as they move away from the diagonal, suggested by the decay structure of the bandable covariance matrices. The second step is to simultaneous kill or keep all the entries within each block to construct the final estimator, where the thresholding levels are chosen adaptively for different blocks. The underlying idea is to mimic the analysis for the tapering estimatorΣ T,k T in the sense that the final estimator can be decomposed as the sum of small submatrices. However, the choice of the bandwidth is chosen adaptively here through using the block thresholding strategy, where the threshold rule is established by a novel norm compression inequality. See Theorem 3.4 of Cai and Yuan (2012) for details. Now we briefly introduce the construction of blocks. First, we construct disjoint square blocks of size k ad log p along the diagonal. Second, a new layer of blocks of size k ad are created towards the top right corner along the diagonal next to the previous layer. In particular, this layer of blocks has either two or one block (of size k ad ) in an alternating fashion (see Figure 1). After this step, we note that the odd rows of blocks has three blocks of size k ad and even rows of blocks have two blocks of size k ad . This creates space to double the size of the blocks in the next step. We then repeat the first and second steps building on the previous layer of blocks but double the size of the blocks until the whole upper half of the matrix is covered. In the end, the same blocking construction is done for the lower half of the matrix. It is possible that the last row and last column of the blocks are rectangular instead of being square. For the sake of brevity, we omit the discussion on these blocks. See Cai and Yuan (2012) for further details. By the construction, the blocks form a partition of {1, . . . , p} 2 . We list the indices of those blocks by B = {B 1 , . . . , B N }, assuming there are N blocks in total. The construction of the blocks is illustrated in Figure 1.
Once the blocks B are constructed, we define the final adaptive estima-torΣ Ada CY by the following thresholding procedure on the blocks of the sample covariance matrixΣ n . First, we keep the diagonal blocks of size k ad constructed at the very beginning, i.e., (Σ Ada CY ) B = (Σ n ) B for those B in the diagonal. Second, we set all the large blocks to 0, i.e., (Σ Ada CY ) B = 0 for all B ∈ B with size(B) > n/ log n, where for B = I × J, size(B) is defined to be max{|I|, |J|}. In the end, we threshold the intermediate blocks adaptively according to their spectral norm as follows.
Finally we obtain the adaptive estimatorΣ Ada CY , which is rate optimal over the classes F α (M 0 , M) or G α (M 1 ).
Theorem 8 (Cai and Yuan (2012)). Suppose that X is sub-Gaussian distributed with some finite constant. Then for all α > 0, the adaptive estimatorΣ Ada CY constructed above satisfies where the constant C depends on α, M 0 and M .
In light of the optimal rate of convergence given in Theorem 1, this shows that the block thresholding estimator adaptively achieves the optimal rate over

Estimation under other losses
In addition to estimating bandable covariance matrices under the spectral norm, estimation under other losses has also been considered in literature. Furrer and Bengtsson (2007) introduced a general tapering estimator, which gradually shrinks the off-diagonal entries toward zero with certain weights, to estimate covariance matrices under the Frobenius norm loss in different settings. Cai et al. (2010) studied the problem of optimal estimation under the Frobenius norm loss over the classes F α (M 0 , M) and G α (M 1 ), and Cai and Zhou (2012a) investigated optimal estimation under the matrix 1 norm.
The optimal estimators for both the Frobenius norm and matrix 1 norm are again based on the tapering estimator (or banding estimator) with the bandwidth k T,F = n 1 2(α+1) and k T,1 = min n 1 2(α+1) , (n/ log p) 1 2α+1 respectively. The rate of convergence under the Frobenius norm of F α (M 0 , M) is different from that of G α (M 1 ). The optimal rate of convergence under the matrix 1 norm is min (log p/n) 2α 2α+1 + n − α α+1 , p 2 /n . Their corresponding minimax lower bounds are established in Cai et al. (2010) and Cai and Zhou (2012a). Comparing these estimation results, it should be noted that although the optimal estimators under different norms are all based on tapering or banding, the best bandwidth critically depends on the norm under which the estimation accuracy is measured.

Toeplitz covariance matrices
We turn to optimal estimation of Toeplitz covariance matrices. Recall that if X is a stationary process with autocovariance sequence (σ m ), then the covariance matrix Σ p×p has the Toeplitz structure such that σ ij = σ |i−j| . Wu and Pourahmadi (2009) introduced and studied a banding estimator based on the sample autocovariance matrix, and McMurry and Politis (2010) extended their results to tapering estimators. Xiao and Wu (2012) further improved these two results and established sharp banding estimator under the spectral norm. All these results are obtained in the framework of causal representation and physical dependence measures proposed in Wu (2005). One important assumption is that there is only one realization available, i.e. n = 1.
More recently, Cai, Ren, and Zhou (2013) considered the problem of optimal estimation of Toeplitz covariance matrices over the class FT α (M 0 , M). The result is valid not only for fixed sample size but also for the general case when n, p → ∞.
An optimal tapering estimator is constructed by tapering the sample autocovariance sequence. More specifically, recallΣ n = (σ ij ) is the sample covariance matrix. For 0 ≤ m ≤ p − 1, set which is an unbiased estimator of σ m . The estimatorσ m was also proposed in Bickel and Levina (2004). Then the tapering Toeplitz estimatorΣ T oep T,k = (σ st ) of Σ with bandwidth k is defined asσ st = ω |s−t|σ|s−t| , where the weight ω m is defined in (11). By picking the best choice of the bandwidth k T o = (np/ log (np)) 1 2α+1 , the optimal rate of convergence is given as follows.
Theorem 9 (Cai, Ren, and Zhou (2013) The performance of the banding estimator was also considered in Cai, Ren, and Zhou (2013). Surprisingly, the best banding estimator is inferior to the optimal tapering estimator over the class FT α (M 0 , M), which is due to a larger bias term caused by the banding estimator. See Cai, Ren, and Zhou (2013) for further details. At a high level, the upper bound proof is based on the fact that the spectral norm of a Toeplitz matrix can be bounded above by the sup-norm of the corresponding spectral density function. which leads to the appearance of the logarithmic term in the upper bound. A lower bound argument will be introduced in Section 4.5. The appearance of the logarithmic term indicates the significant difference in the technical analyses between estimating bandable and Toeplitz covariance matrices.

Sparse covariance matrices
We now consider optimal estimation of another important class of structured covariance matrices -sparse covariance matrices. This problem has been considered by Bickel and Levina (2008b), Rothman et al. (2009) and Cai and Zhou (2012a,b). These works assume that the variances are uniformly bounded, i.e., max i σ ii ≤ ρ for some constant ρ > 0. Under such a condition, a universal thresholding estimatorΣ U, . This estimator is not adaptive as it depends on the unknown parameter ρ. The rates of convergence s n,p (log p/n) (1−q)/2 under the spectral norm and matrix 1 norm in probability are obtained over the classes of sparse covariance matrices, in which each row/column is in an q ball or a weak q ball, 0

Sparse covariance matrices: Adaptive minimax upper bound
Estimation of a sparse covariance matrix is intrinsically a heteroscedastic problem in the sense that the variances of the entries of the sample covariance matrix σ ij can vary over a wide range. A universal thresholding estimator essentially treats the problem as a homoscedastic problem and does not perform well in general. Cai and Liu (2011a) proposed a data-driven estimatorΣ Ad,T h which adaptively thresholds the entries according to their individual variabilities, The advantages of this adaptive procedure are that it is fully data-driven and no longer requires the variances σ ii to be uniformly bounded. The estimator Σ Ad,T h attains the following rate of convergence adaptively over sparse covariance classes H(c n,p ) defined in (4) A similar idea was applied in the practical implementation without theoretical justification in El Karoui (2008). The proof of Theorem 10 essentially follows from the analysis in Cai and Liu (2011a) forΣ Ad,T h under the matrix 1 norm over a smaller class of sparse covariance matrices U * q (s n,p ), where each row/column (σ ij ) 1≤i≤p is assumed to be in a weighted q ball, That result is automatically valid for all matrix ω norms, following the claim in Section 6 of Cai and Zhou (2012b), where the Riesz-Thorin interpolation theorem is applied. The key technical tool in the analysis is the following large deviation result for self-normalized entries of the sample covariance matrix.

Lemma 2.
Under the assumptions of Theorem 10, for any small ε > 0, Lemma 2 follows from a moderate deviation result in Shao (1999). See Cai and Liu (2011a) for further details.
Theorem 10 states the rate of convergence in probability. The same rate of convergence holds in expectation with a mild sample size condition p ≥ n φ for some φ > 0. A minimax lower bound argument under the spectral norm is provided in Cai and Zhou (2012b) (see also Section 4.4), which shows thatΣ Ad,T h is indeed rate optimal under all matrix ω norms. In contrast, the universal thresholding estimatorΣ U,T h is sub-optimal over H(c n,p ) due to the possibility Besides the matrix ω norms, Cai and Zhou (2012b) considered a unified result on estimating sparse covariance matrices under a class of Bregman divergence losses which include the commonly used Frobenius norm as a special case. Following a similar proof there, it can be shown that the estimatorΣ Ad,T h also attains the optimal rate of convergence under the Bregman divergence losses over the large parameter class H(c n,p ). In addition to the hard thresholding estimator introduced above, Rothman et al. (2009) considered a class of thresholding rules with more general thresholding functions including soft thresholding, SCAD and adaptive Lasso. It is straightforward to extend all results above to this setting. Therefore, the choice of the thresholding function is not important as far as the rate optimality is concerned.
It is worthwhile to point out that all related results hold under assumptions on the marginals of X only. This is significantly different from other covariance matrices estimation problems, where joint distributional assumptions are typically imposed. Among these results, distributions with polynomial-type tails have been considered by Bickel and Levina (2008b), El Karoui (2008) and Cai and Liu (2011a). In particular, Cai and Liu (2011a) showed that the adaptive thresholding estimatorΣ Ad,T h attains the same rate of convergence c n,p ((log p)/n) 1/2 as the one for the sub-Gaussian case (14) in probability over the class U * q (s n,p ), assuming that for some γ, C > 0, p ≤ Cn γ and for some > 0, such that E |X j | 4γ+4+ ≤ K for all j. The superiority of the adaptive thresholding estima-torΣ Ad,T h for heavy-tailed distributions is also due to the moderate deviation for the self-normalized statisticσ ij /θ ij (see Shao (1999)). It is easy to see that this still holds over the class H(c n,p ). Recently, Chen et al. (2013) and Basu and Michailidis (2015) considered sparse covariance matrix estimation for time series data based on certain dependence measures, which play an important role in the rates of convergence. In particular, the results in Basu and Michailidis (2015) rely on a new measure of stability for stationary processes without using the commonly imposed functional dependence measure in Wu (2005).

Support recovery
A closely related problem to estimating a sparse covariance matrix is the recovery of the support of the covariance matrix. This problem has been considered by, for example, Rothman et al. (2009) and Cai and Liu (2011a). For support recovery, it is natural to consider the parameter space H 0 (c n,p ). Define the Rothman et al. (2009) applied the universal thresholding estimatorΣ U,T h to estimate the support of true sparse covariance in H 0 (c n,p ), assuming bounded variances max i (σ ii ) ≤ ρ.
In particular, it successfully recovers the support of Σ in probability, provided that the magnitudes of the nonzero entries are above a certain threshold, i.e., min (i,j)∈supp(Σ) |σ ij | > γ 0 (log p/n) 1/2 for a sufficiently large γ 0 > 0. Cai and Liu (2011a) extended this result under a weaker assumption on entries in the support using the adaptive thresholding estimatorΣ Ad,T h . We state the result below.

Minimax upper bound
We now turn to the optimal estimation of sparse spiked covariance matrices over J (c n,p , r n,p , λ n,p ) in (7) under the squared spectral norm loss. Recall the spiked covariance structure Σ = V ΛV +I is defined in (6). This structure was originally considered in Johnstone (2001) (2013), , Berthet and Rigollet (2013) and . However, most of the works considered estimating either individual leading eigenvector v i or the principal subspace V V spanned by r n,p leading eigenvectors under the Frobenius norm loss. A brief discussion along this line is presented later in this section. Despite its close relationship with estimation of the covariance matrix Σ via the well-known sin-theta theorem (Davis and Kahan (1970)), the optimality estimation of sparse spiked covariance matrices cannot be obtained immediately, especially under the setting where r n,p and λ n,p go to infinity as n, p → ∞.
In a recent work,  considered minimax estimation of sparse spiked covariance matrices under the squared spectral norm. The key step of constructing optimal estimators is a global searching scheme to find the support of V . More specifically, let the support of V be supp (V ) = {i : V i, * = 0} ⊂ {1, . . . , p} and the cardinality of any set S be D S . RecallΣ = (σ ij ) is the sample covariance matrix. Then the estimator of supp (V ) is defined by an arbitrary 22 T. T. Cai et al. element in the following set where the thresholding quantity η (a, n, p, γ) = 2 a n + γ n a log(ep/a) + a n + γ n a log(ep/a) 2 , with γ ≥ 10 under the Gaussian assumption. Intuitively, the two deviation criterions in SV (c n,p ) admit supp (V ) as a feasible element and at the same time rule out the possibility that S c overlaps with supp (V ) for any S ∈ SV (c n,p ). The quantity η (a, n, p, γ) is carefully picked under the Gaussian assumption based on the Davidson-Szarek bound (Davidson and Szarek (2001)). See  (also Proposition D.1 in Ma (2013)) for further details. Given an estimatorŜ ∈ SV (c n,p ) of supp (V ), the final sparse spiked covariance matrix estimator can be defined aŝ We setΣ CMW = I if SV (c n,p ) is the empty set. It was shown that the global search estimatorΣ CMW attains the following rate of convergence under the spectral norm.
Theorem 12 At a high level, the first term in the upper bound above does not involve r n,p and is due to estimation of the leading eigenvector. The second term essentially follows from the estimation error of eigenvalue for a c n,p by c n,p matrix with rank r n,p . It is clear that under the setting in which λ n,p is bounded by a universal constant, the rate of convergence reduces to cn,p n log ep cn,p since r n,p cannot be larger than c n,p . Under such a setting, it is interesting to compare the rates of convergence over J (c n,p , r n,p , λ n,p ) and a larger class H 0 (c n,p ) considered in Section 2.3. Theorems 10 and 12 imply that the rate is reduced from c 2 n,p log p n to cn,p n log ep cn,p . This much faster rate of convergence can be achieved because of the extra spike structure.
Unlike other estimators considered in the previous sections which are obtained from the sample covariance via a direct "smoothing" step, the estimator Σ CMW is obtained by a global search for the support of V , which is computationally intensive. It is of interest to ask whether a computationally efficient while still minimax optimal estimator exists. Consider a special case of rank one r n,p = 1. Some recent works imply that without a strong sample size condition, namely n > Cc 2 n,p log p λ 2 n,p for a sufficiently large constant C > 0, minimax estimation cannot be achieved by any randomised polynomial-time algorithm. Indeed, Berthet and Rigollet (2013) first showed that there is no computationally efficient algorithm for a sparse principal component detection problem when the strong sample size condition is violated, assuming the widely-believed hardness of Planted Clique detection problem. See Wang et al. (2014) for the extension to the problem of estimating leading eigenvector. Both works established this computational lower bound by reducing the PCA problems into the Planted Clique detection problem but required a larger parameter space which includes many general distributions besides Gaussian distribution. In a recent work, Gao et al. (2014) bridged the gap and established this result for estimating leading eigenvector, faithful to the Gaussian spiked covariance model. It is immediate to extend it into the estimation of covariance matrix Σ because of r n,p = 1. See Gao et al. (2014) for further details.
When the upper bound in Theorem 12 is larger than λ 2 n,p , it is trivial to use identity matrix as the estimation with an upper bound rate λ 2 n,p . A minimax lower bound is also constructed in  to show that the minimum of λ 2 n,p and the rate of convergence in Theorem 12 is indeed optimal.

Sparse principal component analysis
Principal component analysis (PCA) is one of the most commonly used techniques for dimension reduction and feature extraction with many applications, including image recognition, data compression, and clustering. PCA is closely related to, but different from, covariance matrix estimation under the matrix norm losses. In the classical fixed p setting, the leading eigenvector or eigenspace of Σ can be consistently estimated by the counterparts operated on the sample covariance matrixΣ n as n → ∞. However this standard PCA approach yields inconsistent estimators in the high-dimensional settings when the spectra of the population covariance matrix remain bounded. See, for example, Paul (2007), d 'Aspremont et al. (2007) and Johnstone and Lu (2009). It is worthwhile to point out that this is different from the high-dimensional factor model considered in Fan et al. (2013), where leading eigenvalues increase at least in order of p and as a result the standard PCA still performs well. Various regularized approaches have been introduced in the literature for PCA, assuming certain sparsity structures on the leading eigenvectors. Zou et al. (2006) imposed Lasso type sparsity constraints on the eigenvector after transforming the PCA problem into a regression problem. d 'Aspremont et al. (2007) proposed a semi-definite program as a relaxation to the 0 penalty. Shen and Huang (2008) applied a regularized low-rank approach with its consistency established in Shen et al. (2013). See also Jolliffe et al. (2003) and Witten et al. (2009) for other methodologies.
Theoretical analysis for PCA has so far mainly focused on the spiked covariance matrix model Σ = r i=1 λ r v i v i + I, defined in (6). It is commonly assumed 24 T. T. Cai et al. that the λ i 's are bounded from below and above by some universal constants and each eigenvector v i is either in a weak q ball with radius c 1/q n,p for 0 < q < 2 or exactly sparse with c n,p nonzero entries for q = 0. Johnstone and Lu (2009) considered the single spike case r = 1 and proposed a diagonal thresholding (DT) procedure. In particular, a thresholding procedure is applied to eachσ ii to pick out those coordinates of strong signals with magnitude at least at the level of ((log p)/n) 1/4 and then the standard PCA is performed on this subset of coordinates. The final estimator of the leading eigenvector is obtained by padding zeros to the remaining coordinates. Consistency is shown in the paper for estimating the leading eigenvector v 1 under the squared 2 loss. Amini and Wainwright (2009) studied the theoretical properties of the leading eigenvectors obtained in Johnstone and Lu (2009) and d 'Aspremont et al. (2007) with a focus on the model selection setting, in which the leading eigenvector v 1 is exactly sparse, i.e. q = 0. Birnbaum et al. (2013) established the minimax rates of convergence c n,p (log p/n) 1−q/2 of the individual leading eigenvectors for finite r with distinct leading eigenvalues λ i = λ j for i = j. The DT method is shown to be sub-optimal but can be used as the first step of a two-stage optimal coordinate selection approach called ASPCA. The purpose of the second stage is to further pick out those coordinates with magnitude larger than the optimal threshold level ((log p)/n) 1/2 . Ma (2013) proposed an iterative thresholding procedure (IT-SPCA) based on DT, which also attains the same optimal rate for estimating each leading eigenvector. Estimation of the principal subspace spanned by the leading r eigenvectors is more appropriate when some of the leading eigenvalues have multiplicity great than one. Ma (2013) further established the rates of convergence of ITSPCA for estimating leading principal subspace under a loss function defined by the squared Frobenius distance between the projection matrices of the leading principal subspace V = (v 1 , v 2 , . . . , v r ) and its estimator V , i.e., V V −VV 2 F . All results discussed above assumed finite rank r. Cai, Ma, and Wu (2013) further considered the optimality problems of estimating the leading principal subspace under a group sparsity assumption, allowing the number of spikes r to diverge to infinity. Specifically, it is assumed that the vector obtained from the 2 norm of each row of V is in a weak q ball with radius c 1/q n,p for 0 < q < 2 or exactly sparse with c n,p nonzero entries for q = 0. The minimax rate of convergence is established and an aggregation procedure is constructed and shown to attain the optimal rate c n,p ((r + log ep cn,p )/(nh(λ))) 1−q/2 , where h(λ) = λ 2 /(1 + λ) and leading eigenvalues λ i λ for all i. In particular, the rate is optimal with respect to all the parameters n, p, r, λ, c n,p . However, this aggregation procedure is computationally infeasible and an optimal adaptive procedure is then proposed in Cai, Ma, and Wu (2013). Vu and Lei (2013) extended the spiked covariance model (6) into a more general setting and considered the problem of optimally estimating the leading eigenspace under group sparsity or column sparsity in a slightly difference setting. Compared to the optimal rates in Cai, Ma, and Wu (2013), the dependency on λ is not optimal for the method proposed in Vu and Lei (2013). In a related paper,  further studied the mini-max estimation the principal subspace V under the squared spectral norm loss V V −VV 2 based on the estimatorΣ CMW in (15) proposed for the minimax estimation of spiked covariance matrix we discussed at the beginning of this section.

Factor model
Factor models in the high-dimensional setting have been used in a range of applications in finance and economics such as modeling wage rates and optimal portfolio allocations. See, for instance, Engle and Watson (1981) and Goldfarb and Iyengar (2003). Despite closely related to the spiked covariance models where the covariance also can be written as the sum of a low-rank and a sparse matrix, estimation of factor models is different from that of sparse spiked covariance matrices. We consider the following multi-factor model (See Ross (1976Ross ( , 1977 and Chamberlain and Rothschild (1983).) X j = b j F + U j , where b j is a deterministic k by 1 vector of factor loadings, F is the random vector of common factors and U j is the error component of X j . Set U = (U 1 , . . . , U p ) be the error vector of X and B = (b 1 , . . . , b p ) . Assume that U and factors F are independent, then it is easy to see the covariance of X can be written as ij is the error covariance matrix. Usually a small value of k is assumed and a sparse structure, such as the diagonal structure, is imposed on Σ U . Therefore in factor models, the covariance matrix Σ can be represented as the sum of a low-rank matrix and a sparse matrix. Fan et al. (2008) considered a factor model assuming that the error components are independent, which results Σ U to be a diagonal matrix. This result was extended and improved in Fan et al. (2011) by further assuming general sparse structure on Σ U . More specifically, there are no more than c n,p nonzero entries in each row/column of Σ U , i.e. Σ U ∈ H 0 (c n,p ). Let the ith observation is its error vector. Both works assume that the factors F (i) , i = 1, . . . , n are observable and hence the number of factors k is known as well. In other words, the models are more about the regression models. Indeed, this allows using ordinary least squares estimatorb j to estimate loadings b j accurately first and then estimating the errorsÛ by the residuals. Additional adaptive thresholding procedure is then applied to the error covariance matrix estimator Σ U = 1 n n i=1Û (i) (Û (i) ) to estimate Σ U , motivated by the procedure in (13). Under certain conditions, including bounded variances max i Var(X i ), bounded λ min (Σ U ), λ min (Cov(F )) and exponential tails of F and U , the rates of convergence c n,p k (log p/n) 1/2 under the spectral norm are obtained for estimating Σ −1 U and Σ −1 . Note the number of factors k plays a role on the rates, compared to the one in (14).
Recently, Fan et al. (2013) considered the setting in which the factors are unobservable and must be estimated from the data as well. This imposes new challenges since there are two matrices to estimate while only a noisy version of their sum is observed. To overcome this difficulty, Fan et al. (2013) assumed the factors are pervasive in the sense that a non-negligible fraction of factor loadings is bounded away from zero by a universal constant. As a result, the k eigenvalues of the low-rank matrix BCov (F ) B diverge at the rate O(p) while the spectra of the sparse matrix Σ U is assumed to be bounded from below and above. Under this assumption, after simply running the SVD on the sample covariance matrixΣ n , the matrix BCov (F ) B can be accurately estimated by the matrix formed by the first k principal components ofΣ n and the sparse matrix Σ U can then be estimated by adaptively thresholding the remaining principal components. k is assumed to be finite rather than diverging in Fan et al. (2013). Under other similar assumptions as those in Fan et al. (2011), the rates of convergence c n,p (log p/n) 1/2 for estimating Σ −1 U and Σ −1 under the spectral norm are derived, assuming Σ U ∈ H 0 (c n,p ).
We would like to point out that there also is a growing literature on the study of decomposition from the sum of a low-rank matrix and a sparse matrix. However, the focus is mainly on the data matrix instead of the covariance structure with the goal of identification. The methods are mainly based on certain "incoherence condition" between the two matrices to ensure identifiability while the spectra of the two matrices are on the same order. Let L = U ΛV be the SVD of the low-rank matrix L with rank r. For example, the incoherence condition with parameter μ defined in Candès et al. (2011) states that max i U i, * 2 ≤ μr/n, max i V i, * 2 ≤ μr/n and UV ∞ ≤ n −1 √ μr. Hence under this incoherence condition, the low-rank matrix cannot be a sparse matrix. It is worthwhile to point out that such an incoherence condition is not required for the factor models discussed in this section and is not satisfied by the sparse spiked covariance matrices in Section 2.4, where the low-rank matrix is also sparse. See, e.g., Candès et al. (2011) and Agarwal et al. (2012).

Minimax upper bounds of estimating sparse precision structure
We turn in this section to optimal estimation of sparse precision matrices and recovering its support which have close connections to Gaussian graphical models. The problem has drawn considerable recent attentions. We have seen in the last section that optimal estimators of structured covariance matrices, with the exception of sparse spiked covariance matrices, are usually obtained from the sample covariance matrix through certain direct "smoothing" operations such as banding, tapering, or thresholding. Compared to those methods, estimation of the structured precision matrices is more involved due to the lack of a natural pivotal estimator and is usually obtained through some regression or optimization procedures.
There are two major approaches to estimation of sparse precision matrices: neighborhood-based and penalized likelihood approaches. Neighborhood-based approach runs a Lasso regression or Dantzig selector of each variable on all other variables to estimate the precision matrix column by column. This approach requires running p Lasso regressions. We focus on it in Section 3.1 with an emphasis on the adaptive rate optimal procedure proposed in Sun and Zhang (2012). An extension of this approach to regressing two variables against others lead to a statistical inference result on each entry ω ij . In Section 3.2, we introduce such a method proposed in Ren et al. (2015) and consider support recovery as well. Penalized likelihood approach is surveyed in Section 3.3, together with latent graphical model structure estimation problems.

Sparse precision matrix: Adaptive minimax upper bound under spectral norm
Under the Gaussian assumption, the motivation of neighborhood-based approach is the following conditional distribution of X j given all other variables X j c , where ω j c j is the jth column of Ω with the jth coordinate removed. See, for example, Anderson (2003). Meinshausen and Bühlmann (2006) first proposed the neighborhood selection approach and applied the standard Lasso regression on X j against X j c to estimate nonzero entries in each row. The goal of this paper however is to identify the support of Ω. In the same spirit, Yuan (2010) applied the Dantzig selector version of this regression to estimate Ω column by column. i.e. min β 1 s.t.
X j c X j − X j c X j c β /n ∞ ≤ τ .  further proposed an estimator called CLIME by solving a related optimization problem arg min Ω 1 : Σ n Ω − I ∞ ≤ τ .
In practice, the tuning parameter τ is chosen via cross-validation. However the theoretical choice of τ = CM n,p log p/n requires the knowledge of the matrix 1 norm M n,p = ||Ω|| 1 , which is unknown.  introduced an adaptive version of CLIMEΩ ACLIM E , which is data-driven and adaptive to the variability of individual entries ofΣ n Ω − I. Over the class HP(c n,p , M), the estimators proposed in Yuan (2010),  and  can be shown to attain the optimal rate under the spectral norm if the matrix 1 norm M n,p is bounded. Besides the Gaussian case, both  and  also considered sub-Gaussian and polynomial tail distributions. It turns out that if each X i has finite 4 + ε moments, under some mild condition on the relationship between p and n, the rate of convergence is the same as those in the Gaussian case. Now we introduce the minimax upper bound for estimating Ω under the spectral norm over the class HP(c n,p , M). Sun and Zhang (2012) constructed an estimator for each column with the scaled Lasso, a joint estimator for the regression coefficients and noise level. For simplicity, we assume X is Gaussian. For each j = 1, . . . , p, the scaled Lasso is applied to the linear regression of the 28 T. T. Cai et al. jth column X j of the data matrix against all other columns X j c as follows: where λ = A (log p) /n for some constant A > 2 andσ kk is the sample variance of X k . The optimization (17)  Without assuming bounded matrix 1 norm on Ω, the rate of convergence of Ω SL is given under the spectral norm as follows Theorem 13 (Sun and Zhang (2012)). Suppose that X is Gaussian and c n,p ( log p n ) 1/2 = o(1). The parameter class HP(c n,p , M) is defined in (8) The key technical tool in the analysis is the oracle inequality for the prediction error as well as the bound on the error under the 1 norm in the high-dimensional sparse linear regression setting for the Lasso estimator and Dantzig selector. We state the results for Lasso as a lemma in the following simple case. Assume the observations Y = (Y 1 , Y 2 , . . . , Y n ) have the following form where X is an n by p design matrix and W = (W 1 , W 2 , . . . , W n ) is the vector of i.i.d. independent sub-Gaussian noise with variance σ 2 . Suppose the coefficient β is sparse with no more than s nonzero coordinates, i.e. β 0 ≤ s. The Lasso estimator of β is defined bŷ Moreover, we assume that the rows of X are i.i.d. copies of some sub-Gaussian distribution X with mean 0 and covariance matrix Cov(X) whose spectra are bounded from below and above by constants.

Lemma 3. Assume that (s log p)/n = o(1). For any given M > 0, there exists a sufficiently large constant A > 0 depending on M and the spectra of Cov(X)
such that the following results hold with probability 1 − O(p −M ) for the Lasso estimator with the tuning parameter λ > Aσ (log p)/n in (19), where the constant C depends on M and the spectra of Cov(X).
Lemma 3 follows from the standard Lasso regression results. See, for example, Bickel et al. (2009) for further details. Note that under the sub-Gaussian assumption on the random design matrix X and the assumption (s log p)/n = o(1), the required properties on the Gram matrix X X/n such as the compatibility factor condition (van de Geer and Bühlmann (2009)), the restricted eigenvalue condition (Bickel et al. (2009)) or the cone invertibility factors condition (Ye and Zhang (2010)) are automatically satisfied with probability 1 − c 1 exp(−c 2 n), where constants c 1 and c 2 depend on the spectra of Cov(X) and σ. See, for example, Rudelson and Zhou (2013) for details.
Another contribution ofΩ SL is that the procedure is tuning-free in the sense that λ is well specified. Finally, the assumptions in Theorem 13 can be further weakened and a smaller λ is also valid. See Sun and Zhang (2012) for further details.

Individual entries of sparse precision matrix: Asymptotic normality
Given the connection between the entry ω ij and the corresponding edge (i, j) in a Gaussian graph, it is of significant interest to make inference on and provide a confidence interval for ω ij . Furthermore, the analysis would lead to results on support recovery. Along this line, to estimate a given entry ω ij , Ren et al. (2015) extended the neighborhood-based approach to regress two variables against the remaining ones, based on the following conditional distribution, j} is the index set of the two variables. Recall that ω ij sits in the coefficients −ω −1 jj ω j c j of neighborhood selection model in (16), our goal thus was to estimate the coefficients as a whole under some vector norm loss. In comparison, here we only need to estimate the noise level in the regression model (20) since ω ij is one of the three parameters in Θ A,A . This leads to the multivariate regression with two response variables X i and X j . Scaled Lasso regression is applied in Ren et al. (2015) as follows. For each m ∈ A = {i, j}, where the vector b is indexed by A c andσ kk is the sample variance of X k . Define the (p−2)×2 dimensional coefficientsβ = (β i ,β j ) and the residuals of the scaled Lasso regression byˆ A = X A − X A cβ . The estimator of Θ A,A can be given bŷ Θ A,A =ˆ Aˆ A /n. Finally, the estimatorΩ A,A = (ω kl ) k,l∈A is obtained by simply invertingΘ A,A , i.e.Ω A,A =Θ −1 A,A . In particular, The rate of convergence of estimating each ω ij is then provided under a certain sparsity assumption over HP(c n,p , M). In particular when c n,p = o ( √ n/ log p), an asymptotic efficiency result and the corresponding confidence interval are obtained.
Theorem 14 (Ren et al. (2015)). Let λ = 2δ log p n for any δ > 1 in Equation (21). Assume (c n,p log p)/n = o(1), then for any small > 0, there exists a constant Furthermore,ω ij is asymptotically efficient The key technical tool in the analysis is also related to Lemma 3 but focuses on the prediction error rather than estimation under the 1 norm. The advantage of estimatorω ij overω SL ij defined in Section 3.1 is that by estimating the noise level rather than each coefficient, the estimation accuracy can be significantly improved. However, we have to pay for the accuracy with the computational costs. If our goal is to estimate all those entries above the threshold level (log p/n) 1/2 individually, we can first apply the method proposed in Liu (2013) with p regressions to pick those order pc n,p entries above the threshold level, then order pc n,p regressions have to be done to estimate them individually. In contrast, methods in Section 3.1 only require p regressions. Minimax lower bounds are also provided in Ren et al. (2015) to show the estimatorω ij is indeed rate optimal. See Section 4.3 for details. This methodology can be routinely extended into a more general form with A replaced by some subset B ⊂ {1, 2, . . . , p} with bounded size. Then the inference result can be obtained to estimate a smooth functional of Ω −1 B,B .
Other related works in high-dimensional regression also can be applied to the current setting. Zhang and Zhang (2014) proposed a relaxed projection approach for making inference of each coefficient in a regression setting. See also van de Geer et al. (2014) and Javanmard and Montanari (2014). Although the procedures in those works seem different fromω ij in (22), essentially all methods try to estimate the partial correlation of X i and X j and hence they are asymptotically equivalent. Liu (2013) recently developed a multiple testing procedure with the false discovery rate (FDR) control for testing the entries of Ω, H 0ij : ω ij = 0. Surprisingly, to test all ω ij , it only requires running p regressions as (17).
The support recovery problem is closely related due to the graphical interpretation as well. Based on the estimatorω ij in (22), Ren et al. (2015) applied an additional thresholding procedure, adaptive to the Fisher information F ij to recover the sign of Ω. More specifically, define S(Ω) = {sgn(ω ij ), 1 ≤ i, j ≤ p} and Theorem 15 (Ren et al. (2015)). Let λ = The sufficient condition on each nonzero entry in the Theorem 15 is much weaker compared with other results in the literature, where the smallest magnitude of the nonzero entries is required to be above the threshold level Ω 1 (log p)/n. It is worthwhile to point out that based on Theorem 14, it can be easily shown thatΩ AN T also attains the optimal rates of convergence under the spectral norm over HP(c n,p , M) as that in Theorem 13.

Related results
Penalized likelihood approaches Penalized likelihood methods have also been introduced for estimating sparse precision matrices. It is easy to see that under the Gaussian assumption the negative log-likelihood up to a constant, can be written as l X (1) , . . . , X (n) ; Ω = tr(Σ n Ω) − log det(Ω), where det(Ω) is the determinant of Ω. To incorporate the sparsity of Ω, we consider the following penalized log-likelihood estimator with Lasso-type penaltŷ where Ω 0 means symmetric positive definite. Some results are derived by using 1 penalty on the off-diagonal entries i =j |ω ij | rather than all entries. We will review some theoretical properties and computational issue respectively below. Yuan and Lin (2007) first proposed usingΩ λ and studied its asymptotic properties for fixed p as n → ∞. Rothman et al. (2008) analyzed the high-dimensional behavior of this estimator. Assuming that spectra of Ω are bounded from below and above, the rates of convergence ((p + s) log p/n) 1/2 and ((1 + s) log p/n) 1/2 under the Frobenius norm and spectral norm are obtained respectively with s = i =j I{ω ij = 0} being the number of nonzero off-diagonal entries. The sparsity s is defined globally, which is different from the local sparsity c n,p defined in (9) and can be as large as pc n,p over the class HP 0 (c n,p , M). Hence in the worst case scenario, the rate of convergence ((1 + s) log p/n) 1/2 under the spectral norm is not as good as that (c n,p log p/n) 1/2 derived by neighborhoodbased approach in (18). Lam and Fan (2009) studied a generalization of (27) and replace the Lasso penalty by general non-convex penalties such as SCAD to overcome the bias issue. Ravikumar et al. (2011) applied the primal-dual witness construction to derive the rate of convergence (log p/n) 1/2 under the sup-norm which in turn leads to convergence rates in the Frobenius and spectral norms as well as support recovery under certain regularity conditions. The results heavily depend on a strong irrepresentability condition imposed on the Hessian matrix Γ = Σ ⊗ Σ, where ⊗ is the tensor (or Kronecker) product. Both sub-Gaussian and polynomial tail cases are considered. However this method cannot be extended to allowing many small nonzero entries such as the class HP(c n,p , M).

Latent Gaussian graphical model
We have seen the connections between the precision matrix and the corresponding Gaussian graph, in which we assume all variables are fully observed. In some applications, one may not have access to all the relevant variables. Suppose we only observe p coordinates X = (X 1 , . . . , X p ) of a (p + r)-dimensional Gaussian vector (X , Y ) , where Y = (Y 1 , . . . , Y r ) represent the latent coordinates. Denote the covariance matrix of all variables by Σ (X,Y ) . It is natural to assume that the fully observed (p + r)-dimensional Gaussian graphical model has a sparse dependence graph. In other words, the precision matrix Ω (X,Y ) = Σ −1 (X,Y ) is sparse. Represent the precision matrix Ω (X,Y ) in the following block form In such a case, the p × p precision matrix Ω of the observed coordinates X can be written as the difference by the Schur complement formula, Here S * = Ω XX is a sparse matrix corresponding to the structure of the subgraph induced by those observed p variables and L * = Ω XY Ω −1 Y Y Ω Y X is a low-rank matrix with rank at most r, which is the number of the unobserved latent variables. Chandrasekaran et al. (2012) proposed a penalized likelihood approach to estimate both the sparse structure S * and the low-rank part L * as the solutions to (28) HereΣ n is the sample covariance matrix of the observed coordinates X and tr (L) is the trace of L, which is used to induce the low-rank structure on L. Consistency results were established under a strong irrepresentability condition and assumptions on the minimum magnitude of the nonzero entries of S * and the minimum nonzero eigenvalue of L * . Ren and Zhou (2012) relaxed the assumptions by considering the parameter space HP 0 (c n,p , M) (9) for S * with bounded matrix 1 norm and a "spread-out" parameter space for L * . Ren et al. (2015) further removed the matrix 1 norm condition based on the estimator Ω AN T obtained in (25).

Computational issues
Estimating high-dimensional precision matrices is a computationally challenging problem. Pang et al. (2014) proposed an efficient parametric simplex algorithm to implement the CLIME estimator. In particular, their algorithm efficiently calculates the full piecewise-linear regularization path and provides an accurate dual certificate as stopping criterion. An R package 'fastclime' coded in C has been developed. See Pang et al. (2014) for more details.
For the semi-definite program (27), Yuan and Lin (2007) solved the problem using interior-point method for the general max-det problem which is proposed by Vanderberghe et al. (1998). Rothman et al. (2008) derived their algorithm based on the Cholesky decomposition and the local quadratic approximation such that cyclical coordinate descent approach can be applied. Define W =Ω −1 λ . Banerjee et al. (2008) showed that one can solve the dual of (27) through optimizing over each row/column of W in a block coordinate descent form (see also d' Aspremont et al. (2008)). In fact, solving this dual form is equivalent to solving p coupled Lasso regression problems, which are related to the neighborhood-based approach considered in Section 3.1. Friedman et al. (2008) further proposed the GLasso algorithm which takes advantage of the fast coordinate descent algorithms (Friedman et al. (2007)) to solve it efficiently. The computational complexity of GLasso is O(p 3 ). In comparison, the algorithms of Banerjee et al. (2008) and Yuan and Lin (2007) have higher computational costs. A modified GLasso algorithm is proposed by Witten et al. (2011) when the solutionΩ λ is block diagonal with blocks C 1 , . . . , C m , where |C i | is the size of the ith block. Hsieh et al. (2011) apply a second-order algorithm to solve (27) and achieve superlinear rate of convergence. Their algorithm is based on a modified Newton's method which leverages the sparse structure of the solution. Recently, Hsieh et al. (2013) further improved this result and claimed that the optimization problem (27) can be solved even for a million variables. A block coordinate descent method with the blocks chosen via a clustering approach is used to avoid the memory bottleneck of storing the gradient W when dimension p is very large.

Lower bounds
A major step in establishing a minimax theory is the derivation of rate sharp minimax lower bounds. In this section, we first review a few effective lower bound arguments based on hypothesis testing. These include Le Cam's method, Assouad's Lemma and Fano's Lemma which have been commonly used in the more conventional nonparametric estimation problems. See Yu (1997) and Tsybakov (2009) for further discussions. We will also discuss a new lower bound technique developed in Cai and Zhou (2012b) that is particularly well suited for treating "two-directional" problems such as matrix estimation, where one direction is along the rows and another along the columns. The technique can be viewed as a generalization of both Le Cam's method and Assouad's Lemma. We will then apply these lower bound arguments to the various covariance and precision matrix estimation problems discussed in the previous sections to obtain minimax lower bounds, which match the corresponding upper bound results in the last two sections. The upper and lower bounds together yield the optimal rates of convergence given in Section 1.

Le Cam's method
Le Cam's method is based on a two-point testing argument. See Le Cam (1973) and Donoho and Liu (1991). In nonparametric estimation problems, Le Cam's method often provides the minimax lower bound for estimating a real-valued functional. See, for instance, Bickel and Ritov (1988) and Fan (1991) for the quadratic functional estimation problems.
Let X be an observation from a distribution P θ where θ belongs to a parameter set Θ. For two distributions P θ1 and P θ2 with densities p θ1 and p θ2 with respect to a common dominating measure μ, the total variation affinity is given by P θ1 ∧ P θ2 = p θ1 ∧ p θ2 dμ. Le Cam's method relates the testing problem with the total variation affinity. In other words, when the total variation affinity between the two distributions is bounded away from zero, it is impossible to test between those two distributions perfectly. As a consequence, a lower bound can be measured by the distance of the two parameters θ 1 and θ 2 , which index those two distributions.
In the current paper, we introduce a version of Le Cam's method which tests the simple hypothesis H 0 : θ = θ 0 against a composite alternative H 1 : θ ∈ Θ 1 with a finite parameter set Θ = {θ 0 , θ 1 , . . . , θ D }. Let L be a loss function on Θ * , the domain of θ. Define and denoteP = 1 D D i=1 P θi . Le Cam's method gives a lower bound for the maximum estimation risk over the parameter set Θ.

Assouad's lemma
Assouad's lemma works with a hypercube Θ = {0, 1} r . It is based on testing a number of pairs of simple hypotheses and is connected to multiple comparisons. See Assouad (1983). In nonparametric estimation problems, Assouad's lemma is often successful in obtaining the minimax lower bound for many global estimation problems such as estimating the whole density or regression functions in certain smoothness classes.
For a parameter θ = (θ 1 , . . . , θ r ) where θ i ∈ {0, 1}, one tests whether θ i = 0 or 1 for each 1 ≤ i ≤ r based on the observation X. In other words, we decompose the global estimation problem into r sub-problems. For each sub-problem or each pair of simple hypotheses, there is a certain loss for making an error in the comparison. The lower bound given by Assouad's lemma is a combination of losses from testing all pairs of simple hypotheses. To make connection to Le Cam's method, we can view the loss for making an error due to each subproblem is obtained by applying Le Cam's method. In particular, when r = 1, Assouad's lemma becomes Le Cam's method with D = 1 in Lemma 4, which tests a simple null hypothesis against a simple alternative. Let be the Hamming distance on Θ. Assouad's lemma gives a lower bound for the maximum risk over the hypercube Θ of estimating an arbitrary quantity ψ (θ) belonging to a metric space with metric d. It works especially well when the metric d is decomposable with respect to the Hamming distance.
Lemma 5 (Assouad). Let X ∼ P θ with θ ∈ Θ = {0, 1} r and let T = T (X) be an estimator of ψ(θ) based on X. Then for all s > 0 The lower bound in (31) has three factors. The first factor is the minimum cost of making a mistake per comparison and the second one is the expected number of mistakes one would make when each pair of simple hypotheses is indistinguishable. The last factor, which is usually bounded below by some positive constant in applications, is the total variation affinity for each sub-problem.

Le Cam-Assouad's method
The Le Cam-Assouad's method, which was first introduced in Cai and Zhou (2012b), is designed to treat problems such as estimation of sparse matrices with constraints on both rows and columns. Again, let X ∼ P θ where θ ∈ Θ. The parameter space Θ of interest has a special structure which can be viewed as the Cartesian product of two components Γ and Λ. For a given positive integer r and a finite set The Le Cam-Assouad's method reduces to the classical Assouad's lemma when Λ contains only one element, and becomes Le Cam's method when r = 1. The advantage of this method is that it breaks down the lower bound calculations for the whole matrix estimation problem into calculations for individual rows so that the overall analysis is tractable. For θ = (γ, λ) ∈ Θ, denote the projection of θ to Γ by γ (θ) = (γ i (θ)) 1≤i≤r and to Λ by λ (θ) = (λ i (θ)) 1≤i≤r . Let D Λ be the cardinality of Λ. For a given a ∈ {0, 1} and 1 ≤ i ≤ r, we define the mixture distributionP a,i bȳ SoP a,i is the mixture distribution over all P θ with γ i (θ) fixed to be a while all other components of θ vary over all possible values. In the construction of the parameter set for establishing the minimax lower bound of matrix estimation problems, each θ = (γ, λ) is uniquely associated with some symmetric matrix. Usually r is the number of possibly non-zero rows in the upper triangle of the matrix, and each element λ of Λ is associated with a matrix by making r nonzero rows of the matrix equal to the r coordinates of λ in order.
Lemma 6 (Le Cam-Assouad). For any estimator T of ψ(θ) based on an observation from a probability distribution in {P θ , θ ∈ Θ}, and any s > 0 whereP a,i is defined in Equation (33) and α is given by

Fano's lemma
Fano's lemma, like Assouad's lemma, is also based on multiple hypotheses testing argument and has been widely used for global estimation problems in nonparametric settings. Fano's lemma applies to a more general setting and hence is stronger than Assouad's lemma, although the latter one seems easier to use in many applications. For their relationship, we refer to Yu (1997) for more details.
For two probability measures P and Q with density p and q with respect to a common dominating measure μ, write the Kullback-Leibler divergence as K(P, Q) = p log p q dμ. The following lemma, which can be viewed as a version of Fano's lemma, gives a lower bound for the minimax risk over the parameter set Θ = {θ 0 , θ 1 , . . . , θ m * }. See , Section 2.6) for more detailed discussions.
Lemma 7 (Fano). Let Θ = {θ i : i = 0, . . . , m * } be a parameter set and d be a distance over Θ. Let {P θ : θ ∈ Θ} be a collection of probability distributions satisfying with 0 < c < 1/8. Letθ be any estimator based on an observation from a distribution in {P θ , θ ∈ Θ}. Then Another advantage of Fano's lemma is that it sometimes only relies on the analytical behavior of the packing number of the parameter space with respect to the loss function d and avoids constructing an explicit subsets of parameter spaces, which can be a challenging task in many situations. See Yang and Barron (1999) for details. Hence the accurate rate of packing number at the logarithm level is the key to applying Fano's lemma and obtaining the minimax lower bounds. Rigollet and Tsybakov (2012) applied this method to improve the assumptions in the minimax lower bound argument for estimating sparse covariance matrices under the matrix 1 norm in Cai and Zhou (2012a).

Application of Assouad's lemma to estimating bandable covariance matrices
In Section 2.1, we constructed the tapering estimatorΣ T,k T and claimed in Theorem 7 that it attains the rate of convergence min , p n over the bandable class F α (M 0 , M) defined in (1) under the spectral norm. We now apply Assouad's lemma to give a lower bound n − 2α 2α+1 . The basic strategy is to carefully construct a finite least favorable subset of the corresponding parameter space in the sense that the difficulty of estimation over the subset is essentially the same as that of estimation over the whole parameter space. The finite collection that is appropriate for this lower bound argument is defined as follows. For given positive integers k and m with k ≤ p/2 and 1 ≤ m ≤ k, define the p × p matrix D(m, k) = (d ij ) p×p with d ij = I {i = m and m + 1 ≤ j ≤ 2k, or j = m and m + 1 ≤ i ≤ 2k} .
Set k = n 1 2α+1 and a = k −(α+1) . We then define the collection of 2 k covariance matrices as where I is the p × p identity matrix and 0 < τ < 2 −α−1 M . Without loss of generality we assume that M 0 > 1.
∼ N (0, Σ (θ)) with Σ (θ) ∈ F sub and the joint distribution P θ . For 0 < τ < 2 −α−1 M it is easy to check that F sub ⊂ F α (M 0 , M) for sufficiently large n. Hence applying Lemma 5 to the parameter space F sub , we have It is easy to check by our construction that there exists some constant c 1 > 0, such that the first factor above is lower bounded by c 1 ka 2 , i.e., min H(θ,θ)≥1 In addition, it can be shown that the total variation affinities between the pairs of distributions satisfy min H(θ,θ)=1 P θ ∧ Pθ ≥ c 2 for some positive constant c 2 . Putting all together, the minimax lower bound n − 2α 2α+1 follows from the above results with the choice of k = n 1 2α+1 . Other lower bound arguments based on Assouad's lemma have also been established in the literature. For instance, The lower bound rates n − 2α+1 2α+2 and n − α α+1 of estimating bandable covariance matrix under the Frobenius norm and matrix 1 norm over class G α (M 1 ) defined in (2) respectively. See Cai et al. (2010) and Cai and Zhou (2012a). We omit the details here.

Application of Le Cam's method to estimating entries of precision matrices
In Section 3.2, we discussed that the estimatorω ij for each pair of i, j attains the rate of convergence max{C 1 (c n,p log p)/n, C 2 n −1/2 }. Since the proof of parametric lower bound n −1/2 is trivial, we focus on the novel lower bound (c n,p log p)/n only and apply Le Cam's method to show that it is indeed a lower bound for estimating ω ij . Assume p ≥ c υ n,p with ν > 2. In particular, a finite collection of distributions F sub ⊂ HP(c n,p , M) is carefully constructed in the next paragraph.
Without loss of generality, we consider estimating ω 11 and ω 12 .
kl ) p×p is a matrix with all diagonal entries equal to 1, σ (0) 12 = σ (0) 21 = b and the rest all zeros. The constant b will be chosen later. It is easy to see that Ω 0 also has a very simple form with ω ii = 1 for i ≥ 3 and all other entries being zeros. Besides, denote by A the collection of all p × p symmetric matrices with exactly (c n,p − 1) entries equal to 1 between the third and the last entries on the first row/column and the rest all zeros. Now based on Ω 0 and A, the finite collection F sub = {Ω 0 , Ω 1 , . . . , Ω m * } can be defined formally as follows where a = (τ 1 log p/n) 1/2 and b = (1 − 1/M ) /2 for some sufficiently small positive τ 1 , depending on M , b and ν. In other words, we use subscripts 1, . . . , m * to label those Ω = (Σ 0 + aA) where m * = p−2 cn,p−1 . To show that F sub is a subclass of HP(c n,p , M), we check the sparsity as well as the spectra of each element. First, it is easy to see that number of nonzero off-diagonal entries in Ω m , 0 ≤ m ≤ m * is no more than c n,p per row/column by its construction. Second, tedious calculations yield that the spectra of Ω The motivation of our construction is that a signal-to-noise ratio level a on each entry of the covariance matrix with the sparsity c n,p is able to accumulate to a level of c n,p a 2 (c n,p log p)/n on some entry of the precision matrix. Indeed, it is easy to check that Let P Ωm denote the joint distribution of X 1 , . . . , X n , i.i.d. copies of N (0, Ω −1 m ), 0 ≤ m ≤ m * . It can be shown that P Ω0 ∧P ≥ C 0 . Finally, applying Lemma 4, together with above two facts, we obtain the lower bounds of estimating ω 11 and ω 12 as follows, which match the upper bounds attained by the corresponding estimators in Section 3.2. Cai et al. The Le Cam's method is applied to establish the second term of the minimax lower bound min{ λ 2 n,p rn,p n , λ 2 n,p } for estimating sparse spiked covariance matrices over the parameter space J (c n,p , r n,p , λ n,p ) defined in (7) under the squared spectral norm loss. To calculate the total variation affinity used in Le Cam's method, an interesting analysis on the symmetric random walk stopped at a hypergeometrically distributed time is established. For reasons of space, we omit the details. See Cai, Ma, and Wu (2015) for further details.
The Le Cam's method can also be used to obtain the lower bounds of other covariance/precision matrix estimation problems. For example, the lower bound (log p/n) 1/2 is derived for bandable class F α (M 0 , M) defined in (1) under the spectral norm by picking a collection of covariance/precision matrices with nonzero value only on the diagonal entries. See Cai et al. (2010). Le Cam's method is also applied in Cai and Zhou (2012a) to obtain the lower bound c 2 n,p (log p)/n for estimating sparse covariance matrices over the class H(c n,p ) defined in (4) under the matrix 1 norm. It is worthwhile to point out that the lower bound under the spectral norm loss discussed later using Le Cam-Assouad's method immediately implies this lower bound under the matrix 1 norm. However, the construction using Le Cam's method is much easier when the loss function is the matrix 1 norm.

Application of Le Cam-Assouad's method to estimating sparse precision matrices
In Section 1.2, we introduced the classes of parameters HP(c n,p , M) defined in (8) to model the sparse structure of precision matrices. We have seen that Sun and Zhang (2013) show that the estimatorΩ SL attains the rate of convergence c 2 n,p (log p)/n under the spectral norm over HP (c n,p , M) in Theorem 13 of Section 3.1. Moreover, the estimatorΩ AN T in Section 3.2 also attains the same rate. In this section, we will apply Le Cam-Assouad's method to show that these estimators are indeed rate optimal. The same technique was also used to show the rate-optimality of the ACLIME estimator proposed in  over a different parameter space.
The Le Cam-Assouad's method was originally developed in Cai and Zhou (2012b) to establish a rate sharp lower bound for estimating sparse covariance matrices over the parameter space H(c n,p ) defined in (4) under the squared spectral norm loss. It was shown that a lower bound in such a setting is c 2 n,p (log p)/n. Hence the thresholding estimator defined in (13) attaining the rate of convergence c 2 n,p (log p)/n is indeed rate optimal over H(c n,p ). The main idea for the lower bound proof is similar to that of estimating sparse precision matrices and is hence omitted here. See Cai and Zhou (2012b) for the detailed analysis.
Again, the key of the minimax lower bound proof is to carefully construct a finite collection of distributions in HP(c n,p , M). We assume p > c 1 n β for some β > 1 and c 1 > 0. In the current setting of estimating sparse precision matrices over the class HP(c n,p , M), the parameter subset F sub is constructed as follows.
Let r = p/2 and let B be the collection of all vectors (b j ) 1≤j≤p such that b j = 0 for 1 ≤ j ≤ p − r and b j = 0 or 1 for p − r + 1 ≤ j ≤ p under the constraint b 0 = k = c n,p /2 . For each b ∈ B and each 1 ≤ m ≤ r, define a p × p matrix λ m (b) by making the mth row of λ m (b) equal to b and the rest of the entries 0. It is clear that the cardinality of B is r k . Set Γ = {0, 1} r . Hence each component b i of λ = (b 1 , . . . , b r ) ∈ Λ can be uniquely associated with a p × p matrix λ i (b i ). Now it is the time to define Λ as the set of all matrices λ with every column sum less than or equal to 2k. Define Θ = Γ ⊗ Λ and let n,p ∈ R be fixed, whose value will be chosen later. For each θ = (γ, λ) ∈ Θ with γ = (γ 1 , . . . , γ r ) and λ = (b 1 , . . . , b r ), we associate θ with a precision matrix Ω(θ) by Finally we define the finite collection F sub of precision matrices as Set n,p = υ((log p)/n) 1/2 for some sufficiently small positive υ. Now we can check that each Ω(θ) ∈ F sub is diagonal dominated and further satisfies the bounded spectrum condition. Clearly for each Ω(θ) ∈ F sub , we have Therefore, we obtain that F sub ∈ HP(c n,p , M). Let X 1 , . . . , X n be i.i.d. copies of N (0, Ω(θ) −1 ) with θ ∈ Θ and denote the joint distribution by P θ . Applying Lemma 6 to the parameter space Θ indexing F sub with s = 2 and metric d being the spectral norm, we have where the per comparison loss α is defined in (35) and the mixture distributions P 0,i andP 1,i are defined as in (33). It can be shown that the per comparison loss α ≥ (k n,p ) 2 p and the affinity min i P 0,i ∧P 1,i ≥ c 1 with a constant c 1 > 0. Plugging these two facts into equation (38), we obtain the desired minimax lower bound for estimating a sparse precision matrix over HP(c n,p , M), for some constant c 2 > 0. Besides covariance and precision matrix estimation problems, Le Cam-Assouad's method is also appropriate for lower bound proofs in other matrix estimation problems. One particular example is the volatility matrix estimation problem for high-dimensional diffusions. For more details, please refer to Tao et al. (2013).

Application of Fano's lemma to estimating toeplitz covariance matrices
In Section 2.2, we consider the problem of estimating Toeplitz covariance matrices. In particular, for estimation over the parameter space FT α (M 0 , M) which is defined in terms of the smoothness of the spectral density f in (3), Cai, Ren, and Zhou (2013) showed that the tapering estimatorΣ T oep T,k T o attains the optimal rate of convergence (log(np)/np) 2α 2α+1 under the spectral norm. We briefly present a minimax lower bound argument, focusing on the use of Fano's lemma in this section.
Let X 1 , . . . , X n be i.i.d. copies of N (0, Σ), where the covariance sequence (σ 0 , σ 1 , . . . , σ p ) is given via its corresponding spectral density f = 1 2π (σ 0 + 2 ∞ m=1 σ m cos mx) in FT α (M 0 , M). There are two main steps in the lower bound argument. The first step is to construct a more informative model which is exactly equivalent to a Gaussian scale model under which one observes for |j| ≤ p − 1, and i = 1, 2, . . . n. Here S p (f )(x) = 1 2π (σ 0 + 2 p−1 m=1 σ m cos mx) is the partial sum of f with order p. The advantage of this more informative model is to make the analysis much easier. (see Cai, Ren, and Zhou (2013) for details). From now on we focus on this more informative model. The second step is to establish a minimax lower bound for this Gaussian scale model, which automatically provides a lower bound for the original model. We elaborate on the second step which mainly involves the construction of finite collection of spectral densities F sub = f 0 , f 1 , . . . f k * /2 ⊂ FT α (M 0 , M) as follows.
Define f 0 = M 0 /2 and f i as follows, , n,p = 2π/k * (40) where i = 1, 2, . . . , k * /2 with k * = (np/ log (np)) 1 2β+1 , and A(u) = exp(− 1 1−4u 2 )1 {|2u|<1} . It is easy to check that each distribution in our collection f i ∈ FT α (M 0 , M) by setting τ > 0 sufficiently small, noting A ∈ C ∞ (R) ∩ FT α e −1 , 1/2 . Therefore, we have F sub ⊂ FT α (M 0 , M). Now we apply Lemma 7 to the parameter space F sub with the distance d = · ∞ , the sup-norm. Careful calculation implies that the assumption (36) in Lemma 7 is satisfied with m * = k * /2 and P fi the probability distribution of the Gaussian scale model in (39) indexed by f i ∈ F sub . Hence we obtain that there exist some positive constants c i , i = 1, 2, 3 such that Finally, a close connection between autocovariance matrix and spectral density function implies that, for our construction of F sub , it can be shown that (42) Hence the minimax lower bound for estimating a Toeplitz covariance matrix over the collection FT α (M 0 , M) is obtained by putting (41) and (42) together. For estimating sparse spiked covariance matrices over the parameter space J (c n,p , r n,p , λ n,p ) defined in (7) under the squared spectral norm loss, Fano's lemma is also used to obtain the first term in the minimax lower bound min{ (λn,p+1)cn,p n log ep cn,p , λ 2 n,p } in Theorem 4. See Theorem 12 in Section 2.4 for the method attaining this minimax optimal rate. Note that this term does not depend on r n,p and is based on a sparse vector estimation argument only. Hence the analysis is standard given the packing number of sparse vectors under the vector 2 norm and we omit the proof in this paper. See Theorem 4 in  (also Theorem 2 in Birnbaum et al. (2013)) for further details.
Besides covariance and precision matrix estimation problems, Fano's lemma has also been used in other matrix estimation problems. For example, Rohde and Tsybakov (2011) applied it in a trace regression model to provide a lower bound of low-rank matrix estimation under the Frobenius norm.

Discussions
We have considered optimal estimation of high-dimensional covariance and precision matrices under various structural assumptions. Minimax rates of convergence are established and rate-optimal adaptive procedures are constructed. For ease of presentation, we have so far assumed that the random vector X is centered. As mentioned in the introduction, this assumption is not essential. We will elaborate on this point here. The estimators introduced in the previous sections are positive definite with high probability, but not guaranteed so for a given sample. It is sometimes desirable to have estimators that are guaranteed to be positive (semi)-definite. We will show that a simple additional step will lead to a desirable estimator with the same theoretical guarantees. We also discuss in this section a related problem, hypothesis testing on the covariance structure.

Non-centered case
Suppose E(X) = μ with μ unknown. In this case, μ can be estimated by the sample meanμ = 1 n n i=1 X (i) . We can then apply the corresponding procedures such as banding, tapering, thresholding or regression to the sample covariance matrixΣ n = X X/n −μμ . In fact, all the results remain the same in the unknown mean case except those for estimating the Toeplitz covariance matrix.
All the rate-optimal procedures introduced so far are translation invariant, hence we can assume μ = 0. Those covariance estimators, which are directly based on the sample covariance matrix, now depends onΣ n = X X/n −μμ . Compared to the term X X/n which is the sample covariance matrix when the mean is known, the termμμ usually yields a negligible estimation error. In particular, note that Eμμ = Σ/n since μ = 0. Clearly the contribution of this term entrywise is negligible with respect to the noise level n −1/2 . Globally, the contribution ofμμ is usually also negligible following an analysis that is similar to that of X X/n, as long as the optimal rate is not faster than n −1/2 . This can be made rigorous for those covariance estimators in Section 2 except the estimator of Toeplitz covariance matrices. See, for example, Remark 1 in Cai et al. (2010). In the Toeplitz case, the effective sample size for estimating each autocovariance σ i is far larger than n but there are only n samples for estimating its mean. The issue due to the unknown mean is no longer negligible unless extra information on μ is imposed. For example, under the condition that all coordinates of the mean are equal to some constant c u , we can estimate it using all np samples byĉ u = 1 np n i=1 p j=1 X (i) j . Then the Toeplitz estimator depends on X X/n −ĉ u 11 , where the second termĉ u ∼ O p ((np) −1/2 ) is of higher order and can be ignored. In this setting, it can be shown that the results of Toeplitz covariance estimators remain valid.
For the sparse precision matrix classes, we introduced rate optimal matrix estimators under the spectral norm in Section 3.1 and rate optimal estimators of each entry ω ij in Section 3.2. Both estimators are derived through a regression approach, motivated by the conditional distribution where either the index set A is a singleton in Equation (16) or A = {i, j} in Equation (20). The corresponding regression model of the data matrix can be written as X A = X A c β + A . Then the analysis of both estimators involves the scaled Lasso in Equations (17) and (21). When taking the unknown meanμ into account, the analysis of scaled Lasso is applied to the data matrix as follows where¯ A is the sample mean of the noise vector A . Note that the extra sample mean terms introduced above have a higher order, for example,¯ A ∼ O p (n −1/2 ). As a consequence, the contribution of these terms is also negligible and all the results remain valid. For more details, see, for instance, the discussion section in Ren et al. (2015).

Positive (semi-)definiteness
In many applications, the positive (semi-) definiteness of the covariance or precision matrix estimator is usually required. Although nearly all estimators of both covariance and precision matrices we surveyed in the current paper are symmetric, they are not guaranteed to be positive (semi-) definite. Under the mild condition that the population covariance is nonsingular, it follows from the consistency results that those estimators are positive definite with high probability. Whenever an estimatorÂ is not positive semi-definite, a simple extra step can make the final estimatorÂ + positive semi-definite and also achieve the optimal rate of convergence. Write the eigen-decomposition of the estimatorÂ aŝ where eigenvaluesλ 1 ≥λ 2 ≥ · · · ≥λ p andv i 's are the corresponding eigenvectors. Define the final estimator where I p is the p-dimensional identity matrix and I{λ p < 0} is the indicator function thatÂ is negative definite. Then with A being the target covariance or precision matrix and λ p being its smallest eigenvalue, we have ClearlyÂ + is positive semi-definite and enjoys the same rate of convergence as that ofÂ. This simple idea has appeared in some papers on matrix estimation. See, e.g., El Karoui (2008). Another advantage of this final procedure is that A + has the same desirable structure ofÂ such as bandable, sparse or Toeplitz structure.

Hypothesis testing for the covariance structure
In addition to estimation, there have been considerable recent developments on testing high-dimensional covariance structure. Unlike the estimation problems, an asymptotic null distribution of a test statistic is required explicitly such that the significance level of the test can be controlled. The asymptotic analysis can be very delicate. Various testing methods have been proposed including likelihood ratio test in Bai et al. (2009) andJiang et al. (2012), largest eigenvalue test in Johnstone (2001), Soshnikov (2002) and Peche (2009), Frobenius distance test in Ledoit and Wolf (2002), Srivastava (2005), Birke and Dette (2005), and Chen et al. (2010), and maximum entrywise deviation test in Jiang (2004) (2011), Shao and Zhou (2014), and Cai, Liu, and Xia (2013). But unlike estimation problems there are only few optimality results on testing, see Baik et al. (2005), El Karoui (2007),  and Onatski et al. (2013). To show the optimality of a test, asymptotic power functions are needed under alternatives to match the lower bound. We now briefly survey some recent developments on testing the covariance structure.
In the high-dimensional setting, Chen et al. (2010) also investigated the testing problem with the Frobenius distance (43). However, instead of plugging in the sample covariance matrix to estimate tr(Σ), tr(Σ 2 ) , a U -statistic is applied to derive more accurate and reliable estimators (T 1,n ,T 2,n ) of tr (Σ) , tr Σ 2 . Chen et al. (2010) also provided a lower bound for the asymptotic power function. In particular, as long as Σ − I F n/p → ∞, the test is consistent. In a separate paper,  showed that this procedure is indeed optimal under some natural alternatives. Similar results are obtained for testing sphericity as well.
Several tests based on the maximum entrywise deviations for testing the hypothesis H 0 : R = I, where R is the correlation matrix, have been proposed and studied in the literature. Jiang (2004) studied the asymptotic distribution of the test statistic L n = max whereρ ij is the sample correlation between X i and X j . In particular, the Gumbel distribution is derived as the null limiting distribution of L n lim n→∞ P nL 2 n − 4 log p + log log p ≤ y = exp − under the moment condition E |X i | 30+ε < ∞ and the regime p/n → γ ∈ (0, ∞). Jiang's work attracted considerable attention. However, the moment condition and asymptotic regime seem too restrictive. Zhou (2007) reduced the moment condition to x 6 P (|X 1 X 2 | ≥ x) → 0 and Liu et al. (2008) further weakened it to x 6 / log 3 xP (|X 1 X 2 | ≥ x) → 0 as a special case. In general, their result allowed wide regime cn α ≤ p ≤ Cn α but the moment condition also depends on α. Liu et al. (2008) also introduced some "intermediate" approximation to approximate the test statistic L n with a much faster rate of convergence. In comparison, the convergence of L n to the Gumbel distribution has a typical slow rate log log n/ log n in the regime cn α ≤ p ≤ Cn α . Li et al. (2010) and Li et al. (2012) further showed that some moment condition is necessary. More specifically, in the bounded ratio regime lim p/n ∈ (c, C), if X 1 has finite second moment, then E |X 1 | β < ∞ for all β < 6 is a necessary condition such that limiting distribution (45) holds. Cai and Jiang (2011) and Shao and Zhou (2014) generalized the polynomial regime and push it to the ultra-high-dimensional case log p = o(n βα ). Shao and Zhou (2014) showed that under the moment condition E exp (t |X 1,1 | α ) < ∞ for some t > 0 and α ∈ (0, 1], the necessary and sufficient conditions for establishing (45) in the ultra-high-dimensional setting in terms of the optimal β α is that β α = 4/ (4 − α).
Testing more general covariance structures Hypothesis testing for other covariance structures has also been considered in the literature. These include (i) banded structure with H 0 (k) : Σ is k banded, (ii) bandable structure and (iii) Toeplitz structure. Here a matrix Σ = (σ ij ) is called k banded if σ ij = 0 for all pairs (i, j) such that |i − j| ≥ k. Cai and Jiang (2011) and Shao and Zhou (2014) considered testing k banded structure. To test this hypothesis H 0 (k), analogous to the definition of L n , Cai and Jiang (2011) proposed the following test statistic It can be shown that L n,k has the same limiting Gumbel distribution in (45) as long as most correlations are bounded away from 1 and k = o(p τ ) for some small τ . Recently, Xiao and Wu (2011) established a more general result which allows dependent entries over large range of p. Instead of the L n,k , a self-normalized version of maximum entrywise deviation is constructed in Xiao and Wu (2011) as the new test statistic, where τ ij = Var(X i X j ) andτ ij is the empirical counterpart. In different regimes p = O(n α ) and p = o exp n β , M n is also shown to weakly converge to the Gumbel distribution. This result in turn allows testing all three structures (i), (ii), and (iii) listed above. Qiu and Chen (2012) constructed an unbiased estimator of Σ |i−j|≥k σ 2 ij via certain U -statistic to test banded covariance structure H 0 (k), motivated by the Frobenius distance-based tests in Chen et al. (2010). A lower bound of asymptotic power function is also established.

Some open problems
Although much progress has been made on estimation of structured high-dimensional covariance and precision matrices, there are still many open problems. We conclude the paper with a brief discussion on a few interesting open problems.

Optimality for covariance matrix estimation under Schatten q norm
In addition to the matrix ω norm and Frobenius norm considered in this paper, the Schatten q norm, which is unitarily invariant, is another commonly used matrix norm and considered in many statistical problems, including trace regression, low-rank matrix recovery and density matrix estimation in quantum tomography. The Schatten q norm is the vector q norm of the spectra. Denote the singular values of Σ by {λ i }, i = 1, . . . , p. The Schatten q norm is defined by When q = 2, it coincides with the Frobenius norm and when q = ∞, it becomes the spectral norm. Estimating a covariance or precision matrix under the general Schatten q norm is an interesting open problem. So far the corresponding optimality results for covariance and precision matrix estimation under the general Schatten q norm remain unknown. The major difficulty lies in the establishment of a rate-sharp minimax lower bound. We believe that new technical tools are needed to solve this problem.

Lower Bound via packing number
Fano's lemma is a standard tool for deriving the minimax lower bounds. It relies on a good bound for the cardinality of a suitable packing set for a given parameter space. So far little is known about the packing number for the class of sparse matrices under the spectral norm. Consider the following general sparse matrix class S(k) in which there are at most k ones in each row and each column S(k) = A p×p = (a ij ) : a ij = 0 or 1, max A 1 , A ∞ ≤ k .
We conjecture that there exists a "good" packing set S sub (k) ⊂ S(k) under the spectral norm such that for any A 1 , A 2 ∈ S sub (k), we have A 1 − A 2 ≥ k and log |S sub (k)| > kp log(p/k) for some small constant . If this statement is true, then the standard Fano's lemma can be applied to obtain the minimax lower bound under the spectral norm for estimating sparse covariance/precision/volatility matrix. As a consequence, the proof of those lower bounds in literature using Le Cam-Assouad's method introduced in Section 4 can be unified and simplified. In addition, we point out that in Theorems 3 and 5, the optimality results are obtained under the assumption c n,p ≤ C n/(log p) 3 rather than the natural one c n,p ≤ C n/(log p), which is due to the Le Cam-Assouad's method employed to prove the lower bound. Therefore if the conjecture is true, we can obtain more complete optimality results without such an unpleasant sparsity condition. Similarly, a good bound on the packing number for the class of bandable covariance matrices and the class of sparse covariance matrices under the general Schatten q norm would be very helpful for the establishment of the minimax lower bound for estimating the corresponding covariance matrices under the Schatten q norm.

Optimal estimation of matrix functionals
Many high-dimensional statistical inference problems such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) require knowledge of certain aspects, i.e., functionals, of the covariance structure, instead of the whole matrix itself. So estimation of the functionals of covariance/precision matrices is an important problem. The most common practice for matrix functional estimation is the plug-in approach: first estimating the whole matrix in a certain optimal way and then plugging-in the estimator to estimate the corresponding functional. This often leads to a sub-optimal solution.
Despite recent progress on optimality of estimating covariance and precision matrices under various matrix norms, there have been few optimality results on estimation of functionals of the covariance matrices. Cai, Liang, and Zhou (2015) obtained the limiting distribution of the log determinant of the sample covariance matrix in the Gaussian setting and applied the result to establish the optimality for estimation of the differential entropy, which is a functional of the covariance matrix. The problem of optimally estimating an individual entry of a sparse precision matrix discussed in Section 3.2 can also be viewed as estimating a functional.
Given two independent samples, X (1) , . . . , X (n1) iid ∼ N p (μ 1 , Ω −1 ) and Y (1) , . . . , , an important functional to estimate is Ω(μ 1 − μ 2 ). This is motivated by the linear discriminant analysis. In the ideal case when the parameters μ 1 , μ 2 and Ω are known, for a new observation Z drawn with equal probability from either N p (μ 1 , Ω −1 ) or N p (μ 2 , Ω −1 ), the optimal classification rule is Fisher's linear discriminant rule which classifies Z to class 1 if and only if (Z − (μ 1 + μ 2 )/2) Ω (μ 1 − μ 2 ) > 0. In applications, the parameters μ 1 , μ 2 and Ω are unknown and it is a common practice to estimate them separately and then plug in. This approach has been shown to be inefficient as the discriminant depends on the parameters primarily through the functional Ω (μ 1 − μ 2 ). Cai and Liu (2011b) introduced a constrained 1 minimization method for estimating the functional Ω (μ 1 − μ 2 ) directly and proposed a classification method based on the estimator. A similar approach was also used in Mai et al. (2012). Although direct estimators of Ω (μ 1 − μ 2 ) have been proposed and used for classification, the optimality of the estimation problem remains unknown. See also El Karoui and Kösters (2011) for further discussions. It is of significant interest to study the problem of optimal estimation of the functional Ω (μ 1 − μ 2 ) under certain sparsity assumptions.