Confidence intervals for high-dimensional inverse covariance estimation

We propose methodology for statistical inference for low-dimensional parameters of sparse precision matrices in a high-dimensional setting. Our method leads to a non-sparse estimator of the precision matrix whose entries have a Gaussian limiting distribution. Asymptotic properties of the novel estimator are analyzed for the case of sub-Gaussian observations under a sparsity assumption on the entries of the true precision matrix and regularity conditions. Thresholding the de-sparsified estimator gives guarantees for edge selection in the associated graphical model. Performance of the proposed method is illustrated in a simulation study.


Introduction
Motivation.A large number of methods has been proposed for the problem of inverse covariance estimation in high-dimensional settings, where the number of parameters is much larger than the sample size.For an overview of the different methodologies, we refer to [11].However, few theoretical results exist on assessing the accuracy of these estimators and developing methodology for statistical inference for the parameters of interest in this setting.In this paper, our aim is to develop methodology for inference for individual entries of the inverse covariance matrix.To this end, we consider a popular estimator of the inverse covariance matrix, mostly referred to as the graphical Lasso [5].We note that existing work on statistical inference in high dimensional settings has mostly focused on inference for parameters in linear models and generalized linear models [15,17,8,6,3,2,10].Our setting lies outside of the regression context, nevertheless, as shall be demonstrated, a general approach presented in [15] carries through successfully to the present context.Problem overview and related work.Consider an i.i.d.sample X 1 , . . ., X n ∈ R p of size n from a zero-mean distribution with unknown covariance matrix Σ * ∈ R p×p .Denoting the inverse covariance matrix, often referred to as the precision or concentration matrix, as Θ * = (Σ * ) −1 , the goal is to the estimate Θ * in a setting where p ≫ n.The most natural candidate for an estimator of the covariance matrix is presumably the sample covariance matrix.However, if we consider the setting p > n, the sample covariance matrix is singular with proba-bility one.Moreover, even when p/n tends to a constant, the covariance matrix exhibits poor performance [7].Consequently, different structural assumptions have been imposed on the model to allow for consistent estimation.One natural assumption is a certain kind of sparsity in the model.For instance, sparsity may be imposed by restricting the cardinality of the set of non-zero elements of the covariance matrix or precision matrix.A popular estimation procedure which relies on the sparsity in Θ * is the graphical Lasso [5,4,16].The estimator minimizes the negative Gaussian log-likelihood with regularization in terms of the ℓ 1 norm of the off-diagonal (or possibly all) entries of the precision matrix.Although the method arises from the situation when the data is normally distributed, it has justification also when this is not the case.For a discussion see [11], where it is shown that for non-Gaussian X the method corresponds to minimizing the ℓ 1 penalized log-determinant Bregman divergence.The optimization problem corresponding to graphical Lasso is a convex optimization problem that can be solved with, for instance, coordinate descent methods [4,5] in polynomial time.The statistical performance of graphical Lasso has been analyzed in [13] and [11], where convergence rates of the method were established under a sparsity assumption on the entries of Θ * .In [13] (see also [9]), mild conditions on the eigenvalues of Θ * allow to derive rates in Frobenius norm.High-dimensionality is reflected in p being allowed to grow as a function of n, however, in limit, p/n → 0 is required.In [11] convergence rates for the maximum norm are derived under the stronger irrepresentability condition on the true precision matrix Θ * , but only a weaker condition log p/n → 0 is required, allowing for p ≫ n asymptotically.
A closely related work [15] considers inference for low-dimensional subsets of a high-dimensional vector in the regression setting under sparsity assumptions on the parameters and the design matrix.The proposed method is based on the Lasso estimator for which the Karush-Kuhn-Tucker conditions are modified ("inverted") to obtain a de-sparsified estimator.Subsequently, the estimator is shown to have a normal limiting distribution and reach the semiparameteric efficiency bound.In addition we mention a recent paper [12] that suggests a "piecewise" procedure for precision matrix estimation in Gaussian graphical models, under mild assumptions on Θ * .The method restricts itself to the Gaussian model by making use of the well-known conditional distribution property of the multivariate normal distribution.This leads to the idea of using regression to estimate the precision matrix.Roughly speaking, the procedure suggests to regress every pair of variables (X i , X j ) on the remaining p − 2 variables using the scaled lasso regression, resulting in order p 2 individual high-dimensional regressions.The acquired partial estimators are then combined to form an estimator for Θ * .Asymptotic normality is proved, allowing for inference results about the parameters.

Outline
In this paper we aim for statistical inference for individual elements of the precision matrix in a high-dimensional setting under appropriate sparsity assumptions on Θ * .The work closely follows the idea of the more general approach presented in [15], which builds on "inverting" the necessary Karush-Kuhn-Tucker conditions for an optimization problem.The paper [15] demonstrates this method for the case of linear regression and generalized linear models, while we apply the idea to a fully nonlinear estimator.By inverting the KKT conditions, we obtain a de-sparsified graphical Lasso estimator and consequently we analyze its asymptotic properties.Asymptotic normality of the new estimator is proved for sub-Gaussian observations.The rest of the paper is organized as follows.In section 1.2, we briefly introduce the model.Section 1.3 summarizes deviation inequalities for the covariance matrix for further reference throughout the paper.Section 2 contains the main results.Section 3 illustrates the theoretical results in a simulation study.Finally, section 4 contains proofs.
Notation.For a vector x ∈ R d and p ∈ (0, ∞] we use the notation x p to denote the p−norm of x in the classical sense.For a matrix A ∈ R d×d we use the notations The symbol vec(A) denotes the vectorized version of a matrix A. For two matrices A and B, use A ⊗ B to denote the Kronecker product of A and B. By e i we denote a p-dimensional vector of zeros with one at position i and by e ij := e i ⊗ e j a p 2 -dimensional vector of zeros with one at position indexed by (i, j).For sequences f n , g n , we write for some C > 0 independent of n and all n > C. Analogously, we write for some C > 0 independent of n and all n > C. We write f n ≍ g n if both f n = O(g n ) and f n = Ω(g n ) hold.Finally, f n = o(g n ) if lim n→∞ f n /g n = 0. We use S p + to denote the cone of positive semi-definite p × p matrices, i.e. S p + := {A ∈ R p×p |A = A T , A 0} and S p ++ to denote the set of positive definite p × p matrices, S p ++ := {A ∈ R p×p |A = A T , A ≻ 0}.

Model setup
In this section, we introduce the model and present assumptions that will be referenced throughout the paper.Given an i.i.d.sample X 1 , . . ., X n , the graphical Lasso estimator Θn is defined as the solution to the optimization problem Θn := arg min where Σn is the sample covariance matrix Σn = Note that when the data is normally distributed, the problem is equivalent to ℓ 1 −penalized maximum likelihood for the precision matrix.
The setting when the observations are Gaussian is perhaps of the highest interest due to the fact that the problem corresponds to estimation of the structure of an underlying Gaussian graphical model.Assume thus in the rest of this paragraph that the data comes from a multivariate normal distribution with zero mean and a covariance matrix Σ * , i.e.X = (X 1 , . . ., X p ) ∼ N (0, Σ * ).Associate the variables X 1 , . . ., X p with the vertex set V = {1, . . ., p} of an undirected graph G = (V, E) with an edge set E. We say that the concentration matrix Θ * respects the edge structure of the graph G if Θ * ij = 0 for all (i, j) ∈ E. Consequently, estimation of the precision matrix corresponds to parameter estimation in a Gaussian graphical model and specifying the non-zero set of Θ * corresponds to model selection.Note that in the Gaussian case, sparsity assumptions on the entries of the precision matrix translate to sparsity of the edges in the underlying Gaussian graphical model.
be the set of all non-zero entries of Θ * and denote the cardinality of S by s.Use S c (Θ * ) for the complement of S(Θ * ) in V × V. We shall impose a sparsity assumption on the maximum row cardinality of Θ * ; therefore define d = d n as follows In terms of a Gaussian graphical model, the parameter d corresponds to the maximum vertex degree in the graph G (note that also self-loops at each vertex are counted).The parameter d is allowed to grow as a function of n.
We shall consider two parameters of the model, κ Σ * and κ Γ * defined below (cf.[11]) which might be viewed as certain measures of the model complexity.The parameters may be allowed to grow as a function of the sample size n and the rate of this growth then determines the convergence rates Θ − Θ * ∞ .However, to keep the presentation cleaner, we assume the parameters κ Σ * and κ Γ * are bounded in n.
First, define κ Σ * to be the ℓ ∞ operator norm of the true covariance matrix Σ * , i.e.
To define the second parameter, consider first the Hessian Γ(Θ) of the function tr(Θ T Σ) − log det(Θ), which takes the form We shall impose some restrictions on the Hessian Γ evaluated at the true Θ * , Γ * := Γ(Θ * ).To this end, let us fix the following notation.For any two subsets T and T ′ of V × V , we use Γ * T T ′ to denote the |T | × |T ′ | matrix with rows and columns of Γ * indexed by T and T ′ respectively.Consequently, define κ Γ * to be the ℓ ∞ operator norm of the matrix Finally, we shall consider an irrepresentability condition on the Hessian of the log-determinant function evaluated at Θ * as in assumption (A1) below, and boundedness of eigenvalues as in assumption (A2).

Assumption (A1). (Irrepresentability condition) There exists
Remark 1. Theoretically, under the irrepresentability condition, one could use the graphical Lasso, and then the maximum likelihood estimator based on nonzeros of Θ.The de-sparsified graphical Lasso is an alternative that we believe to be less sensitive to violations of the conditions for exact variable selection.Besides, the above procedure would result in discarding the "false negatives" produced by the graphical Lasso.We also refer here to [14] where the results of [11] are extended using an irrepresentable condition on the small (not necessarily zero) entries of Θ * .

Assumption (A2). (Bounded eigenvalues)
There exists M > 0 such that, for every n, each eigenvalue Since our results depend on the convergence rates for graphical Lasso derived in [11], the assumption (A1) corresponds to the irrepresentability condition used in [11].Parameters appearing in assumption (A3) influence the rates of convergence if they scale with n (see [11]).

Tail Bounds
In this section, based on [11] and [1], we review a result concerning a highprobability bound for a deviation condition for the sample covariance matrix.Since the optimization problem (P1) depends on a surrogate of Σ * which is the sample covariance matrix Σ, the performance of the method (such as the rates Θ − Θ * ∞ ) depends on the difference Σ − Σ * .Roughly speaking, the role of Σ − Σ * arises by considering the estimation error |trace{( The general approach (as in [11]) depends on showing that for a specific pre-assumed statistical model for X, the event holds with high probability for n → ∞ and some sequence δ n ց 0.
To measure the rate of decay of the tails of Σn ij − Σ * ij , we introduce the following definition.
Definition 1.Given a random vector X ∈ R p , we say that a function τ : N × (0, ∞) → R + , increasing in both variables, is a tail function if there exists a constant ν ∈ (0, ∞] such that for all (i, j) ∈ V × V and for all δ ∈ (0, ν], where Σ is the sample covariance matrix associated with an i.i.d.sample distributed as X.For a given tail function τ , define the inverse tail function for r ≥ 1 (and for each n fixed) For a given tail behaviour of X, the union bound gives a high probability bound for T n as in lemma 6.

Sub-Gaussian Tails
The deviation condition Σn − Σ * ∞ < δ n is subsequently shown to hold with high probability for special classes of random variables.Of interest is for instance the case when X is sub-Gaussian, though the results can be extended to some other cases as well (cf.[11]).We shall restrict ourselves to the class of sub-Gaussian random variables.
Definition 2. A real zero-mean random variable X is sub-Gaussian if there exists β > 0 such that for all t > 0 We shall consider the following sub-Gaussianity condition for random vectors.

Condition (C1). (Sub-Gaussianity condition)
A zero-mean random vector X = (X 1 , . . ., X p ) with a covariance matrix Σ * satisfies the sub-Gaussianity condition if all its normalized components X i / Σ * ii are sub-Gaussian random variables with a common parameter β > 0.
The following lemma from [11] gives probabilistic bounds for the event T n if X satisfies (C1).Lemma 1 ( [11], sub-Gaussian model).Let X = (X 1 , . . ., X p ) be a zero mean random vector with covariance matrix Σ * satisfying the sub-Gaussianity condition (C1).Then an inverse tail function is given by and for every γ > 2 and each n such that δ τ (n, p γ ) < c 1 we have where c 0 , c 1 depend only on β and max i Σ * ii .

Main results
In this section we present the main results which imply inference for individual parameters of the precision matrix.The main result is contained in theorem 1, where asymptotic normality of the de-sparsified estimator is proved.At this point we remark that for any λ n > 0 and Σn with strictly positive diagonal elements, the optimization problem (P1) has a unique solution Θn ∈ S p ++ which is characterized by the Karush-Kuhn-Tucker conditions where the matrix Ẑ belongs to the subdifferential of the off-diagonal norm • 1,off evaluated at Θ (see Lemma 3 in [11]).
Inverting the KKT conditions (2.1) leads us to the de-sparsified graphical Lasso estimator which is defined as follows To prove asymptotic normality of T , we shall first need the following lemma, which shows that a term we may identify as a "remainder term" converges in probability to zero.The result holds uniformly for all the elements of Θ * .The proof of the lemma is given in section 4.
Lemma 2. Suppose that Θ * satisfies assumptions (A1),(A3) and the sparsity assumption d = O( 1 δn ) for some 0 < δ n ≤ 1. Suppose that Θn is the solution to the optimization problem (P1) with sample covariance matrix Σ and regularization parameter ) The statement of lemma 2 is deterministic, but as lemma 1 suggests, in cases of interest (e.g. the sub-Gaussian model), the deviation condition holds with high probability.The result of lemma 2 implies that under an additional sparsity assumption , on the event T n it holds that Rem ∞ = o(1).Corollary 1 (Sub-Gaussian model).Suppose that in addition to the assumptions of lemma 2, the i.i.d.random vectors X 1 , . . ., X n ∈ R p satisfy the sub-Gaussianity condition (C1).Put δ n ≍ δ τ (n, p γ ) = o(1), where γ > 2. Then it follows that Furthermore, provided that d Note also that if X 1 , . . ., X n ∈ R p is an i.i.d.sample from the multivariate normal N p (0, Σ * ) distribution, then W in lemma 2 has the Wishart distribution W p (Θ * , n) (up to scaling and shifted by its mean).
Consequently, lemma 2 under the sub-Gaussian model assumption lets us show asymptotic normality as demonstrated in theorem 1 below.Since our aim is inference about individual elements of Θ * , fix (i, j) ∈ V × V and to keep the notation simple, we suppress the index (i, j).Note also that the columns of Θ * will be denoted by Θ * •i ∈ R p .For a fixed n the elements of the sequence of random vectors X 1 , X 2 , . . ., X n depend also on p = p n (X k ∈ R p ).In order to control this dependence we shall consider the doubly indexed (or triangular) array {X nk } 1≤k≤n , n ∈ N.
Theorem 1.Let {X nk } n k=1 be i.i.d.satisfying the sub-Gaussianity condition (C1) with parameter β which does not depend on n.Suppose that Θn is the solution to the optimization problem (P1) with regularization parameter λ n = 8 α δ τ (n, p γ ) for some γ > 2. Let Θ * satisfy assumptions (A1),(A2),(A3) and the sparsity assumption Then for all (i, j) where Z n ij converges weakly to N (0, 1).Note that in practice, σ n is unknown, so instead of σ n one needs to use a consistent estimator σn > 0. The next lemma shows that for the Gaussian model σ 2 n = Ω(1) and there is a natural estimator of σ 2 n .Lemma 3. Suppose that the conditions of Theorem 1 hold with the exception of σ 2 n = Ω(1).Furthermore, suppose that X nk are Gaussian random vectors.Then for all (i, j) where σ2 n := Θ2 ij + Θii Θjj , Z n ij converges weakly to N (0, 1).

Simulation Results
In this section we report on the performance of our method on simulated data and provide a brief comparison to some other methodologies.We consider two particular instances of sparse Gaussian graphical models which may be (fully) specified by the precision matrix Θ * .The model setup for the simulations was specified as follows.As a first graphical model instance, we consider the chain graph on p vertices where the maximum vertex degree is by definition restricted to d = 2.As a second instance, we consider the star graph where a given central vertex may or may not be connected to any of the remaining vertices.Thus the maximum degree d may vary from 0 to p − 1.We specify the graphical models with the corresponding precision matrices in a normalized form with ones on the main diagonal.For the case of a chain graph, the corresponding precision matrix Θ * is a tridiagonal matrix, which was set to Θ * = tridiag(ρ, 1, ρ) for a given ρ > 0. The cardinality of the active set is then 3p − 2. For the star graph, Θ * ii = 1 and Θ * ij = ρ for i = j, (i, j) ∈ E. To solve the graphical Lasso program (P1), we have used the implemented procedure glasso of Friedman et al. [5].By corollary 3, it follows the (1 − α)100% asymptotic confidence interval for Θ * ij is given by For each parameter Θ * ij , the probability that the true value Θ * ij is covered by the confidence interval was estimated by its empirical version, αij := P N 1 {Θ * ij ∈Iij,α} .The number of iterations used to calculate the estimates αij was set to N = 300.Next for a set A ⊂ V × V define the average coverage over the set A as After obtaining the estimates αij , we have averaged them over the sets S 0 and S c 0 to obtain Avgcov S0 and Avgcov S c 0 , respectively.Similarly, we have calculated the average length of the confidence interval for each parameter Θ * ij from N = 300 iterations and again averaged these over the sets S 0 and S c 0 to obtain Avglength S0 and Avglength S c 0 .We have carried out a comparison of the graphical Lasso-based confidence intervals with two other methods for their construction: the "oracle" maximum likelihood estimator with the zero set S c 0 pre-specified and a method based on the sample covariance matrix.The results are shown in figure 1.It may be noted that the graphical Lasso performs worse for the chain graph, where the active set is bigger.
The choice of the regularization parameter.Theory implies that the correct choice of the regularization parameter satisfies λ n ≍ log p n .In our setting, roughly speaking, it might be sufficient to choose the regularization parameter λ smaller than when model selection is the goal (i.e.identifying the zero set of Θ * ).Considering confidence intervals, there is presumably not a significant difference between a small absolute value of a parameter and a zero value, thus less regularization may suffice.The practical choice of regularization was done empirically, optimizing over average coverage as a function of λ.Very roughly, based on our experiments for the two Gaussian graphical models, λ n = log p n produces reasonably satisfactory results.

Proofs
Proof of Lemma 2. The KKT conditions for the optimization problem (P1) where the matrix Ẑ is the subdifferential of .1,off at the optimum Θ. Vectorizing the equation and multiplying by the matrix Θ ⊗ Θ, we obtain Adding the expression Θ − Θ * to both sides and rearranging to obtain the asymptotic pivot (i.e.subtracting expectation from Σ), it follows
Next we find an elementwise bound for the remainders Rem 1 and Rem 2 .In what follows, condition on the event T defined in (1.2).
Denoting ∆ = Θ − Θ * and rewriting the second remainder in terms of ∆ we get where inequality (4.3) follows since |a T b| ≤ a ∞ b 1 for any vectors a, b ∈ R p and equality (4.4) follows since max i e i A ∞ = A ∞ and max j Ae Next by the sub-multiplicativity of |||.||| ∞ it follows that Next we bound the second remainder For any m ∈ N, matrix A ∈ R m×m and a vector x ∈ R m we have Ax ∞ ≤ |||A||| ∞ x ∞ .Thus it follows that Recall that conditioned on T , we have Σ − Σ * ∞ ≤ δ n .Also observe that Θ ⊗ Θ − Θ * ⊗ Θ * may be rewritten as To see this, realize that the (i, j)−th row of the Kronecker product is given by (A ⊗ B) (i,j),• = A i• ⊗B j• .For vectors u, v, it is easy to check that u⊗v 1 = u 1 v 1 by straightforward calculation.Hence |||A ⊗ B||| ∞ = max (i,j)∈V ×V (A⊗B Applying the triangle inequality together with the last property of |||•||| ∞ gives Applying (4.6) again and noting that Finally, we obtain the following elementwise bound for the remainder where the last inequality follows since by assumption (A3) κ Σ * is uniformly bounded, by assumption (A2) Λ max (Θ * ) < M , and since d ≥ 1.
Proof of theorem 1.By corollary 1, under the sub-Gaussianity condition (C1), the assumptions of lemma 2 and the sparsity assumption d log p ), it holds that Rem ∞ = o P (1).Considering then the result of lemma 2 elementwise, for every (i, j) it holds that To prove the claim of the theorem, we need to show that the scaled summation term in (4.7) weakly converges to the normal distribution.To this end, define For each n fixed, Z n1 , . . ., Z nn are identically distributed r.v.s with mean zero since T X nk is sub-Gaussian since a finite sum of sub-Gaussian random variables is sub-Gaussian.The moments of a sub-Gaussian random variable are finite, which implies finiteness of σ which follows by assumption d In the integral, substitute t := Again by the restriction on d and the Lebesgue dominated convergence it then follows that the limit of the integral is 0. In conclusion, we get Sn sn D → N (0, 1) for n → ∞.Lemma 4. Let Θ * satisfy assumption (A2) and let the random vector X n ∈ R p satisfy the sub-Gaussianity condition (C1).Then for t > c 0 the random variable where c 0 , c 1 do not depend on n.
, and using the union bound and sub-Gaussianity of X k n / Σ * kk by assumption (C1) gives Note that by remark 2, Σ * kk and Θ * •i 2 ≤ Λ max (Θ * ) are uniformly bounded in n by the assumption on eigenvalues of Σ * .Then for Θ * Proof of lemma 3. The claim of the corollary follows by first applying Theorem 1 provided that we show that σ n = Ω(1) and then by multiplying (2.4) by σ n /σ n provided that we show σ n /σ n P → 1.We shall first prove that σ 2 n is the corresponding element of the inverse of Fisher information matrix for Θ * .Assuming that X ∼ N (0, Σ * ), then Θ * T X ∼ N (0, Θ * ), hence T X) = Var(e T i Θ * T XΘ * T Xe j ) = Var(e T i ZZ T e j ), where Z ∼ N (0, Θ * ).Thus Next we show that σ n = Ω(1).To obtain a lower bound on the diagonal elements of Θ * , again observe by assumption (A2) that Θ * ii ≥ Λ min (Θ * ) > 1 M > 0. Hence Θ * ii Θ * jj + Θ * 2 ij ≥ 1 M 2 > 0, where M is independent of n, therefore σ n = Ω(1) as required.
Next we show σ n /σ n P → 1. Observe that on T Lemma 6 (Lemma 8, Ravikumar et al. [11]).Let X be a zero-mean random vector with precision matrix Θ * , inverse tail function δ τ and a tail constant ν.For γ > 2 put δ n := δ τ (n, p γ ) in the definition of T n .Then for a sample size n such that δ τ (n, p γ ) < ν, we have p ) it follows that Rem ∞ = o P (1).

3 2
= o(√ n log p ). Next considering the limit of the last term in (4.8), we have
Tables showing a comparison of graphical Lasso with the maximum likelihood estimator with specified set S 0 and an estimator based on the sample covariance matrix Σ.The results were obtained for p = 80 and n = 250.The regularization parameter was chosen λ = The active sets have cardinality 238 for the chain graph and cardinality 94 for the star graph.The constant ρ in the definition of Θ * was set to 0.3 for both graphical models.
Fig 2. A table showing the performance of the graphical Lasso (P1) for number of parameters p taking values 100, 200, 300, 500 and n = 100.The regularization parameter was chosen λ = log p n .The constant ρ is 0.3 in the definition of Θ * .