TIGER: A Tuning-Insensitive Approach for Optimally Estimating Gaussian Graphical Models

We propose a new procedure for estimating high dimensional Gaussian graphical models. Our approach is asymptotically tuning-free and non-asymptotically tuning-insensitive: it requires very few efforts to choose the tuning parameter in finite sample settings. Computationally, our procedure is significantly faster than existing methods due to its tuning-insensitive property. Theoretically, the obtained estimator is simultaneously minimax optimal for precision matrix estimation under different norms. Empirically, we illustrate the advantages of our method using thorough simulated and real examples. The R package bigmatrix implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran.r-project.org/.

In low dimensions where d < n, Perlman (2007, 2008) develop a multiple testing procedure for identifying the sparsity pattern of the precision matrix. In high dimensions where d n, Meinshausen and Bühlmann (2006) propose a neighborhood pursuit approach for estimating Gaussian graphical models by solving a collection of sparse regression problems using the Lasso (Tibshirani, 1996;Chen et al., 1998). This approach can be viewed as a pseudo-likelihood approximation of the full likelihood. A related approach is to directly estimate Θ by penalizing the likelihood using the L 1 -penalty (Banerjee et al., 2008;Yuan and Lin, 2007;Friedman et al., 2008). To further reduce the estimation bias, Lam and Fan (2009) ;Jalali et al. (2012); Shen et al. (2012) propose either greedy algorithms or non-convex penalties for sparse precision matrix estimation. Under certain conditions, Ravikumar et al. (2011); Rothman et al. (2008) study the theoretical properties of the penalized likelihood methods. Yuan (2010) and Cai et al. (2011a) also propose the graphical Dantzig selector and CLIME respectively, which can be solved by linear programming and are more amenable to theoretical analysis than the penalized likelihood approach. More recently, Liu and Luo (2012) and Sun and Zhang (2012) propose the SCIO and scaled-Lasso methods, which estimate the sparse precision matrix in a column-by-column fashion and have good theoretical properties.
Besides Gaussian graphical models,  propose a semiparametric procedure named nonparanormal skeptic which extends the Gaussian family to the more flexible semiparametric Gaussian copula family. Instead of assuming X follows a Gaussian distribution, they assume there exists a set of monotone functions f 1 , . . . , f d , such that the transformed data f (X) := (f 1 (X 1 ), . . . , f d (X d )) T is Gaussian. More details can be found in  and Lafferty et al. (2012). Zhao et al. (2012) developed a scalable software package to implement the nonparanormal algorithms. Other nonparametric graph estimation methods include forest graphical models (Liu et al., 2011) or conditional graphical models (Liu et al., 2010a).
Most of these methods require choosing some tuning parameters that control the biasvariance tradeoff. Theoretical justifications of these methods are usually built on some oracle choices of tuning parameters that cannot be implemented in practice. It remains an challenging problem on choosing the regularization parameter in a data-dependent way. Popular techniques include the C p -statistic (Mallows, 1973), AIC (Akaike information criterion, Akaike (1973)), BIC (Bayesian information criterion, Schwarz (1978)), extended BIC Chen, 2008, 2012;Foygel and Drton, 2010), RIC (Risk inflation criterion, Foster and George (1994)), cross validation (Efron, 1982), and covariance penalization (Efron, 2004). Most of these methods require data splitting and have been only justified for low dimensional settings. Significant progress has been made recently on developing likelihoodfree regularization selection techniques, including permutation methods (Wu et al., 2007;Boos et al., 2009;Lysen, 2009) and subsampling methods (Lange et al., 2004;Ben-david et al., 2006;Meinshausen and Bühlmann, 2010;Bach, 2008). Meinshausen and Bühlmann (2010) and Bach (2008) and Liu et al. (2010b) also propose to select the tuning parameters using subsampling. However, these subsampling based methods are computationally expensive and are still lack of theoretical guarantees.
In this paper we propose a new procedure for estimating high dimensional Gaussian graphical models. Our method, named TIGER (Tuning-Insensitive Graph Estimation and Regression), owes a "tuning-insensitive property": it automatically adapts to the unknown sparsity pattern and is asymptotically tuning-free. In finite sample settings, we only need to pay very few efforts on tuning the regularization parameter. The main idea is to estimate the precision matrix Θ in a column-by-column fashion. For each column, the computation is reduced to a sparse regression problem. This idea has been adopted by many methods, including the neighborhood pursuit (Meinshausen and Bühlmann, 2006), graphical Dantzig selector (Yuan, 2010), CLIME (Cai et al., 2011a), SCIO (Liu and Luo, 2012), and the scaled-Lasso method (Sun and Zhang, 2012). These methods differ from each other mainly by how they solve the sparse regression subproblem: the graphical Dantzig selector and CLIME use the Dantzig selector, the SCIO and neighborhood pursuit use the Lasso, while Sun and Zhang (2012) use the scaled-Lasso (Sun and Zhang, 2012). Unlike these existing methods, the TIGER solves this sparse regression problem using the SQRT-Lasso (Belloni et al., 2012). By using the SQRT-Lasso regression, the TIGER owes the tuning-insensitive property and improves upon existing methods both theoretically and empirically.
The main advantage of the TIGER over existing methods is its asymptotic tuningfree property, which allows us to use the entire dataset to efficiently learn and select the model. In contrast, it is well known that the cross-validation and subsampling methods are computationally expensive. Moreover, they may potentially waste valuable data which could otherwise be exploited to learn a better model (Bishop et al., 2003). With such a tuning-insensitive property, the TIGER also allows us to conduct more objective scientific data analysis.
Another advantage of the TIGER is its computational simplicity and scalability for large datasets. For problems with large dimensionality d, the TIGER divides the whole problem into many subproblems, in each subproblem it estimates one column of the precision matrix by solving a simple SQRT-Lasso problem. The final matrix estimator is formed by combining the vector solutions into a matrix. This procedure can be solved in a parallel fashion and achieves a linear scale up with the number of CPU cores. An additional performance improvement comes from the tuning-insensitive property of the TIGER. Most existing methods exploit cross-validation to choose tuning parameters, which requires computing the solutions over a full regularization path. In contrast, the TIGER solves the SQRT-Lasso subproblem only for several fixed tuning parameters. For sparse problems, this is significantly faster than existing methods.
In the current paper, we prove theoretical guarantees of the TIGER esitmator Θ of the true precision matrix Θ. In particular, let Θ max := max jk |Θ jk | and Θ 1 := max j k |Θ jk |. Under the assumption that the condition number of Θ is bounded by a constant, we establish the elementwise sup-norm rate of convergence: (1) If we further assume Θ 1 ≤ M d where M d may scale with d, the obtained rate in (1) is minimax optimal over the model class consisting of precision matrices with bounded condition numbers. This result allows us to effectively conduct graph estimation without the need of irrepresentable conditions. Let I(·) be the indicator function and s := j =k I (Θ jk = 0) be the number of nonzero off-diagonal elements of Θ. The result in (1) implies that the Frobenious norm error between Θ and Θ satisfies: Similarly, if we assume Θ 1 ≤ M d where M d may scale with d, the rate in (2) is minimax optimal for the Frobenious norm error in the same model class consisting of precision matrices with bounded condition numbers. Let Θ 2 be the largest eigenvalue of Θ (i.e., Θ 2 is the spectral norm of Θ) and k := max i=1,...,d j I(Θ ij = 0). We also prove that Under the same condition that Θ 1 ≤ M d where M d may scale with d, this spectral norm rate in (3) is also minimax optimal over the same model class as before. Besides these theoretical results, we also establish a relationship between the SQRT-Lasso and the scaled-Lasso proposed by Sun and Zhang (2012). More specifically, the objective function of the scaled-Lasso can be viewed as a variational upper bound of the SQRT-Lasso. This relationship allows us to develop a very efficient algorithm for the TIGER .
Computationally, the TIGER is significantly faster than existing methods since very few tunings are needed. In particularly, we propose an iterative algorithm with initial values searched by the Alternating Direction Method of Multipliers (ADMM). For each reduced sparse regression subproblem, the computational complexity is of the same order as solving one single Lasso with a sparse solution. Under the parallel computing framework, our algorithm is even faster and more scalable than the glasso and huge packages (Friedman et al., 2008;Zhao et al., 2012). Empirically, we present thorough numerical simulations to compare the graph recovery and parameter estimation performance of our method with other approaches. A real data experiment on a gene expression dataset is also provided to back up our theory. The R package bigmatrix implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran.r-project.org/.
The rest of the paper is organized as follows. In Section 2, we introduce basic notations and backgrounds on Gaussian graphical models. In Section 3, we present the TIGER estimator and its computational algorithm. In Section 4, we present the theoretical properties including the rates of convergence for parameter estimation and graph recovery. We also provide further discussions on the connections and differences of our results with other related methods. In Section 5, we demonstrate its numerical performance through synthetic and real datasets. The proofs of the main results are given in the appendix.

Background
Let x 1 , . . . , x n be n data points from a d-dimensional Gaussian random vector X := (X 1 , . . . , X d ) T ∼ N d (0, Σ). We denote x i := (x i1 , . . . , x id ) T . As has been discussed in the previous section, we define precision matrix to be Θ := Σ −1 . In this section, we start with some notations followed by an introduction of solving Gaussian graphical models in a column-by-column approach.

Notations
Let v := (v 1 , . . . , v d ) T ∈ R d and I(·) be the indicator function, for 0 < q < ∞, we define We also define v 0 := d j=1 I(v j = 0) and v ∞ := max j |v j |. Let A ∈ R d×d be a symmetric matrix and I, J ⊂ {1, . . . , d} be two sets, we denote A I,J to be the submatrix of A with rows and columns indexed by I and J. Let A * j be the j th column of A and A * \j be the submatrix of A with the j th column A * j removed.
We define the following matrix norms: It is easy to see that when q = ∞, A ∞ = A 1 . We also denote Λ max (A) and Λ min (A) to be the largest and smallest eigenvalues of A.

Gaussian Graphical Model and Column-by-Column Regression
Let X ∼ N d (0, Σ), the conditional distribution of X j given X \j satisfies Let α j := (Σ \j,\j ) −1 Σ \j,j ∈ R d−1 and σ 2 j := Σ jj − Σ \j,j (Σ \j,\j ) −1 Σ \j,j . We have where j ∼ N 0 , σ 2 j is independent of X \j . By the block matrix inversion formula, we have Therefore, we can recover Θ in a column by column manner by regressing X j on X \j for j = 1, 2, · · · , d. For example, let be the data matrix. We denote by α j := (α j1 , . . . , α j(d−1) ) T ∈ R d−1 . Meinshausen and Bühlmann (2006) propose to iteratively estimate each α j by solving the Lasso regression: where λ j is a tuning parameter. Once α j is given, we get the neighborhood edges by reading out the nonzero coefficients of α j . The final graph estimate G is obtained by either the "AND" or "OR" rule on combining the neighborhoods for all the d nodes. However, the neighborhood pursuit method of Meinshausen and Bühlmann (2006) only estimates the graph G and does not estimate the inverse covariance matrix Θ.
To explicitly estimate Θ, Yuan (2010) proposes to estimate α j by solving the Dantzig selector: where Σ := 1 n XX T is the sample covariance matrix and γ j is a tuning parameter. Once α j is given, we can estimate σ 2 j by 6 We then get the estimate Θ of Θ by plugging α j and σ 2 j into (8) and (9). Yuan (2010) analyzes the L 1 -norm error Θ − Θ 1 and shows its minimax optimality over certain model space. However, no graph estimation result is provided for this approach.
In another work, Sun and Zhang (2012) propose to estimate α j and σ j by solving a scaled-Lasso problem: Once b j is obtained, α j = b \j . Sun and Zhang (2012) analyze the spectral-norm rate of convergence of the obtained precision matrix estimator. They did not investigate the elementwise sup-norm and graph recovery performance. In the next section, we will show that the scaled-Lasso estimator is highly related to our proposed procedure and will discuss the relationship in more details.
To estimate both precision matrix Θ and graph G, Cai et al. (2011a) proposes the CLIME estimator, which directly estimates the j th column of Θ by solving where e j is the j th canonical vector and δ j is a tuning parameter. Cai et al. (2011a) show that this convex optimization can be formulated into a linear program and has the potential to scale to large problems. Once Θ is obtained, we use another tuning parameter τ to threshold Θ to estimate the graph G. In a follow-up work of the CLIME, Liu and Luo (2012) propose the SCIO estimator, which solves the j th column of Θ by The justifications of most of these graph estimation methods are built on some theoretical choices of tuning parameters that cannot be implemented in practice. For example, in the neighborhood pursuit method and the graphical Dantzig selector, the tuning parameter λ j and γ j depend on σ 2 j , which is unknown. Practically, we usually set λ = λ 1 = · · · = λ d and γ = γ 1 = · · · = γ d to reduce the number of tuning parameters. However, as we will illustrate in later sections, such a choice makes the estimating procedure non-adaptive to inhomogeneous graphs. The tuning parameters of the CLIME and SCIO depend on Θ 1 , which is unknown. In general, these methods employ cross validation to conduct data-dependent tuning parameter selection and Liu and Luo (2012) provide theoretical analysis of the cross-validation estimator. However, as we discussed before, cross-validation is computationally expensive and a waste of valuable training data. In the next section, we will describe a tuning-insensitive procedure that simultaneously estimates the precision matrix Θ and graph G with the optimal rates of convergence. 7

Method
In this section we introduce the use of the SQRT-Lasso from Belloni et al. (2012) for simultaneously estimating the graph G and precision matrix Θ := Σ −1 .
The SQRT-Lasso is a penalized optimization algorithm for solving high dimensional linear regression problems. For a linear regression problem y = Xβ + , where y ∈ R n is the response, X ∈ R n×d is the design matrix, β ∈ R d is the vector of unknown coefficients, and ∈ R n is the noise vector. The SQRT-Lasso estimates β by solving where λ is the tuning parameter. It is shown in Belloni et al. (2012) that the choice of λ for the SQRT-Lasso method is asymptotically universal and does not depend on any unknown parameter. In contrast, most of other methods, including the Lasso and Dantzig selector, rely heavily on a known standard deviation of the noise. Moreover, the SQRT-Lasso method achieves near oracle performance for the estimation of β.

TIGER for Graph and Precision Matrix Estimation
In the discussion of this section, we always condition on the observed data x 1 , . . . , x n . Let Γ := diag( Σ) be a d-dimensional diagonal matrix with the diagonal elements be the same as those in Σ. We define By (7), we have We define Therefore, we have We define R to be the sample correlation matrix: Motivated by the model in (21), we propose the following precision matrix estimator.
For j = 1, . . . , d, we estimate the j th column of Θ by solving : 8 For the estimator in (22), λ is a tuning parameter. In the next section, we show that by choosing λ = π log d 2n , the obtained estimator achieves the optimal rates of convergence in the asymptotic setting. Therefore, our procedure is asymptotically tuning-parameter free.
For finite samples, we set and ζ can be chosen from a range [ √ 2/π, 1]. Since the choice of ζ does not depend on any unknown parameters or quantities, we call the procedure tuning-insensitive. Empirically, we found that in most cases we encountered, we can simply set ζ = √ 2/π and the resulting estimator works well in finite sample settings. More details can be found in the simulation section.
If a symmetric precision matrix estimate is preferred, we conduct the following correction: Θ jk ← min Θ jk , Θ kj for all k = j. Another symmetrization method is As has been shown by Cai et al. (2011a), if Θ is a good estimator, then Θ will also be a good estimator: they achieve the same rates of convergence in the asymptotic settings. Let Z ∈ R n×d be the normalized data matrix, i.e., Z * j = X * j Σ −1/2 jj for j = 1, . . . , d. An equivalent form of (22) is Once we have Θ, the estimated graph G := (V, E) where (j, k) ∈ E if and only if Θ jk Θ kj = 0.

Relationship with the Scaled-Lasso Estimator
In this subsection, we show that the scaled-Lasso from Sun and Zhang (2012) can be viewed as a variational upper bound of the objective function of the SQRT-Lasso and they are solving the same problem 1 . More specifically, we consider the following optimization: Proposition 1. The optimization in (22) and (29) provide the same solution β j .
Proof. For any a, b > 0, we have a 2 + b 2 ≥ 2ab and the equality is attained if and only a = b. Therefore, we have This shows that the objective function in (29) is an upper bound of the objective function in (22). The equality is attained if and only if We finish the proof.
This relationship between the TIGER and scaled-Lasso provides an efficient algorithm as described in the next subsection.

Computational Algorithm
Equation (22) is jointly convex with respect to β j and τ j and can be solved by a coordinatedescent procedure. In the t th iteration, for a given τ (t) j , we first solve the subproblem This is a Lasso problem and can be efficiently solved by the coordinate descent algorithm developed by Friedman et al. (2007). Once β (t+1) j is obtained, we can calculate τ We iterate these two steps until the algorithm converges. On thing to note is that the above algorithm converges fast if a good initial value of τ j is provided. For example, if τ j is close to τ j , the algorithm converges in only 3 to 5 iterations. In fact, with a good initial value of τ j , the computation is roughly the same as running one single tuning parameter of the Lasso with a sparse solution. However, with a bad initial value of τ j , the computational complexity can be as heavy as calculating the full regularization path of a Lasso problem, which is less efficient.
To obtain a good initial estimate of τ j , we propose an augmented Lagrange method as follows: We first reparameterize (27) and (28) as We consider the following augmented Lagrangian function where ρ > 0 is the penalty parameter for the violation of the linear constraints. For simplicity, throughout this paper we assume that ρ > 0 is fixed (In implementations, we simply set ρ = 1). α ∈ R n is the Lagrange multiplier vector but rescaled by ρ for computational and notational convenience. This reparametrization decouples the computational dependency in the optimization problem. Therefore a complicated problem can be split into multiple simpler sub-problems, each of which can be solved easily.
The augmented Lagrangian method works in an iterative fashion. Suppose we have the solution β (t) j , γ (t) , α (t) at the t-th iteration, the algorithm proceeds as follows: Step 1. Update β j by Let u (t) := Z * j − α (t) − γ (t) . and λ ρ := λ/ρ. The problem in (37) is equivalent to This is a Lasso subproblem which can be efficiently solved by the coordinate descent algorithm (Friedman et al., 2007).
Step 3. Given γ (t+1) and β (t+1) j , we update the Lagrange multiplier α by Since we have rescaled the Lagrange multiplier α by ρ, the updating equation for α is independent of ρ.
The algorithm stops when the following convergence criterion is satisfied where > 0 is a precision tolerance parameter. Theoretically, we can set to be a very small value, e.g. = 10 −6 . This directly solves β j . However, empirically, we found it is more efficient to set = 10 −3 and only obtain a crude initial estimate τ crude j . We then use τ crude j as the initial value and alternatively solve (32) and (33). Such a hybrid procedure delivers the best empirical performance.

Fine-tune the Regularization Parameter
To secure the best finite sample performance, we could also find-tune the ζ in (25) on a small interval [ √ 2/π, 1]. Practically, due to the tuning-insensitive property of our procedure, we find it suffices to only cross-validate 3 values and pick the best one: ζ ∈ √ 2/π, 0.6, 1 . In general, all these three values guarantee that the solutions are relatively sparse, the algorithm runs very efficiently.

Theoretical Properties
In this section we investigate the theoretical properties of the proposed method. We begin with some notations and assumptions. We define a matrix class M(ξ max , k): where ξ max is a constant and k may scale with the sample size n. We first list down three required assumptions: All these assumptions are mild. Assumption (A1) only requires the inverse covariance matrix Θ := Σ −1 to have a bounded condition number. Assumption (A2) is equivalent to In later analysis, we will show that this condition is necessary to secure the consistency of the precision matrix estimation under different matrix norms. Assumption (A3) constrains that the marginal variance of X j should not diverge too fast.

Precision Matrix Estimation Consistency
We study the estimation error of precision matrix Θ − Θ under different norms, including spectral norm, matrix L 1 -norm, elementwise sup-norm, and Frobenius norm. The rate under the elementwise sup-norm is important for graph recovery. The rate under the spectral norm is important since it leads to the consistency of the estimation of eigenvalues and eigenvectors, which may further be utilized to analyze the theoretical properties of downstream statistical inference. We analyze the rate of spectral norm by bounding the L 1 -norm rate. The rate of Frobenius is also a useful measure on the accuracy of the estimation of Θ and can be viewed as the sum of squared errors for estimating individual rows. Our main results indicate that the TIGER procedure simultaneously achieves the optimal rates of convergence under all these different matrix norms. We present these results in two main theorems and compare our results with related work in the literature. The proofs of these theorems can be found in the appendix.
Theorem 2 provides the rates of convergence under the matrix L 1 and spectral norms.
If we further assume Θ 1 ≤ M d where M d may scale with d, i.e., we define the following new matrix class M(M d , ξ max , k): The result of Theorem 2 implies that Based on the results of Cai et al. (2011b), Liu and Luo (2012), and Yuan (2010), this rate of convergence is minimax optimal on model class M(M d , ξ max , k). The next theorem provides the rates of convergence under the elementwise sup-norm and Frobenius norm. The elementwise sup-norm result is useful for the graph recovery from the precision matrix Θ.
Theorem 3 (Elementwise sup-norm and Frobenius-norm rates). We choose the regularization parameter λ as in (25) with ζ = 1. Let s be the total number of nonzero off-diagonal elements of Θ. Under Assumptions (A1), (A2), and (A3), we have Proof. The proof of (49) can be found in Appendix D.2. To prove (50), since s is the total number of nonzero off-diagonal elements of Θ, we have where the second inequality follows from (49).
Again, based on the results in Cai et al. (2011b) and Liu and Luo (2012), we know that the TIGER achieves the minimax optimal rates of convergence under both elementwise sup-norm and Frobenius norm on model class M(M d , ξ max , k).
In summary, the TIGER simultaneously achieves the optimal rates of convergence for precision matrix estimation under spectral norm, Frobenius norm, matrix L 1 -norm, and elementwise sup-norm.

Graph Recovery Consistency
Next, we study the graph recovery property of the TIGER. Let Θ be the estimated precision matrix. Recall that we define the estimated graph G := (V, E) where (j, k) ∈ E if and only if Θ jk Θ kj = 0. Similarly, the true graph G := (V, E) where (j, k) ∈ E if and only if Θ jk = 0. We have the following theorem on graph recovery consistency.
Theorem 4 (Graph recovery consistency). We choose the regularization parameter λ as in (25) with ζ = 1. We assume Assumptions (A1), (A2), and (A3) hold and for a sufficiently large constant K, such that the smallest nonzero element of Θ satisfies We have lim inf n→∞ P E ⊂ E = 1.
Theoretically, it is possible that the TIGER delivers some precision matrix estimates with very small nonzero values. To achieve exact recovery, we can hard threshold the estimated precision matrix Θ to get a sparse precision matrix estimate Θ A as has been discussed in Cai et al. (2011a). Let It can be seen that under the same conditions of Theorem 4, there exists a constant A (A may depend on K) such that the above hard threshold estimator achieves exact recovery. More discussions can be found in Cai et al. (2011a). Unlike the CLIME and graphical Dantzig selector where the linear program solver may deliver very small nonzero values (but not exact zero). Our algorithm defined in Section 3 is based on soft-thresholding operator and is capable of delivering exact zero. Empirically, we found the TIGER procedure works very effectively in graph estimation even without this hard-thresholding step. So this hardthresholding step is more of theoretical intents and is not necessary in applications.
It is worth pointing out that if we are only interested in estimating the graph, assumption (A2) can be relaxed to k log d = o(n), see for example Meinshausen and Bühlmann (2006) and Jalali et al. (2012). Such an extension is relatively straightforward and will be reported elsewhere.

Discussion
In this subsection, we briefly discuss the theoretical properties of the TIGER estimator and compare them with other existing results.
The SCIO method proposed in Liu and Luo (2012) also provides the rates of convergence for precision matrix estimation under various norms. One can see that the TIGER estimator and SCIO estimator achieve the same rates of convergence in terms of spectral norm, matrix L 1 -norm, Frobenius norm, and elementwise sup-norm. However, here we consider a much larger matrix class since we only bound the condition number of the precision matrix while the SCIO bounds the largest and smallest eigenvalues from above and below, respectively. Moreover, the SCIO requires the irrepresentable condition, which is stronger than our condition and not required by the TIGER. In fact, it is still an open problem on whether the SCIO estimator can achieve the same rates as the TIGER on the model class M(M d , ξ max , k).
Note that the graphical Dantzig selector proposed in Yuan (2010) also considers the model class where the largest and smallest eigenvalues are bounded from above and below. Therefore the results of the Graphical Dantzig selector are again on a smaller model class than the TIGER estimator. Moreover, the graph recovery performance of the graphical Dantzig selector is still an open problem.
When compared with the CLIME in Cai et al. (2011a), we see that the rate of convergence of the TIGER is faster since the spectral norm rate of the CLIME is O P kM 2 d log d n , and the spectral norm rate of the TIGER is O P kM d log d n . When compared with the glasso method, see for example Rothman et al. (2008) and Ravikumar et al. (2011), the TIGER estimator achieves the minimax optimal rates of convergence under spectral norm, Frobenius norm, matrix L 1 -norm and elementwise supnorm. In contrast, the only theoretical result of the glasso is in terms of Frobenius norm. it is still an open problem on whether the glasso can achieve the same spectral norm rate of convergence as the TIGER method.
The scaled-Lasso estimator proposed in Sun and Zhang (2012) provides the rates of convergence under spectral norm and matrix L 1 norm of the precision matrix estimation. However, the scale-Lasso estimator considers a different model class where the smallest eigenvalue of the correlation matrix is bounded away from zero by a constant. It is not clear on the optimality of the obtained rates of convergence over that model class. Besides, Sun and Zhang (2012) does not provide the elementwise sup-norm analysis and hence does not have the graph recovery result. Though the TIGER has a close relationship with the scaled-Lasso, the theoretical analysis of the current paper is dramatically different from that in Sun and Zhang (2012).
Another related work is Meinshausen and Bühlmann (2006), where they only focus on the graph recovery and do not have precision matrix estimation result.

Experimental Results
We compare the numerical performance of the TIGER and other methods (glasso and CLIME) in parameter estimation and graph recovery using simulated and real datasets. The TIGER and CLIME algorithms are implemented in the R package bigmatrix, and it is publicly available through CRAN. The glasso is implemented in the R package huge (ver. 1.2.3).

Numerical Simulations
In our numerical simulations, we consider 6 settings to compare these methods: (i) n = 200, d = 100; (ii) n = 200, d = 200; (iii) n = 200, d = 400; (iv) n = 400, d = 100; (v) n = 400, d = 200; (vi) n = 400, d = 400. We adopt the following models for generating undirected graphs and precision matrices. A typical run of the generated graphs and the heatmaps of the precision matrices are illustrated in Figure 1. * Scale-free graph. The degree distribution of the scale-free graph follows a power law. The graph is generated by the preferential attachment mechanism. The graph begins with an initial small chain graph of 2 nodes. New nodes are added to the graph one at a time.
Each new node is connected to one existing node with a probability that is proportional to the number of degrees that the existing node already has. Formally, the probability p i that the new node is connected to an existing node i is, p i = k i j k j , where k i is the degree of node i. The resulting graph has d edges (d = 200 or d = 400). Once the graph is obtained, we generate an adjacency matrix A by setting the nonzero off-diagonal elements to be 0.3 and the diagonal elements to be 0. We calculate its smallest eigenvalue Λ min (A). The precision matrix is constructed as where D ∈ R d×d is a diagonal matrix with D jj = 1 for j = 1, . . . , d/2 and D jj = 3 for j = d/2 + 1, . . . , d. The covariance matrix Σ := Θ −1 is then computed to generate the multivariate normal data: x 1 , . . . , x n ∼ N d (0, Σ). * Erdös-Rényi random graph. We add an edge between each pair of nodes with probability 0.02 independently. The resulting graph has approximately 400 edges when d = 200 and 1, 596 edges when d = 400. Once the graph is obtained, we construct the adjacency matrix A and generate the precision matrix Θ using (53) but setting D jj = 1 for j = 1, . . . , d/2 and D jj = 1.5 for j = d/2 + 1, . . . , d. We then invert Θ to get the covariance matrices Σ and generate the multivariate normal data: x 1 , . . . , x n ∼ N d (0, Σ). * Hub graph. Each block has off-diagonal entries equal to 0.5 and diagonal entries equal to 1. Such a matrix is guaranteed to be positive definite. The resulting matrix is then randomly permuted by rows and columns. The resulting graph has approximately 900 edges when d = 200 and 3, 800 edges when d = 400. The covariance matrix Σ := D −1 Θ −1 D −1 is then computed to generate multivariate normal data, where D is a diagonal matrix with D jj = 1 for j = 1, . . . , d/2 and D jj = 1.5 for j = d/2 + 1, . . . , d.

Graph Recovery Performance
We first compare the TIGER with the CLIME and glasso on their graph recovery performance. Let G = (V, E) be a d-dimensional graph. We denote by |E| the number of edges in the graph G. We use the false positive and false negative rates to evaluate the graph recovery performance. Let G λ = (V, E λ ) be an estimated graph using a regularization parameter λ in any of these procedures. The number of false positives when using the regularization parameter λ is defined as FP (λ) To illustrate the overall performance of the TIGER, CLIME and glasso methods over the full paths, the receiver operating characteristic (ROC) curves are drawn using FNR(λ), 1− FPR(λ) . For the TIGER method, we found that using or without using the second hardthresholding step provides the same graph estimates. So, the presented result does not use the second hard-thresholding step. The ROC curves for these models are presented in Figures 2, 3, 4. From the ROC curves on the scale-free model in Figure 2, we see that the graph recovery performance of the TIGER is significantly better than those of the CLIME and glasso in higher dimensional settings (when d = 200 or d = 400 and d ≥ n). This result suggests that the TIGER is more adaptive to inhomogeneous noise models. From Figure 2, we also see that the TIGER has similar graph recovery performance as the CLIME on the band model. Both the TIGER and CLIME significantly outperform the glasso. In particular, in the high dimensional setting when d = 400 and n = 200, the TIGER outperforms both CLIME and glasso. This result suggests that the TIGER is more reliable when facing higher dimensional problems on this model.
For the other models, from the ROC curves in Figures 3 and 4, we see that the three methods perform similarly on these settings, and for Erdös-Rényi random graph models,      the TIGER outperforms the CLIME in the settings when n ≤ d. This means that the TIGER is effective for a wide range of models. In summary, the above numerical results suggest that the TIGER is a very competitive graph estimation method in high dimensions.

Tuning-Insensitive Regularization Path
We are interested in studying the tuning-insensitive property of the TIGER. For conciseness, we only discuss the TIGER and CLIME in this section and compare their regularization paths. Recall that we see that ζ and λ have a one-to-one mapping. In Figure 5 (a) and (b), we plot the curves of Frobenius-norm errors vs. the tuning parameter ζ for the TIGER and CLIME. We define FNR(ζ) and FPR(ζ) in the same way as in (54). We also define the graph recovery accuracy as In Figure 5 (c) and (d), we plot the curves of the graph recovery accuracy vs. the tuning parameter ζ for the TIGER and CLIME. The vertical axis of these plots are calibrated so that the results are directly comparable. These plots illustrate the tuning-insensitive property of the TIGER regularization path. For the TIGER, we found it is empirically safe to only consider the regularization path over the range ζ ∈ [ √ 2/π, 1]. From both subplots (a) and (c), the regularization paths are flat and do not change dramatically with the change of ζ (In another word, the procedure is insensitive to the tuning parameter). In contrast, for the CLIME, we need to search over a much larger range of ζ to find the optimal value and the paths are more irregular. In the subplots (b) and (d), we visualize the regularization paths of the CLIME over ζ ∈ [0.125, 2], these are only a sub-fraction of the whole regularization paths of the CLIME and the paths are irregular (or more sensitive to the choice of ζ). Therefore, it is much easier to choose a reasonable tuning parameter for the TIGER than for the CLIME. In most cases, the choice of ζ = 1 for TIGER provides a sparser graph estimate than the true graph. This is due to the fact that the asymptotic analysis in the previous section involves many union bounds, which may be too conservative in finite sample settings. As will be illustrated in the next subsection, we found that in most settings, simply setting ζ = √ 2/π yields a reasonably good graph and precision matrix estimates. Such a choice is used as a default in the bigmatrix package on CRAN.

Frobenius Norm, type = hub
ζ Adjusted Frobenius norm q d=100 n=200 d=100 n=400 d=200 n=200 d=200 n=400 d=400 n=200 d=400 n=400 q q q q q q q q q q q q q q q q q q q q q q q q q q

Frobenius Norm, type = hub
ζ Adjusted Frobenius norm q d=100 n=200 d=100 n=400 d=200 n=200 d=200 n=400 d=400 n=200 d=400 n=400 (a) TIGER-Frobenius norm error (b) CLIME-Frobenius norm error q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Graph Match, type = hub
ζ Accuracy q d=100 n=200 d=100 n=400 d=200 n=200 d=200 n=400 d=400 n=200 d=400 n=400 q q q q q q q q q q q q q q q q q q q q q q q

Graph Match, type = hub
ζ Accuracy q d=100 n=200 d=100 n=400 d=200 n=200 d=200 n=400 d=400 n=200 d=400 n=400 (c) TIGER-Graph recovery accuracy (d) CLIME-Graph recovery accuracy Figure 5: Comparison of the regularization paths of the TIGER and CLIME on the hub graph model. The vertical axis of these plots (Forebius divide by the dimensionality d) are calibrated so that the results are directly comparable. These plots illustrate the tuninginsensitive property of the TIGER regularization path. For the TIGER, we only need to consider the range ζ ∈ [ √ 2/π, 1]. From both subplots (a) and (c), the regularization paths are flat and do not change dramatically with ζ. In contrast, for the CLIME, we need to search over a much larger range of ζ to find the optimal value and the paths are more irregular (the subplots (b) and (d) only show a part of this range).

Quantitative Evaluation on Parameter Estimation Performance
We then compare the TIGER with CLIME and glasso on their parameter estimation performance. For each model described before, we generate a training sample of n. The tuning parameters of the glasso and CLIME are automatically chosen in a data-dependent way. More specifically, for a given sample size n, we generate the same number of independent data points from the same distribution as a validation set. For each tuning parameter, the glasso or CLIME estimated precision matrix Θ is calculated on training data. The optimal tuning parameter is chosen by minimizing the held-out negative log-likelihood loss on the validation set. For the TIGER, the tuning parameter ζ in (25) is simply chosen to be ζ = √ 2/π so that the procedure is completely tuning-free. For dimensionality d = 100, 200, 400, we consider the spectral norm error Θ − Θ 2 and Frobenius norm error Θ − Θ F of all the 6 models described in the previous subsections.
The results are reported in Tables 1 and 2. In these tables we present the mean and standard deviation (in the parenthesis) of the spectral and Frobenius norm errors based on 50 random trials. We see that in almost all cases, the TIGER and CLIME outperform the glasso. In most cases, the TIGER outperforms the CLIME. Our results, though obtained in different experimental settings, are consistent with the results in Sun and Zhang (2012) for the scaled-Lasso based matrix inversion method.
One possible reason for the superior empirical performance of the TIGER over CLIME and glasso is that the data-dependent tuning selection procedure described in the previous section does not work well for the CLIME and glasso. To gain more insight, we also report the oracle estimation results in Tables 3 and 4. In these tables we present the mean and standard deviation (in the parenthesis) of the spectral and Frobenius norm errors based on 50 random trials. For all three methods, we draw the full regularization paths and select the best tuning parameter by minimizing the corresponding spectral or Frobenius norm errors to the true precision matrix. From these tables, we see that again the TIGER and CLIME outperform glasso and the TIGER is slightly better than CLIME.

Gene Network
This dataset includes 118 gene expression arrays from Arabidopsis thaliana originally appeared in Wille et al. (2004). Our analysis focuses on gene expression from 39 genes involved in two isoprenoid metabolic pathways: 16 from the mevalonate (MVA) pathway are located in the cytoplasm, 18 from the plastidial (MEP) pathway are located in the chloroplast, and 5 are located in the mitochondria. While the two pathways generally operate indepen-    dently, crosstalk is known to happen (Wille et al., 2004). Our goal is to recover the gene regulatory network, with special interest in crosstalk. We first examine whether the data actually satisfies the Gaussian distribution assumption. In Figure 6 we plot the histogram and normal QQ plot of the expression levels of a gene named MECPS. From the histogram, we see the distribution is left-skewed compared to the Gaussian distribution. From the normal QQ plot, we see the empirical distribution has a heavier tail compared to Gaussian. To suitably apply the TIGER method on this dataset, we need to first transform the data so that its distribution is closer to Gaussian. Therefore, we Gaussianize the marginal expression values of each gene by converting them to the corresponding normal-scores. This is automatically done by the huge.npn function in the R package huge (Zhao et al., 2012).

Expression Value
We apply the TIGER on the transformed data using the default tuning parameter ζ = √ 2/π. The estimated network is shown in Figure 7. We see the estimated network is very sparse with only 44 edges. We draw the within-pathway connections using solid lines and the between-pathway connections using dashed lines. Our result is consistent with previous investigations, which suggest that the connections from genes AACT1 and HMGR2 to gene MECPS indicate a primary sources of the crosstalk between the MEP and MVA pathways and these edges are presented in the estimated network. MECPS is clearly a hub gene for this pathway.
For the MEP pathway, the genes DXPS2, DXR, MCT, CMK, HDR, and MECPS are connected as in the true metabolic pathway. Similarly, for the MVA pathway, the genes AACT2, HMGR2, MK, MPDC1, MPDC2, FPPS1 and FPP2 are closely connected. Our analysis suggests 11 cross-pathway links, which is consistent to previous investigation in Wille et al. (2004). This result suggests that there might exist rich inter-pathway crosstalks.

Conclusions
We introduce a tuning-insensitive approach named TIGER for estimating high dimensional Gaussian graphical models. Our method is asymptotically tuning-free and simultaneously achieves the minimax optimal rates of convergence in precision matrix estimation under different matrix norms (matrix L 1 , spetral, Frobenius, and elementwise sup-norm). Computationally, our procedure is significantly faster than existing methods due to its tuninginsensitive property. The advantages of our estimators are also illustrated using both sim-ulated and real data examples. The TIGER approach is a very competitive alternative for estimating high dimensional Gaussian graphical models.
There are several possible directions to expand the current methods. First, it is interesting to extend the TIGER to the nonparanomral setting for estimating high dimensional Gaussian copula graphical models . Second, it is also interesting to extend the TIGER to more complex settings where latent variables or missing data might exist.

Acknowledgement
We thank Professor Cun-hui Zhang for constructive comments and help on pointing out the connection between the SQRT-Lasso and scaled-Lasso. We also thank Professor Tony Cai, Professor Weidong Liu, and Professor Harrison Zhou for their help on providing the results on minimax lower bounds. We also thank Xingguo Li and Tuo Zhao for their help on implementing the softwares and running the experiments. Han Liu's research is supported by NSF Grant III-1116730. Lie Wang's research is supported by NSF Grant DMS-1005539.

A Appendix: Proofs
We first prove several important lemmas, followed by the proof of the main theorems. For notational simplicity, we use a generic constant C whose value may change from line to line.

A.1 Preliminaries
For any set S ⊂ {1, 2, · · · , d} and |S| ≤ k, let ∆ d c (S) denote a subset of R d defined as Also, we define Recall that the matrix class M(ξ max , k) is defined as We define the population correlation matrix R as We also define τ j := σ j Γ −1/2 jj . We recall that there are three assumptions: We define Therefore, β j from (27) can be written as In the whole proof, for notational simplicity, we always denote the tuning parameter λ to be where c > 1 and a > 2. It is easy to see that λ = π √ 2 log d n is a special case of this setup.

A.2 Technical Lemmas
Throughout this paper we often use one of the following tail bounds for central χ 2 random variables. We denote χ 2 d to be a Chi-square variable with d degrees of freedom. Lemma 5 presents some well known results of χ 2 d and the proofs can be found in the original papers.
Lemma 5 (Johnstone. (2000) and Laurent and Massart (1998)). Let X ∼ χ 2 d . We have The next lemma bounds the tail of sample correlation for bivariate Gaussian random variables.
The following lemma bounds the difference between the sample correlation matrix and the true correlation matrix in elementwise sup-nrom.
Lemma 7. Let R and R be the sample and population correlation matrices. We define the event Then, we have P A 1 ≥ 1 − 1/d.
Proof. From Lemma 6, we have The result follows by choosing t = 18 log d n .
Lemma 8. Let j := ( j1 , . . . , jn ) T ∈ R n and j ∼ N n (0, σ 2 j I n ). We define the event by Lemma 5, it is easy to see that for any constant 1 ≤ w < 1.5, for all j = 1, 2, · · · , d. By setting w = 1.4, we have Similarly, for t ∈ [0, 1/2), we have By setting t = 3.5 log d n , we have The desired result of this lemma follows from a union bound of (101) and (103).
The next lemma bounds the sample standard deviation of each marginal univariate Gaussian random variable.
Lemma 9. Let Σ be the sample covariance matrix. Under the assumption that we define the event We have P A 3 ≥ 1 − 1/d.
Proof. By the definitions of Λ max (Σ) and Λ min Σ , we have By (85), we know that for any j ∈ {1, . . . , d} and 0 ≤ < 1/2, We choose = tΣ jj , then By the union bound, we have Under the assumption (104), we know that, for large enough n, there must be = tΣ jj < 1/2. The desired result of the lemma follows by setting t = 3.5 log d n .
Recall that we define the next lemma provides theoretical justification to the choice of the tuning parameter λ.
Lemma 10. Let λ = c 2a log d n with c > 1 and a > 2, we define an event Then Proof. Using the fact that Z * j = X * j Γ −1/2 jj and Z * \j = X * \j Γ −1/2 \j,\j , we have where j ∼ N n (0, σ 2 j I n ). We then have From the properties of multivariate Gaussian, we know that Z * \j and j are independent. For any = j, Z T * j follows N (0, nσ 2 j ) distribution when conditioning on Z * \j . In the following argument, we suppose everything is conditioning on Z * \j . Since j 2 2 /σ 2 j ∼ χ 2 n , by Lemma 5, we know that, for any 0 ≤ r n < 1/2, Proof. For any S ⊂ {1, 2, · · · , d} with |S| ≤ k, we have, for any β ∈ ∆ d c (S), and where the last inequality uses the fact that β 1 ≤ (1 +c) √ k β S 2 . This result is obtained from (142).
Recall that X ∈ R n×d is a matrix and the rows of X are independent N (0, Σ) Gaussian random vectors, where Σ is the d × d covariance matrix. Let Σ be the sample covariance matrix of X. From Theorem 1 of Raskutti et al. (2010), we know that there exist two positive constants c 1 and c 2 such that P β T Σβ ≥ 1 4 β T Σβ − 9 max 1≤j≤d Σ jj log d n β 1 , ∀β ∈ R d ≥ 1 − c 1 exp −c 2 n .
By (142), for any β ∈ ∆ d c (k), we have Therefore, Since we assume k log d = o(n), for n large enough, we have 1 4(1 +c) We finish the proof of this lemma.
We can see that when k log d = o(n) and n large enough, with high probability, is bounded away from zero by a positive constant. This implies the restricted eigenvalue condition required by the SQRT-Lasso method, which is provided in the next lemma.

Proof.
We only need to verify that the L 1 -restricted eigenvalue condition of the SQRT-Lasso is satisfied. More specifically, we need to verify the conditions in Theorem 1 of Belloni et al.
we have This implies that max 1≤j≤d Θ jj − Θ jj ≤ C · Θ 2 log d n .
The last inequality follows from the fact that max 1≤j≤d Θ jj ≤ Θ 2 .

C.2 Analyzing the Off-diagonal Elements in L 1 -norm Error
We first bound the L 1 -norm of each column of the off-diagonal elements of Θ − Θ.
Combining all the above analysis, we have Θ \j,j − Θ \j,j 1 ≤ C Θ 2 k log d n + C · Θ 1 log d n .
We thus prove the desired result of the lemma.

C.3 Analyzing the off-diagonal Elements in Sup-norm Error
To conduct sup-norm analysis of each of the column of the off-diagonal elements, we first Therefore Γ 1/2 jj n Z T * \j (Z * \j β j − Z * \j β j ) ∞ ≤ C · σ j · √ 2ak · log d n + 2σ j 2.8a log d n .
The result follows from the fact that √ k log d n ≤ log d n for large enough n. We finish the proof of this lemma.
Lemma 17. On the event E, we have for large enough n.

D.1 Proof of Theorem 2
Proof. By piecing together the results of Lemma 14 and Lemma 15, we have The last inequality follows from the fact that We complete the proof of this theorem.

D.2 Proof of Theorem 3
Proof. By piecing together the results of Lemma 14 and Lemma 17, we have We complete the proof of this theorem.