Graphical model selection with latent variables

: Gaussian graphical models are commonly used to characterize the conditional dependence among variables. However, ignorance of the eﬀect of latent variables may blur the structure of a graph and corrupt statistical inference. In this paper, we propose a method for learning Latent Variables graphical models via (cid:2) 1 and trace penalized D-trace loss (LVD), which achieves parameter estimation and model selection consistency under certain identiﬁability conditions. We also present an eﬃcient ADMM algorithm to obtain the penalized estimation of the sparse precision matrix. Using simulation studies, we validate the theoretical properties of our estimator and show its superior performance over other methods. The usefulness of the proposed method is also demonstrated through its application to a yeast genetical genomic data.


Introduction
Model selection for high-dimensional data has attracted much attention in recent years due to the need to analyze and interpret various types of high-dimensional data resulting from technological advances, including the analysis of gene expression data, spectroscopic imaging, fMRI data, and weather forecasting. A model selection problem that is of great importance is the estimation of a highdimensional covariance matrix and its inverse, also known as the precision matrix. A number of papers have studied this problem in the context of Gaussian graphical models (Cox and Wermuth, 1996), which are represented by an undirected graph G = (V, E), where V contains p nodes corresponding to a collection of joint Gaussian random variables and the edges E = (e ij ) 1≤i<j≤p describe the conditional independence relationships among variables. Every pair of variables not included in E is conditionally independent given all the other variables and corresponds to a zero entry in the precision matrix (Lauritzen, 1996).
To deal with the singularity problem presented by high-dimensional data to infer precision matrix, regularization methods have been proposed to impose sparsity constraint on the precision matrix and infer the sparse precision matrix, where an 1 penalty is often applied to induce sparsity (Yuan and Lin, 2007;Friedman et al., 2008;Rothman et al., 2008;Cai et al., 2011;Zhang and Zou, 2014). Some explicit rates of convergence of such estimators have been obtained (Rothman et al., 2008;Lam and Fan, 2009;Cai et al., 2011;Ravikumar et al., 2011). An alternative approach of estimating a sparse Gaussian model, first proposed by Meinshausen and Bühlmann (2006), is to perform an 1 -regularized regression on every variable. It can asymptotically recover the true graph although does not provide estimates of the precision matrix. Other neighbourhood-based methods have been then introduced to estimate the precision matrix, see Yuan (2010), Sun and Zhang (2013) and Ren et al. (2015). For a comprehensive review on theoretical properties and optimalities of the estimation of structured covariance and precision matrices, see a recent review paper (Cai et al., 2016) and the references therein.
In many applications, however, the graph structure among variables can be decomposed into intrinsic local connections and external global effects. One such example is related to the analysis of genomic data, where Gaussian graphical models have been applied to infer the relationship between genes at the transcriptional level (Segal et al., 2005;Li and Gui, 2006). Although a direct application provides some insights into the gene regulatory network, it ignores the effects of covariates such as sex, age, race, genetic variants on gene expression, and those unmeasured confoundings that may blur the statistical inference (Cheung and Spielman, 2002). Some methods have been proposed to model the conditional Gaussian graphical models to adjust for the effect of covariates (Li et al., 2012;Cai et al., 2013;Chen et al., 2016). Nevertheless, these approaches are not applicable when we either do not observe all relevant variables, or do not incorporate them into the model. To further illustrate the impact of latent variables, Figure 1 shows a simple example of a graph with 7 nodes. Node G is a latent variable and Figure 1(a) is the full graph with all variables. Recall that an edge between two nodes indicates conditional dependence between these two variables conditioning on all other variables in the graph. Without adjusting for the effect of latent variable G, the reduced graph demonstrated in Figure 1(b) is much denser, where many spurious conditional dependence are observed.
The problem of Gaussian graphical model with latent variables was first studied by Chandrasekaran et al. (2012). Under the joint Gaussian assumption, the authors proposed a regularized maximum likelihood approach and described conditions under which such graphical model was identifiable and the estimators were consistent. A number of alternative estimators were then proposed by other researchers in the accompanied discussion papers. Yuan (2012), Lauritzen and Meinshausen (2012) and Agakov et al. (2012) described an EM-based estimator, which could be treated as an alternative algorithm as well as the ADMM algorithm proposed by Ma et al. (2013). The error bounds for the estimated precision matrix in the Frobenius norm were analyzed by Meng et al. (2014) under more stringent conditions. Giraud and Tsybakov (2012) proposed two approaches based on the Dantzig selector (Candès and Tao, 2007) and the neighborhood selection approach (Meinshausen and Bühlmann, 2006). Ren and Zhou (2012) further analyzed a CLIME-like (Cai et al., 2011) estimator, and the results were improved using a regression method (Ren et al., 2015). The maximum likelihood approach with non-convex penalties are also studied by Xu et al. (2017). Finally, Städler and Bühlmann (2012) considered Gaussian graphical models with missing values, which could also be applied to this problem if we treat hidden variables as missing data.
In this paper, we consider a factor model where the unobserved and observed variables follow a linear relationship. Under such regime, the sample covariance matrix can be decomposed into the sum of a low-rank matrix and the inverse of a sparse matrix, which correspond to the effect of latent variables and the true conditional dependence between manifest variables, respectively. This problem is analogous to the "sparse plus low-rank" matrix decomposition problem (Chandrasekaran et al., 2011;Candès et al., 2011;Hsu et al., 2011;Agarwal et al., 2012) after reformulation. Fan et al. (2013) considered a similar factor model, but they focused on the estimation of a high dimensional covariance matrix and assumed a sparse error covariance matrix. Besides, Kalaitzis and Lawrence (2012) developed the residual component analysis, which also included the lowrank plus inverse sparse matrix decomposition as a special case. Nevertheless, they adopted a Bayesian approach and had to solve an intractable problem in an EM fashion. In this article, we propose an 1 and trace penalized approach to estimating the precision matrix so as to recover the true graph after adjusting for the effect of hidden variables. Following Chandrasekaran et al. (2012), we present conditions under which such "inverse sparse plus low-rank" decomposition is identifiable. Theoretical analysis reveals that our LVD estimator enjoys estimation and model selection consistency under such identifiability conditions. We also develop an efficient alternating direction method of multipliers (ADMM) for solving our optimization problem. Simulation studies and a real data analysis validate our theoretical results and show its supremacy over other competitors.
The rest of the paper is organized as follows. In Section 2, after basic notation is introduced, we present the formation of our method. Then in Section 3 we in-troduce the associated computational algorithm. Theoretical analysis including the identifiability results and rates of convergence is established in Section 4. Numerical performance of our method is presented through simulation studies and a real data analysis in Sections 5 and 6, respectively. Section 7 concludes with a summary and discussion. The proofs of main results are given in the Appendix.

Notation
For a vector a = (a 1 , · · · , a p ) i . For a symmetric matrix A = (a ij ) ∈ R p×p , we write tr(A) for the trace of A, and φ min (A) and φ max (A) for the minimum and maximum eigenvalues of , and the nuclear norm A * = tr(A T A). We use the the notation A 0 or A 0 to denote that A is positive definite/semidefinite.

The LVD procedure
Suppose Y = (Y 1 , · · · , Y p ) T ∈ R p is a manifest random vector (gene expression levels) and X ∈ R r is a hidden random vector (confounding factors), we consider the following model where B is a p × r unknown coefficient matrix representing the effect of latent variables on Y , and ε is a p × 1 random Gaussian error vector independent of X with mean zero, covariance matrix Σ * and precision matrix S * = (Σ * ) −1 . Hence S * is the parameter of main interest that characterizes the gene-gene interaction network conditioning on latent variables. Without loss of generality, we assume that Y and X are centered at zero and E(XX T ) = I, where I is the identity matrix. Taking variance on both sides of (2.1) and writing Σ Y = Var(Y ), we obtain If we only observe Y , we only have access to Σ Y . The two terms that compose Σ Y can be interpreted as follows: The matrix BB T serves as a summary of the effect of marginalization over latent variables X, which is low-rank if the number of latent variables is small compared to that of observed variables. The precision matrix S * has an interpretation of conditional independence between observed variables given latent variables. Specifically, if and only if S * ij = 0. If the interaction network can be depicted by a sparse graphical model, then S * is sparse. Thus, the covariance matrix Σ Y can be decomposed into the sum of a low-rank matrix and the inverse of a sparse matrix. We are interested in detecting the nonzero entries of S * in order to construct a conditional independence graph for Y after the effect of hidden variables X on Y is removed. Remark 2.1. We note that X and ε can be correlated and the "row-rank plus sparse inverse" relationship still holds, since in this case we have Now that rank(A + B) ≤ rank(A) + rank(B), the sum BB T + BCov(X, ε) + Cov(ε, X)B T is still a low-rank matrix as long as r is small.
Suppose that we have n independent and identically distributed observations (i) be the sample covariance matrix. Unlike the joint Gaussian distribution on (X, Y ) assumed by Chandrasekaran et al. (2012), we do not impose any specific distribution on X, thus a maximum likelihood approach is inapplicable. Note that (S * ) −1 in (2.2) is not a convex function of S * , we take inverse on both sides of (2.2) and use the Sherman-Morrison-Woodbury formula to obtain Zhang and Zou (2014), we consider the following quadratic loss function based on (2.2): This is called the D-trace loss in Zhang and Zou (2014), since it is expressed as the difference of two trace operators. To see the intuition of this loss, if we multiply Σ Now considering the Frobenious norm of the difference of these two terms lead to the D-trace loss in (2.4) if we discard terms independent of S or L and replace Σ Y with its sample version Σ n . Remark 2.2. As noted in Zhang and Zou (2014), the D-trace norm loss bears some resemblance to the one proposed in Liu and Luo (2015). Indeed, they considerer the same loss for S − L as in (2.4), except in a column-by-column fashion. Specifically, let s i and l i be the ith column of S and L respectively, then the loss function for s i − l i in Liu and Luo (2015) takes the form where e i is the ith standard orthonormal basis of R p . Nevertheless as pointed out in Zhang and Zou (2014) the positive-definite property of a matrix that should be taken into account in graphical models, can only be properly dealt with using a loss viewing S − L as a whole rather than columnwisely. This is even more crucial in our setting, as the information in the low-rank matrix L will be lost if we consider l i , i = 1, . . . , p, separately. For consideration of a low-rank L and a sparse S, we propose to estimate S and L by solving the following 1 and trace regularized D-trace loss min R,S,L 1 2 tr(RΣ n R) − tr(R) + λ n (γ S 1 + tr(L)) subject to R = S − L, R 0, L 0, where λ n is a tuning parameter that controls the strength of regularization and γ is a tuning parameter that provides a trade-off between the 1 and trace penalties. These penalties are used to encourage sparsity and low-rankness as the 1 and nuclear norm (trace when L is positive semidefinite) are convex relaxations of the sparsity level and rank, respectively; see for example in Donoho (2006); Candès and Tao (2010). Finally, the positive (semi)definite constraints are imposed on S − L and L according to model (2.2) and decomposition (2.3). Chandrasekaran et al. (2012) assumed that (Y , X) are joint Gaussian with a covariance matrix Σ (Y,X) = [ Σ Y Σ Y X Σ XY Σ X ] and precision matrix Ω (Y,X) = Σ −1 (Y X) . Then by the Schur complement (Horn and Johnson, 2012) with respect to block Ω X , we have

Penalized likelihood approach
They proposed a regularized maximum likelihood approach to estimate the sparse structure Ω Y and the low-rank term Ω Y X Ω −1 X Ω XY . We see that criterion (2.6) is the same as (2.3) and is also used by other authors, see Ren and Zhou (2012); Giraud and Tsybakov (2012); Xu et al. (2017) among others. Since the joint Gaussian distribution implies a linear conditional distribution of Y given X, our assumption is weaker and arrives at the same matrix decomposition criterion. Nevertheless, we emphasize that allowing for X to take distributions other than Gaussian, especially discrete distributions, is of particular interest in practice. The global dependence between variables is often due to factors taking discrete values such as batch effects, hence the joint Gaussian assumption is sometimes unrealistic.
As to different forms of loss function, we make two remarks here. First, one might think that the log likelihood is the optimal loss in terms of estimating the precision matrix under the joint Gaussian assumption due to its likelihood explanation. Under our regime, however, even though ε in (2.1) follows a Gaussian distribution to ensure the conditional independence interpretation, we do not make any distribution assumption for X. Hence our loss function, which is analogous to the least squares loss, is more convenient under such model. Second, similar to the discussion presented in Zhang and Zou (2014), both loss functions have an optimum occurs at the true value if we replaceΣ n with its true value Σ Y . The preference for these two loss functions may alter from case to case and we will explore more about the performances of these two approaches in numerical studies.

Factor analysis
The factor analysis (FA) model can be expressed in the same form as in (2.1), but with different assumptions on the random error ε. We state here explicitly as where B is a p × r unknown factor loading matrix, X is the r × 1 hidden factors satisfying EX = 0 and EXX T = I. Last, ε is a p × 1 Gaussian random error independent of X, with zero mean and covariance matrix Σ satisfying Σ = diag{σ 1 , . . . , σ p }.
The main difference between our model and the classical factor analysis model is the goal of inference. Our interest lies in the detection of conditional dependence between observed variables after adjusting for the effect of latent factors, and we do not pay much attention to the global dependence induced by hidden factors. On the other hand, the underlying assumption in FA is that the variability among manifest variables can be captured by a small number of unobserved factors, and parameters of interest include B, X and {σ i , i = 1, . . . , p}. In general, the decomposition of correlations among observed variables are the same for these two models, where the hidden variables describe the global dependence while the random error depicts a more subtle local dependence. With different goals in mind, there are substantial differences both on the assumptions made on the linear model and the strategies used to make statistical inference. Fan et al. (2013) considered the same factor model as in (2.1), but assumed a sparse error covariance Σ * instead of a sparse error precision matrix. Based on a fast diverging eigenvalues assumption, they introduced the principal orthogonal complement thresholding estimator for estimating the covariance and derived its convergence rate. Although the "low-rank plus sparse" decomposition bears some resemblance to our model, interpretations of these two models differ significantly. Further, their main interest lies in estimating the high-dimensional covariance matrix of observed variables, i.e., Σ Y , which is also different from our method.

Numerical algorithm
In this section we develop an efficient algorithm for solving the optimization problem (2.5). Since R 0 is not a closed convex cone, we instead consider the following problem: (3.1) Note that we drop the positive-definite constraint for S−L. Although S−L is an estimate of Σ Y , which should be positive-definite according to its interpretation, solving (3.1) is more convenient than the one that imposes positive-definiteness on S − L. Furthermore, we find in our numerical experiments that in all cases the solution to (3.1) does satisfy the constraint.
Remark 3.1. For the sake of completeness, we also develop methods for solving the following problem: where ε is a pre-defined threshold. The algorithm for solving this problem is deferred in Appendix C.
We next introduce the ADMM for solving (3.1). For notational simplicity, let where the indicator function I(L 0) is defined as We now rewrite the problem (3.1) into one with two blocks: let Z = (R, S, L), Z = (R,S,L), we aim to solve where φ(Z) = I(R −S +L = 0). Its Lagrange form is then given by where Λ ∈ R p×p is the multiplier of the linear constraintR = R, ρ > 0 is the penalty parameter for the violation of the constraint and ·, · denotes the standard trace inner product. Hence the ADMM algorithm can be written as Step (3.6) is trivial, we now elaborate on strategies for solving subproblems (3.4) and (3.5). As for ( can be solved separately as follows. To update S k+1 , we have where the shrink operator is defined as To update L k+1 , we have where the projection operator is defined in the following way: for a real symmetric matrix L, let L = U L diag(σ L )U T L be its eigenvalue decomposition, define Proj(L) as Last, to update R k+1 , we write The first-order optimality condition is then given by Following Zhang and Zou (2014), the above problem has an explicit form solu- where • denotes the Hadamard product of matrices and Π Σ is defined as The proofs of (3.7) and (3.8) are obvious, and we provide a proof of (3.10) in Appendix B for completeness. We next derive solution to the sub-problem (3.5). Note that solving (3.5) is equivalent to in the same form asZ = (R,S,L). Then the first-order optimality condition is given by where Γ is the Lagrange multiplier associated withZ. Therefore, we havẽ Combining the above display with the equality constraintR −S +L = 0 yields which completes the update ofZ. We summarize the proposed algorithm as follows.
The stopping criterion of our algorithm is when both the primal error (Z k+1 − Z k+1 ) and the dual error (Z k+1 −Z k ) is small enough with a tolerance level of 10 −4 . In view of (3.3), the proposed ADMM is a special case of the consensus problem (Boyd et al., 2011, Chap 7) with two blocks. The composition of quadratic form and affine mapping (L−S) imply that the D-trace loss is convex as well as the two penalty terms and the positive semidefinite constraint, which guarantees the global convergence of our algorithm. From the view of computational complexity, as it suffices to do the SVD of Σ n in (3.9) for once, the most time-consuming step is the SVD for solving (3.8) in each iteration, whose complexity is O(p 3 ). Such decomposition can be further accelerated by only solving leading singular values and vectors when λ n is not too small. Besides, ADMM would usually converge to a moderate accuracy in only tens of iterations (Boyd et al., 2011, Chap 3.2.2), which is also observed in our numerical experiments and proves to be sufficient. Using a laptop with Intel Core I5-5200 2.20GHz and 8 GB of RAM, our codes written in MATLAB 9.1.0 is able to converge in seconds for a few hundreds of nodes and in tens of seconds when p = 1000.

Choices of tuning parameters
We have two tuning parameters to be tuned: λ n that controls the strength of regularization and γ that provides a trade-off between regularizations on S and L. We will show later in Theorem 4.1 that the consistency results hold for a range of values of γ, which means our LVD procedure is robust with respect to γ. This phenomenon is also observed by Chandrasekaran et al. (2012) and in our numerical studies.
As for λ n , we tune it via K -fold cross validation for a fine gird of λ. Specifically, we denote the cross validation error for λ by n is the sample covariance matrix for the k th part sample, andŜ are the estimates under λ with the k th part sample removed. Another useful practice is to apply a warm-start method on decreasing sequences of values for λ. Exactly, if we want to compute a sequence of solutions of (3.1) for λ 1 > λ 2 > · · · > λ L , the initial values of the solution at λ i can be chosen as the solution at λ i−1 . Note that when λ 1 is large,Ŝ would be very sparse, thus the algorithm is not very sensitive to the initial values for λ 1 . Subsequently, for other values of λ i , the initial values would be close to the resulting estimates, hence the number of iterations is greatly reduced. The proposed algorithm would usually stop after tens of iterations and the performance of this warm start strategy proves to be satisfactory in practice.

Identifiability
Before we investigate theoretical properties of the proposed LVD estimator, it is important to first cope with the identifiability issue. Recall that the whole procedure stems from the decomposition (2.3), that is, the inverse of the population covariance can be decomposed into the sum of a sparse matrix and a low-rank matrix. Obviously this decomposition problem is ill-posed if we only have access to the sample covariance matrix without any further assumptions. To deal with this problem, we begin by introducing some notion that plays an important role in the identifiability issue.
Indeed there are cases where such a decomposition is unidentifiable. For example, if the sparse matrix S * itself has a low rank (after subtracting its diagonal entries), then we can subtract these entries from S * and add them to L * . Figure  2 illustrates an example of the "sparse plus low-rank" decomposition that seems to be identifiable, as entries of the low-rank matrix L * (rank(L) = 1) spread out over all entries of the matrix. Such identifiability problem has been studied in Candès et al. (2011);Chandrasekaran et al. (2011Chandrasekaran et al. ( , 2012; Hsu et al. (2011). We follow the framework developed by Chandrasekaran et al. (2012) to deal with this problem. Since the set of sparse matrices and the set of low-rank matrices can be viewed as algebraic varieties (sets of solutions to systems of polynomial equations), we denote the tangent space at the sparse matrix S * by Similarly, the tangent space at the low-rank matrix L * is defined as follows: suppose L * is a rank-r square matrix with its singular value decomposition given by L * = U L diag(σ L )V T L , where U L , V L ∈ R p×r and σ L ∈ R r , then the tangent space at L * is given by Denote by Ω * = Ω(S * ) and T * = T (L * ). Since both Ω * and T * are subspaces in R p×p , a sufficient and necessary condition for S * and L * to be identifiable with prior knowledge of Σ Y , Ω * and T * is that these subspaces intersect transversally: To quantify transversality between these two tangent spaces, Chandrasekaran et al. (2012) introduced the following quantity with respect to the tangent space Ω * and T * : According to definitions, a small ξ(T * ) means that any element in the tangent space T * cannot have its support concentrated in a few locations; a small μ(Ω * ) implies that the spectrum of any element in Ω * is not too concentrated. Hence these two quantities serve as tools for measuring "incoherence" of the row/column spaces of the low-rank matrix and the spectrum of the sparse matrix respectively. As discussed in Chandrasekaran et al. (2011), we always have ξ(T * ) ≤ 1. Further if rank(L * ) = r and it has almost maximally incoherent row/column spaces, ξ(T * ) can be as small as ∼ r p . Similarly, μ(Ω * ) is upperbounded by the maximal degree of S * , which implies that if S * has at most deg(S * ) nonzero entries per row/column, then we must have μ(Ω * ) ≤ deg(S * ).
We next present the irrepresentability condition for establishing the model selection consistency of our LVD estimator. As our loss function is itself a quadratic loss, a key quantity is its derivative with respect to S and L (c.f., the Gram matrix and its role in the analysis of Lasso). Specifically, we define the following linear operator where both Σ and R are p × p matrices. For any linear subspace T of matrices, denote by P T the projection onto T . To measure transversality between the tangent spaces Ω * and T * in terms of the operator h Σ Y (·), we need to analyze the following quantities. First, the minimum gain of h Σ Y (·) restricted to Ω * and the maximum effect of elements in Ω * on the orthogonal complement (Ω * ) ⊥ are given by Similarly, the minimum gain of h Σ Y (·) restricted to T * and the maximum effect of elements in T * on the orthogonal direction (T * ) ⊥ are defined as Finally, we also need to control the behavior of h Σ Y (·) restricted to Ω * in the spectral norm and its behavior restricted to T * in the ∞ norm, that is Note that the two sets of quantities (α Ω , δ Ω ) and (α T , δ T ) measures the effect of h Σ Y (·) restricted to spaces Ω * and T * in the natural norm, respectively. For notational simplicity, set Then the following irrepresentability condition is assumed in our theoretical analysis: there exists a ν ∈ (0, 1 2 ) such that We note that this assumption can be viewed as a generalization of the irrepresentability condition assumed in high-dimension regression literature (Meinshausen and Bühlmann, 2006;Zhao and Yu, 2006) and graphical model literature (Ravikumar et al., 2011;Zhang and Zou, 2014). It also bears some resemblance to the one imposed in Chandrasekaran et al. (2012), although the operators lie in the core of the analysis are different owing to different loss functions.
Finally, the superposition of penalizations can be seen as a norm, which implies that its dual norm is given by This dual norm g γ (·, ·) will serve as a tool for measuring consistency in our following analysis.

Estimation and model selection consistency
In this section, we analyze the performance of the proposed LVD estimator in a nonasymptotic framework. Let (Ŝ,L) be the solution to optimization problem (2.5). We establish rates of convergence under the assumption that Y (1) , . . . , Y (n) are independent and identically distributed samples from a sub-Gaussian distribution with covariance Σ Y . That is, we assume that there exists K > 0 such that This assumption implies that the observed variable Y have a sub-Gaussian tail, which is used to control the deviation between the sample covariance matrix and the population counterpart measured in the spectral norm, i.e., Σ n − Σ Y 2 . Remark 4.1. If this condition is relaxed to that Y has finite fourth moment, Vershynin (2012) proved a suboptimal rate that only differed from the optimal one by a logarithmic factor log p and conjectured the same convergence rate. For the sake of simplicity, we stick to the sub-Gaussian tails. Still, we emphasize that it is weaker than the joint Gaussian distribution in Chandrasekaran et al. (2012). One possible choice of the scaling is when the number of latent variables and the magnitude of coefficient matrix B are fixed, X could follow a discrete distribution with finite values or a sub-Gaussian distribution.
We are now ready to present results that establish estimation and model selection consistency of our LVD estimator. Denote by ψ 1 = φ min (Σ Y ) ≤ φ max (Σ Y ) = ψ 2 and d the maximum node degree in S * . We further let θ be the minimum magnitude of nonzero entry of the sparse matrix S * and σ be the minimum nonzero eigenvalue of the low-rank matrix L * .
Theorem 4.1 (Estimation and model selection consistency). Suppose that the irrepresentability condition (4.2) holds, and μ(Ω * ) and ξ(T * ) satisfy the following condition for identifiability (4.4) If we choose γ and λ n such that where C K is an absolute constant that only depends on K, and assume that σ > 3 α λ n and ψ 1 ψ 2 > max 1, 1 γ then with probability at least 1 − 2 exp{−p}, we have Hence if we further assume θ > 3γ α λ n , then under the same probability we have sign(Ŝ) = sign(S * ). (4.7) The proof of Theorem 4.1 is given in Appendix D. Theorem 4.1 says that under the identifiability condition (4.4), the irrepresentability condition (4.2) and some conditions on the smallest eigenvalue on L * and Σ Y , then with proper chosen tuning parameters γ and λ n , our proposed LVD procedure is consistent with certain rates of convergence. Model selection consistency is also achieved by further assuming a minimal signal strength condition on the precision matrix S * .
Our results look similar to those of Chandrasekaran et al. (2012) (hereafter referred to as CPW by taking initials of authors). As both methods require irrepresentable and identifiability conditions, it is interesting to compare our conditions (4.2) and (4.4) with theirs. Although taking the same form, specific Table 1 Comparison of key quantities used in irrepresentabiliy and identifiability conditions.
definitions of some quantities, such as α, β, and δ, differ from each other. The key difference is the operator (4.1) and the counterpart of theirs, which is All six quantities defined with a h Σ Y (·) in our approach should be replaced with h CPW Σ Y (·) in their analysis. It is in general difficult to give a thorough comparison between these two sets of conditions. Instead, we consider a simple and special case in which these quantities could be computed explicitly.
Let S * = I p be the p × p identity matrix and L = π(1 p / √ p)(1 p / √ p) T = πJ p /p be the rank-1 matrix with the maximal incoherence with standard orthonormal basis, where J p is the p×p matrix with all ones and π is a parameter. Note that to guarantee L * and S * − L * are both positive definite, we must have 0 < π < 1. In this specific circumstances, all six quantities for both methods are given in Table 1. Hence for the irrepresentability condition, as long as p ≥ 2, our method requires π p(1−π) 1 + π p(1−π) < 1, which always holds regardless of π. In contrast, the CPW approach requires For a fixed 0 < π < 1, the solution of the above inequality (as a function of p) is the largest root of a cubic function and its approximate solution is given by p ≈ √ 2π (1−π) 2 . This means if π is close to 1, the dimensionality should be large enough to guarantee the incoherence of the low-rank matrix. For example when π = 0.9, the identifiability condition fails for the CPW method when p < 118 (118 is the exact number and 127 is the approximation).
Similarly, we can compare the identifiability condition between these two methods, and it suffices to focus on the right hand side of equation (4.4). Denote by ζ = να (2−ν)β and set ν = 1 2 (1 − δ α ), we have ζ = να (2−ν)β = α(α−δ) (3α+δ)β . It can be shown after tedious calculation that ζ LVD > ζ CPW always holds when p ≥ 2 and 0 < π < 1. For any fixed 0 < π < 1, the ratio ζ LVD /ζ CPW has a limit greater than 1 when p → ∞. Figure 3 shows the ratio when 30 ≤ p ≤ 1000 and 0 < π ≤ 0.75. Note that we do not include large π and small p as the irrepresentability condition for CPW barely or does not hold in this scenario. Besides, the limit of ζ LVD /ζ CPW goes to 1 when π is small and can be large as π increases.
We make a few more remarks on Theorem 4.1. First, the maximum degree of S * and the rank of L do not explicitly appear in the theorem, but are hidden in conditions and other factors. To better illustrate their impacts on our results, we focus on μ(Ω * ) and ξ(T * ) and assume temporarily that α, β, δ, ν, ψ 1 , ψ 2 are of the order O(1). Recall that μ(Ω * ) ≤ d := deg(S * ) and ξ(T * ) ∼ r/p if L * is nearly maximally incoherent. In this circumstance, condition (4.4) essentially requires d r p = O(1) and λ n is of the order somewhere between O(d p n ) and O( p √ rn ) (depending on the choice of γ). These results match the rate of CPW, but hopefully improve other factors appearing in the rate and are derived after removing the joint Gaussian assumption.
Second, the consistency result (as well as CPW) holds for a range of values of γ, which is preferable in practice since we do not need to tune two parameters. If we choose γ at the upper end, we have the following corollary. Finally, our technical analysis is different from CPW. While they resorted to Brouwer's fixed point theorem in proofs, we use a more direct approach to analyze the LVD procedure owing to its simple quadratic form. Since there is no log-determinant term in our loss function, we additionally assume the spectrum of observed covariance matrix Σ Y is well-conditioned to guarantee positive-definiteness of our estimates.

Simulation studies
In this section simulation studies are carried out to compare the numerical performance of our LVD estimatorŜ LVD , the regularized maximum likelihood estimatorŜ CPW from Chandrasekaran et al. (2012), and the graphical lassô S glasso from Friedman et al. (2008). We are particularly interested in investigating how these methods perform differently in relation to different settings of latent variables and how the dimensionality affects their performance.
We consider nine models as shown in Table 2. For each model, we generate a r × p coefficient matrix B with its entries independently following Unif([−1.5, −0.5] ∪ [0.5, 1.5]). Each element of the n × r design matrix X is independently drawn from a Bernoulli(0.5) or a standard Normal distribution according to different model set-ups. We generate the p × p precision matrix with P(S * ij = 0|i = j) shown in Table 2 and diagonal entries equaling to 1. If S * ij = 0, i = j, we set |S ij | = 0.25 and its sign with equal probability. Finally, we generate a n × p random error matrix E so that each row The nine models are divided into three groups in terms of dimensionality. While the sample size are fixed at n = 200, the number of observed variables ranges from p = 20 in the first three models to p = 50 in Models 4-6, and we consider a moderately high dimension with p = 100 in Models 7-9. Within each group, the latent variables follow from Bernoulli distributions in the first model and Normal distributions in the second model. We also set r = 0 in the third model, which corresponds to no latent variable at all. When X is normal, Y also has a normal distribution, which may give an edge toŜ CPW since it adopts the maximum likelihood approach. If X has binary values, on the other hand, Y will follow a mixture Gaussian distribution. This setting is very common in real data analysis when X is used to model batch effects or other categorical covariates such as sex and race. Finally, we expectŜ glasso to perform well when there is no latent variable. The number of latent variables is set to be r = 2 in Models 1-2 and 4-5, and increases to r = 5 in Models 7-8. For each model, we choose the tuning parameter λ n using 10-fold cross validation as described in Section 3.2. As for γ, we have shown that our estimate is insensitive to a wide choice of low-rank tuning parameters, which is also observed by Chandrasekaran et al. (2012); Yuan (2012) and in our numerical experiments, hence we report here the results with a pre-chosen fixed constant that may vary across models. The tuning parameters for CPW and glasso are also chosen using 10-fold cross validation, except that the cross validation error is defined as the negative log likelihood according to their loss functions. Further, the tuning parameter γ for CPW is also fixed at a constant such that the low-rank matrices estimated by the LVD and CPW have approximately the same rank for the sake of fairness.
Since our main goal is to estimate S * , especially its non-zero pattern that carries information on direct interactions among manifest variables after adjusting for the effect of latent variables, we use six measures on estimation and model selection quantities to assess the performance of each method. The estimation errorŜ − S * are evaluated by the spectral norm, the matrix 1 norm and the Frobenius norm. Further, the model selection performance is characterized by the true positive rate (TPR), the true negative rate (TNR), and the Matthews correlation coefficient (MCC). Specifically, these three measures are defined as  Table 3. We see that when p is small, our proposed LVD procedure outperforms the other two methods in terms of all measures, especially when X has a discrete distribution. When the dimensionality is moderately high, the performance of our estimator and that ofŜ CPW on TPR, TNR and MCC are comparable to each other. This is because the marginal distribution of Y does not deviate significantly from a multivariate normal distribution owing to dimensionality, and the fact that the penalized likelihood can be viewed as a penalized log-determinant Bregman divergence (Ravikumar et al., 2011). Nevertheless, we note thatŜ LVD has the smallest estimation error even if when p is moderately high. Finally, it can be seen that the graphical lasso estimator only performs well in the absence of latent variables as expected. Notice that glasso may not achieve the best estimation performance even in the case without latent variables, owing to different choices Table 3 Simulation results for Models 1-9. Each performance measure is averaged over 100 replications with standard deviations shown in parentheses.
Model 1 While we use the Frobenious norm as in (3.12), the other two methods adopt the log determinant loss. Our method tend to choose a smaller tuning parameter compared with those from CPW and glasso, which may lead to better estimation performance. Similar results have also be observed in Zhang and Zou (2014), and such difference is negligible when tuning parameters are changing; see later in Figure 4.
To further investigate the performance on graph structure recovery, we obtain the receiver operating characteristic (ROC) curve for each simulated data set by varying the turning parameter for the sparse matrix. Figure 4 shows the ROC curves averaged over 100 replications. Figures on the same row correspond to same dimensionality and different set-ups of latent variable X. The performance of our LVD estimator is comparable to that of the penalized likelihood method in most cases and better than the other two methods when p = 20 and X follows a Bernoulli distribution. The graphical lasso estimator, on the other hand, performs poorly when there exists latent variables, that is, in Models 1-2, 4-5, and 7-8. We also note that in these cases, there is a "kink" in the ROC curve for glasso. Below this critical point, the graphical lasso estimator performs no better than random guess; as p and r increase (thus the effects of latent variables decrease), this critical point moves towards the original point. This phenomenon implies that a direct application of graphical lasso in real data analysis may lead to unreliable results if there is concern over latent variables, especially when the number of node is not large. Finally, we emphasize that the two methods accounting for the effect of latent variables,Ŝ LVD andŜ CPW , perform no worse thanŜ glasso in all models.

Analysis of a yeast data set
To demonstrate our method in real data analysis, we present results from the analysis of a yeast genetical genomics data set generated by Brem and Kruglyak (2005). They used BY4716 and the wild isolate RM11-1a as parent strains to grow 112 yeast segregants. Then they isolated RNA and hybridized cDNA to microarrays that had 6216 yeast genes assayed on each array. It is nearly impossible to build a gene-gene interaction network for all the genes owing to the small sample size and restricted perturbation in biological systems. We instead apply our method to a set of 56 genes that belong to the yeast mitogen-activated protein kinase (MAPK) signaling pathway provided by the KEGG database (Kanehisa et al., 2010).
The S. cerevisiae genome encodes multiple MAP kinase orthologs. Fus3 mediates cellular response to peptide pheromones, Kss1 permits adjustment to nutrient limiting conditions and Hog1 is necessary for survival under hyperosmotic conditions. Besides, Slt2/Mpk1 is required for repair of injuries to the cell wall. Figure 5 displays the illustrative pathway structure. Since several genes such as Ste20, Ste12 and Ste11 appear in multiple locations, this graph cannot be treated as the "true graph" for evaluating or comparing different methods. Our goal is to construct a conditional independent network among these genes at the expression levels and compare the results to this reference graph in the hope of gaining some biological interpretations.
We apply the above methods to this set of 56 genes and use 10-fold crossvalidation to choose tuning parameters for all approaches. The model selected by cross-validation include 44 (LVD), 508 (CPW) and 536 (glasso) links among the 56 genes, respectively. While the graphical lasso estimator tends to choose a large number of interactions as expected, we see thatŜ CPW also selects too  (Kelder et al., 2012). many links to interpret. On the other hand, our LVD estimator results in a much sparser graph, which implies that the yeast data set we are analyzing may be subject to the effect of latent variables. We emphasize that we choose γ such that the low-rank matrices estimated by LVD and CPW have the same rank. Further, although tuning γ will change the rank of the estimated low-rank matrix, the resulting graph of these two methods are stable across different values of γ. Figure 6 shows the undirected graph for 37 linked genes on the MAPK pathway constructed base on our estimator. Although we do not expect the resulting graph to fully recover the original MAPK pathway, it indeed has some biological meanings. For example, DIG1, FUS1, FUS3, GPA1, FAR1, STE2, STE3, STE12, STE18 and STE20 are linked together, which suggests a strong interactions between these genes because they are all involved in the yeast pheromone and mating process. Similarly, PKC1 and MKK1, MLP1 and SWI4 are linked together, while Slt2, RLM1, FKS1 and MID2 are linked through CTT1 and MSN4 owing to their interaction in the cell wall remodelling process. Finally, CTT1, SHO1, MSN4, YPD1, MCM1 and SLN1 are connected via SLT2 and RLM1 since they participate in osmolyte synthesis. Our method also estimates a low-rank matrix that summarizes the effects of latent variables; some methods have been developed to interpret latent variables (Taeb and Chandrasekaran, 2016) and factorize this low-rank matrix, for example, into sparse factors (Witten et al., 2009). To compare across different methods, we follow the stability selection approach (Meinshausen and Bühlmann, 2010) and fit the graph for all three methods based on 100 bootstrap subsamples of size n/2 with tuning parameters fixed at the values chosen before. An extra cut-off threshold π thr that determines whether an interaction is stable varies across methods and is derived according to Theorem 1 in Meinshausen and Bühlmann (2010) such that the expected number of wrong edges is less than 30 among the 1540 possible edges. Since the cut-off thresholds π thr for stability selection have respective values for different methods, the resulting graphs have comparable number of edges with 36 for LVD, 39 for CPW and 50 for glasso. Among these edges, 26 of them are selected by all three methods, with another 2 links selected simultaneously by our method and glasso/CPW. Although our estimator selects the least number of edges in the graph, it involves the most number of genes whose interactions can be interpreted by the real MAPK pathway illustrated in Figure 5.

Conclusion
In this paper, we study the problem of modeling the statistical dependence of random variables as a sparse graphical model conditioning on a few additional latent components. Since it is very common that the observed variables are correlated with some hidden variables in real world data, considering such model is of great importance. We develop the LVD procedure, an 1 and trace regularized minimization method for latent-variable graphical model selection, and prove its estimation and model selection consistency under certain identifiability conditions. We also propose a computationally efficient algorithm that can be applied to high-dimensional settings. Our simulation studies verify our theoretical results and show the superior performance of our method over other approaches. Finally we demonstrate the effectiveness of our method in a yeast gene expression data set.
Several improvements and extensions of our method deserve further analysis. Both joint Gaussian distribution and our linear factor model (2.1) assume a Gaussian distribution for the random error to facilitate the estimation of conditional independence among variables. Graphical models for non-Gaussian data have attracted a lot of attention recently, including discrete variables (Ravikumar et al., 2010;Loh and Wainwright, 2013), exponential families (Yang et al., 2015), and mixed data (Chen et al., 2015;Cheng et al., 2016;Fan et al., 2017). It is of great interest to develop statistically consistent methods for latent-variable modeling with such non-Gaussian variables. As a matter of fact, Tan et al. (2016) have already studied semiparametric exponential family graphical models with latent variables when replicates within each subject are available. Another important issue that is worthwhile to pursue concerns about controlling the false discovery rate (FDR) and constructing confidence intervals for latent variable graphical models. Similar problems for Gaussian graphical models have been considered by Liu (2013);Wasserman et al. (2014); Janková and van de Geer (2015); Ren et al. (2015). Particularly, results on entrywise confidence interval for latent variables graphical models under the joint Gaussian assumption have been established in Ren et al. (2015). It is promising to extend these methodologies to the more general distribution assumptions, with an emphasize on controlling the FDR of detected edges.

A.1. Stability of γ
We have already showed in Theorem 4.1 that estimation and model selection consistency is possible for a range of γ. In this section we investigate the performance of our LVD estimator when γ is varied. For simplicity, we only consider Model 4 in Section 5, that is, n = 200, p = 50, r = 2 and the latent variables take binary values. Recall that in Table 3, the average rank of the estimated low-rank matrix is 4.67 for our LVD estimator. Table 4 summarizes results when we vary γ. For each γ fixed, we choose λ n using 10-fold cross validation. Note that γ = 0.15 corresponds to results in Table 3.
We see that as γ varies from 0.12 to 0.35, the average estimated rank of the low-rank matrix ranges from 2.85 to 13.11. Nevertheless, the estimation errors measured in all three quantities remain stable regardless of different choices of γ, especially when 0.15 ≤ γ ≤ 0.3. Besides, the Matthews Correlation Coefficient is even larger when we increase γ, i.e., when our LVD method estimates theL with a moderate high rank. Consequently, the estimated sparse matrix becomes more sparse (TPR decreases and TNR increases) since γ provides a trade-off between the sparse and the low-rank matrices.

A.2. Effect of r
To understand how the global effects caused by latent variables affect the performance of different methods, we extend the simulation study in Section 5 by varying r. In order to better illustrate the effect of r, we consider the lowdimensional scenario, that is, n = 200 and p = 20. As is explained in Section 5, we fix γ at a pre-chosen constant so that the low-rank matrices estimated by LVD and CPW have approximately the same rank, and we only set X taking binary values for simplicity. Table 5 summarizes results when r is varied from 0 to 4. We have already seen in Section 5 that glasso only performs well when there is no latent variable at all. Although the performance of all three methods deteriorates as r increases, our LVD procedure outperforms other approaches in all cases. Proof. We rewrite the first-order optimality condition for (3.9) as Σ from the left and U Σ from the right in (B.1), we obtain To solveṘ k+1 from the above display, all we need to do is to make sure that for each i, j = 1, . . . , p, we have which establishes (3.10).

Appendix C: Numerical algorithm for solving (3.2)
The ADMM described in Algorithm 1 for solving (3.1) need to be slightly modified. Specifically, we need to update R k+1 by solving the following optimization problem instead of (3.9): As there is no explicit form solution for this problem, we could solve it via an inner loop of ADMM. After introducing an auxiliary variableŘ such thať R = R, we turn to solve It can be easily seen that the updates of R andŘ are similar to (3.10) and (3.8). We finally note that it is sufficient to solve (3.1) in practice, which is much simpler and faster.

Appendix D: Proof of Theorem 4.1
We start from a Lemma that characterizes the deviation of the sample covariance matrix from its true value, whose proof can be found in Vershynin (2010).

Lemma D.1. Under assumption (4.3), the empirical covariance matrix satisfies
where C K is a constant that only depends on K.
Proof of Theorem 4.1. In what follows we will condition on the event that the result in Lemma D.1 holds, which occurs with probability at least 1 − 2 exp(p).
The key idea used in the proof is a technique known as the primal-dual witness method used previously in analysis of the Lasso (Wainwright, 2009) and graphical model (Ravikumar et al., 2011). The proof is summarized in the following three steps.
Step 1 Note that there is no positive-(semi)definite constraint in (D.1). It will be shown later in Step 2 that under conditions assumed in Theorem 4.1,Ř andĽ are positive-(semi)definite with high probability, hence it suffices to study problem (D.1). We will show in this step that if we solve (D.1) subject to additional constraints that S and L belong to tangent spaces Ω * and T * , that is, min S,L,R 1 2 tr(RΣ n R) − tr(R) + λ n (γ S 1 + L * ), subject to R = S − L, R = R T , L = L T , S ∈ Ω * , L ∈ T * , (D.2) then these two sets of solutions are the same.
Step 2. We analyze the solution to optimization problem (D.2) (denoted by (S,L,R)) and show that (S,L,R) enjoys estimation consistency measured in terms of g γ (S − S * ,L − L * ) as claimed in Theorem 4.1. We further show that under suitable conditions on minimum magnitude of nonzero entry of S * and minimum nonzero singular value of L * ,Ř andĽ are positive-(semi)definite with high probability, thus model selection consistency for S * is achieved. Therefore, we have (Ŝ,L,R) = (Š,Ľ,Ř) = (S,L,R), which completes the proof of Theorem 4.1.
Restricting (D.3) to the space Y * , we have P Ω * h Σn (R) − I = −λ n γsign(S), whereL =Ū diag(σ)V T is the SVD ofL. Let A and A † be the addition and adjoint of the addition operator, that is, for matrices A and B, we have