Learning Gaussian Graphical Models with Latent Confounders

Gaussian Graphical models (GGM) are widely used to estimate the network structures in many applications ranging from biology to finance. In practice, data is often corrupted by latent confounders which biases inference of the underlying true graphical structure. In this paper, we compare and contrast two strategies for inference in graphical models with latent confounders: Gaussian graphical models with latent variables (LVGGM) and PCA-based removal of confounding (PCA+GGM). While these two approaches have similar goals, they are motivated by different assumptions about confounding. In this paper, we explore the connection between these two approaches and propose a new method, which combines the strengths of these two approaches. We prove the consistency and convergence rate for the PCA-based method and use these results to provide guidance about when to use each method. We demonstrate the effectiveness of our methodology using both simulations and in two real-world applications.


Introduction
In many domains, it is necessary to characterize relationships between features using network models. For example, networks have been used to identify transcriptional patterns and regulatory relationships in genetic networks and applied as a way to characterize functional brain connectivity and cognitive disorders, and provide insights into neurodegenerative diseases like Alzheimer's (Fox and Raichle, 2007;Van Dijk et al., 2012;Barch et al., 2013;Price et al., 2014). One of the most common methods for inferring a network from observations is the Gaussian graphical model (GGM). A GGM is defined with respect to a graph, in which the nodes correspond to joint Gaussian random variables and the edges correspond to the conditional dependencies among pairs of variables. A key property of GGM is that the presence or absence of edges can be obtained from the precision matrix for multivariate Gaussian random variables (Lauritzen, 1996). Similar to LASSO regression (Tibshirani, 1996), we can learn a GGM via sparse precision matrix estimation with l 1 -regularized maximum likelihood estimation. This family of approaches is called graphical lasso (Glasso) (Friedman et al., 2008;Yuan and Lin, 2007).
In practice, however, network inference may be complicated due to the presence of latent confounders. For example, when characterizing relationships between the stock prices of publicly trade companies, the existence of overall market and sector factors induces extra correlation between stocks (Choi et al., 2011), which can obscure the underlying network structure between companies.
Let Ω be the precision matrix encoding the graph structure of interest, and Σ = (Ω) −1 . When latent confounders are present, the covariance matrix for the observed data, Σ obs can be expressed as where the positive semidefinite matrix L Σ reflects the effect of latent confounders. Alternatively, by the Sherman-Morrison identity (Horn and Johnson, 2012), the observed precision matrix can be expressed as, where L Ω again reflects the effect of unobserved confounding. Importantly, if confounding is ignored, the inferred networks may heavily biased because the observed precision matrix is no longer (necessarily) sparse. Motivated by this problem, multiple approaches have been proposed to recover the observed-data graph encoded by Ω in the presence of confounding. In this paper, the goal is to generalize two seemingly different notions of confounding into a common framework for addressing the effect of L Σ in order to obtain the graph structure encoded in Ω = Σ −1 . We compare and contrast two methods for GGM inference with latent confounders and propose a generalization which combines these methods.
In order to control for the effects of confounding, some key assumptions are required. The first assumption is that Ω is sparse and L Ω , or equivalently L Σ , is low rank. The low rank assumption is equivalent to assuming that the number of confounding variables is small relative to the number of observed variables. As such, these methods are often referred to as "sparse plus low rank" methods (Chandrasekaran et al., 2011). We focus on two prominent approaches for sparse plus low rank inference, which require different assumptions for identifiability of Ω.
One common approach, known as latent variable Gaussian Graphical Models (LVGGM), uses the parameterization in (2) and involves joint inference for Ω and L Ω . In this approach, the focus is on the effect of unobserved variables in the complete graph, which affect the partial correlations of the variables in the observed graph Ω. This perspective can be particularly useful when the unobserved variables would have been included in the graph, had they been unobserved. In LVGGM, in order for Ω to be identifiable, L Ω must be sufficiently dense (Chandrasekaran et al., 2011).
An alternative approach uses principal component analysis (PCA) as a preprocessing step to remove the effect of confounders. The focus of this approach is on how confounders might affect the marginal correlation between observed variables. This approach uses the parameterization in (1) and involves a first stage in which effect of L Σ is removed by subtracting the leading eigencomponent of Σ obs and a second stage of standard GGM inference (Parsana et al., 2019). We call this PCA-based approach PCA+GGM. For this approach to be effective, the norm of L Σ (noise) must be large relative to the norm of Σ (signal). PCA+GGM has shown to be especially useful in estimating gene co-expression networks, where correlated measurement noise and batch effects induce large extraneous marginal correlations between observed variables (Geng et al., 2018;Leek and Storey, 2007;Stegle et al., 2011;Gagnon-Bartsch et al., 2013;Freytag et al., 2015;Jacob et al., 2016). In contrast to Chandrasekaran et al. (2012), the confounders in Parsana et al. (2019) are usually thought of as nuisance variables and would not be included in the complete graph, even if they had been observed.
In practice, it is quite possible that the data is affected by both sources of confounding, e.g. measurement batch effects as well as unobserved nodes in the complete graph of interest. Moreover, the presumed type of confounding will effect which approach is most appropriate. In this paper, our goal is to explore proper ways to address the effect of confounders in order to obtain the graph structure encoded in Ω. We relate and compare two methods, LVGGM and PCA+GGM, then propose a new method PCA+LVGGM that combines those two, aiming to address confounding from multiple sources. The combined approach is more general, since PCA+LVGGM contains both LVGGM and PCA+GGM as special cases. We demonstrate the efficacy of PCA+LVGGM reconstructing the underlying network structure from gene co-expression data and stock return data in section 5.
In summary, our contributions are: • We carefully compare PCA+GGM and LVGGM, and illustrate the connection and difference between these two methods. We provide a non-asymptotic convergence result for the PCA+GGM proposed by (Parsana et al., 2019) and show that the convergence rate of PCA+GGM is of the same order as that of LVGGM. We demonstrate both theoretically and empirically that PCA+GGM works particularly well when the norm of extraneous low-rank component is large compared to that of the signal.
• We propose PCA+LVGGM, which combines elements of PCA+GGM and LVGGM. We show that PCA+LVGGM can outperform using PCA+GGM and LVGGM individually when the data is corrupted by multiple sources of confounders with certain structure. We perform extensive simulations to validate the theory, compare the performance of the three methods, and demonstrate the utility of our approach in two applications.
The remainder of this paper is organized as follows: In section 2, we introduce the basic setup for GGM, LVGGM and PCA+GGM. A brief literature review follows. Then we describe our method PCA+LVGGM. In section 3, we present theoretical results for PCA+GGM and use this result to analyze the similarities and differences between LVGGM and PCA+GGM. In section 4, we demonstrate the utility of the various approaches in the simulation setting. Finally, in section 5 we apply the methods on two real world data sets.

Notations: For a vector
We also define the spectral norm M 2 = sup v 2 ≤1 M v 2 and M L 1 = max j i |M ij |. The nuclear norm M * is defined as the sum of the singular values of M . When M ∈ R p×p is symmetric, its 2 Problem Setup and Methods Review

Gaussian Graphical Models
Consider a p-dimensional random vector X = (X 1 , ..., X p ) T with covariance matrix Σ and precision matrix Ω. Let G = (V, E) be the graph associated with X, where V is the set of nodes corresponding to the elements of X, and E is the set of edges connecting nodes. The graph shows the conditional independence relations between elements of X. For any pair of connected nodes, the corresponding pairs of variables in X are conditionally independent given the rest variables, i.e., X i |= X j |X \i,j , for all (i, j) / ∈ E. If X is multivariate Gaussian, then X i and X j are conditionally independent given other variables if and only if Ω ij = 0, and thus the graph structure can be recovered from the precision matrix of X.
Without loss of generality, we assume the variable X has mean zero in the whole of section 2. Assuming that the graph is sparse, given a random sample {X (1) , ..., X (n) } following the distribution of X, the Glasso estimateΩ Glasso (Yuan and Lin, 2007;Friedman et al., 2008) is obtained by solving the following log-likelihood based 1 -regularized function: where Tr denotes the trace of a matrix and Σ n = 1 n n k=1 X (k) X (k) T is the sample covariance matrix. Many alternative objective functions for sparse precision matrix estimation have been proposed (Cai et al., 2011;Meinshausen et al., 2006;Peng et al., 2009;Khare et al., 2015). The behavior and convergence rates of these approaches are well studied (Bickel et al., 2008b,a;Rothman et al., 2008;Ravikumar et al., 2011;Cai et al., 2016).
In presence of latent confounders, Glasso and other GGM methods would likely recover a more dense precision matrix owing to spurious partial correlations introduced between observed variables. In other words, even when the underlying graph is sparse conditioned on the latent variables, the observed graph is dense marginally.

Latent Variable Gaussian Graphical Models
One method for controlling for the effects of confounders is the Latent Variable Gaussian Graphical Model (LVGGM) approach first proposed by Chandrasekaran et al. (2012). They assume that the number of latent factors is small compared to the number of observed variables, and that the conditional graph is sparse. Consider a (p + r) dimensional mean-zero normal random variable X = (X O , X H ) T , where X O ∈ R p is observed and X H ∈ R r is latent. Let X have precision matrix Ω ∈ R (p+r)×(p+r) , and the submatrices Ω O ∈ R p×p , Ω H ∈ R r×r and Ω O,H ∈ R p×r specify the dependencies between observed variables, between latent variables and between the observed and latent variables respectively. By Schur complement, the inverse of the observed covariance matrix satisfies: where Ω = Ω O encodes the conditional independence relations of interest and is sparse by assumption.
L Ω = Ω O,H Ω −1 H Ω T O,H reflects the low-rank effect of latent variables X H . Based on this sparse plus low-rank decomposition Chandrasekaran et al. (2012) proposed the following problem: minimize where Σ n is the observed sample covariance matrix and (Ω, Σ) = log (det (Ω))−Tr(ΩΣ) is the Gaussian log-likelihood function. The 1 -norm encourages sparsity on Ω and the nuclear norm encourages low-rank structure on L Ω .
The sparse plus low-rank decomposition is ill-posed if L Ω is not dense, because it is indistinguishable from Ω. Identifiability of Ω coincides with the incoherence condition in the matrix completion problem (Candès and Recht, 2009) which requires that |u T k e i | is small for all k ∈ {1, ...r} and i ∈ {1, ..., p} where u k is the k-th eigenvector of L Ω and e i is the i-th canonical vector. When the eigenvectors of L Ω are sufficiently dissimilar to the canonical vectors, the error bound 1 , Ω −Ω ∞ , is of order O( p n ). More analysis on LVGGM can be found in Agarwal et al. (2012) and Meng et al. (2014).

Procedure 1 PCA+GGM
Input: Sample covariance matrix,Σ obs ; rank ofL Σ , r Output: Precision matrix estimate,Ω 1: EstimateL Σ from eigencomponents: 3: UsingΣ, computeΩ as solution to (3) Procedure 2 PCA+LVGGM Input: Sample covariance matrix,Σ obs ; rank ofL Σ , r P ; rank ofL Ω , r L Output: Precision matrix estimate,Ω 1: EstimateL Σ from eigencomponents: 3: UsingΣ, computeΩ as solution to (5) with γ such that rank(L Ω ) = r L Finally, Ren and Zhou (2012) show the standard GGM approaches can still recover Ω O in the presence of latent confounding as long as the norm of the low-rank component sufficiently small. In our simulations, we observe that LVGGM significantly outperforms Glasso only when the low-rank component has large enough norm compared to the signal. If the magnitude of L Ω is large enough, however, a more direct approach that remove high variance components can be more appropriate. This partly motivates the PCA+GGM method described in the next section.

PCA+GGM Approach
Unlike LVGGM, which involves a decomposition of the observed data precision matrix, PCA+GGM involves a decomposition of the observed data covariance matrix: Motivated by confounding from measurement error and batch effects, Parsana et al. (2019) proposed the principal components correction (PC-correction) for removing L Σ . Consider observed data X obs , such that X obs = X + AZ, where X ∼ N (0, Σ) and Z ∼ N (0, I r ). Matrix A ∈ R p×r is non-random so that L Σ = AA T . In general, additional structural assumptions are needed to distinguish L Σ from Σ. If the norm of L Σ is large relative to that of Σ, then under mild conditions, L Σ is close to the sum of the first few eigencomponents of Σ obs . Parsana et al. (2019) thus proposed removing the first r eigencomponents from Σ obs . The PCA+GGM method is described in Procedure 1.
We demonstrate the effectiveness of Procedure 1 both theoretically and in simulations in section 3 and section 4. The number of principal components needs to be determined a priori and we discuss the rank selection in section 3.5.

PCA+LVGGM Strategies
Although both LVGGM and the PCA+GGM are methods for solving the same problem, they have different motivations. As described in previous sections, identifiability in LVGGM models requires that L Ω is sufficiently dense, whereas identifiability in PCA+GGM requires that the norm of L Σ dominates the norm of Σ. In many cases, it is likely that the data is corrupted by multiple sources of confounding. For example, in the biological application discussed in section 5.1, both batch effects and unmeasured biological variables likely confound estimates of graph structure between observed variables.
As (4) illustrated, the observed precision matrix Ω may have been corrupted by a latent factor L Ω : Now, rewriting (8) in terms of Σ = Ω −1 and Σ = Ω −1 , applying the Sherman-Morrison identity on where L Ω is still a low-rank matrix. If Σ is further corrupted by an additive noise factor represented by L Σ , the following equation described the observed matrix Σ obs : In the above example, if the norm of L Σ is much larger than that of Σ and L Ω , then removing L Σ using the PC-correction is likely to be effective. If the norm of L Ω is not much larger than that of Σ, then Ω and L Ω can be well estimated by LVGGM. In (10), the overall confounding L Ω + L Σ is the sum of two low-rank components with different norms, we can consider using both methods: first remove L Σ via eigendecomposition, then apply LVGGM to estimate Ω and L Ω . We call this procedure PCA+LVGGM and it is shown in Procedure 2. We discuss how to determine the ranks for r P and r L in section 3.5.

Theoretical Analysis and Model Comparisons
In this section, we investigate the theoretical properties of PCA+GGM. We derive the convergence rate of PCA+GGM and compare it to that of LVGGM. As shown in theoretical analysis by Parsana et al. (2019), the low-rank confounder can be well estimated by PC-correction if the number of features p → ∞ with the number of observations n fixed. We provide a non-asymptotic analysis depending on p and n and our result shows that the graph can be recovered exactly when n → ∞ with fixed p.

Convergence Analysis on PCA+GGM
Without loss of generality, we consider the case of a rank-one confounder. Assume that we have a random sample of p-dimensional random vectors: where Cov(X (i) ) = Σ and Z (i) is a univariate standard normal random variable. v ∈ R p is a nonrandom vector with unit norm, and σ is a non-negative scalar constant. Without loss of generality, we assume that X (i) |= Z (i) . To see how v affects estimation, we assume that v is the k-th eigenvector of Σ. Therefore, the covariance matrix of X (i) obs is: where Σ −k is the matrix Σ without the k-th eigencomponent and λ k (Σ) is the k-th eigenvalue of Σ. When σ 2 > λ 1 (Σ), λ k (Σ) + σ 2 becomes the first eigenvalue of Σ obs , and v is the corresponding first eigenvector. We remove the first principal component from the sample covariance matrixΣ obs : whereλ 1 is the first eigenvalue ofΣ obs andθ 1 is the first eigenvector ofΣ obs . Then we useΣ to estimate Ω. We first show that under mild conditions,Σ is close to Σ. Following Bickel et al. (2008b, 3.1), we assume that there exists a constant M such that: Theorem 1. Let λ i be the i-th eigenvalue of Σ obs and ν = λ 1 − λ 2 be the eigengap of Σ obs . Suppose Σ obs satisfies condition (14) and X (i) obs is generated as (11). Further assume that σ 2 > λ 1 (Σ). Suppose n ≥ p and Σ 2 ν+1 ν 2 p n ≤ 1 128 , then: with probability greater than 1 − C 4 p −6 − C 5 e −p . Constants C 1 and C 2 depend on M , C 3 depends on λ 1 (Σ) and C 4 , C 5 > 0.
Proof. By (12) and (13), where λ k is the k-th eigenvalue of Σ obs , and θ k is the k-th eigenvector of Σ obs . At a high level, we bound Σ − Σ ∞ by bounding the norms of Σ obs −Σ obs , λ 1 −λ 1 and θ 1 −θ 1 . The details of the complete proof is in Appendix A.1.
The bound in Theorem 1 can be further simplified as C s p n + λ k (Σ) θ 1 θ T 1 ∞ for some large constant C s . We express it in the above form because it provides more insight on how each term affects the result.
After obtainingΣ, we can use Glasso, CLIME (Cai et al., 2011) or any sparse GGM estimation approach to estimate Ω. We can have a good estimate of Ω when Σ − Σ ∞ is not too large. With the same inputΣ, the theoretical convergence rate of the estimate obtained from CLIME is of the same order as the Glasso estimate. The derivation of the error bound of Glasso requires the irrepresentability condition and restricted eigenvalue conditions (see Ravikumar et al. (2011)). Due to the length of the article, we only show the proof of the edge selection consistency for CLIME, meaning that for the theoretical analysis, we apply CLIME method after obtainingΣ.
The CLIME estimatorΩ 1 is obtained by solving: SinceΩ 1 might not be symmetric, we need the symmetrization step to obtainΩ.
Following Cai et al. (2011), we assume that Ω is in the following class: where we allow s 0 and M 0 to grow as p and n increase. WithΣ obtained from equation (13) as the input of (15), we have the following result.
with probability greater than 1 − C 4 p −6 − C 5 e −p . C i 's are defined the same as in Theorem 1.
Proof. The main steps follow the proof of Theorem 6 in Cai et al. (2011). The complete proof is in Appendix A.2.
Therefore, if the minimum magnitude of Ω is larger than the error bound 2M 0 λ n , then we can have exact edge selection with high probability.

Analysis of Theoretical Results
The error bound in Theorem 2 depends on the largest eigenvalue of Σ obs , the eigengap ν = λ 1 (Σ obs ) − λ 2 (Σ obs ), the eigenvector of the the confounder and n and p. The term ν+1 ν 2 shows that if the eigengap ν is larger, the bound will be smaller. Recall that when σ 2 > λ 1 (Σ), λ k (Σ) + σ 2 becomes the first eigenvalue of Σ obs . Hence if σ 2 is large compared to λ 1 (Σ), then the eigengap is large. This observation (i.e. larger eigengap leads to a better convergence rate) is closely related to the concept of effective dimension. We know the following trace constraint for any positive semidefinite matrix M ∈ R p×p : where C ≥ 1 can be viewed as the effective dimension of M (Meng et al. (2014); Wainwright (2019)). The matrix is approximately low-rank if the first eigenvalue is much larger than the rest, then, C is much smaller than the observed dimension p, hence the p appeared in the error bound O( p n ) can be reduced significantly to the effective dimension C in (17).
, meaning that if the eigenvector of the low-rank component is closely aligned with the first few eigenvectors of Σ, then the bound is larger. This result shows that the first few eigencomponents play a more important role in determining the structure of Σ and its inverse. Thus we may have poor estimation if those first few eigencomponents are removed by PC-correction.
Note that θ 1 θ T 1 ∞ is upper bounded by 1, since θ 1 is the eigenvector of some matrix, and thus has unit Euclidean norm; however, θ 1 θ T 1 ∞ can be much smaller than 1 when θ 1 is dense. One extreme case is when all the elements of θ 1 are 1 √ p and θ 1 θ T 1 ∞ can be as small as 1 p as a result. This setup corresponds to a scenario in which the confounder has a widespread effect over all the p variables in the signal, which is in accordance with one requirement in LVGGM. LVGGM requires the low-rank component to be dense. Therefore, we have a consistent observation for both approaches, i.e. more "widespread" confounding can be addressed more easily.
The previous analysis assumes that the low-rank confounder has rank 1, is independent of X and the eigenvector of the covariance of the low-rank confounding is one of the eigenvectors of Σ. More detail about the general settings can be found in Appendix B.

Comparison with LVGGM
Now we compare LVGGM to PCA+GGM in more detail. The convergence rate of PCA+GGM is of the order O( p n ), the same as that of LVGGM. We know that p n is the optimal rate for estimating the eigenvectors without further structural assumption (Wainwright, 2019, Chapter 8), hence our result achieves the optimal rate.
We also observe that PCA+GGM can be viewed as a supplement to LVGGM. The assumptions of PCA+GGM can be well satisfied when the assumptions of LVGGM cannot be satisfied. In (12), now let v be the k-th eigenvector of Ω (thus the (p − k + 1)-th eigenvector of Σ), the Sherman-Morrison identity gives We can see that as σ increases, λ k (Ω) 2 λ k (Ω)+(1/σ 2 ) increases. In later simulations, we observe that LVGGM can hardly have good estimation when v is closely aligned with the first few eigenvectors of Ω (therefore the last few eigenvectors of Σ). This observation is consistent with the conclusion in Agarwal et al. (2012). They impose a spikiness condition, which is a weaker condition than the incoherence condition in Chandrasekaran et al. (2012). The spikiness condition requires that L Ω ∞ is not too large. (18) shows that L Ω tends to have a larger norm when v is aligned with the first few eigenvectors of Ω and σ is large, since in this case, The large norm of L Ω implies that the spikiness condition is not well satisfied, thus the error bound of LVGGM is large. Note, however, that the first few eigenvectors of Ω are the last few eigenvectors of Σ. Our analysis shows that the error bound of the estimate of PCA+GGM is small when v is aligned with the first few eigenvectors of Ω and σ is large.

PCA+LVGGM
In this section we discuss the PCA+LVGGM method briefly. We use the same formulation as (8) to (10). We claim that PCA+LVGGM outperforms using PCA+GGM or LVGGM individually when L Σ 's norm is large compared to L Ω and Σ, L Σ 's vectors are not aligned with the first few eigenvectors of Σ, and the norm of L Ω is not significantly larger than that of Σ. This is because based on Theorem 1 and 2, PCA+GGM is effective only when the norm of the low-rank confounding is larger than that of the signal. PC-correction, however, can only effectively remove L Σ but not L Ω because the norm of L Ω is not significantly larger than that of Σ. In contrast, LVGGM can estimate L Ω well, but not L Σ because it has a larger norm and its eigenvectors might be aligned with the first few eigenvectors of Ω.

Tuning Parameter Selection
In both LVGGM and PCA+GGM there are crucial tuning parameters to select. For LVGGM, recall that λ controls the sparsity of Ω and γ controls the rank of L Ω . Chandrasekaran et al. (2012) argues that λ should be proportional to p n , the rate in the convergence analysis, and choose γ among a range of values that makes the graph structure ofΩ stable (see Chandrasekaran et al., 2011, for more detail).
When using PCA+GGM, we need to determine the rank first (i.e. how many principal components should be removed). Leek and Storey (2007) and Leek et al. (2012) suggest using the sva function from Bioconductor, which is based on parallel analysis (Horn, 1965;Buja and Eyuboglu, 1992;Lim and Jahng, 2019). Parallel analysis compares the eigenvalues of the sample correlation matrix to the eigenvalues of a random correlation matrix for which no factors are assumed. Given the number of principal components to remove, we can use model selection tools such as AIC, BIC or cross-validation to choose the sparsity parameter in Glasso. One may also decide how many principal components to remove by considering the eigenvalues of the observed covariance matrix (see section 3.1 and section 3.2).
The PCA+LVGGM method has three tuning parameters: the rank of L Σ , γ and λ. To start, we first look at eigenvalues or use the sva package to determine the total rank of the low-rank component, L Σ + L Ω . We then need to partition the total rank between L Σ and L Ω . If we determine that rank(L Σ + L Ω ) = k, we look for an eigengap in the first k eigenvalues and allocate the largest m < k eigenvalues for PC-removal. We will see in later sections that in many applications, domain knowledge can be used to motivate the number of components for PC-removal. After removing the principal components, we choose γ in LVGGM so tha L Ω is approximately rank k − m . We observe that when running LVGGM, the rank won't change for a range of λ values when using a fixed γ. Thus, it suffices to fix γ first to control the rank, then determine λ to control the sparsity.
Practically, network estimation is often used to help exploratory data analysis and hypothesis generation. For these purposes, model selection methods such as AIC, BIC or cross-validation may tend to choose models that are too dense (Danaher et al., 2014). This fact can also be observed by our experiments. Therefore, we recommend that model selection should be based on prior knowledge and practical purposes, such as network interpretability and stability, or identification of important edges with low false discovery rate (Meinshausen and Bühlmann, 2010). Thus, we recommend that the selection of tuning parameters should be driven by applications. For example, for biological applications, the model should be biologically plausible, sufficiently complex to include important information and sparse enough to be interpretable. In this context, a robustness analysis can be used to explore how edges change over a range of tuning parameters.

Simulations
In this section, numerical experiments illustrate the utility of each sparse plus low rank method. In section 4.1 we illustrate the behavior of Glasso, LVGGM and PCA+GGM under different assumptions about rank-one confounding. In section 4.2, we show the efficacy of PCA+LVGGM in a variety of simulation scenarios. In all experiments, we set p = 100 and use the scale-free and random networks from huge.generator function in R package huge. Due to space limit, we only include results on the scale-free structure.

The Efficacy of LVGGM and PCA+GGM
We compare the relative performance of PCA+GGM, LVGGM and Glasso in the presence of a rank-one confounder, L. Guided by our analysis in section 3, we show that the relationship between L and the eigenstructure of Σ determines the performance of these three methods. We first generate the data with where X (i) ∈ R p is normally distributed with mean zero and covariance matrix Σ. Z (i) ∈ R r , the low-rank confounder, follows a normal distribution with mean zero and identity covariance matrix. V ∈ R p×r is a non-random semi-orthogonal matrix satisfying V T V = I, and σ ∈ R represents the magnitude of the confounder. Without loss of generality, we assume that X (i) and Z (i) are independent. We illustrate the performance of different methods under various choices for, V , the eigencomponents of L.
We first set r = 1, p = 100 and n = 200. The largest eigenvalue of Σ is around 5. We use v i to denote the i-th eigenvector of Σ. When examining the effect of σ, we choose the 95-th eigenvector of Σ as V to ensure that V is not closely aligned with the first few eigenvectors of Σ. We then compare the cases with σ 2 = 20 and 3. Next, we examine the effect of eigenvectors. We fix σ 2 as 20, and use the i-th eigenvector of Σ as V , where i ∈ {1, 60, 95}. Following previous notation, we use v 1 , v 60 and v 95 as V . 1 is chosen as the rank for PC-correction and LVGGM. We generate ROC curves (Hastie et al., 2009, 9.2.5) for each method based on 50 simulated samples and use the average to draw the ROC curves ( Figure 1). We truncate the ROC curves at FPR=0.2, since the estimates with large FPR are typically less useful in practice.
From Figure 1, we observe that when the confounder has large norm and its eigenvectors are not closely aligned with the first few eigenvectors of Σ, PCA+GGM performs better than LVGGM and Glasso. LVGGM preforms the best when the confounder has large norm and its eigenvectors are not aligned with the last few eigenvectors of Σ (also the first few eigenvectors of Σ −1 ). When the low-rank component does not have a large norm, Glasso also performs well.

The Efficacy of PCA+LVGGM
In this section, we use multiple examples to demonstrate the efficacy of the PCA+LVGGM. We introduce corruption of the signal with two low-rank confounders. The data is generated as follows: where X i ∈ R p is normally distributed with mean zero and covariance matrix Σ. Z (i) 1 ∈ R d 1 , corresponding to the first source of low-rank confounder, has a normal distribution with mean zero and covariance matrix I d 1 . V 1 ∈ R p×d 1 is a non-random, semi-orthogonal matrix satisfying V T 1 V 1 = I, and D 1 ∈ R d 1 ×d 1 is a diagonal matrix, measuring the magnitude of the first confounder. Similarly, Z (i) 2 ∈ R d 2 , corresponding to the second source of low-rank confounder, has normal distribution with mean zero and covariance matrix I d 2 . V 2 ∈ R p×d 2 is a semi-orthogonal matrix satisfying V T 2 V 2 = I, and D 2 ∈ R d 2 ×d 2 is a diagonal matrix, measuring the magnitude of the second low-rank confounder.  Figure 1: n = 200. The low-rank component has rank 1. σ 2 is the magnitude of the low-rank component, v i is the i-th eigenvector of Σ. The first row illustrates the effect of σ: we use the 95-th eigenvector of Σ as V , and set σ 2 to 20 and 3 from left to right. The second row illustrates the effect of V : we fixed σ 2 as 20, and use the 60-th and first eigenvector of Σ as V from left to right. PCA+GGM works the best when σ 2 is large and V is not aligned with the first eigenvector of Σ. LVGGM works the best when σ 2 is large and V is not aligned with the last eigenvector of Σ. When σ 2 is small, Glasso works as well as the other two.
Without loss of generality, we assume that X (i) , Z (i) 1 and Z (i) 2 are three pairwise independent vectors. Hence the observed covariance matrix is Our first simulation setup (case 1) shows an ideal case for PCA+LVGGM, meaning that PCA+LVGGM method performs much better than using PCA+GGM, LVGGM, or Glasso. Let d 1 = d 2 = 3. We set p = 100 and n = 100. In our first example, the columns of V 1 and V 2 come from the eigenvectors of Σ. We expect that PC-correction removes L 2 , so we set the diagonal elements of D 2 2 all 50, and use the last 3 eigenvectors of Σ as V 2 . This can guarantee that PC-correction performs much better than LVGGM and Glasso when removing L 2 . Then we use LVGGM to estimate L 1 , so we need a moderately large magnitude. We set all diagonal elements of D 2 1 to 20, and use the first 3 eigenvectors of Σ as V 1 . This ensures that LVGGM performs better than PC-correction and Glasso when estimating L 1 .
Using the sva package, we estimate the rank of L 1 + L 2 to be 6. Then we look at the eigenvalues of the observed sample covariance matrix and we can see the first 3 eigenvalues are much larger than the 4-th to 6-th eigenvalues (shown in the top row of Figure 2). We therefore allocate 3 to PC-correction, and 6 − 3 = 3 to LVGGM. We also try allocating 1 to PC-correction and 5 to LVGGM. Then we compare more approaches, including using PC-correction individually by removing only 3 principal components or 6 principal components, using LVGGM with rank 6 for the low-rank component as well as the uncorrected approach Glasso. We still use 50 datasets and draw the ROC curve for the averages with varying sparsity parameters λ. The ROC for the scale-free example is in the bottom row of Figure 2. We also include the AUC (area under the ROC curve) for each method. We compare PCA+LVGGM with rank 3 in PC-correction with other methods. For each data set, we calculate (AUC of PCA+LVGGM)/(AUC of one other method), then compute the sample mean and sample standard deviation of that ratio over 50 data sets to compare the average performance and the variance of different methods. The results for the scale-free graph are shown in the first column of Table 1. We can see that PCA+LVGGM with rank 3 for PC-correction and 3 for LVGGM do perform much better than other methods for both graph structures, indicating that if the assumptions are satisfied, our method and parameter tuning procedure are useful.
Finally, we try setups that are more similar to real world data. We still use (19) to generate the data and set p = 100 and n = 100. Differently from previous settings, we now use some randomly generated eigenvectors as columns of V 1 and V 2 . We look at the distribution of eigenvalues of gene co-expression and stock return data covariance matrices, and try to make simulation settings similar to those examples. We run two setups -the first is called a large-magnitude case (case 2), with D 2 1 a diagonal matrix with diagonal elements (7, 6, 6) and D 2 2 a diagonal matrix with diagonal elements (20, 10, 10). The second setup is referred to a moderately large magnitude case (case 3), in which the low-rank component has the same eigenvectors as the large-magnitude case, but the elements of D 1 and D 2 become smaller, with diagonal elements of D 2 1 (3, 3, 3) and diagonal elements of D 2 2 (10, 8, 6). Using the sva package, we estimate the rank of L 1 + L 2 to be 6 for both case 2 and case 3. We observe that the first 3 eigenvalues are larger than the rest, so we allocate 3 to PC-correction and use 6 − 3 = 3 as the rank for the low-rank component for LVGGM. We also try allocating 1 to PC-correction and 5 to LVGGM, using PC-correction by removing only 3 PC-components and 6 PC-components, using LVGGM with rank 6 and using Glasso. Again, we run over 50 datasets and include ROC curves and AUC tables.  Figure 2: We use the scale-free structure when generating graphs. The first row shows the eigenvalues of Σ obs under 3 setups, and the second row shows the corresponding ROC curves with different methods. PC(k) means that we use k as the rank in PC-correction and PC(k)+LV means that we use PCA+LVGGM with k as the rank for PC-correction.
From Figure 2 and  (3) 1.08(0.017) 1.05(0.027) 0.99(0.018) PCA(1)+LVGGM 1.47(0.11) 1.04(0.038) 1.01(0.035) Table 1: We use the scale-free structure when generating graphs. We compute the ratio of AUC between PCA+LVGGM with rank 3 in PC-correction and other methods, using PCA+LVGGM as the numerator. The table shows the sample mean and sample standard deviations of that ratio (in the parenthesis) over 50 data sets. In case 3, the magnitude of the confounding is not as large as other cases, so PC-correction with rank 3 has the best performance.
To see that, we can have LVGGM from PCA+LVGGM by allocating a rank of 0 to PC-correction. From the simulation and real data examples, we observe that using PCA+GGM with higher ranks often removes some useful information, resulting in more false negatives. On the other hand, if the effect of multiple confounders exists in the data that are not well represented by the first few principal components, using PCA+GGM alone might not be enough to remove the additional sources of noise. Note that LVGGM may not be enough to remove the confounding with large norm, leading to spurious connections between nodes. In this case, we would suggest PCA+LVGGM as a default setting and a starting point for problems with low-rank confounding. We can adjust different rank allocations based on the specific problems and goals of interest.

Gene Co-expression Networks
Our first application is to reanalyze the gene co-expression networks originally analyzed by Parsana et al. (2019). The goal of gene co-expression network analysis is to identify transcriptional patterns indicating functional and regulatory relationships between genes. In biology, it is of great interest to infer the structure of these networks; however, the construction of such networks from data is challenging, since the data is usually corrupted by technical and unwanted biological variability known to confound expression data. The influence of such artifacts can often introduce spurious correlation between genes; if we apply sparse precision matrix inference directly without addressing confounding, we may obtain a graph including many false positive edges. Parsana et al. (2019) uses PCA+GGM to estimate this network and shows that PC-correction can be an effective way to control the false discovery rate of the network. In practice, however, some effects of confounding may not be represented in the top few principal components. This motivates the more flexible PCA+LVGGM approach. The PC-correction effectively removes high variance confounding, and then LVGGM subsequently accounts for any remaining low-rank confounding. We consider gene expression data from 3 diverse tissues: blood, lung and tibial nerve, with sample sizes between 300 to 400 each. 1000 genes are chosen from each tissue. More detail about the source of the data and pre-processing steps are introduced in Appendix C.
We observe that all of the sample covariance matrices are approximately low-rank by looking at the eigenvalues of the covariance matrices of genes, indicating the potential existence of high variance confounding. Then we use sva package to estimate the rank for PC-correction and call this the full sva rank correction. Parsana et al. (2019) suggests that the rank estimated by sva might be so large that some useful network signal is removed. To reduce the effect of over-correction, we apply the PC-correction with half and one quarter of the sva rank, which we refer to as half sva rank correction and quarter sva rank correction, respectively. For many tissues, the first eigenvalue is much larger than the rest, this motivates us to try rank-1 PC-correction. We include the results with half sva rank, quarter sva rank and rank 1 PC-corrections in Figure 3. After running the above PC-corrections to remove high-variance confounding, we run LVGGM as an additional step to further estimate and We can see that PCA+LVGGM performs the best or equivalently well compared to other approaches for almost all 3 tissues.
remove the low-rank noise with moderate variance. We use two different values as the γ parameters in LVGGM. Larger γ leads to removing lower-rank confounding and smaller γ leads to remove higher-rank confounding. We show the results for both choices of γ. We use different λ to control sparsity of the estimated graph and draw Figure 3 similar to the precision recall plot. The y-axis represents the precision (True Positives/(True Positives + False Positives)), and the x-axis is the number of true positives. We can see that PCA+LVGGM can yield better or equivalently good results compared to other methods, indicating that it can be useful to run LVGGM after the PC-correction when estimating gene co-expression networks.

Stock Return Data
In finance, the Capital Asset Pricing Model (CAPM) states that there is a widespread market factor which dominates the movement of all stock prices. Empirical evidence for the market trend can be found in the first principal component of the stock data, which is dense and has approximately equal loadings across all stocks ( Figure 5, left). In fact, the first few eigenvalues of the stock correlation matrix are significantly larger than the rest (Fama and French, 2004), which suggests that only a few latent factors are driving stock correlations.
In this section, we posit that the conditional dependence structure after accounting for these latent effects is more likely to reflect direct relationships between companies aside from the market and, perhaps, sector trends. Our interest is in recovering the undirected graphical model (conditional dependence) structure between stock returns after controlling for potential low rank confounders.
We compare networks inferred by PCA+LVGGM, PCA+GGM, LVGGM and Glasso by analyzing monthly returns of component stocks in S&P 100 index between 2008 and 2019 (Hayden et al., 2015). The 49 chosen companies are in 6 sectors: technology (10 companies), finance (11), energy (7), health (8), capital goods (7) and non-cyclical stocks (6). For PCA+GGM, we remove the first eigenvector which corresponds to the overall market trend. For the other latent variable methods we use the sva package to identify a plausible rank. For PCA+LVGGM, we remove the first principal component corresponding to the overall market trend and use LVGGM to estimate remaining latent confounders and the graph. Figure 4 shows the networks obtained by each approach.
For each method, the sparsity-inducing tuning parameter was chosen to minimize negative log-likelihood using a 6-fold cross-validation procedure, and the number of low rank components are chosen manually. We observe that when using LVGGM, allocating rank 1 or 2 to the low-rank component won't make the estimates very different from Glasso, while allocating ranks higher than 6 to the low-rank component leads to higher out-of-sample error, so 5, the rank picked by sva, is among the best choices. For PCAbased methods, removing more than 1 principal components leads to higher out-of-sample error. As expected, the Glasso result is denser than the networks learned with sparse plus low rank methodology with PCA+LVGGM yielding the sparsest network.
For LVGGM, we note that the method effectively controls for sector effect but is less effective in controlling for the effect of the overall market trend. LetΣ obs be the empirical observed covariance matrix andv i be its i-th eigenvector. We have the following observations: first, the first principal component is closely aligned with the overall market trend, because the absolute value of the inner product between the first eigenvector ofΣ obs and the normalized "all ones" vector is 0.98. Second, the observed empirical covariance matrix has an approximately low-rank structure, because the first eigenvalue ofΣ obs is 18.25 and the second is 3.5 and all other eigenvalues are close or smaller than 1.
Third, LVGGM does not capture the full effect of the market trend. LetL Ω be the estimate of L Ω in (9). When we apply LVGGM onΣ obs , the inner product between the first eigenvector ofL Ω andv 1 is close to 1 but the first eigenvalue ofL Ω is only 0.55, much smaller than the first eigenvalue ofΣ obs . L Ω obtained with LVGGM with rank 5. The rank-one approximation toL Σ is close to a constant matrix. In contrast,L Ω reflects sector effects but does not reflect the strong effect due to overall market trends.
We argue that PCA+LVGGM is the most appropriate method for this application because it appropriately controls for both market and sector effects. LetL Σ andL Ω be the estimates of the low-rank components defined in (10). For PCA+LVGGM, we remove L Σ by removing the first eigencomponent of Σ obs , then run LVGGM to estimate L Ω and Ω. We claim that PCA+LVGGM can remove the confounding effect fully in the market trend direction, as well as the remaining confounding effect in other directions. To see that, first,v 1 is removed in PC-correction. Second, the inner product between the first eigenvector ofL Ω andv 2 , the second eigenvector ofΣ obs , is 0.99. The first eigenvalue ofL Ω is 0.4 and the second eigenvalue ofΣ obs is 3.5. This shows that when applying LVGGM, only part of the information in the direction ofv 2 has been removed. We know that the direction ofv 1 reflects the market trend, butv 2 might include both true graph information and some latent confounding effect, hence using LVGGM might be a good choice for capturing the confounding effect in the direction ofv 2 . Overall PCA+LVGGM, therefore, might be a better choice than LVGGM and the PCA-based method. Figure 5 shows heat maps ofL Σ obtained via PCA andL Ω obtained with LVGGM (rank 5). As expected, the elements ofL Σ are roughly equal in magnitude, reflect the market trend and the large first eigenvalue ofΣ obs . In contrast,L Ω shows a block-diagonal structure and its elements have smaller magnitudes, which suggests that LVGGM does not adequately account for the overall the market trend.
On the other hand, the block diagonal structure ofL Ω reflects inferred sector effects. PCA+GGM is most effective at reducing confounding from overall market trends and LVGGM is more effective at accounting for remaining confounding, such as the sector effect. Therefore, PCA+LVGGM, which combines the benefits of PCA and LVGGM is arguably the most appropriate choice for addressing the latent confounding in this context.

Conclusion
We have studied the problem of estimating the graph structure under Gaussian graphical models when the data is corrupted by latent confounders. We compare two popular methods PCA+GGM and LVGGM. One of our contributions is to show the connection and difference between these two approaches, both theoretically and empirically. Based on that, we propose a new method, PCA+LVGGM. The effectiveness of this method is supported by theoretical analysis, simulations and applications. Actually, our analysis provides guidance on when to use which approach to estimate GMM in the presence of latent confounders. We believe that this guidance can help researchers in many fields, such as finance and biology, when there exist problems of graph estimation with confounders.
There are several future directions. First, we can extend the current framework to other distributions, such as the transelliptical distributions (Liu et al., 2012) and the Ising model (Ravikumar et al., 2010;Nussbaum and Giesen, 2019). Secondly, we can consider more structures -for example, we are interested in what will happen if the principal components of the observed data are sparse. Other future works include applying our current methods and analysis in more applications, such as the functional magnetic resonance imaging (fMRI) in neuroscience.
Proof. Our proof follows the proof of Wainwright (2019, Corrollary 8.7). By Davis-Kahan Theorem, we know that the difference between two eigenvectors is bounded if the spectral norm of the difference between two matrices is bounded. So we can start from bounding Σ obs −Σ obs . Because σ 2 > λ 1 , we have, X (i) obs = X (i) + σvZ (i) = X (i) + σθ 1 Z (i) , i = 1, ..., n, where Z (i) , independent of X (i) , is a standard normal random variable. The true covariance matrix is, Σ obs = σ 2 θ 1 θ T 1 + Σ, and the sample observed covariance matrix is, After expanding the above term using its definition, we obtain the perturbation P = Σ obs −Σ obs : = P 1 + P 2 + P 3 , where w = 1 n n i=1 Z (i) X (i) . The eigendecomposition of Σ obs is Σ obs = U ΣU T . Let U = [θ 1 , U 2 ] and definep = U T 2 P θ 1 , then we can obtain the bound for the first eigenvector ofΣ obs by Wainwright (2019, Theorem 8.5): We start from bounding p 2 . (20) and U T 2 θ 1 = 0 givẽ Because columns of U 2 have Frobenius norm 1, we have the following bound: Then Cauchy-Schwarz gives,