Heterogeneity Adjustment with Applications to Graphical Model Inference

Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the biases of batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a large fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the"Bless of Dimensionality". As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.


Introduction
Aggregating and analyzing heterogeneous data is one of the most fundamental challenges in scientific data analysis.In particular, the intrinsic heterogeneity across multiple data sources violates the ideal 'independent and identically distributed' sampling assumption and may produce misleading results if it is ignored.For example, in genomics, data heterogeneity is ubiquitous and referred to as either 'batch effect' or 'lab effect'.Microarray gene expression data obtained from different labs at different processing dates may contain systematic variability.More specifically, Leek et al. (2010) analyzed a microarray data from a bladder cancer study and showed that the gene expressions vary significantly across different batches even after data normalization.Furthermore, Leek and Storey (2007) pointed out that heterogeneity across multiple data sources may be caused by unobserved factors that have confounding effects on the variables of interest, generating spurious signals.In finance, it is also known that asset returns are driven by varying market regimes and economy status, which can be regarded as a temporal batch effect.Later in this paper, we will use a brain imaging dataset to show similar heterogeneity effect.Therefore, to properly analyze data aggregated from multiple sources, we need to carefully model and adjust the heterogeneity effect.
Modeling and estimating heterogeneity effect is challenging for two reasons.(i) Typically, we can only access a limited number of samples from an individual group, given the high cost of biological experiment, technological constraint or fast economy regime switching.(ii) The dimensionality can be much larger than the total aggregated number of samples.The past decade has witnessed the development of many methods for adjusting batch effect in high throughput genomics data.See, for example, Sims et al. (2008); Alter et al. (2000); Leek and Storey (2007); Johnson et al. (2007).Though progresses have been made, most of the aforementioned papers focus on the practical side and none of them has a systematic theoretical justification.In fact, most of these methods are developed in a case-by-case fashion and are only applicable to certain problem domains.Thus, there is still a gap that exists between practice and theories.
To bridge this gap, we propose a generic theoretical framework to model, estimate, and adjust heterogeneity across multiple datasets.Formally, we assume the data come from m different sources: the i th data source contributes n i samples, each having p measurements such as gene expressions of an individual or stock returns of a day.To explicitly model heterogeneity, we assume that the batch-specific latent factors f i t influence the observed data X i jt in batch i (j indexes variables; t indexes samples) as in the approximate factor model: where λ i j is unknown factor loading for variable j and u i jt is true uncorrupted signals.The linear term λ i j f i t models the heterogeneity effect.We assume that f i t is independent of u i jt and u i t = (u 1t , . . ., u pt ) shares the same common distribution with mean 0 and covariance Σ p×p across all data sources.In the matrix-form model, (1.1) can be written as where X i is a p × n i data matrix in the i th batch, Λ i is a p × K i factor loading matrix with λ i j in the j th row, F i is an n i × K i factor matrix and U i is a signal matrix of dimension p × n i .Here, we allow the number of latent factors K i to depend on batch i.
To see how model (1.2) models the heterogeneity, we assume f i t ∼ N (0, I) and u i t ∼ N (0, Σ).Then, the t th sample X i t , which is the t th column of X i , follows (1.3) Therefore, the heterogeneity effect is modeled as a low rank component Λ i Λ i of the population covariance matrix of X i t .Later, we will show that, under a pervasive assumption, the heterogeneity component can be estimated by directly applying principal component analysis (PCA) or Projected-PCA, which is more accurate when there are sufficiently informative covariates W i (Fan et al., 2016).Let Λ i F i be the estimated heterogeneity component.We denote U i = X i t − Λ i F i to be the heterogeneity adjusted signal, which can be treated as homogeneous across different datasets and thus can be combined together for downstream statistical analysis.This whole framework of heterogeneity adjustment is termed ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) and is schematically shown in Figure 1.
The proposed ALPHA framework is fully generic and applicable to almost all kinds of multivariate analysis of the combined, heterogeneity adjusted datasets.As an illustrative example, in this paper, we focus on the problem of Gaussian graphical model inference based on multiple datasets.It is a powerful tool to explore complex dependence structure among variables X = (X 1 , . . ., X p ) ∼ N (0, Σ).The sparsity pattern of the precision matrix Ω = Σ −1 encodes the information of an undirected graph G = (V, E) where V consists of p vertices corresponding to p variables in X and E describes their dependence relationship.To be specific, V i and V j are linked by an edge if and only if Ω ij = 0, meaning that X i and X j are dependent conditioning on the rest variables.For heterogeneous data across m data sources, we need to first adjust for heterogeneity using the ALPHA framework.The idea of covariate-adjusted precision matrix estimation has been studied by Cai et al. (2012), but the factor model they used assumes observed factors and no heterogeneity issue, i.e., m = 1.
A significant amount of literature has focused on the estimation of the precision matrix Ω for graphical models for homogeneous data.Yuan and Lin (2007), Banerjee et al. (2008), Figure 1: Schematic illustration of ALPHA: Depending whether we can find some sufficiently informative covariates W, we implement principal component analysis (PCA) or Projected-PCA (PPCA) methods (labeled respectively M 1 and M 2 ) to remove the heterogeneity effects ΛF for each batch of data.This decision was made adaptively by a test statistic.After removing the unwanted variations, the homogeneous data {U (i) } m i=1 are aggregrated for further analysis.Friedman et al. (2008) developed the Graphical Lasso method using the L 1 penalty and Lam and Fan (2009) and Shen et al. (2012) used a non-convex penalty.Furthermore, Ravikumar et al. (2011) and Loh and Wainwright (2013) studied the theoretical properties under different assumptions.Estimating Ω can be equivalently reformulated as a set of node-wise sparse linear regression that utilizes Lasso or Danzig selector for each node (Meinshausen and Bühlmann, 2006;Yuan, 2010;Cai et al., 2011).To relax the assumption of Gaussian data, Liu et al. (2009) and Liu et al. (2012) extend the graphical model to the case of semiparametric Gaussian copula and transelliptical family.Under the ALPHA framework, the adjusted data U i can be combined to construct an estimator for the inverse matrix Ω by the above methods.
The rest of the paper is organized as follows.Section 2 lays out a basic problem setup and necessary assumptions.We model the heterogeneity by a semiparametric factor model.Section 3 introduces the ALPHA methodology for heterogeneity adjustment.Two main methods of PCA and Projected-PCA will be introduced for adjusting the factor effects under different settings.A guiding rule of thumb is also proposed to determine which method is more appropriate in the real data analysis.The heterogeneity-adjusted data will be combined to provide valid graph estimation in Section 4. The CLIME method of Cai et al. (2011) is used for estimating precision matrix, although other related methods are also applicable.Some synthetic simulations and a real dataset are analyzed to demonstrate the proposed framework in Section 5. Section 6 contains some further discussions.All the proofs are relegated to the Appendix.

Problem Setup
To more efficiently use the external covariate information in removing heterogeneity effect, we first present a semiparametric factor model.Then, based on whether the collected external covariates have explaining power on factor loadings, we discuss two different regimes where PCA or Projected-PCA (PPCA) should be used.We will state the conditions under which these methods can be formally justified.

Semiparametric factor model
We assume that for subgroup i, we have d external covariates W i j = (W i j1 , . . ., W i jd ) for variable j.In stock returns, these can be attributes of a firm; in brain imaging, these can be the physical locations of voxels.We assume that these covariates have some explanatory power on the loading parameters λ i j in (1.1) so that it can be further modeled as λ i j = g i (W i j ) + γ i j , where g i (•) is the external covariate effects on λ i j and γ i j is the part that can not be explained by the covariates.Thus, model (1.1) can be written as (2.1) Model (2.1) does not put much restriction.If W i j is not informative (i.e., λ i j does not depend on W i j ), then g i (•) = 0, the model reduces to a regular factor model.In a matrix form, model (2.1) can be written as (2.2) ) and Γ i are p × K i component matrices of the factor loading Λ i .More specifically, g i k (W i j ) and γ jk are the (j, k) th element of G i (W i ) and Γ i respectively.Expression (2.2) suggests that the observed data can be decomposed into a low-rank heterogeneity term Λ i F i and a homogeneous signal term U i .
Letting u i t be the t th column of U i , we assume all u i t 's share the same distribution for any t ≤ n i and for all subgroups i ≤ m with E[u i t ] = 0, Var(u i t ) = Σ.Our goal is to recover U i from the observation X i and combine all the estimated U i 's together to enhance the inferential power of Σ or Ω = Σ −1 .
There has been a large literature on factor models in econometrics (Bai, 2003;Bai and Ng, 2013;Fan et al., 2013;Stock and Watson, 2002), machine learning (Cai et al., 2013;Negahban and Wainwright, 2011;Candès et al., 2011) and random matrix theories (Johnstone and Lu, 2009;Paul, 2007;Shen et al., 2013;Fan and Wang, 2015).We refer the interested readers to those relevant papers and the references therein.However, none of these models incorporate the external covariate information.The semiparametric factor model (2.1) was first proposed by Connor and Linton (2007) and further investigated by Connor et al. (2012); Fan et al. (2016).Using sufficiently informative external covariates, we are able to more accurately estimate the factors and loadings, and hence yield better heterogeneous adjustment.

Modeling assumptions and general methodology
In this subsection, we explicitly list all the required modeling assumptions.We start with an introduction of the data generating processes.
t } t≤n i ,i≤m are independent within and between subgroups.u i t 's are identically sub-Gaussian distributed with mean zero and variance Σ across all subgroups and are independent of {W i j , f i t }. {f i t } t≤n i is a stationary process, but with arbitrary temporal dependency.(iii) There exists a constant C 0 > 0 such that Σ 2 ≤ C 0 .(iv) The tail of the factors is sub-Gaussian, i.e., ∃C 1 , C 2 > 0 such that for k ≤ K i , t ≤ n i , The above set of assumptions are commonly used in the literature, see Bai and Ng (2013); Fan et al. (2016).We omit detailed discussions here.
Based on whether the external covariates are informative, we specify two regimes, each of which requires some additional technical conditions.

Regime 1: External covariates are not informative
For the case that G i (W i ) = 0, the external covariates do not have explanatory power on the factor loadings Λ i and model (2.2) reduces to the traditional factor model, extensively studied in econometrics (Bai, 2003;Stock and Watson, 2002;Onatski, 2012).PCA will be employed in Section 3.1 to estimate the heterogeneous effect.It requires the following assumptions.
Assumption 2.2.(i) (Pervasiveness) There are two positive constants c min and c max so that The first condition is common and essential in the factor model literature (e.g., Stock and Watson (2002)).It requires the factors to be strong enough such that the covariance matrix Λ i cov(f i t )Λ i + Σ has spiked eigenvalues.This is trivially true if {λ i j } p j=1 's can be regarded as random samples from a population with nondegenerate sample covariance matrix (Fan et al., 2013).The second condition is technical, and a relaxation of bounded requirement in the literature (Fan et al., 2013;Bai and Ng, 2013).

Regime 2: External covariates are informative
When covariates are informative, we will employ the PPCA (Fan et al., 2016) to better estimate the heterogeneous effect.It requires the following assumptions.
Assumption 2.3.(i) (Pervasiveness) There are two positive constants c min and c max so that This assumption is parallel to Assumption 2.2 (i).Pervasiveness is trivially satisfied if {W i j } j≤p are independent and G i is sufficiently smooth.
Condition (i) is parallel to Assumption 2.2 (ii) whereas Condition (ii) is natural since Γ i can not be explained by W i .Condition (iii) imposes cross-sectional weak dependence of γ i j , which is much weaker than assuming independent and identically distributed {γ i j } j≤p .This condition is mild as main serial dependency has been taken care of by g k (•)'s.

The ALPHA Framework
We introduce the ALPHA framework for heterogeneity adjustment.Methodologically, for each sub-dataset we aim to estimate the heterogeneity component and subtract it from the raw data.Theoretically, we aim to obtain the explicit rates of convergence for both the corrected homogeneous signal and its sample covariance matrix.Those rates will be useful when aggregating the homogeneous residuals from multiple sources.
This section covers details for heterogeneity adjustments under both regimes that G i (•) = 0 and G i (•) = 0: they correspond to estimating U i by either PCA or Projected-PCA.From now on, we drop the superscript i whenever there is no confusion as we focus on the i th data source.We will use the notation F if F is estimated by PCA and F if estimated by PPCA.This convention applies to other related quantities such as U and U, the heterogeneityadjusted estimator.In addition, we use notations such as F and Ǔ to denote the final estimators, which are F and U if PCA is used, and F and U if PPCA is used.
Estimators for latent factors under regimes 1 and 2 satisfy n −1 F F = I, which corresponds to normalization in Assumption 2.1 (i).By the principle of least squares, the residual estimator of U then admits the form It possesses the following properties.
Theorem 3.1.For any K by K matrix H such that where where The above theorem states that the error of estimating U by Ǔ (or estimating UU by Ǔ Ǔ ) is decomposed into two parts.The first part is inevitable even when the factor matrix F in (3.1) is known in advance.The second part is caused by the uncertainty from estimating F. Since the true F is identifiable up to an orthonormal transformation H, we need to carefully choose H to bound the error Π (or ∆).We will provide explicit rates of convergence for those terms in the following two subsections.

Estimating factors by PCA
In regime 1, we directly use PCA to adjust data heterogeneity.PCA estimates F by F where the k th column of F/ √ n is the eigenvector of (pn) −1 X X corresponding to the k th largest eigenvalue.By the definition of F, we have (np) −1 X X F = FK, where K is a K by K diagonal matrix with top K eigenvalues of (np) −1 X X in descending order as diagonal elements.Define a K by K matrix H as in Fan et al. (2013): It has been shown that K , K −1 and H , H −1 are all O P (1).The following theorem provides all the rates of convergences that are needed for downstream analysis.
Theorem 3.2.Under Assumptions 2.1 and 2.2, we have Combining the above results with Theorem 3.1, we have where where

Estimating factors by Projected-PCA
In regime 2, we would like to incorporate the external covariates using the Projected-PCA method proposed by Fan et al. (2016).We now explain this method.
To reduce the curse of dimensionality of g k (W j ), we assume it takes an additive form: To model the unknown function g kl (•), we adopt a sieve based idea which approximates g kl (•) by a linear combination of basis functions (e.g., B-spline, Fourier series, polynomial series, wavelets).Let {φ 1 (x), φ 2 (x), • • • } be a set of basis functions.Then for each l ≤ d, Here {b ν,kl } J ν=1 are the sieve coefficients of the l th additive component of g k (W j ), corresponding to the k th factor loading; R kl is the remainder function representing the approximation error; J denotes the number of sieve bases which may grow slowly as p diverges.The basic assumption for sieve approximation is that sup x |R kl (x)| → 0 as J → ∞.To facilitate notation, we take the same basis functions in (3.3) for all k and l though they can be different.
Define, for each k ≤ K and for each j ≤ p, Then, we can write )) be a p × (Jd) matrix of basis functions, and R(W) be a p × K matrix with the (j, k) th element d l=1 R kl (W jl ).Then the matrix form (2.2) can be written as recalling that we drop the data source index i.Thus the residual term contains three parts: the sieve approximation error R(W)F , unexplained loading ΓF and true signal U.
The idea of Projected-PCA is simple: since the factor loadings are a function of the covariates in (3.4) and U and Γ are independent of W, if we project (smooth) the observed data onto the space of W, the effect of U and Γ will be significantly reduced and the problem becomes nearly a noiseless one, recalling that the approximation error R(W) is small.Define P as the projection operator onto the space spanned by the basis functions of W: Then, by (3.4), PX ≈ PΦ(W)BF ≈ G(W)F .Thus, F can be estimated from the 'noiseless data' PX, using the traditional PCA.Let the columns of F/ √ n be the eigenvectors corresponding to the top K eigenvalues of the n × n matrix X PX, which is the sample covariance matrix of the projected data PX.Then, F is the PPCA estimator of F. It differs from the conventional PCA in that we use smoothed or projected data PX.
By the definition of F, we have (np) −1 X PX F = FK where K is a K × K diagonal matrix with the first K largest eigenvalues of (np) −1 X PX in descending order as its diagonal elements.Define the K by K matrix H as in Fan et al. (2016): It has been shown that K , K −1 and H , H −1 are all O P (1).Here we remind that though H and K are different from those in regime 1, they play essentially the same roles (thus with same notations).
As in Fan et al. (2016), we need the following conditions for the basis functions and accuracy of the sieve approximation.
Assumption 3.1 (Basis functions).(i) There are d min and d max > 0 so that almost surely, where X l is the support of the l th element of W j , and J is the sieve dimension.
Condition (i) in Assumption 3.2 is satisfied by most commonly used basis.For example, when {φ ν } is polynomial basis or B-splines, it is implied by the condition that smooth curve g kl (•) belongs to a Hölder class G, defined by G = {g : (Lorentz, 2005;Chen, 2007).Another example is step function g kl (•) with finite many distinct values, which can be expressed exactly as the linear combination of disjoint indicator functions so that κ can be arbitrarily large.
With the above conditions, the following theorem provides all the rates we need, recalling the definition of ν p in Assumption 2.4 (iii).

Specification test
In this section, we give an adaptive rule to decide whether the covariates W are informative enough to use PPCA or just PCA.We test the hypothesis that H 0 : G(W) = 0 using the test statistic (Fan et al., 2016) .
Here, we use PCA estimator Λ as PPCA is not applicable under H 0 .If Λ has nothing to do with W, then P Λ ≈ 0 and S should be quite small after projection.Conversely, if G(W) = 0, S will be large, hence we reject the null.We showed the following theorem, whose proof is omitted.
Theorem 3.4.Under all the assumptions discussed above, if {W j , γ j } j≤p are independent and identically distributed, as p, n i , J → ∞, we have under H 0 for the i th subgroup, Based on this result, we can decide whether or not to reject the null hypothesis, namely to use PPCA or PCA.When the test is applied to all m data sources, it becomes a multiple testing problem.The thresholding can be chosen by using various false discovery rate control methods such as Benjamini-Hochberg method (Benjamini and Hochberg, 1995).If a hypothesis is rejected, we identify the subgroup as regime 2 and use Projected-PCA to obtain U; otherwise, we identify the subgroup as regime 1 and apply regular PCA to get U.

Estimating number of factors
We now address the problem of estimating the number of factors for two different regimes.Extensive literature has made contributions to this problem in regime 1, i.e. the regular factor model (Bai and Ng, 2002;Hallin and Liška, 2007;Ahn and Horenstein, 2013;Lam and Yao, 2012).Ahn and Horenstein (2013) and Lam and Yao (2012) proposed to use ratio of adjacent eigenvalues of X X to infer the number of factors.They showed the estimator K = arg max k≤Kmax λ k (X X)/λ k+1 (X X) correctly identifies K with probability tending to 1, where K max can be a fixed prior upper bound for the number of factors.
For the geniune semiparametric factor model, in the recent work by Fan et al. (2016), they propose K = arg max k≤Kmax λ k (X PX)/λ k+1 (X PX).Here K max is of the same order as Jd, say K max = Jd/2.It was shown that P( K = K) → 1 under assumptions we omit here.When we have genuine and pervasive covariates, K typically outperforms K.More details can be found in Fan et al. (2016).

Summary of ALPHA
We now summarize the final procedure and convergence rates.We first divide m subgroups based on whether the collected covariates have influence on the loadings.Let ALPHA consists of the following three steps.
Step 1: (Preprocessing) For data source i, estimate K i by K i ; test H 0 : G i (X i ) = 0 by S i and use FDR control to construct two groups.For rejected groups, refine K i by K i .
Step 2: (Adjustment) Apply Projected-PCA to estimate Λ i F i if the test is rejected, otherwise use PCA to remove the heterogeneity, resulting in adjusted data Ǔ, which is either U i or U i .
Step 3: (Aggregation) Combine adjusted data { Ǔi } m i=1 to conduct further statistical analysis.For example, estimate sample covariance where N = i n i is the aggregated sample size; or estimate sparse precision matrix Ω by existing graphical model methods say CLIME (Cai et al., 2011).
We summarize the ALPHA procedure in Algorithm 1 given in the appendix.We also summarize the convergence of U i and U i here.To ease presentation, we consider a typical regime in practice: n i < Cp, i≤m K i < CN for some constant C. Also we focus on the situation of sufficiently smooth curves κ = ∞ so that J diverges very slowly (say with rate O( √ log p)) and constant φ max , ν p .Based on discussions of the previous subsections, for estimation of U, we have Therefore, Projected-PCA dominates PCA as long as the effective covariates are provided.However, U i F i F i /n i dominates all the remaining terms so that Ǔi In addition, for estimation of UU , we have where δ = n 2 i Σ 2 1 log p/p 2 , depending on Σ 1 .If we consider a general Σ so that Σ 1 can be as large as O( √ p), then the rate for i ∈ M 1 is simplified to This illustrates the advantage of Projected-PCA since its convergence is faster.If we only consider very sparse covariance matrix so that Σ 1 is bounded, we can simply drop the term δ in both regimes.Then, regime 1 achieves better rate if p = O(n 2 i ν p ), but regime 2 dominates otherwise.

Conditional Graphical Model
We have summarized the order of bias caused by adjusting heterogeneity for each data source in Section 3.5.Now we combine the adjusted data together for further statistical analysis.As an example, we study graph estimation under a Gaussian graphical model.
Assume u i t ∼ N (0, Σ) and consider the class of the precision matrices: To simplify the analysis, we assume R is fixed, but all the analysis can be easily extended to include growing R.
To estimate Ω = Σ −1 via CLIME, we first need a covariance estimator as the input.We also assume here the number of factors is known, i.e., the exception probability of recovering K i has been ignored for ease of discussion.Such an estimator is natually given by Since the number of data sources is large, we focus on the typical case of diverging N and p.

Covariance estimation
Denote by Σ N the oracle sample covariance matrix i.e.Σ N = N −1 m i=1 U i U i .We consider the difference of our proposed Σ with Σ N in this subsection.The oracle estimator obviously attains the rate Σ N − Σ max = O P ( log p/N ). Let It is Gaussian distributed with mean zero and variance Σ.Note that ξ i k are iid with respect to k and i, using the assumption where K tot = i≤m K i .Therefore, by (3.6), we have N and N 2 = i∈M 2 n i .We now examine the difference of the ALPHA estimator from the oracle estimator for two specific cases.In the first case, we apply PCA to all data sources, i.e., all i ∈ M 1 and K i is bounded.We then have a m,N,p = m log p/N .This rate is dominated by the oracle error rate log p/N if and only if m = O( N/ log p).This means traditional PCA performs optimally for adjusting heterogeneity as long as the number of subgroups grows slower than the order of N/ log p.
If we apply PPCA to all data sources, i.e., i ∈ M 2 and K i is bounded, then a m,N,p = ν p /p log p+ √ m log p/N .This rate is of smaller order than rate log p/N if p/ log p > CN for some constant C > 0. The advantage of using PPCA is that when n i is bound so that m N , we can still achieve optimal rate of convergence so long as we have a large enough dimensionality at least of the order N .

Precision matrix estimation
In order to obtain an estimator for the sparse precision matrix from Σ, we apply the CLIME estimator proposed by Cai et al. (2011).For a given Σ, CLIME solves the following optimization problem: where Ω 1,1 = i,j≤p |σ ij | and λ is a tuning parameter.Note that (4.4) can be solved column-wisely by linear programming.However, CLIME does not necessarily generate a symmetric matrix.We can simply symmetrize it by taking the one with minimal magnitude of σij and σji .The resulting matrix after symmetrization, still denoted as Ω with a little bit abuse of notation, also attains good rate of convergence.In particular, we consider the sparse precision matrix class F(s, C 0 ) in (4.1).The following lemma provides guarantee for recovering any sparse matrix Ω ∈ F(s, C 0 ). Furthermore, The proof of the theorem can be found in the appendix.The theorem shows that CLIME has strong theoretical guarantee of convergence under different matrix norms.The rate of convergence has two parts, one corresponding to the minimax optimal rate (Yuan, 2010) while the other is due to the error caused by estimating the unknown factors under various situations.The discussions at the end of Section 4.1 suggests that the latter error is often negligible.

Numerical Studies
In this section, we first validate the theoretical results derived above through Monte Carlo simulations.Our purpose is to show that after heterogeneity adjustment, our proposed aggregated covariance estimator Σ approximates well the oracle sample covariance Σ N , thereby leading to accurate estimation of the true covariance matrix Σ and precision matrix Ω.We also compare the performance of Projected-PCA and regular PCA on heterogeneity adjustments under different asymptotic settings.
In addition, we analyze a real brain image data using the proposed procedure.The dataset to be analyzed is the ADHD-200 data (Biswal et al., 2010).It consists of rs-fMRI images of 688 subjects, of whom 491 are healthy and 197 are diagnosed with ADHD.We dropped 16 subjects (13 healthy, 3 diseased) in our analysis since their data contain missing values.Following Power et al. (2011), we divided the whole brain into 264 regions of interest (ROI, p = 264), which are regarded as nodes in our graphical model.Each brain is scanned for multiple times with sample sizes ranging from 76 to 261 (76 ≤ n i ≤ 261).In each scan, we acquire the blood-oxygen-level dependent (BOLD) signal within each ROI.Here the heterogeneity among subjects arises from the difference in age, gender, handedness and IQ.

Preliminary analysis
To analyze the data, the first question is what external covariates W i j are for each of the 264 regions.Ideally, we hope these covariates have pervasive power on explaining the batch effect, while bearing no association with the graph structure of u t .For the current data, we can construct such covariates from physical locations of the regions, since the level of batch effect is non-uniform over different locations of the brain when scanned in fMRI machines, and furthermore it has been widely acknowledged in biological study that spatial adjacency does not necessarily imply brain functional connectivity.
Here we simply split the 264 regions into 10 clusters (J = 10) by the hierarchy clustering (Ward's minimum variance method) of their physical locations and use the categorical cluster indices as the covariates of the nodes.Note that the healthy and ADHD group share the same physical locations.The clustering result is shown in Figure 2 and the spatial locations of the 264 regions are shown in Figure 6 in 10 different colors.Black (middle), green (left) and blue (right) represent roughly the region of frontal lobe; gray (middle), pink (left) and magenta (right) occupy the region of parietal lobe; red (left) and orange (right) are in the area of occipital lobe; finally yellow (left) and navy (right) provide information about temporal lobe.
Other possible values for J could also be considered, but we do not want J too large to overfit the smooth loading functions.Note that here since the covariate W is one-dimensional (d = 1) and discrete, the sieve basis functions are just indicator functions 1(w − 0.5 ≤ W < w + 0.5) for w = 1, . . ., 10.We use the same covariates for all subjects.
The next question is whether the selected covariates can explain the loadings well.We implemented the specification tests described in Section 5 and find out the p-values for each subject.Most of the p-values are rather small (82.4% subjects in the healthy group and 79.0% subjects in the patient group have p-values smaller than 10 −3 ).We chose to control the FDR by Benjamini-Hochberg method (Benjamini and Hochberg, 1995) below the level of 1%.We discovered 425 healthy samples (91.4%) and 129 diseased samples (90.2%) rejecting the null, meaning that the selected covariates have significant explanatory powers on factor loadings of most subjects.We identified them as samples in the class M 2 and used Projected-PCA to estimate the heterogeneity effect.For those whose null hypotheses were not rejected, we classified them as individuals in the class M 1 and regular PCA was applied.
Based on which class each subject falls into, we employ the corresponding method to estimate the number of factors.We used K max = 5.The estimated number of factors for the two groups are summarized in Table 1.

Synthetic datasets
In this simulation study, for stability, we use the first 15 subjects in the healthy group to calibrate the simulation models.The testing results reveal that the external covariates W are informative for each of these 15 subjects.We specify four asymptotic settings for our Here the last setting represents regime 1 with G(W) = 0 where we should expect PCA to work well when the number of subjects is of order of square root of the total sample size, that is m √ N .The first three settings represent regime 2 with informative covariates G(W) = 0; they present asymptotics with growing p, m and n i respectively.

Model calibration and data generation
We calibrate (estimate) the covariance matix Σ of u t , which is a 264 by 264 matrix, by our proposed method to the data in the healthy group.Plugging it as input in CLIME solver delivers a sparse precision matrix Ω, which will be taken as truth in the simulation.Note that due to the regularization in CLIME, Ω −1 is not the same as Σ.To obtain the covariance matrix used in setting 1, we also calibrate, using the same method, a sub-model that involves only the first 100 regions.We then copy this 100 × 100 matrix multiple times to form a p × p block diagonal matrix and used it for simulations in setting 1.We describe how we calibrate these 'true models' and generate data from the models as follows.
1. (External covariates) For each j ≤ p, generate the external covariate W i.i.d.from the multinomial distribution with P(W j = s) = w s , s ≤ 10 where {w s } 10 s=1 are calibrated with the hierarchy clustering results of the real data (Figure 2).

(Calibration)
For the first 15 healthy subjects, obtain estimators for F, B and Γ by PPCA, resulting in Fan et al. (2016).Use the rows of the estimated factors to fit a stationary VAR model f t = Af t−1 + t , where t ∼ N (0, Σ ), and obtain the estimators A and Σ .

(Simulation)
For each subject i ≤ m, pick one of the 15 calibrated models and their associated parameters from above at random and do the following.
(a) Generate γ i jk i.i.d.from N (0, σ2 γ ) where σ2 γ is the variance of all entries of Γ.For the first three settings, compute the 'true' loading matrix Λ i = Φ(W) B + Γ i .For the last setting, set , where the parameters A and Σ are taken from the fitted values in step 2.
(c) Finally, generate the observed data X i = Λ i F i + U i , where each column of U i is randomly sampled from N (0, Ω −1 ), where Ω has been calibrated by the CLIME solver as described at beginning of the section.

Estimation of Σ
In this subsection, we investigate the errors of estimating covariance of u t in max-norm after applying Projected-PCA or regular PCA for heterogeneity adjustment.We also compare them with the estimation errors if we naively pool all the data together without any heterogeneity adjustment, but the estimation errors for the first 3 cases are too large to fit  in the graph.Denote the oracle sample covariance of u t by Σ N as before.The estimation errors under all four settings are presented in Figure 3, which are based on 100 simulations.
In Case 1, m and n i are fixed while the dimension p increases.For this setting, n i is small and this highlights more the advantages of Projected-PCA over regular PCA.From the left panel, we observe that increase of dimensionality improves the performance of Projected-PCA.This is consistent with the rate we derived in theories.In Case 2, n i and p are fixed while m increases.Both Projected-PCA and regular PCA benefit from increasing number of subjects.However, since n i is small, again Projected-PCA outperforms regular PCA.In Case 3, m and p are fixed while n i increases.Again both methods achieve better estimation as n i increases, but more importantly, regular PCA outperforms Projected-PCA when n i is large enough.This is again consistent with our theories.As illustrated by Section 4.1, when m is fixed, PCA attains the convergence rate Σ − Σ max = O P ( log p/N ), while Projected-PCA only achieves Σ − Σ max = O P (log p/ √ p), which is worse than PCA when p/ log p = o(N ).In Case 4, p is fixed, and both m and n i increase.Note that the covariates have no explanation power at all, i.e., Condition 2.3 about pervasiveness does not hold so that PPCA is not applicable.As expected, adjusting by PCA behaves much better than by Projected-PCA, which can sometimes be as bad as 'nPCA', corresponding to no heterogeneity adjustment.This is not unexpected as we utilized a noisy external covariates.

Estimation of Ω
In this subsection, we focus on estimation error of the precision matrix of u t .We plug Σ, obtained from data after adjusting for heterogeneity, into CLIME to get an estimator Ω of Ω.In Figure 4, Ω − Ω max and Ω − Ω 1 are depicted under the same four asymptotic settings as before.From the plots we see Ω − Ω max and Ω − Ω 1 share similar behavior with Σ − Σ max in all the four settings.In the first three cases, if we do not adjust data heterogeneity, Ω − Ω max and Ω − Ω 1 will be too large to be fitted in the current plots.
We also present the ROC curves of our proposed methods in Figure 5, which is of interest to readers concerned with sparsity pattern recovery.The black dashed line is the 45 degree line connecting (0, 0) and (1, 1), representing performance of the random guess.It is obvious from those plots that heterogeneity adjustment very much improves the sparsity recovery of the precision matrix Ω.When the sample size of each subject is small, genuine pervasive covariates increase the power of Projected-PCA method while on the other hand if the sample size is relatively large, PCA is sufficiently good in recovering graph structures.Also notice that in all cases, the naive method with no heterogeneity adjustment can still achieve a certain amount of power, but we can improve the performance dramatically by correcting the batch effects.

Brain image network data
We report the estimated graphs for both the healthy group and the ADHD patient group with batch effects removed using three methods: (1) PPCA using physical locations of ROI as covariates; (2) PCA without using any covariates; (3) no-PCA, which ignores heterogeneity and naively pool the data from all subjects together.We took various sparsity levels of the networks from 1% to 5% (corresponding to the same set of λ's for two groups) and selected the common edges, which are stable with respect to tuning, to be depicted.The brain network produced by Method 1 is reported in Figures 6.We omit the networks produced by Methods 2 and 3 since the inferred graphs actually do not differ too much, given that the number of subjects and total sample size are large.All methods give around 90% identical edges for the two networks, while respectively generating 8.6%, 8.0% and 11.6% unshared edges.Therefore, Methods 1 and 2 with adjustments provide more consistent and trustworthy graphs as batch effect brings more non-biological factors that exaggerate the difference.Preferences for Methods 1 and 2 should be based on relationship of p, n i , m and whether the collected covariates are influential enough to explain the loadings.From Figure 6, it is obvious that the brain is more connected for the ADHD subjects, but the connections are weaker.Actually, the average correlation reduction for the estimated correlation matrices (obtained from Ω −1 ) of the two groups is 0.001; a paired t-test gives p-value < 2.2e−16.This is consistent with the recent finding that kids with ADHD show weaker interactions among brain networks (Cai et al., 2016).
In addition, we investigate how those unshared edges between two groups of people are distributed across the 10 clusters.We only focus on networks from Method 1.We are interested in which cluster contributes to the difference of the distributions of vertices of the unshared edges the most.We summarized the total degree of unshared edge vertices within each cluster in Table 2.For each column j, we consider the hypothesis testing for p jH = p jD , where H and D denotes the healthy and diseased group respectively, meaning that the unshared edges within the j th cluster are found due to the same Bernoulli distribution.A simple chi-square test for the null was carried out for each column and those p-values are reported also in Table 2.The most noteworthy fact is that occipital lobe shows dependency change from right brain (orange) to left brain (red) for ADHD patients.The left frontal lobe (green) and the left parietal lobe (pink) have relatively large change in dependence structure compared with other parts of the brain.These are signs that ADHD is a complex disease that affects many regions of the brain.The general methodology we provide here could be valuable for further understanding the mechanism of the disease.

Discussions
In this paper, we developed a generic method called ALPHA that can consistently estimate and remove data heterogeneity and lead to effective subsequent statistical analysis on the true signal.The entire analysis relies on the pervasive assumption that most of the dimensions are corrupted by the heterogeneous factors.Future work may relax such pervasive conditions to allow for weaker signal batch effect, thus delivering more flexibility to recover the homogeneous residual.
As we have seen, ALPHA is adaptive to factor structures and is flexible to include external information.For brain image data analysis, previous literature rarely took physical locations into considerations.With the new framework, we can take advantage of external characteristics of the voxels or genes relevant to the batch effect, and consistently estimate the pervasive heterogeneity term even with very limited samples.However, this advantage of Projected-PCA is accompanied by more assumptions and the practical issue of selecting proper basis functions and the number of them in sieve approximation.On the other hand, if no valuable covariates exist and the sample size is relatively large for each data source, we have shown conventional PCA is still an effective tool.Direct aggregation of less heterogeneous subgroups (say subjects with the same age and gender in the ADHD dataset) might also be helpful to increase the sample size.
Finally, note that after heterogeneity adjustment, the recovered residuals Ǔ are not column-wisely independently distributed anymore.Statistical procedures that require assumptions of i.i.d.data cannot be directly applied on Ǔ.However, the ALPHA procedure gives theoretical guarantee for Ǔ − U max and Σ − Σ max , which serve as foundations for establishing the statistical properties of the subsequent procedure.In this sense, our framework is compatible with any statistical procedure that only requires an accurate estimator as the input, for instance, the CLIME procedure.Methods robust to small perturbations on the truth are preferable.

Algorithm 1 Algorithm for adaptive low-rank principal heterogeneity adjustment
Input: Panel X i p×n i and d-dimensional {W i j } p j=1 from m data sources Output: Ǔi , the adjusted estimator for U i and Σ 1: procedure ALPHA 2: for each subject i ≤ m do 3: for each subject i ≤ m do 8: else 12: K i ← projected eigenvalue-ratio method 13: 14: end if 17: end for 18: return { Ǔi } m i=1 and Σ 20: end procedure Now we consider Ǔ Ǔ in the following.
So ∆ = III + IV and it suffices to bound the two terms.
Now let us take a look at IV .
By assumption, H max ≤ H = O P (1).Simple decompositions of D 1 gives It is also not hard to show D 3 max = O P ( III max + D 1 max ).Under both Theorems 3.2 and 3.3 (replacing F by F for regime 1 and F for regime 2), we can check the following relationship holds: Therefore we have C Proof of Theorem 3.2

C.1 Convergence of factors F
Let K denote the K × K diagonal matrix consisting of the first K largest eigenvalues of (pn) −1 X X in descending order.By the definition of eigenvalues, we have To bound F − FH max , note that there is a constant C > 0, so that Hence we need to bound E i max for i = 1, 2, 3 since K −1 2 = O P (1).The following lemma gives the stochastic bounds for each individual term.
) again according to Lemma F.1.So combining the three terms, we have F − FH F = O P (1 + n/p).We now refine the bound for Note first that the two matrices under consideration is both K by K, so we do not lose rates bounding them by their Frobenius norm.
Let us find out rate for F ( F−FH) F .Basically we need to bound F E i F for i = 1, 2, 3. Firstly Finally, So combining three terms we have Therefore HH − I F has the same rate since

C.3 Rate of U( F − FH) max
In order to study rate of U( F − FH) max , we essentially need to bound UE i max for i = 1, 2, 3. We handle each term separately.
From bounding E 3 F , the last term has rate So combining three terms, we conclude

D Proof of Theorem 3.3 D.1 Convergence of factors F
Let K denote the K × K diagonal matrix consisting of the first K largest eigenvalues of (pn) −1 X PX in descending order.By the definition of eigenvalues, we have where A i , i ≤ 3 has nothing to do with R(W) and Γ: Similarly, A 5 max attains the same rate of convergence.In addition, notice A 9 , A 10 have similar representation as A 4 , A 5 .The only difference is to replace R by Γ.It is not hard to see A 11 has similar representation as A  Note first that the two matrices under consideration is both K by K, so we do not lose rates bounding them by their Frobenius norm.
It has been proved in Fan et al. (2016) that F ( F−FH) F = O P ( n/p+n/p+n ν p /p+ nJ −κ/2 ).By the choice of J, the last term vanishes.So  By (D.1), in order to bound U( F − FH) max , we essentially need to bound UA i max for i = 1, . . ., 15.We do not bother going into the details of each term again as in Lemma D.1.However, we point out the difference here.All A i are separated into two types: the ones starting with F and the ones starting with U.
If a term A i starts with F, say A i = FQ, in Lemma D.1, we bound A i max using √ K F max Q F .Now we use bound UA i max ≤ √ K UF max Q F so that we obtain all related rates by just changing rate F max = O P ( √ log n) to UF max = O P ( √ n log p).Terms starting with U includes A i , i = 2, 3, 8, 13.In Lemma D.1, we bound A i max , i = 3, 8, 13 using U Φ max while we bound A 2 max using U ΦB max .Correspondingly now we need to control UU Φ max and UU ΦB max separately to update the rates.The derivation is relegated to Lemma F.5.We have UU Φ(W) max = O P (φ max ( √ np log p + n Σ 1 )) and UU Φ(W)B max = O P ( √ np log p + nJφ max Σ 1 ).So we replace the corresponding terms in Lemma D.1.It is not hard to see the dominating term is UA 2 max = O P ( n log p/p + nJφ max Σ 1 /p).Therefore, U( F − FH) max has the same rate.

E Proof of Theorem 4.1
Proof.Denote the empirical covariance matrix as As in Cai et al. (2011) where the first term of the last inequality uses the constraint of (4.4) while the optimality condition of (4.4) is applied to bound Ω 1 by Ω 1 .So it remains to find τ N,p in (E.1).
Since Ω ∈ F(s, C 0 ), Ω 1 ≤ C 0 , so we just need to bound Σ − Σ N max and Σ N − Σ max .Obviously, By assumption Σ − Σ N max = O P (a m,N,p ).Thus τ = log p/N + a m,N,p .Similar proof as in Cai et al. (2011) can also reach error bounds under • 1 and • 2 , which we omit.The proof is now complete.

F Technical lemmas
Lemma F.1.(i) Λ U 2 F = O P (np), (ii) U U 2 F = O P (np 2 + pn 2 ), (iii) U UF 2 F = O P (np 2 + pn 2 ).Proof.We simply apply Markov inequality to get the rates.We minimize the right hand side and choose θ = s/(2Cp), it is easy to check η < 1/(4 Σ ) and see that max  Proof.This results can be found in the paper of Fan, Liao and Wang (2014).But the conditions they used are a little bit different from our conditions.In particular, we allow no time (sample) dependence and only require bounded Σ 2 instead of Σ 1 .By Markov inequality, it is sufficient to show the expected value of each term attains the corresponding rate of convergence.

Figure 3 :
Figure 3: Estimation of Σ by PCA, PPCA and the oracle sample covariance matrix for 4 different settings.Case 1: m and n i are fixed while the dimension p increases; case 2: n i and p are fixed while m increases; case 3: m and p are fixed while n i increases; case 4: p is fixed, and both m and n i increase and conditions for PPCA are violated.

Figure 4 :Figure 5 :
Figure 4: Estimation of Ω. Presented are the estimation errors in max-norm and in L 1 -norm for 4 different settings.The same captions in Figure 3 apply.

Figure 6 :
Figure 6: Estimated brain functional connectivity networks using physical locations as covariates to correct heterogeneity.10 region clusters are labeled in 10 colors.Black, blue and red edges represent respectively common edges, unshared edges in the healthy group and in the ADHD group.
M 2 ← control FDR by Same bound holds for A 15 .The final rate of convergence for F−FH max and F−FH F are summarized as follows.Proposition D.1.Choose J = (p min(n, p, ν −1 p )) 1/κ and assume J 2 φ 2 max log(nJ) = O(p) and ν p = O(1), F − FH max = O P log n p and F − FH F = O P n p .(D.2) Proof.The max norm result follows from Lemmas D.1 and (D.1), while the Frobenius norm result has been shown in Fan et al. (2016).D.2 Rates of F ( F − FH) max and HH − I max

Table 1 :
The distributions of the estimated number of factors for healthy and ADHD groups

Table 2 :
The degree of unshared edge vertices for each cluster red orange blue green yellow navy pink black magenta gray The final rate of convergence for F−FH max and F−FH F are summarized as follows..2Rates of F ( F − FH) max and HH − I max C ).The rate of convergence for A 8 can be bounded in the same way.So do A 12 and A 13 .Given that Γ Φ F = O P (pJν p ), we have A 12 max = O P (Jφ max ν p log(nJ) log n/p) = A 13 max .(vi)Obviously, A 14 max = O P t =t |u t u t | = O P ( √ p log n).So we conclude that U U max = O P (p).(iii)Let fk be the k th column of F. U UF max = max t,k |u t U fk | ≤ max t,k |u t U (−t) fk(−t) |+ max t,k |u t u t f tk | where U (−t) , fk(−t) are U and fk canceling the t th column and element respectively.From (ii) we know the second term is of order O P (p max tk |f tk|) = O P (p √ log n).Define ξ = U (−t) fk(−t) ∼ subGaussian(0, Σ fk(−t)2 ), which is independent with u t .Thus|u t ξ| > s) ≤ 2nKe −sθ E[exp(θu t ξ)],where E[exp(θu t ξ)] ≤ E[exp(θ 2 u t Σu t fk(−t) 2 /2)] ≤ E[exp(Cθ 2 n u t 2 )].Similar to (ii), we choose η = Cθ 2 n here.It is not hard to see max t,k |u t ξ| = O P ( √ np log n).Thus U UF max = O P ( √ np log n + p √ log n).