Gaussian graphical model estimation with measurement error

The process of recovering the precision matrix in the presence of additive and multiplicative measurement errors.


Introduction
Estimation precision (inverse covariance) matrix has become a basic problem in modern multivariate statistical analysis and has a very wide range of practical applications.For example, in genome-wide association studies, genomic data are used to estimate the precision matrix and obtain the genetic correlation between traits from the non-zero terms.It is not only crucial for understanding gene regulatory pathways and gene functions, but also contributes to a deeper understanding of the genetic basis of quantitative variation in complex traits [1] .Similarly, in social network analysis [2][3] , precision matrix can be used to explore the correlation between users, and also widely used in economics and finance [4][5][6] , health science [7] , high dimensional discriminant analysis [8] and complex data visualization [9] .
The estimation of the precision matrix has attracted many scholars to study in recent years.In terms of the form of research methods, the currently proposed methods can be roughly divided into two classes: penalized likelihood methods and column-by-column estimation methods.The former class includes, for example, Ref. [10] has laid down a general framework for penalized likelihood method, Refs.[11-15]  used different penalty algorithms to select zero elements in the precision matrix and these methods all impose a penalty term on the negative log-likelihood function to estimate the precision matrix.The theoretical properties of these methods have been thoroughly studied by Refs.[16, 17].The latter class includes, for instance, Refs.[8, 14, 18-21].Such methods convert the problem of precision matrix estimation into a nodewise or pairwise regression, or optimization problem and then apply the technique of high-dimensional regularization using the Lasso or Dantzig selector type methods.However, most of these existing methods are designed for clean data, while corrupted data are often encountered in various fields.Naively applying the aforementioned methods to analyze the corrupted data can lead to inconsistent and unstable estimates, thus leading to misleading conclusions.Therefore, it is urgent to develop an efficient method for large precision matrix estimation under measurement errors.l 1 Various statistical methods have been proposed to mitigate the effects of measurement errors.Specifically, there are a series of works to address measurement errors in univariate response linear regression models, which can be traced back to Ref. [22] and its extensions include Refs.[23, 24].In recent years, great progress has also been made in highdimensional error-in-variables regression.For instance, Ref. [25] developed a -regularized likelihood approach to handle missing data by solving a negative log-likelihood optimization problem via EM algorithm.Similarly, Ref. [26] developed a Lasso-type estimator by replacing the corrupted Gram matrix with unbiased estimates for corrupted data.Furthermore, Ref. [27] proposed a Dantzig selector-type estimator based on the compensated matrix uncertainty method.However, after adjusting for corrupted data, negative likelihood functions are usually not convex and may depend on some crucial hidden parameters.To address this difficulty, Ref. [28] proposed the convex conditioned Lasso (CoCoLasso) method by replacing the unbiased covariance matrix estimate with the nearest positive semi-definite matrix, thus enjoying stepwise convexity in high-dimensional measurement error regression.
In this work, motivated by the study of above authors, we develop a new approach called convex conditioned innovative scalable efficient estimation (COCOISEE) to address Gaussian graphical model under measurement errors by combining the strengths of the recently developed innovative scalable efficient estimation [21] and the nearest positive semidefinite matrix projection [28] , thus enjoying stepwise convexity and scalability.
The remainder of the paper is organized as follows.Section 2 presents the model setting and the new methodology.Theoretical properties including consistency in estimation are established in Section 3. We provide simulation examples in Section 4. Section 5 discusses possible extensions of our method and possible future work.All technical details are relegated to the appendix.
Consider an undirected Gaussian graphical model for a -dimensional Gaussian random vector where is an undirected graph related to vector , is a node set, is an edge matrix with each entry represent the edge between and , is a mean vector and is a covariance matrix.Assume that is an independently identically distributed sample from the Gaussian distribution.Without loss of generality, suppose that the mean vector throughout the paper.Denote by the precision matrix, that is, the inverse of the covariance matrix .It is well known that in Gaussian graphical model theory, an edge exists (or ) between and if and only if the corresponding entry of the precision matrix is nonzero.The above characterization of the precision matrix G shows that the problem of estimating Gaussian graphical model can be equivalent to estimate the support: x To motivate our new method, we consider a linear transformation of the precision matrix similar to Ref. [21].Specifically, based on the transformation of , we have the following structure that x = Ωx, x ∼ N(0, Ω). (1) Therefore, from the distribution of , we can see that if the estimator is obtained, the precision matrix could be achieved by estimating the covariance matrix of .
x xS By the definition of , we can write the subvector in following form: x S It is well known that in the Gaussian graphical model, the condition distribution about the vector can be written as Formula ( 3) can be expressed as the following multivariate linear regression model: where is a dimensional regression coefficients matrix, and is the matrix of model errors which has a multivariate Gaussian distribution and independent of .Thus, the estimates of the two terms on the right side of Eq. are correlated and can be efficiently achieved by linear regression techniques, i.e., Lasso [29] .
The representation of subvector shows that the unknown subvector can be estimated by regression technique.To see this, let be the residual vector obtained by fitting model using Lasso.Then the unknown matrix can be estimated as the inverse of the sample covariance matrix of the model residual vector .Then we can estimate the sub-vector in Eq. as .
The ISEE [21] repeated the above procedure for each with to obtain the estimated subvectors 's, and then stacked all these subvectors together to form an estimate of the oracle innovated vector .Thus, the problem of estimating the precision matrix based on the original vector reduces to that of estimating the covariance matrix based on the estimated transformed vector .X However, in many practical applications, the data we collect may contain unobvious measurement errors, resulting in inaccurate Gaussian graphical model estimation.In this paper, we consider two kinds of measurement errors associated with matrix , as shown below.

Z
• Additive error.We assume that the observed elements of the data matrix are affected by additive measurement er- rors , and write it as a matrix form , where is the measurement error matrix.We assume that the rows of are with mean and finite covariance matrix .
• Multiplicative error.We assume that the observed elements of the data matrix are affected by multiplicative measurement errors , and write it as a matrix form , where denotes the Hadamard product and is the measurement error matrix.We also assume that the rows of are with mean and finite covariance matrix . Particularly, missing data can be regarded as a special case of the multiplicative measurement error, where , and represents the indicator function.

Scalable estimation by COCOISEE
The data matrix is denote by .Then, the transformed data matrix is defined as .Thus, Eq. ( 4) can be rewritten as where is an error matrix with rows being copies from .Then, can be written as

X
The representation in Eq. ( 6) provides the basis for the estimation of matrix .

i S X
i : To fit the Gaussian linear regression model ( 5), we suggest use univariate regression methods.Specifically, for each node in the index set , we consider the univariate linear regression model for

Z X
In general, we can use some high-dimensional regression methods to fit Eq. ( 7), such as SCAD [30] , Lasso [29] , elastic net [31] , adaptive Lasso [32] and Dantzig selector [33] .However, we observe the data matrix instead of the matrix in linear models with error-in-variables.Thus, directly applying Lasso to this equation by minimizing is often erroneous, where and . Based on the corrupted data matrix , Ref. [28]  suggested to construct unbiased surrogates and for and that cannot be observed to mitigate the effects of measurement errors.For the additive error setting, the unbiased surrogates are defined as And for the multiplicative error setting, the unbiased surrogates are defined as where represents the element division operator of vectors and matrices.Since or can be obtained in practice through repeated measurements or field experience, we assume that they are known for the model.
However, the estimator is usually not positive semidefinite in high-dimensional models, resulting in the related optimization problems that may no longer be convex.To overcome this problem, we draw on the idea of Ref. [28] to obtain the sparse regression coefficients by minimizing where is the principal submatrix of , and is a nearest positive semidefinite matrix.From the definition of , we can solve it efficiently by an alternating direction method of multipliers (ADMM) and have Σ Σ Σ Σ Eq. ( 11) is obtained by the triangle inequality and the definition of ensures approximates as well as the initial surrogate .
Based on the regression step, for each node in the index set we cannot directly replace with in Eq. ( 5) to calculate .Instead, we can use as an intermediate variable to calculate .Then the estimator can be written as τ ≥ 0 Then for a given threshold , define where denotes the matrix threshoulded at .Thus, the set of links of graphical structure is .The choice of the threshold in Eq. ( 15) is important for practical implementation.We use the crossvalidation method for large precision matrix estimation.Specifically, we randomly split the sample of rows of the sample matrix into two subsamples of sizes and , and repeat this times.Denote by and the precision matrices as defined in Eq. ( 14) based on these two subsamples, respectively, for the th split.Thus, the threshold can be chosen to minimize The implementation of COCOISEE is summarized in Algorithm 1.

Theoretical properties
In this section, we will present several technical conditions and then analyze the theoretical properties of COCOISEE.

Technical conditions
Condition 1. (Closeness condition) Assume that the distributions of and are identified by a set of parameters .Then there exist universal constants and , and positive functions and depending on and such that for any , and satisfy the following probability statements: (Restricted eigenvalue condition) The restricted eigenvalue of the Gram matrix satisfies: Condition 3. (Irrepresentable condition)  are close to and respectively in terms of the elementwise maximum norm.Condition 2 is the restricted eigenvalue condition proposed in Ref. [34], and is widely used in Lasso related articles.It imposes a lower bound on eigenvalues of the Gram matrix to constrain the correlations between relatively small numbers of predictors in the design matrix .Condition 3 is the irrepresentable condition proposed in Refs.[35, 36], and was obtained to prove a model selection consistency result for Gaussian graphical model selection using the Lasso.Conditions 1-3 are necessary conditions for proving Lemma A.1 introduced in the appendix.

K
Condition 4 imposes a mild assumption on measurement error matrices since sub-gaussians are a natural kind of random variable for which the properties of Gaussians can be extended [37] .Condition 5 is standard and has been widely used in high-dimensional linear regression models.Condition 6 guarantees that the number of links for each node is bounded by from above and that the precision matrix has a bounded

end for
Construct the block-diagonal entries according to Eq. Construct block-diagonal entries according to Eq. Construct the initial COCOISEE estimator spectrum [21] .

λK → o(1)
1 − ∆ ΩCOCOISEE ini Theorem 1. Assume that Conditions 1-6 hold and .Then with probability tending to one the initial COCOISEE estimator in Eq. ( 14) satisfies that Theorem 1 establishes the entrywise infinity norm estimation bound for the initial COCOISEE estimator.From the proof of Theorem 1 in Appendix, we see that the rate of convergence for the initial COCOISEE estimator is the maximum of two components and for both block-diagonal and off-block-diagonal entries of .Since we assume that , the rate of convergence dominates that of , meaning that the rate of convergence for the initial COCOISEE estimator is change into .
Theorem 2. Assume that the conditions of Theorem 1 hold and with and some sufficiently large constants.Then, with probability tending to one, and for any with , we have

ΩCOCOISEE g
Theorem 2 shows that the sparse precision matrix estimator enjoy good asymptotic properties.

Simulation studies
In this section, we simulated the finite-sample performance of the proposed COCOISEE method.To illustrate the impact of ignoring measurement errors, we compared two methods designed for clean data sets: Glasso [12] and ISEE [21] .The three methods were implemented as follows.Glasso was implemented by the R package glasso and had a tuning parameter that was chosen using five-fold cross-validation.ISEE [21] selected the tuning parameters in scaled Lasso [38] following the suggestion of Ref. [39] and adapted the cross-validation method proposed in Refs.[40, 41] for the threshold parameters.For the implementation of our COCOISEE approach, we used 5-fold corrected cross-validation, similar to Ref. [29].Specifically, we used 90 of the sample to calculate and the remained 10 to calculate .Then, we choose from 20 values with the same interval by minimizing criterion (16) with the number of random splits set to .

Z
We generated 50 data sets.For each data set, we generate the precision matrix in two steps.First, we produce a band matrix with diagonal entries being one, for , and all other entries being zero.Second, we randomly permute the rows and columns of to obtain the precision matrix .Thus, in general, the final precision matrix no longer has the band structure.We next sample the rows of the data matrix as independent and identically distributed (i.i.d.) vectors copied from the multivariate Gaussian distribution .Then based on different types of measurement errors, the corrupted covariate matrices can be provided respectively as follows.
• Additive errors case.The observed covariates , where the rows of are i.i.d.vectors copied from with .
• Multiplicative errors case.The observed covariates are , where the elements of follow a log-normal distribution, meaning that 's are i.i.d.variables from with .i j = x i j m i j m i j = I(x i j is not missing) • Missing data case.The observed covariates are defined as , where , so that the covariates are missing at random with probability 0.1.
Due to high computational cost, we consider settings with , whose dimensionality varies in .To compare the aforementioned methods, we consider the same performance measures as suggested in Ref. [21].The first two measures are the true positive rate (TPR) and the false positive rate (FPR), defined as TPR = #of correctly identified edges #of identified edges in total , FPR = #of falsely identified edges #of identified nonedges in total , respectively.We use the Frobenius norm to calculate the normalized estimation error (NEE) defined as Table 1-3 summarizes the simulation results of these three performance measures for additive, multiplicative error and missing data cases.In view of TPR, FPR and NEE in these tables, it is clear that the NEE of COCOISEE is the best, and both TPR and FPR are also doing very well.

Conclusions
In this paper, we have introduced a new methodology COCOISEE to achieve scalable and interpretable estimation for a Gaussian graphical model under both additive and multiplicative measurement errors.It takes full advantage of recently developed innovative scalable efficient estimation and the nearest positive semi-definite matrix projection, thus enjoying stepwise convexity and scalability in Gaussian graphical model estimation.The suggested method is ideal for parallel and distributed computing and cloud computing and has been shown to enjoy appealing theoretical properties.Both the established theoretical properties and numerical performances demonstrate that the proposed method enjoys good estimation, recovery accuracy and high scalability under both additive and multiplicative measurement errors.Similar theoretical conclusions can also be extended to random sub-Gaussian settings.It would be very interesting to study several extensions of COCOISEE to more general model settings such as the time series model, the generalized linear model and the large nonparanormal graphical model, which are beyond the scope of the current paper and demand future studies.
There are three main contributions in this paper.First, the proposed COCOISEE method has scalability and high efficiency, because the main calculation cost comes from the es-timation of the regression coefficient in the case of measurement error, and the other steps adopt matrix operation, which has relatively high calculation efficiency.Second, we use a recently developed semi-positive definite matrix projection technique to mitigate the effects of measurement errors when recovering high-dimensional coefficient matrices from column-by-column regression.Therefore, COCOISEE enjoys progressive convexity and computational stability.Finally, we provide comprehensive theoretical properties to the proposed method by establishing the consistency of the estimator.Numerical results show the effectiveness of the proposed method.

κCondition 4 .Condition 5 . 6 .
= {1, ..., K} where is the true support set of the regression θ * coefficient vector .A M The entries of measurement error matrices and are all independent and identically distributed sub-Gaussian random variables.The random vectors obey a zero-mean sub-Gaussian distribution with a finite variance parameter, i.e., .Condition The Gaussian graph model we focus on satisfies the following conditions.G(M, K) = {Ω : each row has at most K nonzero off-diagonal entries and M −1 ⩽ λ min (Ω) ⩽ λ max (Ω) ⩽ M}.

Table 3 .
Missing data case.