High-dimensional robust precision matrix estimation: Cellwise corruption under $\epsilon$-contamination

We analyze the statistical consistency of robust estimators for precision matrices in high dimensions. We focus on a contamination mechanism acting cellwise on the data matrix. The estimators we analyze are formed by plugging appropriately chosen robust covariance matrix estimators into the graphical Lasso and CLIME. Such estimators were recently proposed in the robust statistics literature, but only analyzed mathematically from the point of view of the breakdown point. This paper provides complementary high-dimensional error bounds for the precision matrix estimators that reveal the interplay between the dimensionality of the problem and the degree of contamination permitted in the observed distribution. We also show that although the graphical Lasso and CLIME estimators perform equally well from the point of view of statistical consistency, the breakdown property of the graphical Lasso is superior to that of CLIME. We discuss implications of our work for problems involving graphical model estimation when the uncontaminated data follow a multivariate normal distribution, and the goal is to estimate the support of the population-level precision matrix. Our error bounds do not make any assumptions about the the contaminating distribution and allow for a nonvanishing fraction of cellwise contamination.


Introduction
Covariance matrix estimation has long taken center stage in multivariate analysis (Anderson, 2003). The sample covariance estimator, which originates as the maximum likelihood estimator under a multivariate normal model, is optimal in many respects: It is unbiased, consistent, efficient under various distributional assumptions, and easy to compute. Despite its many positive traits, however, the sample covariance matrix is also highly non-robust when data are observed subject to contamination. Hence, various procedures in robust statistics have been derived to obtain a covariance matrix estimator that behaves well even in the presence of contaminated data (Huber, 1981;Hampel et al., 2011).
In other areas of multivariate analysis, the precision matrix Ω * := (Σ * ) −1 is of significant interest. Examples include computing Mahalanobis distances, linear discriminant analysis, and Gaussian graphical models. In the setting of graphical models, a random vector X is associated with an undirected graph G = (V, E) that encodes the conditional independence relations between components of X (Lauritzen, 1996). The vertex set V contains X 1 , . . . , X p , while the edge set E consists of pairs (i, j), where (i, j) ∈ E if X i and X j are connected by an edge. For each nonedge (i, j) ∈ E, the variables X i and X j are conditionally independent given all other variables X \{i,j} := V \{X i , X j }. When X ∼ N (µ, Σ * ), pairwise conditional independence holds if and only if Ω * ij = 0. Thus, recovering the support of the precision matrix is equivalent to graphical model selection. The aforementioned observations have been used for network reconstruction in many scientific fields, including genetics and neuroscience (e.g., see Werhli et al. (2006); Smith et al. (2011), and the references cited therein). When the dimensionality p is small compared to the number of samples n, a reasonable method for robust precision matrix estimation could consist of computing a robust estimate of the covariance matrix and then taking a matrix inverse.
With the recent deluge of high-dimensional data, however, a need has arisen to obtain highdimensional analogs of classical statistical procedures that are both computable and possess rigorous theoretical guarantees. Although several methods, notably the graphical Lasso (GLasso) (Yuan and Lin, 2007;Friedman et al., 2008) and the method of constrained 1 -minimization for inverse matrix estimation (CLIME) , have been proposed for high-dimensional precision matrix estimation, robust estimation of high-dimensional precision matrices has only recently emerged in the literature. The GLasso and CLIME estimators themselves tend to perform poorly under contaminated data, since they take as input the sample covariance matrix that is sensitive to even a single outlier.
Popular classical robust covariance estimators are applicable in settings where less than half of the observation vectors are contaminated. Such assumption is closely connected to the Tukey-Huber contamination model that underlies much of the existing robustness theory (Tukey, 1962;Huber, 1964). In the Tukey-Huber contamination model, a mixture distribution with a dominant nominal component (such as a multivariate normal distribution) and a minority unspecified component are posited, and each observation vector is either completely clean or completely spoiled. Classical robust covariance estimators then involve downweighting contaminated observations in order to reduce their influence. When the dimension p is large, however, the fraction of perfectly observed data vectors may be rather small: If all components of an observation vector had an independent chance of being contaminated, most observation vectors would be contaminated. Thus, downweighting an entire observation would waste the information contained in the clean components of the observation vector. This describes the setting of the cellwise contamination model, which was developed by Alqallaf et al. (2002). It generalizes the classical Tukey-Huber contamination model, which may be viewed as a case of rowwise contamination of the data matrix, and is fairly realistic for applications involving measurement error in DNA microarray analysis (Troyanskaya et al., 2001) or dropout measurements in sensor arrays (Swanson, 2000).
On the other hand, most existing approaches for robust covariance estimation focus on affine equivariance. These include the M -estimators (Maronna, 1976), Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD) estimators (Rousseeuw, 1984(Rousseeuw, , 1985, and the Stahel-Donoho (SD) estimator (Stahel, 1981;Donoho, 1982). Although affine equivariance may be a desirable property under rowwise contamination, it is less appropriate in the setting of cellwise contamination, since linear combinations of observation vectors lead to a propagation of outliers (Alqallaf et al., 2009). In addition, the MVE, MCD, and SD estimators all require heavy computational effort, rendering them impractical for high-dimensional datasets. To deal with cellwise contamination, Van Aelst (2014) proposed a modified SD estimator that adapts winsorization (Huber, 1981;Alqallaf et al., 2002) and a cellwise weighting scheme. Similar to the original SD estimator, however, computation is only feasible for small p. A recent approach by Agostinelli et al. (2014) is capable of dealing with both rowwise and cellwise outliers. The procedure consists of two steps: (1) flag cellwise outliers as missing values; and (2) apply a rowwise robust method to the incomplete data. However, computation is again infeasible in high dimensions. Other recent proposals for robust high-dimensional covariance matrix estimation include Chen et al. (2015) and Han et al. (2015), but both methods treat different contamination models and are not suitable to handle data with cellwise contamination.
In contrast, relatively few approaches exist for robust high-dimensional precision matrix estimation under any form of contamination. One method is supplied by the TLasso estimator of Finegold and Drton (2011), which builds upon the GLasso and models the data as coming from the multivariate t-distribution, a long-tailed surrogate for the multivariate normal distribution. The "alternative multivariate t-distribution" is used to model a case where different coordinates of the distribution are obtained from the latent multivariate normal distribution using different weights. Although the TLasso demonstrates a higher degree of robustness than the GLasso under both rowwise and cellwise contamination in simulations, however, a theoretical analysis from the point of view of robust statistics has not been derived. More recently, Oellerer and Croux (2014) and Tarr et al. (2015) propose a promising new method for high-dimensional precision matrix estimation, designed specifically for cellwise contamination. The method consists of combining a robust covariance estimator that may be computed efficiently with a suitable high-dimensional precision matrix estimation procedure. Whereas Tarr et al. (2015) focus on developing new methodology and Oellerer and Croux (2014) analyze breakdown behavior of the precision matrix estimators, however, a rigorous high-dimensional analysis from the point of view of statistical consistency has not been conducted.
In this paper, we focus on high-dimensional robust estimation of precision matrices under the cellwise contamination model, using the estimators proposed by Oellerer and Croux (2014) and Tarr et al. (2015). Formally, we derive statistical error bounds in elementwise ∞ -norm for robust precision matrix estimation procedures under an -contamination model, where at most an fraction of entries in the data matrix are corrupted by outliers. Our work fuses two threads of research involving classical robust procedures and high-dimensional statistical estimation in a novel and rigorous manner. The bounds we derive match standard high-dimensional bounds for uncontaminated precision matrix estimation, up to a constant multiple of . Furthermore, they are of a complementary nature to Oellerer and Croux (2014), since we are primarily concerned with robustness as measured from the viewpoint of statistical consistency, rather than breakdown behavior.
More generally, our results reveal an interesting interplay between bounds for statistical error under -contamination and classical measures of robustness such as the influence function (Hampel, 1974) and breakdown point (Donoho and Huber, 1983). Estimators with bounded influence have long been favored in classical robust statistics, as the rate of change in the statistical functional associated with the estimator is controlled when the nominal distribution is contaminated by an arbitrary point mass distribution. Our results show that a variety of bounded influence estimators, including Kendall's and Spearman's correlation coefficients, give rise to (inverse) covariance estimators with statistical error rates that depend linearly on the degree of contamination; the converse relationship may be seen to hold more generally as a result of our proof arguments. On the other hand, our discussion of the breakdown point of the precision matrix estimators, building upon the analysis of Oellerer and Croux (2014), emphasizes the significant differences between the notions of breakdown point and statistical consistency. Whereas our analysis shows that the robust CLIME and GLasso procedures have comparable behavior from the point of view of high-dimensional statistical consistency, the CLIME estimator has a substantially smaller breakdown point than the GLasso, due to its constrained feasibility region. Rather than advocating one measure of robustness over another, our discussion emphasizes the value of considering different quantitative measures of robustness in selecting an appropriate estimator.
The remainder of the paper is organized as follows: Section 2 furnishes the mathematical background for the cellwise contamination model and the robust covariance and precision matrix estimators to be considered in the paper. Section 3 presents our main theoretical contributions, providing bounds on the statistical error of the covariance and precision matrix estimators under the cellwise contamination model, as well as concrete consequences in the presence of outliers and/or missing data. Section 4 provides a discussion of the breakdown point for the robust GLasso and CLIME estimators. In Section 5, we discuss the main steps of the proofs of our theorems. Section 6 contains simulation results that are used to validate the theoretical results of the paper. We conclude with a discussion in Section 7, including some avenues for future research.
Notation: For a vector a = (a 1 , . . . , a p ) T ∈ R p , we denote by to denote the ordered eigenvalues of A, and we write A 0 (respectively, A 0) to indicate that A is positive definite (respectively, positive semidefinite). We write I for the identity matrix and 0 for the vector of all zeros (the respective dimension of which will be clear from context). The binary operation ⊗ denotes the tensor product.

Background and Problem Setup
We begin with a description of the cellwise contamination model, followed by a rigorous formulation of the robust covariance and precision matrix estimators to be studied in our paper.
Following the notation of Alqallaf et al. (2002Alqallaf et al. ( , 2009), we write the cellwise contamination model in the following form: Here, we observe the contaminated random vector X k ∈ R p . The unobservable random vectors Y k , Z k , and B k are independent, and Y k ∼ G (a nominal distribution) and Z k ∼ H * (an unspecified outlier generating distribution). Furthermore, B k = diag(B k1 , . . . , B kp ) is a diagonal matrix, where B k1 , . . . , B kp are independent Bernoulli random variables with P (B ki = 1) = i , for all 1 ≤ i ≤ p. When 1 = · · · = p = , the probability of an observation vector having no contamination in any component is (1 − ) p , a quantity that decreases exponentially as the dimension increases. This probability goes below the critical value 1/2 for p ≥ 14 at = 0.05, and for p ≥ 69 at = 0.01. Equation (1) is a special case of a more general model, where we allow other joint distributions for B k1 , . . . , B kp . For instance, if B k1 , . . . , B kp were completely dependent (i.e., P (B k1 = · · · = B kp ) = 1), we would obtain the rowwise contamination model. In that case, the probability of an observation vector being totally free of contamination would be 1 − , which is independent of the dimension. Alqallaf et al. (2009) also uses the terms fully independent contamination model (FICM) and fully dependent contamination model (FDCM) to denote the cellwise and rowwise contamination settings, in order to distinguish the pattern of contamination across rows of the data matrix.
Throughout, we will work under the cellwise contamination model (1), and assume that G is a multivariate normal distribution N (µ, Σ * ). Our goal is to estimate the matrices Σ * and Ω * = (Σ * ) −1 from the (uncontaminated) normal component.

Covariance Matrix Estimation
Note that when = 0 (i.e., the data are uncontaminated), we may use the classical sample covariance matrix estimatorΣ, defined pairwise as whereX i = (1/n) n k=1 X ki andX j = (1/n) n k=1 X kj . When n p, the sample covariance is an efficient estimator for Σ * . However, when > 0, the performance ofΣ may be compromised depending on the properties of H * : Under the cellwise contamination model, for i = j, we have When no restrictions are placed on the covariance Σ * Z of the contaminating distribution, the elementwise deviations between Σ * X and Σ * Y (and consequently, also the sample covarianceΣ X :=Σ and Σ * Y ) will in general behave arbitrary badly. Furthermore, note that even when Σ * Z is constrained to lie in a space where the deviations between Σ * X and Σ * Y are suitably bounded, we would require the contaminating distribution to have properties such as sub-Gaussian tails in order to ensure consistency of the sample covariance estimator on the order of O log p n . When a procedure based on covariance estimation is used to estimate the precision matrix, the errors incurred during the covariance estimation step would propagate to the next step. For instance, this issue would arise in using the CLIME or GLasso estimator. In contrast, our theory for robust covariance estimators will not require any assumptions on either Σ * Z or the tail behavior of the contaminating distribution.
To deal with cellwise contamination in the high-dimensional setting, we therefore take the pairwise approach suggested by Oellerer and Croux (2014), where a robust covariance or correlation estimate is computed for each pair of variables. Early proposals of robust procedures are of this type (Bickel, 1964;Puri and Sen, 1971), where a coordinatewise approach is taken for robust estimation of location. In addition to having relatively low computational complexity, the pairwise approach is appealing because a high breakdown point of the pairwise estimators translates into a high breakdown point of the overall covariance matrix. For 1 ≤ i, j ≤ p, we write where σ i = [Var(X ki )] 1/2 , σ j = [Var(X kj )] 1/2 , and ρ ij = Corr(X ki , X kj ). We will take suitable robust estimators ofσ i ,σ j , andρ ij , to obtain the covariance matrix estimatorΣ, with (i, j) entrŷ Σ ij =σ iσjρij . To estimate σ i , we consider the median absolute deviation from the median (MAD), a robust measure of scale. The MAD estimator was popularized by Hampel (1974), who attributes the concept to Gauss. It has a breakdown point of 50%. Let X (1),i ≤ · · · ≤ X (n),i denote the ordered values of X 1i , . . . , X ni . The sample medianm i and the sample MADd i are defined, respectively, asm where W ki = |X ki −m i |, for all k = 1, . . . , n, and k * = n/2 . Expressed another way, We then estimate σ i byσ i = [Φ −1 (0.75)] −1d i , where the constant [Φ −1 (0.75)] −1 is chosen in order to make the estimator consistent for σ i at normal distribution. The population-level median of a distribution with cdf F is defined to be m(F ) := F −1 (0.5), where F −1 (c) = inf{x : F (x) ≥ c}, for c ∈ [0, 1]. Similarly, we may define the population-level MAD d(F ) to be the median of the distribution of |X − m(F )|, where X has cdf F .
Spearman's rho: This statistic is given by where rank(X ki ) denotes the rank of X ki among X 1i , . . . , X ni .
Hence, for asymptotic consistency at normal distribution, our estimator for ρ ij is the transformed version of Kendall's tau and Spearman's rho, given by sin( π 2 r K ij ) and 2 sin( π 6 r S ij ), respectively. We then define asΣ our robust covariance matrix estimator, witĥ

Precision Matrix Estimation
A long line of literature exists for precision matrix estimation in the high-dimensional setting. We will focus our attention on sparse precision matrix estimation; i.e., Ω * contains many zero entries. In this section, we review two techniques, the GLasso and CLIME, which produce a sparse precision matrix estimator based on optimizing a function of the sample covariance matrix. As proposed by Oellerer and Croux (2014) and Tarr et al. (2015), these methods may easily be modified to obtained robust versions, where the sample covariance matrix estimator is simply replaced by a robust covariance estimatorΣ as described in the previous section.
The graphical lasso (GLasso) estimator (Yuan and Lin, 2007;Friedman et al., 2008) is defined as the maximizer of the following 1 -penalized log-likelihood function: Here, λ > 0 is a tuning parameter that controls the sparsity of the resulting precision matrix estimator.
In this paper, we replace the sample covariance matrixΣ by the robust alternativeΣ, and consider a variant where only the off-diagonal entries of the estimator are penalized: Note that although the program (8) is convex for any choice ofΣ ∈ R p×p , several state-of-the-art algorithms for optimizing the GLasso require the matrixΣ to be positive semidefinite (Friedman et al., 2008;Zhao et al., 2012;Hsieh et al., 2011). We will first derive statistical theory for the robust GLasso without a positive semidefinite projection step, and then discuss properties of the projected version in Section 4. A popular alternative to the GLasso is the method of constrained 1 -minimization for inverse matrix estimation (CLIME) proposed in . The CLIME routine solves the following convex optimization problem by linear programming: Note that here, no symmetry condition is imposed on Ω, and the solution is not symmetric in general. If a symmetric precision matrix estimate is desired, we may perform a post-symmetrization step onΩ = (ω 1 ij ) to obtain the symmetric matrixΩ sym , defined bỹ In other words, betweenω 1 ij andω 1 ji , we pick the entry with smaller magnitude. Similar to the GLasso case, we will robustify the CLIME estimator by solvinĝ and then apply the post-symmetrization step (9) toΩ to obtain the final robust CLIME estimator Ω sym .

Main Results and Consequences
In this section, we provide rigorous statements of the main results of the paper. We begin by deriving bounds for robust covariance matrix estimation, which are used to obtain bounds on the error incurred by the precision matrix estimator. Note, however, that the statistical error bounds presented in Section 3.1 are of independent interest, and we believe they are the first bounds appearing in the literature that quantify the robustness of covariance matrix estimators under a cellwise contamination model.

Covariance Matrix Estimation
Throughout this section, we will assume that the standard deviations of the uncontaminated distributions are bounded as follows: We also define the expression Our first theorem provides a bound on the statistical error of the robust covariance estimator Σ K based on Kendall's tau correlations. Note that our result does not involve any assumptions on the nature of the contaminating distribution H. Thus, the distribution H may contain point masses, and we do not require a probability density function of H to even exist.
Theorem 1. Under the cellwise contamination model (1), suppose inequality (11) is satisfied, and = max 1≤i≤p i ≤ 0.02. Let C > π √ 2 and C > and Φ −1 (0.75)C log p n < 1. Then with probability at least the robust covariance estimator satisfies The proof of Theorem 1 is provided in Section 5.1.
Remark 1. Theorem 1 clearly illustrates the effect of -contamination on the estimation error of the covariance matrix estimator. Note that when = 0, we recover the minimax optimal rate for covariance matrix estimation in ∞ -norm (Cai and Zhou, 2012); although the estimatorΣ K is not equal to the sample covariance estimator in the uncontaminated case, the robust covariance estimator nonetheless converges to the true covariance matrix at the optimal rate. On the other hand, cellwise contamination introduces an extra term that is linear in . Another way to interpret the bound (13) is that if the level of contamination is bounded by a constant times log p n , then the robust covariance estimatorΩ K will enjoy the same statistical error rate as the optimal covariance estimator in the uncontaminated case. As we will see in Theorems 3 and 4 below, the sample size requirements for precision matrix estimation are such that the condition ≤ C log p n still allows for a nonvanishing fraction of contamination. Furthermore, note that although the restriction ≤ 0.02 may seem somewhat prohibitive, the proof of Theorem 1 reveals that the specific bound on is an artifact of the proof technique, and a more careful analysis would allow for a larger degree of contamination, at the expense of slightly looser constants in the covariance estimation bound (13), as long as is bounded by some constant in [0, 1).
The following theorem is an analog of Theorem 1, derived for the robust covariance estimator Σ S based on Spearman's correlation coefficient. We assume that the ranks of variables between samples are distinct; note that this happens almost surely when the contaminating distribution has continuous density.
Remark 2. The conclusion of Theorem 2 is very similar to that of Theorem 1, except for constants and an additional requirement on the size of n. However, note that when log p n = o(1), implying the statistical consistency of the robust covariance estimator, the requirement n ≥ max 15, 16π 2 C 2 log p is essentially extraneous.
Although the high-dimensional error bounds derived in Theorems 1 and 2 are substantially different from the canonical measures analyzed in the robust statistics literature, our bounds are somewhat related to the notion of the influence function of an estimator. The influence function (Hampel, 1974), defined at the population level, measures the infinitesimal change incurred by the statistical functional associated with an estimator when the underlying distribution is contaminated by a point mass. Thus, an estimator has a bounded influence function if the extent of the deviation in its functional representation due to contamination remains bounded, regardless of the location of the point mass. The error bounds (13) and (14) also reveal that the extent to which the error deviation between the robust covariance estimator and the true covariance grows is bounded by a constant depending only on M σ . The two notions do not match precisely; for instance, our theorems allow contamination by an arbitrary distribution rather than simply a point mass, and we are comparing finite-sample deviations of an estimator from Σ * rather than population-level deviations of a statistical functional under a contaminated distribution. However, note that by sending n → ∞ in the finite-sample bounds and taking the contaminating distribution to be a point mass, we may conclude that the influence function of the robust covariance estimator is bounded. Furthermore, the arguments in our proofs (cf. Lemmas 12 and 13 in Appendix C) may be used to derive the fact that the corresponding correlation estimators have a bounded influence function, the precise forms of which appear in Croux and Dehon (2010). The reverse implication, that a correlation estimator with bounded influence (together with a bounded-influence scale estimator) gives rise to high-dimensional deviation bounds of the form in inequalities (13) and (14), seems natural but is not immediate.
Finally, note that although Theorems 1 and 2 have been derived under the assumption that the uncontaminated data are drawn from a normal distribution, the same proof techniques may be applied to analyze settings where the uncontaminated data are drawn from a different underlying distribution, as long as the uncontaminated distribution is suitably well-behaved (e.g., has sub-Gaussian tails). Since our ultimate goal is precision matrix estimation, we have focused only on the scenario where the uncontaminated data are drawn from a Gaussian distribution, in which case the structure of the precision matrix is of great interest in the statistical community.
Extensions. Similar high-dimensional error bounds could be derived for the robust covariance estimator based on the quadrant correlation estimator, which is given by and also known to have bounded influence (Shevlyakov and Vilchevski, 2002). However, we do not provide the full derivations here, since they follow from similar arguments to the ones used in the case of Kendall's and Spearman's correlations.
We also comment briefly on another pairwise covariance estimator appearing in the robust statistics literature. Tarr et al. (2015) and Oellerer and Croux (2014) propose to use the following estimator based on an idea of Gnanadesikan and Kettenring (1972): Noting that where α = 1/ Var(X) and β = 1/ Var(Y ), the proposal is to replace the variance estimator by a robust variance estimator (e.g., the square of the MAD estimator). However, the drawback of this estimator in comparison to the covariance estimators based on Kendall's tau and Spearman's rho is that the covariance estimator has a maximal breakdown point of 25% under cellwise contamination, since the argument in the variance involves a sum of variables, and any robust variance estimator has a maximal breakdown point of 50%. We remark that from the point of view of statistical consistency, a version of the Gnanadesikan-Kettenring covariance estimator may be analyzed in the same manner as the above estimators. Indeed, if we consider the covariance estimator whereσ (i,j),+ is the (rescaled) MAD statistic computed from {X ki + X kj : 1 ≤ k ≤ n}, andσ (i,j),− is analogously defined to be the MAD statistic computed from {X ki − X kj : 1 ≤ k ≤ n}, our derivations showing the consistency of the MAD estimator (cf. Lemmas 10 and 11, with minor modifications) show that max 1≤i,j≤p for data from the cellwise contamination model, where σ (i,j),+ and σ (i,j),− are the population-level standard deviations of the distributions of X ki + X kj and X ki − X kj , respectively. Thus, as well, from which we may conclude that the covariance estimator (15) deviates from the true covariance Cov(X ki , X kj ) by the same margin.
Finally, we remark briefly about another popular robust scale estimator known as the Q n estimator (Rousseeuw and Croux, 1993), defined as follows: where c is a constant factor, chosen such that Q n is Fisher-consistent for the population standard deviation, and k * = n 2 /4 . Since the Q n estimator is also based on quantiles, essentially the same types of arguments used to derive MAD concentration (cf. Appendix B) may be used to establish concentration bounds for the Q n estimator similar to those appearing in Lemmas 10 and 11, up to constant factors.

Precision Matrix Estimation
Using the novel statistical error bounds derived in the previous section, we now provide statistical error bounds on the precision matrix estimators attained by plugging the robust covariance matrix estimates into the CLIME and GLasso. We provide explicit statements in the case of the covariance estimate based on Kendall's tau; analogous statements hold for Spearman's rho, assuming uniqueness of ranks.
We begin with the CLIME estimator. Consider the following uniformity class of matrices: The following result provides an elementwise error bound on the estimation error between the CLIME output and the true precision matrix, provided the true precision matrix lies in the class (16) defined above: Theorem 3. Under the cellwise contamination model (1), suppose inequality (11) is satisfied, and = max 1≤i≤p i ≤ 0.02. Let C > π √ 2 and C > 1 Φ −1 (0.75) min 1≤i≤p c(σ i ) √ 2 , and suppose inequality (12) also holds and Φ −1 (0.75)C log p n < 1. If the regularization parameter satisfies then with probability at least the CLIME estimator (10) satisfies The proof of Theorem 3 is contained in Section 5.3.
Remark 3. Clearly, the optimal choice of λ to minimize the estimation error bound in Theorem 3 is λ = C 1 log p n + C 2 , where C 1 and C 2 are the constant prefactors appearing on the right-hand side of inequality (17). In this case, the estimation error bound takes the form Turning to the GLasso, we focus on the class of precision matrices satisfying the following incoherence assumption: where Γ * := Σ * ⊗ Σ * and S = supp(Ω * ) is the true edge set.
We then have the following result, which is stated in terms of the population-level quantities as well as k, the maximum number of nonzero elements in each row of Ω * . The theorem also involves constants C 0 , C 1 , and C 2 , which are independent of and the problem instances n, p, and k.
Theorem 4. Under the cellwise contamination model (1), suppose inequality (11) is satisfied, and = max 1≤i≤p i ≤ 0.02. Also suppose the sample size satisfies the scaling and suppose Assumption 1 holds. Suppose λ = 8 Then with probability at The proof of Theorem 4 is contained in Section 5.4. Note that Theorem 4 implicitly assumes that ≤ C k , so that the expression in parentheses on the right-hand side of inequality (19) is positive.
Remark 4. Comparing the results of Theorems 3 and 4, we see that as in the traditional uncontaminated setting, the GLasso delivers slightly stronger guarantees, at the expense of more stringent assumptions. In particular, the GLasso requires the sample size to scale as n ≥ Ck 2 log p, whereas the CLIME requires the scaling n ≥ C Ω * 2 L 1 log p in order to achieve consistency. When the parameter M defining the precision matrix class scales more slowly than k 2 , the CLIME thus requires a weaker scaling. In addition, the GLasso result supposes Assumption 1, which posits an incoherence bound on submatrices of Γ * . On the other hand, Theorem 4 establishes that the supp(Ω) ⊆ supp(Ω * ) for the GLasso estimator, whereas Theorem 3 only guarantees consistency for the CLIME estimator in terms of ∞ -norm, so the estimated support might contain extraneous terms. In the case of the CLIME estimator, however, the true support of Ω * may be obtained via thresholding, assuming the nonzero elements of Ω * are of the order Ω under the corresponding assumptions. Hence, when ≤ C log p n , the estimation error matches the error of the optimal precision matrix estimator in the uncontaminated case, up to a constant factor . Further note that when ≤ C log p n , the condition = O 1 k required by the condition (19) in Theorem 4 clearly holds when the sample size satisfies n ≥ Ck 2 log p. Note that although the level of contamination tolerated by the estimator decreases as the level of sparsity increases, it is not required to decrease as n and p increase, as long as the ratio log p n remains fixed. Thus, the conclusions of Theorems 3 and 4 are truly high-dimensional. As in the case of the robust covariance matrix estimators, another nice feature is that when the data are uncontaminated ( = 0), the rate of convergence of the robust precision matrix estimator to the true precision matrix agrees with the optimal rate.
Lastly, note that since the inverse of the correlation matrix has the same support as the precision matrix, we could also estimate supp(Ω * ) using the Kendall's or Spearman's correlation matriceŝ respectively, as inputs to the CLIME (10) or GLasso (8). Then the same derivations as in Theorems 3 and 4, omitting the concentration bounds on the MAD estimates of scale, would show convergence ofρ K andρ S to the population correlation matrix ρ * in ∞ -norm. Note, however, that the conditions for support recovery would then need to hold for the correlation matrix ρ * , rather than for the precision matrix Ω * . In particular, a minimum signal strength requirement on ρ * is stronger than the same requirement imposed on Ω * , since the latter can scale inversely with the standard deviations of individual variables in the joint distribution. Therefore, we have chosen to focus our attention in this paper on the output of the CLIME and GLasso when applied to an estimate of the covariance matrix rather than the correlation matrix.

Consequences for Robust Estimation
We now interpret the conclusions of our theorems in some concrete settings, where the data matrix is contaminated according to several different mechanisms.

Constant fraction of outliers:
We first briefly discuss the most basic setting of cellwise contamination, to emphasize the generality of our results. Following the model (1), suppose each entry of the data matrix X is contaminated independently with probability . Furthermore, either all contaminated entries may be drawn independently from a fixed contaminating distribution, or the contaminated entries in each row may be drawn jointly from a fixed contaminating distribution. In each case, Theorems 1 and 2 provide elementwise error bounds on the robust covariance estimators, and Theorems 3 and 4 provide elementwise error bounds on the robust precision matrix estimators constructed from the CLIME and GLasso. The strength of the theorems lies in the fact that we do not make any side assumptions about the outlier distribution; in particular, it may be heavy-tailed and/or contain point masses. Hence, whereas statistics such as the sample covariance and sample correlation will have slower rates of convergence due to a constant fraction of outliers drawn from an ill-behaved distribution, their robust counterparts are agnostic to the outlier distribution. It is also important to note that the statistical error bounds given in the theorems of Sections 3.1 and 3.2 continue to hold when > C log p n . The difference is that in such scenarios, the statistical error will be of the order O( ) rather than O log p n . However, the effect of an fraction of outliers nonetheless grows only linearly as a function of . This emphasizes the robustness properties of the covariance and precision matrix estimators studied in our paper.
Missing data: Turning to a somewhat different setting, note that missing data may also be seen as an instance of cellwise contamination. In this model, data are missing completely at random (MCAR), meaning that the probability of missingness is independent of the location of the unobserved entry of the data matrix (Little and Rubin, 1986). In other words, if we observe the matrix X mis with missing entries, where the probability that an entry in column i is missing is equal to i , we have where Y is the fully-observed matrix. Note that if we zero-fill the missing entries of X mis , the resulting matrix X exactly follows the cellwise contamination model (1), with Z k = 0 for all k.
The following result is an immediate consequence of our theorems: Corollary 1. Suppose data are drawn from the missing data model (20), and the matrix X is the zero-filled data matrix. Let = max 1≤i≤p i . Under the same conditions as in Theorem 3, we have for the robust CLIME estimator constructed from X. Under the same conditions as in Theorem 4, we have supp(Ω) ⊆ supp(Ω * ) and for the robust GLasso estimator constructed from X.
Note that the conclusion of Corollary 1 does not actually require the matrix X to be zero-filled for missing values; in fact, we could fill the missing entries with samples generated according to any distribution (as long as the distribution remains the same across rows). This is because the missing entries are essentially taken as outliers. Of course, our bounds should only be interpreted up to constant factors, and filling missing entries in a strategic way, e.g., filling entries in column i with the mean E(X ki ), could lead to smaller estimation error in practice.
Rowwise contamination: Although we have thus far assumed that data are contaminated according to a cellwise mechanism, we now show that the same results apply for rowwise contamination, as well. Recall that each row in the data matrix for the rowwise contamination model with contamination level is given by where Y k is the uncontaminated row vector, Z k is the contamination vector, and B k ∼ Bernoulli( ). Although model (21) differs from model (1), a simple inspection of the proofs of Theorems 3 and 4 shows that only Lemma 1 needs to be modified. Furthermore, the equation (44), giving the distribution of pairwise entries in a row, simply needs to be replaced by the equation in the proof of Lemma 1. Equation (22) comes from the fact that the pair is either drawn jointly from a normal distribution with probability 1 − , or from the contaminating distribution with probability . Then the remainder of the argument follows as before, implying that the same conclusion of Lemma 1 applies. (We could obtain a smaller prefactor for in the bound (29), since 2 is replaced by , but we are not concerned about optimizing constants here.) We therefore arrive at the following result: Under the rowwise contamination model (21), the same conclusions as in Corollary 1 hold for the CLIME and GLasso estimators constructed from X.
We emphasize that the rowwise contamination model (21) is not in general a special case of the cellwise contamination model (1); rather, the proof techniques for analyzing the cellwise model may be used to handle the rowwise model, as well.

Breakdown Point
We now turn to a brief discussion of the breakdown point of the estimators studied in this paper. As is discussed in Donoho and Huber (1983) and Hampel et al. (2011), breakdown analysis concerns the global behavior of a procedure, under large departures from an assumed situation. On the other hand, the theoretical analysis of statistical consistency and efficiency are related to notions of infinitesimal robustness, and quantifies the local behavior of a procedure at or near the assumed situation. The analogy is made in Donoho and Huber (1983) between the fields of material science and statistics, where the notions of stiffness (resistance of a material to displacements caused by a small load) and breaking strength (the amount of load required to make the material fracture) parallel those of the influence function and the breakdown point. Ideally, a procedure should perform well both locally and globally; optimizing either measure alone is unwise. Our key result of this section shows that although the GLasso and CLIME estimators both enjoy roughly the same statistical rate of estimation, the CLIME does not perform as well as the GLasso when the breakdown point is used to quantify the degree of robustness.
Our analysis of the GLasso estimator closely follows that of Oellerer and Croux (2014); however, since the specific precision matrix estimators analyzed in our paper differ slightly from those of Oellerer and Croux (2014), we include the full argument for the sake of completeness. We define the finite-sample breakdown point of the precision matrix estimator under cellwise contamination to be n (Ω, X) := min 1≤m≤n m n : sup where , and X m is a data matrix obtained from X by replacing at most m entries in each column by arbitrary elements. We also define the explosion finite sample breakdown point of a covariance matrix estimator as follows: (Maronna and Zamar, 2002). Note that the explosion breakdown point only accounts for maximum eigenvalues, whereas the overall covariance matrix estimator breaks down under explosion or implosion (i.e., arbitrarily small minimum eigenvalues). Also, the breakdown point under cellwise contamination is less than or equal to the breakdown point under rowwise contamination, since the supremum in the latter case is only taken over X m with at most m rows replaced. We will consider the breakdown behavior of a slightly tweaked version of the GLasso presented earlier, using a positive semidefinite matrix as the input to the optimization problem. Consider the matrixΣ (X) := argmin whereΣ =Σ(X) is the robust covariance matrix estimator constructed from the data matrix X. LetΩ (X) := argmin be the corresponding GLasso estimator. Note that from a computational standpoint, the projection step (25) is important so that fast solvers for the GLasso program (26) may be applied (e.g., Friedman et al. (2008)). Furthermore, we note that the projection step (25) constitutes a convex program, so the additional computational time is negligible compared to the computation required for running the GLasso. We have the following result: Theorem 5. Consider the positive semidefinite version of the robust GLasso estimator (26). Then under the same conditions as in Theorem 4, we have supp(Ω) ⊆ supp(Ω * ) and Furthermore, for any data matrix X ∈ R n×p , the breakdown point satisfies n (Ω, X) = 50%.
The proof of Theorem 5 is provided in Section 5.5.
Remark 5. Note that Theorem 5 guarantees that the robust GLasso estimatorΩ obtained from a semidefinite projection of the robust covariance estimator shares the same level of statistical consistency achieved by the robust GLasso estimatorΩ. In addition, the precision matrix estimatorΩ has a breakdown point of 50%. Although other authors (Oellerer and Croux, 2014;Tarr et al., 2015) also suggest projecting the robust covariance estimator onto the positive semidefinite cone before applying the GLasso, they advocate a projection in terms of the Frobenius norm rather than the ∞ -norm in the optimization program (25). As can be seen in the proof of Theorem 5, minimizing the elementwise ∞ -norm is much more natural from the point of view of statistical consistency, since it guarantees that the ∞ -error between the precision matrix estimate and the true precision matrix grows by at most a factor of two.
Turning to the CLIME estimator, we now show that although the CLIME is as robust as the GLasso in terms of statistical consistency under the cellwise contamination model, it has much poorer breakdown behavior. Consider the CLIME estimator based on corrupted data: whereΣ(X m ) is the robust covariance estimator based on a data matrix with at most m arbitrarily corrupted entries per column. Since the CLIME estimator arises as the solution to a constrained linear program, the solution is undefined (infinite) when the problem is infeasible. Indeed, we will show in the following theorem that such a case may arise even by corrupting at most one entry in each column of the data matrix.
Theorem 6. In the case when p = 2, there exists X ∈ R n×2 such that n (Ω, X) = 1 n , whereΩ denotes the CLIME estimator.
The proof of Theorem 6, supplied in Section 5.6, provides the construction of a data matrix X ∈ R n×2 where the CLIME estimator becomes infeasible after perturbing a single entry in each column. This is in stark contrast to the result in Theorem 5, which establishes that the breakdown point of the robust GLasso estimator is 50%, for any realization of the data matrix X. Remark 6. Although Theorem 6 is stated for the case p = 2, the argument used to prove the theorem is readily generalizable to higher dimensions, as well, in which case we would also have a matrix X ∈ R n×p satisfying n (Ω, X) = 1 n . For instance, we could construct an n × p matrix X 1 such that Σ(X 1 ) is a block matrix with upper-left block equal to the matrix constructed in the proof of Theorem 6, lower-left block equal to the identity, and off-diagonal blocks equal to zero.
The conclusion of Theorem 6 underscores the fact that consistency and breakdown point under cellwise contamination are in some sense orthogonal measures of robustness. As we demonstrated in the previous section, both the CLIME and GLasso lead to estimators that enjoy good rates of statistical consistency when the contamination fraction is sufficiently small relative to the problem parameters. On the other hand, the results of this section show that the CLIME is extremely nonrobust in terms of its breakdown point. Similarly, procedures such as the Gnanadesikan-Kettenring estimator (15) may be shown to be statistically consistent under cellwise contamination, but as discussed in Oellerer and Croux (2014), the breakdown point of the covariance estimatorΣ is at most 25%, which leads to error propagation inΩ.
Finally, we note that the notion of breakdown point that we consider in equation (23) is defined with respect to a finite sample, without recourse to probability distributions. Other notions of breakdown point, defined with respect to an -contaminated distribution, have also been studied in the literature (Hampel et al., 2011). For some alternative measures of breakdown robustness, the CLIME estimator may have a more controlled breakdown behavior, but we have not explored them here.

Proofs
In this section, we provide an outline of the proofs of the main theorems in the paper. Proofs of the more technical supporting lemmas are contained in the supplementary Appendix.

Proof of Theorem 1
The proof is based on Lemma 1, which gives an error bound for the pairwise terms sin( π 2 r K ij ), and Lemma 2, which gives an error bound for the scale estimatesσ i . Note that we require the bound ≤ 0.02 on the level of contamination in Lemma 1, but the requirement could be relaxed with a more refined proof technique. The proofs of Lemmas 1 and 2 are provided in Appendices A.1 and A.2.

Proof of Theorem 2
The proof is based on Lemma 3, which gives an error bound for 2 sin( π 6 r S ij ), and Lemma 2, which gives an error bound forσ i . Note that we require the bound ≤ 0.01 on the level of contamination in Lemma 3, but the requirement could again be relaxed with a more refined proof technique. The proof of Lemma 3 is contained in Appendix A.3.
Lemma 3. Under model (1), let = max 1≤i≤p i ≤ 0.01. Suppose C > 8π and the sample size satisfies n ≥ max 15, 16π 2 C 2 log p . Then with probability at least 1 − 2p Using a similar decomposition as in the proof of Theorem 1, we have Using Lemmas 2 and 3, we then obtain the overall upper bound 5C 2 which is easily simplified to obtain the prescribed bound.

Proof of Theorem 3
Clearly, it suffices to prove the elementwise deviation bound for the unsymmetrized matrixΩ. We begin with the following general lemma, relating deviation bounds in the covariance matrix estimatorΣ to the error of the CLIME estimator. A version of the following result appears in , but we include the relatively short proof for the sake of completeness.
Inspecting the proofs of the technical lemmas employed in proving Theorem 1, we may see that inequality (32) holds with the function f (n, δ) = c 1 exp(c 2 n(δ − c 0 ) 2 ), defined for δ > c 0 , where c 0 , c 1 , and c 2 are appropriately chosen constants. An easy calculation shows that Similarly, we may easily verify thatn Lemma 5 then implies that the desired conclusions.

Proof of Theorem 5
Note thatΣ is the projection of the robust covariance estimatorΣ onto the positive semidefinite cone, where the distance is measured in the elementwise ∞ -norm. Furthermore, note that This implies that the bound (32) in Lemma 5 holds withΣ replaced byΣ, and f (n, δ) replaced by f (n, δ/2). Proceeding as in the proof of Theorem 4 with these minor modifications, we arrive at the bound (27).
Consider the estimatorΣ(X m ), based on corrupted data. We have where the first inequality follows from the bound (33), and the second inequality comes from the triangle inequality. Furthermore, note that sinceΣ(X m ) 0 by construction, we have where we have used the bound in the last inequality. Combining inequalities (36) and (37), we then obtain Finally, since the correlation estimators are bounded in magnitude by 1, we have where {σ i (X m )} 1≤i≤p are the robust scale estimators based on X m , given by the MAD estimators calculated from the corresponding columns. Furthermore, the breakdown point of the MAD is 50% (Huber, 1981), meaning the quantity on the right-hand side of inequality (39) is finite when m n < 50%. Then by inequality (38) and the definition of the explosion breakdown point, we conclude that the bound (35) holds. By inequality (34), we therefore have n (Ω(X), X) ≥ 50%, as well. We now establish that n (Ω(X), X) = 50%. Note that if we are allowed to corrupt more than 50% of the entries in each column of the data matrix, the columnwise MAD estimates may be made arbitrarily small (say, smaller than some value a); indeed, we may simply replace more than half of the entries in each column by values in (0, a). Consequently, the overall covariance estimator Σ(X m ) will have all entries bounded in magnitude by [Φ −1 (0.75)] −2 a 2 . We claim that the diagonal elements ofΣ(X m ) must therefore be bounded in magnitude by 2[Φ −1 (0.75)] −2 a 2 . Indeed, note that the matrix diag(Σ(X m )) is feasible for the projection (25). Hence, we must have as claimed. Now note that the first-order optimality condition for the GLasso is given by where the sign function is computed entrywise, omitting the diagonal elements ofΩ(X m ). In particular, this implies that the diag(Σ(X m )) = diag Ω (X m ) −1 , so the diagonal elements of Ω (X m ) −1 are also bounded in magnitude by 2[Φ −1 (0.75)] −2 a 2 . Hence, where the e j 's are the canonical basis vectors, and we have used the variational representation of eigenvalues of a Hermitian matrix to show that the minimum eigenvalue is bounded by the minimum diagonal entry. This allows us to conclude that where we have used the inequality λ p (AB) ≤ λ 1 (A)λ p (B), for A, B 0, in the first inequality (Zhang, 2011). Hence, However, we may choose a to be arbitrarily close to 0, implying that the maximum eigenvalue of Ω(X m ) may be made arbitrarily large, and the estimator breaks down. This concludes the proof.

Proof of Theorem 6
Clearly, n (Ω, X) ≥ 1 n for any X, by the definition of the breakdown point. To show equality, we now provide a data matrix X and a corrupted data matrix X 1 , where X 1 differs from X in at most one element per column, and the CLIME problem is feasible forΣ(X) but infeasible forΣ(X 1 ). Consider the n × 2 matrix X 1 , constructed as follows: where the a k 's are all distinct. Note that the columns of X 1 are perfectly negatively correlated; hence, the correlation matrix (computed from either Kendall's tau or Spearman's rho, for instance) is 1 −1 −1 1 .
Furthermore, we haveσ 1 =σ 2 :=σ, since the data in the two columns are negatives of each other. It follows thatΣ Clearly, the problem is infeasible for λ < 1 2 . Hence, the CLIME estimator based onΣ(X 1 ) is infeasible. On the other hand, we may construct an initial data matrix X such that the CLIME program based onΣ(X) is feasible, simply by altering the last row of X 1 . Suppose we change the last row of X 1 to (a n , a n ). Then the columns are no longer perfectly negatively correlated, and it is easy to check that the correlation matrix of X will take the form 1 a a 1 , for some |a| < 1. Denoting the corresponding estimates of scale asσ 1 andσ 2 , we then havê Note that det{Σ(X)} =σ 2 1σ 2 2 (1 − a 2 ) > 0. It follows thatΣ(X) is invertible. In particular, the matrix Σ (X) −1 is always a feasible point for the CLIME program based onΣ(X). Hence, we conclude that the CLIME program breaks down when even one corruption per column is allowed. It follows that n (Ω, X) = 1 n for the constructed value of X.

Simulations
In this section, we perform simulation studies to examine the performance of the two robust covariance matrix estimators introduced in Section 2, and also the robust precision matrix estimators obtained using the GLasso. We will refer to the two type of estimators as Kendall and Spearman, respectively. For comparison, we also compute the following robust covariance matrix estimators, which are similarly plugged into the GLasso to obtain robust precision matrix estimators: • SpearmanU: The pairwise covariance matrix estimator proposed in Oellerer and Croux (2014), where the MAD estimator is combined with Spearman's rho (without transformation): • OGK: The OGK estimator proposed in Maronna and Zamar (2002), with scale estimator Q n .
• NPD: The pairwise covariance matrix estimator considered in Tarr et al. (2015), wherẽ An NPD projection is applied toΣ to obtain the final positive semidefinite covariance matrix estimator: Further details for the orthogonalized Gnanedesikan-Kettenring (OGK) and nearest positive definite (NPD) procedures may be found in Maronna and Zamar (2002) and Higham (2002), respectively. The nonrobust GLasso, which takes the sample covariance matrix estimator as an input (SampleCov), as well as the inverse sample covariance matrix estimator (InvCov), applicable in the case p < n, are used as points of reference. An implementation of the GLasso that allows the diagonal entries of the precision matrix estimator to be unpenalized is provided in the widely used glasso package. In this paper, however, we use the GLasso implementation from the QUIC package (Hsieh et al., 2011), since it does not require the input covariance matrix to be positive semidefinite, and speeds up substantially over glasso. We select the tuning parameter λ in GLasso by cross-validation: We first split the data into K groups, or folds, of nearly equal size. For a given λ and 1 ≤ k ≤ K, we take the k th fold as the test set, and compute the precision matrix estimateΩ (−k) λ based on the remaining K − 1 folds. We then compute the negative log-likelihood on the test set: (k) is the robust covariance estimate obtained from the test set. This is done over a logarithmically spaced grid of 15 values between λ max = max i =j |Σ ij | and λ min = 0.01λ max , wherê Σ is the robust covariance estimate computed from the whole data set. The value of λ that is selected as the final tuning parameter.
Simulation settings: We consider the following four sampling schemes, covering different structures of the true precision matrix Ω * ∈ R p×p . The first three structures come from .
• Sparse: Ω * = B + δI p , where b ii = 0 and b ij = b ji , with P (b ij = 0.5) = 0.1 and P (b ij = 0) = 0.9, for i = j. The parameter δ is chosen such that the condition number of Ω * equals p. The matrix is then standardized to have unit diagonals.
Performance measures: We assess the performance of the covariance and precision matrix estimators via the deviations Σ − Σ * ∞ and Ω − Ω * ∞ , respectively. To measure the accuracy of recovering the support of the true precision matrix, we also consider the false positive (FP) and false negative (FN) rates: .
FP gives the proportion of zero elements in the true precision matrix that are incorrectly estimated to be nonzero, while FN gives the proportion of nonzero elements in the true precision matrix that are incorrectly estimated to be zero. Note that if Ω * has no zero entries, as in the case of the banded and dense structures, the quantity FP is undefined.
Tables 1 and 2 show the results for n = 200 and p = 120. We summarize the salient points below: • When the dataset is clean, SampleCov performs best in terms of both covariance and precision matrix estimation, across all sampling schemes. Note that even though the data are uncontaminated, InvCov performs poorly, due to the fact that the sample covariance matrix has low precision when p > n/2.
• In the case of rowwise contamination, the nonrobust SampleCov has the largest estimation error for the covariance matrix, as expected. Curiously, the precision matrix estimation error based on SampleCov is the lowest among all estimators. We do not have good explanation for this, but the tuning parameter selected for SampleCov by cross-validation tends to be smaller (as can be seen from its relatively low FN). NPD, Kendall, Spearman, and SpearmanU have similar performance in terms of both covariance and precision matrix estimation. In all sampling schemes, OGK outperforms these four estimators for covariance estimation, but not consistently so for precision matrix estimation.
• For covariance and precision matrix estimation under cellwise contamination, the Kendall, Spearman, and SpearmanU estimators perform the best. NPD performs the worst among all cellwise robust covariance matrix estimators. Nonetheless, NPD still beats OGK, which is designed to work well under rowwise contamination, and also beats the nonrobust SampleCov.
• When the data are generated from the multivariate t-distribution or alternative t-distribution, we again see that Kendall, Spearman, and SpearmanU behave similarly and outperform all other estimators, across all sampling schemes.
• When Ω * is either sparse or diagonal, FP is low for all estimators except InvCov, under all contamination mechanisms.
• Except for InvCov, FN is high when Ω * is banded or dense, under all contamination mechanisms. This is expected because GLasso implicitly assumes the underlying Ω * to be sparse, which is not true in these cases. When Ω * is sparse, the FN for Kendall, Spearman, and SpearmanU are relatively low compared to the other estimators.
Tables 3 and 4 show the results for n = 200 and p = 400. Since p > n, the inverse sample covariance matrix cannot be computed, hence is excluded from the analysis. Overall, we obtain conclusions similar to those obtained in the first set of simulations: • When the data are clean, SampleCov perform best in terms of estimation error, across all sampling schemes. Immediately following are OGK and NPD, and then Kendall, Spearman, and SpearmanU (the last three have nearly the same performance).
• Under rowwise contamination, SampleCov has the worst covariance estimation error, but also the best precision estimation error, across all sampling schemes. OGK performs best in terms of covariance estimation, but not precision estimation. NPD, Kendall, Spearman, and SpearmanU have similar performance in nearly all cases. When Ω * is diagonal and the contamination fraction is 10%, Kendall turns out to have high precision estimation error, possibly because the selected tuning parameter in GLasso is too small (as can be seen by the high FP).
• In terms of estimation error under cellwise contamination, OGK performs nearly as badly as SampleCov. Kendall, Spearman, and SpearmanU perform equally well, while NPD is slightly worse off.
• When the data are generated from the multivariate t-distribution or alternative t-distribution, SampleCov performs badly. Kendall, Spearman, and SpearmanU perform similarly and outperform OGK and NPD, across all sampling schemes.
• In general, under all contamination mechanisms, when Ω * is either sparse or diagonal, FP is low for all estimators. On the other hand, when Ω * is banded or dense, FN is high, as expected.
When Ω * is sparse, FN is not as low as desired.
In summary, SampleCov performs best for clean data. Under rowwise contamination, OGK yields the best results in terms of covariance estimation. Under cellwise contamination, Kendall, Spearman, and SpearmanU equally share the best performance, while NPD is slightly worse off. Kendall, Spearman, and SpearmanU also perform very well when the data are generated from a multivariate t-distribution or the alternative t-distribution, although these latter cases are not covered by our theory.   Table 1: Simulation results for seven estimators and four sampling schemes, when n = 200 and p = 120. Performance is measured by Σ − Σ * ∞ for covariance matrix estimation (Cov), Ω − Ω * ∞ for precision matrix estimation (Prec), and false positive rate (FP) and false negative rate (FN) for support recovery of the true precision matrix. The results are averaged over 100 replications.       Table 4: Simulation results for six estimators and four sampling schemes, when n = 200 and p = 400. Performance is measured by Σ − Σ * ∞ for covariance matrix estimation (Cov), Ω − Ω * ∞ for precision matrix estimation (Prec), and false positive rate (FP) and false negative rate (FN) for support recovery of the true precision matrix. The results are averaged over 100 replications.

Discussion
In this paper, we have derived statistical error bounds for high-dimensional robust precision matrix estimators, when data are drawn from a multivariate normal distribution and then observed subject to cellwise contamination. We show that in such settings, the precision matrix estimators that are obtained by plugging in pairwise robust covariance estimators to the GLasso or CLIME routine, as suggested by Oellerer and Croux (2014) and Tarr et al. (2015), have error bounds that match standard high-dimensional bounds for uncontaminated precision matrix estimation, up to an additive factor involving a constant multiple of the contamination fraction . Our results for precision matrix estimators are derived via estimation error bounds for robust covariance matrix estimators, which have similar deviation properties.
The results of our paper naturally suggest several venues for future work. As discussed earlier, our results seem to indicate that covariance estimators based on bounded-influence estimators of correlation and scale give rise to statistical error bounds of the form derived in our paper, and it would be interesting to rigorize this notion, as a further attempt to connect the fields of robust and high-dimensional statistics. It would also be interesting to relate the nonasymptotic statistical error bounds to the behavior of the sensitivity curve of the robust covariance estimator, which is the finite-sample analog of the influence function. We have also left open the question of calculating the breakdown point for the CLIME estimator with respect to more general data matrices, as well as the breakdown behavior of CLIME and GLasso under different notions of breakdown point. Although our results imply the superiority of the GLasso over the CLIME estimator from the perspective of the finite-sample breakdown point, this may only be part of the story.
Lastly, it would be interesting to generalize our study to other classes of distributions. In one direction, it would be possible to study contaminated versions of other distributions besides the multivariate Gaussian, for which the precision matrix encodes information about the underlying graphical model (e.g., Ising models on trees). A harder question to tackle would be the problem of robust graphical model estimation in settings where the structure of the graph is not encoded in the precision matrix alone. Finally, one could consider robust estimation of scatter matrices, when the uncontaminated data are drawn from an elliptical distribution. In that case, the proposed Kendall's tau and Spearman's rho correlation coefficients would still be Fisher consistent upon taking the respective sine transformations, so similar error bounds should hold. As demonstrated in our simulation results, the pairwise covariance estimators based on Kendall's tau and Spearman's rho perform reasonably well when data are generated from either the multivariate t-distribution or the alternative t-distribution. This motivates studying the convergence rates of the same covariance matrix estimators under heavy-tailed or elliptical distributions.
The problem of estimating high-dimensional covariance matrices under various structural assumptions has also been widely studied. Various families of structured covariance matrices have been introduced, including bandable matrices (Cai et al., 2010), Toeplitz matrices (Cai et al., 2013), and sparse matrices (Bickel and Levina, 2008;Cai and Zhou, 2012). The proposed covariance matrix estimators involve regularizing the sample covariance matrix in accordance to structural assumptions. It would be interesting to study robust versions of these structured covariance matrix estimators under a model such as cellwise contamination. Besides graphical models, covariance matrix estimation is also useful for statistical methods such as linear discriminant analysis and principal component analysis. Several high-dimensional procedures have been proposed with proven theoretical guarantees when data are uncontaminated Vu et al., 2013), and it would be interesting to study robust adaptations of these procedures, as well.

A Proofs of supporting lemmas
In this Appendix, we provide the proofs of the technical lemmas used to establish Theorems 1 and 2 in Section 3.1.

A.1 Proof of Lemma 1
When i = j, we have is a U -statistic, and the last inequality follows from the fact that cos(x) is 1-Lipschitz. By Hoeffding's inequality for U -statistics, we have Now, consider the case where i = j. Note that where ρ K ij = E(r K ij ) and the expectation is with respect to the distribution under model (1). Since r K ij is a U -statistic with kernel bounded between −1 and 1, Hoeffding's inequality and the fact that sin(x) is 1-Lipschitz implies that the first term on the right-hand side of inequality (41) satisfies Combining inequalities (40) and (42) and taking t = C log p n , we conclude that with probability at least 1 − 2p −(C 2 /π 2 −2) , For the second term on the right-hand side of equation (41), we have under model (1) that for any pair i = j, where Φ µ {i,j} ,Σ {i,j} = N (µ {i,j} , Σ {i,j} ) is the marginal distribution of (Y ki , Y kj ), H ij is a mixture of the distributions of Y ki , Y kj , Z ki , and Z kj , and 1 − γ ij = (1 − i )(1 − j ). By Lemma 12, we have In particular, this bound is less than 1 when ≤ 0.02. Then using the fact that | sin(x) − x| ≤ |x| 3
, withd i defined as in equation (3), it suffices to bound the term |d i − d(Φ µ i ,σ i )|, which we decompose as follows: By Lemma 11, for 0 < t < 1, On the other hand, by Lemma 10, we have Thus, with probability at least 1 It follows that with the same probability,

A.3 Proof of Lemma 3
When i = j, we have 2 sin( π 6 r S ii ) = ρ ii = 1; hence, we only need to consider the case when i = j. First, note that 2 sin π 6 r S ij − ρ ij ≤ 2 sin where the expectation is taken with respect to the distribution under model (1). By Lemma 14, we have r S ij = n−2 n+1 U ij + 3 n+1 r K ij , where U ij is a U -statistic with kernel bounded between −3 and 3, and r K ij is the Kendall's tau correlation. Using the fact that sin(x) is 1-Lipschitz, we then have where the last inequality follows from the choice t = C log p n and the fact that 6 n+1 ≤ 3t 2π when n ≥ 16π 2 C 2 log p . Furthermore, Hoeffding's inequality implies Plugging in t = C log p n and using a union bound, we then have (47) For the second term on the right-hand side of equation (41), we have under model (1) that for any pair i = j, Note that γ ij = i + j − i j ≤ 2 , so |R ij | ≤ π 6 48γ ij + 129γ 2 ij + 88γ 3 ij + 12 n + 1 ≤ π 6 48 · 2 + 129(2 ) 2 + 88(2 ) 3 + 12 n + 1 ≤ 16π + 86π 2 + 118π 3 + 2π n + 1 .
In particular, this bound is less than 1 when ≤ 0.01 and n ≥ 15. Then using the fact that where the final inequality uses the assumption n ≥ 16π 2 C 2 log p once more. Combining this bound with inequality (47) implies the desired result.

B Lemmas for MAD concentration
In this Appendix, we prove several lemmas that are needed in deriving consistency of the MAD estimator. We begin with some results concerning the concentration of sample medians from an arbitrary distribution. A version of Lemmas 7 and 8 is also contained in Serfling and Mazumder (2009).
Lemma 6. Let X 1 , . . . , X n be a random sample from a distribution with cdf F , and letm be the sample median. Ifm < c, then |{X i : Proof. This result follows easily from the definition of the sample median.
Lemma 7. Let X 1 , . . . , X n be a random sample from a distribution F . Let m be the population median and letm be the sample median. Then Proof. By Lemma 6, where Y i = 1 X i ≤ m + t 2 and p 1 = F (m + t 2 ), and the last inequality follows from Hoeffding's inequality. Similarly, we have where Z i = 1 X i ≤ m − t 2 and p 2 = F (m − t 2 ). Combining expressions (48) and (49), we then obtain Lemma 8. Let X 1 , . . . , X n be a random sample from a distribution with cdf F . Let m and d denote the population median and MAD, respectively, and letm andd denote the sample median and MAD. Let G be the distribution of |X i − m|. Then where Proof. Let W i = |X i −m|. By the definition of the sample MAD, Lemma 6 gives where Y i = 1 |X i − m| ≤ d + t 2 and p 3 = G(d+ t 2 ). Then by Hoeffding's inequality and Lemma 7, the last quantity is bounded by exp − 2n p 3 − 1 2 2 + 2 exp(−2nb 2 (t)). Similarly, where Z i = 1 |X i − m| ≤ d − t 2 and p 4 = G(d − t 2 ). By Hoeffding's inequality and Lemma 7, the last quantity is upper-bounded by exp − 2n p 4 − 1 2 2 + 2 exp(−2nb 2 (t)).
Next, we prove two population-level lemmas for the -contamination model. As remarked in the introduction, we use the notation F −1 (c) = inf{x : F (x) ≥ c}, which is defined even if the cdf F is not surjective on the interval [0, 1]. Note that Lemmas 9 and 10 do not impose any conditions on the contaminating distribution H.
Proof. By an abuse of notation, we also use F to denote the cdf of the contaminated distribution. Then F −1 is the quantile function. Note in particular that the following statements hold, where X ∼ F , as an easy consequence of the definition of F −1 : Furthermore, we may write where Z ∼ N (µ, σ 2 ). By Lemma 9, the last expression is further lower-bounded by We will take It follows that P (|X − F −1 (0.5)| ≤ a) < 0.5, so by implication (ii) above, Using the fact that d(Φ µ,σ ) = Φ −1 µ,σ (0.75) and Φ −1 µ,σ (0.5) = 0, inequality (56)  Thus, we have the desired result.
We conclude with the main lemma of this section, which establishes the consistency of the sample MAD to its population-level version.
The proof of the following lemma is adapted from an argument in Croux and Dehon (2010).
This concludes the proof.