Overlap in Observational Studies with High-Dimensional Covariates

Estimating causal effects under exogeneity hinges on two key assumptions: unconfoundedness and overlap. Researchers often argue that unconfoundedness is more plausible when more covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in observational studies with high-dimensional covariates and formalize curse-of-dimensionality argument, suggesting that these assumptions are stronger than investigators likely realize. Our key innovation is to explore how strict overlap restricts global discrepancies between the covariate distributions in the treated and control populations. Exploiting results from information theory, we derive explicit bounds on the average imbalance in covariate means under strict overlap and show that these bounds become more restrictive as the dimension grows large. We discuss how these implications interact with assumptions and procedures commonly deployed in observational causal inference, including sparsity and trimming.


Introduction
Accompanying the rapid growth in administrative databases and online platforms, there has been a push to extend methods for estimating causal effects under exogeneity to settings with highdimensional covariates (Belloni et al., 2014;Farrell, 2015;Athey et al., 2018). These studies typically require a pair of identifying assumptions (Rosenbaum & Rubin, 1983;Imbens, 2004): unconfoundedness, also known as selection on observables, in which the treatment assignment mechanism depends only on observed covariates; and overlap, also known as positivity or common support, in which all units have a non-zero probability of assignment to each treatment condition.
A key argument for high-dimensional observational studies is that unconfoundedness is more plausible when the analyst adjusts for more covariates (Rosenbaum, 2002;Rubin, 2009). Setting aside notable counter-examples to this argument (Pearl, 2011;Wooldridge, 2016), the intuition is straightforward to state: the richer the set of covariates, the more likely that unmeasured confounding variables become measured confounding variables. This intuition, however, has the opposite implications for overlap: the richer the set of covariates, the closer these covariates come to perfectly predicting treatment assignment for at least some subgroups.
We formalize this curse of dimensionality argument and demonstrate that there are strong implications of overlap when there are many covariates. In particular, we focus on the strict overlap assumption, which asserts that the propensity score is bounded away from zero and one with probability one. While this appears to be a local constraint, we show that strict overlap implies global restrictions on the discrepancy between the covariate distributions in the treated and control populations. To do so, we re-frame strict overlap as bounding a likelihood ratio, which is a well-studied problem in information theory (Hellman & Cover, 1970). Adapting results from Rukhin (1997), we derive explicit bounds on various types of covariate imbalance, and show that these bounds become more restrictive as the dimension of the covariates grows. For example, we show that as the dimension of the covariates grows, strict overlap implies that the covariates must either be highly correlated, or that their means must become arbitrarily close to balance on average.
To put these results into context, we discuss how the implications of strict overlap intersect with common modeling assumptions, and how our results inform the common practice of trimming in high-dimensional contexts.
We contribute to a growing literature on the critical role of overlap in observational settings. In the context of semiparametric estimators, several papers show that the convergence rate critically depends on the level of overlap (Khan & Tamer, 2010;Hong et al., 2018;Ma & Wang, 2019); see Busso et al. (2014) for relevant simulation evidence. Recognizing this, one common approach is to trim units that have extreme values of the propensity score (Dehejia & Wahba, 1999;Crump et al., 2009;Petersen et al., 2012;Yang & Ding, 2018). An alternative is to instead propose estimators and inference methods that have additional robustness to overlap violations (Chen et al., 2008;van der Laan & Rose, 2011;Chaudhuri & Hill, 2014;Rothe, 2017;Armstrong & Kolesár, 2018;Sasaki & Ura, 2018). Finally, our results are especially relevant for recent efforts to incorporate machine learning into estimating causal effects, partly to exploit rich covariates (see Chernozhukov et al., 2019;Athey & Imbens, 2019, for recent reviews). On the one hand, by using machine learning to perform covariate adjustment, these methods can achieve parametric convergence rates under extremely weak nonparametric modeling assumptions. On the other hand, the cost of this nonparametric flexibility is that these methods are highly sensitive to poor overlap.
Thus, understanding the implications of overlap with high-dimensional covariates is therefore critical across many open research areas.
The paper proceeds as follows. Section 2 sets up the problem and defines key notation. Section 3 gives the main results on implications of strict overlap. Section 4 discusses the role of assumptions on the outcome model, such as sparsity, as well as trimming. Section 5 offers some discussion. In separate work, we address possible remedies and methodologies for assessing overlap in this setting, but believe that characterizing the implications of overlap remains of independent interest.

Preliminaries
We focus on an observational study with a binary treatment. For each sampled unit i, be independently and identically distributed according to a superpopulation probability measure P . We drop the i subscript when discussing population stochastic properties of these quantities.
We observe triples (Y obs , T, X) where Y obs = (1 − T )Y (0) + T Y (1). We would like to estimate the average treatment effect though our results immediately extend to other estimands like the Average Treatment Effect on the Treated. The standard approach in observational studies is to argue that identification is plausible conditional on a possibly large set of covariates (Rosenbaum & Rubin, 1983;Imbens, 2004). Specifically, the investigator chooses a set of p covariates X 1:p ⊂ X, and assumes unconfoundedness.
Importantly, the conditional expectations in (1) are non-parametrically identifiable only if the following population overlap assumption is satisfied. Let e(X 1:p ) = P (T = 1 | X 1:p ) be the propensity score.
Assumption 2 is sufficient for non-parametric identification of τ ATE , but is not sufficient for efficient semiparametric estimation of τ ATE , a fact we discuss in further detail in the next section. For this reason, investigators typically invoke a stronger variant of Assumption 2 (e.g., Hirano et al., 2003;Khan & Tamer, 2010), which we call the strict overlap assumption with bound η.
Strict overlap is integral across a range of settings. Without any restrictions on the outcome distribution, strict overlap is a necessary condition for the existence of regular semiparametric estimators of τ ATE that are uniformly n 1/2 -consistent over a nonparametric model family (Khan & Tamer, 2010). This necessity may not hold if other conditions, e.g., conditional moment conditions and smoothness conditions, are imposed on the potential outcomes (e.g., Chen et al., 2008;Hirshberg & Wager, 2017;Ma & Wang, 2019). Technically, we can relax Assumption 3, but this will involve non-standard asymptotic analyses (e.g., Hong et al., 2018;Ma & Wang, 2019) and it is difficult, if not impossible, to conduct uniform inference on τ ATE (e.g. Khan & Nekipelov, 2013 Khan & Nekipelov (2013) prove that neither bootstrap inference nor pivotal inference is asymptotically valid without this assumption. Ma & Wang (2019) shed some light on the possibility of uniform inference under assumptions on tail behaviors of inverse propensity scores though they do not provide a complete recipe. In general, a lack of uniform inference is problematic in practice, even if we can characterize the limiting behavior for every data generating distribution in a model, because the correct choice of inferential procedure will depend on the unknown truth. See, e.g., Romano & Wolf (1999), Andrews & Cheng (2012, and Chen et al. (2011) for discussion in other contexts.

Framework
In this section, we show that strict overlap restricts the overall discrepancy between the treated and control covariate measures, and that this restriction becomes more binding as the dimension p increases. Formally, we write the control and treatment measures for covariates, for all p, as: For the remainder of the paper, we will assume that the marginal probability that any unit is assigned to treatment, π := P (T = 1), is bounded by η ≤ π ≤ 1 − η. With a slight abuse of notation, we define the marginal probability measure on covariates, implied by the superpopulation distribution, as P = πP 1 + (1 − π)P 0 , a mixture of the condition-specific probability measures P 0 and P 1 .
We write the densities of P 1 and P 0 with respect to the dominating measure P as dP 1 /dP and dP 0 /dP . We write the marginal probability measures of finite-dimensional covariate sets X 1:p as P 0 (X 1:p ) and P 1 (X 1:p ), and the marginal densities as dP 1 /dP (X 1:p ) and dP 0 /dP (X 1:p ). When discussing density ratios, we will omit the dominating measure dP .
By Bayes' Theorem, Assumption 3 is equivalent to the following bound on the density ratio between P 1 and P 0 , which we will refer to as a likelihood ratio: Implications of bounded likelihood ratios are well-studied in information theory (Hellman & Cover, 1970;Rukhin, 1993Rukhin, , 1997. Each of the results that follow are applications of a theorem due to Rukhin (1997), which relates likelihood ratio bounds of the form (2) to upper bounds on certain divergences measuring the discrepancy between the distributions P 0 (X 1:p ) and P 1 (X 1:p ). We include an adaptation of Rukhin's theorem in the appendix, as Theorem 2. We also derive additional implications of this result in the appendix.
In the subsequent, we explore the implications of Assumption 3 when there are many covariates.
To do so, we set up an analytical framework in which the covariate sequence X is a stochastic process (X (k) ) k>0 . For any single problem, the investigator selects a finite set of covariates X 1:p from the infinite pool of covariates (X (k) ) k>0 . Importantly, this framework includes no notion of sample size because we are examining the population-level implications of an assumption about the population measure P . Our results are independent of the number of samples that an investigator might draw from this population.
Remark 1 (Strict Overlap and Gaussian Covariates). While we focus on the implications of strict overlap in high dimensions, this assumption also has surprising implications in low dimensions.
For example, if X is one-dimensional and follows a Gaussian distribution under both P 0 and P 1 , strict overlap implies that P 0 = P 1 , or that the covariate is perfectly balanced. This is because if P 0 = P 1 , the log-density ratio log dP 0 /dP 1 (X) diverges for values of X with large magnitude, implying that e(X) can be arbitrarily close to 0 or 1 with positive probability. Similar results can be derived when X 1:p is multi-dimensional Gaussian. Thus, for Gaussianly distributed covariates, the implications of strict overlap are so strong that they are uninteresting. For this reason, we do not give any examples of the implications of the strict overlap assumption when the covariates are Gaussian.

Strict Overlap Implies Bounded Mean Discrepancy
We now use these bounds to derive concrete implications of strict overlap. Here, we show that strict overlap implies a strong restriction on the discrepancy between the means of P 0 (X 1:p ) and P 1 (X 1:p ). In particular, when p is large, strict overlap implies that either the covariates are highly correlated under both P 0 and P 1 , or the average discrepancy in means across covariates is small.
We use · to denote the Euclidean norm of a vector, and · op to denote the operator norm of a matrix.
Theorem 1. Assumption 3 implies µ 0,1:p − µ 1,1:p ≤ min Σ 0,1:p where b min and b max are defined in (3), and The proof is included in the Appendix. Theorem 1 has strong implications when p is large.
These implications become apparent when we examine how much each covariate mean can differ, on average, under (4).

Corollary 1. Assumption 3 implies
The mean discrepancy bounds in Theorem 1 and Corollary 1 depend on the operator norms of the covariance matrices Σ 0,1:p and Σ 1,1:p . The operator norm is equal to the largest eigenvalue of the covariance matrix and is a proxy for the degree to which the covariates X 1:p are correlated. In particular, the operator norm is large relative to the dimension p if and only if a large proportion of the variance in X 1:p is contained in a low-dimensional projection of X 1:p . For example, in the cases where the components of X 1:p are independent, or where X 1:p are samples from a stationary ergodic process, the operator norm scales like a constant in p. On the other hand, in the case where the variance in X 1:p is dominated by a low-dimensional latent factor model, the operator norm scales linearly in p. We treat these examples precisely in the appendix.
Corollary 1 establishes that strict overlap implies that the average mean discrepancy across covariates is not too large relative to the operator norms of the covariance matrices Σ 0,1:p , and Σ 1,1:p . When p is large, these implications are strong. To explore this, let (X (k) ) k>0 be a sequence of covariates such that for each p, X 1:p ⊂ (X (k) ) k>0 . When the smaller operator norm min( Σ 0,1:p op , Σ 1,1:p op ) grows more slowly than p, the bound in (5) converges to zero, implying that the covariate means are, on average, arbitrarily close to balance. On the other hand, for the bound to remain non-zero as p grows large, both operator norms must grow at the same rate as p.
This is a strong restriction on the covariance structure; it implies that all but a vanishing proportion of the variance in X 1:p concentrates in a finite-dimensional subspace under both P 0 and P 1 .
Remark 2. Theorem 1 bounds the mean discrepancy of X 1:p , which is a special case of a bound on any functional discrepancy of the form |E P 0 {g(X 1:p )} − E P 1 {g(X 1:p )}| for any function g : R p → R that is measurable and square-integrable under P 0 or P 1 . This result is of independent interest, and is included in the appendix.

Strict Overlap Restricts General Distinguishability
In addition to bounds on mean discrepancies, strict overlap also implies restrictions on more general discrepancies between P 0 (X 1:p ) and P 1 (X 1:p ). In this section, we present two additional results showing that strict overlap restricts how well the covariate distributions can be distinguished from each other.
First, we show that Assumption 3 restricts the extent to which P 0 (X 1:p ) can be distinguished from P 1 (X 1:p ) by any classifier or statistical test. Let φ(X 1:p ) be a classifier that maps from the covariate support X 1:p to {0, 1}. We have the following upper bound on the accuracy of any classifier φ(X 1:p ) when Assumption 3 holds.
Asymptotically, by Proposition 1, strict overlap implies that there exists no consistent classifier of P 0 against P 1 in the large-p limit.
Definition 1. A classifier φ(X 1:p ) is p-consistent if and only if P (φ(X 1:p ) = T ) → 1 as p grows to infinity.
Corollary 2 (No Consistent Classifier). Let (X (k) ) k>0 be a sequence of covariates, and for each p, let X 1:p be a finite subset. If Assumption 3 holds as p grows large, there exists no p-consistent test of P 0 against P 1 .
We can characterize the relationship between the dimension p and the distinguishability of P 0 (X 1:p ) from P 1 (X 1:p ) non-asymptotically by examining the Kullback-Leibler divergence. The following result is a special case of Theorem 2, included in the appendix.
Proposition 2 (KL Divergence Bound). Assumption 3 implies are free of p, with b min and b max defined in (3).
In the case of balanced treatment assignment with π = 0.5, B KL(1 0) and B KL(0 1) have a simple form: Proposition 2 becomes more restrictive for larger values of p. This follows because neither bound in Proposition 2 depends on p, while the KL divergence is free to grow in p. In particular, by the so-called chain rule, the KL divergence can be expanded into a summation of p non-negative terms (Cover & Thomas, 2005, Theorem 2.5.3): Each term in (7) is the expected KL divergence between the conditional distributions of the kth covariate X (k) under P 0 and P 1 , after conditioning on all previous covariates X 1:k−1 . Thus, each term corresponds to the discriminating information added by X (k) , beyond the information contained in X 1:k−1 . In the large-p limit, strict overlap implies that the average unique discriminating information contained in each covariate X (k) converges to zero.
Corollary 3. Let (X (k) ) k>0 be a sequence of covariates, and for each p, let X 1:p be a finite subset of (X (k) ) k>0 . As p grows large, Assumption 3 implies and likewise for the KL divergence evaluated in the opposite direction.
By Corollary 3, strict overlap implies that, on average, the conditional distributions of each covariate X (k) , given all previous covariates X 1:k−1 , are arbitrarily close to balance. In the special case where the covariates X (k) are mutually independent under both P 0 and P 1 , Corollary 3 implies that, on average, the marginal treated and control distributions for each covariate X (k) are arbitrarily close to balance.

Treatment Models: Strict Overlap with Fewer Implications
In this section, we discuss how the implications of strict overlap align with common modeling assumptions about the assignment mechanism. We show that certain modeling assumptions already impose many of the constraints that strict overlap implies. Thus, if one is willing to accept these modeling assumptions, strict overlap has fewer unique implications.
We will focus specifically on the class of modeling assumptions that assert that the propensity score e(X 1:p ) is only a function of a sufficient summary of the covariates b(X 1:p ). In this case, overlap in the summary b(X 1:p ) implies overlap in the full set of covariates X 1:p . Models in this class include sparse models and latent variable models.
Assumption 4 (Sufficient Condition for Strict Overlap). There exists some function of the covariates b(X 1:p ) satisfying the following two conditions: where e b (X 1:p ) := P (T = 1 | b(X 1:p )).
Here, the variable b(X 1:p ) is a balancing score as in Rosenbaum & Rubin (1983). The propensity score is the coarsest balancing score in the sense that there exists some h(·) such that e(X 1:p ) = h(b(X 1:p )). Thus, b(X 1:p ) is a sufficient summary of the covariates X 1:p for the treatment assignment T , and overlap in b(X 1:p ) is a sufficient condition for overlap in the entire covariate set X 1:p .
Assumption 4 has some trivial specifications, which are useful examples. At one extreme, we may specify that b(X 1:p ) = e(X 1:p ). In this case, Assumption 4 is vacuous: there are no restrictions on the form of the propensity score; and strict overlap overall is equivalent to strict overlap with respect to b(X 1:p ). At the other extreme, we may specify b(X 1:p ) to be a constant, i.e., we assume that the data were generated from a randomized trial. In this case, the overlap condition in Assumption 4 holds automatically.
Of particular interest are restrictions on b(X 1:p ) between these two extremes, such as the sparse propensity score model in Example 1 below. Such restrictions trade off stronger modeling assumptions on the propensity score e(X 1:p ) with weaker implications of strict overlap. 1 Example 1 (Sparse Propensity Score). Consider a study where the propensity score is sparse in the covariate set X 1:p , so that for some subset of covariates X 1:s ⊂ X 1:p with s < p, e(X 1:p ) = e(X 1:s ).
This implies X 1:p ⊥ ⊥ T | X 1:s , and e(X 1:s ) is a balancing score. In this case, strict overlap in the lower-dimensional X 1:s implies strict overlap for X 1:p . Belloni et al. (2013) andFarrell (2015) propose a specification similar to this, with an "approximately sparse" specification for the propensity score. The approximately sparse specification in these papers is broader than the model defined here, but has similar implications for overlap.
Example 2 (Latent Variable Model for Propensity Score). Consider a study where the treatment assignment mechanism is only a function of some possibly multivariate latent variable U , such that For example, such a structure exists when treatment is assigned only as a function of a latent class or latent factor. In that case, the projection of e(U ) := P (T = 1 | U ) onto X 1:p is a balancing score: 2 where b U (X 1:p ) := E{e(U ) | X 1:p }. Due to (9), strict overlap in the latent variable U implies strict overlap in b U (X 1:p ), which implies strict overlap in X 1:p by Proposition 3. Athey et al. (2018) propose a specification similar to this in their simulations, in which the propensity score is dense with respect to observable covariates but can be specified simply in terms of a latent class.

Outcome Models: Identification and Estimation with Weaker Overlap
The average treatment effect can be identified and estimated under weaker overlap conditions if one is willing to make structural assumptions about the data generating process. For example, if one assumes that the conditional expectations of outcomes E[Y (0) | X 1:p ] and E[Y (1) | X 1:p ] belong to a restricted class, Hansen (2008) established that τ ATE can be estimated under Assumption 1 and the following assumption.
Modifying Hansen (2008)'s nomenclature slightly, we call r(X 1:p ) a prognostic score. The assumption of strict overlap in a prognostic score r(X 1:p ) in (11) is never more stringent than Assump- tion 3 with the same η. 3 van der Laan & Gruber (2010) and Luo et al. (2017) propose methodology designed to exploit this sort of structure.
One can also weaken overlap requirements by imposing modeling assumptions on the outcome process via the conditional average treatment effect τ (X 1:p ) := E[Y (1) − Y (0) | X 1:p ]. If τ (X 1:p ) is assumed constant, for example, in the case of the partial linear model (Belloni et al., 2014;Farrell, 2015), then estimation of τ ATE only requires that strict overlap hold with positive probability, rather than with probability 1.
Assumption 6 (Strict Overlap with Positive Probability). For some δ > 0, Here, Assumption 6 is sufficient because the constant treatment effect assumption justifies extrapolation from subpopulations where the treatment effect can be estimated to other subpopulations for which strict overlap may fail. The constant treatment effect assumption can also be used to justify trimming strategies, which we turn to next.

Trimming
When Assumption 3 does not hold, one can still estimate an average treatment effect within a subpopulation in which strict overlap does hold. This motivates the common practice of trimming, where the investigator drops observations in regions without overlap (Dehejia & Wahba, 1999;Crump et al., 2009;Petersen et al., 2012;Yang & Ding, 2018). In general, trimming changes the estimand unless additional structure, such as a constant treatment effect, is imposed on the conditional treatment effect surface τ (X 1:p ). 4 Our results suggest that trimming may need to be employed more often when the covariate dimension p is large, especially in cases where overlap violations result from small imbalances accumulated over many dimensions. In these cases, trimming procedures may have undesirable properties for the same reason that strict overlap does not hold. For example, in high dimensions, one may need to trim a large proportion of units to achieve desirable overlap in the new target subpopulation. The proportion of units that can be retained under a trimming policy designed to achieve overlap boundη is related to the accuracy of the Bayes optimal classifier in (6) by the following proposition.
Proposition 4. For an overlap boundη ∈ (0, 1/2), we have Proof. Define the event A := {η ≤ e(X 1:p ) ≤ 1 −η}. The conclusion follows from When large covariate sets X 1:p enable units to be more accurately classified in treatment and control, the probability that a unit has an acceptable propensity score becomes small. In this case, a trimming procedure must throw away a large proportion of the sample. In the large-p limit, if the Bayes optimal classifierφ(X 1:p ) is consistent in the sense of Definition 1, then the expected proportion of the sample that must be discarded to achieve anyη approaches 1.

Discussion
In this paper, we have shown that the strict overlap assumption has strong implications in settings with high-dimensional covariates. In particular, we show that the strict overlap assumption implies that the information distinguishing the treated and control covariate distributions must remain fixed -even as the dimension of the covariates grows. This results in binding, population-level restrictions on the data-generating process. Importantly, techniques such as regularization do not avoid these restrictions, though they are often necessary for estimation with high-dimensional covariates.
Our results suggest that overlap assumptions should be carefully considered when adjusting for rich covariates. First, strict overlap is a testable assumption in the sense that, for any fixed bound η, one can construct finite-sample exact tests (D'Amour et al., 2019). We explore this in separate work and suggest that such empirical validation should be standard practice in these settings. In addition, in cases where the unconfoundedness assumption is violated, overlap appears to play a key role in bias amplification phenomena that result from adjusting for covariates, such as instruments, that are highly predictive of treatment assignment but not of the outcome (Myers et al., 2011;Pearl, 2010;Ding et al., 2017). As the dimensionality increases, appropriately accounting for these complications is important both from a population and finite-sample perspective.

A Strict Overlap Implies Bounded f -Divergences
Here, we adapt a theorem from information theory, due to Rukhin (1997), to derive general implications of strict overlap. The theorem states that a likelihood ratio bound of the form (2) implies upper bounds on f -divergences between P 0 and P 1 . f -divergences are a family of discrepancy measures between probability distributions defined in terms of a convex function f (Csiszár, 1963;Ali & Silvey, 1966;Liese & Vajda, 2006). Formally, the f -divergence from some probability measure Q 0 to another Q 1 is defined as Theorem 2. Let D f be an f -divergence such that f has a minimum at 1. Assumption 3 implies Proof. Theorem 2.1 of Rukhin (1997) shows that the likelihood ratio bound in (2) implies the bounds in (A.1) and (A.2) when f has a minimum at 1 and is "bowl-shaped", i.e., non-increasing on (0, 1) and non-decreasing on (1, ∞). The "bowl-shaped" constraint is satisfied because f is convex.

B Proof of Theorem 1 B.1 Strict Overlap Implies Bounded Functional Discrepancy
The proof of Theorem 1 follows from several steps, each of which is of independent interest.
Here, we apply Theorem 2 to show that strict overlap implies an upper bound on functional discrepancies of the form for any function g : R p → R that is measurable under P 0 and P 1 . This result plays a key role in the proof of Theorem 1, but is general enough to be of independent interest.
We establish this bound by applying Theorem 2 to the special case of the χ 2 -divergence Strict overlap implies the following bound on the χ 2 -divergence.

C Other implications of strict overlap
The decomposition in (B.4) can be used to construct additional upper bounds on the mean discrepancy in g using Hölder's inequality in combination with upper bounds on χ α -divergences (Vajda, 1973). These bounds give a tighter bound in terms of η, but are functions of higher-order moments of g(X 1:p ). Formally, χ α -divergences are a class of divergences that generalize the χ 2 -divergence (Vajda, 1973): The χ α divergence in the opposite direction is obtained by switching the roles of P 0 and P 1 .

D Operator Norm
The behavior of the bounds in Theorem 1 and Corollary 1 depend on the operator norm of the covariance matrix under P 0 and P 1 . Heuristically, this operator norm is large whenever there is high correlation between the covariates X 1:p under the corresponding probability measure. Thus, these bounds on mean imbalance become more restrictive as the dimension grows. Because all points in this discussion apply equally to Σ 0,1:p and Σ 1,1:p , we will refer to a generic covariance matrix Σ 1:p , which can be taken to be either Σ 0,1:p or Σ 1,1:p .
In this section, we give several examples of covariance structures and the behavior of their corresponding operator norm as p grows large. In the first two examples, the operator norm is of constant order; in the third example, the growth rate of the operator norm can vary from O(1) to O(p).
Example 4 (Stationary Covariance Case). When (X (k) ) k>0 is a stationary ergodic process with spectral density bounded by M , Σ 1:p op ≤ M (Bickel & Levina, 2004). For example, when (X (k) ) k>0 is an MA(1) process with parameter θ, it has a banded covariance matrix so that all elements on the diagonal σ k,k = σ 2 and all elements on the first off-diagonal σ k,k±1 = θ. In this case, the spectral density is upper bounded by σ 2 (1 + θ) 2 /(2π), so the operator norm is O(1).
Example 5 (Restricted Rank Case). If (X (k) ) k>0 has component-wise variances given by σ 2 k and Σ 1:p has rank s p , then Σ 1:p op ≥ s −1 p p k=1 σ 2 k , because the maximum eigenvalue of Σ 1:p must be larger than the average of its non-zero eigenvalues. Thus, if s p = s is constant in p and the component-wise variances are bounded away from 0 and ∞, the operator norm is O(p). In the special case where s = 1, the covariates are perfectly correlated. On the other hand, if s p is a non-decreasing function of p, then the operator norm grows as O(p/s p ).
Each example shows that if the covariates X 1:p are not too correlated, so that Σ 1:p op = o(p), strict overlap implies that the mean absolute discrepancy in (5) converges to zero, and the covariate means approach balance, on average, as p grows large.