Conﬁdence intervals for linear unbiased estimators under constrained dependence

: We propose an approach for conducting inference for linear un- biased estimators applied to dependent outcomes given constraints on their independence relations, in the form of a dependency graph. We establish the consistency of an oracle variance estimator when a dependency graph is known, along with an associated central limit theorem. We derive an in- teger linear program for ﬁnding an upper bound for the estimated variance when a dependency graph is unknown, but topological or degree-based con- straints are available on one such graph. We develop alternative bounds, including a closed-form bound, under an additional homoskedasticity as- sumption. We establish a basis for Wald-type conﬁdence intervals that are guaranteed to have asymptotically conservative coverage.


Introduction
Researchers often encounter dependent data, where the exact nature of that dependence is unknown, and they wish to make inferences about a feature of the outcome distribution. When the observed data consist of many independent clusters, with possibly dependent outcomes within clusters, standard approaches often remain unbiased and consistent or can be adapted to yield consistent estimators [e.g. 14]. When many independent clusters are not available, current methods typically assume either independence of unit outcomes, or that the dependency structure is known or directly estimable [7,24,5,22,19]. In many cases, however, researchers may only have limited information about the nature of dependence between units, or perhaps only the number of other units on which a given unit's outcome depends.
Dependency graphs [3] represent a set of non-independence relationships in a set of variables [see also bidirected graphical models [21] and marginal independence models [9]]. Vertices represent individual units and edges represent the possibility of probabilistic dependence. Dependency graphs are useful because they imply marginal non-independence relations in a set of variables, and the class of joint distributions compatible with any dependency graph is flexible. When researchers have partial knowledge of independence relations for a set of variables, it is often easier to incorporate that knowledge into a topological constraint on a dependency graph than to impose restrictions on the space of joint distributions for the variables directly.
In this paper, we study the class of linear estimators that remain unbiased even when outcomes are neither independent nor identically distributed, including the case of ordinary least squares regression coefficients. We develop a framework for constructing confidence intervals for such linear estimators when applied to dependent outcomes, where independence relationships are unknown or partially known, but subject to topological constraints on their dependency graph. We seek an upper bound for the estimated variance of the sum using upper bounds for the degrees of each unit in a dependency graph and a local dependence assumption. We show that this optimization problem can be expressed as an integer linear program for the elements of the adjacency matrix corresponding to a dependency graph. We show that this approach yields asymptotically conservative Wald-type confidence intervals under a normal approximation. We also derive computationally simple bounds, including a closed-form bound, when the random variables are assumed to be homoskedastic.

Setting
Consider a simple graph G = (V, E) with no parallel edges or self-loops. Let |V| = N . Associated with each vertex i ∈ V is a random variable Y i , and G characterizes probabilistic dependencies in the outcomes [e.g., 3].
Suppose G is a dependency graph and we observe a subset V S ⊆ V, where |V S | = n. Label these observed vertices 1, . . . , n, and label the unobserved vertices in V \ V S arbitrarily by n + 1, . . . , N. For each i ∈ V S , we observe the outcomes Y 1 , . . . , Y n , a fixed vector of coefficients θ = θ 1 , . . . , θ n , and the de-

Definition 2 (Induced subgraph). For a set of vertices
Let G S = (V S , E S ) be the induced subgraph of the observed vertices V S . It follows that G S is also a dependency graph. Let G R = (V S , E R ) be a subgraph of G S , consisting of all the observed vertices in V S , and a subset of the edges in E S . Assumption 1 (Observed data). We observe the random outcomes Y 1 , . . . , Y n , the fixed degrees d 1 , . . . , d n , the fixed coefficients θ 1 , . . . , θ n , and the fixed recruitment graph G R .

Definition 3 (Linear unbiased estimator (LUE)). An estimatorβ is a linear unbiased estimator (LUE) of β if
Many familiar linear estimators are unbiased in settings with outcomes that are neither independent nor identically distributed. For example, consider the linear regression model E[Y i ] = p j=1 x ij β j for i = 1, . . . , n [e.g., 10, Ch. 1], where x i and β have dimension p × 1, with the coefficients β estimated by ordinary least squares. Let X be the n × p matrix of covariates, and let Y be the n × 1 vector of outcomes. The estimated coefficients areβ = (X X) −1 X Y .
Define the p × n matrix Θ = n(X X) −1 X . Then for j = 1, . . . , p we can express the vector coefficient estimates asβ = n −1 ΘY , or individually aŝ Additional examples include Horvitz-Thompson-type estimators for a finite population total [11,20], and the difference-in-means estimator under random assignment of a treatment [18].
In what follows, we wish to conduct inference on β given Z. We proceed by constructing conservative estimators of We may use the square roots of these estimates as standard error estimators in order to construct Wald-type confidence intervals that are guaranteed to have asymptotic coverage for β at greater than or equal to nominal levels.

Variance estimation
The observed subgraph G R may not reveal all the edges in G S that connect observed vertices. We consider a class of variance estimators that depend on knowledge of G S , whose structure is represented by an n × n binary symmetric zero-diagonal adjacency matrix in which rows and columns are ordered by the indices 1, . . . , n of the vertices in V S . We now define some key concepts.

Definition 4 (Compatibility). The n × n binary symmetric adjacency matrix
The last condition in Definition 4 requires that the degree of i in the subgraph G S not be greater than its degree in the full graph G.

Definition 5 (Oracle estimator). For a family of variance estimators
The maximal compatible estimator provides a sharp upper bound for the oracle estimator because We now describe an asymptotic scaling, along with boundedness conditions for outcome values and unit degrees. We will primarily rely on restrictions on a dependency graph to obtain root-n consistency, a central limit theorem, and convergence of the variance estimator. To this end, we describe an asymptotic regime in which the number of units (vertices in G) is increasing, but the maximal dependence between their outcomes is bounded.
Further assume there exists a finite, positive constant c 3 such that lim n→∞ nvar(β) = c 3 (nondegenerate limiting variance). Finally, assume c 4 is a finite positive constant such that |θ i | < c 4 .
We will proceed by deriving oracle estimators under two sets of nested assumptions. We establish their asymptotic properties, then derive feasible estimators that dominate the oracle estimators.

General case
We first consider the case where we impose no distributional assumptions on the outcomes beyond the boundedness conditions of Assumption 2. Define the plug-in sample varianceσ 2 2 , and the estimator The corresponding oracle estimator V 1 (A O ; Z) is consistent.

Proposition 1. Under Assumption 2, for any
Proof. We follow the general proof strategy of Aronow and Samii [1] to show that when the true dependency structure A O is known and degrees in the dependency graph are bounded, the sum of covariances of outcomes does not grow too quickly as n → ∞. We will establish mean square convergence of n V 1 (A O ; Z) to nvar(β), allowing us to invoke Chebyshev's inequality to prove the proposition.
Next we address convergence of the second term n −1 . Asymptotic unbiasedness follows directly from linearity of expectations and var(β) = O(n −1 ). To establish mean square convergence, we consider the variance where the last line follows from bilinearity of covariance. Letting we now examine the conditions under which ξ ijkl = 0. Expanding the covariance, Then by root-n consistency of means and Slutsky's Theorem, as n → ∞ expectations involvingβ factorize, yielding, e.g.
. We therefore combine terms and rewrite (3.3) as By Assumption 2, the degree of each vertex in V S is bounded by c 2 , so the condition is satisfied by at most 4nc 3 2 terms in the summation in (3.2). In addition, we may compute the remainder term i,j,k, ) and the result follows. Proposition 1 is applicable to problems where a dependency graph is known, as it provides a basis for consistent variance estimation, generalizing results for special cases [7,2]. We now address the case where the true subgraph G S is not known, but constraints on the graph are available.
Let A m 1 = {A ∈ A(Z) : V 1 (A; Z) is maximized} be the set of compatible adjacency matrices that maximize V 1 (A; Z). Let Θ = diag(θ) be the n × n matrix with Θ ii = θ i and Θ ij = 0 for i = j. We can find an element A m of A m 1 by solving the 0-1 integer linear program where A R is the adjacency matrix of G R and denotes the element-wise "lessthan" relation.
Since A is an adjacency matrix, we can reduce the program and maximize over the decision variables that correspond to the upper or lower triangular elements of A only. Letv ij be the ijth element of the sample covariance matrix with i = 1, ..., n and j = 1, ..., n. Since the sample covariance matrix is symmetric, we can focus on its upper triangular part and use the decision variable a ij = 1 ifv ij = 0, and 0 otherwise, for each i < j. Based on these decision variables, where d i is the degree, and further simplified with the constraints A A R that make some of the decision variables a ij automatically equal to one. The resulting program has at most n(n − 1)/2 decision variables and in general it is a multidimensional knapsack problem [12]. See chapter 9 of Kellerer et al 13 for an overview of the multidimensional knapsack problem. This program (3.5) is NP-hard, but it admits a polynomial time approximation scheme (PTAS). Typical PTAS depend heavily on the size of the problem and their running time can be very high (see, e.g., section 9.4.2 of Kellerer et al 12). In spite of this, in standard practice, for example with 1000 observations or less, problem (3.5) can be solved in a few seconds with modern optimization solvers such as Gurobi. To obtain a solution within a provably small optimality gap, these solvers use a variety of techniques, including: linear programming and branch-and-bound procedures to reduce the set of feasible solutions; presolve routines applied prior to the branch-and-bound procedures to reduce the size of the problem; cutting planes methods to remove fractional solutions and tighten the formulation; and a collection of heuristics to find good incumbent solutions in the branch-and-bound [4,15,17].
While the true adjacency matrix A O is not known, an element A m ∈ A m 1 produces a variance estimate V 1 (A m , Z) that is at least as large as the oracle estimator V 1 (A O ; Z). As n grows large, the variance estimate V 1 (A m , Z) is conservative: the probability that n V 1 (A m ) underestimates nvar(β) by more than > 0 tends to zero. Proof is given in the Appendix. Corollary 1 does not imply consistency of V 1 (A m ; Z) as an estimator of var(β), nor does it imply that the estimator converges to any particular limiting value. Rather we have established that, for large n, its distribution will tend to be at least as large as the true variance.

Alternative bounds under homoskedasticity
When all variances are equal, we can obtain an alternative closed-form bound that is computationally simpler and less sensitive to between-sample variability in the empirical variance-covariance matrix. This estimator essentially only depends on the estimated variance of unit outcomes and the maximum number of edges in a dependency graph.

Assumption 3 (Homoskedasticity). var(θ
Under homoskedasticity, the general estimator V 1 (A m , Z) developed in Section 3.1 provides a conservative variance estimate. A bound that is relatively computationally simple to compute can be derived by noting that when var( The oracle estimator V 2 (A O , Z) need not be consistent, though it is asymptotically conservative. Proof. We first define an alternative oracle estimator which presumes knowledge of the correlations ρ i , Multiplying by n, n V * As in the proof of Proposition 1,σ 2 θ converges in mean square. By Assumption 2, 1 ≤ 1 + As before, we can maximize the estimator V 2 (A; Z) over the family of compatible dependency graphs. Define A m 2 = {A ∈ A(Z) : V 2 (A; Z) is maximized}, and let A m ∈ A m 2 . To find an element of A m 2 , we solve the 0-1 integer linear program maximize where again A is an arbitrary 0-1 adjacency matrix and A R is the adjacency matrix of G R . In order to solve the program (3.8), letv ij = 1 for every i = 1, ..., n and j = 1, ..., n in (3.6). Note that finding the solution to this problem does not depend on the empirical variance-covariance matrix; the variability of the estimator V 2 (A m ; Z) is purely attributable to estimation error inσ 2 θ . Since V 2 (A; Z) relies only on the number of positive entries in A, we can derive a looser closed-form upper bound by considering the maximum number of edges that can be in E S . For i ∈ V S , let d i = min{d i , n − 1} be the degree of i in G, truncated at n − 1. Let The estimator (3.9) does not depend on any particular compatible adjacency matrix. The proof follows from Lemma 1 and the same reasoning employed in the proof of Corollary 1.

Wald-type confidence intervals
We now prove that our variance estimates can be used to form valid Wald-type confidence intervals about β. First, we establish a central limit theorem forβ given our asymptotic scaling. Lemma 2, a standard result in applying Stein's method to the setting of local dependence, has been proven by, e.g., Theorem 2.7 of Chen et al [6]. Similarly, we reiterate the well-known basis for Wald-type confidence intervals.  Proof. Define a random variable U such that U = V (A; Z) if V (A; Z) ≤ var(β) and var(β) otherwise. Then lim n→∞ Pr(|nU − nvar(β)| > ) = 0, and by Lemma 3 Wald-type confidence intervals formed with U as a variance estimate will have at least proper coverage. Across every sample realization, V (A; Z) ≥ U , and thus the coverage of Wald-type confidence intervals using V (A; Z) will be also be at least proper levels.
It follows that Wald-type confidence intervals constructed using the conservative variance estimators derived in Section 3 yield conservative asymptotic coverage.

Corollary 4. Given Assumptions 2 and 3, confidence intervals formed asβ
Proofs for Corollaries 3 and 4 follow directly from Corollaries 1 and 2 and Proposition 3. Upper bounds for the variance estimates can be obtained by solving a relaxed form of the programs (3.5) and (3.8). By Proposition 3, using such upper bounds as a basis for conservative inference will also yield valid confidence intervals. In practice, the results obtained by modern optimization solvers will be tighter with a provably small optimality gap and thus in problems of moderate size will typically be preferable.

Discussion
We have developed conservative estimators for the variance of a linear unbiased estimator under partial observation of a dependency graph and assumptions about the variance of individual outcomes. The variance estimation setting we address here can accommodate a wide variety of dependency and observation assumptions. For example, Assumption 1, which states that we observe Z = (Y, d, G R ), can be weakened when G R is completely unknown. In this case the constraint in the integer linear programs (3.5) and (3.8) becomes A 0 where 0 is the n×n matrix of all zeros; this constraint is met for all adjacency matrices A, so it becomes superfluous. Alternatively, we may not have full knowledge of the degrees d = (d 1 , . . . , d n ), and instead have only an upper bound d * i for each d i , or a global upper bound d i ≤ d * for all i = 1, . . . , n. Conservative variance estimation in both of these cases can be achieved (by substituting d * i or d * for d i ) with no change to the programs (3.5) and (3.8) or to the asymptotic results given here. When no information about G R or the degrees d is available, setting every d i = d * = n − 1 delivers a maximally conservative upper bound.
We note here three extensions. First, it is likely possible to extend our results to obtain confidence intervals more generally for asymptotically linear estimators using an empirical analogue of the variance of the influence function as the objective function. Second, a generalization of our results may facilitate conservative inference for causal estimands under interference between units [e.g., 23,16], given interference that can be characterized by a constrained dependency graph. Finally, the general logic of our approach -maximizing an estimator over a space of graphical structural relations [e.g. 8] -may be profitably be used to obtain conservative interval (or region) estimates for network functionals, including with alternative representations of constraints on the dependency structure (e.g., restrictions compatible with Markov random field-type assumptions). Furthermore, extensions of our approach may not require that constraints involving the adjacency matrix A be linear. More generally, if the constraint is of the form f (A) ≤ 0, and the system can be solved, it is straightforward to conceptualize a broader class of inferential procedures involving network-topological constraints beyond maximum degree, e.g. triangles, diameter, or clustering.