Detecting Treatment Interference under the K-Nearest-Neighbors Interference Model

We propose a model of treatment interference where the response of a unit depends only on its treatment status and the statuses of units within its K-neighborhood. Current methods for detecting interference include carefully designed randomized experiments and conditional randomization tests on a set of focal units. We give guidance on how to choose focal units under this model of interference. We then conduct a simulation study to evaluate the efficacy of existing methods for detecting network interference. We show that this choice of focal units leads to powerful tests of treatment interference which outperform current experimental methods.


Introduction
Randomized experiments have long been viewed as the gold standard for causal inference [1].In epidemiology, researchers may want to study the effect of vaccines on a target population to protect individuals who are at risk of an infectious disease [2].Technology companies such as Google, Amazon, Facebook, LinkedIn, Netflix, Twitter, and others run online randomized controlled experiments to evaluate the effect of a new feature or product on user engagement [3,4,5].However, in such settings, units under study may interact with each other; for example, a user assigned a new feature may interact with one not assigned the feature, thereby impacting the response of the latter user.This interaction poses challenges in estimating and inferring treatment effects under traditional causal inference methodologies [6].
In particular, a fundamental assumption in the traditional causal inference framework is that there is only a single version of each treatment status and the response of a unit is unaffected by the treatment status of any other unit (see Imbens and Rubin [1] for a review).This is known as the stable unit treatment value assumption (SUTVA) [7].SUTVA is violated under settings in which there is treatment interference-that is, when a treatment assigned to a unit affects the response of other units.Effects on response due to treatment interference are also known as spillover, peer influence, social interaction, or network effects.
The dependence of a unit's outcome on other units' exposures or treatments poses statistical challenges because the potential outcome of a unit-the hypothetical outcome of a unit given a realized treatment assignment-is not only affected by its own treatment status but also by the treatment conditions received by other units.In some settings, interference can be considered as a nuisance parameter, and experiments may be designed in such a way to mitigate this interference, thereby reducing the bias in treatment effect estimates [8].Although these designs may minimize the effect of interference, such designs are not always possible.On the other hand, in other settings, estimating the causal effect in the presence of interference is of interest itself.Examples of this include studies on the efficacy of vaccines in which vaccinated and non-vaccinated members of a population interact with each other and researchers are interested in the overall infection rates.Under these latter settings, considerable work has been devoted to the development of reasonable models of interference in order to ensure identification of both the direct effect of treatment and the effect of treatment spillover on the response [9,10,11,12,13].
In this paper, we introduce a model of treatment interference called the K-nearest neighbors interference model (KNNIM).Under KNNIM, the response of a unit is affected only by the treatment given to that unit and the treatment statuses of its K nearest neighbors (KNN).Such models of interference may be reasonable, for example, under social network settings, where only a few of the observable potential interactions (e.g.accounts that a Twitter user follows) may be influential on a unit's response, and the strength of interaction may be measured by the amount of engagement between users.
We then perform a simulation study to determine how existing methods, and one newly developed method, for detecting treatment interference perform under data generated under a KNNIM model.While these methods were originally developed to detect arbitrary interference [14,15,16,4,5], it is reasonable to assume that the efficacy of these methods may vary depending on the structure of interference.However, little work has been done to assess how these methods perform under various interference models.We repeatedly simulate data under a KNNIM model and apply these methods to the simulated data.We then assess the power of these methods to successfully detect treatment interference when it is present and their likelihood of concluding insignificant interference when it is omitted.Results suggest that methods which incorporate structured selection of focal units [14,15] tend to perform reasonably well on this type of data.We then apply the existing methods to a study on the efficacy of an anti-conflict intervention in schools to determine their strength to detect interference on a real dataset.
The rest of this paper is organized as follows.A motivating example is provided in Subsection 1.1.An overview on causal inference under interference is presented in Section 2. KNNIM is introduced in Section 3. Applying conditional randomization tests for detecting interference is discussed in Section 4.An algorithm on the selection of the focal units under KNNIM is provided in Section 5. Section 6 gives a summary of current methods of detecting interference.Our proposed test statistic for detecting interference under KNNIM is given in Section 7. Section 8 evaluates current methods as well as our test under KNNIM model through a simulation.The application of our method to our motivating example is given in Section 9. Section 10 concludes.

Motivating Example: An Anti-Conflict Program in New Jersey Schools
To motivate our approach, we refer to a recent randomized field experiment assessing the efficacy of an anti-conflict intervention aimed to reduce conflict among middle school students in 56 schools in New Jersey [17].In particular, the experiment was explicitly designed to determine whether benefits of the program can be propagated through social interactions between students.The intervention was administered through "seed" students-those that are selected to actively participate and advocate for the anti-conflict program.These students attended meetings with the program staff every two weeks to address conflict behaviors in their schools and to talk about strategies to mitigate peer conflict.Additionally, seed students were encouraged to publicly reflect their opposition to conflict in their school-for example, identifying a common conflict in their school and creating a hashtag about it-and were also asked to distribute orange wristbands with the intervention logo to students that demonstrate anti-conflict attitudes.
Seed students were randomly assigned as follows.First, within each of the 56 schools, between 40 and 64 students were identified as being eligible to be seed students.Then, from the 56 schools in the study, 28 schools were randomly assigned to receive the anti-conflict program.Finally, within each of these assigned schools, half of the eligible students were selected to be seed students.Analysis was performed only on students that were eligible to be seeds (N = 2,451).
Of particular note, to assess potential pathways for treatment interference, students were asked to identify, in order, the 10 other students that they spent the most time with during the previous few weeks.These students include both seed and non-seed students.Specifically, the survey asks the following question: "In the last few weeks I decided to spend time with these students at my school: (in school, out of school, or online) -Number 1 is for the person you spent most time with, then number 2, then number 3...You don't have to fill in all the lines!To make it easier, you can write down their initials here, then find their number.It can be boys and girls!" [17].Students' responses to this question may include both seed and non-seed students.This yields a unique dataset in which the strength of the interaction between two individuals under study is explicitly recorded.Hence, statistical analyses may benefit from an interference model, such as KNNIM, that allows for direct incorporation of the relative strengths of the interactions.For this dataset, KNNIM models with K up to 10 may be applicable.
An analysis performed by Aronow and Samii [9] estimated the indirect effect of being a seed student on wearing an orange wristband to be about 0.15 with a 95% confidence interval between about 8 and 23 percentage points.That is, students exposed to treated peers were about 15% more likely to report wearing an orange wristband in comparison to students in control schools.

Background and Related Work
The Neyman-Rubin Causal Model (NRCM) is a popular model of response in causal inference [18,1,7,19].Consider a simple experiment on N units, numbered 1, . . ., N , where all units are given either a treatment or a control condition.The NRCM assumes that the response of unit i, denoted Y i follows the model Here, y i (W i ) is the potential outcome under treatment status W i ∈ {0, 1}-the hypothetical response of unit i had that unit received treatment status W i -and W i is a treatment indicator: W i = 1 if unit i receives treatment and W i = 0 if unit i receives control.Inherent in this model is the no interference assumption or stable unit treatment value assumption (SUTVA).This assumption states that there is only a single version of each treatment status and that a unit's outcome is only affected by its own treatment status and is not affected by the treatment status of any other unit [20,7].
In many settings, SUTVA is not plausible, and considerable work has been performed on analyzing causal effects when SUTVA is violated.Sobel [6] showed that violating SUTVA can lead to wrong conclusions about the effectiveness of the treatment of interest.Forastiere et al. [10] derive bias formulas for the treatment effect when SUTVA is wrongly assumed and show that the bias that is due to the presence of interference is proportional to the level of interference and the relationship between the individual and the neighborhood treatments.
When interference is present, the effect of a treatment on a unit's response may occur through direct application of the treatment to that unit, indirectly through application of treatment to units that interact with the original unit, or both [2].We can extend the potential outcomes framework to account for both direct and indirect treatment components.Let y i (W) = y i (W i , W −i ) denote the potential outcome of unit i under treatment allocation W ∈ {0, 1} N , where unit i is given treatment W i , and the remaining treatment statuses are allocated according to W −i .Responses Y i satisfy where 1(W = w) is an indicator variable that is equal to 1 if and only if the observed treatment status W is equal to the hypothetical treatment status w.
The average direct effect τ dir is the average difference in a unit's potential outcomes when changing that unit's treatment status and holding all other units' treatment status fixed.It may be defined as where 1 denotes a vector of all 1's.In contrast to direct effect, the average indirect effect τ ind is defined as the average difference in a unit's potential outcome when changing all other treatment statuses from control to treated, holding its own treatment fixed.It may be defined as where 0 denotes a vector of all 0's.The average total effect τ tot measures the average difference in potential outcomes between all units receiving treatment and all units receiving control: Summing (1) and (2) yields the expression Alternatively, the quantities τ dir and τ ind may be defined respectively as )) while still ensuring that (3) holds.These quantities may differ from (1) and ( 2) if there is interaction between direct effects and indirect effects-that is, if the differences y i (1, W −i ) − y i (0, W −i ) differ depending on the allocation of treatment given to W −i .Moreover, direct effects may be defined for each possible however, such definitions may prevent a decomposition of the total effect into direct and indirect effects [2].Finally, when SUTVA holds, τ tot = τ dir and τ ind = 0.
There are a variety of strategies for designing and analyzing experiments under treatment interference.One approach is to view interference as a nuisance parameter and to reduce the effect of treatment interference on causal estimates through effective experimental design.This line of work aims to use available information on potential interaction of units to design an experiment that mitigates the effect of this interaction.Often, this is done through forming clusters with high within-cluster interaction and randomizing treatment across clusters rather than individual units [8,3,21].However, knowledge of the interaction network may not necessary to make progress on this problem-Sävje et al. [22] investigate methods for consistent estimation of treatment effects when the structure of interference is unknown.This approach may not be ideal when indirect effects are of interest to the researcher.
Rather than considering interference as a nuisance, some researchers tend to relax SUTVA and allow for different models of interference, considering interference effect as of primary interest.One significant example of this involves experiments in the efficacy of vaccines where the likelihood of a person contracting an infectious disease depends on others in the same population who are vaccinated [23,2,24].Under this setting, interference is allowed within groups but not across groups-this is referred to as a partial interference assumption [6], i.e., SUTVA is assumed between groups [25,2,26,27,6,28].
A similar approach to partial interference assumes that treatment interference on a unit can only occur within a small closed neighborhood of that unit [12]-the K-nearest-neighbors interference model (KNNIM) introduced in this paper is a variant of this setting.Another common approach is to assume that the treatment condition can only "spill over" and affect the response of a control unit if a certain number or fraction of potential interactors of that unit receive treatment [3,13].Finally, in its least restrictive form, Aronow and Samii [9] consider the use of Horvitz-Thompson estimators for estimating treatment effects under arbitrary forms of interference.
Another research direction focuses on the development of hypothesis tests to detect the presence of treatment interference in an experiment.Aronow [14] introduces a framework for conditional randomization tests for detecting treatment interference.Athey et al. [15] extend this approach to develop tests for more general forms of treatment interference.Basse et al. [16] build on this work and consider the validity of the test by conditioning on observed treatment assignment of the subset of units who received an exposure of interest.Saveski et al. [5] and Pouget-Abadie et al. [4] develop an experimental framework to simultaneously estimate treatment effects and test whether treatment interference is present within an experiment.

K-Nearest Neighbors Interference Model
To obtain meaningful estimates and inferences on treatment effects under interference, interference models often assume some kind of structure restricting how interference can propagate across units.Otherwise, if a model allows for arbitrary interference, each unit will have a unique type of exposure depending on the treatment assignment for all N individuals.This results in distinct 2 N potential outcomes for each unit and N 2 N potential outcomes for the experimental population in total.However, we only observe N of these potential outcomes, and many causal quantities of interest will be unidentifiable under arbitrary interference.
Thus, the assumptions that researchers make about interference often lie strictly between assuming SUTVA and assuming arbitrary interference, and often greatly reduce the number of potential outcomes for each unit [9,12,13,21].Many of these models specify that the units' outcomes are affected by the number/fraction of treated neighbors, but do not specify which neighbors impact unit response and how they affect the response.
We now propose an interference model-the K-nearest-neighbors interference model (KNNIM)-where the treatment status of a unit j can affect the response of a unit i only if j is one of i's K-nearest neighbors.This model allows for neighbors of i to contribute differing effects on the response of i depending on the proximity of their relationship-neighbors that are "closer" to unit i may have a larger influence on the response of i.Additionally, this model restricts the number of potential outcomes to be 2 K+1 for each unit.

Interaction Measure
We begin formally introducing KNNIM by introducing an interaction measure d(i, j) that measures how strongly unit i associates with unit j.This measure does not necessarily need to be computed across every pair of units (i, j); however, we assume that at least K values of d(i, j) can be computed for each unit i, j ̸ = i.Here, d(i, j) may be measured explicitly.For example Section 1.1 describes an example where respondents assign numbers to 10 students, from 1 to 10, where 1 denotes the closest connection, 2 denotes the second closest connection, etc. [17].Alternatively, d(i, j) may combine several interaction measures to form a proxy for overall interaction.For example, an experiment on a social network may define d(i, j) to be an index variable aggregating the number of comments, likes, and other forms of engagement performed by user i and directed towards user j.Smaller values of d(i, j) may correspond to stronger or weaker interactions from i towards j depending on researcher preference.In this paper, we assume smaller values correspond to stronger interactions.
Of particular note, the dissimilarity measure is allowed to be asymmetric; that is, d(i, j) and d(j, i) may differ.Such a property may be necessary if one user strongly influences another user, but not vice versa.A common instance of this involves social media moguls; a mogul i may induce strong engagement from millions of followers j, but may interact sparingly with the vast majority of these followers.This would suggest that followers of the mogul may be strongly impacted by an intervention given to the mogul-indicated by a small value of d(j, i)-but the mogul's behavior may not be altered by their followers-indicated by a large value of d(i, j).
Additionally, it may also be the case that the same absolute value of d(i, j) may be interpreted differently across users.For example, suppose that d(i, j) is an index variable for engagement on a social media platform.If two users i and i ′ interact with the same user j in identical ways, we may have d(i, j) = d(i ′ , j).However, if i engages with the platform often and i ′ does so sparingly, then d(i, j) may be relatively large for user i (that is, i may interact even more with close users j * , leading to smaller values of d(i, j * )), but d(i ′ , j) may be relatively small for user i ′ .

Remarks
Note, when we define our interaction measure d(i, j), we assume that these interactions can be measured precisely and without error.This assumption may be reasonable under certain settings-for example, the motivating example in Section 1.1-but may be unlikely to hold in others.For example, although a social network may have an error-free record of interactions between users-and thus, it may be possible to exactly determine d(i, j) on that networkan external observer of the network may only have a small fraction of these observations to determine the strength of interactions between users.Moreover, even in the presence of perfect information, useful estimates and inferences still require careful selection of d(i, j) to ensure it accurately measures the strength of the interaction between users.Settings under which these interactions are measured with error have been previously considered [22,29]; such a consideration is outside of the scope of this paper but may be an area of further research.
Additionally, previous work on treatment interference has considered models where the interaction is determined by the absolute value of d(i, j), rather than its value relative to d(i, j * ) for other units j * [29].While such a model may be plausible under certain settings, the aforementioned examples suggest scenarios for which a model that relies on the relative value of d(i, j) rather than its absolute value may be more appropriate.

K-Neighborhood Interference Assumption
Let d(i, (j)) denote the jth smallest value of {d(i, j * ), For ease of exposition, we assume that all values of d(i, j) are unique (in practice, ties may be broken arbitrarily).The K-neighborhood of unit i, denoted N iK , is the set of the K "closest" units to unit i: Define N −iK = {1, . . ., N }\(i∪N iK ) as the set of units that are outside of i's K-neighborhood.Note that the sets {i, N iK , N −iK } form a partition of the N units.
Recall that W i is a treatment indicator for unit i, and let W = (W 1 , W 2 , . . ., W N ) = {W i , W N iK , W N −iK } denote the vector of treatment assignments given to all units N .Addi-tionally, recall that y i (W) denotes the potential outcome for unit i under treatment allocation W ∈ {0, 1} N .Now we give the following assumption that defines the K-nearest neighbors interference model: Assumption 1. (K-Neighborhood Interference Assumption (K-NIA)).Units under study satisfy the K-Neighborhood Interference Assumption (K-NIA) if and only if, for each unit i and for all treatment allocations W N −iK , W ′ N −iK , the potential outcomes satisfy, Assumption 1 states that the potential outcome of unit i is only affected by its treatment and by the treatments assigned to its K-nearest neighbors.Changing treatments for other units outside the K-neighborhood will not affect the potential outcome of unit i.This is a special case of the neighborhood interference assumption (NIA) described in Sussman and Airoldi [12].In its most general form, the K-nearest neighbors interference model (KNNIM) assumes only that the treatment interference structure satisfies Assumption 1.For convenience, we will suppress the treatment statuses in W N −iK when referring to the potential outcomes y i .
For ease of exposition, it is often convenient to view units under study as a mathematical graph.For KNNIM, let has weight equal to the interaction measure d(i, j).In this paper, we may refer to G KNN as the weighted adjacency graph.Throughout this article, the terms vertex, unit, and individual will be used interchangeably.
Let A denote the N × N adjacency matrix of G KNN , which indicates the presence or absence of an edge ⃗ ij in the graph G KNN .That is, A ij = 1 if ⃗ ij ∈ E KNN and A ij = 0 otherwise.Note that the diagonal elements of the adjacency matrix are zero; that is, A ii = 0 for all i.

Choosing the neighborhood size K
The choice of K for a given study may vary depending on the studies' field, the purpose of the study, and the availability of data.The experimenter may also use prior knowledge from previous studies to help choose K-for example, if previous studies have indicated that a person's behavior is influenced by their two closest friends, setting K = 2 may be appropriate.When possible, the K should be selected in early phases of the study to help construct the adjacency matrix A when collecting data.
However, another factor that should be addressed when choosing the size of K is the sample size needed to accurately quantify, estimate, and draw inference on the K-nearest neighbors indirect effects.As mentioned above, number of possible exposures to treatments under KNNIM is 2 K+1 .Hence, to ensure sufficient power, many methods that incorporate KNNIM will require a sufficient number of units assigned to each of these exposure levels.From our experience, a good heuristic is to require roughly 30 observations for each treatment exposure.Under this heuristic, most studies may find models with K = 2 or 3 to be most useful.
Issues may arise if responses are used to inform the value of K.For example, a post-hoc selection of K could lead to inaccurate detection of treatment interference due to inherent multiple testing issues (inferences must account for testing both the appropriateness of K and the presence of interference in the model) and/or bias in indirect effect estimates.It may be possible to incorporate additional structure into KNNIM to allow for a rigorous treatment of this problem, but such work is outside of the scope of this paper.See Alzubaidi and Higgins [30] for additional information about the estimation of indirect effects under KNNIM.

Randomization Inference for Detecting Interference
We now describe the framework for randomization inference for testing the presence of treatment interference under KNNIM.Recall that W is the treatment assignment vector and y i (W) is the potential outcome of unit i under treatment W. Let T = T (W, y(W)) denote a test statistic-a random variable where the randomness follows from the random treatment assignment vector W. Let W obs and Y obs = Y(W obs ) denote the observed treatment assignment vector and the observed outcome vector respectively.Then, T (W obs , Y obs ) is the observed value of the test statistic.We aim to test the null hypothesis of no treatment interference for each unit Typically, randomization tests under the potential outcome framework assume a sharp null hypothesis of no unit-level treatment effects, and potential outcomes are able to be inferred under this sharp null across randomizations [31].However, since the hypothesis (4) does not make assumptions about direct effect of treatment on each unit, the potential outcome y i (W i , W N iK ) may not be imputable for randomizations under which W i ̸ = W obs i .Progress can be made by conditioning on a set of randomizations Ω and choosing a test statistic T such that T is imputable under randomizations in Ω [16].Afterward, a conditional p-value is obtained by computing, for example, the fraction of randomizations W ′ ∈ Ω such that Following Aronow [14] and Athey et al. [15], this conditional randomization inference can be performed by first selecting a subset of units under study called focal units and then only considering randomizations of treatment W that do not affect the treatment status of the focal units.Only variant units-those that are not focal units-can have differing treatment statuses across randomizations.In other words, we simulate draws from the random treatment assignment vectors conditional on the fixed treatment of the focal units.Thus, the null hypothesis of no interference is sharp on the focal units since only treatment statuses of variant units-only those units that can impose indirect effects-are randomized.The test statistic T is only computed on the outcomes of the focal units and hence, the test statistic is imputable under alternative treatment assignment vectors.
Randomization tests tend to be the preferred approach for testing for interference under the potential outcome framework.Asymptotic results for statistics for testing interference can be challenging to derive for a number of reasons, including having to account for inherent dependencies between units' treatment allocations induced through the adjacency matrix A. Hence, the use of asymptotic tests tends to be restricted either to settings that rely on strong distributional assumptions or for carefully designed studies.
Finally, while these approaches were originally developed for tests of treatment interference, Basse et al. [16] extend this work to build a framework for randomization tests for more general forms of causal effects.

Selection of the Focal Units
Although the choice of the focal units does not affect the validity of randomization tests for interference, it plays a key role in determining the power of these tests [15].More precisely, there is a trade-off between the size of the focal set (the set of focal units) and the size of the variant set (the set of variant units).Adding additional focal units allows for larger sample sizes when testing for treatment interference-thereby increasing the power of these tests-but will decrease the number of potential randomizations on the variant units-which decreases their power.For general interference models, several useful heuristics for choosing focal units have been proposed, varying widely in complexity.We now outline a few of these methods.
The most basic approach, suggested by Athey et al. [15], is to simply select at random half of the units in the sample to be focal units-the other half are variant units.Note, this rule does not take into account, in any way, the interference model being assumed.
For models in which interference only exists between units with d(i, j) ≤ r (see Section 3.1.1),Aronow [14] suggests a rule to ensure a significant amount of treated and control variant units within each focal units' neighborhood: where N F is the number of focal units and N T,var,r and N T,var,r are the number of treated and control units in the variant set respectively within a "distance" of r from a randomly selected focal unit.
Finally, when the adjacency graph G = (V, E) is known, Athey et al. [15] proposes using an ε-net as the set of focal units-a set of units such that there is path of ε edges or fewer in G from any variant unit j to some focal unit i [32].Note, this is equivalent to choosing a maximal independent set of units in the graph G ε = (V, E ε )-an edge ⃗ ij ∈ E ε if and only if there is a path of ε edges or fewer from i to j in G.
Under KNNIM, we suggest choosing focal units in a way such that the K-neighborhoods of the focal units do not overlap.This can be done by creating a 2-net on the undirected adjacency graph , where E KNN is the edge set of the directed weighted adjacency graph G KNN .The 2-net can then be used as the focal units.This will enable us to remove dependencies between outcomes of focal units induced by indirect effects.In fact, if treatment is Bernoullirandomized across units, the responses of the focal units will be independent of each other.Additionally, a substantial fraction of focal units may still be selected under this condition, increasing the power of the the randomization inference.
We now describe a simple algorithm to obtain a 2-net on the undirected adjacency graph G * KNN .
Algorithm 1.Given a K-nearest neighbors undirected adjacency graph G * KNN = (V, E * KNN ), the following algorithm will obtain a 2-net on G * .

1.
Step 1: (Initialize) Let U = V.Initialize the set of focal units F = ∅.Initialize the set of variant units I = ∅.

2.
Step 2: (Select focal unit) While |U| > 0, choose one vertex i ∈ U at random.Set i as a focal unit: i ∈ F.

3.
Step 3: (Find nearest neighbors) Set I equal to all units j such that ij ∈ E * KNN .

4.
Step 4: (Find neighbors of neighbors) Find all units k ∈ V \ I such that, for some unit j ∈ I, jk ∈ (E * KNN ) 2 .Set these units k ∈ I.

5.
Step 5: (Remove units) Remove all vertices in F and I from U.

6.
Step 6: (Repeat or terminate) If |U| = 0, stop.The set of focal units F is a 2-net for G * KNN .Otherwise, set I = ∅ and return to Step 2.

Current Methods for Detecting Interference
Current methods for detecting interference include conditional randomization tests [14,15] (as outlined in Section 4) and carefully designed experiments performed with the intention to detect interference [4,5].We now provide a summary of these methods for testing for interference.For randomization tests, we focus on the choice of test statistic used.For experimental design methods, we describe both experimental setup and the test statistic.

Test Statistics for Randomization Tests
Aronow [14] introduced the randomization inference approach for testing for interference between units, where units are affected by their own treatment and by the treatment assigned to their immediate neighbors.In this test, the treatment status for a subset of focal units remains fixed; the rest of the units are the variant subset.The randomization inference is conditional on the observed treatment status of the fixed subset.That is, this test is on indirect effects resulting from the treatment allocation on the variant subset of units.A variety of test statistics may be used under this framework.The Pearson correlation coefficient ρ between the outcomes of the fixed units (Y F ) and the "distance" to the nearest unit of a particular treatment status in the variant subset (D nearest ) may be used as the test statistic: A common choice of distance is the Euclidean distance between pretreatment covariates.This distance can be incorporated into the KNNIM framework through the interaction measure d.Aronow [14] advocates for computing Pearson correlation coefficient on the ranks of these quantities; however, preliminary simulations suggest that the statistic ρ tends to be more powerful for the models considered in Section 8. Athey et al. [15] extend this work and develop tests for more general realizations of interference (e.g.no higher-order interference).As part of this work, they suggest additional test statistics for detecting interference.The edge-level contrast statistic T elc -a modification of a test statistic proposed by Bond et al. [33]-is the difference between the average outcomes of the focal units with treated neighbors and the focal units with control neighbors.Here, T elc averages over edges ij where i is a focal unit and j is not a focal unit: , where F i is an indicator variable satisfying F i = 1 if and only if i ∈ F.
A second test statistic is the score test statistic T score [15].This statistic is motivated by a model of treatment interference in which the indirect effect is proportional to the fraction of treated neighbors [34,11].The score test begins by computing for each focal unit i ∈ F, where Y obs F,1 and Y obs F,0 are the average outcome for the treated and control focal units respectively.Then, T score is the covariance between these r i terms and which is the fraction of treated neighbors for unit i.This statistic is computed across only focal units that have at least one treated neighbor: Finally, Athey et al. [15] consider the has-treated-neighbor test statistic T htn , a modification of Pearson correlation coefficient (5).Instead of using the distance to the nearest treated neighbor, this statistic uses an indicator variable E i for whether any of a unit's neighbors in the variant subset are treated: that is, Then T htn is the correlation between this indicator and the outcomes for the focal units F: where Y obs F and S Y obs F are the sample mean and standard deviation of the outcomes for focal units respectively and S E is the sample standard deviation of the E i variables.

Experimental Design Approach
Saveski et al. [5] and Pouget-Abadie et al. [4] present a two-stage experimental design to test for the presence of interference.In this design, the units under study are divided into two groups and two experiments are performed simultaneously: for one group, treatment is assigned completely at random, and for another group, units are clustered and treatment is assigned across clusters rather than units.Then, estimates of the average direct effect are computed under the assumption of no interference for both the completely randomized and cluster randomized designs.Finally, a standardized difference T exp is computed between these estimates: where τcr and τcbr are the estimates of the direct effect under the completely randomized and cluster randomized designs respectively and σp is a pooled standard deviation of responses from both the completely randomized and cluster randomized designs [5].Large values of T exp imply the presence of indirect effects.
A conservative test of the null hypothesis of no treatment interference can be performed at the α significance level by rejecting the null hypothesis if and only if T exp ≥ α −1/2 .Additionally, as the number of units n → ∞, it can be shown that T exp converges to a standard normal distribution (provided that cluster sizes remain fixed).Thus, an approximate size α test can be conducted by rejecting the null hypothesis of no interference if T exp ≥ z 1−α/2 , where z 1−α/2 is the 1 − α/2 quantile of the standard normal distribution.

K-Nearest Neighbors Indirect Effect Test Statistic
We now propose an additional test statistic designed to detect K-nearest neighbors indirect effects.Let Y obs (W i , W ℓ=1 ) and Y obs (W i , W ℓ=0 ) denote the average response of observed units that are assigned to treatment status W i and have their ℓth nearest neighbor assigned to the treatment condition and the control condition respectively.The K-nearest neighbors indirect effect test statistic T knn is obtained by computing differences in potential outcomes between focal units that receive the same treatment status but differ on the status of their ℓth nearest neighbor, and summing these differences across each of the K nearest neighbors.That is, for W i ∈ {0, 1} and ℓ ∈ {1, . . ., K}, define and define T knn,ℓ as a weighted average of these terms: where N F t and N F c are the number of treated focal units and control focal units respectively.We then can define T knn as a sum of these T knn,ℓ statistics: T knn,ℓ .
Note that, under the null hypothesis of no treatment interference, each of the T knn,ℓ (W i ) terms should be close to 0. Thus, since T knn is a linear combination of these terms, values of T knn that are relatively large in magnitude provide evidence against this null hypothesis, and so, |T knn | may be effective as a test statistic.Additionally, note that the statistic T knn,ℓ may be used directly for a test of interference stemming from treatments assigned to the ℓth-nearest neighbor.

Simulation
We now conduct a comparison and evaluate the performance of the methods covered in Section 6 and 7 for testing the null hypothesis of no interference under the K-nearest neighbors interference model.

Data Generation Procedure
We generate the responses under the following model which satisfies KNNIM with K = 3: In this model, we assume that the closest three neighbors affect the response Y i ; we use W iℓ to denote the treatment status of the ℓth nearest neighbor of unit i.The covariates X j , j = 1, 2, 3, are independent and identically distributed N ormal(0, 1) random variables.We use the Euclidean distance between the covariates X i and X j as the interaction measure d(i, j)-units with more similar values of covariates are more likely to interact with each other.Note that the model (8) defines the set of the potential outcomes for each unit i. Simulated data is then generated by randomizing treatment across units.Different models are obtained through varying the β = (β 1 , β 2 , β 3 , β d ) coefficients and the sample size N .We consider sample sizes of N = 256 and N = 1024.
For each choice of sample size, we consider sixteen different models of interference.We describe these models in Table 4 in terms of the coefficients vector β.The first 3 elements of β represent the indirect effect contributed by first, second, and third-nearest-neighbor respectively.The last element β d is the unit's direct effect.In all models considered, the closer the relationship to unit i, the greater the indirect effect: The indirect effects in every set of three models represent the degree of interference starting from no interference in the first 3 models, followed by very weak interference in the second three models, weak interference in the next three models, moderate interference in the next three models, and finally strong interference in the last four models.
For datasets with N = 256 observations, 1,000 realizations of potential outcomes following each model are generated.Tests of indirect effects are then applied to each of the 1,000 realizations.Results for N = 256 are given in Section 8.4.Due to computational limitations, only 100 realizations are generated for models containing N = 1024 units.Results for N = 1024 are given in the Supplementary Material.

Simulation for Randomization Tests
We compare the performance of both conditional randomization tests and experimental design approaches for detecting interference.For the conditional randomization tests, for each set of generated potential outcomes, treatment is initially assigned completely at random to units, with half of the units receiving treatment and the other half receiving control.Then, focal units are selected according to Algorithm 1.We then proceed with randomization tests as described in Sections 4 and 6.1.We evaluate the performance of the following test statistics: the Pearson correlation coefficient (Pearson) [14], the edge level contrast statistic (ELC), the score statistic (Score), the has-treated-neighbor statistic (HTN) [15], and the K-nearest neighbors indirect effect test statistic (KNN).
Test statistics are computed across 1,000 randomizations for each realization of the potential outcomes; for each randomization, treatment statuses are fixed for focal units and are completely randomized across variant units.For each set of potential outcomes and for each choice of test statistic, we obtain a p-value for the null hypothesis of no treatment interference.Thus, for N = 256, we obtain a distribution of 1,000 p-values for each test statistic under each model.The power of the tests can also be estimated by computing the fraction of p-values that fall beneath a pre-specified significance level α.

Simulation for Experimental Design Approach
In addition, we follow the experimental design in Saveski et al. [5] (described in Section 6.2) to determine its efficacy for testing whether SUTVA holds under KNNIM.For each set of generated potential outcomes, we divide the units into clusters of four units using a heuristic algorithm for the clique partitioning problem with minimum clique size requirement from Ji [35] (Algorithm 4).This clustering is performed once per set of potential outcomes.
We then randomly select half of the clusters to be cluster randomized; for this group, treatment is assigned at the cluster level, with half of the clusters receiving treatment and the other half receiving control.For units belonging to the remaining clusters, each unit's cluster assignment is ignored, and treatment is completely randomized across all of these remaining units.Again, half of these units receive treatment and the other half receive control.For each set of potential outcomes, the random selection of clusters and the treatment randomization is performed 1,000 times.
For each randomization, the statistic T exp in ( 7) is computed.We then perform a test of the null hypothesis of no treatment interaction at the α = 0.05 significance level.A conservative test rejects this null hypothesis if T exp ≥ α −1/2 and an asymptotic test rejects the null if T exp ≥ z 1−α/2 .Thus, for N = 256, we perform a total of 1,000,000 tests: that is, 1,000 tests for each of the 1,000 generated potential outcomes.By computing the fraction of rejected null hypotheses, we are able to assess the Type I Error (Models 1-3) and the power (Models 4-16) of the experimental design approach.

Discussion
Figure 1 provides a visual comparison of the distribution of p-values for the randomization tests to detect interference under KNNIM.Table 5 provides the estimated Type I Error and power of these tests (conducted at significance level α = 0.05) across the 16 considered models.As is expected by design [36], the p-values of all randomization tests under models without treatment interference (Models 1-3) are approximately distributed uniformly between 0 and 1.All tests lack of power under very weak interference (Models 4-6) where the highest power is 0.110 for KNN test followed by 0.108 for Score test.Under weak interference (Models 7-9), the ELC, Score, and KNN tests seem to outperform the Pearson and the HTN tests; the p-values are smaller overall for these three tests.Similar trends hold under moderate interference (Models 10-12) and strong interference (Models 13-16).In particular, under strong interference, Score, KNN, and ELC tests have near 100% power to detect treatment interference.
However, the ELC and HTN tests seem to have some difficulty with detecting indirect effects when direct effects become large.For example, the p-values for these three tests under Models 9 and 12-models that have comparatively larger direct effects-are substantially larger than under Models 7 and 8 and Models 10 and 11 respectively.The Score and KNN tests do not suffer from this loss of power as direct effects increase.For example, for Model 9, the Score and KNN tests have an estimated power of 0.844 and 0.839 respectively where the ELC and HTN tests have an estimated power of 0.553 and 0.249 respectively.Thus, for the considered tests, the Score and KNN tests seem to have the best combination of power in detecting treatment effects and isolating indirect effects in the presence of direct effects.Similar comparisons between the methods hold for datasets with N = 1024 and/or when focal units are selected from only one treatment condition (see the Supplementary Material for details).
Figure 3 gives box plots of the estimated rejection rate across all 1,000 generated potential outcomes for both the conservative and asymptotic tests using the experimental design method [4,5] with N = 256 and significance level α = 0.05.This plot also shows the estimated power of the considered randomization tests under these 16 models.Table 5 includes the median values of the rejection rates across the 1,000 generated potential outcomes for these tests.The conservative experimental approach appears to lead to a very conservative test; the true Type I Error is much smaller than α = 0.05, and the test appears to have weak power under very weak, weak and moderate interference.Even under Models 13-16, which exhibit strong interference, the conservative test only has a median power of approximately 0.6965.
The asymptotic test yields much more desirable results for our simulated data.Overall, the Type I Error seems quite close to the nominal α = 0.05.The asymptotic test outperforms the Pearson and HTN randomization tests for almost all models of interference, and has a power close to 1 of detecting interference under Models 13-16.However, the power of the asymptotic test still is behind that of the Score, KNN, ELC tests across all models.
When we increase the sample size to N = 1024, the conservative approach seems to be powerful for moderate and strong interference while the asymptotic approach is powerful for all interference models except the very weak interference models.However, both approaches remain comparatively less powerful than the Score, KNN, and ELC randomization tests (see the Supplementary Materials for details).

Analysis of Anti-Conflict Program Experiment
In this section we reanalyze data from the motivating study described in Section 1.1 designed to reduce conflict among middle school students in New Jersey.Following Paluck et al. [17], we only perform our analysis on seed-eligible students-hence, the adjacency matrix A only contains information about connections between seed-eligible students.We then select a set of focal units following the procedure in Algorithm 1.
For this study, randomization inference is then performed assuming complete randomization of treatment to the non-focal units.Note, this is a simplification of how treatment was originally assigned to seed-eligible students-specifically, treatment was block-randomized with the schools serving as blocks.However, as our focus is more on discussing the implementation of these randomization tests on data rather than confirming the results of Paluck et al. [17], we allow this simplifying assumption.

Selecting K
Recall that the K = 10 closest connections were identified for each student.However, implementing a KNNIM model with K = 10 is impractical for this example.For a study of this size (N = 2,451), such a model would result in too many potential exposures for each unit (2,048 in total) to allow for meaningful inference to be performed on the indirect effect.Moreover, seed-eligible students often identify connections with ineligible students which are not included in A-in fact, most seed-eligible students have fewer than 3 connections with other seed-eligible students.This complicates the implementation of KNNIM with K = 10, which (from Section 3) is only well-identified when each observation K has at least 10 connections.
To determine whether a choice of K is appropriate for this application, we first subset all seed-eligible students that have at least K connections with other seed-eligible students.We then calculate how many of these students are exposed to each of the 2 K+1 treatment exposures.Finally, we choose the largest K that yields sufficient sample sizes (at least 30 students) for each exposure for our KNNIM model.
To make this explicit, suppose we consider a KNNIM model with K = 2.This sample contains N = 348 units-that is, there are 348 seed-eligible students that interact with at least two other seed-eligible students.Moreover, there are eight treatment exposures possible for each student in this sample; in Table 1, we see that each possible exposure has at least 34 students assigned to that exposure.Hence, K = 2 seems to be an acceptable choice.Now, suppose we restrict our analysis further to only eligible students in treated schools who have at least K = 3 seed-eligible nearest neighbors.In this case, the sample size is reduced to only 100 students.Additionally, from Table 2, we see that there are an insufficient number of units assigned to each exposure-in fact, there is only one student in the sample that for which that student and all its three seed-eligible nearest neighbors are all treated.We conclude that K = 3 yields an inappropriate model, and continue our analysis using a KNNIM model with K = 2.

Assessing indirect effects using randomization tests
We evaluate the performance of the randomization tests for the following statistics: the Pearson statistic (Pearson), the edge level contrast statistic (ELC), the score statistic (Score), the has-treated-neighbor statistic (HTN), and the K-nearest neighbors indirect effect test statistic (KNN).We choose focal units according to Algorithm 1 and treatment is re-randomized across non-focal units 1,000 times.The p-value is the proportion of the replications where the absolute value of the simulated test statistic is greater than the absolute value of the observed test statistic.Results are given in Table 3.For context, an analysis of this experiment by Aronow and Samii [9] estimated the indirect effect to be 0.154-that is, the probability that a non-seed student wears a wristband increases by about 15% if they have a connection with a seed student.Failure of these permutation tests to detect an indirect effect do not negate the findings of the original study.For example, from Section 8, we find that permutation tests struggle to detect indirect effects of similar sizes consistently.Additionally, this modified demonstration dramatically reduces the sample size of the original study, further decreasing the power of these tests.

Conclusion
Traditional causal inference methodologies may fail to make reliable causal statements on treatment effects in the presence of interference.A substantial amount of recent work has been devoted to causal inference under interference, including methods for detecting treatment interference [14,9,15,16,10,11,4,5,12,13].
We consider a new model of treatment interference-the K-nearest-neighbors interference model (KNNIM)-in which the treatment status of a unit i affects the response of a unit j only if i is one of j's K closest neighbors.We give advice for selecting focal units for conditional randomization tests for detecting interference under KNNIM, and suggest a new test-statistic-the K-nearest neighbors indirect effect test statistic (KNN)-for these randomization tests.We then perform a simulation study to compare the efficacy of both the randomization tests and experimental design approach for detecting interference under KNNIM.

Figure 1 :
Figure 1: Boxplots of p-values for the Pearson test (Pearson), has treated neighbor test (HTN), edge level contrast test (ELC), score test (Score) and K-nearest neighbors indirect effect test (KNN) under various KNNIM models.We use N = 256 units and K = 3 nearest neighbors.The p-values are estimated using 1,000 randomizations for each of the 1,000 generated potential outcome realizations.

Figure 3 :
Figure 3: Boxplots of the estimated rejection rates under the experimental design approach for both the conservative and asymptotic tests of the null hypothesis of no treatment interference under various KNNIM models.Plots also contain the estimated Type I Error (Models 1-3) and power (Models 4-13) for the Pearson test (Pearson), edge level contrast test (ELC), score test (Score), has treated neighbor test (HTN) and K-nearest neighbors indirect effect tests (KNN).We use N = 256 units and K = 3 nearest neighbors.The rejection rates are estimated using 1,000 treatment assignments for each of the 1,000 generated potential outcomes.Tests are performed at significance level α = 0.05.

Figure 4 :
Figure 4: Boxplots of p-values for the Pearson test (Pearson), has treated neighbor test (HTN), edge level contrast test (ELC) and K-nearest neighbors indirect effect test (KNN) under various KNNIM models using only control focal units.We use N = 256 units and K = 3 nearest neighbors.The p-values are estimated using 1,000 randomizations for each of the 1,000 generated potential outcome realizations.

Figure 5 :
Figure 5: Boxplots of p-values for the Pearson test (Pearson), has treated neighbor test (HTN), edge level contrast test (ELC) and K-nearest neighbors indirect effect test (KNN) under various KNNIM models using only control focal units.We use N = 1024 units and K = 3 nearest neighbors.The p-values are estimated using 1,000 randomizations for each of the 100 generated potential outcome realizations. .

Figure 6 :
Figure 6: Boxplots of p-values for the Pearson test (Pearson), has treated neighbor test (HTN), edge level contrast test (ELC), score test (Score) and K-nearest neighbors indirect effect test (KNN) under various KNNIM models.We use N = 1024 units and K = 3 nearest neighbors.The p-values are estimated using 1,000 randomizations for each of the 100 generated potential outcome realizations.

Figure 7 :
Figure 7: Boxplots of the estimated rejection rates under the experimental design approach for both the conservative and asymptotic tests of the null hypothesis of no treatment interference under various KNNIM models.Plots also contain the estimated Type I Error (Models 1-3) and power (Models 4-13) for the Pearson test (Pearson), edge level contrast test (ELC), score test (Score), has treated neighbor test (HTN) and K-nearest neighbors indirect effect tests (KNN).We use N = 1024 units and K = 3 nearest neighbors.The rejection rates are estimated using 1,000 treatment assignments for each of the 100 generated potential outcomes.Tests are performed at significance level α = 0.05.

Table 1 :
Number of units in each exposure of Anti-Conflict Program Experiment with K = 2

Table 2 :
Number of units in each exposure of Anti-Conflict Program Experiment with K = 3

Table 3 :
Data Analysis of Anti-Conflict Program Experiment.For this modified experiment, all randomization tests fail to detect an indirect effect.The p-value is smallest for the ELC test (p = 0.14), followed by the Score test (p = 0.22) and the KNN test (p = 0.34).

Table 5 :
Estimated Type I Errors and power for tests of treatment interference for sample size N = 256.Errors (Models 1-3) and estimated power (Models 4-16) for simulated data under KNNIM.Results are provided for the score test (Score), K-nearest neighbors indirect effect test (KNN), edge level contrast test (ELC), has treated neighbor test (HTN) and the Pearson test (Pearson).