What is a Randomization Test?

Abstract The meaning of randomization tests has become obscure in statistics education and practice over the last century. This article makes a fresh attempt at rectifying this core concept of statistics. A new term—“quasi-randomization test”—is introduced to define significance tests based on theoretical models and distinguish these tests from the “randomization tests” based on the physical act of randomization. The practical importance of this distinction is illustrated through a real stepped-wedge cluster-randomized trial. Building on the recent literature on randomization inference, a general framework of conditional randomization tests is developed and some practical methods to construct conditioning events are given. The proposed terminology and framework are then applied to understand several widely used (quasi-)randomization tests, including Fisher’s exact test, permutation tests for treatment effect, quasi-randomization tests for independence and conditional independence, adaptive randomization, and conformal prediction. Supplementary materials for this article are available online.


Introduction
Randomization is one of the oldest and most important ideas in statistics, playing several roles in experimental designs and inference (Cox, 2009).Randomization tests were introduced by Fisher (1935, Chapter 21) 1 to substitute Student's t-tests when normality does not hold, and to restore randomization as "the physical basis of the validity of statistical tests".
An appealing property of randomization tests is that they have exact control of the nominal type I error rate in finite samples without relying on any distributional assumptions.This is particularly attractive in modern statistical applications that involve arbitrarily complex sampling distributions.Recently, there has been a rejuvenated interest in randomization tests in several areas of statistics, including testing associations in genomics (Bates et al., 2020;Efron et al., 2001), testing conditional independence (Candès et al., 2018;Berrett et al., 2020), conformal inference for machine learning methods (Lei et al., 2013;Vovk et al., 2005), analysis of complex experimental designs (Ji et al., 2017;Morgan and Rubin, 2012), evidence factors for observational studies (Karmakar et al., 2019;Rosenbaum, 2010Rosenbaum, , 2017)), and causal inference with interference (Athey et al., 2018;Basse et al., 2019).
Along with its popularity, the term "randomization test" is increasingly used in statistics and its applications, but unfortunately often not to represent what it means originally.
For example, at the time of writing Wikipedia redirects "randomization test" to a page titled "Resampling (statistics)" and describes it alongside bootstrapping, jackknifing, and subsampling.The terms "randomization test" and "permutation test" are often used inter-changeably, which causes a great deal of confusion.A common belief is that randomization tests rely on certain kinds of group structure or exchangeability (Lehmann and Romano, 2006;Southworth et al., 2009;Rosenbaum, 2017).This led some authors to categorize randomization tests as a special case of "permutation tests" (Ernst, 2004) or vice versa (Lehmann and Romano, 2006, p. 632).Furthermore, some authors started to use new and, in our opinion, redundant terminology.An example is "rerandomization test" (Brillinger et al., 1978;Gabriel and Hall, 1983), which is nothing more than an usual randomization test and confuses with a technique called "rerandomization" that is useful for improving covariate balance (Morgan and Rubin, 2012).For a historical clarification on the terminology, we refer the readers to Onghena (2017).
The main objective of this paper is to give a clear-cut formulation of the randomization test, so it can be distinguished from closely related concepts.Our formulation follows from the work of Rubin (1980), Rosenbaum (2002), Basse et al. (2019), and many others in causal inference based on the potential outcomes model first conceived by Neyman (1923).In fact, this was also the model adopted in the first line of works on the randomization tests by Pitman (1937), Welch (1937), and Kempthorne (1952).However, as the fields of survey sampling and experimental design grew apart since the 1930s,2 and because randomization tests often take simpler forms under the convenient exchangeability assumption, the popularity of the potential outcomes model had dwindled until Rubin (1974) introduced it to observational studies.In consequence, most contemporary statisticians are not familiar with this approach that can give a more precise characterization of randomization test and distinguish it from related ideas.Our discussion below will focus on the conceptual differences and key statistical ideas; detailed implementations and examples of randomization tests (and related tests) can be found in the references in this article and the book by Edgington and Onghena (2007).
The fundamental reason behind this confusing nomenclature is that a randomization test coincides with a permutation test in the simplest example where half of the experimental units are randomized to treatment and half to control.The coincidence is caused by the fact that all permutations of the treatment assignment are equally likely to realize in the assignment distribution.As this is usually the first example in a lecture or article where randomization tests or permutation tests are introduced, it is understandable that many think the basic ideas behind the two tests are the same.
The reconciliation, we believe, lies precisely in the names of these tests.Randomization refers to a physical action that makes the treatment assignment random, while permutation refers to a step of an algorithm that computes the significance level of a test.The former emphasizes the basis of inference, while the latter emphasizes the algorithmic side of inference, so neither "randomization test" or "permutation test" subsumes the other.In the example above, we can use either term to refer to the same test, but the name "randomization test" is preferable as it provides more information about the context of the problem.

Randomization tests vs. Quasi-randomization tests
To further clarify the distinction between randomization tests and permuation tests, we believe it is helpful to introduce a new term-"quasi-randomization tests".It refers to any test that is not based on the physical act of randomization.It is exactly the complement of randomization tests.A test can then be characterized in two dimensions: (1) whether it is based on the physical act of randomization; and (2) how it is computed (using permutations, resampling, or distributional models).With this in mind, we can now distinguish two tests that are computationally identical (in the sense that the same acceptance/rejection decision is always reached given the same observed data) based on their underlying assumptions.
To illustrate this, consider permutation test in the following two scenarios.In the first scenario, an even number of units are paired before being randomized to receive one of two treatments, with exactly one unit in each pair receiving each treatment.In the second scenario, the units are observed (but not randomized) and we pair each unit receiving the first treatment with a different unit receiving the second treatment.To test the null hypothesis that the two treatments have no difference on any unit, we can permute treatment assignments within the pairs.Although these tests are computationally identical, the same conclusion (e.g.rejecting the null hypothesis) from them may carry very different weight.
The second test relies on the assumption that the two units in the same pair are indeed exchangeable besides their treatment status.This assumption can be easily violated if the units are different in some way; even if they look comparable in every way we can think of now, someone in the future may discover an overlooked distinction.Randomization plays a crucial role in hedging against such a possibility ( Kempthorne and Doerfler, 1969;Marks, 2003).In the new terminology we propose, both tests are permutation tests, but the first is a randomization test and the second is a quasi-randomization test.That is, even if a randomization test and a quasi-randomization test are algorithmically identical, they have entirely different inferential basis and thus must be distinguished.
The distinction between a randomization test and a quasi-randomization test is intimately related to causal inference using experimental and observational data.Our nomenclature is motivated by term "quasi-experiment" coined by Campbell and Stanley (1963) to refer to an observational study that is designed to estimate the causal impact of an intervention.
Since then, this term has been widely used in social science (Cook et al., 2002).

Randomness used in a randomization test
At this point, our answer to the question in the title of this article should already be clear.
A randomization test is precisely what its name suggests-a hypothesis test 3 based on randomization and nothing more than randomization.But what does "based on randomization" exactly mean?To answer this question, it is helpful to consider counterfactual versions of the data.In the causal inference literature, this is known as the Neyman-Rubin model (Holland,   3 In this article we use the terms "significance test" and "hypothesis test" interchangeably.Some authors argued that we should also distinguish a significane test, "as a conclusion or condensation device", from a hypothesis test "as a decision device" (Kempthorne and Doerfler, 1969).This is closely related to the "inductive inference" vs. "inductive behaviour" debate between Fisher and Neyman (Lehmann, 1993).
1986), which postulates the existence of a potential outcome (or counterfactual) for every possible realization of the treatment assignment.Broadly speaking, a "treatment assignment" can be anything that is randomized in an experiment (so not an actual treatment), while the "outcome" includes everything observed after randomization.To clarify the nature of randomization tests, we separate the randomness in data and statistical tests into (i) Randomness introduced by the nature in the potential outcomes; (ii) Randomness introduced by the experimenter (e.g., drawing balls from an urn); (iii) Randomness introduced by the analyst, which is optional.
Using this trichotomy, a randomization test can be understood as a hypothesis test that conditions on the potential outcomes and obtains the sampling distribution (often called the randomization distribution) using the second and third sources of randomness.A randomization test is based solely on the randomness introduced by humans (experimenters and/or analysts), thereby providing a coherent logic of scientific induction (Fisher, 1956).
We would like to make two comments on the definitions above.First, the notion of potential outcomes, first introduced by Neyman in the context of randomized agricultural experiments, was not uncommon in the description of randomization tests.This was often implicit, but the seminal paper by Welch (1937) used potential outcomes to clarify that randomization test is applicable to Fisher's sharp null rather than Neyman's null concerning the average treatment effect.Second, the difference between randomization before and after an experiment is also well recognized (Basu, 1980;Kempthorne and Doerfler, 1969).
Much of the recent literature on randomization tests is motivated by the interference (cross-unit effect) problem in causal inference.A key feature of the interference problem is that the null hypothesis is only "partially sharp", another term we coin in this article to refer to the phenomenon that the potential outcomes are not always imputable under all possible treatment assignments.In consequence, a randomization test that uses all the randomness introduced by the experimenter may be uncomputable.A general solution to this problem is conditioning on some carefully constructed events of the treatment assignment.

An overview of the article
Section 2 investigates a real cluster-randomized controlled trial using (quasi-)randomization tests that are based on different assumptions about the data.Section 3 develops an overarching theory for (conditional) randomization tests by generalizing the classical Neyman-Rubin causal model.The usage of the potential outcome notation allows us to give a precise definition of randomization test.Section 4 then reviews some practical methods to construct conditional randomization tests.Section 5 discusses some quasi-randomization tests in the recent literature, including tests for (conditional) independence and conformal prediction.
Finally, Section 6 concludes the paper with some further discussion.
Notation.We use calligraphic letters for sets, boldface letters for vectors, upper-case letters for random quantities, and lower-case letters for fixed quantities.We use a single integer in a pair of square brackets as a shorthand notation for the indexing set from 1: [N ] = {1, . . ., N }.We use set-valued subscript to denote a sub-vector; for example, 2 An illustrative example: The Australia weekend health services disinvestment trial We first illustrate the conceptual and practical differences of randomization and quasirandomization tests through a real data example.Haines et al. (2017) reported the results of a cluster randomized controlled trial about the impact of disinvestment from weekend allied health services across acute medical and surgical wards.The trial consisted of two phases-in the first phase, the original weekend allied health service model was terminated, and in the second phase, a newly developed model was instated.The trial involved 12 hospital wards in 2 hospitals in Melbourne, Australia.As our main purpose is to demonstrate the distinction between randomization and quasi-randomisation tests, we will focus on the first phase of the trial and 6 wards in the Dandenong Hospital.The original article investigated a number of patient outcomes; below we will just focus on patient length of stay after a log transformation.

Trial background
A somewhat unusual feature of the design of this trial is that the hospital wards received treatment (no weekend health services) in a staggered fashion.This is often referred to as the "stepped-wedge" design.In the first month of the trial period, all 6 wards received regular weekend health service.In each of the following 6 months, one ward crossed over to the treatment, and the order was randomized at the beginning of the trial.The dataset contains patient-level information including when and where they were hospitalized, their length of stay, and other demographic and medical information.More details about the data can be found in the trial report (Haines et al., 2017).Figure 1 illustrates the stepped-wedge design and shows the mean outcome of each ward in each calendar month; the average (log-transformed) length of stay tends to be higher after the treatment, but more careful analysis is required to decide if such a pattern is statistically meaningful in some way.

Trial analysis via (quasi-)randomization tests
We say a patient is exposed to the treatment if there is no weekend health services when the patient was admitted to a hospital ward.The exposure status is jointly determined by the actual treatment (crossover order of the wards) and when and where the patient was admitted.This motivates seven permutation tests for the sharp null hypothesis that removing the weekend allied health services has no effect on the length of stay.These tests differ in which variable(s) they permuted.In particular, we considered permuting (i) Crossover: the crossover order of the hospital wards to be exposed to the treatment; (ii) Time: calendar months during which the patients visited the hospital; (iii) Ward: hospital wards visited by the patient.
As the trial only randomizes the crossover order, only the test permuting crossover qualifies We considered three test statistics for the permutation tests.The first statistic T 1 is simply the exposed-minus-control difference in the mean outcome; equivalently, this can be obtained by the least-squares estimator for a simple linear regression of log length of stay on exposure status (with intercept).The second statistic T 2 is the estimated coefficient of the exposure status in the linear regression that adjusts for the hospital wards, and the third statistic T 3 further adjusts for the time of hospitalization (in calendar month).may be useful to know that the adjusted R 2 of the first linear model is only 1.8%, while the adjusted R 2 of the second and third models are both about 8.4%.

Results: Randomization test v.s. Quasi-randomization tests
The randomization test that only permutes the crossover order almost always gave the largest p-value and widest confidence intervals.This is perhaps not too surprising given that there are only 6! = 720 permutations in total and thus the smallest possible permutation p-value is 1/720 ≈ 0.0014.Using better statistics (T 2 or T 3 ) substantially reduces the length of the confidence intervals.On the other hand, the quasi-randomization tests, while generally giving smaller p-values, were quite sensitive to the choice of the test statistic.
In particular, the four quasi-randomization tests that permute ward (and other variables) produced non-overlapping confidence intervals with T 1 and T 2 .The same phenomenon also occurs with the normal linear model (last row of the table).
To better understand the difference between randomization and quasi-randomization   1 is calculated by computing the corresponding area of the (quasi)-randomization distributionthe black curve-above the observed statistic-the dashed red line.A small p-value means that the observed test statistics is more extreme than the test statistics on the permuted data for most of the permutations considered, which is a strong evidence of a positive treatment effect on the log length of stay conjectured in Figure 1.
Notably, in the top row of Figure 2, the randomization distribution of T 1 is quite flat, indicating low power.In contrast, the distributions of T 2 and T 3 have sharper peaks.
Interestingly, the randomization distribution of T 2 (adjusted for ward) is clearly not centred at 0. This is due to a general upward trend in the length of stay over the course of this trial as shown in Figure 1.Since the wards gradually crossed over to the treatment group, this trend confounds the causal effect under investigation.Thus, when the crossover order and/or hospital ward are permuted, the exposure status would have a positive coefficient in the linear model that does not adjust for time.
In the other rows of Figure 2, the distributions of T 1 and T 2 are centred at different places depending on which variables are permuted.This explains why inverting the quasirandomization tests based on T 1 and T 2 gives non-overlapping confidence intervals in Table 1.
Because T 3 adjusts both time and ward, its permutation distributions are much less affected by permuting time and ward, so the result of the quasi-randomization tests is very close to the randomization test (top row in that column) which only permutes the crossover order.

Recap
The real data example gives a clear demonstration of how a permutation test on the randomness it tries to exploit.The distinction between the randomization and (quasi-

A general theory for randomization tests
Next, we provide a general framework of (conditional) randomization tests by formalizing and generalizing what has become a "folklore" in causal inference after the strong advocation by Rubin (1980) and Rosenbaum (2002).Many quasi-randomization tests in the literature (such as independence testing and conformal prediction) do not test a causal hypothesis but still fall within our framework by conceiving imaginary randomization (e.g. through an i.i.d. or exchangeability assumption).Importantly, it is often helpful to construct artificial "potential outcomes" in those problems to fully understand the underlying assumption; see Section 5 for some examples.

Potential outcomes and randomization
Consider an experiment on N units in which a treatment variable Z ∈ Z is randomized.
We use boldface Z to emphasize that the treatment Z is usually multivariate.Most experiments assume that Z = (Z 1 , . . ., Z N ) collects a common attribute of the experimental units (e.g., whether a drug is administered).However, this is not always the case and the dimension of the treatment variable Z is not important in the general theory.For example, in the Australia weekend health services disinvestment trial in Section 2, the treatment Z (crossover order) is randomized at the ward level, while the patient level outcomes (length of stay).So Z has 6! = 720 permutations of the wards.What the theory below requires is that (i) Z is randomized in an exogenous way by the experimenter; random number generator) (ii) the distribution of Z is known (often called the treatment assignment mechanism); and (iii) one can reasonably define or conceptualize the potential outcomes of the experimental units under different treatment assignments.
To formalize these requirements, we adopt the potential outcome (also called the Neyman-Rubin or counterfactual) framework for causal inference (Holland, 1986;Neyman, 1923;Rubin, 1974).In this framework, unit i has a vector of real-valued potential outcomes (or We assume the observed outcome (or factual outcome) for unit i is given by Y i = Y i (Z), where Z is the realized treatment assignment.
This is often referred to as the consistency assumption in the causal inference literature.In our running example, Y i (z) is the (potential) length of stay of patient i had the crossover order of the wards been z.When the treatment Z is an N -vector, the no interference assumption is often invoked to reduce the number of potential outcomes; this essentially says that Y i (z) only depends on z through z i .4However, our theory does not rely on this assumption but treats it as part of the sharp null hypothesis introduced below.
It is convenient to introduce some vector notation for the potential and realized outcomes. Let ∈ W collect all the potential outcomes (which are random variables defined on the same probability space as Z).We will call W the potential outcomes schedule, following the terminology in Freedman (2009).5This is also known as the science table in the literature (Rubin, 2005).It may be helpful to view potential outcomes Y (z) as a (vector-valued) function from Z to Y; in this sense, W consists of all functions from Z to Y.
Using this notation, The following assumption formally defines a randomized experiment.
Assumption 1 (Randomized experiment).Z ⊥ ⊥ W and the density function π(•) of Z (with respect to some reference measure on Z) is known and positive everywhere.
We write the conditional distribution of Z given W in Assumption 1 as This assumption formalizes the requirement that Z is randomized in an exogenous way.
Intuitively, the potential outcomes schedule W is determined by the nature of experimental units.Since Z is randomized by the experimenter, it is reasonable to assume that Z ⊥ ⊥ W .
In many experiments, the treatment is randomized according to some other observed covariates X (e.g., characteristics of the units or some observed network structure on the units).This can be dealt with by assuming Z ⊥ ⊥ W | X in Assumption 1 instead.Notice that in this case the treatment assignment mechanism π may depend on X.But to simplify the exposition, unless otherwise mentioned we will simply treat X as fixed, so Z ⊥ ⊥ W is still true (in the conditional probability space with X fixed at the observed value).
For the rest of this article, we will assume Z is discrete so Z (e.g., Z = {0, 1} N ) is finite.

Partially sharp null hypotheses
As Holland (1986) pointed out, the fundamental problem in causal inference is that only one potential outcome can be observed for each unit under the consistency assumption.To overcome this problem, additional assumptions on W beyond randomization (Assumption 1) must be placed.In randomization inference, the required additional assumptions are (partially) sharp null hypotheses that relate different potential outcomes.6 A typical (partially) sharp null hypothesis assumes that certain potential outcomes are equal or related in certain ways.As a concrete example, the no interference assumption assumes that Y i (z) = Y i (z * ) whenever z i = z * i .This allows one to simplify the notation The no treatment effect hypothesis (often referred to as Fisher's sharp or exact null hypothesis) further assumes that Y i (z i ) = Y i (z * i ) for all i and z i , z * i .When the treatment of each unit is binary (i.e., Z i is either 0 or 1), under the no interference assumption we may also consider the null hypothesis that the treatment effect is equal to a constant τ , that is, H 0 : Y i (1) − Y i (0) = τ for all i.Under the consistency assumption, this allows us to impute the potential outcomes as More abstractly, a (partially) sharp null hypothesis H defines a number of relationships between the potential outcomes.Each relationship allows us to impute some of the potential outcomes if another potential outcome is observed (through consistency).We can summarize these relationships using a set-valued mapping: Definition 1.A partially sharp null hypothesis defines an imputability mapping If we assume no interference and no treatment effect, all the potential outcomes are observed or imputable regardless of the realized treatment assignment z, so In this case, we call H a fully sharp null hypothesis.In more sophisticated problems, H(z, z * ) may depend on z and z * in a nontrivial way and we call such hypothesis partially sharp.The concept of imputability has appeared before in Basse et al. (2019) and Puelz et al.
(2019), though imputability was tied to test statistics under a hypothesis (see Definition 4 below) instead of the hypothesis itself.
In the Australia trial example (Section 2), we made an implicit "no interference" type assumption when obtaining the confidence intervals-we assumed that Y i (z) only depends on the implied binary exposure status D i (z) of the ith patient by the crossover order z.
That is, given when and where a patient is admitted, the patient's potential outcome only depends on whether that ward has already crossed over to the treatment group according to z.This would be violated when there is interference between the wards or the effect of ending the weekend health services is time-varying.This assumption allows us to abbreviate In the permutation tests, test statistics were computed using the three linear models in Section 2 with the shifted outcomes Y i − D i (Z)τ, i = 1, . . ., N .The permutation tests were subsequently inverted to obtain confidence intervals of τ .

Conditional randomization tests (CRTs)
To test a (partially) sharp null hypothesis, a randomization test compares an observed test statistic with its randomization distribution, which is given by the value of the statistic under a random treatment assignment.However, it may be impossible to compute the entire randomization distribution when some potential outcomes are not imputable (i.e. when H(z, z * ) is smaller than [N ]).To tackle this issue, we confine ourselves to a smaller set of treatment assignments.This is formalized in the next definition.
Definition 2. A conditional randomization test (CRT) for a treatment Z is defined by (i) Definition 3. The p-value of the CRT in Definition 2 is given by where Z * is an independent copy of Z conditional on W and the notation P * is used to emphasize that the probability is taken over Z * .
Because Z ⊥ ⊥ W (Assumption 1), Z * is independent of Z and W and Z * ∼ π(•).The invariance property in Lemma 1 is important because it ensures that, when computing the p-value, the same conditioning set is used for all the assignments within it.By using the equivalence relation ≡ R defined by the partition R, we can rewrite (1) as When S z = Z for all z ∈ Z this reduces to an unconditional randomization test.
Notice that T Z (Z * , W ) generally depends on some unobserved potential outcomes in W . Thus the p-value (1) may not be computable if the null hypothesis does not make enough restrictions on how T Z (Z * , W ) depends on W .By using the imputability mapping In consequence, given Assumption 1 and a partially sharp null hypothesis H, if P (Z, W ) is computable, then Note that by marginalizing (3) over the potential outcomes schedule W , we obtain The conditional statement ( 3) is stronger as it means that the type I error is always controlled for the given samples.In addition to Assumption 1, no assumptions are required about the sample.

Nature of conditioning
To construct the partition R, one common approach is to condition on a function of, or more precisely, a random variable G = g(Z) generated by Z.This idea is formalized by the next result, which immediately follows by defining the equivalence relation z * ≡ R z when g(z * ) = g(z).
Tests of this form have appeared before in Zheng and Zelen (2008) and Hennessy et al. (2016) to deal with covariate imbalance in randomized experiments.As an example, consider a Bernoulli trial with 10 females (unit i = 1, . . .10) and 10 males (unit i = 11, . . ., 20).Let g(z) = 10 i=1 z i denote the number of treated females in any assignment z.Suppose the realized randomization has has only 2 treated females, that is, g(Z) = 2, due to chance.
More generally, one can consider a measure-theoretic formulation of conditional randomization tests.For example, let ) .be the σ-algebra generated by the conditioning events in (1).Because {S m } ∞ m=1 is a partition, G consists of all countable unions of {S m } ∞ m=1 .This allows us to rewrite (2) as This is equivalent to This measure-theoretic formulation is useful for extending the theory above to continuous treatments and consider the structure of conditioning events in multiple CRTs, which will be considered in a separate article.

Post-randomization
In many problems, there are several ways to construct the conditioning event/variable; see e.g.Section 4. In such a situation, a natural idea is to post-randomize the test.
Consider a collection of CRTs defined by are indexed by v ∈ V where V is countable.In the example of bipartite graph representation introduced below in Section 4.3, v can be a biclique decomposition of the graph.Each v defines a p-value where Z * is an independent copy of Z, and we may use a random value P (Z, W , V ) where V is drawn by the analyst and thus independent of (Z, W ). It immediately follows from Theorem 1 that this defines a valid test in the sense that A more general viewpoint is that we may condition on a random variable G = g(Z, V ) that depends on not only the randomness introduced by the experimenter in Z but also the randomness introduced by the analyst in V .Proposition 1 can then be generalized in a straightforward way.The construction below is inspired by Bates et al. (2020) (who were concerned with genetic mapping) and personal communiations with Stephen Bates.
As above, suppose V ⊥ ⊥ (Z, W ) and G has a countable support.Because G is generated by Z, the conditional distribution of G given Z is known.Let π(• | g) be the density function of Z given G = g, which can be obtained from Bayes' formula: , where π is the density of Z with respect to some reference measure µ on Z.Let T g (•, W ) be the test statistic that is now indexed by g in the support of G.The post-randomized p-value is then defined as where the probability is taken over and the randomized p-value can be written as Similar to above, we say P (Z, W ; g) is computable if it is a function of Z and Y under the null hypothesis and write it as P (Z, Y ; g).Theorem 2. Under the setting above, the randomized CRT is valid in the following sense Theorem 2 generalizes several results above.Theorem 1 is essentially a special case where G = S Z is a set.Proposition 1 is also a special case where G = g(Z) is not randomized.
Finally, equation ( 4) amounts to conditioning on the post-randomized set G = S Z (V ).
Theorem 2 also generalizes a similar theorem in Basse et al. (2019, Theorem 1) by allowing post-randomization and not requiring imputability of the test statistic.In other words, imputability only affects whether the p-value can be computed using the observed data and is not necessary for the validity of the p-value.
An alternative to a single post-randomized test is to average the p-values from different realizations of G.It may be shown that the average of those p-value is valid up to a factor 2, in the sense that the type I error is upper bounded by 2α if the null is rejected when the average p-value is less than α (Rüschendorf, 1982;Vovk and Wang, 2020).This strategy may be useful when post-randomization gives rise to a large variance.

Practical methods of CRTs
This section summarizes some practical methods to construct computable and powerful tests from the causal interference literature (Aronow and Samii, 2017;Athey et al., 2018;Basse et al., 2019;Bowers et al., 2013;Hudgens and Halloran, 2008;Li et al., 2019;Puelz et al., 2019).The following example provides some context.Consider an experiment that displays an advertisement (or nothing) to the users of a social network, and we would like to test if displaying the advertisement to a user has a spillover effect on their friends.Each user thus has one of three exposures: directly see the advertisement ("treated"), has a friend who sees the advertisement ("spillover"), or has no direct or indirect exposure to the advertisement ("control").As the null hypothesis of no spillover effect only relates the potential outcomes under spillover and control, the outcomes of the treated users (denote the collection of them by I Z ⊂ [N ]) do not provide any information about the hypothesis; in other words, I Z ∩ H(Z, z * ) = ∅ for all z * .This means that a test statistic T (Z, W ) is imputable only if it does not depend on the potential outcomes of the users in the set I Z changing with Z.This makes it difficult to construct an imputable test statistic.

Intersection method
Often, the test statistic of a CRT only depends on the potential outcomes corresponding to the (counterfactual) treatment and takes the form T z (z * , W ) = T z (z * , Y (z * )).Then the fundamental challenge is that only a sub-vector Y H(z,z * ) (z * ) of Y (z * ) is imputable under H.A natural idea is to only use the imputable potential outcomes.
Then, under Assumption 1, the partition R and test statistics (T m (z, Y Hm (z))) M m=1 define a computable p-value.
However, the CRT in Proposition 2 would be powerless if H m is an empty set.More generally, the power of the CRT depends on the size of S m and H m , and there is an important trade-off: with a coarser R, the CRT is able to utilize a larger subset S m of treatment assignments but a smaller subset H m of experimental units.In many problems, choosing a good partition R is nontrivial.In such cases, it may be helpful to impose some structure on the imputability mapping H(z, z * ).
Definition 6.A partially sharp null hypothesis H is said to have a level-set structure with respect to a collection of exposure functions D i : Z → D, i = 1, . . ., N , if D is countable and The imputability mapping is then defined by the level sets of the exposure functions.
This would occur if, for example, the null hypothesis only specifies the treatment effect between two exposure levels, as in our social network advertisement example.Definition 6 is inspired by Athey et al. (2018, Definition 3), but the concept of exposure mapping can be traced back to Aronow and Samii (2017); Manski (2013); Ugander et al. (2013).
An immediate consequence of the level-set structure is that H(z, z * ) is symmetric.
Moreover, by using the level-set structure, we can write H m in Proposition 2 as This provides a way to choose the test statistic once the partition R = {S m } M m=1 is given.

Focal units
We may also proceed in the other direction and choose the experimental units first.Aronow which is slightly more restrictive than (5).In the social network example, H(z, z * ) is the subset of users who do not receive the advertisement directly in both z and z * .The "conditional focal units" H m in (6) can then be written as Their key insight is that the condition in ( 9) can be visualized using a bipartite graph with

Examples of (quasi-)randomization tests
Next, we examine some randomization and quasi-randomization tests proposed in the literature.These examples not only demonstrate the generality and usefulness of the theory above but also help to clarify concepts and terminologies related to randomization tests.

Fisher's exact test
Fisher's exact test is perhaps the simplest (quasi-)randomization test.In our notation, let Z ∈ {0, 1} N be the treatment assignment for N units and Y i (0) ∈ {0, 1}, Y i (1) ∈ {0, 1} be the potential outcomes of each unit i (so the no interferencec assumption is made).We are interested in testing the hypothesis that the treatment has no effect whatsoever.Because both the treatment Z i and outcome

Permutation tests for treatment effect
In a permutation test, the p-value is obtained by calculating all possible values of the test statistics under all allowed permutations of the observed data points.As argued in the Introduction, the name "permutation test" emphasizes the algorithmic perspective of the statistical test and thus is not synonymous with "randomization test".
In the context of testing treatment effect, a permutation test is essentially a CRT that uses the following conditioning sets (suppose Z is a vector of length N ) In view of Proposition 1, a permutation test is a CRT that conditions on the order statistics of Z.In permutation tests, the treatment assignments are typically assumed to be exchangeable, (Z 1 , . . ., Z N ) d = (Z g(1) , . . ., Z g(N ) ) for all permutations g of [N ], so each permutation of Z has the same probability of being realized under the treatment assignment mechanism π(•).See Kalbfleisch (1978) where Ω N be the set that collects all the N !permutations of [N ].
The same notation P (Z, Y ) is used in (13) as it is algorithmically identical to a conditional randomization test given the order statistics of Z.So it appears that the same test can be used to solve a different, non-causal problem-after all, no counterfactuals are involved in testing independence.In fact, Lehmann (1975) referred to the causal inference problem as the randomization model and the independence testing problem as the population model.Ernst (2004) and Hemerik and Goeman (2020) argued that the reasoning behind these two models is different.
However, a statistical test not only tests the null hypothesis H 0 but also any assumptions needed to set up the problem.For example, the CRT described in Section 3 tests not only the presence of treatment effect but also the assumption that the treatment is randomized (Assumption 1).However, due to physical randomization, we can treat the latter as given.
Conversely, in independence testing we may artificially define potential outcomes as and what is regarded as the hypothesis being tested.Rather than distinguishing them according to the type of "model" (randomization or population), we believe that the more fundamental difference is the nature of randomness used in each test.In testing treatment effect, inference is entirely based on the randomness introduced by the experimenter and is thus a randomization test.In testing independence, inference is based on the permutation principle ( 12) that follows from a theoretical model, so the same permutation test is a quasi-randomization test in our terminology.
Recently, there is a growing interest in using quasi-randomization tests for conditional independence (Berrett et al., 2020;Candès et al., 2018;Katsevich and Ramdas, 2020;Liu et al., 2020).Typically, it is assumed that we have independent and identically distributed observations (Z 1 , Y 1 , X 1 ), . . ., (Z n , Y n , X n ) and would like to test This can be easily incorporated in our framework by treating X = (X 1 , . . ., X n ) as fixed; see the last paragraph in Section 3.1.In this case, the quasi-randomization distribution of Z = (Z 1 , . . ., Z n ) is given by the conditional distribution of Z given X, and it is straightforward to construct a quasi-randomization p-value (see e.g.Candès et al., 2018, Section 4.1).Berrett et al. (2020) extended this test by further conditioning on the order statistics of Z, resulting in a permutation test.
As a remark on the terminology, the test in the last paragraph was referred to as the "conditional randomization test" by Candès et al. (2018) because the test is conditional on X.
However, such conditioning is fundamentally different from post-experimental conditioning (such as conditioning on S Z ), which is used in Section 3 to distinguish conditional from unconditional randomization tests.When Z is randomized according to X, conditioning on X is mandatory in randomization inference because it needs to use the randomness introduced by the experimenter.On the other hand, further conditioning on S Z or more generally G = g(Z, V ) in Section 3.5 is introduced by the analyst to improve the power or make the p-value computable.For this reason, we think it is best to refer to the test in Candès et al. (2018) as a unconditional quasi-randomization test and the permutation test in Berrett et al. (2020) as a conditional quasi-randomization test.

Covariate imbalance and adaptive randomization
Morgan and Rubin (2012) proposed to rerandomize the treatment assignment if some baseline covariates are not well balanced.Li and Ding (2019) showed that the asymptotic distribution of standard regression-adjusted estimators in this design is a mixture of a normal distribution and a truncated normal distribution, whose variance is always no larger than that in the standard completely randomzied design.
Notice that the meaning of "rerandomization" here is completely different from that in "rerandomization test" (Brillinger et al., 1978;Gabriel and Hall, 1983), which emphasizes on the Monte-Carlo approximation to a randomization test.The key insight of Morgan and Rubin (2012) is that the experiment should then be analyzed with the rerandomization taken into account.More specifically, rerandomization is simply a rejection sampling algorithm for randomly choosing Z from the subset Z = {z : g(z) ≤ η}, where g(z) measures the covariate imbalance implied by the treatment assignment z and η is the experimenter's tolerance of covariate imbalance.Therefore, we simply need to use the randomization distribution over Z to carry out the randomization test.The same reasoning applies to other adaptive trial designs, see Rosenberger et al. (2018) and references therein.
specific task-testing a sharp null hypothesis.Thus in a randomized experiment, there seems to be no reason to not report the result of a randomization test (for more discussion, see Rosenberger et al., 2018;Rubin, 1980).Quasi-randomization and model-based tests may have better power, but they are sensitive to model misspecification or violations of any assumption involved in justifying the quasi-randomisation (such as exchangeability), thus careful arguments are needed to justify their utility.When no randomization test is available (e.g. in construction of prediction intervals without experimental data), the choice between quasi-randomisation and model-based tests will rest on how reasonable the theoretical assumptions are in a practical problem and how easily these tests can be computed.

Figure 1 :
Figure 1: Australia weekend health services disinvestment trial: a 7-month stepped-wedge design with monthly ward-level mean outcomes.The six hospital wards are indexed by A,B,C,D,E and F.
as a randomization test according to our definition in the Introduction.All six other tests that involve permuting other variables are instances of quasi-randomization tests in our terminology because their permuted variables are not randomized in the trial.The quasi-randomization tests may have smaller p-values than the randomization test, but their validity requires the exchangeability assumption which may not hold.For instance, it is questionable to permute the admission times if the patients have seasonal diseases or permute the hospital wards if the wards have different specialties (which is indeed the case in this trial).

Figure 2 :
Figure 2: Australia weekend health services disinvestment trial: (quasi-)randomization distributions of three different test statistics.

)
randomization tests lead to practically different conclusions.Using a randomization test protects against model misspecification and allows us to take advantage of a better model safely in the test statistics without sacrificing validity or marking additional assumption.In contrast, quasi-randomization tests and tests based on the normal linear model are sensitive to model specification and tend to overstate statistical significance.
S 1 , . . ., S M are disjoint subsets of Z satisfying Z = M m=1 S m ; and (ii) A collection of test statistics (T m (•, •)) M m=1 , where T m : Z × W → R is a real-valued function that computes a test statistic for each realization of the treatment assignment Z given the potential outcomes schedule W . Methods to construct R and examples will be discussed in the following sections.Any partition R defines an equivalent relation ≡ R and vice versa, so S 1 , . . ., S M are simply the equivalence classes generated by ≡ R .With an abuse of notation, we let S z ∈ R denote the equivalence class containing z.For any z ∈ S m , we thus have S z = S m and T z (•, •) = T m (•, •).This notation is convenient because the p-value of the CRT defined below conditions on Z * ∈ S z when we observe Z = z.The following property follows immediately from the fact that ≡ R is an equivalence relation: Lemma 1 (Invariance of conditioning sets and test statistics).For any z ∈ Z and z * ∈ S z , we have z ∈ S z , S z * = S z and T z * (•, •) = T z (•, •).
Definition 1, this is formalized in the next definition.Definition 4. Consider a CRT defined by the partition R = {S m } M m=1 and test statistics (T m (•, •)) M m=1 .We say the test statistic T z (•, •) is imputable under a partially sharp null hypothesis H if for all z * ∈ S z , T z (z * , W ) only depends on the potential outcomes schedule W = (Y (z) : z ∈ Z) through its imputable part Y H(z,z * ) (z * ).Lemma 2. Suppose Assumption 1 is satisfied and T z (•, •) is imputable under H for all z ∈ Z. Then the p-value P (Z, W ) only depends on Z and Y .Definition 5.Under the assumptions in Lemma 2, we say the p-value is computable under H and denote it, with an abuse of notation, by P (Z, Y ).Given a computable p-value, the CRT then rejects the null hypothesis H at significance level α ∈ [0, 1] if P (Z, Y ) ≤ α.The next theorem establishes the validity of this test.Theorem 1.Consider a CRT defined by the partition R = {S m } M m=1 and test statistics (T m (•, •)) M m=1 .Then the p-value P (Z, W ) is valid in the sense that it stochastically dominates the uniform distribution on [0, 1]:

(
2012) andAthey et al. (2018) proposed to choose a partition R = {S m } M m=1 such that H m is equal to a fixed subset of "focal units", I ⊆ [N ], for all m.Given any I ⊆ [N ], the conditioning set is given by all the treatment assignments such that all the units in I receive the same exposure.That is,S z = {z * ∈ Z : I ⊆ H(z, z * )} = {z * ∈ Z : D I (z * ) = D I (z)},(7)whereD I (•) = (D i (•)) i∈I .From the right-hand side of (7), it is easy to see that {S z : z ∈ Z} satisfies Lemma 1 and thus forms a partition of Z. Furthermore, {S z : z ∈ Z} is countable because S z is determined by D I (z), a subset of the countable set D I .In our social network advertisement example, the focal units can be a randomly chosen subset of users; see Aronow (2012) and Athey et al. (2018) for more discussion.The next proposition summarizes the method proposed by Athey et al. (2018) and immediately follows from our discussion above. 7Proposition 3. Given a null hypothesis H with a level-set structure in Definition 6, and a set of focal units I ⊆ [N ].Under Assumption 1, the partition R = {S z : z ∈ Z} as defined in (7) and any test statistic T (z, Y I (z)) induce a computable p-value.4.3 Bipartite graph representation Puelz et al. (2019) provided an alternative way to use the level-set structure.They consider imputability mapping of the form (suppose 0 ∈ D) 8 H(z, z * ) = {i ∈ [N ] : vertex set V = [N ] ∪ Z and edge set E = {(i, z) ∈ [N ] × Z : D i (z) = 0}connecting every 7 Athey et al. (2018) used the same test statistic in all conditioning events, which is reflected in Proposition 3. Our construction further allows the test statistic T Z (z, Y I (z)) to depend on Z through D I (Z).8The "null exposure graph" inPuelz et al. (2019) actually allows D i (z) and D i (z * ) to belong to a prespecified subset of D. This can be incorporated in our setup by redefining the exposure functions.
unit i with every assignment z satisfying that D i (z) = 0.Puelz et al. (2019) referred to this as the null exposure graph G = (V, E).Then by using (9), we have Proposition 4. The vertex subset V m = H m ∪ S m and the edge subsetE m = {(i, z) ∈ H m × S m } form a biclique (i.e., a complete bipartite subgraph) in G.By definition, both V m and E m depend on S m .The challenging problem of finding a good partition of Z is reduced to finding a collection of large bicliques{(V m , E m )} M m=1 inthe graph such that {S m } M m=1 partitions Z.This was called a biclique decomposition inPuelz et al. (2019).They further described an approximate algorithm to find a biclique decomposition by greedily removing treatment assignments in the largest biclique.
0) are binary, data can be summarized by a 2 × 2 table where N zy denotes the number of units with treatment z and outcome y for z, y ∈ {0, 1}.Let N z• = N z0 + N z1 and N •y = N 0y + N 1y be the row and conceptual confusion and controversy can be avoided by distinguishing randomization tests from quasi-randomization tests.
so W consists of many identical copies of Y and the "causal" null hypothesis H 0 : Y (z) = Y (z * ), ∀z, z * ∈ Z is automatically satisfied.Suppose the test statistic is given by T z (z * , W ) = T (z * , Y (z * )) as in Section 4.2.Due to how the potential outcomes schedule W is defined, the test statistic is simply T (z * , Y ).The CRT then tests Z ⊥ ⊥ W in Assumption 1, which is equivalent to Z ⊥ ⊥ Y .When S Z is given by all the permutations of Z as in (10), Z * | Z * ∈ S Z has the same distribution as Z g where g is a random permutation.These observations show that the permutation test of independence is identical to the permutation test for the artificial "causal" null hypothesis.So permutation tests of treatment effect and independence are two "sides" of the same "coin".They are algorithmically the same, but differ in what is regarded as the presumption

Table 1
shows the one-sided p-values (alternative hypothesis is positive treatment effect) of the seven permutation tests with these three statistics.Confidence intervals were obtained by inverting the two one-sided permutation tests of null hypotheses with varying constant treatment effects (see Section 3.2 for more details).Results are further compared with the corresponding output of the two linear models assuming normal homoskedastic noise.It

Table 1 :
Results of (quasi-)randomization tests applied to the Australia weekend health services disinvestment trial.T 1 is the exposed-minus-control difference of log length of stay.T 2 is the coefficient of exposure status in the linear regression of log length of stay on treatment status and hospital ward.T 3 is the coefficient of exposure status in the linear regression of log length of stay on exposure status, hospital ward, and time of hospitalization.Results of the permutation tests are compared with the output of the corresponding linear models assuming normal homoskedastic noise.(CI:90% Confidence Interval.)T 1 (adjust for nothing) T 2 (adjust for ward) T 3 (adjust for ward & time) Roach and Valdar (2018)ulation of rank-based tests based on marginal and conditional likelihoods.Exchangeability makes it straightforward to compute the p-value (1), as Z * is uniformly distributed over S Z if Z has distinct elements.In this sense, our assumption that the assignment distribution of Z is known (Assumption 1) is more general than exchangeability.SeeRoach and Valdar (2018)for some recent development on generalized permutation tests in non-exchangeable models.Notice that in permutation tests, the invariance of S z in Lemma 1 is satisfied because the permutation group is closed under composition, that is, the composition of two permutations of [N ] is still a permutation of[N ].This property can be violated when the test conditions on Let Z = (Z 1 , . . ., Z N ), Y = (Y 1 , . . ., Y N ), and Z g = (Z g(1) , . . ., Z g(N ) ).Given a test statistic T (Z, Y ), independence is rejected by the permutation test if the following p-value is less than the significance level α: