Kernel regression analysis of tie-breaker designs

Tie-breaker experimental designs are hybrids of Randomized Controlled Trials (RCTs) and Regression Discontinuity Designs (RDDs) in which subjects with moderate scores are placed in an RCT while subjects with extreme scores are deterministically assigned to the treatment or control group. In settings where it is unfair or uneconomical to deny the treatment to the more deserving recipients, the tie-breaker design (TBD) trades off the practical advantages of the RDD with the statistical advantages of the RCT. The practical costs of the randomization in TBDs can be hard to quantify in generality, while the statistical benefits conferred by randomization in TBDs have only been studied under linear and quadratic models. In this paper, we discuss and quantify the statistical benefits of TBDs without using parametric modelling assumptions. If the goal is estimation of the average treatment effect or the treatment effect at more than one score value, the statistical benefits of using a TBD over an RDD are apparent. If the goal is nonparametric estimation of the mean treatment effect at merely one score value, we prove that about 2.8 times more subjects are needed for an RDD in order to achieve the same asymptotic mean squared error. We further demonstrate using both theoretical results and simulations from the Angrist and Lavy (1999) classroom size dataset, that larger experimental radii choices for the TBD lead to greater statistical efficiency.


Introduction
In this paper we study a nonparametric regression approach to tie-breaker studies. In the settings of tie-breaker studies, there is a costly treatment while the control is inexpensive or even free. In addition, an investigator can decide how to allocate the costly treatment using a priority ordering on the subjects. The priority ordering could be based on how deserving of the treatment each subject is, or based on how strongly each subject is expected to respond to the treatment. Examples include offering scholarships or school placement to students (Angrist, Autor and Pallais, 2020), offering a drug rehabilitation program to people of varying needs (Cappelleri and Trochim, 2003), or assigning interventions to reduce risk factors for child abuse and neglect (Krantz, 2022). In these settings a randomized controlled trial (RCT) is inappropriate because it is extremely inefficient economically, or even ethically questionable. The natural, even automatic, approach to settings like this is to rank the subjects i = 1, . . . , N according to their value of a running variable x i and assign the treatment to only those subjects with the highest values of x i . For simplicity one can assume that the number of subjects to treat is fixed and that the treatment is offered to subject i if and only if x i > t for some threshold t.
The problem with a deterministic treatment based on x i is that it complicates causal inference of the effect of the treatment. One can use regression discontinuity analysis (Thistlethwaite and Campbell, 1960;Cattaneo and Titiunik, 2022) but the regression discontinuity design (RDD) is known to give treatment effect estimates a very high variance. See, for example, Gelman and Imbens (2019). In a parametric regression model the treatment and running variable are highly correlated, making for an inefficient design (Goldberger, 1972;Jacob et al., 2012). In a tie-breaker design (TBD), the subjects are given the treatment with a probability that increases with x i . The top ranked subjects get the treatment with probability one, the bottom ranked subjects do not get the treatment and subjects in between are randomly assigned to either treatment or control. The tie-breaker design interpolates between two extremes: the RCT and the RDD, trading off statistical efficiency with the short term economic value of aligning treatment to the running variable.
We believe that there are many more good uses for TBDs. Many companies interact electronically with their customers and partners. Perks, such as service upgrades, can easily be assigned with some randomization. Because the treatment is costly, it is important to evaluate the treatment efficacy later. This provides a strong motivation to introduce randomization. When the perk is simply a gift to some subjects there is less ethical concern over whether it goes to the most loyal customers or introduces some randomization. We also expect that tie-breakers will be useful in evaluating governmental programs such as the one in Krantz (2022), as well as educational programs such as those in Angrist, Autor and Pallais (2020).
TBDs have been primarily studied as experimental designs using parametric regression modeling assumptions. While the design literature focuses on parametric models, the RDD literature primarily uses nonparametric regression methods. In this paper, we quantify the statistical gains to be obtained by conducting a TBD instead of an RDD using nonparametric regression.
Our main theoretical contributions are as follows. We study a kernel weighted local linear regression with a slope and intercept for both treated and control subjects and bandwidth h. The RDD can consistently estimate the treatment effect only at x = t, and so we focus our comparison at that point. We find an expression for the optimal bandwidth for estimating the treatment effect at x = t under the TBD. We then compare the optimal mean squared error at x = t for the two designs. For the popular triangular kernel, a TBD reduces the asymptotic mean squared error (AMSE) by a factor of about 2.27 compared to an RDD of the same sample size, N . For other popular kernels, the AMSE is reduced by a slightly greater factor. In this setting, the AMSE decreases proportionally to N −4/5 , and using N points in an RDD is comparable to using only 0.36N points in a TBD. The asymptotic analysis has a bandwidth that converges to zero. Since this convergence is at the very slow N −1/5 rate, we cannot assume that in practice h will be small enough such that subjects without randomized treatment are discarded. Therefore, we also compare the designs when h is fixed and large enough to include nonnrandomized subjects in the regression. In this setting, the efficiency ratio, which we define as the relative variance of treatment effect estimators under the two designs, can be as large as four. For a fixed bandwidth and for the triangular and boxcar kernels, we also find that the efficiency ratio is monotone non-decreasing in the proportion of subjects who are given a randomized treatment assignment.
A further advantage of the TBD is that it can give consistent nonparametric estimates of the treatment effect for any value of x in the randomization window. It can also be used to estimate the average causal effect over that window. These additional advantages are described more explicitly in Section 3.
An outline of this paper is as follows. Section 2 reviews the literature on tie-breakers as well as the much larger literature on RDDs. In Section 3, we define a causal parameter of interest that can be used to compare the TBD to the RDD. We also introduce the causal identification assumptions needed and the local linear regressions used for estimating that parameter. In Section 4, we compare the mean squared error (MSE) in asymptotically optimal estimation of our causal parameter of interest under an RDD to that under a TBD. In that asymptotic setting, the optimal bandwidth decreases at the slow O(N −1/5 ) 3 rate and then the local linear regression is eventually supported entirely in the experimental region of the TBD. In Section 5, we investigate another regime where the bandwidth h is fixed and is assumed to be larger than the radius ∆ of the experimental region. For this setting, deferring an investigation of the bias to Appendix D, we study the variance of our estimator as a function of ∆ and find the efficiency ratio to be monotone in ∆ for the triangular and boxcar kernels. Section 6 shows how one can compute efficiency ratios empirically using one's actual assignment variable levels, focusing on the Israeli classroom size data from Angrist and Lavy (1999) as an example. The curves of the empirical efficiency ratios are quite similar to the ones obtained theoretically. Section 7 presents a discussion. Appendix C extends our results to TBDs in which each subject in the experimental group is given the treatment with probability p = 1/2. Appendices A, B, E, and F contain some of our proofs.

Literature review
Here we survey the small TBD literature and some recent developments in the much larger RDD literature. We also note connections to the experimental design literature. Most of the TBD literature has focused on global parametric models. Those are also the dominant model for experimental design. The TBD is usually compared to the RDD, for which nonparametric models are the norm.

Regression discontinuity methods
Here we present some concepts from the regression discontinuity literature drawing heavily on Cattaneo and Titiunik (2022). We begin with a setting where there is a running variable x i (also called a score or index) for subject i = 1, . . . , N . Subjects with x i > t are given the treatment and others get the control. The treatment levels are typically T i ∈ {0, 1} with T i = 0 being the control. For the TBD setting it is more convenient to use Z i ∈ {−1, 1} with Z i = −1 indicating the control. The potential outcomes for subject i are Y i+ if treated and Y i− for control.
There are two main approaches to RDD in causal inference, continuity-based and local randomization-based. The continuity-based approach assumes that the mean response for treated subjects is continuous in x as is that for control subjects. If the mean response for all subjects shows a discontinuity at x = t, then the magnitude of this discontinuity is defined to be the causal effect of treatment on subjects at x = t, and one then considers how to estimate that effect. The version from Hahn, Todd and der Klaauw (2001) . We will work primarily with a superpopulation setting where the subjects in the study are sampled from a joint distribution. 4 Cattaneo and Titiunik (2022) discuss this setting along with some other settings that focus on causal inference for the given subjects. The local randomization approach from Cattaneo, Frandsen and Titiunik (2015) assumes the existence of a window W = [t − h, t + h] such that for x ∈ W the treatment variable is 'as good as randomized'. In the local randomization approach we assume a) that the joint distribution of the Z i for x i ∈ W is known, and, b) that the potential outcomes (Y i+ , Y i− ) are independent of x i . In particular, both mean responses must be constant functions of x ∈ W. A variant of local randomization has the treatment based on a threshold of x where x is a noisy version of a latent variable u (Eckles et al., 2020) and where we have outside information under which the probability of treatment given u is known.
Both frameworks have challenges. The obvious difficulty with local randomization is choosing the window W (or knowing the treatment probability given u). A smaller window provides a smaller dataset to use while making the window larger will normally increase the discrepancy between the model and the ground truth. The TBD can be viewed as a strategy to impose by design the first assumption in the local randomization approach while not making the second assumption.
The challenge in the continuity framework is in estimating the necessary limits. In a parametric model, those limits are estimated from all the data but have a bias due to lack of fit of the parametric model. As a result, nonparametric regression methods based on local polynomial models are favored. The challenge there is that one must choose a bandwidth h, analogously to the window size from the local randomization framework. Because the mean responses are only locally polynomial we must contend with a bias-variance tradeoff in estimating the limits.
Our theoretical and numerical results compare the TBD to the RDD in the continuity framework. We think that this is the more likely alternative analysis for our motivating applications if randomization had not been used, because the local randomization assumptions do not seem natural in those applications. We focus on the accuracy of point estimation. There is also a large literature on constructing confidence intervals around the RDD estimate (see Cattaneo and Titiunik (2022)). We describe some of those concepts in this paper, but we do not develop confidence intervals for the TBD due to space constraints.
There are many different settings where treatments depend in a discontinuous way on x. In a sharp design, the treatment is Z i = 1 if and only if x i > t. In a fuzzy design, the assignment to treatment or control might not perfectly match x i > t versus x i t, for reasons beyond the control of the investigator. For instance there may be subjects that do not comply with their assigned treatment. A related issue is that some subjects might be able to manipulate their value of the running variable in order to get (or avoid) the treatment. Rosenman and Rajkumar (2019) discuss some ways to counter that problem. In the settings we consider, the investigator has control of the treatment and so we study the sharp design. We also do not address issues of subject compliance, as we suppose that the effect of the treatment assignment or the intent to treat is of sufficient interest to the investigator. 5 The RDD setting has been extended well beyond the simple framework described above. There are versions with treatments at more than two levels as well as versions with continuous treatments. The cutoff can be defined in terms of a vector of covariates yielding a discontinuity set of dimension one less than the vector has. The treatment discontinuity could be defined by geographical boundaries. There are multi-cutoff settings where subject i gets the treatment when x i > t i . There are models where it is the derivative of E(Y | x, Z = 1) − E(Y | x, Z = −1) that has a step discontinuity at t. For discussion and references to the variants above, see Cattaneo and Titiunik (2022).

Experimental design
While we study TBDs in comparison to regression discontinuity, they can also be considered within an experimental design framework such as the covariatedependent designs considered by Metelkina and Pronzato (2017). That paper emphasizes sequential problems, and like most of the design literature it works primarily with parametric models. They find conditions where the treatment policies converge to optimal deterministic functions of a covariate vector. The TBD does not use deterministic allocations which is an advantage if the response distribution is subject to change between experiments.
Experimental design, especially in a sequential setting is closely related to bandit methods. We are motivated by problems where the responses Y i arrive too slowly for bandit methods to be suitable. In a business setting, the responses may arrive after a year or calendar quarter while the effect of a scholarship on graduation rates can only be seen years later.

Tie-breakers
The simplest tie-breaker design replaces the threshold t by two thresholds t ± ∆. Subjects with x i > t + ∆ get the treatment, subjects with x i < t − ∆ get the control and other subjects are randomized to either treatment or control. The simplest choice has (1) Campbell (1969) describes the ∆ = 0 version of this design. Some subjects are exactly at the threshold t and then randomization breaks the ties among them. Boruch (1975) considers positive values of ∆ such that differences in the running variable among subjects with |x − t| ∆ are essentially arbitrary because x is an imperfect measure. Abdulkadiroglu et al. (2022) study the New York school system that breaks ties among applicants by lottery, or standardized test, or audition, depending on the program. We only consider randomized tie-breaking. Goldberger (1972) considers a simple two line regression model that in our notation is for IID errors ε i with mean zero and variance σ 2 . He finds that an RDD estimates these coefficients with a variance that is asymptotically π/(π − 2) ≈ 2.75 times as large as it would be under an RCT. His setting has Gaussian x i and β 4 = 0. Jacob et al. (2012) generalizes the above model to polynomials of degree two or three in x with or without interactions between x and Z. Their Table 6 shows that an RCT is 4 times as efficient as the RDD for a uniformly distributed running variable with t at the midpoint of its range. They also provide similar efficiency estimates for other polynomial models for both uniform and Gaussian x and include settings where t is not at the median of the distribution of x.
The above comparisons of RCTs to RDDs do not include tie-breakers. Cappelleri, Darlington and Trochim (1994) compare small, medium and large randomization windows in which 20%, 35% and 50%, respectively, of the subjects get a randomized treatment. They tabulate the sample sizes needed to attain a certain level of statistical power for three treatment effect sizes in these TBDs as well as in an RDD and in an RCT. All designs had half of the subjects getting the treatment. The running variable x was normally distributed. The model was (2) with β 4 = 0, making the treatment effect constant. The power calculations were done by Monte Carlo sampling. The required sample sizes became smaller with increased randomization at any level of power and effect size.
Owen and Varian (2020) work out the asymptotic variance ofβ in the model (2) as a function of ∆. They consider both U[−1, 1] and N (0, 1) distributions for x and a threshold t at the median of x's distribution. The estimated treatment effect is 2(β 3 +β 4 x) and they find for uniform x that this estimate has asymptotic variance proportional to 16(1 + 3x 2 )/[1 + 3∆ 2 (2 − ∆ 2 )] where ∆ = 0 describes the RDD and ∆ = 1 is the RCT. This decreases monotonically in ∆ while increasing monotonically in |x|.
They also consider the opportunity cost of experimentation compared to the RDD. For If larger Y i are better and β 4 > 0 then the opportunity cost grows proportionally to β 4 ∆ 2 . They discuss how one might trade off this opportunity cost against statistical efficiency.
The TBD has so far been analyzed for simpler methods than the RDD has. This can be understood by comparing their workflows. In a TBD we measure x i , then sample Z i and then some time later observe Y i . For an RDD we usually get (x i , Z i , Y i ) all at once. The investigator planning a TBD only has x i , must decide how to assign the Z i , and may not know what model will be fit later, and then chooses some specific model to design for. When one studies the TBD theoretically, one does not even have the x i and then it is natural to assume a distribution for them. The TBD is prospective while the RDD is retrospective.
When vectors x i of covariates are available, Owen and Varian (2020, Section 8) describe how to investigate numerically the efficiency of a TBD that fits a regression model on some collection of features of x i that interact with Z i where the treatment window is based on a linear combination of x i . Morrison and Owen (2022) study multiple regression for a tie-breaker in a regression model They study a prospective D-optimality criterion that maximizes the determinant of E(X T X ), where X is the (random) design matrix built from x i and Z i . For any known x i , the finite sample optimal p i can be computed by convex optimization.
For as yet unobserved x i the prospective D-optimality criterion averages over both random Z i and random x i from an assumed distribution for x i . For random x i , they study a three level tie-breaker with running variable x T i η and treatment probabilities 0, 0.5 and 1. Owen and Varian (2020) consider replacing the simple trichotomy (1) by various sliding scales where Pr(Z i = 1 | x i ) is a monotone function of x i . They find no advantage to such alternatives when x i has a symmetric distribution about t and half the subjects are treated. Li and Owen (2022) revisit that problem for the two line model and find optimal designs for general x i distributions and general fractions of treated subjects without assuming that half of the subjects will be treated. These optimal designs can greatly improve upon the design defined in (1). They still have Pr then only two treatment probability levels are needed.
A limitation of previous comparisons between RDDs and TBDs is that they all assume parametric regression models for the response Y i . We compare them using local linear regression. For simplicity, we restrict our attention to the three level version of the TBD in (1).
Our model assumes an additive error on top of smooth functions of x for the data points where |x − t| > ∆. When h > ∆, the causal estimate we consider merges deterministic and randomized treatment allocations and then cannot be analyzed in a potential outcomes framework. It is common in causal inference to ignore such data. For instance, a rule of thumb in Crump et al. (2009) is to omit data where the treatment probability is outside [0.1, 0.9]. Asymptotically, h < ∆ and then a potential outcomes analysis is available. Otherwise, to stay within the potential outcomes framework, one must choose between ignoring some data and using the additive error model like we do.

Causal estimand and problem formulation
Throughout the text we will compare the TBD to the RDD. In our comparison, we will define t to be the putative RDD threshold and ∆ to be the experimental radius, and we consider allocation of the treatments to the N subjects according to the 3-level tie-breaker design (1).
Next we discuss the estimands of interest. For each subject we consider the assignment variable X ∈ R, the treatment Z ∈ {−1, 1} and two potential outcomes: Y + = Y (Z = 1) and Y − = Y (Z = −1). Defining the treatment effect at X = x is If the investigator chooses an RDD with a threshold at t, then under certain regularity conditions, the causal estimand can be consistently estimated. In particular, we assume IID samples (or sufficiently weak dependence between the samples) and that: (i) The density f (·) of the assignment variable X is continuous at t with f (t) > 0. (ii) The conditional mean functions µ + and µ − in (3) have at least 3 continuous derivatives in an open neighborhood of t. (iii) The conditional variance functions σ 2 ± (x) ≡ Var Y ± | X = x are both bounded in a neighborhood of t and continuous at t. Under these conditions, τ thresh can be consistently estimated by local linear regression with O p (N −2/5 ) errors (Imbens and Kalyanaraman, 2012). On the other hand, the conditions above do not suffice to let an RDD consistently estimate τ (x) for any x = t.
If an investigator runs a TBD with ∆ > 0, then assumptions like those above replacing t by x allow consistent estimation of τ (x) for any x ∈ (t − ∆, t + ∆). Furthermore, as long as var(Y ± ) < ∞, can be consistently estimated with error O p (N −1/2 ) in a TBD without requiring assumptions (i), (ii), and (iii).
The discussion above leaves open the possibility that an RDD could be better than a TBD when estimating τ thresh . Therefore, for the remainder of the paper our primary focus will be on showing that even if the only goal is estimating τ thresh , it is still beneficial to run a TBD rather than an RDD. Our other focus will be to show that when the only goal is to estimate τ thresh , it is beneficial to pick a larger ∆ in the experimental design stage when the option is available. Picking a larger ∆ has other benefits as well such as making τ (x) identifiable for more values of the assignment variable and making τ ATE (∆) more representative of the overall population and easier to estimate. Naturally, there are non-statistical reasons to keep ∆ smaller.

Local linear estimation
In keeping with current RDD practice, we suppose that under an RDD τ thresh will be estimated with local linear regression. In particular, we assume that a parameter vector β defined bŷ will be fit for some symmetric kernel function K(·) 0 and bandwidth parameter h > 0, and that τ thresh will be estimated witĥ While this formulation of estimatingτ thresh may be less familiar than the approach of fitting separate local linear regressions for the treatment and control groups, it is easy to check that the two formulations yield the same estimator. Throughout the paper, we suppose that under a TBD,τ thresh will also be estimated using local linear regression according to (4) and (5). We do not use the same bandwidth h for the TBD and RDD. Indeed, in the next section we see that the optimal bandwidth choice (in terms of AMSE) is different for the two designs.
Because kernels with unbounded support are not typically used in RDD analysis (Cattaneo and Titiunik (2022)), we only consider kernels with bounded support. We assume without loss of generality, that the kernel is supported on [−1, 1]. We have a special interest in a uniform (boxcar) kernel K BC (x) = 1 |x| 1 because it is a popular kernel choice and is a local version of the regression model (2). We are also interested in a triangular spike kernel K TS (x) = (1−|x|) + where z + = max(0, z). This kernel was shown by Cheng, Fan and Marron (1997) to optimize a bias-variance tradeoff for extrapolation from x i > t to E(Y | x = t) and has been advocated for RDD analysis by Imbens and Kalyanaraman (2012) and Calonico, Cattaneo and Titiunik (2014) among others.
The local linear regression estimator from (5) has a bias and variance that both depend on the bandwidth h. Larger h typically bring greater bias because the true regression is not precisely linear over a region centered on t. Smaller h bring greater variance because then fewer data points are in the regression. Imbens and Kalyanaraman (2012) develop a method for choosing the bandwidth h that is asymptotically mean squared optimal for the RDD. In the next section, we compare the AMSE of the TBD with that of the RDD, when each of them has their asymptotically optimal bandwidth choice.
In this paper, we focus on the accuracy of the estimated treatment effect. The RDD literature includes several papers devoted to the construction of confidence intervals. There it is necessary to account for the bias in a local polynomial regression. A simple approach is to choose h to undersmooth the regression function, resulting in a bias of lower order than the standard error, and this simplifies confidence interval construction. Undersmoothing, however, brings less accuracy (Calonico, Cattaneo and Titiunik, 2014). See Calonico, Cattaneo and Farrell (2019) for a discussion of bandwidth choices to optimize estimation, or optimize confidence interval construction, or to get robust (asymptotically valid) confidence intervals using the bandwidth that is optimal for estimation.

Asymptotic mean square optimal error
In this section, we demonstrate the advantage of the TBD over the RDD when each design's bandwidth is chosen to minimize the AMSE in the estimation of τ thresh = µ + (t)−µ − (t). Following Imbens and Kalyanaraman (2012), we assume the following more general regularity conditions for estimating the causal effect at X = t: is continuous on its support, and is strictly positive somewhere.
Under an RDD, Z i = 1 if X i > t and is −1 otherwise, so these assumptions imply Assumptions 3.1-3.6 that Imbens and Kalyanaraman (2012) make for an RDD. To allow for analysis in the TBD setting, our assumptions (i)-(vi) are slightly stronger than those in Imbens and Kalyanaraman (2012). For example, unlike in our Assumption (iii), Imbens and Kalyanaraman (2012) make no as- Regarding assumption (vi), Imbens and Kalyanaraman (2012) also consider the case where µ show that in this case, their proposed method of estimating τ thresh has error O p (N −3/7 ) rather than O p (N −2/5 ). We do not consider the case where µ − (t) in detail for the TBD as the result should be similar to that for the RDD and is of less interest for our head-tohead comparison of TBD with RDD.
Because our assumptions (i)-(vi) imply Assumptions 3.1-3.6 in Imbens and Kalyanaraman (2012) for an RDD, if we let for j ∈ N, and let and define then Lemma 3.1 of Imbens and Kalyanaraman (2012) holds. We reproduce the statement of this lemma below.
Lemma 1. Under Assumptions (i)-(vi), if an RDD determines the treatment assignment and both h → 0 and N h → ∞ as the number of samples N → ∞, then the mean squared error in estimating τ thresh is given by and the asymptotically optimal bandwidth, defined by arg min h AMSE RDD (h, N ) is given by Proof. Imbens and Kalyanaraman (2012, Lemma 3.1).
Because we wish to compare the RDD to the tie-breaker design, we derive a similar result for the asymptotic MSE for the tie-breaker design. The TBD counterparts to the RDD quantities above are for j ∈ N, and . (12) Lemma 2. Under Assumptions (i)-(vi), if a TBD with a fixed experimental radius ∆ > 0 determines the treatment assignment and both h → 0 and N h → ∞ as the number of samples N → ∞, then the mean squared error in estimating τ thresh is given by and the asymptotically optimal bandwidth, defined by arg min h AMSE TBD (h, N ) is Proof. See Appendix A.
The proof of this lemma is very similar to the proof of Lemma 3.1 in Imbens and Kalyanaraman (2012), from their appendix. Instead of pointing to their proof and noting the parts of their proof that differ in the tie-breaker design setting, we write out the proof of Lemma 2 in Appendix A to ensure there are no subtle issues with using their proof in the tie-breaker design setting.
The leading order MSE formulas are derived by evaluating and summing the leading order terms for both the squared-bias and the variance. In formulas (7) and (12) for the leading order MSE, the first term gives the leading order squared-bias while the second term gives the leading order variance. See formulas (36) and (37) for explicit calculations of the leading order bias and variance in Fig 1: A comparison of the asymptotic bias-variance tradeoff for the regression discontinuity design (dotted lines) versus for the tie-breaker design (solid lines). The x-axes are in units of asymptotically MSE optimal bandwidth for RDD given at (9) while the y-axes are the leading order terms in units of αN −4/5 where N is the sample size and α is a constant given in (15) that depends on properties of the joint distribution of (X, Y, Z) in a neighborhood of the cutoff.
the TBD case, and see the formulas for 'B' and 'V' in the appendix of Imbens and Kalyanaraman (2012) for explicit calculations of these quantities in the RDD case. It is not surprising that the formulas for the leading order squared-bias, variance and MSE are different for the two design types because for an RDD, estimation of τ thresh involves estimation of the mean functions at a boundary point, whereas for a TBD, estimation of τ thresh involves estimation of mean functions at an interior point.
In Figure 1, we plug in scalar multiples of the optimal bandwidth for the RDD given in Lemma 1 to the first and second terms of formulas (7) and (12) to visualize the trade-off for the leading order squared-bias and variance in a tie-breaker design compared to a regression discontinuity design. The formulas simplify when defining the quantity which does not depend on h, N or the kernel choice.
In practice, the optimal bandwidth is not known and must be estimated. For both the RDD and the TBD, the optimal bandwidth depends on the quantity which must be estimated from the observed data. We consider the regularized estimator for γ of Imbens and Kalyanaraman (2012). We take the estimated optimal bandwidthĥ opt proposed in their Section 4.2 and setγ RDD = (4C 1 /C 2 ) 1/5ĥ opt N 1/5 . It can be seen from the proof of Theorem 4.1 in Imbens and Kalyanaraman (2012) that under assumptions (i)-(vi),γ RDD p − → γ. In the TBD case, we know a consistent estimator of γ exists. For example, if we let γ TBD,naive be an estimator of γ that is constructed similarly toγ RDD using only the subset of the data which looks like an RDD,γ TBD,naive p − → γ. Of course such an estimator of γ is inefficient; in practice one should instead use an estimator of γ that does not throw out all of the control samples for which x > t and all of the treated samples for which x < t. For our theoretical comparison of TBDs with RDDs, we are not concerned with the actual form ofγ TBD as long as it is consistent. Therefore, in the TBD case we will letγ TBD be any estimator that satisfiesγ TBD p − → γ. We make a few remarks about estimation of γ in the TBD setting in the discussion section.
To compare the AMSE for the RDD versus the TBD, we will assume that if the investigator were to run an RDD and were seeking mean squared optimal estimation of τ thresh , they would ultimately use the bandwidtĥ whereγ RDD is the consistent estimator for γ described above andC j are defined at (6). We will also assume that if the investigator were to run a TBD seeking mean squared optimal estimation of τ thresh , they would ultimately use the bandwidthĥ opt,TBD (N ) = whereγ TBD is any consistent estimator of γ and C j are defined at (11).
The following theorem compares the RDD with N points to a TBD with θN points for some θ > 0. We will use the value of θ that provides equal MSEs for estimation of τ thresh as a metric to compare the two designs.
holds for any tie-breaker design of the form (1) with ∆ > 0.
Proof. See Appendix B.
Theorem 1 uses the assumption that Pr(Z i = 1 | x i ) = 1/2 for x i in the randomization window. If σ 2 + (t) = σ 2 − (t), then we might prefer to offer the treatment with probability p = 1/2. In Appendix C, we study a treatment probability p ∈ (0, 1). When p = σ + (t)/(σ + (t) + σ − (t)), the asymptotic MSE is minimized, though an investigator would also want to account for the cost of the treatment. If one chooses p using poor prior estimates of σ ± it is possible that the resulting TBD will have a higher asymptotic MSE than the RDD. However, for any of the kernels in Table 1, one can protect against that by choosing p ∈ [0.18, 0.82].

Asymptotic MSE comparison for some specific kernels
We now use Theorem 1 to compare the MSE in estimating τ thresh for the RDD versus the TBD, under optimal bandwidth choices for various kernels of interest. See Table 1. If an investigator is deliberating between an RDD with N samples versus conducting a TBD (for a fixed ∆ > 0) with N samples, and either experimental design is to be analyzed with the asymptotically optimal bandwidth choice for the prespecified kernel, then the ratio of the MSEs will converge in probability to (C 1C 4 2 )/(C 1 C 4 2 ) 1/5 as N → ∞. Using formulas (6) and (11), the fourth column of Table 1 gives the value of the quantity (C 1C 4 2 )/(C 1 C 4 2 ) 1/5 rounded to 2 decimal places. For the boxcar and triangular kernels respectively, this quantity is precisely 64 1/5 and 60.46618 1/5 without rounding. It is also interesting to consider the quantity given by As a result of Theorem 1, an experimental designer deciding to use a TBD rather than an RDD would only need to collect θ * times as many samples in order to achieve the same asymptotic MSE in estimating τ thresh . Table 1 shows that the kernel choice has a remarkably small impact on the relative benefit of using a TBD rather than an RDD to estimate τ thresh . It is well known in the usual kernel smoothing setting that there is little difference in performance among the widely used kernels. See Wand and Jones (1994). If τ thresh is to be estimated with local linear regression using one of the seven popular kernel choices exhibited in Table 1, then the RDD has an asymptotic MSE that is about 2.3 times as large as that of the TBD, and the TBD will require 64 to 65 percent fewer samples than the RDD in order to achieve the same asymptotic MSE.
We think that a version of Theorem 1 will hold also for unbounded kernels such as the N (0, 1) density under reasonable but stronger regularity conditions on f (·), µ ± (·) and σ ± (·). We do not develop such a result as Cattaneo and Titiunik (2022) state that kernels with unbounded support are not used in RDD analysis. Table 1 An asymptotic comparison of the regression discontinuity designs with tie-breaker designs in Kernel regression-based estimation of τ thresh . The fourth column gives the quantity (C 1C 4 2 )/(C 1 C 4 2 ) 1/5 , which is computed using (6) and (11) and rounded to 2 decimal places. The last column gives the quantity θ * given by (20)

Variance comparisons at fixed h > ∆
The AMSE comparison in Section 4 depends upon the optimal TBD bandwidth,ĥ opt,TBD , eventually becoming smaller than the positive experimental radius ∆. However, the optimal h converges to zero only at the very slow rate N −1/5 . Furthermore, the constant in that rate includes the factor |µ − (t)| −2/5 which could be very large. We believe that in many applied settings the optimal value of h will not be smaller than ∆. Then ∆/h is not necessarily within the support of the kernel and Y values data from outside the experimental region are included in the local linear regression.
In this section, we complement the prior analysis with one where h is fixed and larger than ∆. We assume a symmetric kernel function that is Lipschitz continuous on its support.
The kernel regression estimate ofτ thresh has a leading bias of O(h 2 ). In the regime where the bandwidth is bigger than ∆, mean squared optimality analysis for estimating τ thresh is complicated by the fact that for the TBD there will often exist an h > ∆ such that the constant in this O(h 2 ) term vanishes. Remarkably, such a bandwidth depends only on the experimental radius ∆ and the kernel K. It does not depend on µ + , µ − , f , or N . In Appendix D, we prove that under certain regularity conditions on f , µ ± , and K, a bandwidth h that solves ν 2 2 = 4 ∞ ∆/h uK(u) du ∞ ∆/h u 3 K(u) du removes the leading order bias, and moreover, such a solution exists. See Table D.1 for numerical solutions of this equation for the kernel choices considered previously. We find that for these kernel choices, the bandwidth removing the leading order bias ranges from approximately 3.13∆ for the Boxcar kernel to approximately 4.84∆ for the Triweight kernel. We caution investigators against picking this bandwidth because it does not shrink with N . It could place too little weight on reducing variance for small N and the third order bias term will be O(1).
Due to the existence of a fixed bandwidth bigger than ∆ that removes the leading order bias ofτ thresh , analysis of bias and mean squared optimality using second order Taylor expansions of µ ± (·) would be misleading. Hence, we do not conduct an analysis similar to that seen in Section 4 for the regime where h > ∆.
For that regime, we instead restrict our attention to the variance in estimating τ thresh at a fixed bandwidth h. The variance of the local linear estimatorτ thresh given in (4) and (5) can be computed as follows. The design matrix for the regression is For simplicity we assume var(Y | X ) = σ 2 I N and without loss of generality we assume t = 0. The kernel weights are K(x i /h), and we let and under the assumption that var(Y | X ) = σ 2 I N we have Formula (21) forβ matches the familiar generalized least squares formula for the case where var(Y | X ) = Wσ 2 . Here W arises from weights that are not of inverse variance type and hence the formula for var(β | X ; ∆) involves a W 2 factor and less cancellation than we might have expected. The boxcar kernel is special because then K(x i /h) ∈ {0, 1} equals its own square. In that case var(β | X ; ∆) = (X T WX ) −1 σ 2 . The estimator isτ thresh = 2β 3 . Therefore, we study var(β 3 | X ; ∆) under a tie-breaker design as (var(β | X ; ∆)) 3,3 using the expression in (22). At the stage where the experiment is being designed and ∆ is being chosen, the investigator does not have much information about X ∈ R N ×4 but we will later see, quite a bit is known about the quantity N × var(β 3 | X ; ∆)/σ 2 . For x i from a real dataset, we see in Section 6 (e.g. Figure 5) that this quantity does not vary much for different simulations of the random treatment assignments To get theoretical insight, we turn our attention to the uniformly spaced setting with x i = (2i − N − 1)/N to develop tractable theoretical results. We give an asymptotic justification for this assumption using results from Fan and Gijbels (1996) in Section 6. This rank transformation is also used in Owen and Varian (2020).
For x i = (2i − N − 1)/N , the matrices X T WX /N and X T W 2 X /N contain elements that can be approximated by integrals of the form for integer exponents r, s and t. Our expressions will simplify somewhat because Z 2 = 1 making every I r,2,t = I r,0,t and also because both x and E(Z | x; ∆) are antisymmetric functions of x making them orthogonal to K(x/h) which we have assumed to be symmetric. The error in those moment approximations is O p (N −1/2 ) if the Z i are independent random variables. The error can be much 17 less with other sampling schemes. For instance, we could use stratified sampling, forming pairs of subjects (i, i + 1) in the experimental region and randomly setting Z i = ±1 and Z i+1 = −Z i . We will use ≈ to describe approximations that are O p (N −1/2 ) or better. Applying first Z 2 = 1 and then using symmetry and anti-symmetry Because K 2 (·) is also a symmetric function we also get From all of the symmetries involved in the 32 components of these two matrices, we need to consider at most six distinct integrals. We rewrite those matrices, beginning with where Note that κ 0 and κ 2 may depend on h but they do not depend on ∆. A similar argument shows that for Now we are ready to describe the asymptotic variance ofβ 3 .

Now (28) follows directly by matrix inversion and multiplication.
Our Lipschitz condition on the kernel K is present for technical reasons. Using a formulation with fixed and discrete x i = (2i − N − 1)/N , this condition allows us to obtain the same error rate of O p (N −1/2 ) as would be obtained using the formulation with random x i IID ∼ U[−1, 1]. Without the Lipschitz condition, an adversarially chosen kernel K(·) might have point discontinuities at every rational multiple of the bandwidth h. We remark that our Lipschitz condition can be loosened to a 1/2-Hölder continuity condition, with details available from the first author upon request.
The variance formula in Theorem 2 does not require the linear model (2) to hold. When it does not hold there will generally be some bias where E(2β 3 | X ; ∆) = µ + (0) − µ − (0). We suppose that the user will choose an h to appropriately navigate the bias variance tradeoff, but that step takes place after the outcomes Y i are observed, which are not available when ∆ is chosen, so we resort to comparing the variance for any choice of h.
We are primarily interested in comparing the asymptotic variance ofτ 0 = 2β 3 for various choices of ∆. We especially want to compare the efficiency of tiebreaker designs with ∆ > 0 to the RDD with ∆ = 0. To do this we consider the efficiency ratio 19 Using Theorem 2, Eff (N ) (∆) converges in probability to the asymptotic efficiency ratio using quantities that we defined at (25) and (27).

Efficiency with boxcar and triangular kernels
In this subsection we present the efficiency ratios under the conditions of Theorem 2 for the two kernels of greatest interest: the boxcar kernel and the triangular kernel. We work with x i = (2i−N −1)/N throughout this subsection.
For the boxcar kernel K BC (u) = 1 |u| 1 , we can assume without loss of generality that h 1 because there are no data with |x i − t| = |x i | > 1, and then any h > 1 will give the same estimate as h = 1. We find for this kernel that Using some foresight, we define the local tie-breaker constant δ = ∆/h. This is the fraction of the local regression region in which the treatment was assigned at random.
Proposition 1. Under the conditions of Theorem 2 and using the boxcar kernel K BC , the asymptotic efficiency ratio of the tie-breaker design is for δ = ∆/h 1. If δ > 1, then Eff BC = 4.
Proof. Because many quantities from (31) are identical, substituting them into (30) produces numerous simplifications that yield For 0 δ < 1 formula (32) follows from expanding the quadratic while for δ > 1 the positive part term vanishes.
Choosing h = 1 makes the local regression a global one. We then get the same efficiency ratio as in equation (6) from Owen and Varian (2020). By taking derivatives it is easy to show that the efficiency ratio in (32) is strictly increasing as the local amount of experimentation δ varies over the interval 0 < δ < 1. Figure 2 plots Eff BC versus δ.
The triangular spike kernel K TS (x) = (1 − |x|) + (triangular kernel for short) is more complicated than the boxcar kernel because for it, K 2 is not proportional to K. Once again, we assume that h ∈ [0, 1]. For this kernel we compute , and λ 2 = h 3 30 20 and then using δ = ∆/h, we get Proposition 2. Under the conditions of Theorem 2 and using the triangular kernel K TS , the asymptotic efficiency of the tie-breaker design is Proof. This follows from plugging in the values of κ 0 , κ 2 , λ 0 , λ 2 , φ(∆), and ψ(∆) for the triangular kernel into (30). See Appendix E for the explicit calculations.
The second panel in Figure 2 shows Eff TS versus the local experiment size δ. The efficiency curve has a similar monotone increasing shape as we saw for the boxcar kernel. The maximum efficiency ratio, at δ = 1, is 18/5 = 3.6 instead of 4. The efficiency ratio is a rational function of δ with a numerator of degree 12 and a denominator of degree 7. It is strictly increasing on the interval 0 < δ < 1, though the proof is lengthy enough to move to the Appendix.
Proposition 3. The derivative of Eff TS with respect to δ is positive for 0 < δ < 1.
Proof. See Appendix F.

Classroom size data
We explored the efficiency ratio for the tie-breaker design for x i with a uniform distribution. While that can be arranged by using ranks, in other situations 21 we might prefer to use the original value of a running variable and those might not be uniformly distributed. We show how to do this using a dataset from Angrist and Lavy (1999) on classroom sizes. Angrist and Lavy (1999) studied the causal effect of classroom size on test performance of elementary school students in Israel. In Israel, the Maimonides rule mandates that elementary school classes cannot exceed 40 students. If a school has 41 students enrolled in a particular grade that grade must be split into two classes. Note that grades that have 40 or fewer enrolled students are allowed to split into multiple classes and that grades with slightly more than 40 students occasionally violate the Maimonides rule and do not split into multiple classes. Despite this, we can consider this a setting for RDD where the treatment variable is whether or not the school is legally mandated to split a particular grade into smaller classes.
The dataset, published on the Harvard Dataverse (Angrist and Lavy, 2009), has verbal and math scores for 3rd, 4th and 5th graders across Israel. We chose to focus exclusively on 4th grade verbal scores as our response variable and 4th grade enrollments as our assignment variable because Angrist and Lavy (1999) suggest that a slightly significant effect of the treatment on 4th grade verbal scores exists. Even though the data were not generated by a tie-breaker we can still compute the relative efficiency that a tie-breaker design would have had.
To simplify the analysis, we removed all schools that either had more than 80 students or more than two 4th-grade classes from the dataset. We further removed all schools that had NA entries for either class size or verbal scores, leaving N = 711 schools in our filtered dataset. See Figure 3 for a visualization of the distribution of the 4th grade enrollments and Figure 4 for visualizations of the local linear regression based-RDD on this dataset using boxcar and triangular kernels. We use the bandwidths h IK given by the Imbens and Kalyanaraman (2012) procedure, which were computed using that paper's MATLAB code. The apparent benefit from smaller classrooms is positive but small and it turns out, not statistically significant in this analysis. The 95% confidence interval (assuming homoscedastic errors) for the effect size at the boundary of the local linear regression-based RDD was (−1.5, 9.2) when a boxcar kernel with bandwidth h IK,BC = 7.09 was used. The 95% confidence interval for the effect size at boundary of this RDD was (−2.4, 9.4) when a triangular kernel with bandwidth h IK,TS = 9.02 was used.
Next we illustrate how an investigator can estimate the efficiency ratio of tiebreaker designs as a function of ∆ on sample values of the assignment variable. First we translate the data, replacing x i by x i − 40.5 to move the threshold from t = 40.5 to t = 0. Next, for each ∆ of interest we use 1000 Monte Carlo samples to estimate var(β 3 | X ; ∆) and also var(β 3 | X ; 0), both up to a constant σ 2 . That gives us 1000 efficiency ratios Eff (N ) (∆) = var(β 3 | X ; 0)/var(β 3 | X ; ∆) for each ∆. In each of our 1000 samples, we simulate random assignments Z i for a tie-breaker design at the given experimental radius ∆. The random assignments are stratified: in each consecutive pair of classroom sizes in the experimental region, one was randomly chosen to have Z = 1 and the other got Z = −1. The x i and the random Z i let us compute the matrices X and W defined in the beginning of Section 5, from which we compute a non-asymptotic var(β 3 | X ; ∆) using (22). We do not simulate any Y i values because efficiency only depends on X , and we are retaining the bandwidths from the Imbens and Kalyanaraman (2012) procedure on the original data. A more detailed simulation randomizing the bandwidth choice is out of scope. Our simulations demonstrate that the TBD is more efficient at each fixed h, so we expect that it will also be more efficient at a randomly chosen h. There could be exceptions if the bandwidth is adversarially correlated with the estimation errors but we do not think that is likely. Figure 5 shows boxplots of 1000 simulated Eff (N ) (∆) values for various choices of ∆ ∈ N to plot the full efficiency curve. It is clear from Figure 5 that with stratified allocations the efficiency is very reproducible. Figure 6 shows results for different bandwidths, ranging from h IK /2 to 3h IK /2. Because the efficiencies are so reproducible given the bandwidth, we just plot curves of the mean and standard deviations of estimated Eff values. For both the boxcar and triangular kernels, we see that the tie-breaker design is reproducibly more efficient than the RDD and the effect increases as δ = ∆/h increases for all h we studied. The efficiency curves for this dataset under various bandwidth choices look similar to the theoretical efficiency curves derived in Section 5 for the case of a uniform assignment variable.
For a further discussion of the Maimonides rule, see Angrist et al. (2019). They consider different data sets and also investigate the possibility that the class sizes are sometimes manipulated to be above the threshold triggering a classroom split.

Comparison with theoretical results for uniform assignment variable
Our theoretical analysis in Section 5 is for a uniformly spaced assignment variable. We can offer one explanation for why the empirical efficiencies on nonuniformly distributed data look so similar to the theoretical ones for uniformly distributed data (see the left panels in Figure 6). The explanation uses some results about non-parametric regression from Fan and Gijbels (1996, Table 2.1). Nonparametric regression estimatesμ(t) typically have an asymptotic variance where the leading term is proportional to 1/f (t) where f is the probability density of the x i . This arises because the local sample size is asymptotically proportional to f (t). Hence, when considering nonuniform distributions, the 1/f (t) factors in the leading order variance terms will cancel out when computing the efficiency ratios. Some of the nonparametric regression estimators, such as the Nadaraya-Watson estimator, have a lead term in their bias that depends on the derivative f (t), and while f (t) = 0 for uniformly distributed data, it is not zero in general. Kernel weighted least squares methods (with symmetric K(·)) do not have a dependency on f (t) in their bias. There is a curvature bias from µ (t) but that is not related to the sampling distribution of the x i . The lead terms in bias and variance for local linear regressions do not distinguish between distributions with the same value of f (t) but different f (t). Thus the effects of non-uniformity of X are asymptotically negligible.

Discussion
If an investigator is able to implement a 3-level tie-breaker design with any experimental radius ∆ ∈ (0, ∆ max ), our results show that the TBD has considerable statistical advantages over the RDD.
The most obvious advantage is that the TBD allows estimation of multiple 25 causal parameters of interest including the average treatment effect over subjects with x ∈ (t − ∆, t + ∆) as well as the expected treatment effect at any particular x ∈ (t − ∆, t + ∆). The former is estimable at a faster rate and with fewer assumptions, whereas the latter may still be of interest for choosing a future policy threshold. Meanwhile, the RDD only allows estimation of τ thresh , the expected treatment effect at x = t.
Even if the only goal is estimation of τ thresh , our results indicate a statistical advantage to running a TBD rather than an RDD and an advantage to picking a larger experimental radius ∆ ∈ (0, ∆ max ). As seen in Section 4, to achieve the same asymptotic MSE in mean squared optimal estimation of τ thresh , a TBD would require roughly 64 percent fewer samples than would be needed for an RDD. Moreover, the asymptotic advantage for a TBD is largely driven by its lower variance (Figure 1). Hence, if the convenient, but controversial, method of undersmoothing to construct asymptotically valid confidence intervals for τ thresh is used instead of more nearly optimal approaches, the TBD would exhibit even greater advantages over the RDD. We point readers to the introduction of Calonico, Cattaneo and Farrell (2018) for an overview of the history of undersmoothing, and Calonico, Cattaneo and Farrell (2019) for a modern approach to constructing confidence intervals that has better coverage properties than undersmoothing has.
In terms of the statistical advantages of picking a larger ∆, Owen and Varian (2020) found an efficiency advantage for the tie-breaker in a global regression, wherein the estimation variance decreased monotonically in ∆. We provide a comparable finding for the now more standard local linear regression approach: for any fixed bandwidth h, we see a theoretical efficiency that increases with the amount ∆ of experimentation. We have not investigated the effect of ∆ on the subsequent choice of h whenĥ opt,TBD > ∆, although one candidate choice is an h > ∆ that removes the leading order bias term, which we derived in Appendix D.
There is room for an improved estimator of γ in the TBD context which uses data from both treatments on both sides of the threshold t. We leave this for further work. A critical ingredient is the estimation of µ (2) ± (t). Compared to the method in Imbens and Kalyanaraman (2012), one could use a bandwidth tuned for an internal point t instead of one tuned for an endpoint. Also the curvature estimates in Imbens and Kalyanaraman (2012) use local quadratic regressions while Fan and Gijbels (1996, p 63) suggest using local cubic regressions for curvature estimation at an interior point.

Appendix A: Proof of Lemma 2
Without loss of generality suppose t = 0 (if this is not the case we can define a new assignment variable to be a translation of the original assignment variable by t). It is convenient to define for nonnegative integers j.
Next, let f ± be the conditional density of X i given that Z i = ±1. That is We now prove some helpful small bandwidth approximations for F j,± and G j,± .
Lemma A.1. In the setting of Lemma 2, for nonnegative integers j, Proof. We will prove this for F j,+ and the proof for F j,− will be identical. First note that F j,+ is the average of N + IID random variables so where the last equality holds because K(·) is bounded with bounded support, f + is continuously differentiable at 0, and asymptotically h → 0. 29 Meanwhile, the term Var[X j K h (X)|Z = 1] is upper bounded by where the last equality holds because K(·) is bounded with bounded support, f + is continuous and strictly positive at 0, and asymptotically h → 0. Thus where last equality holds because as N → ∞, hN → ∞, so h 2j /hN = o(h 2j ), and because N/N + = O p (1).
Combining previous results Lemma A.2. In the setting of Lemma 2, for nonnegative integers j, with π j = ∞ −∞ u j K 2 (u) du. Proof. We will prove this for G j,+ and the proof for G j,− will be identical. First note that G j,+ is the average of N + IID random variables so where the last equality holds because K(·) is bounded with bounded support, σ 2 + (·)f + (·) is continuous at 0, and asymptotically h → 0.
Meanwhile, the term Var[X j K 2 h (X)σ 2 + (X)|Z = 1] is upper bounded by where the last equality holds because K(·) is bounded with bounded support, σ 2 + (·)f + (·) is continuous and strictly positive at 0, and asymptotically h → 0. Thus where the last equality holds because as N → ∞, hN → ∞, so h 2j /hN = o(h 2j ), and because N/N + = O p (1). Combining previous results, . We are now ready to compute an asymptotic approximation for the bias and variance of the causal estimatorτ thresh =τ (0). Recall that as discussed in Section 3,τ (0) can be equivalently estimated by solving two separate local linear regressions, one for the treatment group and one for the control group, rather than solving (4) and plugging the solution into (5). In this proof, we use the two separate local linear regression formulation, so that the proof more closely resembles that seen in the appendix of Imbens and Kalyanaraman (2012).
In particular, define X + ∈ R N+×2 and X − ∈ R N−×2 to be the design matrices for the local linear regression restricted to the treated group and the control group respectively. That is define Also define the corresponding local linear regression weight matrices by and similarly define the corresponding conditional variance matrices by and let e 1 = (1, 0). The causal estimator for τ (0) = µ + (0) − µ − (0) is given byτ (0) =μ + (0) −μ − (0), wherê µ + (0) andμ − (0) are the local linear regression estimators given bŷ Let X ∈ R N ×4 be the full design matrix whose ith row is (1, X i , Z i , X i Z i ), and note that the matrices X + and X − and the sets Z + and Z − are functions of the full design matrix.
In the next subsection, we compute the asymptotic approximation to the bias of the estimator E[τ (0)|X ] − τ (0), and in the following subsection, we compute the asymptotic approximation to its variance Var[τ (0)|X ]. These calculations leverage Lemmas A.1 and A.2.

Asymptotic approximation of the bias
We will compute the asymptotic formula for B + , and by an identical argument, the asymptotic formula for B − will follow. To do this note that Since µ + (·) has at least three continuous derivatives in an open neighborhood of 0, for each i ∈ Z i , µ + (X i ) = µ + (0) + µ (1) + (x)|. So letting T + = T i i∈Z+ and S + = µ (2) + (0)X 2 i /2 i∈Z+ it follows that Combining previous results Since ν 0 ν 2 − ν 2 1 > 0 (by the Cauchy-Schwartz inequality), Lemma A.1 and a first order Taylor expansion, as h → 0 and N → ∞, yield Hence, by Lemma A.1 again, (1)] . Now note, that since µ has bounded support, there exists an > 0 such that for all h ∈ (0, ), K h (x) = 0 whenever |x| > a. Therefore for all h < and all i, Thus for all h sufficiently small, where the bottom equality holds by Lemma A.1, and the top equality holds by the exact same argument as the proof of Lemma A.1 except with absolute values.
Combining the two previous results and multiplying through and dividing by N + , Similarly, by Lemma A.1, and therefore, A similar argument shows that Hence, recalling that the bias is given by E[τ (0)|X ] − τ (0) = B + − B − it follows that the asymptotic formula for the bias is Squaring the above formula, we recover the leading order squared-bias. As noted in Section 4, the leading order squared-bias is the first term in (12).

Asymptotic approximation of the variance
Defining V ± = Var(μ ± (0) | X ), the variance for our causal estimator is where the last equality holds because (X i , Y i+ , Y i− ) are IID by assumption making Y + and Y − independent conditionally on the treatment assignments which then implies thatμ ± (0) are independent conditionally on X .
We now compute an asymptotic formula for V + and by an identical argument, the asymptotic formula for V − will follow.
First note that Now the middle factor rescaled by 1/N + is and recall from the previous subsection that Thus, Thus dividing through by N + and rearranging terms, By the weak law of large numbers N Pr(Z i = 1)/N + = (1 + o p (1)) and therefore, A similar argument shows that As remarked in Section 4, the leading order variance matches the second term in formula (12).

Asymptotic expression for mean squared error
To complete the proof of the asymptotic expression for the tie-breaker design MSE in Lemma 2, note that N h where the second last equality holds from combining equations (37), (36), and (11), while the last equality holds from definition (12).

Asymptotically optimal bandwidth expression
To complete the proof of Lemma 2, recall that we define h opt,TBD (N ) = arg min h AMSE TBD (h, N ) and by (12), .
Plugging (18) into (12), an identical argument that usesγ TBD p − → γ and definitions (15) and (16)  In applying Lemma 1, it will be helpful to note that letting c = C 2 /(4C 1 ) 1/5 , Therefore under the conditions of Theorem 1, by applying Lemmas 1 and 2 and combining the previous results, Appendix C: Extensions to assignment probability p = 1/2 In this appendix we consider the MSE of a generalization of the 3-level TBD given by (1), where we allow the assignment probabilities to Z = 1 and Z = −1 to differ from 1/2 within the interval of experimentation. In particular, we consider a design with the following assignment probabilities for some p ∈ (0, 1). Under this more general version of the TBD, one can show that the AMSE formula given in (12) changes to and that the following lemma holds.
Lemma C.1. Under the conditions of Lemma 2, with the exception that the assignment probabilities follow (38) rather than (1), the mean squared error in estimating τ thresh is given by and the asymptotically optimal bandwidth, defined by arg min h AMSE TBD(p) (h, N ) is Proof. The proof is identical to the proof of Lemma 2 in Appendix A, except that the formulas for the functions f ± (·) presented at (34) are instead given by The changes in the formulas for f ± (·) do not affect the leading order bias formula but do affect the leading order variance and optimal bandwidth formulas.
Analogously to the empirical bandwidth choice given by (18), an investigator running a TBD of the form (38) seeking mean squared optimal estimation of τ thresh would ultimately use the bandwidtĥ h opt,TBD(p) (N ) = C 2 4C 1 1/5γ TBD(p) N −1/5 , whereγ TBD(p) is some consistent estimator of the quantity 1 2p σ 2 To derive an analogue of Theorem 1 for a TBD of the form (38), it is convenient to define the relative variance measure The following theorem compares the RDD with N points to a TBD of the form (38) with θN points for some θ > 0. Here we consider the same sample size for each design (θ = 1), and that the most favorable kernel for the RDD (the triangular kernel) is to be used. The color scale for the contour plot is displayed on the right, with the added blue lines denoting a level of 1.
Outside the blue lines, the RDD performs better than the TBD asymptotically. The vertical black lines, between which the relative AMSE is always greater than 1, are placed at p = θ * /2 and p = 1 − θ * /2, for θ * = 60.46618 −1/4 . The pair of green lines give the boundaries of the two triangular regions in which the TBD with assignment probabilities (38) has a smaller AMSE than the default TBD with p = 1/2 has. Proof. The proof is the same as that of Theorem 1, except the formulas for AMSE TBD(p) (h, N ), h opt,TBD(p) (N ), andĥ opt,TBD(p) (N ) are different in the setting of a tie-breaker design of the form (38).
In the proof of Lemma 2, where we computed the leading order bias, we did not need to assume (I) and (II) because we supposed that h → 0 as N → ∞. In the following theorem about the leading order bias, because we do not assume h → 0, we assume (I) and (II) in addition to (i)-(vi).
Theorem D.1. Suppose conditions (i)-(vi) from the main text and conditions (I) and (II) hold, and that N → ∞. Let ∆ > 0 be fixed. Under the tie-breaker design defined by (1) and estimation of τ thresh according to (4) and (5) for some bandwidth h > 0, the bias ofτ thresh is given by where ν j is defined at (10). As a result, the leading order bias term is equal to 0 whenever the bandwidth h is chosen so that (44) Moreover, there must exist an h > ∆ > 0 that satisfies (44).
Proof. Fix ∆ > 0 and suppose without loss of generality that t = 0. Define K h (·), Z ± , N ± , F j,± for j = 0, 1, . . . , 4, f ± (·), X ± ,W ± , Y ± ,μ ± (0), and B ± , according to the same definitions presented in Appendix A. We will first derive a formula for F j,± that is similar to that given in Lemma A.1, except here we will not rely on a simplification that occurs in the asymptotic regime where h → 0, eventually dropping below ∆.
For any nonnegative integer j, because the samples are IID, we get To simplify this expression further, observe that, for any a, b ∈ [−∞, ∞] and where above the last two steps hold by Assumptions (II) and (iv) respectively, and as a consequence, for any Combining this with the above formula for F j,+ , If we define ν j,− (δ) ≡ δ −δ u j K(u) du + 2 −δ −∞ u j K(u) du, a similar argument yields an analogous formula for F j,− , and hence By noting the similarities between the formula for F j,+ in Lemma A.1 and that from (45), the formula for B + can be derived by the same argument presented in Appendix A, by simply replacing ν j terms with ν j, , for any integer j. Because we now no longer assume h → 0 as N → ∞, there are three steps where we have to use a slightly different argument from that presented in Appendix A. First, since we no longer assume that h → 0, the 2nd order Taylor expansion of the mean functions with the remainder terms T i require that µ (2) ± (·) is continuous everywhere, rather than merely in a neighborhood of t which is given by Assumption (iii). Assumption (I) that µ (3) ± (·) is bounded, guarantees that µ (2) ± (·) is continuous everywhere because differentiability implies continuity. Second, we can use Assumption (I) that µ (3) ± (·) is bounded to show that there exists anā + < ∞ for which K h (X i ) sup x∈[−Xi,Xi] |µ (3) + (x)| ā + K h (X i ) for any h > 0. In Appendix A, we did not need Assumption (I) since we supposed that h → 0, so it was enough to show that inequality holds for any sufficiently small h. Third, the argument involves a first order Taylor expansion of the quantity (F 0,+ F 2,+ − F 2 1,+ ) −1 . In the current setting where we do not suppose h → 0 as N → ∞, we must now ascertain that with probability approaching 1 as N → ∞, F 0,+ F 2,+ − F 2 1,+ is positive and bounded away from zero. Recalling that up to an O p (1/ √ N ) term, F j,+ , equals h j ∞ −∞ u j K(u)f + (hu) du, by the Cauchy-Schwartz inequality it follows that Hence with probability converging to 1 as N → ∞, F 0,+ F 2,+ − F 2 1,+ is positive and bounded away from zero. Consequently, the following first order Taylor expansion holds for all h > 0 h 2 f 2 + (0) where the denominator in the first term is positive and bounded away from 0 by a similar Cauchy-Schwartz argument that leverages symmetry of K. By the same derivation of the bias formula presented in Appendix A, with the exceptions in the argument noted above, the bias ofμ + (0) is given by Now note that the bias ofτ thresh is given by B + − B − . To simplify the expression for B + − B − , observe that by symmetry of K(·), for any h > 0 ν j,− (∆/h) = ν j,+ (∆/h) = ν j for even j, −ν j,+ (∆/h) = −2 ∞ ∆/h u j K(u) du for odd j.
The ν j above is defined at (10). Hence, subtracting B − from B + the bias for τ thresh is given by Since we supposed t = 0 without loss of generality, this proves (43). Now, since ν 0 ν 2 − 4[ ∞ ∆/h uK(u) du] 2 is bounded away from zero, the leading order term in h in formula (43) for the bias ofτ thresh is zero whenever h satisfies (44).

Appendix F: Proof of Proposition 3
We want to show that this function Eff TS (δ) = 2 3 − 2(1 − 3δ 2 + 2δ 3 ) 2 2 5 − 5(1 − 3δ 2 + 2δ 3 )(1 − 6δ 2 + 8δ 3 − 3δ 4 ) + 2(1 − 3δ 2 + 2δ 3 ) 2 has a positive derivative for 0 < δ < 1. The numerator has degree 12 and the denominator has degree 7. The customary formula for the derivative of a rational function produces a rational function with a non-negative denominator and a numerator of degree 18. We will work through a sequence of steps reducing the degree of this polynomial to show that the numerator must be positive on (0, 1). That then rigorously establishes the monotonicity of Eff TS (δ) which is visually apparent.