Ridge rerandomization: An experimental design strategy in the presence of covariate collinearity

Randomization ensures that observed and unobserved covariates are balanced, on average. However, randomizing units to treatment and control often leads to covariate imbalances in realization, and such imbalances can inflate the variance of estimators of the treatment effect. One solution to this problem is rerandomization – an experimental design strategy that randomizes units until some balance criterion is fulfilled – which yields more precise estimators of the treatment effect if covariates are correlated with the outcome. Most rerandomization schemes in the literature utilize the Mahalanobis distance, which may not be preferable when covariates are high-dimensional or highly correlated with each other. As an alternative, we introduce an experimental design strategy called ridge rerandomization, which utilizes a modified Mahalanobis distance that addresses collinearities among covariates. This modified Mahalanobis distance has connections to principal components and the Euclidean distance, and – to our knowledge – has remained unexplored. We establish several theoretical properties of this modified Mahalanobis distance and our ridge rerandomization scheme. These results guarantee that ridge rerandomization is preferable over randomization and suggest when ridge rerandomization is preferable over standard rerandomization schemes. We also provide simulation evidence that suggests that ridge rerandomization is particularly preferable over typical rerandomization schemes in high-dimensional or high-collinearity settings. ©2020 TheAuthors. Published by Elsevier B.V. This is an open access article under the CCBY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Randomized experiments are often considered the ''gold standard" of scientific investigations because, on average, randomization balances all potential confounders, both observed and unobserved (Krause and Howard, 2003). However, many have noted that randomized experiments can yield ''bad allocations'', where some covariates are not well-balanced across treatment groups (Seidenfeld, 1981;Lindley, 1982;Papineau, 1994;Rosenberger and Sverdlov, 2008). Covariate imbalance among different treatment groups complicates the interpretation of estimated causal effects, and thus covariate adjustments are often employed, typically through regression or other comparable methods.
However, it would be better to prevent such covariate imbalances from occurring before treatment is administered, rather than depend on assumptions for covariate adjustment post-treatment which may not hold (Freedman, 2008). One common experimental design tool is blocking, where units are first grouped together based on categorical covariates, and then treatment is randomized within these groups. However, blocking is less intuitive when there are non-categorical covariates. A more recent experimental design tool that prevents covariate imbalance and allows for non-categorical covariates is the rerandomization scheme of Morgan and Rubin (2012), where units are randomized until a prespecified level of covariate balance is achieved. Rerandomization has been discussed as early as R.A. Fisher (e.g., see Fisher, 1992), and more recent works (e.g., Cox, 2009;Bruhn and McKenzie, 2009;Worrall, 2010) recommend rerandomization. Morgan and Rubin (2012) formalized these recommendations in treatment-versus-control settings and was one of the first works to establish a theoretical framework for rerandomization schemes. Since Morgan and Rubin (2012), several extensions have been made. Morgan and Rubin (2015) developed rerandomization for treatment-versus-control experiments where there are tiers of covariates that vary in importance; Branson et al. (2016) extended rerandomization to 2 K factorial designs; and Zhou et al. (2018) developed a rerandomization scheme for sequential designs. Finally, Li et al. (2018) established asymptotic results for the rerandomization schemes considered in Rubin (2012, 2015), and Li and Ding (2020) established asymptotic results for regression adjustment combined with rerandomization.
All of these works focus on using an omnibus measure of covariate balance -the Mahalanobis distance (Mahalanobis, 1936) -during the rerandomization scheme. The Mahalanobis distance is well-known within the matching and observational study literature, where it is used to find subsets of the treatment and control that are similar (Rubin, 1974;Rosenbaum and Rubin, 1985;Gu and Rosenbaum, 1993;Rubin and Thomas, 2000). The Mahalanobis distance is particularly useful in rerandomization schemes because (1) it is symmetric in the treatment assignment, which leads to unbiased estimators of the average treatment effect under rerandomization; and (2) it is equal-percent variance reducing if the covariates are ellipsoidally symmetric, meaning that rerandomization using the Mahalanobis distance reduces the variance of all covariate mean differences by the same percentage (Morgan and Rubin, 2012).
However, the Mahalanobis distance is known to perform poorly in matching for observational studies when there are strong collinearities among the covariates or there are many covariates (Gu and Rosenbaum, 1993;Olsen, 1997;Stuart, 2010). One reason for this is that matching using the Mahalanobis distance places equal importance on balancing all covariates as well as their interactions (Stuart, 2010), and this issue also occurs in rerandomization schemes that use the Mahalanobis distance. This issue was partially addressed by Morgan and Rubin (2015), who proposed an extension of Morgan and Rubin (2012) that incorporates tiers of covariates that vary in importance, such that the most important covariates receive the most variance reduction. However, this requires researchers to specify an explicit hierarchy of importance for the covariates, which might be difficult, especially when the number of covariates is large. Furthermore, it is unclear how to conduct current rerandomization schemes if collinearity is so severe that the covariance matrix of covariates is degenerate, and thus the Mahalanobis distance is undefined.
As an alternative, we consider a rerandomization scheme using a modified Mahalanobis distance that inflates the eigenvalues of the covariates' covariance matrix to alleviate collinearities among the covariates, which has connections to ridge regression (Hoerl and Kennard, 1970). Such a quantity has remained largely unexplored in the literature. First we establish several theoretical properties about this quantity, as well as several properties about a rerandomization scheme that uses this quantity. In particular, instead of reducing the variance of all covariates equally, ridge rerandomization increases the variance reduction of the first principal components of the covariate space at the expense of decreasing the variance reduction of the last principal components. We show through simulation that a rerandomization scheme that incorporates this modified criterion can be beneficial in terms of variance reduction when there are strong collinearities among the covariates. We also discuss how this modified Mahalanobis distance connects to other criteria, such as principal components and the Euclidean distance. Because the rerandomization literature has focused almost exclusively on the Mahalanobis distance, this work also contributes to the literature by exploring the use of other criteria besides the Mahalanobis distance for rerandomization schemes.
The remainder of this paper is organized as follows. In Section 2, we introduce the notation that will be used throughout the paper. In Section 3, we review the rerandomization scheme of Morgan and Rubin (2012). In Section 4, we outline our proposed rerandomization approach and establish several theoretical properties of this approach, as well as several theoretical properties about the modified Mahalanobis distance. In Section 5, we provide simulation evidence that suggests that our rerandomization approach is often preferable over other rerandomization approaches, particularly in high-dimensional or high-collinearity settings. In Section 6, we conclude with a discussion of future work.

Notation
We use the colon notation λ 1:K = (λ 1 , . . . , λ K ) ∈ R K for tuples of objects, and we let f (λ 1:K ) = (f (λ 1 ), . . . , f (λ K )) for any univariate function f : R → R. We respectively denote by I N and 1 N the N × N identity matrix and the N-dimensional column vector whose coefficients are all equal to 1. Given a matrix A, we denote by A ij its (i, j)-coefficient, A i• its ith row, A •j its jth column, A ⊤ its transpose, and tr(A) its trace when A is square. Given two symmetric matrices A and B of the same size, we write A > B (resp. A ≥ B) if the matrix A − B is positive definite (resp. semi-definite). Let x be the N ×K matrix representing K covariates measured on N experimental units. Let W i = 1 if unit i is assigned to treatment and 0 otherwise, and let W = (W 1 . . . W N ) ⊤ . Unless stated otherwise, we will focus on completely randomized experiments (Imbens and Rubin, 2015, see Definition 4.2) with a fixed number of N T treated units and N C = N − N T control units. For a given assignment vector W, we definex T = N −1 T x ⊤ W andx C = N −1 C x ⊤ (1 N − W) as the respective covariate mean vectors within treatment and control. For completely randomized experiments, the covariance matrix of the covariate mean differences is Σ (Morgan and Rubin, 2012). Throughout, we use Σ to refer to this fixed covariance matrix, and we assume Σ > 0. The spectral decomposition ensures that Σ is diagonalizable with eigenvalues λ 1 ≥ · · · ≥ λ K > 0. Let Γ be the orthogonal matrix of corresponding eigenvectors, so that we may write Σ = ΓDiag(λ 1:K )Γ ⊤ , where Diag(λ 1:K ) denotes the K × K diagonal matrix whose (k, k)-coefficient is λ k . Thus, Σ and its eigenstructure are available in closed-form, and the latter coincides with the eigenstructure of S 2 x up to a scaling factor.
We let χ 2 K denote a chi-squared distribution with K degrees of freedom, P(χ 2 K ≤ a) its cumulative distribution function (CDF) evaluated at a ∈ R, and q χ 2 K (p) its p-quantile for p ∈ (0, 1).

Review of rerandomization
We follow the potential outcomes framework (Rubin, 1990(Rubin, , 2005, where each unit i has fixed potential outcomes Y i (1) and Y i (0), which denote the outcome for unit i under treatment and control, respectively. Thus, the observed outcome for unit i is y obs . . . y obs N ) ⊤ as the vector of observed outcomes. We focus on the average treatment effect as the causal estimand, defined as (1) Furthermore, we focus on the mean-difference estimator ⊤ y obs are the average treatment and control outcomes, respectively. When conducting a randomized experiment, ideally we would likex T andx C to be close; otherwise, the estimatorτ could be confounded by imbalances in the covariate means. Morgan and Rubin (2012) focused on a rerandomization scheme using the Mahalanobis distance to ensure that the covariate means are reasonably balanced for a particular treatment assignment. The Mahalanobis distance between the treatment and control covariate means is defined as where the dependence of M on the assignment vector W is implicit through (x T −x C ). Morgan and Rubin (2012) suggest randomizing units to treatment and control by performing independent draws from the distribution of W | x until M ≤ a for some threshold a ≥ 0. Hereafter, we refer to this procedure of randomizing units until M ≤ a as rerandomization. The expected number of draws until the first acceptable randomization is equal to 1/p a , where p a = P(M ≤ a | x) is the probability that a particular realization of W yields a Mahalanobis distance M less than or equal to a. Thus, fixing p a effectively allocates an expected computational budget and induces a corresponding threshold a: the smaller the acceptance probability p a , the smaller the threshold a and thus the more balanced the two groups, but the larger the expected computational cost of drawing an acceptable W. For example, to restrict rerandomization to the ''best'' 1% randomizations, one would set p a = 0.01, which implicitly sets a equal to the p a -quantile of the distribution of M given x. If one assumes (x T −x C ) | x ∼ N (0, Σ), then M | x ∼ χ 2 K , so that a can be chosen equal to the p a -quantile of a chi-squared distribution with K degrees of freedom. The assumption (x T −x C ) | x ∼ N (0, Σ) can be justified by invoking the finite population Central Limit Theorem (Erdös and Rényi, 1959;Li and Ding, 2017). When the distribution of M | x is unknown, one can approximate it via Monte Carlo by simulating independent draws of M | x and setting a to the p a -quantile of M's empirical distribution. Morgan and Rubin (2012) established that the mean-difference estimatorτ under this rerandomization scheme is unbiased in estimating the average treatment effect τ , i.e., that E [τ | x, M ≤ a ] = τ . Furthermore, they also established that under rerandomization, if N T = N C and (x T −x C ) | x ∼ N (0, Σ), then not only are the covariate mean differences centered at 0, i.e., E [x T −x C | x, M ≤ a] = 0, but also they are more closely concentrated around 0 than they would be under randomization. More precisely, Morgan and Rubin (2012) Therefore, under their assumptions, rerandomization using the Mahalanobis distance reduces the variance of each covariate mean difference by 100(1 − v a )% compared to randomization. Morgan and Rubin (2012) call this last property equally percent variance reducing (EPVR). Thus, using the Mahalanobis distance for rerandomization can be quite appealing, but Morgan and Rubin (2012) rightly point out that non-EPVR rerandomization schemes may be preferable in settings with covariates of unequal importance. This is in part addressed by Morgan and Rubin (2015), who developed a rerandomization scheme that incorporates tiers of covariates that vary in importance. However, this requires researchers to specify an explicit hierarchy of covariate importance, which may not be immediately clear, especially when the number of covariates is large. Furthermore, if there are strong collinearities amongst covariates such that Σ is degenerate and thus the M in (3) is undefined, then it is unclear how one should conduct the rerandomization scheme of Morgan and Rubin (2012) and its extensions (Morgan and Rubin, 2015;Branson et al., 2016;Li et al., 2018;Li and Ding, 2020).

Ridge rerandomization
As an alternative, we consider a modified Mahalanobis distance, defined as for some prespecified λ ≥ 0. Guidelines for choosing λ will be provided in Section 4.2. The eigenvalues of Σ in (6) are inflated in a way that is reminiscent of ridge regression (Hoerl and Kennard, 1970). For this reason, we will refer to the quantity M λ as the ridge Mahalanobis distance. To our knowledge, the ridge Mahalanobis distance has remained largely unexplored, except for Kato et al. (1999), who used it in an application for a Chinese and Japanese character recognition system. Our proposed rerandomization scheme, referred to as ridge rerandomization, involves using the ridge Mahalanobis distance in place of the standard Mahalanobis distance within the rerandomization framework of Morgan and Rubin (2012). In other words, one randomizes the assignment vector W until M λ ≤ a λ for some threshold a λ ≥ 0.
In order to make a fair comparison between rerandomization and ridge rerandomization, we will fix the expected computational cost of ridge rerandomization by calibrating the respective thresholds so that Thus, fixing p a implicitly determines the pair (λ, a λ ), so that for every fixed λ ≥ 0 and p a ∈ (0, 1) corresponds to a unique a λ that satisfies (7).
As we will discuss in Section 4.3, the ridge Mahalanobis distance alleviates collinearity among the covariate mean differences by placing higher importance on the directions that account for the most variation. In that section we also discuss how ridge rerandomization encapsulates a spectrum of other standard rerandomization schemes. But first, in Section 4.1 we establish several theoretical properties of ridge rerandomization for some prespecified (λ, a λ ), and in Section 4.2 we provide guidelines for specifying (λ, a λ ). In Section 4.4, we discuss how to conduct inference for the average treatment effect τ after ridge rerandomization is used to design a randomized experiment.

Properties of ridge rerandomization
The following theorem establishes that, on average, the covariate means in the treatment and control groups are balanced under ridge rerandomization, and thatτ is an unbiased estimator of τ under ridge rerandomization.
Theorem 4.1 is a particular case of Theorem 2.1 and Corollary 2.2 from Morgan and Rubin (2012). Theorem 4.1 follows from the symmetry of M λ in treatment and control, in the sense that both assignments W and (1 N −W) yield the same value of M λ . From Morgan and Rubin (2012), we even have the stronger result that E[V T −V C | x, M λ ≤ a λ ] = 0 for any covariate V , regardless of whether V is observed or not. While it may seem stringent to require that N T = N C , Morgan and Rubin (2012) demonstrate a simple counterexample where rerandomization also yields biased treatment effect estimates when N T ̸ = N C . However, Morgan and Rubin (2015, Section 3.2) conjectured that this bias was small for even moderate sample sizes, and Li et al. (2018) formalized this conjecture by showing thatτ is asymptotically unbiased under rerandomization even when N T ̸ = N C . While asymptotic properties of ridge rerandomization are outside the scope of this work, we can similarly conjecture that the bias ofτ under ridge rerandomization will be small for moderate sample sizes, even when N T ̸ = N C . We discuss simulation results that validate this conjecture in Section 5.4. Now we establish the covariance structure of (x T −x C ) under ridge rerandomization. To do this, we first derive the exact distribution of M λ . The following lemma establishes that if we assume (x T −x C ) | x ∼ N (0, Σ), then M λ is distributed as a weighted sum of K independent χ 2 Lemma 4.1 (Distribution of M λ ). Let λ ≥ 0 be some prespecified constant.
The proof of Lemma 4.1 is provided in the Appendix; see Appendix A.1. Under the Normality assumption, the representation in (8) provides a straightforward way to simulate independent draws of M λ , despite its CDF being typically intractable and requiring numerical approximations (e.g., see Bodenham and Adams, 2016, and references therein).
We will find that the covariance structure of (x T −x C ) under ridge rerandomization depends on the conditional ∼ N(0, 1). The following lemma establishes a property that will be helpful for characterizing these conditional expectations. Non-Negative Random Variables). Let L 1 , . . . , L K be independent and identically distributed non-negative random variables, let C 1 , . . . , C K be non-negative constants such that C 1 ≥ C 2 ≥ · · · ≥ C K , and let a > 0 be some constant. Define, for k = 1, . . . , K ,

Lemma 4.2 (Conditional Expectations of Constrained
The proof of Lemma 4.2 is provided in the Appendix; see Appendix A.2. We would like to thank an anonymous reviewer for suggesting a way to prove this result.
Using Lemmas 4.1 and 4.2, we can derive the covariance structure ofx T −x C under ridge rerandomization, as stated by the following theorem.
where Γ is the orthogonal matrix of eigenvectors of Σ corresponding to the ordered eigenvalues λ 1 ≥ · · · ≥ λ K > 0, and for all k = 1, . . . , K , The proof of Theorem 4.2 is in Appendix A.3. The quantities d k,λ are intractable functions of λ and a λ and thus need to be approximated numerically, as explained in Section 4.2. Conditioning on M λ ≤ a λ in (11) effectively constrains the magnitude of the positive random variables Z 2 k . Since the weights λ k (λ k + λ) −1 of their respective contributions to M λ are positive and non-increasing with k = 1, . . . , K , intuitively 0 < d 1,λ ≤ · · · ≤ d K ,λ < 1, and this is established by Lemma 4.2.
Using the above results, we can now compare randomization, rerandomization, and ridge rerandomization. Under the assumptions stated in Theorem 4.2, the covariance matrices ofx T −x C under randomization, rerandomization, and ridge rerandomization can be respectively written as Cov Cov where (13) follows from Theorem 3.1 in Morgan and Rubin (2012) with v a ∈ (0, 1), and (14) follows from Theorem 4.2 with d k,λ ∈ (0, 1) defined in (11). If we define new covariates x * as the principal components of the original ones, i.e., x * = xΓ, then (13) and (14) respectively yield and for all k = 1, . . . , K , where (x * T −x * C ) k is the kth principal component mean difference between the treatment and control groups, i.e., the kth coefficient of Γ ⊤ (x T −x C ). From (15) we see that rerandomization reduces the variances of the principal component mean differences equally by 100(1 − v a )% and is thus EPVR for the principal components, as well as for the original covariates, as discussed in Section 3. On the other hand, ridge rerandomization reduces these variances by unequal amounts: the variance of the kth principal component mean difference is reduced by 100(1 − d k,λ )%, and because 0 < d 1,λ ≤ · · · ≤ d K ,λ < 1, ridge rerandomization places more importance on the first principal components.
Translating (16) back to the original covariates yields the following corollary, which establishes that ridge rerandomization is always preferable over randomization in terms of reducing the variance of each covariate mean difference.
satisfies v k,λ ∈ (0, 1), so that The proof of Corollary 4.1 is provided in the Appendix; see Appendix A.4. Reducing the variance of the covariate mean differences is beneficial for precisely estimating the average treatment effect if the outcomes are correlated with the covariates. For example, Theorem 3.2 of Morgan and Rubin (2012) establishes that -under several assumptions, including additivity of the treatment effect -rerandomization reduces the variance ofτ defined in (2) by where R 2 denotes the squared multiple correlation between the outcomes and the covariates. Now we establish how the variance ofτ behaves under ridge rerandomization.
In the rest of this section, we assume-as in Morgan and Rubin (2012)-that the treatment effect is additive. Without loss of generality, for all i = 1, . . . , N, we can write the outcome of unit i as where β 0 + xβ is the projection of the potential outcomes Y(0) = (Y 1 (0) . . . Y N (0)) ⊤ onto the linear space spanned by (1, x), and ϵ i ∈ R captures any misspecification of the linear relationship between the outcomes and x.
Theorem 4.3 establishes that the variance ofτ under ridge rerandomization is always less than or equal to the variance ofτ under randomization. Thus, ridge rerandomization always leads to a more precise treatment effect estimator than randomization.

Theorem 4.3. Under the assumptions of Theorem
there is an additive treatment effect, then where the equality holds if and only if β = 0 K in (19).
The proof of Theorem 4.3 is in the Appendix; see Appendix A.5. The conditional independence assumption was also leveraged in the proof of Theorem 3.2 in Morgan and Rubin (2012). While this independence assumption may seem strong, Li et al. (2018) showed that it is justified asymptotically, which allowed them to establish that rerandomization is preferable over randomization even if treatment effects are not additive. Again, while the asymptotic properties of ridge rerandomization are outside the scope of this work, we conjecture that Theorem 4.3 holds asymptotically even without the conditional independence and additive treatment effects assumptions. Indeed, we find evidence via simulation that ridge rerandomization is still preferable over randomization (and often rerandomization) when treatment effects are heterogeneous, as discussed in Section 5.4.
The fact that ridge rerandomization performs better than randomization is arguably a low bar, because this is the purpose of any rerandomization scheme. The following corollary quantifies how ridge rerandomization performs compared to the rerandomization scheme of Morgan and Rubin (2012).

Corollary 4.2. Under the assumptions of Theorem 4.3, the difference in variances ofτ between rerandomization and ridge rerandomization is
It is not necessarily the case that d k,λ ≤ v a for all k = 1, . . . , K , and so it is not guaranteed that ridge rerandomization will perform better or worse than rerandomization in terms of treatment effect estimation. Ultimately, the comparison of rerandomization and ridge rerandomization depends on β, which is typically not known until after the experiment has been conducted. However, in Section 5.3, we provide some heuristic arguments for when ridge rerandomization would be preferable over rerandomization, along with simulation evidence that confirms these heuristic arguments. In particular, we demonstrate that ridge rerandomization is preferable over rerandomization when there are strong collinearities among the covariates. We also discuss a ''worst-case scenario'' for ridge rerandomization, where β is specified such that ridge rerandomization should perform worse than rerandomization in terms of treatment effect estimation accuracy.
In order to implement ridge rerandomization, researchers must specify the threshold a λ ≥ 0 and the regularization parameter λ ≥ 0. The next section provides guidelines for choosing these parameters.

Guidelines for choosing a λ and λ
For ridge rerandomization, we recommend starting by specifying an acceptance probability p a ∈ (0, 1), which then binds λ and a λ together via the identity (7). Once p a is fixed, there exists a uniquely determined threshold a λ ≥ 0 for each λ ≥ 0 such that P(M λ ≤ a λ | x) = p a . As in Morgan and Rubin (2012), acceptable treatment allocations under ridge rerandomization are generated by randomizing units to treatment and control until M λ ≤ a λ . Thus, a smaller p a leads to stronger covariate balance according to M λ at the expense of computation time.
The only choice that remains after fixing p a is the regularization parameter λ ≥ 0. The choice of λ is investigated in Section 4.2.1. Once we fix p a and λ, we can set a λ equal to the p a -quantile of the quadratic form Q λ defined by which will be used to choose λ, as we discuss in the remainder of this section.

Choosing λ
In this section, assume that p a has been fixed. Note that choosing λ = 0 corresponds to rerandomization using the Mahalanobis distance. Thus, we would only choose some λ > 0 if it is preferable over rerandomization, in the following sense. There are many metrics that could be used for comparing rerandomization and ridge rerandomization; for simplicity, we focus on the average percent reduction in variance across covariate mean differences. Arguably, ridge rerandomization is preferable over rerandomization only if it is able to achieve a higher average reduction in variance across covariate mean differences. Recall that, as discussed in Section 3, rerandomization reduces the variance of each covariate mean difference by 100(1 − v a )% compared to randomization, where v a is defined in (5). Meanwhile, as established by Corollary 4.1, ridge rerandomization reduces the variance of the kth covariate mean difference by (17). Thus, the average variance reduction under ridge rerandomization is greater than that under rerandomization only if Proving the existence of some λ > 0 such that (22) holds is challenging, so we propose the following iterative procedure (see ''Procedure for finding a desirable λ ≥ 0'') for choosing such a λ > 0 if it exists. The technical details justifying this procedure are in the Appendix; but at a high-level, our procedure uses the following intuition: • Ridge rerandomization with λ > 0 is preferable over rerandomization (i.e., ridge rerandomization with λ = 0) only if (22) holds.
• If we cannot find any λ > 0 such that (22) holds, then we set λ = 0. Otherwise, among all the λ's satisfying (22), we set λ such that the conditional covariance structure of (x T −x C ) is altered the least.
We discuss why we choose a λ that alters the conditional covariance structure of (x T −x C ) the least in Section 4.3. In the procedure below, we initialize λ = 0, and then we iteratively increase candidate λ's by increments of δ, which is specified by the user. As a rule-of-thumb, the step size δ can be chosen as a fraction of the smallest strictly positive gap between consecutive eigenvalues, i.e., min{λ k −λ k−1 : k = 1, . . . , K such that λ k > λ k−1 } with the convention λ 0 = 0. The stopping point of this iterative search is chosen dynamically in Step 3 of our procedure, and we discuss in Appendix A.7 why this dynamic search is guaranteed to stop in finite time. Finally, as we discuss further in Appendix A.7, the procedure is computationally efficient in the sense that nK auxiliary Normal variables only need to be simulated once and can be reused when testing different values of λ.
Procedure for finding a desirable λ ≥ 0 1. Specify p a ∈ (0, 1), n ≥ 1, δ > 0, and ε > 0. 2. Initialize λ = 0 and Λ = ∅. 3. While |(λ + δ)â λ+δ − λâ λ |> ε: .., K , and return: In our procedure, Λ represents the set of λ such that (22) holds. When the set Λ is empty, we return λ = 0 (which corresponds to typical rerandomization). However, the following heuristic argument illustrates why we would expect the existence of at least one λ such that (22) holds. The rerandomization scheme of Morgan and Rubin (2012) spreads the benefits of variance reduction across all K covariates equally; however, note that the term v a = P(χ 2 K +2 ≤ q χ 2 K (p a ))/p a is monotonically increasing in the number of covariates K for a fixed acceptance probability p a . Thus, the variance reduction under rerandomization, 100(1−v a )%, is monotonically decreasing in the number of covariates. A consequence of this is that if one can instead determine a smaller set of K e < K covariates that is most relevant, then that smaller set of covariates can benefit from a greater variance reduction than what would be achieved by considering all K covariates. As we mentioned at the end of Section 3, this idea was partially addressed in Morgan and Rubin (2015), which extended the rerandomization scheme of Morgan and Rubin (2012) to allow for tiers of covariate importance specified by the researcher, such that the most important covariates receive the most variance reduction. Ridge rerandomization, on the other hand, automatically specifies a hierarchy of importance based on the eigenstructure of the covariate mean differences. To provide intuition for this idea, consider a simple case where the smallest (K − K e ) eigenvalues λ Ke+1 , . . . , λ K are all arbitrarily close to 0. In this case, we can find λ > 0 such that λ j (λ j + λ) −1 ≈ 1 for the K e largest eigenvalues and λ j (λ j + λ) −1 ≈ 0 for the remaining K − K e eigenvalues, so that M λ would be approximately distributed as χ 2 ke with an effective number of degrees of freedom K e strictly less than K . For some fixed acceptance probability p a ∈ (0, 1) and corresponding thresholds a e = q χ 2 Ke (p a ) and since p a is fixed and K e < K . The relative variance reduction for ridge rerandomization would then be (1 −v ae ) for the first K e principal components -which in this simple example make up the total variation in the covariate mean differences -while the relative variance reduction for rerandomization would be (1 − v a ) < (1 − v ae ) for the K covariates. Thus, in this case, ridge rerandomization would achieve a greater variance reduction on a lower-dimensional representation of the covariates than typical rerandomization.
This heuristic argument also hints that our method has connections to a principal-components rerandomization scheme, where one instead balances on some lower dimension of principal components rather than on the covariates themselves. We discuss this point further in Section 4.3.

Connections to other rerandomization schemes
Ridge rerandomization has connections to other rerandomization schemes. Ridge rerandomization requires specifying the parameter λ; thus, consider two extreme choices of λ: , M λ tends to a scaled Euclidean distance. In other words, ridge rerandomization with λ = 0 is equivalent to rerandomization using the Mahalanobis distance; and for large λ, rerandomization using λM λ is equivalent to rerandomization using the Euclidean distance. Note, however, that the threshold a λ will already take the λ −1 factor into account when computing the quantile of M λ , meaning that ridge rerandomization using M λ for large λ is essentially equivalent to rerandomization using the Euclidean distance.
Thus, for any finite λ > 0, the distance defined by M λ can be regarded as a compromise between the Mahalanobis and Euclidean distances. Rerandomization using the Euclidean distance is similar to a rerandomization scheme that places a separate caliper on each covariate, which was proposed by Moulton (2004), Maclure et al. (2006), Bruhn andMcKenzie (2009), andCox (2009). However, Morgan and Rubin (2012) note that such a rerandomization scheme is not affinely invariant and does not preserve the correlation structure of (x T −x C ) across acceptable randomizations. See Morgan and Rubin (2012) for a full discussion of the benefits of using affinely invariant rerandomization criteria. As discussed in Section 4.2.1, our proposed procedure aims for larger variance reductions of covariate mean differences while mitigating the perturbation of the correlation structure of (x T −x C ).
As an illustration, consider a randomized experiment where N T = N C = 50 units are assigned to treatment and control; and furthermore, where there are two correlated covariates, generated as x 1j Fig. 1 shows the distribution of (x T −x C ) | x across 1000 randomizations, rerandomizations (with p a = 0.1), ridge rerandomizations (with p a = 0.1 and λ = 0.005), and rerandomizations using the Euclidean distance instead of the Mahalanobis distance.
All three rerandomization schemes reduce the variance of (x T −x C ) k | x for k ∈ {1, 2}, compared to randomization; however, rerandomization using the Euclidean distance destroys the correlation structure of (x T −x C ) | x, while rerandomization and ridge rerandomization largely maintain it. This provides further motivation for Step 4 of the procedure presented in Section 4.2.1.
Furthermore, as discussed in Sections 4.1 and 4.2.1, ridge rerandomization can be regarded as a ''soft-thresholding" version of a rerandomization scheme that would focus solely on the first K e < K principal components of (x T −x C ). A ''hard-thresholding" rerandomization scheme would use a truncated version M Ke of the Mahalanobis distance, defined as i.e., Σ Ke artificially sets the smallest (K − K e ) eigenvalues of Σ to 0. This scheme would then be EPVR for the first K e principal components of (x T −x C ) -although not necessarily EPVR for the original covariates themselves -but would effectively ignore the components associated with the smallest (K − K e ) eigenvalues of Σ.
Therefore, ridge rerandomization is a flexible experimental design strategy that encapsulates a class of rerandomization schemes, thus making it worth further investigation in future work. We expand on this point in Section 6.

Conducting inference after ridge rerandomization
Here we outline how to conduct inference for the average treatment effect after ridge rerandomization has been used to conduct an experiment. In general, there are Neymanian, Bayesian, and randomization-based modes of inference for analyzing randomized experiments (Imbens and Rubin, 2015). The Neymanian mode of inference relies on asymptotic approximations for the variance of the mean-difference estimatorτ ; such results are well-established for completely randomized experiments (Neyman et al., 1990), paired experiments (Imai, 2008), blocked experiments (Miratrix et al., 2013;Pashley and Miratrix, 2017), and randomized experiments with stages of random sampling (Branson and Dasgupta, 2020). In a seminal paper, Li et al. (2018) derived many asymptotic results for rerandomized experiments (as discussed in Morgan and Rubin (2012)), thereby establishing Neymanian inference for such experiments. The results therein rely on various properties of the Mahalanobis distance, which -as established by our results -differ from the properties of the ridge Mahalanobis distance. As a consequence, the theory developed in Li et al. (2018) cannot be readily applied to ridge rerandomized experiments, and a promising line of future work is deriving asymptotic results for ridge rerandomized experiments. Asymptotic results could also be used to establish Bayesian inference for such experiments, which would be particularly useful given that one's preference for rerandomization or ridge rerandomization may depend on their prior knowledge of β, as suggested by Corollary 4.2. Addressing these complications is beyond the scope of this paper. Instead, we focus on randomization-based inference, because it can be readily applied to ridge rerandomization.
Randomization-based inference focuses on inverting sharp null hypotheses that define the relationship between the potential outcomes in terms of treatment effects. The most common null hypothesis is that of an additive treatment effect τ , such that the hypothesis H τ 0 : Y i (1) = Y i (0) + τ holds for all i = 1, . . . , N. Confidence intervals derived from inverting this hypothesis were first established by Hodges Jr and Lehmann (1963) and have since been popularized for analyzing randomized experiments (e.g., see Rosenbaum, 2002;Imbens and Rubin, 2015). Here we briefly review how to obtain randomization-based confidence intervals for completely randomized experiments, and then we extend them to ridge rerandomized experiments.
As first proposed by Hodges Jr and Lehmann (1963), a valid randomization-based confidence interval is the set of τ such that we fail to reject H τ 0 ; such inversion of a hypothesis is a classical way to obtain a confidence set (Kempthorne and Doerfler, 1969). To obtain a valid p-value for H τ 0 , a key insight is that, if H τ 0 holds, then one has full knowledge of the potential outcomes for all units: If we observe the outcome under control for a particular unit, we know that the outcome under treatment for that unit is simply the observed outcome plus τ . As a result, for any hypothetical randomization, a test statistic -such as the mean difference estimator,τ -can be computed. To obtain a p-value for H τ 0 under randomization, one follows this simple three-step procedure: 1. Generate many hypothetical randomizations, w (1) , . . . , w (M) , by permuting the observed treatment indicator. t(w, x, y), such as the mean-difference estimator, across the randomizations w (1) , . . . , w (M) assuming H τ 0 is true. 3. Compute the randomization-based p-value, defined as

Compute a test statistic
where t obs is the observed test statistic and 1(·) denotes the indicator function. The additional 1 in the numerator and the denominator induces a very small amount of bias in order to validly control the Type 1 error rate and is a standard correction for randomization test p-values (Phipson and Smyth, 2010). Modern statistical software allows one to readily invert H τ 0 after Step 1 is completed (in Section 5, we will use the R package ri (Aronow and Samii, 2012) to do this), thereby producing randomization-based confidence intervals. This makes the extension to ridge rerandomization quite straightforward: In Step 1, one generates many hypothetical ridge rerandomizations (instead of randomizations), and then proceeds as usual to conduct randomization-based inference. This is identical to the approach discussed in Morgan and Rubin (2012) for obtaining confidence intervals under rerandomization, except using hypothetical ridge rerandomizations instead of hypothetical rerandomizations. This can also be viewed as inverting a conditional randomization test, where we condition on the fact that the ridge rerandomization balance criterion has been fulfilled (Hennessy et al., 2016;Branson and Miratrix, 2019). As we shall see in Section 5, confidence intervals for ridge rerandomized experiments are much more precise than intervals for completely randomized experiments, and often more precise than intervals for rerandomized experiments, especially in high dimensional and/or collinearity settings.

Simulations
We now provide simulation evidence that supports the heuristic argument presented in Section 4.2 and suggests when ridge rerandomization is an effective experimental design strategy. First, we will consider conducting an experiment where covariates are linearly related with the outcome, treatment effects are additive, and the number of treated units and the number of control units are equal. Then we will consider alternative scenarios. Throughout, we will compare rerandomization and ridge rerandomization in terms of (1) their ability to balance covariates, (2) their ability to produce precise treatment effect estimators, and (3) their ability to produce precise confidence intervals. We find that ridge rerandomization is particularly preferable over rerandomization in high-dimensional or high-collinearity settings.

Simulation setup
Consider N = 100 units, 50 of which are to be assigned to treatment and 50 are to be assigned to control. Let x be a N × K covariate matrix, generated as where 0 ≤ ρ < 1. The parameter ρ corresponds to the correlation among the covariates. Furthermore, let Y i (1) and Y i (0) be the potential outcomes under treatment and control, respectively, for unit i, generated as For this simulation study, we set the treatment effect to be τ = 1. Across simulations, we consider number of covariates K ∈ {10, . . . , 90} and correlation parameter ρ ∈ {0, 0.1, . . . , 0.9}. We discuss choices for β in Section 5.3. In Section 5.4, we discuss scenarios where covariates are nonlinearly related with the outcomes, treatment effects are non-additive, and N T ̸ = N C ; however, the results for these other scenarios are largely the same as those for the above data-generating process, and so for ease of exposition we focus on results for the case where the covariates are generated from (26) and the potential outcomes are generated from (27). We will consider three experimental design strategies for assigning units to treatment and control:  (6).
For each choice of K , ρ, and β, we ran randomization, rerandomization, and ridge rerandomization 1000 times. For rerandomization and ridge rerandomization, we set p a = 0.1, which corresponds to randomizing within the 10% ''best'' randomizations according to the Mahalanobis distance and ridge Mahalanobis distance, respectively. Furthermore, for ridge rerandomization, we used the procedure in Section 4.2.1 for selecting λ, with n = 1000, δ = 0.01, and ϵ = 10 −4 . The value λ = 0.01 was selected for most K and ρ, and occasionally λ = 0.02 was selected.
First, in Section 5.2, we compare how these three methods balanced the covariates x, and so the β parameter in (27) is irrelevant for this section. Then, in Section 5.3, we compare the accuracy of treatment effect estimators and precision of confidence intervals for each method; in this case, the specification of β is consequential.

Comparing covariate balance across randomizations
First, we computed the covariate mean differences across each randomization, rerandomization, and ridge rerandomization. Fig. 2 shows how much rerandomization and ridge rerandomization reduced the variance ofx T −x C (averaged across covariates) compared to randomization for data generated from (26). For rerandomization, the average variance reduction decreases as K increases (an observation previously made in Morgan and Rubin, 2012), and it stays largely the same across values of ρ for fixed K . As for ridge rerandomization, the average variance reduction also decreases as K increases, but the average variance reduction increases as ρ increases, i.e., as there is more collinearity in x. Finally, the right-hand plot in Fig. 2 shows that ridge rerandomization has a higher average variance reduction than rerandomization; furthermore, the advantage of ridge rerandomization over rerandomization increases in both K and ρ. This suggests that ridge rerandomization may be particularly preferable over rerandomization in the presence of many covariates and/or high collinearity among covariates, which is intuitive given the motivation of ridge regression (Hoerl and Kennard, 1970).

Comparing treatment effect estimation accuracy across randomizations
Reducing the variance of each covariate mean difference leads to more precise treatment effect estimates if the covariates are related to the outcome, as in (27). The extent to which the covariates are related to the outcome depends on the β parameter. Theorem 4.3 guarantees that ridge rerandomization will improve inference for the average treatment effect, compared to randomization, regardless of β. However, Corollary 4.2 establishes that β dictates whether rerandomization or ridge rerandomization will perform better in terms of treatment effect estimation accuracy. First we will consider a β where the covariates are equally related to the outcome, and in this case ridge rerandomization performs better than rerandomization. Then, we will consider a β which -according to our theoretical results -should put ridge rerandomization in the worst light as compared to rerandomization.

One choice of β
Consider β = 1 K . Because the covariates have been standardized to have the same scale, such a β implies that all of the covariates are equally important in affecting the outcome. For each of the 1000 randomizations, rerandomizations, and ridge rerandomizations generated for each K ∈ {10, . . . , 90} and ρ ∈ {0, 0.1, . . . , 0.9}, we computed the meandifference estimatorτ . Then, we computed the MSE ofτ across the 1000 randomizations, rerandomizations, and ridge rerandomizations for each K and ρ. Fig. 3 shows the MSE of rerandomization and ridge rerandomization relative to the MSE of randomization. A lower relative MSE represents a more accurate treatment effect estimator, compared to how that estimator would behave under randomization. Three observations can be made about Fig. 3. First, both rerandomization and ridge rerandomization reduce the MSE ofτ compared to randomization: the relative MSE for both methods is always less than 1. Second, for rerandomization, the relative MSE stays constant across values of ρ and decreases as K decreases. Meanwhile, for ridge rerandomization, the relative MSE decreases as ρ increases and K decreases. Third, for this choice of β, ridge rerandomization reduces the MSE of the treatment effect estimator more so than rerandomization, especially when K and/or ρ is large. These last two observations reflect the variance reduction behavior observed in Fig. 2.
Meanwhile, for each randomization, rerandomization, and ridge rerandomization, we generated a 95% confidence interval for the average treatment effect using the procedure outlined in Section 4.4. Regardless of the procedure used, coverage was near 95%. This is unsurprising, because these intervals were constructed by inverting randomization tests that are valid for their corresponding assignment mechanism; see Edgington and Onghena (2007) and Good (2013) for classical results on the validity of randomization tests. However, the width of these intervals differed across these three procedures: Fig. 4 compares the relative average interval width (compared to randomization) for rerandomization  (27), as well as the difference in relative MSE between the two (i.e., the second plot minus the first).  (27), as well as the difference between the two (i.e., the second plot minus the first). and ridge rerandomization. For the first two plots in Fig. 4, a number closer to 1 indicates intervals that are closer in width to intervals under randomization. Meanwhile, for the right-most plot in Fig. 4, a more negative number indicates more narrow confidence intervals for ridge rerandomization, as compared to rerandomization. The qualitative results are identical to the previous results: Ridge rerandomization tends to provide narrower confidence intervals as the covariates' dimension and/or collinearity increases.

A choice of β where ridge rerandomization has the least competitive advantage over rerandomization
As can be seen by Corollary 4.2, there may exist β where rerandomization performs better than ridge rerandomization. To assess how poorly ridge rerandomization can perform compared to rerandomization, now we will specify a β that puts ridge rerandomization in the worst light when comparing it to rerandomization in terms of treatment effect estimation accuracy.
Under the assumptions of Corollary 4.2, the difference in treatment effect estimation accuracy between rerandomization and ridge rerandomization is given by ∆ ) Γ ⊤ β, which can be artificially minimized Relative MSE ofτ =ȳ T −ȳ C under rerandomization and ridge rerandomization (relative to randomization) for the β such that ridge rerandomization has the least competitive advantage over rerandomization, as well as the difference in relative MSE between the two (i.e., the second plot minus the first).
with respect to β, subject to some constraint on β for the minimum to exist, e.g., ∥β∥ ≤ 1. If d k,λ < v a for all k = 1, . . . , K , then ridge rerandomization dominates rerandomization since ∆ > 0 for all β ̸ = 0, and these schemes are only tied when ∆ = 0 for β = 0, i.e., the covariates are uncorrelated with the outcomes. In other cases, we can define β * = Γ •k * where Γ •k * is the k * -th column of Γ and k * = argmin 1≤k≤K (v a − d k,λ ). We would typically have k * = K , because the d k,λ 's are non-increasing. By construction, β * minimizes ∆ over {β ∈ R K : ∥β∥ ≤ 1} and yields ∆ < 0 as negative as possible. This is equivalent to β being in the direction that accounts for the least variation in the covariates. While such a case is unlikely, we consider such a β to see how much worse ridge rerandomization performs as compared to rerandomization in this scenario. Fig. 5 shows the relative MSE (as compared to randomization) for rerandomization and ridge rerandomization for this specification of β. Interestingly, there are occasions where rerandomization and ridge rerandomization have relative MSEs greater than 1, i.e., when they perform worse than randomization in terms of treatment effect estimation accuracy. At first this may be surprising, especially when findings from Morgan and Rubin (2012) guarantee that rerandomization should perform better than randomization. However, in this case, β is in the direction of the last principal component of the covariate space, meaning that the covariates have nearly no relationship with the outcomes. Thus, the relative MSE that we see in the first two plots of Fig. 5 is more or less the behavior we would expect if we compared 1000 randomizations to 1000 other randomizations. Furthermore, from the third plot in Fig. 5, we can see that rerandomization occasionally performs better than ridge rerandomization -particularly when K is small -but the differences in relative MSE across simulations are somewhat centered around zero. Meanwhile, Fig. 6 compares the relative average confidence interval width for rerandomization and ridge rerandomization, and the qualitative results are largely the same as the relative MSE results: Rerandomization and ridge rerandomization are fairly comparable, but rerandomization tends to provide slightly narrower confidence intervals for low-dimensional covariates.
Note that this specification of β is a unit vector. We could have scaled β arbitrarily large, and, as a result, the differences in the last plots of Figs. 5 and 6 could have been made arbitrarily large. Thus, ridge rerandomization can perform much worse than rerandomization when β exhibits particularly large effects in the direction of the last principal component of the covariate space, especially when the number of covariates is small. Practically speaking, such a scenario is unlikely, but it is a scenario that researchers should acknowledge and consider when comparing rerandomization and ridge rerandomization.

Additional simulations: Unequal sample sizes, nonlinearity, heterogeneous treatment effects, and rank deficiency
In the above, we considered scenarios where an equal number of units are assigned to treatment and control, covariates are linearly related with the potential outcomes, and treatment effects are additive. In Appendix A.8, we present simulation results for scenarios where N T ̸ = N C , covariates are nonlinearly related with the potential outcomes, and treatment effects are heterogeneous. The results presented therein are very similar to the results presented above: Rerandomization and Fig. 6. Relative average 95% confidence interval width under rerandomization and ridge rerandomization (relative to randomization) for the β such that ridge rerandomization has the least competitive advantage over rerandomization, as well as the difference between the two (i.e., the second plot minus the first).
ridge rerandomization are still preferable over randomization, and ridge rerandomization is preferable over rerandomization in high-dimensional and/or high-collinearity scenarios. We found that ridge rerandomization's advantage over rerandomization was somewhat diminished when treatment and control sample sizes were highly unequal or when covariates were nonlinearly related with the potential outcomes, but the advantage in high-dimensional and/or highcollinearity scenarios was still clear. Due to the similarity of these results, we relegated these additional simulations to the Appendix.
Finally, note that all of our previous simulation studies focused on the case where N = 100 and K ∈ {10, 20, . . . , 90}. In this case, the covariance matrix Σ is always invertible, which we have assumed throughout the manuscript. When N ≤ K , Σ is not invertible, the Mahalanobis distance is undefined, and rerandomization cannot be implemented. However, the ridge Mahalanobis distance M λ in (6) is still defined, and ridge rerandomization can still be implemented. In Appendix A.8, we present simulation results when N = 100 and K = 101, and we again find that ridge rerandomization is preferable over randomization, especially in high-collinearity scenarios. This suggests that ridge rerandomization may be a viable experimental design strategy when N ≤ K , and interesting future work would be establishing theoretical results for ridge rerandomization even when Σ is not invertible but the ridge Mahalanobis distance is still defined.

Summary of simulation results
Importantly, the effectiveness of rerandomization or ridge rerandomization in balancing the covariates does not depend on the covariates' relationship with the outcomes. In other words, the variance reduction results in Fig. 2 do not depend on β, whereas the treatment effect estimation accuracy results in Figs. 3 and 5 and confidence interval results in Figs. 4 and 6 do. From Fig. 2 we see that ridge rerandomization appears to generally be more effective than rerandomization in balancing covariates in high-dimensional or high-collinearity settings, and from Figs. 3 and Fig. 4 we see that this can result in more precise treatment effect estimators and confidence intervals. These results also hold when treatment and control sample sizes are unequal, the outcome is nonlinearly related with the covariates, or when there is treatment effect heterogeneity, as discussed briefly in Section 5.4 and more fully in Appendix A.8. However, from Section 5.3.2, we see that there are cases where rerandomization can perform better than ridge rerandomization in terms of treatment effect estimation. In particular, if the relationship between the covariates and the outcome is strongly in the direction of the last principal component of the covariate space, rerandomization can perform arbitrarily better than ridge rerandomization, especially when there are only a few number of covariates. In general, the comparison between rerandomization and ridge rerandomization depends on the relationship between the covariates and the outcomes, which is typically not known until after the experiment is conducted.
In summary, these simulations suggest that ridge rerandomization is often preferable over rerandomization by targeting the directions that best explain variation in the covariates rather than the covariates themselves. If the covariates are related to the outcomes (linearly or nonlinearly), ridge rerandomization appears to be an appealing experimental design strategy when there are many covariates and/or highly collinear covariates.

Discussion and conclusion
The rerandomization literature has focused on experimental design strategies that utilize the Mahalanobis distance. Starting with Morgan and Rubin (2012) and continuing with works such as Morgan and Rubin (2015), Branson et al. (2016), Zhou et al. (2018), andLi et al. (2018), many theoretical results have been established for rerandomization schemes using the Mahalanobis distance. However, the Mahalanobis distance is known to not perform well in high dimensions or when there are strong collinearities among covariates-settings which the current rerandomization literature has not addressed.
To address experimental design settings where there are many covariates or strong collinearities among covariates, we presented a rerandomization scheme that utilizes a modified Mahalanobis distance. This modified Mahalanobis distance inflates the eigenvalues of the covariance matrix of the covariates, thereby increasing the variance reduction of the covariates' first principal components at the expense of decreasing the variance reduction of the last principal components. Such a quantity has remained largely unexplored in the literature. We established several theoretical properties of this modified Mahalanobis distance, as well as properties of a rerandomization scheme that uses it-an experimental design strategy we call ridge rerandomization. These results establish that ridge rerandomization preserves the unbiasedness of treatment effect estimators and reduces the variance of covariate mean differences. If the covariates are related to the outcomes of the experiment, ridge rerandomization will yield more precise treatment effect estimators than randomization. Furthermore, we conducted a simulation study that suggests that ridge rerandomization is often preferable over rerandomization in high-dimensional or high-collinearity scenarios, which is intuitive given ridge rerandomization's connections to ridge regression.
This modified Mahalanobis distance represents a class of rerandomization criteria, which has connections to principal components and the Euclidean distance. This motivates future work for rerandomization schemes that utilize other criteria. In particular, our theoretical results establish that the benefit of our class of rerandomization schemes over typical rerandomization depends on the covariates' relationship with the outcomes, which usually is not known until after the experiment has been conducted. However, if researchers have prior information about the relationship between the covariates and the outcomes, this information may be useful in selecting rerandomization criteria. An interesting line of future work is further exploring other classes of rerandomization criteria, as well as demonstrating how prior outcome information can be used to select useful rerandomization criteria when designing an experiment. where Z = (Z 1 . . . Z K ) ⊤ ∼ N (0, 1 K ) marginally and independently of x. The matrix (I K + λ Σ −1 ) −1 shares the same orthonormal basis x of eigenvectors Γ as Σ, with corresponding eigenvalues λ 1 (λ 1 + λ) −1 , . . . , λ K (λ K + λ) −1 . As a consequence, we have

A.2. Proof of Lemma 4.2
Without loss of generality, let K = 2. Thus, the aim of this proof is to establish that E 1 ≤ E 2 , i.e., where L 1 and L 2 are independent and identically distributed non-negative random variables, C 1 ≥ C 2 ≥ 0 are constants, and a > 0 is a constant. First, it will be helpful to note that the event C 1 L 1 + C 2 L 2 ≤ a can be partitioned into two events: In other words, A ∪ B is equal to the event C 1 L 1 + C 2 L 2 ≤ a. Thus, and analogously for L 2 . Now note that if C 1 L 1 + C 2 L 2 ≤ a and L 1 ≥ L 2 , then C 1 L 2 + C 2 L 1 ≤ a and thus B cannot occur. To see this, note that if L 1 ≥ L 2 , then C 2 (L 1 − L 2 ) − C 1 (L 1 − L 2 ) ≤ 0, because C 1 ≥ C 2 ≥ 0, and therefore: In other words, B will only occur if L 1 < L 2 , and therefore E[ Meanwhile, due to the symmetry of L 1 and L 2 in the two constraints in A, E[L 1 |A] = E[L 2 |A]. Thus, revisiting (30), we have the following: which completes the proof. For K > 2, the same application of the proof applies, with the only difference being partitioning the event ∑ K j=1 C j L j ≤ a into 2(K ! − 1) events. □ A.3. Proof of Theorem 4.2 Using the same notation and reasoning as for the proof of Lemma 4.1 in Appendix A.1, in particular (28), we can write where (31) follows from the definition of Σ 1/2 = ΓDiag (√ λ 1:K ) Γ ⊤ along with the constructed independence of Z and x to get rid of the conditioning on x, and (32) follows from (Γ ⊤ Z) ∼ Z by orthogonality of Γ and standard Normality of Z.
All that is left now is to compute the conditional covariance matrix appearing in (32). Starting by its diagonal elements, the symmetry of the Normal distribution ensures that Z ∼ −Z, which implies are given by with ℓ ̸ = m, we use again the symmetry of the Normal distribution by noticing that Z ∼ Z * , where we define Z for all 1 ≤ ℓ, m ≤ K such that ℓ ̸ = m. Combining (33) and (34) gives Plugging (35) back into (32) finally yields where the d k,λ 's are given by (33). From the expression of d k,λ , we immediately have d k,λ > 0 for all k = 1, . . . , K . By using Equation (13) from Palombi and Toti (2013), we also get for all k = 1, . . . , K . Therefore, we have d k,λ ∈ (0, 1) for all k = 1, . . . , K . □ Since λ j (1 − d j,λ ) > 0 for all j = 1, . . . , K , the matrix for all v ∈ R K \{0}. In particular, by using (36) with v chosen to be the kth canonical basis vector of R K (whose elements are all 0 except its kth element equal to 1), we get, for all k = 1, . . . , K , These terms being strictly positive, this leads to v k,λ ∈ (0, 1) for all j = 1, . . . , K , i.e.
By using (19), we can writê By conditional independence of (ε T −ε C ) and (x T −x C ) given x, we have Conditional on x, M λ is a deterministic function of (x T −x C ), thus (ε T −ε C ) is conditionally independent of M λ given x.

This leads to
where (40) follows from the conditional independence of (ε T −ε C ) and M λ given x, and (41) follows from Theorem 4.2.
By plugging (39) into (41), we get As explained by (36) in the proof of Corollary 4.1, the positive definiteness of the matrix ΓDiag for all β ∈ R K , with equality if and only if β = 0. □

A.6. Calibration of a λ and d k,λ
Here we discuss how to compute the threshold a λ after the acceptance probability p a and the regularization parameter λ are set. We also discuss how to approximate the d k,λ 's in (11) via Monte Carlo.

A.6.1. Estimating a λ
As discussed in Lemma 4.1 and Section 4.2, the distribution of the ridge Mahalanobis distance M λ can be approximated as a weighted sum of independent χ 2 1 random variables. Thus, we set a λ equal to the p a -quantile of this weighted sum, defined as Q λ in (20).
Let F Q λ (q) = P(Q λ ≤ q) denote the CDF of Q λ . Since Q λ is a weighted sum of independent χ 2 1 variables, its characteristic as detailed in Equation (3.2) of Imhof (1961). In practice, for any fixed U ≥ 0, F Q λ ,U (q) can be computed with arbitrary precision and at a negligible cost by using any (deterministic) univariate numerical integration scheme. We can then approximate F Q λ (q) with F Q λ ,U (q) by choosing U large enough. As explained in Imhof (1961), the approximation tends to improve as the number of covariates K increases, and one can guarantee a truncation error of at most ξ > 0 in absolute value by choosing U ξ = [ξ π (K /2) ∏ K k=1 √ λ k (λ k + λ) −1 ] −2/K . More recent algorithms for approximating F Q λ (q) include Davies (1980) and Bausch (2013), and computationally cheaper but less accurate alternatives to approximate F Q λ are discussed in Bodenham and Adams (2016). Finally, we approximate the p a -quantile of Q λ bŷ i.e., the p a -quantile of F Q λ ,U . The hat onâ λ only reflects the distributional approximation of M λ by Q λ , whereas the errors due to numerical integration and truncation can be regarded as virtually nonexistent compared to the Monte Carlo errors involved in the later approximations of v k,λ . In the simulations of Section 5, we use ξ = 10 −4 by default.

A.6.2. Estimating d k,λ
As discussed in Section 4.2, choosing λ depends on the d k,λ 's defined in (11), which involve intractable conditional expectations. By considering n simulated sets of K independent variablesZ ij i.i.d.
∼ N (0, 1) for i = 1, . . . , n and j = 1, . . . , K , the expectations appearing in (11) can be consistently estimated via Monte Carlo, for all k = 1, . . . , K , bŷ (43), where 1(A) denotes the indicator function of an event A. We regard the computational cost of generating nK independent Normal variables as negligible compared to the expected cost of generating 1/p a successive random assignment vectors and testing the acceptability of each assignment, since the former can be done in parallel at virtually the same cost as generating one single Normal random variable.
A.7. Details on procedure for finding a desirable λ ≥ 0 Here we discuss the details of the procedure outlined in Section 4.2, specifically Steps 3 and 4 of that procedure.
The justification of our proposed procedure stems from the following facts. By definition, we have P(M λ ≤ a λ | x) = p a for all λ ≥ 0. By taking the limit as λ → +∞ under the assumptions of Lemma 4.1, we get where q * (p a ) is the p a -quantile of the distribution of ∑ K k=1 λ k Z 2 k . This in turn implies that, for all k = 1, . . . , where d * ] for all k = 1, . . . , K . Since the limits in (46) are strictly positive, this shows that increasing λ beyond a certain value will no longer yield any practical gain. This is in line with the intuition that the ridge Mahalanobis distance degenerates to the Euclidean distance when λ → +∞, as discussed in Section 4.3. Thus, in practice, it is sufficient to search for λ only over a bounded range of values. The lower bound λ = 0 corresponds to rerandomization with the standard Mahalanobis distance; the upper bound is determined dynamically via Step 3, which is guaranteed to stop in finite time by using an argument similar to (45). As mentioned in Section 4.2, the step size δ can be chosen as a fraction of the smallest strictly positive gap between consecutive eigenvalues, i.e., min{λ k − λ k−1 : k = 1, . . . , K such that λ k > λ k−1 } with the convention λ 0 = 0. Finally, among all the acceptable λ's satisfying (22), Step 4 returns the λ ⋆ that aims at altering the conditional covariance structure of (x T −x C ) the least, in the sense of minimizing the distance between Cov and the linear span of Σ, i.e., k stands for the Frobenius norm, andâ λ and thed j,λ 's are defined in (43) and (44), respectively. The inner minimization can be written as which is attained at c ⋆ = ∑ K k=1 c kdk,λ with c k = λ 2 k ( ∑ K j=1 λ 2 j ) −1 for all k = 1, . . . , K , thus yielding Eq. (23). The outer minimization is then straightforward since the set Λ of candidates is finite by construction.
Finally, note that our procedure relies on computingâ λ and thed j,λ 's; these quantities rely on nK auxiliary Normal variablesZ ij , which only need to be simulated once and can then be reused when testing different values of λ.
A.8. Additional simulations: Unequal sample sizes, nonlinearity, treatment effect heterogeneity, and rank deficiency In Section 5 we considered scenarios where an equal number of units are assigned to treatment and control, covariates are linearly related with the potential outcomes, and treatment effects are additive. In this section, we provide additional simulation results for other scenarios. However, the results presented here are largely the same as those presented in Section 5 -i.e., both rerandomization and ridge rerandomization are preferable over randomization, and ridge rerandomization is preferable over rerandomization in high-dimensional and/or high-collinearity scenarios.

A.8.1. Unequal sample sizes
Similar to Section 5, we consider N = 100 units to be assigned to treatment and control. For each unit, the covariate matrix x is still generated with (26) and the potential outcomes are generated with (27), as in Section 5. However, unlike in Section 5, when implementing randomization, rerandomization, and ridge rerandomization, N T ̸ = 50 units will be assigned to treatment and 100 − N T units will be assigned to control.
We will consider N T ∈ {10, 20, 30, 40}, where smaller N T denotes more unequal sample sizes between treatment and control. Similar to Section 5, we will consider collinearity ρ ∈ {0, 0.1, . . . , 0.9, 1.0} for (26), and treatment effect τ = 1 and coefficients β = 1 K for (27). We will run randomization, rerandomization, and ridge rerandomization 1000 times for each setting, and then we will compare rerandomization and ridge rerandomization in terms of (1) the average reduction in variance across covariates, (2) relative MSE for the average treatment effect, and (3) relative average 95% confidence interval width for the average treatment effect. Here, ''relative'' means relative to randomization. Figs. 7,8,and 9 show the simulation results for average reduction in variance, relative MSE, and relative average confidence interval width, respectively. These figures are analogous to Section 5 Figs. 2, 3, 4, but for N T ̸ = 50. The results in these figures are nearly identical to those presented in Section 5: By focusing on the ''Difference'' plots, Fig. 8. Relative MSE ofτ =ȳ T −ȳ C under rerandomization and ridge rerandomization (relative to randomization), as well as the difference between the two (i.e., the second plot minus the first) for N T ∈ {10, 20, 30, 40}. This is analogous to Fig. 3, but for different values of N T .
we see that ridge rerandomization tends to have (1) a higher average variance reduction, (2) lower relative MSE, and (3) lower relative average confidence interval width, especially in high-dimensional and/or high-collinearity settings, even if the treatment and control sample sizes are unequal. The N T = 10 subfigures suggest that ridge rerandomization's advantage over rerandomization may be slightly dimensioned when N T and N C are highly unequal, but nonetheless ridge rerandomization appears preferable when K and/or ρ are large.

A.8.2. Nonlinearity
Similar to Section 5, we consider N = 100 units to be assigned to treatment and control. For each unit, the covariate matrix x is still generated with (26) and N T = N C = 50 units will be assigned to treatment and control when implementing randomization, rerandomization, and ridge rerandomization. However, instead of using (27) to generate the potential outcomes, we will use the following model: where exp(x) denotes the matrix of values e x . Again we set τ = 1 and β = 1 K and consider K ∈ {10, . . . , 90} and ρ ∈ {0, 0.1, . . . , 0.9} when generating the covariates.
Rerandomization and ridge rerandomization only aim to balance the first moments of the covariates, and thus the simulations in Section 5 (where the potential outcomes are linearly related with the covariates) may be considered a ''well-specified'' scenario, and here we are considering a misspecified scenario where averages across potential outcomes depend on more than just the first moments of covariates. This alternative model for the potential outcomes does not affect rerandomization and ridge rerandomization's ability to balance covariates' first moments, but it does affect their ability to precisely estimate treatment effects. Fig. 10 compares the relative MSE (compared to randomization) of rerandomization and ridge rerandomization, and Fig. 11 does the same for relative average 95% confidence interval width. Although ridge rerandomization does not have as clear of an advantage over rerandomization in this misspecified scenario, it still tends to perform better than rerandomization in high-dimensional and high-collinearity settings. Furthermore, Fig. 9. Relative average 95% confidence interval width under rerandomization and ridge rerandomization (relative to randomization), as well as the difference between the two (i.e., the second plot minus the first) for N T ∈ {10, 20, 30, 40}. This is analogous to Fig. 4, but for different values of N T . Fig. 10. Relative MSE ofτ under rerandomization and ridge rerandomization (relative to randomization) when β = 1 K in (47), as well as the difference in relative MSE between the two (i.e., the second plot minus the first).
both rerandomization and ridge rerandomization still provide more precise inference for the average treatment effect compared to randomization, although not as much as when the potential outcomes were generated from a linear model. This is because the covariates still have some linear relationship with the covariates, and thus one can still obtain more  (47), as well as the difference between the two (i.e., the second plot minus the first).
precise estimators and intervals for the average treatment effect by balancing the first moments of the covariates (Li et al., 2018). In short, the results presented here are largely the same as those presented in Section 5, where the potential outcomes were linearly related with the covariates.

A.8.3. Treatment effect heterogeneity
Similar to Section 5, we consider N = 100 units to be assigned to treatment and control. For each unit, the covariate matrix x is still generated with (26) and N T = N C = 50 units will be assigned to treatment and control when implementing randomization, rerandomization, and ridge rerandomization. However, instead of using (27) to generate the potential outcomes, we will use the following model: The above setup is similar to the simulation setup used in Ding et al. (2016) for studying treatment effect heterogeneity.
Thus, the only simulation feature we are changing (compared to Section 5) is the way that the potential outcomes are generated. This will affect the analysis stage but not the design stage, and thus results for the average reduction in variance will be identical to those in Section 5, regardless of the heterogeneity parameter. Thus, in what follows, we will only study the relative MSE and relative average 95% confidence interval width for rerandomization and ridge rerandomization.
We will implement randomization, rerandomization, and ridge rerandomization 1000 times and compute the MSE and average confidence interval width for estimating the average treatment effect. Similar to Section 5, we focus on using the mean-difference estimatorτ =ȳ T −ȳ C . However, unlike in Section 5, the average treatment effect is no longer simply τ = 1, because each unit now has its own treatment effect τ ≡ Y i (1) − Y i (0) = τ + σ τ Y i (0). Thus, when computing the MSE for randomization, rerandomization, and ridge rerandomization, we compute E[(τ −τ ) 2 ], whereτ = N −1 ∑ N i=1 τ i . Fig. 12 compares the relative MSE (compared to randomization) of rerandomization and ridge rerandomization, and Fig. 13 does the same for relative average 95% confidence interval width. Once again, the results in these figures are nearly identical to those presented in Section 5: Ridge rerandomization tends to have a lower relative MSE and lower relative average confidence interval width, especially in high-dimensional and/or high-collinearity settings, regardless of whether treatment effect heterogeneity is moderate (σ τ = 0.25) or large (σ τ = 0.5). We should note that the raw MSE and average confidence interval width (not shown) for randomization, rerandomization, and ridge rerandomization all increased from σ τ = 0.25 to σ τ = 0.5; however, their relative performance to each other did not substantially change from moderate Fig. 12. Relative MSE ofτ =ȳ T −ȳ C under rerandomization and ridge rerandomization (relative to randomization), as well as the difference between the two (i.e., the second plot minus the first) for σ τ ∈ {0.25, 0.5}. This is analogous to Fig. 3, but for heterogeneous treatment effects using (48) to generate the potential outcomes.
to strong treatment effect heterogeneity, as shown by Figs. 12 and 13. In short, even though inference becomes more challenging when treatment effect heterogeneity increases, ridge rerandomization still appears to exhibit an advantage over rerandomization in high-dimensional and/or high-collinearity settings.

A.8.4. Rank deficiency
Similar to Section 5, we consider N = 100 units where 50 units are assigned to treatment and 50 units are assigned to control. For each unit, the covariate matrix x is still generated with (26) and the potential outcomes are generated with (27), where β = 1 K and τ = 1. Again we consider ρ ∈ {0, 0.1, . . . , 0.9} when generating the covariates. For this subsection, we will focus on the case where there are K = 101 covariates.
When K = 101, the covariates' covariance matrix Σ is rank-deficient, because N < K . In other words, Σ is not invertible, the Mahalanobis distance is undefined, and rerandomization cannot be implemented. Morgan and Rubin (2012) noted that when N ≤ K , the pseudo-inverse for Σ can be used when defining the Mahalanobis distance; however, when we attempted this on our simulated data, we found that the resulting Mahalanobis distance was constant across all Fig. 13. Relative average 95% confidence interval width under rerandomization and ridge rerandomization (relative to randomization), as well as the difference between the two (i.e., the second plot minus the first) for σ τ ∈ {0.25, 0.5}. This is analogous to Fig. 4, but for heterogeneous treatment effects using (48) to generate the potential outcomes.
randomizations, thereby leaving it uninformative. In our own past exploration of the Mahalanobis distance using the pseudo-inverse (not shown), we have found this to also occasionally occur with real datasets. Interesting future work would be investigating when using the pseudo-inverse for Σ leads to a properly defined Mahalanobis distance.
In any case, the ridge Mahalanobis distance M λ in (6) is still defined even when N ≤ K , and we can still assess the benefits of ridge rerandomization over randomization in this case, even if we cannot assess rerandomization. Similar to the previous sections, we implemented randomization and ridge rerandomization 1000 times under this scenario and computed (1) the average reduction in variance across covariates, (2) relative MSE for the average treatment effect, and (3) relative average 95% confidence interval width for the average treatment effect. Fig. 14 shows the results for ρ ∈ {0, 0.1, . . . , 0.9}. Once again, we see that ridge rerandomization reduces the average variance of covariate mean differences compared to randomization, and it also leads to a lower MSE and narrower confidence intervals when estimating the average treatment effect. This is especially the case when collinearity is high. This suggests that ridge rerandomization may be a viable experimental design strategy when N ≤ K .