Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values

Many testing problems are readily amenable to randomised tests such as those employing data splitting. However despite their usefulness in principle, randomised tests have obvious drawbacks. Firstly, two analyses of the same dataset may lead to different results. Secondly, the test typically loses power because it does not fully utilise the entire sample. As a remedy to these drawbacks, we study how to combine the test statistics or p-values resulting from multiple random realisations such as through random data splits. We develop rank-transformed subsampling as a general method for delivering large sample inference about the combined statistic or p-value under mild assumptions. We apply our methodology to a wide range of problems, including testing unimodality in high-dimensional data, testing goodness-of-fit of parametric quantile regression models, testing no direct effect in a sequentially randomised trial and calibrating cross-fit double machine learning confidence intervals. In contrast to existing p-value aggregation schemes that can be highly conservative, our method enjoys type-I error control that asymptotically approaches the nominal level. Moreover, compared to using the ordinary subsampling, we show that our rank transform can remove the first-order bias in approximating the null under alternatives and greatly improve power.


Introduction
Many modern statistical procedures are randomized in the sense that the output is a random function of the data.A prominent class of such procedures are hypothesis testing methods that involve splitting the dataset into independent parts (Cox, 1975;Moran, 1973).These procedures randomly divide the data into two nonoverlapping subsets, A and B, and then perform two steps which can be described as 'hunt and test': first, Sample A is used to choose one from among a collection of test statistics; next the chosen statistic is applied to Sample B to produce the final test statistic.This approach is attractive because in the first stage, an arbitrarily complicated procedure may be employed to hunt for an appropriate test.Clearly, were we to ignore the fact that our test statistic was selected from data and simply apply it to Sample A, we would fail to control the Type I error, a phenomenon sometimes referred to as 'double dipping' or 'data snooping'.However, because the data in A and B are independent, we can, in the second stage, effectively forget that the test statistic was chosen in a data-driven way, which permits straightforward calibration even when using a complicated 'hunting' procedure.This strategy is particularly useful in settings with complex alternatives, as the test may be chosen to target the particular alternative from which the data appear to have arisen.This approach has been used for a variety of problems, such as testing the location of multiple samples (Cox, 1975), constructing conformal prediction intervals (Lei et al., 2018;Solari & Djordjilovic ́, 2022), goodness-of-fit testing (Janková et al., 2020), conditional (mean) independence testing (Lundborg et al., 2022;Scheidegger et al., 2021), and conducting inference that is agnostic to the asymptotic regime (Kim & Ramdas, 2024), to list just a few.As we show in this work (see Section 4.1.2),a hunt and test approach can also be used to test for a clustering structure, i.e. for testing the null of unimodality, in high-dimensional data.
Another use of data splitting is related to nonparametric or semiparametric methods where the estimator for the parameter of interest depends on nuisance parameters that must also be estimated.To ensure proper asymptotic behaviour of the final estimator, the bias from nuisance parameter estimation needs to be controlled, and this may be achieved by employing a form of sample splitting known as cross-fitting.Here, the data are first split into folds (i.e.parts) of roughly equal size, and then estimators are computed on each fold using nuisance parameters estimated from out-of-fold data.The per-fold estimates are then combined to produce the final estimate.The independence afforded by data splitting permits the use of flexible machine learning methods to estimate these parameters, as adopted by targeted estimation or double/debiased machine learning (DML) methods.This second use of sample splitting has recently become popular in practice, although the idea has a long history; see also Newey and Robins (2018) and Chernozhukov et al. (2018) and references therein.
Randomized procedures also arise naturally in settings where a null hypothesis one wishes to test may more easily be stated in terms of a reweighted distribution.Thams et al. (2023) show how a wide range of problems may be cast in this framework, including testing properties of a new policy in a contextual bandits setting, model selection after covariate shift and testing socalled generalized conditional independencies (also known as dormant independencies, or Verma constraints; see Section 4.2 for further discussion).In this work, we focus on the latter, which after appropriate reweighting may be reduced to simpler independence testing problems.Thams et al. (2023) further propose to use resampling or rejection sampling to select from the original set of observations, a random subset that behaves like a sample from the reweighted distribution, to which an off-the-shelf testing procedure may then be applied.
Despite its simplicity and broad applicability, as pointed out by Cox (1975), randomized procedures have obvious drawbacks.Firstly, the extra randomness hinders replication of the analysis, an issue that is particularly concerning in view of today's 'replication crisis' in many scientific disciplines (Baker, 2016;Ioannidis, 2005;Open Science Collaboration, 2015).Although one may insist that the random seed used in a randomized algorithm should be part of the replication, when the substantive result of an analysis hinges on a particular seed, considering this as replicable is questionable.Moreover, although it is sometimes argued that replicability is less of an issue when the sample size is large, as we show in Example 1 below, it remains a problem when the effect size is moderate.
A second major issue in the context of hypothesis testing is that when sampling or data splitting is used in the construction of a test, we may expect a loss of power due to the sample not being fully utilized; see, e.g.Kim and Ramdas (2024, Theorem 2.6) for a concrete case.In the context of debiased machine learning, cross-fitting can sometimes be applied to alleviate the two drawbacks, to a certain extent.Ideally, the per-fold estimates are asymptotically independent and jointly Gaussian, thus giving an approximately Gaussian final averaged statistic.However, in finite samples or less ideal settings, these estimates are correlated and can result in under-coverage of standard confidence intervals; we discuss this further in Section 4.3.
To illustrate the two main drawbacks mentioned above, consider the following toy example.
Example 1 (Gaussian location experiment).Let T (1) , T (2) , . . .be test statistics resulting from repeatedly applying a randomized (e.g.data splitting) procedure to a given dataset.Because T (1) , T (2) , . . .are iid conditional on the data, unconditionally they are exchangeable in the sense that the joint distribution is invariant under any finite permutation (Kallenberg, 2005, § 1.1).Suppose the statistic is constructed such that the location of T (1) captures the signal.In particular, suppose T (1) ∼ N (μ, 1) marginally and we are interested in testing H 0 : μ = 0 against H 1 : μ > 0. The corresponding α-level test is to reject H 0 when T (1) > z α := Φ −1 (1 − α), which we refer to as the single-split test.We compare this to a test that aggregates the values from L realizations (e.g.random splits of the data).In this case, it is natural to aggregate by taking the average ̅ T 1 : L := (T (1) + • • • + T (L) )/L.The aggregated test rejects H 0 when ̅ T 1 : L > ̅ t α , where the critical value ̅ t α is determined by the null distribution of ̅ T 1 : L .In order to permit closed-form expressions for the distribution of ̅ T 1 : L and hence ̅ t α , let us assume T (1) , T (2) , . . .are jointly normal, i.e. follow a Gaussian process.One can show that the likelihood ratio test against any fixed μ > 0 is monotone in ̅ T 1 : L .Due to the exchangeability, the distribution is parametrized by (μ, ρ), where ρ ∈ [0, 1) is the pairwise correlation between T (i) and T (j) for every i ≠ j.In practice, we expect ρ > 0.
Let ϕ T (l) := I{T (l) > z α } be the test corresponding to the lth randomized test statistic.For each μ and ρ, we may compute the probability that the test cannot be replicated by another application of the same procedure on the same dataset: P(ϕ T (1) ≠ ϕ T (2) ).Similarly, the probability that the aggregated test does not replicate is P(ϕ̅ T 1 : L ≠ ϕ̅ T (L+1) : 2L ), where ϕ̅ T 1 : L := I{ ̅ T 1 : L > ̅ t α }.The two probabilities (see Appendix A, online supplementary material for their expressions) are compared in Figure 1.Note that the probability of nonreplication for ϕ T (1) can be quite high when the effect size (relative to the sample size for constructing ϕ T (1) ) is neither too weak nor too strong.The probability is significantly reduced by aggregating L = 200 realizations and approaches zero as L tends to infinity.
Moreover, the aggregation also boosts the power as evidenced by Figure 1.One can show that the power of the aggregated test is given by

􏼠 􏼡
, where L = 1 corresponds to the power of the single-split test.For a small μ and a large L, we have so the slope of local power is improved by a factor of 1/ �� ρ √ .This improvement can be particularly significant when ρ takes a small positive value, which is not uncommon for the settings considered in the paper such as statistics resulting from two random splits of data.
From the example above, one may be tempted to conclude that to perform the aggregated test, we only need to estimate the correlation ρ (e.g. through the empirical variance computed from T (1) , T (2) , . ..).However, the joint normality assumption on the statistics T (1) , T (2) , . .., which further implies the normality of ̅ T 1 : L , need not hold in practice.In other words, even if every T (l) is marginally normal or asymptotically normal, the dependence among the statistics need not be a normal copula even in large samples.Consider the following example due to Kim and Ramdas (2024).Example 2 Let X 1 , . . ., X n be iid random vectors in R p with mean μ and covariance Σ.We are interested in testing H 0 : μ = 0 against H 1 : μ ≠ 0. Let I n be a random subset of {1, . . ., n} of size n 1 and let n Treating μ1 as a fixed vector we see that μ⊤ n .By studentizing this quantity, we may obtain via a central limit theorem (under appropriate conditions), a test statistic as n → ∞ and n 2 /n → q ∈ (0, 1), where Σ2 is the empirical covariance computed from I c n .We reject H 0 when T n is large compared to N (0, 1).One benefit of using such a test statistic is that unlike for example a norm of the empirical mean from the whole sample, the limit distribution does not depend delicately on the asymptotic limit of p/n.
As indicated earlier, however, a disadvantage is that the approach does not fully utilize the information in the sample.One might hope that this can be alleviated by considering the cross-fitted statistic [T n (X 1 , . . ., X n ; I n )+ T n (X 1 , . . ., X n ; I c n )]/2.However, Kim and Ramdas (2024, Proposition A.1) showed that this does not have a normal limit, rendering calibration challenging.Further, let I (l)  n for l = 1, . . ., L be independent random subsets of size ⌊qn⌋ and define T (l)  n := T n (X 1 , . . ., X n ; I (l) n ) and the aggregated statistic n /L.We see in Figure 2 that its sampling distribution under X 1 , . . ., X n ∼ iid N (0, Σ) is clearly nonnormal even in large sample.
Using an aggregate S n of the exchangeable test statistics T (1) n , . . ., T (L)  n is more sensitive to departures from the null as it makes better use of the full data.In addition, because the conditional variability in S n given the data decreases in L (e.g.like 1/L when S n is the average), by taking a relatively large L, S n is effectively derandomized; we will revisit Example 2 in Section 4.1.1 and Appendix H.1 online supplementary material to demonstrate the improvement from using S n .However, it should be clear from this very simple example that the main challenge in general lies in calibrating S n .In particular, we need to handle the unknown, potentially complicated dependence among T (1)  n , . . ., T (L) n , which typically causes S n to not follow any textbook distribution.To solve this problem, we develop a data-driven calibration scheme based on subsampling to obtain tests with size that under mild assumptions will approach the nominal level.Further, we will demonstrate that through the rank transform we introduce, our method is able to accurately approximate the null distribution even when data are drawn from a local alternative, and this leads to a power almost as good as an oracle procedure.Our approach is applicable to all of the randomized tests mentioned above, and as we will see, by inverting particular aggregated hypothesis tests, we can also obtain confidence intervals for cross-fitted double machine learning that can reliably deliver coverage when the standard confidence intervals cannot.Before introducing our method, we briefly summarize some existing proposals in the literature.

Existing proposals and related literature
There is a long line of work that has considered how to combine multiple test statistics.Most approaches either aim to control the Type I error under an arbitrary dependence of the test statistics, or make some specific assumptions about their dependence, such as a Gaussian copula.
Most approaches of the first type consider aggregating p-values, i.e. when T (1) is marginally (super-)uniformly distributed under the null.Classical examples include Bonferroni correction, i.e. taking the minimum p-value and multiplying by the total number of p-values, and the average of the p-values multiplied by 2 (Meng, 1994;Rüschendorf, 1982).Related results have been shown for quantiles (DiCiccio et al., 2020;Meinshausen et al., 2009) and generalized means (Vovk & Wang, 2020).Other aggregation rules include taking a weighted sum of Cauchy transformations (Liu & Xie, 2020), those developed through concentration inequalities (DiCiccio et al., 2020) and those involving converting p-values to the so-called e-values (Vovk & Wang, 2021), to name a few.
While these methods have the attractive guarantee of giving finite-sample valid p-values, as they necessarily must cater for the worst-case dependence, this benefit comes with the downside of conservativeness.Indeed, as pointed out by DiCiccio et al. (2020), when used with p-values produced through sample-splitting, these methods can sometimes be outperformed by using a single-split test; see Appendix H.4, online supplementary material for several such numerical examples from applications considered in this paper.For combining Z-statistics, the right panel of Figure 1 illustrates a similar phenomenon in the context of Example 1 with the 'conservative' rule that rejects when ̅ T 1 : L > 2z α , where the extra factor of 2 guarantees its validity; see Theorem B.1, online supplementary material.Indeed, as we show in Proposition B.1, online supplementary material, directly comparing ̅ T 1 : L to a standard normal can lead to a size of 2α, even when enforcing exchangeability of the underlying Z-statistics.This may come as a surprise given that by Jensen's inequality, ̅ T 1 : L has variance at most 1 for example.An analogous negative result for averaging p-values, which shows that the doubling rule cannot be improved under exchangeability, is proved by Choi and Kim (2022).
In view of this, other work has considered approaches relying on asymptotic joint normality of the underlying test statistics T (1)  n , T (2) n , . ..; see, e.g.Romano and DiCiccio (2019, Theorem 4.1), DiCiccio (2018, Theorem 3.2), Tian et al. (2021), andLiu et al. (2022).However, as evidenced by Example 2, such a dependence structure among the test statistics is unlikely to hold in practice, especially for data splitting.
Our work also connects more broadly to a body of literature on subsampling; see Politis et al. (1999) for a monograph on the topic.Berg et al. (2010) and McMurry et al. (2012) consider subsampling for hypothesis testing and use the data to centre the estimated null distribution to improve power; this is in similar spirit to our rank transform introduced in Section 2.2, which is a more aggressive form of centring that enforces the mean and all other moments of the null.Subsampling has also been used to reduce the variance of an estimator through what is known as 'subagging' (Bühlmann & Yu, 2002).Stability selection procedures exploit this for variable selection (Meinshausen & Bühlmann, 2010;Shah & Samworth, 2013) and can provide a form of finite-sample error control for a user-chosen variable selection method.More broadly, in the literature of resampling-based inference, ranks from bootstrap can be used to 'prepivot' (Beran, 1987(Beran, , 1988) ) a statistic to improve error control under the null, though this is somewhat different from our use of ranks to improve power; this is discussed further in Appendix G, online supplementary material.
A large number of randomized algorithms have been developed for various testing problems to which our method is applicable.Some of these have been mentioned in the introduction under the umbrellas of 'hunt and test' and reweighting.Other methods that do not explicitly fall within these categories include approximate co-sufficient sampling (Barber & Janson, 2022) for goodness-of-fit testing and several approaches for assessing variable importance nonparametrically (Cai et al., 2022;Dai et al., 2022;Tansey et al., 2022;Williamson et al., 2023).

Main contributions and organization of the paper
We develop a general framework for hypothesis testing with an aggregated statistic S n := S(T (1)  n , . . ., T (L) n ), where S(•) is a user-specified, symmetric aggregation function such as the arithmetic mean.To fully exploit the signal in S n , it is essential that the aggregated test is not conservative.Therefore, we construct a test from S n whose Type I error asymptotically approaches the nominal level α, and which outperforms existing conservative aggregation rules.
To achieve this, we employ subsampling (Politis & Romano, 1994;Politis et al., 1999), a generic tool for approximating sampling distributions under minimal assumptions.However, using subsampling alone is not enough, because it not only approximates the sampling distribution under the null, but under alternatives also picks up a visible, finite-sample bias from the sampling distribution (we formalize this in a first-order asymptotic analysis of subsampling in Appendix D.3, online supplementary material).Hence, naively comparing S n to its subsampling critical value leads to a test with suboptimal power.To fix this crucial issue, we exploit the fact that the asymptotic null distribution of T (1)  n is typically known, e.g.unif(0, 1) for a p-value or N (0, 1) for a Z-statistic.We introduce a rank transform that we apply to the subsampled statistics to enforce the known null marginal distribution of (T (1)  n , . . ., T (L) n ).In other words, to approximate the null distribution of (T (1)  n , . . ., T (L) n ) and hence the aggregated S n , we effectively combine the known marginal null distribution with the unknown copula estimated from subsampling.We demonstrate favourable performance of our method with three types of applications.
(i) 'Hunt-and-test' procedures: we use our framework to develop new tests for the goodness of fit of parametric quantile regression models, and for testing unimodality in highdimensional data.We illustrate the latter on cancer gene expression data to detect the presence of cancer subtypes.(ii) Testing hypotheses under reweighting or 'distributional shift' with resampling or rejection sampling: specifically we study testing the sharp null of no direct treatment effect in the context of a sequentially randomized trial.(iii) Calibrating confidence intervals for cross-fitted, DML estimators: in our simulations, we look in particular at confidence intervals for partially linear models, though the methodology we develop is applicable more broadly.
The rest of this paper is organized as follows.In Section 2, we introduce our rank-transformed subsampling procedure.We present an aggregated, multiple-split test (Algorithm 2) and a variant (Algorithm 3) that allows for several user-specified aggregation functions and adapts to the best one.In Section 3, we study the theoretical properties of our method.We show that these algorithms give tests that asymptotically have size equal to a given significance level α.Further, we show that if the test statistic and the aggregated statistic converge uniformly under the null, our procedures inherit such uniformity in terms of Type I error control.In terms of power, we show under mild conditions that our test is as powerful as an oracle test that has access to the null distribution of the aggregated statistic.Moreover, we show that the power gap between the oracle test and our test is smaller than the gap between the oracle test and a test based on ordinary subsampling (i.e.without the rank transform) by an asymptotic order, and this leads to a significant power improvement in practice.We establish this result under general conditions that go beyond the settings where the Edgeworth expansion, the standard tool for higher-order asymptotic analysis in the literature, can be applied.In Section 4, we demonstrate our method with a variety of applications as mentioned above.Finally, we conclude with a discussion in Section 5 outlining possible directions for future research.The online supplementary material contains all proofs, additional theoretical and numerical results; all the appendices can be found there.An R package MultiSplit implementing our method and scripts for reproducing numerical results are available from https://github.com/richardkwo/MultiSplit.

Setup
Let X 1 , . . ., X n ∈ X be data points drawn iid from an underlying distribution P. We are interested in testing H 0 : P ∈ P 0 vs. H 1 : P ∈ P \ P 0 , where P is the set of relevant laws of X that includes both the null and the alternative.Let T (1) n , . . ., T (L) n be test statistics that can be computed from sample X := (X 1 , . . ., X n ) and a piece of external randomness Ω generated by the analyst.Throughout, we assume that the random vector under (X, Ω) ∼ P n × P Ω for every P ∈ P, where P Ω is the distribution of Ω.
Often, such T (1) n , . . ., T (L)  n are obtained by applying the same randomized procedure L times on X.That is, we have where, without loss of generality, we assume Ω (l) ∼ iid unif(0, 1) independently from X. Formally, for l) can be used by T n to realize a random data split or a sequence of unif(0, 1) random variables for acceptance-rejection sampling (e.g. by splitting the bits in a binary expansion).In this case, Ω = (Ω (1) , . . ., Ω (L) ).
Alternatively, it can be that every T (l)  n is a deterministic function of X 1 , . . ., X n .This can happen, for example, when there are L prespecified ways of splitting the full sample and every such way looks no different from any other way.We will study this case in the context of cross-fitting in Section 4.3.
Throughout, we require that under the null, T (1) n converges to a known, continuous distribution F 0 , such as unif(0, 1) or N (0, 1).Without loss of generality, we assume H 0 is rejected for large values of T (1)  n ; other cases can be handled by redefining T (1) n , e.g.replacing T (1) n with |T (1) n | for a twosided test, or with 1 − T (1)  n for a p-value.To abuse the term slightly, we call the test that rejects when T (1)  n > F −1 0 (1 − α) the 'single-split' test, even though T (1) n itself may not be constructed with data splitting.Consider the aggregated, 'multiple-split' statistic constructed with a symmetric, continuous aggregation function S : R L → R, such as the arithmetic mean or the maximum.By taking L reasonably large, we can expect that the conditional variance of S n given X 1 , . . ., X n is small enough such that the aggregated test statistic is effectively derandomized.Note the restriction that S is symmetric is rather reasonable: it follows from the Neyman-Pearson lemma that a most powerful test (for a simple null against a simple alternative) necessarily combines the exchangeable test statistics in some symmetric fashion; see Proposition C. 1, online supplementary material.We will make the mild assumption (see Assumption 1 and the following discussion) that S n converges to some distribution G P under the null; in practice, the limit G P is typically an unknown and often non-Gaussian continuous distribution that depends on P ∈ P 0 .Our aggregated test rejects H 0 for large values of S n , and under the null aims to mimic an oracle procedure that rejects whenever S n exceeds the unknown upper α quantile of G P .To do this, we use subsampling to compute Gn , an approximation to G P , and use its quantile to determine the critical values for S n .As mentioned earlier in Section 1.2, to have good power, however, Gn must continue to closely mimic the null sampling distribution of S n even when data are generated under alternatives.For example, when T (1)  n ∼ N (0, 1) under H 0 and S n := (T ( 1) , under alternatives, Gn , as expected from an oracle procedure, should maintain zero mean even when S n takes a positive mean.This is achieved by the rank transform introduced below.

Rank-transformed subsampling
In this section, we describe our procedure when using a single aggregation function S. We first introduce some notation relating to distribution functions and then set out our subsampling scheme.
Notation.Given a set of points {x i } on the real line, we use F {xi} to denote their empirical distribution function.For a real-valued function F, let ‖F‖ ∞ := sup x |F(x)|.For a distribution function F, its upper α quantile is defined as Subsampling.Our method is based on subsampling, which ensures Type I error control under minimal assumptions.In general, subsampling cannot be replaced by the bootstrap without sacrificing the wide applicability of our method; we explain this in Appendix F, online supplementary material.Let m < n be a user-chosen subsample size.Throughout the paper, we require m → ∞ and m/n → 0; for the description of the algorithms and all the numerical experiments in this paper, we use m = ⌊n/ log n⌋ (see Section 5 for a discussion).We randomly select a total of B sets of indices, each of size m, such that there is a sufficiently low degree of overlap among the sets.To do this, we first choose a positive integer (e.g.J = 100) and let B := J⌊n/m⌋.Then our collection of sets of indices Note that the construction guarantees that B contains J collections of ⌊n/m⌋ sets that are nonoverlapping, and so statistics computed on these subsamples are independent.This will allow us to obtain guarantees for subsampling that do not rely on approximating a scheme (e.g.Politis et al., 1999, § 2.4) where statistics on every possible subsample of size m are evaluated.
Let Ĥ = ( Ĥb,l ) be a B × L matrix consisting of rows i.e. by computing (T (1) m , . . ., T (L) m ) on each subsample listed in B. When the statistic is a randomized test in the form of equation ( 2), the external random number is regenerated for every entry of Ĥ; that is, Note that although we have arranged the subsampled test statistics into a matrix, entries in the same column but different rows do not correspond directly to one another, that is, Ĥb,l is no more related to Ĥb ′ ,l than Ĥb ′ ,l ′ for b ′ ≠ b and l ′ ≠ l.Now if we were to apply the aggregation function S to each row of H, we would obtain C.1, online supplementary material), we have ‖ Ĝn − G P ‖ ∞ → p 0 under P ∈ P 0 .However, directly using Ĝn to construct the test is suboptimal because under a sequence of local alternatives, Ĝn contains an upward bias from sampling under alternative.Although such a bias may vanish asymptotically, the rate at which it vanishes can be rather slow and this can severely reduce power; this is illustrated in the bottom left panel of Figure 3.In the online supplementary material, we formalize this point in Theorem D.2, online supplementary material and numerically demonstrate the bias in Appendix D.5, online supplementary material.Therefore, instead of using Ĝn to calibrate our test statistic S n , we perform the rank transform introduced below.
Rank transform.Using exchangeability, we can pool the entries of Ĥ and let F Ĥ be the resulting empirical distribution function.With this, we form a rank-transformed version of Ĥ, denoted by H = ( Hb,l ), filled with entries where the subtraction of 1/2 from the ranks is simply a finite sample correction to prevent infinity being produced when applying F −1 0 .We then compute the aggregated statistics and their resulting empirical distribution function Gn := F { Sb } (x), which we then use to determine The full procedure is given in Algorithm 2. We provide some intuition on why the rank transform works.Consider first the null case where P ∈ P 0 .As n → ∞ (and hence n/m → ∞, B → ∞), by consistency of subsampling and exchangeability of T (1)  m , . . ., T (L) m (so they share the same marginal distribution), we expect F Ĥ(•) − 1/(2BL) ≈ F 0 (•) in equation ( 4).Therefore, H ≈ Ĥ under the null.Because Gn is computed from H in the same way as Ĝn is computed from Ĥ, we can expect that Gn ≈ Ĝn ≈ G P with high probability under the null.This is formalized in Theorem 1 and illustrated in the top panel of Figure 3, from which we see that under the null the rank-transform leaves the points almost unchanged, and both Ĝn and Gn well-approximate the sampling distribution of S n .
Under an alternative P ∈ P \ P 0 , we expect that F Ĥ(•) − 1/(2BL) ≈ F P (•), with F P the distribution function of the test statistic T (1)  m corresponding to the subsample size m.Thus, from equation (4) we have

􏼁
for some U b,l ∼ unif(0, 1).In this way, the rank transform enforces the marginal distribution of Hb,l to be F 0 , the asymptotic null distribution of Ĥb,l , as we can observe from the side histograms in Figure 3.The dependency among the Ĥb,1 , . . ., Ĥb,L in contrast is left unchanged.However, particularly under local alternatives where T (1) n contains just enough information to detect deviation from the null, we would certainly expect the dependency among test statistics constructed from smaller subsampled data to be indistinguishable from that under the null; we will formalize this notion in Definition 1.In sum then, Hb,1 , . . ., Hb,L should continue approximating the null distribution of T (1)  n , . . ., T (L) n , as we desire.Further, such an approximation should be more accurate than directly using Ĥb,1 , . . ., Ĥb,L , because any bias stemming from the difference in the marginal distribution has already been removed.Indeed, this underlies the effectiveness of the rank transform in restoring null-like behaviour under the alternative, as demonstrated in the bottom panel of Figure 3 and formalized by our theory in Section 3.2.
3: Initialise B × L matrices Ĥ, H and B-dimensional vector S.
end for

Adapting to the best aggregation function
We can further improve power by choosing a good aggregation function S. The performance of an aggregation function, however, depends on the joint behaviour of the single-split tests under the alternative of interest, which is usually unknown.For example, we expect S = (T ( 1) n )/L to work particularly well if most of the single-split statistics are large under the alternative; in contrast, S = max (T (1)  n , . . ., T (L) n ) should perform better if only a few of them are large under the alternative.This motivates us to allow the user to specify multiple candidate aggregation functions S 1 , . . ., S W , which are expected to accommodate different cases.In Algorithm 3, we present a variant of our procedure that aims to adapt to the best aggregation function among S 1 , . . ., S W .
The algorithm rejects for large values of where Gw n and S w n , respectively, are the counterparts of Gn and S n in Algorithm 2 but relate to the wth aggregation function.The quantity R n in equation ( 5) is, therefore, one minus the minimum p-value corresponding to each of the aggregation functions.Thus if any one of the aggregation functions yields good power, we should expect R n to be large: in this way, the test statistic aims to achieve power close to that of the best S w under consideration.
We could calibrate R n using a Bonferroni correction, but this would give a conservative test potentially sacrificing any power we might have gained in using multiple aggregation functions.Instead, we can reuse our subsampling aggregate statistics Sw b to approximate the sampling distribution of R n under the null; the subsampled versions of R n used for this are computed in lines 9-11 of Algorithm 3. The advantage of this approach is that it properly takes account of the dependence As a consequence, the resulting test has asymptotic size equal to its prescribed level (Theorem 2), and so in this sense no power is lost.

Behaviour under the null
In this section, we establish that our algorithms lead to asymptotically level α tests under a set of mild assumptions, which ensure the consistency of subsampling and the validity of rank transform.Compute S w n ← S w (T (1) n , . . ., T (L) n ) from X 1 , . . ., X n .15: end for Condition 1 (Asymptotic pivotal null).For every P ∈ P 0 , under (X, Ω) ∼ P n × P Ω , as n → ∞ it holds that where F 0 is a known, continuous distribution function.
Note that when T (l)  n is a randomized test of the form in equation ( 2), a sufficient condition for the above is that for every fixed ω ∈ (0, 1) and every P ∈ P 0 , it holds that T n (X 1 , . . ., X n ; ω) → d F 0 .
We study the tests constructed from rank-transformed subsampling under the null for two leading cases, when (i) F 0 is unif(0, 1) (one minus p-value) and (ii) F 0 = Φ (Z-statistic).For other null distributions, probability integral transform and its inverse can be applied to convert the statistic to one of these cases.
First, we show that Algorithm 2 is an asymptotic level α test under mild assumptions.Further, when T (1)  n and S n converge uniformly to their limiting distributions over the null, under a mild finite density condition, we show that the test also controls size below α uniformly over the null.It can be argued that uniform asymptotic size control, as opposed to pointwise asymptotic size control, is more relevant to practice because in contrast to the latter, it ensures that the sample size required to control the actual Type I error below, say 0.051, does not depend on the underlying P ∈ P 0 ; see Lehmann and Romano (2005, § 11.1).As uniform size control involves consideration of the behaviour of random variables under different P (rather than a single P in pointwise asymptotics), in the below, we will use a subscript in P P (•) to denote that (X, Ω) ∼ P n × P Ω .
Recall that Algorithm 2 rejects H 0 whenever S n exceeds the upper α quantile of the ranktransformed subsampling distribution Gn .Control of the size is, therefore, intimately linked to the behaviour of Gn , for which we will require the following.Note that in the below, all densities are with respect to the Lebesgue measure.

Condition 2 (Lipschitz aggregation). The aggregation function
n )/L and S n = max (T (1) n , . . ., T (L) n ).Note that given any Lipschitz continuous S, by scaling, the Lipschitz constant 1 above is not a restriction; however, such a scaling affects the value of g P, max in Assumption 1 below, which we require to be finite.
Assumption 1 (Stability of S n ).For every P ∈ P 0 , under (X, Ω) ∼ P n × P Ω , it holds that where G P is a continuous distribution that can depend on P and can be unknown.Further, G P has a density function g P such that g P, max := sup x g P (x) < ∞.

If (T (1)
n , . . ., T (L) n ) has a limiting joint distribution under every P ∈ P 0 , then the first part of Assumption 1 holds by definition of S n .Given that (T (1)  n , . . ., T (L) n ) is exchangeable with a limit marginal law (Condition 1), we typically expect the joint distribution to be stable as well (see Figure 2 for an example).In fact, for a given P, the sequence (S n ) ∞ n=1 is uniformly tight, and so by Prohorov's theorem (van der Vaart, 2000, Theorem 2.4), there always exists a subsequence that converges in distribution; see Proposition C.3, online supplementary material.Thus the only way the stability assumption can fail is when the copula (F n,P (T (1) n ), . . ., F n,P (T (L) n )) in some sense 'oscillates' as n → ∞, where F n,P is the distribution function of T (1)  n .Below we present results on both pointwise and uniform asymptotic size control for the test in Algorithm 2.
Theorem 1 (Validity of Algorithm 2).Let (X, Ω) ∼ P n × P Ω for P ∈ P 0 .Suppose T (1) n , . . ., T (L)  n are exchangeable and Condition 1 holds with F 0 = unif(0, 1) or F 0 = N (0, 1).Suppose S is chosen such that Condition 2 holds.Then for all α ∈ (0, 1), the following hold: (i) Under Assumption 1, the test in Algorithm 2 is pointwise asymptotically level α. (ii) Suppose T (1)  n and S n converge to their respective limit distributions uniformly over the null, i.e. for every x ∈ R sup where F n,P and G n,P , respectively, denote the distribution function of T (1) n and S n .Also, suppose sup P∈P 0 g max ,P < ∞.Then, the test in Algorithm 2 is uniformly asymptotically level α, i.e. sup Now we establish similar results for the adaptive test in Algorithm 3, under the following joint stability assumption on the chosen aggregation functions (S 1 , . . ., S W ).
Assumption 2 (Joint stability of multiple aggregation functions).For every P ∈ P 0 , under (X, Ω) ∼ P n × P Ω , it holds that where every S w has a continuous distribution function G w P that can depend on P. Further, suppose G w P permits a density g w P such that g w P, max := sup x g w P (x) < ∞ for w = 1, . . ., W.

Power
In this section, we study the power of rank-transformed subsampling and establish its advantage over ordinary subsampling, i.e. subsampling without the rank transform.We will analyse power under a sequence of local alternatives that converge 'in copula' to a null case-such a null is typically also the limit that the sequence of local alternatives weakly converges to.For any sequence P n ∈ P, let F m,P n be the distribution function of T (1) m (X 1 , . . ., X m ) under (X, Ω) ∼ P m n × P Ω and let U m denote the copula: Definition 1 (Convergence in copula).Let U m be the copula of (T (1) m , . . ., T (L) m ) under P m n × P Ω given by equation ( 6).We say P n converges in copula to P 0 , if there exists some P 0 ∈ P 0 such that under P n 0 × P Ω , (T (1) n , . . ., T (L) n ) converges to a limit distribution with copula Convergence in copula is a rather weak notion of convergence for two reasons.Firstly, it involves the lower sample size m (recall m = o(n)) rather than n.Consider a sequence of alternatives P n that are only just distinguishable from the null at sample size n.At sample size m, the null and P n should be indistinguishable; that is, the behaviour of the entire vector of test statistics (T (1)  m , . . ., T (L) m ) under P m n × P Ω and P m 0 × P Ω should be asymptotically identical, and in particular convergence in copula would hold.Secondly, Definition 1 is completely insensitive to the marginal distribution of the test statistics, and so in fact we can even expect a stronger version of the convergence above to hold with m replaced by n.In particular, when P n is a sequence of local alternatives that converges to P 0 ∈ P 0 in a way such that under P n 0 × P Ω , (T (1) n , . . ., T (L) n ) and the log-likelihood ratio log ( dP n n / dP n 0 ) jointly converge to a normal limit, then by Le Cam's third lemma (van der Vaart, 2000, Example 6.7), (T (1)  n , . . ., T (L) n ) under P n n × P Ω must also converge to a normal limit with the same covariance, and so the same copula, as its null limit.
Our next result shows that under a sequence of local alternatives that converge in copula to a null P 0 ∈ P 0 , the test in Algorithm 2 asymptotically has the same critical value and hence achieves the same power as an oracle test that has access to the asymptotic null distribution of S n under P 0 .In stating Theorems 3 and 4 below, we use G P 0 to denote the limit null distribution function of S n := S(T (1)  n , . . ., T (L) n ) under P n 0 × P Ω .
In fact, when P n is a sequence of local alternatives that converges to P 0 ∈ P 0 such that P n n is contiguous to P n 0 (i.e.absolutely continuous asymptotically; see van der Vaart, 2000, Ch. 6), we also expect Ĝ−1 n (1 − α) → p G −1 P 0 (1 − α) for Ĝn obtained from ordinary subsampling (Politis et al., 1999, Theorem 2.6.1).Hence, to capture the power improvement from the rank transform, we need a finer analysis.To this end, we characterize the first-order asymptotic behaviour of ranktransformed subsampling in the next theorem, of which the full statement can be found in Appendix D.2, online supplementary material.In the below, d TV (X, Y) denotes the total variation distance between X's distribution and Y's distribution.The regularity condition would require S to be nondecreasing in each coordinate, a condition that holds for S = avg, S = max and other reasonable choices.Theorem 4 (First-order behaviour of rank-transformed subsampling).Suppose Condition 1 holds and (T (1) n , . . ., T (L) n ) is exchangeable.Consider a sequence P n ∈ P that converges in copula to some P 0 ∈ P 0 in the sense of Definition 1 such that Suppose the distribution of C is absolutely continuous with respect to the Lebesgue measure.Let Gn denote the rank-transformed subsampling distribution function obtained with a variant of Algorithm 2 that uses two independent copies of the data under (X, X ′ , Ω) ∼ P n n × P n n × P Ω (see Appendix D.2, online supplementary material).
Suppose Assumption 1 holds and fix α ∈ (0, 1) such that the density G ′ P 0 is strictly positive and continuous in a neighbourhood of G −1 P 0 (1 − α).Then, under regularity conditions posed on the copula and S (see Appendix D.2, online supplementary material), for any M > 0, we have Further, let G n,Pn be the distribution function of S(T (1) n , . . ., T (L) n ) under Then, for any M > 0, we also have To interpret this result, let us take M to be a large constant and choose m = ⌊n/ log n⌋.Then, the first statement above says that up to the first order (with scaling factor ������ log n ), the ranktransformed subsampling delivers an approximation to the oracle critical value that is asymptotically unbiased.In contrast, in Theorem D.2, online supplementary material, we show that the ordinary subsampling approximation to the oracle critical value is biased upwards, and typically the bias grows with the effect size of the alternative.This formalizes our observation from Figure 3: under H 1 , before applying the rank transform, subsampling is biased towards the alternative sampling distribution and we can see a clear discrepancy between the subsampling distribution (red histogram) and the desired null distribution (red curve).Along a sequence of contiguous local alternatives, although this discrepancy vanishes asymptotically, this occurs rather slowly (1/ ������ log n  ≈ 1/4 when n = 10 7 ) and can result in a significant loss of power in practice.Recall that G n,Pn denotes the distribution function of S n under P n n × P Ω .The power of the oracle test is If we ignore the dependence between the estimated critical value and the test statistic, the power of our test can be written similarly as Consequently, the second statement of Theorem 4 implies pow(rank-transformed subsampling) ≈ pow(oracle Meanwhile, with G m,Pn denoting the distribution function of S m under P m n × P Ω , suppose converges to τ > 0 that measures the effect size.Then, in contrast to the above, Theorem D.2, online supplementary material implies pow(ordinary subsampling) ≈ pow(oracle for some κ α,τ > 0. Typically, we expect κ α,τ τ to grow as τ increases from zero up to a certain value; see Appendix D.5, online supplementary material for a concrete example.For example, for testing a hypothesis of a regular parameter, under a We illustrate such a case using numerical results for Example 2, which considers testing the mean of a random vector.The details of the simulation study will be described in Section 4.1.1.For now, let us focus on Figure 4, which shows the power of several aggregated, multiple-split tests against the effect size of local alternatives.Indeed, we can see that the rank-transform (Algorithm 2) has a clear power advantage over ordinary subsampling ('no rank'), and this advantage enlarges with the effect size, exactly as we expect from comparing equations ( 7) and ( 8) under β = 1/2.Meanwhile, in every setting, the power of the rank-transformed subsampling closely tracks that of the oracle test regardless of the effect size, confirming equation ( 7).
We prove Theorem 4 in Appendix D, online supplementary material using the functional delta method, where a major technical challenge is a certain Hadamard differentiability we establish for handling the errors introduced by the rank transform.For technical reasons, Theorem 4 is proved for a variant of Algorithm 2 that has access to two independent copies of the data, but we expect a similar result to hold for the original algorithm as well; see Appendix D.5 online supplementary material for a concrete example with supporting numerical results.Further, in Appendix D.4, online supplementary material, we show that when C follows a Gaussian copula, the regularity conditions in Theorem 4 are satisfied by choices of F 0 and S considered in this paper.

Applications
We illustrate our method with three types of applications.First, we study data splitting, hunt-and-test procedures: specifically, we revisit Example 2 for testing the zero mean of a highdimensional random vector; we develop a new test for unimodality in high dimensions; and introduce a simple, flexible approach for goodness-of-fit testing of parametric regression models such as parametric quantile regression.Next, we consider using the data from a distribution P to test a property of a different distribution Q, where Q is related to P through reweighting.We study this in the context of causal inference, where P is the observational distribution and Q is an intervened distribution.Finally, we study the inference of cross-fitted, DML estimators.We show that the cross-fold dependence in these estimators, though often argued to be asymptotically negligible in standard well-specified, low-dimensional settings, can, in finite sample or under Location is μ = τn −1/2 v 1 with v 1 the normalized principal eigenvector of Σ; q is the proportion of the test sample.We compare the single-split test (single) with tests based on S n being the arithmetic mean (S=avg): Algorithm 2 is our method; 'oracle' is an oracle test that compares S n to its null distribution; 'no rank' uses the ordinary subsampling distribution Ĝn to determine the critical value; 'conservative' compares S n /2 to a standard normal, where division by two ensures validity when the statistics are not jointly normal as evidenced by Figure 2 (see also Appendix B, online supplementary material).Observe that Algorithm 2 has a power advantage over 'no rank' and this advantage grows with τ; meanwhile, Algorithm 2 closely tracks 'oracle' regardless of τ.These observations match our first-order theory presented in Section 3.2.

16
Guo and Shah Downloaded from https://academic.oup.com/jrsssb/advance-article/doi/10.1093/jrsssb/qkae091/7759731 by guest on 20 September 2024 misspecification, lead to under-coverage of confidence intervals (see Jiang et al., 2022 for a highdimensional setting not considered in this paper where this issue also arises).We present an alternative construction of confidence intervals using our method that captures such dependence and restores the desired coverage.These confidence intervals can even maintain coverage when the model for a nuisance parameter in doubly robust estimation is misspecified, which we illustrate in Appendix E, online supplementary material.
For each application, we present numerical results to illustrate and benchmark the new methods we develop.Additional numerical results, including the performance of conservative aggregation rules mentioned in Section 1.1, can be found in Appendix H, online supplementary material.

Hunt and test
In this section, we consider testing a hypothesis that can be expressed as a conjunction of simpler hypotheses where we already have an off-the-shelf test for each H 0 (δ).As explained in the introduction, such null hypotheses are amenable to a hunt-and-test approach that employs data splitting, where one part of the data is used to find an appropriate δ and the remaining data are used to test H 0 ( δ).Here is another perspective due to Moran (1973).Consider testing where Θ 1 does not contain θ 0 .The alternative parameter space Θ 1 might be so large or heterogeneous that a reasonable test for H 0 only has power against certain alternatives in Θ 1 .Again, we can split our data and perform hunt and test: use the first part to estimate θ1 ∈ Θ 1 and then use the second part to test θ ∈ Θ 0 vs. θ = θ1 .

Revisiting Example 2
In Example 2, we considered a hunt-and-test approach for testing H 0 : μ = 0 vs. H 1 : μ ≠ 0 with iid random vectors.Clearly, the problem is an instance of equation ( 10); it can also be viewed as an instance of equation ( 9), namely H 0 = ∩ δ∈R p {μ : μ ⊤ δ = 0}.Figure 4 compares the performance of various tests based on aggregated S n = (T ( 1) n )/L with L = 200 and the single-split test based on T (1)  n alone.In our simulation, we draw X 1 , . . ., X n ∼ iid N (μ, Σ) where Σ ∈ R 3×3 has entries given by Σ ij = 2 −|i−j| and μ = τn −1/2 v 1 , where v 1 is the normalized principal eigenvector of Σ.When τ = 0, we see that our method (Algorithm 2) controls the Type I error at the nominal level.Further, as τ grows, its power clearly dominates both the single-split test and the ordinary subsampling test, while closely tracking the power of the oracle test in all regimes.These observations align with our theory presented in Section 3.2.Also, note that the power of our test is insensitive to the split ratio q.Meanwhile, our aggregated test significantly reduces the chance of nonreplication.For example, when q = 0.5, for a random dataset, there is <5% chance that two applications of Algorithm 2 will give contradicting results, while the probability can be as large as 30% for the single-split test; see Appendix H.1, online supplementary material.

Testing multivariate unimodality
Testing nontrivial clustering structure of a high-dimensional dataset is a long-standing problem.The problem cannot be directly answered by clustering algorithms, because typically these (e.g.k-means or hierarchical clustering) always return clusters even when the data comes from a homogenous population (Huang, Liu, Hayes, et al., 2015).This problem is closely connected to selecting the number of clusters as a trivial clustering structure corresponds to the true number being one.Here, we work with Euclidean data and we take the perspective that there is only one cluster if the population distribution is unimodal.
A univariate distribution is called unimodal if there is a point a such that the distribution function is convex on ( − ∞, a) and concave on (a, + ∞) (Khintchine, 1938).While there are different notions of multivariate unimodality, we take linear unimodality as our definition: we say a random vector (X 1 , . . ., X p ) is unimodal if  i a i X i is unimodal for every nonzero coefficient vector a = (a 1 , . . ., a p ).That is, Linear unimodality is implied by several other notions related to multivariate unimodality such as log-concavity (Dharmadhikari & Joag-Dev, 1988, Lemma 2.1); see also Dharmadhikari and Joag-Dev (1988, Theorem 2.15).The formulation in equation ( 11) naturally leads to the following hunt-and-test procedure after randomly splitting the data into two Parts A and B: 1. Identify a direction â using any suitable clustering algorithm on Part A of the data; 2. Test univariate unimodality of  i âi X i on Part B.
We note that the idea of reducing to a univariate test is not new, and is for example used by Ahmed and Walther (2012) in projecting data onto its principal curve, and in a likelihood ratio test for log-concavity using random projection and data splitting (Dunn et al., 2021).Dip hunting test.To identify a good direction â, on Part A we run a 2-means algorithm (initialized with k-means++ by Vassilvitskii & Arthur, 2006) and choose â as the normalized vector connecting the two cluster centres.Then to test for unimodality, we use a test based on the dip statistic due to Hartigan and Hartigan (1985), which we describe below.Figure 5 shows a schematic of our procedure, which may be described as 'dip hunting' by analogy with the bump hunting procedure of Good and Gaskins (1980).
Let F n be the empirical distribution function of Y and let U be the set of unimodal univariate distributions.The dip statistic is defined as and may be computed efficiently using R package diptest (Maechler, 2021).Hartigan and Hartigan (1985) recommend comparing ρ n to the dip statistic of a sample drawn from unif(0, 1), which serves as the least favourable null distribution.However, this approach typically results in very conservative p-values.To avoid this problem, Cheng and Hall (1998) show that when the density f of Y is unimodal, under mild regularity conditions, we have 2n 3/5 ρ n → d cZ, where Z is a particular function of a standard Wiener process.The constant c depends on the density f and is given by where x 0 is the unique mode of f.The only unknown quantity c can be estimated with ĉ, which is a plugin from kernel density estimates f and f ′′ evaluated at x0 = arg max f .We use R package kedd (Guidoum, 2015) to estimate f and f ′′ , for which the respective bandwidths are selected with maximum likelihood cross validation (function h.mlcv, Duin, 1976;Habbema et al., 1974).From our experience, for ĉ to behave properly, it is essential to centre and rescale Y so that its value lies between 0 and 1, which does not affect c.Because the asymptotic distribution of the dip statistic only depends on c, Cheng and Hall (1998) suggest the following approach to obtain the corresponding p-value, which we adopt here.Given our observed dip statistic, we compare this to the distribution of ρ n based on samples drawn from a known distribution whose c equals ĉ; three families of such distributions covering the range of c are provided by Cheng and Hall (1998).
Simulations.We consider the following settings.
1. Mixture of unit balls.Let B(p) be the unit p-dimensional ball centred at the origin.Consider the following density with bounded support: Here, unif B(p) denotes the uniform density on the unit ball B(p) and ‖x 0 ‖ is the Euclidean distance between the centres of the two balls.We set ‖x 0 ‖ = 2τ/ ������ 2 + p  so that the density in the direction connecting the two ball centres 1 becomes bimodal roughly when τ ≥ 1. 2. Mixture of multivariate t's.We consider a heavy-tailed setting where t 4 is the density of the p-dimensional multivariate t-distribution with 4 degrees of freedom, mean zero and scale matrix Σ ∈ R p×p with entries Σ ij = 2 −|i−j| .We set x 0 = τv 2 �� p √ , where v 2 is the second normalized eigenvector of Σ.
Figure 6 shows the results based on sample size n = 1,000.We compare the single-split dip hunting test (single) with the multiple-split versions that aggregate L = 50 dip hunting p-values, including S = avg and S = min (Algorithm 2), as well as the adaptive test (Algorithm 3) with (S 1 = avg, S 2 = min ).We compare dip hunting to SigClust (Huang, Liu, Yuan et al., 2015;Huang et al., 2022;Liu et al., 2008), a widely used clustering significance testing method based on a Gaussian mixture model.Other methods include a nonparametric bootstrap approach suitable for ellipsoidal clusters (Maitra et al., 2012), and an approach based on simulating from an estimated Gaussian copula model (Helgeson et al., 2021); see also the review paper Adolfsson et al. (2019) and references therein.We see that at the null (τ = 0), SigClust incurs a large Type I error for the multivariate t settings, while all the dip hunting tests maintain the correct level in both settings; note that this is to be expected as SigClust assumes a Gaussian distribution under the null.For the unit ball setting, SigClust loses power as p increases.We can see that S = avg is more powerful than S = min for the unit ball setting and conversely for the multivariate t setting.Nevertheless, our adaptive test is able to achieve the better performance between the two and also shows significant power improvement over the single-split test.
Gene expression of cancer subtypes.We apply our test to gene expression data on renal cell carcinoma (RCC), which mainly consists of three subtypes: clear cell (ccRCC), papillary (PRCC), and chromophobe (ChRCC).We use the ICGC/TCGA Pan-Cancer dataset (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020), which contains mRNA expression levels (FPKM-UQ normalized) from 111 kidney samples, including 37 cc, 31 P, and 43 Ch RCC cases.We use the expression levels of 1,000 genes that are most relevant to RCC by selecting those with the highest |μ 1 − μ 0 |/σ 0 , where μ 1 and μ 0 are case and control means, σ 0 is the control standard 1 The one-dimensional projection of unif deviation.We apply both the single-split and the aggregated (S is the arithmetic mean, L = 6,000) dip hunting tests to every subtype, every mixture of two subtypes and the whole sample.Figure 7 shows the distribution of p-values.The aggregated test produces stable p-values, which indicate clear separation between subtypes and relative homogeneity of each subtype.In contrast, it is more difficult to disentangle subtypes ccRCC and PRCC from the single-split dip hunting test.

Goodness-of-fit testing for parametric regression models
As another application, we use hunt and test to construct flexible goodness-of-fit tests for parametric regression models.The approach we take here is closely related to the generalized residual prediction (RP) test (Janková et al., 2020; see also Shah & Bühlmann, 2018) for assessing the goodness of fit of (potentially high-dimensional) generalized linear models.Generalized RP tests also employ sample splitting and could equally well benefit from our rank transform p-value aggregation scheme.However, our construction here is applicable more broadly to testing model specification of the form where h(X) is the conditional mean or a conditional quantile of outcome Y ∈ R given X ∈ R p .

7.
Testing homogeneity of gene expression levels of three subtypes of renal cell carcinoma: clear cell (cc), papillary (P), and chromophobe (Ch).The plot shows p-values resulting from repeatedly applying single-split and aggregated (S = avg, L = 6,000) dip hunting tests to each subtype, every mixture of two subtypes and the whole sample.

20
Guo and Shah The starting point of our approach is the simple observation that h(X) = β ⊤ X is equivalent to having γ = 0 in h(X) = β ⊤ X + γg(X) where g : R p → R is any nonlinear (measurable) function.For a given g, we may test for whether γ = 0 by regressing Y on (X, g(X)) and utilizing existing inference tools for the model at hand to assess the significance of g(X).To obtain good power under an alternative, we would like to pick an appropriate g to expose the lack of fit present in the data.This suggests a hunt-and-test procedure where we randomly divide our data into Parts A and B, and use Part A to hunt for a suitable g, and Part B to assess the significance of our artificially constructed additional covariate g(X).
To find an appropriate g, we take inspiration from gradient boosting (Friedman, 2001) and proceed as follows.Let (X i , Y i ) n ′ i=1 be iid covariate-response pairs in Part A. Suppose we have an M-estimator β for estimating β in equation ( 12) that minimizes  i ℓ(Y i − β ⊤ X i ) for some loss function ℓ : R → R. Defining residual r i := Y i − β⊤ X i , upon introducing a potential new covariate g(X), the loss can be locally approximated as where ℓ ′ is the derivative of the loss function.Hence, when γ > 0 and fixing (  i g(X i ) 2 ) 1/2 , to locally decrease the loss by the greatest amount, we should attempt to choose g such that approximately (g(X To achieve this, we regress (ℓ ′ (r i )) n ′ i=1 onto the covariates using any flexible regression or machine learning method, and take the fitted regression function to be g.Since we expect that under an alternative the resulting g(X) should have a positive coefficient, we take as our single-split statistic T n a Z-statistic for the significance of g(X) computed on Part B. We expect T n to be large and positive under an alternative.
We demonstrate the effectiveness of this approach for quantile regression.Consider a quantile regression model specified as q τ (X) = β 0 + β ⊤ X for a fixed τ ∈ (0, 1), where q τ (X) is the τth conditional quantile of Y given p-dimensional covariates X.The construction of goodness-of-fit tests, or more commonly called lack-of-fit tests in the related literature, have largely relied on asymptotic properties of certain statistics or processes concerning the residual; see, e.g.Escanciano and Goh (2014); Escanciano and Velasco (2010); He and Zhu (2003); Horowitz and Spokoiny (2002).These tests tend to have difficulty scaling up to more than a handful of covariates (Conde-Amboage et al., 2015).Recently, Dong et al. (2019) recast the goodness-of-fit problem as a two-sample test problem and developed a different, highly competitive method that can handle moderate or large p.It is worth mentioning that, unlike our approach that directly repurposes existing parameter inference for quantile regression (Koenker, 2005, Chap.3), these aforementioned methods rely on asymptotic results that can require substantial development.
For quantile regression, we have ℓ τ (r) = r(τ − I r<0 ) and ℓ ′ τ (r) = τ − I r<0 for r ≠ 0. Therefore, we use Part A to train a classifier (e.g.random forest, Breiman, 2001) that predicts the sign of the residual from X; we take g : R p → { − 1, 1} to be the resulting prediction function.To improve numerical stability, we also partial out X from g(X) on Part B before adding g(X) to the covariates.Define the single-split test statistic as T n := ���� � n/2  β′ /σ (the prime indicates fitted from part A), where σ is estimated from bootstrap.The statistic can be readily computed using R package quantreg (Koenker, 2022).
Figure 8 shows results from a simulation study under sample size n = 1,000.Covariate vector X is drawn from p-dimensional Gaussian with covariance Σ ij = 2 −|i−j| .We fix τ = 0.5 and consider two specifications where nonlinear function η Error ε is drawn from {Exp(1), t 3 } for specification (i) and from Exp(1) for specification (ii).Because Exp(1) has a nonzero median, observe that the quantile regression model q τ (X) = β 0 + β ⊤ X is well-specified if and only if v = 0. We choose g to be a random forest trained with R package ranger (Wright & Ziegler, 2017).We run Algorithm 2 with S being the arithmetic mean.Our approach is already competitive with the state-of-the-art method of Dong et al. (2019); see Appendix H.2, online supplementary material for results in another setting.

Testing generalized conditional independence
In this section, we consider the use of randomized procedures in causal inference.To draw causal conclusions from data, we are often faced with the challenge that the quantity of interest is defined with respect to an 'intervened' distribution, which is related to but different from the datagenerating distribution.This is also known as a 'distributional shift' in machine learning (Shimodaira, 2000).
For example, consider a two-stage sequentially randomized trial represented by graph G in Figure 9: A 1 is the first treatment, L is the first outcome, A 2 is the second treatment and Y is the final outcome.The first treatment A 1 is completely randomized; the second treatment A 2 is randomized according to A 1 and L. Variable U represents an unobserved confounder, for example, the underlying health status that affects both outcomes and is unobserved.Graph G, typically with extra base covariates that we omit here, is often employed to represent observational or follow-up studies where the treatment is time-varying and is affected by previous outcomes.For example, in HIV studies, we have A t = 1 (t = 1, 2) if the individual receives antiretroviral therapy at time t and A t = 0 otherwise.Outcomes L and Y denote CD4 cell counts that measure the effectiveness of the therapy.See Hernán and Robins (2020, Chap. 19) for more background.
Suppose we are interested in the first treatment's direct effect τ on the final outcome (i.e.not through the second treatment), represented by the dashed edge in G.Because A 1 also affects Y through A 2 , we cannot learn τ by regressing Y on A 1 alone.Moreover, because of the latent U that affects both L and Y (such variables are called 'phantoms' by Bates et al., 2022), nor can we learn τ by additionally adjusting for L or A 2 (or both) in the regression.
To learn τ, it is useful to imagine another trial, drawn as G ′ in Figure 9, where both treatments A 1 and A 2 are completely randomized so that A 1 only affects Y directly.There, we can easily learn τ by regressing Y on A 1 in the G ′ distribution.Although we did not carry out the G ′ trial, its data distribution can be approximated by reweighting our data obtained from the G trial according to the inverse propensity of A 2 given A 1 and L. Further, we can even artificially simulate data from the G ′ trial by resampling (e.g.importance resampling or rejection sampling) our data from the G trial.In other words, reweighting or resampling provides access to our distribution of interest.This idea underlies a general approach known as the g-methods (Naimi et al., 2017;Robins, 1986).
Figure 8. Testing goodness-of-fit of a quantile regression model q 0.5 (X ) = β 0 + β ⊤ X : power (95% CI) at level α = 0.05 (dashed horizontal) under n = 1,000.The model is well-specified if and only if v = 0.The nonlinear function in equation ( 13 where q is an arbitrary positive distribution over A 2 .Constraints as such, which prescribe independence or conditional independence in a reweighted distribution, are called generalized conditional independence or Verma constraints in the literature; see Richardson et al. (2023) and the references therein.
Because the independence holds under Q instead of P, the usual permutation test is not applicable; nor is it applicable through simple reweighting as employed by Berrett et al. (2020) for testing ordinary conditional independence.Instead, the standard approach in causal inference is through inverse probability weighting (IPW).Here, we consider an alternative approach: we resample our data to represent Q according to equation ( 15) and then use the resampled data to test A 1 ⊥ ⊥Y, e.g. by a permutation test or any off-the-shelf test for independence.
More generally, as demonstrated by Thams et al. (2023), such a test-after-subsampling procedure is applicable to testing any property under 'distributional shift'.Suppose we observe iid sample X 1 , . . .X n ∼ P but we are interested in testing a property of a different target distribution Q.Distribution Q is related to P through a density ratio r = dQ/ dP, which is either known or can be estimated from P. Suppose we already have a suitable test for the property based on a test statistic T n := T n ( X1 , . . ., Xn ) for iid sample X1 , . . ., Xn ∼ Q. Assume r is bounded from above by a constant C.Then, we can test the property in two steps: 1. Obtain a sample from Q through rejection sampling.2. Test the property with the test statistic computed from the accepted sample.
The next result shows that the test statistic computed as such inherits the desired asymptotic distribution while permitting the use of an estimated density ratio rn .
Proposition 1 (Test after rejection sampling).Let T n (X 1 , . . ., X n ) be a test statistic.Define T 0 := 0. With X := (X 1 , . . ., X n ), suppose T n (X 1 , . . ., X n ) → d T under X ∼ Q n .Let P be a distribution and C be a positive constant, such that Q is absolute continuous with respect to P and r := dQ/ dP satisfies r(x) ≤ C, P − almost every x.In our numerical example, we choose heteroscedastic errors V and ξ (see Appendix E, online supplementary material for details) and fit nuisance parameters η := (l, m) with random forests.Table 1 compares the coverage of our confidence interval equation ( 21) with the coverage of standard DML normal confidence interval based on equation (20).For the latter, we use a plugin estimate of σ that replaces expectations in equation ( 19) by their empirical counterparts.While DML confidence intervals tend to undercover for smaller samples, where ρ(L − 1) is larger, our confidence intervals are well calibrated.When one of the two nuisance functions above is misspecified, under assumptions, θdml n can still be consistent and asymptotically normal due to the double robustness of the estimating equation.However, the estimators θ(1) n , . . ., θ(L) n are correlated, even asymptotically.In such a case, the standard DML confidence interval no longer has the desired asymptotic coverage but our confidence interval equation ( 21) should still work.See Appendix E, online supplementary material for such an example; see also Benkeser et al. (2017) for related methods.

Discussion
Rank-transformed subsampling provides a framework that properly aggregates results from multiple applications of a randomized statistical test.Backed by this framework, we are free to design explicitly randomized tests that employ data splitting, resampling or any other random processing to solve difficult problems.In particular, data splitting and resampling can often reduce a complex hypothesis to a simpler one, for which an off-the-shelf test can be repurposed in a 'plug-and-play' fashion.Even though the resulting 'single-split' randomized procedure may have high variability or low power, as we have demonstrated, these aspects can be significantly improved by our aggregation algorithms.
There are several aspects worth further investigation.Firstly, it is of interest to go beyond Algorithm 3 and study how to learn an optimal aggregation function.Secondly, with a large L and B, constructing Ĥ can be computationally intensive, although it is embarrassingly parallelizable across both rows and columns.It is desirable to develop faster approximations to Ĥ. Thirdly, we find m = ⌊n/ log n⌋ works surprisingly well in our numerical studies, in spite of only guaranteeing around seven independent subsamples for n = 1,000.Essentially, rank-transformed subsampling is able to approximate the null sampling distribution far better than the theory might suggest, and it would be interesting to investigate why this might be the case, and whether there is a better choice of the subsample size (e.g. using the rule of Bickel & Sakov, 2008).

Figure 3 .
Figure3.Illustration of the rank transform.Here L = 2 and the rows of Ĥ and H are plotted as points in two dimensions in the left and right panels, respectively.Arrows indicate for certain points their image after the rank transform.We consider aggregation function S as the arithmetic mean which may be visualized as projection onto the dashed line.Top histograms: distributions Ĝn and Gn (curve: null density of S n ); side histograms: marginal distributions F Ĥ and F H (F 0 = unif(0, 1)).

Figure 4 .
Figure 4. Testing μ = 0 in Example 2: power (95% CI) at level α = 0.05 (dashed horizontal).Location is μ = τn −1/2 v 1 with v 1 the normalized principal eigenvector of Σ; q is the proportion of the test sample.We compare the single-split test (single) with tests based on S n being the arithmetic mean (S=avg): Algorithm 2 is our method; 'oracle' is an oracle test that compares S n to its null distribution; 'no rank' uses the ordinary subsampling distribution Ĝn to determine the critical value; 'conservative' compares S n /2 to a standard normal, where division by two ensures validity when the statistics are not jointly normal as evidenced by Figure2(see also Appendix B, online supplementary material).Observe that Algorithm 2 has a power advantage over 'no rank' and this advantage grows with τ; meanwhile, Algorithm 2 closely tracks 'oracle' regardless of τ.These observations match our first-order theory presented in Section 3.2.

Figure 6 .
Figure6.Detecting a mixture of two p-dimensional unimodal components whose centres are separated ∝τ away: power (95% CI) at level α = 0.05 (dashed horizontal).The multivariate density is linearly unimodal when τ = 0. We compare dip hunting with clustering significance testing method SigClust.Test single is the single-split dip hunting test; S=avg, S=min and S=adaptive are multiple-split tests that aggregate L = 50 dip hunting p-values.The S=adaptive test is Algorithm 3 with (S 1 = avg, S 2 = min ), which is able to adapt to the better performance between the two.See also Figure H.4, online supplementary material for the performance of conservative p-value aggregation rules.

Figure 9 .
Figure9.Graph G depicts a sequentially randomized trial: A 1 is the first treatment, L is the first outcome, A 2 is the second treatment, and Y is the final outcome.Latent variable U represents the underlying health status that affects both outcomes.The dashed edge represents the direct effect from A 1 on Y. Graph G ′ represents the population where both A 1 and A 2 are completely randomized.In this case, if A 1 has no direct effect on Y (i.e. the dashed edge is absent), then we can observe A 1 ⊥ ⊥Y .
Stat Soc Series B: Statistical Methodology 11 Downloaded from https://academic.oup.com/jrsssb/advance-article/doi/10.1093/jrsssb/qkae091/7759731 by guest on 20 September 2024 also Figures H.2 and H.5, online supplementary material.Downloaded from https://academic.oup.com/jrsssb/advance-article/doi/10.1093/jrsssb/qkae091/7759731 by guest on 20 September 2024Specifically, let us consider testing the sharp null hypothesisH 0 : A 1 has no individual direct effect on Y, (14)which is represented by the dashed edge from A 1 to Y being absent from G. As explained earlier, this amounts to A 1 ⊥ ⊥Y in the population Q represented by G ′ , given by dQ/dP = q(A 2 )/p(A 2 | A 1 , L), (