On the Use of Random Forest for Two-Sample Testing

Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on $\mathbb{R}^d$. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribution. An asymptotic power analysis for the proposed tests is developed. Finally, two real-world applications illustrate the usefulness of the introduced methodology. To simplify the use of the method, the R-package"hypoRF"is provided.


Introduction
Two-sample testing via classification methods is an old idea tracing back to the work of [1]. Generally speaking, one adapts the output of a classifier to construct a two-sample test. Let X 1 , . . . , X n0 and Y 1 , . . . , Y n1 be a collection of R d -valued random vectors, such that X i iid ∼ P and Y i iid ∼ Q, where P and Q are some Borel probability measure on R d . The goal is to test H 0 : P = Q, H A : P = Q. (1) Given these iid samples of vectors, we define labels ℓ i = 1 for each X i and ℓ i = 0 for each Y i to obtain the data (Z j , ℓ j ), j = 1, . . . , N , for N = n 0 + n 1 , and Z j = X i or Z j = Y i . On this data, we train a classifier g : R d → {0, 1}. Ifĝ is able to "accurately" predict ℓ on some test sample, it is taken as evidence against H 0 . In this work, we assume the data is generated from a mixture distribution such that n 1 ∼ Bin(π, N ), where Bin denotes the Binomial distribution. While our exposition will be valid for general classifiers, we specifically target the use of the Random Forest (RF) classifier in this work. Random Forest is a powerful and flexible method developed by [2], known to have a remarkably stable performance in applications (see e.g. the extensive work of [3]). This approach to testing was used in scientific applications, especially in the field of neuroscience. We refer to [4] for an excellent literature overview. More recently, a lot of additional work has been produced in this direction in the statistical literature, see e.g., [4]; [5] our work appears to be the extensive recent work of [4]. Our first out-of-sample test in Section 2.1, though derived independently, is very closely related to their test in Section 9.1. Moreover, [4,Proposition 9.1] provide a consistency result for general classifiers under mild assumptions. We add to this discussion, by showing that under imbalance these assumptions nonetheless break down for the Bayes classifier, such that a test based on this classifier is not consistent. [4] also provide a rule of thumb on when to use classificationbased tests, as opposed to more fine-tuned statistical tests designed for a specific problem. We extend this discussion by adding a recommendation when to use the RF-based test, as opposed to kernel-based tests, as for instance proposed in [11], [12], [13] and [14]. These tests are natural competitors to classification-based tests and our work indicates that: 1. If the differences between P , Q can be found in the marginal distributions, even sparsely so, the RF-based test tends to perform very well. We demonstrate in Section 4.2 that the RF-based test succeeds in an example with marginal differences, that is difficult for kernel-based tests.
2. If the change is mostly found in the dependency structure, or copula, kernel tests like MMD may be preferable. As is demonstrated in Appendix B the RF-based test still has power, but less so than the kernel-based tests.
In addition, the Random Forest classifier brings two features to the two-sample testing problem: The out-of-bag (OOB) statistics and the variable importance measures. The former is used to increase sample efficiency, compared to a test based on a holdout sample, while the latter provides insights into the source of distributional differences. Our work also shares similarities with [5], [8] and [9]. The work of [8] focuses on the use of the in-sample classification error as a test statistic in the balanced case. [5] focuses attention on the power of different classifier-based test statistics for specific alternatives. They also seem to be the first to propose the use of bootstrap-based classification tests. The work of [9] presents a different approach based on regression and focuses on local testing, i.e. determining where the distributional difference appears.
The next two subsections list our contributions and demonstrate the advantages of our method with a small toy example. Section 2 introduces the two tests used, the first based on out-of-sample observations and the second on the OOB statistics. It closes with a theoretical insight into the consistency of classifierbased tests. Section 3 extends this theoretical insight into an asymptotic power analysis for a version of the OOB error-based test, using U-statistics theory. Finally, Section 4 discusses the role of the variable importance measure of the Random Forest and demonstrates the power of our tests with simulated as well as two real-world data sets.

Contributions
Our work differentiates itself from the existing literature in several aspects: -The out-of-sample test based on the class-wise errors in Proposition 1, though similar to the one in [4,Proposition 1], requires less assumptions to conserve the level asymptotically (though [4] focus on a setting, where both the number of observations N → ∞ as well as the dimension d → ∞. In our work, d is assumed to be fixed).
-We show that no test based on the Bayes classifier is consistent for π = 1/2 in Lemma 1, but that a simple change in the classifier's "cutoff" restores consistency. Overall p-value: 0.0099

Significance treshold
Column Index Variable Importance Figure 1: (Intro) We sampled 300 observations from a d = 5 dimensional multivariate normal, with no correlation between the marginals. Likewise 300 observations were sampled from a multivariate normal, with the last two marginals having a correlation of 0.8. The Random Forest used 500 trees.
-We utilize the OOB error and variable importance measure in this context to both increase the power of the test and extract more meaning in practice. As shown in simulations, the increase in power with the OOB test is substantial.
-We analyze the asymptotic normality of an OOB error-based test statistic using U-statistics theory and use it to derive an expression for the approximate power of the test in Section 3.
-We provide empirical evidence in Section 4.2, and in Appendix B, that our test constitutes an important complementary method to powerful kernel-based tests, leading to improved performance in some traditionally difficult examples.
-Finally, we provide the R-package hypoRF available on CRAN, with an implementation of the method.

Motivational example
We consider a toy example to demonstrate the proposed methodology underlying the Random Forest classifier two-sample test. We choose P and Q to be five-dimensional multivariate Gaussian probability distributions. The covariance matrix of P is the identity and the distribution Q only differs from P in the last two components between which a positive correlation of 0.8 is imposed. The OOB statistics-based two-sample test correctly rejects with a p-value of 0.0099 (details are given in Section 2.2). Thus our method correctly rejects in this example and moreover delivers a hint which components might be responsible for the perceived difference in distribution.

Framework
Let Z 1 , . . . , Z N be random vectors with values in X ⊂ R d and l 1 , . . . , l N corresponding labels in {0, 1}, A sample Z i coming from the mixture component P (respectively Q) is labeled l i = 0 (respectively l i = 1).
Letĝ(Z) := g(Z, D Ntrain ) be a classifier trained on a subset D Ntrain of size N train < N of the observed data.
Given the setting above, we now present two tests based on the discriminative ability ofĝ. The first such test uses an independent test set and is very similar to the test proposed by [4]. The second test in Section 2.2 is entirely new and uses the OOB error to obtain its decision rule.

Out-of-sample test
Let N test = N − N train be the number of test points. Moreover, n 0,j is the number of observations coming from class 0, and n 1,j the number of observations from class 1, for j ∈ {train, test}. We assume throughout the paper that n 0,j ≥ 1, n 1,j ≥ 1. If there is no difference in the distribution of the two groups, it clearly holds that in other words, ℓ i is independent of Z i . If π = 1/2, a test can be constructed by considering the overall out-of-sample classification error,L which under the null hypothesis of equal distributions has N testL (ĝ) ∼ Bin(N test , 1/2). Here, I{ĝ(Z i ) = ℓ i } takes the value 1 ifĝ(Z i ) = ℓ i and 0 otherwise. In an effort to extend this principle for general π, we instead use an approach based on the class-wise errorŝ I{ĝ(Z i ) = 1}, similar to [4]. Define, for j ∈ {0, 1}, the true class-wise loss for a given classifierĝ as L (ĝ) j = P(ĝ(Z) = j|D Ntrain , ℓ = j). As shown in the proof of Proposition 1, conditioned on the training data and the number of observations from class j ∈ {0, 1}, n j,testL (ĝ) j |D Ntrain , n j,test ∼ Bin(n j,test , L (ĝ) j ). The loss L (ĝ) j depends on the classifier and is generally not known, even under H 0 . However if P = Q, it holds that where we used independence of ℓ and Z when P = Q. As a side-note, this shows that L (ĝ) 0 + L (ĝ) 1 = 1 will be true, as soon as ℓ andĝ(Z) are independent. This follows if P = Q, but also ifĝ negates the dependence between ℓ and Z, which essentially means it has no discriminating abilities.
We are then able to formulate the following decision rule: where Φ −1 (α) is the α quantile of the standard normal distribution and ǫ Ntest is a decreasing sequence of small non-random numbers. Then Proposition 1 There exists a sequence ǫ Ntest , such that the decision rule in (2) conserves the level asymp- Proposition 1 is related to the first part of Proposition 9.1. in [4]. Note that we did not put any restrictions on how Lĝ 0 , Lĝ 1 change individually and in particular, we made no assumption on how N train behaves, as N test goes to infinity. The reason for including the sequence ǫ N is that, when N train increases with N test , boundary cases are possible, in which the variance Lĝ 0 (1 − Lĝ 0 ) + Lĝ 1 (1 − Lĝ 1 ) decreases as 1/N test or faster, while still being nonzero for finite N . In this case the asymptotic normality of (L (ĝ) 1/2 − 1/2)/σ c breaks down and it becomes increasingly difficult to control the behavior of the acceptance probability under the Null. Adding ǫ N makes it possible to circumvent this difficulty, albeit at the price of a potential loss in asymptotic power in these boundary cases. If N train grows at the same rate as N test , such boundary cases appear unlikely in practice. In fact, for a Random Forest classifier, it rather seems the classifier just outputs the majority class for all N train large enough, such thatσ c = 0 andL (ĝ) 1/2 = 0,L (ĝ) 1/2 = 1 or vice versa. In this case the level is guaranteed, even if ǫ N = 0 for all N . We will in the following simply take ǫ Ntest = 0 for the remainder of this paper. The test is summarized in Algorithm 1.
We briefly highlight the connection between the above decision rule and the one based on the overall classification errorL (ĝ) , in the case of π = 1/2 and ǫ Ntest = 0. Since, forπ = n 1,test /N test .
⊲ random separation of training data 2: Training of a classifier,ĝ(.) on D Ntrain 3: ⊲ calculating the out-of-sample classification error pvalue ← I{err 1/2 − 1/2 > 0} 11: end if 12: return pvalue Naturally, the split in training and test set is not ideal. For finite sample sizes, one would like to have as many (test) samples as possible to detect differences. At the same time, it would be preferable to have the classifier trained on many data points. This in fact resembles a bias-variance trade-off, similar to what was described in [6]: Let g * 1/2 be the Bayes classifier defined in Section 2.3. For π = 1/2, there is a trade-off between the closeness of L (ĝ) to L (g * 1/2 ) , which may be achieved through a large training set and the closeness ofL (ĝ) to L (ĝ) , which is generally only true in large test sets.

Out-of-bag test
For the purpose of overcoming the arbitrary split in training and testing, Random Forest delivers an interesting tool: the OOB error introduced in [2]. Since each tree is build on a bootstrapped sample taken from D N , there will be approximately 1/3 of the trees that are not using the ith observation (ℓ i , Z i ). Thus we may use this ensemble of trees not containing observation i to obtain an estimate of the out-of-sample error for i. We slightly generalize this here, in assuming we have an ensemble learner g: That is, we assume to have iid copies of a random element ν, ν 1 , . . . , ν B , such that eachĝ ν b (Z) := g(Z, D Ntrain , ν b ) is a different classifier. We then consider the averageĝ For B → ∞, this is (a.s.)ĝ(Z) = E ν [ĝ ν (Z)]. For Random Forest, ν usually represents the bootstrap sampling of observations and the sampling of variables to consider at each splitpoint for a given tree. Let as before, n 0 := N i=1 I{ℓ i = 0} and n 1 := N i=1 I{ℓ i = 1}, with n 0 ≥ 1, n 1 ≥ 1. We assume in the following that eachĝ ν b (Z) uses a bootstrapped sample from the original data, as Random Forest does. The class-wise OOB error of such an ensemble of learners trained on N observations is defined as whereĝ −i , represents the ensemble of learners not containing the i th observation for training.
Unfortunately, the test statistic E oob 1/2 is difficult to handle; due to the complex dependency structure between the elements of the sum, it is not clear what the (asymptotic) distribution under the null is. For theoretical purposes, we consider in Section 3 a solution based on the concept of U-statistics. Here, we recommend using the OOB error together with a permutation test. See e.g., [15] or [4], who use it in conjunction with the out-of-sample error evaluated on a test set: We first calculate the class-wise OOB errors E oob 0 , E oob 1 and then reshuffle the labels K times to obtain K permutations, σ 1 , . . . , σ K say. For each of these new datasets for j ∈ {0, 1}. Under H 0 , (ℓ 1 , . . . , ℓ N ) and (Z 1 , . . . , Z N ) are independent and each E oob 1/2 is simply an iid draw from the distribution F of the random variable E oob 1/2 |(Z 1 , . . . , Z N ). As such we can accurately approximate the α quantile F −1 (α) of said distribution by performing a large number of permutations and use the decision rule Thus, as in the decision in Equation (2), the rejection region depends on the data at hand. Nonetheless, the level will be conserved, as proven e.g. in [16, Theorem 1].
Heuristically, this procedure will have power under the alternative, as in this case there is some dependence between (ℓ 1 , . . . , ℓ N ) and (Z 1 , . . . , Z N ), formed by the difference in the distribution of the Z i . The OOB error E oob 1/2 will thus be different than the ones observed under permutations. The whole procedure is described in Algorithm 2. We name this test "hypoRF".
An interesting question is whether g * 1/2 leads to a consistent test in our framework. We first define consistency for a hypothesis test : Let Θ be the space of tuples of all distributions on R d , θ = (P, Q) ∈ Θ, Algorithm 2 hypoRF ← function(Z, K, ...) ⊲ reshuffle the label 8:  . Following e.g., [18] we call a test consistent at level α (for Θ 1 ), if lim sup N sup θ∈Θ0 φ(θ) ≤ α and for any θ ∈ Θ 1 , lim inf N φ(θ) = 1. For theoretical purposes, we extend this definition also to δ that depend on the unknown θ itself, for instance via the densities of P and Q respectively.
Under the assumption of equal class probabilities π = 1/2 the Bayes error has the property that, where T V (P, Q) is the total variation distance between P , Q: T V (P, Q) = 2 sup A |P (A) − Q(A)|, with the supremum taken over all Borel sets on R d . As T V defines a metric on the space of all probability measures on R d , it holds that P = Q ⇐⇒ T V (P, Q) = 0. Consequently, as soon as there is any difference in P and Q, T V (P, Q) > 0 and L (g * 1/2 ) < 1/2. Thus we would expect a test based on g * 1/2 to be consistent. More generally, [4] prove that if the classifierĝ is such that , for some L 0 , L 1 ∈ (0, 1) with L 0 + L 1 = 1 − ε, for any ε > 0, (8) then the decision rule in (2) is consistent.
Unfortunately, this assumption doesn't hold for g * 1/2 , if π = 1/2. In this case, simple counterexamples show that even when P, Q are different, it might still be that L Lemma 1 Take X ⊂ R and π = 1/2. Then no decision rule of the form, δ(D N ) = δ(g * 1/2 (D N )) is consistent.
Thus even though we allow the classifier g * 1/2 to depend for each (P, Q) ∈ Θ 1 on the densities p of P and q of Q, we are not able to construct a consistent test. The problem appears to be that the Bayes classifier minimizes the overall classification loss, so that condition (8) cannot hold. In doing so, it focuses too much on the overrepresented class. Indeed, we might define the following alternative classifier: For given P , Q let g * π be the classifier that minimizes the error L g 1/2 , i.e. a classifier that solves the problem arg min{L g 1/2 : g : X → {0, 1} a classifier}.
It turns out a slight variation to the Bayes classifier solves this problem: is a solution to (9). Moreover it holds that for any π ∈ (0, 1).
Since this theoretical classifier needs no training, the two testing approaches coincide with an evaluation of the classifier loss on the overall data D N . While this analysis with theoretical classifiers is by no means sufficient for the much more complicated case of a classifierĝ trained on data, it suggests that adapting the "cutoff" in a given classifier might improve consistency issues. Indeed, we use the classifier whereπ is an estimate of the prior probability based on the training data. As long as the later is used (as opposed to the test data), the tests above are still valid.

Tests based on U-Statistics
To avoid the splitting in training and test set, we introduced an OOB error-based test in Section 2.2. In this section, we discuss a potential framework to analyse a version of such a test theoretically. For Ntrain denote the data set without observation (ℓ i , Z i ). Then we consider the class-wise OOB error based on N train observations: where . We assume that the number of classifiers in the ensemble, B → ∞, so thatĝ(Z) → E ν [ĝ ν (Z)], almost surely. We refer to the function h Ntrain as kernel of size N train and define the incomplete U-Statistics,Û where the sum is taken over K randomly chosen subsets of size N train -see e.g., [19], [20], [21], [22]. We assume that K goes to infinity as N goes to infinity. Since we are only considering learners for which the ith sample point is not included, we may simply seeĝ −i as an infinite ensemble build on the dataset D −i Ntrain only. Consequently, with the assumption of an infinite number of learners, the OOB error is "almost" Here, E[L (ĝ−i) 1/2 ] refers to the expected value of the error based on the classifier trained on N train − 1 data points. As such, it does not depend on i. This is essentially the same result as in [23] in the case of the leave-one-out error.
We are now able to show that h Ntrain in (12) is a symmetric function, unbiased for E[L Combining arguments from [21] and [24], we obtain the conditions for asymptotic normality listed in Theorem 1. Though both paper consider the asymptotic distribution of a Random Forest prediction at a fixed z, the U -Statistics theory they develop can be used in our context as well. We also refer to [22] and [25], who already refined the results of [21] for asymptotic normality of a U -statistics with growing kernel size. [22] in particular, derived a similar result to Theorem 1 independently from us. Let for random variables ξ 1 , ξ 2 , V(ξ 1 ), Cov(ξ 1 , ξ 2 ) be the variance and covariance respectively and define for the following, for c ∈ {1, . . . , N train }, In particular, ζ 1,Ntrain and ζ Ntrain,Ntrain will be of special interest. [19] provides an immediate important result: Lemma 5, which is actually true for any U -statistics, shows that, whenever the second moment of the kernel h Ntrain exists, ζ 1,Ntrain = O(N −1 train ). Then Then, Condition (15) is hard to control in general, but with Lemma 5, it can be seen that choosing is sufficient for both (15) and (16). If K = log(N train ) 1+d , this corresponds to the condition log(N train ) 1+d N train /N → 0 required by [24]. In the context of Random Forest, Theorem 1 essentially proves that the OOB error of a prediction function that is bounded, is asymptotically normal if the number of trees is "high" and if K forests are trained on subsamples such that (15) and (16) are true. Since the OOB error with infinite learners is essentially the leave-one-out error in the context of cross-validation, this also means that a test of the cross-validation error could be derived under much weaker assumption as for instance in [20]. The key reason for the generality of the result, as was also realized by [22], is that K should be chosen small relative to N . This introduces additional variance, such that conditions on ζ 1,Ntrain usually required in such results, see e.g., [25], can be replaced by (18). This has an additional computational advantage, but it may come at the price of reduced power, as will be seen in Corollary 2.
Then the decision rule in (19) conserves the level asymptotically and has approximate power The test has thus power going to one, as soon as Condition (22) mirrors condition (A9) in [4], in that it asks for a better than chance prediction in expecta- tion. Crucially, Corollary 2 also illustrates the downside of the weak assumptions used in Theorem 1: The power is dependent on √ K, as well as the accuracy of the trained classifier through E[L (ĝ−i) ]. Since our theory requires that K is of small order compared to N , we lose power, at least theoretically. In practice, it appears from simulations with Random Forest that ζ Ntrain,Ntrain decreases to zero and roughly behaves like 1/N train . From the asymptotic power expression above, it can be seen that this would offset the small order K. Nonetheless, the test of Corollary 2 appears less powerful than the Binomial and hypoRF test. In the example of Figure 2, plugging the estimate of E[L (ĝ−i) 1/2 ] obtained from the 500 repetitions into (21) and averaging, we obtain an expected power of 0.63. The actual power, i.e. the fraction of rejected tests over the 500 repetitions, is given as 0.61. The Binomial test with Random Forest on the other hand, reaches a power of 1. This illustrates that the test derived in this section, still lacks behind the test that uses sample-splitting. Nonetheless, modern U -statistics theory gives powerful theoretical tools to construct OOB-error based tests with tractable asymptotic power.

Application
In this section, we first describe the proposed significance threshold for the variable importance measure and apply the hypoRF test to simulated and real application cases. In the simulation section, we will compare the hypoRF to recent kernel-based tests by investigating the power of a selected scenario. A more extensive simulation study is given in Appendix B. In Section 4.3, two real data sets from biology and finance are considered.

Variable importance measure
Variable importance measures in the context of Random Forest are practical tools introduced by [2]. As a by-product of the hypoRF test of Section 2.2, we obtain a significance threshold for such a given variable importance measure: For each permutation, we record the maximum variable importance measure I σ over Overall p-value: 0.0099

Significance treshold
Column Index Variable Importance all variables, thus approximating the distribution of I σ under H 0 . The estimated 1 − α quantile of this distribution will then be used as the significance threshold. Every variable with an importance measure above this threshold will be called significant. This should serve as an additional hint, in which components a rejection decision might originate from. We will use in all instances the "Gini" importance measure or "Mean Decrease Impurity", see e.g., [26,Section 5].
Obtaining p-values for the variable importance measure by permuting the response vector was developed much earlier in [27] and further developed in [28]. As we are not directly interested in p-values for each variable, our approach differs slightly and is more in the spirit of the Westfall-Young permutation approach, see e.g., [29]. Since we use a permutation approach already to define the decision rule of the hypoRF test, the significance threshold for the variable importance arises without any additional cost. Figure 1 in Section 1.2 demonstrates that in this example the Random Forest is able to correctly identify the effect of the last two components. This appears remarkable, as there is only a change in dependence, but no marginal change. On the other hand, one could imagine a situation, where no significant variable may be identified, but the test overall still rejects. This is illustrated in Figure 3. In this example, instead of endowing only the last two components with correlations, we introduced correlations of 0.4 between all variables when changing from P to Q. Again the hypoRF test manages to differentiate between the two distributions. However this time, no significant variables can be identified. This seems sensible, as the source of change is divided equally between the different components in this example. Any situation could also be a mixture of the above extreme examples: There could be one or several significant variables, but the test still rejects, even after removing them. Section 4.3 will show real-world examples in which some variables can be identified to be significant in the above sense.

Simulation
In what follows, we will demonstrate the power of the proposed tests through simulation, and compare it with 3 kernel methods and a recently proposed Random Forest test based on the classification probability.
To this end, we will use both the first version of the test, as described in Algorithm 1 ("Binomial" test), and the refined version in Algorithm 2 ("hypoRF" test). For the latter, as mentioned in Section 2.2, we will use K = 100 permutations. For the Binomial test described in Algorithm 1 we decided to set N train = N test , as taking half of the data as training and the other half as test set seems to be a sensible solution a priori.
To conduct our simulations we will use the R-package "hypoRF" developed by the authors, which consists of the "hypoRF" function including the two proposed tests. For each pair of samples, we run all tests and save the decisions. The estimated power is then the fraction of rejected among the S tests.
The 3 kernel-based tests include the "quadratic time MMD" [11] using a permutation approach to approximate the H 0 distribution ("MMDboot"), its optimized version "MMD-full", as well as the "ME" test with optimized locations, "ME-full" [14]. The original idea of the "MMD-full" was formulated in For all tests, we use a Gaussian kernel, which is a standard and reasonable choice if no a priori knowledge about the optimal kernel is available. The Gaussian kernel requires a bandwidth parameter σ, which is tuned in MMD-full and ME-full based on training data. For MMDboot we use the "median heuristic", as described in [11, Section 8], which takes σ to be the median (Euclidean) distance between the elements in Finally, we consider the method of [10], which is a test based on the classification probability of Random Forest. We would like to emphasize that their first publication on arXiv appeared more than 6 months after our first upload on arXiv. As such, we do not view them as a direct competitor. Nonetheless, it seems interesting to compare their performance to the one of hypoRF, as they use a permutation approach based on the in-sample probability estimates.
We would like to stress that we did not use any tuning for the parameters of the RF-based tests, just as we did not use any tuning for MMDboot. As such, comparing the MMD/ME-full to the other methods might not be entirely fair. On the other hand, our chosen sample size might be too small for the optimized versions to work in full capacity. In particular, all optimized tests suffer from a similar drawback as our Binomial test: The tuning of the method takes up half of the available data. While [14] find that MEfull outperforms the MMD, they only observe settings where the latter also uses half of the data to tune its kernel, as proposed in [12]. In our terminology, they only compare ME-full to MMD-full, instead of MMDboot. It seems unclear a priori what happens if we instead employ the median heuristic for the MMD and let it use all of the available data, as in [11]. It should also be said that both optimization and testing of the ME-full scale linearly in N , making its performance below all the more impressive. On the other hand, the optimization depends on some hyperparameters common in gradient-based optimization, such as step size taken in the gradient step, the maximum number of iterations, etc. As this optimization is rather complicated for large d, some parameter choices sometimes lead to a longer runtime of the ME than the calculation-intensive hypoRF and CPT-RF. In general, it seems both runtime and performance of ME-full are in practice highly dependent on the chosen hyperparameters; we tried 3 different sets of parameters based on the code in https://github.com/wittawatj/interpretable-test with very different power results. The setting used in this simulation study is the exact same as used in their simulation study.
As discussed in [30], changing the parameters of our experiments (for instance the dimension d) should be done in a way that leaves the Kullback-Leibler (KL) Divergence constant. When varying the dimension d we generally follow this suggestion, though in our case, this is not as imminent; whatever unconscious advantage we might give our testing procedure is also inherent in the competing methods. Finally, also note that, while our methods would be in principle applicable to arbitrary classifiers, we did not compare our proposed tests with tests based on other classifiers, such as those used in [6]. Rather, we believe the choice of classifiers for binary classification is a more general problem and should be studied separately, as for example done extensively in [3]. The only exception to this, is our use of an LDA classifier-based test for the example of a Gaussian mean-shift in Appendix B.0.1.
Where not differently stated, we use for the following experiments: N = 600 observations, 300 per class, d = 200 dimensions, K = 100 permutations and 600 trees for the RF-based tests. In some examples, we additionally study a sparse case, where the intended change in distribution appears only in c < d components. Throughout, notation such as Moreover, if P 1 , . . . , P d are distributions on R, we will denote by their product measure on R d . In other words, in this case, we simply take all the components of X to be independent. The prime example which we present here in the main text is rather challenging. Let P = N (µ, Σ) with µ set to 50 · 1 and Σ = 25 · I d×d . For the alternative, we consider the mixture 1], and H c some distribution on R d . This is a "contamination" of P by H c with λ determining the contamination strength. Here, we take H c to be another independent (d−c)-variate Gaussian together with c components that are in turn independent Binomial(100, 0.5) distributed. We thereby choose parameters such that the Binomial components in H c have the same mean and variance as the Gaussian components and such that differentiating between Binomial and Gaussian is known to be difficult.  over the level of 5%. On the other hand, the two proposed tests slowly grow from around 0.05 to almost 0.4 in the case of the hypoRF test. Interestingly, while relatively close at first, the difference in power between the Binomial test and the hypoRF grows and is starkest for λ = 1, again demonstrating the benefit of using the OOB error as a test statistic. Although slightly worse than the hypoRF, the CPT-RF is also clearly beating the Binomial test, highlighting the benefit of using the permutation approach with (in-sample) classification probabilities.
Finally, we consider the case d = c, so that H c simply consists out of d independent Binomial distributions. The result is displayed in Figure 6 and all RF-based tests are now extremely strong, while the kernel tests fail to detect any signal.
More simulation examples can be found in Appendix B.

Real Data
As a first application, we consider a high-dimensional microarray data set from [31]. The data set is about breast cancer, originally provided by [32]. They examined 168 patients with 2905 gene expressions, each over a five-year period. The 111 patients with no metastasis of small node-negative breast carcinoma after diagnosis were labeled "good", and the 57 patients with early metastasis were labeled "poor". The application of the hypoRF to the two groups is summarized in Figure 7. The test detects a clear difference between the groups "good" and "poor" with "8p23", "8p21" and "3q25" being the most important (and significant) genes. There seems to be a high correlation between the genes that are located close to each other (especially within the same chromosome). This has the effect that the Random Forest takes a more or less arbitrary choice at a split point between those highly correlated genes. This in turn is reflected in the variable importance measure. For this reason, one should be careful when interpreting the variable importance measure on a gene level. It appears that chromosomes 8 and 3 play an important role in distinguishing the two groups. This finding is in line with [32, Figure 2, p. 1129].
In the second example, we are interested in the relative importance of financial risk factors (asset-specific characteristics). We claim that a financial risk factor has explanatory power if it contributes significantly to the classification of individual stock returns above or below the overall median. We use monthly stock  The test rejects the null hypothesis that the two groups "good" and "poor" come from the same distribution with a p-value of 0.0099. The 3 significant genes are "8p23", "8p21" and "3q25" (marked in red). The green triangles represent the important genes reported by [32]. Additionally, the plot of the first two principal components highlights the fact that there seem to be no obvious clusters. Note: only 15% of the total variance is explained by the first 2 principal components. The Random Forest used 1000 trees and a minimal node size to consider a random split of 4. characteristics used by [33] from Dacheng Xiu's webpage -see, http://dachxiu.chicagobooth.edu. Between 1977 and 2016 we only use stocks for which we have a full return history. This leads to 501 stocks with 94 stock-specific characteristics. The group "positive" contains stocks and time points for which the return was above the overall median and vice versa for the "negative" group. The two groups are balanced and contain more than 120'000 observations each. The application of the hpyoRF test on the two groups is summarized in Figure 8. The ordering of the different risk factors is in line with the findings in [33, Figure 5, p. 34], 1-month momentum being the most important characteristic.
One could argue that stocks that are at time point t close to the overall median are more or less randomly assigned to one of the two groups. Hence, a possible option is to only assign a stock and time point to a certain group if the return is above (below) a certain threshold, i.e., overall median ±ǫ. However, we observed that the result is very robust for different values of ǫ.

Discussion
We discussed in this paper two easy to use and powerful tests based on Random Forest and empirically demonstrated their efficacy. We presented some consistency and power results and showed a way of adapting the Bayes classifier to obtain a consistent test. This adaptation consisted simply in changing the "cutoff" of the classifier. Especially the test based on the OOB statistics (hypoRF) proved to be powerful and additionally delivered a way to assess the significance of individual variables. This was demonstrated in applications using medical and financial data.
After our first publication on arXiv, [10] developed an approach based on a smooth transformation of the in-sample probabilities. Interestingly, experiments using their approach with OOB probability estimates, as a hybrid of their and our methodology, delivered very promising results. Investigating this further could lead to a further improve in power for RF-based tests.
Proof Let H N = {D Ntrain , n 1,test }. Note that, n 1,test , n 0,test contain the same probabilistic information, so it does not matter which we condition on. We first prove that, 1 are conditionally independent given D Ntrain , n 1,test . To prove (A.1) first note that by exchangeability (due to iid sampling), i:ℓi=j I{ĝ(Z i ) = j}, j ∈ {0, 1}. Conditional on H N , the above is a sum of n j,test iid, elements I{ĝ(Z i ) = j}, with Finally, since the eventĝ(Z i ) = j is independent of n j,test , 1 ) and recall that Moreover, set for all N test : for some ǫ > 0 and ν ∈ (1/2, 1). Note that we assume N test → ∞, while N train might also increase to infinity at any rate, or stay constant. Let for the following Then P(E) = 1, as n1,test Ntest → π a.s. First assume for a realized sequence of D Ntrain , N testσ 2 c → ∞ holds. Then for a realized sequence of n 1,test , with the property that n 1,test /N test → π (i.e. on E), it holds that lim sup by the Lindeberg-Feller Central Limit Theorem. On the other hand assume N test L (ĝ) for j ∈ {0, 1}. In this case, Moreover, for all δ > 0, on E and thus, and (A.2) remains true. Finally note that ǫ Ntest is of too small order to make a difference in that case, since by the above √ N test (L (ĝ) Now assume that N train , D Ntrain are such that lim inf Ntest N testσ 2 c → ∞ does not hold. In this case, using again Markov's inequality, Thus we have shown that for a realized sequence of D Ntrain , n 1,test , with the property that n 1,test /N test → π, it holds that lim sup On the other hand, Lemma 6 (Restatement of Lemma 1) Take X ⊂ R and π = 1/2. Then no decision rule of the form, Proof We first show that if π = 1 2 , one can construct (P, Q) ∈ Θ 1 that the Bayes classifier is not able to differentiate. Consider π > 1/2, d = 1 and Q being the uniform distribution on (0, 1), with density q(z) = I{z ∈ (0, 1)}. We write q = I(0, 1) for short. P is a mixture of Q and another uniform on R ⊂ (0, 1), Giving Q a label of 1 and P a label of 0 when observing (1 − π)P + πQ, and taking |R| = 1/2, the Bayes classifier is then given as g * 1/2 (z) = I{η(z) > 1/2}, where Simple algebra shows that for any α < min(π/(1−π)−1, 1), η(z) > 1/2 and thus g * 1/2 (z) = 1 for all z ∈ (0, 1). In particular, L On the other hand, for any θ 0 ∈ Θ 0 , simple evaluation of η(z) shows that g * 1/2 (z) = 1 for all z. Consequently, for θ 1 = (P, Q) in the above example and θ 0 ∈ Θ 0 arbitrary, it holds that for any bounded measurable function f : {0, 1} N → R. In particular, since the test conserves the level by assumption, φ(θ 1 ) = φ(θ 0 ) ≤ α and the test has no power.

Proof We show Relation (A.4) for the classifier
If this is true, it will immediately follows that g * = g * π . Indeed, let h # P be the push-forward measure of P through a measurable function h : X → R. Taking h = g, for an arbitrary classifier g, it holds that = P(g(Z) = 0|ℓ = 0) − P(g(Z) = 0|ℓ = 1) where the first inequality follows, because {x : g(x) = 0} and {y : g(y) = 0} are two Borel sets on X . Consequently, it also holds for any classifier g that It remains to prove (A.4) for g * : It is well-known that (one of) the sets attaining the maximum in the definition of T V (P, Q) is given by A * := {z : q(z) ≤ p(z)}. It is possible to rewrite A * : Corollary 3 (Restatement of Corollary 1) The decision rule δ B (g * π (D N )) in (2) is consistent for any π ∈ (0, 1).
Proof We restate here the decision rule in (2) for completeness, First we show that the decision rule conserves the level, for ǫ N = 0 for all N . Since, for any P, Q, = 0 a.s., so that for all θ 0 ∈ Θ 0 and any sample size, 1/2 < 1/2) = 0.
Appendix A.2. Proofs to Section 3 Proof First we note that Let B(i) ≤ B be the number of classifiers in the ensemble, not containing observation i. Since we assume that each classifier in the ensemble receives a bootstrapped version of D Ntrain , there is a probability p > 0, that any given classifierĝ ν b will not contain observation i. Since this bootstrapping is done independently for each classifier, we have that B(i) ∼ Bin(p, B). Thus as B → ∞, also B(i) → ∞ a.s. and thuŝ since the eventĝ −i (Z i ) = ℓ i is independent of n 1,train given the event ℓ i = 1. Finally, Similarly, Thus indeed, Lemma 9 h Ntrain is a valid kernel for the expectation E[L (ĝ−i) Proof Unbiasedness was proven above. Symmetry follows, since for any two permutations σ 1 , σ 2 , there exists i, j such that σ 1 (j) = σ 2 (i) := u, and thus where D σs Ntrain = (Z σs(1) , ℓ σs(1) ), . . . , (Z σs(Ntrain) , ℓ σs(Ntrain) ), s ∈ {1, 2}. But that means the sum in (12) does not change.
We also need a well-known auxiliary result: Lemma 10 Let (ξ N ) N , ξ be an arbitrary sequence of random variables. If every subsequence has a subse- Proof Let for the following ξ i = (Z i , ℓ i ) for brevity and consider the complete U-statisticŝ where the sum is taken over all N Ntrain possible subsets of size N train ≤ N from {1, . . . , N }. From the "H-Decomposition", see e.g., [19], the variance ofÛ N can be bounded as, see also [24,Lemma 7]. Thus it holds for all ε > 0 that by (A.5) and (A.6).
We now use the idea of [19, Lemma A] to prove (17): As in [21], we denote by S N,Ntrain = {S j : j = 1, . . . , N Ntrain } all possible subsamples of size N train sampled without replacement. Let M N,Ntrain = (M S1 , . . . , M S (N,N train ) ) be the number of times each subsample appears when sampling K times. Then Let a i = (h Ntrain (S i ) − E[Lĝ −1 1/2 ])/ ζ Ntrain,Ntrain , as in [19]. Then is again a U-statistics with E[Û N,2 ] = 1 and using Lemma 5. Thus,Û N,2 p → 1 and this will be true for any given subsequence as well. Similarly, where we suppressed the dependence on the chosen subsequence. Thus the subsequence converges in distribution to N (0, 1) and by Lemma 10, so does the overall sequence.
Proof From Theorem 1 and the assumption thatζ Ntrain,Ntrain is a consistent estimator, it follows that

Appendix B. Further Simulations
Additional simulation examples can be found in the next three subsections. tests perform. We will now turn to more complex examples, where changes in the marginals alone are not as easy, or even impossible to detect.

Appendix B.0.2. Changing the Dependency Structure
The previous example focused only on cases where the changes in distribution can be observed marginally.
For these examples, it would in principle be enough to compare the marginal distributions to detect the difference between Q and P . An interesting class of problems arises when we instead leave the marginal distribution unchanged but change the dependency structure when moving from P to Q. We will hereafter study two examples; the first one concerning a simple change from a multivariate Gaussian with independent components to one with nonzero correlation. The second one again takes P to have independent Gaussian components, but induces a more complex dependence structure on Q, via a t-copula. Thus for what follows, we set P = N (0, I d×d ).   suffering from the decrease in sample size. ME-full on the other hand, which suffers the same drawback, manages to put up a very strong performance, on par with the hypoRF. This is all the more impressive, keeping in mind that the ME is a test that scales linearly in N . Case (II) can be seen in Figure B.12. Again the resulting "sparsity" is beneficial for our test, with the hypoRF now being on par with the powerful MMD test, and with ME-full only slightly above the Binomial test.
In the second example, we study a change in dependence, which is more interesting than the simple change of the covariance matrix. In particular, Q is now given by a distribution that has standard Gaussian marginals bound together by a t-copula, see e.g., [35] or [36,Chapter 5]. While the density and cdf of the resulting distribution Q are relatively complicated, it is simple and insightful to simulate from this distribution, as described in [35]: Let x → t v (x) denote the cdf of a univariate t-distribution with ν degrees of freedom, and T ν (R) the multivariate t-distribution with dispersion matrix R and ν degrees of freedom.
We first simulate from a multivariate t-distribution with dispersion matrix R and degrees of freedom ν, to obtain T ∼ T ν (R). In the second step, simply set Y : . What kind of dependency structure does Y have? It is well known that T ∼ t ν (R) has with N ∼ N (0, R) and G ∼ Gamma(ν/2, ν/2) independent of N. As such, the dependence induced in T, and therefore in Q, is dictated through the mutual latent random variable G. It persists, even if R = I d×d  and induces more complex dependencies than mere correlation. These dependencies are moreover stronger, the smaller ν, though this effect is hard to quantify. One reason this dependency structure is particularly interesting in our case is that it spans more than two columns, contrary to correlation which is an inherent bivariate property. We again study the case (I) with all d components tied together by the t-copula, and (II) only the first c = 20 < d components having a t-copula dependency, while the remaining d − c = 180 columns are again independent N (0, 1). The results for case (I) are shown in Figure B.13. Now our tests, together with ME-full cannot compete with CPT-RF, MMD and MMD-full. However for the ME-full, this very much depends again on the hyperparameters chosen, for some settings ME-full was as good as MMD-full. Though there appears to be no clear way how to determine this. Both MMD-based tests manage to stay at almost one, even for ν = 8, which seems to be an extremely impressive feat. The CPT-RF test falls behind the two MMD-based tests, but has still an impressively high power, compared to our hypoRF test. Our best test, on the other hand, loses power quickly for ν > 4, while the Binomial test does so even for ν > 2. The results for case (II) shown in Figure B.14, are similarly insightful. Given the difficulty of this problem, it is not surprising that almost all of the tests fail to have any power for ν > 3. The exception is once again the MMD, performing incredibly strong up to ν = 5. The performance of MMDboot is not only interesting in that it beats our tests, but also in how it beats all other kernel approaches in the same way. In particular, MMD-full stands no chance, which again is likely, in part, due to the reduced sample size the MMDboot has available for testing. Though hard to generalize, it appears from this analysis that a complex, rather weak dependence, is a job best done by the plain MMDboot.

. Multivariate Blob
A well-known difficult example is the "Gaussian Blob", an example where "the main data variation does not reflect the difference between between P and Q" [12], see e.g., [12] and [14]. We study here the following generalization of this idea: Let T ∈ N, µ = (µ t ) T t=1 , µ t ∈ R d , and Σ = (Σ t ) T t=1 , with Σ t a positive definite d × d matrix. We consider the mixture For µ, we will always use a baseline vector of size d, w say, and include in µ all possible enumerations of choosing d elements from w ∈ R d with replacement. This gives a total number of T = c d possibilities and each µ t ∈ R d is one possible such enumeration. For example, if c = d = 2 and w = (1, 2) then we may set µ 1 = (1, 1), µ 2 = (2, 2), µ 3 = (1, 2), µ 4 = (2, 1). We will refer to each element of this mixture as a "Blob" and study two experiments where we change the covariance matrices Σ t of the blobs when changing from P to Q, i.e., Obviously it quickly gets infeasible to simulate from N (µ, Σ), as with increasing d the number of blobs explodes. Though, as shown below, this difficulty can be circumvented when Σ t is diagonal for all t.
The example also considerably worsens the curse of dimensionality, as even for small d the numbers of observations in each Blob is likely to be very small. Thus for 300 observations, we have a rather difficult example at hand.
We will subsequently study two experiments. The first one takes w = (1, 2, 3), Σ 1,X = Σ 2,X = . . . = Σ t,X = I d×d and Σ 1,Y = Σ 2,Y = . . . = Σ t,Y = Σ to be a correlation matrix with nonzero elements on the off-diagonal. In particular, we generate Σ randomly at the beginning of the S trials for a given d, such that (1) it is a positive definite correlation matrix and (2) it has a ratio of minimal to maximal eigenvalue of at For d = 2, this corresponds to the original Blob example as in [12], albeit with a less strict bound on the eigenvalue ratio. The resulting distribution for d = 1 and d = 2 is plotted in Figure B.15. Table B.1 displays the result of the experiment with our usual set-up and a variation of d = 2, 3 and the number of blobs being 2 d and 3 d . Very surprisingly our hypoRF test is the only one displaying notable power throughout the example. MMD and MMD-full are not able to detect any difference between the distribution with this sample size. Interestingly, the ME which we would have expected to work well in this example is also only at the level. However, this again depends on the specification chosen for the hyperparameters of the optimization. For another parametrization, we obtained a power of 0.116 for d = 2, blobs = 2 2 and 0.082 for d = 2 and blobs = 3 2 , all other values being on the level.
The second experiment takes w = (−5, 0, 5) and for all t, Σ t,X , Σ t,Y to be diagonal and generated similarly to µ. That is, we take Σ t,X = diag(σ 2 t,X ), where each σ t,X is a vector including d draws with replacement from a base vector v X ∈ R d , and analogously with Σ t,Y . In this case, it is possible to rewrite P and Q, as As such, it is feasible to simulate from P and Q, even for large d, by simply simulating d times from P X and P Y . We consider w = (−5, 0, 5) and the standard deviations (v 1,X , v 2,X , v 3,X ) = (1, 1, 1) , The change between the distributions is subtle even in notation; only the standard deviation of the middle mixture component is changed from 1 to 2. This has the effect that the middle component gets spread out more, causing it to melt into the other two. The resulting distribution for d = 1 and d = 2 is plotted in Figure B.16. Unsurprisingly, P looks quite similar as in Figure B.15. The marginal plots (d = 1) appear to be very different, though this is only an effect of having centers (−5, 0, 5) instead of (1, 2, 3). On the other hand, while not clearly visible, it can be seen that the different blobs of Q display different behavior in variance; every Blob in positions (2, 1), (2, 2), (2, 3), (1, 2), (3, 2) on the 3 × 3 grid has its variance increased.
The results of the simulations are seen in Figure B.17. The Binomial, CPT-RF and hypoRF test display a power quickly increasing with dimensions, regardless of the decreasing number of observations in each Blob. This also holds true, to a smaller degree, for the ME-full, which due to its location optimization appears to be able to adapt to the problem structure. However, its power considerably lacks behind the RF-based tests. In contrast, the behavior of the MMD-based tests quickly deteriorates as the number of samples per Blob decreases. Indeed from a kernel perspective, all points have more or less the same distance from each other, whether they are coming from P or Q. Thus the extreme power of the MMD to detect "joint" changes in the structure of the data (i.e., dependency changes) cements its downfall here, as it is unable to detect the marginal difference.
This example might appear rather strange; it has a flavor of a mathematical counterexample, simple or even nonsensical on the outset, but proving an important point: While the differences between P and Q are obvious to the naked eye if only one marginal each is plotted with a histogram, the example manages to completely fool the kernel tests (under a Gaussian kernel at least). As such it is not only a demonstration of the merits of our test but also a way of fooling very general kernel tests. It might be interesting to find real-world applications, where such data structure is likely.    We obtain the characteristics used by [33] from Dacheng Xiu's webpage; see http://dachxiu.chicagobooth.edu. Note that the data is collected in [77].