Global and Local Two-Sample Tests via Regression

Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness.


Introduction
Given two distributions P 0 and P 1 on R D , the global two-sample problem is concerned with testing H 0 : P 0 = P 1 versus H 1 : P 0 = P 1 , based on independent random samples from each distribution. This fundamental problem has a long history in statistics and has been well-studied in a classical setting (see, e.g., Thas, 2010). Recently, however, there has been renewed interest in this field as modern data we encounter have become more complex and diverse. Traditional approaches, which focus on low-dimensional and Euclidean data, often fail or are not easily generalizable to highdimensional and non-Euclidean data. Additionally, some recent developments in high-dimensional two-sample testing are limited to simple alternatives such as location and scale differences (see, Hu and Bai, 2016, for a recent review). In this context, there is a need to develop a new tool for the two-sample problem that can efficiently handle complex data and can detect differences beyond location and scale alternatives.
When the null hypothesis of the global two-sample test is rejected, it is often valuable (for e.g. scientific discovery, calibration of simulation models, and so on) to further explore how the two distributions are different. Specifically, as a follow-up study to the global test, one might wish to identify locally significant regions where the two distributions differ. This topic, which we refer to as the local two-sample problem, has been studied by Duong (2013) who uses kernel density estimators to identify local differences between two density functions. However, the kernel density approach may perform poorly when distributions are not in a low-dimensional Euclidean space, and hence another tool is needed for more challenging settings.
The goal of this work is to develop a general framework for both global and local two-sample problems that overcomes the aforementioned challenges. Specifically, we aim to design a two-sample test that can efficiently handle different types of variables (e.g. mixed data types) and various structure (e.g. manifold, irrelevant covariates) in the data. Consequently, the resulting test can have substantial power for a variety of challenging alternatives. We achieve our goal by connecting the two-sample problem to a regression problem as follows. Let f 0 and f 1 be density functions of P 0 and P 1 with respect to a common dominating measure. We view f 0 and f 1 as conditional densities f (x|Y = 0) and f (x|Y = 1) by introducing an indicator random variable Y ∈ {0, 1}. Then by Bayes' theorem, the hypothesis H 0 : f 0 (x) = f 1 (x) for all x ∈ S = {x ∈ R D : f 0 (x) + f 1 (x) > 0} can be verified to be equivalent to the hypothesis that involves the regression function: (1) We state the corresponding global and local alternative hypotheses as H 1 : P(Y = 1|X = x) = P(Y = 1), for some x ∈ S, and H 1 (x) : P(Y = 1|X = x) = P(Y = 1), at fixed x ∈ S, respectively. Motivated by the above reformulation, we propose a testing procedure that measures an empirical distance between the regression function P(Y = 1|X = x) and the class probability P(Y = 1). We refer to this approach as the regression test. Depending on the choice of regression method, the regression test can adapt to nontraditional data settings. As we shall see, the power of the test is closely related to the mean square error of the chosen regression estimator. In addition, by choosing a nonparametric regression method, the global regression test can be sensitive to general alternatives beyond location and scale differences. We will demonstrate the benefits of the regression test with both theoretical and empirical results.

Motivating Example
We motivate our approach by comparing multivariate distributions of galaxy morphologies, but the proposed framework benefit other areas of science and technology as well (involving, e.g., outlier detection, calibration of simulation models, and comparison of cases and controls). A galaxy's morphology is the organization of a galaxy's light, as projected into our line of sight and observed at a particular wavelength as a pixelated image. Morphological studies are key to understanding the evolutionary history of galaxies and to constraining theories of the Universe; see e.g. Conselice (2014) for a review. So far astronomers have only been able to study one or two morphological statistics (or projections of these) at a time instead of an entire ensemble. The reason is a lack of tools for effectively comparing and jointly analyzing multivariate or high-dimensional data in their native spaces. A global hypothesis test with a binary reject yes/no answer is also not informative enough to explain how two distributions are different in a multivariate feature space.
We illustrate the efficacy of the proposed global and local testing framework on the morphology statistics of two galaxy populations with high and low star-formation rate (SFR), respectively. The challenge here is not only that the problem involves multivariate data, but also that some of the morphological statistics are mixed discrete and continuous type with heavy outliers. We efficiently handle this issue by building on the success of random forests regression. The visualized local two-sample result is shown in Figure 1 and the details of the analysis can be found in Section 6. : Result of local two-sample test of differences between high-and low-SFR galaxies in a sevendimensional morphology space. The red squares indicate regions where the density of low-star-forming galaxies are significantly higher, and the blue circles indicate regions in morphology space that are dominated by high-star-forming galaxies; the gray crosses represent insignificant test points. The galaxies are embedded in a two-dimensional diffusion space for visualization purposes only (see Appendix B for details); Ψ 1 and Ψ 2 here denote the first two coordinates.

Related Work
In recent years, several attempts have been made to connect binary classification with two-sample testing. The main idea of this approach is to check whether the accuracy of a binary classifier is better than chance level and reject the null if the difference is significant. Such an approach, referred to as an accuracy or classification test, was conceptualized by Friedman (2003) and has since been investigated by several authors (Ojala and Garriga, 2010;Olivetti et al., 2015;Kim et al., 2016;Rosenblatt et al., 2016;Gagnon-Bartsch and Shem-Tov, 2016;Lopez-Paz and Oquab, 2016;Hediger et al., 2019). In the same manner as our regression framework, a key strength of the accuracy test is that it offers a flexible way for the two-sample problem as it can utilize any existing classification procedure in the literature. However, the classification accuracy framework is not easily converted to a local two-sample test. In addition, many classifiers are estimated by dichotomizing regression estimators and the discrete nature of such classifiers may result in a less powerful test (see Section 5.2 and other simulation results).
For the global two-sample test, our framework can be viewed as an instance of goodness-of-fit testing for regression models (e.g. González-Manteiga and Crujeiras, 2013, for a review). There is a substantive literature on this topic including Hardle and Mammen (1993), Weihrather (1993), González-Manteiga and Cao (1993), Zheng (1996), Zhang and Dette (2004), Hart (2013) and among others. This line of work typically concentrates on comparing differences between parametric (e.g. linear regression) and nonparametric (e.g. kernel regression) fits from an asymptotic point of view. For example, Hardle and Mammen (1993) consider the squared deviation between a parametric regression estimator and a kernel estimator. They show that their test statistic converges to a normal distribution under the null hypothesis and justify the use of the wild bootstrap procedure. However, this type of asymptotic approach is challenging to analyze beyond kernel-type estimators and often requires strong technical assumptions. In contrast, our framework is designed to compare any type of regression estimators with a specific constant fit by building upon the permutation principle. Hence the resulting test is valid in any finite sample sizes. Moreover we present a unified framework of studying the power of the regression test by taking advantage of existing results on the estimation error.
For the local two-sample test, our approach has similarities to independent work by Cazáis and Lhéritier (2015) who estimate the Kullback-Leibler divergence between P(Y = 1|X = x) and P(Y = 1). Our procedure, however, identifies locally significant areas with statistical confidence whereas Cazáis and Lhéritier (2015) graphically decide a threshold for the significance.

Overview of this paper
We outline the paper as follows: In Section 2, we introduce the proposed metrics, test statistics and algorithms for the global and local regression tests. In Section 3, we study theoretical properties of the global regression test. We begin by considering a simple scenario where two populations only differ in their means in Section 3.1. In this scenario, we show that the regression test based on Fisher's linear discriminant analysis (LDA) achieves the same local optimality as the Hotelling's T 2 test. Moving on to general regression settings in Section 3.2, we establish a connection between the testing error of the global regression test and the mean integrated square error (MISE) of the regression estimator. In Section 4, we turn to the local two-sample problem and investigate general properties of the local regression tests. In Section 4.1, we describe the testing error of the local regression test in terms of the mean square error (MSE) of the regression estimator. We further establish an optimality of the local regression tests over the Lipschitz class from a minimax point of view in Section 4.2. When data have intrinsic dimension, we show that the performance of the local regression tests based on kNN or kernel regression only depends on intrinsic dimension in Section 4.3. Section 4.4 studies the limiting distribution of the local permutation statistic to avoid a high computational cost from permutations for large sample size. In Section 5, simulation studies are provided to illustrate finite sample performance of the global and local regression tests. In Section 6, we apply the proposed approach to a problem in astronomy and demonstrate its efficacy. All the proofs are deferred to Appendix A.
Notation. Throughout this paper, we denote the class probabilities P(Y = 0) and P(Y = 1) by π 0 and π 1 , respectively, and write the joint distribution of (X, Y ) by π 0 [P 0 × δ 0 ] + π 1 [P 1 × δ 1 ] where δ k denotes the point mass at k for k = 0, 1. We denote the corresponding conditional probability P(Y = 1|X = x) by m(x), which can be explicitly written as We use P X (·) to denote the marginal probability measure of X and ||Z|| 2 denotes the Euclidean norm of a vector Z ∈ R D . The symbols p −→ and d −→ stand for convergence in probability and in distribution, respectively. We use a n b n if there exists C > 0 such that a n ≤ Cb n for all n.
Similarly, a n b n if there exist constants C, C > 0 such that C ≤ |a n /b n | ≤ C for all n. As convention, the acronym i.i.d. is used to represent independent and identically distributed.

Metrics
A common metric for comparing two distributions is the difference between two density functions f 0 (x) and f 1 (x); this metric has been used for global and local two-sample testing by Anderson et al. (1994) and Duong (2013). Another natural metric, suggested for global two-sample testing by Keziou and Leoni-Aubin (2005), Fokianos (2008) and Sugiyama et al. (2011), is the density ratio f 1 (x)/f 0 (x). Although both the density difference and density ratio metrics are intuitive, there are several weaknesses associated with each of them. For example, the estimation of a density difference is largely limited to kernel density estimators, which are sensitive to the curse of dimensionality. The density ratio, on the other hand, could potentially be estimated using various regression methods thanks to the following reformulation: (see, e.g., Qin and Zhang, 1997). The main weakness of the ratio approach, however, is that the ratio is highly sensitive to the tail behavior of distributions, and it is not even well defined when m(x) = 1. To overcome these limitations, we propose an alternative approach which instead compares the regression function with the class probability. More specifically, we consider as global and local measures of the discrepancy between two distributions where we assume that π 1 is a fixed constant within 0 < π 1 < 1 throughout this paper. By construction, both T global and T local (x) are bounded between zero and one. More importantly, we can take advantage of numerous existing regression methods (see, e.g., Friedman et al., 2009, for popular methods and descriptions) when estimating m(x). Hence, our approach maintains the flexibility of the density ratio approach while avoiding the problem of ill-defined quantities.

Test Statistics and Algorithms
Suppose we observe n pairs of samples be an estimate of m(x) based on the samples, and π 1 = 1 n n i=1 I(Y i = 1). Then by plugging these statistics into (2), we define our global and local test statistics as The null distributions of the proposed test statistics are typically unknown, and they depend on the choice of regression method as well as the distribution of the data. Hence, to keep our framework as general as possible, we use a permutation procedure to set a critical value that yields a valid level α test for any given regression estimator under any sampling scheme given in Section 2.3. The proposed permutation framework for global and local two-sample testing are summarized in Algorithm 1 and Algorithm 2, respectively.
Algorithm 1: Global Two-Sample Testing via Permutations (4) Approximate the permutation p-value by (5) Reject the null hypothesis when p < α. Otherwise, accept the null hypothesis.
Algorithm 2: Local Two-Sample Testing via Permutations , number of permutations B, significance level α, a multiple testing procedure, a regression method.
(1) Calculate the test statistic T local (x j ) at the k test points.
(3) Repeat the previous step B times to obtain { T local (x j )} k j=1 . (4) Approximate the permutation p-value at each test point x j by (5) Apply a multiple testing procedure for controlling the FWER or the FDR at α level. (6) Return the significant local test points.

Sampling Schemes
In the two-sample problem, there are two common sampling schemes for obtaining the paired data Here we note that n is fixed in advance. Then n 1 = n i=1 I(Y i = 1) and n 0 = n − n 1 are Binomial(n, π 1 ) and Binomial(n, π 0 ), respectively. This setting is common in applications of supervised learning where the goal is to build a model that can successfully predict the class label Y given the feature vector X (e.g. Friedman et al., 2009). Our goal, on the other hand, is to test whether the two distributions P 0 and P 1 are the same or not by leveraging existing methods in the regression literature.
• Separate sampling. In the case of separate sampling, n 0 and n 1 are predetermined and they are not random. We then observe n 0 and n 1 independent sample points from P 0 and P 1 separately, which provides the data set We can link the separate sampling to the i.i.d. sampling scheme by randomly ordering the (X i , Y i ) pairs, so that the data points are exchangeable and for each i ∈ {1, . . . , n}, the conditional distribution of Y i given X i = x is m(x) = π 1 f 1 (x)/{π 1 f 1 (x)+π 0 f 0 (x)} where the class probability is given by π 1 = n 1 /n. Therefore, although the joint distributions of {(X i , Y i )} n i=1 are different under i.i.d. and separate sampling schemes, they share the same regression function.
Remark 2.1. These two sampling schemes are also known as prospective sampling and retrospective (or case-control) sampling, respectively, and their relationships have been studied in different contexts. For example, it has been shown that the logistic slope estimates have similar behaviors under both sampling schemes (see, e.g. Anderson, 1972;Prentice and Pyke, 1979;Carroll, 1993, 1999;Bunea and Barbu, 2009). This result has been extended to general regression models by Scott and Wild (2001).

Global Two-Sample Tests via Regression
The choice of regression method in our framework will ultimately decide whether we achieve competitive statistical power. In Section 3.1, we illustrate the point that the global regression test can be optimal if we choose a suitable regression method. For this theoretical purpose, we focus on the regression test based on Fisher's LDA and show its optimality. In Section 3.2, we turn our attention to more general regression settings and characterize the testing error of the global regression test in terms of the mean integrated square error (MISE) of the regression estimator.

Fisher's Linear Discriminant Analysis
In this section, we consider a simple scenario of two sample normal mean to highlight the difference between our approach and the classification accuracy approach. In particular, we prove that the regression test based on Fisher's LDA achieves the same local power as Hotelling's T 2 test. This result has significance given that i) Hotelling's test is optimal under the considered scenario and ii) the classification accuracy test based on Fisher's LDA is usually underpowered (Kim et al., 2016;Rosenblatt et al., 2016). To facilitate comparison with the previous results, which are established under separate sampling, we also consider the case where n 0 and n 1 are predetermined throughout this subsection.
where n = n 0 + n 1 . The two-sample problem then becomes the problem of testing for mean differences as For this particular problem, Fisher's LDA is a natural choice for regression, assuming normality and equal class covariances. Let µ i be the sample mean vector for each group, S be the covariance matrix of the combined samples, i.e. S = n −1 n i=1 (X i − µ)(X i − µ) where µ = n −1 n i=1 X i . Then, by putting π 1 = n 1 /n, the regression estimator based on Fisher's LDA is given by One of the most popular test statistics for testing (4) is Hotelling's T 2 statistic, which yields optimal power for the normal means problem (see, e.g. Anderson, 2003). For the two-sample problem, Hotelling's T 2 statistic is defined by where S p is the pooled covariance matrix, that is On the other hand, the regression test statistic based on Fisher's LDA is given by The next theorem provides a connection between the seemingly unrelated T LDA and T 2 Hotelling statistics. Specifically, it shows that nπ −1 0 π −1 1 T LDA is asymptotically identical to Hotelling's T 2 statistic under the null. It is also worth pointing out that the theorem still holds without the normality assumption.
Theorem 3.1. Let {X i,0 } n 0 i=1 and {X i,1 } n 1 i=1 be random samples under separate sampling from two multivariate distribution with the mean vectors µ 0 and µ 1 , respectively, and the same covariance matrix Σ. Assume the pooled samples are mutually independent and the third moments of X 1,0 and X 1,1 are finite. Suppose that S p and S satisfy S −1 p = Σ −1 (1 + o P (1)) and S −1 = Σ −1 (1 + o P (1)). Then, under H 0 : µ 0 = µ 1 , it holds that Therefore, where χ 2 D is the chi-squared distribution with D degrees of freedom.
Let us now turn to the alternative hypothesis. To begin with, we consider a family of probability functions that satisfy the following smoothness condition.
Such q.m.d. families include fairly large parametric models such as exponential families in natural form. For our purpose, we focus on location q.m.d. families, denoted by {P µ , µ ∈ Ω}. Specifically, P µ is a member of {P µ , µ ∈ Ω} if its density satisfies f µ (x) = f (x − µ) for which f (x) has zero mean and covariance matrix Σ. Next, for given P µ 0 and P µ 1 from {P µ , µ ∈ Ω}, let us consider the local alternative where h = (h 1 , . . . , h D ) . Then, under H 1,n , T LDA has asymptotic behavior as follows.
∼ P µ 1 where P µ i is a member of the location q.m.d. family with the same covariance matrix Σ and finite third moments. Suppose that S p and S satisfy S −1 p = Σ −1 (1 + o P (1)) and S −1 = Σ −1 (1 + o P (1)). Under the sequence of local alternatives given in (8), we have where χ 2 D (λ) denotes a noncentral chi-square distribution with D degrees of freedom and the noncentral parameter The results from Theorem 3.1 and Theorem 3.2 imply that our regression test based on T LDA has the same asymptotic local power as Hotelling's T 2 test. As a result, the regression test based on T LDA is asymptotically optimal against the local alternatives as Hotelling's T 2 test.
To illustrate the main point of this section, we compare the performance of T LDA with Hotelling's T 2 test through Monte Carlo simulations. We randomly generate n 0 = n 1 = 100 samples from N ((0, . . . , 0) , I D ) and N ((µ, . . . , µ) , I D ), respectively and set µ 2 = 0.05 for D = 5 and µ 2 = 0.01 for D = 20. We also consider two versions of the accuracy-based tests via Fisher's LDA: the in-sample (re-substitution) accuracy and the two-fold cross-validated accuracy. To calculate the cross-validated accuracy, we use the balanced sample splitting scheme in which the first part of data is used to train the LDA, and the second part is used to estimate the accuracy of the classifier (see, Definition 1 and 2 of Rosenblatt et al., 2016, for more details). To make a fair comparison, the critical values of the given tests were all decided by the permutation procedure. As shown in Figure  2, the regression test based on T LDA has comparable power to Hotelling's T 2 test that coincides with our theory. On the other hand, the accuracy tests have less power than Hotelling's T 2 test.

The MISE and Testing Error for Global Regression
We now turn to more general regression settings and investigate general properties of the global regression test in both separate and i.i.d. sampling cases. Let M be a certain class of regression m(x) : S ⊆ R D → [0, 1] containing constant functions. Suppose that we have a regression estimator m(x) that has the mean integrated square error as where C 0 is a positive constant and δ n = o(1). In the case of i.i.d. sampling, we further assume δ n ≥ n −1 , which is typical for nonparametric regression estimators. Our main interest here is in employing the above MISE to characterize the testing error of the global regression test. Note that the plug-in global statistic in (3) is typically a biased estimator of the MISE and the bias differs from case to case. To simplify our analysis, we consider sample splitting where the half of data is used to estimate the regression function and the other is used to evaluate the empirical squared error. In detail, given samples (X 1 , Y 1 ), . . . , (X 2n , Y 2n ), the regression test statistic based on (random) sample splitting is defined by where m(·) and π 1 are calculated based on the first half of the data {(X 1 , Y 1 ), . . . , (X n , Y n )}. In the case of separate sampling, we assume a random ordering in the entire data set and similarly split it into two parts but with the additional restriction that class probabilities are the same in both parts. Based on T global , we argue that for sufficiently large C 1 > 0 and n, the testing error of the global regression test can be arbitrarily small against the class of global alternatives given by Note that since π 1 is assumed to be fixed, the regression function m(x) is completely determined by f 0 and f 1 . Thus in the following theorem and hereafter, we use the notation for all x ∈ S. With this notation in hand, we state the main theorem of this subsection.
Theorem 3.3. Consider the case of i.i.d. sampling or separate sampling. In each case, suppose that we have a regression estimator m(·) satisfying (9). Let t α be the upper α quantile of the permutation distribution of T global based on m(·) where we permute the first half of labels. For fixed α ∈ (0, 1) and β ∈ (0, 1 − α), we assume that there exists a positive constant C 0,α such that Then there exist positive constants C 1 and N depending on C 0 , C 0,α , α, β such that Theorem 3.3 uses the assumption that the permutation critical value of the regression test is uniformly bounded by δ n (up to some constant factor) with high probability. We end this subsection with a class of regression estimators, which satisfy this assumption. Let us consider a class of regression estimators with the following representation: In addition, we assume that w i (x) is a function of {X 1 , . . . , X n } but not {Y 1 , . . . , Y n }. This class of estimators, often called linear smoothers, contains many popular regression methods such as k-nearest neighbor (kNN) regression, kernel regression and local polynomial regression. Focusing on linear smoothers, we provide the following corollary.
Corollary 3.1. Consider the case of i.i.d. sampling or separate sampling. In each case, let T global be the global regression test statistic in (10) based on a linear smoother m(·) with the property in (9). Let t α be the upper α quantile of the permutation distribution of T global where we permute the first half of labels. Then for fixed α ∈ (0, 1) and β ∈ (0, 1 − α), there exist positive constants C 1 and N depending on C 0 , α, β such that • Type I error: sup

Examples
In the case of i.i.d. sampling, the convergence rate δ n of commonly used regression estimators have been well-established and these results can be directly used to study the testing error of the global regression test. We list several known results here. More examples can be found in Györfi et al. (2002), Tsybakov (2009) and Devroye et al. (2013).
• Kernel regression. Kernel regression estimators also achieve the converge rate as δ n = n −2/(2+D) for Lipschitz continuous functions and more generally as δ n = n −2β/(2β+D) for a Hölder space with smooth parameter 0 < β ≤ 1.5 (Györfi et al., 2002). The adaptivity of kernel regression to the intrinsic dimension has been proved by Kpotufe and Garg (2013). Following their results, the convergence rate becomes δ n = n −2/(2+d) n −2/(2+D) when there exists a low-dimensional structure in the data.
• Local polynomial regression. Let M be a Sobolev space with smoothness α. Then local polynomial regression estimators has the convergence rate as δ n = n −α/(α+d) where d is manifold dimension smaller than the original dimension D (Bickel and Li, 2007).
• Random forests regression. For Lipschitz continuous functions, Biau (2012) shows that the random forest estimator converges at rate δ n = n − 0.75 s log 2+0.75 where s is the number of the relevant features. Hence, the convergence rate of the random forests becomes faster than n −2/(2+D) when s ≤ D/2 under certain conditions. Wager and Walther (2015) use the guess-and-check forest algorithm to show that the convergence rate of the random forest is To the best of our knowledge, there has been no detailed investigation of the regression estimation error under separate sampling. In this case, we cannot directly take advantage of existing results on regression. However, as the sample size becomes larger, the difference between i.i.d. sampling and separate sampling becomes minor. Hence we expect that a reasonable regression estimator behaves similarly under both sampling schemes in large sample sizes, while a detailed analysis is necessary in future work. It is also worth noting that for certain regression methods, consistency results are not significantly affected by sampling scheme. For example, the consistency theory for L 1 penalized regression relies mainly on the assumption about a design matrix, which can be fulfilled under both sampling schemes (Van de Geer, 2008;Bühlmann and Van De Geer, 2011). In such a case, the same convergence rate can be established under both sampling schemes.

Local Two-Sample Tests via Regression
The global two-sample test only answers the question whether two distributions are different, whereas in some applications, it would be more valuable to describe how these two distributions differ in a multivariate space. With this goal in mind, we now move on to the local two-sample problem and study general properties of the local regression test.

The MSE and Testing Error for Local Regression
We start by establishing similar results in Section 3.2 for local regression tests. Given a local point x ∈ S of interest, suppose that a regression estimator has the mean square error such that where C 0,x is a positive constant and δ n,x = o(1). In addition, we assume δ n,x ≥ n −1 for i.i.d. sampling. Then the next theorem shows that for sufficiently large C 1,x and n, the local testing error based on the given regression estimator can be arbitrarily small against the class of local alternatives given by  (11). Let t α be the upper α quantile of the permutation distribution of T local (x). Then for fixed α ∈ (0, 1) and β ∈ (0, 1 − α), there exist positive constants C 1,x and N x such that • Type I error: sup Remark 4.1. Although Theorem 4.1 focuses on a linear smoother, the same conclusion holds for other regression methods as long as there exists a positive constant C 0,x,α such that the permutation critical value t α is bounded above by C 0,x,α δ n with high probability (see Theorem 3.3 for a more formal statement).
In order to keep things as simple and concrete as possible, we next focus on the Lipschitz class and analyze the optimality of the local regression tests from a minimax point of view. In the rest of this section (Section 4.2-4.4), we concentrate on i.i.d. sampling scheme to take full advantage of known regression results. However, as we discussed in Section 3.3, similar results are expected to hold under separate sampling as well.

Minimax Optimality over the Lipschitz Class
For a fixed constant L > 0, let us denote the Lipschitz function class by We also denote the collection of α level tests by Φ n,α = {φ : sup f 0 ,f 1 ∈M 0 P f 0 ,f 1 (φ = 1) ≤ α} and denote the class of Lipschitz local alternatives by (12) With this notation and fixed α ∈ (0, 1) and β ∈ (0, 1 − α), the minimum separation is defined by which is the smallest distance between m(x) and π 1 such that the power becomes nontrivial. Then a test is called minimax rate optimal if it has power uniformly over M Lip (δ n,x ) such that δ n,x δ n,x .
In this section, we will investigate minimax rate optimality of local regression tests over the Lipschitz class under i.i.d. sampling. First we formally state an upper bound for the local estimation error based on kNN and kernel regression in Example 4.1 and Example 4.2, respectively. We then use these results to obtain the upper bound for the minimum separation in Corollary 4.1.
Example 4.1 (kNN regression). For a fixed point x ∈ S, list the data by and assume that P(X ∈ B x, ) > τ x D where B x, is a ball of radius > 0 centered at x and τ x > 0. Then , and for k n = n 2/(2+D) , we have A similar result can be established for kernel regression estimators as follows.
Example 4.2 (Kernel regression). Given a kernel K : S → [0, ∞), the kernel regression estimator at a fixed point x is given by Assume there exists 0 < r < R and 0 < λ < 1 such that where B 0, is a ball of radius > 0 centered at the origin. Further assume that P( Remark 4.2. Example 4.1 and Example 4.2 are well-known and standard except that we keep track of the constant C 0,x over the Lipschitz class. Similar results exist in the literature but for slightly different settings. Hence, in Appendix A, we present detailed proofs for these two examples heavily building on Györfi et al. (2002). The proofs will also be used to study the performance of the kNN and kernel local regression tests under the existence of intrinsic dimension in Proposition 4.1.
From the previous examples together with Theorem 4.1, we conclude that the minimum separation in (13) satisfies δ n,x n −2/(2+D) . We summarize this result in the following corollary.
and the upper α quantile of the permutation distribution of each statistic by t α,kN N and t α,ker respectively. Suppose the conditions in Example 4.1 holds with k n = n 2/(D+2) . Then for fixed α ∈ (0, 1) and β ∈ (0, 1 − α), there exist positive constants C 1,x and N x such that • Type I error: sup On the other hand, under the conditions in Example 4.2 with h n = n −2/(2+D) and for fixed α ∈ (0, 1) and β ∈ (0, 1 − α), there exist positive constants C 1,x and N x such that • Type I error: sup As a result, the minimum separation satisfies δ n,x n −2/(2+D) .
Next based on the standard technique to lower bound the testing error (e.g., Ingster, 1987;Baraud, 2002), we establish a lower bound for the minimum separation by n −2/(2+D) δ n,x . This results matches with the upper bound in Corollary 4.1. Therefore, the tests in Corollary 4.1 are minimax rate optimal and cannot be improved.
Theorem 4.2 (Lower bound). For any given α ∈ (0, 1) and β ∈ (1 − α), there exists a constant Remark 4.3. In the context of two-sample testing, it is sometimes more natural to make smoothness assumptions on densities f 0 and f 1 rather than on the regression function. Here we briefly discuss how to translate the smoothness condition on f 0 and f 1 into a condition on the regression function.
Suppose that density functions f 0 and f 1 are uniformly bounded below by c > 0 (see, e.g. Yang and Barron, 1999, for a similar assumption). Then some algebra shows that In other words, if f 0 and f 1 are Lipschitz continuous (or more generally Hölder continuous), then the regression function is also Lipschitz continuous with a different Lipschitz constant. This means that our theoretical results will remain valid for the class of Lipschitz densities with the boundedness condition.

An Approach to Intrinsic Dimension
The previous results show that no test is uniformly powerful when the square distance between m(x) and π 1 is order of n −2/(2+D) ; therefore it demonstrates the typical curse of dimensionality. Suppose that data X ∈ S ⊆ R D has low intrinsic dimension d which is smaller than the original dimension D (e.g. manifold data). In this case, we would like to have a test whose performance only depends on intrinsic dimension and thus avoids the curse of dimensionality. For this purpose, we consider the homogeneous measure which captures local dimension of data.
Definition 4.1. (Definition 2 of Kpotufe, 2011) Fix x ∈ S ⊆ R D , and r > 0. Let C > 0 and 1 ≤ d < D. The probability measure P(·) is (C, d)-homogeneous on B x,r if we have P(X ∈ B x,r ) ≤ C −d P(X ∈ B x, r ) for all r ≤ r and 0 < < 1.
Using Definition 4.1, we reproduce Corollary 4.1 and show that the performances of the local kNN and kernel regression tests depend on the intrinsic dimension instead of the original dimension.
Proposition 4.1. Consider the same notations as in Corollary 4.1 and let x ∈ S ⊆ R D . Suppose the probability measure P(·) is (C, d)-homogeneous on B x,r . Then for the kNN regression test with k n = n 2/(2+d) and for any β ∈ (0, 1 − α), there exist positive constants C 1,x and N x such that • Type I error: sup On the other hand, for the kernel regression test with h n = n −2/(2+d) and for any β ∈ (0, 1 − α), there exist positive constants C 1,x and N x such that • Type I error: sup When the intrinsic dimension is unknown, one can employ a Bonferroni procedure to obtain the same results in Proposition 4.1. To illustrate the idea, let k n (i) = n −2/(i+2) for i = 1, . . . , D and denote the resulting kN N tests by φ i (α) = I(T are the kNN test statistic calculated with k n (i) and the corresponding α level permutation critical value, respectively. Then the final test is defined by φ max = max 1≤i≤D φ i (α/D). By using the union bound, it is easy to see that for certain C 1,x and N x . This shows that the Bonferroni test does not lose any power in terms of separation rate and it adapts to the unknown intrinsic dimension. Despite this theoretical guarantee, the Bonferroni approach should be used with caution in practice. Indeed the Bonferroni test might be too conservative since it does not take into account the dependency structure among φ 1 , . . . , φ D .
Remark 4.4. For simplicity, we illustrate our idea on the Lipschitz class which only requires a mild smoothness assumption. Nevertheless our results in Section 4.2-4.3 can be extended to a general function class such as Hölder class (e.g. Chapter 3.2 of Györfi et al., 2002) in a similar way. Indeed, all we need is a uniform bound for the MSE (11) over a general class, which can be found in the regression literature (See Section 3.3).

Limiting Distribution of Local Permutation Test Statistics
When the sample size is large, calculating the permutation distribution is time-consuming. Hence it would be useful to investigate the limiting distribution of the permutation statistic. Based on the combinatorial central limit theorem (e.g. Bolthausen, 1984), we show that the permutation distribution of our local test statistic converges to the chi-square distribution with one degree of freedom as the sample size tends to infinity.
holds and let Further let η = (η 1 , . . . , η n ) be a permutation of {1, . . . , n}. Then the permutation distribution of the one-side local regression statistic converges to the standard normal distribution as Here P η (·|X n ) is the uniform probability measure over permutations conditioned on (X 1 , Y 1 ), . . . , (X n , Y n ) and m η (x) = n i=1 w i (x)Y η i . Thereby, σ −2 n T local (x) converges to the chi-square distribution with one degree of freedom as We illustrate Theorem 4.3 using kNN and kernel regression and show that both σ −2 n T kN N (x) and σ −2 n T ker (x) converge to the chi-square distribution with one degree of freedom under appropriate conditions. (14) with

Corollary 4.2 (kNN regression). Consider the kNN estimator in
Then the permutation distribution of σ −2 n T kN N (x) converges to the chi-square distribution with one degree of freedom when n, k → ∞ and 2k < n. (15) and assume that sup t |K(t)| = K < ∞, K 2 (t)dt < ∞ and K h (t)dx = 1 where K h (t) = h −D K(t/h). Denote the density function of X by f (·). Assume that 0 < f (x) < ∞ and f (·) is twice differentiable at x. Further assume that nh D → ∞ and h → 0. Then the permutation distribution of σ −2 n T ker (x) converges to the chi-square distribution with one degree of freedom where σ 2 n is given in (18).

Simulations
In this section, we carry out simulation studies for global and local two-sample tests to examine the empirical performance of the proposed methods. Throughout our simulations, we focus on the separate sampling scenarios under which other existing two-sample tests are usually investigated. We begin by comparing the regression test based on random forests (Breiman, 2001) with other benchmark competitors in Section 5.1. Next in Section 5.2, we illustrate by an example that the classification accuracy tests can fail due to their discrete nature while the corresponding regression tests perform well. We also provide simulation results for the local regression test in Section 5.3 to validate our approach.

Random Forests Two-Sample Testing
Random forests have been proven to be a powerful tool for regression and classification problems in many application areas (see e.g., Hamza and Larocque, 2005;Díaz-Uriarte and De Andres, 2006;Cutler et al., 2007;Chen and Ishwaran, 2012). Despite the good performance of random forests in classification and regression problems, only a few works have applied these methods to statistical inference problems. To the best of our knowledge, only Gagnon-Bartsch and Shem-Tov (2016) and Hediger et al. (2019) use random forests for the two-sample problem. Now whereas Gagnon-Bartsch and Shem-Tov (2016) and Hediger et al. (2019) consider an accuracy test based on random forests, we propose a regression test based on random forests. The corresponding test statistic is given by where m RF is the regression estimator from the random forest algorithm. For our simulation study, we implement both the RF accuracy and regression tests with the randomForest package (version 4.6-12) in R with default options for the parameters. We found in our simulation study that the in-sample classification accuracy of random forests is typically one even under the null case; therefore, the resulting test has no power against any alternative. For this reason, we instead estimate the classification accuracy from out-of-bag samples (which is a default option provided by the randomForest package). Throughout this section, we denote the accuracy test statistic based on random forests by A RF .

Simulation Setting
Our simulations analyze two main settings. The first setting includes dense alternatives where the two distributions are different over a number of coordinates. The second setting, on the other hand, considers sparse alternatives where the two distributions differ in only a few coordinates. We carry out the simulations via the permutation procedure with 100 random permutations, repeated 300 times for all test statistics. The significance level is controlled at α = 0.05.
As a benchmark competitor, we consider the maximum mean discrepancy (MMD) test (Gretton et al., 2012) based on where k(x, y) is the Gaussian kernel with a bandwidth chosen by the median heuristic, i.e. k(x, y) = exp −||x − y|| 2 2 /σ median (see, Gretton et al., 2012, for details). We also consider the Energy test (Székely and Rizzo, 2004;Baringhaus and Franz, 2004) based on Energy n = 2 n 0 n 1

Simulation Results
Tables 1-4 summarize our simulation results. We see from Table 1 and 2 that MMD n and Energy n perform better than the regression test ( T RF ) and the accuracy test ( A RF ) against the dense normal location and scale alternatives. Indeed, MMD n and Energy n are known to be asymptotically optimal against the normal location alternative with the identity covariance matrix (Ramdas et al., 2015). However, they are both moment-based statistics, and hence sensitive to outliers. They are also based on the Euclidean metric. A major issue of the Euclidean and similar metrics is that they assign weights to the coordinates proportional to their scale without screening for irrelevant variables. Consequently, neither MMD n nor Energy n can properly deal with sparse alternatives, which explains their poor performance against the sparse location and scale alternatives. On the other hand, the base learner of the random forest algorithm is the decision tree. The usual splitting rule of decision trees is invariant to absolute values (see e.g., Chapter 9.2 of Friedman et al., 2009), which leads to robustness against outliers. Random forests also have the ability to handle sparse alternatives by randomly selecting a few variables during the tree-growing process. By averaging each tree, random forests eventually put more weight on informative variables. In general, T RF and A RF are comparable to or more powerful than MMD n and Energy n under the sparse location and scale alternatives. Finally, we note from our simulations that the regression test T RF exhibits higher power than the accuracy test A RF for the dense as well as the sparse alternatives.

A Comparison between Regression and Classification Accuracy Tests
As mentioned earlier, many classifiers are typically estimated by dichotomizing regression estimators. Depending on the alternative, this dichotomization can result in a less powerful accuracy test than the corresponding regression test. We specifically demonstrate this point by considering two commonly used nonparametric regression methods; namely, k-nearest neighbors regression and kernel regression.

Simulation Setting
Recall the kNN estimator and the kernel regression estimator in (14) and (15), respectively. Using these estimators, the global regression test statistics are given by Here we use the Euclidean distance to measure the pairwise distance between observations for kNN. On the other hand, we consider the Gaussian kernel with a diagonal bandwidth matrix with identical components h for kernel regression. The corresponding accuracy test statistics are respectively. For all tests, we reject the null hypothesis when the test statistic is larger than a permutation critical value.
where µ 0 = (0, . . . , 0) , µ 1 = (0.2, . . . , 0.2) , σ 2 0 = 1, and σ 2 1 = 1.2. Hence, there exist differences in both the location and scale parameters. We choose the sample sizes n 0 = n 1 = 50 and change the dimension from D = 5 to D = 75 by steps of 10. To compare the performance, we carry out the permutation test with 200 permutations, and the simulations are repeated 1,000 times to estimate the power of the test. We provide results for a range of different values of the tuning parameters: k = 5, 15, 25 for the k-NN regression, and h = 5, 15, 25 for the kernel regression.

Simulation Results
Simulation results are presented in Figure 3 and 4. From the results, it is seen that the regression tests consistently outperform the corresponding classification accuracy tests under the given scenario. The power of the accuracy tests even decreases with dimension, whereas the power of the regression tests steadily increases with dimension. The increase in power with dimension is desirable under this scenario because each coordinate presents evidence towards the alternative. The counter-intuitive result for the accuracy tests is due to the fact that the tests employ a dichotomized regression estimator. To explain it more clearly, we borrow some results from Mondal et al. (2015). First, it can be shown by the weak law of large numbers that for 1 ≤ i ≤ n 0 , 1 ≤ j ≤ n 1 , as D → ∞ while n 0 and n 1 are fixed. For the given example, we have σ 0 √ 2 < σ 2 0 + σ 2 0 + (µ 0 − µ 1 ) 2 < σ 1 √ 2, which implies that every instance is closer to an instance from the class Y = 0 than to other instances from the class Y = 1. In other words, the nearest neighbors of any observation are most likely to be from the class Y = 0. Note that both k-NN and kernel regression, explicitly or implicitly, use the Euclidean distance to calculate the proximity between two instances. Therefore, we observe with high probability that m kN N (X i ) and m KerR (X i ) are estimated as less than half and the dichotomized classifiers become Due to this dichotomization, A kN N and A KerR converge to the empirical class probability n 0 /n under the alternative, resulting in poor power performance. On the other hand, the regression tests based on T kN N and T ker can be powerful as long as m kN N (x) and m ker (x) significantly deviate from the class probability. This is indeed the case under the considered scenario and thus explains why the regression tests outperform the corresponding classification tests.

Toy Examples for Local Two-Sample Testing
Contrary to classification accuracy, our regression approach naturally leads to a local two-sample testing framework that provides further information on pointwise differences between two populations. We consider two toy examples to demonstrate the empirical performance of the local regression test. For the simulation study, we focus on the local kNN regression statistic in (16) with k n = n 2/(2+D) for the normal mixture example and k n = n 2/(2+d) for the manifold example. For both examples, we control the family-wise error rate (FWER) at α = 0.05 via the Hochberg step up procedure (Hochberg, 1988).

Manifold data example
In the second example, we create high-dimensional data with a low-dimensional manifold structure by generating edge images of size 16 × 16. Let x, y be integers on evenly spaced points between −30 and 30 that are 2 units apart. Hence the size of the domain of (x, y) becomes 16 × 16. Given two underlying parameters θ ∈ [−π, π] and ρ ∈ [−5, 5], an edge image is defined by For the simulation, we draw n 0 = n 1 = 100 samples from Here Ψ 1 and Ψ 2 denote the the first two coordinates of the diffusion map.
points into the two-dimensional diffusion space (see Appendix B for details) and the final result is provided in Figure 6.
For both examples, the kNN local regression test performs reasonably well and detects most of the local differences between two distributions.

Application to Astronomy Data
Continuing our discussion from Section 1.1, we apply our two-sample framework to galaxies in the COSMOS, EGS, GOODS-North and UDS fields observed by the Hubble Space Telescope (HST) as part of the CANDELS program. 1 For the analysis, we compute seven morphological statistics that summarize galaxy images nonparametrically: M, I, D (Freeman et al., 2013), Gini, M 20 (Lotz et al., 2004), C and A (Conselice, 2003). Each statistic (see the references for details) explains particular aspects of galaxy morphology. In brief, the M, I, D statistics capture galaxies with disturbed morphologies, Gini and M 20 describe the variance of a galaxy's stellar light distribution, and the C and A statistics measure the concentration of light and asymmetry of a galaxy, respectively. We restrict our study to relatively nearby galaxy observations that have a redshift (proxy for distance) estimate between 0.56 < z < 1.12. The final data set consists of 2736 so-called i-band-selected galaxy observations. For each galaxy, we have seven morphological image statistics along with an estimate of star-formation rate (SFR).
Galaxy morphology is closely related to other physical properties such as star formation rate, mass and metallicity (see, e.g., Snyder et al., 2015). The aim of this study is to demonstrate that our local two-sample framework can be valuable in detecting and quantifying dependencies between variables of moderate or high dimension without resorting to low-dimensional projections of summary statistics. In particular, we demonstrate that local two-sample tests can identify galaxies that lie in regions of the feature space where the estimated proportion of a particular defined class of objects (such as star-forming galaxies) differs significantly from the global proportion. Hence, we start by defining two galaxy classes based on the SFR: we say that a galaxy belongs to the high-SFR group if its SFR is higher than the upper 25% quantile of the SFR distribution (log 10 (SFR) > 1.201), and that it belongs to the low-SFR group if its SFR is lower than the lower 25% quantile of the SFR distribution (log 10 (SFR) < −0.915). We further randomly divide the data into a training set (n = 684) and a test set (n = 684). We use the training data to construct the local test statistic in (3), and we perform the local-two sample tests at the points in the test set (that is, these are the evaluation points in Algorithm 2). Note that this particular application is especially challenging because the seven morphological statistics have very different properties, and some of the statistics (M and I) are essentially of mixed discrete and continuous type with heavy outliers; hence, any metric-based estimator is bound to perform poorly even after normalizing the variables. Our regression test, however, can by-pass this problem by leveraging the random forest algorithm. Another advantage of using random forests is that the algorithm returns variable importance measures that can help us identify which morphology statistics are the most important in distinguishing the two populations ( Figure 7).

Analysis and Result
According to our global two-sample test ( T RF = .188, p < .001), there is a significant difference between the low-SFR and the high-SFR populations in terms of galaxy morphology. We follow up on this result by implementing the local two-sample testing framework according to Algorithm 2 with FWER control at α = 0.05 by the Hochberg step up procedure. To visualize locally significant points from the local test, we use diffusion maps with local scaling (Zelnik-Manor and Perona, 2005). For more information on our particular application of diffusion maps, see Appendix B. The main result of the local significance test is displayed in Figure 1. As we can see, the high-SFR and low-SFR dominated regions (that is, the regions where f LowSFR < f HighSFR and f LowSFR > f HighSFR , respectively) are fairly well-separated in morphology space. Figure 1 also shows some examples of galaxy images at significant test points. By inspecting such images, we note that the "red" galaxies in the low-SFR dominated regions of the seven-dimensional space tend to be more concentrated and less disturbed than their "blue" counterparts in the high-SFR dominated regions -this result is consistent with previous astronomical studies about irregular galaxies displaying merger activities and high star-formation rates. Our test result is further supported by the variable importance measures in Figure 7: the two most important morphology statistics in distinguishing between high-SFR and low-SFR galaxies are the Gini (Lotz et al., 2004) and I (Freeman et al., 2013) morphology statistics. Indeed, by definition, the Gini statistic describes the variance of a galaxy's stellar light distribution, and the I statistic captures galaxies with disturbed morphologies.

Conclusions
In this work, we presented a new framework for both global and local two-sample testing via regression. Depending on the chosen regression model, our framework can efficiently deal with different types of variables and different structures in the data; thereby, providing tests with competitive power against many practical alternatives. Compared to other recent approaches in the two-sample literature (such as classification tests), our framework has the key advantage of being able to detect locally significant regions in multivariate spaces. Throughout this work, we studied theoretical properties of the regression tests by building on existing regression results. We established a connection between the power of the global and local tests to the MISE and MSE of the corresponding regression estimators, and we demonstrated practical usefulness of our methods via simulations.
By taking advantage of permutation tests under the global null hypothesis, the proposed local testing framework ensures that the type I error rate is less than or equal to the significance level. When the local null hypothesis H 0 (x) : m(x) = π is of interest, on the other hand, there is no such guarantee. In this case, it would be necessary to use an asymptotic framework and investigate the limiting behavior of a local test statistic. This topic is reserved for future work. Another direction for future work is to study the optimality of global regression tests. Contrary to the local regression test, a regression estimator with the optimal estimation error rate may not necessarily return minimax optimal global regression test. We hope that future studies will establish a lower bound and matching upper bound for the global regression test.
, and write For some a ∈ (0, 1), Taylor expansion of f (x) = a/{a + (1 − a)e x } at x = 0 provides where C is a universal constant. This implies that Now based on |x + y| 3 ≤ 4|x| 3 + 4|y| 3 and Cauchy-Schwarz inequality, it can be seen that As a result, n T LDA can be approximated by Let us denote δ n = S −1 ( µ 0 − µ 1 ) and ∆ n = ( µ 0 + µ 1 )/2, and recall We also note that the residual term is negligible under the null, i.e. nπ 2 0 π 2 1 R n = o P (1), which results in The rest of the proof follows by the limiting property of Hotelling's T 2 .
A.2 Proof of Theorem 3.2 Proof. First note that the likelihood ratio for testing (8) is given by .
Since {P µ , µ ∈ Ω} is q.m.d. at µ 0 , Theorem 12.2.3 of Lehmann and Romano (2006) under n 1 /(n 0 + n 1 ) → π 1 yields that where I(µ) is the Fisher information matrix. This implies by Corollary 12.3.1 of Lehmann and Romano (2006) that the joint distribution of X 1,0 and X 1,1 under the null and the alternative are mutually contiguous. Since contiguity implies under H 1,n , the result follows by the limiting distribution of Hotelling's T 2 statistic.

A.3 Proof of Theorem 3.3
Proof. The exact type I error control of the permutation test is well-known (see e.g. Chapter 15 of Lehmann and Romano, 2006). Strictly speaking, the considered test is not the usual permutation test since the only first half of labels are permuted to decide a critical value. However, it also controls the type I error under H 0 due to Theorem 15.2.1 of Lehmann and Romano (2006). Indeed, this result holds regardless of i.i.d. sampling or separate sampling. Hence we focus on the type II error control.
• Type II error control (i.i.d. sampling) We start with the case of i.i.d. sampling. Based on the inequality (x − y) 2 ≤ 2(x − z) 2 + 2(z − y) 2 , we lower bound the test statistic as Define the events A 1 , A 2 , A 3 , A 4 such that A 1 = (π 1 − π 1 ) 2 < C 2 δ n , Using Markov's inequality, we have by the condition in (9). For the third event, denote ∆ n = E (m(X) − π 1 ) 2 and use Chebyshev's inequality to have where the last inequality uses the assumption that ∆ n ≥ C 1 δ n . Furthermore, under the assumption on the permutation critical value, P (A c 4 ) ≤ β/2. Hence, we obtain by choosing sufficiently large C 1 , C 2 , C 3 > 0 with the assumption that δ n ≥ n −1 . Using (23), the type II error of the regression test is bounded by where C 4 can be chosen by C 4 = 2C 0,α + C 2 + 2C 3 . Now by choosing C 1 > C 4 for sufficiently large n, the type II error can be bounded by β. Hence the result follows.
• Type II error control (Separate Sampling) The proof for separate sampling is almost the same as before except few details. First, we do not need to define A 1 since π 1 is known. In terms of A 2 , apply Markov's inequality to obtain where the last line uses the fact that n 0 n P 0 + n 1 n P 1 = P X . Similarly, for the event A 3 , we have by Chebyshev's inequality that The rest follows exactly the same as before. Hence the proof is complete.
A.4 Proof of Corollary 3.1 Proof. We prove the corollary by showing that the conditions in Theorem 3.3 are satisfied. In particular, it suffices to verify that for fixed α ∈ (0, 1) and β ∈ (0, 1 − α), there exists a positive constant C 0,α such that sup f 0 ,f 1 ∈M P f 0 ,f 1 (t α < C 0,α δ n ) ≥ 1 − β/2. Then the rest of the proof proceeds the same as before.
• i.i.d. sampling To start with the case of i.i.d. sampling, let η = (η 1 , . . . , η n ) be a permutation of {1, . . . , n}. Now conditioned on the data X 2n = {(X 1 , Y 1 ), . . . , (X 2n , Y 2n )}, we denote the probability and expectation under permutations by P η [·] = P η [·|X 2n ] and E η [·] = E η [·|X 2n ] respectively. Then by Markov's inequality Further note that where the first inequality uses Note that the permutation samples are not i.i.d. and thus in order to use the condition in (9) which holds for i.i.d. samples, we will associate the upper bound in (25) Therefore, we obtain which in turn implies that So the critical value of the permutation distribution is bounded by Now choose C 0,α such that 2C 0 /(αβ) ≤ C 0,α . Then based on the assumption in (9) and Markov's inequality Hence the proof completes.
• Separate Sampling Let Y * * 1 , . . . , Y * * n be Bernoulli random variables with parameter π 1 such that n i=1 Y * * i = n π 1 and they are independent of X 1 , . . . , X 2n . In the case of separate sampling, the proof follows similarly by noting that the right-hand side of (24) is the same as Now by putting the above quantity into the right-hand side of (26) and following the same lines afterwards, we complete the proof.

A.5 Proof of Theorem 4.1
This result can be proved by following the same steps in the proof of Theorem 3.3. In fact, it is simpler than the previous proof since it does not involve sample splitting to estimate the integration error; hence we omit the proof.
A.6 Proof of Example 4.1 Proof. Let m kN N (x) = E[ m kN N (x)|X 1 , . . . , X n ]. Then we have the following decomposition. .
For a fixed x, Proposition 8.1 of Biau and Devroye (2015) shows that conditioned on {X 1 , . . . , X n }, (X 1,n (x), Y 1,n (x)), . . . , ((X n,n (x), Y n,n (x))) are independent. Using this independence property, Next for (II), where the last inequality uses the Lipschitz condition. Note that for fixed > 0 by the assumption that P(X ∈ B x, ) > τ x D . Hence, Similarly to the proof of Theorem 6.2 of Györfi et al. (2002), divide the data into k n + 1 parts where the first k n parts have size n/k n and denote the first nearest neighbor of x from the jth partition by X x j . This implies that and by Jensen's inequality, and thus .

Define an event
Then .
By Lemma 4.1 of Györfi et al. (2002), where B ∼ Binominal(n, p). Using this result, For (I 2 ), note that ( m ker (x) − m ker (x)) 2 ≤ 1 and thus where the second inequality is because if there exists X i such that ||x − X i || 2 ≤ rh n , then we have n i=1 K x−X i hn ≥ λ by the assumption on the kernel. In addition, where (i) uses 1 + x ≤ e x with the assumption P (X ∈ B x, ) ≥ τ x D and (ii) uses sup z ze −z ≤ e −1 . As a result, For (II), we use Jensen's inequality and the Lipschitz condition to have Since K(x) ≤ I(x ∈ B 0,R ), we observe that Consequently, where the second inequality is by the assumption λI(x ∈ B 0,r ) ≤ K(x). By taking the expectation, (II) ≤ L 2 R 2 h 2 n + (1 − P (X ∈ B x,rhn )) n (30) ≤ L 2 R 2 h 2 n + 1 − τ x r D h D n n ≤ L 2 R 2 h 2 n + e −1 τ x r D 1 nh D n . Therefore, we conclude that E ( m ker (x) − m(x)) 2 = (I) + (II) which completes the proof.
For the first term, |m(x 1 , x 2 , . . . , x D ) − m(z 1 , x 2 , . . . , x D )| Applying the same logic to the other terms, we see that A.9 Proof of Proposition 4.1 It is enough to show that there exist universal constants C 0 , C 0,α such that Then we can apply Theorem 4.1 to complete the proof. To start with kNN regression, we only need to modify (27) and follow the same steps in the proof of Example 4.1. From the definition of the (C, d)-homogeneous measure, we see that As a result, (27) becomes P (||X 1,n (x) − x|| 2 > ) = (1 − P(X ∈ B x, )) n ≤ 1 − C d n ≤ e −C n d .
Then we end up having E ( m kN N (x) − m(x)) 2 ≤ 1 4k n + L 2 2Γ(2/d) dC 2/d k n n 2/d and the result follows by setting k n = n 2 2+d . Similarly, we only need to modify (29) and (30) in the proof of Example 4.2. By using the (C, d)-homogeneous measure, (1 − P (X ∈ B x,rhn )) n ≤ 1 − h d n C P (X ∈ B x,r ) n = 1 − C h d n n ≤ e −C nh d n and apply this result to (29) and (30). We complete the proof by following the same steps in the proof of Example 4.2.
A.10 Proof of Theorem 4.3 Proof. We use a combinatorial central limit theorem in Bolthausen (1984) to prove the result. First denote a ij = w i (x)Y j for 1 ≤ i, j ≤ n and µ = na ·· , σ 2 n = n 1≤i,j≤n (a ij − a i· − a ·j + a ·· ) 2 /(n − 1), where a i· = n j=1 a ij /n, a ·j = n i=1 a ij /n, a ·· = n 1≤i,j≤n a ij /n 2 .
A.12 Proof of Corollary 4.3 Proof. Note that Hence it suffices to show that Using the given condition, the numerator is bounded by Whereas the denominator can be decomposed into two parts: Based on the usual bias-variance decomposition of the kernel density estimation (Wasserman, 2006), each part can be approximated as Now, the sufficient condition can be further bounded by Then using the previous approximations, the denominator becomes >0 by the assumption +o P (1).
Hence (32) converges to zero in probability and the result follows.

B Diffusion Maps
Dimensionality reduction methods can be useful for visualizing and describing low-dimensional structures that are embedded in higher-dimensional spaces. In this section, we briefly describe diffusion maps (Coifman et al., 2005;Coifman and Lafon, 2006) and the particular version that we use to visualize the results of our local two-sample test. As a starting point for constructing a diffusion map, one first defines a weight that reflects the local similarity of two points x i and x j in X = {x 1 , . . . , x n }. A common choice is the Gaussian kernel where s(x i , x j ) represents (for example, the Euclidean) distance between the points. These weights are used to build a Markov random walk on the data with the transition probability from x i to x j defined as .