On the exact Berk-Jones statistics and their p-value calculation

Continuous goodness-of-fit testing is a classical problem in statistics. Despite having low power for detecting deviations at the tail of a distribution, the most popular test is based on the Kolmogorov-Smirnov statistic. While similar variance-weighted statistics, such as Anderson-Darling and the Higher Criticism statistic give more weight to tail deviations, as shown in various works, they still mishandle the extreme tails. As a viable alternative, in this paper we study some of the statistical properties of the exact $M_n$ statistics of Berk and Jones. We derive the asymptotic null distributions of $M_n, M_n^+, M_n^-$, and further prove their consistency as well as asymptotic optimality for a wide range of rare-weak mixture models. Additionally, we present a new computationally efficient method to calculate $p$-values for any supremum-based one-sided statistic, including the one-sided $M_n^+,M_n^-$ and $R_n^+,R_n^-$ statistics of Berk and Jones and the Higher Criticism statistic. We illustrate our theoretical analysis with several finite-sample simulations.


Introduction
Let x 1 , x 2 , . . . , x n be a sample of n i.i.d. observations of a real-valued one dimensional random variable X. The classical continuous goodness-of-fit problem is to assess the validity of a null hypothesis that X follows a known (and fully specified) continuous distribution function F , against an unknown and arbitrary alternative G, (1.1) Goodness-of-fit is one of the most fundamental hypothesis testing problems in statistics (Lehmann and Romano, 2005). Most GOF tests for continuous distributions can be broadly categorized into two groups. The first comprises of tests based on some distance metric between the null distribution F and the empirical distribution functionF n (x) = 1 n i 1(x i ≤ x). These include, 1 arXiv:1311.3190v3 [stat.ME] 2 Oct 2014 among others, the tests of Kolmogorov-Smirnov, Cramér-von Mises, Anderson-Darling, Berk-Jones, as well as the Higher Criticism and Phi-divergence tests, see for example Anderson and Darling (1954), Berk and Jones (1979) and Jager and Wellner (2007). The second group considers the first few moments of the random variable X with respect to an orthonormal basis of L 2 (R). Notable representatives are Neyman's smooth test (Neyman, 1937), and its more recent data-driven versions, where the number of moments is determined in an adaptive manner, see Ledwina (1994) and Rainer et al. (2009).
Despite the abundance of proposed GOF tests, KS is nonetheless the most common test used in practice. It has several desirable properties, including asymptotic consistency against any fixed alternative, good power against a shift in the median of the distribution (Janssen, 2000), and the availability of simple procedures to compute its p-value. However, it suffers from a well known limitation -it has little power for detecting deviations at the tails of the distribution. The latter is important in a variety of practical situations. One scenario is the detection of rare contaminations, whereby only a few of the n observations are contaminated and arise from a different distribution. A specific example is the rare-weak model (Ingster, 1997;Donoho and Jin, 2004) and its generalization to sparse mixture models (Cai and Wu, 2014). Another example involves high dimensional variable selection or multiple hypothesis testing problems under sparsity assumptions.
Given the popularity of the KS test, natural questions are whether there are simple variants of it with tail sensitivity, and what are their statistical properties. In this paper we make several contributions regarding these questions. We start in Section 2 by viewing the KS and the Higher Criticism statistics under a common framework, as different ways to measure the deviations of order statistics from their means. As described in Section 3, this leads us to study a different GOF statistic, based on the following principle: Rather than looking for the largest (possibly weighted) deviation, it looks for the deviation which is most statistically significant. Since this statistic uniformly calibrates the different deviations, we denote it as the Calibrated Kolmogorov Smirnov statistic (CKS). Independently of our work, this statistic was recently suggested in several different works, including Mary and Ferrari (2014), Gontscharuk et al. (2014) and Kaplan and Goldman (2014). As discussed in Section 3, this test statistic is in fact equivalent to the seemingly forgotten M n statistic defined in Berk and Jones (1979). As discussed in Section 3.1, the CKS statistic is also closely related to the work of Aldor-Noiman et al. (2013) where the authors presented a new method to construct confidence bands for a Normal Q-Q plot.
On the theoretical front, Section 4 studies the asymptotic behavior of the CKS test under the null and under several alternative hypotheses. We prove its asymptotic consistency against any fixed alternative G = F , as well as against series of converging alternatives G n → F provided that the convergence in the supremum norm G n − F ∞ is sufficiently slow. Furthermore, following the work of Cai and Wu (2014) we show that the CKS test is adaptively optimal for detecting a broad family of sparse mixtures.
In a second contribution, we devise in Section 5 a novel and simple O(n 2 ) algorithm to compute p-values for any supremum-based one sided test. Particular examples include the Higher-Criticism and Berk-Jones statistics as well as the one sided version of CKS. While O(n 2 ) algorithms exist for KS, common methods for other test statistics require O(n 3 ) operations, for example via Noé's recursion (Noe, 1972;Owen, 1995). Finally, in Section 6 we compare the power of CKS to other tests under the following settings: i) a change in the mean or variance of a standard Gaussian distribution; and ii) rare-weak sparse Gaussian mixtures; These results showcase scenarios where CKS has improved power compared to common GOF tests. For other examples involving real data, see Aldor-Noiman et al. (2013); Kaplan and Goldman (2014).

The Kolmogorov-Smirnov, Anderson-Darling and Higher Criticism Tests
Let us first introduce some notation. For a given sample x 1 , . . . , x n , we denote by . The standard definition of the KS test statistic is based on a (two-sided) L ∞ distance over a continuous variable x ∈ R, (2.1) Whereas (2.1) involves a supremum over x ∈ R, in what follows we instead use an equivalent discrete formulation (Knuth, 2006), whereby the two-sided KS statistic is the maximum of a pair of discrete one-sided statistics, Note that by the definition ofF n , under the null hypothesis ). The latter varies significantly throughout the range of x, attaining a maximum at the median of the distribution and smaller values near the tails. Anderson and Darling (1952) were among the first to suggest different weights to deviations at different locations. Based on a weight function ψ : [0, 1] → R. they proposed a weighted L 2 statistic and a lesser-known weighted L ∞ statistic, given by Specifically, Anderson and Darling (1952) proposed the weighting function ψ(x) = 1 x(1−x) which standardizes the variance ofF n (x). Closely related to Eq. (2.4) is the Higher Criticism statistic, whose two variants below can be viewed as one-sided GOF test statistics, (Donoho and Jin, 2004), (2.5) (Donoho and Jin, 2008). (2.6) Indeed, the HC 2004 n test with α 0 = 1 is equivalent to a one-sided version of the AD sup n,ψ test with ψ(x) = 1/x(1 − x).

Order Statistics of Uniform Random Variables
To motivate the derivation of the CKS statistic, first recall that by the probability integral transform, if X is a random variable with continuous cdf F , then Y = F (X) follows a uniform distribution Y ∼ U [0, 1]. Hence, under the null, the transformed values u i = F (x i ) are an i.i.d. sample from the U [0, 1] distribution and the sorted values u (i) = F (x (i) ) are their order statistics. Next, recall that if U (i) is the i-th order statistic of a sample of n iid draws from U [0, 1], then its distribution is given by In particular, its mean and variance are . (2.8) We now relate the KS and HC tests to U [0, 1] order statistics. Up to a small O(1/ √ n) correction, the one sided KS test statistic of Eq. (2.2) is the maximal deviation of the n different uniform order statistics from their expectations, Since the variance of each U (i) is different, attaining a maximum at i = n/2, under the null the largest deviation tends to occur near the center. Importantly, this deviation can mask small (but statistically significant) deviations at the tails, leading to poor tail sensitivity (Mason and Schuenemeyer, 1983;Calitz, 1987).
In contrast, up to a small correction term, the HC statistic normalizes the difference E[U (i) ] − u (i) by its standard deviation, (1 + O(1/n)) . (2.10) Such normalizations are common when comparing Gaussian variables with different variances. Indeed, at indices 1 i n, the distribution of U (i) is close to Gaussian. However, it is severely skewed at extremely small or large indices. In fact, for a fixed i, as n → ∞, the distribution Beta(i, n − i + 1) converges to an extreme value variate (Keilson and Sumita, 1983). Furthermore, for any n ≥ 2 the distribution of U (1) is monotone and heavily skewed towards 0. In section 6.1 we demonstrate and explain analytically how the above normalization significantly hurts the performance of HC.

The Calibrated Kolmogorov Smirnov Statistic
The discussion above demonstrates that neither the KS test nor the HC test uniformly calibrate the various deviations of u (i) for indices i in the entire range from 1 to n. In this paper we study a different test statistic, whose key underlying principle is to look for the deviation E[U (i) ] − u (i) which is most statistically significant. In details, for each observed u (i) it computes the one-sided p-value p (i) according to Eq. (2.7), where f a,b (x) is the density of a Beta(a, b) random variable. In analogy to KS, the Calibrated Kolmogorov-Smirnov statistic is then defined as In contrast to the KS statistic, whose range is [0, ∞) and for which large values lead to rejection of the null, the CKS statistic is always inside [0, 1], and small values indicate a bad fit to the null hypothesis. Numerically evaluating CKS is straightforward using common mathematical packages as the integral in Eq.
(3.1) is simply the regularized incomplete Beta function. Independently of our work, this test statistic has been recently suggested in several different papers, including Mary and Ferrari (2014), Gontscharuk et al. (2014) and Kaplan and Goldman (2014). However, a close examination reveals that the one-sided CKS + statistic of Eq. (3.3) above is in fact equivalent to the lesser known M + n statistic proposed by Berk and Jones (1979, Eq.(1.9)). Similarly the two-sided CKS is equivalent to their M n statistic. In contrast to our motivation, their derivation of M + n followed from their earlier work on relatively optimal combinations of test statistics (Berk and Jones, 1978). Let us emphasize that M + n (or equivalently CKS + ) should not be confused with the more popular R + n statistic, known as the Berk-Jones (BJ) statistic, which was also suggested in the same paper. Finally, note that R + n can be viewed as an approximation of M + n , which was easier to compute at the time, when digital computers and software to calculate the tails of a Beta distribution were not as widespread as today.
In the following sections we study some theoretical properties of CKS, present a numerical procedure to compute p-values for CKS + , CKS − , and perform finite sample simulations comparing the CKS test to other GOF tests.

Confidence Bands
Often, one is interested not only in the magnitude of the most statistically significant deviation from the null hypothesis, as can be measured for example by the CKS statistic, but also in gaining insight into the nature of the deviations throughout the entire range of the sample set. One common practice is to draw a Q-Q scatter plot of the points , and hence we expect the Q-Q plot to be concentrated around the x = y diagonal.
Similar to Owen (1995), who constructed α-level confidence bands around the diagonal based on the Berk-Jones statistic, we can instead use the CKS statistic. Let c α ∈ [0, 1] be the CKS threshold that corresponds to an α-level test. i.e.
By definition (3.2), CKS > c α if and only if the transformed order statistics all satisfy L i < u (i) < B i where L i and B i are the c α and 1 − c α quantiles of the Beta(i, n − i + 1) distribution respectively. Upon making the inverse transformation x (i) = F −1 (u (i) ), this yields confidence bands for the entire Q-Q plot. In the Gaussian case, these confidence bands are precisely those of Aldor-Noiman et al. (2013). For a related construction of confidence bands and further discussion, see Dumbgen and Wellner (2014).

Theoretical Properties of CKS
Let CKS F (X 1 , . . . , X n ) denote the two-sided CKS statistic of the random variables X 1 , . . . , X n , where F is the known (and fully specified) continuous distribution assumed under the null hypothesis. Our first theoretical result is that a test based on CKS F is consistent against any fixed alternative G = F . Namely, there exists a sequence c(n), such that under the null hypothesis whereas under any fixed alternative hypothesis G = F In fact, we show that CKS is consistent even against series of converging alternatives G n n→∞ − −−− → F , provided that this convergence is sufficiently slow. Similar properties hold for KS (Lehmann and Romano, 2005, Chapter 14) and are considered desirable for any GOF test. A second result, described in Section 4.2 below, is that the CKS test is asymptotically optimal for detecting deviations from a Gaussian distribution for a wide class of rare-weak contamination models.

Asymptotic Consistency of CKS
To prove the asymptotic consistency of CKS under various alternative hypotheses, we first provide lower and upper bounds on its value under the null. The following theorem sharpens results in Section 6 of Berk and Jones (1979).
The theorem follows from Lemmas B.1 and B.3 in the appendix, which prove the upper and lower bounds in (4.1), respectively. The basic proof idea is to adapt known results regarding the supremum of the standardized empirical process to our setting, using a Gaussian approximation to the distribution of U (i) . Even though the Gaussian approximation is inaccurate at the smallest (and largest) indices, this poses no difficulty, since according to Lemma B.2, under the null the CKS statistic rarely attains its most significant deviation at such indices.
Next, we study the asymptotics of CKS under various alternatives. First, we consider the case of a fixed alternative X i Combining Theorems 4.1 and 4.2, we obtain the following key result.
Corollary 4.1. CKS F is consistent against any fixed alternative G = F .
In other words, as n → ∞ the CKS test perfectly distinguishes between the null hypothesis F and any fixed alternative G = F . In fact, as the following corollary shows, the CKS test can perfectly distinguish between F and a series of converging alternatives {G n } ∞ n=1 such that G n → F , provided that this convergence is sufficiently slow.
Remark 4.1. The proof of Theorem 4.2 is based on a simple application of Chebyshev's inequality. As such, neither Eq. (4.2) nor the corollary, Eq. (4.3) are sharp.

Sparse Mixture Detection
Next, we study properties of CKS under the following class of sparse mixture models. Suppose that under the null hypothesis X i i.i.d.
∼ F , whereas under the alternative a small fraction n of the variables are contaminated and have a different distribution G n . The corresponding hypothesis testing problem is (4.4) Such models have been analyzed, among others, by Ingster (1997), Donoho and Jin (2004) and Cai and Wu (2014). Let us briefly review some results regarding these models, first for the Gaussian mixture model, where F = N (0, 1) and G n = N (µ n , 1), Recall that for n 1, the maximum of n i.i.d. standard Gaussian variables is sharply concentrated around √ 2 log n. Thus, for any fixed n = , as n → ∞, contamination strengths µ n > √ 2 log n(1 + δ) are perfectly detectable by the maximum statistic max x i . Similarly, for any fixed µ n = µ, sparsity levels n n −1/2 visibly shift the overall mean of the samples, and hence as n → ∞, can be perfectly detected by the sum statistic x i . These cases lead one to consider the parameterization n = n −β , µ n = √ 2r log n and examine the asymptotic detectability in the (r, β) plane (Ingster, 1997). Since any point (r, β) where either r > 1 or β < 0.5 is easily detectable, the interesting regime is the region where both 0 < r < 1 and 0.5 < β < 1.
For the particular model (4.5), if n and µ n are known, both H 0 and H 1 are simple hypotheses, and the optimal test is the likelihood ratio (LR). Its performance was studied by Ingster (1997), who found a sharp detection boundary in the (r, β) plane, given by Namely, as n → ∞, the sum of type-I and type-II error rates of the LR test tends to 0 or 1 depending on whether (r, β) lies above or below this curve. While the LR test is optimal, it may be inapplicable as it requires precise knowledge of the model's parameters (µ and ). Importantly, the Higher Criticism test based on Eq. (2.5), was proven to achieve the optimal asymptotic detection boundary without using explicit knowledge of r and β (Donoho and Jin, 2004, Theorem 1.2). Thus, the HC test has adaptive optimality for the sparse Gaussian mixture detection problem. Recently, Cai and Wu (2014) considered more general sparse mixtures of the form (4.4) where the null distribution is Gaussian and n = n −β , but G n is not necessarily Gaussian. The following is a simplified version of their Theorem 1, describing the asymptotic detectability under this model. Theorem 4.3. Let G n be a continuous distribution with density function g n . If the following limit exists for all u ∈ R then the hypothesis testing problem (4.4) with F = N (0, 1) and n = n −β has an asymptotic detection threshold given by Namely, for any β < β the error rate of the likelihood ratio tends to zero as n → ∞.
In their paper, Cai and Wu (2014) proved that HC is adaptively optimal under the conditions of Theorem 4.3. As we now show, CKS has the same adaptive optimality properties, in particular for the Gaussian mixture model of Eq. (4.5).
Theorem 4.4. Let G n be a continuous distribution satisfying Eq. (4.7) and let If β < β * where β * is given by Eq. (4.8), then there is an > 0 such that where Φ is the cumulative distribution function of N (0, 1).
The proof of Theorem 4.4 is in the appendix. Combining it with Theorem 4.1 gives Corollary 4.3. For any > 0 and β < β * , the test perfectly separates, as n → ∞, the null distribution N (0, 1) from a sparsemixture alternative of the form (1 − n −β )N (0, 1) + n −β G n . Namely, inside the asymptotic detectability region, the error rate of the test tends to zero.

Computing p-values
For the classical one-sided and two-sided KS statistics, there are many methods to compute the corresponding p-values, see Durbin (1973); Brown and Harvey (2008a,b). Most of these methods, however, are particular to KS and seem inapplicable to other GOF tests. One notable exception is Noé's recursion (Noe, 1972), with complexity O(n 3 ), which is applicable to any supremum-based onesided and two-sided test, see Owen (1995).
In this section we instead present a novel and simple O(n 2 ) algorithm to compute p-values of any supremum based one-sided test statistic, including CKS + , CKS − , Berk-Jones and the Higher Criticism. For the p-value of twosided tests, we may use an approximation based on these one-sided p-values. To describe our approach, note that combining definitions (3.3) for CKS + and (3.1) for p (i) , yields While procedures to compute L n i (c) are available in many mathematical packages, directly evaluating Eq. (5.1) is not straightforward because of the set of dependencies imposed by the sorting constraints U (1) ≤ . . . ≤ U (n) . Below, we nonetheless present a simple procedure to numerically evaluate this expression. Our starting point is the fact that under the null, all the n unsorted variables are uniformly distributed, U i i.i.d.
∼ U [0, 1], and hence their joint density equals 1 inside the n-dimensional box [0, 1] n . Given that there are n! distinct permutations of n indices, the joint probability density of the random vector of sorted values (U (1) , . . . , U (n) ) is From this it readily follows that Eq. (5.2) is the key to fast calculation of p-values for CKS + or other one-sided tests. To proceed we simply evaluate this multiple integral, from right to left. The first integration yields a polynomial of degree 1 in U (2) , the next integration yields a polynomial of degree 2 in U (3) and so on. While we have not found simple explicit formulas for the resulting polynomials, their numerical integration is straightforward. We store d + 1 coefficients for the d-th degree polynomial, and its numerical integration takes O(d) operations. Hence, the total time complexity to numerically evaluate Eq. (5.2) is O(n 2 ). Still, there are some numerical difficulties with this approach: A naïve implementation, using standard (80-bit) long double floating-point accuracy, suffers from a fast accumulation of numerical errors and breaks down completely at n ≈ 150. Nonetheless, as described in the appendix, with a modified procedure, this error accumulation is significantly attenuated, thus allowing accurate calculation of one-sided p-values for up to n ≈ 50, 000 samples in less than one second on a standard PC.
To the best of our knowledge, there are no O(n 2 ) algorithms to compute the p-value of two-sided supremum type test statistics. The following theorem provides simple upper and lower bounds for the p-value of the two-sided CKS, in terms of its one-sided p-values, Furthermore, as n → ∞, Remark 5.1. As mentioned above, our algorithm can compute p-values of any supremum-type one sided test statistic. The only difference lies in the coefficients L n i (c) of Eq. (5.2), which depend on the specific test statistic. For the HC 2008 test of Eq. (2.6), for example, Thus, for this statistic, L n i (c) = i n − c i n 2 1 − i n . Remark 5.2. Historically, an equation similar to (5.2) was derived by Daniels (1945), in an entirely different context. His formula was used in later works to derive closed form expressions for the distribution of the KS test statistic. See Durbin (1973) for a survey.
Remark 5.3. To the best of the authors' knowledge, the only other O(n 2 ) algorithm for computing p-values of L ∞ -type one-sided test statistics is that of Kotel'nikova and Chmaladze (1983). Their method is based on a different recursive formula, which involves large binomial coefficients, and may thus also require careful numerical implementation. A comparison of their method to ours is beyond the scope of this paper.

Deviations from a Standard Gaussian Distribution
We consider a null hypothesis that X i i.i.d. For detecting a change in the mean, the CKS test is on par with KS, but the AD test outperforms both. The HC/AD sup test has close to zero power in this benchmark. For detecting a change in the variance, which strongly affects the tails, CKS has a higher detection power throughout the entire range of σ. In contrast, HC/AD sup performs poorly, and has power close to zero when σ < 1.
As we now show, the poor performance of HC/AD sup stems from its specific normalization of the deviations at the extreme indices u (1) , u (2) , etc. For simplicity, suppose that u (1) < 1/cn log log n, for some constant c > 0. Then, the corresponding HC deviation at this index is Since under the null, Pr u (1) < x = 1 − (1 − x) n , it follows that Pr u (1) < 1 cn log log n = 1 + o(1) c log log n .
In particular, for n = 100 samples as in Figure 1, a value c = 65.48 gives that with probability 1% the deviation of the first order statistic is at least √ c log log n ≈ 10.  Now suppose we conduct an HC test at a false alarm level of α = 1%. The above calculation has two important implications: First, the threshold of the HC test must clearly satisfy t α > 10. This value is significantly larger than that of HC's asymptotic value of √ 2 log log n ≈ 1.74 (Jaeschke, 1979). Since 1/ log(log(n)) decays to zero extremely slowly, the above illustrates the very slow convergence of HC's distribution to its asymptotic limit. Second, such a high threshold prevents detection of significant deviations near the center of the distribution, as indeed is shown empirically in Figure 1. As an example, a significant deviation from the null of u (n/2) = 1/4 which corresponds to about 5.8 standard deviations cannot be detected by the HC test at level α = 1%.
We remark that HC's problematic handling of u (1) was already noted by Donoho and Jin (2004), and also discussed in several recent works Siegmund and Li (2014); Gontscharuk et al. (2014). Finally, we note that in our numerical example, removing u (1) from the HC test does not resolve the problem, since the next extreme order statistics u (2) , u (3) etc., also have non-negligible probabilities to induce very large HC values. In contrast, the CKS statistic correctly calibrates all of these deviations.

Detecting Sparse Gaussian Mixtures
Next, we consider the problem of detecting a sparse Gaussian mixture of the form (4.5), where the parameter µ is assumed positive. We hence compare the following four one-sided test statistics: max X i , X i , HC 2004 and CKS + . Figure 2 compares the resulting receiver operating characteristic (ROC) curves for two choices of and µ, both with n = 10, 000 samples. The optimal curve is that of the likelihood-ratio test, which unlike the other statistics, is model specific and requires explicit knowledge of the values of and µ.
While asymptotically as n → ∞ both HC 2004 and CKS + achieve the same performance as that of the optimal LR test, for finite values of n, as seen in the figure, the gap in detection power may be large. Moreover, for some parameter values (µ, ), HC 2004 achieves a higher ROC curve, whereas for others CKS + is better. A natural question thus follows: For a finite number of samples n, as a function of the two parameters and µ, which of these four tests has greater power? To study this question, we made the following extensive simulation: for many different values of (µ, ), we empirically computed the detection power of the four tests mentioned above at a significance level of α = 5%, both for n = 1000 and for n = 10, 000 samples. For each sparsity value and contamination level µ we declared that a test T 1 was a clear winner if it had a significantly lower mis-detection rate, namely, if min j=2,3,4 Pr[T j = H 0 |H 1 ]/ Pr[T 1 = H 0 |H 1 ] > 1.1. Figure 3 shows the regions in the (µ, ) plane where different tests were declared as clear winners. First, as the figure shows, at the upper left part in the (µ, ) plane, X i is the best test statistic. This is expected, since in this region is relatively large and leads to a significant shift in the mean of the distribution. At the other extreme, in the lower right part of the (µ, ) plane, where is small but µ is large, very few samples are contaminated and here the HC 2004 test statistic works best, with the max statistic being a close second. In the intermediate region, which would naturally be characterized as the rare/weak region, it is the CKS + test that has a higher power. Second, while not shown in the plot, we note that the CKS + test had similar power to that of the Berk-Jones R + n test. Finally, in this simulation the HC 2008 test performed worse than at least one of the other tests for all values of (µ, ).

A.1. Asymptotics of the Beta distribution
As is well known, when both α, β → ∞, the Beta(α, β) distribution approaches N (µ, σ 2 ), where µ and σ are the mean and standard deviation of the Beta random variable. The following lemma quantifies the error in this approximation. For other approximations, see for example Peizer and Pratt (1968); Pratt (1968).
Lemma A.1. Let f α,β be the density of a Beta(α, β) random variable and let g t (α, β) be its value at t standard deviations from the mean. i.e.
For any fixed t, as both α, β → ∞ Remark A.1. For any fixed t, as α, β → ∞ all error terms tend to zero, hence demonstrating the pointwise convergence of a Beta distribution to a Gaussian. However, for this approximation to be accurate, all correction terms must be significantly smaller than one. As an example, with t = 2 standard deviations and α = n 1/4 , to have |t 3 |/ √ α < 0.1 we need n > 1.7 × 10 15 samples, far beyond the reach of almost any scientific study. Remark A.2. A closer inspection of the proof below shows that Lemma A.1 continues to hold even if t = t(α, β) → ∞, provided that α, β → ∞ and Proof of Lemma A.1. For convenience we denote In this notation, the mean and variance of Beta(α, β) are µ = a + 1 n + 1 , σ 2 = (a + 1)(b + 1) (n + 1) 2 (n + 2) , Using Stirling's approximation, that n! = √ 2πn(n/e) n (1 + O(1/n)), and the fact that σ = ab/n 3 (1 + O(1/a + 1/b)), we obtain that as both a, b → ∞, Next, we write the remaining terms in (A.3) as follows, Note that as a, b → ∞, both σ/µ and σ/(1 − µ) tend to zero. Hence, for either a fixed t, or t = t(α, β) slowly growing to ∞ such that Eq. (A.2) holds, we may replace the logarithms in Eq. (A.4) by their Taylor expansion with small approximation errors Simple algebra gives Similarly, Finally, as a, b → ∞, the cubic term in Eq.
Combining all of these results concludes the proof of the lemma.
We present a simple corollary of Lemma A.1, which shall prove helpful in studying the asymptotic behavior of CKS under the null hypothesis.

A.2. Supremum of the Standardized Empirical Process
The standardized empirical process plays a central role in the analysis of the HC test statistic. As we shall see below, it is central also to the analysis of CKS. We begin with its definition followed by several known results regarding the magnitude and location of its supremum.
∼ F and letF n (x) = 1 n i 1(X i ≤ x) denote their empirical distribution. The normalized empirical process is defined as Similarly, the standardized empirical process iŝ Of particular interest to us is the supremum ofV n . The following lemma provides an equivalent expression for this quantity. A similar formula for the supremum of V n was used by Donoho and Jin (2004). .

(A.9)
Proof. Without loss of generality, we may assume that F = U [0, 1]. For any 0 < c < 1, since (c − u)/ c(1 − c) is monotone decreasing in u, the supremum in (A.9) is attained at the left edge of one of the intervals of the piecewiseconstant functionF n . Hence, . SinceF n (U (i) ) = i/n, Eq. (A.9) follows.
The following lemma characterizes the supremum ofV n (u). It follows directly from Theorem 1 of (Eicker, 1979). Furthermore, the next lemma, which follows from the main Theorem of Jaeschke (1979), implies that this supremum is rarely attained at one of the first or last log n indices.
Lemma A.4. Let LT be the union of two intervals containing the first and last log n order statistics, LT = (U (1) , U (log n) ] ∪ [U (n−log n) , U (n) ). Then
We are now ready to bound the value of p (i * ) . By definition, where f denotes the density of a Beta(i * , n − i * + 1) random variable. Lemma A.4 states that with high probability log n < i * < n − log n. Now, since τ = 2 log log n(1 + o p (1)) = o p ((log n) 1/6 ) it follows from Corollary A.1 that for all t ∈ [−τ, τ ] we may approximate f (µ + σt) by the density of a standard Gaussian, namely, Plugging the well-known approximation to the Gaussian tail, Thus, with high probability, Finally, Eq. (B.2) follows by combining the above with Eq. (B.3).
Next, we consider the location where, under the null, the CKS test attains its minimal value. Eq. (1.7) in Proposition 1 of Eicker (1979) shows that for both the normalized and standardized empirical processes, the probability of the supremum being attained at one of the first or last log n indices approaches zero as n → ∞. We prove a similar result about the CKS test. This result will be used in the proof of Lemma B.3.
Lemma B.2. Let i * denote the location of the most statistically significant deviation as measured by the two-sided CKS test, Then under the null hypothesis, Proof. Under the null, by Eq. (3.1), p (i) is the one-sided p-value of U (i) . By the probability integral transform, Pr[p (i) < c] = c. A union bound yields Pr arg min The same argument works for the last log n elements and also applies to CKS − .
We are now ready to prove the left inequality of Theorem 4.1. Proof. Assume to the contrary that for some > 0 and some strictly positive probability p > 0, the following inequality holds for an infinite sequence of n values, Under H 0 , both CKS + and CKS − have the same distribution. Therefore for an infinite sequence of values of n, the following inequality holds, The main idea is to demonstrate that if CKS + < 1/ log n(log log n) 1+ then the supremum of the standardized empirical process is unusually large, contradicting Lemma A.3. Denote the index of the most significant one-sided deviation by i * (i.e. CKS + = p (i * ) ). Let µ and σ 2 be the mean and variance of a Beta(i * , n − i * + 1) random variable and let f be its density. The one-sided p-value at the index i * is thus In particular, this implies that with high probability µ/σ > (log n) 1/7 . Using this inequality and Eq. (B.8), we obtain In this domain of integration |t| < (log n) 1/7 , so Corollary A.1 gives that From Lemma A.3 it follows that with high probability τ (log n) 1/7 . Therefore, by applying the tail approximation (B.5) to Eq. (B.10), the second integral becomes negligible with respect to the first, leading to the bound Combining this result with (B.7) implies that Using the monotonicity of 1 τ e −τ 2 /2 , the above inequality leads to a lower bound on τ . Thus there exists an infinite sequence of n values and strictly positive numbers > 0 and p > 0, for which the following inequality holds Pr τ > 2 log log n 1 + log log log n 4 log log n (1 + ) ≥ p 2 , in contradiction to Lemma A.3.

B.2. Proof of Theorem 4.2
Let t 0 ∈ R be some point that satisfies Without loss of generality, we assume that F (t 0 ) < G(t 0 ) and derive an upper bound on CKS + (in the opposite case the same upper bound would be obtained on CKS − ). Let i * denote the number of random variables X i smaller than t 0 . Since for all i, Pr [X i < t] = G(t), the random variable i * follows a binomial distribution, i * ∼ Binomial(n, G(t 0 )) .
Next, we give a sketch of the proof of Eq. (5.4). Theorems 1 and 3 of Eicker (1979) imply that for the standardized empirical process, as n → ∞, the probability of crossing a lower boundary is independent of the probability of crossing an upper boundary. As in the proof of Lemma B.1, we can translate these probabilities to crossing probabilities of the CKS + and CKS − statistics under the null, yielding asymptotic independence, Pr CKS + > c 1 and CKS − > c 2 n→∞ − −−− → Pr CKS + > c 1 Pr CKS − > c 2 .
In particular, by choosing c 1 = c 2 = c, Eq. (5.4) follows. Table 1 Comparison of symbolic expressions for fn(1) resulting from direct integration vs. computation using translated polynomials. L i is shorthand for L n i (c).