Computing the Two-Sided Kolmogorov-Smirnov Distribution

We propose an algorithm to compute the cumulative distribution function of the two-sided Kolmogorov-Smirnov test statistic Dn and its complementary distribution in a fast and reliable way. Different approximations are used in different regions of n, x. Java and C programs are available.


Introduction
The Kolmogorov-Smirnov (KS) two-sided test statistic D n is widely used to measure the goodness-of-fit between the empirical distribution of a set of n observations and a given continuous probability distribution.It is defined by where n is the number of (independent) observations, Ĝn is their empirical cumulative distribution function (cdf), and G is a completely specified continuous theoretical cdf.Let F n denote the cdf of D n under the null hypothesis H 0 that the n observations are independent and have cdf G, that is, This F n is loosely called the KS distribution.
Computing F n (x) in a fast and reliable way, for arbitrary values of n and x, is not easy.
With the help of a symbolic algebra package, Drew, Glen, and Leemis (2000) computed the exact KS distribution for 1 n 30 as a set of piecewise polynomials of degree n resulting from a large number of integrals.They also gave a short literature review.Brown and Harvey (2008) consider seven different methods proposed in the past to compute the exact KS distribution.They programmed these seven methods in Mathematica, using only rational numbers to obtain exact results.Although their Mathematica programs are very useful for verification purpose, the calculations are extremely slow for large n.For example, computing F n (x) for n = 10000 and x = 2/ √ n, using the Durbin (1968) recursion formula, takes three hours on our (standard) desktop computer.The Durbin recursion formula gives the fastest exact method among all those programmed by Brown and Harvey (2008).Unfortunately, it contains expressions that suffer from subtractive cancellation between very large terms, and is thus unsuitable for floating-point computations with limited precision.Among the formulae considered by Brown and Harvey (2008), only those of Noé, Pomeranz, and the Durbin matrix formula are not subject to catastrophic cancellation because they contain only non-negative terms.
The method of Pomeranz (1974) and the Durbin (1968) recursion method were programmed in single precision Fortran 77 by Pomeranz (1974).The Durbin recursion method is numerically unstable and gives sensible results only for very small n.The Pomeranz method makes use of floors and ceilings which must be computed with great care; as an illustration, the KS distribution computed by the program of Pomeranz (1974) sometimes gives a non-smooth distribution with very large errors, as we will see in Section 2.2.Kallman (1977) programmed a generalized version of Pomeranz method in Fortran 77; this program works well in single precision for small n but starts to break down for moderate n and when F n (x) is close to 1. Miller (1999) programmed the complementary KS distribution Fn (x) = 1 − F n (x) in Fortran 90, using different approximations; his program is very fast, but gives only 0 to 3 decimal digits of precision for n > 120 and sometimes returns NaN (not a number), for example for n = 100 and x = 0.54, 0.66, 0.77, or negative values, for example for n = 50 and 0.35 < x < 0.50.
Recently, Marsaglia, Tsang, and Wang (2003) wrote a C procedure implementing the matrix formula of Durbin (1973).Although it works very well for small n and gives 13 decimal digits of precision according to the authors, it is very slow for large n and sometimes returns NaN or infinite values; for example for n = 11000 and 0.000413 ≤ x ≤ 0.000414; for n = 21000 or 21001 and 0.000434 x 0.000526; for n = 42001 and 0.000206 ≤ x ≤ 0.000263; and for n = 62000 and 0.001 x 0.007.Furthermore, computing a p-value as p = P[D n x] = 1 − F n (x) gives rise to loss of precision when F n (x) is close to 1, due to subtractive cancellation.
A close look at popular statistical software reveals that even some of the best products use poor approximations for the KS distribution (or the complementary distribution) in some regions (values of (n, x)).For example, in the popular statistical software environment R (2009), the method of Marsaglia et al. (2003) was recently implemented to compute the p-value p = 1−F n (x) in KS tests (with the option "exact").However, because this method is very slow for large n, the default algorithm in R for n 100 is an asymptotic approximation that gives only 0 to 2 decimal digits of precision.For example, for n = 120 and x = 0.0874483967333, R default method returns p = 0.3178 while the correct value is p = 0.30012.For n = 500 and x = 0.037527424, R default method returns p = 0.482 while the correct value is p = 0.47067.R "exact" (but slow) method gives the correct values in both cases.However for p-values smaller than 10 −15 , neither method gives any correct decimal.(2009) uses different methods to approximate the KS distribution depending on n and x.For example, for n = 20 and x = 0.8008915818, MATLAB returns p = 2.2544 × 10 −12 while the correct value is p = 2.5754 × 10 −14 , and for n = 20 and x = 0.9004583223, it returns p = 1.5763 × 10 −15 while the correct value is p = 1.8250 × 10 −20 .These returned values have zero decimal digits of precision in both cases.

MATLAB
Available programs for the KS distribution in standard programming languages such as C, Fortran, and Java are either fast but unreliable, or precise but slow.Moreover, they either compute only the KS distribution or only its complementary.So there is a need for a fast and reliable program that computes both the KS distribution and its complementary distribution for arbitrary n and x.
In this paper, we describe such an algorithm and its implementation for the KS distribution with floating-point numbers of 53 bits of precision.In §2, we briefly describe the exact methods that we use to compute the KS distribution.We implemented the Pomeranz recursion formula and we compare its speed with the Durbin matrix formula.In §3, we introduce faster but less precise approximations that we use for large n.In §4, we partition the space (n, x) in different regions and we explain why each method is more appropriate in a given region to compute the KS distribution.In §5, we do the same for the complementary KS distribution.Finally, we measure the speed of our C program at several values of n and x.

Some values in the tails
There are a few special cases for which the exact KS distribution is known and has a very simple form.In the tails (for x close to 0 or 1), the exact values for arbitrary n are (Ruben and Gambino 1982): This also provides an exact formula for the complementary cdf Fn (x) over the same range of values of (n, x).We will use these simple formulas whenever they apply.

The Pomeranz recursion formula
The Pomeranz formula permits one to compute a (2n + 2) × (n + 1) matrix whose entries are defined by a recurrence, and the last of those entries is F n (x).The method is described in Pomeranz (1974) and Brown and Harvey (2008), and explained below.It makes use of floors and ceilings which must be computed with care, because it is very sensitive to numerical imprecision, as we will see in a moment.Let t = nx and define the A 1 , . . ., A 2n+2 (which

Computing the Two-Sided
Kolmogorov-Smirnov Distribution depend on t) as follows: for i = 4, 5, . . ., 2n + 1, and We then define the entries V i,j of a (2n + 2) for i = 2, . . ., 2n + 2 and j where Then we have As observed by Pomeranz, the differences A i −A i−1 can take at most four different values.This can be exploited by precomputing the factors ((A i − A i−1 )/n) j /j! in (2), and this increases the speed of computation significantly.It is also clear from (2) that we never need to store more than two rows of matrix V at a time, since row i can be computed from row i − 1 only.
Pomeranz's algorithm uses the floor and ceiling of A i ± t to define the lower and upper bounds of summation in (2), which are defined in (3).For some values of t (and x), unavoidable roundoff errors in floating-point calculations will give the wrong value of k 1 (i, j) or k 2 (i, j), and this may give rise to large errors in the cdf.As an illustration, Figure 1 shows (in blue) the KS distribution F n (x) as computed by the program 487.f of Pomeranz (1974) for n = 20 and 0.17 x 0.19, compared with the correct distribution (in red).The zigzags in the blue line are due to floating-point errors in computing k 1 (i, j) and k 2 (i, j).
To avoid this type of problem, we precompute exactly all floors and ceilings of A i ± t in (3) for the given value of t = nx, although nx itself is not computed exactly due to floating-point error in the multiplication.We decompose t = + f , where is an non-negative integer and 0 f < 1.There are three cases to consider depending on where the fractional part f lies.We must make sure to identify correctly in which case we are, and use the corresponding formulas for the precomputations.
Case (i): f = 0.In this case, we have Case (ii): 0 < f 1/2.Here we have Here we have With floating-point numbers that follow the IEEE-754 64-bit standard for double precision, the algorithm cannot be used in this naive form as soon as n > 140, because for some regions of x, the terms V i,j will underflow to 0 while the n! factor will overflow, even though the corresponding probability is finite.For this reason, we must periodically renormalize all elements of row i of V (in our implementation, we multiply all elements of the row by 2 350 ) when the smallest element of the row becomes too small (smaller than 10 −280 in our implementation) and we keep track of the logarithm of the renormalization factors to recover the correct numbers.

Comparison between Pomeranz and Durbin for small n
One of our objectives in this work was to provide a reliable implementation of Pomeranz algorithm and compare its speed and precision with the Durbin matrix algorithm implemented in MTW (its main competitor), for small n.We implemented the Pomeranz algorithm in C and Java, while MTW is implemented in C. For n not too large, these two implementations both return 13 to 15 decimal digits of precision (as measured by comparing with the exact values provided by the Mathematica programs of Brown and Harvey (2008)).We also compared their speed for computing F n (x) at values of (n, x) selected as follows.The mean of D n (under H 0 ) is close to µ 0 def = ln(2) π/(2n) ≈ 0.868731/ √ n and we selected the pairs (n, x) around this mean.More specifically, we took x = aµ 0 for a = 1/4, 1/3, 1/2, 1, 2, and 3, for n ranging from 10 to 1000.The values of F n (x) for these pairs (n, x) are given in Table 1.For each pair (n, x) in the table, we measured the CPU time (in seconds) needed to compute F n (x) 10 6 times with the C programs.All the timings reported in this paper were made on a computer with an AMD Athlon processor 4000+ at clock speed of 2400 MHz, running Red Hat Linux.The results are shown in Table 2 for the Pomeranz algorithm and in Table 3 for the Durbin matrix algorithm.We see that the CPU times increase with n and with x (for fixed n).For small n, MTW is faster for x < µ 0 (roughly), while the Pomeranz program is faster for x µ 0 .
When n is large, especially for x > µ 0 , these exact algorithms are too slow and have to be replaced by approximations.

Asymptotic approximations
Kolmogorov (1933) has proved the following expansion which provides an approximation of F n (z/ √ n) for large n: (5) Pelz and Good (1976) generalized Kolmogorov's approximation ( 5) to an asymptotic series in 1/ √ n of the form where Ruben -Gamb ino (13) constant, are curves of almost constant probability (see Table 1 for some typical values of x).Thus regions of constant nx 2 seem relevant for the different approximations of F n (x).
If Fn (x) < 2.2 × 10 −16 , the double precision representation of F n (x) cannot be distinguished from 1, so we can just return F n (x) = 1.The second term in Kolmogorov's formula (5) is 2e −2z 2 and is less than 2.2 × 10 −16 when z 2 > 18.37.More precise calculations show that in fact Fn (x) < 5 × 10 −16 whenever nx 2 18. Thus we shall just return F n (x) = 1 whenever nx 2 18, and this gives 15 digits of precision in this region, labeled "F n (x) = 1" in Figure 2.
Comparing the results of the approximation of Fn (x) by ( 8) with the exact values from the Brown and Harvey (2008) Mathematica programs, we find that it returns at least 10 decimal digits of precision as long as Fn (x) < 10 −3 when n 140.To determine the region where Fn (x) < 10 −3 (approximately) in terms of n and x, we first use Kolmogorov's formula (5) which gives K 0 (2) = 0.99933 for nx 2 = 4.More precise calculations with exact methods show that for n 140, we have Fn (x) < 0.0006 (and thus F n (x) > 0.9994) whenever nx 2 4. When we use the approximation (8) in the region nx 2 4, the absolute error on Fn (x) is always smaller than 10 −14 and thus computing F n (x) = 1 − Fn (x) gives 14 decimal digits of precision for F n (x).
In the region that remains, we use one of the exact methods of Durbin and Pomeranz.Based on the speed comparison in §2.4,we choose Durbin when nx 2 < 0.754693 and Pomeranz otherwise.We avoid the Pelz-Good approximation because it returns less than 5 correct digits for n ≤ 140.
(ii) 140 < n 10 5 .Whenever nx 2 18, we simply return F n (x) = 1 and this gives 15 digits of precision as explained above.Otherwise, we use the Pelz-Good asymptotic series ( 6 to the dominant term in (5), we obtain the condition z 2 > 355.More precise calculations for different values of n with the Miller approximation (8) show that we can safely set Fn (x) = 0 whenever nx 2 370, because Fn (x) is then always smaller than 10 −308 ; this corresponds to region " Fn (x) = 0" in Figure 3.
Since for n > 140, the Pelz-Good approximation (6) returns 5 decimal digits of precision, we shall use the Miller approximation (8) with the Smirnov formula (7) in the larger region nx 2 2.2.In this region, we have Fn (x) < 0.025 and the Miller approximation returns at least 6 decimal digits of precision.
For nx 2 < 2.2, we simply use Fn (x) = 1−F n (x).Since for n > 140, we have at least 5 decimal digits of precision on F n (x) in the region where F n (x) > 10 −12 , the program returns at least 5 decimal digits of precision everywhere for Fn (x) in the region nx 2 < 2.2.
Overall, for 140 < n ≤ 200000, our program for Fn (x) returns at least 5 decimal digits of precision everywhere.And for n > 200000, it returns a few correct decimal digits.

The C and Java programs
The C program in file KolmogorovSmirnovDist.c has two public functions: KScdf(int n, double x), which computes F n (x) for given n and x, and KSfbar(int n, double x), which computes the complementary cdf Fn (x) for given n and x.
The Java program in file KolmogorovSmirnovDist.java has two public static methods: Computing the Two-Sided Kolmogorov-Smirnov Distribution cdf(int n, double x), which computes F n (x) for given n and x, and fbar(int n, double x), which computes the complementary cdf Fn (x) for given n and x.
The programs are available at http://www.iro.umontreal.ca/~simardr/ksdir/.Tables 4 and 5 report the CPU times needed to compute F n (x) 10 6 times, for selected values of n and x as in Table 1, with MTW and our C program.Note that for x = 3µ 0 , MTW uses the quick approximation given by the authors, which returns only 7 decimal digits of precision.We do not use their approximation since for n 140, the cdf is much less precise than our program, and using this formula to compute Fn (x) = 1 − F n (x) will give a very bad approximation.For n > 140, our program is considerably faster than MTW.On the other hand, the MTW program often gives more precision than our asymptotic expansions when it returns sensible values for n > 140.However, the MTW program loses all precision when it is used to compute Fn (x) and the latter is smaller than 10 −15 , whereas our program returns sensible values for Fn (x) even when the latter is as small as 10 −307 .8 and 9 with the relative error of the Miller approximation.Since the Miller approximation becomes better as x increases for a given n, we conclude that it gives at least 6 decimal digits of precision for Fn (x) in this region.1.4 and 140 < n 10 5 for F n (x).Here we compare the Pelz-Good approximation on the separating curve nx 3/2 = 1.4 with the Mathematica program, and also with the Durbin matrix C program for a few values of (n, x).The results are shown in Tables 10 and 11.After testing many values of F n (x) at different n > 140 and x, we concluded that the Pelz-Good approximation becomes better as x increases for a given n, and that it gives

Figure 1 :
Figure 1: The KS distribution for n = 20 returned by the program 487.f (in blue) compared with the correct cdf (in red)

Figure 2 :
Figure2: Choice of method to compute (or approximate) F n (x) and minimal number of decimal digits of precision for F n (x), as a function of (n, x), in each region of N × [0, 1].

Figure 3 :
Figure3: Choice of method to compute (or approximate) Fn (x) and minimal number of decimal digits of precision for Fn (x), as a function of (n, x), in each region of N × [0, 1].

Table 2 :
CPU time (seconds) to compute F n (x) 10 6 times with the Pomeranz algorithm

Table 5 :
CPU time (seconds) to compute F n (x) 10 6 times with our C program

Table 7 :
Values of Fn (x) for x = 4/n Region nx 2 2.2 and 140 < n 10 5 .Consider the values of x = 2.2/n for several values of n.We compare the Miller approximation (8) for the KS complementary distribution function Fn (x) both with the Brown-Harvey Mathematica program and with the Durbin matrix C program.The results are shown in Tables

Table 8 :
Values of Fn (x) for x = 2.2/n