Arbitrary Precision Mathematica Functions to Evaluate the One-Sided One Sample K-S Cumulative Sampling Distribution

Efficient rational arithmetic methods that can exactly evaluate the cumulative sampling distribution of the one-sided one sample Kolmogorov-Smirnov (K-S) test have been developed by Brown and Harvey (2007) for sample sizes n up to fifty thousand. This paper implements in arbitrary precision the same 13 formulae to evaluate the one-sided one sample K-S cumulative sampling distribution. Computational experience identifies the fastest implementation which is then used to calculate confidence interval bandwidths and p values for sample sizes up to ten million.


Introduction
In a recent paper, Brown and Harvey (2007) evaluated 13 formulae for calculating the exact p values of the one-sided one sample Kolmogorov-Smirnov (K-S) test. They used rational arithmetic so that the p values were calculated exactly and the implementation of all 13 formulae yielded the same p values. From comparisons of the computational times which increased with increasing sample size, increasing number of test statistic digits, and decreasing value of the p value, the formulae were ranked from the fastest to the slowest. For ρ = 3 digits in the test statistic and a p value of 0.001, the computer time for the fastest formula was 1.406 seconds for a sample size of n = 10, 000 and 81.047 seconds for a sample size of n = 50, 000 on a Pentium IV running at 2.4 GHz. In contrast, for ρ = 6 digits in the test statistic and the same p value of 0.001, the computer time for the fastest formula was 23.657 seconds for a sample size of n = 10, 000 and 1213.360 seconds for a sample size of n = 50, 000.
Rational arithmetic stores every number as a ratio of two integers (a rational number) where each integer can have as many decimal digits as needed to express the number exactly. However, even the fastest rational arithmetic method only calculated p values for sample sizes up to fifty thousand. This paper develops an alternate computational environment that is faster than all the rational arithmetic implementations. Arbitrary precision methods are used to calculate one-sided one sample K-S p values where the accuracy of the resultant p value is specified by the user. Comparative analysis of the rational arithmetic implementations in Brown and Harvey (2007) with the arbitrary precision implementations in this paper show that the arbitrary precision methods are over ten times faster than the rational arithmetic methods. Although arbitrary precision methods are faster, a major difficulty is determining the accuracy of any p value produced by arbitrary precision methods. The p values calculated by rational arithmetic in Brown and Harvey (2007) will be used to check the accuracy of this paper's arbitrary precision calculations.
Arbitrary precision uses floating point quantities where the number of decimal digits (precision) is specified by the user. Every arbitrary precision quantity has a fixed number of digits but unlike rational arithmetic the precision of the calculated quantities can decrease as calculations are made. However, Mathematica in its arbitrary precision package automatically keeps track of every number's precision. To quote from the Mathematica Book (Wolfram 2003) (page 731), "When you do a computation, Mathematica keeps track of which digits in your result could be affected by unknown digits in your input. It sets the precision of your result so that no affected digits are ever included. This procedure ensures that all digits returned by Mathematica are correct, whatever the values of the unknown digits may be." In other words, the user specifies the internal precision ip to evaluate an expression and then Mathematica evaluates the expression so that the resulting precision rp (the precision of the quantity found by calculating the value of the expression) is equal to ip if at all possible. If this is not possible, then Mathematica produces a result that maximizes rp given the limitations imposed by the precision of the inputs and the computations.
Since the internal precision ip must be specified before computations begin, determining the relationship between the internal precision ip and the resulting precision rp makes arbitrary precision methods more difficult to implement than rational arithmetic methods. By finding functions that predict the internal precision ip needed to produce a p value with a resulting precision of at least rp, this paper develops arbitrary precision methods to calculate onesided one sample K-S p values to any desired accuracy for sample sizes up to ten million, n ≤ 10, 000, 000.
Mathematica 5 was used to develop all the code in this paper. However, the code was tested in Mathematica 6 and will work in both Mathematica 5 and 6.

K-S cumulative sampling distribution formulae
The one-sided one sample K-S test uses the maximum distance between the hypothesized continuous cumulative distribution F (x) and the empirical cumulative distribution F n (x). There are two one-sided one sample random variables: the one-sided upper random variable D + n = sup −∞<x<∞ {F n (x) − F (x)} and the one-sided lower random variable D − n = sup −∞<x<∞ {F (x) − F n (x)}. Since by symmetry D + n and D − n have the same cumulative sampling distribution, D + n is used to represent both cases. The cumulative sampling distri-Type Formula to compute P [D + n ≥ d + ] for 0 < d + ≤ 1 is the greatest integer less than or equal to n(1 − d + ) Table 1: K-S one-sided one sample direct formulae.
bution is used to calculate the p value P (D + ≥ d + ) for test statistic d + . Brown and Harvey (2007) found or developed 13 formulae (four direct formulae, four iterative formulae, and five recursion formulae) that can be used to evaluate the one-sided one sample K-S cumulative sampling distribution. This section summarizes the 13 formulae.

Direct formulae
A closed form expression of the one-sided one sample K-S cumulative sampling distribution was developed by Smirnov (1944) and verified by many scholars including Feller (1948) and Birnbaum and Tingey (1951). For 0 < d + ≤ 1 and sample size n, Smirnov's distribution, SmirnovD, is shown in the first row of Table 1 where n(1 − d + ) is the greatest integer less than or equal to n(1 − d + ). Dwass (1959) derived a different formula, DwassD, that is also shown in Table 1. Second forms of the Smirnov distribution, SmirnovAltD, and the Dwass distribution, DwassAltD, are derived by factoring 1/n n−1 out of their respective formulae. Brown and Harvey (2007) transformed each of the four formulae in Table 1 into an iterative version which are presented in Table 2. Daniels (1945) derived a difference equation that can be solved for the following formula.

Daniels' recursion formula
Q 0 (1) = 1 Type Iterative formula Smirnov is the greatest integer less than or equal to n(1 − d + ) Table 2: K-S one-sided one sample iterative formulae.

Noe and Vandewiele recursion formula
Since the Daniels recursion formula has both positive and negative terms, Noe and Vandewiele (1968) derived an alternate recursion formula that has only non-negative terms. Noe (1972) later added a correction to this recursion formula. The particular form of the recursion formula listed below containing Noe's correction is taken from Shorack and Wellner (1986, formulae 24 through 28 on page 363) and is denoted by Noe in the rest of the paper.
7. Bolshev recursion formula Kotelnikov and Chmaladze (1983) used the recursion formula shown below that was later called the Bolshev recursion in Shorack and Wellner (1986).

Calculation error and computation time
Rounding and catastrophic cancellation are the two sources of calculation error. For rounding error, the size of the error grows with the number of calculations. In contrast, catastrophic cancellation can only occur when a negative number is added to a positive number and the size of the error is dependent on how close the absolute value of the negative number is to the positive number. Like possible rounding error, computation time also increases with the number of calculations.
In implementing the 13 formulae using arbitrary precision, the internal precision ip must be specified at the beginning of the calculations. At the end of the calculations, Mathematica gives both the value representing the result of the calculations and the value's resulting precision rp (number of decimal digits of precision). However, as the error increases, the resulting precision rp decreases for the same internal precision ip.
Before studying the relationship between ip and rp for the 13 formulae, the difficulty in evaluating the 13 formulae can be seen by looking at the number of terms as well as the smallest and largest terms in each formula. In each formula, the terms of the formula are added together to produce the p value P (D + n ≥ d + = t/n). Table 3 defines what is a term for each of the 13 formulae.
For a specific test statistic d + and sample size n, the Mathematica functions implementing each of the 13 formulae in rational arithmetic from Brown and Harvey (2007) were modified to find the number of positive terms, the number of negative terms, the smallest negative term, the largest negative term, the smallest positive term, and the largest positive term. Table 4 provides a listing of these Mathematica functions which are contained in the approximately 630 kilobyte file KS1SidedOneSampleLargestSmallestTermsRational.nb.
In Section 14, file KS1SidedOneSampleLargestSmallestTermsRational.nb also contains the Mathematica function LargeSmallTermsToFileKS1SidedOneSample that uses all the Mathematica functions listed in Table 4 to produce an output file containing all the results. For the sample size n = 200 and the corresponding test statistics d + that produces a p value near 0.001 and 0.9, Table 5 lists the number of positive terms, the number of negative terms, the smallest negative term, the largest negative term, the smallest positive term, and the largest positive term for each formula. Table 5 shows that only the DwassD, DwassAltD, DwassI, DwassAltI, and Daniels formulae (the negative/positive term formulae) contain negative terms and thus are the only formulae that can have cancellation error. Since the SmirnovD, SmirnovAltD, SmirnovI, SmirnovAltI, Noe, Steck, Conover, and Bolshev formulae contain only non-negative terms, the only source of calculation error is rounding error. In contrast, the the negative/positive term formulae (DwassD, DwassAltD, DwassI, DwassAltI, and Daniels formulae) can have both rounding error and cancellation error. Thus, we would expect that the internal precision ip needed to produce a specific resulting precision rp would be much higher for the the negative/positive term formulae than for the non-negative term formulae (the SmirnovD, SmirnovAltD, SmirnovI, SmirnovAltI, Noe, Steck, Conover, and Bolshev formulae  things being equal, a larger internal precision ip takes more computer time. So just considering the internal precision ip, the negative/positive term formulae would take more computer time than the non-negative term formulae. Another factor that greatly affects computer time is the number of terms. Using the number of terms tabulated in Table 5, the formulae ranked from the smallest to the largest number of terms are DwassD, DwassAltD, DwassI, DwassAltI, SmirnovD, SmirnovAltD, SmirnovI, SmirnovAltI, Conover, Steck, Bolshev, Daniels, and Noe. In general, the Dwass-based formulae (DwassD, DwassAltD, DwassI, DwassAltI) have the fewest number of terms, the Smirnov base formulae (SmirnovD, SmirnovAltD, SmirnovI, SmirnovAltI) have the second fewest number of terms, and the recursion formulae (Conover, Steck, Bolshev, Daniels, Noe) have by far the largest number of terms. For rational arithmetic implementation of the 13 formulae, Brown and Harvey (2007) found the Dwass-based formulae were faster than the Smirnovbased formulae which were much faster than the recursion formulae. This difference in speed is undoubtedly due to the differences in the number of terms between the three groups.
The relative magnitude of the terms can also affect the error. For example, with an internal precision ip = 20, adding or subtracting 1.23 × 10 1 to 4.56 × 10 23 has no effect as the result is 4.56 × 10 23 . If in later calculations, catastrophic cancellation reduces the value to 1.11 × 10 2 , then the contribution of 1.23 × 10 1 is lost and the correct value of 1.23 × 10 1 + 1.11 × 10 2 = 1.233 × 10 2 is not realized. Table 5 shows that the terms in each formula can have a large relative magnitude.
The computational tradeoff between the 13 formulae depends on three factors: the number  of terms, the relative magnitude of the terms, and whether negative terms are present. How these three factors will interact is unclear and can only be determined by implementing each formula in arbitrary precision and then comparing the computational time of all formulae.
Using the rational arithmetic implementations of all 13 formulae, Brown and Harvey (2007) found that the DwassAltD formula was the fastest. In developing techniques to implement the 13 formulae in arbitrary precision, the fastest rational arithmetic formula, DwassAltD, will be developed first. This provides a methodology that will be used to develop arbitrary precision implementations for the other formulae.

Arbitrary precision implementation of DwassAltD
Utilizing the rational arithmetic version of DwassAltD programmed by Brown and Harvey (2007), an arbitrary precision version is produced by inputing the internal precision ip to be used in all calculations and then replacing the rational arithmetic calculations with arbitrary precision calculations employing the inputed internal precision ip. The Mathematica function DwassAltDKS1SidedOneSampleRTArbPrecision contained in Section 1 of the KS1SidedOneSampleDwassFormulae.nb file calculates the right tail p value with the Dwas-sAltD formula for test statistic d + = dIn and sample size n = sampleSizeIn using arbitrary precision arithmetic with internal precision ip = internalPrecisionIn digits of precision. The inputed test statistic d + = dIn is converted to a rational arithmetic number d + = d so that Mathematica will consider the test statistic an exact number. For the test statistic d + = 0.105632, sample size n = 100, and internal precision ip = 30, the Mathematica function DwassAltDKS1SidedOneSampleRTArbPrecision produces a right tail probability of P D + 100 ≥ d + = 0.105632 = 0.09997990380077079963347 which has a resulting precision of rp = 22. If the test statistic d + = 0.105632 is not converted to a rational number, the Mathematica function DwassAltDKS1SidedOneSampleRTArbPrecisionNonRationalTestStatistic in Section 1 of the KS1SidedOneSampleDwassFormulae.nb file produces a right tail probability of P D + 100 ≥ d + = 0.105632 = 0.0999799 which has a resulting precision of rp = 6. The reason for this decrease in precision is that Mathematica considers the test statistic d + = 0.105632 a machine precision number and does all the computations in machine precision. Not converting the test statistic d + to a rational number can have very bad consequences. For the test statistic d + = 0.0199760, sample size n = 2, 000, and internal precision ip = 60, the Mathematica function DwassAltDKS1SidedOneSampleRTArbPrecision produces a right tail probability of P D + 2000 ≥ d + = 0.0199760 = 0.199998648779404946563706459535437273 which has a resulting precision of rp = 36. However, if the test statistic d + = 0.0199760 is not converted to a rational number, the Mathematica function DwassAltDKS1SidedOneSampleRTArbPrecisionNonRationalTestStatistic produces a right tail probability of P D + 2000 ≥ d + = 0.0199760 = −1.52071 × 10 6 . These examples also show that the DwassAltD formula has a lot of catastrophic cancellation.
To deal with catastrophic cancellation in DwassAltD, the following six steps are needed to implement DwassAltD in arbitrary precision. First, implement DwassAltD in arbitrary precision where the internal precision ip is inputed. Second, develop a procedure to determine the minimum internal precision mp needed to produce a result with a desired precision dp. Third, determine an upper limit on the sample size n so that the computation time needed to find a p value for the sample size upper limit is around 100 seconds. Fourth, specify a representative set of sample sizes n, p values, and desired precisions dp to generate the resulting minimum precisions mp. Fifth, using the data generated in Step 4, fit a function that will predict the minimum precision mp needed for a particular n, p value, and dp. Sixth, using the fitted function found in Step 5 to predict the internal precision ip, modify the program in Step 1 so that the desired precision dp is inputed instead of ip.

Minimum precision
In order to get a resulting precision of rp, ip may have to be set to a value that exceeds rp but the user does not know by how much. Thus, the relationship between ip and rp needs to be investigated so realistic internal precisions ip can be set. Define rp(F, d + , n, ip) as the resulting precision rp for formula F , sample size n, test statistic d + , and internal precision ip. The resulting precision rp(F, d + , n, ip) can be found by running the arbitrary precision method for formula F and then using the Mathematica function Precision on the calculated right tail p value P [D + n ≥ d + ] to determine rp(F, d + , n, ip). Since the Mathematica function Precision will often return a non-integer value, the result will be truncated so that resulting precision rp(F, d + , n, ip) used in this paper will always be an integer.
For any d + , n, dp, and formula F , a search procedure can be used to find the minimum internal precision mp(F, d + , n, dp). The procedure to find mp(F, d + , n, dp) has two distinct steps. First, find a lower bound l and an upper bound u so that l ≤ mp(F, d + , n, dp) ≤ u. Second, use bisection search to find the minimum internal precision mp(F, d + , n, dp). As seen from the example in the last paragraph, the minimum internal precision mp(F, d + , n, dp) for DwassAltD can be much greater than the desired precision so an initial estimate of mp(DwassAltD, d + , n, dp) could significantly reduce the number of iterations. For the Dwas-sAltD formula, preliminary work shows that √ n is a good initial estimate.
For DwassAltD, the Mathematica function DwassAltDKS1SidedOneSampleRTArbPrecision contained in Section 1 of the KS1SidedOneSampleDwassFormulae.nb file will be used to calculate the resulting precision rp(F, d + , n, ip). The following search procedure to determine the DwassAltD minimum precision mp(DwassAltD, d + , n, dp) summarizes the Mathematica function MinPrecisionMinusDesiredPrecisionDwassAltD contained in Section 2 of the KS1SidedOneSampleDwassFormulae.nb file. The function returns the number of iterations numberTries and the minimum precision minus the desired precision mpdp(DwassAltD, d + , n, dp) = mp(DwassAltD, d + , n, dp) − dp.
Procedure to find the DwassAltD minimum precision mp(DwassAltD, d + , n, dp) Step 1 (initial potential bounds): Set the lower limit l to the desired precision dp, l = dp. Set the upper limit and internal precision u = ip = √ n + dp and then run the arbitrary precision DwassAltD method to determine rp(DwassAltD, d + , n, ip). If rp(DwassAltD, d + , n, ip) ≥ dp, go to Step 4. Otherwise, set l = u and go to Step 2.
Step 2 (new potential upper bound): At this point, mp(DwassAltD, d + , n, dp) > l. Set the upper limit u and internal precision ip to the lower limit plus an increment, u = ip = l + dp − rp(DwassAltD, d + , n, l).
Step 3 (test potential upper bound): Run the arbitrary precision DwassAltD method to determine rp(DwassAltD, n, nt, ip). If rp(DwassAltD, d + , n, ip) ≥ dp, go to Step 4. Otherwise, set l = u and go to Step 2.
Step 5 (binary search): Set ip = l + u 2 and run the arbitrary precision DwassAltD method to determine rp(DwassAltD, d + , n, ip). If rp(DwassAltD, d + , n, ip) ≥ dp, set u = ip and go to Step 4. Otherwise, set l = ip and go to Step 4.
For an arbitrary precision formula F , the minimum precision minus the desired precision mpdp(F, d + , n, dp) will be found for a representative set of desired precisions dp, sample sizes n, and test statistics d + . Using this data, a function will be fit to predict the minimum precision minus the desired precision mpdp(F, d + , n, dp) needed for a specific value of dp, n, and d + . This function will then be used in the arbitrary precision routine to initially set the internal precision ip.
In considering what would be a good representative set of desired precisions dp, Mathematica uses machine precision rather than arbitrary precision if the internal precision is less than the precision of 16 needed for machine precision numbers. Since Mathematica needs to be forced to use arbitrary precision, the smallest desired precision will be set to dp = 20. Using the desired precisions dp = 20, 40, 100, the analyses in the rest of the paper found that the minimum precision minus the desired precision mpdp(F, d + , n, dp) did not vary for the same formula F , the same test statistic d + , and the same sample size n. In other words, mpdp(F, d + , n, dp = 20) = mpdp(F, d + , n, dp = 40) = mpdp(F, d + , n, dp = 100). Thus, we will not report the minimum precision minus the desired precision mpdp(F, d + , n, dp) for different values of the desired precision dp.

Generating test statistics
The representative set of the test statistic d + must take into account the fact that for a fixed value of d + , the p value changes with the sample size n. For d + = 293/5000, Brown and Harvey (2007) show that for n = 1, 000, P [D + 1,000 ≥ 293/5000] 0.001 while for n = 5, 000, P [D + 5,000 ≥ 293/5000] 1.14493 × 10 −15 . Given the p value for the same test statistic d + changes dramatically with n, a different method of determining the representative set for d + other than just arbitrarily fixing their values should be used. One method is to set the test statistic d + equal to a value that would give a p value close to a specified p value. The approximation of Maag and Dicaire (1971) can be used to find a test statistic d + for a sample size n that will yield a p value close to the specified p value α MD . Specifically, this is accomplished by . The Mathematica code for the Maag and Dicaire (1971) approximation is found in the Mathematica function KS1SidedOneSampleTestStatisticByMaagDicaire contained in Section 3 of the KS1SidedOneSampleDwassFormulae.nb file. The representative set of test statistics d + is generated by using the Maag and Dicaire approximation with the p value representative set α MD = 0.001, 0.1, 0.5, 0.9. Table 6 contains the Maag and Dicaire approximation of the test statistic d + to six digits of precision for the representative set α MD = 0.001, 0.1, 0.5, 0.9 and various sample sizes.
For the Maag and Dicaire test statistic d + MD (n, α MD ), let mp(F, n, dp, α MD ) denote the minimum precision for Formula F , sample size n, desired precision dp, and test statistic d + MD (n, α MD ). Then let mpdp(F, n, dp, α MD ) = mp(F, n, dp, α MD ) − dp denote the minimum precision minus desired precision. After the minimum precision minus the desired precision mpdp(F, n, dp, α MD ) is determined for the representative set and a particular formula F , a function is fit to the data that predicts the minimum precision mp for any desired precision dp, sample size n, and p value α MD . This function is then put into the arbitrary precision routine to set the internal precision.

Representative set for DwassAltD
Since the minium precision minus the desired precision mpdp(F, n, dp, α MD ) does not vary with the desired precision dp, the representative set for dp will be set to dp = 20. Section 4.2 specified the representative set of test statistics d + as those generated by using the Maag and Dicaire approximation with the p value representative set α MD = 0.001, 0.1, 0.5, 0.9. In this section, the representative set of sample sizes n used to compute the minimum precision minus desired precision mpdp(F, n, dp, α MD ) will be determined differently for each formula F because, as we will see, the computation time varies considerably from one formula to another. For each formula F , a reasonable upper limit on the sample size n will be determined experimentally so that the computation time will not exceed 100 to 200 seconds. Using this upper limit, a representative set of approximately twenty sample sizes will be specified.
To specify a representative set of sample sizes n for DwassAltD, we need to be able to determine the time in seconds needed to calculate the right tail p value for various sample sizes. The Mathematica function TimingDwassAltDKS1SidedOneSampleRTArbPrecision contained in Section 4 of the KS1SidedOneSampleDwassFormulae.nb file first determines the test statistic d + MD (n, α MD ) corresponding to Maag and Dicaire p value α MD . The minimum precision mp(DwassAltD, n, dp, α MD ) for sample size n, test statistic d + MD (n, α MD ), and desired precision dp is calculated. Using test statistic d + MD (n, α MD ) and internal precision ip = mp(DwassAltD, n, dp, α MD ), the time in seconds to calculate the right tail p value for the arbitrary precision DwassAltD formula is found. The sample output in Section 4.1 of the KS1SidedOneSampleDwassFormulae.nb file is summarized in Table 7. Since a sample size of n = 10, 000, 000 can have a computation time that exceeds 100 seconds but is less than

Sample
Maag and Dicaire test statistic approximation, d + MD (n, α MD ) size n α MD = 0.001  200 seconds, we will set n = 10, 000, 000 as the sample size upper limit for the DwassAltD formula. The sample size representative set will then be the twenty-two sample sizes listed in Table 6.

Minimum precision minus desired precision DwassAltD data
The minimum precision minus desired precisions in Tables 8 and 9 were produced by the Mathematica function MinPrecisionMinusDesiredPrecisionToFileDwassAltD contained in Section 5 of the KS1SidedOneSampleDwassFormulae.nb file which writes the minimum pre-α MD = 0.001, dp = 20 α MD = 0.001, dp = 100 α MD = 0.9, dp = 20 Minimum precision is mp(DwassAltD, n, dp, α MD ) Computational times on a 3.4 GHz Pentium IV Table 7: DwassAltD time in seconds to compute a p value.
cision minus desired precisions mpdp(DwassAltD, n, dp, α MD ) to an output file for a set of specified sample sizes n, a set of desired precisions dp, and a set of Maag and Dicaire approx-

Predicting the minimum precision for DwassAltD
In constructing a fitted function to predict mpdp(DwassAltD, n, dp, α MD ), a tradeoff exists between how close the predicted value is to the actual value and whether the predicted value is less than the actual value. If the predicted value is less than the actual value, then the internal precision will produce a p value whose resulting precision rp is less than the specified desired precision dp. When rp < dp, the internal precision ip must be increased and the p value recalculated. The fitted function can be modified to insure that almost no predicted values are less than the actual values by adding a constant value to the original fitted function. Thus, there is a tradeoff between living with the chance of having to calculate the p value more than once or adding a constant value to the original fitted function to almost eliminate the possibility of that chance. To study this tradeoff, we need to know the relative computer times needed to calculate the p value again versus the computer time needed to calculate the p value with a larger internal precision ip. Table 8 shows this tradeoff by tabulating the computer time needed for DwassAltD to calculate p values for α MD = 0001 and various internal precisions, ip = mpdp(DwassAltD, n, dp, α MD = 0.001) + dp. This table also shows that the computer time needed to calculate a p value is much greater than the additional time needed for a slightly larger internal precision (ip increased by 2 to 3). Thus, in producing a function to predict mpdp(F, n, dp, α MD ), an original fitted function is first created and then a constant value is added to the original fitted function to create the final fitted function so that a predicted value produced by the final fitted function will almost always be greater than or equal to the actual value.

Predicting internal precision for the DwassAltD formula
To investigate how mpdp(DwassAltD, n, dp, α MD ) changes with the p value α MD , define the proportion P001mpdp(DwassAltD, n, dp, α MD ) shown in Equation (1) below.
(1) Table 9 shows that proportion P001mpdp(DwassAltD, n, dp, α MD ) does not change much as the sample size n varies. This suggests a two stage process for fitting a function to predict mpdp(DwassAltD, n, dp, α MD ). First, mpdp(DwassAltD, n, dp, α MD = 0.001) is predicted by a fitted function using n as the independent variable. Second, fit a function to predict the proportion P001mpdp(DwassAltD, n, dp, α MD ) for some n (say n = 1, 000, 000) by using α MD as the independent variable. Then, multiply the two fitted functions to obtain a function to predict mpdp(DwassAltD, n, dp, α MD ).
For the first step, the fitted function mpdp(DwassAltD, n, dp, α MD = 0.001) 4.83943059 + 1.03244211 √ n was found by using stepwise regression with a variety of independent variables that are all functions of n. Rounding up the coefficients yields the predicting function shown in Equation (2) below. mpdp(DwassAltD, n, dp, α MD = 0.001) 4.84 + 1.033 √ n To determine if the predicting function in Equation (2) is a good predictor, additional points for other sample sizes between the ones used to derive the function were calculated and then the function was used to predict the actual mpdp(DwassAltD, n, dp, α MD = 0.001).
The results are summarized in Table 10. The differences between the actual and predicting mpdp(DwassAltD, n, dp, α MD = 0.001) show that the predicted function in Equation (2) is a very good fit.
For the second step with n = 1, 000, 000, a function is fitted to predict the proportion P001mpdp(DwassAltD, n = 1000000, dp, α MD ) by using α MD as the independent variable. Using the data in the first two columns of Table 11, stepwise regression with a variety Sample mpdp(DwassAltD, n, dp, α MD ) P001mpdp(DwassAltD, n, dp, α MD ) size  Table 9: DwassAltD minimum precision minus desired precision proportion.
The differences reported in Table 11 between the actual and predicted proportions show that the predicted function in Equation (3) is a very good fit.
The predictive ability of Equation (4) is tested by applying it to the samples sizes n and p values α MD in Table 12 which are different than the sample sizes n and p values used to construct the approximation in Equation (4). Table 12 shows that the initial predicted values mpdpInitPred (DwassAltD, n, dp, α MD ) are close to the actual values mpdp(DwassAltD, n, dp, α MD ). Let mpdpIPError (DwassAltD, n, dp, α MD ) defined in Equation (5) below represent the error in the initial predicted values.
Since the largest negative error in Table 13 is −1.30, construct the final prediction by conservatively adding 3 to the initial prediction mpdpInitPred (DwassAltD, n, dp, α MD ) and then round up. The resultant final prediction mpdpFinalPred (DwassAltD, n, dp, α MD ) is shown in Equation (6) below where x is the smallest integer greater than or equal to x.
resulting precision rp of the p value is less than the desired precision dp, rp < dp, the internal precision is increased by dp − rp and the p value is recalculated until rp ≥ dp.

Arbitrary precision implementation methodology
The techniques used in Secton 4 for DwassAltD can be used to develop an arbitrary precision implementation for any formula F . A summary of the general methodology is shown below.
Methodology to implement a formula F in arbitrary precision: 1. Utilizing the rational arithmetic version of formula F programmed by Brown and Harvey (2007) and modified in Section 3, an arbitrary precision version is produced by inputing the internal precision ip to be used in all calculations and then replacing the rational arithmetic calculations with arbitrary precision calculations employing the inputed internal precision ip.
2. For formula F , develop a procedure to determine the minimum precision mp(F, d + , n, dp) and the minimum precision minus desired precision mpdp(F, d + , n, dp) by modifying the procedure in Section 4.1. Specifically, use the arbitrary precision version for formula F developed in Step 1 to develop an initial estimate of the minimum precision mp(F, d + , n, dp).
3. Specify a representative set of sample sizes n for formula F by experimentally determining the time in seconds needed to calculate the right tail p value for various sample sizes. Select the sample size upper limit on n so that the computation time needed for the upper limit is around 100 seconds. Then, select about twenty sample sizes between 10 and the sample size upper limt as the representative set of sample sizes n for formula F .
4. Since the minium precision minus the desired precision mpdp(F, n, dp, α MD ) does not vary with the desired precision dp, the representative set for dp will be set to dp = 20. Specify the representative set of desired precisions as dp = 20. Specify the representative set of test statistics d + as those generated by using the Maag and Dicaire approximation with the p value representative set α MD = 0.001, 0.1, 0.5, 0.9.
5. Find the minimum precision minus desired precisions mpdp(F, n, dp, α MD ) for the representative set. Fit a function to this data that will predict mpdp(F, n, dp, α MD ).
6. Using the fitted function found in Step 5, modify the program developed in Section 4.6 for formula F .

Arbitrary precision implementation of Dwass-based formulae
Section 4 implemented the DwassAltD formula in arbitrary precision and Section 5 summarized the methodology used in Section 4. This methodology will now be used to implement all the other formulae in arbitrary precision. This section implements the remaining Dwass-based formulae (DwassD, DwassI, DwassAltI) in arbitrary precision, compares computed right tail probabilities of all Dwass-based formulae to make sure there are no implementation errors, and runs computational experience to determine the fastest formula.
The four Mathematica functions needed in the methodology to implemement each Dwassbased formula are listed in  in file KS1SidedOneSampleDwassFormulae.nb. The remainder of this section will use these Mathematica functions and the methodology to produce the desired precision function for each Dwass-based formula which is also listed in Table 15.
Steps 1 through 4 of the methodology are basically the same for all Dwass-based formulae (DwassAltD, DwassD, DwassI, and DwassAltI) so we will use the initial estimate of the minimum precision mp(F, d + , n, dp) = 4.84 + 1.033 √ n in Step 2 and the representative set in Step 4 that was developed for DwassAltD in Section 4.
Step 5 of the methodology produces the minimum precision minus desired precision mpdp(F, n, dp, α MD ) data contained in Table 16 for the DwassD, DwassAltD, DwassI, and DwassAltI formulae respectively. Since the minimum precision minus desired precision mpdp(F, n, dp, α MD ) data for DwassD and DwassAltD in Table 16 are identical, the same fitted function found for DwassAltD can also be used for DwassD in Step 6 of the methodology to produce the desired precision function for DwassD, Mathematica function DwassDKS1SidedOneSampleRTdesiredPrecision listed in Section 11 of the KS1SidedOneSampleDwassFormulae.nb file.
Similarly, Table 16 shows that mpdp(F, n, dp, α MD ) is also the same for the DwassI and DwassAltI (Dwass iterative formulae) so the same fitted function to be developed below can be used for both Dwass iterative formulae.

Predicting internal precision for the Dwass iterative formulae
To investigate how the minimum precision minus desired precision mpdp(DwassI , n, dp, α MD ) changes with the p value α MD , define the proportion P001mpdp(DwassI , n, dp, α MD ) shown in Equation (8) below.
(8) Table 17 shows that the proportion P001mpdp(DwassI , n, dp, α MD ) does not change much as the sample size n varies. This suggests a two stage process for fitting a function to predict mpdp(DwassI , n, dp, α MD ). First, fit a function to predict mpdp(DwassI , n, dp, α MD = 0.001) by using n as the independent variable. Second, fit a function to predict the proportion P001mpdp(DwassI , n, dp, α MD ) for some n (say n = 1, 000, 000) by using α MD as the independent variable. Then, multiply the two fitted functions to obtain a function to predict mpdp(DwassI , n, dp, α MD ).
Using the data in the first two columns of Table 17 for the first step, stepwise regression with a variety of independent variables was used to find fitted function mpdp(DwassI , n, dp, α MD = 0.001) 2.98591930 + 1.03190981 √ n. Rounding up the coefficients yields the predicting function shown in Equation (9) below. mpdp(DwassI , n, dp, α MD = 0.001) 2.986 + 1.032 √ n To determine if the predicting function in Equation (9) is a good predictor, additional points for other sample sizes between the ones used to derive the function were calculated and then the function was used to predict the actual mpdp(DwassI , n, dp, α MD = 0.001). The results are summarized in Table 18. The differences between the actual and predicting mpdp(DwassI , n, dp, α MD = 0.001) show that the predicted function in Equation (9) is a very good fit.
For the second step with n = 1, 000, 000, a function is fitted to predict the proportion P001mpdp(DwassI , n = 1000000, dp, α MD ) by using α MD as the independent variable. Using the data in the first two columns of Table 19, stepwise regression with a variety of variables was used to find the fitted function P001mpdp(DwassI , n = 1000000, dp, α MD ) 0.68749934 − 0.14334219α MD − 0.399053723α   P001mpdp(DwassI , n = 1000000, dp, α MD ) 0.6875 − 0.1433α MD − 0.399α The differences between the actual and predicted P001mpdp(DwassI , n = 1000000, dp, α MD ) Sample mpdp(DwassI , n, dp, α MD ) P001mpdp(DwassI , n, dp, α MD ) size  in Table 19 show that the predicted function is a very good fit.
The predictive ability of Equation (11) is tested by applying it to the samples sizes n and p values α MD in Table 20 which are different than the sample sizes n and p values used to construct the approximation in Equation (11). Table 20 shows the initial predicted values mpdpInitPred (DwassI , n, dp, α MD ) are close to the actual values mpdp(DwassI , n, dp, α MD ). Let mpdpIPError (DwassI , n, dp, α MD ) defined in Equation (12) below represent the error in the initial predicted values.
mpdpIPError (DwassI , n, dp, α MD ) = mpdpInitPred (DwassI , n, dp, α MD ) −mpdp(DwassI , n, dp, α MD ) (12) Table 21 contains the initial prediction error mpdpIPError (DwassI , n, dp, α MD ) for the sample sizes n and p values α MD in Table 20. Since the largest negative error in Table 21 is −2.44, construct the final prediction by conservatively adding 5 to the initial prediction mpdpInitPred (DwassI , n, dp, α MD ) and then round up. The resultant final prediction mpdpFinalPred (DwassI , n, dp, α MD ) is shown in Equation (13) below where x is the smallest integer greater than or equal to x.
Let mpdpFPError (DwassI , n, dp, α MD ) defined in Equation (14) below represent the error in the final predicted values.
The fitted function in Equation (13) is used for both DwassI and DwassAltI to produce the desired precision Mathematica functions DwassIKS1SidedOneSampleRTdesiredPrecision and DwassAltIKS1SidedOneSampleRTdesiredPrecision listed in Sections 16 and 21 respecitvely of the KS1SidedOneSampleDwassFormulae.nb file.
rational DwassAltD formula and therefore much more efficient than all rational arithmetic implementations of the Dwass-based formulae.
To get an idea of the relative efficiency of the arbitrary precision implementation of the Dwass-based formulae, Tables 25 and 26 contain computer times for various sample sizes up to ten million. Unlike the rational arithmetic implementations, DwassD is the fastest arbitrary precision formula.

Arbitrary precision implementation of Smirnov formulae
Since the rational arithmetic implementations of the Smirnov-based formulae were faster than the recursion formulae, this section uses the methodology in Section 5 to implement the SmirnovAltD, SmirnovD, SmirnovAltI, and SmirnovI formulae. The five Mathematica functions needed in the methodology to implemement each Smirnov based formula are listed in Table 27 along with the section number in file KS1SidedOneSampleSmirnovFormulae.nb where they are found. The remainder of this section uses these Mathematica functions to produce the desired precision function for each Smirnov-based formula.
Preliminary work shows that the minimum precision minus the desired precision for the Smirnov-based formulae is always less than ten for sample sizes up to one million, n ≤ 1, 000, 000. Thus the desired precision dp can be used as initial estimate for both the lower and upper bounds in Step 2 of the Section 5 methodology producing the procedure below.
Procedure to determine the Smirnov-based formulae F minimum precision mp(F, d + , n, dp) Step 1 (initial potential bounds): Set the lower limit l to the desired precision dp, l = dp. Set the upper limit and internal precision u = ip = dp and then run the arbitrary precision F method to determine rp(F, d + , n, ip). If rp(F, d + , n, ip) ≥ dp, go to Step 4. Otherwise, set l = u and go to Step 2.
Step 2 (new potential upper bound): At this point, mp(F, d + , n, dp) > l. Set the upper limit u and internal precision ip to the lower limit plus an increment, u = ip = l + dp − rp(F, d + , n, l).
Step 3 (test potential upper bound): Run the arbitrary precision F method to determine rp(F, n, nt, ip). If rp(F, d + , n, ip) ≥ dp, go to Step 4. Otherwise, set l = u and go to Step 2.
Step 5 (binary search): Set ip = l + u 2 and run the arbitrary precision F method to determine rp(F, d + , n, ip). If rp(F, d + , n, ip) ≥ dp, set u = ip and go to Step 4. Otherwise, set l = ip and go to Step 4.
Step 6 (determine minimum precision): Run the arbitrary precision F method to determine rp(F, n, nt, l). If rp(F, d + , n, l) ≥ dp, set mp(F, d + , n, dp) = l and terminate the procedure. Otherwise, set mp(F, d + , n, dp) = u and terminate the procedure. Note: All timings on a Pentium IV running at 3.4 GHz. Note: Desired precision dp = 20 Note: All timings on a Pentium IV running at 3.4 GHz. Note: Desired precision dp = 100 Note: All timings on a Pentium IV running at 3.4 GHz. Note: Desired precision dp = 20 Note: All timings on a Pentium IV running at 3.4 GHz. Note: Desired precision dp = 100 α MD = 0.001 α MD = 0.001 α MD = 0.9 dp = 20 dp = 100 dp = 20 Sample Minimum Minimum Precision is mp(F, n, dp, α MD ) Maag and Dicaire test statistic generated with ρ = 6 digits of precision for α MD Computational times on a 3.4 GHz Pentium IV Using the minimum precisions mp(F, d + , n, dp) for the Smirnov-based formulae, various computer times are contained in Table 28. Since all the Smirnov-based formulae for a sample size of one million have computer times exceeding 100 seconds, the following development uses a sample size of one million (n = 1, 000, 000) as the upper limit.

Predicting internal precision for the Smirnov-based formulae
For the Smirnov-based formulae F (SmirnovAltD, SmirnovD, SmirnovAltI, SmirnovI), Table  29 contains the computed the minimum precision minus desired precisions mpdp(F, n, dp, α MD ). From the data in this table, the minimum precision minus desired precision mpdp(F, n, dp, α MD ) is almost always the same for the same sample size n independent of the desired precision dp and the p value α MD . Thus, the function predicting the internal precision just depends on the sample size n. Using binary search and α MD = 0.001, Table 30 contains the lower and upper bounds on the sample size n for each mp − dp. Conservatively adding one to the mp − dp in Table 30, Table 31 contains the predicted internal precisions used in the Smirnov-based desired precision functions listed in Sections 6 (SmirnovAltD), 12 (SmirnovD), 18 (SmirnivAltI), and 24 (SmirnovI) of the file KS1SidedOneSampleSmirnovFormulae.nb (see Table 27).

Computational experience for the Smirnov-based formulae
Using the Mathematica function SmirnovArbPrecisionDwassDTimingsToFile listed in Section 25 of the KS1SidedOneSampleSmirnovFormulae.nb file, the computer times in Tables 32 (dp = 20) and 33 (dp = 100) are generated to compare the SmirnovAltD, SmirnovD, SmirnovI, SmirnovAltI, and DwassD formulae. The DwassD formula was chosen because it is the fastest arbitrary precision implementation of all the Dwass based formulae. As expected, every Smirnov-based arbitrary precision formula is less efficient than the DwassD formula.

Arbitrary precision implementation of recursion formulae
From Table 5 in Section 3, the number of terms in both the Dwass-based formulae and the Smirnov-based formulae are much less than the number of terms in the recursion formulae (Daniels, Noe, Steck, Conover, Bolshev). This data on the number of terms suggests that the recursion formulae will be significantly slower than both the Dwass-based formulae and the Smirnov-based formulae. Consequently, this section will not use the methodology in Section 5 to implement the recursion formulae but instead will compare the computer time needed for their arbitrary precision implementations to the time needed for the Dwass and Smirnov-based formulae. In the unlikely case that some recursion formulae are not significantly slower than the Dwass and Smirnov-based formulae, then those recursion formulae will be implemented using the methodology in Section 5. The Mathematica functions needed to implemement and time the recursion formulae are listed in Table 34 along with the section number where they are found in file KS1SidedOneSampleRecursionFormulae.nb. For sample size n, preliminary analysis in finding the minimum precision minus the desired precision mp − dp for each recursion formula yielded an initial estimate of 0.384n+0.004n 2 for the Daniels, Steck, Conover, and Bolshev recursion formulae. The initial estimate for the Noe recursion formula is 10. To find mp − dp for each recursion formula, these initial estimates are used in the Mathematica functions contained Sections 2, 6, 10, 14, 18 of file KS1SidedOneSampleRecursionFormulae.nb.
The number of terms in four of the recursion formulae (Daniels, Steck, Conover, Bolshev) are very similar so they will be analyzed first. Since the number of terms for the Noe recursion formula is over 50 times the number of terms in the other recursion formulae, the Noe recursion formula will probably be much slower than the other recusion formulae and it will be analyzed separately.
For the Daniels, Steck, Conover, and Bolshev recursion formulae, the arbitrary precision Mathematica functions contained in file KS1SidedOneSampleRecursionFormulae.nb (Sections 3,4,7,8,11,12,15,16) are used to determine the minimum precision minus desired precisions mp − dp in Table 35 and the timings in Tables 36 and 37. In addition, these tables include the mp − dp and timings for the fastest Smirnov-based formula (SmirnovAltD) and the fastest Dwass-based formula (DwassD). Since Daniels is the only recursion formula with negative terms, we would expect the mp − dp in Table 35 for Daniels to be be larger than the other recursion formulae which is the case. Interestingly, Steck has significantly smaller mp − dp than the other recursion formulae even though the number of terms is almost the same. From Tables 36 and 37, Steck is the fastest recursion formula but as expected is significantly slower than SmirnovAltD which in turn is slower than DwassD.
Sample size n lower and upper limits for mp − dp and α MD = 0. mp − dp is minimum precision minus desired precision dp + 9 n ≥ 1, 000, 000 n ≥ 1, 000, 000 n ≥ 1, 000, 000 n ≥ 1, 000, 000 mp − dp is minimum precision minus desired precision  For the Noe recursion formula, the arbitrary precision Mathematica functions contained in file KS1SidedOneSampleRecursionFormulae.nb (Sections 19 and 20) are used to determine the minimum precision minus desired precisions mp − dp in Table 38 and the timings in Table 39.
As expected, comparing the computer times in Tables 36 and 37 with those in Table 39 shows that the Noe recursion formula is significantly slower than the other recursion formulae.
From the timings in Sections 6.2, 7.2, and 8, DwassD is the fastest formula and will be used Note: All timings on a Pentium IV running at 3.4 GHz. Note: Desired precision dp = 100

Calculating the one-sided bandwidth
In addition to calculating the p value for hypothesis testing, the one-sided one sample K-S cumulative sampling distribution can be used to construct a one-sided confidence band around the empirical distribution F n (x). The bandwidth of a one-sided confidence band with confidence coefficient 1 − α and sample size n is the value of the test statistic d + that satisfies P (D n ≥ d + ) = α. Determining a bandwidth d + for a particular sample size n and confidence coefficient 1 − α means evaluating the inverse of the cumulative sampling distribution which can only be done by search techniques such as binary search. Unlike the p value, a bandwidth d + cannot in practice be determined exactly because the search technique may not converge to the exact value. For example, binary search with starting values of 0 and 1 would never find d + = 1/3 and would iterate forever. Thus, search techniques are designed to stop when a specified accuracy is reached. Let d + (n, α, ρ) represent the bandwidth rounded to ρ significant digits for sample size n and confidence coefficient 1 − α. Note that bandwidth d + (n, α, ρ) is also the hypothesis testing critical value for an α level of significance.
The linear search algorithm in Brown and Harvey (2007) to determine d + (n, α, ρ) using rational arithmetic can be modified to use arbitrary precision by replacing the rational arithmetic version of DwassAltD by the desired precision DwassD function contained in Section 11 of the KS1SidedOneSampleDwassFormulae.nb file. The resulting Mathematica function KS1SidedOneSampleBandwidthArbPrecision and sample output is contained in Section 24 of the KS1SidedOneSampleDwassFormulae.nb file.
The Mathematica function KS1SidedOneSampleArbPrecisionBandwidthsToFile contained in Section 25 of the KS1SidedOneSampleDwassFormulae.nb file finds bandwidths using the Section 24 function and writes these bandwidths to a comma delimited file for input into Excel and a text file that can be used as the input into timing programs. The text file contains bandwidths where every digit in a half-width is output separately so the bandwidth can be reconstructed to any desired accuracy. Using the results of this function, Table 40 contains the bandwidths to six digits of precision (ρ = 6) for α = 0.2, 0.1, 0.05, 0.02, 0.01, 0.001 and representative sample sizes from n = 3, 000 through n = 10, 000, 000 (Brown and Harvey 2007, contain bandwidths up to n = 2, 000).

Conclusion and areas of future research
This paper has developed an arbitrary precision method that can be used to compute onesided one sample K-S p values for sample sizes of at least ten million, n = 10, 000, 000. Most importantly, the method lets the user specify the precision of the resulting p value before the computations are made. In addition, the arbitrary precision method is much faster than the fastest rational arithmetic method. Consequently, it can be used to study the errors in the limiting distribution, approximations, and machine precision implementations.