Monkeying with the Goodness-of-Fit Test

The familiar Σ(OBS - EXP) 2/EXP goodness-of-fit measure is commonly used to test whether an observed sequence came from the realization of n independent identically distributed (iid) discrete random variables. It can be quite effective for testing for identical distribution, but is not suited for assessing independence, as it pays no attention to the order in which output values are received. This note reviews a way to adjust or tamper, that is, monkey-with the classical test to make it test for independence as well as identical distribution--in short, to test for both the i's in iid, using monkey tests similar to those in the Diehard Battery of Tests of Randomness (Marsaglia'95).


Introduction
We begin with an example that will illustrate the problem and suggest a solution. A general discussion will follow in Section 2.
Suppose you want to demonstrate use of the goodness-of-fit test on a sequence x 1 , x 2 , . . . , x n purported to be iid, that is, independent, identically distributed discrete random variables taking values 0, 1, 2, . . . , 25 with equal probabilities, 1/26. You use a computer to generate such a sequence with n = 2600, using, say, the random number generator provided by your programming language's function library.
If v i is the number of times that the value i appears in the sequence of length 2600, then E[v i ] = 100 and Q 1 = (v i − 100) 2 /100 is the standard goodness-of-fit statistic, viewed as a χ 2 25 variate (25 degrees of freedom) if the process providing the 26 possible values is choosing them with equal probabilities. But the value of Q 1 is invariant under permutations of the 2600 elements in the test sequence. That Q 1 value provides no assessment for independence, only for identical distribution-the second, but not the first, i in iid.
To consider means for assessing independence, suppose we change the values 0,1,2,. . . ,25 produced by our random variates to A,B,C,...,Z, in order to have a single symbol for each realization. Then the output of a sequence of realizations might look like that from the fabled monkey at a typewriter, randomly striking 26 keys. After n keystrokes, the whimsical output from the monkey might begin, then end, like this: ...LUQMAJMLQJHKKOJXOPOYMLFWWKNEDXDOKMNQOLYBJPZZYPP A satisfactory value for Q 1 = (v i − n/26) 2 /(n/26) will suggest that the frequencies of A's,B's,. . . ,Z's are satisfactorily close to their expected values of n/26. But we might ask: how many times does CAT appear in that sequence? DOG? PDQ? TIT?-or, more generally, how many times does each of the possible 3-letter words appear, or, a favorite when using this as a classroom example, which forbidden 4-letter words appear, and how often?
If n = 100 × 26 3 = 1757600, and if we have independent realizations, then we should expect around 100 instances of each of the possible 3-letter words, and we might be tempted to use the classical (OBS − EXP) 2 /EXP as a test statistic. But that will not work, because successive (overlapping) 3-letter words cannot reasonably be assumed to have come from a set of identically distributed, id, variates. It turns out, as we shall see in the next section, that the following is the proper way to test the 3-letter word counts, and thus extend to a full iid test, that is, test for independence as well as identical distribution: Let w ijk be the number of times the word IJK appears in the sequence, and form Q 3 = i,j,k (w ijk − 100) 2 /100, the naive Pearson form. Also, with w ij the number of times the word IJ appears in the sequence, set Q 2 = i,j (w ij − 2600) 2 /2600, the form for pairs of letters.
For such a large d.o.f. we can take Q 3 − Q 2 to be normal with mean 16900 and variance 33800, so the question reduces to: how many sigmas is Q3 − Q2 from its mean, i.e., what is the value of the standard normal variate (Q 3 − Q 2 − 16900)/183.85?.
For a truly iid sequence of 26 3 × 100 uniform variates from a set of 26, we should certainly expect (Q 3 − Q 2 − 16900)/183.85 to be within ±3. For the rand() function of the C compiler I used, (a congruential RNG), Q 3 − Q 2 was -1.27 sigmas from its mean, quite satisfactory. However, we get a different story if we change the rand() function to a LFSR generator based on the primitive polynomial x 31 + x 3 + 1, the very generator Whittlesey (1969) touted after I had established the lattice structure of congruential RNGs (Marsaglia 1968). The standard goodness-of-fit measure, which we might call Q 1 , passed well; the letters are apparently produced with the proper frequencies. But are they produced independently? No, not even close: Q 3 − Q 2 was over 62000 sigmas from its mean, and Q 2 − Q 1 was worse, over 136000 sigmas from its mean.
Monkey tests such as these and others from Marsaglia (1995) have been be used for many RNGs. A more extensive summary, and further references relating to monkey tests are in that CDROM, mostly in the file monkey.pdf, a copy of which is available from http://www. jstatsoft.org/v14/i13/monkey.pdf. See also Knuth (1999, pp. 62, 78, 565) for discussion of the Q r − Q r−1 approach. The main purpose of this note is to merely point out to interested readers that extending the standard (OBS − EXP) 2 /EXP to more than just individual frequencies, (e.g., to Q 2 − Q 1 or Q 3 − Q 2 ), can provide a better assessment of a purported iid sequence. Using overlapping pairs, triples, quadruples and the resulting χ 2 distributions for Q 2 − Q 1 , Q 3 − Q 2 or Q 4 − Q 3 , can provide means to test for both of the i's in iid.

A sketch of background theory
If x 1 , x 2 , . . . , x n is a sequence of iid discrete variates taking values v 1 , v 2 , ..., v k with probabilities p 1 , p 2 , . . . , p k , and if V 1 , V 2 , . . . , V k are counts for the number of times v i appears in the sequence of x's, then V 1 , . . . , V k are, as n → ∞, asymptotically jointly normal with mean vector (np 1 , . . . , np k ) and a certain covariance matrix C. That covariance matrix will have rank k − 1 because V 1 + · · · + V k = n. For such a jointly normal distribution, if C − is any weak inverse of C, (CC − C = C), then the mean-adjusted quadratic form in the V 's, with matrix C − , will be reduced to the sum of squares of k − 1 standard normal variates, and thus have a χ 2 k−1 distribution. This quadratic form can be expressed as (V i − np i ) 2 /(np i ) in the above case, the familiar (OBS − EXP) 2 /EXP. Now let z i be the value taken by x i , and consider the concatenation z 1 z 2 z 3 · · · z n , as though it were the output of our monkey randomly hitting k different keys with probabilities p 1 , p 2 , . . . , p k .
From the assumptions, the probability that any particular 3-letter 'word' in this string will take a specific value, say v 7 v 4 v 3 , is the product p 7 p 4 p 3 , and the number of appearances of that particular word will have an (asymptotically) normal distribution, part of the jointly normal distribution for the counts of all 3-letter words. Thus, if we could find the covariance matrix C for the joint 3-letter word counts, and find a weak inverse C − for C, then we could take a mean-adjusted quadratic form in the joint 3-letter word counts, with matrix C − , and have a χ 2 r distribution, with r the rank of C. When I first tried this, I was stuck with an awkward covariance matrix for the joint 3-letter word counts, because the first 3-letter word has no left neighbor, the last has no right neighbor, the second has only one left neighbor, and so on, while the more central 3-letter words have two left-and two-right neighbors that contribute to finding the joint covariances.
I was able to overcome this difficulty by assuming the output was circular, so that each 3letter word had two left-and two right-neighbors. The resulting covariance matrix had a complicated form for which I was able to find a weak inverse, and then discover a remarkable fact: The mean-adjusted quadratic form in the weak inverse for that covariance matrix can be represented as the difference, Q 3 − Q 2 , between two familiar (OBS − EXP) 2 /EXP forms, Q 3 for 3-letter and Q 2 for 2-letter words.
The general result is readily inferred from that for 3-letter words, which we use to simplify notation: If z 1 z 2 z 3 · · · z n is the (circular) concatenated output of a sequence x 1 , x 2 , . . . , x n of iid discrete variates taking values v 1 , v 2 , . . . , v k with probabilities p 1 , p 2 , . . . , p k , and if Q 3 is the naive Pearson form for 3-letter words: where V ijk is the count for the number of appearances of the 3-letter word z i z j z k , and if Q 2 is the naive Pearson form for 2-letter words, then Q 3 − Q 2 is (asymptotically) χ 2 distributed with k 3 − k 2 degrees of freedom, and more generally, for example, Q 5 − Q 4 will be asymptotically χ 2 distributed with k 5 − k 4 degrees of freedom, Q 4 − Q 3 will be χ 2 k 4 −k 3 distributed, and so on.
In practice, Q 2 −Q 1 , Q 3 −Q 2 , Q 4 −Q 3 , etc. are themselves discrete variates whose distributions get closer to the χ 2 limiting forms as n increases. In particular, expected counts such as np i p j p k may require a large value of n in order to reach the lower limit of ten or so that we often require for expected cell counts. Otherwise, the distributions of forms such as Q 3 − Q 2 , Q 4 −Q 3 , etc., may not be close enough to the limiting χ 2 distribution, and extensive simulation may be required for more accurate hypothesis testing.