Programs to Compute Distribution Functions and Critical Values for Extreme Value Ratios for Outlier Detection

A set of FORTRAN subprograms is presented to compute density and cumulative distribution functions and critical values for the range ratio statistics of Dixon (1951, The Annals of Mathematical Statistics ) These statistics are useful for detection of outliers in small samples.


Introduction
described a set of statistics for the purpose of detecting outliers in simple data sets so that they could be excluded from subsequent analysis. For a set of n observations x i , ordered such that x 1 ≤ x 2 ≤ . . . ≤ x n , the statistics are defined by r j,i−1 = (x n − x n−j )/(x n − x i ). The first subscript on the r symbol indicates the number of outliers that are suspected at the upper end of the data set, and the second subscript indicates the number of outliers suspected at the lower end. Dixon gave analytic formulas for the density and cumulative distribution functions for r 10 , r 11 , r 12 , r 20 , r 21 , and r 22 for a few small values of n, and presented numerically generated tables of critical values for each of those statistics for n ≤ 30. Dixon (1950) described the use and performance of these "range ratio" statistics for the purpose of rejecting outliers, and Dixon (1950); Beckman and Cook (1983); Barnett and Lewis (1984) and Hampel (1985) compared them to other approaches to the same problem. Dixon's ratios and many other approaches share the weakness that the user must guess the number of outliers that exist in the data set, and the tests often fail if the guess is wrong. In other ways Dixon's tests perform as well as or better than most other tests and they are very easy to use. Dixon's tests have become a standard for outlier rejection in analytical chemistry and are used in other fields as well. Dixon's r 10 is also called Q, and the outlier rejection test using it is called the Q test in the chemistry literature.
I can find no indication in the literature that anyone has recomputed the numerical values in Dixon's tables in the 55 years since he published them. In fact, the literature gives the impression that the art of generating the critical values for his statistics has been lost. Rorabacher (1991), in a paper intended to provide definitive reference values for the analytical chemistry community and to add critical values for new confidence levels, interpolated in Dixon's tables. The outliers package (Komsta 2005) for version 2 of the R statistical environment (R Development Core Team 2005) similarly interpolates between Dixon's published values. Efstathiou (1992) estimated critical values stochastically and compared them to Dixon's tables.
Dixon gives enough information in his paper to permit recomputation of his results at higher precision. This article describes one numerical approach to that task and gives FORTRAN functions for the probability density, cumulative distribution function, and critical values for each of the statistics presented by Dixon. The functions should be easy to translate into other languages for incorporation into larger software packages.
2. Formulas for density and cumulative distribution functions 2.1. Probability density for r Each of Dixon's ratios is a function of three of the n data values: r j,i−1 = (x n −x n−j )/(x n −x i ). The joint probability density for those three variables is obtained by multiplying the density functions for all of the data values, integrating over the possible values of all the variables except the three being used, and applying a combinatorial normalization factor. There are i−1 observations below x i , n − j − i − 1 observations between x i and x n−j , and j − 1 observations between x n−j and x n . The joint density for normally distributed data is therefore where φ(x) = (2π) −1/2 exp(−x 2 /2) is the density function for the standard normal distribution. Equation 1 in Dixon (1951) is the special case of Equation 1 for r 10 , that is, for j = i = 1. Following Dixon, I simply write r when there is no ambiguity.

Dixon then changes variables from
The Jacobian for this transformation is v. To obtain the density for r alone we integrate x and v over their ranges (−∞ < x < ∞, 0 ≤ v < ∞) to yield Equation 2 in Dixon (1951) is Equation 2 for j = i = 1. The last equation in his paper should be exactly Equation 2, but it is missing the Jacobian factor v. Similarly, his equation 14 should be my Equation 2 for j = 1, i = 2, but it has several errors; it should read Despite these printing errors, all the tables in Dixon's paper were computed from correct formulas.
Rewriting Equation 2 in terms of the cumulative standard normal distribution Φ(x) gives The main numerical problem solved in this paper is the evaluation of Equation 4 by numerical quadrature.

Cumulative distribution function and critical values for R
The cumulative distribution function G(R) is The range of r is 0 ≤ r ≤ 1, so G(0) = 0 and G(1) = 1. The critical values are the roots of (1 − α) − G(R) = 0 where α is a specified probability; G(R) increases monotonically from 0 to 1 so there is exactly one critical value for each value of α.

Probability density
The integrand in Equation 4 can change sharply because of the Gaussian form of the φ terms, and for large n also because of the high power in the second Φ term. I chose to regard the Gaussian dependences on x and v as weight functions and use Gaussian quadratures to accomodate them; the integral over x can be done with a classical Gauss-Hermite quadrature, but the integral over v requires a special quadrature because of the [0, ∞] limits.
Writing out the three φ terms in Equation 4 and gathering terms in x 2 and v 2 yields where N is the normalization factor (including the (2π) −3/2 term) and J(x, r, v) represents the terms containing Φ.
The change of variable t 2 = (1+r 2 )v 2 /2, u 2 = 3x 2 /2 converts the integral into a form suitable for Gauss-Hermite quadratures: can be introduced. Here the w l are the weights and the t l are the abscissas for an n hhpoint half-range Hermite quadrature. The abscissas and weights can be obtained with the ORTHPOL program package described in Gautschi (1994); the test7 program in that package does the calculation that is needed. In the programs accompanying this paper, the weights and abscissas for n hh = 17 and n hh = 15 are included in tabular form. Similarly the w k and u k are the weights and abscissas for an ordinary n fh -point Gauss-Hermite quadrature. Application of the quadrature rules yields The function rdens(r) implements Equation 11.

Cumulative distribution function
P (r) is a well-behaved function, and its integral over the domain [0, R] presents no numerical difficulties. For purposes of computing critical values it is important that the numerical implementation of G(R) should be monotonic and smooth, so it is best to avoid numerical procedures that incorporate convergence tests. I have chosen to use Gauss-Legendre quadrature with a fixed number n gl of quadrature points, so that where w m and y m are the usual Gauss-Legendre weights and abscissas on the range [−1, 1], and the variable transformation y = 2r/R−1 was used to change the [0, R] range of integration to [−1, 1]. The function rcdf(R) implements Equation 12.

Critical values of R
The critical values are the values of R such that 1−α−G(R) = 0, and they always lie between 0 and 1. The function rcrit(alpha) uses Brent's method (Brent 1973) to converge on the critical values.

User callable routines
The program package includes five user-callable routines. All floating-point arguments and function results are DOUBLE PRECISION.
1. The subroutine rinit(n,i,j,prec) initializes the programs. Its arguments are integer n, i, and j as used above, and the integer variable prec, whose value must be 1 or 2. If prec is 2, the programs use 17, 31, and 16 quadrature points in the integrals over v, x, and r respectively; if prec is 1, it uses 2 fewer quadrature points in all three integrations.
Comparison of results obtained with the two values of prec gives an estimate of the numerical error in the results. prec=2 is adequate to converge all the values in Dixon's tables to an absolute error of 5 × 10 −4 or less.
The user must call rinit once before calling any of the other programs, and again if the value of prec needs to be changed. It sets up the quadrature abscissas and weights, stores all the r-independent terms needed for evaluation of Equation 11, and calls rreset to compute the normalization factor. All the information is stored in a COMMON block in the file dixonr.fi.
2. The subroutine rreset(n,i,j) stores n, i, and j, and computes and stores the value of N in Equation 11 for use by the other functions. It must be called any time the values of n, i, or j need to be changed.
3. The function rdens(r) returns the value of the probability density function at r for r j,i−1 for n data points, where n, j, and i were specified in the most recent call to rreset or rinit. It implements Equation 11 with a single loop over a "composite index" that runs from 1 to n hh n fh , using values of t, x, ut, w k w l , and Φ(x) that were computed at all the quadrature points by rinit.
4. The function rcdf(R) returns the value of the cumulative distribution function at R, for the current n, i, j. It implements Equation 12 by calling rdens to obtain the values of P (r) at the necessary quadrature points.
5. The function rcrit(alpha) returns the critical value of R for the specified α, for the current n, i, j. It solves 1 − α − G(R) = 0 using Brent's method (Brent 1973) as implemented in the routine zeroin. It uses a subsidiary function rcerr(R), called by zeroin, to evaluate 1 − α − G(R). In the package the error tolerance for zeroin is set to 10 −6 , which is adequate for the numbers of quadrature points provided; users who generate larger numbers of quadrature points for special purposes may want to decrease the error tolerance.

Utility routines
The package contains several support routines in utility.f.
Many numerical libraries contain routines that can generate Gauss-Hermite or Gauss-Legendre abscissas and weights for arbitrary numbers of quadrature points, and fhquad and glquad can easily be replaced with such routines. The only source of the half-range Hermite abscissas and weights I know is the ORTHPOL package. More points may become necessary if i increases beyond 3 or n increases beyond 30.
A user who wishes to use different numbers of quadrature points must replace fhquad, hhquad, and/or glquad with other versions, and change the values of maxfh, maxhh, and maxgl in dixonr.fi. In addition he may want to change the error tolerance used in rcrit.
Phi(x) is a double precision routine to return the cumulative standard normal distribution at x. I have included the "little C function" of Marsaglia (2004), translated to FORTRAN.
zeroin is a function from Netlib (Browne, Dongarra, Grosse, and Rowan 1995, http: //www.netlib.org/) that uses Brent's method to find the roots of a function of one variable. It calls d1mach to obtain a value of the machine precision, and I include the current d1mach routine from Netlib that adapts automatically to the host machine.

Example programs
Several example programs, with corresponding output files, are included. To create an executable for test1, for example, the user must compile and link test1.f, rfuncs.f and utility.f; the file dixonr.fi must be in the same directory with rfuncs.f at compile time.
1. test1.f generates a table of the density and cumulative distribution functions of the Dixon r 11 statistic (j = 1, i = 2) for n = 7 by calls to rdens and rcdf.

Usage notes
The language used is fixed-form FORTRAN 90, but the few FORTRAN 90 features used are supported by most modern FORTRAN 77 compilers as well. The programs have been tested with Compaq Visual FORTRAN version 6.6 and with GNU FORTRAN 3.3.1.
The programs are reasonably efficient implementations of the numerical procedures described above, but they do the quadrature anew each time a new value of P (r) is needed. For single calls or for generating tables of values this design is adequate. However, if many values of P (r), G(R), or the critical values are needed, the programs in this package should not be used directly. Instead, it will be faster to use these programs to generate a table of accurate values with moderate density over the range of r or R needed, then use interpolation in that table to generate individual values.
The notation used in this paper is consistent with that of Dixon, and the α values and corresponding critical values of R are those appropriate for one-tailed tests. In most uses of these tables for outlier rejection, a two-tailed test is needed, so that "90% confidence" corresponds to the α = 0.05 columns in the output of test4. See Rorabacher (1991) for a discussion.