accuracy: Tools for Accurate and Reliable Statistical Computing

Most empirical social scientists are surprised that low-level numerical issues in software can have deleterious effects on the estimation process. Statistical analyses that appear to be perfectly successful can be invalidated by concealed numerical problems. We have developed a set of tools, contained in accuracy, a package for R and S-PLUS, to diagnose problems stemming from numerical and measurement error and to improve the accuracy of inferences. The tools included in accuracy include a framework for gauging the computational stability of model results, tools for comparing model results, optimization diagnostics, and tools for collecting entropy for true random numbers generation.


Introduction
Social science data are often subject to measurement error, and nearly all methods of statistical computing can yield inaccurate results under some combinations of data and model specification.This is an issue of practical concern -since our previous work and that of others, demonstrates that published analyses are affected with surprising frequency [Altman et al., 2003, 2005, McCullough and Vinod, 1999, Altman and McDonald, 2002, 2003].Unfortunately, most practitioners ignore even routine numerical issues in statistical computing.This is in contrast to some scientific and engineering disciplines, which require assessments of the accuracy of both data and results. 1 simple example (below), using generated data, shows how numeric issues can affect even seemingly simple calculations.Consider the calculation of a variable's dispersion: In R, the function sd returns the 'sample standard deviation', but does not provide an option to return the 'population standard deviation'. 2 The three functions below compute this quantity.The first two functions are direct implementations of textbook formulas, and the third simply adjusts the results of the sd function.To see the damaging effects of numerical errors, we apply these functions to the following generated data frames, each of which produces columns of numbers, of increasing magnitude, that have standard deviations of 0.5.
> testMat <-function(len = 50, digits = c (3, 5, 7, 9, 11, 13, + 15, 17, 19)) { + len <-2 * len + dat <-NULL + for (i in digits) { + dat <-as.data.frame(cbind(dat,1:len%%2 + 10^i)) + } + names(dat) <-as.character(digits)+ return(dat) + } Although the three formulas for calculating the standard deviation of the populations are mathematically correct, the results computed with them are obviously wrong.The first method is particularly bad: The range of inputs that yield correct answers is much smaller for the first method than for the others, some of the errors produced by the first method are extremely large, and the pattern of errors is not easily discernable as the number of observations in each column increases.
When social scientists encounter similar behavior from software, they often find it baffling.
Ironically, as user-friendly statistical packages have enabled social scientists to adopt increasingly sophisticated statistical methods, users have become even less familiar with the computational details involved in their estimation. 4We have demonstrated such issues to numerous colleagues, and most consider it entirely mysterious until the underlying mechanics have been thoroughly explained.These comments, collected by one reviewer or our previous work [Zandt, 2005] after showing a version of the standard deviation example to his colleagues, are representative: I polled my colleagues, all quantitative psychologists who routinely perform complex statistical analyses by showing them (a similar table of standard deviations, as computed by Excel).The responses I got ranged from "I don't want to hear about this," and "Oh my God!" to "I'm confused."Colleagues of a certain age, those who remember punching Fortran code onto cards or tape, were not surprised at all.Another colleague said "These issues are completely unknown to us.".
Our example is relatively easy to grasp because it uses simple calculations on a restricted range of inputs.In practice, the computational issues involved are more complex, and inaccurate results less obvious to users.Many computational problems reveal themselves only as obscure warning messages delivered at the end of analysis, or not as all, yielding purportedly correct estimates.The various sources of computational accuracies can include: un-modeled measurement error; bugs; errors in data input; data that is ill-conditioned for a particular model; floating point underflow, overflow, and rounding; non-random structure in "random" number generators; local optima or discontinuities in optimization landscapes; inappropriate or unlucky choices of starting values; and inadequate stopping criteria.
A number of book-length works are devoted to each of these sources of error, and correcting them may require intensive examination of the numerical details of a particular problem.Fortunately, a number of relatively straightforward techniques exist to detect such problems.While these techniques cannot prevent or fix errors, they can alert the researcher to many situations in which numerical and measurement errors would otherwise lead an analysis astray.
We describe here a set of software tools we have developed to make these techniques readily available to users of the S language.In particular, we have created accuracy, a package that runs in R [R Development Core Team, 2007] and S-PLUS, to help researchers to determine whether the results of their statistical analyses are potentially affected by measurement and numerical error.This package provides a variety of numerical tools for improving the accuracy and reliability of statistical analysis, including: methods to compare the results from multiple models, sensitivity analyses for numerical and measurement error, methods for checking whether a local optima has "trapped" a maximum likelihood estimate, non-linear regression, or other optimization-based method; a routine to supply true random numbers by tapping external physical sources of entropy; and a generalized Cholesky method for recovering information from non-invertible Hessians.The accuracy module also provides an interface to Zelig, a package for R that provides a simple unified way to estimate, interpret, and present the results of a large range of statistical methods [Imai et al., 2007].5 3 Common Sources of Inaccuracy

Accuracy and stability
What is computational accuracy?The computational accuracy of a statistical analysis can be thought of as the distance (using a well-behaved measure) between the actual results produced by the computer program and results that would have been produced had all algorithms been implemented correctly, and all calculations been performed perfectly.
Computational accuracy alone is not enough to ensure reliable computations.Given infinitely precise and accurate inputs, a computation can yield perfectly accurate results, and yet the same program yield wildly incorrect answers in the presence of minute amounts of measurement or rounding error in the inputs.Since, in practice, data is almost certainly subject to some finite measurement error, and calculations to some finite numerical error, reliable statistical computations must be computationally stable as well as accurate.
Somewhat more formally, stability is simply the distance of the true estimate, given the data, Y, from the computer output, given the data and noise: Less formally, a stable algorithm gives, to quote Higham [2002], "almost the right answer to almost the same problem." Instability may arise from or be exacerbated by several levels of problems: accumulated numerical errors in the computations used to implement the model; limitations in the algorithms guiding these computations; or interactions between the form of the statistical model with the data being analyzed.Regardless of the source of problem, if the data inputs are not given with absolute precision and accuracy, whether because of measurement error in collecting the data or numerical error in processing it, and if the model estimation process is unstable, then any inferences follow are unlikely to be correct.
To understand how to detect and correct for computational problems, it is useful to understand their sources.Most statistical models, most problems at the computational level fall into three broad categories: First, computational problems occur because nearly all software uses floating-point number representations for computation rather than using perfect symbolic representation.Second, computational problems occur because of the limits or incorrect use of optimization algorithms.
Third, computational result from non-randomness in "random" number generation.

Floating point arithmetic errors
A floating point numbering system is a subset of the real number system where elements have the form: y = ±m × β e−t .Each numbering system is characterized by the following quantities: a base (β), an exponent range (e, e min ≤ e ≤ e max ), a precision (t), a sign, and a mantissa m, 0 ≤ m ≤ β t − 1 .For a particular y only m, e, and a sign are stored.In IEEE floating point, which is now used in almost all software, one can think of each number as being represented, in base 2, a single bit for the sign, a sequence of t bits for the mantissa, and an exponent of length e bits.6 Unfortunately, some numbers cannot be exactly represented using this scheme.An interesting example is the number 0.1, which is an infinitely repeating decimal in base 2. Rounding error occurs when this number is represented as a floating point value, since 0.1 must be represented in a finite number of bits: 0.0001100110011 . . .Moreover, floating point arithmetic is not necessarily exact even when the operands happen to be represented exactly.For example, when floating point numbers are added or subtracted their exponents must first be normalized: The mantissa of the smaller number is divided in two while increasing its exponent, until the two operands have the same exponent.This division may cause low-order bits in the mantissa of the smaller number to be lost, which results in rounding error.In addition, floating point operations are susceptible to overflow and to underflow, when a number is larger (smaller) than the largest (smallest) number capable of being represented.(Careful programmers can avoid some of these problems, with considerable effort, by explicit use of multiple precision arithmetic through packages such as gmp [Lucas et al., 2007].Most functions and calculations in R, S-PLUS, and other statistical languages, however, remain potentially susceptible.) As Knuth [1997, page 229] points out, one of consequences of the inaccuracies in floating point arithmetic is that the associative law of arithmetic sometimes breaks down: 7 Also, as Higham [2002, Section 2.10] notes, limits on precision can interfere with the mathematical properties of functions.Although many elementary functions can be efficiently calculated to any desired degree of precision, the necessity of returning the final results at a fixed precision leads to situations in which the computed function may not have all of the mathematical properties of the true function of interest.The requirements of preserving symmetries, mathematical relations and identities, and correct rounding to the destination precision can conflict, even for elementary functions.

Non-linear optimization errors
The effects of numerical and measurement error can both interact and accumulate.In the standard deviation examples above, rounding error had a large cumulative effect on the final result.
As another example, in linear regression, when several explanatory variables are near-collinear, their individual parameter estimates are based on little information, and are thus less stable, to measurement errors.The cumulative effects of numerical error are, however, more often visible in non-linear estimations.And the accuracy of many non-linear estimates is, in addition, inherently dependent on well-informed (or fortuitous) choice of computational methods and parameters that are not included in the formal statistical model, such as: starting values, convergence tolerances, iteration step sizes, and optimization algorithms.
Standard techniques for the optimization of likelihood functions (and other non-linear modeling techniques) typically involve the following steps: 1. Choose (implicitly or explicitly) starting values for each parameter in the model.
2. Use analytic gradients (or numerically calculated differences) of the likelihood function, given the current parameter values, to determine a "direction" to move.
3. Identify an appropriate step size.
4. Take a "step" in that direction, updating the parameter values accordingly.
5. Update the parameters accordingly in accordance with the step size and direction.
7 Where ⊕ denotes the standard arithmetic operators.
Steps two through five are repeated until the algorithm has converged to a stationary point, or some other stopping criterion (such as a limit on the number of iterations) is satisfied.
Floating point errors can affect the process of nonlinear optimization by inducing discontinuities in the optimization function or its gradients, and optimization algorithms sometimes converge to such false optima.As Chaitin-Chatelin and Traviesas-Caasan [2004a] point out, the classical theory of singularities breaks down under floating point arithmetic -pseudo-singularities (induced by computational inexactness) form a generic set rather than a set of measure zero.
In addition, even without numerical error, non-linear models may be ill-conditioned with respect to a particular dataset, and the estimates thus particularly sensitive to errors in variables.Finally, unless the optimization surface is convex or unimodal, no computationally-tractable optimization algorithm, even without error in variables or calculation, is guaranteed to converge to a global optimum in a finite amount of time.Thus, when there are multiple local optima, the resulting parameter estimates will be incorrect unless the starting values chosen are within the basin of attraction of the global optimum.

Comparing model results
Because of the potential for computational and other errors, it is sometimes necessary to systematically compare two tables of output that purport to estimate the same statistical model using the same data.accuracy provides a number of functions, introduced in version 1.19, to aid in these comparisons.These tools are meant to be used when comparing results of statistical benchmarks to output already known to be correct; when moving to a new version of a statistical package; or when building an extension to an existing model.In each of these cases one may want to verify that the new algorithm or implementation yields the same results as an alternative.
Model comparisons may also be used as a component of robustness testing, to check whether a change in computational parameters (such as starting values) or implementations lead to different results.
The simplest function, LRE(), computes the log relative error between two numbers (or vectors, matrices, or arrays).This is roughly interpretable as the number of significant digits of agreement between the first and second sets of numbers, with larger numbers indicating closer agreement.
An LRE value of 16 is the maximum achievable using standard IEEE double precision -and thus indicates complete agreement within the standard storage precision of R and S-PLUS.Numbers less than or equal to zero indicate complete disagreement between two numbers.
To illustrate the use of the LRE function, we compare the output of the function for the student t distribution to numerically correct benchmark output.8First we compute the quantiles, by applying the qt function to the probability values given in the test data: The LRE function is invoked to compare the computed quantiles to the correct quantiles, which are contained in another column of the benchmark data.This produces a table that shows how many of the 10,000 qt evaluations tested were accurate to a given number of 'digits': from 0 (totally inaccurate) to 16 (as accurate as the benchmark can measure): First, run the models separately.
R> modelsCompare(list(fit1, fit2, fit3, fit4, fit5), param.function= modelBetas) We have also supplied a convenience function, modelsAgree, which uses modelsCompare to provide a simple logical test that two models agree to a given number of digits for all coefficients.
For example, the following lines be used in "if" expressions to test that selected models agree to two significant digits: R> (min(as.vector(modelsCompare(list(fit1, [1] TRUE R> modelsAgree(fit1, fit2, digits = 2, param.function= modelBetas) [1] TRUE While modelsCompare aids in ad hoc comparisons among different implementations of the same model, often a more systematic approach is needed.The next section shows how to use accuracy to systematically compare the results of the same model across slightly perturbed inputs.

Methods for sensitivity analysis through data perturbation
A straightforward way to test that a given model is stable with respect to some forms of numerical and measurement error is to repeatedly introduce small random perturbations to the data, on the order of the measurement error of the instruments used to collect it, and then recalculate the estimate.When the range of estimates produced using this technique vary greatly the model estimation is necessarily unstable, although the converse is not necessarily true.Where a model is already known to be statistically appropriate, this type of sensitivity analysis will give the researcher greater confidence that the their results are robust to numerical and measurement error.
Data perturbations were described as a sensitivity test in early work by Beaton et al. [1976a,b], who also developed a stability index based on this idea.Similar methods have been recommended by Gill et al. [1981], Pregibon [1981], Cook [1986], Belsley [1991].Belsley's approach evaluates collinearity in linear models regression models by adding random disturbances to suspected variables and assessing the consequences for parameter stability, thus exaggerating subtle effects to make them more detectable.Hendrickx et al. [2004]  To see how perturbations affect the estimation process, consider two likelihood functions: a standard form based on the observed data (θ, x), and an identical specification but with perturbed data p (θ, x p ).Here p denotes an individual perturbation scheme: One can show that comparing the two likelihood functions is analogous to comparing an un-weighted likelihood function (θ, x) Alternatively, one could define the unperturbed likelihood function to be one in which there are null perturbations or weights: p 0 (θ, x p 0 ) = i p 0i i (θ, x i ), where p 0 is simply a vector of 1's.This provides two maximum likelihood vectors for comparison: θ and θp .Our approach is to evaluate the range of θ produced by multiple samples of x p generated by randomly production of p disturbances across different datasets, x.This builds upon the very mechanical approach of Cook [1986] who observes maximizing and minimizing perturbations, and roughly follows a simpler test of logistic regression given by Pregibon [1981].
Although this evaluation methodology is not restricted to a particular class of likelihood functions, Cook [1986] shows that for "well-behaved" functions there is a straightforward mapping between perturbations of data and perturbations of the model9 For example, normally-distributed noise added to the data induces a corresponding small mean-shift in the likelihood curve [Laurent and Cook, 1993].
Sensitivity analyses are invaluable because they can applied where benchmark tests and independent confirmation are unavailable or inconclusive.Moreover, sensitivity analyses can be applied to the actual data under study.Perturbation analyses can serve to draw attention to potential problems in an algorithm, implementation, or model.
It is important to note, however, that sensitivity analysis cannot alone demonstrate whether a particular set of estimates are correct, nor can they be used to improve estimates of "correct" values.
The mean of the range of parameter estimates produced by sensitivity is, in fact, likely to be slightly biased -it is well known that explanatory variable measurement error introduces bias (although, as above, this bias is slight and well-behaved, under certain conditions).Furthermore, in recognizing the statistical problems with measurement error, we do not advise intentionally exacerbating it as an estimation technique.Rather, the perturbations are a means of testing the reliability of the estimates produced by a particular model and set data.
In summary, perturbation may introduce bias, but if the problem is well-conditioned and the algorithm and implementation accurate, the bias should be small.Models that react dramatically to modest levels of measurement error warrant caution.Furthermore, any bias introduced by perturbations should be the same across different implementations of the model, thus perturbations are one means of comparing the relative stability of competing computational approaches to model estimation.

Running sensitivity analysis
The accuracy package makes perturbation-based sensitivity analysis simple to apply and to interpret.For many models, running a sensitivity analysis involves only three steps.
1. Specify the data, and model.
2. Give this specification to sensitivity() to run.
3. Use summary() or plot(summary()) to display the sensitivity of the parameter estimates to perturbations.
The sensitivity() command works automatically almost with any R and S-PLUS model, including: lm, glm, and nls.All that is required is for the model to accept a data frame argument, and to returns estimated coefficients through the standard coef() interface.
The example below shows how to conduct a sensitivity analysis of the classic analysis by using sensitivity() and default noise functions.For example, to run a sensitivity analysis, using the classic Longley [1967] dataset, a well-studied example known to have multi-collinearity, execute the following command: R> plot(summary(plongley)) q q q q q q q q q q q q q q q q q q q q −6000 −4000 −2000 0 (Intercept) q q q q q q q −0.2 −0.1 0.0 0.1 0.2 GNP.deflator q q q q q q q q q q q q q q q −0.15 −0.10 −0.05 0.00 0.05 GNP q q q q q q q q q q −0.035 Unemployed q q q q q q q q q q q −0.014 Population q q q q q q q q q q q q q q q q q q q 0 1 2 3

Year
Figure 1: Sensitivity Boxplots R> plongley <-sensitivity(longley, lm, Employed ~., ptb.R = 500) Sensitivity results can be expressed in plot format.In this plot, each boxplot shows, the distribution of estimates across repeated perturbations for a single parameter using:

R> plot(summary(plongley))
And the resulting plot is shown in Figure 1 A tabular summary can reveal more information.The summary In comparison, the "Thurber" example from the National Institute of Standards and Technology (NIST) statistical benchmark datasets [Rogers et al., 1998], immediately below, exhibits instability in all parameters, while the anorexia model following it is entirely insensitive to perturbation: ), Sigma = matrix(0.9, Your choice of error functions should reflect the measurement error model that is appropriate to the data you are using.In numerical analysis, uniform noise is often applied, since this is what would result from simple rounding error.Normal random noise is commonly used in statistics, under the assumption that measurement error is the sum of multiple independent error processes.
In addition, when normal perturbations are used, the result can be interpreted, for many models, as equivalent to the results of running a slightly perturbed model on unperturbed data.In some cases, like discrete or ratio variables, other forms of noise are necessary to preserve the structure of the problem.[see, for example Altman et al., 2003, section 16.3].The magnitude of the noise is also under the control of the researcher.Most choose a magnitude roughly proportionate to the researcher's qualitative estimate of the underlying measurement error in the data.Noise is usually adjusted to the size of each component, since this better preserves the structure of the problem, however in some cases the underlying measurement error model may require norm-wise scaling of the noise.For more information on noise distributions and measurement error models see , e.g., Belsley [1991], Chaitin-Chatelin and Traviesas-Caasan [2004a], Cheng and Ness [1999], Fuller [1987], Carroll et al. [1995].
We recommend that users run sensitivity multiple times with different noise specifications.
However, in our practical experience with social science analyses, the error model choice does not tend to affect the substantive conclusions from the sensitivity analysis.
Some researchers omit perturbations to outcome variables, since, in terms of statistical theory, mean-zero measurement error on outcome variables (as opposed to explanatory variables) contribute only to increased variance in estimates, not bias.While this attitude is justified in the context of statistical theory, it is not similarly justified in the computational realm.If the estimation of a model diagnostics, including one based on data perturbations.
is computationally unstable, errors in the outcome variable may have large and unpredictable biases on the model estimate.Hence, the conservative default in our package is to subject all variables to perturbation, although options are available to completely control the form and magnitude of all perturbations.
Consider this example, which shows a sensitivity analysis of the anorexia analysis described in Venables and Ripley [2002].In this case, for illustration,11 we leave the dependent variable unperturbed, by assigning it the identity error function.
R> data("anorexia", package = "MASS") The anorexia example above is relatively stable.Most of the parameters estimates vary little over repeated perturbations.
Finally, if a R or S-PLUS model does not take a data argument or does not return coefficients through the coef method, it is usually only a matter of a few minutes to write a small wrapper that calls the original model with appropriate data, and that provides a coef method for retrieving the results.(Alternatively, you might to choose to run such models in Zelig, as described in the next section.) For example, the mle function for maximum-likelihood estimation does not have an explicit

Sensitivity Analysis of Zelig Models
Zelig [Imai et al., 2007] is an R package that can estimate and help interpret the results of a large range of statistical models.Zelig provides a uniform interface to these models that accuracy utilizes to perform sensitivity analyses.In addition, accuracy can also be used to perform sensitivity analyses of the robust alternatives, simulated predicted values, expected values, first differences, and risk ratios that Zelig produces for all the models it supports.12So, using these packages together provides a convenient means to analyze the sensitivity of predicted values to measurement error.
To illustrate, we replicate Longley's analysis, using zelig() (instead of lm) to run the OLS model, and the convenience function sensitivityZelig to run the sensitivity analysis: R> data(longley) R> goodZelig <-require(Zelig, quietly = TRUE) && (!inherits(try(zelig(Employed + GNP, "ls", longley, cite = FALSE), silent = TRUE), "try-error")) Just as above, summary and plot(summary()) can be used summarize the sensitivity of the model coefficients.In addition, we can use the Zelig methods setx and sim to simulate various quantities of interest.And when summary and plot are used, they will display a sensitivity analysis of the predicted values.
For example, the code below generates predictions of the distribution of the explanatory variable, 'Employed', around the point where 'Year' equals 1955 and the other variables are at their means, and creates a profile plot of the predicted distribution of the explanatory variable.
This sets the values of the explanatory variables to be used for the predictive simulation: This performs the simulations, using the perturbed models: R> if (goodZelig) { + sim.perturb.zelig.out<-psim(perturb.zelig.out,setx.out) This reports the range of predicted values, under perturbations.This can be thought of as predictions that are "robust" to perturbations, in an informal sense via a summary, and accompanying plot: The resulting plot is shown in Figure 2.

More accurate computing
When a model is shown to be sensitive to perturbations, there are a number of possible culprits, including multiple optima, ill-conditioning, poor random number generation, and rounding error.
Although there is no single tool that can identify or fix these root causes, accuracy offers a number of additional helpful tools for dealing with these sorts of issues.We discuss these in the following

True random numbers through entropy collection
'Random' numbers aren't.The numbers provided by routines such as runif() are not genuinely random.Instead, they are pseudo-random number generators (PRNG's), deterministic processes that create a sequence of numbers.Pseudo-random number generators start with a single "seed" value (specified by the user or based on a default value) and generate a repeating sequence with a certain fixed length, or 'period', p.This sequence is statistically similar, in limited respects, to random draws from a uniform distribution.
The earliest PRNG's, which are still in use in some places, and which were used in early versions of R, come from the family of Linear Congruential Generators (LCGs), defined as:13 LCG(a, m, s, c) ≡ on the starting point of the search.Maximum likelihood functions, non-linear-regression models, and the like, are not guaranteed to be globally convex in general.And even where convexity is guaranteed by statistical theory, inaccuracies in statistical computation can sometimes induce false local optima (discontinuities that may cause local search algorithms to converge, or at least stop).
A poor or unlucky choice of starting values may cause a search algorithm to converge at a local optimum, which may be far from the real global optimum of the function.Inferences based on the values of the parameters at the local optimum will not be correct.
Knowing when a function has reached its true maximum is something of an art.While the plausibility of the solution in substantive terms is often used as a check, relying solely on the expected answer as a diagnostic might bias researchers toward Type I errors, rejecting the null hypothesis when it is true.Diagnostic tests are therefore useful to provide evidence that the computed solution is the true solution.If such tests indicate that the global optimum has not been reached, the user may consider a closer examination of starting values, applying an alternative optimization algorithm, and/or heuristic designed for non-smooth optimization problems, such as the simulated annealing option for optim(), or the optimizers provided by the gafit [Tendys, 2002], genalg [Willighagen, 2005], rgenoud [Jr. and Sekhon, 2007] A number of strategies related to the choice of starting values have been formalized as tests or global optimality.In this package we implement two.The 'Starr' test and the 'Dehaan' test. 14  The intuition behind the Starr test statistic is to run the optimization from different starting points to observe 'basins of attraction', and then to estimate the number of unobserved basins of attraction from the number of observed basins of attraction.The greater the number of observed basins of attraction, the lower the probability that a global optimum has been located.This idea has been attributed to Turing [1948], and the test statistics was developed by Starr [1979]: Here V 2 is the probability a convergence point has not been observed, and r is the number of randomly chosen starting points.S is the number of convergence points that were produced from one (or a Single) starting value and D is the number of convergence points that were produced from two (or Double) different starting values.Finch et al. [1989] demonstrate the value of the statistic by analyzing a one parameter equation on a [0, 1] interval for r = 100.While the proposed statistic given by the above equation is compelling, their example is similar to an exhaustive grid search on the [0, 1] interval.(Starr's result is further generalizable for triples and higher order observed clumping of starting values into their basins of attraction, but Finch, Mendell, and Thode assert that counting the number of singles and doubles is usually sufficient.) The statistic may be infeasible to compute for an unbounded parameter space with high dimensionality.However, the intuition behind the statistic can still be applied soundly in these cases.For computationally intensive problems, another test, proposed by Veall [1990], drawing upon a result presented by de Haan [1981], may be more practical.The de Haan/Veall test relies on sampling the optimization function itself rather than identifying basins of attraction.A confidence interval for the value of the likelihood function's global optimum is generated from the points sampled from the likelihood surface.This procedure is much faster than the Starr test because the likelihood function is calculated only once for each trial, as opposed to running the optimization algorithm many times to identify basins of attraction.As with starting value searches, researchers are advised to increase the bounds of the search area and the number of trials if the function to be evaluated has a high degree of dimensionality or a high number of local optima have been identified.
Veall suggests that by using a random search and applying extreme asymptotic theory, a confidence interval for the candidate solution can be formulated.The method, according to Veall (pg. 1460) is to randomly choose a large number, n, of values for the parameter vector using a uniform density over the entire parameter space.Call the largest value of the evaluated likelihood function L 1 and the second largest value L 2 .The 1 − p confidence interval for the candidate solution, L , is [L 1 , L p ] where: where k is some function that depends on n such that k As Veall (pg.1461) notes, the bounds on the search of the parameter space must be large enough to capture the global maximum and n must be large enough to apply asymptotic theory.
In Monte Carlo simulations, Veall suggests that 500 trials are sufficient for rejecting that a local optimum is not the a priori identified global optimum.

A generalized Cholesky method
We include in the accuracy package is an implementation of the Schnabel and Eskow [1990] matrix Cholesky decomposition algorithm as implemented in Gill and King [2004].Essentially this asks the question, if my Hessian (or any other matrix) cannot be decomposed by the Cholesky algorithm for low-level numerical reasons (or perhaps other reasons), then what is the smallest amount I need to change the matrix to make it decomposable.Recall that the Cholesky decomposition is defined as V in the decomposition C = V V for the matrix C. Of course this idea of "change" has for the algorithm multidimensional consequences and the elegance of the Schnabel and Eskow approach is that takes this into account over the more primitive solutions whereby changes can impose greater consequences down the procedure [Gill and Murray, 1974].Their method, based on Gerschgorin bounds, is implemented in our function sechol.
The generalized inverse is a commonly used technique in statistical analysis, but the generalized Cholesky has not, to our knowlesge, before been used for statistical purposes, prior to its appearance in Altman, Gill, & McDonald 2003.When the inverse of the negative Hessian does not exist, we suggest the following procedure: Create a pseudo-variance matrix and use it, in place of the inverse, in an importance resampling scheme.In brief, applying a generalized inverse (when necessary, to avoid singularity) and generalized Cholesky decomposition (when necessary, to guarantee positive definiteness) together often produce a pseudo-variance matrix for the mode that is a reasonable summary of the curvature of the posterior distribution.This method is developed and analyzed in detail in Gill and King [2004], here we provide a brief sketch.
The Gill/Murray Cholesky factorization of a singular matrix C, adds a diagonal matrix E such that the standard Cholesky procedure is defined.Unfortunately it often increments C by an amount much larger than necessary, providing a pseudo-Cholesky result that is further away from the intended result.Schnabel and Eskow [1990] improve on the C+E procedure of Gill and Murray by applying the Gerschgorin Circle Theorem to reduce the infinity norm of the E matrix.The strategy is to calculate delta values that reduce the overall difference between the singular matrix and the incremented matrix.This improves the Gill/Murray approach of incrementing diagonal values of a singular matrix sufficiently that Cholesky steps can be performed.
This technique is complex to describe but simple to use.The following is an example of its use with a singular matrix: singularity in the matrix.More generally, consider the example from Altman et al. [2003]: The Cholesky decomposition of this matrix is given by: The procedure works since this matrix is positive definite.Suppose now that we change the values on the corners have been changed from 2.4 to 2.5.Now the matrix is nonpositive definite and we cannot calculate the Cholesky decomposition in the normal manner.However, sechol returns which preserves the sense that only a small change (2.4 to 2.5) has been made.

Example: prior elicitation for logit regression
Bayesian applications of the logit model for dichotomous outcomes are quite common and often use rather vague priors.Suppose, instead, that we want to elicit prior information from subject matter experts and include this explicitly qualitative information in the statistical model.In this way the Bayesian posterior distribution is a compromise between expert knowledge and the data at hand.
Arguments for this approach can be found in Gill and Walker [2005].
Begin with the basic logit regression model conforming to the standard assumptions, defining terms conventionally: , where X is an n × p, rank p matrix of explanatory variables with a leading column of ones for the constant, θ is a p × 1 vector of coefficients, Y is an n × 1 vector of observed outcome variable values, and is a n × 1 vector of errors.The Bayesian approach to uninformed priors for this model often specifies that p(θ) ∝ c over (−∞, ∞) for an arbitrary constant value c.
To elicit prior judgments for the prior, experts need to be queried in such a way some parametric form for the prior on θ can be stipulated that conforms to their assessments.Kadane et al. [1980] suggest the following approach: Establish j design points of the explanatory variable vector, X1 , X2 , . . ., Xj , such that these represent interesting cases spanning the range of the p variables.
The assessors are asked to study each of the Xi scenarios and produce Y * i , an expected outcome variable (zero or one) corresponding to the design point cases.Such a value represents a typical response to the hypothesized design point, Xi .For a single expert, the result is a "stacked" design matrix from collecting the Xi values, and an expected outcome variable vector from collecting the Y i values.
An elicited prior point estimate for θ is produced by running a logit model as if these were conventional data.However, if the researcher does not ensure that the design matrix leads to a positive definite Hessian matrix (the matrix of second derivatives at the MLE point estimates) in the estimation process, then she may proceed to elicit responses from the expert before noticing that the Hessian cannot produce a variance/covariance matrix for the estimated coefficients (there is nothing in the definition of interesting Xi cases that ensures this).Since these experts are often busy and difficult to schedule, repeating the process may not be possible.Obviously, the matrix can be reconfigured by deleting cases, but this leads to a loss of information and potential biases.
Consider the following simple case from education policy.The California Department of Education (CDE) collected testing data for unified school districts and collections of schools that logically constitute similar units (n = 303) in 1998 by requiring students in the 2 nd through 11 th grade to take standardized tests for a variety of subjects including mathematics, which we will analyze here, at each grade level (the Stanford 9).The raw test scores are replaced with a binary outcome indicating whether or not the student exceeded the national median, summed by district, as a way to deal with large measurement error (this is also a criteria for evaluating administrators).The following design matrix (with insufficient variation) is stipulated: Elicitations are produced by an expert in education policy, giving the following vector: Y = [1, 0, 0, 0, 1, 1, 0, 1] (test scores are heavily influenced by demographics in this policy environment).
These numbers are admittedly synthetic but are contrived to show how easy the following problem emerges.
One gets somewhat farther constucting the probit likelihood and using optim: In this article, we have discussed a number of tools and methods for assessing the sensitivity of complex models to numerical and measurement error.These tools are all included in the accuracy module, which can be obtained from any CRAN mirror (http://cran.r-project.org/),from the new CSAN site http://csan.insightful.com),or from our website: http://www.hmdc.harvard.
edu/micah_altman/software. 15 accuracy (with Zelig) has also been incorporated in the Virtual Data Center (VDC) [Altman et al., 2001] digital library and data analysis system.This offers a web-based interface to perform simulation and sensitivity analysis that is simple enough for even novices to use.The VDC system is available from: http://thedata.org.
(x) * (length(x) -1)/length(x)) + } rbind(sapply(dat, sdp.formula1), sapply(dat, sdp.formula2), + sapply(dat, sdp.formula3)), digits = 3) rbind(sapply(dat, sdp.formula1), sapply(dat, sdp.formula2), + sapply(dat, sdp.formula3)), digits = 3) The function modelsCompare can be used to compare the results from a list of two or more models.Its default behavior is to extract the estimated coefficient values from the model summary, compare these sets of coefficients using the LRE function, and report for each coefficient the largest of the differences between the first model given and each of the subsequent models in the list.(If, for example one is interested only in both parameter estimates and standard errors, one can use other functions to select the parameters of interest for use in modelsCompare.)For example, we can use modelsCompare to see to what extent the use of mle optimization methods to solve the same problem affects the estimated parameter coefficients: b3 + b4 * x))/(1 + x * (b5 + x * (b6 + x * b7)))), + start <-c(b1 = 1000, b2 = 1000, b3 = 400, b4 = 40, b5 = 0If error functions are not specified, a default set of error function will be selected based on the measurement type of each variable: continuous, ordered, or unordered.Continuous variables, by default are subject to a small amount of mean-zero component-wise uniformly distributed noise, which is typical of instrumentation-driven measurement error.Ordered factors are assigned a small probability of having observations reclassified to the neighboring classification, and unordered factors have a small probability of being reassigned to another legal value.Alternatively, one can specify the error functions to use, or use one of many already available.The accuracy package comes with a wide range of functions to add noise to continuous variables and to randomly reclassify factor variables. 10 If measurement errors are correlated across variables, one can use a matrix-oriented perturbation function, as illustrated below, returning to the Longley dataset: R> if (require(MASS, quietly = TRUE)) { + plongleym <-sensitivity(longley, lm, Employed ~., ptb.rangen.ismatrix= TRUE, + ptb.ran.gen= function(x, size = 1) { + mvrnorm(n = dim(x)[1], mu = rep(0, dim(x)[1] data option.Instead, it receives data implicitly through the log-likelihood function, ll, passed into it.To adapt it for use in sensitivity one simply constructs another function that accepts data and a log-likelihood function separately, constructs a temporary log-likelihood function with the data passed in the environment, and then calls mle with the temporary function: R> mleD <-function(data, lld, ..the log-likelihood function to accept data.As in this example, which is based on the documented example in the Stats4 package: R> llD <-function(data, ymax = 15, xhalf = 6) -sum(stats::dpois(data[[2]], + lambda = ymax/(1 + data[[1]]/xhalf), log = TRUE))
table compares the range of parameter estimates produced under perturbation to the original, unperturbed, parameter estimates.The table also highlights parameters that are particularly sensitive to perturbation -the parameter estimate under perturbation exceeded the original estimate by more than +/-two times the original standard error in a disproportionate number of runs.For the Longley analysis, only Ifmultiple local optima are identified over the course of a search for good starting values, a researcher should not simply stop once an apparent best fit has been found, especially if there are a number of local optima which have basins of attraction that were identified only once or twice.Our implementation of the Starr test provides a ready-to-use-interface that can be easily incorporated into a search of the parameter space for good optimization starting values.