Nonparametric regression analysis of data from the Ames mutagenicity assay.

The Ames assay has received widespread attention from statisticians because of its popularity and importance to risk assessment. However, investigators have yet to routinely apply modern regression methods that have been available for more than a decade. We study yet another approach, the application of nonparametric regression techniques, not as the ultimate solution but rather as a framework within which to address some of the shortcomings of other methods. But nonparametric regression is itself prone to difficulties when applied to Ames assay data, as we show through the use of two examples and some simulation studies. We argue that there remains a great need for further development of statistical methods suitable to the Ames assay. It is hoped that such work can be stimulated and guided by greater collaboration between statisticians and laboratory investigators.


Introduction
The Ames (Salmonella/histidine reversion) mutagenicity assay (1,2) may be the most popular short-term procedure for the evaluation of chemical mutagenic and carcinogenic potency. Because of its central role in risk assessment, the assay has received the attention of numerous statisticians who have sought to develop and apply appropriate methods of analysis with the goal of providing valid tests of mutagenicity and estimates of potency. Breslow and Kaldor (3) reviewed the wide range of statistical methods that have been proposed for use with the Ames assay and other mutagenicity assays. In particular, modem methods involving regression models and incorporating assumptions concerning biological mechanisms and distributional properties ofthe response data have become available and should be preferred over the use of simpler, more classical methods.
In light of these advances, it might be expected that investigators would routinely apply modem methods ofanalysis. Yet, an informal scan of recent volumes of one major journal revealed that these methods are almost never used by investigators. Most inferences regarding mutagenicity seem to be based on classical methods ofanalysis (e.g., simple regression or ANOVA, multiple m-tests, etc.) or even nonstochastic methods such as the twofold rule (4). The problem ofbridging the gap between statisticians and toxicologists therefore remains to be overcome. This contrasts sharply with the fields of epidemiologic research and clinical trials, where modem regression methods are much in vogue.
Given this state of affairs, is it reasonable to introduce new methods and to further develop existing ones? Should not efforts be directed toward facilitating better application of existing methods? We would argue that both problems require further attention, namely: a) Statisticians have not yet achieved an ultimate solution to the problem ofhow best to analyze Ames assay data; and b) investigators should make concerted efforts to understand and apply, with the aid of statistical collaborators, modem methods of analysis. In this way, investigators can help statisticians to evaluate the utility of various methods by providing input as to their performance on a large amount ofreal data. Such feedback should enable statisticians to develop better methods. We should interject at this point that we strongly advocate collaboration between statisticians and laboratory investigators. We do not deem simplicity or ease of understanding on the part of nonstatisticians to be the only criterion for recommending methods of analyzing laboratory data. Proposing overly simple and statistically inefficient methods to make them palatable to statistically untrained researchers can lead to misleading results and, perhaps, distrust of statisticians. We hope that this conference, by bringing together statisticians and toxicologists, will allow toxicologists to acquire better understanding of statistical methods and help build the kind ofcollaboration that will further the interests of both fields of endeavor.
With these introductory ideas in mind, we turn now to focus specifically on the first issue, the need for further development of methods of analysis. We will not address the issue of facilitating better application. We propose a new direction for methodological development in the area of the Ames assaynonparametric regressionand present a specific example based on the smoothing spline. This paper is not meant to be the final word on statistical methods for the analysis ofAmes data, nor do we wish to imply that the present method is the only possible application of nonparametric regression methods. In fact, it will be seen that this new method may at present have only limited usefulness. Rather, we use our approach as a framework within which to evaluate state-of-the-art methods and characterize the need for further research. In the end, we hope it will be obvious that there remains a great deal ofwork to be done to overcome problems that exist in currently available methods. This work could be in the form of better alternative methods, further development ofexisting methods, or comparative studies to provide guidance as to when and how existing methods should be applied. It is expected that all ofthese can and should be performed in close collaboration with laboratory investigators.
Overview of the Ames Assay Figure 1 summarizes the Ames assay. Mutant strains of the bacterium Salmonella typhimurium, which are dependent on histidine for growth, are plated on petri dishes that contain the test chemical and a limited amount of histidine in the growth medium. The end point is reverse mutation to histidine independence, which is detected by the existence ofsurviving colonies ofbacteria after histidine has been depleted. The outcome of interest for statistical analysis is the frequency of such colonies. Several doses of test chemical are used, including a zerodose control. Each dose is typically tested in duplicate or triplicate. Interesting statistical features ofdata arising from the Ames assay include overdispersion (relative to PNisson sampling) and competing underlying dose responses (mutagenicity and toxicity). It is often desirable to condense the results of an assay to a single number, the "mutagenic potency," to rank and compare different chemicals. The most common potency measure used in the Ames assay is the initial slope of the mutagenicity dose response. The two major goals ofanalysis are to test for mutagenicity and to estimate mutagenic potency (given a positive test). Often, the mutagenicity test is based on the potency estimator, but there are methods that test mutagenicity independently from potency estimation [e.g., Simpson and Margolin (5)1. Throughout this paper, we use the initial slope both to estimate potency and to test for mutagenicity. We leave it to others to discuss potency measures and their relative advantages (6,7). (b) Quinoline data (9). Fitted curves represent point rejection (---), nonlinear regression using Equation 1 withoutthe quadratic term (--), quasilikelihood with a loglinear model (---) (12), and the smoothing spline (-). concerns mutagenicity of base extract from crude tar effluent following coal gassification (8). In this example, the presence of a mutagenic effect seems clear, and we can ascertain the nature of the dose response from a plot of the data. Note, however, the possibility of little or no mutagenic effect at low doses because of the quadratic shape of the low-dose response. Fitted lines are the estimated dose responses from nonlinear regression, point rejection, and nonparametric regression (see subsequent sections). The second example (Fig. 2b) is from a study of the mutagenicity of the chemical quinoline (9). In this example it is not easy to ascertain the dose response because ofthe variability in the data, although both mutagenicity and a downturn due to toxicity at high doses are evident. Fitted curves are from the point rejection, nonlinear regression, quasi-likelihood, and nonparametric regression methods (see subsequent sections).
We will keep mathematics to a minimum, focusing rather on concepts. Nevertheless, some basic notation is required. Let Yij be the observed frequency at dose i (i = 1, . . ., K), replicate] (1 = 1, . .. , ni) and let the actual value ofthe ith dose bex x(xi = 0). Let the mean and variance of Yij be Hi and ai?, respectively.
The total sample size is n= K n i. Finally, call the mutagenicity and survival functions M(x) and S(x), respectively, and let the composite dose response function be 14(x), the expected mutant frequency given dose x.

Regression Analysis of Ames Assay Data
Breslow and Kaldor (3) have reviewed statistical methods for analyzing Ames assay data. We mention here only modern methods that specifically accommodate regression analysis of Poisson data. All such methods are based in some way on the model of bacterial mutagenesis of Haynes and Eckardt (10), where M(x) and S(x) are exponentiated polynomials and No is the (unknown) number of plated organisms. A typical example is where M(x) is approximated by its exponent, which is presumed to be small. [Note that No has been absorbed into the parameters of M(x); we subsequently write MN(x) in place of NoM(x).] Such models have been fit by nonlinear regression (8), Poisson maximum likelihood (11), quasi-likelihood (12), and adaptive procedures [e.g., the point rejection method (13) based on the negative binomial likelihood]. The parameters mo (spontaneous mutant frequency) and ml (mutagenic potency) in Equation 1 may be constrained to be nonnegative if desired, but this is not necessary in practice. Background mutation frequency is typically large enough that mo is not estimated close to , and negative estimates of m, can be taken as no evidence of low-dose mutagenicity.
Margolin and others (9,14) have generalized these models to incorporate multiple generation effects in the assay, fitting their models with maximum likelihood using the negative binomial distribution. In this paper we use their multigeneration model IV (9), with three generations for mutagenesis and histidine depletion and unlimited toxicity, to generate random data for evaluating the performance ofvarious methods. The form ofthis model is Several restrictive assumptions accompany the use of Haynes-Eckardt models. One is that mutagenicity and toxicity are stochastically independent. If such independence does not hold (e.g., mutants have different survival rates), then a more general model may be needed. Another is the lack of metabolic and mutagenic mechanisms in the model. Indeed, one major criticism ofthe Ames assay as used for human cancer risk assessment is that it lacks the complex chemical and cellular mechanisms involved in human carcinogenesis [briefly reviewed by Weinstein (15)].
Apart from the issue ofvalidity ofthe basic model itself, other difficulties arise in the case ofspecific methods. For example, the point rejection method circumvents toxic effects by rejecting high-dose points when there is a significant downward departure from linearity. Two potential sources of bias are: a) model misspecification when the dose response is nonlinear and b) onesided (downward only) point rejection, which can occur by chance even in the absence of toxicity. The multiple generation model is theoretically quite attractive, but the choice ofgeneration times can have a substantial effect on the magnitude of the estimated potency (9).
Although all methods have potential deficiencies in the response model, all the methods mentioned above accommodate nonconstant Poisson variance and overdispersion (with the exception ofmaximum Poisson likelihood). In practice, they seem to perform well in many examples, but more widespread use of these methods would facilitate a much-needed critical evaluation of their general applicability. In a later section, we use Monte Carlo studies to characterize, for some of these methods, circumstances in which they are most appropriate.

Nonparametric Regression
In this section we only briefly describe one approach to analyzing Ames data by nonparametric regression. Technical details will be made available elsewhere (Cologne, in preparation); a brief outline is presented in the Appendix.

Motivation
Before introducing our approach mathematically, it may be useful to motivate it from a nontechnical perspective. The hypothetical data shown in Figure 3 provide an exaggerated illustration of the concept of fitting nonparametric functions to data. Most investigators will be familiar with the use ofthe sample means plus or minus two standard errors as depicted in Figure  3a. A straight line connecting the means gives some indication of the dose response, but it is a crude and inefficient estimate. Furthermore, the initial slope estimate is based only on the first two dose levels, ignoring information about the shape ofthe dose response. Finally, the model interpolates the sample means, thereby ignoring their variability. Parametric models attempt to account for variability in the sample means by smoothing them according to some mathematical model (such as Equation 1 ofthe previous section). Figure 3b shows linear and linear-quadratic Haynes-Eckardt models fit to these data. Initial slope estimates based on these models incorporate information from the entire experimental range of doses, not just the first two doses. However, neither model fits the data well; presumably they do not reflect the true mechanistic model. In addition, different models with seemingly equivalent fits to the data can produce disparate potency estimates (as is well known to occur in the case of low-dose extrapolation in the animal bioassay).
As an alternative, we consider using a general class ofmathematical functions, rather than the specific functions implied by use of the Haynes-Eckardt model. Figure 3c shows the fit of a "cubic interpolating spline," a piecewise polynomial function that connects points in a smooth fashion. (In fact, the line in Figure 3a is a linear interpolating spline.) The amount of smoothness at the dose values depends on the degrees of the polynomials (compare Figure 3c to Figure 3a). We can add higher-order polynomial terms to make the function more smooth at the dose values. Generally, cubic splines are sufficiently smooth so that higher-order terms are not necessary, but by interpolating the means, the cubic interpolating spline again ignores sampling variability. The result is often a curve that is extremely wavy in the regions between dose values. The final step is to "penalize" the spline in such a way that it neither overfits (interpolation, leading to large variability in the estimated response) nor oversmooths (straight line regression, leading to bias in the estimated response). This is done by crossvalidation, a method that balances bias and variability by letting the data determine the optimal amount of smoothness. Figure 3d shows the fit of such a "cross-validated cubic smoothing spline" to these hypothetical data. The fitted curve more closely describes the underlying response than those in Figure 3b, because it is based on fewer assumptions concerning the mathematical form of the response.
The use of flexible, nonspecific, smooth functions to fit data is generally called "nonparametric regression." Note, however, that there is no clear separation between parametric and nonparametric regression methods. The difference is the degree to which the fitted curve is constrained by a mathematical model. Parametric regression is based on specific models thought to describe, or at least approximate, underlying biological mechanisms, whereas nonparametric models encompass a wider class ofmathematical functions that have no such biologically based interpretation. Furthermore, nonparametric regression produces parameters; in general, the number ofparameters with nonparametric regression will be larger than with a parametric model, reflecting the greater flexibility of the nonparametric function (i.e., our uncertainty about the true model).
In general, there is a trade-off involved in choosing between parametric and nonparametric regression methods. If the model is correct, we expect that the parametric model will provide more powerful inference concerning mutagenicity and potency. However, if the model is wrong, the results can be biased, possibly leading us to incorrect conclusions. There does not yet seem to be any single mathematical model for the Ames assay that a) incorporates all ofthe biological features ofthe assay, b) requires no unverifiable assumptions, and c) can be successfully .,'} 'I fit to the variety of results encountered in practice. The lack of such a model motivated us to investigate the possible applicability of nonparametric regression methods to Ames assay data.

Implementation
We have chosen to use the cubic smoothing spline (CSS) because it is visually smooth and is easily differentiated. The latter is important for computing potency, which requires calculation ofthe initial slope ofthe dose-response curve (see the next section for some discussion ofbiases in CSS estimation ofpotency). Detailed treatment ofthe CSS is available elsewhere (16,17).
The CSS is the solution k(x) to the following penalized weighted least-squares problem: minimize x over the set oftwice continuously differentiable functions g with square integrable second derivative. The smoothing parameter X 2 0 determines where the solution lies between a straight line (as Xoo) and a cubic interpolating spline (as X -0). The estimator is a set of piecewise cubic polynomials g(x) = gi + 1i(xxi) + -i(X Xi)2 + Si(xxi)3 (4) (xi < x < xi+,, i = 1, . . ., K-1), which are constrained to be continuous and to have continuous first and second derivatives. These constraints result in a reduction of Equation 4 to K parameters, which we may take to be the fitted values jii=g(xi). For fixed X, Equation 3 reduces to a set of linear equations = (gL . **,I 9)T =B(X)y where y = (y,..., yK) isthe vector of sample means and B(X) is a smoothing matrix depending on the ui and the xi but not on y (see the Appendix). An obvious estimator of mutagenic potency is (3k, the initial slope of the piecewise curve (Eq. 4) fit to the segment between xI = 0 and x2, the lowest nonzero dose. Note that is, depends on the other data points as well; in fact, This a linear combination of it: = a Tp = aTB(X)y with a being a known vector (Appendix), and so i31 is also a linear function ofy for fixed X. The variance of(31 is easily obtained using distribution theory for linear combinations of sample means. In practice, X and the at must be estimated. Because of replicates in the Ames assay, we have the usual unbiased sample variance estimates 2 1 which ordinarily are not available in CSS applications. The value of X is chosen optimally according to cross-validation by minimizing (Y. -2 CV(X)= -ni~~~n with respect to X, where bi is the ith diagonal element of B(X) and the ji depend implicitly on X. We denote cross-validated estimators by an asterisk (*).
Although the cross-validated fitted values ji* are not a linear function of -, it may be shown that, for a fixed set ofdose values, asymptotically, j* andy have the same distribution ifthe number of replicate observations increases at each dose (18). This result is especially useful in providing a means of making inference concerning the inital slope. Since i , ' is a linear function of j*, it is asymptotically a linear combination ofy and so has a limiting normal distribution: where fB is the initial slope of a cubic interpolating spline fit to 21 the vector of true means I = (ju, .. ., r F = diag 'vyf and O < vi = lim-< < is the limiting fraction ofpoints at xi. We may thereforebase inference concerning the initial slope upon the standard normal z score z= A; SE(f3) assuming 3,B is close to the true mutagenic potency (see the discussion ofbiases in the next section). Monte Carlo studies of the small sample behavior of (t (not shown) reveal that both its bias and variance can be strongly affected by estimation of the smoothing parameter X. The magnitude of these effects is evaluated briefly in comparison studies discussed below.

Biases in the Potency Estimator
The validity of (3r as a potency estimator depends on the unknown model A (x) because The validity of 3,r also depends on how well a cubic spline approximates the true dose-response function g(x). Let biasa = , -u '(0) be called the "approximation bias"; this is the bias due to approximating the initial slope of A(x), using the interpolating spline fit to the true means. This bias may be calculated for specific dose-response models by simulating the model and fitting the cubic interpolating spline. Figure 4 shows the relative bias bias, /1 '(0) as a function of true mutagenic potency ml in Haynes-Eckardt (Eq. 1) and multigeneration (Eq. 2) models, using various values of the toxicity parameter s (0.001, upper panels; 0.01, lower panels). In these simulations, six dose groups were used with relative spacing (0, 0.015625,0.125,0.25,0.5, 1); the highest dose was calculated to give either 90 % or 10 % survival. [The appropriate maximum dose for achieving 90% or 10% survival with the multiple generation model assuming unlimited toxicity may be calculated using probabilities derived by Margolin et al. (9).] These results show that the spline approximation bias is quite large with 10% survival, presumably because there are insufficient data for the spline to mimic the dose-response structure between the first two dose values in the low-dose portion ofthe curve. Increasing the quadratic term also increases the relative bias.
Finally, there will be bias due to estimation of the smoothing parameter X; call bias, = St -(3 the "smoothing bias." According to the above asymptotic distribution the smoothing bias will be small in large samples. This bias is not easily computed theoretically, but it may be ascertained from Monte Carlo studies by subtracting the toxicity and approximation biases from the total bias of the estimator, as illustrated in the next section. The total bias in Of1 as an estimate of mlin

Comparative Studies
We now describe Monte Carlo studies designed to evaluate the CSS approach and compare the CSS to two other regression methods: nonlinear regression and the point rejection method. We also compare the fits ofthese methods to the two example data sets introduced earlier. The multigeneration approach of Margolin et al. (9,14) was not compared because ofthe difficulty of estimating the generation times.

Monte Carlo Studies
In the Monte Carlo studies we used a variety of Haynes-Eckardt and multigeneration models, with varying degrees of curvature and either 90% or 10% minimum survival, to obtain random Ames assay type data. All studies were based on five dose levels with relative spacing (0, 0.125, 0.25, 0.5, 1). Random Poisson or negative binomial variables were generated according to algorithm "NG" ofAhrens and Dieter (19). Three replicates were generated at each dose. Each experiment consisted of2000 simulated data sets. Ifthere was no quadratic mutagenicity term (m2 = 0) in the model that generated the data, we fit a linear   = 0]; otherwise we fit the full model (Eq. 1). With all three methods, the test of mutagenicity was based on asymptotic normality of the initial slope estimator divided by its standard error. Power was estimated for a one-sided level 0.05 test by the proportion of times the z score exceeded the critical value 1.645.
The first set of results (Table 1) is for data generated by Equation 1, taking mo = 25 and using various values of mi, m2, and s at 90% minimum survival. Sampling standard error is the standard deviation of potency estimates from the 2000 simulated data sets in one experiment; computed standard error is the average of the 2000 individual standard errors computed from each simulated data set in an experiment. Simulated data were either Poisson or negative binomial with variance 141 + 0.005A]. Because the models evaluated were nearly linear over the range of doses at which 90% minimum survival was achieved, the nonlinear regression method failed to converge about 25 % of the time except in the case where m2 = 0. Most notable from these studies were the positive bias and inappropriate test size of the point rejection method under the null hypothesis (ml = 0) in the presence of upward curvature, contrasted with the lack of power of the smoothing spline and nonlinear regression methods under the alternative model (m, = 2). Negative binomial errors (to mimic overdispersion) led to a reduction in power and a slight increase in standard error with each method. Results for data generated by the multiple generation model with 90% minimum survival were similar and so are not shown.
The second set of results (Table 2) is for data generated by Haynes-Eckardt and multiple generation models, using various parameter values, and with 10% minimum survival. For multiple generation models, parameters ofthe mutagenicity function M(x) may be approximately related to parameters of the Haynes-Eckardt model MN (x) by dividing the latter by 7 X No (13). We took No = 108. All simulated models displayed substantial curvature over the wider range ofdoses produced by allowing survival to decrease to 10%, and the nonlinear regression method nearly always converged. The power of all three methods was greater than in the above cases of 90% minimum survival. The nonlinear regression method performed quite well on data generated by Haynes-Eckardt models, whereas the smoothing spline and point rejection initial slope estimates were highly sensitive to the magnitude of the quadratic term.
With multiple generation models and 10% minimum survival, the smoothing spline was the only method able to closely estimate a small positive potency (last column ofTable 2). The nonlinear regression method gave highly biased and variable initial slope estimates, whereas the point rejection method overestimated the initial slope when the true potency was small.
As an example ofthe total bias calculations for the smoothing spline (see the previous section), consider the estimate 1.65 from

Example Data Sets
Earlier, we introduced two example data sets from Ames assays ofcoal tar base fraction (Fig. 2a) and quinoline (Fig. 2b).   Table 3 displays estimated potencies obtained from the three methods compared above, along with standard errors and z scores.
As with the Monte Carlo studies, there is large bias in the point rejection slope due to an upwardly curving response in the case ofthe coal base data. However, the point rejection method is the most powerful in the case ofthe quinoline data, in which the lowdose linearity assumption appears reasonable, and all three methods gave similar estimates. The smoothing spline doseresponse estimate for the quinoline data appears strange in the high-dose region because there is a large gap with no information between the highest and next highest doses (a seemingly uninteresting region of the dose response). This points out the need to consider issues of experimental design; the same total sample size applied to a narrower dose range would provide more information concerning mutagenic potency.

Conclusions
Each of the methods we compared performed best in situations where its assumptions most closely held true. The nonlinear regression method [based on the Haynes-Eckardt model (Eq. 1)] performed well when the true model was Equation 1 and the dose range was wide enough that all of the parameters were estimable from the data. The point rejection method was quite powerful when the underlying dose response was nearly linear, but otherwise it overestimated mutagenic effects. The smoothing spline performed best in situations when the nonlinear regression was inappropriate because ofthe wrong model (e.g., with multigeneration models) and when the point rejection method was wrong due to curvature in the initial dose response (e.g., with nonlinear models and wide dose ranges). No single method performed well under all situations.
The smoothing spline approach was designed to provide an alternative to model-based regression methods when the underlying model is nonlinear and does not follow Equation 1. We observed that, when the dose range is selected so that minimum survival is at least 90%, even multigeneration models are nearly linear over the entire dose range, so that the point rejection method produces quite reasonable results. In this sense, the smoothing spline is rather disappointing because it appears that we may achieve model robustness using the point rejection method so long as adequate dosing experiments are performed first. However, a greater understanding ofthe Ames assay on the part of statisticians is needed before it can be said whether this will be true generally in practice. For example, day-to-day variation in Ames assay results decreases the effectiveness ofdosing experiments (B. H. Margolin, personal communication); this may lead to difficulties in attempts to routinely achieve approximate linearity. Based on the lack ofa single acceptable method, it appears that more work is needed to further develop existing methods or to find new methods that will perform well with Ames assay data without being so closely tied to model assumptions. In addition, further comparisons among methods are needed to understand in which circumstances the various methods are appropriate. Such studies, as well as comparisons on a number ofactual Ames data sets, should be performed in conjunction with laboratory investigators so that situations that are most relevant in practice may be identified and studied. Figure 5 summarizes the need for further research into statistical methods suitable for use with the Ames assay. It emphasizes the important role that collaboration and application should play in this work. Statisticians can also contribute by advising researchers on how they can design better experiments so as to obtain the sort ofdata needed for robust analyses. Finally, it would be useful to have methods for comparing goodness of fit among the variety ofmodels discussed in this paper. It is hoped that more statisticians will become interested in these problems and that greater collaboration between statisticians and laboratory investigators will occur in the future. J S = (,TS-i + nXQ)-lJTS-ly = (Sy+ nx4 S;'y = (X)y often produces satisfactorily smooth dose-response estimates.
To compute the initial slope estimator, we use the fact that Q may be written as 0 = QR-1QT, where Q and R are tridiagonal matrixes (16 This work was supported in part by U.S. Public Health Service (U.S. PHS) training grant no. T-32-CA-09168 while the first author was a graduate student in the Department of Biostatistics at the University of Washington, and in part by U.S. PHS grant CA-40644 to N.E.B. We gratefully acknowledge Elaine Faustman for invaluable advice, Leslie Bernstein, who kindly provided the point rejection program, James Trosko for a critical reading ofan earlier draft, and Ritsko Sato, who helped with the preparation offigures and with numerous administrative tasks. Finally, we thank David Hoel and Akira Sakuma for the invitation to present this work.