Statistical studies in genetic toxicology: a perspective from the U.S. National Toxicology Program.

This paper surveys recent, as yet unpublished, statistical studies arising from research in genetic toxicology within the U.S. National Toxicology Program (NTP). These studies all involve analyses of data from Ames Salmonella/microsome mutagenicity tests, but the statistical methodologies are broadly applicable. Three issues are addressed: First, what is a tenable sampling model for Ames test data, and how does one best test the adequacy of the Poisson sampling assumption? Second, given that nonmonotone dose-response curves are fairly common in the Salmonella assay, what new statistical techniques or modifications of existing ones seem appropriate to accommodate to this reality? Finally, an intriguing question: How can the extensive NTP Ames test data base be used to assess the characteristics of any mutagen-nonmutagen decision rule? The last issue is illustrated with the commonly used "two-times background" rule.


Introduction
During the last decade the science of genetic toxicology has experienced dramatic growth in its volume of experimentation, its variety of assays, and the level of public awareness of it. Even laymen are likely to have heard of some of the tests in this field or seen newspaper accounts of results from one. This growth, in all its dimensions, is attributable to the ability of these test systems to detect, rapidly and relatively inexpensively, environmental agents that are genotoxic; these agents are thought to be implicated in such diverse human health problems as cancer, aging, and birth defects (1). It is reputed that over 2000 laboratories worldwide, in industry, academia, and government, currently perform the Ames Salmonella/microsome test (2), the best known and most widely employed of the short-term tests for genetic toxicity. In many parts of the industrialized world, regulatory decisions regarding the registration of pesticides or pharmaceuticals are based in part on results from tests for genetic toxicity. In some countries, such as Japan and the United States, these tests are used in national programs to screen agents already in the environment. It is worth remembering, however, that this area of scientific research is far from mature; much remains to be achieved in terms of understanding the precise implications ofresults from such tests for the assessment of risks to human health. To date, man-made industrial agents have been the primary focus of research interest in this area; there is, however, an increasing emphasis on naturally occurring potential sources of genetic toxicity, such as common dietary components. The term "genetic toxicity" is applied to the induction of genetic damage by any agent, whether the damage be DNA point mutations at a particular locus, induction of DNA repair, binding to DNA, or chromosomal aberrations, such as fragments or aneuploidy. The chronic rodent carcinogenicity bioassay, technically speaking, is not a test for genetic toxicity because tumor development has not yet been demonstrated to result directly from genetic damage. The somatic-mutation theory of cancer (3), however, is seemingly reinforced weekly by new experimental findings.
Unlike the chronic rodent carcinogenicity bioassay, for which there is a rich statistical literature, the tests for genetic toxicity have only recently begun to attract the attention of research statisticians; witness the dearth of published papers containing new statistical methodology motivated wholly or in part by problems in genetic toxicology. Two exceptions are the works by Collings, Margolin and Oehlert (4) on the analysis of binomial data and by Tarone (5) on the use of historical control data. Although five years ago Hollstein et al. (6), in an excellent review of short-term tests for genetic toxicity, could cite over 100 assays that had at least a modicum of representation in the published literature, the U.S. National Toxicology Program (NTP) has fewer than 20 assays in use, undergoing validation or in de- velopment. Figures 1 and 2 present these assays, separated as to whether the target used to probe an agent's ability to induce genetic damage is a somatic or a germ cell. Fewer than half a dozen of these assays have been studied carefully by statisticians, and fewer still have methods of statistical analysis that are generally accepted.
The NTP statistical effort in the area of genetic toxicology has emphasized the development of objective analyses of individual test results, methods for meaningful assay validation regarding operating characteristics, and large data bases, which can be exploited for a variety of purposes, such as devising screening strategies, measuring interlaboratory and interassay concordance, and attempting to ascertain the degree of predictivity of short-term tests for the chronic rodent carcinogenicity bioassay. This paper surveys the methodological components of a series of statistical projects, largely unpublished as of this date, that were conducted under the direction of the author in response to perceived needs in genetic toxicology within the NTP. The principal issues addressed are: tenable sampling models and goodness of fit; nonmonotone dose-response relationships and tests of significance; and external validation of tests of hypothesis when there is sufficient replication. All three issues will be illustrated with the Ames test, but their importance transcends this one assay.

Tenable Sampling Models and Goodness of Fit
The development of a parametric statistical analysis for data arising from short-term tests is best achieved after scrutiny of a variety of data, preferably generated by different technicians and at least two laboratories. One main component in the developmental process is the creation of a tenable sampling model. Significant departures of reality from assumption with regard to the sampling distribution can impact substantially on false positive and false negative rates, and on the efficiency of estimators (7,8).
For short-term tests, the response of interest is frequently a count, which may be bounded by definition or Early authors discussing analyses of Ames test data assumed Poisson sampling without producing any empirical supporting evidence. From a theoretical standpoint, the usual Poisson assumptions seem credible for a given plate, but to extrapolate from one plate and claim that a set of plate counts behaves like a random sample from a Poisson distribution requires an additional assumption of homogeneity of environments across plates. In some laboratories that condition may obtain, but the key point is that this issue is open to empirical study. The concept of uniformity trials (9) from agricultural research deserves renewed consideration by experimenters and statisticians; it suggests the desirability of running assays early in their development as one would to test a compound for genetic toxicity, but with no test compound added. Ideally, data from such negative or solvent control trials can then support or refute a particular sampling model, and can be used to assess the possibility of hidden components of variability.
Margolin et al. (8) reported results from 20 replicated control plates for Ames tests conducted by each of three laboratories. If Y1, . . ., Yn represent the control plate counts observed by a laboratory on a given day, and if Y denotes their mean, then a standard test of the Poisson sampling assumption is based on the statistic n T (=Y -1 )2/Y When the data are a random sample from a Poisson distribution, the statistic T is well approximated by a chi-square random variable with n -1 degrees of freedom. Using this fact, Margolin et al. (8) demonstrated that the Poisson model is inadequate to describe Ames test data; sample variance to mean ratios of 4 or larger were reported by them. In place of the Poisson, those authors adopted a negative binomial sampling model, which they motivated as a stochastic mixture of Poisson distributions created by pipetting errors. Additional evidence, both empirical and theoretical, in support of the negative binomial (NB) model for Ames test data is given by Collings and Margolin (10), who employed the following form of the negative binomial distribution: (2) obtains. Here m is the mean of Y and the limit of Eq. (2) as c --0 is the Poisson distribution with mean m. Thus, Eq. (2) extends naturally to incorporate the Poisson distribution at c = 0. With this formulation, one can speak of the distribution of the maximum likelihood estimate (MLE) for c, which now has finite moments of all positive orders (11). Contrast this with the more common parametrization in terms of k = c6-1, where the MLE of k does not possess a proper distribution (12).
Although control trials are highly useful, they are rarely available. In general, even a good-sized random sample of control plates is hard to come by. For example, the data of Margolin et al. (8,10) are unique in the literature on the Ames test.
For the general short-term test in which unbounded count data are observed, the test results that would be available for assessing the goodness of fit of the Poisson assumption are from experiments with varying doses of true test compounds. These data are not identically distributed, but rather have a one-way layout structure indexed by dose. An extension to the one-way layout of the goodness of fit test for Poisson sampling based on Eq. (1) is studied by Collings and Margolin (10), who obtain the following result. The null sampling distribution of Tc and the power of the test based upon it were also studied (10).
The theorem above generalizes a result obtained by Potthoff and Whittinghill (13) for the case of the goodness-of-fit test for a Poisson random sample based upon Eq. (1). A test statistic competitive to that in Eq. (3) is to aggregate the value of Eq. (1) obtained for each group separately, i.e., (2) for y = 0, 1, 2, ..., 0 < m < x, and 0 < c < o. As a shorthand, Y will be said to be distributed NB (m,c) if r Ii Sc = E (Yit-Yi+-)21Y+ I=1 j=1 (4) and reference a chi-square distribution with (Xni) -r degrees of freedom. Collings and Margolin (10) prove that if ni/inj --pi, a constant for each i = 1,. . ., r, such that 0 < pi < 1 and lp, = 1, then the Pitman asymptotic relative efficiency of Sc to Tc is given by ec = (2 mipif I mi2pi

Nonmonotone Dose-Response and Tests of Significance
Were the possibility of hyper-Poisson sampling variability for Ames test data their only distinguishing feature, one could readily modify inference procedures intended for Poisson data so that these procedures were appropriate for negative binomial data, thereby accommodating the overdispersion. To illustrate, a commonly used procedure to test a quantitative factor d, such as dose, for its effect on Poisson means is to compute the Cochran-Armitage test (14,15) for trend in the means. If for each i, Xi is distributed as a Poisson random variable with mean Xi and this observation is associated with a level di of a quantitative factor, then the trend test of Ho: Xi = X for all i, versus H1: Xi ordered by di, is based on the statistic n~n -11/2 Z = ; Xi(did)lsx (djdh)2 (6) where sx = X = IXi/n and d = Yd,in. Z in Eq. (6) can easily be seen to be the regression coefficient for X regressed on d, normalized by its estimated standard deviation. Under Ho, Z is distributed approximately as a standard normal random variable. Tarone (5) has shown that the test based upon Eq. (6) is asymptotically locally optimal against any smooth monotone function expressing X in terms of d.
The modification of the Cochran-Armitage trend test needed to permit its use for negative binomial data is to define sx2 = X(1 + cX), where c is the MLE of c in Eq. (2) when the data are considered as a random sample (HO). Again, the reference distribution for Z is the standard normal. The Appendix contains a demonstration paralleling that ofTarone (5), which establishes that the test for trend among negative binomial means is asymptotically locally optimal against any smooth monotone function that expresses m in terms of d. As Collings and Margolin (10) note, the negative binomial distribution in Eq. (2) can be extended to include the binomial as well as the Poisson distribution. The Appendix then contains a proof that holds for all three models. Table 1 presents results from a small Monte Carlo study of the size of the one-tailed test for trend in negative binomial means. To mimic typical experimentation, the Monte Carlo included six dose groups, with either three or five replicate observations per dose. The dosing was either linear (specified by d = 0, 1, . . ., 5) or logarithmic (specified by d = 0, 1, 10,. .., 104). Note that these specifications entail no loss in generality because Eq. (6) is invariant to scale transformations of dose. The values for m were set at 15 and 150, whereas c was either 0 (Poisson) or 3/m (highly overdispersed). Each of the 1000 data sets randomly generated for a given set of conditions was analyzed two ways, once with the true c used in s,, in Eq. (6) and once with the MLE of c, as would be the case with real data. The results indicate that the size of the trend test is well approximated by the standard normal tail area whether c is known or estimated from the data.
A more interesting characteristic of Ames test data that separates them from most other dose response data treated in the statistics literature is that the dose-response for Ames test data is frequently not monotone (8). There are other in vitro assays for genetic toxicity that exhibit similar behavior, e.g., the fluctuation test (4) and the mouse lymphoma assay (personal communication from W. Caspary, NTP). The common decrease in mean response at high doses, sometimes to levels below that for the control, is usually attributed to toxicity that prevents an experimental unit from exhibiting phenotypic evidence of mutagenicity. Decreases in the mean response at high doses, especially to or below control levels, impact heavily on the power of trend tests (4), which place their greatest weight on the responses to the control and maximum dose.
Three published significance tests for various shortterm tests attempt to cope with a nonmonotone dose response. First, Collings et al. (4) proposed the use of an isotonic test for fluctuation test data; this test, while not tailored to the situation under discussion, exhibits a greater degree of power robustness against downturns than does the binomial trend test. Second, Bernstein et al. (16) proposed a recursive analysis for Ames test data in which the response at highest dose is sub- jected to a pretest for downward departure from linearity. If the pretest supports such a downturn, then the highest dose is excluded from the analysis and the next highest dose is similarly scrutinized. When this "pointrejection" procedure terminates, the remaining doses are subjected to a trend test modified for unequal variances. Finally, Margolin et al. (8) developed mechanistic biomathematical models that reflect a somewhat simplified view of the underlying biology of an Ames test. They proposed a test of significance based on the MLE of a parameter in their model that represents a mutagenic index. The last two analyses are clearly in need of further study to understand better their operating characteristics. Work on the latter is nearing completion and will be reported elsewhere.
The use of nonparametric procedures, especially Jonckheere's test (17), has been advocated for analyzing data from short-term tests for genetic toxicity (18,19).

Simpson and Margolin (unpublished manuscript) have
shown that nonparametric tests that are tailored to detect ordered alternatives, such as Jonckheere's, can have their power functions substantially depressed by a downturn in the underlying dose response function. Consequently, they devised a recursive strategy that excludes data obtained at the highest dose if there is evidence of a substantial downturn in response at that dose, i.e., a departure from monotonicity.
This check for a downturn is performed recursively with a Wilcoxon test, and when it terminates, the remaining doses are subjected to Jonckheere's test. The key consideration in doing this analysis recursively is to retain control of the size of the test. Simpson and Margolin present both empirical and analytic evidence for proper size behavior of their test. They also show that their procedure is consistent for the cases of in- terest and offers substantial improvement in power over Jonckheere's test when there is a sizeable downturn in dose response at high doses. This gain is achieved at a cost of a modest loss of power when the underlying response is, in fact, monotone in dose.

External Validation of Tests of Hypothesis
One further important way in which the Ames Salmonella assay is unusual is in its sheer volume of usage; because the assay is fast and relatively inexpensive, it lends itself nicely to screening efforts. Since its creation in 1978, the NTP has had as one of its broad goals the extensive screening of environmental agents for evidence of genetic toxicity. To date, the data collected have come overwhelmingly from Ames tests on four strains of Salmonella typhimurium (TA98, TA100, TA1535, TA1537) tested separately at each of three levels of metabolic activation: rat liver, hamster liver, or none. The two mammalian liver (S9) preparations represent an attempt to recreate in vitro the metabolic processes that occur in humans. It is well known that apparently innocuous chemicals can be converted in vivo into noxious metabolites, so the use of an S9 activation attempts to provide for this possibility.
Chemicals are nominated in many different ways for NTP testing. If the scientific interest or evidence for concern is sufficient to justify the experimentation, the selected chemical proceeds through a 12 strain-activation battery of tests. The NTP Salmonella/microsome database currently consists of over 24,000 experiments, where an experiment refers to a test with a particular chemical, strain and activation in a given laboratory on Table 2. Frequency of replication by strain and activation among the 941 chemicals. Strain  Activation   1  2  3  4  5  6  TA100  None  72 774 70  22  1  1  Hamster  30 810 69  27  2  2  Rat  33 815 68  21  3  1  TA98  None  114 775 42  9  1  0  Hamster  80 784 62  10  3  2  Rat  76 787 61  13  2  2  TA1537  None  140 742 52  5  1  0  Hamster  121 765 Table 2, from Margolin, Kim, and Risko (unpublished manuscript, hereafter referred to as MKR), indicates the frequency of replication by strain and activation among 941 chemicals tested; zero frequencies have been suppressed. Experimental loss due to contamination or extreme toxicity, together with ad hoc decisions by experimenters not to take a second replicate produced the singlets. Equally ad hoc decisions to obtain additional replicates beyond the two required by the protocol account for the replicates numbering greater than two. MKR report that the decision to proceed with additional replicates beyond two was apparently triggered on occasion by results observed for TA100 with either rat or hamster S9 activation. These two combinations were viewed by the experimenters as the two combinations with highest sensitivity to mutagens, and so clear resolution of these cases was frequently sought. The potential bias in the results for these two combinations suggests focusing attention on results for the other ten.

Number of replicates
MKR note that if a given chemical is tested in n replicates of a given strain and activation, then the operating characteristics of any decision rule that assigns a "mutagenic" or "nonmutagenic" label to the individual experiments can be assessed by use of a finite mixture of binomials model. Specifically, in the notation of MKR, if Y of the n replicates are judged positive and labeled mutagenic by a decision rule, then the probability distribution function of Y can be written as: f(Yi;p,Ti) = zib(Yi;ni,p) + (1 -zi) b(Yi;ni,Ti); (7) where b(x;n,4) is the binomial probability distribution function for x successes out of n trials with success probability +; zi is an indicator variable with value 1 for nonmutagenicity and 0 for mutagenicity of chemical i in the particular strain/activation; p is the true probability that an experiment with a nonmutagen in the particular strain/activation will yield a result judged positive by the decision rule; Ti is the probability that an experiment with the particular strain/activation for chemical i, given that chemical i is a mutagen in this combination, will yield a result judged positive by the decision rule; and, by assumption, T > p for all i that correspond to mutagens.
MKR reason that p is presumably constant for all nonmutagens tested with a given strain and activation, but that T clearly depends upon a mutagen's potency and toxicity for a given strain and activation. Nevertheless, they argue that the paucity of information regarding the behavior of a given chemical for a specific strain and activation suggests as a first approximation assuming T to be constant across all mutagens for a given They then construct a version of the EM algorithm (20) for the MLEs of (rr, p, i). Using results of Louis (21), MKR also obtain the observed information matrix for the parameters, and so produce estimates of the precision of the MLEs as well. MKR apply their analytic technique to two decision rules. The first is a modified statistical analysis based on the mechanistic models of Margolin et al. (8), while the second is really not a rule, but rather a set of decisions arrived at by a senior NTP toxicologist upon his review of the experimental data.
In the present paper, the same technique is applied to a decision rule that has been widely employed in toxicology, but poorly understood. Labeled the "twotimes background" rule, this rule declares a chemical mutagenic if the average response for at least one dose of test chemical is greater than twice the observed concurrent control mean. This rule, which has a long history of application, is indifferent to the number of doses tested, the number of replicates observed per dose, any empirical measure of variability, and any consideration of level of significance. The results of applying the MKR technique to the decisions of the "two-times background" rule with regard to the NTP database are in Table 3. Estimates ± one standard deviation of the proportion of mutagens among chemicals tested, the false positive probability, and the true positive probability for the "two-times background" rule by strain and activation.  Table 3. As one might well predict intuitively, this rule is moderately conservative, yielding false positive rates of approximately 0.01 for TA100 and 0.02 for TA98 and TA1535, irrespective of activation level. For TA1537, however, with its very low background rates, this rule has a false-positive rate of approximately 0.07. These estimates apply to the NTP protocol as executed by the NTP contractual laboratories, and to no other context. If one requires a repeated positive result for confirmation, then the probability of a falsely confirmed positive is p2. For TA100, TA98, and TA1535, this probability is estimated to be 1 x 10-4 to 4 x 10-4. For the NTP screening program, in which scientific judgment in chemical nomination and selection produces a population of test chemicals highly enriched with mutagens, decision rules with probabilities of confirmed false positives on the order of 10' are too conservative and counterproductive. The attendant loss in sensitivity to detect weak mutagens is a heavy price to pay in order to obtain a simple rule of thumb. In many instances, mutagens may not be able to achieve a doubling of background levels because of toxicity, solubility or other limitations, yet they may well exhibit highly reproducible patterns of mutation induction. An excellent example of this phenomenon is phenobarbital (22).

Concluding Remark
The statistical studies briefly surveyed here all had their origins in problems that arose from genetic toxicology. From this survey, one conclusion is clear: genetic toxicology represents a rapidly growing area of science that is rich with research opportunities for statisticians.