Statistical considerations for dominant lethal mutagenic trials.

Statistics can be used in a biological assay as either a method of checking on the validity of conclusions or as a guide to the interpretation of experimental results. The division is sometimes made as one between hypothesis testing and estimation. However, the tool of hypothesis testing can be used as an interpretive aid in conjunction with the tools of estimation, so I prefer to make the divisions in terms of the uses made by the biologist. If we are to apply statistical procedures to the dominant-lethal trail as a check of validity of conclusions, we would have to consider the inherent theoretical faults in the design. For instance, the animals actually treated constitute a small set of males and any conclusions must be conditional upon that set of males; thus, the probability levels computed are, themselves, random variables, functions of the random choice of males. Or, a typical trial will involve more than one test compound or dose of a test compound against the controls. Thus, any formal probability level computed must take into account questions of multiple comparisons. With comments like these, the statistician can sit in the marble halls of his floating island and throw bolts of doubt at the whole procedure. But, almost any biological assay can be made to suffer from this kind of criticism. At this stage in the develop-


General Considerations
Statistics can be used in a biological assay as either a method of checking on the validity of conclusions or as a guide to the interpretation of experimental results. The division is sometimes made as one between hypothesis testing and estimation. However, the tool of hypothesis testing can be used as an interpretive aid in conjunction with the tools of estimation, so I prefer to make the divisions in terms of the uses made by the biologist.
If we are to apply statistical procedures to the dominant-lethal trail as a check of validity of conclusions, we would have to consider the inherent theoretical faults in the design. For instance, the animals actually treated constitute a small set of males and any conclusions must be conditional upon that set of males; thus, the probability levels computed are, themselves, random variables, functions of the random choice of males. Or, a typical trial will involve more than one test compound or dose of a test compound against the controls. Thus, any formal probability level computed must take into account questions of multiple comparisons. With comments like these, the statistician can sit in the marble halls of his floating island and throw bolts of doubt at the whole procedure. But, almost any biological assay can be made to suffer from this kind of criticism. At this stage in the develop-*Pfizer Central Research, Medical Research Laboratory, Groton, Connecticut 06340. ment of mutagenicity testing, statistics can be much more effectively used to aid in the organization and interpretation of empirical results.
Three general problems face the experimenter in trying to organize and interpret the results of a dominant-lethal trial. He must be able to understand and display a vast amount of data (typically, 10 males per test group, 30 females per week, or 240 paired observations of implants and dead implants per group). He must have a running test procedure that will decide if a substance is to be suspected, a procedure with known probabilities of false positive and false negative results. He must be able to estimate the degree of an effect if he suspects one. A running test procedure will have to make use of the formal terminology of hypothesis testing. However, this is not a classical situation. For instance, in view of the inherent vagaries of the animals, it may be impossible to reduce the probability of false positives to a fixed constant (such as 5%) and the biologist may be forced to accept what he can, relying on statistical techniques to estimate the true running levels of error.

Review of the Literature
A detailed review of the literature from a statistical point of view is given elsewhere (1). Prior to 1970, statistics appear to have been used strictly as a check on validity. Chi-square contingency tables tend to be presented to compare counts of im-plants in order to "prove" the obvious by she chose to allow the analysis to run computing significance levels of 1% or less. separately for counts of dead implants (esti-Attempts to organize the results to derive mating postimplantation losses) and for useful estimates of differential effect appear counts of total implants (estimating preimto be entirely without benefit of mathematical plantation losses). clergy.
In 1970, KrUger (2) attempted to inter-Results of Investigations at Pfizer pret the results of dominant lethal trials in At Pfizer, we were able to examine the terms of a simple genetic mathematical data from over 4000 females taken during model. He concluded that it was not possible --the control phases of more than 20 trials. to estimate the mutagenic effect, uncon-From the analysis of this data we have defounded by other factors, and he proposed veloped an on-going method of computer no useful tools of analysis. He did not have analysis which has enabled us to estimate available raw data from these tests and could---the true alpha level of our procedure, and we only check the validity of his model againt have been able to examine the value of our mean counts in the published literature. prcedures with respect to known mutagens On the other hand, Bishop (3) was able to and compounds of unknown mutagenic potenuse a large number of control animals from tiaLnoDetails of our analysis of control data actual experiments. She was able to estabhave been described elsewhere (1). lish that the distribution of control data ;In general, we found that the number of was sufficiently well behaved to allow for implants for a given pregnant female mouse the use of standard robust statistical tech--{all Charles River strain) can be effectively niques. She set up a routine test method approximated by a binomial variate, as if involving the use of a two-way analysis of there had been n implant sites, with each variance with interaction [ (weeks) x one having an independent opportunity to (treatment) ] and introduced the use of bear an implant with fixed probability p.
variance stabilizing transforms to deal with Thus, the number of implants y, found in counts of dead implants. She has kindly sent the ith control female has a probability freus a copy of her computer program. We have quency of the form checked it against our data, and it appears to be a very well-written and well-document-( n) py (i-p)n-y ed program that can effectively handle the Y range of problems that might be expected The parameter p appears to be fixed at about in running a dominant-lethal trial. 1/2, but the parameter n varies from one lot The Bishop approach is much better than of females to another, ranging between 22 anything else appearing in the literature, and 26. but it is still far from ideal. In particular, If the number of implants can be fitted to it tests the overall mean levels of treata binomial (n;p), then the total number of ments and makes no provision for testing implants in M pregnant females will also be effects at specific stages of mutagenesis. In a binomial with parameters (Mn;p). Furfact, a mild mutagen which affects the sperm thermore, if we let the occurrence of dead during only one period will not produce a implants be a set of independent Bernoulli statistically significant treatment effect (alvariables, conditional on the occurrence of though it might produce a significant interan implant, with probability of death, r, then action effect); and, thus, the method of Xi, the number of dead implants in the ith testing chosen is ill-suited to the kind of alfemale, has a conditional frequency of the ternative hypothesis one might expect in form real life. Bishop also failed to consider what optimum combination of observations might X ( Yi ) r work for a single mutagenic index. Instead, X Environmental Health Perspectives From these theoretical considerations, it in April. There is also a significant (p<0.01) can be shown that the total number of dead -upward trend in the number of dead imimplants in M pregnant females has an un-plants. conditional binomial distribution with para-meters (Mn;pvr) or a frequency of the -Methods-of-Analysis Now Being Used form ( Mn) (pr)z (1-pr)M1" If we assume that the effect of a mutagen will be to change the parameters p (the probability of an implant) or r (the conditional probability of a death, given an implant), then the number of implants and dead implants will continue to have a binomial distribution, even with a treatment effect. Thus, the arcsine transformation will stabilize the variance for both treatment and controls, regardless of the effect of treatment. This is not true of the square-root transformation chosen by Bishop, since that variance stabilization will hold only as long as the probability of a dead implant (r) remains small. By the arcsine transform, we mean Z = 2 [number of implants/nM, or 1/2 arcsine number of dead implants/nM f Regardless of the underlying probabilities, this transform has a variance 1/nWM. In our running method of analysis, we chose n = 24. It is clear that the variances will remain stabilized even if we have misestimated the value of n, as long as the ratio under the radical sign remains less than 1.0.
We have also examined the patterns of change that occurred over time among our control animals. There is a clear indication that the control parameters change with time. Figure 1 illustrates these changes. The upper part of the figure displays the average number of implants per pregnant control female (an estimate of n/2 in our binomial model if we assume p is constant at 0.5) across the entire 8 weeks of specific trials. The lower part of the figure displays the arcsine transform of the mean numbers of dead implants. There is an apparent cyclic pattern in the number of implants, with peaks occurring in September and valleys at Pfizer We nowx have a running computer program which is written in Fortran but is mildly bound to the specific input/output configurations of the PDP-10 computer. Copies of the set of programs are available to anyone who requests them, with the understanding that some minor changes will have to be made in the flow of data. The program produces four pages of output for each treatment group. The first page lists the daily counts of pregnant females, numbers of implants, and rnumbers of dead implants that form the basic input, along with appropriate ratios and 3-day subtotals. This enables the experimenter to see gross and obvious patterns at a glance and to check for transcription errors in the initial input data.
The second page displays mean levels of implants, numbers of pregnant females, implants, dead implants, living implants, and ratios of these for each of the 8 weeks of trial. This enables the investigator to see the entire eight weeks of a single treatment group together, to gain subjective or "gut feeling" insights.
The third page is of the kind displayed in Figure 2. This is a plot of mean daily levels of a given measure (one of the z statistics described in the previous section or one of the more sophisticated second-order moment indices described in the next section) against a regression plotted from the mean daily control values. This regression, based on controls is an important part of the running analysis. In order to increase the power of the test, the entire 8 weeks of control values are compared against a single week of treatment values. Early experience with the trial indicated that the mean levels for controls tended to change over the 8-week period, so it was inappropriate to compare the treatment values for a single week against the overall mean of the controls. Instead, we fit a linear regres-    of the magnitude indicated on the horiwe can be 99% sure of detecting a doubling zontal axis. Each ray represe*-a fixed-nui-u in the mean number of dead implants per ber of females per group impregnated each female if we run at six pregnant females a day. This figure suggests, for instance, that day.

Environmental Health Perspectives
Constructing a Single Mutagenic Index The actual daily observations from the dominant lethal mutagenic trial consist of a two-dimensional vector, ( number of implants Vnumber of dead implants,J It should be possible to derive from these two numbers a single mutagenic index that will cover both pre-and postimplantation losses. In fact, a great deal of the pre-1970 literature deals with just this question. If we fall back upon the binomial model proposed above it can be shown that, if Xi = number of dead implants for the ith female and Yi = total number of implants for the ith female, then The mean number of dead implants per pregnant female estimates E(X), and the mean number of implants per pregnant female estimates E(Y). It does not seem possible to find ratios or simple linear combinations of these two estimates that will be an increasing function of both pre-and postimplantation losses. Table 1 displays various indices based upon these two estimates, with the appropriate combinations of parameters of which they are consistent estimators. The columns labeled "conditions" show that for some mutagenic conditions they will tend to remain constant or actually decrease.
It would appear that any attempt to construct a consistent estimator of a combination of parameters that will increase for both pre-and postimplantation losses will have to involve second order moments like the variances or covariance. With clever enough juggling of the formulae for expectation, variance, and covariance, a number of such indices can be found. Two of these indices are displayed in Table 1. If we use the sample moments of the data, we can construct moment estimators of such indices. These are consistent estimators, but they may be biased, and it might be possible to find more efficient ones by means of maximum likelihood computations.
However, the present state of the art is  such that we should first find a useful index that can make sense to the biologist. So, in our first tentative attempts to locate such an index, we have restricted attention to these moment estimators. It would appear, from our first few runs, that the estimator, ln (S'2/1) is the best of those tried, best in the sense that it will declare statistical significance for known mutagens and fails to call significance for many of the situations we have identified as false positives using mean number of dead implants (or its arcsine transform).

Acknowledgements
A great deal of the work behind this paper is due to the efforts of Dr. Verne Ray and Leon Just of Pfizer Central Research who have joined with me in writing a more definitive paper (1). In addition to contributing a great deal of the statistical and mathematical back-up, Mr. Just is responsible for the running computer program described here. Dr. Ray has provided the impetus, the biological insights, and the general air of sensibility behind the work reported here. Furthermore, Miss Martha Hyneck, who has overseen the development of and actually controlled the running or our dominantlethal mutagenic trials over a 21/2 year period deserves the credit for many of ini--tial insights that lead to our analyses of data and for the amazing amount of careful hard work, without which the numbers we analyzed would never have been available.