Multiple comparisons in long-term toxicity studies.

Several multiple comparison procedures (MCPs) are discussed in relation to the specific formulation of type I and type II errors in toxicity studies and the typical one-way design control versus k treatment/dose groups. Examples for these MCPs are: the standard many-to-one MCP (Dunnett's procedure), sequential rejection modifications, closed testing procedures, many-to-one MCPs with an ordered alternative hypothesis, procedures based on the assumption of a mixing distribution of responders and nonresponders, and MCP's for multiple end points.


Introduction
Why is it that multiple comparison procedures (MCPs) are being discussed in toxicology even today, despite the fact that they are every-day procedures in biostatistics? This paper deals with several sources ofmultiplicity in long-term toxicity studies and possible methods for suitable statistical analysis.
Based on the closed testing principle discussed by Marcus et al. (1), a revolution in MCPs has taken place. We can thus diminish the antagonism enforcing aexp (type I error) and decreasing the power r (where ir = 1 -3, ... type II error). This paper presents a special case where as is held and the maximum power of the two-sample case is guaranteed. This paper is therefore limited to regulatory toxicity studies, e.g., carcinogenicity, mutagenicity, according to national/international guidelines, for example, the European Community (EC) guideline (2). Regulatory toxicity studies are so-called safety studies, the purpose of which is to ascertain carcinogenic, mutagenic side effects etc. For this purpose, the statistical hypothesis in relation to type I and II errors should be specified: a) The risk of a type I error, a, represents the producer's risk: the conclusion is therefore that a toxic side effect exists, while in fact this is not the case. b) The riskofatype I error, (3, represents the customer's risk: the conclusion is therefore that a toxic effect does not exist, while in truth one actually does. Intuitively, it is clear that both risks must be handled with care, even though controlling the type II error should be of primary concern in toxicology. Usually, the type H error is defined comparisonwise and the type I error experimentwise (am). A typical design analyses comparisons between the control and treatment/dose groups, several time points, both sexes, elements of a multivariate end point vector, and multiple tumor sites. Because ofa dramatic increase in the type II error with such a high-dimensional design, Department of Biostatistics, German Cancer Research Center, Im Neuenheimer Feld 280, D-6900 Heidelberg, Germany. This paper was presented at the International Biostatistics Conference on the Study of Toxicology that was held May 13-25, 1991, in Tokyo, Japan. an a,,p formulation is normally used for the subdesign control versus k treatment/dose groups. The purpose of an adequate statistical analysis is to minimize the type II error while holding crexp constant. This article will therefore investigate several MCPs to establish the conditions under which the above requirement can be fulfilled.

Experimental Design of Long-Term Toxicity Studies
In long-term toxicity studies, there are three types of experimental design that can be distinguished for the above-mentioned subdesign. a) Control, dose,, . . ., dose,, where C = 0 < DI < ... < D*, the purpose ofwhich is to analyze dose response analysis or estimate the no-observed-effect dose. b) Control, treatment, . . ., treatment*, with treatment Tj ... several substances, combinations, etc. The purpose is to characterize all contrasts [control versus Tj V ft (1, . . ., k)J c) Control, [D> or Tj}, P+. The purpose of using a positive control group, P+ (administration of a known toxic substance), is to check the sensitivity ofthe test system currently in use (animals, bacteria, etc.). Using this simple closed testing procedure, asp can also be held constant in this most complex design (L. Hothorn, in preparation).

Multiple Comparison versus Modeling
Two widely used and disjointed types of statistical approach are possible for long-term toxicity studies: modeling, choosing a suitable dose-response model and fitting the model to the data, e.g., for the AMES assay according to Margolin et al. (3); and MCPs. This paper only discusses MCPs.
MCPs are suitable for all three above-mentioned types of design. Modeling is sometimes uncertain for the typical guideline-related two or three dose-groups design. MCPs usually use fewer aprior assumptions (e.g., no problems with a correct model choice). An interaction of incorrect model choice and estimation error in the modeling approach is possible. Robust use of MCPs in routine evaluation of studies with multiple end points is possible. Ofcourse, the MCP approach also has several disadvantages, such as no possibility of extrapolation.
MCPs in Control versus k Treatment/ Dose Groups Design Two-Sample versus k-Sample Testing Toxicology journals often contain papers in which the statistical analysis is based on the two-sample t-test or the Wilcoxon-Mann-Whitney Utest, even in the k-sample many-toone situation (4) using a comparisonwise aXcomp level; i.e., testing each contrast with a for example on a 0.05 level independently. Using this simple approach, the experimentwise aexp-level is violated on the one hand, whereas, on the other, the type II error is smaller in comparison with an MCP and does not depend on the number of treatment/dose groups. This is the testing dilemma always faced in toxicity studies. Several compromises and an ideal situation (minimum type II error and holding texp) will now be discussed.
Many-to-one MCPs can be recommended on the whole. But if two-sample tests are used, then they should be used only for the contrasts (C -D,), but not for the between-dose contrasts, (Dj -Di) with (i # j) e-(1, . . . , k).

k-Sample Tests versus k-Sample Procedures
There is some desire to clarify the difference between tests and MCPs from both a toxicological and a biostatistical viewpoint. A k-sample test, e.g., the well-known F-test, represents a single decision problem: HO: FC = FD1 *--FDk HA:FC 7AFD1 #.-.FDk with F distribution function for testing the global substance effect. An MCP represents a multiple decision problem: HA: Fi i4 Fj V (i 96 j) e (1 .,k) for testing every contrast (C -Dj) Vje (1,..., k). Because not only the global effect, but also each single contrast (C -D>) is of interest in toxicology, application of MCP is recommended. A combination of both approaches based on the closed testing principle is also possible, providing both global and local information.

All-Pair versus Many-to-One Procedures
Commonly used statistical software packages are generally oriented to all-pair MCPs, such as Tukey, Scheffe, Duncan, etc.
All-pair MCPs analyze not only contrasts of interests (C -Dj) but also contrasts (Dj -Di) with (i * j) E (1, . . . , k). The type II error rate thus increases (S): control versus k=3 dose groups, afexp = 0.05, aud = 1.0; nj = 24 (with a end point-specific variance, ddetectable difference); many-to-one MCP (Dunnett) The Standard Many-to-One MCP: Dunnett's Procedure In control versus k treatment design, Dunnett's (6) procedure is commonly used to approximate normally distributed end points. Other types ofend points occurring in toxicology will not be discussed in this paper. For dichotomous end points, see Piegorsch (7).
Dunnett's procedure is relatively robust against violation ofthe normal distribution assumption (12, Ortseifen and Hothorn, in preparation). However, for nj > 10, the nonparametric analog according to Steel (13) shows a better power behavior (even in the near normal distributed case). The maximum power of Dunnett's procedure is attained with: n. = Vknj (14)(15)(16). This is not the case for the Williams (17) procedure, assuming an ordered alternative (18). The power depends on the number oftreatment or dose groups k, which implies that inclusion of further nonsignificant treatment groups can lead to overlooking significant effects (19). A rule for design using MCPs is to use only the minimal necessary number of treatment/dose groups.
In the case ofvariance heterogenicity, Dunnett's procedure is not robust (12). Other approaches should be used in this case, e.g., e-adjusted Welch-tests or Brownie (20) type of control group variance inclusion (Hothorn and Ortseifen, in preparation).

Procedures
The closed testing principle in many-to-one MCPs is quite simple (in comparison with all-pair MCPs) because a complete system ofhypotheses with (2k -1) elementary hypotheses (21) is given. Several types ofsequential rejective modifications will be discussed: a) Bonferroni/Holn (22) (29). The testing strategy (for simplicity, given here as C, k = 3) is shown in Figure 1. This multiple procedure works simply as follows: A level a test is performed on stage 1. If and only if the Ho9e1 is rejected, all subhypotheses at stage 2 are tested on the same a level, and so on. If a Ho( ) is not rejected, none of the subhypotheses are rejected.
Procedure with a priori Hierarchical Hypotheses. Use specific two-sample tests for the elementary contrasts, (C -Dj ), and estimate the related p-values (without ordering).

Decision scheme: if
Pk > cc ==> STOP HI,..,Hk cannot be rejected, otherwise Ho is rejected, and go to the dose level (k-1). If Pk-i > a ==> STOP Hl,...Hk-l cannot be rejected, otherwise Ho(kl) is rejected, and go to the dose level (k-2), etc.
This procedure represents a special case ofthe closed testing procedure under the assumption of an ordered alternative hypothesis. If thePi values within a real study are ordered, then with this procedure we find an ideal situation in MCP: holding ahexp and guaranteeing the maximum powers irj of the twosample tests (based on comparisonwise a). This procedure is moderately robust against violations of this monotonicity assumption (28).

Nonrestricted versus Ordered Alternative Hypotheses
Now we will consider the design C, Di, ... , Dk. Assuming a monotonic dependence ofthe effect on dose, restriction ofalternative hypotheses is possible:  (19). For Poisson-distributed end points, the closed testing procedure is based on Lee's (34 ) trend test (19).

Comparison of Several Procedures with Simulation Studies
For commonly observed conditions ofreal toxicity study data, namely expected value profiles, dimension ofk, sample sizes nj, a levels, variances, etc., several procedures were investigated with simulation studies (5,18,28,30,(35)(36)(37). For practical application, these simulation results can be summarized in a rather simple way: recommendation ofthe Hommel (24) / Hochberg (26) procedure, without restriction ofthe alternative hypothesis, a power behavior near the MCPs with ordered alternative was observed. It should be pointed out that for sequential rejection procedures, the estimation ofconfidence intervals in time was not solved satisfactorily.
where r is unknown, (1-r) is the proportion of nonresponders, and r is the proportion of responders. Two types of Lehmann (40) alternative will be considered here: shift: Fpatho(x) = Fc(x- 6) according to Good (41) and power: Fpatho(X) = Fa(X) according to Lehmann (40). Johnson et al. (42) suggested, for the shift alternative, approximate score statistics based on following mixed normal score function: where i is a rank in the combined (control+treatment) sample, d is a constant (in the simulation study where d=0.5,1,1.5,2 were used; only the case d=l will be reported here), and 4-' is a distribution function of the normal distribution.
As a generalization of Wilcoxon-Mann-Whitney (WMW) scores, Conover and Salsburg (43) proposed the following approximate score function for the power alternative: sc(i) = (i/(nc + nT + 1))a-1 where a is an integer constant (a=3,4,5,6 were used in the simulation study; here, only the case a=4 will be reported).
In toxicology, tests based on this mixing distribution assumption were used for behavioral studies (44), teratological studies (45), sister chromatid exchange mutagenicity assays (42), chronic studies (5), and micronucleus mutagenicity assays (46). With simulation studies (42,46), advantages in power can be shown for several practical data situations in toxicology.

Many-io-une Murs Tor muiiipie Unimodal versus Mixing Distribution Assumption End Points
All MCPs discussed in the preceding sections compare expected values. In real data, two situations may occur: greater variability (variance) with increasing response and existence of a subpopulation of nonresponders. This problem can be treated by several approaches: a) use of MCPs that are robust under variance heterogenicity (Hothorn and Ortseifen in preparation); b) so-called location-scale models, e.g., a combination ofthe Utest (location) and Ansari/Bradley (38) test [scale (39)] or the Brownie (20) type of control group variance includion; c) assumption of a mixing distribution of responders and nonresponders with the following hypotheses: HO FC(x) = FD(X) HA : FC(X) < FD(X) with FD(X) In long-term toxicity studies, several end points occur, (19): approximate, normally distributed (e.g., body mass); nonnormally distributed [e.g., the skewed distributed liver enzyme ASAT (5)]; binomially distributed (e.g., tumor rate); Poissondistributed (e.g., number of tumors). The commonly used evaluation consists ofseparate univariate analysis ofeach single end point, e.g., Unkelbach et al. (47), but a multivariate analysis ofmultiple end points in the many-to-one design is also possible: a) T2 modification according to Higazi and Dayton (48), b) with better power behavior for the typical one-sided hypothesis: multiple end point analysis (49) based on Dunnett's procedure (50) is a special case of parametric testing after k-ranking transformation. Both approaches have, however, a major disadvantage: only decision ofthe global end point vector. Information is not available on the combinations of end points, which might go as far as the single end point case. The multivariate problem also consists of a complete hypothesis system of 2k -1 elementary hypotheses (21). The decision scheme is quite simple (51), as can be seen for the four end points in Figure 2. Based on the level a-test on each step, this procedure shows good power behavior. This procedure is available as a PC program for up to 10 end points (Hothorn and Nagel, submitted).
An interesting extension ofthis method is possible for toxicity studies with both multiple end points and multiple treatment or dose groups based on the closed testing procedure under the assumption ofan ordered alternative using Williams (17) MCP. With this approach, decisions can be performed both on the multiple end points and the multiple dose group based on level a tests on each step, but holding ac,,p (50).

Summary
This paper reveals several sources of multiplicity within longterm toxicity studies and their suitable treatment, the possibility of reducing the antagonism between holding asp and ensuring the maximum power, that special MCPs for biostatistical analysis of long-term toxicological studies are necessary and are available as a PC program.