Notes on the use of historical controls.

The principles and methods of incorporating historical controls in four cases (C1, stable control; C2, rare occurrence of responses; C3, small group size; C4, historical control as a reference) are discussed. Two points are emphasized: one is that the historical control should be regarded as a given condition and the other is that the historical control should be used conservatively. Incorporating historical controls is recommended only when it is advantageous under the conditional evaluation of the performance and even in the conservative use of controls. For case C1, adjusting the critical value for the Cochran-Armitage trend test is proposed. For case C2, a modified conditional trend test proposed by Yanagawa et al. is appreciated as a proper procedure. For case C3, a conservative use of interblock information is discussed. The incorporation of the historical control is not recommended for case C4.


Introduction
Ever since the work of R. A. Fisher, we have been analyzing, in principle, the data in a toxicological experiment independently of other experiments. The number of wellcontrolled experiments conducted under the same protocol has recently increased, however, and this has tempted us to incorporate the data from past experiments with the data from current experiments. The temptation is especially strong in the following cases, where control groups in past experiments are referred to as the historical control. Incorporating the historical control would serve to a) increase the power ofhypothesis testing when the historical control is stable, as is the case reported in Hayashi et al. (1): Cl; b) carry out hypothesis testing when the occurrence of response is rare, as is the case in Tarone (2) or Yanagawa and Hoel (3): C2; c) increase the power of hypothesis testing when the group size is small, as is the case when the experiment is conducted on dogs: (C3; d) validate the judgment that the observed significance is a realization of type I errors, as is the case when so many items are tested that an inflated type I error is likely to occur: C4. Experiencing many such cases, researchers engaged in toxicological experiments ask statisticians the following questions: In what cases can we use the historical control? How should we incorporate the historical control? The purpose of this paper is to address these questions.

The Meaning of "Historical"
Because statistical testing is a principal concern in the data analysis in the cases mentioned above, we concentrate our atten-tion in this paper on testing procedures. There are many works in the literature addressing the use ofthe historical control in toxicological experiments. Among them, Margolin and Risko (4) presented a good summary of the general principle. For carcinogenicity studies, Tarone (2), Yanagawa and Hoel (3), Tamura and Young (5), Hoel and Yanagawa (6), Krewski et al. (7), and Yanagawa et al. (8) proposed some procedures or discussed problems to be considered based on the ,8-binomial model.
In their arguments (2)(3)(4)(5)(6)(7)(8), however, the time dependence ofthe historical control in comparison to the current experiment is not consciously considered. The arguments, except the one in Yanagawa et al. (8), are valid even when we reanalyze the data in a past experiment by incorporating data in succeeding experiments. In real situations, incorporation ofthe historical control is sought after only when a laboratory has collected enough data from past experiments and has confirmed some sort of homogeneity ofpast experiments. Once the historical control is saved in a database, researchers in that laboratory always refer to the same data repeatedly in the data analysis of succeeding experiments. In this circumstance, the data in the historical control are not randomly realized values but refer to a given condition fixed in advance in the data analysis of the current experiment. To address the questions mentioned earlier, we should examine the performance ofeach procedure from the viewpoint that the historical control is a given, fixed condition. This viewpoint of conditional use is the first point of our assertion.
The historical control is used to estimate unknown parameters related to the current experiment. The assertion that he historical control should be regarded as a given, fixed condition means that we should regard the estimated values based on the historical control as including some deviations from true parameter values, though the amount of deviation is within random variations. Because the historical control is a realization of random variables, both negative deviation and positive deviation may occur. In general, when the positive deviation causes the test based on the historical control to be conservative, the negative deviation causes the test to be liberal, and vice versa. Because we cannot know which situation is realized, we have to design a test procedure that is conservative in the sense that I errors are controlled within a target significance level even when a disadvatageous deviation has occurred. This viewpoint ofconservative use is the second point of our assertion.

Stable Historical Control
To make the arguments clear, we deal only with simple cases. Assume that the current experiment consists of (a + 1) groups, Ao, A,,. . . Aa of n individuals and that each individual in the group Ai is exposed to dose di(do < d, < ... < da ofa chemical. Let the observed response for the jth individual in Ai be 1  Under this formulation, it must be reasonable to regard the case where 4(1) = N2) = . =. = b s= as the case Cl, that is, the case with the stable historical control. In this case, the observed variable can be reduced to (XY) = (Xo, XI, , Xa, Y), where Y = LYi is distributed binominally B (bn, wo?). In most practical situations, when the researcher tries to test the null hypothesis Ho: =o =y = ... =1a against an alternative hypothesis Th a X (di-dh)Xi+(do-dh)Y =i=O /(E, (d1-dh)2+b(do-dh)2)nPh(l-Ph) i=o > u(a) (2) where dh= [(b+1)do+d1 + * --+da]/(a+b+l) and Ph= (Y+Xo+ XI + * * * +Xa)/(an+n+n).
IfY is regarded as a random variable, the test Th is obviously better than the test Tc. But, if Y is regarded as a given constant y, the type I error of the test Th is not controlled within a target significance level a as is explained below.
Under Ho, the statistic Th can be written as HI: Uo./. * * a, at least one strict inequality holds, based on (XY). A typical procedure for this problem is the Cochran-Armitage trend test because it is the uniformly most powerful, unbiased test against logistic alternatives (9). When the group size, n, is so large that the normal approximation on the binomial distribution is available, two procedures, say Tc and Th, ofthe trend test with significance level at can be considered by excluding or including Y as follows: can be regarded as a small quantity, because b is generally much greater than a in the situation where the use ofthe historical control is asked for and that 2 ( whered, = (do+di + d + * +d.)/(a+l), Pc,=Xo+XI+ ---+X.)/ (an+n) and u(a) is the upper 100a% point of N(0, 1). (6) where w ='bI(a + b + 1>. Forexample, ifa = 4, b = 20anddi = i, then B2 = 0.14. Under such a situation, Th is approximated Because D% is distributed as N(O,1) independently of Dy, the conditional distribution of Tb given Y = y, is positively (or negatively) biased from N(O,1) if y > bnwo ( or y < bnwo).   where D (y-bnnI0) Wb~ni 0( 1n 0) ) (9) and 4) is the distribution function of N(0,1) because DY is distributed as N(0,1), DY < 1.645 with probability 0.90. For di = i and for DY within this range, some numerical values ofthe right-hand side of Equation 8 are, together with type I error obtained by a Monte-Carlo simulation, shown in Table 1. Table 1 shows the possibility of an inflation of the type I error.
One idea to control the type I error within the significance level a is to adjust the critical value of the test Th. If we evaluate a possible maximum value of Dy by u(a ), we obtain an adjusted test Ta as follows: Test Ta: Reject Ho if Th> u(a)+Bu(a') (10) We think that a reasonable value of ca' is 0.05, which g u(a )=1.645. According to the above argument, the incorp tion ofthe historical control is advantageous only when the pc of the test Ta is greater than that of the test Tc.
Because, under HI and for given y, the statistic Tc is appi  Table 2, together with values obtained by a Monte-Carlo simulation. Table 2 shows that some parts of increases of the power for the test Th are spurious due to an inflation of the type I error and that the advantage of the incorporation of the historical control is rather limited, even when the historical control is stable enough. According to our assertion, whether the historical control should be incorporated or not isjudged through the comparison ofthe two tests Ta and Tc. The choice is possible

Rare Occurrence of Responses
Cases of the stable historical control discussed in the previous section rarely occur in toxicological experiments. We have seen such cases only in in vitro experiments or in short-term experiments. In most cases, there are more or less variabilities among historical controls and the concurrent control. Considering these variabilities as a prior distribution, Tarone (2) proposed a trend test based on the (3-binomial model. Yanagawa and Hoel (3) proposed a set oftrend tests based on the same line ofthought as Tarone. In addition, the latter authors proposed exact test procedures that control the type I error when the asymptotic theory is not applicable.
To make the argument simple, let us assume a logistic response model on the rate parameter ri for the binary response, that is, = exp y0+5(di-do)) i = 0, 1 Ifthe two parameters a and ( of Equation 19 are known, no problem arises in the proposed procedures. In real situations, however, these parameters are unknown and must be estimated from the historical control. Tamura and Young (5) pointed out that the estimation error seriously affects the type I error of Tarone's procedure. This sensitivity was also recognized by Yanagawa and Hoel (3). To overcome this defect, Krewski et al.
(4) proposed a two-stage procedure that contains an optionally chosen parameter for the choice ofthe second-stage test. Though the result ofthe procedure is highly influenced by this parameter value, they do not give any reasonable method to determine it. Not doing so makes their procedure ambiguous. In the same situations, Yanagawa and Hoel (3)  where X = EXX. Along this statistic, exact probabilities are accumulated, where "exact" means that conditional probabilities given X are calculated. By taking either the upper or the lower limit, four exact p-values are obtained for the test statistic TyI The use ofthe maximum value ofthese values as thep-value for testing the hypothesis Ho: 5=0 is Yanagawa et al.'s proposal (9).
Let us denote this testing procedure by Ty.
In the test Ty, the historical control is used only to estimate a and (3, and the estimated values ofa and ,B are used as given constants. As a result, all the information contained in the historical control is included in the given condition, and so the resulting procedure adapts to our viewpoint. The conservative use ofthe historical control to keep type I errors within a target significance level is entirely the same idea as the one explained in the previous section. Therefore, the test Ty is recommended. The problem is how tojudge whether the test Ty is superior to the corresponding test, say Te, without the use ofthe historical control. It must be reasonable to assume that, in the test Te, the p-value is calculated by accumulating exact probabilities along the statistic Te defined below. a X n (da-do)n Te = I (di-do)Xi -an = i=O = (21) The comparison oftwo tests can be carried out in the following manner. Define two sets Sy and Se as Sy = [x I py(x) < a , P(X) > a) Se = (x py(X) > cc , Pe(X) < a) (22) where py(x) and pe(x) arep-values corresponding to Ty and Te, respectively and a is the target significance level. Ty should be used. The choice is possible in advance ofthe current experiment because the powers of the two tests can be evaluated from the design of the current experiment, the target alternatives, and the realized values of the historical control, which specify the range of the nuisance parameter TrO. Though we have not confirmed it, the method ofMonte-Carlo simulation seems to be effective to evaluate probabilities.

Small Group Size
In toxicological experiments using large animals such as dogs or monkeys, the group size is as small as three or four. In such cases, any testing procedure ofa hypothesis rarely yields significant results due to the lack of power. The incorporation of the historical control is highly attractive to increase the power ofthe test in this situation. This is the case C3 mentioned above.
In this case, we naturally assume that the observed response is quantitative. Assume, as before, that the current experiment consists of (a+1) groups of size n with dose di and observed responses are independent normal variables. Let thej' response of the group Ai be Xij and assume that the structural model of X  (4), when oa2 and a112 are known, the maximum likelihood estimator of ( is given by where (3 is the usual estimator of 3 based on the current experiment, Ah is the interblock estimator of (3 based on the historical control and the overall mean of X's, and the weight w is a monotone increasing function of Ua2/Up2, whose explicit form is shown in Margolin and Risko (4). Ifthe historical control can be regarded as random quantities, then we can test the hypothesis Ho: (3=0 by standardizing (3 with the square root of Var((3), but that is not consistent with our viewpoint. Ah should be regarded as a given constant more or less deviated from the true value of(. In addition, a,2Uand au2 are unknown in real situations. They are estimated from the historical control and within-group variances in the current experiment. This yields an estimation error on the weight w and causes an inflation or deflation of type I errors. We have to devise a conservative procedure to control the type I error within a target significance level. In principle, the same idea as the one in the previous section is available, that is, to use confidence limits as true values ofw; but its realization has not been achieved up to now. Proposals ofpractical procedures are left for future studies.

Historical Control as a Reference
In chronic toxicity studies, hundreds of items are inspected during a long time interval within one experiment. This brings about many repetitions ofstatistical tests and causes an inflation oftype I errors or frequent occurrences offalse-positive results due to the multiplicity of tests. Encountering such errors, toxicologists do not usually accept the results of statistical data analysis unless the results are confirmed by toxicological and/or biological knowledge. When a toxicologist believes an observed statistical significance to be a realization of a type I error, he or she wants to validate this belief with evidence. In such circumstances, the historical control is used as evidence ofthe false positive ofthe statistical test. In fact, Matsumoto (10) found and reported many such cases through a survey ofa volume ofajournal. This is the case C4 mentioned in the initial section.
Our opinion on the use of the historical conrol in this case is rather negative because the variability among experiments is so big that the observed deviation ofa treatment group from the concurrent conotrol group can be neglected almost always by using the historical control as the reference. This fact violates the rationality ofthe statistical reasoning. In this case, we recommend, in principle, the use ofthe distribution ofp-values (1 ) to evaluate the inflation ofthe type I errors or to reduce many items to a few end points to avoid multiplicities (12), though the construction of practical procedures is not easy.

Concluding Remarks
Two points are emphasized in this paper: one is that the historical control should be regarded as a given condition and the other is that it should be used conservatively. We recommend the incorporation ofhistorical controls only when it is advantageous under such a conditional evaluation of the performance; even then it should be used conservatively.
In this paper, we considered only simple situations and simple procedures. In such cases, the choice of whether to adopt the incorporation is not difficult because the performance ofthe two procedures can be, at least approximately, evaluated and compared based on the design of the current experiment, the target alternatives, and the realized values of the historical control, which are obtained in advance. The application ofthis viewpoint seems easy for more complicated situations ifwe concentrate our attention on simple procedures. In real situations, however, there is a possibility that a more complex procedure adapts to our viewpoint better than such simple procedures. One example ofthis is shown in Hayashi et al. (1). The evaluation of such complex procedures is left for future investigations.