Effect of a preliminary test of homogeneity of stratum-specific odds ratios on their confidence intervals

Consider a case-control study in which the aim is to assess the effect of a factor on disease occurrence. We suppose that this factor is dichotomous. Also suppose that the data consists of two strata, each stratum summarized by a two-by-two table. A commonly-proposed two-stage analysis of this type of data is the following. We carry out a preliminary test of homogeneity of the stratum-specific odds ratios. If the null hypothesis of homogeneity is accepted then we find a confidence interval for the assumed common value (across strata) of the odds ratio. We examine the statistical properties of this two-stage analysis, based on the Woolf method, on confidence intervals constructed for the stratum-specific odds ratios, for large numbers of cases and controls for each stratum. We provide both a Monte Carlo simulation method and an elegant large-sample method for this examination. These methods are applied to obtain numerical results in the context of the large numbers of cases and controls for each stratum that arose in a real-life dataset. In this context, we find that the preliminary test of homogeneity of the stratum-specific odds ratios has a very harmful effect on the coverage probabilities of these confidence intervals.


Introduction
Consider a case-control study in which the aim is to assess a factor's effect on disease occurrence. We suppose that this factor is dichotomous. Also suppose that the data consists of two strata, each stratum summarized by a 2 × 2 table. The parameters of interest are the stratum-specific odds ratios. A commonly-proposed two-stage analysis of this type of data is the following, see e.g. section 4.4 of Breslow and Day (1980), Section 16.2 of Pagano and Gauvreau (2000) and Section 13.6 of Rosner (2011). We carry out a preliminary test of homogeneity of the stratum-specific odds ratios. If the null hypothesis of homogeneity is accepted then we find a confidence interval for the assumed common value (across strata) of the odds ratio.
From a practical point of view, we must state what action we take when the null hypothesis of homogeneity of stratum-specific odds ratios is rejected. It would not make sense for a consulting statistician to tell a client that this null hypothesis has been rejected and so the statistician will do nothing. There is some awareness of the need to clearly state what action we take when the null hypothesis of homogeneity is rejected, see e.g. p.279 of Rothman et al (2008) and p.620 of Rosner (2011).
The latter states that "If the true ORs are significantly different, then it makes no sense to obtain a pooled-OR estimate ... Instead, separate ORs should be reported".
We suppose that when the null hypothesis of homogeneity is rejected, we compute confidence intervals for each of the stratum-specific odds ratios.
Our aim is to examine the statistical properties of this two-stage analysis, in the context of simultaneous inference for the stratum-specific odds ratios. In Section 2, we provide a precise general formulation of this two-stage analysis. We examine the statistical properties of this two-stage analysis using the Woolf method (described e.g. on p.139 of Breslow and Day, 1980) to carry out the preliminary test of homogeneity of the stratum-specific odds ratios and to construct confidence intervals for the stratum-specific odds ratios, for two strata and large numbers of cases and controls for each stratum. We provide both a Monte Carlo simulation method and an elegant large-sample method for this examination. These methods are applied to obtain numerical results in the context of case and control sample sizes that come from a study whose aim is to assess the effect of the consumption of caffeinated coffee on nonfatal myocardial infarctions for adult males under the age of 55 (Pagano andGauvreau, 2000 andRosenberg et al, 1988). Our general conclusion is that the preliminary test of homogeneity of the stratum-specific odds ratios has a very harmful effect on the coverage probabilities of the confidence intervals for these odds ratios, for two strata when the numbers of cases and controls in each stratum is large.

Precise general formulation of the two-stage analysis
For easier cross-referencing with the notation used in Section 3, we phrase our discussion in terms of log odds ratios. Let θ i denote the log odds ratio for the i th stratum (i = 1, 2). To provide a precise general formulation of the two-stage analysis, our first step is to describe what we would do if it was known with certainty (a) that θ 1 = θ 2 and (b) that θ 1 = θ 2 . We consider simultaneous inference for θ 1 and θ 2 . Consequently, this description is in terms of simultaneous confidence intervals for θ 1 and θ 2 .
• Suppose that θ 1 = θ 2 LetĴ be the confidence interval for θ = θ 1 = θ 2 based on the two-by-two tables for both strata, with approximate coverage 1 − α. LetĴ 1 =Ĵ andĴ 2 =Ĵ be confidence intervals for θ 1 and θ 2 , respectively. These confidence intervals have simultaneous The two-stage analysis is precisely formulated as follows. If the null hypothesis of homogeneity is rejected then we use the confidence intervalsÎ 1 andÎ 2 for θ 1 and θ 2 , respectively. If, on the other hand, this null hypothesis is accepted then we use the confidence intervalsĴ 1 andĴ 2 for θ 1 and θ 2 , respectively. The nominal simultaneous coverage of the resulting confidence intervals for θ 1 and θ 2 is 1 − α. We will assess this two-stage analysis by comparing the actual simultaneous coverage probability of these confidence intervals with 1 − α. Of course, there are several different possible choices of preliminary test of homogeneity and confidence intervals that can be used in the two-stage analysis. As explained in the next section, we use tests and confidence intervals based on the Woolf method.
3. The two-stage analysis that will be evaluated We use the following notation for the 2 × 2 contingency table that summarizes the data for the i th stratum (i = 1, 2). Let n i denote the number of subjects with the disease (cases), with y i of these subjects exposed to the factor. Also, let n ′ i denote the number of subjects without the disease (controls), with y ′ i of these subjects exposed to the factor. We use upper case to denote random variables and lower case to denote observed values. Thus, for example, Y i is the random variable corresponding to the observed value y i . We use the following model for the data in this table.
The random variables Y i and Y ′ i are independent, with Y i ∼ Binomial(n i , p i ) and Let ǫ be a specified small positive number (0 < ǫ < 1 2 ). Suppose that p i ∈ [ǫ, 1 − ǫ] and p ′ i ∈ [ǫ, 1 − ǫ], for i = 1, 2. The parameter of interest for this table is the odds ratio We find a confidence interval for ψ i as follows. We first find a confidence interval for the log odds ratio θ i = ln(ψ i ) and then transform this in the obvious way into a confidence interval for the odds ratio ψ i .
We consider the two-stage analysis, described in Section 2, implemented using Woolf's method. The maximum likelihood estimates of p i and p ′ i arep i = y i /n i and p ′ i = y ′ i /n ′ i , respectively. The resulting estimator of θ i is .
This estimator has a number of disadvantages, including the fact that it is undefined for y i either 0 or n i and for y ′ i either 0 or n ′ i . We do not use this estimator. Instead, we follow the common recommendation (see e.g. page 139 of Breslow and Day, 1980) of estimating p i and p respectively. The resulting estimator of θ i iŝ .
This estimator has the following three advantages. Firstly, it is defined for all possible values of y i and y ′ i . Secondly, according to page 32 of Cox and Snell (1989), Θ i is an asymptotically less biased estimator of θ i thanΘ i . Thirdly, the use of this type of adjustment of the maximum likelihood estimates of p i and p ′ i can be remarkably effective in improving the coverage probability properties of Wald-type confidence intervals based on these estimates, see e.g. Agresti and Caffo (2000).
Woolf's method is based on the approximation that and the approximation that σ 2 i is equal tô We test the null hypothesis of homogeneity H 0 : θ 1 = θ 2 against the alternative hypothesis H 1 that the θ 1 = θ 2 . We carry out this test using the test statistiĉ We make the approximation thatT ∼ N(0, 1) under H 0 . Let β denote the nominal level of significance of this test.
If H 0 is rejected then the confidence intervals for θ 1 and θ 2 , with nominal simultaneous coverage 1 − α, areÎ 1 andÎ 2 respectively, wherê If, on the other hand, H 0 is accepted then we carry out inference based on the assumption that , which is the estimator of θ assuming that θ 1 = θ 2 = θ. If θ = θ 1 = θ 2 then the following confidence interval for θ has nominal coverage 1 − α: If H 0 is accepted then the confidence intervals for θ 1 and θ 2 , with nominal simultaneous coverage 1 − α, areĴ 1 andĴ 2 , respectively.

Application to case and control sample sizes that arise in a real-life data set
Consider the case-control study data with two strata, described on p. 376 of Pagano and Gauvreau (2000). This data originates from Rosenberg et al (1988). Pagano and Gauvreau (2000) carry out a preliminary test of homogeneity of the odds ratios for these 2 strata, which is almost identical to that described in the previous section.
They conclude that they cannot reject the null hypothesis of the odds ratios being the same for these 2 strata. They then use the Mantel-Haenszel method to estimate the odds ratio, which is assumed to be the same for both of these strata. However, the Mantel-Haenszel method is inefficient, in the context of a fixed number of strata and large numbers of cases and controls for each stratum, unless special circumstances hold (Tarone et al, 1983). This is one of the reasons why we estimate the common odds ratio from the two strata in the previous section using Woolf's method. The other reason for doing this is that this permits us to find the elegant large-sample approximation described in Section 5.
Our aim is to find confidence intervals for θ 1 and θ 2 with simultaneous coverage 1 − α. We suppose that 1 − α = 0.95. We also suppose that (p 1 , p ′ 1 , p 2 , p ′ 2 ) belongs to the set A = [0.02, 0.98] 4 . Under this restriction, it is expected that the distributions ofΘ 1 andΘ 2 will be close to normal. To see this, consider the rule-of-thumb that the cdf of Y ∼ Binomial(n, p) is approximated well by the N(np, np(1 − p)) cdf if np(1 − p) ≥ 5 (see e.g. p.133 of Rosner, 2011). Note that Consequently, we expect the distributions ofΘ 1 andΘ 2 to be close to normal.
Firstly, consider the Woolf method confidence intervalsÎ 1 andÎ 2 , when we do not carry out a preliminary test of homogeneity of the stratum-specific odds ratios.
Because n 1 , n ′ 1 , n 2 and n ′ 2 are large, we expect that the simultaneous coverage probability P (θ 1 ∈Î 1 , θ 2 ∈Î 2 ) will not fall far below 1−α = 0.95 for all (p 1 , p ′ 1 , p 2 , p ′ 2 ) in A. Using the simulation method described in Appendix A, we obtained the rough estimate 0.951 of the minimum simultaneous coverage probability. This coverage probability is attained at (p 1 , p ′ 1 , p 2 , p ′ 2 ) = (0.596, 0.788, 0.308, 0.788). All of the computations presented in this paper were performed with programs written in MATLAB, using the statistics toolbox. This shows that the confidence intervals resulting from the two-stage analysis are completely inadequate.
Firstly,Θ i has an N(θ i , σ 2 i ) distribution, when both n i and n ′ i are large (i = 1, 2). Secondly, in the expressions forΘ,T ,Ĵ andÎ i (given in the previous section), we may replaceσ i by σ i (i = 1, 2). Thirdly, we assume that σ 2 1 and σ 2 2 are known. In Appendix C, we apply this approximation to obtain a large-sample approximation to the simultaneous coverage probability of the confidence intervals for θ 1 and θ 2 , with nominal simultaneous coverage 1 − α, resulting from the two-stage analysis.

Numerical results obtained using the large-sample approximation
The case-control study data described in Section 3 consists of two strata with sample sizes n 1 = 1092, n ′ 1 = 467, n 2 = 449 and n ′ 2 = 488. In the present section, we consider the same number of strata and the same sample sizes. Our aim is to find confidence intervals for θ 1 and θ 2 with simultaneous coverage 1 − α. As in Section 3, we suppose that (p 1 , p ′ 1 , p 2 , p ′ 2 ) belongs to the set A = [0.02, 0.98] 4 .
Firstly, consider the Woolf method confidence intervalsÎ 1 andÎ 2 , when we do not carry out a preliminary test of homogeneity of the stratum-specific odds ratios. Obviously, the large-sample approximation described in Section 4 tells us that Now consider the two-stage analysis. As in Section 3, suppose that 1 − α = 0.95. . The large-sample approximation is a smooth function of (p 1 , p ′ 1 , p 2 , p ′ 2 ) and so this minimum value can be expected to be an accurate approximation to large sample approximation minimized over (p 1 , p ′ 1 , p 2 , p ′ 2 ) in A.
For the two-stage analysis, the simulation estimate of the minimum simultaneous coverage probability of the confidence intervals for the stratum-specific odds ratios was found to be 0.131. This is quite close to the minimum value of the large-sample approximation to the simultaneous coverage probability of these confidence intervals, which was found to be 0.134847. In this context, we find that the preliminary test of homogeneity of the stratum-specific odds ratios has a very harmful effect on the coverage probabilities of these confidence intervals.
7. The simultaneous coverage probability of the confidence intervals resulting from the two-stage analysis is small away from the boundaries of the parameter space In this section we deal exclusively with the confidence intervals resulting from the two-stage analysis. As noted in the previous sections, the minimum simultaneous coverage probability is achieved on the boundary of the parameter space A for both the simulation and large-sample estimates of this minimum coverage probability. If this simultaneous coverage probability is small only at or near the boundaries of the parameter space then it might be argued that statistical practitioners need not be concerned about the smallness of the minimum simultaneous coverage probability.
Therefore, it is natural to ask the question: Is this simultaneous coverage probability small only at or near the boundaries of the parameter space?
In this section, we show that this simultaneous coverage probability is also small far from the boundaries of the parameter space A. We do this as follows. Suppose that (p 1 , p ′ 1 , p 2 , p ′ 2 ) belongs to the set A = [ǫ, 1 − ǫ] 4 , where ǫ is a small specified positive number (0 < ǫ < 1 2 ). Let ∆ = (1/n 1 ) + (1/n 2 ) and r = (1/n ′ 1 ) + (1/n ′ 2 ) (1/n 1 ) + (1/n 2 ) Also let ∆ ′ = r ∆. For each p 1 in [ǫ + ∆, 1 − ǫ − ∆] and p ′ 1 in [ǫ + ∆ ′ , 1 − ǫ − ∆ ′ ], we find the minimum over p 2 = p 1 + δ and p ′ of the large-sample simultaneous coverage probability of the confidence intervals for θ 1 and θ 2 resulting from the two-stage analysis. We then examine this partiallyminimized coverage probability using a contour plot of it, as a function of (p 1 , p ′ 1 ) Note that for (p 1 , p ′ 1 ) not close to the boundaries of this set, the partially-minimized coverage is achieved at a value of (p 1 , p ′ 1 , p 2 , p ′ 2 ) not close to the boundaries of A. The large-sample analysis described in Appendix C includes the test statistic T which has an N(λ, 1) distribution, where λ = (θ 1 − θ 2 )/ σ 2 1 + σ 2 2 . In this partial minimization, λ is a function of δ ∈ [−∆, ∆], for each given (p 1 , p ′ 1 ). In Appendix D, we show that the range of this function includes the interval [−2, 2]. Therefore, this partial minimization includes the consideration of a wide interval of values of λ, suggesting that the partiallyminimized coverage will be quite small. Of course, whether or not this is, indeed, the case needs to be assessed numerically.
Consider the case-control study data described in Section 3, which consists of two strata with sample sizes n 1 = 1092, n ′ 1 = 467, n 2 = 449 and n ′ 2 = 488. In this case, ∆ = 0.056062 and r = 1.333316, so that ∆ ′ = 0.074748. Figure 1 is a contour plot of the partially minimized coverage probability, as a function of Figure  1 demonstrates that the large-sample approximation to the simultaneous coverage probability is much less than 0.95 for (p 1 , p ′ 1 , p 2 , p ′ 2 ) far from the boundaries of the It is straightforward to show that the harmful effect of the preliminary test of homogeneity of the stratum-specific odds ratios does not disappear as the sample sizes increase. Consider n 1 = 1092N, n ′ 1 = 467N, n 2 = 449N and n ′ 2 = 488N, where N is a positive integer. It may be shown that, as we increase N, the partiallyminimized coverage probability converges to a limiting value for each (p 1 , p ′ 1 ). The contour plot shown in Figure 1 does not differ greatly from the contour plot of this limiting value. In other words, the harmful effect of the preliminary test of homogeneity of the stratum-specific odds ratios does not disappear as the sample sizes increase. This plot shows the partially-minimized coverage probability as a function of (p 1 , p ′ 1 ).

Discussion
The literature on the effect of preliminary statistical model selection (using, for example, hypothesis tests or minimizing a criterion such as AIC) on confidence intervals begins with the work of Freeman (1989) who analyzed the effect of a preliminary test of the null hypothesis of zero differential carryover in a two-treatment two-period crossover trial on the confidence interval for the difference of treatment effects. This literature has grown steadily since this work of Freeman and is reviewed by Kabaila (2009). It is commonly the case that preliminary model selection has a highly detrimental effect on the coverage probability of these confidence intervals. However, each case (specified by a model, a model selection procedure and parameters of interest) needs to be considered individually on its merits.
Our results show that the preliminary test of homogeneity of the stratum-specific odds ratios should not be used. The harmful effect of this preliminary test is very substantial and exists far from the boundaries of the parameter space. Furthermore, this harmful effect does not disappear with increasing sample sizes.

the minimum simultaneous coverage probability
In this appendix we describe the search through the parameter space that was used to find an approximation to the minimum simultaneous coverage probability of specified confidence intervals for the log-odds ratios θ 1 and θ 2 . As shown in Appendix B for the particular case P (θ 1 ∈Î 1 , θ 2 ∈Î 2 ), this simultaneous coverage probability is a discontinuous function of (p 1 , p ′ 1 , p 2 , p ′ 2 ). This makes it difficult to get a very accurate estimate of the simultaneous coverage probability minimized over (p 1 , p ′ 1 , p 2 , p ′ 2 ) in the parameter space A. Nonetheless, the following search method provides a rough estimate of the minimum simultaneous coverage probability.
For a given value of (p 1 , p ′ 1 , p 2 , p ′ 2 ), we estimate the simultaneous coverage probability of the confidence intervals by Monte Carlo simulation. We use a search method of the type described by Kabaila and Giri (2008) (cf. Section 3.1 of Kabaila and Leeb, 2006). Define the step length h = 0.096. The simultaneous coverage probability of these confidence intervals is estimated using M = 10000 simulations for each (p 1 , p ′ 1 , p 2 , p ′ 2 ) belonging to the set (0.02, 0.02+h, 0.02+2h, . . . , 0.98) 4 . The 10 values of (p 1 , p ′ 1 , p 2 , p ′ 2 ) with the lowest estimates of this coverage probability are then selected for further consideration. For each of these 10 values, the coverage probability is then re-estimated using M = 200000 simulations. The value of (p 1 , p ′ 1 , p 2 , p ′ 2 ) with the lowest estimate of this coverage probability is then selected for further consider-ation. For this value, the coverage probability is then re-estimated using M = 10 6 simulations.
If H 0 is rejected then the confidence intervals for θ 1 and θ 2 , with nominal simultaneous coverage 1 − α, are I 1 and I 2 respectively, where , which is the estimator of θ, assuming that θ = θ 1 = θ 2 . If θ = θ 1 = θ 2 then the following confidence interval for θ has coverage 1 − α: Let J 1 = J and J 2 = J. If H 0 is accepted then the confidence intervals for θ 1 and θ 2 , with nominal simultaneous coverage 1 − α, are J 1 and J 2 , respectively.
Our aim is to evaluate the coverage probability of the simultaneous confidence intervals for θ 1 and θ 2 resulting from the above procedure for given n i , n ′ i , p i and p ′ i (i = 1, 2). By the law of total probability, this coverage probability is equal to P θ 1 ∈ J, θ 2 ∈ J, |T | ≤ c β + P θ 1 ∈ I 1 , θ 2 ∈ I 2 , |T | > c β .