Statistical analysis of K 2 x 2 tables: a comparative study of estimators/test statistics for association and homogeneity.

In order to control for confounding variables, epidemiologists often obtain data in the form of a 2 x 2 table. One variable is usually the disease status, while the other variable represents a dichotomous exposure variable that is suspected of being a risk factor. If a confounding variable is present, the data are often stratified into several 2 x 2 tables. The objectives of the analysis are to test for the association between the suspected risk factor and the disease and to estimate the strength of this relationship. Before estimating a common odds ratio, it is important to check whether the odds ratios are homogeneous. This paper presents the results of a Monte Carlo study that was performed to determine the size and power of a number of tests of association and homogeneity when the data are sparse. We also evaluated the performance of three estimators of the common odds ratio. For the Monte Carlo studies, equal numbers of cases and controls were used in a wide variety of sparse data situations. On the basis of these studies, we recommend the Breslow-Day test for nonsparse data, and the T4 and T5 statistics for sparse data to test for homogeneity. The Mantel-Haenszel test of association is recommended for sparse and nonsparse data sets. With sparse data, none of the odds ratio estimators are entirely satisfactory.


Introduction
Epidemiologists often stratify data to control for a confounding variable in order to evaluate the relationship between a suspected risk factor and disease. If K levels of the confounding variable are used, and if the risk factor and the disease are dichotomous, then the data can be arranged in a K x 2 x 2 table of observed cell counts.
A common objective is to perform a test of association between the disease and the risk factor after controlling for the confounding variable. Another common objective is to estimate the common disease, or the exposure odds ratio. Before performing a test of association, it is desirable to determine if the assumption of a common disease, the exposure odds ratio is tenable. Therefore, a test of homogeneity is often a first step in the analysis of several 2 x 2 tables.
Many tests and estimates have been proposed for multiway contingency tables (1). However, these tests that are based on asymptotic distribution theory may not be valid when used on tables having few observa-tions in some of the cells. One objective of this study is to evaluate the behavior of these large sample tests when some of the counts are small, i.e., for sparse data.
In the last few years some homogeneity tests have been designed specifically for the sparse data setting where the number of strata (K) is large but the cell counts are small (2). A second objective is to evaluate the performance of these sparse data tests. Monte Carlo methods were used in this study since analytic comparisons are not feasible for these situations. This paper reviews a portion of the more detailed studies we have published on these topics. (3,4).

Description of Simulation
In both studies (3,4) we use Monte Carlo methods to generate cell counts for K x 2 x 2 tables. These cell counts are used to compute homogeneity tests, association tests, and odds ratio estimators. For the ith 2 x 2 table we use the notation for the cell counts presented in Table 1.
Let xi be the binomial count from ni independent trials with probability of success pli, and let yi be an independent binomial count from mi independent trials with probability of success P2i. In a case-control study xi is the number of exposed cases while yi is the number of exposed controls. Let ti = xi + yi. For the Monte Carlo study we specify the probability of a control having been exposed (P2i) for i = 1,..., K and we specify the odds ratio (*i) for i = 1,..., K.
Using the formula we compute the probability of a case having been exposed for i = 1,... , K. For the ith stratum the number of exposed cases (xi) is the number of random numbers that are less that pji out of ni calls to the uniform [0,1) random number generator. The number of exposed controls (yi) are obtained in a similar fashion and the remainder of the table is computed by subtraction. In our simulation studies equal numbers of cases and controls were used in all strata. Also, for most of the studies the strata were balanced so that Mi = ni = N/(2K)

Tests of Homogeneity
The likelihood ratio test of homogeneity (LRTH) and the Pearson test of homogeneity (PH) can be computed from the maximum likelihood cell estimates (1) of the cell probabilities. These estimates are obtained from the iterative proportional fitting algorithm (5).
Breslow and Day (6) proposed the statistic where *MH is the Mantel-Haenszel (7) estimator of the common odds ratio; ei *MH) is the expected value of xi given *MH and is computed as the solution to the quadratic equation ei(miti + ei) = 4MH(niei)(tiei); and the variance estimator is given by Var(xi 4jMH)_= {l/ei + 1/(niei) + 11(tiei) + 1/(miti + ei)})1 Tarone (8)  The second score test statistic is a normal approximation for a mixture model and is given by  The weighted least squares test statistic for association according to Wolf (10)

Odd Ratio Estimators
The three odds ratio estimators that are used in this study are the Mantel-Haenszel (1) estimator, the weighted least squares estimator (9), and the conditional maximum likelihood estimator (10

Results of Monte Carlo Study
Full details of the simulations are described elsewhere (3,4). Here we only describe the key findings from our studies.

Results for Tests of Homogeneity
The sizes of the tests of homogeneity are estimated from the percentage of times the hypothesis of a common odds ratio is rejected. When compared to the chisquare tabular values, the tests based on PH, BD, MBD, and CS generally maintain their nominal size in the large stratum situation, while the test based on LRTH rejects much too often. In the sparse data situation the tests based on T4 and T5 maintain their size, while generally the tests using PH, BD, MBD, and CS do not reject often enough.
The powers of the tests are estimated by the number of times the test statistics lead to rejection of the hypothesis of a common odds ratios when the odds ratios were not held constant. The odds ratios were generated according to lognormal, exponential, two-point, and uniform distributions. For those tests that maintain their sizes near the 5% level, the PH, BD, MBD, and CS tests have about equal power for the large stratum setting where they are superior to T4 and T5. In the sparse data setting T4 and T5 are generally more powerful than the other statistics. It should be noted that all of these tests of homogeneity have low power. For example, with 128 cases and 128 controls, if we generate 4i from a uniform [1.0, 4.0] distribution the power is less than 13% for all of the tests for K = 2, 4, 8, 16, and 32. Also, for many situations studied, the power is not sensitive to the number of strata (K) so long as the total sample size is kept constant. Jones et al. (4) gives further details and a discussion concerning the reasons for the low power.
We also studied the situation where 50% of the cases and controls were placed in one large table while the remainder were placed equally among the other tables.
In these unbalanced tables we used 4p = 1 for the large table and 4i > 1 for the other tables.
For these situations the test based on T5 was most powerful and the test based on BD was the second most powerful. In our studies T5 performed well in both the balanced and unbalanced sparse data settings,while the BD statistic performed well in the large stratum settings.

Results for Tests of Association
The MH test maintained its size for both large stratum and small stratum situations. The LRA test held its size for large stratum but tended to be anti-conservative in the small stratum setting. The PA and WLS tests maintained their size for the large stratum case but were much too conservative with sparse data. The powers of these tests were estimated by the proportion that led to rejection of the hypothesis of a common odds ratio of 1.0 when the common odds ratio exceeded 1.0. The power of the LRA and MH test were approximately equal and were not related to the number of strata used. Because of their conservative sizes, the powers of the PA and WLS tests were considerably below the powers of the LRA and MH tests with sparse data.

Results for Odds Ratio Estimators
The median and the interquartile ranges of the three odds ratio estimators were also estimated in the Monte Carlo study. When For nonsparse data the variability of the three odds ratio estimators are approximately equal. For sparse data the interquartile range of WLS iS less than that of *MH and 4MCLE.

Summary
We compared the performance of three combined odds ratios estimators and four tests of association using Monte Carlo techniques (3,4). For these Monte Carlo studies a constant odds ratio is used with an equal number of cases and controls. In addition, a wide range of odds ratios, probabilities of exposure, numbers of cases, and strata are used. For each of the K 2 x 2 tables, 1000 simulations were generated for each configuration ofthe parameters studied. The Mantel-Haenszel (7), the weighted least squares (9), and maximum conditional likelihood (10) estimators of the odds ratio were computed. In addition, the likelihood ratio (1), Mantel-Haenszel, Pearson, and weighted least squares tests of association are studied. These studies indicate that the interquartile range of the weighted least squares estimator is usually less that of the other estimators; although in many situations the median of this least squares estimator is far from the population odds ratio. With sparse data the Mantel-Haenszel test for association maintains its size. For the range of parameters studied here, the degree of stratification does not greatly affect the power of the likelihood ratio and the Mantel-Haenszel test statistics.
In addition to studying measures of association and tests for association, we also examine several tests for homogeneity. We conclude that the Breslow Day statistic (6) is a reasonable statistic for use in nonsparse data settings when taking into account both the size and power of the test. In balanced sparse data settings the T4 statistic of Liang and Self (2) performs the best when all tables, regardless of sample size, have odds ratios generated from the same distribution. In sparse data settings characterized by a large table with an odds ratio of 1 and many small tables of odds ratios greater than 1, the T5 statistic of Liang and Self (2) performs the best. One result of these investigations is that virtually all of the homogeneity tests have generally low power in the presence of sparse data.

Recommendations
The Breslow-Day test of homogeneity is recommended for nonsparse data. For sparse data the T4 and T5 statistics are the most powerful tests of homogeneity and are recommended. The choice between T4 and T5 should be based on considerations found in (4). For tests of association the Mantel-Haenszel test is recommended. The three estimators studied here cannot be recommended for sparse data, although the Mantel-Haenszel performs reasonably well. A modified version of MH studied by Hauck et al. (12) may be preferred in extreme sparse data settings.