Testing homogeneity of effect sizes in pooling 2x2 contingency tables from multiple studies: a comparison of methods

Meta-analysis is a statistical methodology that combines the outcomes of several independent studies. Either a fixed-effects or a random-effectsmodel is used to combine the results of the independent studies. The choice depends on whether the outcomes can be considered homogeneous or not. Several methods have been developed to test the homogeneity of effect sizes. Although many of them have been compared already, they have not been studied in more realistic practical settings where individuals all have their own risk of an event and a complete overview is lacking. In this article, we investigate the performance of 10 statistical methods to test homogeneity of a binary exposure on a binary clinical outcome, using an extensive simulation study and a real life meta-analysis. We evaluated the Type I error and the statistical power. The fixed-effects regression model for treatment and study interactionwas found to be a slightly liberal test, while theQ-statistic and the Bliss statistic can be considered conservative tests. The random-effects regression model, the Peto statistic and the I perform rather poorly compared to the other methods. All chisquare tests that are based on the calculation of a specific odds ratio and the fixedABOUT THE AUTHOR Edwin van den Heuvel is a full-time professor at the Department of Mathematics and Computer Sciences in the Eindhoven University of Technology and he is the head of the Statistics group within the Stochastics section. The Statistics group develops and compares dataanalytical methods for analyzing and sampling complex structured correlated datasets. It includes parameter estimation, model fitting, latent variable models, mixed models, missing data, statistical process control, survival and reliability theory, time series analysis, and statistical learning methods. One of the central themes is the analysis of high-dimensional temporal datasets and other large datasets. The group actively explores new research lines in data science. This current research article is part of a PhD project of Osama Almalik (the principal author) on meta-analysis for non-standard or complex situations under supervision of van den Heuvel. It is considered beneficial to medical researchers, as well as statisticians, when carrying out meta-analysis for combining effect sizes from cohort studies and clinical trials. PUBLIC INTEREST STATEMENT Testing homogeneity of effect sizes (treatment effects, risk factor associations) is an essential part of meta-analysis since it helps us to interpret the consistency of these effects across groups of participants. The testing can be done using various statistical tools, which can result in different conclusions. For this reason we thought it was important to study the performance of the existing statistical tests for homogeneity. Specific comparisons have been done earlier but we also tried to explore the performance of statistical tests that have never been applied to this specific problem, and to adjust existing statistical tests. We studied the performance of 10 different methods to test the homogeneity of effect sizes using a case study and an extensive simulation study. Based on our study, we recommend one specific statistical test for use for meta-analysts when testing the homogeneity since it outperformed the others in most of the settings we studied. Almalik et al., Cogent Mathematics & Statistics (2018), 5: 1478698 https://doi.org/10.1080/25742558.2018.1478698 © 2018 The Author(s). This open access article is distributed under a Creative Commons Attribution (CC-BY) 4.0 license. Received: 03 January 2018 Accepted: 13 May 2018 First Published: 24 May 2018 *Corresponding author: Osama Almalik, Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, the Netherlands E-mail: O.Almalik@tue.nl Reviewing Editor: Yueqing Hu, Fudan University School of Life Sciences, China Additional information is available at the end of the article

Abstract: Meta-analysis is a statistical methodology that combines the outcomes of several independent studies. Either a fixed-effects or a random-effects model is used to combine the results of the independent studies. The choice depends on whether the outcomes can be considered homogeneous or not. Several methods have been developed to test the homogeneity of effect sizes. Although many of them have been compared already, they have not been studied in more realistic practical settings where individuals all have their own risk of an event and a complete overview is lacking. In this article, we investigate the performance of 10 statistical methods to test homogeneity of a binary exposure on a binary clinical outcome, using an extensive simulation study and a real life meta-analysis. We evaluated the Type I error and the statistical power. The fixed-effects regression model for treatment and study interaction was found to be a slightly liberal test, while the Q-statistic and the Bliss statistic can be considered conservative tests. The random-effects regression model, the Peto statistic and the I 2 perform rather poorly compared to the other methods. All chisquare tests that are based on the calculation of a specific odds ratio and the fixed-ABOUT THE AUTHOR Edwin van den Heuvel is a full-time professor at the Department of Mathematics and Computer Sciences in the Eindhoven University of Technology and he is the head of the Statistics group within the Stochastics section. The Statistics group develops and compares dataanalytical methods for analyzing and sampling complex structured correlated datasets. It includes parameter estimation, model fitting, latent variable models, mixed models, missing data, statistical process control, survival and reliability theory, time series analysis, and statistical learning methods. One of the central themes is the analysis of high-dimensional temporal datasets and other large datasets. The group actively explores new research lines in data science. This current research article is part of a PhD project of Osama Almalik (the principal author) on meta-analysis for non-standard or complex situations under supervision of van den Heuvel. It is considered beneficial to medical researchers, as well as statisticians, when carrying out meta-analysis for combining effect sizes from cohort studies and clinical trials.

PUBLIC INTEREST STATEMENT
Testing homogeneity of effect sizes (treatment effects, risk factor associations) is an essential part of meta-analysis since it helps us to interpret the consistency of these effects across groups of participants. The testing can be done using various statistical tools, which can result in different conclusions. For this reason we thought it was important to study the performance of the existing statistical tests for homogeneity. Specific comparisons have been done earlier but we also tried to explore the performance of statistical tests that have never been applied to this specific problem, and to adjust existing statistical tests. We studied the performance of 10 different methods to test the homogeneity of effect sizes using a case study and an extensive simulation study. Based on our study, we recommend one specific statistical test for use for meta-analysts when testing the homogeneity since it outperformed the others in most of the settings we studied.

Introduction
Pooling (analysis) results (like odds ratios) from multiple studies is common practice in medical, social and biological sciences. It is the main goal in an aggregate data meta-analysis or in an experimental study where some kind of stratification is applied (like multi-center randomized controlled trails). Researchers would typically be interested mainly in the overall or pooled estimate of a binary exposure or treatment effect, regardless of study differences, since it may better represent an estimate for the full target population than the individual studies. For a binary outcome, this means pooling 2 × 2 contingency tables from several studies. A traditional approach for pooling 2 × 2 contingency tables is the Cochran-Mantel-Haenszel statistic (Breslow, 1994;Breslow & Day, 1980;Mantel & Haenszel, 1959). However, this approach requires that the odds ratios across studies are homogeneous. This emphasizes the need for a test on homogeneity of the odds ratios or more generally the homogeneity of treatment effects on the binary clinical outcome. This topic is not new and many different approaches have been developed.
A classic test for homogeneity of effect sizes is the Q-statistic (Cochran, 1954). It creates a weighted effect size using the squared inverse within-study standard error and evaluates if the variation around this pooled effect is larger than that can be expected from the within-study standard errors. It was initially developed for pooling potency estimates from bioassay experiments (Mzolo, Hendriks, & van den Heuvel, 2013). A similar approach used in this area is the method of Bliss (1952), which uses the Q-statistic but in a different way. Related measures that are used in systematic reviews are the H 2 , I 2 and the R 2 statistics (Higgins & Thompson, 2002). The I 2 statistic is also fully determined by the Q-statistic and it was initially developed to quantify the amount of heterogeneity instead of testing for homogeneity. The H 2 statistic is just an alternative measure for I 2 , but the R 2 statistic describes the inflation from the random effects to the fixed effects, and it is viewed more as a measure than as a test statistic. However, the three measures of heterogeneity can be used as test statistics using appropriate confidence intervals.
Alternatively, fixed-and random-effects logistic regression models can be applied to test homogeneity. This is performed by including an interaction term between study and treatment, and then applying either a Wald test, a score test, or the Likelihood ratio test for its significance (Agresti, 2002).
A method that studies the heterogeneity of the odds ratio directly is the Breslow-Day test adjusted by Tarone (1985). It is some form of Pearson's chi-square statistic. Other types of chisquare test statistics, which have been developed for specific situations, are Peto statistic (Yusuf, Peto, Lewis, Collins, & Sleight, 1986), Liang and Self statistics (Liang & Self, 1989) and Zelen statistic (Zelen, 1971). Yusuf et al. (1986) originally presented the Peto method for pooling odds ratios, and accompanied it with a homogeneity test (Yusuf et al., 1986). Liang and Self (1989) focused on developing test statistics in case data are sparse, i.e. some cells in the study-specific contingency tables contain only a few participants. Zelen developed the Zelen statistic to test the homogeneity of the relative risks (Zelen, 1971).
Several of these statistics have been discussed, evaluated and compared on their performance of detecting heterogeneity but, as Sutton and Higgins (2008) point out, there is no overview of the results (Sutton & Higgins, 2008). The Q-statistic has been shown to have low statistical power when the number of studies is small (Higgins & Thompson, 2002;Higgins, Thompson, Deeks, & Altman, 2003;Huedo-Medina, Sánchez-Meca, Marín-Martínez, & Botella, 2006) and to be conservative for a large number of studies (Higgins & Thompson, 2002). Moreover, the application of the Q-statistic to test the homogeneity of treatment effects has come under criticism recently (Hoaglin, 2016). Hoaglin (2016) points out that Cochran did not use the Q-statistic itself as a test statistic and that the Q-statistic only approaches a chi-square distribution under the null hypothesis asymptotically when within-study sample size increases (since the individual terms will become approximately normally distributed). In addition, the assumptions of normality of effect sizes and that the withinstudy variances are known do not apply to many meta-analyses. This is especially the case when the effect sizes are risk differences, the logarithm of the risk differences or the logarithm of the odds ratios. Hoaglin's concerns also apply directly to the widely used I 2 as well (Hoaglin, 2016). Jackson (2006) derived a formula to calculate the power of the homogeneity test when using the Q-statistic. The author only considered the case when the number of studies is odd and the studies have the same within-variance, and concluded that the homogeneity test based on the Q-statistic indeed had low power. Kulinskaya, Dollinger and Bjørkestøl (2011) improved the performance of the Q-statistic using two approximations-a Gamma distribution based on expanded first two moments of the Q-statistic and a chi-square distribution with fractional degrees of freedom matching the first moment-and recommended using the latter approximation (Kulinskaya et al., 2011). Sánchez-Meca and Marín-Martínez (1997) compared the performance of the Q-statistic with the Schmidt and Hunter rule for homogeneity on a number of continuous outcomes (Sánchez-Meca & Marín-Martínez, 1997). The outcomes that were investigated were the standardized mean difference, the point-biserial correlation coefficient and the standardized correlation coefficient. The Schmidt and Hunter rule rejects the homogeneity of treatment effects if the sampling error variance is less than a specific threshold of the total variance (e.g. 75% or 90%). Sanchez-Meca et al. concluded that all tests had low statistical power, the Q-statistic resulted in nominal Type 1 errors while the Schmidt and Hunter rule overestimated the Type 1 errors (Sánchez-Meca & Marín-Martínez, 1997). Jones, O'Gorman, Lemke, Woolson and Monte Carlo (1989) examined the performances of the following tests: likelihood ratio test, Pearson's chi-square test, Breslow-Day test (Jones et al., 1989) and Tarone's adjustment to it (Tarone, 1985), a conditional score test and Liang and Self's normal approximation to the score test (Liang & Self, 1989). They found that the likelihood ratio test was too liberal. They recommended using the Breslow-Day statistic when the number of studies is large, and Liang and Self's normal approximation to the score test when the data are sparse, i.e. when there are few participants in some studies. Similar results were also reached by Paul and Donner (1992). Jones et al. (1989) concluded that all tests considered had low power when the number of studies is small (Jones et al., 1989). This was supported by Gavaghan, Moore, and McQuay (2000) who examined the Peto statistic (Yusuf et al., 1986), the Woolf statistic (Woolf, 1955), the Q-statistic (applied to the estimates of the risk difference) (DerSimonian & Laird, 1986), Liang and Self's normal approximation to the score test, and the Breslow-Day test. These tests were also found to have incorrect Type 1 error rates in case of truly homogeneous sets. The authors concluded that the Q-statistic was too liberal and that the Woolf statistic was conservative, and they recommended the Breslow-Day test for routine use (Gavaghan et al., 2000). Recently, the Cochrane Reviews have started including I 2 in its reviews (Breslow, 1994). However, the I 2 statistic has been shown to be conservative and has a low power for continuous outcomes when the number of studies is small (Huedo-Medina et al., 2006). Bagheri, Ayatollahi and Jafari (2011) compared the performance of the random-effects logistic regression model (Bagheri et al., 2011) to Cochran's Q-statistic and the Breslow-Day test, and found that the Breslow-Day test had the highest power. The authors recommended using the random-effects logistic regression model citing practical advantages, though these models are known for their computational difficulties (Wolfinger & O' Connell, 1993).
This overview demonstrates that a lot of work has already been done. However, some measures have never been investigated (e.g. Bliss statistic), and not all measures have been investigated in all settings. More importantly, all studies assumed that the underlying binomial probabilities for the treatments in the stratified contingency tables would be satisfied and used this in their simulations, while in practical situations this would typically be violated. The aggregate data in a contingency table would be generated by individuals all having a different probability of obtaining the clinical outcome, due to other variables like age and gender. It is unclear whether the abovementioned statistical methods will remain robust in more realistic practical settings. We use an extensive simulation study trying to account for hidden variations that can occur in practice, and we study the performance of 10 methods, including statistical methods that have not been applied before for this purpose.
This article is organized as follows. In Section 2 the statistical approaches for testing the homogeneity of odds ratios are presented, assuming that there does not exist individual risk. Section 3 describes the simulation model used where individual risk scores are being simulated. In Section 4 we present the simulation results and a case study on tuberculosis to illustrate the methods. The final section provides a discussion of the results and recommendation for use of the methods and possible further research.

Statistical models
Before we introduce the methods, we first introduce notation that will be used throughout this section. The binary exposure or treatment will be referred to as just treatment and the clinical outcome is considered binary. Then let X hi be the number of successes for treatment h ( ¼ 0; 1) in study i ( ¼ 1; 2; :::; m) out of n hi trials. The control group is indexed by h ¼ 0 while the treatment group is indexed by h ¼ 1. The random variables X h1 ; X h2 ; . . . ; X hm are assumed to be mutually independent (independent studies). For the methods that test homogeneity of effect sizes, it is assumed that the expected value of X hi is equal to E X hi ð Þ ¼ n hi Á p hi , with p hi being the proportion of a success for treatment h in study i. All methods for testing homogeneity will assume that the proportion p hi satisfy the following form Þ , with F being some kind of a function (typically Logistic), α i an intercept for study i, β the (mean) treatment effect, t hi a treatment indicator variable for study i with value 1 when h ¼ 1 and 0 otherwise, and γ i an interaction effect for study i. Then homogeneity means that when the parameters γ i s are considered fixed or when they are being assumed random.
Note that in practice the analysis of meta-analysis could also include one or more available covariates, e.g. average age per study or percentage males or females. In this setting, however, we assume that the only available information is the number of successes and the number of units in the treatment and control groups. This way the statistical methods discussed below can be applied to data with the least available information. Many of these methods can be easily generalized to accommodate extra covariates.

Logistic regression approaches
For fixed-effects logistic regression analysis, the function F is a distribution function of the form FðxÞ ¼ expðxÞ= 1 þ expðxÞ ½ . We also need to impose a constraint to obtain identifiability of the model. A natural choice is to take γ 1 ¼ 0 or γ m ¼ 0. Additionally, logistic regression implies that X hi is binomial and that X 0i and X 1i are independent (this is often not true due to different risks for individuals, but without additional information X hi is assumed binomial).
To test (1) we apply the likelihood ratio test. The likelihood of the full model for F being the logistic distribution function is with X i = X 0i + X 1i being the total number of events and β i ¼ β þ γ i . The maximum likelihood estimates are denoted byα F i ,β F andγ F i for the full model andα 0 i andβ 0 under the null hypothesis. The likelihood ratio test is now defined by The LRT has an approximate χ 2 distribution with m À 1 degrees of freedom [28]. The LRT statistic can be determined with standard software packages (we applied the GENMOD procedure of SAS).
Instead of a fixed-effects model, the interaction effects γ i can be taken random for the logistic link function. Under this assumption, it is common to also assume that the intercepts α i are random. Thus, we will assume that α i ; γ i ð Þ T are bivariate normal with mean α; 0 ð Þ T and variancecovariance matrix AE given by with σ 2 C being the standard deviation for the random intercepts, σ 2 CT the standard deviation for the random interaction term and ρ the correlation between α i and γ i .
The number of events X hi , conditionally on α i and γ i , is considered binomial with parameters n hi and Þand X 0i and X 1i are being independent given α i and γ i . The log likelihood of the full model can be written as: Þbeing the bivariate normal density given by Homogeneity hypothesis in (2) can be evaluated using a likelihood ratio test again. The LRT becomes being the maximum likelihood estimators under the null hypothesis. Note that the variance-covariance matrix AE becomes when the null hypothesis would be true. The approximate asymptotic distribution of the LRT is a mixture of a chi-square distribution with two degrees of freedom and a degenerate distribution in zero. This means that we should compare the LRT statistic with a chi-square value at significance level 2α (McCulloch, Searle, & Neuhaus, 2008;Verbeke & Molenberghs, 2003). The LRT statistic can be determined with standard software packages (we applied the NLMIXED procedure of SAS).

Approaches based on the Q-statistic
The Q-statistic was developed by Cochran (1954) and it can be applied to test the homogeneity of any type of effect size. The Q-statistic is given by Q ¼ ∑ m i¼1βi À " β À Á 2 =ŝe 2 i , withβ i being the estimated effect size, " β a weighted average, andŝe 2 i an estimated standard error forβ i . For the log odds ratio of study i, we would haveβ β is the weighted average which is given by withŝe 2 i being given bŷ Cochran showed that the Q-statistic is approximately χ 2 distributed with m À 1 degrees of freedom, although he did not apply or discuss log odds ratios as effect sizes. This was done by Woolf (1955). Bliss (1952) used the Q-statistic in a slightly different way to test the homogeneity hypothesis. Bliss' test statistic is given by Note that Bliss' method was originally developed for testing the equality of potencies obtained from multiple bioassays similar to that of Cochran. In such cases, " n is the average of the corresponding degrees of freedom over all bioassays. The corresponding degrees of freedom per bioassay is obtained as a result of applying a parallel line bioassay model, i.e. a linear regression model (Mzolo et al., 2013). In our case we apply a fixed-effects logistic regression per study, so two parameters are estimated per study. Hence, we have chosen the degrees of freedom per study equal to ∑ 1 h¼0 n hi À 2. Under the null hypothesis, the B-statistic may be approximated by a χ 2 distributed random variable with m À 1 degrees of freedom.
As mentioned in Section 1, Higgins and Thompson (2002) introduced a few measures for quantifying the heterogeneity of treatment effects that are based on the Q-statistic (Sutton & Higgins, 2008). The authors suggested a homogeneity test based on a confidence interval of the I 2 statistic, given by with z α=2 being the α=2 quantile of the standard normal distribution and This confidence interval for H can be transformed to a confidence interval for I 2 , and the transformed confidence interval can be used to test hypothesis in (1) by verifying whether the value zero lies in the interval or not.

Chi-square approaches
One of the oldest approaches for testing homogeneity of odds ratio is the Breslow-Day test (Breslow & Day, 1980). They used the Cochran-Mantel-Haenszel pooled odds ratio OR CMH calculated by and assumed the same distributional assumptions for X hi as for the fixed-effects logistic regression O^R CMH under the assumption that the null hypothesis in (1) is correct. E 1i can be obtained by solving the following quadratic equation: For testing the homogeneity, the Tarone test statistic (Tarone, 1985) is used. This statistic is given by Note that we applied the Tarone statistic instead of Breslow-Day statistic. The Breslow-Day statistic is based on the Cochran-Mantal-Haenszel odds ratio estimator. This estimator has been shown to be consistent but not efficient. For this reason Tarone modified the Breslow-Day statistic (Tarone, 1985). If (1) is true, the statistic in (4) approximately follows a χ 2 distribution with m À 1 degrees of freedom. The Tarone-adjusted Breslow-Day test can be carried out using standard software packages (we applied the procedure FREQ of SAS). Zelen (1971) also presented a statistic to test the homogeneity of odds ratios (Zelen, 1971). However, it was found to be biased and inconsistent (Halperin et al., 1977). Instead, Halperin et al. (1977) presented a corrected version of the Zelen statistic, which we will use. It also uses an expected value E 1i , like Breslow-Day, but now with an alternative estimate of the pooled odds ratio. The odds ratio O^R CZ is given by O^R CZ ¼ expβ 0 withβ 0 being the maximum likelihood estimator of the Fixed-effects logistic regression analysis under the null hypothesis of homogeneity. The expected value E 1i ¼ E X 1i j X i ; O^R CZ ð Þ can be obtained by solving Equation (3) with O^R CMH being replaced by O^R CZ .
Then the corrected Zelen statistic for testing homogeneity is given by with the variance given in (5) using the solution E 1i ¼ E X 1i j X i ; O^R CZ ð Þ . If the null hypothesis in (1) is true, the corrected Z statistic will follow an asymptotic χ 2 distribution with m À 1 degrees of freedom (Halperin et al., 1977). Liang and Self (1989) presented test statistics for sparse data (Liang & Self, 1989). (Recall that sparse data have low cell counts, thus not necessarily low number of studies.) Again, an expected value E 1i is used given X i and a pooled odds ratio O^R CL . HereÔR CL is given by O^R CL ¼ expβ 0 CL À Á witĥ β 0 CL being the maximum likelihood estimator of the conditional likelihood function given X i and The conditional logistic regression can be obtained using standard software packages. (We applied the procedure LOGISTIC of SAS.) The expected value E 1i ¼ E X 1i jX i ; O^R CL ð Þ can be obtained by solving Equation (3) with O^R CMH being replaced by O^R CL . Liang and Self test statistic T 2 is defined by with var X 1i j X i ; O^R CL ð Þ being given by (5) and E 1i being the described solution of (3) using O^R CL . The statistic T 2 is approximately chi-squared distributed with m À 1 degrees of freedom (Liang & Self, 1989).
Following Liang and Self's approach, we derived an alternative statistic. Instead of using the conditional logistic regression, we apply a random-effects logistic regression model, assuming that var γ i ð Þ ¼ 0. The odds ratio we apply is O^R R ¼ expβ 0 R À Á withβ 0 R being the maximum likelihood estimator described in Section 2.1. The test statistic T R is equal to T 2 but with E 1i the solution of (3) with O^R CMH replaced by O^R R and the variance given by (4), with E 1i the described solution of (3) using O^R R .
Finally, a complete alternative χ 2 statistic is the Peto statistic. Yusuf et al. (1986) presented the Peto statistic to test the homogeneity of the odds ratios [10]. The Peto statistic P is given by When the null hypothesis in (1) is true, the Peto statistic will have an approximate χ 2 distribution with m À 1 degrees of freedom (Yusuf et al., 1986).

Introduction
Although all approaches discussed above assume equal risk probabilities for all participants in one study for one treatment group, this assumption is hardly true in practice. In reality, risk probabilities of individuals will be influenced by individual's characteristics. Therefore, the data are simulated so that each individual has its unique risk probability. This is done by creating covariates for each individual that affect the risk probability in addition to the influence of the treatment. Since populations across studies vary, the distributions of the covariates are affected by studies, causing already one form of treatment-study interaction. We also added treatment-study interactions independent of the covariates.

Simulation model and settings
The simulation model to generate m studies can be described as follows. First, the total number of participants n i for study i is drawn. We used an overdispersed Poisson distribution with parameter λ. The value γ i , Γ a 0 ; b 0 ð Þis drawn to make a study-specific parameter λ i ¼ λ Á exp 0:5 Á γ i ð Þ . Then n i is drawn from a Poisson distribution λ i ð Þ. As mentioned in Section 1, to make the simulation somewhat more realistic we generated covariates, in this case age and gender. The covariate age a ij for the jth participant in study i is created by with α 0 being the average age over all participants in all studies, η i , N 0; τ 2 s À Á a study effect for age and ε ij , N 0; τ 2 aðsÞ the age effect within study i. The covariate gender g ij for the j th participant in study i is generated with a uniform variable z ij . The jth participant will be assigned the value g ij ¼ 1 if z ij > 0:5 and zero otherwise.
Then we need to generate study-specific treatment effects independent of covariates. We use a bivariate normally distributed variable u 0i ; u 1i ð Þsuch that The parameter α represents an intercept to generate the risk of an event for the control group and β is the direct treatment effect. Additionally, the random variable v ij is drawn from a normal distribution, v ij , N ð0; σ 2 Þ, to generate heterogeneity across participants in the study due to variables other than age and gender.
A treatment indicator δ ij is generated by a uniform 0; 1 ð Þ distribution. If w ij is uniformaly distributed, then δ ij ¼ 1 if w ij > p and zero otherwise, with p being the ratio of allocation of control and treatment. The event X ij is Bernoulli p ij À Á distributed with p ij satisfying with α a being the effect of age on the risk of an event irrespective of treatment, β a a treatment age interaction effect, α g the effect of gender and u ij ¼ u 01 1 À δ ij À Á þ u 11 δ ij the study treatment effect.
Finally, the dichotomous response variable X ij is drawn with a uniform 0; 1 ð Þdistribution, i.e. X ij ¼ 1 if y ij > p ij and zero otherwise, with y ij , U 0; 1 ð Þ. This procedure is repeated 1,000 times resulting in 1,000 simulation runs for any set of parameters. As per simulation run, it is determined whether the null hypothesis of homogeneity is rejected or not using the approaches in Section 2 which all ignore the individual heterogeneities. The percentage of rejected null hypotheses over the number of simulations is being reported.
Note that the final model in 6 ð Þ has a study-specific latent variable Z 0i ¼ α a η i þ u 0i for the control group, and a study-specific latent variable Z 1i ¼ α a þ β a ð Þη i þ u 1i for the treatment group. The variance-covariance matrix for these latent variables is given by when β a ¼ 0, ρ ¼ 1 and σ 0 ¼ σ 1 ; the correlation between Z 0i and Z 1i is equal to 1 and there is no treatment-study interaction. In case β a Þ0 or ρÞ1, there is a treatment-study interaction and thus heterogeneity in treatment effects. The size of the heterogeneity is given by β 2 a τ 2 s þ 2σ 2 0 1 À ρ ð Þ; when σ 0 ¼ σ 1 . The larger this value, the higher the statistical power of the methods. Thus, the heterogeneity reduces when ρ increases to 1. For the simulation, a choice was made to take σ 2 0 ¼ σ 2 1 . Type I error was simulated assuming ρ ¼ 1 and β a ¼ 0. The different simulation settings that we studied are presented in Table 1.

Meta-analysis of antituberculosis therapy
We applied the statistical methods presented in Section 2 to a meta-analysis on antituberculosis therapy (Pasipanodya, Srivastava, & Gumbo, 2012). This meta-analysis consisted of 26 different regimens (plans or regulated courses). In each regimen, slow acetylators (control) and rapid acetylators (treatment) were applied to two groups of patients. The objective was to test whether microbiological failure after therapy, relapse after therapy and acquired drug resistance (ADR) were as likely to occur among slow and rapid acetylators. The authors limited their selection of studies to randomized, controlled clinical trials in which outcomes were reported in table, graph or text format. All patients who were considered microbiological failures, relapses or ADR during treatment for slow and rapid acetylators were recorded. For each of the three different outcomes, the authors then calculated risk ratios (RR) with 95% confidence intervals using regimen as unit of analysis. Random-effects models were used to combine regimens when I 2 >50% and the corresponding P-value <0.05; otherwise fixed-effects models were used to combine the RR.
For our study we only consider the results of one outcome, namely microbiological failure. The authors performed a test of homogeneity of the relative risks using the I 2 and found I 2 ¼ 39% with a corresponding P ¼ 0:035 (Pasipanodya et al., 2012). The authors then applied mixed-effects modeling for this outcome, resulting in a combined RR of 2, with a confidence interval [1.5 ; 2.7] for the RR.
However, four of the 26 regimens contained zero cells, and the selected methods cannot deal with this unless a correction to the number of cells is made (Gart, 1966). We chose to remove these four regimens from our data, and apply all the selected approaches to the remaining datasets.
The test statistic, degrees of freedom and the P-value for rejecting the null hypothesis are presented in Table 2. Using 5% level of significance, the null hypothesis is only rejected by the Note that we could not reject the heterogeneity in treatment effects with I 2 , as the authors suggested. This did not change when we included the four studies and changed the cell counts with 0.5. random-effects logistic regression model. Additionally, the P-values range from 0.035 to 0.124. Thus, the proposed methods do not all seem to agree.

Simulation results
The full set of results is presented in the supplementary material. A subset of these results is shown in Tables 3-6. Tables 3-5 present a subset of the simulation results where p hi depends on age, gender and other heterogeneous factors v ij À Á . They represent the results for the most realistic practical settings to contrast these results and to be able to compare our results with literature. Table 6 presents a selection of the Type I error and the statistical power results of the test methods under the assumption of the binomial distribution. In such case, the proportion of a success, p hi , does not depend on age, gender or any other participant variables v ij À Á . The randomeffects logistic regression analysis occasionally showed numerical problems, in particular when heterogeneity between studies was small or not present at all. Note that in around 13.7% of all simulation runs the full random-effects logistic regression model could not converge and the results of these problematic simulation runs are eliminated from the results.
For the realistic practical settings (Table 3), the fixed-effects logistic regression analysis has a slight increased Type I error, when the number of participants within studies is relatively small, irrespective of the number of studies, but this is still acceptable. When sample sizes within studies increase, the Type I error is closer to nominal. The Q-statistic, the Bliss statistic and the I 2 statistic constantly underestimate the Type I error (Table 3). This is most apparent when only a third of the participants are exposed to the treatment. This gets slightly better when the number of studies increases (supplementary material). The I 2 statistic has the smallest Type I error rates of all the methods, except for a few cases where the Peto statistic gives even smaller values. The randomeffects logistic regression model mostly underestimates the Type I error rate when σ 2 0 and σ 2 1 are small. However as σ 2 0 and σ 2 1 increase, it overestimates the Type I error rate (Table 2). This gets even worse when the number of studies increases (supplementary material). The Type I error rates of the chi-square methods using estimates of the odds ratio, i.e. the Breslow-Day statistic, the corrected Zelen statistic, Liang and Self's T 2 and its adjusted version T R , are almost at the nominal value. However, when one-third of the participants is exposed to treatment, the chi-square methods stay just below the nominal level, but this disappears when the number of studies increases. When participants no longer have their individual risk, the Type I error rates (Table 5) are mostly similar to the Type I error rates for the realistic practical settings, except for two cases. The fixed-effects logistic regression no longer gives liberal Type I error rates when there is no treatment effect β ¼ 0 ð Þ.
Investigating the power for the realistic practical settings (Tables 4 and 5), it obviously shows that the statistical power increases as the number of studies increases. This improved power for testing the interaction effect can be explained by the reduced sampling error due to increased sample size. Furthermore, the increase in σ 2 0 and σ 2 1 results in higher power since they increase the treatment interaction effects. This is due to the term 2σ 2 0 1 À ρ ð Þin the size of heterogeneity. The statistical power is negatively correlated with ρ, with no interaction effect when ρ ¼ 1 and β a ¼ 0 (as mentioned earlier). The highest statistical power is obtained by the fixed-effects logistic regression model, followed by the χ 2 statistics (Breslow-Day test, the corrected Zelen test, T 2 and T R ). The lowest statistical power is obtained by the I 2 and the random-effects logistic regression model, respectively. These results hold true irrespective of the number of studies. Although Bliss uses the Q-statistic differently from Cochran, the two tests behave almost identically but they are less powerful than the best test statistics. It should be noted that heterogeneity in treatment effects from differences in age effects across studies β a > 0 ð Þis not increasing the power a lot, a result that holds even when the number of studies increases. This is due to the selected settings.
The heterogeneity effect in the simulation was only increased with β 2 a τ 2 a ¼ 0:01 ð Þ 2 Á 10 ¼ 0:001, which is really small. Table 3. Type I error rate for m  Table 6. Type I error and statistical power for a a ¼ a Comparing the power results from the setting that participants have a common risk (Table 6) with the power results from the setting that individuals have their own individual risks (Tables 4  and 5) shows a few interesting differences. First, the power is substantially smaller when individuals are heterogeneous in their risk rates, which makes it more difficult to detect treatment heterogeneity across studies. The power values for testing treatment interactions increase slightly when there would be a treatment effect in case individuals have heterogeneous risk rates. However, this is opposite when individuals are not heterogeneous, since the power values for treatment heterogeneity is substantially smaller for β ¼ 2 than for β ¼ 0 (Table 6 and supplementary material). With respect to choosing the best methods for detecting treatment effect heterogeneity, no real difference could be detected.

Discussion
The purpose of this article was to investigate the performance of statistical methods that may test for the homogeneity of odds ratios in meta-analysis. Ten methods were applied to a real-life metaanalysis study and to practically realistic simulated datasets. The simulated datasets were created using different scenarios based on different numbers of studies, participants within studies, unbalanced number of observations for treatment and control groups, whether or not there is an actual treatment effect, individual risks through covariates and the level of heterogeneity.
The Peto statistic is somewhat unreliable with respect to its Type I error rate, a conclusion that has been reached earlier (Gavaghan et al., 2000). The fixed-effects logistic regression analysis is a slightly liberal test, although the inflated Type I error rate is not too large, and it reduces with larger number of studies. This conclusion has been reached earlier (Paul & Donner, 1992), although we could not demonstrate it for settings where all individuals have the same constant risk rates, the settings used in the literature. In our study the inflation happened when the proportion of success depends on individual covariates and with small numbers of studies. This interesting result merits further investigation to find out the exact analytical cause, but it is beyond the scope of this article.
The fact that the fixed-effects logistic regression has the highest power among all other methods is partly due to this inflated Type I error rate. The I 2 statistic can be considered a very conservative test. This conclusion had been reached earlier (Huedo-Medina et al., 2006), but only for continuous responses. It makes the I 2 an unsuitable test statistic for homogeneity of odds ratios. The randomeffects logistic regression analysis is also conservative, albeit less conservative than the I 2 for small numbers of studies, but it is less powerful than I 2 for a larger number of studies. This low power combined with the unreliable Type I error rate makes the random-effects model unsuitable for testing homogeneity; see also Bagheri et al. (2011). The Q-statistic and the Bliss statistic can also be considered conservative tests, a conclusion reached earlier (Higgins & Thompson, 2002;Higgins et al., 2003;Huedo-Medina et al., 2006). However, heterogeneity in the individual's risk rates makes it more apparent. The Type 1 error rates of the chi-square methods are close to the nominal value, but Paul and Donner (1989) found T 2 to have liberal Type 1 error rates in the setting of a small number of centers and a large number of units. In our study, however, the T 2 (and all the χ 2 methods) hardly overestimates the Type I error in such setting, but we did not study large numbers of participants within studies. The Type I error rate of T 2 is not affected by heterogeneity of individual's risk rates. Since the allocation of participants to treatment and control groups seems to affect the statistical power, it should be taken into consideration. Additionally, the heterogeneity in individual's risk rates reduces the power, in particular when there would be no treatment effects present.
Overall, the chi-square tests based on odds ratios and the fixed-effects logistic regression analysis perform best. We recommend the Breslow-Day test adjusted by Tarone for convenience purposes, since it is available in most statistical software packages and the other tests are not. The Breslow-Day test had been recommended before by Jones et al. (1989), but only when the number of studies is large. We also recommend the Breslow-Day test when the number of studies is small. However, when a treatment effect is suspected to be present and we have only a small number of studies (m 10Þ, we recommend using the fixed-effects logistic regression over the χ 2 tests since it has a slightly higher power than the chi-square test statistics, with an almost nominal Type I error rate. We advise against using the Q-statistic, the Bliss statistic and the I 2 statistic because of their poor performance, and due to the recent theoretical worries expressed by Hoaglin (Hoaglin, 2016). The Peto statistic and the random-effects logistic regression are not recommended either due to their poor performance. This article has its limitations. First, the conclusions presented only hold for dichotomous outcomes. Second, although the simulation method creates data on a unit (subject) level, hence adding more variation, it uses normally distributed latent variables. This assumption may not always be realistic in practice. Finite mixture models for the latent variables can be a potentially useful method. More specifically, the beta-binomial distribution (Skellam, 1948) could be contemplated to perform the interaction test. Another suggestion is to use the methods that estimate between-study variance; see e.g. Brockwell and Gordon (2001) and DerSimonian and Laird (1986) for an overview. The estimated confidence intervals of the between-study variance can be used to test the homogeneity of effect sizes. Finally, Bayesian approaches to test the interaction effects between study and treatment effect are other options for future research.

Supplementary material
Supplemental material for this article can be accessed at https://doi.org/10.1080/25742558.2018.1478698.