A comparison of analytic approaches for individual patient data meta-analyses with binary outcomes

Background Individual patient data meta-analyses (IPD-MA) are often performed using a one-stage approach-- a form of generalized linear mixed model (GLMM) for binary outcomes. We compare (i) one-stage to two-stage approaches (ii) the performance of two estimation procedures (Penalized Quasi-likelihood-PQL and Adaptive Gaussian Hermite Quadrature-AGHQ) for GLMMs with binary outcomes within the one-stage approach and (iii) using stratified study-effect or random study-effects. Methods We compare the different approaches via a simulation study, in terms of bias, mean-squared error (MSE), coverage and numerical convergence, of the pooled treatment effect (β 1) and between-study heterogeneity of the treatment effect (τ 12). We varied the prevalence of the outcome, sample size, number of studies and variances and correlation of the random effects. Results The two-stage and one-stage methods produced approximately unbiased β 1 estimates. PQL performed better than AGHQ for estimating τ 12 with respect to MSE, but performed comparably with AGHQ in estimating the bias of β 1 and of τ 12. The random study-effects model outperformed the stratified study-effects model in small size MA. Conclusion The one-stage approach is recommended over the two-stage method for small size MA. There was no meaningful difference between the PQL and AGHQ procedures. Though the random-intercept and stratified-intercept approaches can suffer from their underlining assumptions, fitting GLMM with a random-intercept are less prone to misfit and has good convergence rate. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0307-7) contains supplementary material, which is available to authorized users.


Background
Individual Patient Data (IPD) meta-analyses (MA) are regarded as the gold standard in evidence synthesis and are increasingly being used in current practice [1,2]. However, the implementation of the analysis of IPD-MA requires additional expertise and choices [3], particularly when the outcome is binary. These include (i) should a one-or two-stage model be used [4,5], (ii) what estimation procedure should be used to estimate the one-stage model [6,7] and, (iii) should the study effect be fixed or random [8].
Although IPD-MA were conventionally analyzed via a two-stage approach [9], over the last decade, use of the one-stage approach has increased [10]. Recently, some have suggested that the two-stage and one-stage frameworks produce similar results for MA of large randomized controlled trials [5]. The literature suggests the one-stage method is particularly preferable when few studies or few events are available as it uses a more exact statistical approach than relying on a normality approximation [3][4][5].
When IPD are available and the outcome is binary, the one-stage approach consists of estimating Generalized Linear Mixed Models (GLMMs) with a random slope for the exposure, to allow the exposure effect to vary across studies. Penalized quasi-likelihood (PQL) introduced by Breslow and Clayton is a popular method for estimating the parameters in GLMMs [11]. However, regression parameters can be badly biased for some GLMMs, especially with binary outcomes with few observations per cluster, low outcome rates, or high between cluster variability [12,13]. Adaptive Gaussian Hermite quadrature (AGHQ) is the current favored competitor to PQL, which approximates the maximum likelihood by numerical integration [14]. Although estimation becomes more precise as the number of quadrature points increases, it often gives rise to computational difficulties for high-dimension random effects and convergence problems where variances are close to zero or cluster sizes are small [14].
The heterogeneity between studies is an important aspect to consider when carrying out IPD-MA. Such heterogeneity may arise due to differences in study design, treatment protocols or patient populations [8]. When such heterogeneity is present, the convention is to include a random slope in the model as it captures the variability of the exposure across studies. However, there are corresponding assumptions in regards to the study effect being modelled as stratified or random [4,15].
Few comparisons of GLMMs have been reported in the context of IPD-MA with binary outcomes [4,15], that is, when the number of studies and the number of subjects within each study is small, study sizes are imbalanced, in the presence of large between-study heterogeneity and small exposure effects and there is an interest in the variance parameter of the random treatment effect. According to previous literature, these factors have all been identified as influencing model performance [6]. While several simulation studies have been published, these have mainly limited their attention to simple models with only random intercepts [13,16]. Thus, the performance of the random effects models including both a random intercept and a random slope are less well known.
Our objective was to assess and compare via simulation studies, (i) one-stage approaches to conventional twostage approaches (ii) the performance of different estimation procedures for GLMMs with binary outcomes, and (iii) using stratified study-effect or random study-effects in a randomized trial setting. We use our results to develop guidelines on the choice of methods for analyzing data from IPD-MA with binary outcomes and to understand explicitly the trade-offs between computational and statistical complexity.
Methods section introduces the models we are considering, the design of the simulation study and the assessment criteria. In Results section, results for the different methods under varying conditions are presented and discussed. Discussion section concludes with a discussion.

Methods
We conducted a simulation study to compare various analytic approaches to analyze data from IPD-MA with binary outcomes. Hereto, our methods all assume that between-study heterogeneity exists, as it is likely in practice, and so only random treatment-effects IPD meta-analysis models are considered.

Data Generation
The data generation algorithm was developed to generate two-level data sets (e.g. patients grouped into studies). We generated a binary outcome (Y ij ) and a single binary exposure (X ij ). We denote the number of studies j = 1, 2 …, K and i = 1, 2 …, n j denotes the individuals per study. Therefore, Y ij is the outcome observed for the i th individual from the j th study.
The dichotomous exposure variable, X ij , was generated from a Bernoulli distribution with probability = 0.5 and recoded AE 1 2 = to indicate control/treatment group [15]. To generate the binary outcome variable Y ij , first the probability of the outcome was calculated from the random-study and -treatment effects logistic regression model (Eq. 1), or the stratified-study effects model (Eq. 2): Here π ij is the true probability of the outcome for the i th individual from the j th study, β 0 denotes the mean log-odds of the outcome (study-effect) and β 1 the pooled treatment effect (log odds ratio). The random effects (b 0j and b 1j ) were generated from a bivariate normal distribution with mean = 0 and variance-covariance matrix Σ ¼ σ 2 ρστ ρστ τ 2 for the random study-effect case. In the stratified study effects case, (i.e. Eq. (2)), β j , were generated from a uniform distribution and b 1j was generated from a normal distribution with zero mean and variance, τ 2 . A Bernoulli distribution with probability π ij from Eqs. (1) and (2) was used to generate the binary outcome Y ij .
The number of studies, study size, total sample size, variances and correlation of the random effects, and average conditional probability were all varied, with levels described in Table 1. For each distinct combination (n = 480) of simulation parameters, 1000 IPD-MA were generated from each Eqs. (1) and (2), allowing us to investigate a wide range of scenarios. The heterogeneity was set at I 2 = 0.01, 0.23 and 0.55 as defined by τ 2 /(τ 2 + π 2 /3) for a binary outcome using an odds ratio [17]. The levels correspond to: little or no, low and moderate heterogeneity respectively [18].
A sensitivity analysis was also considered to explore the performance of different methods when just 5% of observation had a positive outcome.

Two-stage IPD methods
In the two-stage approach, each study in the IPD was analyzed separately via logistic regression The first step estimated the study-specific intercept and slope and their associated within-study covariance matrix (consisting of the variances of the intercept and slope, as well as the covariance) for each study. This model reduces the IPD to its relative treatment effect estimate and variance for each study then at the second stage these aggregate data (AD) are synthesized (described below).
Model 1-Bivariate meta-analysis The AD were combined via a bivariate random-effects model that simultaneously synthesized the estimates whilst accounting for their correlation, and the within-study correlation [4]. The model assumes that the true effects follow a bivariate normal distribution and is estimated via restricted maximum likelihood with the following marginal distributions of the estimates [19]: where Σ is the unknown between-study variancecovariance matrix of the true effects (γ 0 and γ 1 ) and C j (j = 1, …, K) the with-in study variance-covariance matrix with the variances of the estimates.

Model 2: Conventional DerSimonian and Laird approach
The with-in study and between-study covariance estimates are often times not estimated since most researchers assumed that studies are independent, and instead a univariate meta-analysis of the logit of the odds ratios is performed [20]. The marginal distribution of the pooled estimated treatment effect under this approach is easily obtained as: with unknown parameters γ 1 and τ 1 2 , estimated via the inverse variance weighted non-iterative method (methodof-moments) [21].

One-stage IPD methods
The one-stage approach analyzes the IPD from all studies simultaneously, while accounting for clustering of subjects within studies [4]. The one-stage model is a form of GLMM. Two different specifications are considered.

Model 3-Random intercept and random slope
We estimated a GLMM with a random study effect u 0j and a random treatment effect u 1j via PQL and AGHQ, and allowed the random effects to be correlated, which implies that the between-study covariance between u 0j and u 1j is fully estimated.
Model 4-Stratified intercept one-stage Finally, the stratified one-stage approach estimates a separate intercept for each study rather than constraining the intercepts to follow a normal or other distribution. Therefore, there is no need for the normality assumption for the study membership, hence, the between-study covariance term is no longer estimated. The model is defined as follows: where I k = j indicates that a separate intercept should be estimated for each study j = 1, …, K and u 1j~N (0, τ 1 2 ). Parameters of both Models 3 and 4 were estimated via PQL and AGHQ. Correlation between random effects: In a sensitivity analysis, we extended the number of studies to 50 with an average sample size of 9000 and reduced the prevalence of the outcome to 5%. The prevalence of the outcome was fixed to 30% by setting the value of the intercept β 0 to -0.85 b The number of subjects per study was reported for only large studies when data sets were generated with imbalanced study sizes (bold text: 25% large studies-10 times more subjects)

Estimation Procedures and Approximations
The parameters of the one-stage models were estimated using PQL and AGHQ. For the two-stage approach, a logistic regression was first estimated for each study via maximum likelihood. The parameters of the two-stage model were estimated via method-of-moments (MOM) (Model 2) and restricted maximum likelihood (REML) (Model 1) [21][22][23] at the second stage. Both likelihood-based methods (PQL and AGHQ) were implemented on SAS version 9.4 using PROC GLIMMIX with default options [24]. The number of quadrature points in AGHQ was selected automatically [25], the absolute value for parameter convergence criterion was 10-8 and the maximum number of iterations was 100.
Therefore, for each generated data set the following models were fit.

Assessment criteria
The performance of the estimation methods was evaluated using: a) numerical convergence, b) absolute bias; c) root mean square error (RMSE); and d) coverage probabilityof the pooled treatment effect and its between-study variability.
Numerical convergence The convergence rate was estimated for all models fit, as the number of simulation repetitions that did converge (without returning a warning message) divided by the total attempted (M = 1000). Models that returned a warning message specifying that the estimated variance-covariance matrix was not positive definite or that the optimality condition was violated were considered not to have converged.
Bias The Monte Carlo bias of the pooled treatment effect and its between-study heterogeneity is defined as the average of the bias in the estimates provided by each method as compared to the truth, across the 1000 IPD-MA in each scenario. The Monte Carlo estimate of the bias is computed as whereθ j were the parameter estimates and θ was the true parameter of the pooled treatment effect or its between-study variance. We also reported the mean absolute bias (AB).
Mean square error The mean square error (MSE) is a useful measure of the overall accuracy, because it penalizes an estimate for both bias and inefficiency. The Monte Carlo estimate of the MSE is: For each scenario, the RMSE of the pooled treatment effect and its between-study heterogeneity was reported, as this measure is on the same scale as the parameter.
Coverage probability We estimated coverage for the pooled treatment effect and its between-study heterogeneity for the various methods. Gaussian coverage was estimated, where if θ̂−θ j j≤1:96 Â SE θð Þ the true value was covered, and if θ̂−θ j j> 1:96 Â SE θð Þ it was not. We reported the median, the 25 th and 75 th percentiles of the AB and RMSE of the pooled treatment effect and its between-study heterogeneity but reported percentages for the numerical convergence and coverage rate.

Results
Tables 2, 3, 4, 5, 6 and 7 present the median and interquartile range of the AB, RMSE, coverage and convergence of the pooled treatment effect and its between-study variance, respectively, as estimated via two-and one-stage; AGHQ and PQL; random-intercept and stratified-intercept methods. We reported results for data generated with imbalances in study sizes (different sample size in all studies) for both the random-intercept and stratified-intercept data generation (Eqs. 1 and 2) with correlated random effects (ρ = 0.5), as this scenario is likely the closest to real-life.
We did not exclude results from meta-analyses that returned a warning message (imperfect convergence). These meta-analyses were included as non-convergence and although these models failed to produce proper parameter estimates, these estimates were included in the calculation of the bias and the MSE.

One-versus Two-stage
In Tables 2 and 3, results for the absolute bias (AB) of the estimates for the pooled treatment effect β 1 are given. Recalling that the true parameter value was 0.18, we see that the biases were identical and under 0.05 in the one-stage and the two-stage approaches in both small and large data sets. Results were very comparable when the outcome rate was reduced from 30 to 5% (Additional file 1: Table S1). For both the one-and the ): (Random treatment-effect variance, random study-effect variance) f The two-stage approach did not return a confidence interval for τ 1 2 , hence no coverage was estimated and comparison was not applicable (NA) to the one-stage method ): (Random treatment-effect variance, random study-effect variance) f The two-stage approach did not return a confidence interval for τ 1 2 , hence no coverage was estimated and comparison was not applicable (NA) to the one-stage method    two-stage, results depended on the true τ 2 , and the sample size. For the larger sample size, root mean square error (RMSE) in β 1 was generally slightly larger when the onestage method was used than when the two-stage was used. The picture was similar across all heterogeneity levels (Tables 2 and 3) and when the outcome rate was reduced (Additional file 2: Table S3).
Neither one-stage nor two-stage methods yielded coverage of β 1 close to nominal levels (Tables 2 and 3). Increasing sample size had a positive effect on percent coverage, and increasing the true heterogeneity made estimation more difficult, hence decreasing the coverage (Table 3).
Absolute bias of the between-study heterogeneity, τ 1 2 was usually slightly lower when the one-stage approach was used than when the two-stage approach was (Tables 2 and 3), particularly, when the sample size was small ( Table 2) and when greater amount of heterogeneity exist in the random effects (Bottom panel of Table 2). Regarding the effects of the simulation parameters, AB decreased when data was generated with equal study sizes and increased when the rate of occurrence was reduced (Additional file 3: Table S2). In these cases, the one-stage approach was most biased. The RMSE of τ 1 2 for the one-stage estimates was mostly smaller than the RMSE of the two-stage method estimates. For increased sample size or reduction in the level of heterogeneity in the random effects, RMSE of τ 1 2 decreased at least by a factor of three across both methods. While the RMSE of τ 1 2 was inflated when the outcome rate was reduced, the one-stage method continued to outperform that of the two-stage method (Additional file 4: Table S4).
Convergence was not a problem for the two-stage approach while convergence of the one-stage method varied from 90 to 100% (Tables 2 and 3).

AGHQ versus PQL
One-stage models estimated via PQL and AGHQ methods often yielded similar AB in β 1 . There was no observed difference in the AB (β 1 ) between the methods when the outcome rate was reduced (Additional file 1: Table S1).
RMSE of β 1 were generally greater when AGHQ was used than when PQL was used (Tables 4 and 5). Decreasing sample size, increasing the variances of the random effects or reducing the event rate (Additional file 2: Table S3) made precise estimation more difficult, hence RMSE increased.
When the true heterogeneity was large and total sample was small (Top panel of Table 4), AGHQ provided coverage for β 1 closer to nominal levels than PQL, while both methods provided comparable coverage when the sample size was increased ( Table 5). Note that across both methods, levels of coverage were higher as heterogeneity increased and similar coverage was observed when the outcome rate was reduced (Additional file 5: Table S5).
AB in τ 1 2 , was very comparable but slightly lower when PQL was used rather than AGHQ (Tables 4 and  5). The AB decreased with increasing sample size, particularly, when PQL was used (Table 5). There was substantial bias in τ 1 2 estimates when the event rate was reduced (Additional file 3: Table S2).
On account of a better overall performance of PQL with regards to AB, RMSE of τ 1 2 was generally lower with PQL than with AGHQ (Tables 4 and 5). RMSE decreased with decreased variability in the random effects, and with increased sample size. In addition, PQL-estimates continued to yield smaller RMSE than AGHQ-estimates when the outcome rate was reduced (Additional file 4: Table S4).
We found important under-coverage of the estimates for τ 1 2 for both estimation methods, particularly when PQL was used (Tables 4 and 5). The percent coverage was usually fair for both estimation methods when sample size increased, but was poor when the outcome rate was reduced (Additional file 6: Table S6).
Convergence occurred more often when AGHQ was used than when PQL was used (Tables 4 and 5). Convergence was problematic for PQL, particularly when true heterogeneity was low and sample size was small (Bottom panel of Table 4). Comparable convergence was seen when the event rate was reduced (Additional file 5: Table S5).

Random-intercept versus stratified-intercept
The results of the simulation studies, modeling the intercept as random or fixed (random slope was always considered) via PQL estimation are summarized in the Tables 6 and 7.
The convergence was markedly low (14-97%) for the fixed intercept & random slope method (Tables 6 and 7). Convergence was only reasonable for the approach when the sample size was large and heterogeneity was small, whereas convergence was always greater than 80% for the random intercept and slope approach.
In general, AB in β 1 was similar for both stratifiedintercept (random-slope only) and random intercept & slope methods. Regarding the simulation parameters, sample size and variability of the random effects, were not influential in reducing the AB in β 1 .
The RMSE in β 1 was smaller when estimated via the random intercept and slope model than when only a random slope was fit (Tables 6 and 7).
Increased sample size and level of heterogeneity in the random effect was most influential in determining coverage probability.
Absolute bias in τ 1 2 was clearly comparable when fit with a random intercept & slope approach or a random slope only (Tables 6 and 7). For lower outcome rate, there was a trend towards less pronounced bias when a random slope only was fit (Additional file 3: Table S2).
We observed lower RMSE of τ 1 2 when a random intercept was fit, especially when the true heterogeneity was large (Top panels of Tables 6 and 7). Comparable results were seen when both models were fit in large sample and the true heterogeneity was small (Bottom panel of Table 7)-also when outcome rate was reduced (Additional file 4: Table S4).
We found significant under coverage of τ 1 2 when both models were fit, however, this was more severe when a random slope only model was fit (Tables 6 and 7). When the generated values of τ 0 2 or τ 1 2 were low (i.e. low variability in the random effects) and sample size was increased, we had less difficulty to estimate the coverage of τ 1 2 when both models were fit. The coverage probability continued to be an issue when the rate of occurrence was reduced (Additional file 6: Table S6).

Findings
Our simulation results indicate that when the number of subjects per study is large, the one-and two-stage methods yield very similar results. Our results also confirm the finding of previous empirical studies [5,26,27] that in some cases, the one-stage and two-stage IPD-MA results coincide. However, we found discrepancies between these methods, with a slight preference towards the one-stage method when the number of subjects per study is small. In these situations, neither method produced accurate estimates for the between-study heterogeneity associated with the treatment-effect; however, the biases were larger for the two-stage approach. Furthermore, one-stage methods produced less biased and more precise estimates of the variance parameter and had slightly higher coverage probabilities, though these differences may be due to using the REML estimate of τ 1 2 instead of the der Simonian and Laird estimator used in the two-stage approach.
Estimation of GLMMs with binary outcomes continues to pose challenges, with many methods producing biased regression coefficients and variance components [7]. AGHQ has been shown to overestimate the variance component with few clusters or few subjects [17]. On the contrary, PQL has been found to underestimate the variance component while the standard errors are overestimated [12]. In the context of IPD-MA, we found similar absolute bias of the PQL-and AGHQ-estimated pooled treatment effect, while the PQL-estimates of the between-study variance had greater precision when study sizes were small and random effects were correlated. This somewhat confirms previous results, which found that PQL suffers from large biases but performs better in terms of MSE than AGHQ [6]. Both estimation methods experienced difficulty in attaining nominal coverage of the between-study heterogeneity associated with the treatment effect in two situations: (i) when the number of studies included was small and/or (ii) the true variances of the random effects were small. We also found that convergence was not an important problem for AGHQ when meta-analyses included studies with less than 50 individuals per study. However, convergence was poor when the prevalence of the outcome was reduced to 5% and the true heterogeneity was close to zero.
Stratification of the intercept in one-stage models avoids the need to estimate the random effect for the intercept and the correlation between the random effects. This approach may be preferable in situations not investigated in this work (e.g. when the distribution of the random effects is skewed). However, this approach suffered from marked convergence rates when fit to small data sets (15 studies and on average 500 subjects).

Strengths and Limitations
We used simulation studies to compare various analytic strategies to analyze data arising from IPD-MA across a wide range of data generation scenarios but made some simplifications. We only considered binary outcomes, one dichotomous treatment variable, a two-level data structure, and no confounders. Moreover, we estimated GLMMs via PQL and AGHQ, but did not compare Bayesian or other estimation methods, which might be particularly useful in sparse scenarios [28]. We have made the assumption throughout that IPD were available. Certainly, the time and cost associated with collecting IPD are considerable. However, once such data is in hand, we have addressed several open questions relating to the best way to analyze it. We should also note that methods exist for combining IPD and aggregated data [7]. Further study is needed to investigate alternative confidence intervals (or coverage probability) for the between-study heterogeneity that can be used to remedy the under-coverage of Gaussian intervals. The normalitybased intervals (coverage rate) we studied greatly underperformed in most scenarios because the constructions of the confidence interval are likely to be invalid. A further simplification that limits the generalizability of this work is that it is restricted to only two-arm trials. The extension to three or more arms would require careful consideration of more complicated correlation structures in treatment effects across arms and within studies [29].
One important comparison we have not addressed is, computational speed where the two-stage method had a distinct advantage over the one-stage; PQL was faster than AGHQ and the stratified-intercept model run-time was quicker than the random-intercept model.
As far as we know, this simulation study is the first to simultaneously generate data with normally distributed and stratified random intercepts. This study also compares approaches that include a random intercept for study membership to those that do not. Furthermore, the use of simulation -to systematically investigate the robustness of the approaches to variation in sample size, study number, outcome rate, magnitude of correlation and variances. As a result, our scenarios have allowed us to assess performance without being too exhaustive.

Guidelines for Best Practice
On the basis of these findings, we can make several recommendations. When the IPD-MA included many studies and the outcome rate was not too low, this work supports the conclusion of a previous study [5] that the conventional two-stage method by DerSimonian and Laird [21] is a good choice under the data conditions simulated here. Cornell et al. found that the DL method produced too-narrow confidence bounds and p values that were too small when the number of studies was small or there was high between-study heterogeneity [30]. In such cases, a modification such as the Hartung-Knapp approach may be preferable [31]. Further, while the bivariate two-stage approach is very rarely used in practice, we found that it tended to yield good overall model performance, comparable with that of the one-stage models when study sizes are small. In addition, our results also suggest that the one-stage method can be used in IPD-MA where study sizes are less than 50 subjects per study or few events were recorded in most studies (outcome rate of 5%). In these cases, the one-stage approach is more appropriate as it models the exact binomial distribution of the data and offers more flexibility in model specification over the two-stage approach [32].
If interest lies in estimation of the pooled treatment effect or the between-study heterogeneity of the treatment effect, estimation using PQL appeared to be a better choice due to its lower bias and mean square error for the settings considered. On the contrary, computational issues such as convergence occurred less with this technique than with AGHQ. However, it is important to note that convergence and coverage in τ 2 was an issue in small and large total sample sizes and also, when level of true heterogeneity was large.
For these simulated data, the results of both the random-intercept and stratified-intercept models were not importantly different. However, under both data generations, fitting a GLMM with the random-intercept was overall less sensitive to misspecification in small sample sizes with large between-study heterogeneity than the stratified-intercept GLMM since we have observed high rates of non-convergence via the stratified-intercept model.
There are four important caveats to these recommendations. First, our simulations show greater accuracy of the pooled odds ratio as the number of studies increase. Therefore, an IPD-MA with more studies will provide more accurate estimates. Secondly, our results show that the estimation of the between-study heterogeneity of the treatment effect is highly biased regardless of the sample size and number of studies. Therefore, we should always expect that the variance parameter be estimated with some error. Thirdly, small overall samples mark the trade-off under which a meta-analyst might consistently choose precision over bias and our simulations show that PQL estimation may be preferred in these situations. Finally, large overall sample size can eliminate the lack of statistical power present in small overall samples. In such cases, comparable results are seen for one-and two-stage methods and fitting a two-stage analysis as a first step may be advisable. This could aid as a quick and efficient investigation of heterogeneity and treatment-outcome association.

Conclusion
To summarize, the one-and two-stage methods consistently produced similar results when the number of studies and overall sample are large. Although the PQL and AGHQ estimation procedures produced similar bias of the pooled log odds ratios, PQL-estimates had lower RMSE than the AGHQ-estimates. Both the random-intercept and stratified-intercept models yielded precise and similar estimates for the pooled log odds ratios. However, the random-intercept models gave good coverage probabilities of the between-study heterogeneity in small sample sizes and yielded overall good convergence rate as compared to the random slope only model.