Random effects meta‐analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation

A random effects meta‐analysis combines the results of several independent studies to summarise the evidence about a particular measure of interest, such as a treatment effect. The approach allows for unexplained between‐study heterogeneity in the true treatment effect by incorporating random study effects about the overall mean. The variance of the mean effect estimate is conventionally calculated by assuming that the between study variance is known; however, it has been demonstrated that this approach may be inappropriate, especially when there are few studies. Alternative methods that aim to account for this uncertainty, such as Hartung–Knapp, Sidik–Jonkman and Kenward–Roger, have been proposed and shown to improve upon the conventional approach in some situations. In this paper, we use a simulation study to examine the performance of several of these methods in terms of the coverage of the 95% confidence and prediction intervals derived from a random effects meta‐analysis estimated using restricted maximum likelihood. We show that, in terms of the confidence intervals, the Hartung–Knapp correction performs well across a wide‐range of scenarios and outperforms other methods when heterogeneity was large and/or study sizes were similar. However, the coverage of the Hartung–Knapp method is slightly too low when the heterogeneity is low (I 2 < 30%) and the study sizes are quite varied. In terms of prediction intervals, the conventional approach is only valid when heterogeneity is large (I 2 > 30%) and study sizes are similar. In other situations, especially when heterogeneity is small and the study sizes are quite varied, the coverage is far too low and could not be consistently improved by either increasing the number of studies, altering the degrees of freedom or using variance inflation methods. Therefore, researchers should be cautious in deriving 95% prediction intervals following a frequentist random‐effects meta‐analysis until a more reliable solution is identified. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.


Introduction
A random effects meta-analysis combines the results of several independent studies in order to summarise a particular measure of interest, such as a treatment effect. The approach allows for unexplained betweenstudy heterogeneity in the true treatment effect by incorporating random study effects about the overall mean [1,2]. Interest may be in estimating the overall mean (summary or pooled) effect and its confidence interval, quantifying the magnitude of heterogeneity itself or deriving predictive inferences about the treatment effect in a future setting, such as a 95% prediction interval or the probability the treatment will be effective [1].
There are numerous methods for constructing confidence intervals for a random effects meta-analysis, and several articles have examined the performance of these methods. Cornell

The random effects model
Suppose that there are k studies investigating a treatment effect, that is, the difference in outcome value or risk between a treatment group and a control group (for example, a mean difference, log odds ratio or a log hazard ratio). Further, suppose that each study provides the treatment effect estimatêi and its variance 2 i , for i = 1, · · · , k. The random effects model assumes that, because of heterogeneity between studies, each study is potentially estimating a different true effect i . Therefore, the conventional parametric approach to random effects meta-analysis, as outlined by Whitehead and Whitehead [11], is given bŷ where is the mean (also known as overall, summary or pooled) treatment effect, the 2 i are assumed known and 2 is the between study variance of the treatment effects. There are numerous methods for fitting the random effects model (1); however, REML is often preferred, as (unlike ML estimation) it accounts for the simultaneous estimation of and and (unlike methods of moments [10]) is based on normality of the random effects, which is convenient for making predictive inferences (see the discussion for more on this normality assumption).

Derivation of confidence intervals
Following estimation of model (1) using REML, the mean treatment effect estimatêis typically assumed to be normally distributed for large samples. Thus, a (1 − )100% confidence interval for the mean effect is conventionally calculated bŷ± Z 2̂, wherêis the REML estimate of ,̂is its standard error and Z 2 is the upper 2 quantile of the standard normal distribution. However,̂does not account for uncertainty in the REML estimate of̂. Moreover, this approach also assumes that the within study variances are fixed and known, which is an untenable assumption when study sizes are small. As a result, Equation (2) may not provide an accurate reflection of the total uncertainty in the mean effect̂.
To address this, Hartung and Knapp [7] suggest using the adjusted confidence interval where Var HK = q̂2, t k−1; 2 is the upper 2 quantile of the t distribution with k − 1 degrees of freedom and where w i = 1 2 i +̂2 and̂2 is the REML estimate of 2 . The Hartung-Knapp (HK) confidence interval in Equation (3) will usually be wider than the normal confidence interval (2) because it is based on the t distribution. However, it is possible, if q is sufficiently small, that the resulting interval will be shorter than the unadjusted interval. This obviously contradicts the idea that these intervals should be widened to account for additional uncertainty. With this in mind, Rover et al. [12] suggest a modification to the HK method which ensures that the resultant interval is not shorter than the conventional confidence interval in Equation (2). In particular, they suggest usinĝ where Var * HK = q * ̂2 and q * = max{1, q}. Sidik and Jonkman [5] also independently suggested using (3) to calculate confidence intervals for and subsequently proposed an alternative modification to the HK method, which uses a more robust estimator of the variance. In particular, they suggest usinĝ where The Sidik-Jonkman (SJ) confidence interval in Equation (5) will also usually be wider than the conventional confidence interval. However, Sidik and Jonkman [6] comment that Var SJ is biased, especially for small k and so suggest an alternative confidence interval which uses the bias corrected estimator where Another alternative is discussed by Kenward and Roger [9], which is implemented in the Stata package, ipdmetan [13]. In particular, they suggest adjusting the variance of̂using the expected information and appropriately modifying the degrees of freedom of the t distribution. For univariate random effects meta-analysis, the expected information is given by Then the adjusted variance is given by while the adjusted degrees of freedom are Thus the Kenward-Roger (KR) confidence interval is given bŷ

Derivation of prediction intervals
To summarise the potential treatment effect in a new setting, a (1 − )100% prediction interval is conventionally calculated using the equation proposed by Higgins et al. [1], wherê2 is the REML estimate of the between study heterogeneity, while the variance of the mean effect̂2 is the conventional value obtained following REML estimation. However, in our simulations, we also consider modification of Equation (8) to replacê2 with another measure, such as as Var HK (̂) or Var SJ (̂) . In addition, although Kenward and Roger [9] do not discuss modifying Equation (7) to propose a prediction interval, a natural extension from their work is to replace Equation (8) witĥ

Methods
We now perform a simulation study to examine the coverage performance of these aforementioned methods for deriving confidence and prediction intervals following REML estimation. The step by step guide to the simulation is summarised in the succeeding discussions.
Consider that a meta-analysis of k trials is of interest, for summarising a treatment effect. For a set number of studies k, to simulate a meta-analysis data set, we generate the k random effects i from a normal distribution with mean and variance 2 . The k within study varianceŝ2 i are simulated from a chi-square distribution centred on 2 and with n − 1 degrees of freedom. Here, n relates to the average within study sample size and controls the variability of the within study variance. We then generate the treatment effect estimates,̂i, in each study from a normal distribution with mean i and 2 .
Next, we fit the random effects model (1) to the simulated data set, using REML, to obtain the estimates of the mean treatment effect , its variancê2 and the between study variance 2 and use these to construct 95% confidence intervals for the mean effect using the conventional method (N) in Equation (2), the HK method in Equation (3), the modified HK method (HK2) in Equation (4), the SJ method in Equation (5), Table I. Summary of the parameters used in the simulation study. In addition, we consider different sample size mixes around the average: one study 10 times larger, one study 10 times smaller and a mixture of large and small studies.   (6) and the KR method in Equation (7). We then determine whether the derived confidence intervals contain the true mean treatment effect . Similarly, we also construct 95% prediction intervals using Equation 8 for each of the HK and SJ methods and using Equation (9) for the KR method. We also simulate a new treatment effect from the true random effects distribution (second line in model (1)) and determine whether the prediction interval derived actually contains this new treatment effect.
The given process is then repeated 10 000 times for the chosen parameter values, to produce 10 000 meta-analysis data sets and 10 000 confidence and prediction intervals for each method of interest. This allows us to calculate the coverage of the confidence and prediction intervals (that is, the proportion of times the true mean or the new treatment effect are found to lie in the derived confidence or prediction interval, respectively across all 10 000 replications).
Simulations were considered for a range of different scenarios that differed according to the parameter values chosen. The parameter values are summarised in Table I. In particular, we varied the number of studies k, (either 3,5,7,10 or 100) and the relative degree of heterogeneity, controlled by = 2 2 . We consider a true mean treatment effect = 1 and an average within study variance 2 = 0.1 with the variability in the within study variance controlled by n = 30. The degree of heterogeneity is controlled using the ratio = 2 2 . In particular, we consider 2 = 1, 0.1, 0.05 and 0.01, which corresponds to = 10, 1, 0.5 and 0.1, respectively. However, we also report the average I 2 over each of the 10 000 metaanalyses, to give the reader a feel for the relative magnitude of the between-study variation relative to the total variation in the meta-analysis for each scenario.
In addition to the situation where all studies are of the same size (balanced: 2 i = 2 ), we also consider the impact of including one large study 10 times bigger than the others (a within study variance 10 times smaller: 2 i = 2 ∕10) and one small study (with a within study variance 10 times larger: 2 i = 10 × 2 ) and also the effect of a mixture of large and small studies. In these scenarios, we do not modulate 2 or 2 to account for modified sample size, and so, these modifications naturally impact the degree of heterogeneity in these settings. It is also important to note that when the study sizes are unbalanced, one must be careful when interpreting and I 2 as both these measures of heterogeneity rely on the idea of a 'typical' within study variance.
With 10 000 studies, the Monte Carlo error is approximately 0.4, and so, if the coverage is indeed 0.95, coverage outside the range of [94.6, 95.4] is unexpected.

Results
Tables II-V display the coverage of the 95% confidence and prediction intervals for each of the approaches, for each of a variety of scenarios defined by different k and . The tables also report the average I 2 over each of the 10 000 simulations, to give the reader a feel for the relative magnitude of the between-study variation relative to the total variation in the meta-analysis for each scenario. Figures 1 and 2 graphically display the confidence and prediction interval coverage, respectively, for each of the methods, based on balanced studies with a large ( = 10) and small ( = 0.1) degree of heterogeneity. Table II. The coverage of the 95% confidence interval (CI) and prediction interval (PI) obtained by fitting the model (1) using restricted maximum likelihood (N) and the coverage of the adjusted CI and PI using a variety of methods. The results are based on a random effects meta-analysis with k studies and = 10. The table also includes the average I 2 across each of the 10 000 simulations.

Method
Balanced studies One small study   Table III.
The coverage of the 95% confidence interval (CI) and prediction interval (PI) obtained by fitting the model (1) using restricted maximum likelihood (N) and the coverage of the adjusted CI and PI using a variety of methods. The results are based on a random effects meta-analysis with k studies and = 1. The table also includes the average I 2 across each of the 10 000 simulations.

Method
Balanced studies One small study   Table IV.
The coverage of the 95% confidence interval (CI) and prediction interval (PI) obtained by fitting the model (1) using restricted maximum likelihood (N) and the coverage of the adjusted CI and PI using a variety of methods. The results are based on a random effects meta-analysis with k studies and = 0.5. The table also includes the average I 2 across each of the 10 000 simulations.

Method
Balanced studies One small study

Performance of confidence intervals
For relatively large heterogeneity ( ⩾ 1), the conventional confidence intervals (Equation (2)) are consistently too short when there are only a small number of studies. For example, in Table II ( = 10), the conventional confidence intervals have coverage of 82% and 88% for k = 3 (I 2 = 75%) and k = 5 (I 2 = 84%) balanced studies, respectively. These results agree with previous simulations studies [5][6][7]. The coverage of the conventional confidence intervals improves as k increases, but at least 10 studies are required to obtain coverage, which is reasonably close to 95%. For example, in Table III ( = 1), the conventional confidence intervals have coverage of 93% and 95% for k = 10 (I 2 = 43%) and k = 100 (I 2 = 52%) balanced studies, respectively. All of the modified methods (Equations (3)-(7)) perform reasonably well in terms of confidence interval coverage provided k ⩾ 5. For example, in Table II ( = 10), the coverage probabilities of the HK method is 95%, while the HK2 confidence interval is slightly wider on average with coverage 96%. Similarly, the KR and SJ methods are 96% and 93% respectively for k = 5 (I 2 = 84%) balanced studies, compared with 88% for the conventionally calculated confidence intervals. In general, the KR method seems to produce confidence intervals that are slightly too wide on average, while the SJ confidence intervals are slightly too short on average; however, as k increases, both these converge on the expected coverage of 95%.
When there is very little heterogeneity, the conventional confidence intervals perform much better. For example, in Table V ( = 0.1), the conventional confidence interval has reasonably good coverage in almost all cases and is comparable to the HK method. In this case, the KR and HK2 methods generate confidence intervals that are slightly too wide when there are a small number of studies. For example, for k = 10 (I 2 = 17%) balanced studies, the KR and HK2 methods both have coverage of 98%.
Of all the modified methods, the HK method is frequently the best in terms of confidence interval coverage; however, when the heterogeneity is small and the study sizes are mixed, the HK method produces confidence intervals that are too wide. For example, in Table V ( = 0.1), the coverage probabilities of the HK and HK2 methods are both 97% for k = 100 (I 2 = 69%) studies with a mixture of sizes, compared with 95% for the conventionally calculated confidence intervals.

Performance of prediction intervals
Prediction interval coverage is reasonably good for all methods provided that the relative degree of heterogeneity is reasonably large ( > 1; I 2 > 30% approximately for balanced studies) and there are at least k = 5 studies. For example, in Table II ( = 10) for k = 3 (I 2 = 75%) balanced studies, the coverage is too large, but for k ⩾ 5 (I 2 ⩾ 84%), the prediction interval coverage is close to 95% for all methods.
However, the coverage performance of the prediction intervals is poor in situations where the relative degree of heterogeneity is small or moderate. For example, in Table III where = 1, there is departure from the expected coverage of 95% for all methods when k ⩽ 10, including the conventional method (Equation (8)), with coverage often considerably less than 95%. Similarly, for < 1 (I 2 < 30% approximately for balanced studies), there are even more serious departures from the expected coverage of 95%. For example, in Table IV, the conventional method has coverage of 87% for k = 10 (I 2 = 30%) balanced studies. Also, in Table V, the prediction interval coverage is 91% for all methods even when there are k = 100 (I 2 = 15%) balanced studies.
Coverage of prediction intervals is also poor when there are a mixture of large and small studies. For example, for = 0.1 and k = 100 (I 2 = 69%), the prediction intervals are generally too wide, with coverage of 99% for all methods. Changing the degrees of freedom used in Equation (8), for example, to k − 3 rather than k − 2, did not consistently improve coverage performance to the required level (results not shown).

Example
We consider a meta-analysis of seven randomised trials looking into the effect of anti-hypertensive treatment versus control on reducing diastolic blood pressure (DBP) in patients with hypertension, as detailed elsewhere [14]. The treatment effect of interest is the difference in the final mean DBP value between the treatment and control groups, after adjusting for baseline DBP. Treatment effect estimates and their variances were derived in each study, using analysis of covariance as detailed by Riley et al. [15] and a random-effects meta-analysis used to synthesise the results using REML. Heterogeneity is expected given the diversity of included populations. Crucially, there is also a mixture of study sizes, with the largest having 6991 participants and the smallest having only 172.
Additionally, for illustrative purposes, we also took the mean difference between treatment groups at baseline in each study and then performed a random effects meta-analysis for this measure [15]. This was done to consider a situation where homogeneity is expected, such that the relative amount of betweenstudy variance should be very small.
In each meta-analysis, we construct confidence and prediction intervals using each of the methods outlined in the previous section, and the findings are now described.

Pre-treatment meta-analysis
The forest plot in Figure 3 shows the results of the REML random effects meta-analysis (model (1)) on the pre-treatment difference in DBP between the treatment and control groups. As expected, prior to treatment, there is a relatively low degree of heterogeneity between studies with I 2 = 17.4% [0%; 61.3%] and̂= 0.09. Table VI and Figure 4 summarise the confidence and prediction intervals constructed using each of the methods. Unsurprisingly, the summary mean difference is very small indicating that the two randomised groups are well balanced in terms of baseline DBP. All the derived 95% confidence intervals are reasonably similar, with the exception of the KR method, which is much wider than the others. As identified in the simulation study, the KR method generally produces slightly wider confidence intervals in this setting, namely, when there are a mixture of study sizes and a relatively low degree of heterogeneity. Almost all of the adjustment methods successfully generate wider confidence intervals than the conventional method, but the SJ2 confidence interval actually shrinks compared with the conventional method. Again, this is identified in the simulation study as the coverage of the SJ2 method is too small in this setting.
Similarly, the prediction intervals show good agreement across all methods, with the exception of the KR method. The KR method generates a particularly erratic and uninformative prediction interval. This is caused by an unusually small value for the degrees of freedom = 1.54, which results in a hugely inflated value for t −1; 2 . Also, observe that the HK prediction interval actually shrinks, but this is corrected when applying the HK2 method.
More importantly, based on the findings of the simulation study, given that the relative degree of heterogeneity is small, it is likely that (aside from the erratic prediction interval for the KR method) all the derived prediction intervals are too narrow and thus inaccurate (see the bottom right of Table IV). Therefore, one should be cautious about using prediction intervals in this setting and, at best, they should be viewed as approximate. Forest plot for the pre-treatment difference in diastolic blood pressure between the treatment and control groups based on the random effects meta-analysis model (1) fit using restricted maximum likelihood and confidence interval for the mean treatment effect calculated using the conventional method. Table VI. The pre-treatment difference in diastolic blood pressure between the treatment and control groups (̂0) along with the 95% confidence intervals (CI) and standard error (s.e.) for each study. The summary resultŝare based on a random effects meta-analysis fit using restricted maximum likelihood with confidence and prediction intervals constructed using a variety of methods

Post-treatment meta-analysis
The forest plot in Figure 5 shows the results of a meta-analysis on the post-treatment difference in DBP between the treatment and control groups, based on a random effects meta-analysis fit using model (1) with REML. As expected, given the diversity of the populations, at the end of follow-up, there is a large amount of between study heterogeneity in the treatment effect with I 2 = 69.4% [32.6%; 86.1%] and̂= 1.73. Table VII and Figure 6 summarise the confidence and prediction intervals for each of the methods. All methods generate confidence intervals that are entirely below zero, providing strong evidence that the mean treatment effect is to reduce DBP more than control. However, note that the conventional method still produces a 95% confidence interval that is narrower than the others, and based on the simulation study, the other methods (especially HK) would appear more appropriate. The SJ2 method produces a much tighter confidence interval than the other methods; however, the simulation study results for heterogeneous studies of varied size show that the SJ2 method produces over-precise results in this setting (see the bottom right of Table III). Because heterogeneity is large in this meta-analysis, a prediction interval may be a more appropriate summary than the confidence interval for the mean effect; in other words, it is of interest whether the treatment effect is likely to always be beneficial in new populations. Prediction intervals from all methods agree that treatment is likely to be effective in at least 95% of settings at reducing DBP in a new setting, Figure 4. The summary results for the pre-treatment difference in diastolic blood pressure between the treatment and control groups based on a random effects meta-analysis fit using restricted maximum likelihood with confidence and prediction intervals constructed using a variety of methods.   as none contain zero. However, there is some concern that these prediction intervals may be too short because the simulation study reveals that in this setting, the coverage of prediction intervals is too low (for example, see the bottom right of Table III). All methods produce reasonably similar prediction intervals, but they should clearly be viewed as approximate given the simulation study. Further, it is clear that the KR interval is wider than the others. Again, the simulation study identifies that the KR prediction interval is likely to be overly conservative in this setting.

Key findings
In this paper, we compared the performance of several different methods for constructing confidence intervals for the mean effect and prediction intervals for the effect in a new setting from a random effects meta-analysis estimated using REML. In particular, we used a simulation study to examine the coverage performance of confidence and prediction intervals constructed using both conventional and novel estimators for the variance of the mean effect, combined with a normal or t distribution for deriving intervals. The key findings are summarised in Figure 7.
Firstly, it was demonstrated that when there is a large degree of heterogeneity ( > 1; I 2 > 30% approximately for balanced studies), confidence intervals constructed by the conventional approach are consistently too short if there are few studies. The coverage improves as the number of studies increases, but at least 10 studies are required to obtain reasonable coverage. By contrast, other methods that correct the confidence interval to account for additional uncertainty perform reasonably well in terms of coverage, provided there are at least five studies. Of these methods, when the heterogeneity is large and when study sizes are similar, the standard HK method appears consistently the best in terms of coverage based on our simulation findings. This conclusion is in line with other, previous simulations studies [5][6][7]. However, even the HK method is problematic in situations where heterogeneity is low, and the study sizes are quite varied, as the coverage appears slightly too low.
When there is very little heterogeneity, such as = 0.1 (approximately 15% < I 2 < 18% for balanced studies), the conventional confidence intervals are generally very accurate in terms of coverage, even for just k = 3 studies. As a result, some of the methods that widen the conventional confidence interval (for example, KR and HK2) have coverage that is too large. However, the HK method remains consistently accurate in terms of coverage for this situation.
For prediction intervals, it was shown that all methods perform reasonably well in terms of coverage provided there are at least five balanced studies and a reasonably large degree of heterogeneity ( > 1; I 2 > 30% approximately for balanced studies). However, the performance of all methods rapidly deteriorates as the between study variability is reduced and/or there is a large imbalance in the study sizes. For example, when = 1 prediction intervals are generally too short in all cases when k ⩾ 5, even when there are a large number of studies. When the level of heterogeneity is even lower ( < 1; I 2 < 30% approximately for balanced studies), prediction intervals are extremely inaccurate. Even when heterogeneity is large (I 2 = 60% for example), if there are a mixture of study sizes then the performance of the prediction intervals may still be poor, especially when there are fewer than 10 studies. Figure 7. Key findings regarding the coverage performance of confidence intervals (CIs) and prediction intervals (PIs) obtained using the conventional and modified intervals for a random-effects meta-analysis model (1) with estimation via restricted maximum likelihood.

Recommendations
Ideally, for the reporting of a random effects meta-analysis based on heterogeneous studies, both confidence and prediction intervals should be reported. Confidence intervals can be used to explain the uncertainty in the mean effect, while prediction intervals describe the potential effect in a new setting. However, the results of this paper identify that, in certain situations, one should interpret the confidence and/or prediction interval with caution. When there is little heterogeneity, confidence intervals constructed using the conventional method are generally very accurate in terms of coverage. Indeed, some of the other methods (for example, KR and HK2) actually generate confidence intervals that are too wide in this case. Hence, provided there are sufficient studies with little variability in the treatment effect between them, the conventional method can be used to construct accurate confidence intervals.
On the other hand, when there is a moderate or high degree of heterogeneity, one should not routinely construct confidence intervals using the conventional method unless there are a large number of studies. Indeed, if heterogeneity is large and only a small number of studies are available then it would be prudent to construct confidence intervals by a number of different methods. The HK method appears generally very good in terms of coverage in this situation and indeed most other settings, unless there is a large imbalance in the study sizes.
If there is substantial between-study heterogeneity then it is important to consider prediction intervals for the effect in a new population, in addition to the confidence interval. If the degree of heterogeneity is sufficiently large, one can construct accurate prediction intervals by almost any of the methods discussed here provided there are at least five studies that are reasonably well balanced in terms of size.
When the study sizes are unbalanced or there are lower levels of heterogeneity, all of the methods discussed here fail to construct accurate prediction intervals, including the conventional and increasingly used equation proposed by Higgins et al. [1] and Riley et al. [2]. Even when heterogeneity is large, if the study sizes are quite different, then even the conventional method may still perform poorly, with under-coverage again a concern. Therefore, in these settings, one should be cautious of interpreting frequentist prediction intervals, irrespective of how they are constructed; at best, they should be viewed as approximate, and further research is needed to address this in the frequentist setting.
For researchers who remain keen to produce prediction intervals, perhaps the most natural approach is to use a Bayesian method [16,17], for example, with an empirically based prior distribution for the between-study variance [18].

Limitations and further research
The simulation study in this paper considers the performance of the different models for a non-specific normally distributed outcome measure. While this means that our findings and recommendations are applicable to a wide range of measures, it ignores issues associated with some, more problematic, outcome measures (e.g. rare binary outcomes). Moreover, for binary outcomes in general, the standard error will be strongly correlated with the effect estimate. The performance of the random effects models in such situations should be the subject of further, more-tailored simulation studies.
Our simulation study also considered the impact of different mixtures of sample sizes on the model performance. These scenarios, which included incorporating one large or small study as well as a mixture of large and small studies, provide useful information for how these models might perform in practice. However, as our study did not modulate the within or between study variance to account for this, we were unable to control degree of heterogeneity in these cases.
Further, while we considered a broad range of values for k, the number of studies in the meta-analysis, we did not consider every plausible value of k. However, although we did not consider a study size between 10 and 100, it is unlikely that the results would change dramatically on this range.
Another limitation of this study is that we only consider estimating 2 using REML. Additional simulations (not presented here) show that the results remain relatively consistent when estimating 2 using DerSimonian and Laird's method of moments [10]. However, for a discussion on the impact of other heterogeneity estimators on the results of a random effects meta-analysis,refer to Sánchez-Meca and Marín-Martínez [19] and Langan et al. [20].
Another important issue when conducting a random effects meta-analysis is the specification of the random effects distribution. While it is common practice to assume the random effects are normally distributed, there is little justification for this [21]. Consequently, there are other methods that relax the assumption of normal random effects, such as the Bayesian method proposed by Lee and Thompson [16]. Thus, an appraisal of the impact of non-normal random effects is of considerable interest for future research.

Conclusions
When assuming the conventional random effects model with normally distributed random effects, confidence intervals for the mean effect should account for the uncertainty in the estimate of between-study variation. Generally, following REML estimation of a random effects meta-analysis, we recommend the Hartung-Knapp method although further work is needed to address the slightly low coverage for this approach in the situation where there are a variety of study sizes or low between-study heterogeneity. Further, prediction intervals derived using the conventional method proposed by Higgins et al. [1] are only accurate when the between-study heterogeneity is relatively large and study sizes are similar. In other situations, the coverage of prediction intervals can be too low and so should be treated with caution, with derivation in a Bayesian framework potentially preferred.