Tests for publication bias are unreliable in case of heteroscedasticity

Regression based methods for the detection of publication bias in meta-analysis have been extensively evaluated in literature. When dealing with continuous outcomes, specific hidden factors (e.g., heteroscedasticity) may interfere with the test statistics. In this paper we investigate the influence of residual heteroscedasticity on the performance of four tests for publication bias: the Egger test, the Begg-Mazumdar test and two tests based on weighted regression. In the presence of heteroscedasticity, the Egger test and the weighted regression tests highly inflate the Type I error rate, while the Begg-Mazumdar test deflates the Type I error rate. Although all three tests already have low statistical power, heteroscedasticity typically reduces it further. Our results in combination with earlier discussions on publication bias tests lead us to conclude that application of these tests on continuous treatment effects is not warranted.


Introduction
In a meta-analysis, publication bias can lead to an incorrect pooled estimate of a treatment effect. In the presence of publication bias, the treatment effect is associated with factors that affect publication bias, e. g., the size of the standard error of the treatment effect or the size of the study. Thus studies with a lack of statistical significance or smaller studies are less likely to be published.
Several methods have been proposed in literature to test for lack of this type of publication bias, e.g., the Egger test [1], the rank-correlation test [2], and several others [1,[3][4][5][6][7][8][9]. These tests allow for heteroscedastic residual variances of the study effect sizes, but when their performances were studied the underlying data models are typically homoscedastic if sample size differences and realization of standard errors are ignored. We believe that homoscedasticity is not always a valid assumption, especially when dealing with continuous outcomes on individuals. For many clinical and social outcomes the variance may be (inversely) proportional to the mean (e.g., blood pressure in cardiovascular disease [10], forced expiratory volume in respiratory disease [11], heards on dairy sire in genetics [12], smoking-mood relationship in psychology [13], income and consumption in economics [14], grade point average in educations [15], stock market volatility in finance [16], radioimmunoassay in biology [17], pharmacokinetic, enzyme kinetics, and chemical kinetics in pharmacology [18], as well as inequalities in sociology [19]).
In case of heteroscedasticity, studies may or may not follow the premises of the test statistics, but when the variance of the outcome is correlated with its mean the resulting treatment effect estimates will be correlated with its standard errors as well. This may lead to the detection of artificial "publication bias" without the presence of a real publication bias process. Since treatment related heteroscedasticity is not testable with aggregated data in a meta-analysis, there is no guarantee that positive tests results for publication bias are truly positive at all. Therefore, it is crucial to better understand the performances of the publication bias test for aggregated data under heteroscedasticity.
The objective of this paper is to demonstrate that heteroscedasticity negatively affects the performance of test statistics for publication bias in case of continuous outcomes. We will perform a simulation study, since analytical investigations are complicated. We ignore bias adjustment methods, because we expect that correction of pooled estimates for publication bias is even more difficult when the presence or absence of publication bias is hard to determine. The Egger test, the rankcorrelation test and two tests based on a weighted regression model [4] are considered. The choice of the Egger test follows from the objective of this paper, since it tests the dependence of the standardized estimated treatment effect with the precision of the study effect size (i.e., the inverse of its standard error). The two weighted regression methods were included because they were recommended for practice in the literature [4], and they might be robust to heteroscedasticity in theory. We decided to also include the rank-correlation test because it is not a regression based method and it is based on the correlation between the normalized treatment effect and its standard error. We did not consider publication bias tests for other outcome types (e.g., binary outcome) or other forms of publication bias (e.g., language bias) [1,3,[5][6][7][8][9].
In Section 2 we will describe the four test statistics for publication bias in an aggregated-data meta-analysis. In the same section, we will introduce a heteroscedastic mixed effects model for individual participants in a study [17,20]. This model will be used to formulate the aggregated treatment effect per study and it will be used to simulate data. The third part of Section 2 is about a mechanism of publication bias. This mechanism is based on concepts for study selection and described in literature [1,8,21,22]. Then Section 3 describes a simulation study and the choices of parameter settings. Section 4 presents the simulation results and a discussion is provided in Section 5.

Tests for publication bias on aggregated data
An aggregated data meta-analysis usually consists of treatment effect estimates D i obtained from different studies, i ( = 1, 2, ..., m), accompanied by their estimated standard errors S i [23]. The four test statistics for testing the hypothesis of no publication bias investigate the association between D i and S i .

Egger's Test:
The Egger test uses a regression model with the standardized effect size D i /S i as response variable and the precision S − 1 i as the independent variable, respectively. Using a t-test, the null hypothesis of no publication bias is rejected if the intercept of the regression model significantly deviates from zero [1].

Weighted regression:
A weighted regression method uses D i as the response variable, S i as the independent variable and 1/ (S 2 i +τ 2 ) as the weight, 1 with τ 2 the between-study variance estimated with DerSimonian and Laird method [24]. Using a t-test, the null hypothesis of no publication bias is rejected if the slope of the independent variable significantly deviates from zero [4]. This method will be referred to as the weighted DL test. However, the DerSimonian-Laird estimate has been found to underestimate the between-study variance estimate, and thereby producing narrow confidence intervals for the mean treatment effect [25]. Instead of the DerSimonian-Laird estimator, the Restricted Maximum Likelihood (REML) estimator have been recommended in literature [26]. We therefore also considered the weighted regression model with τ 2 estimated by REML, here referred to as the weighted REML test. We used procedure MIXED in SAS, version 9.4, to calculate this REML estimate τ 2 .

Rank correlation:
The rank-correlation test applies Kendall's tau the estimated variance of the normalized effect size (under assumption of homogeneity of effect sizes). This non-parametric test statistic follows approximately a standard normal distribution when D * i and S 2 i are independent. Thus the associated p-value is used to test the hypothesis of no publication bias [2,8].

Heteroscedastic mixed effects model on individual participants
Let Y ijk denote the continuous response variable of individual k ( = 1, ⋯, n ij ), exposed to treatment j ( = 0, 1), in study i ( = 1, ⋯, m). A heteroscedastic linear mixed effects model per individual [20] can then be described as: with μ j the mean of treatment or control group j, θ = μ 0 − μ 1 the mean treatment effect, U ij a study-specific random effect for group j, U i0 − U i1 a random treatment effect for study i, ξ 2 j a treatment specific residual variance parameter, V i normally distributed, and ε ijk ∼ N(0, 1) standard normally distributed and independent of the random effects U i0 , U i1 , and V i . Residual heteroscedasticity at the individual level is introduced via parameter ξ 2 j and the random term exp(V i ). The variance ξ 2 j indicates a fixed heteroscedasticity in variability between individuals for the two treatment groups (i.e., treatment affects both the level and the variability) and exp(V i ) indicates a random heteroscedasticity between individuals across studies (i.e., individuals are more or less alike within studies). Thus, the random variable V i makes it a heteroscedastic mixed effects model and not the variance parameters ξ 2 j , because these parameters only introduce heteroscedasticity within study. If V i is degenerated in 0, model (1) becomes a simple mixed effects model.
It is assumed that (U i0 , U i1 , V i ) T has a multivariate normal distribution with means 0 and variance-covariance matrix Σ given by The value of ρ M represents the correlation between the study-specific random effects U i0 and U i1 for the treatment and the control group, respectively. The random treatment effect U i0 − U i1 represents the study heterogeneity of the study effect size as follows: If ρ M = 1 and σ 0 = σ 1 , U i0 − U i1 is degenerate in zero or non-existent, while for all other settings of ρ M < 1, σ 0 > 0, and σ 1 > 0 it will lead to study heterogeneity. The value ρ V represents the correlation between the mean and the logarithm of the random heteroscedastic residual variance.
The treatment effect per study is given by the raw mean difference Y ijk /n ij is the average value for group j in study i. The standard error S i for the effect size in study i is the sample variance for treatment group j in study i. Based on model (1), the treatment effect can be written into the well-known random effects model 2 for meta-analysis studies [23].
Without the existence of V i , the residuals ε i in (2) are homoscedastic if sample sizes n i are consistent across studies. The variance S 2 i can be rewritten into ij. ) 2 chi-square distributed with n ij − 1 de- 1 In case the weight is changed to 1/S 2 i , the weighted regression approach is identical to Egger's test. 2 In the random effects model it is often assumed that the random variables U i and ε i are independent and normally distributed, but due to our random heteroscedastic variable exp(V i ) both assumptions will be violated. grees of freedom.
Clearly, the introduced random heteroscedasticity V i affects both the treatment effect D i and the standard error S i . As a consequence, V i affects the responses and the independent variables used in the Egger, the weighted DL, the weighted REML, and the rank correlation test. It also makes an analytical investigation of the test statistics more complex, since the joint distribution of the responses and independent variables are less traceable. We therefore studied the influence of the random residual heteroscedasticity on the three test statistics by simulation.

Publication bias mechanism on aggregated data
We briefly describe the selection model which will create publication bias at study level [5,21]. For each study i ∈ {1, 2, ..., m} in the meta-analysis, the selction model assumes a latent variable Z i that depends on the standardized mean difference D i /S i . If the latent variable is positive (Z i > 0), study i is published and appearsin the meta-analysis study, but when it is non-positive (Z i ≤ 0) the study is not published and may create selection bias in the meta-analysis. The latent variable is given by with α and β > 0 two constants and δ i ∼ N(0, 1) standard normally distributed and independent of all other random terms. Thus the larger the standardized treatment effect, the larger the probability of being selected (assuming that effect sizes are more frequently positive).
Note that the selection process of studies will be affected by the random residual heteroscedasticity V i through the standardized effect size D i /S i . The standardized effect size can be rewritten in with ν 2 N(0, 1), and U i = U i0 − U i1 . Thus the difference in behavior of Z i with and without heteroscedasticity is determined by a difference in behavior of (β +U i )exp{ − V i } and β + U i , respectively. The distribution of β + U i is symmetric and normal, while (β +U i )exp{ − V i } is skewed to the right and non-normal. Combined with the choices for the constant α and β, the probability of selecting a study is lower under heteroscedasticity than under homoscedasticity when these studies would have the same standardized effect size D i / S i .

Simulation study
We simulated a meta-analysis study with m studies and vary the sample size n i for study i = 1,⋯,m. This sample size was selected from an overdispersed Poisson distribution, i.e., n i |γ i ∼ Poi(λexp{0.5γ i }), with γ i ∼ Γ(a 0 , b 0 ) drawn from a gamma distribution. Then within each study the participants are randomly allocated to the treatment and the control group with equal probabilities, resulting in n i1 participants in the treatment group, and n i0 participants in the control group (i.e., n i0 |n i ∼ Bin(n i , p)). The continuous response Y ijk is then simulated according to the heteroscedastic linear mixed effects model described in Section 2.2. The data from this model is then used to calculate the study effect size D i and its standard error S i .
To introduce publication bias, a selection process or mechanism is simulated according to the selection model described in Section 2.3. We used the 5% and 95% quantiles of the set of standardized treatment effects D 1 /S 1 , D 2 /S 2 , …, D m /S m , and denote them by q 5 and q 95 , respectively. The values α and β are chosen such that P(Z i > 0|D i /S i = q 95 ) = 0.99 and P(Z i > 0|D i /S i = q 5 ) = 0.025. Thus relatively small standardized effect sizes, with respect to other studies, will be published with low probability and large standardized effect sizes will be published almost always. That the publication of one study depends on other studies may be reasonable if more research is already known on the same topic. Solving the two probability equations results in α = ( − 1.96q 95 − 2.33q 5 )/(q 95 − q 5 ) and β = ( − 1.96 − α)/q 5 , using the normality assumption of the random term δ i and its independence with the standardized treatment effects. Note that creating α and β in this way results in different values for α and β per simulated metaanalysis.
Different simulation settings were considered both with and without publication bias and with (σ 2 2 > 0) and without (σ 2 2 = 0) random heteroscedasticity. The settings of the parameters are chosen such that the simulation corresponds approximately with a meta-analysis of clinical trials on for instance hypertension treatment. Parameter settings used to generate the aggregated data (D i , S i ) from the individual participant data are m ∈ {20,50,100}, λ = 100, a 0 = b 0 = 1, p = 0.5, μ = 160, θ = − 5, Based on the number of studies that remain in the meta-analysis, the four publication bias methods in Section 2.1 were used to test for publication bias. The four test statistics were applied to the same metaanalysis data and they were considered significant at the level of 0.1 in order for our results to be comparable to other studies [4,6,[27][28][29][30]. We will study the type I error and the power of these tests. We will also report the effective number of studies used in the meta-analyses. The simulation of the meta-analysis data and the analysis of the data was conducted with SAS software, version 9.4.

Results
The Type I error rate and the power of Egger's test, the rank correlation test (RC), the weighted DL (wDL) and the weighted REML (wREML) test are presented in Table 1 and Table 2, respectively. For the power values, the effective number of studies (as percentage of the number m) ranged from 37.18% to 39.55% on average for ρ V = − 0.7 to ρ V = 0.7.
From Table 1, it can be seen that when heteroscedasticity is absent (σ 2 = 0), the weighted regression approaches and the rank correlation approach show a nominal Type I error rate of approximately 10% (although the rank correlation test was slightly conservative at m = 20). When heteroscedasticity is introduced, the Type I error rates of the Egger test, the weighted DL test and the weighted REML seem to be close to their Type I error rates under homoscedasticity when m = 20 and ρ V ≥ − 0.3. However, the Type I error rate increases for these three tests when ρ V < − 0.3, compared to their Type I error rates at homoscedasticity. When the number of studies m increases we see a different pattern. The Type I error rates of these three tests increase as |ρ V | increases.
When publication bias is introduced, heteroscedasticity influences the statistical power of the four test statistics in a different way than for the Type Ierror rates. The power is increasing with the correlation ρ V . This would make sense. If the heteroscedasticity is negatively correlated with the heterogeneity of effect sizes (ρ V < 0), this correlation reduces the positive correlation between study effect sizes and standard errors that is introduced by the publication bias (antagonism). For positive values of ρ V the publication bias is increased (synergism). However, it is somewhat more complicated than just the presence of synergism and antagonism, because the heteroscedasticity also affects the publication bias mechanism (see Fig. 1). Under heteroscedasticity non-selected studies may have higher standardized effect sizes than under homoscedasticity, while selected studies may have lower standardized effect sizes than under homoscedasticity (see Fig. 1). As a consequence, the power of all four test statistics at ρ V = 0 is lower than the power under homoscedasticity when the number of studies is relatively large, due to this altered publication bias mechanism.

Discussion and conclusion
The performances of the Egger test, the rank correlation test, the weighted DL test and the weighted REML test were investigated for testing publication bias in the presence of residual heteroscedasticity for mean differences. Residual heteroscedasticity for mean differences is plausible and realistic, since it represents heterogeneity in interparticipant variability across studies. Indeed, participants can be more alike in one study than in another study, depending on the implemented selection criteria for the study participants. For instance, pragmatic clinical trials may hardly use any selection criteria, since they focus on efficiency, while other clinical trials may focus on efficacy on a specific set of participants with a particular symptom or disease in a limited age range. Note that the form of heteroscedasticity that we included has been studied in more sophisticated ways in the area of multilevel models [20].
In the simulation study, we found the Type I error of the Egger's test inflated. This is a well-known phenomenon for log odds ratios [31] and standardized mean differences [32,33], but in our simulation such inflation is unexpected. The cause is related to the way we simulated the data. We have chosen to simulate data based on an IPD model, which is different from simulations based on aggregated data model that other researchers have used to study performances of tests for publication bias [22,24,28]. In our simulation, the distribution of sample size n i affects Egger's test directly. If we select the overdispersion parameter γ i ∼ Γ (1 /3, 3) in n i |γ i ∼ Poi(λexp{0.5γ i }), the distribution of n i is less extreme or less skewed than with γ i ∼ Γ(1, 1) and the Type I error rate for Egger's test reduces to 12.1 (for m = 50). One very large study in a meta-analysis becomes an influential point in Egger's regression analysis that strongly affects the test on intercept, especially when heterogeneity in study effect sizes are not considered as in the weighted regression approaches.
Heteroscedasticity renders the four tests unreliable in testing for publication bias in meta-analysis. It causes the Type I error to deviate from the nominal level substantially and also from the level obtained under homoscedasticity if the test was not nominal (Egger's test). For the Egger test, the introduction of heteroscedasticity increases the variability in 1/S i . Together with the correlation between the heteroscedasticity V i and heterogeneity U i0 − U i1 , the standard error of the intercept in the regression model reduces, leading to more rejections of the null hypothesis than under homoscedasticity. For the weighted regression approaches something similar is going on. The correlation between the heteroscedasticity and heterogeneity induces a correlation between D i and S i , which causes an increased Type I error rate. For the rank-correlation test, introducing heteroscedasticity causes the Type I error rate to drop below the nominal level, but the larger the number of studies the smaller the effect of heterogeneity. The introduction of heteroscedasticity reduces the variance of Kendall's tau statistic, thereby reducing the variance of the test statistic. This results in a non-standard normal distribution with a variance less than one, introducing the conservative Type I error rates.Furthermore, it decreases the power strongly in most cases. Tests for publication bias have always shown low statistical power even in the absence of heteroscedasticity [3,28,29,33]. With this known evidence and our newly presented criticism, testing for publication bias in meta-analysis with continuous outcomes should be avoided. We did not include any of the publication bias methods (e.g., trim and fill, Copas' selection method) that could correct the pooled estimate from a meta-analysis [9,22]. These estimation approaches would typically make use of the correlation between the study effect size and its standard error in one way or another, as do the test statistics we investigated. Since heteroscedasticity is destroying this relationship, we expect that these correction methods would not be able to correct the pooled estimate appropriately. As we have demonstrated, diagnosing publication bias using the correlation between effect size and standard  error, is strongly affected. Heteroscedasticity also affected the mechanism of publication bias, making it difficult to disentangle these two mechanisms at an aggregated level. Thus when heteroscedasticity is anticipated from the topic of study, it may be recommended to collect and pool the individual participant data. The heteroscedasticity can then be potentially modeled, although it remains unknown how to address the publication bias into this approach. More research is needed to be able to model the correlation between study effect sizes and its standard error.