Examining publication bias—a simulation-based evaluation of statistical tests on publication bias

Background Publication bias is a form of scientific misconduct. It threatens the validity of research results and the credibility of science. Although several tests on publication bias exist, no in-depth evaluations are available that examine which test performs best for different research settings. Methods Four tests on publication bias, Egger’s test (FAT), p-uniform, the test of excess significance (TES), as well as the caliper test, were evaluated in a Monte Carlo simulation. Two different types of publication bias and its degree (0%, 50%, 100%) were simulated. The type of publication bias was defined either as file-drawer, meaning the repeated analysis of new datasets, or p-hacking, meaning the inclusion of covariates in order to obtain a significant result. In addition, the underlying effect (β = 0, 0.5, 1, 1.5), effect heterogeneity, the number of observations in the simulated primary studies (N = 100, 500), and the number of observations for the publication bias tests (K = 100, 1,000) were varied. Results All tests evaluated were able to identify publication bias both in the file-drawer and p-hacking condition. The false positive rates were, with the exception of the 15%- and 20%-caliper test, unbiased. The FAT had the largest statistical power in the file-drawer conditions, whereas under p-hacking the TES was, except under effect heterogeneity, slightly better. The CTs were, however, inferior to the other tests under effect homogeneity and had a decent statistical power only in conditions with 1,000 primary studies. Discussion The FAT is recommended as a test for publication bias in standard meta-analyses with no or only small effect heterogeneity. If two-sided publication bias is suspected as well as under p-hacking the TES is the first alternative to the FAT. The 5%-caliper test is recommended under conditions of effect heterogeneity and a large number of primary studies, which may be found if publication bias is examined in a discipline-wide setting when primary studies cover different research problems.


Funnel asymmetry test (FAT)
The first class of tests makes it possible to address publication bias by the association of the effect sizes and their variance. Because the variance (se²) of an effect size in a primary study (es) is strongly related to the sample size, small studies with a low number of observations (N) show an increased variation of effects around the unobserved true effect. The larger the N, the smaller the variation and thus the more precise is the effect size of the study. Under publication bias small non-significant studies are mostly omitted, whereas small but precise effects with a large N still remain in the analysis. When this pattern for a small positive effect is represented through a scatterplot graph a typical inverted funnel-shaped pattern can be observed (called "funnel plot" Light & Pillemer 1984: 63-69). In the exemplary Figure A1 on the right, studies in the lower left side are missing because of publication bias with a preference for significant positive effects. On the left side, in contrast, a symmetric funnel with no publication bias is shown.

Figure A1 Funnel asymmetry test (FAT)
Exemplary funnel plot showing a symmetric funnel in the unbiased left graph and an asymmetric funnel in the right graph with an asymmetry towards positive effects.
Relying only on subjective graphical information, as provided by funnel plots, might be misleading (Tang & Liu 2000). Begg & Mazumdar (1994: 1089 examine the rank correlation of the standardised effect (t = es/se) and its variance (se²). A similar approach by Egger et al. (1997) 1 regresses t on the inverse standard error (1/se). t is chosen as the dependent variable in order to account for the unequal variance across the effects (heteroscedasticity) by weighting each observation by the inverse of its variance. Compared to the regression of se on es this changes the interpretation.
1 This estimator is equivalent to the bivariate FAT-PET recommended by Stanley & Doucouliagos (2014). The FAT-PET furthermore makes it possible to also include "potential effect modifiers" (Deeks et al. 2008: 284) in a meta-regression model. This is especially necessary if the literature being studied has, besides its theoretical meaningful overall effect, systematic differences (e.g. different implementations of an experimental stimulus, different experimental populations, etc.).
The constant β0 is the test on publication bias (FAT stating publication bias if β0 ≠ 0), whereas β1 makes it possible to identify a true empirical effect controlling for publication bias (Egger et al. 1997: 632). In the left graph of Figure A2 a primary study (depicted as one dot), with almost no precision, would not able to find an effect (H0: β0 = 0 could not be rejected). In contrast, in the right graph under publication bias a study with no precision would also find a substantial effect.

Figure A2 Funnel asymmetry test (FAT)
Exemplary graphical example of the FAT indicating no publication bias in the left (intercept through the origin) and publication bias in the right graph (positive intercept).
Despite its strengths, the central weaknesses of the FAT lies in its low statistical power in a setting with only a small number of primary studies (Macaskill et al. 2001 simmulated the performace only based on 20 primary studies). 2 p-uniform (PU) The tests discussed so far focus on the empirical effect sizes, whereas the p-curve method, proposed by Simonsohn et al. (2014b), and the similar PU, a method proposed by van Assen et al. (2015), focus entirely on the distribution of significant p-values. All non-significant values are therefore dropped from the analysis. The sample is, furthermore, restricted to the direction of suspected publication bias: that means only positive or negative effects are examined (Simonsohn et al. 2014a: 677 In a second step the skewness of the pp-distribution is tested (Simonsohn et al. 2015(Simonsohn et al. : 1149. Right skewness shows an overrepresentation of findings with a substantial statistical significance and indicates a genuine empirical effect. Left skewness, in contrast, shows an overrepresentation of just significant estimates that barely pass the significance threshold (in this case 5%) and indicates publication bias under the null hypothesis (Simonsohn et al. 2014b: 536).
2 In addition to the performance of the FAT, multiple simulation studies (Alinaghi & Reed 2016;Paldam 2015;Reed 2015) also examine the unbiasedness of the effect estimate (PET -the estimated underlying effect size corrected on publication bias) which is not of interest in the study at hand. The PET is especially threatened by an increased false positive rate under effect heterogeneity (Deeks et al. 2005;Stanley 2017), the properties of the FAT in these conditions have not yet been examined.  In the case of a underlying null-effect, pcurve is therefore a special case of PU. In the numerator, the effect size estimate is conditioned on the underlying effect (μ), similar to a one-sample z-test. The denominator of the pp-value is not fixed to 0.05 as in p-curve, but is also conditioned on the underlying effect (μ), which is subtracted from the effect threshold (et) an effect has to reach to become statistically significant given its standard error (se).
The test statistic is gamma-distributed with k degrees of freedom. 5 Because the skewness is now conditional on the underlying empirical effect left skewness observed by PU identifies publication bias across all underlying empirical effects, as depicted in Figure A3.

Caliper test (CT)
In contrast to the aforementioned three tests, the CT, developed by Gerber and Malhotra (2008a;2008b) ignores most of the information provided by the studies included and looks only at a narrow interval (caliper = c) around the significance threshold (th) in a distribution of absolute z-values. In case of a continuous distribution of z-values, studies in the interval below the significance threshold (in the so-called over-caliper; xz = 1) should be as likely as just nonsignificant studies (in the so-called under-caliper; xz = 0).
Gerber and Malhotra (2008a; 2008b) use a 5%, 10%, 15% and 20% interval (c) proportional to the significance threshold (th). In particular, the widest 20% caliper may be too wide because the 10%-significance level that could be another target threshold for publication bias is fully overlapped. The higher the overrepresentation in the over-caliper, the higher the likelihood of publication bias. This is also shown in Figure A4: in the left graph with no publication bias no discontinuities are seen around the arbitrary 5% significance threshold (dashed line), whereas in the right graph a stepwise increase of just significant results indicates publication bias. As with the TES, a one-sided binomial test is used to test the equal distribution of z-values in the over-and under-caliper. 9

Figure A4 Caliper test (CT with 5% caliper)
Exemplary graphical example of the CT indicating no publication bias in the left (no jump point around the significance threshold visualized by the red dashed line) and publication bias in the right graph (jump point at the 5% significance level).

Results in detail by simulation conditions
The following section presents the results of the false positive rates by each simulation condition. Besides the statistical power (Table A2-A5) of the evaluated publication bias tests also the actual committed as well as successful publication bias is reported along the results. As in the regression analysis in the article also the impact of publication bias on the meta-analytical p-value is reported.
False positive rates Table A1 shows the false positive rates of the publication bias tests across all simulated conditions. Inflated false positive rates are highlighted in bold. Over all conditions the FAT, PU, the TES, as well as the narrower CTs (3%, 5%), had a consistent false positive rate. The FAT was closest to the expected 5% error rate. PU and the TES, as well as the 3% and 5% CTs, in contrast, were in most cases very conservative because they fall far below 0.05. This overconservatism may be problematic in respect to a decreased statistical power, a matter which is discussed later on. The wider 10% and 15% CTs suffered under inflated false positive rates because, due to the large caliper width, the assumption of a uniform distribution in both calipers was violated. 10 For the 10% CT the specified false positive rate doubles to more than 10%, whereas in case of the 15% CT it more than quadruples.

Statistical power
Looking at conditions with 50% publication bias in the file-drawer condition (see Table A2), the FAT had a superior power compared to other tests in 14 of 20 conditions, as indicated by the underlined numbers. The FAT is, however, closely followed by the TES, which had a larger number of conditions with a satisfactory power (> 0.8) compared to the FAT (7 vs. 6). In the first condition with N = 100 as well as K = 100 the TES was superior in the case of an underlying small or moderate effect (β = 0.5; 1; 1.5). The large variability of the primary study effect, which was caused by the low-N and low-K in the meta-analyses, resulted in an overall minor statistical power. A sufficient power (highlighted in bold) was only reached in conditions with a low or moderate underlying true effect (β = 0.5, 1). This is caused by high prevalence of committed publication bias (PB com) that is also successful (PB sucmeaning p < 0.05). None of the CTs yielded a sufficient power. This picture changes if more studies were included in the meta-analysis. With K = 1000 most of the tests yielded a sufficient power. In particular, the FAT had a statistical power close to 100%, also under effect heterogeneity. The PU and the TES failed to uncover file-drawer behaviour under effect heterogeneity, but performed well under homogeneity. PU was only able to discover file-drawer behaviour under low underlying true effects. The CTs profited the most from an increased K, the wider caliper (10, 15%) had a larger statistical power than the narrower ones but also had inflated false positive rates (see Table A2) that might invalidate the conclusions (grey shaded area). The narrower caliper had a sufficient power only in studies with no or small underlying effects (β = 0; 0.5). K = 100 and N = 500 decreased the power of all tests drastically. In this condition the FAT had the largest, but still not satisfactory power. With K = 1000 a sufficient power is yielded in conditions with a low overall effect (β = 0; 0.5).   Table  A1 PB com displays the share of studies committing publication bias. PB suc describes the share of studies successfully committing publication bias. p defl. shows the the deflation of the meta-analytical p-value by publication bias.
The statistical power of the tests increased if the intent to engage in file-drawer behaviour is set to 100% (see Table A3). Overall, more publication bias tests achieved a satisfactory statistical power to detect publication bias. Also, in these conditions, the FAT dominated in 13 of 20 conditions. As before, neither the TES nor the PU were able to detect publication bias under effect heterogeneity. The TES was, furthermore, not able to detect publication bias with an underlying null effect, despite publication bias was successfully applied by 21.3% of the cases.
The dominance of the FAT weakened when looking at the 50% p-hacking condition (see Table   A4). Instead, the TES was besides the 15% CT superior under most conditions but had the advantage that its false positive rate was not inflated. The overall pattern was, however, quite similar: both PU and TES had almost no power to detect p-hacking under effect heterogeneity.
Also, the statistical power was only satisfactory for PU when K = 100. With a large number of included studies, however, the power of the CT was close to, or even outperformed, the FAT, PU and the TES.   Table  A1 PB com displays the share of studies committing publication bias. PB suc describes the share of studies successfully committing publication bias. p defl. shows the the deflation of the meta-analytical p-value by publication bias.
In the 100% p-hacking condition (see Table A5) the FAT caught up with the TES and yielded an increased power, especially in the case of K = 100. Despite the dominance of the 15% CT, the TES and the FAT closely followed. The CT had a similar strength to that demonstrated in the earlier conditions under effect heterogeneity and K = 1000. The underperformance of all tests in the condition with N = 500 and moderate underlying effects (β = 1; 1.5) is caused by the already existing significance of most results in this condition.   Table  A1 PB com displays the share of studies committing publication bias. PB suc describes the share of studies successfully committing publication bias. p defl. shows the the deflation of the meta-analytical p-value by publication bias.
Overall, the FAT dominated under the file-drawer condition. The TES, in contrast, had a slightly higher statistical power than the FAT under the p-hacking condition without effect heterogeneity. However, the differences between both tests were quite small. The CTs performed well under the file-drawer as well as p-hacking condition with heterogeneous effect sizes and large numbers of studies included (K = 1000). Although the 10% and 15% caliper had the highest power to detect p-hacking these tests should not be applied due to their increased false positive rate.