Generalized mean p-values for combining dependent tests: comparison of generalized central limit theorem and robust risk analysis

The test statistics underpinning several methods for combining p-values are special cases of generalized mean p-value (GMP), including the minimum (Bonferroni procedure), harmonic mean and geometric mean. A key assumption influencing the practical performance of such methods concerns the dependence between p-values. Approaches that do not require specific knowledge of the dependence structure are practically convenient. Vovk and Wang derived significance thresholds for GMPs under the worst-case scenario of arbitrary dependence using results from Robust Risk Analysis (RRA). Here I calculate significance thresholds and closed testing procedures using Generalized Central Limit Theorem (GCLT). GCLT formally assumes independence, but enjoys a degree of robustness to dependence. The GCLT thresholds are less stringent than RRA thresholds, with the disparity increasing as the exponent of the GMP ( r) increases. I motivate a model of p-value dependence based on a Wishart-Multivariate-Gamma distribution for the underlying log-likelihood ratios. In simulations under this model, the RRA thresholds produced tests that were usually less powerful than Bonferroni, while the GCLT thresholds produced tests more powerful than Bonferroni, for all r> − ∞. Above r> − 1, the GCLT thresholds suffered pronounced false positive rates. Above r> − 1/2, standard central limit theorem applied and the GCLT thresholds no longer possessed any useful robustness to dependence. I consider the implications of these results in the context of various interpretations of GMPs, and conclude that the GCLT-based harmonic mean p-value procedure and Simes' (1986) test represent good compromises in power-robustness trade-off for combining dependent tests.


Introduction
Combining p-values is a convenient and widely used form of meta-analysis that aggregates evidence across studies or tests, e.g. 1,2. Aggregating evidence in this way improves the sensitivity (power) of formal tests to detect subtle signals in data, making better use of resources and improving the potential for scientific discovery. There are many methods for combining tests, e.g. 3,4. In general, combining tests using the full data is more powerful than using summary statistics. However, access to the full data may be difficult for many reasons, for example computational tractability or issues of consent in statistical genetics. For these reasons, parameter estimates and standard errors, or Z-statistics, are often provided instead, e.g. 5. Combining Z-statistics, rather than p-values, allows parameters to be jointly estimated across datasets, e.g. 6. Nevertheless, combining p-values may be preferred when: (i) parameters are dataset-specific, (ii) hypotheses are mutually exclusive, or (iii) only the p-values are available. Fisher's is a widely-used method in scenario (i) that is appropriate when the datasets are independent 7 . The harmonic mean p-value (HMP) is suited to scenario (ii) 8, 9 . The Bonferroni procedure 10 is a universal method for combining p-values under arbitrary dependence. These methods are closely connected and can be thought of as occupying different strategies in trading off power against robustness to dependence.

Box 1. Kinds of combined tests for p-values
Some methods for combining p-values were originally formulated as constructing simultaneous confidence intervals for multiple tests, including the Bonferroni and Šidák procedures 10,11 . They account for multiple comparisons by widening the confidence interval of each test from 100(1 − )% to 100(1 − /K)% and 100(1 − ) 1/K %, assuming arbitrary dependence and independence respectively, where K is the number of tests. A p-value can be defined as the widest 100(1− p)% confidence interval that rejects the null hypothesis. Therefore, these approaches are equivalent to increasing the stringency of the p-value threshold of each test from to /K and 1 − (1 − ) 1/K respectively. When any individual p-value falls below the adjusted threshold, the grand null hypothesis that none of the tests are significant can be rejected. Adjusting confidence intervals and significance thresholds for multiple testing is thus equivalent to a combined test in which the minimum p-value is compared against the adjusted significance threshold. For instance, Tippett's combined test 12 , in which the minimum p-value is compared to significance threshold 1 − (1 − ) 1/K is equivalent to Šidák correction for this reason. In this article, I do not distinguish between these formulations of combining p-values. All aim to limit the probability of falsely rejecting the grand null hypothesis that none of the individual tests is significant, at a pre-determined level (the weak-sense familywise error rate 13 ). Some combined tests (e.g. [9][10][11] are additionally able to reject subsets of tests when some null hypotheses are false while limiting the probability of falsely rejecting any true null hypotheses (the strong-sense family-wise error rate 13, 14 ).
The focus of this article is to consider significance thresholds for the generalized mean p-value (GMP) and compare their performance under different dependence assumptions. The test statistics underpinning the Bonferroni, Šidák, HMP, Fisher and other procedures are special cases of the GMP:  (1) which includes the maximum (when r → ∞), arithmetic mean (r = 1), geometric mean (r → 0), harmonic mean (r = -1) and minimum (r → -∞) 15 . The exponent parameter r affects the characteristics of the test, so that as r approaches -∞, the GMP is more influenced by smaller p-values, and as it approaches ∞ is it more influenced by larger p-values. These characteristics affect the interpretation of the GMP as suitable for particular purposes, such as model averaging (HMP; r = -1), or combining evidence (Fisher's method; r → 0). (See section 6 for more on interpretation).
In general, GMPs cannot be interpreted directly as p-values because they are not uniformly distributed even when the null hypothesis (that the constituent p-values are uniformly distributed) is true 15 . Instead the GMP can be used as a test statistic by calculating a significance threshold , ( ) r K  � Ψ for rejecting the null hypothesis, which limits the false positive rate to some pre-specified level, e.g. = 5%: Pr ( ,..., ) ( )~(0.1) .
This requires an assumption about dependence between the constituent p-values. For example, the p-values may be assumed (i) to be independent, as in Fisher's method 7 , (ii) to conform to a particular model of dependence, as in Brown's method 16 , or (iii) to possess arbitrary dependence, in which the worst case is usually considered, as in the Bonferroni procedure 10 . Different assumptions about dependence produce different significance thresholds, which in turn affect the power of the test to reject the null hypothesis. More conservative assumptions produce more stringent thresholds that trade off reduced statistical power against greater robustness of the false positive rate to dependence. Vovk and Wang 15 derived significance thresholds for GMPs under arbitrary dependence by considering the worst case scenario using robust risk analysis (RRA). For the HMP they derived a significance threshold of RRA, 1, ( ) log The result is precise (asymptotically as K → ∞) in the sense that for any value of , there is a form of dependence under which the threshold is not conservative (so the equality in Equation 2 is satisfied) 15 .
In contrast, Wilson 9 derived a considerably less stringent threshold using generalized central limit theorem (GCLT): where 1 ≈ and K is assumed large. This result implies that the HMP can be directly interpreted as if it were a p-value when it is small, because GCLT, 1, ( ) as 0.
The difference in the stringency of the GCLT and RRA thresholds, which approaches log K for small , stems from different assumptions about dependence between the constituent p-values. Formally, the GCLT derivation assumes independence, but the heavy-tailed distribution of p -1 confers robustness to dependence 9,17 . Specifically, a result by Davis  3. Do GMPs with exponents r ≠ -1 enjoy a more favourable power-robustness trade-off?
To address these questions, I derived significance thresholds for GMPs using GCLT (section 2). I motivated a model of p-value dependence based on the Wishart-Multivariate-Gamma distribution (section 3). I simulated under this model to test the power and false positive rates of the GCLT and RRA significance thresholds (section 4). To complete the picture, I derived procedures to control the strong-sense family-wise error rate based on the GCLT thresholds (section 5) and considered the interpretation of combining p-values using GMPs (section 6). The results indicate that the power of the GMP to combine p-values, under relevant dependence assumptions at = 0.05, was better than the Bonferroni procedure for GCLT thresholds, but worse than the Bonferroni procedure for RRA thresholds. However, GCLT thresholds began to suffer pronounced false positive rates for r > -1, and enjoyed apparently no robustness to dependence whatever for r > -1/2. I conclude that the GCLT-based HMP procedure 9 and the related Simes (1986) test 20 represent good compromises in power-robustness trade-off for combining dependent p-values. These methods are interpretable in terms of model-averaging and require no specific knowledge of the dependence structure.

GCLT significance thresholds for generalized mean p-values
This section uses GCLT to infer the distribution of GMPs under the grand null hypothesis and thereby construct significance thresholds, assuming the number of constituent p-values K is large. The GCLT derivation formally assumes the p-values are independent, but the Davis-Resnick condition extends to -∞ < r < 0, which implies that robustness to dependence is expected for small . The case r = 0 (geometric mean) cannot be directly attained by the same GCLT approach, but Fisher's method provides the exact solution anyway.
Assuming independence between p 1 , . . . , p K , GCLT states 21 that , , converges to a Stable distribution with heavy tail index λ > 0, where λ ≥ 2 corresponds to the Normal distribution. The heavy tail index is determined by the tails of the individual X i s, characterised as Assuming that p ~ U (0, 1), In other words, Beta(1/ ,1) when 0 Thus c = 1, d = 0 and Uchaikin and Zolotarev  Table 1, S λ,1 is an Extremal Stable distribution in Nolan's S1 parameterization (Equation 1 of 22) with parameters α = λ, β = 1, σ = 1, µ = 0. This is equivalent to Nolan's S0 parameterization (Equation 3 of 22 but it seemed to me there was a missing factor of i on line 2, second term in square brackets) with parameters α = λ, β = 1, σ = 1 and, if α = 1, µ 0 = 0, or if α ≠ 1, µ 0 = β tan(πλ/2). The moments required by Table 1 The main result of this paper is that a general significance threshold for the GMP with exponent r based on GCLT is therefore where 1 ,1 F λ − is the inverse cumulative distribution function of an Extremal Stable random variable S λ,1 and the coefficients a r,K and b r,K are defined in Table 1. It is notable that for any distribution. This occurs when 1 2 r > − because the p r s are no longer heavy-tailed in the sense that their variances are defined. Those results are therefore equivalent to a straightfoward application of central limit theorem. This transition in the behaviour of the GMP has implications for its robustness to dependence.

Small approximation
By the theory of regularly varying functions, the general significance threshold (Equation 14) simplifies when 1 2 r < − (i.e. λ < 2) to subject to the Davis-Resnick condition (Equation 6).
The small approximation shows that the HMP is the only GMP that can be directly interpreted as if it were a p-value, and only then when → 0. The small approximation is compared to the significance thresholds of Vovk and Wang 15 in Table 2.
appears to be applicable in the transitional region 1 0 (i.e. 2) 2 r − ≤ < λ ≥ (i.e. λ ≥ 2) but here the tail behaviour can be alternatively characterised as Gaussian. Empirically, both approximations appear to struggle, so I caution that the small approximation is not helpful for 1 . 2 r ≥ − For small , the GCLT significance threshold is less stringent than the RRA significance threshold by a factor of r/(1 + r) when r < -1, log K when r = -1 and (r + 1) 1/r /K 1+1/r when -1 < r < -1/2. As r → -∞, the GCLT and RRA thresholds converge to those of the Šidák and Bonferroni procedures respectively, which are equivalent for small . Even below r = -2, the difference in stringency is less than two-fold when is small, suggesting that in this region, the approaches are similar. However, directly comparing the significance thresholds only allows a comparison of the extreme cases of independence and arbitrary dependence. More generally, an explicit model of p-value dependence is required, which is the subject of the next section.

Dependence structure of likelihood ratio tests
In motivating a dependence structure for p-values, I consider the p-values to have arisen from nested likelihood ratio tests, in which each p-value is a regularly varying function of the maximized likelihood ratio R i for a pair of nested models M 0 and M i . Asymptotic theory for classical inference states (under various assumptions 23 ) that the deviance equals the true value of the parameter, then under the null hypothesis, S i ~ Normal νi (0, I). I.e. S i follows an uncorrelated standard multivariate normal distribution, and 2logR i follows a Chi-Squared(ν i ) distribution.
The above outline implies that where ℂ represents the covariance. Therefore Table 2. Significance thresholds for M r,K (p 1 , . . . , p K ) assuming large K and, for the GCLT threshold, small .
Thus, S has a multivariate Normal distribution with variancecovariance matrix equal to the block matrix of individual correlation matrices between the maximum likelihood estimates of each pair of models. Call that matrix ρ. The matrix might be positive-semi-definite, rather than positive-definite, because of collinearity between the K models.
When the null hypothesis is true The diagonal of SS′, which has a Wishart-Multivariate-Gamma distribution 24 , models the dependence within and between the terms in all the sums 2log , 1... .
However, the analytical results for this distribution are limited, so in practice Equation 21 is used for simulation. After computing the maximized likelihood ratios R 1 , . . . , R K via Equation 17, the p-values are computed from the quantile functions of the corresponding Chi-Squared(ν i ) distributions.

Simplified Wishart-Multivariate-Gamma dependence
For the simulations to test the power and false positive rate of the GMP significance thresholds, I used a simplification of the Wishart-Multivariate-Gamma dependence structure with a single parameter, 0 ≤ ρ ≤ 1, which measures the strength of positive dependence between the log-likelihood ratios, and hence the p-values. I made the simplifyingn assumption that ν i = ν for all i = 1 . . . K and (for i ≠ j) 0Ĉ or [ , ] . 0 In this scenario, every alternative hypothesis has ν free parameters compared to its nested null hypothesis. These might be considered to represent parameters that are in some way analogous from test to test. For any pair of likelihood ratio tests, estimates of the analogous parameters are correlated (with correlation coefficient ρ) but estimates of non-analogous parameters are uncorrelated. In which case the joint distribution of (2 log R 1 , . . . , 2 log R K )′ when the null hypothesis is true is modelled by the diagonal elements of In particular, I took ν = 2, which produces the simple relationship p i = 1/R i . While the parameter ρ can be seen as characterizing the strength of the dependence from mild to strong, all models with ρ > 0 could be considered as representing a particularly 'dense' form of dependence in which every p-value is equally correlated with every other one.

Power-robustness trade-offs 4.1 Independence versus arbitrary dependence
To assess the relative performance of the GCLT and RRA significance thresholds for GMPs, I began by directly comparing the thresholds themselves. This allowed me to assess the false positive rates of the RRA threshold under the assumption of independence and the GCLT threshold under worst case dependence.
For a target false positive rate, , the GCLT and RRA significance thresholds for the GMP are written GCLT, , ( )  r K Ψ and RRA, , ( ) r K  Ψ respectively. Two quantities can be studied easily: 1. The false positive rate of the RRA threshold under independence: 2. The false positive rate of the GCLT threshold under worst case dependence (assuming the RRA thresholds are precise): The two quantities are expected to be below and above respectively. Neither extreme (independence nor worst case dependence) is thought to represent empirical dependence: Wishart-Multivariate-Gamma dependence scenarios are considered later.
The two quantities are plotted in Figure 1 for = 0.01 and K = 1000 over a range of r. The plots support the idea that large negative values of r produce tests that are most robust to assumptions regarding dependence and significance thresholds that are most similar. In the extreme case that r → -∞, this implies use of Bonferroni and Šidák correction under arbitrary dependence and independence respectively: the resulting significance thresholds converge for small . r it converges to a Normal distribution. The Davis-Resnick condition implies that that sum of heavy-tailed random variables is robust to dependence in the tail: Figure 1 (right side) shows that robustness to arbitrary dependence is a process that begins at 1 = -2 r but requires r substantially below 1 -2 to become appreciable. For example, under worst case dependence, the false positive rate of the HMP (r = -1) is still elevated 6.2-fold above the target false positive rate, = 0.01.
As the exponent falls to r = -2, the inflation in false positive rate of the GCLT threshold above its target drops to two-fold, a notable value that corresponds to the worst case for direct interpretation of the arithmetic mean p-value (AMP, r = 1) 26,27 . If, however, one applies the GCLT significance threshold to the AMP, rather than directly interpreting it, the false positive rate jumps to 0.96 under worst case dependence because direct interpretation of the AMP is highly conservative under independence. The disparity in false positive rates illustrates the vastly superior robustness to dependence of the GCLT threshold, and the GMP in general, at r = -2 versus r = 1. It also shows that direct interpretation the AMP is questionable: not only is it up to two-fold anti-conservative under worst case dependence between tests, but its power will be highly compromised for independent tests. GMPs with smaller exponents have intrinsically greater robustness to dependence.
Robustness to dependence is a desirable property in the false positive rate, but there may be a cost in terms of the power to reject the null hypothesis when it is false. The well-known conservatism of Bonferroni correction suggests this is inevitable. False positive rates. Using the GCLT thresholds, the GMP with r < -1 exhibited false positive rates that were close to the target of = 5% under mild dependence (ρ ≤ 0.2) and substantially below it under strong dependence (ρ ≥ 0.6) ( Figure 2). The GMP with r > -1 exhibited substantial inflation of false positive rates under mild to strong dependence. In contrast, the HMP (r = -1) maintained a false positive rate close to the target in all cases, erring on the side of inflation for 0.2 ≤ ρ ≤ 0.6, as observed previously 17,19 .

Simulations under Wishart-Multivariate-Gamma dependence
Using the RRA thresholds, the false positive rate of the GMP was always well-controlled below the target of = 5%, as expected, and therefore below the rates achieved using the GCLT thresholds. Often it was far below the target false positive rate. Usually it was even below the false positive rate of the Bonferroni procedure ( Figure 2).

Power.
In considering power, the GCLT thresholds for r > -1 can be disregarded as inadmissible when ρ ≠ 0 because they could not control the false positive rate close to the target. In this range, the RRA thresholds were admissible but exhibited very poor power under all scenarios (Figure 2). The conclusion that neither the GCLT nor RRA thresholds are useful in the range -1 < r < -1/2 is relevant to the later section on the interpretation of the GMP (section 6). From a practical perspective, these simulations suggest that GMPs with r > −1, including the geometric mean p-value (r = 0; Fisher's method) and the arithmetic mean p-value (r = 1), are not useful unless independence or an explicit model of dependence can be safely assumed, e.g. 7,16.
In the Needle-in-a-haystack scenario, the GCLT thresholds for r ≤ -1 achieved power comparable to, or slightly worse than, the Bonferroni procedure. For -10 ≤ r ≤ -1, the RRA thresholds were worse than the Bonferroni procedure, considerably so for -2 ≤ r ≤ -1 (Figure 3).
In both the Mixture of signals and Subtle pervasive signal scenarios, the power of the GCLT thresholds increased from r = -10 to r = -1. The trend was reversed for the RRA thresholds (except for ρ = 1). The power of the RRA thresholds was uniformly worse (often substantially worse) than for the GCLT thresholds, and it was usually worse than for the Bonferroni thresholds too. The effect of increasing the strength of dependence was to reduce power for the GCLT, RRA and Bonferroni thresholds.
The Simes test resembles the HMP using the GCLT threshold, both in terms of interpretation and performance 9 .
Simes' test had the advantage of avoiding inflation in false positive rate under the Null hypothesis with 0.2 ≤ ρ ≤ 0.6. The power of Simes' test was no better than the HMP (except at ρ = 1), but often it was appreciably worse in the Mixture of signals and Subtle pervasive signal scenarios. For smaller values of r, e.g. r = -1.25, the inflation in false positive rate under mild dependence was reduced compared to the HMP. The GCLT threshold at r = -1.25 also generally outperformed Simes' test in terms of power, with the exception of ρ = 1. No value of r enabled the RRA thresholds to outperform the power of Simes' test. As expected, Bonferroni correction behaved much like the GMP with r = -10 under both the GCLT and RRA thresholds. As expected, Fisher's method was indistinguishable from the GMP with r = 10 -6 under the GCLT threshold. For ρ = 0, Fisher's method could not be bettered in the Mixture of signals and Subtle pervasive signal scenarios, but it was roundly outperformed in the Needle-in-a-haystack scenario ( Figure 2).
In conclusion, the HMP (with GCLT threshold) and Simes' test appear to offer superior performance to the alternatives over the range of dependence structures considered. The HMP enjoys greater power than Simes' test, at the cost of an inflated false positive rate. For end-users, the relative importance of power versus a conservative false positive rate will influence the choice of test. The simulations support the claim that the HMP is robust to dependence in the sense that the realized false positive rate is close to the target across all dependence structures investigated.

Inherent power of the HMP vs Simes' test
The superior power of the HMP compared to Simes' test in the presence of dependence is attributable, in part, to its higher false positive rate. However, receiver-operator curves (ROCs, Figure 4) summarizing the simulations above show that in the Mixture of signals and Subtle pervasive signal scenarios, the HMP is inherently more powerful than Simes' test. However, the advantage is reduced, and even reversed, as dependence increases. Simes' test is inherently more powerful than the HMP in the Needle-in-a-haystack scenario.
Inherent power (the power of a test when its threshold achieves the target false positive rate) cannot be realized without analytical results, which are not available for Wishart-Multivariate-Gamma dependence, or simulations. Nevertheless, one can use simulations to compare inherent power to actual power to  quantify the shortfall or excess power attributable to conservatism or anti-conservatism of the false positive rate. In Figure 5, the bars show inherent power while the grey whiskers compare that to actual power, using GCLT thresholds. When the grey whisker exceeded the bar, as it often did for the HMP (r = -1), the actual power was elevated relative to inherent power because of an inflated false positive rate. When the grey whisker fell below the bar, as it usually did for Simes' test, the actual power was less than the inherent power because of a conservative false positive rate.
For r < -1, the tendency for actual power to exceed inherent power was reduced, and often reversed compared to the HMP. However, the cost of this greater robustness to dependence was reduced actual power in the Mixture of signals and Subtle pervasive signal scenarios. Empirically, the trend appeared to be monotonic except for when log K/(1 -log K) < r < -1, in which region the inherent and actual power were drastically reduced (not shown). Therefore, r = log K/(1 -log K), which equalled -1.17 for K = 1000, was the GMP with r < -1 whose characteristics most closely resembled the HMP. This coincided with the derivation of the RRA threshold for r = -1, which was taken as the tightest bound based on r < -1, which occurs at r = log K/(1 -log K) 15 . Whether this recommends the use of the GMP at r = log K/(1 -log K) over the HMP is unclear. Figure 3 showed that GMPs with r < -1 can still be subject to mildly inflated false positive rates using GCLT thresholds, even if they are attenuated relative to the HMP. Bonferroni usually over-powered the GMP using RRA thresholds.

Strong-sense family-wise error rates
One of the advantages of the HMP procedure is its ability to test arbitrary combinations of the K p-values while controlling the strong-sense familywise error rate at a pre-specified level , known as multilevel testing 9 . A full assessment of the relative performance of other GMPs to the HMP therefore involves a comparison of their performance as multilevel tests. This requires a closed testing procedure (CTP 14 ) to be derived for the GMP with any exponent r.
Suppose that R is an index set of the i = 1 . . .
assuming the least favourable case that , 1.
The condition in Equation 29 is therefore satisfied by ( ) The above reveals that the multilevel test suffers complications when r > 0. This is because for subsets of p-values smaller than K(1 − Ψ r,K ( ) r ), the numerator of Equation 32 can be negative when r > 0. The interpretation is that there is no value of , r p R small enough to compensate for the assumption that , Thus no individual p-value can be sufficiently significant to guarantee that the combined test is significant. (Although knowledge of the rank of the p-value would alter this conclusion, e.g. knowing it was the maximum. The problem arises because this multilevel test is a shortcut to the full CTP and is based on single p-values. CTPs are not ruled out for r > 0 in general.) From a practical perspective, it means that when r > 0, no subsets of p-values smaller than K(1 − Ψ r,K ( ) r ) can be significant within this multilevel test, limiting the finest levels at which inference can be made.
The multilevel test simplifies when r < −1 because the GCLT thresholds (Equation 14 and Table 1) and RRA thresholds ( This produces a convenient form of the CTP: Thus by RRA one has, and by GCLT (Equation 16 and 9), one has a simple small approximation Further, Equation 34 shows that all CTPs in the range −∞ < r < −1 have a cost relative to Bonferroni (r → −∞) in the sense that the significance threshold for an individual p-value is more stringent by a factor shows that the latter is close to 1 for small . Intuitively, this represents the cost of the additional power to make statements about groups of p-values over and above the statements one can make about individual p-values 17 . However, the multilevel Simes test does not incur this penalty.

Interpretation
The p-value can be seen as a low-dimensional summary of the data that is relevant to hypothesis testing. From this perspective, the distribution of the p-value under the alternative can be modelled directly, e.g. 28. Heard and Rubin-Delanchy 4 considered Beta distributions for the p-value under the alternative, subject to the constraint that the density is non-increasing in p, which implies that 0 < ξ ≤ 1 and 1 ≤ ζ. By the Neyman-Pearson lemma 29 , they argued that one can identify uniformly most powerful tests for combining independent p-values under the Beta distribution assumption. Fisher's method was optimal for ζ = 1, the subset of Beta distributions that have been advocated for local alternatives 30 . Pearson's method 31 was optimal for ξ = 1, the subset of Beta distributions that have been advocated for simple alternatives 30 .
The likelihood ratio for a p-value that is Beta distributed under the alternative can be written where B(·) is the Beta function. The notation BF i is used because the local alternatives assumption amounts to a Bayesian prior distribution over effect sizes with hyper-parameter ξ, and the likelihood ratio is therefore a Bayes factor.
A mean Bayes factor (or likelihood ratio) of the form ( ) ( ) arises in a model-averaging setting in which the alternative hypothesis is a mixture of individual, mutually exclusive alternatives. Here each p-value uses the same data to evaluate each competing alternative hypotheses against a common nested null hypothesis. This implies an interpretation of GMPs as uniformly most powerful tests for model-averaged alternative hypotheses, each of which is a different local alternative to the common nested null hypothesis. Under these conditions, ξ = 1 + r and ζ = 1 so that , . . . , , 1 The model-averaging interpretation applies when −1 < r < 0. Unfortunately, the simulations summarized in Figure 2 showed that for −1 < r < 0, the GCLT threshold suffered greater elevation in false positive rates than the HMP. (The RRA threshold is not considered here because power was low and usually worse than Bonferroni). Nevertheless, outside the range −1 < r < 0, the GMP can be viewed as a bound on the model-averaged Bayes factor. Defining r* < r < 0, one has the relationship , * 1 BF , . . . , because M r,K (p 1 , . . . , p K ) ≥ M r*,K (p 1 , . . . , p K ). Therefore GMPs supply upper bounds on Bayes factors when r ≤ −1 and lower bounds on Bayes factors when r > 0. The HMP supplies the tightest upper bound for well-powered tests (r ↓ −1). Simes' test and Bonferroni can be framed similarly as providing lower bounds on the model-averaged Bayes factor.
While the HMP has a natural interpretation as a tight bound on the model-averaged Bayes factor for wellpowered tests, it does not correspond exactly to any Bayes factor and it does not control the frequentist false positive rate exactly for some dependence structures, making it an approximation to both approaches that could be criticized for satisfying neither.

Conclusions
Taking the generalized mean p-value of a group of tests extends a number of existing methods for combining p-values including the Bonferroni, Šidák, harmonic mean p-value and Fisher procedures 7,9-12 ( Figure 6). The interpretation varies by (i) the exponent of the GMP, and (ii) the key assumption regarding dependence between the tests. Two appealing interpretations occur when −1 < r < 0 and r = 0. When −1 < r < 0, combining p-values using the GMP can be interpreted as model averaging if the same data have been used to evaluate mutually exclusive alternative hypotheses against a common null hypothesis. In this interpretation, when r is closer to −1, very small p-values are assumed more likely when the alternative hypothesis is true, implying the individual tests were more powerful. Outside the range −1 < r < 0, the GMP is interpretable as approximating this approach, with r = −1 (the HMP) offering the closest approximation for well-powered tests. When r = 0, combining p-values using the GMP can be interpreted as aggregating evidence for related pairs of alternative and null hypotheses, if independent data were used for the individual tests, in which case the method is equivalent to Fisher's for many tests. Outside these specific interpretations, the GMP offers a flexible non-parametric approach to combining p-values where r controls the sensitivity to small values.
GMPs are not directly intepretable as p-values in general 15 . Instead, significance thresholds are required. Generalized central limit theorem and robust risk analysis provide convenient methods for defining significance thresholds for GMPs that do not require explicit knowledge of the dependence structure, providing robustness to dependence to varying degrees. RRA thresholds provide robustness to arbitrary dependence in the sense that the false positive rate will not exceed the target 15 . GCLT thresholds provide a weaker form of robustness to forms of dependence that satisfy the Davis-Resnick condition (Equation 6), but only for sufficiently small values of the target false positive rate and r ≤ −1. Subject to these conditions, the HMP is the only GMP that can be directly interpreted as if it were a p-value 9,17 .
The cost of robustness to arbitrary dependence was too high to make the RRA thresholds directly useful in practice, because they were usually rendered less powerful than the Bonferroni procedure in simulations (Figure 2, Figure 3). However, they remain theoretically valuable because they bound the worst-case inflation of the false positive rate of the GCLT thresholds. The RRA and GCLT thresholds agreed more closely as r → −∞. The trend for RRA thresholds to deliver less powerful tests as r increased was reversed for GCLT thresholds. In practice the GCLT thresholds were generally more powerful than Bonferroni, and increasingly so as r increased, but they began to suffer inflated false positive rates. The GCLT threshold for the HMP has previously been shown to suffer modest inflation under mild dependence 19 . However, above r = −1, the point at which the underlying distribution of p r transitioned from very heavy tailed (r ≤ −1) to heavy tailed (−1 < r ≤ −1/2), inflation accelerated to the point that there was no useful robustness to non-independence ( Figure 2). Despite this problem, incorporating knowledge of dependence into standard central limit theorem, applicable for light-tailed distributions (r > −1/2), would be straightforward, requiring knowledge only of ) . The loss of robustness to dependence recommends against the use of the GMP with GCLT thresholds for r > −1 except when independence can be safely assumed.
The arithmetic mean p-value, which arises in numerous applications including posterior predictive p-values, is known to be directly interpretable subject to a maximum two-fold inflation in false positive rate 25,27 . However, for the dependence structures considered here, there appeared to be little merit in direct interpretation of the AMP: to do so would be far too conservative under independence and likely less powerful than Bonferroni under worst-case dependence. An interesting alternative might be the GMP with r = −2, whose GCLT threshold is /  K for small and does suffer at worst two-fold inflation under arbitrary dependence (for large K). Unlike the HMP, the GMP with r = −2 did not suffer even mild inflation in false positive rate in simulations, but the HMP was more powerful (Figure 3). The GMP with r = −2 performed remarkably similarly in false positive rate and power to Simes' test 20 . Simes' test and the HMP can be seen as offering similarly-performing but complementary solutions to the power-robustness trade-off for model-averaged p-values 9 , erring on the side of conservatism versus power respectively.
There were several limitations in the current study: (i) Equal weights were assumed throughout, although simulations for the HMP 9 suggest there may be robustness to unequal weights, at least for r ≤ −1. (ii) The distribution of p-values under the null hypothesis was assumed to be Uniform(0,1). However, valid p-values are generally defined such that 0 Pr( . Conservatism of this sort was not explored, but is likely to profoundly diminish the power of the GMP. (iii) The simulations considered here assumed a particular form of dependence in which the p-values were chi-squared tail probabilities of underlying log-likelihood ratios that for large samples would follow a Wishart-Multivariate-Gamma distribution. A particularly dense form of dependence was assumed that applied to every pair of p-values. Some results, such as the inflation in false positive rates for the GCLT thresholds in the region r = −1, will depend quantitatively on the details of the simulations. The conclusion that the RRA thresholds are less powerful than Bonferroni may apply more widely because it stems from the theoretical divergence in GCLT and RRA thresholds as r increases, and it might seem reasonable to assume that the behaviour of empirically relevant p-value dependence is intermediate between their respective assumptions of independence and arbitrary dependence.
In conclusion, simulations under a form of dependence relevant to p-values calculated from likelihood ratio tests showed that the GMP is practically useful for combining dependent p-values for exponents r ≤ −1 using Figure 6. Summary of the generalized mean p-value (GMP), its relation to other tests, Bayesian interpretation, significance thresholds derived using generalized central limit theorem (GCLT) and robust risk analysis (RRA) 15 , test assumptions and performance characteristics. thresholds derived from generalized central limit theorem. Robust risk analysis provides corresponding upper bounds on the false positive rate under worst case dependence 15 , but these upper bounds were not directly useful as significance thresholds because they produced tests typically less powerful than Bonferroni. Those wishing to protect themselves against worst case dependence should therefore prefer the Bonferroni procedure. However, there is increasing interest in exploiting heavy tail behaviour to confer desirable properties in terms of power and robustness to dependence upon combined tests 9,32 , and the GMP for r ≤ −1 with GCLT thresholds extends this class of methods.

Data availability
Underlying data No underlying data are associated with this article.

Xuexia Wang
Department of Mathematics, University of North Texas, Denton, TX, USA Methods of combining p-values are very useful in many fields, such as genetic meta analysis by combining multiple studies and gene-based genetic association studies by combing multiple single marker tests together. Several combination test statistics underpinning the Bonferroni, Šidák, harmonic mean p-value (HMP), Fisher, and other procedures are special cases of the generalized mean p-value (GMP). GMP cannot be interpreted directly as p-values since they are not uniformly distributed under the null hypothesis, but the GMP can be used as a test statistic by calculating a significance threshold for rejecting the null hypothesis with a pre-specified limit (significance level). In this manuscript, the author considered significance thresholds for the GMP and compared their performance under different dependence assumptions. The author performed very important work and generated some very valuable results.
There are some published work regarding to the significance thresholds. Vovk and Wang (Vovk, 2018) 1 derived significance thresholds for GMPs under arbitrary dependence using robust risk analysis (RRA). Wilson (Wilson, 2019) 2 derived a considerably less stringent threshold using generalized central limit theorem (GCLT). In this manuscript, the author further derived significance thresholds for GMPs using GCLT and motivated a model of p-value dependence based on the Wishart-Multivariate-Gamma distribution. In simulation studies, the author simulated data under the model to test the power and false positive rates of the GCLT and RRA significance thresholds. Simulations under a form of dependence relevant to p-values calculated from likelihood ratio tests showed that the GMP is practically useful for combining dependent p-values for exponents r ≤ −1 using thresholds derived from GCLT. The RRA thresholds produced tests that were usually less powerful than Bonferroni, while the GCLT thresholds produced tests more powerful than Bonferroni, for all r> − ∞.
In general, this is a very important work. I have several minor comments: In the Introduction, the author mentioned "combining Z-statistics, rather than p-values, allows parameters to be jointly estimated across datasets". What parameters can be jointly estimated across datasets? 1.
Below Figure 2, the author mentioned "I conducted 10000 simulations". Does the 10000 simulations mean 10000 replications?
© 2020 Rustamov R. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Raif Rustamov
Data Science and AI Research, Chief Data Office, AT&T Inc, Bedminster, NJ, USA P-value combination approaches have started attracting increasing attention after the recent introduction of the Harmonic Mean P-value (HMP) 1 and Cauchy Combination Test 2 combination methods. The motivations for these two approaches are very different -Bayesian and Cauchy additive property, but interestingly, these methods are asymptotically equivalent in the most relevant regime, namely for small p-values 3 . A fundamental question is whether there are other p-value combination approaches that a) have a different asymptotic behavior and b) are at least as efficient as these two methods. The paper under review analyzes this problem from both theoretical and simulation aspects, and provides evidence that such a method may not exist.
The main theoretical contribution of this paper is to generalize the logic behind HMP to a parametrized class of p-value combination methods by computing the appropriate thresholds via generalized central limit theorem (GCLT). These thresholds are based on the independence assumption yet are robust to violations. They are directly useful for practical applications, and provide an alternative to the robust risk analysis (RRA) based thresholds 4 that are universal but rather conservative. An argument using a result from Davis-Resnick shows that there exists an explicit condition that guarantees the proper test size control at the limit. It seems to me that this condition can be relaxed; for example a theorem in 2 shows that a certain bivariate normality condition leads to the correct limiting size control. I take this as the evidence that the proposed thresholds can be used more widely.
It is the comparison of GCLT and RRA thresholds that is very revealing. It becomes clear that HMP, which corresponds to the case of r=-1 has the most potential for gains due to the GCLT/RRA threshold ratio being log K. For methods with r < -1 the thresholds differ at most by a factor of 2 and there is little room to provide viable alternatives to HMP. Finally, methods with r > -1 are neither robust to dependencies (as shown by experiments) nor provide threshold gains at the limit of large K. This can be seen as evidence to the unique place of r=-1 and that the only interesting/efficient methods for p-value combination should have the same asymptotic behavior as HMP. Of course, this only establishes the asymptotic form of the tail, but the specific form can make big difference in practice.
The paper continues on to experimentally compare this class of methods using both GCLT and RRA thresholds for different values of r. The experiments consider a wide variety of setups and are rather detailed. One aspect of these experiments is that the proposed setup is very challenging due to the inclusion of all-pair dependencies, and it essentially pushes this methodology to the limit. As a consequence, the presented results may seem somewhat discouraging even for HMP. However, in my own work 3,5 with p-value combination approaches I have found both HMP and Cauchy Combination Test to work rather well even at the nominal rate of 0.05; the dependencies usually have limited range (e.g. correlation matrix is banded, not full) and thus the violations of independence are not as severe. Perhaps pointing out the this aspect more prominently would make sure that the reader finds this methodology useful.
I would like to make a number of minor suggestions that would potentially help the readability of the paper: Providing an explicit example of this new class of methods would be very helpful. For Yes