Distribution-free Multiple Testing

We study a stylized multiple testing problem where the test statistics are independent and assumed to have the same distribution under their respective null hypotheses. We first show that, in the normal means model where the test statistics are normal Z-scores, the well-known method of (Benjamini and Hochberg, 1995) is optimal in some asymptotic sense. We then show that this is also the case of a recent distribution-free method proposed by Foygel-Barber and Cand\`es (2015). The method is distribution-free in the sense that it is agnostic to the null distribution - it only requires that the null distribution be symmetric. We extend these optimality results to other location models with a base distribution having fast-decaying tails.


The risk of a multiple testing procedure
Consider a setting where we want to test n null hypotheses, denoted H 1 , . . . , H n . The test that we use for H i rejects for large values of a statistic X i . Throughout, we assume that X 1 , . . . , X n are independent. Denote the vector of test statistics by X = (X 1 , . . . , X n ). Let Ψ i denote the survival function 1 of X i and Ψ = (Ψ 1 , . . . , Ψ n ).
Remark 1. In a large portion of the literature, it is assumed that P-values can be computed (or at least approximated). The simplest such case is when H i is a singleton, H i = {Ψ null i }, and the null distributions Ψ null 1 , . . . , Ψ null n are known. In that case, the i-th P-value is defined as P i = Ψ null i (X i ), which is the probability of exceeding the observed value of the statistic under its null distribution. In this context, working with the statistics X 1 , . . . , X n is equivalent to working with the P-values P 1 , . . . , P n .
Let F ⊂ [n] ∶= {1, . . . , n} index the false null hypotheses, meaning (1) A multiple testing procedure R takes the test statistics X and return a subset of R(X) ⊂ [n] representing the null hypotheses that the procedure rejects. Given such a procedure R, the false discovery rate is defined as the expected value of the false discovery proportion (Benjamini and Hochberg, 1995) fdr Ψ (R) = E Ψ (fdp(R(X))), fdp(R) ∶= where we denoted the cardinality of a set A ⊂ [n] by A and we adopt the convention that 0 0 = 0. While the FDR of a multiple testing procedure is analogous to the level or size of a test procedure, the false non-discovery rate plays the role of power and is defined as the expected value of the false non-discovery proportion fnr Ψ (R) = E Ψ (fnp(R(X))), fnp(R) ∶= F ∖ R F .
(Note that our definition is different from that of Genovese and Wasserman (2002). ) Accordingly, we define the risk of a multiple testing procedure R as risk Ψ (R) = fdr Ψ (R) + fnr Ψ (R).
Remark 2. For any multiple testing procedure, fnp → 0 in probability if and only if fnr → 0. Indeed, one direction is justified by Markov's inequality, and the other direction is justified by dominated convergence (using the fact that fnp ≤ 1 always).

Natural procedures
Under our circumstances, the following properties are natural, and are for example satisfied by the BH procedure, and in fact, any other procedure we can think of, at least in the present context.
• Permutation invariance. We say that a procedure R is permutation invariant if, as a function of n variables, it is invariant with respect to the order of these variables.
• Threshold. We say that a procedure R is a threshold procedure if it is of the form for some threshold function τ .
• Monotonicity. We say that a procedure R is monotonic if, for any i ∈ R(X 1 , . . . , X n ), increasing X i leaves R unchanged.
Because they are so natural, but also for ease of exposition, we focus on natural procedures throughout the paper.

The normal model and the optimality of the BH method
This model corresponds to the setting above with X i ∼ Ψ i = N (µ i , 1) and H i ∶ µ i = 0, so that H i is a singleton equal to Ψ null i = N (0, 1). In this context it is compelling to ask how the µ i 's need to be in order for the risk of the BH procedure to tend to zero. To the best of our knowledge, this question has not been directly answered in the literature.
Our inspiration for considering the normal (location) model comes from the seminal work of Ingster (Ingster, 1997;Ingster and Suslina, 2003) and Donoho and Jin (2004) on testing the global null ⋂ i H i . In (Ingster, 1997) we find the following first-order asymptotic result. Assume a prior under which m ≤ n randomly picked µ i 's are set to √ 2r log n and the others are set to 0. An interesting parameterization happens to be m n ∼ n −β with β > 0 fixed. Focusing on the so-called sparse regime, where β > 1 2, one finds that the detection boundary is at r = ρ(β), where This means that, taking r to be fixed, when r < ρ(β) all tests have risk 2 at least 1 in the large sample limit (which is as bad as random guessing), while when r > ρ(β) the likelihood ratio test has risk 0 in the large sample limit. Donoho and Jin (2004) propose an adaptive test procedure based on Tukey's higher criticism (HC) that achieves this optimal detection boundary.
Returning to the question of identifying the false null hypotheses, which is our concern here, we know that r > 1 allows for the identification of the false nulls with a control of the family-wise error rate (FWER) at any fixed level. In fact, r = 1 is the precise boundary for this to be possible -we leave this as an exercise to the reader. This is true for all β ∈ (0, 1). We are more interested in controlling the risk (4). The following is a special case of a more general lower bound appearing later in the paper.
Corollary 1. In the normal model, assume that β ∈ (0, 1) and r ≥ 0 are both fixed. If r < β, then the risk of any natural multiple testing procedure has limit inferior at least 1 as n → ∞.
Remark 3. Such a lower bound was in fact obtained earlier by Ji et al. (2012) in a more general normal model allowing for dependencies.
In our context, we know that Corollary 1 is tight because the BH method achieves the stated multiple testing boundary. The following is also a special case of a more general result appearing later in the paper.
Corollary 2. In the setting of Corollary 1, if instead r > β, then the risk of the BH procedure (properly calibrated) tends to 0 as n → ∞.
Remark 4. A similar result was derived in (Jin and Ke, 2014) for a form of oracle threshold method with threshold chosen based on knowledge of the number of false nulls. By comparison, the BH method -which does not require this knowledge -is shown here to be adaptive to the number of false nulls.
Together, Corollary 1 and Corollary 2 establish that the BH procedure is asymptotically optimal to first order in the normal model. We will see that this remains true for a much wider class of models.

Multiple testing under symmetry
The P-values are based on the assumed knowledge of the null distribution of each test statistic. In many practical settings, this is not strictly the case, resulting in P-values that are only approximately uniformly distributed under their respective null hypothesis. This can jeopardize the control of the FDR. In the same way that it may be appealing in some situations to use a distribution-free test such as the signed-rank test instead of the t-test, it may be desirable to use a distribution-free procedure for multiple testing.
Our working assumption is the following X 1 , . . . , X n are independent with common null distribution that is symmetric about 0.
The assumption of symmetry is standard in the literature on nonparametric tests (Hettmansperger, 1984). However, it is not standard in the context of multiple testing. Although quite natural, we only know of two instances where this assumption is made: • In our own work on testing the global null (Arias-Castro and Wang, 2013).
Whether following one lead or the other, the resulting procedure is essentially the same, which we call the Barber-Candès (BC) procedure, properly defined later on. This procedure is shown in (Foygel-Barber and Candès, 2015) to control the FDR at any desired level. Here we show that, under fairly general conditions, it achieves the multiple testing boundary. In particular, it does as well as the BH procedure with knowledge of the null distributions. The following is a special case of a more general results appearing later on.
Corollary 3. The conclusions of Corollary 2 apply to the BC procedure.
We mention that other distribution-free procedures have been suggested in the literature. Except for those cited above, all the other ones we know of are based on resampling (Ge et al., 2003;Romano and Wolf, 2007;Westfall and Young, 1993;Yekutieli and Benjamini, 1999). These methods are not applicable in the setting assumed here. They are typically applied to situations, as in microarray analysis, where each test statistic is based on comparing two (or more) samples.
Another approach suggested in the literature is that of estimating the null distribution of the test statistics (assumed to be the same). This is advocated in (Efron, 2004;Pollard and van der Laan, 2004). The asymptotic risk properties of such methods remain unknown, despite the fact that there is some theory on the topic of estimating the null distribution (Cai and Jin, 2010;Jin and Cai, 2007).

Content
In Section 2 we derive a lower bound on the boundary for multiple testing in a location model where the base distribution is asymptotically generalized Gaussian. This comprises the normal model. In Section 3 we analyze the performance of the BH procedure based on the full knowledge of the null distribution, while in Section 4 we analyze the performance of the BC procedure. We present the result of some numerical experiments in Section 5. The proofs are gathered in Section 6.

The AGG model
We start by defining an oracle procedure, which is all we can hope for when using a natural procedure. We then define a family of location models where the base distribution is asymptotically polynomial in log-scale -which in particular encompasses the normal model -and in the context of such a model, we establish a lower bound on the risk of the oracle procedure. The result is an oracle risk bound.

The oracle procedure
In the context of natural procedures, the oracle procedure is defined as the threshold procedure using the threshold that minimizes the risk of a particular realization, namely, In words, with full knowledge of the set of false null distributions F, the procedure chooses a threshold that partitions the test statistics in a way that minimizes the sum of the false discovery and non-discovery proportions. The expected risk of this procedure is what we call below the oracle risk.
Remark 5. Of course, if one knew F, one would simply reject H i for all i ∈ F and, in the end, there would not any multiple testing problem to deal with! This oracle procedure is, however, constrained to be of threshold type. We will only use this fictitious procedure as a benchmark among threshold-type procedures.
Remark 6. We note that our oracle is stronger (provides more information) than the oracle considered by Meinshausen et al. (2011) in the context of FWER control.

Asymptotically generalized Gaussian model
In a location model, we assume that we know the null survival function Ψ, assumed to be continuous for simplicity, and the test statistics are independent with respective distribution where µ i = 0 under the null H i and µ i > 0 otherwise. Both minimax and Bayesian considerations lead to considering a prior on the µ i 's where m ≤ n randomly picked µ i 's are set equal to some µ > 0 and the others are set to 0. The prior is therefore defined based on m and µ, which together control the signal strength. Beyond the normal model, we consider other location models where the base distribution has a polynomial right tail in log scale.
The AGG class of distributions is nonparametric and quite general. It includes the parametric class of generalized Gaussian (GG) distributions with densities {ψ γ , γ > 0} given by log ψ γ (x) ∝ − x γ γ, which comprises the normal distribution (γ = 2) and the double exponential distribution (γ = 1). We assume that γ ≥ 1 so that the null distribution has indeed a sub-exponential right tail. Remark 7. We note that the scale (e.g., standard deviation) is fixed, but this is really without loss of generality as both the BH and BC methods are scale invariant. For the BH method, this is because the P-values are scale invariant. However, this is so because we provide the BH method with the null distribution, including the scale. The BC method, by contrast, can operate without knowledge of the scale. (Donoho and Jin, 2004) considered the problem of testing the global null in a GG location model and derived the detection boundary. We use the same prior, where m nulls chosen uniformly at random are designated to be false and all positive µ i 's are set equal to µ, with and µ = µ γ (r) = (γr log n) 1 γ , with r > 0 (fixed).
Theorem 1. Consider a location model where the base distribution is AGG with exponent γ ≥ 1, with prior described above, and with the parameterization (9)-(10). If r < β, then the oracle risk has limit inferior at least 1 as n → ∞.
Remark 8. We believe the result is valid for all procedures, regardless of whether they are natural or not. As usual, proving such a result would require the application of tools from information theory or decision theory (Tsybakov, 2009). Fano's inequality is appealing, but to obtain a result as strong as Theorem 1, one would need to develop a stronger form of Fano's inequality.

The performance of the BH method
Recall that P i = Ψ(X i ) is the P-value corresponding to the test statistic X i , where Ψ denotes the survival function under the null. We order the X i 's in decreasing order, to obtain the following order statistics X (1) ≥ ⋯ ≥ X (n) . Given a desired FDR control at q, the BH procedure of (Benjamini and Hochberg, 1995) is defined is the threshold procedure (5), with threshold where P (1) ≤ ⋯ ≤ P (n) are the ordered P-values. This procedure is shown in (Benjamini and Hochberg, 1995) to control the FDR at q, for example, when the tests are independent -which we assume throughout. Typically, q is set to a small number, like q = 0.10. In this paper we allow q to tend to 0 as n → ∞, but slowly. Specifically, we always assume that q > 0 such that n a q → ∞ for any a > 0 fixed.
The following result establishes the BH procedure as optimal in the AGG model, in the sense that it achieves the detection boundary (r = β) stated in Theorem 1.
Theorem 2. In the setting of Theorem 1, if instead r > β, then the risk of the BH procedure with q satisfying (12) has fnp tending to 0 in probability as n → ∞. In particular, if q → 0, then its risk tends to 0.
We note that the second part of the theorem follows from Remark 2 and that Corollary 2 follows immediately from this result.

The performance of the BC method
Under the assumption of symmetry, given the desired FDR control level q, the Barber-Candès (BC) procedure defines the data-dependent threshold τ BC as: where, as usual, the infimum is infinite if the set is empty, X ∶= { X i ∶ i = 1, . . . , n} is the set of sample absolute values, and is a measure of how asymmetric the set of observations {X i ∶ X i ≥ t} is. The notation is borrowed from (Foygel-Barber and Candès, 2015) and is justified by the fact that this quantity aims at estimating fdp(R t ), where R t = {i ∶ X i ≥ t} as in (8). The BC procedure is shown in (Foygel-Barber and Candès, 2015) to control the FDR at level q.
The following result shows that, although agnostic to the null distribution, the BC procedure achieves the detection boundary in a AGG model as long as the underlying distribution is symmetric.
Theorem 3. In the setting of Theorem 1, and assuming that the null distribution is symmetric about 0, if instead r > β, then the risk of the BC procedure with q satisfying (12) has fnp tending to 0 in probability as n → ∞. In particular, if q → 0, then its risk tends to 0.

Another variant
We mentioned another inspiration for considering the BC procedure as a potential candidate for achieving the detection boundary in such a nonparametric setting -which is now confirmed by Theorem 3. It relates to our own work on testing the global null in a similar setting (Arias-Castro and Wang, 2013). Following closely the reasoning there leads us to consider the following procedure. Let ξ (1) , . . . , ξ (n) be the respective signs of the observations arranged in decreasing order of absolute value and let S k = ∑ i≤k ξ (i) denote their partial sum up to k. Given a desired q ∈ (0, 1), we define the threshold index The procedure is then defined as the threshold procedure with threshold τ = X (ι) , where X (i) denotes the i-th largest observation in absolute value. To make a strong parallel with (Arias-Castro and Wang, 2013), let us call this procedure the cumulative sum (CUSUM) sign procedure. The two procedures (BC and CUSUM sign) are equally principled, and in fact, they are very closely related, as we show next. Assume that the observations come from a continuous distribution, so that no observation is equal to 0 with probability 1, or that the observations equal to 0 have been removed. In that case, We can express fdp( X (k) ) as a function of S k : Then, under mild assumptions, and when this is the case the event fdp( X (k) ) ≤ q is approximately equivalent to the event S k ≥ (1 − 2q (1 + q))k. Thus we suspect that the CUSUM sign procedure performs comparably to the BC procedure. We did observed this in all the numerical experiments we performed (none reported here).

Numerical experiments
In this section, we perform simple simulations to compare the BH and BC procedures on finite data, with the goal of illustrating the theory we established. We consider the normal model and the double-exponential model. We reemphasize that the BH procedure plays a role of oracle here since it requires knowledge of null distribution to compute the P-values. In contrast, the BC method does not require knowledge of the null distribution.

Fixed sample size
In this first set of experiments, the sample size is chosen large at n = 10 5 . The FDR control level is set at q = 0.05. We draw m observations from the alternative distribution Ψ(⋅ − µ), and the other n − m from the null distribution Ψ. All the models are parameterized as described in Section 2.2, in particular, (9) and (10). We choose a few values for the parameter β so as to exhibit different sparsity levels, while the parameter r takes values in a grid of range (0, 1). Each situation is repeated 100 times for each test and we report the average FDP and FNP of both tests.

Normal model
In this model, Ψ is the standard normal distribution. The simulation results are reported in Figure 1 and Figure 2. In Figure 1 we report the FDP. Recall that the methods are set to control the FDR at the desired level (q = 0.05). We see that the BC method becomes more conservative than the BH method as β increases. In Figure 2 we report the FNP. We see that the BC method performs comparably to the (oracle) BH method at β = 0.3 and β = 0.5, but is clearly less powerful in the sparsest regime β = 0.7. This is in line with the earlier observation that the BC method becomes more conservative with increasing values of β. It can also be explained by the fact, at β = 0.7, the number of false nulls (m = 31 out of n = 10 5 ) is too small to reveal the asymptotic power of the BC method. Finally, we remark that the transition from high FNP to low FNP happens in the vicinity of the theoretical threshold (r = β).

Double-exponential model
In this model, Ψ is double-exponential distribution with variance of 1. The simulation results are reported in Figure 3 (FDP) and Figure 4 (FNP). Here we observe that the BC method is rather conservative regardless of β. The two methods are again comparable in terms of FNP, in fact a bit more so than in the normal setting. The transition from FNP near 1 to FNP near 0 happens, again, in the vicinity of the theoretical threshold, but is much sharper here.

Varying sample size
In this second set of experiments, we examine the effect of various sample sizes on the risk of BH and BC procedures under standard normal model and double-exponential model (with variance 1). We simultaneously explore the effect of letting the desired FDR control level q tend to 0, in accordance with (12). Specifically, we set it as q = q n = 1 log n. We choose n on a log scale, specifically, n ∈ {10 2 , 10 3 , 10 4 , 10 5 , 10 6 }. Each time, we fix a value of (β, r) such that r > β.
In the first setting, we set (β, r) = (0.4, 0.9). The simulation results are reported in Figure 5 and Figure 6. We see that, in both models, the risks of the two procedures decrease to zero rapidly as the sample size gets larger. The BH method clearly dominates (in terms of FNP) up until n = 10 3 , and after that the two methods behave similarly. In the second setting, we set (β, r) = (0.7, 1.5) for normal model and (β, r) = (0.7, 1.2) for double-exponential model. The simulation results are reported in Figure 7 and Figure 8. In this sparser regime, we can see that the BC method is much more conservative than BH method when n is relatively small. But as n gets larger, this is less pronounced. The BH method clearly dominates (in terms of FNP) up until n = 10 3 and past n = 10 4 the two methods behave similarly. The difference is much more dramatic here, in line with our findings in Section 5.1.

Proofs
We prove our results in this section.

Proof of Theorem 1
We first remark that, for any procedure (left implicit) Hence, to show that the procedure has risk tending to 1, by dominated converge, it suffices to show that F R → 0 or R F → 0. This brutish tactic is in fact enough in all cases below except for the last one.
We have the following simple facts. On the one hand, under (9). On the other hand, for any t ∈ R, we have In particular, by Chebyshev's inequality, We will also use the fact that By shifting Ψ if needed, we may assume with loss of generality that Ψ(0) = 1 2.
Case: t ≤ 0. In this case there are too many rejections. Indeed, for any t ≤ 0, so that F R t → 0.
Case: t > 0. Such a threshold t > 0 is necessarily of the form t = (γr n log n) 1 γ for some r n > 0. Extracting subsequences if needed, we may restrict ourselves to the following situations.
• Subcase: r n → r ∞ = β. In this case, the number of rejections is about right, but the rejections themselves are not accurate. Indeed, we have with Ψ(t − µ) = o(1) since t − µ → ∞ due to the fact that r n → β > r. Hence, R t ∩ F F → 0, which in turn implies that F ∖ R t F → 1.

Proof of Theorem 3
The proof borrows a number of arguments from Section 6.2. We use the same notation and assume as before that the X i 's are distinct. We assume, in addition, that Ψ is symmetric about 0. Define the threshold τ = inf t ∶ fdp(t) ≤ q .
The difference with τ BC in (13) is that the range is not limited to X . It can be seen that τ = X (ι BC +1) unless ι BC = n (the BC procedure rejects all the nulls), in which case τ = τ BC . This, in particular, implies fnp(R τ ) ≤ fnp(R τ BC ) ≤ fnp(R τ ) + 1 m .