New FDR bounds for discrete and heterogeneous tests

: To ﬁnd interesting items in genome-wide association studies or next generation sequencing data, a crucial point is to design powerful false discovery rate (FDR) controlling procedures that suitably com- bine discrete tests (typically binomial or Fisher tests). In particular, recent research has been striving for appropriate modiﬁcations of the classical Benjamini-Hochberg (BH) step-up procedure that accommodate discreteness and heterogeneity of the data. However, despite an important number of attempts, these procedures did not come with theoretical guarantees. In this paper, we provide new FDR bounds that allow us to ﬁll this gap. More speciﬁcally, these bounds make it possible to construct BH-type procedures that incorporate the discrete and heterogeneous structure of the data and provably control the FDR for any ﬁxed number of null hypotheses (under independence). Markedly, our FDR controlling methodology also allows to incorporate the quantity of signal in the data (corresponding therefore to a so-called π 0 -adaptive procedure) and to recover some prominent results of the literature. The power advantage of the new methods is demonstrated in a numerical experiment and for some appropriate real data sets.


Introduction
Multiple testing procedures are now routinely used to find significant items in massive and complex data. An important focus has been given to methods controlling the false discovery rate (FDR) because this scalable type I error rate "survives" to high dimension. Since the original procedure of Benjamini and Hochberg (1995), much effort has been undertaken to design FDR controlling procedures that adapt to various underlying structures of the data, such as the quantity of signal, the signal strength and the dependencies, among others.
In this work, our motivation is to deal with adaptation to discrete and heterogeneous data. This type of data arises in many relevant applications, in particular when data are represented by counts. Examples can be found in clinical studies (see, e.g., Westfall and Wolfinger, 1997), genome-wide association studies (GWAS) (see, e.g., Dickhaus et al., 2012) and next generation sequencing data (NGS) (see, e.g., Chen and Doerge, 2015b). It is well known (see, e.g., Westfall and Wolfinger, 1997) that using discrete test statistics can generate a severe power loss, already at the stage of the single tests. A consequence is that using "blindly" the BH procedure with discrete p-values will control the FDR in a too conservative manner. Therefore, more powerful procedures that avoid this conservatism are much sought after in applications, see for instance Karp et al. (2016), van den Broek et al. (2015) and Dickhaus et al. (2012).
In the literature, building multiple testing procedures that take into account the discreteness of the test statistics has a long history that can be traced back to Tukey and Mantel (1980). Some null hypotheses can be a priori excluded from the study because the corresponding tests are unable to produce sufficiently small p-values. This results in a multiplicity reduction that should increase the power. While this idea has been exploited in Tarone (1990) and in a more general manner in Westfall and Wolfinger (1997) for family-wise error rate, a first attempt was made for FDR in Gilbert (2005). More recently, Heyse (2011) has proposed a more powerful solution, relying on the following averaged cumulative distribution function (c.d.f.): where each F i corresponds to the c.d.f. of the i-th test p-value under the null hypothesis. To illustrate the potential benefit of using F , Figure 1 displays this function for the pharmacovigilance data from Heller and Gur (2011) (see Section 5 for more details). It is important to note that heterogeneity and discreteness structures are both essential in (1): on the one hand, without any heterogeneity (all the F i 's are equal), we have F (t) = F 1 (t) and there is no benefit in averaging the null; on the other hand, without discreteness, the F i 's are essentially invertible and the p-values can be transformed to be (continuous) uniform under the null, so that the standard BH procedure can be applied. Both structures commonly arise when multiple conditional tests are performed, for which the heterogeneity and discreteness come from marginal counts of contingency tables, e.g., for multiple Fisher exact tests (see simulations in Section 6).
The critical values of the Heyse procedure can be obtained by inverting F at the values αk/m, 1 ≤ k ≤ m. Thus, the smaller the F -values, the larger the critical values. For the example depicted in Figure 1, Heyse critical values improve the BH critical values roughly by a factor 3, thereby yielding a potentially strong rejection enhancement. Furthermore, since the functions F i 's are known in practice, so is F . Hence, the user has a good prior idea of the improvements reachable by this discrete approach. Unfortunately, the Heyse procedure does not rigorously control the FDR in general; counterexamples are provided in Heller and Gur (2011) and Döhler (2016) (and also in Appendix B.1).
Meanwhile, different solutions have been explored by modifying directly the p-values, either by randomisation (see Habiger, 2015 and references therein), or by shrinking them to build so-called mid p-values (see Heller and Gur, 2011 and references therein). While randomised approaches possess attractive theoretical properties, they are often criticised for their lack of reproducibility (see, e.g., Berger, 1996 andRipamonti et al., 2017). Other approaches incorporate discreteness and heterogeneity by constructing less conservative FDR estimates, see, e.g., Pounds and Cheng (2006), or by combining grouping and weighting approaches, see Chen and Doerge (2015b).
Overall, although many new procedures have been proposed in the literature, only few of them have been proved to achieve a rigorous FDR control under standard conditions, especially in the finite sample case. To the best of our knowledge, we can only refer to the discretised version of the procedure of Benjamini and Liu (1999) introduced by Heller and Gur (2011) and to the asymptotic work of Ferreira (2007). Our paper offers a solution to this problem by presenting new procedures that achieve both theoretical validity and good practical performance. These procedures are readily implemented in computer software and are therefore easy to apply. Moreover, since neither randomisation nor any additional choice of tuning parameters is necessary, their results are easy to interpret.
The paper is organised as follows: after having precisely defined the setting in Section 2, we introduce in Section 3 new procedures relying on the following modifications of the F function: , t ∈ [0, 1], (with the convention 1/0 = +∞), where an appropriate choice of τ m is made. To feel how light these modifications are, Figure 1 displays these functions and shows they are very close to the original F for small values of t. In addition, we also introduce more powerful "adaptive" versions, meaning that the derived critical values are designed in a way that "implicitly estimates" the overall proportion of true null hypotheses and thus may outperform the original Heyse procedure. Next, in Section 4, we establish rigorous FDR control of the corresponding non-adaptive and adaptive procedures under standard conditions. Our proofs, presented in Section 8, rely on new bounds on FDR that generalise some prominent results of the multiple testing literature. These bounds are the main mathematical contributions of the paper and are interesting in their own right, beyond the discrete setting. Also, to explore in detail the improvement of our procedures, we analyse both real and simulated data in Sections 5 and 6. Finally, while some additional procedures are presented in Appendix A, other complementary results are provided in Appendices B, C and D.

General model
Let us observe a random variable X, defined on a probabilistic space and valued in an observation space (X , X). We consider a set P of possible distributions for the distribution of X and we denote the true one by P . We assume that m null hypotheses H 0,i , 1 ≤ i ≤ m, are available for P and we denote the corresponding set of true null hypotheses by H 0 (P ) = {1 ≤ i ≤ m : H 0,i is satisfied by P }. We also denote by H 1 (P ) the complement of H 0 (P ) in {1, . . . , m} and by m 0 (P ) = |H 0 (P )| the number of true nulls. We assume that the user has at hand a set of p-values to test each null, that is, a set of random variables {p i (X), 1 ≤ i ≤ m}, valued in [0, 1]. Throughout the paper, we also make the important (but classical) following assumption: which is assumed to be known. Note that we necessarily have F i (·) non decreasing, F i (t) ∈ [0, 1], F i (1) = 1 and we add the technical condition F i (0) = 0. Loosely, each F i corresponds to the (least favorable) cumulative distribution function of p i under the null. Above, we have taken the supremum to cover the case where the null hypothesis is composite: in that situation, each F i is adjusted according to the least favorable configuration within the null H 0,i .
Here are some conditions on F that will be useful to compare some of the studied procedures (these conditions are not assumed in our results unless explicitly mentioned): Condition (2) ensures that the p-values have marginals stochastically lowerbounded by a uniform variable under the null, called a super-uniform distribution in the sequel. This is the classical setting which is used in most of the work dealing with FDR controlling theory, see, e.g., Benjamini and Hochberg (1995). Condition (3) is more restrictive: if each null hypothesis is a singleton, it is equivalent to the p-values having uniform marginals under the null.

Discrete and continuous modelling
In order to describe the overall support of p-value distributions, we assume one of the two following situations to be at hand throughout the paper (except in Section 4 which is written in a more general manner): • The continuous setting is typically valid in situations where the p-values are calibrated from test statistics having a continuous distribution under the null. In this situation, (3) is often satisfied. The discrete setting typically arises in situations where the p-values are calibrated from test statistics having a finitely supported distribution under the null. In this situation, we generally have that (3) holds true only on the support of F i , that is, In the discrete framework, let us underline that while (4) will typically hold, the equality F i (t) = t, t ∈ A will fail in general because A contains points of A j for j = i. As a result, F (t) defined by (1) will be smaller than t in general (see Figure 1), which is exactly the property that we want to exploit in this paper. To illustrate the above framework, we provide below two simple examples (for more advanced examples, see for instance Chen and Doerge (2015b)).
Example 1 (Gaussian testing). Observe X = (X i ) 1≤i≤m with independent coordinates and marginals X i ∼ N (μ i , 1), where μ i ∈ R is the parameter of interest, 1 ≤ i ≤ m. In that situation, a possible hypothesis testing problem is to consider the nulls H 0,i :

of a standard Gaussian variable).
Example 2 (Binomial testing). Observe X = (X i ) 1≤i≤m with independent coordinates and marginals X i ∼ B(n i , θ i ), where n i ≥ 1 is known and θ i ∈ (0, 1) is the parameter of interest, 1 ≤ i ≤ m. In that situation, a possible hypothesis testing problem is to consider the nulls H 0,i : is the upper-tail distribution function of a binomial distribution of parameters (n i , 1/2). The support of the p-values under the null and alternative is given by the values 2 −ni ni We easily check in that case that (3) is violated while (2) and (4) hold.
The step-down procedure of critical value sequence τ , denoted by SD(τ ), rejects the i-th hypothesis if p i ≤ τk, with k = max{k ∈ {0, 1, ..., m} : ∀k ≤ k, p (k ) ≤ τ k }. It is straightforward to check that, for the same set of critical values, the step-up version always rejects more hypotheses than the step-down version. More comments and illustrations on step-wise procedures can be found in Blanchard et al. (2014) and Dickhaus (2014), among others.

False discovery rate
We measure the quantity of false positives of a step-up (resp. step-down) procedure by using the false discovery rate (FDR), introduced and popularised by Benjamini and Hochberg (1995), which is defined as the averaged proportion of errors among the rejected hypotheses. More formally, for some procedure R rejecting the i-th hypothesis if p i ≤t(X) (for some thresholdt(X)), we let The main contribution of this work is to propose procedures that control the FDR at a prescribed level α and that incorporate the knowledge of the F i 's in a way that increases the number of discoveries.

Procedures
In this section we briefly review some existing methods for FDR control and introduce our new procedures.

Existing methods
We use the following methods as starting points for constructing new procedures.
- [BH]: the seminal procedure proposed in Benjamini and Hochberg (1995), corresponding to the step-up procedure SU(τ ), with critical values τ k = αk/m, 1 ≤ k ≤ m; -[BR-λ], λ ∈ (0, 1): an adaptive version of the BH procedure that was proposed in Blanchard and Roquain (2009), corresponding to the step-up procedure SU(τ ), with critical values - [GBS]: an adaptive version of the BH procedure that has been proposed in Gavrilov et al. (2009), corresponding to the step-down procedure SD(τ ), with critical values - [Heyse]: the step-up procedure SU(τ ) using critical values given by where F is defined by (1). This procedure was proposed in Heyse (2011).
The rationale behind the critical values of [BR-λ] and [GBS] is that they are intended to mimic the oracle critical values τ k = αk/m 0 (P ), 1 ≤ k ≤ m, which are less conservative than those of [BH] when m 0 (P )/m is not close to 1, see, e.g., Benjamini et al. (2006); Blanchard and Roquain (2009) for more details on this issue. Also, among adaptive procedures, [GBS] satisfies a kind of optimality as a finite sample version of the asymptotically optimal rejection curve, see Finner et al. (2009).
Let us now comment on [Heyse]. First, in the continuous setting where (2) holds, F (t) ≤ t, t ∈ [0, 1], and thus the critical values given by (8) satisfy τ k ≥ αk/m, 1 ≤ k ≤ m, which means that [Heyse] rejects at least as many hypotheses as [BH]. When (3) additionally holds, we have F (t) = t, t ∈ [0, 1], and the two critical value sequences are the same. Second, in the discrete setting where (2) holds, A is finite and τ k is not necessarily greater than αk/m anymore. However, [Heyse] is also less conservative (or equal) than [BH] in the latter case, as stated in the following result (proved in Appendix B for completeness).
Lemma 1. Consider the model of Section 2.1 assuming (2), both in the continuous and discrete setting described in Section 2.2. Then the set of nulls rejected by [Heyse] is larger than the one of [BH] (almost surely). Furthermore, under (4), these two rejection sets are equal (almost surely) The equality case of Lemma 1 was provided in Proposition 2.3 of Heller and Gur (2011). It can be seen as a limitation of Heyse procedure in the homogeneous case. In the heterogeneous case, however, F (t) is smaller than t (see Figure 1) and [Heyse] can substantially improve [BH] (see Figure 2).
While [Heyse] incorporates the knowledge of the F i 's in a natural way (see also Remark 1 below), it is not correctly calibrated for a rigorous FDR control (see Appendix B.1). We propose suitable modifications of [Heyse] in the next sections.

Remark 1 (Empirical Bayes point of view on the Heyse procedure).
We claim that [Heyse] corresponds to a suitable empirical Bayes procedure. To see this, consider the "binomial example" of Section 2.2, but assume now that the counts n 1 , . . . , n m are observed from a sample N 1 , . . . , This suggests to normalise the p-values p i asF 0 (p i ) which leads to the step-up procedure with critical values τ k = max{t :F 0 (t) ≤ αk/m}. Following an empirical Bayes approach, the prior ν can be estimated byν which is equal to F given by (1). Hence, the corresponding (empirical Bayes) step-up procedure reduces to [Heyse].

Two new methods
We now present two procedures that aim at correcting [Heyse] : -[HSU] (heterogeneous step-up) : the step-up procedure SU(τ ) using the critical values defined in the following way: -[HSD] (heterogeneous step-down) : the step-down procedure SD(τ ) using the critical values defined in the following way : [HSU] can be seen as a correction of [Heyse]: the correction term in the critical values (10) lies in the additional denominator 1 − F i (τ m ). A consequence is that [HSU] can be more conservative than [BH]. However, the magnitude of this phenomenon is always small, as the next lemma shows (proved in Appendix B for completeness).

Lemma 2. Under the conditions of Lemma 1, the set of nulls rejected by [HSU] contains the one of [BH] taken at level α/(1 + α) (almost surely).
For [HSD], the following result can be established.

Lemma 3. Under the conditions of Lemma 1, the set of nulls rejected by [HSD]
contains the one of the step-down procedure with critical values (αk/m)/(1 + αk/m), 1 ≤ k ≤ m (almost surely).
From (10) and (11) it is clear that the critical values of [HSD] are always at least as large as those for [HSU]. However, since the step-up direction is more powerful than the step-down direction (see Section 2.3) neither of the two generally dominates the other one.

Remark 2.
We may ask whether we can construct a uniform improvement of [BH] that incorporates the F i 's. There is indeed such a procedure, see procedure [RBH] in Appendix A.1 for more details. However, the improvement brought by the F i 's information is less substantial than for [HSU], so we have chosen to omit [RBH] from the main stream of the paper.

Adaptive versions
In this section, we define adaptive versions of [HSU] and [HSD] in the following way: -[AHSU] (one-stage adaptive heterogeneous step-up): the step-up procedure SU(τ ) using the critical values defined in the following way: τ m as in (9) and for 1 ≤ k ≤ m − 1, where each -[AHSD] (one-stage adaptive heterogeneous step-down): the step-down procedure SD(τ ) using the critical values defined in the following way: for where each  The above lemma ensures that the user can incorporate the knowledge of the F i 's in adaptive procedures with a "no loss" guarantee with respect to [BR] and [GBS].

New FDR bounds for heterogeneous nulls
In this section, we present new FDR bounds which are the main mathematical contributions of this paper and that are of independent interest. They generalise some classical bounds from super-uniform null distributions to arbitrary heterogeneous (not necessarily discrete) null distributions, and immediately yield FDR control of our new procedures.

Results
The following result holds. It only assumes independence between the p-values and not super-uniformity of the null distributions.

Theorem 1. Consider any family
The proof of Theorem 1 is deferred to Section 8. It combines several techniques: the first tool is an expression of the FDR introduced by Ferreira (2007) (step-up case) and Roquain and Villers (2011) (step-down case). A second idea comes from the work Blanchard and Roquain (2009) (step-up case) and Gavrilov et al. (2009) (step-down case), which introduced a new term (here, the denominator (1 − F i (·))) to make the proof work. Finally, another inspiration is the study of Roquain and van de Wiel (2009) and Döhler (2016) that allowed to deal with heterogeneous FDR thresholding. Let us underline that the obtained proof is especially concise, which means that these different techniques fit together perfectly well, which is perhaps surprising at first glance, see Section 8.
Next, let us note that taking the maximum over the subset A in (14) and (15) allows us to adapt to the unknown number of true null hypotheses: loosely, if k − 1 is the number of rejections, A corresponds to the acceptation set (hence of cardinality m − k + 1), which "estimates" H 0 and thus the sums in (14) and (15) are indexed by a set "close" to the unknown set H 0 . Taking the maximum then corresponds to the least favorable possible H 0 .
Finally, let us underline again that the above bounds do not use the superuniformity of the F i 's which makes them quite general and flexible tools. Several examples are given below.

Application to adaptiveness and weighting
Let us now give some intuition behind these bounds and illustrate their generality by showing how they encompasse previous work in the literature. First, assuming the super-uniformity F i (t) ≤ t for all t and i, these bounds entail  (16), (17) and (18) encompass Theorem 1 of Benjamini and Hochberg (1995), Theorem 9 of Blanchard and Roquain (2009) and Theorem 1.1 of Gavrilov et al. (2009), respectively. Second, by removing the adaptative part of the bounds, that is, by replacing A by {1, . . . , m}, we obtain the simpler but more conservative bounds Here, we show how these bounds can be used to recover some of the finite sample FDR controlling results of Roquain and van de Wiel (2009) This family can be considered as "weighting" the p-values. It is a free parameter that adds an extra flexibility which can be useful in different contexts, see, e.g., Ignatiadis et al. (2016), Durand (2017). An important point is then to make sure that this weighting maintains the FDR control. For this, let us first modify the family (Δ i ) 1≤i≤m as follows: with the conventionΔ i (1) = 1 (to makeΔ i meet the properties of a c.d.f.). Then we can consider the BH procedure using the transformed p-valuesp i =Δ −1 i (p i ), 1 ≤ i ≤ m, which can be interpreted as a "weighted BH procedure", in the sense that each p-value p i has an importance which is increased or diminished in the procedure according to the value ofΔ −1 i at p i . Since eachp i has for null c.d.f.
which recovers the results of Theorem 4.1 in Roquain and van de Wiel (2009) (step-up part). The step-down part can be recovered from (20) in a similar way.

Application to the new procedures
To make a connection between Theorem 1 and our new procedures, especially [AHSU] and [AHSD] (see Section 3.3), observe that the following relations hold true: Therefore, Theorem 1 implies that our new procedures enjoy the desired FDR controlling property.

Corollary 1. In the setting of Theorem 1, the procedures [HSU], [HSD], [AHSU], [AHSD] all control the FDR at level α.
Now let us focus on the discrete case. In that situation, recall that the individual p-values cannot be transformed (without randomisation) to be uniform under the null. Rather, our Heyse-type procedures "average" the heterogeneous nulls. As a consequence, if some of the F i 's are really small, they will not contribute much to the average, offering some additional room for the other F j 's.
Finally, let us underline that our bounds can be useful for other discrete-type procedures. As a case in point, consider mid p-values which were introduced by Lancaster (1961) and are sometimes used for analysing discrete data (see, e.g., Karp et al., 2016). These p-values are no longer super-uniform under the null hypotheses, however our theorem can accommodate such distributions in a natural way to still yield valid FDR controlling procedures.

Empirical data
To illustrate the performance of FDR-controlling procedures for discrete data, we analyse two classical data sets. In what follows, our main goal is to compare the performance of the new procedures [HSU] and [AHSU] to the classical [BH] and [Storey] and also to [Heyse]. The procedure [Storey] was proposed in Storey et al. (2004), and corresponds to the step-up procedure SU(τ ), with critical is an estimate of the number m 0 of true null hypotheses among the m hypotheses. We use the standard value λ = 1 2 . All analyses were performed using the R language for statistical computing (R Core Team, 2016).

Pharmacovigilance data
This data set is derived from a database for reporting, investigating and monitoring adverse drug reactions due to the Medicines and Healthcare products Regulatory Agency in the United Kingdom. It contains the number of reported cases of amnesia as well as the total number of adverse events reported for each of the m = 2446 drugs in the database. For more details we refer to Heller and Gur (2011) and to the accompanying R-package 'discreteMTP' (Heller et al., 2012), which also contains the data. Heller and Gur (2011) investigate the association between reports of amnesia and suspected drugs by performing for each drug a Fisher's exact test (one-sided) for testing association between the drug and amnesia while adjusting for multiplicity by using several (discrete) FDR procedures.

Next generation sequencing data
We also revisit the next generation sequencing (NGS) count data analysed by Chen and Doerge (2015b), to which we also refer for more details. More specifically, we reanalyse the methylation data set for cytosines of Arabidopsis in Lister et al. (2008) which is part of the R-package 'fdrDiscreteNull' (Chen and Doerge, 2015a). This data set contains the counts for a biological entity under two different biological conditions or treatments. Following Chen and Doerge (2015b), m = 7421 genes whose treatment-wise total counts are positive but row-total counts are no greater than 100 are analysed using the exact binomial test, see Chen and Doerge (2015b). Table 1 summarises the number of discoveries for the pharmacovigilance and NGS data when using the respective FDR procedures at level α = 0.05. Compared to the classical [BH] procedure, the discrete procedures are able to detect three additional candidates linking amnesia and drugs in the pharmacovigilance data. This data set seems to contain very few signals so there is no benefit in using adaptive procedures, in fact the (finite sample) [Storey] procedure performs worse than the [BH] procedure. Note also that our new procedures -while being correctly calibrated -still reject the same number of hypotheses as [Heyse]. In contrast, the Arabidopsis data seems to contain a large portion of signals so that in particular the [Storey] procedure performs much better than [BH]. The [HSU] and [Heyse] procedures also outperform [BH], while the [Storey] procedure is dominated by the [AHSU] procedure. Figure 2 illustrates graphically the data and the critical constants of the involved multiple testing procedures. In particular, the benefit of taking dis-  [Heyse], [Storey] and [AHSU] critical constants are represented respectively by blue, red, grey, green, and orange solid lines.

Results
creteness into account becomes more apparent: for the pharmacovigilance data, the discrete critical values are considerably (by a factor of 2.5 − 3.5) larger than their respective classical counterparts. This leads to more powerful procedures. For the NGS data, we can observe quite clearly that the [HSU] critical constants are dominated by the [AHSU] constants, as explained in Section 3. This leads to roughly 100 additional rejections. Again, the discrete critical values are considerably larger than their respective classical counterparts. In Section 3.2, we mentioned that the correction factor 1 − F i (τ m ), introduced for guaranteeing FDR control of [HSU], may lead to a procedure which is more conservative than [BH]. However, Figure 2 shows that -at least for the data sets considered herethis risk is by far compensated by the benefit of taking discreteness adequately into account.

Simulation study
We now investigate the power of the procedures from the previous section in a simulation study similar to those described in Gilbert (2005), Heller and Gur (2011) and Döhler (2016). Again, we focus on comparing the performance of the new discrete procedures to [BH], [Storey] and [Heyse].

Simulated scenarios
We simulate a two-sample problem in which a vector of m independent binary responses ("adverse events") is observed for each subject in two groups, where each group consists of N = 25 subjects. Then, the goal is to simultaneously test the m null hypotheses H 0i : "p 1i = p 2i ", i = 1, . . . , m, where p 1i and p 2i are the success probabilities for the ith binary response in group 1 and 2, respectively. Before we describe the simulation framework in more detail, we explain how this set-up leads to discrete and heterogeneous p-value distributions. Suppose we have simulated two vectors of dimension m where each component represents a count in {0, . . . , 25}. This data can be represented by m contingency tables. Now each hypothesis is tested using Fisher's exact test (two-sided) for each contingency table, which is performed by conditioning on the (simulated) pair of marginal counts. Thus, we can determine for every contingency table i the discrete distribution function F i of the p-values for Fisher's exact test under the null hypothesis. For differing (simulated) contingency tables, these induced distributions will generally be heterogeneous and our inference is conditionally on the marginal counts.
We take m = 800, 2000 where m = m 1 + m 2 + m 3 and data are generated so that the response is Bernoulli(0.01) at m 1 positions for both groups, Bernoulli(0.10) at m 2 positions for both groups and Bernoulli(0.10) at m 3 positions for group 1 and Bernoulli(q) at m 3 positions for group 2 where q = 0.15, 0.25, 0.4 represents weak, moderate and strong effects respectively. The null hypothesis is true for the m 1 and m 2 positions while the alternative hypothesis is true for the m 3 positions. We also take different configurations for the proportion of false null hypotheses, m 3 is set to be 10%, 30% and 80% of the value of m, which represents small, intermediate and large proportion of effects (the proportion of true nulls π 0 is 0.9, 0.7, 0.2, respectively). Then, m 1 is set to be 20%, 50% and 80% of the number of true nulls (that is, m − m 3 ) and m 2 is taken accordingly as m − m 1 − m 3 . For each of the 54 possible parameter configurations specified by m, m 3 , m 1 and q, 10000 Monte Carlo trials are performed, that is, 10000 data sets are generated and for each data set, an unadjusted two-sided p-value from Fisher's exact test is computed for each of the m positions, and the multiple testing procedures mentioned above are applied at level α = 0.05. The power of each procedure was estimated as the fraction of the m 3 false null hypotheses that were rejected, averaged over the 10000 simulations. Note that while our procedures are designed to control the FDR conditionally on the marginal counts, our power results are presented in an unconditional way for the sake of simplicity. For random number generation the R-function rbinom was used. The two-sided p-values from Fisher's exact test were computed using the R-function fisher.test.

Results
We have computed the (average) power and FDR of the five procedures under investigation in all scenarios (see Tables 3 and 4 in Appendix E for the full display). For weak and moderate effects, i.e. q = 0.15 and q = 0.25, none of the procedure possesses relevant power. For strong effects, the results are summarised in Figure 3. (Since the power of the discrete procedures is slightly increasing in m 1 for fixed m 3 and q, we present -in order to avoid over-optimism -the configuration with smallest m 1 ).  [BH], [Storey], [HSU], [Heyse] and [AHSU] procedures in the simulation study. The coloring is the same as in Figure 2.
The results are consistent with the findings of the previous section: the new discrete procedures are considerably more powerful than the [BH] procedure. When the proportion of alternatives is large, the [Storey] procedure provides large gains over [BH] but is still dominated by the discrete adaptive procedure [AHSU].

Conclusion and discussion
In this paper, we provided new bounds for the FDR of step-up and step-down procedures that use heterogeneous test statistics. This made it possible to define a new class of multiple testing procedures that provably control the FDR while incorporating the discreteness and heterogeneity of the tests statistics in a convenient way. We have shown that our approach can be seen as cor-recting and improving the approach of Heyse (2011): while it ensures a theoretical control, it can also make more rejections when the signal is strong enough.
Our new procedures are easily interpreted since they involve neither randomisation nor any additional choice of tuning parameters (in Appendix C we present a comparison with a randomised p-value approach). Furthermore, an R-package implementing them is currently being developed, which will make these methods available for the practitioner.
Additionally, our methodology can deal with other null distributions F i that arise in the context of discrete testing: as a case in point, consider mid p-values which were introduced by Lancaster (1961) and are sometimes used for analysing discrete data (see, e.g., Karp et al. (2016)). These p-values are no longer superuniform under the null hypotheses, however our methods can accommodate such distributions in a natural way to still yield valid FDR controlling procedures.
Finally, this paper opens several directions for future research, especially by trying to extend our arguments to other frameworks. For instance, an important point is to relax the independence requirement. To this respect, we believe that our procedures will inherit the behavior of BH procedure: while the FDR control is likely to be maintained under "realistic" dependence, formally proving such a result is probably a challenging problem. Another challenge is to develop mathematically valid plug-in procedures for discrete data. A first step in this direction is sketched in Appendix D.

Lemmas for step-down and step-up procedures
Let us introduce the following modifications of SU(τ ) : The following lemma holds (variation of a well known lemma, see, e.g., Ferreira and Zwinderman, 2006) and is proved in Appendix B for completeness. The following lemma holds (variation of Gavrilov et al., 2009;Roquain and Villers, 2011) and is proved in Appendix B for completeness: Lemma 6. For all i ∈ {1, . . . , m}, the following assertions are equivalent: (i)

Proof of Theorem 1, step-up part
By using Lemma 5 (ii) and (iii), we obtain because p i ≤ τk is equivalent to p i ≤ τk ,−i +1 , and both implyk ,−i + 1 = k. Now using independence betweenk ,−i and p i , we obtain because for any i ∈ H 0 , and t, we have P (p i ≤ t) ≤ F i (t). Now, on the one hand,

S. Döhler et al.
Next, on the other hand, by using again (Indep) and that for any i ∈ H 0 , and where the latter inequality comes from the last assertion of Lemma 5. Now, since τk +1 ≤ τ m , we have that the last display is smaller than or equal to by taking the maximum over all the possible realisations of the set which is the index set corresponding to the non-rejected null hypotheses of SU(τ ) (the latter being by definition of cardinality m −k ). This concludes the proof.

Proof of Theorem 1, step-down part
It is similar to the step-up case, with some subtle changes: , and both implyk ,−i + 1 =k (keep in mind thatk ,−i might be different fromk −i ), by applying Lemma 6. Now using independence between (k −i ,k ,−i ) and p i , we obtain

New FDR bounds for discrete and heterogeneous tests
becausek ,−i + 1 ≥k −i + 1 and because for i ∈ H 0 , and any t, we have . This gives the first part of the bound. Next, by using again (Indep), we obtain Now using the last assertion of Lemma 6, the last display is smaller than or equal to since both sets correspond to the set of non-rejected hypotheses of SD(τ ). Since SD(τ ) rejects exactlyk hypotheses, the proof is completed.

A.1. A rescaled BH procedure
The procedure [RBH] (rescaled-BH) is defined as the step-up procedure using the critical values The following result is straightforward from Theorem 1 (SU part).

A.2. A heterogeneous BR procedure
The procedure [HBR-λ] (discrete BR) is defined as the step-up procedure SU(τ ) using the critical values defined in the following way: for k ∈ {1, . . . , m}, where each (F (t)) (j) denotes the j-th largest elements of the set {F i (t) , 1 ≤ i ≤ m}. The following result is straightforward from Theorem 1 (SU part).

Corollary 3.
In the setting of Theorem 1, with the additional assumption (2), we have ∀P ∈ P, FDR(HBR, P ) ≤ α. Moreover, the set of nulls rejected by [HBR-λ] is larger than the one of [BR-λ] (almost surely), with equality (almost surely) under (4) and F i = F j for all i = j.

B.1. Counterexample
We present here a modification of the counterexample due to Krieger given in Heller and Gur (2011). Consider m = 3 p-value null distributions given by where δ {x} denotes the Dirac distribution in x. It is easy to verify that (1) yields Then the critical values of [Heyse] at level α = 0.25 are given by τ 1 = 0.2, τ 2 = τ 3 = 0.29, see (8). Now consider an alternative distribution for P 3 given by where will be suitably chosen further on. Assume that the p-values p 1 , p 2 , p 3 are independent, with p i ∼ P i for i ∈ {1, 2} (hypotheses H 1 and H 2 true) and p 3 ∼ Q 3 (hypothesis H 3 false).
Altogether, we obtain

B.2. Proofs for lemmas comparing procedures
The lemmas presented here rely on the fact that, there is almost surely no pvalue in [0, 1]\A (both in the continuous and discrete cases). All symbols "=" or "⊂" are intended to be valid almost surely in this section.
A result which will be extensively used in the proofs of this section is the following one : for p-values valued in the set A, then the step-up procedure with critical values τ k , 1 ≤ k ≤ m, has the same rejection set as the step-up procedure with critical values ξ k = max {t ∈ A : t ≤ τ k }, 1 ≤ k ≤ m. This fact comes from the simple following observation: for all k, The ξ k 's are called the "effective" critical values of SD(τ ) or SU(τ ) in the sequel.

B.2.1. Proof of Lemma 1
The effective critical values of the BH procedure are given by the quantities ξ k = max {t ∈ A : t ≤ αk/m}, 1 ≤ k ≤ m. If (2) holds, then F (t) ≤ t and each ξ k is clearly smaller than the k-th critical values of [Heyse]. This implies that the rejection set of [Heyse] is larger than the one of [BH]. Conversely,under (4) and if F i = F j = F for all i = j, we always have F (t) = F i (t) = t for t ∈ A. This implies that the ξ k 's are the critical values of [Heyse] and shows the reversed inclusion.

B.2.2. Proof of Lemmas 2 and 3
Let τ k , 1 ≤ k ≤ m, be the critical values of [HSU]. Let us consider ξ k = max t ∈ A : t ≤ α  (2), where the last inequality follows from the definition of τ m . Thus we have F SU (ξ m ) ≤ α, which in turn implies ξ m ≤ τ m . Additionally, the bound (23) yields where we used that ξ m ≤ τ m . This proves Lemma 2. The proof of Lemma 3 is analogue and is left to the reader.

B.2.3. Proof of Lemma 4
Let us first focus on the case (i) and denote by τ k , 1 ≤ k ≤ m, the critical values of [AHSU]. From (2), we have for 1 ≤ k ≤ m − 1, which correspond to the effective critical values of [BR-λ] with λ = τ m . Now consider the case (ii) and denote again by τ k , 1 ≤ k ≤ m, the critical values of [AHSD]. From (2), we have for 1 ≤ k ≤ m, which correspond to the effective critical values of [GBS]. This implies the result.

B.3.1. Proof of Lemma 5
First note that for any step-up procedurê which is sometimes more handy, because this definition avoids to rely explicitly on the order statistics of the p-values. Now, it is not difficult to check thatk ,−i ≥k − 1 always holds: this comes from the inequalitŷ . . , m} (note that we can assume without loss of generalityk ≥ 1 here). This means that (i) implies (ii). Now, when p i ≤ τk ,−i +1 , we havê 1{p j ≤ τk ,−i +1 } and thusk ,−i + 1 ≤k (by using the definition ofk). Since, again,k ,−i ≥k − 1 always holds, we havê k ,−i + 1 =k. Hence, (ii) implies (iii). Now, ifk ,−i + 1 =k, we have by definition of τ ,−i , which gives that (iii) implies (i). Now, to prove the last statement, we first note thatk ≥k ,−i always holds. Furthermore, if p i > τ m let us provek ≤k ,−i . First,k = m is impossible because p i is above τ m and thus p i cannot be rejected by SU (τ ). Hence,k ≤ m − 1 and thus τ ,−î k is well defined. Now, since p i > τ m , we obtain which impliesk ≤k ,−i by definition of SU ,−i (τ ).

B.3.2. Proof of Lemma 6
First note that for any step-down procedurẽ which givesk <k ,−i + 2 by definition ofk and thusk ≤k ,−i + 1. Next, if so thatk >k ,−i and thusk ≥k ,−i + 1. This proves that (i) implies (iv). Next, which entailsk <k −i + 1 and thusk ≤k −i . This provesk =k ,−i + 1. Hence, (iv) implies (iii). The fact that (iii) implies (ii) is obvious becausek ≥k −i always holds. Finally, we merely check thatk is such that which means that the set of p-values rejected at threshold τk is the same as the set of p-values rejected at threshold τk +1 . This gives that (ii) implies (i). For the last assertion, it has been proved in the above reasoning while showing that (iv) implies (iii).

Appendix C: Empirical analyses for randomised p-values
In this section we follow the suggestion of one of the reviewers to investigate how using randomised p-values (see, e.g., Habiger, 2015) compares to our procedures. We do this by reanalysing the Pharmacovigilance and Arabidopsis methylation data from Section 5. To be more specific, we apply the BH and the Storey procedure (with λ = 1/2) to randomised p-values and denote these procedures by [r-BH] and [r-Storey]. For each random set of randomised p-values this results in a random set of rejected hypotheses. We repeat this simulation 1000 times and for each simulation run determine the number of rejected hypotheses. The resulting distribution of the number of rejected hypotheses is summarised numerically in Table 2 and displayed visually in Figure 4. The pharmacovigilance data seems to contain very few signals, so there is no benefit in using (either randomised or non-randomised) adaptive procedures as compared to discrete procedures (in fact, [r-BH] is more powerful than [r-Storey]). This is also consistent with the findings in Sections 5 and 6. In contrast, the arabidopsis methylation data seems to contain a large portion of signals, so that adaptive procedures become effective. We see that [r-Storey] considerably outperforms the adaptive discrete procedures from Table 1 which are not based on a plug-in method. We think the reason for this phenomenon is that this seems to be a situation which is tailored to the strengths of the plug-in method (again this is consistent with the findings in Sections 5 and 6). Summarizing the findings from this section, it appears that the amount of power that is lost by avoiding randomisation depends primarily on the proportion of alternatives. The Pharmacovigilance data set may serve as an example for a small proportion of alternatives. In this setting, no power is lost -on average -by using discreteness and avoiding randomisation. When the proportion of alternatives is large however (the Arabidopsis methylation data may be considered a prototype example here) we think that the behaviour of the randomised/discrete procedure is determined primarily by how the quantity of signal is estimated. To support this fact, we propose in Appendix D a new procedure that combines our approach with the Storey estimator which rejects the same order of hypotheses as [r-Storey].

Appendix D: A plug-in version of [HSU]
In this section, we sketch a discrete plug-in procedure in the spirit of Storey et al. (2004) for adapting to the unknown quantity of a discrete signal. To keep the exposition short, we describe this approach only for step-up procedures, however our ideas carry over directly to step-down procedures. As the proof of Theorem 1 shows, we have from (22) the following bound for FDR where m 0 = |H 0 | is the (unknown) number of true null hypotheses. Choosing critical value sequence τ 1 (m 0 ), . . . , τ m (m 0 ) that satisfy max A⊂{1,...,m} |A|=m0 i∈A yields a new HSU-type procedure which is adapted to the number of null hypotheses. In applications, m 0 is an unknown quantity which has to be estimated appropriately, for more details on this issue, see, e.g., Roquain (2009), Storey et al. (2004), Liang and Nettleton (2012), Heesen and Janssen (2016) and references therein. Our plug-in method works as follows: 1. Given the data, determine an appropriate estimate m 0 for m 0 . 2. Apply the step-up procedure with critical values τ 1 ( m 0 ), . . . , τ m ( m 0 ).
We emphasize that this approach is only a heuristic one and currently we do not have a proof for FDR control.
Depending on the amount of signals and discreteness of p-values this approach can lead to strongly enhanced rejection numbers. As an example, we revisit the analysis of the Arabidopsis methylation data (see Section 5). Figure 5 depicts the number of rejections R as a function of π 0 = m 0 /m for this data set. The estimator used by the [Storey] procedure in Table 1 yields π 0 = 0.6 and thus the corresponding discrete plug-in procedure rejects R = 2659 hypotheses. The randomised p-values that were used for evaluating [r-Storey] in Appendix C result in an average π 0 = 0.468. Using this estimate, the discrete plug-in procedure rejects R = 2836 hypotheses, which lies within the range of the rejection numbers for the completely randomised procedure [r-Storey].