Adaptive p -value weighting with power optimality

: Weighting the p -values is a well-established strategy that improves the power of multiple testing procedures while dealing with heterogeneous data. However, how to achieve this task in an optimal way is rarely considered in the literature. This paper contributes to ﬁll the gap in the case of group-structured null hypotheses, by introducing a new class of procedures named ADDOW (for Adaptive Data Driven Optimal Weighting) that adapts both to the alternative distribution and to the proportion of true null hypotheses. We prove the asymptotical FDR control and power optimality among all weighted procedures of ADDOW, which shows that it dominates all existing procedures in that framework. Some numerical experiments show that the proposed method preserves its optimal properties in the ﬁnite sample setting when the number of tests is moderately large.


Introduction
Recent high-throughput technologies bring to the statistical community new type of data being increasingly large, heterogeneous and complex.Addressing significance in such context is particularly challenging because of the number of questions that could naturally come up.A popular statistical method is to adjust for multiplicity by controlling the False Discovery Rate (FDR), which is defined as the expected proportion of errors among the items declared as significant.
Once the amount of possible false discoveries is controlled, the question of increasing the power, that is the amount of true discoveries, arises naturally.In the literature, it is well-known that the power can be increased by clustering the null hypotheses into homogeneous groups.The latter can be derived in several ways: • sample size: a first example is the well-studied data set of the Adequate Yearly Progress (AYP) study (Rogosa, 2005), which compares the results in mathematics tests between socioeconomically advantaged and disadvantaged students in Californian high school.As studied by Cai and Sun (2009), ignoring the sizes of the schools tends to favor large schools among the detections, simply because large schools have more students and not because the effect is stronger.By grouping the schools in small, medium, and large schools, more rejections are allowed among the small schools, which increases the overall detection capability.This phenomenon also appears in more large-scale studies, as in GWAS (Genome-Wide Association Studies) by grouping hypotheses according to allelic frequencies, (Sun et al., 2006) or in microarrays experiments by grouping the genes according to the DNA copy number status (Roquain and van de Wiel, 2009).Common practice is generally used to build the groups from this type of covariate.• spatial structure: some data sets naturally involve a spatial (or temporal) structure into groups.A typical example is neuroimaging: in Schwartzman, Dougherty and Taylor (2005), a study compares diffusion tensor imaging brain scans on 15443 voxels of 6 normal and 6 dyslexic children.By estimating the densities under the null of the voxels of the front and back halves of the brain, some authors highlight a noteworthy difference which suggests that analysing the data by making two groups of hypotheses seems more appropriate, see Efron (2008) and Cai and Sun (2009).• hierarchical relation: groups can be derived from previous knowledge on hierarchical structure, like pathways for genetic studies, based for example on known ontologies (see e.g.The Gene Ontology Consortium (2000)).Similarly, in clinical trials, the tests are usually grouped in primary and secondary endpoints, see Dmitrienko, Offen and Westfall (2003).
In these examples, while ignoring the group structure can lead to overly conservative procedures, this knowledge can easily be incorporated by using weights.This method can be traced back to Holm (1979) who presented a sequentially rejective Bonferroni procedure that controls the Family-Wise Error Rate (FWER) and added weights to the p-values.Weights can also be added to the type-I error criterion instead of the p-values, as presented in Benjamini and Hochberg (1997) with the so-called weighted FDR.Blanchard and Roquain (2008) generalized the two approaches by weighting the p-values and the criterion, with a finite positive measure to weigh the criterion (see also Ramdas et al. (2017) for recent further generalizations).Genovese, Roeder and Wasserman (2006) introduced the p-value weighted BH procedure (WBH) which has been extensively used imsart-generic ver.2014/10/16 file: Durand2018v2.texdate: March 1, 2022 afterwards with different choices for the weights.Roeder et al. (2006); Roeder and Wasserman (2009) have built the weights upon genomic linkage, to favor regions of the genome with strong linkage.Hu, Zhao and Zhou (2010) calibrated the weights by estimating the proportion of true nulls inside each group (procedure named HZZ here).Zhao and Zhang (2014) went one step further by improving HZZ and BH with weights that maximize the number of rejections at a threshold computed from HZZ and BH.They proposed two procedures Pro1 and Pro2 shown to control the FDR asymptotically and to have a better power than BH and HZZ.
However, the problem of finding optimal weights (in the sense of achieving maximal averaged number of rejected false nulls) has been only scarcely considered in the literature.For FWER control and Gaussian test statistics, Wasserman and Roeder (2006) designed oracle and datadriven optimal weights, while Dobriban et al. (2015) considered a Gaussian prior on the signal.For FDR control, Roquain and van de Wiel (2009) and Habiger (2014) designed oracle optimal weights by using the knowledge of the distribution under the alternative of the hypotheses.Unfortunately, this knowledge is not reachable in practice.This leads to the natural idea of estimating the oracle optimal weights by maximizing the number of rejections.This idea has been followed by Ignatiadis et al. (2016) with a procedure called IHW.While they proved that IHW controls asymptotically the FDR, its power properties have not been considered.In particular, it is unclear whether maximizing the overall number of rejections is appropriate in order to maximize power.Other recent works (Li and Barber, 2016;Ignatiadis and Huber, 2017;Lei and Fithian, 2018) suggest weighting methods (with additional steps or different threshold computing rules) but they don't address the power question theoretically either.
In this paper, we present a general solution to the problem of optimal data-driven weighting of BH procedure in the case of grouped null hypotheses.The new class of procedures is called AD-DOW (for Adaptive Data-Driven Optimal Weighting).It relies on the computation of weights that maximize the number of detections at any rejection threshold, combined with the application of a step-up procedure with those weights.This is similar to IHW, however, by taking a larger weight space thanks to the use of estimators of true null proportion in each group, we allow for larger weights, hence more detections.With mild assumptions, we show that ADDOW asymptotically controls the FDR and has optimal power among all weighted step-up procedures.Interestingly, our study shows that the heterogeneity with respect to the proportion of true nulls should be taken into account in order to attain optimality.This fact seems to have been ignored so far: for instance we show that IHW has optimality properties when the true nulls are evenly distributed across groups but we also show that its performance can quickly deteriorate otherwise with a numerical counterexample.
In Section 2, we present the mathematical model and assumptions.In Section 3, we define what is a weighting step-up procedure and discuss some procedures of the literature.In Section 4, we introduce ADDOW.Section 5 provides our main theoretical results.Our numerical simulations are presented in Section 6, while the overfitting problem is discussed in Section 7 with the introduction of a variant of ADDOW.We conclude in Section 8 with a discussion.The proofs of the two main theorems are given in Section 9 and more technical results are deferred to appendix.Let us underline that an effort has been made to make the proofs as short and concise as possible, while keeping them as clear as possible.
In all the paper, the probabilistic space is denoted (Ω, A, P).The notations a.s.
−→ and P −→ stand for the convergence almost surely and in probability.

Model
We consider the following stylized grouped p-value modeling: let G ≥ 2 be the number of groups.Let us emphasize that G is kept fixed throughout the paper.Because our study will be asymptotic in the number of tests m, for each m we assume that we test m g hypotheses in group g ∈ {1, . . ., G}, where the m g are non-decreasing integer sequences depending on m (the dependence imsart-generic ver.2014/10/16 file: Durand2018v2.texdate: March 1, 2022 is not written for conciseness) and such that G g=1 m g = m.In each group g ∈ {1, . . ., G}, let H g,1 , . . ., H g,mg be some binary variables corresponding to the null hypotheses to be tested in this group, with H g,i = 0 if it is true and H g,i = 1 otherwise.Consider in addition p g,1 , . . ., p g,mg some random variables in [0, 1] where each p g,i corresponds to the p-value testing H g,i .Note also m g,1 = mg i=1 H g,i the number of false nulls and m g,0 = m g − m g,1 the number of true nulls in group g.
We make the following marginal distributional assumptions for p g,i .
Assumption 2.2.If H g,i = 1, p g,i follows a common distribution corresponding to c.d.f.F g , which is strictly concave on [0, 1].
In particular, note that the p-values are assumed to have the same alternative distribution within each group.Note that the concavity assumption is mild (and implies continuity on R as proven in Lemma A.1 for completeness).Furthermore, by concavity, x → Fg(x)−Fg(0) x−0 has a right limit in 0 that we denote by f g (0 x−1 has a left limit in 1 that we denote by Assumption 2.3.There exists π g > 0 and π g,0 > 0 such that for all g, m g /m → π g and m g,0 /m g → π g,0 when m → ∞.Additionally, for each g, π g,1 = 1 − π g,0 > 0. The above assumption means that, asymptotically, no group, and no proportion of signal or sparsity, is vanishing.We denote π 0 = g π g π g,0 the mean of the π g,0 's and denote the particular case where the nulls are evenly distributed in each group by (ED): (ED) Let us finally specify assumptions on the joint distribution of the p-values.
Assumption 2.4.The p-values are weakly dependent within each group: This assumption is mild and classical, see Storey, Taylor and Siegmund (2004).Note that weak dependence is trivially achieved if the p-values are independent, and that no assumption on the p-value dependence accross groups is made.Finally note that there is a hidden dependence in m in the joint distribution of the p-values (p g,i ) 1≤g≤G 1≤i≤mg but that does not impact the remaining of the paper as long as (2.1) and (2.2) are satisfied.

π g,0 estimation
Assumption 2.5.For each g, we have at hand an (over-)estimator πg,0 ∈ (0, 1] of m g,0 /m g such that πg,0 P −→ πg,0 for some πg,0 ≥ π g,0 . Let also π0 = g π g πg,0 .In the model of Section 2.1, this assumption can be fulfilled by using the estimators introduced in Storey, Taylor and Siegmund (2004): for a given parameter λ ∈ (0, 1) let arbitrary (the 1 m is here just to ensure πg,0 (λ) > 0).It is easy to deduce from (2.1) and (2.2) that 1 mg mg i=1 1 {pg,i≤λ} P −→ π g,0 λ + π g,1 F g (λ), which provides our condition: πg,0 (λ) While (π g,0 ) g is let arbitrary in our setting, some particular cases will be of interest in the sequel.First is the Evenly Estimation case (EE) one where πg,0 = π0 , 1 ≤ g ≤ G. (EE) In that case, our estimators all share the same limit, and doing so they do not take in account the heterogeneity with respect to the proportion of true nulls.Case (EE) is relevant when the proportion of true nulls is homogeneous across groups, that is, when (ED) holds.A particular subcase of (EE) is the Non Estimation case (NE) where: Case (NE) is basically the case where no estimation is intended, and the estimators are simply taken equal to 1. Let us also introduce the Consistent Estimation case (CE) for which the estimators πg,0 are assumed to be all consistent: πg,0 = π g,0 , 1 ≤ g ≤ G. (CE) While this corresponds to a favorable situation, this assumption can be met in classical situations, where f g (1 − ) = 0 and λ = λ m tends to 1 slowly enough in definition (2.3), see Lemma A.2 in Section A. The condition f g (1 − ) = 0 is called "purity" in the literature.It has been introduced in Genovese and Wasserman (2004) and then deeply studied, along with the convergence of Storey estimators, in Neuvial (2013).
Finally, the main case of interest is the Multiplicative Estimation case (ME) defined as the following: ∃C ≥ 1, πg,0 = Cπ g,0 , 1 ≤ g ≤ G. (ME) Note that the constant C above cannot depend on g.Interestingly, the (ME) case covers the (CE) case (in this respect, C = 1) and also the case where (ED) and (EE) both hold (in this respect, C = π0 π0 ).So the (ME) case can be viewed as a generalization of previous cases.

Criticality
Depending on the choice of α, multiple testing procedures may make no rejection at all when m tends to ∞.This case is not interesting and we should focus on the other case.To this end, Chi (2007) introduced the notion of criticality: they defined some critical alpha level, denoted α * , for which BH procedure has no asymptotic power if α < α * .Neuvial (2013) generalized this notion for any multiple testing procedure (see Section 2.5 therein) and also established a link between criticality and purity.
In Section A, Definition A.1, we define α * in our heterogeneous setting and will focus in our results on the supercritical case.
Assumption 2.6.The target level α lies in (α * , 1).Lemma A.3 states that α * < 1 so such an α always exists.While the formal definition of α * is reported to the appendix for the sake of clarity, let us emphasize that it depends on the parameters of the model, that are (F g ) g , (π g ) g and (π g,0 ) g , and on the parameters of the chosen estimators, that are (π g,0 ) g .

Leading example
While our framework allows a general choice for F g , a canonical example that we have in mind is the Gaussian one-sided framework where the p-values are derived from Gaussian test statistics.
In this case, Assumption 2.1 is fulfilled, and hence F g is strictly concave and Assumption 2.2 is also fulfilled.Furthermore we easily check that f g (0 + ) = ∞, so α * = 0 and f g (1 − ) = 0 which means that this framework is supercritical (α * = 0, see Definition A.1) with purity and then can achieve consistent estimation (CE) with additional independence assumptions.Two particular subcases of interest arise when Σ has a particular form and can be written as , where Σ (g) is a square matrix of size m g .The first subcase is when Σ (g) is the identity matrix.In this case, the p-values are all independent and Assumption 2.4 is fulfilled by the law of strong numbers.The second subcase is when Σ (g) is a Toeplitz matrix with Σ (g) j,k = 1 |j−k|+1 .In this case, Assumption 2.4 is also fulfilled (see e.g.Delattre and Roquain, 2016, Proposition 2.1, Equation (LLN-dep) and Theorem 3.1).

Criterion
The set of indices corresponding to true nulls is denoted by H 0 , that is (g, i) ∈ H 0 if and only if H g,i = 0, and we also denote H 1 = H 0 c .In this paper, we define a multiple testing procedure R as a set of indices that are rejected: p g,i is rejected if and only if (g, i) ∈ R. The False Discovery Proportion (FDP) of R, denoted by FDP(R), is defined as the number of false discoveries divided by the number of rejections if there are any, and 0 otherwise: We denote FDR(R) = E [FDP(R)] the FDR of R. Its power, denoted Pow(R), is defined as the mean number of true positives divided by m: Note that our power definition is slightly different than the usual one for which the number of true discoveries is divided by m 1 = g m g,1 instead of m.This simplifies our expressions (see Section 9.1) and does not have any repercussion because the two definitions differ only by a multiplicative factor converging to 1 − π 0 ∈ (0, 1) when m → ∞.
Finally, let us emphasize that the power is the (rescaled) number of good rejections, that is, the number of rejected hypotheses that are false.The power is different from the number of total rejections, this distinction is fundamental and will be discussed all along this paper (like, for example, when discussing Heuristic 3.1, or in the simulations of Section 6.4).

Weighting the BH procedure
Say we want to control the FDR at level α.Assume that the p-values are arranged in increasing order p (1) ≤ . . .≤ p (m) with p (0) = 0, the classic BH procedure consists in rejecting all p g,i ≤ α k m where k = max k ≥ 0 : p (k) ≤ α k m .Take a nondecreasing function h defined on [0, 1] such that h(0) = 0 and h(1) ≤ 1, we denote I(h) = sup {u ∈ [0, 1] : h(u) ≥ u} .Some properties of the functional I(•) are gathered in Lemma A.4, in particular h (I(h)) = I(h).We now reformulate BH with the use of I(•), because it is more convenient when dealing with asymptotics.Doing so, we follow the formalism notably used in Roquain and van de Wiel (2009) and Neuvial (2013).Define the empirical function . This is a particular case of Lemma A.5.Note that G(u) is simply the number of p-values that are less than or equal to αu, divided by m.
The graphical representation of the two points of view for BH is depicted in Figure 1 with m = 10.The p-values are plotted on the right part of the figure along with the function k → αk/m and we see that the last p-value under the line is the sixth one.On the left, the function G corresponding to these p-values is displayed alongside the identity function, with the last crossing point being located between the sixth and seventh jumps, thus I( G) = 6/m and 6 p-values are rejected.
The weighted BH (WBH) with weight vector w ∈ R G + is defined by computing and rejecting all p g,i ≤ αI (G w ) w g .We denote it WBH(w).Note that w is authorized to be random, hence it can be computed from the p-values.In particular, BH = WBH(1) where 1 = (1, . . ., 1) ∈ R G + .Following Roquain and van de Wiel (2009), to deal with optimal weighting, we need to further generalize WBH into a multi-weighted BH (MWBH) procedure by introducing a weight function W : [0, 1] → R G + , which can be random, such that the following function: of the class of WBH procedures because for any weight vector w, w can be seen as a constant weight function u → w and G w is nondecreasing.Note that, there is a simple way to compute ûW .For each r between 1 and m denote the W (r/m)-weighted p-values p

[r]
g,i = p g,i /W g (r/m) (with the convention p g,i /0 = ∞), order them p (r) ≤ α r m (this is Lemma A.5).As in previous works (see e.g.Genovese, Roeder andWasserman, 2006 or Zhao andZhang, 2014), in order to achieve a valid FDR control, these procedures should be used with weights that satisfy some specific constraints.The following weight spaces will be used in the following of the paper: Note that K may appear unusual because it depends on the estimators πg,0 , however it is completely known and usable in practice.Some intuition about the choice of K is given in next section.Note also that K = KNE in the (NE) case.Finally, for a weight function W and a rejection threshold u ∈ [0, 1], we denote by R u,W the double indexed procedure rejecting the p-values less than or equal to αuW g is the number of rejections of R u,W , divided by m) and that MWBH(W ) can also be written as R ûW ,W .

Choosing the weights
Take W and u, and let P (m) W (u) = Pow (R u,W ).We have Note that these relations are valid only if W and u are deterministic.In particular, they are not valid when used a posteriori with a data-driven weighting and u = ûW .
In Roquain and van de Wiel (2009), the authors define the oracle optimal weight function W * or as: Note that they defined W * or only in case (NE), but their definition easily extends to the general case as above, by replacing KNE by K.They proved the existence and uniqueness of W * or when both (ED) and (NE) hold and that, asymptotically, MWBH(W * or ) controls the FDR at level π 0 α and has a better power than every MWBH(w (m) ) for w (m) ∈ KNE some deterministic weight vectors satisfying a convergence criterion.
However, computing W * or requires the knowledge of the F g , not available in practice, so the idea is to estimate W * or with a data driven weight function W * and then apply MWBH with this random weight function.For this, consider the functional defined by, for any (deterministic) weight function W and u ∈ [0, 1]: where W (u) is the mean ratio of rejections for the procedure rejecting each p g,i ≤ αuW g (u).P W (u) is the rescaled mean of the number of true positives (i.e. the power) of this procedure while H and let ûM = max(û ŵ(1) , û ŵ(2) ).In the second stage, they maximize G w (û M ) over K, which gives rise to the weight vector W * (û M ) according to our notation.Then they define their procedures as the following: and Pro 2 = WBH W * (û M ) .
Pro 2 comes from an additional step-up step compared to Pro 1, hence its rejection threshold, û W * (û M ) , is larger than ûM and allows for more detections.The caveat of this approach is that the initial thresholding, that is the definition of ûM , seems somewhat arbitrary, which will result in sub-optimal procedures, see Corollary 5.3.As a side remark, ŵ(1) and ŵ(2) are involved in other procedures of the literature.The HZZ procedure of Hu, Zhao and Zhou ( 2010) is WBH( ŵ(2) ), and WBH( ŵ(1) ) is the classical Adaptive BH procedure (see e.g.Lemma 2 of Storey, Taylor and Siegmund ( 2004)) denoted here as ABH.Ignatiadis et al. (2016) actually used Heuristic 3.1 with multi-weighting (while their formulation differs from ours) which consists in maximizing G w (u) in w for each u.However, their choice of the weight space is only suitable for the case (NE) and can make Heuristic 3.1 break down, because in general H (m) W (u) can still depend on w, see remark 3.1 below.In the next section, we take the best of the two approaches to attain power optimality with data-driven weighting.Let us already mention that the crucial point is Lemma B.3, that fully justifies Heuristic 3.1, but only in case (ME).When (ME) does not hold, we must take care that Heuristic 3.1 can fail for the same reason that it can fail with IHW.Thereby, in general, more detections do not necessarily imply more power.
Remark 3.1.In particular, we can compute numerical counterexamples where BH has larger asymptotic power than IHW.For example, if we break (ED) by taking a small π 1,0 (almost pure signal) and a large π 2,0 (sparse signal), along with a small group and a large one (π 1 much smaller than π 2 ) and strong signal in both groups, we can achieve a larger power with BH than with IHW.Our interpretation is that, in that case, IHW slightly favors group 2 because of its size, whereas the oracle optimal favors group 1 thanks to the knowledge of the true parameters.BH, by weighting uniformly, does not favor any group, which allows its power to end up between the power of the oracle and the power of IHW.This example is studied in Section 6.4 and illustrated in Figures 8 and 7.

Recent weighting methods
Besides IHW, there are several recent methods putting weights on p-values.We briefly discuss three of them.The first is a variation of IHW by the same authors, IHWc (Ignatiadis and Huber, 2017), where the letter 'c' stands for 'censoring'.The method bring two innovations to IHW.First, the use of cross-weighting thanks to a subdivision of the hypotheses into folds: the weights of the p-values of a fold are computed by only using the p-values of the other folds.This approach reduces overfitting since, during the step-up procedure, the information brought by a given p-value is used only once instead of twice.The second innovation is the censoring, where a threshold τ is fixed and only p-values larger than τ are used to compute the weights, while only p-values lesser than τ can be rejected during the step-up.Together, these innovations allow IHWc to control the FDR in finite sample at level α if the p-values associated to true nulls are independent.However, using only large p-values to compute the weights seems somehow counterintuitive: large p-values are likely to be associated to true nulls and to be uniform, so they won't allow the weights to properly discriminate the groups and to increase the power compared to BH.We will verify this intuition in Section 6.3.Finally, it is worth noting that IHWc allows for a kind of π g,0 estimation à la Storey, with a variant called IHWc-Storey.
The censoring idea originates from the Structure Adaptive BH Algorithm (SABHA, Li and Barber, 2016), which has a group structured version with an FDR bounded by αC for a known imsart-generic ver.2014/10/16 file: Durand2018v2.texdate: March 1, 2022 constant C > 1 when the p-values are independent.Hence, applying the group structured SABHA at level α/C gives FDR control at level α, but using a target level < α can induce conservatism, especially since computing the weights only with the large p-values involve the same risks that we highlighted when discussing of IHWc.
Lastly, AdaPT (Lei and Fithian, 2018) introduces threshold surfaces s t (x) that can be considered as weights and adapted to group setting.AdaPT is not a WBH procedure, its whole philosophy is totally different and relies on symmetry properties of the true null distribution of the p-values by using an estimator of the FDP, different than the one implicitly used in BH-like methods, which also relies on symmetry and allow to mask p-values during the procedure (see also Barber et al., 2015 andArias-Castro et al., 2017 for more details on this pioneering paradigm).We won't further consider AdaPT because of its fundamental differences with WBH procedures and because we are mainly interested by optimality among said WBH procedures.

New procedure: ADDOW
We exploit Heuristic 3.1 and propose to estimate the oracle optimal weights W * or by maximizing in w ∈ K the empirical counterpart to G (m) w (u), that is G w (u).Definition 4.1.We call an adaptive data-driven optimal weight function a random function Such maximum is guaranteed to exist because G w (u), w ∈ K is a finite set.Indeed, it is a subset of k m , k ∈ 0, m .However, for a given u, W * (u) may not be uniquely defined, hence there is no unique optimal weight function W * in general.So, in all the following, we fix a certain W * , and our results do not depend on the choice of W * .An important fact is that G W * is nondecreasing (see Lemma A.6) so û W * exists and the corresponding MWBH procedure is well-defined: Definition 4.2.The ADDOW procedure is the MWBH procedure using W * as the weight function, that is, ADDOW = MWBH W * .
One shall note that ADDOW is in fact a class of procedures depending on the estimators πg,0 through K. Its rationale is similar to IHW in that we intend to maximize the number of rejections, but incorporating the estimators πg,0 allows for larger weights and more detections.Finally, note that, in the (NE) case, ADDOW reduces to IHW.Remark 4.1.It turns out that ADDOW is equal to a certain WBH procedure.It comes from part 2 of the proof of Theorem 5.2 and Remark 9.2.Moreover, to every MWBH procedure, corresponds a WBH procedure with power higher or equal.This fact does not limit the interest of the MWBH class, because computing the dominating WBH procedure of a given MWBH( W ) procedure requires the knowledge of the step-up threshold û W which is known by actually computing MWBH( W ).

Main results
Now we present the two main theorems of this paper.The two are asymptotical and justify the use of ADDOW when m is large.The first is the control of the FDR at level at most α.The second shows that ADDOW has maximum power over all MWBH procedures in the (ME) case.The two are proven in Section 9.

Relation to IHW
Recall that IHW reduces ADDOW in the (NE) case, that (NE) is a subcase of (EE), and that when both (EE) and (ED) hold then (ME) is achieved.Hence, as a byproduct, we deduce from Theorems 5.1 and 5.2 the following result on IHW.2016) (with slightly stronger assumption on the smoothness of the F g s), the FDR controlling result of Corollary 5.1 gives a slightly sharper bound (π 0 α instead of α) in (ED) case.
The power optimality stated in Corollary 5.1 is new and was not shown in Ignatiadis et al. (2016).It thus supports the fact that IHW should be used under the assumption (ED) and when π 0 is close to 1 or not estimated.

Comparison to other existing procedures
For any estimators πg,0 ∈ [0, 1], any weighting satisfying g mg m w g ≤ 1 also satisfies g mg m πg,0 w g ≤ 1, that is KNE ⊂ K. Hence, any MWBH procedure estimating The next corollary simply states that ADDOW outperforms many procedures of the "weighting with π 0 adaptation" literature.
The results for Pro2, HZZ and ABH follow directly from Theorem 5.2 because these are MWBH procedures.The proof for Pro1 (which is not of the MWBH type) can be found in Section D.

Numerical experiments
6.1.Simulation setting FDR analysis and power analysis from Sections 6.2 and 6.3 are conducted using simulations which setting we describe here.Section 6.4 presents a counter-example using its own setting.
Our experiments have been performed by using the four following scenarios Each simulation of each scenario is replicated 1000 times.
• Scenario 4: µ 1 = μ and µ 2 = 0.01 and the dependence follows the Toeplitz pattern described in the end of Section 2.4.
In each scenario, three groups of procedures are compared.The difference between the three groups lies in the way π 0 is estimated.Group 1 corresponds to the (NE) case: πg,0 = 1.Group 2 corresponds to the (CE) case, with an oracle estimator: πg,0 = π g,0 .Groups 3 use the Storey estimator πg,0 (1/2) defined in Equation (2.3).We choose λ = 1/2 as it is a standard value (see e.g.Storey, 2002).The compared procedures are the following: • ABH as defined in section 3.2 (which is BH in Group 1), • HZZ as defined in section 3.2 (except in Group 1 where it is not defined), • Pro2 as defined in section 3.2 (for Group 1, we only use the BH threshold), • ADDOW (which is equal to IHW in Group 1), • An oracle ADDOW wich is the MWBH procedure using the oracle weights W * or given by equation 3.5 (only in Groups 1 and 2), • IHWc (only in Groups 1 and 3).The version of IHWc used in Group 3 is IHWc-Storey.
For IHWc, the censoring level chosen is the default of the IHW R package, that is α.
In the following, only plots of scenarios 1 and 3 are shown, as the situation with Toeplitz dependence is found to be similar to the independent case, up to a slight increase of the FDR of most of the procedures.

FDR control
The FDR of all above procedures are compared in Figure 2 and Figure 3. 1.5 1.75 2 2.25 2.5 2.75 3 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q FDR plot 1.5 1.75 2 2.25 2.5 2.75 3 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q FDR plot In scenario 1, we can distinguish two different regimes depending on the signal strength.For µ ≥ 1 the signal strength is not weak in both groups (from µ 1 = μ and µ 2 = 2μ) and the FDR is controlled at level α for all procedures of Groups 2 & 3 except ADDOW and Pro2, the two procedures using the data-driven weights, that is W * .In particular, Oracle ADDOW in Group 2 controls the FDR at level α.As the data driven weights converge to the oracle weights (see Lemma C.4), we get an illustration of Theorem 5.1 in the (CE) case.The situation is similar for Group 1 and level π 0 α, except for Oracle ADDOW which controls the FDR only for µ ≥ 2.
The situation get more confused when the signal is weak (µ < 1).The FDR of ADDOW (in each group) is largely inflated.The FDR control at level α also fails sometimes for Oracle ADDOW, Pro2, ABH and HZZ (only in Group 2).
In scenario 3, one group has always weak signal.The FDR inflation of ADDOW (in each group) and Group 2 is worse for small μ, whereas, for large μ, the situation is similar to scenario 1, up to one exception: the FDR of ABH and IHWc in Group 3 does not reach α as it did in scenario 1, which suggests some sort of conservatism.
In both scenarios, procedures of Group 2 have a larger FDR than their equivalent in Group 3, which in turn have larger FDR than in Group 1.
As a side note, in both scenarios, and both Groups 1 and 3, the FDR plots of IHWc and ABH are nearly indistinguishable.
In both settings regarding μ (large or small), procedures based on W * suffer from some sort of overfitting causing a loss of FDR control.This is discussed in Section 7 with an attempt to stabilize the weights.Let us underline that this does not contradict Theorem 5.1 because a small µ g might imply a smaller convergence rate while m stays < 10 4 in our setting.

Power analysis
Now that the FDR control has been studied, let us compare the procedures in terms of power.First, to better emphasize the benefit of adaptation, the power is rescaled in the following way: we define the normalized difference of power with respect to BH, or DiffPow, by for any procedure R.
Figures 4 and 5 display the power of all the procedures defined in Section 6.1.Figures 6a and 6b display only a subset of them in Scenario 1, for clarity.We can make several observations: • In both scenarios, procedures of Group 2 are more powerful than their equivalent in Group 3, which are better than in Group 1 (up to one exception, see next point), see e.g. Figure 6a.
In particular, the difference between Group 2 and Group 1 is huge.This illustrates the importance of incorporating the knowledge of π 0 to improve power.• In scenario 2, HZZ is largely better in Group 3 than in Group 2. Our interpretation is that the signal is so weak in the second group of p-values that the estimator π2,0 (1/2) is close to one, while π1,0 (1/2) stays close to π 1,0 .Hence ŵ(2) 1 in Group 3 is larger than ŵ(2) 1 in Group 2 which allows for more good discoveries.The drawback of having ŵ(2) 2 in Group 2 is not a real one since the signal is so small that it is impossible to detect no matter the weight.Recall that ŵ(2) is defined in Section 3.2.
• In every Group (that is for any choice of πg,0 ), and for both scenarios, ADDOW achieves the best power (see e.g. Figure 6b), which supports Theorem 5.2.Additionnaly, maybe surprisingly, Pro2 behaves quite well, with a power close to the one of ADDOW (sometimes larger than Oracle ADDOW) and despite its theoretical sub-optimality.• Inside Group 2 or Group 3, and for both scenarios, comparing ABH and HZZ to ADDOW and Pro2 shows the benefit of adding the F g adaptation to the π 0 adaptation: the ADDOW and Pro2 have better power than ABH and HZZ for all signals (see e.g. Figure 6b).In scenario 1, for Groups 2 and 3, we can see a zone of moderate signal (around μ = 1.5)where q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Difference of power w.r.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Difference of power w.r.q q q q q q q q q q q q Difference of power w.r.  the two categories of procedures are close.That is the same zone where HZZ becomes better than ABH.We deduce that in that zone the optimal weighting is the same as the uniform ŵ(1) weighting of ABH.• The comparison of the DiffPow between, on the one hand, IHW and, on the other hand, ABH or HZZ from Group 2, in Figure 4, shows the difference between adapting only to the F g 's versus adapting only to π 0 .No method is generally better than the other: as we see in the plot, it depends on the signal strength.We also see that neither ABH nor HZZ is better than the other.• In scenario 1, for all signals, methods of Group 3 are close to their equivalent of Group 2, which indicates that using λ = 1/2 gives a good estimate of π g,0 in practice (see e.g. Figure 6a).Furthermore, the larger the signal is, the more methods of Group 3 get closer to Group 2. • In both scenarios, once again IHWc and ABH are nearly indistinguishable, which confirms the intuition given in Section 3.3 that IHWc performs badly in terms of power due to using only large p-values to compute the weights.See in particular how the power of IHW is larger than the power of IHWc (and even than the power of IHWc-Storey) in Figure 4.

Importance of (ME) for optimality results
We provide here a setting and a simulation where Corollary 5.1 fails because (ED) does not hold, to illustrate the importance of (ME) in Theorem 5.2 and in Theorem 5.1 (to get (5.2)).The setting is chosen according to what we sketched in Remark 3.1 and is the following.
We consider again the one-sided Gaussian framework described in Section 2.4 for G = 2 groups and independent p-values.The parameters are the same as in Section 6.1 and each simulation of each scenario is replicated 1000 times.We choose a large value for α (α = 0.7) which is unlikely to appear in practice but allows us to get our counterexample.We set m 1 = 1000 and m 2 = 9000, m 1,0 /m 1 = 0.05 and m 2,0 /m 2 = 0.85.So group 1 is small and has a lot of signal, while group 2 is large but has not much signal.The signal strength is given by µ 1 = 2 and µ 2 = μ, and μ ∈ {1.7, 1.8, 1.9, 2, 2.1, 2.2, 2.3}, so the signal is strong and almost equal in both groups.

FDR plot
Fig 7 : FDR of ADDOW and BH against μ in the simulation of Section 6.4.The two solid lines are the α and π 0 α levels, the FDR of BH is confounded with the π 0 α level.ADDOW in the (NE) case is given by the black triangles and ADDOW in the (CE) case is given by the red triangles.
a DiffPow plot in Figure 8.In Figure 7, the FDR of BH is π 0 α as expected, and we see that the FDR of IHW is above that level, hence Equation (5.3) is violated.On a side note, we see that, thanks to a large m (10 4 ) and a rather strong signal, ADDOW in (CE) does not overfit and we get an illustration of Equation (5.2) with C = 1.
Figure 8 is rather unequivocal and shows that our parameter choice implies that IHW has a power smaller than BH (ADDOW in (CE) case stays better as expected), hence Equation (5.4) is violated.Let us recall our interpretation proposed in Remark 3.1: IHW favors the large and sparse second group of hypotheses whereas the optimal power is achieved by favoring the small first group of hypotheses which contains almost only signal.As a WBH procedure with weights (1,1), BH does not favor any group.Figure 8 demonstrates the limitation of Heuristic 3.1 by providing a direct counterexample, and underlines the necessity of estimating the π g,0 when nothing lets us think that (ED) may be met.

Overfitting phenomena
Since ADDOW uses the data both through the p-values and the weights, it suffers from an overfitting phenomena where the FDR in finite samples is above the target level α, as we saw in Section 6.2.In our setting, if the signal is strong enough, this drawback is proved to vanish when m is large enough, see the simulations and Theorem 5.1.However, the latter is not true for weak signal: if the data are close to be random noise, making the weight optimization leads ADDOW to assign its weighting budget at random, and giving large weights to the wrong groups increases the FDP.As said before, our intuition is that the overfitting is at least partly due to using each p-value twice in the step-up procedure of ADDOW: in the expression 1 {pg,i≤αu W * g (u)} , p g,i appears in both sides of the inequality because it is used to compute W * g (u).Following this, we propose a variation of ADDOW that uses the same cross-weighting trick as IHWc.

The crADDOW variant
The main idea is to split the p-values into F folds, where F is some fixed integer ≥ 2, and to use only p-values of the remaining F − 1 folds to compute the weights assigned to the p-values of a given fold.The resulting procedure can be seen as a WBH procedure using F × G groups.
Formally, for each m we have a random function , which simply means that the p-values of each group g are evenly distributed between the F folds.Some dependence assumptions are required: Assumption 7.1.The σ-algebra generated by (F m ) m and the σ-algebra generated by (p g,i ) (g,i) m are independent.
Assumption 7.2.Conditionally to (F m ) m , we have weak dependence (as in Assumption 2.4) inside each fold.
For each fold f ∈ {1, . . ., F }, we compute ADDOW −f , that is ADDOW but using only p-values for the folds in {1, . . ., F } \ {f }.This is done by constructing the empirical function and then maximizing it in w ∈ K−f for each u ∈ [0, 1], where: While this expression seems complicated, note that if F divides each m g , then |{1 ≤ i ≤ m g : The maximization provides a weight function W * −f and the MWBH procedure provides a step-up threshold û Our ADDOW variant, named crADDOW for cross-ADDOW, is the WBH procedure which assigns the weight w * g,f to all p-values p g,i such that F m (g, i) = f .Now, in 1 {pg,i≤αuw * g,f } , p g,i is only used once.While we don't have a finite-sample result about crADDOW, we expect it to have a lesser FDR than ADDOW, especially for weak signal.We expect crADDOW to act like a stabilization of ADDOW and to not lose the good performances of ADDOW when the signal is not weak.Those intuitions are verified in the simulations of Section 7.3.Still, crADDOW has the nice property of being asymptotically equivalent to ADDOW.
Theorem 7.1.Let us assume that Assumptions 2.1 to 2.6, 7.1 and 7.2 are fulfilled.Assume also that α ≤ π0 .We have This Theorem is proved in Section E.

Simulations with crADDOW
The simulations presented here are the same as the simulations depicted in Section 6.1, with the addition of crADDOW in each Group.
From the FDR plots, we see that the FDR is hugely deflated and is now controlled at level α for weak μ in each scenario, while for large μ we are still slightly above the target level but with a small improvement over ADDOW.In scenario 1 there is a small window between large and small μ, around μ = 0.75, where crADDOW in Group 2 overfits more than for really large μ, but even there we see a large improvement over ADDOW.
As for the power, we see that crADDOW is less powerful than ADDOW, as expected since we reject less hypotheses, but we see that in most Groups and scenarios the loss of power is almost negligible and crADDOW remains even as powerful as Oracle ADDOW (with the exception of Group 1 in scenario 1).The difference of power between crADDOW and Pro2 is even smaller and crADDOW is better in most configurations, with the exception of Groups 2 and 3 around μ = 1.5, which is the zone that we identified in Section 6.3 as the zone where the optimal weights are given by the uniform ŵ(1) weighting of ABH.
The simulations hence confirm our intuitions about the stabilization properties of crADDOW especially for weak signal where ADDOW was totally unreliable.Studying the finite sample properties of crADDOW, especially its FDR, is an interesting direction for future works.

Concluding remarks
In this paper we presented a new class of data-driven step-up procedures, ADDOW, that generalizes IHW by incorporating π g,0 estimators in each group.We showed that while this procedure q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q FDR plot Fig 9 : FDR against μ in scenario 1. Same legend as in Figure 9, with the addition of crADDOW (yellow lines).
The color of the points (black, red, green) indicates the Group (respectively, 1, 2 and 3).1.5 1.75 2 2.25 2.5 2.75 3 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q FDR plot Fig 10 : FDR against μ in scenario 3. Same legend as in Figure 9. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Difference of power w.r.t.BH 1.5 1.75 2 2.25 2.5 2.75 3 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Difference of power w.r.t.BH asymptotically controls the FDR at the targeted level, it has the best power among all MWBH procedures when the π 0 estimation can be made consistently.In particular it dominates all the existing procedures of the weighting literature and solves the p-values weighting issue in a groupstructured multiple testing problem.As a by-product, our work established the optimality of IHW in the case of homogeneous π 0 structure.Finally we proposed a stabilization variant designed to deal with the case where only few discoveries can be made (very small signal strength or sparsity).Some numerical simulations illustrated that our properties are also valid in a finite sample framework, provided that the number of tests and the signal strength are large enough.We also introduced crADDOW, a variant of ADDOW that uses cross-weighting to reduce the overfitting while having the exact same asymptotic properties.
Assumptions Our assumptions are rather mild: basically we only added the concavity of the F g to the assumptions of Ignatiadis et al. (2016).Notably we dropped the other regularity assumptions on F g that were made in Roquain and van de Wiel ( 2009) while keeping all the useful properties on W * in the (NE) case.Note that the criticality assumption is often made in the literature, see Ignatiadis et al. (2016) (assumption 5 of the supplementary material), Zhao and Zhang (2014) (assumption A.1), or the assumption of Theorem 4 in Hu, Zhao and Zhou (2010).Finally, the weak dependence assumption is extensively used in our paper.An interesting direction could be to extend our result to some strong dependent cases, for instance by assuming the PRDS (positive regression dependence), as some previous work already studied properties of MWBH procedures under that assumption, see Roquain and Van De Wiel (2008).

Computational aspects
The actual maximization problem of ADDOW is difficult, it involves a mixed integer linear programming that may take a long time to resolve.Some regularization variant may be needed for applications.To this end, we can think to use the least concave majorant (LCM) instead of the empirical c.d.f. in equation (3.1) (as proposed in modification (E1) of IHW in Ignatiadis et al., 2016).As we show in Section 9, ADDOW can be extended to that case (see especially Section 9.1) and our results are still valid for this new regularized version of ADDOW.

Toward nonasymptotic results
Interesting direction for future research can be to investigate the convergence rate in our asymptotic results.One possible direction can be to use the work of Neuvial (2008).However, it would require to compute the Hadamard derivative of the functional involved in our analysis, which might be very challenging.Finally, another interesting future work could be to develop other versions of ADDOW that ensure finite sample FDR control property: this certainly requires to use a different optimization process, which will make the power optimality difficult to maintain.A possible such variation is crADDOW, whose FDR in finite sample has yet to be investigated.

Further generalization
Define, for any u and W , 1 {pg,i≤αuWg(u),Hg,i=0} , and For the sake of generality D g is not the only estimator of D g (defined in equation (B.1)) that we will use to prove our results (for example, we can use the LCM of D g , denoted LCM( D g ), see Section 8).So let us increase slightly the scope of the MWBH class by defining G W (u) = If W is such that G W is nondecreasing, we then define the generalized MWBH as If ( D g ) g is such that we can define, for all u ∈ [0, 1], we define the generalized ADDOW by the latter being well defined because G W * is nondecreasing (by a proof similar to the one of Lemma A.6).Note that for any continuous D g , such as LCM( D g ) or D g itself, the arg max in (9.1) is non empty and GADDOW can then be defined.
What we show below are more general theorems, valid for any GADDOW ( D g ) g .Our proofs combined several technical lemmas deferred to Sections B and C, which are based on the previous work of Roquain and van de Wiel (2009); Hu, Zhao and Zhou (2010); Zhao and Zhang (2014).
Remark 9.1.GADDOW ( D g ) g when D g = LCM( D g ) and πg,0 = 1 is exactly the same as IHW with modification (E1) defined in the supplementary material of Ignatiadis et al. (2016).In our notation, the latter is WBH W * ũ W * , which is the same as GADDOW ( D g ) g because ũ W * = ũ W * (ũ W * ) (same proof as in Remark 9.2).

Proof of Theorem 5.2
First, in any case, For the rest of the proof, we assume we are in case (ME), which implies by Lemma B.3 that W * (u) ∈ arg max w∈K ∞ P ∞ w (u) for all u, and that P ∞ W * is nondecreasing.We also split the proof in two parts.For the first part we assume that for all m, W is a weight vector ŵ ∈ K therefore not depending on u.In the second part we will conclude with a general sequence of weight functions.
Part 1 W = ŵ ∈ K for all m.Let = lim sup Pow (MWBH ( ŵ)).Up to extracting a subsequence, we can assume that = lim E P ŵ(û ŵ) , πg,0 a.s.−→ πg,0 for all g, and that the convergences of Lemma C.1 are almost surely.. Define the event Now consider that Ω occurs and fix a realization of it, the following of this part 1 is deterministic.Let = lim sup P ŵ(û ŵ).The sequences m mg πg,0 are converging and then bounded, hence the sequence ( ŵ) is also bounded.By compacity, once again up to extracting a subsequence, we can assume that = lim P ŵ(û ŵ) and that ŵ converges to a given w cv .By taking m → ∞ in the relation Part 2 Now consider the case where W is a weight function u → W (u). Observe that so by definition of I(•), û W ≤ û W (û W ) , and then As a consequence, Pow MWBH W ≤ Pow WBH W (û W ) .Finally, apply part 1 to the weight vector sequence W (û W ) to conclude.
Remark 9.2.We just showed that for every MWBH procedure, there is a corresponding WBH procedure with better power.In particular, by defining û = u W * the ADDOW threshold, we showed that û ≤ û W * (û) .But G W * ≥ G ŵ and then û ≥ u ŵ for any ŵ.Hence û = û W * (û) and ADDOW is equal to the WBH procedure associated to the weight vector W * (û).
Remark 9.3.We actually proved a stronger result, as we can replace the statement W : πalt g,0 w g ≤ 1 and the πalt g,0 are such that πalt g,0 P −→ πalt g,0 for some πalt g,0 ≥ πg,0 .That is, the weight space W belongs to does not have to be the same weight space where we apply ADDOW, as long as it uses over-estimators of the limits of the over-estimators used in K.
Appendix A: Lemmas and proofs of Section 2 Lemma A.1.For all g, F g is continuous.
Proof.F g is concave so it is continuous over R \ {0, 1}.F g is continuous in 0 because it is càdlàg.F g is continuous in 1 by concavity and monotonicity.
Lemma A.2. Take a real valued sequence (λ m ) with λ m ∈ (0, 1), converging to 1, such that 0 for all g and the p-values inside each group are mutually independent, then ∀g ∈ {1, . . ., G}, πg,0 (λ m ) The two suprema of the last display, when multiplied by √ m, converge in distribution (by Kolmogorov-Smirnov's theorem).So when divided by 1 − λ m they converge to 0 in distribution and then in probability (because Definition A.1.The critical alpha value is , where Proof.We only need to show that for one w ∈ K ∞ , we have g π g w g π g,0 + π g,1 f g (0 + ) > 1.
Let us show that this is true for every w ∈ K ∞ such that g π g πg,0 w g = 1, e.g. the w defined by w g = 1 πg,0 for all g.We use the fact that f g (0 + ) > Fg(1)−Fg(0) 1−0 = 1 by the strict concavity of F g .Then π g,0 + π g,1 f g (0 + ) > 1 and Recall that I(•) is defined as I(h) = sup {u ∈ [0, 1] : h(u) ≥ u} on the function space: which has the natural order F is also normed with the sup norm • .
Lemma A.4.For all h ∈ F, I(h) is a maximum and h (I(h)) = I(h).Moreover, I(•), seen as a map on F, is nondecreasing and continous on each continuous h 0 ∈ F such that either u → h 0 (u)/u is decreasing over (0, 1], or I(h 0 ) = 0.
Proof.I(h) is a maximum because there exists n → 0 such that

So h (I(h)) ≥ I(h). Then h (h (I(h))
) ≥ h (I(h)) thus h (I(h)) ≤ I(h) by the definition of I(h) as a supremum.
If u + > 1 then obviously It is a maximum by continuity over a compact and is such that s γ < 0, because s γ ≥ 0 would contradict the maximality of I(h 0 ).Then, for all u ∈ [u + , 1], and then sup Hence, as soon as We can then write the following: as soon as h − h 0 ≤ 1 2 (h 0 (u − ) − u − ).This implies I(h) > u − .Taking completes the proof.g,i = p g,i /W g (r/m) (with the convention p g,i /0 = ∞), order them p (1) , . . ., p [r] (r) ≤ α r m .Then r/m ≤ ûW by definition of ûW .Second, we know that ûW can be written as κ/m because ûW = G W (û W ), so we want to show that κ ≤ r which is implied by r, p Furthermore,

Appendix B: Asymptotical weighting
Define, for a weight function W : [0, 1] → R G + , possibly random, and W is nondecreasing, we also define It is the asymptotic version of K. We now define oracle optimal weights over K ∞ for G ∞ • (u) and P ∞ • (u), for all u > 0.
, it is a singleton.In this case, its only element w * belongs to [0, 1 αu ] G and satisfies g π g πg,0 w (u) ≤ 1 with equality if and only if αu ≥ π0 .The same statements are true for P ∞ • , except that the upper bound of max w∈K ∞ P ∞ w (u), which is achieved if and only if αu ≥ π0 , is not For the rest of the proof u is greater than 0.
First we show that any w (u) such that αuw * g1 > 1 and αuw * g2 < 1 for some g 1 , g 2 ≤ G. Now then we define w such that wg = w * g for all g ∈ {g 1 , g 2 }, wg1 = 1 αu and So w belongs to K ∞ and satisfies π g D g (αuw * g ) + π g1 + π g2 D g2 (αu wg2 ) because D g is increasing over [0, 1] and then constant equal to 1.This contradicts the definition of w * so is impossible.
Next we distinct three cases.
We showed that the only w * ∈ arg max Furthermore g π g πg,0 w * g = 1 : if not there would exist a w with wg1 > w * g1 (for the same g 1 as in previous sentence) and wg = w * g for all g = g 1 such that w • , by replacing D g by π g,1 F g .
From now on, W * (u) denotes an element of arg max w∈K ∞ G ∞ w (u) (just like we write W * (u) as an element of arg max w∈ K G w (u)), our results will not depend on the chosen element of the argmax.Next Lemma gives some properties on the function W * is nondecreasing by exactly the same argument as in the proof of Lemma A.6.The result can be strengthened thanks to Lemma B.1, by writing, for u Because the expression above is continuous of the w g , they can always be chosen nonzero.We have α ∧1] with a < b and λ ∈ (0, 1), by Lemma B.1, we have that αaW * g (a), αbW * g (b) ≤ 1 and then, for all g: .We have w ∈ K ∞ and then for all g: , the inequality being strict for g 1 .Finally by summing: 1.The fact that u * = 1 ⇐⇒ α ≥ π0 follows directly from the previous statements and Lemma B.1.The decreasingness of u → G ∞ W * (u)/u is straightforward from strict concavity properties because it is the slope of the line between the origin and the graph of G ∞ W * at abscissa u > 0. Previous statements imply that G ∞ W * is continuous at least over (0, π0 α ∧ 1) and, if α ≥ π0 , over → 0 when u → 0 which gives the continuity in 0. As in the proof of Lemma A.1, the continuity in π0 α ∧ 1 is given by the combination of concavity and nondecreasingness.Remark B.1.The case α ≥ π0 is rarely met in practice because α is chosen small and the signal is assumed to be sparse (so π0 is large) but it is kept to cover all situations.It confirms the intuitive idea that in this situation the best strategy is to reject all hypotheses because then the FDP is equal to π 0 ≤ π0 ≤ α.
Remark B.2.For a weight vector w Figure 13 illustrates all the properties stated in Lemma B.2, with the two cases α ≥ π0 and α < π0 .
The next Lemma justifies the intuitive idea that maximizing the rejections and the power is the same thing (as exposed in Section 3.2), but only under (ME).
u) and arg max w∈K ∞ P ∞ w (u) are both equal to the set of weights w ∈ K ∞ such that αuw g ≥ 1 for all g.Now if αu ≤ π0 , both arg max are singletons.Take w * the only element of arg max w∈K ∞ P ∞ w (u).Recall that there exists C ≥ 1 such that, for all 1 ≤ g ≤ G, πg,0 = Cπ g,0 , and write, for all w ∈ K ∞ , imsart-generic ver.2014/10/16 file: Durand2018v2.texdate: March 1, 2022 because g π g πg,0 w * g = 1 and αuw * g ≤ 1 for all g, by Lemma B.1.This means that w * is also the unique element of arg max w∈K ∞ G ∞ w (u).
In particular, if αu ≤ π0 , Proof.If the statement is false, there exists some > 0 and some sequence (w

Appendix C: Convergence lemmas
Recall that • is the sup norm for the bounded functions on their definition domain: Lemma C.1.The following quantities converge to 0 in probability: From now on D g is assumed to converge uniformly to D g in probability and that W * (u) ∈ arg max w∈ K G w (u) exists for all u.
Next Lemma is the main technical one (with the longest proof).
Lemma C.2.We have the following convergence in probability: Proof.First, where the first term tends to 0 by (C.1), so we work on the second term.
The main idea is to use the maximality of G w (u) in W * (u) and the maximality of G ∞ w (u) in W * (u).The problem is that one is a maximum over K and the other is over K ∞ .The solution consists in defining small variations of W * (u) and W * (u) to place them respectively in K ∞ and K.

W
mean of the number of its false positives.Heuristic 3.1.Maximizing G (m) W (u) should be close to maximizing P (m) W (u). Indeed, consider weight functions W such that g mg,0 m W g (u) = 1 and then replace U (x) by x for all x ∈ R + (whereas U (x) = x only holds for x ≤ 1), then H (m) W (u) becomes αu g mg,0 m W g (u) = αu and it does not depend on the weights.So P (m) W (u) is the only term depending on W in (3(u) is the same.Now, we can evaluate the constraint we just put on W by estimating to the weight space K defined in equation (3.3)), and G (m) w (u) can be easily estimated by the (unbiased) estimator G w (u).As a result, maximizing the latter in w should lead to good weights, not too far from W * or (u).Zhao and Zhang (2014) followed Heuristic 3.1 by applying a two-stage approach to derive two procedures, named Pro1 and Pro2.Precisely, in the first stage they use the weight vectors ŵ(1π0) , where πg,1 = 1 − πg,0 , imsart-generic ver.2014/10/16 file: Durand2018v2.texdate: March 1, 2022 Corollary 5.1.Let us assume that Assumptions 2.1 to 2.6 are fulfilled, with the additional assumption that (ED) holds.Then lim m→∞ FDR (IHW) = π 0 α, (5.3) and for any sequence of random weight functions ( W ) m≥1 such that W : [0, 1] → KNE and G W is nondecreasing, we have lim m→∞ Pow (IHW) ≥ lim sup m→∞ Pow MWBH W . (5.4)While equation (5.1) of Theorem 5.1 covers Theorem 4 of the supplementary material of Ignatiadis et al. ( by 1 uses a weight function valued in K.This immediately yields the following corollary.Corollary 5.2.Let us assume that Assumptions 2.1 to 2.6 are fulfilled, with the additional assumption that (ME) holds.Then lim m→∞ Pow (ADDOW) ≥ lim sup m→∞ Pow (R) , for any R ∈ {BH, IHW}.

Fig 2 :
Fig 2:FDR against μ in scenario 1. Group 1 in black; Group 2 in red; Group 3 in green.The type of procedure depends on the shape: Oracle ADDOW (triangles and solid line); ADDOW (triangles and dashed line); Pro2 (disks); HZZ (diamonds) and finally BH/ABH (crosses).IHWc and IHWc-Storey are in blue, respectively with black and green points.Horizontal lines: α and π 0 α levels.See Section 6.1.

Fig 6 :
Fig 6: Details of Figure 4 where only a subset of procedures is plotted.

Fig 8 :
Fig 8:DiffPow of ADDOW against μ in the simulation of Section 6.4.ADDOW in the (NE) case is given by the black triangles and ADDOW in the (CE) case is given by the red triangles.
strictly concave over [0, u ] and constant equal to M on [u , 1], hence u → G ∞ w (u)/u is decreasing.So whether w = 0 or not, I(•) is continuous in G ∞ w by Lemma A.4. Remark B.3.The proof of the strict concavity of G ∞ W * can easily be adapted to show the (non necessary strict) concavity of G W * when D g = LCM D g .
Finally the properties on P ∞ W * are obtained by the same proof as Lemma B.2.The next lemma is only a deterministic tool used in the proof of Lemma C.4.Define the distance d of a weight vector w to a subset S of R G + by d(w, S) = inf w∈S max g |w g − wg |.Let M u = arg max w∈K ∞ G ∞ w (u) to lighten notations.Lemma B.4.Take some u ∈ (0, 1].Then we have: and D g − D g , for all g ∈ {1, . . ., G}.Furthermore, for any( D g ) g such that D g − D g P −→ 0, sup w∈R G + G w − G ∞ w P −→ 0. (C.1)imsart-generic ver.2014/10/16 file: Durand2018v2.texdate: March 1, 2022Proof.By using the same proof as the one of the Glivenko-Cantelli theorem (which adapts trivially to convergence in probability instead of almost surely), we get from (2.1) and (2.2) that, for all g, ,i≤•,Hg,i=1} − π g,1 F g P −→ 0. So by summing, D g − D g P g + π g D g − D g P −→ 0.
The BH procedure applied to a set of 10 p-values.Right plot: the p-values and the function k → αk/m.