Some optimality properties of FDR controlling rules under sparsity

False Discovery Rate (FDR) and the Bayes risk are two different statistical measures, which can be used to evaluate and compare multiple testing procedures. Recent results show that under sparsity FDR controlling procedures, like the popular Benjamini-Hochberg (BH) procedure, perform also very well in terms of the Bayes risk. In particular asymptotic Bayes optimality under sparsity (ABOS) of BH was shown previously for location and scale models based on log-concave densities. This article extends previous work to a substantially larger set of distributions of effect sizes under the alternative, where the alternative distribution of true signals does not change with the number of tests m, while the sample size n slowly increases. ABOS of BH and the corresponding step-down procedure based on FDR levels proportional to n−1/2 are proved. A simulation study shows that these asymptotic results are relevant already for relatively small values of m and n. Apart from showing asymptotic optimality of BH, our results on the optimal FDR level provide a natural extension of the well known results on the significance levels of Bayesian tests.


Introduction
Driven by a vast number of applications, over the last few years multiple hypothesis testing with sparse alternatives has become a topic of intensive research (see [1,9,12,13,24] or [31]). As a result of this interest many new multiple testing procedures have been proposed, which can be compared according to several different optimality criteria. In the classical context, a multiple testing procedure is considered to be optimal if it maximizes the number of true discoveries, while keeping one of the type I error measures (like Family Wise Error Rate, False Dicovery Rate or the expected number of false positives) at a certain fixed level (see [10,19,23,30,34,36,41] or [42]). A different notion of optimality is proposed in [38] and [6], which investigate multiple testing procedures in the context of minimizing the Bayes risk.
The frequentist measures of accuracy for multiple testing procedures, like the False Discovery Rate (FDR) or the Family Wise Error Rate (FWER), are seemingly very different from the Bayes risk. Nevertheless several papers (see e.g. [4,5] or [21]) reported that under sparsity (i.e. when the proportion p of alternatives among all tests is small) the popular Benjamini-Hochberg procedure (BH, [2]) has very good properties with respect to minizing the Bayes risk under 0-1 loss. These empirical findings have been confirmed by the theoretical results reported in [6] and [33], where BH is proved to have some asymptotic optimality properties with respect to minimizing the Bayes risk under sparsity. These results complement findings of [1], where it is shown that the estimator of the vector of means based on the hard-thresholding rule, with cutoff values provided by BH, is asymptotically minimax under sparsity.
Asymptotic results of [6] are concerned with testing hypotheses about means of normal distributions, where the true means under the alternative have a normal distribution with a standard deviation slowly increasing with m. These results were further extended in [33] to a very general class of location and scale models. While the results of [33] are of substantial theoretical value, in practical applications it is perhaps too limiting to consider test statistics whose marginal distribution under the alternative differs from the null distribution only by a location or a scale parameter. In this article we present an extension of [6] which we believe to be of more practical relevance than the setting discussed in [33].
To motivate our proposed model we want to look at two examples from statistical genetics. First consider a rather simple two groups model for RNA micro-array data given in (3.11) of [16]. Each test statistic is assumed to have a normal distribution N (µ i , σ 2 ), which typically can be justified by the Central Limit Theorem. Furthermore it is assumed that only a very small proportion p of the true means µ 1 , . . . , µ m are different from zero, and that their conditional distribution can be described by a normal distribution N (µ 0 , σ 2 0 ). It follows that the marginal distribution of test statistics under the alternative are N (µ 0 , σ 2 + σ 2 0 ), which differs from the null distribution N (0, σ 2 ) both by a location and a scale parameter. Thus even this simple example goes beyond the framework of [33], where tests for the difference in only one of these parameters are considered.
A second example are genetic association studies, where it is quite common that effect sizes are modelled by exponential or Laplace distributions. For example in the model of [32] the effect sizes of causative SNPs are Laplace, while the environmental effects are normal, corresponding exactly to our setting. The marginal distribution under the alternative is thus the convolution of a Laplace and a normal distribution while the null distribution is again normal, which again places this practically important example out of the class of models treated in [33].
In this article we will consider only testing the point null hypotheses that µ i = 0. This is less general than the normal scale mixture model from [6], where µ was allowed to be normally distributed with small variance under the null hypothesis. However, this restriction can be justified in many applications. For example whenever it is assumed in statistical genetics that disease characteristic phenotypes are influenced only by a small number of genes, then the restriction to point null hypotheses is legitimate. Considering point null hypotheses furthermore allows nontrivial asymptotic inference (which means positive asymptotic power) when keeping the distribution of true effects under the alternative fixed while increasing the number of tests. This assumption, natural in many practical applications, substitutes the assumptions of [6] and [33], where the magnitude of true effects increases with the number of tests.
We believe that both assumptions, increasing the effect size and increasing the sample size, are relevant in certain practical situations. For example in RNA micro-array experiments the sample size is often rather small, while effect sizes are believed to be quite large, which relates to the asymptotic analysis of [6] and [33]. On the other hand in association studies it is common to collect data from very large samples just to be able to detect relatively moderate effect sizes. For instance in genome wide association studies one typically has samples of size n > 1000, while the number of genetic markers is m > 10 6 . On the other hand the number of markers k which are actually expected to be associated with the trait in question is usually considered to be rather moderate, like k < 100. Then for the proportion of markers under the alternative hypothesis it holds that p < 10 −4 . These figures motivate an asymptotic scheme where the number of tests m goes to infinity, while the sample size n = n m increases at a slower rate, and p = p m converges to 0.
From a mathematical point of view, the major contribution of the present article is the extension of the results of [6] to cover a wide range of distributions of effect sizes µ i under the alternative. The technical details on the class of feasible distributions are provided in Assumption (B), which includes for example all distributions with positive density over the real line. The main mathematical challenge is that one no longer obtains closed form expressions for the critical values of the Bayes rule. The rather elaborate asymptotic analysis which is then necessary to obtain approximate formulas of the Bayes risk uses some techniques introduced in [25].
We will develop the asymptotic theory under the general condition that p → 0, but to explain the main message of this article we summarize here the results for the special case where p ∝ m −β , with β ∈ (0, 1]. We will see that in this scenario nontrivial asymptotic inference is possible only for n converging to infinity at least at a rate n ∝ log m. In terms of applications this allows the number of tests to be much larger than the sample size, like for example in genome wide association studies.
Asymptotic Bayes optimality under sparsity (ABOS) depends on the loss function under which the Bayes risk is computed. We consider a generalized 0-1 loss, and define δ as the ratio of losses between type I and type II errors. Our main result is concerned with BH which controls FDR at the nominal level α. When the loss ratio δ is kept fixed, and if α converges to 0 at a rate which is roughly proportional to n −1/2 , then BH is ABOS for any β ∈ (0, 1]. In that sense BH adapts well to any unknown level of sparsity. Apart from BH we also consider the Bonferroni procedure which controls FWER at the nominal level α. Under the conditions on α which yield ABOS of BH, it turns out that the Bonferroni rule is ABOS only for the extremely sparse case β = 1 (which means that p ∝ 1/m). We will furthermore illustrate some situations where the Bonferroni rule is optimal when β < 1, but then the nominal level α depends on β. Thus in contrast to BH the Bonferroni rule does not adapt well to an unknown level of sparsity.
Apart from these theoretical findings, we report the results of an extensive simulation study, comparing the performance of BH, Bonferroni correction and a multiple testing procedure based on the empirical Bayes estimate proposed in [26]. The study shows that our asymptotic results are relevant already for quite moderate values of m and n.
The rest of the paper is organized as follows. In Section 2 we present our statistical model, asymptotic assumptions and Theorem 2.2, which provides the conditions under which fixed threshold multiple testing rules are ABOS. The resulting Corollary 2.2 shows that the universal threshold 2 log m of [14] is ABOS in the extremely sparse case p ∝ m −1 and for sample size n satisfying log n = o(log m). The most important theorems on multiple testing are given in Section 3. Specifically, Theorem 3.2 specifies the conditions under which BH and its corresponding step down procedure are ABOS, while Theorem 3.3 gives respective conditions for the Bonferroni correction. The results of the simulation study are presented in Section 4. Finally, Section 5 summarizes our results and discusses the directions for future work. Most of the technical proofs have been put in the Appendix, which also includes a discussion on the relationship between rules controlling the Bayesian False Discovery Rate (BFDR) and the Bayes classifier.
2. Asymptotic Bayes optimality of multiple testing rules with fixed threshold

Statistical model
Consider a set of m normal populations N (µ i , σ 2 ), i = 1, . . . , m. We are interested in testing point null hypotheses H 0i : µ i = 0 against the alternatives H Ai : µ i = 0, based on simple random samples X i = (X 1i , . . . , X ni ) of size n from each of these populations. The effects under study µ i are supposed to be independent and identically distributed according to a mixture distribution where d 0 is the Dirac measure at 0, ν is a probability measure on the real line describing the distribution of µ i under the alternative, and p ∈ (0, 1) is the proportion of alternatives among all tests. Since ν describes the alternative distribution of the different µ i , we assume that ν({0}) = 0. Furthermore both positive and negative values of µ i should be possible, that is From (2.1) it easily follows that the marginal distribution of the sample mean where the pdf of the second measure is computed by convolution of ν and N (0, σ 2 /n).
Remark 2.1. To reduce the complexity of notation we assume that the population variances σ 2 i are the same for every i. However, the results on asymptotic optimality presented in this article hold also if the sequence of possible different variances σ 2 i ∈ (0, ∞) satisfies lim inf i→∞ σ 2 i > 0 and lim sup i→∞ σ 2 i < ∞. Remark 2.2. Note that for sufficiently large n the assumption thatX i |µ i ∼ N (µ i , σ 2 i /n) can be justified by the Central Limit Theorem, even when the conditional distribution of X i |µ i is not normal. According to [17], under weak assumptions on the moments of the distribution of X i |µ i , the level of the simultaneous tests based on the t-test statistics √ n(X i |µ i − µ i )/σ i is accurate provided log m increases at a strictly slower rate than n 1/3 . Based on these findings we expect that under similar assumptions our asymptotic results can be generalized to the case where the distribution of X i |µ i is not normal and the variances σ 2 i are estimated separately for every i. This is partially confirmed by our simulation study, which illustrates the performance of BH based on the t-statistics, with σ i estimated separately for each i.

Asymptotic framework and the power of the Bayes oracle
Our decision theoretic framework for multiple testing is based on a generalization of the standard 0-1 loss. There are m decisions to be made. For each false rejection (type I error) we assign a loss of δ 0 , and for missing a true signal (type II error) a loss of δ A . The total loss of a multiple testing procedure is then defined as the sum of losses for individual tests. The total loss is clearly minimized by applying the Bayes classifier to each individual test, the decision rule which was called Bayes oracle in [6].
As noted in [25], if p ∈ (0, 1) then for any measure ν satisfying (2.2) and sufficiently large n, the Bayes classifier chooses H 0i ifX i ∈ (a n , b n ), where the critical values a n and b n are uniquely defined by Equivalently one can work with the scaled test statistics Z i = √ n σX i , for which the critical values c a := √ n σ a n and c b := √ n σ b n of the Bayes rule are given by the equations All the results in the main text will be presented at the scale of the test statistics, that means in terms of c a and c b . However, some of the technical proofs in the appendix are more conveniently stated working with (2.4).
As explained in the Introduction, our asymptotic scheme which is motivated for example by genetic association studies assumes that the number of tests m goes to infinity, while the sample size n = n m increases at a slower rate, and p = p m converges to 0. From now on sequences will be indexed with m, but one has to be aware that m → ∞ will also imply that n → ∞ and p → 0.
Also, in our analysis we are mainly interested in the situation where the loss ration δ = δ 0 /δ A is kept constant, which directly relates to the risk based on misclassification loss. However, we also consider the situation when δ = δ m slowly converges to zero, such that log δ = o(log p). This allows us in Theorem 3.2 to obtain optimality results for multiple testing rules at a constant nominal FDR level α.
We will focus on an asymptotic scenario under which nontrivial asymptotic inference is possible. Specifically, we concentrate on the situation where the optimal Bayes rule has non-vanishing asymptotic power. As shown in the proof of Theorem 2.1 the power of the Bayes rule diminishes when p is getting smaller, which will be balanced by increasing the sample size. Lemma 2.1 shows that nontrivial asymptotic inference is possible only if the sequence log p n remains bounded.
Lemma 2.1. Assume that the distribution ν of true signals under the alternative is such that ν(0, ∞) > 0 and ν(−∞, 0) > 0. Moreover, assume that p = p m converges to zero and that the loss function satisfies log δ = o(log p). Then the asymptotic power of the Bayes classifier is positive only if the sequence log p n remains bounded.
Proof. The proof is given in Section A of the Appendix.  Remark 2.4. Corollary 2.1 specifies the relationship between the number of tests and the sample size under which the asymptotic power of the Bayes classifier does not vanish in the limit. Similar conditions appear quite often in the literature on high dimensional inference. For an example in the context of multiple testing see [17] as discussed in Remark 2.2. Similarly, in the context of model selection the popular Lasso [43] has asymptotically optimal tuning parameter λ ∝ log m/n, where m is the total number of regressors. As discussed in [8] Lasso is consistent if the vector of true regression coefficients satisfies β 1 = o( n/ log m). So, if β 1 stays constant then for consistency one needs log m n → 0.
Motivated by the above considerations we will investigate the performance of the multiple testing rules under the following asymptotic assumption.
Assumption (A): p → 0, δ is bounded from above and such that log δ = o(log p) and −2 log p n → C, where 0 ≤ C < ∞.
The second set of assumptions used throughout this paper imposes a mild restriction on the measure ν. Remark 2.5. In principle this assumption requires only that the measure ν has a positive density in the neighborhood of one or two points, depending on the constant C specified in Assumption (A). Thus it is for example satisfied whenever ν has a positive, bounded density on the real line. Note that in case when C = 0 the Assumption (B) is satisfied whenever ν has a positive density at µ = 0 (i.e. very small signals are probable).
Using the techniques from [25] it can be shown that under Assumptions (A) and (B) the threshold values of the Bayes oracle satisfy where v := nδ 2 /p 2 (the formula is derived in Section B of the Appendix).
The risk for a multiple testing rule is computed under the additive loss of individual tests simply as the sum of the risks of individual tests. A multiple testing rule using for given m the same critical values for all m test statistics will be called henceforth a fixed threshold rule. Typically the acceptance region of such a rule will be an interval of the form wherec a ,c b are the critical values for the scaled test statistics Z i , andã m ,b m are the critical values for the empirical meansX i . Note that for the specified mixture model (2.3) type I error rates t 1 and type II error rates t 2 of fixed threshold rules are identical for each individual test. The corresponding risk is therefore defined as (2.8) In the following theorem we provide the asymptotic risk R opt of the Bayes oracle.
Proof. The proof combines techniques of [25] and [6], and is given in Section C of the Appendix.
Remark 2.6. Similarly as in [6], the asymptotics of the optimal Bayes risk is determined by the cost associated with the Type II error. From the proof of Theorem 2.1 it immediately follows that the Bayes rule has asymptotic power 1 − ν(−T, T ).
Remark 2.7. In a setting of sparsity one might consider the question how the trivial rule of not rejecting any hypothesis performs. This procedure has the risk R 0 = mpδ A , which in case of C > 0 can be rewritten as R 0 = Ropt ν(−T,T ) (1 + o m ). Therefore when the asymptotic power of the Bayes rule is close to zero the trivial rule is doing almost as well as the Bayes rule, and it will be difficult for any other statistical procedure to perform reasonably well. From a practical perspective the following results for C > 0 are therefore more meaningful when the asymptotic power of the Bayes rule is sufficiently larger than zero.

Asymptotic Bayes optimality under sparsity
The Bayes classifier defined in (2.4) and in (2.5) requires the full knowledge of the mixture distribution. Since this information is usually not available, in practice one needs to consider other multiple testing rules. This section provides a theorem characterizing the set of fixed threshold rules, which asymptotically attain the optimal risk. Definition. A multiple testing rule is called asymptotically Bayes optimal under sparsity (ABOS) if its risk R satisfies R Ropt → 1 under Assumption (A).
The proof is provided in Section D of the Appendix.
The following corollary gives a simple multiple testing rule, which is asymptotically optimal in case of extreme sparsity, where the number m of tests increases to infinity, but the expected number of true signals mp remains constant or increases only very slowly with m.
Corollary 2.2. If for n = n m the extreme sparsity assumption is fulfilled, then thresholds of the form is ABOS.
Proof. Simply observe that z 1 = log(nm 2 ) + d and z 2 = 2 log m + d fulfill the requirements of Theorem 2.2 under the assumption of the corollary.
Remark 2.8. Corollary 2.2 states that under extreme sparsity and when log n = o(log m) the universal threshold 2 log m of [14] is ABOS. However, when log n increases to infinity at least at the rate log m then the universal threshold needs to be supplemented by the log n term to preserve its ABOS property. Similar criteria in the context of model selection in multiple regression have been proposed and discussed e.g. in [22,3] and [7].

ABOS of some popular multiple testing rules
In this Section we will present the main results of this article: First we give conditions under which the FDR controlling procedure BH and its corresponding step down procedure SD at nominal FDR level α are ABOS. Then we will consider the Bonferroni correction at nominal FWER level α. Specifically ABOS of BH and SD is derived under the following Condition (Op) on α: This condition is necessary and sufficient for the asymptotic optimality of rules based on the Genovese-Wasserman approximation to BH (see Theorem 3.1), and of fixed threshold rules controlling the Bayesian False Discovery Rate, which are discussed in Section E of the Appendix. In Theorem 3. Assertion a) delivers perhaps the most important consequence, that our optimality results will hold for nominal levels α ∝ n −1/2 . Assertions c) and d) are concerned with signals being at the verge of detectability, where it is possible to get optimality for constant α, but only when letting δ getting smaller with growing m. From a practical perspective this is quite reasonable. It means that the relative cost of missing a true signal is increased when the signals become more sparse. If one wants to consider only δ = const, which includes the simple misclassification rate, then the last assertion states that at the verge of detectability it is possible to attain optimality for α → 0 at a rate which is substantially slower than α ∝ n −1/2 .

FDR controlling procedures
The Benjamini-Hochberg rule [2], which we will also call step-up FDR controlling procedure, is defined as follows: Denote the ordered p-values p i of the scaled test statistics Z i as p [1] ≤ p [2] ≤ · · · ≤ p [m] . For the step-up procedure at the FDR level α compute and reject the k F hypotheses with p-values smaller or equal p [kF ] .
In our asymptotic analysis we will also consider the corresponding step-down procedure (SD) at level α. For this one needs to compute and reject the k G − 1 hypotheses with p-values smaller than p [kG] . It is known that under sparsity both procedures behave very similar (see [1]). However, the proof of ABOS for SD turns out to be slightly more challenging than for BH. The proof of optimality results for the step-up FDR controlling rule in [6] and [33] relied upon the definition of a random threshold for the BH rule A detailed motivation of (3.4) is given in [15].
Alternatively let us denote 1 − F (SD) m (y) = #{|Z i | > y}/m. Similarly as in case of BH it is easy to check that SD rejects the null hypothesis H 0i if and only if |Z i | ≥ c SD , where It was proved by Genovese and Wasserman (GW) in [21] that for fixed p, as the number of tests increases, the random threshold c BH can be approximated by the non-random threshold where F (y) = P (|Z 1 | ≤ y). From Corollary F.1 of Appendix F it immediately follows that for any α ∈ (0, 1) there exists a unique positive solution of (3.6). Figure 1 illustrates the thresholds c BH , c SD and c GW . Comparingc BH and c SD with c GW the only change is in replacing the cumulative distribution function of |Z i | by the corresponding empirical distribution function. Thus, for fixed p and m converging to infinity, c SD will also converge to c GW . Approximations of c BH by c GW in case of sparsity were given in [6]. In Section G of the Appendix we consider the approximation of c SD with c GW under our asymptotic assumptions. A much simpler result is that under sparsity the multiple testing rule based on the Genovese-Wasserman threshold c GW is ABOS.  For the denser case should hold. Then both BH and SD are ABOS if their nominal FDR levels α satisfy the Condition (Op) of Theorem 3.3.
Proof. The proof is given in Section G of the Appendix.
Remark 3.1. The upper bound on m provided in the second condition of (3.9) is not very restrictive. Specifically, it is satisfied whenever p ∝ m −β with β ∈ (0, 1]. For p decreasing to 0 at a slower rate (for example like (log m) −1 ) one can replace this bound with the condition n ≥ m γ3 for some γ 3 > 0 . (3.10) The following Corollaries are easy consequences of Theorem 3.2.  Remark 3.2. Corollary 3.1 states that under some mild restrictions on δ BH and SD at the FDR level α ∝ n −1/2 are ABOS for the whole range of sparsity parameters p = m −β with β ∈ (0, 1]. This illustrates that these rules adapt to the unknown sparsity parameter. Corollary 3.2 says that in case when n ∝ log m and under the additional requirement that δ converges slowly towards zero, BH and SD at the fixed FDR level α ∈ (0, 1) are also ABOS. This result substantially extends the results of [6] to the case where the prior on µ i is fixed and not normal, while the sample size n slowly increases to infinity. This additionally justifies the use of the fixed FDR level for BH in many applications, like e.g. in bioinformatics, where n is much smaller than m. As discussed in [6] the condition δ → 0 is quite reasonable in this context, since the cost of missing a true positive is usually large if p is very small. Note that Condition (Op) does not allow to keep both δ and α fixed at the same time.
Remark 3.3. Our results on the optimal FDR level are closely related to the well known fact that for fixed p the significance level of the Bayes test is roughly proportional to n −1/2 (see e.g. [11]). Interestingly, in the context of multiple testing under moderate sparsity (i.e. when β ∈ (0, 1)) the Bayesian classifier controls rather FDR than FWER.

Bonferroni correction
In applied sciences the most popular multiple testing procedure is still the Bonferroni correction. For the scaled test statistics The test decision is based on comparing these pvalues with the corrected significance level α/m, where in case of p i < α/m one rejects the null hypothesis.
Equivalently this procedure can be viewed as a fixed threshold rule, where the corresponding critical value c Bon for the test statistic Z i is defined by The procedure controls the family wise error rate at level α.
Note that the threshold of the Bonferroni correction is determined by the choice of the nominal FWER α. So, one can expect that by using an appropriate choice of α one can obtain an optimal rule. Unfortunately, the threshold of the optimal Bayes classifier depends on n, δ and the unknown sparsity level p. As a consequence the optimal FWER α will also depend on these parameters. The following Theorem 3.3 states that under the conditions on α for which BH is optimal the Bonferroni correction is ABOS only in the extremely sparse case, where p is roughly proportional to 1 m . Theorem 3.3. Suppose Assumptions (A), (B) and the Condition (Op) for α = α m are satisfied. The Bonferroni procedure at FWER level α is ABOS if and only if sparsity condition (2.14) holds.
Proof. The proof of Theorem 3.3 is provided in Section H of the Appendix. The proof relies upon technical results concerning the relationship between BFDR controlling rules and the Bayes classifier, which are discussed in Appendix Section E.
Remark 3.4. Our results show that under Condition (Op) the Bonferroni correction is ABOS only under extreme sparsity, while BH is ABOS under a much wider range of sparsity. This is due to the fact that the Bonferroni procedure has larger Type II error than BH. The Bonferroni-Holm procedure serves as a popular alternative, which is more powerful than the Bonferroni procedure while still controlling FWER. The Bonferroni-Holm procedure works like the SD procedure (3.3), but the ordered p-values are compared with α/(m − i + 1) instead of iα/m. Thus the Holm procedure is sandwiched between Bonferroni and SD, and one might ask if it is ABOS for a wider range of sparsity parameters than the Bonferroni correction. Unfortunately this is not the case. The gain in power from the Bonferroni-Holm procedure comes from an implicit estimation of 1 − p, but this does not bring any advantage when p → 0.
The following proposition gives conditions on α under which the Bonferroni rule is ABOS also for the denser case.
The proof is given in Appendix H. For the ease of presentation the proposition is presented with the simple sufficient condition (3.11). However, the condition indicates that the significance level α of the Bonferroni rule depends on the unknown sparsity parameter β. In that sense, contrary to the FDR controlling procedures, the Bonferroni rule does not adapt to the unknown level of sparsity.

Simulations
The purpose of this simulation study is to illustrate the theoretical results of Section 3. We focus particularly on the implications of Corollary 3.1 and 3.2. For various scenarios we generated independent test statistics according to the mixture model (2.3). The performance of different multiple testing procedures under sparsity is compared with the Bayes oracle, where the risk ratio is estimated for a given loss function based on 2000 simulation runs for each scenario.
The following features are analyzed systematically: 1. Multiple testing rules: We consider the Bonferroni rule and the Benjamini-Hochberg procedure. Initially we also compared two further procedures, namely the step-down procedure (3.3) as well as the asymptotically optimal rejection curve from [19]. However, under sparsity the performance of both procedures is almost indistinguishable from BH, and therefore these results are not shown. In our simulation study we also included the empirical Bayes multiple testing procedure from Johnstone and Silverman [26,27]. As shown in [26] this procedure adapts to the unknown sparsity and provides an asymptotically minimax thresholding rule for the estimation of unknown means. The effect of different significance levels for Bonferroni (FDR levels for BH) was studied. In particular we considered fixed nominal levels α = 0.05 and α = 0.1, as well as α ∝ n −1/2 (where α = 0.1 for n = 5). In the manuscript we only present results for fixed α = 0.1, and in case of BH for α ∝ n −1/2 . The empirical Bayes approach of [26] was based on a Laplace prior, with density which performed best in the simulations of Johnstone and Silverman. Note that the Laplace prior is heavy tailed and spiked at zero, which makes it very well suited for detection of signals under the assumption of sparsity. This prior is often used in the context of sparse model selection and is directly related to the L 1 penalty applied in Lasso [43]. In accordance with the presentation in [26] the procedure EB uses the scaling parameter τ = 2, for which the variance of the Laplace distribution is equal to 8. The procedure EB2 estimates the scaling parameter τ (see [27] for details). 2. Known and unknown error variance σ 2 : In all simulations the error variance term was σ 2 = 1. We performed analysis both for known σ (z-tests) as well as under the assumption that σ is not known (t-tests). To apply the empirical Bayes approach of [26] in case of unknown σ we transformed t-test statistics into z-scores, yielding test statistics which are standard normal under the null hypothesis.
3. For the sample size n and the number of tests m two prototypic scenarios are considered. For n ∈ {5, 10} the difference between the polynomial and the exponential scenario is quite small. But for larger n the number of tests in the exponential case gets much larger, and leads to the situation referred to as 'being on the verge of detectability'. Results for the polynomial case are presented in Figure 2 and Figure 3, results for the exponential case in Figure 4 and Figure 5. 4. Simulations were run for two different sparsity levels: p ∝ m −β for β = 1, and β = 1/4 .
For the case of extreme sparsity β = 1 we used the proportionality constant 8, which yields for n = 5 the sparsity parameter p = 0.4 in the polynomial case, and p = 0.4211 in the exponential case. For the denser case β = 1/4 we wanted sparsity for n = 5 to be the same as for the case of extreme sparsity, which is achieved using proportionality constants 0.8459 for the polynomial case, and 0.8791 for the exponential case. 5. Concerning the loss function mainly the usual 0 -1 loss is studied, which corresponds to δ = 1 and results in the Bayes risk being equal to the misclassification rate. In view of Corollary 3.2 we analyze for the exponential case also the loss δ = 1/n, which corresponds to a loss ratio δ ∝ (log(m/5)) −1 (with proportionality constant log(1.3)). 6. Distribution of effect sizes under the alternative: Apart from the standard normal distribution we considered an asymmetric double exponential (ADE) distribution with density The particular choice of ADE is motivated two-fold. In [35] it was suggested to make use of the asymmetric Laplace distribution to model microarray data. While [35] models the data directly with ADE it appears to be reasonable to consider a model where the effect sizes under the alternative are stemming from ADE. In statistical genetics there is a long history to model effect sizes using Laplace distributions, and ADE is a natural asymmetric generalization. On the other hand ADE is simple enough to allow to compute the thresholds of the Bayes rule (details about this are given below). Results are presented for the specific choice λ 1 = 1.5 and λ 2 = 3.
Remark 4.1. The scenario with normally distributed mean under the alternative µ ∼ N (0, τ 2 ) coincides with the scale mixture model discussed in [6].
The ADE as a generalization of the Laplace distribution gives an example of a distribution with heavier tails. If we make the specific choice λ 1 = c 1 /τ and λ 2 = c 2 /τ , then ADE can be viewed as a scaling family with scaling parameter τ . While in [6] the focus was on normal scaling mixture models and asymptotic analysis was performed in terms of the parameter τ 2 /σ 2 , this article studies the more general class of distributions characterized by Assumption (B), and the asymptotic analysis is driven by the sample size n. However, for scaling families it is easy to take into account both the influence of effect size τ 2 /σ 2 and the influence of sample size n, by working with the parameter u = nτ 2 /σ 2 . The following simulations are performed with σ = 1, and for the normal mixture model τ is kept fixed at τ = 1, a value which guarantees that the Bayes oracle has reasonable power to detect signals within the considered range of n and m. The presented results are representative for any other choice of n, σ and τ yielding the same parameter u = nτ 2 /σ 2 .
For the normal scaling mixture model one readily computes that threshold values of the Bayes oracle forX i are given by For ADE threshold values of the Bayes oracle have no closed form expression. However, for (4.2) the optimal threshold values b l < 0 and b r > 0 ofX i are quite easily obtained as numerical solutions of Figure 2 illustrates the performance of multiple testing procedures for polynomial growth rates of m with respect to n, in case of known error variance σ 2 . In the upper two panels (a) and (b) the mean under the alternative is simulated from a standard normal distribution µ ∼ N (0, 1), whereas the lower two panels (c) and (d) relate to the asymmetric double exponential (4.2). Panels of the left column (a) and (c) show the extremely sparse case β = 1, whereas (b) and (d) of the right column show the denser case β = 1/4.
The first observation is that the estimates of the risk ratio for the extremely sparse case in Figure 2 (a) and (c) show much more fluctuation due to simulation error than estimates in panels (b) and (d). This is not surprising because under extreme sparsity the expected number of true signals does not increase with m, which makes it difficult to accurately estimate the mixture density. In accordance with our theoretical results Bonferroni correction is doing pretty well in case of extreme sparsity, where the risk ratio rapidly converges towards the optimal value 1. In contrast for the denser case the Bonferroni rule is way too conservative and its risk ratio is an increasing function of n.
The BH procedure with fixed FDR level α = 0.1 (abbreviated as BH01) is getting too liberal for large n, and its risk ratio starts growing. However, in   Risk ratio in case of known σ for the polynomial growth rate of m with respect to n as described in the main text. The loss ratio δ = 1 is kept fixed. Figure 2 (b) BH01 seems to attain optimal risk for n = 20. What actually happens here is that FDR of the Bayes oracle is a decreasing function in n, with FDR ≃ 0.2 for n = 5 and FDR ≃ 0.03 for n = 200. Roughly at n = 20 FDR of the Bayes oracle is close to the nominal FDR level 0.1 of BH01, resulting in the minimum of the risk ratio. In Figure 2 (d) the same observation can be made at n = 50. Much better performance for large n has the BH procedure at FDR level proportional to n −1/2 (abbreviated as BHn). Both in case of extreme sparsity as well as for the denser case its risk ratio converges towards 1, which confirms the theoretical results from Corollary 3.1.
The observed behavior of risk ratios for Bonferroni, BH01 and BHn is qualitatively quite similar for normal alternatives (panels (a) and (b)) and ADE (panels (c) and (d)), though there are obviously big quantitative differences. In particular in Figure 2 (d) the risk ratio of BHn with decreasing FDR level con-   verges rather slowly towards 1. This is mainly because in this scenario the Bayes oracle is much more liberal than BHn. We observed also in other simulation scenarios (not presented here), that in case of very large effect sizes (large variance of the alternative distribution) BHn is more conservative than the Bayes oracle, and then only for very large values of m and n the risk ratio gets close to 1. However, in such an "easy" situation the misclassification rate of BHn will be rather small, only the Bayes oracle will still perform that much better.
Concerning the empirical Bayes approach our simulation study confirms very good results reported in [26]. Both EB and EB2 perform quite well under all scenarios presented in Figure 2. Qualitatively EB performs similarly to BHn. For the extremely sparse case the risks of these procedures are very similar. In the denser case EB has a substantially larger risk than BHn, but still seems to be ABOS. In the denser case (panels (b) and (d)) EB2 performs substantially  better than EB as long as m > 300. In particular for ADE (or in general for alternative distributions with large variance) EB2 is performing extremely well in the denser case. However, for smaller m EB2 has difficulties to estimate p correctly, leading to a rather large misclassification rate. Also, in the extremely sparse case EB2 needs as least m > 1000 tests to be able to estimate both the mixture rate p as well as the scaling parameter τ simultaneously, and even then it tends to have larger risk ratio than BHn. Figure 3 presents the same simulation results for polynomial growth rate as Figure 2, but now in case of unknown σ. Qualitatively we obtain very similar results as for known σ, with the most striking difference that for small n the risk ratio is generally much larger in case of unknown σ. This is obviously a consequence of the fact that the Bayes oracle has the advantage of knowing σ,  and for small n it is difficult to estimate σ from the data. The risk ratio of the Bonferroni procedure converges towards 1 only in case of extreme sparsity (Figures 3 (a) and (c)), whereas for BHn convergence towards 1 occurs in all four scenarios. Figure 4 and Figure 5 refer to simulations at exponential growth rates, which relates to the situation of 'being on the verge of detectability'. In case of known σ Figures 4 (a) and (b) illustrate again the implications of Corollary 3.1, which states that for fixed δ the BH procedure with α ∝ n −1/2 is ABOS. Just like in the polynomial case BH01 is for larger n too liberal, and its risk ratio is increasing with n. In contrast the risk ratio of BHn converges to 1 both in panel (a) and (b), resembling very much the results of Figure 2. In Figure 4 (c) and (d) risk ratios are shown at a loss ratio δ = n −1 . In that case BH01 with fixed FDR level performs better than BHn with decreasing α level. Here the risk ratio of BH01 actually appears to be decreasing for large n which corresponds with Corollary 3.2, although convergence towards 1 seems to be rather slow.
Finally in Figure 5 simulation results at the verge of detectability in case of unknown σ are reported. While in the denser case β = 1/4 (panels (b) and (d)) the qualitative behavior of all risk ratios is fairly similar to the case of known σ in Figure 4, this seems to be no longer true for extreme sparsity (panels (a) and (c)). To a certain extent these findings are in accordance with results of [17], which require log m = o(n 1/3 ) for the accurate uniform approximation of the distribution of t-statistics by a normal distribution.

Discussion
In this paper we extend the asymptotic optimality results on FWER and FDR controlling rules from [6] to the practically relevant situation, where the distribution of effect sizes under the alternative does not change with the number of tests, and has an arbitrary positive density over the entire real line. This allows to describe realistic scenarios, where the distribution of effect sizes is not symmetric.
Our results are easiest to describe for a sparsity parameter p of the form p ∝ m −β , with β ∈ (0, 1]. When furthermore the ratio of losses for type I and type II error fulfills log δ = o(log m), then nontrivial inference is possible only when the sample size n increases to infinity at least at the rate of log m. We show that in general if the FDR level α ∝ n −1/2 then BH and the corresponding step-down procedure are optimal for all sparsity levels β ∈ (0, 1]. For the limiting case of n ∝ log m the FDR controlling procedures at a fixed level α ∈ (0, 1) are asymptotically optimal when δ slowly converges to 0. This condition on δ seems justifiable, since typically the cost of missing the true signal gets larger if the proportion of true signals decreases.
We also show that for α ∝ n −1/2 the Bonferroni correction and the universal threshold of [14] are ABOS in the extremely sparse case of p ∝ 1/m. Interestingly, the universal threshold remains ABOS also when δ = const. For the denser case β < 1 the Bonferroni rule is ABOS when the nominal significance level depends on the sparsity parameter β roughly like α ∝ m 1−β n −1/2 . Thus contrary to the FDR controlling rules at level α ∝ n −1/2 the Bonferroni rule does not adapt well to the unknown level of sparsity.
Our results on the optimal FWER and FDR level can be seen as an extension of the well known results that the significance levels of Bayesian tests is roughly proportional to n −1/2 (see e.g. [11]). They suggest that in the context of Bayesian multiple testing under sparsity FDR is a better analogue of the type I error than FWER. Specifically, in section E of the Appendix we observe that under our sparsity assumptions the Bayesian classifier controls FDR at the level (n(log n+ log m)) −1/2 . Additionally, our asymptotic results provide some margin for the FDR level, under which FDR controlling rules are still ABOS.
Our model assumes thatX i |µ i ∼ N (µ i , σ 2 /n). Asymptotic results can also be applied directly to the case when the variances of X i |µ i are known, but not necessarily equal to each other (assuming that the sequence of variances is bounded and bounded away from zero). Also, as shown in [17], even when the distribution of X i |µ i is not normal and variances are estimated separately for each i, the level of the simultaneous tests based on the t-test statistics √ n(X i |µ i − µ i )/σ i is accurate provided log m increases at a strictly slower rate than n 1/3 . Based on these results we expect that under similar assumptions our asymptotic results can be generalized to the case where the distribution of X i |µ i is not normal and the variances σ 2 i are estimated separately for every i. We consider this as an interesting topic for further research.
Motivated by the genetic applications from the introduction (for example [16] and [32]) we have focussed essentially on continuous distributions of µ i (or at least on distributions which are continuous in an environment of ±T ). One referee brought up the topic of discretely distributed µ i , which indeed would need a rather different analysis. Discrete distributions fulfill Condition A 1 of [25], thus when the asymptotic power of the Bayes oracle equals one (that is when C = 0) the techniques of [25] can be applied to obtain formulas for the optimal risk analogously to (2.9). From a statistical point of view it is easier to detect signals when the smallest possible effect size is bounded away from 0. Accordingly for discrete distributions the optimal risk decays exponentially with n. It is open to study which effect this has on the conditions under which FDRcontrolling rules are ABOS. For signals at the verge of detectability (that is when C > 0) we would expect that results for discrete distributions are qualitatively rather similar to the results presented in this paper. However, there remain certain technical difficulties to be solved when the distribution is such that it has no probability mass at ±T .
Another interesting topic for further research is the analysis of the asymptotic optimality properties of the plug-in version of the Bayes oracle, based on the empirical bayes estimates (EB) of the mixture distribution. Encouraging theoretical results from [26] and [9] make us believe that EB procedures are ABOS when mp → ∞, i.e. when the expected number of true signals increases with m. However, our simulation study suggests that EB procedures might be ABOS even in the extremely sparse case p ∝ 1/m. Also, based on our simulation study and the results of [5], empirical Bayes or full Bayes multiple testing procedures perform better than BH for somewhat denser scenarios. To distinguish between these rules in our asymptotic context one could consider investigating the rate of convergence of the risk ratio to 1, as it was proposed in [33] in the context of testing within families parametrized by the location or the scale parameter.
Our theory is based on the assumption that the test statisticsX 1 , . . . ,X m are independent. However, it can be rather directly applied to the case where the correlation matrix between the test statistics has a diagonal block structure (e.g. correlated SNPs, gene pathways), when one tests the significance of the entire block based on some aggregated test statistic. Then our results hold as long as the distribution of the aggregated test statistics can be approximated by a normal (or chi-square) distribution.
If the correlation structure is more involved, then the Bayes classifier for a given hypothesis will use the information from other test statistics. Depending on the relationship between m and n such a Bayes classifier can be approximated by full or empirical Bayes methods, which may estimate the dependence structure using some parametric assumptions or one of the (sparse) high-dimensional methods to estimate covariance matrices. We believe that such methods will in general work much better than BH based on individual p-values. It remains to verify under which forms of dependence BH is still ABOS, and to find the conditions under which empirical or full Bayes methods are ABOS. Due to a more complicated form of the Bayes classifier the extension of our results to the case of dependent test statistics will require further development of the proof technique and remains an interesting topic for a further research.
Lemma B.1. Let Assumptions (A) and (B) hold. Then the critical values converge with limits a n → −T and b n → T .
The proof of Lemma 2.1 relies on the following technical result.
Lemma B.2. Let a n → a be any convergent sequence. Define h n (µ) := exp a n µ σ 2 − µ 2 2σ 2 and h(µ) := exp a µ σ 2 − µ 2 2σ 2 . Then Proof. First note that for all n it holds that h n ∈ L ∞ (ν), and therefore also h n ∈ L m (ν), ∀m > 0. It is easy to check that lim n h n − h L ∞ (ν) = 0. Thus for any ǫ > 0 and sufficiently large n we have h n − h L n (ν) ≤ h n − h L ∞ (ν) < ǫ. Now (B.1) easily follows by the triangle inequality and the fact that Now we are ready to prove Lemma B.1.
Proof. Let, as before denote h n (µ) = exp a n µ σ 2 − µ 2 2σ 2 , where a n are now the solutions of (2.4). Then (δf ) 1/n = h n L n (ν) and due to Assumption (A) lim n (δf ) 1/n = e C/2 . Note that a n has to be bounded, otherwise the sequence h n L n (ν) could not be bounded. Let a be an accumulation point of a n . By Lemma (B.2) for any subsequence a j → a it holds .

(B.2)
Let S denote the support of ν. By (B.2) it holds On the other hand, using the assumption that −T ∈ S we have which implies that a ≥ −T . Thus (B.3) and (B.4) lead to the conclusion that a = −T . The proof that b n → T goes exactly along the same lines.
The following Lemma B.3 specifies the rate at which a n and b n converge to zero in case of C = 0. Lemma B.3. Throughout this proof and the rest of the appendix we will make use of the notation o m = o m (1).

F. Frommlet and M. Bogdan
To get the exact behavior we further split the domain of the integral in (−ǫ, −g n ), (−g n , 0) and (0, ǫ), where g n is a positive sequence such that a n = o(g n ), or more specifically For the first interval we get a bound by evaluating the integrand at −g n , for the second and third interval we repeat the computations leading to (B.9) with the corresponding boundaries, and finally obtain which yields (B.5). The proof for b n is exactly the same.
Remark B.1. The proof of Lemma B.3 relies upon choosing a suitable sequence g n . The choice of the sequence g n strongly depends on the asymptotic behavior of δf . If for example for sufficiently large n, δf ≤ n α , with α > 0, one might use g n = log n √ n , the choice of Johnson and Truax (1973). Another situation occurs if δf ∼ e n 1−γ with 0 < γ < 1, where g n = n −γ/3 is a suitable choice.
Remark B.2. As shown in the proof of Lemma B.3, the accuracy of the approximations provided in (B.5) and (B.6) depends on the asymptotic behavior of δf and on the regularity of ρ in a neighborhood of 0. Assuming for example that ρ is one-sided Lipschitz (on both sides of 0) and that δf is polynomially bounded one obtains that the ratio of the right and left-hand sides of (B.5) and (B.6) can be expressed as 1 + z n with z n = o(n −1/2 log n).
In case of C = 0 equation (2.6) then immediately follows from Lemma B.3 where we use the notation v = nδ 2 /p 2 and remember that f = 1/p (1 + o m ). Otherwise if C > 0 Lemma B.1 provides a n → −T = −σ √ C and equation (2.6) follows from Assumption (A).

Appendix C: Proof of Theorem 2.1
Notice, that the type II error of the Bayes oracle is given by We will now calculate the asymptotic formula for the type II error in case when C = 0. Consider first the integral over µ ∈ (−∞, 0). Remember that a n → 0, thus for n sufficiently large ν has a density ρ(µ) on (2a n , 0) and it holds that Applying the mean value theorem and substitution yields for some ρ n ∈ [inf µ∈(2an,0) ρ(µ), sup µ∈(2an,0) ρ(µ)]. Using the facts that where the last equality holds due to equation (2.6) from the manuscript. It remains to show that the integral over (−∞, 2a n ) is of lower order. It holds that In case of 0 < C < ∞ we know from Lemma B.1 that a n → −T and b n → T , where T = σ √ C > 0. For µ ∈ (−T, T ), Ψ n (µ) → 1, while for µ ∈ (−∞, T ) ∪ (T, ∞), Ψ n (µ) → 0. Then by the dominated convergence theorem, and ν(−T, T ) > 0, since the distribution has a positive density in neighborhoods of −T and T . The Bayes risk can be written as Thus by (C.1) and (C.2) to complete the proof of Theorem 2.1 it is enough to In case of C = 0, equation (2.6) and the normal tail approximation yield t 1 ∝ (v log v) −1/2 . Thus from (C.1) we easily obtain In case of C > 0 we write t 1 = t 1a + t 1b , where t 1a = Φ ( √ na n /σ) and t 1b = 1 − Φ ( √ nb n /σ) . Using the fundamental equality (2.4) for a n yields Because a n → −T similar considerations as in (B.7) show that the integral vanishes rapidly for µ / ∈ (−T − ǫ, −T + ǫ). Now observe that Thus we finally obtain δf t 1a = O 1 n . Analogous considerations for t 1b finish the proof.

Appendix D: Proof of Theorem 2.2
First consider the case C = 0. To prove sufficiency of (2.12) and (2.13) for ABOS of a fixed threshold rule note that computing type II error for rules of the form (2.11) involves similar computations to those leading to (C.1), but usingã n , and b n defined in (2.7) instead of a n and b n . Taking into account (2.11) and (2.12) one thus obtains which is asymptotically equivalent to the first contribution of the type II error of the Bayes Oracle. On the other hand where the last equality follows from the first part of Assumption (A) and (2.13).
Thus the type II error component of the risk R 2 = mpδ A t 2 satisfies R 2 = R opt (1 + o m ). Now, using (C.1) and the tail approximation for the type I error we obtain . Thus under assumption (2.13) R 1 = o(R opt ), which completes the proof of sufficiency for C = 0.
In case of C > 0 due to (2.12) it holds thatã n → −T andb n → T , and hence thresholds specified by (2.11) also have type II error of the form (C.2). For sufficiency it remains to establish (C.3). To this end note that the type I error can be written approximately as Hence where C ν = √ C ν(−T,T ) . Thus, under assumption (2.13) again R 1 = o(R opt ), and the proof of sufficiency is completed.
Concerning necessity, similar arguments as in the proof of Theorem 3.2 of Bogdan et al. (2011) show that (2.12) is necessary for ABOS. In that case the computations leading to (D.1) and (D.2) are still valid and imply the necessity of (2.13).

Appendix E: Bayesian False Discovery Rate vs the Bayes risk
To derive results on classical multiple testing procedures controlling FWER or FDR we first consider rules controlling the Bayesian False Discovery Rate (BFDR), a concept which was introduced in [15]: where t 1i and t 2i are the probabilities of the corresponding type I and type II errors. As discussed in [40], under the mixture model (2.3) the BFDR of a fixed threshold multiple testing rule is related to FDR according to the formula where R is the total number of rejections.
Based on the formulas for the type I and type II errors of the Bayes classifier presented in Section C of the Appendix, it is easy to show that under assumptions (A) and (B) both BFDR and FDR of the Bayes classifier are proportional to (n(log n − log p)) −1/2 . It follows that for sparsity levels of the form p ∝ m −β both are proportional to (n(log n+log m)) −1/2 . Now Theorem 2.2 provides some margin for thresholds which yield asymptotically optimal rules. Below we will translate this margin of "asymptotically optimal" thresholds into the margin of "asymptotically optimal" BFDR levels.
Here we will restrict our attention to the BFDR controlling rules based on symmetric thresholds, such that a B m = −b B m , and use to denote the corresponding threshold for the scaled test statistics Z i = √ nXi σ . As shown in Lemma E.3, under the mixture model (2.3) BFDR decreases continuously from 1 − p for c B = 0 to 0 as c B → ∞, implying that each α ∈ (0, 1 − p] corresponds uniquely to a threshold value c B . To obtain the threshold value c B with BFDR level α is equivalent to solving and This is the key relation for the proof of the following theorem, which provides conditions on α for which the BFDR controlling rule is ABOS and gives the asymptotic approximation of the corresponding threshold c B . Theorem E.1. Assume that Assumptions (A) and (B) hold. Additionally assume that α → α ∞ < 1 and log α n → 0. As before let v = nδ 2 p 2 . Then a rule with BFDR = α is ABOS if and only if The threshold value c B of a rule with BFDR equal to α is then given by Proof. According to (E.3) the threshold value c B with BFDR level α fulfills Let us define u B n = c B σ/ √ n. First we want to show that u B n is bounded. Assume on the contrary that for some subsequence u B j → ∞. It holds for any constant K > 0, that Lemma E.1. For any fixed s = 0 the function Proof. Points a) and b) easily follow by elementary algebra. To prove point c) let us define Then straight forward calculations yield Let us consider at first the case of s > 0. In this situation it is enough to show that g(c) is increasing. We find To show that h(c) is a decreasing function observe that Now, the standard bound on the tail of the normal distribution yields √ 2πc 2 (1 − Φ(c)) < ce −c 2 /2 , which implies that h ′ (c) < 0. The proof for s < 0 goes analogously. In that case g(c) has to be decreasing, which yields h(c) > h(c − s), and again h(c) is a decreasing function.
The following Lemma E.2 easily follows from Lemma E.1.
Then it holds that Lemma E.3. Let ν(·) be any probability measure such that ν(0) < 1. Let and Then BF DR(c) is continuously decreasing from 1 − p for c = 0 to 0 for c → ∞.

Proof. Observe that
with H(c) as in (E.7). Thus Lemma E.3 is a direct consequence of Lemma E.2.
Appendix F: Existence of c GW and Proof of 3.1 Corollary F.1. For any p there is a unique positive solution c GW of (3.6) for each α ∈ (0, 1).
Proof of Theorem 3.1: Proof. First observe that the threshold c GW (3.6) at the level α coincides with the threshold of the BFDR controlling rule at the level α ′ = α(1 − p). The rest of the proof follows by observing that when p → 0 then α ′ satisfies Condition (Op) if and only if α satisfies Condition (Op) .
Appendix G: Proof of Theorem 3.2 To prove optimality of the type II risk component of SD in the denser case we first show that with large probability the random threshold of SD is bounded from above by the asymptotically optimal thresholdc 1n .
Lemma G.1. Let c SD be the random SD threshold at the level α m and letc 1 = c 1m be the GW threshold (3.6)  Taking another intersection of the right hand side with {c SD ≥c 1 } we can conclude that P (c SD ≥c 1 ) ≤ P inf  .
With this Lemma we are ready to prove Theorem 3.2 itself. First note that BH is more liberal than SD, thus it is enough to control the risk contribution of Type 1 error for BH, as well as the risk contribution of Type 2 error for SD.
Under the first condition in (3.9) the proof for Type 1 error of BH follows along the same lines as the proof of Lemma 5.4 in [6]. Also, under the condition of extreme sparsity (2.14) according to Theorem 3.3 the Bonferroni procedure is ABOS. Therefore the optimality of the type II error component of the risk of SD in the extremely sparse case follows directly from a comparison with the more conservative Bonferroni correction. There remains to bound the type II error component of the risk of SD for the denser case (3.8), which overlaps with the extremely sparse case.
Denote by L A the number of false negatives under the SD rule, and by R A the corresponding type II error component of the risk. Furthermore letc 1 be the GW threshold defined in Lemma G.1. Clearly E(L A ) ≤ E(L A |c SD ≤c 1 )P (c SD ≤c 1 ) + mP (c SD >c 1 ) , and also E(L A |c SD ≤c 1 )P (c SD ≤c 1 ) ≤ EL 1 , where L 1 is the number of false negatives produced by the rule based on the thresholdc 1 . Since by Lemma G.1 the rule based onc 1 is asymptotically optimal, it follows that δ A EL 1 = R opt (1+o m ). On the other hand by Lemma G.1 it holds that P (c SD >c 1 ) ≤ m −γu for any γ u > 0 if only m is sufficiently large, and therefore Now by using assumptions (3.8) and (3.9), and choosing e. g. γ u = γ 2 /2 + 1, we conclude that δ A m 1−γu = o(R opt ), and the proof is thus complete. and log(n/p 2 ) = o(log v). Specifically remember that v = nδ 2 /p 2 , and because of Assumption (A) δ is bounded from above and log δ = o(log p). Thus it is not possible that log(n/p 2 )/ log(δ 2 ) → 0. It follows that condition (2.12) cannot be fulfilled, and the Bonferroni rule is not ABOS. Under assumption (3.11) it is clear that log 2 log(m/α) = o(v), and thus the first condition for a fixed threshold rule to be optimal becomes 2(1 − β) log m − log(nα 2 ) log n − 2 log p → 0 .
This is trivially fulfilled because (3.11) is equivalent to 2(1−β) log m = log(nα 2 )+ const. It is easy to check that this choice also fulfills the second condition for optimality.