False discovery rate control with multivariate $p$-values

Multivariate statistics are often available as well as necessary in hypothesis tests. We study how to use such statistics to control not only false discovery rate (FDR) but also positive FDR (pFDR) with good power. We show that FDR can be controlled through nested regions of multivariate $p$-values of test statistics. If the distributions of the test statistics are known, then the regions can be constructed explicitly to achieve FDR control with maximum power among procedures satisfying certain conditions. On the other hand, our focus is where the distributions are only partially known. Under certain conditions, a type of nested regions are proposed and shown to attain (p)FDR control with asymptotically maximum power as the pFDR control level approaches its attainable limit. The procedure based on the nested regions is compared with those based on other nested regions that are easier to construct as well as those based on more straightforward combinations of the test statistics.


Introduction
In multiple hypothesis tests, it is common to evaluate nulls with univariate statistics. This especially has been the case for tests based on FDR control [2,12,13,16,20,22,23]. On the other hand, for hypotheses on high dimensional data, such as those in classification or recognition for complex signals, multivariate statistics in general are prerequisite for satisfactory results [1,6,24]. Such hypotheses each involves a sample of random vectors, from which a multivariate statistic is derived to capture critical features of the sample. Given the conceptual appeal of FDR control, it is natural to ask how it can be achieved using multivariate statistics.
The FDR of a multiple testing procedure is defined as E[V /(R ∨ 1)], where R is the number of rejected nulls and V that of rejected true nulls [2]. In addition to FDR, power and pFDR [20] are two important measures to assess the performance of a procedure. Recall that where n is the number of nulls, and N that of true nulls. The importance of power is well appreciated in the FDR literature [2,13,14,20,22]. In contrast, the issue of pFDR seems more subtle. Oftentimes, as follow-up actions can ensue only after some rejections are made, pFDR is more relevant than FDR. However, unlike FDR, in general pFDR is not necessarily controllable at a desirable level, say below 0.4. The reason is that in many cases, test statistics cannot provide strong enough evidence to assess the nulls, especially when the data distribution is only partially known and the number of observations for each null is small. The controllability of pFDR can strongly affect power. For the well-known Benjamini-Hochberg (BH) procedure [2], if its FDR control parameter is below the minimum attainable pFDR, then its power tends to 0 as n → ∞ [7]. In light of this, power and pFDR should be considered together when designing testing procedures. A direct way to improve power and pFDR control is to collect more observations for each null. However, this may not be feasible due to constraints on resources. On the other hand, if the observations can be viewed from different aspects each containing some unique information, then the aspects may be exploited together to yield more substantive evidence.
The approach of the paper is to first establish FDR control based on multivariate p-values, and then evaluate power and pFDR control. Among procedures that attain the same pFDR, the one with the highest power is preferred. Section 2 sets up notations and recalls known results. It then gives an example to illustrate when multivariate p-values may be useful for pFDR control. Section 3 presents a general FDR controlling procedure which uses an arbitrary family of nested regions in the domain of p-values. Then, it shows that if the data distribution under true nulls and that under false nulls are both known, then the nested regions can be chosen in such a way that the procedure has the maximum power among those with the same pFDR while satisfying certain consistency conditions. However, since full knowledge about data distributions is usually unavailable, the emphasis of the section is FDR control based on nested regions that approximate the optimal regions. Under certain conditions, the approximating regions are ellipsoids under an L ε -norm, where ε > 0 in general is a non-integer. Section 4 analyzes the power of the procedure based on the approximating regions. It shows that under certain conditions, the power is asymptotically maximized as the pFDR tends to the minimum attainable level. The procedure is compared with several others, including those that work "directly" on test statistics instead of p-values, for example, procedures that rejects nulls with large L a -norms of the test statistics. It will be seen that only for a < 0, the "direct" procedures may attain the same pFDR as the procedure based on the approximating regions. The section also considers a procedure based on nested rectangle regions in the domain of p-values and shows that it has the same level Z. Chi/FDR with multivariate p-values 370 of pFDR control as the one based on the approximating regions. Although less powerful, the procedure is simpler to compute.
Section 5 considers examples of t and F statistics. Section 6 reports a simulation study on the procedures considered in previous sections. Section 7 concludes with some remarks. Most of the technical details are collected in the Appendix.

Notation
Denote by K the dimension of a multivariate p-value. Points in R K will be taken as column vectors. With a little abuse of notation, for f : A → R with A ⊂ R K , sup f will denote the essential supremum of f , i.e. inf{a: ℓ(f −1 (a, ∞)) = 0}, where ℓ(·) is the Lebesgue measure. If ξ i1 , . . . , ξ iK are marginal or conditional p-values under a null H i , then ξ i = (ξ i1 , . . . , ξ iK ) ′ will be referred to as a multivariate p-value associated with H i . The discussion is under a random effects model as follows [10,13]. Denoting by a ∈ (0, 1) the proportion of false nulls and θ i = 1 {H i is false}, (θ i , ξ i ) are i.i.d. such that θ i ∼ Bernoulli(a) and given θ i = 0, ξ i1 , . . . , ξ iK are i.i.d. ∼ Unif(0, 1), given θ i = 1, ξ i ∼ G with density g. (2.1) The density of ξ i is then 1 − a + ag. The assumption that ξ i1 , . . . , ξ iK are independent under true H i should be checked carefully. Generally speaking, it should be problem-dependent to design test statistics with independent p-values [5]. One situation in which independence may arise is where multiple data sets on the same nulls are collected independently following different protocols, e.g., with different experiment designs being used or different physical attributes being recorded. In this situation, observations in different data sets may not follow the same distributions, and hence cannot be combined into larger i.i.d. samples. Nevertheless, the p-values derived respectively from them can be combined into multivariate p-values with independent coordinates.
In general, a nested family of Borel sets in [0, 1] K can often be parameterized so that procedure (3.2) is applicable to them.
As ℓ(D t ) ≡ t, D t will be referred to as the regularization of Γ u . Since nested regions naturally occur as decision regions in hypothesis tests, as seen below, by regularization, a test can turn into a FDR controlling procedure.
Example 3.1. (a) Suppose a test rejects a null if and only if min ξ k ≤ u, where ξ = (ξ 1 , . . . , ξ K ) ′ is the associated p-value and u a threshold value. The corresponding rejection region is where F K is the Gamma distribution with K degrees of freedom and scale parameter 1.

Regions with maximum power under consistency condition
If the distribution under true nulls and that under false nulls are known, then, in light of Neyman-Pearson lemma, it is natural to ask if FDR can be controlled with maximum possible power using the likelihood ratios of the test statistics. Some works have been done on this idea [18,21]. We next show that the idea is correct under certain conditions and can be realized by procedure (3.2) with an appropriate nested family {D t , t ∈ [0, 1]} ⊂ [0, 1] K . Let X = (X 1 , . . . , X K ) ′ ∈ R K be a test statistic. Suppose that under true nulls, X ∼ Q 0 with density q 0 and under false nulls, X ∼ Q 1 with density q 1 . Our construction of D t is based on a familiar transformation of X into multivariate p-values. Denote by f k (x 1 , . . . , x k ) the marginal density of X 1 , . . . , X k under Q 0 . Clearly q 0 = f K .
ii) all f k are continuous and iii) q 1 is continuous on sppt(q 0 ). Then 1) φ : sppt(q 0 ) → [0, 1] K is continuous and 1-to-1; 2) under true nulls, ξ 1 , . . . , ξ K are i.i.d. ∼ Unif(0, 1); and 3) under false nulls, ξ has a continuous density g(x) := Condition i) is not restrictive because hypothesis testing is trivial when X = x ∈ sppt(q 1 ) \ sppt(q 0 ) (cf. [18]). Conditions ii) and iii) are used to make sure all the transformations involved on X are still well-defined random variables. Under these conditions, any rule based on the likelihood ratio of X has an equivalent based on g(ξ). In the rest of the section, we will only consider tests on p-values.
A multiple testing procedure can be regarded as a deterministic or random function δ that maps each data point to an n-tuple (δ 1 , . . . , δ n ) with δ i = 1 {H i is rejected}. In our setup, the data point is an n-tuple (ξ 1 , . . . , ξ n ) jointly distributed with θ = (θ 1 , . . . , θ n ). We need two conditions on δ.
(A) For n ≥ 1, δ and θ are independent conditional on ξ 1 , . . . , ξ n , i.e., Most multiple testing procedures are deterministic functions of test statistics and therefore satisfy condition (A). The condition means that the observed test statistics contain all the available information on θ; if there is any prior knowledge on θ, it has already been fully incorporated into ξ and hence any randomness introduced into δ is a "pure guess".
The second condition imposes some consistency on δ. For an n-tuple S = ( (B) For any sequence of n k -tuples S k with n k → ∞, ifF (x; S k ) converges in the sense that sup x |F (x; S k ) − F (x)| → 0 for a distribution function F , then R(S k ; δ)/n k converges in probability.
Basically, the condition requires that, when δ is applied to samples with similar empirical distributions, it should reject similar fractions of nulls from them. Loosely speaking, that means as far as the fraction of rejected nulls is concerned, δ has to "stick to" a single way of testing, rather than alternate between different ways for different data sets.
For 0 ≤ u < ∞, define Although {Γ u } is decreasing instead of increasing, its regularization can be made increasing. Let h(u) = ℓ(Γ u ). Then h is decreasing, h(0) = 1 and h(u) → 0 as 2) with D t belongs to those which asymptotically have the maximum power as n → ∞. Furthermore, the following statements on the procedure hold: 1) it always rejects the same set of nulls as the BH procedure does when applied to p i = h(g(ξ i )), i = 1, . . . , n; 2) for α > α * , the power is asymptotically positive and pFDR → FDR = (1 − a)α; and 3) for α < α * , the power is asymptotically 0 and pFDR → (1 − a)α * .
Example 3.2. To illustrate that in general condition (B) is needed in Proposition 3.2, let K = 1. First consider the case where the p-values ξ 1 , . . . , ξ n are i.i.d.
Finally, given c ∈ (0, 1), by small variation to G on [t 1 , t 2 ], one can construct G which is smooth and strictly concave, such that the above conclusions still hold. By Proposition 3.2, the most powerful procedure (3.2) satisfying condition (B) in this case is the BH procedure and hence is strictly less powerful than the randomized procedure at the same pFDR level.
As noted in a discussion in [4], by either accepting all nulls with probability 1 − α or rejecting all of them with probability α, it is guaranteed that FDR ≤ α; however, the FDR attained in this way is useless, because it cannot say how well one can learn from the data being analyzed. Without some coherence of a procedure, one can hardly make a sensible evaluation of its performance in a particular instance based on a measure defined as a long term average, as the measure incorporates not only the way of testing chosen for the data at hand, but also others that are potentially very different. The same comments apply to pFDR as well. Condition (B) aims to impose some coherence, which is possible if the data follows the law of large numbers. This is similar to the ergodicity assumption, whereby long term average can be approximated by an average over a single large sample.
The construction in this section requires full knowledge of the density g, which is often unavailable. However, if g is known to possess some regularities, then it is possible to apply procedure (3.2) to regular shaped D t with reasonable power. This possibility is explored next.

Nonconstant approximation of lowest order
In many cases, the following is true for the distribution G of p-values under false nulls: Under (3.6), smaller p-values are stronger evidence against nulls, nevertheless the strength is bounded. By Proposition 2.1, the minimum attainable pFDR is (1−a)α * , where α * is now equal to 1/[1−a+ag(0)]. Following Taylor's expansion, suppose for some γ k > 0 and ε > 0, where x ε denotes (x ε 1 , . . . , x ε K ) ′ . It is perhaps desirable and expected to be true that ε is a positive integer. However, under regular conditions, this usually is not the case. As will be seen in Section 5, for the upper-tail p-values associated with t or F statistics, ε usually is a fraction of 1. More generally, g( However, for simplicity, this case will not be discussed. Rewrite the region in (3.5) suggesting that the latter may be used in procedure (3.4) with reasonable power. In general, for ν = (ν 1 , . . . , ν K ) ′ with ν k ≥ 0 and ν k > 0, define Γ u as Then by procedure (3.4), the following procedure obtains.
Control based on regions (3.8) Given FDR control parameter α ∈ (0, 1), apply the BH procedure to Procedure (3.9) is "scale invariant" in ν, i.e., the set of nulls rejected by using Γ u (cν) is the same for c > 0. If K = 1, then the procedure is simply the BH procedure and the parameter ν = ν 1 has no effect on its performance. However, when K > 1, the power of procedure (3.9) depends on ν. To analyze the power, denote The next lemma will be used.

Analysis of power
, with n the number of nulls and N that of true nulls. As our focus is the case n ≫ 1, we shall consider the limit Pow(α) of power at pFDR level (1 − a)α as n → ∞. In general, closed form formulas for Pow(α) are not available. To get a handle on Pow(α), our approach is to look at how fast it drops to 0 as α ↓ α * , by approximating Pow(α) as a linear combination of (α − α * ) a , a ∈ [0, ∞), or for that matter, G(D t ) as a linear combination of t a , with D t a family of nested regions used by procedure (3.2). Thus, the analysis is essentially a type of Taylor's expansion, which can provide useful qualitative information for comparing powers of different procedures.
Our analysis will only yield approximations of low orders. It remains to be seen how high order approximations can be obtained. In order to apply the results in section 3, we shall assume g satisfies (3.6) and (3.7) such that ℓ(Γ u ) is continuous where Γ u is defined in (3.5).

Dependence on parameter values
For α close to α * , the dependency of the power of procedure (3.9) on ν can be characterized as follows.
Proposition 4.2. For α > α * , let Pow o (α) be the limit of power of procedure (3.2) based on (3.5), and Pow(α) that of procedure (3.9) with ν = γ. Then Moreover, let V o and D o be the sets of true nulls and false nulls, respectively, that are rejected by the first procedure and V and D those by the second one.

Other types of nested regions
In order to compare the power of procedure (3.9) and that of procedure (3.2) based on other types of nested regions, the following comparison lemma will be used, which says that if a nested family of regions can "round up" more false nulls, then procedure (3.2) based on the regions has more power.
let τ i be defined as in (3.2). Assume that as n → ∞, τ i To start with, for Γ u (ν) in (3.8), by (3.10), the regularization is Example 4.1. A common rule is to reject a null if and only if z = ξ k is small. The rejection regions are Γ ′ u = {x ∈ [0, 1] K : x k ≤ u}. We next show that for α ≈ α * , procedure (3.2) based on Γ ′ u has strictly less power than procedure (3.9) for any ν with ν 1 · · · ν K > 0. Roughly, the reason is that, for u ≪ 1, most of Γ ′ u is spread around the boundary surfaces x k = 0 where the density of false nulls is lower than that around 0.
Thus, by Lemma 4.1, in order to compare the powers for α ≈ α * , it is enough to compare G(D 1t ) and G(D 2t ) for t ≪ 1. Recall Dit dx dy = t. Fix a i ∈ (0, γ i ) and η > 0 such that The case K > 2 can be treated likewise; see Appendix A.2.
Example 4.2. Normal quantile transformations of p-values have been used as a convenient representation of data in multiple testing [9].
On the other hand, with D 2t as in Example 4.1,  1], Thus, for α ≈ α * , the maximum power of procedure (3.2) based on D 1t is strictly less than the one based on D 2t . By Proposition 4.3, Therefore, the power by using the rectangles is of the same order as that by using Γ u (ν), albeit lower.
Procedure (3.2) based on rectangles has some advantages, even though the power is not maximum. First, rectangles are much easier to construct than the regularization of Γ u (ν). Second, for g satisfying (3.7), there is no need to know ε. Indeed, it suffices to try f k (t) = c k t 1/K with c k = 1. By (3.4), the next procedure obtains.
Control based on rectangles Given FDR control parameter α ∈ (0, 1), where c k > 0 satisfy c 1 · · · c K = 1. By Appendix A.2, for procedure (4.6), In particular, if c k = (γ/γ k ) 1/ε , the power is asymptotically maximized. Now suppose ξ ik are upper-tail probabilities of test statistics X ik and f k (t) = t 1/K . Then procedure (4.6) rejects H i if and only if p ik ≤ τ 1/K for all k, where τ is random. If τ is small, then procedure (4.6) may be viewed as one that rejects H i with large min k X ik , which makes it seem unnatural as max k X ik is more often used in testing. However, it will be seen in Example 4.4 that procedures based on max k X ik in general cannot attain the same level of pFDR control as procedure (4.6).

Direct combination of test statistics
Procedure (3.9) requires p-values of test statistics. Oftentimes, procedures that only use simple combinations of test statistics seem more desirable because they do not have to evaluate p-values. However, as seen next, in many cases such procedures cannot attain pFDR control levels as low as procedure (3.9), and hence have strictly less power at low pFDR control levels.
We only consider test statistics X i = (X i1 , . . . , X iK ) with X ik being independent under false H i as well as under true H i . Let X ik follow F 0k under true H i and F k under false H i and suppose F 0k and F k have continuous densities f 0k and f k respectively, such that (0, i.e., for each X ik , larger values are stronger evidence against H i , however, the strength is bounded. Notice that, under (4.8), . By (4.8) and (4.9), g k are strictly decreasing and g k (0) = ρ k ∈ (1, ∞). Then g(0) = ρ k = sup g and (3.6) is satisfied. If then g(x) satisfies (3.7) and so by Propositions 4.1, the minimum attainable pFDR level for procedure (3.9) is (1 − a)α * , with α * = 1/(1 − a + ag(0)). In the following examples, we shall assume (4.10) holds. .
Since max k ρ k can be much smaller than g(0) = k ρ k , the minimum pFDR can be significantly higher than (1 − a)α * .
with T > 0 a threshold value. In Appendix A.2, it is shown that L < g(0). As a result, the minimum attainable pFDR is strictly greater than Then the argument for k X ik can be applied to G 0k and G k .  Indeed, letting ξ ik be the upper p-value of X ik , by the derivations in next and o k (u) → 0 as u → 0. Consequently, if ε k ≡ ε, then the above procedure can be formulated as one based on ξ i , such that the associated rejection regions are approximately Γ u (ν) in (3.9) when contracting to 0. Provided the regularizations of the rejection regions are readily available, they can be used in procedure (3.2) for FDR control as well, with approximately the same power as procedure (3.9) as α ↓ α * .

Examples of special distributions
We next show t and F distributions satisfy (3.6) and (3.7). To this end, some general formulas are needed. Suppose X 1 , . . . , X K are test statistics that are independent not only under true nulls but also under false nulls. Let F 0k and f 0k be the marginal distribution and density of X k under a true null, and F k and f k those under a false null.
Suppose that, for some positive constants a, b, c k , d k and r k , 1 ≤ k ≤ K, Let ξ k be the upper-tail p-value of X k and g k (u) its density. By (4.9) and (5.2), as u ↓ 0, implying g k (0) = r k . By independence, ξ has density g(u) = g k (u k ) which satisfies (3.6) and (3.7). Moreover, the parameters in (3.7) are Note that for F 0k = N (0, σ 2 ) and F k = N (µ, σ 2 ), where µ > 0 and σ is known, (5.3) does not hold, as sup g k = sup f k /f 0k = ∞. For simplicity, we next only consider K = 1 and omit the index k.

t distribution
Let F = t p,δ , the noncentral t distribution with p dfs and noncentrality parameter δ ≥ 0 and F 0 = t p = t p,0 . The density of t p,δ is Consequently, in (5.1) and (5.2), a = p, b = 2, c = A/p, Then by (5.4), Clearly, for p > 2, ε is a fraction of 1.

Numerical study
We report a simulation study on the procedures described in previous sections. The simulations are implemented in R language [17]. We focus on testing mean vectors of multivariate normals with only partial knowledge about their variances. Model (2.1) is used to sample H 1 , . . . , H n , such that H i is "µ i = 0 in N (µ i , Σ i )" and 1{H i is true} ∼ Bernoulli(a) with a = .05 or .02, where µ i ∈ R K . We mimic the situation where the only available information is 1) under true H i , Σ i is diagonal and 2) under false H i , all the coordinates of µ i are positive. Since Σ i are not necessarily the same for different H i and there is no knowledge on their relations whatsoever, Σ i cannot be estimated by pooling the observations. Since Σ i are unknown and the values of µ i under false H i are also unknown, the tests have to rely on t-statistics. In the simulations, for each is drawn and the test statistic is computed as (T i1 , . . . , T iK ) ′ , where T ik = √ df + 1X ik /S ik , withX ik and S ik the mean and standard deviation of the k-th coordinates of X ij . It follows that for true H i , T ik ∼ t n and for false H i , T ik ∼ t df, √ df+1µ ik /σ ik , where µ ik is the k-th coordinate of µ i and σ ik the k-th diagonal entry of Σ i . The corresponding p-value is ξ i = (ξ i1 , . . . , ξ iK ) ′ , with ξ ik the marginal upper-tail p-values of T ik under t n .

Bivariate p-values
In this part, we compare procedures (3.9) and (4.  (Table 1). In each group, procedures (3.9) and (4.6) are implemented for 6 pairs of a and Σ(r), with a = .05, .02 and r = 0, ±1/5. For the pairs with r = 0, α * = 1/[1 − a + ag(0)] is calculated using (5.4) and the results in Section 5.1. The value of α * as well as those of ε = 2/n, γ 1 and γ 2 are given in Table 1. As is seen, for all the pairs α = .15 > α * , so pFDR = (1 − a)α is attainable by the procedures. On the other hand, if only the ith (i = 1, 2) coordinate of the p-value is used, pFDR = (1 − a)α is not attainable because in this case the minimum attainable pFDR is (1 − a)α * i with α * i > . 15. where s > 0 is a tuning parameter. The reason why s ε instead of s is used for ν will be seen later. For r = ±1/5, the procedures are tested with the same sets of values ν and c as well. For groups 1 and 2, (3.13) is used to calculate h(u; ν) for procedure (3.9). For group 3, as ε = 1, (3.15) is used.
The plots of power and (p)FDR vs log 2 s are shown in Figures 1-3 and labeled with "e" and "r" for procedures (3.9) and (4.6), respectively. The label "e" refers to "ellipsoid", due to the similarity of the nested regions in procedure (3.9) to Euclidean ellipsoids. For most of the plots, a = .05. The results for a = .02 are qualitatively the same, except that the power is lower and the pFDR is harder to control. For illustration, Figure 1 includes the plots of power and (p)FDR for (µ, df) = (.6, .2, 8), r = 0 and a = .02.
The results show that for α significantly greater than α * , the power may still exhibit patterns similar to that for α ≈ α * . First, by Proposition 4.1 and (4.7), for α ≈ α * , the maximum power of procedure (3.9) is strictly greater than that of (4.7). The left panels of Figures 1-3 show that this remains to be the case for α = .15. Second, for ν and c as in (6.1), as α ≈ α * , for both procedures, the power is approximately proportional to (s ε + 1/s ε ) −K/ε . As a result, the power curves of the procedures should be approximately symmetric, decreasing in | log 2 s| and parallel to each other. This holds quite well for α = .15, except for the plots for procedure (4.6) in Figure 1, which exhibit moderate asymmetry that may be attributable to the unequal marginal distributions of the coordinates of the p-values.
In (6.1), it is necessary to use s ε to tune ν in order to get a power curve parallel to the one for procedure (4.6). For t distributions with df > 2, ε < 1, suggesting that procedure (3.9) is more sensitive to the change in ν than procedure (4.6) is to the change in c. However, since ε is known, as the results show, the sensitivity is easy to address. For df = 2, ε = 1 and hence the power of procedure (3.9) is uniformly greater than that of procedure (4.6) (Figure 3).
The results also demonstrate the difference between pFDR and FDR. In the simulations, the FDR remains constant. However, as power decreases, the pFDR increases, sometimes quite rapidly. Unless power is high enough, the pFDR is strictly greater than the FDR. Note that this observation is made when the number of tested nulls is 5000. In theory, if power is positive, then pFDR → FDR as the number of nulls tends to ∞. The observed discrepancy between the pFDR and FDR is due to the fact that the number of nulls is not large enough for the asymptotic to take effect. Finally, as seen from Figures 1 and 2, statistical dependency between the coordinates of the test statistics may have significant influence on power and pFDR control. Nevertheless, the modality and symmetry of the power curves are quite stable. Furthermore, the effects of correlations are not obvious in Figure  3, where ε = 1 and the joint distribution of the p-values is symmetric. This apparent stability remains to be explained.

Comparison with other procedures
In this part, we take K ≥ 2 and compare procedures (3.9) and (4.6) with the methods in Examples 4.1, 4.4 and 4.5. The first method rejects H i with small k ξ ik , the second one rejects H i with large max k T ik and the third one rejects H i with large k T ik . We refer to the methods as "by-product", "by-max" and "by-sum", respectively.
The basic setup in this part is as follows.  We conduct 7 groups of simulations with the values of (µ, n) given in Table 2.
Each simulation makes 3000 runs, each run tests 6000 nulls. The (p)FDR and power are computed as Monte Carlo averages of the runs. For K > 2, unless ε = 1, the evaluation of h(u; ν) in procedure (3.9) is rather difficult. To get around this problem, by (3.10), we approximate the procedure by replacing h(u; ν) with V ε (u/ν) K/ε for all u ∈ [0, 1]. A more difficult issue is how to compare the procedures and the methods. One idea is to compare their powers at the same pFDR level (1 − a)α. However, by Examples 4.1, 4.4 and 4.5, for the values of (µ, df) in Table 2, except for the 7th one, no method attains pFDR < α = .15. For this reason, we choose to examine the (p)FDR levels of the methods when they have the same power as either procedure at pFDR level (1 − a)α. The steps are as follows. Take the by-product method and procedure (3.9) for example. Suppose the latter rejects D false nulls when applied to ξ i . If D > 0, then sort k ξ ik in increasing order, keep rejecting the sorted nulls, starting from the first one, until D false nulls are rejected; if D = 0, then reject no null. In this way, the number of rejected true nulls of the by-product method is minimized while the number of rejected false nulls is the same as procedure (3.9).
In each group, for each combination of a and Σ(r), procedures (3.9) and (4.6) are simulated with ν k = γ k and c k = (γ/γ k ) 1/ε , which are approximately the parameter values yielding maximum power for α ≈ α * . For each procedure, the by-product, by-max and by-sum methods are compared to it in the way described above. The results are reported in Tables 3-6. In groups 1-6, procedure (3.9) has more power than (4.6), often with a large margin. In all the cases where both procedures are able to control the pFDR around (1−a)α, the methods have substantially higher FDR and pFDR when their powers are matched to that of procedure (3.9) or (4.6). The results show that the methods either cannot control the pFDR at the level of (1 − a)α (which is indeed the case) or, alternatively, they can only control the pFDR with much lower power than procedures (3.9) and (4.6).
Unlike groups 1-6, in group 7, each coordinate of the vector of t-statistics provides strong evidence to identify false nulls. By only using the 1st or 3rd coordinate, the minimum attainable pFDR is 2.4×10 −5 for a = .05 and 6×10 −5 Table 3 Simulation results for groups 1 and 2 described in Section 6.2. In each group, the results are organized according to r = 0, 1/5, −1/5. The numbers in the rows for the "by-product" method are its FDR and pFDR as its power is pegged to procedure (3.9) or (4.6). The numbers in the rows for the "by-sum" and "by-max" methods are likewise. for a = .02, and by only using the 2nd coordinate, the value is even lower. As Table 6 shows, procedures (3.9) and (4.6) identify all the false nulls. Since almost all the p-values of false nulls are smaller than those of true nulls, due to how the by-product method is implemented, it rejects very few true nulls and hence has near-zero (p)FDR. The same is true for the other two methods. The pFDR of procedure (3.9) is significantly lower than (1 − a)α, because the approximation we use for h(u; ν), i.e. V ε (u/ν) K/ε , is strictly greater than h(u; ν) when u > min ν k and hence inflates s i in (3.9). This causes the BH procedure to reject more nulls with s i not very close to 0. As these nulls are exclusively true nulls, the resulting (p)FDR is lower. Finally, in order to see how procedures (3.9) and (4.6) perform when the parameters ν and c are not set to their respective asymptotically optimal values, we simulate groups 1 and 2 again, with ν = c = (1, 1, . . . , 1). As Table 7 shows, across the simulations, for each procedure, the power is lower than in Table 3 but not dramatically while the (p)FDR is quite stable. The (p)FDR levels of the other 3 methods tend to be lower than in Table 3. As in group 7, this can be explained by how the methods are implemented. 7. Discussion

Role of p-values
We have followed the tradition of using p-values for hypothesis testing. The general procedure in the work, i.e., (3.2), utilizes the fact that the p-value of a continuous multivariate statistic can be defined in such a way that its coordinates are i.i.d. ∼ Unif(0, 1). The interpretation of p-value as a measure on how "rare" or "suspi0cious" an observation looks is irrelevant, even though in many cases smaller p-values are indeed more likely to be associated with false nulls. Table 6 Simulation results for group 7 described in Section 6.2. "-" means value equal to the nonmissing value in the same row. Thus, in this work, p-values serve as a mechanism to "flatten" the probability landscape of true nulls and hence facilitates exploring subtle differences between true and false nulls.
Since what essentially matters to procedure (3.2) is nested events with specific probabilities, it can be easily modified to directly handle test statistics instead of their p-values. Indeed, in (3.2), D t ∈ [0, 1] K can be replaced with nested E t in the domain of the test statistics, such that P (E t ) = t under true nulls. Analysis on the power of the modification might yield some useful insight. For example, weighted L 2 norms are commonly used as criterion for acceptance/rejection. However, as shown in Examples 4.5 and 4.6, in more challenging cases, one may need to consider L p norms with p < 0. On the other hand, the modification does not simplify the testing problem, as probabilities still have to be evaluated. Nevertheless, as remarked next, the notion of using nested regions in spaces other than [0, 1] K is useful.

Incorporating discrete components
Often times, test statistics have nontrivial discrete components. For example, test statistics for different nulls may have different dimensions or degrees of freedom. In this case, the discrete component may be expressed as a scalar. However, if the test statistics are multivariate but only partially observed, then the discrete component in general have to be set-valued accounting for observed coordinates. Procedure (3.2) can be modified as follows. Suppose Z is the discrete component of test statistic T such that for any z, the conditional distribution of T given Z = z has a density and is K(z) dimensional. Then, in Table 7 Simulation results for groups 1 and 2. The setting is similar to that in Table 3, except that ν in (3.9) and c in (4.6) are set equal to (1, 1, . . . , 1)   An apparently simpler alternative is to conduct separate tests on statistics with different values of the discrete components. This alternative fails to take into account the distribution of the discrete components and hence may have lower power.

Power optimization
When the distribution under false nulls is only partially known, it can be a difficult issue how to attain maximum power. To see this better, consider testing the null "µ = 0" for N (µ, I) based on a single observation X. As the variance is known to be I, the most powerful test statistic would be ν ′ X, provided that the true value ν of µ under false nulls is known. However, when ν is unknown, unless there is strong evidence on its whereabouts, one has to search in a large region of µ to improve the power, which becomes more difficult as the dimension of ν gets higher.
One way to improve power is to restrict the search to parametric families of nested regions. This is the approach taken in Section 3.3. If the parameter involved is of high dimension, some type of stochastic optimization [19] may be needed. On the other hand, regions that attain maximum power may consist of several disconnected regions, which makes it difficult to use a single parametric family of nested regions to approximate them. An alternative way therefore is to try different families of nested regions at different locations in the domain of p-values and combine the results appropriately [7].
To show Proposition 3.2, we need a few preliminary results. Recall p i = h(g(ξ i )).
Lemma A.1.1. Let h be continuous. Then 1) p i ≤ t ⇐⇒ ξ i ∈ D t and 2) under the assumptions of Lemma 3.1, and G(D t ) is strictly concave.
proving the right-continuity of D t . By the continuity of h, ℓ(D t ) = ℓ(Γ h * (t) ) = h(h * (t)) = t. The rest of (3.1) is easy to check.
Denote by N the number of true nulls, and for any given procedure, denote by R and V the numbers of rejected nulls and rejected true nulls, respectively. We shall show i) procedure (3.2) with D t satisfies conditions (A) and (B) and attains FDR = (1 − a)α; ii) the search for procedures with maximum power can be restricted to those that reject and only reject nulls with largest g(ξ i ); and iii) for such a procedure, R/n converges in probability to a nonrandom number. From these results, the proof will follow without much difficulty. i) By Lemma A.1.1 1), procedure (3.2) with D t = Γ h * (t) is the same as the BH procedure applied to p 1 , . . . , p n . Therefore, statement 1) holds and FDR = (1 − a)α [2,22]. Since the set of rejected nulls is uniquely determined by ξ 1 , . . . , ξ n , the procedure satisfies condition (A).
, θ i is conditionally independent of (δ i , R) given ξ 1 , . . . , ξ n . Then where the last inequality is due to R = n i=1 δ i . Then with equality being true if rejected nulls are exactly those with the R largest r i . On the other hand, since N = where θ (i) corresponds to the null with the ith largest r i . Then Note that r i ≥ r j ⇐⇒ g(ξ i ) ≥ g(ξ j ) ⇐⇒ p i ≤ p j . Construct procedure δ ′ which first applies δ and then, provided δ rejects R nulls, rejects nulls with the R smallest p i instead. It follows that 1) if δ has FDR ≤ (1 − a)α, then so does δ ′ ; 2) δ ′ is at least as powerful as δ; 3) if δ satisfies condition (A), then, as the second step of δ ′ is uniquely determined by ξ 1 , . . . , ξ n , δ ′ satisfies condition (A) as well; and 4) since δ and δ ′ reject the same number of nulls, if one satisfied condition (B), the other does as well.
iii) Let δ satisfy conditions (A), (B) and attain maximum power asymptotically while having FDR ≤ (1 − a)α. As n → ∞, the empirical distribution of ξ 1 , . . . , ξ n converges to the distribution that has density 1 − a + ag(x). By condition (B), R/n P → some t * ∈ [0, 1]. Let F n be the empirical distribution of p i . Then F n P → F . Since F is strictly concave and continuous, F is strictly increasing. By Lemma A.1.3, F * n (R/n) P → F * (s). Since δ rejects and only rejects nulls with p i ≤ F * n (R/n), δ is asymptotically equivalent to a procedure which rejects and only rejects nulls with p i ≤ t * = F * (s). If t * > 0, by the law of large numbers and dominate convergence, as n → ∞, In order to attain maximum power while maintaining FDR ≤ (1 − a)α, t * has to be the largest value of t satisfying t/F (t) ≤ α. It is easy to see that for α ∈ (α * , 1), t * is the unique positive solution of t/α = F (t). Combined with part i) of the proof, this shows procedure (3.2) with D t = Γ h * (t) can be taken as δ. Furthermore, in this case, power → G(D t * ) > 0 and since P (R > 0) → 1, pFDR = (1 + o(1))FDR → (1 − a)α. Thus 2) is proved. On the other hand, for α < α * , no t > 0 satisfies t/F (t) ≤ α. As a result, t * = 0. Thus, the power of δ is asymptotically 0 and procedure (3.2) with D t again can be taken as δ. Furthermore, by [7], the procedure has pFDR → (1 − a)α * . This proves 3).
, so by (4.2), the density of h(Z i ) at 0 is g(0). Because procedure (3.9) is the BH procedure applied to h(Z i ), it follows that its minimum attainable pFDR is (1 − a)α * .
Proof of Lemma 4.1. By the assumptions of the lemma and [12], where t * i = sup{t : (1 − a)t + aG(D it ) ≥ t/α} and furthermore, t * i < T . Since G(D 1t ) < G(D 2t ) for t < T and both are continuous, it is seen that t * 1 < t * 2 . On the other hand, As a result, Pow 1 (α) < Pow 2 (α).