Multiple hypothesis testing on composite nulls using constrained p -values

: Multiple hypothesis testing often encounters composite nulls and intractable alternative distributions. In this case, using p -values that are deﬁned as maximum signiﬁcance levels over all null distributions (“ p max ”) often leads to very conservative testing. We propose constructing p -values via maximization under linear constraints imposed by data’s empirical dis- tribution, and show that these p -values allow the false discovery rate (FDR) to be controlled with substantially more power than p max .


Introduction
In multiple hypothesis testing, much research has been carried out on the case of simple nulls, where the data associated with true nulls follow a common distribution [8,12]. On the other hand, the case of composite nulls, where the data associated with true nulls may follow different distributions, is often encountered and challenging.
Typically, for a multiple testing procedure, p-values are the only accessible information [14,19,20,23,31]. Thus, how p-values are defined is important to the performance of the procedure. For composite nulls, p-values are usually defined as maximum significance levels over all null distributions, resulting in conservative tests [19]. On the other hand, a Bayesian approach is possible [12], which assumes that there is a known mixture probability distribution on true nulls. Under the assumption, the overall distribution associated with true nulls can be obtained as a weighted integral of the null distributions, which essentially reduces testing on composite nulls to testing on simple nulls.
The article aims to develop an approach in between the above two. Its premise is that there exists a mixture probability distribution on true nulls, however, it is unknown. In general, in cases like this, it is an important issue how to get extra information from data [7,8]. One way is to assume that there is specific knowledge on both the null and alternative distributions, so that a few parameters suffice their characterization. Under the assumption, the parameters can be estimated from data [17]. However, in exploratory studies, oftentimes the goal is to identify novel signals with no prior knowledge on how they might look like; that is to say, no parametric forms of the alternative distribution can be presumed. In this case, there can be a wide choice of mixture probability distributions and alternative distributions. Roughly speaking, given a candidate mixture probability distribution, all the misfits to the data can be attributed to false nulls. This makes consistent estimation of the mixture probability distribution infeasible. At the same time, although there can be many different distributions associated with false nulls, since they and their composition are intractable, it is suitable to treat the data associated with false nulls as being sampled from a single, intractable, overall distribution.
From the above position, we shall study the case where there are only a finite number of null distributions. This case may serve as an approximation to the case where there are infinitely many null distributions; see the discussion in Section 6. Meanwhile, in pattern classification, it is common to deal with a finite number of distributions [4,15]. On the one hand, for patterns that are known, their respective distributions can be learned beforehand as null distributions. On the other, in a novel environment, the composition of these patterns may be intractable due to the presence of false nulls, i.e., unknown novel patterns. Regardless of the specifics, the basic point is that the mixture of the null distributions is dominated by the empirical distribution of the data, except for a small error. As a result, the p-values can be defined via maximization over a range of linear combinations of the null distributions, with the coefficients constrained by the empirical distribution.
The foremost goal of the article is the construction of constrained p-values. These p-values can be used like any other p-values for multiple testing. To evaluate their efficacy, one has to choose an error criterion. We select the FDR due to the interest it has received in literature. There have been several FDR controlling procedures proposed with the aim to improve power [1,10,30]. We shall study the FDR control and power of some of these procedures based on the constrained p-values.
In the next section, we set up basic assumptions and review some FDR controlling procedures. In Section 3, we describe the initial step of the construction of constrained p-values, i.e., to get p-values under individual nulls. We also formulate some common types of p-values in ways that motivate the construction. In Section 4, we introduce two types of constrained p-values and state some results on the FDR control based on the p-values. Unfortunately, the proofs of the results are a bit involved, so we put them in the Appendix. In Section 5, we conduct numerical study on the FDR control and power of several procedures based on the constrained p-values and compare the results to other types of p-values. Section 6 concludes with some discussion.

Step-up procedure for FDR control
Step-up (SU) procedures are widely used for multiple testing [16,20,28]. If p (1) ≤ · · · ≤ p (n) are sorted p-values, then an SU procedure rejects and only rejects nulls whose p-values have ranks no greater than R = max{i : p (i) ≤ c i:n }, with max ∅ := 0, where c 1:n , . . . , c n:n are certain critical values [11].
To conduct multiple tests, one first has to specify an error criterion. There are quite a few such criteria available and a major line of research is to improve power while controlling the error [24,25]. We shall use FDR as the error criterion. Several SU procedures have been proposed to control the FDR, all of which set c i:n = ρ(i/n) for some function ρ. Perhaps the most well-known such procedure is the Benjamini-Hochberg procedure that uses ρ(t) = αt, where α is the target FDR control level [1]. The procedure will be coded as BH. We shall also consider a few of its variants that aim to improve power.
The first variant is proposed by Storey, Taylor and Siegmund in [30]. Let where λ ∈ (0, 1) is a parameter, R(t) = #{i : p i ≤ t}, and The procedure rejects and only rejects nulls whose p-values have ranks no greater than R = max{i : FDR * λ (p (i) ) ≤ α}. The procedure is an SU procedure, with It will be coded as STS.
We shall also consider the SU procedures proposed by Finner, Dickhaus, and Roters in [10]. Fix parameter κ ∈ (0, 1] and let f α The procedures will be coded as F1, F2, and F3, respectively. Recall that for multiple testing, if R is the number of rejected nulls and V that of rejected true nulls, then FDR = E [V /(R ∨ 1)]. By definition, if there are n nulls in total and N of them are true, then power = E[

P-values under individual null distributions
To construct suitable p-values for composite nulls, we need to first consider how to define p-values under individual F θ . These p-values are the basic "building blocks". Let {D t ⊂ R d : t ∈ R} be a family of Borel sets such that D1. the family is increasing and right-continuous, i.e. D t = s>t D s for any t; For each t, denote the probabilities of D t under G and F θ by (3.1) By D1-D3, ψ(t) and φ θ (t) are nondecreasing and right-continuous, with ψ(−∞) = φ θ (−∞) = 0 and ψ(∞) = φ θ (∞) = 1. Define By D2, s(x) < ∞ is well-defined, and by D3, The next result is basic. Its part (1) implies that φ θ (s i ) can be used as p-values of X i under F θ and part (3) shows that R n (t)/n is the empirical distribution of s i . Apparently, the p-values depend on D t . While the selection of D t can have a strong effect on multiple testing [5], the issue is beyond the scope of the article. Remark.
(1) If the index set of a nested family {D t , t ∈ I} is an interval I = R, then, by defining D s = t∈I D t for s ≤ inf I and D s = R d for s ≥ sup I, we can extend the nested family to one with index set R, so that the discussion in the subsequent sections still applies.
(2) In single hypothesis tests, nested rejection regions are usually indexed by significance level. As already seen, other indices can be used as well, which is sometimes more natural and avoids potential problems caused by different regions having the same significance level. As an example, let X i be real-valued. If we set D t = (−∞, t], then s i = X i and φ θ (s i ) = F θ (s i ), the lower-tail probability of , the upper-tail probability of X i . For F θ continuous at 0, if we set D t = [−t, t] for t ≥ 0 and D s = {0} for s < 0, then almost surely,

Maximum significance levels as p-values
The function is nondecreasing with M max (−∞) = 0 and M max (∞) = 1. By Proposition 3.1, M max (s i ) is the maximum significance level of X i over all possible null distributions, which is a conventional p-value under composite nulls [19]. We henceforth denote p i,max = M max (s i ).
It is known that using p i,max as p-values, BH and its variants in (2.1) and (2.2) can control the FDR [3,10,30]. The issue here is the power of the procedures using p i,max . This will be studied in the simulations in Section 5.
Observe that M max (t) can be written as where the supremum is taken over all possible measures µ on Θ with µ(Θ) ≤ 1. If no information is available on ν or a, then the above unconstrained supremum is justified. If, on the other hand, it is known that ν and a satisfy certain conditions, then, by constraining the supremum with the conditions, it is possible to get significance levels closer to p i,mix , hence improving the performance of multiple testing. Here Indeed, since φ θ (t) dν(θ) is the probability of D t under the mixture of the null distributions, it is known that by using p i,mix as the p-values, the power of BH is maximized when the FDR is controlled at a target level [2,13,29].
Remark. Since for a > 0, the largest possible value of p i,mix is 1 − a < 1, strictly speaking, p i,mix are not significance levels. Nevertheless, for convenience, we shall still refer to them as significance levels or p-values.
(Henceforth, boldfaced letters denote L-dimensional vectors.) Denote Let s i = s(X i ) as in (3.2). From Proposition 3.1, s 1 , . . . , s n are i.i.d. ∼ Q, with and their empirical distribution function is F n (t) = R n (t)/n, where R n is as in (3.3). Now (3.4) and (3.5) can be rewritten as At the same time, we still have p i,max = M max (s i ) and p i,mix = M mix (s i ).
Constraints based on empirical distribution. Since a and ν are unknown, the graph of (1 − a)ν ⊤ φ(t) is intentionally missing.
The expressions for M max and M mix suggest a general approach to constructing p-values, that is, by constrained maximization of c ⊤ φ(s i ). Certainly, we hope to get some kind of p-values not so conservative as p i,max but still conservative. Meanwhile, the closer we can make the p-values to p i,mix , the better. What constraints can be used then?
A straightforward idea is to find a suitable C ⊂ ∆ and define p-values as Under the above considerations, we need (1 − a)ν ∈ C ∆, which sometimes can be satisfied. For example, if it is known for sure that a ≤ a 0 , then, as (1 − a)ν ∈ ∆ ′ = {c ∈ ∆ : c 1 + · · · + c L ≥ 1 − a 0 }, one can set C = ∆ ′ . In general, however, are there constraints that are available under more general conditions? It turns out that by exploiting properties of empirical distributions, a lot of constraints can be obtained. To get the constraints, we assume that φ 1 , . . . , φ L and ψ are continuous.
The constraints can be roughly divided into two types, as described next; see Fig. 1 for reference.

First type
If Q were observable, the inequalities could serve as constraints on the maximization of c ⊤ φ(t). Since Q is in fact unobservable, we need to find an observable functioñ Q as a substitute. In general,Q has to be made from the data so it will have some fluctuations. With this in mind, we should expect a new set of inequalities which have to meet two criteria with high probability. First, the inequalities should hold for (1 − a)ν, so that the resulting p-values are conservative. Second, the constraints imposed by (B) should not be too lax comparing with (A). Since a good choice forQ seems to be F n , where 0 < ǫ n ≪ 1 such that, with high probability, F n (t) − F n (s) + ǫ n ≥ Q(t) − Q(s) for t > s ≥ −∞. Using probability inequalities, such ǫ n can be found. Then the inequalities in (B) become In practice, for the infinitely many inequalities in (4.1), one has the freedom of choice. If, for example, only c ⊤ φ(t) ≤ F n (t) + ǫ n are used as constraints due to concerns about computational cost, then c ⊤ 1 φ(t) and c ⊤ 2 φ(t) in Fig. 1 are functions satisfying the constraints.
Remark. (1) Under the constraints in (4.1), it is possible that c ⊤ φ(t) > Q(t), as seen from the graphs of c ⊤ 2 φ and Q in Fig. 1. (2) For different t, the value of c that maximizes c ⊤ φ(t) may be different. For example, in Fig. 1, for t around s (i) , c 2 is a more plausible maximizer than c 1 , whereas for t around s (n) , the opposite is true.

Second type
Since the constraints in (4.1) are meant to be satisfied with high probability by all pairs s < t simultaneously, they are not necessarily very tight for particular values of t. It is possible to replace some of the constraints with tighter ones based on local properties of the empirical distribution. Since the tails of a sample are often handled more carefully than its other part in hypothesis tests, we shall focus on s (i) with i ≪ n. However, the discussion applies equally well to any small set of s i .
The idea is as follows.
To find such z i , note that Q(s i ) are i.i.d. ∼ Unif(0, 1) as Q = (1 − a)ν ⊤ φ + aψ is continuous by our assumption. Therefore, Q(s (1) ), . . . , Q(s (n) ) have the same joint distribution as Since S n+1 /n → 1 almost surely as n → ∞, given 0 < β < 1, for n ≫ 1, with high probability, Q(s (i) ) ≤ γ i , where γ i is some random variable following Gamma(i, 1/(nβ)). Thus z i can be set equal to a high percentile of the Gamma distribution. In order for the constraints in (4.2) to be stronger than their counterparts in (4.1), one needs to have z i < F n (s (i) ) + ǫ n , which has to be checked on a case-by-case basis for specific choice for ǫ n .

Construction of p-values
To design concrete ways to construct p-values using the foregoing constraints, a principle we will follow is that a viable construction should yield p-values that are an increasing function of s i . The increasing monotonicity naturally holds for φ k (s i ), as they are p-values under individual null distributions. However, as different sets of constraints may be applied to different observations in calculating their p-values for composite nulls, the increasing monotonicity does not automatically hold and should be checked for specific constructions.
It is impossible to use all of the foregoing constraints as there are infinitely many of them. Besides, we need to take into account computational cost. The constraints we shall use can be divided into 3 categories. First, some "hard" constraints on c = (c 1 , . . . , c L ), the most basic ones being c k ≥ 0 and c k ≤ 1. Second, upper bounds on c ⊤ φ(s i ) derived from (4.1) or (4.2), depending on the rank of s i in s 1 , . . . , s n . Third, upper bounds on c ⊤ [φ(t) − φ(s)], where s < t belong to some pre-selected finite set of "check points".
We need to set some parameters first. To impose hard constraints on c, fix a closed set ∆ ′ ⊂ ∆, which must be known for sure to contain (1 − a)ν. In most cases, ∆ ′ = ∆ for lack of direct information about ν. To set the margin of error in the constraints of the form (4.1), fix ǫ n > 0 such that ǫ n → 0 as n → ∞. To impose constraints involving c ⊤ [φ(t) − φ(s)], let T n be a finite set of check points. To set z i in (4.2), fix β ∈ (0, 1) and m n such that m n ≪ n for large n.
Denote byΓ * (z; a) the z-th upper -tail quantile of Gamma(a, 1) and define Since by assumption Q is continuous, with probability 1, s i are distinct from each other and F n (s (i) ) = i/n.
with C n,seq (t) being the set of c ∈ ∆ satisfying the following conditions, Global construction. Denote by p i,glb the p-value of s i constructed in this way. Then p i,glb = M n,glb (s i ), where for any t, with C n,glb being the set of c ∈ ∆ satisfying the following conditions, Remark.
(1) Since C n,seq (t) and C n,glb are convex closed subsets of ∆ and contain 0, the suprema in the definitions of M n,seq and M n,glb are attainable. Each of C n,seq (t) and C n,glb has the property that if c belongs to it, then so does any d with 0 ≤ d i ≤ c i .
(2) For every given t, unlike M max (t) and M mix (t), both M n,seq (t) and M n,glb (t) depends on s 1 , . . . , s n and hence are random.
(3) The first construction is dubbed "sequential" because each p i,seq is computed based on s j ≥ s i : if we imagine that s i are input one by one, starting with the largest one, then p i,seq can be computed only after all s j > s i have been input. The second construction is dubbed "global" because each p i,glb is computed based on all s j . While presumably imposing stronger constraints, the global construction has a higher computational cost.
(4) In both constructions, the constraints are linear, allowing each p-value to be computed by linear programming. It is easy to see that both allow parallelized computation of the p-values.

Application to FDR control
Once p i,seq and p i,glb are computed, they can be applied to testing procedures just as p i,mix and p i,max are. In this section, we state some theoretical results on the FDR control by BH based on p i,seq and p i,glb .
The first result deals with p i,seq . The tool for its proof is the optional sampling theorem (cf. [30]) and the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [21].
The bound contains terms in addition to α. For appropriate ǫ n and T n , the (1). Under certain conditions, R is of the same order as n. Thus, the bound shows the FDR can be asymptotically controlled at α. On the other hand, the simulations in Section 5 indicate that usually the realized FDR is substantially lower than α, which is perhaps not surprising because p i,seq ≥ p i,mix .
Unlike the above result, the optional sampling theorem has not been able to apply to p i,glb . We will settle for an asymptotic result. For S, T ⊂ R, denote δ(S, T ) = sup{|s − t|, s ∈ S, t ∈ T }. A sequence of finite sets S n is said to be increasingly dense in T if for any r > 0, δ(S n , T ∩ [−r, r]) → 0 as n → ∞.
Theorem 4.2. Suppose 1) φ 1 , . . . , φ L and ψ are continuous and 2) as n → ∞, , then BH based on p i,glb attains lim n→∞ FDR ≤ α and is asymptotically equivalent to the procedure that rejects The result is based on the observation that as n → ∞, M n,glb (t) → m(t) under certain metric as well as a fixed point argument [12].

Setup
We next use simulations to compare the performances of multiple testing when different types of p-values are used. In the study, we only consider multiple tests for univariate observations. Given a data distribution Q = (1 − a) L k=1 ν k F k + aG and X 1 , . . . , X n i.i.d. ∼ Q, the nulls to be tested are H i : "X i ∼ F k for some k", i = 1, . . . , n. We use F k (X i ) as the p-values under individual null distributions; cf. the remark at the end of Section 3.1. The parameters of the data distributions used in the study are listed in Table 1.
Throughout, the target FDR control level is α = 0.25, and, unless otherwise specified, the fraction of false nulls is a = 0.05. To calculate p i,seq and p i,glb , in (4.3), we set ǫ n = ln n/n, β = 0.95, m n = n 1/5 , and in (4.4) and (4.5), we let Table 1 Parameters of the data distributions for the simulation study. tn,c denotes the noncentral t distribution with n df and noncentrality c T n be a set of ⌊(ln n) 2 ⌋ equally spaced points that begins with X (1) and ends with X (n) , and set ∆ ′ = {c ∈ ∆ : c k ≥ 1 − a 0 }, with a 0 = 1 or 0.1. For each data distribution, the simulation proceeds as follows. First, 1000 samples are drawn, each one consisting of n i.i.d. X i . Then, for each sample, different types of p-values are evaluated and different procedures are applied to the p-values. Finally, measures of performance are estimated by averaging over the samples. All the simulations are conducted in R language [22]; p i,seq and p i,glb are computed by the R linear programming package glpk.

Performances of BH based on different p-values
We apply BH to p i,seq , p i,glb , p i,max and p i,mix , respectively, with p i,seq and p i,glb being computed by setting ∆ ′ = ∆ in (4.4) and (4.5). The power and FDR are estimated by the sample averages of (R−V )/(n−N ) and V /(R∨1), where N , R and V are the total numbers of true nulls, rejected nulls, and rejected true nulls, respectively. For the first six data distributions (cf. Table 1), we use n = 5000; for the last one, due to high computational intensity, we use n = 2000.
As seen from Table 2, each type of the p-values allows BH to control the FDR. On the other hand, p i,mix yields substantially higher power than the others, while p i,seq and p i,glb yield substantially higher power than p i,max . To get an idea why this is the case, we compare the plots of the p-values. Since BH tests nulls by comparing np (i) /i and α, where p (i) is the ith smallest p-value of a given type, it is informative to graph np (i) /i vs i/n, wherep (i) is the sample average of p (i) . As Fig. 2 shows, for small i/n, np (i),seq /i and np (i),glb /i are similar, explaining why the performances of BH are similar when it is applied to the two types of constrained p-values. At the same time, both np (i),seq /i and np (i),glb /i are substantially lower than np (i),max /i and increase more rapidly than np (i),mix /i, which explains the differences in power of BH when it is applied to these different types of p-values. Thus, when ν k are intractable, by utilizing properties of empirical processes to reduce over-evaluation of p-values, the power of BH can be significantly increased.
We also look at how linear programming works in the evaluation of the constrained p-values. For each p (i),seq or p (i),glb , denote by c 1,(i) , . . . , c L,(i) the coefficients obtained by the corresponding optimization in (4.4) or (4.5). Unlike the p-values, the coefficients exhibit very different patterns (Fig. 3). For each k,    when i/n is small, the two types of c k,(i) are similar. However, as i/n increases, for p (i),seq , all but one c k,(i) become 0, while for p (i),glb , a more complicated combination of c k,(i) emerges. This difference may be partially due to how linear programming is implemented by the software used. However, it also suggests that c k cannot be used as suitable estimates of ν k .

Effects of stronger hard constraint
Observe that in Fig. 3, k c k,(i) ≤ 0.4 for small i/n. Since a = 1 − c k , this means that in the evaluation of the constrained p-values, 0.6 is counted as a feasible value of a, which is too high comparing to the true value a = 0.05. This suggests that, by imposing more constraints on c k , the power may be improved. We therefore simulate the scenario where it is known that a ≤ 0.1. In this scenario, the hard constraint becomes c ∈ ∆ ′ = {c ∈ ∆ : c k ≥ 0.9}. Denote by p ′ i,seq and p ′ i,glb the p-values evaluated under the extra constraint. Table 3 compares the powers of BH when it is applied to p i,seq , p i,glb , p ′ i,seq and p ′ i,glb , respectively. For each type of the p-values, the sample SDs of (R−V )/(n− N ) are also shown. For some of the data distributions (1-3), there is a small but significant increase in power when BH is applied to p ′ i,seq and p ′ i,glb , while for the other data distributions, there is no significant difference. Fig. 4 shows the plots of np (i) /i for the 1st and 5th data distributions. Since all the rejected nulls are associated with i ≪ n, we only compare the plots for i ≤ 0.05n. Like p (i),seq and p (i),glb , the plots for p ′ (i),seq and p ′ (i),glb are very close to each other. On the other hand, for the 1st data distribution, the latter two are significantly lower than the former two, which explains the improved power of BH when it is applied to p ′ i,seq and p ′ i,glb . Finally, as Figs. 3 and 5 show, the new hard constraint on c substantially changes c k(i) .

Comparisons between BH and its variants
For multiple testing on simple nulls with independent p-values, STS (2.1) and F1-3 (2.2) are more powerful than BH [10,30]. They are also easy to implement. Since constrained p-values are computationally costly, one may ask if they can   be dispensed with by STS and F1-3. In the context of the study, the question becomes, whether the power of these procedures when they are applied to p i,max can be as high as the power of BH when it is applied to p i,seq or p i,glb .
To answer the question, we apply STS, F1, F2 and F3 to the p-values used in Section 5.2. We set λ = 0.5 in (2.1) for STS and κ = 0.5 in (2.2) for F1-3. Table 4 shows the powers of the procedures. The results from F1, F2, and F3 are identical, so they are grouped together. Comparing with Table 2, it is seen that the powers of BH, STS and F1-3 when they are applied to p i,max are similar, and all are substantially lower than the power of BH when it is applied to p i,seq and p i,glb . The following observations can also be made. First, when these procedures are applied to p i,mix , STS consistently has more power than the others. In contrast, when these procedures are applied to other types of the p-values, STS has the lowest power. Second, F1-3 in general has a little more power than BH when applied to each type of p-values. The results suggest that, when a is small, how p-values are defined has a stronger influence on multiple testing than how BH is modified, and variants of BH can have more or less power depending on the p-values being used.
To see how the procedures perform when the fraction of false nulls becomes larger, we repeat the simulations for moderate and large values of a, while keeping the other parameters unchanged. Tables 5 and 6 display the powers of the procedures in the simulations using the 6th set of parameters in Table 1. Since all the procedures control the FDR at or below α = 0.25, the values of FDR are omitted. As seen from the tables, for all the values of a, the power of BH when it is applied to the constrained p-values is substantially higher then the powers of STS and F1-3 when they are applied to p i,max . This again shows that the constrained p-values cannot be dispensed with by the variants of BH. The following observations can also be made. First, for each procedure, across the values of a, the constrained p-values yield more power than p i,max , but less power than p i,mix . Second, the powers of BH when it is applied to the two types of constrained p-values are similar, even when a is large. This is also the case for F2 and F3. In contrast, as a increases, STS gains more power from p i,glb than from p i,seq , and even for moderate a, the gain is quite a lot. For F1, such gain in power from p i,glb is obvious only for large a. Third, as a increases, both STS and F1-3 become more powerful than BH when all of them are applied to Table 5 Powers of BH (top), STS (middle) and F1-3 (bottom) based on different types of p-values, when a = 0.05k, k = 1, . . . , 6. Except for a, the parameters are identical to the 6th group in Table 1  the constrained p-values. The difference is large when a is large. On the other hand, between STS and F1-3, the picture is more complicated. Depending on the value of a, each one can be more powerful than the other when applied to p i,glb . However, when applied to p i,seq or p i,max , F1-3 have more power, whereas when applied to p i,mix , STS has more power, especially for moderate a (≤ .3).
In summary, for moderate or large a, p i,glb yields more power than p i,seq and p i,max , and STS and F1 tend to have more power than the other procedures. The above results indicate that for large a, the two types of constrained pvalue are no longer similar to each other. This is confirmed by Fig. 6. When a = .5, the plots for p i,seq and p i,glb are significantly different. It is also worth noting that in this case, for large i, p (i),glb are significantly smaller than p (i),mix . The difference is also noticeable for a = 0.4 and even greater for a = 0.6 (results not shown). This suggests that, even though a cannot be estimated consistently, by exploiting the empirical distribution of observations, the linear programming for p i,glb can impose strong constraints on feasible values of a. Of course, it should be noted that this only occurs when a is large.

Effects of sample size
Finally, we run simulations on BH with n = 500, 1000, 2000, and 5000 for the first six distributions in Table 1, and n = 500, 100, and 2000 for the last one. Table 7 shows the results for the 6th and 7th distributions. The results for the others show similar patterns. Across all values of n, the FDR is controlled. The power shows an decreasing trend when n increases. The trend is more pronounced for p i,seq , p i,glb and p i,max than for p i,mix .

Discussion
We have only considered the case where the number of null distributions is finite. Formally, it is straightforward to generalize the constrained maximization to the case where there are infinitely many null distributions. Generally speaking, however, the maximization will involve infinitely many degrees of freedom and it is unclear how to accommodate this with a finite number of observations. An alternative approach might be to partition the set of null distributions into a finite number of subsets and use the envelopes of the subsets to compute p-values. More specifically, given a partition Θ 1 , . . . , Θ L of Θ, let u k (t) = sup{φ θ (t) : θ ∈ Θ k } and l k (t) = inf{φ θ (t) : θ ∈ Θ k }. Then define, for example, M n (t) = sup c ⊤ u(t), where the supremum is taken over c ∈ ∆ such that c ⊤ l(t) is dominated by the empirical distribution function. One issue is how to select the partition. Too coarse partition will only yield loose constraints on c k and too fine partition will result in many degrees of freedom. Either way, the obtained M n (t) may not be much different from unconstrained maximum significance levels.
As is known, the local FDR can be used for multiple testing [9]. For simple nulls, the local FDR is (1 − a)f 0 (x)/h(x), where f 0 is the density under true nulls and h the overall density of the data. If the number of null distributions is finite with densities f 1 , . . . , f L , one could define a conservative local FDR as ρ(x)/h(x), where ρ(x) = max{c ⊤ f (x): c ∈ ∆ and c ⊤ f ≤ h}. Furthermore, if the dimension of the data is high, one could instead use the empirical distribution of its low-dimensional transformations to get constraints.
The article only considers the case where the data collected for different nulls are independent. Some progress has been made on multiple testing for time series [17]. Since the constraints used here are derived from empirical marginal distributions, they can also be applied to time series. At issue is the strength of the constraints, which is determined by how well the empirical marginal distribution approximates the underlying marginal distribution. For the i.i.d. case, this can be resolved by the DKW inequality. For other cases, if similar inequalities are available, then one may also get useful constraints.

Appendix
In this Appendix, the sup-norm of f ∈ C(R) will be denoted by f .

A.2. Proof of Theorem 4.1
To prove the result, we will employ a stopping time technique and also rely on a few probability inequalities, esp. the DKW inequality [21]. Let where R n (t) is as in (3.3). We will show that BH is equivalent to a thresholding procedure with τ as the cut-off, making it possible to apply the optional sampling theorem. Together with a few probability inequalities, this will give Theorem 4.1.
Recall that by assumption, φ k and ψ are continuous. Denote F = ν ⊤ φ. Then F ∈ C(R) and by Proposition 3.1, under true H i , s i are i.i.d. ∼ F . Lemma A.1. M n,seq is always nondecreasing with M n,seq (−∞) = 0. Almost surely, 1) M n,seq is continuous at every t other than s 1 , . . . , s n and 2) M n,seq is left-continuous and has a right limit at each s i .
To show 1) and 2), it suffices to show a) M n,seq is left-continuous and b) M n,seq is right-continuous at every t ∈ S = {s 1 , . . . , s n } and has a right limit at every s i . a) Fix t. If 0 < t − u ≪ 1, then [u, t) has no point in T n ∪ S. Thus C n,seq (u) = C n,seq (t). Denote K = C n,seq (t). It is not hard to see that K is compact and c ⊤ φ(s) is uniformly continuous in (c, s) ∈ K ×R. Then sup c∈K c ⊤ φ(s) is continuous in s, implying M n,seq (u) → M n,seq (t) as u ↑ t. Thus M n,seq is leftcontinuous.
b) Since M n,seq is nondecreasing, it has a right limit at every t. It remains to show that at t ∈ S, M n,seq is right-continuous. Given t ∈ S, if 0 < u − t ≪ 1, then [t, u) has no point in T n ∪ S, thus C n,seq (u) = C n,seq (t). Then the rightcontinuity follows from the same argument for the left-continuity.
(2) Almost surely, u := s (n) ∨ max T n < ∞. For t > u, M n,seq (t) = sup{c ⊤ φ(t) : c ∈ ∆ ′ }. Since (1 − a)ν ∈ ∆ ′ and 1 − a > α, for all t ≫ u, M n,seq (t) > α, giving τ < ∞. (3) follows from (1), (2) and that M n,seq is left-continuous and R n is nondecreasing. Let N be the total number of true nulls. For each t, let F t be the σ-field generated by s i and η i that are "observable" in [t, ∞), i.e., with ø ∈ { * } ∪ Θ denoting a value being missing. Then R n (t−) = n − #{i : s i ≥ t} and V n (t−) = N − #{i : Likewise, for s ≥ t, R n (s) and V n (s) are F t -measurable. It is seen that {F t , t ∈ R} is a backward filtration, i.e., F t ⊂ F s for t > s.
Proof. It suffices to show that given t and 0 ≤ z < 1, {M n,seq (t) > z} ∈ F t . Let S be a dense countable subset of ∆ ′ . Note that S ∩ C n,seq (t) may not be dense in ∆ ′ ∩ C n,seq (t). By the properties of C n,seq (t) noted at the end of Section 4.2, it can be seen that M n,seq (t) > z if and only if for some fixed rational number r > 0 and every k, there is c ∈ T r,k = {c ∈ S : (c − 1/k) ⊤ φ(t) ≥ z + r − 2L/k} such that (c − 1/k) + ∈ C n,seq (t), where 1 = (1, . . . , 1) ⊤ and d + stands for a point with coordinates d i ∨ 0. That is, where Q + denotes the set of positive rational numbers. Since T r,k is countable, by the above expression, it suffices to show that for any c, the event {c ∈ C n,seq (t)} belongs to F t , which follows easily from the definition of C n,seq (t).
Proof. We already know that V n (t−) is F t -measurable. It suffices to show As t is fixed, it is more convenient to write F t as . [26], p. 172). Given I, J ⊂ {1, . . . , n}, let f IJ be a function only in a i and b i with i ∈ I as follows, where for I, J ∈ {1, . . . , n}, Assuming (A.3) is true for now, observe that for each fixed disjoint pair I and J, A IJ and B I are independent, because A IJ is determined by (s i , η i , i ∈ I), while B I is determined by (s i , η i , i ∈ I). Furthermore, E(A IJ ) is independent of the value of s, while by the continuity of F , It follows that In particular, letting s = t, it is seen that and V n (s−) = |{k : s k < s, η k ∈ Θ}| = k∈I 1 {s k < s, η k ∈ Θ}. As a result, On the other hand, by the definition of A IJ and B I , By the definition of I and J , i∈I 1 {s i < t, η i ∈ Θ} .
hence showing the claim.
In light of the Lemmas, our intention is to apply the optional sampling theorem to τ and {Z(t), F t }. There is one minor problem with this, i.e., the index t starts at ∞ instead of a finite value. We can get around the problem using truncation. Let b 0 be as in part (1)  Proof. To show that τ c is a stopping time of the backward filtration F t , it suffices to show {τ c ≥ t} ∈ F t for every t ≤ c. By the same argument for (3) of Lemma A.2, M n,seq (τ c ) ≤ α[R n (τ c ) ∨ 1]/n. Therefore, {τ c ≥ t} = {g(s) ≤ 0 for some s ∈ [t, c]}, where g(s) = M n,seq (s) − α[R n (s) ∨ 1]/n. Since the latter event can be reduced to By the optional sampling theorem (cf. [18], Ch. 1, We shall need a few probability inequalities. To get the first one, for each t, let E(t) = {c ∈ ∆ : c satisfies conditions 1) and 2)} where the conditions are 1) c ⊤ φ(s (j) ) ≤ F n (s (j) ) + ǫ n for all s (j) ≥ t; and Proof. Recall Q ∈ C(R). By the DKW inequality in [21], for any λ > 0 with On the other hand, we will show that given x ∈ R, Assuming (A.5) is true for now, it gives It remains to get (A.5). Let y = Q(x). By quantile transformation, Then V i are i.i.d. ∼ Unif(0, 1) and ξ = sup 0≤s≤1−y [s−G ′ n (s)], where G ′ n is the empirical distribution of V i . Applying DKW inequality to ξ, it is seen that (A.5) follows.
It is easy to see that for 0 ≤ s < r, (1 − a)ν ∈ Γ s ⊂ Γ r and Γ s = r>s Γ r .
(1) If K n = ∅, then by definition, M n,glb (t) ≡ 1. On the other hand, if K n = ∅, then, since K n is compact and φ ∈ C(R) is bounded, c ⊤ φ(t), c ∈ K n as a family of functions in t are equicontinuous and uniformly bounded. It follows that M n,glb ∈ C(R). Likewise, m ∈ C(R).