Interactive Martingale Tests for the Global Null

Global null testing is a classical problem going back about a century to Fisher's and Stouffer's combination tests. In this work, we present simple martingale analogs of these classical tests, which are applicable in two distinct settings: (a) the online setting in which there is a possibly infinite sequence of $p$-values, and (b) the batch setting, where one uses prior knowledge to preorder the hypotheses. Through theory and simulations, we demonstrate that our martingale variants have higher power than their classical counterparts even when the preordering is only weakly informative. Finally, using a recent idea of"masking"$p$-values, we develop a novel interactive test for the global null that can take advantage of covariates and repeated user guidance to create a data-adaptive ordering that achieves higher detection power against structured alternatives.


Introduction
This paper proposes new martingale-based methods for testing the global null corresponding to hypotheses {H i } i∈I using a corresponding set of p-values {p i } i∈I and possibly other covariates {x i } i∈I , where the index set I can be finite or countably infinite. Global null testing corresponds to testing if all individual hypotheses are truly nulls (denoted as H i = 0), against its complement: H G 0 : H i = 0 for all i ∈ I, H G 1 : H i = 1 for some i ∈ I.
As we review later in the introduction, this is a well-studied classical problem. We consider two settings, the batch setting and the online setting, and our proposed framework applies to both settings: • Batch setting: we have access to a fixed batch of n hypotheses, thus I = {1, . . . , n}.
Most common global null tests involve a one-step operation, comparing a single statistic with a critical value derived from its null distribution. Observing that many classical tests effectively use a martingaletype test statistic, we propose novel martingale analogs of these tests that are inherently sequential (multi-step) in nature, and thus naturally apply in the online setting, or in the batch setting if an ordering can be created using prior knowledge and/or the data. Intriguingly, the ordering may also be created interactively: this means that a human may adaptively create the ordering in a datadependent manner if they adhere to a particular protocol of masking and unmasking. In order to understand why our interactive martingale tests have desirable properties (both controlling Type 1 errors and having higher power in structured settings), it is necessary to present them last, after having derived the vanilla non-interactive martingale global null tests, which are also novel in their own right. Specifically, for the purposes of progressively developing intuition, our treatment follows three steps of increasing complexity: • (Preordered setting, Section 2) In the batch setting, the analyst employs prior knowledge (dataindependent) to preorder the hypotheses. In the online setting, an ordering of hypotheses is provided by nature.
• (Data-adaptive ordering, Section 3) In the batch setting, the hypotheses are unordered, but an adaptive data-dependent ordering is created based on "masked" p-values. In the online setting, nature orders hypotheses, but the analyst discards some hypotheses from the ordering based on their masked p-values. Even though the data-adaptive and preordered settings proceed sequentially and handle the p-values one at a time, the analyst plays no role during this sequential process, as all the rules for how to order the hypotheses are prespecifed before the data is observed.
• (Interactive ordering, Section 4). The utility of masking to enable interaction with a human is most compelling in the batch setting, where in addition to the unordered hypotheses, we suppose that the analyst also has additional side information in the form of covariates, and perhaps prior knowledge in the form of structural constraints on the non-null set. Using these, and any statistical models of the their choice, the analyst interactively creates an ordering by initially observing only masked p-values, and progressively unmasking them one at a time. The analyst can update their prior knowledge and/or structural constraints and/or statistical model in the middle of the process (when only some hypotheses have been ordered and their p-values unmasked), thus intervening to change the rest of the ordering. It is important to note that even though a human is allowed to make subjective decisions at each step of the interaction, an algorithm can be deployed to act on the human's behalf.
Since all our tests proceed sequentially in nature, accumulating evidence from one hypothesis at a time, the type-I error guarantee we achieve is that P 0 (∃i ∈ I : the test stops and rejects H G 0 after step i) ≤ α, where P 0 is the probability under the global null H G 0 . They are judged based on their power, P 1 (∃i ∈ I : the test stops and rejects H G 0 after step i), where P 1 is the probability under some alternative in H G 1 . We remark that even though we formulate our tests in terms of a target Type 1 error level α, there is an equivalent formulation in terms of creating a sequential "always-valid" p-value for the global null that is valid at any arbitrary stopping time. Section 9 explicitly connects these two interpretations.

Assumptions
Instead of assuming that the marginal distribution of null p-values is exactly uniform, we relax it by allowing conservative p-values defined in two different ways. We either assume that (a) if the global null is true, all p-values are stochastically larger than uniform: If H G 0 is true, Pr(p i ≤ t) ≤ t for all t ∈ [0, 1], i ∈ I.
(1) or assume that (b) if the global null is true, all p-values are mirror-conservative: where f i is the probability mass function of p i for discrete p-values or the density function otherwise. Neither of the aforementioned conditions implies the other, though the former is more commonly made. Examples of mirror-conservative p-values include permutation p-values and one-sided tests of univariate parameters with monotone likelihood ratio (Lei and Fithian, 2018). In the majority of the paper, it may be easier for the reader to pretend that the null p-values are exactly uniform for simplicity. Later in the paper, we explicitly demonstrate the distinct advantages of our tests for conservative p-values.
We also assume that if the global null is true, the null p-values are independent of each other: If H G 0 is true, {p i } i∈I are jointly independent. This is also a common assumption; Fisher's test (Fisher, 1934) and Tukey's Higher Criticism (Donoho and Jin, 2015) are two other examples. Even though we are cognizant that independence is a strong assumption that only holds in some limited situations in practice (like meta-analysis), we wish to explore how much it can be exploited to design novel tests, for instance enabling the use of martingale techniques and "masking", as described soon. We remark that all aforementioned assumptions on the null p-values only need to hold under the global null. If the global null is not true, we do not require the null p-values (or the non-nulls) to have any particular marginal distribution or to satisfy any independence assumptions.

Related work
Our paper builds on and connects three distinct lines of work, classical work on global null testing, modern ideas on permitting interaction using p-value masking, and recent ideas on uniform martingale concentration inequalities. We discuss these separately below.
Global null testing. Most previous tests for the global null work in the batch setting. Our work is most directly connected to tests which accumulate information as a sum, such as Fisher's and Stouffer's tests (Stouffer et al., 1949). There are many other global null tests like the Bonferroni method, Simes' test (Simes, 1986), and Higher Criticism, and our techniques do not apply to these. Importantly: We do not wish to claim that our interactive martingale tests are more powerful than prior work in any universal sense, but instead attempt to expand the creative design space of new procedures that can involve a human in the loop and explore their potential benefits.
Permitting interaction by masking p-values. The motivation behind masking p-values is to permit interaction with an analyst, who may freely employ models, prior knowledge and intuition, without any risk of violating Type 1 error control. The main idea is to decompose each individual p-value p i into two parts, Here, g(p i ) is called the masked p-value, while h(p i ) is called the missing bit since it is either plus or minus one. The critical observation is that h(p i ) and g(p i ) are independent if H i is null (p i is uniformly distributed). Masking was introduced recently by Lei and Fithian (2018) in the context of false discovery rate (FDR) control, and further generalized and extended in Lei et al. (2017) for FDR control under structural constraints. In this paper, we show that masking is also useful for global null testing in structured settings, and permitting interaction with an insightful analyst can improve power (but it is impossible for any analyst to violate Type 1 error control).
Uniform martingale concentration inequalities All new test statistics in this paper are designed to be martingales under the global null. The Type 1 error control guarantees for our tests thus stem from utilizing uniform martingale concentration inequalities. These "boundary crossing" inequalities are high probability statements about the behavior of entire trajectory of the martingale. In fact, several of our martingales have increments which are either fair coin flips (±1) or standard Gaussians, which are some of the most well studied objects in sequential analysis, especially through their natural connections to Brownian motion (Siegmund, 1986). In this paper, we care about nonasymptotic guarantees on the Type 1 error, and hence we use some recent line-crossing inequalities (Howard et al., 2018a) and new curve-crossing inequalities (Howard et al., 2018b) that are nonasymptotic generalizations of the law of the iterated logarithm. For a martingale M k , these boundaries are denoted u α (k) and satisfy Pr(∃k ∈ N : M k > u α (k)) ≤ α.
In the next section we provide the exact expressions for the u α (k) that we use, which are chosen because they have similar qualitative behavior but tighter constants than earlier work, references to which may be found within the aforementioned papers.

Outline
To progressively build intuition, the preordered martingale test is described in Section 2 followed by the adaptively ordered martingale test in Section 3. In Section 4, the general interactively ordered martingale test is presented. For all these methods, the Type 1 error guarantees are presented immediately after the algorithms. However, power guarantees for all algorithms in the Gaussian sequence model are derived in Section 5. We then perform extensive simulations in Section 6. In Section 7, we examine the robustness of our test to conservative nulls. Finally, Section 8 explicitly describes how to interpret our tests as tracking an anytime sequential p-value. We end with a brief summary in Section 9, and defer all proofs and additional experiments to the supplementary material.

The preordered martingale test
The preordered martingale test is not a single test, but instead a general framework to extend the application of many classical methods that use the sum or product of transformed p-values, such as Stouffer's method (Stouffer et al., 1949) and Fisher's method (Fisher, 1934), from the batch setting to the online setting. In this section, the ordering of hypotheses is fixed in advance by nature, or by the analyst using prior knowledge to place potential/suspected non-nulls early in the ordering.
The general framework. Our test takes the following general form: where f (·) is some transformation of the p-value, and {u α (k)} k∈N is a boundary sequence depending on the choice of f . The boundary is determined by first establishing that the sequence { i∈M k f (p i )} k∈N is a martingale under the global null (after appropriate centering if needed). We then characterize the tail behavior of the martingale increments f (p i ) for a uniform p-value. Finally, to control the Type 1 error, we employ recent results (Howard et al., 2018a,b) which provide boundaries under parametric and nonparametric conditions on the increments, such that with high probability the entire trajectory of the martingale is contained within the boundary. The preordered martingale test improves on its original batch version in two aspects. First, the applicability of the original test is extended from the batch setting to the online setting. Second, in the case of sparse non-nulls, the martingale version greatly improves the detection power if the non-nulls appear early on. As an example of converting a classic test to its martingale version, we develop the martingale Stouffer test below. An additional example involving a martingale Fisher test using f (p i ) = −2 log p i can be found in Appendix E.
An example: martingale Stouffer test. The batch test by Stouffer et al. (1949) where Φ(·) denotes the standard Gaussian CDF. Since the distribution of S n under the global null is N (0, n), the batch test rejects when S n > √ nΦ −1 (1 − α). To design the martingale test, simply observe that {S k } k∈I is a martingale whose increments f (p i ) = Φ −1 (1 − p i ) are standard Gaussians under the global null. There are several types of uniform boundaries u α (k) for a Gaussian increment martingale, and here we give two examples: linear and curved. The first boundary, which can be derived from the Gaussian sequential probability ratio test (Wald, 1945), grows linearly with time. Specifically, the test rejects the global null if ∃k ∈ N : where m ∈ R + is a tuning parameter that determines the time at which the bound is tightest: a larger m results in a lower slope but a larger offset, making the bound loose early on. We suggest a default value of m = n/4 if the number of hypotheses n is finite, but it should be chosen based on the time by which we expect to have encountered most non-nulls (if any). In contrast, the martingale Stouffer test with a curved boundary (Howard et al., 2018b) rejects the global null if ∃k ∈ N : These bounds differ in the quota of error budget distributed to every step k = 1, 2, . . ., which can influence the detection power of the martingale test as it is more likely to exceed a tighter bound. Curved bounds have a slower growth rate O( √ k log log k) than the linear bounds, indicating a tighter bound for large enough k, but they are usually looser for small k. Comparisons of the test with several linear and curved boundaries are given in Appendix D. Generally, the linear bound is recommended for the batch setting, and the curved bound for the online setting.
The martingale Stouffer test with either boundary controls the Type 1 error, if under the global null the sum { k i=1 Φ −1 (1 − p i )} k∈N is stochastically upper bounded by a martingale with standard Gaussian increments, which holds under our assumption that the null p-values are stochastically larger than uniform, as stated below.
Theorem 1. If the p-values are independent and stochastically larger than uniform under the global null, then the martingale Stouffer test with linear boundary (4) or curved boundary (5) controls the Type 1 error at level α.
The next natural question is what we can prove about the detection power of the aforementioned tests. While this is treated more formally later in the paper, for now it suffices to say that the power of the martingale Stouffer test relies on a good preordering that places non-nulls up front. If such prior knowledge is not available (and say the preordering is completely random, or even adversarial), then the preordered martingale tests can have poor power. This motivates the development of methods based on data-adaptive orderings, as treated next.

The adaptively ordered martingale test
If we naively use the p-values to both determine the ordering as well as form the test statistic, the resulting "double-dipped" sequence of test statistics does not form a martingale under the global null. In order to allow using the p-value for determining the ordering, we use a recent idea called masking, as briefly mentioned in the introduction. Each p-value p i is decomposed as where h(p i ) is called the missing bit, and g(p i ) is called the masked p-value. The masked p-values are used to create the ordering (by placing smaller ones up front) while the test statistic just sums the missing bits h(p i ) in that order. Since h(p i ) and g(p i ) are independent under the global null, sorting by the g(p i ) values results in a uniformly random ordering, and the sum of h(p i ) is just a random walk of independent coin flips. Formally, define the set M k as the first k hypotheses ascendingly ordered by g(p i ). Our test rejects where the upper bound u α (k) is the same as for the martingale Stouffer test in equations (4) and (5), since the sequence of sums i∈M k h(p i ) is also a martingale with 1-subGaussian increments under the global null. The adaptively ordered martingale test in the batch setting is summarized below.
Algorithm 1: The adaptively ordered martingale test for the batch setting Input: p-values (p i ) n i=1 , target Type 1 error rate α; Procedure: reject the global null and stop; end The adaptively ordered martingale test in the online setting proceeds slightly differently: it screens the hypotheses by g(p) so that only promising non-nulls enter the set M k . Specifically, given a threshold parameter c (such as 0.1), the set M k expands at time t only if g(p t ) < c, as summarized below.
Algorithm 2: The adaptively ordered martingale test for the online setting Input: target Type 1 error rate α, threshold parameter c; reject the global null and stop; end The adaptively ordered martingale test controls Type 1 error if under the global null, all p-values are mirror-conservative (condition (2)), as formally stated below.
Theorem 2. If the p-values are independent and mirror-conservative under the global null, then the adaptively ordered martingale test controls the Type 1 error at level α.
In the batch setting, the adaptive ordering (as realized by the nested sequence {M k }) is fully determined at the start of the procedure by sorting the masked p-values. Next, we demonstrate that in the presence of independent covariates x i for each hypothesis and side information such as structural constraints on potential rejected sets, it is actually beneficial to interactively determine the ordering one step at a time with a human-in-the-loop, who may be guided by the masked p-values as well as intuition and statistical models.

The interactively ordered martingale test
The interactively ordered martingale test also applies to both the batch and online settings. We first describe the framework in the batch setting with side information and structural constraints, where the power of interaction is more compelling.
To begin, first suppose that in addition to the p-values, the scientist also has some side information about each hypothesis available to them in the form of covariates x i . For example, if the hypotheses are arranged in a rectangular grid, then x i could be the coordinate on the grid for hypothesis i. We then suppose that the scientist also has some prior knowledge or intuition about what structural constraints the non-nulls would have, if the global null is false. For example, perhaps the scientist thinks that the non-nulls (if any) would be clustered on the grid, themselves forming a rectangular shape (of some size, at some location). Our main additional assumption about the covariates is that: Under the global null, x i ⊥ p i for all i ∈ I.
Our interactively ordered martingale test satisfies the following two properties: (a) if the global null is true, the Type 1 error is controlled, regardless of what the scientist thinks or acts, (b) if the global null is false, and the prior knowledge and/or structural constraints are accurate (or somewhat so), then the power of the test is high. The interactive test proceeds as follows: • At the beginning, all covariates and masked p-values (x i , g(p i )) i∈I are revealed to the scientist, while only the missing bits (h(p i )) i∈I remain hidden. We initialize M 0 = ∅.
• The scientist repeats the following at each time step k ≥ 1: they choose a single promising hypothesis i k from [n]\M k−1 , and update M k = M k−1 ∪ {i k }.
• On doing so, they learn h(p i k ), and can thus keep track of S k := i∈M k h(p i ). If S k > u α (k) for any k, we stop and reject the global null.
Type 1 error control is essentially guaranteed because regardless of how the scientist acts at each step, if the global null is true, all the g(p i ) values and the revealed h(p i ) values do not provide any information about the still hidden missing bits, and thus S k is a martingale. Importantly, when the global null is false, we expect the power to be high because of the following reasons. First, the scientist may use any statistical model of their choice (or none at all) to guide their choice at each step. For example, they can attempt to estimate the non-null likelihood for each hypothesis i at each step k, denoted as π (k) i . In fact, as they learn the missing bits at each step, they can change their model or update their prior knowledge based on the observed p-values thus far. The information available to the scientist at the end of step k is denoted by the filtration , and thus, naturally, the choice i k is predictable, meaning it is measurable with respect to F k−1 . The general interactive framework is summarized below as Algorithm 3.

Algorithm 3: The interactively ordered martingale test for the batch setting
Information available to the scientist: side covariate information and/or structural constraints, and masked p-values F 0 := σ((x i , g(p i )) n i=1 ), target error α; Procedure: Initialize M 0 = ∅; for k = 1, · · · , n do reject the global null and exit; end The aforementioned algorithm (or framework) comes with the following error guarantee, regardless of the choices made by the scientist.
Theorem 3. If under H G 0 , the p-values are mirror-conservative and are independent of each other and of the covariates x i , then the interactively ordered martingale test controls the Type 1 error at level α.
Note that there is no requirement whatsoever on the null or non-null p-values when the global null is false. As before, note that under the global null the missing bits are random fair coin flips and the masked p-values are uniform on [0, 0.5] and completely uninformative about the missing bit. However, under the alternative, the true signals have very small masked p-values (say 0.01, 0.003, etc.) and along with covariate information, one may be able to infer that the missing bit is more likely to be +1 and thus include it in the ordering. Continuing the grid example from the start of this section, by revealing all but one bit per p-value at the start of the procedure, the scientist can possibly notice if small masked p-values are randomly scattered or clustered on the grid.
Remark 1. It is critical to remark that for any particular setup, like our example of a grid with a cluster of signals, it may be possible to design a better global null test that is perfectly suited for that setting. Hence, we do not claim that our interactive method is the right test to use in all problem setups. Its main advantage is its generality: instead of having to design a new test for each situation (trying to figure out how to optimally combine prior knowledge, structural constraints and covariates from scratch), our general framework provides a simple and flexible alternative.
The correctness of the test (proof in Appendix A.2) hinges on one bit from each p-value being hidden from the scientist. Once this protocol has been run once, and all p-values have been unmasked, the procedure obviously cannot be run a second time from scratch. In other words, our interactive setup does not prevent these and related forms of p-hacking. This is similar to the traditional offline setup, where it is not allowed to pick the global null test after observing the p-values and guessing which test will have the most power to reject, and if scientists do this anyway and report only the final finding, we would have no way to know whether such inappropriate double-dipping has occurred.
It is worth remarking on the main disadvantage of such a test, relative to (say) the martingale Stouffer test introduced earlier. The interactive test statistic is a sum of coin flips (missing bits) -no matter how strong the signal might be, the interactive test statistic can only increase by one at most. On the other hand, the martingale Stouffer test adds up Gaussians, and if there is a strong signal (very small p-value), it can stop very early. If a relatively good prior ordering is known to the scientist, the martingale Stouffer test should be preferred. However, if the prior knowledge is not in the form of an ordering, but some intuition about how the covariates and p-values may be related or what type of structure the non-nulls may have (if any), then the interactive test can be much more powerful.
The above framework leaves the specific strategy of expanding M k unspecified, allowing much flexibility. Now, we give one example of how i k can be chosen based on the available information F k . One straightforward choice for i k is the hypothesis not in M k with the highest non-null likelihood, computed with the aid of an assumed statistical model, like the Bayesian two groups model, where each p-value p i is drawn from a mixture of a null distribution F 0 with probability 1 − π i and an alternative distribution F 1 with probability π i : For example, we can choose F 0 as a uniform and F 1 as a beta distribution. We may further assume a model that treats π i as smooth function of x i . The masked p-values g(p i ) and the revealed missing bits in F k−1 can be used to infer the other missing bits using the EM algorithm (Details in Appendix F). The missing bits that are inferred to be more likely +1 should be chosen, potentially in accordance with other structural constraints. Importantly, the Type 1 error is controlled regardless of the correctness of the model or any other heuristics to expand M k .

Power guarantees
The interactively ordered martingale test expands the testing set M k interactively based on the filtration F k containing masked p-values, and tests the global null using the missing bits h(p i ). Another cumulative test we proposed in Section 2, the martingale Stouffer test, uses the complete p-values for testing but relies on a good prior ordering to expand M k . In terms of the detection power, no one method dominates the other. This section assesses the conditions to guarantee 1 − β power for both frameworks, that is the condition to control type 2 error at β. Specifically, our analysis considers a simple multiple testing problem where each hypothesis is a one sided test on the mean value of a Gaussian, as described in setting 1.
Setting 1. When each individual hypothesis tests whether a unit variance Gaussian has zero or positive mean, the global test is equivalent as considering a mixture of n Gaussians N (µ i , 1) and test the set of mean values: Though the power is compared under the above simple Setting of testing Gaussian mean, our proposed methods apply to many other types of hypotheses as long as their p-values are mirrorconservative under null. Such hypotheses includes the permutation tests and one sided tests with monotone likelihood ratio. In fact the alternative mean µ i in Setting 1 can be interpreted as a measure of the distinction in general between the null and the alternative. Also this setting doesn't assume any prior knowledge, since accounting for the prior knowledge leads to much flexibility of the interactively ordered martingale test, making its power analysis in general vague. No prior knowledge makes the interactively ordered martingale test collapse to its special case of ordering the hypotheses by g(p), the adaptively ordered martingale test (Algorithm 2). Nevertheless we note that there is much potential for the interactively ordered martingale test to further improve power using prior knowledge, shown later by simulation in Section 6.
The interactively ordered martingale test and the martingale Stouffer test both apply to two settings, the batch setting and the online setting. We separately discuss the power guarantee and compare them with the according alternatives in the two settings.

Power guarantees in the batch setting
This section derives the conditions to guarantee 1 − β power of the martingale Stouffer test, the interactively ordered martingale test, and the batch Stouffer test for comparison.

The batch Stouffer test and the martingale Stouffer test
The conditions are all in the form of comparing the expected mean value (non-null signal) with the number of hypotheses.
Theorem 4. A necessary and sufficient condition for the batch Stouffer test with Type 1 error α to have at least 1 − β power is In contrast, a sufficient condition for the martingale Stouffer test to have power at least 1 − β is where C α,k = 1.7 log log(2k) + 0.72 log 5.19 α and C β,k = 1.7 log log(2k) + 0.72 log 5.19 β are almost constants with respect to k. Further, the aforementioned condition is (up to constants) necessary, because if α < 1 − β, the power of the martingale Stouffer test is less than 1 − β whenever The proof is in Appendix B. For an additional comparison, the power of the Bonferroni method is less than 1 − β unless which cannot be satisfied if no single mean value is big. While the condition for the batch Stouffer test considers the averaged mean over all the hypotheses, the martingale Stouffer test considers the cumulative ones so that even if the overall averaged mean is small the martingale Stouffer test may still successfully reject the null. Here is an illustration example. If there are 10 4 hypotheses, among which 100 have mean value µ = 1 and others have µ = 0. Under Type 1 error α = 0.1, the Bonferroni method and the batch Stouffer test have power less than 0.8; in contrast, the martingale Stouffer test can have power greater than 0.8 if the non-zero mean values are exactly the first 100 ones.
The interactively ordered martingale test For clarity, we assume all the non-nulls have the same mean value, µ i = µ if r i = 1. Denote the number of non-nulls as N 1 and the nulls as N 0 . Let Z(ν) be a Gaussian random variable with unit variance and mean ν, then the non-nulls are {Z j (µ)} for j = 1, . . . , N 1 and we denote Z (j) (µ) be the j-th non-null after ordering by its absolute value, Theorem 5. The interactively ordered martingale test with level α has at least 1 − 2β power if The proof is in Appendix B. For interpretation, we present a sufficient condition. Suppose there are sufficient number of non-nulls such that N 1 ≥ 6 (C α,n + C β,n ) 2 , but they are sparse with respect to the number of nulls such that N 0 > 0.1N 2 1 . A sufficient condition for the interactively ordered martingale test to have 1 − 2β power is For comparison, the batch Stouffer test requires Both conditions becomes stricter if the ratio N0 N 2 1 bigger, which suggests sparser non-nulls. The condition for interactively ordered martingale test is less sensitive to the ratio since it is in a log term. For example when the number of nulls is N 0 = 10 6 and the number of non-nulls is 400. Under Type 1 error α = 0.1, the power of the interactively ordered martingale test is greater than 0.8 if µ > 4.0 while the batch test has less than 0.8 power unless µ > 5.3. In addition, we confirm that the aforementioned conditions does not violate the detection threshold derived in Donoho and Jin (2015) for the same setting of detecting sparse Gaussian mixtures. (Appendix B.2).
To summarize the martingale Stouffer test and the interactively ordered martingale test require weaker conditions for the same power than the batch Stouffer test. The martingale Stouffer test relies on a good pre-defined ordering, whereas the interactively ordered martingale test relies on a good distinction between the null and alternative. The above results discuss the batch setting, and we expect to see similar advantages in the online setting.

Power guarantees in the online setting
When testing the global null, the natural test to compare to is the online Bonferroni method, which chooses a sequence of significance level {l k (α)} ∞ k=1 such that ∞ k=1 l k (α) = α, and rejects the global null if Though the power may be high if the value of l k (α) is well-chosen, under no prior knowledge it is often a decaying sequence such that l k (α) < α k for all k = 1, 2, . . . The following sections compare the power guarantee of the online Bonferroni method with the martingale Stouffer test and interactively ordered martingale test.
The online Bonferroni method and the preordered martingale test Unlike previous methods, the online Bonferroni method does not aggregate the p-values, so its power guarantee conditions on the individual mean values.
Theorem 6. A necessary condition for the online Bonferroni method with Type 1 error α to have In contrast, a sufficient condition for the martingale Stouffer test to have at least 1 − β power is Further the aforementioned condition is (up to constants) necessary, because if α < 1 − β, the power of the martingale Stouffer test is less than 1 − β whenever The proof is in Appendix C. If the mean values are non-zero but all small, the online Bonferroni method have little power but the martingale Stouffer test can have good power. For example, suppose the mean value µ k weakens as k grows, µ k = k −1/3 for k = 1, 2, . . .. Under Type 1 error α = 0.15, the online Bonferroni method have less than 0.5 power whereas the martingale Stouffer test has power one.
The interactive martingale test For clarity, we consider same mean value for the non-nulls, µ i = µ if r i = 1. Let a Z score for each hypothesis H i be Z i = Φ −1 (1 − p i ). For simple notation, we substitute the screening rule g(p i ) < c with an equivalent rule |Z i | > c.
Theorem 7. A sufficient condition for the interactively ordered martingale test with Type 1 error α and parameter c to have 1 − 3β power is if the non-null proportion The threshold on the right-hand side has two terms in k, and decreases almost at rate O(k −1/2 ) (since C α,k + C β,k grows at rate log log(2k)). To simplify the terms A(µ; c) and B(µ; c), we consider the case c = µ (which already demonstrate advantages over the martingale Stouffer test, though ideally a good choice of c should minimize A(µ; c)), which however do not have a closed form solution). The term A(µ; µ) decreases at an exponential rate in µ when µ > 0.25, A(µ; µ) ≤ e −µ 2 /4 , and the term B(µ; µ) is upper bounded by a constant for all µ ≥ 0. For comparison, the power of the martingale Stouffer test is less than 1 − 3β if whose threshold also decreases almost at the rate O(k −1/2 ) in k but the term µ −1 decreases much slower than the term A(µ; µ) for the interactively ordered martingale test. Therefore the condition for the interactively ordered martingale test is weaker when the non-nulls have sufficiently big mean values but are sparse. As an illustrative example, suppose the non-nulls have mean µ = 3.7 but the non-null proportion is extremely low such that it is always less than 0.012% but after the first 10 4 hypotheses is at least 0.01%. Under Type 1 error α = 0.1, the martingale Stouffer test has less than 0.7 power until there have been 3.02 × 10 7 arrived hypotheses, whereas the interactively ordered martingale test has at least 0.85 power once there have been 2.86 × 10 7 arrived hypotheses. We use this uncommon example to demonstrate that a sufficient condition is weaker than a necessary condition. In later simulations, we find the interactively ordered martingale test reject the null early than the martingale Stouffer test under more realistic settings.
Overall, both in the batch setting and the online setting, the martingale Stouffer test and the interactively ordered martingale test require weaker conditions than the classical methods to guarantee the same power when the non-nulls are sparse. The martingale Stouffer test relies on good prior knowledge to order the hypotheses while the interactively ordered martingale test uses masked pvalues to generate a good ordering; thus, they are better than each other in different situations. The theoretical analyses in this section discuss the case with no prior knowledge, and the simulations in the next section assume non-null structures to demonstrate a higher power of the interactively ordered martingale test over the martingale Stouffer test and other classical methods.

Numerical simulations
While the martingale Stouffer test can only use prior knowledge in the form of non-null probabilities for each hypothesis, the interactively ordered martingale test combines (a) side covariate information (which could include prior non-null probabilities as a component) with (b) structural constraints on the unknown non-null set, and (c) masked p-values, to infer whether a hypothesis is a non-null and thus include it earlier in the ordering. Here, we demonstrate that prior structural constraints can help the interactively ordered martingale test attain a higher power than the martingale Stouffer test and some other classical methods (Section 6.1). Even in the absence of such structural information or prior knowledge in the online setting, we that the interactively ordered martingale test can still have high power in some cases (Section 6.2).

Power against structured alternatives
We consider two non-null structures as simple examples: a blocked structure within a grid and a hierarchical structure within a tree. For each of these, we customize a heuristic strategy to expand M k in the interactively ordered martingale test (recalling that Type 1 error is controlled regardless of the heuristic used, and only power is affected).

Clustered non-nulls in a grid of hypotheses
Consider the setting where the hypotheses are arranged in a rectangular grid, and if the null is false, then the non-nulls form a single coherent cluster. This is a common structure which, as a hypothetical example, is a reasonable belief when trying to detect if there is a tumor in a brain image. Here, the covariates x i are simply the two-dimensional location of the hypothesis H i on the grid. The blocked non-null structure is utilized in specifying the non-null likelihood using model (6) by constraining the prior non-null probabilities π i to be a smooth function of x i . Details can be found in Appendix F.
The block structure is also imposed in the strategy of interactively expanding M k such that M k to be a single connected component, which we call the "including method". The including method expands M k by only including possible non-nulls that are on the boundary of M k (see Figure 1 for example).  We compare the interactively ordered martingale test with the martingale Stouffer test and the batch Stouffer test. We use the martingale Stouffer test (MST) with a preordering that starts at the center of the grid and the following hypotheses are included into the preordering in randomly chosen (data-independent) directions such that the hypotheses always form a single cluster. Our simulation has 10 4 hypotheses arranged in a 100 × 100 grid with a disc of about 150 non-nulls, placed either at the grid center and or at a corner of the grid. We use Setting 1, where we varied the non-null mean as (0.3, 0.6, 0.9, 1.2, 1.5, 1.8). For this experiment and the rest of the paper, the Type 1 error is α = 0.05, and the power was estimated using 100 repetitions.  The interactively ordered martingale test has a high power for both positions of the non-null block, whereas the power of martingale Stouffer test drops quickly when the block is not at center (Figure 2), which is because the martingale Stouffer test does not have information of the block position (its preordering starts from the center by default), whereas the interactively ordered martingale test uses masked p-values to learn the block position. It is worth noting that even with a bad preordering, the martingale Stouffer test does not do worse than the batch version, but has much higher power with a good preordering.
Remark 2. As mentioned in the introduction, we do not intend to claim that the interactively ordered martingale test is in any sense the "best" test for this problem setting. It is possible, or even likely, that several other generic tests (Bonferroni, chi-squared, higher crticism, or many others) or specialized tests (scan statistics) might have higher power. Our goal in this section is to demonstrate the tradeoffs between the batch and martingale versions of the same test (Stouffer in this case), and the interactive versus preordered martingale tests. Also note that the power of our martingale tests depends crucially on the preordering, or on the model and heuristic used to form the ordering interactively, and perhaps better models/algorithms might even improve the power of our own tests. We chose settings that are easy to visualize for intuition, keeping in mind that our tests apply to any general covariates x i , and prior knowledge or structural constraints, any statistical models, etc.

A sub-tree of non-nulls in a tree of hypotheses
In applications such as wavelet decomposition the hypotheses can have a hierarchical structure, where the children can be a non-null only if its parent is a non-null. We consider the hierarchical structure in two settings, the batch setting and the online setting.
A fixed tree in the batch setting The hierarchical structure is again encoded in modeling the non-null likelihood (6) by adding a partial order constraint on π i that π i ≥ π j , if i is the parent of j.
Also the hierarchical structure is imposed in the strategy of update M k such that M k should keep as a sub-tree. Specifically, we compare the non-null likelihoods for all the children of M k and choose the highest one.
We compare the interactively ordered martingale test with the martingale Stouffer test and the chi-square test, where the martingale Stouffer test order the hypotheses by level and from left to right within level. We simulate a tree of five levels (the root has twenty children and three children for each parent node after that) with over 800 nodes in total and 7 of them being non-nulls. Each node tests if a Gaussian is zero mean as described in Setting 1, where we varies the mean value for the non-nulls as (1, 2, 3, 4). The interactively ordered martingale test is implemented without modeling the nonnull likelihoods for the sake of computational cost, and Appendix G shows the implementation with modeling on a smaller tree. The interactively ordered martingale test has a higher power especially when the signal is strong so that the masked p-values provides a better guide on the M k update (Figure 9a).
(a) Hypothesis tree with decreasing non-null probability, which is marked by fading red nodes.  A growing tree in the online setting The online tree grows a new level at every step, with the probabilities of being non-null no bigger than their parents. For an arriving level k, the interactively ordered martingale test models the non-null likelihood π (k) j for the new hypothesis H j by equation (6), where the prior non-null likelihood is the same as its direct parent H i from the level k − 1, The parameter c for the interactively ordered martingale test in the online setting is set to 0.5. We compare the interactively ordered martingale test with the martingale Stouffer test and a classical method, the online Bonferroni method. In the online setting their performances are assessed by the averaged number of hypotheses required to reject the global null (detection time), the smaller the better.
We simulate the online tree with thirty children for the root node and three children for each parent node after that. The probability non-null for the first generation children is set to 0.2 for 25 children and 0.9 for the other 5 children. The ongoing three children of each node decay the probability of being non-null as by a proportion of 100%, 80%, 0%. Each node tests if a Gaussian is zero mean as described in Setting 1, where we varies the mean value for the non-nulls as (1, 2, 3, 4). The interactively ordered martingale test needs much shorter time unless the signals are so strong that any method can detect the non-nulls once they appear (Figure 3c).
Overall, both in the batch setting and the online setting, the interactively ordered martingale test has a higher detection power than the martingale Stouffer test, the chi-square test, and the online Bonferroni method, provided with structured alternatives. In the case of no structure or any form of prior knowledge, the interactively ordered martingale test and martingale Stouffer test also have high powers in some cases of the online setting.

Powerful in the online setting even without prior knowledge
Without prior knowledge, the martingale Stouffer test and the adaptively ordered martingale test perform better than the few existing alternatives in the online setting . The test performance is evaluated by the averaged number of hypotheses required to reject the global null (detection time), the earlier the better. We compare the adaptively ordered martingale test, the martingale Stouffer test and the online Bonferroni method and whichever needs least detection time depends on the proportion and the mean value of the non-nulls. We simulate hypotheses that tests if a Gaussian is zero mean as in Setting 1. Each hypothesis has the same probability of being non-null (a theoretical non-null proportion π) and the mean value of non-nulls are the same, denoted as µ. We vary the non-null proportion π ∈ (0, 1) and mean value of the non-nulls µ ∈ (0.5, 1, 2, 3, 4, 5). The martingale Stouffer test reject the global null first when the proportion of the non-nulls is not high π < 50% or their mean values are not large, µ < 4 (Figure 4). The adaptively ordered martingale test is the first if the non-nulls are very sparse, π < 5%, but their mean values are sufficiently high, µ ∈ (2, 3).
We remark the advantage of the interactively ordered martingale test in practice where prior knowledge often exists in various forms. The interactively ordered martingale test is highly flexible that allows modifications to the strategy of expanding M k , at any step and with any form as a human analyst (or a program) wants to. For example, the interactively ordered martingale test for the block structure assumes that the non-nulls are in a single block, but it can be changed to develop several blocks if the masked p-values or some side information indicates so. The next section demonstrates one more advantage of the interactively ordered martingale test under the conservative nulls.

Robustness to conservative nulls
In all the above simulations, the nulls have uniformly distributed p-values, but in practice they could be stochastically larger than uniform (condition 1) or mirror conservative (condition 2); both are henceforth referred to as "conservative nulls". Such conservative nulls diminish the detection power of many batch global null tests like Fisher's and Stouffer's methods. For example, each term in the Stouffer test is Φ −1 (1 − p), whose value can be smaller than −2 if the p-value is bigger than 0.98; thus as the nulls grow more conservative and their p-values closer to one, its power can quickly drop to zero.
To examine the effect of conservative nulls on the interactively ordered martingale test, we first propose an alternative definition of a masked p-value asg(p) := min(p, p + 1 2 mod1). Recalling that g(p) = min(p, 1−p), we g andg as the tent and railway functions respectively (see Figure 5a, Figure 5b). Note that if the p-value is exactly uniformly distributed,g(p) is still independent of h(p), and g(p) has the same distribution asg(p), and so all previous results still hold with the new masking function in place of the old one. However, when the p-values are conservative, the new masking function has a clear advantage. To see this, consider a p-value of 0.99. The original masked p-value would be 0.01, thus causing the methods to potentially confuse this with a non-null masked p-value, but the new masked p-value would be 0.49, which the methods would easily exclude as being a null.  Comparing the interactively ordered martingale test with tent and railway masking functions, the martingale Stouffer test, and the chi-square test for the robustness to conservative nulls.
As an example, we consider the simple case with no prior knowledge and simulate 1000 hypotheses with 100 non-nulls. Each hypothesis is a one side test on whether a Gaussian is zero mean as described in Setting 1. The alternative mean values are set to 1.5. The mean value for nulls are negative so that the resulting null p-values are conservative. We tried nine values from 0 to −4 for the mean of nulls, with smaller value indicating higher conservativeness. Figure 5c compares the power of the interactive martingale test with tent and railway functions, the martingale Stouffer test and chi-squared test. The power of most tests drop sharply to zero, but the power of interactively ordered martingale test with the new railway function initially dips and then improves. The reason for the initial dip is that the increasingly convservative nulls influence the interactive martingale test in two opposite directions: (a) more null h(p) values are now equal to −1 (instead of being ±1 with equal probability), and this hurts power because including a null h(p) in the martingale almost always lowers its value (instead of increasing and lowering its value with equal probability), (b) as the p-value gets more conservative, g(p) will approach 0.5 for nulls, allowing the tests to easily distinguish between the non-nulls and the nulls to increase the power. When the p-values are only slightly conservative, effect (a) dominates and hurts power, causing the initial dip in power in Figure 5c.

Anytime p-values
In this paper, we defined the problem as testing the global null at a predefined level α. Instead, we could ask the test to output a sequential or anytime p-value for the global null, which is a sequence of p-values {p t } ∞ t=1 that are valid at any stopping time. We use p t to differentiate it from p t -the latter is the input to our global null test, the former is the desired output of our global null test. Specifically, p t is a function of p 1 , . . . , p t , such that if p 1 , . . . , p t are all null, then p t will be a valid p-value (its distribution will be stochastically larger than uniform), and this fact will be true uniformly over t.
Recall that all of the proposed procedures follow the same form; we reject the global null if where S k is a martingale under the global null and u α (k) is a sequence of upper bounds at level α. The anytime p-value p t at time t is defined by the smallest level at which our test would have rejected the null at or before time t.
Definition 1. The p-value p t can be defined as the smallest level α at which the test would have rejected at or before time t, Viewing u α (k) as a function of two variables k, α, we define an inverse function at a fixed k with respect to the level α as which is unique for a given input S since the bound u α (k) is continuous and strictly decreasing in α. Then the p-value at time t can be computed as As one example, if u α (k) is the linear bound as in test (4), its inverse is The p-value sequence {p t } ∞ t=1 has the following nice properties, 1. the anytime p-values decrease with time: p t+j ≤ p t for all j, t > 0.
2. inf t∈I p t is also a valid p-value for the global null: In fact inf t∈I p t is the global p-value: the smallest level α at which the test would ever reject: 3. for any arbitrary stopping time τ ∈ I, p τ is a valid p-value: for all x ∈ (0, 1).
The second property implies that the p-value at any time t is a valid p-value. Recalling that fixedsample p-values are dual to fixed-sample confidence intervals, it is also the case that anytime p-values are dual to anytime confidence intervals. These ideas are explored and explained in depth in Howard et al. (2018b). The idea of anytime p-values are also studied by Grünwald (2018); Grünwald et al. (2019), where they propose a new measure as a substitution for classical p-values. The main takeaway message for our current paper is that all aforementioned tests can be reformulated as calculating anytime p-values. To exactly recover our level α tests, we just stop the first time that p t ≤ α and reject.

Summary
We have introduced martingale analogs of some classical global null tests, and used these to build adaptively ordered martingale tests through the idea of masking. These are further generalized to a protocol for interactively ordered martingale tests that possess the following interesting advantages: • It is a general global null testing framework that can utilize any types of covariates, structural constraints, prior knowledge and repeated user interaction guided by an assumed statistical model, all while provably controlling the Type 1 error.
• It permits the use of Bayesian modeling techniques while retaining frequentist error guarantees.
• It applies to both the batch and online settings.
• It is robust against conservative nulls.
• It has favorable theoretical power guarantees in simple settings, and performs well in simulations.
We believe that interactive testing protocols are only beginning to be explored in the literature, and constitute both an intellectually fascinating direction for further exploration, as well as a potentially powerful one. Masking (and progressive unmasking) is a promising technique that permits interaction, and it deserves further scrutiny and generalization to other settings.

A Error control
This section proves the Type 1 error control for our proposed methods: the martingale Stouffer test and the interactively ordered martingale test.

A.1 Proof of Theorem 1
Proof. Under the global null, because p-values are independent and stochastically larger than the uniform, the transformed p-values Φ −1 (1 − p i ) are independent and stochastically smaller than a standard Gaussian. Thus given the uniform bound for a Gaussian increment martingale u α (k), where G i for i ∈ I are i.i.d. standard Gaussians. By definition the above argument proves the Type 1 error control.

A.2 Proof of Theorem 3
This proof also implies Theorem 2 since the adaptively ordered martingale test is a special case of the interactively ordered martingale test.
Proof. We start by considering a special case that p-values are exactly uniform. We argue that the sum { i∈M k+1 h(p i )} k∈I is a martingale with respect to the filtration {F k } k∈I . First, the sum First note that for any i / ∈ M k , where the last equation is because 1(i * k = i) ∈ F k . Observe that under the global null, at any step k, the unobserved missing bits h(p i ) are independent of all observed information: so the expectation of the increment is E(h(p i ) | F k ) = E(h(p i )) = 0. Because the increment h(p i * k ) | F k is a Bernoulli (of value ±1), i∈M k+1 h(p i ) is a 1-subGaussian increment martingale. Therefore, for any uniform bound u α (k) of the Gaussian increment martingale.
In the general case where p-values are mirror conservative (condition (2)), the increment is stochastically smaller than Ber(0.5) because E(h(p i * k ) | F k ) ≤ 0 reasoned as below. A missing bit h(p i ) conditioned on its corresponding masked p-value g(p i ) is stochastically smaller than a fair coin flip: Under the global null, for any hypothesis i / ∈ M k , observe that h(p i ) | F k has the same distribution as h(p i ) | g(p i ). Thus, the conditional expectation E(h(p i * k ) | F k ) is upper bounded by zero: Thus, the increment is stochastically smaller than Ber(0.5) and following the same argument in Appendix A.1, the test using bound for a Gaussian increment martingale controls the Type 1 error.

B Proofs of the power guarantees in the batch setting
This section presents the proofs of power guarantees in the batch setting for 1) the batch Stouffer test, 2) the martingale Stouffer test and 3) the interactively ordered martingale test.

B.1 Proof of Theorem 4
We divide the proof into two subsections for the batch Stouffer test and the martingale Stouffer test.

B.1.1 The batch Stouffer test
Proof. Define the Z-score for each hypothesis H i as Z i = Φ −1 (1 − p i ). Under setting 1 of testing Gaussian mean, the Z-score is a Gaussian Z i ∼ N (µ i , 1), or written as N (r i µ i , 1) to separate the true nulls from the true non-nulls. Thus, the sum S n = n i=1 Z i is also a Guassian S n ∼ N ( n i=1 r i µ i , n). The power of the batch Stouffer test is A power of at least 1 − β is is equivalent to which can be rewritten as which is the condition in Theorem 4.

The power of martingale Stouffer test is at least
since under such condition, The last step holds because Gaussian increment martingale is symmetric so that −u β (k) is a uniform lower bound. The power of martingale Stouffer test is less than 1 − β if Thus we find a sufficient condition and a necessary condition for the martingale Stouffer test to have 1 − β power. The proof completes by plugging the curved bound in test (5) into the conditions. If without further explanation, u α (k) in rest of the proofs denotes the curved bound.

B.2 Proof of Theorem 5
The interactively ordered martingale test uses the missing bits h(p i ) for testing, and under no prior knowledge, uses the masked p-values g(p i ) to order the hypotheses. We divide the proof into three steps, 1) derive the power guarantee given a fixed order in Lemma 1; 2) quantify the effect of ordering by masked p-values in Lemma 2, and 3) derive the power guarantee for the interactively ordered martingale test (Theorem 5).
The power of interactively ordered martingale test given a fixed order Lemma 1. Given a fixed sequence of {M k } n k=1 with the size |M k | = k, the interactively ordered martingale test with Type 1 error control α has power at least 1 − β if ∃k ∈ {1, · · · , n} : ) is a measurement of the "signal strength" from the non-nulls and S i (0) = P(h(p i ) = 1 | r i = 0, {M k } n k=1 ) is from the nulls. Meanwhile the power is less than 1 − β if ∀k ∈ {1, · · · , n} : Proof. Consider the re-scaled increment (h(p i * k ) + 1)/2 | F k , which follows a Bernoulli So the cumulative sum S k is a martingale with sub-Gaussian increments after centering, whose ex- 1)). So the power of interactively ordered martingale test is The proof can be completed by following similar steps in the proof for martingale Stouffer test (Appendix B.1.2).
The effect of ordering Define the Z-score as Z i = Φ −1 (1 − p i ) for each hypothesis H i . Under setting 1, Z i is a Gaussian with unit variance and mean value µ i . We consider the simple case where for all the non-nulls µ i = µ. The interactively ordered martingale test orders the hypotheses increasingly by g(p i ), which is equivalent as ordering decreasingly by |Z i |. Following the same notations in Theorem 5, the Z-scores for non-nulls have the same distribution as Z(µ) and Z (j) (µ) is the Z-score of j-th non-null when they are ordered decreasingly by |Z i |. We describe the effect of ordering by the size of the set M k right after the j-th non-null enters, denoted as M (j).
Proof. In M (j), the number of non-nulls is known as j − 1 and the number of nulls is random. The nulls in M (j) should have a higher absolute Z-score than |Z (j) (µ)|. Note that the Z-scores of the nulls are i.i.d. standard Gaussians, so the probability of a null to be in front of the j-th non-null is P(|Z(0)| > |Z (j) (µ)|) for any nulls. Thus the number of nulls before the j-th non-null follows a binomial distribution: Thus, the size of M (j) is distributed as By the Bonferroni correction, with high probability |M (j)| is upper bounded by We further characterize the Binomial quantile B β/N1 (N 0 , j) (proof of Remark 3). The quantile is upper bounded (Chernoff inequality): The proof completes by showing that the probability term P(|Z(0)| > |Z (j) (µ)|) is upper bounded by The bound (11) holds because the event |Z(0)| > |Z (j) (µ)| can be viewed as comparing Z(0) with N 1 Gaussians with the same distribution as Z(µ), and Z(0) is bigger than N 1 − j + 1 of them. Given that the probability of Z(0) bigger than one Z(µ) is P (µ), let X be Bin(N 1 , P (µ)) and the bound (11) holds because P(|Z(0)| > |Z (j) (µ)|) = P(X > N 1 − j + 1) for j = 1, · · · , N 1 (1 − P (µ)) + 1 . The proof of Remark 3 is completed by plugging bound (11) in the upper bound for B β/N1 (N 0 , j).

Proof of Theorem 5
Proof. Lemma 1 provides a condition for interactively ordered martingale test to have at least 1 − β power given any choice of {M k } n k=1 , thus when {M k } n k=1 is random, the power is at least 1 − β if ∃k ∈ {1, · · · , n} : where S i (0) and S i (1) as the probabilities conditioning on M k are random. Whether the above condition holds is not determinant, and Theorem 5 provides a sufficient condition such that the above condition holds with high probability.
First, for all the nulls, where (a) is because by the definition of the Z-score, h(p i ) > 0 is equivalent as Z i > 0; and (b) is because {M k } n k=1 is determined by |Z i | which is independent of 1(Z i > 0). Thus, (2S i (0)−1)(1−r i ) = 0 and in the above condition the sum on the left-hand side only increases when a non-null enters M k . Therefore, the above condition is satisfied if and only if it is satisfied when a non-null enters M k : ∃j ∈ {1, · · · , N 1 } : Second, the non-nulls in M (j) are the ones with j highest absolute Z-scores, whose Z-scores are Z (1) (µ), . . . , Z (j) (µ). Thus, i∈M (j),ri=1 S i (1) = j s=1 P(Z (s) (µ) > 0), and the above condition can be rewritten as The above condition holds with probability at least 1 − β if ∃j ∈ {1, · · · , N 1 } : where C α,n + C β,n ≥ C α,|M (j)| + C β,|M (j)| and j + B β/N1 (j) is the uniform upper bound of |M (j)| by Lemma 2.
Overall when condition 13 holds, the probability of failing to reject is less than the sum of 1) the probability that |M (j)| exceeds its upper bound, which is less than β; and 2) the probability of not rejecting when condition (12) is satisfied, which is also less than β; thus the power is at least 1−2β.

B.3 Proof of condition (8)
Proof. Let j = N 1 /2 in Theorem 5, the power of interactively ordered martingale test is at least First, the left-hand side can be lower bounded by since the term 1 j j s=1 2P(Z (s) (µ) > 0) − 1 decreases in j and is minimum at j = N 1 , whose value is Second on the right-hand side, B β/N1 N 0 , N1 2 can be upper bounded (Chernoff inequality): in which the probability term P(|Z(0)| > |Z (N1/2) (µ)|) can be further upper bounded by where (a) is in the proof of Remark 3 in Appendix B.2; (b) holds given N 1 ≥ 6 (C α,n + C β,n ) 2 and µ > 2 (an assumption we will visit later); and (c) is because P Plugging the lower bound of the left-hand side and the upper bound of the right-hand side, condition (14) is implied by Given N 1 ≥ 6 (C α,n + C β,n ) 2 and µ > 2, the above condition holds if Because 1−Φ(µ) ≤ e −µ 2 /2 /2 when µ > 2 and log(N 1 /β) < N1 5 when N 1 ≥ 6 (C α,n + C β,n ) 2 , a sufficient condition of the above condition is which can be written as a condition on µ: µ ≥ 2 log N 0 N 2 1 + 4 log (C α,n + C β,n ) + 3.45.
Finally we complete the proof by noting that the above condition implies the assumption µ ≥ 2 when N 0 > 0.1N 2 1 .

C Proofs of the power guarantees in the online setting
This section proves the power guarantees in the online setting for three methods: the martingale Stouffer test, the interactively ordered martingale test, and a benchmark, the online Bonferroni method.

C.1 Proof of Theorem 6
The power guarantee for martingale Stouffer test in the online setting follows the same steps as that in the batch setting (Appendix B.1.2), except that the range of k is changed from {1, · · · , n} to {1, 2, · · · }. We present the proof of the power guarantee for the online Bonferroni method as follows.
Denote Z k = Φ −1 (1 − p k ) whose distribution is N (µ k , 1) under setting 1, and the bound above can be rewritten as Thus, a necessary condition for the online Bonferroni method to have at least 1 − β power is which can be rearranged as and this concludes the proof.

C.2 Proof of Theorem 7
Theorem 7 is a simplified version of the following Theorem 8 (by Claim 1). Before stating Theorem 8, we first define the distinction measure D(c) as where c is the screening parameter in the online interactively ordered martingale test. Bigger D(c) indicates bigger distinction. Further denote N 1 (k) = k i=1 r i as the number of non-nulls after k hypotheses arrive and N 0 (k) = k i=1 1 − r i as for the nulls. Theorem 8. The interactively ordered martingale test with Type 1 error α and threshold c guarantees Proof. Denote M k as the set of hypotheses that pass screening (|Z i | > c) after k hypotheses arrive. By extending Lemma 1 from k = 1, . . . , n to k = 1, 2, . . ., the power of interactively ordered martingale test is at least 1 − β if where for the passed non-nulls, S i (1) = P(h(p i ) = 1 | r i = 1, i ∈ M i ) = P(Z i > 0 | r i = 1, |Z i | > c) = S(µ, c), and for passed the nulls, S i (0) = P(Z i > 0 | r i = 0, |Z i | > c) = P(Z(0) > 0 | |Z(0)| > c) = 0.5. By the lemmas presented below, the right-hand side is upper bounded by with probability 1 − β (Lemma 3). The left-hand side is lower bounded by i∈M k with probability 1 − β (Lemma 4). The condition in Theorem 8 results from plugging the bounds of both sides into condition (17). Overall, when the condition Theorem 8 holds, the probability of failing to reject is less than the sum of 1) the probability that the upper bound for the right-hand side is violated, which is less than β; 2) the probability that the lower bound for the left-hand side is violated, which is less than β; and 3) the probability of not rejecting when condition (17) is satisfied, which is less than β; thus the power is at least 1 − 3β.
Lemma 3. The size of M k in the online setting is uniformly upper bounded: Proof. The probability of a hypothesis H i passing screening is P(|Z(µ)| > c) when H i is a non-null, and P(|Z(0)| > c) when H i is a null. Denote X i as the indicator of whether H i passes the screening, then |M k | = k i=1 X i . Because X i 's are independent and each X i is a mixture of two Bernoullis (of value {0, 1}), the size |M k | is a martingale with 1 4 -subGaussian increment. Therefore, where u β (k) is the upper bound for Gaussian increment martingale as in test (5). The expected value is which completes the proof.
Lemma 4. The number of non-nulls in M k is uniformly lower bounded: The proof follows the same steps as in Lemma 3, by considering only the non-nulls, or equivalently assuming r i = 1 for all i.
Claim 1. The condition of interactively ordered martingale test to have 1 − 3β power in Theorem 7 implies that in Theorem 8.
Proof. First, the condition in Theorem 8 can be written as a quadratic inequality on N 1 (k), 2P(|Z(µ)|>c) ≥ 0.9N 1 (k) since the condition in Theorem 7 guarantees N 1 (k) ≥ C β,k 0.2P(|Z(µ)|>c) 2 . Solve the quadratic inequality for N 1 (k) to get a sufficient condition of the above one: c) . Note that under the square root, the last two terms involving k is upper bounded by for a, b > 0, an upper bound on the right-hand side is Thus, the above condition on N 1 (k) is implied by . Finally we review the assumptions made throughout the proof: 1) we assume N 1 (k) ≥

D Choices for the uniform bounds in the martingale Stouffer test
The martingale Stouffer test has the general form: where u α (k) is the uniform bound for a martingale with standard Gaussian increment. We present four bounds from Howard et al. (2018a) and Howard et al. (2018b),

a linear bound
where m ∈ R + is a tuning parameter that determines the time at which the bound is tightest: a larger m results in a lower slope but a larger offset, making the bound loose early on.
We use simulations to explore two choices in the martingale Stouffer test: 1) the choice of parameter m in the linear bound (18); and 2) the choice among the above four types of bound.
Choice of the parameter m in the linear bound A good choice of parameter m should make the bound tight at where most non-nulls appear; thus, it depends on how the non-nulls distribute. A smaller m results in a faster slope but a tighter bound at front, so it is desired when the non-nulls are gathered at front; and vice versa. We seek for a robust value of m such that the resulting test has relatively high power under different non-null sparsity. The following constructed simulation is used for exploring bounds in both the martingale Stouffer test and martingale Fisher test.
Setting 2. Consider the hypothesis of testing if a Gaussian has zero mean as in Setting 1. In total n = 10 4 samples are simulated, with 100 from the non-null distribution N (1.5, 1) and the rest from the null N (0, 1). The non-null sparsity varies by restricting the range where the non-nulls randomly distribute. The non-null range is set as H 1 to H l and we test values l = 100, 10 3 , 2 × 10 3 , · · · , 10 4 . We define the non-null sparisty as l n and a bigger value indicates a more sparse non-null distribution.  We compare three choices of m = n/4, n/2, 3n/4, with an oracle benchmark of m = l (whose corresponding bound is the tightest right after all the non-nulls appear). In both the martingale Stouffer test (Figure 6a) and the martingale Fisher test (Figure 6b), the choice of m = n/4 leads to the highest power, which is also close to the oracle benchmark.
Choice of the uniform bound Four bounds presented above can be generally classified as two types: linear and curved. Curved bounds have a slower increasing rate O( k log log(k)) than the linear bound, indicating a tighter bound for large enough k, but they are usually looser for small k (Figure 7b).
Under the batch setting where the number of hypotheses n is finite, we use the simulation setting 2, and the linear bound 18 (with m = n/4) results in the highest power (Figure 7a). Similar to tuning the parameter m in the linear bound, we explored to tune the implicit parameters in the curved bound, and yet the linear bound still has the highest power. However, under the online setting where new hypotheses keep arriving, the tests with curved bounds need less time (number of hypotheses) on average to reach rejection.

E Martingale Fisher test
The batch test by Fisher (1934) calculates S n = −2 n i=1 log p i . Since the distribution of S n under the global null is χ 2 2n (chi-square with 2n degree of freedom), the batch test rejects when S n is bigger than the 1 − α quantile for χ 2 2n . To design the martingale test, simply observe that {S k } k∈I is a martingale whose increments f (pi) = −2 log p i are χ 2 2 under the global null (after centering as S k − 2k). Similar as for the martingale Stouffer test, there are several types of uniform boundaries u α (k) for chi-square increment martingale. We present three types: chi-square characterized (linear), sub-exponential characterized (linear), and sub-Gamma characterized (curved). The general form of the martingale Fisher test rejects the global null if, where u α (k) can be any of the following uniform boundaries: 1. a chi-square characterized linear bound u α (k) = 2x m,α log( xm,α where x m,α = min x : exp{− x 2 + m + m log x 2m } ≤ α . 2. a sub-exponential characterized linear bound u α (k) = ( 1.41m x m,α + 2) log(1 + 1.41x m,α m ) − 2 (k − m) + 2.82x m,α , where x m,α = min x : exp −0.71x + m 2 log(1 + 1.41x m ) ≤ α . 3. a sub-Gamma characterized curved bound u α (k) = 4.81 k log log(2k) + 0.72 log 5.19 α + 13.47 log log(2k) + 0.72 log 5.19 α .
Both linear bounds contain parameter m with the same interpretation as m in the linear bound (4) for martingale Stouffer test: it determines the time at which the bound is tightest: a larger m results in a lower slope but a larger offset, making the bound loose early on. Again, we suggest a default value of m = n/4 if the number of hypotheses n is finite, but it should be chosen based on the time by which we expect to have encountered most non-nulls (if any).
(a) Power with varying sparsity score. (b) Type 1 error with varying sparsity score. The power of the martingale Fisher test using three bounds are compared under different nonnull sparsity (using simulation setting 2). The test using chi-square characterized linear bound has the highest power, closely followed by the one using sub-exponential characterized linear bound (Figure 8a). It indicates that using the sub-exponential characterized bound when the exact distribution for the increment is unknown can be almost as powerful as the oracle case of knowing the distribution. The sub-Gamma characterized curved bound loses power quickly when non-null is rather sparse, consistent with the comparison between linear and curved bounds for the martingale Stouffer test (Appendix D).

F Bayesian modeling for the non-null likelihoods
Modeling the non-null likelihoods Define the Z-score for hypothesis H i be Z i = Φ −1 (1 − p i ). Instead of modeling the p-values, we choose to model the Z-scores since under setting 1 they are distributed as a Gaussian either under the null or the alternative: where µ is the mean value for all the non-nulls. We model Z i by a mixture of Guassians: where q i is the indicator of whether the hypothesis H i is a true non-null.
The non-null structures are imposed by the constraints on non-null probability π i . In our examples, the blocked non-null structure is encoded by fitting non-null probabilities π i as a smooth function of the hypothesis position (covariates) x i , specifically as a logistic regression model on a spline basis: where φ(x i ) is a spline basis. The hierarchical structure is imposed by a partial ordering constraint on π i : π(i) ≥ π(j), if i is the parent of j.
A EM framework for the non-null likelihoods An EM algorithm is used to train the model because masked p-values are modeled. Specifically we treat p-values as the hidden variables, and the masked p-values g(p) as observed. In terms of Z i , Z i is a hidden variable and the observed variableZ i is its absolute value |Z i |. Define a sequence of hypothetical labels w i = I(Z i =Z i ), then the likelihood of observingz i is The E-step updates w i , q i . For w i , if the p-value is revealed , w i = 1; otherwise the update is where λ = (1−πi) πi exp{−Z i µ + µ 2 /2}. For q i , if the p-value is revealed, the update is (q i,new ) = E(q i |Z i ) = 1 1 + 1−πi πi exp{Ziµ−µ 2 /2} ; otherwise the update is (q i,new ) = E(q i |Z i ) = 1 + exp{−2Z i µ} 1 + exp{−2Z i µ} + 2 1−πi πi exp{Ziµ−µ 2 /2} . In the M-step, parameters π i 's and µ are updated. The update for µ is The update for π i 's depends on the constraints, which can vary by the non-null structure. Under the block non-null structure, updating π i is equivalent as fitting q i by a logistic regression: (π i,new ) = argmax β i q i log π β (x i ) + (1 − q i ) log(1 − π β (x i )), where π β (x i ) is defined in equation (26). Under the hierarchical structure, it is equivalent as fitting a partial isotonic regression on q i (proven by Robertson et al. (1988)): (π i,new ) = argmax partial ordered{πi} i q i log π i + (1 − q i ) log(1 − π i ) = argmin where the partial ordering is defined in statement (27).
G Tests with modeled non-null likelihoods under the Hierarchical structure simulation Two trees are simulated, one with the probability of being non-null decreasing from a parent to its children and the other with it increasing. Each tree has over 100 nodes, and 7 of them are non-nulls. As expected, the interactively ordered martingale test has higher power than the non-adaptive martingale Stouffer test (Figure 9).
(a) Power with varying alternative mean, in a hypothesis tree with decreasing probability of being non-null.
(b) Power with varying alternative mean, in a hypothesis tree with increasing probability of being non-null.