Performance of Test Supermartingale Confidence Intervals for the Success Probability of Bernoulli Trials

Given a composite null hypothesis ℋ0, test supermartingales are non-negative supermartingales with respect to ℋ0 with an initial value of 1. Large values of test supermartingales provide evidence against ℋ0. As a result, test supermartingales are an effective tool for rejecting ℋ0, particularly when the p-values obtained are very small and serve as certificates against the null hypothesis. Examples include the rejection of local realism as an explanation of Bell test experiments in the foundations of physics and the certification of entanglement in quantum information science. Test supermartingales have the advantage of being adaptable during an experiment and allowing for arbitrary stopping rules. By inversion of acceptance regions, they can also be used to determine confidence sets. We used an example to compare the performance of test supermartingales for computing p-values and confidence intervals to Chernoff-Hoeffding bounds and the “exact” p-value. The example is the problem of inferring the probability of success in a sequence of Bernoulli trials. There is a cost in using a technique that has no restriction on stopping rules, and, for a particular test supermartingale, our study quantifies this cost.

Experiments in physics require very high confidence to justify claims of discovery or to unambiguously exclude alternative explanations [6]. Particularly striking examples in the foundations of physics are experiments to demonstrate that theories based on local hidden variables, called local realist (LR) theories, cannot explain the statistics observed in quantum experiments called Bell tests. See Ref. [7] for a review and Refs. [8,9,16,18] for the most definitive experiments to date. Successful Bell tests imply the presence of some randomness in the observed statistics. As a result, one of the most notable applications of Bell tests is to randomness generation [1]. In this application, it is necessary to certify the randomness generated, and these certificates are equivalent to extremely small significance levels in an appropriately formulated hypothesis test. In general, such extreme significance levels are frequently required in protocols for communication or computation to ensure performance.
Bell tests consist of a sequence of "trials", each of which gives a result M i . LR models restrict the statistics of the M i and therefore constitute a composite null hypothesis to be rejected. Traditionally, data has been analyzed by estimating the value of a Bell function B and its standard errorσ from the collective result statistics (see [12,21]). Under the null hypothesis,B is expected to be negative, so a large value ofB compared toσ is considered strong evidence against the null hypothesis. This method suffers from several problems, including the failure of the Gaussian approximation in the extreme tails and the fact that the trials are observably not independent and identically distributed (i.i.d.) [21].
In Ref. [21] a method was introduced that can give rigorous p-value bounds against LR. These p-value bounds are memory-robust, that is, without any assumptions on dependence of trial statistics on earlier trials. The method can be seen as an application of test supermartingales as defined in Ref. [17]. Test supermartingales were first considered, and many of their basic properties were proved, by Ville [20] in the same work that introduced the notion of martingales. The method involves constructing a non-negative stochastic process V i determined by (M j ) j≤i such that the initial value is V 0 = 1 and, under LR models, the expectations conditional on all past events are non-increasing. As explained further below, the final value of V i in a sequence of n trials has expectation bounded by 1, so its inverse p = 1/V n is a p-value bound according to Markov's inequality. A large observed value of such a test supermartingale thus provides evidence against LR models. Refs. [21,22] give methods to construct V i that achieve asymptotically optimal gain rate E(− log(p)/n) for i.i.d. trials, where E(. . .) is the expectation functional. This is typically an improvement over other valid memory-robust Bell tests. Additional benefits are that V i can be constructed adaptively based on the observed statistics, and the p-value bounds remain valid even if the experiment is stopped based on the current value of V i . These techniques were successfully applied to experimental data from a Bell test with photons where other methods fail [5].
Although the terminology is apparently relatively recent, test supermartingales have traditionally played a major theoretical role. Carefully constructed test supermartingales contribute to the asymptotic analysis of distributions and proofs of large deviation bounds. They can be constructed for any convex-closed null hypothesis viewed as a set of distributions, so they can be used for memory-and stopping-robust adaptive hypothesis tests in some generality. The application to Bell tests shows that at least in a regime where high significance results are required, test supermartingales can perform well or better than other methods. Here we compare the performance of test supermartingales directly to (1) the standard large deviation bounds based on the Chernoff-Hoeffding inequality [4,10], and (2) "exact" p-value calculations. Our comparison is for a case where all calculations can be performed efficiently, namely for testing the success probability in Bernoulli trials. The three p-value bounds thus obtained have asymptotically optimal gain rates. Not surprisingly, for any given experiment, test supermartingales yield systematically worse p-value bounds, but the difference is much smaller than the experiment-to-experiment variation. This effect can be viewed as the cost of robustness against arbitrary stopping rules. For ease of calculation, we do not use an optimal test supermartingale construction, but we expect similar results no matter which test supermartingale is used.
Any hypothesis test parametrized by φ can be used to construct confidence regions for φ by acceptance region inversion (see Ref. [19], Sect. 7.1.2). Motivated by this observation, we consider the use of test supermartingales for determining confidence regions. We expect that they perform well in the high-confidence regime, with an increase in region size associated with robustness against stopping rules. We therefore compared the methods mentioned above for determining confidence intervals for the success probability in Bernoulli trials. After normalizing the difference between the interval endpoints and the success probability by the standard deviation, which is O(1/ √ n), we find that while large deviation bounds and exact regions differ by a constant at fixed confidence levels, the test supermartingale's normalized endpoint deviation is Ω( log(n)) instead of O(1). This effect was noted in Ref. [17] and partially reflects a suboptimal choice of supermartingale. To maintain robustness against stopping rule, one expects Ω( log log(n)) according to the law of the iterated logarithm. However, we note that if the number of trials n is fixed in advance, the normalized endpoint deviation can be reduced to O(1) with an adaptive test supermartingale. So although the increase in confidence region necessitated by stopping rule robustness is not so large for reasonably sized n, when n is known ahead of time it can, in principle, be avoided without losing the ability to adapt the test supermartingale on the fly during the experiment in non-i.i.d. situations.
The remainder of the paper is structured as follows. We establish the notation to be used and define the basic concepts in Sect. II. Here we also explain how adaptivity can help reject hypotheses for stochastic processes. We introduce the three methods to be applied to Bernoulli trials in Sect. III. Here we also establish the basic monotonicity properties and relationships of the three p-value bounds obtained. In Sect. IV we determine the behavior of the p-value bounds in detail, including their asymptotic behavior. In Sect. V we introduce the confidence intervals obtained by acceptance region inversion. We focus on one-sided intervals determined by lower bounds but note that the results apply to two-sided intervals. The observations in Sects. IV and V are based on theorems whose proofs can be found in the Appendix. While many of the observations in these sections can ignore asymptotically small terms, the results in the Appendix uncompromisingly determine interval bounds for all relevant expressions, with explicit constants. Concluding remarks can be found in Sect. VI.

II. BASIC CONCEPTS
We use the usual conventions for random variables (RVs) and their values. RVs are denoted by capital letters such as X, Y, . . . and their values by the corresponding lower case letters x, y, . . .. All our RVs are finite valued. Probabilities and expectations are denoted by P(. . .) and E(. . .), respectively. For a formula φ, the expression {φ} refers to the event where the formula is true. The notation µ(X) refers to the distribution of X induced on its space of values. We use the usual conventions for conditional probabilities and expectations. Also, µ(X|φ) denotes the probability distribution induced by X conditional on the event {φ}.
We consider stochastic sequences of RVs such as X = (X i ) n i=1 and X ≤k = (X i ) k i=1 . We think of the X i as the outcomes from a sequence of trials. For our study, we consider B = The parameter θ is also referred to as the success probability. We denote the distribution of B by ν θ . We define S k = k i=1 B i andΘ k = S k /k. We extend the RV conventions to the Greek letterΘ k . That is,θ k = s k /k = k i=1 b i /k is the value of the RVΘ k determined by the values b i of B i . We may omit subscripts on statistics such as S n andΘ n when they are based on the full set of n samples. Some expressions involvingΘ n require that nΘ n is an integer, which is assured by the definition.
A null hypothesis for X is equivalent to a set H 0 of distributions of X, which we refer to as the "null". For our study of Bernoulli RVs, we consider the nulls parametrized by 0 ≤ ϕ ≤ 1. the set of distributions of Bernoulli RVs with P(B = 1) ≤ ϕ.
One can test the null hypothesis determined by a null by means of special statistics called p-value bounds. A statistic P = P (X) ≥ 0 is a p-value bound for H 0 if for all µ ∈ H 0 and p ≥ 0, P µ (P ≤ p) ≤ p. Here, the subscript µ on P µ (. . .) indicates the distribution with respect to which the probabilities are to be calculated. We usually just write "p-value" instead of "p-value bound", even when the bounds are not achieved by a member of H 0 . Small p-values are strong evidence against the null. Since we are interested in very small pvalues, we preferentially use their negative logarithm − log(P ) and call this the log(p)-value. In this work, logarithms are base e by default. A general method for constructing p-values is to start with an arbitrary real-valued RV Q jointly distributed with X. Usually Q is a function of X. Define the worst-case tail probability of Q as P (q) = sup µ∈H 0 P µ (Q ≥ q). Then P (Q) is a p-value for H 0 . The argument is standard. Define F µ (q) = P µ (Q ≥ q). The function F µ is non-increasing. We need to show that for all µ ∈ H 0 , P µ (P (Q) ≤ p) ≤ p. Since F µ (q) ≤ P (q), we have P µ (P (Q) ≤ p) ≤ P µ (F µ (Q) ≤ p). The set {q : F µ (q) ≤ p} is either of the form [q min , ∞) or (q min , ∞) for some q min . In the first case, P µ (F µ (Q) ≤ p) = P µ (Q ≥ q min ) = F µ (q min ) ≤ p. In the second, P µ (F µ (Q) ≤ p) = P µ ( n {q : q ≥ q min + 1/n}) = lim n P µ ({q : q ≥ q min + 1/n}) = lim n P µ (F µ (Q) ≤ q min + 1/n) ≤ p, with σ-additivity applied to the countable monotone union.
When referring to H 0 as a null for X, we mean that H 0 consists of the distributions where the X i are i.i.d., with X i distributed according to µ for some fixed µ independent of i. To go beyond i.i.d., we extend H 0 to the set of distributions of X that have the property that for all x ≤i−1 , µ(X i |X ≤i−1 = x ≤i−1 ) = µ i for some µ i ∈ H 0 , where µ i depends on i and x ≤i−1 . We denote the extended null by H 0 . In particular, The LR models mentioned in the introduction constitute a particular null H LR for sequences of trials called Bell tests. In Ref. [21], a technique called the probability-based ratio (PBR) method was introduced to construct p-values P n that achieve asymptotically optimal gain rates defined as E(log(1/P n ))/n. The method is best understood as a way of constructing a test supermartingale for In this work, to avoid unwanted boundary cases, we further require T i to be positive. The definition of test supermartingale used here is not the most general one because we consider only discrete time and avoid the customary increasing sequence of σ-algebra by making it dependent on an explicit stochastic sequence X. Every test supermartingale defines a p-value by P n = 1/T n . This follows from E(T n ) ≤ T 0 = 1 (one of the characteristic properties of supermartingales) and Markov's inequality for non-negative statistics, according to which P(T n ≥ κ) ≤ E(T n )/κ ≤ 1/κ. From martingale theory, the stopped process T τ for any stopping rule τ with respect to X also defines a p-value by P = 1/T τ . Further, P * n = 1/ max n i=1 T i also defines a p-value. See Ref. [17] for a discussion and examples.
A test supermartingale T can be viewed as the running product of the F i = T i /T i−1 , which we call the test factors of T. The defining properties of T are equivalent to having F i > 0 and E(F i |X ≤i−1 ) ≤ 1 for all distributions in the null, for all i. The PBR method adaptively constructs F i as a function of the next trial outcome X i from the earlier trial outcomes X ≤i−1 . The method is designed for testing H 0 for a closed convex null H 0 , where asymptotically optimally gain rates are achieved when the trials are i.i.d. with a trial distribution ν not in H 0 . If ν were known, the optimal test factor would be given by [11]. Since ν is not known, the PBR method obtains an empirical estimateν of ν from x ≤i−1 and other information available before the i'th trial. It then determines the KL-closest µ ∈ H 0 toν. The test factor F i is then given by F i (x) =ν(x)/µ(x). The test factors satisfy E µ (F i ) ≤ 1 for all µ ∈ H 0 , see Ref. [21] for a proof and applications to the problem of testing LR.
The ability to choose test factors adaptively helps reject extended nulls when the distributions vary as the experiment progresses, both when the distributions are still independent (so only the parameters vary) and when the parameters depend on past outcomes. Suppose that the distributions are sufficiently stable so that the empirical frequencies over the past k trials are statistically close to the next trial's probability distribution. Then we can adap-tively compute the test factor to be used for the next trial from the past k trials' empirical frequencies, for example by following the strategy outlined in the previous paragraph. The procedure now has an opportunity to reject an extended null provided only that there is a sufficiently long period where the original null does not hold. For example, consider the extended null B ϕ . The true success probabilities θ i at the i'th trial may vary, maybe as a result of changes in experimental parameters that need to be calibrated. Suppose that the goal is to calibrate for θ i > ϕ. If we use adaptive test factors and find at some point that we cannot reject B ϕ according to the running product of the test factors, we can recalibrate during the experiment. If the the recalibration succeeds at pushing θ i above ϕ for the remaining trials, we may still reject the extended null by the end of the experiment. In many cases, the analysis is performed after the experiment, or it may not be possible to stop the experiment for recalibration. For this situation, if the frequencies for a run of k trials clearly show that θ i < ϕ, the adaptive test factors chosen would tend to be trivial (equal to 1), in which case the next trials do not contribute to the final test factor product. This is in contrast to a hypothesis test based on the final sum of the outcomes for which all trials contribute equally.
Let ϕ be a parameter of distributions of X. Here, ϕ need not determine the distributions. There is a close relationship between methods for determining confidence sets for ϕ and hypothesis tests. Let H ϕ be a null such that for all distributions µ with parameter ϕ, µ ∈ H ϕ . Given a family of hypothesis tests with p-values P ϕ for H ϕ , we can construct confidence sets for ϕ by inverting the acceptance region of P ϕ , see Ref. [19], Sect. 7.1.2. According to this construction, the confidence set C a at level a is given by {ϕ|P ϕ (X) ≥ a} and is a random quantity. The defining property of a level a confidence set is that its coverage probability satisfies P µ (ϕ ∈ C a ) ≥ 1 − a for all distributions µ ∈ H ϕ . When we use this construction for sequences B of i.i.d. Bernoulli RVs with the null B ϕ , we obtain one-sided confidence intervals of the form [ϕ 0 , 1] for θ = E(B i ). When the confidence set is a one-sided interval of this type, we refer to ϕ 0 as the confidence lower bound or endpoint. If B has a distribution µ that is not necessarily i.i.d., we can define Θ max = max i≤n E µ (B i |B ≤i−1 ). If we use acceptance region inversion with the extended null B ϕ , we obtain a confidence region for Θ max . Note that Θ max is an RV whose value is covered by the confidence set with probability at least 1−a. The confidence set need not be an interval in general, but including everything between its infimum and its supremum increases the coverage probability, so the set can be converted into an interval if desired.
While our focus is on one-sided confidence intervals, our observations immediately apply to two-sided intervals ones with a standard method of obtaining a two-sided confidence interval from two one-sided intervals. For our example, we can obtain confidence upper bounds at level a by symmetry, for example by relabeling the Bernoulli outcomes 0 → 1 and 1 → 0. To obtain a two-sided interval at level a, we compute lower and upper bounds at level a/2. The two-sided interval is the interval between the bounds. The coverage probability of the two-sided interval is valid according to the union bound applied to maximum non-coverage probabilities of the two one-sided intervals.

III. BERNOULLI HYPOTHESIS TESTS
We compare three hypothesis tests for the nulls B ϕ or the extended nulls B ϕ : The "exact" test with p-value P X , the Chernoff-Hoeffding test with p-value P CH and a PBR test with pvalue P PBR . In discussing properties of these tests with respect to the hypothesis parameter ϕ, the true success probability θ and the empirical success probabilityΘ, we generally assume that these parameters are in the interior of their range. In particular, 0 < ϕ < 1, 0 < θ < 1, and 0 <Θ < 1. When discussing purely functional properties with respect to valuesθ ofΘ, we use the variable t instead ofθ. By default nt is a positive integer.
The p-value for the exact test is obtained from the tail for i.i.d. Bernoulli RVs: whereΘ = S n /n = n i=1 B i /n as defined in Sect. II. Note that unlike the other p-values we consider, P X,n is not just a p-value bound. It is achieved by a member of the null. The quantity P X,n (t|ϕ) is decreasing as a function of t, given 0 < ϕ < 1. It is smooth and monotonically increasing as a function of ϕ, given t > 0. To see this, compute This is positive for ϕ ∈ (0, 1). The probability that S n ≥ tn, given that all B i are distributed as ν θ with θ ≤ ϕ, is bounded by P X,n (t|θ) ≤ P X,n (t|ϕ). That P X is a p-value for the case where the null is restricted to i.i.d. distributions now follows from the standard construction of pvalues from worst-case (over the null) tails of statistics (here S n ) as explained in the previous section. That P X is a p-value for the extended null B ϕ follows from the observations that the tail probabilities of S n are linear functions of the distribution parameters θ 1 , θ 2 , ..., θ n where θ i ≤ ϕ, i = 1, 2, ..., n, the extremal distributions in B ϕ have B i independent with P(B i = 1) = θ i ≤ ϕ, and the tail probabilities of S n are monotonically increasing in P(B i = 1) for each i separately. See also Ref. [2], App. C. DefineΘ max = max(Θ, ϕ). The p-value for the Chernoff-Hoeffding test is the optimal Chernoff-Hoeffding bound [4,10] for a binary random variable given by This is a p-value for our setting because P CH,n (t|ϕ) ≥ P X,n (t|ϕ), see Ref. [10]. For ϕ ≤ t, we have − log(P CH,n (t|ϕ)) = nKL(ν t |ν ϕ ). We abbreviate KL(ν t |ν ϕ ) by KL(t|ϕ). For ϕ ≤ t < 1, P CH,n (t|ϕ) is monotonically increasing in ϕ, and decreasing in t. For 0 ≤ t ≤ ϕ, it is constant. The p-value for the PBR test that we use for comparison is constructed from a p-value for the point null {ν ϕ } defined as The PBR test's p-value for B ϕ is That P PBR is a p-value for B ϕ is shown below. As a function of ϕ, P 0 PBR,n (t|ϕ) has an isolated maximum at ϕ = t. This can be seen by differentiating log . Thus in Eq. 7 when ϕ ≥Θ, the maximum is achieved by ϕ =Θ. We can therefore write Eq. 7 as By definition, P PBR,n (t|ϕ) is non-decreasing in ϕ and strictly increasing for ϕ ≤ t. As a function of t, it is strictly decreasing for t ≥ ϕ (integer-valued nt). To see this, consider k = nt ≥ nϕ and compute the ratio of successive values as follows: The expression for P 0 PBR,n is the final value of a test supermartingale obtained by constructing test factors F k+1 from S k . Define Thus, Θ k would be an empirical estimate of θ if there were two initial trials B −1 and B 0 with values 0 and 1, respectively. The test factors are given by One can verify that E ν θ (F k+1 ) = 1 for θ = ϕ. More generally, set δ = θ − ϕ and compute As designed, T n = n k=1 F k is a test supermartingale for the point null {ν ϕ }. Thm. 5 in App. VII B, establishes that T n = 1/P 0 PBR,n (Θ|ϕ). The definition of P PBR,n (Θ|ϕ) as a maximum of p-values for ν ϕ with ϕ ≤ ϕ in Eq. 7 ensures that P PBR,n (Θ|ϕ) is a p-value for B ϕ .
To show that P PBR is a p-value for B ϕ , we establish that for all t (integer-valued nt), P PBR,n (t|ϕ) ≥ P CH,n (t|ϕ). By direct calculation for both ϕ ≤ t and t ≤ ϕ, we have The expression t k (1 − t) k n k is maximized at k = nt as can be seen by considering ratios for successive values of k and the calculation in Eq. 9, now applied also for k < nt. Therefore, A better choice for test factors to construct a test supermartingale to test B ϕ would be This choice ensures that E ν θ (F k+1 |B ≤k ) ≤ 1 for all θ ≤ ϕ but the final value of the test supermartingale obtained by multiplying these test factors is not determined by S n , which would complicate our study.
We summarize the observations about the three tests in the following theorem.
The three tests satisfy the following monotonicity properties for 0 < ϕ < 1 and 0 < t < 1 with nt integer-valued: P X (t|ϕ) is strictly increasing in ϕ and strictly decreasing as a function of t.
P CH (t|ϕ) is strictly increasing in ϕ for ϕ ≤ t, constant in ϕ for ϕ ≥ t, strictly decreasing in t for t ≥ ϕ and constant in t for t ≤ ϕ.
P PBR (t|ϕ) is strictly increasing in ϕ for ϕ ≤ t, constant in ϕ for ϕ ≥ t and strictly decreasing in t for t ≥ ϕ.

IV. COMPARISON OF p-VALUES
We begin by determining the relationships between P X , P CH and P PBR more precisely. Since we are interested in small p-values, it is convenient to focus on the log(p)-values instead and determine their differences to O(1/ √ n). Because of the identity − log(P CH,n (t, ϕ)) = nKL(t|ϕ), we reference all log(p)-values to − log(P CH,n ). Here we examine the differences for t ≥ ϕ determined by the following theorem: − log(P X,n (t|ϕ)) = − log(P CH,n (t|ϕ)) + The theorem follows from Thms. 6, 7 and Cor. 8 proven in the Appendix, where explicit interval expressions are obtained for these log(p)-value differences. The order notation assumes fixed t > ϕ. The bounds are not uniform, see the expressions in the appendix for details.
The most notable observation is that there are systematic gaps of log(n)/2+O(1) between the log(p)-values. As we already knew, there is no question that the exact test is the best of the three for this simple application. While these gaps may seem large on an absolute scale, representing factors close to √ n, they are in fact much smaller than the experiment-toexperiment variation of the p-values. To determine this variation, we consider the asymptotic distributions. We can readily determine that the log(p)-values are asymptotically normal with standard deviations proportional to √ n, which is transferred from the variance of Θ. Compared to these standard deviations the gaps are negligible. The next theorem determines the specific way in which asymptotic normality holds. Let N (µ, σ 2 ) denote the normal distribution with mean µ and variance σ 2 . The notation X n D − → N (µ, σ 2 ) means that X n converges in distribution to the normal distribution with mean µ and variance σ 2 .
Theorem 3. Assume 0 < ϕ < θ < 1. For P = P CH,n , P = P PBR,n or P = P X,n , the log(p)-value − log(P ) converges in distribution according to with The theorem is proven in the Appendix, see Thm. 10. For the rest of the paper, we write P or P n for the p-values of any one of the tests when it does not matter which one.
We display the behavior described in the above theorems for n = 100 and θ = 0.5 in Fig. 1. We conclude that the phenomena discussed above are already apparent for small numbers of trials. For Fig. 1, we computed the quantiles of the log(p)-values numerically using the formulas provided in the previous section, substituting for t the corresponding quantile of Θ given that P(B = 1) = θ. To be explicit, let t r,n (θ) be the r-quantile ofΘ defined as the minimum valueθ ofΘ satisfying P(Θ ≤θ) ≥ r. (For simplicity we do not place the quantile in the middle of the relevant gap in the distribution.) For example, t 0.5,n (θ) is the median. Then, by the monotonicity properties of the tests, the r-quantile of − log(P n (Θ|ϕ)) is given by − log(P n (t r,n (θ)|ϕ)).
As noted above, the gaps between the log(p)-values are of the form log(n)/2 + O(1). In fact, it is possible to determine the asymptotic behavior of these gaps. After accounting for the explicitly given O(1) terms in Thm. 2, they are asymptotically normal with variances of order O(1/n). The standard deviations of the gaps are therefore small compared to their size. The precise statement of their asymptotic normality is Thm. 11 in the Appendix. Fig. 2 shows how these gaps depend on the valueθ ofΘ given ϕ. The gaps are scaled by log(n) so that they can be compared to log(n)/2 visually for different values of n. The deviation from log(n)/2 is most notable near the boundaries, where convergence is also slower, particularly for P X . This behavior is consistent with the divergences as t approaches ϕ in the explicit interval bounds in Thm. 7 and Cor. 8.

V. COMPARISON OF CONFIDENCE INTERVALS
Let P be one of P CH,n , P PBR,n or P X,n . Given a valueθ ofΘ, the level-a confidence set determined by the test with p-value P is I = {ϕ|P (θ|ϕ) ≥ a}. By the monotonicity properties of P , the closure of I is an interval [ϕ a (θ; P ), 1]. We can compute the endpoint ϕ a by numerically inverting the exact expressions for P . An example is shown in Fig. 3, where we show the endpoints according to each test for a = 0.01 andθ = 0.5 as a function of n. All tests' endpoints converge to 0.5 as the number of trials grows. Notably, the relative separation between the endpoints is not large at level a = 0.01.
To quantify the behavior of the endpoints for the different tests, we normalize by the empirical standard deviationσ = θ (1 −θ)/n. The empirical endpoint deviation is then defined as For the exact test and for large n, we expect this quantity to be determined by the tail probabilities of a standard normal distribution. That is, if the significance a is the probability that a normal RV of variance 1 exceeds κ, we expect γ a (θ; P X ) ≈ κ.
We take the point of view that the performance of a test is characterized by the size of the endpoint deviation. If the relative size of the deviations for two tests is close to 1 then they perform similarly for the purpose of characterizing the parameter θ. Another way of comparing the intervals obtained is to consider their coverage probabilities. For our situation, the coverage probability for test P at a can be approximated by determining a such that γ a (θ; P X ) = γ a (θ; P ). From Thm. 4 below, one can infer that the coverage probability is then approximately 1 − a ≥ 1 − a. The coverage probabilities can be very conservative (larger than 1 − a), particularly for small a and P = P PBR .
We determined interval bounds for the empirical endpoint deviation for all three tests. The details are in App. VII E. The next theorem summarizes the results asymptotically.  N (0,1) (N ≥ x)) be the negative logarithm of the tail of the standard normal. Fixθ ∈ (0, 1). Write α = | log(a)|. There is a constant c (depending on θ) such that for α ∈ (1, cn], γ a satisfies γ a (θ; The last expression has the following approximation relevant for sufficiently large α: For α = o( √ n), the relative error of the approximation in the first two identities goes to zero as n grows. This is not the case for the last identity, where the relative error for large n is dominated by the term O(log(α)/α 3/2 ), and large α is required for a small relative error.
Proof. The expression for γ a (θ; P CH ) is obtained from Thm. 12 in the Appendix by changing the relative approximation errors into absolute errors.
To obtain the expressions for γ a (θ; P X ), we refer to Thm. 14, where the lower bound on α implies α ≥ 1 > log(2). The intervals in Thm. 14 give relative errors that need to be converted to absolute quantities. By positivity and monotonicity of q −1 , for sufficiently large n and for some positive constants u and v, we have Explicit values for u and v can be obtained from Thm. 14. We simplified the argument of q −1 by absorbing the additive terms in the theorem into the term uα √ α/ √ n with the constant u chosen to be sufficiently large. Consider Eq. 94 with δ = u √ α/ √ n. For sufficiently large n, the expression in the denominator of the approximation error on the right-hand side exceeds a constant multiple of α. From this, with some new constant u , which, with order notation simplifies further to It now suffices to apply q −1 (α) = O( √ α) (see the proof of Eq. 24 below) and Eq. 23 is obtained.
The expression for γ a (θ; P X ) confirms our expectation that it approaches the expected value for a standard normal distribution and may be compared to the Berry-Esseen theorem [14]. The empirical endpoint deviation of the CH test approaches that of the exact test for small a (large α). Their squares differ by a term of order log(α) = log | log(a)|. Notably, the ratio of the PBR and CH tests' empirical endpoint deviation grows as Θ( log(n)/α). The relationships are visualized in Figs. 4, 5 and 6 for different values of a. The figures show that the relative sizes of the empirical endpoint deviations tend toward 1 with smaller a. The Θ( log(n)/α) relative growth of the PBR test's endpoint deviations leads to less than a doubling of the deviations relative to the exact test's at a = 0.01 and a = 0.001 even for n = 10 6 . So while the test's coverage probabilities are much closer to 1 than the nominal value of 1 − a, we believe that it does not lead to unreasonably conservative results in many applications.
Next we consider the behavior of the true endpoint deviations given by the normalized difference of the true success probability θ and the endpoint obtained from one of the tests. Let σ = θ(1 − θ)/n be the true standard deviation and define the true endpoint deviation determined by test P asγ The true endpoint deviations show how the inferred endpoint compares to θ and therefore directly exhibits the statistical fluctuations ofΘ. In contrast, the empirical endpoint deviations are to lowest order independent ofθ − θ.
We take the view that two tests' endpoints perform similarly if their true endpoint deviations differ by an amount that is small compared to the width of the distribution of the true endpoint deviations. To compare the three tests on this basis, we consider the quantiles for Θ corresponding to ±κ Gaussian standard deviations from θ with κ constant. The quantiles satisfy θ ±κ = θ ± κσ(1 + O(1/ √ n)), by the Berry-Esseen theorem or from Thm. 14. Sincê σ = σ(1 + O(1/ √ n)), we can also see that γ a (θ ±κ |P ) = γ a (θ|P ) + O(1/ √ n), and so by substituting into the definition, where the implicit constants depend on κ. For large α, the CH and exact tests' endpoints are close and are dominated by κ, so they perform similarly. But this does not hold for the comparison of the CH or the exact test's endpoints to those of the PBR test, since the latter's endpoint deviation grows as log(n)/2. The PBR test's robustness to stopping rules requires that endpoint deviations must grow. Qualitatively, we expect a growth of at least Ω( log log(n)) due to the law of the iterated logarithm. This growth is slower than the log(n)/2 growth found above, suggesting that improvements are possible, as observed in Ref. [17]. In many applications, the number of trials to be acquired can be determined ahead of time, so full robustness to stopping rules is not necessary. However, the ability to adapt to changing experimental conditions may still be helpful, as the example in Sect. II shows. If we know the number of trials ahead of time, we can retain the ability to adapt while avoiding the asymptotic growth of the endpoint deviations of the PBR test.
A strategy for avoiding the asymptotic growth of the PBR test's endpoint deviations is to set aside the first m = λn of the trials for training to infer the probability of success, and then use this to determine the test factor to be used on the remaining (1 − λ)n of the trials. With this strategy, the endpoint deviations are bounded on average and typically. We formalize the training strategy as follows: Modify Eq. 11 by setting F k=1 = 1 for k < m and for k ≥ m, Let G = F if ϕ ≤Θ m and G = 1 otherwise. The G k+1 are valid test factors for the null B ϕ . A p-value for testing B ϕ is given by whereΘ m is defined by (n − m)Θ m + mΘ m = nΘ n . We call this the P λ test. Define Then for ϕ ≤Θ m , Q λ (B|ϕ) = P λ (B|ϕ). To investigate the behavior of these quantities, we consider values b,θ,θ m andθ m of the corresponding RVs. As a function of ϕ, Q λ (b|ϕ) is maximized at ϕ =θ m and monotone on either side ofθ m . Ifθ m ≤ ϕ ≤θ m , then Q λ (b|ϕ) ≥ 1 = P λ (b|ϕ), So for ϕ ≤ max(θ m ,θ m ), we can use Q λ instead of P λ without changing endpoint calculations. For determining the endpoint of a level-a one-sided confidence interval from P λ , we seek the maximum ϕ such that for all ϕ ≤ ϕ, P λ (b|ϕ ) ≤ a. This maximum value of ϕ satisfies that ϕ ≤ min(θ m ,θ m ): Forθ m ≤θ m , this follows from P λ (b|θ m ) = 1. Forθ m ≥θ m , the location of the maximum of Q λ implies that P λ (b|θ m ) ≥ P λ (b|θ m ) = 1.
We show that endpoint deviations from the P λ test are typically a constant factor larger than those of the CH test. For large α, the factor approaches 1/ √ 1 − λ, approximating the endpoint deviations for a CH test with (1 − λ)n trials. We begin by comparing P λ to P CH,(1−λ)n with the latter applied to the last (1 − λ)n trials and ϕ ≤θ m , where we can use Q λ in place of P λ .

VI. CONCLUSION
It is clear that for the specific problem of one-sided hypothesis testing and confidence intervals for Bernoulli RVs, it is always preferable to use the exact test in the ideal case, where the trials are i.i.d. For general nulls, exact tests are typically not available, so approximations are used. The approximations often do not take into account failure of underlying distributional assumptions. The approximation errors can be large at high significance. Thus trustworthy alternatives such as those based on large deviation bounds or test supermartingales are desirable. Our goal here is not to suggest that these alternatives are better for the example of Bernoulli RVs, but to determine the gap between them and an exact test, in a case where an exact test is known and all tests are readily calculable. The suggestion is that for high significance applications, the gaps are relatively small on the relevant logarithmic scale. For p-values, they are within what is expected from experiment-to-experiment variation, even for moderate significances. For confidence intervals, the increase in size is bounded by a constant if the number of trials is known ahead of time, but there is a slowly growing cost with number of trials if we allow for arbitrary stopping-rules.

A. Preliminaries
Notation and definitions are as introduced in the text. The p-value bounds obtained by the three tests investigated are denoted by P X for the exact, P CH for the Chernoff-Hoeffding, and P PBR for the PBR test. They depend on n, ϕ andΘ. For reference, here are the definitions again.
The gain per trial for a p-value bound P n is G n (P n ) = − log(P n )/n. The values of ϕ,θ and θ are usually constrained. Unless otherwise stated, we assume that 0 < ϕ,θ, θ < 1 and n ≥ 1.
Most of this appendix is dedicated to obtaining upper and lower bounds on log(p)-values and lower bounds on endpoints of confidence intervals. We make sure that the upper and lower bounds differ by quantities that converge to zero as n grows. Their differences are O(1/n) for log(p)-values and O(1/ √ n) for confidence lower bounds. We generally aim for simplicity when expressing these bounds, so we do not obtain tight constants.
The expression in Eq. 50 can be seen as the inverse of a positive martingale for H 0 = {ν ϕ } determined by S n . The complete family of such martingales was obtained by Ville [20], Chapter 5, Sect. 3, Eq. 21. Ours is obtained from Ville's with dF (t) = dt as the probability measure.

C. Log-p-Value Approximations
We use − log(P CH,n (t|ϕ)) = nKL(t|ϕ) as our reference value. According to Thm. 1, the log(p)-values are ordered according to − log(P PBR ) ≤ − log(P CH ) ≤ − log(P X ). To express the asymptotic differences between the log(p)-values, we use auxiliary functions. The first is The first two terms of this expression can be recognized as the Shannon entropy of n independent random bits, each with probability t for bit value 1. For t ∈ [1/n, 1 − 1/n] and with Stirling's approximation √ 2πn(n/e) n e 1/(12n+1) ≤ n! ≤ √ 2πn(n/e) n e 1/(12n) applied to the binomial coefficient, we get (1 − t)n + log (n/e) n (tn/e) tn ((1 − t)n/e) (1−t)n We can increase the interval to simplify the bounds while preserving convergence for large n. For the lower bound, we use −1/(12t(1 − t)n). For the upper bound, note that (12tn + 1)(12(1 − t)n + 1) is maximized at t = 1/2. We can therefore increase the upper bound according to 1 12n − 12n + 2 (12tn + 1)(12(1 − t)n + 1) for n ≥ 1. From this we obtain the interval expression valid for t ∈ [1/n, 1 − 1/n]. The boundary values of H n at t = 0 and t = 1 are − log(n + 1)/2. The next auxiliary function is where the bounds are from Ref. [13]. See this reference for a summary of all properties of Y mentioned here, or Ref. [15] for more details. The function Y is related to the tail of the standard normal distribution, the Q-function, by Q(t) = e −t 2 /2 Y (t)/ √ 2π. The function Y is monotonically decreasing, convex, Y (0) = π/2, and it satisfies the differential equation d dt Y (t) = tY (t) − 1. We make use of the following bounds involving Y : The lower bound comes from the upper bound 1/t for Y (t). The upper bound is from the lower bound t/(1 + t 2 ) for Y (t). Specifically, we compute − log(Y (t)) ≤ − log(t/(1 + t 2 )) = log(t) + log(1 + 1/t 2 ) ≤ log(t) + 1/t 2 .
With these definitions, we can express the log(p)-values in terms of their difference from − log(P CH ).
Then for 0 < ϕ < t < 1, − log(P X,n (t|ϕ)) ∈ − log(P PBR,n (t|ϕ)) + log(n + 1) − log(P X,n (t|ϕ)) ∈ − log(P CH,n (t|ϕ)) + Observe that lE n (t|ϕ) is O(1) with respect to n for t > ϕ constant. The first term in the defining minimum is smaller than 1 only for ϕ within less than one standard deviation (which is O(1/ √ n)) of t. It is defined so that the primary dependence on the parameters is visible in the interval bounds.
Proof. For approximating P X , we apply Thm. 2 of Ref. [13] with the following sequence of substitutions, the first four of which expand the definitions in the reference: With the given substitutions and Y (t) as defined by Eq. 57, we obtain for t ≥ ϕ, The second identity of the theorem follows by substituting the expression from Thm. 6.
We can eliminate the function Y from the expressions by applying the bounds from Eq. 58.

D. Asymptotic Normality of the log(p)-Values and Their Differences
The main tool for establishing the asymptotic distribution of the log(p)-values is the "delta method". A version sufficient for our purposes is Thm. 1.12 and Cor. 1.1 of Ref. [19]. The notation X n D − → N (µ, σ 2 ) means that X n converges in distribution to the normal distribution with mean µ and variance σ 2 . By the central limit theorem,Θ n = S n /n satisfies √ n(Θ n − θ) (1 − θ)). An application of the delta method therefore yields the next lemma.
Theorem 10. For P = P CH , P = P PBR or P = P X , and 0 < ϕ < θ < 1 constant, the gain per trial G n (P ) converges in distribution according to with Proof. Consider P = P CH first. In Lem. 9, define F (x) = KL(x|ϕ) = x log(x/ϕ) + (1 − x) log((1 − x)/(1 − ϕ)) so that F (Θ n ) = G n (P CH ). For the derivative of F at x = θ, we get The theorem now follows for P CH by applying Lem. 9. Thm. 6 and the law of large numbers imply that (− log(P PBR )/ √ n) − (− log(P CH )/ √ n) converges in probability to 0. Cor. 8 implies the same for P X , namely that (− log(P X )/ √ n)− (− log(P CH )/ √ n) converges in probability to 0. In general, if X n −Y n converges in probability to 0 and Y n D − → µ, then X n D − → µ, see Ref. [3], Prop. 6.3.3. The statement of the theorem to be proven now follows for P = P PBR and P = P X by comparison of √ nG n (P PBR ) and √ nG n (P X ) to √ nG n (P CH ).
The differences of the log(p)-values have much tighter distributions. They are also asymptotically normal with scaling and variances given in the next theorem. The differences are Ω(log(n)) with standard deviations O(1/ √ n).
Similarly, from Cor. 8 and taking note of the definition of lE n (t|ϕ), converges in probability to zero. The relevant derivative is and combining the two observations gives Eq. 72.

E. Confidence Interval Endpoints
For the one-sided confidence intervals, we need to determine the lower boundaries of acceptance regions, that is the confidence lower bounds. By monotonicity of the p-values in ϕ, it suffices to solve equations of the form − log(P (θ, ϕ)) = α, where a = e −α is the desired significance level. Here we obtain lower and upper bounds on the solutions ϕ.
To illuminate the asymptotic behavior of solutions ϕ of − log(P (θ, ϕ)) = α, we reparametrize the log-p-values so that our scale is set by an empirical standard deviation, namelyσ = θ (1 −θ)/n. Thus we express the solution as in terms of a scaled deviation down fromθ. Inverting for γ we get Theorem 12. Let 0 <θ < 1 and α > 0. Suppose that α ≤ nθ 2 (1 −θ) 2 /8. Then there is a solution γ α > 0 of the identity − log(P CH (θ, ϕ(γ α ,θ))) = α satisfying The constants in this theorem and elsewhere are chosen for convenience, not for optimality; better constants can be extracted from the proofs. Note that the upper bound on α ensures that the reciprocal square root is bounded away from zero. However, for the relative error to go to zero as n grows requires α = o(n).
The function q(x) is the negative logarithm of the Q-function, which is the tail of the standard normal distribution. The lower bound on α in Thm. 14 ensures that there is a solution with γ α > 0, because q(0) = log (2). For reference, the constants multiplying the interval expressions are 64/(15 √ 15) ≈ 1.102, 8/ √ 15 ≈ 2.066, π/6 ≈ 0.724, 2/ √ 5 ≈ 0.894. Note that in the large n limit, where the O(1/ √ n) terms are negligible, the value of γ α in Thm. 14 corresponds to the (1 − e −α )-quantile of the standard normal.
From the inequality d dy q −1 (y) ≤ 1/q −1 (y) in Eq. 91, integration and monotonicity, for 0 ≤ z ≤ δ, To determine the relative error, write δ = δ/α to obtain the interval inclusion For α(1 − δ ) > q(1), the interval relationship can be weakened to The relative error on the right-hand side is given by the term multiplying the interval, and can be written as αδ /(α − (αδ + q(1) − 1)). If αδ + q(1) − 1 ≤ α/2, then the relative error is bounded by 2δ which is twice the relative error of α. Of course, for the interval bounds to converge, we need α = o(n).