On matching confidence intervals and tests for some discrete distributions: methodological and computational aspects

Exact two-tailed tests and two-sided confidence intervals (CIs) for a binomial proportion or Poisson parameter by Sterne (Biometrika 41:117–129, 1954) or Blaker (Can J Stat 28(4):783–798, 2000) are successful in reducing conservatism of the Clopper–Pearson method. However, the methods suffer from an inconsistency between the tests and the corresponding CIs: In some cases, a parameter value is rejected by the test, though it lies in the CI. The problem results from non-unimodality of the test p value functions. We propose a slight modification of the tests that avoids the inconsistency, while preserving nestedness and exactness. Fast and accurate algorithms for both the test modification and calculation of confidence bounds are presented together with their theoretical background.


Introduction
We follow up, in a sense, Fay (2010a, b) who pointed to some problems in categorical data analysis that may occur when a statistical test and a confidence interval (CI) is applied to the same data.It may happen that the test rejects a parameter value that lies in the CI or the other way round.He recommended, in order to avoid such inconsistencies, using tests and CIs in matching pairs and offered (unlike some prestigious statistical programs, at least their versions of that time) software tools (R packages) making it possible to follow the recommendation.
However, even following this recommendation leaves some discrepancies between CIs and test results unresolved.Fay (2010b) calls them unavoidable inconsistencies.They stem from the fact that inverting some tests yields non-connected confidence sets whose gaps have to be "filled" in order to obtain confidence intervals.When a parameter value lies in such a gap, it is rejected by the test, yet it belongs to the CI.
We show that even such "unavoidable" inconsistencies can be avoided-it only is necessary to slightly modify the test.We propose an appropriate modification, as well as algorithms and software tools that make it applicable in some important cases, such as Sterne's and Blaker's methods for the binomial and Poisson distributions.

General considerations
Throughout the paper, we will deal with count data and their probability model-a nonnegative integer-valued variable T whose distribution has a single scalar parameter θ to be estimated and/or tested.(Possible additional parameters are supposed to be fixed and known.) There are many types of tests and CIs for such data.That stems perhaps from the typical impossibility of constructing the uniformly most powerful tests, or CIs with uniform coverage probability (unless the tests and CIs are randomized) as well as from the lack of consensus about the principles and criteria worth following.Moreover, different principles, each of them looking appealing, may prove to be mutually incompatible, and it remains up to an individual choice which one should be preferred.
The binomial distribution is a archetypal example to illustrate the multiplicity of methods.It is known (see, e.g., Brown et al. 2001) that CIs for the binomial proportion cannot have constant coverage probability over the whole parameter space [0, 1], or equivalently, a nonrandomized test of hypothesis θ = θ 0 cannot have a constant type I error rate over all θ 0 .
This explains the diversity why some authors require coverage probabilities being never below the nominal confidence level and test sizes never exceeding the nominal α (Schilling and Doi 2014;Andres and Tejedor 2004;Reiczigel et al. 2008) whereas others advocate approximate methods and require that actual coverage or type one error rate is close to the nominal on average (Agresti and Coull 1998).
The approximate CI referred to nowadays as standard or Wald CI dates back to Laplace (1812) and there is a quite long list of alternative proposals (see Brown et al. (2001), for a review) trying to overcome some drawbacks of the original method (primarily its too low coverage probability near the extremes of the parameter space).
The first exact interval for the binomial proportion was proposed by Clopper and Pearson (1934), but it was later strongly criticized for being far too conservative, or equivalently, much longer than necessary, in case of two-sided intervals.Later on other authors developed less conservative exact intervals for the two-sided case.Sterne (1954) proposed a principle leading to the shortest possible confidence sets but, unfortunately, it was shown by Crow (1956) that the sets may not be connected.The natural remedy is to transform them into intervals by filling the gaps.Reiczigel (2003) demonstrated that the resulting intervals are still fairly short.The length optimality, nevertheless, is lost.Blyth and Still (1983) and Casella (1986) proposed a method of obtaining exact CIs of minimum length, but these exhibit a problem of not being nested: a CI at a lower level may not be a subset of a higher level interval.In terms of testing this means that the same hypothesis may be rejected at significance level α 1 , but remains not rejected at a level of α 2 > α 1 .Blaker (2000) proved that exact binomial CIs cannot be lengthoptimal and nested at the same time.His intervals are subsets of the Clopper-Pearson intervals, nested, and nearly length-optimal.
The above review of binomial CIs is by no means complete [see, e.g.(Schilling and Doi 2014;Lecoutre and Poitevineau 2016)].

Eliminating inconsistencies: methodological proposal
Our aim is to set up a framework for constructing exact two-tailed tests and exact two-sided CIs for a real-valued parameter θ of the distribution of a variable T on {0, 1, 2, . ..}.Here are three simple and natural criteria that we regard as important.
(I) The confidence sets are intervals.(II) Confidence sets computed from the same sample at different confidence levels are nested: a confidence set at a higher confidence level contains another one at a lower level.(III) The test and the confidence set invert each other, that is, the test rejects the hypothesis θ = θ 0 at significance level α if and only if θ 0 is not in the 1 − α confidence set.
When (II) and (III) hold, p value functions (also referred to as evidence functions, CI functions, etc.-see Hirji (2006)) f k (θ ), one for each possible value k of T , may be defined as or, equivalently, Equivalence of the above two definitions follows from nestedness (criterion II).
On the other hand, assume that some functions f k (θ ) with values in [0, 1] are given.Let us define a test based on these functions as and also a level 1 − α confidence set as {θ ; f k (θ ) > α}.Then the test and confidence sets satisfy (II) and (III).However, (II) and (III) cannot guarantee that also (I) holds.In order to explain the fact, we need the concept of function unimodality, and since the definitions in the literature slightly vary, we will define it.
Definition 1 Function f (x) defined on a nonempty set of reals I is unimodal if for every such triplet x 1 , x 2 , x 3 of arguments that x 1 < x 2 < x 3 , the inequality f (x 2 ) ≥ min f (x 1 ), f (x 3 ) holds, and strictly unimodal, if the inequality is strict.Function f (x) is reversely unimodal (strictly reversely unimodal) if − f (x) is unimodal (strictly unimodal).
(The property of being unimodal is referred to elsewhere as bimonotonicity, see Hirji (2006), or quasiconcaveness, see Klaschka (2010).) It is easy to show that a p value function f k (x) defined on an interval is unimodal if and only if the sets {x; f k (x) > a} for all a ∈ R are intervals.So, when f k (x) is not unimodal, it yields, in some cases (for some α values), non-connected confidence sets {θ ; f k (θ ) > α} and it is natural to substitute them with their convex hulls.After that, (I) holds, and (II) as well (since the operation of convex hull preserves the set inclusion).However, (III) is violated: When θ 0 lies in the convex hull of {θ ; f k (θ ) > α} but not in the set itself, hypothesis θ = θ 0 is rejected for T = k.That is what Fay (2010b) calls unavoidable inconsistencies (in contrast with other inconsistencies between test results and CIs that result from applying side by side two different types of test and CI).
This happens for example in the binomial case if testing for H 0 : θ = 0.15 at level α = 0.05 by Sterne's test and observing k = 8 successes in a sample of n = 100.The resulting p value is p = 0.0496, whereas the 95% Sterne CI is (0.0375, 0.1534).Figure 1 illustrates the situation.
A closer look at the problem of "unavoidable" inconsistencies reveals, however, that they can be avoided.It only requires that the p value functions are derived from the convex hulls of the (potentially non-connected) confidence sets by (1).Since derived now from nested intervals, the modified p value functions are unimodal.Of course, the test should be modified accordingly, that is, derived from the modified p value functions via (2).That is a way-and, in our opinion, the most reasonable one-to fulfil all three conditions (I), (II), and (III).
More explicitely, the modified p value function f * k (θ ) (the uniformly least unimodal function satisfying f * k (θ ) ≥ f k (θ )) may be written as Formula (3) still cannot be utilized directly for calculation.In Sect. 3 we will give practical recipes how to calculate the modified p values in some important cases.

Computational algorithms
In this section we describe the principles that the algorithms are based on, and then specify all details of the modified Sterne and Blaker methods with respect to the binomial and Poisson distributions.

Methods
Let T be a nonnegative integer-valued random variable whose distribution belongs to a family with a single parameter θ ∈ Θ, the parameter space Θ being an interval of reals.
Sterne's method (the probability based method by Hirji (2006)) is based on p value functions Another method of our interest will be Blaker's method (by Hirji (2006), the combined tails method).Its p value function for T = k (referred to as acceptability in Blaker 2000) is defined as where Both Sterne's and Blaker's methods guarantee (Blaker 2000) that confidence sets {θ ; f k (θ ) > α} are nested and exact, and all the more so their convex hulls (CIs).
Moreover, Blaker's p value function f B k (θ ) is always less or equal to the Clopper-Pearson p value function so that the Blaker CI must be a subset of the Clopper-Pearson one.Note that an analogous statement does not hold for the Sterne p value function and the corresponding CIs.
We will specifically apply the Sterne and Blaker methods to the binomial Bin(n, θ) (n fixed) and Poisson Poi(θ ) distributions.

Theoretical background of the algorithms
The following two Lemmas hold for the binomial and Poisson distributions.
Lemma 1 Let T have either Bin(n, θ), or Poi(θ ) distribution.Then for i, j from the support of T , i < j, there is a unique solution θ = σ S i j to The difference P θ (T = j) − P θ (T = i) changes its sign from negative to positive at σ S i j , and σ S i j < σ S rs , whenever i ≤ r, j ≤ s and {i, j} = {r , s}.Proof Simple algebra.
Lemma 2 Let T have either Bin(n, θ), or Poi(θ ) distribution.Then for i, j from the support of T , i < j, there is a unique solution θ = σ B i j to The difference P θ (T ≥ j) − P θ (T ≤ i) changes its sign from negative to positive at σ B i j , and σ B i j < σ B rs , whenever i ≤ r, j ≤ s and {i, j} = {r , s}.Proof The Lemma is a trivial consequence of the facts that the functions P θ (T = i) of θ are continuous, and both distribution families are stochastically increasing, i.e.P θ (T ≤ i) is a strictly decreasing function of θ , unless i is the maximum of the support of T .
Due to Lemmas 1 and 2, both Sterne's and Blaker's p value functions may be described in a unified manner.Though dealing in detail just with the binomial and Poisson distributions, we present the description so that "the door is left open" for a broader class of distributions.Thus, we will write the minimum of the support of variable T as k min (instead of 0), and maximum, if one exists, as k max (instead of n in case of Bin(n, θ)).Symbols f k and σ stand either for f S k and σ S , or f B k and σ B .So, in case of a finite support of T the p value function may be described as We can see that for the binomial and Poisson distribution, and, more generally, for any distribution given that ( 6) or ( 7) holds and the likelihood functions P θ (T = k) are continuous, both Sterne's and Blaker's p value functions f k (θ ) are piecewise continuous, and the only discontinuities are jumps upwards at {σ ik ; i < k} and downwards at {σ ki ; i > k}.At each discontinuity point each of the p value functions has both one-sided limits, and takes the higher of the limits as its value.An example is presented in Fig. 2.
Note that in case of the Blaker method, formulas (6) or ( 7) apply to all stochastically increasing families, and analogous ("mirror-reflected") ones to the stochastically decreasing families.However, the same is not true for the Sterne method-for example in case of the geometric distribution (5) has no solution, and (4) leads to one-sided tests and CIs.
The following pair of Lemmas on the shape and maxima of the continuous parts of the p value functions is important for numerical calculations.
Lemma 3 Let T be distributed as either Bin(n, θ), or Poi(θ ), and let f k (θ ) stand for f S k (θ ), or f B k (θ ).When f k (θ ) is continuous and different from the unit constant on interval I , then it is strictly reversely unimodal on I .
Proof From ( 6) and ( 7) it follows that under the assumptions of the Lemma, f k (θ ) = 1 − i∈A P θ (T = i) on I , where A is a nonempty finite proper subset of the support of T , and the elements of A are consecutive integers.It is easy to show-see details in the "Appendix"-that g(θ ) = i∈A P θ (T = i) is then strictly unimodal in the whole parameter space, and thus also on As a consequence of the above Lemma, continuous parts of the p value functions (for the given methods and distributions) are-roughly speaking-maximized on the interval boundaries.However, since not only closed intervals should be considered, the following Lemma states the fact in a more careful way.
Lemma 4 Let T be distributed as either Bin(n, θ), or Poi(θ ), and let f k (θ ) stand for f S k (θ ), or f B k (θ ).When f k (θ ) is continuous on interval I with bounds a < b, then Unless f k (θ ) equals the unit constant on I , f k (θ ) is smaller than the supremum over I at every point of the interior of I .
Proof A trivial consequence of Lemma 3.

p value function modification
Formula (3) defines a modified p value f * k (θ ) at θ through suprema of f k (•) over potentially infinite sets {ξ ; ξ ≤ θ } and {ξ ; ξ ≥ θ }, and so it cannot be used directly as a calculation recipe.However, for the distributions and methods of our interest, the modified p value calculation may be, as shown below, reduced to taking maximum of two numbers.
Proposition 1 Let T be distributed as either Bin(n, θ), or Poi(θ ), and let f k (θ ) and σ i j stand either for f S k (θ ) and σ S i j , or for f B k (θ ) and σ B i j .Then the modified p value function defined by (3) can be written equivalently as otherwise.
The proof of the Proposition is elementary in case of the Blaker method, but in case of the Sterne method it is harder and based on the following Lemma.
Lemma 5 Let T be distributed as either Bin(n, θ), or Poi(θ ).Let f S− k (θ ) denote the smaller one of the one-sided limits of f S k at discontinuity point θ .Let θ 1 , θ 2 be a pair of discontinuity points of f S k (θ ) so that either Proof See the "Appendix".

Proof of Proposition 1
We may assume that σ = σ ki , i > k-case σ = σ ik , i < k is analogous.Due to Lemma 4, max{ and what only remains to prove is that f k (ξ ) < f k (σ ) for ξ > σ.In case of the Sterne method, that follows from Lemmas 4 and 5: Due to Lemma 4, supremum of f S k (ξ ) over {ξ ; ξ > σ} equals one of the one-sided limits of f S k (ξ ) at the boundaries of the continuous segments to the right of σ .Due to Lemma 5 (and the fact that there is a jump downwards at σ ), however, all these are smaller than f S k (σ ).In case of the Blaker method, obviously Thus, the proof of the Proposition is complete.
Due to Proposition 1 a common framework of the modified p value calculation for all four cases of interest (combinations of method and distribution) can be outlined as follows: 1. Input data: θ 0 (hypothetical parameter value), k (number of successes), in case of binomial distribution n (number of trials).2. Identify the continuous segment of f k (θ ) where θ 0 belongs.3. Calculate f k (θ 0 ). 4. If f k (θ 0 ) = 1 or θ 0 belongs to the leftmost or rightmost segment, output f k (θ 0 ) and finish.Otherwise continue.

Put σ
and finish.

Remarks:
-In case of the Sterne method, the discontinuity points σ ik and σ ki can be calculated algebraically.Thus the easiest (though not the only) way of performing Step 2 is to calculate these points and compare θ 0 with them.-In case of the Blaker method, point σ i j can only be calculated numerically as the root of equation P θ (T ≤ i) = P θ (T ≥ j) in θ .Thus, it appears better to perform Step 2 by determining the order number of the continuous segment first, and only then to calculate numerically point σ from Step 5.For instance, when σ ik ≤ θ 0 < σ i+1,k , index i can be found as y − 1 where y is α-quantile of the corresponding distribution (Bin(n, θ 0 ), or Poi(θ 0 )), and α = P θ 0 (T ≥ k).
Figure 3 illustrates the relationship between the original and modified Sterne p value function using the same example as was shown in Fig. 2. Proof Let us consider the case of the lower confidence limit (the other case is analogous), and assume that k > 0 (case k = 0 is trivial).Clearly, f B k (θ ) ≤ 2P θ (T ≥ k) for all θ .Equality holds at the discontinuity points σ B ik , i < k (and so for θ = θ 2 ).The right-hand side is a strictly increasing function of θ , and equals α for θ = θ 1 (since The search for the lower confidence bound proceeds as follows. 1. Input parameters: α (1− confidence level), k (number of successes), in case of binomial distribution n (number of trials), ε (numerical tolerance).2. If k = 0, return 0 and finish.Otherwise continue.3. Calculate θ 1 as the Clopper-Pearson lower 1 − α confidence bound.
Remark: A single call of the beta distribution quantile function (binomial case), or of the chi-square quantile function (Poisson case).4. Find i as the smallest j such that θ 1 < σ B jk .Remark: Calculated as α 1 -quantile of Bin(n, θ 1 ) (binomial case) or Poi(θ 1 ) (Poisson case), where Initialize θ low as θ 1 , and θ upp as θ M L , the maximum likelihood estimate of θ (i.e. k/n for the binomial, and k for the Poisson distribution).Remark: By Lemma 6, the confidence bound must lie in It should be remarked that setting Neumann (1966Neumann ( , 1970) for a proof.
The above algorithms for the binomial and Poisson distributions are implemented in R package BlakerCI.

Discussion
Beyond test modification to eliminate inconsistencies between tests and CIs, we presented algorithms for the calculation of modified p values and confidence bounds.Our algorithms exhibit considerable improvements of speed and precision compared to those by Blaker (2000) and Reiczigel (2003), where the p value function is evaluated on an equidistant grid in the parameter space to find the smallest and greatest values not rejected by the test at level α.The fixed-grid algorithms clearly have a poor precision -speed tradeoff and, moreover, the resolution of the grid does not imply the same precision for the CI endpoints (see Klaschka (2010) for details.) Fay (2010a, Online supplementary material) was aware of the deficiencies of the fixed grid algorithms, and designed for R package exactci a more sophisticated one.Each of the confidence limits is localized in an interval of length ε (input parameter) by stepwise partitioning of the parameter space.
The algorithm has much better precision-speed tradeoff than the fixed-grid algorithms, but still it is much slower than the algorithms of Sect.3.4 (for speed comparisons in the Blaker case see Klaschka 2010).The reason lies in ignoring facts stated by Lemmas 3, 4, and 6.Furthermore, the algorithm suffers from precision problems in case of the Blaker method.Namely, although discontinuity points of the p value function are calculated numerically with given precision ε, an error less than ε in the discontinuity point may lead to a much larger error in the confidence limit.
An algorithm for Blaker's confidence limits similar to that of Sect.3.4 was proposed by Lecoutre and Poitevineau (2016).It is based on the properties of p value functions analogous to those mentioned in Lemmas 3 and 6, and uses interval halving within the interval bounded by the Clopper-Pearson confidence limit from one side, and a p value function discontinuity point from the other side.The algorithm applies to more distributions than the one in this paper.On the other hand, the Sterne case is not covered.
The reason why inconsistencies between tests and CIs persist is that a non-connected confidence set is incontestably annoying, whereas when testing a hypothesis like θ = θ 0 , one doesn't feel necessary to test other values in the neighbourhood of θ 0 at the same time.Thus, rejecting a parameter value which lies between two non-rejected ones is not at all conspicuous, so it remains in most cases unnoticed.Moreover, the original p value functions are given by a easy-to-calculate comprehensible formula, and modifying it (which may require, as shown in Sect.3.3, application of numerical methods), represents some inconvenience.
Of course, a more efficient way of enforcing our proposal would consist in convincing software developers rather than users.In this respect, however, we do not expect an immediate victory.Even Fay (2010a, b), who addressed a more frequent and much less subtle kind of inference inconsistencies, does not seem to have been generally heard out so far.For instance, poisson.test(5,r = 1.8) in R still results, nowadays, in a (Sterne test) p value of 0.036 but the 95% (Clopper-Pearson) CI of (1.6, 11.7), as in Fay (2010b).
We have to note that there exists yet another important kind of inconsistencies pertinent to the methods that we are dealing with in this paper.Vos and Hudson (2008) show that, when dealing with binomially distributed data, some methods, including the Sterne's and Blaker's one, generate conflicts between inferences from samples of different sizes.For instance, when testing H 0 : θ = 0.2 and observing 2 successes out of 33, Sterne's test results in a p value of 0.0495.Then one would intuitively expect that with 2 successes out of 34 trials, the same test should reject the same hypothesis "even more resolutely".The test, however, yields 0.0504 as the p value.Nonetheless, since our opinions on the problem and its possible solutions diverge, we do not address it here, and it is left to be the topic of another paper.a continuous function strictly increasing from 0 to +∞, the derivative of g(θ ) vanishes at a single θ * ∈ (0, 1), where Moreover, the derivative is positive or negative on [0, θ * ] or [θ * , 1], resp.Function g(θ ) is thus strictly increasing for θ ≤ θ * and strictly decreasing for θ ≥ θ * , which is evidently sufficient for strict unimodality.This completes the proof for the binomial case.
Poisson case is analogous: Let g(θ ) = y i=x π(i, θ) where 0 ≤ x ≤ y < ∞.We have to prove that g(θ ) is strictly unimodal on [0, +∞).For x = 0, function g(θ ) is clearly strictly decreasing in the whole domain.Otherwise, since for i ≥ 1 , a continuous function strictly increasing from 0 to +∞, the derivative of g(θ ) vanishes and changes its sign from positive to negative at a single θ * ∈ (0, +∞), where This completes the Poisson case, and the whole proof.
Proof (Proof of Lemma 5) Clearly, it is sufficient to deal with the case k < i < j (the other case is analogous), and assume j = i + 1.

Fig. 1
Fig. 1 A portion of Sterne's p value function for a binomial proportion given k = 8 successes out of n = 100 trials.Testing for H 0 : θ = 0.15 results in rejection, whereas a 95% CI contains the value of 0.15

Fig. 2
Fig. 2 Sterne p value function for a binomial proportion given k = 2 successes out of n = 15 trials.The locations of jumps σ i,k are parameter values where P σ i,k (T = i) = P σ i,k (T = k) for some i = k, thus the critical region of the test {i : P θ (T = i) ≤ P θ (T = k)} changes