The cost of using exact confidence intervals for a binomial proportion

When computing a confidence interval for a binomial proportion p one must choose between using an exact interval, which has a coverage probability of at least 1-{\alpha} for all values of p, and a shorter approximate interval, which may have lower coverage for some p but that on average has coverage equal to 1-\alpha. We investigate the cost of using the exact one and two-sided Clopper--Pearson confidence intervals rather than shorter approximate intervals, first in terms of increased expected length and then in terms of the increase in sample size required to obtain a desired expected length. Using asymptotic expansions, we also give a closed-form formula for determining the sample size for the exact Clopper--Pearson methods. For two-sided intervals, our investigation reveals an interesting connection between the frequentist Clopper--Pearson interval and Bayesian intervals based on noninformative priors.


Introduction
Inference for a binomial proportion p is one of the most commonly encountered statistical problems, with important applications in areas such as clinical trials, risk analysis and quality control. Consequently, a large number of two-sided confidence intervals and one-sided confidence bounds for p have been proposed by different authors. These are of two different types: exact methods, that have a coverage at least equal to 1 − α for all p ∈ (0, 1), and approximate methods, that may have coverage less than 1 − α for some values of p, but that have a coverage that in some sense is approximately equal to 1 − α.
Research on confidence intervals and bounds for a binomial proportion has mostly focused on approximate intervals. In the methodological literature, exact intervals have often been deemed to be too conservative (Agresti & Coull, 1998;Brown et al., 2001;Newcombe & Nurminen, 2011), as they tend to be quite wide and have actual coverage levels that often are noticeably greater than 1−α. Nevertheless, the use of exact intervals for proportions is abundant among practitioners: see e.g. Abramson et al. (2013), Ibrahim et al. (2013), Ward et al. (2013) and Sullivan et al. (2013) for some recent examples. By far the most widely used exact interval is the Clopper-Pearson interval, introduced by Clopper & Pearson (1934).
The benefit of using an exact interval is obvious: one does not risk that the actual coverage falls below 1 − α. For this reason, some regulatory authorities require that exact intervals be used. Moreover, the binomial distribution is unusual in that we often can be sure that it is an accurate description of that which we are modelling and not just an approximation to the true distribution, as is often the case when continuous distributions are used for modelling. In such a situation, using an exact method seems reasonable. But there are also costs associated with the use of such an interval. When choosing between approximate and exact confidence methods, there is a trade-off in that exact intervals and bounds by construction are wider than the best approximate intervals, or equivalently, require a larger sample size in order to obtain a certain expected length. If one is unwilling to accept intervals and bounds with undercoverage for some values of p, there is a cost to pay in terms of expected length or required sample size. This paper seeks to quantify these costs.
In planned experiments, it is always important to determine a suitable sample size. Sample size determination for binomial confidence intervals has received much attention in recent years (Katsis, 2001;Piegorsch, 2004;Krishnamoorthy & Peng, 2007;M'Lan et al., 2008;Gonçalves et al., 2012;Wei & Hutson, 2013), with different authors studying different intervals and methods for sample size calculations, the latter often of a computer-intensive nature. The first main contribution of this paper is closed-form formulas for computing the sample size required for the Clopper-Pearson methods to obtain a given expected length. This eliminates the need for computer-intensive methods for computing sample sizes and gives a better understanding of how the desired length and the parameters p and α affect the sample size.
The second main contribution is closed-form expressions for the excess length and increase in required sample size that comes from using the exact Clopper-Pearson methods instead of approximate methods. We obtain these expressions by deriving asymptotic expansions for the exact Clopper-Pearson methods, extending the work of Brown et al. (2002), Cai (2005) and Staicu (2009) on the asymptotics of approximate binomial confidence methods to exact intervals and bounds.
The rest of the paper is organised as follows. In Section 2 we introduce the Clopper-Pearson methods along with other exact and approximate confidence methods. In Section 3 we give an asymptotic expression for the expected length of the Clopper-Pearson interval. This allows us to give a formula for computing the sample size, and to determine the cost of using an exact interval rather than an approximate interval, in terms of expected length and sample size. In Section 4 we discuss the one-sided Clopper-Pearson bound and give expressions for its expected distance to p and the cost of using an exact bound. In Section 5 we discuss costs associated with approximate intervals and state some conclusions. All proofs and technical details are deferred to an appendix.
2 Binomial confidence methods

The Clopper-Pearson interval and bounds
The two-sided Clopper-Pearson interval for a proportion p is an inversion of the equal-tailed binomial test: the interval contains all values of p that aren't rejected by the test at confidence level α. Given an observation X, the lower limit is thus given by the value of p L such that and the upper limit is given by the p U such that As is well-known, the computation of p L and p U is simplified by the following equality from Johnson et al. (2005). Let f (t, a, b) be the density function of a Beta(a, b) random variable. Then When (3) is plugged into (1) and (2), the problem of finding p L and p U reduces to inverting the distribution functions of two beta distributions. Consequently, the endpoints of the Clopper-Pearson interval are given by quantiles of beta distributions: When X is either 0 or n, closed-form expressions for the interval bounds are available. When X = 0 the interval is (0, 1 − (α/2) 1/n ) and when X = n it is ((α/2) 1/n , 1). For other values of X, (4) must be evaluated numerically. The interval is implemented in most statistical software packages; it can for instance be found in the PropCIs package in R and computed using the PROC FREQ command in SAS. Some authors (Agresti & Coull, 1998;Brown et al., 2001) have argued that when choosing between confidence intervals, it is often preferable to use an interval with a simple closed-form formula rather than one that requires numerical evaluation, as the former is easier to present and to interpret. Next, we give asymptotic expansions of p L and p U , that function as good approximations when n ≥ 40, and can be used if a closed-form formula for the Clopper-Pearson interval is desired. As an example, when n = 50 the upper bound is accurate up to two decimal places for X / ∈ {0, 1, 2, n}.
Theorem 1. Let X ∈ {1, 2, . . . , n − 1} be fixed. Letp = X/n,q = 1 −p and z α/2 be the upper α/2 quantile of the standard normal distribution. The bounds of the Clopper-Pearson interval are, up to O(n −3/2 ), Similar in construction to the two-sided interval, the one-sided Clopper-Pearson bounds are obtained by inverting one-sided binomial tests. Thus the 1−α Clopper-Pearson upper bound p U is given by the p U such that In the following, we limit our study to upper bounds. For symmetry reasons, the results are however equally valid for lower bounds, as for the bounds under consideration, a lower bound p L for p is equivalent to an upper bound for q, as q U = 1 − p L . If a closed-form expression for p U is desired, it can be obtained in the form of an asymptotic expansion by replacing α/2 with α in Theorem 1 above.

Other exact intervals
In much of the medical literature, as well as the rest of the present paper, the Clopper-Pearson interval is refered to as the exact confidence interval for a binomial proportion. Despite this terminology, several other exact intervals have been proposed throughout the years. These alternative intervals do not admit closed-form expressions and are, to varying extents, computer-intensive.
There are several reasons as to why the Clopper-Pearson interval is the most widely used exact interval. One is simply tradition and availability: it has found its way in to classic statistical textbooks and has been implemented in almost all statistical software packages. Compared to the computer-intensive alternatives, the Clopper-Pearson interval is also considerably simpler computationally. Finally, it remains a natural choice in that it is the inversion of the well-known equal-tailed binomial test.
In the two-sided case, however, there is room for improvement, at least if one is willing to let go of some natural properties of confidence intervals. Other exact intervals have been designed to be shorter than the Clopper-Pearson interval, by inverting two-sided tests that need not be equal-tailed. Moreover, the coverage probabilities of these intervals often fluctuate less from 1−α than does the coverage of the Clopper-Pearson interval.
The Blyth-Still-Casella interval (Blyth & Still, 1983;Casella, 1986) is guaranteed to be the shortest exact interval, but has the odd property that it is not nested, in the sense that the 90 % interval need not be contained in the 95 % interval (Blaker, 2000, Theorem 2). This is also true for the intervals of Crow (1956).
The Sterne (1954) procedure yields nested intervals that are shorter than the Clopper-Pearson interval, but will in some cases result in two separate intervals rather than one connected interval. Blaker (2000) proposed a nested exact interval that, while wider than the Blyth-Still-Casella interval, always is contained in the Clopper-Pearson interval. It is however sometimes a union of disjoint intervals and its upper bound is decreasing but not strictly decreasing in α when n and X are fixed (Vos & Hudson, 2008). The interval based on the inverted exact likelihood ratio test suffers from similar problems (Vos & Hudson, 2008).
The Clopper-Pearson interval, in contrast, is nested, is always a connected set and has bounds that are strictly monotone in α. While it is possible to obtain shorter exact confidence intervals for a binomial proportion, this seems to be associated with the loss of nestedness, connectedness or monotonicity. As we consider these properties to be of importance, we will only include the Clopper-Pearson interval and bounds in the following sections, and will out of convenience refer to them as the exact methods.
Implementations of some of the alternative exact intervals are readily available. The Blyth-Still-Casella interval has been implemented in StatXact and Blaker (2000) gave an S-PLUS function for his interval. A more efficient implementation of Blaker's interval is found in the R package BlakerCI (Klaschka, 2010).

Approximate confidence intervals and bounds
Throughout the text, the Clopper-Pearson methods will be compared to several well-known approximate methods. These are described below, along with the commonly used Wald interval. For more thorough reviews of binomial confidence methods, see Newcombe (2012), Cai (2005) and Brown et al. (2001Brown et al. ( , 2002. In the descriptions below,p = X/n is the sample proportion,q = 1 −p and z α/2 is the 100(1 − α/2)th percentile of the standard normal distribution.
The Wald interval. Inversion of the large sample test |(p − p)(pq/n) −1/2 | ≤ z α/2 leads to the Wald interval, which is presented in virtually every introductory statistics course:p ± z α/2 pq/n. The Wald interval suffers from particularly erratic coverage properties, and cannot be recommended for general use (Brown et al., 2001;Newcombe, 2012).
The Wilson score interval. Like the Wald interval, the Wilson (1927) score interval is based on an inversion of the large sample normal test |(p − p)/d(p)| ≤ z α/2 , where d(p) is the standard error ofp. Unlike the Wald interval, however, the inversion is obtained using the null standard error (p(1 − p)/n) 1/2 instead of the sample standard error. The solution of the resulting quadratic equation leads to the confidence interval The Wilson score interval has favourable coverage and length properties and is often recommended for general use (Brown et al., 2001;Newcombe, 2012). The Agresti-Coull interval. For 95% nominal coverage, Agresti & Coull (1998) proposed the use of the Wald interval with two successes and two failures added, i.e. with n replaced by n + 4 and X replaced by X + 2. More generally, let n = n + z 2 α/2 ,X = X + z 2 α/2 /2,p =X/ñ andq = 1 −p. Brown et al. (2001) dubbed the intervalp ± z α/2 pq/ñ the Agresti-Coull interval. It has performance close to that of the Wilson interval, but is somewhat simpler to use.
Bayesian Beta intervals and bounds. Let B(α, a, b) denote the α-quantile of the Beta(a, b) distribution. An equal-tailed Bayesian credible interval based on the Beta(a, b) prior is given by (B(α/2, X +a, n−X +b), B(1−α/2, X +a, n−X +b)), where B(α, a, b) is the quantile function of the Beta(a, b) distribution. Similarly, an upper bound is given by B(1 − α, X + a, n − X + b). As these methods make use of beta quantiles, they are algebraically very similar to the Clopper-Pearson interval. This connection is discussed further in Section 3.4.
The second-order correct bound. Cai (2005) proposed a coverage-corrected version of the one-sided Wald bound, based on second-order asymptotic expansions. Cai (2005) recommended it for general use and gave a closed-form expression for the bound.
The modified loglikelihood root bound. Staicu (2009) studied the bound obtained by inverting the modified loglikelihood root test and found it to have very favourable coverage and length properties. It cannot be expressed in a closed form, but Staicu (2009) gave asymptotic expansions that can be used as approximations.

Expected length
Let q = 1 − p and let L CP = p U − p L denote the length of the Clopper-Pearson interval. Next, we present an asymptotic expression for the expectation of L CP .
The expansion (6) is compared to the actual expected length in Figure 1. Even for small values of n, the approximation comes quite close to the actual expected length over the entire parameter space. Having an expression for the expected length of the Clopper-Pearson interval allows us to evaluate its performance for different combinations of n, p and α.
When planning an experiment, this is extremely useful as it can be used to determine what sample size we need in order to achieve a desired expected length. Methods for determining sample size are discussed next.

Sample size determination
Several different criterions can be considered when determining sample size, as discussed e.g. by Gonçalves et al. (2012). We focus on a comparatively simple criterion: for a fixed confidence level 1 − α we wish to find the smallest sample size n such that the expected length of the confidence interval is some fixed value d. As the value of n will depend on p, we require that an initial guess p 0 for p is available.
Studying the Clopper-Pearson interval, Krishnamoorthy & Peng (2007) gave a first-order approximation of E(L CP ) in the form of beta quantiles and used that to numerically calculate the sample size required to obtain a desired expected length d. Ignoring the higher terms of the expansion (6) we obtain the secondorder approximation E(L CP ) ≈ 2z α/2 n −1/2 (pq) 1/2 + n −1 , which can be evaluated analytically. Given an initial guess p 0 for p, the equation 2z α/2 n −1/2 (p 0 q 0 ) 1/2 + n −1 = d has the solution when rounded up to the nearest integer. This is a good approximation of the actual required sample size, with a small positive bias. At the 95 % level it does typically not differ by more than 4 from the solution obtained by more complicated (and computer-intensive) exact numerical computations. For p close to 1/2, the Krishnamoorthy-Peng method is slightly more accurate, whereas for p close to 0 or 1, (7) gives a better approximation. In either case, both approximations are accurate enough for most applications. As an example, when p 0 = 0.05 and d = 0.05, the actual required sample size is 329, while our approximation yields n = 331, corresponding to an actual expected length of 0.0498. In comparison with exact methods or the Krishnamoorthy-Peng procedure, (7) offers greater computational ease without sacrificing much accuracy. It is likewise possible to solve the cubic equation that results from including the n −3/2 -term of (6), but the solution does not yield a simple formula and does not give substantially improved accuracy.
A downside to this approach to sample size determination is that the initial guess p 0 may be quite wrong. This is particularly problematic if p is closer to 1/2 than is p 0 , in which case the calculated required sample size will be too small. As a safety measure, it is sometimes recommended to use the conservative guess p 0 = 1/2, which maximizes the required sample size. More often than not, however, this choice is needlessly conservative.
An alternative approach, with a Bayesian flavour, is to use a prior distribution for p when determining the sample size. Beta distributions constitute a flexible and analytically tractable class of priors for p. For p ∼ Beta(a, b), we have E 2z α/2 n −1/2 (pq) 1/2 + n −1 = 2z α/2 n −1/2 Γ(a + 1/2) When applying a frequentist procedure, the prior information about p is typically diffuse, indicating that a low-informative prior should be used so as not to bias the sample size determination. One example is the Jeffreys prior Beta(1/2, 1/2), which puts more probability mass close to 0 and 1 and yields R(1/2, 1/2) = 1/π. Other examples include the uniform Beta(1, 1) prior, for which we have R(1, 1) = π/8 and the Beta(2, 2) prior, which puts more mass close to 1/2, yielding R(2, 2) = 9π/64. The required sample size for different combinations of p and α is shown in Figure 2. It is decreasing in α, increasing in p when p < 0.5 and decreasing in p when p > 0.5.  Remark. In formulas similar to those above, some authors use d to denote the expected half-length, or error tolerance, of a confidence interval. This may be inappropriate in the binomial setting, since using the half-length might give the false impression that all confidence intervals are symmetric about the unbiased estimatorp = X/n. This is not the case for the Clopper-Pearson interval and most good approximate intervals, including those presented in Section 2.3. As an example, when n = 50 and p = 0.01, the expected length of the Clopper-Pearson interval is 0.044. Since the interval is boundary respecting, most of its length will be placed above p. The expected length is very much an interesting quantity when determining sample size, but for binomial proportions it should not be interpreted in terms of error tolerances.

The cost of using the exact interval
Next, we will study the cost of using the exact Clopper-Pearson interval instead of an approximate interval. We will do so by comparing the exact interval to three of the approximate intervals described in Section 2.3: the Wilson score, Jeffreys and Agresti-Coull intervals. These intervals have been recommended as default intervals for a single proportion by several authors (Agresti & Coull, 1998;Brown et al., 2001;Newcombe, 2012).
First, we measure the cost in terms of increased expected length. By comparing the expansion in Theorem 2 to the expansions in Theorem 7 of Brown et al. (2002), we get the following expressions for how much the expected length of the confidence interval increases when the Clopper-Pearson interval is used instead of an approximate interval.
Corollary 1. The Clopper-Pearson interval is asymptotically wider than the approximate intervals described in Section 2.3. In particular, compared to the length L J of the Jeffreys interval, and if L A denotes the length of the Wilson or Agresti-Coull interval, Expanded versions of (9) for the different intervals, including the n −3/2 -terms, are given in the proof in the appendix. Up to O(n −3/2 ), the increase in expected length is inversely proportional to n. Note that, up to O(n −3/2 ), the increase does not depend on p or α. The cost of using an exact interval, in terms of expected length, is thus more or less constant for a fixed n. This is an interesting and somewhat unexpected fact, since the expected lengths of these confidence interval are highly dependent on both p and α.
Next, we consider required sample size. As the Clopper-Pearson interval is wider than the approximate intervals, it naturally requires larger sample sizes to obtain

Increase for the Agresti−Coull interval
Desired expected length Increase in required sample size a particular expected length d. Let n CP (d, p, α) be the minimum sample size for which E p (L CP ) ≤ d at the 1 − α level. Similarly, let n J (d, p, α) be the minimum sample size for which the expected length of the Jeffreys interval is at most d under p at the 1 − α level.
As noted by Piegorsch (2004), the sample size for the Jeffreys interval is well approximated by n J (d, p 0 , α) = 4z 2 α/2 p 0 q 0 d −2 . Comparing this to (7) without rounding, the increase in required sample size n + J (d, p 0 , α) = n CP (d, p 0 , α) − n J (d, p 0 , α) can be approximated by This approximation is quite accurate, generally differing by less than 1 when compared to the value for n + J obtained using substantially more computer-intensive exact computations.
(10) is plotted as a function of d for three choices of p 0 in Figure 3. When shorter intervals are desired, the increase in required sample size can be substantial. When d = 0.05, for instance, n + J is 40 for 0.05 ≤ p 0 ≤ 0.95. As was the case for the expected length, the increase n + J is remarkably insensitive to p and α: there is no concernable difference when 0.05 ≤ p ≤ 0.95 and 0.001 ≤ α ≤ 0.2. The cost of using an exact interval instead of the Jeffreys interval is, in terms of required sample size, constant for a fixed expected length d.
Moving on to the Wilson score interval, Piegorsch (2004) gave the following formula for its sample size: The increase n + W S (d, p 0 , α) = n CP (d, p 0 , α) − n W S (d, p 0 , α) can thus be approxi-mated by The approximation is good when p 0 is not very small, typically not differing by more than 2 from the exact value. Similarly, Piegorsch (2004) gave the formula n AC (d, p 0 , α) = 4z 2 α/2 p 0 q 0 d −2 − z 2 α/2 for the sample size of the Agresti-Coull interval. Consequently, the increase The expressions (11) and (12) are plotted for some combinations of p and α in Figure 3. For the Agresti-Coull interval, the cost is more or less constant in p, but is sensitive to changes in α. For the Wilson score interval, the cost depends on both p and α.

The exact frequentist interval and Bayesian credible intervals with noninformative priors
Equation (8) in Corollary 1 and the fact that (10) is so insensitive to p and α reveal a strong connection between the frequentist Clopper-Pearson interval and the Bayesian credible interval derived under the Jeffreys prior. In the light of these results, it seems natural to think of the Bayesian interval as a sort of continuitycorrection of the Clopper-Pearson interval, in which conservativeness is sacrificed in order to get a short interval. Attempts to connect the exact frequentist interval with Bayesian intervals have previously been made by Brown et al. (2001), who argued that the Jeffreys interval can be thought of as a continuity-corrected version of the Clopper-Pearson interval. Their argument comes from a comparison between the Jeffreys interval and the mid-p interval, which generally is considered to be a continuity-corrected Clopper-Pearson interval. However, the key step in their argument is their equation (17), which is incorrect; it relies on the false assumption that for two continuous functions f 1 and f 2 , (f 1 + f 2 ) −1 = f −1 1 + f −1 2 . Another natural noninformative Bayesian interval is that based on the uniform prior, Beta(1, 1). The Clopper-Pearson interval is essentially this interval after the prior information has been removed, a fact which we have not seen mentioned before in the literature. To see this, note that for a central Bayesian interval with prior Beta(a, b), a, b > 0, the lower bound is given by the beta quantile p L,B (a, b, X, n) = B(α/2, X + a, n − X + b). The parameters a and b can be interpreted as additional successes and failures added to the data. For the uniform prior, a = b = 1. The lower bound of the Clopper-Pearson interval is similarly the beta quantile B(α/2, X, n − X + 1). When X / ∈ {0, n} this can be written as B(α/2, (X −1)+1, (n−1)−(X −1)+1) = p L,B (1, 1, X −1, n−1), the lower bound of the Beta(1, 1) interval with one success and one failure removed. Expressed in words, the lower bound of the Clopper-Pearson interval equals the lower bound of the Bayesian interval with the uniform prior after the prior information has been removed. Similarly, the upper bound is 1 − p L,B (a, b, n − X, n − 1), i.e. 1 minus the lower bound for q under the uniform prior with one success and one failure removed. The Beta(1, 1) interval can thus be thought of as a shrinkage Clopper-Pearson interval.

One-sided bounds 4.1 Expected distance to the true proportion
For one-sided confidence bounds, it is not the expected length that is of interest, but how close the bound is to p. Let L U,CP = p U − p denote the distance from p U to p. The next theorem gives an asymptotic expansion for the expectation of L U,CP .
Theorem 3. As n → ∞ the expected distance to p for the 1−α one-sided Clopper-Pearson upper bound is E(L U,CP ) =n −1/2 z α (pq) 1/2 + (3n) −1 2(1/2 − p)z 2 α + 1 + q The expansion (13) is compared to the actual expected distance to p in Figure  4. Like the expansion for the expected length of the two-sided interval, (13) is close to the actual expected distance even for small n.

Sample size determination
The expressions we obtain in the one-sided case are not quite as simple as those in the two-sided case. Let d denote the desired expected distance to p and let p 0 be the initial guess for the value of p. Proceeding as before, using the second-order approximation E(L U,CP ) ≈ n −1/2 z α (pq) 1/2 + (3n) −1 2(1/2 − p)z 2 α + 1 + q yields the required sample size This approximation is very good when d is not too small. For smaller d it has a small negative bias: when α = 0.05 and p 0 = 1/2 the actual required sample size for d = 0.02 is n = 1738, whereas the above expression gives the approximation n = 1721, corresponding to a true expected distance of d = 0.0201. For most purposes, this will probably be a sufficiently accurate approximation.
As in the two-sided case, we may consider using a prior distribution of p, rather than a fixed p 0 , to determine a reasonable sample size. The expectation of the second-order approximation with respect to a Beta(a, b) prior for p is (14) Note that this expression is undefinied when a, b ≥ 2, limiting which priors we can use. When (14) is well-defined, a general formula for the required sample size can be obtained by equating (14) to d and solving for n, but the resulting expression is rather complicated. It is however readily evaluated for particular values of a and b. For the Jeffreys prior for instance, the required sample size is The solutions for the Jeffreys and uniform priors as well as the low-informative asymmetric Beta(1/2, 1) prior are shown in Figure 5, along with the solutions for fixed p 0 and different values of α.
In contrast to the two-sided case, d can in fact be interpreted as an error tolerance for the one-sided bound. This makes the interpretation of d easier in this case.

The cost of using the exact bound
The cost of using the exact bound will be evaluated in relation to three approximate bounds: The Jeffreys, second-order correct and modified loglikelihood root bounds, described in Section 2.3. Comparing (13) to the expansions in Corollary 1 of Cai (2005) and Proposition 2.2 of Staicu (2009), the following corollary is immediate.
Corollary 2. When L U,A denotes the distance of the Jeffreys, second-order correct or modified loglikelihood root bounds, It should be noted that there are one-sided versions of the Wald and Wilson score intervals, but since these have very poor performance (Cai, 2005) they are omitted from our comparison. They can however readily be compared to the  to the corresponding expansions in Corollary 1 of Cai (2005).
For one-sided bounds, the approximation of the increased sample size when the exact bound is used is more involved than it was for the two-sided cases. To preserve space, we simply use the naive first-order formula n = z 2 α/2 pqd −2 to determine the sample sizes for the approximate bounds. This works reasonably well most of the time. Let n + (d, p, α) be the increase in sample size when the Clopper-Pearson bound is used instead of an approximate bound. Then, with ω(z, d, p) = 9z 2 pq + 12dz 2 − 24dz 2 p, Compared to the increased sample size in the two-sided setting, (15) is more sensitive to changes in p and α. The cost is the smallest when p = 0.5. When evaluating the increased sample size p 0 = 0.5 is therefore not to be recommended as the default choice, as this can lead to a serious underestimation of the increase, especially for smaller d.

Greater d
Desired expected distance to p Increase in required sample size p=0.9 p=0.5 p=0.1 p=0.01

Minimum coverage or mean coverage?
The Clopper-Pearson methods are exact in the sense that their minimum coverage over all p is at least 1 − α. An alternative measure of coverage is mean coverage, which typically is taken to be the expected coverage with respect to a uniform pseudo-prior of p. In recent papers on binomial confidence intervals, approximate methods have often been considered to be preferable to exact methods (Agresti & Coull, 1998;Brown et al., 2001;Cai, 2005;Newcombe & Nurminen, 2011), the argument being that it makes more sense to interpret the confidence level as the mean coverage probability rather than the minimum coverage probability, as this corresponds better to how many modern-day statisticians think of coverage levels. Reasoning along the lines of Newcombe & Nurminen (2011), the minimum coverage can occur in an uninteresting part of the parameter space, typically close to the boundaries, possibly rendering it an uninteresting measure of coverage. This is discussed further in the next section.
As noted e.g. by Newcombe & Nurminen (2011), using mean coverage is very much in line with current statistical practice in other problems. Widely used methods based on boostrapping and MCMC, for instance, typically only control confidence levels and type I error rates approximately, attaining the 1 − α level only on average. This is particularly reasonable when the model is known to be an imperfect representation of the underlying process, in which case even minimum coverage criterions are approximate at best. Unlike in many other applications however, one can often be rather certain that a random variable truly is binomial. This begs the question whether one should resort to approximations or use methods that really are guaranteed to be exact.
If the Bayesian credible intervals based on either the Jeffreys Beta(1/2, 1/2) or the uniform Beta(1, 1) priors are used, an additional argument for the mean coverage criterion is given by the Bayesian interpretation of these intervals. If we accept mean coverage as a criterion when choosing between confidence intervals, we can obtain intervals that simultaneously admit both frequentist and objective Bayesian interpretations.
The minimum coverage criterion underlying the Clopper-Pearson interval is in line with classical statistical theory. It asserts that overcoverage is a less serious problem than undercoverage, or, in other words, that it is better to be more confident than you think that you are than to be overconfident. Next, in order to evaluate this argument further, we will discuss just how overconfident one risks being when using approximate intervals.

The cost of using approximate methods
Just as there are costs associated with using exact methods, there are costs associated with using approximate methods: the actual coverage level may, even for large n, drop below the nominal 1 − α. There is no guarantee that the true p is not in an unfortunate area with low coverage. However, these coverage anomalies usually occur close to the boundaries of the parameter space, so unless we are interested in inference for p close to 0 or 1, it may therefore be more relevant to A coverage of 0.94 for a nominal 0.95 method is well below what one should expect for sample sizes as large as n = 2000. If undercoverage of this size is unacceptable, one may apply computer-intensive coverage-adjustment method similar to those discussed in Reiczigel (2003), decreasing α to some γ for which the minimum coverage over some set of values of p is at least 1 − α, thus making the methods exact. Decreasing α will however increase the expected length of the intervals.
Comparing sample sizes of the 1 − γ Jeffreys interval and the 1 − α Clopper-Pearson interval, we have: For n between 1000 and 1500, computer-intensive adjustments of the Jeffreys interval lead to γ ≈ 0.04 (the actual γ being somewhat larger than 0.04). For p 0 = 1/2 and d = 0.04, we get n + (0.04, 1/2, 0.05, 0.04)) ≈ −186, i.e. that the Clopper-Pearson interval requires 186 observations fewer to obtain the desired expected length. In general, approximate intervals that have been adjusted to be exact are outperformed by the Clopper-Pearson interval.
Similarly, if one is willing to use approximate intervals, it is possible to apply coverage-adjustments to the Clopper-Pearson interval in order to adjust its mean coverage to 1 − α. The resulting γ > α, meaning that the interval becomes shorter after the adjustment. Thulin (2013) studied this problem in detail for n ≤ 100, showing that the adjusted Clopper-Pearson intervals often outperformed its competitors.
It should be noted that other criterions than coverage and expected length can be used for comparing confidence intervals. Newcombe (2011Newcombe ( , 2012 compared location properties, i.e. left and right non-coverage, of intervals and found the Clopper-Pearson interval to have good properties in comparison to some approximate intervals. Vos & Hudson (2005) considered two criterions related to p-values, motivated by the interpretation of confidence intervals as inverted tests, and found the Clopper-Pearson interval to be better than its competitors.

Conclusions
When choosing between exact and approximate confidence methods, it is important to be aware of the benefits and the costs associated with the two types of methods. The coverage fluctuations of approximate intervals have been compared in several studies, making it easy for practitioners to compare how costly these intervals can be in terms of undercoverage. We have attempted to make the costs of using exact methods explicit, by giving expressions for how much larger the expected length of the exact intervals are and for how much the sample size increases when a fixed expected length is to be attained.
For the two-sided Jeffreys interval, exactness comes at a fixed price: the cost of using the Clopper-Pearson interval instead of this intervals is, in terms of expected length and required sample size, insensitive to p and α. For the Agresti-Coull interval, the cost only depends on α. This stands in contrast to the Wilson score interval and one-sided bounds, for which p and α can greatly affect the cost. In either case the required sample sizes for the exact methods can be substantially larger than those of the approximate methods.
In our comparison of exact and approximate methods, the only exact methods considered were the Clopper-Pearson interval and bound. While other shorter exact two-sided intervals exist, they suffer from various problems that make them unsuitable for use. Moreover, the Clopper-Pearson interval is used far more often than the other exact intervals, which merits its role as the main subject of this study.