Improved Confidence Intervals of a Small Probability from Pooled Testing with Misclassification

This article concerns construction of confidence intervals for the prevalence of a rare disease using Dorfman’s pooled testing procedure when the disease status is classified with an imperfect biomarker. Such an interval can be derived by converting a confidence interval for the probability that a group is tested positive. Wald confidence intervals based on a normal approximation are shown to be inefficient in terms of coverage probability, even for relatively large number of pools. A few alternatives are proposed and their performance is investigated in terms of coverage probability and length of intervals.


INTRODUCTION
Screening for subjects infected with a disease can be costly and time-consuming, especially when the disease prevalence is low. In an effort to overcome these barriers, Dorfman (1) proposed the pooling of blood samples to test for syphilis antigen. According to his procedure, blood samples from subjects under screening are pooled prior to testing. If a pool of blood samples is tested negative, then all subjects in the pool are declared free of infection. Otherwise, a positive test result on a pool indicates that at least one subject is infected and retesting of all individuals in that pool is then conducted to find the infected subjects.
Since its appearance, Dorfman's (1) pooled testing (also known as group testing) approach has drawn considerable attention. The approach has been applied to other areas of screening (than syphilis), such as human immunodeficiency virus (HIV) testing [e.g., Westreich et al. (2)]. A number of variations have been developed, and the scope has been expanded to include estimation of the prevalence of a disease (without necessarily identifying the diseased individuals) . However, there is relatively little discussion on the possibility of misclassification (i.e., that the disease status of an individual or a pool of individuals can be assessed incorrectly because the biomarker may be imperfect).
The existing literature on estimating the prevalence of a disease using the pooled testing approach is focused on point estimation (3, 5-7, 13-15, 17, 20, 22, 23). Construction of confidence intervals for the prevalence of a disease has been discussed by Hepworth (10)(11)(12) and Tebbs and Bilder (21). These authors assumed that the disease status of a subject can be accurately determined, which may be unrealistic in practice. For example, Weiss et al. (25) reported 97.7% sensitivity and 92.6% specificity for detecting HIV infection with an enzyme-linked immunosorbent assay, and Deitz et al. (26) reported 94% sensitivity for determining the status of N -acetyltransferase 2 with a commonly used 3-single nucleotide polymorphism genotyping assay.
This article focuses on construction of efficient confidence intervals for the prevalence of a rare disease using Dorfman's pooled testing procedure when the disease status is determined by an imperfect biomarker subject to misclassification. We investigate the unified approach of Tu et al. (27), which produces a confidence interval for the disease prevalence by converting a confidence interval for the probability of a pool being tested positive. We then demonstrate that Wald confidence intervals based on a normal approximation are inefficient in that they have a repetitive up-and-down behavior in the coverage probability, similar to that of the classical normal approximation binomial confidence interval discovered by Brown et al. (28). In the present context, this up-and-down behavior persists even when the number of pools is relatively large. We derive alternative confidence intervals by extending the methods of Wilson (29), Clopper and Pearson (30), Agresti and Coull (31), and Blaker (32). Simulation studies are conducted to compare the performance of the proposed methods in terms of coverage probability and mean length. The methods are applied to a real example concerning the seroprevalence of HIV among newborns.

INTERVAL ESTIMATION UNDER POOLED TESTING
Suppose one wants to estimate the prevalence of a disease in a population, p = P(D = 1), where D denotes the disease status of a www.frontiersin.org subject in the population, with D = 1 if the subject is infected with the disease. We assume that the disease status is determined using an imperfect biomarker M, taking values 0 and 1, and a subject is classified as infected if M = 1. The accuracy of the biomarker is measured by its specificity π 0 = P(M = 0|D = 0) and sensitivity π 1 = P(M = 1|D = 1). For the biomarker to be of practical use we assume that 1/2 < π 0 , π 1 ≤ 1; otherwise a random assignment of the disease status would perform better than the biomarker.
Supposed a random sample of size nk, where n and k are positive integers, is available from the target population. The conventional approach to estimating p is based on individually observed values of M, say M 1 , . . ., M nk . Dorfman's procedure for estimating p is carried out by randomly assigning the nk individuals into n pools with k individuals in each pool and testing for positivity of the biomarker for each pool. Inference on p is then based on the number of pools that are tested positive. Consequently, For fixed k, π 0 , and π 1 , the value of δ as a function of p increases as p increases. Because 0 ≤ p ≤ 1, we have 1 − π 0 ≤ δ ≤ π 1 . (2) Using the relationship given by equation (1) along with the constraint equation (2), a unified (and straightforward) approach (27) to constructing a confidence interval for p is as follows. Suppose [δ L , δ U ] is a confidence interval for δ with level 1 − α. Define Then [p L , p U ] is a confidence interval for p with level 1 − α.

CONSTRUCTING CONFIDENCE INTERVALS FOR δ
In this section we propose a few methods to construct a confidence interval for δ. Once derived, the interval can then be converted into a confidence interval for p, as indicated in the previous section. Let M i , i = 1, . . ., n, be the test result for the ith pool. The M i are independent and identically distributed Bernoulli variables with P( M i = 1) = δ ∈ [1 − π 0 , π 1 ] . Thus a confidence interval for δ can be constructed using methods developed for a binomial probability. However the constraint equation (2) must be taken into account. In what follows we extend several popular methods for binomial confidence intervals to construct confidence intervals for δ, taking the constraint equation (2) into consideration.

THE WALD CONFIDENCE INTERVAL
The Wald confidence interval is based on the fact that the estimator of δ,δ = S/n, is asymptotically normally distributed with mean δ and variance δ(1 − δ)/n, where S = n i=1 M i is the number of pools that are tested positive. Without the constraint equation (2), the Wald confidence interval is given by With the constraint, we define the Wald confidence interval for δ as where

THE WILSON CONFIDENCE INTERVAL
Without any constraints on the binomial probability, the Wilson confidence interval (29) is Accounting for the constraint equation (2), the modified Wilson confidence interval for δ is given by

THE CLOPPER-PEARSON CONFIDENCE INTERVAL
The Clopper-Pearson confidence interval is often referred to as the exact confidence interval due to its derivation based on the binomial distribution rather than the normal approximation. Note that S follows a binomial distribution with size n and probability δ. Let s be the observed value of S. If there are no constraints on δ, then the lower bound δ L and the upper bound δ U of the Clopper-Pearson interval can be derived by solving the equations: respectively, where b(s; n, δ) = P δ (S = s) is the binomial density function with size n and probability δ, and B(s; n, δ) = s i=1 b(s; n, δ) = P δ (S ≤ s) is the corresponding binomial distribution function. Tu et al. (27) suggested using this interval without any modification for δ. The modified Clopper-Pearson confidence interval that accounts for the constraint equation (2) is given by Frontiers in Public Health | Epidemiology

THE AGRESTI-COULL CONFIDENCE INTERVAL
The Agresti-Coull confidence interval is a modification of the Wald confidence interval withδ replaced bỹ Thus, when δ is not constrained, the Agresti-Coull confidence interval is given by With the constraint equation (2), the Agresti-Coull confidence interval becomes

THE BLAKER CONFIDENCE INTERVAL
The confidence intervals of Wilson, Clopper-Pearson, and Agresti-Coull are highly recommended by Brown et al. (28). Blaker (32) proposed a method to improve the standard "exact" confidence intervals for discrete distributions, and called the resulting confidence intervals acceptability intervals. For the binomial case, the author showed that the acceptability interval is shorter than the Wald, Wilson, and Agresti-Coull intervals. Define Then for the binomial probability δ with no constraints, by reformulating the notation of Blaker (32), the Blaker interval is given by Blaker (32) showed that I B is indeed an interval and has coverage probability 1 − α. When δ is constrained by equation (2), the Blaker confidence interval can be defined as

SIMULATIONS
There has been a large amount of research on the performance of various binomial confidence intervals for the disease prevalence p under the usual setting where independent and identically distributed Bernoulli observations of the disease status are available. However, not much research has been conducted under Dorfman's pooled testing setting, especially in the presence of misclassification. In this section we conduct simulations to compare the coverage probability and mean length of the confidence intervals for p, and to investigate the effect of the pool size k and the misclassification rates (i.e., 1 − π 0 and 1 − π 1 ) on the precision (coverage and length) of the intervals. It is worth noting that when a confidence interval [δ L , δ U ] for δ is converted into a confidence interval [p L , p U ,] for p via equation (3), the coverage probability remains unchanged because of the monotonicity of p as a function of δ.

THE OSCILLATION BEHAVIOR OF WALD CONFIDENCE INTERVALS
Brown et al. (28) investigated the performance of a number of confidence intervals for a binomial probability in the usual setting, where the individual disease status is observed without error, corresponding to k = 1 and π 0 = π 1 = 1 in our setting. The authors showed a remarkable oscillation up-and-down behavior of the widely used Wald confidence intervals based on a normal approximation; the coverage probability of the interval increases from far below the nominal level of 1 − α to the nominal level and then decreases, and the pattern repeats until the sample size becomes rather large. We demonstrate here that Wald confidence intervals have the same oscillation up-and-down phenomenon under pooled testing with misclassification. Fixing specificity π 0 = 0.85 and sensitivity π 1 = 0.90, we computed via simulation the coverage probability of the Wald confidence interval for p, in a variety of scenarios with k = 2, 5, 10 ≤ n ≤ 150, and p = 0.01, 0.10. For each configuration of (k, n, p), 10,000 simulations were conducted. For each simulation, we generated a random observation from the binomial distribution with probability δ = π 1 − r(1 − p) k and size n, and constructed the 95% Wald confidence interval for δ according to equation (4). This confidence interval for δ was then converted into a confidence interval for p using equation (3). The average coverage probability of the confidence interval is the proportion of the 10,000 intervals that contain the true value of p. Figure 1 presents the simulation results. It is clear that, for each configuration, the coverage probability as a function of n starts with very low coverage, usually below 85%, and then gradually increases as n gets larger to a value near the nominal level of 95%. Then it quickly decreases to a low coverage probability. The trend then repeats until n is large enough to stabilize the coverage probability. Therefore, unless n is sufficiently large, the Wald confidence interval does not provide the desired coverage and should not be recommended. This unfortunate observation is consistent with that of Brown et al. (28) for the classical binomial confidence intervals.

COMPARISON OF CONFIDENCE INTERVALS
Using simulations again we compared the precision of the four alternative confidence intervals, the Wilson, Clopper-Pearson, Agresti-Coull, and Blaker intervals, along with the Wald interval, in terms of mean length and coverage probability. To set up the simulation we considered various representative configurations of (p, n, k, π 0 , π 1 ) with p = 0.001, 0.1, 0.3, n = 10, 20, 50, k = 2, 5, 10, and (π 0 , π 1 ) = (0.85, 0.95). A total of 10,000 simulations were conducted, and the coverage probability of each interval was estimated the same way as for the Wald interval. The mean length of each interval was estimated by averaging over the 10,000 simulated intervals. Table 1 shows the estimated overage probability and average length of each confidence interval in various scenarios with p = 0.001 (the results for p = 0.1 and p = 0.3 are similar and therefore not shown). In almost all cases considered, the four alternative confidence intervals (i.e., the Wilson, Clopper-Pearson, Agresti-Coull, and Blaker intervals) provide satisfactory coverage probability around the 95% nominal level of confidence. The Wald intervals are quite unstable, with poor precision when n is small.

FIGURE 1 | The oscillation up-and-down behavior of the Wald confidence interval under pooled testing with misclassification.
The Clopper-Pearson and Agresti-Coull intervals tend to be more conservative, producing higher coverage probability, followed by the Blaker interval and then the Wilson interval. However, conservatism usually comes with the price of longer intervals, as shown in Table 1.
The effect of misclassification on the Wald interval does not seem to be clear due to its oscillation up-and-down behavior. For the other intervals, it appears that the coverage probability increases as the sensitivity π 1 increases or as the specificity decreases. For fixed misclassification rates (π 0 , π 1 ), including more samples in a pool seems to improve the coverage probability, up to certain pool size. This latter observation seems to agree with that of Tu et al. (23) and Liu et al. (33), who found that in presence of misclassification the efficiency of estimation increases with the pool size up to a certain point.

EXAMPLE
We now illustrate the methods by applying them to a real example concerning the seroprevalence of HIV among newborns in the State of New York (34). The data were obtained by testing blood specimens from all infants born in this state during a 28-month period (from November 30, 1987through March 31, 1990). The test was targeted at serum antibodies produced by the immune system in response to HIV infection. A positive test result indicates HIV infection in the mother but not necessarily in the child. To illustrate the methods, we focus on the Manhattan area, where 50,364 newborns were tested with 799 positive results.
Because the study did not involve pooled testing, we create pools in a post hoc manner by grouping subjects randomly into pools of a given size (k = 5 or 10). With k = 10, for instance, we obtain 5,036 pools of size 10, ignoring the four additional subjects. The test result for each pool is taken to be the maximum of all individual test results in the pool; that is, a pool is declared positive if and only if it contains one or more infants with positive test results. To account for possible misclassification, estimation of HIV seroprevalence requires knowledge of the sensitivity and specificity of the test. Because the true values of these performance measures are not known precisely, we perform a sensitivity analysis that covers a range of plausible values for the sensitivity and the specificity of the test. The reasoning of Tu et al. (27) and the numerical results in their Table 1 suggest that the specificity of the HIV test in this study is at least 99%. Accordingly, our sensitivity analysis includes the values 99, 99.5, and 99.9% for the specificity. The appears to be less information about the sensitivity of the HIV test in this study, and we therefore consider a wider range (95, 97.5, 99, and 99.9%) for the sensitivity. For each pair of sensitivity and specificity values and each value of k, we apply the five methods described earlier to the pooled dataset to obtain five 95% confidence intervals for the individual-level HIV seroprevalence rate, in addition to a point estimate (common to all five methods). Table 2 presents the results of our sensitivity analysis (with different combinations of sensitivity and specificity values) for each value of k (5 or 10). It appears that the results are more sensitive to the specificity of the HIV test than to the sensitivity of the test. The point estimate and the confidence limits (for all five methods) tend to decrease with the sensitivity of the test and increase with the specificity of the test, as predicted by theory. Intuitively, increased sensitivity means fewer false negatives, and increased specificity means fewer false positives, and these are reflected in the estimates in Table 2. Between the different pool sizes (5 and 10), which lead to different datasets, there are some numerical differences, especially at lower values of the sensitivity and the specificity. However, when the sensitivity and the specificity are both high (say, 99.9%), there is remarkable agreement between the estimates based on k = 5 and those based on k = 10. In any case, the five confidence intervals are generally similar to each other, perhaps as a result of the large sample size.

DISCUSSION
In this article we proposed a few approaches to constructing a confidence interval for the disease prevalence under pooled testing with misclassification. These approaches share a common feature in that they are all obtained by converting a valid confidence interval for the probability of a pool being tested positive. Our investigation of the coverage probability and mean length of the confidence intervals indicates that caution needs to be taken in using the Wald interval when the sample size is not large enough. From our overall evaluation it appears that the Clopper-Pearson and Agresti-Coull intervals, though somewhat conservative, tend to be more valid than the Wilson and Blaker intervals, especially when the disease probability and the sample size are relatively small. Misclassification of the disease status clearly impacts the precision of the confidence intervals, as demonstrated by the simulation results in Figure 1 and Table 1. In this article, the misclassification is assumed to be independent of the pool size, which seems to be a reasonable assumption in some situations. However, as noted by Cahoon-Young (35), this assumption may be violated when the pool size gets larger. It remains to be seen how the performance of a confidence interval might be affected by pool-size-dependent misclassification. www.frontiersin.org