Pass-Fail Testing: Statistical Requirements and Interpretations

Performance standards for detector systems often include requirements for probability of detection and probability of false alarm at a specified level of statistical confidence. This paper reviews the accepted definitions of confidence level and of critical value. It describes the testing requirements for establishing either of these probabilities at a desired confidence level. These requirements are computable in terms of functions that are readily available in statistical software packages and general spreadsheet applications. The statistical interpretations of the critical values are discussed. A table is included for illustration, and a plot is presented showing the minimum required numbers of pass-fail tests. The results given here are applicable to one-sided testing of any system with performance characteristics conforming to a binomial distribution.


Introduction
In evaluating the efficacy of equipment that is meant for detection of hidden contraband or dangerous substances, the instrument is often subjected to testing that measures its performance against requirements set forth in protocols set by national or international standards organizations. Performance requirements in these standards include those for probability of detection (PD) and probability of false alarm (PFA) at a specified level of statistical confidence.
The detection systems considered in this paper are all assumed to behave according to a binomial distribution. Only two outcomes are considered for independent trials with contraband present: the detection system either correctly reports detection or does not. Furthermore, the probability of detection must remain constant during the period of the testing. Otherwise, it may be meaning-less to perform binomial model based tests to determine estimates of this quantity. Similarly, for tests with contraband absent, the detection system either correctly reports no detection, or it falsely reports the presence of contraband: and the probability of a false alarm is presumed to remain fixed throughout the period of testing. For a detection system, PD or PFA can only be determined accurately by a sufficient number of trials. However, there is a number called the confidence level (CL) that gives some sense of adequacy of the results from a series of trials of a given size.
CL is defined in terms of the binomial probability mass function, also called the binomial discrete density function, b (m; n, p), Performance standards for detector systems often include requirements for probability of detection and probability of false alarm at a specified level of statistical confidence. This paper reviews the accepted definitions of confidence level and of critical value. It describes the testing requirements for establishing either of these probabilities at a desired confidence level. These requirements are computable in terms of functions that are readily available in statistical software packages and general spreadsheet applications. The statistical interpretations of the critical values are discussed. A table is included for illustration, and a plot is presented showing the minimum required numbers of pass-fail tests. The results given here are applicable to one-sided testing of any system with performance characteristics conforming to a binomial distribution.
Key words: binomial distribution; confidence bounds; confidence coefficient; critical value; probability of detection; probability of false alarm. where m = 0,1, . . . , n, denotes the number of successful detections or false alarms) in n independent trials with p = PD, or p = PFA, 0 ≤ p ≤ 1 (see Johnson, Kotz, and Kemp, 1992.) The number of successes in n repeated independent trials conforms to this function if each trial can be scored as either success or failure and the probability for success is fixed.
In Sec. 2 we discuss the definitions of CL and related critical values in detection problems. Section 3 gives statistical interpretation of these values in terms of hypothesis testing and confidence bounds. The note is concluded with Sec. 4 containing some examples.

Definitions and Test Requirements
The quantity CL can be loosely interpreted as the likelihood that any such system conforming to a binomial distribution with m successes in a series of n independent trials will have a true PD value greater or equal to a chosen value, PD c .
More formally, the accepted definition of CL in setting testing requirements is stated in terms of the equation below. The usage of this term is consonant with that of ASTM standard C 1236-99 (2005).
For a number m of successes found in a series of n pass-fail trials, with a fixed value of PD, designated PD c , the confidence level CL(m, n, PD c ) is defined by the equation (2) In other words, if for x = 0, 1, . . . , n, 0 ≤ p ≤ 1, denotes the binomial cumulative distribution function, then (2) can be expressed as (4) Note that under this definition CL (m, n, PD c ) cannot exceed 1 -PD n c . To find the critical value m c , i.e., the minimum value of m establishing the PD c of interest with a preselected, fixed level of confidence, CL, one must invert the inequality, Since BINCDF(x, n, p) is a step-function in x (i.e., is not strictly increasing), it does not have a proper inverse function. If we set m c -1, 1 ≤ m c ≤ n to be the least integer such that BINCDF(m c -1, n, PD c ) exceeds CL, then (7) where INVBINCDF(CL, n, p ) is the inverse cumulative binomial distribution function (i.e., is the smallest nonnegative integer such that the cumulative distribution function evaluated at this value equals or exceeds CL.) Versions of this function are available in many statistical software packages, including MATLAB (binoinv), R (qbinom), NAG, GAMS, IMSL, S-PLUS, and SAS and in general spreadsheet applications, such as EXCEL (function CRITBINOM(n, p, CL).) l The binomial cumulative distribution function can be expressed through the incomplete beta-function, (Abramowitz and Stegun, 1972), so that for fixed m and n, BINCDF(m -1, n, p) is a decreasing function of p, 0 ≤ p ≤ 1. This formula allows one to define BINCDF(m -1, n, p) for any real (noninteger) values m and n such that 0 < m < n + 1.
An analogous definition of CL applies to testing for PFA in systems where no contraband or dangerous substance is present. For any chosen value of PFA, designated PFA c , the confidence level CL (m, n, PFA c ), equals the probability that the number of false alarms occurring in a series of n independent binary trials exceeds m. Thus, this level is defined by the equation can be employed. To prove (12), notice that for x = 0, . . . , n -1, so that (14) Therefore, so that M c ≤ n -1 and M c is not defined when i.e., when (1 -PFA c ) n > 1 -CL .
Thus (15) and (7) show that under the same value of CL, when PD = 1-PFA, a simple formula, relates m c and M c .

Hypothesis Testing and Confidence Bounds on Binomial Probability
We give here two statistical interpretations of Eq. (7) and Eq. (15). The first of these is related to a (lower) confidence limit for binomial probability p . Such limits are supposed to provide a data-dependent interval containing the unknown p with a given probability called confidence coefficient (see Hahn and Meeker, 1991).
Assume that for the given CL, a lower confidence bound for PD = p of confidence coefficient CL is desired: that is for a binomial observation X ∼ ∼ BIN (n, p), one requires a function p -= p -(X, n , CL) such that (17) The well known solution of this problem for X ≥ 1, is (18) (e.g, Casella and Berger, 2002.) When X = 0, p -(0, n, CL) = 0. Thus with m c defined by (7), the inequalities p -< p (strict inequality) and X ≤ m c (non-strict inequality) are equivalent. Therefore, the critical value m c has the interpretation of the largest value of the binomial BIN (n, p ) variable such that the lower confidence bound for p does not exceed PD c .
A related interpretation is provided by the statistical hypothesis testing problem, H 0 : p ≥ PD c under the alternative: H 1 : p < PD c . The most powerful test of level 1 -CL rejects H 0 when the observed value X exceeds the critical value m, X > m (which means the same as p -(X, n, CL) ≥ PD c ). The critical value for PFA has a similar statistical interpretation, namely, M c is the largest value of the binomial variable for which the upper confidence bound for the binomial probability does not exceed PFA c . Indeed, an upper confidence bound of confidence coefficient CL has the form,

Examples
Consider an example in which one finds twenty-nine correct results in a single set of thirty trials. If the system under test conforms to a binomial distribution, then based on the result of twenty-nine out of thirty correct responses in that one set of tests, one can make multiple correct inferences, such as: the PD > 0.95 with 44 %, confidence, the PD > 0.90 with 81 %, confidence, or the PD > 0.85 with 95 % confidence.
One can easily construct a table which simultaneously includes requirements for both PD and PFA. Table 1 gives the critical value M c and n -m c for 68 % confidence to show the general characteristics of these quantities. These are the maximum permissible numbers of incorrect results that may be tolerated in establishing the specified PD or PFA values at this level of confidence. If the tabulated value is indicated as " * ", then the number of trials in that set is insufficient to establish the corresponding PD or PFA at this confidence level. One may generate tables of this kind for any CL, PD, and PFA using Eq. (7) and Eq. (15) by using the previously mentioned functions like binoinv or CRITBINOM from statistical software packages or spreadsheet applications. The actual value of M c and n -m c given by these functions in the cases marked by " * " is -1.
The symmetry of testing requirements when PFA = 1 -PD permits tabulating the results for PFA and PD in a single table, but it does not imply that PFA should or must always be chosen equal to 1 -PD. The PD and PFA values may be assigned independently in any testing protocol. In fact, to avoid disruption of the stream of commerce by large numbers of false alarms, it is often necessary to require inspection equipment to have PFA smaller than 1 -PD.
By solving (6) or (10), we obtain a formula for the minimum number of required trials n k needed to establish a given value of PD or PFA for the same CL, Here ⎡ ⎡a⎤ ⎤ denotes the smallest integer exceeding a. This formula is useful in designing test protocols that give the most satisfactory requirement with the least amount of testing. Figure 1 shows a plotted as a function of PD and CL. This function increases much more rapidly for PD approaching 1 than for CL → 1.
Similarly n k in (21) would increase much more rapidly for PFA → 0 than for CL → 1.
When only the minimum number of trials n k is performed, the system must give 100 % correct results to establish the specified PD or PFA at, the desired confidence CL. In statistical terms, n k is the smallest number of trials with 100 % correct detections such that the CL-lower confidence bound for detection probability exceeds the given value PD. The same is true when there are no false alarms with the CL-upper confidence bound on the false alarm probability being less than PFA. A table such as Table 1 will show how many errors may be permitted if a larger number of trials are carried out, while still establishing the specified PD or PFA at the desired CL.

Discussion and Conclusions
The formula for n k shows that requiring either PD or CL to be too near unity can result in impossibly large numbers of pass-fail tests. If such rigorous criteria are in fact required then one should search for some method of verification different from pass-fail testing.
The results presented here make it possible to design pass-fail testing protocols based on functions readily available in statistical software packages and general spreadsheet applications.