Inverse Sampling for McNemar ’ s Test

Inverse sampling for McNemars test is studied. Sampling is conducted until a pre-specified number of discordant pairs is observed instead of sampling until a pre-specified total number of pairs is observed. The joint likelihood is decomposed into a product of a negative binomial distribution for the number of pairs required to observe r discordant pairs, a binomial distribution for the number of successes in the concordant observations, and a binomial distribution for the number of successes in the discordant observations. Since inference in this problem is based on the discordant observations, inverse sampling controls the type II error when small numbers of discordant observations are observed and the exact binomial test is required. The control results from fixing the sample size for the exact binomial test. Standard sampling instead lets the sample size for the exact binomial test vary and then performs the test conditionally on the observed number of discordant pairs.


Introduction
This paper focuses on planning an experiment on paired binary responses.We want our hypothesis test comparing the probability of success in the two margins to control both the type I error and power.During the planning stage of the experiment, there are many possible 2x2 tables that may result, and each has a critical region.The size of the critical region under the null hypothesis is the type I error rate and the size of the critical region under the alternative hypothesis is the power.We create the critical regions for all possible tables so that the type I error is some fixed level, however, this means that the power in each table may vary.Generally, we speak of power of a discrete test as being the expected power over all the possible tables that can result in the experiment.This metric is okay if we wish to describe the average performance of the test when used repeatedly by the scientific community.However, for the individual experiment, only one table is observed, and we would like to have some assurance that size of the critical region is controlled under both the null and alternative hypotheses.Miettinen (1968) calls the average power "unconditional power" and power of an individual table "conditional power".We want to avoid the unlucky situation of making a type II error simply because conditional power is low for the table we observe in our experiment.
The paper examines conditional power for different sampling schemes and test statistics for the paired binary experiment.Inverse sampling may be used to bring the conditional power of McNemar's test (McNemar, 1947) closer to the nominal power.Inference is based on the number of discordant pairs.Inverse sampling controls the type II error when small numbers of discordant observations are observed and the exact binomial test is used.The control results from fixing the sample size for the exact binomial test.Standard sampling lets the sample size for the exact binomial test vary and then performs the test conditionally on the observed number of discordant pairs.The conditional power can be less than nominal if a small number of discordant events occur.
The paper first presents probability models, estimators and test statistics for paired binary experiments.Probability models for standard and inverse sampling schemes are developed.Large sample and exact tests are presented.Sample size estimation will be considered for the sampling schemes and test statistics.An example will be presented next.It illustrates the kinds of considerations needed in selecting a sampling scheme and test statistic.Tests are compared in terms of the percent of tables with conditional power less than nominal power and the standard deviation of the conditional power.

Probability Models
This section provides probability models, densities for the two sampling schemes, estimators, and some advice on sample size selection.Note that McNemar (1947) describes the large sample test under standard sampling.The exact test is an obvious extension of McNemar's test when the number of discordant events is small.

Probability Model for the Paired Binary Experiment
McNemar's test is used when two binary measurements are made on the same unit of analysis.Some example experiments are before/after studies on the same subject, or when two judges rate the same experimental unit.The following table summarizes the responses where two treatments are given to a subject and the response to each treatment is positive or negative.There are 4 possible pairs of outcomes: (1,1), (1,0), (0,1) and (0,0).A concordant response is where the responses are the same, N c , and a discordant event is where the responses differ, N d .
For standard sampling, the total number of subjects is fixed in advance, say N = N c + N d , and sampling continues until N is reached.The number of discordant events, r, is random and conditioned upon at the end of the study.For inverse sampling, the total number of discordant events is fixed in advance, r = N d , and sampling continues until r is reached.In this case, the number of concordant events, N c , is a random variable.
The usual treatment difference of interest is between the first row marginal probability and the first column marginal probability.This subtraction removes p 11 from the treatment comparison and isolates the treatment difference into the difference between the probabilities of discordant pairs.The cell probabilities may be reparameterized into expressions involving the probability of a discordant pair, the treatment difference and the marginal row/column probabilities. where Note that this parameterization imposes the constraint that |∆| ≤ ρ to satisfy the requirement that p 01 and p 10 be in [0, 1].That is to say, the absolute value of the treatment difference must be less than or equal to the probability of a discordant pair for this parameterization to be well formulated.

Joint Distribution for Standard Sampling
Under standard sampling, the response pairs are 4-fold multinomial random variables.The multinomial distribution may be factored into 3 densities under the McNemar's parameterization (see Appendix 5.1).There is a binomial distribution for each diagonal in the table and a binomial distribution for the number of discordant pairs.The joint distribution under standard sampling is the product of three binomial densities: Tables may be generated by first generating r, and then x 11 and x 10 using the generated value of r.The test statistics are well known (McNemar, 1947, Miettinen 1968, Newcombe, 1998): • exact test statistic: x 10 in exact binomial test that π = 0.5, where π Although the information about the treatment difference, ∆, seems isolated in the distribution for x 10 the constraint |∆| ≤ ρ makes ρ a nuisance parameter that cannot be entirely removed from power calculations and information about ρ is contained in other factors of the likelihood.The probability of a discordant pair is only removed from the model for x 10 under the null hypothesis.

Joint Distribution for Inverse Sampling
There is a binomial distribution for each diagonal in the table and a negative binomial distribution for the number of concordant pairs (see Appendix 5.2).The joint density is the product of two binomials and a negative binomial density Tables may be generated by first generating N c , and then x 11 using the generated value of N c .The value of x 10 is generated independently of x 11 and N c .The test statistics are well known (McNemar, 1947, Miettinen 1968, Newcombe, 1998): The test statistics are • small sample test statistic: x 10 in exact binomial test that π = 0.5 where π = (ρ + ∆)/(2ρ) = 0.5 =⇒ H 0 : ∆ = 0 , • N c is random -number of concordant pairs needed until r discordant pairs have been observed and independent of x 10

Maximum Likelihood Estimators
The sampling joint likelihoods differ by a normalization factor and which variables are random.The kernels for both likelihoods are symbolically identical.The estimators are Under standard sampling, the ρ, p10 and p11 are just functions of cell probabilities since N = N c + r.For inverse sampling, ρ is the usual MLE for ρ for the negative binomial distribution.Confidence intervals for ∆ under standard sampling are discussed in (Newcombe, 1998).The simplest interval is ∆ ± z 1−α/2 se where se = √ (N c N d + 4x 10 x 01 )/N 3 ).A similar confidence interval for ∆ under inverse sampling is ∆ ± z 1−α/2 √ V( ∆) where the V( ∆) may be found in equation 8 of Appendix 5.3.

Sample Size Selection under Inverse Sampling
Sample size selection for standard sampling is discussed in Miettinen (1968) and Connet et. al. (1987).Under inverse sampling, methods for the exact binomial sample size estimation may be used because r is fixed and The hypothesis of interest is . However, the detectable difference, δ, depends on the probability of a discordant event, ρ.This is also an issue in standard sampling.The dependency is further complicated by the fact that the size of treatment difference, ∆, must be less than ρ.Larger values of ρ reduce the value of δ and decrease the power of the exact binomial test.Some prior information is needed for the value of ρ to help pick δ.Miettinen (1968) and Connet et. al. (1987) suggest internal pilot studies to estimate ρ or external evidence to help with this.Note that most information about ρ is contained in N c ∼ nb(r, ρ) and this distribution does not contain direct information about the treatment difference.Hence information about ρ may be gathered as the study progresses without unmasking the treatment codes.Futility might be declared if the current estimate of ρ is unlikely to exceed a pre-specified value of ∆.

Examples
All of the examples assume the same design: p 1 .= 0.87, p .1 = 0.75, r = 23, ∆ = 0.12, ρ = 0.2, type I error = 0.05 one-sided, and type II error = 0.1.The number of discordant pairs, r = 23, is the smallest sample size that gives an exact binomial test under inverse sampling with 90% power.By choosing r = 23 for inverse sampling, the fixed sample size for standard sampling is N = 111 so that ceiling(ρ * N) = r.Ten thousand 2x2 tables were generated for each sampling method assuming these parameters.

Effect of Sampling Methods on Large Sample Test
This section provides some examples of the effect of the sampling methods on the large sample tests.In particular, the examples demonstrate what may happen to condtional power and ρ in equations 4.3 and 6.2 of Miettinen (1968).Note that a popular sample size calculator program, nQuery Advisor (nQuery Advisor 6.0, 2005, Method POT1), computes unconditional power using equation 5.3 from Miettinen (1968).The conclusion is that inverse sampling produces a smaller probability of having a table with less than 90% conditional power.
The algorithm for generating tables in this example is as follows: Generate a table from the sampling distribution under the alternative distribution.
Repeat these steps many times and summarize the output.
The distribution of T for both methods appear similar, which means they will reach similar decisions.The standard error of the difference is constant for inverse sampling and variable for standard sampling.Consequently, confidence interval width on the treatment difference will be constant for inverse sampling.The difference between Figures 1 and 2 is remarkable.While both curves have declining power as the fraction of discor-dant events increases, the power may be much lower for standard sampling at values of ρ smaller than 0.2.Also, while both sampling schemes have declining power with increasing ρ, inverse sampling presents a more predictable decline that agrees with the fact that δ declines with increasing ρ.The reason is the number of random variables in the test statistic, T , are different for the sampling methods.For inverse sampling, T only has one random variable, and has two under standard sampling.
Under standard sampling 31.9% of tables have less than 90% conditional power and the standard deviation of the conditional power is 5.4% (Figure 1).Under inverse sampling 18.8% of tables have less than 90% power and the standard deviation of power is 5.0% (Figure 2).

Effect of Sampling Methods on Exact Test
This section provides some examples of the effect of the sampling methods on the exact tests.In particular, the examples demonstrate what may happen to power and ρ.

The algorithm is
Generate a table from the sampling distribution under the alternative distribution.
Use the number of discordant events to determine the critical values for the test assuming the null distribution.
Calculate the size of the critical region under the assumed alternative distribution (conditional power) Calculate the proportion of discordant events.
Repeat these steps many times plot ρ = N d /N versus power.
For standard sampling, N = 111 pairs were used and the expected number of discordant pairs is E[N d ] = 23(= ceiling(ρ * N)).The detectable difference, δ, is 0.80 to get 90% power in a 1 sided exact binomial test with a 5% chance of a type I error.The critical region is to reject the null hypothesis if x 10 ≥ 16.Note that the critical region size varies as the number of discordant pairs varies under standard sampling, but not under inverse sampling.
For inverse sampling, we will sample to r = 23 discordant pairs are observed, so the expected sample size will vary around N = 111 pairs.However, the critical region will stay fixed since r is fixed.The detectable difference will remain at δ = 0.8.As a consequence, there is only one test to be planned for, and it's power will remain constant.
Figures 3 and 4 illustrate these observations.The power function decreases as the probability of discordant events decreases in standard sampling.Still, it is a smoother power function than that of the large sample test.This indicates, that the exact test is probably a better test to use than T in terms of the predictability of the size of the critical region under the alternative hypothesis.

Conditional Power Comparisons
This section presents the sort of considerations needed for planning a study.While this example shows a weakness of using a paired binary experiment, not all proposed designs will have these weaknesses to the same degree.The large sample test may actually be acceptable under standard sampling and different assumptions about sample size, parameters and type I/II errors.
The following table contains descriptive statistics for the four tests in this example.The first row in the next table is the average of the conditional power, which estimates unconditional power.The second row is the percent of tables with conditional power less than the nominal value of 90%.The last row is the standard deviation of the conditional power.
While the unconditional power of the large sample test under standard sampling compared favorably with that of the exact test under inverse sampling, there was a 31.9%chance that the conditional power was less than the nominal power for standard sampling.The exact test under standard sampling had less than nominal unconditional power, 87.2%, since the number of discordant pairs was allowed to vary.It also had a 60.8% chance of having conditional power less than 90%.The large sample test under inverse sampling had 94.5% unconditional power, but an 18.8% chance of having less than 90% power.The standard deviations of the conditional power functions were comparable for the first 3 tests in the table, but was 0% for the exact test under inverse sampling.
. Empirical distribution function of total sample size under inverse sampling Another thing to consider for inverse sampling is the total sample size.In this example, the average sample size was N = 92, which is a savings over standard sampling.There was a 79% chance that the sample size was less than 111.The median was N = 97 with an interquartile range of 26.

Summary
Inverse sampling can control the size of the critical region under the alternative for tables that may result in a paired binary experiment.Inverse sampling also removes the variability of the number of discordant events at the expense of a variable overall sample size.
One remaining issue is to develop interim analyses for inverse sampling.It is fairly easy to apply well known procedures for the one sample binomial test based on the number of discordant pairs (Jennison & Turnbull, 2000).However, the number of concordant pairs may be infinite.To stop based on futility, one could do something Bayesian based on the constraint that |∆| ≤ ρ.A conjugate beta could be placed on ρ, say ρ ∼ Beta(α, β) and N c ∼ NB(r, ρ).The hyperparameters would be selected based on the proposed design and sample size calculation.In particular, one would select the hyperparameters on the basis of how precisely ρ can be estimated under the proposed design.It's easy to show that the posterior distribution of ρ given n c is also ρ|n c ∼ Beta(α + r, β + n c ).One could compute the posterior probability that the constraint will be met, P[ρ > |∆||n c ], and stop for futility if that probability is small.The advantage of this test is that it does not require unmasking the treatment code.This test may also work for standard sampling since most of the information about ρ is contained in a binomial density.However, the posterior distribution would condition on a second random variable, n d in α * = α + n d .More work is needed to formulate something like this based on frequentist testing, but the principles would be similar.

Factorization of the Joint Density for Standard Sampling
The sampling distribution is multinomial.The multinomial coefficient may be factored into 3 binomial coefficients by multiplying and dividing by n d !n c !, The desired factorization results from multiplying and dividing by ρ n d (1 − ρ) n−n d , and substituting the parameterization from Table 2:

Figure 1 .Figure 2 .
Figure 1.Conditional power for tables for large sample test under standard sampling

Figure 3 .
Figure 3. Conditional power for tables for exact test under standard sampling

Figure 4 .
Figure 4. Conditional power for tables for exact test under inverse sampling

Table 1 .
2x2 Table Count Statistics in the Paired Binary Problem

Table 2 .
2x2 Table Probabilities in Paired Binary Problem

Table 3 .
Descriptive Statistics for Conditional Power for the 4 Tests in the Example.