A two-stage validation study for determining sensitivity and specificity.

A two-stage procedure for estimating sensitivity and specificity is described. The procedure is developed in the context of a validation study for self-reported atypical nevi, a potentially useful measure in the study of risk factors for malignant melanoma. The first stage consists of a sample of N individuals classified only by the test measure. The second stage is a subsample of size m, stratified according the information collected in the first stage, in which the presence of atypical nevi is determined by clinical examination. Using missing data methods for contingency tables, maximum likelihood estimators for the joint distribution of the test measure and the "gold standard" clinical evaluation are presented, along with efficient estimators for the sensitivity and specificity. Asymptotic coefficients of variation are computed to compare alternative sampling strategies for the second stage.


Introduction
In epidemiologic research, validation studies often have as a goal the determination of the sensitivity and specificity of a "test" for the presence of a risk factor. This test is usually easier and cheaper to administer than a more accurate "gold standard" method. Knowledge of the specificity and sensitivity of the test can be used in sample size calculations for subsequent studies of the effect of the risk factor and to adjust relative risk estimates for measurement error. This knowledge can also be important in the evaluation of the clinical utility of the test as a diagnostic tool.
An example of the use of a test measure occurs in epidemiologic studies of incidence and prevalence of neoplastic skin disease, where the use of self-reported counts of atypical nevi rather than clinical examination can lead to substantial reduction in study costs. Recent studies of the importance of the presence of nevi as a predictor of melanoma have employed both selfreported counts [e.g., Bain et al. (1)] and physical examination by a dermatologist or trained interviewer [e.g., Augustsson et al. (2,3)]. However, the sensitivity and specificity of self-reports of the presence of atypical nevi are not well known, although validation studies of self-reported aggregate measures of body nevus density (4) suggest a measurable correlation with interviewer determinations, and studies using both types of data are able to show significant relative risks for melanoma using either measure (5).
A recent survey in Sweden sucessfully solicited mail questionnaire responses from 50,000 women between the ages of 35 and 55 ("Women's Lifestyles and Health Study," H-O Adami, unpublished). Respondents were asked to examine their skin for the presence of atypical moles. At present, no analysis of the results of the survey has been performed. A number of alternative designs are being considered to use the survey to estimate the sensitivity and specificity of self-reported atypical nevi (SRAN). Specifically, a two-stage procedure is proposed in which the first stage would be a random sample of the original cohort to determine the prevalence of SRAN. In the second stage, a random sample of the individuals identified in the first stage would be examined to obtain physician-diagnosed atypical nevi (PDAN). The second-stage sample could be stratified by level of SRAN as determined in the first stage. In this case, an advantage in estimating sensitivity and specificity could be achieved by manipulating the relative sizes of the test positive and test negative samples.
The purpose of this article is to demonstrate the utility of a two-stage validation study design as applied to self-reported counts of atypical nevi. In the sections that follow, estimators for sensitivity and specificity in the proposed designs are described, along with their approximate standard errors. The relative efficiencies of the designs are used to compare the proposed designs and to assess the design for the SRAN validation study. At the first stage, design B also specifies a sample of size N to obtain SRAN. However, in the second stage the number of individuals selected who have a positive SRAN, ml, and the number of individuals who have a negative SRAN, mO are controlled, under the constraint that mo+ ml = m.

Design Options
For simplicity, it is assumed that the counts of atypical nevi have been reduced to a binary classification (i.e., presence or absence of nevi). The object of the design is to manipulate the ratio mI/mo to achieve a higher precision than the other designs at an equivalent cost. Of course, it is not possible for ml to exceed the number of test positives in the first stage, NI, or for mo to exceed the number of test negatives in the first stage, No.
The data from both these designs can be arranged as in Figure 2. The number of complete observations is m, the size of the second stage in both designs A and B. The number of incomplete observations is r = Nm. These are the individuals not selected for the second stage, and thus only having the SRAN (test) classification. In design A, the margins ml and mo are not fixed, whereas in design B, they are chosen to achieve design objectives such as minimizing the variance of estimators of sensitivity and specificity. Estimation In describing the methods of estimation and the properties of the estimators, the following notation will be adopted. Let {in, i = 0,1; j = 0,1} represent the joint distribution of the binary indicators SRAN and PDAN, with i indexing the level of SRAN and j indexing the level of PDAN. Thus s1O is the probability that SRAN is one or more and PDAN is zero. Sensitivity can be expressed as s = nil/(no, + 11), and specificity as S = 'oo/(ntoo + io). The prevalence is Xt = it(1I + i11. To simplify the following developments, let 9 = (0oo, IolI rlo, 11)t) The primary objective of a validation study is to estimate s and S. However, the prevalence may also be of interest, and could be included in the development that follows with little trouble.
A method of estimation can be devised for designs A and B by treating the r individuals not included in the second stage sample as having missing data for the gold standard PDAN. In design A, the m individuals included in the second stage are selected randomly without regard to their SDAN status. Conversely, the r individuals not included in the second stage can also be considered a simple random sample of the Nindividuals in the first stage.
For the second stage of design B, mo individuals are selected randomly from among the No SDAN negative individuals identified in the first stage, and ml individuals are selected randomly from the N1 SDAN positive individuals. Thus the probability of having a missing value of PDAN (i.e., not being included in the second stage) depends upon SDAN status.
For both design A and design B, the data for PDAN are what Little and Rubin (6) refer to as "missing at random." In fact, they discuss a closely related problem in their chapter on models for partially classified contingency tables. In the section dealing with monotone missing data patterns, they give formulas for the maximum likelihood estimates of the elements of 0, which in this case is the joint distribution of PDAN and SRAN. These formulas may be written as A = mii + (mii / mi)r,] [

N
Intuitively, the estimators can be thought of as distributing the individuals not selected for the second stage between the two PDAN classifications. Formulas for the elements of the asymptotic covariance matrix of e are given in Little and Rubin (6, section 9.2.3). This covariance matrix will be denoted as X6. More detailed expressions, provided in the appendix to the current article, clearly illustrate the dependence of the precision of the distribution estimates on the choice of N, m1, and mo.  [3] where and (CS _

Efficiency Calculations
The variance formulas given in Equation 3 can be used to compare the efficiency of designs A and B for a given sensitivity, specificity, prevalence, and first-and second-stage sample sizes. The coefficients of variations CV(s) = \/Wlar7iIs and CV(S) = s\/Var)/s are used as a basis for the comparisons shown in this section. Figure 3 shows the coefficients of variation for sensitivity and specificity for a first stage sample size of N= 10,000, a second stage sample size of m = 400, and a range of values for ml in design B. A sensitivity of s = 0.9, a specificity of S = 0.7, and a prevalence of 7t = 0.15 are assumed. For design A, ml, the number of PDAN positives in the second stage, is not controlled, and the coefficients of variation are shown as constant over all values of mlI. The plots reveal that, in design B, an increase in ml tends to increase the imprecision of estimation for sensitivity and decrease the imprecision for specificity. This reflects the nature of the adjustment for the stratified sampling scheme. Figure 4 shows the sum of the coefficients of variation for sensitivity and specificity for designs A and B. All the assumptions are the same as in Figure 3, except that the lower panel uses a sensitivity of s = 0.99. The results show that design B is better than design A for only a small range of values for ml for a sensitivity of 0.9, whereas a fairly broad range of values

A Validation Study for SRAN
The ability to self-diagnose atypical nevi has important implications for screening efforts aimed at prevention and early detection of melanoma. If it can be shown that atypical nevi are accurately self-reported, large population surveys aimed at targeting high risk groups for intervention or prevention, could be inexpensively conducted through mail surveys. In addition to screening applications, the validation study may provide a foundation for populationbased follow-up studies of melanoma risk associated with atypical nevi, case-control evaluation of risk factors for atypical nevi, and preventive efforts against melanoma.
In the recent survey in Sweden, 50,000 women between the ages of 35 and 55 (above) were asked to examine their lower extremities for the presence of irregular moles resembling the atypical nevi pictured in color photographs. A proposed validation study would evaluate self-screening efforts of survey respondents living in the Uppsala area using the two-stage design proposed in this article (L. Titus-Ernstoff, unpublished grant application).
In the first stage, a random sample of 2000 women living in one of two counties adjacent to the Uppsala medical facilities will be selected from among the respondents to the original survey. Based upon the results of a population-based study of atypical nevi in Gothenburg, Sweden (2,3), it is estimated that about 300 of the 2000 women who live in the eligible counties will report an atypical nevus. A total of 400 women will be selected randomly for the second-stage validation study. Study recruits will undergo a physician-conducted skin examination, during which pigmentation characteristics-including mole counts-and atypical mole counts will be recorded. Table 1 shows optimal strata sizes and coefficients of variation for design B, as applied to the Gothenburg example. Four assumptions are used for sensitivity and specificity. Anticipated coefficients of variation are shown for s and S. These results demonstrate that a two-stage design for measuring sensitivity and specificity can improve on overall precision by controlling the number of test positives and negatives in the second stage.

Discussion
A new proposal for two-stage designs of validation studies has been presented. These designs may lower the cost of validation studies by reducing the use of the more expensive gold standard measurement. A method of estimation has been derived by using methods for missing data.
It should be noted that similar issues have been been considered in investigations of "verification" or "work-up bias" (7) in diagnostic test assessment. An example of such a situation might be a study of "silent" coronary heart disease in which determining the gold standard disease classification would involve giving an invasive test to populations of apparently healthy individuals, thus incurring a high degree of noncompliance. In these situations, the decision to apply the invasive procedure may be influenced by the results of the screening test, with a positively screened individual being more likely to receive the gold standard. This selection can bias the estimate of the operating characteristics of the tests, and has been termed work-up bias.
The two-stage design suggested in this paper can be viewed as a study which deliberately incurs work-up bias. The estimates of sensitivity and specificity presented using corrections for missing data are equivalent to the bias-corrected estimators suggested by Begg  To obtain asymptotic variances for the purposes of comparing designs, the estimates in these formulas are replaced by the value of the parameters 6. To compute the asymptotic variances for design A, cCis set to 0.