Screening tests: a review with examples

Screening tests are widely used in medicine to assess the likelihood that members of a defined population have a particular disease. This article presents an overview of such tests including the definitions of key technical (sensitivity and specificity) and population characteristics necessary to assess the benefits and limitations of such tests. Several examples are used to illustrate calculations, including the characteristics of low dose computed tomography as a lung cancer screen, choice of an optimal PSA cutoff and selection of the population to undergo mammography. The importance of careful consideration of the consequences of both false positives and negatives is highlighted. Receiver operating characteristic curves are explained as is the need to carefully select the population group to be tested.


Introduction
A screening test (sometimes termed medical surveillance) is a medical test or procedure performed on members (subjects) of a defined 1 asymptomatic population or population subgroup to assess the likelihood of their members having a particular disease. 2 With few exceptions, screening tests do not diagnose the illness. Rather subjects who test positive typically require further evaluation with subsequent diagnostic tests or procedures. Examples of actual or proposed screening tests include the pap smear for cervical cancer (Arbyn et al., 2008;Mayrand et al., 2007), mammography (or tomosynthesis) for breast cancer (Friedewald et al., 2014;Rafferty et al., 2013), PSA (and/or digital rectal exam) for prostate cancer (Catalona et al., 1991), cholesterol level for heart disease, X-ray (or computed tomography) for lung cancer (discussed below), PKU test for phenylketonuria in newborns, B-natriuretic peptide test for screening patients undergoing echocardiography to determine left ventricular dysfunction (Maisel et al., 2001), and urinalysis or other screening tests for sexually transmitted diseases or illicit drug use (Gastwirth, 1987;Jafari et al., 2013;Watson et al., 2002). Screening tests may be based on the measurement of a particular chemical in the blood or urine (a quantitative measurement) or some qualitative assessment by a trained observer (e.g. interpretation of an x-ray or CT scan, or semi-quantitative analysis by a polygraph operator).
A major objective of most screening tests is to reduce morbidity or mortality in the population group being screened for the disease by early detection, when treatment may be more successful. 3 An alternative objective might be to reduce morbidity or mortality in persons other than the screened population who might be impacted by a communicable and preventable disease (such as screening for HIV in blood donors 4 ) among subjects in the population being tested.
Although some of the key analytical/statistical results applicable to the design and evaluation of screening tests have been around since the late 1700s, when the Reverend Thomas Bayes first developed the theorem that bears his name and numerous tutorials or review articles have been written more recently (Alberg et al., 2004;Altman & Bland, 1994a,b,c;Deeks & Altman, 2004;Goetzinger & Odibo, 2011;Lalkhen & McCluskey, 2008;Thompson et al., 2005;Zou et al., 2007), there is still some confusion among practitioners about how to interpret and assess the utility of screening tests (Casscells et al., 1978;Grimes & Schutz, 2002;Manrai et al., 2014;Wegwarth et al., 2012), which is why the article might be of interest to readers of Inhalation Toxicology.

Definitions
In its simplest form, the screening test has only two outcomes: positive (suggesting that the subject has the disease or condition) or negative (suggesting that the subject does not have the disease or condition). 5 An ideal screening test would have a positive result if and only if the subject actually has the disease and a negative result if and only if the subject did not have the disease. Actual screening tests typically fall short (sometimes far short, see below) of this ideal. Instead, most screening tests exhibit what are termed false positives and false negatives to varying degrees. Logical possibilities are described in the 2 Â 2 Table 1.
In most cases, 6 screening tests need to be benchmarked against an agreed ''Gold Standard'' test (Greenhalgh, 1997). The gold standard test is a diagnostic test that is usually regarded as definitive (e.g. by biopsy or autopsy). The actual gold standard test may be invasive (e.g. biopsy), unpleasant, too late (e.g. autopsy) to be relevant, too expensive or otherwise impractical to be used widely as a screening test. Table 2 provides examples of various screening tests and possible Gold Standards.
In principle, a ''Gold Standard'' should have 100% sensitivity and 100% specificity (see below for definitions), that is, it would never make a classification error. In practice that may not be the case and the ''Gold Standard'' is regarded as the best test under ''reasonable conditions.'' As noted by Versi (1992): ''As science increases its hold on the practice of medicine we become more aware of the limitations of the clinical method. Unfortunately, we also become more aware of the limitations of various diagnostic tests. Nevertheless, at any given time there may well be a consensus that a given test in a given situation is the best available test. It therefore serves as the gold standard against which newer tests can be compared. When enough data have accumulated to make that gold standard untenable, it can perfectly reasonably be replaced by another. This can then preside until it too is toppled.' ' Troy et al. (1996) offered the following perspective on gold standards: ''. . . however, gold standards for comparison are not always available. Moreover, a perfect gold standard is less often available than an imperfect gold standard ('alloyed gold standard'), an adopted standard based on observed data which is measured with error.'' For the purposes of this discussion, the gold standard test is assumed to be without error. Several authors have developed statistical approaches for dealing with ''alloyed gold standards'' (Dendukuri, 2011;Hawkins et al., 2001; Johnson et al., 2001;Joseph et al., 1995;Lewis & Torgerson, 2012;Rutjes et al., 2007;van Smeden et al., 2014;Walter & Irwig, 1988). As might be expected, none of these alternative procedures perform as well as if a true Gold Standard were available, but several are improvements over naively assuming the Gold Standard is ''unalloyed.'' The possible outcomes shown in Table 1 are quantified by two probabilities, termed the test sensitivity and specificity. These are two key characteristics of a screening test.
Sensitivity is the test's ability to correctly designate a subject with the disease as positive; it is the conditional probability (Pr{T + jD + }) 7 , denoted by the symbol S that a subject who has the disease, D + , tests positive, T + . A highly sensitive test means that there are few false negative results; few actual cases are missed. Ceteris paribus, tests with high sensitivity have potential value for screening, because they rarely miss subjects with the disease (Goetzinger & Odibo, 2011). Specificity is the test's ability to correctly designate a subject without the disease as negative; it is the conditional probability (Pr{T À jD À }), denoted by the symbol Sp that a subject who does not have the disease, D À , tests negative, T À . A highly specific test means that there are few false positive results. Therefore, high specificity tests perform well for diagnosis because of low false positive errors. Tests with low specificity have the disadvantage that (among other things) many subjects without the disease will screen positive and potentially receive unnecessary (and possibly invasive, risky or expensive) follow-up diagnostic or therapeutic procedures. Publications about screening tests typically report both the sensitivity and specificity of the test. It is clearly desirable to have a test that is both highly sensitive and highly specific. (In some cases, it may be possible to structure the test so as to tradeoff sensitivity and specificity, as discussed below.) Figure 1 shows a sample of reported sensitivities and specificities of various screening tests as summarized by Alberg et al. (2004), denoted by the triangles, and from our own literature search (refer Table A1), denoted by the circles. As can be seen, there are a substantial number of screening tests with both high sensitivity and high specificity, but also many that fall far short of this ideal. It should be noted that not all the screening tests shown in Figure 1 are actually used at present -some may have been found wanting.
As some test results (e.g. reading an X-ray) require interpretation, it is possible that there will be inter-observer variation (notwithstanding attempts at standardization) so that the reported sensitivity and specificity may vary with the observer (Deeks, 2001;Elmore et al., 2002 for illustrations). This creates issues when large scale screening tests are being contemplated and it is necessary to extrapolate or generalize from screening test data based on pilot studies, often conducted by highly specialized and experienced personnel. Refer Whiting et al., (2004) and Table 3 for a useful systematic review of sources of variation and bias in studies of diagnostic or screening accuracy.
Frequently, it is of interest to compare one screening test (a potential improvement) with another. It is important to make careful statistical comparisons to assess whether the ''improved'' test is actually superior and to calculate confidence intervals on the various proportions (e.g. using the methods given in Newcombe, 1998). Ideally such comparisons should be made on the same population and randomly assigning subjects to each test.
Before addressing additional important definitions applicable to screening tests it is appropriate to mention some of the consequences of false negatives and false positives in screening tests. Briefly: A false negative means that a subject with the disease is misclassified as not having the disease on the basis of the screening test. The subject is given a misleading impression that he/she is free of the disease and thus does not undergo more suitable diagnostic tests. At a minimum this means that correct diagnosis is delayed (perhaps until the subject develops symptoms) and, in the case of diseases for which early treatment offers improved chances of recovery, there is increased risk of morbidity and mortality [refer Kaufman et al. (2014) for an example with breast cancer]. False negatives from a screening test for illicit drug use or a polygraph test design to detect deception have obvious negative consequences (Gastwirth, 1987). As another example, failure to detect someone with an STD, may result in increased morbidity or mortality of future sexual partners of the subject. Systematic reviews of the consequences of false negatives are provided in Petticrew et al. (2000Petticrew et al. ( , 2001. False negatives may also lead to legal action being taken by affected individuals and may reduce public confidence in screening. A false positive means that a subject without the disease is misclassified as having the disease on the basis of the screening test. The subject is given the misleading impression that he/she has the disease and thus endures  (Table A1) while triangles represent studies reported in Alberg et al. (2004). 7 The symbols T+ and TÀ denote the events that the test outcome is positive and negative, respectively. The symbols D+ and DÀ denote the events that the subject has or does not have the disease.
the unnecessary psychological consequences as well has having to undergo possibly invasive diagnostic or treatment procedures. The consequences of a false positive can be material. For example: Elmore et al. (1998) provides examples of the consequences of false-positive screening mammograms. Among other things, false-positive mammograms led to more outpatient visits, diagnostic imaging examinations, and biopsies than false positive clinical breast examinations. In one patient, cellulitis requiring hospitalization for surgical debridement and intravenous antibiotic therapy developed after a biopsy prompted by a false positive mammogram. Wiener et al. (2011) provide an assessment of the population-based risk for complications after transthoracic needle lung biopsy of a pulmonary nodule discovered using a CT scan. Croswell et al. (2009) reported on the cumulative incidence of false-positive results in repeated multimodal cancer screening. Among other things this study revealed that for a woman the cumulative risk of undergoing a false-positive-prompted invasive diagnostic procedure was about 12.3% after 4 tests increasing to 22.1% after 14 tests. For men the corresponding percentages were 17.2% after 4 tests and 28.5% after 14 tests. A study of 12 669 Swedish youths (aged 16 and over) diagnosed with cancer found a 60% increased risk of suicide or attempted suicide (Lu et al., 2013). A false positive in this instance has a material adverse consequence. Another study (Baade et al., 2006) of people diagnosed with cancer in Queensland, Australia indicated that this group experienced a SMR of 149.9 for non-cancer deaths. False positives may also decrease the likelihood that a subject will return for subsequent follow-up procedures (Á lamo-Junquera et al., 2011). And, false positives may also result in litigation and loss of public confidence in screening. The consequences of both false positives and false negatives need to be carefully considered in assessing the utility of a screening test. In some cases, it may be possible to alter the decision criterion (or criteria) of a particular screening test to alter the sensitivity or specificity of the test and thus trade off one type of error for another. For such cases, the usual procedure is to calculate a receiver-operating characteristic curve (discussed below) for the test (Thompson et al., 2005;Zou et al., 2007). As well, some of the consequences of false positives can be altered by the choice of follow-up procedures among those subjects who test positive. For example, less invasive or non-invasive diagnostic tests can be selected, depending upon the specific outcomes of the initial screening test. The prevalence of the disease is the fraction, Å, of subjects in the population under study that have the disease. It is equal to the a priori probability (Pr{D + }) that a subject selected at random from the population or subgroup has the disease. 8 Prevalence, along with sensitivity and specificity, is a key determinant of the utility of the screening test (see below). For reasons discussed below, it is desirable to be able to define the population to be screened in such a way that the prevalence in the test population is high. The reported prevalence among various populations that are the subject of screening tests (Alberg et al., 2004) range from 0.05 to 0.9, but clustered among the higher values.
There are four additional relevant characteristics of a screening test, the positive predictive value, negative predictive value, accuracy and likelihood ratio: The positive predicted value (PPV) is the probability that a subject with a positive (abnormal) test actually has the disease (Pr{D + jT + }) also called the a posteriori probability. Given the above notation; PPV ¼ ÅS/((ÅS + (1 -Å)(1 -Sp)). In words, the a posteriori probability that the subject has the disease given a positive test is the ratio of true positives (the product of the prevalence and sensitivity) divided by total positives (the sum of true positives and false positives). It is desirable that the screening test has a high PPV.
The negative predicted value (NPV) is the post-test probability that the subject has no disease given a negative test result (Pr{D À jT À }) also termed the a posteriori probability given a negative test. Given the above notation: Table 3. Common sources of bias in study design.

Type of bias Description
Verification bias Non-random selection for definitive assessment for disease with the old standard reference test Errors in the reference True disease status is subject to misclassification because the gold standard is imperfect Spectrum bias Types of cases and controls included are not representative of the population Test interpretation bias Information is available that can distort the diagnostic test Unsatisfactory tests Tests that are uninterpretable or incomplete do not yield a test result Extrapolation bias The conditions or characteristics of populations in the study are different from those in which the test will be applied Lead time bias Earlier detection by screening may erroneously appear to indicate beneficial effects on the outcome of a progressive disease Length bias Slowly progressing disease is over-represented in screened subjects relative to all cases of disease that arise in the population Overdiagnosis bias Subclinical disease may regress and never become a clinical problem in the absence of screening, but is detected by screening Source: Pepe (2003).
NPV ¼ (1 À Å)Sp/((1 À Å)Sp + Å(1 À S)). In words, the a posteriori probability that the subject does not have the disease given a negative test is the ratio of the true negatives (complement of prevalence times the specificity) divided by the total negatives (the sum of true negatives and false negatives). It is also desirable that the test has a high NPV.
The accuracy (also termed overall accuracy, diagnostic accuracy or test efficiency) of a test is the overall proportion of correct test results. This includes true positives and true negatives. Mathematically it is calculated from the equation: ÅS + (1 À Å)Sp. The term ÅS includes the true positives (prevalence times sensitivity) and the term (1 À Å)Sp are the true negatives (probability the subject does not have the disease times the probability that the test is negative given the subject is without disease). As noted by Alberg et al., (2004): ''Overall accuracy is the weighted average of a test's sensitivity and specificity, where sensitivity is weighted by prevalence and specificity is weighted by the complement of prevalence.'' Refer to Alberg et al. (2004) for a discussion of the limitations of this measure of screening efficiency.
The Likelihood ratio is another term used to characterize screening tests; it is defined as the probability of a subject who has the disease testing positive divided by the probability of a subject who does not have the disease testing positive, L ¼ S/(1 À Sp). To many, the PPV or NPV are the key characteristics of a screening program. It is important to remember that the PPV or NPV are dependent on both the population under study and the technical characteristics of the screening test. 9 A screening test with relatively high sensitivity and specificity may still have a low PPV if the population prevalence is sufficiently low. Thus, to assess a proposed screening test it is necessary to evaluate both the technical and population characteristics.
The probability of a positive test, Pr(T + ) is the sum of the probabilities of a subject with the disease correctly testing positive and someone without the disease incorrectly testing positive, or ÅS + (1 À Å)(1 À Sp). Some have suggested using the observed fraction of positive tests, F + , (sometimes termed the apparent prevalence) as a surrogate for or estimate of Å, but, unless both the sensitivity and specificity are both equal to unity, this will give a biased answer. Given the definitions, an improved estimate for Å is equal to (F + + Sp À1)/(S + Sp -1). Refer Gart & Buck (1966), Gastwirth (1987), Levy & Kass (1970) and Rogan & Gladen (1978) for a derivation and Karaagaoglu (1999) for additional analyses.

A numerical example
All of these screening test characteristics are determined by testing a particular population (using one or more screening tests) and recording the number of subjects that fall into the various categories shown in Table 1. To illustrate, Table 4 provides a hypothetical data from a screening test evaluation of a population of 10 000 subjects, assumed to have a disease prevalence of 0.5, with a calculated sensitivity of 0.9 (95% confidence interval including continuity correction [0.8913, 0.9081]), and a specificity of 0.3 (95% confidence interval including continuity correction [0.2874, 0.313]). 10 Table 4 also illustrates the equations and numerical computation of the various quantities defined above. The bottom of Table 4 provides the calculation (using Bayes' theorem) of the a posteriori probabilities corresponding to either a positive or negative test outcome. In this example, a subject who tests positive has an a posteriori probability of having the disease of 0.5625 -not materially greater than the a priori prevalence (0.5) in the population. This is because although the sensitivity is relatively high, the specificity of the test is relatively low. Conversely, a subject who tests negative has an a posteriori probability of not having the disease of 0.75 -in this case, clearly different from the a priori prevalence (0.5) in the population in this example.

Further analysis of the numerical example
The a posteriori probability of having the disease given a positive test result, or PPV, is one obvious measure of the evidence provided by the test. Other things being equal, tests with high specificity (few false positives) tend to have a high PPV. However, unlike sensitivity or specificity (which might be termed ''pure characteristics'' of the test), the PPV is also a function of the characteristics of the population under study; PPV is a function of the prevalence. In the numerical example given in Table 4, the prevalence was assumed to be 0.5 (i.e. 50% of the population or subpopulation had the disease). Figure 2 shows how the PPV, NPV and accuracy depend upon the assumed prevalence Å in the population being screened. As can be seen, both PPV and accuracy decrease (sharply in the case of PPV) as Å decreases from the base case assumption of 0.50. Conversely, the NPV increases as the prevalence decreases.
To help place the content of Figure 2 in perspective, note that if the prevalence, Å, were as low as 0.16, the Positive Predicted Value, PPV, would be only 0.2. Put another way, a subject who tested positive under these circumstances would have an 80% chance of not having the disease! And, if Å were as low as 0.08, there would be a 90% chance that a subject with a positive test would be disease free. If the consequences of a positive test (e.g. worry, invasive or expensive and unnecessary follow-up procedures) were substantial, this would not be a satisfactory screening test. Thus, the quantity 1 -PPV might aptly be termed the regret probability. Figure 3 shows how the regret (so defined) varies with both the prevalence and specificity, when the sensitivity is held 9 It is beyond the scope of this article to consider optimal screening study designs, but it is appropriate to comment on one possible design, the case control design. As noted by Goetzinger & Odibo (2011): ''It is important to highlight that the case control study design cannot be used to determine predictive values because these values are influenced by disease prevalence. Because cases and controls are selected for inclusion, the prevalence of the disease is, therefore, ''fixed'' by the study design. Reproducing a generalizable spectrum of patients also becomes difficult with this type of study design''.
Probability of positive test True positives + false positives divided by total tests Probability of negative test True negatives + false negatives divided by total tests  constant at 0.90. Looking at Figure 3, you can see how the likelihood that a subject who tests positive actually is disease free changes as the prevalence changes. If the actual prevalence in the population were say 0.3, the regret would be approximately 0.7 and if the prevalence were as low as 0.08, the regret would be 0.9. This example illustrates the point that both technical parameters of the screening test and prevalence need to be considered. Figure 4 shows the locus of points that have a constant regret (equal to 0.8) as a function of specificity and prevalence for values of sensitivity ranging from 0.7 through 0.9; this test characteristic does not have much leverage in this example. Rather the prevalence and specificity are the key variables.
Although the example given in Table 4 is hypothetical, it is relevant to many actual tests. One is described below.

An example: LDCT scans for lung cancer
Low dose computed tomography (LDCT) has been proposed as a screening test for lung cancer. Several studies (Humphrey et al., 2013;Tiitola et al., 2002) have shown that this technique has a high sensitivity (as a percentage ranging from 80 to nearly 100%, Humphrey et al. [2013]) at detecting nodules. The National Lung Cancer Screening Trial (NLST) Research Team (2011) published a study reporting that the estimated reduction in mortality from use of LDCT screening was approximately 20% compared to alternative test strategies. Subjects included in the study population were between 55 and 74 years of age at the time of randomization, had a history of cigarette smoking of at least 30 pack-years, and, if former smokers, had quit within the previous 15 years. These criteria were used in an attempt to define a population with a relatively high prevalence and thus a high Positive Predictive Value: Smokers (    However, this and earlier proposals for LDCT screening have also had numerous critics or skeptics (American Academy of Family Physicians (AAFP) 2014; Heffner & Silvestri, 2002;Ruano-Ravina et al., 2013;Silvestri, 2011;Vansteenkiste et al., 2012), some arguing that the estimated benefits of LDCT screening in reducing mortality are uncertain, lower than estimated or absent (Bach et al., 2007(Bach et al., , 2012Black, 2000;Oken et al., 2011;Pastorino et al., 2012;Saghir et al., 2012), others that the procedure is not costeffective (Mahadevia et al., 2003), and yet others that the radiation risks might be excessive (Brenner, 2004).
One of the major concerns about the use of LDCT even among advocates (Marshall et al., 2013) is that LDCT detects a large number of benign but uncalcified pulmonary nodulesproperly termed false positives -that are challenging to diagnose (MacRedmond et al., 2006;Nawa et al., 2002;Patz et al., 2004;Swensen et al., 2002Swensen et al., , 2003Swensen et al., , 2005 and which create other problems depending upon what is done as part of the follow-up to a positive test (Wiener et al., 2011). As noted by Diedrerich (2008): Many pulmonary nodules even in smokers are due to benign lesions such as granulomas and hamartomas. 12 In short, although this test is highly sensitive, it has a low specificity. Table 5 provides estimates of the false positive rate (benign nodules discovered by CT scans) as reported in several studies -even those that favor routine screening of this population subgroup. The calculated or reported false positive rates shown in Table 5 vary substantially among the studies; 13 some of the differences can be explained by different criteria for defining a positive (e.g. size of the nodule that is classified as a positive) and whether or not multiple LDCTs were used (and the criteria for a positive on multiple tests) as part of the procedure. Despite this variability it is apparent that most reported estimates of false positive probabilities are quite high. The study by van Klavern et al.
[2009] reports a false positive probability very much lower than the other results depicted in Table 5. The actual test and decision criteria developed by these investigators differed from others. Specifically, they used a mathematical model to evaluate a non-calcified nodule according to its volume or volume-doubling time. Growth was defined as an increase in volume of at least 25% between two scans. The first-round screening test used by these investigators was considered to be negative if the volume of a nodule was less than 50 mm 3 , if it was 50 to 500 mm 3 but had not grown by the time of the 3month follow-up CT, or if, in the case of those that had grown, the volume-doubling time was 400 days or more. Another concern of critics of the NLST is that it might be difficult to generalize the results to community practices. Silvestri (2011), for example, wrote: Participants in the NLST were enrolled in tertiary care hospitals with expertise in all aspects of cancer care.
[LDCT] studies were interpreted by dedicated chest radiologists with expertise in characterizing nodules and providing appropriate recommendations for follow up. As a result, few patients required invasive testing and radiographic follow-up was sufficient for many patients. However, community radiologists without expertise in evaluating lung nodules may feel compelled to advise invasive testing for a screening-detected nodule. Of the 26 309 persons randomly assigned to chest CT screening in the NLST, 7191 (27%) had an abnormal finding. Most scans (96.4%) yielded false-positive results that were followed by serial radiography. Variation in how nodules are managed could lead to a substantial increase in transthoracic needle aspiration of lung nodules, unnecessary surgery, additional morbidity and even mortality for some persons who never had cancer to begin with.
From the data given in Table 5, it is clear that a conservative estimate of the false positive probability is at least 0.7, which means that the specificity of this test is at most 0.3 -the value assumed in the hypothetical example given in Table 4 -and might be much lower. Thus, even for the potentially high risk group of elderly heavy cigarette smokers included in the screening trials, the Positive Predictive Value of the test is not likely to be high.
There is some discrepancy in reported PPVs for the NLST; according to Humphrey et al. (2013) reported calculated positive predictive values (PPVs) for abnormal screening results ranging from 2.2% to 36.0%, while Ruano-Ravina et al. (2013) report the PPV for the NLCT as only 3.6%. Ruano-Ravia et al. (2013) have summarized the PPVs for 14 other LDCT investigations. Including their estimate for the NLCT, PPVs in these tests range from 0.028 to 0.115 with a median value of 0.053 and an arithmetic mean of 0.064, meaning that the probability that someone with a single positive test does not have lung cancer ranges from 0.885 to 0.972! Kovalchik et al. (2013) examined how the reduction in lung cancer mortality as reported by the NLST varied with the estimated risk based on a prediction model using age, bodymass index, family history of lung cancer, pack-years of smoking, years since smoking cessation and emphysema diagnosis. Based on model predictions, they divided the study population into quintiles based on a predicted 5-year risk of lung cancer. They analyzed the NLST data and found: Screening with low-dose CT prevented the greatest number of deaths from lung cancer among participants who were at highest risk and prevented very few deaths among those at lowest risk. These findings provide empirical support for risk-based targeting of smokers for such screening.
This finding highlights the importance of identifying the target population that is likely to benefit most from the screening procedure.
Overdiagnosis is another factor to consider in assessing the merits of LDCT cancer screening. This is because although screening has a high sensitivity and potential to detect aggressive tumors, screening will also detect indolent tumors that otherwise might not cause immediate clinical symptoms. Patz et al. (2014) used data from the NLST to estimate that more than 18% of all lung cancers detected by LDCT seemed to be more indolent, and the potential of overdiagnosis should be considered when describing the risks of LDCT for lung cancer.
Depending upon what is done in terms of follow up in the event of a positive screening test result, the impact of false positives could be substantial. Wiener et al. (2011) determined population-based estimates of risks of complications following transthoracic needle biopsy of a pulmonary nodule. This group collected data on the percentage of biopsies complicated by hemorrhage, any pneumothorax and pneumothorax requiring chest tube, and computed adjusted odds ratios for these complications associated with various biopsy characteristics, calculated using multivariable populationaveraged generalized estimating equations among a population of 15 865 adults (in California, Florida, Michigan and New York) who underwent transthoracic needle biopsy of a pulmonary nodule.
These investigators reported: Although hemorrhage was rare, complicating 1.0% (95% CI 0.9-1.2%) of biopsies, 17.8% (95% CI 11.8-23.8%) of patients with hemorrhage required a blood transfusion. In contrast, the risk of any pneumothorax was 15.0% (95% CI 14.0-16.0%), and 6.6% (95% CI 6.0-7.2%) of all biopsies resulted in a pneumothorax requiring chest tube. Compared to patients without complications, those who experienced hemorrhage or pneumothorax requiring chest tube had longer lengths of stay (p 5 0.001) and were more likely to develop respiratory failure requiring mechanical ventilation (p ¼ 0.02). Patients aged 60-69 years (as opposed to younger or older patients), smokers and those with chronic obstructive pulmonary disease had higher risk of complications.
It is apparent form these results that the consequences of false positives are potentially material.
Based largely on concerns over the high false positive rate, the Medicare Evidence Development and Coverage Advisory Committee (MEDCAC) in the United States recently recommended against covering the procedure for this patient group based on a lack of evidence to support the benefits of the screening test (http://www.aafp.org/news/health-of-thepublic/20140521medcacctrec.html). MEDCAC makes recommendations, not decisions; as noted by the Centers for Medicare and Medicaid Services (http://www.cms.gov/ Regulations-and-Guidance/Guidance/FACA/ MEDCAC.html): The MEDCAC reviews and evaluates medical literature, technology assessments, and examines data and information on the effectiveness and appropriateness of medical items and services that are covered under Medicare, or that may be eligible for coverage under Medicare. The MEDCAC judges the strength of the available evidence and makes recommendations to CMS based on that evidence.
The merits of this screening test are likely to be reviewed by other panels and the MEDCAC recommendation may ultimately be reversed -the Centers for Medicare & Medicaid Services (CMS) is expected to issue a proposed decision on the issue by November 2014, and a final decision in February 2015. The decision may be made on policy grounds but, from a scientific perspective, the ultimate outcome is likely to hinge on the judgment of the key parameters prevalence in the population, the high false positive rate, and ultimately the low PPV (Nelson, 2009;Phend, 2014;US Preventive Services Task Force, 2013).

Another example of LDCT screening
As a related example, we were asked by ECFIA (a trade association of manufacturers of high temperature insulating wools) to comment on the suitability of routine use of LDCT scans in a medical surveillance program for workers of all ages (including both smokers and non-smokers) engaged in the manufacture of refractory ceramic fiber (RCF) and other high temperature insulating wools in France. The available results of a mortality study of these workers in two US plants does not indicate any increase over baseline cancer rates (LeMasters et al., 2003;Utell & Maxim, 2010), so the likely prevalence of lung cancer in this population is not likely to be high. This is because most of the employed population is substantially younger than those included in the NLST (indeed, the retirement age in France is 60-62 depending upon what age the employee entered the workforce) and not all employees are smokers, let alone heavy smokers.
According to data from SEER (see http://www.cancer.org/ cancer/cancerbasics/lifetime-probability-of-developing-ordying-from-cancer) the lifetime probability of contracting lung cancer among American males (including both smokers and non-smokers) is approximately 7.6%. Taking this as an estimate applicable to the French population prevalence 14 and using the sensitivity and specificity values from Table 4, the positive Predictive Value of CT lung cancer screening is approximately 0.1. (Obviously it would be much lower for young men and non-smokers and higher among those nearing retirement and heavy smokers.) This means that the a posteriori probability (regret) that a subject who tests positive in a single CT scan does not have lung cancer is approximately 0.9. Despite the high probability that a subject with a positive test does not have lung cancer, these subjects would be subject to whatever follow-up procedures might accompany such a test result. Members of this group would, at a minimum, suffer some mental distress and would be subject to follow-up CT scans and possibly invasive procedures. This screening test would clearly be inappropriate for this group. In this context it is noteworthy that the American Lung Association's guidance document (ALA, 2012) that endorsed LDCT scans for older smokers also states: Low-dose CT screening should NOT be recommended for everyone. [Emphasis in original.] And, Bach et al. (2012) also endorsed LDCT screens for older smokers, but also recommended: 15 For individuals who have accumulated fewer than 30 packyears of smoking or are either younger than 55 years or older than 74 years, or individuals who quit smoking more than 15 years ago, and for individuals with severe comorbidities that would preclude potentially curative treatment, limit life expectancy or both, we suggest that CT screenings should not be performed.
Thus, regardless of whether one believes that LDCT is an appropriate screening test for the population of older smokers, it is not justified for a population with much lower prevalence or those who are not likely to benefit from a correct diagnosis, evaluation and treatment.

Periodic screenings
Some screening tests are designed as ''once-off'' tests, but many are intended to be administered periodically, such as annually. For example, mammography and clinical breast examination have been proposed for screening for breast cancer. As Elmore et al. (1998) wrote: If a woman undergoes annual screening beginning at the age of 40, she will have had 60 opportunities for a false positive result by the age of 70, with 30 mammograms and 30 clinical breast examinations. The cumulative lifetime risk from her having a result from a screening test that requires further workup, even though no breast cancer is present, is not known. . .It is important to determine the cumulative risk of false positive tests, because women are advised to have breast-cancer screening every 1-2 years over several decades of their lifetimes, and false positive rates can provoke anxiety, increase costs and cause morbidity.
Thus, in evaluating periodic screening, it is necessary to measure or calculate cumulative probabilities. Care must be taken because the results of multiple tests may not be independent events.

Choosing the right population subgroup
As the PPV of a screening test depends critically on the prevalence of the disease in the population it is important to identify criteria to define a population group or subgroup with a high disease incidence to begin with. As noted above, this is why the LDCT program was limited to older smokers. Lung cancer rates increase with age and the vast majority of lung cancers occur in smokers. This is potentially a reasonable population subgroup for screening. 14 Male mortality rates from lung cancer are approximately the same in France and the United States (see http://www.oecd-ilibrary.org/docserver/ download/8111101ec007.pdf?expires¼1404337643&id¼id&accname ¼guest&checksum¼03F45C46CE1A31E393DD2EAFDF0157D3). Moreover, the 7.6% figure assumed for the prevalence is for an entire lifetime. The probability of contacting cancer through age 60 or 62 (when workers will retire) is certainly lower. Thus, this estimate probably overstates the actual prevalence for the worker cohort.
To illustrate the selection of a relevant population subgroup, we use an example from a study of breast cancer screening. Kerlikowske et al. (1993) reported on a cross-sectional study of 31 814 women aged 30 years and older referred for mammography at the University of California. They segmented the population into women of various age groups with and without a family history of breast cancer. Figure 5 shows a bar chart of the estimated PPVs for these groups. These investigators found that five times as many cancers per 1000 first-screening mammographic examinations were diagnosed in women aged 50 years or older compared with women aged less than 50 years. The highest PPVs for mammography were older women with a family history of breast cancer. This finding guided their recommendation.
Possible criteria for defining a population subgroup include various demographic factors (age, gender, race and country), known risk factors (e.g. smoking), medical history and occupation. For screening to be highly effective, the prevalence in the population should be as high as is practicable. Harper et al. (2000) provides additional comments on the importance of the study population.

ROC curves
It is noted above that there may be opportunities to design a screening test that has different combinations of sensitivity and specificity. If so, there are opportunities to design the test to possess characteristics that are superior in terms of the combination of possible consequences of false positives and false negatives. For example in the LDCT test, the threshold for size (mm) of the nodules or other characteristics (e.g. solid or semisolid nodules) might be varied (Lam et al., 2013), choices that would alter the sensitivity or specificity. Thompson et al. (2005) and Zou et al. (2007) offer other relevant examples of ROC curves. Figure 6 shows a typical receiver operating characteristic (ROC) 16 curve for a prostate specific antigen (PSA) test administered to men aged 70 or more.
Each subject is tested and a specific PSA score determined. Subjects were also administered digital rectal examinations and biopsies -those with positive biopsies were used as the gold standard for assessment of disease status. A series of possible PSA cutoff scores (measured in nanograms per milliliter ng/ml) were considered for the screening test. Each cutoff score resulted in a partitioning of subjects into those who tested positive and those who tested negative. Knowing the actual disease status of the subjects enabled calculation of the sensitivity and specificity of the test. The ROC curve plots the calculated sensitivity against the false positive error (1 -Sp). Thus, each plotted point on the curve represents a 16 ROC analysis emerged from the study of signal detection problems differentiating signals from noise. These were first used by scientists in Britain during World War II as the abilities of radar receiver operators were being assessed based on their ability to differentiate signal (e.g. enemy aircraft) from noise (non-relevant targets). The term was later borrowed by statisticians assessing screening tests. different possible screening test with its own sensitivity and specificity. By considering the consequences of false positives and false negatives, it is possible to determine a cutoff value for the PSA test that is optimal in some sense. One statistic often used to characterize the ROC is the area under the curve (AUC). A perfectly discriminatory ROC would have an AUC ¼ 1.0. The value for the PSA tests studied by Thompson et al. (2005) was 0.678. Thompson et al. (2005) also considered using a so-called Gleason score 17 with a cutoff of 8 or more in this population. Figure 7 shows the ROC curve (topmost curve) for this possible screening test. As can be seen, this series of tests dominates the tests based upon PSA score alone (the AUC in this case is 0.827). The dashed line in Figure 7 shows the ROC curve that would occur under chance alone.
Whether or not and for whom PSA screening is appropriate requires the same sort of analysis noted for the LDCT screening evaluation. The ROC curve is just one piece of the puzzle, but this type of analysis shows that it is possible to design a screening test with several alternative combinations of sensitivity and specificity.
A complete specification of a screening test includes the intrinsic test characteristics (sensitivity, selectivity and cost) and ROC curve (if multiple tests are possible), characteristics of the subject population (including opportunities for segmenting the population to identify high risk groups), the key derived quantities (PPV and NPV) and the consequences of false positives and negatives.

Concluding remarks
Screening tests have the potential to be a cost effective means for identifying subjects with early stage (and thus potentially more treatable) disease before symptoms develop and therefore, for saving lives. The ideal screening test would discriminate perfectly between those who have or do not have the disease and be inexpensive and not invasive. In practice, screening tests exhibit false positives and false negatives -errors with consequences that need to be carefully considered when evaluating the advantages and disadvantages of the test.
The predictive value of the test depends in part on the technical parameters of the test, including the sensitivity and specificity, but also on the prevalence of the disease in the population. For this reason, it is necessary to be able to define the population to be tested so that the prevalence is high. This is why mammography is appropriate only for older women and those with a family history of breast cancer and why lung CT scans are not appropriate for screening the general population.
With some screening tests it is possible to alter the test decision criterion to alter the balance between sensitivity and specificity in which case it may be possible to develop an optimal screening test.
Nonetheless, screening of asymptomatic populations is not always appropriate and could do more harm than good. 18 Table 6 summarizes the circumstances/conditions when screening might be either appropriate or contra-indicated.

Declaration of interest
This paper represents independent research and the authors are solely responsible for the content. Two of the authors (LDM and MJU) were asked by ECFIA to give an opinion on the use of CT scans for workers engaged in the production of high temperature insulating wools. Disease constitutes a significant public health problem, meaning that it is a relatively common condition with significant morbidity and mortality or disease is contagious and might infect others before symptoms occur and disease detected.
Disease is rare or not serious or, if serious there is no effective treatment for disease.
The population to be screened can be so defined that the prevalence is high and there are no significant co-morbidities.
Unknown or low population prevalence Treatment before symptoms occur is more effective than if treatment is delayed No benefit to early treatment and/or significant likelihood of overdiagnosis (pseudodisease) ''Gold Standard'' diagnostic exists and screening test sensitivity and specificity is high and based on adequate sample size Screening test data is based on small sample sizes or is difficult to extrapolate to larger pool of screening centers with high sensitivity and specificity (e.g. high inter-observer variability) Consequences of false negative or false positives are modest Consequences of one or more of these errors significant Screening test is inexpensive, easy to administer, not harmful and reliable Any of these circumstances not met There must be some mechanism for follow-up of subjects with positive screening results to ensure subsequent diagnostic testing and ultimate treatment takes place.