Abstract
The recent movement of regulatory agencies toward probabilistic analyses of human health and environmental risks has focused greater attention on the quality of the estimates of variability and uncertainty that underlie them. Of particular concern is how uncertainty — a measure of what is not known — is characterized, as uncertainty can play an influential role in analyses of the need for regulatory controls or in estimates of the economic value of additional research. This paper reports the second phase of a study, conducted as an element of the National Human Exposure Assessment Survey (NHEXAS), to obtain and calibrate exposure assessment experts judgments about uncertainty in residential ambient, residential indoor, and personal air benzene concentrations experienced by the nonsmoking, nonoccupationally exposed population in U.S. EPA's Region V. Subjective judgments (i.e., the median, interquartile range, and 90% confidence interval) about the means and 90th percentiles of each of the benzene distributions were elicited from the seven experts participating in the study. The calibration or quality of the experts' judgments was assessed by comparing them to the actual measurements from the NHEXAS Region V study using graphical techniques, a quadratic scoring rule, and surprise and interquartile indices. The results from both quantitative scoring methods suggested that, considered collectively, the experts' judgments were relatively well calibrated although on balance, underconfident. The calibration of individual expert judgments appeared variable, highlighting potential pitfalls in reliance on individual experts. In a surprising finding, the experts' judgments about the 90th percentiles of the benzene distributions were better calibrated than their predictions about the means; the experts tended to be overconfident in their ability to predict the means. This paper is also one of the first calibration studies to demonstrate the importance of taking into account intraexpert correlation on the statistical significance of the findings. When the judgments were assumed to be independent, analysis of the surprise and interquartile indices found evidence of poor calibration (P<0.05). However, when the intraexpert correlation in the study was taken into account, these findings were no longer statistically significant. The analysis further found that the experts' judgments scored better than estimates of Region V benzene concentrations simply drawn from earlier studies of ambient, indoor and personal benzene levels in other U.S. cities. These results suggest the value of careful elicitation of expert judgments in characterizing exposures in probabilistic form. Additional calibration studies need to be undertaken to corroborate and extend these findings.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 6 print issues and online access
$259.00 per year
only $43.17 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Notes
In subjective judgment research, calibration has a similar meaning as in exposure assessment where an instrument is first tested to determine if it is giving a “true” measurement as defined by an appropriate standard and, if not, is subsequently adjusted until it does. Originally, calibration methods were developed for use during elicitations to provide feedback to those giving the judgments in hopes of causing them to “adjust” their subsequent judgments closer what they truly knew, i.e., to help them be more “self-knowledgeable” and thus, better calibrated. In more recent studies, as in this one, the term calibration refers to the first step only.
The original study design called for 350 samples in ambient, indoor, and personal air, and this is the number conveyed to the experts during the elicitation. The final number of samples differed and is reported with the NHEXAS results.
The Brier score is computed as the average of the squared differences between the stated probability Rij for the ith question, and in the jth class (e.g., rain or no rain, m=2) and the individual score achieved, sij, where sij= 1 if the response turns out to be correct and sij=0 if it does not:
Another difference is that Matheson and Winkler defined their score as negative although the interpretation is the same. Because most studies report the positive version of the Brier score and its components, we have used the same convention here.
Wallace reports the “global” mean residential ambient and personal air concentrations from the Total Exposure Assessment Methodology (TEAM) studies to be 6 and 15 μg/m3, respectively. The overall mean for indoor benzene concentrations and the 90th percentile estimates were not reported. The method used to develop the “global” estimates was not given in the paper but were approximately consistent with unweighted grand means for all of the studies in the paper. The corresponding 90th percentile estimates were 11, 18, and 31 μg/m3 for ambient, indoor and residential air, respectively. Modest uncertainty distributions with geometric standard deviations of 1.5 were assumed.
The beta-binomial model has long been used to analyze data from animal experiments in which the experimental unit is a litter and where “[r]esponses within a litter are assumed to form a set of Bernoulli trials whose success probability varies between litters in the same treatment group according to a two parameter beta distribution” (Williams, 1975). In our study, each of the seven experts represents a “litter,” each of which has the same number of “animals” or questions. The experts surprise index or interquartile index results constitute binary response data. Using a model in S+ developed by one of the authors, P. Catalano, to analyze experimental data with animals, two parameters of interest could be estimated: the mean success probability (mean probabilities of surprise and of including the true NHEXAS values in the interquartile range) across all experts and rho, ρ, a measure on intraexpert correlation (Williams, 1975).
The value of a perfect score will vary with the number and spacing of the fractiles to some degree. In the discrete case, where an expert is asked to answer true/false questions to which the answers are known and he assigns the probability of being correct a zero or a one, a total score of zero is possible if all questions are answered correctly. However, when the study design allows different probabilities to be specified, the expertise portion of the score will always be positive and to some extent a measure of the dispersion in the probabilities stated by the experts, Ri.
We obtained each of these papers to reconfirm that the scoring rule used by Winkler and Poses was the same as that used in our paper. Besides these studies, Winkler and Poses also analyzed data from a study by Christensen-Szalanski, and Bushyhead (1981). We could not confirm the scores from this study, and because they seemed suspect, we have not reported them here.
As discussed earlier, because the expertise portion of the score is in part determined by the fractiles elicited in a particular study, direct comparisons across studies may be misleading and are therefore not reported here.
This study originally envisioned expert judgment on a pair of chemicals, one for which substantial data existed and one about which little exposure data known. Benzene fulfilled the first requirement but circumstances prevented the second part of the study from being completed.
Cooke (1991) recommends no more than two hours be spent at a time on an elicitation!
Abbreviations
- BEADS:
-
Benzene Exposure and Absorbed Dose Simulation model
- CRARM:
-
Commission on Risk Assessment and Risk Management
- ETS:
-
environmental tobacco smoke
- NHEXAS:
-
National Human Exposure Assessment Survey
- NRC:
-
National Research Council
- TEAM:
-
Total Exposure Assessment Methodology
- U.S. EPA:
-
U.S. Environmental Protection Agency
References
Bevington P.R. Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, 1969.
Bostrom A., Fischhoff B. and Morgan M.G. Characterizing mental models of hazardous processes: a methodology and an application to radon. J Soc Issues 1992: 48(4): 85–100.
Bostrom A., Atman C., Fischhoff B. and Morgan M.G. Evaluating risk communications: completing and correcting mental models of hazardous processes: Part II. Risk Anal 1994: 14(5): 789–798.
Box G.E.P., Hunter W.G. and Hunter J.S. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, 1978.
Brier G.W. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950: 78(1): 1–3.
Christensen-Szalanski J.J.J. and Beach L.R. The citation bias: fad and fashion in the judgment and decision literature. Am Psychol 1984: 39: 75–79.
Christensen-Szalanski J.J.J. and Bushyhead J.B. Physicians' use of probabilistic information in a real clinical setting. J Exp Psychol Hum Percept Perform 1981: 7: 928–935.
Cooke R.M. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford University Press, New York, 1991.
Cooke R.M., Mendel M. and Thys W. Calibration and information in expert resolution: a classical approach. Automatica 1988: 24: 87–94.
DeGroot M. and Fienberg S. The comparison and evaluation of forecasters. The Statistician 1983: 32: 12–22.
Desmet A.A., Fryback D.G. and Thornbury J.R. A second look at the utility of radiographic skull examination for trauma. AJR Am J Roentgenol 1979: 132: 95–99.
Evans J.S., Cooper D.W. and Kinney P.L. On the propagation of error in air pollution measurements. Environ Monit Assess 1984: 4: 139–153.
Evans J.S., Graham J.D., Gray G.M. and Sielken R.L. A distributional approach to characterizing low-dose cancer risk. Risk Anal 1994: 14(1): 25–34.
Fiering M., Wilson R., Kleiman E. and Zeise L. Statistical distributions of health risk. Civ Eng Syst 1984: 1: 129–138.
Hawkins N.C. and Evans J.S. Subjective estimation of toluene exposures: a calibration study of industrial hygienists. Appl Ind Hyg J 1989: 4(3): 61–68.
Henrion M. and Fischhoff B. Assessing uncertainty in physical constants. Am J Phys 1986: 54(9): 791–798.
Hynes M.E. and Van Marke E.H. Reliability of embankment performance predictions. In: Mechanics in Engineering, 1st ASCE-EMD Specialty Conference, University of Waterloo, May 26–28, 1976 (as cited in Cooke, 1991).
Lichtenstein S.B., Fischhoff B. and Phillips L. Calibration of probabilities: thestate of the art to 1980. In: Kahneman D., Slovic P., Tversky A. (Eds.), Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press, New York, 1982.
Macintosh D.L., Xue J., Özkaynak H., Spengler J.D. and Ryan P.B. A population-based exposure model for benzene. J Exposure Anal Environ Epidemiol 1995: 5(3): 375–403.
Matheson J.E. and Winkler R.L. Scoring rules for continuous probability distributions. Manage Sci 1976: 22: 1087–1096.
Morgan M.G. and Henrion M. Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge University Press, New York, 1990.
Mullin T.M. Understanding and supporting the process of probabilistic estimation. 1986. PhD dissertation. As cited in Morgan and Henrion, 1990. Carnegie Mellon University, Pittsburgh.
Murphy A.H. Scalar and vector partitions of the probability score: Part I. Two-state situation. J Appl Meteorol 1972: 11: 273–282.
Murphy A.H. A new vector partition of the probability score. J Appl Meteorol 1973: 12: 595–600.
Murphy A.H. and Winkler R.L. Reliability of subjective probability forecasts of precipitation and temperature. Appl Stat 1977: 26: 41–47.
Murphy A.H. and Winkler R.L. Diagnostic verification of probability forecasts. Int J Forecast 1992: 7: 435–455.
Pellizzari E. Concentrations of benzene in ambient, indoor, and personal air in Region V NHEXAS study. Personal communication, 1998.
Pellizzari E., Lioy P., Whitmore R., Freeman N., Clayton C.A., Rodes C. and Thomas K. Quality Systems and Implementation Plan for Human Exposure Assessment. Research Triangle Institute, Research Triangle Park, NC, 1995.
Raiffa H. Decision Analysis. Addison-Wesley, Reading, MA, 1968.
Samet J.H., Shevitz A., Fowle J. and Singer D.E. Hospitalization decisions in febrile intravenous drug users. Am J Med 1990: 89: 53–57.
Seiler F.A. Error propagation for large errors. Risk Anal 1987: 7(4): 509–518.
Sexton K. Informed decisions about protecting and promoting public health: rationale for a national human exposure assessment survey. J Exposure Anal Environ Epidemiol 1995: 5(3): 233–256.
Shlyakhter A.I., Kammen D.M., Broido C.L. and Wilson R. Quantifying the credibility of energy projections from trends in past data — the United States energy sector. Energ Policy 1994: 22(2): 119–130.
Taylor A.C., Evans J.S. and McKone T.E. The value of animal test information in environmental control decisions. Risk Anal 1993: 13(4): 403–412.
Thompson K. and Evans J.S. The value of improved national exposure information for perchloroethylene (perc); a case study for dry cleaners. Risk Anal 1997: 17(2): 253–271.
Tierney W.M., Fitzgerald J., McHenry R., Roth B., Psaty B., Stump D. and Anderson K. Physicians' estimates of the probability of myocardial infarction in emergency room patients with chest pain. Med Decis Making 1986: 6(1): 12–17.
Van Lenthe J. ELI: an interactive elicitation technique for subjective probability. Organ Behav Decis Process 1993: 55: 379–413.
Walker K.D., Macintosh D. and Evans J.S. Use of expert judgment in exposure assessment: Part I. Characterization of personal exposure to benzene. J Exposure Anal Environ Epidemiol 2001: 11: 308–322.
Wallace L. Environmental exposure to benzene: an update. Environ Health Perspect 1996: 104(Suppl. 6): 1129–1139.
Williams D.A. The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 1975: 31: 949–952.
Winkler R.L. and Poses R.M. Evaluating and combining physicians' probabilities of survival in an intensive care unit. Manage Sci 1993: 39(12): 1526–1543.
Yaniv I. and Foster D.P. Precision and accuracy of judgmental estimation. J Behav Decis Making 1997: 10(1): 21–32.
Acknowledgements
We are thankful to John Graham of the Harvard School of Public Health for his insightful comments on this study. This work was supported by U.S. EPA Cooperative Agreement CR822038-03-1-3; Health Resources and Services Administration, Bureau of Health Professions Grant A03-AH01165-01; National Science Foundation Grant SES-0084372, “Combining Expert Judgments for Environmental Risk Analysis,” and the Leslie Silverman Foundation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Walker, K., Catalano, P., Hammitt, J. et al. Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene. J Expo Sci Environ Epidemiol 13, 1–16 (2003). https://doi.org/10.1038/sj.jea.7500253
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1038/sj.jea.7500253
Keywords
This article is cited by
-
Exposure Modeling of Benzene Exploiting Passive–Active Sampling Data
Environmental Modeling & Assessment (2010)
-
Health risk assessment for nanoparticles: A case for using expert judgment
Journal of Nanoparticle Research (2006)
-
Reliability of a semi-quantitative method for dermal exposure assessment (DREAM)
Journal of Exposure Science & Environmental Epidemiology (2005)