Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Original Article
  • Published:

Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene

Abstract

The recent movement of regulatory agencies toward probabilistic analyses of human health and environmental risks has focused greater attention on the quality of the estimates of variability and uncertainty that underlie them. Of particular concern is how uncertainty — a measure of what is not known — is characterized, as uncertainty can play an influential role in analyses of the need for regulatory controls or in estimates of the economic value of additional research. This paper reports the second phase of a study, conducted as an element of the National Human Exposure Assessment Survey (NHEXAS), to obtain and calibrate exposure assessment experts judgments about uncertainty in residential ambient, residential indoor, and personal air benzene concentrations experienced by the nonsmoking, nonoccupationally exposed population in U.S. EPA's Region V. Subjective judgments (i.e., the median, interquartile range, and 90% confidence interval) about the means and 90th percentiles of each of the benzene distributions were elicited from the seven experts participating in the study. The calibration or quality of the experts' judgments was assessed by comparing them to the actual measurements from the NHEXAS Region V study using graphical techniques, a quadratic scoring rule, and surprise and interquartile indices. The results from both quantitative scoring methods suggested that, considered collectively, the experts' judgments were relatively well calibrated although on balance, underconfident. The calibration of individual expert judgments appeared variable, highlighting potential pitfalls in reliance on individual experts. In a surprising finding, the experts' judgments about the 90th percentiles of the benzene distributions were better calibrated than their predictions about the means; the experts tended to be overconfident in their ability to predict the means. This paper is also one of the first calibration studies to demonstrate the importance of taking into account intraexpert correlation on the statistical significance of the findings. When the judgments were assumed to be independent, analysis of the surprise and interquartile indices found evidence of poor calibration (P<0.05). However, when the intraexpert correlation in the study was taken into account, these findings were no longer statistically significant. The analysis further found that the experts' judgments scored better than estimates of Region V benzene concentrations simply drawn from earlier studies of ambient, indoor and personal benzene levels in other U.S. cities. These results suggest the value of careful elicitation of expert judgments in characterizing exposures in probabilistic form. Additional calibration studies need to be undertaken to corroborate and extend these findings.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

Notes

  1. In subjective judgment research, calibration has a similar meaning as in exposure assessment where an instrument is first tested to determine if it is giving a “true” measurement as defined by an appropriate standard and, if not, is subsequently adjusted until it does. Originally, calibration methods were developed for use during elicitations to provide feedback to those giving the judgments in hopes of causing them to “adjust” their subsequent judgments closer what they truly knew, i.e., to help them be more “self-knowledgeable” and thus, better calibrated. In more recent studies, as in this one, the term calibration refers to the first step only.

  2. The original study design called for 350 samples in ambient, indoor, and personal air, and this is the number conveyed to the experts during the elicitation. The final number of samples differed and is reported with the NHEXAS results.

  3. The Brier score is computed as the average of the squared differences between the stated probability Rij for the ith question, and in the jth class (e.g., rain or no rain, m=2) and the individual score achieved, sij, where sij= 1 if the response turns out to be correct and sij=0 if it does not:

  4. Another difference is that Matheson and Winkler defined their score as negative although the interpretation is the same. Because most studies report the positive version of the Brier score and its components, we have used the same convention here.

  5. Wallace reports the “global” mean residential ambient and personal air concentrations from the Total Exposure Assessment Methodology (TEAM) studies to be 6 and 15 μg/m3, respectively. The overall mean for indoor benzene concentrations and the 90th percentile estimates were not reported. The method used to develop the “global” estimates was not given in the paper but were approximately consistent with unweighted grand means for all of the studies in the paper. The corresponding 90th percentile estimates were 11, 18, and 31 μg/m3 for ambient, indoor and residential air, respectively. Modest uncertainty distributions with geometric standard deviations of 1.5 were assumed.

  6. The beta-binomial model has long been used to analyze data from animal experiments in which the experimental unit is a litter and where “[r]esponses within a litter are assumed to form a set of Bernoulli trials whose success probability varies between litters in the same treatment group according to a two parameter beta distribution” (Williams, 1975). In our study, each of the seven experts represents a “litter,” each of which has the same number of “animals” or questions. The experts surprise index or interquartile index results constitute binary response data. Using a model in S+ developed by one of the authors, P. Catalano, to analyze experimental data with animals, two parameters of interest could be estimated: the mean success probability (mean probabilities of surprise and of including the true NHEXAS values in the interquartile range) across all experts and rho, ρ, a measure on intraexpert correlation (Williams, 1975).

  7. The value of a perfect score will vary with the number and spacing of the fractiles to some degree. In the discrete case, where an expert is asked to answer true/false questions to which the answers are known and he assigns the probability of being correct a zero or a one, a total score of zero is possible if all questions are answered correctly. However, when the study design allows different probabilities to be specified, the expertise portion of the score will always be positive and to some extent a measure of the dispersion in the probabilities stated by the experts, Ri.

  8. We obtained each of these papers to reconfirm that the scoring rule used by Winkler and Poses was the same as that used in our paper. Besides these studies, Winkler and Poses also analyzed data from a study by Christensen-Szalanski, and Bushyhead (1981). We could not confirm the scores from this study, and because they seemed suspect, we have not reported them here.

  9. As discussed earlier, because the expertise portion of the score is in part determined by the fractiles elicited in a particular study, direct comparisons across studies may be misleading and are therefore not reported here.

  10. This study originally envisioned expert judgment on a pair of chemicals, one for which substantial data existed and one about which little exposure data known. Benzene fulfilled the first requirement but circumstances prevented the second part of the study from being completed.

  11. Cooke (1991) recommends no more than two hours be spent at a time on an elicitation!

Abbreviations

BEADS:

Benzene Exposure and Absorbed Dose Simulation model

CRARM:

Commission on Risk Assessment and Risk Management

ETS:

environmental tobacco smoke

NHEXAS:

National Human Exposure Assessment Survey

NRC:

National Research Council

TEAM:

Total Exposure Assessment Methodology

U.S. EPA:

U.S. Environmental Protection Agency

References

  • Bevington P.R. Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, 1969.

    Google Scholar 

  • Bostrom A., Fischhoff B. and Morgan M.G. Characterizing mental models of hazardous processes: a methodology and an application to radon. J Soc Issues 1992: 48(4): 85–100.

    Article  Google Scholar 

  • Bostrom A., Atman C., Fischhoff B. and Morgan M.G. Evaluating risk communications: completing and correcting mental models of hazardous processes: Part II. Risk Anal 1994: 14(5): 789–798.

    Article  CAS  Google Scholar 

  • Box G.E.P., Hunter W.G. and Hunter J.S. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, 1978.

    Google Scholar 

  • Brier G.W. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950: 78(1): 1–3.

    Article  Google Scholar 

  • Christensen-Szalanski J.J.J. and Beach L.R. The citation bias: fad and fashion in the judgment and decision literature. Am Psychol 1984: 39: 75–79.

    Article  Google Scholar 

  • Christensen-Szalanski J.J.J. and Bushyhead J.B. Physicians' use of probabilistic information in a real clinical setting. J Exp Psychol Hum Percept Perform 1981: 7: 928–935.

    Article  CAS  Google Scholar 

  • Cooke R.M. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford University Press, New York, 1991.

    Google Scholar 

  • Cooke R.M., Mendel M. and Thys W. Calibration and information in expert resolution: a classical approach. Automatica 1988: 24: 87–94.

    Article  Google Scholar 

  • DeGroot M. and Fienberg S. The comparison and evaluation of forecasters. The Statistician 1983: 32: 12–22.

    Article  Google Scholar 

  • Desmet A.A., Fryback D.G. and Thornbury J.R. A second look at the utility of radiographic skull examination for trauma. AJR Am J Roentgenol 1979: 132: 95–99.

    Article  CAS  Google Scholar 

  • Evans J.S., Cooper D.W. and Kinney P.L. On the propagation of error in air pollution measurements. Environ Monit Assess 1984: 4: 139–153.

    Article  CAS  Google Scholar 

  • Evans J.S., Graham J.D., Gray G.M. and Sielken R.L. A distributional approach to characterizing low-dose cancer risk. Risk Anal 1994: 14(1): 25–34.

    Article  CAS  Google Scholar 

  • Fiering M., Wilson R., Kleiman E. and Zeise L. Statistical distributions of health risk. Civ Eng Syst 1984: 1: 129–138.

    Article  Google Scholar 

  • Hawkins N.C. and Evans J.S. Subjective estimation of toluene exposures: a calibration study of industrial hygienists. Appl Ind Hyg J 1989: 4(3): 61–68.

    Article  CAS  Google Scholar 

  • Henrion M. and Fischhoff B. Assessing uncertainty in physical constants. Am J Phys 1986: 54(9): 791–798.

    Article  CAS  Google Scholar 

  • Hynes M.E. and Van Marke E.H. Reliability of embankment performance predictions. In: Mechanics in Engineering, 1st ASCE-EMD Specialty Conference, University of Waterloo, May 26–28, 1976 (as cited in Cooke, 1991).

  • Lichtenstein S.B., Fischhoff B. and Phillips L. Calibration of probabilities: thestate of the art to 1980. In: Kahneman D., Slovic P., Tversky A. (Eds.), Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press, New York, 1982.

    Google Scholar 

  • Macintosh D.L., Xue J., Özkaynak H., Spengler J.D. and Ryan P.B. A population-based exposure model for benzene. J Exposure Anal Environ Epidemiol 1995: 5(3): 375–403.

    CAS  Google Scholar 

  • Matheson J.E. and Winkler R.L. Scoring rules for continuous probability distributions. Manage Sci 1976: 22: 1087–1096.

    Article  Google Scholar 

  • Morgan M.G. and Henrion M. Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge University Press, New York, 1990.

    Book  Google Scholar 

  • Mullin T.M. Understanding and supporting the process of probabilistic estimation. 1986. PhD dissertation. As cited in Morgan and Henrion, 1990. Carnegie Mellon University, Pittsburgh.

  • Murphy A.H. Scalar and vector partitions of the probability score: Part I. Two-state situation. J Appl Meteorol 1972: 11: 273–282.

    Article  Google Scholar 

  • Murphy A.H. A new vector partition of the probability score. J Appl Meteorol 1973: 12: 595–600.

    Article  Google Scholar 

  • Murphy A.H. and Winkler R.L. Reliability of subjective probability forecasts of precipitation and temperature. Appl Stat 1977: 26: 41–47.

    Article  Google Scholar 

  • Murphy A.H. and Winkler R.L. Diagnostic verification of probability forecasts. Int J Forecast 1992: 7: 435–455.

    Article  Google Scholar 

  • Pellizzari E. Concentrations of benzene in ambient, indoor, and personal air in Region V NHEXAS study. Personal communication, 1998.

  • Pellizzari E., Lioy P., Whitmore R., Freeman N., Clayton C.A., Rodes C. and Thomas K. Quality Systems and Implementation Plan for Human Exposure Assessment. Research Triangle Institute, Research Triangle Park, NC, 1995.

    Google Scholar 

  • Raiffa H. Decision Analysis. Addison-Wesley, Reading, MA, 1968.

    Google Scholar 

  • Samet J.H., Shevitz A., Fowle J. and Singer D.E. Hospitalization decisions in febrile intravenous drug users. Am J Med 1990: 89: 53–57.

    Article  CAS  Google Scholar 

  • Seiler F.A. Error propagation for large errors. Risk Anal 1987: 7(4): 509–518.

    Article  Google Scholar 

  • Sexton K. Informed decisions about protecting and promoting public health: rationale for a national human exposure assessment survey. J Exposure Anal Environ Epidemiol 1995: 5(3): 233–256.

    CAS  Google Scholar 

  • Shlyakhter A.I., Kammen D.M., Broido C.L. and Wilson R. Quantifying the credibility of energy projections from trends in past data — the United States energy sector. Energ Policy 1994: 22(2): 119–130.

    Article  Google Scholar 

  • Taylor A.C., Evans J.S. and McKone T.E. The value of animal test information in environmental control decisions. Risk Anal 1993: 13(4): 403–412.

    Article  CAS  Google Scholar 

  • Thompson K. and Evans J.S. The value of improved national exposure information for perchloroethylene (perc); a case study for dry cleaners. Risk Anal 1997: 17(2): 253–271.

    Article  Google Scholar 

  • Tierney W.M., Fitzgerald J., McHenry R., Roth B., Psaty B., Stump D. and Anderson K. Physicians' estimates of the probability of myocardial infarction in emergency room patients with chest pain. Med Decis Making 1986: 6(1): 12–17.

    Article  CAS  Google Scholar 

  • Van Lenthe J. ELI: an interactive elicitation technique for subjective probability. Organ Behav Decis Process 1993: 55: 379–413.

    Article  Google Scholar 

  • Walker K.D., Macintosh D. and Evans J.S. Use of expert judgment in exposure assessment: Part I. Characterization of personal exposure to benzene. J Exposure Anal Environ Epidemiol 2001: 11: 308–322.

    Article  CAS  Google Scholar 

  • Wallace L. Environmental exposure to benzene: an update. Environ Health Perspect 1996: 104(Suppl. 6): 1129–1139.

    CAS  PubMed  PubMed Central  Google Scholar 

  • Williams D.A. The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 1975: 31: 949–952.

    Article  CAS  Google Scholar 

  • Winkler R.L. and Poses R.M. Evaluating and combining physicians' probabilities of survival in an intensive care unit. Manage Sci 1993: 39(12): 1526–1543.

    Article  Google Scholar 

  • Yaniv I. and Foster D.P. Precision and accuracy of judgmental estimation. J Behav Decis Making 1997: 10(1): 21–32.

    Article  Google Scholar 

Download references

Acknowledgements

We are thankful to John Graham of the Harvard School of Public Health for his insightful comments on this study. This work was supported by U.S. EPA Cooperative Agreement CR822038-03-1-3; Health Resources and Services Administration, Bureau of Health Professions Grant A03-AH01165-01; National Science Foundation Grant SES-0084372, “Combining Expert Judgments for Environmental Risk Analysis,” and the Leslie Silverman Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katherine D Walker.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Walker, K., Catalano, P., Hammitt, J. et al. Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene. J Expo Sci Environ Epidemiol 13, 1–16 (2003). https://doi.org/10.1038/sj.jea.7500253

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/sj.jea.7500253

Keywords

This article is cited by

Search

Quick links