Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene

Walker, Katherine D; Catalano, Paul; Hammitt, James K; Evans, John S

doi:10.1038/sj.jea.7500253

Original Article
Published: 01 January 2003

Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene

Katherine D Walker¹,
Paul Catalano²,
James K Hammitt³ &
…
John S Evans¹

Journal of Exposure Science & Environmental Epidemiology volume 13, pages 1–16 (2003)Cite this article

1251 Accesses
38 Citations
3 Altmetric
Metrics details

Abstract

The recent movement of regulatory agencies toward probabilistic analyses of human health and environmental risks has focused greater attention on the quality of the estimates of variability and uncertainty that underlie them. Of particular concern is how uncertainty — a measure of what is not known — is characterized, as uncertainty can play an influential role in analyses of the need for regulatory controls or in estimates of the economic value of additional research. This paper reports the second phase of a study, conducted as an element of the National Human Exposure Assessment Survey (NHEXAS), to obtain and calibrate exposure assessment experts judgments about uncertainty in residential ambient, residential indoor, and personal air benzene concentrations experienced by the nonsmoking, nonoccupationally exposed population in U.S. EPA's Region V. Subjective judgments (i.e., the median, interquartile range, and 90% confidence interval) about the means and 90th percentiles of each of the benzene distributions were elicited from the seven experts participating in the study. The calibration or quality of the experts' judgments was assessed by comparing them to the actual measurements from the NHEXAS Region V study using graphical techniques, a quadratic scoring rule, and surprise and interquartile indices. The results from both quantitative scoring methods suggested that, considered collectively, the experts' judgments were relatively well calibrated although on balance, underconfident. The calibration of individual expert judgments appeared variable, highlighting potential pitfalls in reliance on individual experts. In a surprising finding, the experts' judgments about the 90th percentiles of the benzene distributions were better calibrated than their predictions about the means; the experts tended to be overconfident in their ability to predict the means. This paper is also one of the first calibration studies to demonstrate the importance of taking into account intraexpert correlation on the statistical significance of the findings. When the judgments were assumed to be independent, analysis of the surprise and interquartile indices found evidence of poor calibration (P<0.05). However, when the intraexpert correlation in the study was taken into account, these findings were no longer statistically significant. The analysis further found that the experts' judgments scored better than estimates of Region V benzene concentrations simply drawn from earlier studies of ambient, indoor and personal benzene levels in other U.S. cities. These results suggest the value of careful elicitation of expert judgments in characterizing exposures in probabilistic form. Additional calibration studies need to be undertaken to corroborate and extend these findings.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Principal component analysis

Article 22 December 2022

An overview of clinical decision support systems: benefits, risks, and strategies for success

Article Open access 06 February 2020

Notes

In subjective judgment research, calibration has a similar meaning as in exposure assessment where an instrument is first tested to determine if it is giving a “true” measurement as defined by an appropriate standard and, if not, is subsequently adjusted until it does. Originally, calibration methods were developed for use during elicitations to provide feedback to those giving the judgments in hopes of causing them to “adjust” their subsequent judgments closer what they truly knew, i.e., to help them be more “self-knowledgeable” and thus, better calibrated. In more recent studies, as in this one, the term calibration refers to the first step only.
The original study design called for 350 samples in ambient, indoor, and personal air, and this is the number conveyed to the experts during the elicitation. The final number of samples differed and is reported with the NHEXAS results.
The Brier score is computed as the average of the squared differences between the stated probability R_ij for the ith question, and in the jth class (e.g., rain or no rain, m=2) and the individual score achieved, s_ij, where s_ij= 1 if the response turns out to be correct and s_ij=0 if it does not:
Another difference is that Matheson and Winkler defined their score as negative although the interpretation is the same. Because most studies report the positive version of the Brier score and its components, we have used the same convention here.
Wallace reports the “global” mean residential ambient and personal air concentrations from the Total Exposure Assessment Methodology (TEAM) studies to be 6 and 15 μg/m³, respectively. The overall mean for indoor benzene concentrations and the 90th percentile estimates were not reported. The method used to develop the “global” estimates was not given in the paper but were approximately consistent with unweighted grand means for all of the studies in the paper. The corresponding 90th percentile estimates were 11, 18, and 31 μg/m³ for ambient, indoor and residential air, respectively. Modest uncertainty distributions with geometric standard deviations of 1.5 were assumed.
The beta-binomial model has long been used to analyze data from animal experiments in which the experimental unit is a litter and where “[r]esponses within a litter are assumed to form a set of Bernoulli trials whose success probability varies between litters in the same treatment group according to a two parameter beta distribution” (Williams, 1975). In our study, each of the seven experts represents a “litter,” each of which has the same number of “animals” or questions. The experts surprise index or interquartile index results constitute binary response data. Using a model in S+ developed by one of the authors, P. Catalano, to analyze experimental data with animals, two parameters of interest could be estimated: the mean success probability (mean probabilities of surprise and of including the true NHEXAS values in the interquartile range) across all experts and rho, ρ, a measure on intraexpert correlation (Williams, 1975).
The value of a perfect score will vary with the number and spacing of the fractiles to some degree. In the discrete case, where an expert is asked to answer true/false questions to which the answers are known and he assigns the probability of being correct a zero or a one, a total score of zero is possible if all questions are answered correctly. However, when the study design allows different probabilities to be specified, the expertise portion of the score will always be positive and to some extent a measure of the dispersion in the probabilities stated by the experts, R_i.
We obtained each of these papers to reconfirm that the scoring rule used by Winkler and Poses was the same as that used in our paper. Besides these studies, Winkler and Poses also analyzed data from a study by Christensen-Szalanski, and Bushyhead (1981). We could not confirm the scores from this study, and because they seemed suspect, we have not reported them here.
As discussed earlier, because the expertise portion of the score is in part determined by the fractiles elicited in a particular study, direct comparisons across studies may be misleading and are therefore not reported here.
This study originally envisioned expert judgment on a pair of chemicals, one for which substantial data existed and one about which little exposure data known. Benzene fulfilled the first requirement but circumstances prevented the second part of the study from being completed.
Cooke (1991) recommends no more than two hours be spent at a time on an elicitation!

Abbreviations

BEADS:: Benzene Exposure and Absorbed Dose Simulation model
CRARM:: Commission on Risk Assessment and Risk Management
ETS:: environmental tobacco smoke
NHEXAS:: National Human Exposure Assessment Survey
NRC:: National Research Council
TEAM:: Total Exposure Assessment Methodology
U.S. EPA:: U.S. Environmental Protection Agency

References

Bevington P.R. Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, 1969.
Google Scholar
Bostrom A., Fischhoff B. and Morgan M.G. Characterizing mental models of hazardous processes: a methodology and an application to radon. J Soc Issues 1992: 48(4): 85–100.
Article Google Scholar
Bostrom A., Atman C., Fischhoff B. and Morgan M.G. Evaluating risk communications: completing and correcting mental models of hazardous processes: Part II. Risk Anal 1994: 14(5): 789–798.
Article CAS Google Scholar
Box G.E.P., Hunter W.G. and Hunter J.S. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, 1978.
Google Scholar
Brier G.W. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950: 78(1): 1–3.
Article Google Scholar
Christensen-Szalanski J.J.J. and Beach L.R. The citation bias: fad and fashion in the judgment and decision literature. Am Psychol 1984: 39: 75–79.
Article Google Scholar
Christensen-Szalanski J.J.J. and Bushyhead J.B. Physicians' use of probabilistic information in a real clinical setting. J Exp Psychol Hum Percept Perform 1981: 7: 928–935.
Article CAS Google Scholar
Cooke R.M. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford University Press, New York, 1991.
Google Scholar
Cooke R.M., Mendel M. and Thys W. Calibration and information in expert resolution: a classical approach. Automatica 1988: 24: 87–94.
Article Google Scholar
DeGroot M. and Fienberg S. The comparison and evaluation of forecasters. The Statistician 1983: 32: 12–22.
Article Google Scholar
Desmet A.A., Fryback D.G. and Thornbury J.R. A second look at the utility of radiographic skull examination for trauma. AJR Am J Roentgenol 1979: 132: 95–99.
Article CAS Google Scholar
Evans J.S., Cooper D.W. and Kinney P.L. On the propagation of error in air pollution measurements. Environ Monit Assess 1984: 4: 139–153.
Article CAS Google Scholar
Evans J.S., Graham J.D., Gray G.M. and Sielken R.L. A distributional approach to characterizing low-dose cancer risk. Risk Anal 1994: 14(1): 25–34.
Article CAS Google Scholar
Fiering M., Wilson R., Kleiman E. and Zeise L. Statistical distributions of health risk. Civ Eng Syst 1984: 1: 129–138.
Article Google Scholar
Hawkins N.C. and Evans J.S. Subjective estimation of toluene exposures: a calibration study of industrial hygienists. Appl Ind Hyg J 1989: 4(3): 61–68.
Article CAS Google Scholar
Henrion M. and Fischhoff B. Assessing uncertainty in physical constants. Am J Phys 1986: 54(9): 791–798.
Article CAS Google Scholar
Hynes M.E. and Van Marke E.H. Reliability of embankment performance predictions. In: Mechanics in Engineering, 1st ASCE-EMD Specialty Conference, University of Waterloo, May 26–28, 1976 (as cited in Cooke, 1991).
Lichtenstein S.B., Fischhoff B. and Phillips L. Calibration of probabilities: thestate of the art to 1980. In: Kahneman D., Slovic P., Tversky A. (Eds.), Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press, New York, 1982.
Google Scholar
Macintosh D.L., Xue J., Özkaynak H., Spengler J.D. and Ryan P.B. A population-based exposure model for benzene. J Exposure Anal Environ Epidemiol 1995: 5(3): 375–403.
CAS Google Scholar
Matheson J.E. and Winkler R.L. Scoring rules for continuous probability distributions. Manage Sci 1976: 22: 1087–1096.
Article Google Scholar
Morgan M.G. and Henrion M. Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge University Press, New York, 1990.
Book Google Scholar
Mullin T.M. Understanding and supporting the process of probabilistic estimation. 1986. PhD dissertation. As cited in Morgan and Henrion, 1990. Carnegie Mellon University, Pittsburgh.
Murphy A.H. Scalar and vector partitions of the probability score: Part I. Two-state situation. J Appl Meteorol 1972: 11: 273–282.
Article Google Scholar
Murphy A.H. A new vector partition of the probability score. J Appl Meteorol 1973: 12: 595–600.
Article Google Scholar
Murphy A.H. and Winkler R.L. Reliability of subjective probability forecasts of precipitation and temperature. Appl Stat 1977: 26: 41–47.
Article Google Scholar
Murphy A.H. and Winkler R.L. Diagnostic verification of probability forecasts. Int J Forecast 1992: 7: 435–455.
Article Google Scholar
Pellizzari E. Concentrations of benzene in ambient, indoor, and personal air in Region V NHEXAS study. Personal communication, 1998.
Pellizzari E., Lioy P., Whitmore R., Freeman N., Clayton C.A., Rodes C. and Thomas K. Quality Systems and Implementation Plan for Human Exposure Assessment. Research Triangle Institute, Research Triangle Park, NC, 1995.
Google Scholar
Raiffa H. Decision Analysis. Addison-Wesley, Reading, MA, 1968.
Google Scholar
Samet J.H., Shevitz A., Fowle J. and Singer D.E. Hospitalization decisions in febrile intravenous drug users. Am J Med 1990: 89: 53–57.
Article CAS Google Scholar
Seiler F.A. Error propagation for large errors. Risk Anal 1987: 7(4): 509–518.
Article Google Scholar
Sexton K. Informed decisions about protecting and promoting public health: rationale for a national human exposure assessment survey. J Exposure Anal Environ Epidemiol 1995: 5(3): 233–256.
CAS Google Scholar
Shlyakhter A.I., Kammen D.M., Broido C.L. and Wilson R. Quantifying the credibility of energy projections from trends in past data — the United States energy sector. Energ Policy 1994: 22(2): 119–130.
Article Google Scholar
Taylor A.C., Evans J.S. and McKone T.E. The value of animal test information in environmental control decisions. Risk Anal 1993: 13(4): 403–412.
Article CAS Google Scholar
Thompson K. and Evans J.S. The value of improved national exposure information for perchloroethylene (perc); a case study for dry cleaners. Risk Anal 1997: 17(2): 253–271.
Article Google Scholar
Tierney W.M., Fitzgerald J., McHenry R., Roth B., Psaty B., Stump D. and Anderson K. Physicians' estimates of the probability of myocardial infarction in emergency room patients with chest pain. Med Decis Making 1986: 6(1): 12–17.
Article CAS Google Scholar
Van Lenthe J. ELI: an interactive elicitation technique for subjective probability. Organ Behav Decis Process 1993: 55: 379–413.
Article Google Scholar
Walker K.D., Macintosh D. and Evans J.S. Use of expert judgment in exposure assessment: Part I. Characterization of personal exposure to benzene. J Exposure Anal Environ Epidemiol 2001: 11: 308–322.
Article CAS Google Scholar
Wallace L. Environmental exposure to benzene: an update. Environ Health Perspect 1996: 104(Suppl. 6): 1129–1139.
CAS PubMed PubMed Central Google Scholar
Williams D.A. The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 1975: 31: 949–952.
Article CAS Google Scholar
Winkler R.L. and Poses R.M. Evaluating and combining physicians' probabilities of survival in an intensive care unit. Manage Sci 1993: 39(12): 1526–1543.
Article Google Scholar
Yaniv I. and Foster D.P. Precision and accuracy of judgmental estimation. J Behav Decis Making 1997: 10(1): 21–32.
Article Google Scholar

Download references

Acknowledgements

We are thankful to John Graham of the Harvard School of Public Health for his insightful comments on this study. This work was supported by U.S. EPA Cooperative Agreement CR822038-03-1-3; Health Resources and Services Administration, Bureau of Health Professions Grant A03-AH01165-01; National Science Foundation Grant SES-0084372, “Combining Expert Judgments for Environmental Risk Analysis,” and the Leslie Silverman Foundation.

Author information

Authors and Affiliations

Department of Environmental Health, Harvard School of Public Health, Boston, 02115, Massachusetts, USA
Katherine D Walker & John S Evans
Department of Biostatistics, Harvard School of Public Health, Boston, 02115, Massachusetts, USA
Paul Catalano
Department of Health Policy and Management, Harvard School of Public Health, Boston, 02115, Massachusetts, USA
James K Hammitt

Authors

Katherine D Walker
View author publications
You can also search for this author in PubMed Google Scholar
Paul Catalano
View author publications
You can also search for this author in PubMed Google Scholar
James K Hammitt
View author publications
You can also search for this author in PubMed Google Scholar
John S Evans
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katherine D Walker.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Walker, K., Catalano, P., Hammitt, J. et al. Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene. J Expo Sci Environ Epidemiol 13, 1–16 (2003). https://doi.org/10.1038/sj.jea.7500253

Download citation

Received: 19 August 2002
Published: 01 January 2003
Issue Date: 01 January 2003
DOI: https://doi.org/10.1038/sj.jea.7500253

Keywords

This article is cited by

Exposure Modeling of Benzene Exploiting Passive–Active Sampling Data
- Spyros P. Karakitsios
- Pavlos A. Kassomenos
- Georgios A. Pilidis
Environmental Modeling & Assessment (2010)
Health risk assessment for nanoparticles: A case for using expert judgment
- Milind Kandlikar
- Gurumurthy Ramachandran
- William A. Toscano
Journal of Nanoparticle Research (2006)
Reliability of a semi-quantitative method for dermal exposure assessment (DREAM)
- Berna van Wender de Joode
- Joop J van Hemmen
- Hans Kromhout
Journal of Exposure Science & Environmental Epidemiology (2005)

Use of expert judgment in exposure assessment: Part 2. Calibration of expert judgments about personal exposures to benzene

Abstract

Access options

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Principal component analysis

An overview of clinical decision support systems: benefits, risks, and strategies for success

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

Exposure Modeling of Benzene Exploiting Passive–Active Sampling Data

Health risk assessment for nanoparticles: A case for using expert judgment

Reliability of a semi-quantitative method for dermal exposure assessment (DREAM)

Search

Quick links

Abstract

Access options

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Principal component analysis

An overview of clinical decision support systems: benefits, risks, and strategies for success

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Exposure Modeling of Benzene Exploiting Passive–Active Sampling Data

Health risk assessment for nanoparticles: A case for using expert judgment

Reliability of a semi-quantitative method for dermal exposure assessment (DREAM)

Search

Quick links