Geographic exposure modeling: a valuable extension of geographic information systems for use in environmental epidemiology.

Geographic modeling of individual exposures using air pollution modeling techniques can help in both the design of environmental epidemiologic studies and in the assignment of measures that delineate regions that receive the highest exposure in space and time. Geographic modeling can help in the interpretation of environmental sampling data associated with airborne concentration or deposition, and can act as a sophisticated interpolator for such data, allowing values to be assigned to locations between points where the data have actually been collected. Recent advances allow for quantification of the uncertainty in a geographic model and the resulting impact on estimates of association, variability, and study power. In this paper we present the terminology and methodology of geographic modeling, describe applications to date in the field of epidemiology, and evaluate the potential of this relatively new tool.

GIS, or geographic information system, refers to a series of computerized maps (a base map and overlays) that provide for the storage and retrieval of an extensive amount of geographically indexed data. Although data storage and retrieval are the primary functions of a GIS, such systems readily lend themselves to geographic analysis of health data, as in searching for spatial clustering of disease, assessing disease rates by proximity to a pollution source (3), or comparing census tracts supposedly high in lead exposure with the actual levels found in blood lead screening records (4).
In comparison, geographic modeling converts GIS data into quantities that allow estimation of exposure with greater and greater individualized precision depending on the level of information available on residential history and personal activity. For example, in a casecontrol study of stillbirths, Ihrig et al. (5) estimated the dispersion of airborne arsenic from several sources, comparing health outcome with the exposure estimated for the mother's address at time of hospital admission for delivery, i.e., residence late in pregnancy. A more individualized approach was taken by Stevens et al. (6) in a study of people potentially exposed to radioactive fallout from nuclear testing. Individual exposure levels were obtained by integrating predictions of concentrations of radionuclides in the food supply at different locations and time periods with questionnaire data on residence history and the amount of milk and vegetables consumed at various ages.
In both cases an explicit model was used to compute the concentrations of concern based on the science underlying the exposure. This is in contrast to reliance on implicit assumptions such as those inherent in proximity analysis, i.e., that closeness to a facility determines the degree of exposure.
Geographic modeling is appropriate in an epidemiologic study either when an investigator wants to go beyond proximity as a measure of exposure or when direct measurements of environmental pollutants are too limited. Geographic modeling strives to create the equivalent of a hypothetical ideal monitoring system that would have measured the concentration of pollutants at all locations and times in the medium and domain under study. As with a real monitoring system, once the system is validated, it is possible to use the output to compute a cumulative exposure estimate for a desired time period, taking into account physiologic factors, lifestyle factors, and residence/work history to the extent such information is available.
Reconstruction of a monitoring system by a geographic model based on release rates and transport models has some advantages over an actual system, which is inherently limited in its geographical coverage. Should a threshold be involved in causing a particular health end point, relying on monitor data that do not capture the highest exposures could miss the connection (7). As a result, a combination of both measurements and models offers the best method for specifying exposures across a large population (8).

Applications of Geographic Modeling in Environmental Epidemiology
Although geographic modeling has been recognized for a number of years as an important and rapidly developing technique (9)(10)(11), applications of spatial modeling methods in the field of environmental epidemiology have been limited until Environmental Health Perspectives * Vol 107, Supplement I * February 1999 recently. Examples from the first wave of epidemiologic research using air pollution or groundwater modeling to estimate exposure include a study of birth defects and exposure to solvent-contaminated drinking water (12); a study of cancer and exposure to industrial air pollution (13); and a study of releases from the Three Mile Island (TMI) nuclear plant (14). Generally, no attempt was made in these first-wave studies to account for any changes in odds ratios, regression coefficients, or confidence limits that would result from uncertainty in exposure modeling.
More recent studies have used increasingly sophisticated modeling methods, which account explicitly for model uncertainty. This second wave of geographic modeling includes studies of leukemia and thyroid cancer in relation to radioactive fallout from nuclear testing in Utah (6,(15)(16)(17)(18). An additional study is currently in progress, namely the Hanford [Washington] Thyroid Disease Study (19,20).
There have also been some recent historical reconstructions of occupational exposure that come close to the complexity of the Utah and Hanford reconstructions (21,22), although an analysis of the uncertainty in exposure classification was not included in these studies. Table 1 lists the most recent geographic modeling exercises along with the epidemiologic studies that have used them. Table 2 describes the models and the methods used to assess uncertainty.
To reach the level of sophistication evident in the Utah and Hanford thyroid studies (15,20), complex models were needed whose development was funded by a mandate from the U.S. Congress. Not all epidemiologists will have the resources to develop such models and research the parameters needed for them, nor is such sophistication always required to provide a suitable exposure marker. Simpler transport models may be accurate enough if the uncertainties in the health data do not justify more than a crude delineation between the high-and low-exposure regions.
Similarly, it is not always necessary to have as much individualized information as was used in the Utah and Hanford reconstructions (15,20). For instance, if location is only available at the area level-say by zip code of residence-then exposures can still be assigned at the group level. Geographic modeling has the advantage that it can assign limits to exposure within the group. If the range of exposures is small compared to the spread across groups, group-level assignments are not likely to introduce significant errors in epidemiologic measures of association or risk. Even if the estimated exposure range within groups is large enough to suggest that significant misclassification is possible, greater detail may still not be warranted at an early stage of research on the health effects of some environmental exposure. Average exposure at the group level may still be sufficient to differentiate those likely to be highly exposed from those with little exposure.
Nevertheless, it is preferable to have individualized information. Questionnaires usually provide the greatest opportunity to obtain relevant data. Residential and occupational histories can be taken that allow exposure contributions to be summed regardless of the number of geographic  (30) Utah fallout dose reconstruction studies (15,16) Three Mile Island (Pennsylvania) Dose Reconstruction (32' Associated epidemiologic studies Hanford Thyroid Disease Study (in progress) (19) Under consideration by the Centers for Disease Control and Prevention (24) Colorado Reproductive Outcome Study (26,28) Arsenic stillbirth study (5) Belarus Epidemiological Study (29) No study known to be contemplated, although some epidemiologic studies have been carried out using earlier plutonium dose contours Utah leukemia and thyroid epidemiologic studies (6,17,18,31) Three Mile Island epidemiologic studies (14,33) moves an individual has made. Questions can also be asked about dietary choices and other personal activities that the analyst may be able to use to individualize exposures, for instance, if contamination information is available by food product.
In a complex pathway study, exposure information might be partly at the individual and partly at the group level. For instance, analysts might be able to assign exposure from inhalation at the individual level, based on an address history, but only be able to assign exposure from the food pathway for a particular pollutant at a town level based on survey data found in the literature on food consumption aggregated at the town level.
Although the group-level exposure component is not as accurate as the individualized component, there are good reasons for keeping it in the analysis. When sources of area-wide exposure are unknown, unmeasured, or contribute to individual exposure in ways not fully understood, adding an area indicator to the individual level model may reduce the potential for confounding that could arise from omitting these areawide factors (34). Mixed models of this kind have been used primarily to examine social risk factors but have applications to environmental epidemiology as well.

Modeling Terms Defined
Source Term This refers to data on the source or sources of exposure. Quantitatively, the source term may refer to emissions per unit time from a unique location, per unit length from a line source (e.g., a highway), or per unit area from a polluted region. The units may be in absolute physical units or in relative units if a surrogate source term such as sales data on pesticides is used as a substitute for actual release estimates.

Fate and Transport
The crudest transport models are simple distance models in which exposure is assumed to decrease as a function of distance from the source. Such a model can be misleading for an elevated release, where exposures can be higher away from the source at the point where the elevated plume touches down. Nevertheless, if one starts far enough out or the release is from the ground, a simple distance model might be appropriate for an epidemiologic study, provided that wind direction patterns are reasonably symmetric around the source location and the terrain is flat. To Environmental Health Perspectives * Vol 107, Supplement 1 * February 1999  (14,32) adjustments model, readings monitor near stack analysis using Monte terrain adjustments Carlo techniques to simulate alternate source terms that could fit field data account for asymmetries in wind patterns, the chosen distance dependence is weighted by a function of angle obtained from data on wind direction frequency. Such data are often available from a local power plant. Because the standard Gaussian plume straight-line model (35) is easily implemented with the same meteorological data, it is not difficult to include this better description of the variation in concentration with distance. The Gaussian model has been used for more than 30 years in predicting concentrations downwind of smokestacks in relatively flat terrain. Although more sophisticated air dispersion models are available and have been used in epidemiologic studies, it may not make sense to put great resources into the transport model if the uncertainties in source terms are large. If the source term data warrant it, there are a number of ways to improve on the Gaussian plume model: puff models that track puffs of pollutants moving in zigzag patterns as the winds shift; k-theory models that account better for the turbulent boundary layer between surface and air than do Gaussian plume models; and complex terrain models that account for landscapes with hills and channels.
An interesting use of a fate and transport model in geographic modeling is to improve a researcher's ability to interpolate between data points that might have been collected for other purposes, such as air pollution monitoring, but whose spacing is too sparse to allow satisfactory interpolation by conventional means. With geographic modeling, it is the fate and transport model that determines the functional form for intermediate values between data points. The model is forced to fit several data points, one region at a time, by adjusting model parameters. Estimators for missing data are then obtained by running the model at intermediate locations before moving to the next region, where new model parameters are fit to a new set of data points. This approach is of particular use in regions where the pollutant concentration measured is expected to change rapidly over distances between sampling points, e.g., close to individual point sources of pollution. Calibraion In most complex models there are an enormous number of parameters that can be refined through optimization. Many of the parameters and their probability distributions are chosen through subjective assessment (36,37). Calibration against field data allows the analyst to improve the choice of parameters. Models can be calibrated in two ways by field data. Parameters of the model can be altered to improve fits to field data, or entire classes of models can be rejected because of inadequacies that parameter adjustments cannot correct.
Model calibration has been used both to choose a source term and to pin down values of key model parameters. Calibrated models should produce better matches to airborne concentrations than uncalibrated models (38). Calibration is likely to be particularly important when multiple sources of pollution are involved, as uncertainties of scale that would keep relative exposure rankings the same for a single source can switch rankings when two or more scale factors are involved.
In the ChemRisk study at Rocky Flats in Colorado (30), analysts used soil data to calibrate the source term for their model. They used air-monitoring data to fix a parameter in their model, namely the wind-speed dependence of resuspension of plutonium. They chose not to hold back any data for validation; instead all were used at the calibration stage.
Environmental Health Perspectives * Vol 107, Supplement 1 * February 1999 In contrast, the Hanford study (20) chose to use all field data for validation and none for calibration. Similarly, in our TMI study (14) we used all the data for validation, primarily because the relevant data set was not available at the time the exposure model was developed.
Other authors take a middle position, using part of the field data for calibration and saving some hold-out data to validate the model (23,39,40).
Types of field data that might be relevant to calibrating a geographic model of the type discussed in this paper and used in an epidemiologic study include air concentrations, soil samples, vegetation samples, blood samples, house dust, food samples, and peat sediment. Almost any measurement of a pollutant that has a significant contribution from the air pathway is a serious candidate.

Validaton
Modeling approaches were reviewed in a 1991 document on exposure assessment issued by the National Research Council (NRC) of the National Academy of Sciences (9). The NRC panel recommended that assumption-dependent deterministic models be validated against field data to assess uncertainties before being used to estimate exposure. Although validation can never qualify a model for use in all contexts (41), it is obviously important in geographic modeling, where dependence on multiple parameters is common. Validation can also be used to assess misclassification bias (42).
Validation is routine in exposure assessments for epidemiologic purposes (22,39,40,(43)(44)(45)(46)(47)(48). Validation can give the analyst the best overall assessment of model uncertainty, at least in the spatial and temporal domain covered by the data. Absolute errors, which affect all study subjects equally (nondifferential uncertainties), are of less importance in epidemiology than errors that can affect individuals differentially. Whereas a model that overpredicts exposures everywhere by a factor of 20 in a validation exercise may be judged to have failed as a tool for communicating risk to the public, such a model can be adequate for finding associations in epidemiology because such an error in scale will not change ratio measures (e.g., odds ratios).
It is doubtful that sufficient data will be available to fully characterize model uncertainty from a validation exercise alone, so it is likely that Monte Carlo simulations will be a component of state-of-the-art geographic modeling.

Monte Carlo Simulation
One form of uncertainty in complex exposure models is conceptually simple to handle, namely, uncertainty in input parameters (49). Once a likelihood distribution is chosen for the parameters, the propagation of the variance can be computed by Monte Carlo simulation. Random numbers are used to sample from the various distributions and the model run; then the resulting exposure output is tabulated. Repeating the process many times generates an output frequency distribution for each individual's exposure, as shown in Figure 1. The variance of these frequency distributions is taken to characterize the uncertainty in individual exposure estimates. For typical distributions computed for individuals in the HTDS, the ratio of the exposure at 95% frequency to the exposure at 5% frequency was a factor of 25 (20). For the Utah leukemia case-control study the corresponding ratio was approximately 5 (16). The ratio for the Utah thyroid cohort study was approximately 60 (15). Although a ratio of 60 represents a large uncertainty for an individual, it proved small on a relative basis, as the variation in exposure across the thousands of study subjects varied by more than four orders of magnitude.
Unlike validation, such simulations can never capture uncertainty in the model structure; nevertheless, it is often the case that the impacts of parameter uncertainty are expected to dominate uncertainty in 0.00 I--' model structure. Although the overall uncertainty for an individual's exposure is large in most historical exposure reconstructions, it is still possible to have meaningful statistical power in a study with sufficient numbers of subjects (usually in the thousands) and sufficient geographical or temporal variation in exposure.
One issue at the forefront of research in the field of Monte Carlo exposure assessment is the importance of hypothetical correlations between model parameters. Commonly, uncertainties in input parameters are assumed to be independent. The failure of this assumption can lead to both under-and overestimates of exposure using Monte Carlo techniques (50).
Typically, the output of Monte Carlo simulations is 100 to 200 realizations of exposures to all study subjects. Each realization can be used separately to compute regression coefficients relating health outcomes to exposure estimates. The variation in epidemiologic quantities computed with different realizations of exposure provides a measure of the impact of exposure uncertainty on the study's results. It is of interest to the biostatistician that the correlations between individual exposures are maintained in using these realizations.
The Accuracy and R b of Geographic Modeling A number of concerns have been raised about geographic modeling. First, some of the underlying data may be poor and incomplete, forcing the analyst to make a number of unverifiable assumptions. Source terms, for instance, are rarely known with 11 111 ,1, certainty. Second, transport processes through the environment are often so complex that, even if the problems with the underlying data used in the models were absent, the models themselves would be questionable and inherently unverifiable. Critics of modeling believe it is preferable to rely on variables that can be known with greater accuracy, such as duration of residence in a particular area. However, such variables are themselves likely to introduce measurement error because they are crude surrogates for exposure and will tend to create overinclusive definitions of who is exposed. Using accurate data that identify truly unexposed individuals as exposed is no panacea. Because the intent behind geographic modeling is to determine who among all potentially exposed individuals has actually been exposed, the exposure classification will tend to have high specificity. As a result, estimates from modeling may be less biased than those based on simpler data.
The best response to concerns about accuracy is to present a quantitative estimate of uncertainty for each exposure variable. When possible this should include a test of the model or its major components against field data. The inclusion of an uncertainty distribution provides a firm basis for evaluating the exposure estimates. If the variation in exposure is much greater than the uncertainty assigned to the estimates, the model will be able to accurately stratify exposure into at least some spatial and temporal regions where reliable regressions can be made against health data. This was the experience in our earlier study on the risk of cancer associated with emissions from the TMI nuclear facility (14). By integrating information on wind direction and using a Gaussian dispersion model modified to handle terrain, the model showed that only a minority of the population living in one sectormround the plant had any real opportunity for exposure. Two validation strategies were used to confirm the result. First, we looked at the sensitivity of exposure predictions as key model parameters were varied about the default values. Second, we compared the exposure predictions to those generated by an independent source term determined from backfitting to offsite monitoring data using a weighted least-squares approach that accounted for measurement error. Agreement between the default exposures used in the study and the alternates was good, helping to build confidence in the model (32).

Uncertainty Analysis
Advances in the field of risk assessment in quantifying uncertainties in exposure modeling have made it relatively straightforward, in principle, to account for exposure uncertainties in the determination of confidence intervals and, in some cases, to correct for a tendency of errors to introduce bias, generally toward a null result (2,51,52).
Analytical methods of computing overall uncertainty due to lack of exact knowledge of parameter values are limited to a restricted class of models (53). On the other hand, Monte Carlo simulations can be used to characterize parameter uncertainty in any model.
Methods and guidelines for making and using Monte Carlo simulations to characterize exposure predictions have been discussed by a number of authors (54,55). In the absence of field data to characterize a parameter distribution, analysts sometimes rely on expert judgments from a sample of experts to define the distribution (56). The use and accuracy of such elicitations have been discussed in detail by Cooke (37). Reliance on expert judgment to estimate ranges for uncertain model parameters has obvious similarities to the use in occupational epidemiology of a panel of experts to develop job exposure matrices that rank exposure levels for different work situations and time periods (40,(57)(58)(59).
The uncertainty in a geographic model affects the power of a study to find a significant correlation between exposure and health outcome. The effects on power can be determined by simulating an epidemiologic study and examining the reduction in power that occurs as the measured exposures vary further from the true exposures. In our simulation of an epidemiologic study of breast cancer, we found that the power changed slowly at first as the overall uncertainty in exposure was increased, ultimately plummeting after some critical threshold was reached. Presumably, the rapid decline in power occurred as the differential uncertainty in the exposure estimates began to overwhelm the variability in exposure across the population.
Confidence intervals around the quantities calculated will be widened by the uncertainty in exposures. To our knowledge, the HTDS is the first study to account for the correlations in exposure uncertainty that exist across study subjects. For instance, the exposure estimated for all study subjects present in 1945 will increase when the scale factor for release of radioiodine in 1945 is increased. The HTDS team plans to adapt a methodology developed by Guo and Thompson (60) to correct for the attenuation in regression coefficients that exposure uncertainty can bring and to estimate the impact exposure uncertainty has on widening the confidence limits (61).
In pilot work on polycyclic aromatic hydrocarbon ( (63) to handle the exposure uncertainty problem and sampling error simultaneously. The bootstrap approach, so named to convey its power of seemingly lifting oneself by one's biostatistical bootstraps, is now standard and routine, allowing analysts to take advantage of normally untapped information a data set carries about the distribution from which it is sampled. In particular, the bootstrap approach, which here involves repetitive resampling from the original set of cases and controls without worry about duplication, is useful for estimating complex functions like confidence limits. For each of our Monte Carlo exposure realizations, we generate a new bootstrap set of simulated cases and controls and perform a regression. The results of several hundred of these regressions generate frequency data from which 95% confidence limits can be read off for each coefficient linked to an explanatory variable. As long as the Monte Carlo sampling is part of the analysis to characterize exposure uncertainty, the addition of simultaneous bootstrap resampling on cases and controls adds no significant increase in computer time.
In risk assessments the preferred approach is to distinguish between variability and uncertainty (64). Variability refers to real variations that occur in people or nature, as opposed to our ignorance of a model parameter. Real variations include person-to-person differences in breathing rate as well as site-to-site differences in terrain that affect meteorological dispersion. Although the distinction between variability and uncertainty is important in risk communication and risk management, it is not yet a distinction made by epidemiologists.

Amblig te Model Components and Identisig Proje-Specific Paamme
Exposure reconstruction reports have been likened to detective stories. "Each one has Environmental Health Perspectives * Vol 107, Supplement 1 * February 1999 its own flavor..., but the flow of the plot is pretty much the same" (65).
With so many diverse fate and transport models in existence, the analyst's first responsibility-to find an appropriate model or suite of models-is relatively easy. Web sites of the U.S. Environmental Protection Agency (U.S. EPA) (66-68) and the California Air Resources Board (69) are valuable resources for locating and downloading exposure models of all types. Furthermore, the U.S. EPA has a suite of models, including multisource models, on its "Exposures Models Library" CD-ROM (70). An advantage of working with a U.S. EPA-approved model is that the uncertainties and limitations of the model have already been investigated.
More difficult than finding an appropriate suite of models is gathering the information for the project-specific parameters that these models require. Complete historical information and scientific understanding are never available. In general, for an exposure or dose reconstruction one assembles as much of the parameter information as is easy to find, bridges the gaps with the best approximations that can be made, and proceeds as far down the model chain as possible, relying on the uncertainty analysis to give the overall process its final rigor.
The effort may be as straightforward as analyzing readily available plant process records to estimate annual average releases of arsenic and inputting them to the U.S. EPA fugitive dust model (70), which uses average meteorological frequency data. This input and analysis was done in the study by Ihrig et al. (5) to get inhalation exposures. Alternatively, the effort may be complex. Analysts for the Hanford Environmental Dose Reconstruction Study (20) sifted through warehouses of documents to find information for model parameters, modeled the process of dissolving irradiated nuclear fuel to obtain daily releases of radioiodine, and then entered the release rates into a massive suite of computer models that took into account time-sequenced meteorological data.
From our review of the literature on exposure reconstructions, we have identified a number of steps that analysts generally follow. The order of the steps listed is somewhat arbitrary.  (71), as are the publications of the Agency for Toxic Substances and Disease Registry (ATSDR) (72). Sources of data for radionuclides can be found in reports of the International Commission on Radiological Protection (73) and the U.S. EPA (74). IRIS (71) and the ATSDR publications (72) also include references to the literature on the pharmacokinetics of chemicals once they have entered the body. For some chemicals (and most radionuclides), sufficient information is available to relate intake of pollutants to the quantity that reaches target organs, or in the case of radioactivity, the energy absorbed. In such cases an exposure model can proceed to a dose model. Usually this step is handled by simple multiplication using age-specific coefficients taken from the literature. Note, however, there are individual variations in organ uptake that should be included in the overall assessment when possible. For certain pollutants the dose response for the health end point of concern has been identified in earlier studies. If any of the pollutants have health-effects thresholds, modeling effort for that pollutant should be focused on exposure levels in a range bracketing the threshold.
Interestingly, geographic models can provide more information than epidemiologists are used to seeing; such models can give complete time histories of exposure, not just cumulative or peak exposures. To our knowledge such information has not yet been exploited in environmental epidemiology.
Step 2: Review the History of Pollutant Usage and the Nature of Relses to the Environmenw A review of a substance's general historical usage, including its sales history, is helpful in establishing the time period that might bound the modeling exercise. When the focus of the study is a particular facility, such as a smelter, a review of the facility's geographical/hydrological layout and history of operations is helpful in choosing which type of transport model is appropriate. For instance, there are dispersion models adapted for high stack emissions, for resuspension from dust piles, for leaks from building cracks, and so on. There are models for point sources and for distributed sources. If the temperatures of the effluent and/or its vent velocity are high, a special model may be necessary to estimate the height to which the plume will rise before dispersing horizontally.
As for water dispersion, the nature of the facility's past and present operations will determine whether to indude models for surface runoff, surface percolation, and/or deep well disposal.
Step 3: Determine Population and Time Frame to Be Mod&ees A decisioninformed to a certain extent by an understanding of the pollutant transport-must ultimately be made as to the size, extent, and distribution of the population and time period for which exposures are to be modeled. This will determine such practical modeling details as how far back in time releases must be estimated and whether a short-range or long-range dispersion model is necessary.
Step 4: Quantify the Pollutant Relse Rates (Soure Ter~s) over Time. It is in the gathering of site-specific information, such as in the determination of a source term, that exposure reconstructions are likely to differ the most. Project-specific research is always necessary and may involve searching through files and interviewing people with direct experience in project-related issues. Data may be collected on stack or tailpipe emission measurements at the facility or at similar facilities. Information on historical records of operations, such as product shipped, chemicals purchased, and chemicals in stock, may also be useful.
Source terms do not have to be from point sources. To model DDT air exposures in our Long Island pilot study (62), we used the geographic area of salt marshes and farms that were sprayed with DDT before it was banned in the early 1970s as the basic database. With the assumption that spraying was constant per unit area, it is the areal shape that determines the location of highest exposures. We were able to locate handwritten records from the Suffolk County, New York, agricultural extension service. These records provided the acreage of potato crops from before the U.S. Department of Agriculture (USDA) began recording them and before DDT was first produced. This project-specific datagathering exercise was important because it demonstrated that the potato acreage had not changed much since the 1940s.
Although the source term from specific releases is usually determined from plant records or from knowledge of the engineering of the plant's operation, sometimes there are sufficient environmental measurements available in space and time to allow a backfit to be made to infer the magnitude and timing of the releases. This was the case at Rocky Flats for plutonium, where accumulated deposits in soil had been measured since the 1970s. On the other hand, no such environmental record was available for chemicals, so engineering calculations were used to estimate this component. In some cases a geographic model might bypass considerations of the pollutant's origin altogether and work directly with a detailed map of deposition on the ground, modeling how much of the contaminants would have been eaten by grazing animals and ended up in products sold in stores.
Values for parameters not identifiable from facility records or, more likely, frequency distributions for such parameters must be determined from the literature and/or field measurements. Only in rare cases does the analyst find that the information needed has already been collected and is available at a reasonable cost, as is the case for U.S. EPA data on large combustion sources. More often one finds that the full set of data is available from a private source but is too expensive. It is then necessary to try other less-complete approaches. A review of the modeling literature indicates that analysts use certain basic principles to fill gaps in data: In some cases analysts simply use the values they have extracted from the literature to estimate a parameter distribution. To guard against expert overconfidence (37), it is preferable to fit the data to long-tailed distributions such as log-normals. Elicitation of subjective parameter estimates from experts that are combined into an overall distribution is now done in a formal manner, taking into account lessons learned about the accuracy of past expert assessments (56).
In some cases the needed parameter cannot be determined by any of these methods but can be approximated from related data. Common techniques used for this purpose include interpolation, extrapolation, and disaggregation.
Interpolation refers to the process of inferring data values at locations that lie between points where measurements have actually been made. Often a smooth functional form will be fit to the existing data and the functional form used to infer the actual imputed value. Interpolation has been widely used in exposure modeling for epidemiologic purposes. For instance, the fallout deposition GIS database underlying the Utah studies made heavy use of interpolation between measured values found on unpublished fallout maps collected after each weapons test (65).
Extrapolation differs from interpolation in that the inferred data lie outside the region containing the measured values. The functional form fit to the available values is extended beyond the data points. For instance, at Fernald, Ohio, scrubber filter efficiencies measured between 1961 and 1965 were extrapolated as far back as 1951 and as far forward as 1981, assuming similar trends before and after (23).
Disaggregation is a technique that few modern risk assessments can avoid, yet is unfamiliar to those outside the modeling community. It is the process of breaking down summed data into its unmeasured components based on reasonable assumptions. For instance, at Hanford there were periods for which only total releases of radioactivity to the river were available, not values for individual radionuclides. To obtain estimates for the individual release percentages, analysts used subjective distributions they believed to be reasonable as input to a Monte Carlo simulation (75).
With all three techniques-interpolation, extrapolation, and disaggregationan estimate of the uncertainty of the derived values should be made and propagated through to the final exposure values.
In some cases surrogate values are used for parameter values or distributions. Model validation is particularly important in such circumstances. In our TMI study (14) we used a surrogate for releases of stack radioactivity, namely stripchart readings from radiation counters near the exhaust stack. The readings were thought to rise and fall with the emissions of radioactivity. In our pilot for the Long Island Breast Cancer Study Project (62) we have used carbon monoxide emissions, for which extensive databases exist, as a surrogate for airborne releases of PAHs because both emissions are associated with incomplete combustion. For validation, we found data correlating carbon monoxide (CO) emissions (76) and PAH air deposition in high marsh sediment (77) as far back as 1940, as well as direct correlations between airborne CO and PAH over periods of months (78).
Step 5: Determine the Major Pathways by Which Study Pollutants Likely Reached the Study Population. The obvious air pathways are direct migration through air to lungs and deposition onto food. The obvious water pathways are direct passage through groundwater to wells or into rivers tapped downstream for community drinking water. However, pollutants can cross media boundaries before reaching people and can spend considerable time in reservoirs before being recycled into the air and into drinking water.
As an aid to identifying a full set of pathways, analysts can review other geographic modeling studies, risk assessments, and/or environmental impact statements carried out for facilities or technologies comparable to those under study. These documents, along with ATSDR publications on specific pollutants (72), also can help in reaching a decision as to which pathways are significant enough to justify modeling.
Step 6: Pick the Transport/Storage Model to Be Used for Each Included Pathway. Transport models convert emission rates of pollutants to concentrations, whether they are measured in nanograms per cubic meter of air, curies per meter squared of land surface, or micrograms per kilogram of food. The analyst must match the requirements of the study to the choice of model. To do so it is helpful to ask a series of questions: Can the pollutant be transformed on the way to people? For instance, if it is a metal, can its valence state change? If it is radioactive can it decay and possibly transform into another radioactive substance? If so one may want to use a transport model that contains options for transformation, or one may add on simple correction factors. Can the pollutant be stored and released at a later time? If so one may add in a compartment model. Is any stack on site particularly high or ejecting very hot gases? If so it may be necessary to make use of a special plume rise model.
What is the nature of the terrain? If it consists of rolling hills of moderate slope, it will be possible to use a program that is based on Gaussian plume algorithms, such as the U.S. EPA fugitive dust model (70). On the other hand, if the terrain is complex, with valleys and high hills, one will want to use a complex terrain model, although the uncertainty in complex terrain models can be large and has not always been well characterized.
Realistically, choice of a model is often based on convenience. A model may have been used for years at the site under study. The analyst may be familiar with a particular version. As long as the transport model utilized has been quantitatively compared with other standard models or is Environmental Health Perspectives * Vol 107, Supplement 1 * February 1999 independently validated, there is no reason to begrudge the savings in time and effort that will result from using that model.
Step 7: Decide IfAny New Model Components Must Be Constructed. It is possible that some site-specific feature might require modeling that cannot be obtained from one of the programs in the repertoire provided by the U.S. EPA or the larger state agencies. However, in our experience, it is almost always possible to find a relevant model in the scientific literature or obtain one from a government agency, whether it involves revolatilization of pesticides from soil, uptake of pollution by vegetation, or  Step 8: Convert Concentration Predictions to Integrated Exposures or Doses. Whereas the geographic model produces concentrations in time and space, the epidemiologist may want a measure of the cumulative intake or absorption of a pollutant. This information may be provided by a dose calculation or approximated by integrating concentration over exposure time. A dose calculation, or full-scale modeling effort in complex cases, tabulates the amount of pollutant entering or absorbed by the cells of a study subject, accounting for the duration of time spent by people at each location in the study area. Current approaches to determining integrated exposures or doses from concentration data are based on relatively general assumptions about the activity patterns that determine how much time people spend in each location (9). Questionnaire data can improve these estimates.
Step 9: Ensure the Results Are Convenient for Epidemiologic Analysis. It is desirable to produce exposure output that includes estimates of the impacts of measurement error and exposure misclassification. Validation of the model against field data, as discussed in "Modeling Terms Defined," can help in this regard, as can Monte Carlo techniques. If the models chosen for the study are not inherently adapted for Monte Carlo analysis, a simple solution may be to build a metaprogram that will repeatedly run the model in batch mode with different assumptions, thereby effectively turning the system into a Monte Carlo engine that can give the desired epidemiologic estimates.
A summary of the approach taken by key studies is given in Table 2.

Conditions Appropriate for Use of Geographic Modeling in Epidemiology
Our review of the literature identified three situations where geographic modeling could be useful for epidemiologic studies: to assist in study design, to assist in assigning exposure estimates to study subjects, and to assist in analysis and use of field measurements.

Conditions When Geographic Modeling Can Be Usefill in Study Design
Gleographic modeling can provide an estimate of both the typical magnitude of the exposure and the range of variability to be expected across various potential study populations. Geographic modeling can be of particular assistance if multiple sources are involved, making intuition an unreliable predictor of relative exposure.
Geographic modeling can help in the design of study questionnaires. For example, as a result of our modeling efforts, a question about distance to the nearest road was included on the soil-sampling questionnaire for the Long Island Breast Cancer Study Project (62). Pilot work had indicated that distance would account for a significant amount of the variation in pollution from automobiles.
Geographic modeling can help in deciding on the number of samples that should be collected and their geographic distribution, if environmental sampling data are to be used as an exposure surrogate.

Conditions Appropriate
for Assignig E;xposure Estimates by Geographic Modeling Geographic modeling might serve as a substitute for, or complement to, simpler exposure estimates; for instance, when simple distance relationships fronm sources are not expected to be a reliable indicator of exposure; when multiple sources are close enough together to merge their impacts; when historical exposures may have been different from modern exposures, making current monitoring levels a poor indicator of past exposure; or when the population is sufficiently mobile and the pollutant sufficiently distributed that account must be made of exposures at multiple locations such as home and work.
Conditions When Geographic Modeling Can Help with Analysis and Utilization of Field Data As discussed in "Fate and Transport," a possible use of geographic modeling is to improve a researcher's ability to interpolate between data that might have been collected for other purposes, such as air pollution monitoring. Even for data specifically collected for an epidemiologic study, model-based interpolation could be useful if data have only been collected for a subset of the full study population.

Conclusion
Recent developments in quantifying uncertainty have helped to answer many of the methodological concerns expressed in the past about the use of geographical modeling in epidemiologic studies. Experience gained in large complex historical reconstructions has shown that, with sufficient effort, data can be found to reconstruct exposures as many as 50 years in the past. Even when individual exposure uncertainties have turned out to be large, the range in exposures across the study population has proven larger still, allowing meaningful estimates of health effects to be made, at least in studies involving thousands of subjects. This review of geographic modeling via the air pathway shows that the methodology has a potentially wide application in epidemiology.