Methodologic research needs in environmental epidemiology: data analysis.

A brief review is given of data analysis methods for the identification and quantification of associations between environmental exposures and health events of interest. Data analysis methods are outlined for each of the study designs mentioned, with an emphasis on topics in need of further research. Particularly noted are the need for improved methods for accommodating exposure assessment measurement errors in analytic epidemiologic studies and for improved methods for the conduct and analysis of aggregate data (ecologic) studies.


Introduction
Nearly all study of the health consequences of environmental and lifestyle exposures in human populations is purely observational. This means that the validity of the comparison of disease rates between more exposed and less or nonexposed persons is dependent on the assumption that disease rates in the two groups are comparable in the absence of such exposure. This comparability assumption can be weakened somewhat by the measurement and accommodation of other factors that are associated with disease risk and that have a different distribution in the compared exposure groups. If such confounding factors are accurately measured and adequately acknowledged in the data analysis, it is then sufficient that in the absence of the exposure of interest, the groups being compared have common disease rates conditional on the values of the confounding factors. Lack of validity (i.e., bias) in testing or estimation can be expected ifthere are unidentified confounding factors, if the recorded confounding fictors are measured with error, or if the treatment of individual confounding factors is inadequate (e.g., linear allowance for confounders having effects that are substantially nonlinear). Bias also can be introduced if the exposure variables of interest This manuscnpt was prepared as part of the Environmental Epidemilogy Planning Project of the Health Effects Institte, September 1990-September 1992.
*Author to whom correspondence should be addressed.
This work was supported partially by grant CA-53996 from the National Institutes of Health. The authors wish to thank Professor John Tukey, the members of the Environmental Working Group, and a reviewer for helpful suggestions.
or the health effects under study are not measured accurately. In practice these sources of bias can be reduced, but it is unliklly that they will be completely eliminated.
The sources of bias mentioned here are the principal reasons why epidemiologic cohort studies, among others, may yield inaccurate and conflicting results. Concern about residual, uncontrolled confounding can never be completely eliminated in any nonexperimental study. Hence, such studies are most reliable for the detection of moderate to large health effects (e.g., increase in disease incidence by a factor of two or more among highly exposed persons) that are unlikely to be qualitatively affected by modest confounding. There is also a strong role for the replication of results in diverse populations that are presumed to have different potentials for severe confounding. It is worth noting that experimental studies also have important practical limitations in the context of environmental epidemiology. Data analysis methods for cohort studies with accurate and complete assessment of exposure variables, confounding factors, and potential health consequences are well developed, as summarized in "Exposure-Response Estimate in Cohort Studies," below.
Case-control studies in which exposure and confounding factors are assessed retrospectively are subject to all of the biases noted above, as well as to recall bias, which occurs when diseased individuals (cases) and disese-free individuals (controls) differentially recall their exposures, their confounders, or their health outcome. Aggregate data studies, referred to later in this paper as ecologic studies, attempt to relate the exposure and confounding factor experience of groups to their corresponding disease rates. Such studies may be subject to additional biases if the statistical model for the group disease rate does not equal the average of valid disease rate models for the individuals being aggregated. Apparent disagreement between environmental epidemiologic studies can also arise, not from bias, but from lack of power combined with attention to point estimates rather than confidence limits. The ability to detect an association between the levels of an exposure variable or exposure history and the risk ofa disease depends primarily on the observed number ofdisease events in the study sample, on the range of exposures in the sample, and on the strength of association between exposure and disease. The distribution of exposures in the study cohort or in the cohort from which cases and controls are selected for a case-control study also has important influences on study power. While random measurement error in (univariate) exposure assessment will not invalidate, under weak conditions, a test for the hypothesis that no association between exposure and disease exists, test power may be reduced considerably by such measurement errors. Also, estimates of dose-response parameters may be substantially distorted (usually biased downward), induding the possibility of a loss of monotonicity of dose-response trends (1). Thus, the proper analysis and interpretation of environmental epidemiologic studies rely heavily on the investigator's assessment of the magnitude of both potential biases and study power in the absence of such biases. For practical reasons, the power of specific studies will often be rather low, and knowledge of disease mechanisms and measurement properties will be too limited to place usefid bounds on potential biases. Hence, there are important uses for formal tests of the equality of exposure-disease associations from two or more studies in differing populations and for techniques used in combining the results of several studies. This topic will be discussed in the section tided "Comparing and Combining the Results of Several Studies." The following section describes statisticaland biological-based models that can serve as the basis for exposure-disease analyses.

Models for Disease Occurrence
The simplest cohort studies occur when exposure takes place in one instant, as in Japanese atomic bomb survivors, or is constant over the individual's lifetime, as in some animal inhalation experiments. However, most exposures, and most confounders, are complex functions of time and demand a more complicated mathematical description. Our discussion of descriptive disease occurrence models begins with the over-simplified case.
Let ko(t) denote the instantaneous rate of occurrence of a study disease or other health-related event for subjects of age t who have not received the exposure of interest. This means that if N such persons, all at age t, were observed for a short time dt, the expected number of disease occurrences would be NXO(t)dt. If a person of age t received an exposure Z, the instantaneous occurrence rate would be altered from ko(t) to k(t Z), the (instantaneous) relative risk is X(t;Z) / ko(t).
These rates are nonnegative and, provided neither is zero, one can take the logarithm of this relative risk. It is often convenient and useful to assume the logarithm of the relative risk to be a linear function of exposure and confounding factor measurements. This is equivalent to modeling the relative risk as an exponential function, exp(XP), where the vector X = (X1,,..,X which replaces the more general Z, consists of carefully chosen (and usually incomplete) measures of exposure or confounding factors, with X = (0,...,0) corresponding to no exposure and standard values for confounders. The coefficients (P1,...,), regression coefficients that comprise the vector P (or, more precisely, its transpose T), then tell us about the impact of each Xi on relative risk when the other Xs are held fixed.
The result is a simple proportional hazards (or Cox) model X(tIZ) + ko(t)exp(Xi) [1] which is used widely in the analysis of failure time data (2). In order to deal with complications inherent in most environmental epidemiologic studies, one must generalize this discussion and complicate the appearance of some formulae, but be careful not to change the essentials of the approach. Such generalization follows in the next subsection.
Descriptive Relative-Risk Models As above, let k0 (t) denote the instantaneous rate of a study disease (or other health-related problem) at age t in the absence of the exposure of interest. A person of age t may have received exposures z(u) at certain ages u < t. One can refer to Z(t) = {z(u), u < t} as the person's exposure history up to age t. Furthermore, one can allow the vector z(u) to indude the values of confounding factors at age u, so that Z(t) indudes both exposure and confounding factor histories up to age t. The disease rate at age tis X{tIZ(t)}, a function of this exposure and confounder history. The relative risk associated with history Z(t) is then the ratio X{tIZ(t)}/ ko(t). Because this ratio is nonnegative, it can be, and often is, modeled using an exponential function exp{X(t)P}, where X(t) = {X1(t),...,X(t)}. This function consists of data-analyst-defined functions of Z(t) and t, with X(t) (0...0) again corresponding to no exposure and standard confounder histories, while pT= (Pi'. . is a corresponding vector of relative-risk parameters to be estimated.
This relative risk (RR) regression model X{tIZ(t)} = X0(t)exp{X(t)P} [2] also called the Cox-regression model or (inaccurately called) the proportional hazards model (2)(3)(4) or an approximation to these models, forms the basis for most descriptive analyses of environmental epidemiologic studies. As a simple example to illustrate the notation, consider the relationship between exposure to ionizing radiation and the rate of a certain cancer in the atomic bomb-exposed populations in Hiroshima and Nagasaki. One could define z(u0)T = {Z1(U0),Z2(U0)}, [3] as the gamma and neutron exposures for a person at age u0 in 1945 when the exposure occurred and as z(u) _ 0 otherwise. A specification X(t) _ z(u0) then assumes a log-relative risk function that is linear in gamma and neutron exposure levels. The regression model can be relaxed to allow, for example, the relative risk to depend on age at exposure and time since exposure and to allow for nonlinear dependencies of the log-relative risk on gamma and neutron exposure. As noted above, the histories of potential confounding factors can also be induded in Z(t), in which case X(t) will indude functions of both the exposures of interest and other factors, while product terms between the two will allow the relative risk associated with a given exposure history to depend on the value of other variables. This allowance is termed effect modification in epidemiologic parlance. Confounding factors may also be controlled by means of stratification rather than, or in addition to, regression modeling using the descriptive model X{ttZ(t)} = Xo,(t)exp{X(t)P} [4] where the baseline rate X4(o) is allowed to vary across a number of strata defined as functions ofage (4 and confounding factor values.
Relative-risk forms other than exponential also may be considered in the above models.
In particular, the linear form 1 + X(t)p often is felt to be theoretically and empiricaly more appropriate for certain carcinogenic exposures and has been used widely in radiation literature, sometimes with the addition of quadratic terms. Absolute rather than relative-risk models, such as {,tIZ(t)} = X?"(t) + X(t)P, [5] also have been used in modeling radiation effects, although there is a consensus that it generally does not fit well without the addition of terms for the modifying effect of age at exposure and latency. It may also be useful for modeling certain rare diseases such as mesothelioma, for which the baseline rate in the absence of asbestos exposure is virtually zero. In all of these alternatives to the standard exponential relative-risk model, estimates of the relative-risk parameters and baseline rates are often found to have poor statistical properties. However, quite general programs that use likelihood-based methods to obtain appropriate confidence limits (5) are now available to fit a broad class ofrelativeand absolute-risk models with combinations of linear and exponential terms.
Suppose that the regression vector X(t) in the above unstratified model consists only of functions of the exposure variable under study, and let pr{X(t)} denote the probability density for value X(t) in the source population of the modeled regression vector. In addition to estimating the relative-risk function, one may be interested in the fraction of the disease incidence at age t that may be attributed to exposure. If the disease rate for all study subjects was reduced to the baseline rate 4(t), then the overall incidence at age t would be reduced by the attributable proportion pr{X(t) }d%(t)/JXo(t) exp{X(t) }pr{X(t)}d%(t). [6] A similar expression can be written for the attributable proportion under the stratified relative-risk model. In some applications of these relative-risk models, it is convenient to define the basic time variable t to be chronological time or time from entry into a certain cohort rather than age, which is accommodated through stratification or regression modeling. For example, in a cohort study with covariate information collected at specified points in chronological time, such a definition can help ensure comparability of the covariate (i.e., exposure and confounding) information on all study subjects at a given value of t.
There are distinct advantages in using hazard rates or instantaneous disease rates, X{tIZ(t)}, in our formulae rather thanddisease rates over some specified age or time period, in part because the interpretation of these latter rates will depend on the duration of the age period or time period in question, which will vary inevitably from study to study. Nevertheless, in some studies one observes only whether disease occurs in a certain time period rather than the actual times or ages of disease occurrence. Let D = 1 denote disease occurrence during a prescribed disease ascertainment period for a study and D = 0 denote lack of occurrence. Ignoring issues such as competing risks and losses to follow-up, one may choose to model the disease probabilities pr{D = I I Z(tO)} by an exponential-form odds-ratio model in which pr{D = 1 I Z(to)1 / prD= 1 I Z(to) = Zo pr{D =°Z(to)1 / pr{D= O Z(to) = Zo} = exp{X(to)|}, [7] where Z(to) denotes a subject's exposure and confounding factor history at age to at the beginning of the ascertainment period, and ZO denotes the standard, or base, covariate history. This odds-ratio model can be rewritten as a logistic regression model pr{D = 1 i Z(to)i = exp{a(Zo) + X(to)P} / [I + exp{a(ZO) + X(to)tp}], [8] where the function a(ZO) may, for example, be defined to take value a, whenever the study subject falls in stratum s, which is defined as a function of potential confounding factor values at to.
The above relative-risk and odds-ratio models are purely descriptive models. Their application is intended as an aid for summarizing and displaying aspects of large, complex data sets. In some situations, such as a regulatory decision concerning the safe level of a certain exposure, it will be essential to bring to bear any available biologic or mechanistic knowledge on the inference problem. Such knowledge could be used, for example, to specify a form for the relative risk at age t as certain elements of X(to) approach zero, where these elements capture the dosage, duration, or other aspects of the exposure in question. Similarly, knowledge or assumption about the pertinent biological mechanisms could be used to derive models for X{tIZ(t)} of forms other than those mentioned above. The next subsection overviews two classes of carcinogenesis models that have been proposed on mechanistic or biological grounds.

Mecanistic and Biologicliy
Based Moddls Efforts to describe a disease process in terms of deterministic or stochastic models have focused mostly on models for the spread of infectious diseases in a population and models for carcinogenesis. Some of the work on carcinogenesis models, as outlined below, may be pertinent to other diseases.
Much of the early work on mathematical models for cancer was reviewed in a classic paper byArmitage and Doll (6). Whittemore and Keller (7) also provide a comprehensive review. A major contribution of the Armitage and Doll paper is the use of the multistage model of carcinogenesis. This model is based on the assumptions that cancer results from a single cell line undergoing a series of discrete, heritable changes (e.g., point mutations, chromosomal breaks or translocations, or other types of copying errors) in a particular sequence, and the rates of such transitions do not depend explicitly on age, although they may be affected by exposure to carcinogens or by factors that modify the rate of cell division. As a consequence of these and some additional assumptions, it can be shown that the age-specific incidence rate is predicted to vary approximately as the (k-i)stpower of age, where k is the number of transitions required (usually estimated to be about 5 to 7 for adult tumors). If a carcinogenic exposure occurs at a constant rate over time, the incidence will vary approximately as a polynomial function of dose rate of order equal to the number of dose-dependent transitions. If exposure is instantaneous or varies over time, the incidence rate will be modified by age at exposure and/or time since exposure, depending upon which stage(s) is dose-dependent.
Until recently, most of the empirical tests of these predictions have been done by fitting the model to aggregate data on population age-specific rates, or to broadlygrouped data on cohorts, stratified by dose, age at exposure, or time since exposure. A problem with this approach is the difficulty of separating the effects of dose rate, age at first exposure, duration of exposure, time since last exposure, and attained age, all of which influence the predictions of the model. Simple comparisons of one factor without controlling the other factors can be misleading. This is less of a problem when animal bioassay data are used, as these are usually limited to constant, lifetime, dose regimens. However, such data are not informative about whether the carcinogen acts at an early or late stage. Nevertheless, the approach has been used for risk-assessment purposes by many regulatory agencies. The default approach advocated by the U.S. Environmental Protection Agency (EPA) and others involves fitting the multistage model to available epidemiologic or toxicologic data and using an upper confidence limit on the estimated slope coefficient (scaled for species differences in weight and life span) to compute the lifetime excess risk in humans. The scientific and statistical validity of this approach is controversial (8, 9).
With the development of general relative-risk models ("Descriptive Relative-Risk Models" above), it has become possible to test the multistage and other models by fitting them directly to data on individuals. This offers great advantages for dealing with time-dependent exposures, which are the most informative about the stage at which a carcinogen acts. This approach has been applied to data on occupational exposures to asbestos (10), arsenic (11), and benzene (12); on the atomic bomb survivors (13); and on smoking (14), with varying results. The three occupational applications all were consistent with a single stage of action (relatively late for asbestos and arsenic, early for benzene), while the radiation and smoking data both showed signs of two stages being affected.
The multistage model has several important limitations, inlduding its inability to account Environmental Health Perspectives Supplements Volume 101, Supplement 4, December 1993 for leukemia and childhood cancers, the genetics of cancer, and the distinction between mechanisms of initiation and promotion. It also has been criticized for its need for as many as 5 to 7 stages to account for the steep age dependence, when only two or three have been established in experimental systems. Moolgavkar and Knudson (15) have proposed an altemative model that addresses these issues. This model assumes that two mutational events are required and the cell lines that have experienced the first event may be at a competitive advantage (proliferation) or disadvantage (repair) relative to normal cells. Carcinogens might act by affecting either mutation rates or proliferation rates. Major gene effects are accounted for by assuming that individuals who inherit the gene begin life with all cells in the intermediate stage. This model has been successful in fitting epidemiologic data on smoking (15,16), breast cancer (17), and radon (18). In the latter example, data from an experimental study of rats expsed to radon were fitted to the model and radon was found to have an effect on both the mutation and proliferation rates. However, the interpretation of this result is complicated by the authors' use, for both of these dependencies, of a power function dose-response relationship with a very low exponent rather than a simple linear dose-response. Thomas (19) has proposed a variant of this model that adds an additional stage to the process to try to explain the difference in the modifying effect of dose rate and the duration of exposure for different types of radiation; so far, no attempt has been made to test this model.
With the rapid growth in our understanding ofthe fundamental biology ofcancer, further development of methods to validate these mechanistic ideas and, where appropriate, to incorporate them into the analysis of epidemiologic data would be worthwhile. Most of the models that have been considered seriously are sufficiently general that some parameter values can be found to provide an adequate fit to epidemiologic data sets. Thus, these models are not easily falsified as a class, and it is unlikely that one could choose among them on purely statistical grounds. Instead, their utility lies in the types of comparisons that can be made within the context ofa particular model-whether a carcinogen acts at an early or a late stage in the multistage model or as an initiator or a promoter in the two-stage model, for example. Their real value, therefore, lies in their ability to organize a complex set ofhypotheses into a unified framework and to suggest empirical tests, in populations ofhumans or animals, of mechanistic ideas suggested by observations at the cellular level. Research efforts to iden-tify and measure the assumed biological entities on the pathway to cancer cell formation seem particularly well motivated.

Exposure-esponse Estimation in Cohort Studies
Rdatve-Risk and Odds-Rado Estimation Consider the unstratified relative-risk regression model of "Relative-Risk Models." A cohort study involves the selection of a sample of individuals from the population under study, succeeded by a follow-up to observe dease occurrence. The relative-risk parameter P can be estimated by maximig a partial likelihood function L(5) that is a product over all disease occurrence times (ages) that appear in the sample of the ratio of the relative risk for the subject developing disease to the sum of the relative risks for all subjects at risk at that time (20). The corresponding likelihood function under the above stratified relative-risk model is simply the product over strata of the stratum-specific likelihood functions. Note that this estimation procedure is quite general in that exposure variables, confounding variables, and stratum assignments each can vary with follow-up time. The principal assumption underlying this estimation procedure requires the set of subjects at risk for disease at any follow-up time to be representative of the base population, conditional on the covariate history and stratum assignment. This assumption will be satisfied, for example, if study subjects are sampled randomly and independently from the study population, and if rates of censoring (e.g., losses to follow-up) at a given follow-up time depend most on the covariate histories and stratum assignments at that time. Also, under weak conditions, L(O) can be manipulated as if it were an ordinary likelihood function for asymptotic inference on , (21,22). Various computer programs are available now for the estimation of P, and, therefore, also of the relative-risk process exp{X(t)[B}.
The score statistic U(Io), defined as the value at X = Po of the derivative with respect to , oflog L(Po), can be used tO test X = o. If Po = 0 and X(t) consists only of indicator variables to distinguish exposure groups, then U( 0) = U(0) is known as the log rank statistic. Other choices of X(t) yield other amiliar, censored data test statistics, including generalizations ofthe Wllcoxon statistic.
Suppose now that there is no possibility of early censorship in the cohort study throughout the follow-up. The odds-ratio parameter [ in the logistic regression model of "Relative-Risk Models," along with the location parameters a, = a(ZO), can be esti-mated by a likelihood function L (f) that is simply the product over all study subjects of the logistic regression probabilities pr(D = 1I Z(to)) for subjects developing disease, and one minus such probabilities for other study subjects. Computer programs are widely available for inference on , from this likelihood function. If there are few disease events in stratum s, it is preferable to eliminate cxs by conditioning on the number of such events prior to applying standard likelihood procedures for the estimation of Pi (23).
The likelihood functions just described may seem esoteric to readers not having a statistical background. The main point to note, however, is that estimation of relative-risk and odds-ratio parameters in the very flexible models of "Relative-Risk Models" is now routine, and suitable software is available. Of course, the odds-ratio parameter will approach the relative-risk parameter if the disease acquisition period dt becomes short. This occurs because the odds ofdisease, pr{D = 1 IZ(t) / [1 -pr{D = 1 IZ(t)1] [9] then typically approaches X{tI Z(to)}Id4 from which the exponential-form odds ratio approaches a corresponding exponential-form relative risk with identical regression parameter.
Estimation of the relative-risk regression parameter ,B may be computationally demanding if there are many distinct disease incidence times and if the regression vector and stratum assignment depend on time.
However, if each koX(t) is defined to be constant over a partition of the time axis and X(t) is restricted to be constant within the elements of this partition, then 0 can be estimated in a computationally simple fashion using Poisson regression methods. See Preston et al. (24) for application of such methods to radiation dose-response estimation from the Hiroshima and Nagasaki cohorts. Particular care is required if these estimation procedures are applied to cohorts having few cases or if most cases occur within a small portion of the overall range of exposures. Asymptotic formulae for interval estimation on P may then be inaccurate and more specialized procedures (e.g., resampling methods) may be required. In fact, there has been little study of cohort data configurations under which such asymptotic formulae will provide adequate approximations. Kalbfleisch  Ao,(t) = I to,(u)du, [10] the cumulative baseline disease rate in stratum s in the stratified model of "Relative-Risk Models" over the range of ages t 2 tos represented in the cohort. A simple nonparametric estimator of Ao,(t) can then be defined as the sum over all disease occurrence times in stratum s of the ratio of the number of stratum s failures to the sum of the relative risks for all subjects at risk in stratum s at that time, with all relative risks evaluated at that 3 which maximizes L(,). As with ordinary regression methods, model-checking procedures are important to the application of relative-risk and oddsratio models. Such procedures naturally focus on the assumed relative-risk process, exp{X(t)P}, because other aspects of the model essentially are nonparametric. For example, the postulated relative-risk function can be generalized by adding well-selected additional elements to X(t) and testing the hypothesis that corresponding coefficients equal zero. Computationally feasible methods also have been developed for approximating the influence of each study subject or each age group on P-estimation, in order to highlight questionable data points and to highlight vulnerabilities of the inference to model assumptions (25). Graphical procedures particularly are useful. In addition to the usual types of plots of influence (i.e., sensitivity) values and residuals, plots of separate estimates of Aos(t) for subsets of the cohort can provide useful visual checks on proportionality and other relative-risk assumptions (3).
The fact that the baseline rates kos(e)are unrestricted is an important source of robustness in respect to -etiimation. Specifically, relative-risk estimation is unlikely to be affected much if the intensity of ascertainment of disease events in the cohort varies somewhat across time or among strata. Similarly, location shifts in the modeled regression vector X(t) across different values of t would not affect [-estimation in the exponential-form relative-risk model. However, more general measurement error in the ascertainment of X(t) may have a profound effect on relative-risk estimation.
Measurement Error in u re Variables and Confunding E s Epidemiologists have long recognized that errors in the measurement of the study variables, induding misdassification in the case ofcategorical variables, can lead to biased tests and estimates of the associations under study. Measurement error in the exposure histories or confounding factor histories may be of particular importance in environmental epidemiologic applications. Unfortunately, the methodology for avoiding bias due to measurement error is still at a rudimentary stage ofdevelopment.
Consider the unstratified relative-risk regression model of "Relative-Risk Models" and suppose that rather than the covariate history Z(t) one observes an estimate W(t). The disease rate function at age (or chronological time) t, given the observed covariate history W(t) can then be written (26) X{t;W(t)} =k(t)E[exp{X } W [11] where the expectation also is conditional on lack of disease occurrence or censorship prior to t. In fact, this induced relative-risk model also requires X{t; Z(t), W(t)} = {t; Z(t) } [ 1 2] so that the W(t) is unrelated to disease risk, given the true covariate history Z(t). Unfortunately, the expectation in X{t; W(t)} generally depends on the baseline rates 0(u), u . t, which complicates the estimation. However, in cohort studies in which the cumulative probability of disease occurrence is small, this dependence usually can be ignored and estimation of ,B can be based on a likelihood function in the form described above upon specifying a measurement error distribution for X(t) given W(t), from which X{t;W(t)} can be calculated.
Specification of the distribution of X(t) given W(t) would seem to be a hazardous undertaking unless there is a subsample in which both Z(t) and W(t) are available. In the presence of such a validation sample, simultaneous inference on relative-risk parameters and measurement error distribution parameters is possible (27), though further development is necessary before such estimation can be viewed as routine. More difficult issues arise if a true validation sample is not available. A reliability sample, in which separate estimates W1(t) and W2(t) of Z(t) are obtained on a cohort subsample at two (or more) points in time, permits insight into some aspects of measurement error distribution, but additional strong assumptions are required for the estimation of ,. Even if the exposures under study are precisely estimated and pertinent confounding factors are identified, severe con-founding may occur if confounding factor histories are measured with error (28), as is obvious if one considers an extreme situation in which measurement error produces a totally useless confounding factor estimate. This bias is likely to be more acute if the exposure and confounding factor values appearing in X(t) are highly correlated.
A hypothetical cohort study of prenatal exposure to passive smoking in relation to the risk of lower respiratory disease during the first 3 years of life provided illustration in Morgenstern and Thomas, this volume. Any elevation in the odds of lower respiratory disease among more heavily exposed neonates may be severely attenuated by inaccuracies in exposure assessment in such a study. An analysis that controls for passive smoke exposure during the first 3 years of life, an exposure that would often be highly correlated with prenatal exposure, may be dominated by measurement error and be totally unreliable. A more practical illustration of the impact of measurement error is seen in the analysis of the mortality rates of various cancers in relation to gamma and neutron exposures in the Hiroshima and Nagasaki cohorts. Individual exposure estimates were constructed based on each study subject's location and shielding information as early as 1960. These estimates have continued to be refined in succeeding years through the use of improved models for the yields of the two bombs and more sophisticated models for the formation, transmission, and attenuation of gamma and neutron radiation. Many of the analyses of these cohorts simply combine gamma and neutron exposures into a single total dose estimate. The corresponding cancer mortality analyses have been affected somewhat by the changes in total dose estimates from one dosimetry system to the next (e.g., in the magnitude of elevated relative risks and the apparent shape of the dose-response curves), whereas analyses that attempt to estimate simultaneously the effect of gamma and neutron exposures on relative risk have been completely changed by dose estimate modifications. This illustrates the difficulty of reliably estimating exposure-disease associations when there are two or more exposure variables that are each measured with error (random or systematic) or, analogously, when there are exposure and confounding variables each measured with error. Very similar issues arise in epidemiologic studies of nonenvironmental factors; for example, they arise in attempts to separate the effects of fat and calories on cancer risk in nutritional epidemiology, or to separate the effects of PRENT7CE AND THOMAS types of fat by degree ofsaturation on cancer risk in nutritional epidemiology (29).
Some recent work has concentrated on developing methods to adjust associations for the effects of measurement errors when their distributions are known. A very general framework for attacking this problem has been outlined by Clayton (30), who specifies the problem in terms of component models: the disease model describes the dependence of disease risk on true exposures and other factors; the measurement error model describes the relationship between true and measured exposures and any modifying factors; and the exposure model describes the population distribution of true exposures. These three models are combined in a maximum likelihood framework, and approaches to estimating the parameters of the disease model are described. Unfortunately, the approach is mathematically intractable in its general form, but useful progress has been made in some special cases. For categorical variables, Greenland and Kleinbaum (31) described a method based on applying the inverse of a matrix of known misclassification rates to the subject counts by measured exposure and disease classifications. Hui (35), and others have discussed approaches that replace the measured doses with empirical Bayes estimates of the true dose and use these in standard analyses. For a general review of these approaches, see Armstrong (36) and Thomas et al. (37). Another recent development involves combining nonparametric density estimation techniques with a computational device known as Gibbs sampling to overcome the tractability problems in the Clayton approach and avoid the need for parametric assumptions about the distribution of true doses. This method has been applied to data on studies of leukemia and thyroid disease in Utah residents downwind of the Nevada Test Site (38,39). These approaches are in an early stage of development, but they offer the prospect of removing the bias due to misclassification, correcting the shapes of dose-response curves, adjusting for covariates, and examining interaction effects, all while allowing for the additional uncertainties due to uncertainties in exposure esti-mates. Further developments along these lines are highly desirable.
Most of the literature on correcting for measurement errors has assumed that the misdassification rates were known and were constant across subjects. In practice, only estimates of these error distributions are available, either from earlier validation studies, from replicate measurements, from gold standard measurements on a subset of the subjects, or from theoretical uncertainty analysis. Methods need to be developed to account for uncertainties in the estimates of these misclassification rates (40). As a design issue, the optimal allocation of resources between highquality measurements on a subset and larger numbers of approximate measurements should be considered (41,42). A unique aspect of the Utah fallout studies is the availability of individual-specific uncertainty estimates based on elaborate sensitivity analyses of the exposure pathways. This has allowed subjects with more precise exposure estimates to be given heavier weight in the analysis. Whether such efforts are warranted in terms ofimproved precision needs to be considered.
In summary, covariate measurement errors can bias severely the results of environmental epidemiologic studies. Improved analytic methods for accommodating random, nondifferential covariate measurement errors are required. Such methodologic developments might naturally focus on the potential for obtaining a true validation sample, on validation study design, and on the incorporation of validation study data in the overall estimation procedure (27).

Exposure-Response Estimation Under Case-Control and Other Sampling Procedures
Relative-risk and odds-ratio estimation often can be carried out more economically by sampling only subjects developing the study disease (the cases) or a random sample thereof, along with a suitably matched sample of subjects without disease (the controls). Typically covariate histories Z(t), where t is the age (time) of case or control ascertainment, then have to be obtained retrospectively.
Consider the stratified relative-risk model of "Relative-Risk Models" and suppose that each case has one or more randomly selected controls that are matched on age at ascertainment (t) and stratum (s). Given the covariate histories {Z1(t),..., Zm(t)} for a case and its (rn1) age-and stratum-matched controls, the probability that exposure history Z1 (t) corresponds to the case is simply the relative risk at t for the case divided by the sum of such relative risks for the n-matched subjects (including the case). Hence, the relative-risk parameter , can be estimated by maximizing the likelihood function L(f3), which is formed by multiplying these ratios for all matched case-control sets (43). To avoid strict matching on (t,s), relaxations of this sampling scheme are possible.
Similarly, the odds-ratio parameter P in the logistic regression model of "Relative-Risk Models" can be estimated under case-control sampling by maximizing the resulting logistic regression likelihood function by acting as though a prospective study had been conducted, though the estimates of as no longer reflect disease incidence probabilities (23). In fact, the baseline rates XOs(@) and axs in the relativerisk and odds-ratio regression models of "Relative-Risk Models" cannot be identified from case-control data in the absence of additional information on case and control sampling fractions.
In general, relative-risk and odds-ratio parameter estimates from case-control studies will be subject to the same biases as cohort studies. They also may be subject to recall bias if exposures or other covariate histories are differentially recalled by cases and controls or if they involve measurements that are affected by disease occurrence or its sequelae. There are often various practical steps that can be taken to minimize bias in ascertaining the covariate histories Z(t) (e.g., interviewers blinded to case or control status), but usually it is not possible to identify residual recall bias because the requirement to obtain prediagnosis and postdiagnosis covariate histories on a sufficient sample of cases would often eliminate much of the efficiency of the case-control design.
As with the cohort study design, nondifferential measurement errors lead to the expectation E [exp{X(t) } I W(t)]' [13] where X(t) is the true and W(t) is the measured regression vector at age t, as the identifiable relative-risk function under age-and stratum-matched case-control sampling. To the extent that a representative validation sample can be ascertained retrospectively, there will be a potential to conduct valid relative-risk estimation from this type of study without making further assumptions.
A case-cohort (case-base) sampling procedure can also be considered as a means of reducing the cost or simplifying the logistics of a cohort study. With this design, covariate histories Z(t) are assembled only for cases and a (stratified) random sample of the study cohort. This sampling procedure has advantages if several end points (diseases) are to be studied in relation to an exposure. Also, the subcohort may be used to monitor exposures and other variables during the study's follow-up. However, estimation may be less efficient than estimation based on a case-control study with a comparable number of study subjects if cases and subcohort members are not well matched (44,45), and recall bias typically will be an issue. Prentice (46) has developed a procedure for estimating the relative-risk and odds-ratio parameters from case-cohort samples, and, in contrast to case-control sampling, baseline rates also can be estimated without external information. Comparisons and refinements of these sampling procedures are worthwhile research activities. Note also that the use ofso-called two-stage designs (47,48) can lead to further valuable efficiency gains in some case-control study applications.

Exposure-esponse Estimation in Aggregate Data (Ecologic) Studies
As discussed previously, sometimes it will be economical and convenient to examine an exposure-disease association by relating the disease rates among several groups of individuals to aspects of the exposure experience of each group. Such studies can be referred to as aggregate data studies since they involve the disease rates and exposures for the aggregate, rather than for individuals. These studies also are commonly referred to as ecological studies since groups having differing exposure histories are sometimes defined on an ecologic or geographic basis. Denote by Xki(t) the age-and sex-specific disease rate in the kh group during (chronological) time period t. A multiple group study involves the analysis of estimates of Xki(t), k = 1,...,Kduring a fixed time period; a time trend study involves estimates of Xki(t), t = 1,..., Tin a single population, while a mixed study involves estimates of Xki(t) at several values of both k and t. An exponential-form relative-risk model for Xki(t) can be written, in the notation of "Relative-Risk Models," as Xki(t) = Xko(t)exp{Xki (t)13}, [14] from which the average disease rate Xk(t) for the nk(t) individuals in group k during time period t is nk(t) where Xk(t) = nk (t)Xki (t) and dki (t) = Xki (t) -Xk (t) [17] Letyk(4 denote the observed age-and sex-specific dise incidence rate in group k during time period t as may be available from a disease register or other admininstrative source. From the above expression for Xk(t), one expects a regression of log Yk(t) on Xk(t) for various values of kor t(or both) to yield biased extimates of the relative-risk parameter P, because of the influences of the residuals dki(t), even if the logarithms of the baseline rates Xko(t) can be regarded as independent random variables with a common mean. This specification bias will be small if the dki(t) values are small, that is, if the exposure and other regression variables have little variation within groups. Such bias presumably can be reduced by extending the regression equation to indude averages ofsquares and ofhigher powers of the dki(t) terms, though there does not appear to have been specific study of this approach. A dosely related approach would replace the exponential-form relative-risk model by a linear-form model, so that and Xki(t) = XkO(t){ 1 + Xki (t)13} [18] Ak(t) = XkO(t){ 1 + %*(t)*} [19] from which the regression of yk(t) on Xk(t), under certain random-effects assumptions on the baseline rates {kko(t)}, will yield valid estimates of the linear relative-risk parameters (49). Note, however, that an exponentialform relative-risk model often might be more parsimonious than a linear-form model in environmental epidemiologic applications so that the regression vector in a linear relativerisk model may need to be quite lengthy and involve, for example, the average of product terms between exposure and potential confounding factors in order to adequately describe the data. In a multigroup study, it may be sensible to assume the ko(t) terms are independent random variables with a common mean for k = 1,...,K thought it often may be useful to allow for the possibility of correlation among groups in a similar geographic area. In time-trend and mixed studies, however, it will typically be essential to model, or otherwise accommodate, the correlation structure among Xko(t), t= 1,...,Tat any fixed k. Inadequate modeling of the {Xko(t)} may lead to aggregation bias. These types of data analysis methods have received very little attention in the scientific literature and constitute an important gap in the collection of methods pertinent to environmental epidemiologic applications.
Aggregate data studies involving the simple linear regression of disease rates or the logarithm of disease rates on average exposures and average values of potential confounding factors can often be conducted quickly and cheaply and can play a useful role in hypothesis generation. It is obvious, however, that more comprehensive data sources and more sophisticated data analyses typically will be required if aggregate data studies are to contribute reliably to the identification and estimation of exposure-disease associations. Better data could come from randomly sampling each of the compared groups in order to obtain estimates, Xk(t) of acceptable precision for use in a linear relative-risk model or to obtain estimates of the average of exp{Xki(t)P}, i = L,...,nk(t) for use in an exponential relativerisk model. Random measurement error in the ascertainment of individual exposure and confounding factors could impact substantially survey design. Better data analyses may arise from the application of so-called marginal methods (50,51) to mean and covariance models for the set of yk(t) or log yk(t) values being analyzed.
Most effort to date concerning aggregate data studies has been directed to identifying the biases that may arise from aggregation, confounding, and other sources (52,53). It seems timely to direct a major effort to the development of procedures to prevent (or greatly reduce) such biases and, hence, to evaluate whether aggregate data studies can play a more fundamental and useful role in environmental epidemiologic studies and in epidemioligic research more generally.

Comparing and Combining the Results of Several Studies
Studies ofa certain exposure-disease association may, for a variety of practical reasons, be lacking in power, and they may be subject to biases that can differ according to the population under study, the type of study design, and the rigor of the investigation. It follows that tests of agreement among the results of various studies and the formal combining of results from pertinent studies can play an important role in an overall exposure-disease association assessment.
Under ideal conditions, each of the types of studies described above can yield a valid estimate 0 of the logarithm of the relative risk associated with a specified exposure history, as well as an estimate 62 of its variance. The Environmental Health Perspectives Supplements Volume 101, Supplement 4, December 1993 logarithm is used here, because its estimate is likely to adhere more dosely to a normal distribution (with mean j) than the estimate dP ofthe relative risk itself. Suppose m-independent studies yield (scalar) log-relative risk estimates of ,. . .,02with corresponding variance estimates ap...,-2 a= aIX -i [20] estimates a weighted mean of ,i's, which reduces to a common 3 if all Pi3's are identical. To obtain the most stable estimate of this common mean, one can follow developments arising from Cochran's (54) introduction of partial weighting, thereby avoiding weights & i .  [21] will have a chi-square distribution with m-1 degrees of freedom, thereby giving a simple test of "all pi = [" (assuming each Oi is distributed normally). If the i's are not identical, then a t-procedure can be used to set confidence limits for the weighted mean 5 = & -i2 i l &-i2 [22] Confidence limits on are approximately 0 i t CY&-i2) -2 [23] where tv is a critical value of t on v (somewhat less than m) degrees of freedom. These limits are often conservative, particularly when the 3i follow longer-tailed distributions.
There are various reasons why the chisquare test described may provide evidence of heterogeneity of the relative-risk estimates from the m studies. For example, studies of the same type (e.g., r-cohort studies) may have differentially controlled for confounding or may have defined and measured exposure differently. Studies of different types (e.g., m-cohort, case-control and aggregate studies) have different sources of potential bias, for example, recall bias for case-control studies and aggregation bias in ecologic studies. Hence, it may be useful first to contrast and combine studies of the same type and then to examine whether the summary estimates of 0 from each study type are heterogenous. In respect to studies of the same type, the overview, or metanalysis, may be strengthened by analyzing the raw data from each study in a uniform format, which would maximize their comparability in terms of confounding control and exposure modeling. A fundamental principle of such analyses is that the parameter estimate 0 is based only on the combination of within-study information, as is the case for the heterogeneity test and the log-relative risk estimate described above.
Measurement error in exposure and in covariate assessment may be a particularly important source of heterogeneity among relative-risk estimates. For example, random measurement error may attenuate severely or otherwise distort relative-risk estimates in a cohort or case-control study if, for example, exposure assessment is based on data provided by individual interviews (e.g., location and shielding information in the Hiroshima and Nagasaki cohorts), but such attenuation may not be an issue in an aggregate data study if the desired averages (see "Exposure-Response Estimation in Aggregate Data Studies") can be estimated precisely. In this circumstance, some effort to deattenuate the analytic study relative-risk estimates, or to attenuate equally the aggregate data relative-risk estimates, is essential prior to the comparison of these estimates. See Prentice and Sheppard (55) for a recent attempt to study the consistency of international disease rate, time-trend, case-control and cohort studies in the dietary carcinogenesis area. Note also that will be biased as an estimator of i if the available log-relative risk estimates OMl. 5m are a biased sample of estimates from existing studies, which may arise if there is so-called publication bias in which relative-risk estimates that are significantly different from unity are more likely to be reported in the scientific literature. See Yusuf et al. (56) for a discussion of some issues in the conduct of such metanalyses.

Other Data Analysis Topics
The above presentation emphasized time to disease endpoints and corresponding relativerisk and odds-ratio models. In some areas of environmental epidemiologic research (e.g., respiratory epidemiology or neuroepidemiology), important endpoints are continuous. Much of the corresponding data analysis methodology is well established and does not need to be discussed here. However, methods for handling measurement error with continuous data (57) also require much additional development. Recent advances in the methods for analysis of longitudinal data (50) for discrete or continuous data are also quite relevant to the analysis of certain types of environmental epidemiologic data.
Preceding sections also have not addressed the simultaneous analysis of two or more endpoints. For example, in respiratory epidemiology, there may be several measures of lung function, and a data analysis goal may be to summarze exposure effects over several correlated measures of change in lung function. The estimating equation approaches mentioned above (50,51) provide an approach to such problems with discrete or continuous outcomes, but work could be done to compare these methods to univariate methods based on some summary endpoint. Methods for the analysis ofcorrelated failure time data currendy are not well established, though much statistical research is underway presently. See, for example, Clayton and Cuzick (58), Wei et al. (59), and Prentice and Cai (60) for recent contributions. Correlated failure-time methods also are required for the investigation of genetic factors or gene-environment interactions under certain types ofstudy designs. For example, in a pedigree cohort study, it typically will be essential to allow for dependence between the disease occurrence times of family members when studying environmental exposure effects in relation to genetic indicators ofsusceptibility.
Morgenstern and Thomas, in this volume, mention certain designs other than those discussed thus far in this artide, as well as the use of biomarker endpoints. Corresponding data analysis issues and methods will be mentioned only briefly here.
It was noted that experimental designs are practical occasionally in environmental epidemiologic research. The relative-risk and odds-ratio regression methods described above apply equally well for the comparison of disease incidence (or mortality) rates between randomization groups in individually randomized designs. However, a group-randomized design (e.g., with community as the unit of randomization) is more likely to be feasible, in which case it is essential to acknowledge the possibility of correlation among the responses (e.g., disease incidence times) of subjects in the same randomization group, which require the use of the type of correlated failure-time methods mentioned above.
In the discussion of ecologic designs it was noted that descriptive studies of the clustering of disease (e.g., in space or time) can play a useful role in the generation of environmental health hypotheses. These types of studies also have specialized data-analytic issues and methods. Statistical analysis has little to offer in the event of an isolated duster discovered by ad hoc methods. Clusters within which the disease counts substantially Environmental Health Perspectives Supplements Volume 101, Supplement 4, December 1993 addressed by direct fieldwork to identify a putative cause. On the other hand, hypotheses of a general tendency to cluster can be addressed statistically by using methods that compare the number of cases in certain neighborhoods of each case to the expected number of cases, while also taking account of population density. Local neighborhood tests also are available with case-control sampling. See Rothman (61) and other papers in this volume for discussions of disease-clustering methods. The design chapter (in this issue) also emphasizes cross-sectional studies for the estimation of prevalence rates. The logistic regression methods outlined in "Relative-Risk and Odds-Ratio Estimation" may be used to relate prevalence probabilities to retrospectively obtained exposure and confounding factor histories. Of course, such prevalence probabilities reflect aspects of both disease incidence and disease duration, and therefore, may be difficult to interpret. Keiding (62) provides a comprehensive discussion of the relationships between prevalence probabilities, incidence rates, and disease durations and of the possibility of deriving estimates of age-specific incidence from cross-sectional studies.
As discussed previously, biomarkers may serve usefully as exposure indicators or as early indicators of disease (see Hatch and Thomas, this volume). An example of a biomarker as an intermediate endpoint is seen in chromosomal abnormalities in the radiation-exposed cohorts of Hiroshima and Nagasaki. The rates of such abnormalities among long-lived lymphocytes (usually 100 cells examined for each subject) have played a useful role in assessing the health effects of radiation exposure in these populations.
The correlation among the chromosomal events in cells from the same study subject has a strong influence on dose-response analyses in this application (35,63). Recent advances in the ability to study the cellular and molecular mechanisms involved when responding to exposure and disease pathogenesis will lead inevitably to greater use of biomarkers and biological measurement in environmental epidemiologic studies. Hence, data analysis methods that incorporate such measurements in a biologically meaningfud fashion are required. Suitable methods for dose-response analysis with biomarker endpoints will vary according to the type of endpoint(s) involved. Recent estimating equation approaches (50,51) often may be useful for such analyses. Circumstances under which a biomarker endpoint can substitute for disease occurrence and yield valid dose-response tests and estimates is also of considerable interest. See Prentice (64) for the introduction and discussion of such criteria.
Finally, it seems worth noting that the interpretation of relative-risk estimates from a study may depend on prior knowledge and on study goals. For example, if such estimation takes place in the context of a study specifically designed to confirm a particular association, the corresponding tests and confidence intervals are more appropriately taken at face value than if the relative risk is estimated in a purely exploratory context wherein various other exposures also are examined in relation to disease risk. In this latter situation, formal methods may be used to acknowledge the multiple hypotheses being examined, but precise statistical methods for doing so in a general way are not available. (So-called Bonferroni methods are available widely and may be precise enough.) Also, one is often neither in a purely exploratory nor a purely confirmatory mode in data analysis.

Summary Recommendations
Perhaps the single most important data analysis research need in environmental epidemiology concerns the development of improved methods to accommodate measurement errors in exposure assessment. Efforts aimed at the design and use of validation studies would be particularly useful, as would studies to document the scope and magnitude of measurement error influences.
A second important need concerns improved methods for the conduct and analysis of aggregate data (ecologic) studies. The development of strategies for controlling potential confounding, particularly by using individual surveys in multigroup studies, along with corresponding innovative data analysis methods, will be important. Empirical studies that illustrate various analytic and aggregate data analyses of real data sets also would be valuable.
Other pertinent topics for data analysis reech indude the development ofimproved methods for meta-analyses when studies of different types with differing potential for measurement error biases are available, the development of flexible data analysis methods, and the study of properties of analyses based on biomarker indicators of exposure or biomarker end points. Studies that evaluate and compare strategies for the control of confounding also merit continuing attention in environmental epidemiology as in other observational research areas. Further work on biologically based mathematical models for cancer and for other disease also would be well motivated. en