Assessing, accommodating, and interpreting the influences of heterogeneity.

Heterogeneity, ranging from measurement error to variation among individuals or regions, influences all levels of data collected for risk assessment. In its role as a nemesis, heterogeneity can reduce the precision of estimates, change the shape of a population model, or reduce the generalizability of study results. In many contexts, however, heterogeneity is the primary object of inference. Indeed, some degree of heterogeneity in excess of a baseline amount associated with a statistical model is necessary in order to identify important determinants of response. This report outlines the causes and influences of heterogeneity, develops statistical methods used to estimate and account for it, discusses interpretations of heterogeneity, and shows how it should influence study design. Examples from dose-response modeling, identification of sensitive individuals, assessment of small area variations and meta analysis provide applied contexts.


Introduction
Without heterogeneity, or variation, there would be no potential for gaining scientific information. Yet, unappreciated heterogeneity can degrade or distort scientific interpretations.Therefore, though heterogeneity may play a negative role in some situations, it is our lifeblood in others. This report discusses issues, techniques, and examples related to assessing, accommodating, interpreting, and controlling the effects of heterogeneity. Though no abstract definition ofheterogeneity is satisfactory for all settings, we shall propose operational definitions that are relevant to specific scientific and practical questions. We do not go into great detail on any specific issue or example, rather we provide an overview and key into a wide literature.
Heterogeneity can be defined in many ways with the most inclusive being variation in general. More restrictive definitions help structure our approach. Heterogeneity as variation in excess of a baseline model provides an important framework. For example, Knuiman (4) show that intra-and interlaboratory variation in the number of revertants in the Ames test far exceeds that predicted from the Poisson distribution, and we have a considerable amount ofunexplained variation. Interpretations ofthe Ames test should be based on the true variability, so identifying the excess has direct value. Equally important is an attempt to explain excess variation. Much of this may be inexplicable, but some may be accounted for by covariates such as variation in the growth medium and incubation temperature. Explaining some ofthe excess allows improved experimental design, and, with a validated model that relates covariates to outcomes, gives the ability adjust *Division of Biostatistics, University of Minnesota, School ofPublic Health, Mayo Box 197, Minneapolis, MN 55455. results and to put outcomes on a common basis. This adjustment can reduce variability and bias.
The problem ofdefining and identifying sensitive individuals provides another instructive example. As Bailar and Louis (5) discuss for lung responsiveness to challenge, smokers as a group may be considered sensitive relative to nonmokers, yet some smokers (even in the same smoking rate category) are more sensitive than others. This example identifies the question ofscoping the assessment of heterogeneity. One needs to group study units that can be considered homogeneous (follow a baseline model). This decision is made using a combination of scientific and practical considerations. A third class ofexamples, called measurement error models, shows the importance of identifying and accounting for heterogeneity. Consider the effect ofdiastolic blood pressure on risk of stroke. Blood pressure measurements are made with error (6,7), and it is well known that estimated regression slopes are attenuated relative to the relation between true blood pressure and risk ofstroke. Ofcourse, the reported regression slope is correct for the question ofhow stroke risk relates to measured blood pressure. But this slope does underestimate the true influence of blood pressure on stroke. MacMahon et al. (8) show, for example, that the reported slope should be increased by approximately 60% to estimate the underlying relation. This increase has profound policy and health implications.
Identifying this attenuation of slope is important when combining evidence over a variety ofstudies through a meta analysis or overview (9). If each study uses a different measurement system (for example, taking the average ofa different number of blood pressure readings), reported slopes will need adjustment before combining. Failure to do so can result in observed heterogeneity of slopes in excess of that predicted by sampling variability (the baseline model). In this case, the unexplained variation can be explained and accommodated by an error-invariables model.
One more class ofexamples serve to introduce another feature of unappreciated heterogeneity. Vaupel and Yashin (10) and Yashin (11) discuss "heterogeneity's ruses," where the shape of the hazard for death in a population is different from that for any individual. For example, each individual may have a constant hazard (exponential distribution), but the population curve will show a decreasing hazard. Relatively speaking, the frail die out early, leaving the hardy (low hazard) to live. Again, at the population level, the decreasing hazard is appropriate, but ifone wishes to study aging or policy impacts, an understanding of the ruses and an attempt using data and theory to uncover them is vital. These ruses are just another example of Simpson's paradox.
Thepolicy issue in survival models iseasily seen instudiesofthe effects ofsmokingcessationontheriskoflungcancer. A multistage model predicts thatthe excess riskcontinues to increaseevenafter an individual stops smoking (though the increase is less than for an individual who continues to smoke), whereas many data sets show that the excess decreases. The decrease may be true, and surely is for the population, but each individual's excess risk may indeed increase, but heterogeneously.
The facts discussed so far, suggest several definitions of heterogeneity, including general variability, variability in excess of a baseline model, variance components, measurement error, variation in latent parameters. To these definitions we can add heterogeneity induced by variation in data analytic approaches. In the end, heterogeneity is a vague concept and is best appreciated through a series of examples.

Two-Stage Models Basic
The two-stage model provides a convenient way to represent variation in excess of a baseline model. In this model we have stage II: a parameter (vector) is sampled from a distribution; and stage I: data are generated from a sampling distribution, conditional on the parameter. This two-stage sampling process can be repeated for each experimenal unit (e.g., clinics, petri dishes, small geographic areas), and underlies Bayes and empirical Bayes analysis. For concreteness, consider a stage II governed by a prior distribution (G) that is Gaussian and that stage I is Gaussian with a known variance (12,13). Let k denote experimental unit, k = 1, . . ., K, and the basic model becomes: 61, 02, ... 0K are iid N(1i, r2) YkI Ok N(Ok, u2). This model can be extended in a wide variety of ways, including having repeated sampling of Y values for each 0, allowing distributions different from the Gaussian, using unequal sampling variances, and introducing a regression structure for the Y so that the 0 values come from different, but related distributions. Continuing with this basic model we have: Y1, ..., YK are (iid) N (A, U2 + r2) and the Y, values are overdispersed relative to the sampling distribution with variance a2, but it may be explicable.
This model is the model II or random effects ANOVA, and tests ofthe null hypothesis are asking if r2 = 0. If T2 > 0, then we have variation not accounted for by the baseline model and can look for explanations through covariance adjustment. If useful covariates are present, excess (unexplained) variation will be reduced by the adjustment. Notice that the notion ofexcess variation depends on the stage I model (the sampling distribution). If we find in a data set that the sample variance: (yYk-Y)2= 10, and we know (assume) that a2 = 6, then a natural estimate for r2 is 4. If, however, we assume 2 = 9, then -r = 1. In practice, we need either direct information on sampling variation (through replication within units) or reliable assumptions (e. g., use ofarcsine transformed binomial data) to identify T2.
Frequently, variability among the Y values is greater than that predicted by their sample mean, and a two-stage model is suggested (1-3). For example, assuming that theO values are gamma with mean, g, and squared coefficient of variation ac', we find that the Y, values are negative binomial with: allowing for overdispersion. Even if we do not accept the twostage hierarchy, the negative binomial allows more flexible modeling than does the Poisson.

Consequences
Going back to the Gaussian example, let us consider estimation ofthe mean (JL), and estimation ofthe individual 0 values. For the fonier, allow n replications for each 0. The optimal estimate of t is Y, the mean ofall observations. The variance of Y consists oftwo components: that induced by the prior and that induced by the sampling distribution. We have: Y K R Yk, V(Y) =Rn_ + R-Notice that increasing either K or n will reduce the first tern, but the second term is controlled by K, the number of0 values that we sampled (primary units). In fact, if n = co, so that we have perfect knowledge of each 0,k we still have uncertainty in estimating ;L.
A flxedefcts analysis, where we wish to estimate 0, the mean of the 0 values generating these data, assumes no connection among them. The random effects analysis strives to make an inferenceabout t, whichisbroaderthantothecurrentexperimental uiits, and this broader inference comes with increased variability. To get an idea of the increase, consider the ratio of the variability for the random effect (RE) versus fixed effect (FE) analysis. We have: V where p = i2/(a2 + 72) is the intraclass correlation. If n = 1 or p = 0 (r = 0), there is no variance inflation, otherwise it can be considerable. This ratio is called the design effect. In the extreme case with p = 1, n observations on an experimental unit are no more informative (variance reducing) an one observation and V,E is n times VF.
The design effect and generalizations thereof can be used to determine how many repeat measurements on a unit (n) and how many units (K) are needed to produce a desired accuracy. Setting n = 1 minimizes the number of observations, but generally repeat observations on a unit are more relevant or less expensive than adding units, so an n > I usually produces the opimal solution under resource constraints.
The decision to use fixed or random effects depends on inferential goals, and differences in goals underlie the current controversy over when to use meta analysis (16). The random effects approach produces considerably larger standard errors on an estimated treatment effect than does the fixed effects analyses. Peto and colleagues argue in favor of inferences limited to the meta-analyzed studies and used fixed effects (17). Others desire a broadened inference to the population of similar studies, and this promotes random effects (18). The narrow inference has a well-defined reference population but does not easily genealize.
The broad inference is possibly more relevant, but the target population is somewhat vague.
Irrespective ofthese scoping issues, the stage II variance component can be ofindependent interest. Consider the metaanalysis ofthe effect ofcoaching on scholastic aptitude test (SAT) scores conducted by Laird and DerSimonian (19). Table 1 shows that estimated effects are greater for uncontrolled than for controlled studies (likely because much of the coaching effect is really regression to the mean), reports the stamdard error ofthe estimate (the SE depends both on the number of studies and the sample size for each study), and shows that the stage II variability is greater for uncontrolled studies. We can see that unexplained variation is greater for the uncontrolled studies (it is likely that dty are perforned under a wide variety ofconditions) and under the Gaussian assumption can get an idea ofthe variation in true coaching effect. True coaching effect for uncontrolled studies can be expected to vary according to a N(41, 252) distribution, while for matched/randomized studies it follows something close to a N (10,32). This information is ofpolicy interest. For example, it is virtually impossible for there to be a negative coaching effect Tnbke 1. The effect of coachn on SAT scores (19 14 3 There were nine studies.
in the randomized approach applied to populations similar to those used in the current studies.
Tests for heterogeneity in a Gaussian framework ask if T2 = 0. In general, these tests are less informative than reporting an estimate, where one can determine if the excess variation is a threat to interpretation or generalization. Consider, for example, the meta analysis conducted by Beaumont and Breslow (20) on the cancer risk from vinyl chloride. Table 2 shows their summary, indicating statistically significant relative risks for liver and brain, but not for lung tumors. For liver, the test for heterogeneity conclusively shows that the studies are not estimating a common relative risk (there is heterogeneity). An estimated variance component would allow one to see if variation in relative risks is sufficient to produce values below 1 (a qualitative interaction) or simply variation all to the right of 1 (a quantitative interaction). The former causes problems in interpretation, the latter does not. Both set the stage for finding explanations for the excess variation.

Example
An early two-stage analysis of water contamination was published by Von Mises (21). In each of 3420 sampling sites, 5 samples were taken and the number of contaminated samples recorded. Table 3 gives the data and expected frequencies (rounded) under the binomial distribution, computed with the estimated contmination probability of0.025. For this event probability, the binomial assumption predicts far too few occurences of 2 to 5, and a two-stage variance components model can capture the excess variation among geopraphic areas. The sample variance of the observed contaminations is 0.1885, which is greater than the 0.1219 predicted from the binomial [0.1219= 5(0.025)(0.975)J, suggesting overdispersion.
The beta distribution is a common model for stage II, and we parameterize it by the mean, ju, and intraclass correlation (M + 1)'. Specifically, with 0 the binomial parameter: From the Von Mises data, we obtain i =0.025, M = 6.2. The 62 shows high apriori variation, whereas an M -+ oo would indicate no prior variation. We can go further with this example and free ourselves from assuming a specific parametric shape for the prior by using a nonparametric estimate. Laird (22) introduced the algorithm, and we obtain a discrete prior with masses (0.606, 0.261, 0.092, 0.022, 0.018) at mass points (0.012, 0.021, 0.029, 0.052, 0.408). This nonparametric approach allows for flexible modeling, and when its p es are better developed, it should become a standard approach to variance component problems.  Expected  0  3086  3017  1  279  383  2  32  19  3 15 0. 0. This treatment-by-clinic random interaction effect in the untransformed scale is induced by clinics having different baseline responses (the a, vary) and the model being misspecified. Random effects models in the untransformed scale can pick up the interaction induced by misspecification and serve to make the standard analysis more robust. Notice that interaction induced by these transformation models is never qualitative (all the bk are of the same sign) so combining evidence is scientifically valid. The random-effects model delivers an estimated variance for the treatment effect that accounts for the heterogeneity.

Estimating Individual Components
Now, let us consider estimating the underlying mean for in-  This improvement in estimates carries over to improved confidence intervals in that they attain the nominal coverage but are of shorter length than the classical intervals (13,34). This empirical Bayes advantage holds even when the intervals are broadened to account for uncertainty in estimating the prior distribution. With the development ofcalibrated confidence intervals, the empirical Bayes approach produces inferences ideally suited to risk assessment investigations when data from related sources are available.

Consequences of Heterogeneity Bias
We have seen ffiat accounting for heterogeneity can adjust standard errors, provide a valid basis for generalization, and improve estimation of individual parameters. Ignoring the heterogeneity can have dire consequences. Cox (15) considered the Gamma-Poisson example. Ifwe wish to estimate the mean event rate from several units, the mean Y is totally efficient, but we do need to account for heterogeneity to get a proper standard error. However, consider estimating a nonlinear function ofthe mean 1L; for example e". The estimate ey will not have expectation e" even for large sample sizes; it is inconsistent. A consistent estimate requires accounting for the heterogeneity.

Correlation
Discovery and accounts of heterogeneity are key ingredients for the analysis of longitudinal data (35,36). At the most basic level, consider the correlation between lung function measurements (the forced expiratory volume) in adults at 3-year intervals. With no covariates the correlation is about 0.90, but adjusting for age, height, and gender brings the correlation down to about 0.80. Some ofwhat used to be unexplained (co)variation has been explained. It is important to note that this covariation is measured using residuals from the model producing expectations, not from the raw data.
Unexplained (co)variation produces a variety ofphenomena in longitudinal data. Consider relating adjacent residuals by plotting e,,+ versus e,, where the e values are residuals from a model. A plot with slope 1 represents tracking where an individual's deviation from the population prediction tends to stay put. A slope less than 1 indicates regression to the mean, where the subsequent deviation tends to be less extreme than the current value. A slope greater than 1 indicates the horse race, where residuals tend to increase in absolute value. The name derives from the idea that horses in the lead tend to be running fastest and increase their lead. Of course, a plot showing random scatter with slope 0 indicates no association.
As one includes additional, effective covariates in the model, these residual plots change, sometimes with changed slope, usually with less extreme relations. Again, as the model explains additional (co)variation, we change the structure. For example, one explanation for the horse race comes from assuming that covariates not yet in the model (e.g., smoldng), influence the rate ofchange in lung function. Without the covariate in the model, smokers have the fastest decline and the lowest lung function. Their lead increases over time. Once the smoking covariate is included, residuals compare smokers to smokers, and residuals have a chance to be both positive and negative. Though there still may be a horse race phenomenon, it will be less and could be reduced further by including additional predictors of decline. When to stop adding predictors depends on the available information, subject area knowledge, and statistical art and science.

Errors in Variables
Unappreciated heterogeneity can influence scientific and policy conclusions by disguising true parameter values and relational shapes. Berkson provided the classic example, where we are relating variables through a linear model. For specificity, consider the model: toxicity = a + b * dose, where we take measurements on dose and toxicity. Consider a large sample, so that statistical variation in parameter estimates is not the issue. Ifdose is measured without error, then b relates true dose to toxicity but if measured dose is a random deviation from true dose, then b will be closer to zero (attenuated) relative to the slope ap propriate for true dose. Yet, it is the appropriate value for relating observed dose to toxicity.
Since all we ever deal with are observed values, why should we care about this apparent bias? We care for at least two reasons: (a) If different experiments are performed with different measurement accuracies, then failure to account for these differences will produce apparently different slopes. Any meta analysis ofexperimental results will require adjustment to a common basis. (b) Pbssibly more importantly, even though the unadjusted slope is appropriate for relating measured dose to response, it is inappropriate when making policy recommendations. For example, many studies report on the positive relation between blood pressure and heart attack risk. Since blood pressure generally is measured with considerable variation, the effectiveness ofblood pressure control is underestimated. Since policies are aimed at reducing true blood pressure, the adjusted slope is relevant.
Many authors have discussed the consequences of errors in variables and methods for reducing the effects through design and analyses (6,7,37-4(1). Design considerations include averaging repeated measures to reduce heterogeneity. Analyses that explicitly or implicitly de-attenuate regression slopes are effective in performing the necessary adjustments. In studies relating a risk factor to a response in the presence ofa confounder, Kupper (6) shows that unreliability in measurement of the confounder can be more damaging than unreliability in the risk factor, sometimes producing a sign change between the estimated slope and the slope appropriate for true confounder-adjusted risk. Extreme care is needed in defining research questions, designing, and analyzing studies.
The effects of unaccommodated heterogeneity can be more dramatic than slope attenuation. Consider the linear model Y = a + bx + error where x is the true regressor. Let the observed regressor (X) conditional on the true regressor be distributed as a log-normal variable with mean x. Then, the regression using the observed regressor is: y = a + bXP+ error, where p < 1, so a linear relation is converted to nonlinear. Bailar et al. (41) suggest that this type of phenomenon (operating as variation in true slope from rodent to rodent) may produce the ap parent lack of conservatism for linear extrapolation of doseresponse relations. The dose-response curve for vinyl chloride is a classic example.
When we start with a nonlinear model, such as a logistic regression or a multistage model for carcinogenicity, heterogeneity due to noisy regressors or due to variations in true slopes from unit to unit (the compound model) change the shape of the relation. Techniques are available for estimating and adjusting (39,41) but more development of numerical methods is required.
With S(t) the survival curve: p4t) = -Flog (S(t)) S (t)= exp [-U(t)], U(t) =ft IL(t)dt 0 Assume that each individual in a population has a hazard depending on a parameter 0 that multiplies a baseline hazard (proportional hazards), and that 0 varies from person to person according to a probability distribution G (again, the two-stage model). Then, for the population: where h is the basline hazard, and H is its integral. It can be shown the EG (0 I T >t) is decreasing in t so that the shape of it (t) is different from that of h(t).
Vaupel and Yashin (10) give several examples ofthis difference in shape. The easiest example has h(t) 1 and G putting mass on two points 01 <02. Then, though each individual has a constant hazard, the population hazard decreases from 02 to 01 as a function oft. Intuitively, those alive at a large t are not representative ofthe original population, but overrepresent those with the smaller 0 (01). Individuals with 02 tend to die offearly. As in errors in variables regression, u,G(t) is the appropriate population curve, but it may be deceptive in terms of determining appropriate policies or their effects. For example, it may be true that it is good for every individual in a population to stop smoking (all age-specific hazards decrease), but that after some follow-up interval of time, some age-specific hazards are higher than they would have been without the intervention. Some of the frail individuals who would have died earlier if they had continued to smoke have been sent to deaths at later ages. This is another manifestation of Simpson's paradox; aptly put as an intervention that is good for men; good for women; but appears bad for people. Vaupel and Yashin (1J) call this the apparent failure of success and remind us carefully to consider heterogeneity when designing and evaluating programs.

Risk Assessment
The decision on using historical controls in the analyses ofthe carcinogen bioassay illuminates several issues related to heterogeneity. Consider a bioassay comparing lifetime tumor rates between an exposed and control group ofrodents, each with 50 rodents. Table 4 presents typical data when all tumors occur in the exposed group. The Fisher's exact one-sidedp-value is approximately (0.5)x, so that if x is less than 5, thep-value will exceed the usual level for statistics significance. Thep-values for X = 0,1,2,3,4,5 are 1.000, 0.500,0.247,0.121, 0.059, and 0.028. However, in many situations pathologists will report that even an X = 3 is biologically significant because in a long series of experiments virtually no tumors ofthe type being considered have been found in the control group.
This dissensus between the statistical procedure and scientific opinion can be explained and rectified by a two-stage model where the control rate for the current experiment is considered to be sampled from a prior distribution. In the case we are considering, this prior puts almost all weight on a control rate of 0 and is equivalent to increasing the control sample size. For Ibke 4. Hypothetcal reslt from a carcinogen bioasay.  (49), and the references thereof explain approaches. Formalized use ofthe historical data will help resolve controversies such as those that surrounded the assay for DMT (dimethylterephthalate), where the data showed 2 %, 16%, and 27 % lifetime incidence of alveolar/broncheolar adenomas and carcinomas in male mice for the control, low-, and high-dose groups (p<0.0001), but previous control groups in the same laboratories had rates of 10%, 13%, and 18% (50). Refinements are needed to incorporate time-until-tumor and cause-of-death information, and the approach should be included as a formal method offocusing discussion ofbioassay results. As evidenced by Freedman (5S), the carcinogen bioassay generates uncertainty and controversy more broadly.
Hierarchical models incorporating complicated variance component structures have been used to relate data from seemingly unrelated human studies and to formalize interspecies extrapolations ofRisk. DuMouchel and Harris (52) present an analysis of the health effects ofenvironmental emissions, and Laird (53) investigates the thyroid cancer risk of ionizing radiation. Both of these analyses carefully lay out assumptions and study sensitivity of results to changes in assumptions. The formal approach provides explicit documentation ofmethods and helps focus discussion of modifications. These models will always be augmented by expert opinion and political considerations when used as input to risk assessment and control, but they serve an extremely valuable role, going beyond more descriptiveapproaches such as those ofCrouch and Wilson (54).

Scoping
When assessing or accommodating heterogeneity, the analysis frame is extremely important. One needs to decide on the basic unit of analysis (e.g., the individual, the publication, the small geographic area), the baseline model (e.g., logistic dose response, PRisson counts), the form ofheterogeneity (e.g., gamma, Gaussian), and the type of units to be admitted to the analysis. We refer to the types ofunits analyzed as scoping, and it concerns defining the types ofunits that can be expected to be related by the heterogeneity distribution. Too broad a scoping will produce estimated prior variation so large as to unduly inflate the standard error ofparameter estimates and reduce the advantage in estimating unit-specific parameters of combining evidence over units. Too narrow a scoping produces unstable estimates of the prior and constrains generalizations of the findings.
One approach to scoping includes units that are thought to have parameters sampled from a common prior distribution. A more general approach allows covariates to adjust the prior distribution. For example, in performing a small area analysis ofdisease incidence (29), the prior can be adjusted for age, gender, and risk factors by building a regression model for the prior mean. This approach allows for broadened inferences and is the model-based method for explaining unexplained variation. Use of historical controls in the bioassay provides a good example of the issues. The analyst needs to decide what controls to include. Should they be from the same laboratory or several laboratories? Should controls from tests performed several years previously be included or only recent controls? Should control rates from experiments read by the same pathologist be included? Each choice influences the effect ofincluding historical controls and the operating characteristic of the test procedure. Amato and Lagakos (55) analyze the effects of disagreement among pathologists when characterizing the tumors in rodents used in a carcinogen bioassay. They show how dose-response curves vary from pathologist to pathologist even when the same slidesareread. Forbladderandlivertumors, dose-responsecurves firomdifferentpathologistsessentiallydonotcross, implyingthat someindividualshavea substantially higherrateofpositivesatall doses. This variation in callrateproduces variation inthe statistical powerofthebioassay and limits its generalizability. Thevariation invites explanation, with likely explanations including variation inindividualpathologists' perceivedprevalenceoftumortypesand variationinthedegreetowhichtheyincorporatetheconsequences offalse positives and false negatives intotheirdecision rule. Ifthe causative factors can be isolated, pathology protocols can be modified to reduce this inter-rate variation.

Summary
We have seenthat heterogeneity is the foundation of statistical science and that its identification and accommodation is centrally important to the design, conduct, analysis, and interpretation of statistical studies. Key issues include determining baseline models and defining analytic goals. Ifthe goals are primarily to make inferences at the population level, then heterogeneity modeling is quite robust and generally serves to expand the flexibility of population models to represent expectations and variances. If, however, inferences are directed at the unit-specific level, considerable care is needed in specifying baseline models and forms for heterogeneity. Commonly, true replications at the unit level are unavailable, and models will be based on a combination of scientific reasonableness and statistical/mathematical convenience. Usually, the specific forms chosen are not uniquely best for the observed data, and careful interpretation coupled with sensitivity analysis is required.
Effective assessments, accommodations, and interpretations ofheterogeneity require effective team work among statisticians and other scientists tuning the approach to the application. The scientifically challenging and societally important problems in quantitative risk assessment provide fertile ground for developing and applying methods that will increase scientific understanding and improve the public health.