Statistical models for genetic susceptibility in toxicological and epidemiological investigations.

Models are presented for use in assessing genetic susceptibility to cancer (or other diseases) with animal or human data. Observations are assumed to be in the form of proportions, hence a binomial sampling distribution is considered. Generalized linear models are employed to model the response as a function of the genetic component; these include logistic and complementary log forms. Susceptibility is measured via odds ratios of response, relative to a background genetic group. Significance tests and confidence intervals for these odds ratios are based on maximum likelihood estimates of the regression parameters. Additional consideration is given to the problem of gene-environment interactions and to testing whether certain genetic identifiers/categories may be collapsed into a smaller set of categories. The collapsibility hypothesis provides an example of a mechanistic context wherein nonhierarchical models for the linear predictor can sometimes make sense.


Introduction
Recent technological advances in biomedical experimentation have greatly improved identification of genetic damage (1) and recognition of genetic factors that may affect disease susceptibilities in animal (2) and human (3) subjects. Molecular genetic techniques are now used to idenitify specific genotypes or genetic patterns in individuals affected by some disease or, for example, exhibiting cancer. For instance, the role ofgenetic factors in lung tumor onset and progression has been recently highlighted (4-7), as have genetic components in the development ofhuman bladder tumors and other cancers (8)(9)(10). To study these effects, various biochemical, cytogenetic, and molecular probes are used (11)(12)(13)(14), and epidemiologic research has moved to use these methods in studies ofdisease/cancer susceptibility (15,16). Of interest is whether individuals in various genetic categories display greater risk of cancer or disease than those identified in some background, control, or genetic "wild-type" category. Statistical models and methods for assessing these risks in both animal experiments and human population studies are described in the following sections.

Statistical Models: Generalized Linear Forms
Assume the existence of T>2 genetic susceptibility groups or categories, identified without error via some form ofbiological/ biomolecular probe, and indexed by i=O,. . ., T-1. For instance, the experimental study design might compare the effects of Statistics  a set of Tdifferent genotypes or polymorphisms on the cancer or disease under study. The group at i = 0 is considered the background group to which susceptibility comparisons are to be made. From each prospectively sampled group, a count, Yi, of individuals exhibiting tumors is recorded. This is compared to the total number, Ni, of individuals in each group (i = 0. . . . T -1). For example, one might compare the proportion of mice developing tumors between two different inbred strains. The Yis are assumed to take the binomial distribution (17) with (known) sample size parameter Ni and probability parameterpi.
As noted above, susceptibility in the ith genetic group is measured relative to the response in the background, i = 0 group. This is quantified via the odds ratios To test Ho against a global one-sided departure, HI: ca >0, Vi, (i.e, that all odds ratios exceed one) one appeals to the largesample normality of the ML estimate &i. A recommended approach that identifies simultaneously individual departures from Ho (i.e, which of the individual ai are positive) is based on a modification of the well-known Bonferroni inequality (22). Begin with the individual Wald test (23) p-values Pi = 1 -oD &i / se(&i) ) , where se(ci,) is the large-sample standard error of , i , and c1(.) is the cumulative distribution function from a standard normal distribution. Order these values from smallest to largest; denote P(j) as the ith smallest ordered probability. Set the desired simultaneous confidence level to 1-a. Then, calculate the index, A, which is the largest i such that va P(T-1-i+v) > T (this inequality must hold for every value of v=1, . . . ,i, at each i under scrutiny). Conclude that ac is significantly greater than 0 with simultaneous confidence 1-a, ifPi < a/A. (IfAlcannot be calcualted, conclude that cai is significantly greater than 0 at all levels of i). Other aspects of this and related approaches to one-sided testing are described in detail elsewhere (24).
If a quantification via some intensity variable, say, vi, exists for each genetic category/group, the logistic model may be enhanced by incorporating this quantitative information. A  if and only if 0>0. One-sided testing is again of interest, although it takes on a simpler formulation in the dose-response setting, since only one parameter, 0, is assessed. For example,

Example 1
To illustrate use of the logistic model in Equation 1, consider the lung tumor susceptibility data given by Ryan et al. (4). These authors considered susceptibility to the known murine carcinogen urethan by examining specific allelic forms of the Kras-2 proto-oncogene in recombinant offspring from crosses of inbred strains of mice. The susceptibility allele is characterized by a shorter initial exon (length 0.55 kb) compared to the normal allele (length 0.70 kb). The mice under study were known to be either homozygous for the 0.70-kb allele, or heterozygous (0.70 kb/0.55 kb). Ifthe 0.55-kb allele were to confer or otherwise indicate increased susceptibility to lung tumorigenesis, heterozygous mice would exhibit greater lung tumor rates and thus an odds ratio relative to the homozygous mice greater than one.
The data for the T=2 groups are shown in Table 1. Applying a logistic model to these data gives an ML estimate ofthe regression parameter as &l = 2.398, with se(&,) = 1. 16 For these data, this yields 1.14 <461 <106. 45.
Based on this analysis, it seems fair to conclude that the 0.70-kb/0.55-kb heterozygous genotype exhibited moderately increased risk of murine lung tumorigenesis relative to the homozygous genotype. Sample sizes are small; total samples of at least 100 have been suggested to achieve nominal operating characteristics in one-sided testing under the logistic model (24). Thus, further experimentation and analysis are required before unequivocal conclusions can be reached as to the heterozygote susceptibility in this setting.

Two-Way Models and Gene-Environment Interactions
In example 1, only T=2 genetic groups were examined for murine lung cancer susceptibility. With T=2 groups, a number ofpossible analyses can identify susceptibility, including 2 x 2 contingency table calculations with x2 tests (25), Fisher exact tests (26,27), etc. These approaches often provide similar inferences. For example, the one-sidedp-value from the Fisher exact test comparing the two proportions in example 1 is 0.024, almost identical with the value of 0.019 achieved with the Wald test of a I.
The usefulness of the logistic model is more evident, however, in cases with many genetic categories or a dose-response under study, or when additonal factors or other sources of variability are identified and examined as part ofthe susceptibility analysis. For example, with human lung cancer, susceptibility may associate with genetic effects, lifestyle-related factors such as cigarette smoking, occupational or environmental exposures to pulmonary genotoxins, or a combination of these factors (15). (An application ofthe logistic model to such data is presented in example 2, below.) Indeed, recognition is growing that genetic susceptibility must be studied in the context of external environmental exposures that might initiate or contribute to disease progression (28). In these cases, the linear predictor in Equation 1 is easily extended, and the logistic model facilitates estimation of the additional parameters.
(If either index is 0 the corresponding standard error will be 0 since we set ac0=00=0. Similarly, cov(&j3)=0 if i=0 orj=0.) Collapsibility Collapsibility over genetic categories is another potential area of interest in studies of genetic susceptibility to disease. One questions whether the different genotypes have equivalent effects and may be collapsed into one or a small group ofcategories. For example, a single-locus, two-allele (say, B and b) system generates three genetic categories: BB,Bb,bb. IfBbehaves as a simple dominant allele, we can collapse these three categories into two: B-and bb. It may be of interest to assess any such collapse statistically.
Obviously, the nature ofthe collapsibility hypothesis will depend on the gene under study. For instance, if one encountered a situation where the genetic factor is important to the detoxification ofan environmental exposure, or is important in either producing or inactivating a toxic metabolic product ofthe exposure, then there may be no apparent genetic effect in those individuals without the exposure. In such a case, this may indicate collapsibility at only certain levels ofthe exposure variable. The logistic model provides a means for assessing collapsibility of this (or any) sort in the T x J setting. At any fixed environmental level, sayj = jo, consider collapsibility over Cof the genetic levels, indexed by ij,...,ic. This is expressed as Under the logistic model, this corresponds to Ho: ajl + lj0o = i + .= = aic + ylco d Departure from Ho is assessed via a generalized LR statistic (32), for example, with a limiting x2 distribution on C-I df.
Extensions to collapsibility over multiple levels of j are straightforward.
Notice that atj=0 (i.e., at the "background" environmental level) the collapsibility hypothesis is Under the identifiability constraints -yio=0, for all i, this simplifies to H ,:ail= %i2= -= a= C Thus, collapsibility at the environmental background corresponds to collapsibility among genetic main effects, as would be expected. This illustrates a situation where a nonhierarchial model (with interactions, but not all main effects, fully modeled) makes sense from a mechanistic perspective, a paradigm not typically encountered in (generalized) linear modeling. Collapsibility can also be assessed under a no-interaction condition, i.e, when yij =0 V ij. The collapsibiltiy hypothesis becomes HO: ail0= ai a1iC at anyj. This condition is independent ofj, however, so collapsibility at any j when 'yij =0 Vij corresponds to collapsibility over allj. Joint collapsibility follows similarly. Suppose collapsibility is considered over the two nonoverlapping categories indexed by il,...,ic and Il,..,ID. Then, joint collapsibiltiy is expressed as a (C-1)+(D-1) df hypothesis: Ho Vil=Vi2= *-ic;VIl =VI2== VID corresponding to Ho: il= ai2 = 0Xic ; a,, = a,2 * = CID Of course, one should have some a prioi basis for considering certain sets ofgenetic types as reasonable candidates for collapse because exploratory analyses over all possible collapsings run the risk of data overinterpretation.

Applications in Epidemiology
Once a potential susceptibility gene has been identified in humans, its association with disease can be tested in epidemiologic population studies using case-control study designs (33). Sampling in a case-control study is carried out separately for cases and controls in a retrospective manner and thus in effect is conditioned on disease status (34). As is well known, however, one can reverse this conditioning and model the logit of the risk ofdisease as a function ofcovariates as in Equation 1 or 2, treating the data as if they had arisen prospectively. The resulting regression coefficients are asymptotically unbiased for the associated log odds ratios (35,36), and likelihood ratio testing based on the prospective logistic model is valid when applied to case-control data (37). Thus the (prospective) logistic models descibed above are applicable in retrospective (and prospective) epidemiologic studies where genetic susceptibility is under examination. In a case-control study, Caparaso et al. (38) examined individual subjects' abilities to metabolize the drug debrisoquine and related these metabolic activities to lung cancer susceptibility. Increased ability to metabolize agents such as debrisoquine is conjectured to indicate increased cancer susceptibility because heterogenity in drug metabolism may associate with heterogeneity in cancer susceptibility. (For example, the drug metabolism pathway may play a role in carcinogenesis metabolism, either by deactivating a carcinogen or by activating a proto-carcinogen into a carcinogen.) Debrisoquine metabolism is polymorphic in humans: most individuals receiving the drug rapidly excrete large amounts ofdebrisoquine metabolite (39); these individuals are "extensive metabolizers" (EM). Some individuals excrete reduced amounts ofthe metabolite or excrete the drug almost unchanged. They are "intermediate" (IM) or "poor" metabolizers (PM), respectively (5). These polymorphisms lead to T=3 genetic categories for study.
An initial question of interest is whether the intermediate or poor metabolizers truly constitute two distinct genetic classes with respect to their lung cancer susceptibility. That is, is collapsibility evidenced between PM and IM? To study this question, Caporaso et al. (38) reported the case-control data shown in Table 2. Since the logistic model is applicable to this retrospective sampling scenario (37), we consider the prospective form in Equation I to model the genetic effects. As suggested above, collapsibility ofPM and IM categories corresponds to equality of main effects parameters: Ho:ca=a=2. To test this hypothesis, a GLIM (21) analysis of these data yields a 1 df LR statistic of 0.178. No significant departure is evidenced, and we conclude that the data support the contention that PM and IM metabolizers exhibit similar susceptibilities to lung cancer. Caporaso et al. (38) also reported data on the potential interaction of genetic susceptibility (as evidenced by debrisoquine phenotype) and environmental factors, such as asbestos exposure ( Table 3). Notice that the data reflect the previous recognition of PM/IM collapsibility. Referring now to the two-way model from Equation 2, no interaction between debrisoquine phenotype and asbestos exposure corresponds to testing Ho: mI,=O. The LR statistic for this significance test is 1.609, on 1 df. The corresponding P-value is 0.205. No evidence is seen for a significant interaction between the genetic and environmnetal factors.

Data Truncation
In some settings, the experimental end point may involve the number of occurrences of some phenomenon, such as the number of tumors seen in a certain organ of an experimental animal (40) or the number ofcells in a tissue or culture responding to a chemical stimulus (41). Denote the random variable associated with this discrete-valued response by U. Ifthe observing mechanism or technique is such that only the occurrence of a non-null state is recorded (e.g., "no tumors" versus "some tumors"), the data will be truncated into a dichotomous response.  Sobel and Elashoff (42) have referred to this sampling scheme as (binomial) "group testing"; also see Chen and Swallow (43).
When interest centers on the nonresponse, Pr[Y=O], the data are often referred to as "Hansen frequencies" (44), based on applications of E. W. Hansen's work in the behavioral sciences (45).
This sort of data truncation could occur in a susceptibility study, where multiple tumors occur in each individual, but only the presence or absence of the cancer is noted. Thus, for the kth individual in the ith genetic category or group, one observes Yik as the indicator of individual tumorigenic response. In this truncation scenario one often takes the Uik as independently This construction, based on Poisson occurrence rates, was discussed in detail by Cochran (46), who had in mind application to bacterial concentrations in suspension and the planning ofdilution experiments. He suggested that the concept was fairly well known, starting with the work ofMcCrady (47) on the concentration of organisms in liquids. In

Example 1 (continued)
To illustrate use of the complementary log model from Equation 3, consider again the K-ras-2/lung tumor susceptibility data described earlier. For the inbred recombinant mice studied in that experiment, it is common to observe multiple lung neoplasms per mouse (4). Reporting the data as dichotomous outcomes therefore involves a U-p Ydata truncation ofthe form considered herein. The complementary log formulation becomes a viable model candidate for quantitative assessment of the cancer susceptibility.
Applying a complementary log model to these data gives an ML estimate of the regression parameter as &1=1.792, with se(&1) = 0.9895. A Wald test ofthe no-susceptibility hypothesis Ho: a=0 yields a test statistic ofZ = 1.81, with one-sidedp-value equal to 0.035. Again, increased susceptibility is evidenced, although with slightly lesser significance than that exhibited with the logistic analysis. (The call for larger sample sizes remains valid, and is perhaps in greater evidence here.) Additional similarity is seen with the ML estimate of the odds ratio, constructed from the ML estimate ofa1 . Under the complementary log formulation, 0, = 11.003 for these data, an almost indistinguishable change from the logistic estimate reported above.
Thanks are due to Joseph K. Haseman, David G. Hoel, Jack A. Taylor, Clarice R. Weinberg, and Takashi Yanagawa for their helpful comments and support during the preparation of this manuscript.