Recent developments in the multistage modeling of cohort data for carcinogenic risk assessment.

The modeling of cohort data based on the Armitage-Doll multistage model of the carcinogenic process has gained popular acceptance as a methodology for quantitative risk assessment for estimating the dose-related relationships between different occupational and environmental carcinogenic exposures and cancer mortality. The multistage model can be used for extrapolation to low doses relevant for setting environmental standards and also provides information regarding whether more than one stage is dose-related, which assists in determining whether different carcinogens affect different stages of the cancer process. This paper summarizes recent developments in the multistage modeling of cohort data and emphasizes practical issues such as handling data arising from large epidemiologic follow-up studies, the time-dependent nature of exposures and statistical issues such as multicollinearity in time-related variables, robustness of parameter estimates, validating of the fitted models, and routine inferential procedures. Problems related to uncertainties of risk estimates are discussed also. Computer programs for fitting multistage models with one or two dose-related stages to cohort data incorporating time-dependent exposure patterns; constructing confidence regions for model parameters; and predicting lifetime risks of dying from cancer adjusting for competing causes of death are detailed. Illustrations are provided using lung cancer mortality in a cohort of nonwhite male coke oven workers exposed to coal tar pitch volatiles.


Introduction
Recent literature on epidemiologic studies ofhuman cancer indicates that the Armitage-Doll multistage model of carcinogenesis can be used to examine the relationships of excess cancer mortality to some carcinogenic exposure and time-related variables such as age at initial exposure; duration of exposure; and time since exposure terminated (1)(2)(3)(4). These relationships help to predict benefits of different strategies for reducing carcinogenic exposure in the workplace or environment.
Because multistage models can be fonnulated to provide information regarding whether more than one stage is dose-related, these models assist in determining whether different carcinogens affect different stages of the cancer process. They also are used in carcinogenic quantitative risk assessment for estimating the dose-response relationships and for extrapolating to low doses relevant for setting environmental standards (5,6). The dose can be constant or time-related, and the response is usually measured by lifetime risk of dying from cancer. This paper summarizes recent developments in multistage modeling of cohort data detailing methodological as well as computational advancements. These developments lie mostly in dealing with inferential problems and developing computer programs.
A recently developed computer program package for the direct fitting of multistage models with one or two doserelated stages and time-dependent exposure patterns is de-scribed. In addition to tests of significance related to one-ata-time inference, this package provides computing algorithms for the construction of confidence regions for simultaneous inferences regarding parameters of two-stage, dose-related multistage models and algorithms for computations of lifetime risks adjusting for competing causes of death. Programs in this package can be grouped conveniently according to the need of researchers.
Several statistical concerns that arise when cohort data are analyzed for carcinogenic risk assessment are discussed. New methodological approaches to deal with these concerns are proposed. Specific issues addressed include the appropriateness of statistical inferential procedures, robustness of parameter estimates, model validation, and uncertainties ofrisk estimates.
Illustrations ofthe operations ofthe program are provided, using lung cancer mortality in a cohort ofnonwhite, male coke oven workers exposed to coal tar pitch volatiles. Results from the analysis of this data set are used to facilitate discussions of methodological and statistical concerns addressed in the paper.

Armitage-Doll Multistage Model of Carcinogenesis
The simple multistage model of carcinogenesis assumes that a tumor results when k events occur in order in a single cell. The occurrence rate (at age t) ofthe ith event given that i-i events have occurred, is assumed to be: ri = a + bD(t) (1) where a, is the background rate independent of age, bi is the potency parameter for the ith cellular changes, and D(t) is the dose rate at age t. The cumulative hazard rate (or cumulative incidence rate) by age t is then given by t t0 Uk u2 where to is the time from initiation of tumor to clinical expression.
In accordance with the above formulation, the age-specific cancer death rate [h(t)] at time t can be expressed as follows: (3) where ho(t) is a baseline age-specific cancer death rate derived either from a standard population or a suitably chosen control population, and h(t) is the excess death rate. The expression for the excess death rate depends on the number and the order ofthe dose-related stage(s) and the nature of the dose rate patterns.

Recent Developments in Multistage Modeling of Cohort Data
The methodology used in the present paper for multistage modeling of cohort data incorporates time-dependent exposure patterns. A briefdescription ofthis methodology is given below followed by recent developments in its implementation. Developments in areas ofmodel validation and appropriateness of inferential procedures are also discussed.

Direct Fitting of the Multistage Model
In occupational cohort studies, historical exposure estimates for one or more agents are usually taken as the measure ofdose. Simplified expressions for multistage models under the assumption of constant exposure rate are easily amenable to model fitting. Applications ofthis approach for occupational arsenic and coke oven exposures are given by Brown  The assumption ofconstant exposure rate throughout the work history is not realistic under most occupational settings. The fitting ofmultistage models to cohort data that reflect realistic exposure experience necessitates the incorporation of timedependent exposure patterns. A methodology to incorporate exact time-dependent exposure patterns in the calculations of excess death rates has been provided by Crump and Howe (7). A formulation of this approach to epidemiologic data when one stage of the multistage model is dose-related is discussed in a subsequent paper (8).
Our first effort toward enhancing the methodology and com- *Statistically significant, p < 0.01 puter software for the fitting of multistage models to cohort data sets incorporating time-dependent exposure patterns consisted ofextending the formulation provided by Crump et al. (8) to the situation when two stages are dose-related and developing a computer program package for model fitting and computing lifetime risks. A briefdescription ofthe formulation ofthe problem with exact expressions for the excess risk function is provided first to facilitate our discussion of the computer package developed.
To illustrate the multistage modeling programs, we used a data set from the Steelworkers Mortality Study, a project conducted in the Departnent ofBiostatistics, University ofPittsburgh (9). The selected cohort consists of 1100 Allegheny County black male coke oven workers who worked between 1951 and 1953 and were followed through 1970. There were 50 lung cancer deaths during this follow-up period (Table 1). Exposure data consist of concentrations ofcoal tar pitch volatiles (CTPV) in milligrams per cubic meter (10). Exposure profiles were constructed for the workers given the levels of CTPV at jobs over the entire work periods divided into 1-month intervals. These exposure profiles provide the time-dependent exposure rates.

General Formulation of the Two-Stage Dose-Related Multistage Model
In the direct-fitting procedure of the multistage models to cohort data, a discrete version ofEq. (3) is needed. Ifwe divide age into 1-year intervals, xobeing the age when the worker starts his first job and the exposure rate in the ith interval is assumed to be a constant c; (i=1,2 ... ), then for a k stage model with the rth and Ith stages dose-related, Eq. (3) takes the form fixed value which without any loss of generality can be taken toO.
Denoting by Z,m,n z2mn. and Z3ms,, Eqs. (5-7) for the nth individual and Nm, the total person-years at risk at age xm, the expected number of cancer deaths at age xm is given by Using this formulation, we had developed a computer program package for multistage modeling ofcohort data when two stages are dose-related. The program package with an application has been reported (11,12).
Our most recent research deals with the methodological and statistical concerns related to the two-stage, dose-related multistage models with time-dependent exposure patterns. This includes adaptation of Armitage-Doll constraints in the general formulation of two-stage dose-related models described earlier and development of additional computer programs for the model fitting and inferential procedures.

Armitage-Doll Formulation of the Two-Stage Dose-Related Model
The formulation of the two-stage, dose-related multistage model described previously provides a more generalized form than given by the Armitage-Doll theory, which requires the following constraints on the model parameters: = ,B X1 and 1 >°'2 > 0 (9) These constraints are usually referred to as the Armitage-Doll constraints.
Detailed formulation of the two-stage, dose-related multistage model for epidemiologic data with the Armitage-Doll constraints, associated inferential procedures, and an application to coke oven workers cohort data set are detailed by Patwardhan (13). The expression given in Eq.  where, z3* = Z3m (k xk-'/a.) and other quantities remain the same. Computer programs have been developed for simultaneous estimation, testing, and construction ofconfidence regions for model parameters using three distinct statistical procedures: Wald, likelihood ratio, and score.
In our initial program package for multistage modeling, generalized linear interactive modeling (14) was used for the estimation purpose, and the Armitage-Doll constraints could not be imposed on the estimation algorithm. As part of our most recent work, the computer package has been enhanced by including programs for the constrained estimation methods. In addition, the new package contains several programs for inferential procedures. The package for single-stage, doserelated models is called "Multistage Modeling Programs I" and the package for two-stage, dose-related models is called "Multistage Modeling Programs II." Flowcharts of these two program packages are given in Figures 1 and 2.
Lifetime risks of dying from cancer adjusted for competing causes of death are calculated using the procedure developed by Gail (15). Using the Multistage Modeling Programs I to analyze the coke oven workers data set and specifying the number of stages (k) as 4; the dose-related stage (r) as 1; groupingsofdeaths andperson-years by intervals ofz values; and time-dependent exposure patterns as observed, the following information was generated: a) likelihood plot (Fig. 3); b) estimate ofthe potency parameter, standard error, goodnessof-fit statistics, and Wald, likelihood ratio-based, and scorebased confidence intervals and associated test statistics ( Table 2); and c) lifetime risks (Table 3).
Using the Multistage Modeling Program II to evaluate the data set and specifying the number of stages (k) as 4; dose-related stages (l,r) as 3 and 1; groupings of deaths and person-years by cross-classified cells ofz value vector; and timedependent exposure patterns as observed, the following information was generated: a) likelihood surface (Fig. 4); b) estimates of potency parameters, standard errors, goodness-of-fit statistics, and Wald, likelihood ratio-based, and score test statistics (Table 4); c) Wald, likelihood ratio-based, and scorebased confidence regions (Fig. 5); and d) lifetime risks (not shown).

Coverage and Power Study of Test Procedures
Null hypotheses about the potency parameters of multistage models can be tested using standard inferential procedures such as Wald's procedure, the likelihood ratio-based procedure, and the procedure based on score statistics. The question often arises about the appropriateness and relative merits or differences in the performances of these procedures. To address this question, one must undertake simulation studies that are conducted under conditions corresponding to those of the data set being analyzed. We have developed computer programs to perform such simulations and used them to analyze the coke oven workers data set (15). Results from the simulation study indicated that for a single-stage, dose-related multistage model, confidence intervals obtained by all three procedures have good coverage and power properties. For two-stage dose-related models, the Wald confidence region performs poorly for extrene parameter values, whereas the other confidence regions perform satisfactorily.

Robustness
Assessing robustness of a regression fit is an important step in any modeling endeavor. There are two different aspects of this assessment. The first aspect involves assessing the robustness of the parameter for model-fitting purposes within some fixed distributional assumptions. The second aspect    involves examining the robustness (sensitivity) ofthe parameter estimates to the inherent assumptions of the model, mainly the assumption about the probability distribution of the random component of the model. A new method has been proposed for investigating model robustness using the generalized linear model approach (16). Briefly, the method quantifies model robustness by defining Table 3. Estimated lifetime lung cancer risk (through age 85 + years) for a hypothetical U.S. black male exposed to coal tar pitch volatiles for 40 years, exposure starting at age 20.  cEach test statistic has a chi-square distribution with 2 degrees of freedom. robust regions ofthe parameters in the variance function and link function in a generalized linear model (GLM). These robust regions are defined as ranges in the spaces ofvariance function parameter or link function parameter for which the corresponding estimates ofthe potency parameter lie within the confidence interval of the potency parameter determined under the basic model assumptions and the model fitting is adequate. Wedderbum's quasilikelihood approach (17) and several criteria based on commonly used statistics are used to obtain the robust regions. The method applies to the robustness investigation of either the variance function or the link function or both simultaneously. The GLM method has been applied to the first of four stages dose-related multistage model fitted to the coke oven workers mortality data. Results show that the selected multistage model is sufficiently robust to the change of the relationship between the first two moments of the random component (variance function) but sensitive to the change ofthe relationship between the systematic component and the mean (link function). Figure  6 displays one such variance-link robust region.

Cross-Validation
The coke oven workers mortality study provides a unique opportunity to cross-validate models fitted to death rates for quantitative carcinogenic risk assessment since the cohort has been followed for about 30 years (the most recent follow-up is through 1982). The extended follow-up allows fitting a model with data through a certain follow-up period and then making predictions of excess deaths for subsequent calendar periods. These predictions can then be validated with actual observations from the extended follow-up periods. As commonly used, goodness-of-fit statistics (e.g., deviance and Pearson's chisquare) are found to be not very sensitive in choosing the best fitted model. We believe that the cross-validation method is more meaningful for the modeling of cohort data.
In the context of multistage modeling ofcohort data with timedependent exposure patterns, this cross-validation method requires first the grouping of Z values (vector) by specific calendar year periods and using the estimated potency parameters (s) and calculating the predicted excess death rates beyond the calendar period used in model fitting. The predicted excess death rates can then be compared with the observed excess death rates obtained from the extended follow-up information. A computer program has been developed for grouping deaths and personyears by calendar year periods and was used to cross-validate a multistage model with time-dependent exposure pattern fitted to the lung cancer mortality of a different cohort of coke oven workers (18).

Discussion
In recent years, the use of multistage models has become an integral part of environmental carcinogenic risk assessment. Multistage models provide a conceptual framework that facilitates understanding of the relationships between different time-dependent variables and tumor incidence, and they also give a reasonable description of the epidemiology of many nonhormonally dependent cancers of epithelial origin.
Concerns about the basis of multistage models have been raised by Moolgavkar and his colleagues in a series ofpapers (19,20). A deficiency ofthe Armitage-Doll model is that the target tissue is not allowed to grow and that no allowance is made for the fact that most epithelia constantly shed and replenish cells. Moolgavkar and his colleagues have developed a two-stage model to rectify this defect (18). We have developed some of the necessary software for the modeling of large cohort data sets using two-stage models and will continue efforts to complete the remaining work. Programs that are developed for the multistage modeling purpose will be modified suitably for two-stage modeling.
We have developed methods to assess several statistical issues that arise in multistage modeling including the appropriateness and merits ofvarious procedures for testing the null hypothesis, the robustness ofparameter estimates, and the cross-validation of models. However, the methods developed so far fall short of adequately addressing uncertainties in the risk estimates. The confidence limits and the confidence regions of the potency parameters can be used in providing ranges of risk estimates. In addition, we intend to use bootstrap methods to provide alternative estimates of the confidence limits (regions) based on asymptotic theory which then can be used to provide alternate ranges of the risk estimates.