Relative risk regression analysis of epidemiologic data.

Relative risk regression methods are described. These methods provide a unified approach to a range of data analysis problems in environmental risk assessment and in the study of disease risk factors more generally. Relative risk regression methods are most readily viewed as an outgrowth of Cox's regression and life model. They can also be viewed as a regression generalization of more classical epidemiologic procedures, such as that due to Mantel and Haenszel. In the context of an epidemiologic cohort study, relative risk regression methods extend conventional survival data methods and binary response (e.g., logistic) regression models by taking explicit account of the time to disease occurrence while allowing arbitrary baseline disease rates, general censorship, and time-varying risk factors. This latter feature is particularly relevant to many environmental risk assessment problems wherein one wishes to relate disease rates at a particular point in time to aspects of a preceding risk factor history. Relative risk regression methods also adapt readily to time-matched case-control studies and to certain less standard designs. The uses of relative risk regression methods are illustrated and the state of development of these procedures is discussed. It is argued that asymptotic partial likelihood estimation techniques are now well developed in the important special case in which the disease rates of interest have interpretations as counting process intensity functions. Estimation of relative risks processes corresponding to disease rates falling outside this class has, however, received limited attention. The general area of relative risk regression model criticism has, as yet, not been thoroughly studied, though a number of statistical groups are studying such features as tests of fit, residuals, diagnostics and graphical procedures. Most such studies have been restricted to exponential form relative risks as have simulation studies of relative risk estimation procedures with moderate numbers of disease events.


Introduction
One of the most important developments in biostatistics in recent years has been the evolution of regression methods for "failure" time data. In epidemiology, failure may refer to the diagnosis of a certain disease or to death from the disease. Primary interest typically centers around the relationship between individual characteristics or exposures and subsequent disease incidence or mortality.
In a cohort study, a group of subjects is selected from a population of interest and followed forward in time for disease occurrence. Both baseline characteristics or exposures, and characteristics or exposures measured during follow-up, may be of interest as disease risk factors. Such information will be referred to as the subjects' covariate history.
A cohort study is often too long-term and expensive to be feasible particularly for exploratory studies of rare diseases. A case-control design involves the monitoring of a large population for disease occurrence followed by a retrospective ascertainment of covariate histories. Such ascertainment takes place both for a representative sample of cases of disease and for a suitably selected disease-free, or control, group. Case-control design strategies often involve some degree of matching of controls to cases in respect to potential "confounding" variables that may otherwise obscure the relationship between covariates of primary interest and disease occurrence.
The ideas of case-control sampling can also be useful in the context of a cohort study. Specifically, cases of disease arising in a cohort may be compared to a subset of the disease free group in the cohort in order to avoid the assembly of covariate histories on the entire cohort. Such an approach is useful, for example, in the exploitation of a serum bank, since biochemical or viral analysis of stored sera on every cohort member may be prohibitively expensive. More generally a "synthetic" case-control analysis of a large cohort data set may be considered strictly for computational reasons.
A hybrid "case-cohort" design in which covariate histories are assembled only for a preselected subcohort and for cases developing disease may give rise to further cost saving in the context of certain types of cohort studies.

Regression Analysis of Cohort Data Regression Models
Disease occurrence data in a cohort study takes the form of a random time variable T for each subject. Typically T will be defined as time from entry into the cohort until disease occurrence, though other time specifications, such as age at disease occurrence, may be more natural in some applications. The time variate T will usually be subject to right censorship, as the subject may be without disease at the cut-off time for data analysis or may be lost to follow-up. Suppose initially that each subject has a fixed covariate vector z describing baseline characteristics or exposures under study, along with auxiliary data, for example, on potential confounding factors. The probability distribution for an absolutely continuous T can be equivalently described by its density, its survivor or distribution function, or by its (instantaneous) disease rate function X(t;z) = lim pr(t -T < t + At Tt,z)/At The disease rate, or hazard rate, function is a convenient representation for modeling purposes since it is natural to think in terms of disease rates and variations in disease rates over the follow-up course of a cohort study. Conventional parametric models, such as exponential and Weibull regression models, specify a hazard rate function of the form A(t;z) = XO(t)r(xP) (1) where XO(-) and r( ) are fixed functions, x = x(z) is a row vector consisting of functions of z, and ,B is a corresponding column vector to be estimated. An exponential regression model is characterized by a Xo(t) while the XO(-) function is a power function of time in a Weibull regression model. Since the ratio of hazard functions at any two z-values is independent of t, the class (1) is sometimes referred to as the proportional hazards model. For uniqueness one requires r(O) = 1 so that r(x,) = X(t;z)/X(t;zO), where zo is a "standard" covariate vector giving rise to x(zo) = 0. Since r(x,) is the ratio of the failure rate at a general covariate vector to that at a standard vector it is often referred to as the relative risk function. Very often the relative risk function will be taken to be of exponential form, r( ) = exp(-), but other forms such as r(-) = 1 + (-) may be more useful in some applications.
In many epidemiologic risk factor problems estimation of the relative risk is of primary interest, while the baseline disease rate function XO(t) = X(t;zo) can be thought of as a nuisance parameter. A major advance in the theory of failure time regression took place when Cox (1) discovered that estimation of the regression parameter P could conveniently take place without placing any restrictions on the baseline hazard function XO(*).  (2) where R(t) denotes the set of subjects at risk for disease at t-, for estimation of the relative risk parameter P. Under independent failure times and independent censorship (see below) the i-th factor in Eq. (2) is precisely the probability that failure occurs on the subject with regression vector xi, given the risk set R(ti) and given that exactly one failure is observed at ti. Since the factors in Eq. (2) are dependent, special justification is required to show that Eq. (2) could be manipulated as an ordinary likelihood function, at least as far as asymptotic inference is concerned. Kalbfleisch and Prentice (2) showed L(,) to have a marginal likelihood interpretation. Cox (3) introduced the notion ofpartial likelihood which not only encompasses Eq. (2) but also a range of important related functions arising from generalizations of the class of models (1). The fact that Eq. (2) is a partial likelihood function implies, very generally, that the score statistic d d U(P) = a log L(r)/33 = ;a log Li()/a = UA(W) is such that each Ui(p) has mean 0 and conditional variance estimated by -_2 log Li (p)1a12, and that score statistic components Ui(p) and Uj(I) are uncorrelated, i :L j. The partial likelihood structure then sets the stage for central limit theory to show n( -) to converge in distribution to a normal variate with mean vector zero and with variance matrix estimated by nI-[] = -n{a2 log L ()Ia'2}-1, where n is the cohort size and a is the maximum partial likelihood estimate defined by U(,B) = 0. Formal asymptotic convergence results were developed somewhat later, notably by Tsiatis (4). Efron (5) and Oakes (6) showed that it is not possible to improve on the efficiency of a provided Ao ( ) is completely unrestricted, and, equally important, that generally good efficiency properties obtain relative to the maximum likelihood estimates from parametric submodels of (1), even relative to parametric models that specify X, ( ) up to a single scale parameter.
With arbitrary Ao ( ), the sole restriction in the model (1) is the relative risk specification r(x,B). The requirement that this relative risk be independent of follow-up time may be unnecessarily restrictive in many applications; in fact, the change over time in the relative risk associated with a certain characteristic or exposure may be of considerable interest in some settings. For example, one may be interested in latent periods and other aspects of the temporal pattern of cancer relative risk over time in the follow-up of cohorts exposed to ionizing radiation or other carcinogens. The model (1) is readily relaxed to allow a dependence of relative risk on time by setting X(t;z) = XO(t)r{x(t)P} (3) where the modeled regression vector x(t) now may consist not only of functions of z but also of product terms between functions of z and t. For example, with a single binary z, r( ) = exp(-) and x(t) = (z, z log t) the relative risk X(t;z = 1)/X(t;z = 0) is eN1t2 which is constant, monotone increasing, or monotone decreasing according to whether the coefficient 02 iS zero, positive, or negative. Based on Eq. (3), a partial likelihood function is readily developed that differs only from Eq. (2) through the replacement of xi and xl in the i-th factor of Eq. (2) by xi(ti) and x1(ti), respectively, for i = 1, ... ,d.
To this point the regression model (3) presumes the parametric modeling of a relative risk function that includes not only the characteristics or exposures of primary interest, but also the auxiliary variables in z that may have been included, for example, to control confounding. Epidemiologic tradition, dating from the seminal paper by Mantel and Haenszel (7), very much concentrates on the use of stratification to control confounding or other potential biases.
The model (3) may be generalized to permit stratification by writing X(t,z) = Xo0(t) r{x(t)P8} (4) where the population is divided into q strata, s E [1,. ... ,q] on the basis of z values, and baseline disease rates Xo,(-) are allowed to differ arbitrarily among strata. Note also that the regression parameter can be allowed to vary among strata. A partial likelihood function for ,B = (01, . . . ,Pq) is readily developed as q dB where t.1, . .. , tdS denote the distinct disease incidence times in stratum s and RX(t) denotes the set of subjects at risk in stratum s at t. A convenient approximation (8) is available to accommodate tied disease times within a stratum. Note also that stratum assignments may be time-dependent, that is s = s(t,z), as a subject may move from one stratum to another during the course of follow-up. Model (4) allows the data analyst the choice of stratification or regression modeling for the control of confounding factors. It therefore allows one to avoid excessive stratification that sometimes poses a problem in direct application of the Mantel-Haenszel technique, and also avoids the unnecessary restrictions or unwieldy regression models that may arise if Eq. (4) were used without stratification. In short, Eq. (4) allows one to extract the best from traditional epidemiologic methods and modern failure time data methods. In large cohorts with relatively rare disease occurrence there is evidently little efficiency loss through a detailed stratifi-cation on key confounding variables. Some further study of this topic would be worthwhile.
The regression model (4) presumes a fixed baseline regression vector z. An important aspect of a number of large scale epidemiologic cohort studies, however, is the periodic recording of risk factor and confounding factor levels during the course of follow-up. Denote by z(u) a covariate measurement pertaining to follow-up time u and by Z(t) = [z(u);u < t] the entire covariate history for a subject prior to time t. The disease rate at time t may be defined as X{t;Z(t)} = lim pr{t S T < t + At Tt,Z(t)}I/At and a relative risk regression model A{t;Z(t)} = Xo0(t) r{x(t)O} (6) may be defined, where Xo0 ( ) is a baseline disease rate for stratum s, x(t) = x[t,Z(t)] is a row regression pvector that specifies the dependence of disease rate on risk factor histories under study, such that x(t) = 0 corresponds to a standard risk factor history, and by convention r(0) = 1. Regression models in the class (6) provide a flexible framework for a broad range of analyses to relate risk factor levels and changes in risk factor levels to subsequent disease incidence. A partial likelihood function for the estimation of ,B = (,... ,Bq) is once again given by Eq. (5).

Illustrations
There are many examples of the use of Eq. (4) in the literature. For example, Prentice et al. (9) apply Eq. (4) to a cohort of over 18,000 mice receiving a single time exposure to gamma radiation. The time-dependent feature of the regression variable in Eq. (4) was used to show that, for most cancer sites, the relative risk associated with a specific radiation dose drops off markedly as the animal's age.
For an illustration involving periodically measured covariate values consider a cohort of nearly 20,000 residents of Hiroshima and Nagasaki followed by the Radiation Effects Research Foundation. Prentice et al. (10) use data from this cohort to study the relationship between serial blood pressure measurements and subsequent cardiovascular disease incidence. Systolic and diastolic blood pressure along with a number of other cardiovascular disease risk factors and potential confounding factors were measured during the course of biennial examinations, beginning in 1958. The analyses described (10) make use of data on 16,711 subjects examined at least once during the time period 1958-74, including 108 incident cases of cerebral hemorrhage, 469 incident cases of cerebral infarction, and 218 incident cases of coronary heart disease. Specific objectives of their analysis concerned the relative importance of systolic and diastolic blood pressure as risk indicators for the three major cardiovascular disease categories just mentioned, and the relative importance of blood pressure levels from two or more biennial exam periods before a risk period, given the blood pressure measurements from the most recent examination period. The application of relative risk regression methods described (10) used model (6) with t defined as the examination cycle (i.e., t = 1 in 1958-60, t = 2 in 1960-62,... with 32 strata defined on the basis of sex and 16 five-year age-at-baseline categories. The modeled regression vector x(t) was taken to consist of systolic and diastolic blood pressure levels in examination cycles 1, 2, ..., t-1 or functions thereof. Naturally, in order that x(t) be defined, it is necessary that certain preceding examination have been attended and that the desired blood pressure measurements have been taken. In order to accommodate missing covariate data it is necessary to assume that the set of subjects at risk in examination cycle t with covariate history Z(t) are represented by the subset for whom the corresponding x(t) value can be specified. This assumption is subsumed in the independent censorship process described below. In terms of the partial likelihood function (5), the risk sets R8(t) consist only of those subjects under active follow-up in stratum s for whom the modeled regression vector x(t) can be derived from available covariate data. Table 1 shows the results of relative risk regression analyseswithr(@) = exp(), x(t) = [SBP(t -1), DBP(t -1)] the systolic and diastolic blood pressure measurements in examination cycle t -1, and with common regression parameters across strata (, 13). Note that previous cycle diastolic blood pressure is the important disease risk predictor for cerebral hemorrhage, while the corresponding systolic blood pressure is the more important predictor for cerebral infarction and for coronary heart disease. This observation has clinical implications and provides insight into the three disease processes. Table 2 gives results of analyses in which a sequence of blood pressure measurements are related to subsequent disease incidence. The regression vector is now defined as x(t) = [DBP(t -1), DBP(t -2), DBP(t -3)] for cerebral hemorrhage and x(t) equal to the corresponding SBP values from the three preceding cycles for cerebral infarction and coronary heart disease. Note that for a subject to contribute to the risk set in examination cycle t, all three previous biennial examinations need to have been attended. From Table 2 one can note that the most recent systolic blood pressure measurement is highly predictive of cerebral infarction risk, while the next most recent makes some additional contribution to risk prediction. With coronary heart disease, on the other hand, a recent elevated systolic blood pressure measurement is not predictive, or is possibly even negatively predictive, of risk given the levels of SBP in the two preceding cycles. One possible explanation for this result would be that hypertensive medication brings about blood pressure control without a corresponding reduction in coronary heart disease risk. The analysis for cerebral hemorrhage indicates that both elevated diastolic blood pressure and the duration of elevation are strong risk predictors.
Most applications to date of relative risk regression methods have presumed the exponential relative risk form r( ) = exp( ). Thomas (11) and Prentice et al. (12) use the linear form r( ) = 1 + (H) to examine the joint dependence of certain cancer relative risks on radiation exposure and other factors. Table 3, from Prentice et al. (12), is based on data from 40,498 subjects in a larger cohort monitored by the Radiation Effects Research Foundation. These subjects were surveyed at least once in the time period 1964-70 in respect to cigarette smoking habits and had available (T65) total body radiation dose estimates. In this analysis T is defined to be years since the subjects first survey participation and Z(t) = 1ZI(t), Z2(t)] is (t) defined to consist of radiation exposure information Z1(t) and cigarette smoking data (cigarettes per day and duration of smoking) Z2(t). The analysis also involved 128 fixed strata defined on the basis of age at radiation exposure (16 five-year classes), city, sex, and survey date (before or after the end of 1966). Table 3 shows results of fitting both exponential form r(u) = exp(u) and linear form r(u) = 1 + u relative risk models with x(t) defined to include linear and quadratic terms in T65 total dose estimate (truncated at 600 Rads), indicator variables for four cigarettes per day categories, and a single term involving both exposures defined as the product of T65 dose (truncated) and a cigarette per day variate that takes values 0 for nonsmokers, and values 1 to 4 for the four cigarette per day categories indicated in Table 3. The results given in Table 3 are based on 1570 cancer deaths excluding hematologic cancers (which are apparently not smoking related) and excluding short-term smokers with smoking durations of between 5 and 20 years. Smokers of less than 5 years duration were pooled with nonsmokers. The coefficient of the product term [(T65 dose/100) x cig/day category] 'a values are maximum partial likelihood estimates. b Asymptotic significance levels for testing 1 = 0 are given in parentheses. is of particular interest. The significantly negative coefficient in the exponential form regression indicates that the relative risk corresponding to a joint exposure to radiation and cigarette smoke is less than the product of relative risks for the individual exposures. For example, the estimated relative risk for a nonsmoker exposed to 100 rads (T65) of radiation is exp{0.237 -0.009} = 1.25, the estimated relative risk for a long-term 20 cigarette per day smoker with no radiation exposure is exp{0.565} = 1.76, while the estimated relative risk for a long-term 20 cigarette per day smoker with 100 rads of radiation exposure is exp{0.237 -0.009 + 0.565 = 0.067(3)} = 1.81. This last number can be compared with the estimate (1.25)(1.76) = 2.20 which would apply under a multiplicative relative risk model. In good agreement, the linear form relative risk model gives estimates of 1 + 0.292 -0.001 = 1.29 for a nonsmoker exposed to 100 rads, 1 + 0.774 = 1.77 for a long-term 20 cigarette per day smoker unexposed to radiation, and 1 + 0.292 -0.001 + 0.774 -0.094(3) = 1.78 for the long-term 20 cigarette per day smoker with an estimated 100 rads of exposure. This latter number may be compared with a relative risk estimate of 1 + 0.292 -0.001 + 0.774 = 2.06 which would apply under an additive relative risk model. Table 3 thus implies that the relative risk for all nonhematologic cancer among individuals exposed to both radiation and cigarette smoke is less than a multiplicative model would imply and possibly less than additive as well. When a more thorough account of age at radiation exposure is taken there, however, ceases to be evidence against an additive relative risk model, but evidence for submultiplicativity remains. Such analyses provide useful insights into the carcinogenic mechanism in addition to their obvious public health implications.

Distribution Theory
Rigorous distribution theory for the maximum partial likelihood estimator and corresponding baseline disease Table 3. Relative risk regression analyses of cigarette smoking and radiation exposure in relation to all nonhematologic cancer mortality. Ex-smokers and smokers with duration of smoking between 5 and 20 years are excluded, while short-term smokers (<5 years) are pooled with nonsmokers. rate estimators is given in Andersen and Gill (13). Their work was limited to the exponential relative risk form, a restriction removed by Prentice and Self (14). An overview of these developments will be given here. It is convenient to change notation slightly and to assume a single stratum for notational ease. Denote the n subjects in the cohort by i = 1, . . ., n. For the i-th subject, define the counting process Ni(t) to take value Two basic assumptions underlie the regression analyses described above. An independent failure time assumption among distinct study subjects requires (7) indicating that the disease rate at time t for the i-th subject does not depend on data recorded for other study subjects. An independent censorship assumption requires further that The probability that subject {i} develops disease at t. given Ft-i and given exactly one disease occurrence at ti, is easily calculated as pr{i develops disease failure at ti and FJ n = Xi(ti Ft)lYkX{ti FtJ n = Yi(t) r{xi(ti)I}IlY1(t) r{x1(tO)P} and, as before, a partial likelihood function for i is given by The reason for introducing counting process and stochastic integral notation in this context is to make use of the counting process decomposition Ni(t) = Ai(t) + Mi(t) i=l, . .. , n where Mi is a locally square integrable martingale and, under slight regularity (15), the cumulative intensity process Ai relates to the above disease rate process via ki(t;Ft-) = Yi(t) X{t;Zi(t)}, all (i,t) X{t;Z(t)} = Xo(t r{x#(t)I3}, all (i,t) (9) Together Eqs. (7), (8), and (9) imply Xi(t;Ft-) = Yi(t)ko(t)r{xi(t)p} The disease rate process which has been modeled via Eq. (10) as a relative risk regression model then has a representation as a counting process intensity. This representation allows convergence results for stochastic integrals over martingales to be applied in order to develop asymptotic convergence results for the maximum partial likelihood estimate and related quantities. In that convergence results for stochastic integrals with respect to martingales require the integrand to be a "predictable" process it is natural to require the processes appearing in Eq. (10), namely the censoring process Yi and the regression process xi, to have the sample paths that are left continuous with right hand limits. The principal results to arise from applying martin-230 ki(t;Ft-) = ki(t;F'-), all (it) gale convergence theory are: (i) na log L(3)Iap converges in distribution to a normal variate with mean zero and with variance matrix consistently estimated by i== n-1 a2log L )Iap2; (ii) n½(1 -P) converges in distribution to a normal variate with mean zero and with variance matrix consistently estimated by t -; and (iii) n(A -AO) converges to a certain Gaussian process, where A is a natural estimator of the cumulative baseline disease rate Ao(t) = fxo(u)du Without going into detail, sufficient conditions for these convergences include a finite follow-up period, the asymptotic stability of certain processes arising in log L (A) and its first and second derivatives, a Lindeberg condition, certain asymptotic regularity conditions and, in order to accommodate regression forms other than r(-) = exp(-), a regression positivity condition and a condition to assure the asymptotic stability of the observed information matrix. In spite of this rather lengthy list these conditions are collectively quite unrestrictive. For example, it is not necessary that [Ni,Yi,zi] be independent and identically distributed.
Recently Gill (16) has given an informal and intuitive presentation of this martingale approach.
An interesting technical point in this use of stochastic covariates relates to the fact that the disease rate process being modeled, namely Xi[t;Ft-], conditions on the subject's entire preceding covariate history. Risk factor associations of interest may, however, involve the relationship between disease rate and a subset of the preceding covariate history. For example, Table 1 above is concerned with cardiovascular disease rates in relation to previous blood pressure measurements, but only blood pressure measurements recorded in the immediately preceding examination cycle. An application of the asymptotic results just mentioned to Table 1 would then implicitly require one to assume disease rates to be independent of earlier blood pressure measurements, given the most recent measurements; an assumption not substantiated by Table 2. To address this issue, Self and Prentice, in a submitted manuscript, have generalized the above results to allow aspects of preceding covariate history to be excluded from the conditioning at the division points of a time axis partition. The relative risk parameter is then chosen to maximize a pseudo-likelihood function that is the product of partial likelihoods over the elements of the time axis partition. The maximum pseudo-likelihood function is identical to that which would be obtained by specifying an oversimplified intensity process model that involves only selected aspects of the preceding covariate history. An adjustment is required, however, to give a consistent variance estimator for this maximum pseudo-likelihood estimator.

Generalizations and Current Status of Relative Risk Regression Methods
The counting process formulation described above encompasses multivariate failure time data. Such a feature may be useful, for example, in studying the epidemiology of epileptic seizures or asthmatic attacks. Prentice et al. (17) and Andersen and Gill (13) consider relative risk regression models of the type Xi(t;F,-) = X0o(t)r{x(t)38} (11) which merely continue the intensity process modeling for the i-th subject beyond the first failure time to the times of second and subsequent failures. Note that in Eq. (11) F,will include the counting process histories, including all preceding random failure times, for each subject and that the stratification s = s(t) and regression variable may be defined to reflect aspects of the subject's preceding failure time information. For example, the subject may be required to move to the next stratum whenever the subject experiences a failure. Model (11) directly gives rise to a partial likelihood function to which the asymptotic results previously cited apply. A second class of multivariate relative risk regression models (17) can be written: Xi(t;F,-) = Xo8(tti *)r{x(t)p.} (12) where tt is the most recent random failure time on subject i prior to time t. This model also naturally gives rise to a partial likelihood function provided the stratification is fine enough to require the subject to enter a new stratum each time the subject experiences a failure. Formal asymptotic results for such estimation have been given in certain special cases (18).
Competing risk generalizations of relative risk regression models have also been described (19). Specifically, if m distinct disease categories may arise in a follow-up study, a relative risk regression model X{t,j;Z(t)} = X4(t) r{x(t)p may be specified for the rate of disease j occurrence, for selected values ofj e [1, 2, . .. , m]. Straightforward partial likelihood estimation of the disease-j relative risk regression parameter I3j proceeds by regarding disease occurrences of types other than j as censored.
Some work (20) has also taken place to allow relative risk regression parameter estimation and testing in the presence of random measurement errors in the covariate processes, a topic of obvious practical importance in epidemiologic research.
In the context of occupational mortality data studies, Breslow et al. (21) have considered the use of external mortality rate data, for example from vital statistical records, in order to partially specify the baseline disease rates ko,(-). In most applications such usage typically turns out to provide little benefit in respect to estimation efficiency, provided the baseline rates are allowed to differ from the external rates by a scale factor. Such external rates are, of course, indispensible if the cohort is essentially homogeneous in respect to the covariate histories of interest. See Breslow (22) for a discussion of relative risk regression estimation in these circumstances.
To date there has been rather limited study of the sample sizes and data configurations necessary to ensure a good approximation by the asymptotic distributions mentioned above. Johnson et al. (23) describe some simulation results pertinent to a fixed regression vector and exponential form relative risk function.
A full regression approach, of course, requires not only suitable model fitting and estimation procedures, but also a range of procedures for model criticism. In general the area of model criticism is at a rather early stage of development for relative risk regression methods. Some relevant works include proposals in respect to test of fit (24), residuals (25)(26)(27)(28), regression diagnostics (29), and choice of relative risk form (11).

Relative Risk Regression for Time-Matched Case-Control Studies
Suppose now that a large population is being monitored for disease occurrence, perhaps by means of a cancer registry or by a mortality index. It would often be impractical to enumerate and collect covariate data on such a large cohort, and furthermore since disease rates are likely to be low the data on many of the individuals who do not develop disease in some defined "follow-up" period will be largely redundant. The casecontrol design provides a valuable and much used alternative to the cohort design in such circumstances. A time-matched case-control study would proceed by matching each case that arises in some defined case accession period to one or more control subjects who are without disease at the time of case ascertainment. Here time would usually refer to age, although other specifications, including calendar time may be preferable in some applications. The cases ascertained by the disease surveillance system should be representative of the cases arising in the population in respect to their prior covariate histories, and controls selected should be representative of the sub-population who are without disease at the "time" of control ascertainment. Controls may also be matched to cases in respect to other potential confounding factors in which case the controls corresponding to a specific case need only to be representative of the disease free group in that stratum at the time of case occurrence. Upon selection the covariate histories Z(t) are ascertained retrospectively for cases and controls, usually by personal interview. Here t refers to the time of case occurrence. A major meth-odologic concern relates to the ability to retrospectively construct accurate covariate histories, and to do so equally for cases and controls (recall bias), and the ability to sample randomly from case and control populations (selection bias). Assuming these concerns are met a relative risk regression model (6) is readily applied to case-control data (30,31). Specifically a suitable likelihood function is again given by Eq. (5), where ts1,... tsd are the times of case ascertainment in the s-th stratum and Rs(tsi) consists only of the case occurring at t8i along with its corresponding time-and stratum-matched controls. The (s,i) factor of Eq. (5) can be derived as the conditional probability that covariate history Z,i(t,i), giving rise to the regression vector xM,(t8i), corresponds to the diseased individual, given the set of covariate histories [Z1(t8i), 1 e R8(t8i)] and the fact that R,(t,i) includes exactly one case. This assertion requires an independent disease times assumption. Such an assumption furthermore implies that the contributions to Eq. (5) at distinct (s,i) are statistically independent, since distinct individuals are involved at each (s,i), so that Eq. (5) has a conditional likelihood interpretation. It follows that standard asymptotic likelihood methods can be expected to apply to Eq. (5), under time-matched case-control sampling, under mild conditions (32). Note that, under model (6), covariate histories need be assembled only to the point of permitting x(t) to be specified at the time of occurrence for a case, or at the time of the corresponding case occurrence for a matched control.

Synthetic Case-Control and Case-Cohort Designs
Consider again the cohort study discussed above. Partial likelihood estimation based on Eq. (5) can be computationally intensive especially with large cohorts and time-dependent regression variables. Consequently a number of authors (31,(33)(34)(35)(36)(37) have suggested the imposition of case-control sampling on the cohort for computational reasons. This idea involves replacement of the denominator in each (s,i) factor of Eq. (5) by a summation over a set that includes only the subject developing disease at t.,i disease and a comparison group randomly selected from Rj(tj.) In many situations selection of as few as five "controls" per case will yield regression parameter estimates of high efficiency (e.g., 80% or more) compared to a full cohort analysis, though Breslow et al. (21) indicate that twenty or more controls per case may be necessary to ensure good efficiency in the presence of large relative risks and unbalanced regression variable distributions.
The synthetic case-control approach is a useful aid to the data analyst in the exploration of a large cohort data set. Not only might risk sets in Eq. (5) involving several thousand subjects be replaced by sets involving only 10 or 20 subjects, but also only a single fixed regression vector x(t) needs to be stored for each subject selected.
Equally important, the synthetic case-control ap-proach gives the possibility of considerable cost saving in relative risk estimation in circumstances wherein assembly of key covariate data requires expensive synthesis of specimens or other raw materials that have been collected and stored during the course of a cohort study. For example, a number of prominent cohort studies and disease prevention trials have developed blood serum banks on large numbers of participating subjects. The use of these serum samples, for example to relate biochemical factors to subsequent disease incidence, may, however, be prohibitively expensive. The synthetic case-control design allows efficient relative risk estimation based on serum analyses for cases and a small number of time-matched controls. A full cohort analysis on the other hand would typically involve a much larger number of serum analyses. The synthetic case-control design does not, however, appear to be the most efficient approach to this type of estimation problem. In particular, a given subject could properly serve as a control for a number of cases arising at times during the subject's risk period. The synthetic case-control approach, however, rather arbitrarily links a specific control subject to a single case. Prentice has proposed (38) a case-cohort design to avoid this limitation. In such a design a subcohort is randomly selected from the entire cohort to serve as comparison group for all cases arising during follow-up. The sampling can be relaxed to allow different sampling fractions among baseline defined strata.
Estimation can then be based on Eq. (5) with the risk set R,8(t8i) replaced by a set that consists only of the case occurring at t5i and the subcohort risk set at t-Si.
It follows that covariate histories need be assembled only for cases and subcohort members. With the risk sets modified as just mentioned standard asymptotic likelihood formulae can evidently be applied to Eq. (5) with a modification to the score statistic variance to accommodate a correlation among score statistic contributions within a stratum. Specifically, the score statistic contribution at t8i will typically be weakly correlated with score statistic contributions at t,j, j<i whenever the disease occurrence at t5i arises outside the selected cohort.
This work was supported by grants GM-24472, GM-28314 and CA-34847 from the National Institutes of Health. Parts of this manuscript are identical to subsets of a manuscript by R. L. Prentice and V. T. Farewell, which appeared in the Proceedings of Second IMACS Symposium on Biomedical Systems Modeling, North Holland, Amsterdam.