Risk ratio estimation in case-cohort studies.

In traditional (cumulative-incidence) case-control studies, the exposure odds ratio can be used as an estimator of the risk ratio only when the disease under study is rare. The case-cohort study is a recently developed useful modification of the case-control study. This design allows direct estimation of the risk ratio from a fixed cohort, but does not require any rare-disease assumption. This article reviews recent developments in risk ratio estimation procedures for the analysis of case-cohort data. In the crude analysis, it is shown that the empirical risk ratio estimator is not fully efficient, and the maximum likelihood estimation of the crude risk ratio is discussed. In the stratified analysis, several common risk ratio estimation procedures and standardization methods have been proposed for large strata. However, the Mantel-Haenszel risk ratio and its variance estimator are the only available methods for sparse data.


Introduction
Cohort and case-control studies are well established epidemiologic designs for studying individual level exposure-disease relationship. Suppose we are interested in estimating a risk ratio that is a ratio of incidence proportions between the exposed and unexposed populations. In fixed cohort studies, the exposed and unexposed subjects, initially disease-free, are followed over a given risk period. We then ascertain disease-specific incidence proportions between these two groups and have an estimate of the risk ratio. In traditional case-control studies, cases of a study disease are sampled from all incident cases in a fixed cohort and controls are sampled from noncases, the population at risk at the end of the risk period. Exposure histories among cases and controls are identified retrospectively and compared. In such a cumulative-incidence sampling of controls (1), we cannot estimate incidence proportions without external information. However, we may use the exposure odds ratio as a good approximation of the risk ratio when the disease under study is "rare" (2).
In 1975, Kupper et al. (3) proposed a useful modification of traditional case-control studies. In their design, cases are sampled from all incidence cases, which is the same as traditional case-control studies; but controls are sampled from the initial cohort This paper was presented at the 4th Japan-US Biostatistics Conference on the Study of Human Cancer held 9-11 November 1992 in Tokyo, Japan. This work was supported by grant-in-aid for scientific research 04857063  members (the population at risk at the start of the risk period) regardless of their future disease status. This design allows estimation of the risk ratio without the need for the rare-disease assumption. Since it is a compromise between fixed cohort and case-control studies, Kupper et al. called it the hybrid epidemiologic design. It is also called the case-base (4) or case-cohort (5) study, because the control group is a sample from the study "base" or the full cohort. (Some use the term case-base for risk ratio estimation and case-cohort for incidence rate ratio estimation (6), but I use the term case-cohort throughout the article.) In this article, I will review recent developments in risk ratio estimation procedures in case-cohort studies, and discuss the maximum likelihood method and sparse risk ratio estimation.

Crude Analysis
Suppose that a fixed cohort of Ninitially disease-free subjects are followed for a given risk period and that M out of N subjects develop a disease under study by the end of the risk period. In case-cohort studies, m cases are randomly selected from the total of M incident cases with a sampling proportion rl; and n controls (subcohort) are randomly selected from the N initial cohort members with a sampling proportion ro (3,4). We assume that (N,M) and (rl, ro) are unobservable.
The subcohort may contain cases (7); some are included in the case sample and some are not. The observed and expected counts in the case-cohort sample are shown in Table 1. Here p1 and po are incidence proportions in the exposed and the unexposed, and p, is the exposure prevalence in the initial cohort. Let a-=ao+al+a2 and b+=bo+bl+b2, which are all the exposed and the unexposed cases, e=al+a2 andf=bl+b2, the exposed and the unexposed cases in the subcohort, and nl=al+a2+c and n6=bl+b2+d, the exposed and the unexposed in the subcohort.
We assume that the appropriate effect measure is the risk ratio which is defined by where P(EID) is the exposure prevalence in diseased cases. Since a l/b and n1/no consistently estimate P(EID)I[1-P(EID)] and Pe'(I-pe), respectively, the empirical estimator of the crude risk ratio (3,4) is given by noa+ [1] Kupper et al. (3) considered the situation that the sampled cohort was the target population. As criticized by Mantel (8), they failed to take account of random incidence variation, and hence their confidence interval method is not valid for inference beyond that cohort. The correct variance estimate of log'E is given by which is independently derived by Greenland and Nurminen (S. Greenland, personal communication, 1992), reported in Miettinen (4). When the full cohort is observed (r =r=1), E turns to the full cohort risk ratio, and VE becomes identical to the variance estimator of its logarithm. When the subcohort has no cases, OE turns to the odds ratio and VE becomes identical to the variance estimator oflog odds ratio (9). For risk ratio estimation in Equation 1, we do not exclude the cases from the sub-Environmental Health Perspectives 53 cohort. Miettinen (4), however, noted that we should make a usual case and noncase comparison, as in traditional case-control studies, when testing zero exposure effect. He gave the simple Pearson chi-square statistic given by where t=a++b++c+d, the total number of distinct subjects in the case-cohort sample. Given that the exposure has no effect, X0 has an approximately chi-square distribution with one degree of freedom (d.f. Since the test and the confidence interval method in the above example are inconsistent, Nurminen (9)  where summations are over all strata. Greenland (11) showed that this summary risk ratio is asymptotically biased. The large-strata expectation becomes NPek (1 Pk) Y,k Pek + ( -Pek) It does not reduce to the common value ' except when 0=1.
The unbiased adjustment methods have been given by Greenland (11). Applying Miettinen's arguments (12), he derived the SMR estimator by OSMR = nknb1k+k 'nOk and the large-strata variance estimator of its logarithm V 2 V/2 VS~MR=XYka+kYVEkI/a+. We may use the stratum-specific maximum likelihood estimators in the SMR estimator (10).
Modifications are quite simple: change the number of the exposed and the unexposed in the subcohort (nik,nok) to their maximum likelihood estimators ( ilk,fiod Although the SMR does not require risk ratio homogeneity, we will have a more efficient estimator for the common risk ratio when the stratum-specific risk ratios are common across strata. By analogy with the Mantel-Haenszel odds ratio (14), Miettinen (15)  Greenland (11) gave two closed-form Mantel-Haenszel type estimators for the common risk ratio. One is the Tarone estimator: A XknOka+k 'Sk OT = Xknlkb+k /Sk where sk=aOk+bOk+ck+dk. It is the inverse null variance weighting of the stratumspecific risk ratios, and asymptotically fully efficient under zero exposure effect. When we study the full cohort, OT becomes identical to the Tarone estimator for the common risk ratio (16). The other is the Mantel-Haenszel estimator: A XknOka+k Itk 'MH Eknlkb+k Itk where tk=a+k+b+k+ck+dk, the total number of distinct subjects in the kth stratum. The Mantel-Haenszel estimator is dually consistent for ', that is, consistent in both the large-strata and the sparse-data (the number of strata K becomes large, as in the matched sample), while the Tarone estimator is consistent only in the large-strata. Greenland (11) gave the large-strata variance estimator of log OT (and implicitly of log A ). The dually consistent variance estimator of log'MH is given by where Wk=(bok+d) flika,k+(aOk+Ck) nlkb+k+ aokdk+bokck (17). With the full cohort observed, VMH becomes identical to the Mantel-Haenszel variance derived by Greenland and Robins (18). By changing tks in V.H to Sk, we have the variance estimator of log T, but it is consistent only in the large strata. The confidence interval method based on the estimating function is also proposed (17). Three other large-strata common risk ratio estimators, more efficient than the Tarone or the Mantel-Haenszel estimator, are available. Greenland (11) gave the Woolf (the weighted least squares) estima-tor based on the stratum-specific empirical risk ratios. Using the corresponding maximum likelihood estimators, we have the modified Woolf estimator logA* -Y.k log 'MLk /VMLk Xw kl1/VMLk and the large-strata variance estimator of log+* ow VW4=(Ik1 'IVMLk) The following two estimators do not have a closed form. Nurminen (9) proposed an estimator as an extension of the cohort chisquare function approach (19 which has an asymptotically chi-square distribution with one degree of freedom under zero exposure effect (11,15). This test is applicable to both large-strata and the sparse-data cases. Example 2. Consider a stratified casecohort data with K=2: a01=74, a11=4, a21=5, c1=75, b01=2, bi1=O, b2l=O, and d1=19 for stratum 1; and a02=8, a12=0, a22=1, c2=41, bo2=6, b-2=1, b22=0, and d2=190 for stratum 2 (10,17). The Mantel-Haenszel test gives XMH=26.7 with P value=0.0, highly significant. Several summary risk ratios and 95% confidence intervals are shown in Table 2.
Volume 102, Supplement 8, November 1994 The upper half of Table 2 gives the common risk ratio estimates and the lower half the indirect standardization. The Tarone and the Mantel-Haenszel risk ratios give the virtually the same results. The Woolf, Nurminen, and maximum likelihood also give dose point estimates, but the Nurminen method gives the narrower 95% interval. The two SMR estimates are also close.

Concluding Remarks
In the crude analysis of the case-cohort data, the maximum likelihood estimator for the risk ratio should be used. It is more efficient than the empirical risk ratio estimator and easy to compute. The chi-square test given by Miettinen (4) is still valid, because it is identical to the efficient score test. In the stratified analysis, there are several options for summary risk ratio estimation in large strata. Greenland (11) gives tentative recommendations on choosing between large-strata estimators. When the data are sparse, the Mantel-Haenszel estimator is the only available common risk ratio estimator. We may improve its efficiency simply using the contrasts nOka+k flkb+k rather than noka+k-On b This article has reviewed recent developments in risk ratio estimation procedures in case-cohort studies when censoring is unimportant. If censoring is important, the risk ratio estimate not adjusted for it is misleading (20) and the correct risk ratio estimation procedure is proposed by Flanders et al. (21). When time to response is of primary concern, incidence rate ratio (hazard ratio) estimation is available (5).