Dimension reduction and estimation in the secondary analysis of case-control studies

Studying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semi-parametric efficient estimation generates various nonparametric estimation problems with multivariate covariates. We devise a dimension reduction approach that fits with the specified primary and secondary models in the original problem setting, and use reweighting to adjust for the case-control nature of the data, even when the disease rate in the source population is unknown. The resulting estimator is both locally efficient and robust against the misspecification of the regression error distribution, which can be heteroscedastic as well as non-Gaussian. We demonstrate the advantage of our method over several existing methods, both analytically and numerically.


Introduction
Case-control studies are popular tools in investigating risk factors associated with various uncommon diseases, such as cancer and myocardial infarction, often because these studies are relatively less expensive and more convenient to implement compared with designs such as cross-sectional and prospective cohort studies . Typically, a population-based case-control study employs a random sample of cases (diseased subjects) and a separate random sample of controls (non-diseased subjects). It also collects covariate asymptotically normally distributed even if the posited functions are incorrectly specified; and (b) it is efficient if the posited functions are correctly specified. An estimator with the properties (a) and (b) will be called locally efficient throughout this article.
Because the approach of Ma and Carroll (2016) was developed by adopting a superpopulation concept and viewing case-control samples as independent and identically distributed observations sampled from the superpopulation, they need to link the quantities in the superpopulation to the ones in the true population. As a consequence, several additional conditional distributions arise in the likelihood formulation, including quantities conditional on the covariates. This leads to the need to perform several nonparametric regressions on the covariates in their estimator. When the covariate dimension increases, such nonparametric regressions inevitably suffer from the curse of dimensionality.
In this paper, we work in the superpopulation framework and handle the potential dimensionality problem using a dimension reduction modeling approach. We assume several quantities of interest depend on the covariates X only through linear combinations of X and/or known functions of X. This allows us to avoid multivariate nonparametric regression. However, because of the inherent relation between the covariates assumed in the original true population, the dimension reduction structure is not completely arbitrary. Instead, it is subject to various constraints, which makes the problem different from the classical dimension reduction modeling and estimation. Taking these various special features into consideration, we construct asymptotically consistent estimators for the regression parameters in the true population model. These estimators have a parametric convergence rate and are robust to the misspecification of the conditional distribution of Y given X.
We emphasize that ours is not a paper about advancing dimension reduction modeling, which already has a massive literature (Ma and Zhu, 2013b;Li, 1991;Li and Duan, 1989;Li, 1992;Li and Dong, 2009;Li and Wang, 2007;Li et al., 2008Li et al., , 2005Dong and Li, 2010;Ma andZhu, 2012b, 2013a;Zhu et al., 2010;Cook, 2009;Cook and Li, 2002;Yin and Cook, 2002;Cook, 1994;Setodji and Cook, 2004;Cook and Setodji, 2003;Yin and Bura, 2006;Xia, 2007). Instead it is about using dimension reduction ideas for solving a semiparametric problem in the secondary analysis of case-control studies when the dimensionality of the covariates is potentially large.

Background
Let D be disease status, where D = 1 denotes a case and D =0 denotes a control. Also let (X T ,Y) T be a (p + 1) ⨰ 1 vector of covariates, where X is a p-dimensional vector and Y is a scalar. We assume that both X and Y are continuous and they are related to disease status D pr(D = d X = x, Y = y) = f D X, Y true (d, x, y) = H(d, x, y, α) = exp d(α c + x T α 1 + yα 2 ) 1 + exp(α c + x T α 1 + yα 2 ) , 2.1 where α = (α c , α 1 T , α 2 ) T .
As mentioned before, the goal of secondary analysis is to investigate the relationship between X and Y in the source population, which we assume is of the form f X, Y, D (x, y, d) Y D true (x, y, d) = N d N η 1 (x)η 2 (ϵ, x)H(d, x, y, α)
Here we use the fact that the distribution of (X, Y) conditional on the disease status D in the superpopulation and in the true population are identical, which links the distributions in these two populations. Ma and Carroll (2016) derived the semiparametric efficient score function corresponding to the above superpopulation, S eff (X i , Y i , D i ) = {S(X i , Y i , D i ) -g{Y i -m(X i , β), X i )} -(1 -D i )v 0 -D i v 1 . The resulting efficient estimating equation is κ(x, y) ≡ ∑

Background
The estimating equation (2.5) contains three expectations conditional on covariates X, i.e., E true {ϵ 2 κ(X,Y) | X}, E true {ϵμ s (X, Y) | X} and E true {ϵf D|X,Y (0, X, Y) |X}, which need to be estimated nonparametrically. However, such estimation may be extremely hard when the covariates X are multivariate. To bypass the potential curse of dimensionality problem caused by the multivariate nature of X, we use a dimension reduction modeling strategy, i.e., we assume all three quantities in the conditional expectations depend on X only through several linear combinations X T γ or several linear combinations of functions of X. Under such a dimension reduction structure, we can construct nonparametric regression estimators for high dimensional covariates X in a way similar to the univariate case with desired bias and MSE order, hence facilitating the estimation procedure via solving the estimating equation (2.5).
Let f 0 (X,Y,α) = f D|X,Y (0,X,Y). All three functions κ(x,y),μ s (x,y) and f 0 (x, y) depend on π d = π d (α). To emphasize this, we replace π d with π d (α) in those three functions and we use the notation κ(x, y, α), μ s (x, y, α), f 0 (x, y, α) to distinguish them from the ones using the true parameter value α. In addition, we define ϵ(X, Y, β) = Y − m(X, β) to distinguish it from the true ϵ = Ym(X,β). Z β general model (3.1)-(3.3) without specifying the particular form of Z β . Of course, we need to estimate γ j and ζ j (·) for j = 1, 2, 3. To resolve the issue of estimating conditional expectations in the true population while we only have a random sample from the superpopulation, the key point is to recognize the connection between the two populations and to adjust the case-control data in the context of conditional expectations via procedures and algorithms for both cases, with the algorithm for different indices in Appendix A.2.1 and that for the same index in Appendix A.2.2.

Remark 1.
It is worth pointing out that the estimation of π via (3.5) originates from Thus, the estimator takes into account the difference between the superpopulation and the population from which the case-control sample is drawn, and thus leads to a consistent estimator of π 0 .

Estimation Algorithm Using Different Indices
The estimating equation in (2.5) relies on the unknown probability density function η 2 .
Here, we use a posited model η 2 *, which is not necessarily the truth, to calculate the efficient score and other related quantities. The resulting estimating function is denoted by S eff * . We will show that the resulting estimator is still consistent, and it is efficient if the posited model η 2 * is the correct one.
The main difficulty in calculating S eff * lies in approximating functions g, v 0 , and v 1 , because For convenience, we adopt γ 1 = γ 2 = γ 3 = γ in all the simulations, where the lower square block of γ is set to be identity to ensure identifiability. The algorithm in this simplified case is identical to the one described above except step 3. The detailed algorithms for cases using different indices and using a common index are given in Appendix A.2.

Distribution Theory
We now establish the asymptotic distribution theory of our estimators, stated as Theorem 1 below, with necessary regularity conditions C1-C11 listed in Appendix A.3. The proof of Theorem 1 is detailed and lengthy and is thus sketched in the Appendix Section A.5. While Theorem 1 holds for both the estimator using different indices and the estimator using a common index, we only provide the proof and regularity conditions for the algorithm with different indices. One can easily adapt the conditions and proof to the case of a common index.
Under the regularity conditions C1-C11 listed in Appendix A.3, the following theorem holds. The proof is in the Appendix Section A.5.

Define
The estimator θ obtained from solving the estimating equation T and θ is locally efficient, see the definition of locally efficient in Section 1.

Setup
We performed a series of simulations to understand the behaviour of our method and compare it to competitors. The simulations displayed in this section are for the case that the regression errors are Gaussian or centered Gamma, both homoscedastic and heteroscedastic.
In these simulations, we considered different disease rates, different dimensions and distributions for X and different error variance structures. The results indicate that our methods have small bias and good coverage probability in all the cases we examined. Here, due to space limitations, we only list the results for two typical scenarios, where the first one is homoscedastic and the second one is heteroscedastic. In both cases, we chose a balanced design with N 1 = 1000 cases and N 0 = 1000 controls, set the disease rate to be approximately 4.5% and let X be exchangeable with p = dim(X) = 4.
More specifically, we generated X = (X 1 , · ,X 4 ) T in the following way.
Hence, X is an exchangeable vector of i-th random variables satisfying X i = Uniform[0,1] for i = 1, · , 4 and corr(X i , X j ) = corr (X k , X j ) for all i ≠ j, k ≠ l. In our simulation studies, we used ρ = 0.2, which resulted in corr(X i , X j ) ≈ 0.191 for all 1 ⩽ i ≠ j ⩽ 4.
We set the posited model η 2 * to be Normal(0,1) and adopted the estimation algorithm discussed in Section 3.4 and Appendix A.2 for the three important conditional expectations E true {∈ 2 k(X, Y)|X}, E true {∈μ s (X,Y)|X} and E true {∈f D|X, Y (0,X, Y)|X}. In steps (a)-(c) in Appendix A.2 that involves nonparametric calculations, we used the asymptotically justified bandwidth h = cn 0 −1/5 : we found that when c [1, 6], the estimation results are very similar.

Results
We contrasted three methods. The first one is ordinary least squares using controls only. The second one is the semiparametric efficient method that assumes the regression error to be normally distributed with homoscedastic variance and E(Y | X) to be linear in X, or equivalently in our notation, m(X, β) = β 0 + X T β 1 (Lin and Zeng, 2009). This method also requires a rare or known disease rate, which was set to 0.1% in the simulations. The third is our method described in Section 3.4, which does not require the rare disease assumption and does not put any restriction on other than that E( |X) = 0.
To implement Lin and Zeng's method, we used their software SPREG provided on http:// dlin.web.unc.edu/software/spreg-2/, which adopts the rare disease assumption if the input disease rate is less than 1%. This software was designed to work in a semiparametric framework where it assumes a fully parametric Gaussian model for but the distribution of X is nonparametric. However, through multiple attempts we found that their software can only handle the case where components of X are independent. Thus, before running SPREG, we decorrelated X by multiplying it by L -1 , where L is the Cholesky decomposition of the cov(X) = Σ satisfying LL T = Σ. In the simulations, we used the true covariance matrix Σ to fulfill the restriction of SPREG. However when dealing with the mammographic density data in Section 6, the true covariance matrix Σ is unknown. We estimated it using only the controls.
The results are summarized in Tables 1-2. In the homoscedastic Gaussian scenario (Table 1), the approach using only controls ("Ctrl") is asymptotically valid with small bias and near nominal coverage. Lin and Zeng's method ("Param"), which assumes normality and homoscedasticity, has the smallest standard deviation among the three methods since it is efficient if the errors are normal. However, it suffers from slight bias since the true disease rate is 4.5%, larger than 1%. Our method ("Semi"), which assumes neither normality nor rare disease, is superior considering overall performance. It has the smallest bias compared with the other two methods. In addition, its mean-squared error efficiency is from 60.0% to 79.9% greater than using only controls and is comparable to Lin and Zeng's method. In the homoscedastic Gamma case (Table 2), Lin and Zeng's methods has considerable bias, under-coverage and loss of mean squared error efficiency.
In the heteroscedastic scenario, for both Gaussian and Gamma errors, both the "Ctrl" and the "Param" methods suffered from low coverage probabilities while our approach ("Semi") maintains nominal coverage. The approach using only controls is reasonably unbiased in the Gaussian case but suffers from much larger bias in the Gamma case. In both cases, Lin and Zeng's parametric method gives badly biased estimates, low coverage probabilities and low mean squared error efficiency. Taking β 13 , the third element in β 1 , as an example, while the nominal coverage is 95%, the actual coverage rates are 40.6% and 43.7% in the Gaussian and Gamma case, respectively. Our approach has no larger than 4% bias compared with the truth, which is the best among three methods. It also achieves the best coverage probabilities and smallest mean-squared errors.
We have done other simulations with different disease rates, and the overall picture remains the same as what we have described above. For example, in the Appendix Section A.7, we display results for the case that the intercept was adjusted to make the disease rate in the source population ≈ 10%.
We have also done simulations when the dimension of X is 6, 8 and 10 with an approximate 4.5% disease rate, and found results similar to the ones previously described. Of course the computation takes longer as the dimension of X increases. Please see the Appendix Section A.8 for numerical results.
Remark 2. While a number of methods on secondary analysis exist in the literature, none of them is applicable in our setting. For examplee, Jiang et al. (2006) and Li et al. (2010) focused on binary Y, for which a logistic regression model for Y and X or Y and (X, D) was considered. Ma and Carroll (2016) adopted kernel density regression in their estimation procedure, and thus it is not applicable to the cases with multivariate X due to the curse of dimensionality. Wei et al. (2013) requires the known or rare disease assumption as well as homoscedastic regression errors, and hence is also not applicable in our model setting. Likewise, Lin and Zeng (2009) requires the known or rare disease assumption and is applicable only when the secondary model is parametric. Thus, we have compared our approach to only two methods, the control only method for its simplicity and sometimes surprisingly good result when the disease is truly rare, and Lin and Zeng's method for its gold standard status in practice, for parametric models.

Analysis of Mammographic Density Data
Here we apply our methodology in a case-control study of breast cancer, where the data were collected from women in the breast cancer detection demonstration project (BCDDP), see Chen et al. (2006) and . The study recruited a total of 284,780 women, starting from January 1, 1973 and ended December 31, 1995. Then in the following five years, follow-up annual screening was performed for each subject. Here the period from 1973-1980 is referred to as the "screening phase" of the study. At the end of the screening phase, the study selected all cases, i.e. women who developed breast cancer, and sampled from the controls. All the selected women were included in a further extended follow-up study from 1980 to 1995. Standard risk factors, including age at menarche, age at first live birth and body mass index, were available in this study. However, we were only able to retrieve mammographic density measurements at baseline in 1973-1975 for N 1 = 2092 cases and N 0 = 3295 controls.
Mammographic density is a measure of the average of dense tissue percentage in both breasts. Women's breasts consist of fat, breast tissue, nerves, veins, arteries and connective tissue that holds everything in place. Both breast tissue and connective tissue are denser than fat. Previous studies showed that higher mammographic density is a strong risk factor for breast cancer. In addition, age at menarche and age at first live birth are both known to be associated with breast cancer. Women who have their first menstruation before age 12 have a slightly higher chance of developing breast cancer compared with those who have their first period after 14; women who give birth to their first child at a young age tend to have a relatively lower risk of developing breast cancer. Body mass index is another risk factor for breast cancer. Before menopause, being slightly overweight can reduce breast cancer risk. However, there is little existing work discussing the interrelationship between mammographic density, age at menarche, age at first live birth and body mass index. The goal of our analysis is to investigate this interrelationship. Before implementing our method, we used an inverse logistic transformation on mammographic density and rescaled the other three risk factors to [0,1] by subtracting their minimums and dividing by the ranges.
Preliminary analysis based on only the controls data showed that mammographic density is reasonably linear in age at menarche, age at first live birth and body mass index. To check this, we fit both a linear regression model and a quadratic regression model using controls and compared these two models via analysis of variance. The p-value is about .78, which indicates the linear model is preferred over the quadratic model. Hence, we adopted a linear m(·) in the secondary analysis. The diagnostic plots of linear regression are given in Figure   1. The left plot is the kernel density estimate of the residuals from a linear fit on the controls, with an overlaid normal density. It shows that the regression error almost follows a normal distribution but with slightly negative skewness. The right plot is the LOWESS smoother of fitted values versus the square roots of absolute values of residuals, which indicates the regression error is homoscedastic.
The results of secondary analysis using only controls, Lin and Zeng's parametric method and our semiparametric approach based on 1000 bootstrap samples are given in Table 3. All three methods have fairly consistent results as expected, due to the fact that the regression error is homoscedastic and close to normal. For all three methods, age at first live birth is highly statistically significant with a positive effect on mammographic density. That is women who gave birth to their first children earlier tend to have a lower mammographic density, and hence obtain some protective effect from developing breast cancer. Both age at menarche and body mass index have negative coefficients, which indicates that having a relatively late first period or being moderately overweight can slightly reduce mammographic density. However, neither of them is statistically significant.
As expected, Lin and Zeng's parametric method has a much smaller bootstrap standard deviation compared with the ordinary least squares using only controls, with an average efficiency of 1.60. Here the efficiency is defined as the square of the ratio of bootstrap standard deviation compared with using only controls. Our semiparametric approach, which assumes neither homoscedasticity nor normality, has almost the same bootstrap standard deviation as Lin and Zeng's method. The bootstrap standard errors of Lin and Zeng's parametric approach for age at menarche, age at first live birth and body mass index are 0.131, 0.106, 0.138, respectively, while that of our semiparametric approach are 0.129, 0.107 and 0.137 respectively. The average efficiency of our approach is 1.63, which is even slightly larger than that of Lin and Zeng's method.

Discussion
We have extended the work of Ma and Carroll (2016) and have overcome the potential dimensionality issue involved in their nonparametric kernel regression. Multivariate kernel regression is avoided by using dimension reduction modeling ideas. We repeat that our work is not about fitting dimension reduction models per se, but to use them in the secondary analysis of case-control studies. Our method makes no assumptions about the regression errors, and we do not need to make a rare disease assumption or require known disease rate.
The dimension reduction assumptions stated in (3.1)-(3.3) are mild in general, see Proposition 1, and are applicable in many practical situations. An interesting topic for future work would be to consider using regularization to further reduce the dimension of Zβ so as to obtain an even more parsimonious model.
Alternative dimension reduction modeling approaches could exist, although it is not easy to identify them based on our preliminary analysis along this line. For example, generalized additive models do not appear to be suitable in the common regression error structures described in Section 3.2. For example, in (3.1), where G is a function of the logistic distribution function, i.e., a function of several exponential functions. It is not clear that this can be written as a generalized additive model. Even if it can be done, using such a dimension reduction approach will still require careful exploration and new methodology development because off-the-shelf results on generalized additive models may not apply due to the case-control sampling nature.
Finally, in some cases, it might be possible to posit a parametric form for var( | X). We believe that our approach can be extended to this case, and would further improve efficiency in estimating β. This will be pursued in future work.

Acknowledgments
* Research was supported by grants from the National Cancer Institute (U01-CA057030). † Research was supported by the National Science Foundation (DMS-1206693) and the National Institute of Neurological Disorders and Stroke (R01-NS073671).
In the usual case, we have that where y = m(x, β) + . Here η 2 * , is the posited conditional density of e given X, not necessarily the true model. Let w(d, x,y; α) = d -H(1, X, Y, α), so that Hence, We assume the following models hold.
where Z ={X T , m(X, β)} T when m is nonlinear while Z = X when m is linear. For identifiability, the lower square blocks of γ 2j ,j = 1, 2, 3 are fixed to be identity.

5.
Estimate E true ϵμ s (X, Y) X using nonparametric regression under the dimension reduction model assumption (3.2). Because E true ϵμ s (X, Y) X actually consists of three separate dimension reduction models, its estimation is slightly complex. We give the estimation details in Appendix A.1 and denote the resulting estimator by E true ϵ(X, Y, β)μ s (X, Y, α) X .

6.
Estimate E true ϵ f 0 (X, Y) X using nonparametric regression under the dimension reduction model assumption (3.3).

(a)
and solve the corresponding estimating equation.

A.2.2. Algorithm Using A Common Index
Specifically, we replace the steps 4-6 of Appendix A.2.1 with the following three steps.

Author Manuscript
Author Manuscript

Author Manuscript
Author Manuscript

A.3.: Regularity Conditions
Let ℓ be the dimensionality of the kernel regressions in our method after dimension reduction. In our simulations and example, we took ℓ =1. The set of regularity conditions required by Theorem 1 is listed below.

C1
The univariate kernel function is a function that integrates to 1 and has support ( -1,1) and order r, i.e., ∫K(u)u r du ≠ 0 if 1 ⩽ t < r and ʃ K(u)u r du ≠ 0. The ℓdimensional kernel function, still represented with K, is a product of ℓ univariate kernel functions, that is, K(u) = ∏ i = 1 ℓ K(u i ) for a ℓ-dimensional u.

C2
Let ξ i, β true the true population density of Z β T γ i for i=1, 2, 3.and β in a local neighborhood of β. Assume that ξ i, β true are bounded away from 0 and they all have third order bounded and continuous derivatives.
T γ 3 have (r +1)th order bounded and continuous derivatives for any θ in a local nighborhood of θ.
T γ 3 have (r + 1) th order bounded and continuous derivatives for any θ in a local neighborhood of θ.

A.4.: Proof of Proposition 1
We provide a detailed proof that the first dimension reduction model (3.1) satisfies Proposition 1. Proving that the other two dimension reduction models (3.2) and (3.3) also satisfy Proposition 1 is similar.
In (3.1), k(x, y, α) is a function of the weighted sum of H(d, x, y) with d = 0,1. As a result, where h(·) is a differentiable function.

A.5.1. Introduction
Following Ma and Carroll (2016), we divide the N observations randomly into three sets, where the first set contains n 1 = N -N 1-δ -N 1−2δ observations, the second set contains n 2 = N 1-δ observations and the third set contains n 3 = N1−2δ observations, where δ is a small positive number. For convenience of proof, we require the disease proportion in the third data set to be the same as the whole data set. That is, n 30 /n 31 = N 0 /N 1 , where n 30 and n 31 are the numbers of controls and cases in the third set of data, respectively. We form and solve the estimating equation (2.5) using data in the first set while calculating all the estimated quantities described in Appendix A.2 steps 1-3 using data in the second set and the other estimated quantities defined in Appendix A.2 steps 4-6 using the data in the third set.

A.5.2. Lemmas
Before proving Theorem 1, we first state several lemmas, which ensure the quantities defined in Appendix A.2 steps 4-6 have the desired orders of bias and mean square error, i.e., the same as that of the usual nonparametric estimators.
From (3.5), we can easily show that Lemma 1.
For some σ πd(α) We now analyze the property of our estimators defined in Appendix A.2 steps 4-6. For notational brevity, we only focus on the first conditional expectation E true { 2 k(X,Y)|X}.
The other two conditional expectations have similar properties. We split the analysis into three parts: i) analyze the properties of E 1 π (X j , γ 1 , θ); ii) analyze the properties of γ 1 (θ) for θ near θ; iii) show that E true {ϵ 2 (X, Y, β)κ(X, Y, α) | X} has desired bias order and standard deviation order.
For the first part of the analysis, we establish the following lemma.

Lemma 2.
Under the regularity conditions C1-C10, Proof. Denote the numerator and denominator of E 1 π (X j , γ 1 , θ) by q num and q den respectively.
We can replace π d (α) in q num and q den with π d (α) without changing the error order due to the data partition scheme we use. That is, With further calculations, this means that Similarly, we have q den = (n 3 − 1) Here we used the regularity conditions C1-C2, C5, C8-C10.
In addition, with the regularity conditions C1-C4 and C8-C10, we have

A.7
Using Lemmas 2 and 3 and the regularity condition C10, we have that the fourth term in (A. 7) By applying Lemma A1 in Ma and Zhu (2012a), we obtain that the second and third terms in (A.7) are of order O p h r + n 3 1/2 h 2r + log 2 n 3 / n 3 h 2 = o p (1).
Hence, the estimating equation can be written as A.8 We now show that the influence function given in (A.8) has mean 0 at θ = θ.
The last equality is because of the single index model assumption (3.1). In practical operation, we will replace θ by θ, the solution of the estimating equation defined in (4.1). As long as θ θ in probability, the above expectation approaches 0.

Lemma 5.
Under the regularity conditions C1-C10,  Simulation study in Section 5 with N 1 = 1,000 cases and N 0 = 1, 000 controls, disease rate of approximately 4.5% and 4-dimensional correlated covariates X over 1,000 simulated data sets. The results for the homoscedastic normal error model are listed on the left and the results for the heteroscedastic normal error model are listed on the right. The three analyses performed are "Ctrl", which is ordinary least squares using only controls, "Param", which is semiparametric efficient method proposed by Lin and Zeng (2009) assuming normality and homoscedasticity, and "Semi", which is our new estimator described in Section 3.4. Here, we list the sample mean ("mean"), the sample standard deviation ("s.d."), the mean estimated standard deviation ("est. sd") and the coverage for the nominal 95% confidence intervals ("95%") for all three methods. In addition, we computed the mean squared error efficiency compared to using only controls for the "Param" and "Semi" methods.  Table 2 Simulation study in Section 5 with N 1 = 1,000 cases and N 0 = 1, 000 controls, disease rate of approximately 4.5% and 4-dimensional correlated covariates X over 1,000 simulated data sets. The results for the homoscedastic gamma error model are listed on the left and the results for the heteroscedastic gamma error model are listed on the right. The three analyses performed are "Ctrl", which is ordinary least squares using only controls, "Param", which is semiparametric efficient method proposed by Lin and Zeng (2009) assuming normality and homoscedasticity, and "Semi", which is our new estimator described in Section 3.4. Here, we list the sample mean ("mean"), the sample standard deviation ("s.d."), the mean estimated standard deviation ("est. sd") and the coverage for the nominal 95% confidence intervals ("95%") for all three methods. In addition, we computed the mean squared error efficiency compared to using only controls for the "Param" and "Semi" methods.  Analyses of the mammographic density data from the breast cancer detection demonstration project (BCDDP) in Section 6, which has N 1 = 2092 cases and N 0 = 3295 controls, using only controls ("Ctrl"), Lin and Zeng's method ("Param") and our approach ("Semi"). Displayed are the mean estimates of the coefficients for age at menarche (MENARCHE), age at first live birth (1STLB) and body mass index (BMI), their bootstrap standard deviation ("boot. sd"), the mean estimated bootstrap standard deviation ("est. sd") and the lower and upper end values of the 95% confidence intervals ("Lower" and "Upper"). Also displayed is the efficiency ("Eff"), which is the square of the ratio of bootstrap standard deviation to that using only controls.   Table 5 500 simulations, 1000 cases/1000 controls,10% disease rate, correlated covariates X with dimension 4, Gamma error. See Table 1 for definitions.  Table 6 500 simulations, 1000 cases/1000 controls, 4.5% disease rate, correlated covariates X with dimension 6, Gaussian error. See Table 1