A nested semiparametric method for case‐control study with missingness

We propose a nested semiparametric model to analyze a case‐control study where genuine case status is missing for some individuals. The concept of a noncase is introduced to allow for the imputation of the missing genuine cases. The odds ratio parameter of the genuine cases compared to controls is of interest. The imputation procedure predicts the probability of being a genuine case compared to a noncase semiparametrically in a dimension reduction fashion. This procedure is flexible, and vastly generalizes the existing methods. We establish the root‐ n$$ n $$ asymptotic normality of the odds ratio parameter estimator. Our method yields stable odds ratio parameter estimation owing to the application of an efficient semiparametric sufficient dimension reduction estimator. We conduct finite sample numerical simulations to illustrate the performance of our approach, and apply it to a dilated cardiomyopathy study.


INTRODUCTION
Our work is motivated by a case-control study of dilated cardiomyopathy conducted using the University of Pennsylvania hospital electronic health record (EHR).Cases and controls were identified from Penn EHRs using separate rules that were created based on EHR data elements.The rule for identifying controls was rigorous so that controls were identified accurately as in typical EHR-based case-control studies.A more relaxed rule was used for identifying candidate cases.
A larger number of genuine cases could be included in the study using this more relaxed rule, which is essential for ensuring study power and generalizability of study results.However, such a relaxed rule led to the inclusion of patients who are not genuine cases and also do not satisfy the control definition.These patients are referred to as "non-cases" (Wang et al., 2020).Noncases differ from genuine cases, and they differ from controls as well, making them ineligible for the study.When estimating odds ratio association parameters, naively treating noncases as genuine cases will lead to biased results (Little & Rubin, 2019).Stemming from the fact that it is often very difficult to create a binary decision rule for discerning patients with or without a condition among the candidate cases, this challenge is common when conducting EHR-based case-control studies.In this work, we propose an innovative method to effectively account for inaccurate case selection.
Our problem can be viewed as belonging to the missing data framework, where the true status of being a noncase or genuine case is unknown for the identified candidate cases.More specifically, the probability model for predicting genuine cases in the combined population of genuine cases and noncases automatically serves as a model for the missingness and naturally brings us to the missing at random (MAR) framework.Our method imputes the true status by modeling the relationship between the genuine case and noncase from a validated subset.This setup represents one key feature of our method.We form a two-layer nested case-control study by treating the genuine cases and noncases as a new case-control data structure along with the primary case-control data.Another key feature of our method is that we impute the missing case status through a semiparametric model which is sufficiently flexible and allows for many covariates.
The imputation step in our approach is nonstandard and plays a different role from what is typically done in the classical imputation literature.Imputation is a widely applied approach for accommodating missing data (Aerts et al., 2002;Little & Rubin, 1987), including missing binary outcomes (Mukaka et al., 2016).But few works focus on case-control studies with missing genuine case status when there is a third group of individuals who are ineligible for the study.Wang et al. (2020) proposed a parametric imputation method for case-control studies in this framework.This method introduces imputation to the estimating equation which corrects the bias caused by the missing genuine case status.To retain the flexibility while bypassing the curse of dimensionality (Wang et al., 2004), we propose a semiparametric sufficient dimension reduction model, and apply an efficient procedure (Ma & Zhu, 2012) to obtain the efficient probability prediction in our imputation procedure.This leads to a stabilized odds ratio parameter estimation in the main model.This modeling and estimation approach allows us to impose minimal assumptions on the missingness scheme while limiting its influence on our odds ratio parameter estimation.In addition, we perform imputation with a probability instead of a randomly generated outcome in an intermediate step of our method.This practice minimizes the potential bias, especially when the prediction probability is extreme (Bernaards et al., 2007), and stabilizes the computation of the overall method.

MODELING THE CASE-CONTROL DATA WITH MISSINGNESS
Let D denote the outcome, with D = 0 indicating the controls, D = 1 indicating the genuine cases, and D = 2 indicating the noncases.The definition of noncases here is simply anyone who is neither a genuine case nor a control.Let N 1 be the sample size of candidate cases (i = 1,…, N 1 ), which includes both genuine cases (D = 1) and noncases (D = 2).There are also N 0 controls (D i = 0, i = N 1 + 1,…, N ≡ N 1 + N 0 ).Further, n 1 observations (i = 1,…, n 1 ) from the N 1 candidate-cases are fully observed, and we use the indicator S to denote the validated outcome status.Specifically, S = 1 indicates a genuine case and S = 0 denotes a noncase.Let X be a p dimensional covariate vector and Z be a q-dimensional covariate vector.X and Z are allowed to share common components or can even be identical.
Our goal is to fit a logistic regression model using the the genuine cases and controls.When all patients in the model are either genuine cases or controls, the probability that the patient is a genuine case is . (1) Note that there is no Z in model (1).In other words, we use X to represent all the covariates that are responsible for separating genuine cases from controls in the combined population of genuine cases and controls.Hence, the probability that the patient is a control is , where  c ∈ R,  1 ∈ R p .We note that the indicators D indexed from i = n 1 + 1, ..., N 1 in the sample are not observed.We therefore propose to recover the missingness in D by utilizing the underlying structure among the candidate cases.To do this, we assume that given an observation is a candidate case (i.e., is not a control), the probability of being a genuine case is Note that there is no X in (2).The covariates that are responsible for predicting genuine cases from noncases in the combined population of genuine cases and noncases are collected as Z.Hence, the probability of being a noncase is where  ∈ R q×d and (⋅) ∶ R d  → R is an arbitrary function.Note that we have used X to represent all the predictive covariates for (1), and have used Z to represent all the predictive covariates for (2).Because X and Z are allowed to overlap or even be identical, we are not imposing any additional independence assumptions.The function  is unspecified in (2), and the dimension d will be selected via the data-driven method Validated Information Criterion (VIC) (Ma & Zhang, 2015) in practice.When d is selected to be q, the parameter  becomes the identity matrix, and (2) becomes a purely nonparametric model.In this sense, (2) can be viewed as a maximally flexible model and enjoys the same robustness as any nonparametric model against model misspecification.
In this study, we have assumed (2) is correct.Because (2) serves as a missingness mechanism model in our problem formulation, this directly brings us to the MAR framework.In the special case when Z contains only 1, corresponding to the intercept term, the problem degenerates to missing completely at random (MCAR), and our method will still apply.On the other hand, if some important covariates that are related to D are not included in Z, then (2) will be a misspecified model.In this case, the estimation procedure will break down.Indeed, in this case, the missingness of S will be dependent on that unobserved covariates, which may be further related to whether or not D = 0 or D = 1 in the population of controls and genuine cases combined.Hence, we are actually in the missing not at random (MNAR) framework.It is well known that any method developed by assuming MAR while the true data structure is MNAR will produce biased results.

Estimating equation
According to our proposed model, the estimating equation for the case-control odds ratio parameter  c and  1 in (1) is equivalent to where Si , i = n 1 + 1,…, N 1 , denotes the hypothetical case indicator within the unobserved n 2 ≡ N 1 − n 1 samples whose status D can be either a genuine case or noncase.In other words, Si = 1 if the ith candidate case is a genuine case and Si = 0 if it is a noncase.Because Si 's are not available, our intention is to first impute Si 's using (2).

Semiparametric imputation model
Following the multiple imputation idea, imputing Si is equivalent to replacing Si 's with the probabilities predicted by (2).We first need to estimate the parameters in (2).To this end, we take advantage of an efficient semiparametric method (Ma & Zhu, 2012) to estimate the unknown component (⋅) and the high-dimensional parameter  simultaneously.In order to avoid the identifiability issue, we assume the upper d × d block of  is the identity and only the lower (p − d) × d block of  needs to be estimated.In the semiparametric model (2), the nuisance parameters are (⋅) and f Z (z), the density of covariate Z.The corresponding nuisance tangent space is Λ = Λ 1 ⊕ Λ 2 , where ] .
The efficient score is where "vecl" is vectorizing the lower (p − d) × d block of a matrix.Hence we can solve the estimating equation to obtain an efficient estimator for .We need to point out that when efficiency of the estimation of  is not sought after, (4) can be generalized to the following form where g(⋅, ⋅) is an arbitrary nontrivial function on R d .This estimator retains the consistency of the  estimation as well (Ma & Zhu, 2012).
Since both E(Z i |  T Z i ) and ( T Z i ) are unknown in (4), we use the following approach to estimate these two quantities.First, we posit a working model  * , and its corresponding derivative is for example, a kernel estimation.This yields the estimating equation Write the estimator as γ1 which is a consistent estimator of .Second, we estimate  and  ′ by solving the equations of b 0 and b 1 from to obtain the estimation of η(t) and 4) and solve for the efficient estimator γ from the estimating equation

Nested estimating equation for odds ratio parameters
By incorporating the estimation η(⋅) and γ from ( 6) and ( 7), we will be able to impute the status of the n 2 candidate cases using the generated model Using multiple imputation, say B imputations and taking the average, then when B → ∞, we obtain in probability, and hence, we get the estimating equation We solve this equation to obtain βc and β1 .Following Chen and Ibrahim (2014), considering infinite B will eliminate the additional between-imputation variation.Note that the estimation of  c and  1 is completely separated from the estimation of  and .Both estimation procedures are standard, hence the computation is not challenging.Below, we provide the detailed algorithm.
Step 1 Obtain an initial estimation γ1 : Obtain an initial estimator γ1 from solving (5) based on data {S i , Z i } n 1 i=1 .In (5),  * and  * ′ are from a working model and .
Step 2 Estimate ( T Z i ),  ′ ( T Z i ) and : 21 For any  and t Solve for η(γ T Z i ) and η′ (γ Step 4 Obtain βc and β1 : Compute βc and β1 by solving (9), based on γ from Step 23, Step 3, and data All the equations in the algorithm are solved by Powell's algorithm (Powell, 1965).Powell's algorithm is designed for solving multivariate nonlinear problems.

ASYMPTOTIC PROPERTIES
We intend to derive the asymptotic properties of the estimator from ( 9) for  1 by taking into account the variability of η(⋅) and γ.For simplicity, we denote the expit function H, that is, We prove the results for the case where  is a vector.The case where  is a matrix, the results are similar but with more complex notation in handling matrix operations.
First, we list the regularity conditions for deriving the asymptotic properties.
C1 There exists two constants 0 < c 1 < c 2 < ∞ so that the sample sizes satisfy c 1 < n 1 ∕n 2 < c 2 and c 1 < N 1 ∕N 0 < c 2 .C2 The univariate kernel function K(⋅) is Lipschitz, symmetric and has compact support.It satisfies Here we use the same K regardless of the dimension of its argument.C3 The bandwidth h = O(n − 1 ) for 1∕(4m) <  < 1∕(2d).C4 The density functions of Z and  T Z, denoted, respectively, by f Z (z) and f  T Z ( T z), are bounded from below and above.Each entry in the matrices E ( ZZ T |  T z ) is locally Lipschitz-continuous and bounded from above as a function of  T z.C5 E(Z |  T z)f  T Z ( T z) and g( T z) are mth-order differentiable and their mth derivatives, as well as f  T Z ( T z) are locally Lipschitz-continuous. C6 (The boundedness.)The parameter space  is bounded.
These are very mild conditions.Condition C1 requires that the proportion of cases and controls do not degenerate to zero both in the population and in the sample.Conditions C2 and C3 are common requirements on the kernel function and the bandwidth.Conditions C4 and C5 assume sufficient smoothness and boundedness of the density of covariates corresponding to the efficient semiparametric method.In order to ensure a unique solution of the parameter estimation, we assume the boundedness of parameter space in Condition C6.

Lemma 1. Under Conditions C2-C5, especially n
1∕2 1 h 4 → 0 and n 1 h 2 → ∞, using results from Ma and Zhu (2012), we obtain that γ is a consistent estimator of  satisfying n where and We do not include the details of Lemma 1 since it was carefully proved and discussed in Ma and Zhu (2013).Following Ma and Zhu (2013), the above expansion still holds if we replace the pre-decided function g(⋅) with the estimated version of the function g 0 (⋅), where To further estimate (⋅) regardless which choice of g(⋅) function is used in obtaining γ, we propose to simply perform a kernel mean regression followed with a logit transformation, that is, at any u 0 , we set Next, we provide the asymptotic properties of β based on the discussion of γ and η. and The asymptotic variance in Theorem 1 has the typical sandwich form.The matrix A  results from the derivative of ( 9) with respect to .The matrix V contains three components.The first component V 1 captures the variability contributed by the randomness of the fully observed genuine and noncases.The second component V 2 corresponds to the variability due to the randomness of the candidate-cases.The third component V 3 corresponds to the variability due to the randomness of the controls.
Theorem 1 shows that the proposed estimator β is consistent with a root-n convergence rate.It also provides an approach to estimate the asymptotic variance of β.The proof of Theorem 1 is in the Appendix S1.

Data generation procedure
The population can be divided into three parts, D = 0, 1, and 2, according to the model.The ratio between D = 0 and D = 1 is 1 ∶ exp( c +  T 1 X) and the ratio between D = 1 and D = 2 is exp{( T Z)} ∶ 1.Thus, the ratio between Therefore, we use the following data generating process to conduct finite sample studies.
1. Generate a population following the model

Finite sample study
We study the finite sample performance of our method through simulation studies.In each of our studies, we generate 1000 datasets.In the first study, we generate a p = 6 dimensional covariate vector X from the multivariate normal distribution with mean zero and variance-covariance matrix equal to the identity.We set (Z 1 , Z 2 , Z 3 ) T = (X 1 , X 2 , X 3 ) T , and generate (Z 4 , Z 5 , Z 6 ) T from the multivariate normal distribution with mean zero and variance-covariance matrix identity I 3 .Thus, the dimension of Z is q = 6.We set the true parameter values In the second study, we repeat the same analysis as in the first study, except here, the true  function is ( T Z) = 1 − ( T Z) 2 .In the first two studies, we set the bandwidth in the nonparametric estimator to be cSD( T Z)(N 1 − n 1 ) −1∕3 and c is a constant in the range of 0.1 to 10.The results are insensitive in this range of c.
The performances of β in the first simulation study are in Table 1 and Figure 1.We can see clearly that the estimators of  have very small biases and SDs.We also report the estimated SD of the main regression parameter estimator β using the asymptotic results provided in Section 4. Clearly, the average estimated SD is close to the sample SD and the resulting 95% confidence interval has coverage close to the nominal level.The estimation of  is also consistent with very small SEs.The absolute biases of γ−1 's are (0.025, 0.004, 0.052, 0.022, 0.001) T and the corresponding SEs are (0.609, 0.552, 0.649, 0.599, 0.610) T , where subscript −1 indicates the indices corresponding to all values of γ except for the first one.More details of estimating  have been rigorously discussed in Ma and Zhu (2012) and Ma and Zhu (2013).From Figure 1 we can see that the mean of η( T z) is close to the true function ( T z) overall, with the performance at the boundary worse than the interior as is typical for all nonparametric estimators.The results of estimating  in the second simulation study are in Table 2 and Figure 2. The estimation of  −1 has small absolute biases with value (0.072, 0.051, 0.076, 0.092, 0.086) T and the SEs are (0.470, 0.511, 0.482, 0.482, 0.518) T .The same conclusion can be drawn as in the first simulation.
For comparison, we also report the results from a naive method and original EE (OEE) method (Wang et al., 2020).The naive method treats the noncases as genuine cases.The OEE method proposes a weighted estimating equation to overcome the bias from the naive method.The weight is calculated by estimating the probability of being a genuine case given covariates parametrically.By doing so, its odds ratio parameter estimation is unbiased.OEE performs well in the first study because the model is correctly specified.On the contrary, OEE performs poorly in the second study when the model is misspecified.Without surprise, the naive method performs poorly in both studies.
We also conduct a third simulation to evaluate the performance of the proposed model in a high-dimensional covariate case which imitates the dilated cardiomyopathy dataset.In this scenario, we generate X and Z from independent standard uniform distribution with dimension p = q = 20, d = 2, and they share 10 common covariates.The  function is where  1 and  2 stand for the first and second column vectors in , respectively.The sample sizes are N 1 = 2000 and N 0 = 5000 for cases and controls.Among the candidate cases, we randomly mask out D for 1000 observations.The bandwidth in the nonparametric estimator is set to be c{SD( T 1 Z) + SD( T 2 Z)}(N 1 − n 1 ) −1∕5 , and c is a constant in the range of 0.1 to 10.The results are insensitive in this range of c.
The  −1 estimation of the third simulation is reported in Table 3 along with the corresponding SD estimation and 95% coverage probability.We also illustrate the estimated (⋅), that is, η(⋅), in Figure 3 where the mean and 95% confidence band are reported.We can see the estimation   captures the trend of  even in such a high-dimensional situation.A referee points out that the estimated SDs are almost identical to each other.This is because the covariate components in X, Z happen to be generated from the same distribution in this simulation.Following a referee's request, we further conduct two additional simulation studies, where the purpose is to investigate the performance of our method in small sample size situation and in the MNAR situation, respectively.Specifically, in Study 4, the data is generated from the same model and parameter setting as in Study 2 but with sample size n 1 = 50, N 1 = 100, and N 0 = 100, hence the total size is N 1 + N 0 = 200.The results are provided in Table 4 and Figure 4.These results show that when the sample total size is 200, our method deteriorates, although it still performs better than the OEE and naive methods.Our method captures the missingness mechanism well in terms of estimating , although it has a wider confidence band than in Study 2 due to the very small sample size.In the fifth simulation, we set (Z 1 , Z 2 , Z 3 ) T = (X 1 , X 2 , X 3 ) T as before, and generate (Z 4 , Z 5 , Z 6 , Z 7 , Z 8 ) T from the multivariate normal distribution with mean zero and variance-covariance matrix equal to the identity I 5 .Thus, the dimension of Z is q = 8.Otherwise, all the settings are the same as in Study 2. Thus, the data generation mechanism for the true outcome status D depends on all the covariates in Z, while we use only 3, respectively, from left to right.Second line: η versus  T 2 Z at  T 1 Z = 0.8, 0.9, 1.1, respectively, from left to right.Black solid line: true , Blue solid line: mean of η, blue dashed line: 0.025 and 0.975 quantile of η.
to estimate the imputation model (2).This is an MNAR setting and it mimics the situation that two covariates (Z 7 , Z 8 ) T are not observed.The estimation of  −1 and  are reported in Table 5 and Figure 5. Compared to the correctly specified model in the second simulation, our method retains the major trend in the estimation with slight biases.It indicates some degree of robustness of our method when the missingness model is misspecified.

DILATED CARDIOMYOPATHY DATASET ANALYSIS
We apply the proposed model to the analysis of a dilated cardiomyopathy case-control study using data form the University of Pennsylvania EHR.The subjects in this study are patients of European descent who are enrolled in the Penn Biobank.The main goal of the study is to assess the association of the hiPSI TTNtv with the phenotype dilated cardiomyopathy.The adjusting covariates include a patient's gender, age, a collection of ICD-9 and ICD-10 codes related to dilated cardiomyopathy, summarized measures derived from echocardiograms (EKGs), and genetic principal components for helping control for population stratification.Additionally, a number of individuals in the data set are missing summary measures for EKGs, so we include an indicator for each patient to indicate whether or not each of the summary measures is available.Patients' ICD-9 and ICD-10 codes were mapped to PheWAS codes (Haggerty et al., 2019).In this analysis, a candidate case is defined as one who had at least one visit for dilated cardiomyopathy or has had at least one of the following diagnosis codes: I42.0, 425.4,425.8, 425.9, I42.8, and I42.9.The dilated cardiomyopathy visits are any encounters with the words "Dilated Cardiomyopathy" in the clinical notes.These encounters are identified using natural language processing, a technique for text mining.The genuine cases were defined using an algorithm validated by the clinician team, and the remaining patients in the case pool who did not meet the genuine case definition were treated as noncases.Everyone who does not match the definition of a candidate case is considered to be a control.The sample size of candidate cases is 1723, where 400 individuals were fully observed.We obtained the validated sample by randomly drawing a subset of 400 individuals from the subset of candidate cases.The genuine case status, D, was retained for these 400 individuals, and D was masked for the remainder of the candidate cases.We also have 6120 controls.The bandwidth in the nonparametric estimator is set to be SD( T Z)1323 −1∕3 .Before applying the proposed semiparametric method in estimating  i , i = 1, 2,…, d, we first determine the number of indices d by minimizing the VIC (Ma & Zhang, 2015), an information criterion for sufficient dimension reduction models that takes goodness of fit and dimensionality into account simultaneously under mild assumptions.The best selection of d corresponds to the smallest VIC.
The most preferable choice of d is 1, corresponding to VIC = 123.462.The result of  estimation is reported in Figure 6.According to the plot, the probability of being a genuine case is decreasing when  T 1 Z increases with small perturbation.Large variability occurs at both ends due to fewer data points observed.The results of estimating  from three methods are reported in Table 6.It is shown that the coefficient for the hiPSI TTNtv is significant with the same sign in all methods.Meanwhile, the estimation efficiency of the odds ratio parameter in the proposed method is higher than the OEE method.OEE estimates Age to be nonsignificant with the opposite sign compared to other methods.All methods conclude that only the first genetic principle component is significant.Compared to the naive analysis, which treats all missing values as genuine cases, the proposed method does not lose too much efficiency.

CONCLUSION
We propose a nested semiparametric method for analyzing EHR-based case-control studies where the true outcome status of some of the candidate cases are missing.Our method imputes the missing values by introducing an additional index, denoted as noncases, and by modeling the genuine case/noncase pair semiparametrically.The imputation process is very flexible because of the semiparametric structure and the dimension reduction association.Meanwhile, applying the efficient sufficient semiparametric dimension reduction estimator helps to retain stability in odds ratio parameter estimation in the main model even though the missingness scheme is unknown.Many applicable alternative approaches have been developed in the missing data literature if the imputation model had been known or parametric, such as maximum likelihood estimator (MLE) and the fully Bayesian method (Ibrahim et al., 2005;Mitra & Reiter, 2011).However, a prespecified functional form increases the chance of model misspecification (Si & Reiter, 2013), and misspecifications will lead to biased results (Chen & Ibrahim, 2014).In order to improve the robustness, modifications have been made in both MLE and Bayesian methods, such as incorporating a spline into the algorithm to estimate the nonparametric components (Rizopoulos & Ghosh, 2011;Su & Hogan, 2008).The modified MLE and modified Bayesian methods are alternative approaches to our semiparametric imputation approach.They reflect different general approaches in handling missing data in the literature.Although we only considered binary outcomes that are subject to missingness, the flexibility of the semiparametric modeling allows for a straightforward extension to more complex data formats.

ORCID
Ge Zhao https://orcid.org/0000-0002-2875-8652 as the solution of b 0 and b 1 to (6), respectively.22 Insert η( T Z i ) and η′ ( T Z i ) from Step 221 into (7) and solve it to obtain an updated γ1 based on the data {S i , Z i } n 1 i=1 .23 Repeat Step 21 and Step 22 until convergence.The resulting γ1 is the efficient estimator of .Let γ = γ1 .Step 3 Apply the imputation model:

2.
Sample N 0 observations from the D = 0 subpopulation.3. Sample N 1 observations from the D = 1 and D = 2 subpopulations combined.4. Sample n 1 observations from the sub-sample with size N 1 above, set S = 1 if D = 1 and S = 0 if D = 2. Mask out the D information on all the N 1 observations.
2.5, −0.5) T and consider the true  function to be ( T Z) =  T Z.We experiment with sample size N 0 + N 1 = 2000.Of the 2000 samples, n 1 = 500 are observed candidate cases whose true status, D is known (D = 1 or D = 2), N 0 = 1000 are controls (D = 0), and the remaining N 1 − n 1 = 500 are candidate cases whose true status is unobserved (unknown D where D = 1 or D = 2).
Simulation performance in Study 1. Left panel: boxplot of β.Right panel: performance of η.Black solid line is the truth for both panels.Blue solid line: mean of η, Lower blue dashed line: 0.05 quantile of η, Upper blue dashed line: 0.95 quantile of η.
Simulation performance in Study 2. Left panel: boxplot of β.Right panel: performance of η.Black solid line is the truth for both panels.Blue solid line: mean of η, Lower blue dashed line: 0.05 quantile of η, Upper blue dashed line: 0.95 quantile of η.
Simulation performance in Study 4. Left panel: boxplot of β.Right panel: performance of η.Black solid line is the truth for both panels.Blue solid line: mean of η, Lower blue dashed line: 0.05 quantile of η, Upper blue dashed line: 0.95 quantile of η.
Simulation performance in Study 5. Left panel: boxplot of β.Right panel: performance of η.Black solid line is the truth for both panels.Blue solid line: mean of η, Lower blue dashed line: 0.05 quantile of η, Upper blue dashed line: 0.95 quantile of η.Estimated logit probability of genuine cases among candidate-cases in the dilated cardiomyopathy dataset analysis, that is, η.
Results of Study 1, based on 1000 simulations with 1000 control-cases and 1000 candidate-cases.
TA B L E 1Abbreviations: Bias, average of absolute bias; CI, average 95% confidence interval; Coverage, 95% coverage of corresponding estimation; Mean, average of β; SD, sample standard deviation; ŜD, average of the estimated standard deviations of the corresponding estimation.
Results of Study 2, based on 1000 simulations with 1000 control-cases and 1000 candidate-cases.
Abbreviations: Bias, average of absolute bias; CI, average 95% confidence interval; Coverage, 95% coverage of corresponding estimation; Mean, average of β; SD, sample standard deviation; ŜD, average of the estimated standard deviations of the corresponding estimation.
Results of Study 3, based on 1000 simulations with 2000 control-cases and 2000 candidate-cases.
TA B L E 3Abbreviations: Bias, average of absolute bias; CI, average 95% confidence interval; Coverage, 95% coverage of corresponding estimation; Mean, average of β; SD, sample standard deviation; ŜD, average of the estimated standard deviations of the corresponding estimation.
Results of Study 4, based on 1000 simulations with 100 control-cases and 100 candidate-cases.