Method for Evaluating Multiple Mediators: Mediating Effects of Smoking and COPD on the Association between the CHRNA5-A3 Variant and Lung Cancer Risk

A mediation model explores the direct and indirect effects between an independent variable and a dependent variable by including other variables (or mediators). Mediation analysis has recently been used to dissect the direct and indirect effects of genetic variants on complex diseases using case-control studies. However, bias could arise in the estimations of the genetic variant-mediator association because the presence or absence of the mediator in the study samples is not sampled following the principles of case-control study design. In this case, the mediation analysis using data from case-control studies might lead to biased estimates of coefficients and indirect effects. In this article, we investigated a multiple-mediation model involving a three-path mediating effect through two mediators using case-control study data. We propose an approach to correct bias in coefficients and provide accurate estimates of the specific indirect effects. Our approach can also be used when the original case-control study is frequency matched on one of the mediators. We employed bootstrapping to assess the significance of indirect effects. We conducted simulation studies to investigate the performance of the proposed approach, and showed that it provides more accurate estimates of the indirect effects as well as the percent mediated than standard regressions. We then applied this approach to study the mediating effects of both smoking and chronic obstructive pulmonary disease (COPD) on the association between the CHRNA5-A3 gene locus and lung cancer risk using data from a lung cancer case-control study. The results showed that the genetic variant influences lung cancer risk indirectly through all three different pathways. The percent of genetic association mediated was 18.3% through smoking alone, 30.2% through COPD alone, and 20.6% through the path including both smoking and COPD, and the total genetic variant-lung cancer association explained by the two mediators was 69.1%.


Introduction
A mediation model is a statistical approach that explores the direct and indirect effects of an independent variable (i.e., initial variable) on a dependent variable (i.e., outcome variable) by including one or more mediating variables (or mediators) [1]. In some scenarios, the mediation model can infer the causal effects from the initial variable to the mediator variable and then to the outcome variable [1]. Mediation models have been widely applied in many different fields [2], such as psychology, behavioral science, genetic epidemiology, prevention research, and political communication research. Recently, there have been efforts in using mediation analysis to dissect the direct and indirect effects of genetic variants on complex diseases in genetic variant association studies [3][4][5][6][7]. Most of these studies used data from genome-wide association (GWA) studies, in which the outcome variables were selected on the basis of case-control study design. For example, our group has applied single-mediator analysis (i.e., the Baron-Kenny procedure) to identify the mediation effects of smoking and chronic obstructive pulmonary disease (COPD) on the association between the CHRNA5-A3 genetic locus and lung cancer risk using data from a case-control GWA study of lung cancer [6]. However, ignoring the case-control study design and applying standard regressions might result in biased estimations of the indirect effects. According to recent studies of secondary phenotypes, the bias could arise in the estimations of the genetic variant-mediator association because the presence or absence of the mediator (i.e., cases and controls with respect to the mediator) is not sampled following the principles of case-control study design [8][9][10][11][12]. In this case, the mediation analysis using data from case-control studies might lead to biased indirect effect estimates, either over-or under-estimated depending on the prevalence values of outcome and mediators.
Lung cancer GWA studies have consistently shown that the CHRNA5-A3 gene cluster is strongly associated with an increased risk of lung cancer. Also, multiple studies have associated SNPs spanning this region with heavy smoking, nicotine dependence, smoking cessation and COPD [13][14][15][16][17][18][19]. Thus, there is a debate about whether the genetic variants have an impact on lung cancer risk directly or exert their effect largely through the profound effect of the variants on smoking intensity [20][21][22] or COPD [23]. Further work investigating this association concluded that there are dual pathways between the genetic variant and lung cancer association, independently via a direct effect on lung carcinogenesis and through smoking behavior [6,7,15,[24][25][26]. More recent studies of current smokers have shown that the genetic variants on CHRNA5-A3 gene cluster have a stronger association with cotinine levels than with self-reported smoking behavior, and suggested that the effect of the genetic variants on lung cancer risk, is largely, if not exclusively, through their effect on smoking intensity [27][28][29]. However, in an accompanying editorial Spitz et al [21] concluded that the degree to which the association is mediated by smoking is yet to be determined. Prior studies focused on one mediator (e.g., smoking) at a time, and none has studied multiple mediators simultaneously in one model. However, in reality, more than one mediator could affect the association between the genetic variant and lung cancer risk. In our previous analysis [6], we found that in single-mediator analyses smoking and COPD were mediators of the association between the singlenucleotide polymorphism (SNP) rs1051730 and risk of lung cancer. However, analyzing multiple mediators in one model could have some advantages over such single-mediator analyses [30].
The multiple-mediation model used for the study of the SNP, smoking, COPD and lung cancer risk is depicted as a path diagram in Figure 1. The multiple-mediation model includes a three-path mediating effect through both smoking and COPD, which allows one mediator (i.e., smoking) to causally affect the other mediator (i.e., COPD) [31]. This causal association is biologically compelling because smoking is the known major risk factor for COPD [32]. The underlying assumption of this threepath mediating effect is that the individuals carrying the deleterious allele of rs1051730 are more likely to be heavy smokers, which in turn leads to a higher risk of COPD, which in turn increases the risk of lung cancer. Thus, in addition to the indirect effects passing through each of the mediators alone, we will investigate the indirect effect passing through both mediators.
To our knowledge, there has been no previous study investigating such a multiple mediation model in the case-control study design setting, in which the standard regression approach could provide biased estimations for the indirect effects as we described above. Therefore, we developed an approach to conduct a multiple-mediation analysis using the model shown in Figure 1. We conducted simulations to investigate the performance of the proposed approach, and these showed the approach can provide accurate estimates of the indirect effects. The bootstrapping approach was applied to assess the significance of the indirect effects and total effect. We also developed an approach for when the original case-control study is frequency matched on one of the mediators, as in our lung cancer case-control study where controls are frequency matched to cases with respect to smoking status. We applied the proposed approach to the multiple-mediation study of the simultaneous mediating effects of smoking and COPD on the association between SNP rs1051730 and lung cancer risk using lung cancer case-control GWA study data.

Methods
Let X, M 1 , M 2, and Y denote the genetic variant, two mediator phenotypes, and the disease variable, respectively. We assumed binary random variables for both mediator variables and the disease variable, denoted as M 1~f 0, 1g, M 2~f 0, 1g, and Y~f0, 1g, respectively, with 0 representing non-occurrence and 1 representing occurrence of the mediator phenotypes or the disease. We considered a SNP locus with two alleles: deleterious allele A and normal allele a. We first assumed a dominant or recessive genetic model for the genetic variant and also denoted it as a binary random variable, X~f0, 1g. For a dominant genetic model, 0 represents genotype (a, a) and 1 represents genotypes (A, a) and (A, A); for a recessive genetic model, 0 represents genotypes (a, a) and (A, a) and 1 represents genotype (A, A). Note that if an additive genetic model was assumed, a categorical random variable X~f0, 1, 2g was denoted to represent genotypes (a, a), (A, a), and (A, A), respectively. Given the random variables, X, M 1 , M 2, and Y, the association among all random variables shown in Figure 1 can be expressed using the following conditional probabilities with logistic models: Pr , and ð2Þ where a 0 , b 0 , c 0 , a 1 , a 2 , b 1 , b 2 , d, and c9 are regression coefficients and i, j, k = 0, 1. There are different indirect effects in this model [33] (see Figure 1): (i) the indirect effect passing through the mediator M 1 , bypassing M 2 , which can be assessed as a 1 b 1 (denoted as IE 1 ); (ii) the indirect effect passing through the mediator M 2 , bypassing M 1 , which can be assessed as a 2 b 2 (denoted as IE 2 ); and (iii) the three-path indirect effect passing through both mediators, which can be assessed as a 1 db 2 (denoted as IE 3 ). Therefore, the total indirect effect passing through the mediators can be given as the summation of the above indirect effects: a 1 b 1 +a 2 b 2 +a 1 db 2 (denoted as IE t ). The regression coefficient c9 represents the effect of the genetic variant on the disease not mediated by either mediator and is usually called the direct effect. In general, the total effect of the genetic variant on the disease is estimated by regressing the disease variable on the genetic variant variable directly. However, the previous analysis has shown that the total effect estimated in this way could be biased when the disease variable and/or mediator variables are binary [34]. Therefore, in this study we reported the total effect (TE) using an alternative formula defined as the summation of the indirect and direct effects (denoted as TE = IE t +c9). In this case, the percentages of the association explained by the different mediation paths (percent mediated, PM) can be assessed as the specific indirect effects divided by the defined total effect, respectively, and denoted as PM 1 = IE 1 /TE, PM 2 = IE 2 /TE, PM 3 = IE 3 /TE, and PM t = IE t /TE, which represents PM of M 1 bypassing M 2 , PM of M 2 bypassing M 1 , PM of both M 1 and M 2 , and the total PM through different paths, respectively. When the data of interest are randomly sampled from the general population, the estimations of the indirect effects and the percent mediated are accurate. However, if the data are sampled based on a case-control study design, the estimated associations among the initial variable and both mediators (i.e., a 1 , a 2 , and d) will be biased if standard logistic regressions are employed, which in turn, will result in biased estimations of indirect effects and the percent mediated [8][9][10][11][12]. To obtain accurate estimations of the coefficients a 1 , a 2 , and d, we modified the bias-correction approach proposed in our previous study [12]. Briefly, the biased coefficient estimated from the logistic regression, the prevalence values of the disease, and both mediator phenotypes can be expressed using non-linear equations. The prevalence values are obtained from the literature, and the robustness of this approach to the misspecification of prevalence values has been investigated in our previous works [12,35]. Solving the system of non-linear equations gives us the corrected coefficients. For the purpose of the multiplemediator model, different non-linear equations were employed to correct different coefficients. The correction approach for the regression coefficient d for the M 1 -M 2 association, while regressing M 2 on M 1 and X (see Figure 1), is given below. The correction approaches for the other parameters, namely a 1 and a 2 , are given in Text S1.

Correction of Coefficient d
As stated above, the regression coefficient d, of the M 1 -M 2 association while regressing M 2 on M 1 and X, could be biased. We used the following non-linear estimating equation approach to correct the bias. Given a sample of N participants, of which N 1 are cases (Y = 1) and N 0 are controls (Y = 0) with respect to the disease, the odds ratio (OR) for the association between the mediators M 1 and M 2 (exp(d)) can be expressed as follows: where E kj is the expected number of individuals in the sample, with M 2 = k and M 1 = j, which is given as where j, k, r = 0, 1. The conditional probability p kj|r is written as p kjjr~P r (M 2~k , M 1~j jY~r) Pr(Y~rjM 2~k , M 1~j ) Pr (M 2~k jM 1~j ) Pr (M 1~j ) Pr (Y~r) p rjkj p kjj p j q r , for j, k, r = 0, 1.
The probabilities p 1 and q 1 represent the prevalence of the mediator M 1 and the disease, respectively, in the general population. The conditional probabilities p r|kj and p k|j are given as functions of regression coefficients: and where b 0 , c 0 , and d are unknown coefficients of interest. Based on the conditional probabilities given above, we can write the estimated prevalences of the disease and the mediator M 2 as follows: Given a sample with N independent individuals for a casecontrol study of the disease (Y), one can estimate the regression coefficients b 1 and b 2 as well as the biased coefficient d using logistic regressions based on Equations (1),(3). Therefore, Equations (4),(6) are a system of nonlinear equations with three unknown variables, c 0 , b 0 , and d. We employed the ''fsolve'' function in Matlab [36] to solve the nonlinear equation system with the use of default settings. By default, the ''fsolve'' function uses the trust-region dogleg algorithm, which is a variant of the Powell dogleg method [37]. The solution to this nonlinear equation system will give us the corrected estimate for coefficient d for the association between two mediators. As mentioned above, for brevity, the details of correction for the coefficients a 1 and a 2 were given in Text S1. We denoted the corrected coefficients asã a 1 , a a 2 , andd d. Given these corrected coefficients, the indirect effects can be estimated as IE 1 =ã a 1 b 1 , IE 2 =ã a 2 b 2 , and IE 3 =ã a 1d db 2 .

Additive Genetic Model
When the genetic variant is assumed to be additive, special care needs to be taken. In this situation, we used a categorical random variable, X~f0, 1, 2g, to denote the three genotypes (a,a), (A,a), and (A,A). We employed the property that the biased OR obtained using logistic regression is given by the per-allele OR and adapted the approach for an additive model proposed in our previous study [35]. To obtain the true per-allele OR, we assessed biased OR in two ways. First, we obtained the biased OR 1 by calculating the OR of SNP random variable X = 1 versus X = 0, which gives the OR for heterozygous genotype against wild-type homozygous genotype. Second, we obtained the biased OR 2 by calculating the OR of SNP random variable X = 2 versus X = 0, which gives the OR for homozygous genotype for variant allele against wild-type homozygous genotype. On the basis of OR 1 and OR 2 , and following the different formulas in our previous study [12], we obtained two corrected coefficients, and the final corrected coefficient for the additive genetic model is the average of these.

Frequency-matched Case-Control Study
Frequency matching is an important and commonly used study design for known risk confounders and has been widely used in case-control studies [38]. In the analysis of real lung cancer data, because smoking is a well-known risk confounder for the association between lung cancer and other risk factors, controls were frequency matched to lung cancer cases with respect to smoking status. That is, for the multiple mediation model shown in Figure 1, the disease cases and controls are frequency matched on the mediator M 1 . In this scenario, frequency-matching design also contributes to bias in the estimate of the coefficients for associations among the SNP and the mediators (i.e., a 1 , a 2 , and d). Therefore, we adapted the approach proposed in our previous work [12] with some modifications. We first considered the calculation ofã a 1 . The expected numbers of individual E ji can be calculated as for i = 0, 1, 2 and j = 0, 1.
The parameter D was denoted as the difference in the proportions of individuals with the presence of the mediator M 1 in the disease cases and controls, given as D = prop(M 1 = 1|Y = 0) prop(M 1 = 1|Y = 1). In reality, the selection of controls in a frequency-matched study does not have to be perfect, that is, the proportions of individuals with the matched variables do not have to be exactly the same in the disease cases and controls (D = 0). For example, in the study of lung cancer, the proportion of current smokers was 48% in lung cancer cases and 42% in controls, and the difference in the proportions was D = 20.06. Therefore, the inclusion of the parameter D can take into account variations that occur when selecting controls that are frequency matched on the mediator, and therefore, improve the robustness of our approach. The conditional probabilities h ijj0 and h jj1 can be calculated using the same formulas given in our previous work [12]: for i = 0, 1, 2, and j = 0, 1. When assessing the corrected coefficientd d, we used a similar formula to evaluate the expected numbers of individual E kj : for j, k = 0, 1. The conditional probabilities g kjj0 and g jj1 are defined as: for j, k = 0, 1.
If the original disease case-control study is frequency matched on the mediator M 1 , the estimated value of b 1 will be nonsignificant or biased and will not represent the true association between the mediator M 1 and the disease. However, because the matching design considers the known risk-confounding factor at the study design phase, we typically know the associated risk. Therefore, for the frequency-matching case-control studies, we added one more constraint on the value of b 1 , which is fixed as the known risk coefficient (from the literature or estimated from unmatched case-control studies). Given the new formulas for E ji and E kj , one can follow the same procedure described for the unmatched study to assess the corrected coefficientsã a 1 andd d, respectively. The corrected coefficientã a 2 can be evaluated using the same formula of E ki that was used in the unmatched casecontrol study because the calculation ofã a 2 does not involve the matched mediator variable M 1 .

Bootstrapping Confidence Intervals for Indirect Effects
Bootstrapping has been employed to evaluate the significance of indirect effects in a multiple-mediator model [30,33] to overcome the difficulty in assessing standard errors for the indirect effects. In this study, we also used the empirical confidence intervals (CIs), based on a resampling-based method with replacement [39]. Given the regression coefficients b 1 , and b 2 obtained using the standard regression and the corrected coefficientsã a 1 ,ã a 2 , andd d obtained using the proposed approach, the empirical CIs of the corrected individual indirect effects IE 1 =ã a 1 b 1 , IE 2 =ã a 2 b 2 , and IE 3 =ã a 1d db 2 , as well as the total indirect effect IE t =ã a 1 b 1 +ã a 2 b 2 +ã a 1d db 2 , were obtained by the following steps: 1. Take B samples with replacement from the study data, each with n 1 individuals from the disease cases and n 0 samples from the disease controls (n = n 0 +n 1 ). Note that n 0 #N 0 and n 1 #N 1 , where N 0 and N 1 are numbers of cases and controls with respect to the disease in the study sample.
2. Evaluate the bootstrap regression coefficients using logistic regressions based on the bootstrap samples. Denote the bootstrap coefficients as , andd d Ã u , u = 1, 2, …, B are calculated by using the approaches described above.

The bootstrap indirect effects are assessed as
and IE Ã t½u be the uth ordered bootstrap indirect effects estimations, respectively. Then the 100(1-c)% CIs of indirect effects are given as

Simulation Approach
We performed simulation studies to investigate the performance of our approach for evaluating the indirect effects in the multiplemediation model in a case-control study ( Figure 1). To mimic the real data analysis of lung cancer, we assumed a single di-allele SNP with a minor allele frequency (MAF) of 37%. We used 14%, 24%, and 12% as the prevalence values for the disease (Y), the mediator M 2 , and the mediator M 1 , respectively, which approximate the prevalence values of lung cancer [40], COPD [41], and heavy smokers [42] in ever smokers. We considered two different sets of regression coefficients for the associations among the SNP, the mediators, and the disease. For the first scenario, we fixed the coefficients as a 1 = 0.4055, a 2 = 0.4055, d = 0.6931, c9 = 0.4055, b 1 = 1.0986, and b 2 = 1.0986, which correspond to ORs of 1.5, 1.5, 2, 1.5, 3, and 3, respectively; for the second scenario, we fixed the coefficients as a 1 = 0.3365, a 2 = 0.3365, d = 0.3365, c9 = 0.6931, b 1 = 0.4055, and b 2 = 0.4055, which correspond to ORs of 1.4, 1.4, 1.4, 2, 1.5, and 1.5, respectively. The ORs used in this simulation studies were chosen to reflect the observed ORs found in many GWA studies of common human diseases [20,[43][44][45]. According to these settings, the theoretical true values of the percentage of the total indirect effect among the association of interest are about 75% for scenario one and 32% for scenario two. For each scenario, we considered different study designs (i.e., unmatched study and frequency-matched study with respect to mediator M 1 ) and different genetic models for the SNP (i.e., dominant, additive, and recessive genetic models). For the frequency-matched study, we also considered different values for the parameter D (0, 60.05, 60.1), which represents the difference in the proportion of individuals with the mediator M 1 in disease cases (Y = 1) and controls (Y = 0). On the basis of these parameters, we obtained the values for the intercept regression coefficients a 0 , b 0 , and c 0 for different situations.
First, we generated genotypes for a SNP using the genotype frequencies, which can be calculated from the MAF. The mediator M 1 values were then generated on the basis of the dataset of realizations of the SNP using Equation (1), assuming different genetic models for the SNP. Conditioned on mediator M 1 and the SNP values, we used Equation (2) to generate the values of the mediator M 2 . Last, the disease cases and controls were generated conditional on values of the SNP and both mediators M 1 and M 2 using Equation (3). In this way, we simulated a large amount of data on the population of interest and then randomly sampled 1,000 disease cases (Y = 1) and 1,000 disease controls (Y = 0). When a frequency-matched case-control study design with respect to the mediator M 1 was considered, the 1,000 disease cases were still sampled randomly. However, the 1,000 controls were sampled so that the proportion of the presence of the mediator M 1 in the controls was approximately equal to that in the cases [38]. The average results of coefficients and indirect effects reported for the simulation studies were based on 1,000 replicate datasets.

Simulation Study
The average results of the regression coefficients a 1 , a 2 , b 1 , b 2 , c9, and d estimated using both standard logistic regression and the approach proposed in this article are reported in Table 1. In the table, the top panel shows the results for the first simulation scenario and the bottom panel shows the results for the second simulation scenario. The true regression coefficients used to generate the data are also listed in the table for the purpose of comparison. For each scenario, we investigated different study designs (unmatched and frequency-matched), different genetic models (dominant, additive, and recessive), and differences in the proportions of the matched variable (M 1 ) between the disease cases and controls (D = 0, 60.05, and 60.1).
For the unmatched case-control study design, when the standard logistic regressions were applied, the estimates of c9, b 1 , and b 2 were close to the corresponding true values, which was expected because selection of the disease cases and controls does not introduce bias in these estimations. For example, for scenario one using the dominant genetic model (unmatched study), the estimated values for c9, b 1 , and b 2 were 0.4041, 1.0967, and 1.0989, respectively, which were very close to the true values of 0.4055, 1.0986, and 1.0986 used for the simulations. However, the estimated values for a 1 , a 2 , and d were 0.4615, 0.4547 and 0.7551, respectively, which were biased compared to the true values of 0.4055, 0.4055, and 0.6931. On the other hand, the proposed approach led to estimates ofã a 1 ,ã a 2 , andd d as 0.4119, 0.4069, and 0.6942, respectively, which agreed well with the true values.
When the case-control study was frequency-matched with mediator M 1 , in addition to the coefficients a 1 , a 2 , and d, the coefficient b 1 was also highly biased, as expected when the standard regression approach is applied; the coefficients c9 and b 2 were still correctly estimated, as in the unmatched study. For example, in scenario one for frequency-matched design, when the proportion of individuals with presence of M 1 was higher in cases than in controls by 5% (D = 20.05) and the dominant genetic model was assumed, the estimated values of c9 and b 2 were 0.4072 and 1.1003, respectively, which were close to the true values of simulation; however, the estimated values of a 1 , a 2 , d, and b 1 were 0.3500, 0.4502, 0.5171, and 0.1189, respectively, which were all highly biased compared to the true values. When we applied the proposed correction approach, however, accurate estimates ofã a 1 ,ã a 2 , andd d were obtained (0.3986, 0.4020, and 0.6930, respectively). Table 2 reports the average results for the indirect effects and the percent mediated through two mediators on the effect of the genetic variant on the disease, assessed on the basis of the regression coefficient results reported in Table 1. The true indirect effects, total effect, and percent mediated are listed in the table for each scenario. We considered several specific indirect effects involved in the multiple-mediation model (Figure 1), including the indirect effect through the mediator M 1 , bypassing mediator M 2 (IE 1 ), the indirect effect through the mediator M 2 , bypassing mediator M 1 (IE 2 ), the three-path indirect effect through both mediators (IE 3 ), and the total indirect effect, which is the summation of all the specific indirect effects (IE t ). We also reported the total effect of the genetic variant on the disease (TE), as well as the percentages of the SNP-disease association explained by different paths (PM 1 , PM 2 , PM 3 , and PM t ). For scenario one, on the basis of the coefficients used for the simulations, the true values of the specific indirect effects and the total effect were given as For scenario two, the true values of the indirect effects, total effect, and corresponding percentages mediated were assessed using the same formulas and were given as follows: IE 1 = 0.14, IE 2 = 0.14, IE 3 = 0.05, IE t = 0.32, and TE = 1.01 for indirect effects and total effect and PM 1 = 13%, PM 2 = 13%, PM 3 = 5%, and PM t = 32% for the percentages mediated through different indirect effects. Similar to Table 1, the top panel of Table 2 reports the results for scenario one and the bottom panel reports the results for scenario two. The results from both the standard logistic regressions and the proposed approach are reported. For the unmatched case-control study, when the standard regression approach was applied, the estimates of the specific indirect effects, as well as the total effect, were biased compared to the true values. This was expected because the coefficients used to assess the indirect effects and total effect were biased. For example, for scenario one with a dominant genetic model (unmatched study), the specific indirect effects and the total effect were given as IE 1 = 0.51, IE 2 = 0.50, IE 3 = 0.38, IE t = 1.39, and TE = 1.79, When the case-control study was frequency matched with mediator M 1 , the magnitudes of bias in the estimations of indirect effects, total effect, and percent mediated were larger than those for the unmatched study when applying the standard approach. For example, in scenario one for the frequency-matched design, when the proportion of individuals with presence of M 1 was higher in the cases than in the controls by 5% (D = 20.05) and the genetic model was assumed as dominant, the estimates of indirect effects and total effect were IE 1 = 0.04, IE 2 = 0.50, IE 3 = 0.20, IE t = 0.73, and TE = 1.14, respectively, which were highly biased compared with the true values of 0.45, 0.45, 0.31, 1.20, and 1.61, respectively. Accordingly, the percentages mediated through different indirect effects were estimated as 4%, 43%, 17%, and 64%, respectively, which were also biased compared with the true values of 28%, 28%, 19%, and 75%, respectively. On the other hand, the proposed approach provided estimates of IE 1 = 0.44, IE 2 = 0.44, IE 3 = 0.30, and IE t = 1.18 for different indirect effects, TE = 1.59 for the total effect, and PM 1 = 28%, PM 2 = 28%, PM 3 = 19%, and PM t = 74%, which were all close to the true values.
Therefore, we observed from the overall simulation results that the standard logistic regressions provided biased estimates for the coefficients a 1 , a 2 , and d in all situations (e.g., frequency-matched and unmatched studies) and biased estimates for the coefficient b 1 when the study was frequency matched on the mediator M 1 , which in turn led to biased estimates of the indirect effects, total effect, and percent mediated by two mediators. The magnitude of the bias in the estimations increased when the proportion difference (D) was relatively large and positive, when the original study was frequency matched on mediator M 1 , and when the true values used for the simulations were relatively large (i.e., scenario one versus scenario two). However, the approach proposed in this article can uniformly provide accurate estimates for the coefficients a 1 , a 2 , and d, and in turn provide accurate estimates for the indirect effects, total effect, and percent mediated for all situations.

Application to the Study of Lung Cancer, COPD, Smoking and SNP rs1051730
We applied our approach to assess the mediating effects of smoking behavior and COPD simultaneously on the association between the SNP rs1051730 and lung cancer risk using a multiplemediation model (Figure 1) based on the data from a lung cancer GWA study [6,20,25]. This analysis included N 1 = 1,153 lung cancer case subjects who were current or former smokers and N 0 = 1,137 control subjects frequency matched to the cases by age, sex, and smoking status. All the case and control subjects were Caucasian. Lung cancer cases were accrued at The University of Texas MD Anderson Cancer Center and were histologically confirmed. Controls were ascertained through a multi-specialty physician practice from the same area. Questionnaire data providing information on smoking were obtained by personal interview. This study was approved by the institutional review board at MD Anderson Cancer Center, and all participants provided written informed consent (LAB10-0347). We selected the number of cigarettes per day, or daily smoking quantity (SQ), as the measurement of smoking intensity. The SQ measure is categorized into two levels: SQ,25, light smokers (coded as 0); SQ$25, heavy smokers (coded as 1) [42]. All the lung cancer cases and controls also self-reported whether a physician had ever diagnosed them with COPD, which was categorized as present or absent. The genetic variant (rs1051730) was coded as having additive effects, as in the original GWA study [20]. Since the lung cancer controls were frequency matched to the cases by smoking status, we employed the proposed approach for frequency matching with respect to the mediator M 1 to investigate the mediating effects of smoking and COPD. All the analyses were adjusted for age. Table 3 reports the estimated coefficients, indirect effects, total effects, and percentages mediated for the SNP-lung cancer association obtained using both the standard and proposed approaches. As we showed in the simulation studies, the estimated coefficients b 2 and c9 should be unbiased, but the estimated coefficients a 1 , a 2 , and d will be biased. Also, the estimated coefficient b 1 for smoking-lung cancer association was statistically non-significant at the 0.05 level of significance (b 1 = 0.1036, 95% CI = 20.0667, 0.2739) owing to the frequency matching on smoking status. To assess the corrected coefficientsã a 1 ,ã a 2 , andd d, we estimated the MAF of the SNP rs1051730 from the data as 37%, and therefore, under Hardy-Weinberg proportion, the genotyping frequenciesp p i , i = 0, 1, and 2, were calculated as 0.40, 0.46, and 0.14, respectively. The prevalences of lung cancer (f f Y ), COPD (f f M2 ), and heavy smokers (f f M1 ) in ever smokers were obtained from the literature as 14% [40], 24% [41], and 12% [42], respectively. We further assumed the OR of association between SQ and lung cancer as 1.86, as reported by Peto et al. [46]. The 95% CIs of the coefficients for the proposed approach were obtained on the basis of our previous work [11]. To obtain the 95% CIs for different indirect effects and total effect, we performed the bootstrapping approach, as described in the Methods section, with B = 10,000. In addition to the estimates of the specific indirect effects, we also reported the percentage of each specific indirect effect, thus explaining the total SNP-lung cancer association.
When the standard logistic regression approach was applied, not all three specific indirect effects were statistically significant, as evidenced by some bootstrap CIs containing zeros ( Table 3). The first indirect effect carries the effect of the SNP on increasing lung cancer risk through only smoking, bypassing COPD. This indirect effect was assessed by the product of a 1 and b 1 and shown to be statistically non-significant because the 95% bootstrap CI contained zero (IE 1 = a 1 b 1 = 0.0257, 95% CI = 20.0181, 0.0752). This biased result using the standard approach was not surprising because the case-control data used for this analysis was frequency matched by smoking status (see also our simulation results for frequency-matched study design). The second indirect effect carries the effect of the SNP on increasing lung cancer risk through only COPD, bypassing smoking behavior. This indirect effect was assessed by the product of a 2 and b 2 and shown to be statistically significant, as the 95% bootstrap CI did not contain zero (IE 2 = a 2 b 2 = 0.2615, 95% CI = 0.0283, 0.4059). This result means that an individual carrying the deleterious allele for SNP rs1051730 is more likely to develop COPD, and in turn, lung cancer, independent of the individual's smoking behavior. The last indirect effect is the effect of the SNP on lung cancer risk through both smoking and COPD, which is the product of a 1 , d, and b 2 , and this effect was also statistically significant (IE 3 = a 1 db 2 = 0.1753, 95% CI = 0.0733, 0.2995). This result shows that the individual carrying the deleterious allele is more likely to become a heavy smoker, which in turn causes COPD and then leads to a higher lung cancer risk. The total indirect effect was evaluated as the summation of the three specific indirect effects, and as expected, it was statistically significant (IE t = IE 1 +IE 2 + IE 3 = 0.4625, 95% CI = 0.1890, 0.6481). Meanwhile, the direct effect of c9 was also statistically significant (c9 = 0.2350, 95% CI = 0.1101, 0.3599), which suggested that the SNP also affects lung cancer risk through a pathway or pathways other than smoking and COPD.
The total effect of the SNP on lung cancer risk was calculated as the sum of the direct (c9) and total indirect (IE t ) effects (TE = 0.6975, 95% CI = 0.3864, 0.9079). Given the total effect, we also calculated the percentage of the SNP-lung cancer association explained by each of the specific indirect effects. The percentages mediated were estimated as 3.7%, 37.5%, and 25.1% for the three specific indirect effects, respectively, suggesting that the path through smoking alone explains 3.7% of the SNP-lung cancer association, the path through COPD alone explains about 37.5% of the association, and the path through both smoking and COPD explains about 25.1% of the association. Thus, the results obtained from the standard approach suggested that the SNP influences the lung cancer risk indirectly through two pathways (COPD only and both smoking and COPD) but not through the smoking only pathway. However, this conclusion is likely to be biased, as our simulation results showed that the case-control study design could introduce bias in the estimations of indirect effects, and furthermore, frequency matching on the basis of smoking status could conceal the true underlying association. Therefore, we applied the new approach proposed in this article to estimate the indirect effects of smoking and COPD on the association between the SNP and lung cancer risk (see Table 3). The indirect effect of smoking, bypassing COPD, was evaluated by using the product ofã a 1 and the fixed b 1 value (i.e., log(1.86)) and was found to be equal to IE 1 = 0.1385, 95% CI = 0.0601, 0.2195, which was statistically significant. This result suggested that an individual carrying the deleterious allele is more likely to become a heavy smoker and, in turn, have a higher risk of lung cancer, independent of the individual's COPD risk. The other two indirect effects were calculated as the products ofã a 2 b 2 andã a 1d db 2 , respectively, and were also statistically significant (IE 2 = 0.2284, 95% CI = 0.0398, 0.4624; IE 3 = 0.1558, 95% CI = 0.0566, 0.3123). The total indirect effect (IE t = 0.5227, 95% CI = 0.2722. 0.8476) and the total effect (TE = 0.7577, 95% CI = 0.4846, 1.0987) were higher than those obtained from the standard approach, mainly due to the significant indirect effect through smoking only. The percentages of the SNP-lung cancer association explained by each of the specific indirect effects were also calculated and suggested that the path through smoking alone explains about 18.3% of the association, the path through COPD alone explains about 30.2% of the association, and the path through both explains about 20.6% of the association. The results obtained from the proposed approach showed that the SNP rs1051730 influences lung cancer risk indirectly through all three pathways and suggested that a higher percentage of the SNP-lung cancer association was explained by the two mediators with three pathways than indicated by the results obtained from the standard regression approach (69.1% versus 66.3%).

Discussion
In this study, we investigated the multiple-mediation model involving a three-path mediating effect using data from a casecontrol study. Such multiple-mediation models have been studied previously but not in the context when the study subjects are sampled according to case-control design [31,33]. We found that bias arises in evaluating the indirect effects if the case-control sampling study design is ignored and standard logistic regressions are applied. Therefore, we proposed an approach to correct bias in estimating coefficients from the mediation analysis and provide accurate estimates of the specific indirect effects. This approach can also be employed when the original case-control study is frequency matched on one of the mediators. We employed the bootstrapping approach to assess the significance of the indirect effects. We conducted simulation studies to investigate the performance of the proposed approach and showed that, compared with the standard approach, the proposed approach provides more accurate estimates of the indirect effects as well as of the percentages mediated by the mediators. The multiplemediation model investigated in this study is related to directed graphic models, which have been applied to the study of genetic data. For example, Zhu and Zhang [47] investigated the association between genetic variants and multiple traits using a similar scenario as considered in Figure 1. However, their analysis was focused on testing multiple traits (e.g., primary disease and mediators) simultaneously for identifying a common genetic variant, while our study is focused on decomposing the potential direct and/or indirect effects of a genetic variant on the primary disease. Moreover, their study was based on a family-based study design, while our study is focused on a case-control study design of the primary disease in which the controls may be frequencymatched to cases with respect to one of the mediators. We applied the approach to investigate the mediating effects of smoking and COPD on the association between the SNP rs1051730 and lung cancer risk using lung cancer case-control GWA study data where the multiple-mediation model was employed. We concluded on the basis of the results obtained from the proposed approach that the SNP rs1051730 influences lung cancer risk indirectly through all three pathways: through smoking only, bypassing COPD (18.3%); through COPD only, bypassing smoking (30.2%); and through both smoking and COPD (20.6%). The percentages mediated through different pathways (total 69.1%) obtained using the proposed approach were more correct, according to our simulation results, whereas the percentages mediated obtained using the standard approach were either under-estimated or over-estimated. Our findings that COPD mediates the effect of the SNP on the lung cancer association concurs with a previous study of the association between the SNP rs16969968 (in tight linkage disequilibrium with rs1051730) and COPD [23], in which the authors proposed that the association between the a5 subunit nAChR SNP and lung cancer could be largely explained through its relationship to COPD. Importantly, our results confirm previous findings from our group [6] that the association between the SNP rs1051730 and COPD was mediated by smoking behavior (percentage mediated = ,40%). Thus, the study emphasizes the complex interrelationships among smoking, genes, COPD, and lung cancer.
One may argue that the use of self-reported, physiciandiagnosed emphysema as a COPD measure could result in misclassification of the disease. For example, some studies have shown that when spirometry is used to assess COPD in smokers, estimates of undiagnosed COPD range from 50-80% [23,[48][49][50][51][52]. Such misclassification would lead to under-estimation of effect sizes for the association between genetic variants and COPD risk. However, a few studies suggest that the questionnaire-based approach to defining COPD is quite accurate for epidemiologic studies [53][54][55].
This study extends our previous work investigating the mediating effects of smoking and COPD on the association between the rs1051730 SNP and lung cancer using a singlemediator model [6]. However, the previous study ignored the case-control study design, which might under-estimate the indirect effect of each mediator, as well as the percent mediated by each mediator. VanderWeele et al. [7] used a weighted regression approach [56] to address the problem of case-control study design when assessing the direct and indirect effects of genetic variants on 15q25.1 on the lung cancer risk through smoking. That study focused on only a single mediator (i.e., smoking) and showed that smoking intensity only explained a small portion (,5%) of the association between the SNP rs1051730 and lung cancer risk, which differs from the percentage we have obtained for the path through smoking only (,18%). This difference could be due to multiple reasons. First, different types of data sets were employed in the two studies: we used only ever smokers, whereas VanderWeele et al. used both never and ever smokers for the analysis. Second, we employed a multiple-mediation model, so the indirect effect through smoking only was assessed by controlling for the other mediator, COPD, whereas VanderWeele et al. did not include COPD in their model. Moreover, the study of VanderWeele et al. used a different measure based on ORs to evaluate the percentage of the effect of the SNP mediated by smoking intensity, which assumes a rare outcome disease [56] and is not applicable to our situation because lung cancer is not rare in ever smokers. Most importantly, the difference in the results is due to the different scales used for the smoking intensity measure as the mediator variable. In the study of VanderWeele et al. [7], the square root of the number of cigarettes smoked per day was employed as a continuous mediator variable. In this case, the mediating effect can be interpreted as the effect at the square-root scale of the individual smoking one cigarette per day on the association between the SNP and lung cancer risk. In contrast, in our study we categorized the individuals into light smokers (,25 cigarettes smoked per day [mean number of cigarettes smoked per day = 17]) and heavy smokers ($25 cigarettes smoked per day [mean number of cigarettes smoked per day = 38]). In this sense, the mediating effect should be interpreted as the effect of heavy smoking compared to light smoking on the association between the SNP and lung cancer risk, which as expected, would be higher than the square-root scale used in the VanderWeele study [7].
Munafo et al. [27] studied the association between genetic variants on chromosome 15q25 locus and tobacco exposure as measured by self-reported daily cigarette consumption and also based on a single measurement of cotinine levels in current smokers. They found that the genetic variants have a stronger association with cotinine level than with self-reported cigarette consumption and the per-allele increase in cotinine level indicated a per-allele increase risk of lung cancer with OR = 1.31. Since the lung cancer GWA studies suggested that the genetic variants increase lung cancer risk by 1.32 fold [20], Munafo and colleagues concluded that the association of 15q25 locus with lung cancer risk is likely to be mediated largely via tobacco exposure. Compared to our approach, this study in actuality did not perform any formal mediation analysis, but inferred the results partially based on the published data, and therefore, could not provide the percentage of the genetic variant-lung cancer association mediated by tobacco exposure. This fact was also noted by Spitz et al. [28]. The major difference in the conclusions of these two studies could also be due to the different samples (current smokers versus ever smokers) and different smoking measures (cotinine level versus smoking quantity) used.
In our study, we focused on the multiple-mediator model shown in Figure 1, which allows for the causal association of one mediator to another mediator (i.e., smoking to COPD). In our real data analysis, the causal association of smoking to COPD was known from previous studies. However, in reality, the assumed causal direction might not be known in advance and has to be obtained using theoretical justification or intuition about the area of investigation [57]. The alternative is to consider both mediators to co-vary in the model, as in a parallel multiple-mediator model [30]. Our approach can be applied to such models as well to correct the potential bias in the estimations of the indirect effects when case-control study data are employed.
The measure of percent mediated used in our study is usually applicable when the signs of the indirect and direct effects are the same [58]. However, in the multiple-mediation model, it is possible that the indirect effects, as well as the direct effect, will have different signs. In this situation, the total effect assessed by the summation of the indirect effects and the direct effect could be arbitrary, and therefore, the percent mediated by each mediator could be greater than 1 (i.e., the total effect is less than the indirect effect), negative (i.e., the total effect and the indirect effect have opposite signs), or undefined (i.e., the total effect approaches zero) [59]. One possible solution is to assess the percentages mediated using the absolute values for all indirect and direct effects [60]. Alternatively, one may use other measures, such as the measure referencing the indirect effect relative to the direct effect and the proportion of the variance in outcome variable explained by the indirect effect [61]. However, these measures might have the same issues, such as producing a negative value. In this study, we assumed there were no confounding factors mitigating associations among the SNP, smoking behavior, COPD, and lung cancer risk [7].
It should be noted that, when we refer to the direct effect, we mean the effect of the SNP on lung cancer risk directly or through pathways other than smoking and COPD.
In summary, we investigated the multiple-mediator model, which involves a three-way mediating effect from one mediator to another in a case-control study. We proposed an approach to correct the biased estimations of the indirect effects in such models due to case-control study design. The proposed approach can provide accurate estimations for indirect effects and percent mediated. It is also robust to the case-control study being frequency matched on one of the mediators. The application of the proposed multiple-mediation approach to the study of the association between SNP rs1051730 and lung cancer risk suggests that the SNP has an indirect association with lung cancer risk mainly through its effect on both smoking behavior and COPD, as well as a relatively weaker direct association with lung cancer risk. Currently, several studies are ongoing to identify genetic variants associated with smoking behaviors and COPD using existing GWA study data collected for lung cancer using simplistic regression analyses. Such studies should use more sophisticated statistical models that take into account the complex interplay of smoking, COPD, and lung cancer. Finally, additional studies that include metabolomics markers, and biochemical assays of lung carcinogens as suggested by Spitz et al. [28], and spirometry assessment among smokers as suggested by Young et al. [23], as well as together with CT scans would be needed to more accurately tease out the direct and indirect effects of the genetic variants on lung cancer risk.

Supporting Information
Text S1 Correction of coefficients a 1 and a 2 for unmatched casecontrol study design.