The Effect of Ignoring Statistical Interactions in Regression Analyses Conducted in Epidemiologic Studies: An Example with Survival Analysis Using Cox Proportional Hazards Regression Model

Objective To demonstrate the adverse impact of ignoring statistical interactions in regression models used in epidemiologic studies. Study design and setting Based on different scenarios that involved known values for coefficient of the interaction term in Cox regression models we generated 1000 samples of size 600 each. The simulated samples and a real life data set from the Cameron County Hispanic Cohort were used to evaluate the effect of ignoring statistical interactions in these models. Results Compared to correctly specified Cox regression models with interaction terms, misspecified models without interaction terms resulted in up to 8.95 fold bias in estimated regression coefficients. Whereas when data were generated from a perfect additive Cox proportional hazards regression model the inclusion of the interaction between the two covariates resulted in only 2% estimated bias in main effect regression coefficients estimates, but did not alter the main findings of no significant interactions. Conclusions When the effects are synergic, the failure to account for an interaction effect could lead to bias and misinterpretation of the results, and in some instances to incorrect policy decisions. Best practices in regression analysis must include identification of interactions, including for analysis of data from epidemiologic studies.


Introduction
The terms interaction effect and effect modification, or effect-measure modification are often used interchangeably, particularly for health related research in epidemiology [1]. From an epidemiologic prospective effect modification refers to a situation where the effect of one predictor variable (e.g., exposure) on the outcome is dependent on the values of some other covariates [1 , 2]. From a statistical prospective, an interactive multivariable regression model could be fit by including the product of two or more predictor variables along with their corresponding individual variables in regression models [1].
Identification of statistical interactions in a regression model is important because this could lead to significant public health implications. For example, using data from an Epidemiologic Study on the Insulin Resistance Syndrome (DESIR) cohort, Gautier A. et al. (2010) found significant interactions between baseline BMI categories and higher waist circumference in relation to progression to type 2 diabetes using a logistic regression model [3]. Based on their findings the authors concluded that for reducing incidence of type 2 diabetes in their study population, it is important to monitor and prevent increases in waist circumference, particularly for those with BMI <25 kg/m 2 . Similarly, Hanai K. et al. examined whether obesity modifies the association of serum leptin levels with the progression of diabetic kidney disease by including an interaction term between the obesity and leptin in a Cox proportional hazards regression model [4]. The authors detected a significant interaction between leptin levels and obesity, hence concluded that obesity modifies the effect of leptin on kidney function decline in patients with type 2 diabetes. On the other hand, in an Iranian sex-stratified cohort study which was conducted to investigate risk factors for incidence of type 2 diabetes using Cox proportional hazards regression model [5], the authors recommended different preventive strategies by sex because of their findings that along with central adiposity in women and general adiposity in men, a lower education level conferred a higher risk for incidence of diabetes among men. Using data from Second Manifestations of Arterial Disease (SMART) study, Verhagen S. N. et al. investigated the relation between high-sensitivity C-reactive protein (hsCRP) levels and the incidence of type 2 diabetes in patients with manifest arterial disease or any cardiovascular risk factors [6]. The authors identified a significant interaction between gender and hsCRP in relation to incidence of type 2 diabetes in a Cox proportional hazards model. They concluded that manifest arterial disease patients with high hsCRP plasma levels are at increased risk to develop type 2 diabetes and this risk is more pronounced in female than male patients.
While some epidemiologic studies investigated and found significant interactions or effect modifications, in a majority of epidemiologic studies the interactions or effect modifications are not explored. We estimated the relative frequency of lack of exploring potential interactions in regression analyses by conducting a non-systematic search in PubMed for the period, January 2004 to December 2013. Since interactions and effect modifications are usually explored via multivariable regression models and different terms such as "multivariable regression" or "multiple regressions" or "regression" are used in regression analysis [7], we restricted our search only to "multivariable regression" and "multiple regression", which are often used interchangeably by epidemiologists. Next, among papers using the terms "multivariable regression", "multiple regression" or "regression", we searched for the terms "statistical interaction", "effect modifier", "effect modification", or "heterogeneity of effect". The result revealed that 4.4% of the published studies in PubMed that used terms "multivariable regression" (or "multiple regression") have used terms related to interactions, effect modifications, or heterogeneity of effects in their publications.
Although some of the publications may not be related to epidemiologic studies, or the terms interaction or heterogeneity may be used under different context, our search revealed that a majority of investigators may not explore interactions or effect modifications when conducting regression analysis.
The aim of this study was to demonstrate the adverse effects of ignoring interactions in regression models by conducting simulation studies in different scenarios. Specifically, we studied the effect of ignoring interactions in a Cox proportional hazards regression model. In addition, to demonstrate the importance of including interactions in regression models in real life epidemiologic studies, we compared results from additive and interactive models using data from the Cameron County Hispanic Cohort (CCHC) [8]. For this we specifically focused on investigating the role of peripheral white blood cells (WBC) counts, WBC differential counts and BMI in association with the time to incidence of type 2 diabetes while controlling for the effect of other known type 2 diabetes risk factors.

Simulation study for investigating interactive effects
Dataset generation-To assess the impact of ignoring the interactive effects, we simulated data from two fully specified Cox proportional hazards regression models h(t| x)=h 0 (t)exp(β′x) where h(t|x) is the hazard rate at time t for an individual with risk vector x, β is the vector of regression coefficients, and h 0 (t) is the baseline hazard, with two predictor variables that had only additive or interactive effects. In addition different magnitudes for the coefficient of the interaction term were considered. All Cox proportional hazards regression models that were used for simulation are presented in Table 1. For each of four scenarios 1000 replications, with sample sizes of 600 each, were generated in a manner that are described in the following.
In order to simulate data from the underlying Cox models, the effect of the covariates have to be translated from the hazard functions to the survival times, as documented in the literature [9 , 10]. It can be shown that the survival time for exponential survival random variable could be written as T= −In[ S ( t)]/[λexp(β ′ x)], where S(t) is the probability of surviving beyond time t and λ>0 is the scale parameter of the exponential distribution. When S(T) has a uniform distribution between 0 and 1 (i.e., S(T)~U(0,1)) then −In[S ( T)] is exponentially distributed with parameter 1, and T is exponentially distributed with parameter λ′ (x)=λexp(β ′ x). To create survival time data we generated exponentially distributed survival times T with scale parameters λ′ (x)=λexp(β ′ x) with an arbitrarily chosen constant baseline hazard rate λ=h 0 ( t ) [ 9 , 10]. Next, we generated exponentially distributed censoring time c with arbitrary chosen constant baseline censoring rate λ 1 [ 10]. To generate rightcensored survival time data the observed time to event was defined as minimum of the survival T and censoring variable C. That is generated observed time to events were (time to event=T when T<C), where the censoring status variable had a value of 1 if the time to event is censored (C>T) or 0 otherwise [11 , 12]. The parameters of the exponential distribution were determined by several iterations such that the censoring rate in each simulated dataset was higher than 30%. While in Cox proportional hazards regression model the predictor variables are not required to be normally distributed, for this simulation study the prespecified continuous variable x 1 was generated from a normal distribution with mean 6.4 and variance 2.25 that represents the distribution of WBC counts observed in the CCHC data. The pre-specified binary variable x 2 was generated as a Bernoulli random variable with probability of success p=0.5, that closely represents the distribution of obese status of individuals in the CCHC subset. The regression coefficients in all scenarios were arbitrarily set to β 1 =2 and β 2 =1.
Model comparisons-Using the simulated datasets Cox models with and without interaction terms were fitted under each of the scenarios. To summarize the simulation results, estimates for each regression coefficient and the corresponding standard errors from the 1000 replications were averaged for each of the scenarios.
For comparing coefficient estimates between correctly specified and misspecified Cox regression models percentage of bias was calculated using , where is the average of the estimated regression coefficient β̂ over 1000 replications and β is the true value of the regression coefficient. Overall accuracy of the estimates was assessed by mean squared error (MSE) which is the sum of the variance of the estimator and the squared bias (MSE = var(β) + (Bias(β,β)) 2 ).

Empirical example for demonstrating interaction effects using Cameron County Hispanic Cohort Data
To demonstrate the importance of identifying interactions in epidemiologic studies, in this section we referred to an earlier finding from analyses of CCHC data with Cox proportional hazards regression model [8]. Briefly, using data from 636 participants in the CCHC, this study assessed crude and adjusted effects of total and differential WBC counts (lymphocytes, monocytes, and granulocytes), C-reactive protein (CRP), and Body Mass Index (BMI) in relation to time to progression to type 2 diabetes in Mexican Americans. All participants without any evidence of type 2 diabetes after the baseline visit, death or lost-tofollow up were considered as censored at their last observation dates. Other covariates, such as age at the time of the event or censoring, gender, family history of diabetes, pre-diabetes status, smoking, and triglycerides were also included in the regression analysis. The effect modification of continuous BMI and BMI levels (e.g., BMI<25 (normal), 25 ≤ BMI<30 (overweight), and BMI ≥30 (obese)) on total and differential WBC counts were evaluated. In this study, the presence of the interactions was tested by including the product of the dichotomized BMI as BMI ≥35 and BMI <35 with total or differential WBC counts in the models [17] and by using a likelihood ratio test for the nested models with and without an interaction term. Also, we conducted multivariable Cox proportional hazards regression models stratified analyses by BMI categories to compare stratum specific hazard ratios with those obtained from our final interactive Cox proportional hazards regression models.
Cox proportional hazards regression model assumptions for the functional form of the continuous covariates (e.g., the linear relationship between log cumulative hazard and a covariate) and proportionality assumption of the hazards of all covariates were evaluated before and after including the interaction term in the model. Our result indicated that family diabetes history and pre-diabetes status did not satisfy the proportional hazards assumption; hence all models that included these two variables were stratified by these variables by including strata statement in SAS Proc Phreg [18]. Details regarding the model development are described elsewhere [8].

Calculation of hazard ratios in the presence of interactive effects
The calculations of Hazard Ratios (HRs) from an interactive Cox proportional hazards model is different from the calculations of HRs from an additive model. For example in the following interactive Cox proportional hazards model.
Where x 1 is a continuous variable and x 2 is a dichotomous variable with values 1 or 0, the HRs for x 1 by the levels of variable x 2 in (1) can be written as HR=exp (x 1 (β 1 + 0β 3 ))=exp (x 1 β 1 ) when x 2 =0, and the HR for x 1 is HR=exp (x 1 (β 1 + β 3 )) when x 2 =1.
In contrast in an additive Cox proportional hazards regression model.   Table 3 shows the results of the analyses on simulated data with Cox proportional hazards models with interaction term under scenarios (2), (3) and (4). Compared to the correctly specified models, the regression coefficient estimates for variables x 1 and x 2 in the misspecified models in all scenarios were more biased and less accurate as the magnitude of the coefficient of the interaction term increased [10]. Specifically, the bias increased from 352% to 894% and MSE increased from 12.43 to 80.18, as the magnitude of the coefficient of the interaction term increased from 0.5 to 1.5.

Findings from the empirical study
The results from the analyses of CCHC data are presented in Tables 4 and 5 including estimates for the coefficients and their standard errors. In our final models we identified significant interactions between dichotomous BMI (i.e., obese categories) and total WBC (p-value=0.0137), granulocytes (p-value=0.0291), and lymphocytes (p-value=0.0478) after adjusting for the effect of age at the time of the event or censoring, gender, smoking, and triglycerides and stratified for family history of diabetes and pre-diabetes status by including strata statement in SAS Proc Phreg. Because of the significant interaction between dichotomous BMI and total WBC, counts, granulocytes, and lymphocytes, we calculated estimated adjusted hazard ratios for WBC counts, granulocytes, and lymphocytes by different levels of the interacting variable BMI. Figure 1a and 1b present the plots of estimated adjusted hazard ratios for total WBC counts for BMI ≥35 vs. BMI <35 under a model with no interaction and a model including interaction term between BMI and total WBC counts, Under the additive model (i.e., with no interactions) the lines for the hazard ratios were somewhat parallel and the vertical distance between them represented the BMI adjusted hazard ratio for WBC counts (Figure 1a). The plot based on the interactive model ( Figure 1b) Figure 2a and 2b with two non-parallel lines present the HRs of granulocytes and lymphocytes respectively by the levels of BMI under the models with interaction, adjusting for the effect of age at the time of the event or censoring, gender, smoking, and triglycerides and stratified by family history of diabetes and pre-diabetes status by including strata statement in SAS Proc Phreg. Finally, Figure 2c presents the interaction effect between monocytes and BMI in relation to time to progression to type 2 diabetes and it shows that when monocytes were between 0 and 0.4, and 0.8 and 1.1 the HR lines for BMI ≥35 and BMI<35 are approximately parallel, and the interaction effect between the two variables was not statistically significant (p-value=0. 16).
The results from stratified analyses with multivariable Cox proportional hazards regression models by BMI categories are reported in Table 5 which, are similar to those of the adjusted interactive models in Table 4. The results in Table 5 reconfirm that the significant associations of total WBC counts, lymphocytes and granulocytes with progression to type 2 diabetes were only among participants with BMI ≥35. Furthermore, the stratum specific HR estimates for each of the variables in Table 5 were not similar indicating the presence of interaction effects of BMI with total WBC counts, lymphocytes and granulocytes.

Discussion
In this paper we have demonstrated that a majority of epidemiologist and scientists do not report whether they have explored potential interactive effects when reporting their findings in publications that involve regression analysis. We have also examined the adverse effects of ignoring interactions when using a Cox proportional hazards regression model. We used simulation studies and data from the Cameron County Hispanic Cohort, to demonstrate detection, calculation of effects based on interactive models, and graphical presentation of interactive effects. The approach for the simulation studies was to first generate data from an additive model with known distributions and apply an additive (without their product term) Cox proportional hazards regression model with two covariates x 1 and x 2 to estimate the regression parameters. In this scenario the relationship between the variable x 1 and the time to event did not depend on the values of variable x 2 . Therefore in the perfect additive Cox proportional hazards regression model the inclusion of the interaction of the two covariates did not have a significant impact on main effects of regression coefficients estimates, hence on the estimated hazard ratios and their interpretation. In the second scenario data were generated using Cox proportional hazards regression model with two covariates x 1 and x 2 and their product term (Model 2). The analyses demonstrated that the misspecified models with no interaction term resulted with biased regression coefficient estimates, hence erroneous interpretations. In the interactive model the effect of x 1 on time to event was dependent on the levels of variable x 2 and we demonstrated how to calculate the adjusted HRs in the presence of interactions between x 1 and x 2 .
Since this study focused on recognition of statistical interaction in regression analysis, in this paper we do not fully discus the findings from the CCHC data. However, the results from the CCHC prospective cohort study demonstrated with real life variables the difference between models with and without interaction term, more specifically the improvement of the inferences for hazard ratios for total WBC counts, granulocytes and lymphocytes after testing and identifying their interaction with dichotomous BMI (BMI ≥35 vs. BMI <35). The models with interaction terms revealed that WBC counts, lymphocytes and granulocytes were significant risk factors in the subjects with BMI ≥35. The presence of the interaction effects and the positive statistical association between total WBC counts, granulocytes and lymphocytes and progression to diabetes in individuals with BMI ≥35 were confirmed when Cox proportional hazards regression models fitted separately in the BMI strata. Stratification is not only a technique to evaluate the presence of possible confounding but also allows for an evaluation of possible presence of interaction effect [1 , 2].
A key underlying assumption (assumption of proportional hazards) of the Cox proportional hazards model is that the hazard ratio is constant over time [1 , 18]. In our simulation studies and in the CCHC study the interactive variables met the proportional hazards assumption. However, stratification by the variable where this assumption does not hold is a popular solution to the problem by including strata statement in SAS Proc Phreg.

Strengths and Limitations
With the analyses of the simulated datasets and CCHC data we highlighted that failure to test for interaction effects in a regression analysis does lead to a significant bias in parameter estimates due to unaccounted interaction effect when it actually exists. We have also demonstrated calculation and interpretation of effects based on parameters estimates in a Cox regression models with a significant interactive effect. Statistical guidelines for best practices recommend assessment of interactions as one of the steps during the model building process [17 , 19]. Although the work presented here is based on Cox regression models, the findings from this study are generalizable to other regression models including linear, logistic, Poisson, and negative binomial regression models.
Our studies have some limitations. These studies illustrated a two-way interaction effect between a continuous and a dichotomous variable. In practice interaction effect may occur between continuous or categorical variables and it may even be represented by a higher order product terms. In addition in the Cox regression models to handle non-proportional hazards we used stratification in the variable where non-proportional hazards assumption was not met. Another method to handle non-proportional hazards is adding an interaction term between some function of time variable and the covariate which takes into consideration the non-constant influence of covariate on the hazard [20].
Since the main purpose of the study was to demonstrate the adverse effects of ignoring statistical interactions in regression models used in epidemiologic studies we did not use the developed sampling weights from the CCHC which were created to account for imbalances in the distribution of sex and age due to unequal participation of household members in the census tracts and to scale the sample to the population [21].

Conclusions
Our findings from a non-systematic review for this study revealed that a majority of published literature on epidemiologic studies that used multivariable regression models have not mentioned anything related to testing for statistical interactions, effect modification, or heterogeneity of effect. Although calculation and interpretation of interactive effects are more difficult these are essential if the effects are interactive or synergic. We recommend inclusion of interaction terms that are clinically significant even if the interaction effects are not statistically significant. The failure to identify interactive effects in regression models could lead to significant bias, misinterpretation of the results, and in some instances to incorrect public health interventions with potential adverse implications.   Table 1 Cox proportional hazards regression models used for different scenarios of the simulation study.  Table 2 Findings from multivariable Cox proportional hazards regression model fitted using generated data in scenario 1.  Table 3 Findings from multivariable Cox proportional hazards regression models fitted using generated data in scenarios 2, 3 and 4.