Detection of and Adjustment for Multiple Unmeasured Confounding Variables in Logistic Regression by Bayesian Structural Equation Modeling

Aim: To compare the bias magnitude between logistic regression and Bayesian structural equation modeling (SEM) in a small sample with strong unmeasured confounding from two correlated latent variables. Study Design: Statistical analysis of artificial data. Methodology: Artificial binary data with above characteristics were generated and analyzed by logistic regression and Bayesian SEM over a plausible range of model parameters deduced by comparing the parameter bounds for two extreme scenarios of no versus maximum confounding. Results: Bayesian SEM with flat priors achieved almost fourfold absolute bias reduction for the effects of observed independent variables on binary outcome in the presence of two correlated unmeasured confounders in comparison with standard logistic regression which ignored the confounding. The reduction was achieved despite a relatively small sample (N=100) and large bias and variance of the factor loadings for the latent confounding variables. However, the magnitude of Method Article Kupek; JAMPS, 3(1): 42-51, 2015; Article no.JAMPS.2015.025 43 residual confounding was still high. Conclusion: Logistic regression bias due to unmeasured confounding may be considerably reduced with Bayesian SEM even in small samples with multiple confounders. The assumptions for Bayesian SEM are far less restrictive than those for the instrumental variable method aimed at correcting the effect size bias due to unmeasured confounders.


INTRODUCTION
Logistic regression is likely the most frequent regression method in publications from medical sciences due to its capacity to quantify risk in a way which is easy to understand and communicate among health professionals. The odds of the event of interest are calculated for each profile of the independent variables and its logarithm is treated pretty much like a continuous outcome in well known analysis-of-variance (ANOVA) model [1], only the model estimates are back-transformed to odds. The robustness of logistic regression against omitted independent variables has been widely praised but its limitation in the case of such variable acting as a confounder is less well known. Most of the methodological work on confounding deals with measured confounding variables, i.e. with the variables available for analysis. Stratifying by confounding variable, matching for observed confounders or for propensity score in observational studies, restriction of the analysis to a highly homogenous exposure groups, covariate adjustment for confounding and instrumental variables have been recommended as main analytical strategies to deal with this problem [2][3][4].
However, many confounding variables are unmeasured or even unknown. The former may be effects hypothesized but not verified empirically with the data available whilst the latter belong to the vast area not yet discovered by scientific research. Simulation of the influence of measurement errors on differential impact of predicting variables across units of analysis has been a key analytical strategy, often labeled as sensitivity analysis with either Bayesian or frequentist background [5][6][7][8][9][10]. It is worth noticing that widely used hierarchical or multi-level models assume an unknown latent variable, omitted from the regression analysis, which affects the distribution of the impact of intervention variables across units (e.g. subjects), so that some of these are more strongly influenced by unknown factors than the others. In other words, random effects quantify the variability of the main ("fixed") effect over units of analysis due to unknown predicting variables which may or may not be confounders. Statistical term for such effect is heteroscedasticity. It may point to previously unknown variance components which exert causal influence, e.g. genetic susceptibility for a disease.
Although heuristic value of variance component analysis has been widely established and enforced with random effects analysis in multilevel modelling, the era of "big data" expanded this approach to considerably less controlled environments than experimental or case-control settings such as population or hospital cohort studies and spatial/environmental effects [11][12][13][14]. Under these conditions, unmeasured confounders are more likely to distort causal effects evaluated by regression techniques. Although such distortion may be attenuated with multi-level analysis, it does not estimate the associations between observed and latent variables which may point to a confounding effect.
Bayesian confirmatory factor analysis has been proposed to explore and test specific confounding hypothesis [15,16]. The method has shown useful for detecting previously unknown (latent) single source of confounding but was rather limited in reducing the bias, despite outperforming standard multivariate logistic regression regarding this task. In this paper, an updated version of the method is applied to a more complicated task with strong confounding, realistically large error components for an observational study and a small sample with multiple sources of confounding. The aim of this work is to compare the bias magnitude between standard multivariate logistic regression and Bayesian structural equation modeling (SEM).
Continuous outcome variable (y) was defined by equation Continuous dependent (y) and independent regression (x) variables were both transformed 44 multiple sources of confounding. The aim of this work is to compare the bias magnitude between standard multivariate logistic regression and Bayesian structural equation modeling (SEM).
Almost 40% of the continuous outcome variation is accounted for solely by independent error terms (d1, d2, e1) and almost 80% solely by two sources of confounding as indicated by two separate linear regression analyses. None of these latent variables were available for the subsequent analysis whose aim was to determine the contribution of six independent binary predictors, denominated x1bin the binary outcome, denominated ybin. No distinction was made between interventio (treatment) and other covariates, some of which may be confounders, as they are all associated with the outcome.

Prior Distributions
Bayesian SEM prior distribution range was derived by comparing two models: multivariate logistic regression assuming no confounding and a latent variable regression assuming maximum confounding. The latter was a factor analysis model where the number of common factors was into binary ones by logistic function with threshold >0.5 for events. The data generating 1 using SEM Almost 40% of the continuous outcome variation is accounted for solely by independent error terms (d1, d2, e1) and almost 80% solely by two sources of confounding as indicated by two separate linear regression analyses. None of hese latent variables were available for the subsequent analysis whose aim was to determine the contribution of six independent binary predictors, denominated x1bin-x6bin, to the binary outcome, denominated ybin. No distinction was made between intervention (treatment) and other covariates, some of which may be confounders, as they are all associated Bayesian SEM prior distribution range was derived by comparing two models: multivariate g no confounding and a latent variable regression assuming maximum confounding. The latter was a factor analysis model where the number of common factors was  18] with "geomin" oblique rotation and the factor scores were regressed onto the binary outcome in SEM as the only predictor variables, thus assuming zero values for the effects of all observed binary variables on the outcome. The difference between multivariate logistic regression coefficients on the logarithmic scale and zero was considered plausible range of confounding. Flat priors were used over the range in Bayesian SEM. Inverse Wishart distribution priors were used to determine the variances and residual variances of the latent variables whereas normal distribution priors was assumed for the regression parameters.

Model Checking
Absolute bias was calculated as absolute value of the difference between true and estimated value for each parameter and estimation method. Its average over the parameters of interest was multiplied by 100 to arrive at percentage of average absolute bias.
The number of common factors in EFA was determined by the chi-square test for additional model parameters as the number of factors increased.
Markov chain Monte Carlo (MCMC) algorithm with Gibbs sampling was used to iteratively approximate posterior distributions of the model parameters. Two parallel chains were run with convergence criterion of proportional scale reduction close to 1 and the first half of the iterations treated as the "burn-in" phase [19]. In addition, 100000 iterations were used to further verify the convergence process graphically. Posterior predictive P-value was also used to evaluate the quality of the model obtained.
Bayesian model convergence was checked by examining the autocorrelation function, trace and density plots of the parameters. Bayesian 95% credibility interval was determined by 2.5% and 97.5% of the posterior distribution.
Statistical software Mplus 6.11 was used for all analyses [20]. The code for statistical programs used in the paper is available upon request.

RESULTS
Observed crude odds ratios (Table 1) suggest strong protective effect of x2bin and strong risk increase in the presence of the independent variables x4bin and x6bin, although none of these effects reached usual statistical significance level (p<0.05).
The same goes for the associations among the independent regression variables which were low to moderate in magnitude. Compared to the true values (see data generating equations), the largest distortions of observed associations due to measurement errors and unmeasured confounding were the reversed sign of the x2bin effect and about fourfold underestimation of the x4bin effect on the binary outcome.
The logistic regression goodness-of-fit evaluated by Hosmer & Lemeshow statistic [21] was excellent (P=0.94). No heteroscedasticity of the model residuals was indicated as these were approximately normally distributed and uncorrelated to the predicted values.
Bayesian posterior predictive checking showed the 95% confidence interval for the difference between the observed and the replicated chisquare values from -25.228 to 22.202 and posterior predictive P-Value of 0.541, thus confirming good convergence.
EFA correctly identified the presence of two moderately correlated latent variables acting as unmeasured confounders but their factor loadings showed considerable average bias and large variation (Table 2).
When regressed on the binary outcome, the effect size of the first confounder was considerably underestimated and the sign of the second confounder was reversed in Bayesian SEM ( Table 3).
The correlation between two latent confounders produced by simulation was 0.40 but was estimated at much lower value of 0.12 in Bayesian SEM.
The 95% credible interval included true values for all but one observed variable in each of the two latent confounders (Table 2) due to significant underestimation of the strongest effects (x4bin for f1 and x5bin for f2).
The largest average absolute bias was found for logistic regression, followed by Bayesian SEM with prior distributions over plausible range of the regression coefficients of the observed independent binary variables on the binary outcome (Table 3). Compared to the standard logistic regression with average absolute bias of 111%, the Bayesian logistic regression with latent confounding determined from the prior exploratory factor analysis reduced the bias almost four times to 30%.

DISCUSSION
Bayesian SEM considerably reduced the regression bias of observed independent variables on binary outcome in the presence of two correlated unmeasured confounders in comparison with standard logistic regression which ignored the confounding. Almost fourfold bias reduction was achieved despite a relatively small sample (N=100) and large bias and variance of the factor loadings for the latent confounding variables. However, the magnitude of residual confounding was still high. In other words, the heuristic value of Bayesian SEM for unmeasured confounding seems to lie more in detecting the relationship between observed and latent confounding variables than in their precise quantification.
Large bias for logistic regression parameters in the present study may come as a surprise but the difficulty of the statistical task should be kept in mind. It is worth recalling that 40% of the continuous outcome variation was due to measurement error terms (d1, d2, e1) and almost 80% of the variation was determined solely by two latent confounders (f1, f2). After logistic transformation to binary variables, such large measurement errors and strong confounding produced some very misleading crude odds ratios and the lack of power as judged by the 95% confidence intervals (Table 1), thus leading to distorted logistic regression parameters. Observed binary variables only partially reflect the causal process defined on continuous (logit) scale. In addition, the missing information was not "missing at random", i.e. fully explainable by the observed data, and the sample size was small to evaluate a large number of parameters and their complex relationships in the data generating process. From this perspective, only partial elucidation of this process via the Bayesian SEM parameters obtained comes as no surprise and almost fourfold bias reduction in comparison with the logistic regression parameters can be considered a very satisfactory result.
There is a growing number of Internet surveys which are particularly susceptible to selfselection of the respondents in a way that affects the outcomes of interest because the motivation to participate is correlated with personal characteristics which enhance the chances of modifying the outcomes (e.g. unhealthy behaviors). If there is a follow-up to such a survey, the self-selection process tends to gather strength and eventually may become the strongest predictor of the outcome change over time. The same process also reduces the precision of the estimates with growing depletion of the respondents, thus enhancing the measurement error components in regression model and reducing statistical power. In addition, many sources of confounding are multi-factorial, i.e. include several related but distinct sources (e.g. motivation to improve own health, motivation to follow medical advice, motivation to please friends and family, etc.). Therefore many statistical analyses of the data gathered in this way do not share methodological advantages of a well designed study and have to deal with large measurement errors and with multiple sources of confounding. This can lead to large biases when standard logistic regression is used whereas Bayesian SEM is better suited to include additional information which can reduce the bias. For example, a common factor of personal characteristics associated with both respondent self-selection and the outcome of interest may be added to logistic regression to arrive at more realistic effect size estimates.
In epidemiology, confounding is generally viewed as a nuisance, an obstacle to sound scientific results. From a different angle, however, unmeasured confounders are just another causes unaccounted for in a particular research, so elucidating their nature through associations with other already established causes is a common scientific endeavor. Structural causal modeling framework [22] aims at integrating SEM and causal modeling by examining the hypothesis of interest as a "counterfactual probability" to be compared with an alternative hypothesis in order to arrive at causal effect. EFA in the present study can be seen as a method to derive the counterfactual under unmeasured confounding scenario conditional on observed variables.
The conditions under which counterfactual hypothesis may be inferred from the data define it empirically [22] and provide the basis for the sensitivity analysis of causal effects.
Direction of unmeasured confounding can be determined for a binary intervention variable if its sign is ascertained and there is only one confounder or a set of mutually uncorrelated confounders [23]. However, these conditions do not apply for the example presented in this paper because of the correlation between two latent confounding variables and seem unlikely to hold in many observational studies. In addition, the requirement that no unmeasured confounder is the cause of another unmeasured confounder rules out its use for a sequence of outcomes which naturally arise in chronic disease progression where one outcome (e.g. high arterial blood pressure) is a predictor of another (e.g. tissue damage) and so on (e.g. myocardial infarct). Statistical framework for analyzing such models includes mediation and latent growth analysis in SEM, and instrumental and marginal structural modeling in medical statistics tradition. The way these methods deal with confounding is outlined next.
Latent growth models [20] search for the patterns that best describe individual trajectories in terms of the variables relevant for a time-related process, typically risk factors, sequential outcomes and intervention variables. Longitudinal observational studies are prone to time-dependent confounding as intervention at a given point in time may depend on the effectiveness of the previous interventions as indicated by successive outcome evaluations. For example, a patient who responded well to the first anti-hypertensive medication prescribed may carry on with it as opposed to the one whose response was not satisfactory and therefore was prescribed another drug regimen. As treatment decisions lie on the causal pathway of disease progression, their allocation is not independent of the outcomes, so both of them may be influenced by unmeasured confounder (e.g. the lifestyle characteristics that increase the hypertension risk also decrease adherence to its treatment). The so-called confounding by indication is driven by a similar process where confounders are clinical and other indicators for treatment.
Mediation analysis has rarely been advocated in medical statistics in the way it has been used in econometrics and psychometrics [18,24]. Instrumental variable method is a kind of mediation analysis in which an "instrument" variable affects intervention allocation but not directly the outcome of interest. If such variable is a cause or a proxy for a cause to intervention and is not associated with unmeasured confounding, then an unbiased estimate of the intervention effect can be obtained by comparing the effect size between the intervention and the outcome on one side and the effect of the instrument on the outcome on the other [4,25,26].
The attraction of the instrumental variable method is that the average causal effect can be correctly estimated even in the presence of unmeasured confounding under following conditions: no correlation between instrumental variable and unmeasured confounder, no direct causal relationship between instrumental variable and the outcome (must be mediated through exposure/intervention), and instrumental variable must be a cause of the outcome variable [4,25,26]. However, these conditions may prove pretty restrictive, especially for the observational studies. The best way to calculate the mediated regression effects, including that of confounding variables, particularly in the non-linear models used to analyze binary outcomes, has been debated in the literature for some time [7,18,24].   Marginal structural modeling [27] uses inverse probability weights to account for individual differences in propensity to follow a particular trajectory (e.g. disease evolution) given the observed covariates. The method is based on potential outcome framework [3] and assumes absence of unmeasured confounding, possibility of following any course of treatment at any point in time and consistent estimation of the weights [27].
Several limitations of the present study should be kept in mind. First, only one among many possible confounding models was analyzed.
Albeit the complexity of the model was considerable and the sample size pretty small, a variety of different covariance structures with unmeasured confounding should be tested [28] to better evaluate the effectiveness of Bayesian SEM in reducing the confounding bias.
Observational studies based on "big data" are particularly prone to selection bias and complex dependency structures between observed and latent variables. Second, although only artificial data guarantee full knowledge of the underlying data generating mechanism, real data examples should be tested as well to evaluate practical gains with the method proposed. The uncertainty of the EFA solution regarding true data generating mechanism was certainly underestimated, although this problem may be mitigated by adding parameters which reflect this uncertainty (e.g. hyperpriors or sensitivity analysis with various EFA models). Third, no distinction between different roles of independent regression predictors (e.g. treatment versus covariates) ruled out informative priors which would reduce the model standard errors but this can be done with real data examples. Fourth, MCMC convergence may be difficult to achieve with certain types of SEM. This paper challenges a frequently stated notion that observational studies cannot empirically verify the presence of unmeasured confounders and the bias they cause. Bayesian SEM may considerably reduce such bias providing reasonable assumptions about plausible impact of confounding bounded between two extreme scenarios of no versus maximum confounding. The difficulty in eliciting priors for the association between observed and unmeasured confounders has been recognized and recently dealt with by assuming their exchangeability, i.e. similar magnitude and distribution of bias effects [7]. In this perspective, unmeasured confounder is a missing covariate associated with observed ones. Bayesian SEM in the present study goes along the same line, only it uses EFA to elicit the common factor distribution as a plausible generating mechanism for the observed covariates and evaluates its effect on the outcome within the range of two extreme scenarios (no versus maximum confounding effect). This appears to be a very general elicitation method which requires little knowledge on the strength of association between observed and unmeasured confounders. Furthermore, as common factor score is a continuous variable similar to propensity score, it has higher statistical power than dichotomized latent variables representing unmeasured confounding [29].

CONCLUSION
Bayesian SEM holds promise to considerably reduce the effect size bias due to unmeasured confounding. It may be particularly useful for observational studies prone to such bias. Further research should cast more light on how to apply this method for different covariance structures of unmeasured confounding variables.

CONSENT
It is not applicable.

ETHICAL APPROVAL
It is not applicable.