Adjustment for unmeasured confounding through informative priors for the confounder-outcome relation

Background Observational studies of medical interventions or risk factors are potentially biased by unmeasured confounding. In this paper we propose a Bayesian approach by defining an informative prior for the confounder-outcome relation, to reduce bias due to unmeasured confounding. This approach was motivated by the phenomenon that the presence of unmeasured confounding may be reflected in observed confounder-outcome relations being unexpected in terms of direction or magnitude. Methods The approach was tested using simulation studies and was illustrated in an empirical example of the relation between LDL cholesterol levels and systolic blood pressure. In simulated data, a comparison of the estimated exposure-outcome relation was made between two frequentist multivariable linear regression models and three Bayesian multivariable linear regression models, which varied in the precision of the prior distributions. Simulated data contained information on a continuous exposure, a continuous outcome, and two continuous confounders (one considered measured one unmeasured), under various scenarios. Results In various scenarios the proposed Bayesian analysis with an correctly specified informative prior for the confounder-outcome relation substantially reduced bias due to unmeasured confounding and was less biased than the frequentist model with covariate adjustment for one of the two confounding variables. Also, in general the MSE was smaller for the Bayesian model with informative prior, compared to the other models. Conclusions As incorporating (informative) prior information for the confounder-outcome relation may reduce the bias due to unmeasured confounding, we consider this approach one of many possible sensitivity analyses of unmeasured confounding. Electronic supplementary material The online version of this article (10.1186/s12874-018-0634-3) contains supplementary material, which is available to authorized users.


Background
Inferences from observational epidemiological studies are often hampered by confounding [1,2]. To estimate the causal effect of exposure on the outcome, adjustment for a minimal set of confounding variables (or confounders) is required [3][4][5][6]. However, there may be unmeasured variables that result in unmeasured (or residual) confounding. Several design and analytical methods to account for unmeasured confounding have been proposed [7], including cross-over designs e.g., [8,9], instrumental variable analysis e.g., [10,11], the use of negative controls [12], and approaches to collect information on unmeasured confounding variables in a subsample e.g., [13,14]. In addition, sensitivity analysis of unmeasured confounding is used to quantify the potential impact of unmeasured confounding [15][16][17].
Sensitivity analyses can be performed within a frequentist framework as well as within a Bayesian framework. The latter requires for example assumptions on prior distributions for the unknown parameters of the unmeasured confounder and its relations with exposure and outcome [18][19][20][21]. However, eliciting prior distributions for these unknown parameters can be very challenging as unmeasured confounders may actually be unknown. So far, Bayesian sensitivity analyses focused on allocating informative priors to the effect of the unmeasured confounders on the exposure or on the outcome [18,19,22]. Instead, it may be more straightforward to elicit prior distributions for the parameters of the effects of the observed confounders on the outcome.
Unmeasured confounding of the exposure-outcome relation may not only affect that relation, but may also bias the observed relations between confounders and outcome [23]. Constraining the estimation of the confounder-outcome relation, or incorporating (informative) prior information for the confounder-outcome relation, may (indirectly) reduce the bias due to unmeasured confounding of the exposure-outcome relation.
The aim of this research was to assess to what extent using prior information on parameters for an observed relation between a measured confounder and the outcome in a Bayesian analysis can reduce bias due to unmeasured confounding in an estimator of the exposure outcome relation. The remainder of this article is structured as follows. The bias due to omitting one or more confounders from a regression model is quantified in section 2. In section 3, the use of informative priors for the observed confounder-outcome relation was tested using simulation studies. Section 4 illustrates the approach using an empirical example of the relation between LDL cholesterol levels and systolic blood pressure. Section 5 provides a general discussion to the paper.

Notation
We consider studies of a continuous exposure (denoted by X), a continuous outcome (Y), and two continuous confounders (Z and U). All relations are assumed to be linear. All variables are considered related to the outcome, according to the model: y i = β yx x i + β yz z i + β yu u i + ε i , where lower case letters represent the realisations of the random variables Y, X, Z, and U, i is a subject indicator (i = 1, …, n), and ε~N(0,σ 2 ). The confounders are considered related to the exposure: x i = β xz z i + β xu u i + ζ i , and the confounders are also related to each other: z i = β zu u i + ξ i , with ζ~N(0,σ x 2 ) and ξ~N(0,σ z 2 ). For all models, the intercepts are assumed independent of all other terms in the models and are omitted here and in the following equations. The coefficients of these models represent an increase in the dependent variable by β .. for each unit increase in the independent variable. The structural relations between the variables are presented in Fig. 1.

Bias due to unmeasured confounding
For the fairly simple model outlined in Fig. 1, there are three possible scenarios of confounding adjustment: scenario 1.) both confounders Z and U are measured and adjusted for (e.g., by a multivariable regression analysis of Y on X, including Z and U as covariates); scenario 2.) none of the confounders are measured and hence none is adjusted for; and scenario 3.) one confounder (Z) is measured and adjusted for, while the other (U) is not. Because our interest is in situations in which unmeasured confounding is present, we only consider scenarios 2 and 3.
In both scenarios, the effect of X on Y can be estimated by means of a linear regression model. In the following, we assume all assumptions of the linear regression model are met, except that unmeasured confounding may be present. As a result, the estimator for the effect of X on Y is expected to be biased due to unmeasured confounding. Details about the bias due to unmeasured confounding are provided in Additional file 1: Appendix 1.
In scenario 2, the bias due to omitting Z and U from the data analytical model can be expressed as: where Var(Z), Var(X), and Var(U) denote the marginal variances of Z, X, and U, respectively. Equation (1) indicates that the bias resulting from omitting two confounders is independent of the true exposure-outcome relation β yx . Furthermore, the bias increases with increasing strength of the relation between each of the confounders and the outcome or the exposure (β yz , β yu , β xz , and β xu ). The bias is the result of different backdoor paths [24] from X to Y: Z ← U → Y, and X ← U → Z → Y, which can be identified in the equation.
In scenario 3 the bias due to omitting U from the data analytical model, while adjusting for Z, can be expressed as: where ρ 2 uz is the squared (Pearson's) correlation between U and Z, ρ 2 xz is the squared correlation between X and Z, and VarðUÞð1−ρ 2 uz Þ and VarðXÞð1−ρ 2 xz Þ, represent the conditional variances of U given Z and of X given Z, respectively. Equation (2) shows that the bias resulting from omitting one confounder from the adjustment model is independent of the true exposure-outcome relation β yx . Furthermore, the bias increases as the relation between the unmeasured confounder and the outcome (β yu ) or the exposure (β xu ) increases.
As the correlation between the confounders (ρ uz ) increases, the bias of the estimator of the exposure-outcome relation decreases. Intuitively, when two confounders are correlated, adjusting for one accounts for some of the variability (and thus confounding effect) in the other. Therefore, adjustment for one confounder may reduce the bias that is caused by the other [25,26]. In addition, in a linear model, Var(X|Z) ≤ Var(X) and the larger the absolute value of ρ xz the smaller Var(X|Z). Because of this decreased Var(X|Z), the residual bias carried by U, i.e. β xu β yu VarðUÞð1−ρ 2 uz Þ , is amplified. This bias amplification particularly happens when the confounder (Z) that is adjusted for acts like an instrumental variable (IV) or near-IV, meaning that it has a stronger relation with the exposure (X) than with the outcome (Y) [27,28].
In scenario 3, the linear regression analysis of Y on X and Z, yielding an estimate of β yx | z , is a biased estimator of the relation between X and Y. However, this linear regression analysis is also a biased estimator of the relation between Z and Y (β yz | x ). When we assume all variables follow a multivariate standard normal distribution, the bias in the β yz | x relation can be expressed as: where β 0 yu represents the conditional (or direct) effect of U on Y if both are standardized. Equation (3) shows that the unmeasured confounder (U) of the exposureoutcome relation may also confound the observed relation between the measured confounder (Z) and the outcome. If Z and X are independent (i.e., ρ xz = 0), the bias is simply the result of the backdoor path from Z to Y via U (i.e., β 0 yu ρ zu ). Note that even if Z and U are independent, the observed relation between Z and Y is biased, due to conditioning on X, which is a collider of Z and U and hence conditioning on X opens a path from Z to Y via U [24].
Reducing unmeasured confounding using a Bayesian model As indicated above, unmeasured confounding of the exposure-outcome relation can also bias the relation between an observed confounder and the outcome. Hence, an unexpected relation between a confounder and the outcome may suggest the presence of unmeasured confounding. Allocating informative priors to the observed confounder-outcome relation may not only reduce the bias in that parameter, but also may reduce the bias due to unmeasured confounding of the exposure-outcome relation.
In the absence of information about the confounder U, the relation between X and Y only can be controlled for confounding by Z. In a Bayesian framework, we can specify a linear model of Y as a function of X and Z. The parameters of interest, β yx , β yz and σ 2 , can then be estimated using their joint posterior distribution given the data for Y, X, and Z. The joint posterior distribution is proportional to the product of the density of the data times the joint prior distribution of the parameters: where g(β xy , β yz , σ 2 ) is the joint prior distribution and f(Y| X, Z, β yx , β yz , σ 2 ) is the probability density of Y conditional on the parameters: Assuming independent priors for the different parameters, the joint prior is simply a product of all marginal priors.
Incorporating (informative) prior information for the confounder-outcome relation, may (indirectly) reduce the bias due to unmeasured confounding (by the unmeasured variable U) of the exposure-outcome relation. This was tested through simulation studies, which are described in the next section.
Simulation study of Bayesian analysis to control for unmeasured confounding Objective A simulation study was performed to test the possible decrease in bias in the estimator of the exposure-outcome relation by using informative priors for the confounder-outcome relation. In simulated data, a comparison of the estimated relation between the exposure (X) and the outcome (Y) was made between two frequentist (OLS) multivariable linear regression models and three Bayesian multivariable linear regression models.

Data analysis
Every simulated data set was analysed in five different ways: two frequentist analyses and three Bayesian analyses. The two frequentist regression models included none or one of the two confounding variables: linear regression analysis without and with adjustment for the measured confounder Z. The three Bayesian regression analyses all incorporated the information about one confounder, but used different informative priors for the confounder-outcome relation. The performance of these methods was compared in terms of bias and precision of the estimator of the exposure-outcome relation. The simulation study was performed in R, version 3.1.1 [29].
The Bayesian model described in section 2.3 was used. All Bayesian regression analyses were adjusted for Z, but not for U. We used uninformative priors for σ 2 and β yx : σ ∼ U(0,100) and β yx ∼ N(μ = 0, τ = 0.001), where τ indicates the precision of the distribution. We used informative priors for the parameter β yz , but with different levels of precision. A normal informative prior was assumed for β yz , with the true value for β yz as the mean and different values for the precision, which were proportionate to the sample size n of the simulated data sets: β yz ∼ N(μ = β yz , τ = n, n/10, n/100). The precision could take three different values representing different degrees of certainty in the prior information. The Bayesian models were specified using the rjags package in R [30], which provides an interface from R to JAGS (http://mcmc-jags.sourceforge.net).
Since the priors for σ y and β yx were non-informative, the posterior distributions could be approximated by the product of the density of the data and the prior of β yz . The Gibbs sampler was used with four parallel chains for 2000 iterations. The first 1000 iterations were discarded as burn-in runs. Since the marginal posterior was normal, we chose to present the mean of the posterior distribution as an estimate of β yx|z .

Data generation
Data were generated according to the structure depicted in Fig. 1 and consisted of a continuous exposure (X), a continuous outcome (Y), and two continuous confounders (Z and U). First, U was sampled from a normal distribution: U~N(0, σ u 2 ). Second, Z was generated based on U: z i = β zu u i + ξ i , with ξ~N(0, σ z 2 ). Then, X was generated based on U and Z: x i = β xz z i + β xu u i + ζ i , with ζ~N(0, σ x 2 ). Finally, Y was generated based on U, Z, and X: y i = β yx x i + β yz z i + β yu u i + ε i , with ε~N(0, σ 2 ).
In all simulations, the variances σ u 2 , σ z 2 , σ x 2 , and σ 2 were set to 1. Furthermore, the exposure-outcome relation was fixed at β yx = 0 (i.e. zero relation). The parameter β zu was set at 0, or 1. The parameters β yz , and β xz were set at 1 or 2, indicating that the observed confounder Z was related to X and to Y in all scenarios. The parameters β yu and β xu were set at 0, 1, or 2. All combinations of the parameters settings were evaluated through simulations, leading to 72 different scenarios.

Comparison of methods
For each scenario 100 datasets of 1000 subjects each were generated. In each dataset the methods described above were applied. For each scenario separately, the performance of these methods was compared in terms of bias of the estimator of the relation between X and Y, the empirical standard deviation (SD) of the estimated relations between X and Y, and the mean squared error (MSE). For the frequentist models, we computed the average of the estimated regression coefficients (bias), their standard deviation (SD), and the mean of the squared difference between the estimated regression coefficient and the true exposure-outcome relation (MSE). For the Bayesian models, we computed the average of the posterior means (bias), their standard deviation (SD), and the mean of the squared difference between the posterior mean and the true exposure-outcome relation (MSE).

Example study of the relation between cholesterol levels and blood pressure
To illustrate the application of the use of informative priors for the observed confounder-outcome relation we used data on the relation between low-density lipoprotein (LDL cholesterol) levels and systolic blood pressure (SBP). This example was based on the Second Manifestations of Arterial disease (SMART) study, which is an ongoing prospective cohort study of patients with manifest vascular disease of vascular risk factors [31]. For this example, we assumed that there are two possible confounders of the LDL-SBP relation, namely body mass index (BMI) and blood glucose levels (BGL). A data set of 1000 observations was simulated based on the variance-covariance matrix and the vector of means of these four variables in the cohort study. In all analyses, BMI was considered to be a measured confounder, while BGL was considered to be unmeasured.

Comparison of methods
The different methods described in section 3.2.1 were applied to the example data. As a reference, we fitted a linear regression model of SBP on LDL, including BMI and BGL as covariates (referred to as the 'full model'). BMI was considered to be a measured confounder, while BGL was considered to be unmeasured. The performance of the different methods was assessed by the difference between the estimated LDL-SBP relations from the different models and the LDL-SBP relation obtained from the full model.
The Bayesian approach was implemented in two ways. We first used the estimated regression coefficient of the effect of BMI on systolic blood pressure from the full model (i.e., 0.32), as the mean for the prior distribution of the measured confounder on the outcome, and precision equal to the sample size (i.e., τ = 1000). We then used an relation from the literature as the prior mean. A previous study on the relation between BMI and SBP in adults found a linear regression coefficient of 0.77 [32]. This relation was used as the mean of the prior distribution of the measured confounder and outcome. Since we were less certain about this prior information, we used a smaller precision (τ = 100). For all the other relations we used uninformative priors as described in Section 3.2.1.

Results
Simulation study Table 1 shows the results of the simulation study for the scenarios where β xz = β yz = 2. Similar patterns were observed for other values of β xz and β yz ; these are omitted from the Table for brevity. Results for all simulated scenarios can be found in Additional file 2: Appendix 2. The Bayesian model with precision 100 (i.e., n/10) showed results that were in between those of the Bayesian models with precision 1000 (i.e., n) and precision 10 (i.e., n/100). Results for the Bayesian model with precision 100 are omitted for clarity (see Additional file 2: Appendix 2).
In most scenarios, the Bayesian model with precision 1000 showed less bias than the frequentist model with covariate adjustment. Noticeable exceptions in Table 1 are scenarios 8 and 14, in which the Bayesian model with precision 1000 was more biased than the frequentist model with covariate adjustment (which was actually unbiased). The reason for this is that in these scenarios U is not a confounder of the X-Y relation (because β xu = 0), yet it is a confounder of the Z-Y relation (e.g., in scenario 8 d β yzjx = 1.50, while β yz = 1). As the Bayesian model corrects the bias in the Z-Y relation, it induces a bias in the X-Y relation. In scenarios 10 and 16 in Table 1, the Bayesian models and the frequentist model with covariate adjustment yielded similar, yet biased, results. In these scenarios, the estimated relation between Z and Y from the frequentist   Bias refers to the bias in the estimator of the relation between X and Y, compared to the true X-Y relation (βyx = 0). τ indicates the precision of the prior distribution of the Z-Y relation in the Bayesian model and is proportional to the sample size of each generated data set (n = 1000). Abbreviations: SDstandard deviation of the empirical distributions of the parameter estimates; MSEmean squared error of the parameter estimates. See text for details on simulation study model with covariate adjustment corresponded with the mean of the prior distribution of this relation (i.e., d β yzjx = 1.00 and β yz = 1). Hence, the Bayesian model did not reduce bias, compared to the frequentist model. In scenarios 1-7, all methods that adjusted for the measured confounder Z yielded unbiased results, because the variable U was not a confounder in these scenarios (β yu = 0). The extent to which the Bayesian model reduced bias was substantially smaller when the precision was 10 instead of 1000. The standard deviation (SD) of the empirical distribution of the parameter estimates was smaller for the Bayesian model with precision 1000, compared to the frequentist model with covariate adjustment and the Bayesian model with precision 10 (the latter two showing approximately the same SD). Also, in general MSE was smaller for the Bayesian model with precision 1000, compared to the other models.

Empirical example
In the empirical example of the relation between low-density lipoprotein (LDL cholesterol) levels and systolic blood pressure (SBP)., LDL increased BP, after adjustment for BMI and BGL, but omitting BGL from the data analytical model reduced the estimated effect substantially from 1.24 to 1.03 ( Table 2). The amount of bias of the LDL-SBP relation slightly decreased when using an informative prior for the confounder outcome relation (i.e., for the BMI-SBP relation). However, even when the 'correct' prior, based on the full model, was used, the estimated effect of LDL on SBP remained substantially different from the reference value.

Discussion
This simulation study on the value of Bayesian analysis with informative priors for the relation between the measured confounder and the outcome in the presence of unmeasured confounding shows that such an analysis can reduce the bias due to unmeasured confounding substantially. The magnitude of the remaining bias decreases as the precision of the (correct) informative prior increases.
An obvious prerequisite when using the proposed Bayesian approach to correct for unmeasured confounding is prior knowledge about the relation between the measured confounder and the outcome. We argue that in many clinical research situations, such prior knowledge exists for many observed confounders, at least in terms of direction and order of magnitude of the relation. That information may be obtained from rigorously designed and conducted large epidemiological studies or from meta-analysis of individual patient data of randomised trials. Obviously, the impact of the Bayesian approach depends on the precision of the prior distribution. Informative priors with relatively small precision have little impact in term of confounding correction, yet allow Bayesian algorithms to be used. In practice it might be difficultor researchers may be reluctantto specify relatively highly informative priors.
If only the direction (but not the magnitude) of the confounder-outcome relation is included in the prior, the precision of the prior will be relatively small and the impact of the Bayesian analysis may be relatively small too. We did not include this particular form of prior distribution in our simulation study, but instead focused on distributions with the same mean, yet different precision.
As with any simulation study, an obvious limitation to our work is the finite number of simulated scenarios that we evaluated. For example, we only considered situations with two confounders, one being measured, one unmeasured. Although the two confounders Z and U could be considered as representing two sets of measured and unmeasured confounders, respectively, future research could address scenarios of multiple confounders with, e.g., different distributions of the confounders. Another scenario that we did not evaluate and could be the topic of future research is specification of the priors, such that these do not correspond to the 'true' confounder-outcome relation. The robustness to various levels of misspecifications of the prior distribution still needs to be studied.
Where to position this Bayesian approach in the toolbox of the researcher doing observational epidemiologic research? Given that many observational studies In all analyses (except for the reference), BMI was considered a measured confounder of the LDL-SBP relation, while blood glucose level was considered unmeasured confounder. Bayesian analysis 1 and Bayesian analysis 2 differ in the mean and precision of the prior distribution of the relation between BMI and SBP a The reference is based on the full model, i.e., is adjusted for BMI and blood glucose levels b SBP was measured in mmHg, LDL in mmol/l, and BMI in kg/m 2 potentially suffer from unmeasured confounding, sensitivity analysis of unmeasured confounding is often important. Eliciting priors for unobserved (and possibly unknown) confounding variables is likely to be difficult. On the other hand, focusing on the approximate size of the relations between measured confounders and the outcome provides the opportunity to perform a Bayesian sensitivity analysis as outlined in this paper. Informative priors for the measured confounder-outcome relations can reduce unmeasured confounding bias of the exposure-outcome relation. In case of observing unexpected confounder-outcome relations a sensitivity analysis of unmeasured confounding could be considered, in which prior information about the observed confounder-outcome relations is incorporated through Bayesian analysis.

Conclusions
In this paper we proposed a Bayesian approach to reduce bias due to unmeasured confounding by expressing an informative prior for a measured confounder-outcome relation. A simulation study on the value of this Bayesian analysis with informative priors for the relation between the measured confounder and the outcome in the presence of unmeasured confounding shows that such an analysis can indeed reduce the bias due to unmeasured confounding substantially. The magnitude of the remaining bias decreases as the precision of the (correct) informative prior increases. We consider this approach one of many possible sensitivity analyses of unmeasured confounding.

Additional files
Additional file 1: Appendix 1. Expressions of bias. (PDF 105 kb) Additional file 2: Appendix 2 Table A1. Results of the simulation study of different methods to control for confounding (PDF 395 kb) Abbreviations BGL: Blood glucose levels; BMI: Body mass index; LDL: Low-density lipoprotein; MSE: Mean squared error; SBP: Systolic blood pressure; SD: Standard deviation