High-dimensional confounding adjustment using continuous spike and slab priors

In observational studies, estimation of a causal effect of a treatment on an outcome relies on proper adjustment for confounding. If the number of the potential confounders ($p$) is larger than the number of observations ($n$), then direct control for all potential confounders is infeasible. Existing approaches for dimension reduction and penalization are generally aimed at predicting the outcome, and are less suited for estimation of causal effects. Under standard penalization approaches (e.g. Lasso), if a variable $X_j$ is strongly associated with the treatment $T$ but weakly with the outcome $Y$, the coefficient $\beta_j$ will be shrunk towards zero thus leading to confounding bias. Under the assumption of a linear model for the outcome and sparsity, we propose continuous spike and slab priors on the regression coefficients $\beta_j$ corresponding to the potential confounders $X_j$. Specifically, we introduce a prior distribution that does not heavily shrink to zero the coefficients ($\beta_j$s) of the $X_j$s that are strongly associated with $T$ but weakly associated with $Y$. We compare our proposed approach to several state of the art methods proposed in the literature. Our proposed approach has the following features: 1) it reduces confounding bias in high dimensional settings; 2) it shrinks towards zero coefficients of instrumental variables; and 3) it achieves good coverages even in small sample sizes. We apply our approach to the National Health and Nutrition Examination Survey (NHANES) data to estimate the causal effects of persistent pesticide exposure on triglyceride levels.


Introduction
In observational studies, we are often interested in estimating the causal effect of a treatment T on an outcome Y , which requires proper adjustment of a set of potential confounders X.In the context of high-dimensional data, where the number of potential measured confounders p could be even larger than the sample size n, standard methods for confounding adjustment such as regression or propensity scores (Rosenbaum & Rubin, 1983) will fail.
In the context of prediction, a variety of methods exist for imposing sparsity in regression models with a high-dimensional set of covariates.Arguably the most popular, the lasso (Tibshirani, 1996) places a penalty on the absolute value of the coefficients from a regression model, thus shrinking many of them to be exactly zero, leading to a more parsimonious model.A variety of extensions to the lasso have been proposed such as the SCAD, elastic net, and adaptive lasso penalties, to name a few (Fan & Li, 2001;Zou & Hastie, 2005;Zou, 2006).One challenge encountered with all of these approaches is the difficulty to provide a meaningful assessment of uncertainty around estimates of the regression coefficients.While progress has been made recently on this topic (Lockhart et al. , 2014;Taylor & Tibshirani, 2016), it remains difficult to obtain valid confidence intervals for parameters under complex, high-dimensional models.
Bayesian models can alleviate these issues by providing valid inference from posterior samples.Much of the recent work has centered around shrinkage priors, which can be represented as scale mixtures of Gaussian distributions and allow for straightforward posterior sampling.Park & Casella (2008) introduced the Bayesian lasso: a scale mixture of Gaussians with an exponential mixing distribution that induces wider tails than a standard normal prior.More recently, global-local shrinkage priors have been advocated that have a global shrinking parameter that applies to all parameters, as well as local shrinking parameters which are unique to the individual coefficients.Carvalho et al. (2010) introduced the horseshoe prior, which is a scaled mixture of Gaussians with a half-cauchy mixing distribution that has been shown empirically to have good performance in high-dimensional settings.Bhattacharya et al. (2015) introduced a new class of distributions that are also scaled mixtures of Gaussians with an additional Dirichlet mixing component and proved that it's posterior concentrates at the optimal rate.Ročková & George (2016) differ somewhat in that they adopt the spike and slab formulation of George & McCulloch (1993), however both the spike and slab priors are laplace distributions.All of these approaches are aimed at obtaining ideal amounts of shrinkage in high-dimensional settings where large coefficients should be shrunken a small amount, while others are shrunken heavily towards zero.
These and many other approaches have been based on the same principles of aiming to reduce shrinkage for important covariates in the context of prediction of Y .Several authors have pointed out that frequentist and Bayesian procedures for variable selection or shrinkage that focus on predicting Y perform poorly when the inferential goal is estimation of the effect of T on Y (Crainiceanu et al. , 2008;Wang et al. , 2012;Belloni et al. , 2014Belloni et al. , , 2017;;Hahn et al. , 2017).A variety of data driven methods have been developed to select confounders in causal inference (van der Laan & Gruber, 2010;De Luna et al. , 2011;Vansteelandt et al. , 2012;Wang et al. , 2012;Zigler & Dominici, 2014).Many of these approaches rely on the specification of a treatment model E(T |X), and an outcome model E(Y |T, X).Wang et al. (2012) introduced a Bayesian model averaging approach for estimating the effect of T on Y averaged across models that include different sets of potential confounders.They assume a priori that if a covariate X j is associated with the treatment T than this covariate should have high probability to be included into the outcome model, even if this covariate is weakly associated with the outcome.Many ideas have been built on this prior specification to address the issue of confounder selection and model uncertainty (Talbot et al. , 2015;Wang et al. , 2015;Cefalu et al. , 2016;Antonelli et al. , 2017).All of the aforementioned approaches have been shown to work well in identifying confounders or adjusting for confounding; however, none of these approaches are well-suited to a high dimensional vector of confounders.
Recently, there has been increased attention to estimating treatment effects when p ≥ n.Wilson & Reich (2014) introduced a decision theoretic approach to confounder selection for p ≥ n.They showed that their approach has connections to the adaptive lasso, but with weights aimed at reducing shrinkage of confounders, rather than predictors.Belloni et al. (2014) applied standard lasso models on both the treatment model E(T |X), and the outcome model E(Y |T, X), separately.Then, they identify as confounders the union of the variables that were not shrunk to zero in the two models, and estimate the causal effect using this reduced set of covariates.Farrell (2015) first applies lasso models to treatment and outcome models to select confounders.Second, they calculate a double robust estimator using the resulting unpenalized treatment and outcome models.Antonelli et al. (2016) implemented a similar doubly robust estimation approach using standard lasso outcome and treatment models but in context of matching on both the propensity and prognostic scores.Ertefaie et al. (2015) proposed an alternative approach for selecting confounders in high-dimensional settings by penalizing a joint likelihood for both the treatment and the outcome model to ultimately lead to the selection of important confounders.Shortreed & Ertefaie (2017) used similar ideas by fitting an adaptive lasso to a propensity score model, and show that it leads to the inclusion of only covariates necessary for confounding adjustment or outcome model prediction.Hahn et al. (2016) utilized horseshoe priors on a re-parameterized likelihood that aims to reduce shrinkage for important confounders.Athey et al. (2016) combined high-dimensional regression with the balancing weights of Zubizarreta (2015) to obtain valid inference of treatment effects even when the true data generating models are not sparse.All of these approaches have the advantage of being able to handle settings with p ≥ n, however, as we will demonstrate in simulations, existing approaches relying on asymptotic theory are likely to provide poor coverage in finite samples.Also, many of these approaches will tend to include instrumental variables, are only applicable to binary treatments and are not applicable to studying the effects of continuous environmental exposures (see Section 5), or both.
In Section 2 we propose spike and slab priors for confouding adjustment in the context of a homogeneous and heterogeneous treatment effects.In section 3, we detail the Bayesian computations, including selection of tuning parameters to achieve a good compromise between sparsity and addressing confounding bias.In section 4, we present results from several simulation studies that include homogeneous and heterogeneous treatment effects, strong and weak confounders, instrumental variables, and sparse and non sparse settings.In Section 5, we present the data analysis which considers continuous treatments, In section 6 we conclude with a summary of the strengths and weaknesses of the proposed approach and future research directions.

Spike and slab priors for confounding adjustment
Throughout, we will assume that we observe D i = (Y i , T i , X i ) for i, . . ., n, where n is the sample size of the observed data, Y i is the outcome, T i is the treatment, and X i is a p-dimensional vector of pretreatment covariates for subject i.We will assume that Y i is continuous, though we will not make assumptions regarding T i , as it can be binary, continuous, or categorical.In general we will be working under the highdimensional scenario of p ≥ n.Our estimand of interest is the average treatment effect (ATE), defined as , where Y i (t) is the potential outcome subject i would receive under treatment t.We will assume that the probability of receiving any value of treatment is greater than 0 for any combination of the covariates, commonly referred to as positivity.We make the stable unit treatment value assumption (SUTVA) (Little & Rubin, 2000), which states that the treatment received by one observation or unit does not affect the outcomes of other units and the potential outcomes are well-defined.We will further assume strong ignorability conditional on the observed covariates, and that the covariates necessary for ignorability are an unknown subset of X.Strong ignorability implies that potential outcomes are independent of T conditional on X.

Model and priors
In this section we assume a homogeneous treatment effect, i.e that the treatment effect is the same across all values of the covariates X.We will relax this assumption in Section 2.3.We introduce the following hierarchical formulation: (1) Under these assumptions, ∆(t 1 , t 2 ) = (t 1 − t 2 )β t .If γ j = 1 then β j ∼ ψ 1 (β j ) -the slab component of the prior.If γ j = 0 then β j ∼ ψ 0 (β j ) -the spike component of the prior.Therefore γ j = 1 indicates that X j is potentially an important confounder.We set ψ 1 (•) and ψ 0 (•) to be Laplace distributions with densities ψ 1 (β j ) = λ1 2 e −λ1|βj | and ψ 0 (β j ) = λ0 2 e −λ0|βj | , respectively.More specifically, when γ j = 1, the prior standard deviation of β j is √ 2/λ 1 and when γ j = 0 the prior standard deviation is √ 2/λ 0 .Following Ročková & George (2016), we will fix λ 1 to a small value, say 0.1, so that the prior variance for coefficients in the slab component of the prior is high enough to be reasonably uninformative.We will estimate λ 0 to determine how much to shrink coefficients that are placed in the spike component of the prior.
If we first assume that w j = 1, then the parameter θ denotes the prior probability that γ j = 1 and we assume a priori that θ ∼ Beta(a, b).We will adopt standard practice in the high-dimensional Bayesian literature and set a to a constant, and set b = p (Zhou et al. , 2014;Ročková & George, 2016).This prior more aggressively shrinks coefficients to the spike component of the prior as p grows.This feature is desirable in high-dimensional models where we must more aggressively shrink parameters as the covariate space grows to avoid the curse of dimensionality.Finally, for σ 2 , β 0 and β t , we assume conjugate but uninformative priors.

Selection of w j
The introduction of the weight w j in the prior specification on equation 3 and borrowing information from (T, X i ) to guide the specification of good values for w j are two key innovative features of our approach.If a variable X j is associated with T , then omitting X j in the outcome model could lead to confounding bias.Consequentially, it is desirable to increase the prior probability that β j is in the slab component of the prior.With this guiding principle in mind, we do the following: 1) we use lasso to fit the exposure model E(T |X) (Tibshirani, 1996); 2) for each X j that has a non zero regression coefficient from the lasso estimation of the exposure model, we set w j = δ where 0 < δ < 1. Please note that if δ < 1, then θ δ > θ which leads to a higher prior probability for β j to be included into the slab component of the prior.Smaller values of δ, lead to more protection against omitting an important confounder.However, values of δ too small might lead to inclusion of instrumental variables which decrease efficiency and can amplify bias in the presence of unmeasured confounding (Pearl, 2011).Now we provide guidance on how to select a reasonable value of δ.We start by defining the conditional probability of being included into the slab component of the prior: which depends on λ 0 , λ 1 , θ, β j , and w j .Figure 1 shows p * θ (β j ) as a function of w j and β j when we fix λ 0 = 30, λ 1 = 0.1, and θ = 0.01.This represents a sparse situation in which only 1 out of 100 variables would be included in the slab component of the prior.It is clear from the figure that the weights, w j can have a large impact on certain types of covariates.A covariate with a strong association with the outcome (β j = 0.4) has a value of p * θ (β j ) near 1, regardless of the value of w j .Instrumental variables or variables with no association with either outcome and treatment, i.e noise variables (β j = 0), have a value of p * θ (β j ) near 0 for all values of w j except for w j 0. Variables with moderate associations with the outcome (β j = 0.2) have p * θ (β j ) ≈ 0 when w j = 1, while p * θ (β j ) greatly increases as we decrease w j .Figure 1 can be used to guide our choice of δ.We assume w j = δ ∀j, and we want to set δ to be as small as possible to protect us against shrinking the coefficients for important confounders, but we also want to ensure that coefficients for instrumental variables or noise variables are heavily shrunk towards 0. We can see in Figure 1 that we can set δ to be as small as possible to increase p * θ (β j ) for variables with moderate associations with the outcome, while still keeping p * θ (β j = 0) low.One possibility is to select the minimum value of δ such that p * θ (β j = 0) is less than some threshold, such as 0.1.This would imply that the probability of including an instrument or noise variable into the slab component of the prior would be 0.1.Intuitively, this threshold represents the point at which we can get the most protection against residual confounding bias while alleviating the impact of instrumental variables.
Additionally, we have found that when the treatment assignment is not sparse, assigning weights of δ to all covariates identified by a treatment model can lead to poor performance in the subsequent outcome model.One approach to these types of issues is to cap the number of variables that are prioritized by the treatment model to k covariates.We will explore a scenario where this is the case in the simulation study of Section 4 and assess the extent that this problem is corrected when we limit the number of variables prioritized.

Heterogeneous treatment effects
In this section we describe our approach under the more general case of treatment effect heterogeneity and in the context where the treatment variable is binary or categorical.Addressing treatment effect heterogeneity in the presence of continuous treatments is a more difficult problem, which is beyond the scope of this paper.In the case of a binary treatment, we now specify the same model as in Section 2 but separately for t = 1 and t = 0: It is important to note that for each treatment level t, we fit the model only using subjects i with T i = t.Once posterior draws of all model parameters are obtained, it is straightforward to obtain posterior distributions for E(Y i (t)) for all subjects i and treatments levels t.If we let s denote the s th posterior draw obtained in our MCMC algorithm then the posterior distribution of the treatment effect can be defined as In brief, we have built separate regression models for each of the treatment levels and then taken the difference in the mean predicted values from these models for each observation in the data.One nice feature of doing the analysis within the Bayesian paradigm is that once we have obtained posterior distributions for each of E(Y (t)), we have the posterior distribution of any function of these values and therefore have the posterior distribution of ∆(t 1 , t 2 ) automatically.This means that performing inference on the treatment effect involves no further calculation.If we were interested in estimating treatment effects within particular subgroups such as the treatment effect on the treated, then the sum in equation 9 would be over only those subjects of interest.

Bayesian computation
Posterior distributions of all the unknown parameters can be easily obtained via standard gibbs-sampling as each parameter is conditionally conjugate, with the exception of θ, which is easy to sample from since it is univariate.A key component driving the ease of sampling is that the laplace distribution has the following representation as a scale mixture of Gaussians with an exponential mixing weight.
This makes the prior distribution of β multivariate normal with a covariance matrix equal to diag(τ 2 1 , . . ., τ 2 p ). Full details of this mixture as well as posterior implementation can be found in the Appendix A. The most important parameter of the procedure is λ 0 , which dictates how strongly parameters are shrunk towards zero when they are included in the spike part of the model, i.e γ j = 0. Bayesian inference allows for viable alternatives over cross validation, which is commonly used in the penalized likelihood literature.We will examine estimation of λ 0 using an empirical Bayes procedure, though it is possible to utilize a fully Bayesian specification that places a prior on λ 0 as discussed in Park & Casella (2008).

Selection of λ 0
In many complex settings, such as the current one, empirical Bayes estimators of tuning parameters can not be done analytically.To alleviate this issue Casella (2001) proposed a Monte Carlo based approach to finding empirical Bayes estimates of hyperparameter values.The general idea is very similar to the EM algorithm for estimating missing or unknown parameters, however, expectations in the E-step are calculated using draws from a Gibbs Sampler.In our example, we set λ (0) 0 = λ * 0 , a starting value of the algorithm.Then for iteration k, set where the expectations are approximated with averages from the previous iteration's Gibbs Sampler.Due to Monte Carlo error, this algorithm will not exactly converge, but rather will bounce around the maximum likelihood estimate.The more posterior samples used during each iteration, the less this will occur.Once this has run long enough and the maximum likelihood estimate of λ 0 is found, inference can proceed by running the same gibbs sampler with the selected λ 0 .A derivation of this quantity can be found in Appendix B.

Posterior mode estimation
The most natural implementation of the above formulation is within the Bayesian paradigm, where we can obtain samples of γ directly.This is advantageous as we can examine p(γ|D), which provides an assessment of model uncertainty and can be used to identify the best-fitting models.In some situations, however, MCMC can become burdensome if p is very large.An alternative approach is to formulate model estimation as a penalized likelihood problem, in which we estimate the posterior mode of the model.While in this paradigm we lose some of the aforementioned features of Bayesian inference, estimation can be done in a fraction of the time.Furthermore, the posterior mode of our model will be sparse, i.e many of the regression coefficients will be estimated to be exactly zero allowing us to quickly perform confounder selection in high-dimensions.The details on the penalized likelihood implementation are in Appendix C. In Appendix D we also show that the posterior mode of ∆ from model 1 is consistent at a rate equal to O logp n .

Simulation study
In this section we compared our proposed approach with several state of the art alternatives for confounding adjustment in the context of p ≥ n.We consider three data generating mechanisms: 1) homogeneous treatment effect and sparsity; 2) heterogeneous treatment effect and sparsity; and 3) homogeneous treatment effect and non sparsity.Our goal is always to estimate the average treatment effect.Before we detail the data generating mechanisms to be examined, we describe the approaches being compared.
1. Proposed approach using MCMC, where we estimate λ 0 using the empirical Bayes approach described in Section 3.1 and choose δ as described in Section 2.2 (EM-SSL) 2. EM-SSL for heterogeneous treatment effect 3. Outcome lasso that includes treatment and covariates, but only penalizes the covariates 4. Re-fit an unpenalized regression model using the covariates identified by the outcome lasso approach above (Post selection lasso) 5. Double post selection approach of Belloni et al. (2014) 6. Doubly robust lasso approach of Farrell (2015) 7. Approximate residual de-biasing approach of Athey et al. (2016) The purpose of this simulation study is to assess the performance of our proposed approach compared to competitors when the true outcome model is linear.In the context of non linear relationships between the T and Y none of the methods compared here would perform well.For some of these estimators, an extension to heterogeneous treatment effects exist, while others implicitly account for treatment effect heterogeneity.With the exception of our estimator, we will always use the version of the estimator that matches the data generating mechanism, e.g the homogeneous version of the estimators will be used for the homogeneous simulation studies.In all simulations the covariates are drawn from a multivariate normal distribution with marginal variances set to 1 and correlation of 0.6 between all covariates.For each scenario, we compare average percent bias, mean squared error, 95% interval coverage, and the ratio of the average estimated standard errors and the true standard errors.For our approach, the interval coverage is calculated as the percentage of the time our posterior credible interval covers the true parameter, while all other approaches use frequentist confidence intervals.Finally, we present additional simulation results for differing sample sizes and differing confounding strengths in Appendix E, though we found the results to be very similar to those seen here.

Homogeneous treatment effects
We now examine our approach in a high-dimensional setting where p = 500 and n = 200, in which there exists strong confounders, weak confounders, and instruments.We simulate the treatment and outcome from the following models: where i ∼ N (0, 1).The first 8 elements of β are (1, −1, 0.3, −0.3, 0, 0, 1, −1), while the remaining elements are draw from a normal distribution with a standard deviation of 0.1.The first 6 elements of ψ are (1, −1, 1, −1, 1, −1) and the remaining values are set to zero.This leads to a treatment prevalance of 50%.In this setting, covariates 1 and 2 are strong confounders, covariates 3 and 4 are so called "weak" confounders that are weakly associated with the outcome and strongly associated with treatment, covariates 5 and 6 are instruments, and covariates 7 and 8 are strong predictors of the outcome.The remaining covariates have no association with the treatment and a small to moderate association with the outcome.This situation is not strictly sparse due to the small signals in β, however, it is approximately sparse in the sense that only a small number of covariates are needed to obtain unbiased estimates of the treatment effect.  1 shows the results of the proposed simulation across 1000 simulated datasets, and we see that the proposed approach performs the best with respect to all metrics.EM-SSL achieves the minimum bias of 13.1% and the minimum MSE of 0.09.The heterogeneous version of the EM-SSL procedure performs slightly worse in terms of bias and efficiency, which is to be expected given that it splits the sample into the treated and controls and estimates models separately in the two groups.The next best performing estimator in terms of bias and MSE was the double post selection procedure that had a bias of 15% and an MSE of 0.12.The doubly robust lasso had a small bias of 16.5%, though it was quite variable due to the instability of weights in high-dimensions.In terms of interval estimation we do the best in terms of 95% interval coverages (93%) whereas all the other estimators have coverages well below the nominal level (84%, 73%, and 77%).Looking at the ratio of the average estimated to true standard errors, our approach does well (1.05), while most procedures were substantially smaller than 1.The approximate residual de-biasing procedure does well at estimating the standard errors, but is too biased to achieve good interval coverages.Our EM-SSL Heterogeneous procedure also does well at estimating the standard errors, but has an interval coverage of 84% due to the larger amount of bias relative to the homogeneous version.

Heterogeneous treatment effects
We now simulate data with n = 400 and p = 800 and a heterogeneous treatment effect.The data generating models are of the following form: where i ∼ N (0, 1).This simulation is similar to the previous section with three changes: 1) covariates 9 through 500 have no association with either treatment or outcome; 2) there is an interaction between the treatment and covariates 1 and 3; and 3) the prevalence of the treatment has been dropped from approximately 50% to 25%.We have increased the sample size from 200 to 400 since we lowered the prevalence of the treatment, and all methods explored need a sufficient sample size in both the treated and control groups to estimate heterogeneous treatment effects.Table 2 shows the results averaged across 1000 simulations.Results are similar to the homogeneous treatment effects setting.The double post selection approach again fares relatively well across all metrics, however, is slightly outperformed by the EM-SSL approach.The approximate residual de-biasing approach has fairly substantial amounts of bias in this setting, which also leads to poor interval coverages.The doubly robust lasso again has a higher MSE due to the instability of inverse propensity weights in high-dimensional settings.Our EM-SSL procedure does not achieve the nominal interval coverage in this setting, though this is due to the bias incurred by assuming a homogeneous treatment effect.Our EM-SSL heterogeneous procedure achieves interval coverages of 95% and an average estimated standard error that is close to the truth (0.96

Dense treatment model
Here we simulate data as described in Athey et al. (2016) where the treatment model is purposely chosen to be dense.More specifically, first we define 20 clusters, {c 1 , . . ., c 20 } where c k ∼ N (0, I pxp ).Second, we draw C i uniformly at random from one of the 20.Third, we draw the covariates from a multivariate normal distribution centered at C i with the identity matrix as the covariance.Fourth, we set T i = 1 with probability 0.1 for the first 10 clusters, and T i = 1 with probability 0.9 for the remaining clusters.Finally, we generate data from the outcome model defined as and is normalized such that ||β|| 2 2 = 18.Here we will again set n = 200 and p = 500.Intuitively, this is a simulation scenario in which the outcome model is approximately sparse, though the treatment model is dense as all of the covariates are associated with the treatment.
Results are summarize in Table 3.Because the data generating mechanism does not assume sparsity, our original EM-SSL procedure performs poorly compared to approximate residual de-biasing, double post selection, and outcome LASSO approaches.We obtain a large MSE of 1.76, while also doing very poorly at estimating the standard errors of our approach as the ratio of the average estimated to true standard errors is 0.52.However, under a non sparse setting, if we impose a restriction that only the top k = 10 variables most associated with the treatment (identified by the magnitude of their coefficients in the treatment lasso model) are prioritized with w j = δ, then our approach (EM-SSL Restricted) performs the best in terms of MSE (0.59) and interval coverage (93%).It is also important to note that while we did not show the restricted results in the other simulation scenarios that had sparse treatment models, the restricted approach performed almost identically to the original EM-SSL approach.).The data has also been aggregated and made available by Patel et al. (2016).The NHANES data is a nationally representative study, in which participants were questioned regarding their health status, with a subset of these patients providing extensive clinical and laboratory tests to provide information on a variety of environmental attributes such as chemical toxicants, pollutants, allergens, bacterial/viral organisms, and nutrients (Patel et al. , 2010).Our analysis will center on data from the 1999-2000, 2001-2002, 2003-2004, and 2005-2006 surveys.We build on the analysis described in Patel et al. (2012), by applying our proposed methodology to estimating the effects of volatile compounds (VCs) on triglyceride levels in humans.VCs were measured in n = 177 subjects, and there were p = 127 covariates.The list of potential confounders consists of other volatile compounds, their interactions, other persistent pesticides measured in the respective subsample, body measurements, demographic, and socioeconomic variables.To evaluate the proposed approach in a setting with p > n, we ran an additional analysis, which looked at the effect of VCs on triglycerides in subjects over 40 years old.This led to a sample size of n = 77 subjects.
In previous work (Patel et al. , 2012) these exposures were evaluated individually without controlling for the remaining pesticides, and only a small subset of pre-selected covariates such as age, BMI, and gender were controlled for.Patel & Ioannidis (2014) wrote that persistent pesticides tend to be highly correlated with one another and that many of the associations found by previous exposome studies (Patel et al. , 2012) could simply be to confounding bias that was unadjusted for due to the small set of confounders used and the fact that other pesticides were not adjusted for.This highlights the need for an analysis that adjust for all potential confounders, but in our analyses we have p = 127 covariates with small sample sizes.Therefore, due to the large model space, some confounder selection or shrinkage is required to obtain efficient estimates of exposure effects.

VC analyses
We now analyze the NHANES data described above.In particular, we will examine the effect of each of 10 volatile compounds on triglyceride levels while controlling for an extensive set of potential confounders.We analyze the effect of each volatile compound using three approaches: 1) an unadjusted model that regresses the outcome on the treatment without confounder adjustment, 2) the homogeneous EM-SSL procedure described above, and 3) the double post selection approach described in Belloni et al. (2014).The approximate residual de-biasing and the doubly robust lasso approaches are only applicable to categorical treatments and are therefore left out.We restrict attention to the homogeneous treatment effect application of our approach, because addressing heterogeneity in the setting of continuous treatments would require additional work that is beyond the scope of this paper.
Figure 2 shows the point estimates and 95% confidence intervals (credible interval for EM-SSL approach) from the analysis across the 10 volatile compounds for the three approaches under consideration and each of the two data sets being analyzed.The results are qualitatively very similar across the three approaches for each of the 10 exposures we looked at, in both the full data set and the data set restricting to older subjects.One exception is VC7 in the analysis of older subjects where the EM-SSL estimate has a credible interval that does not contain zero, while the confidence intervals for the other two approaches do.In general, however, the results are very similar in magnitude and direction across the approaches, with the only major difference coming in the widths of the corresponding confidence (and credible) intervals.

Comparison of standard errors across approaches
While there are not drastic differences in point estimates, there are large differences in the widths of the confidence intervals of the two approaches that aim to adjust for confounding.Of interest is the ratio of the standard errors for the EM-SSL procedure and the double post selection approach.These can be seen in Figure 3.We see that in the full data set the EM-SSL procedure is more efficient overall than the double post selection approach.The majority of the analyses (8/10) had smaller confidence intervals under the EM-SSL procedure with an average confidence interval ratio of 0.9 as indicated by the dashed line in Figure 3.This means that the EM-SSL procedure on average has a 10% smaller standard error than the double post selection approach and occasionally has a standard error 30% smaller.The results are even more striking when we subset the data to the n = 77 subjects who are over 40 years old.In this case all analyses were more efficient using the EM-SSL procedure, with an average standard error ratio of 0.78, highlighting the ability of our estimator in high-dimensional scenarios.That our estimator is more efficient than the double post selection estimator is not surprising.The goal of the double post selection estimator was to obtain valid inference in high-dimensional scenarios, not to provide the most efficient estimate of the treatment effect.Nonetheless, this analysis highlights an important difference between the approaches in their finite sample performance and how they address instrumental variables.

Discussion
In this paper we have introduced a novel approach for estimating treatment effects in high-dimensional settings.We introduced a generalization of the spike and slab formulation to allow the probability of entering into the slab component of the prior to depend on the association between each potential confounder and the treatment.We highlighted how this could drastically reduce the shrinkage of important confounders, while still shrinking to zero the coefficients of instrumental and noise variables.Through simulation we showed that our proposed approach has highly competitive performance compared with state of the art approaches under data generating mechanisms that are more or less sparse and also in the context of heterogeneous treatment effects.By tackling the problem within the Bayesian paradigm we achieve good interval coverage rates even in small samples unlike existing approaches in the literature.We expect that as the sample size increases, the performance of existing approaches will improve and give good interval coverages.Importantly, we applied the proposed approach to an exposome study and found that our approach gave smaller confidence intervals than existing approaches for confounding adjustment in high dimensional settings.
Our proposed approach has limitations.First, we make the strong assumption of a linear outcome model which allows us to: 1) handle ultra high-dimensional covariate spaces; and 2) borrow information from the treatment model when estimating the causal effects.While similar assumptions of linearity are also used in existing approaches to high-dimensional confounding adjustment (Belloni et al. , 2014;Farrell, 2015;Athey et al. , 2016;Antonelli et al. , 2016;Shortreed & Ertefaie, 2017), a topic of future research would be to extend these ideas to nonlinear settings to overcome challenges inherent to model misspecification.Second, we also make the assumption of sparsity of both the treatment and outcome models.While we showed that we can potentially overcome a lack of sparsity in the treatment model by restricting that only a small percentage of the total number of covariates be prioritized in the outcome model, our approach still relies on sparsity of the outcome model to obtain good results in terms of estimation and interval coverages.Third, although we consider scenarios of large p, MCMC can become computationally intensive in ultra-high dimensions where the number of covariates is in the tens of thousands.In this setting, alternative approaches such as double post selection, the doubly robust lasso, and approximate residual de-biasing could be used.Fourth, in the current manuscript we restricted attention to the case where w j = δ for all variables associated with the treatment.We can relax this assumption to let w j vary for each covariate j associated with the treatment.One idea is to set w j to be inversely proportional to the strength of a variables association with the treatment.This and other approaches are possible and merit further research to identify the best strategy to including important confounders while excluding instrumental variables.
More generally, the idea to borrow information from the treatment model to guide the amount of shrinkage in the outcome model can be extended to other high-dimensional priors beyond the spike and slab one seen here.A number of priors are used in high-dimensional Bayesian modeling, and this idea can potentially be extended to many of them.There are a number of other generalizations to our approach that could make it more widely applicable to different data structures.For example, the extension to binary or ordinal outcome variables is conceptually straightforward due to latent variable representations of GLMs first seen in Albert & Chib (1993).Finally, our approach can also be extended beyond the current setting of confounding adjustment, as the idea to place more weight on certain variables would be useful in any setting where important variables can be identified a priori.

Software
An R package implementing the methodology proposed here can be found at github.com/jantonelli111/HDconfounding

A Posterior computation
The laplace distribution has the following representation as a scale mixture of Gaussians with an exponential mixing weight.
This makes the conditional distribution of β multivariate normal.To simplify notation we will also define Conditional on a value of λ 0 the algorithm iterates through the following steps: 1. Sample β * from the full conditional: where D τ is a diagonal matrix with diagonal (K, K, τ 2 1 , . . ., τ 2 p ) 2. For j = 1, . . ., p sample γ j from a bernoulli distribution with probability (1 − θ wj ) (1−γj ) (17) 4. Letting η 2 j = 1/τ 2 j , sample from the full conditional: where the inverse-Gaussian density is given by 5. Sample σ 2 from the full conditional B Finding empirical Bayes estimate of λ 0 As seen in Casella (2001) the Monte Carlo EM algorithm treats the parameters as missing data and iterates between finding the "complete data" log likelihood where the parameter values are filled in with their expectations from the gibbs sampler, and maximizing this expression as a function of the hyperparameter values.After removing the terms not involving λ 0 the "complete data" log likelihood can be written as And now, adopting the same notation that is typically used with the EM algorithm and was seen in Park & Casella (2008) we can take the expectation of this expression, where expectations are taken as the averages over the previous iterations posterior samples And now we can find the maximum of this expression with respect to λ 0 and we find that the maximum occurs at

C Posterior mode estimation
In the manuscript we briefly described the details involved with estimating the posterior mode of our model.
Here we provide a more detailed description of the coordinate ascent algorithm.These ideas were first introduced in Ročková & George (2016), and we extend them here to differing weights, w j .We will defer many of the implementation details to Ročková & George (2016), though we will highlight many of them here to illustrate how the weights, w j can reduce shrinkage for potential confounders.

Assuming θ is fixed
Adopting similar notation to Ročková & George (2016) we can formulate the problem in the standard penalized regression framework Notice that in the penalty we have conditioned on a value of θ.To illustrate the results and the implementation of the approach we will first illustrate them for a fixed value of θ.Obviously, we want to let the data inform θ, which will then provide us with multiplicity control Scott et al. (2010).As shown in Ročková & George (2016) the extension to random values of θ is straightforward and the same results will apply with a minor adjustment.The penalty can be written as where An important feature of the penalty can be seen by looking at it's derivative, which gives insight into the level of shrinkage induced for β j . ∂pen(β|θ) where The intuition for λ * θ (β j ) is the larger it gets, the more shrinkage that is induced for β j .If p * θ (β j ) is large, then it is likely that covariate j is important and it will get shrunk by a smaller amount.These ideas can be seen by looking at the eventual estimation strategy for estimating β, which we detail now.We can define where there is a subscript j, because ρ(t|θ) depends on the weight, w j .Then the global mode, β satisfies the following where . This highlights a couple very important features both for implementation of the approach as well as understanding the role of the prior distribution in penalizing covariates when the goal is confounder selection.This is very useful for implementation, as we can adopt a coordinate ascent algorithm that iterates through each of the covariates and updates them according to (30).We also now see the role that p * θ (β j ), and therefore w j plays in both the shrinkage of covariates and the decision to force them to zero or not.As w j gets smaller, p * θ (β j ) gets larger and then ∆ j gets smaller making it less likely that we estimate β j to be zero.For those coefficients that are not thresholded to zero, there still exists soft thresholding or shrinkage and this is dictated by λ * θ (β j ).The smaller we set w j , the smaller λ * θ (β j ) becomes leading to less shrinkage.This shows that if we can set w j to be smaller for potential confounders then we have both increased their probability of making it into the final model (i.e β j = 0), and we have reduced the shrinkage affecting potential confounders.

Assuming θ is random
Now that we have illustrated the approach for a fixed value of θ, we can extend it to random θ, by letting the data inform the level of sparsity.Ročková & George (2016) showed a crucial result when attempting to implement a random value for θ.Many of the previous results were allowed because the penalty on β was separable, i.e the penalties for each β j were independent.Now they are not independent as the penalty marginalizes over θ, which is estimated from all of the regression coefficients and the penalties are no longer marginally independent.Ročková & George (2016) showed, however, that all of the previous results hold in the situation where θ is estimated if we replace θ in all the expressions with θ j = E(θ| β \j ) where β \j is the vector of estimated regression coefficients with the j th element removed.When p is large then E(θ| β \j ) ≈ E(θ| β), and therefore we can let the value of θ remain constant for each j.This conditional expectation is given by E(θ| β) = This expression does not have a closed form, but is fairly easy to calculate numerically.Because we can simply take all the implementation steps from above and replace θ with E(θ| β), we can update our coordinate ascent algorithm accordingly.Now after each time we update β j for each j, we must also update θ according to (31).This can be computationally expensive, particularly in very high-dimensional settings.
In reality E(θ| β), will not change much after each updated β j , especially in sparse settings when most of the parameters are not included in the model, i.e β j = 0.In practice, one can update θ in the coordinate ascent algorithm after every m th covariate is updated, where m is some large value that is also smaller than p.

Selection of λ 0
As with fully Bayesian inference, our coordinate ascent algorithm relies on a well chosen value of λ 0 , and there are two distinct solutions within this paradigm.The first of which, cross validation, is commonly used for penalized likelihood procedures to find tuning parameter values.Cross validation is used to find a value of the tuning parameter that aims to optimize the model's predictive ability.We are interested in confounding adjustment, not prediction, though cross validation still might ultimately be useful for finding an appropriate λ 0 estimate.Another approach, as described in Ročková & George (2016), involves estimating the model for an increasing sequence of λ 0 values and then finding the value of λ 0 at which the estimates stabilize.We can evaluate β for λ 0 ∈ λ (1) 0 , λ (2) 0 , . . ., which is a sequence of penalties that more aggressively penalize coefficients.As shown in Ročková & George (2016) the estimates β eventually converge as we increase λ 0 and we can take the estimates of β at this stabilization point as the final posterior mode estimate.

D Theoretical considerations
Ročková & George (2016) provide a number of asymptotic results justifying the use of the spike and slab lasso prior in high dimensional settings.In particular, they derived asymptotic rates of convergence for β, such as the fact that |X( β − β)| 2 < C 1 q(1 + C 2 )log p, where C 1 , C 2 are positive constants and q = ||β|| 0 .It is a straightforward application of their proofs to obtain identical results under our prior, however, our interest is in estimating β t , not in estimating β.One can use Equation 30 to see that We can then apply the triangle inequalty and Hölder's inequality to see that where in the last step we used that 1 n |T | 1 → 0 and we made the assumption that T has bounded second moments, which is very reasonable, particularly when T is binary where it is certain to hold.This shows that our estimate of the treatment effect converges, but at a slower rate than n 1/2 as we see the usual log p penalty found in high-dimensional models.Interestingly, Belloni et al. (2014); Farrell (2015); Belloni et al. (2017); Athey et al. (2016) showed that n 1/2 −consistent estimation can be achieved when estimating treatment effects in high dimensional scenarios.It is possible we could achieve a similar result where we fit a post-selection estimator conditional on the covariates identified by our model as significant.This result would depend heavily on the choice of the weights, w j , as the weights would have to decrease with the sample size to eliminate non-negligible omitted variable bias as described in Belloni et al. (2014)

Figure 3 :
Figure3: The left panel shows a histogram of the ratios of standard errors for the EM-SSL approach and the double post selection approach for the analysis of volatile compounds in the full data.The right panel shows the corresponding histogram for the analysis of subjects over the age of 40.The dashed vertical line is the mean of the ratios that make up the histogram.

Table 1 :
Results for estimating the average treatment effect under the simulation scenario of Section 4.1 type% Bias MSE 95% interval coverage E( SE( β t ))/SD( β t )

Table 2 :
). Results for estimating the average treatment effect under the simulation scenario of Section 4.2 type% Bias MSE 95% interval coverage E( SE( β t ))/SD( β t ) (Wild, 2005;Patel et al. , 2010;Louis et al. , 2012;Patel et al. , 2012;Patel & Ioannidis, 2014)idis, 2014)has centered on studying the effects of a vast set of exposures on disease.These analyses, termed environmental wide association studies (EWAS), examine environmental factors and aim to improve understanding of the long term effects of different exposures and toxins that humans are invariably exposed to on a daily basis.The National Health and Nutrition Examination Survey (NHANES), is a cross-sectional data source made publicly available by the Centers for Disease Control and Prevention (CDC

Table 6 :
. Further research on this topic is merited.type % Bias MSE 95% interval coverage E( SE( β t ))/SD( β t ) Results for estimating the average treatment effect under the third additional simulation scenario.