A variational Bayes approach to variable selection

We develop methodology and theory for a mean ﬁeld variational Bayes approximation to a linear model with a spike and slab prior on the regression coefﬁcients. In particular we show how our method forces a subset of regression coefﬁcients to be numerically indistinguishable from zero; under mild regularity conditions estimators based on our method consistently estimates the model parameters with easily obtainable and appropriately sized standard error estimates; and, most strikingly, selects the true model at a exponential rate in the sample size. We also develop a practical method for simultaneously choosing reasonable initial parameter values and tuning the main tuning parameter of our algorithms which is both computationally efﬁcient and empirically performs as well or better than some popular variable selection approaches. Our method is also faster and highly accurate when compared to MCMC.


Introduction
Variable selection is one of the key problems in statistics as evidenced by papers too numerous to mention all but a small subset.Major classes of model selection approaches include criteria based procedures (Akaike, 1973;Mallows, 1973;Schwarz, 1978), penalized regression (Tibshirani, 1996;Fan and Li, 2001;Fan and Peng, 2004) and Bayesian modeling approaches (Bottolo and Richardson, 2010;Hans et al., 2007;Li and Zhang, 2010;Stingo and Vannucci, 2011).Despite the amount of research in the area there is yet no consensus on how to perform model selection even in the simplest case of linear regression with more observations than predictors.One of the key forces driving this research is model selection for large scale problems where the number of candidate variables is large or where the model is nonstandard in some way.Good overviews of the latest approaches to model selection are Johnstone and Titterington (2009); Fan and Lv (2010); M üller and Welsh (2010); B ülmann and van de Geer (2011); Johnson and Rossell (2012).
Bayesian model selection approaches have the advantage of being able to easily incorporate simultaneously many sources of variation, including prior knowledge.However, except for special cases such as linear models with very carefully chosen priors (see for example Liang et al., 2008; or Maruyama and George, 2011), Bayesian inference via Markov chain Monte Carlo (MCMC) for moderate to large scale problems is inefficient.For this reason an enormous amount of effort has been put into developing MCMC and similar stochastic search based methods which can be used to explore the model space efficiently (Nott and Kohn, 2005;Hans et al., 2007;O'Hara and Sillanpää, 2009;Bottolo and Richardson, 2010;Li and Zhang, 2010;Stingo and Vannucci, 2011).
However, for sufficiently large scale problems even these approaches can be deemed to be too slow to be used in practice.Further drawbacks to these methods include sensitivity to prior choices, and there are no available diagnostics to determine whether the MCMC chain has either converged or explored a sufficient proportion of models in the model space.
Mean field variational Bayes (VB) is an efficient but approximate alternative to MCMC for Bayesian inference (Bishop, 2006;Ormerod and Wand, 2010).While fair comparison between MCMC and VB is difficult (for reasons discussed in Section 5.1), in general VB is typically a much faster, deterministic alternative to stochastic search algorithms.However, unlike MCMC, methods based on VB cannot achieve an arbitrary accuracy.Nevertheless, VB has shown to be an effective approach to several practical problems including document retrieval (Jordan, 2004), functional magnetic resonance imaging (Flandin and Penny, 2007;Nathoo et al., 2014), and cluster analysis for gene-expression data (Teschendorff et al., 2005).Furthermore, the speed of VB in such settings gives it an advantage for exploratory data analysis where many models are typically fit to gain some understanding of the data.
A criticism often leveled at VB methods is that they often fail to provide reliable estimates of posterior variances.Such criticism can be made on empirical, e.g., Wand et al. (2011);Carbonetto and Stephens (2011), or theoretical grounds, e.g., Wang and Titterington (2006); Rue et al. (2009).
However, as previously shown in You et al. (2014) such criticism does not hold for VB methods in general, at least in an asymptotic sense.Furthermore, variational approximation has been shown to be useful in frequentist settings (Hall et al., 2011a,b).
In this paper we consider a spike and slab prior on the regression coefficients (see Mitchell and Beauchamp, 1988;George and McCulloch, 1993) in order to encourage sparse estimators.This entails using VB to approximate the posterior distribution of indicator variables to select which variables are to be included in the model.We consider this modification to be amongst the simplest such modifications to the standard Bayesian linear model (using conjugate priors) to automatically select variables to be included in the model.Our contributions are: (i) We show how our VB method induces sparsity upon the regression coefficients; (ii) We show, under mild assumptions, that our estimators for the model parameters are consistent with easily obtainable and appropriately sized standard error estimates; (iii) Under these same assumptions our VB method selects the true model at an exponential rate in n; and (iv) We develop a practical method for simultaneously choosing reasonable initial parameter values and tuning the main tuning parameter of our algorithms.
The paper is organized as follows.Section 2 considers model selection for a linear model using a spike and slab prior on the regression coefficients and provides a motivating example from real data.Section 3 summarizes our main results which are proved in Appendix A. Section 4 discusses initialization and hyperparameter selection.Numerical examples are shown in Section 5 and illustrate the good empirical properties of our methods.We discuss our results and conclude in Section 6.

Bayesian linear model selection
Suppose that we have observed data (y i , x i ), 1 ≤ i ≤ n, and hypothesize that y i ind.
∼ N (x T i β, σ 2 ), 1 ≤ i ≤ n for some coefficients β and noise variance σ 2 where x i is p-vector of predictors.When spike and slab priors and a conjugate prior is employed on β and σ 2 , respectively, a Bayesian version of the linear regression model can be written as follows, where X is a n × p design matrix whose ith row is x T i (possibly including an intercept), β = (β 1 , . . ., β p ) T is a p-vector of regression coefficients, Inverse-Gamma(A, B) is the inverse Gamma distribution with shape parameter A and scale parameter B, and δ 0 is the degenerate distribution with point mass at 0. The parameters σ 2 β , A and B are fixed prior hyperparameters, and ρ ∈ (0, 1) is also a hyperparameter which controls sparsity.Contrasting with Ročková and George (2014) we use ρ rather than σ 2 β as a tuning parameter to control sparsity.The selection of ρ (or σ 2 β for that matter) is particularly important and is a point which we will discuss later.
Replacing β j with γ j β j for 1 ≤ j ≤ p we can recast the model as where Γ = diag(γ 1 , . . ., γ p ) which is easier to deal with using VB.In the signal processing literature this is sometimes called the Bernoulli-Gaussian (Soussen et al., 2011) or binary mask model and is closely related to 0 regularization (see Murphy, 2012, Section 13.2.2).Wand and Ormerod (2011) also considered what they call the Laplace-zero model where the normal distributed slab in the spike and slab is replaced with a Laplace distribution.Using their naming convention this model might also be called a normal-zero or Gaussian-zero model.
We refer the interested reader to these papers rather than summarize the approach again here.
Using a variational Bayes approximation of p(β, σ 2 , γ|y) by q(β, σ 2 , γ) = q(β)q(σ 2 ) p j=1 q(γ j ) the optimal q-densities are of the form q * (β) is a N (µ, Σ) density, q * (σ 2 ) is a Inverse-Gamma(A + n/2, s) density and q * (γ j ) is a Bernoulli(w j ) density for j = 1, . . ., p, where a necessary (but not sufficient) condition for optimality is that the following system of equations hold: ) and ( 6) where 1 ≤ j ≤ p, λ = logit(ρ), w = (w 1 . . .w p ) T , W = diag(w), Ω = ww T + W (I − W) and Note that D is a diagonal matrix.Algorithm 1 below describes an iterative process for finding parameters satisfying this system of equations via a coordinate ascent scheme.Note that we use the notation that for a general matrix A, A j is the jth column of A, A −j is A with the jth column removed.Later we will write A i,j to be the value the component corresponding to the ith row and jth column of A and A i,−j is vector corresponding the ith row of A with the jth component removed.The w j 's can be interpreted as an approximation to the posterior probability of γ j = 1 given y, that is, p(γ j = 1|y), and can be used for model selection purposes, that is, the posterior probability for including the jth covariate is w j and if w j > 0.5, say, we include the jth covariate in the model.
The VB approach gives rise to the lower bound where the summation is interpreted as a combinatorial sum over all possible binary configurations of γ.At the bottom of Algorithm 1 the lower bound of log p(y; ρ) simplifies to Let log p (t) (y; ρ) denote the value of the lower bound at iteration t.Algorithm 1 is terminated when the increase of the lower bound log-likelihood is negligible, that is, where is a small number.In our implementation we chose = 10 −6 .Note that Algorithm 1 is only guaranteed to converge to a local maximizer of this lower bound.For the n < p case Algorithm 1 is efficiently implemented by calculating y 2 , X T y and X T X only once outside the main loop of the algorithm.Then each iteration of the algorithm can be implemented with cost Algorithm 1 Iterative scheme to obtain optimal q * (β, σ 2 , γ) for our model.
until the increase of log p(y; ρ) is negligible.
To illustrate the effect of ρ on the sparsity of the VB method we consider the prostate cancer dataset originating from a study by Stamey et al. (1989).The data consists of n = 97 samples with variables lcavol, lweight (log prostate weight), age, lbph (log pf the amount of benign prostate hyperplasia), svi (seminal vesicle invasion), lcp (log of capsular penetration), gleason (Gleason score), pgg45 (percent of Gleason scores 4 or 5), and lpsa (log of prostate specific antigen).Friedman et al. (2001) illustrate the effect of tuning parameter selection for ridge regression and Lasso for a linear response model using lpsa as the response variable and the remaining variables as predictors.We also consider the regularization paths produced by the SCAD penalty as implemented by the R package ncvreg (Breheny and Huang, 2011).These regularization paths are illustrated in Figure 1 where for our VB method the values of µ (which serve as point estimates for β) as a function of λ. τ = 1000 and hyperparameters selected as described in Section 4 on the prostate cancer dataset originating in Stamey et al. (1989).Remaining panels: The regularization paths for Ridge, Lasso and SCAD penalized regression fits.
From Figure 1 we make several observations about the VB estimates: (A) the estimated components of β appear to be stepwise functions of λ with components being either zero or constant for various ranges of λ; and (B) large negative values of λ tend to give rise to simpler models and positive values tend to give rise to more complex models.
Note (A) holds only approximately but illustrates empirically the model selection properties of estimators obtained through Algorithm 1.This contrasts with the Lasso and other penalized regression methods where the analogous penalty parameter enforces shrinkage, and, possibly bias for the estimates of the model coefficients.Observation (B) highlights that care is required for selecting ρ (or equivalently λ).

Theory
Instead of Algorithm 1, we analyze the properties of a slightly adjusted version, Algorithm 2 (consisting of Algorithm 2a and 2b), which applies the updates in a different order.Note that Algorithm 2 is less computationally efficient compared to Algorithm 1 due to the extra convergence step in Algorithm 2b.Also note that Algorithm 2 is technically not a VB algorithm.
The reason is because η p ).Another important difference between Algorithm 1 and Algorithm 2 is that in Algorithm 2 we have also chosen w (1) = 1 to ensure that w (1) is initialized to start from a correct model.By a correct model we mean that such a model includes all variables with non-zero coefficient values in the underlying true model.Let β 0 be the true value of β.A correct model γ is the p-vector with elements such that γ j ∈ {0, 1} if β 0j = 0 and γ j = 1 if β 0j = 0.
Hence, the true model γ 0 and the full model γ = 1 are both correct models.
Although Algorithm 2 is less computationally efficient and not a VB algorithm, it simplifies analysis of the estimators.The properties of µ, Σ, τ and {w j } 1≤j≤p when the system of equations (3)-( 7) hold simultaneously are difficult to analyze.Algorithm 2 allows us to decouple the equations (3), ( 4) and ( 5) with equations ( 6) and ( 7) over the iterations of the algorithm.Based on Algorithm 2 our analysis of the theoretical properties assumes that (3), ( 4) and ( 5) hold exactly in each iteration in Algorithm 2a (namely at the completion of Algorithm 2b), and both ( 6) and ( 7) for 1 ≤ j ≤ p hold exactly at the completion of Algorithm 2a.

Algorithm 2a
Iterative scheme to obtain optimal q * (θ) for our model.
until the increase of log p(y; ρ) is negligible.
In the next Appendix A we will show, under certain assumptions, the following two main results.The first result concerns the behavior of VB estimates when particular w j 's are small.
until the increase of log p(y; q) is negligible Main Result 1 (Proof in Appendix A.1). Suppose that (3), ( 4) and ( 5) hold.Then for w j 1, 1 ≤ j, k ≤ p, for observed y and X we have , and the update for w (t+1) j in Algorithm 2a with small w (t) is approximately equal to exp(λ − τ (t) X j 2 σ 2 β /2).Thus, when σ 2 β is sufficiently large, w (t+1) j is, for numerical purposes, identically 0. This explains why that Algorithms 1 and 2 provide sparse estimates of w and β.Furthermore, all successive values of w (t) j remain either small or numerically zero and may be removed safely from the algorithm reducing the computational cost of the algorithm.
In order to establish various asymptotic properties in Main Result 2, we use the following assumptions (which are similar to those used in You et al., 2014) and treat y i and x i as random quantities (only) in Main Result 2 and the proof of Main Result 2 in Section 6: ∼ N (0, σ 2 0 ), 0 < σ 2 0 < ∞, β 0 are the true values of β and σ 2 with β 0 being element-wise finite; (A2) for 1 ≤ i ≤ n the random variables x i ∈ R p are independent and identically distributed with where rank(X) = p; and (A4) for 1 ≤ i ≤ n the random variables x i and ε i are independent.
We view these as mild regularity conditions on the y i 's, ε i 's and the distribution of the covariates.
Assumption (A5) and (A6) will simplify later arguments, whereas Assumption (A7) is necessary for our method to identify the true model, which we will denote γ 0 .
We now define some notation to simplify later proofs.For an indicator vector γ the square matrix W γ (W −γ ) is the principal submatrix of W by distinguishing (removing) rows and columns specified in γ.The matrix D γ (D −γ ) is defined in the same manner.The matrix X γ (X −γ ) is the submatrix of X by distinguishing (removing) columns specified in γ.For example, suppose the matrix X has 4 columns, γ = (1, 0, 0, 1) T then X γ is constructed using the first and forth columns of X and W γ is the submatrix of W consisting first and forth rows, and first and forth columns of W. Similar notation, when indexing through a vector of indices v, for example, if v = (1, 4), then X v is constructed using the first and the forth column of X and W v is the submatrix of W consisting of the first and forth rows, and the first and forth columns of W. We rely on context to specify which notation is used.We denote O v p (•) be a vector where each entry is O p (•), O m p (•) be a matrix where each entry is O p (•) and O d p (•) be a diagonal matrix where diagonal elements are O p (•).We use similar notation for o p (•) matrices and vectors.
and for 1 ≤ j ≤ p we have If, in addition to the aforementioned assumptions, Assumption (A7) holds, then for t = 2 we have µ (2) and Σ (2) and for 1 ≤ j ≤ p we have For t > 2 we have and for 1 ≤ j ≤ p we have Remark: This result suggests, under assumptions (A1)-(A7) and in light of Lemma 1, that the vector w (t) in Algorithm 2 approaches γ 0 at an exponential rate in n.

Hyperparameter selection and initialization
We will now briefly discuss selecting prior hyperparameters.We have used A = B = 0.01, σ 2 β = 10 and initially set τ = 1000.This leaves us to choose the parameter ρ = expit(λ) and the initial values for w.The theory in Section 3 and 4 suggests that if we choose w = 1 and say λ ∝ − √ n and provided with enough data then Algorithm 1 will select the correct model.
However, in practice this is not an effective strategy in general since Algorithm 1 may converge to a local minimum (which means w should be carefully selected), all values of λ satisfy Assumption (A7) when n is fixed and we do not know how much data is sufficient for our asymptotic results to guide the choice of λ.
Ročková and George (2014) used a deterministic annealing variant of the EM algorithm proposed by Ueda and Nakano (1998) to avoid local maxima problems and proved to be successful in that context.We instead employ a simpler stepwise procedure which initially "adds" that variable j (by setting w j to 1 for some j) which maximizes the lower bound log p(y; ρ) with ρ = expit(−0.5√ n).We then, (I) For fixed w select the ρ j = expit(λ j ) which maximizes the lower bound log p(y; ρ j ) where λ j is an equally spaced grid between −15 and 5 of 50 points.
(II) Next, for each 1 ≤ j ≤ p, calculate the lower bound log p(y; ρ) when w j is set to both 0 and 1.
The value w j is set to the value which maximizes log p(y; ρ) if this value exceeds the current largest log p(y; ρ).

(III) Return to (I).
This procedure is more specifically described in Algorithm 3. Note that in Algorithm 3 we use the notation w (k) j to denote the vector w with the jth element set to k.
Algorithm 3 Iterative scheme to tune ρ and select initial w for Algorithm 1 If L does not improve return output of Algorithm 1 with input y, X, σ 2 β , A, B, τ 0 , ρ, w

Numerical examples
In the following three numerical examples we only consider simulated, but hopefully sufficiently realistic, examples in order to reliably assess the empirical qualities of different methods where truth is known.We start with considering a data set with few explanatory variables before looking at higher dimensional situations where p = 41 and n = 80 and in the third example where p = 99 and n = 2118.
We use the mean square error (MSE) to measure the quality of the prediction error defined by and the F 1 -score to assess the quality of model selection defined to be the harmonic mean between precision and recall where precision = T P T P + F P and recall = T P T P + F N , with T P , F P and F N being the number of true positives, false positives and false negatives respectively.Note that F 1 is a value between 0 and 1 and higher values are being preferred.We use this measure to prefer methods which do not select none or all of the variables.We compare the performance of our VB method against the Lasso, SCAD and MCP penalized regression methods using 10-fold cross-validation to choose the tuning parameter as implemented by the R package ncvreg (Breheny and Huang, 2011).Our methods were implemented in R and all code was run on the firrst author's laptop computer (64 bit Windows 8 Intel i7-4930MX central processing unit at 3GHz with 32GB of random access memory).

Comparison with MCMC
Comparisons between VB and MCMC are fraught with difficulty.In terms of computational cost per iteration VB has a similar cost to an MCMC scheme based on Gibbs sampling.The later method has a slightly higher cost from drawing samples from a set of full conditional distributions rather than calculating approximations of them.The full conditionals corresponding to the model (2) are given by from which Gibbs sampling can be easily implemented.
Despite the similarity between Algorithm 1 and ( 16) a fair comparison of these methods is difficult since choices for when each of these methods are stopped and what statistics are used to compare the outputs of each of the methods can unduly favor one method or the other.This MCMC scheme is appropriate when determining the quality of the VB method for performing Bayesian inference for model (2).However, if we wanted to compare model selection strategies, another MCMC scheme might be more appropriate to use for comparison.Here we focus on comparing the quality of the VB via Algorithm 1 with its Gibbs sampling counterpart (16).
Firstly, comparison is hampered by the difficultly to determine whether a MCMC scheme has converged to its stationary distribution, or in the model selection context, whether the MCMC scheme has explored a sufficient portion of the model space.Furthermore, the number of samples required to make accurate inferences may depend on the data at hand and the choice of what inferences are to be made.For these reasons both an overly large number of burn-in and total samples drawn are commonly chosen.However, by making the number of burn-in samples sufficiently large MCMC methods can be made to be arbitrarily slower than VB.
Similarly, convergence tolerances for VB trade accuracy against speed.We have chosen in With the above in mind we consider using ( 16) with identical hyperparameters and ρ selected via Algorithm 3.For each of the examples we used 10 5 MCMC samples for inference after a discarding a burn-in of 10 3 .No thinning was applied.For the comparisons with MCMC in addition to F 1 -score and MSE we also compare the posterior density accuracy, introduced in Faes et al.
(2011), defined by where θ j is an arbitrary parameter and is expressed as a percentage and the mean parameter bias for the regression coefficients In our tables our results MSE and BIAS are reported on negative log scale (where higher values are better) and bracketed values represent standard error estimates.
In Figure 2 we see in the first panel that VB selects the correct model for almost every simulation.We also see in the second panel that VB provides smaller prediction errors when compare to Lasso, SCAD and MCP penalty methods.All methods perform almost identically for the dense case, where the data generating model is the full model.The mean times per simulation for out VB method, and the Lasso, SCAD and MCP penalized regression methods were 0.73, 0.15, 0.17 and 0.18 seconds respectively.The results for the comparisons between VB and MCMC based on 100 simulations are summarized in Table 1.Note that the MCMC approach took an average of 32.56 seconds per simulation setting.For each of the settings we see high agreement between VB and MCMC in terms of accuracy and model selection performance.For model sizes 3, 4 and 5 the prediction and bias measures are also very similar.However, for model sizes 1 and 2 the VB approach seems to have smaller prediction errors and less biased estimates of the regression coefficients.

Example 2: Diets simulation
We use the following example modified from Garcia et al. (2013).Let m 1 and n be parameters of this simulation which are chosen to be integers.For this example we suppose that there are two groups of diets with n/2 subjects in each group.We generate m 1 + 1 explanatory variables , where u ik are independent uniform (0, 1) random variables, v 1 , . . ., v 0.75m 1 are independent uniform (0.25, 0.75) random variables, and v 0.75m 1 +1 , . . ., v m 1 are identically zero.
We generate 100 independent data sets for each value of κ in the set {1, 2, 3, 4} and apply each of the variable selection procedures we consider.Note that larger values of κ in the range κ ∈ [1,7] correspond to a larger signal to noise ratio.Garcia et al. (2013) considered the case where κ = 1 and n = 40.The results are summarized in the two panels of Figures 2.
In Figure 3 we see in the first panel that VB selects the correct model and in the second panel provides smaller prediction errors for almost every simulation with the exception of κ = 4 corresponding to the smallest signal to noise scenario.The mean times per simulation for out VB method, and the Lasso, SCAD and MCP penalized regression methods were 6.8, 0.4, 0.3 and 0.2 seconds respectively.
The results for the comparisons between VB and MCMC based on 100 simulations are summarized in Table 2.Here we see similar results to Table 1 except posterior density parameter accuracy is lower than for the solid waste example.In particular we note that MSE and parameter bi- ases are either similar to MCMC or better for some settings.Note that the MCMC approach took an average of 131.4 seconds per simulation setting.

Example 3: Communities and crime data
We use the Communities and Crime dataset obtained from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime).The data collected was part of a study by Redmond and Baveja (2002)  The raw data consists of 2215 samples of 147 variables the first 5 of which we regard as nonpredictive, the next 124 are regarded as potential covariates while the last 18 variables are regarded as potential response variables.Roughly 15% of the data is missing.We proceed with a complete case analysis of the data.We first remove any potential covariates which contained missing values leaving 101 covariates.We also remove the variables rentLowQ and medGrossRent since these variables appeared to be nearly linear combinations of the remaining variables (the matrix X had two singular values approximately 10 −9 when these variables were included).We use the nonViolPerPop variable as the response.We then remove any remaining samples where the response is missing.The remaining dataset consist of 2118 samples and 99 covariates.Finally, the response and covariates are standardized to have mean 0 and standard deviation 1.
For this data we use the following procedure as the basis for simulations.
• Use the LARS algorithm to obtain the whole Lasso path and its solution vector β: for all positive values of λ.The solution for β is a piecewise function of λ with a finite number of pieces, say J, which can be represented by the set {λ (j) , β (j) } 1≤j≤J .
• For the jth element in this path: -Let X (j) be the columns of X corresponding to the non-zero elements of β (j) .
For this data we use σ 2 = 0.1, the first J = 20 elements of the LARS path and S = 50.Such datasets are simulated for each of these J = 20 elements.We use the R package lars (Hastie and Efron, 2013) in the above procedure.Results for the comparisons between VB, Lasso, SCAD and MCP are summarized in Figure 4 where we see again that VB has, except for a few model settings, the best model selection performance and smallest prediction error.The mean times per simulation for out VB method, and the Lasso, SCAD and MCP penalized regression methods were 45, 12, 8 and 6 seconds respectively.

Conclusion
In this paper we have provided theory for a new approach which induces sparsity on the estimates of the regression coefficients for a Bayesian linear model.We have shown that these estimates are consistent, can be used to obtain valid standard errors, and that the true model can be found at an exponential rate in n.Our method performs well empirically compared to the penalized regression approaches on the numerical examples we considered and is both much faster and highly accurate when comparing to MCMC.
A drawback of our theory is that we only consider the case where p < n.This might be mitigated to a certain extent but the use of screening procedures such as sure independence screening (Fan and Lv, 2008) which can be seen as searching for a correct model with high probability.Theoretical extensions include considering the case where both p and n diverge.Such theory would be important for understanding how the errors of our estimators behave as p grows relative to n.
A second important extension would be to analyze the effect of more elaborate shrinkage priors on the regression coefficients, e.g., where the normal "slab" in the spike and slab is replaced by the Laplace, horseshoe, negative-exponential-gamma and generalized double Pareto distributions (see for example Neville et al., 2014).Another theoretical extension includes adapting the theory presented here to non-Gaussian response.However, such methodological (as opposed to theoretical) extensions would be relatively straightforward, as would extensions which handle missing data or measurement error highlighting the strength and flexibility of our approach.
A variational Bayes approach to variable selection Appendices

Proof of Result 1:
A matrix is positive definite if and only if for all non-zero real vector a = [a 1 , . . ., a p ] T the scalar a T Ωa is strictly positive (Horn and Johnson, 2012, Section 7.1).By definition j > 0, 1 ≤ j ≤ p, we have ( p j=1 a j w j ) 2 > 0 for any non-zero vector a.Hence, the result is as stated. 2 and Udiag(ν)U T be the eigenvalue decomposition of (X T X) Ω, where U is an orthonormal matrix and ν = [ν 1 , . . ., ν p ] T is a vector of eigenvalues of (X T X) Ω.

Proof of Result 2 :
Let the eigenvalue decomposition (17) hold.Since w j ∈ (0, 1], 1 ≤ j ≤ p is positive by Result 1 the matrix Ω is positive definite.By the Schur product theorem (Horn and Johnson, 2012, Theorem 7.5.2), the matrix (X T X) Ω is also positive definite.Hence, ν i > 0, i = 1, . . ., p.Then, using properties of the orthonormal matrix U, we have Clearly, dof(α, w) is monotonically decreasing in α, dof(0, w) = p and dof(α, w) only approaches zero as α → ∞. 2 The next lemma follows from Horn and Johnson (2012, Section 0.7.3): Lemma 2. The inverse of a real symmetric matrix can be written as where A = A − BC −1 B T −1 provided all inverses in ( 18) and ( 19) exist.
Then, by Equation ( 18) in Lemma 2, where Note any principal submatrix of a positive definite matrix is positive definite (Horn and Johnson, 2012, Chapter 7.1.2).Hence, the matrix 20) is strictly decreasing as b 1 increases.The result follows for b j , 2 ≤ j ≤ p after a relabeling argument. 2 Result 3. Suppose that (3) and ( 5) hold.Then Proof of Result 3: Substituting (3) into (5) we may rewrite τ as .
The result then follows after rearranging.

2
Remark: Note from Result 3 that the numerator in (21) may become negative when 2A + n < p and σ 2 β is sufficiently large.The practical implication of this is that Algorithm 1 can fail to converge in practice in these situations.For these reasons, and to simplify our results, we only consider the case where n > p.
The following result bounds the values that τ can take and is useful because these bounds do not depend on µ, Σ or w.
Hence, using Result 2 we have Let the eigenvalue decomposition (17) hold.Then where the last line follows from Lemma 3. Combining this inequality, using the fact from Result 2 that dof(τ −1 σ −2 β , w) ≤ p and Result 3 obtains the lower bound on τ . 2

A.1 Proof of Main Result 1
It is clear from the numerical example in Section 2 that sparsity in the vector µ is achieved (at least approximately).In order to understand how sparsity is achieved we need to understand how the quantities µ, Σ and η behave when elements of the vector w are small.Define the n × n matrix P j by for j = k, 1 ≤ j, k ≤ p we define and for a indicator vector γ we define Result 5.If (3) holds then for 1 ≤ j ≤ p we have and and for j = k, 1 ≤ j, k ≤ p we have If ( 3) and ( 4) hold then and and if (3), ( 4) and ( 5) hold then Proof of Result 5: For a given indicator vector γ we can rewrite (3) as .
→ σ −2 0 so that τ is bounded almost surely between two constants.Finally, since almost sure convergence implies convergence in probability the result holds. 2 We will next derive some properties for correct models.Note β 0,−γ = 0 by definition.In the following, we denote j ∈ γ if γ j = 1 and j ∈ γ if γ j = 0.In the main result we assume that w (t) is "close" to a correct model in probability (defined in Result 7).Under this assumption we prove, in the following order, that: • µ (t) is a consistent estimator of β; • τ (t) is a consistent estimator of σ −2 0 ; • Σ (t) = cov(β LS ) + O m p (n −3/2 ); and • w (t+1) is also "close" to the true model in probability.
The result is proved after application of the continuous mapping theorem, expanding and dropping appropriate lower order terms to the above expression.).Hence, using Result 9 and Equation ( 34) we have For T 7 we need to consider the behavior of (w

Figure 1 :
Figure 1: Top right panel: An illustration the final values of the components of µ for multiple runs of Algorithm 1 over a grid of λ = logit(ρ) values where we have used w j = 1, j = 1, . . ., p,

( 8 )
to be 10 −6 .Larger values of result in cruder approximations and smaller values of are usually wasteful.Since each completion of Algorithm 1 takes very little time we are able to tune the parameter ρ via Algorithm 3. In comparison, MCMC schemes can both be sensitive to the choice of hyperparameter values and prohibitively time consuming to tune in practice.

Figure 2 :
Figure 2: Summaries of the model selection and prediction accuracies of VB, Lasso, SCAD and MCP methods for the Solid Waste example.

Figure 3 :
Figure 3: Summaries of the model selection and prediction accuracies of VB, Lasso, SCAD and MCP methods for the Diet Simulation example.

Figure 4 :
Figure 4: Summaries of the model selection and prediction accuracies of VB, Lasso, SCAD and MCP methods for the Communities and Crime example.

Lemma 3 .
Let M be a real positive definite symmetric p × p matrix, a = [a 1 , . . ., a p ] T be a real vector, and let the elements of the vector b = [b 1 , . . ., b p ] T be positive.Then the quantity a T [M + diag(b)] −1 a is a strictly decreasing function of any element of b.Proof of Lemma 3: Let the matrix M + diag(b) be partitioned as M + diag(b) = M 11 + b 1 m T 12 m 12 M 22 + B 2 where m 12 = [M 12 , . . ., M 1p ] T , B 2 = diag(b 2 , . . ., b p ) and

Table 1 :
Performance measure comparisons between VB and MCMC based on 100 simulations for the solid waste example.
as follows.First, we generate a binary diet indicator z where, for each subject i = 1, . . ., n, z i =

Table 2 :
Performance measure comparisons between VB and MCMC based on 100 simulations for the diet simulation example.
combining socio-economic data from the 1990 United States Census, law enforcement data from the 1990 United States Law Enforcement Management and Administrative Statistics survey, and crime data from the 1995 Federal Bureau of Investigation's Uniform Crime Reports.

Table 3 :
Performance measure comparisons between VB and MCMC based on 30 simulations for the communities and crime example.