Selection of Tuning Parameters, Solution Paths and Standard Errors for Bayesian Lassos

Penalized regression methods such as the lasso and elastic net (EN) have become popular for simultaneous variable selection and coefficient estimation. Implementation of these methods require selection of the penalty parameters. We propose an empirical Bayes (EB) methodology for selecting these tuning parameters as well as computation of the regularization path plots. The EB method does not suffer from the “double shrinkage problem” of frequentist EN. Also it avoids the difficulty of constructing an appropriate prior on the penalty parameters. The EB methodology is implemented by efficient importance sampling method based on multiple Gibbs sampler chains. Since the Markov chains underlying the Gibbs sampler are proved to be geometrically ergodic, Markov chain central limit theorem can be used to provide asymptotically valid confidence band for profiles of EN coefficients. The practical effectiveness of our method is illustrated by several simulation examples and two real life case studies. Although this article considers lasso and EN for brevity, the proposed EB method is general and can be used to select shrinkage parameters in other regularization methods. MSC 2010 subject classifications: primary 62F15, 62J07; secondary 60J05.


Introduction
Consider the standard linear model y = μ1 n + Xβ + , where y ∈ R n is the vector of responses, μ ∈ R is the overall mean, 1 n is the n × 1 vector of 1's, X = (X 1 , X 2 , . . . , X p ) is the n × p (standardized) covariate matrix, β ∈ R p is the unknown vector of regression coefficients, and is the n × 1 vector of iid normal errors with mean zero and unknown variance parameter σ 2 . The ordinary least square (OLS) method for estimating β and σ 2 is not applicable when p > n, which is very common in the modern data sets arising in genetics, medical science, and other scientific disciplines. OLS also has problems when "n is not much larger than p" (James et al., 2013, p. 203). The shrinkage (also known as regularization) approach, where regression coefficients are shrunken toward zero, can be used to analyze these types of data sets. For example, ridge regression (Hoerl and Kennard, 1970) penalizes large values of the coefficients using L 2 norm. The result of the ridge regression, however, is not sparse and all the regression coefficients will remain nonzero at the end of the analysis. The extremely popular least absolute shrinkage and selection operator (lasso) (Tibshirani, 1996), which is based on L 1 norm regularization, can simultaneously perform shrinkage and variable selection as many of the coefficients can be estimated to be exactly zero. The lasso estimate of β is obtained by solving for some shrinkage parameter λ ∈ R. The lasso estimator, although has shown to be very useful in many situations, does have some shortcomings. Since lasso does convex optimization, it can not select more variables than the sample size. But, many problems, for example the micro array experiments, involve much more predictors than the available sample size. Also lasso performs unsatisfactorily in the situations where predictors are highly correlated. Finally, if there is some group structure among the variables, the lasso tends to select only one variable from a group ignoring others. Zou and Hastie (2005) proposed the Elastic Net (EN) to achieve better performance in the above three scenarios where lasso has limitations. The EN estimator is obtained by solving where λ 1 and λ 2 are tuning parameters. From (2) we see that the elastic net uses both an L 1 penalty as in lasso and an L 2 penalty as in the ridge regression. A variety of other penalty terms have been proposed in the literature for incorporating the grouping structure among the variables and to overcome other limitations of lasso. For example, the grouped lasso penalty in Yuan and Lin (2006), the adaptive lasso of Zou (2006), and the octagonal shrinkage and clustering of Bondell and Reich (2008).
One of the problems that we consider in this paper is the selection of the shrinkage parameters in the penalized regression methods. The tuning parameters control the impact of the penalty terms. For example, in (2) if λ 1 = 0 = λ 2 the penalty terms have no effect and in this case EN produces the OLS estimates. On the other hand, as λ 1 , or λ 2 → ∞, the impact of the penalty terms increase and the EN estimates approach zero. Typically cross validation methods are used for selecting the shrinkage parameters. However, there are problems with the use of cross validation methods for selecting the tuning parameters. For example, in the context of EN, the cross validation method described in Zou and Hastie (2005) "selects λ 1 and λ 2 sequentially instead of simultaneously and causes the double shrinkage problem" (Li and Lin, 2010, p. 152). In the Bayesian framework, the shrinkage parameters can be estimated using either an empirical Bayes (EB) approach or a fully Bayesian analysis with appropriate priors on λ 1 and λ 2 . There are several problems with the full Bayesian approach, for example, choosing an appropriate prior may not be easy and as seen in Section 4, the choice of prior can have influence on subsequent inference. Also, sampling from the full conditional distributions of λ 1 and λ 2 , which is required in the Markov chain Monte Carlo (MCMC) sampling for the full Bayesian analysis, is computationally expensive. As shown in Section 2.1, the full conditional distributions of λ 1 and λ 2 mentioned in Kyung et al. (2010) are incorrect. The correct conditionals of λ 1 and λ 2 are nonstandard, complex distributions. Thus MCMC sampling for full Bayesian analysis of EN is computationally demanding.
It has been recently shown by Khare and Hobert (2013) that when λ in (1) is assumed fixed, the Gibbs sampler Markov chain for Bayesian lasso of Park and Casella (2008) has a geometric rate of convergence, while no such convergence results is currently known about the MCMC algorithm for the full Bayesian lasso model. (See Section 2 for the definition of geometric convergence.) In section 2, we prove that, the Bayesian elastic net Gibbs sampler that do not update λ 1 and λ 2 is geometrically ergodic. Park and Casella (2008) mention that a Monte Carlo EM algorithm can be used for calculating the maximum marginal likelihood estimate of λ in the Bayesian lasso. Li and Lin (2010) use the Monte Carlo EM for jointly estimating the tuning parameters of the EN. In the Monte Carlo EM algorithm, a new fully convergent Markov chain is to be run in each iteration of the EM algorithm-which is computationally inefficient. In Section 4, we see that the EM algorithm can be extremely slow. Also, we may want to obtain the entire marginal likelihood surface instead of just the maximizing value as available from the Monte Carlo EM algorithm. Park and Casella (2008, p. 683) mention that an importance sampling method can be used to approximate the ratio of marginal likelihoods near the maximizer of λ. As we explain in Section 3 this naive importance sampling method is in general inefficient. In this paper we develop an EB approach with efficient generalized importance sampling methods based on multiple Markov chains for estimating the shrinkage parameters in penalized regression methods.
In penalized regression methods such as lasso and EN, a plot of the profiles of the estimated regression coefficients as a function of the penalty parameter is used to display the amount of shrinkage corresponding to different tuning parameter values. These plots are also useful for comparing different shrinkage methods. We show that the proposed generalized importance sampling method can efficiently compute such regularization path plots.
The other issue that we consider in this paper is the standard error estimation of the EN estimator. In the frequentist analysis, various standard error estimates have been proposed for the lasso estimator. The problem with the ridge regression approximation suggested by Tibshirani (1996) or the sandwich estimator of Fan and Li (2001) is that they produce the standard error estimate to be zero when the regression parameter estimate is zero. Kyung et al. (2010) have shown that the bootstrap method of standard error estimation does not "attach valid standard error estimates to the values of the lasso that are shrunk to zero, in the sense that these estimators are inconsistent" (see also Knight and Fu, 2000). On the other hand, as mentioned above, Khare and Hobert (2013) have recently shown that the Bayesian lasso Gibbs sampler is geometrically ergodic-which allows for calculation of asymptotically valid standard errors of lasso estimates (see Section 2 for details). Following Khare and Hobert (2013), in this paper we prove that the Bayesian EN Gibbs sampler is geometrically ergodic. This implies that there is a central limit theorem (CLT) for posterior EN estimates based on the Gibbs sampler. Moreover, it also justifies the use of batch means estimator proposed in Roy et al. (2015) for producing confidence bands for marginal likelihood surface estimates as well as regularization path estimates.
The rest of the paper is organized as follows. Section 2 presents the Gibbs samplers as well as the statement regarding their convergence rates. Section 3 describes an efficient importance sampling method for selecting the shrinkage parameters. It also provides regularization path estimators. Section 4 contains numerical results involving simulation studies and real data application. Some concluding remarks appear in Section 5. Proofs of results are relegated to the Web based supplementary materials (Roy and Chakraborty, 2016).
2 Hierarchical models and Gibbs samplers for lasso and elastic net 2.1 Gibbs samplers for lasso and elastic net Tibshirani (1996) noted that lasso estimates can be obtained as posterior mode when the regression parameters have independent and identical Laplace priors. Using a Gaussian mixture representations (with exponential mixing density) of the Laplace distribution, Park and Casella (2008) consider a hierarchical Bayesian lasso model as follows The improper prior π(σ 2 ) = 1/σ 2 (used in Park and Casella (2008)) is obtained by replacing α = 0, ξ = 0 in the above model. The connection of the above hierarchical representation with the lasso penalty in (1) can be seen from the following result of Andrews and Mallows (1974) From (4) it follows that by integrating out τ 2 1 , . . . , τ 2 p , the conditional density of β|σ 2 is p i=1 (λ/2σ) exp(−λ|β i |/σ). Letỹ = y −ȳ1 n . Since the columns of X are centered, easy calculation shows that that is, marginalization over μ does not break the conjugacy offered by the use of conjugate priors in the above hierarchical model. In fact, the full conditional distributions of β, τ 2 , σ 2 are given by The Bayesian lasso Gibbs sampler is a fixed scan Gibbs sampling algorithm which updates the parameters (β, τ 2 , σ 2 ) in each iteration using draws from the above three conditional distributions sequentially.
Following Park and Casella (2008), Kyung et al. (2010) present hierarchical models for other generalized lasso methods including the elastic net. We consider the hierarchical model for the Bayesian elastic net: with Here we also allow the improper prior, 1/σ 2 (corresponds to α = 0, ξ = 0 in (6)) for σ 2 . The connection of the above hierarchical representation with (2) can be seen from the conditional prior density of β|σ 2 given by which is obtained using the mixture representation (4). Note that the independent exponential prior of τ 2 j used in the Bayesian EN model by Kyung et al. (2010) does not lead to the prior density β|σ 2 in (7). Kyung et al. (2010) assumed independent exponential (λ 2 1 /2) priors on τ 2 j , j = 1, . . . , p. Consequently, with independent Gamma priors on λ 2 1 and λ 2 , the full conditionals of λ 2 1 and λ 2 become Gamma distributions. As shown above, the prior of τ 2 j depends on both λ 1 and λ 2 , and in this case, the conditional distributions of λ 2 1 and λ 2 are complicated. Thus MCMC sampling for the full Bayesian analysis of EN is computationally expensive (Lee et al., 2015).
As in the Bayesian lasso model, the parameter μ can be analytically integrated out from the joint posterior distribution corresponding to (6). The full conditional distributions of β, τ 2 , σ 2 are similar to the Bayesian lasso and are given by In Section 2.2 we analyze the Gibbs sampler which is run by updating (β, τ 2 , σ 2 ) in each iteration using draws from the above three conditional distributions.

Geometric convergence of the elastic net Gibbs sampler
be the Markov chain underlying the elastic net Gibbs sampling algorithm discussed in the previous section. That is, at iteration m given θ m ≡ (β m , τ 2 m , σ 2 m ), this Markov chain uses three steps to move to the new value (9) given (β m , σ 2 m+1 ), and finally draw β m+1 |τ 2 m+1 , σ 2 m+1 from (8) given (τ 2 m+1 , σ 2 m+1 ). (Note that τ 2 m is used to denote both mth component of τ 2 as well as the value of τ 2 at iteration m.) The Markov transition density (Mtd) of this Gibbs sampler is where π(β|τ 2 , σ 2 , y), π(τ 2 |β, σ 2 , y) and π(σ 2 |β, τ 2 , y) are the full conditional densities corresponding to the joint posterior density where f (ỹ|β, σ 2 ) is given in (5), π(β, σ 2 , τ 2 |λ 1 , λ 2 ) is the prior density of (β, σ 2 , τ 2 ) given in (6), and is the normalizing constant of the posterior density (12). Standard calculations show that the posterior density (12) is the invariant density for the EN Gibbs Markov chain {β m , τ 2 m , σ 2 m } m≥0 . Since the Mtd k is strictly positive, the EN Gibbs chain is irreducible with respect to the Lebesgue measure on R p × R p + × R + (Meyn and Tweedie, 1993, Chapter 4). Then from Asmussen and Glynn (2011) it follows that {β m , τ 2 m , σ 2 m } m≥0 is a positive Harris recurrent Markov chain. Thus, the EN Gibbs chain can be used to produce strongly consistent estimators (Meyn and Tweedie, 1993, Chapter 17). In particular, if E π |g| < ∞, that is, if g is an integrable function with respect to the posterior density (12), thenḡ m := matter what is the initial distribution of (β 0 , τ 2 0 , σ 2 0 ). However, this estimator is useful in practice only if an associated standard error is provided. In fact if there is a CLT for g, that is, and ifψ 2 g is a consistent estimator of the asymptotic variance ψ 2 g , then an asymptotic standard error forḡ m isψ g / √ m. Unfortunately, Harris recurrence of a Markov chain does not guarantee the existence of such a CLT. The only standard method for establishing a Markov chain CLT and obtaining consistent estimates of the asymptotic variance is to prove that the underlying Markov chain is geometrically ergodic (see Jones and Hobert, 2001;Roy and Hobert, 2007;Flegal and Jones, 2010). The EN Gibbs chain is called geometrically ergodic if there exists a positive real valued function M and a constant r ∈ (0, 1) such that, for all m, where K m ((β, τ 2 , σ 2 ), ·) denotes the probability distribution of the Markov chain started at (β, τ 2 , σ 2 ) after m steps, Π(·|y) denotes the probability measure corresponding to the posterior density (12), and · TV denotes the total variation norm. Note that the function W and the constant r may depend on λ 1 , λ 2 . It is known that if the EN Gibbs chain is geometrically ergodic, then there is a CLT (14) for every function g such that E π |g| 2+δ < ∞, for some δ > 0 ( Roberts and Rosenthal, 1997). We have the following result.
Hence the CLT (14) holds for the EN Gibbs chain and it can be used to construct asymptotic standard errors for posterior estimates. We establish geometric ergodicity of the EN Gibbs chain by establishing the so-called drift condition, and an associated minorization condition, which we now describe. (See Jones and Hobert (2001) for an introduction to these ideas.) We begin with a drift condition. Consider the following function Let (KV )(β 0 , τ 2 0 , σ 2 0 ) denote the expectation of V (·) with respect to the Mtd k in (11), that is, We use V (β, τ 2 , σ 2 ) to establish the following drift condition.
Lemma 1. If n ≥ 4, then there exists constants 0 ≤ γ < 1 and d > 0 such that We also establish an associated minorization condition to the geometric drift condition (16). For every L > 0, let Below we present the required minorization condition.

Lemma 2. For every
where u(·) is a probability density function on The drift and minorization conditions in Lemma 1 and Lemma 2 imply that the EN Gibbs sampler is geometrically ergodic, that is, Proposition 1 holds (Rosenthal, 1995). Moreover, Rosenthal's (1995) Theorem 12 provides a computable upper bound on W (β, τ 2 , σ 2 )r m in (15) that involves the functions and constants from the drift and minorization conditions. Since the drift function V (·) depends on λ 2 and the constants γ, d, , and the pdf u(·) may depend on λ 1 , λ 2 , the upper bound may also depend on λ 1 , λ 2 . The proofs of Lemma 1 and Lemma 2 are given in the Web based supplementary materials.

Selection of tuning parameters and computation of regularization paths using generalized importance sampling methods
In this section, we propose a method for selecting the penalty parameters of the penalized regression methods and computing the regularization paths. For brevity we describe the proposed method in the context of Bayesian elastic net model.

Selection of the tuning parameters
Here we consider an empirical Bayes approach for making inference on the hyperparameters λ 1 and λ 2 in the Bayesian elastic net model (6). For notational convenience we denote (λ 1 , λ 2 ) by λ. In particular, we estimate λ ≡ (λ 1 , λ 2 ) bŷ where m λ (y) ≡ m λ1,λ2 (y) is the marginal density defined in (13), and Λ = R + × R + . Note that, m λ (y) is not available in closed form and in order to select the value of λ which maximizes m λ (y), one can estimate m λ (y) for several (large number of) values of λ and computeλ using these estimated values. Monte Carlo estimation of the marginal likelihood is extremely difficult, for example, Newton and Raftery's (1994) harmonic mean estimator is known to perform poorly (Wolpert and Schmidler, 2012). It is often much easier to estimate {am λ (y), λ ∈ Λ} than {m λ (y), λ ∈ Λ} for an appropriately chosen constant a. We calculate and subsequently compare the values of Note that, Let {(β m , τ 2 m , σ 2 m )} m≥1 be the Gibbs Markov chain mentioned in Section 2.2 with invariant density π(β, τ 2 , σ 2 |y, λ 0 ), then by ergodic theorem we have a simple consistent estimator of B λ,λ 0 , as M → ∞. Note that in (18) a single Markov chain {β m , τ 2 m , σ 2 m } M m=1 with stationary density π(β, τ 2 , σ 2 |y, λ 0 ) is used to estimate B λ,λ 0 for different values of λ. In general the naive importance sampling estimator (18) is very unstable; indeed when λ is not close to λ 0 only a few values of the ratios π(β m , τ 2 m , σ 2 m |λ)/π(β m , τ 2 m , σ 2 m |λ 0 ) may dominate the estimator.
Recently, Doss (2010) describes a method for efficiently computing large families of BFs (see also Geyer, 1996). The basic idea is to appropriately chose k reference points λ 0 , λ 1 , . . . , λ k−1 ∈ Λ and replace π(β m , τ 2 m , σ 2 m |λ 0 ) in (18) with a linear combinations of the prior densities evaluated at these skeleton points. We now describe this generalized importance sampling (GIS) method based on multiple Markov chains.
Note that, (20) can be used to estimate the entire family of BFs {B λ,λ 0 , λ ∈ Λ}. The functionB λ,λ 0 (r) can also be optimized to findλ instead of estimating B λ,λ 0 for large number of values of λ. We use a quasi-Newton optimization procedure to maximizeB λ,λ 0 (r) and estimate λ. Since the prior density of σ 2 does not depend on λ, the estimator (20) becomeŝ We use the two stage procedure mentioned in Doss (2010) for computing {B λ,λ 0 (r), λ ∈ Λ}. In stage I, we draw large MCMC samples {β (j;l) , τ 2(j;l) , σ 2(j;l) } Mj l=1 from π(β, τ 2 , σ 2 |y, λ j ) for each j = 0 . . . , k − 1. Since the EN Gibbs sampler does not involve any computationally demanding calculations, these MCMC samples can be obtained quickly. We estimater by Geyer's (1994) reverse logistic regression method using these samples. Independently of stage I, in stage II, we get new MCMC samples from π(β, τ 2 , σ 2 |y, λ j ) for each j = 0 . . . , k − 1 and use these stage II samples for estimating B λ,λ 0 usinĝ B λ,λ 0 (r) given in (20). The reason for using this two stage procedure is that the amount of computation required to calculateB λ,λ 0 (r) is linear in M and this rules out large M in stage II (Doss, 2010). On the other hand, it is desirable to use large M j in stage I to estimate r accurately.  use this two-stage method for selecting correlation parameters and link function parameters in spatial generalized linear mixed models (See also Roy, 2014, for another application.). They use Roy et al.'s (2015) standard error estimates ofB λ,λ 0 (r) for choosing good values of k, and the skeleton points λ 0 , . . . , λ k−1 (see also Buta andDoss, 2011, p. 2671). In order to use Roy et al.'s (2016) method for choosing the skeleton points, it is required to establish a CLT result for the estimatorB λ,λ 0 (r) defined in (20). A CLT forB λ,λ 0 (r) is also required for constructing (point wise) confidence interval for the BFs B λ,λ 0 . Buta and Doss's (2011) Theorem 1 presents the conditions under whichB λ,λ 0 (r) has a CLT. The Markov chains {β (j;l) , τ 2(j;l) , σ 2(j;l) } Mj l=1 need to be geometrically ergodic-which we have shown in Section 2.2. Define (22) Let E λ Z denote the posterior mean of the function Z with respect to the density (12). Another condition of Buta and Doss's (2011) Theorem 1 is that there exists > 0 such that E λ l (Z 2+ ) < ∞ for l = 0, 1, . . . , k − 1. The following lemma provides a condition under which Z(·) is a bounded function, in particular, it has moments of all orders.
Proof of Lemma 3 is given in the Web based supplementary materials.
Note that for the original lasso model (3), since the prior distributions of β and σ 2 do not depend on the penalty parameter λ, the estimator (21) for lasso becomes
Since the EN Gibbs sampler is geometrically ergodic, Roy et al.'s (2015) Theorem 3 can be used to compute standard errors ofη [g] . To use this theorem, we need to show that there exists > 0 such that E λ l (|gZ| 2+ ) < ∞ for l = 0, 1, . . . , k − 1, where Z is defined in (22). The following corollary follows from Lemma 3.

Corollary 1. If there exists at least one
We use the GIS estimatorη [g] to efficiently compute the profile plot of E λ β, the posterior mean of the regression coefficients for Bayesian lasso and EN. Also Roy et al.'s (2015) Theorem 3 is used to produce asymptotically valid confidence band (point wise) for these regularization path estimates. A step-by-step summary of the proposed inferential procedure is given in the Web based supplementary materials.

Applications
In this section we perform simulations to compare our empirical Bayes Lasso and empirical Bayes Elastic-Net models fitted using GIS with other competing models. We also apply these models to two real data sets.

Simulation study
We compare the performance of our two models against several frequentist and Bayesian versions of lasso and elastic-net. They are as follows i) lasso (Tibshirani, 1996), ii) elasticnet (EN) (Zou and Hastie, 2005), iii) full hierarchical Bayes lasso (FB-lasso) (Park and Casella, 2008), iv) full hierarchical Bayes elastic-net (FB-EN) (Kyung et al., 2010), v) empirical Bayes lasso fitted using Monte Carlo EM (EM-lasso) (Park and Casella, 2008), and vi) empirical Bayes elastic-net fitted using Monte Carlo EM (EM-EN) (Li and Lin, 2010). For each simulation scenario, we generate a training set and a test set separately. We fit GIS-lasso, GIS-EN and all other competing models (i-vi) using the training data and obtain the parameter estimates. Then we use the fitted models from the training set to calculate the prediction accuracy in the test set. The five simulation setups we choose to use are very similar to those in the original EN paper (Zou and Hastie, 2005) and in the OSCAR paper (Bondell and Reich, 2008).
The five simulation scenarios are as follows: Training set sample size = 100; Test set sample size = 400.
For each of the five above simulation scenarios, we generate 100 training data sets and 100 test data sets from the respective models.
The tuning parameters for frequentist lasso and elastic-net are selected using five fold cross-validation on the training set. We use the glmnet() package in R to fit the frequentist lasso and elastic-net models. In FB-lasso and FB-EN the prior distributions for the tuning parameters for λ, λ 1 , λ 2 are selected following the recommendations provided in Park and Casella (2008) and Kyung et al. (2010). To study if there is any effect of different choices of the prior distribution for the tuning parameters we have fitted the Bayesian lasso and Bayesian elastic-net under several choices of the priors for λ, λ 1 and λ 2 . In FB-lasso we assign the priors (a) λ 2 ∼ Gamma(1, .01), (b) λ 2 ∼ Gamma(1, .1), and (c) λ 2 ∼ Gamma(1, 1). In FB-EN we assign the priors (a) λ 2 1 ∼ Gamma(1, .01) and λ 2 ∼ Gamma(1, .01), (b) λ 2 1 ∼ Gamma(1, .1) and λ 2 ∼  Gamma(1, .1), and (c) λ 2 1 ∼ Gamma(1, 1) and λ 2 ∼ Gamma(1, 1). All these choices provide us near diffuse priors for the tuning parameters (Kyung et al., 2010) and thus to some extent can ensure the objectivity of the analysis. For EM-lasso and EM-EN the tuning parameters are estimated through maximizing the marginal likelihood, which is implemented with the EM-Gibbs algorithm proposed by Kyung et al. (2010). The FBlasso, FB-EN, EM-lasso, and EM-EN are all fitted using the computer codes obtained from Prof. Minjung Kyung (of Kyung et al., 2010). The convergence of the MCMC chains are checked via trace plots.
Our GIS-lasso and GIS-EN do not require any prior distribution assignment for the tuning parameters. The improper prior 1/σ 2 is chosen for σ 2 . Due to the use of this improper prior, m λ (y) is not uniquely defined. Nevertheless, the Bayes factor among any two models, say m λ (y)/m λ 0 (y), is well-defined because the same improper prior is assigned to the shared parameters of the two models (see e.g. Kass and Raftery (1995, Section 5) and Liang et al. (2008, Section 2)). We need to specify the skeleton points needed to calculate the marginal likelihood. In all five simulation studies we have around 10 to 20 skeleton points and they produce highly satisfactory results. We choose the skeleton points using the method described in . The two-stage procedure described in Section 3.1 is used to estimate the entire profile of the marginal likelihood of the tuning parameters. Then the marginal likelihood is maximized to find the estimates of the tuning parameters. In the first stage of generating parameter samples for selected skeleton points we ran MCMC chain of length 10000 with first 5000 as burn-in. In the second stage we ran MCMC chains of length 1000 iteration with the first 500 as burn in. In all our simulation settings and the two real data analysis these choices of MCMC chain lengths produced fast and accurate estimates. Finally, MCMC samples from π(β, τ 2 , σ 2 |y,λ), the posterior density corresponding to estimated tuning parameters are used to estimate other parameters (β, τ 2 , σ 2 ) and predict y at new covariate vector x.
Each simulation scenario is repeated for 100 times and each time a separate training set and a test set are generated. In Tables 1 and 2 we report the average test set mean squared error of prediction (MSEP) along with their corresponding standard deviations for our two models and all other competing lasso and elastic-net models. The ranges of estimates of the tuning parameters are also included in Tables 1 and 2. We can clearly

Real data sets
In this section, we apply our GIS-lasso and GIS-EN models on two real data sets, namely diabetes data (Efron et al., 2004) and soil data (Bondell and Reich, 2008). We also fit all other competing models as discussed in the simulation study section. The tuning parameters for frequentist models are all selected by a 5 fold cross validation like before. The FB-lasso is fitted by adopting the prior choice λ 2 ∼ Gamma(1, .01). The FB-EN is fitted with the prior choice λ 2 1 ∼ Gamma(1, .01) and λ 2 ∼ Gamma(1, .01). As in the simulation examples, we use posterior mean estimates of λ (λ 1 , λ 2 ) for FB-lasso (FB-EN).

Diabetes data
This data set arises from the study of 442 diabetes patients (Efron et al., 2004). The predictor or baseline variables are age, sex, body mass index (bmi), average blood pressure, and six blood serum measurements. The response variable is a quantitative measure of disease progression in one year after measuring the baseline variables. The main goal here is to give a good prediction of the disease progression along with detecting important baseline variables. We randomly split the data with 300 observations in the training set and the remaining 142 observations in the test set. From the results reported in Table 3   covariate all of the lasso type models produced much higher MSE than the elastic-net based models. Our GIS-EN can effectively reduce the out of sample MSE by at least 7% than all other competing models. The GIS-lasso solution path in Figure 1(a) selects bmi, lamotrigine (ltg), mean arterial pressure (map), sex, total cholesterol (tc), and high density lipoprotein (hdl) to be important predictors. Whereas the solution path of GIS-EN Figure 1(b) identifies only ltg, bmi, map, hdl, and tch (thyroid stimulating hormone) to be important. We also include the full profile of the BFs in GIS-Lasso along with a 95% confidence band in Figure 1 subplot (c). In the case of GIS-EN since we have two tuning parameters (λ 1 and λ 2 ) we have included the contour plot of the BF and the corresponding estimate of the standard error (SE) relative to the Bayes Factor in subplots Figure 1(d) and Figure 1(e) respectively. For better illustration, in Figure 2 and Figure 3 we include the 95% confidence bands for GIS-lasso and GIS-EN solution paths for each individual coefficients. These plots illustrate the ability of our method to quantify the uncertainty over the full solution path of the coefficients.
In our GIS-Lasso and GIS-EN we can obtain the full profile of the BF and its corresponding SEs (Figure 1(c)-(e)) which give us a clear idea about the role of optimal choice of the tuning parameters. Moreover the plots (Figure 1(c)-(e)) can be used to provide us a guideline for selection of the skeleton points (see e.g. Roy et al., 2015). It is suggested that we select more skeleton points from the region of the tuning parameters whose corresponding SE relative to the Bayes Factor is relatively large than others.

Soil data
This data set comes from a study of the association between soil characteristics and forest diversity in the Appalachian mountains of North Carolina (Bondell and Reich, 2008). The response of interest is the number of different plant species found in a plot. The covariates are 15 soil characteristics as follows: (1) % base saturation, (2) sum cations, (3) CEC, (4) calcium, (5) magnesium, (6) potassium, (7) sodium, (8) phosphorous, (9) copper, (10) zinc, (11) manganese, (12) humic matter, (13) density, (14) soil pH, and (15) exchangeable acidity. A more detailed description can be obtained from (Bondell and Reich, 2008). In this study, we have only 20 samples. In this data set several predictors are very highly correlated (absolute correlation > 0.90). For example, we see predictor variables 1 to 5 are all very strongly correlated. The correlation between sodium and phosphorous is also very high and there is a strong negative correlation between soil PH and exchangeable acidity. Due to strong correlation structure among the covariates, it would be very useful to group together the variables that are strongly correlated. Therefore lasso and all Bayesian counter parts of lasso which cannot take into account the underlying correlation structure will give sub-optimal result. In Table 4 we report leave one out cross-validation MSE. Under the presence of strong correlation among the predictor variables our GIS-EN resulted in lowest MSE among all. Comparing with all lassos our GIS-EN lowered the CV-error by more than 10%. On the other hand in comparison to the existing elastic-net and EM-EN our GIS-EN improved the accuracy by around 4%. It is important to mention here that although in  comparison GIS-EN and EM-EN both seems to be equally performing well, but EM-EN is extremely slow to implement and thus its unusable in any practical application. On the other hand our GIS-EN is producing the same accurate result 15 times faster. In Figure 4 we list the solution paths for our GIS-lasso and GIS-EN. From the solution paths (Figures 4(a) and (b)) we can identify that GIS-lasso picked up 9 variables as important where as GIS-EN has selected only 6 as important. The 9 selected variables by our GIS-lasso are soil pH, copper, exchangeable acidity, potassium, density, humic Matter, calcium, zinc, and phosphorous. On the other hand the 6 selected variables by our GIS-EN are soil pH, copper, exchangeable acidity, potassium, density and humic Matter.

Method
Test

Discussions
Variable selection plays a fundamental role in modern statistical modeling. Classical approaches to deal with the variable selection problems are through regularization methods. These methods minimize the residual sum of squares subject to an imposed penalty. These methods are closely related to Bayesian methods as often the estimate given by a regularization method is indeed the posterior mode of a Bayesian model. The estimation of penalty parameters, although it is very important, can be tricky. In this article, an empirical Bayes methodology based on efficient importance sampling schemes is used for estimating the penalty parameters as well as the full solution paths for Bayesian lasso and EN. The Gibbs sampler for Bayesian EN is proved to be geometrically ergodic, which allows users to calculate asymptotically valid standard errors for the posterior estimates.
In many applications, interaction exists among the variables. In recent years, a variety of models is proposed in the literature to deal with the presence of highly correlated predictors (see e.g. Bondell and Reich, 2008;Liu et al., 2014). As a possible avenue for future work, it would be interesting to apply our proposed methods for estimating the penalty parameters and solution paths of these other grouped lasso methods. The GIS method can also be applied while using other shrinkage priors such as global local priors ) (e.g. the horseshoe prior (Carvalho et al., 2010)) and non local priors (Johnson and Rossell, 2012).

Supplementary Material
Supplementary Material for "Selection of Tuning Parameters, Solution Paths and Standard Errors for Bayesian Lassos" (DOI: 10.1214/16-BA1025SUPP; .pdf). The online supplementary materials contain the proofs of lemmas. Also a summary of the steps involved in the estimation of the tuning parameters and the solution paths is given in the supplementary materials.