Variable selection via penalized credible regions with Dirichlet-Laplace global-local shrinkage priors

The method of Bayesian variable selection via penalized credible regions separates model fitting and variable selection. The idea is to search for the sparsest solution within the joint posterior credible regions. Although the approach was successful, it depended on the use of conjugate normal priors. More recently, improvements in the use of global-local shrinkage priors have been made for high-dimensional Bayesian variable selection. In this paper, we incorporate global-local priors into the credible region selection framework. The Dirichlet-Laplace (DL) prior is adapted to linear regression. Posterior consistency for the normal and DL priors are shown, along with variable selection consistency. We further introduce a new method to tune hyperparameters in prior distributions for linear regression. We propose to choose the hyperparameters to minimize a discrepancy between the induced distribution on R-square and a prespecified target distribution. Prior elicitation on R-square is more natural, particularly when there are a large number of predictor variables in which elicitation on that scale is not feasible. For a normal prior, these hyperparameters are available in closed form to minimize the Kullback-Leibler divergence between the distributions.


Introduction
High dimensional data has become increasingly common in all fields. Linear regression is a standard and intuitive way to model dependency in high dimensional data. Consider the linear regression model: where X is the n × p high-dimensional set of covariates, Y is the n scalar responses, β = (β 1 , · · · , β p ) is the p-dimensional coefficient vector, and ε is the error term assumed to have E(ε) = 0 and Var(ε) = σ 2 I n . Ordinary least squares is not feasible when the number of predictors p is larger than the sample size n. Variable selection is necessary to reduce the large number of candidate predictors. The classical variable selection methods include subset selection, criteria such as AIC (Akaike 1998) and BIC (Schwarz et al. 1978), and penalized methods such as the least absolute shrinkage and selection operator (Lasso; Tibshirani 1996), smoothly clipped absolute deviation (SCAD; Fan & Li 2001), the elastic net (Zou & Hastie 2005), adaptive Lasso (Zou 2006), the Dantzig selector (Candes & Tao 2007), and octagonal shrinkage and clustering algorithm for regression (OSCAR;Bondell & Reich 2008).
In the Bayesian framework, approaches for variable selection include: stochastic search variable selection (SSVS) (George & McCulloch 1993), Bayesian regularization (Park & Casella 2008, Li et al. 2010, Polson et al. 2013, Leng et al. 2014, empirical Bayes variable selection (George & Foster 2000), spike and slab variable selection (Ishwaran & Rao 2005), β j ∼ N (0, wξ j ), ξ j ∼ π(ξ j ), (w, σ 2 ) ∼ π(w, σ 2 ), where w controls the global shrinkage towards the origin, while ξ j allows local deviations of shrinkage. Various options of shrinkage priors for β, include normal-gamma (Griffin et al. 2010), Horseshoe prior (Carvalho et al. 2009(Carvalho et al. , 2010, generalized double Pareto prior , Dirichlet-Laplace (DL) prior (Bhattacharya et al. 2015), Horseshoe+ prior (Bhadra et al. 2015), and others that can be represented as (2). The GL shrinkage priors usually shrink small coefficients greatly due to a tight peak at zero, and rarely shrink large coefficients due to the heavy tails. It has been shown that GL shrinkage priors have improved posterior concentrations (Bhattacharya et al. 2015). However, the shrinkage prior itself would not lead to variable selection, and to go further, some rules need to be set on the posteriors. Bondell & Reich (2012) proposed a Bayesian variable selection method only based on posterior credible regions. However, the implementation and results of that paper depended on the use of conjugate normal priors. Due to the improved concentration, incorporating the global-local shrinkage priors into this framework can perform better, both in theory and practice. We show that the DL prior yields consistent posteriors in this regression setting, along with selection consistency.
Another difficulty in high dimensional data is the choice of hyperparameters, which can highly affect the results. In this paper, we also propose an intuitive default method to tune the hyperparameters in the prior distributions. By minimizing a discrepancy between the induced distribution of R 2 from the prior and the desired distribution (Beta distribution by default), one gets a default choice of hyperparameter value. For the choice of normal priors, the hyperparameter that minimizes the Kullback-Leibler (KL) divergence between the distributions is shown to have a closed form solution.
Overall, compared to other Bayesian methods, on the one hand, our method makes use of the advantage of global-local shrinkage priors, which can effectively shrink small coefficients and reliably estimate the coefficients of important variables simultaneously. On the other hand, by using the credible region variable selection approach, we can easily transform the non-sparse posterior estimators to sparse solutions. Compared to the common frequentist method, our approach provides flexibility to estimate the tuning parameter jointly with the regression coefficients, allows easy incorporation of external information or hierarchical modeling into Bayesian regularization framework, and leads to straightforward computing through Gibbs sampling.
The remainder of the paper is organized as follows. Section 2 reviews the penalized cred-ible region variable selection method. Section 3 details the proposed method which combines shrinkage priors and penalized credible region variable selection. Section 4 presents the posterior consistency under the choice of shrinkage priors, as well as the asymptotic behavior of the selection consistency for diverging p. Section 5 discusses a default method to tune the hyperparameters in the prior distributions based on the induced prior distribution on R 2 . Section 6 reports the simulation results, and Section 7 gives the analysis of a real-time PCR dataset. All proofs are given in the Appendix.

Background
Bondell & Reich (2012) is used, where σ 2 is the error variance term as in (1), and γ is the ratio of prior precision to error precision. The variance, σ 2 , is often given a diffuse inverse Gamma prior, while γ is the hyperparameter which is either chosen to be fixed or given a Gamma hyperprior.
The credible region is to findβ, such that where ||β|| 0 is the L 0 norm of β, i.e., the number of nonzero elements, and C α is the (1−α)×100% posterior credible regions based on the particular prior distributions. The use of elliptical posterior credible regions yields the form for some nonnegative c α , whereβ and Σ are the posterior mean and covariance respectively.
Then by replacing the L 0 penalization in (4) with a smooth homotopy between L 0 and L 1 proposed by Lv & Fan (2009) and linear approximation, the optimization problem in (4) becomesβ = argmin where there exists a one-to-one correspondence between c α and λ α . The sequence of solutions to (5) can be directly accomplished by plugging in the posterior mean and covariance and using the LARS algorithm (Efron et al. 2004 Thus, the penalized credible region selection method can be feasibly performed by plugging the posterior mean and covariance matrix into the optimization algorithm (5). So given any GL prior, once MCMC steps produce the posterior samples, the sample mean,β, and sample covariance, Σ, would hence be obtained, then variable selection can be performed through the penalized credible region method. In this paper, we modify the Dirichlet-Laplace (DL) prior to implement in the regression setting. We also consider the Laplace prior, also referred as Bayesian Lasso, described in Park & Casella (2008) and Hans (2010), where λ is the Lasso parameter, controlling the global shrinkage.

Dirichlet-Laplace Priors
For the normal mean model, Bhattacharya et al. (2015) proposed a new class of Dirichlet-Laplace (DL) shrinkage priors, possessing the optimal posterior concentration property. We construct the generalization of the DL priors for the linear regression model. The proposed hierarchical DL prior is as follows: for j = 1, · · · , p, τ ∼ Ga(pa, 1/2).
where DE(b) denotes a zero mean Laplace kernel with density f (y) = (2b) −1 exp{−|y|/b} for y ∈ R, Dir(a, · · · , a) is the Dirichlet distribution with concentration vector (a, · · · , a), and Ga(pa, 1/2) denotes a Gamma distribution with shape pa and rate 1/2. Here, small values of a would lead most of (φ 1 , · · · , φ p ) to be close to zero and only few of them nonzero; while large values allow less singularity at zero, thus controlling the sparsity of regression coefficients. The φ j 's are the local scales, allowing deviations in the degree of shrinkage.
As pointed out in Bhattacharya et al. (2015), τ controls global shrinkage towards the origin and to some extent determines the tail behaviors of the marginal distribution of β j 's.
We also assume a common prior on the variance term σ 2 , IG(a 1 , b 1 ), the inverse Gamma distribution with shape a 1 and scale b 1 .

Asymptotic Theory
In this section, we first study the posterior properties of the normal and DL prior, when both n and p n go to infinity, and further investigate the selection consistency of the penalized variable selection method. Assume the true regression parameter is β 0 n , and the estimated regression parameter is β n . Denote the true set of non-zero coefficients is A 0 n = {j : β 0 nj = 0, j = 1, · · · , p n }, and the estimated set of non-zero coefficients is A n = {j : β nj = 0, j = 1, · · · , p n }. Also let q n = |A 0 n | denote the number of predictors with nonzero true coefficients. As n → ∞, consider the sequence of credible sets of the form {β n : (β n − β n ) T Σ −1 n (β n −β n ) ≤ c n }, whereβ n and Σ n are the posterior mean and covariance matrix respectively, and c n is a sequence of non-negative constants. Let Γ n denote the p n × p n matrix whose columns are eigenvectors of X T n X n /n ordered by decreasing eigenvalues, i.e., d 1 ≥ d 2 ≥ · · · ≥ d pn ≥ 0. Then X T n X n /n = Γ n D n Γ T n where D n = diag{d 1 , · · · , d pn }. Assume the following regularity conditions throughout.
Assumption (A2) regarding the eigenvalues bounded away from 0 and ∞ is a necessary condition for estimation consistency in the Bayesian methods, and also for the consistency of Ordinary Least Squares in the case of growing dimension but with p n = o(n). This is akin to the condition in the fixed dimension of X T X/n converging to a positive definite matrix (Assumption (A2) in Bondell & Reich (2012)). The basic intuition is that without a lower bound on the eigenvalue, there is an asymptotic singularity, which then leaves a linear combination of the regression parameters that is not identifiable, i.e, it would have a variance that was infinite, hence could not be consistent. The upper bound, on the other hand, ensures that there is a proper covariance matrix for every p n . If we assume that each row of X n was a random draw from a p n -dimensional probability distribution, the bounded eigenvalue condition is an assumption on the true sequence of covariance matrices, as for large n and p n = o(n), the sample covariance, X T n X n /n, (assuming centered variables) will converge to the true covariance. Typical covariance structures will have the bounded eigenvalue property.
Also note that Assumption (A5) restricts the minimum signal size for the non-zero coefficients while also ensuring that there are not too many small signals. the Horseshoe-like priors. By definition, posterior consistency implies that the posterior distribution of β n converges in probability to β 0 n , i.e., for any > 0, P (β n : ||β n − β 0 n || > |Y n ) → 0 as p n , n → ∞. In this section, we show that the normal and Dirichlet-Laplace prior also yield consistent posteriors. However, the DL prior can yield consistent posteriors under weaker conditions on the signal. Theorem 1. Under assumptions (A1)-(A3), and p n = o(n), if q n = o{n 1−ρ /(p n (log n) 2 )} for ρ ∈ (0, 1), and σ 2 /γ n = C/( √ p n n ρ/2 log n) for finite C > 0, the normal prior (3) yields a consistent posterior.
Note that the difference in the above two theorems is the number of nonzero components, i.e., q n . As n/ log n > n 1−ρ /(p n (log n) 2 ), the Dirichlet-Laplace prior leads to posterior consistency in a much broader domain, compared to the normal prior as well as compared to the Laplace prior who also yields consistent posteriors as shown in Theorem 2 in Armagan, Dunson, Lee, Bajwa & Strawn (2013). This strengthens the justification for replacing the normal prior with the DL prior theoretically. However, note that the Theorems only give a sufficient condition for posterior consistency under each of the priors. The sufficient condition does have a broader domain for q n in Theorem 2, for the Dirichlet-Laplace prior, than in Theorem 1 for the normal prior. However, it is not clear that these conditions are also necessary, so although we are able to prove the consistency for the Dirichlet-Laplace prior under a more general condition than the normal prior, there may be room to improve this condition in either or both of these cases.

Selection Consistency of Penalized Credible Regions
Bondell & Reich (2012) has shown that when p is fixed and β is given the normal prior in (3), the penalized credible region method is consistent in variable selection. In this paper, we show that the consistency of the posterior distribution under a global-local shrinkage prior also yields consistency in variable selection under the case of p n → ∞. for the DL prior in (7), we have the following result.
Corollary 1. Under assumptions (A1) -(A5), given the DL prior in (7), if q n = o(n/ log n), a n = C/(p n n ρ log n) for any finite ρ > 0 and finite Note that the variable selection consistency is derived based on the posterior consistency.
However, Assumption (A3) is not necessary to ensure the variable selection consistency. If (A3) is not satisfied, i.e., β 0 n is truly unbounded, although it would not be possible to obtain a consistent estimator, or posterior, the credible region would become bounded away from zero in that direction, and hence will pick out that direction consistently as well.

Tuning Hyperparameters
The value of hyperparameters in the prior distribution plays an important role in the posteriors. For example, in the normal prior (3), γ is the hyperparameter, whose value controls the degree of shrinkage. This is often chosen to be fixed at a "large" value or given a hyperprior. However, the choice of the "large" value affects the results, as does the choice of hyperprior such as a gamma prior, particularly in the high dimensional case. Also, in the DL prior (7), the choice of a is critical. If a is too small, then the DL prior would shrink each dimension of β towards zero; while, if a is too large, there would be no strong concentration around the origin. Instead of fixing a, a discrete uniform prior can be given on a supported on some interval (for example, [1/max(n, p), 1/2]), with several support points on the interval. However, introducing the hyperprior for the hyperparameters will not only arise new values to tune, but also increase the complexity of the MCMC sampling.
In practice, although the specification of a p-dimensional prior on β may be difficult, some prior information on a univariate function may be easier. The motivation is to incorporate such prior information of the one-dimensional function into the priors on the p-dimensional β.
In this paper, we propose an intuitive way to tune the values of hyperparameters, by incorporating a prior on R 2 (the coefficient of determination). Practically, a scientist may have information on R 2 from previous experiments, and this can be coerced into say a Beta(a, b) distribution. In this way, tuning hyperparameters is equivalent to searching for the hyperparameter which leads to the induced distribution of R 2 closest to the desired distribution. Intuitively, if we fix any value for b, as we increase a, then R 2 will approach 1, hence this controls the total size of the signal that is anticipated in the data. As we will see shortly, it is a prior on the value of the quadratic form β T X T Xβ. Combining this with the choice of prior, gives also the degree of sparsity. For example, with a Dirichlet-Laplace Prior, the parameter in the DL distribution then controls how this total signal is distributed to the coefficients, either to a few coefficients, giving a sparse model, or to many coefficients, giving a dense model. In many cases, a scientist may have done many similar experiments before and can look back and see the values of the sample coefficient of determination from all of these studies. Then treating this as a sample from a Beta distribution, the hyperparameters, a and b, can be obtained from this fit. Without any prior information for R 2 , a uniform prior, Beta(1, 1), may be used as default.
For the linear regression model (1), the population form can be represented as y = x T β + ε, with x independent of ε. Let σ 2 y be the marginal variance of y and σ 2 be the variance of the random error term. The definition of the POPULATION R 2 is given by: which is the proportion of the variation of y in the population explained by the independent variables. Furthermore, for fixed β, it follows that σ 2 y = β T Cov(x)β + σ 2 . Assume E(x) = 0, then we can estimate Cov(x) by X T X/n. So R 2 as a function of β and σ 2 is given by . Given that the form of prior distributions considered includes σ in the scale, it follows that β = ση for η having the distribution of the prior fixed with σ 2 = 1. Hence For a specified prior on η, the induced distribution of R 2 can be derived based on (9).
Then the hyperparameters which yield the induced distribution of R 2 closest to the desired distribution is the tuned value. For a better understanding of the intuition here, we give a simple example. Suppose σ 2 = 1 and we have an intercept only model, i.e., model (1) is simplified as Y = 1 n β + ε with 1 n the n-dimensional vector with all elements of 1. Then (9) can be written as where Γ(.) denotes the gamma function. The left panel of Figure 1 shows the distribution on R 2 for 4 choices of hyperparameters in the Beta distribution, while the right panel shows the corresponding induced prior distribution on β. We see that for a uniform distribution on R 2 , we obtain a distribution on β that puts its mass slightly skewed away from zero on each side. For a bathtub distribution (a = b = 0.5), we see it reduces to the Cauchy distribution, giving heavy tails to obtain the R 2 near one, and the peak around zero to obtain the R 2 near zero. We also see two other extremes, as a → 0 for fixed b = 1, we obtain a distribution that decays very quickly and puts most of its mass around zero, as expected; while as b → 0 and a fixed at 1, we obtain a density proportional to |t|/(1 + t 2 ), allowing for larger values of β with high probability.
In practice, one can consider a grid of possible values of the hyperparameters. For each value, draw a vector η. This is converted to a draw of R 2 . Given this hyperparameter, a comparison between the sample of R 2 and the desired distribution is performed, for example, a Kolmogorov-Smirnov(KS) test. The best fit is then chosen. The whole tuning process only involves the prior distributions, no MCMC sampling, thus avoiding comprehensive computing.
However, given a specific prior for β, based on (9), the exact induced distribution of R 2 can be derived, which relies on the value of hyperparameters. By minimizing the Kullback-Liebler directed divergence between such distribution and the desired distribution (Beta distribution by default), a default hyperparameter value can be found. For continuous random variables with density function f 1 and f 2 , the KL divergence is defined as For the choice of normal priors, the following theorem shows that there is a closed form solution for the hyperparameter to minimize the KL divergence for large p.
Theorem 5. For the normal prior in (3), to minimize the KL directed divergence between the induced distribution of R 2 and the Beta(a, b) distribution, as p → ∞, the hyperparameter, γ, is chosen to be (A /a, C = P 2 /9−Q/3, A = P Q/6 − P 3 /27 − R/2, B = A 2 − C 3 ≥ 0, and d 1 , · · · , d p denote the eigenvalues of X T X/n. In theory, for other continuous priors, one can derive the optimal hyperparameters similarly. However, sometimes the calculation can be quite complex. In this case, the simulation-based approach discussed earlier can be implemented. However, since GL priors can be represented as mixture normal priors (see Section 1), by matching its prior precision with that of the normal prior, the derived default solution as shown in Theorem 5 can offer an intuitive idea for the hyperparameter values in the GL shrinkage priors.
6 Simulation Results

Comparisons of Different Priors
To compare the performance of the penalized credible region variable selection method using different shrinkage priors, including the normal prior (3), Laplace prior (6), and DL prior (7), a simulation study is conducted. Bondell & Reich (2012) demonstrated the improvement in performance of the credible region approach using the normal prior over both Bayesian and Frequentist approaches, such as SSVS, Lasso, adaptive Lasso, Dantzig Selector, and SCAD. Given the previous comparisons, the focus here is to see if replacing the normal prior with the global-local prior can even further improve the performance of the credible region variable selection approach.
We use a similar simulation setup as in Bondell & Reich (2012). In each setting, 200 datasets are simulated from the linear model (1) with σ 2 = 1, sample size n = 60, and the number of predictors p varying in {50, 500, 1000}. To represent different correlation settings, X ij are generated from standard normal distribution, and the correlation between x ij 1 and x ij 2 is ρ |j 1 −j 2 | , with ρ = 0.5 and 0.9. The true coefficient β is (0 T 10 , B 1 T , 0 T 20 , B 2 T , 0 T p−40 ) T for p ∈ {50, 500, 1000} in which 0 k represents the k-dimensional zero vector, B 1 and B 2 are both 5-dimensional vector generated component-wise and uniform from (0, 1). For each case of shrinkage prior, the posterior mean and covariance can be obtained from the Gibbs samplers, and then plugged into the optimization algorithm (5)  The compared credible set methods are listed as below: • Method "Normal hyper", refers to the normal prior, with "non-informative" hyperparameters, i.e., N (0, σ 2 b ) is the prior for β, and IG(0.001, 0.001) prior is given for σ 2 b .
• Method "Normal tune", refers to the normal prior (3), where γ is tuned through the R 2 method introduced in Section 5, with a target of uniform distribution.
• Method "Laplace tune", means Laplace prior (6), and λ is tuned through the R 2 method introduced in Section 5, with a target of uniform distribution.
• Method "DL hyper" is the DL prior (7), in which a is given a discrete uniform prior supported on the interval [1/max(n, p), 1/2] with 1000 support points in this interval.
• Method "DL tune" is the DL prior (7), in which a is tuned through the R 2 method introduced in Section 5, with a target of uniform distribution.
In all above cases, the variance term σ 2 is given an IG(0.001, 0.001) prior. In addition, we show the results from using the Lasso (Tibshirani 1996) fit via the LARS algorithm (Efron et al. 2004).  For the above priors (normal, Laplace and DL), we ran the MCMC chain (Gibbs sampling) for 15, 000 iterations, with the first 5, 000 for burn-in. Posterior mean and covariance were calculated based on the 10, 000 samples, which were then plugged into the penalized credible interval optimization algorithm (5), to conduct variable selection. Table 1 gives the mean and standard error for the area under the ROC and PRC curve for p = 50 with ρ ∈ {0.5, 0.9}. In addition, Figure 2 plots the mean ROC and PRC curves of the 200 datasets for the selected above methods to compare. Table 2 and Figure 3 give the results for the p = 500 case. Table 3 and Figure 4 show the results for the p = 1000 case. Since the Lasso estimator can select at most min{n, p} predictors, when p = 500 or 1000, the ROC and PRC curves cannot be fully constructed. So the area under the curves cannot be compared directly for Lasso with other methods, which are omitted in Table 2 and 3, but partial ROC and PRC curves can still be plotted, which are shown in Figure 3 and 4. On the one hand, in terms of whether given a hyperprior for the hyperparameter or tuning hyperparameters through the R 2 method proposed in Section 5 would lead to better posterior performance, one might compare each " * hyper" and " * tune" pair in Table 1, 2 and 3. In general, for all three priors, the tuning method leads to significantly better posterior performance than the hyperprior method in all simulation setups.
On the other hand, in terms of comparing performance of different priors applied on the penalized credible region variable selection, combining both the tables and figures, we  a more appropriate measure than the ROC curve. For example, when p = 1000, in both ρ = 0.5 and 0.9 cases, in Figure 4, the PRC curve shows that the DL prior is significantly better than the normal prior; the ROC curve of the normal prior goes higher when FPR (or 1-Specificity) is large, however, when FPR is small (which is of more interest), DL prior still leads to significantly larger sensitivity than the normal prior. Overall, the DL prior outperforms the normal prior, as does the Laplace prior. datasets for p = 1000 predictors, n = 60 observations. The left column is the ROC curve, the right column is the PRC curve.

Additional Simulations on Hyperparameter Tuning
To examine the role of a in the DL prior, additional simulations were conducted. Table 4 gives the average squared error for the posterior mean based on the 200 same datasets as Section 6.1, for the DL priors with a fixed at 1/2, 1/n, and 1/p. The results show that when p is large or there is strong correlation in the dataset, a = 1/n is better than a = 1/2. When p is small and there is only moderate correlation for the data, a = 1/2 is recommended.
Since the performance of different values of a varies relying on the dimension of predictors and the correlation structure of the predictors, fixing a is difficult. Thus either giving a hyperprior for a or using the R 2 method proposed in Section 5 to tune a is suggested. Furthermore, to verify Theorem 5 described in Section 5, additional calculations were performed. For each of the above 200 datasets, 'Normal tune" returns a "best" tuned γ through conducting the practical procedures as introduced in Section 5, and we name it as "Tuned". Also, by Theorem 5, the theoretic "best" γ can be derived based on the eigenvalues of X T X/n for each dataset, and we name it as "Derived". In addition, for each of the above 200 datasets, the design matrix X is generated from a multivariate normal distribution with specific and fixed covariance structure. So the eigenvalues of such true covariance matrix, instead of X T X/n, can be used to derive the theoretic "best" γ, and we name it as "Theoretic" value. Table 5 gives the "Theoretic" value, and the mean of "Derived" and "Tuned" value together with the standard error among the 200 datasets, for simulation setups ρ = 0.5 and 0.9. In general, the three values are similar and all of them are close to the value of p. So in practice, γ can be set as the "Derived" value based on the eigenvalues of X T X/n, or for simplicity, γ = p can also be used.

Real Data Analysis
We now analyze data on mouse gene expression from the experiment conducted by Lan et al. (2006). There were 60 arrays to monitor the expression levels of 22, 575 genes consisting of 31 female and 29 male mice. Quantitative real-time PCR were used to measure some physiological phenotypes, including numbers of phosphoenopyruvate carboxykinase (PEPCK), glycerol-3-phosphate acyltransferase (GPAT), and stearoyl-CoA desaturase 1 (SCD1). The gene expression data and the phenotypic data can be found at GEO (http://www.ncbi.nlm.nih.gov/geo; accession number GSE3330).
First, by ordering the magnitude of marginal correlation between the genes with the three responses from the largest to the smallest, 22, 575 genes were screened down to the 999 genes, thus reducing the number of candidate predictors of the three linear regressions.
Note that the top 999 genes were not the same for the 3 responses. Then for each of the 3 regressions, the dataset is composed of n = 60 observations and p = 1, 000 predictors (gender along with the 999 genes). After the screening, the Lasso estimator and the penalized credible region method applied on the normal, Laplace and DL priors were used. The hyperparameters in those prior distributions are tuned through the R 2 method introduced in Section 5, with a target of uniform distribution.
To evaluate the performance of the proposed approach, the first step was to randomly split the sample size 60 into a training set of size 55 and a testing set of size 5. The stopping rule was BIC. To be more specific, the selected model was the one with smallest BIC among all models in which the number of predictors is less than 30. Then the selected model was used to predict the remaining 5 observations, and the prediction error was then obtained. We repeated this for 100 replicates in order to compare the prediction errors. Table 6 shows the mean squared prediction error (with its standard error) based on the 100 random splits of the data. The mean selected model size (with its standard error) is also included.  Overall, the results show that the proposed penalized credible region selection method using global-local shrinkage priors such as DL prior performs well. For all 3 responses, the penalized credible region approach with DL prior performs better than the Lasso estimator and has a smaller number of predictors. For PEPCK and SCD1, the DL prior has significant better performance than the normal prior and Laplace prior. For GPAT, there is no significant difference between normal and DL prior. In all, for this dataset, the proposed approach generally improves the performance by replacing the normal prior with the DL prior.

Discussion
In this paper, we extend the penalized credible variable selection approach by using globallocal shrinkage priors. Simulation studies show that the GL shrinkage priors outperform the original normal prior. Our main result also includes modifying the Dirichlet-Laplace prior to accommodate the linear regression model instead of the simple normal mean problem as in Bhattacharya et al. (2015). In theory, we obtain the selection consistency for the penalized credible region method using the global-local shrinkage priors when p = o(n).
Posterior consistency for the normal and DL priors are also shown.
Furthermore, this paper introduces a new default method to tune the hyperparameters in prior distributions based on the induced prior distribution of R 2 . The hyperparameter is chosen to minimize a discrepancy between the induced distribution of R 2 and a default Beta distribution. For the normal prior, a closed form of the hyperparameters is derived.
This method is straightforward and efficient as it only involves the prior distributions. A simulation study illustrates that our proposed tuning method improves upon the usual hyperprior method.
And the marginal distribution of β j is Without loss of generality, we assume σ = 1. According to the Proposition 3.1 in Bhattacharya et al. (2015), the marginal density function of β j for any 1 ≤ j ≤ p is cos t (t 2 + x 2 ) ν+1/2 dt is the modified Bessel function of the second kind.
Note: This cannot ensure for every j = 1, · · · , p, Proof. Suppose cn pn log n → c, since the solution occurs on the boundary of the credible set, we have n(β n −β n ) T Σ −1 n n (β n −β n ) = c n . Multiplying both sides by 1 pn log n , on the right hand side we have cn pn log n → c. So we have n pn log n (β n −β n ) T Σ −1 n n (β n −β n )) → c. Then w * j Z * j are bounded in probability.
Since from Assumption (A5), pn n T n = O(1), then we have pn n S(β * n ) is bounded in probability. But pn n S(β n ) → ∞ as shown above. Hence there exists large enough n so that pn n S(β * n ) < pn n S(β n ), and thusβ n with n pn (β n1 −β n1 ) → ∞ cannot be the minimizer.
Therefore, n pn (β nj −β nj ) → ∞ can be true only for j ∈ A 0 n .

Proof of Theorem 3
Proof. By assumption, p n = o(n/ log n) and c n /(p n log n) → c ∈ (0, ∞). This implies that c n /p n → ∞ and c n /n → 0. From Lemma 1, it follows that if c n /p n → ∞ then the true β 0 n is contained in the credible region with probability tending to one.
Furthermore, if c n /n → 0, it must follow that the credible region itself is shrinking aroundβ n . Since the solution occurs on the boundary of the credible set, (β n − β n ) T Σ −1 n (β n −β n ) = (β n −β n ) T (X T n Xn+γnIn) σ 2 n (β n −β n ) = c n , or equally, (β n −β n ) T (X T n X n + γ n I n )(β n −β n ) = c nσ 2 n . Multiplying both sides by 1 n , on the right hand side, we have cnσ 2 n n → 0. For the left side, we have (β n −β n ) T (X T n Xn+γnIn) n (β n −β n ). So the left side also goes to zero. Together with Assumption (A2) and γ n = o(n), then the credible region is shrinking aroundβ n . This implies that with probability tending to one, for all j ∈ A 0 n , the credible region will be bounded away from zero in that direction. Hence we have that P (A 0 n ∩ A c n ) → 0.
Consider the first term on the left hand side. Since β 0 n1 = 0, we haveβ n1 → 0, then , where cn pn log n → c. Note that by Lemma 3, both numerator and denominator in (17) are O(1). Also by Lemma 4, the denominator cannot be 0, due to the presence of (β npn −β npn ).