Variance prior forms for high-dimensional Bayesian variable selection

Consider the problem of high dimensional variable selection for the Gaussian linear model when the unknown error variance is also of interest. In this paper, we show that the use of conjugate shrinkage priors for Bayesian variable selection can have detrimental consequences for such variance estimation. Such priors are often motivated by the invariance argument of Jeffreys (1961). Revisiting this work, however, we highlight a caveat that Jeffreys himself noticed; namely that biased estimators can result from inducing dependence between parameters a priori. In a similar way, we show that conjugate priors for linear regression, which induce prior dependence, can lead to such underestimation in the Bayesian high-dimensional regression setting. Following Jeffreys, we recommend as a remedy to treat regression coefficients and the error variance as independent a priori. Using such an independence prior framework, we extend the Spike-and-Slab Lasso of Rockova and George (2018) to the unknown variance case. This extended procedure outperforms both the fixed variance approach and alternative penalized likelihood methods on simulated data. On the protein activity dataset of Clyde and Parmigiani (1998), the Spike-and-Slab Lasso with unknown variance achieves lower cross-validation error than alternative penalized likelihood methods, demonstrating the gains in predictive accuracy afforded by simultaneous error variance estimation.


Introduction
Consider the classical linear regression model where Y ∈ R n is a vector of responses, X = [X 1 , . . . , X p ] ∈ R n×p is a fixed regression matrix of p potential predictors, β = (β 1 , . . . , β p ) T ∈ R p is a vector of unknown regression coefficients and ε ∈ R n is the noise vector of independent normal random variables with σ 2 as their unknown common variance. When β is sparse so that most of its elements are zero or negligible, finding the nonnegligible elements of β, the so-called variable selection problem, is of particular importance.
Whilst this problem has been studied extensively from both frequentist and Bayesian perspectives, much less attention has been given to the simultaneous estimation of the error variance σ 2 . Accurate estimates of σ 2 are important to discourage fitting the noise beyond the the signal, thereby helping to mitigate overfitting of the data. Variance estimation is also essential in uncertainty quantification for inference and prediction.
In the frequentist literature, the question of estimating the error variance in our setting has begun to be addressed with papers including the scaled Lasso (Sun and Zhang, 2012) and the square-root Lasso (Belloni et al., 2014). Contrastingly, in the Bayesian literature, the error variance has been fairly straightforwardly estimated by including σ 2 in prior specifications. Despite this conceptual simplicity, the majority of theoretical guarantees for Bayesian procedures restrict attention to the case of known σ 2 , as there is not a generally agreed upon prior specification when σ 2 is unknown. More specifically, priors on β and σ 2 are typically introduced in one of two ways: either via a conjugate prior framework or via an independence prior framework.
Conjugate priors have played a major role in regression analyses. The conjugate prior framework for (1.1) begins with specifying a prior on β that depends on σ 2 as follows: ( 1.2) where V may be fixed or random. This prior (1.2) results in a Gaussian posterior for β and as such is conjugate. To complete the framework, σ 2 is assigned an inverse-gamma (or equivalently scaled-inverse-χ 2 ) prior. A common choice in this regard is the right-Haar prior for the location-scale group (Berger et al., 1998): Whilst the right-Haar prior is improper, it can be viewed as the limit of an inverse-gamma density. When combined with (1.2), the prior (1.3) results in an inverse-gamma posterior for σ 2 and as such it behaves as a conjugate prior. Prominent examples that utilize the above conjugate prior framework include: • Bayesian ridge regression priors, with V = τ 2 I; • Zellner's g-prior, with V = g(X T X) −1 ; and • Gaussian global-local shrinkage priors, with V = τ 2 Λ, for Λ = diag{λ j } p j=1 .
We note that the conjugate prior framework refers only to the prior characterization of β and σ 2 , and allows for any prior specification on subsequent hyper-parameters such as g and τ 2 which do not appear in the likelihood.
A main reason for the popularity of the conjugate prior framework is that it often allows for marginalization over β and σ 2 , resulting in closed form expressions for Bayes factors and updates of posterior model probabilities. This allowed for analyses of the model selection consistency (Bayarri et al., 2012) as well as more computationally efficient MCMC algorithms (George and McCulloch, 1997). Despite these advantages, however, the conjugate prior framework is not innocuous for variance estimation, as we will show in this work.
Alternatively to the conjugate prior framework, one might treat β and σ 2 as independent a priori. The formulation corresponding to (1.2) for this independence prior framework is: (1.4) π(σ) ∝ 1/σ.
Note that the prior characterization (1.4) does not yield a normal inverse-gamma posterior distribution on (β, σ 2 ) and as such is not conjugate. In addition to the above prior frameworks, Bayesian methods for variable selection can be further categorized by the way they treat negligible predictors. Discrete component Bayesian methods for variable selection exclude negligible predictors from consideration, adaptively reducing the dimension of β. Examples of such discrete component methods include spike-andslab priors where the "spike" distribution is a point-mass at zero (Mitchell and Beauchamp, 1988). In contrast, continuous Bayesian methods for variable selection shrink, rather than exclude, negligible predictors and as such β remains p-dimensional (George and McCulloch, 1993;Polson and Scott, 2010;Ročková and George, 2014).
In this paper, we show that for continuous Bayesian variable selection methods, the conjugate prior framework can result in underestimation of the error variance when: (i) the regression coefficients β are sparse; and (ii) p is of the same order as, or larger than n. Intuitively, conjugate priors implicitly add p "pseudo-observations" to the posterior which can distort inference for the error variance when the true number of non-zero β is much smaller than p. This is not the case for discrete component methods which adaptively reduce the size of β. To avoid the underestimation problem in the continuous case, we recommend the use of independent priors on β and σ 2 . Further, we extend the Spike-and-Slab Lasso of Ročková and George (2018) to the unknown variance case with an independent prior formulation, and highlight the performance gains over the known variance case via a simulation study. On the protein activity dataset of Clyde and Parmigiani (1998), we demonstrate the benefit of simultaneous variance estimation for both variable selection and prediction. The implementation of the Spike-and-Slab Lasso is publicly available in the R package SSLASSO (Ročková and Moran, 2017).
It is important to note the difference in the scope of this work with previous work on variance priors, including Gelman (2004); Bayarri et al. (2012); Liang et al. (2008). Here, we are focused on the estimation of the error variance, σ 2 . In contrast, the aforementioned works are concerned with the choice of priors for hyper-parameters which do not appear in the likelihood, i.e. the g in the g-prior, and τ 2 and λ 2 j for global-local priors. We recognize the importance of the choice of these priors for Bayesian variable selection; however, the focus of this paper is the prior choice for the error variance in conjunction with variable selection.
We also note that our discussion considers only Gaussian related prior forms for the regression coefficients. Despite this seemingly limited scope, we note that the majority of priors used in Bayesian variable selection can be cast as a scale-mixture of Gaussians , and that popular frequentist procedures such as the Lasso and variants thereof also fall under this framework.
The paper is structured as follows. In Section 2, we discuss invariance arguments for conjugate priors and draw connections with Jeffreys priors. We then highlight situations where we ought to depart from Jeffreys priors; namely, in multivariate situations. In Section 3, we take Bayesian ridge regression as an example to highlight why conjugate priors can be a poor choice. In Section 4, we draw connections between Bayesian regression and concurrent developments with variance estimation in the penalized likelihood literature. In Section 5, we examine the mechanisms of the Gaussian global-local shrinkage framework and illustrate why they can be incompatible with the conjugate prior structure. In Section 6, we consider the Spike-and-Slab Lasso of Ročková and George (2018) and highlight how the conjugate prior yields poor estimates of the error variance. We then extend the procedure to include the unknown variance case using an independent prior structure and demonstrate via simulation studies how this leads to performance gains over not only the known variance case, but a variety of other variable selection procedures. In Section 7, we apply the Spike-and-Slab Lasso with unknown variance to the protein activity dataset of Clyde and Parmigiani (1998), highlighting the improved predictive performance afforded by simultaneous variance estimation. We conclude with a discussion in Section 8.

Invariance Criteria
A common argument used in favor of the conjugate prior for Bayesian linear regression is that it is invariant to scale transformations of the response (Bayarri et al., 2012). That is, the regression coefficients depend a priori on σ 2 in a "scale-free way" through π(β|σ 2 ) = 1 for some proper density function h(x). This means that the units of measurement used for the response do not affect the resultant estimates; for example, if Y is scaled by a factor of c, one would expect that the estimates for the regression coefficients, β, and error variance, σ 2 , should also be scaled by c.
A more general principle of invariance was proposed by Jeffreys (1961) in his seminal work, The Theory of Probability, a reference which is also sometimes given for the conjugate prior. In this section, we examine the original invariance argument of Jeffreys (1961) and highlight a caveat with this principle that the author himself noted; namely that it should be avoided in multivariate situations. We then draw connections between this suboptimal multivariate behavior and the conjugate prior framework, ultimately arguing similarly to Jeffreys that we should treat the mean and variance parameters as independently a priori.

Jeffreys Priors
For a parameter α, the Jeffreys prior is where I(α) is the Fisher information matrix. The main motivation given by Jeffreys (1961) for these priors was that they are invariant for all nonsingular transformations of the parameters. This property appeals to intuition regarding objectivity; ideally, the prior information we decide to include should not depend upon the choice of the parameterization, which itself is arbitrary. Despite this intuitively appealing property, the following problem with this principle was spotted in the original work of Jeffreys (1961) and later re-emphasized by Robert et al. (2009) in their revisit of the work. Consider the model If we treat the parameters µ and σ independently, the Jeffreys priors are π(µ) ∝ 1 and π(σ) ∝ 1/σ. However, if the parameters are considered jointly, the Jeffreys prior is π(µ, σ) ∝ 1/σ 2 . This discrepancy is exaggerated when we include more parameters. In effect, by considering the parameters jointly as opposed to independently, we are implicitly including additional "pseudo-observations" of σ 2 and consequently distorting our estimates of the error variance.
This "pseudo-observation" interpretation can be seen explicitly in the conjugate form of the Jeffreys prior for a Gaussian likelihood. For example: suppose now we have an ndimensional mean denoted by µ = (µ 1 , . . . , µ n ). That is, The joint Jeffreys prior π(µ, σ) ∝ 1/σ n+1 is an improper inverse-gamma prior with shape parameter, n/2, and scale parameter zero. As the prior is conjugate, the posterior distribution for the variance is also inverse-gamma: where the first term of both the shape and scale parameters in (2.3) are the prior hyperparameters. Thus, the dependent Jeffreys prior can be thought of as encoding knowledge of σ 2 from a previous experiment where there were n observations which yielded a sample variance of zero. This results in the prior concentrating around zero for large n and will severely distort posterior estimates of σ 2 . As we shall see later, this dependent Jeffreys prior for the parameters is in some cases akin to the conjugate prior framework in (1.2). This prior dependence between the parameters is explicitly repudiated by Jeffreys (1961) who states (with notation changed to match ours): "in the usual situation in an estimation problem, µ and σ 2 are each capable of any value over a considerable range, and neither gives any appreciable information about the other. We should then take: π(µ, σ) = π(µ)π(σ)." That is, Jeffreys' remedy is to treat the parameters independently a priori, a recommendation which we also adopt. In addition, Jeffreys points out that the key problem with the joint Jeffreys prior is that it does not have the same reduction of degrees of freedom required by the introduction of additional nuisance parameters. We shall examine this phenomenon in more detail in Section 3 where we will discuss the consequences of using dependent Jefferys priors and other conjugate formulations in Bayesian linear regression.
We note a possible exception to this independence argument which is found later in The Theory of Probability where Jeffreys argues that for simple normal testing, the prior on µ under the alternative hypothesis should depend on σ 2 . However, it is important to note that this recommendation is for the situation where µ is one-dimensional and so the underestimation problem observed in (2.3) is not a problem. Given Jeffreys' earlier concerns regarding multivariate situations, it is unlikely he intended this dependence to generalize for higher dimensional µ.

Prior considerations
Consider again the classical linear regression model in (1.1). For a non-informative prior, it is common to use π(β, σ 2 ) ∝ 1/σ 2 (see, for example, Gelman et al., 2014). Similarly to our earlier discussion, this prior choice corresponds to multiplying the independent, Jeffreys priors for β and σ. In contrast, the joint Jeffreys prior would be π(β, σ 2 ) ∝ 1/σ p+2 . Let us now examine the estimates resulting from the former, independent Jeffreys prior. In this case, we have the following marginal posterior mean estimate for the error variance: where β = (X T X) −1 X T Y is the usual least squares estimator. We observe that the degrees of freedom adjustment, n − p, naturally appears in the denominator. This does not occur for the joint Jeffreys prior where the marginal posterior mean is given by: For large p, this estimator with the joint Jeffreys prior will severely underestimate the error variance. Avoiding this, it is commonly accepted that the independent Jeffreys prior π(β, σ 2 ) ∝ 1/σ 2 should be the default non-informative prior in this setting.
There is no such clarity, however, in the use of conjugate priors for Bayesian linear regression. To add to this discourse, we show that these conjugate priors can suffer the same problem as the dependent Jeffreys priors and recommend, similarly to Jeffreys, that independent priors should be used instead. We make this point with the following example. A common conjugate prior choice for Bayesian linear regression is β|σ 2 , τ 2 ∼ N p (0, σ 2 τ 2 I). (3.3) If we consider the parameter τ 2 to be fixed, this prior choice corresponds to Bayesian ridge regression. With an additional non-informative prior π(σ 2 ) ∝ 1/σ 2 , we then have the joint prior (3.4) Note again the σ p+2 in the denominator, similarly to the joint Jeffreys prior. It is illustrative to consider the conditional prior dependence here of σ 2 on β from a "pseudo-observation" perspective: the implicit conditional prior on σ 2 from (3.4) is given by Similarly to the discussion in Section 2, this inverse-gamma prior has the following interpretation: from a previous experiment, the sample variance of p observations was 1 p β 2 /τ 2 . For regions where β is sparse, this dependence leads to prior concentration of σ 2 around zero as illustrated by the following. Proposition 1. Suppose β 0 = q and max j β 2 j = K for some constant K ∈ R . Denote the true variance as σ 2 0 . Then (3.6) Proof. Proposition 1 follows from Markov's inequality and the bound β 2 ≤ qK.
Proposition 1 implies that as q/p → 0, we can choose 0 < ε < 1 such that the prior places decreasing mass on values of σ 2 greater than εσ 2 0 . Thus, in regions of bounded sparse regression coefficients, the conjugate Gaussian prior can result in poor estimation of the true variance.
From a more philosophical perspective, (3.5) corresponds to prior knowledge that the error variance is implicitly the sample variance of previous observations of the regression coefficients, β. This is troubling given that the error variance is independent of the signal and in particular of the regression coefficients.
In the next section, we conduct a simulation study for the simple case of Bayesian ridge regression and show empirically how this implicit prior on σ 2 can distort estimates of the error variance.

The failure of a conjugate prior
As an illustrative example, we take n = 100 and p = 90 and compare the least squares estimates of β and σ 2 to Bayesian ridge regression estimates with (i) the conjugate formulation with (3.3) and (ii) the independent prior formulation with π(β) ∼ N p (0, τ 2 I). (3.7) For both Bayesian ridge regression procedures we use the non-informative error variance prior: π(σ 2 ) ∝ 1/σ 2 . The predictors X i , i = 1, . . . , p are generated as independent standard normal random variables. The true β 0 is set to be sparse with only six non-zero elements; the nonzero coefficients are set to {−2.5, −2, −1.5, 1.5, 2, 2.5}. The response Y is generated according to (1.1) with the true variance being σ 2 = 3. We take τ = 10 as known and highlight that this weakly informative choice leads to poor variance estimates in the conjugate prior framework. Whilst an empirical or fully Bayes approach for estimating τ 2 may be preferable for highdimensional regression, it is troubling that the conjugate prior yields poor results for a simple example where n > p and in which least squares and the independent prior perform well. The conjugate prior formulation allows for the exact expressions for the marginal posterior means of β and σ 2 : where H τ = X[X T X + τ −2 I] −1 X T . Similarly to (3.2), the above marginal posterior mean for σ 2 does not incorporate a degrees of freedom adjustment and so we expect this estimator to underestimate the true error variance.
It is illuminating to observe the underestimation problem when considering the conditional posterior mean of σ 2 , instead of the marginal: (3.10) The additional p in the denominator here leads to severe underestimation of σ 2 when β is sparse and bounded as in Proposition 1 and p is of the same order as, or larger than, n, as discussed in the previous section. We note in passing that a value of τ 2 close to β 2 /pσ 2 , which may be obtainable with an empirical or fully Bayes approach, would avoid this variance underestimation problem, as can be seen from (3.10). This is in contrast to the conditional posterior mean for σ 2 using the independent prior formulation (1.4), which we also consider. This estimator is given by: (3.11) Here we do not observe a degrees of freedom adjustment because (3.11) is the conditional posterior mean, not the marginal. Earlier in (3.1) we considered the marginal posterior mean for the independent Jeffreys' prior which led to the n−p in the denominator. For the marginal posterior means of β and σ 2 , the independent prior formulation does not yield closed form expressions. To assess these, we use a Gibbs sampler, the details of which may be found in the appendix. When τ 2 is large, the estimate of β for both the conjugate and independent formulations are almost exactly the least-squares estimate, β = [X T X] −1 X T Y. However, the estimates of the variance σ 2 differ substantially.
In Figure 1, we display a boxplot of the estimates of σ 2 for (i) Least Squares, (ii) Conjugate Bayesian ridge regression, (iii) Zellner's prior: and (iv) Independent Bayesian ridge regression over 100 replications. Here the estimates from least squares and the independent ridge are similarly centered around the truth; however, the conjugate ridge and Zellner's priors consistently underestimate the error variance with medians of σ 2 = 0.27 and 0.55, respectively. This poor performance is a direct result of the bias induced by adding p "pseudo-observations" of σ 2 as discussed in Section 3.1, which also occurs for the Zellner prior. This phenomenon of underestimating σ 2 is also seen in EMVS (Ročková and George, 2014), which can be viewed as iterative Bayesian ridge regression with an adaptive penalty term for each regression coefficient β j instead of the same τ 2 above. EMVS also uses a conjugate prior formulation in which β depends on σ 2 a priori similarly to (3.3). As in the above ridge regression example, with this prior EMVS yields good estimates for β, but severely underestimates σ 2 . This is evident in the Section 4 example of Ročková and George (2014) with n = 100 and p = 1000. There, conditionally on the modal estimate of β, the associated modal estimate of σ 2 is 0.0014, a severe underestimate of the true variance σ 2 = 3. Fortunately, EMVS can be easily modified to use the independent prior specification, as now has been done in the publicly available EMVS R package (Ročková and Moran, 2018). It is interesting to note that the SSVS procedure of George and McCulloch (1993)  A natural question to ask is: how does the poor estimate of the variance in the conjugate case affect the estimated regression coefficients? Insight is obtained by comparing (3.8) to the conditional posterior mean of β in the independent case, given by: (3.13) In (3.8), the Gaussian prior structure allows for σ 2 to be factorized out so that the estimate of β does not depend on the variance. This lack of dependence on the variance is troubling, however, as we want to select fewer variables when the error variance is large making the signal-to-noise ratio low. This is in contrast to (3.13) where when σ 2 is large relative to τ 2 , the signal-to-noise ratio is low and so the posterior estimate for β will be close to zero, reflecting the relative lack of information. This does not occur for the posterior mean of β in the conjugate case.

What about a prior degrees of freedom adjustment?
At this point, one may wonder: if the problem seems to be the extra σ p in the denominator, why not use the prior π(σ 2 ) ∝ σ p−4 instead of the right-Haar prior π(σ 2 ) ∝ σ −2 that is commonly used? This "p-sigma" prior then results in the joint prior: (3.14) We can again consider the implicit conditional prior on σ 2 : For the simulation setup in section 3.2, this alternative conjugate prior would in fact remedy the variance estimates of the conjugate formulation (3.3). However, the p-sigma prior suffers from other drawbacks.
In particular, the p-sigma prior can actually lead to overestimation of the error variance, as opposed to the underestimation by the conjugate prior formulation (1.2) observed in section 3.2. This overestimation can be seen from the concentration of the prior captured in Proposition 2 below. A similar overestimation phenomenon was also found for a penalized likelihood procedure which implicitly uses a p-sigma prior, as we will dicuss in section 4.
Proposition 2. Suppose β 0 = q and min j,β j =0 β 2 j = K for some constant K ∈ R. Denote the true variance as σ 2 0 . Then (3.16) Proof. We have: Proposition 2 implies that as q → ∞, we can choose arbitrary ε > 0 such that σ 2 will overestimate the true variance. As many posterior concentration results require q → ∞, albeit at a much slower rate than p (see, for example, van der Pas et al., 2016), this is particularly troublesome.
Another concern regarding the p-sigma prior is more philosophical. As p gets larger, the p-sigma prior puts increasing mass on larger and larger values of σ 2 , which does not seem justifiable.
In contrast, the only drawback of the independent prior form is that it can be more computationally intensive. This was seen in Section 3.1 where the independent form did not yield closed form expressions for the posterior means. However, most variable selection and shrinkage problems today use more complicated global-local prior forms for the regression coefficients which are also computationally intensive, no matter whether one uses a conjugate or independent prior.
For these reasons, we prefer the independent prior forms for the regression coefficients and error variance. We are also of the opinion that the simplicity of the independent prior is in its favor.

Connections with Penalized Likelihood Methods
Here we pause briefly to examine connections between Bayesian methods and developments in estimating the error variance in the penalized regression literature. Such connections can be drawn as penalized likelihood methods are implicitly Bayesian; the penalty functions can be interpreted as priors on the regression coefficients so these procedures also in effect yield MAP estimates.
One of the first papers to consider the unknown error variance case for the Lasso was Städler et al. (2010), who suggested the following penalized loss function for introducing unknown variance into the frequentist Lasso framework: Optimizing this objective function is in fact equivalent to MAP estimation for the following Bayesian model with the p-sigma prior discussed in Section 3.2: Interestingly, Sun and Zhang (2010) proved that the resulting estimator for the error variance overestimates the noise level unless λ β * 1 /σ * = o(1), where β * and σ * are the true values of the regression coefficients and error variance, respectively. Let us examine this condition more closely. Suppose again the true dimension of β * is q, where q p, and λ = A 2/n log p for some constant A > 1 (the λ required by Sun and Zhang (2012) for consistency). Suppose also that max j |β j | = K for some constant K ∈ R. Then, the error variance estimate from this procedure will be upwardly biased unless λ β * ≈ qK 2/n log p = o(1).
That is, the true dimension q cannot at the same time increase at the required rate for posterior contraction and result in consistent estimates for the error variance. Note also the connection to Proposition 2 -there, the prior mass on σ 2 will concentrate on values greater than the true variance σ 2 0 unless β 2 /τ 2 = o(1). To resolve this issue of overestimating the error variance, Sun and Zhang (2012) proposed as an alternative the "scaled Lasso", an algorithm which minimizes the following penalized joint loss function via coordinate descent: This loss function is a penalized version of Huber's concomitant loss function, and so may be viewed as performing robust high-dimensional regression. It is also equivalent to the "squareroot Lasso" of Belloni et al. (2014). Minimization of the loss function (4.3) can be viewed as MAP estimation for the Bayesian model (with a slight modification): λ 2 e −λ|β j | σ ∼ Gamma(n + 1, n/2).
Note that to interpret the scaled Lasso as a Bayesian procedure, σ, rather than σ 2 , plays the role of the variance in (4.4). Sun and Zhang (2012) essentially then re-interpret σ as the standard deviation again after optimization of (4.3). This re-interpretation can be thought of as an "unbiasing" step for the error variance. It is a little worrisome, however, that the implicit prior on the error variance is very informative: as n → ∞, this Gamma prior concentrates around σ = 2. Sun and Zhang (2012) proved that the scaled Lasso estimate σ(X, Y) is consistent for the "oracle" estimator where β * are the true regression coefficients, for the value of λ 0 ∝ 2/n log p. This estimator (4.5) is called the oracle because it treats the true regression coefficients as if they were known. The term Y − Xβ * 2 is then simply the sum of normal random variables, of which we calculate the variance as n i=1 ε 2 i /n. More recently, Sun and Zhang (2013) proposed a different value of λ 0 which achieves tighter error bounds than in their previous work. Specifically, they propose where Φ is the standard Gaussian cdf and k is the solution to k = L 4 1 (k/p) + 2L 2 1 (k/p).

Global-Local Shrinkage
In this section, we examine how the use of a conjugate prior affects the machinery of the Gaussian global-local shrinkage paradigm. The general structure for this class of priors is given by: where τ 2 is the "global" variance and λ 2 j is the "local" variance. Note that taking τ 2 to be the same as the error variance σ 2 would result in a conjugate prior in this setting. This is exactly what was done in the original formulation of Bayesian lasso by Park and Casella (2008), which can be recast in the Gaussian global-local shrinkage framework as follows (notation changed slightly for consistency): In the conjugate formulation in (5.2), σ 2 plays the dual role of representing the error variance as well as acting as the global shrinkage parameter. This is problematic in light of the mechanics of global-local shrinkage priors. Specifically, Polson and Scott (2010) give the following requirements for the global and local variances in (5.1): π(τ 2 ) should have substantial mass near zero to shrink all the regression coefficients so that the vast majority are negligible; and π(λ 2 j ) should have heavy tails so that it can be quite large, allowing for a few large coefficients to "escape" the heavy shrinkage of the global variance.
This heuristic is formalized in much of the shrinkage estimation theory. For the normal means problem where X = I n and β ∈ R n , van der Pas et al. (2016) prove that the following condition results in the posterior recovering nonzero means with the optimal rate: (i) π(λ 2 j ) should be a uniformly regular varying function which does not depend on n; and (ii) τ 2 = q n log(n/q), where q is number of non-zero β j . The uniformly regular varying property in (i) intuitively preserves the "flatness" of the prior even under transformations of the parameters, unlike traditional "non-informative" priors (Bhadra et al., 2016). In preserving these heavy tails, such priors for λ 2 j allow for a few large coefficients to be estimated. The condition (ii) encourages τ 2 to tend to zero which would be a concerning property if it were also the error variance. These results suggest we cannot identify the error variance with the global variance parameter on the regression coefficients as in (5.2): it cannot simultaneously both shrink all the regression coefficients and be a good estimate of the residual variance. Finally, we note that Hans (2009) also considered the independent case for the Bayesian lasso in which the error variance is not identified with the global variance.
An alternative conjugate formulation for Gaussian global-local shrinkage priors is to instead include three variance terms in the prior for β j : the error variance, σ 2 , the global variance, τ 2 , and the local variance, λ 2 j . For example, (Carvalho et al., 2010) give the conjugate form of the horseshoe prior: β j |σ 2 , τ 2 , λ 2 j ∼ N (0, σ 2 τ 2 λ 2 j ), λ 2 j ∼ π(λ 2 j ), j = 1, . . . , p (5.3) This prior formulation (5.3) remedies the aforementioned issue in the Bayesian lasso as it separates the roles of the error variance and global variance. However, this prior structure can still be problematic for error variance estimation. Consider the conditional posterior mean of σ 2 for the model (5.3): Proposition 3 highlights that, given the true regression coefficients, the conditional posterior mean of σ 2 underestimates the oracle variance (4.5) when β is sparse.
Proposition 3. Consider the global-local prior formulation given in (5.3). Denote the true vector of regression coefficients by β * where β * 0 = q. Suppose max j β * 2 j = M 1 for some constant M 1 ∈ R. Denote the oracle estimator for σ given in (4.5) by σ * and suppose σ * = O(1). Suppose also that for j ∈ {1, . . . , p} with β j = 0, we have τ 2 λ 2 j > M 2 for some M 2 ∈ R. Then E[σ 2 |Y, β * , τ 2 , λ 2 j ] ≤ nσ * 2 n + p − 2 + qM 1 /M 2 n + p − 2 . (5.5) In particular, as p/n → ∞ and q/p → 0, we have Given the mechanics of global-local shrinkage priors, the assumption in Proposition 3 that the the term τ 2 λ 2 j is bounded from below for non-zero β j is not unreasonable. This is because for large β j , the local variance λ 2 j must be large enough to counter the extreme shrinkage effect of τ 2 . Indeed, the prior for λ 2 j must have "heavy enough" tails to enable this phenomenon. We should note that Proposition 3 illustrates the poor performance of the posterior mean (5.4) given the true regression coefficients β * , whereas the horseshoe procedure does not actually threshold the negligible β j to zero in the posterior mean of β. For these small β j , the term τ 2 λ 2 j may be very small and potentially counteract the underestimation phenomenon. However, it is still troubling to use an estimator for the error variance that does not behave as the oracle estimator when the true regression coefficients are known. This is in contrast to the independent prior formulation where the conditional posterior mean of σ 2 is simply: Note also that the problem of underestimation of σ 2 is exacerbated for modal estimation under the prior (5.3). This is because modal estimators often threshold small coefficients to zero and so the term p j=1 β 2 j /λ 2 j τ 2 becomes negligible as in Proposition 3. As MAP estimation using global-local shrinkage priors is becoming more common (see, for example, Bhadra et al., 2017), we caution against the use of these conjugate prior forms.
A different argument for using conjugate priors with the horseshoe is given by Piironen and Vehtari (2017). They advocate for the model (5.3), arguing that it leads to a prior on the effective number of non-zero coefficients which does not depend on σ 2 and n. However, this quantity is derived from the posterior of β and so does not take into account the uncertainty inherent in the variable selection process. As a thought experiment: suppose that we know the error variance, σ 2 , and number of observations, n. If the error variance is too large and the number of observations are too few, we would not expect to be able to say much about β at all, and this intuition should be reflected in the effective number of non-zero coefficients. This point is similar to our discussion at the end of Section 3.2 regarding estimation of β.
As before, we recommend independent priors on both the error variance and regression coefficients to both prevent distortion of the global-local shrinkage mechanism and to obtain better estimates of the error variance.
6 Spike-and-Slab Lasso with Unknown Variance

Spike-and-Slab Lasso
We now turn to the Spike-and-Slab Lasso (SSL, Ročková and George, 2018) and consider how to incorporate the unknown variance case. As the name suggests, the SSL involves placing a mixture prior on the regression coefficients β, where each β j is assumed a priori to be drawn from either a Laplacian "spike" concentrated around zero (and hence be considered negligible), or a diffuse Laplacian "slab" (and hence may be large). Thus the hierarchical prior over β and the latent indicator variables γ = (γ 1 , . . . , γ p ) is given by where ϕ 1 (β) = λ 1 2 e −|β|λ 1 is the slab distribution and ϕ 0 (β) = λ 0 2 e −|β|λ 0 is the spike (λ 1 λ 0 ), and we have used the common exchangeable beta-binomial prior for the latent indicators.
We note that despite the use of the spike-and-slab prior typically associated with "twogroup" Bayesian variable selection methods, the Spike-and-Slab Lasso can also be seen as a "one-group" method as the spike density is continuous. Similarly to the Bayesian lasso, the Spike-and-Slab Lasso can be cast in the Gaussian global-local shrinkage framework using the Gaussian scale-mixture representation of the Laplace density: Although in formulation (6.3) there is only a local variance and not a global variance, van der Pas et al. (2016) show that the Spike-and-Slab Lasso can be interpreted similarly to globallocal shrinkage priors. In this interpretation, the parameter θ (the proportion of non-zero β j ) essentially plays the role of the global-variance in that θ shrinks the majority of β to zero. Ročková and George (2018) recast this hierarchical model into a penalized likelihood framework, allowing for the use of existing efficient algorithms for modal estimation while retaining the adaptivity inherent in the Bayesian formulation. The regression coefficients β are then estimated by Ročková and George (2018) note a number of advantages in using a mixture of Laplace densities in (6.1), instead of the usual mixture of Gaussians as has been standard in the Bayesian variable selection literature. First, the Laplacian spike serves to automatically threshold modal estimates of β j to zero when β j is small, much like the Lasso. However, unlike the Lasso, the slab distribution in the prior serves to stabilize the larger coefficients so they are not downward biased. Additionally, the heavier Laplacian tails of the slab distribution yields optimal posterior concentration rates (Ročková, 2018).
A possible route for adding the unknown variance case to the SSL procedure is to follow the prior framework of Park and Casella (2008) in their Bayesian Lasso. There, Park and Casella (2008) used the following prior for the regression coefficients: In the next section, we illustrate why an analogous conjugate prior formulation for the Spikeand-Slab Lasso would be a poor choice. Afterwards, we introduce the SSL with unknown variance which utilizes an independent prior framework.

The failure of the conjugate prior
The conjugate prior formulation for the Spike-and-Slab Lasso is given by: We find the posterior modes of our parameters using the EM algorithm, the details of which can be found in the appendix. At the (k + 1)th iteration, the EM updates are: Let us take a closer look at the estimator of σ. Following the line of reasoning in Sun and Zhang (2010), an expert with oracle knowledge of the true regression coefficients β * would estimate the noise level by the oracle estimator: (6.16) However, the maximum a posteriori estimate of σ at the true values of β * , γ * is given by σ M AP = τ + τ 2 + (σ * ) 2 1 + p/n + 2/n (6.17) Here we see that if n → ∞ with p fixed, we have σ M AP → σ * . If, however, we have p/n → ∞ and q/p → 0, where the underlying sparsity is q = β * 0 , we have σ M AP → 0. Thus, similarly to the discussion of global-local shrinkage priors in Section 5, we will severely underestimate the error variance. As before, the remedy is to use the independent prior on σ 2 and β.

Spike-and-Slab Lasso with Unknown Variance
We now introduce the Spike-and-Slab Lasso with unknown variance, which considers the regression coefficients and error variance to be a priori independent. The hierarchical model is (6.20) The log posterior, up to an additive constant, is given by pen(β j |θ j ) (6.21) where, for j = 1, . . . , p, pen(β j |θ j ) = −λ 1 |β j | + log[p * (0; θ j )/p * (β j ; θ j )], (6.22) θ j = E[θ|β \j ] (6.23) and p * (β; θ) is as defined in (6.14). For large p, Ročková and George (2018)  To find the modes of (6.21), we pursue a similar coordinate ascent strategy to Ročková and George (2018), cycling through updates for each β j and σ 2 while updating the conditional expectation θ β . This conditional expectation does not have an analytical expression; however, Ročková and George (2018) note that it can be approximated by We now outline the estimation strategy for β. As noted in Lemma 3.1 of Ročková and George (2018), there is a simple expression for the derivative of the SSL penalty: (6.26) Using the above expression, the Karush-Kuhn-Tucker (KKT) conditions yield the following necessary condition for the global mode β: . . . , p (6.27) where z j = X T j (Y − p k =j β k · X k ) and we assume that the design matrix X has been centered and standardized to have norm √ n. The condition (6.27) is very close to the familiar softthresholding operator for the Lasso, except that the penalty term λ * (β j ; θ) differs for each coordinate. Similarly to other non-convex methods, this enables selective shrinkage of the coefficients, mitigating the bias issues associated with the Lasso. Also similarly to nonconvex methods however, (6.27) is not a sufficient condition for the global mode. This is particularly problematic when the posterior landscape is highly multimodal, a consequence of p n and large λ 0 . To eliminate many of these suboptimal local modes from consideration, Ročková and George (2018) develop a more refined characterization of the global mode. This characterization follows the arguments of Zhang and Zhang (2012) and can easily be modified for the unknown variance case of the SSL, detailed in Proposition 4.
Proposition 4. The global mode β satisfies [nt/2 − σ 2 pen(t|θ β )/t]. (6.29) Unfortunately, computing (6.29) can be difficult. Instead, we seek an approximation to the threshold ∆. A useful upper bound is ∆ ≤ σ 2 λ * (0; θ β ) (Zhang and Zhang, 2012). However, when λ 0 gets large, this bound is too loose and can be improved. The improved bounds are given in Proposition 5, the analogue of Proposition 3.2 of Ročková and George (2018) for the unknown variance case. Before stating the result, the following function is useful to simplify exposition: Proposition 5. When σ(λ 0 − λ 1 ) > 2 √ n and g(0; θ β ) > 0 the threshold ∆ is bounded by ∆ U = 2nσ 2 log[1/p * (0; θ β )] + σ 2 λ 1 (6.32) and Thus, when λ 0 is large and consequently d j → 0, the lower bound on the threshold approaches the upper bound, yielding the approximation ∆ ≈ ∆ U . We additionally note the central role that the error variance plays in the thresholds in Proposition 5. As σ 2 increases, the thresholds also increase, making it more difficult for regression coefficients to be selected. This is exactly what we want when the signal to noise ratio is small.
Bringing this all together, we incorporate this refined characterization of the global mode into the update for the coefficients via the generalized thresholding operator of Mazumder et al. (2011): The coordinate-wise update is then (6.35) The conditional expectation θ β is updated according to (6.24). Finally, given the most recent update of the coefficient vector β, the update for the error variance σ 2 is a simple Newton step: Note that this update for σ 2 is a conditional mode, not a marginal mode, and so it does not underestimate the error variance in the same way as (3.9). Indeed, conditional on the true regression coefficients, (6.36) is essentially the oracle estimator (4.5). However, although we retain the update (6.36) during optimization in order to iterate between the modes of β and σ 2 , after the algorithm has converged, our final estimator of σ 2 is obtained as where q = β 0 . Note that (6.37) incorporates an appropriate degrees of freedom adjustment to account for the fact that β is an estimate of the unknown true β. In principle, both σ 2 and the conditional expectation θ β should be updated after each β j , j = 1, . . . , p. In practice, however, there will be little change after one coordinate update and so both σ 2 and θ β can be updated after M coordinates are updated, where M is the update frequency given in Algorithm 1. The default implementation updates σ 2 and θ β after every M = 10 coordinate updates.

Implementation
In the SSL with known variance, Ročková and George (2018) propose a "dynamic posterior exploration" strategy whereby the slab parameter λ 1 is held fixed and the spike parameter λ 0 is gradually increased to approximate the ideal point mass prior. Holding the slab parameter fixed serves to stabilize the non-zero coefficients, unlike the Lasso which applies an equal level of shrinkage to all regression coefficients. Meanwhile, gradually increasing λ 0 over a "ladder" of values serves to progressively threshold negligible coefficients. More practically, the dynamic strategy aids in mode detection: when (λ 1 − λ 0 ) 2 ≤ 4/σ 2 , the objective is convex (Ročková and George, 2018). In fact, when λ 0 = λ 1 , it is equivalent to the Lasso. As λ 0 is increased, the posterior landscape becomes multimodal, but using the solution from the previous value of λ 0 as a "warm start" allows the procedure to more easily find modes. Thus, progressively increasing λ 0 acts as an annealing strategy.
When σ 2 is treated as unknown, the successive warm start strategy of Ročková and George (2018) will require additional intervention. For small λ 0 ≈ λ 1 , there may be many negligible but non-zero β j included in the model. This severe overfitting results in all the variation in Y being explained by the model, forcing the estimate of the error variance, σ 2 to zero. If this suboptimal solution is propagated for larger values of λ 0 , the optimization routine will remain "stuck" in that part of the posterior landscape. As an implementation strategy to avoid this absorbing state, we keep the estimate of σ 2 fixed at an initial value until λ 0 reaches a value at which the algorithm converges in less than 100 iterations. We then reinitialize β and σ 2 and being to simultaneously update σ 2 for the next largest λ 0 value in the ladder. The intuition behind this strategy is that we first find a promising part of the posterior landscape and then update β and σ 2 .
For initialization, we follow Ročková and George (2018) and initialize the regression coefficients, β, at zero and θ 0 = 0.5. For the error variance, we devised an initialization strategy that is motivated by the prior for σ 2 used in Chipman et al. (2010). Those authors used a scaled-inverse-χ 2 prior for the error variance with degrees of freedom ν = 3 and scale parameter chosen such that the sample variance of Y corresponds to the 90th quantile of the prior. This is a natural choice as the variance of Y is the maximum possible value for the error variance. We set the initial value of σ 2 to be the mode of this scaled-inverse-χ 2 distribution, a strategy which we have found to be effective in practice.
The entire implementation strategy is summarized in Algorithm 1.

Scaled Spike-and-Slab Lasso
An alternative approach for extending the SSL for unknown variance is to follow the scaled Lasso framework of Sun and Zhang (2012). In their original scaled Lasso paper, Sun and Zhang (2012) note that their loss function can be used with many penalized likelihood procedures, including the MCP and the SCAD penalties. Here, we develop the scaled Spike-and-Slab Lasso. The loss function for the scaled SSL is the same as that of the scaled Lasso but with a different penalty: where pen(β j |θ β ) is as defined in (6.22) and again we use the approximation (6.24) for the conditional expectation θ β . In using this loss function, we are of course departing from the Bayesian paradigm and simply considering this procedure as a penalized likelihood method with a spike-and-slab inspired penalty. The algorithm to find the modes of (6.38) is very similar to Algorithm 1, the only difference being we replace all σ 2 terms in the updates (6.34) and (6.35) with σ. This is because the refined thresholds for the coefficients are derived using the KKT conditions where the only difference between the two procedures is σ vs. σ 2 .
The Newton step for σ 2 is only very slightly different from the SSL with unknown variance: How do we expect the scaled Spike-and-Slab Lasso to compare to the Spike-and-Slab Lasso with unknown variance? The threshold levels ∆ for the scaled SSL will be smaller after replacing σ 2 with σ. This may potentially result in more false positives being included in the scaled SSL model. In terms of variance estimation, the updates for σ 2 are effectively the same; the only differences we should expect are those arising from a more saturated estimate for β. These hypotheses are examined in the simulation study in the next section.

Simulation Study
We now compare the Spike-and-Slab Lasso with unknown variance with several penalized likelihood methods, including the original Spike-and-Slab Lasso with fixed variance of Ročková and George (2018) as well as the scaled Spike-and-Slab Lasso outlined in the previous section. We investigate both the efficacy of the SSL with unknown variance and the benefits of simultaneously estimating the regression coefficients β and error variance σ 2 in variable selection. We do not consider the SSL with the p-sigma prior as the objective is similar to Städler et al. (2010) (albeit with an adaptive penalty) and so we would expect similar overestimation of σ 2 as proved by Sun and Zhang (2012). We consider three different simulation settings.
The analysis was repeated 100 times with new covariates and responses generated each time. For each, the metrics recorded were: the Hamming distance (HAM) between the support of the estimated β and the true β 0 ; the prediction error (PE), defined as the dimension of the estimated β ( q); the percentage of times the method found the correct model (COR); and the time in seconds (TIME). The average of these metrics for each method over the 100 repetitions are displayed in Table 1.
We can see that the Spike-and-Slab Lasso with the variance fixed and equal to the truth (σ 2 = 3) performs the best in terms of the Hamming distance, prediction error, and MCC. Encouragingly, the Spike-and-Slab Lasso with unknown variance performs almost as well as the "oracle" version where the true variance is known. The SSL with unknown variance in turn performs better than a naive implementation of the SSL with fixed variance (σ 2 = 1). We note that the prediction error for the latter implementation is higher than the Adaptive Lasso and SCAD; however, these frequentist methods use cross-validation to choose their regularization parameter and so are optimizing for prediction to the possible detriment of other metrics; the SSL (σ 2 = 1) still has fewer false positives and a higher MCC. However, both the SSL (σ 2 = 3) and unknown σ 2 have smaller prediction error than the rest of the methods, including those which use cross-validation, which highlights the predictive gains afforded by variance estimation.
Following from the discussion in Section 3.4, we can see that the scaled SSL indeed finds more false positives than the SSL with unknown variance. This is a result of the smaller thresholds in estimating the regression coefficients. We can see that the scaled Lasso significantly reduces the number of false positives found as compared to the Lasso; however, the issues with the Lasso penalty remain. Figure 2 shows the variance estimates over the 100 repetitions for the SSL with unknown variance, the scaled SSL and the scaled Lasso. For the SSL with unknown σ 2 , these are the estimates (6.37). For the scaled SSL and the scaled Lasso variance estimates, we also applied a degrees of freedom correction similarly to (6.37) using the number of non-zero coefficients found by each method. The variance estimates from the SSL (unknown σ 2 ) have a median of 2.87 and standard error 0.04. Meanwhile, the scaled SSL slightly underestimates the variance with a median of 2.76 and standard error 0.04, as expected from the larger number of false positives observed in Table 1. Finally, the scaled Lasso highly inflates the variance with a median of 5.88 and standard error 0.14.

Protein activity data
We now apply the Spike-and-Slab Lasso with unknown variance to the protein activity data set from Clyde and Parmigiani (1998). Following those authors, we code the categorical variables by indicator variables, consider all main effects and two-way interactions, and quadratic terms for the continuous variables. This results in a linear model with p = 88 potential predictors. The sample size is n = 96. We assess the performance of the Spike-and-Slab Lasso with unknown variance in both variable selection and prediction.

Variable selection
As an approximation to the "truth", we use the Bayesian adaptive sampling algorithm (BAS, Clyde et al., 2011), which has previously been applied successfully to this dataset. BAS gives posterior inclusion probabilities (PIP) for each of the p potential predictors from which we determined the median probability model (MPM: predictors with PIP > 0.5). The median probability model found by BAS consisted of q = 7 predictors: • interaction of protein concentration and detergent T (con:detT) • detergent T (detT) • interaction of buffer TRS and detergent N (bufTRS:detN) • protein concentration (con) • interaction of buffer P04 and temperature (bufPO4:temp) • detergent N (detN) • interaction of detergent N and temperature (detN:temp) For the SSL with unknown variance, we used the same settings as the simulation study with λ 1 = 1 and λ 0 ∈ {1, 2, . . . , n}. The procedure found a model with q = 6 predictors, including four of the MPM: con, detN, bufTRS:detN, con:detT. Additionally, instead of detT, the SSL with unknown variance found the interaction of pH with detergent T (pH:detT). The correlation between detT and pH:detT is 0.988, rendering the two predictors essentially exchangeable. Thus, 5 out of the 6 predictors found by the SSL with unknown variance matched with the benchmark MPM.
For the SSL with known variance, we fixed σ 2 = 0.24. This is the mean of the scaledinverse-χ 2 distribution induced by the variance of the response, as detailed in Section 6.4. For the protein data, the sample variance of the response is 0.41 and so fixing σ 2 = 1 overestimates the variance, resulting in no signal being found. The SSL with this fixed variance found q = 2 predictors: one of the MPM (detT) and one not in the MPM but having a correlation of 0.735 with detN.
Here, we can see the benefit of simultaneously estimating the error variance; the estimate from SSL with unknown variance was σ 2 = 0.167, resulting in a less sparse solution.

Predictive Performance
We now compare the predictive performance of the SSL with unknown variance with the penalized regression methods from the simulation study in Section 6.5 using 8-fold cross validation. We split the data into K = 8 sets and denote each set by S k , k = 1, . . . , K. The 8-fold cross-validation error is given by: where β \k is the estimated regression coefficient using the data in S C k . We repeated this procedure 100 times and display the resulting cross-validation errors in Figure 3. We do not display the results from the scaled Lasso in Figure 3 as there were a number of outliers: the cross-validation error for the scaled Lasso was greater than 25 in 20% of the replications.
We can see that the SSL with unknown variance has the smallest cross-validation error. This highlights the gains in predictive performance that can be achieved by simultaneously estimating the error variance and regression coefficients. This result is also very encouraging given that all the other methods (except for the SSL with fixed variance) use cross-validation in choosing their regularization parameters. This also explains the slightly worse performance of the SSL with fixed variance, which we expect would be competitive if we were to also choose the regularization parameters with cross-validation. However, SSL with fixed variance still performs well without the need for computationally intensive cross-validation to choose the parameters.