Sparsity information and regularization in the horseshoe and other shrinkage priors

The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coefficients, which can be problematic with weakly identified parameters, such as the logistic regression coefficients in the case of data separation. This paper proposes solutions to both of these problems. We introduce a concept of effective number of nonzero parameters, show an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions, and argue that the previous default choices are dubious based on their tendency to favor solutions with more unshrunk parameters than we typically expect a priori. Moreover, we introduce a generalization to the horseshoe prior, called the regularized horseshoe, that allows us to specify a minimum level of regularization to the largest values. We show that the new prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spike-and-slab with an infinitely wide slab. Numerical experiments on synthetic and real world data illustrate the benefit of both of these theoretical advances.


Introduction
This paper deals with sparse Bayesian estimation and is an extension to our earlier work (Piironen and Vehtari, 2017a).We consider statistical models with a large number of parameters θ = (θ 1 , . . ., θ D ) but so that it is reasonable to assume that only some of them are far from zero.A typical example -and also the case we will mostly focus in this paper -is a regression or classification problem with a large number of predictor variables out of which we expect only a few to be relevant and therefore have a regression coefficient distinguishable from zero.
A vast number of different estimators, both Bayesian and non-Bayesian, have been proposed for these problems.In the non-Bayesian literature the sparse problems are typically handled by Lasso (Tibshirani, 1996) or one of its generalizations (for an overview, see e.g., Hastie, Tibshirani and Wainwright, 2015).We focus on the probabilistic approach and carry out full Bayesian inference on the problem.
Two prior choices dominate the Bayesian literature: two component discrete mixture priors known as the spike-and-slab (Mitchell and Beauchamp, 1988;George and McCulloch, 1993), and a variety of continuous shrinkage priors (see e.g., Polson and Scott, 2011, and references therein).The spike-and-slab prior is intuitively appealing as when the spike is taken to be a delta-spike in the origin, it is equivalent to Bayesian model averaging (BMA) (Hoeting et al., 1999) over the variable combinations, and often has good performance in practice.The disadvantages are that the results can be sensitive to prior choices (slab width and prior inclusion probability) and that the posterior inference can be computationally demanding with a large number of variables, due to the huge model space.The inference could be sped up by analytical approximations using either expectation propagation (EP) (Hernández-Lobato, Hernández-Lobato andSuárez, 2010, 2015) or variational inference (VI) (Titsias and Lázaro-Gredilla, 2011), but this comes at the cost of a substantial increase in the amount of analytical work needed to derive the equations separately for each model and a more complex implementation.
The continuous shrinkage priors on the other hand are easy to implement, provide convenient computation using generic sampling tools such as Stan (Stan Development Team, 2017), and can yield as good or better results.A particularly interesting example is the horseshoe prior (Carvalho, Polson andScott, 2009, 2010) θ j | λ j , τ ∼ N 0, τ 2 λ 2 j , λ j ∼ C + (0, 1) , j = 1, . . ., D, (1.1) which has shown comparable performance to the spike-and-slab prior in a variety of examples where a sparsifying prior on the model parameters θ j is desirable (Carvalho, Polson andScott, 2009, 2010;Polson and Scott, 2011).The horseshoe is one of the so called global-local shrinkage priors, meaning that there is a global hyperparameter τ that shrinks all the parameters towards zero, while the heavy-tailed half-Cauchy priors for the local hyperparameters λ j allow some θ j to escape the shrinkage (see Sec. 2.1 for more thorough discussion).Despite its good performance in many problems, the horseshoe prior has previously suffered from two shortcomings.First, there has been no consensus on how to carry out inference for the global hyperparameter τ which determines the overall sparsity in the parameter vector θ and therefore has a large impact on the results.We prefer full Bayesian inference (see Sec. 3.1) but the existing methodology has been lacking a systematic way of placing a prior for τ based on the information about the sparsity.Second, the horseshoe prior has the undesired property that the parameters far from zero are not regularized at all (see Sec. 2.3).While this is often considered as a key strength of the prior, it can be harmful especially when the parameters are only weakly identified by the data, for instance in the case of a flat likelihood due to separable data in logistic regression.
We propose a solution to both of these problems.We introduce a concept of effective number of nonzero parameters m eff (Sec.3.3), derive analytically its relationship between the global shrinkage parameter τ , and show an easy and intuitive way of formulating the prior for τ based on the prior information about the sparsity of θ.Based on these theoretical considerations, we argue that the previously proposed default priors are dubious based on the prior they impose on m eff , and that they yield good results only when τ (and therefore m eff ) is strongly identified by the data.Moreover, we introduce a generalization of the horseshoe prior, called the regularized horseshoe, that operates otherwise similarly as the original horseshoe but allows specifying the regularization to the coefficients that are far from zero (see Sec. 2.3).We show that the regularized horseshoe can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spikeand-slab with an infinitely wide slab.The benefit of both of these theoretical advances will be illustrated with examples on synthetic and real world data (Sec.4).
As a final remark, although we focus on the horseshoe in our discussion, we want to emphasize that both of these ideas could also be applied to other shrinkage priors, and several promising alternatives to the horseshoe have been proposed during the recent years (Bhattacharya et al., 2015;Zhang, Reich and Bondell, 2016;Ghosh, Li and Mitra, 2017).

Horseshoe prior and its extension
This section discusses the horseshoe and its connection to the spike-and-slab prior (Section 2.2).We also present an extension (Section 2.3) that both helps understanding the theoretical properties of the original horseshoe and -as will be demonstrated in Section 4 -robustifies the prior and improves its practical performance.

Horseshoe prior for linear regression
Consider the single output linear Gaussian regression model with several input variables, given by where x is the D-dimensional vector of inputs, β contains the corresponding weights and σ 2 is the noise variance.The horseshoe prior is set for the regression coefficients β = (β 1 , . . ., β D ) (2.2) If an intercept term β 0 is included in model (2.1), we give it a relatively flat prior, because there is usually no reason to shrink it towards zero.As discussed in the introduction, the horseshoe prior has been shown to possess several desirable theoretical properties and good performance in practice (Carvalho, Polson andScott, 2009, 2010;Polson and Scott, 2011;Datta and Ghosh, 2013;van der Pas, Kleijn and van der Vaart, 2014).The intuition is the following: the global parameter τ pulls all the weights globally towards zero, while the thick half-Cauchy tails for the local scales λ j allow some of the weights to escape the shrinkage.Different levels of sparsity can be accommodated by changing the value of τ : with large τ all the variables have very diffuse priors with very little shrinkage, but letting τ → 0 will shrink all the weights β j to zero.The above can be formulated more formally as follows.Let X denote the n-by-D matrix of observed inputs and y the observed targets.The conditional posterior for the coefficients β given the hyperparameters and data D = (X, y) can be written as where Λ = diag λ 2 1 , . . ., λ 2 D and β = (X T X) −1 X T y is the maximum likelihood solution (assuming the inverse exists).If the predictors are uncorrelated with zero mean and variances Var(x j ) = s 2 j , then X T X ≈ n diag s 2 1 , . . ., s 2 D , and we can approximate βj = (1 − κ j ) βj , (2.3) where is the shrinkage factor for coefficient β j .The shrinkage factor describes how much coefficient β j is shrunk towards zero from the maximum likelihood solution (κ j = 1 meaning complete shrinkage and κ j = 0 no shrinkage).From (2.3) and (2.4) it is easy to verify that β → 0 as τ → 0, and β → β as τ → ∞.
The result (2.4) holds for any prior that can be written as a scale mixture of Gaussians like (2.2), regardless of the prior for λ j .The horseshoe employs independent half-Cauchy priors for all λ j , and for this choice one can show that, for fixed τ and σ, the shrinkage factor (2.4) follows the prior where a j = τ σ −1 √ n s j .When a j = 1, this reduces to Beta 1 2 , 1 2 which looks like a horseshoe, see Figure 1.Thus, a priori, we expect to see both relevant 0.0 0.5 1.0 0.0 0.5 1.0 The continuous curves show the densities for the shrinkage factor (2.4) for the horseshoe prior (2.2) when a j = τ σ −1 √ n s j = 1 (left) and when a j = 0.1 (right).The bars denote the corresponding point mass function for the spikeand-slab prior (2.7) with infinite slab width c → ∞, when π/(1 − π) = 1 (left) and π/(1 − π) = 0.1 (right).To aid visualization, the bars illustrating the point masses are scaled and show only the relative probability masses.
(κ j = 0, no shrinkage) and irrelevant (κ j = 1, complete shrinkage) variables.By changing the value of τ , the prior for κ j places more mass either close to 0 or 1.For instance, choosing τ so that a j = 0.1 favors complete shrinkage (κ = 1) and thus we expect more coefficients to be close to zero a priori.Notice though that for a fixed τ , the sparsity assumptions will be dependent on the input dimension D, and to get around this issue, we need to consider the values of all the shrinkage factors κ j together.Using this idea, Section 3 discusses an intuitive way of designing a prior distribution for τ based on the assumptions about the number of nonzero components in β.
Notice also that those variables which vary on larger scale s j are treated as more relevant a priori, which is the reason why we usually scale all the variables to have unit variance s 2 j = 1, unless the original scales really carry information about the relevances.Another way would be to use the original scales but adjust scales for the local parameters accordingly λ j ∼ C + 0, s −2 j .

Spike-and-slab prior
The spike-and-slab (Mitchell and Beauchamp, 1988;George and McCulloch, 1993) is a popular shrinkage prior that is often considered as the "gold standard" for sparse Bayesian estimation.The prior is often written as a two-component mixture of Gaussians (2.6) so that ε c and the indicator variable λ j ∈ {0, 1} denotes whether the coefficient β j is close to zero (comes from the "spike", λ j = 0) or nonzero (comes from the "slab", λ j = 1).Often we set ε = 0, that is, the spike is taken to be a delta spike at the origin δ 0 , although also ε > 0 could be used (George and McCulloch, 1993).The user then has to specify the values (or priors) for the slab width c and the prior inclusion probability π, which encodes the prior information about the sparsity of the coefficient vector β.Fixing c is probably the most common approach but by giving it a hyperprior, one can obtain a more heavy-tailed, such as Laplacian, slab (Johnstone and Silverman, 2004).
With the choice ε = 0, the prior (2.6) can be written analogous to (2.2) as so instead of giving continuous priors for λ j s as in the horseshoe, here only two values (λ j = 0, 1) are allowed.Thus also the shrinkage factor κ j only has mass at κ j = 1 1+nσ −2 s 2 j c 2 and at κ j = 1, and the probabilities are π and 1−π, respectively.Letting c → ∞, all the mass is concentrated at the extremes κ j = 0 and κ j = 1, and the resemblance to the horseshoe becomes obvious, see Figure 1.Given the similarity of the shrinkage profiles between the horseshoe and spike-and-slab, it is not surprising that the two priors have shown comparable performance in a variety of experiments (Carvalho, Polson andScott, 2009, 2010;Polson and Scott, 2011).The next section discusses an extension of the horseshoe that closely resembles the spike-and-slab prior with a finite slab width c < ∞.

Regularized horseshoe
As discussed in Section 2.1, the horseshoe prior favors solutions β j ≈ 0 and β j ≈ βj , and it can be shown that under certain conditions, βj → βj when | βj | → ∞ (Carvalho, Polson and Scott, 2010).While this guarantees that the strong signals will not be overshrunk -and is often considered to be one of the key assets of the prior -this property can also be harmful, especially when the parameters are weakly identified.An example of such case is the flat likelihood arising in logistic regression with separable data.As the horseshoe has Cauchy tails, in these problems it suffers basically from the same problems as the Cauchy prior, namely that the posterior means for the regression coefficients may vanish (Ghosh, Li and Mitra, 2017).Therefore it would be very useful to be able to control the amount of shrinkage for the largest coefficients, which in spike-and-slab prior (Sec.2.2) is achieved by controlling the slab width.
To guarantee that the prior always shrinks the coefficients at least by a small amount, we introduce the following regularized horseshoe prior where c > 0 is a constant that we assume is given for now.The intuition behind this definition is the following.When τ 2 λ 2 j c 2 , meaning the coefficient 0.0 0.5 1.0 0.0 0.5 1.0 Shrinkage profiles as in Figure 1 but now for the regularized horseshoe (2.8) and spike-and-slab (2.7) with a finite slab width c = 1.In comparison to Figure 1, for both priors the first mode is shifted from κ j = 0 to κ j = 1/(1 + nσ −2 s 2 j c 2 ) (for the plots we have selected nσ −2 s 2 j = 10).
β j is close to zero, then λ2 j → λ 2 j and the prior (2.8) approaches the original horseshoe.However, when τ 2 λ 2 j c 2 , meaning the coefficient is far from zero, then λ2 j → c 2 /τ 2 and the prior (2.8) approaches N 0, c 2 .Thus the prior will shrink the small signals as the horseshoe but will also regularize even the largest coefficients as a Gaussian slab with variance c 2 .
Another way to see this is to notice that the conditional prior for β j can be factored as (2.9) from which it is easy to see that depending on the relative magnitudes of τ 2 λ 2 j and c 2 , the prior operates (roughly) as the narrower one of the two factors.Therefore the role of N 0, c 2 is to "soft-truncate" the extreme tails of the horseshoe, thereby controlling the magnitude of the largest β j s.Letting c → ∞, we recover the original horseshoe.
The shrinkage profile of the regularized horseshoe is illustrated in Figure 2 together with the spike-and-slab with the slab width c, which demonstrates the similarity of the two priors.Using c < ∞ has the advantage that it regularizes the parameters β j when they are weakly identified, and allows us also to specify our prior information about the maximum effect β j we expect to see.The benefit of the proposed approach is illustrated in Section 4.1.2.
It must be noted that the shrinkage profile in Figure 2 does not have exactly the same shape as the original horseshoe shifted and scaled from interval (0, 1) to (b, 0) with b > 0, although it is very close to this.There is slightly more mass near the left hand side mode, although the difference is too small to be visible in Figure 2. It is possible to retain the exact shape of the horseshoe by defining the modified local parameter λ2 j as (2.10) The details of this result are spelled out in Appendix A. The reason why we define the prior as (2.8) is that unless n or the slab width c 2 is very small, the term σ 2 ns 2 j is typically small compared to c 2 and leaving it out has little influence in practice.On the other hand, formulation (2.8) is simpler and has the nice interpretation as a product of the original horseshoe and the Gaussian slab (2.9).Thus we report the result (2.10)only for completeness.
Our formulation requires choosing a value or prior for c.Unless substantial knowledge about the scale of the relevant coefficients exists, we generally recommend placing a prior for c instead of fixing it.Often a reasonable choice is which translates to a Student-t ν (0, s 2 ) slab for the coefficients far from zero and is typically a good default choice for a weakly informative prior.Another motivation for using inverse-Gamma is that it has a heavy right tail accompanied by a light left tail thereby preventing much mass from accumulating near zero.This is natural as we do not want to shrink those coefficients heavily towards zero that are already deemed to be far from zero.Still, we emphasize that out approach is not limited to this choice and also other hyperpriors are possible.It would also be possible to use variable specific slab widths c i , but we do not explore this further in this paper and leave it for future investigation.Finally, we would also like to point out that the tail-cutting idea of Equation (2.9) could be used more generally with other priors when relevant.For instance, we expect our idea to be useful with the horseshoe+ prior of Bhadra et al. (2017) which also has Cauchy tails and therefore suffers from the same problem as the horseshoe.

Hierarchical shrinkage
In addition to the problems with vanishing means, earlier we have also reported sampling issues with the original horseshoe even in simple regression problems (Piironen and Vehtari, 2015).Technically speaking, the problem arises due to posterior having an extreme funnel shape which is challenging for Markov chain Monte Carlo (MCMC) methods.The problem was revealed with the help of the divergence diagnostics of the NUTS algorithm (Hoffman and Gelman, 2014;Betancourt and Girolami, 2015;Betancourt, 2017a,b) when fitting the models in Stan.
The problem is related to the thick Cauchy tails of the prior, and to overcome the sampling issues, in our technical report we tentatively proposed replacing the half-Cauchy priors for the local parameters λ j in (2.2) with half-t priors with small degrees of freedom, such as ν = 3, and named this approach the "hierarchical shrinkage".With large enough ν, this seems to help with the sampling issues and remove the divergent transitions produced by NUTS, but the drawback is that the prior becomes less sparsifying.This is because when the tails of p(λ j ) are made slimmer, we need to increase the value for τ to accommodate large signals, and therefore the prior is not able to shrink the small coefficients efficiently towards zero.Another limitation of this approach is that in order to fight the problems arising from data separation in logistic regression (Sec.2.3), we would also need to refrain from using half-Cauchy prior for the global parameter τ (which we might want to use, see Sec. 3) as this would also lead to Cauchy tails for p(β j ).
The less sparsifying nature of the choice ν > 1 will be demonstrated in practice in Section 4.1.2,where we also show that the regularized horseshoe (Sec.2.3) clearly outperforms this approach.Thus we no longer recommend increasing the local degrees of freedom, but instead of using the regularized horseshoe.

The global shrinkage parameter
This section discusses the prior choice for the global hyperparameter τ .We begin with a short note on why we prefer full Bayesian inference for τ over point estimation, and then go on to discuss how we propose to set up the prior p(τ ).

Full Bayes versus point estimation
In principle, one could use a plug-in estimate for τ , obtained either by crossvalidation or maximum marginal likelihood (sometimes referred to as "empirical Bayes").The maximum marginal likelihood estimate has the drawback that it is always in danger of collapsing to τ = 0 if the parameter vector happens to be very sparse.Moreover, rather than being computationally convenient, this approach might actually complicate matters as the marginal likelihood is not analytically available for non-Gaussian likelihoods.While cross-validation avoids the latter problem and possibly also the first one, it is computationally less efficient than the full Bayesian solution and fails to account for the posterior uncertainty.For these reasons we recommend full Bayesian inference for τ , and focus on how to specify the prior distribution.

Earlier approaches
Carvalho, Polson and Scott (2009) also recommend full Bayesian inference for τ , and following Gelman (2006), they propose prior whereas Polson and Scott (2011) recommend If the target variable y is scaled to have marginal variance of one, unless the noise level σ is very small, both of these priors typically lead to quite similar posteriors.However, as we argue in Section 3.3, there is a theoretical justification for letting τ scale with σ.The main motivation for using a half-Cauchy prior for τ is that it evaluates to a finite positive value at the origin, yielding a proper posterior and allowing even complete shrinkage τ → 0, while still having a thick tail which can accommodate a wide range of values.For these reasons, C + 0, η 2 is a desirable choice when there are enough observations to let τ be identified by data.Still, we show that in several cases one can clearly benefit by choosing the scale η in a more careful manner than simply η = 1 or η = σ, because for most applications these choices place far to much mass for implausibly large values of τ .This point is discussed in Section 3.3.Moreover, the synthetic example in Section 4.1.1shows that in some cases one could clearly benefit from even more informative prior.van der Pas, Kleijn and van der Vaart (2014) study the optimal selection of τ in model (3.3)They prove that in such a setting, the optimal value (up to a log factor) in terms of mean squared error and posterior contraction rates in comparison to the true β * is where p * denotes the number of nonzeros in the true coefficient vector β * (assuming such exists).Their proofs assume that n, p * → ∞ and p * = o(n).Model (3.3) corresponds to setting X = I and D = n in the usual regression model (2.1).It is unclear whether and how this result could be extended to a more general X, and how one should utilize this result when p * is unknown (as it usually is in practice).In section 3.3, we formulate our method of constructing the prior p(τ ) based on the prior information about p * , and show that if p * was known, our method would also give rise to result (3.4), but is more generally applicable.

Effective number of nonzero coefficients
Consider the prior distribution for the shrinkage factor of the jth regression coefficient for the linear Gaussian model, Eq. (2.5).The mean and variance can be shown to be where a j = τ σ −1 √ n s j as earlier.A given value for the global parameter τ can be understood intuitively via the prior distribution that it imposes on the effective number of coefficients distinguishable from zero (or effective number of nonzero coefficients, for short), which we define as (3.7) When the shrinkage factors κ j are close to 0 and 1 (as they typically are for the horseshoe prior), this quantity describes essentially how many active or unshrunk variables we have in the model.It serves therefore as a useful indicator of the effective model size.
Using results (3.5) and (3.6), the mean and variance of m eff given τ and σ are given by Let us now assume that, in addition of having a zero mean, each variable also has a unit variance s 2 j = 1.In this case the equations above simplify to The expression for the mean (3.10) is helpful.First, from this expression it is evident that to keep our prior information about m eff consistent, τ must scale as σ/ √ n.Priors that fail to do so, such as (3.1), favor models of varying size depending on the noise level σ and the number of data points n.Second, if our prior guess for the number of relevant variables is p 0 , it is reasonable to choose the prior so that most of the prior mass is located near the value which is obtained by solving equation E(m eff | τ, σ) = p 0 .Note that this is typically quite far from 1 or σ, which are used as scales for priors (3.1) and (3.2).For instance, if D = 1000 and n = 200, then prior guess p 0 = 5 gives about τ 0 = 3.6 • 10 −4 σ.
To further develop the intuition about the connection between τ and m eff , it is helpful to visualize the prior imposed on m eff for different prior choices for τ .This is most conveniently done by drawing samples for m eff ; we first draw τ ∼ p(τ ) and λ 1 , . . ., λ D ∼ C + (0, 1), then compute the shrinkage factors κ 1 , . . ., κ D from (2.4), and finally m eff from (3.7).
Figure 3 shows histograms of prior draws for m eff for some different prior choices for τ , with total number of variables D = 10 and D = 1000, assuming n = 100 observations with σ = 1.The first three priors utilize the value τ 0 which is computed from (3.12) using p 0 = 5 as our hypothetical prior guess for the number of relevant variables.Fixing τ = τ 0 results in a nearly symmetric distribution around p 0 , while a half-normal prior with scale τ 0 yields a skewed distribution favoring solutions with m eff < p 0 but allowing larger values to also be accommodated.The half-Cauchy prior behaves similarly to the half-normal, but results in a distribution with a much thicker tail giving substantial mass also to values much larger than p 0 when D is large.Figure 3 also illustrates why prior τ ∼ C + (0, 1) is often a dubious choice: it places far too much mass on large values of τ , consequently favoring solutions with most of the coefficients unshrunk.Thus when only a small number of the variables are relevant -as we typically assume -this prior results in sensible inference only when τ is strongly identified by data.Notice also that, if we changed the value of σ or n, the first three priors for τ would still impose the same prior for m eff , but this is not true for τ ∼ C + (0, 1).
This way, by studying the prior for m eff , one can easily choose the prior for τ based on the information about the number of nonzero parameters.Because the prior information can vary substantially for different problems and the results depend on the information carried by the data, there is no globally optimal prior for τ that works for every single problem.Some recommendations, however, will be given in Section 5 based on these theoretical considerations and experiments presented in Section 4.
We conclude by pointing out a connection between our reference value (3.12) and the oracle result (3.4) for the simplified model (3.3).As pointed out in the last section, model (3.3) corresponds to setting X = I (which implies n = D and X T X = I) in the usual regression model (2.1).Using this fact and repeating the steps needed to arrive at (3.12), we get Suppose now that we select p 0 = p * , that is, our prior guess is oracle.Using the same assumptions as van der Pas, Kleijn and van der Vaart (2014), namely that n, p * → ∞ and p * = o(n), and additionally that σ = 1, we get τ 0 → p * /D = τ * .This result is natural, as it means it is optimal to choose τ so that the imposed prior for the effective number of nonzero coefficients m eff is centered at the true number of nonzeros p * .This further motivates why m eff is a useful quantity.

Regularized horseshoe and other shrinkage priors
As discussed in Section 2.3, when we set c < ∞, the shrinkage profile of the regularized horseshoe (2.8) becomes approximately equivalent to that of the  where m eff is the effective number of nonzeros for the original horseshoe.Thus with a given τ , the effective complexity for the regularized horseshoe is always less than for the pure horseshoe, because those coefficients that are far from zero are still affected by the slab.Therefore we can naturally use result (3.12) also for the regularized horseshoe with p 0 as our prior guess for the number of coefficients far from zero, but remembering that those coefficients will experience the regularization by the slab.
The concept of effective number of nonzeros could also be used with shrinkage priors other than the (regularized) horseshoe, as long as the prior can be written as a scale mixture of Gaussians like (2.2).Many such alternatives have been proposed, including the double-exponential or Laplace (Park and Casella, 2008), Dirichlet-Laplace (Bhattacharya et al., 2015), R-square induced Dirichlet decomposition (Zhang, Reich and Bondell, 2016), and horseshoe+ (Bhadra et al., 2017).For instance, the Dirichlet-Laplace prior has a Dirichlet concentration hyperparameter a that strongly affects the sparsity properties of the prior, and based on the experiments of Bhattacharya et al. (2015) has a substantial effect on the results.It would therefore be interesting the investigate a prior information calibrated selection of a using our framework.
Depending on the prior, corresponding analytical results like (3.10) and (3.11) may or may not be available, but as long as one is able to sample both from p(τ ) and p(λ j ), it is always easy to draw samples from the prior distribution for m eff , and therefore investigate the effect of hyperprior or hyperparameter choice on the effective model complexity.It must be noted though, that for those prior for which the shrinkage factors κ are not near 0 or 1, the values of m eff can be more difficult to interpret.

Non-Gaussian observation models
When the observation model is non-Gaussian, the exact analysis from Section 3.3 is analytically intractable.We can, however, perform the analysis using a Gaussian approximation to the likelihood.Using the second order Taylor expansion for the log likelihood, the approximate posterior for the regression coefficients given the hyperparameters becomes where z = (z 1 , . . ., zn ), Σ = diag(σ 2 1 , . . ., σ2 n ) and β = (X T Σ−1 X) −1 X T Σ−1 z (assuming the first inverse exists).Here φ denotes the possible dispersion parameter and (z i , σ2 i ) the location and variance for the ith Gaussian pseudoobservation.These are obtained from the first and second order derivatives of the log-likelihood terms L i (f i , φ) with respect to the linear predictor f i = β T x i at the posterior mode fi = βT x i (Gelman et al., 2013, ch. 16.2) The fact that some of the observations are more informative than othersmeaning σ2 i is not constant -makes further simplification somewhat difficult.To proceed, we make the rough assumption that we can replace each σ2 i by a single variance term σ2 .Assuming further that the covariates are uncorrelated with zero mean and variances Var(x j ) = s 2 j (as in Sec.3.3), the posterior mean for the jth coefficient satisfies βj = (1 − κ j ) βj with shrinkage factor given by The discussion in Section 3.3 therefore also approximately holds for the non-Gaussian observation model, except that σ 2 is replaced by σ2 .Still, this leaves us with the question, which value to choose for σ2 to exploit this result in practice?
Table 1 The pseudo variances for the most commonly used generalized linear models to be used as approximate plug-in values for σ 2 in equations of Section 3.3.In practice when necessary, we usually replace µ by the sample mean ȳ.In the Gamma distribution, α denotes the shape parameter so that Var(y) = µ 2 /α.In the inverse Gaussian, λ is the shape parameter so that Var(y) = µ 3 /λ.See McCullagh and Nelder (1989).
For the generalized linear models with y having a distribution in the exponential family with natural parameter θ and dispersion φ, the log likelihood for a single observation has the form (McCullagh and Nelder, 1989) for some specific functions A(•), B(•) and C(•).The pseudo variance for a given observation y is then (see Appendix B) where V (µ) = B (θ) is the variance function, µ = E(y) = B (θ) the expected value.The simplified expressions for the most commonly used generalized linear models with their canonical links are listed in Table 1.
We observe that σ2 is a product between A(φ) and a term that in general depends on f , µ, and y, although for canonical links the dependence from y vanishes because the derivative of ( ∂µ ∂f )/V (µ) with respect to f is zero (McCullagh and Nelder, 1989, ch. 2).Thus it makes sense that for those non-Gaussian models that have a dispersion parameter φ (like Gamma and inverse Gaussian models), the pseudo variance and therefore also τ should scale with A(φ).For the non-constant multiplier of A(φ) in (3.15) we can use, for example, value obtained by setting µ equal to the sample mean of ȳ.This approach, although crude, seems to work reasonably well in practice.For instance, in binary classification, if we have the same number of observations from both classes, then µ = 0.5, yielding σ2 = 4 which was observed to give good results in our earlier study (Piironen and Vehtari, 2017a).

More complex models
Although we limit our discussion to generalized linear models, different authors have employed horseshoe prior in various other models.For instance, Faulkner and Minin (2017) consider trend filtering for modelling time series and use horseshoe as a sparsifying prior on the kth-order forward differences to express prior assumptions about the number of rapid changes in the underlying signal.As another example, Ghosh and Doshi-Velez (2017) use horseshoe prior over the weights in Bayesian neural networks to effectively turn off some of the nodes in the network.
When the model gets more complicated, one cannot simply use the reference result (3.12) to guide the hyperprior choice because it is derived assuming the linear model (2.1).Unless similar results can easily be derived for the model of interest, we recommend a pragmatic approach of drawing from the prior for different values of τ and studying the effect on the sparsity (how many coefficients fall below a certain threshold) to get an idea of a reasonable range of values.Although this strategy may seem crude, we argue that it should still be better than not thinking about the prior at all.Moreover, our results (Sec.4.2) suggest that with a weakly informative prior such as τ ∼ C + 0, τ2 0 , having even a rough ballpark figure of the correct magnitude for τ 0 can already be a clear improvement compared to the simple τ ∼ C + (0, 1).

Experiments
This section illustrates the benefit of the theoretical advances on synthetic and real world data.All considered models were fitted using Stan1 (codes in the supplementary material) with the default settings unless otherwise stated, running 4 chains, 2000 samples each, first halves discarded as warmup.

Toy example
We first illustrate the impact of the hyperprior choice p(τ ) with a toy example similar to the one discussed by van der Pas, Kleijn and van der Vaart (2014).Consider model (3.3),where each y i is generated by adding Gaussian noise with σ 2 = 1 to the corresponding signal β i .We generated 100 data realizations with n = 400 and the true β * having p * = 20 nonzero entries equal to A = 1, 2, . . ., 10 with the rest of the entries being zeros.We then computed the mean squared error (MSE) between the estimated posterior mean β and the true β * assuming the pure horseshoe prior with hyperpriors τ ∼ C + (0, 1), τ ∼ C + 0, τ 2 0 and τ = τ 0 , where τ 0 is calculated from Equation (3.13) with the oracle prior guess p 0 = p * .Although the last choice reflects stronger prior information than we would typically expect to have in practice, the purpose of this setup is simply to demonstrate that one can substantially improve the inferences using our framework provided substantial prior knowledge exists.Notice though, that even when we set τ = τ 0 , we do not treat τ as completely fixed, because it depends on σ which is treated as an unknown parameter with vague prior p(σ 2 ) ∝ σ −2 .Figure 4 shows the MSE for the three hyperpriors for different values of A. For each prior, the error is largest close to A = √ 2 log 400 ≈ 3.5, which is called the "universal threshold" for this problem (Johnstone and Silverman, 2004;van der Pas, Kleijn and van der Vaart, 2014).Below this threshold the nonzero components in β are too small to be detected and are thus shrunk too heavily towards zero which introduces error.For A = 4 the informative τ = τ 0 actually yields the worst results due to this overshrinkage (see discussion below), but gives clearly superior results for larger A. The choice τ ∼ C + 0, τ 2 0 gives better results than τ ∼ C + (0, 1) but is clearly inferior to τ = τ 0 .
Figure 5 illustrates the data y and the estimated coefficients β for one particular data realization when A = 4 and A = 6.In both cases the informative choice τ = τ 0 helps to shrink the zero components in β towards zero, but for A = 4 also overshrinks the nonzero components.The reason for the overshrinkage is that some observations y i that correspond to zero signal β i = 0 happen to have similar magnitude to the observations coming from a nonzero signal β i = A, and thus these irrelevant components "steal" from the limited budget for m eff .For this particular value of A the overshrinkage of the actual signals happens to be worse in terms of MSE than undershrinkage of the zero components, and thus one would get better results by setting p 0 to be slightly above the true p * (results not shown).For A = 6 the actual signals are large enough to be distinguished from zero, and the informative selection of τ yields substantially better estimate for β.
Finally, Figure 6 illustrates the importance of scaling τ with the noise level σ.The top row shows one data realization and the posterior mean estimates β when the global parameter is fixed either to τ = p0 D−p0 or to τ = p0 D−p0 σ.For these data both yield essentially the same result, since the true noise variance σ 2 is one.However, when the observations are scaled by multiplying them by 0.1 (bottom row), the value for τ that does not scale with σ yields clearly worse results than in the first case, while the results for the latter value remain practically unchanged.What essentially happens is that when the observations are transformed to a smaller scale, then fixing τ = p0 n−p0 increases the prior expectation for m eff by the same factor (10 in this case) and thus it will favor solutions with many more coefficients far from zero.
This small experiment has relevance also regarding the more general model (2.1) because in the ideal case of uncorrelated predictors with unit variance X T X = nI, we can think that we have a single observation of each β j with variance σ 2 /n.In practice these conditions are rarely met but the idea is still useful.

Classification with separable data
The purpose of this example is to illustrate the problem with the original horseshoe not controlling the magnitude of the largest coefficients and show how the regularized horseshoe can solve this issue.We generated n = 30 binary classification observations so that for the instances in the first class, the first two features were drawn from a Gaussian with mean 1 and scale 0.5, whereas in the other class the mean of the first two features was −1 (the data are visualized in Fig. 7).In addition, we generated 98 irrelevant features drawn from the standard Gaussian, so that the total input dimension was D = 100.We then fitted the standard logistic regression model with prior β 0 ∼ N 0, 5 2 for the intercept and four different priors for the regression coefficients β j : the original horseshoe, hierarchical shrinkage with ν = 3, and regularized horseshoe with slab scales c = 2 and c 2 ∼ Inv-Gamma(2, 8), the latter of which corresponds to a Studentt 4 0, 2 2 slab (see Sec. 2.3).In each case we used hyperprior τ ∼ t + 3 0, τ 2 0 (the conclusions of this example are not sensitive to this choice).For consistent sparsity assumptions, for the original and the regularized horseshoe, τ 0 was calculated from (3.12) using p 0 = 2, and for the hierarchical shrinkage by solving numerically E(m eff | τ = τ 0 ) = p 0 = 2. Figure 7 shows the scatter plots of the posterior draws for β 1 and β 2 (top row) as well as the medians and 80%-intervals for β 3 , . . ., β 100 (middle row), for the four different priors.Because the data are separable using only x 2 , this feature is a "solitary separator" and thus the mean for β 2 does not exist under the horseshoe prior due to its Cauchy tails (Ghosh, Li and Mitra, 2017).Although for β 1 the mean exist (x 1 is not a solitary separator), the posterior has substantial mass for very large values for this parameter also.Moreover, for the original horseshoe the NUTS produces almost 200 divergent transitions after the warmup showing clear problems with sampling.Using ν = 3 for the local parameters cuts down the tails and reduces the number of divergent transitions to a few, but still yields quite fat posterior tails for these two coefficients and results in much less shrinkage for the coefficients of the irrelevant features (middle row).Finally, the First two columns denote the results for the pure horseshoe and hierarchical shrinkage, and the last two columns for the regularized horseshoe.
regularized horseshoe exhibits the most satisfactory performance cutting down the tails for β 1 and β 2 while still being able to shrink the irrelevant coefficients as well as the horseshoe.In this case, fixing the scale of the Gaussian slab to c = 2 results in too strong regularization for the relevant coefficients and generally we would prefer a less informative choice such as c ∼ Inv-Gamma(2, 8), but our purpose here was to demonstrate how easy it is to specify different levels of regularization for largest coefficients using the new prior (2.8).
The top row of Figure 7 clearly reveals the multimodality of the posterior for β 1 and β 2 .This does not produce marked problems in this case (e.g.MCMC convergence problems) but in general the multimodality can be an issue both for the original and the regularized horseshoe, and we will discuss this further in Sections 4.2 and 6.

Real world data -microarray cancer classification
This section further illustrates the important concepts with some real world examples.We use the four microarray cancer classification datasets from our earlier paper (Piironen and Vehtari, 2017a).The datasets are summarized in Table 2 and can be found online.2For all the datasets we used the standard logistic regression model with a vague prior β 0 ∼ N 0, 10 2 for the intercept, and the original and regularized horseshoe priors for the regression coefficients to compare the differences between the two.For these problems, we reduced the number of draws per chain to 1000 to reduce the computation time.
We first consider the Ovarian dataset as a representative example of how the prior choice p(τ ) and the use of the regularized horseshoe can affect the results.We fitted the model to the data with two hyperprior choices, τ ∼ C + (0, 1) and τ ∼ C + 0, τ 2 0 , where τ 0 is computed from (3.12) using p 0 = 3 as our prior guess for the number of relevant variables.For the regularized horseshoe we used hyperprior c 2 ∼ Inv-Gamma(2, 8) (as with the synthetic example, Sec. 4.1.2) which corresponds to a Student-t 4 0, 2 2 slab (see Sec. 2.3).
Figure 8 shows prior and posterior draws for τ and m eff , and the absolute values of the posterior means for the regression coefficients, for the different prior configurations.The results for τ ∼ C + (0, 1) illustrate how weakly τ is identified by the data: there is very little difference between the prior and posterior samples for τ and consequently for m eff , and thus this "non-informative" prior actually has a strong influence on the posterior.With the pure horseshoe this results in severe under-regularization and implausibly large magnitude for the regression coefficients.This happens because the effective model complexity is so large that the classes become separable, and due to the vanishing maximum likelihood estimate also the posterior mean vanishes (as in the synthetic example in Sec.4.1.2).With the regularized horseshoe, even with the complete separation, the magnitude of the coefficients remains sensible due to the regularization by the slab.
Replacing the scale of the half-Cauchy hyperprior with τ 0 , reflecting a more sensible guess for the number of relevant variables, has a substantial effect on the posterior: the posterior mass for m eff becomes concentrated on much smaller The first two columns denote the results for the pure and regularized horseshoe with hyperprior τ ∼ C + (0, 1), and the last two columns the same but with τ ∼ C + 0, τ 2 0 , where τ 0 is computed from (3.12) with prior guess p 0 = 3.
values and the magnitude of the regression coefficients more sensible.For this hyperprior choice the difference between the original and regularized horseshoe is much less severe, but also in this case the largest coefficients are smaller for the regularized horseshoe.How this affects the predictive accuracy will be discussed in a moment.A potential explanation for why τ and therefore m eff are not strongly identified here is that there are a lot of correlations in the data.For instance, the predictor j = 1491 which appears relevant based on its regression coefficient, has an absolute correlation of at least |ρ| = 0.5 with 314 other variables, out of which 65 correlations exceed |ρ| = 0.7.This indicates that there are a lot of potentially relevant but redundant predictors in the data, and thus similar fit could be obtained by models with varying levels of sparsity.
Another difficulty with correlating predictors is that they cause the posterior of the regression coefficients to become multimodal, which usually results in convergence issues.Figure 9 shows the posterior draws under the regularized horseshoe prior for the three coefficients with the largest absolute mean.All three coefficients have most of the draws near zero but some of the draws are far from zero, illustrating the multimodality.The R-values (potential scale reduction factor, see e.g., Gelman et al., 2013, ch. 11) for these parameters were 1.09, 1.20 and 1.05 indicating problems with the convergence of the MCMC chains.Although these convergence issues do not seem to have a drastic negative impact on the predictive accuracy (see the discussion below), the multimodality dramatically reduces the sampling efficiency and is therefore a major practical and theoretical concern and deserves more attention in the future research (see also the discussion in Section 6).
To investigate the effect of the prior choices on the prediction accuracy, we split each dataset into two halves, using one fifth of the data as a test set.All the results were then averaged over 50 such random splits into training and test sets.We carried out the tests for priors τ ∼ C + 0, τ 2 0 and τ ∼ N + 0, τ 2 0 with various τ 0 .The half-normal prior was included in the tests to investigate whether it sometimes could be beneficial to have a strong control over τ by using a short-tailed prior.To get a baseline for the comparisons, we also computed the prediction accuracies to Lasso with the regularization parameter tuned by 10fold cross-validation.The Lasso results were computed with the default settings of the R-package glmnet (Friedman, Hastie and Tibshirani, 2010).The top row of Figure 10 shows the effect of the prior choice on the test prediction accuracy, and the other two rows the computation time 3 and the fraction of divergent transitions produced by the NUTS after the warm-up.For the original horseshoe (solid lines) the results illustrate a clear benefit from using even a crude prior guess for the number of relevant variables p 0 : transforming any guess between p 0 = 1 and p 0 = n into a value of τ 0 using Equation (3.12) (shaded region) and using this as the scale for the half-Cauchy prior instead of τ 0 = 1 yields improved prediction accuracy and reduced computation time in all the datasets.The regularized horseshoe seems in general less sensitive to the prior choice for τ (Ovarian and Colon datasets), but in some cases clearly benefits from a more carefully chosen prior (Prostate and Leukemia datasets).For both priors, using half-Cauchy hyperprior for τ is clearly less sensitive to the , and the fraction of divergent transitions after warmup (bottom row) as a function of τ 0 for two hyperpriors: τ ∼ N + 0, τ 2 0 (red), and τ ∼ C + 0, τ 2 0 (yellow).Solid lines denote the pure horseshoe and dashed lines the regularized horseshoe with c 2 ∼ Inv-Gamma(2, 8).The shaded area denotes the values for τ 0 that correspond to sparsity guesses between p 0 = 1 and p 0 = n (Eq.(3.12)).The black dotted line in the top row plots shows the MLPD for Lasso.All the results are averaged over 50 random splits into training and test sets.
prior guess τ 0 , and yields better results than the half-normal especially when τ 0 was chosen to be too small.
In terms of the predictive accuracy, the regularized horseshoe performs overall comparably to the original horseshoe: for the Colon dataset it performs slightly better, for Leukemia slightly worse, and for Ovarian and Prostate about the same as the pure horseshoe.Only for the Colon and Leukemia datasets the difference is statistically significant (the errorbars are left out from the plot to avoid mess).In Leukemia dataset where the regularized horseshoe loses to the pure horseshoe the chosen prior for the slab width c is probably unnecessarily restrictive leading to too large regularization for the largest coefficients.It is evident that in this case using a looser prior for c could improve the results (because when c → ∞ we recover the original horseshoe), but we feel that these results are enough to convince the reader that a reasonably chosen hyperprior p(c) does not necessarily compromise and in some cases can even improve the accuracy compared to the pure horseshoe.
On the other hand, the regularized horseshoe clearly improves the sampling robustness of the posterior.The regularized horseshoe produces only very few divergent transitions after the warm-up if any (the fraction is non-negligible only for the Prostate dataset with a poorly chosen global prior p(τ )), whereas for the original horseshoe the fraction varies between 1-30% which is really a lot.With this many divergences, there is always a concern about a biased inference and this cannot be taken too lightly (see the next Section for recommendations).Moreover, also the computation times are systematically either smaller or similar to the original horseshoe, which is due to better behaving posterior.
For any reasonably selected prior (shaded region), the (regularized) horseshoe consistently outperforms Lasso in terms of predictive accuracy, but in some cases the difference is not very large.A clear advantage for Lasso, on the other hand, is that it is hugely faster, with computation time of less than a second for these problems, whereas even for the regularized horseshoe, the computation starts to get quite involved for the two biggest problems (around 30-60 minutes for Prostate and Leukemia datasets), which is the price for the full Bayesian inference.

Recommendations
Based on the theoretical considerations and the experimental results, instead of the original horseshoe, we recommend using the regularized horseshoe with a weakly informative prior on c, such as (2.11) with appropriately chosen scale s and degrees of freedom ν.As illustrated in Section 4, this does not compromise the predictive accuracy but greatly improves the sampling robustness of the prior and is resistant to problems originating from weak identifiability of the parameters.Still, it is worthwhile to compare the results to a model with a relatively loose prior on c using standard model assessment techniques (Vehtari and Ojanen, 2012;Vehtari, Gelman and Gabry, 2017) to get an indication if the slab width was chosen to be too restrictive.However, even if a very large value for c yields better predictive fit, if it produces a large number of divergent transitions, it cannot be recommended because in such cases the inference is likely to have a bias with an unknown magnitude (Betancourt, 2017a).In our experience, for the regularized horseshoe it is often possible to get rid off the divergences by tuning Stan's adaptation routine as explained by Betancourt (2017b).
Also, instead of using the simple τ ∼ C + (0, 1), for linear regression we generally recommend a weakly informative default choice τ | σ ∼ C + 0, τ 2 0 , where τ 0 is computed from (3.12) using the prior guess p 0 for the number of relevant variables.For the other generalized linear models (GLMs), an approximately equivalent choice is obtained by replacing σ with a appropriate plug-in value as explained in Section.3.5.Based on the results on the real world data, this choice seems to perform well unless p 0 is chosen to be much too large.However, the toy example shows that sometimes even better results can be obtained by more informative prior.

Discussion
This paper has discussed the use of the horseshoe prior for sparse Bayesian generalized linear models.We considered two methodological advances.First, we proposed a generalization of the horseshoe -called the regularized horseshoe -that operates otherwise similarly as the horseshoe but allows specifying the regularization to the coefficients that are far from zero.Second, we introduced a new concept -effective number of nonzero parameters -which is useful for guiding the hyperprior choice for the global shrinkage parameter.
The experiments demonstrated the benefit of both of these approaches.The ability to regularize those parameters that are far from zero is useful especially when the parameters are only weakly identified by the data.As an example we discussed the logistic regression with data separation.Adding a small regularization to the largest coefficients ensures that the posterior mean will exist, leading to more reasonable parameter estimates and faster posterior exploration.The regularized horseshoe solves also the problems with the divergent transitions that have previously been an indication about problems in posterior simulation and possibly biased inference (Betancourt, 2017b).
Regarding the hyperprior choice for the global shrinkage parameter, we argued that the previous default choices are often dubious based on their tendency to favor solutions with too many parameters unshrunk.The experiments show that for many datasets, one can obtain clear improvements -in terms of better parameter estimates, prediction accuracy and faster computation -by coming up even with a crude guess for the number of relevant variables and transforming this knowledge into a prior for τ using our proposed framework.Based on our results, for a reasonably selected global hyperprior, the (regularized) horseshoe outperforms Lasso in terms of predictive accuracy.A notable difference is that Lasso produces a truly sparse solution with exact zeros for some coefficients, whereas horseshoe does not.After fitting the full model, a truly sparse solution without losing predictive accuracy could be obtained using the projective variable selection (Piironen and Vehtari, 2017b) 4 .
Although these advances improve the overall performance and applicability of the horseshoe prior to various problems, some challenges still persist.When using the horseshoe prior with correlating predictors, the major concern is always the multimodality of the posterior, which can lead to difficulties in sampling and especially to slow convergence of the MCMC.The experiments indicated that the regularized horseshoe can improve the sampling robustness but does not remove the multimodality.It must be noted though, that the multimodality is a direct consequence of the sparsifying prior assumption that favors solutions where only one in a group of correlating predictors would have a nonzero coefficient.Whether this assumption is reasonable in practice is a more fundamental question.It could make sense -both computationally and theoretically -to employ a horseshoe or other sparsifying prior on some transformed set of features -for instance principal components of the original predictors -instead of the original variables themselves.This would yield a unimodal posterior but the effects on the other aspects such as the predictive accuracy remain to be explored.We leave these ideas for future investigation.We denote µ = B (θ) and V (µ) = B (θ), from which we can also deduce ∂µ ∂θ = V (µ).The first derivative is given by the chain rule The second derivative is then The following shows the Stan code for the linear Gaussian model with the regularized horseshoe prior using a straightforward parametrization.In our experience this code works fine, but in Appendix C.2 we also provide another code using different parametrization (with which we generated our results).This is worth trying if the simple code has issues with sampling (produces divergences).
In the code, both τ and λ j are given half-t priors with the degrees of freedom and the scale defined by the user (the scale can be adjusted only for τ ).Setting nu local = 1 corresponds to the horseshoe.nu global = 1 gives τ a half-Cauchy prior.The scale for τ is scale global*sigma, so if we want to set this to be τ 0 = p0 n .The code assumes a Student-t slab for regularizing the largest coefficients, and the scale and degrees of freedom can be specified using slab scale and slab df arguments.

Fig 3 :
Fig 3: Histograms of prior draws for m eff (effective number of nonzero regression coefficients, Eq. (3.7)) imposed by different prior choices for τ , when the total number of input variables is D = 10 and D = 1000.τ 0 is computed from formula (3.12) assuming n = 100 observations with σ = 1 and p 0 = 5 as the prior guess for the number of relevant variables.Note the varying scales on the horizontal axes in the bottom row plots.

Fig 4 :
Fig 4: Toy example: Mean squared error (MSE) between the estimated and the true coefficient vector of length n = 400 on average over 100 different data realizations.The true coefficient vector has p * = 20 elements with a nonzero value equal to A and the rest are zeros.

Fig 5 :
Fig 5: Toy example: An example data realization y = (y 1 , . . .y n ) crosses), posterior mean β = ( β1 , . . ., βn ) (black dots) and the true signal β * (red lines) for A = 4 (top row) and A = 6 (bottom row).In both cases the oracle value for τ helps to shrink the zero components in β but also overshrinks the actual signals in the case A = 4.

Fig 6 :
Fig 6: Toy example: Top row: one data realization y (gray crosses), the posterior estimates β (black dots) and the true signal β * (red lines) when A = 10 and the global parameter is fixed either to τ = p0 D−p0 or τ = p0 D−p0 σ.Bottom row: Otherwise the same but now the scale of the observations is changed by multiplying them by 0.1.

Fig 7 :
Fig 7: Separable classification.Top row: draws for the regression coefficients of the two relevant features.Middle row: median and 80%-interval for the coefficients of the irrelevant features.Bottom row: the observed data (red and green denoting the different classes) and the predictive class probabilities for ỹ = 1 as a function of the first two inputs, given that the rest of the inputs are set to zero.First two columns denote the results for the pure horseshoe and hierarchical shrinkage, and the last two columns for the regularized horseshoe.

Fig 8 :
Fig 8: Ovarian dataset : Histograms of prior (grey) and posterior (red) draws for τ (top row) and m eff (middle row), and absolute values of the posterior means for the regression coefficients | βj | (bottom row) imposed by different prior choices.The first two columns denote the results for the pure and regularized horseshoe with hyperprior τ ∼ C + (0, 1), and the last two columns the same but with τ ∼ C + 0, τ 2 0 , where τ 0 is computed from (3.12) with prior guess p 0 = 3.

Fig 9 :
Fig 9: Ovarian dataset : Histograms of the posterior draws under the regularized horseshoe prior for the three regression coefficients with the largest absolute mean.The histograms highlight the multimodality of the posterior.

Fig 10 :
Fig 10: Microarray datasets : Mean log predictive density (MLPD) on test data (top row), computation time (middle row), and the fraction of divergent transitions after warmup (bottom row) as a function of τ 0 for two hyperpriors: τ ∼ N + 0, τ 2 0 (red), and τ ∼ C + 0, τ 2 0 (yellow).Solid lines denote the pure horseshoe and dashed lines the regularized horseshoe with c 2 ∼ Inv-Gamma(2, 8).The shaded area denotes the values for τ 0 that correspond to sparsity guesses between p 0 = 1 and p 0 = n (Eq.(3.12)).The black dotted line in the top row plots shows the MLPD for Lasso.All the results are averaged over 50 random splits into training and test sets.

Table 2
Summary of the real world microarray cancer datasets; number of predictor variables D and dataset size n.