The Horseshoe+ Estimator of Ultra-Sparse Signals

We propose a new prior for ultra-sparse signal detection that we term the"horseshoe+ prior."The horseshoe+ prior is a natural extension of the horseshoe prior that has achieved success in the estimation and detection of sparse signals and has been shown to possess a number of desirable theoretical properties while enjoying computational feasibility in high dimensions. The horseshoe+ prior builds upon these advantages. Our work proves that the horseshoe+ posterior concentrates at a rate faster than that of the horseshoe in the Kullback-Leibler (K-L) sense. We also establish theoretically that the proposed estimator has lower posterior mean squared error in estimating signals compared to the horseshoe and achieves the optimal Bayes risk in testing up to a constant. For global-local scale mixture priors, we develop a new technique for analyzing the marginal sparse prior densities using the class of Meijer-G functions. In simulations, the horseshoe+ estimator demonstrates superior performance in a standard design setting against competing methods, including the horseshoe and Dirichlet-Laplace estimators. We conclude with an illustration on a prostate cancer data set and by pointing out some directions for future research.

1. Introduction. Ultra-sparse signal detection provides a challenge for developing statistical estimators. In the classical normal means inference problem, we observe data from the probability model (y i |θ i ) ∼ N (θ i , 1) for i = 1, . . . , n. We wish to provide an estimator for the vector of normal means θ = (θ 1 , . . . , θ n ). Sparsity occurs when a large portion of the parameter vector contains zeros. The "ultra-sparse" or "nearly black" vector case occurs when the parameter vector θ lies in the set l 0 [p n ] ≡ {θ : #(θ i = 0) ≤ p n } with the number of non-zero parameter values p n = o(n) where p n → ∞ as n → ∞.
To motivate the need for developing new prior distributions, consider the classic James-Stein "global" shrinkage rule,θ JS (y). This estimator uniformly dominates the traditional sample mean estimator,θ. For all values of the true parameter θ and for n > 2, we have the classical mean squared error (MSE) risk bound: R(θ JS , θ) : = E y|θ θ JS (y) − θ 2 < n = E y|θ y − θ 2 , ∀θ.
However, for a sparse signal,θ JS (y) performs poorly. Suppose that the true parameter θ is an "r-spike" with r coordinates of magnitude n/r and the rest set at zero, giving θ 2 = n. Then Johnstone and Silverman (2004) showed that the classical risk satisfies R θ JS , θ ≥ n/2 whereas simple thresholding at √ 2 log n performs with risk √ log n. To address this issue, a "global-local" shrinkage estimator called the horseshoe estimator was proposed by Carvalho, Polson and Scott (2010). The horseshoe estimator,θ HS (y), provides a Bayes rule that inherits good MSE properties of global shrinkage estimators and simultaneously provides asymptotic minimax risk for estimating sparse signals. For example, Polson and Scott (2012) showed thatθ HS (y) uniformly dominates the traditional sample mean estimator in terms of MSE and van der Pas, Kleijn and van der Vaart (2014) showed that the horseshoe estimator has good posterior concentration properties. Specifically, the horseshoe estimator has the desirable property that it achieves sup θ∈l 0 [pn] E y|θ θ HS (y) − θ 2 p n log (n/p n ) , which is the asymptotically minimax risk rate in 2 for nearly black objects (Donoho et al., 1992). Here the "worst" θ ∈ l 0 [p n ] is obtained at the maximum absolute difference θ HS (y) − y whereθ HS (y) = E HS (θ|y) can be interpreted as a Bayes posterior mean which is optimal under the Bayes MSE.
Though the horseshoe prior was originally designed to provide an accurate and efficient estimator of a sparse normal mean vector, it turns out that the multiple testing rule induced by the horseshoe prior also enjoys the "oracle property" in testing under the 0-1 loss (Datta and Ghosh, 2013). For the multiple testing problem in the classical two-groups model, many approaches involve explicitly modeling the ultra-sparse mean as a mixture of a point mass at zero and a heavy-tailed alternative, also known as the "spike-and-slab" approach (Mitchell and Beauchamp, 1988). This results in a posterior distribution over a high-dimensional discrete space, exploring which often leads to extreme computational cost. The one-group model, inspired by the widespread popularity of the lasso for variable selection in regression (Tibshirani, 1996), is computationally more tractable, and can be used to select a model through concentration of measure in a space of pseudoprobabilities, rather than in the n-dimensional Euclidean space (Carvalho, Polson and Scott, 2010;Polson and Scott, 2010;Datta and Ghosh, 2013). In particular, the horseshoe prior leads to "pseudo-posterior" probabilities that mimic the true posterior inclusion probabilities from a two-groups mixture model, and induces a multiple testing rule with attractive properties. Specifically, Datta and Ghosh (2013) extended the framework of the "asymptotic Bayes optimality under sparsity" introduced by Bogdan et al. (2011), and proved that the Bayes risk for the horseshoe estimator attains the Bayes risk of the oracle if the global shrinkage parameter is of the same order as the proportion of non-zero components of θ. Thus, it seems natural to require that any new sparse signal recovery prior should attain the oracle risk up to a multiplicative constant, and improve upon the error rates in theory as well as in practice.
The purpose of our paper, then, is to provide an estimator that sharpens the ability of the Bayes estimator to extract signals from sparsity while maintaining the optimal properties of the induced decision rule. We provide theoretical justifications by demonstrating that the proposed estimator has sharper information theoretic bounds and better MSE bounds compared to the horseshoe estimator. We illustrate that the horseshoe+ estimator achieves greater separation of signals and noise in a standard simulation setting and we provide a comprehensive MSE comparison with existing sparse estimators. We develop a hierarchical model which is a natural extension of the horseshoe model of Carvalho, Polson and Scott (2010) and hence our terminology for the horseshoe+ hierarchical model.
The rest of the paper is outlined as follows. Section 2 motivates the class of one-group global-local mixture shrinkage priors for sparse signal estimation as a suitable alternative to the commonly used two-groups models. Section 3 describes the horseshoe+ estimator with a particular reference to global-local shrinkage estimators. Section 4 provides theoretical properties of our proposed estimator. The main theoretical findings of this paper are as follows: 1. The decision rule induced by the horseshoe+ prior attains the risk of Bayes oracle under 0-1 loss up to a multiplicative constant, with the constant in Bayes risk close to the constant in oracle. We also obtain a sharper bound on the probability of type-I error compared to the horseshoe prior. 2. The mean squared error for the horseshoe+ estimator is always smaller than the mean squared error of the horseshoe estimator in estimating a large signal.
3. The estimated sampling density using the horseshoe+ prior converges to the true density at a super-efficient rate when the true parameter value is zero, when the efficiency is calculated using the Kullback-Leibler (K-L) distance between the true density and the estimated sampling density. The rate of convergence is shown to be faster than that of the horseshoe estimator using asymptotic properties of the prior utilizing Meijer-G functions (Mathai, Saxena and Haubold, 2009).
Section 5 provides comparisons of our proposed approach with other shrinkage rules using a standard design setting. We compare horseshoe+ with the Dirichlet-Laplace estimator (Bhattacharya et al., 2014) and the horseshoe estimator (Carvalho, Polson and Scott, 2010), illustrating superior performance of the horseshoe+ estimator in both estimation (under squared error loss) and testing (under 0-1 loss). Section 6 discusses the application of the proposed prior on a high-dimensional prostate cancer data set. Section 7 concludes with some directions for future research.
2. The one and two groups models. Consider the model of Section 1, i.e., (y i |θ i ) ∼ N (θ i , 1), for i = 1, . . . , n, where θ is ultra-sparse or nearlyblack, in the sense that θ ∈ l 0 [p n ]. Our interest might lie in testing whether each θ i is zero or non-zero, based on a suitably normalized test statistic or in proposing a suitable estimateθ i , that has attractive properties, e.g., low mean squared error. The large number of parameters together with sparsity require further modeling of the data to facilitate learning via empirical Bayes or full Bayes methods. The two-groups model (see, e.g., Mitchell and Beauchamp, 1988;Efron, 2008), provides a natural Bayesian hierarchical framework for the sparse multiple testing problem where conditionally i.i.d. θ i are modeled as where δ {0} denotes a point mass at zero and the parameter ψ 2 > 0 is the non-centrality parameter that determines the separation between the two groups. Under this setting, the marginal distribution of y i |µ is given by As can be seen from Equation (2.2), the two-groups model leads to a sparse estimate, i.e., it puts exact zeros in the model. The two-groups model enjoys a number of attractive theoretical properties, detailed as follows: 1. Johnstone and Silverman (2004) showed that a thresholding-based estimator for θ under the two-groups model with an empirical Bayes estimate for µ is minimax in 2 sense.
2. Castillo and van der Vaart (2012) treated a full Bayes version of the problem and again found an estimate that is minimax in 2 . 3. Bogdan et al. (2011) found that the estimator under the two-groups model provides asymptotically optimal performance in testing, in the sense that its performance matches the Bayes oracle up to a constant.
Thus, while the two-groups approach is a recognized gold-standard for Bayesian sparse signal detection and estimation, a number of arguments favor an alternative approach via the global-local shrinkage priors, also termed as the one-group model. First, in many real life applications, such as studies involving "high-dimensional, low sample size" gene expression data, the majority of the effect sizes are negligible, but not exactly zero, leading to an argument against exact sparsity induced by the model in Equations (2.1-2.2). From a more pragmatic point of view, the one-group model leads to much faster computation, owing to the simple batch updating in the Gibbs sampler for the latent local shrinkage parameters. We refer the readers to Carvalho, Polson and Scott (2010) for further arguments and insights. A useful outcome of the two-groups model is that the posterior mean E(θ i |y i ) can be written as follows: where ω i = P (θ i = 0|y i ) is the posterior inclusion probability. Looking at the form of the posterior mean, one can see that it involves a "global" component ψ 2 /(1 + ψ 2 ) that provides shrinkage towards zero for all the parameters. However, the "local" component ω i allows the signal terms to escape from being too close to zero. The lack of a local shrinkage term explains why Stein-type global shrinkage estimators perform poorly in a nearly-black setting.
The key to success in a one-group model is to design a "global-local" shrinkage term that gives the same form of the posterior mean as in the two-groups model. The horseshoe prior of Carvalho, Polson and Scott (2010) is one such global-local shrinkage prior that has been shown to possess a number of theoretically attractive properties along with a considerably easier computational implementation compared to the two-groups model.
1. Carvalho, Polson and Scott (2010) showed the horseshoe estimator has good information theoretic properties when the true parameter vector is sparse, in the sense that the K-L distance between the estimated and the true densities decreases at a super-efficient rate. 2. Datta and Ghosh (2013) proved that the decision rule induced by the horseshoe estimator is asymptotically Bayes optimal for multiple testing under 0-1 loss up to a multiplicative constant. 3. van der Pas, Kleijn and van der Vaart (2014) showed the horseshoe estimator is minimax in 2 in a nearly-black case up to a constant. The constant they have been able to achieve is at least twice as large as the minimax constant of Donoho et al. (1992).
These theoretical properties, coupled with the ease of computational implementation suggests the one-group model holds considerable promise. Some other important examples of the one group model include the three-parameter beta prior (Armagan, Clyde and Dunson, 2011), the normal-exponentialgamma prior (Griffin and Brown, 2010), the generalized double Pareto prior (Armagan, Dunson and Lee, 2013), and the Dirichlet-Laplace prior (Bhattacharya et al., 2014). Below we describe the one-group horseshoe hierarchical model and then proceed to propose the horseshoe+ model that leads to considerable improvements upon the horseshoe.
3. The horseshoe+ estimator. Given normally distributed data (y i |θ i ) ∼ N (θ i , 1), the horseshoe hierarchical model is defined by the set of conditional distributions where C + denotes a half-Cauchy distributed scale parameter λ i with density as discussed by Gelman (2006). The horseshoe+ hierarchical model is defined similarly by the set of conditionals where we have introduced a further half-Cauchy mixing variable η i . In both models, the local shrinkage random effects λ i 's are not marginally independent after mixing over the global shrinkage parameter τ . The horseshoe+ model builds on the horseshoe by assuming that the λ i 's are conditionally independent given another level of local shrinkage parameters η i 's, in addition to τ . Integrating over η i gives the the density of λ i as Although conceptually a natural extension, we will see that the additional log(λ i /τ ) term in the numerator leads to very different properties of the proposed estimator compared to the horseshoe. There are a number of ways of dealing with the global shrinkage parameter τ . In a full Bayesian approach one can put a standard half-Cauchy prior or a Uniform(0, 1) prior on τ .
Another approach is to appeal to an asymptotic argument that suggests that the empirical Bayes estimator of τ to be set toτ = p n /n, where p n is the number of non-zero entries in θ (van der Pas, Kleijn and van der Vaart, 2014).
To further develop the distributional properties of the horseshoe+ prior we write this as a member of the class of global-local shrinkage priors with marginal prior density Transforming to a shrinkage scale with κ i = 1/(1 + λ 2 i τ 2 ) yields where κ i ∈ [0, 1] is a shrinkage "weight". The corresponding "ultra-sparse" Bayes estimator is where we need to compute E(κ i |y i , τ ). By comparing the expression for the posterior mean for θ i for the one-group model given by Equation (3.5) to the two-groups model given by Equation (2.3), it is apparent that the quantitŷ ω i = 1 − E(κ i |y i , τ ) behaves as the posterior inclusion probability P (θ i = 0|y i ). This results in a natural threshold for simultaneously testing H 0i : θ i = 0 vs. H 1i : θ i = 0 for i = 1, . . . , n. We will consider the following multiple testing procedure proposed by Carvalho, Polson and Scott (2010), and later shown to be optimal under 0-1 loss by Datta and Ghosh (2013), for the horseshoe prior: Note that the marginal data likelihood is p(y i |κ i , τ ) = κ 1/2 i exp −κ i y 2 i /2 . Signals are identified when κ i → 0 and sparsity occurs when κ i → 1 in the posterior. We see that there are no shrinkage factors in the marginal likelihood to "help" identify sparsity in the normal model. This is precisely why the normal prior performs poorly for sparse settings. The horseshoe prior was designed to cancel the factor κ 1/2 i and to simultaneously introduce a shrinkage weight to find zeroes at κ i = 1. See Carvalho, Polson and Scott (2010) for further discussion. For the horseshoe prior, p(λ i |τ ) = 2/ πτ (1 + (λ i /τ ) 2 ) leads to whereas, for the horseshoe+ prior, p(λ i |τ ) = 4 log( The main difference, then, is in the Jacobian term introduced in the representation on the shrinkage scale. This term has a fundamentally different behavior for separating signals (κ i = 0) from the noise terms (κ i = 1). The horseshoe+ prior introduces another horseshoe U -shaped Jacobian factor that pushes posterior mass to the places of most interest, κ i = 0, 1. This provides an extra level of efficiency in the ultra sparse signal case. Figure  1 plots the Jacobians of the horseshoe and horseshoe+ priors with τ set to 0.5 and 2 to make the difference explicit. The horseshoe Jacobian displays unequal shrinkage behavior near the two extremities of κ i . The Jacobian term can also be interpreted on the shrinkage scale. Specifically, for κ = 1/(1 + τ 2 ), we have This representation shows that the horsehsoe+ prior allows differential shrinkage for κ i around κ (and is continuous at κ i = κ), and suggests that the global shrinkage parameter τ 2 can also be interpreted as a scaling factor for the shrinkage weights κ i .

Theoretical properties of the horseshoe+ estimator.
In this section we investigate the proposed estimator and establish its theoretical properties. We present our main results in the form of eight theorems. Proofs and technical details are given in Appendix A.1-A.8.
A proof is given in Appendix A.1. Theorem 4.1 establishes that the horseshoe+ prior has unbounded mass near the origin and also establishes its tail behavior. Figures 2 and 3 show the behavior of several globallocal shrinkage priors near the origin and at the tails. The priors we plot are: horseshoe+, horseshoe (Carvalho, Polson and Scott, 2010), Dirichlet-Laplace (Bhattacharya et al., 2014), generalized double Pareto (Armagan, Dunson and Lee, 2013), standard Cauchy and standard Laplace (doubleexponential). Note that only the horsehoe+ and horseshoe densities are unbounded near the origin. Further note that horseshoe+ puts more mass compared to the horseshoe in a bounded neighborhood of the origin and at the tails. Carvalho, Polson and Scott (2010) established the importance of having a prior with unbounded density near the origin in a sparse signal setting. Due to Theorem 4.1, the horseshoe+ estimator enjoys the resultant advantages.
4.2. Bound on bias for the horseshoe+ estimator. van der Pas, Kleijn and van der Vaart (2014) showed that the horseshoe estimator, or the posterior mean under the horseshoe prior, attains the minimax rate for the mean squared error, possibly up to a multiplicative constant. This depends crucially on a bound on the bias term. Similar to a result in Carvalho, Polson and Scott (2010) for the horseshoe prior, Theorem 4.2 gives a bound on the absolute value of the difference between the horseshoe+ estimator and an observation.
THEOREM 4.2. For a fixed y i , the absolute value of the difference between the horseshoe+ estimator and an observation y i is bounded above by: A proof is given in Appendix A.2. This result is useful for a bound on the bias terms for small |y i |, since it is clear that the bias goes to zero when |y i | = O(τ 2+α ) for some α > 0 as τ → 0. The question of bounding the bias for large |y i | is deferred until Section 4.5.
4.3. Asymptotic Bayes optimality under sparsity. The optimality of the Bayes risk for the horseshoe prior follows from the fact that the posterior mass of κ i concentrates near one (uniformly in y i ) if the global shrinkage parameter τ goes to zero, and near zero if |y i | → ∞ for any fixed τ in (0, 1) (by Theorems 3.1 and 3.2 and the following corollaries in Datta and Ghosh (2013)). For obtaining the optimal Bayes risk, one additionally needs the global shrinkage parameter τ to adapt to the proportion of signals µ n , i.e. lim n→∞ τ µ −1 n ∈ (0, ∞), where µ n = #{θ i = 0}/n. It turns out that similar concentration inequalities, but with sharper bounds, hold for the posterior distribution of κ i under the horseshoe+ prior. In this section, we state the two posterior concentration inequalities along with the asymptotic type-I and type-II error probabilities to establish the oracle property for horseshoe+.
Below, we briefly describe the asymptotic framework of Bogdan et al. (2011), namely the notion of Bayes oracle in the context of multiple testing. Assume the two-groups data generating model of Equations (2.1-2.2). As shown in Bogdan et al. (2011), the optimal Bayes rule under a 0-1 additive loss for testing H 0i : θ i = 0 vs. H 1i : θ i = 0 is given by: We call this rule the Bayes oracle as the risk for this is the lower bound of (1/n) times the risk for any multiple testing procedure under the two-groups model. By further reparametrizing by u = ψ 2 and v = uf 2 , the threshold in the oracle leads to a simpler form: We use the asymptotic framework of Bogdan et al. (2011). We need the following assumption: ASSUMPTION 1. The sequence of vectors γ n = (ψ n , µ n ) satisfies the following conditions: REMARK 4.1. The asymptotic framework provides a natural way to study the properties of the Bayes risk as the parameter vector γ = (ψ, µ) defining the Bayes oracle in Equation (4.1) varies through an infinite sequence indexed by the number of tests n increasing to infinity. To reduce notational complexity, we will suppress the index n from γ n , µ n , τ n , ψ n throughout the remainder of this section. The statements such as µ → 0 should imply that µ n → 0 as n → ∞.
Under Assumption 1, Bogdan et al. (2011) provided the following simple asymptotic expressions for the type-I and type-II error rates of the Bayes oracle: where we use the notation o n to denote an infinite sequence of terms, indexed by n (the number of tests), converging to zero as n → ∞.
The last expression follows from the fact that the Bayes risk for a fixedthreshold multiple testing rule is given by R = n ((1 − µ)t 1 + µt 2 ) for an additive 0-1 loss, when the data is generated from the two-groups model. A decision rule is said to attain the ABOS (Asymptotic Bayes Optimality under Sparsity) if the ratio of the Bayes risk of the decision rule to the risk of the Bayes oracle (Equation 4.3) goes to 1 as multiplicity n goes to ∞. Bogdan et al. (2011) also provided conditions under which the Benjamini-Hochberg rule attains the optimal risk. Now, we present the concentration inequality on the posterior distribution of κ i providing the conditions under which the posterior mass of κ i will concentrate near one. We show this through derivation of an upper bound to the probability that the posterior mass of κ i below ∈ (0, 1), and showing that the upper bound decays at the rate of τ 2 . THEOREM 4.3. Suppose we have observations y 1 , . . . , y n where y i ∼ N (θ i , 1), for i = 1, . . . , n, and the prior on θ i is distributed as horseshoe+ with the hierarchical model given by Equation (3.3). Then the posterior distribution of κ i = (1 + λ 2 i τ 2 ) −1 given y i and τ satisfies the following: for any fixed ∈ (0, 1), and any τ ∈ (0, 1).
The proof is given in Appendix A.
3. An implication of this theorem is that the posterior distribution of the shrinkage coefficient κ i given τ and the observation y i would converge to the degenerate distribution at one if τ → 0. This leads to the following bound on the probability of type-I error rate for horseshoe+ prior, with proof given in Appendix A.4.
THEOREM 4.4. Suppose we have observations y 1 , . . . , y n from the 'twogroups' model in Equation (2.2), and we want to test H 0i : θ i = 0 vs. H 1i : θ i = 0, using the decision rule of Equation (3.6) induced by the horse-shoe+ prior. Suppose furthermore that Assumption 1 holds for the parameter vector (ψ, µ), then the probability of type-I error for horseshoe+ decision rule is given by: REMARK 4.2. It should be noted that one of the bounds (and consequently the type-I error rate) obtained for the horseshoe+ prior are sharper than that obtained for the horseshoe prior. Theorem 4.3 shows P HS+ (κ i < |y i , τ ) = O(τ 2 ) whereas Datta and Ghosh (2013) We now present the concentration inequality in the other direction, where we have the following result, with a proof in Appendix A.5.
THEOREM 4.5. Suppose we have observations y 1 , . . . , y n where y i ∼ N (θ i , 1), for i = 1, . . . , n, and the prior on θ i is distributed as horseshoe+ with the hierarchical model given by Equation (3.3). Then the posterior distribution of κ i = (1 + λ 2 i τ 2 ) −1 given y i and τ satisfies the following: for any fixed η ∈ (0, 1), any fixed δ ∈ (0, 1/η(1 + τ 2 )) and uniformly in An immediate consequence of this theorem is that the posterior distribution of the shrinkage coefficient κ i given τ and the observation y i would converge to the degenerate distribution at zero if |y i | → ∞.
A crucial step for proving the optimality for the horseshoe prior is the choice of the global shrinkage parameter τ . Datta and Ghosh (2013) chose τ to be of the same order as the proportion of signals µ, i.e. τ = τ n = O(µ n ) and argued that the optimality of the decision rule induced by the horseshoe prior depends on how well the sparsity is captured in the hyperparameter τ . This was further supported by van der Pas, Kleijn and van der Vaart (2014) who showed that the condition τ = O(µ) is also required to prove that the horseshoe estimator attains the minimax squared error risk and that the posterior distribution contracts around the horseshoe estimator at the minimax rate. Since the role of τ as a scale parameter for the prior on λ i does not change with the horseshoe+ prior, intuitively the same choice on τ would lead to the optimal type-II error rates.
Under this choice of the hyperparameter τ , it follows that the type-II error for horseshoe+ decision rule has the same asymptotic order as that of the type-II error rate for the Bayes oracle. Let C denote the constant in the expression for the risk of the Bayes oracle as appears in Equation (4.2). Then it follows from Theorem 4.5 that the type-II error rate has the following upper bound: THEOREM 4.6. Suppose we have observations y 1 , . . . , y n from the 'twogroups' model in Equation (2.2), and wish to test H 0i : θ i = 0 vs. H 1i : θ i = 0, using the decision rule of Equation (3.6). Suppose furthermore that Assumption 1 holds for the parameter vector (ψ, µ), and the global shrinkage parameter τ decreases to zero such that τ = O(µ). Then for all η ∈ (0, 1) and δ ∈ (0, 1/η(1 + τ 2 )), the probability of type-II error of the decision rules induced by the horseshoe+ prior is bounded above by: (1)).
The proof is given in Appendix A.6. The proof of this theorem follows similar steps as the proof of type-II error rate for horseshoe prior in Datta and Ghosh (2013), where a fixed η = 1/4 and δ = 1/9 were used for deriving an explicit expression. Then it follows from Theorems 4.4 and 4.6 that the risk of the horseshoe+ decision rule is given by Since the risk of the Bayes oracle is R BO = n µ(2Φ( √ C) − 1) (1 + o(1)), it follows that the horseshoe+ decision rule attains the Bayes oracle up to a multiplicative constant. 4.4. Kullback-Leibler risk bounds. Carvalho, Polson and Scott (2010) proved that for horseshoe the Bayes estimate for the sampling density, measured using the Kullback-Leibler distance between the true model and the estimator of the density function, converges to the truth at a super-efficient rate. Let θ 0 be the true parameter value and f (y|θ) be the sampling model. Further, let K(q 1 , q 2 ) = E q 1 log(q 1 /q 2 ) denote the K-L divergence of a density q 2 from q 1 . The proof utilizes the following result by Clarke and Barron (1990).
Using the above proposition, Theorem 4 of Carvalho, Polson and Scott (2010) proves that for the horseshoe estimator the Cesàro-average risk satisfies when the true parameter θ 0 = 0. This rate is faster than any prior without a pole at zero. It is super-efficient, in the sense that the risk is lower than that of the MLE, which has the rate O(log n/n). The same result holds for the horseshoe+ estimator due to its infinite mass near zero (by Theorem 4.1). However, we demonstrate that the horseshoe+ prior in fact has a better rate of convergence than the horseshoe prior. Our result is based on the following theorem.
THEOREM 4.7. Let p 0 HS+ (θ) and p 0 HS (θ) denote the marginal densities of the horseshoe+ and horseshoe priors at the origin. Then we have where γ is the Euler-Mascheroni constant and The proof is given in Appendix A.7. Due to the extra log(n) factor, the horseshoe+ prior places more mass around a neighborhood of the origin compared to the horseshoe prior. Thus, when θ 0 = 0, taking = 1/ √ n in Proposition 4.1, one can immediately see that the constant b in Equation (4.6) for horseshoe+ is larger than for horseshoe. Thus, while both the estimators display super-efficiency when the truth is sparse, the horseshoe+ Cesàro-average risk has a faster rate of decay compared to that of the horseshoe estimator. 4.5. Mean squared error. It is well known that if p(|y i − θ i |) is the standard normal density and p(θ i ) is a zero mean scale mixture of normals, with the scale parameter λ 2 following a proper prior law, the posterior moments of θ i admits the following representations, also known as "Tweedie's formula" (Efron, 2011): where m(y i ) is the marginal for y i (see for example Pericchi and Smith (1992) and Carvalho, Polson and Scott (2010)). Furthermore, we can use properties of slowly varying functions to show that if the prior on θ i can be written as a normal scale mixture with a "slowly-varying" prior on the scale parameter, the marginal inherits the slowly varying property. For priors with a polynomially heavy tail it can also be shown that the resulting posterior mean is asymptotically robust, in that the difference |E(θ i |y i , τ )−y i | vanishes for large |y i | while τ is fixed. This complements the results on the bound of bias for small values of |y i | from Theorem 4.2.
Let m HS+ (y i ) and m HS (y i ) denote the marginals under the horseshoe+ and horseshoe priors respectively. Proposition 4.2 immediately shows that we have that m HS+ (y i ) = m HS (y i ) log(|y i |)(1 + o(1)) as |y i | → ∞, since the only difference between the horseshoe and horseshoe+ mixing densities is the additional slowly varying (log λ i ) term in the scale mixing density for the horseshoe+ prior. Furthermore, the (1 + o(1)) term vanishes when we take derivative of the log marginal density. In particular, as |y i | → ∞, we have where "constant" denotes the collection of all terms that does not involve y i . Thus, Using Equations (4.7) and (4.8), in combination with Equations (4.9) and (4.10), allows one to relate the bias and variance, and hence the MSE, for the horseshoe and the horseshoe+ estimators. We have the following result: THEOREM 4.8. Suppose p(|y i − θ i |) is the standard normal density, and p HS (θ i ) and p HS+ (θ i ) denote the horseshoe and horseshoe+ prior densities on θ i , leading to the posterior means E HS (θ i |y i ) and E HS+ (θ i |y i ) respectively. Then, for large values of |y i |, we have, The proof is given in Appendix A.8. This theorem establishes that the horseshoe+ estimator has asymptotically lower MSE compared to the horseshoe estimator when y i is large, due to the extra (log |y i |) factor in the marginal, which in turn is due to the extra (log λ i ) term in the prior mixing density. 5.1. Sum of squared error about the posterior median. We follow the simulation setting described in Bhattacharya et al. (2014). We simulate data y i |θ i ∼ Normal(θ i , 1) for i = 1, . . . , n, where θ i = A in fraction q of its components with the magnitude of A = 7, 8 and θ i = 0 in the remaining components. We report simulation results for n = 200 in Table 1. Each configuration is replicated 100 times and the average sum of squared error about the posterior median is reported.

Numerical examples.
We compare the proposed horseshoe+ prior with two competitors: the horseshoe prior of Carvalho, Polson and Scott (2010) and the Dirichlet-Laplace (D-L) prior of Bhattacharya et al. (2014). To deal with the global shrinkage parameter τ for the horseshoe and the horseshoe+ priors, we try two scenarios: (a) τ ∼ C + (0, 1/n) and (b) τ ∼ Unif[0, 1]. For posterior sampling, we use the Stan software package (Stan Development Team, 2014) to draw 10, 000 samples in each case, half of which are treated as burn-in and discarded. We monitored MCMC convergence and found no evidence of mixing problems. The D-L prior is implemented in its hierarchal normalexponential form, and the horseshoe and horseshoe+ priors by the hierarchical model in Equations (3.1) and (3.3) respectively.
In Table 1, the estimator with the lowest average SSE is in bold in each simulation setting (in rows). The horseshoe+ prior with the half-Cauchy prior on τ has the lowest SSE in all but two cases, in which the horseshoe prior performs the best. The C + (0, 1/n) prior on τ results in better performance over a Uniform(0, 1) prior for both horseshoe and horseshoe+ since the former puts more mass in a neighborhood close to zero, helping τ adapt to the sparsity level of the data.
To make the difference between the horseshoe and the horseshoe+ estimates clear, we plot E(κ i |y i ) and E(θ i |y i ) for i = 1, . . . , n, for horseshoe in  Figure 5. In both cases, the prior on τ is Unif[0, 1]. We used n = 200 and simulated y i with 10 components with a mean equal to 7 and the rest with mean 0. Without loss of generality, the components (true values and estimates) with true non-zero means are plotted as the first 10 data points and those with true zero means are plotted afterwards. The posterior means are shown as dots and the middle 95% posterior credible intervals by solid lines. By comparing the estimates, it is clear that horseshoe+ does a much better job compared to horseshoe in terms of shrinking the noise terms to zero (estimatedκ i closer to 1 or equivalently, estimatedθ i closer to zero).

Misclassification probabilities.
We compared the performance of the multiple testing rule induced by the horseshoe+ prior with two other globallocal shrinkage priors: the horseshoe prior of Carvalho, Polson and Scott (2010) and the Dirichlet-Laplace prior of Bhattacharya et al. (2014) in terms of the misclassification probability (MP). We use the misclassification probability as a criteria for our experiment as it is equal to the Bayes risk under a 0-1 additive loss for data generated by a two-groups model. We follow the same experimental set up in Bogdan, Ghosh and Tokdar (2008), replicated in Datta and Ghosh (2013), where the Bayes oracle (BO) acts as the lower bound and the M P = µ line as the upper bound, where µ is the proportion of signals. We simulated data of size n = 200, ψ n = √ 2 log n = 3.26. Our data generation scheme follows the conditions provided by Bogdan et al. (2011), which guarantees the optimality of the Benjamini-Hochberg procedure to use it as another practical lower bound along with the Bayes Oracle. Figure 6 shows the misclassification probabilities (henceforth abbreviated as MP) for different shrinkage priors considered for ten equispaced values of µ ∈ [0.01, 0.5] along with the oracle and the straight line (MP = µ). Figure 6 shows that the misclassification probability for the horseshoe+ prior is very close to that of the Bayes oracle for a wide range of values of µ, and departs a little for values higher than 0.2. Furthermore, the horseshoe+ decision rule leads to a superior performance compared both the horseshoe and the Dirichlet-Laplace prior. We have also plotted the MP for the Benjamini-Hochberg rule, for α = 1/ log n = 0.1887, along with the one-group shrinkage priors. Under this setting, the Benjamini-Hochberg rule achieves the same MP as the oracle. This is in concordance with the theoretical results for optimality of BH in Bogdan et al. (2011).
We used the full Bayes estimates for the hyperparameters for both the horseshoe prior and the double exponential prior. For estimating τ , we assumed standard half-Cauchy prior on τ for deriving the full conditionals using a Gibbs sampler. As pointed out by Carvalho, Polson and Scott (2009)  and Scott and Berger (2006), the fully Bayesian approach for estimating τ has a few adavantages over its alternatives, viz. empirical Bayes and crossvalidation. In the extremely sparse case, the empirical Bayes estimate of τ might collapse to 0 (Scott and Berger, 2010;Bogdan, Ghosh and Tokdar, 2008). Cross-validation, though free of this problem, uses plug-in estimates for the signal-to-noise ratio. Carvalho, Polson and Scott (2009) argue that the plug-in estimates are not necessarily wrong, but caution should be exercised while using them for extremely sparse problems.
6. Application on a prostate cancer data set. We illustrate the performance of the horseshoe+ prior for the benchmark prostate cancer data, introduced by Singh et al. (2002) and made popular by Efron (2008Efron ( , 2010a, among others. The prostate cancer data has gene expression values for n = 6, 033 genes for m = 102 subjects, with m 1 = 50 normal controls and m 2 = 52 prostate cancer patients. The goal is to identify "genes" that are differentially expressed between controls and the cancer patients. To analyze this data further, the test statistic values are calculated for each of the 6, 033 genes by first calculating a two-sample t-statistic, say t i , i = 1, 2, . . . , n = 6, 033 for each of the genes and then applying the in-verse Normal CDF transformation to obtain y i = Φ −1 F t 100d.f. (t i ) . The y i -values can be modeled as independent Gaussian variables with mean θ i 's, i.e. y i ∼ θ i + i to cast this problem as a high-dimensional normal means inference problem. The corresponding multiple testing problem would be to simultaneously test the hypotheses H 0i : θ i = 0, for i = 1, . . . , n. Under the global null hypothesis of no 'differentially expressed' genes, one should expect the histogram of the test statistics to follow a N (0, 1) density curve but the histogram shows a heavier tail, suggesting the presence of a few regulatory genes.
For a proper appraisal of the extra shrinkage by the horseshoe+ prior at the tails compared to the horseshoe prior, we do the following experiment: We consider the top 10 genes selected by Efron (2010b) and their effect sizes estimated by a two-groups normal hierarchical model. We apply both the horseshoe and the horseshoe+ prior to the 6,033 test statistics, and compare the 'effect-size' estimatesθ i for these genes. One would expect that the horseshoe+ prior would shrink these "top" genes even less than the horseshoe prior and as a result the posterior meanθ i = (1 − E(κ i |y i , τ ))y i would be closer to the observed test statistics y i . Table 2 shows the top 10 genes selected by Efron (2010b), and the effect size estimates by the horseshoe and the horseshoe+ priors. For both the horseshoe and horseshoe+ prior, we implemented a Gibbs sampler with 15,000 draws with a burn-in period of 3,000 draws. The benefits of a heavier tail become apparent from this table as in 9 out of the top 10 genes, the horseshoe+ estimates are closer to the observed test statistics compared to the horseshoe estimates. One might naturally wonder about the performance of the two competing Bayesian models for the "uninteresting" genes, and it turns out that both the priors have equal strength in squelching the noisy test statistics to zero. Figure 7 shows the posterior mean for the two priors against the observed test statistics. It can be clearly seen that all the procedures show good shrinkage properties near zero, and the only difference comes from the performance near tails, or robustness to large signals. This is also reflected in the value of the estimated mean squared prediction error calculated as The values of the mean squared prediction error for the horseshoe+ and the horseshoe prior are 1.189 and 1.045 illustrating the superiority of horseshoe+ prior over the horseshoe prior.
7. Discussion. We have provided a default Bayesian shrinkage estimator for extracting signals from a sparse parameter vector. The proposed prior is called horseshoe+ prior as it renders itself to a natural extension of the Table 2 The test statistics (y-values) and the effect-size estimates for the top 10 genes selected by Efron (2010b) by the horseshoe, horseshoe+ models, and Efron's two-groups model estimates.
Gene y-valueθ HS+ iθ HS iθ  horseshoe prior and provides substantial improvement for the "nearly-black" or "ultra-sparse" situation. In particular, the heavier tails of the horseshoe+ prior leads to an increasing ability of separating the signals, and the larger mass near origin leads to better handling of sparsity and a higher order of super-efficiency for the risk in density estimation. We have examined this new prior both theoretically and empirically by considering the estimation accuracy for a high-dimensional parameter vector as well as the error rates for the multiple testing rule induced by applying a threshold rule to the pseudo posterior inclusion probabilities.
Our asymptotic results demonstrate that the horseshoe+ estimator achieves a lower MSE and the horseshoe+ decision rule attains the Bayes oracle in testing up to O(1) with a sharper bound on the type-I error rate compared to horseshoe. While we have not discussed the asymptotic minimaxity properties of the horseshoe+ estimator in the 2 sense, we conjecture that the asymptotic minimaxity will continue to hold, likely with a sharper bound on the constant term compared to van der Pas, Kleijn and van der Vaart (2014). The sharpening effect of the horseshoe+ prior can be attributed to the extra shrinkage gained by having a U-shaped Jacobian over a lopsided one, in addition to the U-shaped prior induced on κ i .
In the recent past, there have been a few shrinkage priors that we collectively call the 'global-local' shrinkage priors following Polson and Scott (2010). These priors include the generalized double Pareto (Armagan, Dunson and Lee, 2013), the normal-exponential-gamma (Griffin and Brown, 2010), the three parameter beta (Armagan, Clyde and Dunson, 2011), and the Dirichlet-Laplace (Bhattacharya et al., 2014), among others. These priors exhibit similar shrinkage properties as the horseshoe prior in that they simultaneously squelch the noise to zero and recover the signals. Though these priors lead to competitive performances in the sparse signal recovery problem, they also have unique, distinguishable characteristics. For example, the generalized double Pareto prior leads to a closed form prior density of θ and induces a sparsity favoring penalty in regularized least squares, while the Dirichlet-Laplace prior models the joint distribution of θ under the two-groups model via the joint distribution of the shrinkage parameters. The behavior of the marginal prior densities on θ can be seen from Figures 2 and 3, and our simulation results suggest improvements for both estimation and testing, but it is an open question whether a set of necessary conditions can be imposed on the class of global-local shrinkage priors that guarantees certain desirable properties.
A key insight we gain from the success of the family of the global-local shrinkage priors is that the global shrinkage parameter plays a vital role in controlling the behavior of the posterior. Specifically, the global shrinkage parameter in the horseshoe prior needs to be of the order of the proportion of non-null effects to ensure asymptotic minimaxity in estimation (van der Pas, Kleijn and van der Vaart, 2014) as well as the optimality of the induced decision rule in testing (Datta and Ghosh, 2013). We have proved that the same condition also guarantees the optimal performance for the horseshoe+prior in testing.
Finally, the horseshoe+ prior can be further extended by modeling the local shrinkage parameter λ i as a higher order product of independent half-Cauchy random variables, leading to an even heavier tail and larger spike at zero. The moments and densities of the Cauchy product C 1 C 2 . . . C k are given in Bourgade, Fujita and Yor (2007). The density Ψ k (·) of the k-product C 1 C 2 . . . C k for the even and the odd cases are as follows: Furthermore, one might use the "universal prior" due to Rissanen (1983) over the number of terms k in the product density. The "universal prior" is defined with the mass function: where, log * (x) = log x + log log x + . . ., where the sum involves only nonnegative terms and c = 2 − log * i ≈ 2.865064. The family of Cauchy-product densities can be used in conjunction with Rissanen's universal prior described above to define an adaptive shrinkage estimator such as the Polyshrink estimator due to Foster and Stine (2005), where the amount of shrinkage varies adaptively with the estimation task. For an n-dimensional parameter θ, the Polyshrink estimator uses a collection of discrete mixture models G p = g k (y) = (1 − k )φ(y) + k ψ(z), k = 2 k−(K+1) , for k = 1, . . . , K with K = 1 + log 2 (p) and φ(·) and ψ(·) denote the standard normal density and Cauchy density with scale √ 2 respectively. We conjecture the advantages of the one-group model over the two-groups model would naturally carry over to this case if we use a collection of one-group priors defined by Cauchy products of different orders to achieve different amounts of shrinkage. The possibility of such extensions was first discussed in Polson and Scott (2012), and it would be interesting to settle this issue theoretically.

APPENDIX A: PROOFS OF THEOREMS
A.1. Proof of Theorem 4.1. Let p HS+ (θ) be the marginal density for the horseshoe+ prior. From Equations (3.3-3.4) with τ = 1, we have, First note that, applying the transformation ζ = 1/λ 2 we get For an upper bound one can use the fact that For a lower bound one can use the fact that log ζ ζ − 1 ≥ 2 1 + ζ for ζ > 0.

This gives
where E 1 (·) is the exponential integral. Thus, combining the lower and upper bounds, This completes the proof of Part (1). Part (2) then follows from the fact that the lower bound goes to ∞ as θ goes to zero. In comparison, horseshoe has bounds K 2 log(1 + 4 θ 2 ) < p HS (θ) < K log(1 + 2 θ 2 ), with K = 1/(2π 3 ) 1/2 . Both the upper and lower bounds are sharper compared to horseshoe+. However, the bounds for horseshoe+ can be sharpened using better approximations for log(ζ)/(ζ − 1), for which an infinite product representation is given by A.2. Proof of Theorem 4.2. We have Then we use the following upper bound on the Jacobian term: The required bound is obtained as follows: Also, for κ i < 1/(1 + τ 2 ), we would have log where B(·, ·) denotes the beta function and the last inequality follows from the identity B(x + 1, y) = B(x, y) x x+y .
A.3. Proof of Theorem 4.3. We write the posterior density of κ i given y i in τ scale as follows: Now, we use the inequality 1 − 1/x < log(x) < x − 1 for x > 0 to derive the following bounds for the Jacobian term: First note that P(κ i < |y i , τ ) ≤ P(κ i < |y i , τ )/P(κ i > |y i , τ ). Moreover, we can bound the ratio P(κ i < |y i , τ )/P(κ i > |y i , τ ) as follows: The final step follows from the penultimate step by bounding the integral by the extreme values of the integrand multiplied by the length of the interval of integration.
A.4. Proof of Theorem 4.4. We need the following identity proved in Appendix A.3: To use the probability concentration inequality in Theorem 4.3 to prove the bound on type-I error rate, we need to establish the following fact: (1)).
We will prove this in two steps. We first show that the posterior mean can be well approximated by evaluating the integral from 1 2 to 1, i.e.
as τ → 0. In the next step, we prove that as τ → 0. First note that: Thus, (1)) as τ → 0. Next, note that Thus, we have (1)). The asymptotic expression for the type-I error rate is then calculated as Applying Theorem 4.3 for = 1 2 , we have A.5. Proof of Theorem 4.5. First note the following identity: ) .
A.7. Proof of Theorem 4.7. From Appendix A.1, the marginal prior, p HS+ (θ), for horseshoe+ is given by the convolution We can use Meijer G-function convolution identities (Mathai, Saxena and Haubold, 2009) to provide a special function representation of such distributions. Let G m,n p,q ap bq x denote the Meijer-G function, This class of functions satisfies a very useful convolution identity with many applications in Bayesian computation. We have the integral identity Furthermore, when one of the G-functions is an exponential function, we obtain a general form for the Laplace transform of a G-function: for Re(s) > 0. This can be used to calculate the implied prior via the following identity: log(x) x − 1 = G 2,2 2,2 ( 0 0 | x), for appropriate x. Equation (A.2) is then p HS+ (θ) = 1 √ 2π 5/2 G 2,3 3,2 1,1,1 1,1 2θ −2 .
Similarly, using Equation (A.6) we have where the last line follows from the fact that Bias HS (θ i |y i ) = O(1/y 2 i ) (from Theorem 3 in Carvalho, Polson and Scott (2010)).