Conditions for Posterior Contraction in the Sparse Normal Means Problem

The first Bayesian results for the sparse normal means problem were proven for spike-and-slab priors. However, these priors are less convenient from a computational point of view. In the meanwhile, a large number of continuous shrinkage priors has been proposed. Many of these shrinkage priors can be written as a scale mixture of normals, which makes them particularly easy to implement. We propose general conditions on the prior on the local variance in scale mixtures of normals, such that posterior contraction at the minimax rate is assured. The conditions require tails at least as heavy as Laplace, but not too heavy, and a large amount of mass around zero relative to the tails, more so as the sparsity increases. These conditions give some general guidelines for choosing a shrinkage prior for estimation under a nearly black sparsity assumption. We verify these conditions for the class of priors considered by Ghosh and Chakrabarti (2015), which includes the horseshoe and the normal-exponential gamma priors, and for the horseshoe+, the inverse-Gaussian prior, the normal-gamma prior, and the spike-and-slab Lasso, and thus extend the number of shrinkage priors which are known to lead to posterior contraction at the minimax estimation rate.


Introduction
In the sparse normal means problem, we wish to estimate a sparse vector θ based on a vector X n ∈ R n , X n = (X 1 , . . . , X n ), generated according to the model where the ε i are independent standard normal variables. The vector of interest θ is sparse in the nearly black sense, that is, most of the parameters are zero. We wish to separate the signals (nonzero means) from the noise (zero means). Applications of this model include image reconstruction and nonparametric function estimation using wavelets [16].
The model is an important test case for the behaviour of sparsity methods, and has been wellstudied. A great variety of frequentist and Bayesian estimators has been proposed, and the popular Lasso [24] is included in both categories. It is but one example of many approaches towards recovering θ; restricting ourselves to Bayesian methods, other approaches include shrinkage priors such as the spike-and-slab type priors studied by [16,7] and [6], the normal-gamma prior [14], non-local priors [15], the Dirichlet-Laplace prior [3], the horseshoe [5], the horseshoe+ [2] and the spike-and-slab Lasso [23].
Our goal is twofold: recovery of the underlying mean vector, and uncertainty quantification. The benchmark for the former is estimation at the minimax rate. In a Bayesian setting, the typical choice for the estimator is some measure of center of the posterior distribution, such as the posterior mean, mode or median. For the purpose of uncertainty quantification, the natural object to use is a credible set. In order to obtain credible sets that are narrow enough to be informative, yet not so narrow that they neglect to cover the truth, the posterior distribution needs to contract to its center at the same rate at which the estimator approaches the truth.
For recovery, spike-and-slab type priors give optimal results ( [16,7,6]). These priors assign independently to each component a mixture of a point mass at zero and a continuous prior. Due to the point mass, spike-and-slab priors shrink small coefficients to zero. The advantage is that the full posterior has optimal model selection properties but this comes at the prize of, in general, too narrow credible sets. Another drawback of spike-and-slab methods is that they are computationally expensive although the complexity is much better than what has been previously believed ( [26]).
Thus, we might ask whether there are priors which are smoother and shrink less than the spikeand-slab but still recover the signal with a (nearly) optimal rate. A naive choice would be to consider the Laplace prior ∝ e −λ θ 1 with θ 1 = n i=1 |θ i |, since in this case the maximum a posteriori (MAP) estimator coincides with the Lasso, which is known to achieve the optimal rates for sparse signals. In [6], Section 3, it was shown that although the MAP-estimator has good properties, the full posterior spreads a non-negligible amount of mass over large neighborhoods of the truth leading to recovery rates that are sub-optimal by a polynomial factor in n. This example shows that if the prior does not shrink enough, we loose the recovery property of the posterior.
should have tails that are at least as heavy as Laplace, but not too heavy, and there should be a sizable amount of mass close to zero relative to the tails, especially when the underlying vector is very sparse. This paper is organized as follows. We state our main result, providing conditions on sparsity priors such that the posterior contracts at the minimax rate in Section 2. We then show, in Section 3, that these conditions hold for the class of priors of [12], as well as for the horseshoe+, the inverse-Gaussian prior, the normal-gamma prior, and the spike-and-slab Lasso. A simulation study is performed in Section 4, and we conclude with a Discussion. All proofs are given in Appendix A.
Notation. Denote the class of nearly black vectors by 0 [p n ] = {θ ∈ R n : The minimum min{a, b} is given by a ∧ b. The standard normal density is denoted by φ, its cdf by Φ, and we set Φ c (x) = 1 − Φ(x). The norm · is the 2 -norm.

Main results
Each coefficient θ i receives a scale mixture of normals as a prior: where π : [0, ∞) → [0, ∞) is a density on the positive reals. While π might depend on further hyperparameters, no additional priors are placed on such parameters, rendering the coefficients independent a posteriori. The goal is to obtain conditions on π such that posterior concentration at the minimax estimation rate is guaranteed.
We use the coordinatewise posterior mean to recover the underlying mean vector. By Tweedie's formula [22], the posterior mean for θ i given an observation x i is equal to x i + d dx log p(x i ), where p(x i ) is the marginal distribution of x i . The posterior mean for parameter θ i is thus given by We denote the estimate of the full vector θ by θ = (X 1 m X1 , . . . , X n m Xn ). An advantage of scale mixtures of normals as shrinkage priors over spike-and-slab-type priors, is that the posterior mean can be represented as the observation multiplied by (2). The ratio (2) can be computed via integral approximation methods such as a quadrature routine. See [20], [21] and [25] for more discussion on this point in the context of the horseshoe.
Our main theorem, Theorem 2.1, provides three conditions on π under which a prior of the form (1) leads to an upper bound on the posterior contraction rate of the order of the minimax rate. We first state and discuss the conditions. In addition, we present stronger conditions that are easier to verify. Condition 1 is required for our bounds on the posterior mean and variance for the nonzero means. The remaining two are used for the bounds for the zero means.
The first condition involves a class of regularly varying functions. Recall that a function is called regular varying (at infinity) if for any a > 0, the ratio (au)/ (u) converges to the same non-zero limit as u → ∞. For our estimates, we need a slightly different notion, that will be introduced next. We say that a function L is uniformly regular varying, if there exist constants R, u 0 ≥ 1, such that 1,2], and all u ≥ u 0 .
In particular, L(u) = u b , and L(u) = log b (u) with b ∈ R are uniformly regular varying (take for example R = 2 |b| and u 0 = 2). An example of a function that is not uniformly regular varying is L(u) = e u . From the definition, we can easily deduce the following properties of functions that are uniformly regular varying. Firstly, u → L(u) is on [u 0 , ∞) either everywhere positive or everywhere negative. If L is uniformly regular varying then also u → 1/L(u) and if L 1 and L 2 are uniformly regular varying, then also their product L 1 L 2 .
We are now ready to present Condition 1, and the stronger Condition 1', which implies Condition 1, as shown in Lemma A.1.
where L n is a function that satisfies (3) for some R, u 0 ≥ 1 which do not depend on n. Suppose further that there are constants C , K, b ≥ 0 and u * ≥ 1, such that Condition 1'. Consider a global-local scale mixture of normals: Assume that π is a uniformly regular varying function which does not depend on n, and τ = (p n /n) α for α ≥ 0.
Condition 1 assures that the posterior recovers nonzero means with the optimal rate. Thus, the condition can be seen as a sufficient condition on the tail behavior of the density π for 2recovery. The tail may decay exponentially fast, which is consistent with the conditions found on the 'slab' in the spike-and-slab priors discussed by [7]. In general, π will depend on n through a hyperparameter. Condition 1 requires that the n dependence behaves roughly as a power of p n /n.
In the important special case where each θ i is drawn independently from a global-local scale mixture, Condition 1 is satisfied whenever the density on the local variance is uniformly regular varying, as stated in Condition 1'. Below, we give the conditions on π that guarantee posterior shrinkage at the minimax rate for the zero coefficients. The first condition ensures that the prior π puts some finite mass on values between [0, 1].

Condition 2.
Suppose that there is a constant c > 0 such that 1 0 π(u)du ≥ c. We turn to Condition 3 which describes the decay of π away from a neighborhood of zero. To state the condition it will be convenient to write s n := p n n log(n/p n ).
Condition 3. Let b n = log(n/p n ) and assume that there is a constant C, such that In order to allow for many possible choices of π, the tail condition involves several terms. It is surprising that some control on the interval 1 sn uπ(u)du is needed. But this turns out to be sharp. Theorem 2.2 proves that if we would relax the condition to 1 sn uπ(u)du t n for an arbitrary rate t n s n , then there is a prior that satisfies all the other conditions needed for the zero coefficients, but which does not concentrate at the minimax rate.
Below we state two stronger conditions, each of which obviously imply Condition 2 and Condition 3 for sparse signals, that is, p n = o(n).
Condition A. Assume that there is a constant C, such that π(u) ≤ C u 3/2 p n n log(n/p n ), for all u ≥ s n .
Condition B. Assume that there is a constant C, such that ∞ sn π(u)du ≤ Cp n n .
In this case, even a stronger version of Condition 2 holds in the sense that nearly all mass is concentrated in the shrinking interval [0, s n ]. Notice that Condition 3 does not imply Condition 2 in general. If for example π is a point mass at n 2 , then, Condition 3 holds but Condition 2 does not. Condition 1 and Condition 3 depend on the relative sparsity p n /n. Indeed, Condition 1 becomes weaker if the signal is more sparse and at the same time Condition 3 becomes stronger. This matches intuition, as the prior should shrink more in this case and thus the assumptions that are responsible for the shrinkage effect should become stronger. Figure 1 presents plots of the priors π on the local variance, and the corresponding priors on the parameters θ i , for three priors for which the three conditions are verified in Section 3: the horseshoe, inverse-Gaussian, and normal-gamma. The parameter τ , in the notation of Section 3, should be thought of as the sparsity level p n /n. Figure 1 shows that the priors start to resemble each other when τ is decreased. If the setting is more sparse, corresponding to more zero means, the mass of the prior π on σ 2 i concentrates around zero, leading to a higher peak at zero in the prior density on θ i . We now present our main result. The minimax estimation rate for this problem, under 2 risk, is given by 2p n log(n/p n ) [10]. We write θ 0 = (θ 0 i ) i=1,...,n and consider posterior concentration of the zero and non-zero coefficients separately. Asymptotics always refers to n → ∞.
Theorem 2.1. Work under model X n ∼ N (θ 0 , I n ) and assume that the prior is of the form (1). Suppose further that p n = o(n) and let M n be an arbitrary positive sequence tending to +∞. Under Condition 1, Under Condition 2 and Condition 3 (or either Condition A or B), and sup Thus, under Conditions 1-3 (or Condition 1 with either Condition A or B), The statement is split into zero and non-zero coefficients of θ 0 in order to make the dependence on the conditions explicit. Indeed, posterior concentration of the non-zero coefficients follows from Condition 1 and posterior concentration for the zero-coefficients is a consequence of Conditions 2 and 3. It is well-known that posterior concentration at rate n implies existence of a frequentist estimator with the same rate (cf. [11], Theorem 2.5). Thus, the rate of contraction around the true mean vector θ 0 must be sharp. This also means that credible sets computed from the posterior cannot be too so large as to be uninformative, an effect that, as discussed in the introduction, occurs for the Laplace prior connected to the Lasso. If one wishes to use a credible set centered around the posterior mean, then its radius might still be too small to cover the truth. The first step towards guarantees on coverage is a lower bound on the posterior variance. Such a lower bound was obtained for the horseshoe in [25], and for priors very closely resembling the horseshoe in [12]. No such results have been obtained so far for priors on σ 2 i that have a tail of a different order than (σ 2 i ) −3/2 . This is a delicate technical issue that we will not pursue further here. The results also indicates how to build adaptive procedures. The method does not require explicit knowledge of p n but in order to get minimax concentration rates, we need to find priors that satisfy the conditions of Theorem 2.1. Consider for example the prior defined as log n n , for all u ≥ √ log n n and the remaining mass is distributed arbitrarily on the interval [0, √ log n/n). Thus Condition A holds for any 1 ≤ p n = o(n) and thus also Condition 2 and Condition 3. Whenever we impose an upper bound p n ≤ n 1−δ with δ > 0, then also Condition 1 holds and thus Theorem 2.1 follows. This shows that in principle priors can be constructed that adapt over the whole range of possible sparsity levels and lead to some theoretical guarantee. From a practical point, however, these methods shrink to much and have to little mass in the tails. A better procedure would be to get a rough estimate of the relative sparsity p n /n in a first step and then to use a prior that lies on the "boundary" of the conditions in the sense that the both sides in the inequality of Condition 3 are of the same order. An empirical Bayes procedure that first estimates the sparsity was found to work well in [25], arguing along the lines of [16]. The sparsity level estimator counts the number of observations that are larger than the 'universal threshold' of √ 2 log n. Similar results are likely to hold in our setting, as long as the posterior mean is monotone in the parameter that is taken to depend on p n .

Necessary conditions
The imposed conditions are nearly sharp. To see this, consider the Laplace prior, where each θ i is drawn independently from a Laplace distribution with parameter λ. It is well-known that the Laplace distribution with parameter λ can be represented as a scale mixture of normals where the mixing density is exponential with parameter λ 2 (cf. [1] or [18], Equation (4)). Thus, the Laplace prior fits our framework (1) with π(u) = λ 2 e −λ 2 u , for u ≥ 0. As mentioned in the introduction, the MAP-estimator of this prior is the Lasso but the full posterior does not shrink at the minimax rate. Indeed, Theorem 7 in [6] shows that if the true vector is zero, then, the posterior concentration rate has the lower bound n/λ 2 for the squared 2 -norm provided that . This should be compared to the optimal minimax rate log n (the rate for sparsity zero is the same as the rate for sparsity p n = 1). Thus, the lower bound shows that the rate is sub-optimal as long as λ n log n .
If λ n/ log n, the lower bound is not sub-optimal anymore, but in this case, the non-zero components cannot be recovered with the optimal rate. The lower bound shows that the posterior does not shrink enough if λ is not taken to be huge and thus either Condition 2 or Condition 3 must be violated, as these are the two conditions that guarantee shrinkage of the zero mean coefficients.
Obviously, 1 0 π(u)du ≥ 1 0 e −u du > 0 for 1 ≤ λ and thus Condition 2 holds. For Condition 3 notice that the integral can be split into the integral 1 0 uπ(u)du plus an integral over [1, ∞) Now, if λ tends to infinity faster than a polynomial order in n then the integral over [1, ∞) is exponentially small in n. Thus Condition 3 must fail because the integral over 1 sn uπ(u)du is of a larger order than s n = n −1 log n. To see this, observe that for λ ≤ n/ log n, Now, we see that Condition 3 fails if and only if (7) holds. Indeed, if λ n/ log n, then the r.h.s. is of larger order than s n and if λ n/ log n, then, Condition 3 holds. This shows that this bound is sharp.
In order to state this as a formal result, let us introduce the following modification of Condition 3. Let κ n denote an arbitrary positive sequence.
Condition 3(κ n ). Let b n = log(n/p n ) and assume that there is a constant C, such that In particular, we recover Condition 3 for κ n = 1.
Theorem 2.2. Work under model X n ∼ N (θ 0 , I n ) and assume that the prior is of the form (1). For any positive sequence (κ n ) n tending to zero, there exists a prior π satisfying Condition 2 and Condition 3(κ n ) for p n = 1 and a positive sequence (M n ) n tending to infinity, such that This theorem shows that the posterior puts asymptotically all mass outside an 2 -ball with radius M n log(n) log(n) and is thus suboptimal. The proof can be found in the appendix.

Examples
In this section, Conditions 1-3 are verified for the horseshoe-type priors considered by [12] (which includes the horseshoe and the normal-exponential gamma), the horseshoe+, the inverse-Gaussian prior, the normal-gamma prior, and the spike-and-slab Lasso. There are, to the best of our knowledge, no existing results yet showing that the horseshoe+, the inverse-Gaussian and the normal-gamma priors lead to posterior contraction at the minimax estimation rate. Posterior concentration for the horseshoe and horseshoe-type priors were already established in [25] and [12], and for the spike-and-slab Lasso in [23] . Here, we obtain the same results but thanks to Theorem 2.1 the proofs become extremely short. In addition, we can show that a restriction on the class of priors considered by [12] can be removed.
The global-local scale prior is of the form (1) with We assume that the polynomial decay in u is at least of order 3/2, that is a ≥ 1 2 . In particular, the horseshoe lies directly at the boundary in this sense. Depending on a, we allow for different values of τ. If 1 2 ≤ a < 1, we assume τ 2a ≤ (p n /n) log(n/p n ); if a = 1, we assume τ 2 ≤ p n /n; and if a > 1, we assume τ 2 ≤ (p n /n) log(n/p n ).

Condition 1':
It is enough to show that π is a uniformly regular varying function. Notice that L is uniformly regular varying and satisfies (3) with R = M/c 0 and z 0 = t 0 . If two functions are uniformly regular varying, then also their product, and thus π is uniformly regular varying.

Condition 3:
Since L is bounded in sup-norm by M, and s n ≥ τ 2 , we find that π(u) ≤ KM τ 2a u −1−a , for all u ≥ s n . With this bound, it is straightforward to verify Condition 3.
Thus, we can apply Theorem 2.1.
In particular, the posterior concentration theorem holds even more generally than shown by [12], as the restriction a < 1 can be removed. Thus, for example, we recover Theorem 3.3 of [25] and in addition, find that the normal-exponential-gamma prior of [13] contracts at at most the minimax rate for γ = p n /n and any λ ≥ 1/2.

The inverse-Gaussian prior
Caron and Doucet [4] propose to use the inverse-Gaussian distribution as prior for σ 2 . For positive constants b and τ the variance σ 2 is drawn from an inverse Gaussian distribution with mean √ 2τ and shape parameter √ 2b. Thus the prior on the components is of the form (1) with bτ / √ π is the normalization factor. (In the notation of [4], this corresponds to reparametrizing γ = √ 2b, α/n = √ 2τ, and K = n is the dimension of the unknown mean vector.) As τ becomes small the distribution is concentrated near zero. [4] suggests to take τ proportional to 1/n, and we find that optimal rates can be achieved if (p n /n) K τ ≤ (p n /n) log(n/p n ) for some K > 1.
Below we verify Condition 1 and Condition A, which together imply Theorem 2.1. The inverse-Gaussian prior does not fit within the class considered by [12], because of the additional exponential factors.
Condition 1: For u ≥ 1, e −1 ≤ e −τ 2 /u ≤ 1. Thus, u → e −τ 2 /u is uniformly regular varying with constants R = e and z 0 = 1. Since products of uniformly regular varying functions are again uniformly regular varying, we can write π(u) = L n (u)e −bu with L n uniformly regular varying.
Hence, the statement of Theorem 2.1 follows.

The horseshoe+ prior
The horseshoe+ prior was introduced by [2]. It is an extension of the horseshoe including an additional latent variable. A Cauchy random variable with parameter λ that is conditioned to be positive is said to be half-Cauchy and we write C + (0, λ) for its distribution. The horseshoe+ prior can be defined via the hierarchical construction and should be compared to the horseshoe prior The additional variable η i allows for another level of shrinkage, a role which falls solely to τ in the horseshoe prior. In [2], the claim is made that the horseshoe+ is an improvement over the horseshoe in several senses, but no posterior concentration results are known so far. With Theorem 2.1, we can show that the horseshoe+ enjoys the same upper bound on the posterior contraction rate as the horseshoe, if (p n /n) K τ (p n /n)(log(n/p n )) −1/2 , for some K > 1.
The horseshoe+ prior is of the form (1) with Below, we verify Conditions 1-3.

Normal-gamma prior
The normal-gamma prior, discussed by [4] and [14], takes the following form for shape parameter τ > 0 and rate parameter β > 0: In [14], it is observed that decreasing τ leads to a distribution with a lot of mass near zero, while preserving heavy tails. This is also illustrated in the right-most panels of Figure 1. The class of normal-gamma priors includes the double exponential prior as a special case, with τ = 1. We now show that the normal-gamma prior satisfies the conditions of Theorem 2.1 for any fixed β, and for any (p n /n) K τ (p n /n) log(n/p n ) ≤ 1 for some fixed K.

Condition 1:
We define L n (u) = β τ Γ(τ ) u τ −1 , so π(u) = L n (u)e −bu with b = β. Note that since τ → 0, we have that there exist a constant C such that C −1 ≤ β τ ≤ C. We now prove that L n is regular varying. We have L n (au) L n (u) = a τ −1 .
Thus, we can apply Theorem 2.1.
In [14], it is discussed that the extra modelling flexibility afforded by generalizing the double exponential prior to include the parameter τ is essential, and indeed the double exponential (τ = 1) does not allow a dependence on p n and n such that our conditions are met.

Spike-and-slab Lasso prior
The spike-and-slab Lasso prior was introduced by [23]. It may be viewed as a continuous version of the usual spike-and-slab prior with a Laplace slab, as studied in [7,6], where the spike component has been replaced by a very concentrated Laplace distribution. Recent theoretical results, including posterior concentration at the minimax rate, have been obtained in [23]. Here, we recover Corollary 6.1 of [23].
For a fixed constant a > 0 and a sequence τ → 0, we define the spike-and-slab Lasso as prior of the form (1) with hyperprior on the variance. Recall that the Laplace distribution with parameter λ is a scale mixture of normals where the mixing density is exponential with parameter λ 2 . Applied to model (1), the prior on θ i is thus a mixture of two Laplace distributions with parameter √ a and τ −1/2 and mixing weights ω and 1 − ω, respectively and this justifies the name.
We now prove that the prior satisfies the conditions of Theorem 2.1 for mixing weights satisfying (p n /n) K ≤ ω ≤ (p n /n) log(n/p n ) ≤ 1 2 , for some K > 1 and τ = (p n /n) α with α ≥ 1. Condition 1: To prove that Condition 1 holds we rewrite the prior π as =: e −au L n (u) For n large enough, we have 1/τ − a > 1/(2τ ). For all u > 1 and for C > 0 a constant depending only on K and α, Hence, for sufficiently large n, aω ≤ L n (u) ≤ (a + C)ω for all u ≥ 1. Thus L n is regular varying with u 0 = 1. Since also π(u) ≥ aωe −au and ω ≥ (p n /n) K , Condition 1 holds.

Condition 3:
We might split the two mixing components in (10) and write π =: π 1 + π 2 . To verify the condition for the first component π 1 , we use that e −au ≤ 1 for u ≤ 1 and that e −au decays faster than any polynomial for u > 1. In order that Condition 3 is satisfied, we need thus ω (p n /n) log(n/p n ). For π 2 , there exists a constant C such that π 2 (u) ≤ Cτ /u 2 for all u ≥ s n , due to s n ≥ τ. Straightforward computations show that π 2 satisfies Condition 3 since τ ≤ p n /n.
Thus, we can apply Theorem 2.1.

Simulation results
To illustrate the point that our conditions are very sharp, we compute the average square loss for two priors that do not meet our conditions, and compare them with two of the examples from Section 3.
The first prior that does not meet the conditions is of the form (9) of Section 3.1 with a = 0.1 ≤ 1/2, L(u) = e −1/u and density, π 1 (u) ∝ u −1.1 e −τ 2 1 /u , and we take τ 1 = p n /n. Note that π 1 does not meet our conditions, as explained in Section 3.1, and will be called a bad prior. The second prior included in this simulation that does not fit our assumptions is the Laplace prior (see Section 3.4). The two priors considered in this simulation study that do meet the conditions are the horseshoe and the normal-gamma priors, both with τ = p n /n.
For each of these priors, we sample from the posterior distribution using a Gibbs Sampling algorithm, following the one proposed for the horseshoe prior by [5]. To do so, we first compute the full conditional distributions . The only difficulty is thus sampling from p(σ 2 |X, β). For the horseshoe prior we follow the approach proposed by [5]. We apply a similar method for the normal-gamma prior using the approach proposed by [8]. Sampling from the bad prior is even simpler given that in this case p(σ|X, β) is an inverse gamma. We compute the average square loss on 500 replicates of simulated data of size n = 100, 200, 500, 1000. For each n, we fix the number of nonzero means at p n = 10, and take the nonzero coefficients equal 5 √ 2 log n. This value is well past the 'universal threshold' of √ 2 log n, and thus the signals should be relatively easy to detect. For each data set, we compute the posterior square loss using 5000 draws from the posterior with a burn-in of 20%.
The results are presented in Figure 2. Given that p n = 10 is fixed, if the posterior contracts at the minimax rate, then the integrated square loss should be linear in log n. However, we see that for the Laplace and bad priors, the slope of the loss grows with n, when it remains steady for the other two considered priors. This suggest that the horseshoe and normal-gamma have a risk of a lower order than the bad and Laplace priors, illustrating that our conditions are very sharp.

Discussion
Our main theorem, Theorem 2.1, expands the class of shrinkage priors with theoretical guarantees for the posterior contraction rate. Not only can it be used to obtain the optimal posterior contraction rate for the horseshoe+, the inverse-Gaussian and normal-gamma priors, but the conditions provide some characterization of properties of sparsity priors that lead to desirable behaviour. Essentially, the tails of the prior on the local variance should be at least as heavy as Laplace, but not too heavy, and there needs to be a sizable amount of mass around zero compared to the amount of mass in the tails, in particular when the underlying mean vector grows to be more sparse.
In [19] global-local scale mixtures of normals like (5) are discussed, with a prior on the parameter τ 2 . Their guidelines are twofold: the prior on the local variance σ 2 i should have heavy tails, while the prior on the global variance τ 2 should have substantial mass around zero. They argue that any prior on σ 2 i with an exponential tail will force a tradeoff between shrinking the noise towards zero and leaving the large nonzero means unshrunk, while the shrinkage of large signals will go to zero when a prior with a polynomial tail is chosen. This matches the intuition behind our conditions, with the remark that exponential tails are possible, but they should not be lighter than Laplace.
Besides the three discussed goals of recovery, uncertainty quantification, and computational sim-plicity, we might have mentioned a fourth: performing model selection or multiple testing. Priors of the type studied in this paper are not directly applicable for this goal, as the posterior mean will, with probability one, not be exactly equal to zero. A model selection procedure can be constructed however, for example by thresholding using the observed values of m xi : if m xi is larger than some constant, we consider the underlying parameter to be a signal, and otherwise we declare it noise. Such a procedure was proposed for the horseshoe by [5], and was shown to enjoy good theoretical properties by [9]. Similar results were found for the horseshoe+ [2]. The same thresholding procedure, and similar analysis methods, may prove to be fruitful for the more general prior (1).

Appendix A: Proofs
This section contains the proofs of Theorem 2.1 and Theorem 2.2, followed by the statement and proofs of the supporting Lemmas. The proof of Theorem 2.1 follows the same structure as that of Theorem 3.3 in [25], but requires more general methods to bound the integrals involved in the proof.
In the course of the proofs, we use the following two transformations of π, The function g is a density on [0, 1], resulting from transforming the density π on σ 2 i to a density for z = (1 + σ 2 i ) −1 . The function h is a rescaled version of π.
Proof. Observe that π(u) = π(u/τ 2 )/τ 2 . Since by assumption π is uniformly regular varying, (3) holds for some constants R and u 0 which do not depend on n. To check the first part of Condition 1, it is enough to see that π(·/τ 2 ) is uniformly regular varying as well and satisfies (3) with the same constants as π.
Proof of Theorem 2.1. Applying Lemma A.5 gives under Condition 1, i:θi =0 E θi (θ i − θ i ) 2 p n log(n/p n ) and i:θi =0 E θi Var(θ i | X i ) p n log(n/p n ). These inequalities combined with Markov's inequality prove the first two statements of the theorem. Similarly, under Condition 2 and Condition 3, we obtain from Lemma A.6 and Lemma A.7, E θ i:θi=0 θ 2 i ≤ nE 0 (Xm X ) 2 p n log(n/p n ) and i:θi=0 E 0 Var(θ i | X i ) p n log(n/p n ). Together with Markov's inequality, this proves the third and fourth statement of the theorem.
Proof of Theorem 2.2. Without loss of generality, we can take κ n such that κ n ≥ n −1/4 for all n. Consider the prior, where θ i is drawn from the Laplace density with parameter λ = s n /κ n . This prior is of the form (1) with π(u) = λ 2 e −λ 2 u (cf. Section 2.1). Theorem 7 in [6] shows that (8) holds with M n = 1/κ n → ∞. Thus it remains to prove that π satisfies Condition 2 and Condition 3(κ n ).
Lemma A.2. The posterior variance can be written as (12) and bounded by Proof. By Tweedie's formula [22], the posterior variance for θ i given an observation x i is equal taking derivatives with respect to x, and substituting h(z) = (1 − z) −3/2 π(z/(1 − z)) gives From that we can derive (12) noting that the third term on the r.h.s. is 1 − m x . The last display also implies the first inequality in (13). Representation (12) together with the trivial bound Combined with (12), we find Var Suppose that L is uniformly regular varying. If R and u 0 are chosen such that (3) holds, then, for any a ≥ 1, where log 2 denotes the binary logarithm.
Lemma A.4. Assume that L is uniformly regular varying and satisfies (3) with R and u 0 . Then, the shifted function L(· − 1) is also uniformly regular varying with constants R 3 and u 0 ∨ 2.
For z ≥ z 0 ∨ 2 we apply (3) to each of the three fractions and this completes the proof.
The following lemma states that if the density g can be decomposed as a product of a function that is uniformly varying and possibly n dependent, and a factor of the form z → e −bz , then the posterior recovers the size of the non-zero components of θ with the minimax estimation rate, provided that the n dependence is of the right order.
Lemma A.5. If Condition 1 holds, there exists a constant C, which is independent of n, such that and i:θi =0 Proof. We prove the two statements separately. The main argument is a careful analysis of the integral representation (2) and (11)). Throughout the remaining proof, let C 1 be a generic constant which is independent of n and which might change from line to line. Without loss of generality, we may assume that u 0 ≥ 2.
Arguing as for (14) completes the proof.
Next, we provide the technical lemmas establishing the rate for the zero coefficients. Recall that s n = (p n /n) log(n/p n ) and define q n := p n n log(n/p n ).
Suppose that Condition 2 and Condition 3 hold with constants c and C, respectively. With (2), where for the last inequality, we split the integral ∞ 1 = log(n/pn) 1 + ∞ log(n/pn) and used Condition 3 twice. These inequality will be very useful for the proofs below. For the variance bound, the last bound is not sharp enough and we need to work with the upper bound induced by the second inequality.
Lemma A.6. Work under Condition 2 and Condition 3. Then, E 0 (Xm X ) 2 p n n log(n/p n ).
To bound the term I 1 , x 2 e x 2 /2 dx s 2 n a 3 n + q 2 n a n e a 2 n /2 .
There is a constant only depending on K such that x 2 log K (1/x) ≤ C K x for all x ≤ 1. Thus, I 1 (p n /n) log(n/p n ).
Plugging the expression for a n into the r.h.s. shows that also I 2 (p n /n) log(n/p n ) and this finally gives E 0 (Xm X ) 2 (p n /n) log(n/p n ).
Proof. Let a n = 2 log(n/p n ). It is enough to show that E 0 Var(θ | X) p n log(n/p n )/n. To prove this, we need to treat the cases that |X| is larger/smaller than a n , separately. To bound the variance, we use (13), that is Var(θ | X) ≤ m x + x 2 m x ≤ 1 + x 2 .
Using the expression for a n shows that this can be further bounded by (p n /n) log(n/p n ).
For the second term E 0 X 2 m X 1{|X| ≤ a n }, we use the second inequality in (19) and find E 0 X 2 m X 1{|X| ≤ a n } s n an −an x 2 e Together with (21) this shows that E 0 Var(θ | X)1{|X| ≤ a n } s n . Since in both cases the upper bound is of order (p n /n) log(n/p n ) the result follows.