Uncertainty quantification for the horseshoe

We investigate the credible sets and marginal credible intervals resulting from the horseshoe prior in the sparse multivariate normal means model. We do so in an adaptive setting without assuming knowledge of the sparsity level (number of signals). We consider both the hierarchical Bayes method of putting a prior on the unknown sparsity level and the empirical Bayes method with the sparsity level estimated by maximum marginal likelihood. We show that credible balls and marginal credible intervals have good frequentist coverage and optimal size if the sparsity level of the prior is set correctly. By general theory honest confidence sets cannot adapt in size to an unknown sparsity level. Accordingly the hierarchical and empirical Bayes credible sets based on the horseshoe prior are not honest over the full parameter space. We show that this is due to over-shrinkage for certain parameters and characterise the set of parameters for which credible balls and marginal credible intervals do give correct uncertainty quantification. In particular we show that the fraction of false discoveries by the marginal Bayesian procedure is controlled by a correct choice of cut-off.


Introduction
Despite the ubiquity of problems with sparse structures, and the large amount of research effort into finding consistent and minimax optimal estimators for the underlying sparse structures Tibshirani (1996); ; ; ; Jiang and Zhang (2009); Griffin and Brown (2010); Johnson and Rossell (2010); ; Caron and Doucet (2008); Bhattacharya et al. (2014); ; Rocková (2015), the number of options for uncertainty quantification in the sparse normal means problem is very limited. In this paper, we show that the horseshoe credible sets and intervals are effective tools for uncertainty quantification, unless the underlying signals are too close to the universal threshold in a sense that is made precise in this work. We first introduce the sparse normal means problem, and our measures of quality of credible sets.
The sparse normal means problem, also known as the sequence model, is frequently studied and considered as a test case for sparsity methods, and has some applications in, for example, image processing ). A random vector Y n = (Y 1 , . . . , Y n ) of observations, taking values in R n , is modelled as the sum of fixed means and noise: where the ε i follow independent standard normal distributions. The sparsity assumption made on the mean vector θ 0 = (θ 0,1 , . . . , θ 0,n ) is that it is nearly black, which stipulates that most of the means are zero, except for p n = n i=1 1 θ0,i =0 of them. The sparsity level p n is unknown, and assumed to go to infinity as n goes to infinity, but at a slower rate than n: p n → ∞ and p n = o(n).
This paper studies the Bayesian approach based on the horseshoe prior Carvalho et al. ( , 2009); Scott (2011); Polson and Scott (2012a,b). The horseshoe prior is popular due to its good performance in simulations and under theoretical study (e.g. Carvalho et al. ( , 2009); Scott (2012a, 2010); Bhattacharya et al. (2014); Armagan et al. (2013); ; Datta and Ghosh (2013)). The horseshoe prior is a scale mixture of normals, with a half-Cauchy prior on the variance. It is given by Across i the variables are assumed independent, with the exception of the hyperparameter τ if this is given a prior as well. The "global hyperparameter" τ was determined to be important towards the minimax optimality of the horseshoe posterior mean as an estimator of θ 0 (van der Pas et al. (2014)). The results in van der  show that τ can be interpreted as the proportion of nonzero parameters, up to a logarithmic factor. If it is set at a value of the order (p n /n) log(n/p n ), then the horseshoe posterior contracts around the true θ 0 at the (near) minimax estimation rate for quadratic loss. Adaptive posterior contraction, where the number p n is not assumed known but estimated by empirical Bayes or hierarchical Bayes as in this paper, was proven for estimators of τ that are bounded above by (p n /n) log(n/p n ) with high probability in .
The adaptive concentration of the horseshoe posterior is encouraging towards the usefulness of the horseshoe credible balls for uncertainty quantification, as in the Bayesian framework the spread of the posterior distribution over the parameter space is used as an indication of the error in estimation. It follows from general results of ; Robins and van der Vaart (2006); Nickl and van de Geer (2013) that honest uncertainty quantification is irreconcilable with adaptation to sparsity. Here honesty of confidence setsĈ n =Ĉ n (Y n ) relative to a parameter spaceΘ ⊂ R n means that lim inf n→∞ inf θ0∈Θ P θ0 (θ 0 ∈Ĉ n ) ≥ 1 − α, for some prescribed confidence level 1 − α. Furthermore, adaptation to a partitionΘ = ∪ p∈P Θ p of the parameter space into submodels Θ p indexed by a hyper-parameter p ∈ P , means that, for every p ∈ P and for r n,p the (near) minimax rate of estimation relative to Θ p , lim inf n→∞ inf θ0∈Θp P θ0 (diam(Ĉ n ) ≤ r n,p ) = 1.
This second property ensures that the good coverage is not achieved by taking conservative, overly large confidence sets, but that these sets have "optimal" diameter. In our present situation we may choose the models Θ p equal to nearly black bodies with p nonzero coordinates, in which case r 2 n,p p log(n/p), if p n. Now it is shown in  that confidence regions that are honest over all parameters inΘ = R n cannot be of square diameter smaller than n 1/2 , which can be (much) bigger than p log(n/p), if p n 1/2 . Similar restrictions are valid for honesty over subsets of R n , as follows from testing arguments (see the appendix in Robins and van der Vaart (2006)). Specifically, in Nickl and van de Geer (2013) it is shown that confidence regions that adapt in size to nearly black bodies of two different dimensions p n,1 p n,2 cannot be honest over the union of these two bodies, but only over the union of the smallest body and the vectors in the bigger body that are at some distance from the smaller body. As both the full Bayes and empirical Bayes horseshoe posteriors contract at the near square minimax rate r n,p , adaptively over every nearly black body, it follows that their credible balls cannot be honest in the full parameter space.
In Bayesian practice credible balls are nevertheless used as if they were confidence sets. A main contribution of the present paper is to investigate for which parameters θ 0 this practice is justified. We characterise the parameters for which the credible sets of the horseshoe posterior distribution give good coverage, and the ones for which they do not. We investigate this both for the empirical and hierarchical Bayes approaches, both when τ is set deterministically, and in adaptive settings where the number of nonzero means is unknown. In the case of deterministically chosen τ , uncertainty quantification is essentially correct provided τ is chosen not smaller than (p n /n) log(n/p n ). For the more interesting full and empirical Bayes approaches, the correctness depends on the sizes of the nonzero coordinates in θ 0 . If a fraction of the nonzero coordinates is detectable, meaning that they exceed the "threshold" 2 log(n/p n ), then uncertainty quantification by a credible ball is correct up to a multiplicative factor in the radius. More generally, this is true if the sum of squares of the non-detectable nonzero coordinates is suitably dominated, as in .
We show in this work that the uncertainty quantification given by the horseshoe posterior distribution is "honest" only under certain prior assumptions on the parameters. In contrast, interesting recent work within the context of the sparse linear regression model is directed at obtaining confidence sets that are honest in the full parameter set Zhang and Zhang (2014); van de Geer et al. (2014); Liu and Yu (2013). The resulting methodology, appropriately referred to as "de-sparsification", might in our present very special case of the regression model reduce to confidence sets for θ 0 based on the trivial pivot Y n − θ 0 , or functions thereof, such as marginals. These confidence sets would have uniformly correct coverage, but be very wide, and not accommodate the presumed sparsity of the parameter. This seems a high price to pay; sacrificing some coverage so as to retain some shrinkage may not be unreasonable. Our contribution here is to investigate in what way the horseshoe prior makes this trade-off. In addition, we provide a specific example of an estimator that meets our conditions for adaptive coverage: the maximum marginal likelihood estimator (MMLE). The MMLE is introduced in detail in . In this paper, we expand on the MMLE results in  by showing that it meets the imposed conditions for adaptive coverage as well.
Uncertainty quantification in the case of the sparse normal means model was addressed also in the recent paper . These authors consider a mixed Bayesian-frequentist procedure, which leads to a mixture over sets I ⊂ {1, 2, . . . , n} of projection estimators (Y i 1 i∈I ), where the weights over I have a Bayesian interpretation and each projection estimator comes with a distribution. Treating this as a posterior distribution, the authors obtain credible balls for the parameter, which they show to be honest over parameter vectors θ 0 that satisfy an "excessive-bias restriction". This interesting procedure has similar properties as the horseshoe posterior distribution studied in the present paper. While initially we had derived our results under a stronger "self-similarity" condition, we present here the results under a slight weakening of the "excessive-bias restriction" introduced in .
The performance of adaptive Bayesian methods for uncertainty quantification for the estimation of functions has been previously considered in Szabó et al. (2015a,b); Serra and Krivobokova (2017); Castillo and Nickl (2014); Ray (2014); Sniekers and van der Vaart (2015a,c,b); Belitser (2017); Rousseau and Szabo (2016). These papers focus on adaptation to functions of varying regularity. This runs into similar problems of honesty of credible sets, but the ordering by regularity sets the results apart from the adaptation to sparsity in the present paper.
For single coordinates θ 0,i uncertainty quantification by marginal credible intervals is quite natural. Credible intervals can be easily visualised by plotting them versus the index (cf. Figure 1). A simulation study in the context of the linear regression model is given in Bhattacharya et al. (2015). Marginal credible intervals may also be used as a testing device, for instance by declaring coordinates i for which the credible interval does not contain 0 to be discoveries. We show that the validity of these intervals depends on the value of the true coordinate. On the positive side we show that marginal credible intervals for coordinates θ 0,i that are either close to zero or above the detection boundary are essentially correct. In particular, the fraction of false discoveries tends to zero. On the negative side the horseshoe posteriors shrink intervals for intermediate values too much to zero for coverage. Different from the case of credible balls, these conclusions are hardly affected by whether the sparseness level τ is set by an oracle or adaptively, based on the data. The paper is organized as follows. Section 2 is concerned with marginal credible intervals. Consequences for the false and true discoveries are explored in Section 3. Results for credible balls are collected in Section 4. In all cases, the results are given for deterministic and general empirical and hierarchical Bayes approaches. The coverage as well as model selection properties of the marginal credible sets are investigated in a simulation study in Section 5. Section 6 contains proofs for the marginal credible intervals. A supplement  contains the proofs of the other results, as a sequence of appendices.
The cardinality of the discrete set S is denoted by #(S).

Credible intervals
We study the coverage properties of credible intervals for the individual coordinates θ 0,i . We show that the marginal credible intervals fall into three categories, dependent on τ . We show that coordinates θ 0,i that are either "small" or "large" will be covered, in the sense that within both categories the fraction of correct intervals is arbitrarily close to 1. On the other hand, none of the "intermediate" coordinates θ 0,i are covered. We show this first for the deterministic case, where the boundaries between the categories are at multiples of τ and ζ τ respectively. Furthermore, we show that the results for deterministic marginal credible intervals extend to the adaptive situation for any true parameter θ 0 , with slight modification of the boundaries between the three cases of small, intermediate and large coordinates. We elaborate on the implications for model selection in Section 3.

Definitions
Non-adaptive marginal credible intervals can be constructed from the marginal posterior distributions Π(θ : θ i ∈ · | Y n , τ). By the independence of the pairs (θ i , Y i ) given τ , the ith marginal depends only on the ith observation Y i . We consider intervals of the form is the marginal posterior mean, L a positive constant, and r i (α, τ ) is determined so that, for a given 0 < α ≤ 1/2, Adaptive empirical Bayes marginal credible intervals are defined by plugging in an estimator τ n for τ in the intervalsĈ ni (L, τ ) defined by (3).
Similarly full Bayes credible intervalsĈ ni (L) are defined from the full Bayes marginal posterior distributions, centered around the coordinates of the full posterior meanθ = E(θ | Y ) asĈ forr i (α) determined so thatĈ ni (1) has posterior probability 1 − α.

Credible intervals for deterministic τ
The coverage of the marginal credible intervals depends crucially on the value of the true coordinate θ 0,i . For given τ → 0, positive constants k S , k M , k L and numbers f τ ↑ ∞ as τ → 0, we distinguish three regions (small, medium and large) of signal parameters: The conditions on the constants and f τ in the following theorem make it that these three sets may not cover all coordinates θ 0,i , but their boundaries are almost contiguous. The following theorem shows that the fractions of coordinates contained in S and in L that are covered by the credible intervals are close to 1, whereas no coordinate in M is covered.
Remark 1. Statements (5) and (7) concern the fractions of intervals that cover. Under the conditions of Theorem 1 it is also true that the individual intervals satisfy with L = L S and L = L L for i ∈ S and i ∈ L, respectively. This is shown as part of the proof of Theorem 1 in Section 6.
Remark 2. The results of Theorem 1 can be extended to the class of global-local scale mixtures of normals introduced in  with density π(λ 2 i ) given as where a ≥ 1/2, K > 0 and the function L : (0, ∞) → (0, ∞) satisfies that sup t>0 L(t) ≤ M and inf t≥t0 L(t) ≥ c 0 for some c 0 , t 0 > 0. This class of priors includes the horseshoe prior, normal-exponential-gamma priors, the three parameter beta normal mixtures, the generalized double Pareto, the inverse gamma and half-t priors. The resulting constants L S and L L will depend on the hyper-parameters c 0 , t 0 , M, K and a.

Adaptive credible intervals
We show that the adaptive credible intervals mimic the behaviour of the intervals for deterministic τ given in Theorem 2. The adaptive results require some conditions on either the empirical Bayes estimator of τ , or the hyperprior on τ . In the empirical Bayes case, one condition on the estimator of τ suffices, stated below. It is the same condition under which adaptive contraction of the empirical Bayes horseshoe posterior was proven in van der Pas et al. (2017a).
A natural choice of estimatorτ n is the marginal maximum likelihood estimator (MMLE), defined as where g τ (θ) = The restriction of the MMLE to the interval [1/n, 1] corresponds to an assumption that the number of signals is between 1 and n, following the interpretation of τ as (approximately) the proportion of signals. In , and in the simulation study in Section 5, the MMLE is compared to the "simple" estimator of , which estimates p n by counting the number of observations that are larger than (a constant multiple of) the universal threshold √ 2 log n and its computation is discussed. It is proven that the MMLE meets Condition 1, and thus that the empirical Bayes procedure with the MMLE as a plug-in estimate of τ leads to adaptive posterior concentration results.
In the hierarchical Bayes procedure, we impose the same conditions on the hyperprior π n as for adaptive posterior concentration in . We recall them below.
Condition 3. Let t n = C u π 3/2 τ n (p n ), with the constant C u as in Lemma G.8(i). The prior density π n satisfies tn tn/2 π n (τ ) dτ e −cpn , for some c < C u /2.
Condition 3 may be replaced by the weaker Condition 4, at the price of suboptimal rates.
Condition 4. For t n as in Condition 3 the prior density π n satisfies, tn tn/2 π n (τ ) dτ t n .
Examples of priors meeting Conditions 2 and 4 are the Cauchy prior on the positive reals, or the uniform prior, both truncated to [1/n, 1]. They satisfy the stronger Condition 3 if p n ≥ C log n, for a sufficiently large C > 0.
In the adaptive case, the three regions (small, medium and large) of signal parameters are defined as, for given positive constants k S , k M , k L , and f n : Theorem 2. Suppose that k S > 0, k M < 1, k L > 1, and f n ↑ ∞. Ifτ n satisfies Condition 1, then for any sequence γ n → c for some 0 ≤ c ≤ 1/2 such that ζ 2 γn log(1/τ n (p n )), we have that P θ0 θ 0,i / ∈Ĉ ni (L,τ n )) → 1, for any L > 0 and i ∈ M a , with L S and L L given in Theorem 1. Under Conditions 2 and 3 and in addition p n log n the same statements hold for the hierarchical Bayes marginal credible sets. This is also true under Conditions 2 and 4 if f n log n, with different constants L S and L L .
with L = L S and L = L L for i ∈ S a and i ∈ L a , respectively. The same statement holds for the hierarchical Bayes marginal credible intervals. This is shown as part of the proof of Theorem 2 in Appendix A.1 in the supplement (van der Pas et al., 2017b).
Figure 1: 95% marginal credible intervals based on the MMLE empirical Bayes method, constructed using the 2.5% and 97.5% quantiles, for a single observation Y n of length n = 200 with p n = 10 nonzero parameters, the first 5 (from the left) being 7 (green), the next 5 equal to 1.5 (orange); the remaining 190 parameters are coded (blue). The inserted plot zooms in on credible intervals 5 to 13, thus showing one large mean and all intermediate means. Figure 1 illustrates Theorem 2 by showing the marginal credible sets for just a single simulated data set, in a setting with n = 200, and p n = 10 nonzero coordinates. The value τ was chosen equal to the MMLE, which realised as approximately 0.11. The means were taken equal to 7, 1.5 or 0, corresponding to the three regions L, M, S listed in the theorem ( √ 2 log n ≈ 3.3). All the large means (equal to 7) were covered; only 2 out of 5 of the medium means (equal to 1.5) were covered; and all small (zero) means were covered, in agreement with Theorem 2. It may be noted that intervals for zero coordinates are not necessarily narrow.

Model selection
Marginal credible sets give rise to a natural model selection procedure: a parameter is selected as a signal (or "is a discovery") if and only if the corresponding credible interval does not contain zero. We can study this procedure again both in the case of deterministic τ and in the adaptive case, where τ is estimated from the data or receives a hyperprior. For simplicity in this section we only state the result for the adaptive case, leaving the non-adaptive case to the supplement (van der Pas et al., 2017b), see Theorem B.1 in Appendix B .
For zero coordinates θ 0,i selection is the same as coverage, but for nonzero coordinates selection is easier. While coverage involves both the center and the spread of the posterior distribution, selection depends only on the posterior probability that the signal is positive (or negative). This makes that the blow-up constant L in the definition (3) or (4) of a credible interval is unimportant. Thus we consider these intervals with an arbitrary constant L > 0, and say that a parameter θ 0,i = 0 is falsely selected, or that a parameter θ 0,i = 0 is correctly selected, if, in both cases, it is contained in the interval C ni (L,τ ) orĈ ni (L), in the empirical Bayes or full Bayes cases respectively. These are the false and true positives, respectively. Now few zero parameters (the ones with index in N := {1 ≤ i ≤ n : θ 0,i = 0}) are falsely selected and most large signals (the ones with index in L a ) are correctly selected. However most of the remaining parameters (the ones with index in (N c ∩ S a ) ∪ M a ) are not selected, and hence constitute false negatives. Thus the procedure is conservative, the good news being that discoveries tend to be true discoveries.
Theorem 3. Suppose that k M < 1 < k L and f n ↑ ∞ and let L > 0. For any sequence γ n such that ζ 2 γn ζ 2 τn(pn) , the following statements hold, with probability tending to one: (i) The number of selected parameters with i ∈ N divided by the total number #(N ) of zero parameters is at most γ n .
(ii) The number of selected parameters in i ∈ L a divided by the total number of large parameters #(L a ) is at least 1 − γ n , i.e.
(iii) At most a fraction γ n of the parameters within i ∈ (N c ∩S a )∪M a will be selected.
Proof. See Appendix B in the supplement (van der Pas et al., 2017b).
The assertion of the theorem are in the spirit of "false discovery rates". However, none of the statements concerns the usual false discovery rate, defined as the number of falsely selected parameters divided by the total number of selected parameters. Our current methods do not seem to provide realistic bounds on this quantity, partly because we are working under the assumption of sparsity.
An alternative method for model selection using the horseshoe was proposed by . They proposed to select as nonzero coordinates the indices such that the ratio κ i (τ ) =θ i (τ )/Y i exceeds a threshold (to be precise κ i (τ ) > 1/2). This method has similar behaviour to the credible set based model selection approach, as proven in Theorem B.2 in Appendix B in the supplement (van der Pas et al., 2017b). We refer to Datta and Ghosh (2013) for theoretical properties of this procedure, and compare the credible interval and thresholding methods further through simulation in Section 5.

Credible balls
By their definition, credible sets contain a fixed fraction, e.g. 95 %, of the posterior mass. The diameter of such sets will be at most of the order of the posterior contraction rate. The upper bounds on the contraction rates of the horseshoe posterior distributions given in  imply that the horseshoe credible sets are narrow enough to be informative. However, these bounds do not guarantee that the credible sets will cover the truth. The latter is dependent on the spread of the posterior mass relative to its distance to the true parameter. For instance, the bulk of the posterior mass may be highly concentrated inside a ball of radius the contraction rate, but within a narrow area of diameter much smaller than its distance to the true parameter.
In this section we study coverage of credible balls, that is, credible sets for the full parameter vector θ 0 ∈ R n relative to the Euclidean distance. We do so first in the case of deterministic τ and next for the empirical and full Bayes posterior distributions.

Definitions
Given a deterministic hyperparameter τ , possibly depending on n and p n , we consider a credible ball of the form whereθ(τ ) = E(θ | Y n , τ) is the posterior mean, L a positive constant, and for a given α ∈ (0, 1) the numberr(α, τ ) is determined such that Thusr(α, τ ) is the natural radius of a set of "Bayesian credible level" 1 − α, and L is a constant, introduced to make up for a difference between credible and confidence levels, similarly as in Szabó et al. (2015b). Unlike in the latter paper the radiir(α, τ ) do depend on the observation Y n , as indicated by the hat in the notation.
In the empirical Bayes approach we define a credible set by plugging in an estimator τ n of τ into the non-adaptive credible ballĈ n (L, τ ) given in (12): In the hierarchical Bayes case we use a ball around the full posterior meanθ = θ Π(dθ | Y n ), given byĈ where L is a positive constant andr(α) is defined from the full posterior distribution by The question is whether these Bayesian credible sets are appropriate for uncertainty quantification from a frequentist point of view.

Credible balls for deterministic τ
The following lower bound forr(α, τ ) in the case that nτ → ∞ is the key to the frequentist coverage. The assumption nτ /ζ τ → ∞ is satisfied for τ of the order the "optimal" rate τ n (p n ) provided p n → ∞ (as we assume).
Theorem 4. If τ ≥ τ n and τ → 0 and p n → ∞ with p n = o(n), then, there exists a large enough L > 0 such that Proof. The probability of the complement of the event in the display is equal to P θ0 ( θ 0 −θ(τ ) 2 > Lr(α, τ )). In view of Lemma 1 this is bounded by o(1) plus By Theorem 3.2 of van der Pas et al. (2014) the numerator on the right is bounded by a multiple of p n log(1/τ ) + nτ log 1/τ . By the assumption τ ≥ τ n ≥ 1/n the quotient is smaller than α for appropriately large choice of L.
Theorem 4 combined with the upper bound on the posterior contraction rate in van der  show that a (slightly enlarged) credible ball centered at the posterior mean is of rate-adaptive size and covers the truth provided τ is chosen of the order of the "optimal" value τ n (p n ). This is not possible in general, as it requires knowing the number of signals. In the next sections, we will show that if the empirical Bayes estimator of τ is "close" to τ n (p n ), or if a hyperprior on τ places "enough" mass on a neighborhood of a quantity of order τ n (p n ), then adaptation to the unknown number of signals is possible.

Adaptive credible balls
We now turn to credible sets in the more realistic scenario that the sparsity parameter p n is not available. We investigate both the empirical Bayes and the hierarchical Bayes credible balls. We show that both empirical and hierarchical credible balls cover the true parameter θ 0 , if θ 0 satisfies the "excessive-bias restriction", given below, under some conditions on the empirical Bayes plug-in estimate or the hierarchical Bayes hyperprior on τ .

The excessive-bias restriction
Unfortunately, coverage can be guaranteed only for a selection of true parameters θ 0 . The problem is that a data-based estimate of sparsity may lead to over-shrinkage, due to a too small value of the plug-in estimator or concentration of the posterior distribution of τ too close to zero. Such over-shrinkage makes the credible sets too small and close to zero. A simple condition preventing over-shrinkage is that a sufficient number of nonzero parameters θ 0,i are above the "detection boundary". The minimum threshold for detection required in our proof is 2 log(n/p n ). This leads to the following condition.
The two constants C s and A will be fixed to universal values, where necessarily C s ≥ 1 and it is required that A > 1.
The problem of over-shrinkage is comparable to the problem of over-smoothing in the context of nonparametric density estimation or regression, due to the choice of a too large bandwidth or smoothness level. The preceding self-similarity condition plays the same role as the assumptions of "self-similarity" or "polished tail" used by Picard and Tribouley (2000); Giné and Nickl (2010) (2011), which imposes a lower bound on the nonzero signals in order to achieve consistent selection of the set of nonzero coordinates of θ 0 . However, the present condition is different in spirit both by the size of the cut-off and by requiring only that a fraction of the nonzero means is above the threshold.
For ensuring coverage of credible balls the condition can be weakened to the following more technical condition.
Assumption 2 (excessive-bias restriction). A vector θ 0 ∈ 0 [p] satisfies the excessivebias restriction for constants A > 1 and C s , C > 0, if there exists an integer q ≥ 1 with The set of all such vectors θ 0 (for fixed constants A, C s , C) is denoted by Θ[p], and p =p(θ 0 ) denotes #(i : |θ 0,i | ≥ A 2 log(n/q)), for the smallest possible q.
If θ 0 ∈ 0 [p] is self-similar, then it satisfies the excessive-bias restriction with q = p, C = 2A 2 and the same constants A and C s . This follows, because the sum in (16) is trivially bounded by #(i : θ 0,i = 0) A 2 2 log(n/q).
In the following example we show that the excessive-bias restriction is also implied by a condition with the same name introduced in . The latter condition motivated Assumption 2, which is more suited to our investigation of the horseshoe credible sets.
We now show that in this case θ 0 also satisfies Assumption 2, with q = #(Ĩ). Let , by the minimising property of θ 0,i , verifying the second inequality in (16). Second {j : again by the minimising property of θ 0,i . Thus the first inequality of (16) follows by the fact that G(Ĩ) ≤ C#(Ĩ) log(ne/#(Ĩ)).

Empirical Bayes condition and the MMLE
To obtain coverage in the empirical Bayes setting, we replace Condition 1 by the following.
Condition 5. The estimator τ n satisfies, for a given sequence p n and some constant The lower bound of order τ n (p) instead of 1/n prevents over-shrinkage. Although this condition may appear more restrictive than Condition 1, Condition 5 may not be more stringent than Condition 1, because it only needs to hold for vectors θ 0 that meet the excessive-bias restriction.
For the coverage results in this paper, we need the additional result that the MMLE is of the order τ n (p(θ 0 )) for all vectors θ 0 satisfying the excessive-bias restriction.
Proof. See Appendix D.1 in the supplement (van der Pas et al., 2017b).
The relative performances of the empirical Bayes procedures with the MMLE or the "simple" estimator are studied further in Section 5.

Main result on adaptive credible balls
Under the excessive-bias restriction, both the empirical and hierarchical Bayes credible balls are honest and adaptive. In the hierarchical Bayes setting, the hyperprior is assumed to be supported on [1/n, 1], similar to the MMLE.
Theorem 5. Letp n ≤ p n be given sequences withp n → ∞ and p n = o(n). If the estimator τ n of τ satisfies Condition 5, then for a sufficiently large constant L (depending on A, C s , C) the empirical Bayes credible ballĈ n (L,τ n ) has honest coverage and rate adaptive (oracle) size: In particular, these assertions are true for the MMLE. Furthermore, ifp n ≥ C log n for a sufficiently large constant C, then the hierarchical Bayes method with τ ∼ π n for π n probability densities on [1/n, 1] that are bounded away from zero also yields adaptive and honest confidence sets: for sufficiently large L, Proof. See Appendix E.1 in the supplement (van der Pas et al., 2017b).
It may be noted that for self-similar θ 0 the square diameter of the credible balls is of the order p log(n/p), improving on the square contraction rate p log n obtained in van der . For parameters satisfying the excessive-bias restriction, this may further improve top log(n/p).
The size of the required blow-up factor L in the radius of the credible ball depends on the constants A, C s , C in the excessive-bias restriction, Assumption 2. As argued in the introduction no statistical procedure can simultaneously adapt to sparsity and give uniform coverage over all parameters, so that with every given L necessarily some parameters will not be covered. In practice the validity of the excessive-bias restriction for particular A, C s , C cannot be verified, but one must be satisfied with coverage for a set of "reasonable" parameters. The choice of L operationalises "reasonable"; it will be hard to define an optimal choice. In our simulations in the next section we used L = 1, which is the natural Bayesian choice and seems to work well, at least for the parameters considered in the simulation.

Simulation study
In the first simulation study in Section 5.1, we compare four versions of the horseshoe (empirical Bayes with two different estimators and hierarchical Bayes with two different priors) and evaluate the coverage properties and interval lengths of the resulting credible intervals. In addition, we include an approximation to the credible intervals based on the normal distribution.
In the simulation study in Section 5.2, we compare the model selection properties of the method based on credible intervals resulting from the horseshoe with the MMLE, as discussed in Section 3, to the thresholding method introduced by , with the MMLE of τ plugged in. We use the MMLE because the best results are obtained for the horseshoe with MMLE in the first simulation in Section 5.1. All simulations were carried out using the R package 'horseshoe' (van der Pas et al., 2016b).
We study the relative performances of the empirical Bayes and hierarchical Bayes approaches further through simulation studies, extending the simulation study in . We consider empirical Bayes combined with either (i) the simple estimator (with c 1 = 2, c 2 = 1) or (ii) the MMLE, and for hierarchical Bayes with either (iii) a Cauchy prior on τ , or (iv) a Cauchy prior truncated to [1/n, 1] on τ . We study the coverage and average lengths of the marginal credible intervals resulting from these four methods, as well as intervals based solely on the posterior mean and variance. In addition, we study intervals of the formθ i (y i , τ M ) ± 1.96 var(θ i | y i , τ M ), based on a normal approximation to the posterior, whereθ i (y i , τ M ) is the posterior mean and var(θ i | y i , τ M ) refers to the posterior variance, both with the MMLE plugged in. We include the approximation because it offers a computational advantage over the other methods, as no MCMC is required.
We consider a mean vector of length n = 400, with p n ∈ {20, 200}. We draw the nonzero means from a N (A, 1)-distribution, with A = c √ 2 log n for c ∈ {1/2, 1, 2}, corresponding to most nonzero means being below the universal threshold, close to the universal threshold, or well past the universal threshold, respectively. Instead of the symmetric intervals studied in our theorems, we computed the practically appealing quantile-based 95% marginal credible sets for the hierarchical and empirical Bayes methods, taking the 2.5%-and 97.5%-quantiles of the MCMC samples as the endpoints. We did not include a blow-up factor. The procedure was repeated N = 500 times. Figure 2 gives the coverage results averaged over the 500 iterations, for all parameters, and separately for the p n nonzero means and the (n − p n ) zero means. The average lengths of the credible sets, again for all signals and separately for the nonzero and zero means, are displayed in Figure 3. Figure 4 gives the mean value of τ -in the hierarchical Bayes settings, the posterior mean of τ was recorded for each iteration. No value is given for the normal approximation, as it uses the MMLE as a plug-in value for τ .
We remark on some aspects of the results. First, we see that the zero means are nearly perfectly covered by all methods in all settings, and the main differences lie in Figure 2: Average coverage of all parameters (top), the nonzero means (middle) and the zero means (bottom) for the five methods, from left to right: empirical Bayes with simple estimator (c 1 = 2, c 2 = 1) and MMLE, normal approximation, hierarchical Bayes with Cauchy prior on τ and with Cauchy prior truncated to [1/n, 1]. The p n nonzero means were drawn from a N (A, 1) distribution. Results are based on averaging over 500 iterations.
the nonzero means. Secondly, coverage of the nonzero means improves as their values increase. Thirdly, the lengths of the credible intervals adapt to the signal size. They are smaller for the zero means than for the nonzero means, and smaller for the nonzero means corresponding to A = (1/2) √ 2 log n than for the nonzero means corresponding to A = √ 2 log n and A = 2 √ 2 log n, while there is not much difference between the interval lengths in those latter two settings, suggesting that the interval length does not increase indefinitely with the size of the nonzero mean.
Furthermore, empirical Bayes with the simple estimator achieves the lowest overall coverage, and especially bad coverage of the nonzero means. This appears to be due to smaller interval lengths caused by lower estimates of τ compared to the other methods. The normal approximation leads to better coverage than the simple estimator, and has the highest coverage of the nonzero means, even though the corresponding intervals are Figure 3: Average length of the credible sets of all parameters (top), the nonzero means (middle) and the zero means (bottom) for the five methods, from left to right: empirical Bayes with simple estimator (c 1 = 2, c 2 = 1) and MMLE, normal approximation, hierarchical Bayes with Cauchy prior on τ and with Cauchy prior truncated to [1/n, 1]. The p n nonzero means were drawn from a N (A, 1) distribution. Results are based on averaging over 500 iterations.
slightly shorter than those of empirical Bayes with the MMLE and the hierarchical Bayes approaches. However, its coverage of nonzero means is worse than that of those three methods, while the corresponding intervals are longer, except in the case where A is largest. The normal approximation appears to be reasonable for very large signals only.
The hierarchical Bayes approach with a non-truncated Cauchy on τ leads to the highest overall coverage and coverage of the nonzero means, albeit by a small margin. The price is slightly larger intervals compared to the other methods, mostly for the zero means. These larger intervals are most likely due to the larger values of τ that are employed, this being the only approach that allows for estimates of τ larger than one, and it avails itself of the opportunity in the non-sparse setting. Finally, we again observe that the results for empirical Bayes with the MMLE and hierarchical Bayes with a truncated Cauchy lead to highly similar results. Their coverage is comparable to that of hierarchical Bayes with a non-truncated Cauchy in all settings except when p n = 200 and A is at least at the threshold, in which case the non-truncated Cauchy has slightly better coverage. Their intervals are shorter on average, because τ is not allowed to be larger than one.
In conclusion, empirical Bayes with the simple estimator should not be used for uncertainty quantification. The normal approximation is faster to compute than the marginal credible sets, but leads to worse coverage of the nonzero compared to the empirical Bayes with the MMLE and the hierarchical Bayes approaches, unless the nonzero means are very large. The results of those latter three methods are very similar to each other. All these results can be understood in terms of the behaviour of the estimate of τ : larger values lead to larger intervals and better coverage, which may lead to worse estimates however (as seen in the previous section). Empirical Bayes with the MMLE, or hierarchical Bayes with a truncated Cauchy, appear to be the best choices when considering both estimation and coverage. Those two approaches yield highly similar results and the choice for one over the other may be based on other considerations such as computational ones.

Model selection
We compare the procedure based on credible intervals studied in Section 3 to the thresholding method introduced in . Two scenarios are considered. In the first, the signals are either "small", "intermediate" or "large", as defined in Section 2.3. In the second, all signals are drawn from a distribution.
In the credible interval method, a parameter is selected as a signal if zero is not contained in the corresponding credible interval. For the thresholding method of , the posterior mean is divided by the observation. The result is a number between zero and one, which indicates the amount of shrinkage of that particular observation. If this number is larger than 0.5, the corresponding parameter is considered a signal. For both methods, we estimate τ by the MMLE.  In the first scenario, we have n observations, with p n signals. The p n signals are divided into three groups, corresponding to the three intervals of Section 2.3. The small ones are equal to 1/n, the intermediate ones are 0.5 2 log(1/τ n (p n )), and the large ones are equal to 1.5 √ 2 log n. We study four combinations of n and p n : n = 400, p n = 60; n = 800, p n = 60; n = 800, p n = 120 and n = 1600, p n = 120. We count the number of false positives, that is the noise signals that are incorrectly selected as signals, and the number of correctly selected signals in each group. The number of true discoveries, averaged over N = 500 iterations, are in Figure 5, and the false discovery rate (FDR) is in the upper left panel of Figure 7.
In the second scenario, all signals are drawn from a distribution: the Laplace distribution with dispersion parameter equal to 3, the Gamma distribution with shape and scale equal to 2, or the Cauchy distribution with scale equal to 5. The number of false positives and the number of correctly selected variables are counted. The number of true discoveries, averaged over N = 100 iterations, are in Figure 6, and the FDRs are in Figure 7.
Both simulation scenarios tell a consistent story: the thresholding method results in more discoveries, both true and false, than the credible interval method. The findings of Figures 5 are as expected based on the theoretical results of Section 3: almost none of the small and medium signals are detected, while the large signals are nearly perfectly detected by both methods. In scenario 2, where the signals are drawn from a distribution, the thresholding method finds more of the signals. Comparing the left and right columns of Figure 6, we see that both methods detect more of the signals when the truth is less Figure 7: False discovery rate in scenarios 1 and 2. (i) n = 400, p n = 60; (ii) n = 800, p n = 60; (iii) n = 800, p n = 120; (iv) n = 1600, p n = 120.
sparse. This may be due to the behaviour of the MMLE, which is likely to be larger in the less sparse settings, leading to less shrinkage of the true signals.
The FDR of the credible interval method remains well below 0.05 in all settings (Figure 7). In contrast, the FDR of the thresholding method exceeds 0.10 in all cases, and is much larger still when the observations are drawn from a Cauchy distribution. The FDR of the thresholding method can of course be lowered by taking a different cut-off than 0.5, but no guidelines are available at the moment, and a decrease of the FDR will come at the cost of the number of true discoveries. The credible intervals have low FDR, but fail to detect small and medium observations. We speculate that improvement might be possible by combining the information contained in the posterior mean and variance.

Proof of Theorem 1
The posterior distribution of θ i given (Y i , τ, λ i ) is normal with mean and variancê Furthermore, the posterior distribution of λ i given (Y i , τ) possesses a density function given by We show that this is true, or not, for θ 0,i belonging to the three regions separately for S, L and M.
Case S: proof of (5). If i ∈ S, then |θ 0,i −θ i (τ )| ≤ k S τ + τ |Y i |e Y 2 i /2 , by the triangle inequality and Lemma 3(iii). Below we show thatr i (α, τ ) ≥ τ z α c, with probability tending to one, for z α the standard normal upper α-quantile and every c < 1/2. Hence θ 0,i ∈ C ni (L, τ ) as soon as (1 − u)), a fraction 1 − γ of the variables Y i with i ∈ S is bounded above by k S τ + 2 log(2/γ) + δ = k S τ + ζ γ/2 + δ, with probability tending to 1, for any δ > 0. Then the corresponding fraction of parameters θ 0,i is contained in their credible interval if L is chosen big enough that As the right hand side of the above inequality is bounded above by 2 γ ζ γ/2 (1 + ε), where ε → 0 if γ, τ, δ → 0, this is certainly true for L S as in the theorem. We also note that the above argument implies that for every given i ∈ S the variable Y i is bounded from above by k S τ + ζ γ/2 + δ with probability at least 1 − γ.
We finish by proving the lower bound for the radiusr i (α, τ ). Because the conditional distribution of θ i given ( for any r > 0. Furthermore, by the monotonicity of the variance in λ i of this conditional distribution, the last function is increasing in λ i . Ifπ(· | τ ) is the probability density given byπ(λ On the other hand, since sd(θ i | Y i , τ, λ i ) ≥ τ /2(1 + o(1)), for λ i ≥ 1/2, by (18), the normality of the conditional distribution of θ i given ( Here the second last inequality follows from as τ → 0, by two applications of the dominated convergence theorem. Combination of (19) and (20) shows thatr i (α, τ ) ≥ z α τ /2(1 + o (1)).
For |Y i | > Aζ τ and A > 1 we can choose c sufficiently close to zero so that the first exponential is of order τ A for some A > 1. Then it is much smaller than the denominator, which is of order τ /g 2 τ , provided g τ tends to infinity slowly. If we choose d > 1, then the term involving the second exponential will also tend to zero for |Y i | > Aζ τ as soon as e −cζ 2 τ /g 2 τ g 3 τ → 0, for a sufficiently small constant c. This is true (for any c > 0) for instance if g τ = √ ζ τ . Then the quotient tends to zero, whence the analogon of (20) is lower bounded by α (1 − o(1)) > α, eventually.
We finish by proving thatr i (α, τ ) U τ , with probability tending to one. As a first step we show that, for k < 1, By the explicit form of the posterior density of λ i we have We split the remaining integral over the intervals [M, τ −a ) and [τ −a , ∞), for some a < 1.
On the first interval we use that y 2 /(1 + λ 2 i τ 2 ) = y 2 + o(1), uniformly in |y| ζ τ and λ i ≤ τ −a , while on the second we simply bound the factor e −y 2 /(2(1+λ 2 i τ 2 )) by 1, to see that the preceding display is bounded above by The first term in square brackets (times the leading term) contributes less than a multiple of ∞ M λ −2 dλ = 1/M , while the second term contributes less than e y 2 /2 τ a ≤ τ −k 2 +a , for |y| ≤ kζ τ , which tends to zero if a > k 2 . This concludes the proof of (24).

By the reverse triangle inequality, for any
For sufficiently large M the second term on the far right is smaller than α/2 by the preceding paragraph and for r = z α/4 sup λ≤M r i (τ, λ) the first term on the right is smaller than α/2 as well, by the normality of θ i given (Y i , λ i , τ) and the definition of r i (τ, λ i ). The inequality remains valid if |θ i (τ, λ i ) −θ i (τ )| in the first line is replaced by sup λi≤M |θ i (τ, λ i )| + |θ i (τ )|. It follows that The first term is bounded above by Mτ , and the second by Mτ |Y i |, by the definitions of r i (τ, λ) andθ i (τ, λ), while |θ i (τ )| ≤ τ |Y i |e Y 2 i /2 , by Lemma 3(iii). This concludes the proof thatr i (α, τ ) U τ .

Invited comment on Article by van der Pas, Szabó, and van der Vaart
Ismaël Castillo * The presently discussed paper by Stéphanie van der Pas, Botond Szabó and Aad van der Vaart is a third of a series of very interesting works on convergence properties of posterior distributions associated to the horseshoe prior in the sparse normal means model. The horseshoe prior distribution as considered in the paper is a specific scale mixture of normal distributions. Given τ , it is the distribution of θ 1 obtained from Convergence rates are obtained in van der Pas et al. (2014) and adaptive counterparts are derived in van der Pas et al. (2017). In the present paper the authors make an important step further and study uncertainty quantification: they demonstrate that under certain conditions credible sets derived from the horseshoe posterior distribution, either local marginal credible intervals or global 2 credible balls, can be used as confidence sets, asymptotically in the number of observations. This is, after , one of the first works on the subject using Bayesian methods in sparse settings.
I really enjoyed reading this paper and the previous ones. Below I discuss two main points and then close my discussion with a couple of more specific questions. The first comment draws some analogies with spike and slab priors with sparsity parameter calibrated by empirical Bayes (EB) and asks for possibly more general horseshoe-type distributions. In a second comment, we discuss model selection properties and credible sets for the horseshoe.
Some of the comments below are inspired by current work in progress with Romain Mismer Castillo and Mismer (2017) and Botond Szabó Castillo and Szabó (2017), in which we consider related questions for spike and slab prior distributions for some absolutely continuous distribution G and α calibrated by an empirical Bayes approach: following the steps of , who studied risks of a class of point estimators derived from the EB approach, we consider the convergence of the full EB-posterior and related credible sets properties.

More flexible horseshoe prior distributions?
Following , if π(θ 1 ) denotes the marginal density of θ 1 in (1), This implies that the horseshoe prior, given τ , has a pole at zero and Cauchy tails. The pole at zero guarantees shrinkage of small signals while heavy tails avoid over-shrinkage of large signals.
There seems to be a striking correspondance between the tuning parameter τ of the horseshoe and the success probability α in the spike and slab prior, especially when G is taken to be a distribution with Cauchy tails. For instance, when using a marginal maximum likelihood empirical Bayes (MMLE) method to estimate α for such a spike and slab prior with Cauchy tails, one can show Castillo and Mismer (2017) (thereby slightly improving, in the case one restricts to 0 [p n ] classes, upon the estimate from Lemma 10 and (101) in ) the estimateα is such that, as n → ∞, where p n is the sparsity parameter. This is the same as the upper boundary for τ obtained by the authors who established in van der Pas et al. (2017) which is part of Condition 1 of the present paper. This suggests that tails of the horseshoe and tails of the slab distribution play a similar role, also at level of precise conditions arising in the proofs.
This naturally leads to the question of whether it is possible to allow for other tail distributions for the marginal distribution of θ 1 for horseshoe-type priors. Another reason why we mention this is that it appears from Castillo and Mismer (2017)-Castillo and Szabó (2017) that in the spike and slab case, tails of G are particularly critical in obtaining optimal adaptive rates and confidence sets when using an empirical Bayes method. While Cauchy tails are fine in the spike and slab case when the squared 2 loss θ − θ 2 = i (θ i − θ i ) 2 is considered, they presumably lead to suboptimal rates if the loss is measured in terms of d q -distances d q (θ, θ) := i |θ i − θ i | q (as in Castillo and van der Vaart (2012)) when q < 1 (we note here that we are talking about results for the full EB posterior distribution, not aspects of it such as the median or mode as in , for which this phenomenon does not arise).
Perhaps heavier tails, such as θ −1−δ 1 with δ < 1 could be obtained by considering one of the other mixture priors mentioned in the paper such as the normal-exponentialgamma or the more general global-local scale mixture of normals, although we could not find any explicit results on tails of the marginal distribution in the mentioned references.

Model selection: 'sparsifying' the horseshoe?
By construction, a draw from the posterior distribution associated to the horseshoe prior does not set any component exactly to zero. In a sense, again by construction, the horseshoe prior is not exactly 'made for' 0 [p n ] classes. Still, as the authors nicely prove, it leads to very good results for estimation and confidence sets for the squared 2 loss and 0 [p n ] classes.
When one looks at a different type of results, such as model selection, or results for loss functions that are more sensible to missing the exact zeros, such as d q -losses, something must be done, and the authors propose an additional selection rule to set some of the coefficients to zero.
The selection rule consists in looking at marginal credible intervals for individual coefficients θ i and to select the given index i if the credible interval does not contain zero. This rule is very intuitive, but is there a qualitative justification of this specific choice? For instance, can something be said about its corresponding 'threshold' in the sense of the smallest signal strength that gives detection?
Part of the interesting message from Sections 2 (credible intervals) and 3 (model selection) from the paper is that, after the selection rule is applied, the resulting procedure does qualitatively something similar to what priors with a built-in selection procedure, such as spike and slab, would do: most true zero parameters are set to zero, large enough signals are always detected, while 'intermediate' signals are often set to zero.
One can wonder whether it is possible to recover some results obtained for priors with built-in selection with the horseshoe combined with the selection rule, for instance in the following two directions a) Number of non-zero coefficients. From (i) of Theorem 3.1, it follows that the total number of selected coefficients is no larger than p n + (n − p n )γ n (I believe γ n should be read (n − p n )γ n in point (i) of the statement). The condition on γ n implies that nγ n is of larger order than p n . Could one prove that the bound is close to p n , or rather here, say, a constant times p n log(n/p n )?
b) d q -losses. In principle, one could also expect that, once some of the smallest coefficients of the horseshoe estimator are set to zero, the resulting 'after selection'estimate would perform well also in terms of d q -distances, at least for some qs in (0, 2). This question arises for estimation as in van der Pas et al. (2017) but also for credible sets as in Section 4 of the present paper.

Specific questions
(i) Adaptive minimax rate with precise logarithmic term.
In the companion paper van der Pas et al. (2017), the authors obtain a nearly optimal minimax rate Cp n log n for the horseshoe posterior, which may miss the minimax rate of the order p n log(n/p n ) for signals that are nearly dense (e.g. p n = n/ log n or p n = n/e √ log n ). It would be interesting to see whether the precise logarithmic term can be obtained.
In principle, when looking at classes of sparse vectors that do not specifically contain zeros, such as strong or weak p classes (0 < p < 2), the horseshoe estimator should perform even better, in the sense that it is not 'penalised' by the fact of not setting some coefficients to zero. Did the authors do some simulations in this type of setting? Also, how does one choose in practice the blow-up factor L of the credible intervals or credible balls? Is there a recommended rule to chose it in simulations?

Invited comment on Article by van der Pas, Szabó, and van der Vaart
Ryan Martin *

Introduction
Since its first appearance, not quite 10 years ago, the horseshoe prior has come a very long way. When I first learned about it as a graduate student in 2008, 1 I remember thinking that building a hierarchical model around a prior with both a huge spike at zero and very heavy tails was a clever way to induce this kind of shrinkage needed for structured high-dimensional problems. What I didn't realize at the time was that the horseshoe was more than just a clever idea; it was a game-changer, motivating the now very active research on general classes of global-local priors. While I may have missed my chance in 2008 to get in on the ground floor of the horseshoe enterprise, it is great to have the opportunity here in 2017 to reflect a bit on these developments.
The normal mean model discussed here has X i ∼ N (θ i , σ 2 ), i = 1, . . . , n, independent, and the goal is inference on the mean vector θ = (θ 1 , . . . , θ n ). Let Π n denote the posterior distribution for θ based on the horseshoe prior, with either a plug-in estimator or a hyper-prior for the global scale factor τ . The theoretical questions that Bayesians are currently interested in revolve around the asymptotic concentration, as n → ∞, of the posterior Π n at certain "true" mean vectors θ 0 ∈ R n , in particular, at θ 0 which are p-sparse. The most precise of such questions concern the properties of marginal posterior credible intervals or posterior credible balls; that is, does a credible set have the right size and the right coverage probability? The impossibility theorem (e.g.,  says that no confidence sets-Bayesian or otherwise-are both adaptive, in the sense that their size corresponds to the minimax rate for the true-but-unknown sparsity level p = o(n), and honest, in the sense that they attain the nominal frequentist coverage probability, uniformly over all θ 0 . It is intuitively clear that the troublemaker θ 0 's, the so-called "inconvenient truths" in , have too many |θ 0,i | close to a certain detectability boundary. The authors of the present paper, namely, van der Pas, Szabó, and van der Vaart, are to be congratulated for their efforts to give a precise mathematical characterization of these troublemakers in the context of the horseshoe model. A natural next step would be to develop similar results for other horseshoe-like models, such as those in Ghosh et al. (2016) and Ghosh and Chakrabarti (2017). While I am excited to see how far these efforts can go, I also have some reservations.
For my discussion here, I'll focus on two higher-level points, namely, the choice of a one-group model, like the horseshoe, versus a two-groups model, and coming to terms with the inevitable dishonesty of posterior credible sets.

One group or two?
The horseshoe is an example of a one-group model since it does not treat the zero and non-zero θ i 's differently a priori. A two-groups model, on the other hand, treats the zero and non-zero θ i differently, usually with a discrete mass at zero and continuous density away from zero, respectively. If there is reason to believe that the true θ 0 is sparse in the sense that it contains many exact zeros, then the two-groups formulation is clearly more natural. But the one-group approach has an apparent computational advantage in that it does not require posterior exploration over the very large space of zero/non-zero configurations. This advantage is further boosted by the theoretical fact that a two-groups formulation with conjugate, fixed-center normal priors for the non-zero θ i 's can lead to sub-optimal posterior concentration properties (e.g., Castillo and van der Vaart, 2012, Theorem 2.8). The heavier-than-normal-tailed priors that have the desired theoretical properties, such as Laplace, make computation more expensive. Of course, the computational benefits afforded to these one-group models would be of little relevance if there was no supporting theory, but see the present paper, as well as  and .
Despite the nice results for one-group priors, I still would lean towards a two-groups model in this case: the sparsity assumption is genuine prior information, "genuine" in the sense that I'm willing to make that assumption in the supporting theory, so my prior ought to allow, and even encourage, many of the means to be exact zeros. But how to address the challenges presented in the previous paragraph? It turns out that both the computational and theoretical obstacles can be overcome by taking a conjugate normal prior for the non-zero means but with an informative choice of center. In the context of this normal means model, Martin and Walker (2014) suggest centering the (conditional) prior for the non-zero means at the observations; that is, where μ i is chosen to be X i and γ > 0 is some (small) fixed constant. Then, of course, a prior is assigned to the configurations of zero and non-zero means, which is where the sparsity assumption is incorporated. This prior has some intuitive appeal: we are taking an informative prior on the configuration, the aspect of θ that we have genuine prior information about, and a data-driven "non-informative" prior on the actual values of the non-zero means, about which we have no genuine prior information. From a computational point of view, the data-dependence does not affect conjugacy, so posterior computations are relatively simple. Theoretically, Stephen and I showed that the corresponding "empirical Bayes" posterior has the same adaptive and asymptotically minimax concentration rate as has been demonstrated for the horseshoe. I won't say any more here about our specific formulation, but I'll note that Stephen and I have since extended these results to sparse high-dimensional regression  and even to some nonparametric problems .
To conclude this part of my discussion, I want to revisit van der Pas, Szabó, and van der Vaart's example in Section 2.2 of their paper. There they have n = 200 means, p = 10 of which are non-zero, with five means equal to 7 and the other five equal to 1.5. The point is that 7 easily exceeds their detection boundary but 1.5 is right near Figure 1: Marginal equal-tailed 95% credible intervals based on (a) the two-groups empirical Bayes approach in Martin and Walker (2014) and (b) the horseshoe, for the same experiment summarized in Figure 1 of van der Pas, Szabó, and van der Vaart.
it; therefore, according to their theory, the means equal to 1.5 ought to be difficult to cover. For comparison, Figure 1 shows the naive equal-tailed 95% marginal empirical Bayes credible intervals from the same example but based on the two-groups empirical Bayes formulation in Martin and Walker (2014) and our simple Gibbs sampler. Just like van der Pas, Szabó, and van der Vaart, we cover all the zero means (green) and all the size-7 means (black), but miss a couple of the size-1.5 means (red). However, some of the marginal credible intervals are actually singletons, whereas the horseshoe always returns intervals. I wonder if the two-groups formulation has some efficiency advantage over the one-group version in the sense that the coverage rates ought to be the same but the former's "overall length" is shorter? In this particular instance, the average length of the 200 one-and two-groups intervals were about 1.5 and 0.25, respectively.
To be fair, coverage properties for our credible sets have not yet been worked out, but given the results in Belitser and Nurushev (2017), we have every reason to think that this can be done, and we'll report these details elsewhere.
eter values at which adaptivity holds but honesty fails, as van der Pas, Szabó, and van der Vaart, among others, have given, is theoretically valuable, primarily for the insights it provides. Unfortunately, these insights don't translate to practical guidance; for example, for fixed n, it's impossible to tell if a particular θ 0 satisfies the excessivebias restriction. Moreover, it's exactly those "intermediate" parameters carved about by the theorem's conditions for which a precise uncertainty quantification is needed. In any case, I think many users of Bayesian methods are sold by the often-spoken but rarely-written claim that "Bayes provides automatic uncertainty quantification." But the impossibility theorem says that if the posterior is good, i.e., adaptive, then this rationale breaks down. Of course, the impossibility theorem applies to Bayes and non-Bayes approaches alike, but Bayes isn't needed to construct an adaptive estimator so, if the posterior doesn't provide honest uncertainty quantification, then what does it have to offer? Call me over-dramatic, but does taking the "honesty is the best policy" advice, as I think we should, require a change of perspective? Our theoretical efforts thus far have focused primarily on adaptation-driven priors; is the next game-changer a class of honesty-driven priors? If so, then what will be the horseshoe's fate?
To conclude, I want to mention that this conflict between concentration properties and uncertainty quantification is not unique to the type of high-dimensional problems considered in van der Pas, Szabó, and van der Vaart's paper and the ensuing discussion. For example, even in apparently simple fixed-dimensional problems, Fraser (2011) and Fraser et al. (2016) describe situations where Bayesian uncertainty quantification is less than fully satisfactory. More generally, there are known concerns about uncertainty quantification via the marginal posterior distribution for non-linear interest parameters, including the extreme cases in Gleser and Hwang (1987) where marginal posterior credible intervals could have zero coverage probability for all n. My co-authors and I have been writing on these points recently (e.g., Martin and Liu, 2016b;Balch et al., 2017;Martin, 2017), advocating for a stronger property called validity that can stand up to these problematic cases. Efforts to develop a framework for valid statistical inference are underway (e.g., Martin and Liu, 2016a), and I am excited to see how the new approach can handle these high-dimensional problems.

Invited comment on Article by van der Pas, Szabó, and van der Vaart
Nicholas G. Polson * Let me first congratulate the authors on an impressive paper that solves an open problem on uncertainty quantification for the horseshoe estimator. These are challenging problems and of great importance to our understanding of uncovering sparse signals.
Most of my comments are based on their main result (Theorem 2) and the ensuing Figure 1 which illustrates the marginal credible sets in a simple simulation.
The sparse normal means problem is concerned with inference for the parameter vector θ = (θ 1 , . . . , θ p ) where we observe data y i = θ i + i where the level of sparsity might be unknown. From both a theoretical and empirical viewpoint, regularised estimators have won the day. This still leaves open the question of how does specify a penalty, denoted by π HS , (a.k.a. log-prior, − log p HS )? Lasso simply uses an L 1 -norm, K i=1 |θ i |, as opposed to the horseshoe which (essentially) uses the penalty The motivation for the horseshoe penalty arises from the analysis of the prior mass and influence on the posterior in both the tail and behaviour at the origin. The latter is the key determinate of the sparsity properties of the estimator. See  for a recent review that compares and contrasts Lasso and Horseshoe. The "choice" of τ depends on how much one is willing to assume a priori about the sparsity properties of the underlying vector. Among other things, the authors propose a marginal maximum likelihood estimator (MMLE), defined in (8). This has been discussed by many authors, e.g. Gelman (2006), personally I like to relate this to a similar problem in the Bayesian analysis of variance, see Tiao and Tan (1965), Stein (1969) which I discuss below.
From an applied perspective, many of the authors' results can be inferred from their Figure 1. This illustrates their main theoretical result in Theorem 2.4. The marginal credible sets for uncovering the parameter vector are shown for a single simulated normal means problem with n = 200 and p = 10 non-zero coordinates. The true means are taken to be 0, 1.5, 7, corresponding to the three regions analyzed theoretically to provide bounds in Theorem 2. As predicted by theory, all the means equal to 7 are recovered nicely-as an aside, this would no be true for lasso.The sparse zeroes are also recovered by design. The "honesty" of the estimator, as defined by the authors, can be seen by behavior at recovering the means equal to 1.5 where only 2 out of 5 succeeded for this particular simulation. Lasso would have done a better job! As there is no free lunch for admissible estimators, maybe this result is not that surprising. Assessing the magnitudes (a.k.a. uncertainty quantification) is the goal of the authors' analysis.
From a historical perspective, James-Stein (a.k.a L 2 -regularisation) is only a global shrinkage rule-there are no local parameters-to learn about sparsity. A simple sparsity example shows the issue with L 2 -regularisation. Consider the sparse r-spike problem where focusing solely on rules with the same global shrinkage weight (albeit benefiting from pooling of information) has an issue. Let the true parameter value be θ p = ( d/p, . . . , d/p, 0, . . . , 0). James-Stein is equivalent to the hierarchical model This dominates the plain MLE but loses admissibility! This is due to the fact that a "plug-in" estimate of global shrinkageτ is used. Tiao and Tan's original "closed-form" analysis is particularly relevant here. They point out that the mode of p(τ 2 |y) is zero exactly when the shrinkage weight turns negative (their condition 6.6). From a risk perspective E θJS − θ ≤ p, ∀θ showing the inadmissibility of the MLE. At origin the risk is 2, but! p θ 2 p + θ 2 ≤ R θ JS , θ p ≤ 2 + p θ 2 p + θ 2 implying R(θ JS , θ p ) ≥ (p/2). Simple thresholding rule beats this with a risk, √ log p. This simple historical example merely shows that the choice of penalty should not be taken for granted. The same thought applies to lasso-the credible sets implicit in lasso are not optimal and the horseshoe approach achieves much large gains both theoretically and empirically-which the optimality properties and caveats revealed by the authors' paper.
There are still many fruitful areas of research in Bayesian sparsity and horseshoe estimation. One avenue is to further understand Tiao and Tan's (1965) condition for the posterior on τ to have a mode at the origin in the case of sparsity.
On the empirical side, it is pleasing to see many R packages available to implement horseshoe estimation in a variety of situations, see  for further discussion. Bhattacharya et al. (2016) provide one such package and also shows how horseshoe can vastly outperform lasso in typical applied contexts. These packages are closing the gap on computational speed that lasso enjoys. On the theoretical side, hyper-parameter selection stills seems an interested area to me where many of the old discussions are still relevant. Recent applications of these methods are in evermore complicated models, such as deep learning (Polson and Sokolov, 2017), and understanding the theoretical underpinnings is as important as ever.

Contributed comment on Article by van der Pas, Szabó, and van der Vaart
It is now well known that adaptive credible sets cannot do honest uncertainty quantification over all possible true signals, and some of these signals must be permanently excluded. To this end, the authors discussed two criteria for removal, one based on the concept of self-similarity and the other based on the excessive bias restriction introduced by Belitser and Nurushev (2017). In the present setting of sparse signals, the key insight into these conditions is that the true signals must be at some distance away from the zero signal in a sense made precise in the paper. Interestingly as mentioned in Remark 3, the three regimes become more "contiguous" under self-similarity when compared to the situation where self-similarity was not assumed, as this is evident by comparing S a , M a , L a with S, M, L for τ = τ n (p n ). This in turn suggests that throwing away troublesome truths enables the horseshoe credible interval to fill in the gaps between the three regions.
For the sake of discussion, let us continue working under the self-similarity or excessive bias restriction. From the simulation results, it is clear that the horseshoe credible intervals have the best performance in terms of high coverage and shortest lengths when the means (or the true signals) are zero. Two settings of p n the number of nonzero signals were used, i.e., p n = 20 and p n = 200 when n = 400. Now let us increase the proportion of zero means (n − p n ) to a point that the self-similarity condition is violated, will the horseshoe still enjoy this near-perfect performance? By looking at the bar charts on coverage and lengths, it is conceivable that we can still get good performance even if this condition is violated slightly. My question concerns whether it is possible to observe empirically what will happen when sparse signals become non self-similar in the sense discussed in the paper.
Uncertainty quantification is undoubtedly one of the most active research areas in Bayesian statistics, and as the present paper shows, it involves resolving many delicate technical and practical issues. The horseshoe prior has proven itself to be optimal in sparse signal recovery, and we are able to access the quality of this recovery thanks to the theories and methods developed in the paper. It would be interesting also to consider other classes of priors, e.g., spike-and-slab types, and I hope that there will be more papers on Bayesian uncertainty quantification for sparse models in the future.

Contributed comment on Article by van der Pas, Szabó, and van der Vaart
Juho Piironen * , Michael Betancourt † , Daniel Simpson ‡ , and Aki Vehtari § The authors present a detailed analysis of the asymptotic frequentist properties of credible sets derived from posteriors with normal-linear measurement models and horseshoe priors. Although we disagree with the claim that "In Bayesian practice credible balls are nevertheless used as if they were confidence sets", the results in the paper are important for identifying where the horseshoe priors are fragile asymptotically, and hence particularly dangerous in the non-asymptotic regimes more typical in the applied problems where sparse models are needed.
One clarification we believe is warranted is that the horseshoe family of prior distributions does not encode sparsity as is typically interpreted. Instead of partitioning parameters into those that are zero and non-zero, the horseshoe priors actually separate parameters into those that are resolvable by measurements and those that are not. In particular, as with any model the horseshoe priors cannot be interpreted outside of the context of a particular likelihood (Gelman et al., 2017). Consequently the statement that "τ can be interpreted as the proportion of nonzero parameters, up to a logarithmic factor" is not quite true. Piironen and Vehtari (2017b;2017c) demonstrate that the effects of τ in horseshoe priors are intimately related to the measurement variability σ, even for the simple normal-linear measurement model. Figure 6 of Piironen and Vehtari (2017c), for example, clearly illustrates that rescaling the data changes the impact of the horseshoe prior unless τ is scaled by σ, even with an oracle prior information about the true number of significant parameters, p 0 = p n . In particular, the resolution threshold 2 log(n/p n ) arising in the paper implicitly assumes that the measurement variability σ is equal to 1, but a more realistic threshold has to take into account the value of σ, which is typically unknown a priori. We are very curious as to how robust the results presented in the paper are to these circumstances where also σ must be inferred.
Additionally, we find that the focus on marginal credible intervals is a significant limitation. One of the defining features of the family of horseshoe priors, and indeed a strong reason for their utility, is that they do not regularize each parameter independently but rather induce a joint regularization over the entire parameter space. In particular, joint credible intervals can behave much differently from marginal intervals. Figure 1 illustrates that with a linear model that employs a uniform prior over the slopes of two correlating predictors x 1 and x 2 it may happen that the joint posterior concentrates away from the origin without either of the marginals clearly distinguished from zero. * Dept.  The situation becomes even more difficult with a large number of correlating predictors when utilizing the horseshoe prior. In this case even for the most relevant variables most of the posterior mass can concentrate around zero, see for example Figure 9 in Piironen and Vehtari (2017c), which makes a reliable variable selection based on the posterior intervals challenging. Moreover, the method by  of including all variables with κ j > 1/2 would fail in this case because none of the predictors have κ j > 1/2. Consequently, we believe the only reliable variable selection strategy in these situations is based on the estimated effect on the predictive distribution, for example using the projection predictive variable selection (Piironen and Vehtari, 2017a). This framework has the added benefit that it provides guidance on how to select out significant parameters jointly, instead of one by one as discussed in the paper.
Finally, we advise caution with regard to the recommendation of the maximum marginal likelihood estimator (MMLE) for τ in practical problems. The large p, small n applications where horseshoe priors are most needed lie far away from the asymptotic regime that stabilizes the MMLE. Any complexity of the measurement model beyond the normal-linear model only makes the matter worse.

Contributed comment on Article by van der Pas, Szabó, and van der Vaart
Eduard Belitser * and Nurzhan Nurushev † We congratulate the authors for this very interesting article focused on the frequentist coverage of the credible sets resulting from the horseshoe prior in the sparse multivariate normal means model both for the empirical and hierarchical Bayes approaches. In paper  (last version from 2017), we studied the empirical Bayes approach to the same problem but using a different (mixture of normals) prior. For brevity, we refer to the discussion paper as paper PSV and our paper as paper BN. We skip (because of space limitations) some computations that might be needed to back up some claims we state below, these can be provided upon request to an interested reader.
The main results of PSV are adaptive over the sparsity scale within a grand space Although this excludes some "almost sparse" parameters that are formally non-sparse (with many very small, but nonzero, entries), this is a mild assumption (as p n is not assumed to be known) and in fact necessary to ensure the asymptotic regime n → ∞ considered in PSV. In BN we obtain local results without relating to any sparsity scale, e.g., the true parameter θ may be not 0 [p n ]-sparse at all. For example, as a consequence we derive the results not only for 0 [p n ], but also for other sparsity scales, such as weak s -balls m s [p n ]. Besides, in BN we derive local non-asymptotic exponential concentration bounds, which give a refined characterization of the quality of coverage and size relation results (finer, than, e.g., Theorem 5 from PSV, which is asymptotic in n → ∞) and allow subtle analysis for various asymptotic regimes. We should mention that the derivation of our somewhat stronger results in BN relies on certain explicit posterior expressions resulting from our choice of prior (mixture of normals, although the model is not assumed to be normal), whereas the horseshoe prior studied in PSV leads to only implicit posterior quantities so that the authors had to overcome complicated technical issues in the proofs.
It is well known and understood that in the studied model it is impossible to construct a confidence set simultaneously with a good coverage and optimal size adaptively to sparsity scales. Insisting on the uniform coverage would necessarily lead to a "big" size of resulting confidence set (therefore uninteresting, although optimal among all sets of uniform coverage), while pursuing the optimal size results in bad coverage for some "deceptive" parameters. In the both papers, PSV and BN, the second situation is studied, where the non-deceptive parameters are described by the so called "excessive-bias restriction" (EBR). This condition was introduced in BN and slightly weakened in PSV. Let us give an intuition behind the EBR. For σ 2 = 1 and any θ ∈ R n , the oracle rate introduced in EBR is min I⊆[n] G(I) = G(I o ) where G(I) = G(I, θ) = i∈I c θ 2 i + τ |I| log(en/|I|) = B(I c ) + V (I). It is always smaller than the minimax rate over any sparsity class containing θ: G(I o , θ) p n log(en/p n ) for all θ ∈ 0 [p n ]. Think of I o as the set of the (oracle) significant coordinates, B(I c o ) as the approximation term (or "bias") and V (I o ) as the complexity term (or "variance") of the oracle rate. Because of the non-asymptotic study, we need log(en/|I o |) instead of log(n/|I o |) (as compared with PSV) in the "variance" V (I o ), which are of the same order if |I o | ≤ p n = o(n).
The EBR from BN basically means that the bias of the oracle rate is of the order of the variance: B(I c o ) ≤ tV (I o ) for some t > 0. Consider now the weakened version of the EBR introduced in PSV. Since p n = o(n), without loss of generality one can set C s = 1 in the EBR condition of PSV. If the EBR from BN is fulfilled, then the EBR from PSV (the two relations of display (16)) is also fulfilled which is in our notation as follows: for τ > 2, C > 0, B(I c o ) ≤ CV (I o ) and #{i : θ 2 i ≥ τ log(n/|I o |)} ≥ |I o |. The first relation is exactly the EBR from BN and the second relation is implied by the definition of the oracle I o . The EBR from PSV is based on the fact that it is always possible to enlarge the oracle set I o to the smallest I * ⊇ I o such that the relation B(I c * ) ≤ CV (I * ) is fulfilled. At worst, I * is the support of θ: {i ∈ [n] : θ i = 0}. It is easy to show that V (I * ) G(I * ) G(I o ), i.e., the "variance" is the main term in the rate G(I * ) that is of the oracle rate order. Then the EBR in PSV givesp |I * | and the sizẽ p log(n/p) G(I * ) of the confidence setĈ n (L) in Theorem 5 is actually of the oracle rate order. The first relation of the EBR from PSV (in display (16)) can always be made satisfied, if not for the oracle set I o , then for an enlarged version I * of it. But this does not mean that the problem of adaptive confidence is solved without any price: "there is no free lunch". The point is that the second relation #{i : θ 2 i ≥ τ log(n/|I * |)} ≥ |I * | in display (16) from PSV is not automatically fulfilled for I * as it was for the oracle set I o . It is this relation that becomes the actual restriction on the parameter space, and the (uniform) lower bound on the constant τ is crucial here. In paper PSV, τ > 2 and this has to do with the fact that ε i ind ∼ N (0, 1). In BN, we assume that the error vector ε satisfies only certain mild exchangeable exponential moment condition (allowing non-normal and dependent coordinates) and the constant τ depends on the parameters of that condition. The main massage here is that the second relation in (16) must hold for some sufficiently large (depending on the distribution of ε) constant τ . Motivated by the above discussion, we further weaken the EBR condition. EBR condition. For some (fixed) C > 0 and ∈ (0, 1) there exists a k ∈ [n] such that n−k i=1 θ 2 (i) ≤ Ck log(en/k) and n− k i=n−k+1 θ 2 (i) ≥ (1 − )τ k log(en/k) for sufficiently large τ , where θ 2 (1) ≤ θ 2 (2) ≤ . . . ≤ θ 2 (n) . Let i * be the smallest possible such k. The above EBR condition follows from the EBR version of PSV with i * =p and : θ 2 i ≥ θ 2 (n−i * +1) } becomes the structure of the parameter θ. The EBR condition ensures that the structure I * of θ becomes "identifiable", which is the necessary ingredient in constructing the adaptive confidence sets. Under this weaker EBR version, we can prove exactly the same results as in BN with slightly modified constants.
We finish with two remarks/questions to the authors of PSV. First, the size relation for the constructed confidence set in Theorem 5 of PSV holds uniformly only over the EBR class (intersected with 0 [p n ]), whereas in BN the size relation holds uniformly over the whole space R n . It would be interesting to know whether this is an artefact of the proofs or of the Bayesian procedure based on the horseshoe prior. Second, it might be interesting to study whether the (weakest version of) EBR is minimal for the existence of adaptive confidence sets, and what would even be a proper formulation of this property.
Yoo appears to raise the question if the noncoverage of intermediate values of the parameters is intrinsic or due to the horseshoe or due to our proof. Unless our proof is in error, the third possibility can be safely discarded. As we have pointed out, there are absolute restrictions on confidence sets, which cannot be overcome by any methods, Bayesian or nonBayesian, so the horseshoe seems not to blame either. But perhaps Yoo is referring more to the small gaps between the regions of small, medium and large parameter values in our presentation. There is indeed room for further research on parameters near the boundaries of the regions. This is likely to be delicate, and may not yield essentially more insight than our current treatment, the gaps between the regions being small. Yoo also asks what would happen for a steadily increasing number of zero parameters. As shown by our results, this will depend on the values of the nonzero parameters, not just on their proportion of the whole.
Polson contrasts the LASSO and the horseshoe. We are also not a fan of the LASSO, perhaps for other reasons. As pointed out in , in the sparse case the Bayesian LASSO posterior, as opposed to the LASSO as a posterior mode, lacks accurate uncertainty quantification through the spread of the posterior. This fact is not surprising, as the LASSO prior does not model sparsity in any way. Polson makes a link to the James-Stein estimator and notes that the choice of penalty (i.e. prior density) has a big effect on the type of shrinkage one gets. We agree. For a sparse model one needs a prior that allows for sparsity. Both the Laplace prior of the LASSO and the Gaussian prior behind James-Stein are incapable to induce sparsity in the posterior, although the LASSO at least has a sparse mode. One might ask if a nonparametric prior, for instance a hierarchical one where the means come from a distribution that itself receives a Dirichlet process prior, or the empirical Bayes version of this, might work in both sparse and non-sparse situations. The simulation study in Koenker and Mizera (2014) seems to say that at least recovery is good. We do not know about uncertainty quantification. We note that in a true Bayesian spirit a comparison between various priors should always take into account the full posterior distribution, not just a measure of its center. Thus the prior plays a bigger role than just suggesting a penalty that is added on to the likelihood.
Martin contrasts continuous priors with a peak at zero, such as the horseshoe, with "two-group" priors that have a pointmass at zero. If many parameters are thought to be * Leiden University, svdpas@math.leidenuniv. exactly zero and not just "small", then such a two-group parameter is indeed attractive as a model. The exact zero values in Ryan's Figure 1 are a boon (although one gets almost the same picture by shrinking the zero intervals of the horseshoe to a point; in fact in this example this gives an even more correct result). We have shown elsewhere Castillo and Van der Vaart (2012);  that the recovery obtained with the spike-and-slab and horseshoe (and many other priors) is comparable, and expect the uncertainty quantification of the full posterior distributions also to be similar. (Preliminary work by Castillo appears to confirm this.) We have shown in Castillo and Van der Vaart (2012) that a heavier-tailed distribution than Gaussian for the non-zero parameters is useful here, as the Gaussian shrinks too much to zero (unless the noise variance is very small relative to the prior Gaussian variance). Martin proposes to solve the latter problem by using Gaussian "priors" centered at the observations, a device that it is similar to the one used in Belitser (2017); . This will indeed get rid of the shrinkage effect, but is perhaps not elegant from a Bayesian point of view. Regrettably the device does not remedy the computational disadvantage of the two-group priors, which is the main motivation for continuous priors such as the horseshoe. Martin concludes with an intriguing suggestion to look for a "change of perspective" going against honesty of credible intervals. As a contribution to this discussion we like to highlight that the failures of credible sets, which we observed in function estimation  and in sparse estimation in the present paper, appear to be of different natures. In function estimation the problem is the extrapolation from more or less observable features of an unknown response function to clearly unobserved ones that cannot be known. The data-analyst is not forced to extrapolate; the trouble is that hierarchical Bayesian procedures (and other adaptive schemes) automatically do so. In sparse estimation parameters are shrunk to zero, with the amount of shrinkage determined by a hierarchical or other adaptive scheme. In both cases we like hierarchical procedures, even if they necessarily bring trouble for credible sets. A difference between the cases is that in sparse estimation we know the type of trouble: spurious near-zeros. A possible change-of-perspective here is to interpret credible sets in the sense of false discovery rates: a non-zero is a true discovery, but a zero may be a false nondiscovery. In the case of function estimation the distortion is much less predictable. However, the response functions for which distortion occurs, the "inconvenient truths" of , are perhaps so unrealistic that they can be a-priori ruled out (don't worry be happy). The "a-priori" may be taken literally here, as  show that the priors used there do never produce them. This seems also a difference between the two cases: the intermediate values that are shrunk too much to zero in sparse estimation, seem very real (and their location depends on sample size and error variance and in regression on number of parameters and presumably the design matrix). We feel that honesty is a good policy, and do not feel that this lofty goal is at odds with adaptivity. A honest description of the honesty of adaptive credible intervals, and admission that there is a gray area where we run into a detectability limit, offers actionable insight into the results of a data-analysis, and is surely not in conflict with a statistician's integrity.
Belitser and Nurushev give a valuable summary of their interesting paper. They are interested in finding an optimal procedure, in some sense, under minimal conditions, in some sense, whereas our interest was to study the coverage properties of a reasonable and popular Bayesian procedure. It is interesting that the horseshoe comes close to optimality in the sense of Belitser and Nurushev, and also that their optimal procedure, although not Bayesian, uses empirical Bayesian thinking. At the end of their contribution Belitser and Nurushev note that their procedure possesses a certain uniformity property over the full parameter space (restricted to nearly black bodies), whereas our stated results on the horseshoe do not. One might indeed look further into uniformity. However, in our current formulation the parameterp is defined by θ 0 and so without the restriction we impose our statement would not make any sense. So this is built into formulation, and not an 'artefact of our proof'.
Castillo notes parallels between the performance of the horseshoe procedure and the procedure based on the spike-and-slab prior. For instance, there is a strong correspondence between the horseshoe tuning parameter τ and the weight of the spike. We are very much looking forward to the work in progress by him and his co-authors. Conceptually the spike-and-slab is attractive, perhaps more so than the horseshoe. The performance of the recovery procedure (as opposed to the uncertainty quantification) has been studied in more detail for the spike-and-slab than for the horseshoe, for instance relative to other loss functions than the square Euclidean norm and for other test classes than the class 0 [p] of nearly black bodies. We agree that it will be interesting to study these for the horseshoe as well. We do not know of work or simulation studies in this direction. Our qualitative justification of using marginal credible intervals as a selection rule of nonzero parameters is purely Bayesian: it seems reasonable to use posterior uncertainty in this manner. We do provide 'some justification' by studying the frequentist properties and do mention some thresholds. One might wish to refine these results. Regarding the logarithmic factors in the case of signals with very few zeros, which we do not really cover, we agree that it would be interesting to refine the results, although we note that a purely asymptotic treatment without getting hold of the constants may not be very informative about performance. The suggested refinement of the bound on the number of selected coefficients would indeed be interesting. We would also welcome results on the false discovery rate, which do not follow from our results (although some of our results have that flavour). Regarding the choice of the blow-up factors L in practice, we recommend to set them equal to 1, as we did in our simulations, as this yields the natural Bayesian credible sets. Bayesian and frequentist coverage are not the same, and we see no compelling reason to correct Bayesian credible sets for exact frequentist coverage, in particular in situations where full frequentist coverage is known to be impossible. For instance, joint credible balls will have large frequentist coverage only for parameters that satisfy an 'excessive bias condition', and the coverage probability will depend both on the constants in the latter condition and the blow-up factor. Then there is no universally good blow-up factor. For credible intervals the case appears to be more favourable in that at least the large intervals seem to have a universal constant, for which we have set a concrete value L L . (In the final version of the paper, we have improved this to 1.1ζ γ/2 /z α/2 ≈ 1 if γ = α are small; in the asymptotics 1.1 can be replaced by any constant strictly bigger than 1.) Piironen, Betancourt, Simpson and Vehtari contest our claim that τ can be interpreted as the 'proportion of nonzero parameters, up to a logarithmic factor'. Their argument against it seems to be that one can make such a statement only relative to a likelihood. That is true of course. By itself the horseshoe is even a prior for a single parameter, and it makes no logical sense to call one of its hyper parameters a proportion. Clearly the likelihood we have in mind is the one of the sequence model, and we give ample arguments for our claim, in the form of mathematical theorems backed up by simulations, referring both to recovery and uncertainty quantification. It is true that in the paper under discussion we have set the variances of the observations equal to 1. In earlier work we used a flexible value, and this leads to the same findings as long as one scales the horseshoe prior with the error standard deviation, as is standard (and the threshold then scales with the error standard deviation as well). Piironen, Betancourt, Simpson and Vehtari have done simulations with the horseshoe prior on parameters in a regression model. This model is beyond our current paper, and similar results can only be expected if one makes strong assumptions on the design matrix, as is common in the literature ('restricted isometry', 'compatibility', etc.). This is studied in the Bayesian context for recovery in ; we do not know of work on uncertainty quantification. The restrictions are necessary, because in high-dimensional regression models there must be collinearities between the columns of the design matrix, which the data cannot resolve, so that the posterior will load on multiple columns. We wonder if this is what happened in the figure of a bivariate posterior marginal they contributed in their discussion. We are certainly not against also considering bivariate marginals, or other projections, but in the context of our model they seem not so interesting, because the likelihood does not make the parameters interact. We confirmed this by making some pictures of bivariate posteriors in our model; examples are shown in Figure 1. Piironen, Betancourt, Simpson and Vehtari close with a warning against the marginal maximum likelihood estimator. They are not the first to do so. We can only say that we have not noted problems, not in the theory and not in the simulations. We also prefer full Bayes, but the greater efficiency may weigh in the other direction.