Bayesian Shrinkage towards Sharp Minimaxity

Shrinkage prior are becoming more and more popular in Bayesian modeling for high dimensional sparse problems due to its computational efficiency. Recent works show that a polynomially decaying prior leads to satisfactory posterior asymptotics under regression models. In the literature, statisticians have investigated how the global shrinkage parameter, i.e., the scale parameter, in a heavy tail prior affects the posterior contraction. In this work, we explore how the shape of the prior, or more specifically, the polynomial order of the prior tail affects the posterior. We discover that, under the sparse normal means models, the polynomial order does affect the multiplicative constant of the posterior contraction rate. More importantly, if the polynomial order is sufficiently close to 1, it will induce the optimal Bayesian posterior convergence, in the sense that the Bayesian contraction rate is sharply minimax, i.e., not only the order, but also the multiplicative constant of the posterior contraction rate are optimal. The above Bayesian sharp minimaxity holds when the global shrinkage parameter follows a deterministic choice which depends on the unknown sparsity $s$. Therefore, a Beta-prior modeling is further proposed, such that our sharply minimax Bayesian procedure is adaptive to unknown $s$. Our theoretical discoveries are justified by simulation studies.


Introduction
In Bayesian inference for high dimensional sparse models, the prior distribution needs to incorporate certain a priori knowledge of the structural sparsity.The classical spikeand-slab modeling assigns two-groups prior to each entry of a sparse n-dimensional parameter vector θ (n) = (θ 1 , . . ., θ n ) T , i.e., π(θ i ) is a mixture of two distributions which correspond to θ i = 0 and θ i = 0 respectively.This natural modeling however requires expensive posterior computation.An alternative one-group Bayesian modeling, or so-called shrinkage prior, is much more computational attractive, hence gains more and more popularity in the Bayesian community.
In this work, we consider the sparse normal means model, y (n) = θ (n) + , where y (n) = (y 1 , . . ., y n ) T ∈ R n , ∼ N (0, σ 2 I n ), and the parameter of interest is θ (n) ∈ R n .Throughout this work, the variance σ 2 of the error term is assumed to be known, and W.O.L.G, we let σ 2 = 1.Suppose that the true parameter value θ * is a sparse vector with s (n) nonzero entries, and asymptotically, we allow s (n) to increase simultaneously with n.For the sake of simplicity of the notation, we drop the superscript (n) from y (n) , θ (n) and s (n) in what follows.This normal means model can be viewed as the simplest regression problem under orthogonality.Theoretically, the study of the 1 imsart-ejs ver.2014/10/16 file: sharp-rev2.texdate: April 14, 2020 arXiv:2004.05307v1[math.ST] 11 Apr 2020 Q. Song/Bayesian Sharp Minimax 2 sparse normal means model can provide us important insights about Bayesian regression under a shrinkage prior.In practice, applications of normal means model include multiple testing problems where y i is the z-test statistics, and signal detection problems where y i can be a noisy pixel in a functional magnetic resonance imaging (fMRI).In the literature, a variety of shrinkage priors have been proposed for Bayesian inferences of θ.Popular examples include Bayesian Lasso (Park and Casella, 2008;Hans, 2009), Horseshoe prior (Carvalho, Polson and Scott, 2010), Dirichlet-Laplace prior (Bhattacharya et al., 2015), Normal-Exponential-Gamma distribution (Griffin and Brown, 2011), Generalized double Pareto distribution (Armagan et al., 2013), Generalized Beta mixture of Gaussian distributions (Armagan, Clyde and Dunson, 2011) and etc.A general form of shrinkage priors can be written as where the scale parameter τ is called global shrinkage parameter which controls the overall shrinkage effect.τ can either follow a prior distribution as in (1.1) or have a deterministic value.If furthermore, π 0 can be expressed as a scaled mixture of Gaussian distribution, (1.1) then leads to the so-called local-global shrinkage: θ i ∼ N (0, λ 2 i τ 2 ), λ 2 i ∼ π(λ 2 i ), τ ∼ π(τ ) and λ i 's are called local shrinkage parameters.Note that the Dirichlet-Laplace prior is an exception and doesn't fit the general form (1.1).
Given a wide choice of shrinkage priors, certain criteria are necessary to evaluate and compare these different priors, e.g., computational efficiency and theoretical convergence.A benchmark for the theoretical performance is the posterior contraction rate, which characterizes how fast the posterior distribution (not just the Bayes estimator) converges to the true parameter.For example, the L 2 posterior contraction rate r n satisfies: lim n→∞ E * [π( θ − θ * ≥ r n |y)] = 0, for any θ * , where E * denotes the expectation with respect to the data generation measure of n dimensional data y under true parameter θ * .It is well known that the frequentist minimax rate for normal means models is min θ max θ * θ − θ * = {(2 + o(1))s log(n/s)} 1/2 (Donoho et al., 1992), and many Bayesian works show that the Bayesian contraction rates are comparable to this minimax rate.For example, Dirichlet-Laplace prior (Bhattacharya et al., 2015) (under certain conditions of θ * ) achieves r n {s log(n/s)} 1/2 , and van der Pas, Kleijn and van der Vaart A. W. (2014) showed that horseshoe prior achieves r n = M n {s log(n/s)} 1/2 for any M n → ∞.Furthermore, recent works (e.g., van der Pas, Salomond and Schmidt-Hieber, 2016; Ghosh and Chakrabarti, 2014;Song and Liang, 2017) show that the tail behavior of π 0 plays an important role for posterior asymptotics, and suggest to choose a polynomial decaying π 0 in order to achieve (near-) optimal posterior contraction rate.It is worth noting that all the aforementioned shrinkage priors, except Dirichlet-Laplace prior, have a polynomial tail.Thus we believe polynomially decaying shrinkage priors are, generally speaking, good choices for the high dimensional problem.As for the Dirichlet-Laplace prior, we notice that, although its posterior contraction is rate-minimax, its theory requires θ * ≤ s 1/2 log 2 (n).We comment that this is actually a very strong condition.Under this conimsart-ejs ver.2014/10/16 file: sharp-rev2.texdate: April 14, 2020 dition, even the naive estimator θ = 0 can achieve rate-{s 1/2 log 2 (n)} convergence, and minimax rate has merely logarithmic improvement.
Given that π 0 is polynomially decaying, existing literature focuses on how to (adaptively) choose the global shrinkage τ , i.e., the scale of the prior, such that the order of posterior contraction rate is (near-)optimal (e.g., Song and Liang, 2017;Ghosh and Chakrabarti, 2014;van der Pas, Kleijn and van der Vaart A. W., 2014;van der Pas, Szabo and van der Vaart, 2017).In this work, we will study another aspect of this story, that is how the shape of the prior distribution, i.e., the polynomial order of π 0 , affects the posterior asymptotics.Our contribution of this work is two-fold.First, we show that if the polynomial order of π 0 is sufficiently close to 1, we can achieve Bayesian sharp minimax, i.e., r n /{2s log(n/s)} 1/2 is sufficiently close to 1.This sharp minimaxity holds for the L 1 norm as well.Our simulation study also demonstrates that it is necessary to choose a tiny polynomial order in order to obtain the optimal contraction.In Bayesian literature, the sharpness in term of multiplicative constant is barely investigated, and our work sharpens all existing results on Bayesian posterior convergence for normal means problem.To attain such sharp minimaxity, the choice of τ will depend on true sparsity ratio (s/n) which in practice is unknown.Therefore, our second contribution is to propose a Beta modeling on τ .This leads to a Bayesian sharply minimax inference procedure that is adaptive to unknown sparsity.Simulations show that this adaptive Beta modeling has an excellent performance.
This paper is organized as follows.In Section 2, we study the relationship between the polynomial order of π 0 and the posterior contraction rate, and establish the Bayesian sharp minimax.In Section 3, we propose an adaptive modeling which doesn't depend on s.Two simulation studies and one real cancer data application are presented in Section 4. In the end, Section 5 provides more discussions and remarks.Technical proofs are provided in the Appendix.
Throughout this work, we use D n to denote the observed data (i.e., D n = {y i } n i=1 ), and π(•|D n ) to denote the posterior based on data D n .Given two positive sequences and L 1 norms of a vector respectively.

Sharp Bayesian minimaxity
As discussed in the Section 1, we consider Bayesian inferences for the sparse normal means model under a general prior specification (1.1), where π 0 has a polynomial tail; in other words, we assume the following conditions on the model sparsity and prior distribution π 0 : [C.1]The true model is sparse s = o(n).
[C.3]The tail of π 0 (•) is polynomially decaying with polynomial order α > 1, i.e, there exist some positive constants M and Note the condition C.3 (i.e., the polynomial decaying of π 0 ) implies the polynomial decaying of π(θ i |τ ) as well: if imsart-ejs ver.2014/10/16 file: sharp-rev2.texdate: April 14, 2020 For the simplicity of analysis, in this section we only investigate the posterior asymptotics when the global shrinkage parameter τ is a deterministic value, under which the posteriors of θ i 's are mutually independent.
Let's first intuitively understand the posterior properties induced by a polynomial decaying prior.The posterior distributions of all θ i 's independently follow for some normalizing constant C. Since both functions exp{−(y i − θ i ) 2 /2} and prior π(θ i ) are unimodal, heuristically, the posterior π(θ i |y i ) will have two major modes, around 0 and y i respectively.The posterior mass of the mode around 0 is approximately The posterior mass of the mode around 2ππ(y i ) for any small constant δ .Therefore, if exp{−y 2 i /2} π(y i ), the dominating posterior mode is the one at 0; if exp{−y 2 i /2} ≺ π(y i ), the dominating posterior mode is the one at y i .Note that in the above comparison, we consider different neighbor radiuses around 0 and y i , this is due to the fact that the landscape of π(θ i |y i ) has two unequally wide modes (refer to Figure 3 for an illustration).This heuristic comparison demonstrates a hard thresholding phenomenon for the Bayesian posterior center: when |y i | t, the posterior of π(θ i |y i ) shrinks to 0; when |y i | t, the posterior will be concentrated around y i , where the threshold value t satisfies exp{−t 2 /2} π(t) t −α τ α−1 .This is analogous to the hard thresholding estimator θi = y i 1(|y i | ≥ t) (refer to Figure 4 for more details).It is known that the hard thresholding estimator achieves asymptotic sharp minimaxity when t = {2 log(n/s)} 1/2 , thus we conjecture that such sharp minimaxity carries over to the posterior mean of Bayesian hard thresholding if the same threshold value is used, i.e., the prior satisfies exp{− log(n/s)} τ α−1 {2 log(n/s)} −α/2 .In addition, to derive minimax posterior contraction beyond minimax posterior mean, we also need to control the posterior variation, especially the posterior variation for these zero θ i 's.Hence, we need to impose sufficiently strong shrinkage effect such that the posteriors of these zero θ i 's contract inside a small neighborhood around zero.Equivalently, this requires a sufficiently small global shrinkage parameter τ .Rigorously, we establish the following theorem whose proof is presented in Appendix A.1.
where We have several comments on this theorem.There are two conditions for global shrinkage parameter τ .The first condition says that τ shall not be too small.An overly small τ may cause over-shrinkage for the true nonzero θ i 's.This condition also echoes our heuristic argument that s/n τ α−1 (2 log(n/s)) −α/2 in the previous paragraph.The second condition says that τ shall not be too large, since an overly large τ fails to impose sufficient shrinkage effect on the zero θ i 's.To satisfy both conditions of τ , we need to choose α ∈ (1, 1 + ω/2).The theorem provides an L 1 contraction result as well.Convergence in L 1 , comparing with L 2 convergence, requires stronger posterior variation control for the zero θ i 's.Note that L 1 contraction result can not be easily derived from the quantification of the posterior mean bias and posterior variance (which is the proof technique adopted by (van der Pas, Kleijn and van der Vaart A. W., 2014;Bai and Ghosh, 2017)).Our proof technique, which directly studies the entry-wise posterior contraction for all the zero θ i 's, enables us to perform L 1 contraction analysis.There are several insights obtained from the theoretical results of this theorem.First, for any polynomially decaying prior, with proper choice of global shrinkage τ , the order of Bayesian contraction rate is exactly O({s log(n/s)} 1/2 ).Note that (van der Pas, Kleijn and van der Vaart A. W., 2014;Ghosh and Chakrabarti, 2014;van der Pas, Salomond and Schmidt-Hieber, 2016) only showed that the order of contraction is O(M n {s log(n/s)} 1/2 ) with M n → ∞, since their results are based on Markov's inequality.In contrast, our proof is based on constructing hypothesis testing that can test the true parameter versus balls of alternatives with exponentially small error probabilities.Secondly, the multiplicative constant of the Bayesian contraction rate is positively related to the polynomial order of π 0 (•).Therefore, to obtain sharply L 2 /L 1 minimax contraction 1 , i.e. the ω can be sufficiently small, a sufficient choice is to let the polynomial order of π 0 be very close to 1.
Since the logarithmic term log(n/s) is asymptotically dominated by (n/s) c for any small constant c > 0, the conditions of τ in Theorem 2.1 can be simplified (by replacing log(n/s) with arbitrarily small exponent of (n/s)), and we obtain the following corollary.
Theorem 2.1 and Corollary 2.1 suggest an optimal choice for the global shrinkage as τ (s/n) (α+δ)/(α−1) for some non-negative small value δ.In practice, when true sparsity is always unknown, this theoretical suggestion is not useful.Therefore in Section 3, we will discuss a full Bayesian approach which is adaptive to unknown sparsity.To end this section, we consider a possible alternative choice as τ (1/n) (α+δ)/(α−1) , that is to substitute the unknown s with 1. Obviously, if we assume that the sparsity s is a fixed quantity, this choice of τ is of the same order of the optimal suggestion.But if s is increasing with n, this leads to suboptimal upper bound for the contraction rate.The detail is provided in the next theorem.
for the same functions C 1 (ω) and C 2 (ω) used in Theorem 2.1.
The proof of this theorem is similar to the proof of Theorem 2.1, hence is omitted in this manuscript.This result allows one to choose τ to be of order (1/n) (α+δ)/(α−1) for any constant δ > 0, and the resultant L 2 contraction rate is of order {s log(n)} 1/2 .The rate {s log(n)} 1/2 is considered to be suboptimal in the literature.When comparing C 1 (ω){s log(n)} 1/2 versus C 1 (ω){s log(n/s)} 1/2 , we have that (1) the two rates are asymptotically the same when log s ≺ log n (i.e., the sparsity s increases slower than polynomial rate); (2) the former one is of the same order with the latter, but has a large multiplicative constant, when s n c for some c ∈ (0, 1); (3) the former one is of strictly greater order, when log s ∼ log n (e.g., s = n/ log n).Note here we only compare the upper bound of posterior contraction rates obtained by Theorems 2.1 and 2.2, thus it is not rigorous to claim that the prior specification in Theorem 2.2 leads to suboptimal posterior convergence.

Adaptive Bayesian inference
In this section, we now consider that the information of s is not available, hence adaptive ways to determine the global shrinkage τ are necessary.In Bayesian paradigm, there are at least two popular approaches to handle the hyperparameters: one is the empirical Bayes, i.e., to maximize the marginal likelihood of data, and the other one is the full Bayesian approach, i.e., to assign a prior distribution on the hyperparameters.For example, van der Pas, Szabo and van der Vaart (2017) studied both approaches for the global shrinkage parameter in the horseshoe modeling, and established an adaptive horseshoe Bayesian inference with a suboptimal contraction rate of order O({s log n} 1/2 ).In this work, we are particularly interested in constructing an appropriate prior for τ .
Theorem 2.1 suggests that τ decreases to zero if τ is deterministic.This motivates us to design the prior π(τ ) to be stochastically decreasing as n increases.More specifically, the distribution of prior π(τ ) shrinks toward 0 under proper rate.In the meantime, this prior must not shrink too fast, such that it still assigns minimal prior density around the optimal choice τ (s/n) (α+δ)/(α−1) .Our next theorem provides a sufficient condition for the prior π(τ ), such that the Bayesian sharp minimaxity still holds.5α) , then (2.1) and (2.2) still hold.The above theorem claims that under proper choice of π(τ ), sharp minimaxity is still attainable when we choose α to be close to 1. Let us first discuss the two conditions on the hyper-prior π(τ ).The first condition requires that π(τ ) maintains a minimal prior probability on the optimal range [(s/n) (1+ω/2)/(α−1) , (s/n) α/(α−1) ].The second imsart-ejs ver.2014/10/16 file: sharp-rev2.texdate: April 14, 2020 condition requires that π{τ > (s/n) α/(α−1) } rapidly decays to 0, i.e., the distribution π(τ ) becomes more and more concentrated around 0. A particular choice that satisfies these two conditions is that for some c ∈ (α/(α − 1), (1 + ω/2)/(α − 1)).One can easily verify that It is worth mentioning that the Beta prior is widely used as a hyperprior in spike-and-slab Bayesian modeling (Castillo and van der Vaart, 2012;Rocková, 2015).For example, a common spike-and-slab modeling assigns a prior probability p for each θ i being selected into the model, i.e. π(θ i = 0) = p, and Castillo and van der Vaart (2012) suggested a hyperprior p ∼ Beta(1, 4n+1).In the literature, van der Pas, Szabo and van der Vaart (2017) proposed a hyper truncated half Cauchy prior for the global shrinkage parameter in the horseshoe modeling, that is π(τ ).Both Beta modeling (3.1) and truncated half Cauchy prior have a compact support within [0, 1], but a big difference between these two is that the Beta prior distribution converges to a Dirac measure at 0 as n goes to infinity, but the truncated half Cauchy prior converges to a non-degenerated distribution which is the half Cauchy distribution truncated within [0,1].
In the above theorem, we also impose an additional technical condition on the magnitude of the true nonzero θ j such that log(max |θ * j |)/(log(n/s)) ≤ ω/5α.Note that log(n/s) → ∞, hence this condition still allows that the true signal strength to grow.If the true signal grows sub-polynomially fast, i.e., log(max |θ * j |) = o(log(n/s)), then ω can be arbitrarily small and we obtain sharply minimax contraction; If the true signal grows polynomially fast, i.e., max |θ * j | (n/s) a for some constant a > 0, then Theorem 3.1 still ensures that the rate of contraction is O({s log(n/s)} 1/2 ), despite a larger multiplicative constant.This constraint on |θ * j | essentially is equivalent to that the prior density on true parameter (i.e., π(θ * )) is bounded away from 0. Similar conditions, which require that the prior is "thick" around true parameter value θ * , are regularly used in Bayesian theoretical literature (e.g., Jiang, 2007;Kleijn and van der Vaart, 2006;Ghosal, Ghosh and Van Der Vaart, 2000;Ghosal and Van Der Vaart, 2007).As mentioned in Section 1, Dirichlet-Laplace prior (Bhattacharya et al., 2015) also imposes a upper bound constraint on the magnitude of θ * , but our condition is much weaker.
The proof of this theorem is a combination of the proof of Theorems 2.1 and 3.1, hence is omitted.To construct a π(τ ) that satisfies the conditions in this theorem, we can simply modify the above Beta modeling as:

Simulation and data anlaysis
This section, we will demonstrate two simulation studies to justify our theoretical discoveries, as well as a cancer data application.In the first simulation, we assume that the sparsity s is known in advance, and we empirically compare difference choices of the polynomial order of the prior tail.The theorems presented in Section 2 assert that it is sufficient to assign a very small α; and we would like to use numerical studies to evaluate how good is such a choice, and how necessary is this small-α condition.In the second simulation, we study the performance of the adaptive prior proposed by Theorem 3.1 when s is unknown, and compare it to the adaptive horseshoe prior proposed by (van der Pas, Szabo and van der Vaart, 2017).We will also present a prostate cancer real data Bayesian analysis.

Simulation I: Comparison between different choices of polynomial order
In this simulation, to study the asymptotic behavior, we let the data dimension increases as n = 50, 100, 500, 1000 and the sparsity s equals to the rounded value of n 1/2 .The nonzero coefficients are chosen to be θ * i = {t log(n/s)} 1/2 for 1 ≤ i ≤ s, where t = 1.2, 2.2, 4.2 and 6.2.These different choices of t represent a range of levels of signal strength.To implement a class of polynomial priors with difference tail orders, we let π 0 be the t distribution with degree of freedom α − 1, i.e.
In Figure 1, we compare the posterior contraction among these 6 priors.We estimate their posterior probability π( θ −θ * 2 ≥ 2.2s log(n/s)|D n ) by posterior samples, and plot this probability with respect to the different n and t values.For minimax Bayesian procedure, this probability converges to 0 as n increases, regardless of the magnitude of θ * .The figure clearly indicates that α = 1.1 has the best performance.The posterior probability always decreases toward 0 for all different t's.For the rest 5 priors, their posteriors don't contract into the {2.2s log(n/s)} 1/2 -neighborhood for some value of t.It is worth to mention that for the t prior with α = 1.1, the case t = 2.2 leads to the slowest convergence, as the red curve is decreasing very slowly.This is because, as we discussed, t = 2.2 corresponds to the most difficult scenario for α = 1.1.Besides, the plots of horseshoe and t-distribution with α = 2.1 and c = 1/(α − 1) have very similar patterns, since these two prior specifications are almost equivalent in terms of their tail behaviors.
In Figure 2, we present the some comparisons of the posterior mean of squared As a reference for the comparison, we also plot the curves corresponding the the minimax squared L 2 and L 1 errors, namely, 2s log(n/s) and s{2 log(n/s)} 1/2 respectively.Note that when s = n 1/2 , the suboptimal contraction rate 2s log(n) = 2(2s log(n/s)) and s{2 log n} 1/2 = {2} 1/2 s{2 log(n/s)} 1/2 .When t = 1.2, i.e., signals are weak, the L 2 errors of all 4 priors don't exceed the minimax rate.This is not surprising, because under weak signals, any method that imposes enough shrinkage effect, including the naive estimator θ = 0, will induce an L 2 error that is smaller than minimax error.It also shows that the t-prior with α = 2.1, c = (α + 0.05)/(α − 1) does have a better L 2 error than t-prior with α = 1.1 under weak signals.When t = 2.2, i.e.,the signal strength is in the boundary case, horseshoe and t-prior with c = 1/(α − 1) begin to exceed the minimax L 2 error, while the two t-priors with c = (α + 0.05)/(α − 1) have L 2 errors that is almost the same as, but slightly higher than, the minimax error.When t = 4.2, i.e., signals are strong, t-prior with α = 1.1 is the only one that achieves asymptotic sharp minimaxity.When t is even larger (which is not presented in the Figure 2), the L 2 error of t-prior with α = 1.1 will be much smaller than the minimax rate, but the L 2 errors for the rest three are much larger than the minimax rate.In summary, a small polynomial order universally ensures that the L 2 estimation error is asymptotically bounded by the minimax rate.As for the error rates under L 1 norm, it is much more sensitive to small variations in the coordinates than the L 2 error.The choice τ = (s/n) (α)+0.05/(α−1)satisfies the upper bound condition on τ in Theorem 2.1, and as we discussed, guarantees sufficient posterior shrinkage for the zero θ i 's, hence it leads to small L 1 error.This argument is consistent to our simulation results: the two priors with choice c = (α + 0.05)/(α − 1) have much smaller L 1 errors.Same to the our comparison of L 2 errors, only the small polynomial order prior (α = 1.1) ensures sharp minimaxity.Similar to Figure 1, we see that the t-distribution with α = 2.1 and c = 1/(α − 1) has almost identical performance with the horseshoe prior.The above simulation results successfully demonstrate the sharpness of the Bayesian minimax contraction when the polynomial order α of the prior is close to 1.However, there are still some discrepancies between the displayed finite-sample behaviors and the Bayesian hard thresholding phenomenon described in Section 2. According to the Bayesian hard thresholding phenomenon, when |y i | is greater than the thresholding value 2α log(n/s), its posterior will have one dominating mode which is approximately N(y i , 1).This implies that when t = 4.2 > 2α = 2.2, the posterior mean squared L 2 error or L 1 error of those nonzero θ i 's is asymptotically of order O(s), which thus shall lead to a much smaller estimation error comparing with the boundary case of t = 2.2.But in Figure 2, there is no noticeable difference for posterior mean errors between t = 2.2 and t = 4.2 under the t prior with α = 1.1.This is because our asymptotic theory relies on the sparsity assumption such that log(n/s) → ∞, while in our simulation experiments, the ratio of n/s is not large enough to reflect the asymptotic behavior.To illustrate it, Figure 3 plots the histograms of posterior π(θ i |y i = [3.2log(n/s)] 1/2 ) using t-prior with α = 1.1 and τ = (s/n) 1.15/0.1 , for different values of n/s.As showed, the posterior does have two modes, but the mode around y i isn't the dominating one until n/s is as huge as 100,000.In contrast, other choices of prior with different polynomial order of tail decaying, for example, the horseshoe prior requires a much smaller value of n/s to have the mode around y i be dominating.This implies that in a small-sample real application, prior with polynomial order close to 1 is less powerful in terms of detecting signals, i.e., yields a sparser model selection result, comparing with the horseshoe prior.But on the other hand, horseshoe prior specification fails to induce sufficient shrinkage effect for the zero θ i 's, therefore its estimation performance for the whole vector θ is still inferior.For t-prior with α = 1.1, we also plot its posterior mean shrinkage coefficient E(θ i |y i )/y i with respect to values of c = y 2 i / log(n/s) and n/s in Figure 4.As n/s tends to infinity, the posterior shrinkage does behave more and more similarly to the hard thresholding.In additional, several simulation experiments are presented in the Supplementary Material, where we explore (1) the posterior contraction of shrinkage prior under varying nonzero θ's scenario, (2) the posterior convergence performances for nonzero θ's and zero θ's respectively, and (3) the uncertainty quantification and Bayesian model selection of shrinkage priors.

Simulation II: Adaptive Bayesian modeling and comparisons
In the second simulation, we no longer assume that s is known, and now the global shrinkage τ is chosen in an adaptive Bayesian manner.We compare the following two adaptive Bayesian procedures.The first prior is constructed based on Theorem 3.1.We consider a t-prior with α = 1.1, and τ follows the Beta modeling (3.1) with c = (α + 0.05)/(α − 1).As for the posterior sampling of this adaptive t-prior, in addition to (4.2), the marginal condition distribution of τ 0 is Since this conditional posterior has a compact support, it can be sampled via the inverse cumulative-distribution sampling.Another prior is the horseshoe prior with τ following a half-Cauchy distribution truncated on [1/n, 1] (van der Pas, Szabo and van der Vaart, 2017).The simulation settings for data dimension and signal strength are exactly the same as the first simulation study.Obviously, one shall expect that the performances of adaptive Bayesian approaches are worse than the case that s is known.
Figure 5 demonstrates a comprehensive comparison between adaptive t-prior and adaptive horseshoe prior, in terms of posterior contraction, posterior mean square L 2 error and posterior mean L 1 error.It shows that the adaptive t-prior has a better performance in almost every aspect.The posterior contraction plots of t-prior always have a decreasing trend towards 0 under different signal strengths, and its convergence pattern is very similar to Figure 1.This implies that the Beta modeling of τ is a good substitute for the optimal choice τ = (s/n) (α+0.05)/(α−1) .The posterior contraction of the adaptive horseshoe, on the other side, doesn't converge at all.The plot pattern is also quite different from the posterior contraction plot in Figure 1, especially for the case t = 6.2.This somehow indicates that the truncated half-Cauchy prior doesn't adapt well to large signals.For the posterior mean L 2 error of the adaptive t-prior, when the signal is weak or strong, its error is well bounded by minimax rate if n is large.When the signal strength is moderate, its error slightly exceeds the minimax rate.However, there is no trend showing that the L 2 error will increase faster than the minimax rate, thus we believe that if n continues growing and α is closer to 1, the error of adaptive t-prior should be asymptotically bounded by the minimax rate.But the adaptive horseshoe prior induces a much larger error than the minimax rate, except for the weak signal situation.Similarly, in term of the L 1 norm, the adaptive t-prior attains the minimax rate, and it clearly outperforms the horseshoe prior regardless of the signal strength.
As a conclusion, the presented two simulation studies demonstrate the necessity of choosing α to be sufficiently close to 1.A prior specification as simple as t-distribution with a tiny degree of freedom ensures supreme Bayesian contraction and estimation.The proposed adaptive Beta modeling on the global shrinkage τ leads to a very stable result and significantly outperforms the adaptive horseshoe prior.

Real data set analysis
We consider a popular prostate cancer dataset (Efron, 2008;Singh et al., 2002) from a microarray experiment which consists of expression levels for n = 6033 genes from 50 normal control subjects and 52 cancer patients2 .Two-sample tests are performed to compare the expression level of each gene between control and patient groups.The corresponding p-values are thereafter converted into z-statistics, i.e., z i = Φ −1 (p i /2) for i = 1, . . ., n.Hence, it is appropriate to model these z-statistics as a normal means model z i = θ i + i , where θ i = 0 if the mean expression levels for the ith gene are the same between control and patient population.We use Bayesian shrinkage to make inference on the parameter θ.
We implement the adaptive t prior with α = 1.1 and τ following the Beta modeling (3.1), and the adaptive horseshoe prior with τ following truncated half Cauchy prior.Note that horseshoe prior has already been used to analyze this prostate cancer data in the literature (Bhattacharya et al., 2015;Bai and Ghosh, 2017;Bhadra et al., 2017), but all these applications choose τ to follow the non-truncated half Cauchy prior.As illustrated by (van der Pas, Szabo and van der Vaart, 2017), the empirical performance between truncated half Cauchy hyper-prior and non-truncated half Cauchy hyper-prior are quite different, hence the horseshoe posterior summary presented in this section is not comparable to the results in the literature.
In Figure 6, we plot the posterior means |E(θ i |D n )| against the observations |y i |.For larger |y i |'s, the posterior means between adaptive t-prior and adaptive horseshoe prior are close.For smaller |y i |'s, the adaptive t-prior apparently induces stronger shrinkage effect.This observation is consistent to our simulations results that horse- shoe prior imposes insufficient posterior shrinkage for the zero θ i 's.If we perform a variable selection by selecting genes for which |E(θ i |D n )|/|y i | > 1/2, then horseshoe selects the top 8 genes and t-prior selects the top 6 genes.This is also consistent to our previous arguments that the t-prior with polynomial order close to 1 is less powerful than horseshoe prior.Note that the posterior of π(θ i |D n ) has two major modes around 0 and y i , hence the selection rule |E(θ i |D n )|/|y i | > 1/2 heuristically means that the posterior mass for the mode around y i is greater than half.

Final remarks
In this work, we study the Bayesian inference on high dimensional sparse normal sequence model with a polynomially decaying prior distribution.Our main result Theorem 2.1 reveals the connection between the upper bound of the posterior contraction and the polynomial order α.This provides a sufficient condition to induce sharp posterior minimaxity.That is, choosing a sufficiently tiny α − 1, the ratio between Bayesian posterior contraction rate and minimax will be sufficiently close to 1.We conjecture necessity holds for Theorem 2.1 as well, such that the smaller the α − 1, the better the Bayesian contraction in terms of the multiplicative constant.Empirical studies also show great improvement for the accuracy of Bayesian shrinkage procedure using a tprior with a tiny degree of freedom.Our study considers α to be a fixed, sufficiently small hyperparameter.Alternatively, one can investigate the choice of letting the polynomial order α decrease to 1 as n increases, i.e., lim α n = 1 under proper rate.Another related question will be: is it a good choice to use an improper prior with exactly α = 1, e.g.π 0 (θ) ∝ θ −1 .Our theoretical results break down when α = 1 (the term α − 1 appears in the denominator), and our technical tool which follows the arguments of Le Cam-Birgé testing theory (Birgé, 1984;Barron, 1998;Le Cam, 1986) only works for imsart-ejs ver.2014/10/16 file: sharp-rev2.texdate: April 14, 2020 proper prior specifications.The primary research interest of this paper is on the L 2 and L 1 posterior contraction rates.Another important research objective is sparsity recovery, i.e., to identify the set {j : θ * j = 0}.Given a continuous posterior distribution induced by a shrinkage prior, one easy way to perform model selection is to do a threshold truncation, that is, a variable is selected if its posterior summary such as posterior mean is greater than some thresholding value.This simply approach has been widely used in the literature, however, it usually leads to over-selection with the number of false positives being of order of O(s) (e.g., Theorem 3.4 of Bhattacharya et al., 2015).Another different model selection approach is to select θ i 's whose marginal credible intervals exclude 0, and the consistency of this Bayesian selection method is investigated by (van der Pas, Szabó and van der Vaart, 2017) for the horseshoe prior.
This work focus on the normal sequence model, thus it would be of substantial interest to conduct similar investigation for general regression model.Our results heavily rely on the independence among y i 's, and it is not trivial to extend these results to regression model with correlated design matrix.Song and Liang (2017) studies the posterior asymptotics for general linear regression model, including order-(s log p/n) 1/2 L 2 contraction and model selection consistency, when a polynomially decaying prior is used.We believe that the choice of polynomial order also plays a role for the multiplicative constant of the posterior contraction rate under regression model, and we conjecture that the optimal choice of α will depend on the eigen structure of the design matrix.If the design matrix X is nearly orthogonal, e.g., all entries of X follow independent Gaussian distribution, we conjecture that the same results as Theorem 2.1 will still hold, and one need to choose α ≈ 1 in order to obtain optimal Bayesian contraction.
The next lemma is a refined result of Lemma 6 in (Barron, 1998): Lemma A.3.Let f * be the true probability density of data generation, f θ be the likelihood function with parameter θ ∈ Θ, and E * , E θ denote the corresponding expectation respectively.Let B n and C n be two subsets of the parameter space Θ, and φ n be some testing function satisfying φ n (D n ) ∈ [0, 1] for any realization D n of the data generation Proof.Define Ω n be the event of (m By Fubini theorem, Combining the above inequalities leads to the conclusion.
To proof (A.9), we consider the testing function φ(ỹ) = max |ξ|≤cδs 1( ỹξ ≥ {δs log(n/s)} 1/2 ), where ỹξ is the subvector of ỹ corresponding to model ξ. (A.12) Note that with dominating probability, y ≤ (1 + c )n 1/2 for any c > 0 and A.13) for any positive η.By the condition of τ , [2017] investigated uncertainty quantification of horseshoe prior.In this experiment, we consider the marginal credible interval for θ i of form E(θ i |D n ) ± 1.96 V ar(θ i |D n ) (the same interval was used in the simulation studies of [van der Pas et al., 2017]).And the Bayesian model selection result is based on whether the marginal credible interval includes zero or not.We assess the coverage of marginal credible intervals and the correctness of model selection results, for t-prior with α = 1.1, c = (α +0.05)/(α −1) and horseshoe prior with global shrinkage τ = (s/n){log(n/s)} 1/2 .Namely, we evaluate the percentage of selected true nonzero θ i 's and the percentage of non-selected zero θ i 's (for both percentages, the higher the better), as well as the average credible interval coverages for both nonzero and zero θ i 's.The simulation dimension increases as n = 50, 100, 500, 1000 and the sparsity s equals to the rounded value of n 1/2 .The nonzero coefficients are chosen to be θ * i = {t log(n/s)} 1/2 where t = 1.2, 2.2, 4.2.The results, based on 100 repetitions, are displayed in Figure 3.
For both model selection and marginal credible interval coverage, the zero θ i 's are almost never selected, and have almost 100% coverage.On the other hand, for the nonzero θ i 's, their probability to be selected and marginal credible interval coverages, heavily relies on the signal strength.That is, larger strength leads to better performance.This observation is consistent to the theoretical results discovered by [van der Pas et al., 2017]: consistent model selection and interval coverage don't happen to medium signal θ i 's.Comparing with horseshoe prior, we observe that t-prior performs better for model selection, but has a comparable performance of credible interval coverage.In additional, we also study the credible interval coverage by t-prior with α = 1.1, τ = (1/n) c where c = (α+0.05)/(α−1).Due to Theorem 2.2, we believe it will lead to suboptimal posterior contraction, and would like to see whether sacrificing convergence rate can improve the coverage performance of credible intervals.The results are displayed by green curves, and shows that there is no improvement comparing to the black curves (which corresponds to t-prior with τ = (s/n) c ).

FIG 2 .
FIG 2. Posterior mean sqaured L 2 and L 1 errors of the 4 selective prior specifications.

FIG 6 .
FIG 6.Comparison between adaptive t-prior and adaptive horseshoe prior.