Bayesian variance estimation in the Gaussian sequence model with partial information on the means

Consider the Gaussian sequence model under the additional assumption that a fixed fraction of the means is known. We study the problem of variance estimation from a frequentist Bayesian perspective. The maximum likelihood estimator (MLE) for $\sigma^2$ is biased and inconsistent. This raises the question whether the posterior is able to correct the MLE in this case. By developing a new proving strategy that uses refined properties of the posterior distribution, we find that the marginal posterior is inconsistent for any i.i.d. prior on the mean parameters. In particular, no assumption on the decay of the prior needs to be imposed. Surprisingly, we also find that consistency can be retained for a hierarchical prior based on Gaussian mixtures. In this case we also establish a limiting shape result and determine the limit distribution. In contrast to the classical Bernstein-von Mises theorem, the limit is non-Gaussian. We show that the Bayesian analysis leads to new statistical estimators outperforming the correctly calibrated MLE in a numerical simulation study.


Introduction
For given 0 ≤ α ≤ 1, suppose we observe n independent and normally distributed random variables X i ∼ N µ 0 i 1(i > nα), σ 2 0 , i = 1, . . . , n. (1.1) The parameters in the model are µ 0 i , i > nα and σ 0 > 0. The goal is to estimate the variance σ 2 0 while treating the mean vector µ 0 := (µ 0 ⌈nα⌉ , . . . , µ 0 n ) as nuisance. For α = 0, we recover the Gaussian sequence model. For α > 0, this can be viewed as the Gaussian sequence model with additional knowledge that the means of the first ⌊nα⌋ observations are known (in which case we can subtract them from the data).
One can think of model (1.1) as a simple prototype of a combined dataset. Using for instance different measurement devices, one often faces merged datasets collected from multiple sources. The different sources might not be of the same quality concerning the underlying parameter, see [24] for an example. An alternative viewpoint is to interpret model (1.1) as a sparse sequence model with known support. Since a (1 − α)-fraction of the data is perturbed, we are in the dense regime. Knowledge of the support is then crucial as otherwise there is no consistent estimator for σ 2 0 . If n is even and α = 1/2, then (1.1) is equivalent to the Neyman-Scott model [25] up to a reparametrization. Model (1.1) is in this case equivalent to observing U i := (X n/2+i + X i ) and V i := (X n/2+i − X i ) for i = 1, . . . , n/2. Since U i and V i are independent, this is thus equivalent to observing independent random variables U i , V i ∼ N (µ 0 n/2+i , σ 2 0 ), with σ 2 0 = 2σ 2 0 . Estimation of σ 2 0 in the latter model is known as Neyman-Scott problem.
Although σ 2 0 can be estimated with parametric rate based on the first nα observations, a striking feature of the model is that the MLE for σ 2 0 is inconsistent. In fact the MLE σ 2 mle converges to ασ 2 0 therefore underestimating the true variance by the factor α. The reason is that the likelihood of the observations with non-zero mean significantly affects the total likelihood viewed as a function in σ 2 .
We study what happens when a Bayesian approach is implemented for the estimation of the variance and whether the posterior distribution can correct for the bias of the MLE. The Bayesian method can be viewed as a weighted likelihood method: instead of taking the parameter with the largest likelihood, the posterior puts mass on parameter sets with large likelihood. Because of this, the posterior can in some cases correct the flaws of the MLE. An example are irregular models, see [15,11,26].
In the first part of the paper, we prove that whenever the nuisances are independently generated from a proper distribution, the posterior does not contract around the true variance. This shows that, for a large class of natural priors, the Bayesian method is unable to correct the MLE. In frequentist Bayes, several lower bound techniques have been developed in order to describe when Bayesian methods do not work, [4,8,9,29,10,19]. These results can be used for instance to show that a certain decay of the prior is necessary to ensure posterior contraction. Our lower bounds are of a different flavor and do not require a condition on the tail behavior.
Since for the non-zero means no additional structure is assumed, there is no way to say something about one mean from knowledge of all the other means. Therefore, one might be tempted to think that a correlated prior on the means cannot perform better than an i.i.d. prior and consequently must lead to an inconsistent posterior as well. Surprisingly, this is not true and we construct in the second part of the article a Gaussian mixture prior for which the posterior contracts with the parametric rate around the true variance. For this prior we derive the limit distribution in the Bernstein-von Mises sense. In contrast with the classical Bernstein-von Mises theorem, the posterior limit is non-Gaussian in the case of small means. In this case the posterior also incorporates information about the second part of the sample into the estimator and we show in a simulation study that the maximum a posteriori estimate based on the limit distribution outperforms the √ n-consistent estimator that only uses the observations with zero mean.
Estimation of the variance in model (1.1) can also be interpreted as a semi- parametric problem. The results in this article therefore contribute to the recent efforts to understand frequentist Bayes in semiparametric models. Semiparametric Bernstein-von Mises theorems are derived under various conditions in [27,5,3,7]. For specific priors, it has been observed that there can be a large bias in the posterior limit, see [6,7,26]. In all the cases studied so far, it is unclear whether the bias is due to the specific choice of prior or whether this is a fundamental limitation of the Bayesian method. To the best of our knowledge, our results show for the first time, that the posterior can be inconsistent for all natural priors. Related to model (1.1), [14] studies Bayes for variance estimation of the errors in the nonparametric regression model. It is shown that if the posterior contracts around the true regression function with rate o(n −1/4 ), the marginal posterior for the variance contracts with parametric rate around the true error variance and a Bernstein-von Mises result holds.
The article is organized as follows. In Section 2, we discuss aspects of the problem related to the likelihood and the posterior distribution. A crucial identity for the log-posterior is derived in Section 3. This leads then to the general negative result in Section 4. The Gaussian mixture prior with parametric posterior contraction is constructed in Section 5. This section also contains the limiting shape result and a numerical simulation study. All proofs are deferred to the appendix.
Notation: For a vector u = (u 1 , . . . , u k ), we write u 2 = k i=1 u 2 i and u 2 = u 2 /k for the averages of the squares (not to be confused with the squared averages). We write n 1 = ⌊nα⌋ and n 2 = n−n 1 . The probability and expectation induced by model (1.1) are denoted by P n 0 and E n 0 .

Likelihood and posterior
The MLE. For the subsequent analysis, it is convenient to split the data vector X = (X 1 , . . . , X n ) in the part with zero means Y = (X 1 , . . . , X n1 ) and the observations with non-zero means Z = (X n1+1 , . . . , X n ) such that X = (Y, Z). The likelihood function of the model is If only based on the subsample Y, the MLE for σ 2 0 would be Y 2 /n 1 and this converges to σ 2 0 with the parametric rate n −1/2 . Hence Y 2 /n converges to by a factor α. It is clear that there is very little extractable information about the parameter σ 2 0 in Z. A frequentist estimator can simply discard Z and only use the subsample Y. The MLE also does this but leads to an incorrect scaling of the estimator.
The incorrect scaling factor of the MLE can be explained in different ways. One interpretation is that the MLE can be written as with σ 2 Y,mle = Y 2 /n 1 the MLE based on the subsample Y and σ 2 Z,mle = 0 the MLE based on the subsample Z. The fact that the overall MLE just forms a linear combination of the MLEs for the subsamples shows again that too much weight is given to Z.
Another explanation for the incorrect scaling of the MLE is to observe that in (2.1) the likelihood based on the second subsample is L(σ 2 , µ|Z) ∝ σ −n2 if µ = µ mle . If we would take the likelihood only over the first part of the sample Y we would obtain the optimal estimator Y 2 /n 1 , but since the likelihhod over the full sample is the product of the likelihood functions for Y and Z, an additional factor σ −n2 occurs in the overall likelihood which leads to the incorrect scaling. More generally, we conjecture that likelihood methods do not perform well for combined datasets where one part of the data is informative about a parameter and the other part is affected by nuisance parameters.
Adjusted profile likelihood. For the profile likelihood, we first compute the maximum likelihood estimator of the nuisance parameter for fixed σ 2 , denoted by, say µ σ 2 , and then maximize Obviously µ σ 2 = Z for any σ 2 > 0 and the profile likelihood estimator coincides with the MLE for σ 2 in the Neyman-Scott problem. If the parameter of interest and the nuisance parameters are orthogonal with respect to the Fisher information, that is, the adjusted profile likelihood estimator [12,23,13] is the maximizer of for the matrix valued function Since −∂ 2 /(∂µ j ∂µ ℓ ) log L σ 2 , µ Y, Z = σ −2 1(j = ℓ), the adjusted profile likelihood estimator for σ 2 coincides with the MLE for the subsample Y, In particular, the adjusted profile likelihood results in an unbiased √ n-consistent estimator for σ 2 .
The posterior distribution. From a Bayesian perspective it is quite natural to draw σ 2 and the mean vector µ from independent distributions. Due to the orthogonality with respect to the Fisher information (2.3), we expect no strong interactions of σ 2 and the mean parameters in the likelihood that could be taken care of by a dependent prior. Suppose that µ ∼ ν and that the prior for σ 2 has Lebesgue density π. The marginal posterior distribution is then given by Bayes formula In [28] it has been argued that by using multivariate Laplace approximation, with L(σ 2 ) the adjusted profile likelihood in (2.4). This suggests that the posterior distribution should be centered around the adjusted profile likelihood estimator Y 2 /n 1 , therefore correcting the MLE. Associated sequence model with random means. For the Gaussian sequence model with partial information (1.1) equipped with the product prior π⊗ν, define the associated sequence model with random means, where we observe independent random variables Y i ∼ N (0, σ 2 0 ), i = 1, . . . , n 1 and Z i |µ ∼ N (µ i , σ 2 0 ), i = n 1 + 1, . . . , n, (2.8) with µ ∼ ν and ν known. In this model, the nuisance parameters are replaced by additional randomness. The only parameter in this model is σ 2 0 and the model is therefore parametric. Bayes with improper uniform prior. If the prior on the mean vector in the Bayes formula is chosen as the Lebesgue measure, the formula for the posterior simplifies to This is the same posterior we would get if we discarded the subsample Z. It follows from the parametric Bernstein-von Mises theorem that if π is positive and continuous in a neighbourhood of σ 2 0 , the posterior contracts around the true variance σ 2 0 . Notice that in the case of uniform prior, the Laplace approximation in (2.7) is exact and does not involve any remainder terms. Obviously the Lebesgue measure is not a probability measure and the prior is improper. This raises then the question whether there are also proper priors for which the marginal posterior is consistent on the whole parameter space. We will address this problem in the next sections.

On the derivative of the log-posterior
We first derive a differential equation for the posterior. Denote by µ|(Z, σ 2 ) the posterior distribution of µ for the sample Z, that is, . (3.1) In particular, we set The quantity V (µ|(Z, σ 2 )) measures the spread of Π(µ|Z, σ 2 ) around the vector Z. Recall moreover the definition of L(σ 2 |Y, Z) in (2.6).
Proposition 3.1. The marginal posterior satisfies By Remark 2.1, the right hand side is a closed-form expression of the score function for σ 2 in the random means model (2.8). If the MLE in (2.8) does not lie on the boundary, the score function vanishes at the MLE. From the Bernstein-van Mises phenomenon it is conceivable that the posterior will concentrate around this MLE. For the MLE to be close to the truth σ 2 0 , the score function evaluated at σ 2 0 must be o P (1). Since Y 2 = nασ 2 0 + O P ( √ n), this leads to the condition In the next section, we derive a very general negative result. The main part of the argument is to show that the previous equality does not hold in a neighborhood of σ 2 0 , see (A.12).

Posterior inconsistency for product priors
In this section we study posterior contraction under the following condition.
Prior. The prior on µ is independent of the prior on σ 2 . Under the prior, each component of the mean vector µ is drawn independently from a distribution ν on R. The prior on σ 2 has a positive and continuously differentiable Lebesgue density on R + .
So far, ν denoted the prior on the mean vector. By a slight abuse of language we denote the prior on the individual components also by ν. The assumptions on the prior are mild enough to account for proper priors with heavy tails and possibly no moments.
The i.i.d. prior is the natural choice, if we believe that there is no structure in the non-zero means. From (2.8) it follows that the corresponding sequence model with random means is . . , n 1 and Z i |µ i ∼ N (µ i , σ 2 0 ), i = n 1 + 1, . . . , n, (4.1) with µ i ∼ ν. For α = 1/2 and unknown ν, this model has been studied in [21]. It is shown that the MLE for σ 2 0 and the MLE for the distribution function of the means are consistent. Since the random means model leads to the same posterior distribution as explained in Remark 2.1, this suggests that the posterior might concentrate around the truth.
We now provide a second heuristic that leads to a different conclusion indicating that it makes a huge difference whether the distribution of the means ν is known or unknown. In the framework of (4.1), ν is known. If u 2 dν(u) < ∞, then µ 2 = u 2 dν(u) + O P (n −1/2 ) and Z 2 = µ 2 + σ 2 0 + O P (n −1/2 ), so we have Z 2 − u 2 dν(u) = σ 2 0 + O P (n −1/2 ). This means that model (4.1) carries a lot of information about σ 2 0 in the sense that σ 2 0 can be estimated with parametric rate from the subsample Z only. Since the posterior only sees model (4.1) it is therefore natural to give a lot of weight to the subsample Z as well, which, from a frequentist perspective, is wrong.
This heuristic does not say anything about heavy-tailed priors with u 2 dν(u) = ∞. But even in this case, we will show that the posterior is inconsistent. The first result states that in a neighborhood of σ 2 0 the posterior is increasing extremely fast with high probability. Proposition 4.1. Given α < 1 and the prior above, then, for all sufficiently large σ 2 0 , there exists a mean vector µ 0 , such that The proof of Proposition 4.1 constructs a lower bound on σ 2 0 that is independent of n and moreover guarantees that ν has sufficiently small mass outside [−σ 2 0 , σ 2 0 ]. It therefore depends on the tail behavior of the prior mean distribution ν. The mean vector µ 0 is subsequently chosen with all means being equal to an expression only depending on σ 2 0 . Thus the means in µ 0 are uniformly bounded and independent of n as well. Suppose that almost all posterior mass is close to σ 2 0 . By the previous proposition, the posterior is increasing at least up to 2σ 2 0 . Hence, there must be even more mass around 2σ 2 0 . This is a contradiction and shows that the posterior does not concentrate around σ 2 0 . The proof of the next theorem is based on this argument. For this result, the means in the vector µ 0 can again be chosen to be uniformly bounded.
Theorem 4.2. Given α < 1 and the prior above, then, for all sufficiently large σ 2 0 , there exists a mean vector µ 0 such that Consequently, the posterior is inconsistent and assigns all its mass outside of a neighbourhood of the true variance.
The posterior is therefore inferior if compared to the frequentist variance estimator Y 2 , which achieves the parametric rate n −1/2 in the sense that sup It is remarkable that no conditions on the tail behavior of the prior distribution ν are required for Theorem 4.2. Recall that for the improper uniform prior the posterior contracts around σ 2 0 . This shows that for distributions with heavy tailed densities, very sharp bounds are required.
To the best of our knowledge there are no negative results in the nonparametric Bayes literature that hold for such a large class of priors. The proof strategy to establish Proposition 4.1 is based on a highly non-standard shrinkage argument that will be sketched here. By expanding the square term in (3.2) we can lower bound (3.3) by For an improper uniform prior, one can check that V i ≥ Z 2 i , making the lower bound negative and useless. For a proper prior, there is a shrinkage phenomenon in the sense that for any c > 0 there are parameters (µ 0 If this is the case then which yields the conclusion by choosing c > 0 small enough. In Proposition 4.1 we showed that the posterior overshoots the true variance σ 2 0 whenever the true means are large enough. By analyzing the Gaussian case in the next section, we see that for small means the posterior will in fact underestimate σ 2 0 and that only for a small range of mean vectors, one can hope that the posterior will be able to concentrate around the true variance.

Gaussian priors
To illustrate our approach, we first consider an i.i.d. Gaussian prior on the mean vector From Theorem 4.2 we already know that the posterior will be inconsistent in this case. Nevertheless, the Gaussian assumption yields more explicit formulas and this allows us to build a hierarchical prior resulting in good posterior contraction properties. By Remark 2.1, the marginal likelihood is the same as in the sequence model with random means (4.1). The marginal posterior is therefore which can also be written as the product of two inverse Gamma densities. In view of the Bernstein-von Mises phenomenon, the posterior concentrates around the MLE for parametric problems. Similarly, we can argue here that the posterior will be concentrated around the value σ 2 maximizing the likelihood part of the posterior (5.1). By differentiation, we find n 1 σ 2 + n 2 σ 4 /( σ 2 + θ 2 ) = Y 2 + σ 4 Z 2 /(θ 2 + σ 2 ) 2 and rearranging yields This can be rewritten as where we set µ 2 0 = µ 0 2 /n 2 and suppress the dependence of the O() term on σ 2 0 and µ 0 . Since θ is fixed, this shows that for σ 2 = σ 2 0 + O P (n −1/2 ), we need imsart-ejs ver. 2014/10/16 file: BayesVariance.tex date:  Differently speaking, to force the maximum σ 2 to be close to σ 2 0 , the variance θ 2 of the prior has to match the empirical variance µ 2 0 of the nuisance parameter. We can also deduce from (5.2) that if |µ 2 0 − θ 2 | ≫ n −1/2 and θ is fixed, then also | σ 2 − σ 2 0 | ≫ n −1/2 . More precisely, we even have that This shows that, depending on the size of µ 2 0 compared to θ 2 , the posterior can either overestimate or underestimate the true variance.
If θ is allowed to vary with n, we can make the right hand side in (5.2) arbitrarily small by letting θ tend to infinity. As θ 2 is the variance of the prior, the behavior resembles then that of the uniform improper prior, which, as we already know, leads to posterior consistency. If we think of a prior as a prior belief on the parameters, then the prior should not change depending on the amount of available data and, in particular, it is unnatural that the prior becomes more vague if the sample size increases. In the next section we show that there are sample size independent mixture priors leading to parametric posterior contraction rates.

Mixture priors
Section 4 explains the posterior inconsistency for an i.i.d. prior on the nuisance. It seems unintuitive that introducing dependency on the prior of the nuisance parameter can help avoiding posterior inconsistency for σ 2 0 . Surprisingly, this is not true. In this section, we first provide some intuition why mixture priors can resolve the issues of i.i.d. priors. Afterwards, we discuss and analyze a specific prior construction.
Analyzing Gaussian priors above, (5.3) suggests that for any nuisance parameter vector µ 0 , there exists an i.i.d. prior which seems to work. This i.i.d. prior does, however, depend on the unknown µ 0 and can therefore not be chosen without knowledge of the data. Intuitively, if the posterior had the chance to see all possible i.i.d. priors on µ, instead of just one, it is conceivable that it would automatically select one that is adapted to the unknown nuisance parameter and consequently leads to posterior consistency for the parameter of interest. De Finetti's theorem [18] states that an exchangeable prior ν over the infinite sequence µ = (µ 1 , µ 2 , . . . ) can be written as a mixture over i.i.d. priors in the sense that with λ a probability measure on the set of probability densities P(R) on R. Assuming interchangeability of the integrals, the posterior (2.5) then becomes where q denotes the probability density function of Q. Let q 0 be the i.i.d. prior maximizing the interior integral. Suppose that this is a unique maximum and that the outer integral is determined by the behavior of the integrand in a suitable neighborhood S of q 0 . This means that The right hand side is the posterior density of σ 2 for i.i.d. prior n i=1 q 0 (µ i ) on the components.
Although this argument is only a sketch, it suggests that something might be gained by mixing over i.i.d. priors instead of just choosing one. Maximizing the marginalized likelihood in (5.1) over θ 2 yields . The posterior therefore coincides with the posterior density based on the first part of the sample only, which we know has good posterior contraction properties.
Another heuristic about the posterior properties for this prior can again be derived by making the link to the associated sequence model with random means (2.8). For the prior considered here, the random means model has the form with θ 2 ∼ γ. If θ 2 were a second parameter and not generated from γ, the variance σ 2 0 would not be identifiable if only the Z i 's are observed. In model (5.5) we know the density γ, but this is not enough to consistently reconstruct σ 2 0 from the subsample Z. By Remark 2.1, this model leads to the same posterior for σ 2 . The posterior should therefore realize that there is little extractable information about σ 2 0 in Z and discard these observations. We will see in the limiting shape result below that this is roughly what happens.
The log-likelihoods appearing in (5.6) can be written in terms of inverse-gamma distributions. We denote by IG(γ, β) the inverse-gamma distribution with shape γ > 0 and scale β > 0. The corresponding p.d.f. is where Γ(·) is the Gamma function. Rewriting the posterior, we have that . Starting from Lemma 5.1, we can develop a heuristic argument on how to recover the shape of the limit posterior distribution. We interpret the posterior Π(·|Y, Z) with density (5.8) as the marginalized version, over the set θ 2 ∈ (0, +∞), of the distribution Π(·|Y, Z) whose density is given by π(σ 2 , θ 2 |Y, Z) ∝ f IG(γ1,β1) (σ 2 )f IG(γ2,β2) (σ 2 + θ 2 )γ(θ 2 )π(σ 2 ), (5.9) and refer to Π(·|Y, Z) as the joint posterior on (σ 2 , θ 2 ) ∈ (0, +∞) 2 . The first step is double localization. Thanks to the exponential tails of the inverse Gamma distribution, the joint posterior Π(·|Y, Z) asymptotically concentrates on the set for a sequence ζ n ≍ log n/n. This also implies that the joint posterior (5.9) is arbitrarily close, in total variation distance, to the truncated posterior distribution with density π(σ 2 , θ 2 |Y, Z)1({σ 2 ∈ B 1 } ∩ {θ 2 ∈ B 2 }). In particular, this means that the hyperparameter θ 2 concentrates on a neighborhood of the maximal value derived in (5.4). Arguing as in the classical proof of the Bernstein-von Mises theorem, we can then show that the truncated posterior distribution will asymptotically not depend on the prior and prove that the posterior given by (5.8) behaves asymptotically like Using essentially Laplace approximation, we show that the log-likelihoods ℓ(σ 2 |Y ) and ℓ(σ 2 + θ 2 |Z) in (5.6) can be always uniformly approximated by a second-order Taylor expansion around their maxima Y 2 and Z 2 − σ 2 , and thus the localized posterior converges in total variation distance to a distribution with density whose factors are a truncated Gaussian density with mode Y 2 and variance 2σ 4 0 /n 1 = O(n −1 ) and the integral of a truncated Gaussian density with mode Z 2 − σ 2 and variance 2(σ 2 0 + µ 2 0 ) 2 /n 2 = O(n −1 ). By undoing the localization argument, we can show that the restriction to the sets B 1 and B 2 can be removed from (5.11) and the posterior given by (5.8) converges in total variation distance to the posterior limit distribution with Φ the c.d.f. of the standard normal distribution. Recall that Z 2 ≈ σ 2 0 + µ 2 0 . This suggests that the term involving Φ in the posterior limit distribution should asymptotically disappear if µ 2 0 ≫ n −1/2 . The limit of the posterior should then be the truncated Gaussian with mode Y 2 and variance 2σ 4 0 /n 1 = O(n −1 ). The next result is a formal statement of the arguments mentioned above. To pass to (5.13) involves an additional log n-factor in the signal strength of µ 2 0 . Denote by · TV the total variation distance and recall that the expectation E n 0 is taken with respect to model (1.1). Theorem 5.2. Let Π ∞ (·|Y, Z) and Π ∞ (·|Y ) be the distributions corresponding to the densities (5.12) and (5.13), respectively. If the prior densities γ, π : [0, ∞) → (0, ∞) are positive and uniformly continuous, then, for any compact sets K ⊂ (0, ∞), K ′ ⊂ (−∞, ∞), and n → ∞, As a corollary of the proof, posterior contraction around the true variance σ 2 0 with contraction rate O( log n/n) can be established. In the case of large means this is an immediate consequence of the posterior limit Π ∞ (·|Y ) and the parametric Bernstein-von Mises theorem. For small means it is less obvious because of the non-standard limit of the posterior.
The posterior limit distribution is closely related to the class of skew normal distributions, see [1,2]. We now derive an alternative characterization of the limit distribution. From the argumentation above, the p.d.f.
can be viewed as the joint posterior limit of (σ 2 , θ 2 ). In particular, the posterior limit distribution is the marginal distribution with respect to σ 2 . As this is clear from the context, we do not write explicitly that the following distributions are conditional on Y, Z, that is, Y, Z are assumed to be fixed.
In particular, the posterior limit distribution Π ∞ (·|Y, Z) coincides with the distribution of If the standard deviations of η, ξ are small compared to the means, the posterior limit distribution essentially compares the means Y 2 and Z 2 . This behavior is very reasonable because if µ 2 0 is small, Y 2 ≈ Z 2 and the subsample Z becomes informative about σ 2 .
The posterior limit depends on unknown quantities. A frequentist estimator mimicking the posterior would be to estimate σ 2 from the MLE for zero means X 2 in the case that the means are small. To detect whether small means are present, we can check whether Y 2 ≥ Z 2 , which leads then to the estimator imsart-ejs ver. 2014/10/16 file: BayesVariance.tex date: December 19, 2019

Finite sample analysis
We compare the estimators σ 2 Y = Y 2 and σ 2 to the maximum σ 2 map,∞ and the mean σ 2 mean,∞ of the limit density σ 2 → π ∞ (σ 2 |Y, Z) for sample sizes n ∈ {10, 100, 1000}. As discussed above, we expect to see some differences for small means. We study the performances for σ 2 0 = 1 and µ the vector with all entries equal to t/n 1/4 for the values t ∈ {0, 1, 2, 5}. Since σ 2 Y does not depend on the means, the estimator performs equally well in all setups. Table 1 reports the average of the squared errors and the corresponding standard errors based on 10.000 repetitions. The rescaled MLE σ 2 Y performs worse than any of the other estimators for small signals. Among the other estimators there is no clear 'winner'. For t = 5, the risk of all estimators is nearly the same. For larger values of t, our simulation experiments did not show any changes compared to t = 5 and the results are therefore omitted from the table. There has been a long-standing debate whether Bayesian methods perform well if interpreted as frequentist methods. Results like the complete class theorem and the Bernstein-von Mises theorem have been foundational in this regard, see [22,16]. Our theory highlights another instance where Bayes leads to new estimators with good finite sample properties. The analysis moreover shows that the construction of a prior resulting in a posterior with good frequentist properties can be highly non-intuitive.

A.2. Proofs for Section 4
Proof of Proposition 4.1. It is enough to show that the following statements hold for sufficiently large sample size n. Let and α denotes the fraction of known zero means in the model. Notice that We choose the non-zero means to be The interval I is compact and the prior π is continuous and positive on R + , inf σ 2 ∈I π(σ 2 ) > 0. Since we also assumed that π ′ is continuous, we find that for all sufficiently large n. With (3.3) and (A.2), (A.5) Using (3.1) and (3.2), we expand V (µ|(Z, σ 2 ), imsart-ejs ver. 2014/10/16 file: BayesVariance.tex date:  Since the integrands in the latter display are positive for |µ i | ≥ 2|Z i |, we can set V i := |Z i | |µ|≤2|Zi| |µ|π(µ|Z i , σ 2 )dµ and bound As a next step in the proof, we show inf To prove this inequality, we distinguish the cases |Z i | > R and |Z i | ≤ R, decomposing Next, we bound the term B i in (A.8). In the sequel, we frequently make use of the fact that σ 2 ∈ I. The idea is to split the domain of integration 0 ≤ |µ| ≤ 2|Z i | into sets |µ| ≤ σ 0 and σ 0 < |µ| ≤ 2|Z i |. The contribution of the first part can be bounded by σ 0 . More work is needed for the second part. By expanding the square (µ − Z i ) 2 in the exponent, the Z 2 i -terms in the numerator and denominator cancel against each other, as they do not depend on µ, and we have .
We now treat numerator and denominator separately. For the numerator, the function y → ye −y 2 /2 attains its maximum at y = 1 and is bounded by e −1/2 .

A.3. Proofs for Section 5
Proof of Lemma 5.1. We can write the posterior as By using (5.6) and (5.7) we obtain (5.8).
Along similar lines, we show now that, on the event A n , Π(θ 2 / ∈ B 2 |Y, Z) → 0 as n tends to infinity. Since , and Π(σ 2 / ∈ B 1 |Y, Z) tends to zero by (A.24), it is sufficient to establish convergence of to zero. We can argue similarly as for the upper bound above using that ℓ(σ 2 |Y ) ≤ ℓ(Y 2 |Y ). By following the same steps as for (A.22) and (A.23) and using that a → ℓ(a|Z) is increasing on (0, Z 2 ] and decreasing on [Z 2 , ∞), the numerator in (5.9) integrated over the set By definition (see (A.16)), the constant C 2 > 0 satisfies n 2 C 2 − 4n 2 − n 1 > 4n 2 .
Since δ n = O( log n/n), this implies that the right hand side of (A.26) is bounded above by n exp(−n 2 δ 2 n /4) → 0, as n → ∞. Together with (A.24), this completes the proof for part (ii).
Proof of (iii): It is well-known that for probability measures P, Q defined on the same measurable space X , see Lemma E.1 in [26]. With A = B 1 ∩ B 2 , P = Π(·|Y, Z) and Π 0 (·|Y, Z) the distribution with density By bounding the L 1 -distance between the densities, we now show that Π 0 (σ 2 ∈ ·|Y, Z) and Π 1 (σ 2 ∈ ·|Y, Z) are close in total variation using the following lemma.
Proof of Theorem 5.2. We insert 1 = 1((Y, Z) ∈ A n )+1((Y, Z) / ∈ A n ) in the expectation. Since the total variation distance of probability measures is bounded, the result follows from Proposition A.1.
Proof of Lemma 5.4. To prove the result, we derive an expression for the joint density of (ξ, η − ξ) (0 ≤ ξ ≤ η). Observe that The right hand side is zero if s ≤ 0. Suppose now that 0 ≤ s ≤ t. Conditioning on η, the right hand side can be rewritten as Taking derivatives ∂ s ∂ t , the density of (ξ, η −ξ) (0 ≤ ξ ≤ η) at point (s, t) equals up to a multiplicative constant f ξ (s)f η (t + s). Which completes the proof for the case 0 ≤ s ≤ t.
The case 0 ≤ t ≤ s is similar and the proof for this case therefore omitted.
Since the posterior limit distribution is the marginal over the first component of the joint distribution in (5.14), it must coincide with the distribution of ξ (0 ≤ ξ ≤ η).