Posterior Asymptotic Normality for an Individual Coordinate in High-dimensional Linear Regression

We consider the sparse high-dimensional linear regression model $Y=Xb+\epsilon$ where $b$ is a sparse vector. For the Bayesian approach to this problem, many authors have considered the behavior of the posterior distribution when, in truth, $Y=X\beta+\epsilon$ for some given $\beta$. There have been numerous results about the rate at which the posterior distribution concentrates around $\beta$, but few results about the shape of that posterior distribution. We propose a prior distribution for $b$ such that the marginal posterior distribution of an individual coordinate $b_i$ is asymptotically normal centered around an asymptotically efficient estimator, under the truth. Such a result gives Bayesian credible intervals that match with the confidence intervals obtained from an asymptotically efficient estimator for $b_i$. We also discuss ways of obtaining such asymptotically efficient estimators on individual coordinates. We compare the two-step procedure proposed by Zhang and Zhang (2014) and a one-step modified penalization method.


Consider the regression model
The design matrix X is of dimension n × p.We are particularly interested in the case where p > n, for which b itself is not identifiable.In such a setting identifiability can be attained by adding a sparsity constraint on |b| 0 , the number of nonzero b i 's.That is, the model consists of a family of probability measures {P b : b ∈ R p , |b| 0 ≤ s * }, and the observation Y is distributed N (Xb, I n ) under P b .
We are interested in the Bayesian inference on the vector b, when Y is actually distributed N (Xβ, I n ) for some truth β.If p were fixed and X were full rank, classical theorems (the Bernstein-von Mises theorem, as in [8, page 141]) gives conditions under which the posterior distribution of b is asymptotically normal centered at the least squares estimator, with variance (X T X) −1 under P β .
The classical theorem fails when p > n.Although sparse priors have been proposed that give good posterior contraction rates [3] [5], posterior normality of b is only obtained under strong signal-to-noise ratio (SNR) conditions, such as the SNR conditions of Castillo el al. [3,Corollary 2], which forced the posterior to eventually have the same support as β.Effectively, their conditions reduce the problem to the classical, fixed dimensional case.However that is not the most interesting scenario.Without the SNR condition, Castillo et al. [3,Theorem 6] pointed out that under the sparse prior, the posterior distribution of b behaves like a mixture of Gaussians.
However, there is hope to obtain posterior normality results without the SNR condition if one considers the situation where only one component of b is of interest, say b 1 , without loss of generality.All the other components are viewed as nuisance parameters.As shown by Zhang and Zhang [9] in a non-Bayesian setting, it is possible to construct estimators that are efficient in the classical sense that We will use o p (•) as a short hand for a stochastically small order term under P β throughout this document.Here X i denotes the i'th column of X, and the o p (•) indicates that a term is of stochastically smaller order under P β .Later we also writeX −i to denote the n×(p−1) matrix formed by all columns of X except for X i .The | • | norm on a vector refers to the Euclidean norm.
Approximation (2) is useful when |X 1 | is of order √ n, in which the expansion (2) implies weak convergence [6, page 171]: under P β (Such behavior for |X 1 | is obtained with high probability when X is generated i.i.d.from the standard normal distribution).More precisely, Zhang and Zhang [9] proposed a two-step estimator that satisfies (2) under some regularity assumptions on X and no SNR conditions.They required the following behavior for X.
There exists a constant c 1 > 0 for which max Assumption 2. (REC(3s * , c 2 )) There exists constants c 2 , c > 0 for which Assumption 3. The model dimension satisfies Remark 1. Assumption 2 is known as the restricted eigenvalue condition [1, page 1710] required for penalized regression estimators such as the LASSO estimator [7, page 1] and the Dantzig selector [2, page 1] to enjoy optimal l 1 and l 2 convergence rates.
The goal of this paper is to give a Bayesian analogue for Theorem 1, in the form of a prior distribution on b such that as n, p → ∞, the posterior distribution of b 1 starts to resemble a normal distribution to centered around an estimator in the form of (2).Note that the sparse prior introduced by Castillo et al. [3] does not meet our goal since the marginal posterior distribution of b 1 under the sparse prior converges weakly to a mixture of normal distributions without consistent model selection.
Theorem 2. Under assumptions 1, 2 and 3 and the constraint , there exists a prior on b for which the posterior distribution of where β1 is an estimator of β 1 with expansion (2).
The measure used here to quantify the discrepancy between probability measures is the bounded-Lipschitz metric [4, page 1].The convergence of a sequence of distributions to a fixed distribution in bounded-Lipschitz metric is equivalent to weak convergence.
2 The prior and its background stories.

How does de-biasing work?
In sparse linear regression, penalized likelihood estimators such as the LASSO are often used and tend to give good global properties.One desirable property is the following bound on the l 1 loss.
where λ n is as defined in assumption 1.For example, Bickel et al. [1,Theorem 7.1] showed that under the REC condition (assumption 2) the LASSO estimator satisfies (5).
In general, penalized likelihood estimators introduce bias for the estimation of individual coordinates.To eliminate this bias, Zhang and Zhang [9] proposed a two-step procedure.First find a β, perhaps via a LASSO procedure that satisfies (5).Then define The idea behind this estimator is to penalize the magnitude of all coordinates except the one of interest.Under assumptions 1, 2 and 3, the one-step estimator β(ZZ) is asymptotically unbiased with expansion (2).The same asymptotic behavior can be obtained in a single step, as in the next theorem.The idea of penalizing all coordinates but one to eliminate the bias is seen more clearly here.
Choose η n to be a large enough multiple of nλ n .Under assumptions 1, 2 and 3, the one-step de-biasing estimator of β 1 achieves l 1 control (5) and de-biasing simultaneously.The estimator for the first coordinate satisfies Proof.In the proof of theorem 3 we will refer to the one step estimator as β.
We will first show that β satisfies (5).We know that when the penalty involves all coordinates of b, then the bound on the l 1 norm is true [1, Theorem 7.1].It turned out that leaving one term out the of penalty does not ruin that property.
As in the proof of [1, Theorem 7.1], we compare the evaluation of the penalized likelihood function at β and the truth β using the definition of β.
Plug in Y = Xβ + , the above is reduced to , From here we need to discuss two situations.First consider the case where 1 is in the S, the support of β.The expression above is bounded by By choosing η n to be a large enough multiple of nλ n , we have Since the lefthand side is nonnegative, the above implies Therefore under assumption REC(c 0 /c 1 , κ), we can further bound the prediction loss by So far we have shown with high probability, Under the REC assumption, we can go back to bound the l1 loss.
Therefore with (7) we have The proof for the other case turned out to be messier.But the general idea remains the same.When 1 ∈ S C , we can bound the RHS of ( 6) by Choosing λ to be a large multiple of √ n log p as in the 1 ∈ S case, we have Again use assumption 2 to deduce the l 1 control (5).
Observe that the penalty term does not involve b 1 .
We only need to show the second term in ( 9) is of order o p (1/ √ n).Bound the absolute value of that term with by assumption 1 and the l 1 control (5).That is then bounded by Remark 2. With some careful manipulation the REC(3s * , c 2 ) condition as in assumption 2 can be reduced to REC(s * , c 2 ).The proof would require an extra step of bounding The ideas in the proofs for the two de-biasing estimators β(ZZ) 1 and β(one−step) are similar.Ideally we want to run the regression That gives a perfectly efficient and unbiased estimator.However β −1 is not observed.It is natural to replace it with an estimator which is made globally close to the truth β −1 using penalized likelihood approach.As seen in the proof of Theorem 3, most of the work goes into establishing global l 1 control (5).The de-biasing estimator is then obtained by running an ordinary least squares regression like (10), replacing β −1 by some estimator satisfying (5), so that the solution to the least squares optimization is close to the solution of (10) with high probability.

Bayesian analogue of de-biasing estimators.
We would like to give a Bayesian analogue to the de-biasing estimators discussed above.As pointed out in the last section, it is essential to establish l 1 control on the vector b −1 − β −1 .Castillo et al. [3] and Gao et al. [5] have proposed priors that penalize sparsity of submodel dimension and provided theoretical guarantees such as LASSO-type contraction rates under the posterior distribution.This is the prior construction of Gao et al.Gao et al. [5] gave conditions under which we have a good l 1 posterior contraction rate.
Lemma 1. (Corollary 5.4, [5]) If the design matrix X satisfies for some positive constant c, δ, then there is constant c 3 > 0 and large enough D > 0 for which We slightly modify the sparse prior of Gao et al. [5] to give good, asymptotically normal posterior behavior for a single coordinate.As we discussed in the last section, classical approaches to de-biasing exploit the idea of penalizing all coordinates except the one of interest.Our prior construction mimics that idea by putting the sparse prior only on b −1 .

The prior.
Denote the matrix projecting R n to span(X 1 ) by H.Under the model where Y ∼ N (Xb, I n ), the likelihood function has the factorization The likelihood factorizes into a function of b * 1 and b −1 .Therefore if we make b * 1 and b −1 independent under the prior, they will be independent under the posterior.We put a Gaussian prior on b * 1 to mimic the ordinary least square optimization step in the classical approaches.We put a sparse prior analogue to that of Gao et al. [5, section 3] on b −1 , using W as the design matrix in the prior construction.By lemma 1, b −1 is close to β −1 in l 1 norm with high posterior probability as long as κ o ((2 + δ)s * , W ) is bounded away from 0.
We make b * 1 and b −1 independent under the prior distribution.The product distribution corresponds to a prior distribution on the original vector b.Note that under the prior distribution b 1 and b −1 are not necessarily independent.This modified prior also has the effect of eliminating a bias term, in a fashion analogues to that of the two-step procedure β(ZZ) 1 . The joint posterior distribution of b * 1 and b −1 factorizes into two marginals.In the X 1 direction, the posterior distribution of b * 1 is asymptotically Gaussian centered around After we reverse the reparametrization we want the posterior distribution of b 1 to be asymptotically Gaussian centered around an efficient estimator β1 = β 1 + That can be obtained from the l 1 control on b −1 − β −1 under the posterior.In the next section we will give the proof to our main posterior asymptotic normality result (Theorem 2) in detail.
Since that prior and the likelihood of b * 1 are both Gaussian, we can work out the exact posterior distribution.
Note that without conditioning on b −1 , the posterior distribution of b 1 is not necessarily Gaussian.
The goal is to show the bounded-Lipschitz metric between the posterior distribution of b 1 and N ( β1 , 1/|X 1 | 2 ) goes to 0 under the truth.From Jensen's inequality and the definition of the bounded-Lipschitz norm we have For simplicity denote the posterior mean and variance in (12) as ν n and τ 2 n respectively.The bounded-Lipschitz distance between two normals N (µ 1 , σ 2 1 ) and . Therefore to obtain the desired convergence in (4), we only need to show To show (13), notice that the integrand is bounded.Hence it is equivalent to show convergence in probability.Write The first term is no longer random in b, and it can be made as small as we with now that it is decreasing in For the second term, we will apply lemma 1 to deduce that this term also goes to 0 in P β µ b −1 Y probability.To apply the posterior contraction result we need to establish the compatibility assumption (11) on W . Lemma 2. Under assumption 1, 2, 3 and the constraint for some c, δ > 0.
We will prove the lemma after the proof of Theorem 2.
To show (14), Note that the integrand is not a random quantity.It suffices to show That is certainly true for a {σ n } sequence chosen large enough.Combine (13), ( 14) and the bound on the bounded Lipschitz distance, we have shown Proof of lemma 2. We will justify the compatibility assumption on W in two steps.First we will show that the compatibility assumption of the X matrix follows from the REC assumption 2. Then we will show that the compatibility constant of X and W are not very far apart.
Let us first show that under assumption 2, there exist constants 0 < δ < 1 and c > 0, for which κ 0 ((2 + δ)s * , X) = inf Denote the support of g as S. We have The second term if order o(1) under assumption 3.

1 . 2 .
The size s of the dimension of the sub-model in the direction orthogonal to X 1 has probability mass function π(s) ∝ exp(−Ds log p).S|s ∼ U nif (Z s := {S ⊂ {1, ..., p} : |S| = s, X S is full rank}).3. Given the subset selection S, the coefficients b S has density f S (b S ) ∝ exp(−η n |X S b S |) for suitably chosen η n .

n.
Since b * 1 and b −1 are independent under the posterior distribution, the above is also the distribution of b * 1 given Y and b −1 .That implies the distribution of |X 1 |(b 1 − β1 ) given Y and b −1 is