High-dimensional posterior consistency for hierarchical non-local priors in regression

The choice of tuning parameters in Bayesian variable selection is a critical problem in modern statistics. In particular, for Bayesian linear regression with non-local priors, the scale parameter in the non-local prior density is an important tuning parameter which reflects the dispersion of the non-local prior density around zero, and implicitly determines the size of the regression coefficients that will be shrunk to zero. Current approaches treat the scale parameter as given, and suggest choices based on prior coverage/asymptotic considerations. In this paper, we consider the fully Bayesian approach introduced in (Wu, 2016) with the pMOM non-local prior and an appropriate Inverse-Gamma prior on the tuning parameter to analyze the underlying theoretical property. Under standard regularity assumptions, we establish strong model selection consistency in a high-dimensional setting, where $p$ is allowed to increase at a polynomial rate with n$or even at a sub-exponential rate with n. Through simulation studies, we demonstrate that our model selection procedure can outperform other Bayesian methods which treat the scale parameter as given, and commonly used penalized likelihood methods, in a range of simulation settings.


Introduction
The literature on Bayesian variable selection in linear regression is vast and rich.Many priors and methods have been proposed, such as the stochastic search variable selection ( [6]), empirical Bayes variable selection ( [5]), spike and slab selection ( [7], [13]), g-prior based variable selection ( [11]), and penalized credible regions ( [2]), to name a few.In recent years, the use of non-local priors in this context has generated a lot of interest.
Non-local priors were first introduced in [8] by Johnson and Rossell as densities that are identically zero whenever a model parameter is equal to its null value in the context of hypothesis testing.Compared to local priors, which still preserve positive values at 1

X. Cao et al.
null parameter values, non-local prior distributions have relatively appealing properties for Bayesian model selection.In particular, non-local priors discard spurious covariates faster as the sample size n grows, while preserving exponential learning rates to detect non-zero coefficients as indicated in [8].These priors were further extended to Bayesian model selection problems in [9] by imposing non-local prior densities on a vector of regression coefficients.
In particular, let y n denote a random vector of responses, X n an n × p design matrix of covariates, and β = (β 1 , β 2 , . . ., β p ) a p × 1 vector of regression coefficients.Under the linear regression model, y n ∼ N X n β, σ 2 I n .
In [9], the authors introduce the product moment (pMOM) non-local prior with density Here A p is a p × p nonsingular matrix, r is a positive integer referred to as the order of the density and d p is the normalizing constant independent of τ and σ 2 .Variations of the density in (1.1), called the piMOM and peMOM density, have also been developed in [9,14].Clearly, the density in (1.1) is zero when any component of β is zero.Under appropriate regularity conditions, the authors in [9,15] demonstrate that in high-dimensional settings, model selection procedures based on the pMOM and piMOM non-local prior densities can achieve strong model selection consistency, i.e, the posterior probability of the true model converges to 1 as the sample size n increases.As noted in [9], the scale parameter τ is of particular importance, as it reflects the dispersion of the non-local prior density around zero, and implicitly determines the size of the regression coefficients that will be shrunk to zero.In [8,9], the authors treat τ as a constant and suggest a choice of τ which leads to a high prior probability for significant values of the regression coefficients.In [15], the authors again treat τ as a constant, and consider a setting where p and τ vary with the sample size n.They show that highdimensional model selection consistency is achieved under the peMOM prior (another variation of the priors above introduced in [14]), as long as τ is of a larger order than log p and smaller order than n.
The primary goal and innovation of this paper is to propose and analyze a fully Bayesian approach with the pMOM non-local prior where we place an appropriate Inverse-Gamma prior on the parameter τ .A clear advantage of such an approach, compared to previous approaches, is that it is more robust and comparatively immune to misspecification of τ .The extra prior layer, however, creates technical challenges for a high-dimensional theoretical consistency analysis.Under standard regularity assumptions, which include the prior over all models is restricted to ones with model size less than an appropriate function of the sample size n, we establish posterior ratio consistency (Theorem 3.1), i.e., the ratio of the maximum marginal posterior probability assigned to a "non-true" model to the posterior probability assigned to the "true" model converges to zero in probability.In particular, this implies that the true model will be the mode of the posterior distribution with probability tending to 1 as n → ∞.Next, under the additional assumption that p increases at a polynomial rate with n, we show strong model selection consistency (Theorem 3.2).Strong model selection consistency implies that the posterior probability of the true model converges in probability to 1 as n → ∞.The assumption of restricting the prior over models with appropriately bounded parameter size, i.e., putting zero prior mass on unrealistically large models) has been used in both [13] and [15] for regression models.
The authors in [1] also discuss the potential advantages of using priors on τ in the context of generalized linear model setting and establish model selection consistency.While there are some connections between our model and the one in [1], there are fundamental differences between the the two models and the corresponding analyses.A detailed explanation of this is provided in Remark 1.
The rest of the paper is structured as follows.In Section 2 we provide our hierarchical fully Bayesian model.Model selection consistency results are stated in Section 3, and the proofs are provided in Section 4. Details about how to approximate the posterior density for model selection are demonstrated in Section 5.In Section 6 we use simulation experiments to illustrate the posterior ratio consistency result, and demonstrate the benefits of model selection using our fully Bayesian approach as compared to approaches which treat τ as a constant, and also compared to existing penalized likelihood approaches.

Model specification
We start by considering the standard Gaussian linear regression model with p coefficients and by introducing some required notation.Let y n denote a random vector of responses, X n an n×p design matrix of covariates, and β a p×1 vector of regression coefficients.Our goal is variable selection, i.e., to correctly identify all the non-zero regression coefficients.In light of that, we denote a model by k = {k 1 , k 2 , . . ., k m } if and only if all the non-zero elements of β are β k1 , β k2 , . . ., β km and denote β k = (β k1 , β k2 , . . ., β km ) T .For any p × p matrix A, let A k represent the submatrix formed from the columns of A corresponding to model k.In particular, Let X k denote the design matrix formed from the columns of X n corresponding to model k.For the rest of the paper, simply let k = |k| represent the cardinality of model k for notational convenience.
The class of pMOM densities (1.1) can be used for model selection through the following hierarchical model. (2.1)

X. Cao et al.
Note that in (2.1), no specific form/condition has yet been assigned to the prior over the space of models.Some standard regularity assumptions for this prior will be provided later in Section 3.
By (2.1) and Bayes' rule, the resulting posterior probability for model k is denoted by, where π(y n ) is the marginal density of y n , and m k (y n ) is the marginal density of y n under model k given by, , and E k (.) denotes the expectation with respect to a multivariate normal distribution with mean βk = C −1 k X T k y n , and covariance matrix V = σ 2 C −1 k .In particular, these posterior probabilities can be used to select a model by computing the posterior mode defined by k = arg max k π(k|y n ). (2.4)

Model selection consistency: main results
In this section we will explore the high-dimensional asymptotic properties of the Bayesian model selection approach specified in Section 2. In particular, we consider a setting where the number of regression coefficients p = p n increases with the sample size n.The true data generating mechanism is given by Y n = X n β 0 + ǫ n .Here β 0 is the true p n dimensional vector of regression coefficients, whose dependence on n is suppressed for notational convenience, and the entries of ǫ n are i.i.d Gaussian with mean zero and variance σ 2 0 .As in [9], we assume that the true vector of regression coefficients is sparse, i.e., all the entries of β 0 are zero except those corresponding to a subset t ⊆ {1, 2, . . ., p n }, and t, β 0,t , σ 2 0 do not vary with n.Our results can be easily extended to the case where |t|, and entries of β 0,t and σ 2 0 vary with n but stay bounded.However, we assume these quantities stay fixed for ease of exposition.For any matrix A, denote the j-th largest nonzero eigenvalue as ν j (A) and let , respectively.In order to establish our asymptotic results, we need the following mild regularity assumptions.
Assumption 5.For every n ≥ 1, the hyper-parameter for the non-local pMOM prior in 2.1 satisfy Here a 1 , a 2 are constants not depending on n.
The authors in [9] assume all the eigenvalues of X T X n to be bounded by a constant, which is unrealistic to achieve in the high-dimensional setting.In our work, Assupmtion 1 assumes that non-zero eigenvalues of any sub-matrices of the design matrix not to be bounded by a constant, but to be uniformly bounded over a function of n.Assumption 5 is standard which assumes the prior covariance matrix are uniformly bounded in n.Assumption 3 states that the prior on the space of the 2 pn possible models, places zero mass on unrealistically large models (identical to Assumption in [15]).Assumption 4 states that the ratio of the prior probabilities assigned to the true model and any nontrue model stays bounded below in n (identical to Assumption in [9]).This type of priors have also been considered in [16] and [15].As indicated in [15], since the pMOM and piMOM priors already induce a strong penalty on the model size, it is no longer necessary to put other priors to penalize larger models, for example, in [18] and [3].Assumption 2 states that p can grow at an appropriate polynomial rate with n.
We now state and prove the main model selection consistency results.Our first result establishes what we refer to as posterior ratio consistency.This notion of consistency implies that the true model will be the mode of the posterior distribution among all the models with probability tending to 1 as n → ∞.

X. Cao et al.
Theorem 3.1 (Posterior ratio consistency).Under Assumptions 1, 3, 4 and 5, the following holds: In particular, it implies that the probability that the posterior mode k defined in (2.4) is equal to the true model t will converge to 1, i.e., We would like to point out that posterior ratio consistency (Theorems 3.1) does not require any restriction on the number of predictors.This requirement is only needed for strong selection consistency (Theorem 3.2).Next, we establish a stronger result which implies that the posterior mass assigned to the true model t converges to 1 in probability.We refer to this notion of consistency as strong selection consistency.in Assumption 3, the following holds: The above results have been established under the pMOM priors.Another class of nonlocal priors introduced in [9] are the piMOM priors on the regression coefficients, for which the density of the regression coefficients under the model k = {k 1 , k 2 , . . ., k m } is given by where r is a positive integer and is refered to as the order of the density.The corollary below establishes strong model selection consistency under the piMOM priors, and can be obtained immediately by combining Theorem 3.2 with [10, Eq. ( 59),(60)].
Corollary 3.1 (Strong selection consistency for piMOM priors).Under the same conditions as in Theorem 3.2, when piMOM priors are imposed on β k in model (2.1), the following holds: Remark 1.In the context of generalized linear regression, Bian and Wu [1] consider the following hierarchical Bayesian model on regression coefficients.
In particular, they put an independent piMOM prior on each linear regression coefficient (conditional on the hyper parameter τ i ), and an inverse Gamma prior on τ i .In this setting, the authors in [1] establish strong selection consistency for the regression coefficients (assuming the prior is constrained to leave out unrealistically large models).There are similarities between the models and the consistency analysis in [1] and our work as in the usage of non-local priors and Inverse-Gamma distribution.However, despite these similarities, there are some fundamental differences in the two models and the corresponding analysis.Firstly, unlike the piMOM prior, the pMOM prior in our model does not in general correspond to assigning an independent prior to each entry of β k .In particular, pMOM distributions introduce correlations among the entries in β k and creates more theoretical challenges.Because of the correlation introduced, some properties like detecting small coefficients are not apparent in our case.Also, the pMOM prior imposes exact sparsity in β k , which is not the case in [1].Hence it is structurally different than the prior in [1].Secondly, in order to prove consistency results, the authors in [1] assume the product of the response variables and the entries of design matrix are bounded by a constant.The former assumption is rarely seen in the literatures and the latter assumption can be problematic in practice.See Assumption C1 in [1].In addition, Assumption C2 in [1] states the lower bound for the true regression coefficients, which is not required in our analysis.Thirdly, in terms of proving posterior consistency, we bound the ratio of posterior probabilities for a non-true model and the true model by a 'prior term' which results from the Inverse-Gamma prior on τ , and a 'data term'.The consistency proof is then a careful exercise in balancing these two terms against each other on a case-by-case basis, while the authors in [1] directly follow the proof in [15] and requires additional assumptions on the Hessian matrix.

Proof of Theorems 3.1 and 3.2
The proof of Theorems 3.1 and 3.2 will be broken up into several steps.First we denote for any model k, Our method of proving consistency involves approximating R t and R k (in (2.3)) with R * t and R * k respectively.Fix a model k = t arbitrarily, and let u = k ∪ t and u = |u| be the cardinality of u.Note that

and
. Next, we establish two tail probability bounds for the χ 2 distribution and the non-central χ 2 distribution respectively, which will be useful in our analysis.
as n → ∞.Similarly, we have and as n → ∞.Next, by Lemma 4.2, since u ⊃ t, we have as n → ∞, for some constant c ′ > 0. Define the event C n as X. Cao et al.
We now analyze the behavior of the posterior ratio under different scenarios in a sequence of lemmas.Recall that our goal is to find an upper bound for the posterior ratio, such that the upper bound converges to 0 as n → ∞.For the following lemmas, we will restrict ourselves to the event C c n .
Lemma 4.3.Under Assumption 1 and Assumption 5, for all k = t, there exists N , such that when n > N , where A, B are constants and Proof.It follows by the form of the non-local priors in (2.1), and by Assumptions 1 and 5 that, and Hence, by (2.3) and (4.9), When k = t, by [10, Lemma 6], we get Note that V ≤ 1 nǫ 5 n y T n P u y n under Assumption 1, where P u = X u (X T u X u ) −1 X T u .Therefore, from (4.10), for large enough constants M 1 , M 2 , we get Since 1 + x ≤ e x , integrating out both τ and σ 2 gives us Similarly for the true model t, using (2.3) and (4.9), 1 + x ≤ e x , and first integrating out σ 2 , we get (4.12) X. Cao et al.

Note that
, and, by Assumptions 1 and 5, From (4.12), we have Note that P t ≤ P u .By (4.2) and (4.5), there exists N 1 , such that for all n > N 1 , Using (4.13) and (4.14), integrating out τ, we get It follows from (4.11) and Assumption 1 that, the upper bound for m t (y n ) will be given by, where A 1 and B 1 are large enough constants.Using Stirling approximation, there exists N 2 , such that for all n > N 2 and a large enough constant M ′ , we have By (4.16), there exists N = max{N 1 , N 2 }, such that for n > N, we get where In order to show the upper bound converges to 0, we need the following lemma in our proof.Let δ = 1 2 min j∈t |β 0,j |.
Lemma 4.4.When k t and u = k∪t, Now further note that Hence, P * is idempotent and From the assumptions of our theorem, we have that the non-centrality parameter Let Z be a standard normal distribution.By Lemma 4.4, it follows from the relation between noncentral chi-squared and normal distribution that, The next two lemmas provide the upper bound of the posterior ratio a function of n under different cases of the "non-true" model k.
Lemma 4.5.Under Assumptions 1, 3 and 5, for all k t, there exists N , such that when n > N , where K ′ and L ′ are constants.
Proof.Recall that u = k ∪ t is a supset of t.Therefore, R * u ≤ R * k .By (4.5) and Assumption 1, for large enough n, say n > N 1 , we obtain, Consider the first term of (4.17).When max 4(r+1) For large enough n > N 1 , using (4.3), (4.4) and Assumption 1, we have It follows from 1 + x ≤ e x and Assumption 1 that where K 1 and L 1 are constants.When k < max and (4.20) that the first part of (4.17) will be bounded by By 1 + x ≤ e x and 1 − x ≤ e −x , we get rk for large enough n, say n > N 3 , and √ u − t ≤ √ k ≤ k.Now consider the second term of (4.17).When max 3), (4.4) and Assumption 3, for large enough n, say n > N 4 , we have Similar to (4.21), we obtain, where K 3 and L 3 are large enough constants.When k < max 1−ξ t, 1 (1−ξ) 2 , using (4.18) and argument similar to (4.22), we have .
By 1 + x ≤ e x , we get, where K 4 and L 4 are large enough constants, for large enough n, say n > N 6 .
Combining all cases and letting N = max{N 1 , N 2 , N 3 , N 4 , N 5 , N 6 }, we get that where Lemma 4.6.Under Assumptions 1, 3 and 5, for all k ⊃ t,, there exists N , such that when n > N , where S ′ and T ′ are constants.
Proof.Similar to (4.20), for large enough n > N 1 , we have V < n where S 1 and T 1 are large enough constants.When t < k < max 2(1+ξ)+4r r t, t + 4 (1−ξ) 2 , by the boundedness of k, (4.2), (4.3) and (4.4), the first part of (4.17) will be bounded by for large enough constants S 2 , T 2 , since for large enough n, say n > N ′ 3 .Now for the second part of (4.17), when max 16  r 2 + t, for large enough constants S 3 and T X. Cao et al.
The proof of Theorem 3.1 will follow immediately from these two lemmas.By Lemma 4.5, if we restrict to C c n , for any k = t, if k t, Otherwise, when k ⊃ t, Note that P (C c n ) → 1 as n → ∞.Following from (2.2) and Assumption 4, when k = t, we have

.30)
We now move on to the proof of Theorem 3.2.First note that when ξ < 1 − 4γ 3r , we have min It follows from (4.30) and Assumption 2 that if we restrict to C c n , then Using p k ≤ p k and (4.31), we get as n → ∞.

Computation
The integral formulation in (2.2) is quite complicated, and hence the posterior probabilities can not be obtained in closed form.Hence, we use Laplace approximation to compute m k (y n ) and π(k|y n ).A similar approach to compute posterior probabilities has been used in [9] and [15].Note that for any model k, when For any model k, the Laplace approximation of m k (y n ) is given by where ( βk , τ , σ2 ) = arg max (β,τ,σ 2 ) f (β, τ, σ 2 ), and V ( βk , τ , σ2 ) is a (k + 2) × (k + 2) symmetric matrix with the following blocks: (5.3) The above Laplace approximation can be used to compute the log of the posterior probability ratio between any given model k and true model t, and select a model k with the highest probability.1. Log of posterior probability ratio for k and t for various choices of the "non-true" model k.
Then, we perform model selection using our hierarchical Bayesian approach.This is done by computing the posterior probabilities using the Laplace approximation in (5.2), and exploring the model space using the simplified stochastic shotgun stochastic search algorithm in [15].We would like to remind the readers that in our model, we don't need to specify a fixed value for τ , but rather put a prior on the parameter τ (as opposed to [9] and [15] when τ is treated as a fixed parameter).We also provide model selection performance results with fixed τ at τ = n (the reciprocal of the expected value of 1/τ under our prior), and numerical values in R package BayesS5 (a choice for fixed τ from the results in [15]).Additionally, we also provide model selection performance results for the Lasso and SCAD penalized likelihood methods.
It is clear that our Bayesian approach outperforms both the penalized likelihood approaches and the fixed τ setting based on almost all measures and under all cases.The TPR values for the other three methods except the fixed τ setting are all very close to 1, which means that the Bayesian approach with fixed τ values capture much fewer true indexes than the other three methods.Also, the PPV values for our Bayesian approach are all higher than Lasso and SCAD methods, though just slightly lower than the fixed τ case.In addition, The FPR values for the Bayesian approach are all significantly smaller than the FPR values for the penalized approaches.It is also worth noting that the numerical procedure for choosing τ implemented in BayesS5 needs additional run time, while in

r 4 (
k − t) and (1 + 3 4 r)t ≤ 1 2 rk.Again using (4.3), (4.4) and Assumption 3, for large enough n, say n > N ′ 4 , we get ≤ k − t, it follows that for large enough n, say n > N ′ 5 , the second part of (4.17) is bounded by S 4 T k 4 n −r(k−t) , for large enough constants S 4 and T 4 .Combining all cases and letting